Privacy Preserving Record Linkage (PPRL)

Overview

Washington University participated in the analyses of the utility of synthetic data derived from electronic health records (EHRs). These studies were specific to a national dataset with data from multiple institutions with a focus on COVID-19. While some analyses were specific to COVID-19 research questions, such as those predicting regional infection levels or demonstrating equivalence with epidemiologic curves, many analyses demonstrated the general capability of synthetic data to be equivalent to analyses using actual data in a limited data set. For example, characterizing patients demographically and by multiple clinical measures, as well as predicting hospitalization or hospital severity, demonstrate general capabilities of synthetic data. Even analyses specific to epidemiologic curves reflect a general capability of synthetic data when considered as representing demographic variables. Since the project had completed the analysis of synthetic data utility, the next shift was the analyses of the COVID-19 EHR data to privacy protection, to demonstrate that the synthetic data could be defined and shared as a de-identified dataset, and are not restricted by regulations for protected health information. These analyses used an adversarial attack approach and a data uniqueness characterization to demonstrate how synthetic data protected privacy.

Synthetic data can support easier access to data while ensuring privacy protection, but supporting data sharing is perhaps the greatest benefit. When datasets are linked, they create a risk that information could be inferred from datasets that were not previously linked. This risk precludes data linking across organizations and prevents analyses, in addition to making access more difficult. By creating synthetic derivatives from linked data, organizations can allow linking while ensuring privacy protections for individuals. An example of the need for linked data is in cancer data analysis, where datasets developed for different purposes (e.g., cancer registries and EHRs) can be linked to expand an understanding of both sets.

WU evaluated the accuracy and concordance of PPRL-based matching between health system EHR data and cancer registry data using the validation methodology developed by Regenstrief.
WU also conducted an adversarial privacy risk assessment on the synthetic data derivative.

ICS' Role

Generate a complete EHR data set from Washington University and BJC HealthCare from July 2021 to June 2022 and facilitate the linkage (through Regenstrief honest broker) with data from the Missouri Cancer Registry using PPRL. From that dat set, formulate a synthetic data derivative from the linked dataset using MDClone technology.

Contacts

Internal

Name Role
Nicole Venteris PM
Adam Wilcox PI
Snehil Gupta Technical Lead
John Newland Technical
Philip Payne PI
Randi Foraker PI
Zachary Abrams Technical

External

Name Role Contact Info
Ken Gersing Director of Informatics NCATS/DCI Kenneth.Gersing@nih.gov
Shaun Grannis VP Data Analytics ,Regenstrief sgrannis@regenstrief.org
Brandy Phalora Senior Program Manager, Regenstrief bphalora@regenstrief.org
Chris Beesley Systems Engineer, Regenstrief beesleyc@iu.edu
Jasmine Phua Head of Government Solutions, Datavant jas@datavant.com
Sara Rogovin Head of Technical Solutions, Datavant sara@databant.com
Iris Zachary Director Missouri Cancer Registry zacharyi@health.missouri.edu
Josh Day Manager Clinical Programs, BJC Healthcare josh.day@bjc.org
Lori Grove Manager of Oncology Data Services, BJC Healthcare lori.grove@bjc.org
Susan Weilmuenster Supervisory, Oncology Data Services, BJC Healthcare susan.weilmuenster@bjc.org

Project Management

Major Tasks & Initiatives

Important Dates & Notes

As of December 2024 this project was completed and final reports submitted.

Deliverables

1.    Deliverable: WU Clinical Dataset Extraction. WU will create a data set from existing records at WU from electronic health records as the EHR dataset to be linked. WU will also obtain the data as they were submitted from WU to the Missouri Cancer Registry as the registry dataset. These two datasets will be linked using privacy-preserving record linkage (PPRL) approaches.

2.    Deliverable: PPRL-matched Dataset. WU will apply and demonstrate the PPRL matching according to the matching validation methodology that has been developed by Regenstrief Institute. The matching will then be tested for validity among the datasets using overall correctness and reviewing specific performance through sampling of records.

3.    Deliverable: Synthetic Derivative Dataset. Using licensed synthetic data software (MD Clone), WU will create a synthetic data derivative from the linked dataset, and confirm its consistency with the original data through established data characterization techniques.

4.    Deliverable: Privacy Assessment. WU will perform a privacy assessment for the synthetic data, using an adversarial privacy risk assessment that includes external data to potentially match records to individuals. This is according to a demonstrated approach for privacy validation of synthetic data.

5.    Deliverable: Progress Reports. WU will provide regular progress reports, in collaboration with Regenstrief Institute.

Standard Meetings

Name Occurrence Frequency
Working Group W 8:30 am CT bi-weekly

Administrative Details

Tracking Time

Please ensure all work for this project is categorized underneath the
PPRL Synthetic Project project in Tracking Time..

Digital Landmarks

Project Web Pages

Other References

Document Repositories

Technical Information

Guides, Tutorials, & References

Glossary / Acronyms

Term Definition
MCR Missouri Cancer Registry
PPRL Privacy-Preserving Record Linkage
N3C National Covid Cohort Collaborative
IRB Institutional Review Board

Miscellaneous


Updated on August 7, 2025