Privacy Preserving Record Linkage (PPRL)
Overview
Washington University participated in the analyses of the utility of synthetic data derived from electronic health records (EHRs). These studies were specific to a national dataset with data from multiple institutions with a focus on COVID-19. While some analyses were specific to COVID-19 research questions, such as those predicting regional infection levels or demonstrating equivalence with epidemiologic curves, many analyses demonstrated the general capability of synthetic data to be equivalent to analyses using actual data in a limited data set. For example, characterizing patients demographically and by multiple clinical measures, as well as predicting hospitalization or hospital severity, demonstrate general capabilities of synthetic data. Even analyses specific to epidemiologic curves reflect a general capability of synthetic data when considered as representing demographic variables. Since the project had completed the analysis of synthetic data utility, the next shift was the analyses of the COVID-19 EHR data to privacy protection, to demonstrate that the synthetic data could be defined and shared as a de-identified dataset, and are not restricted by regulations for protected health information. These analyses used an adversarial attack approach and a data uniqueness characterization to demonstrate how synthetic data protected privacy.
Synthetic data can support easier access to data while ensuring privacy protection, but supporting data sharing is perhaps the greatest benefit. When datasets are linked, they create a risk that information could be inferred from datasets that were not previously linked. This risk precludes data linking across organizations and prevents analyses, in addition to making access more difficult. By creating synthetic derivatives from linked data, organizations can allow linking while ensuring privacy protections for individuals. An example of the need for linked data is in cancer data analysis, where datasets developed for different purposes (e.g., cancer registries and EHRs) can be linked to expand an understanding of both sets.
WU evaluated the accuracy and concordance of PPRL-based matching between health system EHR data and cancer registry data using the validation methodology developed by Regenstrief.
WU also conducted an adversarial privacy risk assessment on the synthetic data derivative.
ICS' Role
Generate a complete EHR data set from Washington University and BJC HealthCare from July 2021 to June 2022 and facilitate the linkage (through Regenstrief honest broker) with data from the Missouri Cancer Registry using PPRL. From that dat set, formulate a synthetic data derivative from the linked dataset using MDClone technology.
Contacts
Internal
Name | Role |
---|---|
Nicole Venteris | PM |
Adam Wilcox | PI |
Snehil Gupta | Technical Lead |
John Newland | Technical |
Philip Payne | PI |
Randi Foraker | PI |
Zachary Abrams | Technical |
External
Name | Role | Contact Info |
---|---|---|
Ken Gersing | Director of Informatics NCATS/DCI | Kenneth.Gersing@nih.gov |
Shaun Grannis | VP Data Analytics ,Regenstrief | sgrannis@regenstrief.org |
Brandy Phalora | Senior Program Manager, Regenstrief | bphalora@regenstrief.org |
Chris Beesley | Systems Engineer, Regenstrief | beesleyc@iu.edu |
Jasmine Phua | Head of Government Solutions, Datavant | jas@datavant.com |
Sara Rogovin | Head of Technical Solutions, Datavant | sara@databant.com |
Iris Zachary | Director Missouri Cancer Registry | zacharyi@health.missouri.edu |
Josh Day | Manager Clinical Programs, BJC Healthcare | josh.day@bjc.org |
Lori Grove | Manager of Oncology Data Services, BJC Healthcare | lori.grove@bjc.org |
Susan Weilmuenster | Supervisory, Oncology Data Services, BJC Healthcare | susan.weilmuenster@bjc.org |
Project Management
Major Tasks & Initiatives
Important Dates & Notes
As of December 2024 this project was completed and final reports submitted.
Deliverables
1. Deliverable: WU Clinical Dataset Extraction. WU will create a data set from existing records at WU from electronic health records as the EHR dataset to be linked. WU will also obtain the data as they were submitted from WU to the Missouri Cancer Registry as the registry dataset. These two datasets will be linked using privacy-preserving record linkage (PPRL) approaches.
2. Deliverable: PPRL-matched Dataset. WU will apply and demonstrate the PPRL matching according to the matching validation methodology that has been developed by Regenstrief Institute. The matching will then be tested for validity among the datasets using overall correctness and reviewing specific performance through sampling of records.
3. Deliverable: Synthetic Derivative Dataset. Using licensed synthetic data software (MD Clone), WU will create a synthetic data derivative from the linked dataset, and confirm its consistency with the original data through established data characterization techniques.
4. Deliverable: Privacy Assessment. WU will perform a privacy assessment for the synthetic data, using an adversarial privacy risk assessment that includes external data to potentially match records to individuals. This is according to a demonstrated approach for privacy validation of synthetic data.
5. Deliverable: Progress Reports. WU will provide regular progress reports, in collaboration with Regenstrief Institute.
Standard Meetings
Name | Occurrence | Frequency |
---|---|---|
Working Group | W 8:30 am CT | bi-weekly |
Administrative Details
Tracking Time
Please ensure all work for this project is categorized underneath the
PPRL Synthetic Project project in Tracking Time..
Digital Landmarks
Project Web Pages
Other References
Document Repositories
- PPRL Documents on WUSTL Box
- PPRL Meeting Notes on Google Drive
- PPRL Documents on Google Drive
- PPRL Data Linkage Documents
- PPRL MDClone Data on Box
Technical Information
Guides, Tutorials, & References
Glossary / Acronyms
Term | Definition |
---|---|
MCR | Missouri Cancer Registry |
PPRL | Privacy-Preserving Record Linkage |
N3C | National Covid Cohort Collaborative |
IRB | Institutional Review Board |