Purpose
The purpose of this relational database is to build a system that can track the MRNs accessed by the data brokers.
The primary application of the database would be during an audit. It can also serve as a querying platform to obtain details on various kinds of data brokerage tasks.
Example use cases:
- How many MRNs were accessed in the last 11 months
- Was MRN #xyz retrieved by data brokers for a requests.?
- If yes, for which DevOps tickets./ request ids. was it retrieved for?
- How many requests were completed in the past month?
- For an IRB id/ DevOps ticket number give all the MRNs accessed.
- How many requests did data broker “abc” handle last month?
- What was the query used for a DevOps ticket?
- What was the time taken to complete DevOps task #123?
- For a request give all access dates
- When did the data broker team last provide data to a PI?
Database structure
The database consists of 4 tables:
1. Request
*This table holds the data associated with an IRB ID. The columns of the table are described in the table below. The column “id” is an autogenerated unique identifier that also serves as the primary key.*
Columns | Data type | Comments |
---|---|---|
id | int4 | Autogenerated ID |
irb_id | Varchar | IRB id associated with the request |
data_broker | Varchar | Data broker assigned to the request. |
pi_first_name | Varchar | First name of the PI |
pi_last_name | Varchar | Last name of the PI |
pi_email | Varchar | Email ID of the PI |
requestor_first_name | Varchar | If the person requesting the data is different than the PI. First name of the requestor |
requestor_last_name | Varchar | Last name of the requestor |
requestor_email | Varchar | Email ID of the requestor |
request_date | Varchar | Request date for the task. Can be used to identify the time taken for completion of a request. |
ticket_system | Varchar | System where the ticket for this task was created. Example – Devops, Jira, REDCap etc. |
ticket_id | Varchar | Ticket number or ID used in DevOps/Jira/REDCap to track this task. Can be the billable epic or the task id or the link to DevOps/Jira ticket. |
2. request_mrn
This table holds the MRN data associated with an IRB ID. The columns of the table are described in the table below. The column “request_id” is a foreign key that references column “id” from the table “request”.
Columns | Data type | Comments |
---|---|---|
request_id | int4 | Foreign key – reference column is id from table request |
mrn | varchar | MRNs associated with a request |
data_source | varchar | Data source for the MRN. Example – RDC, HDC etc. |
mrn_access_date | Date | The date MRNs were accessed for the request. |
3. request_query
This table holds the query associated with an IRB ID/project. The columns of the table are described in the table below. The column “request_id” is a foreign key that references column “id” from the table “request"
4. Log
This table holds the log data for database upload. The columns of the table are described in the table below. The column “comments” holds the error message generated after failed attempt of database entry.
Additional table notes
Apart from the 4 tables mentioned above a staging data table called stage_dat is used for temporarily staging all of the data from CSV files before they are entered into respective tables.
System requirements
- Azure PostgreSQL Version - 11.11 (community edition)
- HIPAA compliance as it will save MRNs
- Production level system
- Data storage projected over the next 5 years- ~ 2 GB
- Memory requirement – 4 GB
- Also need software to send emails in case of failed run(s)
Process Flowchart
Data Input (refer to the flowchart above)
After a completing a request, he/she will create a csv file and sql file/s as the example below:
CSV file with the following column headers:
-
irb_id
-
pi_first_name
-
pi_last_name
-
requestor_first_name
-
requestor_last_name
-
pi_email
-
requestor_email
-
request_date
-
query
-
data_broker
-
ticket_system
-
ticket_id
-
MRN
-
data_source
-
data_access_date
Notes
- The CSV file and the SQL file should follow a naming pattern of “databroker_name_task id_pi_name_date.csv” and “databroker_name_task id_pi_name_date.sql”
- We will use BOX storage space for uploading the CSV and SQL files.
- Data broker Admin User will periodically twice a month. pick the files from BOX and put it in the Azure File Storage space. There is going to be 2 folders in the Azure File storage space allocated for this project – Active and Archive. All active files that need to be uploaded to the DB will be housed in the Active folder and files already uploaded will be moved to the Archive folder. Files that fail upload due to technical reasons will stay in the Active folder for retrial.
- A time triggered Azure Function will pick all the files from the active folder and put content from each one of them into the database.
- This Azure function internally will call a stored procedure named “uploadauditdata” for each CSV file and SQL file pair. This stored procedure will populate the data from CSV files to the “request” and “request_mrn” tables and from the SQL file to the “request_query” table.
- The query column of the CSV file contains the name of the SQL file and starts with a prefix “FILE:”. Example – “File:Eric_h_20201521_05272021.sql”. The program then picks up the content of the SQL file automatically.
- In case of successful upload of a CSV file, an entry will be admitted into the “log” table.
- In case of a failed attempt, the cron job will copy the error message from the database and put it with other details to the “log” table.
- The program cannot handle UTF-8 encoded characters in the SQL file. These characters should be removed from the input files.
- An email with the error message, date and filename will also be sent in case of a failed run.
- In case of multiple SQL files the entries in the CSV file should be as follows
Actors
Task | Responsibility |
---|---|
Database Schema | Madhurima/ Warren |
Query to create the tables | Madhurima/ Warren |
Stored procedure code | Madhurima /Warren |
Cron Job | Warren |
Creation of CSV file for data input | Data brokers |
Creation of SQL file for data input | Data brokers |
Putting the CSV and SQL files to BOX storage location | Data brokers |
Putting the CSV and SQL files to Azure file share location | Madhu (Brianna back up. |
Table creation, update, and maintenance of production DB | Warren |
Querying the Audit database | Warren/ Data brokers(read only access./ Madhurima(read only access. |
Investigating failed runs | Warren /Madhurima |
Box folder for data upload
Architecture
Example queries for audit
1. How many MRNs accessed in last 11 months
select count(distinct mrn) from request_mrn where mrn_access_date > date_trunc('day', NOW(. - interval '11 month'.
2. Was MRN xyz accessed by us? which project?
select irb_id, ticket_id,data_broker from request rt ,request_mrn rmt where rt.id = rmt.request_id and rmt.mrn ='35664'
3. How many requests were fulfilled in the past month
select count(id) from request rt where id in(select distinct request_id from request_mrn rmt where mrn_access_date > date_trunc('day', NOW(. - interval '1 month'..
4. For a IRB id/ Devops User Story give all the MRNs
select mrn from request rt ,request_mrn rmt where rt.id = rmt.request_id and rt.irb_id ='201810010';
select mrn from request rt ,request_mrn rmt where rt.id = rmt.request_id and rt.ticket_id ='25069';
5. How many request did a data broker handle last month
select count(ticket_id. from request where data_broker ='Eric Hogue'
6. What was the query used for the devops ticket 20735
select rq.query from request rt,request_query rq where rt.id = rq.id and ticket_id ='20735';
7. For a request, give all access dates
select distinct rmt.mrn_access_date from request rt ,request_mrn rmt where rt.id =rmt.request_id and rt.irb_id ='202121234'
Enhancements for future releases
- Improve the text in the notification emails
- For failed uploads - provide a reason for failure
- For some failed uploads the subject line of email says WARNING. It should say ERROR.
- For successful uploads – provide how many records were uploaded from each file
- For uploads where some files failed, and some passed – provide how many records were uploaded from files that passed and the reason for failure
- Source of MRN
- A person ID from RDC can be associated with different MRNs in RDC. Hence it is wise to use a separate column called Type of ID in the CSV file and in the database.
- For the projects already in the database, we need to figure a way out to handle the Type of ID values for them.
- Use drop-down menu options in the CSV template to reduce spelling errors by enforcing values like Source of MRN
- Investigate into Azure Active Directory, which will help replace the SAS key authentication. Azure Active Directory will be synced with WUSTL Key.