Warehouse Development
RDC v2
The RDC ETL workflow
The workflow begins by executing the RDC ETL notebook. That notebook orchestrates the execution of all ETL processes in the Data Lake.
- General Workflow Overview
- Setup
- Staging: Functions for transforming Clarity data have been converted into individual notebooks
- OMOP: Loads staged data into the OMOP CDM tables
- ETL Workflow
Control Tables
- config_connections
- Connection information for every database used in the ETLs
- config_main
- Contains a record for each department_name
- Each record has values for
- last_run_date
- run_schedule (cron string)
- active (true/false - is it set to active to process this department)
- currently_running (should only be true while ETL is running)
- job_id (last job_id that was run)
- run_type (full, ingress, transform)
- target_db_connection_id (points to the database connection info in config_connections)
- reuse_job_id (typically used for reprocessing a run)
- config_ingestion_artifacts (used in full and ingress run types)
- Lists the schemas and tables for each department in config_main
- delta_rules (used in full, ingress and transform run types)
- Records contain the
- etl_operation for ingress or omop (AKA transform)
- source and target tables
- list of fields in the source table *NOTE: This info is being replaced in some of the ETL by extracting the field names from the information schema.
- primary key of the source table
- is_enabled indicator for including the transform process
- Records contain the
- stage_rules (used in full and transform run types)
- Handles the priority and is_enabled for staging functions (notebooks), e.g. epic_clarity_allergy
Rule Tables
- concept_rules
- nk_rules