Warehouse Development

RDC v2

The RDC ETL workflow

The workflow begins by executing the RDC ETL notebook. That notebook orchestrates the execution of all ETL processes in the Data Lake.

  • General Workflow Overview
    • Setup
    • Staging: Functions for transforming Clarity data have been converted into individual notebooks
    • OMOP: Loads staged data into the OMOP CDM tables
  • ETL Workflow

Control Tables

  • config_connections
    • Connection information for every database used in the ETLs
  • config_main
    • Contains a record for each department_name
    • Each record has values for
      • last_run_date
      • run_schedule (cron string)
      • active (true/false - is it set to active to process this department)
      • currently_running (should only be true while ETL is running)
      • job_id (last job_id that was run)
      • run_type (full, ingress, transform)
      • target_db_connection_id (points to the database connection info in config_connections)
      • reuse_job_id (typically used for reprocessing a run)
  • config_ingestion_artifacts (used in full and ingress run types)
    • Lists the schemas and tables for each department in config_main
  • delta_rules (used in full, ingress and transform run types)
    • Records contain the
      • etl_operation for ingress or omop (AKA transform)
      • source and target tables
      • list of fields in the source table *NOTE: This info is being replaced in some of the ETL by extracting the field names from the information schema.
      • primary key of the source table
      • is_enabled indicator for including the transform process
  • stage_rules (used in full and transform run types)
    • Handles the priority and is_enabled for staging functions (notebooks), e.g. epic_clarity_allergy

Rule Tables

  • concept_rules
  • nk_rules

Common Tasks


Updated on August 7, 2025