MDClone Monthly Extract Guide - Work in progress!

Requirements

1. Access to MDClone ETL host (10.25.13.19 via RDP)
2. WUSTL_KEY Email added to Talend notification list

Prepare for monthly extract

1. Send notification to users 48hrs ahead of time for full weekend starting Friday at 5pm

    This done by reaching out to the I2 Help Desk and requesting notification be sent for the Monthly MDClone Extract

2. Review Hadoop storage capacity and systems via cloudera

3. Prepare job and hierarchy settings files for regular run

    Both files job_is_Active and main_Set need be set to true in order to trigger a job

    For the first run turn off the demographics job

4. Backup demographics event table in Impala on via cloudera (storage permitting?)
CREATE TABLE datalakedb_new.ev_demographics_event_<placeholder for initials>_<placeholder for date> like datalakedb_new.ev_demographics_event;
INSERT INTO datalakedb_new.ev_demographics_event_<placeholder for initials>_<placeholder for date> PARTITION (partition_source)
    select * from datalakedb_new.ev_demographics_event;

Perform monthly extract

1. Confirm correct Job and hierarchy files are configured for regular run, use:

    jobs_settings_new_format_main_prod_incremental.csv

    hierarchy_settings_new_format_main_prod_incremental.csv

2. Confirm context settings and selection for the Master Job 3.6 in Talend job designs folder

    navigate to 'C:\ETL\WashU Main\Talend\settings_incremental' folder in Windows Explorer review:

        - jobs_settings_new_format_main_prod_incremental_PART1.csv
        - jobs_settings_new_format_main_prod_incremental_PART2.csv

    Pay close attention to:

        - JDBC Connection URL
        - hierarchy_settings_csv_file_name
        - job_settings_csv_file_name

3. Launch extract by clicking run from Talend application 
4. Monitor job via email and postgres (DBeaver)
-- This query is for use on the Talend localhost instance of postgres (ETL.public schema)
select
    master_id,
    process_name,
    sub_process_name,
    job_type,
    table_name,
    src_table_name,
    end_time - start_time duration,
    start_time,
    end_time,
    status
from
    talend.vw_mdc_etl_loading_status
where
    start_time between '2024-10-01 00:00:00' and '2024-12-20 23:59:00'
    and process_name like '%- Prod'
    and process_name <> 'Solr Monitoring'
    and process_name <> 'Maintenance'
order by
    start_time desc;
-- This query is for use on the Cloudera postgres instance (postgres.mdclone schema)
SELECT
    run_id,
    process_name,
    subprocess_name,
    end_time - start_time duration,
    start_time,
    end_time,
    run_status,
    run_message
FROM
    common.mdc_run_history
WHERE
    start_time between '2024-10-01 00:00:00' and '2024-12-20 23:59:00'
    --and process_name like '%Prod%'
    and process_name <> 'Solr Monitoring'
    and process_name <> 'Maintenance'
ORDER BY
    start_time desc;
4. Prepare job file for demographics and back fill runs

    (discussion what needs to be back filled?)

5. Set hierarchy file to off, use:

    hierarchy_settings_new_format_main_off_incremental.csv

6. Launch extract for demographics and back fill run from Talend
7. Monitor job via email and postgres (DBeaver) as in step 3

validate monthly extract

1. Review CPATool

    http://localhost:8787/dashboard and postgres,  process_name like '%cpa%'

    Where CPA tests completely fail it is faster to run theses from management tab
    
    technical errors should be reran, for these time and experience will inform known errors to ignore





2. Send notification to users

Updated on August 7, 2025