Requirements
1. Access to MDClone ETL host (10.25.13.19 via RDP)
2. WUSTL_KEY Email added to Talend notification list
1. Send notification to users 48hrs ahead of time for full weekend starting Friday at 5pm
This done by reaching out to the I2 Help Desk and requesting notification be sent for the Monthly MDClone Extract
2. Review Hadoop storage capacity and systems via cloudera
3. Prepare job and hierarchy settings files for regular run
Both files job_is_Active and main_Set need be set to true in order to trigger a job
For the first run turn off the demographics job
4. Backup demographics event table in Impala on via cloudera (storage permitting?)
CREATE TABLE datalakedb_new.ev_demographics_event_<placeholder for initials>_<placeholder for date> like datalakedb_new.ev_demographics_event;
INSERT INTO datalakedb_new.ev_demographics_event_<placeholder for initials>_<placeholder for date> PARTITION (partition_source)
select * from datalakedb_new.ev_demographics_event;
1. Confirm correct Job and hierarchy files are configured for regular run, use:
jobs_settings_new_format_main_prod_incremental.csv
hierarchy_settings_new_format_main_prod_incremental.csv
2. Confirm context settings and selection for the Master Job 3.6 in Talend job designs folder
navigate to 'C:\ETL\WashU Main\Talend\settings_incremental' folder in Windows Explorer review:
- jobs_settings_new_format_main_prod_incremental_PART1.csv
- jobs_settings_new_format_main_prod_incremental_PART2.csv
Pay close attention to:
- JDBC Connection URL
- hierarchy_settings_csv_file_name
- job_settings_csv_file_name
3. Launch extract by clicking run from Talend application
4. Monitor job via email and postgres (DBeaver)
-- This query is for use on the Talend localhost instance of postgres (ETL.public schema)
select
master_id,
process_name,
sub_process_name,
job_type,
table_name,
src_table_name,
end_time - start_time duration,
start_time,
end_time,
status
from
talend.vw_mdc_etl_loading_status
where
start_time between '2024-10-01 00:00:00' and '2024-12-20 23:59:00'
and process_name like '%- Prod'
and process_name <> 'Solr Monitoring'
and process_name <> 'Maintenance'
order by
start_time desc;
-- This query is for use on the Cloudera postgres instance (postgres.mdclone schema)
SELECT
run_id,
process_name,
subprocess_name,
end_time - start_time duration,
start_time,
end_time,
run_status,
run_message
FROM
common.mdc_run_history
WHERE
start_time between '2024-10-01 00:00:00' and '2024-12-20 23:59:00'
--and process_name like '%Prod%'
and process_name <> 'Solr Monitoring'
and process_name <> 'Maintenance'
ORDER BY
start_time desc;
4. Prepare job file for demographics and back fill runs
(discussion what needs to be back filled?)
5. Set hierarchy file to off, use:
hierarchy_settings_new_format_main_off_incremental.csv
6. Launch extract for demographics and back fill run from Talend
7. Monitor job via email and postgres (DBeaver) as in step 3
1. Review CPATool
http://localhost:8787/dashboard and postgres, process_name like '%cpa%'
Where CPA tests completely fail it is faster to run theses from management tab
technical errors should be reran, for these time and experience will inform known errors to ignore
2. Send notification to users