TPI Transition Outline
General Topics
- transparency to the larger team (6/30?)
- what is Chris' day to day tasks
- someone should shadow all of Chris' tasks
- direct DB/DW to us
- who to shadow: Niel, Ian, and Suhas
- Chris to send all requests to PE
- PE to field complete the request with Chris' oversight if needed
- What does DW need from Chris
- Meet with Snehil to get her list
- What does DB need from Chris
- Ask Bri what Chris does for them
- Onboarding users
- Adding new tables/data sources
- Troubleshooting existing ETLs
- Chris to assist with moving off of NiFI
- John & Rob validation of OMOP?
- TimeTracker knowledge transfer for time spent
Things to Document
Below is a list of things we need to ensure are documented before TPI transitions away from ICS
- a day in the life of Chris
- admin / maintenance tasks
- databasin troubleshooting guide - Ian
- how to check pipelines and automations in azure
- how to troubleshoot clarity ETLs
- how to deploy changes to curate schemas - Ian
- sandbox vs cleansed vs curated schemas and when to use which - Ian
- In progress
- billing process (in progress)
- review and expand
- include storage billing info
- how to make changes to RDC & Datalake architecture - Dave0
- what repos are most important
- what can be retired
- what do we not know
- inventory storage accounts
- document RDC (OMOP) deployment process in databricks
- Suhas?
- bjc neural frame process
- data broker auditing details - Chris
- legacy and databasin auditing
- external/non-databasin processes
- how to onboard new data sources - Niel
- document details on delta load options - Chris
- working with files and tables
- Databsin best practices - Chris
- file based ingestion
- How to request Databasin support - Ian
Tasks to Complete
Below is a list of things we need TPI to complete before they transition away from ICS
- Verantos - hand off to PE & DW
- EPIC - assign an ICS rep that is Tier5 and ingests data
- Loop in Ian, Niel, and Snehil
- GPC - Should probably try and get that to run in under 24 hours.
- Niel and John N
- In progress, still tweaking performance
- John to try it with June data with performance updates
- ArcGIS - All the logic is done, but it needs to live in its own Catalog. Will be easy. - Ian, Niel
- Finalizing documentation - done
- Switch target schema
- Schedule to run 1/week (at least)
- Shape file ingestion, ad-hoc task once a year
- Move sparc to Workflow
- Dev hour task
- BJC neural frame incremental load code changes/documentation
- pending BJC SFTP location, Chris to request SFTP site
- databasin file ingestion, snapshot
- automation job pulls ingested data and insert into BJC synapse
- Button up this billing process, so we can start charging. - Ian, Niel
- completed storage script
- WUSM azure storage analysis (notebook name)
- add tags to script
- move to wusm-data-lake-automations
- verify it is crawling the metastore
- move all billing notebooks into monorepo/Billing
- Group, permissions, and schema clean up
- OMOP, Clarity, etc.
- IMPORTANT: We/I need to move the old databricks jobs to databasin. Most of these are "file drop" jobs. - Ian, Niel, and Jack
- $0 marketplace install, approval by Amy
- Move code into repo for notebooks etc that are currently used in workflows and pipelines etc
- find all existing/legacy notebooks / workflows
- add some docs about each
- same as billing process, but all other "administrative" automated tasks
- remove anything that is no longer needed
- RDC migration - Suhas, Snehil, Shinji?
- PE Code review
- Provide any guidance on issues we find
- Tempus changes?
- Move to repository
- Get RAW data from SFTP https://databasin.wustl.edu/projects/uJEj8d/pipelines/70 (adls.file_tempus)
- Chris made changes in code already
- Document tempus process/code
- waiting on Snehil for direction, key not being parsed correctly by spark
- clinical trials key, maybe case sensitive now?- need to re-run all tempus data provided to date
- Onboard PE to Zoho for support
- Niel, Nicole, and Ian for now
- Document process to request support
Training for the team
- Databricks administration / best practices
- Spark
- Insert values will not work, need to mass insert via spark or through staging location
- Packer for devops build agent
- add azcopy
DW & DB Needs
-
Brokerage
- help understanding automated tasks
- audit database
- table ingestion, mostly done
- databasin
-
Data warehouse
- databasin
- epic clarity pipeline
- intake/onboarding
- initial meeting/discovery
- data sources
- teams
- cluster scaling
- 7GB / 5 hours timeout