SmartCumulus Technology Overview
Home page: https://docs.smarthealthit.org/cumulus/
- Host machine running docker, deployed to azure
- hosted in azure to allow for long running processes
- required 10.x IP and vnet peering to allow ssh access to the VM
- multiple docker images running on host machine
- bulk fhir container
- cumulus ETL compose containers
- cumulus library (pip package on host?)
Process Overview
- Load virtual machine with cumulus tools/container
- Work with EPIC team to register the cumulus tools as a client of the FHIR API
- Generate keys for the applications to use during authentication.
- Set up tool configurations
- Setting client app credentials registered with EPIC team
- Run data export via bulk FHIR
- Run ETL process via ETL containers
- Build and upload datasets using the cumulus library
Virtual Machine
- Runs docker engine
- Mounts file shares for config and data storage
- mount point is /data-share on host machine
Software
Bulk Client
https://github.com/smart-on-fhir/bulk-data-client
https://hl7.org/fhir/uv/bulkdata/
- Used to export data from EPIC using Bulk FHIR API
- Required configuration by EPIC1 team
- client id and secret in app orchard
- group (registry) creation and group id
- bulk client requires JWKS key to be passed with requests
- Must generate a PEM file and then convert it to JWKS
- see scripts/create-keys.sh, scripts/generate_jwks_from_pem.py, and scripts/convert_pem_jwks.py
- ETL uses .jwks file and bulk client uses jwks.json file
- container mount host /data-share to /data
- uses /data/config/bjc-prod-config.js configuration file
- See
/data-share/run-export.sh
for examples of running exports from the container - May no longer be needed as Cumulus ETL now supports bulk export. However, the steps to work with the EPIC team will be the same or similar to register the Cumulus ETL tool instead of the Bulk Data Client.
- https://docs.smarthealthit.org/cumulus/etl/bulk-exports.html
Running Exports
- Start container
docker run -v /data-share:/data -it bulk-fhir-client
- Run this from the /app folder on the container
node . --config /data/config/bjc-prod-config.js -d /data/<export dir> -t <comma list of domains to export> --reporter cli
Cumulus ETL
- Used to transform and load the data extracted used the bulk client
- Identifiable data stayed on the azure storage account and the de-identified data was uploaded to the AWS S3 bucket
- Some PHI went to a secured S3 bucket to enable patient linking
- This data was not shared with the study team
- The deidentified data was then shared with the study team using the Cumulus Library tool
https://github.com/smart-on-fhir/cumulus-etl
https://docs.smarthealthit.org/cumulus/etl/
docker compose -f compose.yaml --profile etl build
sh /data-share/run-etl.sh
docker compose --env-file /data-share/config/.cumulus-etl.env run --remove-orphans --rm --volume /data-share:/data \
cumulus-etl \
--errors-to=/data/etl-errors/2024-06-06.1 \
--task condition,encounter,patient,servicerequest \
--fhir-url=https://epicproxy.et0965.epichosted.com/OAuth2-PRD/api/FHIR/R4 \
--smart-client-id=213f3c39-76d0-4d98-a42c-02456afa13c3 \
--smart-jwks=/data/config/cumulus-etl-prd.jwks \
--input-format=ndjson --output-format=deltalake --batch-size=300000 \
/data/2024 \
s3://cumulus-510155166665-us-east-1/patient-output/ \
s3://cumulus-phi-510155166665-us-east-1/patient-output/
Library
https://github.com/smart-on-fhir/cumulus-library
https://docs.smarthealthit.org/cumulus/library/first-time-setup.html
- pip installed on the host machine
- used local AWS CLI credentials to connect to S3 and other AWS resources
- local credentials where a service account created in AWS and stored in a .credentials file
Azure
- Resource group: wusm-prod-rg-cumulus
- Templates and more information located in azure-templates
AWS
- Installed AWS cli to config/AWS on host machine
- Cumulus AWS setup guide
Note
You may need additional rights granted to the AWS account from WUIT to complete the configuration.
Project Challenges
- EPIC implementation for bulk FHIR
- Created issues when exporting Encounters and DocumentReferences
- Did not provide a way to filter data by time frame
- Had to rely on a 7 day rolling window of ED patients to define our range of data via an EPIC registry
Other Information
FHIR Crawler
- Used only when/if Bulk Data Client failed
https://github.com/smart-on-fhir/fhir-crawler
Running
docker build . -t fhir-crawler:latest
docker run -v /data-share/crawler2024/:/app/volume fhir-crawler:latest