SmartCumulus Technology Overview

Home page: https://docs.smarthealthit.org/cumulus/

Host machine running docker, deployed to azure
- hosted in azure to allow for long running processes
- required 10.x IP and vnet peering to allow ssh access to the VM
multiple docker images running on host machine
bulk fhir container
cumulus ETL compose containers
cumulus library (pip package on host?)

Process Overview

Load virtual machine with cumulus tools/container
Work with EPIC team to register the cumulus tools as a client of the FHIR API
1. Generate keys for the applications to use during authentication.
Set up tool configurations
1. Setting client app credentials registered with EPIC team
Run data export via bulk FHIR
Run ETL process via ETL containers
Build and upload datasets using the cumulus library

Virtual Machine

Runs docker engine
Mounts file shares for config and data storage
mount point is /data-share on host machine

Software

Bulk Client

https://github.com/smart-on-fhir/bulk-data-client

https://hl7.org/fhir/uv/bulkdata/

Used to export data from EPIC using Bulk FHIR API
Required configuration by EPIC1 team
- client id and secret in app orchard
- group (registry) creation and group id
- bulk client requires JWKS key to be passed with requests
- Must generate a PEM file and then convert it to JWKS
- see scripts/create-keys.sh, scripts/generate_jwks_from_pem.py, and scripts/convert_pem_jwks.py
- ETL uses .jwks file and bulk client uses jwks.json file
container mount host /data-share to /data
uses /data/config/bjc-prod-config.js configuration file
See /data-share/run-export.sh for examples of running exports from the container
May no longer be needed as Cumulus ETL now supports bulk export. However, the steps to work with the EPIC team will be the same or similar to register the Cumulus ETL tool instead of the Bulk Data Client.
https://docs.smarthealthit.org/cumulus/etl/bulk-exports.html

Running Exports

Start container docker run -v /data-share:/data -it bulk-fhir-client
Run this from the /app folder on the container
- node . --config /data/config/bjc-prod-config.js -d /data/<export dir> -t <comma list of domains to export> --reporter cli

Cumulus ETL

Used to transform and load the data extracted used the bulk client
Identifiable data stayed on the azure storage account and the de-identified data was uploaded to the AWS S3 bucket
Some PHI went to a secured S3 bucket to enable patient linking
- This data was not shared with the study team
The deidentified data was then shared with the study team using the Cumulus Library tool

https://github.com/smart-on-fhir/cumulus-etl
https://docs.smarthealthit.org/cumulus/etl/

docker compose -f compose.yaml --profile etl build
sh /data-share/run-etl.sh

docker compose --env-file /data-share/config/.cumulus-etl.env run --remove-orphans --rm --volume /data-share:/data \
  cumulus-etl \
  --errors-to=/data/etl-errors/2024-06-06.1 \
  --task   condition,encounter,patient,servicerequest \
  --fhir-url=https://epicproxy.et0965.epichosted.com/OAuth2-PRD/api/FHIR/R4 \
  --smart-client-id=213f3c39-76d0-4d98-a42c-02456afa13c3 \
  --smart-jwks=/data/config/cumulus-etl-prd.jwks \
  --input-format=ndjson --output-format=deltalake --batch-size=300000 \
  /data/2024 \
  s3://cumulus-510155166665-us-east-1/patient-output/ \
  s3://cumulus-phi-510155166665-us-east-1/patient-output/

Library

https://github.com/smart-on-fhir/cumulus-library
https://docs.smarthealthit.org/cumulus/library/first-time-setup.html

pip installed on the host machine
used local AWS CLI credentials to connect to S3 and other AWS resources
local credentials where a service account created in AWS and stored in a .credentials file

Azure

Resource group: wusm-prod-rg-cumulus
Templates and more information located in azure-templates

AWS

Installed AWS cli to config/AWS on host machine
Cumulus AWS setup guide

Note

You may need additional rights granted to the AWS account from WUIT to complete the configuration.

Project Challenges

EPIC implementation for bulk FHIR
- Created issues when exporting Encounters and DocumentReferences
- Did not provide a way to filter data by time frame
- Had to rely on a 7 day rolling window of ED patients to define our range of data via an EPIC registry

Other Information

FHIR Crawler

Used only when/if Bulk Data Client failed

https://github.com/smart-on-fhir/fhir-crawler

Running

docker build . -t fhir-crawler:latest
docker run -v /data-share/crawler2024/:/app/volume fhir-crawler:latest

Data Sources

Platforms

Teams

SmartCumulus Technology Overview

Process Overview

Virtual Machine

Software

Bulk Client

Running Exports

Cumulus ETL

Library

Azure

AWS

Project Challenges

Other Information

FHIR Crawler

Running

Table of Contents