SmartCumulus Technology Overview

Home page: https://docs.smarthealthit.org/cumulus/

  • Host machine running docker, deployed to azure
    • hosted in azure to allow for long running processes
    • required 10.x IP and vnet peering to allow ssh access to the VM
  • multiple docker images running on host machine
  • bulk fhir container
  • cumulus ETL compose containers
  • cumulus library (pip package on host?)

Process Overview

  1. Load virtual machine with cumulus tools/container
  2. Work with EPIC team to register the cumulus tools as a client of the FHIR API
    1. Generate keys for the applications to use during authentication.
  3. Set up tool configurations
    1. Setting client app credentials registered with EPIC team
  4. Run data export via bulk FHIR
  5. Run ETL process via ETL containers
  6. Build and upload datasets using the cumulus library

Virtual Machine

  • Runs docker engine
  • Mounts file shares for config and data storage
  • mount point is /data-share on host machine

Software

Bulk Client

https://github.com/smart-on-fhir/bulk-data-client

https://hl7.org/fhir/uv/bulkdata/

  • Used to export data from EPIC using Bulk FHIR API
  • Required configuration by EPIC1 team
    • client id and secret in app orchard
    • group (registry) creation and group id
    • bulk client requires JWKS key to be passed with requests
    • Must generate a PEM file and then convert it to JWKS
    • see scripts/create-keys.sh, scripts/generate_jwks_from_pem.py, and scripts/convert_pem_jwks.py
    • ETL uses .jwks file and bulk client uses jwks.json file
  • container mount host /data-share to /data
  • uses /data/config/bjc-prod-config.js configuration file
  • See /data-share/run-export.sh for examples of running exports from the container
  • May no longer be needed as Cumulus ETL now supports bulk export. However, the steps to work with the EPIC team will be the same or similar to register the Cumulus ETL tool instead of the Bulk Data Client.
  • https://docs.smarthealthit.org/cumulus/etl/bulk-exports.html

Running Exports

  1. Start container docker run -v /data-share:/data -it bulk-fhir-client
  2. Run this from the /app folder on the container
    • node . --config /data/config/bjc-prod-config.js -d /data/<export dir> -t <comma list of domains to export> --reporter cli

Cumulus ETL

  • Used to transform and load the data extracted used the bulk client
  • Identifiable data stayed on the azure storage account and the de-identified data was uploaded to the AWS S3 bucket
  • Some PHI went to a secured S3 bucket to enable patient linking
    • This data was not shared with the study team
  • The deidentified data was then shared with the study team using the Cumulus Library tool

https://github.com/smart-on-fhir/cumulus-etl
https://docs.smarthealthit.org/cumulus/etl/

  1. docker compose -f compose.yaml --profile etl build
  2. sh /data-share/run-etl.sh
docker compose --env-file /data-share/config/.cumulus-etl.env run --remove-orphans --rm --volume /data-share:/data \
  cumulus-etl \
  --errors-to=/data/etl-errors/2024-06-06.1 \
  --task   condition,encounter,patient,servicerequest \
  --fhir-url=https://epicproxy.et0965.epichosted.com/OAuth2-PRD/api/FHIR/R4 \
  --smart-client-id=213f3c39-76d0-4d98-a42c-02456afa13c3 \
  --smart-jwks=/data/config/cumulus-etl-prd.jwks \
  --input-format=ndjson --output-format=deltalake --batch-size=300000 \
  /data/2024 \
  s3://cumulus-510155166665-us-east-1/patient-output/ \
  s3://cumulus-phi-510155166665-us-east-1/patient-output/

Library

https://github.com/smart-on-fhir/cumulus-library
https://docs.smarthealthit.org/cumulus/library/first-time-setup.html

  • pip installed on the host machine
  • used local AWS CLI credentials to connect to S3 and other AWS resources
  • local credentials where a service account created in AWS and stored in a .credentials file

Azure

AWS

Note

You may need additional rights granted to the AWS account from WUIT to complete the configuration.

Project Challenges

  • EPIC implementation for bulk FHIR
    • Created issues when exporting Encounters and DocumentReferences
    • Did not provide a way to filter data by time frame
    • Had to rely on a 7 day rolling window of ED patients to define our range of data via an EPIC registry

Other Information

FHIR Crawler

  • Used only when/if Bulk Data Client failed

https://github.com/smart-on-fhir/fhir-crawler

Running

  1. docker build . -t fhir-crawler:latest
  2. docker run -v /data-share/crawler2024/:/app/volume fhir-crawler:latest

Updated on August 7, 2025