How To Set Up AWS for Cumulus
This guide will explain how to configure your AWS cloud to add the S3 buckets, Athena tables,
and more that Cumulus will need. It assumes you are familiar with AWS.
At the end, you'll be given a CloudFormation template that will do all the hard work for you.
End Goal
By the end of this guide, you'll have multiple S3 buckets to both receive the output of Cumulus ETL
and to store the results of Athena queries.
You'll also have defined a Glue database and crawler to map the output of Cumulus ETL onto Glue
tables, which Athena can then query.
But all of that can be grouped into three different stages:
- Cumulus ETL (S3 buckets)
- Glue (tables)
- Athena (bucket & configuration)
Cumulus ETL
You're going to create two buckets. One for the de-identified Cumulus ETL output and one for build
artifacts (which holds PHI).
They'll each need similar security policies, but Glue will only look at the output bucket.
The buckets will require encryption and grant access to the user role that is running Cumulus ETL.
Glue
Glue is an AWS product that creates table schemas based on the files in the Cumulus ETL output
bucket.
You're going to create a database to hold the tables and a crawler that scans the Cumulus ETL
output bucket, creating the tables.
This crawler can simply be a manually run job for now.
Athena
Athena is an AWS product that can run SQL queries against Glue tables.
It's how we'll generate the patient counts for studies.
You're going to create a bucket to hold Athena query results and an Athena workgroup to configure
Athena.
Cloud Formation
The easy way to set this all up is simply use a CloudFormation template.
Here's an example one that should work for your needs.
It takes four parameters:
- Bucket prefix
- ETL Subdirectory, matching the subdirectory you pass to Cumulus ETL
- KMS key ID for encryption
- Upload Role ARN, matching the user that runs Cumulus ETL
Once you create this CloudFormation stack, your infrastructure will be ready to run Cumulus ETL.
AWSTemplateFormatVersion: 2010-09-09
Description: Create an Athena database for the Cumulus project
Parameters:
BucketPrefix:
Type: "String"
Description: "Prefix for Cumulus bucket names (they will look like ${BucketPrefix}-${purpose}-${AWS::AccountId}-${AWS::Region})"
EtlSubdir:
Type: "String"
Description: "Subdirectory on the Cumulus ETL output bucket where files will be placed. This should match the path you give when running Cumulus ETL. Using a subdirectory is recommended to allow for test runs of Cumulus ETL in different subdirectories and general future-proofing."
KMSMasterKeyID:
Type: "String"
Description: "KMS key ID for Cumulus buckets"
UploadRoleArn:
Type: "String"
Description: "ARN for role that is running Cumulus ETL and thus uploading files to S3"
Resources:
####################################################
# S3 Buckets for raw Cumulus output
####################################################
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${BucketPrefix}-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
KMSMasterKeyID: !Ref KMSMasterKeyID
SSEAlgorithm: aws:kms
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
S3BucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref S3Bucket
PolicyDocument:
Version: 2012-10-17
Statement:
- Sid: AWSBucketRequiresEncryption
Effect: Deny
Principal: "*"
Action: s3:PutObject
Resource: !Sub "arn:aws:s3:::${S3Bucket}/*"
Condition:
StringNotEquals:
s3:x-amz-server-side-encryption: aws:kms
- Sid: AWSBucketAllowUploads
Effect: Allow
Principal:
AWS: !Ref UploadRoleArn
Action:
- s3:DeleteObject
- s3:PutObject
Resource: !Sub "arn:aws:s3:::${S3Bucket}/*"
- Sid: AWSBucketAllowListing
Effect: Allow
Principal:
AWS: !Ref UploadRoleArn
Action:
- s3:ListBucket
Resource: !Sub "arn:aws:s3:::${S3Bucket}"
PHIBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${BucketPrefix}-phi-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
KMSMasterKeyID: !Ref KMSMasterKeyID
SSEAlgorithm: aws:kms
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
PHIBucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref PHIBucket
PolicyDocument:
Version: 2012-10-17
Statement:
- Sid: AWSBucketRequiresEncryption
Effect: Deny
Principal: "*"
Action: s3:PutObject
Resource: !Sub "arn:aws:s3:::${PHIBucket}/*"
Condition:
StringNotEquals:
s3:x-amz-server-side-encryption: aws:kms
- Sid: AWSBucketAllowAccess
Effect: Allow
Principal:
AWS: !Ref UploadRoleArn
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub "arn:aws:s3:::${PHIBucket}/*"
####################################################
# Glue database & tables for raw Cumulus data
####################################################
GlueSecurity:
Type: AWS::Glue::SecurityConfiguration
Properties:
EncryptionConfiguration:
S3Encryptions:
- KmsKeyArn: !Sub "arn:aws:kms:${AWS::Region}:${AWS::AccountId}:key/${KMSMasterKeyID}"
S3EncryptionMode: SSE-KMS
Name: cumulus-kms
CrawlerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Policies:
- PolicyName: S3BucketAccessPolicy
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource: !Sub "arn:aws:s3:::${S3Bucket}/*"
GlueDB:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: cumulus
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: cumulus
DatabaseName: !Ref GlueDB
Role: !GetAtt CrawlerRole.Arn
CrawlerSecurityConfiguration: !Ref GlueSecurity
RecrawlPolicy:
RecrawlBehavior: CRAWL_EVERYTHING
SchemaChangePolicy:
DeleteBehavior: DEPRECATE_IN_DATABASE
UpdateBehavior: UPDATE_IN_DATABASE
Targets:
S3Targets:
- Path: !Sub "s3://${S3Bucket}/${EtlSubdir}/"
Exclusions:
- "JobConfig/**"
SampleSize: 1
####################################################
# Athena queries and where to store them
####################################################
AthenaBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${BucketPrefix}-athena-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
KMSMasterKeyID: !Ref KMSMasterKeyID
SSEAlgorithm: aws:kms
PublicAccessBlockConfiguration:
BlockPublicAcls: True
BlockPublicPolicy: True
IgnorePublicAcls: True
RestrictPublicBuckets: True
AthenaBucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref AthenaBucket
PolicyDocument:
Version: 2012-10-17
Statement:
# We don't expect non-Athena uploads to this bucket, but just as a safeguard against misconfiguration,
# let's enforce encryption on all incoming data.
- Sid: AWSBucketRequiresEncryption
Effect: Deny
Principal: "*"
Action: s3:PutObject
Resource: !Sub "arn:aws:s3:::${AthenaBucket}/*"
Condition:
StringNotEquals:
s3:x-amz-server-side-encryption: aws:kms
AthenaWorkGroup:
Type: AWS::Athena::WorkGroup
Properties:
Name: cumulus
State: ENABLED
WorkGroupConfiguration:
EnforceWorkGroupConfiguration: True
PublishCloudWatchMetricsEnabled: True
EngineVersion:
SelectedEngineVersion: "Athena engine version 3"
ResultConfiguration:
EncryptionConfiguration:
EncryptionOption: SSE_KMS
KmsKey: !Ref KMSMasterKeyID
OutputLocation: !Sub "s3://${AthenaBucket}/"
Outputs:
BucketName:
Description: Cumulus de-identified output bucket ID
Value: !Ref S3Bucket
PhiBucketName:
Description: Cumulus PHI output bucket ID
Value: !Ref PHIBucket
AthenaBucketName:
Description: Cumulus Athena results bucket ID
Value: !Ref AthenaBucket