Data Lake Catalog Policies
This document outlines the policies and procedures for the use, management, and lifecycle of catalogs in the WUSM Data Lake. It is intended for ICS team members, data stewards, and all data lake users. The goal is to ensure proper usage, compliance, and effective management of data assets across different environments.
Catalogs
The WUSM Data Lake organizes data into several catalogs, each serving a distinct role in the data lifecycle and access model. The most commonly used catalogs are sandbox
, review
, curated
, and cleansed
. Each catalog is designed to support a specific stage of data management and to ensure compliance, security, and efficient collaboration. The table below summarizes the purpose, access model, primary users, and typical contents of each catalog.
Catalog | Purpose | Access/Restrictions | Primary Users | Example Assets |
---|---|---|---|---|
sandbox | Team development and experimentation | Team only; not shared outside of the team | Project/ICS/BYOB Teams | Work-in-progress tables, scripts |
review | Data validation and review before promotion | ICS teams RW; project team read-only | Project/ICS/BYOB Teams | Validated tables, review-ready assets |
curated | Approved, shared data assets | Requires ICS review/approval | Project Teams, ICS | Project data marts, OMOP views |
cleansed | Mirror of original data sources | Read-only, primarily for ICS use | ICS DW Team | Tables loaded directly from external sources |
Sandbox
Each team is provisioned with a dedicated schema within the sandbox
catalog, which serves as a private development environment. Only members of the associated team have access to their schema, ensuring that work-in-progress assets remain isolated and secure. Teams are expected to perform all development and experimentation in this space. When a data asset is ready to be shared outside the team, it must be submitted for review and approval by the ICS team before being published to the appropriate schema in the curated
catalog. For more details on this process, see the Review section below.
Review Catalog Overview
The review
catalog is used for data validation and review after work is completed in the sandbox and before promotion to curated. Each team has a dedicated schema in the review catalog, with the same name as their sandbox
schema. The ICS teams hav read/write access to all review
schemas. The ICS DW team is responsible for publishing assets to the appropriate team schema in the review catalog after completing work in the data_warehouse_dev
catalog. Project teams have read-only access to their review schema, allowing them to review the work of the DW team. Once the project team has reviewed and approved the assets, the DW team uses Databasin to promote the assets from the review catalog to the curated catalog. This process ensures that all assets are validated and approved before being shared more broadly.
The data broker teams have a slightly different workflow for using the review
schema. The BYOB teams do their work in their sandbox
schema and then promote the assets to their review
schema. The ICS teams then review the BYOB work in the review
catalog. Upon approval, ICS promotes the assets to the project team's curated
schema.
Curated
The curated
catalog is the primary location for accessing approved data assets by the project team or across project teams. All data in this catalog has undergone review and approval by ICS to ensure quality, compliance, and suitability for broader use. Access to schemas in the curated
catalog requires an approved access request. This catalog contains normalized views and tables derived from data sources in the cleansed
catalog, such as EPIC Clarity, OMOP Standard, and GIS extended patient data. It also includes project-specific schemas, like critical_care_datamart
and critical_care_datamart_peds
, which are created and approved by ICS for particular initiatives or projects. Only a service principal has write or modify access to this catalog, maintaining strict control over its contents.
All curated
assets must be deployed using Databasin.
Cleansed
The cleansed
catalog is managed by the ICS Data Warehousing (DW) team and is intended to mirror original data sources as closely as possible. Table names may vary depending on ingestion options, but the goal is to preserve the structure and content of the source systems. Data in this catalog is typically ingested and maintained through automated Databasin pipelines. The cleansed
catalog holds comprehensive datasets, such as the full RDC OMOP and EPIC Clarity data, which are used primarily by ICS. Standardized views of these datasets are made available to approved users in the curated
catalog. ICS may also publish custom datasets or views to curated schemas for specific projects. Write and modify access to the cleansed
catalog is restricted to a service principal, ensuring data integrity and security.
All cleansed
assets must be deployed using Databasin.
Data Ingestion
Data ingestion occurs primarily in the cleansed
catalog and is managed by the ICS teams. Most ingestion and management tasks are automated using Databasin pipelines, which run on a defined schedule. The ICS teams are responsible for developing and maintaining these processes. Because many assets in the curated
catalog depend on data from cleansed
, any changes to assets in the cleansed
catalog require impact notification and ICS approval before deployment. This ensures that downstream processes and users are not adversely affected by modifications.
Data Assets Lifecycle
The lifecycle of data assets in the WUSM Data Lake includes development, review, approval, promotion, maintenance, and removal. Each stage is designed to ensure data quality, compliance, and proper management throughout the asset's existence.
- Development: Assets are created and tested in the team's
sandbox
catalog. - Review: After development, assets are published to the team's schema in the
review
catalog for review by the appropriate team. - Approval: The appropriate team reviews the assets in the
review
catalog. Assets may be sent back for revision or approved for promotion. - Promotion: Upon approval, assets are published to the
curated
catalog for sharing with approved teams. All promotions are logged for auditing. - Maintenance: Assets in
curated
are periodically reviewed for relevance and accuracy. - Removal: Obsolete or superseded assets are archived or deleted according to data retention policies. Assets within the
review
schema are removed once the assets have been moved tocurated
.
See also: Data Broker Workflow for a detailed step-by-step process.
All assets must be appropriately tagged with metadata and classified according to sensitivity and compliance requirements.
Development
Development of data assets takes place within each team's dedicated sandbox
schema. Team members are responsible for creating and testing assets in this private environment. Data brokers and the ICS DW team use data_warehouse_dev
catalog or their own sandbox
schemas for development before assets are considered for broader use. Before any asset can be promoted, it must be published to the review catalog for review.
Review
After development, the assets are published to the appropriate team schema in the review
catalog. Project teams and ICS teams have read-only access to their review schema to review the work. The ICS DW team retains read/write access to all review schemas. Once the appropriate team has reviewed and approved the assets, the ICS team uses Databasin to promote the assets to the curated
catalog.
Approval
When a data asset is ready to be shared with the project team, other teams, or treated as a production resource, it undergoes a formal approval process. Typically, assets are built by a data broker and reviewed by ICS, with data stewards included as needed. The approval process ensures that assets are properly tagged and classified, and may require multiple iterations before approval is granted. All assets must be appropriately tagged with metadata and classified according to sensitivity and compliance requirements. See Data Asset Tagging for standards and process.
Promotion
Once an asset has approval, it is published to the curated
catalog, usually by ICS using Databasin. All promotions are logged for auditing purposes to maintain transparency and traceability.
Maintenance
Assets in the curated
catalog are periodically reviewed by ICS and appointed data stewards. These reviews assess whether assets are still needed, properly tagged, and meet quality standards. If an asset fails review, ICS initiates a new development cycle to address issues or remove the asset as necessary.
Removal
When assets are no longer required, they are archived or deleted according to applicable data retention policies. For example, data from closed studies may be moved to cold storage, or outdated assets may be removed from the data lake. All removals are logged for auditing, and more information can be found in the Data Retention Policies documentation.
Databasin
Databasin is the platform and set of tools used for data ingestion, automation, and management within the data lake. All curated
and cleansed
assets must be deployed using Databasin. For more information, see the Databasin documentation.
Additional Catalogs (Internal Use)
The following catalogs are for internal or system use and are not intended for general consumption:
data_warehouse_dev
- ICS DW team developmentsystem
- Databricks system level informationconfig
- Used by internal ICS processes such as billinggovernance
- Used by internal ICS processes such as auditingmarketscan
- Used by ADCS to supply administrative data to customersmdclone
- Used by the MDClone ETL processesomop_dw_*
- Used by ICS DW team as the development environment for OMOP/RDC ETL developmentpermissions
- Provides a centralized view of access controls for all data assets, detailing user roles, data sensitivity levels, and audit logs for access control changes. Maintained by the ICS team.postgres-config-*
- Used as to link to the config database PSQL serverraw
- Used by Databasin during ingestion pipelines and not intended for consumptionstaging
- Used by Databasin during ingestion pipelines and not intended for consumption__databricks_internal
- Used by Databricks for internal management processes.
Glossary / Acronyms
- Review: The process by which ICS and data stewards evaluate data assets for quality, compliance, and readiness for promotion. Assets may be sent back for revision or approved for promotion.
- Approval: The formal acceptance of a data asset for promotion to
curated
after review. - Catalog: A logical grouping of schemas within the data lake, each serving a specific purpose (e.g.,
sandbox
,review
,curated
,cleansed
). - Schema: A collection of database objects (tables, views) within a catalog.
- Data Steward: The individual responsible for managing and maintaining a data source.
- ICS: Institute for Clinical and Translational Sciences.
- WUSM: Washington University School of Medicine.
- Data Asset: Tables, views, flat files, models, and other resources managed in the data lake.
- Data Broker: A team member or group responsible for building, preparing, and submitting data assets for review and promotion.
- Service Principal: A security identity used by applications or automation tools to access or modify data in restricted catalogs.
- Databasin: The platform and set of tools used for data ingestion, automation, and management within the data lake.
- DW Team: Data Warehousing team, responsible for managing ingestion pipelines and maintaining the
cleansed
catalog. - Project Team - A group of individuals associated with a particular study or initiative.
- BYOB Team - A group of individuals designated as "Bring Your Own Brokers" for a specific department or buissness unit.
- Cold Storage: Long-term, lower-cost storage for data assets that are no longer actively used but must be retained for compliance or archival purposes.