Data Lake Catalog Policies

This document outlines the policies and procedures for the use, management, and lifecycle of catalogs in the WUSM Data Lake. It is intended for ICS team members, data stewards, and all data lake users. The goal is to ensure proper usage, compliance, and effective management of data assets across different environments.

Catalogs

The WUSM Data Lake organizes data into several catalogs, each serving a distinct role in the data lifecycle and access model. The most commonly used catalogs are sandbox, review, curated, and cleansed. Each catalog is designed to support a specific stage of data management and to ensure compliance, security, and efficient collaboration. The table below summarizes the purpose, access model, primary users, and typical contents of each catalog.

Catalog Purpose Access/Restrictions Primary Users Example Assets
sandbox Team development and experimentation Team only; not shared outside of the team Project/ICS/BYOB Teams Work-in-progress tables, scripts
review Data validation and review before promotion ICS teams RW; project team read-only Project/ICS/BYOB Teams Validated tables, review-ready assets
curated Approved, shared data assets Requires ICS review/approval Project Teams, ICS Project data marts, OMOP views
cleansed Mirror of original data sources Read-only, primarily for ICS use ICS DW Team Tables loaded directly from external sources

Sandbox

Each team is provisioned with a dedicated schema within the sandbox catalog, which serves as a private development environment. Only members of the associated team have access to their schema, ensuring that work-in-progress assets remain isolated and secure. Teams are expected to perform all development and experimentation in this space. When a data asset is ready to be shared outside the team, it must be submitted for review and approval by the ICS team before being published to the appropriate schema in the curated catalog. For more details on this process, see the Review section below.

Review Catalog Overview

The review catalog is used for data validation and review after work is completed in the sandbox and before promotion to curated. Each team has a dedicated schema in the review catalog, with the same name as their sandbox schema. The ICS teams hav read/write access to all review schemas. The ICS DW team is responsible for publishing assets to the appropriate team schema in the review catalog after completing work in the data_warehouse_dev catalog. Project teams have read-only access to their review schema, allowing them to review the work of the DW team. Once the project team has reviewed and approved the assets, the DW team uses Databasin to promote the assets from the review catalog to the curated catalog. This process ensures that all assets are validated and approved before being shared more broadly.

The data broker teams have a slightly different workflow for using the review schema. The BYOB teams do their work in their sandbox schema and then promote the assets to their review schema. The ICS teams then review the BYOB work in the review catalog. Upon approval, ICS promotes the assets to the project team's curated schema.

Curated

The curated catalog is the primary location for accessing approved data assets by the project team or across project teams. All data in this catalog has undergone review and approval by ICS to ensure quality, compliance, and suitability for broader use. Access to schemas in the curated catalog requires an approved access request. This catalog contains normalized views and tables derived from data sources in the cleansed catalog, such as EPIC Clarity, OMOP Standard, and GIS extended patient data. It also includes project-specific schemas, like critical_care_datamart and critical_care_datamart_peds, which are created and approved by ICS for particular initiatives or projects. Only a service principal has write or modify access to this catalog, maintaining strict control over its contents.

All curated assets must be deployed using Databasin.

Cleansed

The cleansed catalog is managed by the ICS Data Warehousing (DW) team and is intended to mirror original data sources as closely as possible. Table names may vary depending on ingestion options, but the goal is to preserve the structure and content of the source systems. Data in this catalog is typically ingested and maintained through automated Databasin pipelines. The cleansed catalog holds comprehensive datasets, such as the full RDC OMOP and EPIC Clarity data, which are used primarily by ICS. Standardized views of these datasets are made available to approved users in the curated catalog. ICS may also publish custom datasets or views to curated schemas for specific projects. Write and modify access to the cleansed catalog is restricted to a service principal, ensuring data integrity and security.

All cleansed assets must be deployed using Databasin.

Data Ingestion

Data ingestion occurs primarily in the cleansed catalog and is managed by the ICS teams. Most ingestion and management tasks are automated using Databasin pipelines, which run on a defined schedule. The ICS teams are responsible for developing and maintaining these processes. Because many assets in the curated catalog depend on data from cleansed, any changes to assets in the cleansed catalog require impact notification and ICS approval before deployment. This ensures that downstream processes and users are not adversely affected by modifications.

Data Assets Lifecycle

The lifecycle of data assets in the WUSM Data Lake includes development, review, approval, promotion, maintenance, and removal. Each stage is designed to ensure data quality, compliance, and proper management throughout the asset's existence.

  1. Development: Assets are created and tested in the team's sandbox catalog.
  2. Review: After development, assets are published to the team's schema in the review catalog for review by the appropriate team.
  3. Approval: The appropriate team reviews the assets in the review catalog. Assets may be sent back for revision or approved for promotion.
  4. Promotion: Upon approval, assets are published to the curated catalog for sharing with approved teams. All promotions are logged for auditing.
  5. Maintenance: Assets in curated are periodically reviewed for relevance and accuracy.
  6. Removal: Obsolete or superseded assets are archived or deleted according to data retention policies. Assets within the review schema are removed once the assets have been moved to curated.

See also: Data Broker Workflow for a detailed step-by-step process.

All assets must be appropriately tagged with metadata and classified according to sensitivity and compliance requirements.

Development

Development of data assets takes place within each team's dedicated sandbox schema. Team members are responsible for creating and testing assets in this private environment. Data brokers and the ICS DW team use data_warehouse_dev catalog or their own sandbox schemas for development before assets are considered for broader use. Before any asset can be promoted, it must be published to the review catalog for review.

Review

After development, the assets are published to the appropriate team schema in the review catalog. Project teams and ICS teams have read-only access to their review schema to review the work. The ICS DW team retains read/write access to all review schemas. Once the appropriate team has reviewed and approved the assets, the ICS team uses Databasin to promote the assets to the curated catalog.

Approval

When a data asset is ready to be shared with the project team, other teams, or treated as a production resource, it undergoes a formal approval process. Typically, assets are built by a data broker and reviewed by ICS, with data stewards included as needed. The approval process ensures that assets are properly tagged and classified, and may require multiple iterations before approval is granted. All assets must be appropriately tagged with metadata and classified according to sensitivity and compliance requirements. See Data Asset Tagging for standards and process.

Promotion

Once an asset has approval, it is published to the curated catalog, usually by ICS using Databasin. All promotions are logged for auditing purposes to maintain transparency and traceability.

Maintenance

Assets in the curated catalog are periodically reviewed by ICS and appointed data stewards. These reviews assess whether assets are still needed, properly tagged, and meet quality standards. If an asset fails review, ICS initiates a new development cycle to address issues or remove the asset as necessary.

Removal

When assets are no longer required, they are archived or deleted according to applicable data retention policies. For example, data from closed studies may be moved to cold storage, or outdated assets may be removed from the data lake. All removals are logged for auditing, and more information can be found in the Data Retention Policies documentation.

Databasin

Databasin is the platform and set of tools used for data ingestion, automation, and management within the data lake. All curated and cleansed assets must be deployed using Databasin. For more information, see the Databasin documentation.

Additional Catalogs (Internal Use)

The following catalogs are for internal or system use and are not intended for general consumption:

  • data_warehouse_dev - ICS DW team development
  • system - Databricks system level information
  • config - Used by internal ICS processes such as billing
  • governance - Used by internal ICS processes such as auditing
  • marketscan - Used by ADCS to supply administrative data to customers
  • mdclone - Used by the MDClone ETL processes
  • omop_dw_* - Used by ICS DW team as the development environment for OMOP/RDC ETL development
  • permissions - Provides a centralized view of access controls for all data assets, detailing user roles, data sensitivity levels, and audit logs for access control changes. Maintained by the ICS team.
  • postgres-config-* - Used as to link to the config database PSQL server
  • raw - Used by Databasin during ingestion pipelines and not intended for consumption
  • staging - Used by Databasin during ingestion pipelines and not intended for consumption
  • __databricks_internal - Used by Databricks for internal management processes.

Glossary / Acronyms

  • Review: The process by which ICS and data stewards evaluate data assets for quality, compliance, and readiness for promotion. Assets may be sent back for revision or approved for promotion.
  • Approval: The formal acceptance of a data asset for promotion to curated after review.
  • Catalog: A logical grouping of schemas within the data lake, each serving a specific purpose (e.g., sandbox, review, curated, cleansed).
  • Schema: A collection of database objects (tables, views) within a catalog.
  • Data Steward: The individual responsible for managing and maintaining a data source.
  • ICS: Institute for Clinical and Translational Sciences.
  • WUSM: Washington University School of Medicine.
  • Data Asset: Tables, views, flat files, models, and other resources managed in the data lake.
  • Data Broker: A team member or group responsible for building, preparing, and submitting data assets for review and promotion.
  • Service Principal: A security identity used by applications or automation tools to access or modify data in restricted catalogs.
  • Databasin: The platform and set of tools used for data ingestion, automation, and management within the data lake.
  • DW Team: Data Warehousing team, responsible for managing ingestion pipelines and maintaining the cleansed catalog.
  • Project Team - A group of individuals associated with a particular study or initiative.
  • BYOB Team - A group of individuals designated as "Bring Your Own Brokers" for a specific department or buissness unit.
  • Cold Storage: Long-term, lower-cost storage for data assets that are no longer actively used but must be retained for compliance or archival purposes.

See Also


Updated on August 7, 2025