Data Lake Brokerage Workflow
Overview
In the WUSM Data Lake environment (managed by the Informatics Core Services, ICS team), data is organized into three main tiers: a Sandbox catalog for working data, a Review catalog for validation and approval, and a Curated catalog for finalized, approved data. Each project team or department is provisioned with its own schemas in these catalogs, following a consistent naming convention and access control policy. This structure ensures that teams can freely develop and prepare data in a restricted area, while final datasets are reviewed and stored in a governed area for broader access (read-only to the team). Below, we outline the naming conventions, role-based access, and the process of moving data from sandbox to review to curated.
Schema Architecture and Access
Below is a diagram depicting the overall schema architecture in the Sandbox, Review, and Curated catalogs, along with the relevant user groups and their access privileges:
flowchart LR
subgraph Sandbox Catalog
direction TB
S2["**sandbox**.
**data_brokers_[dept]**
(Dept broker schema - S2)"]
S3["**sandbox**.
**[prefix]\_[ID]\_[dept]**
(Project schema - S3)"]
end
subgraph Review Catalog
direction TB
R2["**review**.
**[prefix]\_[ID]\_[dept]**
(Project team review
schema - R2)"]
R1["**review**.
**data_brokers_[dept]**
(Dept broker
review schema - R1)"]
end
subgraph Curated Catalog
direction TB
C1["**curated**.
**[prefix]\_[ID]\_[dept]**
(Project schema - C1)"]
end
%% User groups
subgraph User_Groups [User Security Groups]
direction TB
DeptGrp["data_brokers_[dept]
(Dept Broker Group)"]
ICS["data_brokers
(ICS Brokers Group)"]
TeamGrp["[prefix]\_[ID]\_[dept]
(Project Team Group)"]
Databasin["Databasin
(Used by ICS Brokers)"]
end
%% Access relationships
ICS -- "R/W access" --> R2
ICS -- "Read access" --> R1
TeamGrp -- "R/W access" --> S3
DeptGrp -- "R/W access" --> R1
DeptGrp -- "R/W access" --> S2
TeamGrp -- "Read access" --> C1
Databasin -- "R/W access" --> C1
%% (Department brokers and ICS have no direct access to S3 team schema)
Diagram Key:
- S2 is a departmental broker’s sandbox schema (e.g.
data_brokers_med
for a Medical dept broker) - S3 represents a project team's sandbox schema (named with the appropriate prefix_ID_dept)
- R1 is the departmental broker's review schema in the Review catalog
- R2 is the project's review schema in the Review catalog
- C1 is the project’s curated schema for approved data
The User Groups on the right correspond to Databricks security groups with the same naming; membership in these groups controls access to the schemas. ICS brokers do not have access to the departmental broker's sandbox. ICS brokers have read-only access to the departmental broker's review schema in the review catalog, while the departmental broker team has read/write access to their own review schema. All deployments to the curated
catalog must be done using Databasin. No one has write access to the curated
catalog outside of Databasin.
From the diagram, we can note the following:
-
Project Team Schemas: Each project team (IRB study, QA/QI, class, ops, etc.) has a dedicated schema in the Sandbox catalog named after its group (format
[prefix]_[ID]_[dept]
). This is the team’s private sandbox area where they have full read/write permissions. For example, a research study with IRB #12345 in the Department of Medicine might have a sandbox schemasandbox.irb_12345_med
. The same name is reserved in the Review catalog (review.irb_12345_med
) for validation and approval, and in the Curated catalog (curated.irb_12345_med
) to hold the final approved dataset for that project. This data is delivered by ICS as described in the workflow below. -
Data Broker Schemas: Data broker teams have their own sandbox areas. The central ICS brokers team has a schema
sandbox.data_brokers
(S1), and each department’s data broker team has a schemasandbox.data_brokers_[dept]
(S2). These schemas are used by data brokers to develop data assets. Departmental brokers only have access to their own departmental sandbox (S2 for their department). ICS brokers do not have access to the departmental broker's sandbox. Instead, departmental brokers promote assets to their review schema in the Review catalog (R1), where ICS brokers have read-only access for review and approval, and the departmental broker team has read/write access. -
Review Schemas: Departmental brokers have read/write access to their review schemas (R1), and ICS brokers have read-only access to departmental review schemas (R1).
Note: Project teams may have read-only access to their review schema for validation and review. ICS teams have read/write access to project team review schemas for final review and promotion. For more information, see the Data Collaboration Workflow document.
-
Curated Schemas: For each non-broker team, a corresponding schema exists in the Curated catalog (C1) with the same name as the team’s sandbox. Only ICS data brokers can write/publish to curated schemas via Databasin. Project teams do not have write access in Curated. Instead, the team's group is granted read (SELECT) access to their curated schema. This allows the team to query and utilize the finalized data, but not modify it, enforcing data integrity for certified datasets. Other users or analysts may also be given read access to curated data as appropriate, but the sandbox remains private to the team, based on approval from the appointed data steward.
-
Security Groups and Access Control: Each schema is protected by Databricks access controls tied to group membership. The project team’s group (named identically to their schema) is granted R/W on their sandbox schema (and typically also access to a private file storage volume in that sandbox). That same group is granted read-only on the curated schema. Departmental broker groups are granted R/W on their departmental sandbox schema (S2), and on their review schema (R1), where they prepare data for review. The ICS broker group has R/W on the top-level brokers schema (S1) and read-only on all departmental review schemas (R1). ICS brokers act as administrators who can review any data produced by other brokers and manage publication to Curated. Importantly, neither ICS nor departmental broker groups are directly added to the project team sandbox permissions (they do not work in S3 schemas under normal circumstances, to maintain the principle of least privilege and let teams work independently). Instead, collaboration happens by reviewing data in the departmental broker review schema and publishing to the project's Curated schema.
Data Review and Publication Workflow
The following sequence diagram illustrates the workflow of how data moves from a broker's sandbox to the project's review area and then to the curated area, highlighting the roles of the project team, departmental broker, and ICS broker in the process:
sequenceDiagram
participant Team as Project Team
participant Dept as Dept. Data Broker
participant ICS as ICS Data Broker
participant Review as BYOB Review Schema
participant ProjectReview as Project Review Schema
participant Curated as Project Curated Schema
Team->>Dept: Request dataset
Dept->>Dept: Build/finalize dataset in departmental sandbox (S2)
Dept-->>Review: Promote dataset to departmental review schema (R1)
ICS->>Review: Review dataset for compliance/quality
ICS->>Review: Query approved by ICS
ICS->>ProjectReview: Publish approved data to review schema (R2)
Team->>ProjectReview: Query approved by project team (R2)
ICS->>Curated: Publish approved data to curated schema (C1) using Databasin
Curated->>Team: (Optional) Copy data to sandbox for further work
Workflow Explanation
-
Team Requests Data: The project Team requests a dataset from ICS or their Departmental Data Broker. The broker team is responsible for building the requested dataset in the appropriate sandbox. Either the ICS broker sandbox (S1) or their own departmental broker sandbox (S2).
-
Broker Builds Dataset: The Broker assembles and prepares the requested dataset within the appropriate broker sandbox (e.g., S1, S2). The project team does not have access to this area and does not participate in the data preparation process at this stage.
-
Promote to Review: Once the dataset is ready, the Broker promotes the dataset to the departmental review schema (R1) in the Review catalog. Departmental brokers have R/W access to their review schema (R1), and ICS brokers have read-only access to departmental review schemas (R1).
-
ICS Review: The ICS team member reviews the dataset in the departmental review schema (R1) for regulatory compliance, data quality, and adherence to institutional standards. If the dataset passes ICS review, it is promoted to the project team's review schema (R2) in the Review catalog.
-
Project Team Review (R2): The project team has the opportunity to review the assets in their review schema (R2) before deployment to their curated schema. The project team can provide feedback and request changes from the departmental broker during this review. If changes are required, the departmental broker makes the necessary updates in their sandbox and the review process starts over: the updated dataset is promoted to R1, ICS reviews R1, and if approved, the project team reviews R2 again.
-
Promotion to Curated: When the project team approves the assets in R2, ICS deploys the assets to the curated schema (C1) using Databasin. Only ICS has write access to the curated schema; the project team and other users have read-only access.
-
Team Accesses and Works with Curated Data: The team can now select and query the approved dataset from the curated schema and, if needed, copy or transform the data into their own sandbox schema (S3) for further analysis or project work. Any new or modified data would require ICS repeating the review and publication process.
This workflow ensures that the project team only works with data after it has been published and approved by ICS and the project team, maintaining strict governance and compliance throughout the process.
By following this structured approach, the data lake ensures that all teams have a sandbox for unrestricted experimentation and development, a review area for validation and approval, and a curated area as a trusted repository of vetted data. The use of distinct schemas and controlled group permissions (enforced via Databricks Unity Catalog) provides clear separation between working data and production-quality data, and the involvement of departmental and ICS data brokers adds critical governance checkpoints to maintain data quality and compliance. This governance model, illustrated above, helps balance flexibility for researchers with responsibility and control over sensitive data assets.