Data Lake Administration Guide

Overview

This document provides information on how ICS manages users and data sources in the WUSM data lake. Additionally, we aim to create a standard vocabulary around the concepts discussed in the sections below. To this end, we have included a glossary at the end of this document that can be used to clarify terms they are introduced in the content below.

This document is not meant to provide step by step instructions on how to do so. For details on executing the steps that correspond to these practices, please consult the Data Lake How-To's or the Databasin documentation.

Data Sources

In the context of this document, a data source generally refers to a schema within the Databricks workspace that contains structured data organized into tables and columns. These data sources provide the foundational datasets that make up the data lake.

Note

There are cases in which a data source can span Databricks schemas and may include unstructured data in volumes or other resources.

Knowledge Catalog

A knowledge catalog will be maintained, containing comprehensive information about each shared data source. This catalog will include metadata about the data sources, tables, and columns. The responsibility for maintaining and updating this knowledge catalog will fall to members of the WUSM data lake team and designated data stewards. This ensures that the catalog remains current and accurate, providing a reliable resource for users of the WUSM data lake.

Data Source Groups

Each "shared" data source in the WUSM data lake will have specific groups associated with it. These groups are designed to allow us to provision access to all of the data, only a limited subset of the data, and to delegate data steward tasks to users outside of ICS, if needed.

This groups can be identified by their naming convention: [CATALOG]_[DATASOURCE]_[GROUP]. For example, the group to access the curated.omop data source is the curated_omop_identified group. Members of this group are allowed to query data from all tables and views located in that schema.

Identified Group

[CATALOG]_[DATASOURCE]_identified

The default group created for a data source and is intended to provide readonly access to all of the data contained within that data source.

Limited Group

[CATALOG]_[DATASOURCE]_limited

An optional group that has readonly access to subset of data within the data source. using masked data in views and various other security measures. This group will not exist for every data source, and is only added when there is a need to do so.

Note

This does not mean that the available data is a "limited data set" per HIPAA guidelines.

Data Stewards Group

[CATALOG]_[DATASOURCE]_stewards

An optional group that is designed to include the data stewards for each data source. This group allows its members to apply tags to the schema, tables, and columns that represent the associated data source. In the future, there is potential for members of this group to be granted additional permissions, such as applying column masks.

Adding a Data Source

Adding a new data source to the WUSM data lake involves a series of coordinated steps to ensure proper integration and management. This process includes gathering necessary team information, determining the ingestion strategy, documenting sensitive fields, and utilizing the Databasin tool to automate and streamline the setup. Below is an overview of the process:

  1. Collect Team Information: Gather the team name and WUSM cost center number. This is essential as all data sources must be tagged with a WusmDept to ensure the correct cost center is charged for the associated costs.

    Note

    We currently do not charge for hosting data sources. However, ensuring the tags exist when creating a new data source allows us to do so if needed.

  2. Determine Ingestion Strategy: This process requires some investigation and communication with the data source owner. Once access has been established, determine what data (or if all data) needs to be ingested, the frequency of ingestion, and any details related to the ingestion strategy for each artifact/table.

  3. Document Sensitive Fields: Identify and document any sensitive fields included in the data source. This can be used to determine what access should be restricted when providing limited access to the data source.

  4. Create an ingestion pipeline: The Databasin tool should be used to create a new pipeline for the ingestion that is configured per the ingestion strategy defined in step two.

  5. Finalize Data Source Configuration: Once the data source is available in the data lake:

    • Build out the associated data source groups.
    • Assign the appropriate permissions for the data source groups.
    • Add data source metadata into the data knowledge catalog as documented in step three.
    • Optionally, create masked views or other restrictions for the limited group.

Assigning Data Source Access

Access to data sources should be controlled entirely by related data source groups. Team groups should be added to the appropriate data source group based on the team's needs and approval.

The one exception is the data source groups associated with the EPIC Clarity data source. Due to the additional restrictions placed upon this data source, we do not grant access to entire teams of users. Instead, each user must be approved to gain access, and is then placed directly into the appropriate data source group.

Important

Users should never be granted access directly to a data sources within the data lake. If you have questions, or need guidance, please contact the ICS Platform Engineering team.

Team Groups

Each team within the WUSM data lake will have a specific group associated with it in the Databricks workspace. These team groups are designed to manage access to various resources and data sources efficiently. The naming convention for these groups is wusm_datalake_[TEAM].

TODO: update this to provide guidance on using the type of access as the team name suffix. ie pathology_byob, pathology_operational, pi-name_irbXYZ

Users should only be added directly to team groups. Team groups will be added to the appropriate data source groups based on the team's needs and the data they require access to. This ensures that users within a team can access the necessary data sources without being granted direct access to the data sources themselves.

This approach simplifies user management and ensures that access permissions are consistently applied across the data lake.

Default Team Resources

  • Cluster Access: Each team group will have access to a dedicated cluster named [TEAM].
  • SQL Warehouse Access: Each team group will have access to a SQL Warehouse named [TEAM].
  • Sandbox Schema Access: Each team group will have access to a schema under the sandbox catalog named [TEAM].

Adding a Team to WUSM Data Lake

  • Collect the team name and WUSM cost center number
  • Determine if the standard setup will work for the team
    • If there are additional requirements for the cluster/compute, a ticket will need to be filed with Tier 3 data lake support
    • If the team can use the standard compute, proceed to the next step
  • Use the Databasin tool to create the team. You will provide the team name and members email addresses
    • Optional: Also provide the data source groups to assign the team group to
  • Databasin will create the team group and assign the members to it
    • If a list of data source groups were also provided, Databasin will add the team group to each of the provided groups

Adding a user to WUSM Data Lake

  1. Determine what team the user should be assigned to.
  2. Has the team already been added to the data lake?
    • If the team has not been created, then complete the process to onboard the team before proceeding.
    • If the team has already been created, continue to the next step.
  3. Add the user to the Databricks workspace.
  4. Assign the user to the appropriate team group.
graph TB
  A["Determine what team the user should be assigned to."] --> B["Has the team already been added to the data lake?"]
  B -- "No" --> C["Onboard the team"]
  B -- "Yes" --> D["Assign the user to the appropriate team group."]

Glossary

  • Data Source: A schema within the Databricks workspace containing structured data organized into tables and columns.
  • Knowledge Catalog: A comprehensive resource containing metadata about data sources, tables, and columns.
  • Data Steward: A designated individual responsible for maintaining and updating the knowledge catalog and ensuring data quality.
  • Shared Data Source: A data source accessible to multiple users or teams, often with specific access controls.
  • Identified Group: A group with read-only access to all data within a data source.
  • Limited Group: A group with read-only access to a subset of data, often using masked data or other security measures.

Updated on August 7, 2025