Adding a New Data Source to the WUSM Data Lake

Adding a new data source to the WUSM Data Lake is a collaborative and structured process designed to ensure data integrity, security, and compliance. This guide provides a step-by-step narrative for administrators, data stewards, and technical teams.

Intake and Approval Process

The process begins with the submission of an intake form. All requests for new data sources must be submitted via the designated intake form. The form collects essential information, including:

A description of the data source and its contents
The business need for adding the data source
The department making the request
The billing cost center (WusmDept)
The designated data steward

Once submitted, the data lake team reviews the request and collaborates with the data steward to clarify requirements and ensure all necessary details are provided.

Planning and Integration

After approval, the following steps are taken:

Collect Team and Cost Center Information:
Every data source must be tagged with the appropriate WusmDept to ensure accurate cost allocation and future billing if needed.
Determine Ingestion Strategy:
The data lake team works with the data source owner to define what data will be ingested, the frequency of ingestion, and any special requirements. This includes identifying which tables, files, or data subsets are needed and how often they should be updated.
Document Sensitive Fields:
Sensitive fields must be identified and documented. This is critical for compliance and for configuring access controls, especially when creating masked views for limited-access groups.
Configure Ingestion Pipeline:
The Databasin tool is used to automate the creation of the data assets in the cleansed catalog.
Finalize Data Source Configuration:
Once the data is available in the lake, the team:
- Builds out the associated data source groups (see below)
- Assigns permissions based on user roles and sensitivity
- Ensures the proper tags are applied to the schemas and tables
- Optionally, creates masked views for limited-access groups

Data Access Control and Group Management

Access to data sources is managed through Databricks groups:

Identified Group:
[CATALOG]_[DATASOURCE]_identified
Provides readonly access to all data in the data source.
Limited Group:
[CATALOG]_[DATASOURCE]_limited
(Optional) Provides readonly access to a subset of data, typically using masked views for sensitive fields.
Data Stewards Group:
[CATALOG]_[DATASOURCE]_stewards
(Optional) Includes data stewards who can manage metadata and, in the future, may have additional permissions.

Team groups (e.g., wusm_datalake_[TEAM]) are linked to the appropriate data source groups to grant access. Users should only be added to team groups, not directly to data source groups, except in special cases (e.g., EPIC Clarity).

Metadata Tagging

Every new data asset must be tagged with relevant metadata, including department and billing cost center. This ensures assets are easily identifiable and organized within the data lake.

Ongoing Maintenance

Inventory:
Maintain an up-to-date inventory of all data sources and their metadata.
Auditing:
Regularly review access and usage to ensure compliance.
Change Management:
Log all changes to data assets and notify relevant stakeholders.

Conclusion

By following this process, the WUSM Data Lake team ensures that new data sources are onboarded efficiently, securely, and in compliance with institutional policies. Collaboration with data stewards and adherence to access control best practices are essential for maintaining the integrity and value of the data lake.

Data Sources

Platforms

Teams