WUSM Data Lake Glossary
Introduction
Welcome to the WUSM Data Lake Administrator's Guide. This comprehensive guide is tailored for administrators and technical teams responsible for managing the WUSM data lake environment. By following the instructions and procedures detailed in this guide, you will ensure a streamlined, secure, and efficient process for data management, user access, and overall administration of the data lake.
Overview
The WUSM data lake serves as a centralized repository, facilitating the storage, management, and analysis of vast amounts of structured and unstructured data. The data lake's architecture is built around several key components and concepts:
-
Data Source Groups: These are specific groups associated with each data source in the WUSM data lake. They play a pivotal role in determining the level of access to the data.
-
Team Groups: Within the Databricks workspace, each team has an associated group. Users are added to these team groups, which are then linked to the appropriate data source groups to ensure the right access levels.
-
Knowledge Catalog: This is a meticulously curated database containing metadata about each data source. It encompasses details about tables, sensitive fields, and other pertinent information. The WUSM data lake team and designated data stewards maintain it.
-
Data Stewards: These individuals are entrusted with specific data sources. Their responsibilities include approving and auditing user access and ensuring the knowledge catalog's accuracy and relevance.
Adding a Data Source to WUSM Data Lake
When introducing a new data source to the WUSM Data Lake, it's crucial to ensure that the data is correctly categorized, and the right access levels are set. Here's how to go about it:
-
Collect Information: Start by obtaining the team name and the WUSM cost center number. It's essential to tag all data sources with a WusmDept to guarantee accurate cost allocation.
-
Determine Ingestion Strategy: Engage in a dialogue with the data source owner. This conversation will help understand the nature of the data, its requirements, and how best to integrate it into the data lake. Decide on the specific data subsets to be ingested and finalize the frequency of ingestion.
-
Document Sensitive Fields: It's of paramount importance to identify and document any fields in the data source that are deemed sensitive. This step ensures that such fields are treated with the necessary precautions.
-
Utilize Databasin Tool: This automation tool is a boon for administrators. It will:
- Set up a new ingestion pipeline.
- Populate the knowledge catalog with relevant metadata.
- Create the necessary data source groups.
- Generate masked views for users with limited access.
- Assign permissions based on the knowledge catalog's data.
Adding a Team to WUSM Data Lake
When onboarding a new team to the data lake, administrators must ensure that the team has the resources and access they need. Here's the procedure:
-
Collect Information: As with data sources, begin by obtaining the team name and the WUSM cost center number.
-
Determine Setup Requirements: Not all teams have the same requirements. Assess whether the team can work with the standard setup or if they have unique needs. If the latter, it's necessary to liaise with Tier 3 data lake support.
-
Use Databasin Tool: This tool simplifies the team onboarding process. Provide it with the team name and the email addresses of all team members. Optionally, you can also specify which data source groups the team should have access to. The tool will handle group creation, member addition, and access assignments.
Adding a User to WUSM Data Lake
Onboarding a new user involves ensuring they have access to the right resources and are part of the correct teams. Here's the step-by-step process:
-
Team Assignment: First and foremost, identify which team the new user will be a part of.
-
Team Verification: Before proceeding, verify that the team is already set up in the data lake. If not, the team needs to be onboarded first.
-
User Onboarding: With the team in place, the next step is to add the user to the Databricks workspace.
-
Group Assignment: The final step is to assign the user to their respective team group, ensuring they have the right level of access to data sources.
Assigning Data Source Access
Ensuring teams have the right access to data sources is a critical administrative task. Here's how to manage it:
-
Data Source Verification: Before assigning access, confirm that the data source is already integrated into the data lake. If it isn't, you'll need to add the data source first.
-
Determine Access Level: Different teams may require different levels of access to a data source. Decide on the appropriate access level for the team in question.
-
Review Documentation: Always refer to any regulatory or internal documentation related to the data source. This ensures compliance and security.
-
Group Linking: The final step is to link the team group to the appropriate data source group, granting the team access to the data source.
By adhering to the procedures and best practices outlined in this guide, administrators can ensure the smooth operation, security, and efficiency of the WUSM data lake environment. Always prioritize data security and compliance with organizational policies and regulatory requirements.