Introduction
This document provides a comprehensive guide for implementing, managing, and using the metadata tagging solution for the WUSM Data Lake. The solution is designed to ensure compliance with HIPAA standards, facilitate efficient data organization, and improve discoverability for data consumers. While end-users will not manage tags, they will be able to browse metadata to gain insights into the data assets.
Metadata Tagging Framework
Supported Tags
Tag Name | Description | Usage | Examples |
---|---|---|---|
WusmDepartment |
Identifies the department responsible for the data asset. | Used for billing and organizational tracking. | |
ProjectAssociation |
Links the data asset to a specific project or initiative. | Useful for tracking project-related data assets. | |
DataSource |
Indicates that a schema is a "supported dataset" and provides its name. | Essential for generating a list of supported datasets and their associated metadata. | |
IsSupportedDataSource |
Indicates that a schema is a "supported dataset". | Essential for generating a list of supported datasets and their associated metadata. Future use may include powering a portion of the ICS community site. | |
DataStewards |
List of email contacts that manage the data source. | Allows consumers to find the appropriate contact for questions regarding a data source. Also allows for automated reporting for Data Stewards. For ICS-supported data sources, this could be set to datalake@wustl.edu . |
|
DataSensitivity |
Indicates the sensitivity level of the data (e.g., Public, Internal, Restricted). | Helps enforce access controls and compliance with data privacy regulations. | |
RetentionPolicy |
Specifies the retention period for the data asset. | Ensures compliance with data retention policies and facilitates lifecycle management. | |
ComplianceClassification |
Identifies compliance requirements (e.g., HIPAA, GDPR). | Ensures adherence to regulatory standards. |
Descriptions for Metadata
- Schemas and Tables: All schemas and tables must have a description applied to them. Descriptions should provide clear and concise information about the purpose and contents of the schema or table.
- Fields: Fields within tables should also have descriptions, especially if they are tagged with
DataSensitivity
. This ensures users can easily identify the level of HIPAA data contained within a table.
Implementation Steps
Schema and Table Tagging
-
Schema Creation:
When onboarding a new team to Databricks, create a schema and assign the
WusmDepartment
tag to account for billing.Example SQL command:
ALTER SCHEMA <schema_name> SET TAG WusmDepartment = '<Department Name>';
-
Assigning Additional Tags:
Add the
DataSource
tag to indicate the schema is a supported dataset:ALTER SCHEMA <schema_name> SET TAG DataSource = '<Dataset Name>';
Add the
DataSensitivity
tag to tables and fields as needed:ALTER TABLE <table_name> SET TAG DataSensitivity = 'Restricted'; ALTER COLUMN <column_name> SET TAG DataSensitivity = 'PHI';
Add the
RetentionPolicy
tag to specify the retention period for the data asset:ALTER TABLE <table_name> SET TAG RetentionPolicy = '7 years';
Add the
ProjectAssociation
tag to link the data asset to a specific project:ALTER SCHEMA <schema_name> SET TAG ProjectAssociation = 'Project ABC';
Add the
ComplianceClassification
tag to identify compliance requirements:ALTER TABLE <table_name> SET TAG ComplianceClassification = 'HIPAA';
Adding Descriptions
Apply descriptions to schemas, tables, and fields using the following SQL commands:
COMMENT ON SCHEMA <schema_name> IS 'Description of the schema';
COMMENT ON TABLE <table_name> IS 'Description of the table';
COMMENT ON COLUMN <table_name>.<column_name> IS 'Description of the field';
Management Workflow
Role-Based Access Control (RBAC)
-
Data Stewards:
Responsible for managing tags and descriptions for their assigned schemas.
Permissions are granted using Databricks groups:
GRANT MODIFY ON SCHEMA <schema_name> TO GROUP <data_steward_group>;
Periodic Reviews
Data stewards should periodically review tags and descriptions to ensure accuracy and compliance.
Missing or outdated tags should be updated promptly.
Compliance Reporting
Generate reports to identify:
- Individuals with access to sensitive tables.
- Supported datasets and their associated metadata.
Example SQL query for compliance reporting:
SELECT * FROM information_schema.tags WHERE tag_name = 'DataSensitivity';
Data Lineage
Tracking data lineage is essential for understanding the flow and transformations of data. Databricks provides tools to capture lineage information, which can be included in metadata to support debugging and compliance audits. Ensure lineage is working and available for all critical datasets.
Automated Tagging
Some tagging automation has been included in the Databasin application to manage the WusmDepartment
tag for billing purposes. All other tags must be managed manually through the Databricks UI or code.
Conclusion
This implementation guide provides a robust framework for managing metadata tags in the WUSM Data Lake. By following the outlined steps and workflows, data stewards can ensure compliance, improve data discoverability, and support end-users in accessing valuable metadata insights. Collaboration between data stewards and support staff is essential for the success of this initiative.