Introduction

This document provides a comprehensive guide for implementing, managing, and using the metadata tagging solution for the WUSM Data Lake. The solution is designed to ensure compliance with HIPAA standards, facilitate efficient data organization, and improve discoverability for data consumers. While end-users will not manage tags, they will be able to browse metadata to gain insights into the data assets.

Metadata Tagging Framework

Supported Tags

Tag Name Description Usage Examples
WusmDepartment Identifies the department responsible for the data asset. Used for billing and organizational tracking.
ProjectAssociation Links the data asset to a specific project or initiative. Useful for tracking project-related data assets.
DataSource Indicates that a schema is a "supported dataset" and provides its name. Essential for generating a list of supported datasets and their associated metadata.
IsSupportedDataSource Indicates that a schema is a "supported dataset". Essential for generating a list of supported datasets and their associated metadata. Future use may include powering a portion of the ICS community site.
DataStewards List of email contacts that manage the data source. Allows consumers to find the appropriate contact for questions regarding a data source. Also allows for automated reporting for Data Stewards. For ICS-supported data sources, this could be set to datalake@wustl.edu.
DataSensitivity Indicates the sensitivity level of the data (e.g., Public, Internal, Restricted). Helps enforce access controls and compliance with data privacy regulations.
RetentionPolicy Specifies the retention period for the data asset. Ensures compliance with data retention policies and facilitates lifecycle management.
ComplianceClassification Identifies compliance requirements (e.g., HIPAA, GDPR). Ensures adherence to regulatory standards.

Descriptions for Metadata

  • Schemas and Tables: All schemas and tables must have a description applied to them. Descriptions should provide clear and concise information about the purpose and contents of the schema or table.
  • Fields: Fields within tables should also have descriptions, especially if they are tagged with DataSensitivity. This ensures users can easily identify the level of HIPAA data contained within a table.

Implementation Steps

Schema and Table Tagging

  1. Schema Creation:

    When onboarding a new team to Databricks, create a schema and assign the WusmDepartment tag to account for billing.

    Example SQL command:

    ALTER SCHEMA <schema_name> SET TAG WusmDepartment = '<Department Name>';
  2. Assigning Additional Tags:

    Add the DataSource tag to indicate the schema is a supported dataset:

    ALTER SCHEMA <schema_name> SET TAG DataSource = '<Dataset Name>';

    Add the DataSensitivity tag to tables and fields as needed:

    ALTER TABLE <table_name> SET TAG DataSensitivity = 'Restricted';
    ALTER COLUMN <column_name> SET TAG DataSensitivity = 'PHI';

    Add the RetentionPolicy tag to specify the retention period for the data asset:

    ALTER TABLE <table_name> SET TAG RetentionPolicy = '7 years';

    Add the ProjectAssociation tag to link the data asset to a specific project:

    ALTER SCHEMA <schema_name> SET TAG ProjectAssociation = 'Project ABC';

    Add the ComplianceClassification tag to identify compliance requirements:

    ALTER TABLE <table_name> SET TAG ComplianceClassification = 'HIPAA';

Adding Descriptions

Apply descriptions to schemas, tables, and fields using the following SQL commands:

COMMENT ON SCHEMA <schema_name> IS 'Description of the schema';
COMMENT ON TABLE <table_name> IS 'Description of the table';
COMMENT ON COLUMN <table_name>.<column_name> IS 'Description of the field';

Management Workflow

Role-Based Access Control (RBAC)

  • Data Stewards:

    Responsible for managing tags and descriptions for their assigned schemas.

    Permissions are granted using Databricks groups:

    GRANT MODIFY ON SCHEMA <schema_name> TO GROUP <data_steward_group>;

Periodic Reviews

Data stewards should periodically review tags and descriptions to ensure accuracy and compliance.

Missing or outdated tags should be updated promptly.

Compliance Reporting

Generate reports to identify:

  • Individuals with access to sensitive tables.
  • Supported datasets and their associated metadata.

Example SQL query for compliance reporting:

SELECT * FROM information_schema.tags WHERE tag_name = 'DataSensitivity';

Data Lineage

Tracking data lineage is essential for understanding the flow and transformations of data. Databricks provides tools to capture lineage information, which can be included in metadata to support debugging and compliance audits. Ensure lineage is working and available for all critical datasets.

Automated Tagging

Some tagging automation has been included in the Databasin application to manage the WusmDepartment tag for billing purposes. All other tags must be managed manually through the Databricks UI or code.

Conclusion

This implementation guide provides a robust framework for managing metadata tags in the WUSM Data Lake. By following the outlined steps and workflows, data stewards can ensure compliance, improve data discoverability, and support end-users in accessing valuable metadata insights. Collaboration between data stewards and support staff is essential for the success of this initiative.


Updated on August 7, 2025