Metadata tags provide context and additional information about data assets. This document outlines how to correctly use these tags.

Types of Tags

Schema Tags

  • WusmDepartment: Identifies the department responsible for the schema. This tag is used for billing and organizational tracking.
  • DataSource: Indicates that a schema is a "supported dataset" and provides its name. This tag is essential for generating a list of supported datasets and their associated metadata (e.g. "Clarity", "OMOP").
  • ProjectAssociation: Links the data asset to a specific project or initiative. Useful for tracking project-related data assets.
  • IsSupportedDataSource: This a boolean value either True or False. Indicates that a schema is a "supported dataset". Essential for generating a list of supported datasets and their associated metadata. Future use may include powering a portion of the ICS community site. Examples of supported datasets are Clarity, OMOP, and patient geocoded data.
  • DataStewards: List of email contacts that manage the data source. Allows consumers to find the appropriate contact for questions regarding a data source and supports automated reporting for Data Stewards. For ICS-supported data sources, this could be set to datalake@wustl.edu.

Table and View Tags

  • RetentionPolicy: Specifies the retention period for the data asset. Ensures compliance with data retention policies and facilitates lifecycle management.
  • ComplianceClassification: Identifies compliance requirements (e.g., HIPAA, GDPR). Ensures adherence to regulatory standards.

Field Tags

  • DataSensitivity: Indicates the sensitivity level of the data (e.g., Public, Internal, Restricted, PHI, HIPAA Limited Data Set). This tag should also be applied at the field level to identify which fields contain sensitive information.

Descriptions

  • Schemas and Tables: All schemas and tables must have a description applied to them. Descriptions should provide clear and concise information about the purpose and contents of the schema or table. If not manually provided, databricks will use AI to generate content. User's should at least review that.
  • Fields: Fields within tables should also have comments and descriptions respectively; especially if they are tagged with DataSensitivity. This ensures users can easily identify the level of HIPAA data contained within a table.

How To Apply Tags

(TODO: Further elaboration needed.)

Ensure that every new data asset added to the WUSM data lake receives appropriate metadata tags.

  • Maintaining Metadata via Databricks
  • Tagging and Descriptions
  • Knowledge Catalog Creation
    • Descriptions for Schemas, Tables, and Views

Compliance Reporting

(TODO: Further elaboration needed.)

  • Generating Compliance Reports
    • Identify individuals who have access to sensitive tables.
    • Generate lists of supported datasets and their associated metadata using the IsDataSource tag.
  • Review and Action Items
    • Verify intake approval and reach out to get updated documentation, etc.

Conclusion

Metadata tags are essential for easy identification and organization of data assets within the data lake. By applying tags such as WusmDepartment, IsDataSource, and DataSensitivity, and ensuring all schemas, tables, and fields have descriptions, we can maintain compliance and improve data discoverability.


Updated on August 7, 2025