Data Asset Introduction
Overview
Data assets are structured or unstructured datasets that hold value for an organization. In the context of Databricks and the WUSM Data Lake, data assets include schemas, tables, and views used for analytics, reporting, and decision-making.
Effective management of data assets ensures compliance, discoverability, data quality, collaboration, and scalability. This document explains the key concepts, processes, and criteria related to data asset management in the WUSM Data Lake.
Key Concepts
Structured vs. Unstructured Datasets
- Structured Datasets: Data organized in a predefined schema, such as rows and columns in a database or spreadsheet. Examples include relational databases and spreadsheets.
- Unstructured Datasets: Data without a predefined schema, such as images, videos, or free-text documents. These require specialized tools for storage and analysis.
Data Ingestion
Data ingestion is the process of collecting, importing, and processing data from various sources into a centralized storage system like a data lake. This ensures data is available for analysis, reporting, and decision-making.
Curated vs. Cleansed Catalogs
- Curated Catalog: Focuses on enriching and validating datasets for specific use cases, ensuring they are ready for immediate use by stakeholders.
- Cleansed Catalog: Emphasizes data quality by removing errors, inconsistencies, and redundancies, serving as a foundation for further processing or analysis.
Management Processes
Roles and Responsibilities
Data assets in the WUSM Data Lake are managed collaboratively by the ICS team and appointed data stewards:
- ICS Team: Oversees the technical infrastructure, ensuring compliance, security, and scalability.
- Data Stewards: Maintain metadata, review data quality, and ensure assets meet organizational and regulatory standards.
Criteria for Promotion During Review Period
To be promoted during the review period, data assets must meet the following criteria:
- Compliance: Adherence to organizational and regulatory standards.
- Metadata Completeness: Inclusion of all required metadata tags.
- Data Quality: High accuracy, consistency, and reliability.
- Relevance: Alignment with organizational goals and stakeholder needs.
- Approval: Validation by data stewards and relevant stakeholders.
Risks of Non-Compliance
Failing to tag data assets can lead to compliance violations, reduced discoverability, and inefficiencies in data management. Organizations may face penalties such as restricted access to data, additional audits, or even legal consequences depending on regulatory requirements.
See Also
- Data Asset Tagging Overview: Learn about the types of metadata tags and their applications.
- Data Asset Tagging Implementation Guide: An explanation of how metadata tagging is implemented in Databricks.
- Data Lake Catalog Policies: Review the policies governing the use and lifecycle of data catalogs.
- Databricks Documentation
- Databricks Blog for best practices and updates on data lake architecture.