Data Lake Expansion Summary
Overview
This document explains the strategic approach, governance model, and best practices for expanding the Databricks instance into a university-wide analytics platform at Washington University in St. Louis. The goal is to support research, clinical operations, finance, and student services while ensuring compliance, scalability, and effective governance.
Strategic Objectives
Core Principles
- Central Platform Support: Federated governance across domains.
- Domain-Specific Policies: Tailored access, compliance, and workflows.
- Shared Data Literacy: Emphasis on stewardship and standardization.
- Scalable Support Model: Clear roles and responsibilities.
Comparable Examples
- Stanford Medicine’s STARR Ecosystem: Self-service clinical data platform with a defined compliance framework and separate domain workflows (arXiv, PMC, arXiv).
- University of Iowa Health Care: Governance task forces managing complex clinical-research data access workflows (PMC).
- Duke University’s Protected Network: Secure platform for sensitive research data, governed by enterprise IAM and data stewards.
Governance Model
University Data Governance Committee (UDGC)
- Scope: University-wide standards, policies, tool selection, and platform roadmap.
- Membership: Chief Data Officer (chair), informatics platform lead, and representatives from research (IRB), clinical operations, finance, student services, and compliance/legal.
- Responsibilities:
- Approve domain-specific governance frameworks.
- Define data classification, retention, auditing, and compliance policies.
- Oversee the data catalog, metadata standards, and stewardship program.
Domain-Specific Governance Subgroups
Each domain (e.g., Research, Clinical Operations, Finance, Student Services) has a governance working group reporting to the UDGC. These groups:
- Handle domain-appropriate approval workflows (e.g., IRB approvals, privacy rules).
- Define data definitions and access requests.
- Align domain needs with centralized standards.
Advantages:
- Tailored workflows and domain ownership.
- Specialist insights with alignment to overarching standards.
Platform Engineering & Support
The ICS Platform Engineering Team is central to the platform's success. Responsibilities include:
- Managing Azure and Databricks infrastructure.
- Performing upgrades, patches, and runtime management.
- Implementing role-based access and user provisioning.
- Monitoring usage, auditing, and ensuring compliance.
- Maintaining documentation and providing user support.
Best Practices
Data Classification & Access Control
- Tiers: Public, internal, operational, restricted/PHI.
- Workflows: IRB approvals for research access; supervisor or business-owner sign-off for operational data.
- Automation: Regular access reviews and expirations.
Data Stewardship & Cataloging
- Domain stewards maintain metadata, quality, and definitions.
- A shared university catalog ensures discoverability.
Compliance & Auditing
- Regular access reviews and audit trails for sensitive data.
- Compliance officers ensure adherence to HIPAA, FERPA, and financial standards.
Training & Documentation
- A documentation portal provides onboarding guides, glossaries, and policies.
- Regular training sessions cover domain-specific workflows and data usage guidelines.
Continuous Improvement
- The UDGC reviews metrics (e.g., requests, approvals, compliance incidents).
- Workflows are iteratively enhanced based on feedback and evolving needs.
Workflow Overview
graph TD
A[User Requests Data Access] -->|Submit Request| B[Domain Data Steward Review]
B -->|Approve or Deny| C[Compliance Approver Validation]
C -->|Approve| D[Platform Engineering Team Provisioning]
D -->|Setup Access| E[User Access Granted]
E -->|Training Session| F[User Training on Azure Data Bricks & Knowledge Catalog]
F -->|Ongoing Support| G[Community & Peer Network Support]
The workflow for data access requests is designed to ensure a balance between accessibility, compliance, and user enablement. Below is a detailed description of each step in the process:
-
User Requests Data Access: The process begins when a user submits a formal request for data access. This request typically includes details about the data needed, the purpose of access, and any relevant project or research information.
-
Domain Data Steward Review: The request is reviewed by the Domain Data Steward, who evaluates whether the requested data aligns with defined metadata standards, data quality rules, and domain-specific policies. The steward ensures that the request is well-documented and meets the necessary criteria for further review.
-
Compliance Approver Validation: Once the steward approves the request, it is forwarded to the Compliance Approver. This role ensures that the request adheres to regulatory requirements such as HIPAA, FERPA, or financial standards. The approver validates that all necessary approvals (e.g., IRB sign-off for research data) are in place.
-
Platform Engineering Team Provisioning: After compliance validation, the ICS Platform Engineering Team provisions the necessary access. This includes setting up role-based access controls (RBAC), configuring secure workspaces, and ensuring that the user has the appropriate permissions to access the requested data.
-
User Access Granted: Once provisioning is complete, the user is notified that access has been granted. They can now begin working with the data within the secure environment provided by the platform.
-
Training Session: To ensure effective use of the platform, the user participates in a training session. This session covers tools like Azure Data Bricks and the Knowledge Catalog, providing hands-on guidance on how to navigate and utilize the platform effectively.
-
Ongoing Support: After training, the user has access to ongoing support through the Community & Peer Network. This network facilitates knowledge sharing, troubleshooting, and best practices across the university.
Validation of the Workflow
This workflow is designed to address all critical aspects of a data access request:
- Governance: The involvement of Domain Data Stewards and Compliance Approvers ensures that requests are evaluated against both domain-specific and regulatory standards.
- Security: The Platform Engineering Team enforces robust access controls and secure provisioning to protect sensitive data.
- Enablement: Training sessions and ongoing support empower users to make the most of the platform while adhering to best practices.
- Scalability: The structured process allows for consistent handling of requests across multiple domains, making it scalable for university-wide adoption.
By incorporating these elements, the workflow ensures that data access is both secure and user-friendly, supporting the university's mission of enabling analytics and research while maintaining compliance and governance standards.
Institutional Precedents
- Stanford's STARR: Combines self-service clinical data access with a compliance framework.
- UIC Health Data Governance Task Force: Improved data sharing for research while addressing workflow bottlenecks.
- Duke’s Protected Network: Virtualized enclaves for sensitive research data with shared platform governance.
These examples highlight the importance of governance, clear workflows, and secure, scalable platform design.
Next Steps
- Share this summary with senior leadership and IT governance for feedback.
- Recruit domain-specific stewards and assign time commitments.
- Form/engage the UDGC and nominate a chair/secretariat.
- Draft policies for access, data classification, retention, and auditing.
- Build a roadmap for documentation, training, and platform automation.
- Plan a phase 1 pilot in one domain (e.g., research + one operational team) before scaling.