WUSM Data Lake
The WUSM Data Lake is a secure Databricks based data management and analytics system hosted via Microsoft Azure. It contains past and current healthcare data generated and recorded across the WashU and BJC healthcare ecosystem.
A data lake is a central repository for storing all types of data at any scale, which can be accessed and analyzed by researchers to gain insights into health trends, disease patterns, and treatment outcomes in medical research. It enables the storage of raw data without the need for pre-structuring, making it a flexible and scalable platform for researchers to analyze large volumes of data and discover new patterns and correlations. With the increasing amount of health data being generated, data lakes are becoming an essential tool for advancing medical knowledge and improving healthcare outcomes.
The WUSM Data Lake is meant only for research use for WashU faculty, staff and students.
A data lake is a centralized repository that stores both structured and unstructured data at any scale.
Databricks is built upon Apache Spark. Apache Spark enables a massively scalable data engine that runs on cloud compute resources (Microsoft Azure specifically) decoupled from the data storage repository.
The Databricks workspace provides a unified interface and tools for most data tasks, including:
- Data processing workflows scheduling and management
- Generating dashboards and visualizations
- Managing security, governance, high availability, and disaster recovery
- Data discovery, annotation, and exploration
- Machine learning (ML) modeling, tracking, and model serving
- Generative AI solutions
The WUSM Data Lake current hosts the following kinds of data:
- OMOP Data
- TODO: More data description details.
Please visit CURRENTLY UNAVAILABLE: WashU Internal Site for more information regarding the Data Lake.
The WUSM data lake is currently curated and managed by CURRENTLY UNAVAILABLE: Infrastructure Core Services (ICS) group within the Institute for Informatics, Data Science & Biostatistics.
The recipes described in this chapter are related to the details of initializing, configuring, administering, and educating others on the Data Lake.