WUSM Data Lake

The WUSM Data Lake is a secure Databricks based data management and analytics system hosted via Microsoft Azure. It contains past and current healthcare data generated and recorded across the WashU and BJC healthcare ecosystem.

A data lake is a central repository for storing all types of data at any scale, which can be accessed and analyzed by researchers to gain insights into health trends, disease patterns, and treatment outcomes in medical research. It enables the storage of raw data without the need for pre-structuring, making it a flexible and scalable platform for researchers to analyze large volumes of data and discover new patterns and correlations. With the increasing amount of health data being generated, data lakes are becoming an essential tool for advancing medical knowledge and improving healthcare outcomes.

The WUSM Data Lake is meant only for research use for WashU faculty, staff and students.

A data lake is a centralized repository that stores both structured and unstructured data at any scale.
Databricks is built upon Apache Spark. Apache Spark enables a massively scalable data engine that runs on cloud compute resources (Microsoft Azure specifically) decoupled from the data storage repository.

The Databricks workspace provides a unified interface and tools for most data tasks, including:

Data processing workflows scheduling and management
Generating dashboards and visualizations
Managing security, governance, high availability, and disaster recovery
Data discovery, annotation, and exploration
Machine learning (ML) modeling, tracking, and model serving
Generative AI solutions

The WUSM Data Lake current hosts the following kinds of data:

OMOP Data
TODO: More data description details.

Please visit CURRENTLY UNAVAILABLE: WashU Internal Site for more information regarding the Data Lake.

The WUSM data lake is currently curated and managed by CURRENTLY UNAVAILABLE: Infrastructure Core Services (ICS) group within the Institute for Informatics, Data Science & Biostatistics.

The recipes described in this chapter are related to the details of initializing, configuring, administering, and educating others on the Data Lake.

Data Sources

Platforms

Teams

WUSM Data Lake

Useful Resources

Databricks SQL

Table of Contents