Welcome to Databricks Email Template
Hello,
Welcome aboard the WUSM Data Lake! Here is a bit of information to get you started.
- We use Databricks as our analytics provider. You can log into Databricks HERE.
- To login, please use your WUSTL email, and ensure you are either on campus or on WashU VPN.
- You have been onboarded with a security group named GROUP_NAME, and your Databricks compute resources will be named as such.
- You will have access to general compute clusters if you wish to run Databricks Notebooks (Python/SQL). You will also have access to Databricks SQL Warehouses if you wish to run Spark SQL.
- A little info on Spark SQL and its functions, HERE.
The data lake acts much like a traditional database server, and provides navigation very similar to browsing databases, schemas and tables. In Databricks, these objects are stored within the “Catalog” view. In this view, you will find a few common "databases" (referred to as "catalogs" by Databricks) that are shared by all data lake users. You will also see additional catalogs and schemas that you may not have permission to query. These are provided to allow users to browse the metadata of other data hosted in Databricks.
Here is an overview of navigating the catalogs, schemas, tables, and views in Azure Databricks:
- Log into Databricks.
- Click “Catalog” on the left-hand side.
- Expand the relevant catalog to view the available schemas.
- Within each schema, you will find the tables and views you have been granted access to.
Note: the naming convention in the data lake is typically “database” + “schema” + “table name”, with underscores between.
You will have “select” access to the approved tables and views and are free to query the data via a Databricks Notebook or a Databricks SQL endpoint.
You have a sandbox area that you can find under “Catalog -> Sandbox -> GROUP_NAME”. This is your team's private schema, and you have full permissions to this schema to write tables to, etc.
Your team also has a “volume” under this same sandbox schema. This is a shared drive you may use to upload and work with files. It is private and only your team has access to it.
More information on Databricks volumes HERE.
To get you a bit more acquainted with Databricks, we have some high-level training videos available HERE. After reviewing these training materials, if you would like to have a quick training sessions with members of the Data Lake team, please reach out to Nicole Venteris.
Let us know if you have any questions or experience any issues using the data lake.
Thank you,
The WUSM Data Lake Team
datalake@wustl.edu