Connecting to Databricks as a Database

Problem

You want to connect to the Data Lake as just another "database" (i.e. you want to immediately access data lake data outside Databricks with SQL tools you already know).

Solution

Given that you have access to an existing SQL Warehouse cluster, and you are on the WUSTL VPN, get the following 3 details from your Azure Databricks workspace.

Server Hostname
HTTP path
access_token

Part A: Obtain the `Server Hostname` and `HTTP Path`

Log in to your Azure Databricks workspace
In the sidebar, click SQL > SQL Warehouses
Choose an appropriate warehouse to connect to
On the Connection Details tab, copy the connection details.

This should get you the Server Hostname and access_token parameters.

Part B: Obtain a personal access token (if don't already have one)

If you don't already have an Azure Databricks personal access token, do the following:

In your Azure Databricks workspace, click your Azure Databricks username in the top bar, and then select User Settings from the drop down.
Click Developer.
Next to Access tokens, click Manage.
Click Generate new token.
(Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).
Click Generate.
Copy the displayed token to a secure location, and then click Done.

NOTE:
Be sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (Revoke) icon next to the token on the Access tokens page.

If you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator.

Part C: Use your tooling

DBeaver

(See DBeaver integration with Azure Databricks for the most current setup details.)

Python

(See Databricks SQL Connector for Python for the most current setup details.)

#!/usr/bin/env python

# assuming you've done a "pip install databricks-sql-connector" in your environment

# from prior art based on https://learn.microsoft.com/en-us/azure/databricks/dev-tools/python-sql-connector
# export DATABRICKS_SERVER_HOSTNAME=<your server hostname you determined in Part A above>
# export DATABRICKS_HTTP_PATH=<your HTTP path you determined in Part A above>
# export DATABRICKS_TOKEN=<your access token you created in Part B above>


import os
from databricks import sql

databricks_sql_environment = {
    'server-hostname' : os.environ['DATABRICKS_SERVER_HOSTNAME'],
    'http-path' : os.environ['DATABRICKS_HTTP_PATH'],
    'access-token' : os.environ['DATABRICKS_TOKEN']
}

def main():
    with sql.connect(
        server_hostname = databricks_sql_environment['server-hostname'],
        http_path = databricks_sql_environment['http-path'],
        access_token = databricks_sql_environment['access-token']
    ) as connection:
        with connection.cursor() as cursor:
            cursor.execute("select * from samples.nyctaxi.trips LIMIT 2")
            result = cursor.fetchall()

            for row in result:
                print(row)


if __name__ == "__main__":
    main()

Tableau

(See Connect Tableau to Databricks for the most current setup details.)

Follow these steps to get your report connected to the WUSM data lake while running in Tableau desktop:

Add a Databricks data source to your Tableau report.
Enter the Server Hostname and HTTP Path for your SQL warehouse from the connection details above.
Select the Personal Access Token option for Authentication and enter the token you created previously.

Once connected, you should be able to query data from the tables and views you have been granted access to within the Databricks environment.

Additional configuration steps will be required when publishing to the Wash U Tableau server. At this time, we will ask you to contact the ICS team to assist when doing the initial deployment of your report. In the future we hope to provide guidance on publishing in a self-service model.

Submit a ticket to request assistance but submitting a ticket for Data Lake Support here.

R / RStudio

(See RStudio on Azure Databricks for the most current setup details.)

# Install and configure the Databricks ODBC driver for Windows, macOS, or Linux, 
# based on your local machine’s operating system.

library(odbc)
library(DBI)

conn = dbConnect(
  drv = odbc(),
  dsn = "Databricks"
)

print(dbGetQuery(conn, "SELECT * FROM default.diamonds LIMIT 2"))

Java / JVM languages

(See Install the Databricks JDBC driver in a Java project for the most current setup details.)

You'll need to ensure that latest Databricks JDBC driver is included in your JVM project specifications:

<dependency>
  <groupId>com.databricks</groupId>
  <artifactId>databricks-jdbc</artifactId>
  <version>2.6.36</version>
</dependency>

Then in your code, you'll need to use the following formatted JDBC connection URL:

jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<property>=<value>[;<property>=<value>]]

Replace <server-hostname> with the Azure Databricks compute resource’s Server Hostname value.
Replace <catalog> with the default catalog name.
Replace <http-path> with the Azure Databricks compute resource’s HTTP Path value.
Replace <property> and <value> as needed

An example Java invocation with the connection url parameters specified as environment variables:

package org.example;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.Statement;
import java.util.Properties;

public class Main {
  public static void main(String[] args) throws Exception {
    Class.forName("com.databricks.client.jdbc.Driver");
    String url = "jdbc:databricks://" + System.getenv("DATABRICKS_SERVER_HOSTNAME") + ":443";
    Properties p = new java.util.Properties();
    p.put("httpPath", System.getenv("DATABRICKS_HTTP_PATH"));
    p.put("AuthMech", "3");
    p.put("UID", "token");
    p.put("PWD", System.getenv("DATABRICKS_TOKEN"));
    try (Connection conn = DriverManager.getConnection(url, p)) {
      Statement stmt = conn.createStatement();
      try (ResultSet rs = stmt.executeQuery("SELECT * FROM samples.nyctaxi.trips")) {
        ResultSetMetaData md = rs.getMetaData();
        String[] columns = new String[md.getColumnCount()];
        for (int i = 0; i < columns.length; i++) {
          columns[i] = md.getColumnName(i + 1);
        }
        while (rs.next()) {
          System.out.print("Row " + rs.getRow() + "=[");
          for (int i = 0; i < columns.length; i++) {
            if (i != 0) {
              System.out.print(", ");
            }
            System.out.print(columns[i] + "='" + rs.getObject(i + 1) + "'");
          }
          System.out.println(")]");
        }
      }
    }
    System.exit(0);
  }
}

Discussion

Sometimes to get started, it helps to extend the tools that you are already familiar with. One way to access the Data Lake is to view it as another "database". This is one of the purposes of Databricks' SQL warehouse

If you want to do batch processing using sql, a classic cluster is a better choice (because cheaper), but for interactive queries, performance is key. SQL warehouses are pretty fast and optimized for reads/queries.

Administrative Details

Please note that, at the end of the day, a SQL warehouse traces back to a running cluster of Azure cloud compute instances. These compute instances are being billed accordingly to your respective cost center.

If you don't have an existing SQL warehouse cluster to connect to, you'll first need to contact I2DB's ICS group and request them to:

create a SQL warehouse linked to an appropriate cost center
adjust permissions to the a SQL warehouse so that either a dedicated ICS administrator or selected group administrator can manage permission on the SQL warehouse.

Data Sources

Platforms

Teams

Connecting to Databricks as a Database

Problem

Solution

Part A: Obtain the `Server Hostname` and `HTTP Path`

Part B: Obtain a personal access token (if don't already have one)

Part C: Use your tooling

DBeaver

Python

Tableau

R / RStudio

Java / JVM languages

Discussion

Administrative Details

See Also

Further References

Table of Contents

Connecting to Databricks as a Database

Problem

Solution

Part A: Obtain the Server Hostname and HTTP Path

Part B: Obtain a personal access token (if don't already have one)

Part C: Use your tooling

DBeaver

Python

Tableau

R / RStudio

Java / JVM languages

Discussion

Administrative Details

See Also

Further References

Table of Contents

Part A: Obtain the `Server Hostname` and `HTTP Path`