Saving Results From a Databricks Notebook to a File

Problem

While working in a Databricks notebook, you want to save the results of a query or session to a file.

Solution

# Take SQL, and turn to pyspark using the spark.sql function
dataframe = spark.sql("<your-important-sql-query>")

# write the dataframe to a volume
example_volume_path = os.path.join(
    "/Volumes/my_catalog",
    "/my_schema/my_volume",
    "username",
    "special-project-dir",
)

# live example: /Volumes/sandbox/data_brokers/data_brokers_volume/BMinor/samples

(dataframe
  .coalesce(1)
  .write
  .format("csv")
  .mode("overwrite")
  .option("compression", "gzip")
  .save(example_volume_path))

# clean up, remove all non-CSV files and rename the final file
all_files = dbutils.fs.ls(example_volume_path)
desired_final_file = "important-data.csv.gz"
desired_final_file_path = os.path.join(example_volume_path, desired_file_file)

for f in all_files:
    f_absolute_path = partition[0]
    if partition_absolute_path.endswith(".gz")
        dbutils.fs.mv(f_absolute_path, desired_final_file_path)
    else:
        dbutils.fs.rm(f_absolute_path, False)

Discussion

Azure Databricks provides multiple utilities and APIs for interacting with files on various file systems: Unity Catalog volumes, Workspace files, Cloud object storage, Databricks File Storage (DBFS), and ephemeral storage attached to the driver node of the cluster.

It's preferable to use Unity Catalog Volumes for long-term storage needs. Volumes are Unity Catalog objects representing a logical volume of storage in an Azure cloud object storage location. Volumes provide capabilities for accessing, storing, governing, and organizing files. You can use volumes to store and access files in any format, including structured, semi-structured, and unstructured data. Databricks maintains secure access to the files in cloud object storage via the Unity Catalog Volumes.

The path to access volumes is the same whether you use Apache Spark, SQL, Python, or other languages and libraries. The path to access files in volumes uses the following format:

/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

Azure Databricks also supports an optional dbfs:/ scheme when working with Apache Spark, so the following path also works:

dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

The sequence /<catalog>/<schema>/<volume> in the path corresponds to the three Unity Catalog object names associated with the file. These path elements are read-only and not directly writeable by users, meaning it is not possible to create or delete these directories using filesystem operations. They are automatically managed

You can see the various catalogs accessible to you by going to the "Catalog Explorer" within a Databricks notebook. In your Azure Databricks workspace, click on the "Catalog" icon/item found on the left column of the notebook.

There's also a SQL interface to accessing the volumes as well too.

Data Sources

Platforms

Teams

Saving Results From a Databricks Notebook to a File

Problem

Solution

Discussion

See Also

Azure References

Table of Contents

Saving Results From a Databricks Notebook to a File

Problem

Solution

Discussion

See Also

Azure References

Related Recipes

Table of Contents