Transfering Data to/from RIS Storage in a Databricks Notebook
Problem
You want to import/export data from/to RIS storage from a Databricks notebook.
Solution
Inside the notebook run the follwing pieces of code:
Install the dependencies
% pip install pysmb==1.2.8
Restart the Python Environment
dbutils.library.restartPython()
Import the Library and Configure Connection Details
# Imports and connection details
from smb.SMBConnection import SMBConnection
import tempfile
import pandas as pd
userid = "lundeberg@wustl.edu"
password = dbutils.secrets.get("wusm-prod-kv", "lundeberg-ris-password")
client_machine_name = "lundebergMacPro"
remote_machine_name = "smbRemoteMachine"
server_ip = "storage1.ris.wustl.edu"
Define a few Helper Functions
def listFilesInPath(conn, rootPath, folderPath):
fileList = conn.listPath(rootPath, folderPath)
return fileList
def getFileFromRIS(conn, rootPath, folderFilePath, fileObject):
returnFileObject = conn.retrieveFile(f"{rootPath}", f"{folderFilePath}", fileObject)
return returnFileObject
def uploadFileToRIS(conn, filePath, desiredFileName, rootPath, folderPath):
with open(filePath + "/" + desiredFileName, "rb") as file:
conn.storeFile(rootPath, f"{folderPath}/{desiredFileName}", file)
print("Successfully uploaded file to RIS")
def renameSavedFile(filePath, desiredFileName):
allFiles = dbutils.fs.ls(filePath)
for i in allFiles:
fullFileLocation = i[0]
suffix = ".csv"
if fullFileLocation.endswith(suffix):
dbutils.fs.mv(fullFileLocation, filePath + "/" + desiredFileName)
else:
dbutils.fs.rm(fullFileLocation, False)
print("Successfully renamed file")
Example Transfer
# Connect to RIS
conn = SMBConnection(userid, password, client_machine_name, remote_machine_name, use_ntlm_v2 = True)
# Default ports are 139, 445 - RIS is 445
conn.connect(server_ip, 445)
all_files = listFilesInPath(conn, 'my-ris-group', 'Active/T3 2018-2023/lundeberg')
for i in all_files:
print(i.short_name)
# Retrieve the file from remote share
file_obj = tempfile.NamedTemporaryFile(delete=False)
remote_file = getFileFromRIS(conn,
'my-ris-group',
'Active/T3 2018-2023/output/CCUDM_list_121023.csv',
file_obj)
"""
# Open and print contents of file
with open(file_obj.name, 'r') as f:
print(f.read())
"""
# Close File Object
file_obj.close()
filePath = "/Volumes/sandbox/data_brokers/data_brokers_volume/sampleRISUpload"
desiredFileName = "mySample.csv"
# Upload a file from the datalake to the SMB
df = spark.sql("select * from samples.nyctaxi.trips limit 1000")
df.coalesce(1) \
.write \
.format("csv") \
.mode("overwrite") \
.save(filePath)
# Get a listing of all files, after Spark has written all the files
allFiles = dbutils.fs.ls(filePath)
# Get rid of temp files and reanme the actual CSV to something meaningful
renameSavedFile(filePath, desiredFileName)
# Upload file to RIS
uploadFileToRIS(conn, filePath, desiredFileName, "said_a", "Active/T3 2018-2023/lundeberg")
Discussion
RIS services and maintains a large WashU-wide accessible scientfic compute and storage cluster. It's meant for research computing endeavors throughout the University. Often times, researchers would like to have data sourced or sent to this storage for their research activities. One can think of this as an WashU "in-house" alternative to Databricks' "Unity Catalog".
Unfortunately, as of the moment, the link between the RIS storage service, storage1
, and the WUSM Data Lake isn't that tightly integrated. This will hopefully change with time.
Data transfers between Unity Catalog and RIS storage can happen via Samba/SMB.
Note:
Using SMB to transfer small to moderately large files is fine. For many large and "very large" files other approach may have to be investigated, such as using Azure Copy.
Note:
It is possible that the python code described in the solution may be packaged up into properpip
-installable package. When that happens, the solution will be updated accordingly.
Passwords
Notice how the example user RIS password is placed as a secret in Azure Key Vault.
password = dbutils.secrets.get("wusm-prod-kv", "lundeberg-ris-password")
This is a security feature so that other members of the WashU community may be able to collaborate with your notebooks, but not access your senstive information.
Please contact your group admninistrator to setup your RIS password with Azure Key Vault.
Unity Volume Path Structure
In our example we are transfering a file from RIS into the Unity Catalog. Please see the recipe "Saving Results From a Databricks Notebook to a File" for more details about the Unity Catalog Volume path structure.