Billing Documentation

Summary

This document provides an overview of the billing pipeline in the WUSM Data Lake. It explains how billing data is ingested, processed, and made available for analysis.

Json data is ingested to the standard raw/cleansed tables under the billing schema. A curated layer view is then built from the cleansed table which uses some logic to determine a costinbillingcurrency value derived from quantity as well as metercategory, metername, reservationid, and clustername.

Pipeline Details

Job

A Databricks job has been configured to run this daily at 3AM Central Time.

Pre-Pipeline Stage

A pre-pipeline notebook is run to do three things, one per cell:

Set the proper Spark configs to access the billing/wusmbilling container/account where the billing JSON files are copied into from another process.
For each folder within the prod/WUSMPRODBillingMTD directory, get the fileName of the most recently modified file.
Copy each of the files from step 2 into a temporary upload directory to be used in the main ingestion step.

Main Pipeline Stage

A Databasin job ingests files inserted into the temporary upload directory from the pre-pipeline stage into the billing schema in the raw and cleansed catalogs.

Post-Pipeline Stage

A post-pipeline notebook is run to do four things, one per cell:

Similar to the pre-pipeline notebook, set the proper Spark configs to access the billing/wusmbilling container/account.
Clean up any files in the temporary upload directory copied by the pre-pipeline process.
Copy all files ingested into an archive directory.
Create or replace the curated level view logic.

Data Sources

Platforms

Teams