DataBasin Broker Audit Process - Technical Documentation

Overview

The Broker Audit Process is a comprehensive auditing system designed to track and log all data access events, particularly for Protected Health Information (PHI). This system enables data brokers and compliance teams to perform full audits of data transmissions, both internal and external, by maintaining detailed records of what data was accessed, when it was accessed, and by which processes.

Architecture and Components

Core Scala Implementation

The audit system is implemented through the AuditTable object in the com.databasin.platform.automation package. This object contains two primary methods that work together to create a complete audit trail.

Primary Audit Tables

The system maintains two interconnected tables within the governance schema:

1. databasin_audit_log - This is the detail-level table that captures individual record access events. For each record accessed, it stores:

The audit identifier (the actual PHI field value, such as an MRN)
The timestamp when the data was accessed
The extension run ID (representing the overall automation execution)
The query run ID (representing the specific SQL statement that produced the results)
Data lake load timestamps for reconciliation purposes

2. databasin_audit_query_information - This metadata table provides context about the queries executed. It contains:

The automation name
The actual SQL query text
The number of results returned
The audit identifier column name used
Additional audit request information
A count of distinct audit identifiers accessed
Associated run IDs for joining with the detail log

Technical Process Flow

Step 1: File Drop Processing

When a file drop occurs through an automation, the system initiates the audit logging process. The createFileDropAuditTable method handles the initial data capture by:

Determining the appropriate audit identifier column (defaulting to "mrn" if not specified)
Creating a streamlined DataFrame containing only the audit identifier and access timestamp
Enriching the data with system-generated audit columns including run IDs and load timestamps
Writing the audit records to the database using JDBC batch processing with a batch size of 100,000 records for performance optimization

Step 2: Query Metadata Capture

Simultaneously, the createQueryMapAuditTable method captures the query-level metadata by:

Collecting automation details including the name and SQL query text
Computing aggregate statistics such as total results returned and distinct audit identifier count
Creating a single-row DataFrame with all query metadata
Appending the same system audit columns for correlation
Writing to the query information table

Step 3: Data Enrichment

Both methods utilize the private addAuditDLColumns function to standardize audit metadata across tables. This function adds:

extension_run_id: Links all queries within a single automation execution
query_run_id: Uniquely identifies each SQL statement execution
dl_load_dt: Date-only timestamp for partitioning
dl_load_ts: Full timestamp for precise timing

Usage and Query Patterns

Basic Audit Query

To retrieve all audit records for a specific automation:

SELECT * FROM governance.audit.databasin_audit_log 
WHERE extension_run_id = 'your_run_id';

Complete Audit Trail

For a comprehensive audit trail that combines detail records with query metadata and enriches with source system information:

SELECT 
    qi.name AS automation_name,
    al.audit_identifier AS phi_value,
    qi.audit_identifier AS audit_column_name,
    iit.ID_TYPE_NAME AS data_source,
    al.data_access_date,
    qi.query AS sql_executed,
    qi.results_returned,
    qi.audit_request_information,
    qi.extension_run_id,
    qi.query_run_id
FROM governance.audit.databasin_audit_log al 
LEFT JOIN governance.audit.databasin_audit_query_information qi 
    ON al.extension_run_id = qi.extension_run_id 
    AND al.query_run_id = qi.query_run_id
LEFT JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id ii
    ON al.audit_identifier = ii.identity_id
JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id_type iit
    ON iit.ID_TYPE = ii.IDENTITY_TYPE_ID;

Additional Monitoring Tables

The system integrates with broader governance infrastructure:

system.query.history: Real-time query tracking for SQL Warehouse queries
governance.audit.queries: Daily-updated comprehensive query log from Microsoft, useful for workspace-wide query analysis

Example query for workspace analysis:

-- Find queries by specific user containing certain keywords
SELECT * FROM governance.audit.queries 
WHERE user = 'brianna.magnusen@wustl.edu'
AND code LIKE '%greenberg%'
ORDER BY execute_datetime DESC;

Key Design Considerations

Performance Optimization

The system employs several strategies to handle large-scale data processing:

Batch inserts with configurable batch sizes (default: 100,000 records)
Column pruning to minimize data movement
Append-only writes to avoid expensive update operations

Compliance and Security

The architecture ensures:

Complete traceability of PHI access
Immutable audit logs through append-only operations
Correlation capabilities between different audit levels
Support for both internal and external audit requirements

Extensibility

The system design allows for:

Custom audit identifier columns per automation
Additional audit metadata through the auditInformation field
Integration with various source systems through configurable JDBC connections

Quick Reference

Table Schemas

governance.audit.databasin_audit_log

Column	Type	Description
audit_identifier	String	PHI field value (e.g., MRN)
data_access_date	Timestamp	When the data was accessed
extension_run_id	String	Overall automation execution ID
query_run_id	String	Specific SQL statement execution ID
dl_load_dt	Date	Data lake load date
dl_load_ts	Timestamp	Data lake load timestamp

governance.audit.databasin_audit_query_information

Column	Type	Description
name	String	Automation name
query	String	SQL query text
results_returned	Long	Number of results
audit_identifier	String	Audit column name
audit_request_information	String	Additional audit metadata
distinct_audit_identifier_count	Long	Count of unique audit identifiers
extension_run_id	String	Overall automation execution ID
query_run_id	String	Specific SQL statement execution ID
dl_load_dt	Date	Data lake load date
dl_load_ts	Timestamp	Data lake load timestamp

Conclusion

This audit system provides a robust foundation for maintaining compliance with data governance requirements while enabling efficient investigation of data access patterns when needed. By capturing both detailed access logs and query-level metadata, the system supports comprehensive audit trails that can satisfy both internal governance needs and external regulatory requirements.

Data Sources

Platforms

Teams