DataBasin Broker Audit Process - Technical Documentation

Overview

The Broker Audit Process is a comprehensive auditing system designed to track and log all data access events, particularly for Protected Health Information (PHI). This system enables data brokers and compliance teams to perform full audits of data transmissions, both internal and external, by maintaining detailed records of what data was accessed, when it was accessed, and by which processes.

Architecture and Components

Core Scala Implementation

The audit system is implemented through the AuditTable object in the com.databasin.platform.automation package. This object contains two primary methods that work together to create a complete audit trail.

Primary Audit Tables

The system maintains two interconnected tables within the governance schema:

1. databasin_audit_log - This is the detail-level table that captures individual record access events. For each record accessed, it stores:

  • The audit identifier (the actual PHI field value, such as an MRN)
  • The timestamp when the data was accessed
  • The extension run ID (representing the overall automation execution)
  • The query run ID (representing the specific SQL statement that produced the results)
  • Data lake load timestamps for reconciliation purposes

2. databasin_audit_query_information - This metadata table provides context about the queries executed. It contains:

  • The automation name
  • The actual SQL query text
  • The number of results returned
  • The audit identifier column name used
  • Additional audit request information
  • A count of distinct audit identifiers accessed
  • Associated run IDs for joining with the detail log

Technical Process Flow

Step 1: File Drop Processing

When a file drop occurs through an automation, the system initiates the audit logging process. The createFileDropAuditTable method handles the initial data capture by:

  1. Determining the appropriate audit identifier column (defaulting to "mrn" if not specified)
  2. Creating a streamlined DataFrame containing only the audit identifier and access timestamp
  3. Enriching the data with system-generated audit columns including run IDs and load timestamps
  4. Writing the audit records to the database using JDBC batch processing with a batch size of 100,000 records for performance optimization

Step 2: Query Metadata Capture

Simultaneously, the createQueryMapAuditTable method captures the query-level metadata by:

  1. Collecting automation details including the name and SQL query text
  2. Computing aggregate statistics such as total results returned and distinct audit identifier count
  3. Creating a single-row DataFrame with all query metadata
  4. Appending the same system audit columns for correlation
  5. Writing to the query information table

Step 3: Data Enrichment

Both methods utilize the private addAuditDLColumns function to standardize audit metadata across tables. This function adds:

  • extension_run_id: Links all queries within a single automation execution
  • query_run_id: Uniquely identifies each SQL statement execution
  • dl_load_dt: Date-only timestamp for partitioning
  • dl_load_ts: Full timestamp for precise timing

Usage and Query Patterns

Basic Audit Query

To retrieve all audit records for a specific automation:

SELECT * FROM governance.audit.databasin_audit_log 
WHERE extension_run_id = 'your_run_id';

Complete Audit Trail

For a comprehensive audit trail that combines detail records with query metadata and enriches with source system information:

SELECT 
    qi.name AS automation_name,
    al.audit_identifier AS phi_value,
    qi.audit_identifier AS audit_column_name,
    iit.ID_TYPE_NAME AS data_source,
    al.data_access_date,
    qi.query AS sql_executed,
    qi.results_returned,
    qi.audit_request_information,
    qi.extension_run_id,
    qi.query_run_id
FROM governance.audit.databasin_audit_log al 
LEFT JOIN governance.audit.databasin_audit_query_information qi 
    ON al.extension_run_id = qi.extension_run_id 
    AND al.query_run_id = qi.query_run_id
LEFT JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id ii
    ON al.audit_identifier = ii.identity_id
JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id_type iit
    ON iit.ID_TYPE = ii.IDENTITY_TYPE_ID;

Additional Monitoring Tables

The system integrates with broader governance infrastructure:

  • system.query.history: Real-time query tracking for SQL Warehouse queries
  • governance.audit.queries: Daily-updated comprehensive query log from Microsoft, useful for workspace-wide query analysis

Example query for workspace analysis:

-- Find queries by specific user containing certain keywords
SELECT * FROM governance.audit.queries 
WHERE user = 'brianna.magnusen@wustl.edu'
AND code LIKE '%greenberg%'
ORDER BY execute_datetime DESC;

Key Design Considerations

Performance Optimization

The system employs several strategies to handle large-scale data processing:

  • Batch inserts with configurable batch sizes (default: 100,000 records)
  • Column pruning to minimize data movement
  • Append-only writes to avoid expensive update operations

Compliance and Security

The architecture ensures:

  • Complete traceability of PHI access
  • Immutable audit logs through append-only operations
  • Correlation capabilities between different audit levels
  • Support for both internal and external audit requirements

Extensibility

The system design allows for:

  • Custom audit identifier columns per automation
  • Additional audit metadata through the auditInformation field
  • Integration with various source systems through configurable JDBC connections

Quick Reference

Table Schemas

governance.audit.databasin_audit_log

Column Type Description
audit_identifier String PHI field value (e.g., MRN)
data_access_date Timestamp When the data was accessed
extension_run_id String Overall automation execution ID
query_run_id String Specific SQL statement execution ID
dl_load_dt Date Data lake load date
dl_load_ts Timestamp Data lake load timestamp

governance.audit.databasin_audit_query_information

Column Type Description
name String Automation name
query String SQL query text
results_returned Long Number of results
audit_identifier String Audit column name
audit_request_information String Additional audit metadata
distinct_audit_identifier_count Long Count of unique audit identifiers
extension_run_id String Overall automation execution ID
query_run_id String Specific SQL statement execution ID
dl_load_dt Date Data lake load date
dl_load_ts Timestamp Data lake load timestamp

Conclusion

This audit system provides a robust foundation for maintaining compliance with data governance requirements while enabling efficient investigation of data access patterns when needed. By capturing both detailed access logs and query-level metadata, the system supports comprehensive audit trails that can satisfy both internal governance needs and external regulatory requirements.


Updated on August 7, 2025