DataBasin Broker Audit Process - Technical Documentation
Overview
The Broker Audit Process is a comprehensive auditing system designed to track and log all data access events, particularly for Protected Health Information (PHI). This system enables data brokers and compliance teams to perform full audits of data transmissions, both internal and external, by maintaining detailed records of what data was accessed, when it was accessed, and by which processes.
Architecture and Components
Core Scala Implementation
The audit system is implemented through the AuditTable
object in the com.databasin.platform.automation
package. This object contains two primary methods that work together to create a complete audit trail.
Primary Audit Tables
The system maintains two interconnected tables within the governance schema:
1. databasin_audit_log
- This is the detail-level table that captures individual record access events. For each record accessed, it stores:
- The audit identifier (the actual PHI field value, such as an MRN)
- The timestamp when the data was accessed
- The extension run ID (representing the overall automation execution)
- The query run ID (representing the specific SQL statement that produced the results)
- Data lake load timestamps for reconciliation purposes
2. databasin_audit_query_information
- This metadata table provides context about the queries executed. It contains:
- The automation name
- The actual SQL query text
- The number of results returned
- The audit identifier column name used
- Additional audit request information
- A count of distinct audit identifiers accessed
- Associated run IDs for joining with the detail log
Technical Process Flow
Step 1: File Drop Processing
When a file drop occurs through an automation, the system initiates the audit logging process. The createFileDropAuditTable
method handles the initial data capture by:
- Determining the appropriate audit identifier column (defaulting to "mrn" if not specified)
- Creating a streamlined DataFrame containing only the audit identifier and access timestamp
- Enriching the data with system-generated audit columns including run IDs and load timestamps
- Writing the audit records to the database using JDBC batch processing with a batch size of 100,000 records for performance optimization
Step 2: Query Metadata Capture
Simultaneously, the createQueryMapAuditTable
method captures the query-level metadata by:
- Collecting automation details including the name and SQL query text
- Computing aggregate statistics such as total results returned and distinct audit identifier count
- Creating a single-row DataFrame with all query metadata
- Appending the same system audit columns for correlation
- Writing to the query information table
Step 3: Data Enrichment
Both methods utilize the private addAuditDLColumns
function to standardize audit metadata across tables. This function adds:
extension_run_id
: Links all queries within a single automation executionquery_run_id
: Uniquely identifies each SQL statement executiondl_load_dt
: Date-only timestamp for partitioningdl_load_ts
: Full timestamp for precise timing
Usage and Query Patterns
Basic Audit Query
To retrieve all audit records for a specific automation:
SELECT * FROM governance.audit.databasin_audit_log
WHERE extension_run_id = 'your_run_id';
Complete Audit Trail
For a comprehensive audit trail that combines detail records with query metadata and enriches with source system information:
SELECT
qi.name AS automation_name,
al.audit_identifier AS phi_value,
qi.audit_identifier AS audit_column_name,
iit.ID_TYPE_NAME AS data_source,
al.data_access_date,
qi.query AS sql_executed,
qi.results_returned,
qi.audit_request_information,
qi.extension_run_id,
qi.query_run_id
FROM governance.audit.databasin_audit_log al
LEFT JOIN governance.audit.databasin_audit_query_information qi
ON al.extension_run_id = qi.extension_run_id
AND al.query_run_id = qi.query_run_id
LEFT JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id ii
ON al.audit_identifier = ii.identity_id
JOIN cleansed.epic_clarity.clarity_orgfilter_identity_id_type iit
ON iit.ID_TYPE = ii.IDENTITY_TYPE_ID;
Additional Monitoring Tables
The system integrates with broader governance infrastructure:
system.query.history
: Real-time query tracking for SQL Warehouse queriesgovernance.audit.queries
: Daily-updated comprehensive query log from Microsoft, useful for workspace-wide query analysis
Example query for workspace analysis:
-- Find queries by specific user containing certain keywords
SELECT * FROM governance.audit.queries
WHERE user = 'brianna.magnusen@wustl.edu'
AND code LIKE '%greenberg%'
ORDER BY execute_datetime DESC;
Key Design Considerations
Performance Optimization
The system employs several strategies to handle large-scale data processing:
- Batch inserts with configurable batch sizes (default: 100,000 records)
- Column pruning to minimize data movement
- Append-only writes to avoid expensive update operations
Compliance and Security
The architecture ensures:
- Complete traceability of PHI access
- Immutable audit logs through append-only operations
- Correlation capabilities between different audit levels
- Support for both internal and external audit requirements
Extensibility
The system design allows for:
- Custom audit identifier columns per automation
- Additional audit metadata through the
auditInformation
field - Integration with various source systems through configurable JDBC connections
Quick Reference
Table Schemas
governance.audit.databasin_audit_log
Column | Type | Description |
---|---|---|
audit_identifier | String | PHI field value (e.g., MRN) |
data_access_date | Timestamp | When the data was accessed |
extension_run_id | String | Overall automation execution ID |
query_run_id | String | Specific SQL statement execution ID |
dl_load_dt | Date | Data lake load date |
dl_load_ts | Timestamp | Data lake load timestamp |
governance.audit.databasin_audit_query_information
Column | Type | Description |
---|---|---|
name | String | Automation name |
query | String | SQL query text |
results_returned | Long | Number of results |
audit_identifier | String | Audit column name |
audit_request_information | String | Additional audit metadata |
distinct_audit_identifier_count | Long | Count of unique audit identifiers |
extension_run_id | String | Overall automation execution ID |
query_run_id | String | Specific SQL statement execution ID |
dl_load_dt | Date | Data lake load date |
dl_load_ts | Timestamp | Data lake load timestamp |
Conclusion
This audit system provides a robust foundation for maintaining compliance with data governance requirements while enabling efficient investigation of data access patterns when needed. By capturing both detailed access logs and query-level metadata, the system supports comprehensive audit trails that can satisfy both internal governance needs and external regulatory requirements.