Data Governance in Databricks
Data Governance in Databricks
When reviewing datasets, tables, or reports for certification in Databricks, the Data Governance team evaluates a defined set of foundational elements that enable trust, reuse, and long-term sustainability. Meeting these criteria indicates that a data asset is understandable, accountable, reusable, and fit for purpose across the organization.
1. Governance Principles
- Data as a product; data is intentionally designed, documented and maintained to meet specific business needs.
- Shared responsibility; makes data quality sustainable by ensuring accountability is distributed across business, IT/DA, and governance roles instead of relying on a single team to fix data after the fact.
- Single Source of Truth; prevents conflicting answers, reduces reconciliation effort, and builds confidence that decisions are based on consistent and authoritative data
- Privacy by design; embedding privacy into data design ensures sensitive information is protected from the start, reducing compliance risk while still enabling appropriate business use.
- Transparency and traceability; allow users to understand where data comes from, how it is transformed, and who is accountable, which is essential for trust, auditability, and decision confidence
2. Governance Roles and Accountability
Every dataset must have clearly defined ownership roles. At a minimum, datasets should identify the following roles, where applicable:
Data Owner (Business): Accountable for the overall quality, accuracy, security, and appropriate use of a dataset. The Data Owner defines access requirements, approves data usage, and ensures the dataset meets business, regulatory, and compliance obligations.
Note: The Data Owner is based on business accountability for the information itself and may differ from the system or application owner responsible for the technology where the data resides.
Business Data Subject Matter Expert (SME): The individual with deep functional knowledge of the dataset, including its meaning, business rules, context, dependencies, and typical usage patterns. The SME provides authoritative guidance on data interpretation, validates definitions, and supports issue resolution.
Note: In some cases, the SME may have more day-to-day familiarity with the dataset than the Data Owner, but accountability remains with the Data Owner.
Data Steward (Technical Services): Responsible for bringing data from source systems into Databricks and ensuring it lands in a usable, well-structured format. This role manages ingestion pipelines, validates that data is loading as expected, and applies the initial technical setup for tables (schemas, field types, basic documentation, and operational checks). Technical Data Stewards understand the source system, the extraction process, and how the data should flow into the lakehouse so downstream teams and business SMEs can work with it effectively.
Data Governance Owner (Information Management): Responsible for the operational management of a dataset’s governance requirements. Data Stewards maintain metadata, apply data quality and profiling standards, coordinate tagging and classification, and ensure governance practices are consistently followed. The Data Governance Owner does not own the data itself; rather, they oversee governance processes, provide guidance, and ensure consistent application of governance practices across datasets.
Business Systems Analyst (BSA): Responsible for translating business requirements into system and data specifications, ensuring alignment between business processes, data structures, and technical implementations. They support data quality, clarify rules, and facilitate communication between SMEs, stewards, developers, and system owners.
3. Dataset Standards, Metadata, and Data Cataloguing
Governance Requirements Datasets must be clearly defined and documented, including:
- A meaningful table description explaining the purpose and intended use of the dataset
- Column-level comments, centrally managed by the Data Governance team, which serve as the column definitions that are used across ARC
- Appropriate tags applied to support discovery, classification, and governance (ex. Owners, compliance, retention classification)
- Authoritative Source: Identify the system of record for the data (e.g., SCADA, GIS, SAP) at the table level. If similar data exists in multiple systems, clearly state which source is authoritative and why.
- Any upstream or downstream systems that contribute to or consume the data
Metadata Required
- Business Glossary terms
- Technical metadata (schema, data types, lineage, formulas)
- Operational metadata (refresh schedule/freshness, quality score, SLAs)
Standards
- Naming conventions
- Field definitions
- Tagging taxonomy
Central Glossary and Tag Management The Data Governance team maintains a centrally managed business glossary and is responsible for:
- Defining and curating approved business terms and definitions
- Applying and managing standardized definitions across datasets with the Data Services team
- Managing the enterprise tag framework within Databricks (including controlled vocabularies and tag values) Dataset documentation should align with the centrally managed glossary to ensure consistent terminology and interpretation across the organization. Clear and consistent definitions—at the table, column, and glossary level—are foundational to trusted reporting, operational use, and compliance.
4. Core Data Quality Components
For datasets used in Operations, Reporting, or Compliance, business logic and definitions must be explicitly documented. This ensures consistency, traceability, and confidence in how data is interpreted and used. The Information Management and Data Services teams work together to define and monitor data quality. At a minimum, datasets should be evaluated against these data quality dimensions:
- Accuracy – Does the data reflect reality? (IM)
- Completeness – Are required fields populated? (Data Services)
- Consistency – Does the data align across systems and over time? (IM)
- Timeliness – Is the data refreshed when users expect it? (Data Services)
- Uniqueness – Are duplicates understood and managed? (IM)
- Validity – Ensure format/ranges/codes are correct (I.E. license numbers are valid) (IM)
- Integrity – Data maintains logical relationships. (I.e. Spud date after completion date). (IM) Together, these components support reliable analytics, reporting, and decision-making. Learn more about how data quality components are defined and applied.
The Data Governance team looks for evidence that data quality is being intentionally managed through these defined validation checks (completeness, consistency, accuracy, uniqueness, validity, timeliness, and integrity) with those checks explicitly based on and informed by reference data, business rules, and data relationships.
- Reference Data underpins validity, consistency, accuracy, and uniqueness checks
- Business Rules drive completeness, accuracy, validity, and timeliness checks (Define how key fields and metrics are calculated, including assumptions, thresholds, exclusions, and conditional logic.)
- Data Relationships support integrity, completeness, and accuracy checks
Known limitations or acceptable exceptions Controls should be proportionate to how the data is used and the decisions it supports.
For datasets used in Operations, Reporting, or Compliance, business logic and definitions must be explicitly documented. This ensures consistency, traceability, and confidence in how data is interpreted and used.
5. Master Data Alignment – Unity Catalog Master Data Tables as the Master Data System
Databricks Unity Catalog serves as the enterprise Master Data System for analytics and downstream consumption. Master data from multiple source systems is consolidated, standardized, and reconciled within Databricks to produce a certified golden record.
Authoritative Source Hierarchy The authoritative representation of master data is determined by governance policy and applied consistently during mastering:
- System of Record – designated upstream source system(s) per domain
- Steward Overrides – approved manual or rule-based corrections
- Reference Data – controlled code sets and hierarchies used for standardization and validation
- All transformations, mappings, and reconciliation logic must be documented and version-controlled
- Golden master datasets must pass defined data quality gates before certification
- Full lineage from source systems to golden records must be maintained
Golden Master Data Sets
Master Well List
Master Facilities List
Master Pipeline List
Master Equipment List
Master Business Partner List
Business Glossary
Glossary Mapping
ARC Unique Asset ID table