Skip to main content
Version: 1.3.1.0

Data Governance Overview

Data governance is the discipline of ensuring that data assets are discoverable, understandable, trustworthy, and used in compliance with legal and organizational policies. As data volumes grow and regulatory pressure intensifies, governance has moved from a "nice to have" to a foundational requirement.

Why Governance Matters

Regulatory Compliance

GDPR (General Data Protection Regulation) requires organizations to know where personal data resides, document its processing, restrict access to authorized parties, and demonstrate the ability to delete it on request. Without systematic metadata management, answering a regulatory audit becomes a manual, error-prone exercise.

The EU AI Act — entering enforcement from 2025 — imposes additional obligations on organizations using automated decision systems. High-risk AI systems must maintain logs of training data, model versions, and inference outputs. Data lineage captured at the platform level (rather than reconstructed after the fact) is the most reliable way to meet these traceability requirements.

Data Quality and Trust

When analysts cannot identify which pipeline produced a dataset, or whether a table has been updated recently, they lose trust in the data — and either duplicate work by producing their own copies or make decisions on stale data. A governed data platform makes data provenance explicit, reducing duplication and improving decision quality.

Self-Service Enablement

Governance enables self-service: a business glossary and a searchable metadata catalog let analysts find the datasets they need without relying on tribal knowledge or IT tickets. Access policies enforced at the platform level mean data owners can grant access confidently, knowing that Ranger will enforce the boundaries.

Apache Atlas — Metadata Management

Apache Atlas is the metadata and governance layer of ODP. It provides a metadata store, an automatic lineage engine, a classification system, a business glossary, and a full-text search interface — all accessible through a web UI and a REST API.

Automatic Lineage Capture

Atlas automatically captures lineage from:

  • Hive: Every CREATE TABLE AS SELECT, INSERT INTO, or ALTER TABLE operation in HiveServer2 is intercepted by the Atlas Hive hook and recorded as a lineage edge. Atlas builds a directed acyclic graph (DAG) showing exactly which source tables contributed to each output table, including intermediate transformations.
  • Spark: The Atlas Spark listener records DataFrame read and write operations, linking source datasets to output datasets at the job level.
  • Kafka: Atlas records producer and consumer relationships between topics, enabling end-to-end lineage from ingestion pipeline to analytical table.

The resulting lineage graph is navigable in the Atlas UI: starting from any table or column, you can trace data forward to all downstream consumers or backward to all upstream sources — without reading pipeline code.

Classification and Tagging

Atlas supports classifications (also called tags) — labels attached to entities (databases, tables, columns, processes) to convey business meaning or regulatory sensitivity. Examples:

  • PII — Personally Identifiable Information
  • GDPR_SENSITIVE — data subject to GDPR access restrictions
  • FINANCIAL_CONFIDENTIAL — data restricted to the finance team
  • AI_TRAINING_DATA — dataset used to train a production model

Classifications propagate through the lineage graph: if a source column is tagged PII, Atlas automatically tags derived columns in downstream tables with the same classification. This ensures that sensitive data is not overlooked when it is transformed or copied.

Ranger-Atlas Integration: Tag-Based Policies

The integration between Atlas and Ranger is one of the most powerful governance capabilities in ODP:

  1. A data steward tags a Hive column PII in Atlas.
  2. Atlas propagates the tag to all derived columns in downstream tables.
  3. Ranger detects the new tag and applies a pre-configured tag-based policy that masks the column for users without the PII_ACCESS role.
  4. The masking is enforced across all engines — Hive, Spark, Impala, Trino — simultaneously, without creating separate per-table policies.

This approach decouples the what (which data is sensitive) from the how (which users can access it), making governance policies maintainable as the data landscape evolves.

Business Glossary

The Atlas business glossary links technical metadata (database names, column names, data types) to business terms that non-technical stakeholders understand. For example, the column cust_id in fact_orders can be linked to the glossary term "Customer Identifier", which includes a business definition, ownership information, and related terms.

Linking entities to glossary terms makes the catalog navigable for business users and ensures consistent terminology across the organization.

Search and Discovery

Atlas provides a full-text and faceted search interface over all metadata. Analysts can search for:

  • Tables containing columns named revenue with type DECIMAL
  • All Hive tables tagged PII created in the last 30 days
  • Datasets produced by a specific Spark job or NiFi pipeline

The REST API exposes the same search capabilities for integration with data catalog tools and internal portals.

Lineage for AI Model Traceability

As organizations build machine learning models on data stored in ODP, lineage becomes critical for AI governance. The AI Act requires that high-risk systems document the data used to train and validate models. With Atlas lineage:

  • The Spark job that produces a training dataset is linked to its source tables.
  • The source tables carry classifications indicating data quality and sensitivity.
  • The training dataset entity in Atlas records when it was created, by whom, and from which upstream sources.

This lineage record satisfies the documentation obligations of the AI Act and supports post-incident investigation when model behavior needs to be explained.

Atlas in ODP 1.3.1.0

ODP 1.3.1.0 deploys Atlas with:

  • Hive hook enabled by default for automatic lineage on all HiveServer2 queries
  • Spark Atlas Connector pre-configured for lineage from Spark jobs submitted via YARN
  • Kafka Atlas hook for topic-level lineage
  • Ranger-Atlas tag synchronization enabled out of the box
  • Atlas UI accessible via Knox for external clients
  • Ambari-managed Atlas configuration, startup, and monitoring