AI on Sovereign Data
This feature is part of the upcoming ODP 1.3.2.0 release, currently in qualification. Documentation is provided for preview purposes. Do not use in production until the official release.
The Problem with Cloud AI
Modern AI development has largely converged on a cloud-centric model: you send your data to a cloud provider's infrastructure, use their training APIs or foundational model fine-tuning services, and retrieve the resulting model. This model is operationally convenient, but it has a fundamental implication: your data leaves your infrastructure.
For many organizations, this is an unacceptable risk:
- Healthcare: patient data is subject to strict data residency requirements. Uploading clinical records or medical imaging to a cloud AI service may violate GDPR, HDS (France), or NHS data governance rules (UK), regardless of contractual agreements.
- Finance: trading strategies, customer financial profiles, and risk models are commercially sensitive. Exposing them to cloud infrastructure creates both regulatory and competitive risk.
- Government and defense: national security data, public infrastructure maps, and sensitive administrative data cannot be processed on infrastructure outside national control.
- Legal: attorney-client privilege, sealed documents, and litigation strategy data have strict confidentiality requirements that cloud infrastructure cannot guarantee.
The response to this challenge is not to avoid AI — it is to run AI workloads on sovereign infrastructure, where you control the hardware, the network, and the data.
How ODP Solves It
ODP is designed to run entirely on-premise or in a sovereign private cloud. It does not require any connection to external services at runtime. Your data flows:
Your data sources → NiFi (ingestion) → HDFS/Ozone (storage)
↓
Spark (processing & ML training)
↓
HDFS (model artifacts storage)
↓
Your inference infrastructure
Every step in this pipeline runs on your hardware, in your network, under your control.
SecNumCloud Compatibility
ODP's architecture is compatible with SecNumCloud — the French ANSSI security qualification for cloud services, required for processing sensitive government and regulated data. SecNumCloud requires, among other things:
- Data must remain within French territory
- The cloud provider must be legally and operationally independent from non-EU jurisdiction
- Infrastructure must meet stringent security requirements (encryption at rest and in transit, access control, audit logs, vulnerability management)
ODP running on a SecNumCloud-qualified private cloud or on-premise infrastructure satisfies these requirements. ODP itself provides the encryption (HDFS transparent encryption with KMS, TLS for all communications), access control (Ranger + Kerberos), and audit logging (Ranger audit) layers.
GDPR Compliance for AI
GDPR creates specific challenges for AI systems:
Right to erasure (Art. 17): a data subject requests deletion of their personal data. For a conventional data lake, this means finding and deleting records. For an ML model, the question is harder: if the deleted record was in the training set, does the model "remember" it? This is the right to be forgotten in AI problem.
ODP helps address this at the data layer:
- Iceberg's GDPR delete capability: use
DELETEstatements on Iceberg tables to remove a data subject's records. Ranger audit captures the deletion event. - Training data version tracking: if you record the Iceberg snapshot ID used for each model training run, you can determine whether a specific record was in the training set for a given model version.
- Data lineage via Atlas: Atlas lineage chains show which datasets contributed to which derived datasets and models.
ODP does not automatically perform model unlearning (removing a data point's influence from an already-trained model). That remains a research-level problem. But ODP ensures that you have the evidence trail to understand model data provenance and make informed decisions about model retirement.
Right to explanation (Art. 22): users subject to automated decisions must be able to receive an explanation. The Atlas lineage graph showing training data provenance supports the documentation requirement — you can show what data was used to train the model making a decision.
Governing AI Training Data with Ranger
Apache Ranger provides fine-grained access control for all data stored in ODP. For AI training data, Ranger enables:
Dataset-Level Access Control
Grant specific users or service accounts (ML training jobs) read access to training datasets, while blocking access to raw data containing sensitive personal information:
Policy: "ML Training - Customer Features"
Resource: hive / ml_datasets / customer_features (table)
Allow: GROUP ml-engineers (SELECT)
Allow: USER spark-ml-svc (SELECT)
Deny: * (default)
The raw customers table — with PII fields like name, email, and phone number — remains inaccessible to the training job, which only reads the pre-processed feature table.
Column-Level Masking for Training Data
When training data must retain some structure but cannot expose specific columns, use Ranger's column masking policies:
Policy: "ML Data - Mask PII in training"
Resource: hive / ml_datasets / raw_customer_data / columns: [email, phone, ssn]
Masking: HASH (apply SHA-256 hash)
Apply to: GROUP data-scientists
Data scientists can work with the data for feature engineering, but cannot read raw PII values. The hashed values still carry relational information (same email → same hash) useful for joining datasets.
Tag-Based Policies for Sensitive Training Data
The most powerful approach is to use Atlas tags + Ranger tag-based policies. Apply a SENSITIVE_TRAINING_DATA tag in Atlas to any table or column containing data that should only be accessible to approved ML training jobs:
- In Atlas, tag the
raw_patient_recordstable withRESTRICTED_AI_TRAINING. - In Ranger, create a tag-based policy:
- Tag:
RESTRICTED_AI_TRAINING - Allow: SERVICE ml-training-approved-jobs
- Deny: all others
- Tag:
- The policy applies automatically to any table or column tagged
RESTRICTED_AI_TRAINING, regardless of which database or table it lives in.
As your data estate grows, new tables are automatically governed by this policy when the tag is applied — no manual Ranger policy updates required.
Atlas Lineage for Training Data Provenance
Apache Atlas records lineage for data transformations across the ODP stack. For AI workflows, this creates a chain of provenance from raw data to trained model.
What Atlas Captures Automatically
With ODP's Atlas integration, the following lineage is captured automatically:
- Hive queries:
CREATE TABLE ... AS SELECT(CTAS),INSERT INTO,INSERT OVERWRITEoperations — Atlas records which source tables contributed to which target table. - Spark SQL: Spark jobs using Hive Metastore create Atlas lineage for SQL operations.
- Iceberg operations: Iceberg table creation and data manipulation via Hive and Spark is captured by the Atlas hook.
Lineage Chain for a Typical ML Dataset
[NiFi: raw data ingestion]
↓ (NiFi Atlas hook)
[HDFS: raw_data/customers_raw.parquet] (Atlas entity: hdfs_path)
↓ (Hive CTAS, captured by Hive Atlas hook)
[Hive: raw.customers] (Atlas entity: hive_table)
↓ (Spark SQL, captured by Spark Atlas hook)
[Hive/Iceberg: ml_datasets.customer_features] (Atlas entity: hive_table)
↓ (recorded manually or via custom Atlas entity)
[Model: churn-rf-pipeline-v2] (Atlas entity: ml_model — custom type)
The middle three steps are captured automatically. The final step (linking the model to the training dataset) requires either manual Atlas entity creation or integration with a model registry that publishes Atlas entities.
Querying Lineage via Atlas REST API
# Find all datasets that contributed (directly or indirectly) to a given table
curl -s -u admin:admin123 \
"https://atlas.example.com/api/atlas/v2/lineage/<table-guid>?direction=INPUT&depth=5" \
| jq '.relations[] | {fromEntityId, toEntityId, relationshipType}'
The lineage depth parameter controls how many hops upstream to trace. For complex feature engineering pipelines with many transformation steps, set depth to 10 or more.
Iceberg Time Travel for ML Reproducibility
One of the most challenging aspects of ML development is reproducibility: ensuring that a model can be retrained on exactly the same data that was used initially. Without versioned data, this is impossible — data is mutable, and re-reading a table a month later gives different results.
Iceberg solves this with time travel: every write to an Iceberg table creates a new snapshot, and old snapshots are preserved (for a configurable retention period). You can always read the table as it existed at any point in time.
Reproducible Training Workflow
Step 1: Before training, record the current snapshot ID of each training dataset:
snapshot_df = spark.sql("""
SELECT snapshot_id, committed_at, operation
FROM hive_catalog.ml_datasets."customer_features$snapshots"
ORDER BY committed_at DESC
LIMIT 1
""")
snapshot_id = snapshot_df.first()["snapshot_id"]
committed_at = snapshot_df.first()["committed_at"]
print(f"Training on snapshot: {snapshot_id} (committed at {committed_at})")
# Store this in your experiment tracking system / model metadata
Step 2: Train the model.
Step 3: Save the model to HDFS along with a metadata file:
{
"model_name": "customer-churn-rf",
"model_version": "2025-10-15-001",
"model_path": "hdfs:///models/churn-rf-20251015",
"training_datasets": [
{
"table": "hive_catalog.ml_datasets.customer_features",
"iceberg_snapshot_id": "5931985158436469021",
"snapshot_timestamp": "2025-10-14T22:00:00Z"
}
],
"training_timestamp": "2025-10-15T08:30:00Z",
"spark_version": "3.5.6",
"algorithm": "RandomForestClassifier",
"hyperparameters": {"numTrees": 100, "maxDepth": 10}
}
Step 4: To retrain on the exact same data, use the stored snapshot ID:
df = spark.read \
.option("snapshot-id", "5931985158436469021") \
.format("iceberg") \
.load("hive_catalog.ml_datasets.customer_features")
This guarantees that the retrained model sees exactly the same records, even if the table has been updated many times since the original training.
Snapshot Retention Policy
Configure Iceberg snapshot retention to balance reproducibility with storage costs:
ALTER TABLE hive_catalog.ml_datasets.customer_features
SET TBLPROPERTIES (
'history.expire.min-snapshots-to-keep' = '10',
'history.expire.max-snapshot-age-ms' = '2592000000' -- 30 days
);
For datasets used in production models, keep snapshots for at least as long as the model is in production. This ensures you can always reproduce the training data if the model needs to be audited or retrained.
Polaris MCP Server: LLM Access to Catalog Metadata
ODP includes a Polaris REST catalog (deployed on master03:8181 in the reference cluster). The Model Context Protocol (MCP) Server built on top of Polaris allows LLMs and AI agents to query the data catalog without accessing raw data.
This is a key architectural pattern for safe AI integration with ODP data:
LLM / AI Agent
↓ (MCP protocol)
Polaris MCP Server
↓ (Iceberg REST catalog API)
Polaris Catalog (metadata only)
↓ (metadata queries — no data transfer)
Iceberg table metadata (schemas, partitions, statistics)
↑
Actual data (HDFS/Ozone) — NOT accessible to the LLM
An LLM can answer questions like:
- "What tables exist in the
ml_datasetsschema?" - "What are the columns and data types of
customer_features?" - "How many partitions does
sensor_readingshave?" - "When was
transaction_historylast updated?"
The LLM never sees the actual row data. Ranger policies on the Polaris catalog control which tables the MCP server can expose metadata for, ensuring that sensitive table schemas are also protected.
This pattern enables natural language data discovery — users can ask an LLM to help them find the right dataset for their ML project — without compromising data sovereignty or Ranger access controls.