Version: 1.3.1.0

Apache Kafka Overview

Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, real-time data pipelines. ODP 1.3.1.0 ships Kafka 3.8.1, which includes the fully KRaft-based architecture that eliminates the ZooKeeper dependency for Kafka's own metadata management.

What is Kafka?

Kafka models data as a continuous, ordered, durable stream of events (also called records or messages). Unlike traditional message queues where messages are consumed and discarded, Kafka retains events for a configurable period (by default 7 days), allowing multiple independent consumers to read from the same stream at different positions and at different times.

Kafka serves as the central nervous system for real-time data architectures: producers write events as they happen, and any number of consumers process them — in real time or by replaying history.

Core Concepts

Topics and Partitions

Data in Kafka is organized into topics — named, append-only logs. Each topic is divided into partitions that are distributed across the broker cluster. Partitioning provides:

Parallelism: Multiple consumers can read from different partitions of the same topic simultaneously.
Ordering: Events within a single partition are strictly ordered; events across partitions are not.
Scalability: Topics can have hundreds of partitions distributed across many brokers, allowing throughput to scale horizontally.

Each partition is replicated across multiple brokers for fault tolerance. A designated leader handles all reads and writes for a partition; followers replicate the data and take over if the leader fails.

Consumer Groups

A consumer group is a set of consumers that cooperatively process a topic. Each partition is assigned to exactly one consumer within the group, enabling parallel processing while guaranteeing that each event is processed by exactly one consumer in the group. Different consumer groups independently track their own read position (offset) in each partition, so the same topic can feed multiple downstream systems without interference.

Kafka 3.8.1 in ODP

ODP ships Kafka 3.8.1 deployed and managed by Ambari. Key features:

KRaft mode: Kafka 3.8.1 runs in KRaft mode, using an internal Raft-based quorum controller for metadata management. This simplifies operations by removing the ZooKeeper dependency from the Kafka control plane (ZooKeeper is still used by other ODP services).
Ambari integration: Broker configuration, topic management, and broker-level metrics are surfaced in the Ambari UI. Ambari handles rolling restarts and configuration changes.
Multi-broker deployment: ODP reference architectures deploy Kafka brokers on dedicated worker nodes for isolation from HDFS DataNode I/O.

Use Cases

Real-Time Data Pipelines

Kafka decouples data producers from consumers, enabling event-driven architectures where upstream systems (databases, applications, IoT devices, APIs) publish events without needing to know which downstream systems will consume them. NiFi in ODP can ingest data from external sources and publish to Kafka topics; Spark Streaming and Flink then consume these topics for processing.

Event Sourcing

Because Kafka retains the full history of events, it is a natural fit for event sourcing patterns where the current state of a system is derived by replaying past events. This enables audit trails, temporal queries, and recovery from downstream processing errors by replaying from a known offset.

Change Data Capture (CDC)

Kafka Connect (included with Kafka) supports CDC connectors that capture row-level changes from relational databases and publish them as Kafka events. These can be consumed by Spark or Flink to maintain near-real-time replicas in HDFS or Iceberg tables.

Integration with Spark Streaming and Flink

Spark Structured Streaming

Spark treats a Kafka topic as an unbounded source table. A streaming DataFrame reads events continuously:

df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker01:9092,broker02:9092")
    .option("subscribe", "events")
    .load())

Spark handles offset management, exactly-once delivery (with Iceberg or HDFS sinks), and late-data watermarking automatically.

Apache Flink

Flink's native Kafka connector supports both source and sink roles with exactly-once semantics via Kafka transactions. Flink's event-time processing model, combined with Kafka's offset tracking, enables stateful stream processing with millisecond latency — suited for complex event processing, fraud detection, and real-time aggregations.

Ranger Plugin for Kafka Authorization

In ODP, the Apache Ranger Kafka plugin enforces access control on Kafka resources. Ranger policies can:

Grant or deny Publish (produce) and Consume permissions per topic, per user or group
Restrict Create topic and Delete topic operations
Log all access attempts to the Ranger audit trail for compliance

The Ranger Kafka plugin integrates with Kafka's built-in Authorizer interface and is configured automatically by Ambari when Ranger is enabled on the cluster.

Kafka Metrics in Ambari

Ambari collects and displays key Kafka metrics out of the box:

Broker metrics: bytes in/out per second, messages in per second, under-replicated partitions
Consumer lag: the difference between the latest offset and the consumer group's current offset — a critical health indicator for streaming pipelines
Topic metrics: per-topic throughput and partition leader distribution

These metrics feed into Ambari's alerting system, allowing operators to configure alerts for consumer lag thresholds, broker unavailability, or under-replication events.

Apache Kafka Overview

What is Kafka?​

Core Concepts​

Topics and Partitions​

Consumer Groups​

Kafka 3.8.1 in ODP​

Use Cases​

Real-Time Data Pipelines​

Event Sourcing​

Change Data Capture (CDC)​

Integration with Spark Streaming and Flink​

Spark Structured Streaming​

Apache Flink​

Ranger Plugin for Kafka Authorization​

Kafka Metrics in Ambari​