Apache NiFi Overview
Apache NiFi is a dataflow automation platform that enables the visual design, deployment, and management of data pipelines. Its browser-based interface allows engineers to build complex ingestion, routing, and transformation workflows by connecting processors on a canvas, without writing code. ODP includes NiFi deployed in a clustered, highly available configuration managed by Ambari.
What is NiFi?
NiFi was originally developed at the NSA and open-sourced under the Apache License in 2014. Its design philosophy centers on:
- Flow-based programming: Data pipelines are graphs of processors connected by queues. Data flows from one processor to the next as FlowFiles — lightweight objects that carry a payload (bytes) and a set of attributes (key-value metadata).
- Visual pipeline builder: The NiFi UI provides a drag-and-drop canvas for building pipelines. Complex multi-branch flows, conditional routing, and error handling are expressed visually.
- Back-pressure and flow control: NiFi monitors queue depths and automatically applies back-pressure when downstream processors cannot keep up, preventing memory exhaustion and data loss without manual rate limiting.
- Provenance tracking: NiFi records a full provenance event for every FlowFile at every processor, providing a complete audit trail of where data came from, what transformations it underwent, and where it was delivered.
300+ Processors
NiFi ships with over 300 built-in processors covering virtually every integration pattern:
| Category | Example processors |
|---|---|
| Ingestion | GetHTTP, ListenSyslog, GetMQTT, QueryDatabaseTable, ConsumeKafka |
| Routing | RouteOnAttribute, RouteOnContent, DistributeLoad |
| Transformation | JoltTransformJSON, TransformXml, ConvertRecord, ExecuteScript |
| Delivery | PutHDFS, PutHiveStreaming, PublishKafka, PutS3Object, PutDatabaseRecord |
| Enrichment | LookupRecord, EnrichData, GeoEnrichIP |
| Compression/Encoding | CompressContent, Base64EncodeContent, EncryptContent |
Custom processors can be developed in Java and deployed as NAR (NiFi Archive) files, extending the platform to proprietary or specialized protocols.
Back-Pressure and Flow Control
NiFi's back-pressure mechanism prevents fast producers from overwhelming slow consumers. Each connection between processors has configurable thresholds:
- Object threshold: Maximum number of FlowFiles queued before back-pressure is applied (default: 10,000)
- Data size threshold: Maximum total bytes queued (default: 1 GB)
When a queue reaches its threshold, the upstream processor stops scheduling new tasks until the queue drains below the threshold. This propagates upstream through the flow graph, naturally throttling ingestion at the source. No manual rate limiting or sleep timers are needed.
NiFi Cluster and High Availability in ODP
ODP deploys NiFi in cluster mode for production environments. In a NiFi cluster:
- All nodes run the same flow definition, synchronized via Apache ZooKeeper.
- A Primary Node (elected by ZooKeeper) handles source processors that should run on exactly one node (e.g.,
QueryDatabaseTableto avoid duplicate reads from a database). - A Cluster Coordinator (also elected by ZooKeeper) manages node membership and flow synchronization.
- All other processors run on every node simultaneously, providing horizontal throughput scaling.
Ambari manages NiFi cluster configuration, node health monitoring, rolling restarts, and flow deployment. If a NiFi node fails, ZooKeeper triggers leader re-election and the remaining nodes continue processing without manual intervention.
Kerberos Integration
In ODP, NiFi operates fully within the Kerberos security perimeter:
- NiFi nodes authenticate to HDFS, Hive, HBase, Kafka, and ZooKeeper using Kerberos keytabs provisioned by Ambari.
- The NiFi UI is protected by certificate-based client authentication (two-way TLS) when operating without Knox, or by Knox SSO when accessed through the Knox gateway.
- Kerberos keytab renewal is handled automatically by NiFi's built-in credential lifecycle management.
Ranger Integration
Authorization
The Ranger NiFi plugin enforces access control on NiFi resources. Policies can restrict:
- Which users or groups can access the NiFi UI
- Which users can modify specific process groups or processors
- Which users can view provenance data (which may contain sensitive payload details)
Ranger Audit for NiFi
Starting in ODP 1.3.2.0, the Ranger NiFi plugin includes audit logging: every access to a NiFi resource (process group access, processor configuration changes, provenance queries) is recorded in the Ranger audit trail. This provides compliance-grade auditability for NiFi pipelines, answering questions such as:
- Who modified a pipeline that processed payment data?
- Which users have accessed provenance records for PII-carrying flows?
The audit events are written to the same Ranger audit infrastructure (HDFS or Solr) used by all other ODP services, enabling unified audit reporting.
Use Cases
IoT and Sensor Data Ingestion
NiFi's ListenSyslog, GetMQTT, and ListenUDP processors can receive high-frequency sensor data from IoT devices. NiFi normalizes, filters, and routes the data — sending high-priority alerts to Kafka for real-time processing and raw records to HDFS for archival — all in a single visual flow.
API and Web Scraping
The InvokeHTTP and GetHTTP processors support REST API ingestion with configurable authentication (OAuth 2.0, Basic, API key), rate limiting, and retry logic. Responses in JSON or XML are parsed and routed by content or attribute values.
Database Change Data Capture
The QueryDatabaseTable processor tracks the maximum value of an incrementing column (e.g., a timestamp or sequence ID) across executions, efficiently fetching only new or changed rows from relational databases. Combined with PutHDFS or PutHiveStreaming, this builds a near-real-time replica of a relational table in the Hadoop ecosystem.
Routing and Transformation
NiFi handles complex multi-destination routing: a single incoming flow can be split, enriched with lookup data, transformed into different formats (JSON to ORC, XML to Parquet), and delivered simultaneously to HDFS, a Kafka topic, and a REST API — with full provenance tracking and guaranteed delivery.
Delivery to HDFS, Kafka, and Hive
Core delivery targets in ODP:
- HDFS:
PutHDFSwrites FlowFiles to HDFS using Kerberos-authenticated WebHDFS or native HDFS client, with configurable directory structures based on FlowFile attributes (e.g., partition by date). - Kafka:
PublishKafkadelivers FlowFiles as Kafka records, with configurable key extraction, partition selection, and exactly-once delivery using Kafka transactions. - Hive:
PutHiveStreaminguses the Hive Streaming API to write directly into Hive ACID tables, enabling low-latency ingestion into queryable Hive tables without intermediate files.