Apache Hadoop Overview
Apache Hadoop is the foundational open source framework for distributed storage and large-scale data processing. It underpins the entire ODP stack, providing the infrastructure on which every other component runs.
What is Hadoop?
Hadoop is a framework that allows distributed processing of large datasets across clusters of commodity hardware. It achieves reliability through software-level fault tolerance: data is replicated across multiple machines, and failed tasks are automatically retried on healthy nodes. Three core subsystems define Hadoop:
- HDFS (Hadoop Distributed File System) — the distributed storage layer
- YARN (Yet Another Resource Negotiator) — the cluster resource management layer
- MapReduce — the original batch processing engine (now largely superseded by Tez, Spark, and Hive, but still supported)
HDFS Architecture
HDFS splits files into fixed-size blocks (typically 128 MB) and distributes replicated copies across DataNodes. Metadata (the namespace, file-to-block mappings, permissions) is managed by the NameNode.
NameNode and DataNodes
The NameNode holds the entire filesystem namespace in memory, making metadata lookups fast. DataNodes store the actual data blocks and report their block inventory to the NameNode via heartbeats. When a client reads or writes a file, it first contacts the NameNode for block locations, then streams data directly to/from DataNodes — the NameNode is never in the data path.
High Availability with ZooKeeper
A single NameNode is a potential single point of failure. HDFS HA eliminates this by running an Active NameNode and a Standby NameNode in tandem. They share a set of JournalNodes that form a quorum-based edit log. ZooKeeper and the ZKFC (ZooKeeper Failover Controller) monitor liveness and trigger automatic failover to the Standby within seconds if the Active NameNode fails.
HDFS Federation
HDFS Federation allows multiple independent NameNodes, each managing a separate namespace volume, to share a common pool of DataNodes. This enables horizontal scaling of the namespace (beyond the memory limits of a single NameNode) and provides namespace-level isolation between teams or workloads.
YARN Architecture
YARN decouples resource management from data processing, allowing many different execution engines (Hive/Tez, Spark, Flink, MapReduce) to share a single cluster.
ResourceManager and NodeManagers
The ResourceManager is the global arbiter of cluster resources. It tracks available CPU and memory across all nodes and allocates containers (resource slices) to applications. Each cluster node runs a NodeManager that manages containers on that node, monitors their resource usage, and reports to the ResourceManager.
Application Lifecycle
When an application is submitted, YARN launches an ApplicationMaster (AM) in a container. The AM negotiates with the ResourceManager for additional containers and orchestrates the actual computation. This architecture keeps the ResourceManager lightweight and allows application-specific scheduling logic to live in the AM.
Capacity Scheduler
ODP ships with the Capacity Scheduler, which partitions cluster resources into a hierarchy of queues. Each queue is guaranteed a minimum percentage of capacity and can expand up to a configurable maximum when other queues are idle. This allows multiple teams to share the cluster with predictable resource guarantees and prevents any single workload from monopolizing resources.
Why Hadoop Remains Relevant
Despite the rise of cloud-native data platforms, Hadoop remains the reference architecture for on-premises petabyte-scale data processing for several reasons:
- Scale: HDFS scales to hundreds of petabytes across thousands of nodes in a single cluster.
- Commodity hardware: No specialized storage hardware is required. Hadoop is designed to run on standard x86 servers.
- Mature ecosystem: Over 15 years of production hardening, a rich plugin ecosystem, and broad enterprise adoption mean battle-tested solutions for most data engineering challenges.
- Data locality: By co-locating compute and storage, Hadoop minimizes network I/O — YARN preferentially schedules tasks on nodes that hold the relevant HDFS blocks.
- Open standards: HDFS, YARN, and the surrounding ecosystem are fully open source (Apache License 2.0), avoiding vendor lock-in.
How ODP Extends Hadoop
ODP (Open Data Platform) packages Hadoop 3.4.1 with a set of enhancements designed for production deployments:
- Apache Ambari management: ODP uses a hardened build of Ambari to provision, configure, monitor, and upgrade Hadoop clusters through a unified UI and REST API — no manual configuration file editing required.
- Latest upstream versions: ODP ships the most recent stable releases of every component, including Hadoop 3.4.1, with upstream patches backported for compatibility.
- Security by default: Kerberos authentication, Ranger authorization, and TLS encryption are first-class citizens in ODP. The Ambari Kerberos wizard automates the full KDC integration for all services, including HDFS, YARN, and MapReduce.
- High availability by default: NameNode HA, ResourceManager HA, and JournalNode quorums are configured out of the box in ODP reference architectures.
- Extended storage formats: ODP integrates Apache Iceberg 1.6.1 on top of HDFS, enabling ACID transactions, schema evolution, and time travel without replacing the underlying distributed storage layer.