Worker Node Sizing
Worker nodes in an ODP cluster provide the storage and compute capacity for data workloads. Each worker node typically runs HDFS DataNode, YARN NodeManager, and optionally co-located services such as HBase RegionServer, Impala daemon, Kafka broker, or Kudu tablet server.
This page covers HDFS storage planning, YARN memory and vcore allocation, disk layout best practices, and Kafka co-location sizing.
HDFS Storage Planning
Replication Factor
HDFS stores each data block across multiple nodes for fault tolerance. The default replication factor is 3, meaning each block consumes 3x the logical data size in raw disk space.
Raw storage formula:
Required raw storage per node = (total logical data / number of data nodes) × replication factor × 1.25
The 1.25 factor accounts for HDFS overhead (block metadata, intermediate files, temporary job data, and free space to prevent DataNode disk full conditions).
Example:
- 100 TB logical data, 10 worker nodes, replication factor 3
- Per-node raw:
(100 TB / 10) × 3 × 1.25 = 37.5 TBraw disk per worker
Disk Configuration — JBOD Only
Do not use hardware RAID (RAID 5, RAID 6, RAID 10) for HDFS data disks. HDFS provides its own fault tolerance through replication. Hardware RAID adds cost and complexity without benefit, and can reduce throughput due to parity overhead.
Configure HDFS data disks as JBOD (Just a Bunch of Disks). Each disk is mounted independently (e.g., /data/disk1, /data/disk2, ..., /data/diskN) and listed in dfs.datanode.data.dir.
Example dfs.datanode.data.dir configuration for a node with 8 data disks:
/data/disk1/hdfs,/data/disk2/hdfs,/data/disk3/hdfs,/data/disk4/hdfs,
/data/disk5/hdfs,/data/disk6/hdfs,/data/disk7/hdfs,/data/disk8/hdfs
This allows HDFS to distribute blocks across all spindles, maximizing aggregate I/O throughput.
DataNode Heap
The DataNode JVM heap scales with the number of blocks managed per node. A commonly used formula is 1 GB per 1 million blocks stored on the node.
JVM heap example (hadoop-env):
HADOOP_DATANODE_OPTS="-Xms2g -Xmx4g -XX:+UseG1GC"
For most worker nodes in medium clusters, 4 GB DataNode heap is sufficient.
YARN Memory and vCore Allocation
YARN allocates container resources from the NodeManager's configured capacity. Proper sizing avoids both resource underutilization and OOM conditions.
Total Available Memory
The available YARN memory (yarn.nodemanager.resource.memory-mb) should be:
yarn.nodemanager.resource.memory-mb = total RAM - OS overhead - co-located services heap
Example for a 128 GB worker node with HBase RegionServer (16 GB heap) and DataNode (4 GB heap):
128 GB - 8 GB (OS) - 16 GB (HBase) - 4 GB (DataNode) - 4 GB (buffer) = 96 GB → 98304 MB
Total Available vCores
Set yarn.nodemanager.resource.cpu-vcores to the number of logical CPUs on the node minus overhead for OS and co-located services:
yarn.nodemanager.resource.cpu-vcores = total logical CPUs - 2 (OS) - [cores reserved for co-located services]
For a 24-core node: 24 - 2 = 22 vcores available to YARN.
Container Memory Settings
| Parameter | Recommended value |
|---|---|
yarn.scheduler.minimum-allocation-mb | 1024 MB (1 GB) |
yarn.scheduler.maximum-allocation-mb | Equal to nodemanager.resource.memory-mb |
yarn.scheduler.minimum-allocation-vcores | 1 |
yarn.scheduler.maximum-allocation-vcores | Equal to nodemanager.resource.cpu-vcores |
MapReduce and Spark Memory Defaults
For MapReduce:
mapreduce.map.memory.mb: 2048–4096 MBmapreduce.reduce.memory.mb: 4096–8192 MB
For Spark (per executor):
spark.executor.memory: 4–16 GB depending on workloadspark.executor.cores: 2–5 (avoid single-core executors for efficiency)
Disk Layout Best Practices
Separation of OS and Data Disks
Always use separate disks for the operating system (and service logs) versus HDFS data:
| Disk role | Recommended type | Mount points |
|---|---|---|
| OS + logs | SSD 200–500 GB | /, /var/log/ |
| HDFS data | HDD 4–12 TB each | /data/disk1 ... /data/diskN |
| Kudu data (if deployed) | NVMe / SATA SSD | /kudu/disk1 ... /kudu/diskN |
| Kafka logs (if deployed) | HDD or SSD | /kafka/disk1 ... /kafka/diskN |
HDFS Short-Circuit Reads
Enable HDFS short-circuit reads (dfs.client.read.shortcircuit=true) to allow co-located compute tasks (MapReduce, Spark, Impala) to read HDFS blocks directly from local disk via Unix domain socket, bypassing the DataNode network stack. This significantly reduces latency for local reads.
Kafka Broker Sizing (Co-Located)
When Kafka brokers are co-located with HDFS DataNodes on worker nodes (acceptable for small to medium clusters), follow these guidelines:
| Resource | Guidance |
|---|---|
| Kafka heap | 4–8 GB (-Xms6g -Xmx6g) |
| Kafka log disks | Separate disks from HDFS data disks |
| Kafka log retention | Size per broker: partitions × retention size per partition |
| Network | Kafka is network-intensive; 10 GbE minimum, 25 GbE recommended for high-throughput topics |
JVM heap example (kafka-env):
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
Key Kafka broker settings for co-located deployments:
- Set
log.dirsto dedicated disks (not shared with HDFS or OS) - Set
num.io.threadsandnum.network.threadsbased on available cores (typicallynum.io.threads = 2 × number of data disks,num.network.threads = 3–8) - Tune
replica.fetch.max.bytesandmessage.max.bytesfor your expected message sizes
For large Kafka deployments (high message throughput or large retention), consider dedicated Kafka nodes rather than co-location with HDFS DataNodes to avoid I/O and network contention.
HBase RegionServer Sizing (Co-Located)
HBase RegionServers are typically co-located on worker nodes alongside HDFS DataNode and YARN NodeManager.
| Resource | Guidance |
|---|---|
| RegionServer heap | 16–32 GB |
hfile.block.cache.size | 0.40 (40% of heap for read cache) |
hbase.regionserver.global.memstore.size | 0.40 (40% of heap for write buffer) |
| Regions per RegionServer | 20–200 (start with 50 and tune based on workload) |
JVM heap example (hbase-env):
HBASE_REGIONSERVER_OPTS="-Xms16g -Xmx16g -XX:+UseG1GC -XX:MaxGCPauseMillis=100"
Use G1GC for HBase RegionServers to minimize GC pause times, which can cause region server timeouts and unnecessary region reassignment.