Skip to main content
Version: 1.3.1.0

Apache Spark Overview

Apache Spark is a unified analytics engine designed for large-scale data processing. It provides a single platform that handles batch processing, SQL analytics, stream processing, machine learning, and graph computation — eliminating the need to maintain separate systems for each workload type. ODP 1.3.1.0 ships Spark 3.5.6.

What is Spark?

Spark processes data in memory using a distributed collection abstraction. Rather than writing intermediate results to disk between stages (as MapReduce does), Spark keeps data in RAM across operations, drastically reducing I/O. A Spark application runs as a driver process that coordinates a set of executor processes distributed across the YARN cluster.

Core APIs

RDD (Resilient Distributed Dataset)

The RDD is Spark's foundational data abstraction: an immutable, partitioned collection of records that can be operated on in parallel. RDDs track their lineage (the sequence of transformations that produced them), which allows lost partitions to be recomputed from the original data without replication. While most new code uses DataFrames, RDDs remain the underlying execution primitive.

DataFrame and Dataset APIs

DataFrames are distributed collections of data organized into named columns — conceptually equivalent to a table in a relational database. The Catalyst optimizer automatically optimizes DataFrame operations (predicate pushdown, column pruning, join reordering) producing efficient physical plans regardless of the order in which transformations are written.

Datasets extend DataFrames with compile-time type safety using JVM objects. They are primarily used in Java and Scala; Python and R users work with DataFrames.

Spark SQL

Spark SQL allows SQL queries to be mixed freely with DataFrame operations. A single application can load data from HDFS, filter it with a SQL WHERE clause, join it against a Hive Metastore table, and write the result to Iceberg — all within one Spark job. Spark SQL supports the HiveQL dialect and uses the Hive Metastore for schema resolution when deployed with ODP.

Spark Streaming (Structured Streaming)

Structured Streaming models a live data stream as an unbounded table that grows over time. Queries written against this table use the same DataFrame/SQL API as batch jobs, making stream processing accessible to engineers familiar with batch analytics. Structured Streaming provides:

  • Exactly-once semantics via write-ahead logs and idempotent sinks
  • Watermarking for handling late-arriving events
  • Stateful operations (windowed aggregations, sessionization, stream-stream joins)
  • Native integration with Apache Kafka as a source and sink

In ODP, Spark Structured Streaming is the recommended approach for building real-time pipelines that consume from Kafka and write to Hive ACID tables or Iceberg.

Spark MLlib

MLlib is Spark's scalable machine learning library. It provides:

  • Classification, regression, clustering, and recommendation algorithms
  • Feature engineering pipelines (tokenization, scaling, encoding, PCA)
  • Model selection via cross-validation and hyperparameter tuning
  • ML Pipelines: a scikit-learn-inspired API for composing and persisting multi-stage ML workflows

MLlib scales to datasets that do not fit in the memory of a single machine by distributing computation across the cluster. For teams building ML models on Hadoop-resident data, MLlib eliminates the data movement step of exporting data to a standalone ML platform.

Spark 3.5.6 in ODP

ODP 1.3.1.0 ships Spark 3.5.6 with the following integrations pre-configured:

  • Hive Metastore: Spark uses the ODP Hive Metastore as its default catalog for table resolution and schema management.
  • Hive Warehouse Connector (HWC): Enables transactional reads and writes to Hive ACID tables. See the Hive Overview for details.
  • Kerberos: Spark jobs submitted to YARN authenticate automatically using the cluster's Kerberos configuration.
  • Ranger plugin: Spark SQL queries are subject to Ranger column-level and row-level access policies.
  • Dynamic resource allocation: Spark executors are requested and released based on workload, improving cluster utilization.

Native Iceberg Support

Spark 3.5.6 includes the Iceberg Spark Runtime natively integrated in ODP. Spark can read and write Iceberg tables using the standard DataFrame API:

# Write
df.writeTo("catalog.db.events").using("iceberg").createOrReplace()

# Read with time travel
spark.read.option("as-of-timestamp", "2025-01-01").table("catalog.db.events")

Because Iceberg tables written by Spark share the same metadata with Hive, Impala, and Trino, data produced by Spark pipelines is immediately queryable by analysts using SQL tools without any format conversion.

Running on YARN

In ODP, Spark runs on YARN in either client mode (driver on the submitting machine) or cluster mode (driver runs inside a YARN container). YARN provides:

  • Resource isolation between Spark and other workloads
  • Capacity Scheduler queues for multi-tenant priority management
  • Integration with Kerberos delegation tokens for secure data access

Kubernetes Support (Coming in ODP 1.3.2.0)

Support for running Spark on Kubernetes as an alternative to YARN is planned for ODP 1.3.2.0. This will allow Spark workloads to run in containerized environments alongside the existing YARN-based cluster, supporting hybrid deployment models.