Skip to main content
Version: 1.3.1.0

Apache Hive Overview

Apache Hive is the SQL-on-Hadoop engine that makes large-scale data warehousing accessible to analysts and engineers who already know SQL. ODP 1.3.1.0 ships Hive 4.0.1, the latest major release, which brings full ACID support, materialized views, and native Iceberg integration.

What is Hive?

Hive translates HiveQL (a SQL dialect) into distributed execution plans that run on YARN via the Tez engine. Data lives in HDFS (or Ozone) and is typically stored in columnar formats such as ORC or Parquet. Hive provides a Metastore — a relational database (PostgreSQL in ODP) that records table schemas, partition information, and statistics — which other engines such as Spark and Impala can also query.

Hive 4 Architecture

HiveServer2

HiveServer2 (HS2) is the JDBC/ODBC server that clients connect to. It accepts HiveQL queries, compiles them into a logical plan, optimizes that plan, and submits execution to YARN. HS2 is multi-tenant: it handles concurrent sessions with session-level isolation. ODP deploys multiple HS2 instances behind a load balancer for availability.

Metastore

The Hive Metastore is the central schema registry for the Hadoop ecosystem. It stores:

  • Database and table definitions (columns, types, SerDe)
  • Partition metadata (critical for query pruning at scale)
  • Table statistics used by the cost-based optimizer
  • Iceberg table properties when using the Iceberg catalog

The Metastore exposes a Thrift API consumed by HiveServer2, Spark, Impala, Trino, and other engines — making it the de facto interoperability layer of the lakehouse.

Tez Execution Engine

Hive 4 executes queries via Apache Tez, a DAG-based execution framework that replaces the older MapReduce engine. Tez eliminates the write-to-HDFS step between stages, keeps intermediate data in memory or local disk, and reuses JVM containers across tasks. For typical analytical queries, Tez is 2–10x faster than MapReduce.

ACID Transactions

Hive 4 provides full ACID (Atomicity, Consistency, Isolation, Durability) semantics on ORC tables, including:

  • INSERT, UPDATE, DELETE, and MERGE (upsert) statements
  • Read isolation through multi-version concurrency control (MVCC)
  • Automatic compaction of delta files to maintain read performance

ACID tables in Hive require the ORC storage format and a transactional table property. In ODP, ACID is enabled by default for managed tables.

Materialized Views

Hive 4 supports materialized views — precomputed query results stored as physical tables. The query optimizer automatically rewrites eligible queries to use materialized views when doing so reduces cost, transparently accelerating dashboards and repeated analytical patterns without requiring application changes.

Iceberg Integration in ODP 1.3.1.0

ODP integrates Iceberg 1.6.1 as a first-class table format alongside ORC. Hive 4 can create and query Iceberg tables natively:

CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP,
payload STRING
)
STORED BY ICEBERG
STORED AS PARQUET;

Iceberg tables managed through Hive are readable by Spark, Impala, and Trino without format conversion, enabling true multi-engine lakehouse architectures. See the Apache Iceberg Overview for details.

Hive Warehouse Connector (HWC)

The Hive Warehouse Connector allows Spark applications to read from and write to Hive ACID and Iceberg tables with full transactional guarantees. Without HWC, Spark reads Hive tables through a compatibility layer that bypasses ACID semantics. HWC provides:

  • spark.read.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector") for batch reads
  • Streaming writes to Hive ACID tables from Spark Structured Streaming
  • Consistent metadata access via the Hive Metastore

ODP ships HWC pre-configured for the deployed Hive and Spark versions.

When to Use Hive vs Impala vs Trino

Use caseRecommended engine
Large batch ETL / complex transformationsHive (Tez, full ACID, UDFs)
Interactive BI queries on HDFS/Iceberg (<30s)Impala (MPP, lowest latency)
Federated queries across multiple data sourcesTrino (connectors for RDBMS, object store, Hive)
Streaming ingestion into Hive ACID tablesHive (via HWC + Spark Streaming)
Ad hoc exploration by SQL analystsHive or Trino depending on latency needs

Hive is the right choice when query complexity, ACID guarantees, or long-running batch workloads matter more than sub-second latency. For interactive dashboards on pre-aggregated data, Impala or Trino will typically outperform Hive.