Apache Hive Overview
Apache Hive is the SQL-on-Hadoop engine that makes large-scale data warehousing accessible to analysts and engineers who already know SQL. ODP 1.3.1.0 ships Hive 4.0.1, the latest major release, which brings full ACID support, materialized views, and native Iceberg integration.
What is Hive?
Hive translates HiveQL (a SQL dialect) into distributed execution plans that run on YARN via the Tez engine. Data lives in HDFS (or Ozone) and is typically stored in columnar formats such as ORC or Parquet. Hive provides a Metastore — a relational database (PostgreSQL in ODP) that records table schemas, partition information, and statistics — which other engines such as Spark and Impala can also query.
Hive 4 Architecture
HiveServer2
HiveServer2 (HS2) is the JDBC/ODBC server that clients connect to. It accepts HiveQL queries, compiles them into a logical plan, optimizes that plan, and submits execution to YARN. HS2 is multi-tenant: it handles concurrent sessions with session-level isolation. ODP deploys multiple HS2 instances behind a load balancer for availability.
Metastore
The Hive Metastore is the central schema registry for the Hadoop ecosystem. It stores:
- Database and table definitions (columns, types, SerDe)
- Partition metadata (critical for query pruning at scale)
- Table statistics used by the cost-based optimizer
- Iceberg table properties when using the Iceberg catalog
The Metastore exposes a Thrift API consumed by HiveServer2, Spark, Impala, Trino, and other engines — making it the de facto interoperability layer of the lakehouse.
Tez Execution Engine
Hive 4 executes queries via Apache Tez, a DAG-based execution framework that replaces the older MapReduce engine. Tez eliminates the write-to-HDFS step between stages, keeps intermediate data in memory or local disk, and reuses JVM containers across tasks. For typical analytical queries, Tez is 2–10x faster than MapReduce.
ACID Transactions
Hive 4 provides full ACID (Atomicity, Consistency, Isolation, Durability) semantics on ORC tables, including:
INSERT,UPDATE,DELETE, andMERGE(upsert) statements- Read isolation through multi-version concurrency control (MVCC)
- Automatic compaction of delta files to maintain read performance
ACID tables in Hive require the ORC storage format and a transactional table property. In ODP, ACID is enabled by default for managed tables.
Materialized Views
Hive 4 supports materialized views — precomputed query results stored as physical tables. The query optimizer automatically rewrites eligible queries to use materialized views when doing so reduces cost, transparently accelerating dashboards and repeated analytical patterns without requiring application changes.
Iceberg Integration in ODP 1.3.1.0
ODP integrates Iceberg 1.6.1 as a first-class table format alongside ORC. Hive 4 can create and query Iceberg tables natively:
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP,
payload STRING
)
STORED BY ICEBERG
STORED AS PARQUET;
Iceberg tables managed through Hive are readable by Spark, Impala, and Trino without format conversion, enabling true multi-engine lakehouse architectures. See the Apache Iceberg Overview for details.
Hive Warehouse Connector (HWC)
The Hive Warehouse Connector allows Spark applications to read from and write to Hive ACID and Iceberg tables with full transactional guarantees. Without HWC, Spark reads Hive tables through a compatibility layer that bypasses ACID semantics. HWC provides:
spark.read.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")for batch reads- Streaming writes to Hive ACID tables from Spark Structured Streaming
- Consistent metadata access via the Hive Metastore
ODP ships HWC pre-configured for the deployed Hive and Spark versions.
When to Use Hive vs Impala vs Trino
| Use case | Recommended engine |
|---|---|
| Large batch ETL / complex transformations | Hive (Tez, full ACID, UDFs) |
| Interactive BI queries on HDFS/Iceberg (<30s) | Impala (MPP, lowest latency) |
| Federated queries across multiple data sources | Trino (connectors for RDBMS, object store, Hive) |
| Streaming ingestion into Hive ACID tables | Hive (via HWC + Spark Streaming) |
| Ad hoc exploration by SQL analysts | Hive or Trino depending on latency needs |
Hive is the right choice when query complexity, ACID guarantees, or long-running batch workloads matter more than sub-second latency. For interactive dashboards on pre-aggregated data, Impala or Trino will typically outperform Hive.