0% found this document useful (0 votes)
15 views7 pages

Hive Architecture

Apache Hive is a data warehouse infrastructure built on Hadoop that allows for querying large datasets using HiveQL, which is translated into MapReduce, Tez, or Spark jobs. The architecture includes key components such as the User Interface, Hive Driver, Query Compiler, Optimizer, Execution Engine, Metastore, and Storage Layer. Hive is designed for batch processing and analytics on large datasets, offering advantages like scalability and ease of use, but has limitations in real-time transaction support and high latency.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

Hive Architecture

Apache Hive is a data warehouse infrastructure built on Hadoop that allows for querying large datasets using HiveQL, which is translated into MapReduce, Tez, or Spark jobs. The architecture includes key components such as the User Interface, Hive Driver, Query Compiler, Optimizer, Execution Engine, Metastore, and Storage Layer. Hive is designed for batch processing and analytics on large datasets, offering advantages like scalability and ease of use, but has limitations in real-time transaction support and high latency.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Hive Architecture

Apache Hive is a data warehouse infrastructure built on top of Hadoop that


allows querying and managing large datasets using S+QL-like language called
HiveQL. Hive translates SQL queries into MapReduce, Tez, or Spark jobs,
making it highly efficient for big data processing.

1. Hive Architecture Overview


The Hive architecture consists of the following key components:
1. User Interface (UI)
2. Hive Driver
3. Query Compiler
4. Optimizer
5. Execution Engine
6. Metastore
7. Storage (HDFS, HBase, S3, etc.)
High-Level Hive Architecture Diagram
sql
+-----------------------+
| User Interface (UI) |
+-----------------------+
|
v
+-----------------------+
| Hive Driver |
+-----------------------+
|
v
+-----------------------+
| Query Compiler |
+-----------------------+
|
v
+-----------------------+
| Optimizer |
+-----------------------+
|
v
+-----------------------+
| Execution Engine |
+-----------------------+
|
v
+-----------------------+
| Hadoop (HDFS, YARN) |
+-----------------------+

2. Hive Architecture Components


a) User Interface (UI)
 The UI allows users to interact with Hive using Hive CLI, Beeline,
JDBC/ODBC, or Web Interfaces.
 Users submit queries written in HiveQL.
b) Hive Driver
 Manages the session and workflow of queries.
 Interacts with:
o Query Compiler (parsing queries)

o Execution Engine (running jobs)

o Metastore (fetching metadata)

c) Query Compiler
 Parses HiveQL queries and checks syntax/semantics.
 Converts SQL-like queries into Abstract Syntax Tree (AST).
 Generates logical execution plan.
d) Query Optimizer
 Optimizes the execution plan for better performance.
 Uses techniques like:
o Predicate Pushdown (filtering early)
o Join Optimization (reordering joins)

o Partition Pruning (processing only necessary partitions)

e) Execution Engine
 Converts the optimized logical plan into a physical execution plan.
 Executes queries using:
o MapReduce

o Apache Tez

o Apache Spark

 Distributes the query execution across a Hadoop cluster.


f) Metastore (Hive Metadata Repository)
 Stores metadata about tables, columns, partitions, and data
locations.
 Can use RDBMS like MySQL, PostgreSQL, or Derby.
 Key metadata stored:
o Table Schema (column names, types)

o Partitions & Buckets

o Data Storage Location (HDFS paths)

g) Storage Layer (HDFS, HBase, S3, etc.)


 Stores raw data files.
 Supported storage formats:
o TextFile (CSV, TSV)

o ORC (Optimized Row Columnar)

o Parquet (columnar format)

o Avro, JSON

3. Hive Query Execution Process


Step-by-Step Query Execution Flow
1. User submits HiveQL Query via UI (CLI, Beeline, JDBC, etc.).
2. Query Compiler parses and validates the query.
3. Optimizer improves the execution plan.
4. Execution Engine generates tasks and submits them to Hadoop.
5. MapReduce/Tez/Spark processes the query in a distributed fashion.
6. Results are retrieved and displayed to the user.
Example Query Execution
sql
SELECT category, COUNT(*)
FROM products
WHERE price > 100
GROUP BY category;
1. Query Compiler validates syntax.
2. Optimizer rewrites for efficiency.
3. Execution Engine converts into MapReduce/Tez tasks.
4. HDFS fetches data and executes tasks in parallel.
5. Final results are returned.

4. Storage & Data Organization in Hive


a) Tables in Hive
 Hive tables are logically similar to RDBMS tables but store data in
HDFS.
 Tables can be managed or external.
Managed Table (Default)
 Hive controls the table and data storage.
 Dropping the table deletes data from HDFS.
sql
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
) STORED AS ORC;
External Table
 Hive only manages metadata, but data is stored externally.
 Dropping the table does not delete data.
sql
CREATE EXTERNAL TABLE sales (
id INT,
product STRING,
amount FLOAT
) LOCATION 'hdfs://user/data/sales';

b) Partitions & Buckets


1. Partitioning
o Used to divide data into smaller segments based on a column.

o Improves query performance by scanning only relevant partitions.

sql
CREATE TABLE sales (
id INT,
product STRING,
amount FLOAT
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;
o Querying only a specific partition:

sql
CopyEdit
SELECT * FROM sales WHERE year = 2023 AND month = 1;
2. Bucketing
o Distributes data into fixed-size buckets for load balancing.

o Uses HASH function on a column.

sql
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;

5. Hive Execution Modes


Mode Description

Runs on a single machine, useful for


Local Mode
debugging.

Distributed
Uses Hadoop cluster for distributed execution.
Mode

6. Hive vs Traditional Databases

RDBMS (SQL
Feature Hive (Data Warehouse)
Databases)

Query Language HiveQL (SQL-like) SQL

Storage HDFS (distributed) Disk (single server)

Schema Schema-on-Read Schema-on-Write

Scalability High (big data) Low (single server limits)

Query Execution MapReduce, Tez, Spark Traditional SQL Engine

ACID Limited (available from Hive


Full ACID support
Transactions 2.0+)

7. Advantages & Disadvantages of Hive


✅ Advantages
1. Scalable & Handles Big Data – Works on HDFS, supports petabytes of
data.
2. SQL-Like Language – Easy to learn and use.
3. Supports Multiple Execution Engines – MapReduce, Tez, Spark.
4. Partitioning & Bucketing – Improves performance on large datasets.
5. Integration with Hadoop Ecosystem – Works with HDFS, HBase, Spark.
❌ Disadvantages
1. Not Suitable for OLTP – Hive is designed for batch processing, not
real-time transactions.
2. High Latency – Queries take longer due to MapReduce execution.
3. Limited ACID Support – Full transaction support is limited compared to
traditional RDBMS.
4. Not Ideal for Small Datasets – Best suited for large datasets.

Final Thoughts
 Hive is best suited for batch processing and data analytics on
large datasets.
 Uses HiveQL, which is similar to SQL, making it easy to use.
 Optimized using partitions, bucketing, and indexing.
 Works well with Hadoop and supports multiple execution engines
like MapReduce, Tez, and Spark.

You might also like