0% found this document useful (0 votes)

15 views7 pages

Hive Architecture

Apache Hive is a data warehouse infrastructure built on Hadoop that allows for querying large datasets using HiveQL, which is translated into MapReduce, Tez, or Spark jobs. The architecture includes key components such as the User Interface, Hive Driver, Query Compiler, Optimizer, Execution Engine, Metastore, and Storage Layer. Hive is designed for batch processing and analytics on large datasets, offering advantages like scalability and ease of use, but has limitations in real-time transaction support and high latency.

Uploaded by

mytreyan197

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views7 pages

Hive Architecture

Uploaded by

mytreyan197

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Hive Architecture

Apache Hive is a data warehouse infrastructure built on top of Hadoop that

allows querying and managing large datasets using S+QL-like language called
HiveQL. Hive translates SQL queries into MapReduce, Tez, or Spark jobs,
making it highly efficient for big data processing.

1. Hive Architecture Overview

The Hive architecture consists of the following key components:
1. User Interface (UI)
2. Hive Driver
3. Query Compiler
4. Optimizer
5. Execution Engine
6. Metastore
7. Storage (HDFS, HBase, S3, etc.)
High-Level Hive Architecture Diagram
sql
+-----------------------+
| User Interface (UI) |
+-----------------------+
|
v
+-----------------------+
| Hive Driver |
+-----------------------+
|
v
+-----------------------+
| Query Compiler |
+-----------------------+
|
v
+-----------------------+
| Optimizer |
+-----------------------+
|
v
+-----------------------+
| Execution Engine |
+-----------------------+
|
v
+-----------------------+
| Hadoop (HDFS, YARN) |
+-----------------------+

2. Hive Architecture Components

a) User Interface (UI)
 The UI allows users to interact with Hive using Hive CLI, Beeline,
JDBC/ODBC, or Web Interfaces.
 Users submit queries written in HiveQL.
b) Hive Driver
 Manages the session and workflow of queries.
 Interacts with:
o Query Compiler (parsing queries)

o Execution Engine (running jobs)

o Metastore (fetching metadata)

c) Query Compiler
 Parses HiveQL queries and checks syntax/semantics.
 Converts SQL-like queries into Abstract Syntax Tree (AST).
 Generates logical execution plan.
d) Query Optimizer
 Optimizes the execution plan for better performance.
 Uses techniques like:
o Predicate Pushdown (filtering early)
o Join Optimization (reordering joins)

o Partition Pruning (processing only necessary partitions)

e) Execution Engine
 Converts the optimized logical plan into a physical execution plan.
 Executes queries using:
o MapReduce

o Apache Tez

o Apache Spark

 Distributes the query execution across a Hadoop cluster.

f) Metastore (Hive Metadata Repository)
 Stores metadata about tables, columns, partitions, and data
locations.
 Can use RDBMS like MySQL, PostgreSQL, or Derby.
 Key metadata stored:
o Table Schema (column names, types)

o Partitions & Buckets

o Data Storage Location (HDFS paths)

g) Storage Layer (HDFS, HBase, S3, etc.)

 Stores raw data files.
 Supported storage formats:
o TextFile (CSV, TSV)

o ORC (Optimized Row Columnar)

o Parquet (columnar format)

o Avro, JSON

3. Hive Query Execution Process

Step-by-Step Query Execution Flow
1. User submits HiveQL Query via UI (CLI, Beeline, JDBC, etc.).
2. Query Compiler parses and validates the query.
3. Optimizer improves the execution plan.
4. Execution Engine generates tasks and submits them to Hadoop.
5. MapReduce/Tez/Spark processes the query in a distributed fashion.
6. Results are retrieved and displayed to the user.
Example Query Execution
sql
SELECT category, COUNT(*)
FROM products
WHERE price > 100
GROUP BY category;
1. Query Compiler validates syntax.
2. Optimizer rewrites for efficiency.
3. Execution Engine converts into MapReduce/Tez tasks.
4. HDFS fetches data and executes tasks in parallel.
5. Final results are returned.

4. Storage & Data Organization in Hive

a) Tables in Hive
 Hive tables are logically similar to RDBMS tables but store data in
HDFS.
 Tables can be managed or external.
Managed Table (Default)
 Hive controls the table and data storage.
 Dropping the table deletes data from HDFS.
sql
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
) STORED AS ORC;
External Table
 Hive only manages metadata, but data is stored externally.
 Dropping the table does not delete data.
sql
CREATE EXTERNAL TABLE sales (
id INT,
product STRING,
amount FLOAT
) LOCATION 'hdfs://user/data/sales';

b) Partitions & Buckets

1. Partitioning
o Used to divide data into smaller segments based on a column.

o Improves query performance by scanning only relevant partitions.

sql
CREATE TABLE sales (
id INT,
product STRING,
amount FLOAT
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;
o Querying only a specific partition:

sql
CopyEdit
SELECT * FROM sales WHERE year = 2023 AND month = 1;
2. Bucketing
o Distributes data into fixed-size buckets for load balancing.

o Uses HASH function on a column.

sql
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;

5. Hive Execution Modes

Mode Description

Runs on a single machine, useful for

Local Mode
debugging.

Distributed
Uses Hadoop cluster for distributed execution.
Mode

6. Hive vs Traditional Databases

RDBMS (SQL
Feature Hive (Data Warehouse)
Databases)

Query Language HiveQL (SQL-like) SQL

Storage HDFS (distributed) Disk (single server)

Schema Schema-on-Read Schema-on-Write

Scalability High (big data) Low (single server limits)

Query Execution MapReduce, Tez, Spark Traditional SQL Engine

ACID Limited (available from Hive

Full ACID support
Transactions 2.0+)

7. Advantages & Disadvantages of Hive

✅ Advantages
1. Scalable & Handles Big Data – Works on HDFS, supports petabytes of
data.
2. SQL-Like Language – Easy to learn and use.
3. Supports Multiple Execution Engines – MapReduce, Tez, Spark.
4. Partitioning & Bucketing – Improves performance on large datasets.
5. Integration with Hadoop Ecosystem – Works with HDFS, HBase, Spark.
❌ Disadvantages
1. Not Suitable for OLTP – Hive is designed for batch processing, not
real-time transactions.
2. High Latency – Queries take longer due to MapReduce execution.
3. Limited ACID Support – Full transaction support is limited compared to
traditional RDBMS.
4. Not Ideal for Small Datasets – Best suited for large datasets.

Final Thoughts
 Hive is best suited for batch processing and data analytics on
large datasets.
 Uses HiveQL, which is similar to SQL, making it easy to use.
 Optimized using partitions, bucketing, and indexing.
 Works well with Hadoop and supports multiple execution engines
like MapReduce, Tez, and Spark.

Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
R2 TitleList 2023517
0% (1)
R2 TitleList 2023517
877 pages
Module 06 Hive - Distributed Data Warehouse
No ratings yet
Module 06 Hive - Distributed Data Warehouse
36 pages
Big Data & Analytics (CSE6005) L6 (2)
No ratings yet
Big Data & Analytics (CSE6005) L6 (2)
56 pages
IA_Libraries_10.1515_9783111336435
No ratings yet
IA_Libraries_10.1515_9783111336435
394 pages
Hive
No ratings yet
Hive
63 pages
Chapter - 4 - Data Access - Hive
No ratings yet
Chapter - 4 - Data Access - Hive
35 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
Hive
No ratings yet
Hive
30 pages
Execution Environments For Distributed Computing: Apache Hive
No ratings yet
Execution Environments For Distributed Computing: Apache Hive
23 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
big-data-unit 5
No ratings yet
big-data-unit 5
54 pages
Hive
No ratings yet
Hive
52 pages
hive
No ratings yet
hive
49 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
5- HIVE
No ratings yet
5- HIVE
51 pages
day4
No ratings yet
day4
10 pages
BDA Answers
No ratings yet
BDA Answers
10 pages
Unit 3
No ratings yet
Unit 3
23 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hive_Basics
No ratings yet
Hive_Basics
35 pages
Hive
No ratings yet
Hive
23 pages
7.Hive
No ratings yet
7.Hive
30 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
HIVE (1)
No ratings yet
HIVE (1)
18 pages
Bigdata Lecture 5
No ratings yet
Bigdata Lecture 5
19 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
bda unit 4 - mam
No ratings yet
bda unit 4 - mam
57 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Hive2
No ratings yet
Hive2
4 pages
Unit-IV -BDA
No ratings yet
Unit-IV -BDA
42 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
Web Based Data Management of Apache Hive
No ratings yet
Web Based Data Management of Apache Hive
22 pages
unit 3 Hive Overview and Architecture
No ratings yet
unit 3 Hive Overview and Architecture
5 pages
Hive
No ratings yet
Hive
12 pages
Hive
No ratings yet
Hive
29 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
BIG DATA 4
No ratings yet
BIG DATA 4
14 pages
The role of artificial intelligence in supply chain management mapping the territory
No ratings yet
The role of artificial intelligence in supply chain management mapping the territory
25 pages
Execution Environments For Distributed Computing: Apache Hive
No ratings yet
Execution Environments For Distributed Computing: Apache Hive
23 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Hive Architecture and Working
No ratings yet
Hive Architecture and Working
2 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
bda report
No ratings yet
bda report
16 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Introduction to Hive
No ratings yet
Introduction to Hive
14 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
unit 3 hive
No ratings yet
unit 3 hive
3 pages
AES Encryption For Big Data
No ratings yet
AES Encryption For Big Data
7 pages
Hive
No ratings yet
Hive
5 pages
Unit 3
No ratings yet
Unit 3
8 pages
Unit 2
No ratings yet
Unit 2
11 pages
3.Expert Control System
No ratings yet
3.Expert Control System
9 pages
(501) DBMS Notes
No ratings yet
(501) DBMS Notes
23 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
MCQ
No ratings yet
MCQ
81 pages
Azure Data Fundamental
No ratings yet
Azure Data Fundamental
81 pages
Visual_Programming_Topics
No ratings yet
Visual_Programming_Topics
4 pages
DBMS Interview Questions and Answers: Sindhuja Hari
No ratings yet
DBMS Interview Questions and Answers: Sindhuja Hari
71 pages
Computer first Half Book
No ratings yet
Computer first Half Book
4 pages
Integrating Explainable Artificial Intelligence and Blockch 2023 Smart Agric
No ratings yet
Integrating Explainable Artificial Intelligence and Blockch 2023 Smart Agric
13 pages
CS_3440_Graded_Quiz_Unit_3
No ratings yet
CS_3440_Graded_Quiz_Unit_3
7 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
11 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
2025 Specimen Paper 1
No ratings yet
2025 Specimen Paper 1
12 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
29 pages
Unit 2 Symmetric Key
No ratings yet
Unit 2 Symmetric Key
60 pages
Andrii_Puziuk_CV_latest 1
No ratings yet
Andrii_Puziuk_CV_latest 1
3 pages
Viviyan Resume
No ratings yet
Viviyan Resume
6 pages
Bel-Arabi Advanced Arabic Grammar
No ratings yet
Bel-Arabi Advanced Arabic Grammar
7 pages
Key Point of C#
No ratings yet
Key Point of C#
10 pages
What Is Database Architecture
No ratings yet
What Is Database Architecture
7 pages
The Application of Artificial Intelligence and Mac
No ratings yet
The Application of Artificial Intelligence and Mac
18 pages
AI Engineer at Neuralk-AI
No ratings yet
AI Engineer at Neuralk-AI
2 pages
A Review On Sentiment Analysis Using Machine Learning
No ratings yet
A Review On Sentiment Analysis Using Machine Learning
5 pages
WK 2 Introduction To Data Visualisation
No ratings yet
WK 2 Introduction To Data Visualisation
11 pages
Bca 4 Sem Database Management Systems 79726 Dec 2022
No ratings yet
Bca 4 Sem Database Management Systems 79726 Dec 2022
2 pages
CHADUVULA VARUN FlowCV Resume 20240111
No ratings yet
CHADUVULA VARUN FlowCV Resume 20240111
1 page
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Hive Architecture

Uploaded by

Hive Architecture

Uploaded by

Hive Architecture

Apache Hive is a data warehouse infrastructure built on top of Hadoop that

1. Hive Architecture Overview

2. Hive Architecture Components

o Execution Engine (running jobs)

o Metastore (fetching metadata)

o Partition Pruning (processing only necessary partitions)

 Distributes the query execution across a Hadoop cluster.

o Partitions & Buckets

o Data Storage Location (HDFS paths)

g) Storage Layer (HDFS, HBase, S3, etc.)

o ORC (Optimized Row Columnar)

o Parquet (columnar format)

3. Hive Query Execution Process

4. Storage & Data Organization in Hive

b) Partitions & Buckets

o Improves query performance by scanning only relevant partitions.

o Uses HASH function on a column.

5. Hive Execution Modes

Runs on a single machine, useful for

6. Hive vs Traditional Databases

Query Language HiveQL (SQL-like) SQL

Storage HDFS (distributed) Disk (single server)

Schema Schema-on-Read Schema-on-Write

Scalability High (big data) Low (single server limits)

Query Execution MapReduce, Tez, Spark Traditional SQL Engine

ACID Limited (available from Hive

7. Advantages & Disadvantages of Hive

You might also like