Hive Architecture
Hive Architecture
c) Query Compiler
Parses HiveQL queries and checks syntax/semantics.
Converts SQL-like queries into Abstract Syntax Tree (AST).
Generates logical execution plan.
d) Query Optimizer
Optimizes the execution plan for better performance.
Uses techniques like:
o Predicate Pushdown (filtering early)
o Join Optimization (reordering joins)
e) Execution Engine
Converts the optimized logical plan into a physical execution plan.
Executes queries using:
o MapReduce
o Apache Tez
o Apache Spark
o Avro, JSON
sql
CREATE TABLE sales (
id INT,
product STRING,
amount FLOAT
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;
o Querying only a specific partition:
sql
CopyEdit
SELECT * FROM sales WHERE year = 2023 AND month = 1;
2. Bucketing
o Distributes data into fixed-size buckets for load balancing.
sql
CREATE TABLE customers (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;
Distributed
Uses Hadoop cluster for distributed execution.
Mode
RDBMS (SQL
Feature Hive (Data Warehouse)
Databases)
Final Thoughts
Hive is best suited for batch processing and data analytics on
large datasets.
Uses HiveQL, which is similar to SQL, making it easy to use.
Optimized using partitions, bucketing, and indexing.
Works well with Hadoop and supports multiple execution engines
like MapReduce, Tez, and Spark.