0% found this document useful (0 votes)

28 views

Apache Hive Optimization Techniques - 1 - Towards Data Science

Apache Hive is a query engine built on Hadoop that uses SQL syntax. This document discusses optimization techniques for Hive including partitioning and bucketing data, using Tez as the execution engine, data compression, and the ORC file format. Partitioning and bucketing organize data to improve query performance by only scanning relevant data. Tez optimizes jobs by skipping writes, cascading reducers, and reusing containers. Compression reduces data size and I/O operations.

Uploaded by

mydumm

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Apache Hive Optimization Techniques - 1 - Towards Data Science

Uploaded by

mydumm

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Apache Hive Optimization Techniques — 1

Ankit Prakash Gupta Follow

Aug 9, 2019 · 6 min read

Apache Hive is a query and analysis engine which is built on top of Apache Hadoop and
uses MapReduce Programming Model. It provides an abstraction layer to query big-data
/
using the SQL syntax by implementing traditional SQL queries using the Java API. The
main components of the Hive are as follows:

Metastore

Driver

Compiler

Optimizer

Executor

Client

While Hadoop/hive can process nearly any amount of data, but optimizations can lead to
big savings, proportional to the amount of data, in terms of processing time and cost.
There are a whole lot of optimizations that can be applied in the hive. Let us look into the
optimization techniques we are going to cover:

1. Partitioning

2. Bucketing

3. Using Tez as Execution Engine

4. Using Compression

5. Using ORC Format

6. Join Optimizations

7. Cost-based Optimizer

. . .

Partitioning
Partitioning divides the table into parts based on the values of particular columns. A
table can have multiple partition columns to identify a particular partition. Using
partition it is easy to do queries on slices of the data. The data of the partition columns
are not saved in the files. On checking the file structure you would notice that it creates /
folders on the basis of partition column values. This makes sure that only relevant data is
read for the execution of a particular job, decreasing the I/O time required by the query.
Thus, increasing the query performance.

When we query data on a partitioned table, it will only scan the relevant partitions to be
queried and skips irrelevant partitions. Now, assume that even on partitioning, the data
in a partition was quite big, to further divide it into more manageable chunks we can use
Bucketing.

CREATE TABLE table_name (column1 data_type,

column2 data_type, …) PARTITIONED BY
(partition1 data_type, partition2 data_type,….);
Partition Columns are not defined in the Column List of the table.

In insert queries, partitions are mentioned in the start and their column values are
also given along with the values of the other columns but at the end.

INSERT INTO TABLE table_name PARTITION

(partition1 = ‘partition1_val’, partition2 =
‘partition2_val’, …) VALUES (col1_val, col2_val, …,
partition1_val, partition2_val, …);
Partitioning is basically of two types: Static and Dynamic. Well, names are very
much self-explanatory.

Static Partitioning
This is practiced when we have knowledge about the partitions of data we are going
to load. It should be preferred when loading data in a table from large files. It is
performed in strict mode:

set hive.mapred.mode = strict;

/
Dynamic Partitioning
It is used when we do not have knowledge about the partitions of the data. It takes
more time to load data in the table. Usually, we load data in the table using another
table having non-partitioned data.
To enable dynamic partitioning in the hive:

SET hive.exec.dynamic.partition = true;

There are two modes of dynamic partitioning:
Strict: This needs at least one column to be static while loading the data.
Non-strict: This allows us to have dynamic values of all the partition columns.

SET hive.exec.dynamic.partition.mode = nonstrict;

Some other things are to be configured when using dynamic partitioning, like

Hive.exec.max.dynamic.partitions.pernode: Maximum number of partitions to be

created in each mapper/reducer node

Hive.exec.max.dynamic.partitions: Maximum number of dynamic partitions allowed to

be created in total

Hive.exec.max.created.files: Maximum number of HDFS files created by all

mappers/reducers in a MapReduce job

Hive.error.on.empty.partition: Whether to throw an exception if the dynamic partition

insert generates empty results

. . .

Bucketing
Bucketing provides flexibility to further segregate the data into more manageable
sections called buckets or clusters. Bucketing is based on the hash function, which
depends on the type of the bucketing column. Records which are bucketed by the same

/
column value will always be saved in the same bucket. CLUSTERED BY clause is used to
divide the table into buckets. It works well for the columns having high cardinality.

CREATE TABLE table_name (column1 data_type,

column2 data_type, …) PARTITIONED BY
(partition1 data_type, partition2 data_type,….)
CLUSTERED BY (clus_col1) SORTED BY
(sort_col2) INTO n BUCKETS;
In Hive Partition, each partition will be created as a directory. But in Hive Buckets, each
bucket will be created as a file.

set hive.enforce.bucketing = true;

Using Bucketing we can also sort the data using one or more columns. Since the data
files are equal-sized parts, map-side joins will be faster on the bucketed tables.

Bucketing also has its own benefit when used with ORC files and used as the joining
column. We will further discuss these benefits.

Using Tez as Execution Engine

Apache Tez is a client-side library which operates like an execution engine, an
alternative to traditional MapReduce Engine, under Hive and Pig which allows faster
processing of jobs using the DAG formation.

To look into how Tez helps in optimizing the jobs, we will first look into the stereotyped
processing sequence of a MapReduce Job:

/
The Mapper function reads data from the file system, processes it into Key-Value
Pairs which is further stored temporarily on the local disk. These Key-value pairs,
grouped on the key values, are sent to the reducers over the network.

On nodes where Reducers are to be run, the data is received and is saved on the local
disk and waits for the data from all the mappers to arrive. Then, the entire set of
values for a key is read into a single reducer, processed and further writes the output
which is then further replicated based on the configuration.

As you can notice a whole lot of unnecessary read/write overhead is involved in a

single MapReduce job. Multiple MapReduce jobs are run to accomplish a single Hive
query and all outputs of the MapReduce Jobs are first written in the DFS and then
transferred to nodes, and the cycle is repeated since there is no coordination
between two MapReduce jobs.

Apache Tez optimizes it by not breaking a Hive-query in multiple MapReduce Jobs.

Since, Tez is a client-side library, to orchestrate the processing of MapReduce Jobs. Tez
optimizes the jobs using the steps like the following:

/
Skipping the DFS write by the reducers and piping the output of a reducer directly in
the subsequent Mapper as input.

Cascading a series of Reducers without intervening Mapper steps.

Re-use of containers for successive phases of processing.

Optimal Resource usage using Pre-warmed containers.

Cost-based Optimizations.

Vectorized Query Processing.

We can set the execution engine using the following query, or by setting it in the hive-
site.xml.

set hive.execution.engine=tez/mr

Using Compression
As you might have noticed that hive queries involve a lot of Disks I/O or Network I/O
operations, which can be easily reduced by reducing the size of the data which is done by
compression. Most of the data formats in Hive are the text-based formats which are very
compressible and can lead to big savings. But, there is a trade-off involved when we take
compression into consideration, the CPU cost of compression and decompression.

Following are the main situations where I/O operations are performed and compression
can save cost:

Reading data from a local DFS directory

Reading data from a non-local DFS directory

Moving data from reducers to the Next stage Mappers/Reducers

/
Moving the final output back to the DFS.

Also, DFS replicates the data as well to be fault-tolerant, there are more I/O operations
involved when we are replicating data.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as
TextFile. Compressed data can directly be loaded in Hive, using the LOAD statement or
by creating table over compressed data location. The compression will be detected
automatically and the file will be decompressed on-the-fly during query execution.
However, in this case, Hadoop will not be able to split your file into chunks/blocks and
run multiple maps in parallel. But, zipped sequence files can be split into multiple.

The above optimizations will save a whole lot of execution cost and will lead to pretty
quicker execution of jobs. In the next article, we will discuss the remaining techniques:
optimizations using ORC files, optimizations in Join queries as well as the Cost Based
Optimizer.

I hope you find this article informative and easy to learn if you have any queries feel free
to reach me at [email protected]

Big Data Optimization Hadoop Hive Tez

About Help Legal

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Digital Art Cyber Controller 13 Download PDF
No ratings yet
Digital Art Cyber Controller 13 Download PDF
3 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive Optimization - Quick Refresher
No ratings yet
Hive Optimization - Quick Refresher
7 pages
Hive
No ratings yet
Hive
29 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Hive_Main
No ratings yet
Hive_Main
33 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
HDFSandhivecommands
No ratings yet
HDFSandhivecommands
15 pages
Mod 2
No ratings yet
Mod 2
70 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Big Data Best Practices PDF
No ratings yet
Big Data Best Practices PDF
4 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Hive Documet
No ratings yet
Hive Documet
33 pages
Hive Performance - Practical Guide
No ratings yet
Hive Performance - Practical Guide
18 pages
Hive
No ratings yet
Hive
65 pages
HIVE
No ratings yet
HIVE
24 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Hive Bucketing
No ratings yet
Hive Bucketing
3 pages
BIG DATA 4
No ratings yet
BIG DATA 4
14 pages
Hadoop Fundamentals and Hive Interview Questions
No ratings yet
Hadoop Fundamentals and Hive Interview Questions
8 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Hive-Bucketing and Indexing
No ratings yet
Hive-Bucketing and Indexing
28 pages
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
5/5 (5)
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
Hive PPT
No ratings yet
Hive PPT
25 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Company Interview Questions
No ratings yet
Company Interview Questions
6 pages
Hive
No ratings yet
Hive
50 pages
Big Data Training1
No ratings yet
Big Data Training1
4 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Hive Main
No ratings yet
Hive Main
24 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Hive Crash Course: A Beginner's Guide
No ratings yet
Hive Crash Course: A Beginner's Guide
19 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
No ratings yet
Introduction To Hive: Liyin Tang Liyintan@usc - Edu
24 pages
Cse3002 Big Data m2
No ratings yet
Cse3002 Big Data m2
76 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
BDA Unit-5-PPT
No ratings yet
BDA Unit-5-PPT
39 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Kick-Starter Kit For BigData Developers
No ratings yet
Kick-Starter Kit For BigData Developers
7 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
HIVE
No ratings yet
HIVE
80 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
112 Q&A
No ratings yet
112 Q&A
139 pages
English: New Functions in Firmware Version 1.20
No ratings yet
English: New Functions in Firmware Version 1.20
24 pages
Acer Monitor Manual
No ratings yet
Acer Monitor Manual
31 pages
LG - TV - LG Uj6500
No ratings yet
LG - TV - LG Uj6500
37 pages
Safely Home A Guide To Kids in Cars
No ratings yet
Safely Home A Guide To Kids in Cars
28 pages
Assignment 4
No ratings yet
Assignment 4
9 pages
Open-Ended Tools
100% (5)
Open-Ended Tools
24 pages
Keyboard Shortcuts in Windows 10
No ratings yet
Keyboard Shortcuts in Windows 10
2 pages
Unit 3 - Lab 2 - UDT's
No ratings yet
Unit 3 - Lab 2 - UDT's
6 pages
B Ise InstallationGuide32 2
No ratings yet
B Ise InstallationGuide32 2
122 pages
4nf &5nf
No ratings yet
4nf &5nf
4 pages
Matriz de Certificacion Forms y Reports
No ratings yet
Matriz de Certificacion Forms y Reports
44 pages
Industrial IoT Lab Manual
No ratings yet
Industrial IoT Lab Manual
33 pages
مصادر المعلومات المرجعية
100% (1)
مصادر المعلومات المرجعية
137 pages
Backup Procedure For RS232
No ratings yet
Backup Procedure For RS232
36 pages
CRM04b Service Case Study Service Management
No ratings yet
CRM04b Service Case Study Service Management
27 pages
Proposal RTX Csc264
No ratings yet
Proposal RTX Csc264
9 pages
MCS 011
No ratings yet
MCS 011
4 pages
Weekly Diary Itr
No ratings yet
Weekly Diary Itr
8 pages
MT103 SWIFT Message With Optional Fields 52a and 57a - Part 1 - Paiementor
No ratings yet
MT103 SWIFT Message With Optional Fields 52a and 57a - Part 1 - Paiementor
3 pages
Smart_Task_Manager_BRD
No ratings yet
Smart_Task_Manager_BRD
2 pages
Class1 Drawing With a Computer Solved Worksheet-6
No ratings yet
Class1 Drawing With a Computer Solved Worksheet-6
3 pages
SkillLab_Report_AI045_AI040
No ratings yet
SkillLab_Report_AI045_AI040
24 pages
CS100 Lab11-S2
No ratings yet
CS100 Lab11-S2
4 pages
Bi Assignment
No ratings yet
Bi Assignment
35 pages
Adithya Dhanasekar C.V
No ratings yet
Adithya Dhanasekar C.V
2 pages
Hyperfine Revit Shortcuts
No ratings yet
Hyperfine Revit Shortcuts
1 page
Cause Code Mappings
No ratings yet
Cause Code Mappings
5 pages
Project Report Soci. Movie Ticket Booking
No ratings yet
Project Report Soci. Movie Ticket Booking
23 pages
CS001 Midterm Solved McQs Papers by Waqar Sidhu
100% (1)
CS001 Midterm Solved McQs Papers by Waqar Sidhu
16 pages
Instrucciones de Instalación y Registro EPLAN Education 2.9
No ratings yet
Instrucciones de Instalación y Registro EPLAN Education 2.9
23 pages
Unit 2 DIS
No ratings yet
Unit 2 DIS
38 pages
Stata notes by Dr NK Singh
No ratings yet
Stata notes by Dr NK Singh
15 pages
IPA File Extraction Using Jailbroken Iphone - by Shashank's Blog - Medium
No ratings yet
IPA File Extraction Using Jailbroken Iphone - by Shashank's Blog - Medium
7 pages

Apache Hive Optimization Techniques - 1 - Towards Data Science

Uploaded by

Apache Hive Optimization Techniques - 1 - Towards Data Science

Uploaded by

Apache Hive Optimization Techniques — 1

Ankit Prakash Gupta Follow

3. Using Tez as Execution Engine

5. Using ORC Format

CREATE TABLE table_name (column1 data_type,

INSERT INTO TABLE table_name PARTITION

set hive.mapred.mode = strict;

SET hive.exec.dynamic.partition = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

Hive.exec.max.dynamic.partitions.pernode: Maximum number of partitions to be

Hive.exec.max.dynamic.partitions: Maximum number of dynamic partitions allowed to

Hive.exec.max.created.files: Maximum number of HDFS files created by all

Hive.error.on.empty.partition: Whether to throw an exception if the dynamic partition

CREATE TABLE table_name (column1 data_type,

set hive.enforce.bucketing = true;

Using Tez as Execution Engine

As you can notice a whole lot of unnecessary read/write overhead is involved in a

Apache Tez optimizes it by not breaking a Hive-query in multiple MapReduce Jobs.

Cascading a series of Reducers without intervening Mapper steps.

Re-use of containers for successive phases of processing.

Optimal Resource usage using Pre-warmed containers.

Vectorized Query Processing.

Reading data from a local DFS directory

Reading data from a non-local DFS directory

Moving data from reducers to the Next stage Mappers/Reducers

Big Data Optimization Hadoop Hive Tez

About Help Legal

You might also like