Apache Hive Optimization Techniques - 1 - Towards Data Science
Apache Hive Optimization Techniques - 1 - Towards Data Science
Apache Hive is a query and analysis engine which is built on top of Apache Hadoop and
uses MapReduce Programming Model. It provides an abstraction layer to query big-data
/
using the SQL syntax by implementing traditional SQL queries using the Java API. The
main components of the Hive are as follows:
Metastore
Driver
Compiler
Optimizer
Executor
Client
While Hadoop/hive can process nearly any amount of data, but optimizations can lead to
big savings, proportional to the amount of data, in terms of processing time and cost.
There are a whole lot of optimizations that can be applied in the hive. Let us look into the
optimization techniques we are going to cover:
1. Partitioning
2. Bucketing
4. Using Compression
6. Join Optimizations
7. Cost-based Optimizer
. . .
Partitioning
Partitioning divides the table into parts based on the values of particular columns. A
table can have multiple partition columns to identify a particular partition. Using
partition it is easy to do queries on slices of the data. The data of the partition columns
are not saved in the files. On checking the file structure you would notice that it creates /
folders on the basis of partition column values. This makes sure that only relevant data is
read for the execution of a particular job, decreasing the I/O time required by the query.
Thus, increasing the query performance.
When we query data on a partitioned table, it will only scan the relevant partitions to be
queried and skips irrelevant partitions. Now, assume that even on partitioning, the data
in a partition was quite big, to further divide it into more manageable chunks we can use
Bucketing.
In insert queries, partitions are mentioned in the start and their column values are
also given along with the values of the other columns but at the end.
Static Partitioning
This is practiced when we have knowledge about the partitions of data we are going
to load. It should be preferred when loading data in a table from large files. It is
performed in strict mode:
. . .
Bucketing
Bucketing provides flexibility to further segregate the data into more manageable
sections called buckets or clusters. Bucketing is based on the hash function, which
depends on the type of the bucketing column. Records which are bucketed by the same
/
column value will always be saved in the same bucket. CLUSTERED BY clause is used to
divide the table into buckets. It works well for the columns having high cardinality.
Bucketing also has its own benefit when used with ORC files and used as the joining
column. We will further discuss these benefits.
To look into how Tez helps in optimizing the jobs, we will first look into the stereotyped
processing sequence of a MapReduce Job:
/
The Mapper function reads data from the file system, processes it into Key-Value
Pairs which is further stored temporarily on the local disk. These Key-value pairs,
grouped on the key values, are sent to the reducers over the network.
On nodes where Reducers are to be run, the data is received and is saved on the local
disk and waits for the data from all the mappers to arrive. Then, the entire set of
values for a key is read into a single reducer, processed and further writes the output
which is then further replicated based on the configuration.
/
Skipping the DFS write by the reducers and piping the output of a reducer directly in
the subsequent Mapper as input.
Cost-based Optimizations.
We can set the execution engine using the following query, or by setting it in the hive-
site.xml.
set hive.execution.engine=tez/mr
Using Compression
As you might have noticed that hive queries involve a lot of Disks I/O or Network I/O
operations, which can be easily reduced by reducing the size of the data which is done by
compression. Most of the data formats in Hive are the text-based formats which are very
compressible and can lead to big savings. But, there is a trade-off involved when we take
compression into consideration, the CPU cost of compression and decompression.
Following are the main situations where I/O operations are performed and compression
can save cost:
Also, DFS replicates the data as well to be fault-tolerant, there are more I/O operations
involved when we are replicating data.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as
TextFile. Compressed data can directly be loaded in Hive, using the LOAD statement or
by creating table over compressed data location. The compression will be detected
automatically and the file will be decompressed on-the-fly during query execution.
However, in this case, Hadoop will not be able to split your file into chunks/blocks and
run multiple maps in parallel. But, zipped sequence files can be split into multiple.
The above optimizations will save a whole lot of execution cost and will lead to pretty
quicker execution of jobs. In the next article, we will discuss the remaining techniques:
optimizations using ORC files, optimizations in Join queries as well as the Cost Based
Optimizer.
I hope you find this article informative and easy to learn if you have any queries feel free
to reach me at [email protected]