Big Data Analytics and Developers Training Session 10
Big Data Analytics and Developers Training Session 10
ANALYTICS
AND
DEVELOPERS
TRAINING
Hive Advanced
Today’s Objectives
• Use HQL commands to perform DML queries
Apart from SELECT statement, the other important HiveQL queries are Limit clause, nested
queries, CASE...WHEN...THEN queries, LIKE and RLIKE queries, GROUP BY queries etc.
Select Statement
SELECT <column1>, <column2> FROM <table_name>;
Select Records from Columns Starting with a Given String
SELECT string* FROM <table_name>;
Limit Clause
SELECT * FROM <table_name> limit 10;
Contd.
Nested Queries
SELECT * FROM <table_name> where <condition> <compares> (SELECT <column>
FROM <table_name>);
CASE...WHEN...THEN Queries
SELECT <column1>, CASE WHEN <condition1> THEN <option1>, WHEN <condition2>
THEN <option2>, ELSE <option3> END AS <column2> FROM <table_name>;
LIKE and RLIKE Queries
SELECT * FROM <table_name> WHERE <column1> LIKE ‘%string%’;
SELECT * FROM <table_name> WHERE <column2> RLIKE ‘.*(string)*.’;
For example, 'foobar' RLIKE '^f*r$' evaluates to TRUE
GROUP BY Queries
SELECT <column1>, <column2> FROM <table_name> GROUP BY <column1>;
HAVING Queries
SELECT <column1>, <column2> FROM <table_name> GROUP BY <column1> HAVING
<column1=value1> OR <column1=value2>;
Manipulating Column Values Using Functions
Types of Functions in Hive Built-In Functions in Hive
Arithmetic Functions
Built-In Function
Mathematical Functions
Outer Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Cross Join
Cross Join
SELECT * FROM <table1> JOIN <table2>;
Hive Best Practices
Apart from these points, effective use of compressions is also one of the Hive best practices.
Performance-Tuning and
Query Optimizations
Tuning or optimizing Hive queries requires understanding of how a Hive query works.
To know about the working of Hive queries, the EXPLAIN command is used.
Usually, the output of an EXPLAIN command consists of three parts.
Parallel Execution Hive breaks the job into certain stages that
are executed sequentially. Set the value of
Indexes hive.exec.parallel to true enables a parallel
execution of certain stages of the job.
Speculative Execution
While Hive allows creation of indexes on columns to accelerate the execution of GROUP BY
command, it allows speculative execution of jobs by setting parameters as given below:
Parameter Description
mapred.map.tasks.speculative.execution Runs more than one instance of map tasks.
mapred.reduce.tasks.speculative.execution Runs more than one instance of reduce
tasks.
Hive File and Record Formats
File Formats
Text Files
• Input Format: org.apache.hadoop.mapred.TextInputFormat
• Output Format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Sequence File
RCFile
Record Formats (SerDes) SerDe holds the logic to convert unstructured data
into records and is implemented using Java.
Regex
In addition to Regex and Avro SerDes, JavaScript Object
Avro Notation (JSON) is a standard format that uses readable text
to transmit data objects consisting of attribute-value pair.
ORC
JSON SerDe is used to transmit data between applications.
Thrift
Contd.
Features of JSON SerDe
Hive security is, primarily, a matter of two concepts; Authentication and Authorization.
Hive supports nested queries, where the output of the inner query can be specified as the
input to the outer query.
Hive allows use of CASE statements to enable you to classify records depending on various
inputs.
LIKE and RLIKE operators compare and match strings or substrings from a given set of
records.
Hive supports joining of one or more tables together to get useful aggregate information. The
various joins Hive supports are as follows:
Points to Remember.
Inner joins
Outer joins
Map side joins are recommended when you need to join two tables in which one table is
smaller than the other table.
The concept of partitions in Hive helps in maintaining a new partition for every new day
without too much effort.
Hive does not have any primary and foreign keys because it is not meant to run complex
relational queries.
By keeping denormalized data, you avoid multiple disk seeks, which is generally the case
when there are foreign key relations.
Points to Remember.
Hive does a complete table scan to process a query.
Compression of data on HDFS makes data smaller to query on, which ultimately helps in
reducing the query time.
A sequence file improves performance because it is a file format that contains key value
pairs in binary format.