Chapter - 4 - Data Access - Hive
Chapter - 4 - Data Access - Hive
Hive
1
References
1. Achari, Shiva. Hadoop Essentials, Packt Publishing, Limited,
2015.
ProQuest Ebook Central,
https://ebookcentral.proquest.com/lib/shctom/detail.action?docID
=2039889
.
2. https://www.edureka.co/blog/hive-tutorial/
3. https://www.guru99.com/introduction-hive.html
4. https://cwiki.apache.org/confluence/display/Hive/Tutorial
5. https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
2
What is Hive?
Hive is a data warehousing infrastructure based on Apache Hadoop which provides SQL like language for
querying and analyzing Big Data.
Hive provides a mechanism to project structure onto the data and query the data using SQL-like language
called HiveQL
Hive uses MapReduce and HDFS for processing and storage/retrieval of data.
Hive is used for analyzing structured and semi-structured data.
SQL commands in Hive are called as HiveQL.
HiveQL gets converted to map reduce jobs by the Hive compiler.
Apache Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User
Defined Functions (UDF).
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.
3
Reference #1: Page 83 & Reference #2 and #4
Advantage of Hive
It is an efficient ETL tool
Provides capabilities of querying and analysis
HiveQL is similar to SQL statements, thus easy to understand
Performs analytics on large datasets and works well for complex queries
Reduces the need to write complex MapReduce programs to process data using
Hadoop
4
Where Not to use Hive?
When the data to be processed is less than a GB
When finding schema is difficult or not possible on the data
When the response is needed in seconds and for low latency applications.
When it is possible to solve problems by RDBMS
5
Hive Features
It stores schema in a database and processed data into HDFS.
It is designed for OLAP
IT provides SQL type language for querying called HiveQL or
HQL
IT is fast, scalable and extensible.
6
Hive architecture
Hive architecture has different components such as:
Other than any type of applications provided ODBC drivers. These Clients and
drivers in turn again communicate with Hive server in the Hive services.
Reference #3
8
Hive architecture
Hive Services:
Hive CLI (Command Line Interface): This is the default shell provided by
the Hive where you can execute your Hive queries and commands directly.
Apache Hive Web Interfaces: Apart from the command line interface, Hive
also provides a web-based GUI for executing Hive queries and commands.
Hive Server: Hive server is built on Apache Thrift and therefore, is also
referred as Thrift Server that allows different clients to submit requests to
Hive and retrieve the result.
Reference #3
9
Hive architecture
Hive Services:
Apache Hive Driver: It is responsible for receiving the queries submitted through the
CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client.
Then, the driver passes the query to the compiler where parsing, type checking and
semantic analysis takes place with the help of schema present in the metastore.
In the next step, an optimized logical plan is generated in the form of a DAG (Directed
Acyclic Graph) of map-reduce tasks and HDFS tasks.
Finally, the execution engine executes these tasks in the order of their dependencies,
using Hadoop.
Reference #3
10
Hive architecture
Hive Services:
Metastore: Metastore as a central repository for storing all the Hive metadata
information.
Metastore stores all the details about the tables, partitions, schemas, columns, types,
and so on which is required for Read/Write operation on the data present in HDFS.
Metastore is very critical for Hive without which the structure design details cannot be
retrieved and data cannot be accessed. Hence, Metastore is backed up regularly.
Hive ensures that Metastore is not directly accessed by Mappers and Reducers of a job;
instead it is passed through an xml plan that is generated by the compiler and contains
information that is needed at runtime.
Reference #1 Page 84 and Reference #3
11
Job Execution in Hive
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. It executes
the plan step by step, considering the dependent task to complete for every task in the plan. The
results of tasks are stored in a temporary location and in the final step the data is moved to the
desired location.
Reference #3
13
Job Execution in Hive
EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
EE is going to fetch desired records from Data Nodes. The actual data of tables resides in
data node only. While from Name Node it only fetches the metadata information for the
query.
It collects actual data from data nodes related to mentioned query
Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to
perform DDL (Data Definition Language) operations. Here DDL operations like CREATE,
DROP and ALTERING tables and databases are done. Meta store will store information
about database name, table names and column names only. It will fetch data related to query
mentioned.
Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node,
Data nodes, and job tracker to execute the query on top of Hadoop file system
8.Fetching results from driver
9.Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will
send results back to driver and to UI ( front end)
14
Data Units in Hive
Hive data is organized into:
Databases: Namespaces function to avoid naming conflicts for tables, views, partitions,
columns, and so on. Databases can also be used to enforce security for a user or group
of users.
Tables: Homogeneous units of data which have the same schema.
Partitions: Each Table can have one or more partition Keys which determines how the
data is stored.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based
on the value of a hash function of some column of the Table.
Reference #4
15
Data Units in Hive
Reference #2
19
Modes of Hive
Hive operates in two modes depending on the number and size of data nodes.
They are:
Local Mode - Used when Hadoop has one data node, and the amount of data is
small. Here, the processing will be very fast on smaller datasets, which are
present in local machines.
Mapreduce Mode - Used when the data in Hadoop is spread across multiple
data nodes. Processing large datasets can be more efficient using this mode.
Reference #5
20
Hive Vs RDMS
Hive RDBMS
Hive enforces schema on reading RDBMS enforces schema on write
Hive data size is in petabytes Data size is in terabytes
Hive is based on the notion of write once and RDBMS is based on the notion of reading and
read many times write many times
Hive resembles a traditional database by RDBMS is a type of database management
supporting SQL, but it is not a database; it is a system, which is based on the relational model of
data warehouse data
Easily scalable at low cost Not scalable at low cost
Reference #5
21
Hive Query Language : Data Types (Numeric Type)
TINYINT (1 Byte Signed Integer)
SMALLINT (2 Byte Signed Integer)
INT/INTEGER (4 Byte Signed Integer)
BIGINT( 8 Byte Singed Integer)
FLOAT ( 4 Byte Single Precision Floating Point Number)
DOUBLE (8 Byte Double Precision Floating Point Number)
DECIMAL
NUMERIC (Same as Decimal)
22
Hive Query Language : Data Types (Numeric
Type)
TIMESTAMP yyyy-mm-dd hh:mm:ss[f…]
DATE YYYY-MM-DD
INTERVAL INTERVAL ‘1’ day
23
Hive Query Language : Data Types (String Type)
24
Hive Query Language : Data Types (Complex Type)
25
Hive Query Language : Queries
CREATING DATABASE
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment] [LOCATION hdfs_path]
[WITH DBPROPERTIES (PROPERTY_NAME =PROPERTY_VALUE,….)]
SHOWING DATABASE
Hive> SHOW DATABASES;
26
Hive Query Language : Queries
DROPPING DATABASE
DROP (DATABASE | SCHEMA) [IF EXISTS] database_name [RESTRICT |
CASCADE];
Hive> DROP DATABASE IF EXISTS UTAS;
USING DATABASE
USER DATABASE_NAME
27
Hive Query Language : Queries
CREATING A TABLE
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT
col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type
[COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY
(col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format]
[STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES
(property_name=property_value, ...)] [AS select_statement]
28
Hive Query Language : Queries
LOADING DATA IN TABLE
LOAD DATA [LOCAL] INPATH '<The table data location>' [OVERWRITE] INTO
TABLE <table_name> [PARTITION partcol1=val1, partcol2=val2…)]
Hive> LOAD LOCAL INPTH ‘/USER/CLOUDERA/STU.TXT’ INTO TBALE
STUDENT;
DISPLAYING CONTENTS OF TABLE
Hive> select * from student;
ALTERING A TABLE
ALTER TABLE <table_name> ADD COLUMNS (column type);
Hive> alter table student add columns (grade string);
RENAMING A TABLE
ALTER TABLE <table_name> RENAME TO <new table_name>
Hive > alter table student rename to students;
DROPPING TABLE
DROP TABLE <table_name>
Hive > drop table students;
29
Hive Query Language : SELECT Operation
SELECT: SELECT is the projection operator in SQL. The clauses used for this
function are:
SELECT scans the table specified by the FROM clause
WHERE gives the condition of what to filter
GROUP BY gives a list of columns which then specify how to aggregate the
records
CLUSTER BY, DISTRIBUTE BY, and SORT BY specify the sort order and
algorithm
LIMIT specifies the # of records to retrieve
SELECT [ALL | DISTINCT] select_expr, select_expr, FROM table_reference
[WHERE where_condition] [GROUP BY col_list] [HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT
number];
30
Hive Query Language : JOINS
HiveQL supports the following types of joins:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
• FULL OUTER JOIN
31
Hive Query Language : Aggregation
HiveQL supports aggregations and also allows for multiple aggregations to be
done at the same time.
The possible aggregators are:
count(*), count(expr), count(DISTINCT expr[, expr_.])
sum(col), sum(DISTINCT col)
avg(col), avg(DISTINCT col)
min(col)
max(col)
32
Hive Query Language : Built-In Functions
Hive has numerous built-in functions and some of its widely used functions
are:
concat(string A, string B,...)
substr(string A, int start)
round(double a)
upper(string A), lower(string A)
trim(string A)
to_date(string timestamp)
year(string date), month(string date), day(string date)
33
Hive: Partitioning
Partitioning are a way to divide a table into coarse-grained parts,
based on the value of a partition column such as ‘date’.
Using partitions, you can make it faster to do queries on slices of the data.
A table can have one or more partition columns. A separate data directory
Is created for each distinct value combination in the partition column
Partitions are defined at the time of creation of table
Usage
Use the clause (PARTITIONED BY) for creating a list of column
definitions.
The partitions can be added or removed using ALTER TABLE statement.
Partitions can be viewed by using SHOW PARTITIONS logs;
34
Hive: Bucketing
We just discussed the fact about partitioning that it can unevenly distribute the data,
but usually it is very less likely to get even distribution. But, we can achieve almost
even distributed data for processing using bucketing. Bucketing has a value of data
into a bucket due to which the same value records can be present in the same
bucket, and a bucket can have multiple groups of values. Bucketing provides control
to a number of files, as we have to mention the number of buckets while using
bucketing in create table using CLUSTERED BY (month) INTO #noofBuckets
BUCKETS.
35