Course On: Big Data Analytics
Course On: Big Data Analytics
Jaya Gangwani
Dept of CSE
Course On
Big Data Analytics
Unit 5: Hive
Outline
• Introduction to Hive
• Hive Features
• Limitations
• Hive Architecture
• Hive Workflow
• Data types
• Data Models
• Hive Built-in Functions
• Hive Commands
Hive 2
What is Hive?
• Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop. It was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing
large datasets residing in distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to
MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach
of writing complex MapReduce programs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Hive 3
Features of Hive
These are the following features of Hive:
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed
to MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its
functionality.
Hive 4
Limitations
• Not a full database. Main disadvantage is that Hive does not provide
update, alter and deletion of records in the database.
• Not developed for unstructured data.
• Hive is not capable of handling real-time data.
• It is not designed for online transaction processing.
• Hive queries contain high latency.
Hive 5
Hive V/S PIG
Hive 6
Hive V/S RDBMS
Hive 7
Hive V/S RDBMS
Hive 8
Hive Architecture
Hive 9
Hive Architecture
The Hive Architecture can be categorized into the following components:
1. Hive Clients: Hive supports application written in many languages like
Java, C++, Python etc. using JDBC, Thrift and ODBC drivers. Hence
one can always write hive client application written in a language of
their choice.
2. Hive Services: Apache Hive provides various services like CLI, Web
Interface etc. to perform queries.
3. Processing framework and Resource Management: Internally, Hive
uses Hadoop MapReduce framework as default engine to execute the
queries.
4. Distributed Storage: As Hive is installed on top of Hadoop, it uses the
underlying HDFS for the distributed storage
Hive 10
Hive Architecture
1. Hive Clients:
Apache Hive supports different types of client applications for performing
queries on the Hive. These clients can be categorized into three types:
• Thrift Clients: As Hive server is based on Apache Thrift, it can serve the
request from all those programming language that supports Thrift.
• JDBC Clients: Hive allows Java applications to connect to it using the
JDBC driver which is defined in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
• ODBC Clients: The Hive ODBC Driver allows applications that support
the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC
driver uses Thrift to communicate with the Hive server.)
Hive 11
Hive Architecture
2. Hive Services:
Hive provides many services as shown in the image above. Let us have a
look at each of them:
• Hive CLI (Command Line Interface): This is the default shell provided by
the Hive where you can execute your Hive queries and commands
directly.
• Apache Hive Web Interfaces: Apart from the command line interface,
Hive also provides a web based GUI for executing Hive queries and
commands.
• Hive Server: Hive server is built on Apache Thrift and therefore, is also
referred as Thrift Server that allows different clients to submit requests
to Hive and retrieve the final result.
Hive 12
Hive Architecture
• Apache Hive Driver: It is responsible for receiving the queries submitted through
the CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver
passes the query to the compiler where parsing, type checking and semantic
analysis takes place with the help of schema present in the metastore. In the next
step, an optimized logical plan is generated in the form of a DAG (Directed Acyclic
Graph) of map-reduce tasks and HDFS tasks. Finally, the execution engine
executes these tasks in the order of their dependencies, using Hadoop.
• Metastore: Metastore is like a central repository for storing all the Hive metadata
information. Hive metadata includes various types of information like structure of
tables and the partitions along with the column, column type, serializer and
deserializer which is required for Read/Write operation on the data present in
HDFS. The metastore comprises of two fundamental units:
• A service that provides metastore access to other Hive services.
• Disk storage for the metadata which is separate from HDFS storage.
Hive 13
Hive Dataflow
Hive 14
Hive Dataflow
Hive 15
Hive Dataflow
The data flow in the following sequence:
1. We execute a query, which goes into the driver
2. Then the driver asks for the plan, which refers to the query execution
3. After this, the compiler gets the metadata from the metastore
4. The metastore responds with the metadata
5. The compiler gathers this information and sends the plan back to the driver
6. Now, the driver sends the execution plan to the execution engine
7. The execution engine acts as a bridge between the Hive and Hadoop to process the
query
8. In addition to this, the execution engine also communicates bidirectionally with the
metastore to perform various operations, such as create and drop tables
9. Finally, we have a bidirectional communication to fetch and send results back to the
client
Hive 16
Hive Data Types
Hive Data types are used for specifying the column/field type
in Hive tables. Hive data types can be classified into two
categories.
• Primitive Data Types
• Complex Data Types
Hive 17
Hive Data Types
Hive 18
Hive Data Types - Primary Data Types
Primary Data Types are further classified into four categories.
They are:
• Numeric Types
• String Types
• Date/Time Types
• Miscellaneous Types
These data types and their sizes are similar to Java/SQL
primitive data types and sizes.
Hive 19
Hive Data Types - Primary Data Types
Hive 20
Hive Data Types - Numeric Data Types
Hive 21
Hive Data Types - Numeric Data Types
Hive 22
Hive Data Types - Numeric Data Types
• In Hive, by default integral values are treated as INT unless they cross
the range of INT values as shown in above table. But if we need to use
a low integral value like 100 to be treated as TINYINT or SMALLINT or
BIGINT then we need to suffix the value with Y, S or L respectively.
Hive 23
Hive Data Types - String Data Types
Hive 24
Hive Data Types - String Data Types
CHAR vs VARCHAR
• CHAR is fixed length and values shorter than are padded with spaces.
• VARCHAR is of variable length but we need to specify the max length of
the field (example : name VARCHAR(64)). If the values are less than the
max length specified then remaining space will be freed out.
• The maximum length of CHAR is 255 but VARCHAR can be upto 65355
bytes.
• Space/storage optimization is done in VARCHAR by releasing the unused
bytes but in CHAR unused bytes will not be released but filled with
spaces.
• If a string value being assigned to a VARCHAR value exceeds the length
specified, then the string is silently truncated.
Hive 25
Hive Data Types - Time/Date Types
Hive 26
Hive Data Types - Miscellaneous Types
BOOLEAN
• Boolean types in Hive store either true or false.
BINARY
• BINARY type in Hive is an array of bytes.
Hive 27
Hive Data Types – Complex Data Types
Hive 28
Hive Data Types – Complex Data Types
• In addition to primitive data types, Hive also support complex
data types (or also known as collection data types) which are
not available in many RDBMSs.
• Complex Types can be built up from primitive types and other
composite types. Data type of the fields in the collection are
specified using an angled bracket notation. Currently Hive
supports four complex data types. They are:
• Array
• Map
• Struct
• Union
Hive 29
Hive Data Types – Complex Data Types
• ARRAY – An Ordered sequences of similar type elements that
are indexable using zero-based integers. It is similar to arrays
in Java.
• Example – array (‘siva’, ‘bala’, ‘praveen’); Second element is
accessed with array[1].
Hive 30
Hive Data Types – Complex Data Types
• STRUCT – It is similar to STRUCT in C language. It is a record type
which encapsulates a set of named fields that can be any primitive
data type. Elements in STRUCT type are accessed using the DOT (.)
notation.
• Example – For a column c of type STRUCT {a INT; b INT} the a field is accessed
by the expression c.a
Hive 31
Hive Data Models
Hive 32
Hive Data Models
Databases
• Namespace for tables.
Tables
• Similar to tables in RDBMS.
• Support filter, project, join and union operations.
• The table data stores in a directory in HDFS.
Hive 33
Hive Data Models
Partitions
• Table can have one or more partition keys that tell how data
stores.
Buckets
• Data in partition is further divided into buckets based on hash of a
column in the table.
• Stored as file in partition directory.
Hive 34
Hive Built-in Functions
• Hive has various built-in functions available, they are categorized as :
• Mathematical Functions
• Aggregate Functions
• String Functions
Hive 35
Hive Built-in Functions-Mathematical Functions
Hive 36
Hive Built-in Functions
Hive 37
Hive Built-in Functions – Aggregate
Functions
Hive 38
Hive Built-in Functions-String Functions
Hive 39
Hive Commands
• Hive DDL commands are the statements used for defining and
changing the structure of a table or database in Hive. It is used
to build or modify the tables and other objects in the database.
• The several types of Hive DDL commands are:
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE
Hive 40
Hive Commands
Hive 41
Hive Commands
1. CREATE DATABASE in Hive
• The CREATE DATABASE statement is used to create a database in the
Hive. The DATABASE and SCHEMA are interchangeable. We can use
either DATABASE or SCHEMA.
• Syntax:
Hive 42
Hive Commands
• Syntax:
Hive 43
Hive Commands
Hive 44
Hive Commands
• Syntax:
Hive 45
Hive Commands
Hive 46
Hive Commands
Hive 47
Hive Commands
7. TRUNCATE TABLE
• TRUNCATE TABLE statement in Hive removes all the rows from the
table or partition.
• Syntax:
Hive 48
Hive Commands
• Hive DML (Data Manipulation Language) commands are used to insert,
update, retrieve, and delete data from the Hive table once the table
and database schema has been defined using Hive DDL commands.
• The various Hive DML commands are:
1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT
Hive 49
Hive Commands
1. LOAD Command
The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
• If a LOCAL keyword is specified, then the LOAD command will look for the
file path in the local filesystem.
• If the LOCAL keyword is not specified, then the Hive will need the
absolute URI of the file.
• In case the keyword OVERWRITE is specified, then the contents of the
target table/partition will be deleted and replaced by the files referred by
filepath.
• If the OVERWRITE keyword is not specified, then the files referred by
filepath will be appended to the table.
Hive 50
Hive Commands
Hive 51
Hive Commands
2. SELECT COMMAND
• The SELECT statement in Hive is similar to the SELECT statement in
SQL used for retrieving data from the database.
• Syntax:
Hive 52
Hive Commands
3. INSERT Command
• The INSERT command in Hive loads the data into a Hive table. We can
do insert to both the Hive table or partition.
a. INSERT INTO
• The INSERT INTO statement appends the data into existing data in the table or
partition. INSERT INTO statement works from Hive version 0.8.
• Syntax:
Hive 53
Hive Commands
b. INSERT OVERWRITE
• The INSERT OVERWRITE table overwrites the existing data in the table
or partition.
• Syntax:
Hive 54
Hive Commands
c. INSERT .. VALUES
• INSERT ..VALUES statement in Hive inserts data into the table directly
from SQL. It is available from Hive 0.14.
• Syntax:
Hive 55
Hive Commands
4. DELETE command
• The DELETE statement in Hive deletes the table data. If the WHERE
clause is specified, then it deletes the rows that satisfy the condition
in where clause.
• The DELETE statement can only be used on the hive tables that
support ACID.
• Syntax:
Hive 56
Hive Commands
5. UPDATE Command
• The update can be performed on the hive tables that support ACID.
• The UPDATE statement in Hive deletes the table data. If the WHERE
clause is specified, then it updates the column of the rows that satisfy
the condition in WHERE clause.
• Partitioning and Bucketing columns cannot be updated.
• Syntax:
Hive 57
Hive Commands
6. EXPORT Command
• The Hive EXPORT statement exports the table or partition data along
with the metadata to the specified output location in the HDFS.
• Metadata is exported in a _metadata file, and data is exported in a
subdirectory ‘data.’
• Syntax:
Hive 58
Hive Commands
7. IMPORT Command
• The Hive IMPORT command imports the data from a specified
location to a new table or already existing table.
• Syntax:
Hive 59