Unit-5
Unit-5
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.
Hive Services
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
1
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.
Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into two
pieces: a service and the backing store for the data. By default, the metastore service runs in
the same JVM as the Hive service and contains an embedded Derby database instance backed
by the local disk. This is called the embedded metastore configuration (see Figure 12-2).
Using an embedded metastore is a simple way to get started with Hive; however, only one
embedded Derby database can access the database files on disk at any one time, which means
you can only have one Hive session open at a time that shares the same metastore. Trying to
start a second session gives the error:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastore.
The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database. This configuration is referred to as a local metastore, since the metastore
2
service still runs in the same process as the Hive service, but connects to a database running
in a separate process, either on the same machine or on a remote machine.
HiveQL
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement with
WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to
a condition. It filters the data using the condition and gives you a finite result. The built-in
operators and functions generate an expression, which fulfils the condition.
Syntax
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table
as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query
to retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
3
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb",
"", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE
salary>30000;");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands to
compile and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
ID Name Salary Designation Dept
1201 Gopal 45000 Technical manager TP
1202 Manisha 45000 Proofreader PR
1203 Masthanvali 40000 Technical writer TP
4
1204 Krian 40000 Hr Admin HR
HBasics
Features
Concepts
5
Logical View of CustomerContactInformation table in HBase :
6
In the HBase data model column qualifiers are specific names assigned to your data values
in order to make sure you’re able to accurately identify them.
Regions
Tables are automatically partitioned horizontally by HBase into regions. Each region
comprises a subset of a table’s rows. A region is denoted by the table it belongs to.
Locking
Row updates are atomic, no matter how many row columns constitute the row-level
transaction. This keeps the locking model simple.
Implementation
HMaster
Region Server
These are the worker nodes which handle read, write, update, and delete requests from
clients. Region Server process, runs on every node in the hadoop cluster. Region Server
runs on HDFS DataNode and consists of the following components –
Block Cache – This is the read cache. Most frequently read data is stored in the
read cache and whenever the block cache is full, recently used data is evicted.
7
MemStore- This is the write cache and stores new data that is not yet written to the
disk. Every column family in a region has a MemStore.
Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent
storage.
HFile is the actual storage file that stores the rows as sorted key values on a disk.
Zookeeper
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster-
tracking information about how many region servers are there and which region servers are
holding which DataNode.
Clients
There are a number of client options for interacting with an HBase cluster.Creating table
and inserting data in HBase table are shown in the following program.
8
public class ExampleClient
// Create table
HColumnDescriptor("data"); htd.addFamily(hcd);
admin.createTable(htd);
Bytes.toBytes("row1");
p1.add(databytes, Bytes.toBytes("FN"),
Bytes.toBytes("value1")); table.put(p1);
Map Reduce:
TableMapper:
9
Hbase TableMapper is an abstract class extending Hadoop
package org.apache.hadoop.hbase.mapreduce;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.mapreduce.Mapper;
TableReducer
src :
HBASE_HOME/src/java/org/apache/hadoop/hbase/mapreduce/Table
Reducer.java
10
package org.apache.hadoop.hbase.mapreduce;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Reducer;
TableReducer can take any KEY2 / VALUE2 class and emit any KEY3 class, and a Writable
VALUE4 class.
Interfaces:
11
Avro, REST, and Thrift HBase ships with Avro, REST, and Thrift interfaces. These are useful
when the interacting application is written in a language other than Java. In all cases, a Java
server hosts an instance of the HBase client brokering application Avro, REST, and Thrift
requests in and out of the HBase cluster.
Loading Data :
let’s assume that there are billions of individual observations to be loaded. This kind of import
is normally an extremely complex and long-running database operation, but MapReduce and
HBase’s distribution model allow us to make full use of the cluster. Copy the raw input data
onto HDFS and then run a MapReduce job that can read the input and write to HBase.
Web Queries :
To implement the web application, we will use the HBase Java API directly. Here it becomes
clear how important your choice of schema and storage format is.
Hive
Hive, a framework for data warehousing on top of Hadoop. Hive was created to make it
possible for analysts with strong SQL skills (but meager Java programming skills) to run
queries on the huge volumes of data that Facebook stored in HDFS.
Of course, SQL isn’t ideal for every big data problem—it’s not a good fit for building
complex machine learning algorithms, for example—but it’s great for many analyses, and it
has the huge advantage of being very well known in the industry.
Installing Hive
In normal use, Hive runs on your workstation and converts your SQL query into a series of
MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables, which
provide a means for attaching structure to data stored in HDFS. Metadata— such as table
schemas—is stored in a database called the metastore.
% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin
% hive hive>
The shell is the primary way that we will interact with Hive, by issuing commands in
HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by
MySQL.
hive> SHOW
TABLES; OK
Time taken: 10.425 seconds
The database stores its files in a directory called metastore_db, which is relative to where
you ran the hive command from.
13
You can also run the Hive shell in non-interactive mode. The -f option runs the commands in
the specified file script.q as follows.
In both interactive and non-interactive mode, Hive will print information to standard error—
such as the time taken to run a query—during the course of operation. You can suppress these
messages using the -S option at launch time, which has the effect of only showing the output
result for queries:
dummy' An Example
Let’s see how to use Hive to run a query on the weather dataset.
We create a table to hold the weather data using the CREATE TABLE statement:
Hive expects there to be three fields in each row, corresponding to the table columns, with
fields separated by tabs, and rows by newlines.
Running this command tells Hive to put the specified local file in its warehouse directory.
Thus, the files for the records table are found in the /user/hive/warehouse/records directory on
the local filesystem:
% ls /user/hive/warehouse/records/ sample.txt
Now that the data is in Hive, we can run a query against it:
hive> SELECT year, MAX(temperature)
> FROM records
> WHERE temperature != 9999
> AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
> GROUP BY
year; Output:
1949 111
1950 22
Hive transforms this query into a MapReduce job, which it executes on our behalf, then
prints the results to the console.
14
Hive Services
The Hive shell is only one of several services that you can run using the hive command. You
can specify the service to run using the --service option. Type hive --service help to get a list
of available service names; the most useful are described below.
cli:
The command line interface to Hive (the shell). This is the default service.
hiveserver:
Runs Hive as a server exposing a Thrift service, enabling access from a range of clients
written in different languages. Applications using the Thrift, JDBC, and ODBC connectors
need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment
variable to specify the port the server will listen on (defaults to 10,000).
hwi:
jar:
The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes
both Hadoop and Hive classes on the classpath.
metastore:
By default, the metastore is run in the same process as the Hive service. Using this service, it
is possible to run the metastore as a standalone (remote) process. Set the
METASTORE_PORT environment variable to specify the port the server will listen on.
Hive clients:
If you run Hive as a server (hive --service hiveserver), then there are a number of different
mechanisms for connecting to it from applications. The relationship between Hive clients and
Hive services is illustrated as follows.
15
Thrift Client
The Hive Thrift Client makes it easy to run Hive commands from a wide range of
programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python,
and Ruby.
JDBC Driver
ODBC Driver:
The Hive ODBC Driver allows applications that support the ODBC protocol to connect to
Hive.
The Metastore :
The metastore is the central repository of Hive metadata. The metastore is divided into two
pieces: a service and the backing store for the data.
Embedded Mode
In this mode, the metastore uses a Derby database, and both the database and the metastore
service are embedded in the main HiveServer process. Both are started for you when you start
the HiveServer process.
It can support only one active user at a time and is not certified for production use.
16
Local Mode:
In Local mode, the Hive metastore service runs in the same process as the main HiveServer
process, but the metastore database runs in a separate process, and can be on a separate host.
The embedded metastore service communicates with the metastore database over JDBC.
Remote Metastore
there’s another metastore configuration called a remote metastore, where one or more
metastore servers run in separate processes to the Hive service. This brings better
manageability and security.
17
MySQL is a popular choice for the standalone metastore. In this case,
javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, and javax.jdo.option.ConnectionDriverName is set to
com.mysql.jdbc.Driver.
In a traditional database, a table’s schema is enforced at data load time. If the data
being loaded doesn’t conform to the schema, then it is rejected. This design is
sometimes called schema on write, since the data is checked against the schema when
it is written into the database.
Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a
query is issued. This is called schema on read.
There are trade-offs between the two approaches. Schema on read makes for a very
fast initial load. The load operation is just a file copy or move.
Schema on write makes query time performance faster, since the database can index
columns and perform compression on the data. The trade-off, however, is that it takes
longer to load data into the database.
18
Updates, Transactions, and Indexes
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently,
these features have not been considered a part of Hive’s feature set.
Hive Ql
19
Data Types :
The usual set of SQL operators is provided by Hive: relational operators (such as x = 'a'
for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching),
arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR y
for logical OR).
20
Hive comes with a large number of built-in functions—too many to list here—divided
into categories including mathematical and statistical functions, string functions, date
functions (for operating on string representations of dates), conditional functions, ag
gregate functions, and functions for working with XML (using the xpath function) and
JSON.
Hive Metastore is a component in Hive that stores the catalog of the system that contains the
metadata about Hive create columns, Hive table creation, and partitions. Metadata is mostly
stored in the traditional form of RDBMS.
The Apache Hives make use of the Derby databases to store the metadata. Any of the JDBC
compliant or Java database connectivity like MySQL can be used to create a Hive Metastore.
Many primary attributes should be configured for the Hive Metastore. Some of them are:
· URL Connection
· Driver Connection
· User ID Connection
· Password Connection
21