0% found this document useful (0 votes)
16 views

UNIT 5 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

UNIT 5 Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

19EAI433: BIG DATA ANALYTICS

Module V
HBase, data model and implementations, HBase clients, HBase examples, praxis. Cassandra,
Cassandra data model, Cassandra examples, Cassandra clients, Hadoop integration. Hive,
datatypes and file formats, HiveQL data definition, HiveQL data manipulation, HiveQL
queries.
5.1 HBase
5.1.1 What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System
(HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data
in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and
write access.
5.1.2 HBase and HDFS

HDFS HBase
HDFS is a distributed file system suitable for
HBase is a database built on top of the HDFS.
storing large files.
HDFS does not support fast individual record
HBase provides fast lookups for larger tables.
lookups.
It provides high latency batch processing; no It provides low latency access to single rows
concept of batch processing. from billions of records (Random access).
HBase internally uses Hash tables and provides
It provides only sequential access of data. random access, and it stores the data in indexed
HDFS files for faster lookups.

5.1.3 Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:

 Table is a collection of rows.


 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Column Family Column Family Column Family Column Family


Rowid
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

3
5.1.4 Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as
rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Analytical Processing


It is suitable for Online Transaction Process (OLTP).
(OLAP).

Such databases are designed for small number of rows Column-oriented databases are designed for
and columns. huge tables.

The following image shows column families in a column-oriented database:

5.1.5 HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema, which
fixed columns schema; defines only column families. describes the whole structure of tables.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured data. It is good for structured data.
Features of HBase

 HBase is linearly scalable.


 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on
Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.


 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

5.2 Data model and implementations

5.2.1 HBase Data Model

The HBase Data Model is designed to handle semi-structured data that may differ in field size, which is
a form of data and columns. The data model’s layout partitions the data into simpler components and
spread them across the cluster. HBase's Data Model consists of various logical components, such as a
table, line, column, family, column, column, cell, and edition.
Table:

An HBase table is made up of several columns. The tables in HBase defines upfront during the time of
the schema specification.

Row:

An HBase row consists of a row key and one or more associated value columns. Row keys are the bytes
that are not interpreted. Rows are ordered lexicographically, with the first row appearing in a table in the
lowest order. The layout of the row key is very critical for this purpose.

Column:

A column in HBase consists of a family of columns and a qualifier of columns, which is identified by a
character: (colon).

5.2.2 Column Family:

Apache HBase columns are separated into the families of columns. The column families physically
position a group of columns and their values to increase its performance. Every row in a table has a
similar family of columns, but there may not be anything in a given family of columns.

The same prefix is granted to all column members of a column family. For example, Column courses:
history and courses: math, are both members of the column family of courses. The character of the
colon (:) distinguishes the family of columns from the qualifier of the family of columns. The prefix of
the column family must be made up of printable characters.

During schema definition time, column families must be declared upfront while columns are not
specified during schema time. They can be conjured on the fly when the table is up and running.
Physically, all members of the column family are stored on the file system together.

5.2.3 Column Qualifier

The column qualifier is added to a column family. A column standard could be content (html and pdf),
which provides the content of a column unit. Although column families are set up at table formation,
column qualifiers are mutable and can vary significantly from row to row.

Cell: A Cell store data and is quite a unique combination of row key, Column Family, and the Column.
The data stored in a cell call its value and data types, which is every time treated as a byte[].
Timestamp:

In addition to each value, the timestamp is written and is the identifier for a given version of a number.
The timestamp reflects the time when the data is written on the Region Server. But when we put data
into the cell, we can assign a different timestamp value.

5.3 HBase clients

All the 3 components are described below:

1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are
assigned to region server as well as DDL (create, delete table) operations. It monitor all Region
Server instances present in the cluster. In a distributed environment, Master runs several
background threads. HMaster has many features like controlling load balancing, failover etc.
2. Region Server –

HBase Tables are divided horizontally by row key range into Regions. Regions are the basic
building elements of HBase cluster that consists of the distribution of tables and are comprised
of Column families. Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling, managing, executing
as well as reads and writes HBase operations on that set of regions. The default size of a region
is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure notification etc. Clients
communicate with region servers via zookeeper.

Features of HBase architecture :

Distributed and Scalable: HBase is designed to be distributed and scalable, which means it can handle
large datasets and can scale out horizontally by adding more nodes to the cluster.

Column-oriented Storage: HBase stores data in a column-oriented manner, which means data is
organized by columns rather than rows. This allows for efficient data retrieval and aggregation.

Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage Hadoop’s
distributed file system (HDFS) for storage and MapReduce for data processing.

Consistency and Replication: HBase provides strong consistency guarantees for read and write
operations, and supports replication of data across multiple nodes for fault tolerance.

Built-in Caching: HBase has a built-in caching mechanism that can cache frequently accessed data in
memory, which can improve query performance.

Compression: HBase supports compression of data, which can reduce storage requirements and
improve query performance.

Flexible Schema: HBase supports flexible schemas, which means the schema can be updated on the fly
without requiring a database schema migration.

Note – HBase is extensively used for online analytical operations, like in banking applications such as
real-time data updates in ATM machines, HBase can be used.

5.4 HBase examples

Class HBase Configuration

Adds HBase configuration files to a Configuration. This class belongs to the org.apache.hadoop.hbase
package.
Methods and description
S.No. Methods and Description

static org.apache.hadoop.conf.Configuration create()


1
This method creates a Configuration with HBase resources.

Class HTable

HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is
used to communicate with a single HBase table. This class belongs to the
org.apache.hadoop.hbase.client class.

Constructors
S.No. Constructors and Description

1
HTable()

HTable(TableName tableName, ClusterConnection connection, ExecutorService pool)


2
Using this constructor, you can create an object to access an HBase table.

Methods and description


S.No. Methods and Description
void close()
1
Releases all the resources of the HTable.
void delete(Delete delete)
2
Deletes the specified cells/row.
boolean exists(Get get)
3
Using this method, you can test the existence of columns in the table, as specified by Get.
Result get(Get get)
4
Retrieves certain cells from a given row.
org.apache.hadoop.conf.Configuration getConfiguration()
5
Returns the Configuration object used by this instance.
TableName getName()
6
Returns the table name instance of this table.
HTableDescriptor getTableDescriptor()
7
Returns the table descriptor for this table.
byte[] getTableName()
8
Returns the name of this table.
void put(Put put)
9
Using this method, you can insert data into the table.
Class Put

This class is used to perform Put operations for a single row. It belongs to the
org.apache.hadoop.hbase.client package.

Constructors
S.No. Constructors and Description

Put(byte[] row)
1
Using this constructor, you can create a Put operation for the specified row.
Put(byte[] rowArray, int rowOffset, int rowLength)
2
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] row, long ts)
4
Using this constructor, we can create a Put operation for the specified row, using a given
timestamp.

Methods
S.No. Methods and Description

Put add(byte[] family, byte[] qualifier, byte[] value)


1
Adds the specified column and value to this Put operation.
Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
2
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
3
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
4
Adds the specified column and value, with the specified timestamp as its version to this Put
operation.

Class Get

This class is used to perform Get operations on a single row. This class belongs to the
org.apache.hadoop.hbase.client package.
Methods
S.No. Methods and Description

Delete addColumn(byte[] family, byte[] qualifier)


1
Deletes the latest version of the specified column.
Delete addColumns(byte[] family, byte[] qualifier, long timestamp)
2
Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.
Delete addFamily(byte[] family)
3
Deletes all versions of all columns of the specified family.
Delete addFamily(byte[] family, long timestamp)
4
Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.

Class Result

This class is used to get a single row result of a Get or a Scan query.

HBase - Create Data

Inserting Data using HBase Shell

This chapter demonstrates how to create data in an HBase table. To create data in an HBase table, the
following commands and methods are used:

 put command,
 add() method of Put class, and
 put() method of HTable class.

As an example, we are going to create the following table in HBase.


Using put command, you can insert rows into a table. Its syntax is as follows:

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

Inserting the First Row

Let us insert the first row values into the emp table as shown below.

hbase(main):005:0> put 'emp','1','personal data:name','raju'


0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds

Insert the remaining rows using the put command in the same way. If you insert the whole table, you
will get the following output.

hbase(main):022:0> scan 'emp'

ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad

1 column=personal data:name, timestamp=1417524185058, value=ramu

1 column=professional data:designation, timestamp=1417524232601,

value=manager

1 column=professional data:salary, timestamp=1417524244109, value=50000

2 column=personal data:city, timestamp=1417524574905, value=chennai

2 column=personal data:name, timestamp=1417524556125, value=ravi

2 column=professional data:designation, timestamp=1417524592204,

value=sr:engg

2 column=professional data:salary, timestamp=1417524604221, value=30000

3 column=personal data:city, timestamp=1417524681780, value=delhi

3 column=personal data:name, timestamp=1417524672067, value=rajesh

3 column=professional data:designation, timestamp=1417524693187,

value=jr:engg
3 column=professional data:salary, timestamp=1417524702514,

value=25000

Inserting Data Using Java API

You can insert data into Hbase using the add() method of the Put class. You can save it using the put()
method of the HTable class. These classes belong to the org.apache.hadoop.hbase.client package.
Below given are the steps to create data in a Table of HBase.

Step 1:Instantiate the Configuration Class

The Configuration class adds HBase configuration files to its object. You can create a configuration
object using the create() method of the HbaseConfiguration class as shown below.

Configuration conf = HbaseConfiguration.create();

Step 2:Instantiate the HTable Class

You have a class called HTable, an implementation of Table in HBase. This class is used to
communicate with a single HBase table. While instantiating this class, it accepts configuration object
and table name as parameters. You can instantiate HTable class as shown below.

HTable hTable = new HTable(conf, tableName);

Step 3: Instantiate the PutClass

To insert data into an HBase table, the add() method and its variants are used. This method belongs to
Put, therefore instantiate the put class. This class requires the row name you want to insert the data into,
in string format. You can instantiate the Put class as shown below.

Put p = new Put(Bytes.toBytes("row1"));

Step 4: Insert Data

The add() method of Put class is used to insert data. It requires 3 byte arrays representing column
family, column qualifier (column name), and the value to be inserted, respectively. Insert data into the
HBase table using the add() method as shown below.

p.add(Bytes.toBytes("coloumn family "), Bytes.toBytes("column


name"),Bytes.toBytes("value"));
Step 5: Save the Data in Table

After inserting the required rows, save the changes by adding the put instance to the put() method of
HTable class as shown below.

hTable.put(p);

Step 6: Close the HTable Instance

After creating data in the HBase Table, close the HTable instance using the close() method as shown
below.

Given below is the complete program to create data in HBase Table.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
public class InsertData{
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable hTable = new HTable(config, "emp");
// Instantiating Put class
// accepts a row name.
Put p = new Put(Bytes.toBytes("row1"));
// adding values using add() method
// accepts column family name, qualifier/row name ,value
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("name"),Bytes.toBytes("raju"));
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("city"),Bytes.toBytes("hyderabad"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
Bytes.toBytes("manager"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
Bytes.toBytes("50000"));
// Saving the put Instance to the HTable.
hTable.put(p);
System.out.println("data inserted");

// closing HTable
hTable.close();
}
}

5.5 Cassandra

Cassandra - Data Model

The data model of Cassandra is significantly different from what we normally see in an RDBMS. This
chapter provides an overview of how Cassandra stores its data.

Cluster

Cassandra database is distributed over several machines that operate together. The outermost container
is known as the Cluster. For failure handling, every node contains a replica, and in case of a failure, the
replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to
them.

Keyspace

Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in
Cassandra are −

 Replication factor − It is the number of machines in the cluster that will receive copies of the
same data.
 Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We
have strategies such as simple strategy (rack-aware strategy), old network topology strategy
(rack-aware strategy), and network topology strategy (datacenter-shared strategy).
 Column families − Keyspace is a container for a list of one or more column families. A column
family, in turn, is a container of a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each keyspace has at least one and often
many column families.
The syntax of creating a Keyspace is as follows −

CREATE KEYSPACE Keyspace name


WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

The following illustration shows a schematic view of a Keyspace.

Column Family

A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from a table
of relational databases.

Relational Table Cassandra column Family

A schema in a relational model is fixed. Once we In Cassandra, although the column families are
define certain columns for a table, while inserting data, defined, the columns are not. You can freely
in every row all the columns must be filled at least with add any column to any column family at any
a null value. time.

Relational tables define only columns and the user fills In Cassandra, a table contains columns, or can
in the table with values. be defined as a super column family.
A Cassandra column family has the following attributes −

 keys_cached − It represents the number of locations to keep cached per SSTable.


 rows_cached − It represents the number of rows whose entire contents will be cached in
memory.
 preload_row_cache − It specifies whether you want to pre-populate the row cache.

Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.

The following figure shows an example of a Cassandra column family.

Column

A column is the basic data structure of Cassandra with three values, namely key or column name, value,
and a time stamp. Given below is the structure of a column.

SuperColumn

A super column is a special column, therefore, it is also a key-value pair. But a super column stores a
map of sub-columns.

Generally column families are stored on disk in individual files. Therefore, to optimize performance, it
is important to keep columns that you are likely to query together in the same column family, and a
super column can be helpful here.Given below is the structure of a super column.
Data Models of Cassandra and RDBMS

The following table lists down the points that differentiate the data model of Cassandra from that of an
RDBMS.

RDBMS Cassandra

RDBMS deals with structured data. Cassandra deals with unstructured data.

It has a fixed schema. Cassandra has a flexible schema.

In RDBMS, a table is an array of arrays. (ROW In Cassandra, a table is a list of “nested key-value
x COLUMN) pairs”. (ROW x COLUMN key x COLUMN value)

Database is the outermost container that Keyspace is the outermost container that contains data
contains data corresponding to an application. corresponding to an application.

Tables are the entities of a database. Tables or column families are the entity of a keyspace.

Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.

Column represents the attributes of a relation. Column is a unit of storage in Cassandra.

RDBMS supports the concepts of foreign keys,


Relationships are represented using collections.
joins.

5.6 Cassandra examples

Real World Data Modeling Examples

5.1. Facebook Posts

Suppose that we are storing Facebook posts of different users in Cassandra. One of the common query
patterns will be fetching the top ‘N‘ posts made by a given user.

Thus, we need to store all data for a particular user on a single partition as per the above guidelines.
Also, using the post timestamp as the clustering key will be helpful for retrieving the top ‘N‘ posts more
efficiently.

Let’s define the Cassandra table schema for this use case:

CREATE TABLE posts_facebook (


user_id uuid,
post_id timeuuid,
content text,
PRIMARY KEY (user_id, post_id) )
WITH CLUSTERING ORDER BY (post_id DESC);

Now, let’s write a query to find the top 20 posts for the user Anna:

SELECT content FROM posts_facebook WHERE user_id = "Anna_id" LIMIT 20

Example : Gyms Across the Country

Suppose that we are storing the details of different partner gyms across the different cities and states of
many countries and we would like to fetch the gyms for a given city.

Also, let’s say we need to return the results having gyms sorted by their opening date.

Based on the above guidelines, we should store the gyms located in a given city of a specific state and
country on a single partition and use the opening date and gym name as a clustering key.

Let’s define the Cassandra table schema for this example:

CREATE TABLE gyms_by_city (


country_code text,
state text,
city text,
gym_name text,
opening_date timestamp,
PRIMARY KEY (
(country_code, state_province, city),
(opening_date, gym_name))
WITH CLUSTERING ORDER BY (opening_date ASC, gym_name ASC);

Now, let’s look at a query that fetches the first ten gyms by their opening date for the city of Phoenix
within the U.S. state of Arizona:

SELECT * FROM gyms_by_city


WHERE country_code = "us" AND state = "Arizona" AND city = "Phoenix"
LIMIT 10
Next, let’s see a query that fetches the ten most recently-opened gyms in the city of Phoenix within
the U.S. state of Arizona:

SELECT * FROM gyms_by_city


WHERE country_code = "us" and state = "Arizona" and city = "Phoenix"
ORDER BY opening_date DESC
LIMIT 10

Note: As the last query’s sort order is opposite of the sort order defined during the table creation, the
query will run slower as Cassandra will first fetch the data and then sort it in memory.

E-commerce Customers and Products

Let’s say we are running an e-commerce store and that we are storing the Customer and Product
information within Cassandra. Let’s look at some of the common query patterns around this use case:

1. Get Customer info


2. Get Product info
3. Get all Customers who like a given Product
4. Get all Products a given Customer likes

We will start by using separate tables for storing the Customer and Product information. However, we
need to introduce a fair amount of denormalization to support the 3rd and 4th queries shown above.

We will create two more tables to achieve this – “Customer_by_Product” and Product_by_Customer“.

Let’s look at the Cassandra table schema for this example:

CREATE TABLE Customer (


cust_id text,
first_name text,
last_name text,
registered_on timestamp,
PRIMARY KEY (cust_id));

CREATE TABLE Product (


prdt_id text,
title text,
PRIMARY KEY (prdt_id));

CREATE TABLE Customer_By_Liked_Product (


liked_prdt_id text,
liked_on timestamp,
title text,
cust_id text,
first_name text,
last_name text,
PRIMARY KEY (prdt_id, liked_on));
CREATE TABLE Product_Liked_By_Customer (
cust_id text,
first_name text,
last_name text,
liked_prdt_id text,
liked_on timestamp,
title text,
PRIMARY KEY (cust_id, liked_on));

Note: To support both the queries, recently-liked products by a given customer and customers who
recently liked a given product, we have used the “liked_on” column as a clustering key.

Let’s look at the query to find the ten Customers who most recently liked the product “Pepsi“:

SELECT * FROM Customer_By_Liked_Product WHERE title = "Pepsi" LIMIT 10

And let’s see the query that finds the recently-liked products (up to ten) by a customer named “Anna“:

SELECT * FROM Product_Liked_By_Customer


WHERE first_name = "Anna" LIMIT 10

5.7 Cassandra clients

What is a Cassandra client? The Cassandra client is used to connect, manage and develop your
Cassandra database. View + What is the Cassandra database client used for? The database client is
used to manage your Cassandra database with actions like insert, delete and update tables.

Using the Cassandra Client

Apache Cassandra® is a free and open-source, distributed, wide column store, NoSQL database
management system designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.

This quickstart guide shows how to build a REST application using the Cassandra Quarkus extension,
which allows you to connect to an Apache Cassandra, DataStax Enterprise (DSE) or DataStax Astra
database, using the DataStax Java driver.

Cassandra Java Client example

This is a simplistic code example of connecting to the trial Cassandra cluster, creating a time series data
table, filling it with realistic looking data, querying it and saving the results into a csv file for graphing
(Code below). To customise the code for your cluster, change the public IP addresses, and provide the
data centre name and user/password details (it’s safest to use a non-super user).
The Cluster.builder() call uses a fluent API to configure the client with the IP addresses, load balancing
policy, port number and user/password information. I’ve obviously been under a rock for a while as I
havn’t come across fluent programming before. It’s all about the cascading of method invocations, and
they are supported in Java 8 by Lambda functions (and used in Java Streams). This is a very simple
configuration which I’ll revisit in the future with the Instaclustr recommended settings for production
clusters.

The program then builds, gets meta data and prints out the host and cluster information, and then creates
a session by connecting. You have the option of dropping the test table if it already exists or adding
data to the existing table.

Next, we fill the table with some realistic time series sensor data. You can change how many host names
(100 by default) are used, and how many timestamps are generated. For each time 3 metrics and values
will be inserted. There are several types of statements in the Java clients including simple and prepared
statements. In theory prepared statements are faster so there’s an option to use either in the code. In
practice it seems that prepared statements may not improve response time significantly but may be
designed to improve throughput. Realistic looking data is generated by a simple random walk.

The code illustrates some possible queries (SELECTs), including a simple aggregate function (max) and
retrieving all the values for one host/metric combination, finding all host/metric permutations (to assist
with subsequent queries as we made the primary key a compound key of host and metric so both are
needed to select on), and finally retrieving the whole table and reporting the number of rows and total
bytes returned.

5.8 Hadoop integration

Hadoop architecture is designed to be easily integrated with other systems. Integration is very important
because although we can process the data efficiently in Hadoop, but we should also be able to send that
result to another system to move the data to another level. Data has to be integrated with other systems
to achieve interoperability and flexibility.

The following figure depicts the Hadoop system integrated with different systems and with some
implemented tools for reference:
Hadoop Integration with other systems

The Hadoop framework includes :

 Hadoop Distributed File System (HDFS) – It is a file system that provides a robust distributed
file system. Hadoop has a framework that is used for job scheduling and cluster resource
management whose name is YARN.
 Hadoop MapReduce –It is a system for parallel processing of large data sets that implement the
MapReduce model of distributed programming.

Hadoop extends an easier distributed storage with the help of HDFS and provides an analysis system
through MapReduce. It has a well-designed architecture to scale up or scale down the servers as per the
requirements of the user from one to hundreds or thousands of computers, having a high degree of fault
tolerance. Hadoop has proved its infallible need and standards in big data processing and efficient
storage management, it provides unlimited scalability and is supported by major vendors in the software
industry.
Why integrate R with Hadoop?

R is an open-source programming language. It is best suited for statistical and graphical analysis. Also,
if we need strong data analytics and visualization features, we have to combine R with Hadoop.

The purpose behind R and Hadoop integration:

1. To use Hadoop to execute R code.


2. To use R to access the data stored in Hadoop.

R Hadoop Integration Method

Hadoop and R complement each other very well in terms of big data visualization and analytics. There
are four ways of using Hadoop and R together, which are as follows:

R Hadoop

The R Hadoop methods are the collection of packages. It contains three packages i.e., rmr, rhbase, and
rhdfs.

The rmr package

For the Hadoop framework, the rmr package provides MapReduce functionality by executing the
Mapping and Reducing codes in R.

The rhbase package

This package provides R database management capability with integration with HBASE.

The rhdfs package

This package provides file management capabilities by integrating with HDFS.

Hadoop Streaming

Hadoop Streaming is a utility that allows users to create and run jobs with any executable as the mapper
and/or the reducer. Using the streaming system, we can develop working Hadoop jobs with just enough
knowledge of Java to write two shell scripts which work in tandem.

The combination of R and Hadoop appears as a must-have toolkit for people working with large data
sets and statistics. However, some Hadoop enthusiasts have raised a red flag when dealing with very
large Big Data excerpts.
They claim that the benefit of R is not its syntax, but the entire library of primitives for visualization and
data. These libraries are fundamentally non-distributed, making data retrieval a time-consuming affair.
This is an inherent flaw with R, and if you choose to ignore it, both R and Hadoop can work together.

5.9 Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:


This component diagram contains different units. The following table describes each unit:

Unit Name Operation

Hive is a data warehouse infrastructure software that can create interaction


User Interface between user and HDFS. The user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight (In Windows server).

Hive chooses respective database servers to store the schema or Metadata of tables,
Meta Store
databases, columns in a table, their data types, and HDFS mapping.

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL Process of the replacements of traditional approach for MapReduce program. Instead of
Engine writing MapReduce program in Java, we can write a query for MapReduce job and
process it.

The conjunction part of HiveQL process Engine and MapReduce is Hive


Execution Engine Execution Engine. Execution engine processes the query and generates results as
same as MapReduce results. It uses the flavor of MapReduce.

Hadoop distributed file system or HBASE are the data storage techniques to store
HDFS or HBASE
data into file system.
Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step No. Operation


Execute Query
1 The Hive interface such as Command Line or Web UI sends query to Driver (any database driver
such as JDBC, ODBC, etc.) to execute.
Get Plan
2 The driver takes the help of query compiler that parses the query to check the syntax and query plan
or the requirement of query.
Get Metadata
3
The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5 The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and
compiling of a query is complete.
Execute Plan
6
The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job to
7
JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node.
Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
Fetch Result
8
The execution engine receives the results from Data nodes.
Send Results
9
The execution engine sends those resultant values to the driver.
Send Results
10
The driver sends the results to Hive Interfaces.

5.10 Data types and file formats

This chapter takes you through the different data types in Hive, which are involved in the table creation.
All the data types in Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types

Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types

Integer type data can be specified using integral data types, INT. When the data range exceeds the range
of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT.
TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L
String Types

String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two
data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond precision. It supports


java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.

Dates

DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types

Union is a collection of heterogeneous data types. You can create an instance using create union. The
syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.

Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

Structs

Structs in Hive is similar to using complex data with comment.


Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
HIVE FILE FORMAT
1) TEXT FILE FORMAT:
Hive Text file format is a default storage format to load data from comma-separated values (CSV), tab-
delimited, space-delimited, or text files that delimited by other special characters. You can use the text
format to interchange the data with other client applications. The text file format is very common for
most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by
a newline character (\n).The text file format storage option is defined by specifying “STORED AS
TEXTFILE” at the end of the table creation.
2) SEQUENCE FILE FORMAT:
Flat files consisting of binary key-value pairs are sequence files. When converting queries to
MapReduce jobs, Hive chooses to use the necessary key-value pairs for a given record. The key
advantages of using a sequence file are that it incorporates two or more files into one file.The sequence
file format storage option is defined by specifying “STORED AS SEQUENCEFILE” at the end of the
table creation.
3) RCFILE FORMAT:
The row columnar file format is very much similar to the sequence file format. It is a data placement
structure designed for MapReduce-based data warehouse systems. This also stores the data as key-value
pairs and offers a high row-level compression rate. This will be used when there is a requirement to
perform multiple rows at a time. RCFile format is supported by Hive version 0.6.0 and later. The RC file
format storage option is defined by specifying “STORED AS RCFILE” at the end of the table creation.

4) AVRO FILE FORMAT:


Hive version 0.14.0 and later versions support Avro files. It is a row-based storage format for Hadoop
which is widely used as a serialization platform. It’s a remote procedure call and data serialization
framework that uses JSON for defining data types and protocols and serializes data in a compact binary
format to make it compact and efficient. This file format can be used in any of the Hadoop’s tools like
Pig and Hive. Avro is one of the common file formats in applications based on Hadoop. The option to
store the data in the RC file format is defined by specifying “STORED AS AVRO” at the end of the
table creation.

5) ORC FILE FORMAT:

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data in the
Hive table. This file system was actually designed to overcome limitations of the other Hive file
formats. ORC reduces I/O overhead by accessing only the columns that are required for the current
query. It requires significantly fewer seek operations because all columns within a single group of row
data are stored together on disk. The ORC file format storage option is defined by specifying
“STORED AS ORC” at the end of the table creation.

6) PARQUET:

Parquet is a binary file format that is column driven. It is an open-source available to any project in the
Hadoop ecosystem and is designed for data storage in an effective and efficient flat columnar format
compared to row-based files such as CSV or TSV files. It only reads the necessary columns, which
significantly reduces the IO and thus makes it highly efficient for large-scale query types. The Parquet
table uses Snappy, which is a fast data compression and decompression library, as the default
compression.

The parquet file format storage option is defined by specifying “STORED AS PARQUET” at the end
of the table creation.

7) CUSTOMER INPUTFORMAT & OUTPUTFORMAT:

We can implement our own “input format” and “output format” incase the data comes in a different
format. These “input format” and “output format” is similar to Hadoop MapReduce’s input and output
formats.

5.11 HiveQL data definition

Data Definition Language (DDL) is used to describe data and data structures of a
database. Hive has its own DDL, such as SQL DDL, which is used for managing,
creating, altering, and dropping databases, tables, and other objects in a database.
Similar to other SQL databases, Hive databases also contain namespaces for tables. If
the name of the database is not specified, the table is created in the default database.

Some of the main commands used in DDL are as follows:

Creating Databases In order to create a database, we can use the following command:
In the preceding command, the name of the database is added before the name of the
table. Therefore, temp_table gets added in the temp_database. In addition, you can
also create a table in the database by using the following commands:

In the preceding commands, the USE statement is used for setting the current database
to execute all the subsequent HiveQL statements. In this way, you do not need to add
the name of the database before the table name. The table temp_table is created in the
database temp_database. Furthermore, you can specify DBPROPERTIES in the form

of key-value pairs in the following manner:

Viewing a Database You can view all the databases present in a particular path by
using the following command:

Dropping a Database
Dropping a database means deleting it from its storage location.

The database can be deleted by using the following command:

Creating Tables You can create a table in a database by using the CREATE
command, as discussed earlier. Now, let’s learn how to provide the complete
definition of a table in a database by using the following commands:

In the preceding commands, temp_database is first set and then the table is created with the name
employee. In the employee table, the columns (ename, salary, and designation) are specified with their
respective data types. The TBLPROPERTIES isa set of key-value properties. C comments are also
added in the table to provide more details.

Altering Tables

Altering a table means modifying or changing an existing table. By altering a table,


you can modify the metadata associated with the table. The table can be modified by
using the ALTER TABLE statement. The altering of a table allows you to:

1. Rename tables

2. Modify columns

3. Add new columns

4. Delete some columns


5. Change table properties

6. Alter tables for adding partitions


5.12 HiveQL data manipulation

After specifying the database schema and creating a database, the data can be
modified by using a set of procedures/mechanisms defined by a special language
known as Data Manipulation Language (DML).

1. Data can be manipulated in the following ways:

2. Loading files into tables

3. Inserting data into Hive table from queries

4. Updating existing tables

5. Deleting records in tables Let’s learn about each of these mechanisms in


detail.

Loading Files into Tables

While loading data into tables, Hive does not perform any type of transformations.
The data load operations in Hive are, at present, pure copy/move operations, which
move data files from one location to another. You can upload data into Hive tables
from the local file system as well as from HDFS. The syntax of loading data from
files into tables is as follows:

When the LOCAL keyword is specified in the LOAD DATA command, Hive
searches for the local directory. If the LOCAL keyword is not used, Hive checks the
directory on HDFS. On the other hand, when the OVERWRITE keyword is specified,
it deletes all the files under Hive’s warehouse directory for the given table. After that,
the latest files get uploaded. If you do not specify the OVERWRITE keyword, the
latest files are added in the already existing folder.
Inserting Data into Tables

In the preceding syntax, the INSERT OVERWRITE statement overwrites the current
data in the table or partition. The IF NOT EXISTS statement is given for a partition.
On the other hand, the INSERT INTO statement either appends the table or creates a
partition without modifying the existing data. The insert operation can be performed
on a table or a partition. You can also specify multiple insert clauses in the same
query.

Consider two tables, T1 and T2. We want to copy the sal column from T2 to T1 by
using the INSERT command. It can be done as follows:

Static Partition Insertion

Static partition insertion refers to the task of inserting data into a table by specifying a
partition column value. Consider the following example:
Dynamic Partition Insertion

In dynamic partition insertion, you need to specify a list of partition column names in
the PARTITION() clause along with the optional column values. A dynamic partition
column always has a corresponding input column in the SELECT statement. If the
SELECT statement has multiple column names, the dynamic partition columns must
be specified at the end of the columns and in the same order in which they appear in
the PARTITION() clause. By default, the feature of dynamic partition is disabled.
Inserting Data into Local Files

Sometimes, you might require to save the result of the SELECT query in flat files so
that you do not have to execute the queries again and again. Consider the following
example:

Creating and Inserting Data into a Table Using a Single Query

Delete in Hive

The delete operation is available in Hive from the Hive 0.14 version. The delete
operation can only be performed on those tables that support the ACID property. The
syntax for performing the delete operation is as follows:
5.13 Data Retrieval queries
Hive allows you to perform data retrieval queries by using the SELECT command
along with various types of operators and clauses. In this section, you learn about the
following:

1. Using the SELECT command

2. Using the WHERE clause

3. Using the GROUP BY clause

4. Using the HAVING clause

5. Using the LIMIT clause

6. Executing HiveQL queries

Using the SELECT Command

The SELECT statement is the most common operation in SQL. You can filter the
required columns, rows, or both. The syntax for using the SELECT command is as
follows:
Using the WHERE Clause

The WHERE clause is used to search records of a table on the basis of a given
condition. This clause returns a boolean result. For example, a query can return only
those sales records that have an amount greater than 15000 from the US region. Hive
also supports a number of operators (such as > and <) in the WHERE clause. The
following query shows an example of using the WHERE clause:

Using the GROUP BY Clause

The GROUP BY clause is used to put all the related records together. It can also be
used with aggregate functions. Often, it is required to group the resultsets in complex
queries.

In such scenarios, the ‘GROUP BY’ clause can be used.

Using the HAVING Clause

The HAVING clause is used to specify a condition on the use of the GROUP BY
clause. The use of the HAVING clause is added in the version 0.7.0 of Hive.

The following query shows an example of using the HAVING clause:


Using the LIMIT Clause

Using JOINS in Hive

Hive supports joining of one or more tables to aggregate information. The various
joins supported by Hive are:

Inner joins

Outer joins

Inner Joins

In case of inner joins, only the records satisfying the given condition get selected. All
the other records get discarded. Figure 12.11 illustrates the concept of inner joins:
Let’s take an example to describe the concept of inner joins. Consider two tables,
order and customer

Table 12.6 lists the data of the order table:


Outer Joins

Sometimes, you need to retrieve all the records from one table and only some records
from the other table.

In such cases, you have to use the outer join. Figure 12.12 illustrates the concept of
outer joins:

Outer joins are of three types:

Right Outer Join

Left Outer Join

Full Outer Join

Right Outer Join

In this type of join, all the records from the table on the right side of the join are
retained.
Left Outer Join

In this type of join, all the records from the table on the left side of the join are
retained.

Figure 12.14 illustrates the concept of left outer joins:


Full Outer Join

In this case, all the fields from both tables are included. For the entries that do not
have any match, a NULL value would be displayed.
Cartesian Product Joins

In cartesian product joins, all the records of one table are combined with another
table in all possible combinations.

This type of join does not involve any key column to join the tables. The following is
a query with a Cartesian product joint:

Joining Tables

You can combine the data of two or more tables in Hive by using HiveQL queries.

For this, we need to create tables and load them into Hive from HDFS.

You might also like