100% found this document useful (1 vote)
137 views

BDA Complete Notes

1. The Google File System (GFS) uses large 64MB chunks to store and access data across multiple machines. This reduces interactions with the single master node and network overhead. 2. GFS uses a master node to store metadata like file mappings and namespaces, and chunkservers to store and retrieve chunk replicas from local disks. 3. A large chunk size has advantages like caching but also risks of hotspots if small files are widely accessed.

Uploaded by

Amrutha Varsha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
137 views

BDA Complete Notes

1. The Google File System (GFS) uses large 64MB chunks to store and access data across multiple machines. This reduces interactions with the single master node and network overhead. 2. GFS uses a master node to store metadata like file mappings and namespaces, and chunkservers to store and retrieve chunk replicas from local disks. 3. A large chunk size has advantages like caching but also risks of hotspots if small files are widely accessed.

Uploaded by

Amrutha Varsha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Working with Big Data

UNIT II

Working with Big Data:


1. Google File System,
2. Hadoop Distributed File System (HDFS)
Building blocks of Hadoop
A. Namenode
B. Datanode
C. Secondary Name node
D. JobTracker
E. TaskTracker

3. Introducing and Configuring Hadoop cluster


A. Local
B. Pseudo-distributed mode
ri
C. Fully Distributed mode
ad
4. Configuring XML files.
em
H
Working with Big Data

The Google File System


The Google File System, a scalable distributed file system for large distributed data- intensive
applications. It provides fault tolerance while running on inexpensive commodity hardware, and it
delivers high aggregate performance to a large number of clients.
GFS provides a familiar file system interface, though it does not implement a standard API such
as POSIX. Files are organized hierarchically in directories and identified by path-names. We
support the usual operations to create, delete, open, close, read, and write files.
Moreover, GFS has snapshot and records append operations. Snapshot creates a copy of a file or a
directory treat low cost. Record append allows multiple clients to append data to the same file
concurrently while guaranteeing the atomicity of each individual client’s append.
Architecture:

A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple
clients, as shown in Figure.

ri
ad
em
H

Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunk server and a client on the same machine, as long as machine resources
permit and the lower reliability caused by running possibly flaky application code is acceptable.
Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunkserver and a client on the same machine, as long as machine resources
permit and the lower reliability caused by running possibly flaky application code is acceptable.
Working with Big Data

Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally
unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By
default, we store three replicas, though users can designate different replication levels for different
regions of the file namespace. The master maintains all file system metadata. This includes the
namespace, access control information, the mapping from files to chunks, and the current locations
of chunks.
It also controls system-wide activities such as chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunkservers. The master periodically
communicates with each chunkserver in HeartBeat messages to give it instructions and collect its
state.
GFS client code linked into each application implements the file system API and communicates
ri
with the master and chunkservers to read or write data on behalf of the application. Clients interact
ad
with the master for metadata operations, but all data-bearing communication goes directly to the
em

chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode
layer.
H

Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most
applications stream through huge files or have working sets too large to be cached. Not having them
simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache
metadata, however.) Chunkservers need not cache file data because chunks are stored as local files
and so Linux’s buffer cache already keeps frequently accessed data in memory.

Single Master:
Having a single master vastly simplifies our design and enables the master to make sophisticated
chunk placement Application and replication decisions using global knowledge. However, we must
minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never
read and write file data through the master. Instead, a client asks the master which chunkservers
it should contact.
Working with Big Data
Chunk Size:
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than
typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver
and is extended only as needed.

Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest
objection against such a large chunk size.

A large chunk size offers several important advantages.


First, it reduces clients’ need to interact with the master because reads and writes on the
same chunk require only one initial request to the master for chunk location information. The
reduction is especially significant for our work loads because applications mostly read and write
large files sequentially.Even for small random reads, the client can comfortably cache all the chunk
location information for a multi-TB working set.

Second, since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver
over an extended period of time.

Third, it reduces the size of the metadata stored on the master. This allows us to keep the
ri
metadata in memory, which in turn brings other advantages.
ad
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A
em

small file consists of a small number of chunks, perhaps just one. The chunkservers storing those
chunks may become hot spots if many clients are accessing the same file. In practice, hot spots
H

have not been a major issue because our applications mostly read large multi-chunk files
sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: an executable
was written to GFS as a single-chunk file and then started on hundreds of machines at the same time.

Metadata
The master stores three major types of metadata: the file and chunk namespaces, the mapping from
files to chunks, and the locations of each chunk’s replicas.

All metadata is kept in the master’s memory. The first two types (namespaces and file-to- chunk
mapping) are also kept persistent by logging mutations to an operation log stored on the master’s
local disk and replicated on remote machines. Using a log allows us to update the master state
Working with Big Data
simply, reliably, and without risking inconsistencies in the event of a master crash. The master does
not store chunk location information persistently. Instead, it asks each chunkserver about its chunks
at master startup and whenever a chunkserver joins the cluster.

In-Memory Data Structures


Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and
efficient for the master to periodically scan through its entire state in the background.

This periodic scanning is used to implement chunk garbage collection, re-replication in the
presence of chunkserver failures, and chunk migration to balance load and disk space usage across
chunkservers.

Advantages and disadvantages of large sized chunks in Google File System


Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much larger than
typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a chunk server
and is extended only as needed.
Advantages
ri
➢ It reduces clients’ need to interact with the master because reads and writes on the same chunk
ad
require only one initial request to the master for chunk location information.
➢ Since on a large chunk, a client is more likely to perform many operations on a given chunk,
em

it can reduce network overhead by keeping a persistent TCP connection to the chunk server
over an extended period of time.
➢ It reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages.
H

Disadvantages
➢ Lazy space allocation avoids wasting space due to internal
fragmentation.
➢ Even with lazy space allocation, a small file consists of a small number of chunks, perhaps
just one. The chunk servers storing those chunks may become hot spots if many clients are
accessing the same file. In practice, hot spots have not been a major issue because the
applications mostly read large multi-chunk files sequentially. To mitigate it, replication and
allowance to read from other clients can be done.
Working with Big Data

Hadoop Distributed File System (HDFS) Building blocks of Hadoop :


A. Namenode
B. Datanode
C. Secondary Name node
D. JobTracker
E. TaskTracker

Hadoop is made up of 2 parts:


1. HDFS – Hadoop Distributed File System
2. MapReduce – The programming model that is used to work on the data present in HDFS.

HDFS – Hadoop Distributed File System


HDFS is a file system that is written in Java and resides within the user space unlike traditional
file systems like FAT, NTFS, ext2, etc that reside on the kernel space. HDFS was primarily
ri
written to store large amounts of data (terrabytes and petabytes). HDFS was built inline with
ad
Google’s paper on GFS.
em

MapReduce
MapReduce is the programming model that uses Java as the programming language to retrieve
H

data from files stored in the HDFS. All data in HDFS is stored as files. Even MapReduce was
built inline with another paper by Google.
Google, apart from their papers did not release their implementations of GFS and MapReduce.
However, the Open Source Community built Hadoop and MapReduce based on those papers.
The initial adoption of Hadoop was at Yahoo Inc., where it gained good momentum and went
onto be a part of their production systems. After Yahoo, many organizations like LinkedIn,
Facebook, Netflix and many more have successfully implemented Hadoop within their
organizations.

Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it
is broken down into blocks, 64 MB block size by default. These blocks are then replicated
across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e.
there will be 3 copies of the same block in the cluster. We will see later on why we maintain
replicas of the blocks in the cluster.
Working with Big Data

A Hadoop cluster can comprise of a single node (single node cluster) or thousands of
nodes.

Once you have installed Hadoop you can try out the following few basic commands to work
with HDFS:
hadoop fs -ls
hadoop fs -put <path_of_local> <path_in_hdfs>
hadoop fs -get <path_in_hdfs> <path_of_local>
hadoop fs -cat <path_of_file_in_hdfs>
hadoop fs -rmr <path_in_hdfs>

the different components of a Hadoop Cluster are:


NameNode (Master) – NameNode, Secondary NameNode, JobTracker
DataNode 1 (Slave) – TaskTracker, DataNode
DataNode 2 (Slave) – TaskTracker, DataNode ri
ad
DataNode 3 (Slave) – TaskTracker, DataNode
DataNode 4 (Slave) – TaskTracker, DataNode
em

DataNode 5 (Slave) – TaskTracker, DataNode


H

A B

A C

B C
Working with Big Data
A B C

The above diagram depicts a 6 Node Hadoop Cluster


In the diagram you see that the NameNode, Secondary NameNode and theJobTracker are
running on a single machine. Usually in production clusters having more those 20-30 nodes, the
daemons run on separate nodes.
Hadoop follows a Master-Slave architecture. As mentioned earlier, a file in HDFS is split into
blocks and replicated across Datanodes in a Hadoop cluster. You can see that the three files A, B
and C have been split across with a replication factor of 3 across the different Datanodes.

Now let us go through each node and daemon:


NameNode
The NameNode in Hadoop is the node where Hadoop stores all the location information of the files
in HDFS. In other words, it holds the metadata for HDFS. Whenever a file is placed in the cluster
ri
a corresponding entry of it location is maintained by the NameNode. So, for the files A, B and C
ad
we would have something as follows in the NameNode:
em

File A – DataNode1, DataNode2, DataNode


File B – DataNode1, DataNode3, DataNode4
File C – DataNode2, DataNode3, DataNode4
H

This information is required when retrieving data from the cluster as the data is spread across
multiple machines. The NameNode is a Single Point of Failure for the Hadoop Cluster.
Secondary NameNode
IMPORTANT – The Secondary NameNode is not a failover node for the NameNode.
The secondary name node is responsible for performing periodic housekeeping functions
for the NameNode. It only creates checkpoints of the file system present in the NameNode.
DataNode
The DataNode is responsible for storing the files in HDFS. It manages the file blocks within
the node. It sends information to the NameNode about the files and blocks stored in that node
and responds to the NameNode for all filesystem operations.
JobTracker

JobTracker is responsible for taking in requests from a client and assigning TaskTrackers
with tasks to be performed. The JobTracker tries to assign tasks to the
TaskTracker on the DataNode where the data is locally present (Data Locality). If that is not
Working with Big Data
possible it will at least try to assign tasks to TaskTrackers within the same rack. If for some
reason the node fails the JobTracker assigns the task to another TaskTracker where the replica
of the data exists since the data blocks are replicated across the DataNodes. This ensures that
the job does not fail even if a node fails within the cluster.

TaskTracker
TaskTracker is a daemon that accepts tasks (Map, Reduce and Shuffle) from the JobTracker.
The TaskTracker keeps sending a heart beat message to theJobTracker to notify that it is
alive. Along with the heartbeat it also sends the free slots available within it to
process tasks. TaskTracker starts and monitors the Map & Reduce Tasks and sends
progress/status information back to theJobTracker.
All the above daemons run within have their own JVMs. A typical (simplified) flow in Hadoop
is a follows:
➢ A Client (usaually a MapReduce program) submits a job to theJobTracker.

➢ The JobTracker get information from the NameNode on the location of the data
within the DataNodes. The JobTracker places the client program (usually a jar file
along with the configuration file) in the HDFS. Once placed, JobTracker tries
to assign tasks to TaskTrackers on the DataNodes based on data locality.
➢ The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up the
ri
client program from the shared location on the HDFS.
ad
➢ The progress of the operation is relayed back to the JobTracker by theTaskTracker.

➢ On completion of the Map task an intermediate file is created on the local filesystem
em

of the TaskTracker.
➢ Results from Map tasks are then passed on to the Reduce task.
H

➢ The Reduce tasks works on all data received from map tasks and writes the final output
to HDFS.
➢ After the task complete the intermediate data generated by theTaskTracker is deleted.

A very important feature of Hadoop to note here is that, the program goes to where the data
is and not the way around, thus resulting in efficient processing of data.

Introducing and Configuring Hadoop cluster


A. Local
B. Pseudo-distributed mode
C. Fully Distributed mode
We all know that Apache Hadoop is an open source framework that allows distributed processing of
large sets of data set across different clusters using simple programming. Hadoop has the ability to
scale up to thousands of computers from a single server. Thus in these conditions installation of
Hadoop becomes most critical. We can install Hadoop in three different modes:
• Standalone mode - Single Node Cluster
• Pseudo distributed mode - Single Node Cluster
Working with Big Data
• Distributed mode. - Multi Node Cluster

Purpose for Different Installation Modes


When Apache Hadoop is used in a production environment, multiple server nodes are used for
distributed computing. But for understanding the basics and playing around with Hadoop, single

node installation is sufficient. There is another mode known as 'pseudo distributed' mode. This mode
is used to simulate the multi node environment on a single server.

In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system
should have java version 1.6.x installed on it.
Standalone Mode Installation
Now, let us check the standalone mode installation process by following the steps mentioned below.
Install Java
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.
• Step 1 - If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE,
ri
install sun-java6 from Canonical Partner Repository by using the following command.
ad
Note: The Canonical Partner Repository contains free of cost closed source third party
software. But the Canonical does not have access to the source code instead they just package
and test it.
em

Add the canonical partner to the apt repositories using -


$ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"

H

Step 2 - Update the source list.


$ sudo apt-get update
• Step 3 - Install JDK version 1.6.x from Sun/Oracle.
$ sudo apt-get install sun-java6-jdk
• Step 4 - Once JDK installation is over make sure that it is correctly setup using - version 1.6.x
from Sun/Oracle.
user@ubuntu:~# java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b02)
Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)

Add Hadoop User


• Step 5 - Add a dedicated Hadoop unix user into you system as under to isolate this installation
from other software -
$ sudo adduser hadoop_admin

Download the Hadoop binary and install


Working with Big Data
• Step 6 - Download Apache Hadoop from the apache web site. Hadoop comes in the form of
tar-gx format. Copy this binary into the /usr/local/installables folder. The folder - installables
should be created first under /usr/local before this step. Now run the following commands as
sudo
$ cd /usr/local/installables
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2

Define env variable - JAVA_HOME


• Step 7 - Open the Hadoop configuration file (hadoop-env.sh) in the location -
/usr/local/installables/hadoop-0.20.2/conf/hadoop-env.sh and define the JAVA_HOME as
under –
export JAVA_HOME=path/where/jdk/is/installed
(e.g. /usr/bin/java)
Installation in Single mode ri
ad
• Step 8 - Now go to the HADOOP_HOME directory (location where HADOOP is extracted)
and run the following command –
em

$ bin/hadoop
The following output will be displayed –
H

Usage: hadoop [--config confdir] COMMAND


Some of the COMMAND options are mentioned below. There are other options available and
can be checked using the command mentioned above.
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility

The above output indicates that Standalone installation is completed successfully. Now you can run
the sample examples of your choice by calling –
$ bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>

Pseudo Distributed Mode Installation


This is a simulated multi node environment based on a single node server.
Here, the first step required is to configure the SSH in order to access and manage the different nodes.
It is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled
Working with Big Data
and is accessible we should start configuring the Hadoop. The following configuration files need to
be modified:
• conf/core-site.xml
• conf/hdfs-site.xml
• conf/mapred.xml
Open the all the configuration files in vi editor and update the configuration.
Configure core-site.xml file:
$ vi conf/core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property> ri
ad
</configuration>

Configure hdfs-site.xml file:


em

$ vi conf/hdfs-site.xml
<configuration>
H

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Configure mapred.xml file:


$ vi conf/mapred.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Once these changes are done, we need to format the name node by using the following command. The
command prompt will show all the messages one after another and finally success message.

$ bin/hadoop namenode –format


Working with Big Data
Our setup is done for pseudo distributed node. Let's now start the single node cluster by using the
following command. It will again show some set of messages on the command prompt and start the
server process.
$ /bin/start-all.sh
Now we should check the status of Hadoop process by executing the jps command as shown below.
It will show all the running processes.
$ jps
14799 NameNode
14977 SecondaryNameNode
15183 DataNode
15596 JobTracker
15897 TaskTracker

Accessing Hadoop on Browser:


The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on browser.

http://localhost:50070/
ri
ad
em
H

Stopping the Single node Cluster: We can stop the single node cluster by using the following
command. The command prompt will display all the stopping processes.
$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
Working with Big Data
localhost: stopping secondarynamenode

Fully Distributed Mode Installation


Compatibility Requirements
S.No Category Supported

1 Languages Java, Python, Perl, Ruby etc.


Linux (Server Deployment) Mostly preferred,
2 Operating System
Windows (Development only), Solaris.
3 Hardware 32 bit Linux ( 64 bit for large deployment )

Installation Items

S.No Item Version


1 jdk-6u25-linux-i586.bin Java 1.6 or higher
2 hadoop-0.20.2-cdh3u0.tar.gz
ri
Hadoop 0.20.2
ad
Note: Both Items are required to be installed on Namenode and Datanode machines
em

Installation Requirements
S.No Requirement Reason
H

Operating system – Linux

1 recommended for server deployment


(Production env.)

2 Language – Java 1.6 or higher


3 Ram – at least 3 GB/node
4 Hard disk – at least 1 TB For namenode machine.
For changing some system files
5 Should have root credentials
you need admin permissions.

High level Steps


1 Binding IP address with the host name under /etc/hosts
2 Setting passwordless SSH
3 Installing Java
Working with Big Data
4 Installing Hadoop
5 Setting JAVA HOME and HADOOP HOME variables
6 Updating .bash_profile file for hadoop
7 Creating required folders for namenode and datanode
8 Configuring the .xml files
9 Setting the masters and slaves in all the machines
10 Formatting the namenode
11 Starting the Dfs services and mapred services
12 Stopping all services

ri
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed
setup done and we have at least two machines, one acting as master and the other acting as a slave.
ad
Now we run the following commands in sequence.
• $ bin/stop-all.sh- Make sure none of the nodes are running
em

Binding IP address with the host names


• Before starting the installation of hadoop, first you need to bind the IP address of the
H

machines along with their host names under /etc/hosts file.


• First check the hostname of your machine by using following command :
$ hostname

• Open /etc/hosts file for binding IP with the hostname


$ vi /etc/hosts
• Provide ip & hostname of the all the machines in the cluster e.g:
10.11.22.33 hostname1
10.11.22.34 hostname2

Setting Passwordless SSh login


• SSH is used to login from one system to another without requiring passwords. This
will be required when you run a cluster, it will not prompt you for the password again and
again.

• First log in on Host1 (hostname of namenode machine) as hadoop user and generate a
Working with Big Data
pair of authentication keys. Command is:

hadoop@Host1$ ssh-keygen –t rsa


Note: Give the hostname which you got in step 5.3.1. Do not enter any passphrase if asked.
Now use ssh to create a directory ~/.ssh as user hadoop on Host2 (Hostname other
than namenode machine).

Now we open the two files - conf/master and conf/slaves. The conf/master defines the name nodes of
our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
• Edit the conf/core-site.xml file to have the following entries -
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
</property>

• ri
Edit the conf/mapred-site.xml file to have the following entries -
ad
<property>
<name>mapred.job.tracker</name>
em

<value>hdfs://master:54311</value>
</property>
• Edit the conf/hdfs-site.xml file to have the following entries -
H

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

• Edit the conf/mapred-site.xml file to have the following entries -


<property>
<name>mapred.local.dir</name>
<value>${hadoop-tmp}/mapred/local</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>50</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>5</value>
Working with Big Data
</property>
Now start the master by using the following command.
bin/start-dfs.sh
Once started, check the status on the master by using jps command. You should get the following
output –
14799 NameNode
15314 Jps
16977 secondaryNameNode

On the slave, the output should be as shown as:


15183 DataNode
15616 Jps

Now start the MapReduce daemons using the following command.


$ bin/start-mapred.sh

ri
ad
Once started, check the status on the master by using jps command. You should get the following
output:
em

16017 Jps
14799 NameNode
H

15596 JobTracker
14977 SecondaryNameNode

On the slaves, the output should be as shown below.

15183 DataNode
15897 TaskTracker
16284 Jps

Configuring XML files.


Working with Big Data

Hadoop Cluster Configuration Files

All these files are available under ‘conf’ directory of Hadoop installation directory.
ri
ad
em

Here is a listing of these files in the File System:


H
Working with Big Data

Let’s look at the files and their usage one by one!

hadoop-env.sh
This file specifies environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the
important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
This variable directs Hadoop daemon to the Java path in the system.

ri
ad
em

This file is also used for setting another Hadoop daemon execution environment such as heap
size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location
H

(HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary
parameters to start a cluster.
The following three files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the
configuration settings for Hadoop Core such as I/O settings that are common
to HDFS and MapReduce.
Working with Big Data

Where hostname and port are the machine and port on which NameNode daemon runs and
listens. It also informs the Name Node as to which IP and port it should bind. The commonly
used port is 8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary
Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission
checking on HDFS. The actual number of replications can also be specified when the file is created.
The default is used if replication is not specified in create time.
ri
ad
em
H

The value “true” for property ‘dfs.permissions’ enables permission checking in


HDFS and the value “false” turns off the permission checking. Switching from one parameter value
to the other does not change the mode, owner or group of files or directories.
Working with Big Data

mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers. Themapred.job.tracker parameter is a hostname (or IP address) and port pair on
which the Job Tracker listens for RPC communication. This parameter specify the location of the
Job Tracker to Task Trackers and MapReduce clients.

You can replicate all of the four files explained above to all the Data Nodes and Secondary
ri
Namenode. These files can then be configured for any node specific configuration e.g. in case of a
ad
different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop
em

cluster.
Masters
H

This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file
at Master server contains a hostname Secondary Name Node servers.

The ‘masters’ file on Slave Nodes is blank.


Working with Big Data

Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node
and Task Tracker servers.

The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the
‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in
the cluster.

ri
ad
em
H
UNIT-III

MapReduce is a programming model for data processing. The model is simple, yet
not too simple to express useful programs in. Hadoop can run MapReduce programs written
in various languages; Most important, MapReduce programs are inherently parallel, thus
putting very large-scale data analysis into the hands of anyone with enough machines at their
disposal. MapReduce comes into its own for large datasets, so let’s start by looking at one.

A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume of log
data, which is a good candidate for analysis with MapReduce, since it is semistructured and
record-oriented.
Data Format
The data we will use is from the National Climatic Data Center (NCDC, http://www
.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which each line is
a record. The format supports a rich set of meteorological elements, many of which are
optional or with variable data lengths. For simplicity, we shall focus on the basic elements,
such as temperature, which are always present and are of fixed width.
ri
Example shows a sample line with some of the salient fields highlighted. The line has been
split into multiple lines to show each field: in the real file, fields are packed into one line with
ad

no delimiters.
Data files are organized by date and weather station. There is a directory for each year
em

from 1901 to 2001, each containing a gzipped file for each weather station with its readings
for that year. For example, here are the first entries for 1990:
H

% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
Since there are tens of thousands of weather stations, the whole dataset is made up of
a large number of relatively small files. It’s generally easier and more efficient to process a
smaller number of relatively large files, so the data was preprocessed so that each year’s
readings were concatenated into a single file.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to run it
on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and
the reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the map
function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of the
line from the beginning of the file, but as we have no need for this, we ignore it.

Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reducer function can do its work on it:
finding the maximum temperature for each year. The map function is also a good place to
drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
ri
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
ad
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
em

The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits
them as its output (the temperature values have been interpreted as integers):
H

(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being
sent to the reduce function. This processing sorts and groups the key-value pairs by key.
So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
The whole data flow is illustrated in the bellow Figure. At the bottom of the diagram
is a Unix pipeline

Figure 2-1. MapReduce logical data flow

Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to run
the job. The map function is represented by an implementation of the Mapper interface,
which declares a map() method. Example -1 shows the implementation of our map function.

Example - 1. Mapper for maximum temperature example


import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
ri
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
ad

import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
em

import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase implements
Mapper <LongWritable, Text, Text, IntWritable>
{
H

private static final int MISSING = 9999;


public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
}
The Mapper interface is a generic type, with four formal type parameters that specify
the input key, input value, output key, and output value types of the map function. For the
present example, the input key is a long integer offset, the input value is a line of text, the
output key is a year, and the output value is an air temperature (an integer). Rather than use
built-in Java types, Hadoop provides its own set of basic types that are optimized for network
serialization. These are found in the org.apache.hadoop.io package.
Here we use LongWritable, which corresponds to a Java Long, Text (like Java
String), and IntWritable (like Java Integer).

The map() method is passed a key and a value. We convert the Text value containing the line
of input into a Java String, then use its substring() method to extract the columns we are
interested in.
The map() method also provides an instance of OutputCollector to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the temperature
is present and the quality code indicates the temperature reading is OK.

The reduce function is similarly defined using a Reducer, as illustrated in Example - 2.

Example - 2. Reducer for maximum temperature example

import java.io.IOException;
ri
import java.util.Iterator;
ad
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
em

import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
H

public class MaxTemperatureReducer extends MapReduceBase


implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}

Again, four formal type parameters are used to specify the input and output types, this
time for the reduce function. The input types of the reduce function must match the output
types of the map function: Text and IntWritable. And in this case, the output types of the
reduce function are Text and IntWritable, for a year and its maximum temperature, which we
find by iterating through the temperatures and comparing each with a record of the highest
found so far.
The third piece of code runs the MapReduce job (see Example -3).
Example -3. Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class MaxTemperature
{
public static void main(String[] args) throws IOException
{
if (args.length != 2)
{
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
ri
FileInputFormat.addInputPath(conf, new Path(args[0]));
ad
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
em

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
H

}
}

A JobConf object forms the specification of the job. It gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR
file (which Hadoop will distribute around the cluster). Rather than explicitly specify the name
of the JAR file, we can pass a class in the JobConf constructor, which Hadoop will use to
locate the relevant JAR file by looking for the JAR file containing this class.
Having constructed a JobConf object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat, and it
can be a single file, a directory (in which case, the input forms all the files in that directory),
or a file pattern. As the name suggests, addInputPath() can be called more
than once to use input from multiple paths.

The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output files from the
reducer functions are written. The directory shouldn’t exist before running the job, as
Hadoop will complain and not run the job. This precaution is to prevent data loss (it can be
very annoying to accidentally overwrite the output of a long job with another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass()
methods control the output types for the map and the reduce functions, which are often the
same, as they are in our case.
If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass(). The input types are controlled via
the input format, which we have not explicitly set since we are using the default
TextInputFormat. After setting the classes that define the map and reduce functions, we are
ready to run the job.
The static runJob() method on JobClient submits the job and waits for it to finish,
writing information about its progress to the console.

The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-00000:

% cat output/part-00000
1949 111
1950 22
This result is the same as when we went through it by hand earlier. We interpret this
as saying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950 it was
2.2°C.
ri
ad
The new Java MapReduce API:
Release 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred
em

to as “Context Objects,” designed to make the API easier to evolve in the future. The new
API is type-incompatible with the old, however, so applications need to be rewritten to take
advantage of it.
H

There are several notable differences between the two APIs:


 The new API favours abstract classes over interfaces, since these are easier to evolve.
For example, you can add a method (with a default implementation) to an abstract class
without breaking old implementations of the class. In the new API, the Mapper and
Reducer interfaces are now abstract classes.
 The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The
old API can still be found in org.apache.hadoop.mapred.
 The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The MapContext, for example, essentially
unifies the role of the JobConf, the OutputCollector, and the Reporter.
 The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-
value record pairs are pushed to the mapper, but in addition, the new API allows a
mapper to pull records from within the map() method. The same goes for the reducer. An
example of how the “pull” style can be useful is processing records in batches, rather
than one by one.
 Configuration has been unified. The old API has a special JobConf object for job
configuration, which is an extension of Hadoop’s vanilla Configuration object. In the
new API, this distinction is dropped, so job configuration is done through a
Configuration.
 Job control is performed through the Job class, rather than JobClient, which no longer
exists in the new API.
 Output files are named slightly differently: part-m-nnnnn for map outputs, and partr-
nnnnn for reduce outputs (where nnnnn is an integer designating the part number,
starting from zero).

When converting your Mapper and Reducer classes to the new API, don’t forget to
change the signature of the map() and reduce() methods to the new form. Just changing your
class to extend the new Mapper or Reducer classes will not produce a compilation error or
warning, since these classes provide an identity form of the map() or reduce() method
(respectively). Your mapper or reducer code, however, will not be invoked, which can lead to
some hard-to-diagnose errors.

Combiner
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting
the inputs from the Map class and thereafter passing the output key-value pairs to the
Reducer class.

The main function of a Combiner is to summarize the map output records with the same key.
The output (key-value collection) of the combiner will be sent over the network to the actual
Reducer task as input.

Combiner class
The Combiner class is used in between the Map class and the Reduce class to reduce the
ri
volume of data transfer between Map and Reduce. Usually, the output of the map task is
ad

large and the data transferred to the reduce task is high.


The following MapReduce task diagram shows the COMBINER PHASE.
em
H
How Combiner Works?
Here is a brief summary on how MapReduce Combiner works −

 A combiner does not have a predefined interface and it must implement the Reducer
interface’s reduce() method.

 A combiner operates on each map output key. It must have the same output key-value
types as the Reducer class.

 A combiner can produce summary information from a large dataset because it


replaces the original Map output.

Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce
phase, which makes it easier to process.

MapReduce Combiner Implementation

The following example provides a theoretical idea about combiners. Let us assume we have
the following input text file named input.txt for MapReduce.
What do you mean by Object
What do you know about Java
What is Java Virtual Machine
ri
How Java enabled High Performance
ad

The important phases of the MapReduce program with Combiner are discussed below.
em

Record Reader

This is the first phase of MapReduce where the Record Reader reads every line from the
H

input text file as text and yields output as key-value pairs.

Input − Line by line text from the input file.


Output − Forms the key-value pairs. The following is the set of expected key-value pairs.
<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
Map Phase

The Map phase takes input from the Record Reader, processes it, and produces the output as
another set of key-value pairs.

Input − The following key-value pair is the input taken from the Record Reader.

<1, What do you mean by Object>


<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
The Map phase reads each key-value pair, divides each word from the value using
StringTokenizer, and treats each word as key and the count of that word as value. The
following code snippet shows the Mapper class and the map function.

public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>


{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
ri
Output − The expected output is as follows −
ad
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
em

<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>


H

Combiner Phase

The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.
Input − The following key-value pair is the input taken from the Map phase.
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.

job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
Output − The expected output is as follows −

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Partitioner Phase

The partitioning phase takes place after the map phase and before the reduce phase. The
number of partitions is equal to the number of reducers. The data gets partitioned across
the reducers according to the partitioning function.

The difference between a partitioner and a combiner is that the partitioner divides
the data according to the number of reducers so that all the data in a single partition gets
executed by a single reducer. However, the combiner functions similar to the reducer and
processes the data in each partition. The combiner is an optimization to the reducer.

The default partitioning function is the hash partitioning function where the hashing
is done on the key. However it might be useful to partition the data according to some other
function of the key or the value.
ri
ad
Reducer Phase

The Reducer phase takes each key-value collection pair from the Combiner phase, processes
em

it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.
H

Input − The following key-value pair is the input taken from the Combiner phase.

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

The Reducer phase reads each key-value pair. Following is the code snippet for the
Combiner.

public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>


{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Output − The expected output from the Reducer phase is as follows –

<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>


<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
Record Writer

This is the last phase of MapReduce where the Record Writer writes every key-value pair
from the Reducer phase and sends the output as text.

Input − Each key-value pair from the Reducer phase along with the Output format.
ri
Output − It gives you the key-value pairs in text format. Following is the expected output.
ad

What 3
em

do 2
you 2
mean 1
by 1
H

Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1
H
em
ad
ri
Hadoop and Big Data

Unit 4: Hadoop I/O: The Writable Interface, WritableComparable and comparators, Writable
Classes: Writable wrappers for Java primitives, Text, BytesWritable, NullWritable,
ObjectWritable and GenericWritable, Writable collections, Implementing a Custom Writable:
Implementing a RawComparator for speed, Custom comparators

Reference: Hadoop: The Definitive Guide by Tom White, 3rd Edition, O’reilly

Unit 4

1. Hadoop I/O:

 The Writable Interface,

 WritableComparable and comparators,

2. Writable Classes:

 Writable wrappers for Java primitives,

 Text, ri
ad
 BytesWritable,


em

NullWritable,

 ObjectWritable
H

 GenericWritable,

 Writable collections,

3. Implementing a Custom Writable:

 Implementing a Raw Comparator for speed,

 Custom comparators
1. Hadoop I/O

The Writable Interface

Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop
provides these writable wrappers for almost all Java primitive types and some other types,but
sometimes we need to pass custom objects and these custom objects should implement Hadoop's
Writable interface. Hadoop MapReduce uses implementations of Writables for interacting with
user-provided Mappers and Reducers.

To implement the Writable interface we require two methods:

public interface Writable {

void readFields(DataInput in);

void write(DataOutput out);


}

Why use Hadoop Writable(s)?

As we already know, data needs to be transmitted between different nodes in a distributed


ri
computing environment. This requires serialization and de-serialization of data to convert the
ad
data that is in structured format to byte stream and vice-versa. Hadoop therefore uses simple and
efficient serialization protocol to serialize data between map and reduce phase and these are
em

called Writable(s). Some of the examples of writables as already mentioned before are
IntWritable, LongWritable, BooleanWritable and FloatWritable.
H

WritableComparable interface is just a sub interface of the Writable and java.lang.Comparable


interfaces. For implementing a WritableComparable we must have compareTo method apart
from readFields and write methods, as shown below:

public interface WritableComparable extends Writable, Comparable


{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}

Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys
are compared with one another.

Implementing a comparator for WritableComparables like the


org.apache.hadoop.io.RawComparator interface will definitely help speed up your Map/Reduce
(MR) Jobs. As you may recall, a MR Job is composed of receiving and sending key-value pairs.
The process looks like the following.
(K1,V1) –> Map –> (K2,V2)
(K2,List[V2]) –> Reduce –> (K3,V3)
The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed from
the mapper to the reducer. Before these intermediary key-value pairs reach the reducer, a shuffle
and sort step is performed.

The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the sorting
of these keys. In this blog, by implementing the RawComparator to compare the intermediary
keys, this extra effort will greatly improve sorting. Sorting is improved because the
RawComparator will compare the keys by byte. If we did not use RawComparator, the
intermediary keys would have to be completely deserialized to perform a comparison.

Note (In Short):


1)WritableComparables can be compared to each other, typically via Comparators. Any type
which is to be used as a key in the Hadoop Map-Reduce framework should implement this
interface.

2) Any type which is to be used as a value in the Hadoop Map-Reduce framework should
implement the Writable interface. ri
ad
Writables and its Importance in Hadoop
em

Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become IntWritable in Hadoop and String
of Java has become Text in Hadoop.
H

Writables are used for creating serialized data types in Hadoop. So, let us start by understanding
what data type, interface and serilization is.

Data Type

A data type is a set of data with values having predefined characteristics. There are several kinds
of data types in Java. For example- int, short, byte, long, char etc. These are called as primitive
data types. All these primitive data types are bound to classes called as wrapper class. For
example int, short, byte, long are grouped under INTEGER which is a wrapper class. These
wrapper classes are predefined in the Java.

Interface in Java

An interface in Java is a complete abstract class. The methods within an interface are abstract
methods which do not accept body and the fields within the interface are public, static and final,
which means that the fields cannot be modified.
The structure of an interface is most likely to be a class. We cannot create an object for an
interface and the only way to use the interface is to implement it in other class by using
‘implements’ keyword.

Serialization

Serialization is nothing but converting the raw data into a stream of bytes which can travel along
different networks and can reside in different systems. Serialization is not the only concern of
Writable interface; it also has to perform compare and sorting operation in Hadoop.

Why are Writables Introduced in Hadoop?

Now the question is whether Writables are necessary for Hadoop. Hadoop frame work definitely
needs Writable type of interface in order to perform the following tasks:

Implement serialization, Transfer data between clusters and networks

Store the de-serialized data in the local disk of the system

Implementation of writable is similar to implementation of interface in Java. It can be done by


simply writing the keyword ‘implements’ and overriding the default writable method.
ri
ad
Writable is a strong interface in Hadoop which while serializing the data, reduces the data size
enormously, so that data can be exchanged easily within the networks. It has separate read and
write fields to read data from network and write data into local disk respectively. Every data
em

inside Hadoop should accept writable and comparable interface properties.

We have seen how Writables reduces the data size overhead and make the data transfer easier in
H

the network.

What if Writable were not there in Hadoop?

Let us now understand what happens if Writable is not present in Hadoop.


Serialization is important in Hadoop because it enables easy transfer of data. If Writable is not
present in Hadoop, then it uses the serialization of Java which increases the data over-head in the
network.
smallInt serialized value using Java serializer
aced0005737200116a6176612e6c616e672e496e74656765
7212e2a0a4f781873802000149000576616c7565787200106a6176612e
6c616e672e4e756d62657286ac951d0b94e08b020000787000000064

smallInt serialized value using IntWritable


00000064
This shows the clear difference between serialization in Java and Hadoop and also the difference
between ObjectInputStream and Writable interface. If the size of serialized data in Hadoop is
like that of Java, then it will definitely become an overhead in the network.

Also the core part of Hadoop framework i.e., shuffle and sort phase won’t be executed without
using Writable.

How can Writables be Implemneted in Hadoop?

Writable variables in Hadoop have the default properties of Comparable. For example:

When we write a key as IntWritable in the Mapper class and send it to the reducer class, there is
an intermediate phase between the Mapper and Reducer class i.e., shuffle and sort, where each
key has to be compared with many other keys. If the keys are not comparable, then shuffle and
sort phase won’t be executed or may be executed with high amount of overhead.

If a key is taken asIntWritable by default, then it has comparable feature because of


RawComparator acting on that variable. It will compare the key taken with the other keys in the
network. This cannot take place in the absence of Writable.

Can we make custom Writables? The answer is definitely ‘yes’. We can make our own custom
ri
Writable type.
ad
Let us now see how to make a custom type in Java.
em

The steps to make a custom type in Java is as follows:

public class add {


H

int a;

int b;

public add() {

this.a = a;

this.b = b;

}
Similarly we can make a custom type in Hadoop using Writables.

For implementing Writables, we need few more methods in Hadoop:

public interface Writable {

void readFields(DataInput in);

void write(DataOutput out);

Here, readFields, reads the data from network and write will write the data into local disk. Both
are necessary for transferring data through clusters. DataInput and DataOutput classes (part of
java.io) contain methods to serialize the most basic types of data.

Suppose we want to make a composite key in Hadoop by combining two Writables then follow
the steps below:

public class add implements Writable{


public int a; ri
ad
public int b;
em

public add(){

this.a=a;
H

this.b=b;

public void write(DataOutput out) throws IOException {

out.writeInt(a);

out.writeInt(b);

public void readFields(DataInput in) throws IOException {

a = in.readInt();
b = in.readInt();

public String toString() {

return Integer.toString(a) + ", " + Integer.toString(b)

Thus we can create our custom Writables in a way similar to custom types in Java but with two
additional methods, write and readFields. The custom writable can travel through networks and
can reside in other systems.

This custom type cannot be compared with each other by default, so again we need to make them
comparable with each other.

Let us now discuss what is WritableComparable and the solution to the above problem.
ri
ad
As explained above, if a key is taken as IntWritable, by default it has comparable feature because
of RawComparator acting on that variable and it will compare the key taken with the other keys
in network and If Writable is not there it won’t be executed.
em

By default, IntWritable, LongWritable and Text have a RawComparator which can execute this
H

comparable phase for them. Then, will RawComparator help the custom Writable? The answer is
no. So, we need to have WritableComparable.

WritableComparable can be defined as a sub interface of Writable, which has the feature of
Comparable too. If we have created our custom type writable, then

why do we need WritableComparable?

We need to make our custom type, comparable if we want to compare this type with the other.

we want to make our custom type as a key, then we should definitely make our key type as
WritableComparable rather than simply Writable. This enables the custom type to be compared
with other types and it is also sorted accordingly. Otherwise, the keys won’t be compared with
each other and they are just passed through the network.

What happens if WritableComparable is not present?


If we have made our custom type Writable rather than WritableComparable our data won’t be
compared with other data types. There is no compulsion that our custom types need to be
WritableComparable until unless if it is a key. Because values don’t need to be compared with
each other as keys.
If our custom type is a key then we should have WritableComparable or else the data won’t be
sorted.

How can WritableComparable be implemented in Hadoop?

The implementation of WritableComparable is similar to Writable but with an additional


‘CompareTo’ method inside it.

public interface WritableComparable extends Writable, Comparable

void readFields(DataInput in);

void write(DataOutput out);

int compareTo(WritableComparable o)

}
ri
ad
How to make our custom type, WritableComparable?
em

We can make custom type a WritableComparable by following the method below:

public class add implements WritableComparable{


H

public int a;

public int b;

public add(){

this.a=a;

this.b=b;

public void write(DataOutput out) throws IOException {


out.writeint(a);

out.writeint(b);

}
public void readFields(DataInput in) throws IOException {

a = in.readint();

b = in.readint();

public int CompareTo(add c){


int presentValue=this.value;
int CompareValue=c.value;
return (presentValue < CompareValue ? -1 : (presentValue==CompareValue ? 0 : 1));
}

public int hashCode() {


return Integer.IntToIntBits(a)^ Integer.IntToIntBits(b);
} ri
ad
}
em

These read fields and write make the comparison of data faster in the network.

With the use of these Writable and WritableComparables in Hadoop, we can make our serialized
custom type with less difficulty. This gives the ease for developers to make their custom types
H

based on their requirement.


Writable Classes – Hadoop Data Types
Hadoop provides classes that wrap the Java primitive types and implement
the WritableComparable and Writable Interfaces. They are provided in
the org.apache.hadoop.io package.
All the Writable wrapper classes have a get() and a set() method for retrieving and storing the
wrapped value.
Primitive Writable Classes
These are Writable Wrappers for Java primitive data types and they hold a single primitive value
that can be set either at construction or via a setter method.
All these primitive writable wrappers have get() and set() methods to read or write the wrapped
value. Below is the list of primitive writable data types available in Hadoop.

BooleanWritable
ByteWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
ri
ad
VLongWritable
DoubleWritable
em

In the above list VIntWritable and VLongWritable are used for variable length Integer types and
variable length long types respectively.
H

Serialized sizes of the above primitive writable data types are same as the size of actual java data
type. So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.
Array Writable Classes
Hadoop provided two types of array writable classes, one for single-dimensional and another
for two-dimensional arrays. But theelements of these arrays must be other writable objects like
IntWritable or LongWritable only but not the java native data types like int or float.

ArrayWritable
TwoDArrayWritable

Map Writable Classes


Hadoop provided below MapWritable data types which implement java.util.Map interface

o AbstractMapWritable – This is abstract or base class for other MapWritable classes.


o MapWritable – This is a general purpose map mapping Writable keys to Writable values.
o SortedMapWritable – This is a specialization of the MapWritable class that also implements the
SortedMap interface.

Other Writable Classes

NullWritable

NullWritable is a special type of Writable representing a null value. No bytes are read or written
when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be
declared as a NullWritable when we don’t need to use that field.

ObjectWritable

This is a general-purpose generic object wrapper which can store any objects like Java
primitives, String, Enum, Writable, null, or arrays.

Text

Text can be used as the Writable equivalent of java.lang.String and It’s max size is 2 GB. Unlike
java’s String data type, Text is mutable in Hadoop. ri
ad
BytesWritable
em

BytesWritable is a wrapper for an array of binary data.

GenericWritable
H

It is similar to ObjectWritable but supports only a few types. User need to subclass this
GenericWritable class and need to specify the types to support.

Example Program to Test Writables


Lets write a WritablesTest.java program to test most of the data types mentioned above in this
post with get(), set(), getBytes(), getLength(), put(), containsKey(), keySet() methods.
WritablesTest.java
import org.apache.hadoop.io.* ;
import java.util.* ;
public class WritablesTest
{
public static class TextArrayWritable extends ArrayWritable
{
public TextArrayWritable()
{
super(Text.class) ;
}
}

public static class IntArrayWritable extends ArrayWritable


{
public IntArrayWritable()
{
super(IntWritable.class) ;
}
} ri
ad
public static void main(String[] args)
{
em

IntWritable i1 = new IntWritable(2) ;


IntWritable i2 = new IntWritable() ;
H

i2.set(5);

IntWritable i3 = new IntWritable();


i3.set(i2.get());

System.out.printf("Int Writables Test I1:%d , I2:%d , I3:%d", i1.get(), i2.get(), i3.get()) ;

BooleanWritable bool1 = new BooleanWritable() ;


bool1.set(true);

ByteWritable byte1 = new ByteWritable( (byte)7) ;

System.out.printf("\n Boolean Value:%s Byte Value:%d", bool1.get(), byte1.get()) ;

Text t = new Text("hadoop");


Text t2 = new Text();
t2.set("pig");

System.out.printf("\n t: %s, t.legth: %d, t2: %s, t2.length: %d \n", t.toString(), t.getLength(),
t2.getBytes(), t2.getBytes().length);
ArrayWritable a = new ArrayWritable(IntWritable.class) ;
a.set( new IntWritable[]{ new IntWritable(10), new IntWritable(20), new IntWritable(30)}) ;

ArrayWritable b = new ArrayWritable(Text.class) ;

b.set( new Text[]{ new Text("Hello"), new Text("Writables"), new Text("World !!!")}) ;

for (IntWritable i: (IntWritable[])a.get())


System.out.println(i) ;

for (Text i: (Text[])b.get())


System.out.println(i) ;

IntArrayWritable ia = new IntArrayWritable() ;


ia.set( new IntWritable[]{ new IntWritable(100), new IntWritable(300), new
IntWritable(500)}) ;

IntWritable[] ivalues = (IntWritable[])ia.get() ;

for (IntWritable i : ivalues) ri


ad
System.out.println(i);

MapWritable m = new MapWritable() ;


em

IntWritable key1 = new IntWritable(1) ;


NullWritable value1 = NullWritable.get() ;
H

m.put(key1, value1) ;
m.put(new VIntWritable(2), new LongWritable(163));
m.put(new VIntWritable(3), new Text("Mapreduce"));

System.out.println(m.containsKey(key1)) ;
System.out.println(m.get(new VIntWritable(3))) ;

m.put(new LongWritable(1000000000), key1) ;


Set<Writable> keys = m.keySet() ;

for(Writable w: keys)
System.out.println(m.get(w)) ;
}
}
ri
ad
Implementing a Custom Writable:

IMPLEMENTING RAWCOMPARATOR WILL SPEED UP YOUR HADOOP MAP/REDUCE


em

(MR) JOBS

Implementing the org.apache.hadoop.io.RawComparator interface will definitely help speed up


H

your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and
sending key-value pairs. The process looks like the following.

(K1,V1) –> Map –> (K2,V2)

(K2,List[V2]) –> Reduce –> (K3,V3)

The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed from
the mapper to the reducer. Before these intermediary key-value pairs reach the reducer, a shuffle
and sort step is performed. The shuffle is the assignment of the intermediary keys (K2) to
reducers and the sort is the sorting of these keys. In this blog, by implementing the
RawComparator to compare the intermediary keys, this extra effort will greatly improve sorting.
Sorting is improved because the RawComparator will compare the keys by byte. If we did not
use RawComparator, the intermediary keys would have to be completely deserialized to perform
a comparison.
BACKGROUND

Two ways you may compare your keys is by implementing the


org.apache.hadoop.io.WritableComparable interface or by implementing the RawComparator
interface. In the former approach, you will compare (deserialized) objects, but in the latter
approach, you will compare the keys using their corresponding raw bytes.

The empirical test to demonstrate the advantage of RawComparator over WritableComparable.


Let’s say we are processing a file that has a list of pairs of indexes {i,j}. These pairs of indexes
could refer to the i-th and j-th matrix element. The input data (file) will look something like the
following.

1, 2

3, 4

5, 6

... ri
ad
...

...
em

0, 0
H

What we want to do is simply count the occurrences of the {i,j} pair of indexes. Our MR Job will
look like the following.

(LongWritable,Text) –> Map –> ({i,j},IntWritable)

({i,j},List[IntWritable]) –> Reduce –> ({i,j},IntWritable)

METHOD

The first thing we have to do is model our intermediary key K2={i,j}. Below is a snippet of the
IndexPair. As you can see, it implements WritableComparable. Also, we are sorting the keys
ascendingly by the i-th and then j-th indexes.

public class IndexPair implements WritableComparable<IndexPair> {

private IntWritable i;

private IntWritable j;
public IndexPair(int i, int j) {

this.i = new IntWritable(i);

this.j = new IntWritable(j);

public int compareTo(IndexPair o) {

int cmp = i.compareTo(o.i);

if(0 != cmp)

return cmp;

return j.compareTo(o.j);

//....

} ri
ad
Below is a snippet of the RawComparator. As you notice, it does not directly implement
RawComparator. Rather, it extends WritableComparator (which implements RawComparator).
em

We could have directly implemented RawComparator, but by extending WritableComparator,


depending on the complexity of our intermediary key, we may use some of the utility methods of
WritableComparator.
H

public class IndexPairComparator extends WritableComparator {

protected IndexPairComparator() {

super(IndexPair.class);

@Override

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

int i1 = readInt(b1, s1);

int i2 = readInt(b2, s2);


int comp = (i1 < i2) ? -1 : (i1 == i2) ? 0 : 1;
if(0 != comp)

return comp;

int j1 = readInt(b1, s1+4);

int j2 = readInt(b2, s2+4);

comp = (j1 < j2) ? -1 : (j1 == j2) ? 0 : 1;

return comp;

As you can see the above code, for the two objects we are comparing, there are two
corresponding byte arrays (b1 and b2), the starting positions of the objects in the byte arrays, and
the length of the bytes they occupy. Please note that the byte arrays themselves represent other
ri
things and not only the objects we are comparing. That is why the starting position and length are
also passed in as arguments. Since we want to sort ascendingly by i then j, we first compare the
ad
bytes representing the i-th indexes and if they are equal, then we compare the j-th indexes. You
can also see that we use the util method, readInt(byte[], start), inherited from
em

WritableComparator. This method simply converts the 4 consecutive bytes beginning at start into
a primitive int (the primitive int in Java is 4 bytes). If the i-th indexes are equal, then we shift the
starting point by 4, read in the j-th indexes and then compare them.
H

A snippet of the mapper is shown below.

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {

String[] tokens = value.toString().split(",");

int i = Integer.parseInt(tokens[0].trim());

int j = Integer.parseInt(tokens[1].trim());

IndexPair indexPair = new IndexPair(i, j);

context.write(indexPair, ONE);

}
A snippet of the reducer is shown below.

public void reduce(IndexPair key, Iterable<IntWritable> values, Context context) throws


IOException, InterruptedException {

int sum = 0;

for(IntWritable value : values) {

sum += value.get();

context.write(key, new IntWritable(sum));

The snippet of code below shows how I wired up the MR Job that does NOT use raw byte
comparison.

public int run(String[] args) throws Exception { ri


ad
Configuration conf = getConf();
em

Job job = new Job(conf, "raw comparator example");

job.setJarByClass(RcJob1.class);
H

job.setMapOutputKeyClass(IndexPair.class);

job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(IndexPair.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(RcMapper.class);

job.setReducerClass(RcReducer.class);

job.waitForCompletion(true);
return 0;

}
The snippet of code below shows how I wired up the MR Job using the raw byte comparator.

public int run(String[] args) throws Exception {

Configuration conf = getConf();

Job job = new Job(conf, "raw comparator example");

job.setJarByClass(RcJob1.class);

job.setSortComparatorClass(IndexPairComparator.class);

job.setMapOutputKeyClass(IndexPair.class);

job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IndexPair.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(RcMapper.class);
ri
ad
job.setReducerClass(RcReducer.class);
em

job.waitForCompletion(true);
return 0;
H

As you can see, the only difference is that in the MR Job using the raw comparator, we explicitly
set its sort comparator class.

RESULTS

I ran the MR Jobs (without and with raw byte comparisons) 10 times on a dataset of 4 million
rows of {i,j} pairs. The runs were against Hadoop v0.20 in standalone mode on Cygwin. The
average running time for the MR Job without raw byte comparison is 60.6 seconds, and the
average running time for the job with raw byte comparison is 31.1 seconds. A two-tail paired t-
test showed p < 0.001, meaning, there is a statistically significant difference between the two
implementations in terms of empirical running time.
I then ran each implementation on datasets of increasing record sizes from 1, 2, …, and 10
million records. At 10 million records, without using raw byte comparison took 127 seconds
(over 2 minutes) to complete, while using raw byte comparison took 75 seconds (1 minute and
15 seconds) to complete. Below is a line graph.

ri
ad
em
H
Custom comparators.

Frequently, objects in one Tuple are compared to objects in a second Tuple. This is especially
true during the sort phase of GroupBy and CoGroup in Cascading Hadoop mode.

By default, Hadoop and Cascading use the native Object methods equals() and hashCode()
to compare two values and get a consistent hash code for a given value, respectively.

To override this default behavior, you can create a custom java.util.Comparator class to perform
comparisons on a given field in a Tuple. For instance, to secondary-sort a collection of custom
Person objects in a GroupBy, use the Fields.setComparator() method to designate the
custom Comparator to the Fields instance that specifies the sort fields.

Alternatively, you can set a default Comparator to be used by a Flow, or used locally on a
given Pipe instance. There are two ways to do this. Call
FlowProps.setDefaultTupleElementComparator() on a Properties instance, or use the property
key cascading.flow.tuple.element.comparator.

If the hash code must also be customized, the custom Comparator can implement the interface
cascading.tuple.Hasher
ri
ad
public class CustomTextComparator extends
em

private Collator collator;


H

public CustomTextComparator()
super(Text.class);
final Locale locale = new Locale("pl");

collator = Collator.getInstance(locale);

public int compare(WritableComparable a, WritableComparable b) {

synchronized (collator) {

return collator.compare(((Text) a).toString(), ((Text) b).toString());

}
Unit 5:

Pig: Hadoop Programming Made Easier Admiring the Pig Architecture, Going with
the Pig Latin Application Flow, Working through the ABCs of Pig Latin,
Evaluating Local and Distributed Modes of Running Pig Scripts, Checking out
the Pig Script Interfaces, Scripting with Pig Latin

Pig:
Hadoop Programming Made Easier Admiring the Pig Architecture

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is


used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Pig.

To write data analysis programs, Pig provides a high-level language known as


Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts


ri
using Pig Latin language. All these scripts are internally converted to Map
ad
and Reduce tasks. Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those scripts
into MapReduce jobs.
em

Why Do We Need Apache Pig?


Programmers who are not so good at Java normally used to struggle
H

working with Hadoop, especially while performing any MapReduce tasks.


Apache Pig is a boon for all such programmers.

Using Pig Latin, programmers can perform MapReduce tasks easily


without having to type complex codes in Java.

Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would require you to type 200 lines of code
(LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
Ultimately Apache Pig reduces the development time by almost 16 times.

Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.

Apache Pig provides many built-in operators to support data operations


like joins, filters, ordering, etc. In addition, it also provides nested data types
like tuples, bags, and maps that are missing from MapReduce.

Features of Pig
Apache Pig comes with the following features −

Rich set of operators − It provides many operators to perform operations


like join, sort, filer, etc.

Ease of programming − Pig Latin is similar to SQL and it is easy to write a


Pig script if you are good at SQL.

Optimization opportunities − The tasks in Apache Pig optimize their


execution automatically, so the programmers need to focus only on
semantics of the language.

Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.

UDF’s − Pig provides the facility to create User-defined Functions in other


programming languages such as Java and invoke or embed them in Pig Scripts.

Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.

Apache Pig Vs MapReduce


ri
Listed below are the major differences between Apache Pig and MapReduce.
ad
Apache Pig MapReduce
em

Apache Pig is a data flow language. MapReduce is a data processing


H

paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig is It is quite difficult in MapReduce


to perform a Join operation
between
pretty simple. datasets.

Any novice programmer with a basic Exposure to Java is must to


knowledge of SQL can work work with MapReduce.
conveniently with Apache Pig.

Apache Pig uses multi-query approach, MapReduce will require almost 20


thereby reducing the length of the codes to times more the number of lines to
a great extent. perform the same task.
There is no need for compilation. On MapReduce jobs have a
execution, every Apache Pig operator is long compilation process.
converted internally into a MapReduce
job.

Apache Pig Vs SQL


Listed below are the major differences between Apache Pig and SQL.

Pig SQL

Pig Latin is a procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We can Schema is mandatory in SQL.


store data without designing a schema (values
are stored as $01, $02 etc.)

The data model in Apache Pig is nested The data model used in SQL is
relational. flat relational.

Apache Pig provides limited opportunity There is more opportunity for


for Query optimization. ri query optimization in SQL.
ad
In addition to above differences, Apache Pig Latin −
em

Allows splits in the pipeline.


H

Allows developers to store data anywhere in the pipeline. Declares


execution plans.
Provides operators to perform ETL (Extract, Transform, and Load) functions.

Apache Pig - Architecture


The language used to analyze data in Hadoop using Pig is known as
Pig Latin. It is a high level data processing language which provides a rich
set of data types and operators to perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to


write a Pig script using the Pig Latin language, and execute them using any
of the execution mechanisms (Grunt Shell, UDFs, Embedded). After
execution, these scripts will go through a series of transformations applied
by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig
is shown below.

ri
ad
em
H

Admiring the pig architecture

Pig is made up of two components:

The language itself: As proof that programmers have a sense of humor, the Programming
language for Pig is known as Pig Latin, a high-level language that allows you to write data
processing and analysis programs.

The Pig Latin compiler: The Pig Latin compiler converts the Pig Latin code into executable code.
The executable code is either in the form of MapReduce jobs or it can spawn a process where a
virtual Hadoop instance is created to run the Pig node on a single node.

The sequence of MapReduce programs enables Pig programs to do data processing and
analysis in parallel, leveraging Hadoop MapReduce and HDFS. Running the Pig job in the
virtual Hadoop instance is a useful strategy for testing your Pig scripts.
Pig relates to the hadoop ecosystem

Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes, regardless of what
mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez provides a
more efficient execution framework than MapReduce. YARN enables application frameworks other than
MapReduce (like Tez) to run on Hadoop. Hive can also run against the Tez framework.

Apache Pig Components


As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.

Parser ri
ad
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
em

parser will be a DAG Latin statements and logical (directed acyclic graph),
which represents the Pig operators.
H

In the DAG, the logical operators of the script are represented and the data
flows are represented as edges.

Optimizer as the nodes


The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.
Going with the Pig Latin Application Flow
At its core, Pig Latin is a dataflow language, where we define a data stream and a series of
transformations that are applied to the data as it flows through your application.
This is in contrast to a control flow language (like C or Java), where we write a series of
instructions. In control flow languages, we use constructs like loops and conditional logic (like
an if statement). You won’t find loops and if statements in Pig Latin.

Sample pig code

Looking at each line in turn, you can see the basic flow of a Pig program.
ri
ad
1. Load: we first load (LOAD) the data you want to manipulate. As in a typical MapReduce job,
that data is stored in HDFS. For a Pig program to access the data, you first tell Pig what file or
files to use. For that task, you use the LOAD 'data_file' command.
em

Here, 'data_file' can specify either an HDFS file or a directory. If a directory is specified, all files in
that directory are loaded into the program.
If the data is stored in a file format that isn’t natively accessible to Pig, you can optionally add
H

the USING function to the LOAD statement to specify a user-defined function that can read in
(and interpret) the data.

2. Transform: You run the data through a set of transformations that, way under the hood
and far removed from anything you have to concern yourself with, are translated into a set of
Map and Reduce tasks.
The transformation logic is where all the data manipulation happens. Here, you can FILTER
out rows that aren’t of interest, JOIN two sets of data files, GROUP data to build aggregations,
ORDER results, and do much, much more.

3. Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the results in a
file somewhere.

Pig Latin Data Model( pig data types )


The data model of Pig Latin is fully nested and it allows complex non-atomic data
types such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom: An atom is any single value, such as a string or a number — ‘Diego’, for
example. Pig’s atomic values are scalar types that appear in most programming
languages — int, long, float, double chararray, and bytearray.

Tuple: A tuple is a record that consists of a sequence of fields. Each field


can be of any type — ‘Diego’, ‘Gomez’, or 6, for example. Think of a tuple
as a row in a table.

Bag: A bag is a collection of non-unique tuples. The schema of the bag is


flexible — each tuple in the collection can contain an arbitrary number
of fields, and each field can be of any type.
ri
ad
Map: A map is a collection of key value pairs. Any type can be stored in
the value, and the key needs to be unique. The key of a map must be a
chararray and the value can be of any type.
em

Pig latin opearations:


H

In a Hadoop context, accessing data means allowing developers to load, store,


and stream data, whereas transforming data means taking advantage of Pig’s
ability to group, join, combine, split, filter, and sort data.

IT Dept 8
Pig latin operators

ri
ad
Operators for Debugging and Trouble shooting
em
H

Apache Pig Execution Modes


we can run Apache Pig in two modes, namely, Local Mode and HDFS
mode.

Local Mode
In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute
the Pig Latin statements to process the data, a MapReduce job is invoked in the
back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms ri


ad
Apache Pig scripts can be executed in three ways, namely, interactive
mode, batch mode, and embedded mode.
em

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the
H

output (using Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.

Important Questions

1. Explain Pig architecture in detail with neat diagram?


2. what is the difference between Hive and ApachePig?
3. what is the difference between MapReduce and ApachePig?
4. Apache Pig Scripting Statements and data type...?
Applying Structure to Hadoop Data with Hive: Saying Hello to Hive, Seeing How the Hive is Put Together,
Getting Started with Apache Hive, Examining the Hive Clients, Working with Hive Data Types, Creating and
Managing Databases and Tables, Seeing How the Hive Data Manipulation Language Works, Querying and
Analyzing Data.

HIVE Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

The term ‘Big Data’ is used for collections of large datasets that include huge
volume, high velocity, and a variety of data that is increasing day by day. Using traditional
data management systems, it is difficult to process Big Data. Therefore, the Apache
Software Foundation introduced a framework called Hadoop to solve Big Data management
and processing challenges.

Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).

MapReduce: It is a parallel programming model for processing large amounts of structured, semi-
structured, and unstructured data on large clusters of commodity hardware.

HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity hardware.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.

Sqoop: It is used to import and export data to and from between HDFS and RDBMS.

Pig: It is a procedural language platform used to develop a script for MapReduce


operations.

Hive: It is a platform used to develop SQL type scripts to do MapReduce


operations.

Note: There are various ways to execute MapReduce operations:

The traditional approach using Java MapReduce program for structured,


semi- structured, and unstructured data.
The scripting approach for MapReduce to process structured and semi structured data using
Pig.
The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

The apache Hive architecture

This component diagram contains different units. The following


table describes each unit:
Unit Name Operation
Hive is a data warehouse infrastructure software that can create
User Interface interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows

Hive chooses respective database servers to store the schema or


Meta Store Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.

HiveQL is similar to SQL for querying on schema info on the


Metastore. It is one of the replacements of traditional approach for
HiveQL Process Engine MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

The conjunction part of HiveQL process Engine and


Execution Engine MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce

Hadoop distributed file system or HBASE are the data


HDFS or HBASE storage techniques to store data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

Execute Query

1 The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

Get Plan
2 The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.

Get Metadata
3 The compiler sends metadata request to Metastore (any database).

Send Metadata
4
Metastore sends metadata as a response to the compiler.

Send Plan
5 The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.

Execute Plan
6 The driver sends the execute plan to the execution engine.

Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job
7 to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.

Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
7.1 Metastore.

8 Fetch Result
The execution engine receives the results from Data nodes.

9 Send Results
The execution engine sends those resultant values to the driver.

10 Send Results
The driver sends the results to Hive Interfaces.
Hive - Data Types
All the data types in Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types

Column Types
Column type are used as column data types of Hive. They are as follows:

Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller
than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:


Type Postfix Example

TINYINT Y 10Y
SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY- MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types
Union is a collection of heterogeneous data types. You can create
an instance using create union. The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>


{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals
The following literals are used in Hive:

Floating Point Types


Floating point types are nothing but numbers with decimal
points. Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
-308 308
range of decimal type is approximately -10 to 10 .
Null Value
Missing values are represented by the special value NULL.

Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Maps
Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

Structs
Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Hive - Create Database


Hive is a database technology that can define databases and tables to analyze
structured data. The theme for structured data analysis is to store the data in a
tabular manner, and pass queries to analyze it. This chapter explains how to create
Hive database. Hive contains a default database named default.
Create Database Statement
Create Database is a statement used to create a database in Hive. A database in
Hive is a namespace or a collection of tables. The syntax for this
statement is as follows:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with
the same name already exists. We can use SCHEMA in place of
DATABASE in this command. The following query is
executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;

or

hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;


default
userdb

Drop Database Statement


Drop Database is a statement that drops all the tables and deletes the database. Its
syntax is as follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name


[RESTRICT|CASCADE];

The following queries are used to drop a database. Let us assume that the database name
is userdb.

hive> DROP DATABASE IF EXISTS userdb;

The following query drops the database using CASCADE. It means dropping respective
tables before dropping the database.

hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.

hive> DROP SCHEMA userdb;

create Table Statement


Create Table is a statement used to create a table in Hive. The syntax and example are as
follows:

Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]


[COMMENT table_comment]
[ROW FORMAT row_format] [STORED AS file_format]
Example
Let us assume you need to create a table named employee using CREATE TABLE
statement. The following table lists the fields and their data types in employee table:
Sr.No Field Name Data Type

1 Eid int
2 Name String

3 Salary Float
4 Designation string

The following data is a Comment, Row formatted fields such as


Field terminator, Lines terminator, and Stored File type.
COMMENT ŧEmployee detailsŨ FIELDS TERMINATED BY ŧ\tŨ LINES TERMINATED BY ŧ\nŨ STORED IN
TEXT FILE

The following query creates a table named employee using the above data.

hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination
String)
COMMENT ŧEmployee detailsŨ
ROW FORMAT DELIMITED FIELDS TERMINATED BY ŧ\tŨ LINES TERMINATED BY ŧ\nŨ STORED AS
TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement in case the table
already exists.

On successful creation of table, you get to see the following response:

OK
Time taken: 5.905 seconds
hive>

Load Data Statement


Generally, after creating a table in SQL, we can insert data using the Insert statement.
But in Hive, we can insert data using the LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to store bulk records.
There are two ways to load data: one is from local file system and second is from Hadoop
file system.
Syntax
The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename


[PARTITION (partcol1=val1, partcol2=val2 ...)]

LOCAL is identifier to specify the local path. It is optional.


OVERWRITE is optional to overwrite the data in the table.
PARTITION is optional.
Example
We will insert the following data into the table. It is a text
file namedsample.txt in /home/user directory.

1201 Gopal 45000 Technical manager


1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin

The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH
'/home/user/sample.txt' OVERWRITE INTO TABLE employee;

On successful download, you get to see the following response:

OK
Time taken: 15.905 seconds
hive>

Alter Table Statement


It is used to alter a table in Hive.

Syntax
The statement takes any of the following syntaxes based on what attributes we wish to
modify in a table.

ALTER TABLE name RENAME TO new_name


ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Rename To… Statement


The following query renames the table from employee to emp.

hive> ALTER TABLE employee RENAME TO emp;


Change Statement
The following table contains the fields of employee table and it shows the fields to be
changed (in bold).

Field Convert from Data Change Field Convert to Data


Name Type Name Type

eid int eid int

name String ename String

salary Float salary Double

designation String designation String

The following queries rename the column name and column data type using the above
data:

hive> ALTER TABLE employee CHANGE name ename String;


hive> ALTER TABLE employee CHANGE salary salary Double;

Add Columns Statement


The following query adds a column named dept to the employee table.

hive> ALTER TABLE employee ADD COLUMNS (


dept STRING COMMENT 'Department name');

Replace Statement
The following query deletes all the columns from the employee table and replaces it with
emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS (
eid INT empid Int,
ename STRING name String);

Drop Table Statement


Hive Metastore, it removes the table/column data and their metadata. It can be a normal
table (stored in Metastore) or an external table (stored in local file system); Hive treats
both in the same manner, irrespective of their types.
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;

The following query drops a table named employee:


hive> DROP TABLE IF EXISTS employee;

see the following


On successful execution of the query, you get
to response:
OK
Time taken: 5.3 seconds
hive>

The following query is used to verify the list of tables:

hive> SHOW TABLES;


emp ok
Time taken: 2.1 seconds
hive>

Operators in HIVE:

There are four types of operators in Hive:


 Relational Operators

 Arithmetic Operators

 Logical Operators

 Complex Operators

Relational Operators: These operators are used to compare two operands. The following
table describes the relational operators available in Hive:

Operator Operand Description

A=B all primitive types TRUE if expression A is equivalent to


expression B otherwise FALSE.

A != B all primitive types TRUE if expression A is not equivalent to


expression B otherwise FALSE.

A<B all primitive types TRUE if expression A is less than expression B


otherwise FALSE.

A <= B all primitive types TRUE if expression A is less than or equal to


expression B otherwise FALSE.

A>B all primitive types TRUE if expression A is greater than


expression B otherwise FALSE.

A >= B all primitive types TRUE if expression A is greater than or equal to


expression B otherwise FALSE.
A IS NULL all types TRUE if expression A evaluates to NULL
otherwise FALSE.

A IS NOT NULL all types FALSE if expression A evaluates to NULL


otherwise TRUE.

A LIKE B Strings TRUE if string pattern A matches to B


otherwise FALSE.

A RLIKE B Strings NULL if A or B is NULL, TRUE if any


substring of A matches the Java regular
expression B , otherwise FALSE.
A REGEXP B Strings Same as RLIKE.

Example
Let us assume the employee table is composed of fields named Id, Name, Salary,
Designation, and Dept as shown below. Generate a query to retrieve the employee details
whose Id is 1205.

+-----+--------------+--------+---------------------------+------+
| Id | Name | Salary | Designation | Dept |
+-----+--------------+------------------------------------+------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin|
+-----+--------------+--------+---------------------------+------+

The following query is executed to retrieve the employee details using the above table:

hive> SELECT * FROM employee WHERE Id=1205;

On successful execution of query, you get to see the following response:

+-----+-----------+-----------+----------------------------------+
| ID | Name | Salary | Designation | Dept |
+-----+---------------+-------+----------------------------------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
+-----+-----------+-----------+----------------------------------+

The following query is executed to retrieve the employee details whose salary is
more than or equal to Rs 40000.

hive> SELECT * FROM employee WHERE Salary>=40000;

On successful execution of query, you get to see the following response:


+-----+------------+--------+----------------------------+------+
| ID | Name | Salary | Designation | Dept |
+-----+------------+--------+----------------------------+------+
|120 | Gopal | 45000 | Technical manager | TP |
|120 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali| 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+-----+------------+--------+----------------------------+------+

Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of
them return number types. The following table describes the arithmetic operators
available in Hive:

Operators Operand Description

A+B all number types Gives the result of adding A and B.

A-B all number types Gives the result of subtracting B from A.

A*B all number types Gives the result of multiplying A and B.

A/B all number types Gives the result of dividing B from A.

A%B all number types Gives the reminder resulting from dividing A by B.

A&B all number types Gives the result of bitwise AND of A and B.

A|B all number types Gives the result of bitwise OR of A and B.

A^B all number types Gives the result of bitwise XOR of A and B.

~A all number types Gives the result of bitwise NOT of A.


Example
The following query adds two numbers, 20 and 30.

hive> SELECT 20+30 ADD FROM temp;

On successful execution of the query, you get to see the


following response:
+--------+
| ADD |
+--------+
| 50 |
+--------+

Logical Operators
The operators are logical expressions. All of them return either TRUE
or
FALSE.

Operators Operands Description

A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE.

A && B boolean Same as A AND B.

A OR B boolean TRUE if either A or B or both are TRUE, otherwise FALSE.

A || B boolean Same as A OR B.

NOT A boolean TRUE if A is FALSE, otherwise FALSE.

!A boolean Same as NOT A.

Example
The following query is used to retrieve employee details whose Department is TP and
Salary is more than Rs 40000.
hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;

On successful execution of the query, you get to see the


following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
+------+--------------+-------------+-------------------+--------+

Complex Operators
These operators provide an expression to access the elements of Complex Types.

Operator Operand Description

A[n] A is an Array and n is It returns the nth element in the array A. The first
an int element has index 0.

M[key] M is a Map<K, V> and It returns the value corresponding to the key in
key has type K the map.

S.x S is a struct It returns the x field of S.

HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive to process
and analyze structured data in a Metastore. This chapter explains how to use the
SELECT statement with WHERE clause.

SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition. It filters the data using the condition and gives you a
finite result. The built-in operators and functions generate an expression, which fulfils
the condition.

Syntax
Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table
as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate
a query to retrieve the employee details who earn a salary of more than Rs 30000.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
| Gopal | 45000 | Technical manager | TP |
| Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op | Admin |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

HiveQL - Select-Order By
This chapter explains how to use the ORDER BY clause in a SELECT statement.
The ORDER BY clause is used to retrieve the details based on one column and sort
the result set by ascending or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference
[WHERE where_condition] [GROUP BY col_list]

[HAVING having_condition]
[ORDER BY col_list]] [LIMIT
number];

Example
Let us take an example for SELECT...ORDER BY clause. Assume employee
table as given below, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department name.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above
scenario:

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;

HiveQL - Select-Group By
This chapter explains the details of GROUP BY clause in a SELECT statement.
The GROUP BY clause is used to group all the records in a result set using a
particular collection column. It is used to query a group of records.
The syntax of GROUP BY clause is as follows:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]

[ORDER BY col_list]]

[LIMIT number];

HiveQL - Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the
database. It is more or less similar to SQL JOIN.
Syntax

join_table:

table_reference JOIN table_factor [join_condition]


| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]

Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..

+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+

Consider another table ORDERS as follows:

+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |

| 101 | 2009-11-20 00:00:00 | 2 | 1560 |


| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+

There are different types of joins given as follows:

JOIN LEFT OUTER JOIN RIGHT OUTER JOIN


FULL OUTER JOIN

JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same
as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER
tables, and retrieves the records:

hive> SELECT c.ID, c.NAME, c.AGE,


o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the


following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 |
3000 |
| 3 | kaushik | 23 |
1500 |
| 2 | Khilan | 25 |
1560 |
| 4 | Chaitali | 25 | 2060

LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there
are no matches in the right table. This means, if the ON clause matches 0 (zero) records
in the right table, the JOIN still returns a row in the result, but with NULL in each
column from the right table. A LEFT JOIN returns all the values from the left table,
plus the matched values from the right table, or NULL in case of no matching JOIN
predicate.

The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT,


o.DATE FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the


following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE
|
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

5 | Hardik | NULL | NULL


| 6 | Komal | NULL | NULL
| 7 | Muffy | NULL | NULL
|
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if
there are no matches in the left table. If the ON clause matches 0 (zero) records in the left
table, the JOIN still returns a row in the result, but with NULL in each column from the left
table.

A RIGHT JOIN returns all the values from the right table, plus the matched values from
the left table, or NULL in case of no matching join predicate.

The following query demonstrates RIGHT OUTER JOIN between


the CUSTOMER and ORDER t ables.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT


OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 |
| 3 | kaushik | 1500 | 2009-10-08 |
| 2 | Khilan | 1560 | 2009-11-20 |
| 4 | Chaitali | 2060 | 2008-05-20 |
+------+----------+--------+---------------------+

FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer
tables that fulfil the JOIN condition. The joined table contains either all the records from
both the tables, or fills in NULL values for missing matches on either side.

The following query demonstrates FULL OUTER JOIN between CUSTOMER


and ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT,


o.DATE FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

On successful execution of the query, you get to see the


following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |

IMP Questions

1. Explain Hive architecture in detail with neat diagram?


2. How the WorkFlow will happen Between Hive and Hadoop?
3. What is the difference between Hive and ApachePig?
4. Write the Hive Script for the following?
create database
select statement for
each statement order
by statement
alter table statement
Drop table statement

You might also like