BDA Complete Notes
BDA Complete Notes
UNIT II
A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple
clients, as shown in Figure.
ri
ad
em
H
Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunk server and a client on the same machine, as long as machine resources
permit and the lower reliability caused by running possibly flaky application code is acceptable.
Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunkserver and a client on the same machine, as long as machine resources
permit and the lower reliability caused by running possibly flaky application code is acceptable.
Working with Big Data
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally
unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By
default, we store three replicas, though users can designate different replication levels for different
regions of the file namespace. The master maintains all file system metadata. This includes the
namespace, access control information, the mapping from files to chunks, and the current locations
of chunks.
It also controls system-wide activities such as chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunkservers. The master periodically
communicates with each chunkserver in HeartBeat messages to give it instructions and collect its
state.
GFS client code linked into each application implements the file system API and communicates
ri
with the master and chunkservers to read or write data on behalf of the application. Clients interact
ad
with the master for metadata operations, but all data-bearing communication goes directly to the
em
chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode
layer.
H
Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most
applications stream through huge files or have working sets too large to be cached. Not having them
simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache
metadata, however.) Chunkservers need not cache file data because chunks are stored as local files
and so Linux’s buffer cache already keeps frequently accessed data in memory.
Single Master:
Having a single master vastly simplifies our design and enables the master to make sophisticated
chunk placement Application and replication decisions using global knowledge. However, we must
minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never
read and write file data through the master. Instead, a client asks the master which chunkservers
it should contact.
Working with Big Data
Chunk Size:
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than
typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver
and is extended only as needed.
Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest
objection against such a large chunk size.
Second, since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver
over an extended period of time.
Third, it reduces the size of the metadata stored on the master. This allows us to keep the
ri
metadata in memory, which in turn brings other advantages.
ad
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A
em
small file consists of a small number of chunks, perhaps just one. The chunkservers storing those
chunks may become hot spots if many clients are accessing the same file. In practice, hot spots
H
have not been a major issue because our applications mostly read large multi-chunk files
sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: an executable
was written to GFS as a single-chunk file and then started on hundreds of machines at the same time.
Metadata
The master stores three major types of metadata: the file and chunk namespaces, the mapping from
files to chunks, and the locations of each chunk’s replicas.
All metadata is kept in the master’s memory. The first two types (namespaces and file-to- chunk
mapping) are also kept persistent by logging mutations to an operation log stored on the master’s
local disk and replicated on remote machines. Using a log allows us to update the master state
Working with Big Data
simply, reliably, and without risking inconsistencies in the event of a master crash. The master does
not store chunk location information persistently. Instead, it asks each chunkserver about its chunks
at master startup and whenever a chunkserver joins the cluster.
This periodic scanning is used to implement chunk garbage collection, re-replication in the
presence of chunkserver failures, and chunk migration to balance load and disk space usage across
chunkservers.
it can reduce network overhead by keeping a persistent TCP connection to the chunk server
over an extended period of time.
➢ It reduces the size of the metadata stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages.
H
Disadvantages
➢ Lazy space allocation avoids wasting space due to internal
fragmentation.
➢ Even with lazy space allocation, a small file consists of a small number of chunks, perhaps
just one. The chunk servers storing those chunks may become hot spots if many clients are
accessing the same file. In practice, hot spots have not been a major issue because the
applications mostly read large multi-chunk files sequentially. To mitigate it, replication and
allowance to read from other clients can be done.
Working with Big Data
MapReduce
MapReduce is the programming model that uses Java as the programming language to retrieve
H
data from files stored in the HDFS. All data in HDFS is stored as files. Even MapReduce was
built inline with another paper by Google.
Google, apart from their papers did not release their implementations of GFS and MapReduce.
However, the Open Source Community built Hadoop and MapReduce based on those papers.
The initial adoption of Hadoop was at Yahoo Inc., where it gained good momentum and went
onto be a part of their production systems. After Yahoo, many organizations like LinkedIn,
Facebook, Netflix and many more have successfully implemented Hadoop within their
organizations.
Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it
is broken down into blocks, 64 MB block size by default. These blocks are then replicated
across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e.
there will be 3 copies of the same block in the cluster. We will see later on why we maintain
replicas of the blocks in the cluster.
Working with Big Data
A Hadoop cluster can comprise of a single node (single node cluster) or thousands of
nodes.
Once you have installed Hadoop you can try out the following few basic commands to work
with HDFS:
hadoop fs -ls
hadoop fs -put <path_of_local> <path_in_hdfs>
hadoop fs -get <path_in_hdfs> <path_of_local>
hadoop fs -cat <path_of_file_in_hdfs>
hadoop fs -rmr <path_in_hdfs>
A B
A C
B C
Working with Big Data
A B C
This information is required when retrieving data from the cluster as the data is spread across
multiple machines. The NameNode is a Single Point of Failure for the Hadoop Cluster.
Secondary NameNode
IMPORTANT – The Secondary NameNode is not a failover node for the NameNode.
The secondary name node is responsible for performing periodic housekeeping functions
for the NameNode. It only creates checkpoints of the file system present in the NameNode.
DataNode
The DataNode is responsible for storing the files in HDFS. It manages the file blocks within
the node. It sends information to the NameNode about the files and blocks stored in that node
and responds to the NameNode for all filesystem operations.
JobTracker
JobTracker is responsible for taking in requests from a client and assigning TaskTrackers
with tasks to be performed. The JobTracker tries to assign tasks to the
TaskTracker on the DataNode where the data is locally present (Data Locality). If that is not
Working with Big Data
possible it will at least try to assign tasks to TaskTrackers within the same rack. If for some
reason the node fails the JobTracker assigns the task to another TaskTracker where the replica
of the data exists since the data blocks are replicated across the DataNodes. This ensures that
the job does not fail even if a node fails within the cluster.
TaskTracker
TaskTracker is a daemon that accepts tasks (Map, Reduce and Shuffle) from the JobTracker.
The TaskTracker keeps sending a heart beat message to theJobTracker to notify that it is
alive. Along with the heartbeat it also sends the free slots available within it to
process tasks. TaskTracker starts and monitors the Map & Reduce Tasks and sends
progress/status information back to theJobTracker.
All the above daemons run within have their own JVMs. A typical (simplified) flow in Hadoop
is a follows:
➢ A Client (usaually a MapReduce program) submits a job to theJobTracker.
➢ The JobTracker get information from the NameNode on the location of the data
within the DataNodes. The JobTracker places the client program (usually a jar file
along with the configuration file) in the HDFS. Once placed, JobTracker tries
to assign tasks to TaskTrackers on the DataNodes based on data locality.
➢ The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up the
ri
client program from the shared location on the HDFS.
ad
➢ The progress of the operation is relayed back to the JobTracker by theTaskTracker.
➢ On completion of the Map task an intermediate file is created on the local filesystem
em
of the TaskTracker.
➢ Results from Map tasks are then passed on to the Reduce task.
H
➢ The Reduce tasks works on all data received from map tasks and writes the final output
to HDFS.
➢ After the task complete the intermediate data generated by theTaskTracker is deleted.
A very important feature of Hadoop to note here is that, the program goes to where the data
is and not the way around, thus resulting in efficient processing of data.
node installation is sufficient. There is another mode known as 'pseudo distributed' mode. This mode
is used to simulate the multi node environment on a single server.
In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system
should have java version 1.6.x installed on it.
Standalone Mode Installation
Now, let us check the standalone mode installation process by following the steps mentioned below.
Install Java
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.
• Step 1 - If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE,
ri
install sun-java6 from Canonical Partner Repository by using the following command.
ad
Note: The Canonical Partner Repository contains free of cost closed source third party
software. But the Canonical does not have access to the source code instead they just package
and test it.
em
$ bin/hadoop
The following output will be displayed –
H
The above output indicates that Standalone installation is completed successfully. Now you can run
the sample examples of your choice by calling –
$ bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property> ri
ad
</configuration>
$ vi conf/hdfs-site.xml
<configuration>
H
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Once these changes are done, we need to format the name node by using the following command. The
command prompt will show all the messages one after another and finally success message.
http://localhost:50070/
ri
ad
em
H
Stopping the Single node Cluster: We can stop the single node cluster by using the following
command. The command prompt will display all the stopping processes.
$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
Working with Big Data
localhost: stopping secondarynamenode
Installation Items
Installation Requirements
S.No Requirement Reason
H
ri
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed
setup done and we have at least two machines, one acting as master and the other acting as a slave.
ad
Now we run the following commands in sequence.
• $ bin/stop-all.sh- Make sure none of the nodes are running
em
• First log in on Host1 (hostname of namenode machine) as hadoop user and generate a
Working with Big Data
pair of authentication keys. Command is:
Now we open the two files - conf/master and conf/slaves. The conf/master defines the name nodes of
our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
• Edit the conf/core-site.xml file to have the following entries -
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
</property>
• ri
Edit the conf/mapred-site.xml file to have the following entries -
ad
<property>
<name>mapred.job.tracker</name>
em
<value>hdfs://master:54311</value>
</property>
• Edit the conf/hdfs-site.xml file to have the following entries -
H
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
ri
ad
Once started, check the status on the master by using jps command. You should get the following
output:
em
16017 Jps
14799 NameNode
H
15596 JobTracker
14977 SecondaryNameNode
15183 DataNode
15897 TaskTracker
16284 Jps
All these files are available under ‘conf’ directory of Hadoop installation directory.
ri
ad
em
hadoop-env.sh
This file specifies environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the
important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
This variable directs Hadoop daemon to the Java path in the system.
ri
ad
em
This file is also used for setting another Hadoop daemon execution environment such as heap
size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location
H
(HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary
parameters to start a cluster.
The following three files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the
configuration settings for Hadoop Core such as I/O settings that are common
to HDFS and MapReduce.
Working with Big Data
Where hostname and port are the machine and port on which NameNode daemon runs and
listens. It also informs the Name Node as to which IP and port it should bind. The commonly
used port is 8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary
Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission
checking on HDFS. The actual number of replications can also be specified when the file is created.
The default is used if replication is not specified in create time.
ri
ad
em
H
mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers. Themapred.job.tracker parameter is a hostname (or IP address) and port pair on
which the Job Tracker listens for RPC communication. This parameter specify the location of the
Job Tracker to Task Trackers and MapReduce clients.
You can replicate all of the four files explained above to all the Data Nodes and Secondary
ri
Namenode. These files can then be configured for any node specific configuration e.g. in case of a
ad
different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop
em
cluster.
Masters
H
This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file
at Master server contains a hostname Secondary Name Node servers.
Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node
and Task Tracker servers.
The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the
‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in
the cluster.
ri
ad
em
H
UNIT-III
MapReduce is a programming model for data processing. The model is simple, yet
not too simple to express useful programs in. Hadoop can run MapReduce programs written
in various languages; Most important, MapReduce programs are inherently parallel, thus
putting very large-scale data analysis into the hands of anyone with enough machines at their
disposal. MapReduce comes into its own for large datasets, so let’s start by looking at one.
A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume of log
data, which is a good candidate for analysis with MapReduce, since it is semistructured and
record-oriented.
Data Format
The data we will use is from the National Climatic Data Center (NCDC, http://www
.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which each line is
a record. The format supports a rich set of meteorological elements, many of which are
optional or with variable data lengths. For simplicity, we shall focus on the basic elements,
such as temperature, which are always present and are of fixed width.
ri
Example shows a sample line with some of the salient fields highlighted. The line has been
split into multiple lines to show each field: in the real file, fields are packed into one line with
ad
no delimiters.
Data files are organized by date and weather station. There is a directory for each year
em
from 1901 to 2001, each containing a gzipped file for each weather station with its readings
for that year. For example, here are the first entries for 1990:
H
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
Since there are tens of thousands of weather stations, the whole dataset is made up of
a large number of relatively small files. It’s generally easier and more efficient to process a
smaller number of relatively large files, so the data was preprocessed so that each year’s
readings were concatenated into a single file.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to run it
on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and
the reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the map
function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of the
line from the beginning of the file, but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reducer function can do its work on it:
finding the maximum temperature for each year. The map function is also a good place to
drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
ri
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
ad
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
em
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits
them as its output (the temperature values have been interpreted as integers):
H
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being
sent to the reduce function. This processing sorts and groups the key-value pairs by key.
So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
The whole data flow is illustrated in the bellow Figure. At the bottom of the diagram
is a Unix pipeline
Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to run
the job. The map function is represented by an implementation of the Mapper interface,
which declares a map() method. Example -1 shows the implementation of our map function.
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
em
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase implements
Mapper <LongWritable, Text, Text, IntWritable>
{
H
The map() method is passed a key and a value. We convert the Text value containing the line
of input into a Java String, then use its substring() method to extract the columns we are
interested in.
The map() method also provides an instance of OutputCollector to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the temperature
is present and the quality code indicates the temperature reading is OK.
import java.io.IOException;
ri
import java.util.Iterator;
ad
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
em
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
H
Again, four formal type parameters are used to specify the input and output types, this
time for the reduce function. The input types of the reduce function must match the output
types of the map function: Text and IntWritable. And in this case, the output types of the
reduce function are Text and IntWritable, for a year and its maximum temperature, which we
find by iterating through the temperatures and comparing each with a record of the highest
found so far.
The third piece of code runs the MapReduce job (see Example -3).
Example -3. Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class MaxTemperature
{
public static void main(String[] args) throws IOException
{
if (args.length != 2)
{
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
ri
FileInputFormat.addInputPath(conf, new Path(args[0]));
ad
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
em
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
H
}
}
A JobConf object forms the specification of the job. It gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR
file (which Hadoop will distribute around the cluster). Rather than explicitly specify the name
of the JAR file, we can pass a class in the JobConf constructor, which Hadoop will use to
locate the relevant JAR file by looking for the JAR file containing this class.
Having constructed a JobConf object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat, and it
can be a single file, a directory (in which case, the input forms all the files in that directory),
or a file pattern. As the name suggests, addInputPath() can be called more
than once to use input from multiple paths.
The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output files from the
reducer functions are written. The directory shouldn’t exist before running the job, as
Hadoop will complain and not run the job. This precaution is to prevent data loss (it can be
very annoying to accidentally overwrite the output of a long job with another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass()
methods control the output types for the map and the reduce functions, which are often the
same, as they are in our case.
If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass(). The input types are controlled via
the input format, which we have not explicitly set since we are using the default
TextInputFormat. After setting the classes that define the map and reduce functions, we are
ready to run the job.
The static runJob() method on JobClient submits the job and waits for it to finish,
writing information about its progress to the console.
The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-00000:
% cat output/part-00000
1949 111
1950 22
This result is the same as when we went through it by hand earlier. We interpret this
as saying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950 it was
2.2°C.
ri
ad
The new Java MapReduce API:
Release 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred
em
to as “Context Objects,” designed to make the API easier to evolve in the future. The new
API is type-incompatible with the old, however, so applications need to be rewritten to take
advantage of it.
H
When converting your Mapper and Reducer classes to the new API, don’t forget to
change the signature of the map() and reduce() methods to the new form. Just changing your
class to extend the new Mapper or Reducer classes will not produce a compilation error or
warning, since these classes provide an identity form of the map() or reduce() method
(respectively). Your mapper or reducer code, however, will not be invoked, which can lead to
some hard-to-diagnose errors.
Combiner
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting
the inputs from the Map class and thereafter passing the output key-value pairs to the
Reducer class.
The main function of a Combiner is to summarize the map output records with the same key.
The output (key-value collection) of the combiner will be sent over the network to the actual
Reducer task as input.
Combiner class
The Combiner class is used in between the Map class and the Reduce class to reduce the
ri
volume of data transfer between Map and Reduce. Usually, the output of the map task is
ad
A combiner does not have a predefined interface and it must implement the Reducer
interface’s reduce() method.
A combiner operates on each map output key. It must have the same output key-value
types as the Reducer class.
Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce
phase, which makes it easier to process.
The following example provides a theoretical idea about combiners. Let us assume we have
the following input text file named input.txt for MapReduce.
What do you mean by Object
What do you know about Java
What is Java Virtual Machine
ri
How Java enabled High Performance
ad
The important phases of the MapReduce program with Combiner are discussed below.
em
Record Reader
This is the first phase of MapReduce where the Record Reader reads every line from the
H
The Map phase takes input from the Record Reader, processes it, and produces the output as
another set of key-value pairs.
Input − The following key-value pair is the input taken from the Record Reader.
Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.
Input − The following key-value pair is the input taken from the Map phase.
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>
The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
Output − The expected output is as follows −
Partitioner Phase
The partitioning phase takes place after the map phase and before the reduce phase. The
number of partitions is equal to the number of reducers. The data gets partitioned across
the reducers according to the partitioning function.
The difference between a partitioner and a combiner is that the partitioner divides
the data according to the number of reducers so that all the data in a single partition gets
executed by a single reducer. However, the combiner functions similar to the reducer and
processes the data in each partition. The combiner is an optimization to the reducer.
The default partitioning function is the hash partitioning function where the hashing
is done on the key. However it might be useful to partition the data according to some other
function of the key or the value.
ri
ad
Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes
em
it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.
H
Input − The following key-value pair is the input taken from the Combiner phase.
The Reducer phase reads each key-value pair. Following is the code snippet for the
Combiner.
This is the last phase of MapReduce where the Record Writer writes every key-value pair
from the Reducer phase and sends the output as text.
Input − Each key-value pair from the Reducer phase along with the Output format.
ri
Output − It gives you the key-value pairs in text format. Following is the expected output.
ad
What 3
em
do 2
you 2
mean 1
by 1
H
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1
H
em
ad
ri
Hadoop and Big Data
Unit 4: Hadoop I/O: The Writable Interface, WritableComparable and comparators, Writable
Classes: Writable wrappers for Java primitives, Text, BytesWritable, NullWritable,
ObjectWritable and GenericWritable, Writable collections, Implementing a Custom Writable:
Implementing a RawComparator for speed, Custom comparators
Reference: Hadoop: The Definitive Guide by Tom White, 3rd Edition, O’reilly
Unit 4
1. Hadoop I/O:
2. Writable Classes:
Text, ri
ad
BytesWritable,
em
NullWritable,
ObjectWritable
H
GenericWritable,
Writable collections,
Custom comparators
1. Hadoop I/O
Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop
provides these writable wrappers for almost all Java primitive types and some other types,but
sometimes we need to pass custom objects and these custom objects should implement Hadoop's
Writable interface. Hadoop MapReduce uses implementations of Writables for interacting with
user-provided Mappers and Reducers.
called Writable(s). Some of the examples of writables as already mentioned before are
IntWritable, LongWritable, BooleanWritable and FloatWritable.
H
Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys
are compared with one another.
The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the sorting
of these keys. In this blog, by implementing the RawComparator to compare the intermediary
keys, this extra effort will greatly improve sorting. Sorting is improved because the
RawComparator will compare the keys by byte. If we did not use RawComparator, the
intermediary keys would have to be completely deserialized to perform a comparison.
2) Any type which is to be used as a value in the Hadoop Map-Reduce framework should
implement the Writable interface. ri
ad
Writables and its Importance in Hadoop
em
Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become IntWritable in Hadoop and String
of Java has become Text in Hadoop.
H
Writables are used for creating serialized data types in Hadoop. So, let us start by understanding
what data type, interface and serilization is.
Data Type
A data type is a set of data with values having predefined characteristics. There are several kinds
of data types in Java. For example- int, short, byte, long, char etc. These are called as primitive
data types. All these primitive data types are bound to classes called as wrapper class. For
example int, short, byte, long are grouped under INTEGER which is a wrapper class. These
wrapper classes are predefined in the Java.
Interface in Java
An interface in Java is a complete abstract class. The methods within an interface are abstract
methods which do not accept body and the fields within the interface are public, static and final,
which means that the fields cannot be modified.
The structure of an interface is most likely to be a class. We cannot create an object for an
interface and the only way to use the interface is to implement it in other class by using
‘implements’ keyword.
Serialization
Serialization is nothing but converting the raw data into a stream of bytes which can travel along
different networks and can reside in different systems. Serialization is not the only concern of
Writable interface; it also has to perform compare and sorting operation in Hadoop.
Now the question is whether Writables are necessary for Hadoop. Hadoop frame work definitely
needs Writable type of interface in order to perform the following tasks:
We have seen how Writables reduces the data size overhead and make the data transfer easier in
H
the network.
Also the core part of Hadoop framework i.e., shuffle and sort phase won’t be executed without
using Writable.
Writable variables in Hadoop have the default properties of Comparable. For example:
When we write a key as IntWritable in the Mapper class and send it to the reducer class, there is
an intermediate phase between the Mapper and Reducer class i.e., shuffle and sort, where each
key has to be compared with many other keys. If the keys are not comparable, then shuffle and
sort phase won’t be executed or may be executed with high amount of overhead.
Can we make custom Writables? The answer is definitely ‘yes’. We can make our own custom
ri
Writable type.
ad
Let us now see how to make a custom type in Java.
em
int a;
int b;
public add() {
this.a = a;
this.b = b;
}
Similarly we can make a custom type in Hadoop using Writables.
Here, readFields, reads the data from network and write will write the data into local disk. Both
are necessary for transferring data through clusters. DataInput and DataOutput classes (part of
java.io) contain methods to serialize the most basic types of data.
Suppose we want to make a composite key in Hadoop by combining two Writables then follow
the steps below:
public add(){
this.a=a;
H
this.b=b;
out.writeInt(a);
out.writeInt(b);
a = in.readInt();
b = in.readInt();
Thus we can create our custom Writables in a way similar to custom types in Java but with two
additional methods, write and readFields. The custom writable can travel through networks and
can reside in other systems.
This custom type cannot be compared with each other by default, so again we need to make them
comparable with each other.
Let us now discuss what is WritableComparable and the solution to the above problem.
ri
ad
As explained above, if a key is taken as IntWritable, by default it has comparable feature because
of RawComparator acting on that variable and it will compare the key taken with the other keys
in network and If Writable is not there it won’t be executed.
em
By default, IntWritable, LongWritable and Text have a RawComparator which can execute this
H
comparable phase for them. Then, will RawComparator help the custom Writable? The answer is
no. So, we need to have WritableComparable.
WritableComparable can be defined as a sub interface of Writable, which has the feature of
Comparable too. If we have created our custom type writable, then
We need to make our custom type, comparable if we want to compare this type with the other.
we want to make our custom type as a key, then we should definitely make our key type as
WritableComparable rather than simply Writable. This enables the custom type to be compared
with other types and it is also sorted accordingly. Otherwise, the keys won’t be compared with
each other and they are just passed through the network.
int compareTo(WritableComparable o)
}
ri
ad
How to make our custom type, WritableComparable?
em
public int a;
public int b;
public add(){
this.a=a;
this.b=b;
out.writeint(b);
}
public void readFields(DataInput in) throws IOException {
a = in.readint();
b = in.readint();
These read fields and write make the comparison of data faster in the network.
With the use of these Writable and WritableComparables in Hadoop, we can make our serialized
custom type with less difficulty. This gives the ease for developers to make their custom types
H
BooleanWritable
ByteWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
ri
ad
VLongWritable
DoubleWritable
em
In the above list VIntWritable and VLongWritable are used for variable length Integer types and
variable length long types respectively.
H
Serialized sizes of the above primitive writable data types are same as the size of actual java data
type. So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.
Array Writable Classes
Hadoop provided two types of array writable classes, one for single-dimensional and another
for two-dimensional arrays. But theelements of these arrays must be other writable objects like
IntWritable or LongWritable only but not the java native data types like int or float.
ArrayWritable
TwoDArrayWritable
NullWritable
NullWritable is a special type of Writable representing a null value. No bytes are read or written
when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be
declared as a NullWritable when we don’t need to use that field.
ObjectWritable
This is a general-purpose generic object wrapper which can store any objects like Java
primitives, String, Enum, Writable, null, or arrays.
Text
Text can be used as the Writable equivalent of java.lang.String and It’s max size is 2 GB. Unlike
java’s String data type, Text is mutable in Hadoop. ri
ad
BytesWritable
em
GenericWritable
H
It is similar to ObjectWritable but supports only a few types. User need to subclass this
GenericWritable class and need to specify the types to support.
i2.set(5);
System.out.printf("\n t: %s, t.legth: %d, t2: %s, t2.length: %d \n", t.toString(), t.getLength(),
t2.getBytes(), t2.getBytes().length);
ArrayWritable a = new ArrayWritable(IntWritable.class) ;
a.set( new IntWritable[]{ new IntWritable(10), new IntWritable(20), new IntWritable(30)}) ;
b.set( new Text[]{ new Text("Hello"), new Text("Writables"), new Text("World !!!")}) ;
m.put(key1, value1) ;
m.put(new VIntWritable(2), new LongWritable(163));
m.put(new VIntWritable(3), new Text("Mapreduce"));
System.out.println(m.containsKey(key1)) ;
System.out.println(m.get(new VIntWritable(3))) ;
for(Writable w: keys)
System.out.println(m.get(w)) ;
}
}
ri
ad
Implementing a Custom Writable:
(MR) JOBS
your Map/Reduce (MR) Jobs. As you may recall, a MR Job is composed of receiving and
sending key-value pairs. The process looks like the following.
The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed from
the mapper to the reducer. Before these intermediary key-value pairs reach the reducer, a shuffle
and sort step is performed. The shuffle is the assignment of the intermediary keys (K2) to
reducers and the sort is the sorting of these keys. In this blog, by implementing the
RawComparator to compare the intermediary keys, this extra effort will greatly improve sorting.
Sorting is improved because the RawComparator will compare the keys by byte. If we did not
use RawComparator, the intermediary keys would have to be completely deserialized to perform
a comparison.
BACKGROUND
1, 2
3, 4
5, 6
... ri
ad
...
...
em
0, 0
H
What we want to do is simply count the occurrences of the {i,j} pair of indexes. Our MR Job will
look like the following.
METHOD
The first thing we have to do is model our intermediary key K2={i,j}. Below is a snippet of the
IndexPair. As you can see, it implements WritableComparable. Also, we are sorting the keys
ascendingly by the i-th and then j-th indexes.
private IntWritable i;
private IntWritable j;
public IndexPair(int i, int j) {
if(0 != cmp)
return cmp;
return j.compareTo(o.j);
//....
} ri
ad
Below is a snippet of the RawComparator. As you notice, it does not directly implement
RawComparator. Rather, it extends WritableComparator (which implements RawComparator).
em
protected IndexPairComparator() {
super(IndexPair.class);
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return comp;
return comp;
As you can see the above code, for the two objects we are comparing, there are two
corresponding byte arrays (b1 and b2), the starting positions of the objects in the byte arrays, and
the length of the bytes they occupy. Please note that the byte arrays themselves represent other
ri
things and not only the objects we are comparing. That is why the starting position and length are
also passed in as arguments. Since we want to sort ascendingly by i then j, we first compare the
ad
bytes representing the i-th indexes and if they are equal, then we compare the j-th indexes. You
can also see that we use the util method, readInt(byte[], start), inherited from
em
WritableComparator. This method simply converts the 4 consecutive bytes beginning at start into
a primitive int (the primitive int in Java is 4 bytes). If the i-th indexes are equal, then we shift the
starting point by 4, read in the j-th indexes and then compare them.
H
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
int i = Integer.parseInt(tokens[0].trim());
int j = Integer.parseInt(tokens[1].trim());
context.write(indexPair, ONE);
}
A snippet of the reducer is shown below.
int sum = 0;
sum += value.get();
The snippet of code below shows how I wired up the MR Job that does NOT use raw byte
comparison.
job.setJarByClass(RcJob1.class);
H
job.setMapOutputKeyClass(IndexPair.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IndexPair.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(RcMapper.class);
job.setReducerClass(RcReducer.class);
job.waitForCompletion(true);
return 0;
}
The snippet of code below shows how I wired up the MR Job using the raw byte comparator.
job.setJarByClass(RcJob1.class);
job.setSortComparatorClass(IndexPairComparator.class);
job.setMapOutputKeyClass(IndexPair.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IndexPair.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(RcMapper.class);
ri
ad
job.setReducerClass(RcReducer.class);
em
job.waitForCompletion(true);
return 0;
H
As you can see, the only difference is that in the MR Job using the raw comparator, we explicitly
set its sort comparator class.
RESULTS
I ran the MR Jobs (without and with raw byte comparisons) 10 times on a dataset of 4 million
rows of {i,j} pairs. The runs were against Hadoop v0.20 in standalone mode on Cygwin. The
average running time for the MR Job without raw byte comparison is 60.6 seconds, and the
average running time for the job with raw byte comparison is 31.1 seconds. A two-tail paired t-
test showed p < 0.001, meaning, there is a statistically significant difference between the two
implementations in terms of empirical running time.
I then ran each implementation on datasets of increasing record sizes from 1, 2, …, and 10
million records. At 10 million records, without using raw byte comparison took 127 seconds
(over 2 minutes) to complete, while using raw byte comparison took 75 seconds (1 minute and
15 seconds) to complete. Below is a line graph.
ri
ad
em
H
Custom comparators.
Frequently, objects in one Tuple are compared to objects in a second Tuple. This is especially
true during the sort phase of GroupBy and CoGroup in Cascading Hadoop mode.
By default, Hadoop and Cascading use the native Object methods equals() and hashCode()
to compare two values and get a consistent hash code for a given value, respectively.
To override this default behavior, you can create a custom java.util.Comparator class to perform
comparisons on a given field in a Tuple. For instance, to secondary-sort a collection of custom
Person objects in a GroupBy, use the Fields.setComparator() method to designate the
custom Comparator to the Fields instance that specifies the sort fields.
Alternatively, you can set a default Comparator to be used by a Flow, or used locally on a
given Pipe instance. There are two ways to do this. Call
FlowProps.setDefaultTupleElementComparator() on a Properties instance, or use the property
key cascading.flow.tuple.element.comparator.
If the hash code must also be customized, the custom Comparator can implement the interface
cascading.tuple.Hasher
ri
ad
public class CustomTextComparator extends
em
public CustomTextComparator()
super(Text.class);
final Locale locale = new Locale("pl");
collator = Collator.getInstance(locale);
synchronized (collator) {
}
Unit 5:
Pig: Hadoop Programming Made Easier Admiring the Pig Architecture, Going with
the Pig Latin Application Flow, Working through the ABCs of Pig Latin,
Evaluating Local and Distributed Modes of Running Pig Scripts, Checking out
the Pig Script Interfaces, Scripting with Pig Latin
Pig:
Hadoop Programming Made Easier Admiring the Pig Architecture
Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would require you to type 200 lines of code
(LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
Ultimately Apache Pig reduces the development time by almost 16 times.
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
Features of Pig
Apache Pig comes with the following features −
Extensibility − Using the existing operators, users can develop their own
functions to read, process, and write data.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
paradigm.
Pig SQL
The data model in Apache Pig is nested The data model used in SQL is
relational. flat relational.
ri
ad
em
H
The language itself: As proof that programmers have a sense of humor, the Programming
language for Pig is known as Pig Latin, a high-level language that allows you to write data
processing and analysis programs.
The Pig Latin compiler: The Pig Latin compiler converts the Pig Latin code into executable code.
The executable code is either in the form of MapReduce jobs or it can spawn a process where a
virtual Hadoop instance is created to run the Pig node on a single node.
The sequence of MapReduce programs enables Pig programs to do data processing and
analysis in parallel, leveraging Hadoop MapReduce and HDFS. Running the Pig job in the
virtual Hadoop instance is a useful strategy for testing your Pig scripts.
Pig relates to the hadoop ecosystem
Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes, regardless of what
mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez provides a
more efficient execution framework than MapReduce. YARN enables application frameworks other than
MapReduce (like Tez) to run on Hadoop. Hive can also run against the Tez framework.
Parser ri
ad
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
em
parser will be a DAG Latin statements and logical (directed acyclic graph),
which represents the Pig operators.
H
In the DAG, the logical operators of the script are represented and the data
flows are represented as edges.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.
Going with the Pig Latin Application Flow
At its core, Pig Latin is a dataflow language, where we define a data stream and a series of
transformations that are applied to the data as it flows through your application.
This is in contrast to a control flow language (like C or Java), where we write a series of
instructions. In control flow languages, we use constructs like loops and conditional logic (like
an if statement). You won’t find loops and if statements in Pig Latin.
Looking at each line in turn, you can see the basic flow of a Pig program.
ri
ad
1. Load: we first load (LOAD) the data you want to manipulate. As in a typical MapReduce job,
that data is stored in HDFS. For a Pig program to access the data, you first tell Pig what file or
files to use. For that task, you use the LOAD 'data_file' command.
em
Here, 'data_file' can specify either an HDFS file or a directory. If a directory is specified, all files in
that directory are loaded into the program.
If the data is stored in a file format that isn’t natively accessible to Pig, you can optionally add
H
the USING function to the LOAD statement to specify a user-defined function that can read in
(and interpret) the data.
2. Transform: You run the data through a set of transformations that, way under the hood
and far removed from anything you have to concern yourself with, are translated into a set of
Map and Reduce tasks.
The transformation logic is where all the data manipulation happens. Here, you can FILTER
out rows that aren’t of interest, JOIN two sets of data files, GROUP data to build aggregations,
ORDER results, and do much, much more.
3. Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the results in a
file somewhere.
IT Dept 8
Pig latin operators
ri
ad
Operators for Debugging and Trouble shooting
em
H
Local Mode
In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute
the Pig Latin statements to process the data, a MapReduce job is invoked in the
back-end to perform a particular operation on the data that exists in the HDFS.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the
H
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig
Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and
using them in our script.
Important Questions
HIVE Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
The term ‘Big Data’ is used for collections of large datasets that include huge
volume, high velocity, and a variety of data that is increasing day by day. Using traditional
data management systems, it is difficult to process Big Data. Therefore, the Apache
Software Foundation introduced a framework called Hadoop to solve Big Data management
and processing challenges.
Hadoop
Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
MapReduce: It is a parallel programming model for processing large amounts of structured, semi-
structured, and unstructured data on large clusters of commodity hardware.
HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.
Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
Execute Query
1 The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
Get Plan
2 The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
Get Metadata
3 The compiler sends metadata request to Metastore (any database).
Send Metadata
4
Metastore sends metadata as a response to the compiler.
Send Plan
5 The compiler checks the requirement and resends the plan to the driver. Up to
here, the parsing and compiling of a query is complete.
Execute Plan
6 The driver sends the execute plan to the execution engine.
Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job
7 to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data
node. Here, the query executes MapReduce job.
Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
7.1 Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
Hive - Data Types
All the data types in Hive are classified into four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller
than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY- MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create
an instance using create union. The syntax and example is as follows:
Literals
The following literals are used in Hive:
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Structs
Structs in Hive is similar to using complex data with comment.
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with
the same name already exists. We can use SCHEMA in place of
DATABASE in this command. The following query is
executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
The following queries are used to drop a database. Let us assume that the database name
is userdb.
The following query drops the database using CASCADE. It means dropping respective
tables before dropping the database.
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination
String)
COMMENT ŧEmployee detailsŨ
ROW FORMAT DELIMITED FIELDS TERMINATED BY ŧ\tŨ LINES TERMINATED BY ŧ\nŨ STORED AS
TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table
already exists.
OK
Time taken: 5.905 seconds
hive>
While inserting data into Hive, it is better to use LOAD DATA to store bulk records.
There are two ways to load data: one is from local file system and second is from Hadoop
file system.
Syntax
The syntax for load data is as follows:
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH
'/home/user/sample.txt' OVERWRITE INTO TABLE employee;
OK
Time taken: 15.905 seconds
hive>
Syntax
The statement takes any of the following syntaxes based on what attributes we wish to
modify in a table.
The following queries rename the column name and column data type using the above
data:
Replace Statement
The following query deletes all the columns from the employee table and replaces it with
emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS (
eid INT empid Int,
ename STRING name String);
Operators in HIVE:
Arithmetic Operators
Logical Operators
Complex Operators
Relational Operators: These operators are used to compare two operands. The following
table describes the relational operators available in Hive:
Example
Let us assume the employee table is composed of fields named Id, Name, Salary,
Designation, and Dept as shown below. Generate a query to retrieve the employee details
whose Id is 1205.
+-----+--------------+--------+---------------------------+------+
| Id | Name | Salary | Designation | Dept |
+-----+--------------+------------------------------------+------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin|
+-----+--------------+--------+---------------------------+------+
The following query is executed to retrieve the employee details using the above table:
+-----+-----------+-----------+----------------------------------+
| ID | Name | Salary | Designation | Dept |
+-----+---------------+-------+----------------------------------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
+-----+-----------+-----------+----------------------------------+
The following query is executed to retrieve the employee details whose salary is
more than or equal to Rs 40000.
Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of
them return number types. The following table describes the arithmetic operators
available in Hive:
A%B all number types Gives the reminder resulting from dividing A by B.
A&B all number types Gives the result of bitwise AND of A and B.
A^B all number types Gives the result of bitwise XOR of A and B.
Logical Operators
The operators are logical expressions. All of them return either TRUE
or
FALSE.
A || B boolean Same as A OR B.
Example
The following query is used to retrieve employee details whose Department is TP and
Salary is more than Rs 40000.
hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP;
Complex Operators
These operators provide an expression to access the elements of Complex Types.
A[n] A is an Array and n is It returns the nth element in the array A. The first
an int element has index 0.
M[key] M is a Map<K, V> and It returns the value corresponding to the key in
key has type K the map.
HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive to process
and analyze structured data in a Metastore. This chapter explains how to use the
SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition. It filters the data using the condition and gives you a
finite result. The built-in operators and functions generate an expression, which fulfils
the condition.
Syntax
Given below is the syntax of the SELECT query:
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table
as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate
a query to retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
| Gopal | 45000 | Technical manager | TP |
| Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
HiveQL - Select-Order By
This chapter explains how to use the ORDER BY clause in a SELECT statement.
The ORDER BY clause is used to retrieve the details based on one column and sort
the result set by ascending or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference
[WHERE where_condition] [GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]] [LIMIT
number];
Example
Let us take an example for SELECT...ORDER BY clause. Assume employee
table as given below, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department name.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above
scenario:
HiveQL - Select-Group By
This chapter explains the details of GROUP BY clause in a SELECT statement.
The GROUP BY clause is used to group all the records in a result set using a
particular collection column. It is used to query a group of records.
The syntax of GROUP BY clause is as follows:
[ORDER BY col_list]]
[LIMIT number];
HiveQL - Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using values
common to each one. It is used to combine records from two or more tables in the
database. It is more or less similar to SQL JOIN.
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same
as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER
tables, and retrieves the records:
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and
ORDER tables:
A RIGHT JOIN returns all the values from the right table, plus the matched values from
the left table, or NULL in case of no matching join predicate.
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 |
| 3 | kaushik | 1500 | 2009-10-08 |
| 2 | Khilan | 1560 | 2009-11-20 |
| 4 | Chaitali | 2060 | 2008-05-20 |
+------+----------+--------+---------------------+
IMP Questions