Unit 2
Unit 2
UNIT II
IT Dept Page 1
Working with Big Data
Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunk server and a client on the same machine, as long as machine
resources permit and the lower reliability caused by running possibly flaky application code is
acceptable.
Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunkserver and a client on the same machine, as long as machine
resources permit and the lower reliability caused by running possibly flaky application code is
acceptable.
IT Dept Page 2
Working with Big Data
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally
unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a
chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers.
By default, we store three replicas, though users can designate different replication levels for
different regions of the file namespace. The master maintains all file system metadata. This
includes the namespace, access control information, the mapping from files to chunks, and the
current locations of chunks.
It also controls system-wide activities such as chunk lease management, garbage collection
of orphaned chunks, and chunk migration between chunkservers. The master periodically
communicates with each chunkserver in HeartBeat messages to give it instructions and collect its
state.
GFS client code linked into each application implements the file system API and communicates
with the master and chunkservers to read or write data on behalf of the application. Clients interact
with the master for metadata operations, but all data-bearing communication goes directly to the
chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux
vnode layer.
Neither the client nor the chunkserver caches file data. Client caches offer little benefit because
most applications stream through huge files or have working sets too large to be cached. Not
having them simplifies the client and the overall system by eliminating cache coherence
issues.(Clients do cache metadata, however.) Chunkservers need not cache file data because
chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data
in memory.
Single Master:
Having a single master vastly simplifies our design and enables the master to make sophisticated
chunk placement Application and replication decisions using global knowledge. However, we must
minimize its involvement in reads and writes so that it does not become a bottleneck. Clients
never read and write file data through the master. Instead, a client asks the master which
chunkservers it should contact.
IT Dept Page 3
Working with Big Data
Chunk Size:
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger
than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a
chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest
objection against such a large chunk size.
Second, since on a large chunk, a client is more likely to perform many operations on a
given chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunkserver over an extended period of time.
Third, it reduces the size of the metadata stored on the master. This allows us to keep the
metadata in memory, which in turn brings other advantages.
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A
small file consists of a small number of chunks, perhaps just one. The chunkservers storing those
chunks may become hot spots if many clients are accessing the same file. In practice, hot spots
have not been a major issue because our applications mostly read large multi-chunk files
sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: an executable
was written to GFS as a single-chunk file and then started on hundreds of machines at the same
time.
Metadata
The master stores three major types of metadata: the file and chunk namespaces, the mapping from
files to chunks, and the locations of each chunk’s replicas.
All metadata is kept in the master’s memory. The first two types (namespaces and file-to- chunk
mapping) are also kept persistent by logging mutations to an operation log stored on the master’s
local disk and replicated on remote machines. Using a log allows us to update the master state
IT Dept Page 4
Working with Big Data
simply, reliably, and without risking inconsistencies in the event of a master crash. The master
does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.
This periodic scanning is used to implement chunk garbage collection, re-replication in the
presence of chunkserver failures, and chunk migration to balance load and disk space usage across
chunkservers.
Disadvantages
Lazy space allocation avoids wasting space due to internal
fragmentation.
Even with lazy space allocation, a small file consists of a small number of chunks, perhaps
just one. The chunk servers storing those chunks may become hot spots if many clients are
accessing the same file. In practice, hot spots have not been a major issue because the
applications mostly read large multi-chunk files sequentially. To mitigate it, replication and
allowance to read from other clients can be done.
IT Dept Page 5
Working with Big Data
MapReduce
MapReduce is the programming model that uses Java as the programming language to
retrieve data from files stored in the HDFS. All data in HDFS is stored as files. Even
MapReduce was built inline with another paper by Google.
Google, apart from their papers did not release their implementations of GFS and MapReduce.
However, the Open Source Community built Hadoop and MapReduce based on those papers.
The initial adoption of Hadoop was at Yahoo Inc., where it gained good momentum and went
onto be a part of their production systems. After Yahoo, many organizations like LinkedIn,
Facebook, Netflix and many more have successfully implemented Hadoop within their
organizations.
Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS it
is broken down into blocks, 64 MB block size by default. These blocks are then replicated
across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e.
IT Dept Page 6
Working with Big Data
there will be 3 copies of the same block in the cluster. We will see later on why we maintain
replicas of the blocks in the cluster.
A Hadoop cluster can comprise of a single node (single node cluster) or thousands of
nodes.
Once you have installed Hadoop you can try out the following few basic commands to work
with HDFS:
hadoop fs -ls
hadoop fs -put <path_of_local> <path_in_hdfs>
hadoop fs -get <path_in_hdfs> <path_of_local>
hadoop fs -cat <path_of_file_in_hdfs>
hadoop fs -rmr <path_in_hdfs>
A B
A C
B C
IT Dept Page 7
Working with Big Data
A B C
IT Dept Page 8
Working with Big Data
JobTracker is responsible for taking in requests from a client and assigning TaskTrackers
with tasks to be performed. The JobTracker tries to assign tasks to the
TaskTracker on the DataNode where the data is locally present (Data Locality). If that is
not possible it will at least try to assign tasks to TaskTrackers within the same rack. If for
some reason the node fails the JobTracker assigns the task to another TaskTracker where the
replica of the data exists since the data blocks are replicated across the DataNodes. This
ensures that the job does not fail even if a node fails within the cluster.
TaskTracker
TaskTracker is a daemon that accepts tasks (Map, Reduce and Shuffle) from the
JobTracker. The TaskTracker keeps sending a heart beat message to theJobTracker to notify
that it is alive. Along with the heartbeat it also sends the free slots available
within it to process tasks. TaskTracker starts and monitors the Map & Reduce Tasks
and sends progress/status information back to theJobTracker.
All the above daemons run within have their own JVMs. A typical (simplified) flow in Hadoop
is a follows:
A Client (usaually a MapReduce program) submits a job to theJobTracker.
The JobTracker get information from the NameNode on the location of the data
within the DataNodes. The JobTracker places the client program (usually a jar file
along with the configuration file) in the HDFS. Once placed, JobTracker
tries to assign tasks to TaskTrackers on the DataNodes based on data locality.
The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up
the client program from the shared location on the HDFS.
The progress of the operation is relayed back to the JobTracker by theTaskTracker.
The Reduce tasks works on all data received from map tasks and writes the final
output to HDFS.
After the task complete the intermediate data generated by theTaskTracker is deleted.
A very important feature of Hadoop to note here is that, the program goes to where the data
is and not the way around, thus resulting in efficient processing of data.
node installation is sufficient. There is another mode known as 'pseudo distributed' mode. This mode
is used to simulate the multi node environment on a single server.
In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system
should have java version 1.6.x installed on it.
Standalone Mode Installation
Now, let us check the standalone mode installation process by following the steps mentioned below.
Install Java
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.
Step 1 - If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE,
install sun-java6 from Canonical Partner Repository by using the following command.
Note: The Canonical Partner Repository contains free of cost closed source third party
software. But the Canonical does not have access to the source code instead they just
package and test it.
Add the canonical partner to the apt repositories using -
$ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"
Step 2 - Update the source list.
$ sudo apt-get update
Step 3 - Install JDK version 1.6.x from Sun/Oracle.
$ sudo apt-get install sun-java6-jdk
Step 4 - Once JDK installation is over make sure that it is correctly setup using - version
1.6.x from Sun/Oracle.
user@ubuntu:~# java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b02)
Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)
The above output indicates that Standalone installation is completed successfully. Now you can run
the sample examples of your choice by calling –
$ bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>
IT Dept Page 11
Working with Big Data
Here, the first step required is to configure the SSH in order to access and manage the different
nodes. It is mandatory to have the SSH access to the different nodes. Once the SSH is configured,
enabled and is accessible we should start configuring the Hadoop. The following configuration files
need to be modified:
conf/core-site.xml
conf/hdfs-site.xml
conf/mapred.xml
Open the all the configuration files in vi editor and update the configuration.
Configure core-site.xml file:
$ vi conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>
</configuration>
Once these changes are done, we need to format the name node by using the following command.
The command prompt will show all the messages one after another and finally success message.
IT Dept Page 12
Working with Big Data
$ bin/hadoop namenode –format
Our setup is done for pseudo distributed node. Let's now start the single node cluster by using the
following command. It will again show some set of messages on the command prompt and start the
server process.
$ /bin/start-all.sh
Now we should check the status of Hadoop process by executing the jps command as shown below.
It will show all the running processes.
$ jps
14799 NameNode
14977 SecondaryNameNode
15183 DataNode
15596 JobTracker
15897 TaskTracker
http://localhost:50070/
Stopping the Single node Cluster: We can stop the single node cluster by using the following
command. The command prompt will display all the stopping processes.
$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
IT Dept Page 13
Working with Big Data
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
Installation Items
Installation Requirements
S.No Requirement Reason
Operating system – Linux
IT Dept Page 14
Working with Big Data
3 Installing Java
4 Installing Hadoop
5 Setting JAVA HOME and HADOOP HOME variables
6 Updating .bash_profile file for hadoop
7 Creating required folders for namenode and datanode
8 Configuring the .xml files
9 Setting the masters and slaves in all the machines
10 Formatting the namenode
11 Starting the Dfs services and mapred services
12 Stopping all services
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed
setup done and we have at least two machines, one acting as master and the other acting as a slave.
Now we run the following commands in sequence.
$ bin/stop-all.sh- Make sure none of the nodes are running
Binding IP address with the host names
Before starting the installation of hadoop, first you need to bind the IP address of the
machines along with their host names under /etc/hosts file.
First check the hostname of your machine by using following command :
$ hostname
Now we open the two files - conf/master and conf/slaves. The conf/master defines the name nodes
of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be
running.
Edit the conf/core-site.xml file to have the following entries -
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
</property>
Once started, check the status on the master by using jps command. You should get the following
output:
16017 Jps
14799 NameNode
15596 JobTracker
14977 SecondaryNameNode
15183 DataNode
15897 TaskTracker
16284 Jps
IT Dept Page 17
Working with Big Data
All these files are available under ‘conf’ directory of Hadoop installation directory.
IT Dept Page 18
Working with Big Data
hadoop-env.sh
This file specifies environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the
important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
This variable directs Hadoop daemon to the Java path in the system.
This file is also used for setting another Hadoop daemon execution environment such as heap
size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location
(HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary
parameters to start a cluster.
The following three files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the
configuration settings for Hadoop Core such as I/O settings that are common
to HDFS and MapReduce.
Where hostname and port are the machine and port on which NameNode daemon runs and
listens. It also informs the Name Node as to which IP and port it should bind. The commonly
IT Dept Page 19
Working with Big Data
used port is 8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary
Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission
checking on HDFS. The actual number of replications can also be specified when the file is
created. The default is used if replication is not specified in create time.
mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers. Themapred.job.tracker parameter is a hostname (or IP address) and port pair on
which the Job Tracker listens for RPC communication. This parameter specify the location of the
Job Tracker to Task Trackers and MapReduce clients.
IT Dept Page 20
Working with Big Data
You can replicate all of the four files explained above to all the Data Nodes and Secondary
Namenode. These files can then be configured for any node specific configuration e.g. in case of a
different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop
cluster.
Masters
This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file
at Master server contains a hostname Secondary Name Node servers.
Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node
and Task Tracker servers.
IT Dept Page 21
Working with Big Data
The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the
‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in
the cluster.
IT Dept Page 22