Big Data Assignment PDF
Big Data Assignment PDF
What is Hadoop?
Hadoop is a framework that allows for the distributed processing of large datasets across clusters
of computers using simple programming models
Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
Hadoop Ecosystem
Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core components processing,
resource management, and storage.
Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises the following
twelve components:
1|Page
Impala
Hive
Cloudera Search
Oozie
Hue.
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
NameNode/Master Node
Stores metadata for the files, like the directory structure of a typical FS.
The server holding the NameNode instance is quite crucial, as there is only one.
Transaction log for file deletes/adds,etc ,Does not use transaction for whole blocks or
file-streams, only metadata.
Handles creation of more replica blocks when necessary after a DataNode failure.
DataNode/Slave Node
2|Page
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
3|Page
Communication takes place between NameNode and DataNode
All communication between Namenode and Datanode is initiated by the Datanode, and
responded to by the Namenode. The Namenode never initiates communication to
the Datanode, although Namenode responses may include commands to the Datanode that
cause it to send further communications.
DataNode sends information to NameNode through four major interfaces defined in the
2. DataNode sends heartbeat. The DataNode sends a heartbeat message every few seconds.
This includes some information statistics about capacity and current activity. NameNode
returns a list of block oriented commands for DataNode to execute. These commands
4|Page
primarily consist of instructions to transfer blocks to other DataNodes for replication
purposes, or instructions to delete blocks. The NameNode can also command an
immediate Block Report from the DataNode, but this is only done to recover from severe
problems.
3. DataNode sends block report. DataNode periodically reports the blocks contained in its
storage. The period is typically configured to hourly.
4. DataNode notifies Block Received. DataNode reports that it has received a new block,
either from a Client (during file write) or from another DataNode (during replication). It
reports each block immediately upon receipt.
5|Page
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care
of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes 'read()' method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This
process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
6|Page
the file (which is being created) does not exist already and a client has correct
permissions to create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client. Otherwise, the
operation succeeds and a new record for the file is created by the NameNode.
2. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
3. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
4. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
5. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
6. The DataStreamer pours packets into the first DataNode in the pipeline.
7. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
8. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
9. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
10. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
11. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.
7|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.
Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.
https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020
After downloading the file, we need to move this file to the home directory.
8|Page
Two methods to extract the file either right click on the file and click on extract here, or
9|Page
5. Create Java Project in Package Explorer:
By clicking in the left side of the interface
File > New > Java Project > (Name it – WordCountProject1) > Finish
Right Click > New > Package (Name it – WordCountPackage1) > Finish
7. After creating the Package the third step is to create the class
10 | P a g e
Hadoop-core-1.2.1.jar
Commons-cli-1.2.jar
11 | P a g e
9. To Add this Following libraries
Downloads/hadoop-core-1.2.1.jar
Downloads/commons-cli-1.2.jar
12 | P a g e
10. The Hadoop Libraries has successfully added.
The next step is the mapper and reducer logic write in the console for the word count project.
Q3.Explain YARN model and give its comparison with that of Hadoop.
Hadoop YARN knits the storage unit of Hadoop i.e. HDFS (Hadoop Distributed File System) with
the various processing tools. YARN stands for “Yet Another Resource Negotiator”.
Why YARN?
In Hadoop version 1.0, MapReduce performed both processing and resource management
functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the
resources, performed scheduling and monitored the processing jobs. It assigned map and reduce
tasks on a number of subordinate processes called the Task Trackers. The Task Trackers
periodically reported their progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its
article that according to Yahoo!, the practical limits of such a design are reached with a cluster of
5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of
computational resources is inefficient in MRV1. Also, the Hadoop framework became limited
only to MapReduce processing paradigm.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over
13 | P a g e
the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop
the ability to run non-MapReduce jobs within the Hadoop framework.
In version 2.0, YARN. YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to run and process data stored
in HDFS. Therefore YARN opens up Hadoop to other types of distributed applications beyond
MapReduce.
YARN enabled the users to perform operations as per requirement by using a variety of tools
like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.
Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all
your processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components:
1. Resource Manager: Runs on a master daemon and manages the resource allocation in the
cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a
task on every single Data Node.
14 | P a g e
3. Application Master: Manages the user job lifecycle and resource needs of individual
applications. It works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single
node.
Components of YARN
Resource Manager
15 | P a g e
a) Scheduler
The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
It is called a pure scheduler in ResourceManager, which means that it does not perform
any monitoring or tracking of status for the applications.
If there is an application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
Performs scheduling based on the resource requirements of the applications.
It has a pluggable policy plug-in, which is responsible for partitioning the cluster
resources among the various applications. There are two such plug-ins: Capacity
Scheduler and Fair Scheduler, which are currently used as Schedulers in
ResourceManager.
b) Application Manager
Node Manager
It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
It registers with the Resource Manager and sends heartbeats with the health status of the
node.
Its primary goal is to manage application containers assigned to it by the resource
manager.
It keeps up-to-date with the Resource Manager.
16 | P a g e
Application Master requests the assigned container from the Node Manager by sending it
a Container Launch Context(CLC) which includes everything the application needs in
order to run. The Node Manager creates the requested container process and starts it.
Monitors resource usage (memory, CPU) of individual containers.
Performs Log management.
It also kills the container as directed by the Resource Manager.
Application Master
An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity.
It is the process that coordinates an application’s execution in the cluster and also
manages faults.
Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.
Container
It is a collection of physical resources such as RAM, CPU cores, and disks on a single
node.
YARN containers are managed by a container launch context which is container life-
cycle(CLC). This record contains a map of environment variables, dependencies stored in
a remotely accessible storage, security tokens, payload for Node Manager services and
the command necessary to create the process.
17 | P a g e
It grants rights to an application to use a specific amount of resources (memory, CPU
etc.) on a specific host.
18 | P a g e