0% found this document useful (0 votes)
12 views

Bigdata and Hadoop - Unit III

Uploaded by

garima khasdeo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Bigdata and Hadoop - Unit III

Uploaded by

garima khasdeo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Computer Science and Engineering

VII-Semester

CS 802(A) Big Data & Hadoop

Presented By

Vishal Chhabra
Asst. Prof.
UNIT III
• HDFS Daemons – Namenode, Datanode, Secondary Namenode,
Hadoop FS and Processing Environment’s UIs, Fault Tolerant, High
Availability, Block Replication
• Hadoop Processing Framework: YARN Daemons – Resource Manager,
Node Manager, Job assignment & Execution flow
• MapReduce Architecture
• MapReduce life cycle
• Word Count Example(or) Election Vote Count
HDFS
Hadoop Distributed File
System
HDFS Introduction
• It is a specially designed file system for storing huge data sets with cluster of
commodity hardware & with streaming access pattern (Write once read any
number of time but don’t change contain of file).

• Google came up first with the design of GFS and published it in white papers,
then after Apache open-source developed Hadoop based on Google’s white
papers.

• Apache named file system as hadoop distributed file system (HDFS).


Architecture
Name Node
Functions of Name Node:-
1. Manages the Data Nodes
2. Records the metadata of all the files
stored in the cluster
3. Receives a Heartbeat to ensure that
the DataNodes are live.

Functions of Data Nodes:-


1. Actual data is stored on them.
2. Perform the low-level read and
write requests from the file system’s ………….
clients.
Data Data Data Data
Node -1 Node -2 Node -3 Node -N
Blocks
• It is define as smallest logical space needed to store data on hard
drive.

• Hence store the file on HDFS the data-set file is broken in blocks to
store on cluster.

• In Hadoop 1x uses block size of 64 MB


• In Hadoop 2x uses block size of 128 MB
Name Node

File.txt

200MB

DN -1 DN -2 DN -3 DN -4 DN -5
Client

Consider a Example :-
A Client has a file
named File.txt DN -6 DN -7 DN -8 DN -9 DN -10
Assuming Hapdoop 1X environment
Name Node

Therefore ,
File.txt 200 MB =
200MB 64 MB (part 1) –
a.txt
64 MB (part 2) –
b.txt
DN -1 DN -2 DN64
-3 MB ([part
DN -4 3)–
DN -5
Client
c.txt
8 MB (part 4 ) –
d.txt DN -6 DN -7 DN -8 DN -9 DN -10

JT :- Job Tracker Now, these blocks are going to store on cluster under
TT :- Task Tracker
Where should I keep my files ? Name Node

Keep in DN-2 , DN-5 , DN-6, DN-9


Client
Meta-data file
<< …………………
………………
……………….
………………… >>

DN -1 DN -2 DN -3 DN -4 DN -5

DN -6 DN -7 DN -8 DN -9 DN -10
YARN
(Yet Another Resource
Negotiator)
Introduction
It was introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time
of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.
YARN Features
• Scalability: The scheduler in Resource manager of YARN architecture
allows Hadoop to extend and manage thousands of nodes and
clusters.
• Compatability: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0 as
well.
• Cluster Utilization:Since YARN supports Dynamic utilization of cluster
in Hadoop, which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
YARN Introduction
YARN consists of three core components:

• Resource Manager (one per cluster)

• Application Master (one per application)

• Node Managers (one per node)


Components of YARN architecture
• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications. Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion of the request accordingly. It has
two major components:
• Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means
it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN
scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
• Application manager: It is responsible for accepting the application and negotiating the first container from the resource
manager. It also restarts the Application Manager container if a task fails.
• Node Manager: It take care of individual node on Hadoop cluster and manages application and workflow and
that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage,
performs log management and also kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request of Application master.
• Application Master: An application is a single job submitted to a framework. The application manager is
responsible for negotiating resources with the resource manager, tracking the status and monitoring progress
of a single application. The application master requests the container from the node manager by sending a
Container Launch Context(CLC) which includes everything an application needs to run. Once the application is
started, it sends the health report to the resource manager from time-to-time.
• Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The
containers are invoked by Container Launch Context(CLC) which is a record that contains information such as
environment variables, security tokens, dependencies etc.
1.Client submits an application
2.The Resource Manager allocates a container to start the Application Manager
3.The Application Manager registers itself with the Resource Manager
4.The Application Manager negotiates containers from the Resource Manager
5.The Application Manager notifies the Node Manager to launch containers
6.Application code is executed in the container
7.Client contacts Resource Manager/Application Manager to monitor application’s
status
8.Once the processing is complete, the Application Manager un-registers with the
Resource Manager
Map Reduce
Introduction
Map-Reduce programs transform lists of input data elements into lists
of output data elements.
A Map-Reduce program will do this twice, using two different list
processing idioms-
• Map
• Reduce
In between Map and Reduce, there is small phase
called Shuffle and Sort in MapReduce.
Architecture overview
Master node

user
Job tracker

Slave node 1 Slave node 2 Slave node N

Task tracker Task tracker Task tracker

Workers Workers Workers


Word Count Dataflow
Workflow
Thank You
Any Queries ?????

You might also like