Bigdata and Hadoop - Unit III
Bigdata and Hadoop - Unit III
VII-Semester
Presented By
Vishal Chhabra
Asst. Prof.
UNIT III
• HDFS Daemons – Namenode, Datanode, Secondary Namenode,
Hadoop FS and Processing Environment’s UIs, Fault Tolerant, High
Availability, Block Replication
• Hadoop Processing Framework: YARN Daemons – Resource Manager,
Node Manager, Job assignment & Execution flow
• MapReduce Architecture
• MapReduce life cycle
• Word Count Example(or) Election Vote Count
HDFS
Hadoop Distributed File
System
HDFS Introduction
• It is a specially designed file system for storing huge data sets with cluster of
commodity hardware & with streaming access pattern (Write once read any
number of time but don’t change contain of file).
• Google came up first with the design of GFS and published it in white papers,
then after Apache open-source developed Hadoop based on Google’s white
papers.
• Hence store the file on HDFS the data-set file is broken in blocks to
store on cluster.
File.txt
200MB
DN -1 DN -2 DN -3 DN -4 DN -5
Client
Consider a Example :-
A Client has a file
named File.txt DN -6 DN -7 DN -8 DN -9 DN -10
Assuming Hapdoop 1X environment
Name Node
Therefore ,
File.txt 200 MB =
200MB 64 MB (part 1) –
a.txt
64 MB (part 2) –
b.txt
DN -1 DN -2 DN64
-3 MB ([part
DN -4 3)–
DN -5
Client
c.txt
8 MB (part 4 ) –
d.txt DN -6 DN -7 DN -8 DN -9 DN -10
JT :- Job Tracker Now, these blocks are going to store on cluster under
TT :- Task Tracker
Where should I keep my files ? Name Node
DN -1 DN -2 DN -3 DN -4 DN -5
DN -6 DN -7 DN -8 DN -9 DN -10
YARN
(Yet Another Resource
Negotiator)
Introduction
It was introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time
of its launching, but it has now evolved to be known as large-scale
distributed operating system used for Big Data processing.
YARN Features
• Scalability: The scheduler in Resource manager of YARN architecture
allows Hadoop to extend and manage thousands of nodes and
clusters.
• Compatability: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0 as
well.
• Cluster Utilization:Since YARN supports Dynamic utilization of cluster
in Hadoop, which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
YARN Introduction
YARN consists of three core components:
user
Job tracker