HDFS, MapReduce, Yarn

The document discusses Apache Hadoop, an open source framework for distributed storage and processing of large datasets. It describes the core Hadoop modules HDFS, YARN and MapReduce, explaining their functions, architectures and features at a high level.

Uploaded by

Random Guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

HDFS, MapReduce, Yarn

Uploaded by

Random Guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CS 3006

Parallel and Distributed Computing

HDFS, MapReduce, Yarn
Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on
clusters of commodity hardware.

• It consists of the following basic modules:

Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
Hadoop Module
Hadoop Distributed File System
• HDFS is a distributed file system written in Java that is fault
tolerant and scalable.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• There are two types of machines in a HDFS cluster.
NameNode is the heart of an HDFS filesystem, it maintains and manages
the file system metadata. E.g; what blocks make up a file, and on which
datanodes those blocks are stored.
DataNode where HDFS stores the actual data, there are usually quite a few
of these.
HDFS Architecture
HDFS Features
• Failure tolerant - data is duplicated across multiple DataNodes to protect
against machine failures. The default is a replication factor of 3 (every
block is stored on three machines).

• Scalability - data transfers happen directly with the DataNodes so your

read/write capacity scales fairly well with the number of DataNodes

• Space - need more disk space? Just add more DataNodes and re-balance

• Industry standard - Other distributed applications are built on top of

HDFS (HBase, Map-Reduce)
Read Operation in HDFS
Write Operation in HDFS
HDFS Security
• Authentication to Hadoop
Simple – insecure way of using OS username to determine hadoop identity
Kerberos – authentication using kerberos ticket
✔ Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
read (r), write (w), and execute (x) permissions
also has an owner, group and mode
enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ from
natural hierarchy of users and groups
enabled by dfs.namenode.acls.enabled=true
Interfaces to HDFS
• Java API (DistributedFileSystem)

• C wrapper (libhdfs)

• HTTP protocol

• WebDAV protocol

• Shell Commands
MapReduce
• MapReduce is a programming model for efficient distributed computing
Processing unit of Hadoop, used by Google
• It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
Streaming through data, reducing seeks
Pipelining
• A good fit for a lot of applications
Log processing
Web index building
MapReduce (Cont.)
MapReduce - Dataflow
MapReduce - Features
• Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
• Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
• Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible

Introduction: 1-15
Word Count Example
• Mapper
Input: value: lines of text of input
Output: key: word, value: 1
• Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
• Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Yarn
• YARN is the prerequisite for Enterprise Hadoop
Providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop
clusters.
YARN Cluster Basics
• In a YARN cluster, there are two types of hosts:
The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
Yarn Resource Monitoring
• YARN currently defines two resources:
v-cores
Memory

• Each NodeManager tracks

its own local resources and
communicates its resource configuration to the ResourceManager

• The ResourceManager keeps

a running total of the cluster’s available resources.
Yarn Resource Monitoring (Cont.)
Yarn Container
• Containers
a request to hold resources on the YARN cluster.
a container hold request consists of vcore and memory
Hold collection of physical resources

Container as a hold The task running as a

process inside a
container
Yarn Application and ApplicationMaster
• Yarn application
It is a YARN client program that is made up of one or more tasks.
Example: MapReduce Application

• ApplicationMaster
It helps coordinate tasks on the YARN cluster for each running application.
It is the first process run after the application starts.
Hadoop Related Subprojects
• Pig
High-level language for data analysis
• HBase
Table storage for semi-structured data
• Zookeeper
Coordinating distributed applications
• Hive
SQL-like Query language and Metastore
• Mahout
Machine learning
Thank You!

Lucas Nülle - EUG 2 Automatic Synchronising Circuits, Automatic Power Control and Automatic Power Factor Control
No ratings yet
Lucas Nülle - EUG 2 Automatic Synchronising Circuits, Automatic Power Control and Automatic Power Factor Control
1 page
Describe The Functions and Features of HDP
100% (2)
Describe The Functions and Features of HDP
16 pages
Japanese Construction Firms en
No ratings yet
Japanese Construction Firms en
4 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
bdcc-2.2
No ratings yet
bdcc-2.2
12 pages
Big Data
No ratings yet
Big Data
16 pages
Hadoop
No ratings yet
Hadoop
4 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
ECS765P_W3_Hadoop principles and components
No ratings yet
ECS765P_W3_Hadoop principles and components
47 pages
Unit 3
No ratings yet
Unit 3
18 pages
CH 2
No ratings yet
CH 2
6 pages
Unit 3
No ratings yet
Unit 3
25 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data QB
No ratings yet
Big Data QB
24 pages
Introduction to Hadoop
No ratings yet
Introduction to Hadoop
18 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
10 - Big Data Architecture and Tools (1)
No ratings yet
10 - Big Data Architecture and Tools (1)
31 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
BDA_UNIT_3
No ratings yet
BDA_UNIT_3
50 pages
unit5 b
No ratings yet
unit5 b
4 pages
Hadoop
No ratings yet
Hadoop
7 pages
DATA228 Lecture Notes Week 3
No ratings yet
DATA228 Lecture Notes Week 3
21 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
1.Mrplab Intro
No ratings yet
1.Mrplab Intro
18 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop
No ratings yet
Hadoop
7 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Module II
No ratings yet
Module II
46 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
BDA Lab Assignment 2
No ratings yet
BDA Lab Assignment 2
18 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
BDA Lab Assignment 3 PDF
No ratings yet
BDA Lab Assignment 3 PDF
17 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
BigdataUnit III-Part2
No ratings yet
BigdataUnit III-Part2
9 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
3-A-Software Process Models
No ratings yet
3-A-Software Process Models
26 pages
1-Intro To SE
No ratings yet
1-Intro To SE
33 pages
2-Software Processing Models
No ratings yet
2-Software Processing Models
30 pages
3-B-Software Process Models
No ratings yet
3-B-Software Process Models
12 pages
6-Decomposition Techniques
No ratings yet
6-Decomposition Techniques
30 pages
Distributed File System
100% (1)
Distributed File System
17 pages
AI Assignment 2
No ratings yet
AI Assignment 2
1 page
Propertites of Regular Set
No ratings yet
Propertites of Regular Set
4 pages
Teodorico M. Collano, JR.: ENRM 223 Student
100% (1)
Teodorico M. Collano, JR.: ENRM 223 Student
5 pages
[Ebooks PDF] download The Politics of New Immigrant Destinations Transatlantic Perspectives 1st Edition Stefanie Chambers (Editor) full chapters
No ratings yet
[Ebooks PDF] download The Politics of New Immigrant Destinations Transatlantic Perspectives 1st Edition Stefanie Chambers (Editor) full chapters
90 pages
AS - WS - ICSE - VIII - Math - Cube and Cube Roots
No ratings yet
AS - WS - ICSE - VIII - Math - Cube and Cube Roots
6 pages
Communication Plan Template
100% (1)
Communication Plan Template
3 pages
0136977619247c51
No ratings yet
0136977619247c51
1 page
Agreed
No ratings yet
Agreed
18 pages
Ten Good Games For Recycling Vocabulary: 1. Taboo (Aka Hot Seat)
No ratings yet
Ten Good Games For Recycling Vocabulary: 1. Taboo (Aka Hot Seat)
5 pages
FM-200™ Fire Suppression System: Product Overview
0% (1)
FM-200™ Fire Suppression System: Product Overview
6 pages
Exhaust System 4ja1 and 4jhi
No ratings yet
Exhaust System 4ja1 and 4jhi
20 pages
Lab Report 11 (19PWMCT0715) M Waleed Tahir
No ratings yet
Lab Report 11 (19PWMCT0715) M Waleed Tahir
17 pages
Stress in Two-Syllable Words
No ratings yet
Stress in Two-Syllable Words
3 pages
Tradesense Wba Status Date
No ratings yet
Tradesense Wba Status Date
55 pages
Lab Manual
50% (2)
Lab Manual
17 pages
TED Talk Start With WHY
100% (1)
TED Talk Start With WHY
11 pages
4. Frequent pattern based clustering
No ratings yet
4. Frequent pattern based clustering
4 pages
Nonprofit Titles
No ratings yet
Nonprofit Titles
4 pages
Owner, S Manual: Sixty
No ratings yet
Owner, S Manual: Sixty
7 pages
Cambridge IPQ Primary and Secondary Resources
No ratings yet
Cambridge IPQ Primary and Secondary Resources
2 pages
OR
100% (1)
OR
716 pages
7 Design and Fabrication of 360 Fire Protection System
No ratings yet
7 Design and Fabrication of 360 Fire Protection System
4 pages
SCED 404.04 Fall23
No ratings yet
SCED 404.04 Fall23
2 pages
Department of Education: Budget of Work (Bow) For Catch-Up Fridays (Cuf) Lesson Scripts
No ratings yet
Department of Education: Budget of Work (Bow) For Catch-Up Fridays (Cuf) Lesson Scripts
5 pages
Download ebooks file Clinical Bioinformatics 2nd Edition Ronald Trent (Eds.) all chapters
No ratings yet
Download ebooks file Clinical Bioinformatics 2nd Edition Ronald Trent (Eds.) all chapters
25 pages
DVB-S Satellite Transmission Technique White Paper
No ratings yet
DVB-S Satellite Transmission Technique White Paper
2 pages
Lab Activity Physical and Chemical Change
No ratings yet
Lab Activity Physical and Chemical Change
2 pages
Amazon
No ratings yet
Amazon
2 pages
Survival in Southern Sudan
No ratings yet
Survival in Southern Sudan
14 pages
Building-A-Successful-ESG-Strategy-A-Practical-Guide
No ratings yet
Building-A-Successful-ESG-Strategy-A-Practical-Guide
12 pages

HDFS, MapReduce, Yarn

Uploaded by

HDFS, MapReduce, Yarn

Uploaded by

CS 3006

Parallel and Distributed Computing

• It consists of the following basic modules:

• Scalability - data transfers happen directly with the DataNodes so your

• Industry standard - Other distributed applications are built on top of

• Each NodeManager tracks

• The ResourceManager keeps

Container as a hold The task running as a

You might also like