0% found this document useful (0 votes)
24 views

HDFS, MapReduce, Yarn

The document discusses Apache Hadoop, an open source framework for distributed storage and processing of large datasets. It describes the core Hadoop modules HDFS, YARN and MapReduce, explaining their functions, architectures and features at a high level.

Uploaded by

Random Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

HDFS, MapReduce, Yarn

The document discusses Apache Hadoop, an open source framework for distributed storage and processing of large datasets. It describes the core Hadoop modules HDFS, YARN and MapReduce, explaining their functions, architectures and features at a high level.

Uploaded by

Random Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS 3006

Parallel and Distributed Computing


HDFS, MapReduce, Yarn
Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on
clusters of commodity hardware.

• It consists of the following basic modules:


Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
Hadoop Module
Hadoop Distributed File System
• HDFS is a distributed file system written in Java that is fault
tolerant and scalable.
• HDFS is the primary distributed storage for Hadoop applications.
• HDFS provides interfaces for applications to move themselves
closer to data.
• There are two types of machines in a HDFS cluster.
NameNode is the heart of an HDFS filesystem, it maintains and manages
the file system metadata. E.g; what blocks make up a file, and on which
datanodes those blocks are stored.
DataNode where HDFS stores the actual data, there are usually quite a few
of these.
HDFS Architecture
HDFS Features
• Failure tolerant - data is duplicated across multiple DataNodes to protect
against machine failures. The default is a replication factor of 3 (every
block is stored on three machines).

• Scalability - data transfers happen directly with the DataNodes so your


read/write capacity scales fairly well with the number of DataNodes

• Space - need more disk space? Just add more DataNodes and re-balance

• Industry standard - Other distributed applications are built on top of


HDFS (HBase, Map-Reduce)
Read Operation in HDFS
Write Operation in HDFS
HDFS Security
• Authentication to Hadoop
Simple – insecure way of using OS username to determine hadoop identity
Kerberos – authentication using kerberos ticket
✔ Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
read (r), write (w), and execute (x) permissions
also has an owner, group and mode
enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ from
natural hierarchy of users and groups
enabled by dfs.namenode.acls.enabled=true
Interfaces to HDFS
• Java API (DistributedFileSystem)

• C wrapper (libhdfs)

• HTTP protocol

• WebDAV protocol

• Shell Commands
MapReduce
• MapReduce is a programming model for efficient distributed computing
Processing unit of Hadoop, used by Google
• It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
Streaming through data, reducing seeks
Pipelining
• A good fit for a lot of applications
Log processing
Web index building
MapReduce (Cont.)
MapReduce - Dataflow
MapReduce - Features
• Fine grained Map and Reduce tasks
Improved load balancing
Faster recovery from failed tasks
• Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
• Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible

Introduction: 1-15
Word Count Example
• Mapper
Input: value: lines of text of input
Output: key: word, value: 1
• Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
• Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Yarn
• YARN is the prerequisite for Enterprise Hadoop
Providing resource management and a central platform to deliver
consistent operations, security, and data governance tools across Hadoop
clusters.
YARN Cluster Basics
• In a YARN cluster, there are two types of hosts:
The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
Yarn Resource Monitoring
• YARN currently defines two resources:
v-cores
Memory

• Each NodeManager tracks


its own local resources and
communicates its resource configuration to the ResourceManager

• The ResourceManager keeps


a running total of the cluster’s available resources.
Yarn Resource Monitoring (Cont.)
Yarn Container
• Containers
a request to hold resources on the YARN cluster.
a container hold request consists of vcore and memory
Hold collection of physical resources

Container as a hold The task running as a


process inside a
container
Yarn Application and ApplicationMaster
• Yarn application
It is a YARN client program that is made up of one or more tasks.
Example: MapReduce Application

• ApplicationMaster
It helps coordinate tasks on the YARN cluster for each running application.
It is the first process run after the application starts.
Hadoop Related Subprojects
• Pig
High-level language for data analysis
• HBase
Table storage for semi-structured data
• Zookeeper
Coordinating distributed applications
• Hive
SQL-like Query language and Metastore
• Mahout
Machine learning
Thank You!

You might also like