0% found this document useful (0 votes)
17 views64 pages

CS621 Week 15

MapReduce is a software framework designed for parallel and distributed computing on large datasets, enabling tasks like data mining and statistical analysis. It operates in three phases: Map, Sort, and Reduce, and is commonly used in conjunction with Hadoop, which provides a scalable and fault-tolerant environment for data processing. The document also covers the Google File System (GFS) and Hadoop Distributed File System (HDFS), highlighting their roles in managing and processing large volumes of data.

Uploaded by

muteeurrehman113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views64 pages

CS621 Week 15

MapReduce is a software framework designed for parallel and distributed computing on large datasets, enabling tasks like data mining and statistical analysis. It operates in three phases: Map, Sort, and Reduce, and is commonly used in conjunction with Hadoop, which provides a scalable and fault-tolerant environment for data processing. The document also covers the Google File System (GFS) and Hadoop Distributed File System (HDFS), highlighting their roles in managing and processing large volumes of data.

Uploaded by

muteeurrehman113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Dr.

Muhammad Anwaar Saeed


Dr. Said Nabi
Ms. Hina Ishaq

CS621 Parallel and Distributed


Computing
MapReduce

CS621 Parallel and Distributed


Computing
What is MapReduce?

Objective
s
Usage of MapReduce.
What is MapReduce?

“MapReduce is a software framework which supports


parallel and distributed computing on large data sets.”
MapReduce - Introduction

Simple data-parallel • Scalability and


programming model • Fault-tolerance
designed for:

• Processes 200 petabytes of data per


Pioneered by Google day (Updated 2022)

Popularized by open-
source Hadoop • Used at Yahoo!, Facebook, Amazon
project
What is MapReduce used for?
At Google
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation

At Facebook
• Data mining
• Ad optimization
• Spam detection

At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
MapReduce Usage In Research?
Astronomical
image
analysis
(Washington
Ocean )
climate Bioinformati
simulation cs
(Washington (Maryland)
)
In
Researc
h
Analyzing
Particle
Wikipedia
physics
conflicts
(Nebraska)
(PARC)
Natural
language
processing
(CMU)
How MapReduce work?

• Map
• Sort
MapRed • Reduce
uce has
three
main
phases:
MapReduce Overview
MapReduce: Examples

CS621 Parallel and Distributed


Computing
MapReduce example based on
three phases.

Objective
s
Five processing stages based
MapReduce example.
MapReduce Example
(based on Three Phases)

• Example corpus:
The
canonica Jane likes toast with jam
l Joe likes toast
MapRed Joe burnt the toast
uce
Example
: Word
Count
MapReduce: Map (Slow Motion)
MapReduce: Sort (Slow Motion)
MapReduce: Reduce (Slow Motion)

data

Comp
utatio
n
MapReduce logical data flow in 5
processing stages over successive
(key, value) pairs.
MapReduce logical data in 5
processing stages : Example
MapReduce Actual Data and
Computation
Data and Control Partitioning
D

Flow:
e
th te
an e rm

tio ce
d ma in

n
nc u
w s in

fu ed
or te g

R
ke r
r
The main responsibility of the
MapReduce framework is to
efficiently run a user’s program
on a distributed computing
system. Sorting and
Reading the input
data
Grouping
(Data Distribution)

Therefore, the MapReduce


framework meticulously handle
all processing steps like:

nc bi nd
n r
n

tio ne
fu om p a
io
at

C a
M
ic
un
m
om

Synchronization
C
MapReduce Design Goals

Scalability to large data volumes:

1000’s of machines, 10,000’s of disks.

Cost-efficiency:
Commodity machines (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer administrators),
Easy to use (fewer programmers)
Hadoop

CS621 Parallel and Distributed


Computing
Introduction Hadoop.

Objective
s
Key functions of Hadoop
What is Hadoop?

“Open source platform for distributed processing of large data. Hadoop is a


simplified programming model that make it easy to write distributed
algorithms”
Key functions of Hadoop

• The Distribution of data and processing across machine


Hadoo • Management of the cluster
p
Functi
ons:
Hadoop scalability

Hadoop can reach massive scalability by exploiting a simple distribution


architecture and coordination model

Huge clusters can be made up using (cheap) commodity hardware:


A 1000-CPU machine would be much more expensive than 1000 single-CPU or
250 quad-core machines

Cluster can easily scale up with little or no modifications to the programs


Hadoop Components

HDFS: Hadoop Distributed File System:


Stores large amount of data by transparently
Abstraction of a file system over a cluster
spreading it on different machines

MapReduce:
Simple programming model that enables
parallel execution of data processing Executes the work on the data near the data
programs

In a nutshell: HDFS places the data on the cluster and MapReduce does the
processing work
Hadoop Principle

I am one
big data
• Hadoop is basically a middleware set
platforms that manages a cluster of
machines
• The core components is a distributed file
system (HDFS) Hadoop
• Files in HDFS are split into blocks that
are scattered over the cluster HDFS
• The cluster can grow indefinitely simply
by adding new nodes
Hadoop Components

Hadoo
p

MapReduce
MapReduce and Hadoop
HDFS
Hadoop and MapReduce

• MR works on (big) files loaded on


HDFS Hadoop

• Each node in the cluster executes MR MR MR MR


the MR program in parallel, applying
map and reduces phases on the HDFS HDFS HDFS HDFS
blocks it stores

• Output is written on HDFS


Hadoop Goods & Bads

Good for: • Repetitive tasks on big size data

• Replacing a RDMBS
• Complex processing requiring
Not Good for various phases and/or iterations
• Processing small to medium size
data
GFS: Google File System

CS621 Parallel and Distributed


Computing
Introduction to GFS.

Objective
s
GFS Working Process
GFS: Google File System

• “GFS was built primarily as the fundamental storage


service for Google’s search engine.
• As the size of the web data that was crawled and
saved was quite substantial, Google needed a
distributed file system to redundantly store massive
amounts of data on cheap and unreliable computers”
Why GFS?
Component failures
• Component failures are the norm, not the exception

Files are huge


• By traditional standards (many TB)
• Typically 1000 nodes & 300 TB

Most mutations are mutations


• Not random access overwrite

Co-Designing apps & file system

• GFS was co-designed with the applications using it


GFS: Design Assumptions?

Workload: Need
• Large semantics
Must for
streaming
monitor & Modest reads + concurrent
recover number of small High
from comp large files random sustained
failures reads bandwidth
(More important
• Many large than low
sequential latency)
writes
GFS: Interface

Familiar • Create, delete, open, close, read, write

• Snapshot
• Low cost
Novel • Record append
• Atomicity with multiple concurrent writes
GFS: Architecture
GFS: Architecture details

CS621 Parallel and Distributed


Computing
What are the GFS Architecture
Components functions?

Objective
s
GFS implementation.
GFS Architecture: Master
• Stores all metadata
• Namespace
• Access-control information
• Chunk locations
• ‘Lease’ management
• Heartbeats
Master • Having one master  global knowledge
• Allows better placement / replication
• Simplifies design
GFS Architecture: Chunk Servers

• Store all files


• In fixed-size chucks
• 64 MB
• 64 bit unique handle
• Triple redundancy
GFS Architecture

• Contact single
master

• Obtain chunk
locations

• Contact one of
chunk servers

• Obtain data
GFS Architecture: Master->
Metadata

Master stores three types


Mapping from files  Location of chunk
File & chunk namespaces chunks replicas

Stored in memory

Kept persistent through logging


GFS Architecture: Master
• Replica placement
• New chunk and replica
creation
• Load balancing
• Unused storage reclaim
Operations
GFS: Consistency Model

All file namespace mutations are atomic


Handled exclusively by the master

Status of a file region can be


Defined: all clients Undefined but
Consistent: all see the same data, consistent: all clients
see then same data
clients see the same which include the but it may not reflect Inconsistent
data entirety of the last what any one mutation
mutation has written
GFS: Leases and Mutation Order

Master uses leases to maintain a consistent mutation order among replicas

Primary is the chunkserver who is granted a chunk lease

All others containing replicas are secondaries

Primary defines a mutation order between mutations

All secondaries follows this order


GFS Write Control & Dataflow

Mutation Order
identical replicas
File region may end up
containing mingled fragments
from different clients
(consistent but undefined)
GFS: Limitations

Only viable in
Custom Limited
a specific
designed security
environment
HDFS: Hadoop Distributed
File System

CS621 Parallel and Distributed


Computing
Introduction to HDFS.

Objective
s
HDFS Blocks and Nodes.
HDFS: Background

At Google MapReduce operation are run on a special file system called


Google File System (GFS) that is highly optimized for this purpose.

GFS is not open source.

Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop
Distributed File System (HDFS).

The software framework that supports HDFS, MapReduce and other related
entities is called the project Hadoop or simply Hadoop.

This is open source and distributed by Apache.


HDFS: Basic Features

Highly fault-tolerant

High throughput

Suitable for applications with large data


sets

Streaming access to file system data

Can be built out of commodity


hardware
HDFS: Basic Features

HHDFS was designed to HDFS is designed to run


be optimal in performance on clusters of general
for a WORM (Write Once, computers & servers from
Read Many times) pattern multiple vendors
HDFS: Blocks

Files in HDFS are divided into block size chunks


 64 Megabyte default block size

Block is the minimum size of data that it can read or write

Blocks simplifies the storage and replication process


 Provides fault tolerance & processing speed enhancement for larger files
HDFS: Nodes

Namenode (master node)

HDFS clusters use 2 types of


nodes

Datanode (worker node)


HDFS: Nodes

Namenode keeps track of


Manages the file system the data nodes that have
namespace blocks of a distributed
file assigned
Namenode

Maintains the file system


tree and the metadata for
all the files and directories
in the tree
Namespace Image
Stores on the local disk
using 2 file forms
Edit Log
HDFS: Namenode

Namenode holds the filesystem metadata in its memory

Namenode’s memory size determines the limit to the number of files


in a filesystem

But then, what is Metadata?


HDFS: Metadata

Traditional concept of
the library card catalogs

Categorizes and describes the


contents and context of the
data files

Maximizes the usefulness of


the original data file by making
it easy to find and use
HDFS: Metadata

Structural Metadata
Focuses on the data structure’s design and specification

Descriptive Metadata

Focuses on the individual instances of application data or the data content


HDFS: Metadata Types

Structural Metadata
Focuses on the data structure’s design and specification

Descriptive Metadata

Focuses on the individual instances of application data or the data content


HDFS: Datanodes

Workhorse of the filesystem

Store and retrieve blocks when requested by the client


or the namenode

Periodically reports back to the namenode with lists of


blocks that were stored
HDFS: Client Access

Client can use a filesystem interface


similar to a POSIX (Portable
Client can access the filesystem (on
Operating System Interface)) so the
behalf of the user) by communicating
user code does not need to know
with the namenode and datanodes
about the namenode and datanodes
to function properly
HDFS: Namenode Failure

Namenode keeps track of the datanodes that have blocks of a distributed file
assigned
Without the namenode, the filesystem cannot be used

If the computer running the namenode malfunctions then reconstruction of the files
(from the blocks on the datanodes) would not be possible

Files on the filesystem would be lost


HDFS: Namenode Failure Resilience

Namenode failure • Namenode File Backup


prevention
schemes • Secondary Namenode
HDFS

Hadoop 2.x Release Series • HDFS Federation


HDFS Reliability
Enhancements • HDFS HA (High-Availability)

You might also like