0% found this document useful (0 votes)

17 views64 pages

CS621 Week 15

MapReduce is a software framework designed for parallel and distributed computing on large datasets, enabling tasks like data mining and statistical analysis. It operates in three phases: Map, Sort, and Reduce, and is commonly used in conjunction with Hadoop, which provides a scalable and fault-tolerant environment for data processing. The document also covers the Google File System (GFS) and Hadoop Distributed File System (HDFS), highlighting their roles in managing and processing large volumes of data.

Uploaded by

muteeurrehman113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views64 pages

CS621 Week 15

Uploaded by

muteeurrehman113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 64

Dr.

Muhammad Anwaar Saeed

Dr. Said Nabi
Ms. Hina Ishaq

CS621 Parallel and Distributed

Computing
MapReduce

CS621 Parallel and Distributed

Computing
What is MapReduce?

Objective
s
Usage of MapReduce.
What is MapReduce?

“MapReduce is a software framework which supports

parallel and distributed computing on large data sets.”
MapReduce - Introduction

Simple data-parallel • Scalability and

programming model • Fault-tolerance
designed for:

• Processes 200 petabytes of data per

Pioneered by Google day (Updated 2022)

Popularized by open-
source Hadoop • Used at Yahoo!, Facebook, Amazon
project
What is MapReduce used for?
At Google
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation

At Facebook
• Data mining
• Ad optimization
• Spam detection

At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
MapReduce Usage In Research?
Astronomical
image
analysis
(Washington
Ocean )
climate Bioinformati
simulation cs
(Washington (Maryland)
)
In
Researc
h
Analyzing
Particle
Wikipedia
physics
conflicts
(Nebraska)
(PARC)
Natural
language
processing
(CMU)
How MapReduce work?

• Map
• Sort
MapRed • Reduce
uce has
three
main
phases:
MapReduce Overview
MapReduce: Examples

CS621 Parallel and Distributed

Computing
MapReduce example based on
three phases.

Objective
s
Five processing stages based
MapReduce example.
MapReduce Example
(based on Three Phases)

• Example corpus:
The
canonica Jane likes toast with jam
l Joe likes toast
MapRed Joe burnt the toast
uce
Example
: Word
Count
MapReduce: Map (Slow Motion)
MapReduce: Sort (Slow Motion)
MapReduce: Reduce (Slow Motion)

data

Comp
utatio
n
MapReduce logical data flow in 5
processing stages over successive
(key, value) pairs.
MapReduce logical data in 5
processing stages : Example
MapReduce Actual Data and
Computation
Data and Control Partitioning
D

Flow:
e
th te
an e rm

tio ce
d ma in

n
nc u
w s in

fu ed
or te g

R
ke r
r
The main responsibility of the
MapReduce framework is to
efficiently run a user’s program
on a distributed computing
system. Sorting and
Reading the input
data
Grouping
(Data Distribution)

Therefore, the MapReduce

framework meticulously handle
all processing steps like:

nc bi nd
n r
n

tio ne
fu om p a
io
at

C a
M
ic
un
m
om

Synchronization
C
MapReduce Design Goals

Scalability to large data volumes:

1000’s of machines, 10,000’s of disks.

Cost-efficiency:
Commodity machines (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer administrators),
Easy to use (fewer programmers)
Hadoop

CS621 Parallel and Distributed

Computing
Introduction Hadoop.

Objective
s
Key functions of Hadoop
What is Hadoop?

“Open source platform for distributed processing of large data. Hadoop is a

simplified programming model that make it easy to write distributed
algorithms”
Key functions of Hadoop

• The Distribution of data and processing across machine

Hadoo • Management of the cluster
p
Functi
ons:
Hadoop scalability

Hadoop can reach massive scalability by exploiting a simple distribution

architecture and coordination model

Huge clusters can be made up using (cheap) commodity hardware:

A 1000-CPU machine would be much more expensive than 1000 single-CPU or
250 quad-core machines

Cluster can easily scale up with little or no modifications to the programs

Hadoop Components

HDFS: Hadoop Distributed File System:

Stores large amount of data by transparently
Abstraction of a file system over a cluster
spreading it on different machines

MapReduce:
Simple programming model that enables
parallel execution of data processing Executes the work on the data near the data
programs

In a nutshell: HDFS places the data on the cluster and MapReduce does the
processing work
Hadoop Principle

I am one
big data
• Hadoop is basically a middleware set
platforms that manages a cluster of
machines
• The core components is a distributed file
system (HDFS) Hadoop
• Files in HDFS are split into blocks that
are scattered over the cluster HDFS
• The cluster can grow indefinitely simply
by adding new nodes
Hadoop Components

Hadoo
p

MapReduce
MapReduce and Hadoop
HDFS
Hadoop and MapReduce

• MR works on (big) files loaded on

HDFS Hadoop

• Each node in the cluster executes MR MR MR MR

the MR program in parallel, applying
map and reduces phases on the HDFS HDFS HDFS HDFS
blocks it stores

• Output is written on HDFS

Hadoop Goods & Bads

Good for: • Repetitive tasks on big size data

• Replacing a RDMBS
• Complex processing requiring
Not Good for various phases and/or iterations
• Processing small to medium size
data
GFS: Google File System

CS621 Parallel and Distributed

Computing
Introduction to GFS.

Objective
s
GFS Working Process
GFS: Google File System

• “GFS was built primarily as the fundamental storage

service for Google’s search engine.
• As the size of the web data that was crawled and
saved was quite substantial, Google needed a
distributed file system to redundantly store massive
amounts of data on cheap and unreliable computers”
Why GFS?
Component failures
• Component failures are the norm, not the exception

Files are huge

• By traditional standards (many TB)
• Typically 1000 nodes & 300 TB

Most mutations are mutations

• Not random access overwrite

Co-Designing apps & file system

• GFS was co-designed with the applications using it

GFS: Design Assumptions?

Workload: Need
• Large semantics
Must for
streaming
monitor & Modest reads + concurrent
recover number of small High
from comp large files random sustained
failures reads bandwidth
(More important
• Many large than low
sequential latency)
writes
GFS: Interface

Familiar • Create, delete, open, close, read, write

• Snapshot
• Low cost
Novel • Record append
• Atomicity with multiple concurrent writes
GFS: Architecture
GFS: Architecture details

CS621 Parallel and Distributed

Computing
What are the GFS Architecture
Components functions?

Objective
s
GFS implementation.
GFS Architecture: Master
• Stores all metadata
• Namespace
• Access-control information
• Chunk locations
• ‘Lease’ management
• Heartbeats
Master • Having one master  global knowledge
• Allows better placement / replication
• Simplifies design
GFS Architecture: Chunk Servers

• Store all files

• In fixed-size chucks
• 64 MB
• 64 bit unique handle
• Triple redundancy
GFS Architecture

• Contact single
master

• Obtain chunk
locations

• Contact one of
chunk servers

• Obtain data
GFS Architecture: Master->
Metadata

Master stores three types

Mapping from files  Location of chunk
File & chunk namespaces chunks replicas

Stored in memory

Kept persistent through logging

GFS Architecture: Master
• Replica placement
• New chunk and replica
creation
• Load balancing
• Unused storage reclaim
Operations
GFS: Consistency Model

All file namespace mutations are atomic

Handled exclusively by the master

Status of a file region can be

Defined: all clients Undefined but
Consistent: all see the same data, consistent: all clients
see then same data
clients see the same which include the but it may not reflect Inconsistent
data entirety of the last what any one mutation
mutation has written
GFS: Leases and Mutation Order

Master uses leases to maintain a consistent mutation order among replicas

Primary is the chunkserver who is granted a chunk lease

All others containing replicas are secondaries

Primary defines a mutation order between mutations

All secondaries follows this order

GFS Write Control & Dataflow

Mutation Order
identical replicas
File region may end up
containing mingled fragments
from different clients
(consistent but undefined)
GFS: Limitations

Only viable in
Custom Limited
a specific
designed security
environment
HDFS: Hadoop Distributed
File System

CS621 Parallel and Distributed

Computing
Introduction to HDFS.

Objective
s
HDFS Blocks and Nodes.
HDFS: Background

At Google MapReduce operation are run on a special file system called

Google File System (GFS) that is highly optimized for this purpose.

GFS is not open source.

Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop
Distributed File System (HDFS).

The software framework that supports HDFS, MapReduce and other related
entities is called the project Hadoop or simply Hadoop.

This is open source and distributed by Apache.

HDFS: Basic Features

Highly fault-tolerant

High throughput

Suitable for applications with large data

sets

Streaming access to file system data

Can be built out of commodity

hardware
HDFS: Basic Features

HHDFS was designed to HDFS is designed to run

be optimal in performance on clusters of general
for a WORM (Write Once, computers & servers from
Read Many times) pattern multiple vendors
HDFS: Blocks

Files in HDFS are divided into block size chunks

 64 Megabyte default block size

Block is the minimum size of data that it can read or write

Blocks simplifies the storage and replication process

 Provides fault tolerance & processing speed enhancement for larger files
HDFS: Nodes

Namenode (master node)

HDFS clusters use 2 types of

nodes

Datanode (worker node)

HDFS: Nodes

Namenode keeps track of

Manages the file system the data nodes that have
namespace blocks of a distributed
file assigned
Namenode

Maintains the file system

tree and the metadata for
all the files and directories
in the tree
Namespace Image
Stores on the local disk
using 2 file forms
Edit Log
HDFS: Namenode

Namenode holds the filesystem metadata in its memory

Namenode’s memory size determines the limit to the number of files

in a filesystem

But then, what is Metadata?

HDFS: Metadata

Traditional concept of
the library card catalogs

Categorizes and describes the

contents and context of the
data files

Maximizes the usefulness of

the original data file by making
it easy to find and use
HDFS: Metadata

Structural Metadata
Focuses on the data structure’s design and specification

Descriptive Metadata

Focuses on the individual instances of application data or the data content

HDFS: Metadata Types

Structural Metadata
Focuses on the data structure’s design and specification

Descriptive Metadata

Focuses on the individual instances of application data or the data content

HDFS: Datanodes

Workhorse of the filesystem

Store and retrieve blocks when requested by the client

or the namenode

Periodically reports back to the namenode with lists of

blocks that were stored
HDFS: Client Access

Client can use a filesystem interface

similar to a POSIX (Portable
Client can access the filesystem (on
Operating System Interface)) so the
behalf of the user) by communicating
user code does not need to know
with the namenode and datanodes
about the namenode and datanodes
to function properly
HDFS: Namenode Failure

Namenode keeps track of the datanodes that have blocks of a distributed file
assigned
Without the namenode, the filesystem cannot be used

If the computer running the namenode malfunctions then reconstruction of the files
(from the blocks on the datanodes) would not be possible

Files on the filesystem would be lost

HDFS: Namenode Failure Resilience

Namenode failure • Namenode File Backup

prevention
schemes • Secondary Namenode
HDFS

Hadoop 2.x Release Series • HDFS Federation

HDFS Reliability
Enhancements • HDFS HA (High-Availability)

The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
Hot Topics in Machine Learning For Research and Thesis
No ratings yet
Hot Topics in Machine Learning For Research and Thesis
10 pages
Automatic Radar Plotting Aid (ARPA) : Vasile Radu Adrian ET32
100% (1)
Automatic Radar Plotting Aid (ARPA) : Vasile Radu Adrian ET32
4 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
10 Uses of Egg
No ratings yet
10 Uses of Egg
2 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Hadoop
No ratings yet
Hadoop
25 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit-4 CC
No ratings yet
Unit-4 CC
72 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
FCC_Module v - Cloud Technologies and Advancements
No ratings yet
FCC_Module v - Cloud Technologies and Advancements
63 pages
Session+1+-+Introduction+to+Hadoop
No ratings yet
Session+1+-+Introduction+to+Hadoop
69 pages
Module III Hadoop Framework
No ratings yet
Module III Hadoop Framework
21 pages
Bril - Chapter 4 - Electromagnetic Energy
No ratings yet
Bril - Chapter 4 - Electromagnetic Energy
7 pages
Module II
No ratings yet
Module II
46 pages
Hadoop
No ratings yet
Hadoop
7 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Datasheet - DT50-P2113 - 1047314 - en - Sick
No ratings yet
Datasheet - DT50-P2113 - 1047314 - en - Sick
6 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Lecture 5 Post-Lecture
No ratings yet
Lecture 5 Post-Lecture
53 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
229f9 SaMag Total RNA-DNA Extraction Kit SM15 Ver 30072021
No ratings yet
229f9 SaMag Total RNA-DNA Extraction Kit SM15 Ver 30072021
8 pages
HDFS MAP REDUCE
No ratings yet
HDFS MAP REDUCE
16 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Mathematical Modelling of Engineering Problems
No ratings yet
Mathematical Modelling of Engineering Problems
12 pages
Knowbase Rigging
No ratings yet
Knowbase Rigging
2 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Wa0001
No ratings yet
Wa0001
24 pages
Birla Institute of Technology and Science, Pilani Pilani Campus
No ratings yet
Birla Institute of Technology and Science, Pilani Pilani Campus
3 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
Kuali Research Testing Strategy
No ratings yet
Kuali Research Testing Strategy
24 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Slab Design
No ratings yet
Slab Design
21 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
GS95024-2-12 MAGNA Supplement 24.11.20 Ast D3
No ratings yet
GS95024-2-12 MAGNA Supplement 24.11.20 Ast D3
30 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
DS Lecture 5
No ratings yet
DS Lecture 5
28 pages
Week 02
No ratings yet
Week 02
115 pages
BCH 201 MCQ
No ratings yet
BCH 201 MCQ
7 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
Unit - 3 HDFS MAPREDUCE HBASE
No ratings yet
Unit - 3 HDFS MAPREDUCE HBASE
34 pages
Lecture2-Part1 Introduction To Measurements
No ratings yet
Lecture2-Part1 Introduction To Measurements
26 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
BDA_exp_1
No ratings yet
BDA_exp_1
7 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Bhu1101 Lecture 3 Notes
No ratings yet
Bhu1101 Lecture 3 Notes
5 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
4
No ratings yet
4
53 pages
Lecture - 1 (Dit 1, Book C&C++)
No ratings yet
Lecture - 1 (Dit 1, Book C&C++)
2 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data
No ratings yet
Big Data
51 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Concepts of Fields
No ratings yet
Concepts of Fields
21 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
μC2SE Manual en
No ratings yet
μC2SE Manual en
40 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
My Test
No ratings yet
My Test
7 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Calibration PH Meter
No ratings yet
Calibration PH Meter
17 pages
RTK Notes m1
No ratings yet
RTK Notes m1
16 pages
(CBCS) : B.C.A. (2016-17 Onwards) Computer Science 604 Programming
No ratings yet
(CBCS) : B.C.A. (2016-17 Onwards) Computer Science 604 Programming
2 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Chapter 10 Statically Indeterminate Beams
No ratings yet
Chapter 10 Statically Indeterminate Beams
14 pages
Annex G - Kongsberg - AIS BS610 - Datasheet
No ratings yet
Annex G - Kongsberg - AIS BS610 - Datasheet
2 pages
Fuse PDF
No ratings yet
Fuse PDF
17 pages
Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
D49-Calculation of Road Traffic Noise
No ratings yet
D49-Calculation of Road Traffic Noise
100 pages
10-Sedecal Parts Manual 1005R2
No ratings yet
10-Sedecal Parts Manual 1005R2
22 pages
CS8791-Cloud Computing UNIT 5 Notes
No ratings yet
CS8791-Cloud Computing UNIT 5 Notes
33 pages
MSFvenom Metasploit Unleashed Level 1
No ratings yet
MSFvenom Metasploit Unleashed Level 1
9 pages
Common Debating Phrases
No ratings yet
Common Debating Phrases
5 pages
Instruction Manual For Speed Reduction Gearbox: 2250 KW, 989 / 119.99 RPM M/S. DVC Mejia Thermal Power Station
No ratings yet
Instruction Manual For Speed Reduction Gearbox: 2250 KW, 989 / 119.99 RPM M/S. DVC Mejia Thermal Power Station
19 pages
Unit 5-Cloud PDF
No ratings yet
Unit 5-Cloud PDF
33 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet

CS621 Week 15

Uploaded by

CS621 Week 15

Uploaded by

Dr.

Muhammad Anwaar Saeed

CS621 Parallel and Distributed

CS621 Parallel and Distributed

“MapReduce is a software framework which supports

Simple data-parallel • Scalability and

• Processes 200 petabytes of data per

CS621 Parallel and Distributed

Therefore, the MapReduce

Scalability to large data volumes:

1000’s of machines, 10,000’s of disks.

CS621 Parallel and Distributed

“Open source platform for distributed processing of large data. Hadoop is a

• The Distribution of data and processing across machine

Hadoop can reach massive scalability by exploiting a simple distribution

Huge clusters can be made up using (cheap) commodity hardware:

Cluster can easily scale up with little or no modifications to the programs

HDFS: Hadoop Distributed File System:

• MR works on (big) files loaded on

• Each node in the cluster executes MR MR MR MR

• Output is written on HDFS

Good for: • Repetitive tasks on big size data

CS621 Parallel and Distributed

• “GFS was built primarily as the fundamental storage

Files are huge

Most mutations are mutations

Co-Designing apps & file system

• GFS was co-designed with the applications using it

Familiar • Create, delete, open, close, read, write

CS621 Parallel and Distributed

• Store all files

Master stores three types

Kept persistent through logging

All file namespace mutations are atomic

Status of a file region can be

Master uses leases to maintain a consistent mutation order among replicas

Primary is the chunkserver who is granted a chunk lease

All others containing replicas are secondaries

Primary defines a mutation order between mutations

All secondaries follows this order

CS621 Parallel and Distributed

At Google MapReduce operation are run on a special file system called

GFS is not open source.

This is open source and distributed by Apache.

Suitable for applications with large data

Streaming access to file system data

Can be built out of commodity

HHDFS was designed to HDFS is designed to run

Files in HDFS are divided into block size chunks

Block is the minimum size of data that it can read or write

Blocks simplifies the storage and replication process

Namenode (master node)

HDFS clusters use 2 types of

Datanode (worker node)

Namenode keeps track of

Maintains the file system

Namenode holds the filesystem metadata in its memory

Namenode’s memory size determines the limit to the number of files

But then, what is Metadata?

Categorizes and describes the

Maximizes the usefulness of

Focuses on the individual instances of application data or the data content

Focuses on the individual instances of application data or the data content

Workhorse of the filesystem

Store and retrieve blocks when requested by the client

Periodically reports back to the namenode with lists of

Client can use a filesystem interface

Files on the filesystem would be lost

Namenode failure • Namenode File Backup

Hadoop 2.x Release Series • HDFS Federation

You might also like