Map Reduce: Simplified Processing On Large Clusters

MapReduce is a programming model and associated implementation for processing large datasets in parallel across clusters of machines. It handles parallelization and distribution details so programmers can focus on mapping and reducing functions. The model takes key-value pairs as input, applies a map function to generate intermediate key-value pairs, and a reduce function to merge values by key. It provides fault tolerance and locality to improve performance on large datasets.

Uploaded by

Joy Bagdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views29 pages

Map Reduce: Simplified Processing On Large Clusters

Uploaded by

Joy Bagdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

Map Reduce: Simplified

Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat
Google, Inc.
OSDI ’04: 6th Symposium on Operating
Systems Design and Implementation
What Is It?
• “. . . A programming model and an
associated implementation for processing
and generating large data sets.”
• Google version runs on a typical Google
cluster: large number of commodity
machines, switched Ethernet, inexpensive
disks attached directly to each machine in
the cluster.
Motivation
• Data-intensive applications
• Huge amounts of data, fairly simple
processing requirements, but …
• For efficiency, parallelize
• MapReduce is designed to simplify
parallelization and distribution so
programmers don’t have to worry about
details.
Advantages of Parallel
Programming

• Improves performance and efficiency.

• Divide processing into several parts which
can be executed concurrently.
• Each part can run simultaneously on
different CPUs on a single machine, or
they can be CPUs in a set of computers
connected via a network.
Programming Model
• The model is “inspired by” Lisp primitives
map and reduce.
• map applies the same operation to several
different data items; e.g.,
(mapcar #'abs '(3 -4 2 -5))=>(3 4 2 5)
• reduce applies a single operation to a set of
values to get a result; e.g.,
(+ 3 4 2 5) => 14
Programming Model
• MapReduce was developed by Google to
process large amounts of raw data, for
example, crawled documents or web
request logs.
• There is so much data it must be
distributed across thousands of machines
in order to be processed in a reasonable
time.
Programming Model
• Input & Output: a set of key/value pairs
• The programmer supplies two functions:
• map (in_key, in_val) =>
list(intermediate_key,intermed_val)
• reduce (intermediate_key,
list-of(intermediate_val)) =>
list(out_val)
• The program takes a set of input key/value pairs
and merges all the intermediate values for a
given key into a smaller set of final values.
Example: Count occurrences of words in a
set of files
• Map function: for each word in each file, count
occurrences
• Input_key: file name; Input_value: file contents
• Intermediate results: for each file, a list of words
and frequency counts
– out_key = a word; int_value = word count in this file
• Reduce function: for each word, sum its
occurrences over all files
• Input key: a word; Input value: a list of counts
• Final results: A list of words, and the number of
occurrences of each word in all the files.
Other Examples
• Distributed Grep: find all occurrences of a
pattern supplied by the programmer
– Input: the pattern and set of files
• key = pattern (regexp), data = a file name
– Map function: grep the pattern, file
– Intermediate results: lines in which the pattern
appeared, keyed to files
• key = file name, data = line
– Reduce function is the identity function:
passes on the intermediate results
Other Examples
• Count URL Access Frequency
– Map function: counts URL requests in a log of
requests
• key: URL; data: a log
– Intermediate results: URL, total count for this
log
– Reduce function: combines URL count for all
logs and emits (URL, total_count)
Implementation
• More than one way to implement
MapReduce, depending on environment
• Google chooses to use the same
environment that it uses for the GFS: large
(~1000 machines) clusters of PCs with
attached disks, based on 100 megabit/sec
or 1 gigabit/sec Ethernet.
• Batch environment: user submits job to a
scheduler (Master)
Implementation
• Job scheduling:
– User submits job to scheduler (one program
consists of many tasks)
– scheduler assigns tasks to machines.
General Approach
• The MASTER:
– initializes the problem; divides it up among a
set of workers
– sends each worker a portion of the data
– receives the results from each worker
• The WORKER:
– receives data from the master
– performs processing on its part of the data
– returns results to master
Overview
• The Map invocations are distributed
across multiple machines by automatically
partitioning the input data into a set of M
splits or shards.
• The worker-process parses the input to
identify the key/value pairs and passes
them to the Map function (defined by the
programmer).
Overview
• The input shards can be processed in
parallel on different machines.
– It’s essential that the Map function be able to
operate independently – what happens on
one machine doesn’t depend on what
happens on any other machine.
• Intermediate results are stored on local
disks, partitioned into R regions as
determined by the user’s partitioning
function. (R <= # of output keys)
Overview
• The number of partitions (R) and the
partitioning function are specified by the user.
• Map workers notify Master of the location of the
intermediate key-value pairs; the master
forwards the addresses to the reduce workers.
• Reduce workers use RPC to read the data
remotely from the map workers and then
process it.
• Each reduction takes all the values associated
with a single key and reduces it to one or more
results.
Example
• In the word-count app, a worker emits a
list of word-frequency pairs; e.g. (a,
100), (an, 25), (ant, 1), …
• out_key = a word; value = word count
for some file
• All the results for a given out_key are
passed to a reduce worker for the next
processing phase.
Overview
• Final results are appended to an output file
that is part of the global file system.
• When all map/reduce jobs are done, the
master wakes up the user program and
the MapReduce call returns control to the
user program.
Fault Tolerance
• Important, because since MapReduce
relies on 100’s, even 1000’s of machines,
failures are inevitable.
• Periodically, the master pings workers.
• Workers that don’t respond in a pre-
determined amount of time are considered
to have failed.
• Any map task or reduce task in progress
on a failed worker is reset to idle and
becomes eligible for rescheduling.
Fault Tolerance
• Any map tasks completed by the worker are
reset to idle state, and are eligible for
scheduling on other workers.
• Reason: since the results are stored on the
disk of the failed machine, they are
inaccessible.
• Completed reduce tasks on failed machines
don’t need to be redone because output
goes to a global file system.
Failure of the Master
• Regular checkpoints of all the Master’s
data structures would make it possible to
roll back to a known state and start again.
• However, since there is only one master
failure is highly unlikely, so the current
approach is just to abort the program in
case of failure.
Locality
• Recall Google File system implementation:
• Files are divided into 64MB blocks and
replicated on at least 3 machines.
• The Master knows the location of data and
tries to schedule map operations on
machines that have the necessary input.
Or, if that’s not possible, schedule on a
nearby machine to reduce network traffic.
Task Granularity
• Map phase is subdivided into M pieces
and the reduce phase into R pieces.
• Objective: M and R >> than the number of
worker machines.
– Improves dynamic load balancing
– Speeds up recovery in case of failure; failed
machine’s many completed map tasks can be
spread out across all other workers.
Task Granularity
• Practical limits on size of M and R:
– Master must make O(M + R) scheduling
decisions and store O(M * R) states
– Users typically restrict size of R, because the
output of each reduce worker goes to a
different output file
– Authors say they “often” set M = 200,000 and
R = 5,000. Number of workers = 2,000.
“Stragglers”
• A machine that takes a long time to finish
its last few map or reduce tasks.
– Causes: bad disk (slows read ops), other
tasks are scheduled on the same machine,
etc.
– Solution: assign stragglers’ unfinished work to
other machines that have completed. Use
results from the original worker or the backup,
depending on which finishes first
Experience
• Google used MapReduce to rewrite the indexing
system that constructs the Google search engine
data structures.
• Input: GFS documents retrieved by the web
crawlers – about 20 terabytes of data.
• Benefits
– Simpler, smaller, more readable indexing code
– Many problems, such as machine failures, are dealt
with automatically by the MapReduce library.
Conclusions
• Easy to use. Programmers are shielded
from the problems of parallel processing
and distributed systems.
• Can be used for many classes of problems,
including generating data for the search
engine, for sorting, for data mining, for
machine learning, and other
• Scales to clusters consisting of 1000’s of
machines
• But ….
Not everyone agrees that MapReduce is
wonderful!
• The database community believes parallel
database systems are a better solution.

NEB-2000C Coding Software
100% (1)
NEB-2000C Coding Software
16 pages
Computer Repair Invoice Template
No ratings yet
Computer Repair Invoice Template
2 pages
ShrimanteeRoy TechnicalArtist
No ratings yet
ShrimanteeRoy TechnicalArtist
1 page
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Whitney
No ratings yet
Whitney
19 pages
Brother ADS-4300N Datasheet
No ratings yet
Brother ADS-4300N Datasheet
7 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Letter of Complaint OL
No ratings yet
Letter of Complaint OL
1 page
View Max
No ratings yet
View Max
2 pages
DSI00740-11 G2 ECDIS Upgrade
No ratings yet
DSI00740-11 G2 ECDIS Upgrade
3 pages
Ouz
No ratings yet
Ouz
6 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
No ratings yet
UNIT III Notes_18540760ab9652a7b4b8d9c1d0f56f3c
24 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Soft Servo
No ratings yet
Soft Servo
16 pages
Data Base - SQL Vs NoSQL
No ratings yet
Data Base - SQL Vs NoSQL
14 pages
He-Phan-Bo - Wyatt-Lloyd - L19-Big-Data - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Wyatt-Lloyd - L19-Big-Data - (Cuuduongthancong - Com)
16 pages
Week 1 Content:: $ Sudo Deluser $ Sudo Deluser - Remove-Home
No ratings yet
Week 1 Content:: $ Sudo Deluser $ Sudo Deluser - Remove-Home
4 pages
Python Project
No ratings yet
Python Project
8 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
ThinkPad E531 Platform Specifications
No ratings yet
ThinkPad E531 Platform Specifications
1 page
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
Case Studies: Needham - Schroeder, Kerberos, TLS, 802.11 Wifi
No ratings yet
Case Studies: Needham - Schroeder, Kerberos, TLS, 802.11 Wifi
2 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Guias Desarrollo Epicor
No ratings yet
Guias Desarrollo Epicor
3 pages
Course Title: Group Dynamics and Team Building Credit Units: 1 Course Level: UG Course Code: BS301 Course Objectives
No ratings yet
Course Title: Group Dynamics and Team Building Credit Units: 1 Course Level: UG Course Code: BS301 Course Objectives
3 pages
How To Configure EMC Celerra Control Station To Use A NTP Server PDF
No ratings yet
How To Configure EMC Celerra Control Station To Use A NTP Server PDF
3 pages
Iit Wise Gate Analysis Computer Science
No ratings yet
Iit Wise Gate Analysis Computer Science
4 pages
Watermarking Is The Process To Hide Some Data Which Is Called Watermark or Label Into The Original Data. Similar
No ratings yet
Watermarking Is The Process To Hide Some Data Which Is Called Watermark or Label Into The Original Data. Similar
1 page
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
No ratings yet
He-Phan-Bo - Thoai-Nam - Distributedsystem - 18 - Mapreduce - (Cuuduongthancong - Com)
31 pages
Map reduce
No ratings yet
Map reduce
35 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
48 pages
Wave and Traversal Algorithms
No ratings yet
Wave and Traversal Algorithms
10 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Addressing Modes
No ratings yet
Addressing Modes
7 pages
Syed Muhammad Ali (70068893) Assignment 2
No ratings yet
Syed Muhammad Ali (70068893) Assignment 2
19 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Jss1 First Term Information Technology Examination Questions
No ratings yet
Jss1 First Term Information Technology Examination Questions
9 pages
Cuckoo Search Algorithm: Dr. Ahmed Fouad Ali
No ratings yet
Cuckoo Search Algorithm: Dr. Ahmed Fouad Ali
20 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Map Reduced B Seminar
No ratings yet
Map Reduced B Seminar
17 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Syllabus
No ratings yet
Syllabus
3 pages
BS Vii PDF
No ratings yet
BS Vii PDF
3 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Concurrency in Shared Memory Systems: Synchronization and Mutual Exclusion
No ratings yet
Concurrency in Shared Memory Systems: Synchronization and Mutual Exclusion
37 pages
Chap 9. Distributed Systems Architectures: - Architectural Design For Software That Executes On More Than One Processor
No ratings yet
Chap 9. Distributed Systems Architectures: - Architectural Design For Software That Executes On More Than One Processor
43 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Hyperion Financial Reporting (HFR) Report Development Best Practices
No ratings yet
Hyperion Financial Reporting (HFR) Report Development Best Practices
27 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Fert
No ratings yet
Fert
38 pages
Service Manuals Yuma Copiers (928) 344-0887: Canon
No ratings yet
Service Manuals Yuma Copiers (928) 344-0887: Canon
6 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
WT Module-3
No ratings yet
WT Module-3
181 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Implementing and Detecting An ACPI BIOS Rootkit
No ratings yet
Implementing and Detecting An ACPI BIOS Rootkit
36 pages
Untitled
No ratings yet
Untitled
16 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
AbInitio High Level
No ratings yet
AbInitio High Level
16 pages
AccuVueMED Manual ENG
No ratings yet
AccuVueMED Manual ENG
149 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
GLFWReference 26
No ratings yet
GLFWReference 26
64 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Xica Bca Sem I
No ratings yet
Xica Bca Sem I
30 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Algorithm
No ratings yet
Algorithm
18 pages
Video Games Example
No ratings yet
Video Games Example
7 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Distributed Systems Characterization and Design: 1DT057 D I S
No ratings yet
Distributed Systems Characterization and Design: 1DT057 D I S
35 pages
Project Report ON Autonomous Car System Using IOT: Submitted by
No ratings yet
Project Report ON Autonomous Car System Using IOT: Submitted by
59 pages
IMX6SDLAEC
No ratings yet
IMX6SDLAEC
167 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
I Fundamentals 1
No ratings yet
I Fundamentals 1
593 pages
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
1.1.1 Number Representation
100% (1)
1.1.1 Number Representation
21 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Map Reduce: Simplified Processing On Large Clusters

Uploaded by

Map Reduce: Simplified Processing On Large Clusters

Uploaded by

Map Reduce: Simplified

Processing on Large Clusters

• Improves performance and efficiency.

You might also like