0% found this document useful (0 votes)
48 views29 pages

Map Reduce: Simplified Processing On Large Clusters

MapReduce is a programming model and associated implementation for processing large datasets in parallel across clusters of machines. It handles parallelization and distribution details so programmers can focus on mapping and reducing functions. The model takes key-value pairs as input, applies a map function to generate intermediate key-value pairs, and a reduce function to merge values by key. It provides fault tolerance and locality to improve performance on large datasets.

Uploaded by

Joy Bagdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views29 pages

Map Reduce: Simplified Processing On Large Clusters

MapReduce is a programming model and associated implementation for processing large datasets in parallel across clusters of machines. It handles parallelization and distribution details so programmers can focus on mapping and reducing functions. The model takes key-value pairs as input, applies a map function to generate intermediate key-value pairs, and a reduce function to merge values by key. It provides fault tolerance and locality to improve performance on large datasets.

Uploaded by

Joy Bagdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Map Reduce: Simplified

Processing on Large Clusters


Jeffrey Dean and Sanjay Ghemawat
Google, Inc.
OSDI ’04: 6th Symposium on Operating
Systems Design and Implementation
What Is It?
• “. . . A programming model and an
associated implementation for processing
and generating large data sets.”
• Google version runs on a typical Google
cluster: large number of commodity
machines, switched Ethernet, inexpensive
disks attached directly to each machine in
the cluster.
Motivation
• Data-intensive applications
• Huge amounts of data, fairly simple
processing requirements, but …
• For efficiency, parallelize
• MapReduce is designed to simplify
parallelization and distribution so
programmers don’t have to worry about
details.
Advantages of Parallel
Programming

• Improves performance and efficiency.


• Divide processing into several parts which
can be executed concurrently.
• Each part can run simultaneously on
different CPUs on a single machine, or
they can be CPUs in a set of computers
connected via a network.
Programming Model
• The model is “inspired by” Lisp primitives
map and reduce.
• map applies the same operation to several
different data items; e.g.,
(mapcar #'abs '(3 -4 2 -5))=>(3 4 2 5)
• reduce applies a single operation to a set of
values to get a result; e.g.,
(+ 3 4 2 5) => 14
Programming Model
• MapReduce was developed by Google to
process large amounts of raw data, for
example, crawled documents or web
request logs.
• There is so much data it must be
distributed across thousands of machines
in order to be processed in a reasonable
time.
Programming Model
• Input & Output: a set of key/value pairs
• The programmer supplies two functions:
• map (in_key, in_val) =>
list(intermediate_key,intermed_val)
• reduce (intermediate_key,
list-of(intermediate_val)) =>
list(out_val)
• The program takes a set of input key/value pairs
and merges all the intermediate values for a
given key into a smaller set of final values.
Example: Count occurrences of words in a
set of files
• Map function: for each word in each file, count
occurrences
• Input_key: file name; Input_value: file contents
• Intermediate results: for each file, a list of words
and frequency counts
– out_key = a word; int_value = word count in this file
• Reduce function: for each word, sum its
occurrences over all files
• Input key: a word; Input value: a list of counts
• Final results: A list of words, and the number of
occurrences of each word in all the files.
Other Examples
• Distributed Grep: find all occurrences of a
pattern supplied by the programmer
– Input: the pattern and set of files
• key = pattern (regexp), data = a file name
– Map function: grep the pattern, file
– Intermediate results: lines in which the pattern
appeared, keyed to files
• key = file name, data = line
– Reduce function is the identity function:
passes on the intermediate results
Other Examples
• Count URL Access Frequency
– Map function: counts URL requests in a log of
requests
• key: URL; data: a log
– Intermediate results: URL, total count for this
log
– Reduce function: combines URL count for all
logs and emits (URL, total_count)
Implementation
• More than one way to implement
MapReduce, depending on environment
• Google chooses to use the same
environment that it uses for the GFS: large
(~1000 machines) clusters of PCs with
attached disks, based on 100 megabit/sec
or 1 gigabit/sec Ethernet.
• Batch environment: user submits job to a
scheduler (Master)
Implementation
• Job scheduling:
– User submits job to scheduler (one program
consists of many tasks)
– scheduler assigns tasks to machines.
General Approach
• The MASTER:
– initializes the problem; divides it up among a
set of workers
– sends each worker a portion of the data
– receives the results from each worker
• The WORKER:
– receives data from the master
– performs processing on its part of the data
– returns results to master
Overview
• The Map invocations are distributed
across multiple machines by automatically
partitioning the input data into a set of M
splits or shards.
• The worker-process parses the input to
identify the key/value pairs and passes
them to the Map function (defined by the
programmer).
Overview
• The input shards can be processed in
parallel on different machines.
– It’s essential that the Map function be able to
operate independently – what happens on
one machine doesn’t depend on what
happens on any other machine.
• Intermediate results are stored on local
disks, partitioned into R regions as
determined by the user’s partitioning
function. (R <= # of output keys)
Overview
• The number of partitions (R) and the
partitioning function are specified by the user.
• Map workers notify Master of the location of the
intermediate key-value pairs; the master
forwards the addresses to the reduce workers.
• Reduce workers use RPC to read the data
remotely from the map workers and then
process it.
• Each reduction takes all the values associated
with a single key and reduces it to one or more
results.
Example
• In the word-count app, a worker emits a
list of word-frequency pairs; e.g. (a,
100), (an, 25), (ant, 1), …
• out_key = a word; value = word count
for some file
• All the results for a given out_key are
passed to a reduce worker for the next
processing phase.
Overview
• Final results are appended to an output file
that is part of the global file system.
• When all map/reduce jobs are done, the
master wakes up the user program and
the MapReduce call returns control to the
user program.
Fault Tolerance
• Important, because since MapReduce
relies on 100’s, even 1000’s of machines,
failures are inevitable.
• Periodically, the master pings workers.
• Workers that don’t respond in a pre-
determined amount of time are considered
to have failed.
• Any map task or reduce task in progress
on a failed worker is reset to idle and
becomes eligible for rescheduling.
Fault Tolerance
• Any map tasks completed by the worker are
reset to idle state, and are eligible for
scheduling on other workers.
• Reason: since the results are stored on the
disk of the failed machine, they are
inaccessible.
• Completed reduce tasks on failed machines
don’t need to be redone because output
goes to a global file system.
Failure of the Master
• Regular checkpoints of all the Master’s
data structures would make it possible to
roll back to a known state and start again.
• However, since there is only one master
failure is highly unlikely, so the current
approach is just to abort the program in
case of failure.
Locality
• Recall Google File system implementation:
• Files are divided into 64MB blocks and
replicated on at least 3 machines.
• The Master knows the location of data and
tries to schedule map operations on
machines that have the necessary input.
Or, if that’s not possible, schedule on a
nearby machine to reduce network traffic.
Task Granularity
• Map phase is subdivided into M pieces
and the reduce phase into R pieces.
• Objective: M and R >> than the number of
worker machines.
– Improves dynamic load balancing
– Speeds up recovery in case of failure; failed
machine’s many completed map tasks can be
spread out across all other workers.
Task Granularity
• Practical limits on size of M and R:
– Master must make O(M + R) scheduling
decisions and store O(M * R) states
– Users typically restrict size of R, because the
output of each reduce worker goes to a
different output file
– Authors say they “often” set M = 200,000 and
R = 5,000. Number of workers = 2,000.
“Stragglers”
• A machine that takes a long time to finish
its last few map or reduce tasks.
– Causes: bad disk (slows read ops), other
tasks are scheduled on the same machine,
etc.
– Solution: assign stragglers’ unfinished work to
other machines that have completed. Use
results from the original worker or the backup,
depending on which finishes first
Experience
• Google used MapReduce to rewrite the indexing
system that constructs the Google search engine
data structures.
• Input: GFS documents retrieved by the web
crawlers – about 20 terabytes of data.
• Benefits
– Simpler, smaller, more readable indexing code
– Many problems, such as machine failures, are dealt
with automatically by the MapReduce library.
Conclusions
• Easy to use. Programmers are shielded
from the problems of parallel processing
and distributed systems.
• Can be used for many classes of problems,
including generating data for the search
engine, for sorting, for data mining, for
machine learning, and other
• Scales to clusters consisting of 1000’s of
machines
• But ….
Not everyone agrees that MapReduce is
wonderful!
• The database community believes parallel
database systems are a better solution.

You might also like