MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi

MapReduce:
Presented by Areej Qasrawi
Simplified Data Processing on Large Clusters
By Jeffrey Dean and Sanjay Ghemawat

OUTLINES
1 . Introduction
2 . Programming Model
3 . Implementation
4 . Refinements
5 . Performance
6 . Experience and Conclusion
7 . Review

INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in
reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the
data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data
Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.

o What is MapReduce?
Programming Model, approach, for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of commodity machines.
Many real world tasks are expressible in this model.
oMapReduce provides:
User-defined functions
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
INTRODUCTION CONT…

oInput & Output are sets of key/value pairs
oProgrammer specifies two functions:
▪ map (in_key, in_value) -> list(out_key,
intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
▪ reduce (out_key, list(intermediate_value)) ->
list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (most cases just one)
PROGRAMMING MODEL

PROGRAMMING MODEL
…
Input Files
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
Input file1

PROGRAMMING MODEL
More Examples …
Distributed Grep
The map function emits a line if it matches a supplied pattern
Count of URL access frequency.
The map function processes logs of web page requests and outputs <URL, 1>
Reverse web-link graph
The map function outputs <target, source> pairs for each link to a target URL found in a page named source
Term-Vector per Host
A term vector summarizes the most important words that occur in a document or a set of documents as a list
of (word, frequency) pairs
Inverted Index
The map function parses each document, and emits a sequence of (word, document ID) pairs
Distributed Sort
The map function extracts the key from each record, and emits a (key, record) pair

➢ Many different implementations are possible
➢ The right choice is depending on the environment.
➢ Typical cluster: (wide use at Google, large clusters of PC’s connected
via switched Ethernet)
•
•
•
•
•
Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of
memory per machine.
connected with networking HW, Limited bisection bandwidth
Storage is on local IDE disks (inexpensive)
GFS: distributed file system manages data
Scheduling system by the users to submit the tasks (Job=set of tasks
mapped by scheduler to set of available PC within the cluster)
➢ Implemented using C++ library and linked into user programs
IMPLEMENTATION

Execution Overview
➢ Map
• Divide the input into M equal-sized splits
• Each split is 16-64 MB large
➢ Reduce
• Partitioning intermediate key space into R pieces
• hash(intermediate_key) mod R
➢ Typical setting:
• 2,000 machines
• M = 200,000
• R = 5,000
IMPLEMENTATION...

M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
•Read all intermediate data
•Sort it by intermediate keys
Execution Overview…IMPLEMENTATION…

Fault Tolerance
➢ Works: Handled through re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Why do we need to re-execute even the completed tasks?
• Re-execute in progress reduce tasks
• Task completion committed through master
➢ Master failure:
• It can be handled, but don't yet (master failure unlikely)
IMPLEMENTATION…

Locality
➢ Master scheduling policy:
• Asks GFS for locations of replicas of input file blocks
• Map tasks typically split into 64MB (GFS block size )
• Map tasks scheduled so GFS input block replica are on same machine or
same rack
➢ As a result:
• most task’s input data is read locally and consumes no network bandwidth
IMPLEMENTATION…

Backup Tasks
➢ common causes that lengthens the total time taken for
a MapReduce operation is a straggler.
➢ mechanism to alleviate the problem of stragglers.
➢ the master schedules backup executions of the remaining
in- progress tasks.
➢ significantly reduces the time to complete large
MapReduce operations.
IMPLEMENTATION…

• Different partitioning functions.
• User specify the number of reduce tasks/output that they desire (R)
• Combiner function.
• Useful for saving network bandwidth
• Different input/output types
• Skipping bad records
• Master asks next worker is told to skip the bad record
• Local execution
• an alternative implementation of the MapReduce library that sequentially executes all of the work for
a MapReduce operation on the local machine.
• Status info
• Progress of the computation & more info…
• Counters
• count occurrences of various events. (Ex: total number of words processed)
REFINEMENT

Measure the performance of MapReduce on two
computations running on a large cluster of machines.
➢ Grep
• searches through approximately one terabyte of
data looking for a particular pattern
➢ Sort
• sorts approximately one terabyte of data
PERFORMANCE

Cluster
Memory
Processors
Hard disk
Network
bandwidth
PERFORMANCE…
▪ 1800 machines
▪ 4 GB
▪ Dual-processor 2 GHz Xeons with Hyper-
threading
▪ Dual 160 GB IDE disks Gigabit
Ethernet per machine approximately
100Gbps
➢ Cluster Configuration
Specifications

Grep
Computation
➢ Scans 10 billions 100-byte
records, searching for rare
3- character pattern
(occurs in 92,337 records)
➢ input is split into
approximately 64MB pieces
(M=15000)
entire output is placed in
one file , R =1
➢ Startup overhead is
significant for short jobsData Transfer rate over time
PERFORMANCE…

▪ Backup tasks improves completion time reasonably
▪ System manages machine failures relatively quickly.
PERFORMANCE…
Sort Computation
Data transfer rates over time for different executions of the sort program
44% longer 5% longer

➢ MapReduce has proven to be a useful abstraction
➢ Greatly simplifies large-scale computations at Google
➢ Fun to use: focus on problem, let library deal with messy
details
➢ No big need for parallelization knowledge
(relief the user from dealing with low level parallelization details)
Conclusions & Experience

Review
▪ Strong points
✓ The paper follows reasonable logical organization.
✓ The simplified programming model proposed in this paper opened up the
parallel computation field to general purpose programmers.
✓ provides many simple examples of applications for MapReduce, and it
clearly lays out the steps for implementing MapReduce.
✓ The arguments and designs are straightforward.

Review …
▪ Weak points
✓ The framework described in the paper is very myopic. The intermediate
data produced by the map tasks are not meant for reuse.
✓ When dealing with data of such large scale, failures are inevitable
✓ limitation of two stages function
✓ For the reduce phase, there are a lot of communication
✓ MapReduce framework proposed here does not address the
requirements of large scale machine learning algorithm executions.

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi

More Related Content

What's hot (20)

Similar to MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi (20)

Recently uploaded (20)

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi