SlideShare a Scribd company logo
MapReduce:
Presented by Areej Qasrawi
Simplified Data Processing on Large Clusters
By Jeffrey Dean and Sanjay Ghemawat
OUTLINES
1 . Introduction
2 . Programming Model
3 . Implementation
4 . Refinements
5 . Performance
6 . Experience and Conclusion
7 . Review
INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in
reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the
data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data
Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.
o What is MapReduce?
Programming Model, approach, for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of commodity machines.
Many real world tasks are expressible in this model.
oMapReduce provides:
User-defined functions
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
INTRODUCTION CONT…
oInput & Output are sets of key/value pairs
oProgrammer specifies two functions:
▪ map (in_key, in_value) -> list(out_key,
intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
▪ reduce (out_key, list(intermediate_value)) ->
list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (most cases just one)
PROGRAMMING MODEL
PROGRAMMING MODEL
…
Input Files
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
Input file1
PROGRAMMING MODEL
More Examples …
Distributed Grep
The map function emits a line if it matches a supplied pattern
Count of URL access frequency.
The map function processes logs of web page requests and outputs <URL, 1>
Reverse web-link graph
The map function outputs <target, source> pairs for each link to a target URL found in a page named source
Term-Vector per Host
A term vector summarizes the most important words that occur in a document or a set of documents as a list
of (word, frequency) pairs
Inverted Index
The map function parses each document, and emits a sequence of (word, document ID) pairs
Distributed Sort
The map function extracts the key from each record, and emits a (key, record) pair
➢ Many different implementations are possible
➢ The right choice is depending on the environment.
➢ Typical cluster: (wide use at Google, large clusters of PC’s connected
via switched Ethernet)
•
•
•
•
•
Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of
memory per machine.
connected with networking HW, Limited bisection bandwidth
Storage is on local IDE disks (inexpensive)
GFS: distributed file system manages data
Scheduling system by the users to submit the tasks (Job=set of tasks
mapped by scheduler to set of available PC within the cluster)
➢ Implemented using C++ library and linked into user programs
IMPLEMENTATION
Execution Overview
➢ Map
• Divide the input into M equal-sized splits
• Each split is 16-64 MB large
➢ Reduce
• Partitioning intermediate key space into R pieces
• hash(intermediate_key) mod R
➢ Typical setting:
• 2,000 machines
• M = 200,000
• R = 5,000
IMPLEMENTATION...
M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
•Read all intermediate data
•Sort it by intermediate keys
Execution Overview…IMPLEMENTATION…
Fault Tolerance
➢ Works: Handled through re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Why do we need to re-execute even the completed tasks?
• Re-execute in progress reduce tasks
• Task completion committed through master
➢ Master failure:
• It can be handled, but don't yet (master failure unlikely)
IMPLEMENTATION…
Locality
➢ Master scheduling policy:
• Asks GFS for locations of replicas of input file blocks
• Map tasks typically split into 64MB (GFS block size )
• Map tasks scheduled so GFS input block replica are on same machine or
same rack
➢ As a result:
• most task’s input data is read locally and consumes no network bandwidth
IMPLEMENTATION…
Backup Tasks
➢ common causes that lengthens the total time taken for
a MapReduce operation is a straggler.
➢ mechanism to alleviate the problem of stragglers.
➢ the master schedules backup executions of the remaining
in- progress tasks.
➢ significantly reduces the time to complete large
MapReduce operations.
IMPLEMENTATION…
• Different partitioning functions.
• User specify the number of reduce tasks/output that they desire (R)
• Combiner function.
• Useful for saving network bandwidth
• Different input/output types
• Skipping bad records
• Master asks next worker is told to skip the bad record
• Local execution
• an alternative implementation of the MapReduce library that sequentially executes all of the work for
a MapReduce operation on the local machine.
• Status info
• Progress of the computation & more info…
• Counters
• count occurrences of various events. (Ex: total number of words processed)
REFINEMENT
Measure the performance of MapReduce on two
computations running on a large cluster of machines.
➢ Grep
• searches through approximately one terabyte of
data looking for a particular pattern
➢ Sort
• sorts approximately one terabyte of data
PERFORMANCE
Cluster
Memory
Processors
Hard disk
Network
bandwidth
PERFORMANCE…
▪ 1800 machines
▪ 4 GB
▪ Dual-processor 2 GHz Xeons with Hyper-
threading
▪ Dual 160 GB IDE disks Gigabit
Ethernet per machine approximately
100Gbps
➢ Cluster Configuration
Specifications
Grep
Computation
➢ Scans 10 billions 100-byte
records, searching for rare
3- character pattern
(occurs in 92,337 records)
➢ input is split into
approximately 64MB pieces
(M=15000)
entire output is placed in
one file , R =1
➢ Startup overhead is
significant for short jobsData Transfer rate over time
PERFORMANCE…
▪ Backup tasks improves completion time reasonably
▪ System manages machine failures relatively quickly.
PERFORMANCE…
Sort Computation
Data transfer rates over time for different executions of the sort program
44% longer 5% longer
➢ MapReduce has proven to be a useful abstraction
➢ Greatly simplifies large-scale computations at Google
➢ Fun to use: focus on problem, let library deal with messy
details
➢ No big need for parallelization knowledge
(relief the user from dealing with low level parallelization details)
Conclusions & Experience
Review
▪ Strong points
✓ The paper follows reasonable logical organization.
✓ The simplified programming model proposed in this paper opened up the
parallel computation field to general purpose programmers.
✓ provides many simple examples of applications for MapReduce, and it
clearly lays out the steps for implementing MapReduce.
✓ The arguments and designs are straightforward.
Review …
▪ Weak points
✓ The framework described in the paper is very myopic. The intermediate
data produced by the map tasks are not meant for reuse.
✓ When dealing with data of such large scale, failures are inevitable
✓ limitation of two stages function
✓ For the reduce phase, there are a lot of communication
✓ MapReduce framework proposed here does not address the
requirements of large scale machine learning algorithm executions.
Thank you!

More Related Content

What's hot (20)

PPTX
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
PPSX
MapReduce Scheduling Algorithms
Leila panahi
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PDF
Seminar Report on Google File System
Vishal Polley
 
PDF
The google MapReduce
Romain Jacotin
 
PDF
Hadoop map reduce v2
Subhas Kumar Ghosh
 
PPTX
Hadoop training-in-hyderabad
sreehari orienit
 
PPTX
Hadoop fault tolerance
Pallav Jha
 
PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei
 
PDF
Fault tolerant mechanisms in Big Data
Karan Pardeshi
 
PPTX
Google File Systems
Azeem Mumtaz
 
PDF
Map Reduce
Vigen Sahakyan
 
PPT
Map Reduce
Sri Prasanna
 
PPTX
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PPTX
Introduction to MapReduce
Chicago Hadoop Users Group
 
PDF
Hadoop
devakalyan143
 
PPTX
Google file system
Ankit Thiranh
 
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
MapReduce Scheduling Algorithms
Leila panahi
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Seminar Report on Google File System
Vishal Polley
 
The google MapReduce
Romain Jacotin
 
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop fault tolerance
Pallav Jha
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei
 
Fault tolerant mechanisms in Big Data
Karan Pardeshi
 
Google File Systems
Azeem Mumtaz
 
Map Reduce
Vigen Sahakyan
 
Map Reduce
Sri Prasanna
 
Interpreting the Data:Parallel Analysis with Sawzall
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Introduction to MapReduce
Chicago Hadoop Users Group
 
Hadoop
devakalyan143
 
Google file system
Ankit Thiranh
 

Similar to MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi (20)

PPTX
MapReduce presentation
Vu Thi Trang
 
PPTX
mapreduce.pptx
ShimoFcis
 
PPTX
MapReduce.pptx
AtulYadav218546
 
PDF
MapReduce
Abe Arredondo
 
PPT
Introduction To Map Reduce
rantav
 
PDF
My mapreduce1 presentation
Noha Elprince
 
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
PDF
try
Lamha Agarwal
 
PDF
Introduction of MapReduce
HC Lin
 
PPT
Map reducecloudtech
Jakir Hossain
 
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
PDF
Mapreduce2008 cacm
lmphuong06
 
PDF
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
PDF
Simplified Data Processing On Large Cluster
Harsh Kevadia
 
PPT
Hadoop online-training
Geohedrick
 
PPTX
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
This gives a brief detail about big data
chinky1118
 
PDF
Handout3o
Shahbaz Sidhu
 
MapReduce presentation
Vu Thi Trang
 
mapreduce.pptx
ShimoFcis
 
MapReduce.pptx
AtulYadav218546
 
MapReduce
Abe Arredondo
 
Introduction To Map Reduce
rantav
 
My mapreduce1 presentation
Noha Elprince
 
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
Introduction of MapReduce
HC Lin
 
Map reducecloudtech
Jakir Hossain
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
Mapreduce2008 cacm
lmphuong06
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
Simplified Data Processing On Large Cluster
Harsh Kevadia
 
Hadoop online-training
Geohedrick
 
MapReduce : Simplified Data Processing on Large Clusters
Abolfazl Asudeh
 
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
Dilip Reddy
 
This gives a brief detail about big data
chinky1118
 
Handout3o
Shahbaz Sidhu
 
Ad

Recently uploaded (20)

PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
Introducing and Operating FME Flow for Kubernetes in a Large Enterprise: Expe...
Safe Software
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Governing Geospatial Data at Scale: Optimizing ArcGIS Online with FME in Envi...
Safe Software
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Practical Applications of AI in Local Government
OnBoard
 
Introducing and Operating FME Flow for Kubernetes in a Large Enterprise: Expe...
Safe Software
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Governing Geospatial Data at Scale: Optimizing ArcGIS Online with FME in Envi...
Safe Software
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Ad

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qasrawi

  • 1. MapReduce: Presented by Areej Qasrawi Simplified Data Processing on Large Clusters By Jeffrey Dean and Sanjay Ghemawat
  • 2. OUTLINES 1 . Introduction 2 . Programming Model 3 . Implementation 4 . Refinements 5 . Performance 6 . Experience and Conclusion 7 . Review
  • 3. INTRODUCTION o Many tasks in large scale data processing composed of: o Computations that processes large amount of raw data which produce a lots of other data. o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in reasonable period of time. o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the data, and handle failures. o But these techniques contains very complex programming codes. o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
  • 4. o What is MapReduce? Programming Model, approach, for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of commodity machines. Many real world tasks are expressible in this model. oMapReduce provides: User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring INTRODUCTION CONT…
  • 5. oInput & Output are sets of key/value pairs oProgrammer specifies two functions: ▪ map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs ▪ reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (most cases just one) PROGRAMMING MODEL
  • 6. PROGRAMMING MODEL … Input Files Input file2 Each line passed to individual mapper instances Map Key Value Splitting Sort and Shuffle Reduce Key Value Pairs Final Output Output file o Words Count Example Input file1
  • 7. PROGRAMMING MODEL More Examples … Distributed Grep The map function emits a line if it matches a supplied pattern Count of URL access frequency. The map function processes logs of web page requests and outputs <URL, 1> Reverse web-link graph The map function outputs <target, source> pairs for each link to a target URL found in a page named source Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs Inverted Index The map function parses each document, and emits a sequence of (word, document ID) pairs Distributed Sort The map function extracts the key from each record, and emits a (key, record) pair
  • 8. ➢ Many different implementations are possible ➢ The right choice is depending on the environment. ➢ Typical cluster: (wide use at Google, large clusters of PC’s connected via switched Ethernet) • • • • • Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine. connected with networking HW, Limited bisection bandwidth Storage is on local IDE disks (inexpensive) GFS: distributed file system manages data Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster) ➢ Implemented using C++ library and linked into user programs IMPLEMENTATION
  • 9. Execution Overview ➢ Map • Divide the input into M equal-sized splits • Each split is 16-64 MB large ➢ Reduce • Partitioning intermediate key space into R pieces • hash(intermediate_key) mod R ➢ Typical setting: • 2,000 machines • M = 200,000 • R = 5,000 IMPLEMENTATION...
  • 10. M input splits of 16- 64MB each Partitioning function hash(intermediate_key) mod R (0) mapreduce(spec, &result) R regions •Read all intermediate data •Sort it by intermediate keys Execution Overview…IMPLEMENTATION…
  • 11. Fault Tolerance ➢ Works: Handled through re-execution • Detect failure via periodic heartbeats • Re-execute completed + in-progress map tasks • Why do we need to re-execute even the completed tasks? • Re-execute in progress reduce tasks • Task completion committed through master ➢ Master failure: • It can be handled, but don't yet (master failure unlikely) IMPLEMENTATION…
  • 12. Locality ➢ Master scheduling policy: • Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size ) • Map tasks scheduled so GFS input block replica are on same machine or same rack ➢ As a result: • most task’s input data is read locally and consumes no network bandwidth IMPLEMENTATION…
  • 13. Backup Tasks ➢ common causes that lengthens the total time taken for a MapReduce operation is a straggler. ➢ mechanism to alleviate the problem of stragglers. ➢ the master schedules backup executions of the remaining in- progress tasks. ➢ significantly reduces the time to complete large MapReduce operations. IMPLEMENTATION…
  • 14. • Different partitioning functions. • User specify the number of reduce tasks/output that they desire (R) • Combiner function. • Useful for saving network bandwidth • Different input/output types • Skipping bad records • Master asks next worker is told to skip the bad record • Local execution • an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. • Status info • Progress of the computation & more info… • Counters • count occurrences of various events. (Ex: total number of words processed) REFINEMENT
  • 15. Measure the performance of MapReduce on two computations running on a large cluster of machines. ➢ Grep • searches through approximately one terabyte of data looking for a particular pattern ➢ Sort • sorts approximately one terabyte of data PERFORMANCE
  • 16. Cluster Memory Processors Hard disk Network bandwidth PERFORMANCE… ▪ 1800 machines ▪ 4 GB ▪ Dual-processor 2 GHz Xeons with Hyper- threading ▪ Dual 160 GB IDE disks Gigabit Ethernet per machine approximately 100Gbps ➢ Cluster Configuration Specifications
  • 17. Grep Computation ➢ Scans 10 billions 100-byte records, searching for rare 3- character pattern (occurs in 92,337 records) ➢ input is split into approximately 64MB pieces (M=15000) entire output is placed in one file , R =1 ➢ Startup overhead is significant for short jobsData Transfer rate over time PERFORMANCE…
  • 18. ▪ Backup tasks improves completion time reasonably ▪ System manages machine failures relatively quickly. PERFORMANCE… Sort Computation Data transfer rates over time for different executions of the sort program 44% longer 5% longer
  • 19. ➢ MapReduce has proven to be a useful abstraction ➢ Greatly simplifies large-scale computations at Google ➢ Fun to use: focus on problem, let library deal with messy details ➢ No big need for parallelization knowledge (relief the user from dealing with low level parallelization details) Conclusions & Experience
  • 20. Review ▪ Strong points ✓ The paper follows reasonable logical organization. ✓ The simplified programming model proposed in this paper opened up the parallel computation field to general purpose programmers. ✓ provides many simple examples of applications for MapReduce, and it clearly lays out the steps for implementing MapReduce. ✓ The arguments and designs are straightforward.
  • 21. Review … ▪ Weak points ✓ The framework described in the paper is very myopic. The intermediate data produced by the map tasks are not meant for reuse. ✓ When dealing with data of such large scale, failures are inevitable ✓ limitation of two stages function ✓ For the reduce phase, there are a lot of communication ✓ MapReduce framework proposed here does not address the requirements of large scale machine learning algorithm executions.