SlideShare a Scribd company logo
MapReduce:
Limitations, Optimizations
and Open Issues
The 11th IEEE International Symposium on Parallel and
Distributed Processing with Applications (ISPA--13)
Vasiliki Kalavri, Vladimir Vlassov
{kalavri, vladv}@kth.se
17 July 2013, Melbourne, Australia
Outline
● MapReduce / Hadoop
○ background
○ current state
● Limitations and Existing Optimizations
○ performance
○ programming model
○ configuration and automation
● Trends, Open Issues, Future Directions 2
Big Data & Hadoop MapReduce
3
Motivation and Goal
● Numerous Hadoop variations and
enhancements over the past few years
○ each branching out from vanilla Hadoop
○ hard to choose the appropriate tool
○ no categorization / classification exists
● In our survey
○ overview existing variations
○ classify the optimizations
○ identify trends and open issues
4
5
MapReduce Programming Model
MapReduce
● Key-Value Pairs
○ Partitioning functions
● 2nd Order Functions
○ User-Defined Map - Reduce
● Input / Output
○ Distributed Fault-Tolerant File System
● Data-Centric Computation
○ Move the computation to the data
6
Hadoop MapReduce 1.0
7
Limitations
● Scalability
● Cluster Utilization
● No support for non-MR applications
YARN (MapReduce v.2)
8
● JobTracker => Resource Manager
and Application Master
● Map/Reduce Slots => Resource
Container
MapReduce Limitations
● Performance
○ initialization, scheduling, coordination
○ data materialization - intensive disk I/O
● Programming Model
○ single-input operators
○ fixed processing pipeline - job chaining
○ no support for iterations
● Configuration and Automation
○ sensitive to configuration parameters
○ complicated tuning 9
Performance Issues (1)
10
job setup,
initialization
task scheduling
monitoring and
coordination
11
Performance Issues (2)
Data Materialization
and Replication
Intensive Disk I/O
12
Programming Model Issues (1)
Input A
Input B
Merged
Input
tagging
Single Input Operators
Hard to Join / Cross Datasets
pre-processing
13
Programming Model Issues (2)
Fixed, Static Processing
Pipeline
Job Chaining
No support for Iterations
Performance Optimizations
● Operator Pipelining
● Approximate Results
● Indexing and Sorting
● Work Sharing
● Data Reuse
● Skew Mitigation
● Data Colocation
14
Programming Model Extensions
● High-Level Languages
○ Declarative, SQL-like
○ Semi-structured JSON data
○ Java / Scala libraries for complex processing
flows
● Domain-Specific Systems
○ Iterations
○ Incremental Computations
15
Configuration and Automation
● Self-Tuning
○ dynamic configuration based on workload
○ learn performance models
○ data-flow sharing
● Disk I/O Minimization
○ dynamically setting number of reducers
○ handle skew and batch I/O operations
● Data-aware Optimizations
○ static code analysis
○ index creation and selective input scans 16
Trends
● In-memory processing
○ minimize disk I/O and communication
● Traditional database techniques
○ organize and structure data, indexing
● Caching
○ reuse of previous computations
● Relaxation of fault-tolerance
○ materialize less often
17
18
System Major Contribution
Open-Source,
Available?
Transparent
MR Online Pipelining, Online aggregation yes yes
EARL Fast approximate results yes no
Hadoop++, HAIL Improve relational operations no yes / no
MRShare Concurrent work sharing no no
ReStore Reuse of previous computations no yes
SkewTune Automatic skew mitigation no yes
CoHadoop Data colocation no no
HaLoop Iterations support yes no
Incoop Incremental processing no no
Starfish Dynamic self-tuning no yes
Sailfish I/O minimization, automatic tuning no yes
Manimal Automatic data-aware optimizations no yes
Open Issues
● No standard benchmark
● No "typical" MapReduce workload
● Each system is evaluated using different
○ datasets
○ applications
○ deployments
■ impossible to compare or only compare with
vanilla Hadoop
● Application transparency
19
Future Directions
● Fault-tolerance adjustment mechanisms
● Standardize workloads and comparison
metrics
● Support for interactive analysis
○ query optimization techniques
○ data reuse
○ fast approximate results
20
Conclusions
● MapReduce and Hadoop are very useful,
successful and interesting tools
● There is still a lot of room for
optimizations and research
● But, MapReduce might not always be the
right tool for the job
○ more flexible data-flows
○ relational operations
○ graph processing
○ machine learning 21
MapReduce:
Limitations, Optimizations
and Open Issues
The 11th IEEE International Symposium on Parallel and
Distributed Processing with Applications (ISPA--13)
Vasiliki Kalavri, Vladimir Vlassov
{kalavri, vladv}@kth.se
17 July 2013, Melbourne, Australia

More Related Content

What's hot (20)

PDF
Apache flink
pranay kumar
 
PDF
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
 
PDF
ISNCC 2017
Rim Moussa
 
PPTX
Data pipelines from zero
Lars Albertsson
 
PPTX
Migration strategies for a mission critical cluster
Francismara Souza
 
PDF
parallel OLAP
Rim Moussa
 
PDF
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Rim Moussa
 
PDF
Protecting privacy in practice
Lars Albertsson
 
PDF
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
ODP
Big Data Technology
Juan J. Mostazo
 
PDF
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
PDF
ER 2016 Tutorial
Rim Moussa
 
PDF
A time energy performance analysis of map reduce on heterogeneous systems wit...
newmooxx
 
PDF
Organising for Data Success
Lars Albertsson
 
PDF
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
PPTX
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
PDF
GoFFish - A Sub-graph centric framework for large scale graph analytics
charithwiki
 
PDF
10 ways to stumble with big data
Lars Albertsson
 
PDF
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
PDF
Bicod2017
Rim Moussa
 
Apache flink
pranay kumar
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
 
ISNCC 2017
Rim Moussa
 
Data pipelines from zero
Lars Albertsson
 
Migration strategies for a mission critical cluster
Francismara Souza
 
parallel OLAP
Rim Moussa
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Rim Moussa
 
Protecting privacy in practice
Lars Albertsson
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
 
Big Data Technology
Juan J. Mostazo
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Christoph Matthies
 
ER 2016 Tutorial
Rim Moussa
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
newmooxx
 
Organising for Data Success
Lars Albertsson
 
Gelly in Apache Flink Bay Area Meetup
Vasia Kalavri
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
GoFFish - A Sub-graph centric framework for large scale graph analytics
charithwiki
 
10 ways to stumble with big data
Lars Albertsson
 
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
Bicod2017
Rim Moussa
 

Viewers also liked (20)

PPT
HDFS Issues
Steve Loughran
 
PPTX
Introduction to hadoop
Ron Sher
 
PDF
Resume2015 copy
Errin Johnson
 
PPT
Heirarchy
Partho Biswas
 
PPT
High Availbilty In Sql Server
Rishikesh Tiwari
 
PDF
DWD Online Application Workshop
St. Louis Agency on Training and Employment
 
PPT
Performance Issues on Hadoop Clusters
Xiao Qin
 
PPTX
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Analytics
 
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PDF
Apache Flink & Graph Processing
Vasia Kalavri
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PDF
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
PDF
A Skype case study (2011)
Vasia Kalavri
 
PPTX
Employee Management System
Monotheist Sakib
 
PDF
Demystifying Distributed Graph Processing
Vasia Kalavri
 
HDFS Issues
Steve Loughran
 
Introduction to hadoop
Ron Sher
 
Resume2015 copy
Errin Johnson
 
Heirarchy
Partho Biswas
 
High Availbilty In Sql Server
Rishikesh Tiwari
 
DWD Online Application Workshop
St. Louis Agency on Training and Employment
 
Performance Issues on Hadoop Clusters
Xiao Qin
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Analytics
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
The shortest path is not always a straight line
Vasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Apache Flink Deep Dive
Vasia Kalavri
 
Apache Flink & Graph Processing
Vasia Kalavri
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
A Skype case study (2011)
Vasia Kalavri
 
Employee Management System
Monotheist Sakib
 
Demystifying Distributed Graph Processing
Vasia Kalavri
 
Ad

Similar to MapReduce: Optimizations, Limitations, and Open Issues (20)

PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
PPTX
The Apache Hadoop software library is a framework that allows for the distrib...
23Q95A6706
 
PPT
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
PPTX
Hadoop-2022.pptx
MurindanyiSudi1
 
PDF
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 
PPT
Hadoop and Mapreduce Introduction
rajsandhu1989
 
PPTX
Hadoop technology
tipanagiriharika
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
PDF
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET Journal
 
PDF
Map reduce and hadoop at mylife
responseteam
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPT
Hadoop tutorial
Aamir Ameen
 
PPT
Architecting the Future of Big Data and Search
Hortonworks
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PDF
Hadoop Tutorial with @techmilind
EMC
 
PDF
Shared slides-edbt-keynote-03-19-13
Daniel Abadi
 
Mapreduce is for Hadoop Ecosystem in Data Science
DakshGoti2
 
The Apache Hadoop software library is a framework that allows for the distrib...
23Q95A6706
 
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Hadoop-2022.pptx
MurindanyiSudi1
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Yahoo Developer Network
 
Hadoop and Mapreduce Introduction
rajsandhu1989
 
Hadoop technology
tipanagiriharika
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET Journal
 
Map reduce and hadoop at mylife
responseteam
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop tutorial
Aamir Ameen
 
Architecting the Future of Big Data and Search
Hortonworks
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop Tutorial with @techmilind
EMC
 
Shared slides-edbt-keynote-03-19-13
Daniel Abadi
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Biography of Daniel Podor.pdf
Daniel Podor
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 

MapReduce: Optimizations, Limitations, and Open Issues

  • 1. MapReduce: Limitations, Optimizations and Open Issues The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13) Vasiliki Kalavri, Vladimir Vlassov {kalavri, vladv}@kth.se 17 July 2013, Melbourne, Australia
  • 2. Outline ● MapReduce / Hadoop ○ background ○ current state ● Limitations and Existing Optimizations ○ performance ○ programming model ○ configuration and automation ● Trends, Open Issues, Future Directions 2
  • 3. Big Data & Hadoop MapReduce 3
  • 4. Motivation and Goal ● Numerous Hadoop variations and enhancements over the past few years ○ each branching out from vanilla Hadoop ○ hard to choose the appropriate tool ○ no categorization / classification exists ● In our survey ○ overview existing variations ○ classify the optimizations ○ identify trends and open issues 4
  • 6. MapReduce ● Key-Value Pairs ○ Partitioning functions ● 2nd Order Functions ○ User-Defined Map - Reduce ● Input / Output ○ Distributed Fault-Tolerant File System ● Data-Centric Computation ○ Move the computation to the data 6
  • 7. Hadoop MapReduce 1.0 7 Limitations ● Scalability ● Cluster Utilization ● No support for non-MR applications
  • 8. YARN (MapReduce v.2) 8 ● JobTracker => Resource Manager and Application Master ● Map/Reduce Slots => Resource Container
  • 9. MapReduce Limitations ● Performance ○ initialization, scheduling, coordination ○ data materialization - intensive disk I/O ● Programming Model ○ single-input operators ○ fixed processing pipeline - job chaining ○ no support for iterations ● Configuration and Automation ○ sensitive to configuration parameters ○ complicated tuning 9
  • 10. Performance Issues (1) 10 job setup, initialization task scheduling monitoring and coordination
  • 11. 11 Performance Issues (2) Data Materialization and Replication Intensive Disk I/O
  • 12. 12 Programming Model Issues (1) Input A Input B Merged Input tagging Single Input Operators Hard to Join / Cross Datasets pre-processing
  • 13. 13 Programming Model Issues (2) Fixed, Static Processing Pipeline Job Chaining No support for Iterations
  • 14. Performance Optimizations ● Operator Pipelining ● Approximate Results ● Indexing and Sorting ● Work Sharing ● Data Reuse ● Skew Mitigation ● Data Colocation 14
  • 15. Programming Model Extensions ● High-Level Languages ○ Declarative, SQL-like ○ Semi-structured JSON data ○ Java / Scala libraries for complex processing flows ● Domain-Specific Systems ○ Iterations ○ Incremental Computations 15
  • 16. Configuration and Automation ● Self-Tuning ○ dynamic configuration based on workload ○ learn performance models ○ data-flow sharing ● Disk I/O Minimization ○ dynamically setting number of reducers ○ handle skew and batch I/O operations ● Data-aware Optimizations ○ static code analysis ○ index creation and selective input scans 16
  • 17. Trends ● In-memory processing ○ minimize disk I/O and communication ● Traditional database techniques ○ organize and structure data, indexing ● Caching ○ reuse of previous computations ● Relaxation of fault-tolerance ○ materialize less often 17
  • 18. 18 System Major Contribution Open-Source, Available? Transparent MR Online Pipelining, Online aggregation yes yes EARL Fast approximate results yes no Hadoop++, HAIL Improve relational operations no yes / no MRShare Concurrent work sharing no no ReStore Reuse of previous computations no yes SkewTune Automatic skew mitigation no yes CoHadoop Data colocation no no HaLoop Iterations support yes no Incoop Incremental processing no no Starfish Dynamic self-tuning no yes Sailfish I/O minimization, automatic tuning no yes Manimal Automatic data-aware optimizations no yes
  • 19. Open Issues ● No standard benchmark ● No "typical" MapReduce workload ● Each system is evaluated using different ○ datasets ○ applications ○ deployments ■ impossible to compare or only compare with vanilla Hadoop ● Application transparency 19
  • 20. Future Directions ● Fault-tolerance adjustment mechanisms ● Standardize workloads and comparison metrics ● Support for interactive analysis ○ query optimization techniques ○ data reuse ○ fast approximate results 20
  • 21. Conclusions ● MapReduce and Hadoop are very useful, successful and interesting tools ● There is still a lot of room for optimizations and research ● But, MapReduce might not always be the right tool for the job ○ more flexible data-flows ○ relational operations ○ graph processing ○ machine learning 21
  • 22. MapReduce: Limitations, Optimizations and Open Issues The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13) Vasiliki Kalavri, Vladimir Vlassov {kalavri, vladv}@kth.se 17 July 2013, Melbourne, Australia