SlideShare a Scribd company logo
MapReduce
vs/and Spark
Tudor Lapusan
BigData Romanian Tour - Timisoara
History
MapReduce basic functionalities
● Fault tolerance
● Monitoring &
status updates
● Scalability
Hadoop MapReduce
Input Map Reduce Output
Hadoop MapReduce
Input Map Shuffle Reduce Output
MapReduce DAG
A
D
B
C
E
F
Spark
● RDD
● Operations : Transformations and Actions
RDD - Resilient Distributed Dataset
RDD is fault-tolerant collection of elements
distributed across many servers on which we can
perform parallel operations.
RDD
Scala code
val data = Array(1, 2, 3, 4, 5, 6, 7, 8)
val rddData = sc.parallelize(data)
RDD
Scala code
val rddFile = sc.textFile("data.txt")
RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP
Transformations
RDD 1
RDD 2
Transformations are operations on RDDs that return new
RDDs
Transformations
RDD 1
InputRDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4
Actions
RDD 1
Actions are the operations on RDD which return a final value
or write the data to an external storage system.
RDD 1
Actions
RDD 1
InputRDD
{1,2,3,4,5,6}
MapRDD
{2,3,4,5,6,7}
FilterRDD
{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()
Spark DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
Action
Transformation
Stage
Spark DAG vs MapReduce DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
A
B
D C
E
F
Programing languages
MapReduce
Java
Ruby
Perl
Python
PHP
R
C++
Spark
Java
Scala
Python
Easy of use
- Spark is easier to program and include an
interactive mode.
- Hadoop MapReduce is harder to program
but many tools are available to make it
easier.
Performance : Sort Benchmark 2013
Performance : Sort Benchmark 2014
Costs
Costs : hardware recommendation
Spark MapReduce Hadoop
Cores 8-16 4
Memory 8GB to hundreds of GB 24GB
Disks 4-8 4-6 one-TB disks
Network 10GB or more 1GB Ethernet
Spark recommendation Hortonworks recommendation
Costs : developers
Questions
tudor.lapusan@gmail.com
@tlapusan

More Related Content

What's hot (20)

PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
Apache Spark Introduction
sudhakara st
 
PPT
Introduction to MongoDB
Ravi Teja
 
PPTX
An Overview of Apache Cassandra
DataStax
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Introduction to Cassandra
Gokhan Atil
 
PDF
Apache spark
shima jafari
 
PDF
Spark overview
Lisa Hua
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPTX
PySpark dataframe
Jaemun Jung
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Cassandra Database
YounesCharfaoui
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to Apache Spark
Samy Dindane
 
Apache Spark Introduction
sudhakara st
 
Introduction to MongoDB
Ravi Teja
 
An Overview of Apache Cassandra
DataStax
 
Introduction to Spark Internals
Pietro Michiardi
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Cassandra
Gokhan Atil
 
Apache spark
shima jafari
 
Spark overview
Lisa Hua
 
Map Reduce
Prashant Gupta
 
Intro to Apache Spark
Robert Sanders
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PySpark dataframe
Jaemun Jung
 
Apache Spark Architecture
Alexey Grishchenko
 
Cassandra Database
YounesCharfaoui
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introduction to apache spark
Aakashdata
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 

Viewers also liked (12)

PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Legacy Typesafe (now Lightbend)
 
PPTX
Analysing of big data using map reduce
Paladion Networks
 
PPT
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
PPT
Map Reduce introduction
Muralidharan Deenathayalan
 
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPT
Introduction To Map Reduce
rantav
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Legacy Typesafe (now Lightbend)
 
Analysing of big data using map reduce
Paladion Networks
 
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Map Reduce introduction
Muralidharan Deenathayalan
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Introduction To Map Reduce
rantav
 
Ad

Similar to Map reduce vs spark (20)

PPTX
SparkNotes
Demet Aksoy
 
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
PPTX
Dec6 meetup spark presentation
Ramesh Mudunuri
 
PDF
PySpark with Juypter
Li Ming Tsai
 
PPTX
Scrap Your MapReduce - Apache Spark
IndicThreads
 
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PDF
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
PDF
Spark cluster computing with working sets
JinxinTang
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PDF
Why Spark over Hadoop?
Prwatech Institution
 
PPTX
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PDF
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
SparkNotes
Demet Aksoy
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Ramesh Mudunuri
 
PySpark with Juypter
Li Ming Tsai
 
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Scala and spark
Fabio Fumarola
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Spark cluster computing with working sets
JinxinTang
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Why Spark over Hadoop?
Prwatech Institution
 
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
Distributed computing with spark
Javier Santos Paniego
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Apache Spark: What? Why? When?
Massimo Schenone
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Ad

Recently uploaded (20)

PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 

Map reduce vs spark