SlideShare a Scribd company logo
Summary of Apache Spark
Original Papers:
1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al.
Hotcloud 2010.
2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing” by Matei Zaharia, et al. NSDI 2012.
Motivation
• MapReduce is great, but,
• There are applications where you iterate on the same set of data, e.g.,
for many iterations {
old_data = new_data;
new_data = func(old_data);
}
• Problem: The body of each iteration can be described as a MapReduce task,
where the inputs and final outputs are GFS files. There is redundant work storing
the output data to GFS and then reading it out again in the next iteration.
• Idea: We can provide a mode to cache the final outputs in memory if possible.
o Challenge: but if the machine crashes, we lose the outputs, can we recover?
o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if
they get lost or insufficient memory).
Motivation
• Spark’s goal was to generalize MapReduce to support new apps
within same engine
oMapReduce problems can be expressed in Spark too.
oWhere Spark shines and MapReduce does not: applications that need to
reuse a working set of data across multiple parallel operations
• Two reasonably small additions allowed the previous specialized
models to be expressed within Spark:
ofast data sharing
ogeneral DAGs
Motivation
Some key points about Spark:
• handles batch, interactive, and real-time within a single framework
• native integration with Java, Python, Scala.
oHas APIs written in these languages
• programming at a higher level of abstraction
• more general: map/reduce is just one set of supported constructs
Use Example
• We’ll run Spark’s interactive shell…
./bin/spark-shell
• Then from the “scala>” REPL prompt, let’s create some data…
val data = 1 to 10000
• Create an RDD based on that data…
val distData = sc.parallelize(data)
• Then use a filter to select values less than 10…
distData.filter(_ < 10).collect()
Resilient Distributed Datasets (RDD)
• Represents a read-only collection of objects partitioned across a set
of machines that can be rebuilt if a partition is lost.
• RDDs can only be created through deterministic operations (aka
transformations) on:
oEither data in stable storage:
 Any file stored in HDFS
 Or other storage systems supported by Hadoop
oOr other RDDs
• A program cannot reference an RDD that it cannot reconstruct after a
failure
Resilient Distributed Datasets (RDD)
• Two types of operations on RDDs: transformations and actions
o Programmers start by defining one or more RDDs through transformations on data in stable
storage (e.g., map and filter). Transformations create a new dataset from an existing one
 transformations are lazy (not computed immediately)
 instead they remember the transformations applied to some base dataset
 The transformed RDD gets recomputed when an action is run on it (default)
o They can then use these RDDs in actions, which are operations that return a value to the
application or export data to a storage system.
• However, an RDD can be persisted into storage in memory or disk
o Each node stores in memory any slices of it that it computes and reuses them in other
actions on that dataset – often making future actions more than 10x faster
o The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it
o Note that by default, the base RDD (the original data in stable storage) is not loaded into
RAM because the useful data after transformation might be only a small fraction (small
enough to fit into memory).
Study Notes: Apache Spark
RDD Implementation
Common interface of each RDD:
• A set of partitions: atomic pieces of the dataset
• A set of dependencies on parent RDDs: one RDD can have multiple parents
o narrow dependencies: each partition of the parent RDD is used by at most one
partition of the child RDD.
o wide dependencies: multiple child partitions may depend on a parent. Requires the
shuffle operation.
• A function for computing the dataset based on its parents
• Metadata about its partitioning scheme
• Metadata about its data placement, e.g., perferredLocations(p) returns a
list of nodes where partition p can be accessed faster due to data locality
Narrow vs Wide Dependencies
• Narrow dependencies allow for pipelined
execution on one cluster node, which can
compute all the parent partitions. Wide
dependencies require data from all parent
partitions to be available and to be
shuffled across the nodes using a
MapReduce-like operation.
• Recovery after a node failure is more
efficient with a narrow dependency, as
only the lost parent partitions need be
recomputed, and re-computation can be
done in parallel
Job Scheduling on RDDs
• Similar to Dryad, but takes data locality into account
• When user runs an action, scheduler builds a DAG of
stages to execute. Each stage contains as many
pipelined transformations with narrow dependencies
as possible.
• Stage boundaries are the shuffle operations required
for wide dependencies
• Scheduler launches tasks to compute missing
partitions from each stage. Tasks are assigned to
machines based on data locality using delay
scheduling.
• For wide dependencies, intermediate records are
materialized on the nodes holding parent partitions
(similar to the intermediate map outputs of
MapReduce) to simplify fault recovery.
Checkpointing
• Although lineage can always be used to recover RDDs after a failure,
checkpointing can be helpful for RDDs with long lineage chains
containing wide dependencies.
• For RDDs with narrow dependencies on data in stable storage,
checkpointing is not worthwhile. Reconstruction can be done in
parallel for these RDDs, at a fraction of the cost of replicating the
whole RDD.
RDD vs Distributed Shared Memory (DSM)
• Previous frameworks that support data reuse, e.g., Pregel and Piccolo.
o Perform data sharing implicitly for these patterns
o Specialized frameworks; do not provide abstractions for more general reuse
o Programming interface supports fine-grained updates (reads and writes to each
memory location): fault-tolerance requires expensive replication of data across
machines or logging of updates across machines
• RDD:
o Only coarse-grained transformations (e.g., map, filter and join): apply the same
operation to many data item. Note that reads on RDDs can still be fine-grained.
o Fault-tolerance only requires logging the transformation used to build a dataset
instead of the actual data
o RDDs are not suitable for applications that make asynchronous fine-grained updates
to shared state.
RDD vs Distributed Shared Memory (DSM)
• Other benefits of RDDs:
oRDDs are immutable. A system can mitigate slow nodes (stragglers) by
running backup copies of slow tasks as in MapReduce.
oIn bulk operations, a runtime can schedule tasks based on data locality to
improve performance
oRDDs degrade gracefully when there is not enough memory to store them. An
LRU eviction policy is used at the level of RDDs. A partition from the least
recently accessed RDD is evicted to make room for a newly computed RDD
partition. This is user-configurable via the “persistence priority” for each RDD.
Debugging RDDs
• One can reconstruct the RDDs later from the lineage and let the user
query them interactively
• One can re-run any task from the job in a single-process debugger by
recomputing the RDD partitions it depends on.
• Similar to the replay debuggers but without the capturing/recording
overhead.
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD Example
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD
RDD
RDD
RDD
Transformations
// load error messages from a log into memory; then interactively search for
// various patterns
// base RDD
val lines = sc.textFile("hdfs://...")
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split("t")).map(r => r(1))
messages.cache()
// action 1
messages.filter(_.contains("mysql")).count()
// action 2
messages.filter(_.contains("php")).count()
RDD
RDD
RDD
RDD
Transformations
Value
Actions
Shared Variables
• Broadcast variables let programmer keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks
oFor example, to give every node a copy of a large input dataset efficiently
• Spark also attempts to distribute broadcast variables using efficient
broadcast algorithms to reduce communication cost
Shared Variables
• Accumulators are variables that can only be “added” to through an
associative operation.
oUsed to implement counters and sums, efficiently in parallel
• Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend for new
types
• Only the driver program can read an accumulator’s value, not the
workers
oEach accumulator is given a unique ID upon creation
oEach worker creates a separate copy of the accumulator
oWorker sends a message to driver about the updates to the accumulator
Ad

More Related Content

What's hot (19)

Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Cppt
CpptCppt
Cppt
chunkypandey12
 
A Travel Through Mesos
A Travel Through MesosA Travel Through Mesos
A Travel Through Mesos
Datio Big Data
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
JigsawAcademy2014
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
Gagan Agrawal
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
Jeff Hammerbacher
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
Stefanie Zhao
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
Abhishek Mukherjee
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Hadoop
HadoopHadoop
Hadoop
Himanshu Soni
 

Viewers also liked (12)

Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
TheMilestoneBrand
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
chrisvdberge
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
Scottish Library & Information Council (SLIC), CILIP in Scotland (CILIPS)
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - Summary
Maximoff
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive Summaries
Andy Forbes
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual Summary
Brett Suddreth
 
Inside Apple
Inside AppleInside Apple
Inside Apple
Business Book Summaries
 
American Wheels Chinese Roads
American Wheels Chinese RoadsAmerican Wheels Chinese Roads
American Wheels Chinese Roads
Business Book Summaries
 
Fantastic
FantasticFantastic
Fantastic
Business Book Summaries
 
Ad

Similar to Study Notes: Apache Spark (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Spark
SparkSpark
Spark
Heena Madan
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
Corey Nolet
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
RDD
RDDRDD
RDD
Tien-Yang (Aiden) Wu
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Ad

Recently uploaded (19)

5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
Perguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolhaPerguntas dos animais - Slides ilustrados de múltipla escolha
Perguntas dos animais - Slides ilustrados de múltipla escolha
socaslev
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation TemplateSmart Mobile App Pitch Deck丨AI Travel App Presentation Template
Smart Mobile App Pitch Deck丨AI Travel App Presentation Template
yojeari421237
 

Study Notes: Apache Spark

  • 1. Summary of Apache Spark Original Papers: 1. “Spark: Cluster Computing with Working Sets” by Matei Zaharia, et al. Hotcloud 2010. 2. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” by Matei Zaharia, et al. NSDI 2012.
  • 2. Motivation • MapReduce is great, but, • There are applications where you iterate on the same set of data, e.g., for many iterations { old_data = new_data; new_data = func(old_data); } • Problem: The body of each iteration can be described as a MapReduce task, where the inputs and final outputs are GFS files. There is redundant work storing the output data to GFS and then reading it out again in the next iteration. • Idea: We can provide a mode to cache the final outputs in memory if possible. o Challenge: but if the machine crashes, we lose the outputs, can we recover? o Solution: store the lineage of the data so that they can be reconstructed as needed (e.g., if they get lost or insufficient memory).
  • 3. Motivation • Spark’s goal was to generalize MapReduce to support new apps within same engine oMapReduce problems can be expressed in Spark too. oWhere Spark shines and MapReduce does not: applications that need to reuse a working set of data across multiple parallel operations • Two reasonably small additions allowed the previous specialized models to be expressed within Spark: ofast data sharing ogeneral DAGs
  • 4. Motivation Some key points about Spark: • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala. oHas APIs written in these languages • programming at a higher level of abstraction • more general: map/reduce is just one set of supported constructs
  • 5. Use Example • We’ll run Spark’s interactive shell… ./bin/spark-shell • Then from the “scala>” REPL prompt, let’s create some data… val data = 1 to 10000 • Create an RDD based on that data… val distData = sc.parallelize(data) • Then use a filter to select values less than 10… distData.filter(_ < 10).collect()
  • 6. Resilient Distributed Datasets (RDD) • Represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. • RDDs can only be created through deterministic operations (aka transformations) on: oEither data in stable storage:  Any file stored in HDFS  Or other storage systems supported by Hadoop oOr other RDDs • A program cannot reference an RDD that it cannot reconstruct after a failure
  • 7. Resilient Distributed Datasets (RDD) • Two types of operations on RDDs: transformations and actions o Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). Transformations create a new dataset from an existing one  transformations are lazy (not computed immediately)  instead they remember the transformations applied to some base dataset  The transformed RDD gets recomputed when an action is run on it (default) o They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. • However, an RDD can be persisted into storage in memory or disk o Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster o The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it o Note that by default, the base RDD (the original data in stable storage) is not loaded into RAM because the useful data after transformation might be only a small fraction (small enough to fit into memory).
  • 9. RDD Implementation Common interface of each RDD: • A set of partitions: atomic pieces of the dataset • A set of dependencies on parent RDDs: one RDD can have multiple parents o narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD. o wide dependencies: multiple child partitions may depend on a parent. Requires the shuffle operation. • A function for computing the dataset based on its parents • Metadata about its partitioning scheme • Metadata about its data placement, e.g., perferredLocations(p) returns a list of nodes where partition p can be accessed faster due to data locality
  • 10. Narrow vs Wide Dependencies • Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation. • Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need be recomputed, and re-computation can be done in parallel
  • 11. Job Scheduling on RDDs • Similar to Dryad, but takes data locality into account • When user runs an action, scheduler builds a DAG of stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible. • Stage boundaries are the shuffle operations required for wide dependencies • Scheduler launches tasks to compute missing partitions from each stage. Tasks are assigned to machines based on data locality using delay scheduling. • For wide dependencies, intermediate records are materialized on the nodes holding parent partitions (similar to the intermediate map outputs of MapReduce) to simplify fault recovery.
  • 12. Checkpointing • Although lineage can always be used to recover RDDs after a failure, checkpointing can be helpful for RDDs with long lineage chains containing wide dependencies. • For RDDs with narrow dependencies on data in stable storage, checkpointing is not worthwhile. Reconstruction can be done in parallel for these RDDs, at a fraction of the cost of replicating the whole RDD.
  • 13. RDD vs Distributed Shared Memory (DSM) • Previous frameworks that support data reuse, e.g., Pregel and Piccolo. o Perform data sharing implicitly for these patterns o Specialized frameworks; do not provide abstractions for more general reuse o Programming interface supports fine-grained updates (reads and writes to each memory location): fault-tolerance requires expensive replication of data across machines or logging of updates across machines • RDD: o Only coarse-grained transformations (e.g., map, filter and join): apply the same operation to many data item. Note that reads on RDDs can still be fine-grained. o Fault-tolerance only requires logging the transformation used to build a dataset instead of the actual data o RDDs are not suitable for applications that make asynchronous fine-grained updates to shared state.
  • 14. RDD vs Distributed Shared Memory (DSM) • Other benefits of RDDs: oRDDs are immutable. A system can mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce. oIn bulk operations, a runtime can schedule tasks based on data locality to improve performance oRDDs degrade gracefully when there is not enough memory to store them. An LRU eviction policy is used at the level of RDDs. A partition from the least recently accessed RDD is evicted to make room for a newly computed RDD partition. This is user-configurable via the “persistence priority” for each RDD.
  • 15. Debugging RDDs • One can reconstruct the RDDs later from the lineage and let the user query them interactively • One can re-run any task from the job in a single-process debugger by recomputing the RDD partitions it depends on. • Similar to the replay debuggers but without the capturing/recording overhead.
  • 16. // load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD Example
  • 17. // load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD
  • 18. // load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD RDD RDD RDD Transformations
  • 19. // load error messages from a log into memory; then interactively search for // various patterns // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() RDD RDD RDD RDD Transformations Value Actions
  • 20. Shared Variables • Broadcast variables let programmer keep a read-only variable cached on each machine rather than shipping a copy of it with tasks oFor example, to give every node a copy of a large input dataset efficiently • Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost
  • 21. Shared Variables • Accumulators are variables that can only be “added” to through an associative operation. oUsed to implement counters and sums, efficiently in parallel • Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types • Only the driver program can read an accumulator’s value, not the workers oEach accumulator is given a unique ID upon creation oEach worker creates a separate copy of the accumulator oWorker sends a message to driver about the updates to the accumulator

Editor's Notes

  • #9: Lineage: A pointer to the parent RDD; Information about the transformation performed on the parent