SlideShare a Scribd company logo
Advanced Spark
Reynold Xin, July 2, 2014 @ Spark Summit Training
Advanced spark training advanced spark internals and tuning reynold xin
This Talk
Formalize RDD concept

Life of a Spark Application

Performance Debugging


* Assumes you can write word count, knows what
transformation/action is
“Mechanical sympathy” by Jackie Stewart: a driver does not
need to know how to build an engine but they need to know
the fundamentals of how one works to get the best out of it
Reynold Xin
Apache Spark committer (worked on almost every
module: core, sql, mllib, graph)

Product & open-source eng @ Databricks

On leave from PhD @ UC Berkeley AMPLab
Example Application
val sc = new SparkContext(...)
val file = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.count()
Resilient distributed
datasets (RDDs)
Action
Quiz: what is an “RDD”?
A: distributed collection of objects on disk

B: distributed collection of objects in memory

C: distributed collection of objects in Cassandra

Answer: could be any of the above!
Scientific Answer: RDD is an Interface!
1.  Set of partitions (“splits” in Hadoop)
2.  List of dependencies on parent RDDs
3.  Function to compute a partition"
(as an Iterator) given its parent(s)
4.  (Optional) partitioner (hash, range)
5.  (Optional) preferred location(s)"
for each partition
“lineage”
optimized
execution
Example: HadoopRDD
partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
Example: Filtered RDD
partitions = same as parent RDD

dependencies = “one-to-one” on parent

compute(part) = compute parent and filter it

preferredLocations(part) = none (ask parent)

partitioner = none
RDD Graph (DAG of tasks)
HadoopRDD"
path = hdfs://...
FilteredRDD"
func = _.contains(…)"
shouldCache = true
file:
errors:
Partition-level view:
Dataset-level view:
Task1
Task2
 ...
Example: JoinedRDD
partitions = one per reduce task

dependencies = “shuffle” on each parent

compute(partition) = read and join shuffled data

preferredLocations(part) = none"

partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!
Dependency Types
union
groupByKey on"
non-partitioned data
join with inputs not"
co-partitioned
join with inputs
co-partitioned
map, filter
“Narrow” (pipeline-able)
 “Wide” (shuffle)
Recap
Each RDD consists of 5 properties: 

1.  partitions
2.  dependencies
3.  compute
4.  (optional) partitioner
5.  (optional) preferred locations
Life of a Spark Application
Spark Application

sc = new SparkContext

f = sc.textFile(“…”)"
"
f.filter(…)"
.count()"
"
...
Your program
(JVM / Python)
Spark driver"
(app master)
Spark executor
(multiple of them)
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster"
manager
A single application often contains multiple actions
Job Scheduling Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
.count()
RDD	
  Objects	
  
build	
  operator	
  DAG	
  
Scheduler	
  
(DAGScheduler)	
  
split	
  graph	
  into	
  
stages	
  of	
  tasks	
  
submit	
  each	
  
stage	
  as	
  ready	
  
DAG	
  
Executors	
  
execute	
  tasks	
  
store	
  and	
  serve	
  
blocks	
  
Block
manager
Threads
Task	
  
DAG Scheduler
Input: RDD and partitions to compute

Output: output from actions on those partitions

Roles:
>  Build stages of tasks
>  Submit them to lower level scheduler (e.g. YARN,
Mesos, Standalone) as ready
>  Lower level scheduler will schedule data based on
locality
>  Resubmit failed stages if outputs are lost
Scheduler Optimizations
Pipelines operations
within a stage

Picks join algorithms
based on partitioning
(minimize shuffles)

Reuses previously
cached data
join	
  
union	
  
groupBy	
  
map	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
A:	
   B:	
  
C:	
   D:	
  
E:	
  
F:	
  
G:	
  
=	
  previously	
  computed	
  partition	
  
Task	
  
Task
Unit of work to execute on in an executor thread

Unlike MR, there is no “map” vs “reduce” task

Each task either partitions its output for “shuffle”, or
send the output back to the driver
Shuffle
Stage	
  1	
  
Stage	
  2	
  
Redistributes data among partitions

Partition keys into buckets
(user-defined partitioner)

Optimizations:
>  Avoided when possible, if"
data is already properly"
partitioned
>  Partial aggregation reduces"
data movement
Shuffle
Disk	
  
Stage	
  2	
  
Stage	
  1	
  
Write intermediate files to disk
Fetched by the next stage of tasks (“reduce” in MR)
Recap: Job Scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
.count()
RDD	
  Objects	
  
build	
  operator	
  DAG	
  
Scheduler	
  
(DAGScheduler)	
  
split	
  graph	
  into	
  
stages	
  of	
  tasks	
  
submit	
  each	
  
stage	
  as	
  ready	
  
DAG	
  
Executors	
  
execute	
  tasks	
  
store	
  and	
  serve	
  
blocks	
  
Block
manager
Threads
Task	
  
Performance Debugging
Performance Debugging
Distributed performance: program slow due to
scheduling, coordination, or data distribution)

Local performance: program slow because whatever
I’m running is just slow on a single node

Two useful tools:
>  Application web UI (default port 4040)
>  Executor logs (spark/work)
Find Slow Stage(s)
Stragglers?
Some tasks are just slower than others.

Easy to identify from summary metrics:
Stragglers due to slow nodes
sc.parallelize(1 to 15, 15).map { index => 	
val host = java.net.InetAddress.getLocalHost.getHostName	
if (host == "ip-172-31-2-222") {	
Thread.sleep(10000)	
} else {	
Thread.sleep(1000)	
}	
}.count()
Stragglers due to slow nodes
Turn speculation on to mitigates this problem.

Speculation: Spark identifies slow tasks (by looking
at runtime distribution), and re-launches those tasks
on other nodes.

spark.speculation true
Demo Time: slow node
Stragglers due to data skew
sc.parallelize(1 to 15, 15)	
.flatMap { i => 1 to i }	
.map { i => Thread.sleep(1000) }	
.count()	

Speculation is not going to help because the
problem is inherent in the algorithm/data.

Pick a different algorithm or restructure the data.
Demo Time
Tasks are just slow
Garbage collection


Performance of the code running in each task
Garbage Collection
Look at the “GC Time” column in the web UI
What if the task is still running?
To discover whether GC is the problem:

1.  Set spark.executor.extraJavaOptions to include:
“-XX:-PrintGCDetails -XX:+PrintGCTimeStamps”
2.  Look at spark/work/app…/[n]/stdout on
executors
3.  Short GC times are OK. Long ones are bad.
Advanced spark training advanced spark internals and tuning reynold xin
jmap: heap analysis
jmap -histo [pid]
Gets a histogram of objects in the JVM heap

jmap -histo:live [pid]
Gets a histogram of objects in the heap after GC
(thus “live”)
Find out what objects are the trouble
Demo: GC log & jmap
Reduce GC impact
class DummyObject(var i: Int) {	
def toInt = i	
}	
	
sc.parallelize(1 to 100 * 1000 * 1000, 1).map { i =>	
new DummyObject(i) // new object every record	
obj.toInt	
}	
	
	
sc.parallelize(1 to 100 * 1000 * 1000, 1).mapPartitions { iter =>	
val obj = new DummyObject(0) // reuse the same object	
iter.map { i => 	
obj.i = i	
obj.toInt	
}	
}
Local Performance
Each Spark executor runs a JVM/Python process

Insert your favorite JVM/Python profiling tool
>  jstack
>  YourKit
>  VisualVM
>  println
>  (sorry I don’t know a whole lot about Python)
>  …
Example: identify expensive comp.
def someCheapComputation(record: Int): Int = record + 1	
	
def someExpensiveComputation(record: Int): String = {	
Thread.sleep(1000)	
record.toString	
}	
	
sc.parallelize(1 to 100000).map { record =>	
val step1 = someCheapComputation(record)	
val step2 = someExpensiveComputation(step1)	
step2	
}.saveAsTextFile("hdfs:/tmp1")
Demo Time
jstack
jstack
Can often pinpoint problems just by “jstack” a few times
YourKit (free for open source dev)
Debugging Tip
Local Debugging
Run in local mode (i.e. Spark master “local”) and
debug with your favorite debugger
>  IntelliJ
>  Eclipse
>  println

With a sample dataset
What we have learned?
RDD abstraction
>  lineage info: partitions, dependencies, compute
>  optimization info: partitioner, preferred locations

Execution process (from RDD to tasks)

Performance & debugging
Thank You!
Ad

More Related Content

What's hot (18)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Internals
InternalsInternals
Internals
Sandeep Purohit
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
phanleson
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
phanleson
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
phanleson
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 

Similar to Advanced spark training advanced spark internals and tuning reynold xin (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
Paco Nathan
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Ad

Recently uploaded (20)

DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
some basics electrical and electronics knowledge
some basics electrical and electronics knowledgesome basics electrical and electronics knowledge
some basics electrical and electronics knowledge
nguyentrungdo88
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Ad

Advanced spark training advanced spark internals and tuning reynold xin

  • 1. Advanced Spark Reynold Xin, July 2, 2014 @ Spark Summit Training
  • 3. This Talk Formalize RDD concept Life of a Spark Application Performance Debugging * Assumes you can write word count, knows what transformation/action is “Mechanical sympathy” by Jackie Stewart: a driver does not need to know how to build an engine but they need to know the fundamentals of how one works to get the best out of it
  • 4. Reynold Xin Apache Spark committer (worked on almost every module: core, sql, mllib, graph) Product & open-source eng @ Databricks On leave from PhD @ UC Berkeley AMPLab
  • 5. Example Application val sc = new SparkContext(...) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.count() Resilient distributed datasets (RDDs) Action
  • 6. Quiz: what is an “RDD”? A: distributed collection of objects on disk B: distributed collection of objects in memory C: distributed collection of objects in Cassandra Answer: could be any of the above!
  • 7. Scientific Answer: RDD is an Interface! 1.  Set of partitions (“splits” in Hadoop) 2.  List of dependencies on parent RDDs 3.  Function to compute a partition" (as an Iterator) given its parent(s) 4.  (Optional) partitioner (hash, range) 5.  (Optional) preferred location(s)" for each partition “lineage” optimized execution
  • 8. Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(part) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  • 9. Example: Filtered RDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(part) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  • 10. RDD Graph (DAG of tasks) HadoopRDD" path = hdfs://... FilteredRDD" func = _.contains(…)" shouldCache = true file: errors: Partition-level view: Dataset-level view: Task1 Task2 ...
  • 11. Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none" partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed!
  • 12. Dependency Types union groupByKey on" non-partitioned data join with inputs not" co-partitioned join with inputs co-partitioned map, filter “Narrow” (pipeline-able) “Wide” (shuffle)
  • 13. Recap Each RDD consists of 5 properties: 1.  partitions 2.  dependencies 3.  compute 4.  (optional) partitioner 5.  (optional) preferred locations
  • 14. Life of a Spark Application
  • 15. Spark Application sc = new SparkContext f = sc.textFile(“…”)" " f.filter(…)" .count()" " ... Your program (JVM / Python) Spark driver" (app master) Spark executor (multiple of them) HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster" manager A single application often contains multiple actions
  • 16. Job Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) .count() RDD  Objects   build  operator  DAG   Scheduler   (DAGScheduler)   split  graph  into   stages  of  tasks   submit  each   stage  as  ready   DAG   Executors   execute  tasks   store  and  serve   blocks   Block manager Threads Task  
  • 17. DAG Scheduler Input: RDD and partitions to compute Output: output from actions on those partitions Roles: >  Build stages of tasks >  Submit them to lower level scheduler (e.g. YARN, Mesos, Standalone) as ready >  Lower level scheduler will schedule data based on locality >  Resubmit failed stages if outputs are lost
  • 18. Scheduler Optimizations Pipelines operations within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data join   union   groupBy   map   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   G:   =  previously  computed  partition   Task  
  • 19. Task Unit of work to execute on in an executor thread Unlike MR, there is no “map” vs “reduce” task Each task either partitions its output for “shuffle”, or send the output back to the driver
  • 20. Shuffle Stage  1   Stage  2   Redistributes data among partitions Partition keys into buckets (user-defined partitioner) Optimizations: >  Avoided when possible, if" data is already properly" partitioned >  Partial aggregation reduces" data movement
  • 21. Shuffle Disk   Stage  2   Stage  1   Write intermediate files to disk Fetched by the next stage of tasks (“reduce” in MR)
  • 22. Recap: Job Scheduling rdd1.join(rdd2) .groupBy(…) .filter(…) .count() RDD  Objects   build  operator  DAG   Scheduler   (DAGScheduler)   split  graph  into   stages  of  tasks   submit  each   stage  as  ready   DAG   Executors   execute  tasks   store  and  serve   blocks   Block manager Threads Task  
  • 24. Performance Debugging Distributed performance: program slow due to scheduling, coordination, or data distribution) Local performance: program slow because whatever I’m running is just slow on a single node Two useful tools: >  Application web UI (default port 4040) >  Executor logs (spark/work)
  • 26. Stragglers? Some tasks are just slower than others. Easy to identify from summary metrics:
  • 27. Stragglers due to slow nodes sc.parallelize(1 to 15, 15).map { index => val host = java.net.InetAddress.getLocalHost.getHostName if (host == "ip-172-31-2-222") { Thread.sleep(10000) } else { Thread.sleep(1000) } }.count()
  • 28. Stragglers due to slow nodes Turn speculation on to mitigates this problem. Speculation: Spark identifies slow tasks (by looking at runtime distribution), and re-launches those tasks on other nodes. spark.speculation true
  • 30. Stragglers due to data skew sc.parallelize(1 to 15, 15) .flatMap { i => 1 to i } .map { i => Thread.sleep(1000) } .count() Speculation is not going to help because the problem is inherent in the algorithm/data. Pick a different algorithm or restructure the data.
  • 32. Tasks are just slow Garbage collection Performance of the code running in each task
  • 33. Garbage Collection Look at the “GC Time” column in the web UI
  • 34. What if the task is still running? To discover whether GC is the problem: 1.  Set spark.executor.extraJavaOptions to include: “-XX:-PrintGCDetails -XX:+PrintGCTimeStamps” 2.  Look at spark/work/app…/[n]/stdout on executors 3.  Short GC times are OK. Long ones are bad.
  • 36. jmap: heap analysis jmap -histo [pid] Gets a histogram of objects in the JVM heap jmap -histo:live [pid] Gets a histogram of objects in the heap after GC (thus “live”)
  • 37. Find out what objects are the trouble
  • 38. Demo: GC log & jmap
  • 39. Reduce GC impact class DummyObject(var i: Int) { def toInt = i } sc.parallelize(1 to 100 * 1000 * 1000, 1).map { i => new DummyObject(i) // new object every record obj.toInt } sc.parallelize(1 to 100 * 1000 * 1000, 1).mapPartitions { iter => val obj = new DummyObject(0) // reuse the same object iter.map { i => obj.i = i obj.toInt } }
  • 40. Local Performance Each Spark executor runs a JVM/Python process Insert your favorite JVM/Python profiling tool >  jstack >  YourKit >  VisualVM >  println >  (sorry I don’t know a whole lot about Python) >  …
  • 41. Example: identify expensive comp. def someCheapComputation(record: Int): Int = record + 1 def someExpensiveComputation(record: Int): String = { Thread.sleep(1000) record.toString } sc.parallelize(1 to 100000).map { record => val step1 = someCheapComputation(record) val step2 = someExpensiveComputation(step1) step2 }.saveAsTextFile("hdfs:/tmp1")
  • 44. jstack Can often pinpoint problems just by “jstack” a few times
  • 45. YourKit (free for open source dev)
  • 47. Local Debugging Run in local mode (i.e. Spark master “local”) and debug with your favorite debugger >  IntelliJ >  Eclipse >  println With a sample dataset
  • 48. What we have learned? RDD abstraction >  lineage info: partitions, dependencies, compute >  optimization info: partitioner, preferred locations Execution process (from RDD to tasks) Performance & debugging