SlideShare a Scribd company logo
Debugging PySpark
Or why is there a JVM stack trace and what
does it mean?
Holden Karau
IBM - Spark Technology Center
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● Apache Spark committer (as of last month!) :)
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Debugging PySpark: Spark Summit East talk by Holden Karau
What is the Spark Technology Center?
● An IBM technology center focused around Spark
● We work on open source Apache Spark to make it more awesome
○ Python, SQL, ML, and more! :)
● Related components as well:
○ Apache Toree [Incubating] (Notebook solution for Spark with Jupyter)
○ spark-testing-base (testing utilites on top of Spark)
○ Apache Bahir
○ Apache System ML Incubating - Machine Learning
● Partner with the Scala Foundation and other important players
● Multiple Spark Committers (Nick Pentreath, Xiao (Sean) Li, Prashant Sharma,
Holden Karau (me!))
● Lots of contributions in Spark 2.0 & beyond :)
Debugging PySpark: Spark Summit East talk by Holden Karau
Who I think you wonderful humans are?
● Friendly people (this is a Python focused talk after all)
● Don’t mind pictures of cats or stuffed animals
● Know some Python
● Know some Spark
● Want to debug your Spark applications
● Ok with things getting a little bit silly
Lori Erickson
What will be covered?
● A quick overview of PySpark architecture to understand how it can impact our
debugging
● Getting at Spark’s logs & persisting them
● What your options for logging are
● Attempting to understand Spark error messages
● My some what subtle attempts to get you to use spark-testing-base or similar
● My even less subtle attempts to get you to buy my new book
● Pictures of cats & stuffed animals
Aka: Building our Monster Identification Guide
First: a detour into PySpark’s internals
Photo by Bill Ward
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
● Data from Spark worker serialized and piped to Python
worker
○ Multiple iterator-to-iterator transformations are still pipelined :)
● Double serialization cost makes everything more
expensive
● Python worker startup takes a bit of extra time
● Python memory isn’t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
● Error messages make ~0 sense
● etc.
So where are the logs/errors?
(e.g. before we can identify a monster we have to find it)
● Error messages reported to the console*
● Log messages reported to the console*
● Log messages on the workers - access through the
Spark Web UI or Spark History Server :)
(*When running in client mode)
PROAndrey
Working in Jupyter?
“The error messages were so useless -
I looked up how to disabled error reporting in Jupyter”
(paraphrased from PyData DC)
Working in Jupyter - try your terminal for help
Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in
tonynetone
Working in YARN?
(e.g. before we can identify a monster we have to find it)
● Use yarn logs to get logs after log collection
● Or set up the Spark history server
● Or yarn.nodemanager.delete.debug-delay-sec :)
Lauren Mitchell
Spark is pretty verbose by default
● Most of the time it tells you things you already know
● Or don’t need to know
● You can dynamically control the log level with
sc.setLogLevel
● This is especially useful to increase logging near the
point of error in your code
But what about when we get an error?
● Python Spark errors come in two-ish-parts often
● JVM Stack Trace (Friend Monster - comes most errors)
● Python Stack Trace (Boo - has information)
● Buddy - Often used to report the information from Friend
Monster and Boo
So what is that JVM stack trace?
● Doesn’t want your error messages to get lonely
● Often not very informative
○ Except if the error happens purely in the JVM - like asking Spark to
load a file which doesn’t exist
Let’s make some mistakes & debug :)
● Error in transformation
● Run out of memory in the workers
Image by: Tomomi
Bad outer transformation:
data = sc.parallelize(range(10))
transform1 = data.map(lambda x: x + 1)
transform2 = transform1.map(lambda x: x / 0)
transform2.count()
David Martyn
Hunt
Let’s look at the error messages for it:
[Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
Continued for ~400 lines
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
Ok maybe the web UI is easier? Mr Thinktank
And click through... afu007
A scroll down (not quite to the bottom)
File "high_performance_pyspark/bad_pyspark.py",
line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line
175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in
pipeline_func
return func(split, prev_func(split, iterator))
Or look at the bottom of console logs:
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
transform2 = transform1.map(lambda x: x / 0)
ZeroDivisionError: integer division or modulo by zero
Python Pipelines
● Some pipelining happens inside of Python
○ For performance (less copies from Python to Scala)
● DAG visualization is generated inside of Scala
○ Misses Python pipelines :(
Regardless of language
● Can be difficult to determine which element failed
● Stack trace _sometimes_ helps (it did this time)
● take(1) + count() are your friends - but a lot of work :(
Side note: Lambdas aren’t always your friend
● Lambda’s can make finding the error more challenging
● I love lambda x, y: x / y as much as the next human but
when y is zero :(
● A small bit of refactoring for your debugging never hurt
anyone*
● If your inner functions are causing errors it’s a good time
to have tests for them!
● Difficult to put logs inside of them
*A blatant lie, but…. it hurts less often than it helps
Testing - you should do it!
● spark-testing-base is on pip now for your happy test
adventures
● That’s a talk unto itself though (but it's on YouTube)
Adding your own logging:
● Java users use Log4J & friends
● Python users: use logging library (or even print!)
● Accumulators
○ Behave a bit weirdly, don’t put large amounts of data in them
Also not all errors are “hard” errors
● Parsing input? Going to reject some malformed records
● flatMap or filter + map can make this simpler
● Still want to track number of rejected records (see
accumulators)
Mustafasari
So using names & logging & accs could be:
data = sc.parallelize(range(10))
rejectedCount = sc.accumulator(0)
def loggedDivZero(x):
import logging
try:
return [x / 0]
except Exception as e:
rejectedCount.add(1)
logging.warning("Error found " + repr(e))
return []
transform1 = data.flatMap(loggedDivZero)
transform2 = transform1.map(add1)
transform2.count()
print("Reject " + str(rejectedCount.value))
Spark accumulators
● Really “great” way for keeping track of failed records
● Double counting makes things really tricky
○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
● Relative rules can save us* under certain conditions
Found Animals Foundation Follow
Could we just us -mtrace?
● Spark makes certain assumptions about how Python is
launched on the workers this doesn’t (currently) work
● Namely it assumes PYSPARK_PYTHON points to a file
● Also assumes arg[0] has certain meanings :(
paul
Ok what about if we run out of memory?
In the middle of some Java stack traces:
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main
process()
File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func
return func(split, prev_func(split, iterator))
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func
return f(iterator)
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much
return range(10000000000000)
MemoryError
Tubbs doesn’t always look the same
● Out of memory can be pure JVM (worker)
○ OOM exception during join
○ GC timelimit exceeded
● OutOfMemory error, Executors being killed by kernel,
etc.
● Running in YARN? “Application overhead exceeded”
● JVM out of memory on the driver side from Py4J
Reasons for JVM worker OOMs
(w/PySpark)
● Unbalanced shuffles
● Buffering of Rows with PySpark + UDFs
○ If you have a down stream select move it up stream
● Individual jumbo records (after pickling)
Reasons for Python worker OOMs
(w/PySpark)
● Insufficient memory reserved for Python worker
● Jumbo records
● Eager entire partition evaluation (e.g. sort +
mapPartitions)
● Too large partitions (unbalanced or not enough
partitions)
● Native code memory leak
And loading invalid paths:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
Oooh Boo found food! Let’s finish quickly :)
What about if that isn’t enough to debug?
● Move take(1) up the dependency chain
● DAG in the WebUI -- less useful for Python :(
● toDebugString -- also less useful in Python :(
● Sample data and run locally
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
Coming Soon:
Learning PySpark
High Performance Spark (soon!)
First seven chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Python is in Chapter 7 & Debugging in Appendix
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
K thnx bye!
Get in touch if you want:
@holdenkarau on twitter
Have some simple UDFs you wish ran faster?: http://bit.ly/pySparkUDF
If you care about Spark testing: http://bit.ly/holdenTestingSpark
Want to start contributing to PySpark? Talk to me IRL or
E-mail: holden.karau+contributing@gmail.com

More Related Content

What's hot (20)

PPTX
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
NetflixにおけるPresto/Spark活用事例
Amazon Web Services Japan
 
PDF
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
PDF
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
NTT DATA OSS Professional Services
 
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Javaコードが速く実⾏される秘密 - JITコンパイラ⼊⾨(JJUG CCC 2020 Fall講演資料)
NTT DATA Technology & Innovation
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PPTX
Hadoop -NameNode HAの仕組み-
Yuki Gonda
 
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Noritaka Sekiyama
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
NTT DATA Technology & Innovation
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Web Services Japan
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
The Apache Spark File Format Ecosystem
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
NetflixにおけるPresto/Spark活用事例
Amazon Web Services Japan
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
NTT DATA OSS Professional Services
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Javaコードが速く実⾏される秘密 - JITコンパイラ⼊⾨(JJUG CCC 2020 Fall講演資料)
NTT DATA Technology & Innovation
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Hadoop -NameNode HAの仕組み-
Yuki Gonda
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Noritaka Sekiyama
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
NTT DATA Technology & Innovation
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Web Services Japan
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

Similar to Debugging PySpark: Spark Summit East talk by Holden Karau (20)

PDF
Debugging PySpark - PyCon US 2018
Holden Karau
 
PDF
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau
 
PDF
Debugging Apache Spark
Joey Echeverria
 
PDF
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Debugging PySpark - PyCon US 2018
Holden Karau
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau
 
Debugging Apache Spark
Joey Echeverria
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Dive into spark2
Gal Marder
 
Getting The Best Performance With PySpark
Spark Summit
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
PDF
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 

Debugging PySpark: Spark Summit East talk by Holden Karau

  • 1. Debugging PySpark Or why is there a JVM stack trace and what does it mean? Holden Karau IBM - Spark Technology Center
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● Apache Spark committer (as of last month!) :) ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 4. What is the Spark Technology Center? ● An IBM technology center focused around Spark ● We work on open source Apache Spark to make it more awesome ○ Python, SQL, ML, and more! :) ● Related components as well: ○ Apache Toree [Incubating] (Notebook solution for Spark with Jupyter) ○ spark-testing-base (testing utilites on top of Spark) ○ Apache Bahir ○ Apache System ML Incubating - Machine Learning ● Partner with the Scala Foundation and other important players ● Multiple Spark Committers (Nick Pentreath, Xiao (Sean) Li, Prashant Sharma, Holden Karau (me!)) ● Lots of contributions in Spark 2.0 & beyond :)
  • 6. Who I think you wonderful humans are? ● Friendly people (this is a Python focused talk after all) ● Don’t mind pictures of cats or stuffed animals ● Know some Python ● Know some Spark ● Want to debug your Spark applications ● Ok with things getting a little bit silly Lori Erickson
  • 7. What will be covered? ● A quick overview of PySpark architecture to understand how it can impact our debugging ● Getting at Spark’s logs & persisting them ● What your options for logging are ● Attempting to understand Spark error messages ● My some what subtle attempts to get you to use spark-testing-base or similar ● My even less subtle attempts to get you to buy my new book ● Pictures of cats & stuffed animals
  • 8. Aka: Building our Monster Identification Guide
  • 9. First: a detour into PySpark’s internals Photo by Bill Ward
  • 10. Spark in Scala, how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this kristin klein
  • 11. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 12. So how does that impact PySpark? ● Data from Spark worker serialized and piped to Python worker ○ Multiple iterator-to-iterator transformations are still pipelined :) ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● etc.
  • 13. So where are the logs/errors? (e.g. before we can identify a monster we have to find it) ● Error messages reported to the console* ● Log messages reported to the console* ● Log messages on the workers - access through the Spark Web UI or Spark History Server :) (*When running in client mode) PROAndrey
  • 14. Working in Jupyter? “The error messages were so useless - I looked up how to disabled error reporting in Jupyter” (paraphrased from PyData DC)
  • 15. Working in Jupyter - try your terminal for help Possibly fix by https://issues.apache.org/jira/browse/SPARK-19094 but may not get in tonynetone
  • 16. Working in YARN? (e.g. before we can identify a monster we have to find it) ● Use yarn logs to get logs after log collection ● Or set up the Spark history server ● Or yarn.nodemanager.delete.debug-delay-sec :) Lauren Mitchell
  • 17. Spark is pretty verbose by default ● Most of the time it tells you things you already know ● Or don’t need to know ● You can dynamically control the log level with sc.setLogLevel ● This is especially useful to increase logging near the point of error in your code
  • 18. But what about when we get an error? ● Python Spark errors come in two-ish-parts often ● JVM Stack Trace (Friend Monster - comes most errors) ● Python Stack Trace (Boo - has information) ● Buddy - Often used to report the information from Friend Monster and Boo
  • 19. So what is that JVM stack trace? ● Doesn’t want your error messages to get lonely ● Often not very informative ○ Except if the error happens purely in the JVM - like asking Spark to load a file which doesn’t exist
  • 20. Let’s make some mistakes & debug :) ● Error in transformation ● Run out of memory in the workers Image by: Tomomi
  • 21. Bad outer transformation: data = sc.parallelize(range(10)) transform1 = data.map(lambda x: x + 1) transform2 = transform1.map(lambda x: x / 0) transform2.count() David Martyn Hunt
  • 22. Let’s look at the error messages for it: [Stage 0:> (0 + 0) / 4]17/02/01 09:52:07 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() Continued for ~400 lines File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda>
  • 23. Ok maybe the web UI is easier? Mr Thinktank
  • 25. A scroll down (not quite to the bottom) File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0) ZeroDivisionError: integer division or modulo by zero
  • 26. Or look at the bottom of console logs: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator))
  • 27. Or look at the bottom of console logs: File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "high_performance_pyspark/bad_pyspark.py", line 32, in <lambda> transform2 = transform1.map(lambda x: x / 0) ZeroDivisionError: integer division or modulo by zero
  • 28. Python Pipelines ● Some pipelining happens inside of Python ○ For performance (less copies from Python to Scala) ● DAG visualization is generated inside of Scala ○ Misses Python pipelines :( Regardless of language ● Can be difficult to determine which element failed ● Stack trace _sometimes_ helps (it did this time) ● take(1) + count() are your friends - but a lot of work :(
  • 29. Side note: Lambdas aren’t always your friend ● Lambda’s can make finding the error more challenging ● I love lambda x, y: x / y as much as the next human but when y is zero :( ● A small bit of refactoring for your debugging never hurt anyone* ● If your inner functions are causing errors it’s a good time to have tests for them! ● Difficult to put logs inside of them *A blatant lie, but…. it hurts less often than it helps
  • 30. Testing - you should do it! ● spark-testing-base is on pip now for your happy test adventures ● That’s a talk unto itself though (but it's on YouTube)
  • 31. Adding your own logging: ● Java users use Log4J & friends ● Python users: use logging library (or even print!) ● Accumulators ○ Behave a bit weirdly, don’t put large amounts of data in them
  • 32. Also not all errors are “hard” errors ● Parsing input? Going to reject some malformed records ● flatMap or filter + map can make this simpler ● Still want to track number of rejected records (see accumulators) Mustafasari
  • 33. So using names & logging & accs could be: data = sc.parallelize(range(10)) rejectedCount = sc.accumulator(0) def loggedDivZero(x): import logging try: return [x / 0] except Exception as e: rejectedCount.add(1) logging.warning("Error found " + repr(e)) return [] transform1 = data.flatMap(loggedDivZero) transform2 = transform1.map(add1) transform2.count() print("Reject " + str(rejectedCount.value))
  • 34. Spark accumulators ● Really “great” way for keeping track of failed records ● Double counting makes things really tricky ○ Jobs which worked “fine” don’t continue to work “fine” when minor changes happen ● Relative rules can save us* under certain conditions Found Animals Foundation Follow
  • 35. Could we just us -mtrace? ● Spark makes certain assumptions about how Python is launched on the workers this doesn’t (currently) work ● Namely it assumes PYSPARK_PYTHON points to a file ● Also assumes arg[0] has certain meanings :( paul
  • 36. Ok what about if we run out of memory? In the middle of some Java stack traces: File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/home/holden/repos/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 2406, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 345, in func return f(iterator) File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <lambda> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/home/holden/repos/spark/python/pyspark/rdd.py", line 1040, in <genexpr> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "high_performance_pyspark/bad_pyspark.py", line 132, in generate_too_much return range(10000000000000) MemoryError
  • 37. Tubbs doesn’t always look the same ● Out of memory can be pure JVM (worker) ○ OOM exception during join ○ GC timelimit exceeded ● OutOfMemory error, Executors being killed by kernel, etc. ● Running in YARN? “Application overhead exceeded” ● JVM out of memory on the driver side from Py4J
  • 38. Reasons for JVM worker OOMs (w/PySpark) ● Unbalanced shuffles ● Buffering of Rows with PySpark + UDFs ○ If you have a down stream select move it up stream ● Individual jumbo records (after pickling)
  • 39. Reasons for Python worker OOMs (w/PySpark) ● Insufficient memory reserved for Python worker ● Jumbo records ● Eager entire partition evaluation (e.g. sort + mapPartitions) ● Too large partitions (unbalanced or not enough partitions) ● Native code memory leak
  • 40. And loading invalid paths: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/doesnotexist at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  • 41. Oooh Boo found food! Let’s finish quickly :)
  • 42. What about if that isn’t enough to debug? ● Move take(1) up the dependency chain ● DAG in the WebUI -- less useful for Python :( ● toDebugString -- also less useful in Python :( ● Sample data and run locally
  • 43. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark Coming Soon: Learning PySpark
  • 44. High Performance Spark (soon!) First seven chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Python is in Chapter 7 & Debugging in Appendix Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 45. K thnx bye! Get in touch if you want: @holdenkarau on twitter Have some simple UDFs you wish ran faster?: http://bit.ly/pySparkUDF If you care about Spark testing: http://bit.ly/holdenTestingSpark Want to start contributing to PySpark? Talk to me IRL or E-mail: [email protected]