Effective testing for spark programs scala bay preview (pre-strata ny 2015)

Effective Testing for Spark
Programs
or avoiding “I didn’t think that could happen”
Hella-Legit
Preview!

Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Software Engineer
● currently Alpine and previously Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Spark Videos http://bit.ly/holdenSparkVideos

What is going to be covered:
● Disclaimer
● What I think I might know about you
● A bit about why you should test your programs
● Doing traditional unit testing for Spark programs
○ Along with special considerations for SQL/Hive & Streaming
● Using counters & other job acceptance tests w/ Spark
● Cute & scary pictures
○ I promise at least one panda and one cat
● “Future Work”
○ Some of this future work might even get done!

This is an early stage talk (and cat and code)

Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Familiar with Apache Spark
○ If not, buy one of my books :p
● Familiar with one of Scala, Java, or Python
○ If you know R well I’d love to chat though
● Want to make better software
○ (or models, or w/e)

Some Spark terms
Spark Context (aka sc): The window to the world of Spark
sqlContext: The window to the world of DataFrames
Transformation: Takes an RDD (or DataFrame) and returns
new RDD or DF
Action: Causes an RDD to be evaluated (often storing the
result)
mgstanton

Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests, urgh w/e
● Our tests can get too slow
● It takes a lot of time
○ and people always want everything done yesterday
○ or I just want to go home see my partner
○ etc.

So why should you test?
● Makes you a better person
● May help you avoid losing your employer all of their money
○ Or “users” since we are in the bay area.
● Really though, almost all of us know that we should be testing more
● This is really just to guilt trip you & give you flashbacks to your QA internships
● If you weren’t already convinced, you are probably trying to find the talk you
are interested in anyways (I don’t mind if you slip out the back)

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
super.beforeAll()
}
override def afterAll() {
if (sc != null)
sc.stop()
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
super.afterAll()
}
Photo by morinesque

And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
}
def tokenize(f: RDD[String]) = {
f.map(_.split(" ").toList)
}
Photo by morinesque

Wait, where were the batteries?
Photo by Jim Bauer

Let’s get batteries!
● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base
○ sscheck - https://github.com/juanrh/sscheck
● Integration testing
○ About as painful as normal (e.g. those weird CR232 batteries)
● Performance - spark-perf (designed for testing spark itself but…)
● Spark job validation
○ spark-validator - Not overly bothered with existing
Photo by Mike Mozart

A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
val input = List("hi", "hi holden", "bye")
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}

A simple unit test (Java)
public class SampleJavaRDDTest
extends SharedJavaSparkContext implements Serializable {
@Test public void verifyMapTest() {
List<Integer> input = Arrays.asList(1,2);
JavaRDD<Integer> result = jsc().parallelize(input).map(
new Function<Integer, Integer>() { public Integer call(Integer x) { return
x * x;}});
assertEquals(input.size(), result.count());
}
}

A simple unit test (Python)
class SimpleTest(SparkTestingBaseTestCase):
"""A simple test."""
def test_basic(self):
"""Test a simple collect."""
input = ["hello world"]
rdd = self.sc.parallelize(input)
result = rdd.collect()
assert result == input

Making fake data
● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● sc.parallelize is pretty good for small tests
○ Note: that we can specify the number of partitions
● Coming up with good test data can take a long time
Lori Rielly

QuickCheck / ScalaCheck
● QuickCheck generates tests data under a set of constraints
● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base
○ Also Awesome people*, generates more pathological RDDs
*I assume
PROtara hunt

With sscheck
def forallRDDGenOfNtoM = {
val minWords, maxWords = (50, 100)
Prop.forAll(RDDGen.ofNtoM(50, 100, arbitrary[String])) { rdd : RDD[String] =>
rdd.map(_.length()).sum must be_>=(0.0)
}
}

With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}

Testing streaming….
Photo by Steve Jurvetson

Testing streaming….
● Creating test data is hard
○ ssc.queueStream works - unless you need checkpoints (1.4.1+)
● Collecting the data locally is hard
○ foreachRDD & a var
● figuring out when your test is “done”
Let’s abstract all that away into testOperation

Let’s skip straight to the batteries included version:
val input = List(List("hi"), List("hi holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}

What about DataFrames?
● We can do the same as we did for RDD’s
● Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
● Sadly it’s not in a published package :(
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {

We can make it easier!*
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
*This may or may not be easier.

Let’s talk about local mode
● It’s way better than you would expect
● It does its best to try and catch serialization errors
● It’s still not exactly the same as running on a “real” cluster

Running on a real* cluster
● Start one with your shell scripts & change the master
● YarnMiniCluster
○ https://github.
com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClu
sterSuite.scala
Photo by Richard Masoner

Why should we validate our jobs?
● Sometimes data sources fail in new & exciting ways
● That jerk on that other floor changed the meaning of a field :(
● Our tests didn’t catch all of the corner cases that the real world finds
Photo of GSM intercept by Matt EPhoto by Quinn Dombrowski

So how do we validate our jobs?
● Spark has it own counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc.
● We can add counters for things we care about
○ invalid records, users with no recommendations, etc.
● We can write rules for if the values are expected
○ Simple rules (X > J)
○ Historic rules (X > Avg(J))

What does that look like?
Photo by Dvortygirl

Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & slides
● Spark and Spark Streaming Unit Testing
● Effective Testing For Spark Programs - Strata NY 2015 (upcoming!)

Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/sscheck
● spark-validator: https://github.com/holdenk/spark-validator *NOT READY*
● scalacheck - https://www.scalacheck.org/

“Future Work”
● Integrating into Apache Spark
○ Using their style rules to simplify future transition
● Better ScalaCheck integration (with the help of the sscheck people)
● Some reasonable prefab rules for Job validation
● Whatever* you all want
○ Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
● integration testing
*That I feel like doing, or you feel like making a pull request for.
Photo by
bullet101

Cat wave photo by Quinn Dombrowski
k thnx bye!
Remember survey plz: http://bit.
ly/holdenTestingSpark

Effective testing for spark programs scala bay preview (pre-strata ny 2015)

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Effective testing for spark programs scala bay preview (pre-strata ny 2015) (20)

Recently uploaded (20)

Effective testing for spark programs scala bay preview (pre-strata ny 2015)