Keeping the fun in functional w/ Apache Spark @ Scala Days NYC

Keeping the “fun” in functional
Spark Datasets and FP

Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams & live coding: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

@heathercmiller
+
https://www.scalacentershop.com/

Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer

Why Google Cloud care about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the next RCs (2.1.3 / 2.4 probably) -
https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-cha
nges-on-google-kubernetes-engine-and-cloud-dataproc

Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● May or may not know some Scala
○ If you’re new to Scala welcome to the community!
● Might know some Spark
● Want to keep things functional
● Ok with things getting a little bit silly
Lori Erickson

What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● What Datasets mean for Spark instead of RDDs
● Current limitations of Datasets (and the sad implications as a result)
● What Dataset let accomplish that we couldn’t* before
● What we can do to make this more awesome for future generations
● We’re going to talk about a lot of things we need to fix but please remember
everything is has lots of things that need fixing to.

What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets

Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods

Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau

Plus a little magic :)
Steven Saus

What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin

The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson

What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly
Stuart

What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
indamage

What are these “new” APIs?
● First of what is “new” - replaces an old not yet removed working thing with
something that might work
● DataFrames - not that new, kind of superseed ish by Datasets (yay)
● “New” ML API (called ML) - Look ma no types :(
○ We “forgot” to add a serving layer. We started, but then got bored.
● Structured Streaming
○ Hey buddy, want to try a new execution engine? It might not lose your data. Don’t pay any
attention to the missing/broken windows, self-joins, changing APIs, and…. yeah maybe give it
a few months
Susanne Nilsson

DataFrames/Datasets
● DataFrames: Everything is a Row. Even case classes are Rows.
● Datasets: Oh shit, types were useful lets add those back.
● More SQL inspired than functional inspired
○ select etc.
● Started out no functional operations or types, added later (and it shows)
● Schema (not type) inference
○ “How many people know the types of their JSON data?”/ eskati everyone say “fuck json”
○ If you don’t get that reference listen to lil’ pump (or not)
● No automatic tuple magic on read instead “Row” of pretty much anything
● Overhead to apply strict types
● Many many operations through away types
● Required for much of Spark’s new functionality
○ RDDs will still be around, but… the cool new toys are in Datasets :(
Paul Harrison

Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom

What is the performance like?
Andrew Skudder

What about compared to Kryo?
● Depend who you listen to
○ According to the people who wrote it still better
● Nominally also allows sort operations directly on serialized data
○ Some restrictions do apply
● Custom classes with complex times require custom work :(
laurenbeth93

Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)

So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)

And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood

A Word count w/Datasets (ish)
val df = spark.read.load(src).select("text")
val ds = df.as[String]
// Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
Loose type information

Doing the (comparatively) impossible
Hey Paul

Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier

Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1

Window specs
import org.apache.spark.sql.expressions.Window
val spec =
Window.partitionBy("age").orderBy("capital-gain"
).rowsBetween(-10, 10)
val rez =
df.select(avg("capital-gain").over(spec))
Ryo Chijiiwa

UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x:
len(x), IntegerType())
Yağmur Adam

Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol,
strLen(stringCol) from myTable")

Aggregates - Classes are fun right?
abstract class UserDefinedAggregateFunction {
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
}
Sil Silv

Spark SQL Aggregates
● We could make a functional version, but we haven’t yet
● Maybe simple good PR for someone looking to help us keep it functional :p
○ Although to be fair their might be push back
● Hint hint :)

Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol),
dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))

Functions.scala: Everything is a string (or column)
● Lots of operators, yay!
● Mini sadness
● Frameless brings typed columns! -
https://github.com/typelevel/frameless/blob/master/dataset/src/main/scala/fra
meless/TypedColumn.scala

Spark ML pipelines
● Scikit inspired
● No types :(
○ Instead kind of hokey runtime schema checking that isn’t always correct
○ When it fails you can have a job fail after 8+ hours :(
● Frameless to the (optional) rescue -
https://github.com/typelevel/frameless/tree/master/ml/src/main/scala/frameles
s/ml/feature
● Also similar efforts exist inside of certain companies
○ Which I wish they would open source
george erws

Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung

So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey

What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.

Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)

Andrew Skudder
*Arrow: Spark 2.3 and beyond & GPUs & R & Python & ….
* *

What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.

Arrow powered magic (numeric :p):
add = pandas_udf(lambda x, y: x + y, IntegerType())
James Willamor

And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● New execution engine option in 2.3
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still early stages - but now have flexibility to change engines (sort of)

Get a streaming dataset
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
streaming
source

Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
streaming
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)

Scala might matter “less”
● I float between Python & Scala so I’ll still have a job
● But I _like_ functional programming & types
● Traditionally (for better or worse) large overhead to work in Python on
distributed data
○ The overhead is quickly going down
○ As Kelly mentioned in her talk this morning, PySpark folks used sometimes to learn (some)
Scala for performance -- we’ll have to offer new shiny things instead
KLMircea

Key takeaways
● Datasets are a functional API
○ With easier “support” for window operations and similar compared to
RDDs
○ We can still sell enterprise support contracts and training to banks.
● Spark ML still uses Dataframes (no types)
○ Frameless has types for (some of) it!
○ Yes you can use deep learning with it. No I didn’t talk about that, it’s
extra.
● We have some important work to do to keep functional
programming competitive with SQL in Spark.
○ And with Python, seriously.
jeffreyw

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark

High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark

And some upcoming talks:
● June
○ Live streams (this Friday & weekly*) - follow me on twitch & YouTube
● July
○ Possible PyData Meetup in Amsterdam (tentative)
○ Curry on Amsterdam
○ OSCON Portland
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL

k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC

Recommended

More Related Content

What's hot (20)

Similar to Keeping the fun in functional w/ Apache Spark @ Scala Days NYC (20)

Recently uploaded (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC