User Defined Aggregation in Apache Spark: A Love Story

User Deﬁned Aggregation
In Apache Spark
A Love Story
Erik Erlandson
Principal Software Engineer

All Love Stories
Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR

Spark’s Scale-Out World
2
3
2
5
3
5
2
3
5
logical

Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical

Scale-Out Sum
2 3 5
s
=
s
+
2
(2)

Scale-Out Sum
2 3 5
s
=
s
+
3
(5)

Scale-Out Sum
2 3 5
s
=
s
+
5
(10)

Scale-Out Sum
2 3 5 10
5 3 5 13

Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7

Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2

Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2

Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2

Spark Aggregators
Max Numbers Number -∞ max(a, x) max(a1, a2)

Spark Aggregators
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count

Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF

Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)

Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))

Romantic Chemistry
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
| 9|
| 18|
| 12|
+---------+
val r = records.withColumn("time", current_timestamp())
.groupBy(window($”time”, “30 seconds”))
.agg(sketchCDF($"wordcount").alias("CDF"))
.select(callUDF("p50", $"CDF").alias("p50"),
callUDF("p90", $"CDF").alias("p90"))
val query = r.writeStream //...
+----+----+
| p50| p90|
+----+----+
|15.6|31.0|
|16.0|30.8|
|15.8|30.0|
|15.7|31.0|
|16.0|31.0|
+----+----+

Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra

UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}

User Deﬁned Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
}

User Deﬁned Type Anatomy
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
}
Expensive

What Could Go Wrong?
def serialize(tdsql: TDigestSQL): Any = {
print(“In serialize”)
// ...
}
def deserialize(datum: Any): TDigestSQL = {
print(“In deserialize”)
// ...
}
}

What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Merge

Wait What?
val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserialize
In serialize
In deserialize
In serialize
… 997 more times !
In deserialize
In serialize

Oh No
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}

Oh No
input.getDouble(0))
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}

Oh No
input.getDouble(0))
val updated = tdigest + input.getDouble(0) // do the actual update
}

Oh No
input.getDouble(0))
val updated = tdigest + input.getDouble(0) // do the actual update
buf(0) = TDigestSQL(updated) // re-serialize
}

Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}

Intuitive Serialization
2 3 2
5 3 5
2 3 5
Merge

Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first

Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }

Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
70x
Faster

ErikE ErErlandson
Principal Software Engineer
Erik Erlandson
eje@redhat.com
@ManyAngled

User Defined Aggregation in Apache Spark: A Love Story

More Related Content

What's hot (20)

Similar to User Defined Aggregation in Apache Spark: A Love Story (20)

More from Databricks (20)

Recently uploaded (20)

User Defined Aggregation in Apache Spark: A Love Story