SlideShare a Scribd company logo
User Defined Aggregation
In Apache Spark
A Love Story
Erik Erlandson
Principal Software Engineer
All Love Stories
Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
Establish
The Plot
Spark’s Scale-Out World
2
3
2
5
3
5
2
3
5
logical
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical
Scale-Out Sum
2 3 5
s
=
0
Scale-Out Sum
2 3 5
s
=
s
+
2
(2)
Scale-Out Sum
2 3 5
s
=
s
+
3
(5)
Scale-Out Sum
2 3 5
s
=
s
+
5
(10)
Scale-Out Sum
2 3 5 10
Scale-Out Sum
2 3 5 10
5 3 5 13
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
Love Interest
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
Romantic Chemistry
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
| 9|
| 18|
| 12|
+---------+
val r = records.withColumn("time", current_timestamp())
.groupBy(window($”time”, “30 seconds”))
.agg(sketchCDF($"wordcount").alias("CDF"))
.select(callUDF("p50", $"CDF").alias("p50"),
callUDF("p90", $"CDF").alias("p90"))
val query = r.writeStream //...
+----+----+
| p50| p90|
+----+----+
|15.6|31.0|
|16.0|30.8|
|15.8|30.0|
|15.7|31.0|
|16.0|31.0|
+----+----+
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
Conflict!
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
Expensive
What Could Go Wrong?
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = {
print(“In serialize”)
// ...
}
def deserialize(datum: Any): TDigestSQL = {
print(“In deserialize”)
// ...
}
// yada yada yada ...
}
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Wait What?
val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserialize
In serialize
In deserialize
In serialize
… 997 more times !
In deserialize
In serialize
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
buf(0) = TDigestSQL(updated) // re-serialize
}
SPARK-27296
Resolution
#25024
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x
Faster
Epilogue
Don’t Give Up
Patience
Respect
ErikE ErErlandson
Principal Software Engineer
Erik Erlandson
eje@redhat.com
@ManyAngled

More Related Content

What's hot (20)

PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
PDF
Patterns and Operational Insights from the First Users of Delta Lake
Databricks
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Data profiling in Apache Calcite
DataWorks Summit
 
PPTX
scalable machine learning
Samir Bessalah
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
Hive dirty/beautiful hacks in TD
SATOSHI TAGOMORI
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Patterns and Operational Insights from the First Users of Delta Lake
Databricks
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Spark and Cassandra 2 Fast 2 Furious
Russell Spitzer
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Data profiling in Apache Calcite
DataWorks Summit
 
scalable machine learning
Samir Bessalah
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Hive dirty/beautiful hacks in TD
SATOSHI TAGOMORI
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 

Similar to User Defined Aggregation in Apache Spark: A Love Story (20)

PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
PDF
Xxx treme aggregation
Bill Slacum
 
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
ODP
Aggregating In Accumulo
Bill Slacum
 
PDF
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
PPTX
Qubism and scala nlp
Jerome Banks
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Spark
Amir Payberah
 
PDF
Structured streaming for machine learning
Seth Hendrickson
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Xxx treme aggregation
Bill Slacum
 
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Aggregating In Accumulo
Bill Slacum
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
DataStax Academy
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Qubism and scala nlp
Jerome Banks
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Structured streaming for machine learning
Seth Hendrickson
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
Spark Meetup TensorFrames
Jen Aman
 
Spark Meetup TensorFrames
Jen Aman
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
DOCX
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
PDF
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
big data eco system fundamentals of data science
arivukarasi
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
Loading Data into Snowflake (Bulk & Stream)
Accentfuture
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 

User Defined Aggregation in Apache Spark: A Love Story

  • 1. User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer
  • 2. All Love Stories Are The Same Hero Meets Aggregators Hero Files Spark JIRA Hero Merges Spark PR
  • 5. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  • 7. Scale-Out Sum 2 3 5 s = s + 2 (2)
  • 8. Scale-Out Sum 2 3 5 s = s + 3 (5)
  • 9. Scale-Out Sum 2 3 5 s = s + 5 (10)
  • 11. Scale-Out Sum 2 3 5 10 5 3 5 13
  • 12. Scale-Out Sum 2 3 5 10 5 3 5 13 2 3 2 7
  • 13. Scale-Out Sum 2 3 5 10 5 3 5 13 + 7 = 20 2 3 2
  • 14. Scale-Out Sum 2 3 5 10 + 20 = 30 5 3 5 2 3 2
  • 15. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  • 16. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  • 17. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  • 19. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 20. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 21. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 22. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 23. Romantic Chemistry val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  • 24. Romantic Chemistry val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  • 25. Romantic Montage Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass Data Science In Apache Spark With Generative T-Digests Apache Spark for Library Developers Extending Structured Streaming Made Easy with Algebra
  • 27. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 28. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 29. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... }
  • 30. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... } Expensive
  • 31. What Could Go Wrong? class TDigestUDT extends UserDefinedType[TDigestSQL] { def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... } def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... } // yada yada yada ... }
  • 32. What Could Go Wrong? 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 33. Wait What? val sketchCDF = tdigestUDAF[Double] val data = /* data frame with 1000 rows of data */ val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first In deserialize In serialize In deserialize In serialize … 997 more times ! In deserialize In serialize
  • 34. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { }
  • 35. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize }
  • 36. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update }
  • 37. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize }
  • 41. Aggregator Anatomy class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends Aggregator[Double, TDigestSQL, TDigestSQL] { def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a) def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL = TDigestSQL(b1.tdigest ++ b2.tdigest) def finish(b: TDigestSQL): TDigestSQL = b val serde = ExpressionEncoder[TDigestSQL]() def bufferEncoder: Encoder[TDigestSQL] = serde def outputEncoder: Encoder[TDigestSQL] = serde }
  • 42. Intuitive Serialization 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 43. Custom Aggregation in Spark 3.0 import org.apache.spark.sql.functions.udaf val sketchAgg = TDigestAggregator(0.5, 0) val sketchCDF: UserDefinedFunction = udaf(sketchAgg) val sketch = data.agg(sketchCDF($”column”)).first
  • 44. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
  • 45. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112) 70x Faster
  • 50. ErikE ErErlandson Principal Software Engineer Erik Erlandson [email protected] @ManyAngled