Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release

Exciting New Features
of Apache Spark 4.0
Daniel Tenedorio dtenedor

Agenda
New Functionalities
Spark Connect, ANSI Mode, Variant Data Types,
Collation Support
Extensions
Custom Functions and Procedures
Arrow Optimized Python UDF Support,
Python Data Source APIs, Python UDTF
SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures
Usability
PySpark UDF Uniﬁed Proﬁling, Spark Native
Plotting

Spark’s Monolith Driver
How to embed Spark in applications?
Up until Spark 3.4: Hard to support today’s developer experience requirements
Applications
IDEs / Notebooks
Programming Languages / SDKs
No JVM InterOp
Close to REPL
SQL only
Modern data application
Application Logic
Analyzer
Optimizer
Scheduler
Distributed Execution Engine

Spark
Connect
Client
API
Connect to Spark from Any App
Thin client, with full power of Apache Spark
Applications
IDEs / Notebooks
Programming Languages / SDKs
Modern data application
Spark’s Driver
Application Gateway
Analyzer
Optimizer
Scheduler
Distributed Execution Engine

Spark Connect GA in Apache Spark 4.0
pip install pyspark>=3.4.0
in your favorite IDE!
Interactively develop
& debug from your IDE
Check out Databricks Connect,
use & contribute the Go client
New Connectors and SDKs
in any language!
Build interactive Data
Applications
Get started with our
github example!
Databricks
Connect
Scala3

The lightweight Spark Connect Package
pip install pyspark-connect
• Pure Python library, no JVM.
• Pure Spark Connect client, not entire PySpark
• Only 1.5MB (PySpark 355 MB)
• Preferred if your application has fully migrated to the Spark Connect API.

ANSI MODE
ON by default in 4.0

Migration
to ANSI ON
Action: Turn On ANSI Mode
to ﬁx your data corruptions!

Data Corruption!
Without ANSI mode

Error callsite is captured
With ANSI mode (3.5)

Error callsite is highlighted

Culprit operation Line number
DataFrame queries with error callsite

DataFrame queries
with error callsite
• PySpark support is on the way.
• Spark Connect support is on the way.
• Native notebook integration is on the way
(so that you can see highlight).

Variant Data Type for
Semi-Structured Data

Data Engineer’s Dilemma: Pick 2 out of 3…
Flexible
Fast Open
What if you could ingest JSON,
maintain ﬂexibility,
boost performance, and
use an open standard?
Motivation

Variant is ﬂexible
SQL
INSERT INTO variant_tbl (event_data)
VALUES
(
PARSE_JSON(
'{"level": "warning",
"message": "invalid request",
"user_agent": "Mozilla/5.0 ..."}'
)
);
SELECT
*
FROM
variant_tbl
WHERE
event_data:user_agent ilike '%Mozilla%';

ANSI SQL COLLATE ● Associate columns, ﬁelds, array elements
with a collation of choice
○ Case insensitive
○ Accent insensitive
○ Locale aware
● Supported by many string functions such as
○ lower()/upper()
○ substr()
○ locate()
○ like
● GROUP BY, ORDER BY, comparisons, …
● Supported by Delta and Photon
Sorting and comparing strings
according to locale

A look at the default collation
A < Z < a < z < Ā
Is this really what we want here?
> SELECT name FROM names ORDER BY name;
name
Anthony
Bertha
anthony
bertha
Ānthōnī
SQL

COLLATE unicode
One size, ﬁts most
> SELECT name
FROM names
ORDER BY name COLLATE unicode;
name
Ānthōnī
anthony
Anthony
bertha
Bertha
SQL
Root collation with decent sort order for most locales

COLLATE unicode_ci
Case insensitive comparisons have entered the chat
> SELECT name
FROM names
WHERE startswith(name COLLATE unicode_ci, 'a')
ORDER BY name COLLATE unicode_ci;
name
anthony
Anthony
SQL
Case insensitive is not accent insensitive: We lost Ānthōnī

COLLATE unicode_ci_ai
Equality from a to Ź
> SELECT name
FROM names
WHERE startswith(name COLLATE unicode_ci_ai, 'a')
ORDER BY name COLLATE unicode_ci_ai;
name
Ānthōnī
anthony
Anthony
SQL
100s of supported predeﬁned collations across many locales

Enhancing Python UDFs with Apache Arrow
Introduction to Arrow and Its
Role in UDF Optimization:
● Utilizes Apache Arrow
● Supported since Spark 3.5
Key Beneﬁts
● Enhances data serialization and
deserialization speed
● Provides standardized type coercion

spark.conf.set("spark.sql.execution.pythonUDF
.arrow.enabled", True)
# An Arrow Python UDF
@udf(returnType='int')
def arrow_slen(s):
return len(s)
Enabling Arrow Optimization
Local Activation in a UDF
Activates Arrow optimization for a speciﬁc
UDF, improving performance
Global Activation in a UDF
Activates Arrow optimization for all Python UDFs
in the Spark session
# An Arrow Python UDF
@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python
Python

@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python
Performance

sdf.select(
udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v * v, DoubleType(), useArrow=True)("v")
)
Python
Performance

>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+-----------------------------------------------------------------------+
| date_in_string |
+-----------------------------------------------------------------------+
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
+-----------------------------------------------------------------------+
Python
Pickled Python UDF
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+--------------+
|date_in_string|
+--------------+
| 1970-01-01|
| 1970-01-02|
+--------------+
Python
Comparing Pickled
and Arrow-optimized
Python UDFs on type
coercion Link
Arrow-optimized Python UDF

Why Python Data Sources?
• People like writing Python!
• pip install is so convenient.
• Simpliﬁed API without complicated
performance features in Data Source V2.
Spark
❤
Python

● SPIP: Python Data Source API (SPARK-44076)
● Available in Spark 4.0 preview version and Databricks Runtime 15.2
● Support both batch and streaming, read and write
Python Data Source APIs

Register the data source in the
current Spark session using the
Python data source class:
spark
.dataSource
.register(MySource )
spark.read
.format ("my-source")
.load(...)
df.write
.format("my-source")
.mode("append")
.save(...)
Step 1:
Create a Data Source
Step 2:
Register the Data Source
Step 3:
Read from or write to the
data source
Python Data Source Overview
Easy Three Steps to create and use your custom data sources
class MySource(DataSource):
...

Python User Deﬁned Table Functions
This is a new kind of function that returns an entire
table as output instead of a single scalar result value
● Once registered, they can appear in the FROM clause of a
SQL query
● Or use the DataFrame API to call them
from pyspark.sql.functions import udtf
@udtf(returnType="num: int, squared: int")
class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)
Python

©2024 Databricks Inc. — All rights reserved
from pyspark.sql.functions import udtf
@udtf(returnType="num: int, squared: int")
class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)

Python User Deﬁned Table
Functions
SELECT *
FROM SquareNumbers(
start => 1,
end => 3);
+-----+--------+
| num | squared|
+-----+--------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
+-----+--------+

SquareNumbers(
lit(1), lit(3))
.show()
+-----+--------+
| num | squared|
+-----+--------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
+-----+--------+

class ReadFromConfigFile:
@staticmethod
def analyze(filename: AnalyzeArgument):
with open(os.path.join(
SparkFiles.getRootDirectory(),
filename.value), ”r”) as f:
# Compute the UDTF output schema
# based on the contents of the file.
return AnalyzeResult(
from_file(f.read()))
...
Polymorphic Analysis
Compute the output schema for each call depending on arguments, using analyze

Polymorphic Analysis
Compute the output schema for each call depending on arguments, using analyze
ReadFromConfigFile(lit(“config.txt”)).show()
+------------+-------------+
| start_date | other_field |
+------------+-------------+
| 2024-04-02 | 1 |
+------------+-------------+

WITH t AS (SELECT id FROM RANGE(0, 100))
SELECT * FROM CountAndMax(
TABLE(t) PARTITION BY id / 10 ORDER BY id);
+-------+-----+
| count | max |
+-------+-----+
| 10 | 0 |
| 10 | 1 |
...
Input Table Partitioning
Split input rows among instances: eval runs once per row, then terminate runs last

class CountAndMax:
def __init__(self):
self._count = 0
self._max = 0
def eval(self, row: Row):
self._count += 1
self._max = max(self._max, row[0])
def terminate(self):
yield self._count, self._max
Input Table Partitioning
Split input rows among instances: eval runs once per row, then terminate runs last

Variable Keyword Arguments
The analyze and eval methods may accept *args or **kwargs
class VarArgs:
@staticmethod
def analyze(**kwargs: AnalyzeArgument):
return AnalyzeResult(StructType(
[StructField(key, arg.dataType)
for key, arg in sorted(
kwargs.items())]))
def eval(self, **kwargs):
yield tuple(value for _, value
in sorted(kwargs.items()))

Variable Keyword Arguments
The analyze and eval methods may accept *args or **kwargs
SELECT * FROM VarArgs(a => 10, b => ‘x’);
+----+-----+
| a | b |
+----+-----+
| 10 | “x” |
+----+-----+

Custom Initialization
Create a subclass of AnalyzeResult and consume it in each subsequent __init__
class SplitWords:
@dataclass
class MyAnalyzeResult(AnalyzeResult):
numWords: int
numArticles: int
def __init__(self, r: MyAnalyzeResult):
...

Custom Initialization
Create a subclass of AnalyzeResult and consume it in each subsequent __init__
@staticmethod
def analyze(text: str):
words = text.split(” ”)
return MyAnalyzeResult(
schema=StructType()
.add(”word”, StringType())
.add(”total”, IntegerType()),
withSinglePartition=true,
numWords=len(words)
numArticles=len((
word for word in words
if word in (”a”, ”an”, ”the”)))

● Apache Spark supports DataFrames which allow expressing
logic in a sequence of operations from start to ﬁnish.
● Queries generally begin with a table reference and then chain
operations onto the end sequence one-by-one.
DataFrames and the People who Love Them

DataFrames and the People who Love Them
from pyspark.sql import functions as F
# Load the CSV file into a DataFrame.
df = spark.read.option("header", "true").csv("lineitem.csv")
# Filter the DataFrame based on a column.
filtered = df.filter(df['price'] < 100)
# Group the remaining rows and compute the product, and show the
result.
filtered.groupBy("country")
.agg(F.sum(lineitem_filtered.price * lineitem_filtered.quantity)
.alias("total"))
.show()

Introducing New SQL Pipe Syntax
FROM T
|> WHERE A < 100
|> SELECT B + C AS D;
New!

TPC-H Query 13
SELECT c_count, COUNT(*) AS custdist
FROM
( SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

Introducing New SQL Pipe Syntax
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
New!

SELECT
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id,
SUM(ss_sales_price) AS sales_price,
SUM(ss_coupon_amt) AS coupon_amt
FROM
(SELECT ss_item_sk, ss_sales_price,
ss_coupon_amt
FROM
(SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
FROM store_sales
WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199
) AS t1
WHERE ss_sales_price > 100
) AS t2
<continued>
Fun With Subqueries
JOIN item
ON t2.ss_item_sk = i_item_sk
GROUP BY
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id
ORDER BY
sales_price DESC;

FROM store_sales
|> WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199
|> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
|> WHERE ss_sales_price > 100
|> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
|> AS t2
|> JOIN item ON t2.ss_item_sk = i_item_sk
|> AGGREGATE
SUM(ss_sales_price) AS sales_price,
SUM(ss_coupon_amt) AS coupon_amt
GROUP BY
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id
|> ORDER BY sales_price DESC;
Fun With Subqueries
New!

Get Started With Backwards Compatibility
SELECT c_count, COUNT(*) AS custdist
FROM
(
-- We can update the table subquery to pipe syntax
-- while leaving the rest of the SQL alone:
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
New!

Get Started With Backwards Compatibility
-- If we had started with the inner table subquery of TPC-H Q13
-- from earlier:
SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
GROUP BY c_custkey
-- We can just add more aggregation and sorting steps directly
-- to the end like this:
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
New!

• SQL User Deﬁned Scalar Functions
‐ Persisted SQL Expressions
• SQL User Deﬁned Table Functions
‐ Persisted Parameterized Views
• Support named parameter invocation
and defaulting
• Table functions with lateral correlation
Easily extend SQL
function library

• Encapsulate (complex) expressions,
including subqueries
• May contain subqueries
• Return a scalar value
• Can be used in most places where builtin
functions go
SQL User Deﬁned
Scalar Functions

SQL User Deﬁned Scalar Functions
Persists complex expression patterns
> CREATE FUNCTION roll_dice(
num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)’,
num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)'
) COMMENT 'Roll a number of n-sided dice’
RETURN aggregate(
sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides) :: INT,
acc -> acc + roll_dice.num_dice
);
> SELECT roll_dice();
3
-- Roll 3 6-sided dice
> SELECT roll_dice(3);
15
-- Roll 3 10-sided dice
> SELECT roll_dice(3, 10)
21
SQL

• Encapsulate (complex) correlated
subqueries aka a parameterized view
• Can be used in the FROM clause
SQL User Deﬁned
Table Functions

SQL User Deﬁned Table Functions
Persist complex parameterized queries
CREATE FUNCTION weekdays(start DATE,end DATE)
RETURNS TABLE (day_of_week STRING, day DATE)
RETURN SELECT
to_char(day, 'E’),
day
FROM
(
SELECT sequence(weekdays.start, weekdays.end)
) AS t(days),
LATERAL (explode(days)) AS dates(day)
WHERE
extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;
SQL

SQL User Deﬁned Table Functions
> SELECT
day_of_week,
day
FROM
weekdays(DATE '2024-01-01’,
DATE '2024-01-14’);
Mon 2022-01-01
…
Fri 2022-01-05
Mon 2022-01-08
SQL
> -- Return weekdays for date ranges originating from a
LATERAL correlation
> SELECT
weekdays.*
FROM
VALUES
(DATE '2020-01-01’),
(DATE '2021-01-01') AS starts(start),
LATERAL weekdays(start, start + INTERVAL '7' DAYS);
Wed 2020-01-01
Thu 2020-01-02
Fri 2020-01-03
…
SQL SQL

Named parameter invocation
Self documenting and safer SQL UDF invocation
> DESCRIBE FUNCTION roll_dice;
Function: default.roll_dice
Type: SCALAR
Input: num_dice INT
num_sides INT
Returns: INT
> -- Roll 1 10-sided dice - skip dice count
> SELECT roll_dice(num_sides => 10)
7
> -- Roll 3 10-sided dice - reversed order
> SELECT roll_dice(num_sides => 10, num_dice => 3)
21
SQL

External Iceberg Stored Procedures

SQL Scripting Support for control flow, iterators
& error handling natively in SQL
• Control flow → IF/ELSE, CASE
• Looping → WHILE, REPEAT, ITERATE
• Resultset iterator → FOR
• Exception handling → CONTINUE/EXIT
• Parameterized queries → EXECUTE IMMEDIATE
Following the SQL/PSM standard
It’s SQL, but with control flow!

SQL Scripting
BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET VAR c = c - 1;
END WHILE;
END
SQL

SQL Scripting
-- parameters
DECLARE oldColName = 'ColoUr';
DECLARE newColName = 'color’;
BEGIN
DECLARE tableArray Array < STRING >;
DECLARE tableType STRING;
DECLARE i INT = 0;
DECLARE alterQuery STRING;
SET
tableArray = (
SELECT
array_agg(table_name)
FROM
INFORMATION_SCHEMA.columns
WHERE
column_name
COLLATE UNICODE_CI = oldColName
);
SQL
WHILE i < array_size(tableArray) DO
SET
tableType = (
SELECT
table_type
FROM
INFORMATION_SCHEMA.tables
WHERE
table_name = tableArray [i]
);
IF tableType != 'VIEW' COLLATE UNICODE_CI THEN
SET
alterQuery = 'ALTER TABLE ' || tableArray [i] ||
' RENAME COLUMN ' || oldColName || ' TO ' ||
newColName;
EXECUTE IMMEDIATE alterQuery;
END IF;
SET i = i + 1;
END WHILE;
END;
SQL

PySpark UDF Uniﬁed Proﬁling

Overview of Unified Profiling
• Key Components: Performance and memory profiling
• Benefits: Tracks function calls, execution time, memory usage
• Replacement for Legacy Profiling
• Drawbacks of Legacy Profiling
• Advantages of New Unified Profiling
• Session-based, works with Spark Connect, runtime toggling

Overview of Unified Profiling
• How to Enable:
• Performance Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
• Memory Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
• API Features: "show", "dump", and "clear" commands
• Show results:
• Performance: spark.profile.show(type="perf")
• Performance: spark.profile.show(type="memory")
• Dump results: spark.profile.dump("/your_path/...")
• Clear results: spark.profile.clear()

PySpark
Performance
Proﬁler
from pyspark.sql.functions import pandas_udf
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler",
"perf")
added.show()
Python

PySpark
Memory
Proﬁler
from pyspark.sql.functions import pandas_udf
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler",
"memory")
added.show()
Python

Spark Native Plotting
Since Spark 4.0, PySpark supports native plotting.

How it works?
Plotly is used as the default visualization backend.

Supported Plot Types
Type Description
Area plot Shows data as filled areas under lines. It’s great for showing trends over time or how values add up
cumulatively across categories.
Bar plot Uses bars to represent data. The height (or length) of each bar shows the size of the value.
Horizontal bar plot Similar to a bar plot, but the bars go sideways, which is useful for longer category names or side-by-side
comparisons.
Box plot Summarizes data by showing its range (minimum to maximum), interquartile range (IQR), the middle value
(median), and any unusual data points (outliers).
Density/KDE plot Shows how data is spread out, using a smooth curve to estimate the data’s distribution.
Histogram Groups data into ranges (bins) and shows how many values fall into each range. It helps visualize the
distribution and frequency of the data.
Line plot Connects data points with lines, making it easy to see trends or changes over time.
Pie chart Displays data as slices of a circle, representing proportions or percentages of a whole.
Scatter plot Shows the relationship between two variables by plotting them as points on a graph.

Example: Scatter plot
+---------------+-----+------+------------+
| Category|Sales|Profit|ProfitMargin|
+---------------+-----+------+------------+
| Electronics|20000| 5000| 25.0|
| Furniture|15000| 2000| 13.33|
|Office Supplies|12000| 3000| 25.0|
| Clothing| 8000| 1000| 12.5|
| Food|10000| 2500| 25.0|
| Books| 6000| 1500| 25.0|
| Sports| 7000| 2000| 28.57|
| Toys| 5000| 500| 10.0|
+---------------+-----+------+------------+
fig = df.plot.scatter(
x="Sales",
y="Profit",
size="ProfitMargin",
text="Category",
hover_name="Category",
color="ProfitMargin", # Add color gradient based on ProfitMargin
color_continuous_scale="Plasma", # Use a visually appealing color scale
)

Example: Pie plot
+--------------------+--------+
| country| pop|
+--------------------+--------+
| Albania| 3600523|
| Austria| 8199783|
| Belgium|10392226|
|Bosnia and Herzeg...| 4552198|
| Bulgaria| 7322858|
| Croatia| 4493312|
| Czech Republic|10228744|
| Denmark| 5468120|
| Finland| 5238460|
| France|61083916|
| Germany|82400996|
| Greece|10706290|
| Hungary| 9956108|
| Iceland| 301931|
| Ireland| 4109086|
| Italy|58147733|
| Montenegro| 684736|
| Netherlands|16570613|
| Norway| 4627926|
sdf.plot.pie(x="country", y="pop")

Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release

>330
million
Downloads
in 2024
PyPI Downloads by year

210
Countries
& Regions
PyPI downloads of PySpark
in the last 12 months

Join the community today!
spark.apache.org
github.com/apache/spark
apache-spark.slack.com
spark.apache.org/community/

Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release

Recommended

More Related Content

Similar to Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release (20)

Recently uploaded (20)

Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release