SlideShare a Scribd company logo
Exciting New Features
of Apache Spark 4.0
Daniel Tenedorio dtenedor
Agenda
New Functionalities
Spark Connect, ANSI Mode, Variant Data Types,
Collation Support
Extensions
Custom Functions and Procedures
Arrow Optimized Python UDF Support,
Python Data Source APIs, Python UDTF
SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures
Usability
PySpark UDF Unified Profiling, Spark Native
Plotting
Spark Connect
Spark’s Monolith Driver
How to embed Spark in applications?
Up until Spark 3.4: Hard to support today’s developer experience requirements
Applications
IDEs / Notebooks
Programming Languages / SDKs
No JVM InterOp
Close to REPL
SQL only
Modern data application
Application Logic
Analyzer
Optimizer
Scheduler
Distributed Execution Engine
Spark
Connect
Client
API
Connect to Spark from Any App
Thin client, with full power of Apache Spark
Applications
IDEs / Notebooks
Programming Languages / SDKs
Modern data application
Spark’s Driver
Application Gateway
Analyzer
Optimizer
Scheduler
Distributed Execution Engine
Spark Connect GA in Apache Spark 4.0
pip install pyspark>=3.4.0
in your favorite IDE!
Interactively develop
& debug from your IDE
Check out Databricks Connect,
use & contribute the Go client
New Connectors and SDKs
in any language!
Build interactive Data
Applications
Get started with our
github example!
Databricks
Connect
Scala3
The lightweight Spark Connect Package
pip install pyspark-connect
• Pure Python library, no JVM.
• Pure Spark Connect client, not entire PySpark
• Only 1.5MB (PySpark 355 MB)
• Preferred if your application has fully migrated to the Spark Connect API.
ANSI MODE
ON by default in 4.0
Migration
to ANSI ON
Action: Turn On ANSI Mode
to fix your data corruptions!
Data Corruption!
Without ANSI mode
Error callsite is captured
With ANSI mode (3.5)
With ANSI mode (4.0)
Error callsite is highlighted
With ANSI mode (4.0)
Culprit operation Line number
DataFrame queries with error callsite
DataFrame queries
with error callsite
• PySpark support is on the way.
• Spark Connect support is on the way.
• Native notebook integration is on the way
(so that you can see highlight).
Variant Data Type for
Semi-Structured Data
Data Engineer’s Dilemma: Pick 2 out of 3…
Flexible
Fast Open
What if you could ingest JSON,
maintain flexibility,
boost performance, and
use an open standard?
Motivation
Variant is flexible
SQL
INSERT INTO variant_tbl (event_data)
VALUES
(
PARSE_JSON(
'{"level": "warning",
"message": "invalid request",
"user_agent": "Mozilla/5.0 ..."}'
)
);
SELECT
*
FROM
variant_tbl
WHERE
event_data:user_agent ilike '%Mozilla%';
Performance
String Collation Support
ANSI SQL COLLATE ● Associate columns, fields, array elements
with a collation of choice
○ Case insensitive
○ Accent insensitive
○ Locale aware
● Supported by many string functions such as
○ lower()/upper()
○ substr()
○ locate()
○ like
● GROUP BY, ORDER BY, comparisons, …
● Supported by Delta and Photon
Sorting and comparing strings
according to locale
A look at the default collation
A < Z < a < z < Ā
Is this really what we want here?
> SELECT name FROM names ORDER BY name;
name
Anthony
Bertha
anthony
bertha
Ānthōnī
SQL
COLLATE unicode
One size, fits most
> SELECT name
FROM names
ORDER BY name COLLATE unicode;
name
Ānthōnī
anthony
Anthony
bertha
Bertha
SQL
Root collation with decent sort order for most locales
COLLATE unicode_ci
Case insensitive comparisons have entered the chat
> SELECT name
FROM names
WHERE startswith(name COLLATE unicode_ci, 'a')
ORDER BY name COLLATE unicode_ci;
name
anthony
Anthony
SQL
Case insensitive is not accent insensitive: We lost Ānthōnī
COLLATE unicode_ci_ai
Equality from a to Ź
> SELECT name
FROM names
WHERE startswith(name COLLATE unicode_ci_ai, 'a')
ORDER BY name COLLATE unicode_ci_ai;
name
Ānthōnī
anthony
Anthony
SQL
100s of supported predefined collations across many locales
Agenda
New Functionalities
Spark Connect, ANSI Mode, Variant Data Types,
Collation Support
Extensions
Custom Functions and Procedures
Arrow Optimized Python UDF Support,
Python Data Source APIs, Python UDTF
SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures
Usability
PySpark UDF Unified Profiling, Spark Native
Plotting
Arrow Optimized Python UDF
Enhancing Python UDFs with Apache Arrow
Introduction to Arrow and Its
Role in UDF Optimization:
● Utilizes Apache Arrow
● Supported since Spark 3.5
Key Benefits
● Enhances data serialization and
deserialization speed
● Provides standardized type coercion
spark.conf.set("spark.sql.execution.pythonUDF
.arrow.enabled", True)
# An Arrow Python UDF
@udf(returnType='int')
def arrow_slen(s):
return len(s)
Enabling Arrow Optimization
Local Activation in a UDF
Activates Arrow optimization for a specific
UDF, improving performance
Global Activation in a UDF
Activates Arrow optimization for all Python UDFs
in the Spark session
# An Arrow Python UDF
@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python
Python
@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python
Performance
sdf.select(
udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v * v, DoubleType(), useArrow=True)("v")
)
Python
Performance
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+-----------------------------------------------------------------------+
| date_in_string |
+-----------------------------------------------------------------------+
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
+-----------------------------------------------------------------------+
Python
Pickled Python UDF
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+--------------+
|date_in_string|
+--------------+
| 1970-01-01|
| 1970-01-02|
+--------------+
Python
Comparing Pickled
and Arrow-optimized
Python UDFs on type
coercion Link
Arrow-optimized Python UDF
Python Data Sources
Why Python Data Sources?
• People like writing Python!
• pip install is so convenient.
• Simplified API without complicated
performance features in Data Source V2.
Spark
❤
Python
● SPIP: Python Data Source API (SPARK-44076)
● Available in Spark 4.0 preview version and Databricks Runtime 15.2
● Support both batch and streaming, read and write
Python Data Source APIs
Register the data source in the
current Spark session using the
Python data source class:
spark
.dataSource
.register(MySource )
spark.read
.format ("my-source")
.load(...)
df.write
.format("my-source")
.mode("append")
.save(...)
Step 1:
Create a Data Source
Step 2:
Register the Data Source
Step 3:
Read from or write to the
data source
Python Data Source Overview
Easy Three Steps to create and use your custom data sources
class MySource(DataSource):
...
Python UDTF
Python User Defined Table Functions
This is a new kind of function that returns an entire
table as output instead of a single scalar result value
● Once registered, they can appear in the FROM clause of a
SQL query
● Or use the DataFrame API to call them
from pyspark.sql.functions import udtf
@udtf(returnType="num: int, squared: int")
class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)
Python
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
from pyspark.sql.functions import udtf
@udtf(returnType="num: int, squared: int")
class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)
©2024 Databricks Inc. — All rights reserved
Python User Defined Table
Functions
SELECT *
FROM SquareNumbers(
start => 1,
end => 3);
+-----+--------+
| num | squared|
+-----+--------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
+-----+--------+
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
SquareNumbers(
lit(1), lit(3))
.show()
+-----+--------+
| num | squared|
+-----+--------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
+-----+--------+
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
class ReadFromConfigFile:
@staticmethod
def analyze(filename: AnalyzeArgument):
with open(os.path.join(
SparkFiles.getRootDirectory(),
filename.value), ”r”) as f:
# Compute the UDTF output schema
# based on the contents of the file.
return AnalyzeResult(
from_file(f.read()))
...
Polymorphic Analysis
Compute the output schema for each call depending on arguments, using analyze
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
Polymorphic Analysis
Compute the output schema for each call depending on arguments, using analyze
ReadFromConfigFile(lit(“config.txt”)).show()
+------------+-------------+
| start_date | other_field |
+------------+-------------+
| 2024-04-02 | 1 |
+------------+-------------+
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
WITH t AS (SELECT id FROM RANGE(0, 100))
SELECT * FROM CountAndMax(
TABLE(t) PARTITION BY id / 10 ORDER BY id);
+-------+-----+
| count | max |
+-------+-----+
| 10 | 0 |
| 10 | 1 |
...
Input Table Partitioning
Split input rows among instances: eval runs once per row, then terminate runs last
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
class CountAndMax:
def __init__(self):
self._count = 0
self._max = 0
def eval(self, row: Row):
self._count += 1
self._max = max(self._max, row[0])
def terminate(self):
yield self._count, self._max
Input Table Partitioning
Split input rows among instances: eval runs once per row, then terminate runs last
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
Variable Keyword Arguments
The analyze and eval methods may accept *args or **kwargs
class VarArgs:
@staticmethod
def analyze(**kwargs: AnalyzeArgument):
return AnalyzeResult(StructType(
[StructField(key, arg.dataType)
for key, arg in sorted(
kwargs.items())]))
def eval(self, **kwargs):
yield tuple(value for _, value
in sorted(kwargs.items()))
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
Variable Keyword Arguments
The analyze and eval methods may accept *args or **kwargs
SELECT * FROM VarArgs(a => 10, b => ‘x’);
+----+-----+
| a | b |
+----+-----+
| 10 | “x” |
+----+-----+
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
Custom Initialization
Create a subclass of AnalyzeResult and consume it in each subsequent __init__
class SplitWords:
@dataclass
class MyAnalyzeResult(AnalyzeResult):
numWords: int
numArticles: int
def __init__(self, r: MyAnalyzeResult):
...
©2024 Databricks Inc. — All rights reserved
Python User Defined Table Functions
Custom Initialization
Create a subclass of AnalyzeResult and consume it in each subsequent __init__
@staticmethod
def analyze(text: str):
words = text.split(” ”)
return MyAnalyzeResult(
schema=StructType()
.add(”word”, StringType())
.add(”total”, IntegerType()),
withSinglePartition=true,
numWords=len(words)
numArticles=len((
word for word in words
if word in (”a”, ”an”, ”the”)))
Agenda
New Functionalities
Spark Connect, ANSI Mode, Variant Data Types,
Collation Support
Extensions
Custom Functions and Procedures
Arrow Optimized Python UDF Support,
Python Data Source APIs, Python UDTF
SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures
Usability
PySpark UDF Unified Profiling, Spark Native
Plotting
SQL Pipe Syntax
©2024 Databricks Inc. — All rights reserved
● Apache Spark supports DataFrames which allow expressing
logic in a sequence of operations from start to finish.
● Queries generally begin with a table reference and then chain
operations onto the end sequence one-by-one.
DataFrames and the People who Love Them
©2024 Databricks Inc. — All rights reserved
DataFrames and the People who Love Them
from pyspark.sql import functions as F
# Load the CSV file into a DataFrame.
df = spark.read.option("header", "true").csv("lineitem.csv")
# Filter the DataFrame based on a column.
filtered = df.filter(df['price'] < 100)
# Group the remaining rows and compute the product, and show the
result.
filtered.groupBy("country") 
.agg(F.sum(lineitem_filtered.price * lineitem_filtered.quantity)
.alias("total"))
.show()
©2024 Databricks Inc. — All rights reserved
Introducing New SQL Pipe Syntax
FROM T
|> WHERE A < 100
|> SELECT B + C AS D;
New!
©2024 Databricks Inc. — All rights reserved
TPC-H Query 13
SELECT c_count, COUNT(*) AS custdist
FROM
( SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
©2024 Databricks Inc. — All rights reserved
Introducing New SQL Pipe Syntax
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
New!
©2024 Databricks Inc. — All rights reserved
SELECT
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id,
SUM(ss_sales_price) AS sales_price,
SUM(ss_coupon_amt) AS coupon_amt
FROM
(SELECT ss_item_sk, ss_sales_price,
ss_coupon_amt
FROM
(SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
FROM store_sales
WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199
) AS t1
WHERE ss_sales_price > 100
) AS t2
<continued>
Fun With Subqueries
JOIN item
ON t2.ss_item_sk = i_item_sk
GROUP BY
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id
ORDER BY
sales_price DESC;
©2024 Databricks Inc. — All rights reserved
FROM store_sales
|> WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199
|> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
|> WHERE ss_sales_price > 100
|> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt
|> AS t2
|> JOIN item ON t2.ss_item_sk = i_item_sk
|> AGGREGATE
SUM(ss_sales_price) AS sales_price,
SUM(ss_coupon_amt) AS coupon_amt
GROUP BY
i_item_id,
i_item_desc,
i_category,
i_class,
i_manufact_id,
i_brand_id
|> ORDER BY sales_price DESC;
Fun With Subqueries
New!
©2024 Databricks Inc. — All rights reserved
Get Started With Backwards Compatibility
SELECT c_count, COUNT(*) AS custdist
FROM
(
-- We can update the table subquery to pipe syntax
-- while leaving the rest of the SQL alone:
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
New!
©2024 Databricks Inc. — All rights reserved
Get Started With Backwards Compatibility
-- If we had started with the inner table subquery of TPC-H Q13
-- from earlier:
SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY c_custkey
-- We can just add more aggregation and sorting steps directly
-- to the end like this:
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
New!
SQL UDF / UDTF
• SQL User Defined Scalar Functions
‐ Persisted SQL Expressions
• SQL User Defined Table Functions
‐ Persisted Parameterized Views
• Support named parameter invocation
and defaulting
• Table functions with lateral correlation
Easily extend SQL
function library
• Encapsulate (complex) expressions,
including subqueries
• May contain subqueries
• Return a scalar value
• Can be used in most places where builtin
functions go
SQL User Defined
Scalar Functions
SQL User Defined Scalar Functions
Persists complex expression patterns
> CREATE FUNCTION roll_dice(
num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)’,
num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)'
) COMMENT 'Roll a number of n-sided dice’
RETURN aggregate(
sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides) :: INT,
acc -> acc + roll_dice.num_dice
);
> SELECT roll_dice();
3
-- Roll 3 6-sided dice
> SELECT roll_dice(3);
15
-- Roll 3 10-sided dice
> SELECT roll_dice(3, 10)
21
SQL
• Encapsulate (complex) correlated
subqueries aka a parameterized view
• Can be used in the FROM clause
SQL User Defined
Table Functions
SQL User Defined Table Functions
Persist complex parameterized queries
CREATE FUNCTION weekdays(start DATE,end DATE)
RETURNS TABLE (day_of_week STRING, day DATE)
RETURN SELECT
to_char(day, 'E’),
day
FROM
(
SELECT sequence(weekdays.start, weekdays.end)
) AS t(days),
LATERAL (explode(days)) AS dates(day)
WHERE
extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;
SQL
SQL User Defined Table Functions
Persist complex parameterized queries
> SELECT
day_of_week,
day
FROM
weekdays(DATE '2024-01-01’,
DATE '2024-01-14’);
Mon 2022-01-01
…
Fri 2022-01-05
Mon 2022-01-08
SQL
> -- Return weekdays for date ranges originating from a
LATERAL correlation
> SELECT
weekdays.*
FROM
VALUES
(DATE '2020-01-01’),
(DATE '2021-01-01') AS starts(start),
LATERAL weekdays(start, start + INTERVAL '7' DAYS);
Wed 2020-01-01
Thu 2020-01-02
Fri 2020-01-03
…
SQL SQL
Named parameter invocation
Self documenting and safer SQL UDF invocation
> DESCRIBE FUNCTION roll_dice;
Function: default.roll_dice
Type: SCALAR
Input: num_dice INT
num_sides INT
Returns: INT
> -- Roll 1 10-sided dice - skip dice count
> SELECT roll_dice(num_sides => 10)
7
> -- Roll 3 10-sided dice - reversed order
> SELECT roll_dice(num_sides => 10, num_dice => 3)
21
SQL
Stored Procedures
External Stored Procedures
External Iceberg Stored Procedures
SQL Scripting Support for control flow, iterators
& error handling natively in SQL
• Control flow → IF/ELSE, CASE
• Looping → WHILE, REPEAT, ITERATE
• Resultset iterator → FOR
• Exception handling → CONTINUE/EXIT
• Parameterized queries → EXECUTE IMMEDIATE
Following the SQL/PSM standard
It’s SQL, but with control flow!
SQL Scripting
BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET VAR c = c - 1;
END WHILE;
END
SQL
SQL Scripting
Persist complex parameterized queries
-- parameters
DECLARE oldColName = 'ColoUr';
DECLARE newColName = 'color’;
BEGIN
DECLARE tableArray Array < STRING >;
DECLARE tableType STRING;
DECLARE i INT = 0;
DECLARE alterQuery STRING;
SET
tableArray = (
SELECT
array_agg(table_name)
FROM
INFORMATION_SCHEMA.columns
WHERE
column_name
COLLATE UNICODE_CI = oldColName
);
SQL
WHILE i < array_size(tableArray) DO
SET
tableType = (
SELECT
table_type
FROM
INFORMATION_SCHEMA.tables
WHERE
table_name = tableArray [i]
);
IF tableType != 'VIEW' COLLATE UNICODE_CI THEN
SET
alterQuery = 'ALTER TABLE ' || tableArray [i] ||
' RENAME COLUMN ' || oldColName || ' TO ' ||
newColName;
EXECUTE IMMEDIATE alterQuery;
END IF;
SET i = i + 1;
END WHILE;
END;
SQL
Agenda
New Functionalities
Spark Connect, ANSI Mode, Variant Data Types,
Collation Support
Extensions
Custom Functions and Procedures
Arrow Optimized Python UDF Support,
Python Data Source APIs, Python UDTF
SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures
Usability
PySpark UDF Unified Profiling, Spark Native
Plotting
PySpark UDF Unified Profiling
Overview of Unified Profiling
• Key Components: Performance and memory profiling
• Benefits: Tracks function calls, execution time, memory usage
• Replacement for Legacy Profiling
• Drawbacks of Legacy Profiling
• Advantages of New Unified Profiling
• Session-based, works with Spark Connect, runtime toggling
Overview of Unified Profiling
• How to Enable:
• Performance Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
• Memory Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
• API Features: "show", "dump", and "clear" commands
• Show results:
• Performance: spark.profile.show(type="perf")
• Performance: spark.profile.show(type="memory")
• Dump results: spark.profile.dump("/your_path/...")
• Clear results: spark.profile.clear()
PySpark
Performance
Profiler
from pyspark.sql.functions import pandas_udf
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler",
"perf")
added.show()
Python
PySpark
Performance
Profiler
PySpark
Memory
Profiler
from pyspark.sql.functions import pandas_udf
df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1
added = df.select(add1("id"))
spark.conf.set("spark.sql.pyspark.udf.profiler",
"memory")
added.show()
Python
PySpark
Memory
Profiler
Spark Native Plotting
Since Spark 4.0, PySpark supports native plotting.
How it works?
Plotly is used as the default visualization backend.
Supported Plot Types
Type Description
Area plot Shows data as filled areas under lines. It’s great for showing trends over time or how values add up
cumulatively across categories.
Bar plot Uses bars to represent data. The height (or length) of each bar shows the size of the value.
Horizontal bar plot Similar to a bar plot, but the bars go sideways, which is useful for longer category names or side-by-side
comparisons.
Box plot Summarizes data by showing its range (minimum to maximum), interquartile range (IQR), the middle value
(median), and any unusual data points (outliers).
Density/KDE plot Shows how data is spread out, using a smooth curve to estimate the data’s distribution.
Histogram Groups data into ranges (bins) and shows how many values fall into each range. It helps visualize the
distribution and frequency of the data.
Line plot Connects data points with lines, making it easy to see trends or changes over time.
Pie chart Displays data as slices of a circle, representing proportions or percentages of a whole.
Scatter plot Shows the relationship between two variables by plotting them as points on a graph.
Example: Scatter plot
+---------------+-----+------+------------+
| Category|Sales|Profit|ProfitMargin|
+---------------+-----+------+------------+
| Electronics|20000| 5000| 25.0|
| Furniture|15000| 2000| 13.33|
|Office Supplies|12000| 3000| 25.0|
| Clothing| 8000| 1000| 12.5|
| Food|10000| 2500| 25.0|
| Books| 6000| 1500| 25.0|
| Sports| 7000| 2000| 28.57|
| Toys| 5000| 500| 10.0|
+---------------+-----+------+------------+
fig = df.plot.scatter(
x="Sales",
y="Profit",
size="ProfitMargin",
text="Category",
hover_name="Category",
color="ProfitMargin", # Add color gradient based on ProfitMargin
color_continuous_scale="Plasma", # Use a visually appealing color scale
)
Example: Pie plot
+--------------------+--------+
| country| pop|
+--------------------+--------+
| Albania| 3600523|
| Austria| 8199783|
| Belgium|10392226|
|Bosnia and Herzeg...| 4552198|
| Bulgaria| 7322858|
| Croatia| 4493312|
| Czech Republic|10228744|
| Denmark| 5468120|
| Finland| 5238460|
| France|61083916|
| Germany|82400996|
| Greece|10706290|
| Hungary| 9956108|
| Iceland| 301931|
| Ireland| 4109086|
| Italy|58147733|
| Montenegro| 684736|
| Netherlands|16570613|
| Norway| 4627926|
sdf.plot.pie(x="country", y="pop")
Python: The Number One Choice
Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release
Birth of
PySpark
2013
>330
million
Downloads
in 2024
PyPI Downloads by year
210
Countries
& Regions
PyPI downloads of PySpark
in the last 12 months
Join the community today!
spark.apache.org
github.com/apache/spark
apache-spark.slack.com
spark.apache.org/community/
Ad

More Related Content

Similar to Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release (20)

Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDTEclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Elena Laskavaia
 
Who killed object oriented design?
Who killed object oriented design?Who killed object oriented design?
Who killed object oriented design?
Amir Barylko
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward
 
Ida python intro
Ida python introIda python intro
Ida python intro
小静 安
 
Scala is java8.next()
Scala is java8.next()Scala is java8.next()
Scala is java8.next()
daewon jeong
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Visual Studio .NET2010
Visual Studio .NET2010Visual Studio .NET2010
Visual Studio .NET2010
Satish Verma
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Letswift Swift 3.0
Letswift Swift 3.0Letswift Swift 3.0
Letswift Swift 3.0
Sehyun Park
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
Keshav Murthy
 
Visual studio 2008
Visual studio 2008Visual studio 2008
Visual studio 2008
Luis Enrique
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Pres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdfPres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdf
RamziFeghali
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
Roberto Franchini
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
 
Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)
Andrey Volobuev
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
Kostas Tzoumas
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Edureka!
 
Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDTEclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Eclipse Con 2015: Codan - a C/C++ Code Analysis Framework for CDT
Elena Laskavaia
 
Who killed object oriented design?
Who killed object oriented design?Who killed object oriented design?
Who killed object oriented design?
Amir Barylko
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
Databricks
 
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward
 
Ida python intro
Ida python introIda python intro
Ida python intro
小静 安
 
Scala is java8.next()
Scala is java8.next()Scala is java8.next()
Scala is java8.next()
daewon jeong
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Visual Studio .NET2010
Visual Studio .NET2010Visual Studio .NET2010
Visual Studio .NET2010
Satish Verma
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Letswift Swift 3.0
Letswift Swift 3.0Letswift Swift 3.0
Letswift Swift 3.0
Sehyun Park
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
Keshav Murthy
 
Visual studio 2008
Visual studio 2008Visual studio 2008
Visual studio 2008
Luis Enrique
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Pres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdfPres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdf
RamziFeghali
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
Roberto Franchini
 
Apache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customizationApache Spark in your likeness - low and high level customization
Apache Spark in your likeness - low and high level customization
Bartosz Konieczny
 
Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)Prompt engineering for iOS developers (How LLMs and GenAI work)
Prompt engineering for iOS developers (How LLMs and GenAI work)
Andrey Volobuev
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Edureka!
 

Recently uploaded (20)

AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Buckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug LogsBuckeye Dreamin' 2023: De-fogging Debug Logs
Buckeye Dreamin' 2023: De-fogging Debug Logs
Lynda Kane
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.Network Security. Different aspects of Network Security.
Network Security. Different aspects of Network Security.
gregtap1
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical DebtBuckeye Dreamin 2024: Assessing and Resolving Technical Debt
Buckeye Dreamin 2024: Assessing and Resolving Technical Debt
Lynda Kane
 
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your UsersAutomation Dreamin' 2022: Sharing Some Gratitude with Your Users
Automation Dreamin' 2022: Sharing Some Gratitude with Your Users
Lynda Kane
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Ad

Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release

  • 1. Exciting New Features of Apache Spark 4.0 Daniel Tenedorio dtenedor
  • 2. Agenda New Functionalities Spark Connect, ANSI Mode, Variant Data Types, Collation Support Extensions Custom Functions and Procedures Arrow Optimized Python UDF Support, Python Data Source APIs, Python UDTF SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures Usability PySpark UDF Unified Profiling, Spark Native Plotting
  • 4. Spark’s Monolith Driver How to embed Spark in applications? Up until Spark 3.4: Hard to support today’s developer experience requirements Applications IDEs / Notebooks Programming Languages / SDKs No JVM InterOp Close to REPL SQL only Modern data application Application Logic Analyzer Optimizer Scheduler Distributed Execution Engine
  • 5. Spark Connect Client API Connect to Spark from Any App Thin client, with full power of Apache Spark Applications IDEs / Notebooks Programming Languages / SDKs Modern data application Spark’s Driver Application Gateway Analyzer Optimizer Scheduler Distributed Execution Engine
  • 6. Spark Connect GA in Apache Spark 4.0 pip install pyspark>=3.4.0 in your favorite IDE! Interactively develop & debug from your IDE Check out Databricks Connect, use & contribute the Go client New Connectors and SDKs in any language! Build interactive Data Applications Get started with our github example! Databricks Connect Scala3
  • 7. The lightweight Spark Connect Package pip install pyspark-connect • Pure Python library, no JVM. • Pure Spark Connect client, not entire PySpark • Only 1.5MB (PySpark 355 MB) • Preferred if your application has fully migrated to the Spark Connect API.
  • 8. ANSI MODE ON by default in 4.0
  • 9. Migration to ANSI ON Action: Turn On ANSI Mode to fix your data corruptions!
  • 11. Error callsite is captured With ANSI mode (3.5)
  • 12. With ANSI mode (4.0) Error callsite is highlighted
  • 13. With ANSI mode (4.0) Culprit operation Line number DataFrame queries with error callsite
  • 14. DataFrame queries with error callsite • PySpark support is on the way. • Spark Connect support is on the way. • Native notebook integration is on the way (so that you can see highlight).
  • 15. Variant Data Type for Semi-Structured Data
  • 16. Data Engineer’s Dilemma: Pick 2 out of 3… Flexible Fast Open What if you could ingest JSON, maintain flexibility, boost performance, and use an open standard? Motivation
  • 17. Variant is flexible SQL INSERT INTO variant_tbl (event_data) VALUES ( PARSE_JSON( '{"level": "warning", "message": "invalid request", "user_agent": "Mozilla/5.0 ..."}' ) ); SELECT * FROM variant_tbl WHERE event_data:user_agent ilike '%Mozilla%';
  • 20. ANSI SQL COLLATE ● Associate columns, fields, array elements with a collation of choice ○ Case insensitive ○ Accent insensitive ○ Locale aware ● Supported by many string functions such as ○ lower()/upper() ○ substr() ○ locate() ○ like ● GROUP BY, ORDER BY, comparisons, … ● Supported by Delta and Photon Sorting and comparing strings according to locale
  • 21. A look at the default collation A < Z < a < z < Ā Is this really what we want here? > SELECT name FROM names ORDER BY name; name Anthony Bertha anthony bertha Ānthōnī SQL
  • 22. COLLATE unicode One size, fits most > SELECT name FROM names ORDER BY name COLLATE unicode; name Ānthōnī anthony Anthony bertha Bertha SQL Root collation with decent sort order for most locales
  • 23. COLLATE unicode_ci Case insensitive comparisons have entered the chat > SELECT name FROM names WHERE startswith(name COLLATE unicode_ci, 'a') ORDER BY name COLLATE unicode_ci; name anthony Anthony SQL Case insensitive is not accent insensitive: We lost Ānthōnī
  • 24. COLLATE unicode_ci_ai Equality from a to Ź > SELECT name FROM names WHERE startswith(name COLLATE unicode_ci_ai, 'a') ORDER BY name COLLATE unicode_ci_ai; name Ānthōnī anthony Anthony SQL 100s of supported predefined collations across many locales
  • 25. Agenda New Functionalities Spark Connect, ANSI Mode, Variant Data Types, Collation Support Extensions Custom Functions and Procedures Arrow Optimized Python UDF Support, Python Data Source APIs, Python UDTF SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures Usability PySpark UDF Unified Profiling, Spark Native Plotting
  • 27. Enhancing Python UDFs with Apache Arrow Introduction to Arrow and Its Role in UDF Optimization: ● Utilizes Apache Arrow ● Supported since Spark 3.5 Key Benefits ● Enhances data serialization and deserialization speed ● Provides standardized type coercion
  • 28. spark.conf.set("spark.sql.execution.pythonUDF .arrow.enabled", True) # An Arrow Python UDF @udf(returnType='int') def arrow_slen(s): return len(s) Enabling Arrow Optimization Local Activation in a UDF Activates Arrow optimization for a specific UDF, improving performance Global Activation in a UDF Activates Arrow optimization for all Python UDFs in the Spark session # An Arrow Python UDF @udf(returnType='int', useArrow=True) def arrow_slen(s): return len(s) Python Python
  • 30. sdf.select( udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"), udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"), udf(lambda v: v * v, DoubleType(), useArrow=True)("v") ) Python Performance
  • 31. >>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show() +-----------------------------------------------------------------------+ | date_in_string | +-----------------------------------------------------------------------+ |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..| |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..| +-----------------------------------------------------------------------+ Python Pickled Python UDF >>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show() +--------------+ |date_in_string| +--------------+ | 1970-01-01| | 1970-01-02| +--------------+ Python Comparing Pickled and Arrow-optimized Python UDFs on type coercion Link Arrow-optimized Python UDF
  • 33. Why Python Data Sources? • People like writing Python! • pip install is so convenient. • Simplified API without complicated performance features in Data Source V2. Spark ❤ Python
  • 34. ● SPIP: Python Data Source API (SPARK-44076) ● Available in Spark 4.0 preview version and Databricks Runtime 15.2 ● Support both batch and streaming, read and write Python Data Source APIs
  • 35. Register the data source in the current Spark session using the Python data source class: spark .dataSource .register(MySource ) spark.read .format ("my-source") .load(...) df.write .format("my-source") .mode("append") .save(...) Step 1: Create a Data Source Step 2: Register the Data Source Step 3: Read from or write to the data source Python Data Source Overview Easy Three Steps to create and use your custom data sources class MySource(DataSource): ...
  • 37. Python User Defined Table Functions This is a new kind of function that returns an entire table as output instead of a single scalar result value ● Once registered, they can appear in the FROM clause of a SQL query ● Or use the DataFrame API to call them from pyspark.sql.functions import udtf @udtf(returnType="num: int, squared: int") class SquareNumbers: def eval(self, start: int, end: int): for num in range(start, end + 1): yield (num, num * num) Python
  • 38. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions from pyspark.sql.functions import udtf @udtf(returnType="num: int, squared: int") class SquareNumbers: def eval(self, start: int, end: int): for num in range(start, end + 1): yield (num, num * num)
  • 39. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions SELECT * FROM SquareNumbers( start => 1, end => 3); +-----+--------+ | num | squared| +-----+--------+ | 1 | 1 | | 2 | 4 | | 3 | 9 | +-----+--------+
  • 40. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions SquareNumbers( lit(1), lit(3)) .show() +-----+--------+ | num | squared| +-----+--------+ | 1 | 1 | | 2 | 4 | | 3 | 9 | +-----+--------+
  • 41. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions class ReadFromConfigFile: @staticmethod def analyze(filename: AnalyzeArgument): with open(os.path.join( SparkFiles.getRootDirectory(), filename.value), ”r”) as f: # Compute the UDTF output schema # based on the contents of the file. return AnalyzeResult( from_file(f.read())) ... Polymorphic Analysis Compute the output schema for each call depending on arguments, using analyze
  • 42. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions Polymorphic Analysis Compute the output schema for each call depending on arguments, using analyze ReadFromConfigFile(lit(“config.txt”)).show() +------------+-------------+ | start_date | other_field | +------------+-------------+ | 2024-04-02 | 1 | +------------+-------------+
  • 43. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions WITH t AS (SELECT id FROM RANGE(0, 100)) SELECT * FROM CountAndMax( TABLE(t) PARTITION BY id / 10 ORDER BY id); +-------+-----+ | count | max | +-------+-----+ | 10 | 0 | | 10 | 1 | ... Input Table Partitioning Split input rows among instances: eval runs once per row, then terminate runs last
  • 44. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions class CountAndMax: def __init__(self): self._count = 0 self._max = 0 def eval(self, row: Row): self._count += 1 self._max = max(self._max, row[0]) def terminate(self): yield self._count, self._max Input Table Partitioning Split input rows among instances: eval runs once per row, then terminate runs last
  • 45. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions Variable Keyword Arguments The analyze and eval methods may accept *args or **kwargs class VarArgs: @staticmethod def analyze(**kwargs: AnalyzeArgument): return AnalyzeResult(StructType( [StructField(key, arg.dataType) for key, arg in sorted( kwargs.items())])) def eval(self, **kwargs): yield tuple(value for _, value in sorted(kwargs.items()))
  • 46. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions Variable Keyword Arguments The analyze and eval methods may accept *args or **kwargs SELECT * FROM VarArgs(a => 10, b => ‘x’); +----+-----+ | a | b | +----+-----+ | 10 | “x” | +----+-----+
  • 47. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions Custom Initialization Create a subclass of AnalyzeResult and consume it in each subsequent __init__ class SplitWords: @dataclass class MyAnalyzeResult(AnalyzeResult): numWords: int numArticles: int def __init__(self, r: MyAnalyzeResult): ...
  • 48. ©2024 Databricks Inc. — All rights reserved Python User Defined Table Functions Custom Initialization Create a subclass of AnalyzeResult and consume it in each subsequent __init__ @staticmethod def analyze(text: str): words = text.split(” ”) return MyAnalyzeResult( schema=StructType() .add(”word”, StringType()) .add(”total”, IntegerType()), withSinglePartition=true, numWords=len(words) numArticles=len(( word for word in words if word in (”a”, ”an”, ”the”)))
  • 49. Agenda New Functionalities Spark Connect, ANSI Mode, Variant Data Types, Collation Support Extensions Custom Functions and Procedures Arrow Optimized Python UDF Support, Python Data Source APIs, Python UDTF SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures Usability PySpark UDF Unified Profiling, Spark Native Plotting
  • 51. ©2024 Databricks Inc. — All rights reserved ● Apache Spark supports DataFrames which allow expressing logic in a sequence of operations from start to finish. ● Queries generally begin with a table reference and then chain operations onto the end sequence one-by-one. DataFrames and the People who Love Them
  • 52. ©2024 Databricks Inc. — All rights reserved DataFrames and the People who Love Them from pyspark.sql import functions as F # Load the CSV file into a DataFrame. df = spark.read.option("header", "true").csv("lineitem.csv") # Filter the DataFrame based on a column. filtered = df.filter(df['price'] < 100) # Group the remaining rows and compute the product, and show the result. filtered.groupBy("country") .agg(F.sum(lineitem_filtered.price * lineitem_filtered.quantity) .alias("total")) .show()
  • 53. ©2024 Databricks Inc. — All rights reserved Introducing New SQL Pipe Syntax FROM T |> WHERE A < 100 |> SELECT B + C AS D; New!
  • 54. ©2024 Databricks Inc. — All rights reserved TPC-H Query 13 SELECT c_count, COUNT(*) AS custdist FROM ( SELECT c_custkey, COUNT(o_orderkey) c_count FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey ) AS c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC;
  • 55. ©2024 Databricks Inc. — All rights reserved Introducing New SQL Pipe Syntax FROM customer |> LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' |> AGGREGATE COUNT(o_orderkey) c_count GROUP BY c_custkey |> AGGREGATE COUNT(*) AS custdist GROUP BY c_count |> ORDER BY custdist DESC, c_count DESC; New!
  • 56. ©2024 Databricks Inc. — All rights reserved SELECT i_item_id, i_item_desc, i_category, i_class, i_manufact_id, i_brand_id, SUM(ss_sales_price) AS sales_price, SUM(ss_coupon_amt) AS coupon_amt FROM (SELECT ss_item_sk, ss_sales_price, ss_coupon_amt FROM (SELECT ss_item_sk, ss_sales_price, ss_coupon_amt FROM store_sales WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199 ) AS t1 WHERE ss_sales_price > 100 ) AS t2 <continued> Fun With Subqueries JOIN item ON t2.ss_item_sk = i_item_sk GROUP BY i_item_id, i_item_desc, i_category, i_class, i_manufact_id, i_brand_id ORDER BY sales_price DESC;
  • 57. ©2024 Databricks Inc. — All rights reserved FROM store_sales |> WHERE ss_sold_date_sk BETWEEN 2450814 AND 2451199 |> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt |> WHERE ss_sales_price > 100 |> SELECT ss_item_sk, ss_sales_price, ss_coupon_amt |> AS t2 |> JOIN item ON t2.ss_item_sk = i_item_sk |> AGGREGATE SUM(ss_sales_price) AS sales_price, SUM(ss_coupon_amt) AS coupon_amt GROUP BY i_item_id, i_item_desc, i_category, i_class, i_manufact_id, i_brand_id |> ORDER BY sales_price DESC; Fun With Subqueries New!
  • 58. ©2024 Databricks Inc. — All rights reserved Get Started With Backwards Compatibility SELECT c_count, COUNT(*) AS custdist FROM ( -- We can update the table subquery to pipe syntax -- while leaving the rest of the SQL alone: FROM customer |> LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' |> AGGREGATE COUNT(o_orderkey) c_count GROUP BY c_custkey ) AS c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC; New!
  • 59. ©2024 Databricks Inc. — All rights reserved Get Started With Backwards Compatibility -- If we had started with the inner table subquery of TPC-H Q13 -- from earlier: SELECT c_custkey, COUNT(o_orderkey) c_count FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey -- We can just add more aggregation and sorting steps directly -- to the end like this: |> AGGREGATE COUNT(*) AS custdist GROUP BY c_count |> ORDER BY custdist DESC, c_count DESC; New!
  • 60. SQL UDF / UDTF
  • 61. • SQL User Defined Scalar Functions ‐ Persisted SQL Expressions • SQL User Defined Table Functions ‐ Persisted Parameterized Views • Support named parameter invocation and defaulting • Table functions with lateral correlation Easily extend SQL function library
  • 62. • Encapsulate (complex) expressions, including subqueries • May contain subqueries • Return a scalar value • Can be used in most places where builtin functions go SQL User Defined Scalar Functions
  • 63. SQL User Defined Scalar Functions Persists complex expression patterns > CREATE FUNCTION roll_dice( num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)’, num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)' ) COMMENT 'Roll a number of n-sided dice’ RETURN aggregate( sequence(1, roll_dice.num_dice, 1), 0, (acc, x) -> (rand() * roll_dice.num_sides) :: INT, acc -> acc + roll_dice.num_dice ); > SELECT roll_dice(); 3 -- Roll 3 6-sided dice > SELECT roll_dice(3); 15 -- Roll 3 10-sided dice > SELECT roll_dice(3, 10) 21 SQL
  • 64. • Encapsulate (complex) correlated subqueries aka a parameterized view • Can be used in the FROM clause SQL User Defined Table Functions
  • 65. SQL User Defined Table Functions Persist complex parameterized queries CREATE FUNCTION weekdays(start DATE,end DATE) RETURNS TABLE (day_of_week STRING, day DATE) RETURN SELECT to_char(day, 'E’), day FROM ( SELECT sequence(weekdays.start, weekdays.end) ) AS t(days), LATERAL (explode(days)) AS dates(day) WHERE extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5; SQL
  • 66. SQL User Defined Table Functions Persist complex parameterized queries > SELECT day_of_week, day FROM weekdays(DATE '2024-01-01’, DATE '2024-01-14’); Mon 2022-01-01 … Fri 2022-01-05 Mon 2022-01-08 SQL > -- Return weekdays for date ranges originating from a LATERAL correlation > SELECT weekdays.* FROM VALUES (DATE '2020-01-01’), (DATE '2021-01-01') AS starts(start), LATERAL weekdays(start, start + INTERVAL '7' DAYS); Wed 2020-01-01 Thu 2020-01-02 Fri 2020-01-03 … SQL SQL
  • 67. Named parameter invocation Self documenting and safer SQL UDF invocation > DESCRIBE FUNCTION roll_dice; Function: default.roll_dice Type: SCALAR Input: num_dice INT num_sides INT Returns: INT > -- Roll 1 10-sided dice - skip dice count > SELECT roll_dice(num_sides => 10) 7 > -- Roll 3 10-sided dice - reversed order > SELECT roll_dice(num_sides => 10, num_dice => 3) 21 SQL
  • 71. SQL Scripting Support for control flow, iterators & error handling natively in SQL • Control flow → IF/ELSE, CASE • Looping → WHILE, REPEAT, ITERATE • Resultset iterator → FOR • Exception handling → CONTINUE/EXIT • Parameterized queries → EXECUTE IMMEDIATE Following the SQL/PSM standard It’s SQL, but with control flow!
  • 72. SQL Scripting BEGIN DECLARE c INT = 10; WHILE c > 0 DO INSERT INTO t VALUES (c); SET VAR c = c - 1; END WHILE; END SQL
  • 73. SQL Scripting Persist complex parameterized queries -- parameters DECLARE oldColName = 'ColoUr'; DECLARE newColName = 'color’; BEGIN DECLARE tableArray Array < STRING >; DECLARE tableType STRING; DECLARE i INT = 0; DECLARE alterQuery STRING; SET tableArray = ( SELECT array_agg(table_name) FROM INFORMATION_SCHEMA.columns WHERE column_name COLLATE UNICODE_CI = oldColName ); SQL WHILE i < array_size(tableArray) DO SET tableType = ( SELECT table_type FROM INFORMATION_SCHEMA.tables WHERE table_name = tableArray [i] ); IF tableType != 'VIEW' COLLATE UNICODE_CI THEN SET alterQuery = 'ALTER TABLE ' || tableArray [i] || ' RENAME COLUMN ' || oldColName || ' TO ' || newColName; EXECUTE IMMEDIATE alterQuery; END IF; SET i = i + 1; END WHILE; END; SQL
  • 74. Agenda New Functionalities Spark Connect, ANSI Mode, Variant Data Types, Collation Support Extensions Custom Functions and Procedures Arrow Optimized Python UDF Support, Python Data Source APIs, Python UDTF SQL Pipe Syntax, SQL UDF/UDTF, Stored Procedures Usability PySpark UDF Unified Profiling, Spark Native Plotting
  • 75. PySpark UDF Unified Profiling
  • 76. Overview of Unified Profiling • Key Components: Performance and memory profiling • Benefits: Tracks function calls, execution time, memory usage • Replacement for Legacy Profiling • Drawbacks of Legacy Profiling • Advantages of New Unified Profiling • Session-based, works with Spark Connect, runtime toggling
  • 77. Overview of Unified Profiling • How to Enable: • Performance Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") • Memory Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "memory") • API Features: "show", "dump", and "clear" commands • Show results: • Performance: spark.profile.show(type="perf") • Performance: spark.profile.show(type="memory") • Dump results: spark.profile.dump("/your_path/...") • Clear results: spark.profile.clear()
  • 78. PySpark Performance Profiler from pyspark.sql.functions import pandas_udf df = spark.range(10) @pandas_udf("long") def add1(x): return x + 1 added = df.select(add1("id")) spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") added.show() Python
  • 80. PySpark Memory Profiler from pyspark.sql.functions import pandas_udf df = spark.range(10) @pandas_udf("long") def add1(x): return x + 1 added = df.select(add1("id")) spark.conf.set("spark.sql.pyspark.udf.profiler", "memory") added.show() Python
  • 82. Spark Native Plotting Since Spark 4.0, PySpark supports native plotting.
  • 83. How it works? Plotly is used as the default visualization backend.
  • 84. Supported Plot Types Type Description Area plot Shows data as filled areas under lines. It’s great for showing trends over time or how values add up cumulatively across categories. Bar plot Uses bars to represent data. The height (or length) of each bar shows the size of the value. Horizontal bar plot Similar to a bar plot, but the bars go sideways, which is useful for longer category names or side-by-side comparisons. Box plot Summarizes data by showing its range (minimum to maximum), interquartile range (IQR), the middle value (median), and any unusual data points (outliers). Density/KDE plot Shows how data is spread out, using a smooth curve to estimate the data’s distribution. Histogram Groups data into ranges (bins) and shows how many values fall into each range. It helps visualize the distribution and frequency of the data. Line plot Connects data points with lines, making it easy to see trends or changes over time. Pie chart Displays data as slices of a circle, representing proportions or percentages of a whole. Scatter plot Shows the relationship between two variables by plotting them as points on a graph.
  • 85. Example: Scatter plot +---------------+-----+------+------------+ | Category|Sales|Profit|ProfitMargin| +---------------+-----+------+------------+ | Electronics|20000| 5000| 25.0| | Furniture|15000| 2000| 13.33| |Office Supplies|12000| 3000| 25.0| | Clothing| 8000| 1000| 12.5| | Food|10000| 2500| 25.0| | Books| 6000| 1500| 25.0| | Sports| 7000| 2000| 28.57| | Toys| 5000| 500| 10.0| +---------------+-----+------+------------+ fig = df.plot.scatter( x="Sales", y="Profit", size="ProfitMargin", text="Category", hover_name="Category", color="ProfitMargin", # Add color gradient based on ProfitMargin color_continuous_scale="Plasma", # Use a visually appealing color scale )
  • 86. Example: Pie plot +--------------------+--------+ | country| pop| +--------------------+--------+ | Albania| 3600523| | Austria| 8199783| | Belgium|10392226| |Bosnia and Herzeg...| 4552198| | Bulgaria| 7322858| | Croatia| 4493312| | Czech Republic|10228744| | Denmark| 5468120| | Finland| 5238460| | France|61083916| | Germany|82400996| | Greece|10706290| | Hungary| 9956108| | Iceland| 301931| | Ireland| 4109086| | Italy|58147733| | Montenegro| 684736| | Netherlands|16570613| | Norway| 4627926| sdf.plot.pie(x="country", y="pop")
  • 87. Python: The Number One Choice
  • 91. 210 Countries & Regions PyPI downloads of PySpark in the last 12 months
  • 92. Join the community today! spark.apache.org github.com/apache/spark apache-spark.slack.com spark.apache.org/community/