What s new in spark 2.3 and spark 2.4

1 © Hortonworks Inc. 2011–2018. All rights reserved
What’s New in Apache Spark 2.3 & 2.4
DWS Melbourne, Australia 2019
Robert Hryniewicz

• Apache Spark and Kubernetes
• Native Vectorized ORC and SQL Cache Readers
• Pandas UDFs for PySpark
• Continuous Stream Processing
• Barrier Execution
• Avro/Image Data Source
• Higher-order Functions
Agenda Highlights

Apache Spark 2.3

Native Vectorized ORC Reader
• Native ORC read and write: ‘spark.sql.orc.impl’ to ‘native’.
• Vectorized ORC reader: ‘spark.sql.orc.enableVectorizedReader’ to ‘true’
See also ORC Improvement in Apache Spark 2.3 by Dongjoon Hyun

Normal UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Apache Spark
Python Worker
Internal Spark data
Convert to standard Java type
Pickled
Unpickled
Evaluate row by row
Convert to Python data

Pandas UDF
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker

Pandas UDF
Internal Spark data
Apache Arrow format
Apache Spark
Python Worker

Pandas UDF
See also Introducing Pandas UDFs for PySpark by Li Jin

Conversion To/From Pandas With Apache Arrow
• Enable Apache Arrow optimization:
‘spark.sql.execution.arrow.enabled’ to
‘true’.
See also Speeding up PySpark with Apache Arrow by Bryan Cutler

Structured Streaming
Continuous Stream Processing

Structured Streaming: Microbatch
See also Continuous Processing in Structured Streaming by Josh Torres

Structured Streaming: Continuous Processing
See also Continuous Processing in Structured Streaming by Josh Torres

Structured Streaming: Continuous Processing
See also Spark Summit Keynote Demo by Michael Armbrust
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing

Stream-to-Stream Joins
See also Introducing Stream-Stream Joins in Apache Spark 2.3 by Tathagata Das and Joseph Torres

Apache Spark and Kubernetes
See also Running Spark on Kubernetes

Image Support in Spark
• Convert from compressed Images format
(e.g., PNG and JPG) to raw representation
of an image for OpenCV
• One record per one image file
See also SPARK-21866 by Ilya Matiach, and Deep Learning Pipelines for Apache Spark

(Stateless) History Server
History Server Using K-V Store
• Requires storing app lists and UI in the memory
• Requires reading/parsing the whole log file
See also SPARK-18085 and the proposal by Marcelo Vanzin

• Store app lists and UI in a persistent K-V store (LevelDB)
• Set ‘spark.history.store.path’ to use this feature
• The event log written by lower versions is still compatible
See also SPARK-18085 and the proposal by Marcelo Vanzin

R Structured Streaming
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
See also SSR: Structured Streaming on R for Machine Learning by Felix Cheung

R Native Function Execution Stability
See also SPARK-21093

Apache Spark 2.4

Apache Spark 2.4
Barrier Execution
Apache Spark 3.0
• [SPARK-24374] barrier execution mode
• [SPARK-24374] barrier execution mode
• [SPARK-24579] optimized data exchange
• [SPARK-24615] accelerator-aware scheduling
See also Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark by Reynold Xin
See also Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark by Xiangrui Meng

Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/commit/9786ce66c
https://github.com/apache/spark/commit/b2ce17b4c

Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/pull/22620/commits/06a7bd0c
Internal Spark data
Apache Arrow format
Apache Spark
Python Worker

Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter

Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
See also (ongoing) SPARK-24572 for Eagar Evaluation at R side

Flexible Streaming Sink
• Exposing output rows of each microbatch as a DataFrame
• foreachBatch(f: Dataset[T] => Unit) Scala/Java/Python APIs in DataStreamWriter.

Avro Data Source
• Apache Avro (https://avro.apache.org)
• A data serialization format
• Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines.
• Spark-Avro package (https://github.com/databricks/spark-avro)
• Spark SQL can read and write the Avro data.
• Inlining Spark-Avro package [SPARK-24768]
• Better experience for first-time users of Spark SQL and structured streaming
• Expect further improve the adoption of structured streaming

Avro Data Source
• from_avro/to_avro functions to read and write Avro data within a DataFrame instead of
just files.
• Example:
• Decode the Avro data into a struct
• Filter by column `favorite_color`
• Encode the column `name` in Avro format

Avro Data Source
• Refactor Avro Serializer and Deserializer
• External
• Arrow Data -> Row -> InternalRow
• Native
• Arrow Data -> InternalRow

Avro Data Source
• Options:
• compression: compression codec in write
• ignoreExtension: if ignore .avro or not in read
• recordNamespace: record namespace in write
• recordName: top root record name in write
• avroSchema: avro schema to use
• Logical type support:
• Date [SPARK-24772]
• Decimal [SPARK-24774]
• Timestamp [SPARK-24773]

Image Data Source
• Spark datasource for image format
• ImageSchema deprecated use instead:
• SQL syntax support
• Partition discovery

Higher-order Functions
• Takes functions to transform complex datatype like map, array and struct

Higher-order Functions

Built-in Functions
• New or extended built-in functions for ArrayTypes and MapTypes
• 26 functions for ArrayTypes
• transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join,
array_max, array_min, ...
• 8 functions for MapTypes
• map_from_arrays, map_from_entries, map_entries, map_concat, map_filter, map_zip_with,
transform_keys, transform_values

Apache Spark and Kubernetes
• New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes [SPARK-23529]
Scala 2.12 (Beta) Support
Build Spark against Scala 2.12 [SPARK-14220]

PySpark Custom Worker
• Configuration to select the modules for daemon and worker in PySpark
• Set ‘spark.python.daemon.module’and/or ‘spark.python.worker.module’ to the worker or
daemon modules
See also Remote Python Debugging 4 Spark

Data Source Changes
• CSV
• Option samplingRatio
• for schema inference [SPARK-23846]
• Option enforceSchema
• for throwing an exception when user-
specified schema doesn‘t match the CSV
header [SPARK-23786]
• Option encoding
• for specifying the encoding of outputs.
[SPARK-19018]
• JSON
• Option dropFieldIfAllNull
• for ignoring column of all null values or
empty array/struct during JSON schema
inference [SPARK-23772]
• Option lineSep
• for defining the line separator that should
be used for parsing [SPARK-23765]
• Option encoding
• for specifying the encoding of inputs and
outputs. [SPARK-23723]

Data Source Changes
• Parquet
• Push down
• STRING [SPARK-23972]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
• ORC
• Native ORC reader is on by default
[SPARK-23456]
• Turn on ORC filter push-down by
default [SPARK-21783]
• Use native ORC reader to read Hive
serde tables by default [SPARK-
22279]

Data Source Changes
• JDBC
• Option queryTimeout
• for the number of seconds the the driver will wait for a Statement object to execute.
[SPARK-23856]
• Option query
• for specifying the query to read from JDBC [SPARK-24423]
• Option pushDownFilters
• for specifying whether the filter pushdown is allowed [SPARK-24288]
• Option cascadeTruncate [SPARK-22880]

What About Apache Spark 3.0?
Spark 2.2.0 RC1
2017/05
Spark 2.2.0 released
2018/07
Spark 2.2.0 RC2, RC3, RC4, RC5
2017/06
Spark 2.2.0 RC6
2017/07
Spark 2.3.0 RC1
2018/01
Spark 2.3.0 RC2, RC3, RC4, RC5
2018/02
Spark 2.3.0 released
2018/02
Spark 2.4.0 RC1
2018/09
Spark 3.0.0
2019/05 (?)
Spark 2.4.0 RC2
2018/10
Spark 2.4.0
2018/11
See also the thread in Spark dev mailing list for Spark 3.0 discussion

Newer Integration for Apache Hive with Apache Spark
• Apache Hive 3 support: Apache Spark
provides a basic Hive compatibility
• Apache Hive ACID table support
• Structured Streaming Support
• Apache Ranger integration support
• Use LLAP and vectorized read/write – fast!
See also this article for Hive warehouse connector

Questions?

Thanks!
Robert Hryniewicz

What s new in spark 2.3 and spark 2.4

Recommended

More Related Content

What's hot (20)

Similar to What s new in spark 2.3 and spark 2.4 (20)

More from DataWorks Summit (20)

Recently uploaded (20)

What s new in spark 2.3 and spark 2.4