Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie

Sophia Sun (sophia.sun@intel.com)
Qi Xie (qi.xie@intel.com)
Hao Cheng (hao.cheng@intel.com)
Best Practice of Compression
Codecs in Spark

2
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site
and confirm whether referenced data are accurate.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
Copyright ©2018 Intel Corporation.

3
For Performance Claims and Optimization
Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests,
such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique
to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations
in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets
covered by this notice. Notice Revision #20110804.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred
to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
For more information go to http://www.intel.com/performance.

About me
• Big data software engineer from Intel.
• Focus on Spark performance profiling and optimization
for Intel Architecture.
4

Outlines
• Compression Needs & Motivations
• Data Compression Pipelines in Spark
• Experiment Compression Codecs Intros
• Intel® Codec Accelerator Architecture Overview
• Takeaways
• Future Works
5

Compression Needs
• Compression Needs
• Reduce data volume and save storage space.
• Speed up the disk I/O operations and data transfer across network,
optimize workload performance.
• Trade-off
• Computation overhead for high compression ratio codecs.
6

Motivations
• Understanding popular compression codecs in Spark.
• Take advantage of Intel® optimized libraries or
accelerate hardware for data
compression/decompression.
7

Data Compression Pipeline in Spark
8
Map
Map
Input
A HDFS file
Map
reduce
Output
A HDFS file
reduce
reduce
Intermediate Data
Each Map’s output
Shuffle (Multiple iterations)
Partition 0
Partition 1
Partition 0
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 2
Partition 1
Output 0
Output 1
Output 2
Input
split0
Input
split1
Input
split2
Input Decompression
Shuffle Compression
Output Compression
Shuffle Decompression

Data Compression Pipeline in Spark - I/O
Characteristics
• HDFS Storage
• Generally sequence read/write
• Generally one time read/write
9
Shuffle OperationsHDFS Storage
Input Read(Data Decompression) Shuffle Write(Data Compression)
Output Write(Data Compression) Shuffle Read(Data Decompression)
• Shuffle Operations
• Random read/write
• Multiple times read/write

Experiment Compression Codecs Intros
10
Codecs Supported
levels
Default
level
Degree of
Compression
Compression
speed
CPU Usage Comments
ISA-L(igzip) (0~1) 1 Medium Medium Medium~High Based on Intel® ISA-L
ver 2.0.19 optimization
Zlib-ipp (1~9) Best
balance(near
to 6)
High Slow High Based on Intel® IPP
library optimization
Zlib/gzip (1~9) Best
balance(near
to 6)
High Slow High Open source codec
zstd 1~22 3 High Medium Medium~High Open source codec
Lz4-ipp N/A N/A Medium Fast Low Based on Intel® IPP
library optimization
Lz4 Lz4 fast
Lz4 hc
Lz4 fast Low
Medium
Fast
Low
Low
Medium
Open source codec
snappy N/A N/A Low Fast Low Open source codec
High compression ratio codecs
High throughput codecs
Intel® ISA-L reference: https://software.intel.com/en-us/storage/ISA-L ; Intel® IPP reference: https://software.intel.com/en-us/intel-ipp

Compression Level
11
• zstd, gzip, zlib-ipp and igzip support compression level adjustment, while codec lz4 and
snappy does not support.
• No big data size difference among different compression level in TPC-DS parquet format data
generation test.
Compression
codec
Level9
Data Size
Level1
Data Size
*Default level
Data Size
Default
Vs Level9
Level1
Vs
Level9
gzip/zlib 2,500,252,836,007 2,528,269,315,543 2,502,656,222,082 0.096% 1.12%
zlib-ipp 2,482,050,449,516 2,492,687,484,854 2,482,595,509,721 0.022% 0.429%
Compression
codec
*Default level
Data Size
Level6
Data Size
Level9
Data Size
Default
Vs
Level6
Default
Vs
Level 9
zstd 2,472,315,429,619 2,446,857,474,146 2,440,389,051,782 1.04% 1.31%
0 2,000,000,000,000 4,000,000,000,000
gzip
zlib-ipp
zstd
TPC-DS Different Codec Compression
Level Data Size(Raw data: 10TB)
Default* Level 1 Level 6 Level 9
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16

Compression in Parquet Format
12
Col 1 Col 2 … Col N
… … … …
Col 1 Col 2 … Col N
Column Chunk
Row Group
Parquet File
…
• Columnar Storage (For Column Pruning)
• Compression / Decompression for each
Column Chunk
• Column Chunk has same data type even same
values (Default Compression Level is usually
effective)

Intel® Codec Accelerator Architecture(1/2)
Notes:
• QAT and ISA-L AVX512 is available on Intel® Skylake-X platform.
• Open Source codec zstd also can build with Intel® ISA-L AVX512 support to accelerate data compression/decompression.

Intel® Codec Accelerator Architecture(2/2)
14

Takeaways
15
• Better to choose high compression codecs for source data* for IO
intensive workload, such as zstd, zlib-ipp, zlib, igzip.
• Better to use high throughput codecs for spark shuffle compression
codec, such as lz4-ipp, lz4.
• Higher compression codec reduce I/O and network pressure, but
consumes CPU resource, use accelerate hardware such as QAT and
FPGA can help to offload CPU resources.
• Zstd can qualify as both a reasonably strong compressor and a fast
one.
• Best balance of compression codec depends on cluster characteristics
and workloads.

Future Plan
• Open source Intel® Codec Accelerator project and make it as well
supported library.
• Add codec compatibility support.
• Integrate with more IA optimized codecs along with the acceleration
library releases under different platform.
• Introduce more big data frameworks (Cassandra / HBase etc.)
• Besides compression / decompression, we will support more types
of codec like the encryption / decryption etc.
• Keep release new version along with new Intel® Platform release or
new acceleration libraries released.

HiBench Sort Workload bottleneck – No
data compression
18
• No compression data has big data size, mapping data make the IO disk as bottleneck in stage0
• No compression data cause big pressure in shuffle stage(Stage1). 10Gb(~1.2GB) network as
bottleneck in experiment environment. While CPU still has much idle resource.
0
500000
1000000
1500000
2000000
0
128
263
388
498
608
718
828
938
1048
1158
1268
1378
1488
1598
1708
1818
Network IO
Sum of rxkB/s
Sum of txkB/s
0
20
40
60
80
100
120
0
101
200
299
398
498
601
712
832
936
1035
1134
1233
1332
1431
1530
1629
1728
1827
Cpu Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
stage0
stage1

HiBench Sort Workload Resource
Utilization Examples
19
0
50
100
150
0
86
172
258
344
430
516
603
690
778
867
962
Cpu Utilization – zlibipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
• CPU as bottleneck on High compression ratio codecs (like
zstd, zlibipp and igzip)
• Codec lz4, lz4ipp and snappy has lower compression ratio,
large size of data read/write caused the disk as the
bottleneck in stage0 and large shuffle data caused network
as bottleneck in stage10
50
100
150
0
94
181
267
367
464
555
643
728
815
905
992
1077
Cpu Utilization – lz4ipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
0
1000000
2000000
3000000
0
108
225
327
418
509
600
693
785
878
971
1062
Network IO – lz4ipp
Sum of rxkB/s
Sum of txkB/s
Low Compression ratio codec example
High compression ratio codec example

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie

More Related Content

What's hot (20)

Similar to Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie (20)

More from Databricks (20)

Recently uploaded (20)

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie