SlideShare a Scribd company logo
8
Most read
10
Most read
11
Most read
Sophia Sun (sophia.sun@intel.com)
Qi Xie (qi.xie@intel.com)
Hao Cheng (hao.cheng@intel.com)
Best Practice of Compression
Codecs in Spark
2
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site
and confirm whether referenced data are accurate.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
Copyright ©2018 Intel Corporation.
3
For Performance Claims and Optimization
Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests,
such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique
to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations
in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets
covered by this notice. Notice Revision #20110804.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred
to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
For more information go to http://www.intel.com/performance.
About me
• Big data software engineer from Intel.
• Focus on Spark performance profiling and optimization
for Intel Architecture.
4
Outlines
• Compression Needs & Motivations
• Data Compression Pipelines in Spark
• Experiment Compression Codecs Intros
• Intel® Codec Accelerator Architecture Overview
• Takeaways
• Future Works
5
Compression Needs
• Compression Needs
• Reduce data volume and save storage space.
• Speed up the disk I/O operations and data transfer across network,
optimize workload performance.
• Trade-off
• Computation overhead for high compression ratio codecs.
6
Motivations
• Understanding popular compression codecs in Spark.
• Take advantage of Intel® optimized libraries or
accelerate hardware for data
compression/decompression.
7
Data Compression Pipeline in Spark
8
Map
Map
Input
A HDFS file
Map
reduce
Output
A HDFS file
reduce
reduce
Intermediate Data
Each Map’s output
Shuffle (Multiple iterations)
Partition 0
Partition 1
Partition 0
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 2
Partition 1
Output 0
Output 1
Output 2
Input
split0
Input
split1
Input
split2
Input Decompression
Shuffle Compression
Output Compression
Shuffle Decompression
Data Compression Pipeline in Spark - I/O
Characteristics
• HDFS Storage
• Generally sequence read/write
• Generally one time read/write
9
Shuffle OperationsHDFS Storage
Input Read(Data Decompression) Shuffle Write(Data Compression)
Output Write(Data Compression) Shuffle Read(Data Decompression)
• Shuffle Operations
• Random read/write
• Multiple times read/write
Experiment Compression Codecs Intros
10
Codecs Supported
levels
Default
level
Degree of
Compression
Compression
speed
CPU Usage Comments
ISA-L(igzip) (0~1) 1 Medium Medium Medium~High Based on Intel® ISA-L
ver 2.0.19 optimization
Zlib-ipp (1~9) Best
balance(near
to 6)
High Slow High Based on Intel® IPP
library optimization
Zlib/gzip (1~9) Best
balance(near
to 6)
High Slow High Open source codec
zstd 1~22 3 High Medium Medium~High Open source codec
Lz4-ipp N/A N/A Medium Fast Low Based on Intel® IPP
library optimization
Lz4 Lz4 fast
Lz4 hc
Lz4 fast Low
Medium
Fast
Low
Low
Medium
Open source codec
snappy N/A N/A Low Fast Low Open source codec
High compression ratio codecs
High throughput codecs
Intel® ISA-L reference: https://software.intel.com/en-us/storage/ISA-L ; Intel® IPP reference: https://software.intel.com/en-us/intel-ipp
Compression Level
11
• zstd, gzip, zlib-ipp and igzip support compression level adjustment, while codec lz4 and
snappy does not support.
• No big data size difference among different compression level in TPC-DS parquet format data
generation test.
Compression
codec
Level9
Data Size
Level1
Data Size
*Default level
Data Size
Default
Vs Level9
Level1
Vs
Level9
gzip/zlib 2,500,252,836,007 2,528,269,315,543 2,502,656,222,082 0.096% 1.12%
zlib-ipp 2,482,050,449,516 2,492,687,484,854 2,482,595,509,721 0.022% 0.429%
Compression
codec
*Default level
Data Size
Level6
Data Size
Level9
Data Size
Default
Vs
Level6
Default
Vs
Level 9
zstd 2,472,315,429,619 2,446,857,474,146 2,440,389,051,782 1.04% 1.31%
0 2,000,000,000,000 4,000,000,000,000
gzip
zlib-ipp
zstd
TPC-DS Different Codec Compression
Level Data Size(Raw data: 10TB)
Default* Level 1 Level 6 Level 9
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
Compression in Parquet Format
12
Col 1 Col 2 … Col N
… … … …
Col 1 Col 2 … Col N
Column Chunk
Row Group
Parquet File
…
• Columnar Storage (For Column Pruning)
• Compression / Decompression for each
Column Chunk
• Column Chunk has same data type even same
values (Default Compression Level is usually
effective)
Intel® Codec Accelerator Architecture(1/2)
Notes:
• QAT and ISA-L AVX512 is available on Intel® Skylake-X platform.
• Open Source codec zstd also can build with Intel® ISA-L AVX512 support to accelerate data compression/decompression.
Intel® Codec Accelerator Architecture(2/2)
14
Takeaways
15
• Better to choose high compression codecs for source data* for IO
intensive workload, such as zstd, zlib-ipp, zlib, igzip.
• Better to use high throughput codecs for spark shuffle compression
codec, such as lz4-ipp, lz4.
• Higher compression codec reduce I/O and network pressure, but
consumes CPU resource, use accelerate hardware such as QAT and
FPGA can help to offload CPU resources.
• Zstd can qualify as both a reasonably strong compressor and a fast
one.
• Best balance of compression codec depends on cluster characteristics
and workloads.
Future Plan
• Open source Intel® Codec Accelerator project and make it as well
supported library.
• Add codec compatibility support.
• Integrate with more IA optimized codecs along with the acceleration
library releases under different platform.
• Introduce more big data frameworks (Cassandra / HBase etc.)
• Besides compression / decompression, we will support more types
of codec like the encryption / decryption etc.
• Keep release new version along with new Intel® Platform release or
new acceleration libraries released.
Thanks!
HiBench Sort Workload bottleneck – No
data compression
18
• No compression data has big data size, mapping data make the IO disk as bottleneck in stage0
• No compression data cause big pressure in shuffle stage(Stage1). 10Gb(~1.2GB) network as
bottleneck in experiment environment. While CPU still has much idle resource.
0
500000
1000000
1500000
2000000
0
128
263
388
498
608
718
828
938
1048
1158
1268
1378
1488
1598
1708
1818
Network IO
Sum of rxkB/s
Sum of txkB/s
0
20
40
60
80
100
120
0
101
200
299
398
498
601
712
832
936
1035
1134
1233
1332
1431
1530
1629
1728
1827
Cpu Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
stage0
stage1
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
HiBench Sort Workload Resource
Utilization Examples
19
0
50
100
150
0
86
172
258
344
430
516
603
690
778
867
962
Cpu Utilization – zlibipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
• CPU as bottleneck on High compression ratio codecs (like
zstd, zlibipp and igzip)
• Codec lz4, lz4ipp and snappy has lower compression ratio,
large size of data read/write caused the disk as the
bottleneck in stage0 and large shuffle data caused network
as bottleneck in stage10
50
100
150
0
94
181
267
367
464
555
643
728
815
905
992
1077
Cpu Utilization – lz4ipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
0
1000000
2000000
3000000
0
108
225
327
418
509
600
693
785
878
971
1062
Network IO – lz4ipp
Sum of rxkB/s
Sum of txkB/s
Low Compression ratio codec example
High compression ratio codec example
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16

More Related Content

What's hot (20)

PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Spark SQL
Joud Khattab
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PDF
Mastering PostgreSQL Administration
EDB
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Spark SQL
Joud Khattab
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Making Apache Spark Better with Delta Lake
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Elastic Stack Introduction
Vikram Shinde
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Mastering PostgreSQL Administration
EDB
 

Similar to Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie (20)

PDF
QATCodec: past, present and future
boxu42
 
PDF
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Databricks
 
PDF
Crooke CWF Keynote FINAL final platinum
Alan Frost
 
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
PDF
Intel Technologies for High Performance Computing
Intel Software Brasil
 
PDF
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Alluxio, Inc.
 
PDF
Trends in Systems and How to Get Efficient Performance
inside-BigData.com
 
PDF
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
tdc-globalcode
 
PDF
Big data intel platform commenting
Intel IT Center
 
PDF
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Spark Summit
 
PDF
What are latest new features that DPDK brings into 2018?
Michelle Holley
 
PDF
Intel® Open Image Denoise in Unity*
Intel® Software
 
PDF
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
tdc-globalcode
 
PDF
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Igor José F. Freitas
 
PDF
Big Data Intel® Platform
xband
 
PDF
FPGAs and Machine Learning
inside-BigData.com
 
PPTX
E5 Intel Xeon Processor E5 Family Making the Business Case
Intel IT Center
 
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
PDF
AIDC Summit LA- Hands-on Training
Intel® Software
 
PPTX
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Intel® Software
 
QATCodec: past, present and future
boxu42
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Databricks
 
Crooke CWF Keynote FINAL final platinum
Alan Frost
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Intel Technologies for High Performance Computing
Intel Software Brasil
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Alluxio, Inc.
 
Trends in Systems and How to Get Efficient Performance
inside-BigData.com
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
tdc-globalcode
 
Big data intel platform commenting
Intel IT Center
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Spark Summit
 
What are latest new features that DPDK brings into 2018?
Michelle Holley
 
Intel® Open Image Denoise in Unity*
Intel® Software
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
tdc-globalcode
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Igor José F. Freitas
 
Big Data Intel® Platform
xband
 
FPGAs and Machine Learning
inside-BigData.com
 
E5 Intel Xeon Processor E5 Family Making the Business Case
Intel IT Center
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
AIDC Summit LA- Hands-on Training
Intel® Software
 
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Intel® Software
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
What Is Data Integration and Transformation?
subhashenia
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie

  • 1. Sophia Sun ([email protected]) Qi Xie ([email protected]) Hao Cheng ([email protected]) Best Practice of Compression Codecs in Spark
  • 2. 2 Legal Disclaimer No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Copyright ©2018 Intel Corporation.
  • 3. 3 For Performance Claims and Optimization Notice Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others For more information go to http://www.intel.com/performance.
  • 4. About me • Big data software engineer from Intel. • Focus on Spark performance profiling and optimization for Intel Architecture. 4
  • 5. Outlines • Compression Needs & Motivations • Data Compression Pipelines in Spark • Experiment Compression Codecs Intros • Intel® Codec Accelerator Architecture Overview • Takeaways • Future Works 5
  • 6. Compression Needs • Compression Needs • Reduce data volume and save storage space. • Speed up the disk I/O operations and data transfer across network, optimize workload performance. • Trade-off • Computation overhead for high compression ratio codecs. 6
  • 7. Motivations • Understanding popular compression codecs in Spark. • Take advantage of Intel® optimized libraries or accelerate hardware for data compression/decompression. 7
  • 8. Data Compression Pipeline in Spark 8 Map Map Input A HDFS file Map reduce Output A HDFS file reduce reduce Intermediate Data Each Map’s output Shuffle (Multiple iterations) Partition 0 Partition 1 Partition 0 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 2 Partition 1 Output 0 Output 1 Output 2 Input split0 Input split1 Input split2 Input Decompression Shuffle Compression Output Compression Shuffle Decompression
  • 9. Data Compression Pipeline in Spark - I/O Characteristics • HDFS Storage • Generally sequence read/write • Generally one time read/write 9 Shuffle OperationsHDFS Storage Input Read(Data Decompression) Shuffle Write(Data Compression) Output Write(Data Compression) Shuffle Read(Data Decompression) • Shuffle Operations • Random read/write • Multiple times read/write
  • 10. Experiment Compression Codecs Intros 10 Codecs Supported levels Default level Degree of Compression Compression speed CPU Usage Comments ISA-L(igzip) (0~1) 1 Medium Medium Medium~High Based on Intel® ISA-L ver 2.0.19 optimization Zlib-ipp (1~9) Best balance(near to 6) High Slow High Based on Intel® IPP library optimization Zlib/gzip (1~9) Best balance(near to 6) High Slow High Open source codec zstd 1~22 3 High Medium Medium~High Open source codec Lz4-ipp N/A N/A Medium Fast Low Based on Intel® IPP library optimization Lz4 Lz4 fast Lz4 hc Lz4 fast Low Medium Fast Low Low Medium Open source codec snappy N/A N/A Low Fast Low Open source codec High compression ratio codecs High throughput codecs Intel® ISA-L reference: https://software.intel.com/en-us/storage/ISA-L ; Intel® IPP reference: https://software.intel.com/en-us/intel-ipp
  • 11. Compression Level 11 • zstd, gzip, zlib-ipp and igzip support compression level adjustment, while codec lz4 and snappy does not support. • No big data size difference among different compression level in TPC-DS parquet format data generation test. Compression codec Level9 Data Size Level1 Data Size *Default level Data Size Default Vs Level9 Level1 Vs Level9 gzip/zlib 2,500,252,836,007 2,528,269,315,543 2,502,656,222,082 0.096% 1.12% zlib-ipp 2,482,050,449,516 2,492,687,484,854 2,482,595,509,721 0.022% 0.429% Compression codec *Default level Data Size Level6 Data Size Level9 Data Size Default Vs Level6 Default Vs Level 9 zstd 2,472,315,429,619 2,446,857,474,146 2,440,389,051,782 1.04% 1.31% 0 2,000,000,000,000 4,000,000,000,000 gzip zlib-ipp zstd TPC-DS Different Codec Compression Level Data Size(Raw data: 10TB) Default* Level 1 Level 6 Level 9 Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
  • 12. Compression in Parquet Format 12 Col 1 Col 2 … Col N … … … … Col 1 Col 2 … Col N Column Chunk Row Group Parquet File … • Columnar Storage (For Column Pruning) • Compression / Decompression for each Column Chunk • Column Chunk has same data type even same values (Default Compression Level is usually effective)
  • 13. Intel® Codec Accelerator Architecture(1/2) Notes: • QAT and ISA-L AVX512 is available on Intel® Skylake-X platform. • Open Source codec zstd also can build with Intel® ISA-L AVX512 support to accelerate data compression/decompression.
  • 14. Intel® Codec Accelerator Architecture(2/2) 14
  • 15. Takeaways 15 • Better to choose high compression codecs for source data* for IO intensive workload, such as zstd, zlib-ipp, zlib, igzip. • Better to use high throughput codecs for spark shuffle compression codec, such as lz4-ipp, lz4. • Higher compression codec reduce I/O and network pressure, but consumes CPU resource, use accelerate hardware such as QAT and FPGA can help to offload CPU resources. • Zstd can qualify as both a reasonably strong compressor and a fast one. • Best balance of compression codec depends on cluster characteristics and workloads.
  • 16. Future Plan • Open source Intel® Codec Accelerator project and make it as well supported library. • Add codec compatibility support. • Integrate with more IA optimized codecs along with the acceleration library releases under different platform. • Introduce more big data frameworks (Cassandra / HBase etc.) • Besides compression / decompression, we will support more types of codec like the encryption / decryption etc. • Keep release new version along with new Intel® Platform release or new acceleration libraries released.
  • 18. HiBench Sort Workload bottleneck – No data compression 18 • No compression data has big data size, mapping data make the IO disk as bottleneck in stage0 • No compression data cause big pressure in shuffle stage(Stage1). 10Gb(~1.2GB) network as bottleneck in experiment environment. While CPU still has much idle resource. 0 500000 1000000 1500000 2000000 0 128 263 388 498 608 718 828 938 1048 1158 1268 1378 1488 1598 1708 1818 Network IO Sum of rxkB/s Sum of txkB/s 0 20 40 60 80 100 120 0 101 200 299 398 498 601 712 832 936 1035 1134 1233 1332 1431 1530 1629 1728 1827 Cpu Utilization Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user stage0 stage1 Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
  • 19. HiBench Sort Workload Resource Utilization Examples 19 0 50 100 150 0 86 172 258 344 430 516 603 690 778 867 962 Cpu Utilization – zlibipp Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user • CPU as bottleneck on High compression ratio codecs (like zstd, zlibipp and igzip) • Codec lz4, lz4ipp and snappy has lower compression ratio, large size of data read/write caused the disk as the bottleneck in stage0 and large shuffle data caused network as bottleneck in stage10 50 100 150 0 94 181 267 367 464 555 643 728 815 905 992 1077 Cpu Utilization – lz4ipp Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user 0 1000000 2000000 3000000 0 108 225 327 418 509 600 693 785 878 971 1062 Network IO – lz4ipp Sum of rxkB/s Sum of txkB/s Low Compression ratio codec example High compression ratio codec example Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16