SlideShare a Scribd company logo
Spark performance tuning
for Apache Kylin
Shaofeng Shi
Background
• Kylin 2.0 starts to use Spark as the Cube build engine
• Has been proved can improve 2x to 3x build performance
• Need to have Spark tuning experience.
• Kylin 2.5 will move more jobs onto Spark
• Convert to HFile (KYLIN-3427)
• Merge segments (KYLIN-3441)
• Merge dictionaries on YARN (KYLIN-3471)
• Fact distinct columns in Spark (KYLIN-3442)
• In the future, Spark engine will replace MR
Cubing in Spark
Agenda
• Why Spark
• Spark on YARN Model
• Spark Executor Memory Model
• Executor/Driver memory/core configuration
• Dynamic Resource Allocation
• RDD Partitioning
• Shuffle
• Compression
• DFS Replication
• Deploy Modes
• Other Tips
Why Apache Spark
• Fast, memory centric distributed computing framework
• Flexible API
• Spark Core
• DataFrames, Datasets and SparkSQL
• Spark Streaming
• MLLib/SparkR
• Languages support
• Java, Scala, Python, R
• Deployment option:
• Standalone/YARN/Mesos/Kubernetes (Spark 2.3+)
Spark on YARN memory model
• Overhead memory
• JVM need memory to run
• By default: executor memory * 0.1, minimal 384 MB;
• Executor memory
Spark on YARN memory model (cont.)
• If you allocation 4GB to an executor, Spark will request:
• 4 * 0.1 + 4 = 4.4 GB as the container memory from YARN
• From our observation, the default factor (0.1) is a little small for Kylin,
executor is very likely be killed.
• Give 1GB or more as overhead memory
• spark.yarn.executor.memoryOverhead=1024
• From Kylin 2.5, default request 1GB for overhead.
Spark executor memory model
• Reserved memory
• 300MB, just for avoiding OOM
• Spark memory
• spark.memory.fraction=0.6
• For both storage/cache and
execution (shuffle, sort)
• spark.memory.storageFraction=0.5:
cache and execution half half.
• User memory
• The left is for user code execution
Spark executor memory model(cont.)
• An example:
• Given an executor 4GB memory, its max. storage/execution memory is:
• (4096 – 300) * 0.6 = 2.27GB
• If the executor need run computation (need sorting/shuffling), the space for
RDD cache can be shrined to:
• 2.27GB * 0.5 = 1.13 GB
• User memory:
• (4096 – 300) * 0.4 = 1.52 GB
• When you have big dictionaries, consider to allocate more to user
memory
Executor memory/core configuration
• Check how much memory/core available in your Hadoop cluster
• To maximize the resource utilization, use the similar ratio for Spark.
• For example, a cluster has 100 cores and 500 GB memory. You can allocate 1
core, 5GB (1 GB for overhead, 4GB for executor) for each executor instance.
• If you use multiple cores in one executor, increase the memory
accordingly
• e.g., 2 core + 10 GB per instance.
• No more than 40GB mem / instance
Driver memory configuration
• Kylin does not collect data to driver, you can configure less resource
for driver
• spark.driver.memory=2g
• spark.driver.cores=1
More instances less core, or less instance
more cores?
• Spark active task number = instance * (cores / instance)
• Both can get similar parallelism
• If use more cores in one executor, tasks can share references in the
same JVM
• Share big objects like dictionaries
• If with Spark Dynamic resource allocation, 1 core per instance.
Dynamic resource allocation
• Dynamic allocation can improve resource utilization
• Not enabled by default
Dynamic resource allocation
• Static allocation does not fit for Kylin.
• Cubing is by layer; Each layer’s size is different
• Workload is unbalanced: small -> mediate -> big -> extreme big -> small -> tiny
• DRA is highly recommended.
• With DRA enabled, 1 executor has 1 core.
RDD partitioning
• RDD Partition is similar as File Split in MapReduce;
• Spark prefers to many & small partitions, instead of less & big partition
• Kylin splits partition by estimated file size (after aggregation), by default 1
partition per 10 MB:
• kylin.engine.spark.rdd-partition-cut-mb=10
• The real size may vary as the estimation might be inaccurate
• This may affect the performance greatly!
• Min/max partition cap:
• kylin.engine.spark.min-partition=1
• kylin.engine.spark.max-partition=5000
Partition number is important
• When partition number is less than normal
• Less parallelism, low resource utilization ratio
• Executor OOM (especially when use "mapPartition ”)
• When partition number is much more than normal
• Shuffle is slow
• Many small fraction generated
• Pay attention if you observe a job has > 1000 partitions
Partition number can be wild in certain case
• If your cube has Count Distinct or TopN measures, the estimated size may
be far bigger than actual, causing too many partitions.
• Tune the parameter manually, at Cube level, according to the actual Cuboid
file size:
• kylin.engine.spark.rdd-partition-cut-mb=100
• Or, reduce the max. partition number:
• kylin.engine.spark.max-partition=500
• KYLIN-3453 Make the size estimation more accurate
• KYLIN-3472 TopN in Spark is slow
Shuffle
• Spark shuffle is similar as MapReduce
• Partition mapper’s output and send the partition only to its reducer;
• Reducer buffers data in memory, sort, aggregate and then reduce.
• But with difference
• Spark sorts the data on map side, but doesn’t merge them on reduce side;
• If user need the data be sorted, call “sortByKey”or similar, Spark will re-sort
the data. The re-sort doesn’t aware map’s output is already sorted.
• The sorting is in memory, spill if memory is full
Shuffle (cont.)
• Shuffle spill
• Spill memory = (executorMemory – 300M) * memory.fractor * (1 –
memory.StorageFraction)
• Spilled files won’t be merged, until data be request, merging on the fly
• If you need data be sorted, Spark is slower than MR.
• SPARK-2926 tries to introduce MR-style merge sort.
• Kylin’s“Convert to HFile” step need the value being sorted. Spark may
spend 2x time on this step than MR.
Compression
• Compression can significantly reduce IO
• By default Kylin enabled compression for MR in `conf/kylin_job_conf.xml`,
but not for Spark
• If your Hadoop did not enable compression, you may see 2X sized file
generated when switch from MR to Spark engine
• Manually enable compression with adding:
• kylin.engine.spark-
conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
• kylin.engine.spark-
conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.
hadoop.io.compress.DefaultCodec
• Kylin 2.5 will enable compression by default.
Compression (cont.)
• 40% performance improvement + 50% disk saving
No compression vs compression (Merge segments on Spark)
DFS replication
• Kylin keeps 2 replication for intermediate files, configurated in
`kylin_job_conf.xml` and `kylin_hive_conf.xml`
• But this does not work for Spark
• Manually add:
• kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
• Save 1/3 disk space
• Kylin 2.5 will enable this by default.
Deployment modes
• Spark on YARN has two deploy modes
• Cluster: driver runs inside app master
• Client: driver runs in client process
• When dev/debugging, use `client` mode;
• Start fast, with detailed log message printed on console
• Will occupy client node memory
• In production deployment, use `cluster` mode.
• Kylin 2.5 will use `cluster` mode by default
Other tips
• Pre-upload YARN archive
• Avoid uploading big files repeatedly
• Accelerate job startup
• Run Spark history server for trouble shooting
• Identify bottleneck much easier
• https://kylin.apache.org/docs/tutorial/cube_spark.html
Recommended configurations (Kylin 2.2-2.4,
Spark 2.1)
• kylin.engine.spark-conf.spark.submit.deployMode=cluster
• kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
• kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
• kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
• kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
• kylin.engine.spark-conf.spark.driver.memory=2G
• kylin.engine.spark-conf.spark.executor.memory=4G
• kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
• kylin.engine.spark-conf.spark.executor.cores=1
• kylin.engine.spark-conf.spark.network.timeout=600
• kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice/kylin/spark/spark-libs.jar
• kylin.engine.spark-conf.spark.shuffle.service.enabled=true
• kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
• kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
• kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
• kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
Key takeaway
• Kylin will move more jobs to Spark
• Master Spark tuning will help you run Kylin better
• Kylin aims to provide an out-of-box user experience of Spark, like MR.
We are hiring
Apache Kylin
dev@kylin.apach
e.org
Kyligence Inc
info@kyligence.io
Ad

More Related Content

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Checklist_AC.pdf
Checklist_AC.pdfChecklist_AC.pdf
Checklist_AC.pdf
Neaman Ahmed MBA ITIL OCP Automic
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
IncQuery Labs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...
HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
Introducing the New MagicDraw Plug-In for RTI Connext DDS: Industrial IoT Mee...
IncQuery Labs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 

Similar to Spark tunning in Apache Kylin (20)

Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
Marius Adrian Popa
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
Marius Adrian Popa
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Ad

Recently uploaded (20)

Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Ad

Spark tunning in Apache Kylin

  • 1. Spark performance tuning for Apache Kylin Shaofeng Shi
  • 2. Background • Kylin 2.0 starts to use Spark as the Cube build engine • Has been proved can improve 2x to 3x build performance • Need to have Spark tuning experience. • Kylin 2.5 will move more jobs onto Spark • Convert to HFile (KYLIN-3427) • Merge segments (KYLIN-3441) • Merge dictionaries on YARN (KYLIN-3471) • Fact distinct columns in Spark (KYLIN-3442) • In the future, Spark engine will replace MR
  • 4. Agenda • Why Spark • Spark on YARN Model • Spark Executor Memory Model • Executor/Driver memory/core configuration • Dynamic Resource Allocation • RDD Partitioning • Shuffle • Compression • DFS Replication • Deploy Modes • Other Tips
  • 5. Why Apache Spark • Fast, memory centric distributed computing framework • Flexible API • Spark Core • DataFrames, Datasets and SparkSQL • Spark Streaming • MLLib/SparkR • Languages support • Java, Scala, Python, R • Deployment option: • Standalone/YARN/Mesos/Kubernetes (Spark 2.3+)
  • 6. Spark on YARN memory model • Overhead memory • JVM need memory to run • By default: executor memory * 0.1, minimal 384 MB; • Executor memory
  • 7. Spark on YARN memory model (cont.) • If you allocation 4GB to an executor, Spark will request: • 4 * 0.1 + 4 = 4.4 GB as the container memory from YARN • From our observation, the default factor (0.1) is a little small for Kylin, executor is very likely be killed. • Give 1GB or more as overhead memory • spark.yarn.executor.memoryOverhead=1024 • From Kylin 2.5, default request 1GB for overhead.
  • 8. Spark executor memory model • Reserved memory • 300MB, just for avoiding OOM • Spark memory • spark.memory.fraction=0.6 • For both storage/cache and execution (shuffle, sort) • spark.memory.storageFraction=0.5: cache and execution half half. • User memory • The left is for user code execution
  • 9. Spark executor memory model(cont.) • An example: • Given an executor 4GB memory, its max. storage/execution memory is: • (4096 – 300) * 0.6 = 2.27GB • If the executor need run computation (need sorting/shuffling), the space for RDD cache can be shrined to: • 2.27GB * 0.5 = 1.13 GB • User memory: • (4096 – 300) * 0.4 = 1.52 GB • When you have big dictionaries, consider to allocate more to user memory
  • 10. Executor memory/core configuration • Check how much memory/core available in your Hadoop cluster • To maximize the resource utilization, use the similar ratio for Spark. • For example, a cluster has 100 cores and 500 GB memory. You can allocate 1 core, 5GB (1 GB for overhead, 4GB for executor) for each executor instance. • If you use multiple cores in one executor, increase the memory accordingly • e.g., 2 core + 10 GB per instance. • No more than 40GB mem / instance
  • 11. Driver memory configuration • Kylin does not collect data to driver, you can configure less resource for driver • spark.driver.memory=2g • spark.driver.cores=1
  • 12. More instances less core, or less instance more cores? • Spark active task number = instance * (cores / instance) • Both can get similar parallelism • If use more cores in one executor, tasks can share references in the same JVM • Share big objects like dictionaries • If with Spark Dynamic resource allocation, 1 core per instance.
  • 13. Dynamic resource allocation • Dynamic allocation can improve resource utilization • Not enabled by default
  • 14. Dynamic resource allocation • Static allocation does not fit for Kylin. • Cubing is by layer; Each layer’s size is different • Workload is unbalanced: small -> mediate -> big -> extreme big -> small -> tiny • DRA is highly recommended. • With DRA enabled, 1 executor has 1 core.
  • 15. RDD partitioning • RDD Partition is similar as File Split in MapReduce; • Spark prefers to many & small partitions, instead of less & big partition • Kylin splits partition by estimated file size (after aggregation), by default 1 partition per 10 MB: • kylin.engine.spark.rdd-partition-cut-mb=10 • The real size may vary as the estimation might be inaccurate • This may affect the performance greatly! • Min/max partition cap: • kylin.engine.spark.min-partition=1 • kylin.engine.spark.max-partition=5000
  • 16. Partition number is important • When partition number is less than normal • Less parallelism, low resource utilization ratio • Executor OOM (especially when use "mapPartition ”) • When partition number is much more than normal • Shuffle is slow • Many small fraction generated • Pay attention if you observe a job has > 1000 partitions
  • 17. Partition number can be wild in certain case • If your cube has Count Distinct or TopN measures, the estimated size may be far bigger than actual, causing too many partitions. • Tune the parameter manually, at Cube level, according to the actual Cuboid file size: • kylin.engine.spark.rdd-partition-cut-mb=100 • Or, reduce the max. partition number: • kylin.engine.spark.max-partition=500 • KYLIN-3453 Make the size estimation more accurate • KYLIN-3472 TopN in Spark is slow
  • 18. Shuffle • Spark shuffle is similar as MapReduce • Partition mapper’s output and send the partition only to its reducer; • Reducer buffers data in memory, sort, aggregate and then reduce. • But with difference • Spark sorts the data on map side, but doesn’t merge them on reduce side; • If user need the data be sorted, call “sortByKey”or similar, Spark will re-sort the data. The re-sort doesn’t aware map’s output is already sorted. • The sorting is in memory, spill if memory is full
  • 19. Shuffle (cont.) • Shuffle spill • Spill memory = (executorMemory – 300M) * memory.fractor * (1 – memory.StorageFraction) • Spilled files won’t be merged, until data be request, merging on the fly • If you need data be sorted, Spark is slower than MR. • SPARK-2926 tries to introduce MR-style merge sort. • Kylin’s“Convert to HFile” step need the value being sorted. Spark may spend 2x time on this step than MR.
  • 20. Compression • Compression can significantly reduce IO • By default Kylin enabled compression for MR in `conf/kylin_job_conf.xml`, but not for Spark • If your Hadoop did not enable compression, you may see 2X sized file generated when switch from MR to Spark engine • Manually enable compression with adding: • kylin.engine.spark- conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true • kylin.engine.spark- conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache. hadoop.io.compress.DefaultCodec • Kylin 2.5 will enable compression by default.
  • 21. Compression (cont.) • 40% performance improvement + 50% disk saving No compression vs compression (Merge segments on Spark)
  • 22. DFS replication • Kylin keeps 2 replication for intermediate files, configurated in `kylin_job_conf.xml` and `kylin_hive_conf.xml` • But this does not work for Spark • Manually add: • kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 • Save 1/3 disk space • Kylin 2.5 will enable this by default.
  • 23. Deployment modes • Spark on YARN has two deploy modes • Cluster: driver runs inside app master • Client: driver runs in client process • When dev/debugging, use `client` mode; • Start fast, with detailed log message printed on console • Will occupy client node memory • In production deployment, use `cluster` mode. • Kylin 2.5 will use `cluster` mode by default
  • 24. Other tips • Pre-upload YARN archive • Avoid uploading big files repeatedly • Accelerate job startup • Run Spark history server for trouble shooting • Identify bottleneck much easier • https://kylin.apache.org/docs/tutorial/cube_spark.html
  • 25. Recommended configurations (Kylin 2.2-2.4, Spark 2.1) • kylin.engine.spark-conf.spark.submit.deployMode=cluster • kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true • kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1 • kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000 • kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300 • kylin.engine.spark-conf.spark.driver.memory=2G • kylin.engine.spark-conf.spark.executor.memory=4G • kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 • kylin.engine.spark-conf.spark.executor.cores=1 • kylin.engine.spark-conf.spark.network.timeout=600 • kylin.engine.spark-conf.spark.yarn.archive=hdfs://nameservice/kylin/spark/spark-libs.jar • kylin.engine.spark-conf.spark.shuffle.service.enabled=true • kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 • kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true • kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec • kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
  • 26. Key takeaway • Kylin will move more jobs to Spark • Master Spark tuning will help you run Kylin better • Kylin aims to provide an out-of-box user experience of Spark, like MR.

Editor's Notes

  • #7: Spark.shuffle.memoryFraction and spark.storage.memoryFraction are deprecated. They are replahttps://spark.apache.org/docs/2.1.2/configuration.htmlced by spark.memory.fraction. See https://spark.apache.org/docs/2.1.2/running-on-yarn.html
  • #9: https://0x0fff.com/spark-memory-management/
  • #19: https://0x0fff.com/spark-architecture-shuffle/ https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
  • #20: https://0x0fff.com/spark-architecture-shuffle/ https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md https://issues.apache.org/jira/browse/SPARK-2926 In a case, MR (convert to hfile) took 6 min, Spark took 11 min; Even enlarge Spark memory, the performance doesn’t improve.