SlideShare a Scribd company logo
Control dataset partitioning and cache to
optimize performances in Spark
Christophe Préaud & Florian Fauvarque
2
Who are we?
Christophe Préaud
Big data and distributed computing enthusiast
Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big
data technology stack, the development of Spark applications and the Spark support to other teams.
Florian Fauvarque
Opensource enthusiast, who loves neat and clean code, and more generally good software
craftmanship practices
Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to
produce analysis and products feeds for affiliate web sites.
This presentation is also available at https://aquilae.eu/snowcamp2019-spark
3
The global data-driven marketing platform that
connects consumers to products
22 countries
International presence
20 years
of ecommerce experience
4 price comparison sites
7
We are hiring!
Over 30 roles in the company
Roles in Grenoble:
• Java/Scala Developers
• Front-End Developers
• Data Scientists
• Internships
8
• 2 Billions logs written per day
• 60 TB in HDFS
• 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores
• 3300 jobs executed every day
KelkooGroup – Some numbers
9
Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph
processing or real-time stream analysis: http://spark.apache.org
What is Apache Spark?
11
• Task
• Slot
• Shuffle
Spark glossary
12
• Narrow transformation (ex: coalesce, filter, map, …)
Spark glossary
13
• Wide transformation (ex: repartition, distinct, groupBy, ...)
Spark glossary
14
1. Partitions
2. Cache
3. Profiling
15
• What does it mean to partition data?
• To divide a single dataset into smaller manageable chunks
• →A Partition is a small piece of the total dataset
• How do the DataFrameReaders decide how to partition data?
• It depends according to the reader (CSV, Parquet, ORC, ...)
• Task / Partition relationship:
• A typical Task is processing a single Partition
• →The number of Partitions will determine the number of Tasks needed to process
process the dataset
What is a partition in Spark?
16
During the first part of this presentation, we will focus mainly on...
• The number of Partitions my data is divided into
• The number of Slots I have for parallel execution
The goal is to maximize Slots usage, i.e. ensure as much as possible that
each Slot is processing a Task
What is a partition in Spark?
17
• 4 executors
• 2 cores / executor
• College Scorecards (source: catalog.data.gov) make it easier for
students to search for a college that is a good fit for them. They
can use the College Scorecard to find out more about a college's
affordability and value so they can make more informed decisions
about which college to attend.
Configuration for demo
8
18
Partition tuning: reading a file
3.3
min
numPartitions: 1
3 min 24
19
Partition tuning: reading a file
38 s
numPartitions: 9
42 s
20
Why 9 partitions?
• File size is 1.04 GB
• Max partition size is 128 MB
• 1.04 * 1024 / 128 = 8.32
Partition tuning: reading a file
21
Partition tuning: reading a file
• As a rule of thumb, it is always advised that the number of Partitions is a factor of the
number of Slots, so that every Slot is being used (i.e. assigned a Task) during the
processing
• With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be
assigned 1 Task, 1 Slot will be assigned 2 Tasks)
22
Partition tuning: reading a file
14 s
15 s
numPartitions: 8
32 s
repartition(8)
23
Partition tuning: reading a file
spark.sql.files.maxPartitionBytes:
The maximum number of bytes to pack into a single partition when reading files.
20 s
320
numPartitions: 8
22 s
24
Partition tuning: reading a file
45 s
128
numPartitions: 8
49 s
25
Partition tuning: repartition and coalesce
repartition(4)coalesce(4)
26
Partition tuning: repartition and coalesce
27
Partition tuning: repartition and coalesce
28
Partition tuning: repartition and coalesce
29
Partition tuning: repartition and coalesce
coalesce repartition
• Performs better: no shuffle
• Records are not evenly distributed
across all partitions→risk of skewed
dataset (i.e. a few partitions
containing most of the data)
• Extra cost because of shuffle
operation
• Ensure uniform distribution of the
records on all partitions→slots
usage will be optimal
30
Partition tuning: writing a file
numPartitions: 19
39 s
31
Partition tuning: writing a file
3.9
min
coalesce(1)
3 min 57
32
Partition tuning: writing a file
1.8
min
22 s
repartition(1)
2 min 18
33
Partition tuning: repartition or coalesce?
• If your dataset is skewed: use repartition
• If you want more partitions: use repartition
• If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use
repartition
• If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but
not drastically fewer, i.e. not fewer than the number of Slots): use coalesce
• If in doubt: use repartition
34
spark.sql.files.maxRecordsPerFile:
Maximum number of records to write out to a single file. If this value is
zero or negative, there is no limit.
Partition tuning: writing a file
35
Partition tuning: writing a file
Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is
over the threshold, a new file is created.
for each partition {
for each record {
numRecords ++
if (numRecords > 15000) {
closeFile()
openNewFile()
numRecords = 0
}
writeRecordInFile()
}
}
36
Partition tuning: writing a file
There cannot be less than one file per partition.
37
Wide transformation: The data required to compute the records in a single Partition may reside in many
Partitions of the parent Dataset (i.e. it triggers a shuffle operation)
Partition tuning: wide transformation
45 s
1 min
32
38
spark.sql.shuffle.partitions:
The default number of partitions to use when shuffling data for joins or aggregations.
Partition tuning: wide transformation
39
Partition tuning: wide transformation
28 s
1 min
17
8
40
1. Partitions
2. Cache
3. Profiling
41
When use cache
• When re-use a Dataset multiple times
• To recover quickly from a node failure
• data scientist : training data in an iterative loop 👍
• data analyst : most of the time no, hide that the data are not organized properly 👎
• data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔
42
When use cache
7 sec
43
When use cache
1 min
41 sec
44
How to cache a data set in Spark
Cache strategy: Storage Level
• NONE: No cache
• MEMORY_ONLY :
• data cached non-serialized in memory
• If not enough memory: data is evicted and when needed rebuilt from source
• DISK_ONLY : data is serialized and stored on disk
• MEMORY_AND_DISK :
• data cached non-serialized in memory
• If not enough memory: data is serialized and stored on disk
• OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)
45
How to cache a data set in Spark
Cache strategy: Storage Level
• _SER suffix:
• Always serialize the data in memory
• Save space but with serialization penalty
• _2 suffix :
• Replicate each partition on 2 cluster nodes
• Improve recovery time when node failure
NONE
DISK_ONLY DISK_ONLY_2
MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2
MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2
OFF_HEAP
46
How to cache a data set in Spark
Cache strategy: Storage Level
• .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY
• Lazy: .count()
47
Broadcast variable
Useful to share small immutable data
48
Broadcast variable
• spark.sql.autoBroadcastJoinThreshold : auto optimize join queries
when the size of one side data is below the threshold (default 10 MB)
1. Partitions
2. Cache
3. Profiling
50
How to Profile a Spark App ?
51
How to Profile a Spark App ?
52
How to Profile a Spark App ?
53
How to Profile a Spark App ?
https://github.com/criteo/babar
54
Questions ?
55
Ressources
• Spark official documentation: https://spark.apache.org/docs/latest/tuning.html
• Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering-
apache-spark/
• Apache Spark - Best Practices and Tuning by Umberto Griffo:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/
• High Performance Spark by Rachel Warren, Holden Karau, O'Reilly
Control dataset partitioning and cache to optimize performances in Spark
Ad

More Related Content

What's hot (20)

Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Chapter13
Chapter13Chapter13
Chapter13
gourab87
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
NAVER D2
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Delhi/NCR HUG
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
The HDF-EOS Tools and Information Center
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary Storage
Uğur Tılıkoğlu
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
NAVER D2
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary Storage
Uğur Tılıkoğlu
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 

Similar to Control dataset partitioning and cache to optimize performances in Spark (20)

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon Nexus, Inc.
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdf
Harika Pudugosula
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptxMultivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptxMultivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Storage talk
Storage talkStorage talk
Storage talk
christkv
 
Vmfs
VmfsVmfs
Vmfs
Erick Treviño
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon Nexus, Inc.
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdf
Harika Pudugosula
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptxMultivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptxMultivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Storage talk
Storage talkStorage talk
Storage talk
christkv
 
Ad

Recently uploaded (20)

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Ad

Control dataset partitioning and cache to optimize performances in Spark

  • 1. Control dataset partitioning and cache to optimize performances in Spark Christophe Préaud & Florian Fauvarque
  • 2. 2 Who are we? Christophe Préaud Big data and distributed computing enthusiast Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big data technology stack, the development of Spark applications and the Spark support to other teams. Florian Fauvarque Opensource enthusiast, who loves neat and clean code, and more generally good software craftmanship practices Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to produce analysis and products feeds for affiliate web sites. This presentation is also available at https://aquilae.eu/snowcamp2019-spark
  • 3. 3 The global data-driven marketing platform that connects consumers to products 22 countries International presence 20 years of ecommerce experience 4 price comparison sites
  • 4. 7 We are hiring! Over 30 roles in the company Roles in Grenoble: • Java/Scala Developers • Front-End Developers • Data Scientists • Internships
  • 5. 8 • 2 Billions logs written per day • 60 TB in HDFS • 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores • 3300 jobs executed every day KelkooGroup – Some numbers
  • 6. 9 Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing or real-time stream analysis: http://spark.apache.org What is Apache Spark?
  • 7. 11 • Task • Slot • Shuffle Spark glossary
  • 8. 12 • Narrow transformation (ex: coalesce, filter, map, …) Spark glossary
  • 9. 13 • Wide transformation (ex: repartition, distinct, groupBy, ...) Spark glossary
  • 11. 15 • What does it mean to partition data? • To divide a single dataset into smaller manageable chunks • →A Partition is a small piece of the total dataset • How do the DataFrameReaders decide how to partition data? • It depends according to the reader (CSV, Parquet, ORC, ...) • Task / Partition relationship: • A typical Task is processing a single Partition • →The number of Partitions will determine the number of Tasks needed to process process the dataset What is a partition in Spark?
  • 12. 16 During the first part of this presentation, we will focus mainly on... • The number of Partitions my data is divided into • The number of Slots I have for parallel execution The goal is to maximize Slots usage, i.e. ensure as much as possible that each Slot is processing a Task What is a partition in Spark?
  • 13. 17 • 4 executors • 2 cores / executor • College Scorecards (source: catalog.data.gov) make it easier for students to search for a college that is a good fit for them. They can use the College Scorecard to find out more about a college's affordability and value so they can make more informed decisions about which college to attend. Configuration for demo 8
  • 14. 18 Partition tuning: reading a file 3.3 min numPartitions: 1 3 min 24
  • 15. 19 Partition tuning: reading a file 38 s numPartitions: 9 42 s
  • 16. 20 Why 9 partitions? • File size is 1.04 GB • Max partition size is 128 MB • 1.04 * 1024 / 128 = 8.32 Partition tuning: reading a file
  • 17. 21 Partition tuning: reading a file • As a rule of thumb, it is always advised that the number of Partitions is a factor of the number of Slots, so that every Slot is being used (i.e. assigned a Task) during the processing • With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be assigned 1 Task, 1 Slot will be assigned 2 Tasks)
  • 18. 22 Partition tuning: reading a file 14 s 15 s numPartitions: 8 32 s repartition(8)
  • 19. 23 Partition tuning: reading a file spark.sql.files.maxPartitionBytes: The maximum number of bytes to pack into a single partition when reading files. 20 s 320 numPartitions: 8 22 s
  • 20. 24 Partition tuning: reading a file 45 s 128 numPartitions: 8 49 s
  • 21. 25 Partition tuning: repartition and coalesce repartition(4)coalesce(4)
  • 25. 29 Partition tuning: repartition and coalesce coalesce repartition • Performs better: no shuffle • Records are not evenly distributed across all partitions→risk of skewed dataset (i.e. a few partitions containing most of the data) • Extra cost because of shuffle operation • Ensure uniform distribution of the records on all partitions→slots usage will be optimal
  • 26. 30 Partition tuning: writing a file numPartitions: 19 39 s
  • 27. 31 Partition tuning: writing a file 3.9 min coalesce(1) 3 min 57
  • 28. 32 Partition tuning: writing a file 1.8 min 22 s repartition(1) 2 min 18
  • 29. 33 Partition tuning: repartition or coalesce? • If your dataset is skewed: use repartition • If you want more partitions: use repartition • If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use repartition • If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but not drastically fewer, i.e. not fewer than the number of Slots): use coalesce • If in doubt: use repartition
  • 30. 34 spark.sql.files.maxRecordsPerFile: Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit. Partition tuning: writing a file
  • 31. 35 Partition tuning: writing a file Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is over the threshold, a new file is created. for each partition { for each record { numRecords ++ if (numRecords > 15000) { closeFile() openNewFile() numRecords = 0 } writeRecordInFile() } }
  • 32. 36 Partition tuning: writing a file There cannot be less than one file per partition.
  • 33. 37 Wide transformation: The data required to compute the records in a single Partition may reside in many Partitions of the parent Dataset (i.e. it triggers a shuffle operation) Partition tuning: wide transformation 45 s 1 min 32
  • 34. 38 spark.sql.shuffle.partitions: The default number of partitions to use when shuffling data for joins or aggregations. Partition tuning: wide transformation
  • 35. 39 Partition tuning: wide transformation 28 s 1 min 17 8
  • 37. 41 When use cache • When re-use a Dataset multiple times • To recover quickly from a node failure • data scientist : training data in an iterative loop 👍 • data analyst : most of the time no, hide that the data are not organized properly 👎 • data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔
  • 39. 43 When use cache 1 min 41 sec
  • 40. 44 How to cache a data set in Spark Cache strategy: Storage Level • NONE: No cache • MEMORY_ONLY : • data cached non-serialized in memory • If not enough memory: data is evicted and when needed rebuilt from source • DISK_ONLY : data is serialized and stored on disk • MEMORY_AND_DISK : • data cached non-serialized in memory • If not enough memory: data is serialized and stored on disk • OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)
  • 41. 45 How to cache a data set in Spark Cache strategy: Storage Level • _SER suffix: • Always serialize the data in memory • Save space but with serialization penalty • _2 suffix : • Replicate each partition on 2 cluster nodes • Improve recovery time when node failure NONE DISK_ONLY DISK_ONLY_2 MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2 MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2 OFF_HEAP
  • 42. 46 How to cache a data set in Spark Cache strategy: Storage Level • .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY • Lazy: .count()
  • 43. 47 Broadcast variable Useful to share small immutable data
  • 44. 48 Broadcast variable • spark.sql.autoBroadcastJoinThreshold : auto optimize join queries when the size of one side data is below the threshold (default 10 MB)
  • 46. 50 How to Profile a Spark App ?
  • 47. 51 How to Profile a Spark App ?
  • 48. 52 How to Profile a Spark App ?
  • 49. 53 How to Profile a Spark App ? https://github.com/criteo/babar
  • 51. 55 Ressources • Spark official documentation: https://spark.apache.org/docs/latest/tuning.html • Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering- apache-spark/ • Apache Spark - Best Practices and Tuning by Umberto Griffo: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/ • High Performance Spark by Rachel Warren, Holden Karau, O'Reilly

Editor's Notes