SlideShare a Scribd company logo
Flash for Apache Spark Shuffle with Cosco
Flash for Spark Shuffle with Cosco
Aaron Gabriel Feldman
Software Engineer at Facebook
Agenda
1. Motivation
2. Intro to shuffle architecture
3. Flash
4. Hybrid RAM + flash techniques
5. Future improvements
6. Testing techniques
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Why should you care?
▪ IO efficiency
▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data
▪ Compute efficiency
▪ Flash supports more workload with less Cosco hardware
▪ Query latency is less of a focus
▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads.
▪ Flash unlocks new possibilities to improve query latency, but that is future work
▪ Techniques for development and analysis
▪ Hopefully, some of these are applicable outside of Cosco
Intro to Shuffle Architecture
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Map output files written to local storage or distributed filesystem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
Write amplification problem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
And small IOs problem
M x R
Avg IO size is ~200 KiB
Mappers
Map Output Files
(on disk/DFS) Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Spark Shuffle Recap
Map 1
Map m
Mappers
Map Output Files
(on disk/DFS)
Simplified drawing
Reduce 1
Reduce r
Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Mappers stream their output to Cosco Shuffle Services, which buffer in memory
Streaming
output
In-memory buffering
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 0 buffer)
Partition r
(file 0 buffer)
File 0
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Sort and flush buffers to DFS when full
Streaming
output
In-memory buffering
Sort (if required by query)
Flush
Flush
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 1 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
File 2
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Iterator
Iterator
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Reducers do a streaming merge after map stage completes
Streaming
merge
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Replace DRAM with Flash
for Buffering
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Simply buffer to flash instead of memory
On flash
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
▪ Flash write/read latency is
negligible
▪ Generally non-blocking
▪ Latency is much less than buffering time
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
Example Rule of Thumb
▪ Hypothetical example numbers
▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device
▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit
▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash
▪ Notes
▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides
▪ DRAM consumes more power than Flash
Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
Basic Evaluation
▪ Example Cosco cluster
▪ 10 nodes
▪ Each node uses 100 GB DRAM for buffering
▪ And has additional DRAM for sorting, RPCs, etc.
▪ So, 1 TB DRAM for buffering in total
▪ Again, numbers are chosen for illustration only
▪ Apply the example rule of thumb
▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance
▪ If this cluster shuffles less than 100 TB/day, then it is efficient to
replace DRAM with Flash
▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering
▪ Nodes keep some DRAM for sorting, RPCs, etc.
Basic Evaluation
Summary for cluster shuffling 100 TB/day
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
100 GB
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
1 TB
Flash for buffering
Hybrid Techniques for Efficiency
Two Hybrid Techniques
Two ways to use both DRAM and flash for buffering
1. Buffer in DRAM first, flush to flash only under memory pressure
2. Buffer fastest-filling partitions in DRAM, send slowest-filling
partitions to flash
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes buffered in
Cosco Shuffle Service
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes
buffered
Buffer only in DRAM Buffer only in flash
1 TB
100 TB written/day
Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Buffer only in DRAM
Buffer only in flash
1 TB
100 TB written/day
Hybrid
Buffer in DRAM and flash
250 GB
25 TB written/day
Hybrid Technique #1
Buffer in DRAM first, flush to flash only under memory pressure
250 GB DRAM
25 TB written/day to flash
▪ Example: 25% RAM +
25% flash supports
100% throughput
▪ Spikier workload -> more win
▪ Safer to push the
system to its limits
▪ Run out of memory -> immediate bad
consequences
▪ But exceed flash endurance guidelines
-> okay if you make up for it by writing
less in the future
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic?
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
???
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Flush to
Flash
Shuffle Service is
out of DRAM
Shuffle Service is
out of DRAM
Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Plug into pre-existing balancing logic
Shuffle Service is
out of DRAM
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Same logic
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
Hybrid Technique #1
Plug into pre-existing balancing logic
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
▪ THRESHOLD limits flash working set
size
▪ Configure THRESHOLD to stay under
flash endurance limits
▪ Then predict cluster performance as if
working-set flash were DRAM
Hybrid Technique #1
Summary
▪ Take advantage of variation in
total shuffle workload over time
▪ Buffer in DRAM first, flush to
flash only under memory
pressure
▪ Adapt balancing logic
Hybrid Technique #2
Take advantage of variation in partition fill rate
▪ Some partitions fill more slowly than others
▪ Slower partitions wear out flash less quickly
▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster
partitions
Hybrid Technique #2
▪ 1 TB
▪ Supports 100K streams each
buffering up to 10MB
▪ 10 TB, 100 TB written/day
▪ 100K streams each writing 1 GB/day
which is 12 KB/second. (Sanity check:
5 min map stage -> 3.6 MB partition.)
▪ Or 200K streams each writing
6KB/second -> These streams are
better on flash
▪ Or 50K streams each writing 24
KB/second -> These streams would
be better on DRAM
FlashDRAM
Take advantage of variation in partition fill rate: Illustrated with numbers
Hybrid Technique #2
Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash
▪ Technique
▪ Periodically measure partition fill rate
▪ If fill rate is less than threshold KB/s, then buffer partition data in flash
▪ Else, buffer partition data in DRAM
▪ Evaluation
▪ Assume “break-even” threshold of 12 KB/s from previous slide
▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s
▪ Suppose these slow partitions write an average of 3 KB/s
▪ Then, you can replace half of your buffering DRAM with 25% as much flash
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
Combine both hybrid techniques
Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure
▪ Evaluation
▪ Difficult theoretical estimation
▪ Or, do a discrete-event simulation -> Later in this presentation
Future Improvements
Lower-Latency Queries
Made possible by flash
▪ Serve shuffle data directly from flash for some jobs
▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full)
▪ Prioritize interactive/low-latency queries to serve from flash
▪ Buffer bigger chunks to decrease reducer merging
▪ Fewer chunks -> Less chance that reducer needs to do an external merge
Further Efficiency Wins
Made possible by flash
▪ Decrease Cosco replication factor since flash is non-volatile
▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS
▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart
▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart
▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings
on DFS
Practical Evaluation Techniques
Practical Evaluation Techniques
▪ Discrete event simulation
▪ Synthetic load generation on a test cluster
▪ Shadow testing on a test cluster
▪ Special canary in a production cluster
Discrete Event Simulation
https://en.wikipedia.org/wiki/Discrete-event_simulation, 2020-05-18
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Time: 00h:01m:30.000s
Total KB written to flash: 9,000
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.250s
Total KB written to flash: 9,050
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.500s
Total KB written to flash: 9,100
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.750s
Total KB written to flash: 9,150
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.000s
Total KB written to flash: 9,200
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.500s
Total KB written to flash: 9,250
Overall avg file size written to DFS: NaN
Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Sort & flush
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN9,200
DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.500s
Total KB written to flash: 9,350
Overall avg file size written to DFS: NaN9,200
Discrete Event Simulation
Drive simulation based on production data
cosco_chunks dataset
Partition
Shuffle
Service ID
Chunk (DFS
file) number
Chunk Start
Time
Chunk
Size
Chunk
Buffering Time
Chunk Fill Rate (derived from
size and buffering time)
3 10 5
2020-05-19
00:00:00.000
10
MiB
5000ms 2 MiB/s
42 10 2
2020-05-19
00:01:00.000
31
MiB
10000ms 3.1 MiB/s
…
…
Canary on a Production Cluster
▪ Many important metrics are observed on mappers
▪ Example: “percentage of task time spent shuffling”
▪ Example: “map task success rate”
▪ Problem: Mappers talk to many Shuffle Services
▪ Simultaneously
▪ Dynamic balancing can re-route to different Shuffle Services
▪ Solution: Subclusters
▪ Pre-existing feature for large clusters
▪ Each Shuffle Service belongs to one subcluster
▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster
▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t
Chen Yang
Software Engineer at Facebook
Sergey Makagonov
Software Engineer at Facebook
Special Thanks
SOS: Optimizing Shuffle IO, Spark Summit 2018
Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
Previous Shuffle presentations from
Facebook
cosco@fb.com mailing list
Contact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Flash for Apache Spark Shuffle with Cosco
Ad

More Related Content

What's hot (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

Similar to Flash for Apache Spark Shuffle with Cosco (20)

An Optimized Diffusion Depth Of Field Solver
An Optimized Diffusion Depth Of Field SolverAn Optimized Diffusion Depth Of Field Solver
An Optimized Diffusion Depth Of Field Solver
Holger Gruen
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
Linaro
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
trihug
 
Media storage
Media storageMedia storage
Media storage
Mohammed El Hedhly
 
HDT for Mainframe Considerations: Simplified Tiered Storage
HDT for Mainframe Considerations: Simplified Tiered StorageHDT for Mainframe Considerations: Simplified Tiered Storage
HDT for Mainframe Considerations: Simplified Tiered Storage
Hitachi Vantara
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About Storeconfigs
Brice Figureau
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
Ted Dunning
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
Asif Iqbal
 
Berkeley Performance Tuning
Berkeley Performance TuningBerkeley Performance Tuning
Berkeley Performance Tuning
George Ang
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Databricks
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
Mind The Firebird
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
NAVER Engineering
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale Joins
Databricks
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
An Optimized Diffusion Depth Of Field Solver
An Optimized Diffusion Depth Of Field SolverAn Optimized Diffusion Depth Of Field Solver
An Optimized Diffusion Depth Of Field Solver
Holger Gruen
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
Linaro
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
trihug
 
HDT for Mainframe Considerations: Simplified Tiered Storage
HDT for Mainframe Considerations: Simplified Tiered StorageHDT for Mainframe Considerations: Simplified Tiered Storage
HDT for Mainframe Considerations: Simplified Tiered Storage
Hitachi Vantara
 
All About Storeconfigs
All About StoreconfigsAll About Storeconfigs
All About Storeconfigs
Brice Figureau
 
Virtual memory
Virtual memoryVirtual memory
Virtual memory
Asif Iqbal
 
Berkeley Performance Tuning
Berkeley Performance TuningBerkeley Performance Tuning
Berkeley Performance Tuning
George Ang
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Databricks
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale Joins
Databricks
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 

Flash for Apache Spark Shuffle with Cosco

  • 2. Flash for Spark Shuffle with Cosco Aaron Gabriel Feldman Software Engineer at Facebook
  • 3. Agenda 1. Motivation 2. Intro to shuffle architecture 3. Flash 4. Hybrid RAM + flash techniques 5. Future improvements 6. Testing techniques
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 5. Why should you care? ▪ IO efficiency ▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data ▪ Compute efficiency ▪ Flash supports more workload with less Cosco hardware ▪ Query latency is less of a focus ▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads. ▪ Flash unlocks new possibilities to improve query latency, but that is future work ▪ Techniques for development and analysis ▪ Hopefully, some of these are applicable outside of Cosco
  • 6. Intro to Shuffle Architecture
  • 7. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Map output files written to local storage or distributed filesystem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 8. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 9. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 10. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Sort by key Iterator Iterator Iterator Write amplification is ~3x Write amplification problem Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 11. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Sort by key Iterator Iterator Iterator Write amplification is ~3x And small IOs problem M x R Avg IO size is ~200 KiB Mappers Map Output Files (on disk/DFS) Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 12. Spark Shuffle Recap Map 0 Map 1 Map m Reduce 0 Reduce 1 Reduce r Partition Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 13. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Reducers pull from map output files Sort by key Iterator Iterator Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 14. Spark Shuffle Recap Map 1 Map m Reduce 1 Reduce r Mappers Map Output Files (on disk/DFS) Reducers Simplified drawing Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 15. Spark Shuffle Recap Map 1 Map m Mappers Map Output Files (on disk/DFS) Simplified drawing Reduce 1 Reduce r Reducers Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 16. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 Partition r Shuffle Services (N = thousands) Map m Map 1 Mappers stream their output to Cosco Shuffle Services, which buffer in memory Streaming output In-memory buffering Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 17. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 0 buffer) Partition r (file 0 buffer) File 0 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Sort and flush buffers to DFS when full Streaming output In-memory buffering Sort (if required by query) Flush Flush Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 18. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 1 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 19. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 0 buffer) File 0 File 1 File 0 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 File 2 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 20. Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Sort and flush buffers to DFS when full Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 21. Iterator Iterator Cosco Shuffle for Spark Reduce 1 Reduce r Mappers Reducers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition 1 (file 2 buffer) Partition r (file 1 buffer) File 0 File 1 File 2 File 0 File 1 Shuffle Services (N = thousands) Distributed Filesystem (HDFS/Warm Storage) Map m Map 1 Streaming output In-memory buffering Flush Sort (if required by query) Flush Reducers do a streaming merge after map stage completes Streaming merge Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
  • 22. Replace DRAM with Flash for Buffering
  • 23. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 24. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 25. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 26. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 27. Buffering Is Appending Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB
  • 28. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB Simply buffer to flash instead of memory On flash
  • 29. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash
  • 30. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  • 31. Replace DRAM with Flash for Buffering Mappers Shuffle Service 1 Shuffle Service 2 Shuffle Service N Partition r Shuffle Services (N = thousands) Map m Map 1 Each package is a few 10s of KiB ▪ Appending is a friendly pattern for flash ▪ Minimize flash write amplification -> minimizing wear on the drive ▪ Flash write/read latency is negligible ▪ Generally non-blocking ▪ Latency is much less than buffering time Simply buffer to flash instead of memory On flash Read back to main memory for sorting
  • 32. Example Rule of Thumb ▪ Hypothetical example numbers ▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device ▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit ▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash ▪ Notes ▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides ▪ DRAM consumes more power than Flash Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
  • 33. Basic Evaluation ▪ Example Cosco cluster ▪ 10 nodes ▪ Each node uses 100 GB DRAM for buffering ▪ And has additional DRAM for sorting, RPCs, etc. ▪ So, 1 TB DRAM for buffering in total ▪ Again, numbers are chosen for illustration only ▪ Apply the example rule of thumb ▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance ▪ If this cluster shuffles less than 100 TB/day, then it is efficient to replace DRAM with Flash ▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering ▪ Nodes keep some DRAM for sorting, RPCs, etc.
  • 34. Basic Evaluation Summary for cluster shuffling 100 TB/day CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 100 GB DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering CPU DRAM for sorting, RPCs, etc. DRAM for buffering Shuffle Service 10 CPU DRAM for sorting, RPCs, etc. 1 TB Flash for buffering
  • 35. Hybrid Techniques for Efficiency
  • 36. Two Hybrid Techniques Two ways to use both DRAM and flash for buffering 1. Buffer in DRAM first, flush to flash only under memory pressure 2. Buffer fastest-filling partitions in DRAM, send slowest-filling partitions to flash
  • 37. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered in Cosco Shuffle Service
  • 38. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Time Bytes buffered Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day
  • 39. Hybrid Technique #1 Take advantage of variation in shuffle workload over time Buffer only in DRAM Buffer only in flash 1 TB 100 TB written/day Hybrid Buffer in DRAM and flash 250 GB 25 TB written/day
  • 40. Hybrid Technique #1 Buffer in DRAM first, flush to flash only under memory pressure 250 GB DRAM 25 TB written/day to flash ▪ Example: 25% RAM + 25% flash supports 100% throughput ▪ Spikier workload -> more win ▪ Safer to push the system to its limits ▪ Run out of memory -> immediate bad consequences ▪ But exceed flash endurance guidelines -> okay if you make up for it by writing less in the future
  • 41. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic? Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers ??? Redirect to another shuffle service Flush to DFS Backpressure mappers Flush to Flash Shuffle Service is out of DRAM Shuffle Service is out of DRAM
  • 42. Hybrid Technique #1 Buffer in DRAM first, flush to flashPure-DRAM Cosco Plug into pre-existing balancing logic Shuffle Service is out of DRAM Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Same logic Flash working set smaller than THRESHOLD ? No Flush to Flash Yes
  • 43. Hybrid Technique #1 Plug into pre-existing balancing logic Balancing Logic Redirect to another shuffle service Flush to DFS Backpressure mappers Shuffle Service is out of DRAM Flash working set smaller than THRESHOLD ? No Flush to Flash Yes ▪ THRESHOLD limits flash working set size ▪ Configure THRESHOLD to stay under flash endurance limits ▪ Then predict cluster performance as if working-set flash were DRAM
  • 44. Hybrid Technique #1 Summary ▪ Take advantage of variation in total shuffle workload over time ▪ Buffer in DRAM first, flush to flash only under memory pressure ▪ Adapt balancing logic
  • 45. Hybrid Technique #2 Take advantage of variation in partition fill rate ▪ Some partitions fill more slowly than others ▪ Slower partitions wear out flash less quickly ▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster partitions
  • 46. Hybrid Technique #2 ▪ 1 TB ▪ Supports 100K streams each buffering up to 10MB ▪ 10 TB, 100 TB written/day ▪ 100K streams each writing 1 GB/day which is 12 KB/second. (Sanity check: 5 min map stage -> 3.6 MB partition.) ▪ Or 200K streams each writing 6KB/second -> These streams are better on flash ▪ Or 50K streams each writing 24 KB/second -> These streams would be better on DRAM FlashDRAM Take advantage of variation in partition fill rate: Illustrated with numbers
  • 47. Hybrid Technique #2 Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash ▪ Technique ▪ Periodically measure partition fill rate ▪ If fill rate is less than threshold KB/s, then buffer partition data in flash ▪ Else, buffer partition data in DRAM ▪ Evaluation ▪ Assume “break-even” threshold of 12 KB/s from previous slide ▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s ▪ Suppose these slow partitions write an average of 3 KB/s ▪ Then, you can replace half of your buffering DRAM with 25% as much flash
  • 48. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  • 49. Hybrid Technique #2 Real-world partition fill rates Percentile of partitions Percentile of partitions weighted by buffering time Fill rate 0 KiB/sec 1st MiB’s/sec 99th Percentile of partitions Percentile of partitions weighted by buffering time Fill rate, log scale 0 KiB/sec 1st MiB’s/sec 99th
  • 50. Combine both hybrid techniques Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure ▪ Evaluation ▪ Difficult theoretical estimation ▪ Or, do a discrete-event simulation -> Later in this presentation
  • 52. Lower-Latency Queries Made possible by flash ▪ Serve shuffle data directly from flash for some jobs ▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full) ▪ Prioritize interactive/low-latency queries to serve from flash ▪ Buffer bigger chunks to decrease reducer merging ▪ Fewer chunks -> Less chance that reducer needs to do an external merge
  • 53. Further Efficiency Wins Made possible by flash ▪ Decrease Cosco replication factor since flash is non-volatile ▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS ▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart ▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart ▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings on DFS
  • 55. Practical Evaluation Techniques ▪ Discrete event simulation ▪ Synthetic load generation on a test cluster ▪ Shadow testing on a test cluster ▪ Special canary in a production cluster
  • 57. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Time: 00h:01m:30.000s Total KB written to flash: 9,000 Overall avg file size written to DFS: NaN
  • 58. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.250s Total KB written to flash: 9,050 Overall avg file size written to DFS: NaN
  • 59. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.500s Total KB written to flash: 9,100 Overall avg file size written to DFS: NaN
  • 60. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:30.750s Total KB written to flash: 9,150 Overall avg file size written to DFS: NaN
  • 61. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.000s Total KB written to flash: 9,200 Overall avg file size written to DFS: NaN
  • 62. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:31.500s Total KB written to flash: 9,250 Overall avg file size written to DFS: NaN
  • 63. Discrete Event Simulation Shuffle Service Model DFS Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN
  • 64. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Sort & flush Discrete event Time: 00h:01m:32.000s Total KB written to flash: 9,300 Overall avg file size written to DFS: NaN9,200
  • 65. DFS Model File 0 Discrete Event Simulation Shuffle Service Model Example Partition 3 Partition 42 Discrete event Time: 00h:01m:32.500s Total KB written to flash: 9,350 Overall avg file size written to DFS: NaN9,200
  • 66. Discrete Event Simulation Drive simulation based on production data cosco_chunks dataset Partition Shuffle Service ID Chunk (DFS file) number Chunk Start Time Chunk Size Chunk Buffering Time Chunk Fill Rate (derived from size and buffering time) 3 10 5 2020-05-19 00:00:00.000 10 MiB 5000ms 2 MiB/s 42 10 2 2020-05-19 00:01:00.000 31 MiB 10000ms 3.1 MiB/s … …
  • 67. Canary on a Production Cluster ▪ Many important metrics are observed on mappers ▪ Example: “percentage of task time spent shuffling” ▪ Example: “map task success rate” ▪ Problem: Mappers talk to many Shuffle Services ▪ Simultaneously ▪ Dynamic balancing can re-route to different Shuffle Services ▪ Solution: Subclusters ▪ Pre-existing feature for large clusters ▪ Each Shuffle Service belongs to one subcluster ▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster ▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t
  • 68. Chen Yang Software Engineer at Facebook Sergey Makagonov Software Engineer at Facebook Special Thanks
  • 69. SOS: Optimizing Shuffle IO, Spark Summit 2018 Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019 Previous Shuffle presentations from Facebook
  • 71. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.