Cosco: An Efficient Facebook-Scale Shuffle Service

Cosco: an efficient
facebook-scale shuffle service
Brian Cho & Dmitry Borovsky
Spark + AI Summit 2019

Disaggregated compute and storage
• Advantages
• Server types optimized for
compute or storage
• Separate capacity management
and configuration
• Different hardware cycles
• Compute clusters
• CPU, RAM, no disks for data
• Spark executors
• Storage clusters
• Spindle disks
• DFS (Warm Storage)
• Permanent data: size dominant, uses
less IO
• Temporary data: IO dominant, uses
less space

Spindle disk storage
• A single spindle is used to
read/write data on the drive
• Small IO sizes cause low throughput
as seek times dominate
.
64 KiB,
140s
128 KiB,
73s
256 KiB,
39 s
1 MiB, 14s
4 MiB, 8s
0
5
10
15
20
0 50 100 150
Readrequest
size(MiB)
Time (s)

• Drive sizes increase over time
• Must increase IO size to maintain the
same throughput per TB
7 TB drive
1 MiB IO
size
15 TB drive
7 MiB IO
size
0
5
10
15
0 5 10 15 20
AvgIOsize(MiB) HDD capacity (TiB)
10 MiBs/TiB
64 KiB,
140s
128 KiB,
73s
256 KiB,
39 s
1 MiB, 14s
4 MiB, 8s
0
5
10
15
20
0 50 100 150
Readrequest
size(MiB)
Time (s)

same throughput per TB, or
• Read/write less data to reduce
throughput demand 0
5
10
15
0 5 10 15 20
AvgIOsize(MiB) HDD capacity (TiB)
10 MiBs/TiB 8 MiBs/TiB 6 MiBs/TiB
64 KiB,
140s
128 KiB,
73s
256 KiB,
39 s
1 MiB, 14s
4 MiB, 8s
0
5
10
15
20
0 50 100 150
Readrequest
size(MiB)
Time (s)

Spindle disk storage: key metrics
ØDisk service time
ØAverage IO size
ØWrite amplification
same throughput, or
• Read/write less data to reduce
throughput demand

Spark shuffle recap
Map 0
Map 1
Map m
Mappers Map Output Files Reducers
Partition
Reduce 0
Reduce 1
Reduce r

Spark shuffle recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition

Spark shuffle recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Sort by
key
Iterator
Iterator
Iterator
Partition

Spark shuffle recap: Write amplification
Map 0
Map 1
Map m
Sort by
key
Iterator
Iterator
Iterator
Partition
Reduce 0
Reduce 1
Reduce r
Write amplification is 3x

Spark shuffle recap: Small IOs problem
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
M x R
Avg IO size is 200 KiB
Sort by
key
Iterator
Iterator
Iterator
Partition

Spark shuffle recap: SOS
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Sort by
key
Iterator
Iterator
Iterator
SOS: merge map outputs
10-way merge increases
Avg IO size to 2 MiB
Partition

Spark shuffle using Cosco
• Mappers share a write-ahead buffer per reduce partition
• Reducers can read the written data sequentially
• Solves the small IOs problem
• Sequential reads: Avg IO size 200 KiB à 2.5 MiB
• Solves the write amplification problem
• Avoiding spills: Write amplification 3x à 1.2x

Results / Current status
• Hive
• Rolled out to 90%+ of Hive workloads, in production for 1+ year
• 3.2x more efficient disk service time
• Spark
• Analysis shows potential 3.5x more efficient disk service time
• Analysis shows CPU neutral
• Integration is complete, rollout planned during next few months

Cosco deep dive
Dmitry Borovsky
Spark + AI Summit 2019

Problem
• Shuffle exchange on spinning disks (disaggregated compute and
storage)
• Single shuffle exchange scale: PiBs size, 100Ks of mappers,
10Ks reducers
• Write amplification is ~3x (1PiB shuffle does 3PiB writes to disk)
• Small Average IO size: ~200KiB (at least MxR reads)
• IO is spiky (all readers may start at the same time and do MxR reads)
• Cosco is shared between users

Sailfish: a framework for large scale
data processing
SoCC '12 Proceedings of the Third ACM Symposium on
Cloud Computing, Article No. 4, San Jose, California —
October 14 - 17, 2012
Source code: https://code.google.com/archive/p/sailfish/

Write-ahead buffers
Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Spark process
(cosco client)
Dependency
Cosco process
(shared between apps)
Sorts
(if needed)

Exactly once delivery
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Reducer 0
File 0
File 1
File 2
Partition 0
(file 2 buffer)
Spark process
(cosco client)
Cosco process
Dependency
Data
Ack
Partition 0’
(file 3 buffer)
Failover
File 3

Exactly once delivery
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Reducer 0
File 0
File 1
File 2
Partition 0
(file 2 buffer)
Spark process
(cosco client)
Cosco process
Dependency
Data
Ack
Partition 0’
(file 3 buffer)
Failover
2
1
File 3

At least once delivery and deduplication
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Reducer 0
File 0
File 1
File 2
Partition 0
(file 2 buffer)
Spark process
(cosco client)
Cosco process
Dependency
Data
Ack
Partition 0’
(file 3 buffer)
Failover
2
1
Adds row_id and mapper_id to each row
De-duplicates
resends non-acked data File 3

Replication
DFS
Mapper 0
Reducer 0
File 0
File 1
File 2
Partition 0
(file 2 buffer)
Spark process
(cosco client)
Cosco process
Dependency
Data
Partition 0’
(file 2’ buffer)
Data
Ack
Ack

Replication
DFS
Reducer 0
File 0
File 1
Spark process
(cosco client)
Cosco process
Dependency
File 2’
File 2
Mapper 0
Partition 0
(file 2 buffer)
Data
Partition 0’
(file 2’ buffer)
Data
Ack
Ack

// Mapper
writer = new CoscoWriter(
shuffleId: String,
mapper: long);
writer.collect(
partition: int, row: byte[]);
// ...
writer.collect(
partition: int, row: byte[]);
writer.close();
// Reducer
reader = new CoscoReader(
shuffleId: String,
mappers: long[],
partition: int);
while (reader.next()) {
// using row
row = reader.row();
}
reader.close();
// Driver
shuffle = new CoscoExchange(
shuffleId: String,
partitions: int,
recomputeWritersCallback: (mappers: long[], reason: String) -> void);
// end of exchange
shuffle.close();

Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(10Ks)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Cosco
Metadata Service Spark process
(cosco client)
Cosco process
Dependency
Mappers submit what
files them wrote to
Commits files
Reducers ask for files
Metadata

Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(10Ks)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Cosco
(cosco client)
Cosco process
Dependency
Mappers submit what
files them wrote to
Commits files
Metadata
Mapper 0
Mapper 1
Mapper 2
Mapper 3
Mapper 4
File 0
File 1
File 2
File 0
File 1

Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(10Ks)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Cosco
(cosco client)
Cosco process
Dependency
Mappers submit what
files them wrote to
Commits files
Metadata
Mapper 0
Mapper 1
Mapper 2
Mapper 3
Mapper 4
File 0
File 1
File 2
File 0
File 1
Mapper 3’ (recompute)
Mapper 4’ (recompute)
File 3
File 2

Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Driver
Cosco
(cosco client)
Cosco process
Dependency
Mappers submit what
files them wrote to
Commits files
Recompute
request
Metadata

Cosco
Shuffle Services
(thousands)
DFS
Cosco
Shuffle Services
(thousands)
Mapper 0
Mapper 1
Mapper 2
Reducer 0
Reducer 1
Partition 0
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
network
Partition 0
(file 2 buffer)
Driver
Cosco
(cosco client)
Cosco process
Dependency
Cosco
Scheduler
Mappers submit what
files them wrote to
Commits files
Recompute
request
Scheduler

Limits
• Cosco doesn’t support large rows (<4MiB)
• Capacity: shuffle services memory, number of write-ahead
buffers

Future work
• “Unlimited” shuffle exchange:
• millions of splits/partitions
• 10s of PiBs
• Streaming

Questions?
Brian Cho (bcho@fb.com)
Dmitry Borovsky (borovsky@fb.com)

Cosco: An Efficient Facebook-Scale Shuffle Service

Recommended

More Related Content

What's hot (20)

Similar to Cosco: An Efficient Facebook-Scale Shuffle Service (20)

More from Databricks (20)

Recently uploaded (20)

Cosco: An Efficient Facebook-Scale Shuffle Service