Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani

Yuval Degani, Mellanox Technologies
ACCELERATING
SHUFFLE: A TAILOR-
MADE RDMA SOLUTION
FOR APACHE SPARK
#EUres3

TeraSort accelerated with RDMA
RDMA
Standard
0 20 40 60 80 100 120
seconds

#EUres3
• Founded 1999
• End-to-end designer and supplier of
interconnect solutions: network adapters,
switches, system-on-a-chip, cables, silicon
and software
• 10-400 Gb/s Ethernet and InfiniBand
Storage
Front / Backend
Server /
Compute
Switch /
Gateway
56/100/200G
InfiniBand
10/25/40/50/
100/200/400GbE
Virtual Protocol
Interconnect
56/100/200G
InfiniBand
10/25/40/50/
100/200/400GbE
Virtual Protocol
Interconnect

Agenda
• What’s RDMA?
• Spark’s Shuffle Internals
• SparkRDMA Shuffle Plugin
• Results
• What’s next?
#EUres3

What’s RDMA?
• Remote Direct Memory Access
– Read/write from/to remote memory locations
• Zero-copy
• Direct hardware interface – bypasses the kernel
and TCP/IP in IO path
• Flow control and reliability is offloaded in hardware
• Sub-microsecond latency
• Supported on almost all mid-range/high-end
network adapters
• Growing cloud support
– Already supported in Microsoft Azure (A, H instances)
#EUres3
Java app
buffer
OS
Sockets
TCP/IP
Driver
Network Adapter
RDMA
Socket
Context switch

Hardware acceleration in Big
Data/Machine Learning platforms
• Hardware acceleration adoption is continuously growing
– GPU integration is now standard
– ASIC integration is spreading fast
• RDMA is already integrated in mainstream code of popular
frameworks:
– TensorFlow
– Caffe2
– CNTK
• Now it’s Spark’s turn to catch up
#EUres3

Spark’s Shuffle Internals
Under the hood
#EUres3

MapReduce vs. Spark
• Spark’s in-memory model completely changed how shuffle is done
• In both Spark and MapReduce, map output is saved on the local disk (usually in buffer cache)
• In MapReduce, map output is then copied over the network to the destined reducer’s local disk
• In Spark, map output is fetched from the network, on-demand, to the reducer’s memory
#EUres3
…
…
Map Reduce
Map Reduce
Memory-to-network-to-memory? RDMA is a perfect fit!

Spark’s Shuffle Basics
#EUres3
Map
Reduce task
MapReduce
Map
Map
Map
Map
Input Map output
File
File
File
File
File
Driver
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks

Shuffle Read Protocol
Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks from
writers
Locate blocks, and
setup as stream
Request blocks from
stream, one by one
Group block
locations by writer
Locate block, send
back
8
Block data is now
ready

The Cost of Shuffling
• Shuffling is very expensive in terms of CPU,
RAM, disk and network IOs
• Spark users try to avoid shuffles as much as
they can
• Speedy shuffles can relieve developers of such
concerns, and simplify applications
#EUres3

The Potential in Accelerating
Shuffle
What’s the opportunity here?
#EUres3

The Potential in Accelerating Shuffle
• Before setting off on the journey of adding RDMA to Spark, it was essential
to estimate the ROI of such work
• What’s better than a controlled experiment?
#EUres3
Goal Quantify the potential performance gain in accelerating network transfers
Method Bypass the network in Spark shuffle and compare to the original code:
No network = maximum performance gain
…Map Reduce

Experiment: Bypassing The Network
in Shuffle
• Do no fetch blocks from the network, instead, reuse a local sampled
block over and over
• Compare to standard Spark, but with block reuse – so reduce data
will be identical
#EUres3
Reduce task
Reduce
Reduce task
Reduce task
Reduce task
Reduce task
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Fetch blocks
Sample data
Sample data
Sample data
Sample data
Sample data

Results
Benchmark:
• HiBench TeraSort
• Workload: 600GB
Testbed:
• HDFS on Hadoop 2.6.0
• Spark 2.0.0
– Standalone mode
– Master + 30 workers
– 28 cores per worker,840 total
• Machine info:
– Intel Xeon E5-2697 v3 @
2.60GHz
– 256GB RAM,128GB of it
reserved as RAMDISK
– RAMDISK is used for Spark local
directories and HDFS
#EUres3
Network
bypass
Standar
d Spark
0 200 400 600 800
seconds

SparkRDMA Shuffle Plugin
Accelerating Shuffle with RDMA
#EUres3

Design Goals
• Demonstrate significant improvements over
standard Spark
• Seamlessly accelerate Shuffles with RDMA – no
functional limitations
• Easy to use and deploy
• Minimize code impact
#EUres3

Design Approach
• Entire Shuffle-related communication is done with RDMA
– RPC messaging for meta-data transfers
– Block transfers
• SparkRDMA is an independent plugin
– Implements the ShuffleManager interface
– No changes to Spark’s code – use with any existing Spark installation
• Reuse Spark facilities
– Maximize reliability
– Minimize impact on code
• RDMA functionality is provided by “DiSNI”
– Open-source Java interface to RDMAuser libraries
– https://github.com/zrlio/disni
• No functionality loss of any kind, SparkRDMA supports:
– Compression
– Spilling to disk
– Recovery from failed map or reduce tasks
#EUres3

ShuffleManager Plugin
• Spark allows for external implementations
of ShuffleManagers to be plugged in
– Configurable per-job using: “spark.shuffle.manager”
• Interface allows proprietary
implementations of Shuffle Writers and
Readers, and essentially defers the entire
Shuffle process to the new component
• SparkRDMA utilizes this interface to
introduce RDMA in the Shuffle process
#EUres3
SortShuffleManager
RdmaShuffleManager

SparkRDMA Components
• SparkRDMA reuses the main
Shuffle Writer implementations of
mainstream Spark: Unsafe & Sort
• Shuffle data is written and stored
identically to the original
implementation
• All-new ShuffleReader and
ShuffleBlockResolver provide an
optimized RDMA transport when
blocks are being read over the
network
#EUres3
RdmaShuffleManager
SortShuffleWriter
UnsafeShuffleWriter
Writers
RdmaShuffleReader
RdmaShuffleBlockResolver
RdmaWrapperShuffleWriter
SortShuffleManager SortShuffleWriter
UnsafeShuffleWriter
BypassMergeSortShuffleWriter
Writers
BlockStoreShuffleReader
IndexShuffleBlockResolver

Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back Map
Statuses
Request blocks from
writers
Locate blocks, and
setup as stream
Request blocks from
stream, one by one
Group block
locations by writer
Locate block, send
back
8
Block data is now
ready
RDMA-Read blocks
from writers
No-op on writer HW
offloads transfers
5
Block data is now
ready
Shuffle Read Protocol – Standard vs. RDMA

Shuffle Read
Driver
Reader
Writer
1
2
3
7
4
5
6
Request Map
Statuses
Send back
Map Statuses
Request
blocks from
writers
Locate blocks,
and setup as
stream
Request blocks
from stream,
one by one
Group block
locations by
writer
Locate block,
send back
8
Block data is
now ready
Shuffle Read
Driver
Reader
Writer
1
2
3 4 6
Request Map
Statuses
Send back
Map Statuses
Group block
locations by
writer
RDMA-Read
blocks from
writers
No-op on writer HW
offloads transfers
5
Block data is
now ready
StandardRDMA
Server-side:
ü 0 CPU
ü Shuffle transfers are not
blocked by GC in executor
ü No buffering
Client-side:
ü Instant transfers
ü Reduced messaging
ü Direct, unblocked access to
remote blocks
Reader
Writer 7
4
5
6
Request blocks
from writers
Request blocks
from stream, one
by one
Locate block, send
back
8
Block data is now
ready
Reader
Writer
4 6
RDMA-Read
blocks from
writers
No-op on writer
HW offloads
transfers
5
Block data is now
ready
Locate blocks,
and setup as
stream

Benefits
• Substantial improvements in:
– Block transfer times: latency and total transfer time
– Memory consumption and management
– CPU utilization
• Easy to deploy and configure:
– Packed into a single JAR file
– Plugin is enabled through a simple configuration handle
– Allows finer tuning with a set of configuration handles
• Configuration and deployment are on a per-job basis:
– Can be deployed incrementally
– May be limited to Shuffle-intensive jobs
#EUres3

Performance Results: TeraSort
Testbed:
• HiBench TeraSort
– Workload:175GB
• HDFS on Hadoop 2.6.0
– No replication
• Spark 2.0.0
– 1 Master
– 15 Workers
– 28 active Spark cores on each
node,420 total
• Node info:
2.60GHz
– RoCE 100GbE
– 256GB RAM
– HDD is used for Spark local
RDMA
Standard
0 20 40 60 80 100 120
seconds

Performance Results: GroupBy
Testbed:
• GroupBy
– 48M keys
– Each value:4096 bytes
– Workload:183GB
• Spark 2.0.0
– 1 Master
– 15 Workers
– 28 active Spark cores on each
node,420 total
• Node info:
2.60GHz
– RoCE 100GbE
– 256GB RAM
– HDD is used for Spark local
RDMA
Standard
0 5 10 15 20 25 30
seconds

What’s next?
• SparkRDMAofficial v1.0 release is publicly available on
GitHub
• Integration to upstream Apache Spark
• Extend RDMAcapabilities to more Spark components:
– RDD accesses
– Broadcast
• Also in the works: RDMA acceleration for HDFS
#EUres3

Open-source
• SparkRDMAis available at
https://github.com/Mellanox/SparkRDMA
– Quick installation guide
– Wiki pages for advanced settings
• Vote for integrating SparkRDMAinto mainstream Spark:
https://issues.apache.org/jira/browse/SPARK-22229
• Feel free to reach out at: yuvaldeg@mellanox.com
#EUres3

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani

Recommended

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani (20)

More from Spark Summit (20)

Recently uploaded (20)

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval Degani