Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise

Yuval Degani, LinkedIn
Dr. Jithin Jose, Microsoft Azure
Tackling Network
Bottlenecks with
Hardware Accelerations:
Cloud vs. On-Premise
#UnifiedAnalytics #SparkAISummit

Intro
• Infinite loop of removing performance road blocks
• With faster storage devices (DRAM, NVMe, SSD) and
stronger than ever processing power (CPU, GPU, ASIC),
a traditional network just can’t keep up with I/O flow
• Upgrading to higher wire speeds will rarely do the trick
• This is where co-designed hardware acceleration can be
used to truly utilize the power of a compute cluster
2#UnifiedAnalytics #SparkAISummit

Previous talks
Spark Summit Europe 2017
First open-source stand-alone RDMA accelerated
shuffle plugin for Spark (SparkRDMA)
Spark+AI Summit North America 2018
First preview of SparkRDMA on Azure HPC
nodes, demonstrating x2.6 job speed-up on cloud
VMs

Network Bottlenecks in the Wild

Network Bottlenecks in the Wild
• Not always caused by lack of bandwidth
• Network I/O imposes overhead in many system components:
– Memory management
– Memory copy
– Garbage Collection
– Serialization/Compression/Encryption
• Overhead=CPU cycles, cycles that are not available for the
actual job at hand
• Hardware acceleration can reduce overhead and allow better
utilization of compute and network resources

Network Bottlenecks: Shuffle
• Most expensive non-storage
network I/O in compute clusters
• Blocking, massive movement of
transient data
• Acceleration opportunities:
– Efficient serving with reduced server-
side logic
– Serialization/Compression/Encryption
– Reduce I/O overhead and latency by
employing modern transport protocols
Partitioning
4%
Input
11%
Shuffle
Read
57%
Output
28%
HiBench TeraSort on Spark

Network Bottlenecks: Distributed
Training
• Model updates create massive
network traffic
• Model update frequency rises
as GPUs get faster
– Inter-GPU RDMA communication
– Lower latency network transport
– Collectives offloads
K80
M60
V100
ResNet 269*
Total Time GPU Active Time
* “Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training” by Luo et al.

Network Bottlenecks: Storage
• Massive data movement
• Premium devices (DRAM, Flash) provide storage
access speeds that were never seen before
– Higher bandwidth
– Reduced transport overhead
– OS/CPU bypass – direct storage access from network
devices

Major Hardware Acceleration
Technologies

Speeds
• 1, 10, 25, 40, 100, 200Gbps
• Faster network doesn’t
necessarily mean a faster
runtime
• Many workloads consist of
relatively short bursts rather
than sustainable throughput:
higher bandwidth may not have
any effect
0
100
200
300
400
500
600
700
800
Flink
TeraSort
Flink
PageRank
PowerGraph
PageRank
Timely
PageRank
Effect of network speed
on workload runtime*
1GbE 10GbE 40GbE
* “On The [Ir]relevance of Network Performance for Data Processing” by Trivedi et al.

InfiniBand
• De-facto standard in the HPC world
• FDR: 56Gbps, EDR: 100Gbps, HDR:
200Gbps
• Sub-microsecond latency
• Native support for RDMA
• HW accelerated transport layer
• True SDN: standard fabric components are
developed as open-source and are cross-
platform
• Native support for Switch collectives offload
Ethernet
23%
InfiniBand
38%
Custom
28%
Omnipath
10%
Proprietary
1%
TOP500 Supercomputers
Interconnect Performance
Share*
* www.top500.org

RDMA
• Remote Direct Memory Access
– Read/write from/to remote memory locations
• Zero-copy
• Direct hardware interface – bypasses the
kernel and TCP/IP in IO path
• Flow control and reliability is offloaded in
hardware
• Supported on almost all mid-range/high-
end network adapters: both InfiniBand
and Ethernet
12
Java app
buffer
OS
Sockets
TCP/IP
Driver
Network Adapter
RDMA
Socket
Context switch
#UnifiedAnalytics #SparkAISummit

NVIDIA GPUDirect
• Direct DMA over PCIe
• RDMA devices can write/read
directly to/from GPU memory
over the network
• No CPU overhead
• Zero-copy
GPUDirect
Non-GPUDirect
NIC GPU
CPU

“Smart NIC” – FPGA/ASIC Offloads
• FPGA – tailor-made accelerations
• ASIC – less flexibility, better performance
• Common use cases:
– I/O: Serialization, compression, encryption offloads
– Data: Aggregation, sorting, group-by, reduce
• Deployment options:
– Pipeline
– Look-aside
– Bump-on-the-wire

“Smart Switch”
• In-network processing
– Data reduction during movement
– Wire-speed
• Generic: MPI Switch Collectives Offloads (e.g.
Mellanox SHArP)
• Per-workload: Programmable switches (e.g.
Barefoot Tofino)
– Example: Network-Accelerated Query Processing

NVMeOF
• Network protocol for NVM
express disks (PCIe)
• Uses RDMA to provide direct
NIC<->Disk access
• Completely bypasses the host
• Minimal latency differences
between local and remote access
NVMeOF
Traditional
NIC
CPU

Azure Network Acceleration
Offering

Offer ‘Bare Metal’ Experience
– Azure HPC Solution
#UnifiedAnalytics #SparkAISummit 18
Eliminate Jitter
Host holdback is a start, but must
completely isolate guest from host
Minroot & CPU Groups; separated
host and guest VM sandboxes
Full Network Experience
Enable customers to use Mellanox or
OFED drivers
Supports all MPI types and versions
Leverage hardware offload to
Mellanox InfiniBand ASIC
Transparent Exposure of
Hardware
Core N in guest VM should =
Core N in silicon
1:1 between physical pNUMA
topology and vNUMA topology

Latest Azure HPC Offerings – HB/HC
HB Series (AMD EPYC) HC Series (Intel Xeon Platinum)
Workloads Targets Bandwidth Intensive Compute Intensive
Core Count 60 44
System Memory 240 GB 352 GB
Network 100 Gbps EDR InfiniBand, 40 Gbps Ethernet
Storage Support Standard / Premium Azure Storage, and 700GB Local SSD
OS Support for RDMA CentOS/RHEL, Ubuntu, SLES 12, Windows
MPI Support
OpenMPI, HPC-X, MVAPICH2, MPICH,
Intel MPI, PlatformMPI, Microsoft MPI
Hardware Collectives Enabled
Access Model
Azure CLI, ARM template, Azure CycleCloud,
Azure Batch, Partner Platform

Other Azure HPC Highlights
• SR-IOV going broad
– All HPC SKUs will support SR-IOV
– Driver/SKU Performance Optimizations
• GPUs
– Latest NDv2 Series
• 8 Nvidia Tesla v100 NVLINK interconnected GPUs
• Intel Skylake, 672 GB Memory
• Excellent platform for HPC and AI workloads
• Azure FPGA
– Based on Project Brainwave
– Deploy model to Azure FPGA, Reconfigure for different models
– Supports ResNet 50, ResNet 152, DenseNet-121, and VGG-16

Accelerate Your Framework

MPI Microbenchmarks
• Experiments on HC cluster
• OSU Benchmarks 5.6.1
• OpenMPI (4.0.0) + UCX (1.5.0)
• MPI ranks pinned nearer to HCA
1.77 us
12 GB/s
• MPI Latency (4 B) – 1.77us
• Getting even better later this year
• MPI Bandwidth (4 MB) – 12.06 GB/s
0
2000
4000
6000
8000
10000
12000
14000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Bandwidth(MB/s)
Message Size (bytes)
MPI Bandwidth
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)
0
10
20
30
40
50
60
70
80
90
0 1 2 4 8 16 32 64 128 256 512 1K 2K
Time(us)
MPI Latency
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)

SparkRDMA
• RDMA-powered ShuffleManager
plugin for Apache Spark
• Similarly spec 8 node cluster:
– On-prem: 100GbE RoCE
– Cloud: Azure ”h16mr” instances with
56Gbps InfiniBand
• https://github.com/Mellanox/SparkRDMA
0 1000 2000
TeraSort 320GB
PageRank 19GB
On-prem non-RDMA 100GbE
On-prem RDMA 100GbE
Azure IPoIB 56Gbps
Azure RDMA 56Gbps

SparkRDMA on Azure
• Azure HC cluster:
– 100 Gbps InfiniBand
– 16 Spark Workers/HDFS DataNodes
– Separate NameNode
– Data folder hosted on SSD
– HiBench Benchmarks (gigantic)
• Spark 2.4.0, Hadoop 2.7.7, SparkRDMA 3.1
0 100 200 300 400 500 600
TeraSort - 320 GB
PageRank - 19GB
Execution Time (s)
RDMA (100 Gbps)
IPoIB (100 Gbps)

HDFS-RDMA on Azure
• OSU HDFS RDMA 0.9.1
• Based on Hadoop 3.0.0
• http://hibd.cse.ohio-state.edu/#hadoop3
• HDFS on HC cluster
• 1 NameNode
• 16 DataNodes
• Data folder hosted on SSD
• Packet Size: 128KB
• Containers per Node: 32 0
50
100
150
200
250
300
350
400
512GB 640GB 768GB 896GB 1TB
Time(sec)
Size (bytes)
TestDFSIO (Write) Execution Time
Ethernet (40 Gbps)
IPoIB (100 Gbps)
RDMA (100 Gbps)

Memcached-RDMA on Azure
• OSU Memcached RDMA 0.9.6
• Based on Memcached 1.5.3 and
libmemcached 1.0.18
• http://hibd.cse.ohio-state.edu/#memcached
• Experiment run on HC Nodes
• Memcached GET (8 B) Latency – 5.5us
• Memcached SET (8 B) Latency – 6.45us
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
Memcached GET
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
Memcached SET
Ethernet (40 Gbps) IPoIB (100 Gbps)
RDMA (100 Gbps)

Kafka-RDMA on Azure
• OSU Kafka RDMA 0.9.1
• Based on Apache Kafka 1.0.0
• http://hibd.cse.ohio-state.edu/#kafka
• HC cluster
• Broker with 100 GB Ramdisk
• Record Size – 100 bytes
• Number of Records – 500000
0
50
100
150
200
250
300
350
400
Producer
Time(s)
Kafka Producer Latency
IPoIB (100 Gbps) RDMA (100 Gbps)
0
10
20
30
40
50
60
70
Producer
Bandwidth(MB/s)
Kafka Producer Bandwidth
IPoIB (100 Gbps) RDMA (100 Gbps)

Horovod on Azure
• Tensorflow 1.13
– ResNet-50 Training
– Partial ImageNet Data
– Batch Size = 64 per worker
– 2 workers per node
– Total batches 100
– CPU only version
• HC Cluster
– OpenMPI 4.0 + UCX 1.5
– Singularity container
• ~97% Scaling efficiency
100.00
96.78
95.58 94.93
100.00
98.86 98.37
96.94
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
100.00
0
200
400
600
800
1000
1200
1400
1600
2 4 8 16
%Efficiency
Images/second
# nodes
IPoIB (100 Gbps)
RDMA (100 Gbps)
IPoIB Efficiency
RDMA Efficiency

Wrapping up

What’s available on major clouds?
Technology Azure AWS GCP
Network speeds 100Gbps 100Gbps 20Gbps?
InfiniBand ✔ ! !
RDMA ✔ (limited) !
GPUDirect ! (single host) !
Smart NIC ! ! !
Smart Switch ! ! !
NVMeOF ! ! !

Take-aways
• Accelerated Frameworks:
– SparkRDMA on GitHub
– High Performance Big Data (From OSU)
– Horovod
• Azure instances
– Azure HPC HB/HC
– Azure NDv2 GPUs
– Azure FPGA

Questions?

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise

Recommended

More Related Content

What's hot (20)

Similar to Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise (20)

More from Databricks (20)

Recently uploaded (20)

Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise