Deep Learning Performance Comparing Scale-out vs Scale-up
Deep Learning Performance Comparing Scale-out vs Scale-up
Abstract
This whitepaper looks at the performance and
efficiency of Deep learning training when using
the Dell EMC PowerEdge C4140 server to run
neural models. The objective is to show how
C4140 in scale-out configuration performs
against scale-up server.
February 2019
Revisions
Date Description
February 2019 Initial release
Acknowledgements
This paper was produced by the following persons:
Author:
Bhavesh Patel, Dell EMC Server Advanced Engineering.
Vilmara Sanchez, Dell EMC Server Advanced Engineering
Contributor: Josh Anderson, Dell EMC System Engineering
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to
the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
© August 2018 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell
Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.
Dell believes the information in this document is accurate as of its publication date. The information is subject to change without
notice.
1
Deep Learning Performance: Scale-up vs Scale-out
Contents
1 Overview ............................................................................................................................................... 5
1.1 Definition ...................................................................................................................................... 5
2 Introduction .......................................................................................................................................... 6
2.1 Deep Learning ............................................................................................................................... 7
3 Background ........................................................................................................................................... 7
3.1 Criteria......................................................................................................................................... 10
3.2 Why TensorFlow as the framework of choice?........................................................................... 10
4 Test Methodology ............................................................................................................................... 11
4.1 Testing Methodology .................................................................................................................. 12
4.1.1 Short Test ............................................................................................................................ 12
4.1.2 Long Test ............................................................................................................................. 13
4.2 Throughput Testing ..................................................................................................................... 13
4.3 Neural Models & parameters ..................................................................................................... 14
5 PowerEdge Server Details ................................................................................................................... 15
5.1 PowerEdge C4140 ....................................................................................................................... 15
5.1.1 Why is C4140 Configuration-M better? .............................................................................. 15
5.2 PowerEdge R740/R740xd ........................................................................................................... 18
6 Framework Setup Details .................................................................................................................... 19
6.1 Distributed Horovod-TensorFlow Setup ..................................................................................... 19
6.2 Evaluation Platform Setup .......................................................................................................... 21
7 Performance Results ........................................................................................................................... 22
7.1 Single Node – Throughput (images/sec) ..................................................................................... 22
7.1.1 PowerEdge R740xd ............................................................................................................. 22
7.1.2 PowerEdge C4140-V100-16GB-PCle [Config B] – Single Node............................................ 23
7.1.3 PowerEdge C4140-K-V100-16GB SXM2 Single Node .......................................................... 23
7.1.4 PowerEdge C4140-K- V100-32GB SXM2 Single Node ......................................................... 24
7.1.5 PowerEdge C4140-M- V100-16GB SXM2 Single Node ........................................................ 25
7.1.6 .................................................................................................................................................... 25
7.1.7 PowerEdge C4140- V100-SXM2 Configuration K versus Configuration M - Single Node ... 25
7.1.8 What role does CPU play in Deep learning? ....................................................................... 26
7.1.9 Conclusion ........................................................................................................................... 28
7.1.10 Non-Dell EMC server: 8x V100-16GB-SXM2 – Single Node ................................................ 30
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
2
Deep Learning Performance: Scale-up vs Scale-out
3
Deep Learning Performance: Scale-up vs Scale-out
Acknowledgements
We would like to acknowledge the following individuals, Jaime Edwards (Director of PowerEdge
Advanced Engineering) for setting the direction of this project, April Berman (PowerEdge
Acceleration Product Manager), Shreya Shah (PowerEdge C4140 Product Manager) and Trevor
Montgomery (Enterprise Systems Group Business Development) for providing us the resources
for this paper.
We would also like to acknowledge:
❖ Google TensorFlow team for helping us in debugging issues related to TensorFlow.
❖ Uber Horovod team for explaining how to use their library in distributed training
environment.
❖ Nvidia CUDA support team & Nvidia account team for their expedited support.
4
Deep Learning Performance: Scale-up vs Scale-out
1 Overview
The objective of this whitepaper is to compare Dell’s PowerEdge acceleration optimized servers
and determine their performance when running deep learning workloads. The purpose is to
highlight how Dell’s scale out solution is ideally suited for these emerging workloads.
We will compare how PowerEdge C4140 performs using one of the more popular frameworks
like TensorFlow with various neural architectures and compare it to other acceleration optimized
servers in the market, targeting the same workloads. The idea is to investigate whether the
architectural implementation helps PowerEdge C4140 in better utilizing the accelerators when
running hardware level benchmarks like Baidu Deep bench and TensorFlow based benchmarks.
Using Baidu Deep bench, we can profile kernel operations, the lowest-level compute and
communication primitives for Deep Learning (DL) applications. This allows us to profile how the
accelerators are performing at the component level in different server systems. This is very
important since it allows us to look at which hardware provides the best performance on the
basic operations used for deep neural networks. Deep Bench includes operations and workloads
that are important to both training and inference.
Using TensorFlow as the primary framework, we compare the performance in terms of
throughput, and training time to achieve certain accuracy on ImageNet dataset. We look at
performance at a single node level and multi-node level. We use some of the popular neural
architectures like ResNet-50, VGG Net, GoogLeNet, and AlexNet to do this performance
comparison.
1.1 Definition
Scale up : Scale up is achieved by putting the workload on a bigger, more powerful server
(e.g., migrating from a two-socket server to a four- or eight-socket x86 server in a rack-based or
blade form factor). This is a common way to scale databases and several other workloads. It has
the advantage of allowing organizations to avoid making significant changes to the workload; IT
managers can just install the workload on a bigger box and keep running it the way they always
have.
Scale out: Scale out refers to expanding to multiple servers rather than a single bigger server.
The use of availability and clustering software (ACS) and its server node management, which
enables IT managers to move workloads from one server to another or to combine them into a
single computing resource, represents a prime example of scale out. It adds flexibility in allowing
IT organizations to add nodes as the number of users or workloads increase and this helps in
better control of IT budgets.
5
Deep Learning Performance: Scale-up vs Scale-out
2 Introduction
Figure 1: Artificial Intelligence, Machine Learning and Deep Learning [Source: MIT]
Artificial Intelligence
First coined in 1956 by John McCarthy, AI involves machines that can perform tasks that are
characteristic of human intelligence. While this is rather general, it includes things like planning,
understanding language, recognizing objects and sounds, learning, and problem solving.
Machine Learning
Arthur Samuel coined the phrase not too long after AI, in 1959, defining it as, “the ability to learn
without being explicitly programmed.” You see, you can get AI without using machine learning,
but this would require building millions of lines of codes with complex rules and decision-trees.
So instead of hard coding software routines with specific instructions to accomplish a task,
machine learning is a way of “training” an algorithm so that it can learn how. “Training” involves
feeding massive amounts of data to the algorithm and allowing the algorithm to adjust itself and
improve.
Spiking Neural Networks
Spiking neural networks (SNNs) are artificial neural network models that more closely mimic
natural neural networks. In addition to neuronal and synaptic state, SNNs also incorporate the
concept of time into their operating model. The idea is that neurons in the SNN do not fire at
each propagation cycle (as it happens with typical multi-layer perceptron networks), but rather
fire only when a membrane potential – an intrinsic quality of the neuron related to its membrane
electrical charge – reaches a specific value. When a neuron fires, it generates a signal which
6
Deep Learning Performance: Scale-up vs Scale-out
travels to other neurons which, in turn, increase or decrease their potentials in accordance with
this signal.
3 Background
With the recent advances in the field of Machine Learning and especially Deep Learning, it’s
becoming more and more important to figure out the right set of tools that will meet some of
the performance characteristics for these workloads.
Since Deep Learning is compute intensive, the use of accelerators like GPU become the norm.
But GPUs are costly and often it comes down to what is the performance difference between a
system with & without GPU.
7
Deep Learning Performance: Scale-up vs Scale-out
The plot in Figure 3 shows GPU performance when looking at single precision and Figure 4
shows GPU performance when looking at half-precision. Most of the Deep Learning frameworks
and models take advantage of half-precision since they can work with larger datasets with the
available memory. It’s very important to look at the raw Flop numbers for a GPU, since we want
to extract the same level of performance when that GPU is put into a system.
8
Deep Learning Performance: Scale-up vs Scale-out
In order to see whether Dell EMC PowerEdge servers can meet the raw Flop numbers indicated
in the charts above, we approached the problem by breaking it up into different sections. The
picture below better illustrates how we are approaching this testing.
9
Deep Learning Performance: Scale-up vs Scale-out
1. System bandwidth performance i.e. PCIe connected to GPU - p2pbandwidth & latency
tests
2. GPU hardware performance without any Deep learning frameworks – Baidu Deep Bench
3. System running GPU & benchmarks – TensorFlow benchmarks
3.1 Criteria
1. In order to bound our testing, we picked TensorFlow as the framework of choice since it
has better support and models are readily available.
2. For distributed training, we selected Uber Horovod implementation, since it’s one of the
best performing distributed implementation [2].
10
Deep Learning Performance: Scale-up vs Scale-out
4 Test Methodology
The test methodology consists of 3 distinct phases.
Phase 1 is where we test the hardware performance of each server using NVidia supplied
p2pbandwidth and latency tests and Baidu Deep Bench. This is explained in section Phase 1.
Phase 2 we used TensorFlow framework and ran some of the well-known neural models to
compare performance in terms of throughput & training time. This is explained in section Phase
2.
Phase 3 we used TensorFlow but compared performance at multi-node level using Uber’s
Horovod implementation.
11
Deep Learning Performance: Scale-up vs Scale-out
1) Phase 1 – In this phase we are performing some of the basic tests like PCIe bandwidth
and latency tests to ensure it aligns to what we expect based on theoretical numbers. We
then ran Baidu Deep Bench Benchmarks to evaluate deep learning performance for the
accelerators and the system. The results for this step are presented in a separate
whitepaper.
2) Phase 2 - We used the official TensorFlow benchmarks which were run across several
servers.
3) Phase 3 – To benchmarks the servers in the distributed mode we used Horovod, which is
a distributed training framework for TensorFlow.
12
Deep Learning Performance: Scale-up vs Scale-out
• Use Case – The benchmark tests are targeting image classification with convolutional
neural networks models (CNNs).
• Benchmark code – TensorFlow Benchmarks scripts
• Hardware Configuration – Each server is configured based on its maximum GPU
support.
• Servers - The servers tested are PowerEdge R740, PowerEdge C4130, PowerEdge C4140
and non-Dell EMC 8x NVLink GPU server.
• Frameworks – TensorFlow for single node, and TensorFlow with Horovod library for
distributed training.
• Performance – The performance metrics used for comparison across servers is
throughput (images per second) and training time to reach top-5 accuracy and top-1
accuracy.
• Training tests - We conducted two types of tests. 1- Short Tests: for each test, 10
warmup steps were done and then the next 100 steps were averaged. 2-Long Tests: to
get the training accuracy convergence, and elapsed training time.
• Dataset – ILSVRC2012
• Software stack configuration – The benchmarks were run under docker container
environment. See table 1 with details.
13
Deep Learning Performance: Scale-up vs Scale-out
14
Deep Learning Performance: Scale-up vs Scale-out
The Dell EMC PowerEdge C4140, an accelerator-optimized, high density 1U rack server, is used
as the compute node unit in this solution. The PowerEdge C4140 can support four NVIDIA Volta
SMX2 GPUs, both the V100-SXM2 as well as the V100-PCIe models.
Dell EMC PowerEdge C4140 supporting NVIDIA Volta SXM2 in topology ‘M’ with a high bandwidth
host to GPU communication is one of the most advantageous topologies for deep learning. Most
of the competitive systems supporting either a 4-way or 8-way or 16-way NVIDIA Volta SXM use
PCIe bridges and this limits the total available bandwidth between CPU to GPU.
5.1.1 Why is C4140 Configuration-M better?
Configuration Link Interface b/n Total Bandwidth Notes
CPU- GPU complex
K X16 Gen3 32GB/s Since there is a PCIe
switch between host
to GPU complex
G X16 Gen3 32GB/s Since there is a PCIe
switch between host
to GPU complex
M 4x16 Gen3 128GB/s Each GPU has
individual x16 Gen3
to Host CPU
Table 3: Host-GPU Complex PCIe Bandwidth comparison
As shown in Table 3 the total available bandwidth between CPU – GPU complex is much higher
than other configurations. This greatly benefits neural models in taking advantage of larger
capacity although lower bandwidth DDR memory to speed up learning.
Figure 8 shows the CPU-GPU and GPU-GPU connection topology for C4140-K, Figure 9 shows
topology for C4140-M and Figure 10 shows topology for C4140-B.
15
Deep Learning Performance: Scale-up vs Scale-out
Configuration K
X16 X16
IO CPU1 UPI CPU2 IO
Slot Slot
x16
PCIe Switch
x16 x16
x16 x16
SXM1 SXM2
SXM4 SXM3
NVLink PCIe
16
Deep Learning Performance: Scale-up vs Scale-out
Configuration M
X16 X16
IO CPU1 UPI CPU2 IO
Slot Slot
x16 x16
x16 x16
SXM1 SXM2
SXM4 SXM3
NVLink PCIe
17
Deep Learning Performance: Scale-up vs Scale-out
Configuration B
X16 X16
IO CPU1 UPI CPU2 IO
Slot Slot
x16
PCIe Switch
x16 x16
x16 x16
V100 V100
PCIe PCIe
V100 V100
PCIe PCIe
18
Deep Learning Performance: Scale-up vs Scale-out
SAS SSD
x16 x16
x16 x16
Horovod [8] [9] [10] is a distributed training framework for TensorFlow, Keras and PyTorch
initially developed by Uber. It uses bandwidth-optimal communication protocols (RDMA) [2]
In this section, we explain briefly the software stack configuration we used to extract the
performance throughput in multi-node using distributed Horovod TensorFlow and using high
speed Mellanox InfiniBand ConnectX-5 network adapter with 100Gbit/s over IPoIB, and
GPUDirect RDMA.
To setup the configuration, we used as our reference the configuration procedure presented by
Mellanox on its community blog space [3] and the basic installation of Horovod in Docker [4].
19
Deep Learning Performance: Scale-up vs Scale-out
The tests were run in docker environment, Figure 12 shows the different logical layers involved
in the software stack configuration. Each server is connected to the InfiniBand switch; has
installed on the Host the Mellanox OFED for Ubuntu, the Docker CE, and the GPUDirect RDMA
API; and the container image that was built with Horovod and Mellanox OFED among other
supporting libraries. To build the extended container image, we used the Horovod docker file and
modified it by adding the installation for Mellanox OFED drivers.
In Figure 13 below shows how PowerEdge C4130/C4140 is conncted via InifniBand fabric for
multi-node testing.
Figure 13: Using Mellanox CX5 InfiniBand adapter to connect C4130/PowerEdge C4140 in multi-node
configuration
20
Deep Learning Performance: Scale-up vs Scale-out
In Figure 14 we see how the GPU memory is accessed directly instead of copying the data n times
across the system components with the use of GPUDirect RDMA, this feature is reflected directly
in the throughput performance of the server.
21
Deep Learning Performance: Scale-up vs Scale-out
7 Performance Results
7.1 Single Node – Throughput (images/sec)
The charts below show the results for different servers running the short tests to extract
throughput images/second using ResNet50 with batch size 128 and number of steps =100.
The results for single node are with maximum number of GPUs supported within that node.
7.1.1 PowerEdge R740xd
The PowerEdge R740 with P40 GPU is tested with different pre-trained neural models. The results
show how different models use the amount of available memory e.g. ResNet50 uses more
memory than GoogLeNet or AlexNet and therefore we see lower throughput.
22
Deep Learning Performance: Scale-up vs Scale-out
23
Deep Learning Performance: Scale-up vs Scale-out
24
Deep Learning Performance: Scale-up vs Scale-out
7.1.6
7.1.7 PowerEdge C4140- V100-SXM2 Configuration K versus Configuration M - Single Node
The plot below is a comparison showing performance difference between Config-K and Config-
M, although it’s not a like-like comparison because of CPU difference. Config-K has Intel Xeon
4116 @ 2.1GHz 12core processor Vs Config-M which has Intel Xeon 6148 @2.4GHz 20 cores. The
number of CPU cores does matter and in one of the later plots, we will show the difference in
performance using different processors with same PCIe topology.
25
Deep Learning Performance: Scale-up vs Scale-out
As shown in Figure 21 below, it shows that the number of CPU cores does play a role in terms of
throughput. And the biggest difference is when running AlexNet.
26
Deep Learning Performance: Scale-up vs Scale-out
Figure 21 : Performance difference between Intel Xeon-4116 & Intel Xeon-6148 in C4140-M
➢ As you notice in Figure 21, there is a slight performance difference for most of the neural
models when using Intel Xeon 6148 except for AlexNet where it shows almost doubling
in performance.
➢ There are quite a few in-depth articles in showing the architecture for AlexNet and at
some later point we will try to explain why AlexNet behaves differently based on the
number of CPU cores and frequency.
27
Deep Learning Performance: Scale-up vs Scale-out
7.1.9 Conclusion
7.1.9.1 Single Node: 1GPU
28
Deep Learning Performance: Scale-up vs Scale-out
➢ As shown in the plots above in Figure 22 and Figure 23 there is not much difference in
throughput (images/sec) when comparing PowerEdge C4140-Configuration B with V100-
16GB PCIe GPU and Configuration K with V100-16GB SXM2 GPU. The reason is in both
configurations GPUs are in peer-peer mode behind PCIe Switch.
➢ C4140-M shows higher performance compared to C4140-K when using either Intel Xeon
CPU with 12 cores or Intel Xeon CPY with 20 cores.
29
Deep Learning Performance: Scale-up vs Scale-out
30
Deep Learning Performance: Scale-up vs Scale-out
PowerEdge C4130 server scales very well within a node with 97% efficiency and 92% across the
nodes. The ideal performance is computed by multiplying the single-GPU throughput by the
number of GPUs in the system. See Figure 26.
31
Deep Learning Performance: Scale-up vs Scale-out
Figure 26: Scaling Efficiency of C4130-P100-16GB-PCle across multi GPUs and multi nodes
32
Deep Learning Performance: Scale-up vs Scale-out
33
Deep Learning Performance: Scale-up vs Scale-out
34
Deep Learning Performance: Scale-up vs Scale-out
35
Deep Learning Performance: Scale-up vs Scale-out
As seen from the table above, using PowerEdge C4140 with SXM2 shows pretty good
performance across various pre-trained neural models. The most common ones i.e. ResNet-50
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
36
Deep Learning Performance: Scale-up vs Scale-out
and Inception-v3 show performance within 8% of 8X SXM2. The only exception is AlexNet where
it shows quite a bit of difference between 8X SXM2 and PowerEdge C4140.
The good performance shown by PowerEdge C4140 in multi node mode, comparable to a single
node server 8x V100-16GB, was reached after the right software stack configuration with the
distributed framework Horovod over IB/GPUDirect-RDMA, see below Figure 32 the scaling
efficiency reached by PowerEdge C4140:
Figure 32: The Performance with Distributed Horovod TensorFlow, connected by Mellanox ConnectX-5
network adapter with 100Gbit/s over IPoIB, and GPUDirect RDMA
37
Deep Learning Performance: Scale-up vs Scale-out
In the Figure 33 above we can appreciate the throughput improvement when using a sever with
a higher capacity CPU; as seen in the table below, almost all the models trained with C4140-M-
V100-16GB-SXM2 - CPU IntelXeon6148 (8 GPUs) – multi-node performed better than SN-8xV100.
The exception was AlexNet which still performed under SN_8xV100; however, it improved its
throughput significantly compared when trained with the server with C4140-K-V100-16GB-SXM2
- IntelXeon4116. See the summary in the below table
38
Deep Learning Performance: Scale-up vs Scale-out
7.2.6 PowerEdge C4140 Multi Node Training with Different CPU Models vs 8x V100-16GB-
SXM2
In the results shown in the Figure 31 and Figure 33 we configured the multi-node system with
servers PowerEdge C4140-V100-SXM2- Configuration-K Intel Xeon4116 CPU and Configuration-
M Intel Xeon6148 CPU respectively, versus single-node training non-Dell EMC 8xV100-16GB-
SXM2.
To show the impact of the CPU in the training of deep learning workloads, we run additional tests
configuring the multi-node system with servers PowerEdge C4140-V100-SXM2 Configuration-M
and Intel Xeon6148 CPU. In the Figure 34 we see how advance CPU models boost even more the
gpu performance, since most of the data loading, data preprocessing, and batch transformation
tasks occur at the CPU level, whereas the training tasks occur at the gpu level.
39
Deep Learning Performance: Scale-up vs Scale-out
Figure 35 shows the results for the server 8X SXM2, POWEREDGE C4140-V100 in single and multi-
node mode, C4130-P100 in single and multi-node mode, and R740-P40.
Results Highlights
➢ The fastest training time was achieved by the system 8X SXM2 with 93% of accuracy
convergence in 6.6 hours.
➢ PowerEdge C4140 – Configuration K with SXM2 in multi-node configuration achieved 95%
of accuracy convergence in 7.2 hours.
➢ R740-P40 with 48.8hrs and 93% of accuracy convergence.
Figure 35: Longest tests to extract accuracy convergence and training time
40
Deep Learning Performance: Scale-up vs Scale-out
After training the PowerEdge C4140 – Configuration M with SXM2 in multi-node configuration,
we saw it reached the fastest training in 5.3 hours, overpassing the Non-Dell EMC SN_8x-V100-
16GB-SXM2 which completed the training time in 6.6 hours. See Figure 36
Another aspect we wanted to explore was the accuracy convergence capacity for other models,
so we selected models with different depth network topology (vg199, ResNet50, and Inception-
v4) and ran the long tests on PowerEdge C4140 in multi-node configuration and non-Dell EMC 8x
– V100 SXM2. The results are show in Figure 37 below.
41
Deep Learning Performance: Scale-up vs Scale-out
Figure 37: Training long tests to extract accuracy convergence and training time with 8X SXM2 and
different models
Figure 37 above we observe that the models ResNet50 and Inception-v4 reached between 92%-
96% of accuracy convergence in 90 epochs; however, ResNet50 with different batch sizes
42
Deep Learning Performance: Scale-up vs Scale-out
converged faster than Inception-v4. On the other hand, the model VGG-19 didn’t produce
acceptable accuracy suggesting it requires over 90 epochs to converge.
Figure 38 below show comparable results for PowerEdge C4140-K (multi-node) – V100 SXM2
Figure 38: Training long tests to extract accuracy convergence and training time with PowerEdge C4140-
K multi-node and different models
43
Deep Learning Performance: Scale-up vs Scale-out
Figure 39: Training long tests to extract accuracy convergence and training time with PowerEdge C4140-
K multi-node and single-node 8x V100-SXM2 with different models
Figure 39 above shows comparison between 8X SXM2 and PowerEdge C4140 Configuration-K in
multi-node configuration using ResNet-50. The training time difference between 8X SXM2 and
multi-node PowerEdge C4140 is within 7% which shows that using Mellanox InfiniBand RDMA
allows PowerEdge C4140 to achieve similar performance as a scale-up server.
To show the impact of the CPU in the training of deep learning workloads to reach the accuracy
convergence, we run additional tests configuring the multi-node system with servers PowerEdge
C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU. In the Figure 40 we see the multi-
node system C4140-V100-SXM2 Configuration-M and IntelXeon6148 CPU performs 1.3X faster
than SN-8xV100 for the model resnet50 trained in several batch sizes. Again, it shows the
relationship between the CPU model and the Deep Learning performance, where most of the
data loading, data preprocessing, and batch transformation tasks occur at the CPU level, whereas
the training tasks occur at the gpu level.
44
Deep Learning Performance: Scale-up vs Scale-out
Figure 40. Multi-node training PowerEdge C4140-V100-SXM2- Configuration-K with IntelXeon4116 cpu,
Multi-node training PowerEdge C4140-V100-SXM2 Configuration-M with IntelXeon6148 cpu, versus
single-node training non Dell 8xV100-16GB-SXM2
In the Figure 40 we can see how the system C4140-V100-SXM2 Configuration-M outperforms in terms of
training time in different batch sizes compared the other systems.
This section shows the results of aspects explored during this project such as the hyper
parameter tuning, learning rate effect on single-node and multi-node mode, and critical kernels
executed in the TensorFlow benchmarks. These aspects could be subject of deeper study for
future projects.
45
Deep Learning Performance: Scale-up vs Scale-out
46
Deep Learning Performance: Scale-up vs Scale-out
In this experiment, we used the learning rate schedule as follows: the initial learning rate was
set up to 0.4 for the first 10 epochs, after that the learning rate was decreased to 0.04 until the
model reached 60 epochs of training, finally it was decreased to 0.004 until the end of the
training with 90 epochs.
Figure 42 & Figure 43 we show how learning rate has a different effect in single node versus
multi node. In multi node mode the learning rate update is not reflected immediately and there
is a delay until the change is reflected in every processor.
47
Deep Learning Performance: Scale-up vs Scale-out
48
Deep Learning Performance: Scale-up vs Scale-out
We wanted to explore the critical kernels executed in one GPU when running the TensorFlow
benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark
trained with C4130-P100-PCIe-16GB in multi-node.
Figure 44 shows the critical kernels executed, we found that the communication primitives all
reduce where called 38.4% of the time, it may suggest that the GPU may spend too much time
exchanging and communicating rather than computing, this is something that can be explored in
depth for future projects.
Figure 44: Critical kernels executed when training with C4130-P100-16GB-SXM2 (8 GPUs) –
multi-node
CUDA Toolkit offers several performance analysis tools to optimize the performance of CUDA or
OpenACC applications. The Visual Profiler nvvp traces CUDA activities, profiles CUDA kernels, and
correlates performance instrumentation with source code. The tool Nvprof collects performance
events and metrics; CUDA memcheck detects memory accesses issues, incorrect GPU thread
synchronization and other important aspects to optimize performance [5].
49
Deep Learning Performance: Scale-up vs Scale-out
• Table 5 shows that PowerEdge C4140 in multi-node configuration for most widely used
model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system. But
with C4140-M in multi-node out performs single node 8x NVLink by at least 18% using
ResNet-50. The only disclaimer is that C4140-M results are using the latest version of
NCCL & TensorFlow containers.
• There is lot of performance improvement being added continuously either at the GPU
level, library level or framework level. We are continuously looking at how we can
improve our performance results by experimenting with different hyper parameters.
• Some of our future work in this area will be related to exploring the latest software
optimizations being released by Nvidia and looking at fast.ai library where Jeremy
Howard and researchers at fast.ai achieved training time of 3 hours on 8x V100 on
ResNet-50.
50
Deep Learning Performance: Scale-up vs Scale-out
9 Citation
@article {sergeev2018horovod,
Author = {Alexander Sergeev and Mike Del Balso},
Journal = {arXiv preprint arXiv: 1802.05799},
Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}},
Year = {2018}
}
10 References
• [1] Nvidia Blogs, “What’s the Difference between Artificial Intelligence, Machine
Learning, and Deep Learning?” [Online]. Available:
https://blogs.Nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-
machine-learning-deep-learning-ai/
• [2] Cornell University Library, “Horovod: fast and easy distributed deep learning in
TensorFlow” [Online]. Available: https://arxiv.org/abs/1802.05799
• [3] Mellanox Community, “How to Create a Docker Container with RDMA Accelerated
Applications Over 100Gb InfiniBand Network” [Online]. Available:
https://community.mellanox.com/docs/DOC-2971
• [7] Medium, “Hardware for Deep Learning. Part 3: GPU” [Online]. Available:
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
• [8] Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep
Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/
• [9] Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from
https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy
• [10] Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in
TensorFlow. arXiv:1802.05799
51