From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Haibin Lin
Applied Scientist
AWS AI
From Hours to Minutes: The Journey of
Optimizing Mask-RCNN and BERT Using MXNet
Lin Yuan
Software Design Engineer
AWS AI

Dataset and Model Size Keep Growing
Dataset size for training (GB) Model parameter size (million)

Large Scale Distributed Training for Deep Neural Networks
Data parallelism Model parallelism

Optimization for Large Scale Distributed Training
• System-Level Optimization
• Accelerate training on a single GPU
• fused operators, data prefetching, vectorization, cache utilization, tensor core
• Distributed training with multiple GPUs
• large batch size, NCCL-allreduce, Elastic Fabric Adaptor
• Algorithm-Level Optimization
• Large-batch optimization algorithm
• Model architecture
• Accuracy/runtime trade off

Performance Optimization on AWS Cloud
• Leverage the Amazon EC2 P3dn.24xlarge GPU instances
• 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each
• 96 Intel Xeon Scalable vCPUs
• 1.8 TB local NVMe SSD
• 100 Gbps network throughput
• support Elastic Fabric Adapter
• Software
• Apache MXNet
• GluonNLP and GluonCV toolkits
• Horovod distributed training library

Deep learning nowadays - Mask-RCNN
• Widely used in object detection
and instance Segmentation
• Target accuracy
• bounding box AP: 37.7
• mask AP: 33.9

GluonCV: a Deep Learning Toolkit for Computer Vision
• Training scripts that reproduce SOTA results reported in latest papers
• A large set of pre-trained models
• Carefully designed APIs and easy to understand its implementations
• Community support
• Built on top of Apache MXNet framework
Image
classification
Object
detection
Semantic
segmentation
Pose
estimation
Video action
recognition

GPU Profiling
• Analyze runtime using Nvidia Visual Profiler
• Identify large kernels to optimize
Slow operator
NHWC layout conversion
small kernels

GPU Optimization
Runtime Improvements
• optimize ROIAlign: +10%
• optimize NMS: +10%
• fuse RCNN target generator: +5%
• NWHC layout conversion: +10%
• pointwise operator fusion: +3%

Automatic Mixed Precision
• Automatic casting of the model
• Convolution, FullyConnected -> FP16
• Norm, Mean, SoftMax, etc. -> FP32
• Add, Mul etc. -> Cast to widest type
• AMP boosted the throughput by 5~10%
• Casting the gradients to FP16 gives another throughput improvement by 1~2%
without compromising Accuracy.
Utilities for dynamic loss scaling

Model Hybridization
• MXNet provides users the APIs to construct and debug the model using
imperative programming
• Users can invoke a hybridize API to boost model performance that is
equivalent to symbolic programming.
• We applied hybridization to the model and achieved 5% runtime improvement
• Also, Hybridizing the model with static_alloc gave another 1~2% throughput
improvement.

Performance Tuning in AWS cluster
• Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge
EC2 instance helps us to get 8% improvement in throughput
• Autotune Horovod hyperparameters such as tensor fusion threshold cycle
times, cache capacity, hierarchical allreduce etc. +9% throughput
• Increase the number of data workers from 4 to 8 also help to accelerate data
loading. Note that however more data workers do not necessarily mean better
performance due to the overhead of context switching.
• Accelerate dataloader through Cython
• Distributed validation showed significant improvement in Validation compute
time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on
non-distributed validation.

55
65
75
85
95
General Language Understanding Evaluation (GLUE) Benchmark
Human Baseline
Deep learning nowadays - BERT
BERT

Transfer learning with BERT for NLP
• Pre-training for NLP
• learn text representation on large-scale
corpus
• Fine-tuning for downstream tasks
• Named Entity Recognition
• Question Answering
• Search
• Chatbot
• Text Summarization
• Text Classification
• Models available in GluonNLP toolkit
feature
extractor
}
GTC is awesome!
positive
NLP CV
Image credit to: d2l.ai

GluonNLP: a deep learning natural language toolkit
• Open source, available on SageMaker and deep learning container
• State-of-the-art NLP models
• Easy prototyping
• Fast deployment
• Multiple built-in NLP tasks

BERT model architecture
BERTMulti-head attention (Vaswani et al., 17)
x N

1. Masked language modeling
• Estimate
• Randomly mask 15% of all tokens and predict them
2. Next sentence prediction
• 50% of the time, replace it by random sentence
• Learn logical coherence
Pre-training objectives
I went to the bank to deposit some money.
I went to the <mask> to deposit some money.
<CLS> Haibin is obnoxious <SEP> I don’t like his shirt
<CLS> Haibin is obnoxious <SEP> Hello world! .

Data loading
• Mini-batches are generated on the fly for dynamic masking[1]
• Multi-process DatasetLoader with pre-fetching in the background
• AWS FSx for Lustre: file system for compute-intensive workloads
• Profiling result visualization
previous
batch
current
batch
data
loading
gap

Fast Multi-head Self-Attention
For each layer:
Separate projections:
Qproj = QWq, Kproj = QWk, Vproj = QWv
Transpose Qproj , Kproj , Vproj :
From (N, T, H, C) to (N, H, T, C)
Compute attention:
score = batch_gemm(Qproj, Kproj)
result = batch_gemm(score, Vproj)
Transpose result:
From (N, H, T, C) to (N, T, H, C)
credit to: Clement Fuji Tsang
Higher cache utilization
1.58x faster (end to end)
Transpose Q:
From (N, T, HC) to (T, N, HC)
For each layer:
Joint projections:
Wqkv = concat(Wq, Wk, Wv)
Q_K_Vproj = QWqkv
Compute attention:
score = strided_batch_gemm(Qproj, Kproj)
result = strided_batch_gemm(score, Vproj)
Transpose final result:
From (T, N, HC) to (N, T, HC)

GPU memory is precious
- For each mini-batch, the gradient is synchronized across GPUs
- Gradient allreduce can overlap with backward computation
- A larger batch sizes leads to more time to hide communication latency
- 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes
Forward1 Backward1Forward2 Forward3 Backward2 Backward3
Allreduce1 Allreduce2 Allreduce3
time
We can overlap computation
and communication

NCCL + Elastic Fabric Adaptor
HPC Application
MPI
implementation
TCP/IP stack
ENA network
driver
ENA Device
HPC Application
MPI
implementation
EFA kernel
driver
ENA Device
Libfabric
user
space
kernel
Traditional HPC
software stack in EC2
kernel
user
space
HPC software stack
in EC2 with EFA
- Elastic Fabric Adaptor (EFA)
- For HPC and distributed ML
- Bypass OS kernel
- Integrated with MPI, NCCL
- BERT training
- 32 p3dn.24xlarge instances
- V100 GPUs x 256
- 100 Gb/s networking
- BERT-large with GluonNLP
- Batch size 64K, phase 1
- 90% strong scaling efficiency, with
EFA enabled

Distributed Stochastic Optimization
credit to: Shuai Zheng
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡
𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡
𝑔 𝑡
∥𝑔 𝑡∥2
Framework
batch
size
#XPUs #steps optimizer F1 score training time
Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m
MXNet 32K/32K 512 GPUs 7038/1563
LAMB +
NG
90.60% 141.5m

References
[1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach."
arXiv preprint arXiv:1907.11692 (2019).
[2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76
minutes." International Conference on Learning Representations. 2019.

Thank you
Haibin Lin
haibilin@amazon.com
Lin Yuan
lnyuan@amazon.com

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

More Related Content

What's hot (20)

Similar to From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet (20)

Recently uploaded (20)

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Editor's Notes