SlideShare a Scribd company logo
Haibin Lin
Applied Scientist
AWS AI
From Hours to Minutes: The Journey of
Optimizing Mask-RCNN and BERT Using MXNet
Lin Yuan
Software Design Engineer
AWS AI
Dataset and Model Size Keep Growing
Dataset size for training (GB) Model parameter size (million)
Large Scale Distributed Training for Deep Neural Networks
Data parallelism Model parallelism
Optimization for Large Scale Distributed Training
• System-Level Optimization
• Accelerate training on a single GPU
• fused operators, data prefetching, vectorization, cache utilization, tensor core
• Distributed training with multiple GPUs
• large batch size, NCCL-allreduce, Elastic Fabric Adaptor
• Algorithm-Level Optimization
• Large-batch optimization algorithm
• Model architecture
• Accuracy/runtime trade off
Performance Optimization on AWS Cloud
• Leverage the Amazon EC2 P3dn.24xlarge GPU instances
• 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each
• 96 Intel Xeon Scalable vCPUs
• 1.8 TB local NVMe SSD
• 100 Gbps network throughput
• support Elastic Fabric Adapter
• Software
• Apache MXNet
• GluonNLP and GluonCV toolkits
• Horovod distributed training library
Case Study: Mask R-CNN
Deep learning nowadays - Mask-RCNN
• Widely used in object detection
and instance Segmentation
• Target accuracy
• bounding box AP: 37.7
• mask AP: 33.9
GluonCV: a Deep Learning Toolkit for Computer Vision
• Training scripts that reproduce SOTA results reported in latest papers
• A large set of pre-trained models
• Carefully designed APIs and easy to understand its implementations
• Community support
• Built on top of Apache MXNet framework
Image
classification
Object
detection
Semantic
segmentation
Pose
estimation
Video action
recognition
GPU Profiling
• Analyze runtime using Nvidia Visual Profiler
• Identify large kernels to optimize
Slow operator
NHWC layout conversion
small kernels
GPU Optimization
Runtime Improvements
• optimize ROIAlign: +10%
• optimize NMS: +10%
• fuse RCNN target generator: +5%
• NWHC layout conversion: +10%
• pointwise operator fusion: +3%
Automatic Mixed Precision
• Automatic casting of the model
• Convolution, FullyConnected -> FP16
• Norm, Mean, SoftMax, etc. -> FP32
• Add, Mul etc. -> Cast to widest type
• AMP boosted the throughput by 5~10%
• Casting the gradients to FP16 gives another throughput improvement by 1~2%
without compromising Accuracy.
Utilities for dynamic loss scaling
Model Hybridization
• MXNet provides users the APIs to construct and debug the model using
imperative programming
• Users can invoke a hybridize API to boost model performance that is
equivalent to symbolic programming.
• We applied hybridization to the model and achieved 5% runtime improvement
• Also, Hybridizing the model with static_alloc gave another 1~2% throughput
improvement.
Performance Tuning in AWS cluster
• Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge
EC2 instance helps us to get 8% improvement in throughput
• Autotune Horovod hyperparameters such as tensor fusion threshold cycle
times, cache capacity, hierarchical allreduce etc. +9% throughput
• Increase the number of data workers from 4 to 8 also help to accelerate data
loading. Note that however more data workers do not necessarily mean better
performance due to the overhead of context switching.
• Accelerate dataloader through Cython
• Distributed validation showed significant improvement in Validation compute
time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on
non-distributed validation.
Case Study: BERT
55
65
75
85
95
General Language Understanding Evaluation (GLUE) Benchmark
Human Baseline
Deep learning nowadays - BERT
BERT
Transfer learning with BERT for NLP
• Pre-training for NLP
• learn text representation on large-scale
corpus
• Fine-tuning for downstream tasks
• Named Entity Recognition
• Question Answering
• Search
• Chatbot
• Text Summarization
• Text Classification
• Models available in GluonNLP toolkit
feature
extractor
}
GTC is awesome!
positive
NLP CV
Image credit to: d2l.ai
GluonNLP: a deep learning natural language toolkit
• Open source, available on SageMaker and deep learning container
• State-of-the-art NLP models
• Easy prototyping
• Fast deployment
• Multiple built-in NLP tasks
BERT model architecture
Image credit to: d2l.ai
BERTMulti-head attention (Vaswani et al., 17)
x N
1. Masked language modeling
• Estimate
• Randomly mask 15% of all tokens and predict them
2. Next sentence prediction
• 50% of the time, replace it by random sentence
• Learn logical coherence
Pre-training objectives
I went to the bank to deposit some money.
I went to the <mask> to deposit some money.
<CLS> Haibin is obnoxious <SEP> I don’t like his shirt
<CLS> Haibin is obnoxious <SEP> Hello world! .
Data loading
• Mini-batches are generated on the fly for dynamic masking[1]
• Multi-process DatasetLoader with pre-fetching in the background
• AWS FSx for Lustre: file system for compute-intensive workloads
• Profiling result visualization
previous
batch
current
batch
data
loading
gap
Image credit to: d2l.ai
Fast Multi-head Self-Attention
For each layer:
Separate projections:
Qproj = QWq, Kproj = QWk, Vproj = QWv
Transpose Qproj , Kproj , Vproj :
From (N, T, H, C) to (N, H, T, C)
Compute attention:
score = batch_gemm(Qproj, Kproj)
result = batch_gemm(score, Vproj)
Transpose result:
From (N, H, T, C) to (N, T, H, C)
credit to: Clement Fuji Tsang
Higher cache utilization
1.58x faster (end to end)
Transpose Q:
From (N, T, HC) to (T, N, HC)
For each layer:
Joint projections:
Wqkv = concat(Wq, Wk, Wv)
Q_K_Vproj = QWqkv
Compute attention:
score = strided_batch_gemm(Qproj, Kproj)
result = strided_batch_gemm(score, Vproj)
Transpose final result:
From (T, N, HC) to (N, T, HC)
GPU memory is precious
- For each mini-batch, the gradient is synchronized across GPUs
- Gradient allreduce can overlap with backward computation
- A larger batch sizes leads to more time to hide communication latency
- 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes
Image credit to: d2l.ai
Forward1 Backward1Forward2 Forward3 Backward2 Backward3
Allreduce1 Allreduce2 Allreduce3
time
We can overlap computation
and communication
NCCL + Elastic Fabric Adaptor
HPC Application
MPI
implementation
TCP/IP stack
ENA network
driver
ENA Device
HPC Application
MPI
implementation
EFA kernel
driver
ENA Device
Libfabric
user
space
kernel
Traditional HPC
software stack in EC2
kernel
user
space
HPC software stack
in EC2 with EFA
- Elastic Fabric Adaptor (EFA)
- For HPC and distributed ML
- Bypass OS kernel
- Integrated with MPI, NCCL
- BERT training
- 32 p3dn.24xlarge instances
- V100 GPUs x 256
- 100 Gb/s networking
- BERT-large with GluonNLP
- Batch size 64K, phase 1
- 90% strong scaling efficiency, with
EFA enabled
Distributed Stochastic Optimization
credit to: Shuai Zheng
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡
𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡
𝑔 𝑡
∥𝑔 𝑡∥2
Framework
batch
size
#XPUs #steps optimizer F1 score training time
Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m
MXNet 32K/32K 512 GPUs 7038/1563
LAMB +
NG
90.60% 141.5m
References
[1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach."
arXiv preprint arXiv:1907.11692 (2019).
[2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76
minutes." International Conference on Learning Representations. 2019.
Thank you
Haibin Lin
haibilin@amazon.com
Lin Yuan
lnyuan@amazon.com

More Related Content

What's hot (20)

PDF
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 
PDF
Terascale Learning
pauldix
 
PDF
Technical Tricks of Vowpal Wabbit
jakehofman
 
PDF
Manycores for the Masses
Intel® Software
 
PDF
running Tensorflow in Production
Matthias Feys
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 
PDF
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
PDF
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Intel® Software
 
PPTX
Serving BERT Models in Production with TorchServe
Nidhin Pattaniyil
 
PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
PDF
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Spark Summit
 
PDF
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
PDF
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
PDF
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
PPTX
AlexNet and so on...
Dong Heon Cho
 
PDF
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
PDF
Generalized Pipeline Parallelism for DNN Training
Databricks
 
PDF
Distributed Deep Learning on Spark
Mathieu Dumoulin
 
PDF
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Edge AI and Vision Alliance
 
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 
Terascale Learning
pauldix
 
Technical Tricks of Vowpal Wabbit
jakehofman
 
Manycores for the Masses
Intel® Software
 
running Tensorflow in Production
Matthias Feys
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Intel® Software
 
Serving BERT Models in Production with TorchServe
Nidhin Pattaniyil
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Spark Summit
 
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
AlexNet and so on...
Dong Heon Cho
 
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Generalized Pipeline Parallelism for DNN Training
Databricks
 
Distributed Deep Learning on Spark
Mathieu Dumoulin
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Edge AI and Vision Alliance
 

Similar to From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet (20)

PPTX
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
PDF
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
PDF
Apache MXNet AI
Mike Frampton
 
PDF
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
PDF
Introduction to GPUs for Machine Learning
Sri Ambati
 
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
PDF
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
PDF
Netflix machine learning
Amer Ather
 
PPTX
DigitRecognition.pptx
ruvex
 
PDF
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
PDF
The Convergence of HPC and Deep Learning
inside-BigData.com
 
PPTX
Age of Language Models in NLP
Tyrone Systems
 
PDF
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
PDF
Toronto meetup 20190917
Bill Liu
 
PDF
CI-Keras for deep learning by adrian.pdf
sakshamagarwalm2
 
PDF
AI On the Edge: Model Compression
Apache MXNet
 
PPTX
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 
PDF
How to use Apache TVM to optimize your ML models
Databricks
 
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
 
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Apache MXNet AI
Mike Frampton
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
Introduction to GPUs for Machine Learning
Sri Ambati
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Netflix machine learning
Amer Ather
 
DigitRecognition.pptx
ruvex
 
Using Deep Learning Toolkits with Kubernetes clusters
Joy Qiao
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
The Convergence of HPC and Deep Learning
inside-BigData.com
 
Age of Language Models in NLP
Tyrone Systems
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Toronto meetup 20190917
Bill Liu
 
CI-Keras for deep learning by adrian.pdf
sakshamagarwalm2
 
AI On the Edge: Model Compression
Apache MXNet
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 
How to use Apache TVM to optimize your ML models
Databricks
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
 
Ad

Recently uploaded (20)

PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Biography of Daniel Podor.pdf
Daniel Podor
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
July Patch Tuesday
Ivanti
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Ad

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

  • 1. Haibin Lin Applied Scientist AWS AI From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet Lin Yuan Software Design Engineer AWS AI
  • 2. Dataset and Model Size Keep Growing Dataset size for training (GB) Model parameter size (million)
  • 3. Large Scale Distributed Training for Deep Neural Networks Data parallelism Model parallelism
  • 4. Optimization for Large Scale Distributed Training • System-Level Optimization • Accelerate training on a single GPU • fused operators, data prefetching, vectorization, cache utilization, tensor core • Distributed training with multiple GPUs • large batch size, NCCL-allreduce, Elastic Fabric Adaptor • Algorithm-Level Optimization • Large-batch optimization algorithm • Model architecture • Accuracy/runtime trade off
  • 5. Performance Optimization on AWS Cloud • Leverage the Amazon EC2 P3dn.24xlarge GPU instances • 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each • 96 Intel Xeon Scalable vCPUs • 1.8 TB local NVMe SSD • 100 Gbps network throughput • support Elastic Fabric Adapter • Software • Apache MXNet • GluonNLP and GluonCV toolkits • Horovod distributed training library
  • 7. Deep learning nowadays - Mask-RCNN • Widely used in object detection and instance Segmentation • Target accuracy • bounding box AP: 37.7 • mask AP: 33.9
  • 8. GluonCV: a Deep Learning Toolkit for Computer Vision • Training scripts that reproduce SOTA results reported in latest papers • A large set of pre-trained models • Carefully designed APIs and easy to understand its implementations • Community support • Built on top of Apache MXNet framework Image classification Object detection Semantic segmentation Pose estimation Video action recognition
  • 9. GPU Profiling • Analyze runtime using Nvidia Visual Profiler • Identify large kernels to optimize Slow operator NHWC layout conversion small kernels
  • 10. GPU Optimization Runtime Improvements • optimize ROIAlign: +10% • optimize NMS: +10% • fuse RCNN target generator: +5% • NWHC layout conversion: +10% • pointwise operator fusion: +3%
  • 11. Automatic Mixed Precision • Automatic casting of the model • Convolution, FullyConnected -> FP16 • Norm, Mean, SoftMax, etc. -> FP32 • Add, Mul etc. -> Cast to widest type • AMP boosted the throughput by 5~10% • Casting the gradients to FP16 gives another throughput improvement by 1~2% without compromising Accuracy. Utilities for dynamic loss scaling
  • 12. Model Hybridization • MXNet provides users the APIs to construct and debug the model using imperative programming • Users can invoke a hybridize API to boost model performance that is equivalent to symbolic programming. • We applied hybridization to the model and achieved 5% runtime improvement • Also, Hybridizing the model with static_alloc gave another 1~2% throughput improvement.
  • 13. Performance Tuning in AWS cluster • Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge EC2 instance helps us to get 8% improvement in throughput • Autotune Horovod hyperparameters such as tensor fusion threshold cycle times, cache capacity, hierarchical allreduce etc. +9% throughput • Increase the number of data workers from 4 to 8 also help to accelerate data loading. Note that however more data workers do not necessarily mean better performance due to the overhead of context switching. • Accelerate dataloader through Cython • Distributed validation showed significant improvement in Validation compute time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on non-distributed validation.
  • 15. 55 65 75 85 95 General Language Understanding Evaluation (GLUE) Benchmark Human Baseline Deep learning nowadays - BERT BERT
  • 16. Transfer learning with BERT for NLP • Pre-training for NLP • learn text representation on large-scale corpus • Fine-tuning for downstream tasks • Named Entity Recognition • Question Answering • Search • Chatbot • Text Summarization • Text Classification • Models available in GluonNLP toolkit feature extractor } GTC is awesome! positive NLP CV Image credit to: d2l.ai
  • 17. GluonNLP: a deep learning natural language toolkit • Open source, available on SageMaker and deep learning container • State-of-the-art NLP models • Easy prototyping • Fast deployment • Multiple built-in NLP tasks
  • 18. BERT model architecture Image credit to: d2l.ai BERTMulti-head attention (Vaswani et al., 17) x N
  • 19. 1. Masked language modeling • Estimate • Randomly mask 15% of all tokens and predict them 2. Next sentence prediction • 50% of the time, replace it by random sentence • Learn logical coherence Pre-training objectives I went to the bank to deposit some money. I went to the <mask> to deposit some money. <CLS> Haibin is obnoxious <SEP> I don’t like his shirt <CLS> Haibin is obnoxious <SEP> Hello world! .
  • 20. Data loading • Mini-batches are generated on the fly for dynamic masking[1] • Multi-process DatasetLoader with pre-fetching in the background • AWS FSx for Lustre: file system for compute-intensive workloads • Profiling result visualization previous batch current batch data loading gap Image credit to: d2l.ai
  • 21. Fast Multi-head Self-Attention For each layer: Separate projections: Qproj = QWq, Kproj = QWk, Vproj = QWv Transpose Qproj , Kproj , Vproj : From (N, T, H, C) to (N, H, T, C) Compute attention: score = batch_gemm(Qproj, Kproj) result = batch_gemm(score, Vproj) Transpose result: From (N, H, T, C) to (N, T, H, C) credit to: Clement Fuji Tsang Higher cache utilization 1.58x faster (end to end) Transpose Q: From (N, T, HC) to (T, N, HC) For each layer: Joint projections: Wqkv = concat(Wq, Wk, Wv) Q_K_Vproj = QWqkv Compute attention: score = strided_batch_gemm(Qproj, Kproj) result = strided_batch_gemm(score, Vproj) Transpose final result: From (T, N, HC) to (N, T, HC)
  • 22. GPU memory is precious - For each mini-batch, the gradient is synchronized across GPUs - Gradient allreduce can overlap with backward computation - A larger batch sizes leads to more time to hide communication latency - 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes Image credit to: d2l.ai Forward1 Backward1Forward2 Forward3 Backward2 Backward3 Allreduce1 Allreduce2 Allreduce3 time We can overlap computation and communication
  • 23. NCCL + Elastic Fabric Adaptor HPC Application MPI implementation TCP/IP stack ENA network driver ENA Device HPC Application MPI implementation EFA kernel driver ENA Device Libfabric user space kernel Traditional HPC software stack in EC2 kernel user space HPC software stack in EC2 with EFA - Elastic Fabric Adaptor (EFA) - For HPC and distributed ML - Bypass OS kernel - Integrated with MPI, NCCL - BERT training - 32 p3dn.24xlarge instances - V100 GPUs x 256 - 100 Gb/s networking - BERT-large with GluonNLP - Batch size 64K, phase 1 - 90% strong scaling efficiency, with EFA enabled
  • 24. Distributed Stochastic Optimization credit to: Shuai Zheng 𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡 𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡 𝑔 𝑡 ∥𝑔 𝑡∥2 Framework batch size #XPUs #steps optimizer F1 score training time Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m MXNet 32K/32K 512 GPUs 7038/1563 LAMB + NG 90.60% 141.5m
  • 25. References [1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019). [2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." International Conference on Learning Representations. 2019.

Editor's Notes

  • #2: First call deck
  • #10: A
  • #18: What is the specialty for this toolkit? Previously, each model has its own repo. Now all the SOTA models in one place. Smooth to develop.
  • #21: Today we are launching Amazon FSx for Lustre, designed to meet the needs of these applications and others that you will undoubtedly dream up. Based on the mature and popular Lustre open source project, Amazon FSx for Lustre is a highly parallel file system that supports sub-millisecond access to petabyte-scale file systems. Thousands of simultaneous clients (EC2 instances and on-premises servers) can drive millions of IOPS (Input/Output Operations per Second) and transfer hundreds of gibibytes of data per second.