Implementing AI: High Performance Architectures: A Universal Accelerated Computing Platform

A UNIVERSAL ACCELERATED
COMPUTING PLATFORM

22
25 YEARS OF SCIENTIFIC COMPUTING ACCELERATION
X-FACTOR SPEEDUP FULL STACK ONE ARCHITECTURESOFTWARE DEFINED
EXTREME SCALE
25 YEARS OF COMPUTING ACCELERATION
DEVELOPMENT

3
THE NEW COMPUTING
EDGE APPLIANCE
SUPERCOMPUTER
AI
Edge
Streaming
Simulation
Visualization
EXTREME IO
Data
Analytics
Cloud
NETWORK

44
A100 AVAILABLE VIA NVIDIA HGX A100 AND A100 PCIE
Scale-up - Fastest Time-to-solution for AI
8 GPUs, Full NVLink B/W between all
GPUs with NVSwitch
HGX A100 8-GPU
For Mainstream Servers
1-8 GPUs per server, optional NVLink
Bridge between 2 GPUs
A100 PCIe
Scale-Up – Mixed AI & HPC
4 A100s, Fully Connected w/
shared NVLinks
HGX A100 4-GPU

55
5 MIRACLES OF A100
NVIDIA Ampere Architecture
World’s Largest 7nm chip
54B XTORS, HBM2
3rd Gen NVLINK and NVSWITCH
Efficient Scaling to Enable Super GPU
2X More Bandwidth
3rd Gen Tensor Cores
Faster, Flexible, Easier to use
20x AI Perf with TF32
2.5x HPC Perf
New Sparsity Acceleration
Harness Sparsity in AI Models
2x AI Performance
New Multi-Instance GPU
Optimal utilization with right sized GPU
7x Simultaneous Instances per GPU

6
INTRODUCING DGX A100
The Universal AI System – Data Analytics, Training and Inference
9x Mellanox ConnectX-6 200Gb/s Network Interface
8x NVIDIA A100 GPUs with 320GB Total GPU Memory
15TB Gen4 NVME SSD
Dual 64-core AMD Rome CPUs and 1TB RAM
4.8TB/sec Bi-directional Bandwidth
2X More than Previous Generation NVSwitch
6x NVIDIA NVSwitches
12 NVLinks/GPU
600GB/sec GPU-to-GPU Bi-directional Bandwidth
25GB/sec Peak Bandwidth
2X Faster than Gen3 NVME SSDs
3.2X More Cores to Power the Most Intensive AI Jobs
450GB/sec Peak Bi-directional Bandwidth

7
UNIFIED AI ACCELERATION
BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 and FP16 precision A100: DGX A100 Server with 8xA100 using TF32
precision and FP16 |
BERT Large Inference | T4: TRT 7.1, Precision = INT8, Batch Size =256, V100: TRT 7.1, Precision = FP16, Batch Size =256 | A100 with 7 MIG instances of 1g.5gb : Pre-production TRT, Batch Size =94, Precision = INT8 with Sparsity
216
822
1260
2274
0
400
800
1200
1600
2000
2400
FP32 FP16
Sequences/s
BERT-LARGE TRAINING
V100
0.6x 1x 1x
7x
0
1000
2000
3000
4000
5000
6000
7000
Sequences/s
BERT-LARGE INFERENCE
V100T4 1 MIG
(1/7 A100)
6X
out-of-
the-box
Speedup
with TF32
7 MIG
(1 A100)
3X
Speedup with
AMP (FP16)

8
350 CPU Servers
$23M | 22 Racks | 300 kW
NVIDIA SHATTERS BIG DATA ANALYTICS BENCHMARK
19.5X Faster TPCx-BB Performance Results on DGX A100 with RAPIDS
16 NVIDIA DGX A100 Systems
$3.3M | 4 Racks |100 kW
Equivalent
Performance
1/7th Cost
1/3rd Power
16 Servers / Rack
…
Rack 1 Rack 2 Rack 3 Rack 22Rack 4 Rack 1 Rack 2 Rack 3 Rack 4
Performance: CPU = 4.7 hr, DGX A100 = 14.5 min (19.5x faster); After normalizing performance across CPU and GPU clusters -> Cost: CPU = $23M, DGX A100 = $3.3M (1/7th the
cost); Power: CPU = 298kW, DGX A100 = 104kW (1/3rd the power); Space: CPU = 22 racks, DGX A100 = 4 racks (less than 1/5th the space)

9
GPU-ACCELERATED APACHE SPARK 3.0
Data Preparation Model Training
Shared Storage
CPU Powered Cluster GPU Powered Cluster
Data
Sources
Spark 2.x Spark 3.0
Data
Sources
Spark
XGBoost | TensorFlow
| PyTorch
Data Preparation Model Training
Spark
XGBoost | TensorFlow
| PyTorch
Spark Orchestrated
Spark Orchestrated
Spark 3.0 enables:
• A single pipeline, from ingest to data preparation
to model training
• GPU-accelerated data preparation
• Consolidation and simplification of infrastructure
Built on Foundations of RAPIDS
Learn More @ nvidia.com/spark-book
Now Available on Leading Cloud Analytics Platforms
RAPIDS Accelerator for Apache Spark
GPU Powered Cluster

10
1.5X 1.5X 1.6X
1.9X
1.7X
1.8X
1.9X
2.0X
2.1X
0.0x
0.5x
1.0x
1.5x
2.0x
NAMD GROMACS AMBER LAMMPS FUN3D SPECFEM3D RTM BerkeleyGW Chroma
A100
UP TO 2X MORE HPC PERFORMANCE
All results are measured
Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4
More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE
Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model
BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100
Speedup
V100
Molecular Dynamics Physics Geo Science Physics

12
17.1 (1792 A100)
10.5 (256 A100)
3.3 (8 A100)
0.8 (2048 A100)
0.8 (1024 A100)
0.8 (1840 A100)
0.7 (1024 A100)
0.6 (480 A100)
0 5 10 15 20 25 30 35 40
Reinforcement Learning MiniGo
Object Detection (Heavy Weight) Mask R-CNN
Recommendation DLRM
NLP BERT
Object Detection (Light Weight) SSD
Image Classification ResNet-50 v.1.5
Translation (Recurrent) GNMT
Translation (Non-recurrent) Transformer
Time to Train (Minutes)
Time to Train (Lower is Better)
Commercially Available Solutions
NVIDIA A100
NVIDIA V100
Google TPUv3
Huawei Ascend
MLPERF: DGX SUPERPOD SETS ALL 8 AT SCALE AI RECORDS
Under 18 Minutes To Train Each MLPerf Benchmark
MLPerf 0.7 Performance comparison at Max Scale. Max scale used for NVIDIA A100, NVIDIA V100, TPUv3 and Huawei Ascend for all applicable benchmarks. | MLPerf ID at Scale: :Transformer: 0.7-30, 0.7-52 , GNMT: 0.7-34, 0.7-54, ResNet-50
v1.5: 0.7-37, 0.7-55, 0.7-1, 0.7-3, SSD: 0.7-33, 0.7-53, BERT: 0.7-38, 0.7-56, 0.7-1, DLRM: 0.7-17, 0.7-43, Mask R-CNN: 0.7-28, 0.7-48, MiniGo: 0.7-36, 0.7-51 | MLPerf name and logo are trademarks. See www.mlperf.org for more information.
XXXXXXXXXXXXX
X = No result submitted
28.7 (16 TPUv3)
56.7
(16 TPUv3)

13
MLPERF: ALL 8 PER CHIP AI PERFORMANCE RECORDS
0.7X
1.2X
0.9X
1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X
1.5X
1.6X
1.9X
2.0X 2.0X
2.4X 2.4X 2.5X
0x
1x
2x
3x
Image
Classification
ResNet-50 v.1.5
NLP
BERT
Object Detection
(Heavy Weight)
Mask R-CNN
Reinforcement
Learning
MiniGo
Object Detection
(Light Weight)
SSD
Translation
(Recurrent)
GNMT
Translation
(Non-recurrent)
Transformer
Recommendation
DLRM
SpeedupOverV100
Relative Speedup
Commercially Available Solutions
Huawei Ascend TPUv3 V100 A100
Per Chip Performance arrived at by comparing performance at same scale when possible and normalizing it to a single chip. 8 chip scale: V100, A100 Mask R-CNN, MiniGo, SSD, GNMT, Transformer. 16 chip scale: V100, A100, TPUv3 for ResNet-
50 v1.5 and BERT. 512 chip scale: Huawei Ascend 910 for ResNet-50. DLRM compared 8 A100 and 16 V100. Submission IDs: ResNet-50 v1.5: 0.7-3, 0.7-1, 0.7-44, 0.7-18, 0.7-21, 0.7-15 BERT: 0.7-1, 0.7-45, 0.7-22 , Mask R-CNN: 0.7-40, 0.7-19,
MiniGo: 0.7-41, 0.7-20, SSD: 0.7-40, 0.7-19, GNMT: 0.7-40, 0.7-19, Transformer: 0.7-40, 0.7-19, DLRM: 0.7-43, 0.7-17| MLPerf name and logo are trademarks. See www.mlperf.org for more information.
X X X X X X X X X X X X X
X = No result submitted

14
#7 on TOP500 (27.6 PetaFLOPS HPL)
#2 on Green500 (20.5 GigaFLOPS/watt)
Fastest Industrial System in U.S. — 1+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Arch in 3 Weeks
NVIDIA DGX A100 and NVIDIA Mellanox IB
NVIDIA’s decade of AI experience
Configuration:
2,240 NVIDIA A100 Tensor Core GPUs
280 NVIDIA DGX A100 systems
494 Mellanox 200G HDR IB switches
7 PB of all-flash storage
DGX SuperPOD Deployment
SELENE

15
Oxford Nanopore
Sequence Viral Genome in
7Hrs
Plotly, NVIDIA
Real-Time
Infection Rate Analysis
ORNL, Scripps
Screen
2B Drug Compounds in
1 Day vs 1 Year
Structura, NIH, UT Austin
CryoSPARC
1st 3D Structure of Virus Spike Protein
NIH, NVIDIA
AI COVID-19
Classification
Kiwibot
Robot Medical Supply
Delivery
Whiteboard Coordinator
AI Elevated Body Temp
Screening System
ACCELERATED COMPUTING FIGHTS COVID-19
Data
Analytics
Simulation &
Visualization
AI Edge

Implementing AI: High Performance Architectures: A Universal Accelerated Computing Platform

Implementing AI: High Performance Architectures: A Universal Accelerated Computing Platform

Recommended

More Related Content

What's hot (20)

Similar to Implementing AI: High Performance Architectures: A Universal Accelerated Computing Platform (20)

More from KTN (20)

Recently uploaded (20)

Implementing AI: High Performance Architectures: A Universal Accelerated Computing Platform