SuperPod reference architecture
SuperPod reference architecture
GPU:NIC = 8:8
2x6=12 NVlink
400 GB/s
6 NVswitch
Scalable Unit (SU)
1SU = 20 nodes
A800
Computing Network Architecture
With HDR IB (200Gbps)
Whitepaper
MLPERF TRAINING 0.7
NVIDIA Selene with DGX A100 (40GB)
Tested in 2020 Q3.
• Pure data parallelism.
• Up to 43% perf gains for MLPerf Bert training with 8*HCAs VS 1*HCA, on 128 nodes scale.
• Up to 30% perf gains for MLPerf RN50 training with 8*HCAs VS 1*HCA, on 230 nodes scale.
5
OPTIMIZED IMPLEMENTATION
7.5B Model, 32 Nodes
7.5B model train with Megatron on 32 nodes
• 32 nodes: TPS=4, PPS=1, DPS=64 (TPS=4, PPS=1, DPS=64)
Some GPUs have to go through SMP interconnect to reach HCA. Bad GDR perf. 8000
6000
Mainly from improvement of AllReduce for gradients in data parallelism group. 2000
global_batch_size=512
10