0% found this document useful (0 votes)
15 views

SuperPod reference architecture

SuperPod reference architecture

Uploaded by

ljwsiam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

SuperPod reference architecture

SuperPod reference architecture

Uploaded by

ljwsiam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

System within one node

GPU:NIC = 8:8

IB NIC 200 Gb/s

PCIE Gen4 64 GB/s

2x6=12 NVlink
400 GB/s

6 NVswitch
Scalable Unit (SU)
1SU = 20 nodes

A800
Computing Network Architecture
With HDR IB (200Gbps)

140 nodes 80 nodes

For more info, please refer to


SuperPOD reference architecture

Whitepaper
MLPERF TRAINING 0.7
NVIDIA Selene with DGX A100 (40GB)
Tested in 2020 Q3.
• Pure data parallelism.
• Up to 43% perf gains for MLPerf Bert training with 8*HCAs VS 1*HCA, on 128 nodes scale.
• Up to 30% perf gains for MLPerf RN50 training with 8*HCAs VS 1*HCA, on 230 nodes scale.

5
OPTIMIZED IMPLEMENTATION
7.5B Model, 32 Nodes
7.5B model train with Megatron on 32 nodes
• 32 nodes: TPS=4, PPS=1, DPS=64 (TPS=4, PPS=1, DPS=64)

• Forward-compute and backward-compute includes all-reduce 18000

within each tensor model parallel group 16000

• AllReduce time within each data parallel group is significant 14000

Elapsed Time Per Step (ms)


12000

• 1*HCA, 2*HCAs: Extremely bad, bounded by communication perf. 10000

Some GPUs have to go through SMP interconnect to reach HCA. Bad GDR perf. 8000

6000

• 8*HCAs VS 4*HCAs: 6.6% improvement. 4000

Mainly from improvement of AllReduce for gradients in data parallelism group. 2000

1*HCA 2*HCA 4*HCA 8*HCA


0
Forward-compute Backward-compute All reduce (Data P) Optimizer

global_batch_size=512
10

You might also like