Instinct Mi300 Series Custer Reference Guide
Instinct Mi300 Series Custer Reference Guide
Network Fabrics.........................................................................................................................................................9
Software Components........................................................................................................................................... 10
Compute Node........................................................................................................................................................ 11
System Management..............................................................................................................................................14
63916 v1.00 2
November 2024
Appendix A: Vendor Information............................................................................................................... 24
Appendix B: Acronyms................................................................................................................................. 26
Notices.......................................................................................................................................................................30
Trademarks.........................................................................................................................................................................30
63916 v1.00 3
November 2024
List of Figures
Figure 2.1: AMD MI300 Series Platform.............................................................................................. 8
Figure 4.1: Server and Rack Design with MI300 Series Accelerators..........................................15
Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches........................................ 20
Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks...............22
63916 v1.00 4
November 2024
List of Tables
Table 1.1: Key Terms................................................................................................................................ 7
Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack............................. 16
Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32
uplink).....................................................................................................................................18
Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)......................................19
63916 v1.00 5
November 2024
Abstract
Chapter 1: Abstract
Artificial Intelligence (AI) and Machine Learning (ML) models continue to advance in capability and scale
at an increasingly rapid rate. This advancement increases performance and efficiency requirements
on every element of AI/ML infrastructure, including networking infrastructure. As AI/ML model scale
has increased, the workload must be distributed across Graphics Processing Units (GPUs) operating
in parallel at massive scale. Performance is increasingly dependent on the network that enables data
movement between these GPUs.
AI/ML model training and inference operations require the movement and processing of massive
volumes of data. The GPU-GPU communication network must support a wide range of requirements
such as latency-sensitive inference operations and iterative, high-throughput, parallel mathematical
training operations. The highest throughput, lowest latency GPU-GPU data movement occurs within a
node’s accelerator networkthat connects a group of GPUs.
Additional GPU-GPU data movement occurs across a backend network that connects multiples of
these nodes into a large scale network cluster. AI/ML model deployments require a wide range of
network scaling. Therefore, the design of the accelerator network and the backend network is crucial
as it scales in size from small clusters of inferencing nodes to much larger-scale training backend
networks supporting thousands of nodes and beyond. The efficiency and performance of GPU-GPU
communication fabrics is therefore of critical importance.
The frontend network in existing data centers supports a wide range of functions including AI/ML data
ingestion, storage as well as management functions. Traditionally, the front-end network is connected
directly to CPUs in the nodes.
This reference architecture document outlines the components required to build a backend network
cluster of MI300 Series GPUs, utilizing primarily Ethernet-based network interface cards (NICs) and
switches that scale to meet the increasing scaling requirements of AI models. AMD MI300 Series
products support a wide range of networking technologies and topologies beyond Ethernet via standard
PCIe-based NICs. AMD is committed to the development and enhancement of open standards-based
networks like the Ultra Ethernet Consortium (UEC), and the Ultra Accelerator Link Consortium (UALink).
AMD works with partners to support an open ecosystem of multiple networking solutions including
AMD networking products. This reference architecture document describes a wide range of networking
topologies including fat tree and rail-based topologies.
Key Terms
The following table defines the key terms used in this document.
63916 v1.00 6
November 2024
Abstract
Terminology Description
63916 v1.00 7
November 2024
Components of an MI300 Series Cluster
63916 v1.00 8
November 2024
Components of an MI300 Series Cluster
A compute node consists of the AMD MI300 Series Platform together with CPUs, memory, and NIC
devices. Specifications of the AMD MI300 Series reference compute node are given in the following
table. Compute nodes with AMD MI300 Series platforms are available from select vendors (see Vendor
Information).
Component Specification
Network Fabrics
Network fabrics are composed of at least three networks with NICs, switches and cables, as detailed in
the following table. Several such components are listed in Vendor Information.
63916 v1.00 9
November 2024
Components of an MI300 Series Cluster
Out of band management network Separate network with its own NICs connecting BMC
Software Components
An MI300 Series Cluster requires the following software components.
63916 v1.00 10
November 2024
Design Requirements
This reference architecture provides a starting point with common usage models for AI/ML or High
Performance Computing (HPC) workloads.
System Design
The following hardware system design components are recommended:
• Scalable cluster architecture based on a scalable unit of 32 compute nodes
• Datacenter racks may have 1, 4, 8 or 16 compute nodes. The specific number of nodes is influenced
by rack power and cooling requirements.
• Supported Networking adapters (NICs) and switches from AMD and partners supporting up to
400Gb/s
• Storage Networking components to support storage servers and storage network
Compute Node
The compute node consists of eight MI300 Series GPUs interconnected by the 4th gen AMD Infinity
Fabric™ Links. A typical compute node also includes dual-socket CPUs, memory, and NICs connected via
two PCIe Gen 5.0 switches.
Performance-optimized designs have a specific mapping of MI300 Series GPU to frontend NICs and
backend NICs as illustrated in Figure 3.1. Each CPU has one directly connected frontend NIC, so there
are a total of two frontend NICs per compute node. Each MI300 Series GPU has one backend NIC that
is connected through the PCIe Switch, so there are a total of eight backend NICs per compute node.
63916 v1.00 11
November 2024
Design Requirements
Frontend Network
The frontend network is the traditional datacenter network comprised of switches and network adapters
(or NICs) which support storage and management functions.
63916 v1.00 12
November 2024
Design Requirements
Accelerator Network
The accelerator network is a high bandwidth, low latency network that connects a group of GPUs and
supports load/store transactions between the GPUs, as shown in the following figure. In MI300 Series
based designs, this network is Infinity Fabric™ interconnecting 8 GPUs in a mesh topology within a
compute node.
Backend Network
The backend network connects a bigger set of GPUs (that are beyond the set available within the
accelerator or scale-up network). Each compute node has eight NICs with a 1:1 GPU:NIC ratio, utilizing
a PCIe switch between the GPUs and NICs. RDMA communication using RoCE protocol, congestion
management and the support for UEC defined transport layer improvements are essential in this
network. Communication between GPUs in this network is enhanced by NICs supporting acceleration of
collective operations.
63916 v1.00 13
November 2024
Design Requirements
System Management
For management and maintenance of a server, system vendors provide management software and
interfaces that perform real-time health monitoring and management on each server, including firmware
updates.
63916 v1.00 14
November 2024
Cluster Architecture
Figure 4.1: Server and Rack Design with MI300 Series Accelerators
63916 v1.00 15
November 2024
Cluster Architecture
Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack
128 1024 8192 128 128 64 320 8192 8192 8192 24576
256 2048 16384 256 256 128 640 16384 16384 16384 49152
63916 v1.00 16
November 2024
Topology of Network Fabrics
There are some configurations that need to be considered when designing a network topology:
• Blocking factor: A switch has downlink and uplink ports; the blocking factor is defined as
“downlink_port:uplink_port”. For a 64-port switch using 32 uplink and 32 downlink ports, the
blocking factor is 1:1. (A 1:1 blocking factor is defined as a non-blocking configuration).
• Undersubscription: A switch can have fewer downlink ports than uplink ports, for example a 64-port
switch can have 24 downlink ports and 28 uplink ports, with 12 ports unused. This is referred to as
16% undersubscription. The safe approach is to have undersubscription (especially at higher switch
tiers), but the cost effective approach is 1:1 which utilizes all switch ports.
The fat tree topology is a familiar scalable design; some networks may require undersubscription to
mitigate ECMP hash collisions (with a blocking design). The following diagram and table illustrate a 2-tier
Fat Tree non-blocking topology.
63916 v1.00 17
November 2024
Topology of Network Fabrics
Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32 uplink)
32 4 1 0
64 8 2 1
128 16 4 2
256 32 8 4
512 64 16 8
1024 128 32 16
2048 256 64 32
Rail Topology
A 2-tier rail consists of 2 layers of leaf-spine switches (T1, T2), with the T1 (leaf) switches connected to
the NICs in the backend network. Each NIC of a node is connected to one port of each T1 leaf switch. A
3rd tier adds a T3 layer of switches.
Rail topology benefits by containing traffic to rails, thereby minimizing probability of congestion. The
communication libraries are dependent (aware) of the rail connections and the scale-up fabric. The
following diagram and table illustrate a 2-tier rail topology.
63916 v1.00 18
November 2024
Topology of Network Fabrics
Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)
GPUs, NICs (1:1) Nodes Leaf Switches Spine Switches Core Switches
256 32 8 4 --
512 64 16 8 --
768 96 24 12 --
1024 128 32 16 --
2048 256 64 64 32
The following figure is of a full 128 node cluster (where the nodes are in rail layout). Within a rail, a
node is one hop from the other node. The layout can also be a Fat Tree where all the links from a rack
terminate in a Leaf switch, with similar number of leafs and switches for the same number of Scalable
Units.
63916 v1.00 19
November 2024
Topology of Network Fabrics
Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches
To build larger clusters, refer to the following table. The maximum numbers of nodes are dependent on
switches and topology.
2 Tier Rail/Fat Tree 3 Tier Rail/ Fat Tree 2 Tier Rail/Fat Tree 3 Tier Rail/Fat Tree
Parameter Count
(64 port, 400G) (64 port, 400G) (128 port, 400G) (128 port, 400G)
As datasets for AI workloads continue to expand in size, it is becoming increasingly critical that GPUs are
not constrained by the I/O network and storage systems. The storage fabric provides the path between
GPU memory and the storage systems. Storage systems can be connected by the frontend network, but
benefit by a separate, high-speed storage network, as shown by the following.
63916 v1.00 20
November 2024
Topology of Network Fabrics
A separate optimized storage network as shown above provides benefits such as:
• Deep learning models accesses large datasets for training, a dedicated network provides frequent
and iterative access to the data from the GPUs over the storage network.
• As datasets grow in size the capex and opex expenditures are kept separate from the compute needs.
63916 v1.00 21
November 2024
AMD Software Stack
The AMD software stack, shown below, consists of a collection of drivers, development tools, and
APIs that enable GPU programming from low-level kernels to end-user applications. The ROCm Data
Center tools (RDC), and AMD SMI are essential building blocks in cluster management and datacenter
environments.
Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks
63916 v1.00 22
November 2024
AMD Software Stack
AMD System Managaement Interface (SMI) is a C library on linux providing user space interface to
monitor and control AMD devices. The SMI libraries are available on AMD SMI Github Repository.
63916 v1.00 23
November 2024
Vendor Information
Vendor Link
Gigabyte G593-ZX1-AAX1
Supermicro AS -8125GS-TNMR2
Vendor Link
Vendor Link
Vendor Link
63916 v1.00 24
November 2024
Vendor Information
Vendor Link
Vendor Link
63916 v1.00 25
November 2024
Acronyms
Appendix B: Acronyms
The acronyms used in this document are expanded in the following table.
Acronym Definition
AI Artificial Intelligence
AS Autonomous System
63916 v1.00 26
November 2024
Acronyms
Acronym Definition
I/O Input/Output
IP Internet Protocol
ML Machine Learning
OS Operating System
63916 v1.00 27
November 2024
Acronyms
Acronym Definition
SerDes Serializer/Deserializer
SP Strict Priority
VM Virtual Machine
63916 v1.00 28
November 2024
Acronyms
Acronym Definition
63916 v1.00 29
November 2024
Additional Resources and Legal Notices
Revision Summary
Initial release.
Notices
© Copyright 2024 Advanced Micro Devices, Inc.
The information presented in this document is for informational purposes only and may contain technical
inaccuracies, omissions, and typographical errors. The information contained herein is subject to change
and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or
the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented
or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information.
However, AMD reserves the right to revise this information and to make changes from time to time to
the content hereof without obligation of AMD to notify any person of such revisions or changes.
Trademarks
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.
Other product names used in this publication are for identification purposes only and may be trademarks
of their respective companies.
63916 v1.00 30
November 2024