0% found this document useful (0 votes)
23 views

Instinct Mi300 Series Custer Reference Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Instinct Mi300 Series Custer Reference Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

AMD Instinct™ MI300 Series Cluster

Reference Architecture Guide

Publication Number: 63916 v1.00


Date: November 2024
Contents
Chapter 1: Abstract..........................................................................................................................................6
Contents
Key Terms.................................................................................................................................................................... 6

Chapter 2: Components of an MI300 Series Cluster............................................................................... 8


AMD MI300 Series Platform Compute Node.....................................................................................................8

Network Fabrics.........................................................................................................................................................9
Software Components........................................................................................................................................... 10

Chapter 3: Design Requirements................................................................................................................11


System Design......................................................................................................................................................... 11

Compute Node........................................................................................................................................................ 11

Frontend, Accelerator and Backend Networks................................................................................................ 11


Frontend Network............................................................................................................................................................ 12
Accelerator Network....................................................................................................................................................... 13
Backend Network............................................................................................................................................................. 13
Out-of-band Management Network...................................................................................................................14

System Management..............................................................................................................................................14

Chapter 4: Cluster Architecture..................................................................................................................15

Chapter 5: Topology of Network Fabrics..................................................................................................17


Backend Network Topology..................................................................................................................................17
Fat Tree Non-blocking Topology................................................................................................................................... 17
Rail Topology..................................................................................................................................................................... 18
Frontend Network Topology................................................................................................................................ 20

Chapter 6: AMD Software Stack................................................................................................................ 22


Cluster Management with RDC and SMI Tools................................................................................................23

63916 v1.00 2
November 2024
Appendix A: Vendor Information............................................................................................................... 24

Appendix B: Acronyms................................................................................................................................. 26

Appendix C: Additional Resources and Legal Notices........................................................................... 30


Revision History...................................................................................................................................................... 30

Notices.......................................................................................................................................................................30
Trademarks.........................................................................................................................................................................30

63916 v1.00 3
November 2024
List of Figures
Figure 2.1: AMD MI300 Series Platform.............................................................................................. 8

Figure 3.1: Frontend, Accelerator and Backend Networks.............................................................12

Figure 3.2: Accelerator Mesh Network with MI300 Series Accelerators................................... 13

Figure 4.1: Server and Rack Design with MI300 Series Accelerators..........................................15

Figure 5.1: A 32 Node 2-Tier Fat Tree Topology...............................................................................18

Figure 5.2: A 32 Node 2-Tier Rail Topology.......................................................................................19

Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches........................................ 20

Figure 5.4: Storage Network for a 32 Node Design.........................................................................21

Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks...............22

63916 v1.00 4
November 2024
List of Tables
Table 1.1: Key Terms................................................................................................................................ 7

Table 2.1: Compute Node Reference Design Specifications...........................................................9

Table 2.2: Network Hardware Components....................................................................................... 9

Table 2.3: Software Components....................................................................................................... 10

Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack............................. 16

Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32
uplink).....................................................................................................................................18

Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)......................................19

Table 5.3: Maximum Counts Based on Radix and Switch Tiers................................................... 20

Table A.1: System Designs with MI300X.......................................................................................... 24

Table A.2: NICs in Backend and Frontend Network....................................................................... 24

Table A.3: Switches in Backend and Frontend Network................................................................24

Table A.4: Switches in Storage Network........................................................................................... 24

Table A.5: Out-of-band Management Software.............................................................................. 25

Table A.6: Storage Systems.................................................................................................................. 25

Table B.1: Acronyms.............................................................................................................................. 26

63916 v1.00 5
November 2024
Abstract

Chapter 1: Abstract
Artificial Intelligence (AI) and Machine Learning (ML) models continue to advance in capability and scale
at an increasingly rapid rate. This advancement increases performance and efficiency requirements
on every element of AI/ML infrastructure, including networking infrastructure. As AI/ML model scale
has increased, the workload must be distributed across Graphics Processing Units (GPUs) operating
in parallel at massive scale. Performance is increasingly dependent on the network that enables data
movement between these GPUs.

AI/ML model training and inference operations require the movement and processing of massive
volumes of data. The GPU-GPU communication network must support a wide range of requirements
such as latency-sensitive inference operations and iterative, high-throughput, parallel mathematical
training operations. The highest throughput, lowest latency GPU-GPU data movement occurs within a
node’s accelerator networkthat connects a group of GPUs.

Additional GPU-GPU data movement occurs across a backend network that connects multiples of
these nodes into a large scale network cluster. AI/ML model deployments require a wide range of
network scaling. Therefore, the design of the accelerator network and the backend network is crucial
as it scales in size from small clusters of inferencing nodes to much larger-scale training backend
networks supporting thousands of nodes and beyond. The efficiency and performance of GPU-GPU
communication fabrics is therefore of critical importance.

The frontend network in existing data centers supports a wide range of functions including AI/ML data
ingestion, storage as well as management functions. Traditionally, the front-end network is connected
directly to CPUs in the nodes.

This reference architecture document outlines the components required to build a backend network
cluster of MI300 Series GPUs, utilizing primarily Ethernet-based network interface cards (NICs) and
switches that scale to meet the increasing scaling requirements of AI models. AMD MI300 Series
products support a wide range of networking technologies and topologies beyond Ethernet via standard
PCIe-based NICs. AMD is committed to the development and enhancement of open standards-based
networks like the Ultra Ethernet Consortium (UEC), and the Ultra Accelerator Link Consortium (UALink).
AMD works with partners to support an open ecosystem of multiple networking solutions including
AMD networking products. This reference architecture document describes a wide range of networking
topologies including fat tree and rail-based topologies.

Key Terms
The following table defines the key terms used in this document.

63916 v1.00 6
November 2024
Abstract

Table 1.1: Key Terms

Terminology Description

• AMD Instinct™ MI300X Platform


AMD MI300 Series
• AMD Instinct™ MI325X Platform

Network forming the cluster with GPU NICs, also


referred to as the scale-out network, backend scale-
Backend Network
out network, and backside scale-out network. The
NICs in this network are referred to as backend NICs.

Network connecting GPUs within a node in a mesh


with Infinity Fabric™ links, also referred to as the scale-
Accelerator Network up network, backend scale-up network, and backside
scale-up network. NICs are not used in the MI300
Series.

Network with a different set of NICs (from the


backend network), also referred to as the frontside
network. Depending on the server design, this
Frontend Network
network can also support storage and in-band
management operations. The NICs in this network are
referred to as frontend NICs.

63916 v1.00 7
November 2024
Components of an MI300 Series Cluster

Chapter 2: Components of an MI300 Series


Cluster
A cluster consists of several components:

• AMD MI300 Series Platform compute nodes,


• Network fabrics that are composed of at least three networks with NICs, switches and cables, and
• Software libraries, system and management components.

AMD MI300 Series Platform Compute Node


The AMD MI300 Series platform comprises eight OCP Accelerator Module (OAM) form-factor MI300
Series GPUs in a Universal Baseboard (UBB) 2.0 design. The following figure shows the air-cooled
platform.

Figure 2.1: AMD MI300 Series Platform

63916 v1.00 8
November 2024
Components of an MI300 Series Cluster

A compute node consists of the AMD MI300 Series Platform together with CPUs, memory, and NIC
devices. Specifications of the AMD MI300 Series reference compute node are given in the following
table. Compute nodes with AMD MI300 Series platforms are available from select vendors (see Vendor
Information).

Table 2.1: Compute Node Reference Design Specifications

Component Specification

CPU 2x 4th-gen AMD EPYC Processors

8x AMD Instinct™ MI300X Accelerators with AMD


GPU
Universal Base Board (UBB 2.0)

Configurable; typical designs use 6TB (24x 256GB


Memory
DRAM) DDR5

NVMe SSDs; typical designs use 8-16 2.5-inch drives,


Drives
1-2 OS drives, high performance scratch drives

8x PCIe 5.0 high-performance networking cards, 400G


Networking
Ethernet

Incorporating the AMD Infinity Architecture platform


with 4th-gen AMD Infinity Fabric™ Links, up to
Accelerator Interconnect
bidirectional 896GB/s aggregate / 64 GB/s peer inter-
GPU connectivity

Cooling Air cooling or liquid cooling

Network Fabrics
Network fabrics are composed of at least three networks with NICs, switches and cables, as detailed in
the following table. Several such components are listed in Vendor Information.

Table 2.2: Network Hardware Components

Hardware Component Description

Fat-tree, or rail optimized cluster topology with RDMA


Backend scale-out Network
optimized Ethernet NICs and switches

Infinity Fabric™ mesh interconnecting 8 GPUs in the


Accelerator Network
Compute Node

Storage network topology connected through frontend


Storage Network (Frontend network)
NICs

63916 v1.00 9
November 2024
Components of an MI300 Series Cluster

Table 2.2: Network Hardware Components (continued)

Hardware Component Description

Management network connected through frontend


In band management network (Frontend network)
NICs. Also provides services accessible by users.

Out of band management network Separate network with its own NICs connecting BMC

Software Components
An MI300 Series Cluster requires the following software components.

Table 2.3: Software Components

Software Component Description

ROCm Data Center Tools and System Management


Data Center Management Software (RDC, SMI) Interface Libraries (see Cluster Management with RDC
and SMI Tools)

Software and user interface for system management of


System Management
nodes (examples are given in Vendor Information)

63916 v1.00 10
November 2024
Design Requirements

Chapter 3: Design Requirements


AI/ML deployments have a wide range of cluster network scale requirements. The optimal system
design should consider the node design, target NIC cards, switch capabilities, and target workloads to
deliver required efficiency and performance.

This reference architecture provides a starting point with common usage models for AI/ML or High
Performance Computing (HPC) workloads.

System Design
The following hardware system design components are recommended:
• Scalable cluster architecture based on a scalable unit of 32 compute nodes
• Datacenter racks may have 1, 4, 8 or 16 compute nodes. The specific number of nodes is influenced
by rack power and cooling requirements.
• Supported Networking adapters (NICs) and switches from AMD and partners supporting up to
400Gb/s
• Storage Networking components to support storage servers and storage network

The following software system design components are recommended:

• Cluster management software from AMD and partners


• System management software from AMD and partners

Compute Node
The compute node consists of eight MI300 Series GPUs interconnected by the 4th gen AMD Infinity
Fabric™ Links. A typical compute node also includes dual-socket CPUs, memory, and NICs connected via
two PCIe Gen 5.0 switches.

Performance-optimized designs have a specific mapping of MI300 Series GPU to frontend NICs and
backend NICs as illustrated in Figure 3.1. Each CPU has one directly connected frontend NIC, so there
are a total of two frontend NICs per compute node. Each MI300 Series GPU has one backend NIC that
is connected through the PCIe Switch, so there are a total of eight backend NICs per compute node.

Frontend, Accelerator and Backend Networks


Cluster network fabric is composed of at least three networks, as illustrated in the following figure and
discussed in the following subsections.

63916 v1.00 11
November 2024
Design Requirements

Figure 3.1: Frontend, Accelerator and Backend Networks

Frontend Network
The frontend network is the traditional datacenter network comprised of switches and network adapters
(or NICs) which support storage and management functions.

Storage network (part of frontend network):


• As language models grow and training time increases, writing checkpoints is necessary for fault
tolerance. The size of checkpoint files can be terabytes in size, and while not written frequently they
are typically written synchronously.
• Independent of the backend network, this can be Ethernet (preferred). RDMA is a prerequisite (RoCE
for Ethernet).
• If a separate network is designed, a 400 Gbps network is ideal for storage needs.

In-band management network (part of frontend network):


• The in-band management fabric is used for node provisioning, data movements, SLURM, Kubernetes,
and downloading from package repos such as pypi, docker repo, gcr.io, etc.
• Ethernet based, is used for provisioning of nodes and services that need to be accessed by users.
• If a separate network is designed by vendors, a 100 Gbps network is desirable.

63916 v1.00 12
November 2024
Design Requirements

Accelerator Network
The accelerator network is a high bandwidth, low latency network that connects a group of GPUs and
supports load/store transactions between the GPUs, as shown in the following figure. In MI300 Series
based designs, this network is Infinity Fabric™ interconnecting 8 GPUs in a mesh topology within a
compute node.

Figure 3.2: Accelerator Mesh Network with MI300 Series Accelerators

Backend Network
The backend network connects a bigger set of GPUs (that are beyond the set available within the
accelerator or scale-up network). Each compute node has eight NICs with a 1:1 GPU:NIC ratio, utilizing
a PCIe switch between the GPUs and NICs. RDMA communication using RoCE protocol, congestion
management and the support for UEC defined transport layer improvements are essential in this
network. Communication between GPUs in this network is enhanced by NICs supporting acceleration of
collective operations.

This network is designed to be highly scalable:


• The backend network topology can be either a fat tree or rail optimized.

63916 v1.00 13
November 2024
Design Requirements

• Ethernet switches form the 2- (leaf-spine) or 3-tier (leaf-spine-core) switching fabric.

NICs and switches:


• NICs connect compute nodes to the backend network through a set of switches in a well-defined
topology.
• The number of required switches depends on the number of nodes in the cluster, the switch radix,
and the cluster performance requirements.
• NICs and leaf (ToR) switches reside with the compute nodes in a rack.

Out-of-band Management Network


The out-of-band management fabric is a separate, slow-speed (usually 1-10 Gbps) network that connects
to the management ports of all nodes, storage servers, racks and switches in the whole cluster. Within
a node, it connects to the Baseboard Management Controller (BMC), which is used to change BIOS
settings, monitor and set the node health such as fan speed, voltage levels, temperatures, etc. Users can
interact with the BMC through IPMI or Redfish API, or through the BMC web portal.

System Management
For management and maintenance of a server, system vendors provide management software and
interfaces that perform real-time health monitoring and management on each server, including firmware
updates.

63916 v1.00 14
November 2024
Cluster Architecture

Chapter 4: Cluster Architecture


A cluster consists of a group of racks, each of which consists of a group of servers. These servers
are placed in a rack with backend switches for the backend network. In MI300 Series systems these
racks are designed with qualified vendors (see Vendor Information). The rack layouts are scalable and
adjustable to meet the data center requirements. Following is a reference rack layout consisting of
either 4, 8 or 16 GPU server nodes, each with 8 MI300 Series GPUs.

Figure 4.1: Server and Rack Design with MI300 Series Accelerators

63916 v1.00 15
November 2024
Cluster Architecture

Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack

Rack Switch Count Cable Count


Count Node GPU
(with 8 Count Count Nodes- Leafs- Spines-
Leaf Spine Core Total Total
nodes) Leafs Spines Cores

16 128 1024 16 8 -- 24 1024 1024 -- 2048

32 256 2048 32 16 -- 48 2048 2048 -- 4096

64 512 4096 64 64 32 160 4096 4096 4096 12288

128 1024 8192 128 128 64 320 8192 8192 8192 24576

256 2048 16384 256 256 128 640 16384 16384 16384 49152

63916 v1.00 16
November 2024
Topology of Network Fabrics

Chapter 5: Topology of Network Fabrics


A network fabric in a cluster design consists of the following fabrics:

• Compute fabric (backend network),


• Storage fabric (frontend network),
• In-band management fabric (frontend network), and
• Out-of-band management fabric.

Backend Network Topology


There are two topologies that will be discussed for the backend network:
• Fat tree, and
• Rail.

There are some configurations that need to be considered when designing a network topology:

• Blocking factor: A switch has downlink and uplink ports; the blocking factor is defined as
“downlink_port:uplink_port”. For a 64-port switch using 32 uplink and 32 downlink ports, the
blocking factor is 1:1. (A 1:1 blocking factor is defined as a non-blocking configuration).
• Undersubscription: A switch can have fewer downlink ports than uplink ports, for example a 64-port
switch can have 24 downlink ports and 28 uplink ports, with 12 ports unused. This is referred to as
16% undersubscription. The safe approach is to have undersubscription (especially at higher switch
tiers), but the cost effective approach is 1:1 which utilizes all switch ports.

Fat Tree Non-blocking Topology


A 2-tier fat-tree consists of 2 layers of leaf-spine switches (T1, T2), with the T1 (leaf) switches
connected to the NICs in the backend network. All NICs of a node are connected to the same T1 switch.
A third tier adds a T3 layer of switches.

The fat tree topology is a familiar scalable design; some networks may require undersubscription to
mitigate ECMP hash collisions (with a blocking design). The following diagram and table illustrate a 2-tier
Fat Tree non-blocking topology.

63916 v1.00 17
November 2024
Topology of Network Fabrics

Figure 5.1: A 32 Node 2-Tier Fat Tree Topology

Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32 uplink)

GPUs, NICs (1:1) Nodes Leaf Switches Spine Switches

32 4 1 0

64 8 2 1

128 16 4 2

256 32 8 4

512 64 16 8

1024 128 32 16

2048 256 64 32

Rail Topology
A 2-tier rail consists of 2 layers of leaf-spine switches (T1, T2), with the T1 (leaf) switches connected to
the NICs in the backend network. Each NIC of a node is connected to one port of each T1 leaf switch. A
3rd tier adds a T3 layer of switches.

Rail topology benefits by containing traffic to rails, thereby minimizing probability of congestion. The
communication libraries are dependent (aware) of the rail connections and the scale-up fabric. The
following diagram and table illustrate a 2-tier rail topology.

63916 v1.00 18
November 2024
Topology of Network Fabrics

Figure 5.2: A 32 Node 2-Tier Rail Topology

Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)

GPUs, NICs (1:1) Nodes Leaf Switches Spine Switches Core Switches

256 32 8 4 --

512 64 16 8 --

768 96 24 12 --

1024 128 32 16 --

2048 256 64 64 32

The following figure is of a full 128 node cluster (where the nodes are in rail layout). Within a rail, a
node is one hop from the other node. The layout can also be a Fat Tree where all the links from a rack
terminate in a Leaf switch, with similar number of leafs and switches for the same number of Scalable
Units.

63916 v1.00 19
November 2024
Topology of Network Fabrics

Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches

To build larger clusters, refer to the following table. The maximum numbers of nodes are dependent on
switches and topology.

Table 5.3: Maximum Counts Based on Radix and Switch Tiers

2 Tier Rail/Fat Tree 3 Tier Rail/ Fat Tree 2 Tier Rail/Fat Tree 3 Tier Rail/Fat Tree
Parameter Count
(64 port, 400G) (64 port, 400G) (128 port, 400G) (128 port, 400G)

Switch Radix 64 64 128 128

NICs per Node 8 8 8 8

Max Leaf Switches 64 2048 128 8192

Max Spine Switches 32 2048 64 8192

Max Core Switches -- 1024 -- 4096

Max NICs 2048 65536 8192 524288

Max GPUs 2048 65536 8192 524288

Max Nodes 256 8192 1024 65536

Frontend Network Topology


The frontend network composed of Ethernet NICs in 1:1 NIC:CPU organizations carries the storage
and in-band communications, if a separate fabric is not provided. The in-band management network
connects the cluster management services.

As datasets for AI workloads continue to expand in size, it is becoming increasingly critical that GPUs are
not constrained by the I/O network and storage systems. The storage fabric provides the path between
GPU memory and the storage systems. Storage systems can be connected by the frontend network, but
benefit by a separate, high-speed storage network, as shown by the following.

63916 v1.00 20
November 2024
Topology of Network Fabrics

Figure 5.4: Storage Network for a 32 Node Design

A separate optimized storage network as shown above provides benefits such as:

• Deep learning models accesses large datasets for training, a dedicated network provides frequent
and iterative access to the data from the GPUs over the storage network.
• As datasets grow in size the capex and opex expenditures are kept separate from the compute needs.

63916 v1.00 21
November 2024
AMD Software Stack

Chapter 6: AMD Software Stack


The AMD open-source ROCm software platform, containers for AI/ML and Data Center Infrastructure
empowers the accelerated computing community to innovate on top of a robust, flexible stack designed
for scalability. These components work in concert to extract the full potential of heterogeneous
architectures. The platform’s open-source philosophy gives developers complete visibility while
enabling customization and co-development. Users can optimize ROCm software platform’s runtimes,
programming models and utilities based on their workloads and scale requirements.

The AMD software stack, shown below, consists of a collection of drivers, development tools, and
APIs that enable GPU programming from low-level kernels to end-user applications. The ROCm Data
Center tools (RDC), and AMD SMI are essential building blocks in cluster management and datacenter
environments.

Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks

63916 v1.00 22
November 2024
AMD Software Stack

Cluster Management with RDC and SMI Tools


ROCm Data Center (RDC) enables GPU cluster administration with the capability of monitoring,
validating and configuring policies. It enables full diagnostic and stress testing at cluster level.
Administrators can use device monitoring, job statistics and error collection for a group of GPUs in a
cluster and provides APIs for 3rd party integration. Full documentation and API reference are available at
ROCm Data Center Tool documentation.

AMD System Managaement Interface (SMI) is a C library on linux providing user space interface to
monitor and control AMD devices. The SMI libraries are available on AMD SMI Github Repository.

63916 v1.00 23
November 2024
Vendor Information

Appendix A: Vendor Information


Table A.1: System Designs with MI300X

Vendor Link

Dell PowerEdge XE9680

Gigabyte G593-ZX1-AAX1

Lenovo ThinkSystem SR685a V3

Supermicro AS -8125GS-TNMR2

Table A.2: NICs in Backend and Frontend Network

Vendor Link

AMD AMD Pensando™ Giglio DPU 200G

AMD AMD Pensando™ Pollara 400

AMD AMD Pensando™ DSC3-400

Broadcom BRCM957608 (Thor 2) 400G

NVIDIA ConnectX®- 7 400G

Table A.3: Switches in Backend and Frontend Network

Vendor Link

Arista Arista 7060 51.2T TH5

Arista Arista 7060 25.6T TH4

Arista Arista 7700R4C-38PE Jer 14.4T

Arista Arista 7204-128PE Rmn 102T

Juniper Juniper QFX5240 51.2T TH5

Cisco G200 51.2T

Table A.4: Switches in Storage Network

Vendor Link

Juniper Juniper QFX5230

63916 v1.00 24
November 2024
Vendor Information

Table A.5: Out-of-band Management Software

Vendor Link

Dell Dell Managed Services

Gigabyte Gigabyte Management Console

Lenovo Lenovo Managed Services

Supermicro Server Manager, Supermicro Update


Supermicro
Manager

Table A.6: Storage Systems

Vendor Link

AMD-Supermicro WEKAIO Reference Storage

IBM IBM Storage Scale System

63916 v1.00 25
November 2024
Acronyms

Appendix B: Acronyms
The acronyms used in this document are expanded in the following table.

Table B.1: Acronyms

Acronym Definition

AAA Authentication, Authorization, and Accounting

AI Artificial Intelligence

AOC Active Optical DAC

APC Angled Polished Connector

API Application Programming Interface

ARP Address Resolution Protocol

AS Autonomous System

BFD Bidirectional Forwarding Detection

BGP Border Gateway Protocol

BIOS Basic Input/Output System

BMC Baseboard Management Controller

CCDM Collective Communication Data Messages

CNP Congestion Notification Packet

CoPP Control Plane Policing

COS Class Of Service

CPU Central Processing Unit

DAC Direct Attached Cable

DCBx Data Center Bridging eXchange

DDR Double Data Rate

DHCP Dynamic Host Configuration Protocol

DLB Dynamic Load Balancing

DNN Deep Neural Network

DNS Domain Name System

DRAM Dynamic Random Access Memory

DSCP Differentiated Services Code Point

63916 v1.00 26
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

eBGP External Border Gateway Protocol

ECMP Equal Cost MultiPath

ECN Explicit Congestion Notification

ETS Enhanced Transmission Selection

GPU Graphics Processing Unit

GUI Graphical User Interface

HPC High Performance Computing

I/O Input/Output

IP Internet Protocol

IPAM IP Address Management

IPMI Intelligent Platform Management Interface

LAG Link Aggregation Group

LLDP Link Level Discovery Protocol

ML Machine Learning

MLAG Multi-chassis Link Aggregation

MMF MultiMode Fiber

NFS Network File System

NIC Network Interface Card

NOC Network Operations Center

NRZ Non Return to Zero

NVM Non Volatile Memory

OAM OCP Accelerator Module

OCP Open Compute Project

ODM Original Design Manufacturer

OEM Original Equipment Manufacturer

OOBM Out Of Band Management Network

OS Operating System

OSFP Octal Small Form-factor Pluggable

63916 v1.00 27
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

PAM4 Pulse Amplitude Modulation 4

PCI Peripheral Component Interconnect

PCIe PCI Express

PDU Power Distribution Unit

PFC Priority Flow Control

QoS Quality Of Service

RBAC Role-Based Access Control

RCCL ROCm Communication Collectives Library

RDC ROCm Data Center

RDMA Remote Direct Memory Access

RFC Request For Comments

RoCE RDMA over Converged Ethernet

SerDes Serializer/Deserializer

SFP Small Form-factor Pluggable

SMI System Management Interface

SNMP Simple Network Management Protocol

SOTA State Of The Art

SP Strict Priority

SSD Solid State Drive

SSH Secure Shell Protocol

SVI Switch Virtual Interface

ToR Top of Rack

UALink Ultra Accelerator Link Consortium

UBB Universal Baseboard

UEC Ultra Ethernet Consortium

VLAN Virtual LAN

VM Virtual Machine

VoQ Virtual Output Queue

63916 v1.00 28
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

vPC Virtual Port Channel

VPN Virtual Private Network

63916 v1.00 29
November 2024
Additional Resources and Legal Notices

Appendix C: Additional Resources and


Legal Notices
Revision History
The following table shows the revision history for this document.

Revision Summary

November 2024 Version 1.00

Initial release.

Notices
© Copyright 2024 Advanced Micro Devices, Inc.

The information presented in this document is for informational purposes only and may contain technical
inaccuracies, omissions, and typographical errors. The information contained herein is subject to change
and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or
the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented
or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information.
However, AMD reserves the right to revise this information and to make changes from time to time to
the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED "AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES


WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED
HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Trademarks
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.

Other product names used in this publication are for identification purposes only and may be trademarks
of their respective companies.

63916 v1.00 30
November 2024

You might also like