0% found this document useful (0 votes)

23 views

Instinct Mi300 Series Custer Reference Guide

Uploaded by

ja.arenas904pfizer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Instinct Mi300 Series Custer Reference Guide

Uploaded by

ja.arenas904pfizer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

AMD Instinct™ MI300 Series Cluster

Reference Architecture Guide

Publication Number: 63916 v1.00

Date: November 2024
Contents
Chapter 1: Abstract..........................................................................................................................................6
Contents
Key Terms.................................................................................................................................................................... 6

Chapter 2: Components of an MI300 Series Cluster............................................................................... 8

AMD MI300 Series Platform Compute Node.....................................................................................................8

Network Fabrics.........................................................................................................................................................9
Software Components........................................................................................................................................... 10

Chapter 3: Design Requirements................................................................................................................11

System Design......................................................................................................................................................... 11

Compute Node........................................................................................................................................................ 11

Frontend, Accelerator and Backend Networks................................................................................................ 11

Frontend Network............................................................................................................................................................ 12
Accelerator Network....................................................................................................................................................... 13
Backend Network............................................................................................................................................................. 13
Out-of-band Management Network...................................................................................................................14

System Management..............................................................................................................................................14

Chapter 4: Cluster Architecture..................................................................................................................15

Chapter 5: Topology of Network Fabrics..................................................................................................17

Backend Network Topology..................................................................................................................................17
Fat Tree Non-blocking Topology................................................................................................................................... 17
Rail Topology..................................................................................................................................................................... 18
Frontend Network Topology................................................................................................................................ 20

Chapter 6: AMD Software Stack................................................................................................................ 22

Cluster Management with RDC and SMI Tools................................................................................................23

63916 v1.00 2
November 2024
Appendix A: Vendor Information............................................................................................................... 24

Appendix B: Acronyms................................................................................................................................. 26

Appendix C: Additional Resources and Legal Notices........................................................................... 30

Revision History...................................................................................................................................................... 30

Notices.......................................................................................................................................................................30
Trademarks.........................................................................................................................................................................30

63916 v1.00 3
November 2024
List of Figures
Figure 2.1: AMD MI300 Series Platform.............................................................................................. 8

Figure 3.1: Frontend, Accelerator and Backend Networks.............................................................12

Figure 3.2: Accelerator Mesh Network with MI300 Series Accelerators................................... 13

Figure 4.1: Server and Rack Design with MI300 Series Accelerators..........................................15

Figure 5.1: A 32 Node 2-Tier Fat Tree Topology...............................................................................18

Figure 5.2: A 32 Node 2-Tier Rail Topology.......................................................................................19

Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches........................................ 20

Figure 5.4: Storage Network for a 32 Node Design.........................................................................21

Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks...............22

63916 v1.00 4
November 2024
List of Tables
Table 1.1: Key Terms................................................................................................................................ 7

Table 2.1: Compute Node Reference Design Specifications...........................................................9

Table 2.2: Network Hardware Components....................................................................................... 9

Table 2.3: Software Components....................................................................................................... 10

Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack............................. 16

Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32
uplink).....................................................................................................................................18

Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)......................................19

Table 5.3: Maximum Counts Based on Radix and Switch Tiers................................................... 20

Table A.1: System Designs with MI300X.......................................................................................... 24

Table A.2: NICs in Backend and Frontend Network....................................................................... 24

Table A.3: Switches in Backend and Frontend Network................................................................24

Table A.4: Switches in Storage Network........................................................................................... 24

Table A.5: Out-of-band Management Software.............................................................................. 25

Table A.6: Storage Systems.................................................................................................................. 25

Table B.1: Acronyms.............................................................................................................................. 26

63916 v1.00 5
November 2024
Abstract

Chapter 1: Abstract
Artificial Intelligence (AI) and Machine Learning (ML) models continue to advance in capability and scale
at an increasingly rapid rate. This advancement increases performance and efficiency requirements
on every element of AI/ML infrastructure, including networking infrastructure. As AI/ML model scale
has increased, the workload must be distributed across Graphics Processing Units (GPUs) operating
in parallel at massive scale. Performance is increasingly dependent on the network that enables data
movement between these GPUs.

AI/ML model training and inference operations require the movement and processing of massive
volumes of data. The GPU-GPU communication network must support a wide range of requirements
such as latency-sensitive inference operations and iterative, high-throughput, parallel mathematical
training operations. The highest throughput, lowest latency GPU-GPU data movement occurs within a
node’s accelerator networkthat connects a group of GPUs.

Additional GPU-GPU data movement occurs across a backend network that connects multiples of
these nodes into a large scale network cluster. AI/ML model deployments require a wide range of
network scaling. Therefore, the design of the accelerator network and the backend network is crucial
as it scales in size from small clusters of inferencing nodes to much larger-scale training backend
networks supporting thousands of nodes and beyond. The efficiency and performance of GPU-GPU
communication fabrics is therefore of critical importance.

The frontend network in existing data centers supports a wide range of functions including AI/ML data
ingestion, storage as well as management functions. Traditionally, the front-end network is connected
directly to CPUs in the nodes.

This reference architecture document outlines the components required to build a backend network
cluster of MI300 Series GPUs, utilizing primarily Ethernet-based network interface cards (NICs) and
switches that scale to meet the increasing scaling requirements of AI models. AMD MI300 Series
products support a wide range of networking technologies and topologies beyond Ethernet via standard
PCIe-based NICs. AMD is committed to the development and enhancement of open standards-based
networks like the Ultra Ethernet Consortium (UEC), and the Ultra Accelerator Link Consortium (UALink).
AMD works with partners to support an open ecosystem of multiple networking solutions including
AMD networking products. This reference architecture document describes a wide range of networking
topologies including fat tree and rail-based topologies.

Key Terms
The following table defines the key terms used in this document.

63916 v1.00 6
November 2024
Abstract

Table 1.1: Key Terms

Terminology Description

• AMD Instinct™ MI300X Platform

AMD MI300 Series
• AMD Instinct™ MI325X Platform

Network forming the cluster with GPU NICs, also

referred to as the scale-out network, backend scale-
Backend Network
out network, and backside scale-out network. The
NICs in this network are referred to as backend NICs.

Network connecting GPUs within a node in a mesh

with Infinity Fabric™ links, also referred to as the scale-
Accelerator Network up network, backend scale-up network, and backside
scale-up network. NICs are not used in the MI300
Series.

Network with a different set of NICs (from the

backend network), also referred to as the frontside
network. Depending on the server design, this
Frontend Network
network can also support storage and in-band
management operations. The NICs in this network are
referred to as frontend NICs.

63916 v1.00 7
November 2024
Components of an MI300 Series Cluster

Chapter 2: Components of an MI300 Series

Cluster
A cluster consists of several components:

• AMD MI300 Series Platform compute nodes,

• Network fabrics that are composed of at least three networks with NICs, switches and cables, and
• Software libraries, system and management components.

AMD MI300 Series Platform Compute Node

The AMD MI300 Series platform comprises eight OCP Accelerator Module (OAM) form-factor MI300
Series GPUs in a Universal Baseboard (UBB) 2.0 design. The following figure shows the air-cooled
platform.

Figure 2.1: AMD MI300 Series Platform

63916 v1.00 8
November 2024
Components of an MI300 Series Cluster

A compute node consists of the AMD MI300 Series Platform together with CPUs, memory, and NIC
devices. Specifications of the AMD MI300 Series reference compute node are given in the following
table. Compute nodes with AMD MI300 Series platforms are available from select vendors (see Vendor
Information).

Table 2.1: Compute Node Reference Design Specifications

Component Specification

CPU 2x 4th-gen AMD EPYC Processors

8x AMD Instinct™ MI300X Accelerators with AMD

GPU
Universal Base Board (UBB 2.0)

Configurable; typical designs use 6TB (24x 256GB

Memory
DRAM) DDR5

NVMe SSDs; typical designs use 8-16 2.5-inch drives,

Drives
1-2 OS drives, high performance scratch drives

8x PCIe 5.0 high-performance networking cards, 400G

Networking
Ethernet

Incorporating the AMD Infinity Architecture platform

with 4th-gen AMD Infinity Fabric™ Links, up to
Accelerator Interconnect
bidirectional 896GB/s aggregate / 64 GB/s peer inter-
GPU connectivity

Cooling Air cooling or liquid cooling

Network Fabrics
Network fabrics are composed of at least three networks with NICs, switches and cables, as detailed in
the following table. Several such components are listed in Vendor Information.

Table 2.2: Network Hardware Components

Hardware Component Description

Fat-tree, or rail optimized cluster topology with RDMA

Backend scale-out Network
optimized Ethernet NICs and switches

Infinity Fabric™ mesh interconnecting 8 GPUs in the

Accelerator Network
Compute Node

Storage network topology connected through frontend

Storage Network (Frontend network)
NICs

63916 v1.00 9
November 2024
Components of an MI300 Series Cluster

Table 2.2: Network Hardware Components (continued)

Hardware Component Description

Management network connected through frontend

In band management network (Frontend network)
NICs. Also provides services accessible by users.

Out of band management network Separate network with its own NICs connecting BMC

Software Components
An MI300 Series Cluster requires the following software components.

Table 2.3: Software Components

Software Component Description

ROCm Data Center Tools and System Management

Data Center Management Software (RDC, SMI) Interface Libraries (see Cluster Management with RDC
and SMI Tools)

Software and user interface for system management of

System Management
nodes (examples are given in Vendor Information)

63916 v1.00 10
November 2024
Design Requirements

Chapter 3: Design Requirements

AI/ML deployments have a wide range of cluster network scale requirements. The optimal system
design should consider the node design, target NIC cards, switch capabilities, and target workloads to
deliver required efficiency and performance.

This reference architecture provides a starting point with common usage models for AI/ML or High
Performance Computing (HPC) workloads.

System Design
The following hardware system design components are recommended:
• Scalable cluster architecture based on a scalable unit of 32 compute nodes
• Datacenter racks may have 1, 4, 8 or 16 compute nodes. The specific number of nodes is influenced
by rack power and cooling requirements.
• Supported Networking adapters (NICs) and switches from AMD and partners supporting up to
400Gb/s
• Storage Networking components to support storage servers and storage network

The following software system design components are recommended:

• Cluster management software from AMD and partners

• System management software from AMD and partners

Compute Node
The compute node consists of eight MI300 Series GPUs interconnected by the 4th gen AMD Infinity
Fabric™ Links. A typical compute node also includes dual-socket CPUs, memory, and NICs connected via
two PCIe Gen 5.0 switches.

Performance-optimized designs have a specific mapping of MI300 Series GPU to frontend NICs and
backend NICs as illustrated in Figure 3.1. Each CPU has one directly connected frontend NIC, so there
are a total of two frontend NICs per compute node. Each MI300 Series GPU has one backend NIC that
is connected through the PCIe Switch, so there are a total of eight backend NICs per compute node.

Frontend, Accelerator and Backend Networks

Cluster network fabric is composed of at least three networks, as illustrated in the following figure and
discussed in the following subsections.

63916 v1.00 11
November 2024
Design Requirements

Figure 3.1: Frontend, Accelerator and Backend Networks

Frontend Network
The frontend network is the traditional datacenter network comprised of switches and network adapters
(or NICs) which support storage and management functions.

Storage network (part of frontend network):

• As language models grow and training time increases, writing checkpoints is necessary for fault
tolerance. The size of checkpoint files can be terabytes in size, and while not written frequently they
are typically written synchronously.
• Independent of the backend network, this can be Ethernet (preferred). RDMA is a prerequisite (RoCE
for Ethernet).
• If a separate network is designed, a 400 Gbps network is ideal for storage needs.

In-band management network (part of frontend network):

• The in-band management fabric is used for node provisioning, data movements, SLURM, Kubernetes,
and downloading from package repos such as pypi, docker repo, gcr.io, etc.
• Ethernet based, is used for provisioning of nodes and services that need to be accessed by users.
• If a separate network is designed by vendors, a 100 Gbps network is desirable.

63916 v1.00 12
November 2024
Design Requirements

Accelerator Network
The accelerator network is a high bandwidth, low latency network that connects a group of GPUs and
supports load/store transactions between the GPUs, as shown in the following figure. In MI300 Series
based designs, this network is Infinity Fabric™ interconnecting 8 GPUs in a mesh topology within a
compute node.

Figure 3.2: Accelerator Mesh Network with MI300 Series Accelerators

Backend Network
The backend network connects a bigger set of GPUs (that are beyond the set available within the
accelerator or scale-up network). Each compute node has eight NICs with a 1:1 GPU:NIC ratio, utilizing
a PCIe switch between the GPUs and NICs. RDMA communication using RoCE protocol, congestion
management and the support for UEC defined transport layer improvements are essential in this
network. Communication between GPUs in this network is enhanced by NICs supporting acceleration of
collective operations.

This network is designed to be highly scalable:

• The backend network topology can be either a fat tree or rail optimized.

63916 v1.00 13
November 2024
Design Requirements

• Ethernet switches form the 2- (leaf-spine) or 3-tier (leaf-spine-core) switching fabric.

NICs and switches:

• NICs connect compute nodes to the backend network through a set of switches in a well-defined
topology.
• The number of required switches depends on the number of nodes in the cluster, the switch radix,
and the cluster performance requirements.
• NICs and leaf (ToR) switches reside with the compute nodes in a rack.

Out-of-band Management Network

The out-of-band management fabric is a separate, slow-speed (usually 1-10 Gbps) network that connects
to the management ports of all nodes, storage servers, racks and switches in the whole cluster. Within
a node, it connects to the Baseboard Management Controller (BMC), which is used to change BIOS
settings, monitor and set the node health such as fan speed, voltage levels, temperatures, etc. Users can
interact with the BMC through IPMI or Redfish API, or through the BMC web portal.

System Management
For management and maintenance of a server, system vendors provide management software and
interfaces that perform real-time health monitoring and management on each server, including firmware
updates.

63916 v1.00 14
November 2024
Cluster Architecture

Chapter 4: Cluster Architecture

A cluster consists of a group of racks, each of which consists of a group of servers. These servers
are placed in a rack with backend switches for the backend network. In MI300 Series systems these
racks are designed with qualified vendors (see Vendor Information). The rack layouts are scalable and
adjustable to meet the data center requirements. Following is a reference rack layout consisting of
either 4, 8 or 16 GPU server nodes, each with 8 MI300 Series GPUs.

Figure 4.1: Server and Rack Design with MI300 Series Accelerators

63916 v1.00 15
November 2024
Cluster Architecture

Table 4.1: Component Count with 128 Radix Switches, 8 Nodes per Rack

Rack Switch Count Cable Count

Count Node GPU
(with 8 Count Count Nodes- Leafs- Spines-
Leaf Spine Core Total Total
nodes) Leafs Spines Cores

16 128 1024 16 8 -- 24 1024 1024 -- 2048

32 256 2048 32 16 -- 48 2048 2048 -- 4096

64 512 4096 64 64 32 160 4096 4096 4096 12288

128 1024 8192 128 128 64 320 8192 8192 8192 24576

256 2048 16384 256 256 128 640 16384 16384 16384 49152

63916 v1.00 16
November 2024
Topology of Network Fabrics

Chapter 5: Topology of Network Fabrics

A network fabric in a cluster design consists of the following fabrics:

• Compute fabric (backend network),

• Storage fabric (frontend network),
• In-band management fabric (frontend network), and
• Out-of-band management fabric.

Backend Network Topology

There are two topologies that will be discussed for the backend network:
• Fat tree, and
• Rail.

There are some configurations that need to be considered when designing a network topology:

• Blocking factor: A switch has downlink and uplink ports; the blocking factor is defined as
“downlink_port:uplink_port”. For a 64-port switch using 32 uplink and 32 downlink ports, the
blocking factor is 1:1. (A 1:1 blocking factor is defined as a non-blocking configuration).
• Undersubscription: A switch can have fewer downlink ports than uplink ports, for example a 64-port
switch can have 24 downlink ports and 28 uplink ports, with 12 ports unused. This is referred to as
16% undersubscription. The safe approach is to have undersubscription (especially at higher switch
tiers), but the cost effective approach is 1:1 which utilizes all switch ports.

Fat Tree Non-blocking Topology

A 2-tier fat-tree consists of 2 layers of leaf-spine switches (T1, T2), with the T1 (leaf) switches
connected to the NICs in the backend network. All NICs of a node are connected to the same T1 switch.
A third tier adds a T3 layer of switches.

The fat tree topology is a familiar scalable design; some networks may require undersubscription to
mitigate ECMP hash collisions (with a blocking design). The following diagram and table illustrate a 2-tier
Fat Tree non-blocking topology.

63916 v1.00 17
November 2024
Topology of Network Fabrics

Figure 5.1: A 32 Node 2-Tier Fat Tree Topology

Table 5.1: Fat Tree 2-Tier with Switch Radix = 64 (non-blocking 32 downlink, 32 uplink)

GPUs, NICs (1:1) Nodes Leaf Switches Spine Switches

32 4 1 0

64 8 2 1

128 16 4 2

256 32 8 4

512 64 16 8

1024 128 32 16

2048 256 64 32

Rail Topology
A 2-tier rail consists of 2 layers of leaf-spine switches (T1, T2), with the T1 (leaf) switches connected to
the NICs in the backend network. Each NIC of a node is connected to one port of each T1 leaf switch. A
3rd tier adds a T3 layer of switches.

Rail topology benefits by containing traffic to rails, thereby minimizing probability of congestion. The
communication libraries are dependent (aware) of the rail connections and the scale-up fabric. The
following diagram and table illustrate a 2-tier rail topology.

63916 v1.00 18
November 2024
Topology of Network Fabrics

Figure 5.2: A 32 Node 2-Tier Rail Topology

Table 5.2: Rail Topology with 64 Port Switch (32 downink, 32 uplink)

GPUs, NICs (1:1) Nodes Leaf Switches Spine Switches Core Switches

256 32 8 4 --

512 64 16 8 --

768 96 24 12 --

1024 128 32 16 --

2048 256 64 64 32

The following figure is of a full 128 node cluster (where the nodes are in rail layout). Within a rail, a
node is one hop from the other node. The layout can also be a Fat Tree where all the links from a rack
terminate in a Leaf switch, with similar number of leafs and switches for the same number of Scalable
Units.

63916 v1.00 19
November 2024
Topology of Network Fabrics

Figure 5.3: Layout of a Full 128 Node Cluster with 64 Port Switches

To build larger clusters, refer to the following table. The maximum numbers of nodes are dependent on
switches and topology.

Table 5.3: Maximum Counts Based on Radix and Switch Tiers

2 Tier Rail/Fat Tree 3 Tier Rail/ Fat Tree 2 Tier Rail/Fat Tree 3 Tier Rail/Fat Tree
Parameter Count
(64 port, 400G) (64 port, 400G) (128 port, 400G) (128 port, 400G)

Switch Radix 64 64 128 128

NICs per Node 8 8 8 8

Max Leaf Switches 64 2048 128 8192

Max Spine Switches 32 2048 64 8192

Max Core Switches -- 1024 -- 4096

Max NICs 2048 65536 8192 524288

Max GPUs 2048 65536 8192 524288

Max Nodes 256 8192 1024 65536

Frontend Network Topology

The frontend network composed of Ethernet NICs in 1:1 NIC:CPU organizations carries the storage
and in-band communications, if a separate fabric is not provided. The in-band management network
connects the cluster management services.

As datasets for AI workloads continue to expand in size, it is becoming increasingly critical that GPUs are
not constrained by the I/O network and storage systems. The storage fabric provides the path between
GPU memory and the storage systems. Storage systems can be connected by the frontend network, but
benefit by a separate, high-speed storage network, as shown by the following.

63916 v1.00 20
November 2024
Topology of Network Fabrics

Figure 5.4: Storage Network for a 32 Node Design

A separate optimized storage network as shown above provides benefits such as:

• Deep learning models accesses large datasets for training, a dedicated network provides frequent
and iterative access to the data from the GPUs over the storage network.
• As datasets grow in size the capex and opex expenditures are kept separate from the compute needs.

63916 v1.00 21
November 2024
AMD Software Stack

Chapter 6: AMD Software Stack

The AMD open-source ROCm software platform, containers for AI/ML and Data Center Infrastructure
empowers the accelerated computing community to innovate on top of a robust, flexible stack designed
for scalability. These components work in concert to extract the full potential of heterogeneous
architectures. The platform’s open-source philosophy gives developers complete visibility while
enabling customization and co-development. Users can optimize ROCm software platform’s runtimes,
programming models and utilities based on their workloads and scale requirements.

The AMD software stack, shown below, consists of a collection of drivers, development tools, and
APIs that enable GPU programming from low-level kernels to end-user applications. The ROCm Data
Center tools (RDC), and AMD SMI are essential building blocks in cluster management and datacenter
environments.

Figure 6.1: Software Stack with AMD ROCm, Container, and Infrastructure Blocks

63916 v1.00 22
November 2024
AMD Software Stack

Cluster Management with RDC and SMI Tools

ROCm Data Center (RDC) enables GPU cluster administration with the capability of monitoring,
validating and configuring policies. It enables full diagnostic and stress testing at cluster level.
Administrators can use device monitoring, job statistics and error collection for a group of GPUs in a
cluster and provides APIs for 3rd party integration. Full documentation and API reference are available at
ROCm Data Center Tool documentation.

AMD System Managaement Interface (SMI) is a C library on linux providing user space interface to
monitor and control AMD devices. The SMI libraries are available on AMD SMI Github Repository.

63916 v1.00 23
November 2024
Vendor Information

Appendix A: Vendor Information

Table A.1: System Designs with MI300X

Vendor Link

Dell PowerEdge XE9680

Gigabyte G593-ZX1-AAX1

Lenovo ThinkSystem SR685a V3

Supermicro AS -8125GS-TNMR2

Table A.2: NICs in Backend and Frontend Network

Vendor Link

AMD AMD Pensando™ Giglio DPU 200G

AMD AMD Pensando™ Pollara 400

AMD AMD Pensando™ DSC3-400

Broadcom BRCM957608 (Thor 2) 400G

NVIDIA ConnectX®- 7 400G

Table A.3: Switches in Backend and Frontend Network

Vendor Link

Arista Arista 7060 51.2T TH5

Arista Arista 7060 25.6T TH4

Arista Arista 7700R4C-38PE Jer 14.4T

Arista Arista 7204-128PE Rmn 102T

Juniper Juniper QFX5240 51.2T TH5

Cisco G200 51.2T

Table A.4: Switches in Storage Network

Vendor Link

Juniper Juniper QFX5230

63916 v1.00 24
November 2024
Vendor Information

Table A.5: Out-of-band Management Software

Vendor Link

Dell Dell Managed Services

Gigabyte Gigabyte Management Console

Lenovo Lenovo Managed Services

Supermicro Server Manager, Supermicro Update

Supermicro
Manager

Table A.6: Storage Systems

Vendor Link

AMD-Supermicro WEKAIO Reference Storage

IBM IBM Storage Scale System

63916 v1.00 25
November 2024
Acronyms

Appendix B: Acronyms
The acronyms used in this document are expanded in the following table.

Table B.1: Acronyms

Acronym Definition

AAA Authentication, Authorization, and Accounting

AI Artificial Intelligence

AOC Active Optical DAC

APC Angled Polished Connector

API Application Programming Interface

ARP Address Resolution Protocol

AS Autonomous System

BFD Bidirectional Forwarding Detection

BGP Border Gateway Protocol

BIOS Basic Input/Output System

BMC Baseboard Management Controller

CCDM Collective Communication Data Messages

CNP Congestion Notification Packet

CoPP Control Plane Policing

COS Class Of Service

CPU Central Processing Unit

DAC Direct Attached Cable

DCBx Data Center Bridging eXchange

DDR Double Data Rate

DHCP Dynamic Host Configuration Protocol

DLB Dynamic Load Balancing

DNN Deep Neural Network

DNS Domain Name System

DRAM Dynamic Random Access Memory

DSCP Differentiated Services Code Point

63916 v1.00 26
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

eBGP External Border Gateway Protocol

ECMP Equal Cost MultiPath

ECN Explicit Congestion Notification

ETS Enhanced Transmission Selection

GPU Graphics Processing Unit

GUI Graphical User Interface

HPC High Performance Computing

I/O Input/Output

IP Internet Protocol

IPAM IP Address Management

IPMI Intelligent Platform Management Interface

LAG Link Aggregation Group

LLDP Link Level Discovery Protocol

ML Machine Learning

MLAG Multi-chassis Link Aggregation

MMF MultiMode Fiber

NFS Network File System

NIC Network Interface Card

NOC Network Operations Center

NRZ Non Return to Zero

NVM Non Volatile Memory

OAM OCP Accelerator Module

OCP Open Compute Project

ODM Original Design Manufacturer

OEM Original Equipment Manufacturer

OOBM Out Of Band Management Network

OS Operating System

OSFP Octal Small Form-factor Pluggable

63916 v1.00 27
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

PAM4 Pulse Amplitude Modulation 4

PCI Peripheral Component Interconnect

PCIe PCI Express

PDU Power Distribution Unit

PFC Priority Flow Control

QoS Quality Of Service

RBAC Role-Based Access Control

RCCL ROCm Communication Collectives Library

RDC ROCm Data Center

RDMA Remote Direct Memory Access

RFC Request For Comments

RoCE RDMA over Converged Ethernet

SerDes Serializer/Deserializer

SFP Small Form-factor Pluggable

SMI System Management Interface

SNMP Simple Network Management Protocol

SOTA State Of The Art

SP Strict Priority

SSD Solid State Drive

SSH Secure Shell Protocol

SVI Switch Virtual Interface

ToR Top of Rack

UALink Ultra Accelerator Link Consortium

UBB Universal Baseboard

UEC Ultra Ethernet Consortium

VLAN Virtual LAN

VM Virtual Machine

VoQ Virtual Output Queue

63916 v1.00 28
November 2024
Acronyms

Table B.1: Acronyms (continued)

Acronym Definition

vPC Virtual Port Channel

VPN Virtual Private Network

63916 v1.00 29
November 2024
Additional Resources and Legal Notices

Appendix C: Additional Resources and

Legal Notices
Revision History
The following table shows the revision history for this document.

Revision Summary

November 2024 Version 1.00

Initial release.

The information presented in this document is for informational purposes only and may contain technical
inaccuracies, omissions, and typographical errors. The information contained herein is subject to change
and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or
the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented
or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information.
However, AMD reserves the right to revise this information and to make changes from time to time to
the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED "AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES

WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED
HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Trademarks
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.

Other product names used in this publication are for identification purposes only and may be trademarks
of their respective companies.

63916 v1.00 30
November 2024

JND DC 15.a R SG 1of2 PDF
No ratings yet
JND DC 15.a R SG 1of2 PDF
356 pages
2000716-005 - CARESCAPE Network Configuration Guide
No ratings yet
2000716-005 - CARESCAPE Network Configuration Guide
48 pages
Linux HPC Cluster Installation
No ratings yet
Linux HPC Cluster Installation
252 pages
Chapter 3-Memory Management
No ratings yet
Chapter 3-Memory Management
51 pages
Cloud Ready Data Center Network DESIGN GUIDE
No ratings yet
Cloud Ready Data Center Network DESIGN GUIDE
45 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
HPC - IBM Architecture
No ratings yet
HPC - IBM Architecture
62 pages
Implementing A VersaStack Solution by Cisco and IBM With IBM Storwize V5030, Cisco UCS Mini, Hyper-V, and SQL Server
No ratings yet
Implementing A VersaStack Solution by Cisco and IBM With IBM Storwize V5030, Cisco UCS Mini, Hyper-V, and SQL Server
272 pages
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
From Everand
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
Jay Nans
No ratings yet
IBM EC12 sg248049
No ratings yet
IBM EC12 sg248049
490 pages
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
REFTechnical Reference Guide iDS 83rev E061510
100% (1)
REFTechnical Reference Guide iDS 83rev E061510
122 pages
k-309 SCADA SYSTEM
No ratings yet
k-309 SCADA SYSTEM
58 pages
Manual de RTU
No ratings yet
Manual de RTU
58 pages
Ibm System Z10 Capacity On Demand Ibm Redbooks pdf download
No ratings yet
Ibm System Z10 Capacity On Demand Ibm Redbooks pdf download
77 pages
CAN Bus for Beginners: A Practical Guide to Automotive Networking
From Everand
CAN Bus for Beginners: A Practical Guide to Automotive Networking
Mohamad Charara
No ratings yet
SG 247786
No ratings yet
SG 247786
334 pages
Versastack Redbooks
No ratings yet
Versastack Redbooks
476 pages
Marenostrum 3: Slides From Javier Bartolomé BSC S Stem Head
No ratings yet
Marenostrum 3: Slides From Javier Bartolomé BSC S Stem Head
16 pages
SG 247833
No ratings yet
SG 247833
440 pages
IBM Flex System Manual
No ratings yet
IBM Flex System Manual
424 pages
Linux Hpc Cluster Installation Ibm Redbooks download
No ratings yet
Linux Hpc Cluster Installation Ibm Redbooks download
82 pages
sg247844 GPFS
No ratings yet
sg247844 GPFS
426 pages
Silo - Tips - Implementing The Ibm General Parallel File System Gpfs in A Cross Platform Environment
No ratings yet
Silo - Tips - Implementing The Ibm General Parallel File System Gpfs in A Cross Platform Environment
426 pages
IBM Data Center Networking
No ratings yet
IBM Data Center Networking
260 pages
A To Z of Internet: Everything You Wanted to Know
From Everand
A To Z of Internet: Everything You Wanted to Know
Bittu Kumar
No ratings yet
SG 247928
No ratings yet
SG 247928
258 pages
IBM Flex System p270 Compute Node Planning and Implementation Guide
No ratings yet
IBM Flex System p270 Compute Node Planning and Implementation Guide
630 pages
Cutting-Edge Desktop UI Development with Python, PySide6, PyQt6
From Everand
Cutting-Edge Desktop UI Development with Python, PySide6, PyQt6
Jay Nans
No ratings yet
ACI Guide
No ratings yet
ACI Guide
254 pages
sg246116 PDF
No ratings yet
sg246116 PDF
894 pages
Implementando IBM Spectrum Software
No ratings yet
Implementando IBM Spectrum Software
88 pages
SmartCloud Entry PDF
No ratings yet
SmartCloud Entry PDF
216 pages
IBM Network Design Guide
No ratings yet
IBM Network Design Guide
258 pages
Ibm Sonas
No ratings yet
Ibm Sonas
562 pages
IBM Flex System p260 and p460 Planning and Implementation Guide
No ratings yet
IBM Flex System p260 and p460 Planning and Implementation Guide
424 pages
HDM_73R56070_20_0C_IUM
No ratings yet
HDM_73R56070_20_0C_IUM
66 pages
Brocade Fabric Howto From IBM Sg246116
No ratings yet
Brocade Fabric Howto From IBM Sg246116
890 pages
Sera v2 - April 2 2013 - Final
100% (1)
Sera v2 - April 2 2013 - Final
262 pages
AP3917E.wing 5 9 1 System Reference Guide Part 1 3831157
No ratings yet
AP3917E.wing 5 9 1 System Reference Guide Part 1 3831157
467 pages
Redp5472 - IBM Power System AC922 Introduction and Technical Overview
No ratings yet
Redp5472 - IBM Power System AC922 Introduction and Technical Overview
78 pages
InfiniBand On IBM System p6 - D9484224d01
No ratings yet
InfiniBand On IBM System p6 - D9484224d01
432 pages
IBM Intelligent Operations
No ratings yet
IBM Intelligent Operations
250 pages
Min At002 - en P
No ratings yet
Min At002 - en P
124 pages
Hus VM Product Overview Guide
No ratings yet
Hus VM Product Overview Guide
70 pages
Cisco Industrial Security Design Guide
No ratings yet
Cisco Industrial Security Design Guide
76 pages
IBM Certification Study Guide - RS600 SP
No ratings yet
IBM Certification Study Guide - RS600 SP
516 pages
VSC7424-02, VSC7425-02, VSC7426-02, and VSC7427-02
No ratings yet
VSC7424-02, VSC7425-02, VSC7426-02, and VSC7427-02
772 pages
Cisco Grid Security Requirements and Use Cases
No ratings yet
Cisco Grid Security Requirements and Use Cases
160 pages
Plano de Instalação PABX Mitel
No ratings yet
Plano de Instalação PABX Mitel
447 pages
Storwize 7 K
No ratings yet
Storwize 7 K
538 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
IBM z15
No ratings yet
IBM z15
104 pages
Introduction To Storage Area Networks
No ratings yet
Introduction To Storage Area Networks
302 pages
IBM z15 (8561) Technical Guide: Books
No ratings yet
IBM z15 (8561) Technical Guide: Books
554 pages
SG 248158
No ratings yet
SG 248158
280 pages
IBM Xseries Architecture
No ratings yet
IBM Xseries Architecture
34 pages
SG 248960
No ratings yet
SG 248960
424 pages
IBM Power System AC922 Introduction and Technical Overview: Paper
No ratings yet
IBM Power System AC922 Introduction and Technical Overview: Paper
74 pages
Reference Architecture For Microsoft Storage Spaces Direct (S2D)
No ratings yet
Reference Architecture For Microsoft Storage Spaces Direct (S2D)
35 pages
z16 SETUP sg248960
No ratings yet
z16 SETUP sg248960
416 pages
Moderate and Severe Traumatic Brain Injury.7
No ratings yet
Moderate and Severe Traumatic Brain Injury.7
23 pages
KJN 13 9
No ratings yet
KJN 13 9
6 pages
Nihms 1921257
No ratings yet
Nihms 1921257
119 pages
Neurología: Traumatic Brain Injury in The New Millennium: New Population and New Management
No ratings yet
Neurología: Traumatic Brain Injury in The New Millennium: New Population and New Management
7 pages
Contemporary Review On Craniectomy And.28
No ratings yet
Contemporary Review On Craniectomy And.28
4 pages
Fsurg 09 864385
No ratings yet
Fsurg 09 864385
10 pages
Neurotraumaand Intracranialpressure Management: Francis Bernard
No ratings yet
Neurotraumaand Intracranialpressure Management: Francis Bernard
19 pages
First Page PDF
No ratings yet
First Page PDF
1 page
Primus EPIC
No ratings yet
Primus EPIC
44 pages
Top 25 SQL Interview Questions and Answers
No ratings yet
Top 25 SQL Interview Questions and Answers
37 pages
Taxibookingappfeatures1 210210073418
No ratings yet
Taxibookingappfeatures1 210210073418
17 pages
Tips and Tricks of CATIA
No ratings yet
Tips and Tricks of CATIA
5 pages
2023 NAPLAN Quick Reference Guide - Test Administrator
No ratings yet
2023 NAPLAN Quick Reference Guide - Test Administrator
6 pages
PHP Project Computer Accessories
No ratings yet
PHP Project Computer Accessories
91 pages
EMC VNXe1600 Review
No ratings yet
EMC VNXe1600 Review
13 pages
Radio Access Network Architecture
75% (4)
Radio Access Network Architecture
170 pages
Fresh Ebook Digital Workspace in 2019
No ratings yet
Fresh Ebook Digital Workspace in 2019
17 pages
Punjab Information Technology Board (Pitb) : Paper Distribution
No ratings yet
Punjab Information Technology Board (Pitb) : Paper Distribution
1 page
Hexadecimal Number System
No ratings yet
Hexadecimal Number System
3 pages
Assignmet 1 INSE 6230 Total Quality Project Management
No ratings yet
Assignmet 1 INSE 6230 Total Quality Project Management
18 pages
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
No ratings yet
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
31 pages
Proceedings of International Conference On Power Electronics and Renewable Energy Systems
No ratings yet
Proceedings of International Conference On Power Electronics and Renewable Energy Systems
683 pages
Hibernate Interview Questions (2023) - InterviewBit
No ratings yet
Hibernate Interview Questions (2023) - InterviewBit
65 pages
IT Assignment No 1
No ratings yet
IT Assignment No 1
7 pages
Chapter 3 - Serial Interfacing With Microprocessor Based System
No ratings yet
Chapter 3 - Serial Interfacing With Microprocessor Based System
28 pages
Shared-Disk vs. Shared-Nothing: Comparing Architectures For Clustered Databases
No ratings yet
Shared-Disk vs. Shared-Nothing: Comparing Architectures For Clustered Databases
18 pages
Steps For Filling Online Admission Form (Landing Page)
No ratings yet
Steps For Filling Online Admission Form (Landing Page)
1 page
Base Computer Hardware
No ratings yet
Base Computer Hardware
45 pages
Ied Javier Sanchez - 501
No ratings yet
Ied Javier Sanchez - 501
9 pages
Rafay Khan (Recovered)
No ratings yet
Rafay Khan (Recovered)
2 pages
20240821194931_66c644cbdccf8_course_project_4
No ratings yet
20240821194931_66c644cbdccf8_course_project_4
16 pages
Designing and Modeling of Remote Control Car Using Arduino: November 2017
No ratings yet
Designing and Modeling of Remote Control Car Using Arduino: November 2017
5 pages
Robotics Starter 50 Classes
No ratings yet
Robotics Starter 50 Classes
1 page
Dorks Disney
No ratings yet
Dorks Disney
3,718 pages
713P1
No ratings yet
713P1
36 pages
Back To School Offer: TH TH
No ratings yet
Back To School Offer: TH TH
3 pages
Resume Ankit Sharma
No ratings yet
Resume Ankit Sharma
3 pages