100% found this document useful (1 vote)
63 views83 pages

Building a HighPerformance AI DC

As the entire world is moving towards the artificial intelligence the question of how to build a high performance AI DC comes and Cisco answers it

Uploaded by

srikrishnak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
63 views83 pages

Building a HighPerformance AI DC

As the entire world is moving towards the artificial intelligence the question of how to build a high performance AI DC comes and Cisco answers it

Uploaded by

srikrishnak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

#CiscoLiveAPJC

Building a High-
Performance AI DC
Using Nexus, UCS, and NVIDIA GPUs
Andy Sholomon
Director of Technical Marketing
@asholomon
BRKAPP-2698

#CiscoLiveAPJC
#CiscoLiveAPJC
Cisco Webex App

Questions?
Use Cisco Webex App to chat
with the speaker after the session

How
1 Find this session in the Cisco Live Mobile App

2 Click “Join the Discussion”

3 Install the Webex App or go directly to the Webex space Enter your personal notes here

4 Enter messages/questions in the Webex space

Webex spaces will be moderated


by the speaker until December 22, 2023. https://ciscolive.ciscoevents.com/ciscolivebot/#BRKAPP-2698

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
• AI/ML Applications
• What Are Their Requirements?
• AI/ML Blueprint for DC
Networks
Agenda • Industry Talk
• Conclusion

BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
AI/ML
Applications
AI is everywhere!

Self-driving vehicles

Conversational agents
Face recognition

Machine translation

Analysis of medical images


Image generation

Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Main Types of AI/ML Workloads

Training: Training refers to the process of using a machine learning


algorithm to build a model. Training involves the use of a deep-learning
framework and training dataset.

Inference: Inference refers to the process of using the trained machine


dataset to make a prediction.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Dog or Muffin?

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
LLMs

LLMs stands for Large Language Models. These are machine learning
models that have been trained on massive amounts of text data, such as
books, articles, and web pages, to understand and generate human
language. OpenAI’s GPT-4 has 1.5 Trillion parameters.

There are now hundreds of open-source LLMs available. Currently, the


most powerful is Llama-2, released by Meta in July 2023, with 70 billion
parameters. Open-source LLMs have the advantage over proprietary
models that they can be fine-tuned for task-specific goals.

https://www.hopsworks.ai/dictionary/llms-large-language-models

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
What It Takes to Train an LLM

With 8 GPUs per server


it takes 1250 servers
running for one month
to train an LLM like
GPT-3

The estimate is it would


take ~6000 H100 GPUs
or 750 servers today

256 servers for LLaMa


https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Power

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Power

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
LLMs

https://deepchecks.com/llm-models-comparison/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
99% of customers
will not be building
infrastructure to train
their own LLMs
Inference and Fine Tuning

https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Many customers will build
GPU clusters in their
existing DCs for training
”smaller” models, for fine
tuning existing models,
and to do inference or
generative AI.
LLMs everywhere?

Self-driving vehicles

Conversational agents
Face recognition

Machine translation

Analysis of medical images


Image generation

Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Industry Talk
Ultra Ethernet Consortium - UEC

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Ultra Ethernet Consortium - UEC

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Cisco 8000: Power of Disaggregation

Distributed Scheduled Fabric


32K+ GPUs/cluster
Leaf to spine packet spray
VoQ & Request Grant
PFC & ECN Leaf to Host

Tier 1
Spine
Tier 1
Spine …… Tier 1
Spine

Leaf Leaf Leaf


…… Leaf

Accelerator
Racks

Cluster

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
AI/ML: High Performance Ethernet Fabrics
Primary target

One Fabric for all Systems & Nexus 9K with


Nexus Dashboard
workloads Solutions
Enterprise/
Service Tier 2
Public Sector/
Providers Web
AI/ML Commercial

Storage
Distributed
databases
Video / Primary target
broadcast
Disaggregated or
Components 8000 with SONIC
VoIP

Hyperscalers

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
What About
Infiniband?
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL (Nvidia Collective Communication Library) Test, ALL GPU to ALL GPU

Verification Result

• GPU server 1 <-> GPU server 2


# mpirun --allow-run-as-root -np 16 -H dgx01:8,dgx02:8 -mca pml ucx 2 x NVIDIA DGX A100 6U server with 8 x NVIDIA A100 Tensor Core GPUs
-bind-to numa -map-by numa -x LD_LIBRARY_PATH -x
NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=enp12s0,^enp226s0 -x 8 x Mellanox ConnectX-6
UCX_NET_DEVICES=enp12s0,enp18s0,enp75s0,enp84s0,enp141s0,enp14
8s0,enp186s0,enp204s0 -x NCCL_IGNORE_CPU_AFFINITY=1 -x
2TB of Mem
NCCL_UCX_TLS=rc,cuda,cuda_copy -x NCCL_UCX_MAX_RNDV_RAILS=4 -x
NCCL_MAX_EAGER_RAILS=4 -x NCCL_UCX_RNDV_THRESH=0 -x
UCX_MEMTYPE_CACHE=n -x NCCL_PLUGIN_P2P=ucx -x
NCCL_COLLNET_ENABLE=1 -x NCCL_MAX_P2P_NCHANNEL=1 -x
NCCL_ALGO=Tree,CollNet -x NCCL_DEBUG_SUBSYS=graph,tuning,init,p2p
-x NCCL_NET_GDR_READ=1 -x NCCL_NET_GDR_LEVEL=5 -x
NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_P2P_LEVEL=5 -x
CUDA_DEVICE_ORDER=PCI_BUS_ID /bmt/mkis/workspace/nccl-
tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -n 10 -w 100 -p 0 -z 0 -
m 1 -t 1 -c 1 | tee result/2node_1set_`date +%y%m%d_%H%M`_$$.log

Max 161 GB/s

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL test result comparison chart (Ethernet vs Infiniband)

Bandwidth

size (Byte)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL test result comparison chart (Ethernet vs Infiniband)

Latency

size (Byte)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
What Are AI/ML Application
Network Requirements
What is a Neural Network?

A neural network is a method in artificial intelligence that teaches


computers to process data in a way that is inspired by the human brain.
It is a type of machine learning process, called deep learning, that uses
interconnected nodes or neurons in a layered structure that resembles
the human brain.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
What is a Neural Network?

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Neural Network Example (Very Oversimplified)
Should I go surfing?
1. Are the waves good? (Yes: 1, No: 0)
2. Is the line-up empty? (Yes: 1, No: 0)
3. Has there been a recent shark attack? (Yes: 0, No: 1)

Then, let’s assume the following, giving us the following inputs:

X1 = 1, since the waves are pumping


X2 = 0, since the crowds are out
X3 = 1, since there hasn’t been a recent shark attack

Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are
of greater importance to the decision or outcome.

W1 = 5, since large swells don’t come around often


W2 = 2, since you’re used to the crowds
W3 = 4, since you have a fear of sharks
https://www.ibm.com/topics/neural-networks

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Neural Network Example (Very Oversimplified)
Should I go surfing?
Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the
various inputs, we can start to plug in values into the formula to get the desired output.

Y-hat = (1*5) + (0*2) + (1*4) – 3 = 6

If we use the activation function, we can determine that the output of this node would be 1, since 6 is
greater than 0. In this instance, you would go surfing;

https://www.ibm.com/topics/neural-networks

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Parallelism Mechanisms (collectives)
• NVIDIA Collective Communications Library (NCCL) supports several
data/job distribution methods called collectives, namely:
• AllReduce
• Broadcast
• Reduce
• AllGather
• ReduceScatter
• RingReduce

• Each method has its pros and cons; choice is left to the developer
• Python’s PyTorch (top library for deep learning) defaults to AllReduce*
• Informal research (scanning Git repos, reading blogs) shows virtually
every developer leaves this unchanged
* https://pytorch.org/docs/stable/ddp_comm_hooks.html

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Different Traffic Patterns and Job Sizes

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
GPU Communications “Inside” the Server

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Main Infrastructure Requirements for Training
In a DL training infrastructure, it is crucial to get as much raw compute
power and as many nodes as you can afford. Think multi-core processors
and GPUs.

The more nodes and the more mathematical accuracy you can build into
your cluster, the faster and more accurate your training will be.

Huge training datasets require massive networking and storage


capabilities to hold and transfer the data, especially if your data is image-
based or heterogeneous. Plan ahead for adequate networking and
storage capacity, not just for strong computing.

Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Main Infrastructure Requirements for Inference
Inference clusters should be optimized for performance. Think simpler
hardware with less power than the training cluster but with the lowest
latency possible.

Throughput is critical to inference. The process requires high I/O


bandwidth and enough memory to hold both the required training
model(s) and the input data without having to make calls back to the
storage components of the cluster.

Data center resource requirements for inference are typically not as great
for a single instance compared to training needs.

Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Backend and Frontend Connectivity

The “backend network” is used only for GPU-to-GPU connectivity

The “frontend network” is used for all other communications and can be
used for IP storage access as well

For very large clusters an IP storage network can also be used

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Multiple Connections May Be Required For AI Workloads
All 32 Servers connect
2 x 100G NICs = 64 x 100G
Front End / Storage & Outside World
Nexus design Across 2 x N9K-C9364C-GX
N9K-C9364C-GX N9K-C9364C-GX
2 x N9K-C9364C-GX

100G

100G

100G

100G
100G

100G

100G

100G

400G

400G

400G

400G
400G

400G

400G

400G

400G

400G

400G

400G

400G

400G

400G

400G
400G

400G

400G

400G

400G

400G

400G

400G

400G

400G

400G

400G
400G

400G

400G

400G
8 x Servers
N9K-C9364D-GX2A
Per Pair of TORs Leaf Layer
N9K-C9364D-GX2A 8 x N9K-C9364D-GX2A N9K-C9344D-GX2A N9K-C9364D-GX2a
4 NICs x 400G (2x)
32 x 400G Server Ports

32 x 400G to Spines
8 x 400G 8 x 400G 8 x 400G 8 x 400G

N9K-C9364D-GX2A N9K-C9364D-GX2A N9K-C9364D-GX2A N9K-C9364D-GX2A


Spine Layer
#CiscoLiveAPJC 4 xAllN9K-C9364D-GX2A
© 2023 Cisco and/or its affiliates. rights reserved. Cisco Public
What is Needed to Build The Best AI/ML DC Network?
• Non-Blocking Fabric

• Low Latency

• Lossless RoCE

• Visibility

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
RDMA Over Converged Ethernet - RoCE
System GPU GPU System
Memory Memory Memory Memory

CPU GPU GPU CPU

PCIe PCIe

RDMA RDMA
NIC Ethernet Network
IB Network NIC

Eth L2 IB BTH+
Type

RoCE
Eth

IB GRH IB Payload ICRC FCS


Header (L4 HDR)

Eth L2 UDP IB BTH+


Type

RoCEv2
Eth

IP Header IB Payload ICRC FCS


Header Header (L4 HDR)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Lossless Ethernet

• Lossless Ethernet for RoCEv2


• End-to-End
• ECN (Explicit congestion Notification) + WRED or AFD
• PFC (Priority Flow Control)

• Buffers (smart and the right amount)

• Tuning

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Intelligent Buffers

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Congestion Management
Needs to Work As a
System
No Congestion Example

Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10

SIP:Host X; DIP:Host B; ECN 0x10

L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x10

Q3 Q3 Q3 SIP:Host X; DIP:Host B; ECN 0x10

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Light Congestion Example

Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10

SIP:Host X; DIP:Host B; ECN 0x10

L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x11

Q3 Q3 Q3 SIP:Host X; DIP:Host B; ECN 0x11

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Light Congestion Example –CNP Message

Q3 S1
SIP:Host X; DIP:Host A; CNP SIP:Host X; DIP:Host A; CNP
SIP:Host X; DIP:Host B; CNP

SIP:Host X; DIP:Host B; CNP

L1 be markedL2
CNP messages must correctly to get into LX
the priority queue
SIP:Host X; DIP:Host A; CNP

Q3 Q3 Q3 SIP:Host X; DIP:Host B; CNP

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
PFC Congestion Leaf

Q3 S1
SIP:Host A; DIP:Host X; ECN:0x10

PF
C
SIP:Host B; DIP:Host X; ECN:0x10

L1 L2 LX
Q3 Q3 Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
PFC Congestion Leaf + Spine

Q3 S1

PF
PFC

C
C
PF

L1 L2 LX
Q3
SIP:Host A; DIP:Host X; ECN:0x10
Q3
SIP:Host B; DIP:Host X; ECN:0x10
Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
PFC Congestion Leaf + Spine + Leaf

Q3 S1

PF
PFC

C
C
PF

L1 L2 LX
PFC

PFC
Q3 Q3 Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
To get RoCEv2 to work
as expected it is
critical to configure the
NICs and OS correctly
Configure Your Ethernet NICs
<CLIP>
ens1f0np0:
dhcp4: false
ens1f1np1:
dhcp4: false
vlans: VLAN Config Required for PFC
ens1f0np0.101:
Mellanox ConnectX-6 NIC id: 101 VLAN Encap is 101
link: ens1f0np0
2 Port 100G Ethernet NIC addresses:
- 192.168.101.11/24
routes:
cat /etc/netplan/00-installer-config.yaml - to: 192.168.0.0/16
via: 192.168.101.1
metric: 100
ens1f1np1.101:
id: 101
link: ens1f1np1
addresses:
- 172.16.101.11/24
routes:
- to: 172.16.0.0/16
via: 172.16.101.1
metric: 100

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Configure the “marking” Ubuntu Example
sudo apt-get install vlan

sudo bash -c 'echo 1 > /sys/class/net/ens1f0np0/ecn/roce_np/enable/3' Enable ECN for class 3


sudo bash -c 'echo 1 > /sys/class/net/ens1f1np1/ecn/roce_np/enable/3'

sudo bash -c 'echo 6 > /sys/class/net/ens1f0np0/ecn/roce_np/cnp_802p_prio'


Enable CNP L2 egress priority to 6,
sudo bash -c 'echo 6 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_802p_prio' “enable CNP”

sudo bash -c 'echo 48 > /sys/class/net/ens1f0np0/ecn/roce_np/cnp_dscp' Map CNP to DSCP 48


sudo bash -c 'echo 48 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_dscp'

sudo mlnx_qos -i ens1f0np0 --trust=dscp --pfc 0,0,0,1,0,0,0,0 Map PFC to priority 3


sudo mlnx_qos -i ens1f1np1 --trust=dscp --pfc 0,0,0,1,0,0,0,0

sudo sysctl -w net.ipv4.tcp_ecn=1 Enable ECN for traffic

sudo cma_roce_mode -d mlx5_0 -p 1 -m 2 Set mode to RoCEv2 on the NIC


sudo cma_roce_mode -d mlx5_1 -p 1 -m 2

sudo cma_roce_tos -d mlx5_0 -t 96


Set TOS = DSCP CS3for RoCEv2
sudo cma_roce_tos -d mlx5_1 -t 96 traffic

sudo vconfig set_egress_map ens1f0np0.101 4 3 Egress map RoCE traffic to priority 3


sudo vconfig set_egress_map ens1f1np1.101 4 3

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Recommendations for
Server Tunning*
*These recommendations were done in
conjunction with Vijay Durairaj
Cisco GPU-Accelerated Platform Offerings

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
System Tunning Recommendations
• BIOS tunings:
C1E is one of the automatic power-
- Processor Configuration:
saving features which is triggered
Intel Virtualization Technology : <Enabled> when the system is idle. Best to
disable it when overclocking
Intel Hyper-Threading Technology : <Enabled>
C6 BIOS will automatically disable
CPU Performance : <HPC>
CPU core and cache for power saving
- Power & Performance Configuration:
In C0/C1 state all of the cores are
Processor C1E : <Disabled>
locked at maximum performance and
Processor C6 : <Enabled> will cause a higher consumption of
power
Package C-State control : <C0/C1 State>

Uncore Frequency Scaling: <Enabled> Uncore Frequency Scaling enabled


stops the processor from dynamically
• Thermal : Fan Control Policy : <Maximum Power> changing frequencies based on
workload to save power
• PCIe : Width, Speed, Max Payload Size, Max Read Request
• OS tunings : CPU Power, TCP Parameters & Memory Settings

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 58
BIOS Tuning - CPU Performance (HPC)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
BIOS Tuning – Processor Power States

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
Thermal – Fan control Policy

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
OS Tuning - 1
• Mellanox Automatic tuning for Performance Part of the mlnx_tune utility. This
command does all the NIC optimization
# mlnx_tune -p HIGH_THROUGHPUT recommended by Mellanox
• Disable the TCP Timestamps for better CPU Utilization
# sysctl -w net.ipv4.tcp_timestamps=0

• Enable the TCP selective acks option for better throughput:

# sysctl –w net.ipv4.tcp_sack=1

• Increase the maximum length of processor input queues:


# sysctl -w net.core.netdev_max_backlog=300000

• Increase the TCP maximum and default buffer sizes using setsockopt():
# sysctl -w net.core.rmem_max=4194304

# sysctl -w net.core.wmem_max=4194304 The rest of these are recommendations


by Mellanox for “Tuning the Network
# sysctl -w net.core.rmem_default=4194304
Adapter for Improved IPv4 Traffic
# sysctl -w net.core.wmem_default=4194304 Performance”
# sysctl -w net.core.optmem_max=4194304
https://mellanox.my.site.com/mellanoxcommunity/s/article/linu
x-sysctl-tuning

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
OS Tuning - 2
These are recommendations by Mellanox
• Increase memory thresholds to prevent packet dropping: for “Tuning the Network Adapter for
Improved IPv4 Traffic Performance”
# sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"

# sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304” https://mellanox.my.site.com/mellanoxcommunity/s/article/linu


x-sysctl-tuning

• Enable low latency mode for TCP:


# sysctl -w net.ipv4.tcp_low_latency=1

# sysctl -w net.ipv4.tcp_adv_win_scale=1

• Disable OS watchdog & Numa balancing:


# echo '0' > '/proc/sys/kernel/nmi_watchdog’

# echo 0 > /proc/sys/kernel/numa_balancing

• Fine tuning Kernel task scheduler :


# echo 100000000 > /proc/sys/kernel/sched_min_granularity_ns
Set to maximum acceptable values in
nanoseconds (1 sec and ½ sec)
# echo 50000000 > /proc/sys/kernel/sched_migration_cost_ns

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
PCIE Tuning for the Best Performance
Default PCIE Attributes : Width | Speed | Max Payload Size | Max Read Request
List PCI command to find out
where the Mellanox NICs are
connected

List PCI command to show location


5e:00.0 settings. Speed
determines the number of PCIe
transactions possible. The speed is
measured in GT/s which stands for
"billion transactions per second"

The PCIe Max Payload Size


determines the maximum size of a
PCIe packet, or PCIe MTU (similar
to networking protocols). This
parameter is set only by the system
and depends on the chipset
architecture (e.g. x86_64, Power8,
Ubuntu/Debian: sudo apt install pciutils ARM, etc).
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
PCIE Tuning for the Best Performance
Tuned PCIE Attributes : Width | Speed | Max Payload Size | Max Read Request

The first digit is the PCIe Max Read


Request size selector.

The acceptable values are:


0 - 128B, 1 - 256B, 2 - 512B,
3 - 1024B, 4 - 2048B
and 5 - 4096B.

Ubuntu/Debian: sudo apt install pciutils

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
What Does The Network
Configuration Look Like?
QOS Configs Classification
CLASS MAP Put packets with DSCP 24 into
class-map type qos match-all class-q3 “class-q3”
match dscp 24
Put Packets with DSCP 48 into
class-map type qos match-all class-q7 “class-q7”
match dscp 48

POLICY MAP
Put packets from class-q3 into
policy-map type qos QOS_classification_policy queue 3
class class-q3
Put packets from class-q7 into
set qos-group 3 queue 7
class class-q7
Everything else goes into default or
set qos-group 7 queue 0
class class-default
set qos-group 0

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
QOS Configs Queuing and Scheduling
POLICY MAP “type queuing”
policy-map type queuing custom-8q-out-policy
class type queuing c-out-8q-q7 Queues 7 is set up as a priority
queue
priority level 1
class type queuing c-out-8q-q6
bandwidth remaining percent 0
class type queuing c-out-8q-q5
bandwidth remaining percent 0
class type queuing c-out-8q-q4
bandwidth remaining percent 0 Queues 3 received 60% of the
bandwidth and ECN marking
class type queuing c-out-8q-q3 enabled starting at 150KB of buffer
bandwidth remaining percent 60 utilization, drop probability of 7%
random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 7 weight 0 ecn

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
QOS Configs Queuing and Scheduling
POLICY MAP cont.

class type queuing c-out-8q-q2 Queues 1 and 2 receive 0%


bandwidth remaining percent 0 guaranteed bandwidth
class type queuing c-out-8q-q1 Default queue receives 40%
bandwidth remaining percent 0
class type queuing c-out-8q-q-default
bandwidth remaining percent 40

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
QOS Configs cont.
POLICY MAP “type queuing” cont.
policy-map type network-qos custom-8q-nq-policy
<snip> Use a policy-map of type network-
qos to add PFC to queue 3
class type network-qos c-8q-nq3
mtu 9216 MTU is used as a way to calculate
pause pfc-cos 3 headroom for the non-drop queue.

<snip>
Attach policy maps of "type
queuing" and "type network-qos"
SYSTEM QOS system wide.
system qos
This will trigger ECN marking and
service-policy type network-qos custom-8q-nq-policy that ports configured with PFC will
service-policy type queuing output custom-8q-out-policy receive and honor those frames, as
well as generate pause frames as
needed
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
QOS Configs Interfaces

Attach the service policy to the


interface.
interface Ethernet1/1
service-policy type qos input QOS_classification_policy Enable PFC
priority-flow-control mode on
Enable PFC Watch Dog using a
priority-flow-control watch-dog-interval on default interval of 100 milliseconds.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
AI/ML Blueprint
for DC Networks
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 73
The Blueprint and the CVD

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
Nexus Dashboard Fabric Controller Automation

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
The Blueprint For Today

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
Nexus Dashboard Insights - Visibility

#CiscoLiveAPJC BRKAPP-2698
* Coming Soon
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Conclusion
Conclusion
• The AI/ML market and technology is moving quite quickly

• If you are building a very large training cluster please speak with your
sales team

• Use the best practices in the presentation to get started

• We have a blueprint and best practices to get you started on your journey
today

• The blueprint will continue evolving as standards and equipment change

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Session Surveys
We would love to know your feedback on this session!
• Complete a minimum of four session surveys and the overall event surveys to claim
a Cisco Live T-Shirt

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 81
• Visit the Cisco Showcase for
related demos

• Book your one-on-one


Meet the Expert meeting

• Attend the interactive education


with DevNet, Capture the Flag,
Continue and Walk-in Labs

your education • Visit the On-Demand Library


for more sessions at
www.CiscoLive.com/on-demand

BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
Thank you

#CiscoLiveAPJC
#CiscoLiveAPJC
#CiscoLiveAPJC

You might also like