100% found this document useful (1 vote)

63 views83 pages

Building a HighPerformance AI DC

As the entire world is moving towards the artificial intelligence the question of how to build a high performance AI DC comes and Cisco answers it

Uploaded by

srikrishnak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

63 views83 pages

Building a HighPerformance AI DC

As the entire world is moving towards the artificial intelligence the question of how to build a high performance AI DC comes and Cisco answers it

Uploaded by

srikrishnak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

#CiscoLiveAPJC

Building a High-
Performance AI DC
Using Nexus, UCS, and NVIDIA GPUs
Andy Sholomon
Director of Technical Marketing
@asholomon
BRKAPP-2698

#CiscoLiveAPJC
#CiscoLiveAPJC
Cisco Webex App

Questions?
Use Cisco Webex App to chat
with the speaker after the session

How
1 Find this session in the Cisco Live Mobile App

2 Click “Join the Discussion”

3 Install the Webex App or go directly to the Webex space Enter your personal notes here

4 Enter messages/questions in the Webex space

Webex spaces will be moderated

by the speaker until December 22, 2023. https://ciscolive.ciscoevents.com/ciscolivebot/#BRKAPP-2698

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
• AI/ML Applications
• What Are Their Requirements?
• AI/ML Blueprint for DC
Networks
Agenda • Industry Talk
• Conclusion

BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
AI/ML
Applications
AI is everywhere!

Self-driving vehicles

Conversational agents
Face recognition

Machine translation

Analysis of medical images

Image generation

Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Main Types of AI/ML Workloads

Training: Training refers to the process of using a machine learning

algorithm to build a model. Training involves the use of a deep-learning
framework and training dataset.

Inference: Inference refers to the process of using the trained machine

dataset to make a prediction.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Dog or Muffin?

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
LLMs

LLMs stands for Large Language Models. These are machine learning
models that have been trained on massive amounts of text data, such as
books, articles, and web pages, to understand and generate human
language. OpenAI’s GPT-4 has 1.5 Trillion parameters.

There are now hundreds of open-source LLMs available. Currently, the

most powerful is Llama-2, released by Meta in July 2023, with 70 billion
parameters. Open-source LLMs have the advantage over proprietary
models that they can be fine-tuned for task-specific goals.

https://www.hopsworks.ai/dictionary/llms-large-language-models

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
What It Takes to Train an LLM

With 8 GPUs per server

it takes 1250 servers
running for one month
to train an LLM like
GPT-3

The estimate is it would

take ~6000 H100 GPUs
or 750 servers today

256 servers for LLaMa

https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Power

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Power

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
LLMs

https://deepchecks.com/llm-models-comparison/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
99% of customers
will not be building
infrastructure to train
their own LLMs
Inference and Fine Tuning

https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Many customers will build
GPU clusters in their
existing DCs for training
”smaller” models, for fine
tuning existing models,
and to do inference or
generative AI.
LLMs everywhere?

Self-driving vehicles

Conversational agents
Face recognition

Machine translation

Analysis of medical images

Image generation

Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Industry Talk
Ultra Ethernet Consortium - UEC

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Ultra Ethernet Consortium - UEC

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Cisco 8000: Power of Disaggregation

Distributed Scheduled Fabric

32K+ GPUs/cluster
Leaf to spine packet spray
VoQ & Request Grant
PFC & ECN Leaf to Host

Tier 1
Spine
Tier 1
Spine …… Tier 1
Spine

Leaf Leaf Leaf

…… Leaf

Accelerator
Racks

Cluster

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
AI/ML: High Performance Ethernet Fabrics
Primary target

One Fabric for all Systems & Nexus 9K with

Nexus Dashboard
workloads Solutions
Enterprise/
Service Tier 2
Public Sector/
Providers Web
AI/ML Commercial

Storage
Distributed
databases
Video / Primary target
broadcast
Disaggregated or
Components 8000 with SONIC
VoIP

Hyperscalers

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
What About
Infiniband?
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL (Nvidia Collective Communication Library) Test, ALL GPU to ALL GPU

Verification Result

• GPU server 1 <-> GPU server 2

# mpirun --allow-run-as-root -np 16 -H dgx01:8,dgx02:8 -mca pml ucx 2 x NVIDIA DGX A100 6U server with 8 x NVIDIA A100 Tensor Core GPUs
-bind-to numa -map-by numa -x LD_LIBRARY_PATH -x
NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=enp12s0,^enp226s0 -x 8 x Mellanox ConnectX-6
UCX_NET_DEVICES=enp12s0,enp18s0,enp75s0,enp84s0,enp141s0,enp14
8s0,enp186s0,enp204s0 -x NCCL_IGNORE_CPU_AFFINITY=1 -x
2TB of Mem
NCCL_UCX_TLS=rc,cuda,cuda_copy -x NCCL_UCX_MAX_RNDV_RAILS=4 -x
NCCL_MAX_EAGER_RAILS=4 -x NCCL_UCX_RNDV_THRESH=0 -x
UCX_MEMTYPE_CACHE=n -x NCCL_PLUGIN_P2P=ucx -x
NCCL_COLLNET_ENABLE=1 -x NCCL_MAX_P2P_NCHANNEL=1 -x
NCCL_ALGO=Tree,CollNet -x NCCL_DEBUG_SUBSYS=graph,tuning,init,p2p
-x NCCL_NET_GDR_READ=1 -x NCCL_NET_GDR_LEVEL=5 -x
NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_P2P_LEVEL=5 -x
CUDA_DEVICE_ORDER=PCI_BUS_ID /bmt/mkis/workspace/nccl-
tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -n 10 -w 100 -p 0 -z 0 -
m 1 -t 1 -c 1 | tee result/2node_1set_`date +%y%m%d_%H%M`_$$.log

Max 161 GB/s

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL test result comparison chart (Ethernet vs Infiniband)

Bandwidth

size (Byte)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
AI Fabric POC test procedure and results
GPU to GPU TEST

NCCL test result comparison chart (Ethernet vs Infiniband)

Latency

size (Byte)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
What Are AI/ML Application
Network Requirements
What is a Neural Network?

A neural network is a method in artificial intelligence that teaches

computers to process data in a way that is inspired by the human brain.
It is a type of machine learning process, called deep learning, that uses
interconnected nodes or neurons in a layered structure that resembles
the human brain.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
What is a Neural Network?

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Neural Network Example (Very Oversimplified)
Should I go surfing?
1. Are the waves good? (Yes: 1, No: 0)
2. Is the line-up empty? (Yes: 1, No: 0)
3. Has there been a recent shark attack? (Yes: 0, No: 1)

Then, let’s assume the following, giving us the following inputs:

X1 = 1, since the waves are pumping

X2 = 0, since the crowds are out
X3 = 1, since there hasn’t been a recent shark attack

Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are
of greater importance to the decision or outcome.

W1 = 5, since large swells don’t come around often

W2 = 2, since you’re used to the crowds
W3 = 4, since you have a fear of sharks
https://www.ibm.com/topics/neural-networks

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Neural Network Example (Very Oversimplified)
Should I go surfing?
Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the
various inputs, we can start to plug in values into the formula to get the desired output.

Y-hat = (15) + (02) + (1*4) – 3 = 6

If we use the activation function, we can determine that the output of this node would be 1, since 6 is
greater than 0. In this instance, you would go surfing;

https://www.ibm.com/topics/neural-networks

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Parallelism Mechanisms (collectives)
• NVIDIA Collective Communications Library (NCCL) supports several
data/job distribution methods called collectives, namely:
• AllReduce
• Broadcast
• Reduce
• AllGather
• ReduceScatter
• RingReduce

• Each method has its pros and cons; choice is left to the developer
• Python’s PyTorch (top library for deep learning) defaults to AllReduce*
• Informal research (scanning Git repos, reading blogs) shows virtually
every developer leaves this unchanged
* https://pytorch.org/docs/stable/ddp_comm_hooks.html

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Different Traffic Patterns and Job Sizes

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
GPU Communications “Inside” the Server

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Main Infrastructure Requirements for Training
In a DL training infrastructure, it is crucial to get as much raw compute
power and as many nodes as you can afford. Think multi-core processors
and GPUs.

The more nodes and the more mathematical accuracy you can build into
your cluster, the faster and more accurate your training will be.

Huge training datasets require massive networking and storage

capabilities to hold and transfer the data, especially if your data is image-
based or heterogeneous. Plan ahead for adequate networking and
storage capacity, not just for strong computing.

Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Main Infrastructure Requirements for Inference
Inference clusters should be optimized for performance. Think simpler
hardware with less power than the training cluster but with the lowest
latency possible.

Throughput is critical to inference. The process requires high I/O

bandwidth and enough memory to hold both the required training
model(s) and the input data without having to make calls back to the
storage components of the cluster.

Data center resource requirements for inference are typically not as great
for a single instance compared to training needs.

Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Backend and Frontend Connectivity

The “backend network” is used only for GPU-to-GPU connectivity

The “frontend network” is used for all other communications and can be
used for IP storage access as well

For very large clusters an IP storage network can also be used

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Multiple Connections May Be Required For AI Workloads
All 32 Servers connect
2 x 100G NICs = 64 x 100G
Front End / Storage & Outside World
Nexus design Across 2 x N9K-C9364C-GX
N9K-C9364C-GX N9K-C9364C-GX
2 x N9K-C9364C-GX

100G

100G
100G

100G

400G

400G
400G

400G

400G
400G

400G

400G
400G

400G

400G
8 x Servers
N9K-C9364D-GX2A
Per Pair of TORs Leaf Layer
N9K-C9364D-GX2A 8 x N9K-C9364D-GX2A N9K-C9344D-GX2A N9K-C9364D-GX2a
4 NICs x 400G (2x)
32 x 400G Server Ports

32 x 400G to Spines
8 x 400G 8 x 400G 8 x 400G 8 x 400G

N9K-C9364D-GX2A N9K-C9364D-GX2A N9K-C9364D-GX2A N9K-C9364D-GX2A

Spine Layer
#CiscoLiveAPJC 4 xAllN9K-C9364D-GX2A
© 2023 Cisco and/or its affiliates. rights reserved. Cisco Public
What is Needed to Build The Best AI/ML DC Network?
• Non-Blocking Fabric

• Low Latency

• Lossless RoCE

• Visibility

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
RDMA Over Converged Ethernet - RoCE
System GPU GPU System
Memory Memory Memory Memory

CPU GPU GPU CPU

PCIe PCIe

RDMA RDMA
NIC Ethernet Network
IB Network NIC

Eth L2 IB BTH+
Type

RoCE
Eth

IB GRH IB Payload ICRC FCS

Header (L4 HDR)

Eth L2 UDP IB BTH+

Type

RoCEv2
Eth

IP Header IB Payload ICRC FCS

Header Header (L4 HDR)

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Lossless Ethernet

• Lossless Ethernet for RoCEv2

• End-to-End
• ECN (Explicit congestion Notification) + WRED or AFD
• PFC (Priority Flow Control)

• Buffers (smart and the right amount)

• Tuning

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Intelligent Buffers

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Congestion Management
Needs to Work As a
System
No Congestion Example

Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10

SIP:Host X; DIP:Host B; ECN 0x10

L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x10

Q3 Q3 Q3 SIP:Host X; DIP:Host B; ECN 0x10

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Light Congestion Example

Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10

SIP:Host X; DIP:Host B; ECN 0x10

L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x11

Q3 Q3 Q3 SIP:Host X; DIP:Host B; ECN 0x11

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Light Congestion Example –CNP Message

Q3 S1
SIP:Host X; DIP:Host A; CNP SIP:Host X; DIP:Host A; CNP
SIP:Host X; DIP:Host B; CNP

SIP:Host X; DIP:Host B; CNP

L1 be markedL2
CNP messages must correctly to get into LX
the priority queue
SIP:Host X; DIP:Host A; CNP

Q3 Q3 Q3 SIP:Host X; DIP:Host B; CNP

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
PFC Congestion Leaf

Q3 S1
SIP:Host A; DIP:Host X; ECN:0x10

PF
C
SIP:Host B; DIP:Host X; ECN:0x10

L1 L2 LX
Q3 Q3 Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
PFC Congestion Leaf + Spine

Q3 S1

PF
PFC

C
C
PF

L1 L2 LX
Q3
SIP:Host A; DIP:Host X; ECN:0x10
Q3
SIP:Host B; DIP:Host X; ECN:0x10
Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
PFC Congestion Leaf + Spine + Leaf

Q3 S1

PF
PFC

C
C
PF

L1 L2 LX
PFC

PFC
Q3 Q3 Q3

Host A Host B Host X

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
To get RoCEv2 to work
as expected it is
critical to configure the
NICs and OS correctly
Configure Your Ethernet NICs
<CLIP>
ens1f0np0:
dhcp4: false
ens1f1np1:
dhcp4: false
vlans: VLAN Config Required for PFC
ens1f0np0.101:
Mellanox ConnectX-6 NIC id: 101 VLAN Encap is 101
link: ens1f0np0
2 Port 100G Ethernet NIC addresses:
- 192.168.101.11/24
routes:
cat /etc/netplan/00-installer-config.yaml - to: 192.168.0.0/16
via: 192.168.101.1
metric: 100
ens1f1np1.101:
id: 101
link: ens1f1np1
addresses:
- 172.16.101.11/24
routes:
- to: 172.16.0.0/16
via: 172.16.101.1
metric: 100

sudo bash -c 'echo 1 > /sys/class/net/ens1f0np0/ecn/roce_np/enable/3' Enable ECN for class 3

sudo bash -c 'echo 1 > /sys/class/net/ens1f1np1/ecn/roce_np/enable/3'

sudo bash -c 'echo 6 > /sys/class/net/ens1f0np0/ecn/roce_np/cnp_802p_prio'

Enable CNP L2 egress priority to 6,
sudo bash -c 'echo 6 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_802p_prio' “enable CNP”

sudo bash -c 'echo 48 > /sys/class/net/ens1f0np0/ecn/roce_np/cnp_dscp' Map CNP to DSCP 48

sudo bash -c 'echo 48 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_dscp'

sudo mlnx_qos -i ens1f0np0 --trust=dscp --pfc 0,0,0,1,0,0,0,0 Map PFC to priority 3

sudo mlnx_qos -i ens1f1np1 --trust=dscp --pfc 0,0,0,1,0,0,0,0

sudo sysctl -w net.ipv4.tcp_ecn=1 Enable ECN for traffic

sudo cma_roce_mode -d mlx5_0 -p 1 -m 2 Set mode to RoCEv2 on the NIC

sudo cma_roce_mode -d mlx5_1 -p 1 -m 2

sudo cma_roce_tos -d mlx5_0 -t 96

Set TOS = DSCP CS3for RoCEv2
sudo cma_roce_tos -d mlx5_1 -t 96 traffic

sudo vconfig set_egress_map ens1f0np0.101 4 3 Egress map RoCE traffic to priority 3

sudo vconfig set_egress_map ens1f1np1.101 4 3

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Recommendations for
Server Tunning*
*These recommendations were done in
conjunction with Vijay Durairaj
Cisco GPU-Accelerated Platform Offerings

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
System Tunning Recommendations
• BIOS tunings:
C1E is one of the automatic power-
- Processor Configuration:
saving features which is triggered
Intel Virtualization Technology : <Enabled> when the system is idle. Best to
disable it when overclocking
Intel Hyper-Threading Technology : <Enabled>
C6 BIOS will automatically disable
CPU Performance : <HPC>
CPU core and cache for power saving
- Power & Performance Configuration:
In C0/C1 state all of the cores are
Processor C1E : <Disabled>
locked at maximum performance and
Processor C6 : <Enabled> will cause a higher consumption of
power
Package C-State control : <C0/C1 State>

Uncore Frequency Scaling: <Enabled> Uncore Frequency Scaling enabled

stops the processor from dynamically
• Thermal : Fan Control Policy : <Maximum Power> changing frequencies based on
workload to save power
• PCIe : Width, Speed, Max Payload Size, Max Read Request
• OS tunings : CPU Power, TCP Parameters & Memory Settings

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
OS Tuning - 1
• Mellanox Automatic tuning for Performance Part of the mlnx_tune utility. This
command does all the NIC optimization
# mlnx_tune -p HIGH_THROUGHPUT recommended by Mellanox
• Disable the TCP Timestamps for better CPU Utilization
# sysctl -w net.ipv4.tcp_timestamps=0

• Enable the TCP selective acks option for better throughput:

# sysctl –w net.ipv4.tcp_sack=1

• Increase the maximum length of processor input queues:

# sysctl -w net.core.netdev_max_backlog=300000

• Increase the TCP maximum and default buffer sizes using setsockopt():
# sysctl -w net.core.rmem_max=4194304

# sysctl -w net.core.wmem_max=4194304 The rest of these are recommendations

by Mellanox for “Tuning the Network
# sysctl -w net.core.rmem_default=4194304
Adapter for Improved IPv4 Traffic
# sysctl -w net.core.wmem_default=4194304 Performance”
# sysctl -w net.core.optmem_max=4194304
https://mellanox.my.site.com/mellanoxcommunity/s/article/linu
x-sysctl-tuning

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
OS Tuning - 2
These are recommendations by Mellanox
• Increase memory thresholds to prevent packet dropping: for “Tuning the Network Adapter for
Improved IPv4 Traffic Performance”
# sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"

# sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304” https://mellanox.my.site.com/mellanoxcommunity/s/article/linu

x-sysctl-tuning

• Enable low latency mode for TCP:

# sysctl -w net.ipv4.tcp_low_latency=1

# sysctl -w net.ipv4.tcp_adv_win_scale=1

• Disable OS watchdog & Numa balancing:

# echo '0' > '/proc/sys/kernel/nmi_watchdog’

# echo 0 > /proc/sys/kernel/numa_balancing

• Fine tuning Kernel task scheduler :

# echo 100000000 > /proc/sys/kernel/sched_min_granularity_ns
Set to maximum acceptable values in
nanoseconds (1 sec and ½ sec)
# echo 50000000 > /proc/sys/kernel/sched_migration_cost_ns

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
PCIE Tuning for the Best Performance
Default PCIE Attributes : Width | Speed | Max Payload Size | Max Read Request
List PCI command to find out
where the Mellanox NICs are
connected

List PCI command to show location

5e:00.0 settings. Speed
determines the number of PCIe
transactions possible. The speed is
measured in GT/s which stands for
"billion transactions per second"

The PCIe Max Payload Size

determines the maximum size of a
PCIe packet, or PCIe MTU (similar
to networking protocols). This
parameter is set only by the system
and depends on the chipset
architecture (e.g. x86_64, Power8,
Ubuntu/Debian: sudo apt install pciutils ARM, etc).
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
PCIE Tuning for the Best Performance
Tuned PCIE Attributes : Width | Speed | Max Payload Size | Max Read Request

The first digit is the PCIe Max Read

Request size selector.

The acceptable values are:

0 - 128B, 1 - 256B, 2 - 512B,
3 - 1024B, 4 - 2048B
and 5 - 4096B.

Ubuntu/Debian: sudo apt install pciutils

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
What Does The Network
Configuration Look Like?
QOS Configs Classification
CLASS MAP Put packets with DSCP 24 into
class-map type qos match-all class-q3 “class-q3”
match dscp 24
Put Packets with DSCP 48 into
class-map type qos match-all class-q7 “class-q7”
match dscp 48

POLICY MAP
Put packets from class-q3 into
policy-map type qos QOS_classification_policy queue 3
class class-q3
Put packets from class-q7 into
set qos-group 3 queue 7
class class-q7
Everything else goes into default or
set qos-group 7 queue 0
class class-default
set qos-group 0

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
QOS Configs Queuing and Scheduling
POLICY MAP “type queuing”
policy-map type queuing custom-8q-out-policy
class type queuing c-out-8q-q7 Queues 7 is set up as a priority
queue
priority level 1
class type queuing c-out-8q-q6
bandwidth remaining percent 0
class type queuing c-out-8q-q5
bandwidth remaining percent 0
class type queuing c-out-8q-q4
bandwidth remaining percent 0 Queues 3 received 60% of the
bandwidth and ECN marking
class type queuing c-out-8q-q3 enabled starting at 150KB of buffer
bandwidth remaining percent 60 utilization, drop probability of 7%
random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 7 weight 0 ecn

class type queuing c-out-8q-q2 Queues 1 and 2 receive 0%

bandwidth remaining percent 0 guaranteed bandwidth
class type queuing c-out-8q-q1 Default queue receives 40%
bandwidth remaining percent 0
class type queuing c-out-8q-q-default
bandwidth remaining percent 40

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
QOS Configs cont.
POLICY MAP “type queuing” cont.
policy-map type network-qos custom-8q-nq-policy
<snip> Use a policy-map of type network-
qos to add PFC to queue 3
class type network-qos c-8q-nq3
mtu 9216 MTU is used as a way to calculate
pause pfc-cos 3 headroom for the non-drop queue.

<snip>
Attach policy maps of "type
queuing" and "type network-qos"
SYSTEM QOS system wide.
system qos
This will trigger ECN marking and
service-policy type network-qos custom-8q-nq-policy that ports configured with PFC will
service-policy type queuing output custom-8q-out-policy receive and honor those frames, as
well as generate pause frames as
needed
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
QOS Configs Interfaces

Attach the service policy to the

interface.
interface Ethernet1/1
service-policy type qos input QOS_classification_policy Enable PFC
priority-flow-control mode on
Enable PFC Watch Dog using a
priority-flow-control watch-dog-interval on default interval of 100 milliseconds.

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
AI/ML Blueprint
for DC Networks
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices

• If you are building a very large training cluster please speak with your
sales team

• Use the best practices in the presentation to get started

• We have a blueprint and best practices to get you started on your journey
today

• The blueprint will continue evolving as standards and equipment change

#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Session Surveys
We would love to know your feedback on this session!
• Complete a minimum of four session surveys and the overall event surveys to claim
a Cisco Live T-Shirt

• Book your one-on-one

Meet the Expert meeting

• Attend the interactive education

with DevNet, Capture the Flag,
Continue and Walk-in Labs

your education • Visit the On-Demand Library

for more sessions at
www.CiscoLive.com/on-demand

#CiscoLiveAPJC
#CiscoLiveAPJC
#CiscoLiveAPJC

Tessent ATPG Simulation Mismatch Debug: 2016 Mentor Graphics Corporation
0% (1)
Tessent ATPG Simulation Mismatch Debug: 2016 Mentor Graphics Corporation
13 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
CCIE SPv5 Webinar - Build Your Own Lab-V1
No ratings yet
CCIE SPv5 Webinar - Build Your Own Lab-V1
51 pages
Abstraction
No ratings yet
Abstraction
2 pages
BRKDCN 2613
No ratings yet
BRKDCN 2613
97 pages
Brkcom 2018
No ratings yet
Brkcom 2018
72 pages
Brkent 2209
No ratings yet
Brkent 2209
75 pages
Brkcom 1008
No ratings yet
Brkcom 1008
139 pages
Brkarc 2095
No ratings yet
Brkarc 2095
85 pages
AI Security through the Lens of Large Language
No ratings yet
AI Security through the Lens of Large Language
77 pages
Cisco AI Solns on Cisco Infra Ess - dcaie
No ratings yet
Cisco AI Solns on Cisco Infra Ess - dcaie
3 pages
BRKDCN 2921
No ratings yet
BRKDCN 2921
102 pages
AIHUB-1000
No ratings yet
AIHUB-1000
49 pages
CISCO ACI ai
No ratings yet
CISCO ACI ai
35 pages
AIHUB-1009
No ratings yet
AIHUB-1009
25 pages
AI-ML-Cisco - ACI
No ratings yet
AI-ML-Cisco - ACI
20 pages
Modeling Labs For Personal
No ratings yet
Modeling Labs For Personal
2 pages
Flexpod Datacenter For Ai/Ml With Cisco Ucs 480 ML For Deep Learning
No ratings yet
Flexpod Datacenter For Ai/Ml With Cisco Ucs 480 ML For Deep Learning
105 pages
AIHUB-1002
No ratings yet
AIHUB-1002
20 pages
AIHUB-1025
No ratings yet
AIHUB-1025
27 pages
Brkcom 1004
No ratings yet
Brkcom 1004
49 pages
Dcuci Ver4.0 Lab Guide
No ratings yet
Dcuci Ver4.0 Lab Guide
158 pages
Jim Grubb - Cisco Corporate Vision and Strategy
No ratings yet
Jim Grubb - Cisco Corporate Vision and Strategy
78 pages
Brkopt 2699
No ratings yet
Brkopt 2699
83 pages
AIHUB-1012
No ratings yet
AIHUB-1012
39 pages
CCNP DC V1.1-Learning Matrix
No ratings yet
CCNP DC V1.1-Learning Matrix
16 pages
modeling-labs-for-personal
No ratings yet
modeling-labs-for-personal
2 pages
Nexus Cloud Scale Infrastructure BDM
No ratings yet
Nexus Cloud Scale Infrastructure BDM
31 pages
CCNP BCMSN Slides
No ratings yet
CCNP BCMSN Slides
424 pages
LTRCRT-2608 - CCNP Data Center Unified Fabric Implementation (DCUFI) Lab (2014 San Francisco) - 4 Hours
No ratings yet
LTRCRT-2608 - CCNP Data Center Unified Fabric Implementation (DCUFI) Lab (2014 San Francisco) - 4 Hours
14 pages
AIHUB-2170 AI Agents and Agentic Frameworks - An Overview
No ratings yet
AIHUB-2170 AI Agents and Agentic Frameworks - An Overview
49 pages
Cisco Modeling Labs: Jump-Start Your Netdevops Journey With Network Simulation On Actual Ios Images
No ratings yet
Cisco Modeling Labs: Jump-Start Your Netdevops Journey With Network Simulation On Actual Ios Images
2 pages
AIHUB-1011
No ratings yet
AIHUB-1011
21 pages
CCIE DC 21 Learning Matrix
No ratings yet
CCIE DC 21 Learning Matrix
22 pages
High-Touch Delivery Learning Services: Configuring The Cisco Nexus Data Center (CCNDC)
No ratings yet
High-Touch Delivery Learning Services: Configuring The Cisco Nexus Data Center (CCNDC)
4 pages
Brkaci 1000
No ratings yet
Brkaci 1000
80 pages
cisco-silicon-one-bofa-ai-conference-2023
No ratings yet
cisco-silicon-one-bofa-ai-conference-2023
19 pages
CCIE DC Course Content Download
No ratings yet
CCIE DC Course Content Download
8 pages
Cisco Nexus + DC ACI Course Outline
No ratings yet
Cisco Nexus + DC ACI Course Outline
4 pages
Ccie Devnet 2024
No ratings yet
Ccie Devnet 2024
96 pages
CCDE_v3.1_Practical_AI_Infrastructure_Technology_List_final_0604
No ratings yet
CCDE_v3.1_Practical_AI_Infrastructure_Technology_List_final_0604
2 pages
Ai Infrast Pods Inferencing Ds
No ratings yet
Ai Infrast Pods Inferencing Ds
14 pages
CENCX-1004
No ratings yet
CENCX-1004
34 pages
Wired Design Fundamentals
No ratings yet
Wired Design Fundamentals
63 pages
CENCOL-1016
No ratings yet
CENCOL-1016
37 pages
Implementing and Operating Cisco Data Center Core Technologies Dccor
No ratings yet
Implementing and Operating Cisco Data Center Core Technologies Dccor
7 pages
using ai for cyber security operations
No ratings yet
using ai for cyber security operations
69 pages
Python Lab Activities for DevNet Associate_slides
No ratings yet
Python Lab Activities for DevNet Associate_slides
12 pages
Brkens 1501
No ratings yet
Brkens 1501
55 pages
DEVNET-2094
No ratings yet
DEVNET-2094
71 pages
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
From Everand
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Implementing Cisco Data Center Unified Fabric
No ratings yet
Implementing Cisco Data Center Unified Fabric
150 pages
CCIE Data Center v3 Learning Matrix
No ratings yet
CCIE Data Center v3 Learning Matrix
17 pages
Devnet 4099 PDF
No ratings yet
Devnet 4099 PDF
27 pages
Modeling Labs for Education
No ratings yet
Modeling Labs for Education
2 pages
CCIE DataCenter v3.1 BluePrint Reviewed
No ratings yet
CCIE DataCenter v3.1 BluePrint Reviewed
5 pages
CCIE Data Center v3 Learning Matrix
No ratings yet
CCIE Data Center v3 Learning Matrix
17 pages
Getting To Know Cisco DNA Center PDF
No ratings yet
Getting To Know Cisco DNA Center PDF
27 pages
Getting To Know Cisco DNA Center PDF
No ratings yet
Getting To Know Cisco DNA Center PDF
27 pages
The Impact of AI Workloads On Moden DCN
No ratings yet
The Impact of AI Workloads On Moden DCN
18 pages
Ccie DC Training Plan Template
0% (1)
Ccie DC Training Plan Template
119 pages
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Software Defined Networking (SDN) - a definitive guide
From Everand
Software Defined Networking (SDN) - a definitive guide
Rajesh Kumar Sundararajan
2/5 (2)
Kongu Chatu Korika by Akkapeddi
60% (5)
Kongu Chatu Korika by Akkapeddi
125 pages
Annual Report of Bharti Airtel For FY 2012 2013
No ratings yet
Annual Report of Bharti Airtel For FY 2012 2013
244 pages
Filename: Y.1564 - Quality, IPT-REC-Y.1564-201103-I!!PDF-E-1 PDF
No ratings yet
Filename: Y.1564 - Quality, IPT-REC-Y.1564-201103-I!!PDF-E-1 PDF
38 pages
TBG Digital Global Facebook Advertising Report Q22012
100% (1)
TBG Digital Global Facebook Advertising Report Q22012
12 pages
Peter Lynch 8 Growth Stocks
100% (4)
Peter Lynch 8 Growth Stocks
6 pages
Start Here: Cisco Ios Software Release Specifics For Ipv6 Features
No ratings yet
Start Here: Cisco Ios Software Release Specifics For Ipv6 Features
28 pages
Student Management System Project in Python
No ratings yet
Student Management System Project in Python
9 pages
Java Package
No ratings yet
Java Package
6 pages
2nd Term Test Paper-Part II
No ratings yet
2nd Term Test Paper-Part II
5 pages
Aman Yadav
No ratings yet
Aman Yadav
30 pages
Number Story
45% (11)
Number Story
14 pages
Introduction To PLL
100% (1)
Introduction To PLL
40 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
COM Interview Questions: Object? Aggregation Is The Reuse Mechanism, in Which The Outer Object Exposes
No ratings yet
COM Interview Questions: Object? Aggregation Is The Reuse Mechanism, in Which The Outer Object Exposes
4 pages
Two Components of User Interface (Hci)
No ratings yet
Two Components of User Interface (Hci)
2 pages
PIQC Lecture 4
No ratings yet
PIQC Lecture 4
133 pages
Lilijenemasarumi
No ratings yet
Lilijenemasarumi
3 pages
TF30 The Open Closed Principle 080823 123151
No ratings yet
TF30 The Open Closed Principle 080823 123151
3 pages
Acronis DS
No ratings yet
Acronis DS
3 pages
TM103 Chapter 3
No ratings yet
TM103 Chapter 3
52 pages
How To Perform A Clean Installation of Remote Ops Monitor and Preserve The Configuration
No ratings yet
How To Perform A Clean Installation of Remote Ops Monitor and Preserve The Configuration
2 pages
Cvs Lilén Romano
No ratings yet
Cvs Lilén Romano
1 page
Solution To Exercise R-1.7, Page 47: Sept 5, 2001
No ratings yet
Solution To Exercise R-1.7, Page 47: Sept 5, 2001
2 pages
ADC Unit 5 PPT 3
No ratings yet
ADC Unit 5 PPT 3
15 pages
CloudArchitecture of Microsoft
No ratings yet
CloudArchitecture of Microsoft
136 pages
Emerson fb1200 Flow Computer Instruction Manual en 586728 PDF
No ratings yet
Emerson fb1200 Flow Computer Instruction Manual en 586728 PDF
112 pages
Infographics PPT
100% (1)
Infographics PPT
35 pages
B. List of Software Dipsw
No ratings yet
B. List of Software Dipsw
17 pages
A Cross Tenant Access Control (CTAC) Model For Cloud Computing - Formal Specification and Verification - IEEE Journals & Magazine - IEEE Xplore
No ratings yet
A Cross Tenant Access Control (CTAC) Model For Cloud Computing - Formal Specification and Verification - IEEE Journals & Magazine - IEEE Xplore
3 pages
Programming in Python and Applications in Materials Science: Ján Minár
No ratings yet
Programming in Python and Applications in Materials Science: Ján Minár
18 pages
Sts3301 Java-For-beginners Ss 1.0 53 Sts3301
No ratings yet
Sts3301 Java-For-beginners Ss 1.0 53 Sts3301
3 pages
Deco Question Bank Main111
No ratings yet
Deco Question Bank Main111
6 pages
CCNA 200-301 Chapter 27 Analyzing Cisco Wireless Architectures
No ratings yet
CCNA 200-301 Chapter 27 Analyzing Cisco Wireless Architectures
17 pages