Building a HighPerformance AI DC
Building a HighPerformance AI DC
Building a High-
Performance AI DC
Using Nexus, UCS, and NVIDIA GPUs
Andy Sholomon
Director of Technical Marketing
@asholomon
BRKAPP-2698
#CiscoLiveAPJC
#CiscoLiveAPJC
Cisco Webex App
Questions?
Use Cisco Webex App to chat
with the speaker after the session
How
1 Find this session in the Cisco Live Mobile App
3 Install the Webex App or go directly to the Webex space Enter your personal notes here
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
• AI/ML Applications
• What Are Their Requirements?
• AI/ML Blueprint for DC
Networks
Agenda • Industry Talk
• Conclusion
BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
AI/ML
Applications
AI is everywhere!
Self-driving vehicles
Conversational agents
Face recognition
Machine translation
Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Main Types of AI/ML Workloads
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Dog or Muffin?
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
LLMs
LLMs stands for Large Language Models. These are machine learning
models that have been trained on massive amounts of text data, such as
books, articles, and web pages, to understand and generate human
language. OpenAI’s GPT-4 has 1.5 Trillion parameters.
https://www.hopsworks.ai/dictionary/llms-large-language-models
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
What It Takes to Train an LLM
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
Power
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Power
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
LLMs
https://deepchecks.com/llm-models-comparison/
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
99% of customers
will not be building
infrastructure to train
their own LLMs
Inference and Fine Tuning
https://blog.apnic.net/2023/08/10/large-language-models-the-hardware-connection/
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Many customers will build
GPU clusters in their
existing DCs for training
”smaller” models, for fine
tuning existing models,
and to do inference or
generative AI.
LLMs everywhere?
Self-driving vehicles
Conversational agents
Face recognition
Machine translation
Recommender systems
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Industry Talk
Ultra Ethernet Consortium - UEC
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Ultra Ethernet Consortium - UEC
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
ECMP, Telemetry Assisted Ethernet and Fully
Scheduled Fabrics
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Cisco 8000: Power of Disaggregation
Tier 1
Spine
Tier 1
Spine …… Tier 1
Spine
Accelerator
Racks
Cluster
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
AI/ML: High Performance Ethernet Fabrics
Primary target
Storage
Distributed
databases
Video / Primary target
broadcast
Disaggregated or
Components 8000 with SONIC
VoIP
Hyperscalers
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
What About
Infiniband?
AI Fabric POC test procedure and results
GPU to GPU TEST
NCCL (Nvidia Collective Communication Library) Test, ALL GPU to ALL GPU
Verification Result
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
AI Fabric POC test procedure and results
GPU to GPU TEST
Bandwidth
size (Byte)
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
AI Fabric POC test procedure and results
GPU to GPU TEST
Latency
size (Byte)
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
What Are AI/ML Application
Network Requirements
What is a Neural Network?
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
What is a Neural Network?
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Neural Network Example (Very Oversimplified)
Should I go surfing?
1. Are the waves good? (Yes: 1, No: 0)
2. Is the line-up empty? (Yes: 1, No: 0)
3. Has there been a recent shark attack? (Yes: 0, No: 1)
Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are
of greater importance to the decision or outcome.
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Neural Network Example (Very Oversimplified)
Should I go surfing?
Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the
various inputs, we can start to plug in values into the formula to get the desired output.
If we use the activation function, we can determine that the output of this node would be 1, since 6 is
greater than 0. In this instance, you would go surfing;
https://www.ibm.com/topics/neural-networks
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Parallelism Mechanisms (collectives)
• NVIDIA Collective Communications Library (NCCL) supports several
data/job distribution methods called collectives, namely:
• AllReduce
• Broadcast
• Reduce
• AllGather
• ReduceScatter
• RingReduce
• Each method has its pros and cons; choice is left to the developer
• Python’s PyTorch (top library for deep learning) defaults to AllReduce*
• Informal research (scanning Git repos, reading blogs) shows virtually
every developer leaves this unchanged
* https://pytorch.org/docs/stable/ddp_comm_hooks.html
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Different Traffic Patterns and Job Sizes
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
GPU Communications “Inside” the Server
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
Main Infrastructure Requirements for Training
In a DL training infrastructure, it is crucial to get as much raw compute
power and as many nodes as you can afford. Think multi-core processors
and GPUs.
The more nodes and the more mathematical accuracy you can build into
your cluster, the faster and more accurate your training will be.
Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
Main Infrastructure Requirements for Inference
Inference clusters should be optimized for performance. Think simpler
hardware with less power than the training cluster but with the lowest
latency possible.
Data center resource requirements for inference are typically not as great
for a single instance compared to training needs.
Based on https://www.hpcwire.com/2022/06/13/infrastructure-requirements-for-ai-inference-vs-training/
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
Backend and Frontend Connectivity
The “frontend network” is used for all other communications and can be
used for IP storage access as well
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
Multiple Connections May Be Required For AI Workloads
All 32 Servers connect
2 x 100G NICs = 64 x 100G
Front End / Storage & Outside World
Nexus design Across 2 x N9K-C9364C-GX
N9K-C9364C-GX N9K-C9364C-GX
2 x N9K-C9364C-GX
100G
100G
100G
100G
100G
100G
100G
100G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
400G
8 x Servers
N9K-C9364D-GX2A
Per Pair of TORs Leaf Layer
N9K-C9364D-GX2A 8 x N9K-C9364D-GX2A N9K-C9344D-GX2A N9K-C9364D-GX2a
4 NICs x 400G (2x)
32 x 400G Server Ports
32 x 400G to Spines
8 x 400G 8 x 400G 8 x 400G 8 x 400G
• Low Latency
• Lossless RoCE
• Visibility
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
RDMA Over Converged Ethernet - RoCE
System GPU GPU System
Memory Memory Memory Memory
PCIe PCIe
RDMA RDMA
NIC Ethernet Network
IB Network NIC
Eth L2 IB BTH+
Type
RoCE
Eth
RoCEv2
Eth
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Lossless Ethernet
• Tuning
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
Intelligent Buffers
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
Congestion Management
Needs to Work As a
System
No Congestion Example
Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10
L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x10
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
Light Congestion Example
Q3 S1
SIP:Host X; DIP:Host A; ECN 0x10 SIP:Host X; DIP:Host A; ECN 0x10
SIP:Host X; DIP:Host B; ECN 0x10
L1 L2 LX
SIP:Host X; DIP:Host A; ECN 0x11
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
Light Congestion Example –CNP Message
Q3 S1
SIP:Host X; DIP:Host A; CNP SIP:Host X; DIP:Host A; CNP
SIP:Host X; DIP:Host B; CNP
L1 be markedL2
CNP messages must correctly to get into LX
the priority queue
SIP:Host X; DIP:Host A; CNP
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
PFC Congestion Leaf
Q3 S1
SIP:Host A; DIP:Host X; ECN:0x10
PF
C
SIP:Host B; DIP:Host X; ECN:0x10
L1 L2 LX
Q3 Q3 Q3
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
PFC Congestion Leaf + Spine
Q3 S1
PF
PFC
C
C
PF
L1 L2 LX
Q3
SIP:Host A; DIP:Host X; ECN:0x10
Q3
SIP:Host B; DIP:Host X; ECN:0x10
Q3
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
PFC Congestion Leaf + Spine + Leaf
Q3 S1
PF
PFC
C
C
PF
L1 L2 LX
PFC
PFC
Q3 Q3 Q3
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
To get RoCEv2 to work
as expected it is
critical to configure the
NICs and OS correctly
Configure Your Ethernet NICs
<CLIP>
ens1f0np0:
dhcp4: false
ens1f1np1:
dhcp4: false
vlans: VLAN Config Required for PFC
ens1f0np0.101:
Mellanox ConnectX-6 NIC id: 101 VLAN Encap is 101
link: ens1f0np0
2 Port 100G Ethernet NIC addresses:
- 192.168.101.11/24
routes:
cat /etc/netplan/00-installer-config.yaml - to: 192.168.0.0/16
via: 192.168.101.1
metric: 100
ens1f1np1.101:
id: 101
link: ens1f1np1
addresses:
- 172.16.101.11/24
routes:
- to: 172.16.0.0/16
via: 172.16.101.1
metric: 100
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Configure the “marking” Ubuntu Example
sudo apt-get install vlan
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 55
Recommendations for
Server Tunning*
*These recommendations were done in
conjunction with Vijay Durairaj
Cisco GPU-Accelerated Platform Offerings
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
System Tunning Recommendations
• BIOS tunings:
C1E is one of the automatic power-
- Processor Configuration:
saving features which is triggered
Intel Virtualization Technology : <Enabled> when the system is idle. Best to
disable it when overclocking
Intel Hyper-Threading Technology : <Enabled>
C6 BIOS will automatically disable
CPU Performance : <HPC>
CPU core and cache for power saving
- Power & Performance Configuration:
In C0/C1 state all of the cores are
Processor C1E : <Disabled>
locked at maximum performance and
Processor C6 : <Enabled> will cause a higher consumption of
power
Package C-State control : <C0/C1 State>
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 58
BIOS Tuning - CPU Performance (HPC)
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
BIOS Tuning – Processor Power States
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
Thermal – Fan control Policy
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
OS Tuning - 1
• Mellanox Automatic tuning for Performance Part of the mlnx_tune utility. This
command does all the NIC optimization
# mlnx_tune -p HIGH_THROUGHPUT recommended by Mellanox
• Disable the TCP Timestamps for better CPU Utilization
# sysctl -w net.ipv4.tcp_timestamps=0
# sysctl –w net.ipv4.tcp_sack=1
• Increase the TCP maximum and default buffer sizes using setsockopt():
# sysctl -w net.core.rmem_max=4194304
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 62
OS Tuning - 2
These are recommendations by Mellanox
• Increase memory thresholds to prevent packet dropping: for “Tuning the Network Adapter for
Improved IPv4 Traffic Performance”
# sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
# sysctl -w net.ipv4.tcp_adv_win_scale=1
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
PCIE Tuning for the Best Performance
Default PCIE Attributes : Width | Speed | Max Payload Size | Max Read Request
List PCI command to find out
where the Mellanox NICs are
connected
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
What Does The Network
Configuration Look Like?
QOS Configs Classification
CLASS MAP Put packets with DSCP 24 into
class-map type qos match-all class-q3 “class-q3”
match dscp 24
Put Packets with DSCP 48 into
class-map type qos match-all class-q7 “class-q7”
match dscp 48
POLICY MAP
Put packets from class-q3 into
policy-map type qos QOS_classification_policy queue 3
class class-q3
Put packets from class-q7 into
set qos-group 3 queue 7
class class-q7
Everything else goes into default or
set qos-group 7 queue 0
class class-default
set qos-group 0
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
QOS Configs Queuing and Scheduling
POLICY MAP “type queuing”
policy-map type queuing custom-8q-out-policy
class type queuing c-out-8q-q7 Queues 7 is set up as a priority
queue
priority level 1
class type queuing c-out-8q-q6
bandwidth remaining percent 0
class type queuing c-out-8q-q5
bandwidth remaining percent 0
class type queuing c-out-8q-q4
bandwidth remaining percent 0 Queues 3 received 60% of the
bandwidth and ECN marking
class type queuing c-out-8q-q3 enabled starting at 150KB of buffer
bandwidth remaining percent 60 utilization, drop probability of 7%
random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 7 weight 0 ecn
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
QOS Configs Queuing and Scheduling
POLICY MAP cont.
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
QOS Configs cont.
POLICY MAP “type queuing” cont.
policy-map type network-qos custom-8q-nq-policy
<snip> Use a policy-map of type network-
qos to add PFC to queue 3
class type network-qos c-8q-nq3
mtu 9216 MTU is used as a way to calculate
pause pfc-cos 3 headroom for the non-drop queue.
<snip>
Attach policy maps of "type
queuing" and "type network-qos"
SYSTEM QOS system wide.
system qos
This will trigger ECN marking and
service-policy type network-qos custom-8q-nq-policy that ports configured with PFC will
service-policy type queuing output custom-8q-out-policy receive and honor those frames, as
well as generate pause frames as
needed
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
QOS Configs Interfaces
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
AI/ML Blueprint
for DC Networks
The Blueprint For Today
Built to accommodate 1024 GPUs along with storage devices
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 73
The Blueprint and the CVD
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 74
Nexus Dashboard Fabric Controller Automation
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 75
The Blueprint For Today
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
Nexus Dashboard Insights - Visibility
#CiscoLiveAPJC BRKAPP-2698
* Coming Soon
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Conclusion
Conclusion
• The AI/ML market and technology is moving quite quickly
• If you are building a very large training cluster please speak with your
sales team
• We have a blueprint and best practices to get you started on your journey
today
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Session Surveys
We would love to know your feedback on this session!
• Complete a minimum of four session surveys and the overall event surveys to claim
a Cisco Live T-Shirt
#CiscoLiveAPJC BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 81
• Visit the Cisco Showcase for
related demos
BRKAPP-2698 © 2023 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
Thank you
#CiscoLiveAPJC
#CiscoLiveAPJC
#CiscoLiveAPJC