0% found this document useful (0 votes)
207 views23 pages

17 HC2024 Tesla TTPoE v5

Uploaded by

21521811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views23 pages

17 HC2024 Tesla TTPoE v5

Uploaded by

21521811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Tesla Transport Protocol

over Ethernet (TTPoE)

A new lossy, Exa-Scale fabric for


the Dojo AI Supercomputer

Eric Quinnell, Ph.D.


Dojo Fabric Lead

HOT CHIPS 2024


Problem Statement

TCP/IP is too slow for scaled AI interconnect


• Bound by CPU SW kernel

Lossless fabrics are complex and brittle


• Priority Flow Control (PFC) affects the global network

Ideal Fabric:

• Lowest latency
• Highest bandwidth
• Simple Software

For Tesla AI:


• Layer 2 only
• Collective communications and ingest
• Low congestion, single application

HOT CHIPS 2024 2


TTPoE

Tesla Transport Protocol over Ethernet (TTPoE)


is a peer-to-peer ethernet Transport Layer Protocol executed entirely in hardware.

Why a custom transport protocol?

1. Vertical Integration – extend Dojo RDMA onto optical fabric

2. “Lossy” ethernet network – ease of scaling, cost, congestion mgmt.

3. Use 3rd party hardware – Ethernet II frames “Just Work”

TCP got it right – just do it in hardware

HOT CHIPS 2024 3


Dojo OSI Layers

Standard Stack Dojo Stack

OSI Layer Example Protocols (TCP/IP) TCP/IP Implementation OSI Layer Example Protocols Dojo Implementation

Layer 7 Layer 7
HTTP, Telnet, FTP Pytorch, Dojotorch
Application Application

Layer 6 Layer 6 Software


JPEG, PNG, MPEG FFMPEG, HEVC, YUV
Presentation Presentation

Layer 5 Software Layer 5


NFS, SQL Dojo RDMA Descriptors
Session Session

Layer 4 Layer 4
Transport
TCP, UDP
Transport
TTP
Layer 3 Layer 3 (Optional)
IPv4/IPv6 IPv4/IPv6 (Optional)
Network Network Hardware
Layer 2 Layer 2
Ethernet Frames, MAC addresses, VLAN Ethernet Frames, MAC addresses, VLAN
Data Link Data Link
Hardware
Layer 1 Layer 1
Data Encoding, Physical Specs Data Encoding, Physical Specs
Physical Physical

HOT CHIPS 2024 4


TTP transaction examples
TTP Link Communication TTP Link Communication

TTP Device B

TTP Device B
TTP Device A

TTP Device A
Time

Time
Clean TTP transfer Example NACK TTP transfer Example.
TTP_PAYLOAD, ID=3 is either lost or
out of order

HOT CHIPS 2024 5


Transport Layer State Machines
TCP STATE MACHINE TTP STATE MACHINE

CLOSED
OPEN_NACK
CLOSE
(RX) OPEN_NACK
(TX)
OPEN
OPEN (RX)
(TX)
Timeout, OPEN OPEN
Resend OPEN SENT RECD
(TX)

OPEN_ACK OPEN_ACK
(RX) (TX)

HW !CLOSE (RX)
& !idle timer (TX) OPEN
CONSTRAINED CLOSE_ACK & !victim (TX)
(RX) CLOSE_ACK
(TX)
CLOSE CLOSE
(TX) (RX)

CLOSE CLOSE
SENT RECD
CLOSE_NACK
(RX)

Timeout, Resend !quiesced,


CLOSE (TX) CLOSE_NACK
(TX)

Modifications made for hardware-only execution


IETF RFC-793
• 2 millisecond quiesce in a microsecond protocol is too long
• No reliance on virtual memory – physical memory only
• Automatic OPEN/CLOSE with no SW involvement
HOT CHIPS 2024 6
TTP Header Frame

TTP uses Ethernet-II simple formats with optional standard Layers


• Dojo at scale uses only Layer 2, currently not using Layer 3
• MAC addresses are a hardware hash of the SOW Physical Address (PA)
• A TTP endpoint can concurrently handle 512 unique links, dynamically
replaced via victimization and LRU
• Virtual channels (VCs) allow for non-blocking control, semaphore,
completion, and data movement

HOT CHIPS 2024 7


Lossy Protocol

TTPoE is a "lossy" transport protocol


• "Lossy" transport meaning the underlying medium expects to lose packets
and retry – full packet transmission is still guaranteed.
o Similar to TCP and unlike UDP.
• TTP will default to packet drops and replays in corner cases of congestion,
backpressure, or errors
• Speculative transmission is limited by SRAM size before a RTT ACK. This, in
effect, forces a “TTP window size” beyond which bandwidth is lost
• Local SRAM lines are not retired/deallocated until the ACK comes back,
allowing HW to replay the line.
• Replay amounts are also limited by SRAM, constraining the scale of replay
storms

HOT CHIPS 2024 8


Congestion Management

Congestion management is distributed


• Exponential backoff, rate control, and algorithms are handled by local link
TX channels, not by central network or switch.
• Fault Tolerant flow “flushes” the TTP network and removes a bad link
before continuing training
• No PFC, no Nagel Algorithm, no QoS, no tokens, no lossless artifacts

HOT CHIPS 2024 9


TTP MAC IP

The Transport Layer hardware is an IP block between a NOC and an Ethernet standard MAC
• Translates and coalesces 64B/cycle NOC packets into up to 1kB TTP Ethernet packets
• Speaks AXI-S or SOP/EOP formats
• Optionally activates standard MAC features – pause packets, counters, stats, LLDP
• IP block instantiated in FPGA and Silicon implementations

NOC RX RX serdes[3:0]
ETHERNET PCS/FEC PMA/PHY
TTP MAC AXI-S MAC
MII 64/66B
PCS IEEE 802.3

NOC TX TX serdes[3:0]

Standard Ethernet IP

HOT CHIPS 2024 10


TTP MAC Micro-Architecture

TTP’s Micro-Architecture uses techniques from SMP Caches, Snoop


Filters, CPUs
• 4-stage Read-Modify-Write (RMW) Pipeline
• TX Buffer size determines maximum outstanding packets before
stall/backpressure
• ACK packets “retire” a packet from the common buffer
• 1MB TX Buffer allows for ~80 microseconds latency tolerance RTT
• Virtual Channels to prioritize and avoid livelock/deadlock
• Multi-channel “coherent” arbitration to update link and use the TX
Physical Channel
• DMA descriptors issue to TTP MAC
• Can be PUSH for implicit pass-thru local-to-remote
• Can be explicit HBM2HBM fabric memcpy

HOT CHIPS 2024 11


“Mojo” 100Gbps Dumb-NIC

Feature Spec
Ethernet Speed 100Gbps QSFP
PCI-e Gen3 x16 8x GB DDR4

Memory 8GB DDR4


Power <20W max
DDR Memctl
Reliability 5-year tested
DMA engine Dojo DMA TTP
QSFP28 PCIe PCIe x16
CPU+OS None 100Gbps
Ethernet
Controller
NOC Controller
Gen3
Active Links 512 unique, 2-way, LRU
DMA Engine

Clocks
Reset Debug CSR/Perfmons
Power

Mojo Interface Processor (MIP)

HOT CHIPS 2024 12


Hot Chips 34

First integration box - D1 Die

TSMC 7nm, 645mm2

Physically and logically arranged as a 2D array


• 354 DOJO processing nodes on die

Extremely modular design

362 TFlops BF16/CFP8, 22 TFlops FP32 @2GHz

440 MB SRAM

Custom low power serdes channels on all edges


• 576 bidirectional channels
• 2 TB bandwidth on each edge

Seamless connection to neighboring dies


Hot Chips 34

Second integration box – Dojo Training Tile

5x5 array of known good D1 chips


• 4.5TB/s off-tile bandwidth per edge
• Half of in-tile bandwidth

Fully integrated module


• Electrical + thermal + mechanical
• 15kW of power delivery

Custom power delivery


• Horizontal data communication plane
• Vertical power delivery and cooling
• 15kW per module

Custom high-density connectors


• Seamless connection to neighboring training tiles
Hot Chips 34
V1 Dojo Interface Processor

32GB High-Bandwidth Memory


- 800 GB/s Total Memory Bandwidth

900 GB/s TTP Interface


- Tesla Transport Protocol (TTP) - Full custom protocol
- Provides full DRAM bandwidth to Training Tile

50 GB/s TTP over Ethernet (TTPoE)


- Enables extending communication over standard Ethernet
- Native hardware support

32 GB/s Gen4 PCIe Interface


“Mojo” Hosts – Variable Ingest via TTP Network
Vision networks can be heavily ingest limited
• Vision-based tensors and training clips in GBs
• “Mojo” Hosts are scheduled on demand from a generic compute pool
• Forward/Backward pass TTP traffic is mutually exclusive
• i.e. ingest and all-reduce share the same TTP DIP ports but execute
during different phases of training D1 SOW

Variable Ingest
(Forward Pass)
2 Tbps
100 Gbps TTP
Remote NIC
NIC DIP DIP
Remote
Remote NIC
Mojo 1 Tbps TTP DIP DIP
Mojo
Mojo PCIe Gen3 TTP Network DIP DIP
Host
Host NIC DIP DIP
Host NIC
NIC DIP DIP

100 Gbps TTP


Local Ingest
PCIe Gen4
(Forward Pass)
All Reduce
(Backward Pass)
Main Host

To Other
Partitions

HOT CHIPS 2024 16


MDCH – Mojo Dojo Compute Hall

HOT CHIPS 2024 17


Dojo Engineering System

TTP + TCP/IP TCP/IP


Converged Only

SPINE-1 SPINE-2 SPINE-3 SPINE-4

EVPN
• 4xExaFLOP BF16/FP16 Cluster VXLAN

• 40 PB Local Storage
• 40,960 Main Host Cores TTP + TCP/IP TCP/IP
LEAF-1 LEAF-2 LEAF-3 LEAF-4 LEAF-5 TTP Only LEAF-6 LEAF-7 LEAF-8
• 61,440 Mojo Host Cores Converged Only

• 320 Tbps TTP All-Reduce I/O (endpoint) 80 Tbps 16 Tbps 32 Tbps 36 Tbps
36 Tbps
TTP TCP/IP TTP TCP/IP
• 128 Tbps TTP Ingest I/O (endpoint) 80 Tbps 16 Tbps TCP/IP
TTP TCP/IP
• 208 Tbps TCP/IP (endpoint) 32 Tbps
TTP
• Converged and non-Converged network
experiments

Converged Ethernet Independent Networks

HOT CHIPS 2024 18


Results
• Measured on Arista 7060, 7808, and 7816 switches
• RTT latency is random sampling of in-flight packets +
ACK return
• Gbps is wall time real-data movement
• All-reduce measure is network only, non-pipelined
▪ SOW has all-reduce not shown (pre-network)
• All-reduce throughput is determined by the slowest
node in system

HOT CHIPS 2024 19


TTPoE in Ultra Ethernet Consortium (UEC)

https://ultraethernet.org/

https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf

Tesla has achieved Exa-scale with a lossy fabric, executing real training runs deployed in FSD

Tesla is joining the UEC and offering the TTPoE protocol publicly

HOT CHIPS 2024 20


Team Acknowledgements

Prototyping is Easy. Scaling is Hard

Thanks to the

TTPoE Original Inventors, Network Deployment Team, Silicon Design


Team, System and Infrastructure Team, SW and Drivers Team, Linux
Patch Team, SDN Team, DevOps Team, QA Team, DC Tech Team, Supply
Team, and all TTP/Mojo Interns

HOT CHIPS 2024 21


Tesla Transport Protocol

over Ethernet (TTPoE)

HOT CHIPS 2024


Backup – Latencies

Intended de-emphasis on synthetic latency measurements

Differences of greater consequence:


• lossy vs lossless
• centralized vs distributed congestion
• proprietary vs open source
• sustained bandwidths at scale

TTPoE, TCP/IP – Spectrum3 SN4700


IB – Spectrum 9700 IB
Nvlink – DGX-H100 NvSwitch level1 (internal)
RoCEv2 – 7812 R3

Inconsistent methodology and hardware, not at scale


HOT CHIPS 2024 23

You might also like