0% found this document useful (0 votes)
38 views

Introduction to PCIe and CXL 1746566446

The document provides an introduction to PCI Express (PCIe) and CXL, detailing their history, concepts, layers, practical aspects, performance, and future roadmap. It outlines the evolution of PCI standards from conventional PCI to PCIe, highlighting various examples and troubleshooting techniques. Understanding PCIe is essential for comprehending data acquisition systems, as it plays a crucial role in high-speed computer expansion and connectivity.

Uploaded by

KAJA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Introduction to PCIe and CXL 1746566446

The document provides an introduction to PCI Express (PCIe) and CXL, detailing their history, concepts, layers, practical aspects, performance, and future roadmap. It outlines the evolution of PCI standards from conventional PCI to PCIe, highlighting various examples and troubleshooting techniques. Understanding PCIe is essential for comprehending data acquisition systems, as it plays a crucial role in high-speed computer expansion and connectivity.

Uploaded by

KAJA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Introduction to

PCIe and CXL


Paolo Durante
(CERN EP-LBC)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 1


Where can you find ?
PCI (Peripheral Component Interconnect) Express is a
popular standard for high-speed computer expansion
overseen by PCI-SIG
(Special Interest Group)
• PCIe interconnects can be present at all levels of your DAQ
chain…
• Readout boards
• Storage media
• Network interfaces
• Compute accelerators (GPUs, FPGAs…)
• …and may be even more so in the future (with CXL)
• Memory expanders
• Understanding your data acquisition system requires (some)
level of understanding of PCI Express

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 2


What is this presentation about?
• PCIe history and evolution

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 3


PCI (“conventional PCI”)
• 1992
• Peripheral Component
Interconnect
• Parallel Interface
• Bandwidth
• 133 MB/s (~1.0 Gb/s)
(32-bit@33 MHz)
• 533 MB/s (~4.2 Gb/s)
(64-bit@66 MHz)
• Plug-and-Play
configuration (BARs)
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 4
PCI example: ATLAS FILAR
• ~2003
• 4 optical channels
• 160 MB/s (1.28 Gb/s)
• S-LINK protocol
• 2 Altera FPGAs
• Burst-DMA over PCI
• 3rd Altera FPGA
• 64-bit@66MHz PCI

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 5


PCI-X (“Extended PCI”)
• 1998
• PCI compatible
• Hardware and software
• Half-duplex bidirectional
• Higher bus efficiency
• Split-responses
• Message Signaled Interrupts
• Bandwidth
• ≤ 1066 MB/s (~8.5 Gb/s)
(64-bit@133 MHz)
• 2133 MB/s (~17 Gb/s)
(PCI-X 266)
• 4266 MB/s (~34 Gb/s)
(PCI-X 533)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 6


PCI-X example: CMS FEROL
• ~2011
• 4 SFP+ cages
• 1x 10 Gb/s Ethernet
• 3x SlinkXpress
• PCI-X interface to
legacy FE (Slink64)
• Altera FPGA
• Simplex TCP-IP

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 7


PCI Express (PCIe)
PCIe x16 | PCI | PCIe x8|PCI-X
• 2004
• PCI “inspired”
• software, topology
• Serial interface
• Full-duplex bidirectional
• Bandwidth (Gen4)
• x1: ≤2 GB/s (16 Gb/s)
(in each direction)
• x16: ≤32 GB/s (256 Gb/s)
(in each direction)
• Still evolving
• 1.0, 2.0, 3.0, 4.0, 5.0, 6.0…

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 8


PCIe example: ALICE C-RORC
• ~2014
• 3x QSFP
• 36 channels
• up to 6.6Gb/s/channel
• 2x DDR SO-DIMM
• XilinX Virtex-4 FPGA
• PCIe Gen2 x8

• Also used by ATLAS


24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 9
PCIe example: LHCb TELL40
• Introduced for LHC Run3
• Currently in production
• ≤ 48 duplex optical links
• GBT (3.2 Gb/s)
• WideBus (4.48 Gb/s)
• GWT (5.12 Gb/s)
• Altera Arria10 FPGA
• 110 Gb/s DMA
• PCIe 3.0 x16
• Also used by ALICE

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 10


PCIe example: ATLAS FELIX
• Introduced for LHC Run3
• ≤ 48 duplex optical links
• XilinX Ultrascale FPGA
• 2x DDR4 SO-DIMM
• PCIe 3.0 x16
• Wupper DMA
(Open Source)
• Also used by DUNE

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 11


PCIe example: CPPM PCIe400
• PCIe Add in Card 3/4 length
• Agilex 7 M-series
AGMF039R47A1E2V
• Processing capabilities x8 - 12
compared to previous generation
FPGA (Arria 10)
• 32 GiB HBM2e
• Up to 48x26Gbps NRZ for FE
• PCIe Gen 5 / CXL
• QSFP112 for 400GbE
(experimental)
• 2 SFP+ for White Rabbit clock
distribution or PON fast control
• High precision PLLs jitter <100fs
RMS with phase control

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 12


PCIe example: BNL FLX-155
• FPGA: Xilinx Versal Premium
XCVP1552
• PCIe Gen5 x16, 512 GT/s
• 48 FireFly data links @25 Gb/s
• LTI link
• 100/400 GbE
• DDR4
• GbE
• White Rabbit
• PetaLinux

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 13


What is this presentation about?
• PCIe history and evolution

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 14


PCIe concepts – Packets
• Point-to-point connection
• “Serial” “bus” (fewer pins)
• Scalable link: x1, x2, x4, x8, x12, x16, x32
• Packet encapsulation
Packet

PCI Express PCI Express


Device A Device B

Packet

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 15


PCIe concepts – Root complex

• Connects the processor


and memory
subsystems to the PCIe
fabric via a Root Port
• Generates and
processes transactions
with Endpoints on
behalf of the processor

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 16


PCIe concepts – Topology
Relative to root – up is towards, down is away

“UPSTREAM”

“DOWNSTREAM”

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 17


PCIe concepts – BDF
“geographical $ lspci -tv
-+-[0000:ff]-+-08.0 Intel Corporation Xeon ...
| +-08.3 Intel Corporation Xeon ...

addressing” |
|
+-08.4 Intel Corporation Xeon ...
+-09.0 Intel Corporation Xeon ...
| ...

• Bus : Device . Function


+-[0000:80]-+-00.0-[81]--
| +-01.0-[82]--
| +-02.0-[83]----00.0 Intel Corporation Xeon Phi coprocessor 31S1

• Form a hierarchy-
| +-03.0-[84]--
| +-03.2-[85]----00.0 Intel Corporation Xeon Phi coprocessor 31S1
| +-05.0 Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
based address |
|
+-05.2 Intel Corporation Xeon E5/Core
\-05.4 Intel Corporation Xeon E5/Core
i7 Control Status and Global Errors
i7 I/O APIC

• Multiple logical +-[0000:7f]-+-08.0 Intel Corporation Xeon E5/Core


| +-08.3 Intel Corporation Xeon E5/Core
i7 QPI Link 0
i7 QPI Link Reut 0
| ...
“Functions” allowed on \-[0000:00]-+-00.0 Intel Corporation Xeon E5/Core i7 DMI2
+-01.0-[01]--
one physical device +-01.1-[02]--
+-02.0-[03]----00.0 Intel Corporation Xeon Phi coprocessor 31S1

• Bridges (PCI/PCI-X) +-03.0-[04]----00.0 Intel Corporation


+-05.0 Intel Corporation Xeon E5/Core
Xeon Phi coprocessor 31S1
i7 Address Map, VTd_Misc, System Management

form hierarchy +-05.2 Intel Corporation Xeon E5/Core


+-05.4 Intel Corporation Xeon E5/Core
i7 Control Status and Global Errors
i7 I/O APIC
+-11.0-[05]--+-00.0 Intel Corporation C602 chipset 4-Port SATA Storage Control Unit

• Switches (PCIe) form | \-00.3 Intel Corporation


+-1c.0-[06]----00.0 Intel Corporation
C600/X79 series chipset SMBus Controller 0
82574L Gigabit Network Connection

hierarchy ...

00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)

On linux: $ man lspci


80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)
83:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 18


Troubleshooting with lspci
• Device works but is “slow”
• Link speed
• Link width
• MaxPayloadSize
• Interrupts
• Error flags
• Look for bottlenecks upstream
• Device is “there” but driver fails to load
• Unreadable config space
• Unallocated BARs

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 19


Practical troubleshooting (1/3)
From frontend
The LHCb experiment is composed
of 6 subdetectors
DAQ RAM EB RU

Dataflow (in ~150 readout nodes):


HLT GPU EB BU
• Readout board → memory (PCIe)
• Local memory → remote
memory (PCIe + IB RDMA)
• Memory → GPU (PCIe) InfiniBand
• Memory → HLT1 buffer (PCIe +
Ethernet)
The issue DAQ RAM EB RU
DAQ performs well with 5
subdetectors, throughput drops as HLT
soon as a sixth subdetector is GPU EB BU
added
To HLT1 buffer
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 20
Practical troubleshooting (2/3)
From frontend
Check all connections
DAQ RAM EB RU
• DAQ → RAM (OK)
No backpressure
HLT GPU EB BU
• EB RU → EB BU (OK)
Full ibwritebw throughput,
no congestion InfiniBand
• RAM GPU (OK)
Nominal HLT rate, no
backpressure DAQ RAM EB RU
• HLT send → HLT recv (OK)
Full iperf throughput on HLT GPU EB BU
each port of dual port NIC
To HLT1 buffer
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 21
Practical troubleshooting (3/3)
$ sudo lspci | grep Ethernet
62:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
62:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
e1:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
e1:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
$ ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
inet 10.132.2.108 netmask 255.255.255.128 broadcast 10.132.2.127
ether b4:2e:99:ac:a7:94 txqueuelen 1000 (Ethernet)
RX packets 165258503 bytes 11423375211 (10.6 GiB)
RX errors 0 dropped 70093 overruns 0 frame 0
TX packets 5062760768 bytes 45520865620164 (41.4 TiB)
TX errors 0 dropped 2 overruns 0 carrier 0 collisions 0
enp225s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether b4:2e:99:ac:a7:94 txqueuelen 1000 (Ethernet)
RX packets 84474162 bytes 5875677732 (5.4 GiB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 2529040640 bytes 22741702699689 (20.6 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp225s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
ether b4:2e:99:ac:a7:94 txqueuelen 1000 (Ethernet)
RX packets 80784342 bytes 5547697851 (5.1 GiB)
RX errors 0 dropped 3 overruns 0 frame 0
TX packets 2533720128 bytes 22779162920475 (20.7 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
$ sudo lspci -s e1:00 -vvv | grep LnkSta:
LnkSta: Speed 5GT/s (degraded), Width x4 (ok)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 22


PCIe concepts – Address spaces
• Address spaces
• Configuration
(Bus/Device/Function)
• Memory (64-bit)
• I/O (32-bit)

• Configuration space
• Base Address Registers
(BARs) (32/64-bit)
• Capabilities (linked list)
• On linux: $ man setpci

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 23


PCIe concepts – Memory & I/O
• Memory space maps cleanly to CPU semantics
• 32-bits of address space initially
• 64-bits introduced via Dual-Address Cycles (DAC)
• Extra period of address time on PCI/PCI-X
• 4DWORD header in PCI Express
• Burstable (= Multiple DWORDs)
• I/O space maps cleanly to CPU semantics
• 32-bits of address space
• Non-burstable

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 24


PCIe concepts – Bus address
This is actually not specific to PCIe, but a generic
reminder:
• Physical address: the address the CPU sends to the
memory controller
• Virtual address: an indirect address created by the
operating system, translated by the CPU to physical
• Bus address: an address understood by the devices
connected to a specific bus
• On Linux, see: pci_iomap(), remap_pfn_range(), …

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 25


PCIe concepts – Bridges

Transparent Non-Transparent
• Single root (or SR-IOV) • Joins two independent
• Single address space topologies
• Multiple downstreams • One root on each side
(switch) • Each side has its own
• Downstreams appear in address space
the same topology • Needs translation table
• Addresses are passed • Fault tolerance,
through unchanged “networking”, HPC
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 26
PCIe concepts – Interrupts
• PCI pci_read_config_byte(dev,
PCI_INTERRUPT_PIN,
• INTx# &(...));
• x ∈ {A, B, C, D}
• Level sensitive pci_read_config_byte(dev,
PCI_INTERRUPT_LINE,
• Can be mapped to CPU &(...));
interrupt number
• PCIe pci_enable_msi(dev);
• “Virtual Wire”
emulation request_irq(dev->irq, my_isr,
• Assert_INTx code IRQF_SHARED, devname,
cookie);
• Deassert_INTx code
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 27
PCIe concepts – MSI & MSI-X
• Based on messages (MWr)
• MSI uses one address with a
variable data value indicating
which “vector” is asserting
• ≤ 32 per device (in theory)
• MSI-X uses a table of
independent address and
data pairs for each “vector”
• ≤ 2048 per device (use affinity!)
• Vector: interrupt id

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 28


PCIe Gen1 (2003)
• Introduced at 2.5 GT/sec (32 Gb/s/d in x16)
• Signaling rate is also referred to as 2.5 GHz, 2.5 Gb/s
• 100 MHz reference clock
• Eases synchronization between ends
• Can use Spread Spectrum Clocking to reduce EMI
• Optional, but nearly universal
• 8b/10b encoding used to provide DC balance and
reduce “runs” of 0s or 1s which make clock recovery
difficult
• Specification Revisions: 1.0, 1.0a, 1.1

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 29


PCIe Gen2 (2006)
• Speed doubled to 5 GT/sec (64 Gb/s/d in x16)
• Reference clock remains at 100 MHz
• Lower jitter clock sources required vs 2.5 GT/sec
• Generally higher quality clock generation/distribution
required
• 8b/10b encoding continues to be used
• Specification Revisions: 2.0, 2.1
• Devices choosing to implement a maximum rate of
2.5 GT/sec can still be fully 2.x compliant

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 30


PCIe Gen3 (2010)

2x5=?

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 31


PCIe Gen3 (2010)

2x5=8
• Speed “doubled” from 5 GT/sec (126 Gb/s/d in x16)
• More efficient encoding (20% → ~1%)
• 8b/10b → 128b/130b
• 8 GT/sec electrical rate
• 10 GT/sec required significant cost and complexity in
channel, receiver design, etc.
• Reference clock remains at 100 MHz
• Backwards-compatible speed negotiation

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 32


PCIe Gen4 (2017)

2x8=?

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 33


PCIe Gen4 (2017)

2 x 8 = 16
• Speed doubled from 8 GT/sec (252 Gb/s/d in x16)
• Same 128b/130b encoding
• 16 GT/sec electrical rate
• Channel length: ≤ 10”/14”
• Retimer mandatory for longer channels
• More complex pre-amplification, equalization stages
• Reference clock remains at 100 MHz
• Backwards-compatible protocol negotiation
and CEM spec

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 34


PCIe Gen5 (2019)

2 x 16 = 32
• Speed doubled from 16 GT/sec (504 Gb/s/d in x16)
• Same 128b/130b encoding (with small differences)
• 32 GT/sec electrical rate
• Channel length: ≤ 10”/14”
• Up to 2 retimers for longer channels
• More complex pre-amplification, equalization stages
• Support for alternate protocols (see CXL)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 35


PCIe Gen6 (2022)

2 x 32 = 64
• Speed doubled from 32 GT/sec (1024 Gb/s/d in x16)
• NRZ → PAM4 signaling
• 2 bits per Unit Interval
• Lower eye-height and width, much higher First Bit-Error Rate
(FBER)
• Forward Error Correction (FEC)
• Light-weight and low-latency (2ns) FEC for initial correction
• CRC and link-level retry for larger errors
• Flow Control Unit (FLIT) encoding
• Fixed-size and fixed(lower)-latency, compared to TLPs

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 36


PCIe Gen7 (2023)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 37


What is this presentation about?
• PCIe history and evolution

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 38


PCIe – Protocol stack
PCI Express Device A PCI Express Device B

Application Layer Application Layer

PCI Express PCI Express


Logic Interface Logic Interface

Transaction Layer Transaction Layer

T R T R
Data Link Layer Data Link Layer
X X X X

Physical Layer Physical Layer

Link
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 39
FPGA Hardened PCIe IP

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 40


PCIe – Transaction layer
• Four possible transaction types
• Memory Read | Memory Write
• Transfer data from or to a memory mapped location
• Address routing
• IO Read | IO Write
• Transfer data from or to an IO location (on a legacy endpoint)
• Address routing
• Config Read | Config Write
• Discover device capabilities, status, parameters
• ID routing (BDF)
• Messages
• Event signaling

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 41


PCIe – TLP structure
MaxPayloadSize (MPS)
Application Layer parameter limits and
dominates performance
Transmit order

STP Sequence Header Data Payload ECRC LCRC End


1B 2B 3-4DW 0-1024DW 1DW 1DW 1B

Created by Transaction Layer

Appended by Data Link Layer

Appended by Physical Layer


(DW = DWORD = 4 Bytes)
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 42
PCIe – Split transaction model
• Posted transaction
• Single TLP, no completion

• Non-posted transaction
• Split transaction model
• Requester initiates transaction (Requester ID + Tag)
• Requester and Completer IDs encode the sender BDF
• Completer executes transaction internally
• Completer creates completion transaction (Cpl/CplD)

• Bus efficiency of Read is different (lower) wrt Write


• Writes are posted while Reads are not
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 43
PCIe – DMA transaction

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 44


PCIe – Peer-to-Peer transaction
Typical use cases:
▪ High-Performance Computing
▪ Machine Learning

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 45


PCIe – Data Link Layer
• ACK / NAK Packets
• Error handling mechanism
• Flow Control Packets (FCPs)
• Receiver sends FCPs (which are a type of DLLP) to
provide the transmitter with credits so that it can
transmit packets to the receiver
• Power Management Packets
• Vendor extensions
• E.g.: CAPI, CCIX (memory coherency)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 46


PCIe – DLLP structure

Transmit order

SDP DLLP CRC End


1B 4B 2B 1B

Created by Data Link Layer

Appended by Physical Layer

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 47


PCIe – Flow control
• Credit-based
• Point-to-point (not end-to-end)
Available space

Transmitter TLP Receiver

VC buffer
Data Link Layer Data Link Layer

Flow Control DLLP (FCx)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 48


PCIe – Flow Control Update Loop
“If the write requester sources the data as quickly as possible, and the completer consumes
the data as quickly as possible, then the Flow Control Update loop may be the biggest
determining factor in write throughput, after the actual bandwidth of the link.” (Intel)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 49


PCIe – RAS/QoS features
• Data Integrity and Error Handling
• PCIe is RAS (Reliable, Available, Serviceable)
• Data integrity at
• link level (LCRC)
• end-to-end (ECRC, optional)
• Virtual channels (VCs) and traffic classes (TCs) to
support differentiated traffic or Quality of Service (QoS)
• In theory
• Ability to define levels of service for packets of different TCs
• 8 TCs and 8 VCs available
• In practice
• Rarely more than 1 VC and 1 TC are implemented

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 50


PCIe – Error handling

Correctable Uncorrectable
• Recovery happens • Fatal
• Platform-specific handling
automatically in DLL
• Non-fatal
• Performance is • Can be exposed to
degraded application layer and
handled explicitly
• Can and do cause system
deadlock / reset
• Example: LCRC error • Recovery mechanisms are
→ automatic DLL retry outside the spec
(there is no forward error correction • Example: failover for HA
until PCIe Gen 6.0)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 51


PCIe – ACK/NAK

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 52


PCIe – Physical layer
“While the lanes are not tightly synchronized, there is a limit to the lane to lane skew of
20/8/6 ns for 2.5/5/8 GT/s so the hardware buffers can re-align the striped data.” (Wikipedia)
PCI Express Device

Signal Link

Wire

Lane

PCI Express Device

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 53


PCIe – Ordered-Set Structure
Transmit order COM Identifier Identifier … Identifier

Six ordered sets are possible


• Training Sequences (TS1, TS2): 1 COM + 15 TS
• Used to de-skew between lanes
• SKIP: 1 COM + 3 SKP identifiers
• Used to recalibrate receiver clock
• Fast Training Sequence (FTS): 1 COM + 3 FTS
• Power management
• Electrical Idle (IDLE): 1 COM + 3 IDL
• Transmitted continuously when no data
• Electrical Idle Exit (EIEOS): 16 characters (since 2.0)
character: 8 unscrambled bits
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 54
PCIe – Framing (x1)

Transmit order (TIME) STP Framing Symbol
(Physical Layer)
Reserved bits
Sequence Number
(Data Link Layer)

TLP structure
… (Transaction Layer)

LCRC
(Data Link Layer)

END Framing Symbol


(Physical Layer)

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 55
PCIe – Framing (x4)
Transmit order (TIME) Lane order (SPACE)

Lane 0 Lane 1 Lane 2 Lane 3 (Lane-reversal


… … … … possible)
STP

… … … …

END
… … … …

Physical Layer
Data Link Layer
Transaction Layer

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 56


PCIe – Link training 37181 ns
37312 ns
(1881):
EP LTSSM State: RECOVERY.RCVRLOCK
RP PCI Express Link Status Register

37312 ns Negotiated Link Width: x8


37312 ns Slot Clock Config: System
TS TS TS
Reference Clock Used
Device Device 37949 ns EP LTSSM State: RECOVERY.RCVRCFG
38845 ns RP LTSSM State: RECOVERY.RCVRCFG
1 TS TS TS 2 41053 ns RP LTSSM State: RECOVERY.SPEED
41309 ns EP LTSSM State: RECOVERY.SPEED
43573 ns EP LTSSM State: RECOVERY.RCVRLOCK
43765 ns RP LTSSM State: RECOVERY.RCVRLOCK

• Lane polarity
43797 ns RP LTSSM State: REC_EQULZ.PHASE0
43825 ns RP LTSSM State: REC_EQULZ.PHASE1
44141 ns EP LTSSM State: REC_EQULZ.PHASE0

• Link width / ordering 44673 ns


44929 ns
EP LTSSM State: REC_EQULZ.PHASE1
RP LTSSM State: REC_EQULZ.DONE

• Link equalization
44949 ns RP LTSSM State: RECOVERY.RCVRLOCK
45209 ns EP LTSSM State: REC_EQULZ.DONE
45229 ns EP LTSSM State: RECOVERY.RCVRLOCK
• Dynamic equalization! 45425 ns EP LTSSM State: RECOVERY.RCVRCFG
45581 ns RP LTSSM State: RECOVERY.RCVRCFG

• Link speed 45925 ns


46073 ns
RP LTSSM State: RECOVERY.IDLE
EP LTSSM State: RECOVERY.IDLE

• ...
46169 ns EP LTSSM State: L0
46313 ns RP LTSSM State: L0
47824 ns Current Link Speed: 8.0GT/s

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 57


PCIe Link-Training State Machine
(LTSSM) DLL RESET Detect

Poll Link Training


• L0: active
• L0 standby, L1: lower
power, higher latency
• L2: cold standby, even Conf
lower power
• L3: power off

Power
L2 Recovery Link Re-Training
Management

L1 L0 L0s
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 58
Simulate a PCIe link on your own!
• https://github.com/wyvernSemi/pcievhost
• http://www.anita-
simulators.org.uk/wyvernsemi/articles/pci_express.
pdf
• Written in C/Verilog
• Compatible with ModelSim (via DPI)
• Simulates link training, flow control, ACK/NAK,
completions…

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 59


What is this presentation about?
• History and evolution of PCIe

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 60


PCIe link training
Signal integrity – Environment

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 61


PCIe link training
Signal integrity – Robustness
“PCIe will practically run over wet string”

“PCIe will practically run over wet string”

https://www.youtube.com/watch?v=q5xvwPa3r7M
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 62
PCIe link training
Signal integrity – Connectors

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 63


Troubleshooting PCIe deployments
at scale
If you run a large data acquisition system, and most
of your I/O goes through PCIe links, you have to
monitor all your endpoints and root ports

https://github.com/facebook/pcicrawler
https://engineering.fb.com/2020/08/05/open-source/pcicrawler

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 64


PCIe CEM Spec – AIC form factors
Solder side
(A)
Component side • Standard Height
(B)
• 4.20” (106.7mm)
• Low Profile
• 2.536” (64.4mm)
Standard
Height
Full Length
• Half Length (e.g.“HHHL”)
Low Profile • 6.6” (167.65mm)
Half Length • Full Length (e.g. “FHFL”)
Single/Dual • 12.283” (312mm)
Width Power: up to 10W, 25W, 75W, 300W or 375W depending on
form factor & optional extra power connectors
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 65
PCIe storage – More form factors
M.2
≤ 4 lanes

U.2
≤ 4 lanes

“ruler” (EDSFF, NGSFF) ≤ 8 lanes

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 66


PCIe CEM Spec – Power Cables
EPS receptacle

PCIe cable

GPU power

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 67


PCIe CEM Spec – Power Cables

https://support.xilinx.com/s/article/72298?language=en_US

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 68


PCIe – 12VHPWR power failures

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 69


What is this presentation about?
• PCIe history and evolution

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 70


PCIe – Theoretical data rates

8b10b

128/130
NRZ

PAM4

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 71


PCIe – Effective data rates
Theoretical bandwidth Packet efficiency

𝐿𝑎𝑛𝑒 𝑟𝑎𝑡𝑒 × 𝐿𝑎𝑛𝑒 𝑤𝑖𝑑𝑡ℎ 𝑀𝑃𝑆


•ρ= ×
𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔 𝑀𝑃𝑆+𝐻𝑒𝑎𝑑𝑒𝑟𝑠
• Example: Gen2 x8, 128 Bytes MPS
128
• ρ = 40 × 0.8 × = 32 x 0.84 = 26.9 Gb/s
128+24
• Example: Gen3 x8, 128 Bytes MPS
128
• ρ = 64 × 0.98 × = 62.7 x 0.84 = 52.6 Gb/s
128+24
• Example: Gen3 x8, 256 Bytes MPS
256
• ρ = 64 × 0.98 × = 62.7 x 0.91 = 57 Gb/s
256+24

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 72


PCIe 3.0 x8 – DMA Performance

MPS = 128 Bytes MPS = 256 Bytes


10%

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 73


PCIe performance –
interrupt coalescing

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 74


PCIe performance – latency

Typical: ~1us

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 75


NVMe performance
NVM Express (NVMe) is an interface specification for accessing a
computer's non-volatile storage media attached via the PCIe bus.
NVM Express allows host hardware and software to fully exploit the
levels of parallelism possible in modern SSDs.

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 76


What is this presentation about?
• PCIe history and evolution

• PCIe concepts

• PCIe layers

• PCIe practical aspects

• PCIe performance

• PCIe future roadmap

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 77


What is Compute Express Link?
• Alternate protocol that
runs across the standard
PCIe physical layer
• Uses a flexible processor
port that can auto-
negotiate to either the
standard PCIe transaction
protocol or the alternate
CXL transaction protocols
• First generation CXL
aligns to 32 Gbps PCIe 5.0
• 8 Gbps in degraded mode

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 78


CXL Consortium
• Alibaba, Cisco, Dell EMC, Facebook, Google, Hewlett Packard Enterprise,
Huawei, Intel Corporation and Microsoft announced their intent to incorporate
in March 2019.
• The Compute Express Link (CXL) Consortium and Gen-Z Consortium developed
an execution of a Memorandum of Understanding (MOU), describing a mutual
plan for collaboration between the two organizations in April 2020.

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 79


Why CXL?
Need a new class of interconnect for heterogenous
computing and disaggregation usages:
• Efficient resource sharing
• Shared memory pools with efficient access
mechanisms
• Enhanced movement of operands and results
between accelerators and target devices
• Significant latency reduction to enable
disaggregated memory

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 80


CXL – Dynamic Multiplexing
CXL multiplexes three different protocols at the PCIe PHY layer

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 81


CXL protocols
• cxl.io
• device discovery, configuration, initialization, I/O
virtualization, and direct memory access (DMA)
• cxl.cache
• enables a device to cache data from the host memory,
employing a simple request and response protocol
• the host processor manages coherency of data
• cxl.memory
• allows a host processor to access memory attached to a
CXL device

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 82


CXL device types
“Mix and match” protocols depending on application requirements

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 83


CXL evolution timeline
• CXL 1.0 – March 2019
• enables device-level memory expansion and coherent
acceleration modes
• CXL 1.1 – September 2019
• CXL 2.0 – November 2020
• augments CXL 1.1 with enhanced fanout support and a
variety of additional features
• CXL 3.0 – in the making
• CXL supporting platforms coming to market now

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 84


CXL 2.0 new features
• CXL switches
• Multiple host
• Virtual hierarchies
• Multi-Logic devices
• Management
• Fabric manager
• Device allocation
• QoS telemetry
• Memory interleaving

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 85


CXL-attached memory

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 86


Conclusions
• PCIe has a track record of 2x throughput improvements per
generation
• PCIe has maintained backwards compatibility for decades
• PCIe has won the interconnect wars
• Gen-Z has joined the CXL consortium
• All CCIX consortium members have moved to CXL
• CAPI never gained mindshare outside of IBM
• PCIe is proving suitable also for chip-to-chip interconnect
• Universal Chiplet Interconnect Express (UCIe)
• NVLink-C2C will be compatible with CXL
• However, the AI revolution has left PCIe behind
• AI prefers low link counts at much higher rates to save chip shoreline
area, minimizing latency is less of a concern
• Proprietary alternatives: NVLink, AMD AFL, Google ICI, UltraEthernet…

24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 87

You might also like