Introduction to PCIe and CXL 1746566446
Introduction to PCIe and CXL 1746566446
• PCIe concepts
• PCIe layers
• PCIe performance
• PCIe concepts
• PCIe layers
• PCIe performance
Packet
“UPSTREAM”
“DOWNSTREAM”
addressing” |
|
+-08.4 Intel Corporation Xeon ...
+-09.0 Intel Corporation Xeon ...
| ...
• Form a hierarchy-
| +-03.0-[84]--
| +-03.2-[85]----00.0 Intel Corporation Xeon Phi coprocessor 31S1
| +-05.0 Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management
based address |
|
+-05.2 Intel Corporation Xeon E5/Core
\-05.4 Intel Corporation Xeon E5/Core
i7 Control Status and Global Errors
i7 I/O APIC
hierarchy ...
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)
• Configuration space
• Base Address Registers
(BARs) (32/64-bit)
• Capabilities (linked list)
• On linux: $ man setpci
Transparent Non-Transparent
• Single root (or SR-IOV) • Joins two independent
• Single address space topologies
• Multiple downstreams • One root on each side
(switch) • Each side has its own
• Downstreams appear in address space
the same topology • Needs translation table
• Addresses are passed • Fault tolerance,
through unchanged “networking”, HPC
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 26
PCIe concepts – Interrupts
• PCI pci_read_config_byte(dev,
PCI_INTERRUPT_PIN,
• INTx# &(...));
• x ∈ {A, B, C, D}
• Level sensitive pci_read_config_byte(dev,
PCI_INTERRUPT_LINE,
• Can be mapped to CPU &(...));
interrupt number
• PCIe pci_enable_msi(dev);
• “Virtual Wire”
emulation request_irq(dev->irq, my_isr,
• Assert_INTx code IRQF_SHARED, devname,
cookie);
• Deassert_INTx code
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 27
PCIe concepts – MSI & MSI-X
• Based on messages (MWr)
• MSI uses one address with a
variable data value indicating
which “vector” is asserting
• ≤ 32 per device (in theory)
• MSI-X uses a table of
independent address and
data pairs for each “vector”
• ≤ 2048 per device (use affinity!)
• Vector: interrupt id
2x5=?
2x5=8
• Speed “doubled” from 5 GT/sec (126 Gb/s/d in x16)
• More efficient encoding (20% → ~1%)
• 8b/10b → 128b/130b
• 8 GT/sec electrical rate
• 10 GT/sec required significant cost and complexity in
channel, receiver design, etc.
• Reference clock remains at 100 MHz
• Backwards-compatible speed negotiation
2x8=?
2 x 8 = 16
• Speed doubled from 8 GT/sec (252 Gb/s/d in x16)
• Same 128b/130b encoding
• 16 GT/sec electrical rate
• Channel length: ≤ 10”/14”
• Retimer mandatory for longer channels
• More complex pre-amplification, equalization stages
• Reference clock remains at 100 MHz
• Backwards-compatible protocol negotiation
and CEM spec
2 x 16 = 32
• Speed doubled from 16 GT/sec (504 Gb/s/d in x16)
• Same 128b/130b encoding (with small differences)
• 32 GT/sec electrical rate
• Channel length: ≤ 10”/14”
• Up to 2 retimers for longer channels
• More complex pre-amplification, equalization stages
• Support for alternate protocols (see CXL)
2 x 32 = 64
• Speed doubled from 32 GT/sec (1024 Gb/s/d in x16)
• NRZ → PAM4 signaling
• 2 bits per Unit Interval
• Lower eye-height and width, much higher First Bit-Error Rate
(FBER)
• Forward Error Correction (FEC)
• Light-weight and low-latency (2ns) FEC for initial correction
• CRC and link-level retry for larger errors
• Flow Control Unit (FLIT) encoding
• Fixed-size and fixed(lower)-latency, compared to TLPs
• PCIe concepts
• PCIe layers
• PCIe performance
T R T R
Data Link Layer Data Link Layer
X X X X
Link
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 39
FPGA Hardened PCIe IP
• Non-posted transaction
• Split transaction model
• Requester initiates transaction (Requester ID + Tag)
• Requester and Completer IDs encode the sender BDF
• Completer executes transaction internally
• Completer creates completion transaction (Cpl/CplD)
Transmit order
VC buffer
Data Link Layer Data Link Layer
Correctable Uncorrectable
• Recovery happens • Fatal
• Platform-specific handling
automatically in DLL
• Non-fatal
• Performance is • Can be exposed to
degraded application layer and
handled explicitly
• Can and do cause system
deadlock / reset
• Example: LCRC error • Recovery mechanisms are
→ automatic DLL retry outside the spec
(there is no forward error correction • Example: failover for HA
until PCIe Gen 6.0)
Signal Link
Wire
…
Lane
TLP structure
… (Transaction Layer)
LCRC
(Data Link Layer)
… … … …
END
… … … …
Physical Layer
Data Link Layer
Transaction Layer
• Lane polarity
43797 ns RP LTSSM State: REC_EQULZ.PHASE0
43825 ns RP LTSSM State: REC_EQULZ.PHASE1
44141 ns EP LTSSM State: REC_EQULZ.PHASE0
• Link equalization
44949 ns RP LTSSM State: RECOVERY.RCVRLOCK
45209 ns EP LTSSM State: REC_EQULZ.DONE
45229 ns EP LTSSM State: RECOVERY.RCVRLOCK
• Dynamic equalization! 45425 ns EP LTSSM State: RECOVERY.RCVRCFG
45581 ns RP LTSSM State: RECOVERY.RCVRCFG
• ...
46169 ns EP LTSSM State: L0
46313 ns RP LTSSM State: L0
47824 ns Current Link Speed: 8.0GT/s
Power
L2 Recovery Link Re-Training
Management
L1 L0 L0s
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 58
Simulate a PCIe link on your own!
• https://github.com/wyvernSemi/pcievhost
• http://www.anita-
simulators.org.uk/wyvernsemi/articles/pci_express.
pdf
• Written in C/Verilog
• Compatible with ModelSim (via DPI)
• Simulates link training, flow control, ACK/NAK,
completions…
• PCIe concepts
• PCIe layers
• PCIe performance
https://www.youtube.com/watch?v=q5xvwPa3r7M
24/06/2024 ISOTDAQ 2024 - Introduction to PCIe & CXL 62
PCIe link training
Signal integrity – Connectors
https://github.com/facebook/pcicrawler
https://engineering.fb.com/2020/08/05/open-source/pcicrawler
U.2
≤ 4 lanes
PCIe cable
GPU power
https://support.xilinx.com/s/article/72298?language=en_US
• PCIe concepts
• PCIe layers
• PCIe performance
8b10b
128/130
NRZ
PAM4
Typical: ~1us
• PCIe concepts
• PCIe layers
• PCIe performance