Benini ISC2023 Paving The Road For Riscv
Benini ISC2023 Paving The Road For Riscv
GPT-4 (OpenAI’23)
Sevilla 22: arXiv:2202.05924, epochai.org Training Compute: 2.1E+25 (FLOP)
𝟏𝟏
Energy Efficiency ( ) 10x every 12 years…
𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏�𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 4
Efficient Architecture: Heterogeneous+Parallel
Decide Compute
<
>
5
Heterogeneous + Parallel… Why?
Decide Compute
Decide (jump to different program part) Compute (plough through numbers)
Modulate flow of instructions Modulate flow of data
Mostly sequential decisions: Embarassing data parallel:
Don’t work too much Don’t think too much
Be clever about the battles you pick Plough through the data
(latency is king) (throughput is king)
Lots of decisions Few decisions
Little number crunching Lots of number crunching
RegFile L1 L2
PE
core Cluster
From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core
7
PE: Snitch, a Tiny RISC-V Core
A versatile building block
Latency-tolerant Scoreboard
Tracks instruction dependencies
Much simpler than OOO support!
FP Vector Stencil ML/Tensor
F. Zaruba, F. Schuiki, T. Hoefler and L. Benini, "Snitch: A Tiny Pseudo Dual-Issue Processor for
Area and Energy Efficient Execution of Floating-Point Intensive Workloads," in IEEE
Transactions on Computers, vol. 70, no. 11, pp. 1845-1860, 1 Nov. 2021 8
Snitch PE: ISA Extension for efficient “Compute”
How can we remove the Von Neumann Bottleneck?
Targeting “compute” code
9
Stream Semantic Registers
LD/ST elision
10
Floating-point Repetition Buffer
Remove control flow overhead in compute stream
11
RISC-V ISA Extension for Target Workload
Mixed precision
Inference ≠ Training Quantization
EXFMA
EXSDOTP
EXFMA
DOTP=A*B+(C*D+E) DOTP=A*B+C*D+E
mem
mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.
idx
data
data
TCDM Scratchpad
instr.
ft0 ft1
offload matched
FPU
Subsystem /merged
ft2 indices
SSSR
data
idx
Streamer
SSR 2
mem
14
What About Sparsity? Indirect SSR Streamer
Based on existing 3-SSR streamer
mem
mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.
ctrl
idx
data
data
TCDM Scratchpad
Control interface to FPU sequencer (frep.s) instr.
ft0 ft1 frep.s
offload matched
Result index count unknown ahead-of-time FPU
/merged
Subsystem
indices
Enables general sparse-sparse LA on fibers: ft2
SSSR
dotp: index match + fmadd
data
idx
Streamer
vadd: index merge + fadd SSR 2
elem-mul: index match + fmul
mem
vec-mac: index merge + fmadd
15
ISSR Performance Benefits
enable pseudo-dual-issue
Baseline perf. degrades for large (3D) stencils
FP util. base FP util. ISSR IPC base IPC ISSR
Cannot maintain unroll and keep reusable 1.4
17
Efficient PE (snitch) architecture in perspective
18
Compute Efficiency: the Cluster (PEs + On-chip TCDM)
From/To L1
RegFile L1 L2
PE
core Cluster
From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core
19
The Cluster: Design Challenges
CLUSTER
20
High speed logarithmic interconnect
Do not underestimate on-chip wires… Processors
P1 P2 P3 P4
Routing
Tree
Arbitration
Tree
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
N+8 Memory
B1
B2
B3
B4
B5
B6
B7
B8
Banks
World-level bank interleaving «emulates» multiported mem
Multibanked L1 (BF>1)
Logarithmic Interco
23
Where does the Energy go?
Inevitable to have local memory
In an 8-core cluster (e.g., GPU/GPU L1 cache, vector register file)
L1 Memory,
FPU uses 50% of power
47.19
Integer core uses
2% of power
FPU, 87.44
Integer Core,
4.24
SSR/FREP,
9.52
ICACHE, 4.82
SSR/FREP hardware
uses 5% of power Miscellaneous,
25.26
25
Back to the cluster… Can we make it Bigger?
Why?
Better global latency tolerance if L1size > 2*L2latency*L2bandwidth (Little’s law + double buffer)
Easier to program (data-parallel, functional pipeline…)
Smaller data partitioning overhead
26
Hierarchical Physical Architecture
Tile Group Cluster
4 32-bit cores 64 cores 256 cores
16 banks 256 banks 1 MiB of memory (1024 banks)
Single cycle memory access 3-cycles latency 5-cycles latency
MemPool Tile MemPool Group
North Northeast
Scratchpad Memory
Bank Bank Bank Bank Bank Bank
Tile Tile Tile Tile
0 1 2 3 4 15
10 11 14 15
MemPool Cluster
Interconnect Tile Tile Tile Tile
8 9 12 13 Group 2 Group 3
Tile 32-47 Tile 48-63
East
Local
Core 0 Core 1 Core 2 Core 3
Tile Tile Tile Tile
L0 I$ L0 I$ L0 I$ L0 I$ 2 3 6 7
Group 0 Group 1
Tile Tile Tile Tile Tile 0-15 Tile 16-31
Shared L1 Instruction Cache
0 1 4 5
RegFile L1 L2
PE
core Cluster
From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core
29
Occamy: RISC-V goes HPC Chiplet!
• Long & short bursts
HBM DRAM
8 GB/s • 1D & 2D patterns 8 GB/s 64 GB/s HBM2e DRAM
Off-die System- Die-to-Die Die-to-Die <410 GB/s
Serial Link level DMA Serial Link Serial Link HBM PHY
512 512 512 512 512 512 512 512 512 512 512 HBM2e PHY
64b 64b 64b 64b 64b 512 GB/s
b b b b b b b b b b b
Big trend!
31
High-Performance, General-Purpose
Our scalable architecture is general-purpose and high-performance
Peak chiplet performance @1GHz:
FP64: 384 GFLOp/s
FP32: 768 GFLOp/s Cluster Cluster
Group
Group
Group
FP16: 1.536 TFLOp/s
CVA
6
Cluster Cluster
HBM Ctrl
FP8: 3.072 TFLOp/s
SP
SPM
M
7.0 mm
SP
SP M
Group
Group
Group
Dense Kernels:
Die-to-Die
– GEMMS: ≥ 80% FPU utilization (also for SIMD MiniFloat) 10.5mm
32
34
34
Programming Occamy: DACE
Highly expressive DSL family – high-level transformations, support for explicitly managed memory
BERT FPGA
YOLOv5 CUDA
...
RISC-V
35
Efficient Chiplet architecture in Perspective
36
System Level: Monte Cimone, the first RISC-V Cluster
38
Conclusion
[AMD Naffziger ISCAS22]
Game-changing technologies
“Commoditized” chiplets: 2.5D, 3D
Computing “at” memory (DRAM mempool)
Coming: optical IO and smart NICs, swiches
Challenges:
High performance RV Host?
RV HPC software ecosystem?
[RIKEN Matsuoka MODSIM22]
39
Luca Benini, Alessandro Capotondi, Alessandro Ottaviano,
Alessandro Nadalini, Alessio Burrello, Alfio Di Mauro,
Andrea Borghesi, Andrea Cossettini, Andreas Kurth, Angelo
Garofalo, Antonio Pullini, Arpan Prasad, Bjoern Forsberg,
Corrado Bonfanti, Cristian Cioflan, Daniele Palossi, Davide
Rossi, Davide Nadalini, Fabio Montagna, Florian Glaser,
Want to use the stuff? Florian Zaruba, Francesco Conti, Frank K. Gürkaynak,
You can! Georg Rutishauser, Germain Haugou, Gianna Paulin,
Gianmarco Ottavi, Giuseppe Tagliavini, Hanna Müller,
Free, open source Lorenzo Lamberti, Luca Bertaccini, Luca Valente, Luca
With liberal (apache) license! Colagrande, Luka Macan, Manuel Eggimann, Manuele
Rusci, Marco Guermandi, Marcello Zanghieri, Matheus
Cavalcante, Matteo Perotti, Matteo Spallanzani, Mattia
Sinigaglia, Michael Rogenmoser, Moritz Scherer, Moritz
Schneider, Nazareno Bruschi, Nils Wistoff, Pasquale Davide
Schiavone, Paul Scheffler, Philipp Mayer, Robert Balas,
Samuel Riedel, Sergio Mazzola, Sergei Vostrikov, Simone
Benatti, Stefan Mach, Thomas Benz, Thorir Ingolfsson, Tim
Fischer, Victor Javier Kartsch Morinigo, Vlad Niculescu,
Xiaying Wang, Yichao Zhang, Yvan Tortorella, all our past
collaborators and many more that we forgot to mention
http://pulp-platform.org @pulp_platform