0% found this document useful (0 votes)
11 views40 pages

Benini ISC2023 Paving The Road For Riscv

The PULP Platform focuses on developing open-source hardware for RISC-V supercomputers, emphasizing efficient architecture through heterogeneous and parallel computing. It introduces the Snitch core, a versatile RISC-V processor designed for high energy efficiency and performance, particularly in machine learning and data-intensive workloads. The document discusses various architectural innovations, including memory management and instruction set extensions, aimed at overcoming traditional computing bottlenecks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

Benini ISC2023 Paving The Road For Riscv

The PULP Platform focuses on developing open-source hardware for RISC-V supercomputers, emphasizing efficient architecture through heterogeneous and parallel computing. It introduces the Snitch core, a versatile RISC-V processor designed for high energy efficiency and performance, particularly in machine learning and data-intensive workloads. The document discusses various architectural innovations, including memory management and instruction set extensions, aimed at overcoming traditional computing bottlenecks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

PULP PLATFORM

Open Source Hardware, the way it should be!

Paving the Road for RISC-V Supercomputers


with Open Hardware
Luca Benini <[email protected],[email protected]>

http://pulp-platform.org @pulp_platform https://www.youtube.com/pulp_platform


Computing is Power Bound: HPC
TOP500 9/22

HPC Performance 20MW budget


(fixed)

HPC: 10x every ~5 years


2
Computing is Power Bound: ML Largest datacenter <150MW

GPT-4 (OpenAI’23)
Sevilla 22: arXiv:2202.05924, epochai.org Training Compute: 2.1E+25 (FLOP)

AI training: 10x every year!!!

Machine Learning (training): 10x every 2 years


3
Technology Scaling?
TSMC, ISSCC21

@ iso-area 1.24x power ↑

𝟏𝟏
Energy Efficiency ( ) 10x every 12 years…
𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏�𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 4
Efficient Architecture: Heterogeneous+Parallel
Decide Compute

<
>
5
Heterogeneous + Parallel… Why?
Decide Compute
Decide (jump to different program part) Compute (plough through numbers)
 Modulate flow of instructions  Modulate flow of data
 Mostly sequential decisions:  Embarassing data parallel:
 Don’t work too much  Don’t think too much
 Be clever about the battles you pick  Plough through the data
(latency is king) (throughput is king)
 Lots of decisions  Few decisions
Little number crunching Lots of number crunching

 Today’s workloads are dominated by “Compute”:


 Tons of data, few (as fast as possible) decisions based on the computed values,
 “Data-Oblivious Algorithms” (ML, or better DNNs are so!)
 Large data footprint + sparsity

How to design an efficient “Compute” fabric?


6
Compute Efficiency: D (…and I) Movement is Key
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory


Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

7
PE: Snitch, a Tiny RISC-V Core
A versatile building block

 Simplest core: around 20KGE


Dependencies  Speed via simplicity (1GHZ+)
Scoreboard  L0 Icache/buffer for low energy fetch
 Shared L1 for instruction reuse (SPMD)
L0 ICache

 Extensible “Accelerator” port


 Minimal baseline ISA (RISC-V)
 Extensibility: Performance through ISA
extensions (via accelerator port)

 Latency-tolerant Scoreboard
 Tracks instruction dependencies
 Much simpler than OOO support!
FP Vector Stencil ML/Tensor

F. Zaruba, F. Schuiki, T. Hoefler and L. Benini, "Snitch: A Tiny Pseudo Dual-Issue Processor for
Area and Energy Efficient Execution of Floating-Point Intensive Workloads," in IEEE
Transactions on Computers, vol. 70, no. 11, pp. 1845-1860, 1 Nov. 2021 8
Snitch PE: ISA Extension for efficient “Compute”
 How can we remove the Von Neumann Bottleneck?
 Targeting “compute” code

fld ft0, 0(a1) 70 pJ


double sum = 0;
fld ft1, 0(a2) 70 pJ
for (int i = 0; i < N; ++i) {
addi a1, a1, 8 50 pJ
sum += A[i] * B[i];
addi a2, a2, 8 50 pJ
}
fmadd.d fa0, ft0, ft1, fa0 80 pJ
bne a1, a3, -5 50 pJ

Memory access, operation, iteration control – can we do better?


Note: memory access (>1 cycle even for L1)  need latency tolerance for LD/ST

9
Stream Semantic Registers
LD/ST elision

 Intuition: High FPU utilization ≈ high energy-efficiency


 Idea: Turn register read/writes into implicit memory
loads/stores.
 Extension around the core’s register file
 Address generation hardware

 Increase FPU/ALU utilization by ~3x up to 100%


 SSRs ≠ memory operands
 Perfect prefetching, latency-tolerant
 1-3 SSR (2-3KG/SSR)

10
Floating-point Repetition Buffer
Remove control flow overhead in compute stream

 Programmable micro-loop buffer


 Sequencer steps through the buffer,
independently of the FPU
 Integer core free to operate in parallel:
Pseudo-dual issue
 High area- and energy-efficiency

11
RISC-V ISA Extension for Target Workload
Mixed precision
Inference ≠ Training Quantization

Efficient DNN inference & training  Inference: INT8 quantization is SoA


 Training: High dynamic range needed for
weights and weight updates
fp32 is still standard for DNN training
workloads. Low precision training with bf18
and fp8
Support a wide variety of FP formats and
instructions:
 Standard: fp64, fp32, fp16, bf16
 Low precision: fp8, altfp8
 fp8 (1-4-3): forward prop.
 altfp8 (1-5-2): backward prop.
 Exp. ops: accumulation
12
Cascade of EXFMAs vs EXSDOTP
Cascade of FMAs FPU SDOTP

EXFMA

EXSDOTP

EXFMA

DOTP=A*B+(C*D+E) DOTP=A*B+C*D+E

 Fused EXSDOTP (i.e. lossless)


Non-distributive FP addition  Precision Loss  Single normalization and rounding step
 Smaller area and shorter critical path
 Product by-pass to compute fused three-term
addition (vector inner sum)
 Stochastic rounding supported (+3% area)
13
What About Sparsity? Indirect SSR Streamer
 Based on existing 3-SSR streamer

mem

mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.

3. Forward result indices to 3rd SSR

idx
data

data

TCDM Scratchpad
instr.
ft0 ft1
offload matched
FPU
Subsystem /merged
ft2 indices

SSSR

data

idx
Streamer

SSR 2

mem
14
What About Sparsity? Indirect SSR Streamer
 Based on existing 3-SSR streamer

mem

mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.

3. Forward result indices to 3rd SSR

ctrl

idx
data

data

TCDM Scratchpad
 Control interface to FPU sequencer (frep.s) instr.
ft0 ft1 frep.s
offload matched
 Result index count unknown ahead-of-time FPU
/merged
Subsystem
indices
 Enables general sparse-sparse LA on fibers: ft2

SSSR
 dotp: index match + fmadd

data

idx
Streamer
 vadd: index merge + fadd SSR 2
 elem-mul: index match + fmul

mem
 vec-mac: index merge + fmadd

15
ISSR Performance Benefits

 Notable single-core speedups over RV SpVV FP utilization VTI FP utilization


baseline (16b idcs, single
core)
(16b idcs, cluster)
100%
 CsrMV: up to 7.0× faster, 79% FP util. 100%

 SpV+SpV: up to 9.8× faster / higher FP util. 80%


80%

 SpV∙SpV: up to 7.7× faster / higher FP util. 60%


60%

 VTI (3D stencil code): up to 2.9× faster, 78% FP util. 40%


40%

 Significant benefits in multicore cluster: 20% 20%

 CsrMV : up to 5.0× faster, 2.9x less energy 0% 0%


0 200 400 600 800 1000 0 15 30 45 60
 CsrMSpV : up to 5.8× faster, 3.0x less energy sparse nonzeros timesteps

 VTI: up to 2.7× faster

 Notably higher peak FP utilizations than SoA


CPUs (69×), GPUs (2.8×) on CsrMV
16
ISSR Performance on Stencils
4.0
ISSR speedup
 Various 2D/3D stencils on 8-worker-core cluster 3.5
3.0
 FP64, 642/163 grid chunks, up to 4× unroll 2.5

 Tuned LLVM RV32G baseline vs ISSR-enhanced kernels 2.0


1.5

 Geomean 2.7× speedups, 82% FP utilization 1.0


0.5

 ISSR IPC consistently >1 as ISSRs 0.0

enable pseudo-dual-issue
 Baseline perf. degrades for large (3D) stencils
FP util. base FP util. ISSR IPC base IPC ISSR
 Cannot maintain unroll and keep reusable 1.4

inner-loop data in register file 1.2


1.0

 ISSR streams avoid this bottleneck: 0.8

2.5× 2D  3.2× 3D geomean speedup 0.6


0.4
0.2
0.0

17
Efficient PE (snitch) architecture in perspective

1. Minimize control overhead  Simple, shallow pipelines


2. Reduce VNB  amortize IF: SSR-FREP + SIMD (Vector processing)
3. Hide memory latency  non-blocking (indexed) LD/ST+dependency tracking
4. Highly expressive, domain-specific instruction extensions (thanks, RISC-V!)

18
Compute Efficiency: the Cluster (PEs + On-chip TCDM)
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory


Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

19
The Cluster: Design Challenges

Tightly Coupled Data Memory


 Efficient PE
 Hide TCDM “residual” latency Mem Mem Mem Mem Mem
 Remove Von Neumann Bottleneck

 Low latency access TCDM Mem Mem Mem Mem Mem


 Multi-banked architecture
 Fast logarithmic interconnect
Logarithmic Interconnect
 Fast synchronization
 Atomics RISC-V RISC-V RISC-V RISC-V
core core core core
 Barriers

CLUSTER

20
High speed logarithmic interconnect
Do not underestimate on-chip wires… Processors
P1 P2 P3 P4

Routing
Tree

Arbitration
Tree
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
N+8 Memory
B1

B2

B3

B4

B5

B6

B7

B8
Banks
World-level bank interleaving «emulates» multiported mem

@1GHz, 8-16 PEs, Latency: 2 cycles + stalls for banking conflicts


21
Efficient Explicit Global Data Mover
hide L2main memory latency

Snitch DMA 2D DMA


Core Decoder Extension Backend

 64-bit AXI DMA – explicit double-buffered


transfers – better than D$
 Tightly coupled with Snitch (<10 cycles
configuration)
 Operates on wide 512-bit data-bus
 Hardware support to copy 2-4-dim shapes
 Higher-dimensionality handled by SW
 Intrinsics/library for easy programming
 Domain-specific autotilers
22
Snitch Cluster Architecture

Multibanked L1 (BF>1)

Logarithmic Interco

SSR SSR SSR



Snitch
Snitch 0 Snitch 2 Snitch N DIV
N+1
FPU FPU FPU DMA

Shared Instruction Cache

23
Where does the Energy go?
Inevitable to have local memory
In an 8-core cluster (e.g., GPU/GPU L1 cache, vector register file)

L1 Memory,
FPU uses 50% of power
47.19
Integer core uses
2% of power

FPU, 87.44
Integer Core,
4.24
SSR/FREP,
9.52
ICACHE, 4.82

SSR/FREP hardware
uses 5% of power Miscellaneous,
25.26

Spending energy where it contributes to the result  High Efficiency


24
Efficient Cluster architecture in perspective

1. Memory pool – efficient sharing of L1 memory


2. Fast and parsimonious synchronization
3. Data Mover + Double buffering – explicitly managed block transfers at the boundary
4. More cores and more memory per cluster… that would be nice!

25
Back to the cluster… Can we make it Bigger?
 Why?
 Better global latency tolerance if L1size > 2*L2latency*L2bandwidth (Little’s law + double buffer)
 Easier to program (data-parallel, functional pipeline…)
 Smaller data partitioning overhead

 An efficient many-core cluster with low-latency shared L1


 256+ cores
 1+ MiB of shared L1 data memory
 ≤ 10 cycles L1 latency (without contention)

 Physical-aware design MemPool


 WC Frequency > 500 Mhz
 Targeting iso-frequency with small cluster

26
Hierarchical Physical Architecture
 Tile  Group  Cluster
 4 32-bit cores  64 cores  256 cores
 16 banks  256 banks  1 MiB of memory (1024 banks)
 Single cycle memory access  3-cycles latency  5-cycles latency
MemPool Tile MemPool Group
North Northeast
Scratchpad Memory
Bank Bank Bank Bank Bank Bank
Tile Tile Tile Tile
0 1 2 3 4 15
10 11 14 15
MemPool Cluster
Interconnect Tile Tile Tile Tile
8 9 12 13 Group 2 Group 3
Tile 32-47 Tile 48-63

East
Local
Core 0 Core 1 Core 2 Core 3
Tile Tile Tile Tile
L0 I$ L0 I$ L0 I$ L0 I$ 2 3 6 7

Group 0 Group 1
Tile Tile Tile Tile Tile 0-15 Tile 16-31
Shared L1 Instruction Cache
0 1 4 5

TopH: Butterfly Multi-stage Interconnect 0.3req/core/cycle 27


Can we push it further? Mempool  Terapool

3 cycles 5 cycles 9 cycles


GF12 0.8V 16 Snitch/Tile, Multi-stage Interconnect 0.23 req/core/cycle

1024 Cores 4MB, 4096Banks!


69mm2, 3.8W, 900MHz 0.6TOPS (MMUL)  @5nm: 23mm2, 2.2W, 1.2GHz, 1TOPS

4MB can hide a latency of 500ns for a BW of 4TBps


… need more?  Terapool-3D
28
Compute Efficiency: the Chip(let) (Clusters+Off-die Mem)
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory


Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

29
Occamy: RISC-V goes HPC Chiplet!
• Long & short bursts
HBM DRAM
8 GB/s • 1D & 2D patterns 8 GB/s 64 GB/s HBM2e DRAM
Off-die System- Die-to-Die Die-to-Die <410 GB/s
Serial Link level DMA Serial Link Serial Link HBM PHY
512 512 512 512 512 512 512 512 512 512 512 HBM2e PHY
64b 64b 64b 64b 64b 512 GB/s
b b b b b b b b b b b

Simplified System Crossbar

512 512 Group-to-


32b 64b 64b
b b
Group
384 GB/s
Periph 64bit 512KB SPM 1MB SPM
CVA6 64bit 512bit Multi-cluster Multi-core Accelerator
• SPI Host 6 groups of each 4 clustersEach cluster has
• I2C ZeroMem
• Runs Linux 8GB / 512bit 8 compute cores + 1 DMA core
• UART


GPIO
Timers
• Peripheral Manager
• <1% traffic • Long & short bursts Total of 216x Snitch cores @1GHz
• 1D & 2D patterns
with Multi-precision FPU (64 to 8)
30
Occamy NoC: Efficient and Flexible Data Movement

Problem: HBM Accesses are


critical in terms of
 Access energy
HBM Die2Die  Congestion
… …  High latency
System Crossbar Instead reuse data on lower levels
of the memory hierarchy
Group Crossbar Group Crossbar  Between clusters
 Across groups
Cluster Cluster Cluster Cluster Smartly distribute workload
 Clusters: Tiling, Depth-First
 Chiplets: E.g. Layer pipelining

Big trend!
31
High-Performance, General-Purpose
Our scalable architecture is general-purpose and high-performance
Peak chiplet performance @1GHz:
 FP64: 384 GFLOp/s
 FP32: 768 GFLOp/s Cluster Cluster

Group

Group

Group
 FP16: 1.536 TFLOp/s

CVA
6
Cluster Cluster

HBM Ctrl
 FP8: 3.072 TFLOp/s
SP
SPM
M

7.0 mm
SP
SP M

Preliminary measured results:


M

Group

Group

Group
 Dense Kernels:
Die-to-Die
– GEMMS: ≥ 80% FPU utilization (also for SIMD MiniFloat) 10.5mm
32

– Conv2d: ≥ 75% PFU utilization (also for SIMD MiniFloat)


 Stencils Kernels: ≤ 60% FPU utilization
Chiplet taped out: 1st July 22
 Sparse Kernels: ≤ 50% FPU utilization
32
Silicon Interposer: Hedwig (65nm, passive, GF)

Taped out: 15th of October 2022


 Interlocked die arrangement
 Prevent bending, increase stability Occamy 1 HBM DRAM 1
 Compact die arrangement
 No dummy dies or stitching needed

 Fairly low I/O pin count due to no high-


bandwidth periphery
 Off-package connectivity: ~200 wires
HBM DRAM 0 Occamy 0
 Array of 40 x 35 (-1) C4s (total of 1’399 C4 bumps)
 Diameter: 400µm, Pitch: 650µm

 Die-to-Die: ~600 wires


 HBM: ~1700 wires
33
Approaching 1T(DP)-FLOP
Dual Chiplet System Occamy:
 >430+ RV Cores
 0.8 T DP-FLOP/s (no overclocking)
 32GB of HBM2e DRAM
 Low tens of W (est.)
Aggressive 2.5D Integration
Carrier PCB:
 RO4350B (Low-CTE, high stability)
 52.5mm x 45mm

Industry partners are key (thanks)!

34
34
Programming Occamy: DACE
Highly expressive DSL family – high-level transformations, support for explicitly managed memory

DaCeML: Data-Centric Machine Learning


Deep Learning DaCeML frontend DaCe: Data-Centric Parallel Programming framework
Models X86

BERT FPGA
YOLOv5 CUDA
...
RISC-V

Library of optimized deep learning kernels


GEMM, Convolution,
LayerNorm, Softmax, BatchNorm,
...

SSR FREP DMA

35
Efficient Chiplet architecture in Perspective

1. Multi-cluster single-die scaling  strong latency tolerance, modularity


2. NoC for flexible Clus2Clus, Clus2Mem, C2C traffic  reduce pressure to Main memory
3. Top level NoC Routes to “local main memory” / “global main memory” balanced BW
4. Modular chiplet architecture: HBM2e, NoC-wrapped C2C, multi-chiplet ready

36
System Level: Monte Cimone, the first RISC-V Cluster

Designed for HPC “pipe cleaning”


37
Preparing for Occamy: Accelerator on PCIe cards

 Currently using FPGA-mapped “tiny Occamy”


 VCU128 with HBM

 Supporting hybrid usage


 Boot directly on standalone CVA6
 Do not boot and let the Host control the cluster
 HW probing by on-board device tree overlays

 High SW stack re-usability for both modes


 Same Linux drivers to map the cluster
 Same OpenMP offloading runtime

38
Conclusion
[AMD Naffziger ISCAS22]

 Energy efficiency quest: PE, Cluster, SoC, System


 Key ideas
 Deep PE optimization  extensible ISAs (RISC-V!)
 VNB removal + Latency hiding: large OOO processors not needed
 Low-overhead work distribution. Latency hiding  large “mempool”
 Heterogeneous architecture host+accelerator(s)

 Game-changing technologies
 “Commoditized” chiplets: 2.5D, 3D
 Computing “at” memory (DRAM mempool)
 Coming: optical IO and smart NICs, swiches

 Challenges:
 High performance RV Host?
 RV HPC software ecosystem?
[RIKEN Matsuoka MODSIM22]
39
Luca Benini, Alessandro Capotondi, Alessandro Ottaviano,
Alessandro Nadalini, Alessio Burrello, Alfio Di Mauro,
Andrea Borghesi, Andrea Cossettini, Andreas Kurth, Angelo
Garofalo, Antonio Pullini, Arpan Prasad, Bjoern Forsberg,
Corrado Bonfanti, Cristian Cioflan, Daniele Palossi, Davide
Rossi, Davide Nadalini, Fabio Montagna, Florian Glaser,
Want to use the stuff? Florian Zaruba, Francesco Conti, Frank K. Gürkaynak,
You can! Georg Rutishauser, Germain Haugou, Gianna Paulin,
Gianmarco Ottavi, Giuseppe Tagliavini, Hanna Müller,
Free, open source Lorenzo Lamberti, Luca Bertaccini, Luca Valente, Luca
With liberal (apache) license! Colagrande, Luka Macan, Manuel Eggimann, Manuele
Rusci, Marco Guermandi, Marcello Zanghieri, Matheus
Cavalcante, Matteo Perotti, Matteo Spallanzani, Mattia
Sinigaglia, Michael Rogenmoser, Moritz Scherer, Moritz
Schneider, Nazareno Bruschi, Nils Wistoff, Pasquale Davide
Schiavone, Paul Scheffler, Philipp Mayer, Robert Balas,
Samuel Riedel, Sergio Mazzola, Sergei Vostrikov, Simone
Benatti, Stefan Mach, Thomas Benz, Thorir Ingolfsson, Tim
Fischer, Victor Javier Kartsch Morinigo, Vlad Niculescu,
Xiaying Wang, Yichao Zhang, Yvan Tortorella, all our past
collaborators and many more that we forgot to mention

http://pulp-platform.org @pulp_platform

You might also like