0% found this document useful (0 votes)

37 views16 pages

Ventana HotChips23 - Final

Uploaded by

lshx9018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views16 pages

Ventana HotChips23 - Final

Uploaded by

lshx9018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Veyron V1 Data Center-Class RISC-V Processor

August 2023

Company confidential
Veyron V1 Target Markets RISC-V Performance Leader

Data Center Automotive 5G Edge & Client

Generative AI

Company confidential

2
Veyron V1: Server Class RISC-V IP + Chiplets RISC-V Performance Leader

Veyron High Performance RISC-V CPU IP Veyron Chiplet Solutions

Up to 16 cores
IOMMU Ventana compute chiplets
Veyron V1 Core 48MB
Shared L3
512 KB 64 KB L1 D-cache (sliced per core)
AIA
I-cache 512 KB L2 D-cache IO Hub

Coherent Cluster Interconnet

D2D Interface
Domain Specific
AMBA® CHI Acceleration
(Coherent Interconnect)

• Rapid productization with chiplets

• Superscalar aggressive out-of-order instruction pipeline
• Veyron compute chiplets
• High core count multi-cluster scalability (up to 192 cores) o In latest process node technology
o Scalable CPU performance/count
• Comprehensive RAS features
• IO Hub
• IOMMU & Advanced Interrupt Architecture (AIA) system IP o Implemented in lower-cost process node of choice
Company confidential o Customized for application requirements

• Custom Domain Specific Acceleration 3

Veyron V1 Overview RISC-V Performance Leader

16 High Performance RISC-V Cores Up to 16-cores

• Decode, dispatch, and execute up to eight instructions per cycle
• Symmetric execution of any mix of integer Reg/Ld/St/Br ops per cycle V1 Core
48MB
• Decoupled predict/fetch front-end with advanced branch prediction RV64GC Shared L3
512 KB L1/L2 64 KB L1 D-cache (sliced per core)

High Performance Cache Hierarchy I-cache 512 KB L2 D-cache

• 1MB L2 cache per core

• Up to 48MB of globally shared cluster-level L3 cache
Coherent CHI Bus
Coherent CHI System Integration CPU Cluster
• Cluster/chiplet compliant with AMBA Coherent Hub Interface (CHI) system V1 Compute Chiplet
• ODSA-compliant BoW die-to-die interface covering cost-effective organic to
advanced package integrations with Ventana-supplied D2D IP OCP/ODSA BoW Interface
Hub SoC
Server-Class Product AMBA® CHI

DIMM Memory

DIMM Memory
Coherent Interconnect
• Full architectural support to run virtualized workloads 4-6x System Level 4-6x
DDR L4 Cache DDR
• RAS protection of all caches / functional RAMs, with end-to-end data poisoning
and background cache scrubbing PCIe / CXL / Ethernet

• Ground-up microarchitecture with side-channel attack resilience

Pooled Memory

Company confidential

4
Veyron V1 Arch/Uarch Overview
RISC-V Architecture Support RISC-V Performance Leader

• RV64GC plus many additional User, Supervisor, and Machine level

architecture extensions
• Hypervisor extension
• Type 1 and 2 hypervisors; nested virtualization

• Advanced Interrupt Architecture (AIA)

• Including native MSI handling and interrupt virtualization

• 48-bit virtual addressing and 52-bit physical addressing

• External and self-hosted debug; trace-to-memory
• Rich set of performance events and perf counters

Company confidential

6
Core Microarchitecture Highlights RISC-V Performance Leader

• Superscalar, aggressive out-of-order design

• Innovative microarchitecture focused on …
• Power-efficiency and high performance
• Efficient physical implementation and high frequency without custom memory macros

• Decoupled predict / fetch front-end

• Predict fetch stream ahead of actual just-in-time fetch to keep decode pipe fed
• Advanced branch prediction of direction and target address
• High capacity BTB and predictors
• Fetch up to 64B per cycle; decode up to eight instructions per cycle
• Code decompression (16b-to-32b) and fusion of common instruction-pair code idioms

• Decode, dispatch, issue, execute, and commit all operate in terms of “ops”
(fused and unfused)

Company confidential

7
Core Microarchitecture Highlights RISC-V Performance Leader

• Four symmetric integer execution pipes

• Execute any mix of four register / load / store / branch ops per cycle
• Int mul/div, pcnt, clmul, and CSR accesses execute via a separate shared execution unit
• Large associated schedulers – 128-entry scheduling window in total

• Constant register loads pre-executed at dispatch

• Effective zero-cycle latency and no back-end resources consumed

• Scalar FP execution pipe and int/FP transfer/conversion pipe (and associated

schedulers)
• Cache and TLB hierarchies optimized for large code and data working sets,
and for low latency
• 512 KB Instruction L2 with power-efficient L0 cache/loop buffer
• 64 KB Data L1 / 512 KB Data L2 closely coupled for low latency
• Separate 3K+ entry main Instruction TLB and Data TLB (including caching clusters of similar PTEs)
Company confidential

8
V1 CPU Pipelines RISC-V Performance Leader

Restart Pipe
RPS RP1 RP2 RP3

Predict Pipe
PNI PRS PR1 PR2 PR3

Fetch Pipe
QFS QX1 QX2 QT1 QT2 QD1 QD2
QT1 QT2
QD1 QD2

Decode Pipe
DPD DXE DRN DDS

Int Execute Pipe

IIS IOF IX1 --- --- --- IWB
IX1 IX2 IX3 IX4 IWB

LS1 LS2 LS3 LS4 LS5 St Commit

CST

FP Execute Pipe Ld Commit

FWK FIS FOF FX1 FX2 --- --- FWB CST
FX1 FX2 FX3 FX4 FWB

FP Data Transfer Pipe

XWK XIS XOF XD1 XD2

Reg Op Retire Pipe Ldst Op Retire Pipe

ZDN ZRT ZDN RRT

Company confidential

9
Predict, Fetch, and Decode Units RISC-V Performance Leader

• Predict fetch stream of sequential runs of instructions up to 64B long

• Single-level 12K-entry BTB and similarly large collection of branch predictors
• Fully-pipelined, driven by single-cycle Next Lookup Predictor
• Predicts lookup hashes and history updates
• Three-cycle redirect on mispredict

• IL2 + ITLB (large single-level instruction cache and instruction TLB)

• 512 KB IL2
• Physical I/D partitioning allows separate I and D cache hierarchy optimizations for latency and power, and
eliminates code/data conflicts on large footprint workloads
• Fully pipelined misaligned fetch of up to 64B per cycle
• Two-cycle latency for overlapped ITLB, IL2 tag, and IL2 data accesses

• First instruction decode pipe stage does …

• Decompress 16-bit ‘C’ instructions to equivalent 32-bit instructions
• Pre-decode instruction length and find next 8 instruction boundaries
• Pre-decode instruction pair fusion opportunities
• Combine all this together to set up muxes to extract instructions from instruction buffer

Company confidential

10
Load/Store Unit RISC-V Performance Leader

• Can execute any mix of up to four loads and/or stores per cycle
• Closely-coupled L1/L2 data cache hierarchy for low latency
• DL1
• 64 KB virtual cache (VIVT)
• Four-cycle load-to-use latency
• Large single-level DTLB accessed on cache misses (on the way to DL2)
• Hardware synonym handling – multiple read-only synonyms can be co-resident
• Hardware coherent based on inclusion wrt DL2
• Hardware TLB consistent wrt TLB invalidates
• 512 KB DL2
• Pipelined 64B-wide fills into DL1
• Hardware data prefetchers
• Next line, sequential, strided, and multi-stride patterns
• Prefetch next line from DL2 into DL1
• Prefetch much farther ahead from L3/DRAM into DL2 as staging

Company confidential

11
Processor Cluster Highlights RISC-V Performance Leader

• Support for up to 16 cores

• Cluster-level shared L3 cache
• Support for up to 48 MB
• Victim cache with respect to DL2's
• Non-inclusive (exclusive except for selective shared code/data optimizations)
• Advanced reuse-based and scan-resistant replacement policies

• N-way sliced L3 / snoop filter organization

• Each slice responsible for 1/Nth of address space
• “Core + L3/SF slice” physical building block
• Per-core (non-shared) cluster-level snoop filters for IL2 and DL2 caches

Company confidential

12
Processor Cluster Highlights (cont.) RISC-V Performance Leader

• Standard CHI-compatible external interface from cluster to SoC

• Enables direct connect to 3rd party SOC interconnect IP

• Enhanced intra-cluster cache coherency protocol

• Comparable to CHI plus features to support various caching optimizations within a
cluster
• Exclusive / non-inclusive cache allocation
• Data sharing
• Enhanced L3 replacement policy

• Bidirectional “race track” interconnect topology

• Equivalent to dual counter-rotating rings with ends cut off
• Best PPA for up to 16 cores
• 160 GB/s of bisection data bandwidth at 2.5 GHz

Company confidential

13
Veyron V1: World’s First Server Class RISC-V Processor RISC-V Performance Leader

ASSP Based on High Performance

Highest Performance RISC-V CPU
Chiplet Architecture
3.6GHz in 5nm process technology Significant reduction in development Time and Cost
compared to prevailing monolithic SoC model
SPECint2017 per socket

Xeon® EPYC™ AWS G3 Veyron

Ice Lake 8380 Milan 7763 Neoverse V1 V1-128C
270W 280W TDP Not Disclosed 280W

Company confidential
Disruptive ROI: Highest Single Socket Performance at Compelling Perf/Watt/$ 14
Veyron V1 Reference Implementation PPA RISC-V Performance Leader

• TSMC 5nm
• Standard TSMC 5nm metal stack
• Width linearly scales with tiled dual core+L3 slice
• Highly portable design across processes and foundries

• Veyron V1 cluster structure

• Up to 16 cores with fixed private 512 KB IL2 and 64 KB DL1 / 512 KB DL2
• Up to 48 MB shared L3, physically sliced per core
• Configurable 2/4/8/16 Core+L3 slices, 3.0/1.5/0.75 MB L3 per core
16-Core Cluster with 48MB L3 (62.5mm2)
• 5-6 SPECint2017 @ 3.6GHz with 40W total cluster power
• Excellent multi-core scalability with high bandwidth interconnect and large L3 cache
• Dedicated core per thread provides superior multi-core performance compared to
large SMT2 cluster (equal threads, same area, twice the cores)

• Per-core power under max “TDP” workloads

• <0.9W @ 2.4 GHz
CPU Core (1.61mm2) • 1.9W @ 3.2 GHz
L3-3MB Slice Slice (1.86mm2) • Active “Turbo” power management
• Per cluster DVFS, per core DFS
• Accurate digital power model for all components of cluster
Fabric Slice (0.85mm2) • Temp sensor coverage across entire cluster
• Configurable TDP
Company confidential

15
Thank You

Company confidential

A Survey On RISC-V Security: Hardware and Architecture: Tao Lu
No ratings yet
A Survey On RISC-V Security: Hardware and Architecture: Tao Lu
39 pages
OSY Practical No.1
100% (1)
OSY Practical No.1
18 pages
Crown Micro-ELO-VI-6KW
100% (2)
Crown Micro-ELO-VI-6KW
68 pages
Risc V PDF
No ratings yet
Risc V PDF
117 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
RISC V Introduction - Aug 2021
No ratings yet
RISC V Introduction - Aug 2021
50 pages
RISC-V New Era
100% (1)
RISC-V New Era
26 pages
6-SiFive Promotes RISC-V 20190905
No ratings yet
6-SiFive Promotes RISC-V 20190905
29 pages
RISC-V Theory
No ratings yet
RISC-V Theory
106 pages
Attachment
No ratings yet
Attachment
37 pages
RV12 RISC-V 32 - 64-Bit CPU Core - RV12 RISC-V CPU Core
No ratings yet
RV12 RISC-V 32 - 64-Bit CPU Core - RV12 RISC-V CPU Core
104 pages
Shakti Overview
No ratings yet
Shakti Overview
55 pages
Uiv L4
No ratings yet
Uiv L4
28 pages
RISCV Vedhicalu Accelerator Vector
No ratings yet
RISCV Vedhicalu Accelerator Vector
48 pages
Accelerating ML Recommendation
No ratings yet
Accelerating ML Recommendation
23 pages
Accelerating ML Recommendation With Over A Thousand Risc-V/Tensor Processors On Esperanto'S Et-Soc-1 Chip
No ratings yet
Accelerating ML Recommendation With Over A Thousand Risc-V/Tensor Processors On Esperanto'S Et-Soc-1 Chip
23 pages
Unit 5 MPMC - 2024
No ratings yet
Unit 5 MPMC - 2024
14 pages
Demystifying The RISC-V Linux Software Stack
No ratings yet
Demystifying The RISC-V Linux Software Stack
30 pages
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
No ratings yet
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
12 pages
12 10-9 20-StateOfUnion
No ratings yet
12 10-9 20-StateOfUnion
44 pages
Resume Prep
No ratings yet
Resume Prep
24 pages
CPUs GPUs Accelerators and Memory v1.0
No ratings yet
CPUs GPUs Accelerators and Memory v1.0
44 pages
14 HC2024.Intel - Xeon 6 SoC - Praveen.Mosur
No ratings yet
14 HC2024.Intel - Xeon 6 SoC - Praveen.Mosur
28 pages
02 AMD Tech Day AECG Portfolio Overview
No ratings yet
02 AMD Tech Day AECG Portfolio Overview
34 pages
04 PULP Chips
No ratings yet
04 PULP Chips
49 pages
Base 2
No ratings yet
Base 2
22 pages
RISC-V Core RISC-V Core
No ratings yet
RISC-V Core RISC-V Core
3 pages
Major (1) (1) .PPTX - Read-Only
No ratings yet
Major (1) (1) .PPTX - Read-Only
18 pages
Design A 5-Stage Pipeline RISC-V CPU and Optimise
100% (1)
Design A 5-Stage Pipeline RISC-V CPU and Optimise
8 pages
Major
No ratings yet
Major
17 pages
0930 18.07.18 Neel Gala InCore Semiconductors PDF
No ratings yet
0930 18.07.18 Neel Gala InCore Semiconductors PDF
33 pages
Risc Architecture
No ratings yet
Risc Architecture
12 pages
RISC V New Era 04 19 2021
No ratings yet
RISC V New Era 04 19 2021
30 pages
Iare Mtech Eslab Manual
No ratings yet
Iare Mtech Eslab Manual
72 pages
Risc and Cisc Microprocessor
No ratings yet
Risc and Cisc Microprocessor
11 pages
Pulp Intro KGF
No ratings yet
Pulp Intro KGF
65 pages
Paranut Paper Ew2020
No ratings yet
Paranut Paper Ew2020
8 pages
SP 11v2 Wei Han Tenstorrent Gsa Edge Ai 2024 Final Submit 2
No ratings yet
SP 11v2 Wei Han Tenstorrent Gsa Edge Ai 2024 Final Submit 2
17 pages
RISC V A Comprehensive Overview of An em
No ratings yet
RISC V A Comprehensive Overview of An em
9 pages
Single Cycle RISC-V Micro Architecture Processor and Its FPGA Prototype
No ratings yet
Single Cycle RISC-V Micro Architecture Processor and Its FPGA Prototype
5 pages
Skylake Architecture
No ratings yet
Skylake Architecture
31 pages
SCHEMATICS - Cebu - XT2091 - s88731 - 1 - 12 - 202007301614 Central Do Técnico
No ratings yet
SCHEMATICS - Cebu - XT2091 - s88731 - 1 - 12 - 202007301614 Central Do Técnico
54 pages
Andes RISC V Linley
100% (1)
Andes RISC V Linley
19 pages
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
No ratings yet
04 AMD Edge AI TechDay - Singapore - 2024 - FrankWang
29 pages
An Introduction To The RISC-V Architecture
No ratings yet
An Introduction To The RISC-V Architecture
47 pages
HAMSA-DI A Low-Power Dual-Issue RISC-V Core Targeting Energy-Efficient Embedded
No ratings yet
HAMSA-DI A Low-Power Dual-Issue RISC-V Core Targeting Energy-Efficient Embedded
14 pages
Final Year Project Review
No ratings yet
Final Year Project Review
25 pages
Information 14 00064 v2
No ratings yet
Information 14 00064 v2
20 pages
The Cost of Application-Class Processing: Energy and Performance Analysis of A Linux-Ready 1.7-Ghz 64-Bit Risc-V Core in 22-Nm Fdsoi Technology
No ratings yet
The Cost of Application-Class Processing: Energy and Performance Analysis of A Linux-Ready 1.7-Ghz 64-Bit Risc-V Core in 22-Nm Fdsoi Technology
12 pages
Electronics 13 00120 With Cover
No ratings yet
Electronics 13 00120 With Cover
15 pages
RISC
No ratings yet
RISC
11 pages
Literature Survey On
No ratings yet
Literature Survey On
9 pages
Bit-Serial RISC-V CPU Core: Abhay Chopde Vaibhav Shingde Tejas Bhagwat
No ratings yet
Bit-Serial RISC-V CPU Core: Abhay Chopde Vaibhav Shingde Tejas Bhagwat
5 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
2 - 8 - RISC - V - Architecture & Toolchain
No ratings yet
2 - 8 - RISC - V - Architecture & Toolchain
5 pages
Reconfigurable Risc-V Secure Processor and Soc Integration: Zhenya Zang Yao Liu Ray C.C. Cheung
No ratings yet
Reconfigurable Risc-V Secure Processor and Soc Integration: Zhenya Zang Yao Liu Ray C.C. Cheung
6 pages
A Survey of Recent Developments in Testability Safety and Security of RISC-V Processors
No ratings yet
A Survey of Recent Developments in Testability Safety and Security of RISC-V Processors
10 pages
A 32-Bit RISC-V AXI4-lite Bus-Based Microcontroller With 10-Bit SAR ADC
No ratings yet
A 32-Bit RISC-V AXI4-lite Bus-Based Microcontroller With 10-Bit SAR ADC
4 pages
Free and Open Instruction Sets & Other Stuff: Krste Asanović, Representing The ASPIRE Lab
No ratings yet
Free and Open Instruction Sets & Other Stuff: Krste Asanović, Representing The ASPIRE Lab
27 pages
Risc V
No ratings yet
Risc V
5 pages
Xilinx Versal Ai Compute Solution Brief
No ratings yet
Xilinx Versal Ai Compute Solution Brief
3 pages
a5YHo0000009QrFMAU a5b3b0000004cirAAA
No ratings yet
a5YHo0000009QrFMAU a5b3b0000004cirAAA
2 pages
Hawk-M Uhd 6K: Board Schematic
No ratings yet
Hawk-M Uhd 6K: Board Schematic
53 pages
Netweb Technologies India Limited
No ratings yet
Netweb Technologies India Limited
279 pages
ARM Cortex M3 Designstart Eval Fpga User Guide
No ratings yet
ARM Cortex M3 Designstart Eval Fpga User Guide
59 pages
Instalation Guide PLS7250-00043 - 20230117
No ratings yet
Instalation Guide PLS7250-00043 - 20230117
8 pages
HPE - A00085837en - Us - HPE SimpliVity 325 Gen10 Plus V2 Factory Reset Guide
No ratings yet
HPE - A00085837en - Us - HPE SimpliVity 325 Gen10 Plus V2 Factory Reset Guide
10 pages
Binder PDF PCM 03-2009
No ratings yet
Binder PDF PCM 03-2009
164 pages
Not Compatible Storage Device
No ratings yet
Not Compatible Storage Device
1 page
Deep Face Lab PDF
No ratings yet
Deep Face Lab PDF
29 pages
UNV 【Datasheet】ADU87XX-E-V3 Series High Definition Video Decoder V2.2-EN
No ratings yet
UNV 【Datasheet】ADU87XX-E-V3 Series High Definition Video Decoder V2.2-EN
5 pages
Introduction To Computer Organization and Architecture-1
No ratings yet
Introduction To Computer Organization and Architecture-1
15 pages
Red Hat Enterprise Linux-8-8.1 Release Notes-En-Us
No ratings yet
Red Hat Enterprise Linux-8-8.1 Release Notes-En-Us
116 pages
Zynq Ultrascale+ Mpsoc: A Fips 140-3 Primer: Wp543 (V1.0) February 4, 2022
No ratings yet
Zynq Ultrascale+ Mpsoc: A Fips 140-3 Primer: Wp543 (V1.0) February 4, 2022
21 pages
Pcan - P: Peak-S T G H
No ratings yet
Pcan - P: Peak-S T G H
96 pages
SW42DA Manual REVA3 Working
No ratings yet
SW42DA Manual REVA3 Working
20 pages
Unreal Engine W 24 Godziny Nauka Tworzenia Gier Aram Cookson Ryan Dowlingsoka Clinton Crumpler
No ratings yet
Unreal Engine W 24 Godziny Nauka Tworzenia Gier Aram Cookson Ryan Dowlingsoka Clinton Crumpler
38 pages
Uperfect Manual For 101B07
No ratings yet
Uperfect Manual For 101B07
17 pages
11 7 Series Architecture Overview
No ratings yet
11 7 Series Architecture Overview
38 pages
BOM of The PNP
No ratings yet
BOM of The PNP
11 pages
MOLOKAI Block Diagram: VF-co-cc
No ratings yet
MOLOKAI Block Diagram: VF-co-cc
39 pages
CA Lec4 Chap2 MIPS Instructions 3
No ratings yet
CA Lec4 Chap2 MIPS Instructions 3
38 pages
EN Speede II
No ratings yet
EN Speede II
12 pages
Micro Notes - Chapter 2
No ratings yet
Micro Notes - Chapter 2
7 pages
Embedded Systems: Department of Electrical & Computer Engineering (College of Engineering) Wolaita Sodo University
No ratings yet
Embedded Systems: Department of Electrical & Computer Engineering (College of Engineering) Wolaita Sodo University
4 pages
Toshiba Satellite L30 SpecificationBrochure 110706
No ratings yet
Toshiba Satellite L30 SpecificationBrochure 110706
2 pages
Print PC - PC Builder - Star Tech
No ratings yet
Print PC - PC Builder - Star Tech
2 pages
EDC-002 Display Unit
No ratings yet
EDC-002 Display Unit
1 page

Ventana HotChips23 - Final

Uploaded by

Ventana HotChips23 - Final

Uploaded by

Veyron V1 Data Center-Class RISC-V Processor

Data Center Automotive 5G Edge & Client

Veyron High Performance RISC-V CPU IP Veyron Chiplet Solutions

Coherent Cluster Interconnet

• Rapid productization with chiplets

• Custom Domain Specific Acceleration 3

16 High Performance RISC-V Cores Up to 16-cores

High Performance Cache Hierarchy I-cache 512 KB L2 D-cache

• 1MB L2 cache per core

• Ground-up microarchitecture with side-channel attack resilience

• RV64GC plus many additional User, Supervisor, and Machine level

• Advanced Interrupt Architecture (AIA)

• 48-bit virtual addressing and 52-bit physical addressing

• Superscalar, aggressive out-of-order design

• Decoupled predict / fetch front-end

• Four symmetric integer execution pipes

• Constant register loads pre-executed at dispatch

• Scalar FP execution pipe and int/FP transfer/conversion pipe (and associated

Int Execute Pipe

LS1 LS2 LS3 LS4 LS5 St Commit

FP Execute Pipe Ld Commit

FP Data Transfer Pipe

Reg Op Retire Pipe Ldst Op Retire Pipe

• Predict fetch stream of sequential runs of instructions up to 64B long

• IL2 + ITLB (large single-level instruction cache and instruction TLB)

• First instruction decode pipe stage does …

• Support for up to 16 cores

• N-way sliced L3 / snoop filter organization

• Standard CHI-compatible external interface from cluster to SoC

• Enhanced intra-cluster cache coherency protocol

• Bidirectional “race track” interconnect topology

ASSP Based on High Performance

Xeon® EPYC™ AWS G3 Veyron

• Veyron V1 cluster structure

• Per-core power under max “TDP” workloads

You might also like