0% found this document useful (0 votes)

11 views40 pages

Benini ISC2023 Paving The Road For Riscv

The PULP Platform focuses on developing open-source hardware for RISC-V supercomputers, emphasizing efficient architecture through heterogeneous and parallel computing. It introduces the Snitch core, a versatile RISC-V processor designed for high energy efficiency and performance, particularly in machine learning and data-intensive workloads. The document discusses various architectural innovations, including memory management and instruction set extensions, aimed at overcoming traditional computing bottlenecks.

Uploaded by

olivi.ac.alf.e3.2.9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views40 pages

Benini ISC2023 Paving The Road For Riscv

Uploaded by

olivi.ac.alf.e3.2.9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

PULP PLATFORM

Open Source Hardware, the way it should be!

Paving the Road for RISC-V Supercomputers

with Open Hardware
Luca Benini <[email protected],[email protected]>

http://pulp-platform.org @pulp_platform https://www.youtube.com/pulp_platform

Computing is Power Bound: HPC
TOP500 9/22

HPC Performance 20MW budget

(fixed)

HPC: 10x every ~5 years

2
Computing is Power Bound: ML Largest datacenter <150MW

GPT-4 (OpenAI’23)
Sevilla 22: arXiv:2202.05924, epochai.org Training Compute: 2.1E+25 (FLOP)

AI training: 10x every year!!!

Machine Learning (training): 10x every 2 years

3
Technology Scaling?
TSMC, ISSCC21

@ iso-area 1.24x power ↑

𝟏𝟏
Energy Efficiency ( ) 10x every 12 years…
𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏�𝐓𝐓𝐓𝐓𝐓𝐓𝐓𝐓 4
Efficient Architecture: Heterogeneous+Parallel
Decide Compute

<
>
5
Heterogeneous + Parallel… Why?
Decide Compute
Decide (jump to different program part) Compute (plough through numbers)
 Modulate flow of instructions  Modulate flow of data
 Mostly sequential decisions:  Embarassing data parallel:
 Don’t work too much  Don’t think too much
 Be clever about the battles you pick  Plough through the data
(latency is king) (throughput is king)
 Lots of decisions  Few decisions
Little number crunching Lots of number crunching

 Today’s workloads are dominated by “Compute”:

 Tons of data, few (as fast as possible) decisions based on the computed values,
 “Data-Oblivious Algorithms” (ML, or better DNNs are so!)
 Large data footprint + sparsity

How to design an efficient “Compute” fabric?

6
Compute Efficiency: D (…and I) Movement is Key
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory

Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

7
PE: Snitch, a Tiny RISC-V Core
A versatile building block

 Simplest core: around 20KGE

Dependencies  Speed via simplicity (1GHZ+)
Scoreboard  L0 Icache/buffer for low energy fetch
 Shared L1 for instruction reuse (SPMD)
L0 ICache

 Extensible “Accelerator” port

 Minimal baseline ISA (RISC-V)
 Extensibility: Performance through ISA
extensions (via accelerator port)

 Latency-tolerant Scoreboard
 Tracks instruction dependencies
 Much simpler than OOO support!
FP Vector Stencil ML/Tensor

F. Zaruba, F. Schuiki, T. Hoefler and L. Benini, "Snitch: A Tiny Pseudo Dual-Issue Processor for
Area and Energy Efficient Execution of Floating-Point Intensive Workloads," in IEEE
Transactions on Computers, vol. 70, no. 11, pp. 1845-1860, 1 Nov. 2021 8
Snitch PE: ISA Extension for efficient “Compute”
 How can we remove the Von Neumann Bottleneck?
 Targeting “compute” code

fld ft0, 0(a1) 70 pJ

double sum = 0;
fld ft1, 0(a2) 70 pJ
for (int i = 0; i < N; ++i) {
addi a1, a1, 8 50 pJ
sum += A[i] * B[i];
addi a2, a2, 8 50 pJ
}
fmadd.d fa0, ft0, ft1, fa0 80 pJ
bne a1, a3, -5 50 pJ

Memory access, operation, iteration control – can we do better?

Note: memory access (>1 cycle even for L1)  need latency tolerance for LD/ST

9
Stream Semantic Registers
LD/ST elision

 Intuition: High FPU utilization ≈ high energy-efficiency

 Idea: Turn register read/writes into implicit memory
loads/stores.
 Extension around the core’s register file
 Address generation hardware

 Increase FPU/ALU utilization by ~3x up to 100%

 SSRs ≠ memory operands
 Perfect prefetching, latency-tolerant
 1-3 SSR (2-3KG/SSR)

10
Floating-point Repetition Buffer
Remove control flow overhead in compute stream

 Programmable micro-loop buffer

 Sequencer steps through the buffer,
independently of the FPU
 Integer core free to operate in parallel:
Pseudo-dual issue
 High area- and energy-efficiency

11
RISC-V ISA Extension for Target Workload
Mixed precision
Inference ≠ Training Quantization

Efficient DNN inference & training  Inference: INT8 quantization is SoA

 Training: High dynamic range needed for
weights and weight updates
fp32 is still standard for DNN training
workloads. Low precision training with bf18
and fp8
Support a wide variety of FP formats and
instructions:
 Standard: fp64, fp32, fp16, bf16
 Low precision: fp8, altfp8
 fp8 (1-4-3): forward prop.
 altfp8 (1-5-2): backward prop.
 Exp. ops: accumulation
12
Cascade of EXFMAs vs EXSDOTP
Cascade of FMAs FPU SDOTP

EXFMA

EXSDOTP

EXFMA

DOTP=A*B+(C*D+E) DOTP=A*B+C*D+E

 Fused EXSDOTP (i.e. lossless)

Non-distributive FP addition  Precision Loss  Single normalization and rounding step
 Smaller area and shorter critical path
 Product by-pass to compute fused three-term
addition (vector inner sum)
 Stochastic rounding supported (+3% area)
13
What About Sparsity? Indirect SSR Streamer
 Based on existing 3-SSR streamer

mem

mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.

3. Forward result indices to 3rd SSR

idx
data

data

TCDM Scratchpad
instr.
ft0 ft1
offload matched
FPU
Subsystem /merged
ft2 indices

SSSR

data

idx
Streamer

SSR 2

mem
14
What About Sparsity? Indirect SSR Streamer
 Based on existing 3-SSR streamer

mem

mem
1. Extend 2 SSRs to ISSRs
Index
ISSR 0 ISSR 1
2. Add index comparison unit between ISSRs
idx idx
comp.

3. Forward result indices to 3rd SSR

ctrl

idx
data

data

TCDM Scratchpad
 Control interface to FPU sequencer (frep.s) instr.
ft0 ft1 frep.s
offload matched
 Result index count unknown ahead-of-time FPU
/merged
Subsystem
indices
 Enables general sparse-sparse LA on fibers: ft2

SSSR
 dotp: index match + fmadd

data

idx
Streamer
 vadd: index merge + fadd SSR 2
 elem-mul: index match + fmul

mem
 vec-mac: index merge + fmadd

15
ISSR Performance Benefits

 Notable single-core speedups over RV SpVV FP utilization VTI FP utilization

baseline (16b idcs, single
core)
(16b idcs, cluster)
100%
 CsrMV: up to 7.0× faster, 79% FP util. 100%

 SpV+SpV: up to 9.8× faster / higher FP util. 80%

80%

 SpV∙SpV: up to 7.7× faster / higher FP util. 60%

60%

 VTI (3D stencil code): up to 2.9× faster, 78% FP util. 40%

40%

 Significant benefits in multicore cluster: 20% 20%

 CsrMV : up to 5.0× faster, 2.9x less energy 0% 0%

0 200 400 600 800 1000 0 15 30 45 60
 CsrMSpV : up to 5.8× faster, 3.0x less energy sparse nonzeros timesteps

 VTI: up to 2.7× faster

 Notably higher peak FP utilizations than SoA

CPUs (69×), GPUs (2.8×) on CsrMV
16
ISSR Performance on Stencils
4.0
ISSR speedup
 Various 2D/3D stencils on 8-worker-core cluster 3.5
3.0
 FP64, 642/163 grid chunks, up to 4× unroll 2.5

 Tuned LLVM RV32G baseline vs ISSR-enhanced kernels 2.0

1.5

 Geomean 2.7× speedups, 82% FP utilization 1.0

0.5

 ISSR IPC consistently >1 as ISSRs 0.0

enable pseudo-dual-issue
 Baseline perf. degrades for large (3D) stencils
FP util. base FP util. ISSR IPC base IPC ISSR
 Cannot maintain unroll and keep reusable 1.4

inner-loop data in register file 1.2

1.0

 ISSR streams avoid this bottleneck: 0.8

2.5× 2D  3.2× 3D geomean speedup 0.6

0.4
0.2
0.0

17
Efficient PE (snitch) architecture in perspective

1. Minimize control overhead  Simple, shallow pipelines

2. Reduce VNB  amortize IF: SSR-FREP + SIMD (Vector processing)
3. Hide memory latency  non-blocking (indexed) LD/ST+dependency tracking
4. Highly expressive, domain-specific instruction extensions (thanks, RISC-V!)

18
Compute Efficiency: the Cluster (PEs + On-chip TCDM)
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory

Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

19
The Cluster: Design Challenges

Tightly Coupled Data Memory

 Efficient PE
 Hide TCDM “residual” latency Mem Mem Mem Mem Mem
 Remove Von Neumann Bottleneck

 Low latency access TCDM Mem Mem Mem Mem Mem

 Multi-banked architecture
 Fast logarithmic interconnect
Logarithmic Interconnect
 Fast synchronization
 Atomics RISC-V RISC-V RISC-V RISC-V
core core core core
 Barriers

CLUSTER

20
High speed logarithmic interconnect
Do not underestimate on-chip wires… Processors
P1 P2 P3 P4

Routing
Tree

Arbitration
Tree
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
N+8 Memory
B1

B8
Banks
World-level bank interleaving «emulates» multiported mem

@1GHz, 8-16 PEs, Latency: 2 cycles + stalls for banking conflicts

21
Efficient Explicit Global Data Mover
hide L2main memory latency

Snitch DMA 2D DMA

Core Decoder Extension Backend

 64-bit AXI DMA – explicit double-buffered

transfers – better than D$
 Tightly coupled with Snitch (<10 cycles
configuration)
 Operates on wide 512-bit data-bus
 Hardware support to copy 2-4-dim shapes
 Higher-dimensionality handled by SW
 Intrinsics/library for easy programming
 Domain-specific autotilers
22
Snitch Cluster Architecture

Multibanked L1 (BF>1)

Logarithmic Interco

SSR SSR SSR

…
Snitch
Snitch 0 Snitch 2 Snitch N DIV
N+1
FPU FPU FPU DMA

Shared Instruction Cache

23
Where does the Energy go?
Inevitable to have local memory
In an 8-core cluster (e.g., GPU/GPU L1 cache, vector register file)

L1 Memory,
FPU uses 50% of power
47.19
Integer core uses
2% of power

FPU, 87.44
Integer Core,
4.24
SSR/FREP,
9.52
ICACHE, 4.82

SSR/FREP hardware
uses 5% of power Miscellaneous,
25.26

Spending energy where it contributes to the result  High Efficiency

24
Efficient Cluster architecture in perspective

1. Memory pool – efficient sharing of L1 memory

2. Fast and parsimonious synchronization
3. Data Mover + Double buffering – explicitly managed block transfers at the boundary
4. More cores and more memory per cluster… that would be nice!

25
Back to the cluster… Can we make it Bigger?
 Why?
 Better global latency tolerance if L1size > 2*L2latency*L2bandwidth (Little’s law + double buffer)
 Easier to program (data-parallel, functional pipeline…)
 Smaller data partitioning overhead

 An efficient many-core cluster with low-latency shared L1

 256+ cores
 1+ MiB of shared L1 data memory
 ≤ 10 cycles L1 latency (without contention)

 Physical-aware design MemPool

 WC Frequency > 500 Mhz
 Targeting iso-frequency with small cluster

26
Hierarchical Physical Architecture
 Tile  Group  Cluster
 4 32-bit cores  64 cores  256 cores
 16 banks  256 banks  1 MiB of memory (1024 banks)
 Single cycle memory access  3-cycles latency  5-cycles latency
MemPool Tile MemPool Group
North Northeast
Scratchpad Memory
Bank Bank Bank Bank Bank Bank
Tile Tile Tile Tile
0 1 2 3 4 15
10 11 14 15
MemPool Cluster
Interconnect Tile Tile Tile Tile
8 9 12 13 Group 2 Group 3
Tile 32-47 Tile 48-63

East
Local
Core 0 Core 1 Core 2 Core 3
Tile Tile Tile Tile
L0 I$ L0 I$ L0 I$ L0 I$ 2 3 6 7

Group 0 Group 1
Tile Tile Tile Tile Tile 0-15 Tile 16-31
Shared L1 Instruction Cache
0 1 4 5

TopH: Butterfly Multi-stage Interconnect 0.3req/core/cycle 27

Can we push it further? Mempool  Terapool

3 cycles 5 cycles 9 cycles

GF12 0.8V 16 Snitch/Tile, Multi-stage Interconnect 0.23 req/core/cycle

1024 Cores 4MB, 4096Banks!

69mm2, 3.8W, 900MHz 0.6TOPS (MMUL)  @5nm: 23mm2, 2.2W, 1.2GHz, 1TOPS

4MB can hide a latency of 500ns for a BW of 4TBps

… need more?  Terapool-3D
28
Compute Efficiency: the Chip(let) (Clusters+Off-die Mem)
From/To L1

RegFile L1 L2

PE
core Cluster

Cluster Cluster Cluster

From/To L2
PE PE PE PE Cluster Cluster Cluster
Reg & Others DMA
core core core core

L0:Operand Memory L1: Tightly Coupled DM L2: Main Memory

Latency=1 Latency<10 Latency>100
Density=1 Density≈10 Density≈100
Private Shared Shared, Remote

29
Occamy: RISC-V goes HPC Chiplet!
• Long & short bursts
HBM DRAM
8 GB/s • 1D & 2D patterns 8 GB/s 64 GB/s HBM2e DRAM
Off-die System- Die-to-Die Die-to-Die <410 GB/s
Serial Link level DMA Serial Link Serial Link HBM PHY
512 512 512 512 512 512 512 512 512 512 512 HBM2e PHY
64b 64b 64b 64b 64b 512 GB/s
b b b b b b b b b b b

Simplified System Crossbar

512 512 Group-to-

32b 64b 64b
b b
Group
384 GB/s
Periph 64bit 512KB SPM 1MB SPM
CVA6 64bit 512bit Multi-cluster Multi-core Accelerator
• SPI Host 6 groups of each 4 clustersEach cluster has
• I2C ZeroMem
• Runs Linux 8GB / 512bit 8 compute cores + 1 DMA core
• UART
•
•
GPIO
Timers
• Peripheral Manager
• <1% traffic • Long & short bursts Total of 216x Snitch cores @1GHz
• 1D & 2D patterns
with Multi-precision FPU (64 to 8)
30
Occamy NoC: Efficient and Flexible Data Movement

Problem: HBM Accesses are

critical in terms of
 Access energy
HBM Die2Die  Congestion
… …  High latency
System Crossbar Instead reuse data on lower levels
of the memory hierarchy
Group Crossbar Group Crossbar  Between clusters
 Across groups
Cluster Cluster Cluster Cluster Smartly distribute workload
 Clusters: Tiling, Depth-First
 Chiplets: E.g. Layer pipelining

Big trend!
31
High-Performance, General-Purpose
Our scalable architecture is general-purpose and high-performance
Peak chiplet performance @1GHz:
 FP64: 384 GFLOp/s
 FP32: 768 GFLOp/s Cluster Cluster

Group

Group
 FP16: 1.536 TFLOp/s

CVA
6
Cluster Cluster

HBM Ctrl
 FP8: 3.072 TFLOp/s
SP
SPM
M

7.0 mm
SP
SP M

Preliminary measured results:

Group

Group
 Dense Kernels:
Die-to-Die
– GEMMS: ≥ 80% FPU utilization (also for SIMD MiniFloat) 10.5mm
32

– Conv2d: ≥ 75% PFU utilization (also for SIMD MiniFloat)

 Stencils Kernels: ≤ 60% FPU utilization
Chiplet taped out: 1st July 22
 Sparse Kernels: ≤ 50% FPU utilization
32
Silicon Interposer: Hedwig (65nm, passive, GF)

Taped out: 15th of October 2022

 Interlocked die arrangement
 Prevent bending, increase stability Occamy 1 HBM DRAM 1
 Compact die arrangement
 No dummy dies or stitching needed

 Fairly low I/O pin count due to no high-

bandwidth periphery
 Off-package connectivity: ~200 wires
HBM DRAM 0 Occamy 0
 Array of 40 x 35 (-1) C4s (total of 1’399 C4 bumps)
 Diameter: 400µm, Pitch: 650µm

 Die-to-Die: ~600 wires

 HBM: ~1700 wires
33
Approaching 1T(DP)-FLOP
Dual Chiplet System Occamy:
 >430+ RV Cores
 0.8 T DP-FLOP/s (no overclocking)
 32GB of HBM2e DRAM
 Low tens of W (est.)
Aggressive 2.5D Integration
Carrier PCB:
 RO4350B (Low-CTE, high stability)
 52.5mm x 45mm

Industry partners are key (thanks)!

34
34
Programming Occamy: DACE
Highly expressive DSL family – high-level transformations, support for explicitly managed memory

DaCeML: Data-Centric Machine Learning

Deep Learning DaCeML frontend DaCe: Data-Centric Parallel Programming framework
Models X86

BERT FPGA
YOLOv5 CUDA
...
RISC-V

Library of optimized deep learning kernels

GEMM, Convolution,
LayerNorm, Softmax, BatchNorm,
...

SSR FREP DMA

35
Efficient Chiplet architecture in Perspective

1. Multi-cluster single-die scaling  strong latency tolerance, modularity

2. NoC for flexible Clus2Clus, Clus2Mem, C2C traffic  reduce pressure to Main memory
3. Top level NoC Routes to “local main memory” / “global main memory” balanced BW
4. Modular chiplet architecture: HBM2e, NoC-wrapped C2C, multi-chiplet ready

36
System Level: Monte Cimone, the first RISC-V Cluster

Designed for HPC “pipe cleaning”

37
Preparing for Occamy: Accelerator on PCIe cards

 Currently using FPGA-mapped “tiny Occamy”

 VCU128 with HBM

 Supporting hybrid usage

 Boot directly on standalone CVA6
 Do not boot and let the Host control the cluster
 HW probing by on-board device tree overlays

 High SW stack re-usability for both modes

 Same Linux drivers to map the cluster
 Same OpenMP offloading runtime

38
Conclusion
[AMD Naffziger ISCAS22]

 Energy efficiency quest: PE, Cluster, SoC, System

 Key ideas
 Deep PE optimization  extensible ISAs (RISC-V!)
 VNB removal + Latency hiding: large OOO processors not needed
 Low-overhead work distribution. Latency hiding  large “mempool”
 Heterogeneous architecture host+accelerator(s)

 Game-changing technologies
 “Commoditized” chiplets: 2.5D, 3D
 Computing “at” memory (DRAM mempool)
 Coming: optical IO and smart NICs, swiches

 Challenges:
 High performance RV Host?
 RV HPC software ecosystem?
[RIKEN Matsuoka MODSIM22]
39
Luca Benini, Alessandro Capotondi, Alessandro Ottaviano,
Alessandro Nadalini, Alessio Burrello, Alfio Di Mauro,
Andrea Borghesi, Andrea Cossettini, Andreas Kurth, Angelo
Garofalo, Antonio Pullini, Arpan Prasad, Bjoern Forsberg,
Corrado Bonfanti, Cristian Cioflan, Daniele Palossi, Davide
Rossi, Davide Nadalini, Fabio Montagna, Florian Glaser,
Want to use the stuff? Florian Zaruba, Francesco Conti, Frank K. Gürkaynak,
You can! Georg Rutishauser, Germain Haugou, Gianna Paulin,
Gianmarco Ottavi, Giuseppe Tagliavini, Hanna Müller,
Free, open source Lorenzo Lamberti, Luca Bertaccini, Luca Valente, Luca
With liberal (apache) license! Colagrande, Luka Macan, Manuel Eggimann, Manuele
Rusci, Marco Guermandi, Marcello Zanghieri, Matheus
Cavalcante, Matteo Perotti, Matteo Spallanzani, Mattia
Sinigaglia, Michael Rogenmoser, Moritz Scherer, Moritz
Schneider, Nazareno Bruschi, Nils Wistoff, Pasquale Davide
Schiavone, Paul Scheffler, Philipp Mayer, Robert Balas,
Samuel Riedel, Sergio Mazzola, Sergei Vostrikov, Simone
Benatti, Stefan Mach, Thomas Benz, Thorir Ingolfsson, Tim
Fischer, Victor Javier Kartsch Morinigo, Vlad Niculescu,
Xiaying Wang, Yichao Zhang, Yvan Tortorella, all our past
collaborators and many more that we forgot to mention

http://pulp-platform.org @pulp_platform

HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
No ratings yet
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
44 pages
Huawei Safe City Solution Brochure
No ratings yet
Huawei Safe City Solution Brochure
8 pages
Reservoir Simulation Tutorials: Barham Sabir Mahmood
No ratings yet
Reservoir Simulation Tutorials: Barham Sabir Mahmood
6 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
Paper1a1 Slides
No ratings yet
Paper1a1 Slides
39 pages
Pantech Project Titles VLSI Projects 2017-18
No ratings yet
Pantech Project Titles VLSI Projects 2017-18
4 pages
Schiavone Wosh2019 Tutorial
No ratings yet
Schiavone Wosh2019 Tutorial
81 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
Quiz-1 Syllabus of Embedded Systems Design
No ratings yet
Quiz-1 Syllabus of Embedded Systems Design
20 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
No ratings yet
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
14 pages
Systems On Chip (SoC)
No ratings yet
Systems On Chip (SoC)
46 pages
RISC V2 A Scalable RISC V Vector Process
No ratings yet
RISC V2 A Scalable RISC V Vector Process
5 pages
Electronics 13 02971 v2
No ratings yet
Electronics 13 02971 v2
11 pages
Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications
No ratings yet
Pulp: A Parallel Ultra Low Power Platform For Next Generation Iot Applications
39 pages
Mod 5 HW - Upd 2024
No ratings yet
Mod 5 HW - Upd 2024
38 pages
Electronics
No ratings yet
Electronics
8 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Systems On Chip (SoC) - 01
No ratings yet
Systems On Chip (SoC) - 01
47 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
HC31 2.6 Intel SPH 2019 v3
No ratings yet
HC31 2.6 Intel SPH 2019 v3
12 pages
Speedup 0912
No ratings yet
Speedup 0912
34 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
L04 PipeliningII
No ratings yet
L04 PipeliningII
33 pages
Programming Vision Applications On Zynq Using Opencv and High-Level Synthesis
No ratings yet
Programming Vision Applications On Zynq Using Opencv and High-Level Synthesis
11 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
1 Introduction
No ratings yet
1 Introduction
41 pages
Typical Characteristics: Microprocessor Digital Signal Processing
No ratings yet
Typical Characteristics: Microprocessor Digital Signal Processing
3 pages
Neural Network Accelerators: CS223 Computer Architecture & Organization
No ratings yet
Neural Network Accelerators: CS223 Computer Architecture & Organization
45 pages
Veljko Milutinović: University of Belgrade
No ratings yet
Veljko Milutinović: University of Belgrade
42 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
Untitled
No ratings yet
Untitled
3 pages
Manticore Ultraefficient Floating Point
No ratings yet
Manticore Ultraefficient Floating Point
7 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
CARRV2021 Paper 100 AlAssir
No ratings yet
CARRV2021 Paper 100 AlAssir
6 pages
Thesis Davy Koene
No ratings yet
Thesis Davy Koene
71 pages
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
No ratings yet
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
12 pages
HC2021.UWisc - Karu Sankaralingam.v02
No ratings yet
HC2021.UWisc - Karu Sankaralingam.v02
20 pages
Lecture Notes For ARM Architecture - Module I
No ratings yet
Lecture Notes For ARM Architecture - Module I
45 pages
Conv PHD Thesis Urbinati Firmata
No ratings yet
Conv PHD Thesis Urbinati Firmata
146 pages
Arrow: A RISC-V Vector Accelerator For Machine Learning Inference
No ratings yet
Arrow: A RISC-V Vector Accelerator For Machine Learning Inference
6 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
DSP PDF
No ratings yet
DSP PDF
7 pages
2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
No ratings yet
2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
60 pages
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
No ratings yet
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
12 pages
FPGA Basics
No ratings yet
FPGA Basics
20 pages
Ece4750 T02 Proc Uarch
No ratings yet
Ece4750 T02 Proc Uarch
77 pages
Special Issue On Contemporary Industry Products 2024
No ratings yet
Special Issue On Contemporary Industry Products 2024
2 pages
P2041/P2040 Qoriq Integrated Processor Design Checklist: About This Document
No ratings yet
P2041/P2040 Qoriq Integrated Processor Design Checklist: About This Document
47 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Fulltext
No ratings yet
Fulltext
145 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Low Power Vlsi Design: Architecture Optimizations/Synthesis
No ratings yet
Low Power Vlsi Design: Architecture Optimizations/Synthesis
13 pages
The 5G Network Is Facing Technical Challenges
No ratings yet
The 5G Network Is Facing Technical Challenges
6 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
ARM Architecture
No ratings yet
ARM Architecture
26 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Raspberry Pi :The Ultimate Step by Step Raspberry Pi User Guide (The Updated Version )
From Everand
Raspberry Pi :The Ultimate Step by Step Raspberry Pi User Guide (The Updated Version )
Jason Scotts
4/5 (4)
Resume PDF
No ratings yet
Resume PDF
2 pages
5 Steps To Effective Lead Management
No ratings yet
5 Steps To Effective Lead Management
3 pages
Event Loop in Dart
No ratings yet
Event Loop in Dart
13 pages
Quantum Computing A Shift From Bits To Qubits 1nbsped 981199529x 9789811995293
100% (1)
Quantum Computing A Shift From Bits To Qubits 1nbsped 981199529x 9789811995293
487 pages
PLC Memory
No ratings yet
PLC Memory
15 pages
Testing by James Bach
No ratings yet
Testing by James Bach
145 pages
Truecaller SRS - This Is The Software Requirements Specification For The Software True-Caller
No ratings yet
Truecaller SRS - This Is The Software Requirements Specification For The Software True-Caller
23 pages
Devin AI Research Paper
No ratings yet
Devin AI Research Paper
6 pages
LogixPro Simulator
No ratings yet
LogixPro Simulator
36 pages
Home Assignment Unit 5
No ratings yet
Home Assignment Unit 5
13 pages
PSC Magellan Sl381
No ratings yet
PSC Magellan Sl381
2 pages
Parallelizing High-Frequency Trading Applications by Using C++11 Attributes
No ratings yet
Parallelizing High-Frequency Trading Applications by Using C++11 Attributes
8 pages
CSE202
No ratings yet
CSE202
2 pages
Experiance Resume by MR - RN Reddy
No ratings yet
Experiance Resume by MR - RN Reddy
3 pages
Internship Preparation Guide
No ratings yet
Internship Preparation Guide
4 pages
Editable Dinosaur Board Game
No ratings yet
Editable Dinosaur Board Game
8 pages
Mtcna March 2013
No ratings yet
Mtcna March 2013
3 pages
Parts of A Motherboard
No ratings yet
Parts of A Motherboard
4 pages
Log 20230314
No ratings yet
Log 20230314
10 pages
PCS7 OpenOS 3rdparty Integration ABB V9.0 en
No ratings yet
PCS7 OpenOS 3rdparty Integration ABB V9.0 en
57 pages
E Book
No ratings yet
E Book
231 pages
Dell Emc Networking Smartfabric Services Deployment With Vxrail 7 0 1
No ratings yet
Dell Emc Networking Smartfabric Services Deployment With Vxrail 7 0 1
125 pages
Price List Oct 2022
No ratings yet
Price List Oct 2022
118 pages
MPDF
No ratings yet
MPDF
91 pages
IT306 - Object Oriented Programming With C++
No ratings yet
IT306 - Object Oriented Programming With C++
3 pages
BPA DBMS Chapter1 - DBMS Overview
No ratings yet
BPA DBMS Chapter1 - DBMS Overview
45 pages
E-Post Office: A Mini Project Report On
No ratings yet
E-Post Office: A Mini Project Report On
53 pages
DLD Lab: Introduction To VHDL (Very High Speed Integrated Circuit Hardware Description Language)
No ratings yet
DLD Lab: Introduction To VHDL (Very High Speed Integrated Circuit Hardware Description Language)
30 pages