0% found this document useful (0 votes)

16 views28 pages

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

shihyunnam7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views28 pages

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

shihyunnam7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

ECE408/CS483/CSE408 Spring 2023

Applied Parallel Programming

Lecture 19
Parallel Sparse Methods

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• MP5.2 is due this week
• MP 6 is out
– Due on April 14th
• Project Milestone 2: Baseline Convolution Kernel
– Due next week
• Take a note of the day/time of midterm 2
– May 2nd evening

2
Objective
• To learn the key techniques for compacting input data in parallel
sparse methods for reduced consumption of memory bandwidth
– better utilization of on-chip memory
– fewer bytes transferred to on-chip memory
– Better utilization of global memory
– Challenge: retaining regularity

3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix
• Many real-world systems are sparse in nature
– Linear systems described as sparse matrices
• Solving sparse linear systems
– Iterative Conjugate Gradient solvers based on sparse matrix-vector
multiplication is a common method
• Solution of PDE systems can be formulated into linear
operations expressed as sparse matrix-vector multiplication

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Analytics and AI
Recommender systems Natural language processing

Predict missing ratings Latent semantic model

Group similar users/items Word embedding as input to DNN

Complex network Deep learning

Matrix
Link prediction
Factorization Model compression
Vertices clustering Embedding layer

Web search Tensor decomposition

Match query and document In machine learning and HPC applications

items

*
* x ΘT f

Users
Ratings (R)

X θv

m users
* ≈
R
* * * * xTu
*

n items
f

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Scientific Computing
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Teams Grids Grids Matrix Matrix Body Carlo I/O
Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X
Plasmas/Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X
Stellar Atmospheres and 5 PPM, MAESTRO, CASTRO, SEDONA, X X X X X X
Supernovae ChaNGa, MS-FLUKSS
Cosmology 2 Enzo, pGADGET X X X
Combustion/Turbulence 2 PSDNS, DISTUF X X
General Relativity 2 Cactus, Harm3D, LazEV X X
Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X
Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X
Material Science 3 NEMOS, OMEN, GW, QMCPACK X X X X
Earthquakes/Seismology 2 AWP-ODC, HERCULES, PLSQR, X X X X
SPECFEM3D
Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X
Social Networks 1 EPISIMDEMICS

Evolution 1 Eve

Engineering/System of 1 GRIPS,Revisit X
Systems
Computer Science 1 X X X X X
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix-Vector Multiplication (SpMV)

A X Y

7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Challenges
• Compared to dense matrix multiplication, SpMV
– Is irregular/unstructured
– Has little input data reuse
– Benefits little from compiler transformation tools

• Key to maximal performance

– Maximize regularity (by reducing divergence and load imbalance)
– Maximize DRAM burst utilization (layout arrangement)

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Parallel SpMV

Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3

• Each thread processes one row

9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Compressed Sparse Row (CSR) Format
CSR Representation
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 }

Dense representation

Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Data Layout

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Design

row_ptr Dot product

With vector

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Parallel SpMV/CSR Kernel (CUDA)
1. __global__ void SpMV_CSR(int num_rows, float *data, int
*col_index, int *row_ptr, float *x, float *y)
{
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {
4. float dot = 0;
5. int row_start = row_ptr[row];
6. int row_end = row_ptr[row+1];
7. for (int elem = row_start; elem < row_end; elem++)
8. dot += data[elem] * x[col_index[elem]];
9. y[row] = dot;
}
}

Row 0 Row 2 Row 3

Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 } 13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Control Divergence

• Threads execute different number of iterations

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Memory Divergence
(Uncoalesced Accesses)
• Adjacent threads access non-adjacent memory locations
– Grey elements are accessed by all threads in iteration 0

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Regularizing SpMV with ELL(PACK) Format

Thread 0

Thread 1

Thread 2

Thread 3
3 1 *

* * *
3 * 2 1
2 4 1
1 * 4 1
1 1 *
* * 1 *
CSR with Padding
Transposed
• Pad all rows to the same length
– Inefficient if a few rows are much longer than others
• Transpose (Column Major) for DRAM efficiency
• Both data and col_index padded/transposed
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ELL Kernel Design

3 * 2 1

1 * 4 1

* * 1 *

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

A ELL Format 17

ECE408/CS483/ University of Illinois at Urbana-Champaign

A parallel SpMV/ELL kernel
1. __global__ void SpMV_ELL(int num_rows, float *data,
int *col_index, int num_elem, float *x, float *y)
{

2. int row = blockIdx.x * blockDim.x + threadIdx.x;

3. if (row < num_rows) {
4. float dot = 0;
5. for (int i = 0; i < num_elem; i++)
6. dot += data[row+i*num_rows]*x[col_index[row+i*num_rows]];
7. y[row] = dot;
}
}

18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Memory Coalescing with ELL

3 * 2 1

Thread 0
Thread 1
Thread 2
Thread 3
1 * 4 1

* * 1 *

data 3 * 2 1 1 * 4 1 * * 1 *

0 0 1 0 2 1 2 3 3 2 3 1

col_index
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Coordinate (COO) format

• Explicitly list the column & row indices for every non-zero
element

Row 0 Row 2 Row 3

Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Allows Reordering of Elements

Row 0 Row 2 Row 3

Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }

Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }

Column indices col_index[7] { 0 2, 1, 2, 0, 3, 3 }
Row indices row_index[7] { 3 0, 2, 2, 0, 2, 3 }

1. for (int i = 0; i < num_elem; i++)

2. y[row_index[i]] += data[i] * x[col_index[i]];

a sequential loop that implements SpMV/COO

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Kernel Design
Accessing Input Matrix and Vector
x v
Ax = v
All threads can access matrix data and the accessed
col_index to access vector
Maximal parallelism..

Row
Col
Data

A COO Storage Format 23

ECE408/CS483/ University of Illinois at Urbana-Champaign

COO kernel Design
Accumulating into Output Vector
x v
Ax = v
Each threads uses the row_index of its
element to accumulate into one of the
output Y elements
Need atomic operations!

Row
Col
Data

A 24

ECE408/CS483/ University of Illinois at Urbana-Champaign

Hybrid Format (ELL + COO)
• ELL handles typical entries
• COO handles exceptional entries
– Implemented with segmented reduction
Often implemented in
sequential host code in
practice

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduced Padding with Hybrid Format
data col_index
Thread 0 3 1 0 2

Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0

data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL COO
26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS
READ CHAPTER 10
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Given matrix A,
which of the following are correct?

CSR representation: COO representation

Data = [1,2,1,1,2,3,4,1,1] Data = [1,2,1,1,2,3,4,1,1]
Col_idx = [0,2,3,0,1,2,3,0,3] Col_idx = [0,2,3,0,1,2,3,0,3]
Row_ptr = [0,1,3,7,9] Row_idx = [0,1,1,3,3,3,3,7,7]

Zimbabwe Form 52 A Zimbabwe Revenue Authority
0% (1)
Zimbabwe Form 52 A Zimbabwe Revenue Authority
8 pages
The McDonaldization of Society
100% (3)
The McDonaldization of Society
2 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Group 5 Report
No ratings yet
Group 5 Report
21 pages
sparsematrics
No ratings yet
sparsematrics
28 pages
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
No ratings yet
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
8 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
No ratings yet
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
12 pages
2851141.2851190
No ratings yet
2851141.2851190
2 pages
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
21 pages
Extended
No ratings yet
Extended
9 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Chen 2022 FG SPMSP V
No ratings yet
Chen 2022 FG SPMSP V
29 pages
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
No ratings yet
Efficient_Sparse_Matrix-Vector_Multiplication_on_GPUs_Using_the_CSR_Storage_Format_SC2014 (1)
12 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
2212 07490
No ratings yet
2212 07490
41 pages
CUDA Tricks PDF
No ratings yet
CUDA Tricks PDF
33 pages
Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
No ratings yet
Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP
7 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
DS_UNIT_1
No ratings yet
DS_UNIT_1
16 pages
2751205.2751209
No ratings yet
2751205.2751209
12 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Web GPU
0% (1)
Web GPU
40 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Introduction To Petsc
No ratings yet
Introduction To Petsc
25 pages
Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
No ratings yet
Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
21 pages
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
No ratings yet
Systolic Sparse Matrix Vector Multiply in The Age of TPUs and Accelerators
10 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Parallel and Scalable
No ratings yet
Parallel and Scalable
195 pages
Modeling a Non-uniform Memory Access Architecture for Optimizing
No ratings yet
Modeling a Non-uniform Memory Access Architecture for Optimizing
79 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
2404.06047v1
No ratings yet
2404.06047v1
34 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
Bylina Potiopa2017 Article ExplicitFourth OrderRungeKutta
No ratings yet
Bylina Potiopa2017 Article ExplicitFourth OrderRungeKutta
18 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
ASpT PPoPP19
No ratings yet
ASpT PPoPP19
15 pages
Native Shader Compilation With LLVM PDF
No ratings yet
Native Shader Compilation With LLVM PDF
37 pages
Slides
No ratings yet
Slides
24 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
2 6
No ratings yet
2 6
8 pages
module-4-chapter-2
No ratings yet
module-4-chapter-2
42 pages
CUDPP Slides
No ratings yet
CUDPP Slides
26 pages
978-3-642-29737-3_42
No ratings yet
978-3-642-29737-3_42
10 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
Riscv Vector Workshop June2015
No ratings yet
Riscv Vector Workshop June2015
58 pages
Lecture 17
No ratings yet
Lecture 17
7 pages
Lecture02 Types
No ratings yet
Lecture02 Types
21 pages
08_dataparallel
No ratings yet
08_dataparallel
51 pages
Network Address Translation Protocols and Design: Definitive Reference for Developers and Engineers
From Everand
Network Address Translation Protocols and Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
HTB-CPTS-Course-Outline
No ratings yet
HTB-CPTS-Course-Outline
2 pages
PDF Sonosite 180 Plus Service Manual - Compress
No ratings yet
PDF Sonosite 180 Plus Service Manual - Compress
114 pages
APP601S Chapter 5 - Calibration
No ratings yet
APP601S Chapter 5 - Calibration
36 pages
Parts Catalog: Model NB3200AU/A (CE SPEC.) NB3300AU-E (CE SPEC.)
No ratings yet
Parts Catalog: Model NB3200AU/A (CE SPEC.) NB3300AU-E (CE SPEC.)
15 pages
Strategy - Chapter 1: Study Case of Queen Elizabeth and Lady Gaga
No ratings yet
Strategy - Chapter 1: Study Case of Queen Elizabeth and Lady Gaga
6 pages
Briarlane Rental Application Draft 4
No ratings yet
Briarlane Rental Application Draft 4
2 pages
Bài Kiểm Tra LMS Lần 1 Xem Lại Bài Làm
No ratings yet
Bài Kiểm Tra LMS Lần 1 Xem Lại Bài Làm
14 pages
Fakulti Kejuruteraan Awam Universiti Teknologi Mara Shah Alam Laboratory Manual
No ratings yet
Fakulti Kejuruteraan Awam Universiti Teknologi Mara Shah Alam Laboratory Manual
5 pages
Assignment 1 - Operating System
No ratings yet
Assignment 1 - Operating System
137 pages
Hystori Sword Art Online
No ratings yet
Hystori Sword Art Online
8 pages
Group 1
No ratings yet
Group 1
26 pages
PR111_Q4_Mod5_UnderstandingDataandWaystoSystematicallyCollectData_Version2_265020135645839
No ratings yet
PR111_Q4_Mod5_UnderstandingDataandWaystoSystematicallyCollectData_Version2_265020135645839
21 pages
Annamalai University: Distance Education / Open University (Including Lateral Entry / Double Degree) PROGRAMMES
No ratings yet
Annamalai University: Distance Education / Open University (Including Lateral Entry / Double Degree) PROGRAMMES
14 pages
KBSK Who
No ratings yet
KBSK Who
3 pages
DeMark - Impressive Signals From TD Sequential Indicator
100% (2)
DeMark - Impressive Signals From TD Sequential Indicator
5 pages
Din 6926 PDF - PDF
No ratings yet
Din 6926 PDF - PDF
11 pages
System Description: Cylinder Lubrication System CLU 3 For 2-Stroke Crosshead Large Diesel Engines
No ratings yet
System Description: Cylinder Lubrication System CLU 3 For 2-Stroke Crosshead Large Diesel Engines
32 pages
Lms Matrix
No ratings yet
Lms Matrix
3 pages
Communication Engineering Asst. Prof. Dr. Ashwaq Q. Hameed Al Faisal UOT-Electrical Engineering Dept. Electronic Engineering Branch
No ratings yet
Communication Engineering Asst. Prof. Dr. Ashwaq Q. Hameed Al Faisal UOT-Electrical Engineering Dept. Electronic Engineering Branch
6 pages
Hbo
No ratings yet
Hbo
13 pages
Cost of Capital - With Answers
No ratings yet
Cost of Capital - With Answers
4 pages
Communal Services
No ratings yet
Communal Services
3 pages
Comsol
100% (2)
Comsol
38 pages
Schedule.Winter.2025.40_M-F_HarborFLEX (1)
No ratings yet
Schedule.Winter.2025.40_M-F_HarborFLEX (1)
1 page
The American College of Surgeons, Minimum Standards For Hospitals, and The Provision of High-Quality Laboratory Services
No ratings yet
The American College of Surgeons, Minimum Standards For Hospitals, and The Provision of High-Quality Laboratory Services
14 pages
Literature Review On Cash Management PDF
100% (2)
Literature Review On Cash Management PDF
4 pages
Group Assignment 2 - 2024
No ratings yet
Group Assignment 2 - 2024
2 pages
Topic 8 - Short-Circuit and Open Circuit Test of Transformer
No ratings yet
Topic 8 - Short-Circuit and Open Circuit Test of Transformer
20 pages

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

ECE408/CS483/CSE408 Spring 2023

Applied Parallel Programming

Predict missing ratings Latent semantic model

Complex network Deep learning

Web search Tensor decomposition

Match query and document In machine learning and HPC applications

• Key to maximal performance

• Each thread processes one row

row_ptr Dot product

Row 0 Row 2 Row 3

• Threads execute different number of iterations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

ECE408/CS483/ University of Illinois at Urbana-Champaign

2. int row = blockIdx.x * blockDim.x + threadIdx.x;

Row 0 Row 2 Row 3

Row 0 Row 2 Row 3

Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }

1. for (int i = 0; i < num_elem; i++)

a sequential loop that implements SpMV/COO

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

ECE408/CS483/ University of Illinois at Urbana-Champaign

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

ECE408/CS483/ University of Illinois at Urbana-Champaign

CSR representation: COO representation

You might also like