Ece408 Lecture19 Sparse Matrix VK SP23
Ece408 Lecture19 Sparse Matrix VK SP23
Lecture 19
Parallel Sparse Methods
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• MP5.2 is due this week
• MP 6 is out
– Due on April 14th
• Project Milestone 2: Baseline Convolution Kernel
– Due next week
• Take a note of the day/time of midterm 2
– May 2nd evening
2
Objective
• To learn the key techniques for compacting input data in parallel
sparse methods for reduced consumption of memory bandwidth
– better utilization of on-chip memory
– fewer bytes transferred to on-chip memory
– Better utilization of global memory
– Challenge: retaining regularity
3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix
• Many real-world systems are sparse in nature
– Linear systems described as sparse matrices
• Solving sparse linear systems
– Iterative Conjugate Gradient solvers based on sparse matrix-vector
multiplication is a common method
• Solution of PDE systems can be formulated into linear
operations expressed as sparse matrix-vector multiplication
4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Analytics and AI
Recommender systems Natural language processing
items
*
* x ΘT f
Users
Ratings (R)
X θv
m users
* ≈
R
* * * * xTu
*
n items
f
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Scientific Computing
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Teams Grids Grids Matrix Matrix Body Carlo I/O
Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X
Plasmas/Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X
Stellar Atmospheres and 5 PPM, MAESTRO, CASTRO, SEDONA, X X X X X X
Supernovae ChaNGa, MS-FLUKSS
Cosmology 2 Enzo, pGADGET X X X
Combustion/Turbulence 2 PSDNS, DISTUF X X
General Relativity 2 Cactus, Harm3D, LazEV X X
Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X
Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X
Material Science 3 NEMOS, OMEN, GW, QMCPACK X X X X
Earthquakes/Seismology 2 AWP-ODC, HERCULES, PLSQR, X X X X
SPECFEM3D
Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X
Social Networks 1 EPISIMDEMICS
Evolution 1 Eve
Engineering/System of 1 GRIPS,Revisit X
Systems
Computer Science 1 X X X X X
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix-Vector Multiplication (SpMV)
A X Y
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Challenges
• Compared to dense matrix multiplication, SpMV
– Is irregular/unstructured
– Has little input data reuse
– Benefits little from compiler transformation tools
8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Parallel SpMV
Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3
9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Compressed Sparse Row (CSR) Format
CSR Representation
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 }
Dense representation
Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Data Layout
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Design
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Parallel SpMV/CSR Kernel (CUDA)
1. __global__ void SpMV_CSR(int num_rows, float *data, int
*col_index, int *row_ptr, float *x, float *y)
{
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {
4. float dot = 0;
5. int row_start = row_ptr[row];
6. int row_end = row_ptr[row+1];
7. for (int elem = row_start; elem < row_end; elem++)
8. dot += data[elem] * x[col_index[elem]];
9. y[row] = dot;
}
}
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Memory Divergence
(Uncoalesced Accesses)
• Adjacent threads access non-adjacent memory locations
– Grey elements are accessed by all threads in iteration 0
row_ptr 0 2 2 5 7
data 3 1 2 4 1 1 1
col_index 0 2 1 2 3 0 3
15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Regularizing SpMV with ELL(PACK) Format
Thread 0
Thread 1
Thread 2
Thread 3
3 1 *
* * *
3 * 2 1
2 4 1
1 * 4 1
1 1 *
* * 1 *
CSR with Padding
Transposed
• Pad all rows to the same length
– Inefficient if a few rows are much longer than others
• Transpose (Column Major) for DRAM efficiency
• Both data and col_index padded/transposed
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ELL Kernel Design
3 * 2 1
1 * 4 1
* * 1 *
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Memory Coalescing with ELL
3 * 2 1
Thread 0
Thread 1
Thread 2
Thread 3
1 * 4 1
* * 1 *
data 3 * 2 1 1 * 4 1 * * 1 *
0 0 1 0 2 1 2 3 3 2 3 1
col_index
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Coordinate (COO) format
• Explicitly list the column & row indices for every non-zero
element
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Allows Reordering of Elements
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Kernel
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Kernel Design
Accessing Input Matrix and Vector
x v
Ax = v
All threads can access matrix data and the accessed
col_index to access vector
Maximal parallelism..
Row
Col
Data
Row
Col
Data
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduced Padding with Hybrid Format
data col_index
Thread 0 3 1 0 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0
data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL COO
26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS
READ CHAPTER 10
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Given matrix A,
which of the following are correct?
• A: only CSR
© Volodymyr Kindratenko, 2023 ECE408/CS483, 28
University of Illinois, Urbana-Champaign