0% found this document useful (0 votes)
16 views28 pages

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

shihyunnam7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views28 pages

Ece408 Lecture19 Sparse Matrix VK SP23

Uploaded by

shihyunnam7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

ECE408/CS483/CSE408 Spring 2023

Applied Parallel Programming

Lecture 19
Parallel Sparse Methods

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Course Reminders
• MP5.2 is due this week
• MP 6 is out
– Due on April 14th
• Project Milestone 2: Baseline Convolution Kernel
– Due next week
• Take a note of the day/time of midterm 2
– May 2nd evening

2
Objective
• To learn the key techniques for compacting input data in parallel
sparse methods for reduced consumption of memory bandwidth
– better utilization of on-chip memory
– fewer bytes transferred to on-chip memory
– Better utilization of global memory
– Challenge: retaining regularity

3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix
• Many real-world systems are sparse in nature
– Linear systems described as sparse matrices
• Solving sparse linear systems
– Iterative Conjugate Gradient solvers based on sparse matrix-vector
multiplication is a common method
• Solution of PDE systems can be formulated into linear
operations expressed as sparse matrix-vector multiplication

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Analytics and AI
Recommender systems Natural language processing

Predict missing ratings Latent semantic model


Group similar users/items Word embedding as input to DNN

Complex network Deep learning


Matrix
Link prediction
Factorization Model compression
Vertices clustering Embedding layer

Web search Tensor decomposition

Match query and document In machine learning and HPC applications

items

*
* x ΘT f

Users
Ratings (R)

X θv

m users
* ≈
R
* * * * xTu
*

n items
f

5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix in Scientific Computing
Science Area Number Codes Struct Unstruct Dense Sparse N- Monte FFT PIC Sig
of Teams Grids Grids Matrix Matrix Body Carlo I/O
Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME X X X X X
Plasmas/Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC X X X X
Stellar Atmospheres and 5 PPM, MAESTRO, CASTRO, SEDONA, X X X X X X
Supernovae ChaNGa, MS-FLUKSS
Cosmology 2 Enzo, pGADGET X X X
Combustion/Turbulence 2 PSDNS, DISTUF X X
General Relativity 2 Cactus, Harm3D, LazEV X X
Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS X X X
Quantum Chemistry 2 SIAL, GAMESS, NWChem X X X X X
Material Science 3 NEMOS, OMEN, GW, QMCPACK X X X X
Earthquakes/Seismology 2 AWP-ODC, HERCULES, PLSQR, X X X X
SPECFEM3D
Quantum Chromo Dynamics 1 Chroma, MILC, USQCD X X X
Social Networks 1 EPISIMDEMICS

Evolution 1 Eve

Engineering/System of 1 GRIPS,Revisit X
Systems
Computer Science 1 X X X X X
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Sparse Matrix-Vector Multiplication (SpMV)

A X Y

7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Challenges
• Compared to dense matrix multiplication, SpMV
– Is irregular/unstructured
– Has little input data reuse
– Benefits little from compiler transformation tools

• Key to maximal performance


– Maximize regularity (by reducing divergence and load imbalance)
– Maximize DRAM burst utilization (layout arrangement)

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Simple Parallel SpMV

Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3

• Each thread processes one row

9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Compressed Sparse Row (CSR) Format
CSR Representation
Row 0 Row 2 Row 3
Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 }

Dense representation

Row 0 3 0 1 0 Thread 0
Row 1 0 0 0 0 Thread 1
Row 2 0 2 4 1 Thread 2
Row 3 1 0 0 1 Thread 3
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Data Layout

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Design

row_ptr Dot product


With vector

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
A Parallel SpMV/CSR Kernel (CUDA)
1. __global__ void SpMV_CSR(int num_rows, float *data, int
*col_index, int *row_ptr, float *x, float *y)
{
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {
4. float dot = 0;
5. int row_start = row_ptr[row];
6. int row_end = row_ptr[row+1];
7. for (int elem = row_start; elem < row_end; elem++)
8. dot += data[elem] * x[col_index[elem]];
9. y[row] = dot;
}
}

Row 0 Row 2 Row 3


Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row Pointers row_ptr[5] { 0, 2, 2, 5, 7 } 13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Control Divergence

• Threads execute different number of iterations

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
CSR Kernel Memory Divergence
(Uncoalesced Accesses)
• Adjacent threads access non-adjacent memory locations
– Grey elements are accessed by all threads in iteration 0

row_ptr 0 2 2 5 7

data 3 1 2 4 1 1 1

col_index 0 2 1 2 3 0 3

15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Regularizing SpMV with ELL(PACK) Format

Thread 0

Thread 1

Thread 2

Thread 3
3 1 *

* * *
3 * 2 1
2 4 1
1 * 4 1
1 1 *
* * 1 *
CSR with Padding
Transposed
• Pad all rows to the same length
– Inefficient if a few rows are much longer than others
• Transpose (Column Major) for DRAM efficiency
• Both data and col_index padded/transposed
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ELL Kernel Design

3 * 2 1

1 * 4 1

* * 1 *

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018


A ELL Format 17

ECE408/CS483/ University of Illinois at Urbana-Champaign


A parallel SpMV/ELL kernel
1. __global__ void SpMV_ELL(int num_rows, float *data,
int *col_index, int num_elem, float *x, float *y)
{

2. int row = blockIdx.x * blockDim.x + threadIdx.x;


3. if (row < num_rows) {
4. float dot = 0;
5. for (int i = 0; i < num_elem; i++)
6. dot += data[row+i*num_rows]*x[col_index[row+i*num_rows]];
7. y[row] = dot;
}
}

18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Memory Coalescing with ELL

3 * 2 1

Thread 0
Thread 1
Thread 2
Thread 3
1 * 4 1

* * 1 *

data 3 * 2 1 1 * 4 1 * * 1 *

0 0 1 0 2 1 2 3 3 2 3 1

col_index
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Coordinate (COO) format

• Explicitly list the column & row indices for every non-zero
element

Row 0 Row 2 Row 3


Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Allows Reordering of Elements

Row 0 Row 2 Row 3


Nonzero values data[7] { 3, 1, 2, 4, 1, 1, 1 }
Column indices col_index[7] { 0, 2, 1, 2, 3, 0, 3 }
Row indices row_index[7] { 0, 0, 2, 2, 2, 3, 3 }

Nonzero values data[7] { 1 1, 2, 4, 3, 1 1 }


Column indices col_index[7] { 0 2, 1, 2, 0, 3, 3 }
Row indices row_index[7] { 3 0, 2, 2, 0, 2, 3 }

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Kernel

1. for (int i = 0; i < num_elem; i++)


2. y[row_index[i]] += data[i] * x[col_index[i]];

a sequential loop that implements SpMV/COO

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
COO Kernel Design
Accessing Input Matrix and Vector
x v
Ax = v
All threads can access matrix data and the accessed
col_index to access vector
Maximal parallelism..

Row
Col
Data

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018


A COO Storage Format 23

ECE408/CS483/ University of Illinois at Urbana-Champaign


COO kernel Design
Accumulating into Output Vector
x v
Ax = v
Each threads uses the row_index of its
element to accumulate into one of the
output Y elements
Need atomic operations!

Row
Col
Data

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018


A 24

ECE408/CS483/ University of Illinois at Urbana-Champaign


Hybrid Format (ELL + COO)
• ELL handles typical entries
• COO handles exceptional entries
– Implemented with segmented reduction
Often implemented in
sequential host code in
practice

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Reduced Padding with Hybrid Format
data col_index
Thread 0 3 1 0 2

Thread 0
Thread 1
Thread 2
Thread 3
Thread 1 * * * *
Thread 2 2 4 1 2
Thread 3 1 1 0 3
Iteration 0

data 3 * 2 1 1 * 4 1 data 1
col_index 3
index 0 * 1 0 2 * 2 3
row_index 2
ELL COO
26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
ANY MORE QUESTIONS
READ CHAPTER 10
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
ECE408/CS483/ University of Illinois at Urbana-Champaign
Problem Solving
• Q: Given matrix A,
which of the following are correct?

CSR representation: COO representation


Data = [1,2,1,1,2,3,4,1,1] Data = [1,2,1,1,2,3,4,1,1]
Col_idx = [0,2,3,0,1,2,3,0,3] Col_idx = [0,2,3,0,1,2,3,0,3]
Row_ptr = [0,1,3,7,9] Row_idx = [0,1,1,3,3,3,3,7,7]

• A: only CSR
© Volodymyr Kindratenko, 2023 ECE408/CS483, 28
University of Illinois, Urbana-Champaign

You might also like