0% found this document useful (0 votes)
57 views52 pages

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Solver

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views52 pages

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Solver

Uploaded by

John Doe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

A Scalable, Numerically Stable, High-

Performance Tridiagonal Solver for


GPUs
Li-Wen Chang, Wen-mei Hwu
University of Illinois
A Scalable, Numerically Stable, High-
How toTridiagonal
Performance Build a gtsv for for
Solver
CUSPARSE
GPUs 2013
Li-Wen Chang, Wen-mei Hwu
University of Illinois
Material in this Session
• This talk is based on our SC’12 paper
– Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, “A
Scalable, Numerically Stable, High-performance Tridiagonal Solver
using GPUs”, Proceedings of the International Conference for High
Performance Computing, Networking Storage and Analysis, 2012
(SC’12)

• But it contains more


– Details not shown in the paper due to page limit
– Extension worked with the NVIDIA CUSPARSE
team

3
Comparison among Tridiagonal Solvers
Numerical
Solver
Stability

Matlab (backslash) Yes

Intel MKL (gtsv) Yes

Intel SPIKE Yes

CUSPASRE gtsv (2012) No

Our gtsv Yes

Our heterogeneous gtsv Yes

4
Comparison among Tridiagonal Solvers
Numerical CPU
Solver
Stability Performance

Matlab (backslash) Yes Poor

Intel MKL (gtsv) Yes Good

Intel SPIKE Yes Good

CUSPASRE gtsv (2012) No Not supported

Our gtsv Yes Not supported

Our heterogeneous gtsv Yes Good

5
Comparison among Tridiagonal Solvers
Numerical CPU GPU
Solver
Stability Performance Performance

Matlab (backslash) Yes Poor Not supported

Intel MKL (gtsv) Yes Good Not supported

Intel SPIKE Yes Good Not supported

CUSPASRE gtsv (2012) No Not supported Good

Our gtsv Yes Not supported Good

Our heterogeneous gtsv Yes Good Good

6
Comparison among Tridiagonal Solvers
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

7
Numerical Stability on GPUs

• All previous related works for GPUs


– Unstable algorithms, like Thomas algorithms,
Cyclic reduction (CR), or Parallel cyclic reduction
(PCR)
– No pivoting
• Why pivoting important?
1 0 1 0 0 1
1 1 0 1 1 0

8
CUSPARSE gtsv 2012)
• CR (+ PCR)
e0 b0 c0  e0  b0 0 c0 
e1  a1 b1 c1  e a
 1 1 b1 c1  
   b0 c0 
e2  a2 b2 c2  e2 a2 0 b2 0  a2 b2 
   
e3  a3 b3  e3  a3 b3 

• But when bi’s are 0’s…


e0  0 c0 
e1 a1 0 c1 
 0 1
e2  c2 

a2 0
 1 0
e3  a3 0

9
Why Numerical Stability is Difficult on GPUs

• Why people didn’t apply pivoting on GPU ?


– They worried about performance
– Pivoting does not seem to fit GPU

• Pivoting may serialize computation


• Pivoting requires data-dependent control flow
– GPU likes regular computation and regular
memory access
– Branch divergence may hurt performance

10
Our gtsv
• For parallelization
– SPIKE algorithm is applied to decompose the problem
– A optimization technique is applied to achieve high
memory efficiency
• Data layout transformation

• For data-dependent control flow


– Diagonal pivoting is chosen
– A optimization technique is proposed to achieve high
memory efficiency
• Dynamic tiling

11
Part 1: SPIKE Algorithm
• SPIKE algorithm decomposes a tridiagonal
matrix A into several blocks

12
SPIKE Algorithm
• D and S can be redefined as A = DS

• AX = F can be solved by solving


DY = F, and SX =Y

13
A Small Example

A B1 F
e0 2 1  4
A1  
e1 5 1 2  13
e2  1 3 4 27
 
e3  2 5 26

C2 A2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2010-2013
A Small Example
e0 2 1 

e1 5 1 2 
e2  1 3 4
 
e3  2 5
2 1  1 0 v11 
5 1  0 1 v12 
=    
 3 4  w21 1 0
   
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2010-2013  2 5  w22 0 1
SPIKE Algorithm
• How to build S?

• Solve DY = F
– Solve several independent tridiagonal matrices Ai’s

16
SPIKE Algorithm
• How to solve SX =Y?
– Solving the collection of
the first and latest rows
in all blocks
– Reduction*
• Problem size: 4L -> 6 1 𝑣1𝐿
𝑤21 1 0 𝑣21
– Backward substitution 𝑤2𝐿 0 1 𝑣2𝐿
𝑤31 1 0 𝑣31
𝑤3𝐿 0 1 𝑣3𝐿
𝑤41 1

*E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: The SPIKE algorithm,”
17
Parallel Computing, vol. 32, no. 2, pp. 177–194, 2006.
Part 2: Diagonal Pivoting
for Tridiagonal Matrices
• How to solve each block Ai in a numerically
stable way?
– Diagonal pivoting*
– Ai can be solved sequentially by each thread
• Why diagonal pivoting?
– A better data-dependent control flow,
which we can handle on GPUs

*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010. 18
Diagonal Pivoting
• A tridiagonal matrix A can be decomposed to LBMT
– Instead of LDU
– L and M are unit lower triangular matrices
– B is a block diagonal matrix with 1-by-1 or 2-by-2 blocks

• Criteria for choosing 1-by-1 or 2-by-2 blocks


– Asymmetric Bunch-Kaufman pivoting

19
LBM^T decompistion

• Bd is 1-by-1 or 2-by-2 block


• As is also a tridiagonal matrix
– As is updated by modifying leading elements of T22
– As can be decomposed recursively

∆=b1b2-a2c1 20
Diagonal Pivoting
• A can be solved by solving L, B, and then MT

• It has data-dependent control flow


– B contains 1-by-1 or 2-by-2 blocks
– It is better than other pivoting
• Only access nearby rows
– Require dynamic tiling to perform efficiently on
GPUs

21
𝑏1 𝑐1
More Optimization
𝑎2 𝑏2 𝑐2 We store conditions andNot stored.
leading computed
elements of Bon the fly
𝑎3 𝑏3 𝑐3
𝑎4 𝑏4

d=1 1 b1 1 c2/b1
a2/b1
L2 B2 M2^T

d=2
1 b1 c1 1 -c1c2/∆

1 a2 b2 1 b1c2/∆
-a2a3/∆ b1a3/∆
L2 B2 M2^T
An Example
0 2 d=2 1 0 0 2 1 0 2
0.5 0 1 0 1 0.5 0 0 1 0
1 0 1 0.5 0 1 0 0 1 1 0
1 0 0 1 1 0 0 1

What we really store

0 2
0.5 0 1
𝐴=
1 0 1
1 0

condition= 2 0 2 0
Pivoting Criteria
• Bunch-Kaufman algorithm for unsymmetric
cases

𝑘 = 5 − 1 /2
𝜎 = max 𝑐1 , 𝑎2 , 𝑏2 , 𝑐2 , 𝑎3
𝑖𝑓 𝑏1 𝜎 ≥ 𝑘 𝑐1 𝑎2
1−by−1 pivoting
𝑒𝑙𝑠𝑒
2−by−2 pivoting
Our gtsv Algorithm
• Solving each Ai dominates
the runtime
– Using diagonal pivoting

• One Ai is solved sequentially,


and all Ai’s are solved in
parallel
– Require data layout
transformation to perform
efficiently on GPUs

25
Data Layout
• Observation
– GPU requires stride-one memory
access to fully utilize memory
bandwidth
• Contradiction
– Consecutive elements in a diagonal
are stored in consecutive memory in
gtsv interface
– Each block is processed by one thread
• Solution
– Data layout transformation

26
Data Layout Transformation
• Local transpose
– bi’s are elements in a diagonal
– 6 4-elements blocks (block in SPIKE)
address
address

local
transpose

27
Data Layout Transformation
Runtime • Random
(ms) old layout (gtsv interface) proposed layout – 1-by-1 or 2-by-
80

70
69.46
2 pivoting
59.87
60 4-5x • Diagonally
50
38.63
dominant
40 34.59

30
– Always 1-by-1
20
pivoting
10
9.68
7.07
4.73 • Zero diagonal
0
– Always 2-by-2
random diagonally zero diagonal data marshaling
pivoting
dominant overhead

28
Dynamic Tiling
thread ID
• Observation T1 T2 T3 T4
– Memory access with compact 1 1 1 1
compact
footprint can be handled well by L1 footprint
2 2 1 1
• Even though branch divergence exists
– Scattered footprint dramatically 3 3 2 2
reduces memory efficiency
3 4 3 2
address scattered
• Solution 4 5 3 3
footprint
– Insert barriers to regularize memory 4 5 4 3
access footprint
5 6 4 4

6 6 5 4

29
Dynamic Tiling
T1 T2 T3 T4 T1 T2 T3 T4
1 1 1 1 1 1 1 1

2 2 1 1 2 2 1 1

3 3 2 2 3 3 2 2
estimated tiling
boundary
3 4 3 2 3 4 4 2
real barrier
4 5 3 3 4 5 4 4

4 5 4 3 estimated tiling 4 5 5 4
boundary
5 6 4 4 real barrier 6 6 5 6

6 6 5 4 7 6 6 6

30
Dynamic Tiling

Runtime
(ms) data layout only dynamic tiling (with data layout)
70

59.87
60

50
3.5x

40

30

20 16.83

9.68 9.88
10 7.07 7.13

random diagonally dominant zero diagonal

31
Dynamic Tiling
Performance counters
Global Memory Load Efficiency Global Memory Store Efficiency
% L1 Hit Rate Warp Execution Efficiency
100

90

80

70
3x
60

50 1.8x
40
3x
30

20

10

no tiling, tiling, random no tiling, tiling, no tiling, zero tiling, zero


random diagonally diagonally diagonal diagonal
dominant dominant
Because of branch divergence
32
Final Evaluation
• 3 kinds of evaluation
– Numerical stability 𝐴𝑥 − 𝑏
• A backward analysis 𝑏
• 16 selected types of matrices*
– One GPU performance
– Cluster scalability
• Multiple GPUs
• Multiple GPUs + multiple CPUs

*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010 33
Numerical Stability
Relative Backward Error
Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab
1 1.82E-14 1.97E-14 7.14E-12 1.88E-14 1.39E-15 1.96E-14
2 1.27E-16 1.27E-16 1.69E-16 1.03E-16 1.02E-16 1.03E-16
3 1.55E-16 1.52E-16 2.57E-16 1.35E-16 1.29E-16 1.35E-16
4 1.37E-14 1.22E-14 1.39E-12 3.10E-15 1.69E-15 2.78E-15
5 1.07E-14 1.13E-14 1.82E-14 1.56E-14 4.62E-15 2.93E-14
6 1.05E-16 1.06E-16 1.57E-16 9.34E-17 9.51E-17 9.34E-17
7 2.42E-16 2.46E-16 5.13E-16 2.52E-16 2.55E-16 2.27E-16
8 2.14E-04 2.14E-04 1.50E+10 3.76E-04 2.32E-16 2.14E-04
9 2.32E-05 3.90E-04 1.93E+08 3.15E-05 9.07E-16 1.19E-05
10 4.27E-05 4.83E-05 2.74E+05 3.21E-05 4.72E-16 3.21E-05
11 7.52E-04 6.59E-02 4.54E+11 2.99E-04 2.20E-15 2.28E-04
12 5.58E-05 7.95E-05 5.55E-04 2.24E-05 5.52E-05 2.24E-05
13 5.51E-01 5.45E-01 1.12E+16 3.34E-01 3.92E-15 3.08E-01
14 2.86E+49 4.49E+49 2.92E+51 1.77E+48 3.86E+54 1.77E+48
15 2.09E+60 Nan Nan 1.47E+59 Fail 3.69E+58
16 Inf Nan Nan Inf Fail 4.68E+171

34
GPU Performance

Runtime of solving an 8M matrix

0 50 100 (ms) 150 200 250 300

Our dgtsv (GPU) Random Diagonally dominant


Our ddtsvb (GPU)

CUSPARSE dtsv (GPU)

Data transfer (pageable)

Data transfer (pinned)

MKL dgtsv(sequential, CPU)

35
Our Heterogeneous gtsv
• SPIKE algorithm
• OpenMP for multicore in one node
• CUDA stream for multi-GPUs
• MPI for multi-nodes
• MKL gtsv for CPU
• Our gtsv for GPU

36
Cluster Scalability (GPUs)
Strong Scaling
(ms) Our gtsv Our gtsv (predistributed data)
1.E+03

1.E+02

1.E+01
1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs

37
Cluster Scalability (GPUs)
Weak Scaling
(ms)
Our gtsv Our gtsv (predistributed data)
1.E+03

1.E+02

1.E+01
1 GPU, a 2M- 2 GPUs, a 4M- 4 GPUs, an 8M- 8 GPUs, a 16M- 16 GPUs, a 32M-
sized matrix sized matrix sized matrix sized matrix sized matrix

38
Cluster Scalability (GPUs+CPUs)
Strong(predistributed
Weak
Strong scaling scaling
scaling data)

39
Short Summary
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

40
More Features for Our gtsv
(in CUSPARSE 2013)
• Support 4 data types
– Float(S), double(D), complex(C), double
complex(Z)
• Support arbitrary sizes
• Support multiple right-hand-side vectors
• Support both general matrices (gtsv) and
diagonally dominant matrices (dtsvb)

41
More Details
• 4 data types
– CURSPARSE built-in operators
• dtsvb
– SPIKE + Thomas algorithm
• Arbitrary sizes
– Padding
– Pad 1’s for the main diagonal, and 0’s for the
lower and upper diagonals

42
More Details
• Multiple right-hand-side vectors

• Yi’s have multiple columns, but Wi’s and Vi’s


only have one column

43
More Details

• Solve Vi’s, Wi’s and the first column of Yi’s


– Build L, B, and M^T
• Then solve the rest columns of Yi’s using the
pre-built L, B, and M^T

44
Summary
• The first numerically stable tridiagonal solver for
GPUs
– Comparable numerical stability with Intel MKL
– Comparable speed with NVIDIA CUSPARSE 2012
• Support large size matrices
• CUSPARSE gtsv 2013 
– Cluster support is removed
• Source codes for a prototype are available at
http://impact.crhc.illinois.edu/
– With a BSD-like license

45
Something We Forgot…
• How about the batch version?
– Batch version means multiple matrices of the
same size
– Currently, you can just simply merge them in a
large matrix 
• Even work for multiple matrices of different sizes

46
A Case Study
• Empirical Mode Decomposition (EMD)
– An adaptive time-(or spatial-)frequency analysis

• Applications
– Climate research
– Orbit research
– Structural health monitoring
– Water wave analysis
– Biomedical signal analysis
– …
47
Empirical Mode Decomposition
• Spline interpolation
Sifting Procedure
Spline Interpolation
maxima
Tridiagonal
+
Interpolation
Extrema
Solver
Vector
- Vector
Detector Mean Subs
Spline Interpolation
Tridiagonal
minima Interpolation
Solver

IMF Procedure

mM(t)

Sifting m1(t) Sifting Sifting
-
ri(t)
Vector Subs
+
ci(t)

x(t) IMF r1(t) IMF IMF rN(t)

c1(t) c2(t) cN(t)

48
Characteristics of Tridiagonal Matrices
in EMD
• Large size
• Different numbers of matrices
– Dimensions or channels of signals
• Simultaneous tridiagonal matrices
• 1D (1 channel) signal/1D multiple channels signals/2D
signals
– Variations of EMD
• Ensemble EMD (EEMD)
– Adding noise and performing EMD several times
• Multiple dimensional EEMD

49
Benefits of Our gtsv
• Large size matrices
– Some previous GPU EMD works used B-spline to
approximate spline, because they cannot solve
large-size systems efficiently
– Our gtsv perfectly fits
• Multiple matrices of different sizes
– Our gtsv perfectly fits

50
Short Summary
• It is still an on-going work 
• New GPU EMD source codes coming soon
– Check http://impact.crhc.illinois.edu/
• A joint project with Norden Huang’s group
– http://rcada.ncu.edu.tw

51
Q&A

• Thank you

Li-Wen Chang at SC'12 52

You might also like