A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus
A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus
3
Comparison among Tridiagonal Solvers
Numerical
Solver
Stability
4
Comparison among Tridiagonal Solvers
Numerical CPU
Solver
Stability Performance
5
Comparison among Tridiagonal Solvers
Numerical CPU GPU
Solver
Stability Performance Performance
6
Comparison among Tridiagonal Solvers
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability
7
Numerical Stability on GPUs
8
CUSPARSE gtsv 2012)
• CR (+ PCR)
e0 b0 c0 e0 b0 0 c0
e1 a1 b1 c1 e a
1 1 b1 c1
b0 c0
e2 a2 b2 c2 e2 a2 0 b2 0 a2 b2
e3 a3 b3 e3 a3 b3
9
Why Numerical Stability is Difficult on GPUs
10
Our gtsv
• For parallelization
– SPIKE algorithm is applied to decompose the problem
– A optimization technique is applied to achieve high
memory efficiency
• Data layout transformation
11
Part 1: SPIKE Algorithm
• SPIKE algorithm decomposes a tridiagonal
matrix A into several blocks
12
SPIKE Algorithm
• D and S can be redefined as A = DS
13
A Small Example
A B1 F
e0 2 1 4
A1
e1 5 1 2 13
e2 1 3 4 27
e3 2 5 26
C2 A2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2010-2013
A Small Example
e0 2 1
e1 5 1 2
e2 1 3 4
e3 2 5
2 1 1 0 v11
5 1 0 1 v12
=
3 4 w21 1 0
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2010-2013 2 5 w22 0 1
SPIKE Algorithm
• How to build S?
• Solve DY = F
– Solve several independent tridiagonal matrices Ai’s
16
SPIKE Algorithm
• How to solve SX =Y?
– Solving the collection of
the first and latest rows
in all blocks
– Reduction*
• Problem size: 4L -> 6 1 𝑣1𝐿
𝑤21 1 0 𝑣21
– Backward substitution 𝑤2𝐿 0 1 𝑣2𝐿
𝑤31 1 0 𝑣31
𝑤3𝐿 0 1 𝑣3𝐿
𝑤41 1
*E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: The SPIKE algorithm,”
17
Parallel Computing, vol. 32, no. 2, pp. 177–194, 2006.
Part 2: Diagonal Pivoting
for Tridiagonal Matrices
• How to solve each block Ai in a numerically
stable way?
– Diagonal pivoting*
– Ai can be solved sequentially by each thread
• Why diagonal pivoting?
– A better data-dependent control flow,
which we can handle on GPUs
*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010. 18
Diagonal Pivoting
• A tridiagonal matrix A can be decomposed to LBMT
– Instead of LDU
– L and M are unit lower triangular matrices
– B is a block diagonal matrix with 1-by-1 or 2-by-2 blocks
19
LBM^T decompistion
∆=b1b2-a2c1 20
Diagonal Pivoting
• A can be solved by solving L, B, and then MT
21
𝑏1 𝑐1
More Optimization
𝑎2 𝑏2 𝑐2 We store conditions andNot stored.
leading computed
elements of Bon the fly
𝑎3 𝑏3 𝑐3
𝑎4 𝑏4
d=1 1 b1 1 c2/b1
a2/b1
L2 B2 M2^T
d=2
1 b1 c1 1 -c1c2/∆
1 a2 b2 1 b1c2/∆
-a2a3/∆ b1a3/∆
L2 B2 M2^T
An Example
0 2 d=2 1 0 0 2 1 0 2
0.5 0 1 0 1 0.5 0 0 1 0
1 0 1 0.5 0 1 0 0 1 1 0
1 0 0 1 1 0 0 1
0 2
0.5 0 1
𝐴=
1 0 1
1 0
condition= 2 0 2 0
Pivoting Criteria
• Bunch-Kaufman algorithm for unsymmetric
cases
𝑘 = 5 − 1 /2
𝜎 = max 𝑐1 , 𝑎2 , 𝑏2 , 𝑐2 , 𝑎3
𝑖𝑓 𝑏1 𝜎 ≥ 𝑘 𝑐1 𝑎2
1−by−1 pivoting
𝑒𝑙𝑠𝑒
2−by−2 pivoting
Our gtsv Algorithm
• Solving each Ai dominates
the runtime
– Using diagonal pivoting
25
Data Layout
• Observation
– GPU requires stride-one memory
access to fully utilize memory
bandwidth
• Contradiction
– Consecutive elements in a diagonal
are stored in consecutive memory in
gtsv interface
– Each block is processed by one thread
• Solution
– Data layout transformation
26
Data Layout Transformation
• Local transpose
– bi’s are elements in a diagonal
– 6 4-elements blocks (block in SPIKE)
address
address
local
transpose
27
Data Layout Transformation
Runtime • Random
(ms) old layout (gtsv interface) proposed layout – 1-by-1 or 2-by-
80
70
69.46
2 pivoting
59.87
60 4-5x • Diagonally
50
38.63
dominant
40 34.59
30
– Always 1-by-1
20
pivoting
10
9.68
7.07
4.73 • Zero diagonal
0
– Always 2-by-2
random diagonally zero diagonal data marshaling
pivoting
dominant overhead
28
Dynamic Tiling
thread ID
• Observation T1 T2 T3 T4
– Memory access with compact 1 1 1 1
compact
footprint can be handled well by L1 footprint
2 2 1 1
• Even though branch divergence exists
– Scattered footprint dramatically 3 3 2 2
reduces memory efficiency
3 4 3 2
address scattered
• Solution 4 5 3 3
footprint
– Insert barriers to regularize memory 4 5 4 3
access footprint
5 6 4 4
6 6 5 4
29
Dynamic Tiling
T1 T2 T3 T4 T1 T2 T3 T4
1 1 1 1 1 1 1 1
2 2 1 1 2 2 1 1
3 3 2 2 3 3 2 2
estimated tiling
boundary
3 4 3 2 3 4 4 2
real barrier
4 5 3 3 4 5 4 4
4 5 4 3 estimated tiling 4 5 5 4
boundary
5 6 4 4 real barrier 6 6 5 6
6 6 5 4 7 6 6 6
30
Dynamic Tiling
Runtime
(ms) data layout only dynamic tiling (with data layout)
70
59.87
60
50
3.5x
40
30
20 16.83
9.68 9.88
10 7.07 7.13
31
Dynamic Tiling
Performance counters
Global Memory Load Efficiency Global Memory Store Efficiency
% L1 Hit Rate Warp Execution Efficiency
100
90
80
70
3x
60
50 1.8x
40
3x
30
20
10
*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010 33
Numerical Stability
Relative Backward Error
Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab
1 1.82E-14 1.97E-14 7.14E-12 1.88E-14 1.39E-15 1.96E-14
2 1.27E-16 1.27E-16 1.69E-16 1.03E-16 1.02E-16 1.03E-16
3 1.55E-16 1.52E-16 2.57E-16 1.35E-16 1.29E-16 1.35E-16
4 1.37E-14 1.22E-14 1.39E-12 3.10E-15 1.69E-15 2.78E-15
5 1.07E-14 1.13E-14 1.82E-14 1.56E-14 4.62E-15 2.93E-14
6 1.05E-16 1.06E-16 1.57E-16 9.34E-17 9.51E-17 9.34E-17
7 2.42E-16 2.46E-16 5.13E-16 2.52E-16 2.55E-16 2.27E-16
8 2.14E-04 2.14E-04 1.50E+10 3.76E-04 2.32E-16 2.14E-04
9 2.32E-05 3.90E-04 1.93E+08 3.15E-05 9.07E-16 1.19E-05
10 4.27E-05 4.83E-05 2.74E+05 3.21E-05 4.72E-16 3.21E-05
11 7.52E-04 6.59E-02 4.54E+11 2.99E-04 2.20E-15 2.28E-04
12 5.58E-05 7.95E-05 5.55E-04 2.24E-05 5.52E-05 2.24E-05
13 5.51E-01 5.45E-01 1.12E+16 3.34E-01 3.92E-15 3.08E-01
14 2.86E+49 4.49E+49 2.92E+51 1.77E+48 3.86E+54 1.77E+48
15 2.09E+60 Nan Nan 1.47E+59 Fail 3.69E+58
16 Inf Nan Nan Inf Fail 4.68E+171
34
GPU Performance
35
Our Heterogeneous gtsv
• SPIKE algorithm
• OpenMP for multicore in one node
• CUDA stream for multi-GPUs
• MPI for multi-nodes
• MKL gtsv for CPU
• Our gtsv for GPU
36
Cluster Scalability (GPUs)
Strong Scaling
(ms) Our gtsv Our gtsv (predistributed data)
1.E+03
1.E+02
1.E+01
1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs
37
Cluster Scalability (GPUs)
Weak Scaling
(ms)
Our gtsv Our gtsv (predistributed data)
1.E+03
1.E+02
1.E+01
1 GPU, a 2M- 2 GPUs, a 4M- 4 GPUs, an 8M- 8 GPUs, a 16M- 16 GPUs, a 32M-
sized matrix sized matrix sized matrix sized matrix sized matrix
38
Cluster Scalability (GPUs+CPUs)
Strong(predistributed
Weak
Strong scaling scaling
scaling data)
39
Short Summary
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability
40
More Features for Our gtsv
(in CUSPARSE 2013)
• Support 4 data types
– Float(S), double(D), complex(C), double
complex(Z)
• Support arbitrary sizes
• Support multiple right-hand-side vectors
• Support both general matrices (gtsv) and
diagonally dominant matrices (dtsvb)
41
More Details
• 4 data types
– CURSPARSE built-in operators
• dtsvb
– SPIKE + Thomas algorithm
• Arbitrary sizes
– Padding
– Pad 1’s for the main diagonal, and 0’s for the
lower and upper diagonals
42
More Details
• Multiple right-hand-side vectors
43
More Details
44
Summary
• The first numerically stable tridiagonal solver for
GPUs
– Comparable numerical stability with Intel MKL
– Comparable speed with NVIDIA CUSPARSE 2012
• Support large size matrices
• CUSPARSE gtsv 2013
– Cluster support is removed
• Source codes for a prototype are available at
http://impact.crhc.illinois.edu/
– With a BSD-like license
45
Something We Forgot…
• How about the batch version?
– Batch version means multiple matrices of the
same size
– Currently, you can just simply merge them in a
large matrix
• Even work for multiple matrices of different sizes
46
A Case Study
• Empirical Mode Decomposition (EMD)
– An adaptive time-(or spatial-)frequency analysis
• Applications
– Climate research
– Orbit research
– Structural health monitoring
– Water wave analysis
– Biomedical signal analysis
– …
47
Empirical Mode Decomposition
• Spline interpolation
Sifting Procedure
Spline Interpolation
maxima
Tridiagonal
+
Interpolation
Extrema
Solver
Vector
- Vector
Detector Mean Subs
Spline Interpolation
Tridiagonal
minima Interpolation
Solver
IMF Procedure
mM(t)
…
Sifting m1(t) Sifting Sifting
-
ri(t)
Vector Subs
+
ci(t)
…
48
Characteristics of Tridiagonal Matrices
in EMD
• Large size
• Different numbers of matrices
– Dimensions or channels of signals
• Simultaneous tridiagonal matrices
• 1D (1 channel) signal/1D multiple channels signals/2D
signals
– Variations of EMD
• Ensemble EMD (EEMD)
– Adding noise and performing EMD several times
• Multiple dimensional EEMD
49
Benefits of Our gtsv
• Large size matrices
– Some previous GPU EMD works used B-spline to
approximate spline, because they cannot solve
large-size systems efficiently
– Our gtsv perfectly fits
• Multiple matrices of different sizes
– Our gtsv perfectly fits
50
Short Summary
• It is still an on-going work
• New GPU EMD source codes coming soon
– Check http://impact.crhc.illinois.edu/
• A joint project with Norden Huang’s group
– http://rcada.ncu.edu.tw
51
Q&A
• Thank you