0% found this document useful (0 votes)

57 views52 pages

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Solver

Uploaded by

John Doe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views52 pages

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Solver

Uploaded by

John Doe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

A Scalable, Numerically Stable, High-

Performance Tridiagonal Solver for

GPUs
Li-Wen Chang, Wen-mei Hwu
University of Illinois
A Scalable, Numerically Stable, High-
How toTridiagonal
Performance Build a gtsv for for
Solver
CUSPARSE
GPUs 2013
Li-Wen Chang, Wen-mei Hwu
University of Illinois
Material in this Session
• This talk is based on our SC’12 paper
– Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, “A
Scalable, Numerically Stable, High-performance Tridiagonal Solver
using GPUs”, Proceedings of the International Conference for High
Performance Computing, Networking Storage and Analysis, 2012
(SC’12)

• But it contains more

– Details not shown in the paper due to page limit
– Extension worked with the NVIDIA CUSPARSE
team

3
Comparison among Tridiagonal Solvers
Numerical
Solver
Stability

Matlab (backslash) Yes

Intel MKL (gtsv) Yes

Intel SPIKE Yes

CUSPASRE gtsv (2012) No

Our gtsv Yes

Our heterogeneous gtsv Yes

4
Comparison among Tridiagonal Solvers
Numerical CPU
Solver
Stability Performance

Matlab (backslash) Yes Poor

Intel MKL (gtsv) Yes Good

Intel SPIKE Yes Good

CUSPASRE gtsv (2012) No Not supported

Our gtsv Yes Not supported

Our heterogeneous gtsv Yes Good

5
Comparison among Tridiagonal Solvers
Numerical CPU GPU
Solver
Stability Performance Performance

Matlab (backslash) Yes Poor Not supported

Intel MKL (gtsv) Yes Good Not supported

Intel SPIKE Yes Good Not supported

CUSPASRE gtsv (2012) No Not supported Good

Our gtsv Yes Not supported Good

Our heterogeneous gtsv Yes Good Good

6
Comparison among Tridiagonal Solvers
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

7
Numerical Stability on GPUs

• All previous related works for GPUs

– Unstable algorithms, like Thomas algorithms,
Cyclic reduction (CR), or Parallel cyclic reduction
(PCR)
– No pivoting
• Why pivoting important?
1 0 1 0 0 1
1 1 0 1 1 0

8
CUSPARSE gtsv 2012)
• CR (+ PCR)
e0 b0 c0  e0  b0 0 c0 
e1  a1 b1 c1  e a
 1 1 b1 c1  
   b0 c0 
e2  a2 b2 c2  e2 a2 0 b2 0  a2 b2 
   
e3  a3 b3  e3  a3 b3 

• But when bi’s are 0’s…

e0  0 c0 
e1 a1 0 c1 
 0 1
e2  c2 

a2 0
 1 0
e3  a3 0

9
Why Numerical Stability is Difficult on GPUs

• Why people didn’t apply pivoting on GPU ?

– They worried about performance
– Pivoting does not seem to fit GPU

• Pivoting may serialize computation

• Pivoting requires data-dependent control flow
– GPU likes regular computation and regular
memory access
– Branch divergence may hurt performance

10
Our gtsv
• For parallelization
– SPIKE algorithm is applied to decompose the problem
– A optimization technique is applied to achieve high
memory efficiency
• Data layout transformation

• For data-dependent control flow

– Diagonal pivoting is chosen
– A optimization technique is proposed to achieve high
memory efficiency
• Dynamic tiling

11
Part 1: SPIKE Algorithm
• SPIKE algorithm decomposes a tridiagonal
matrix A into several blocks

12
SPIKE Algorithm
• D and S can be redefined as A = DS

• AX = F can be solved by solving

DY = F, and SX =Y

13
A Small Example

A B1 F
e0 2 1  4
A1  
e1 5 1 2  13
e2  1 3 4 27
 
e3  2 5 26

C2 A2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2010-2013
A Small Example
e0 2 1 

e1 5 1 2 
e2  1 3 4
 
e3  2 5
2 1  1 0 v11 
5 1  0 1 v12 
=    
 3 4  w21 1 0
   
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2010-2013  2 5  w22 0 1
SPIKE Algorithm
• How to build S?

• Solve DY = F
– Solve several independent tridiagonal matrices Ai’s

16
SPIKE Algorithm
• How to solve SX =Y?
– Solving the collection of
the first and latest rows
in all blocks
– Reduction*
• Problem size: 4L -> 6 1 𝑣1𝐿
𝑤21 1 0 𝑣21
– Backward substitution 𝑤2𝐿 0 1 𝑣2𝐿
𝑤31 1 0 𝑣31
𝑤3𝐿 0 1 𝑣3𝐿
𝑤41 1

*E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: The SPIKE algorithm,”
17
Parallel Computing, vol. 32, no. 2, pp. 177–194, 2006.
Part 2: Diagonal Pivoting
for Tridiagonal Matrices
• How to solve each block Ai in a numerically
stable way?
– Diagonal pivoting*
– Ai can be solved sequentially by each thread
• Why diagonal pivoting?
– A better data-dependent control flow,
which we can handle on GPUs

*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010. 18
Diagonal Pivoting
• A tridiagonal matrix A can be decomposed to LBMT
– Instead of LDU
– L and M are unit lower triangular matrices
– B is a block diagonal matrix with 1-by-1 or 2-by-2 blocks

• Criteria for choosing 1-by-1 or 2-by-2 blocks

– Asymmetric Bunch-Kaufman pivoting

19
LBM^T decompistion

• Bd is 1-by-1 or 2-by-2 block

• As is also a tridiagonal matrix
– As is updated by modifying leading elements of T22
– As can be decomposed recursively

∆=b1b2-a2c1 20
Diagonal Pivoting
• A can be solved by solving L, B, and then MT

• It has data-dependent control flow

– B contains 1-by-1 or 2-by-2 blocks
– It is better than other pivoting
• Only access nearby rows
– Require dynamic tiling to perform efficiently on
GPUs

21
𝑏1 𝑐1
More Optimization
𝑎2 𝑏2 𝑐2 We store conditions andNot stored.
leading computed
elements of Bon the fly
𝑎3 𝑏3 𝑐3
𝑎4 𝑏4

d=1 1 b1 1 c2/b1
a2/b1
L2 B2 M2^T

d=2
1 b1 c1 1 -c1c2/∆

1 a2 b2 1 b1c2/∆
-a2a3/∆ b1a3/∆
L2 B2 M2^T
An Example
0 2 d=2 1 0 0 2 1 0 2
0.5 0 1 0 1 0.5 0 0 1 0
1 0 1 0.5 0 1 0 0 1 1 0
1 0 0 1 1 0 0 1

What we really store

0 2
0.5 0 1
𝐴=
1 0 1
1 0

condition= 2 0 2 0
Pivoting Criteria
• Bunch-Kaufman algorithm for unsymmetric
cases

𝑘 = 5 − 1 /2
𝜎 = max 𝑐1 , 𝑎2 , 𝑏2 , 𝑐2 , 𝑎3
𝑖𝑓 𝑏1 𝜎 ≥ 𝑘 𝑐1 𝑎2
1−by−1 pivoting
𝑒𝑙𝑠𝑒
2−by−2 pivoting
Our gtsv Algorithm
• Solving each Ai dominates
the runtime
– Using diagonal pivoting

• One Ai is solved sequentially,

and all Ai’s are solved in
parallel
– Require data layout
transformation to perform
efficiently on GPUs

25
Data Layout
• Observation
– GPU requires stride-one memory
access to fully utilize memory
bandwidth
• Contradiction
– Consecutive elements in a diagonal
are stored in consecutive memory in
gtsv interface
– Each block is processed by one thread
• Solution
– Data layout transformation

26
Data Layout Transformation
• Local transpose
– bi’s are elements in a diagonal
– 6 4-elements blocks (block in SPIKE)
address
address

local
transpose

27
Data Layout Transformation
Runtime • Random
(ms) old layout (gtsv interface) proposed layout – 1-by-1 or 2-by-
80

70
69.46
2 pivoting
59.87
60 4-5x • Diagonally
50
38.63
dominant
40 34.59

30
– Always 1-by-1
20
pivoting
10
9.68
7.07
4.73 • Zero diagonal
0
– Always 2-by-2
random diagonally zero diagonal data marshaling
pivoting
dominant overhead

28
Dynamic Tiling
thread ID
• Observation T1 T2 T3 T4
– Memory access with compact 1 1 1 1
compact
footprint can be handled well by L1 footprint
2 2 1 1
• Even though branch divergence exists
– Scattered footprint dramatically 3 3 2 2
reduces memory efficiency
3 4 3 2
address scattered
• Solution 4 5 3 3
footprint
– Insert barriers to regularize memory 4 5 4 3
access footprint
5 6 4 4

6 6 5 4

29
Dynamic Tiling
T1 T2 T3 T4 T1 T2 T3 T4
1 1 1 1 1 1 1 1

2 2 1 1 2 2 1 1

3 3 2 2 3 3 2 2
estimated tiling
boundary
3 4 3 2 3 4 4 2
real barrier
4 5 3 3 4 5 4 4

4 5 4 3 estimated tiling 4 5 5 4
boundary
5 6 4 4 real barrier 6 6 5 6

6 6 5 4 7 6 6 6

30
Dynamic Tiling

Runtime
(ms) data layout only dynamic tiling (with data layout)
70

59.87
60

50
3.5x

20 16.83

9.68 9.88
10 7.07 7.13

random diagonally dominant zero diagonal

31
Dynamic Tiling
Performance counters
Global Memory Load Efficiency Global Memory Store Efficiency
% L1 Hit Rate Warp Execution Efficiency
100

70
3x
60

50 1.8x
40
3x
30

no tiling, tiling, random no tiling, tiling, no tiling, zero tiling, zero

random diagonally diagonally diagonal diagonal
dominant dominant
Because of branch divergence
32
Final Evaluation
• 3 kinds of evaluation
– Numerical stability 𝐴𝑥 − 𝑏
• A backward analysis 𝑏
• 16 selected types of matrices*
– One GPU performance
– Cluster scalability
• Multiple GPUs
• Multiple GPUs + multiple CPUs

*J. B. Erway, R. F. Marcia, and J. Tyson, “Generalized diagonal pivoting methods for
tridiagonal systems without interchanges,” IAENG International Journal of Applied
Mathematics, vol. 4, no. 40, pp. 269–275, 2010 33
Numerical Stability
Relative Backward Error
Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab
1 1.82E-14 1.97E-14 7.14E-12 1.88E-14 1.39E-15 1.96E-14
2 1.27E-16 1.27E-16 1.69E-16 1.03E-16 1.02E-16 1.03E-16
3 1.55E-16 1.52E-16 2.57E-16 1.35E-16 1.29E-16 1.35E-16
4 1.37E-14 1.22E-14 1.39E-12 3.10E-15 1.69E-15 2.78E-15
5 1.07E-14 1.13E-14 1.82E-14 1.56E-14 4.62E-15 2.93E-14
6 1.05E-16 1.06E-16 1.57E-16 9.34E-17 9.51E-17 9.34E-17
7 2.42E-16 2.46E-16 5.13E-16 2.52E-16 2.55E-16 2.27E-16
8 2.14E-04 2.14E-04 1.50E+10 3.76E-04 2.32E-16 2.14E-04
9 2.32E-05 3.90E-04 1.93E+08 3.15E-05 9.07E-16 1.19E-05
10 4.27E-05 4.83E-05 2.74E+05 3.21E-05 4.72E-16 3.21E-05
11 7.52E-04 6.59E-02 4.54E+11 2.99E-04 2.20E-15 2.28E-04
12 5.58E-05 7.95E-05 5.55E-04 2.24E-05 5.52E-05 2.24E-05
13 5.51E-01 5.45E-01 1.12E+16 3.34E-01 3.92E-15 3.08E-01
14 2.86E+49 4.49E+49 2.92E+51 1.77E+48 3.86E+54 1.77E+48
15 2.09E+60 Nan Nan 1.47E+59 Fail 3.69E+58
16 Inf Nan Nan Inf Fail 4.68E+171

34
GPU Performance

Runtime of solving an 8M matrix

0 50 100 (ms) 150 200 250 300

Our dgtsv (GPU) Random Diagonally dominant

Our ddtsvb (GPU)

CUSPARSE dtsv (GPU)

Data transfer (pageable)

Data transfer (pinned)

MKL dgtsv(sequential, CPU)

35
Our Heterogeneous gtsv
• SPIKE algorithm
• OpenMP for multicore in one node
• CUDA stream for multi-GPUs
• MPI for multi-nodes
• MKL gtsv for CPU
• Our gtsv for GPU

36
Cluster Scalability (GPUs)
Strong Scaling
(ms) Our gtsv Our gtsv (predistributed data)
1.E+03

1.E+02

1.E+01
1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs

37
Cluster Scalability (GPUs)
Weak Scaling
(ms)
Our gtsv Our gtsv (predistributed data)
1.E+03

1.E+02

1.E+01
1 GPU, a 2M- 2 GPUs, a 4M- 4 GPUs, an 8M- 8 GPUs, a 16M- 16 GPUs, a 32M-
sized matrix sized matrix sized matrix sized matrix sized matrix

38
Cluster Scalability (GPUs+CPUs)
Strong(predistributed
Weak
Strong scaling scaling
scaling data)

39
Short Summary
Numerical CPU GPU Cluster
Solver
Stability Performance Performance Scalability

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

40
More Features for Our gtsv
(in CUSPARSE 2013)
• Support 4 data types
– Float(S), double(D), complex(C), double
complex(Z)
• Support arbitrary sizes
• Support multiple right-hand-side vectors
• Support both general matrices (gtsv) and
diagonally dominant matrices (dtsvb)

41
More Details
• 4 data types
– CURSPARSE built-in operators
• dtsvb
– SPIKE + Thomas algorithm
• Arbitrary sizes
– Padding
– Pad 1’s for the main diagonal, and 0’s for the
lower and upper diagonals

42
More Details
• Multiple right-hand-side vectors

• Yi’s have multiple columns, but Wi’s and Vi’s

only have one column

43
More Details

• Solve Vi’s, Wi’s and the first column of Yi’s

– Build L, B, and M^T
• Then solve the rest columns of Yi’s using the
pre-built L, B, and M^T

44
Summary
• The first numerically stable tridiagonal solver for
GPUs
– Comparable numerical stability with Intel MKL
– Comparable speed with NVIDIA CUSPARSE 2012
• Support large size matrices
• CUSPARSE gtsv 2013 
– Cluster support is removed
• Source codes for a prototype are available at
http://impact.crhc.illinois.edu/
– With a BSD-like license

45
Something We Forgot…
• How about the batch version?
– Batch version means multiple matrices of the
same size
– Currently, you can just simply merge them in a
large matrix 
• Even work for multiple matrices of different sizes

46
A Case Study
• Empirical Mode Decomposition (EMD)
– An adaptive time-(or spatial-)frequency analysis

• Applications
– Climate research
– Orbit research
– Structural health monitoring
– Water wave analysis
– Biomedical signal analysis
– …
47
Empirical Mode Decomposition
• Spline interpolation
Sifting Procedure
Spline Interpolation
maxima
Tridiagonal
+
Interpolation
Extrema
Solver
Vector
- Vector
Detector Mean Subs
Spline Interpolation
Tridiagonal
minima Interpolation
Solver

IMF Procedure

mM(t)
…
Sifting m1(t) Sifting Sifting
-
ri(t)
Vector Subs
+
ci(t)
…

x(t) IMF r1(t) IMF IMF rN(t)

c1(t) c2(t) cN(t)

48
Characteristics of Tridiagonal Matrices
in EMD
• Large size
• Different numbers of matrices
– Dimensions or channels of signals
• Simultaneous tridiagonal matrices
• 1D (1 channel) signal/1D multiple channels signals/2D
signals
– Variations of EMD
• Ensemble EMD (EEMD)
– Adding noise and performing EMD several times
• Multiple dimensional EEMD

49
Benefits of Our gtsv
• Large size matrices
– Some previous GPU EMD works used B-spline to
approximate spline, because they cannot solve
large-size systems efficiently
– Our gtsv perfectly fits
• Multiple matrices of different sizes
– Our gtsv perfectly fits

50
Short Summary
• It is still an on-going work 
• New GPU EMD source codes coming soon
– Check http://impact.crhc.illinois.edu/
• A joint project with Norden Huang’s group
– http://rcada.ncu.edu.tw

51
Q&A

• Thank you

Li-Wen Chang at SC'12 52

P2 - Exam - Practice - Kit (ACORN) - 1 PDF
100% (3)
P2 - Exam - Practice - Kit (ACORN) - 1 PDF
309 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Modern Big Data Algorithms
No ratings yet
Modern Big Data Algorithms
52 pages
Mantilla - 2008 - Mechanistic Modeling of Liquid Entrainment in Gas in Horizontal Pipes
100% (1)
Mantilla - 2008 - Mechanistic Modeling of Liquid Entrainment in Gas in Horizontal Pipes
219 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
No ratings yet
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
43 pages
2016 Jasak迭代求解
No ratings yet
2016 Jasak迭代求解
30 pages
Web GPU
0% (1)
Web GPU
40 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
2019-mapl-tillet-kung-cox
No ratings yet
2019-mapl-tillet-kung-cox
10 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
2212 07490
No ratings yet
2212 07490
41 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
No ratings yet
LightSpMV_Faster_CSR-based_sparse_matrix-vector_multiplication_on_CUDA-enabled_GPUs
8 pages
VSS-NumericalLibraries
No ratings yet
VSS-NumericalLibraries
21 pages
Fast Tridiagonal Solvers On The GPU: Yao Zhang Jonathan Cohen John D. Owens
No ratings yet
Fast Tridiagonal Solvers On The GPU: Yao Zhang Jonathan Cohen John D. Owens
10 pages
Modeling a Non-uniform Memory Access Architecture for Optimizing
No ratings yet
Modeling a Non-uniform Memory Access Architecture for Optimizing
79 pages
Tridiagonal Matrix Algorithm - Wikipedia
No ratings yet
Tridiagonal Matrix Algorithm - Wikipedia
6 pages
L5 Slides
No ratings yet
L5 Slides
23 pages
Cublas Library
No ratings yet
Cublas Library
254 pages
2_DataflowAnalysis
No ratings yet
2_DataflowAnalysis
49 pages
Numerical Solution of Linear Systems: Chen Greif
No ratings yet
Numerical Solution of Linear Systems: Chen Greif
59 pages
cuThomasBatch+&+cuThomasVBatch,CUDA+Routines+to+Compute+Batch+of+Tridiagonal+Systems+onNVIDIA+GPUs
No ratings yet
cuThomasBatch+&+cuThomasVBatch,CUDA+Routines+to+Compute+Batch+of+Tridiagonal+Systems+onNVIDIA+GPUs
13 pages
Cusvm: A Cuda Implementation of Support Vector Classification and Regression
No ratings yet
Cusvm: A Cuda Implementation of Support Vector Classification and Regression
9 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
No ratings yet
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
5 pages
2851141.2851190
No ratings yet
2851141.2851190
2 pages
CUDA Tricks PDF
No ratings yet
CUDA Tricks PDF
33 pages
PP_2024_HW3
No ratings yet
PP_2024_HW3
12 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Qi Li Dissertation
No ratings yet
Qi Li Dissertation
116 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
Numerical Methods For Partial Differential Algebraic Systems of Equations
No ratings yet
Numerical Methods For Partial Differential Algebraic Systems of Equations
61 pages
CUBLAS Library
No ratings yet
CUBLAS Library
264 pages
A3c5 PDF
No ratings yet
A3c5 PDF
335 pages
Cublas Library: User Guide
No ratings yet
Cublas Library: User Guide
248 pages
Global SLP - Review Meeting
No ratings yet
Global SLP - Review Meeting
29 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Openvdb Introduction
No ratings yet
Openvdb Introduction
17 pages
Notebook PDF
No ratings yet
Notebook PDF
26 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
Extended
No ratings yet
Extended
9 pages
Neural Network Learning For Analog VLSI Implementations of Support Vector Machines: A Survey
No ratings yet
Neural Network Learning For Analog VLSI Implementations of Support Vector Machines: A Survey
19 pages
Scaling Kernel Machine Learning Algorithm Via The Use of Gpus
No ratings yet
Scaling Kernel Machine Learning Algorithm Via The Use of Gpus
1 page
Cublas Library
No ratings yet
Cublas Library
146 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
Randomized Numerical Linear Algebra
No ratings yet
Randomized Numerical Linear Algebra
202 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
DSDA SEE Prev QP
No ratings yet
DSDA SEE Prev QP
4 pages
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
No ratings yet
Merge-Based_Parallel_Sparse_Matrix-Vector_Multiplication_SC2016 (1)
12 pages
202403-Articles-CAF-Symmetric-FSM-Published
No ratings yet
202403-Articles-CAF-Symmetric-FSM-Published
9 pages
Parallel Computing Optimization
No ratings yet
Parallel Computing Optimization
16 pages
Userguide 5.6.0
No ratings yet
Userguide 5.6.0
126 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Linear Algebra: Assignment I
No ratings yet
Linear Algebra: Assignment I
11 pages
sparsematrics
No ratings yet
sparsematrics
28 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
DAA-18-19
No ratings yet
DAA-18-19
20 pages
CompTIA A+ Success Path : Study Guide & Practice Tests
From Everand
CompTIA A+ Success Path : Study Guide & Practice Tests
SUJAN
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Pan, Hanratty - 2002 - Correlation of Entrainment For Annular Flow in Horizontal Pipes
100% (1)
Pan, Hanratty - 2002 - Correlation of Entrainment For Annular Flow in Horizontal Pipes
24 pages
Chemical Engineering Science: E. Pagan, W.C. Williams, S. Kam, P.J. Waltrich
No ratings yet
Chemical Engineering Science: E. Pagan, W.C. Williams, S. Kam, P.J. Waltrich
13 pages
Sarica - 2011 - International Journal of Multiphase Flow Comment On Correlation of Entrainment For Annular Flow in Horizontal Pipes '
No ratings yet
Sarica - 2011 - International Journal of Multiphase Flow Comment On Correlation of Entrainment For Annular Flow in Horizontal Pipes '
2 pages
BHR Zakarian 2009 Shtokman Pipeline Discretization
No ratings yet
BHR Zakarian 2009 Shtokman Pipeline Discretization
17 pages
BHR 2013 E5
No ratings yet
BHR 2013 E5
16 pages
Algorithms 13 00053 v2 PDF
No ratings yet
Algorithms 13 00053 v2 PDF
22 pages
3007.auto Vectorizer 08 Cookbook PDF
No ratings yet
3007.auto Vectorizer 08 Cookbook PDF
21 pages
A Novel Pressure-Free Two - Uid Model For One-Dimensional Incompressible Multiphase Ow
No ratings yet
A Novel Pressure-Free Two - Uid Model For One-Dimensional Incompressible Multiphase Ow
24 pages
SPE 87227 Thermal Modeling of Shut-In Well After Multiphase Hydrocarbon Production
No ratings yet
SPE 87227 Thermal Modeling of Shut-In Well After Multiphase Hydrocarbon Production
9 pages
Sanderse Et Al 2017
No ratings yet
Sanderse Et Al 2017
51 pages
Geotemp™ 1.0: User Manual: Ludovic Ricard and Jean-Baptiste Chanu
No ratings yet
Geotemp™ 1.0: User Manual: Ludovic Ricard and Jean-Baptiste Chanu
104 pages
Viscosity of Hydrocarbon Gases Under Pressure: Norman Carr Riki Kobayashi OF
No ratings yet
Viscosity of Hydrocarbon Gases Under Pressure: Norman Carr Riki Kobayashi OF
9 pages
Japan Economic Outlook - Deloitte Insights
No ratings yet
Japan Economic Outlook - Deloitte Insights
8 pages
Voice Engineer - JD
No ratings yet
Voice Engineer - JD
2 pages
Test A On Screen B1+'10 1rst Term
50% (2)
Test A On Screen B1+'10 1rst Term
4 pages
SQR Questions: 2. How Can We Know From SQR, If Environment Is PSNT or PSUNX? 3. Stages of Program Flow in SQR?
No ratings yet
SQR Questions: 2. How Can We Know From SQR, If Environment Is PSNT or PSUNX? 3. Stages of Program Flow in SQR?
7 pages
Technical English Vocabulary and Grammar
100% (2)
Technical English Vocabulary and Grammar
138 pages
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
21 pages
Commercial Banking Intern Resume
No ratings yet
Commercial Banking Intern Resume
1 page
STP275 WfwMC4 275 270 265
No ratings yet
STP275 WfwMC4 275 270 265
2 pages
Introduction To IF Statements in Excel
No ratings yet
Introduction To IF Statements in Excel
7 pages
Catalogue Subec En
No ratings yet
Catalogue Subec En
503 pages
Final Quiz 1 - Attempt Review
No ratings yet
Final Quiz 1 - Attempt Review
6 pages
Applied Research Program: User Manual
No ratings yet
Applied Research Program: User Manual
33 pages
Manual Smar Tt301
100% (1)
Manual Smar Tt301
58 pages
Leadbeater Rethinking Innovation in Education
No ratings yet
Leadbeater Rethinking Innovation in Education
15 pages
Facebook Whitepaper Fred Lam
No ratings yet
Facebook Whitepaper Fred Lam
12 pages
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
No ratings yet
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
30 pages
Morningstar Bakery: Business Plan
No ratings yet
Morningstar Bakery: Business Plan
14 pages
1-s2.0-S0925346723001374-main
No ratings yet
1-s2.0-S0925346723001374-main
12 pages
Shimano2024 English Issuu
No ratings yet
Shimano2024 English Issuu
316 pages
Clean and Green
100% (1)
Clean and Green
9 pages
Creative Fiction and Nonfiction Writing
100% (3)
Creative Fiction and Nonfiction Writing
159 pages
7 Nation Army David Edit Drum Tab by White Stripes - Songsterr Tabs With Rhythm
No ratings yet
7 Nation Army David Edit Drum Tab by White Stripes - Songsterr Tabs With Rhythm
3 pages
Timestomp and Autopsy
No ratings yet
Timestomp and Autopsy
3 pages
Eka 323 A3-Akuntansi Manajemen: Dosen: I Putu Sudana
No ratings yet
Eka 323 A3-Akuntansi Manajemen: Dosen: I Putu Sudana
3 pages
K.suhas Chandra: Professional Summary
No ratings yet
K.suhas Chandra: Professional Summary
3 pages
SyllabusME2040SU2017_Updated_May30
No ratings yet
SyllabusME2040SU2017_Updated_May30
3 pages
Homework Government Guidelines
100% (1)
Homework Government Guidelines
8 pages
10 Statistics and Probability G11 Quarter 4 Module 10 Identifying The Appropriate Test Statistics Involving Population Proportion
100% (18)
10 Statistics and Probability G11 Quarter 4 Module 10 Identifying The Appropriate Test Statistics Involving Population Proportion
27 pages
Panasonic TX-25lk1p TX-28lk1p Chassis z8
0% (1)
Panasonic TX-25lk1p TX-28lk1p Chassis z8
22 pages

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Uploaded by

A Scalable, Numerically Stable, High-Performance Tridiagonal Solver For Gpus

Uploaded by

A Scalable, Numerically Stable, High-

Performance Tridiagonal Solver for

• But it contains more

Matlab (backslash) Yes

Intel MKL (gtsv) Yes

Intel SPIKE Yes

CUSPASRE gtsv (2012) No

Our gtsv Yes

Our heterogeneous gtsv Yes

Matlab (backslash) Yes Poor

Intel MKL (gtsv) Yes Good

Intel SPIKE Yes Good

CUSPASRE gtsv (2012) No Not supported

Our gtsv Yes Not supported

Our heterogeneous gtsv Yes Good

Matlab (backslash) Yes Poor Not supported

Intel MKL (gtsv) Yes Good Not supported

Intel SPIKE Yes Good Not supported

CUSPASRE gtsv (2012) No Not supported Good

Our gtsv Yes Not supported Good

Our heterogeneous gtsv Yes Good Good

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

• All previous related works for GPUs

• But when bi’s are 0’s…

• Why people didn’t apply pivoting on GPU ?

• Pivoting may serialize computation

• For data-dependent control flow

• AX = F can be solved by solving

• Criteria for choosing 1-by-1 or 2-by-2 blocks

• Bd is 1-by-1 or 2-by-2 block

• It has data-dependent control flow

What we really store

• One Ai is solved sequentially,

random diagonally dominant zero diagonal

no tiling, tiling, random no tiling, tiling, no tiling, zero tiling, zero

Runtime of solving an 8M matrix

0 50 100 (ms) 150 200 250 300

Our dgtsv (GPU) Random Diagonally dominant

CUSPARSE dtsv (GPU)

Data transfer (pageable)

Data transfer (pinned)

MKL dgtsv(sequential, CPU)

Matlab (backslash) Yes Poor Not supported Not supported

Intel MKL (gtsv) Yes Good Not supported Not supported

Intel SPIKE Yes Good Not supported Supported

CUSPASRE gtsv (2012) No Not supported Good Not supported

Our gtsv Yes Not supported Good Supported

Our heterogeneous gtsv Yes Good Good Supported

• Yi’s have multiple columns, but Wi’s and Vi’s

• Solve Vi’s, Wi’s and the first column of Yi’s

x(t) IMF r1(t) IMF IMF rN(t)

c1(t) c2(t) cN(t)

Li-Wen Chang at SC'12 52

You might also like