hpc_scaling
hpc_scaling
Victor Eijkhout
Fall 2023
Justification
2
Collectives as building blocks; complexity
3
Collectives
Gathering and spreading information:
Basic cases:
4
5
Collective scenarios
6
Simple model of parallel computation
• α: message latency
• β: time per word (inverse of bandwidth)
• γ: time per floating point operation
cost = α + β · n + γ · m
7
Model for collectives
• One simultaneous send and receive:
• doubling of active processors
• collectives have a α log2 p cost component
8
Broadcast
t =0 t =1 t =2
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p1 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p2 x0 , x1 , x2 , x3
p3 x0 , x1 , x2 , x3
On t = 0, p0 sends to p1 ; on t = 1 p0 , p1 send to p2 , p3 .
Optimal complexity:
⌈log2 p⌉α + nβ.
Actual complexity:
⌈log2 p⌉(α + nβ).
Good enough for short vectors.
9
Long vector broadcast
Start with a scatter:
t =0 t =1 t =2 t =3
p0 x0 ↓, x1 , x2 , x3 x0 , x1 ↓, x2 , x3 x0 , x1 , x2 ↓, x3 x0 , x1 , x2 , x3 ↓
p1 x1
p2 x2
p3 x3
N
Tscatter (N , P ) = (p − 1)α + (p − 1) · · β.
p
10
Bucket brigade
11
Long vector broadcast
After the scatter do a bucket-allgather:
t =0 t =1 etcetera
p0 x0 ↓ x0 x3 ↓ x0 , x2 , x3
p1 x1 ↓ x0 ↓, x1 x0 , x1 , x3
p2 x2 ↓ x1 ↓, x2 x0 , x1 , x2
p3 x3 ↓ x2 ↓, x3 x1 , x2 , x3
Each partial message gets sent p − 1 times, so this stage also has a
complexity of
N
Tbucket (N , P ) = (p − 1)α + (p − 1) · · β.
p
Better if N large.
12
Reduce
Optimal complexity:
p−1
⌈log2 p⌉α + nβ + γn.
p
Spanning tree algorithm:
t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3)
p0 x0 , x1 , x2 , x3 x0 , x1 , x2 , x3 x0 , x1 , x2 ,x
(1) (1) (1) (1)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3)
p2 x0 , x1 , x2 , x3 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(3) (3) (3) (3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑
Running time
p−1
⌈log2 p⌉(α + nβ + γn).
p
Good enough for short vectors.
13
Allreduce
Allreduce ≡ Reduce+Broadcast
t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3
(1) (1) (1) (1) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3
(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p2 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3
(3) (3) (3) (3) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3
14
Allgather
Gather n elements: each processor owns n/p;
optimal running time
p−1
⌈log2 p⌉α + nβ.
p
t =1 t =2 t =3
p0 x0 ↓ x0 x1 ↓ x0 x1 x2 x3
p1 x1 ↑ x0 x1 ↓ x0 x1 x2 x3
p2 x2 ↓ x2 x3 ↑ x0 x1 x2 x3
p3 x3 ↑ x2 x3 ↑ x0 x1 x2 x3
Same time as gather, half of gather-and-broadcast.
15
Reduce-scatter
t =1 t =2 t =3
(0) (0) (0) (0) (0:2:2) (0:2:2) (0:3)
p0 x0 , x1 , x2 ↓, x3 ↓ x0 , x1 ↓ x0
(1) (1) (1) (1) (1:3:2) (1:3:2) (0:3)
p1 x0 , x1 , x2 ↓, x3 ↓ x0 ↑, x1 x1
(2) (2) (2) (2) (0:2:2) (0:2:2) (0:3)
p2 x0 ↑, x1 ↑, x2 , x3 x2 , x3 ↓ x2
(3) (3) (3) (3) (1:3:2) (1:3:2) (0:3)
p3 x0 ↑, x1 ↑, x2 , x3 x0 ↑, x1 x3
p−1
⌈log2 p⌉α + n(β + γ).
p
16
Efficiency and scaling
17
Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1
Many caveats
18
Limits on speedup/efficiency
19
Scaling
20
Scalability analysis of dense matrix-vector
product
21
Parallel matrix-vector product; general
• Assume a division by block rows
• Every processor p has a set of row indices Ip
Mvp on processor p:
22
Local and remote operations
Local and remote parts:
23
How to deal with remote parts
24
Dense MVP
25
Cost computation 1.
Algorithm:
26
Allgather
• latency α log2 P
• transmission time, receiving n/P elements from P − 1 processors
27
Algorithm with cost:
28
Parallel efficiency
Speedup:
Sp1D-row (n)
T1 (n)
= 1D-row
Tp (n)
2n2 γ
= 2
2 np γ+log2 (p)α+nβ
p
= p log2 (p) α p β
1+ γ + 2n γ
2n2
Efficiency:
Sp1D-row (n)
Ep1D-row (n) = p
1
= p log2 (p) α p β
.
1+ γ + 2n γ
2n2
29
Optimistic scaling
1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ
Roughly Ep ∼ 1 − n−1
30
Strong scaling
Problem fixed, p → ∞
1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ
31
Strong scaling
Problem fixed, p → ∞
1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ
Roughly Ep ∼ p−1
31
Weak scaling
Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ
32
Weak scaling
Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ
√
Does not scale: Ep ∼ 1/ p
problem in β term: too much communication
32
Two-dimensional partitioning
x0 x3 x6 x9
a00 a01 a02 y0 a03 a04 a05 a06 a07 a08 a09 a0,10 a0,11
a10 a11 a12 a13 a14 a15 y1 a16 a17 a18 a19 a1,10 a1,11
a20 a21 a22 a23 a24 a25 a26 a27 a28 y2 a29 a2,10 a2,11
a30 a31 a32 a33 a34 a35 a37 a37 a38 a39 a3,10 a3,11
x1 x4 x7 x10
a40 a41 a42 y4 a43 a44 a45 a46 a47 a48 a49 a4,10 a4,11
a50 a51 a52 a53 a54 a55 y5 a56 a57 a58 a59 a5,10 a5,11
a60 a61 a62 a63 a64 a65 a66 a67 a68 y6 a69 a6,10 a6,11
a70 a71 a72 a73 a74 a75 a77 a77 a78 a79 a7,10 a7,11
x2 x5 x8 x11
a80 a81 a82 y8 a83 a84 a85 a86 a87 a88 a89 a8,10 a8,11
a90 a91 a92 a93 a94 a95 y9 a96 a97 a98 a99 a9,10 a9,11
a10,0 a10,1 a10,2 a10,3 a10,4 a10,5 a10,6 a10,7 a10,8 y10 a10,9 a10,10 a10,11
a11,0 a11,1 a11,2 a11,3 a11,4 a11,5 a11,7 a11,7 a11,8 a11,9 a11,10 a11,11
33
Two-dimensional partitioning
√
Processor grid p = r × c, assume r , c ≈ p.
x0 x3 x6 x9
a00 a01 a02 y0
a10 a11 a12 y1
a20 a21 a22 y2
a30 a31 a32 y3
x1 ↑ x4 x7 x10
y4
y5
y6
y7
x2 ↑ x5 x8 x11
y8
y9
y10
y11
34
Key to the algorithm
• Consider block (i , j )
• it needs to multiply by the xs in column j
• it produces part of the result of row i
35
Algorithm
36
Analysis 1.
37
Reduce-scatter
Time:
p−1
⌈log2 p⌉α + n(β + γ).
p
38
Step Cost (lower bound)
Allgather xi ’s within columns ⌈log2 (r )⌉α + r −p 1 nβ
≈ log2 (r )α + nc β
2
Perform local matrix-vector multi- ≈ 2 np γ
ply
Reduce-scatter yi ’s within rows ⌈log2 (c )⌉α + c −p 1 nβ + c −p 1 nγ
≈ log2 (c )α + nr β + nr γ
39
Efficiency
√
Let r = c = p, then
√ √
p× p 1
Ep (n ) = p log2 (p) α
√
p (2β+γ)
1+ 2n2 γ + 2n γ
40
Strong scaling
No strong scaling
41
Weak scaling
42
Weak scaling
42
Weak scaling
Weak scaling:
for p → ∞ this is ≈ 1/ log2 p:
only slowly decreasing.
42
LU factorizations
43
Boundary value problems
Consider in 1D
(
−u ′′ (x ) = f (x , u , u ′ ) x ∈ [a, b]
u ( a ) = ua , u ( b ) = ub
in 2D: (
−uxx (x̄ ) − uyy (x̄ ) = f (x̄ ) x ∈ Ω = [0, 1]2
u (x̄ ) = u0 x̄ ∈ δΩ
44
Approximation of 2nd order derivatives
Taylor series (write h for δx):
h2 h3 h4 h5
u (x + h) = u (x ) + u ′ (x )h + u ′′ (x ) + u ′′′ (x ) + u ( 4 ) (x ) + u (5) (x ) +···
2! 3! 4! 5!
and
h2 h3 h4 h5
u (x − h) = u (x ) − u ′ (x )h + u ′′ (x ) − u ′′′ (x ) + u (4) (x ) − u (5) (x ) +···
2! 3! 4! 5!
Subtract:
h4
u (x + h) + u (x − h) = 2u (x ) + u ′′ (x )h2 + u (4) (x ) +···
12
so
u (x + h) − 2u (x ) + u (x − h) h4
u ′′ (x ) = − u (4) (x ) +···
h2 12
Numerical scheme:
u (x + h) − 2u (x ) + u (x − h)
− = f (x , u (x ), u ′ (x ))
h2
(2nd order PDEs are very common!)
45
This leads to linear algebra
2u (x ) − u (x + h) − u (x − h)
−uxx = f → = f (x , u (x ), u ′ (x ))
h2
Equally spaced points on [0, 1]: xk = kh where h = 1/(n + 1), then
46
Matrix properties
47
Sparse matrix in 2D case
Sparse matrices so far were tridiagonal: only in 1D case.
Difference equation:
4u (x , y ) − u (x + h, y ) − u (x − h, y ) − u (x , y + h) − u (x , y − h) = h2 f (x , y )
4uk − uk −1 − uk +1 − uk −n − uk +n = fk
48
The graph view of things
Poisson eq:
This is a graph!
This is the (adjacency) graph of a sparse matrix.
49
Sparse matrix from 2D equation
−1 −1
4 0/ 0/
−1 4 1 −1
.. .. .. ..
. . . .
.. .. ..
. . −1 .
0/
−1 4 0/ −1
−1 0/ 4 −1 −1
−1 −1 4 −1 −1
..
↑ . ↑ ↑ ↑ ↑
k −n k −1 k k +1 −1 k +n
−1 −1 4
.. ..
. .
50
Matrix properties
• Very sparse, banded
• Factorization takes less than n2 space, n3 work
• Symmetric (only because 2nd order problem)
• Sign pattern: positive diagonal, nonpositive off-diagonal
(true for many second order methods)
• Positive definite (just like the continuous problem)
• Constant diagonals: only because of the constant coefficient
differential equation
• Factorization: lower complexity than dense, recursion length less
than N.
51
Realistic meshes
52