0% found this document useful (0 votes)
9 views

hpc_scaling

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

hpc_scaling

Uploaded by

Rajul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Scalability of operations

Victor Eijkhout

Fall 2023
Justification

Parallel operations are supposed to be faster than their sequential


counterparts. In this section we will explore how to quantify this, and
we will see examples where the same result can be computed with
diferent efficiencies.

2
Collectives as building blocks; complexity

3
Collectives
Gathering and spreading information:

• Every process has data, you want to bring it together;


• One process has data, you want to spread it around.

Root process: the one doing the collecting or disseminating.

Basic cases:

• Collect data: gather.


• Collect data and compute some overall value (sum, max):
reduction.
• Send the same data to everyone: broadcast.
• Send individual data to each process: scatter.

4
5
Collective scenarios

How would you realize the following scenarios with collectives?

• Let each process compute a random number. You want to print


the maximum of these numbers to your screen.
• Each process computes a random number again. Now you want
to scale these numbers by their maximum.
• Let each process compute a random number. You want to print
on what processor the maximum value is computed.

6
Simple model of parallel computation
• α: message latency
• β: time per word (inverse of bandwidth)
• γ: time per floating point operation

Send n items and do m operations:

cost = α + β · n + γ · m

Pure sends: no γ term,


pure computation: no α, β terms,
sometimes mixed: reduction

7
Model for collectives
• One simultaneous send and receive:
• doubling of active processors
• collectives have a α log2 p cost component

8
Broadcast
t =0 t =1 t =2
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p1 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p2 x0 , x1 , x2 , x3
p3 x0 , x1 , x2 , x3
On t = 0, p0 sends to p1 ; on t = 1 p0 , p1 send to p2 , p3 .

Optimal complexity:
⌈log2 p⌉α + nβ.
Actual complexity:
⌈log2 p⌉(α + nβ).
Good enough for short vectors.

9
Long vector broadcast
Start with a scatter:

t =0 t =1 t =2 t =3
p0 x0 ↓, x1 , x2 , x3 x0 , x1 ↓, x2 , x3 x0 , x1 , x2 ↓, x3 x0 , x1 , x2 , x3 ↓
p1 x1
p2 x2
p3 x3

takes p − 1 messages of size N /p, for a total time of

N
Tscatter (N , P ) = (p − 1)α + (p − 1) · · β.
p

10
Bucket brigade

11
Long vector broadcast
After the scatter do a bucket-allgather:

t =0 t =1 etcetera
p0 x0 ↓ x0 x3 ↓ x0 , x2 , x3
p1 x1 ↓ x0 ↓, x1 x0 , x1 , x3
p2 x2 ↓ x1 ↓, x2 x0 , x1 , x2
p3 x3 ↓ x2 ↓, x3 x1 , x2 , x3

Each partial message gets sent p − 1 times, so this stage also has a
complexity of

N
Tbucket (N , P ) = (p − 1)α + (p − 1) · · β.
p

Better if N large.

12
Reduce
Optimal complexity:
p−1
⌈log2 p⌉α + nβ + γn.
p
Spanning tree algorithm:

t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3)
p0 x0 , x1 , x2 , x3 x0 , x1 , x2 , x3 x0 , x1 , x2 ,x
(1) (1) (1) (1)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3)
p2 x0 , x1 , x2 , x3 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(3) (3) (3) (3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑

Running time
p−1
⌈log2 p⌉(α + nβ + γn).
p
Good enough for short vectors.

13
Allreduce

Allreduce ≡ Reduce+Broadcast

t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3

(1) (1) (1) (1) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3

(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p2 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3

(3) (3) (3) (3) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3

Same running time as regular reduce!

14
Allgather
Gather n elements: each processor owns n/p;
optimal running time

p−1
⌈log2 p⌉α + nβ.
p

t =1 t =2 t =3
p0 x0 ↓ x0 x1 ↓ x0 x1 x2 x3
p1 x1 ↑ x0 x1 ↓ x0 x1 x2 x3
p2 x2 ↓ x2 x3 ↑ x0 x1 x2 x3
p3 x3 ↑ x2 x3 ↑ x0 x1 x2 x3
Same time as gather, half of gather-and-broadcast.

15
Reduce-scatter

t =1 t =2 t =3
(0) (0) (0) (0) (0:2:2) (0:2:2) (0:3)
p0 x0 , x1 , x2 ↓, x3 ↓ x0 , x1 ↓ x0
(1) (1) (1) (1) (1:3:2) (1:3:2) (0:3)
p1 x0 , x1 , x2 ↓, x3 ↓ x0 ↑, x1 x1
(2) (2) (2) (2) (0:2:2) (0:2:2) (0:3)
p2 x0 ↑, x1 ↑, x2 , x3 x2 , x3 ↓ x2
(3) (3) (3) (3) (1:3:2) (1:3:2) (0:3)
p3 x0 ↑, x1 ↑, x2 , x3 x0 ↑, x1 x3

p−1
⌈log2 p⌉α + n(β + γ).
p

16
Efficiency and scaling

17
Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1

Many caveats

• Is T1 based on the same algorithm? The parallel code?


• Sometimes superlinear speedup.
• Can the problem be run on a single processor?
• Can the problem be evenly divided?

18
Limits on speedup/efficiency

• Fs sequential fraction, Fp parallelizable fraction


• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.
• loglog plot: straigth line with slope −1

19
Scaling

• Amdahl’s law: strong scaling


same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra

20
Scalability analysis of dense matrix-vector
product

21
Parallel matrix-vector product; general
• Assume a division by block rows
• Every processor p has a set of row indices Ip

Mvp on processor p:

∀i ∈Ip : yi = ∑ aij xj = ∑ ∑ aij xj


j q j ∈Iq

22
Local and remote operations
Local and remote parts:

∀i ∈Ip : yi = ∑ aij xj + ∑ ∑ aij xj


j ∈Ip q ̸=p j ∈Iq

Local part Ip can be executed right away, Iq requires communication.

23
How to deal with remote parts

• Very flexible: mix of working on local parts, and receiving remote


parts.
• More orchestrated:
1. each process gets a full copy of the input vector (how?)
2. then operates on the whole input
Compare?

(Are we making a big assumption here?)

24
Dense MVP

• Separate communication and computation:


• first allgather
• then matrix-vector product

25
Cost computation 1.

Algorithm:

Step Cost (lower bound)


Allgather xi so that x is available
on all nodes
2
Locally compute yi = Ai x ≈ 2 nP γ

26
Allgather

Assume that data arrives over a binary tree:

• latency α log2 P
• transmission time, receiving n/P elements from P − 1 processors

27
Algorithm with cost:

Step Cost (lower bound)


P −1
Allgather xi so that x is available ⌈log2 (P )⌉α + P
nβ ≈
on all nodes log2 (P )α + nβ
2
Locally compute yi = Ai x ≈ 2 nP γ

28
Parallel efficiency
Speedup:
Sp1D-row (n)
T1 (n)
= 1D-row
Tp (n)
2n2 γ
= 2
2 np γ+log2 (p)α+nβ
p
= p log2 (p) α p β
1+ γ + 2n γ
2n2

Efficiency:
Sp1D-row (n)
Ep1D-row (n) = p
1
= p log2 (p) α p β
.
1+ γ + 2n γ
2n2

Strong scaling, weak scaling?

29
Optimistic scaling

Processors fixed, problem grows:

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

Roughly Ep ∼ 1 − n−1

30
Strong scaling

Problem fixed, p → ∞

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

31
Strong scaling

Problem fixed, p → ∞

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

Roughly Ep ∼ p−1

31
Weak scaling

Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ

32
Weak scaling

Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ

Does not scale: Ep ∼ 1/ p
problem in β term: too much communication

32
Two-dimensional partitioning

x0 x3 x6 x9
a00 a01 a02 y0 a03 a04 a05 a06 a07 a08 a09 a0,10 a0,11
a10 a11 a12 a13 a14 a15 y1 a16 a17 a18 a19 a1,10 a1,11
a20 a21 a22 a23 a24 a25 a26 a27 a28 y2 a29 a2,10 a2,11
a30 a31 a32 a33 a34 a35 a37 a37 a38 a39 a3,10 a3,11
x1 x4 x7 x10
a40 a41 a42 y4 a43 a44 a45 a46 a47 a48 a49 a4,10 a4,11
a50 a51 a52 a53 a54 a55 y5 a56 a57 a58 a59 a5,10 a5,11
a60 a61 a62 a63 a64 a65 a66 a67 a68 y6 a69 a6,10 a6,11
a70 a71 a72 a73 a74 a75 a77 a77 a78 a79 a7,10 a7,11
x2 x5 x8 x11
a80 a81 a82 y8 a83 a84 a85 a86 a87 a88 a89 a8,10 a8,11
a90 a91 a92 a93 a94 a95 y9 a96 a97 a98 a99 a9,10 a9,11
a10,0 a10,1 a10,2 a10,3 a10,4 a10,5 a10,6 a10,7 a10,8 y10 a10,9 a10,10 a10,11
a11,0 a11,1 a11,2 a11,3 a11,4 a11,5 a11,7 a11,7 a11,8 a11,9 a11,10 a11,11

33
Two-dimensional partitioning

Processor grid p = r × c, assume r , c ≈ p.

x0 x3 x6 x9
a00 a01 a02 y0
a10 a11 a12 y1
a20 a21 a22 y2
a30 a31 a32 y3
x1 ↑ x4 x7 x10
y4
y5
y6
y7
x2 ↑ x5 x8 x11
y8
y9
y10
y11

34
Key to the algorithm

• Consider block (i , j )
• it needs to multiply by the xs in column j
• it produces part of the result of row i

35
Algorithm

• Collecting xj on each processor pij by an allgather inside the


processor columns.
• Each processor pij then computes yij = Aij xj .
• Gathering together the pieces yij in each processor row to
form yi , distribute this over the processor row: combine to form a
reduce-scatter.
• Setup for the next A or At product

36
Analysis 1.

Step Cost (lower bound)


Allgather xi ’s within columns ⌈log2 (r )⌉α + r −p 1 nβ
≈ log2 (r )α + nc β
2
Perform local matrix-vector multi- ≈ 2 np γ
ply
Reduce-scatter yi ’s within rows

37
Reduce-scatter

Time:
p−1
⌈log2 p⌉α + n(β + γ).
p

38
Step Cost (lower bound)
Allgather xi ’s within columns ⌈log2 (r )⌉α + r −p 1 nβ
≈ log2 (r )α + nc β
2
Perform local matrix-vector multi- ≈ 2 np γ
ply
Reduce-scatter yi ’s within rows ⌈log2 (c )⌉α + c −p 1 nβ + c −p 1 nγ
≈ log2 (c )α + nr β + nr γ

39
Efficiency


Let r = c = p, then
√ √
p× p 1
Ep (n ) = p log2 (p) α

p (2β+γ)
1+ 2n2 γ + 2n γ

40
Strong scaling

Same story as before for p → ∞:


√ √
p× p 1
Ep (n) = p log2 (p) α

p (2β+γ)
∼ p−1
1+ 2n2 γ + 2n γ

No strong scaling

41
Weak scaling

Constant memory M = n2 /p:


√ √
p× p 1
Ep (n) = p log2 (p) α

p (2β+γ)
1+ 2n2 γ + 2n γ

42
Weak scaling

Constant memory M = n2 /p:


√ √
p× p 1 1
Ep (n) = √ = log2 (p) α
+ 2√1M (2β+γ)
p log2 (p) α p (2β+γ)
1+ 2n2 γ + 2n γ
1+ 2M γ γ

42
Weak scaling

Constant memory M = n2 /p:


√ √
p× p 1 1
Ep (n) = √ = log2 (p) α
+ 2√1M (2β+γ)
p log2 (p) α p (2β+γ)
1+ 2n2 γ + 2n γ
1+ 2M γ γ

Weak scaling:
for p → ∞ this is ≈ 1/ log2 p:
only slowly decreasing.

42
LU factorizations

• Needs a cyclic distribution


• This is very hard to program, so:
• Scalapack, 1990s product, not extendible, impossible interface
• Elemental: 2010s product, extendible, nice user interface (and it
is way faster)

43
Boundary value problems

Consider in 1D
(
−u ′′ (x ) = f (x , u , u ′ ) x ∈ [a, b]
u ( a ) = ua , u ( b ) = ub

in 2D: (
−uxx (x̄ ) − uyy (x̄ ) = f (x̄ ) x ∈ Ω = [0, 1]2
u (x̄ ) = u0 x̄ ∈ δΩ

44
Approximation of 2nd order derivatives
Taylor series (write h for δx):
h2 h3 h4 h5
u (x + h) = u (x ) + u ′ (x )h + u ′′ (x ) + u ′′′ (x ) + u ( 4 ) (x ) + u (5) (x ) +···
2! 3! 4! 5!
and
h2 h3 h4 h5
u (x − h) = u (x ) − u ′ (x )h + u ′′ (x ) − u ′′′ (x ) + u (4) (x ) − u (5) (x ) +···
2! 3! 4! 5!
Subtract:
h4
u (x + h) + u (x − h) = 2u (x ) + u ′′ (x )h2 + u (4) (x ) +···
12
so
u (x + h) − 2u (x ) + u (x − h) h4
u ′′ (x ) = − u (4) (x ) +···
h2 12

Numerical scheme:
u (x + h) − 2u (x ) + u (x − h)
− = f (x , u (x ), u ′ (x ))
h2
(2nd order PDEs are very common!)

45
This leads to linear algebra
2u (x ) − u (x + h) − u (x − h)
−uxx = f → = f (x , u (x ), u ′ (x ))
h2
Equally spaced points on [0, 1]: xk = kh where h = 1/(n + 1), then

−uk +1 + 2uk − uk −1 = −1/h2 f (xk , uk , uk′ ) for k = 1, . . . , n

Written as matrix equation:


    
2 −1 u1 f 1 + u0
−1 2 −1  u2   f2 
   =  
.. .. .. .. ..
. . . . .

46
Matrix properties

• Very sparse, banded


• Symmetric (only because 2nd order problem)
• Sign pattern: positive diagonal, nonpositive off-diagonal
(true for many second order methods)
• Positive definite (just like the continuous problem)
• Constant diagonals (from constant coefficients in the DE)

47
Sparse matrix in 2D case
Sparse matrices so far were tridiagonal: only in 1D case.

Two-dimensional: −uxx − uyy = f on unit square [0, 1]2

Difference equation:

4u (x , y ) − u (x + h, y ) − u (x − h, y ) − u (x , y + h) − u (x , y − h) = h2 f (x , y )

4uk − uk −1 − uk +1 − uk −n − uk +n = fk

Consider a graph where {uk }k are the edges


and (ui , uj ) is an edge iff aij ̸= 0.

48
The graph view of things
Poisson eq:

This is a graph!
This is the (adjacency) graph of a sparse matrix.

49
Sparse matrix from 2D equation
−1 −1
 
4 0/ 0/
 −1 4 1 −1 
 
 .. .. .. .. 
 . . . . 
 
 .. .. .. 

 . . −1 . 

 0/
 −1 4 0/ −1 

 −1 0/ 4 −1 −1 
 

 −1 −1 4 −1 −1 
 .. 

 ↑ . ↑ ↑ ↑ ↑ 

 k −n k −1 k k +1 −1 k +n 


 −1 −1 4 

.. ..
. .

50
Matrix properties
• Very sparse, banded
• Factorization takes less than n2 space, n3 work
• Symmetric (only because 2nd order problem)
• Sign pattern: positive diagonal, nonpositive off-diagonal
(true for many second order methods)
• Positive definite (just like the continuous problem)
• Constant diagonals: only because of the constant coefficient
differential equation
• Factorization: lower complexity than dense, recursion length less
than N.

51
Realistic meshes

52

You might also like