0% found this document useful (0 votes)

15 views56 pages

hpc_scaling

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views56 pages

hpc_scaling

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Scalability of operations

Victor Eijkhout

Fall 2023
Justification

Parallel operations are supposed to be faster than their sequential

counterparts. In this section we will explore how to quantify this, and
we will see examples where the same result can be computed with
diferent efficiencies.

2
Collectives as building blocks; complexity

3
Collectives
Gathering and spreading information:

• Every process has data, you want to bring it together;

• One process has data, you want to spread it around.

Root process: the one doing the collecting or disseminating.

Basic cases:

• Collect data: gather.

• Collect data and compute some overall value (sum, max):
reduction.
• Send the same data to everyone: broadcast.
• Send individual data to each process: scatter.

4
5
Collective scenarios

How would you realize the following scenarios with collectives?

• Let each process compute a random number. You want to print

the maximum of these numbers to your screen.
• Each process computes a random number again. Now you want
to scale these numbers by their maximum.
• Let each process compute a random number. You want to print
on what processor the maximum value is computed.

6
Simple model of parallel computation
• α: message latency
• β: time per word (inverse of bandwidth)
• γ: time per floating point operation

Send n items and do m operations:

cost = α + β · n + γ · m

Pure sends: no γ term,

pure computation: no α, β terms,
sometimes mixed: reduction

7
Model for collectives
• One simultaneous send and receive:
• doubling of active processors
• collectives have a α log2 p cost component

8
Broadcast
t =0 t =1 t =2
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p1 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 , x1 , x2 , x3
p2 x0 , x1 , x2 , x3
p3 x0 , x1 , x2 , x3
On t = 0, p0 sends to p1 ; on t = 1 p0 , p1 send to p2 , p3 .

Optimal complexity:
⌈log2 p⌉α + nβ.
Actual complexity:
⌈log2 p⌉(α + nβ).
Good enough for short vectors.

9
Long vector broadcast
Start with a scatter:

t =0 t =1 t =2 t =3
p0 x0 ↓, x1 , x2 , x3 x0 , x1 ↓, x2 , x3 x0 , x1 , x2 ↓, x3 x0 , x1 , x2 , x3 ↓
p1 x1
p2 x2
p3 x3

takes p − 1 messages of size N /p, for a total time of

N
Tscatter (N , P ) = (p − 1)α + (p − 1) · · β.
p

10
Bucket brigade

11
Long vector broadcast
After the scatter do a bucket-allgather:

t =0 t =1 etcetera
p0 x0 ↓ x0 x3 ↓ x0 , x2 , x3
p1 x1 ↓ x0 ↓, x1 x0 , x1 , x3
p2 x2 ↓ x1 ↓, x2 x0 , x1 , x2
p3 x3 ↓ x2 ↓, x3 x1 , x2 , x3

Each partial message gets sent p − 1 times, so this stage also has a
complexity of

N
Tbucket (N , P ) = (p − 1)α + (p − 1) · · β.
p

Better if N large.

12
Reduce
Optimal complexity:
p−1
⌈log2 p⌉α + nβ + γn.
p
Spanning tree algorithm:

t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3)
p0 x0 , x1 , x2 , x3 x0 , x1 , x2 , x3 x0 , x1 , x2 ,x
(1) (1) (1) (1)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3)
p2 x0 , x1 , x2 , x3 x0 ↑, x1 ↑, x2 ↑, x3 ↑
(3) (3) (3) (3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑

Running time
p−1
⌈log2 p⌉(α + nβ + γn).
p
Good enough for short vectors.

13
Allreduce

Allreduce ≡ Reduce+Broadcast

t =1 t =2 t =3
(0) (0) (0) (0) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p0 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3

(1) (1) (1) (1) (0:1) (0:1) (0:1) (0:1) (0:3) (0:3) (0:3) (0:3)
p1 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↓↓, x1 ↓↓, x2 ↓↓, x3 ↓↓ x0 , x1 , x2 , x3

(2) (2) (2) (2) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p2 x0 ↓, x1 ↓, x2 ↓, x3 ↓ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3

(3) (3) (3) (3) (2:3) (2:3) (2:3) (2:3) (0:3) (0:3) (0:3) (0:3)
p3 x0 ↑, x1 ↑, x2 ↑, x3 ↑ x0 ↑↑, x1 ↑↑, x2 ↑↑, x3 ↑↑ x0 , x1 , x2 , x3

Same running time as regular reduce!

14
Allgather
Gather n elements: each processor owns n/p;
optimal running time

p−1
⌈log2 p⌉α + nβ.
p

t =1 t =2 t =3
p0 x0 ↓ x0 x1 ↓ x0 x1 x2 x3
p1 x1 ↑ x0 x1 ↓ x0 x1 x2 x3
p2 x2 ↓ x2 x3 ↑ x0 x1 x2 x3
p3 x3 ↑ x2 x3 ↑ x0 x1 x2 x3
Same time as gather, half of gather-and-broadcast.

15
Reduce-scatter

t =1 t =2 t =3
(0) (0) (0) (0) (0:2:2) (0:2:2) (0:3)
p0 x0 , x1 , x2 ↓, x3 ↓ x0 , x1 ↓ x0
(1) (1) (1) (1) (1:3:2) (1:3:2) (0:3)
p1 x0 , x1 , x2 ↓, x3 ↓ x0 ↑, x1 x1
(2) (2) (2) (2) (0:2:2) (0:2:2) (0:3)
p2 x0 ↑, x1 ↑, x2 , x3 x2 , x3 ↓ x2
(3) (3) (3) (3) (1:3:2) (1:3:2) (0:3)
p3 x0 ↑, x1 ↑, x2 , x3 x0 ↑, x1 x3

p−1
⌈log2 p⌉α + n(β + γ).
p

16
Efficiency and scaling

17
Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1

Many caveats

• Is T1 based on the same algorithm? The parallel code?

• Sometimes superlinear speedup.
• Can the problem be run on a single processor?
• Can the problem be evenly divided?

18
Limits on speedup/efficiency

• Fs sequential fraction, Fp parallelizable fraction

• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.
• loglog plot: straigth line with slope −1

19
Scaling

• Amdahl’s law: strong scaling

same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra

20
Scalability analysis of dense matrix-vector
product

21
Parallel matrix-vector product; general
• Assume a division by block rows
• Every processor p has a set of row indices Ip

Mvp on processor p:

∀i ∈Ip : yi = ∑ aij xj = ∑ ∑ aij xj

j q j ∈Iq

22
Local and remote operations
Local and remote parts:

∀i ∈Ip : yi = ∑ aij xj + ∑ ∑ aij xj

j ∈Ip q ̸=p j ∈Iq

Local part Ip can be executed right away, Iq requires communication.

23
How to deal with remote parts

• Very flexible: mix of working on local parts, and receiving remote

parts.
• More orchestrated:
1. each process gets a full copy of the input vector (how?)
2. then operates on the whole input
Compare?

(Are we making a big assumption here?)

24
Dense MVP

• Separate communication and computation:

• first allgather
• then matrix-vector product

25
Cost computation 1.

Algorithm:

Step Cost (lower bound)

Allgather xi so that x is available
on all nodes
2
Locally compute yi = Ai x ≈ 2 nP γ

26
Allgather

Assume that data arrives over a binary tree:

• latency α log2 P
• transmission time, receiving n/P elements from P − 1 processors

27
Algorithm with cost:

Step Cost (lower bound)

P −1
Allgather xi so that x is available ⌈log2 (P )⌉α + P
nβ ≈
on all nodes log2 (P )α + nβ
2
Locally compute yi = Ai x ≈ 2 nP γ

28
Parallel efficiency
Speedup:
Sp1D-row (n)
T1 (n)
= 1D-row
Tp (n)
2n2 γ
= 2
2 np γ+log2 (p)α+nβ
p
= p log2 (p) α p β
1+ γ + 2n γ
2n2

Efficiency:
Sp1D-row (n)
Ep1D-row (n) = p
1
= p log2 (p) α p β
.
1+ γ + 2n γ
2n2

Strong scaling, weak scaling?

29
Optimistic scaling

Processors fixed, problem grows:

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

Roughly Ep ∼ 1 − n−1

30
Strong scaling

Problem fixed, p → ∞

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

31
Strong scaling

Problem fixed, p → ∞

1
Ep1D-row (n) = p log2 (p) α
.
p β
1+ 2n2 γ + 2n γ

Roughly Ep ∼ p−1

31
Weak scaling

Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ

32
Weak scaling

Memory fixed:
M = n 2 /p
1 1
Ep1D-row (n) = p log2 (p) α
= √
p
p β log2 (p) α
1+ 2n2 γ + 2n γ 1+ 2M γ + 2√M βγ
√
Does not scale: Ep ∼ 1/ p
problem in β term: too much communication

32
Two-dimensional partitioning

x0 x3 x6 x9
a00 a01 a02 y0 a03 a04 a05 a06 a07 a08 a09 a0,10 a0,11
a10 a11 a12 a13 a14 a15 y1 a16 a17 a18 a19 a1,10 a1,11
a20 a21 a22 a23 a24 a25 a26 a27 a28 y2 a29 a2,10 a2,11
a30 a31 a32 a33 a34 a35 a37 a37 a38 a39 a3,10 a3,11
x1 x4 x7 x10
a40 a41 a42 y4 a43 a44 a45 a46 a47 a48 a49 a4,10 a4,11
a50 a51 a52 a53 a54 a55 y5 a56 a57 a58 a59 a5,10 a5,11
a60 a61 a62 a63 a64 a65 a66 a67 a68 y6 a69 a6,10 a6,11
a70 a71 a72 a73 a74 a75 a77 a77 a78 a79 a7,10 a7,11
x2 x5 x8 x11
a80 a81 a82 y8 a83 a84 a85 a86 a87 a88 a89 a8,10 a8,11
a90 a91 a92 a93 a94 a95 y9 a96 a97 a98 a99 a9,10 a9,11
a10,0 a10,1 a10,2 a10,3 a10,4 a10,5 a10,6 a10,7 a10,8 y10 a10,9 a10,10 a10,11
a11,0 a11,1 a11,2 a11,3 a11,4 a11,5 a11,7 a11,7 a11,8 a11,9 a11,10 a11,11

33
Two-dimensional partitioning
√
Processor grid p = r × c, assume r , c ≈ p.

x0 x3 x6 x9
a00 a01 a02 y0
a10 a11 a12 y1
a20 a21 a22 y2
a30 a31 a32 y3
x1 ↑ x4 x7 x10
y4
y5
y6
y7
x2 ↑ x5 x8 x11
y8
y9
y10
y11

34
Key to the algorithm

• Consider block (i , j )
• it needs to multiply by the xs in column j
• it produces part of the result of row i

35
Algorithm

• Collecting xj on each processor pij by an allgather inside the

processor columns.
• Each processor pij then computes yij = Aij xj .
• Gathering together the pieces yij in each processor row to
form yi , distribute this over the processor row: combine to form a
reduce-scatter.
• Setup for the next A or At product

36
Analysis 1.

Step Cost (lower bound)

Allgather xi ’s within columns ⌈log2 (r )⌉α + r −p 1 nβ
≈ log2 (r )α + nc β
2
Perform local matrix-vector multi- ≈ 2 np γ
ply
Reduce-scatter yi ’s within rows

37
Reduce-scatter

Time:
p−1
⌈log2 p⌉α + n(β + γ).
p

38
Step Cost (lower bound)
Allgather xi ’s within columns ⌈log2 (r )⌉α + r −p 1 nβ
≈ log2 (r )α + nc β
2
Perform local matrix-vector multi- ≈ 2 np γ
ply
Reduce-scatter yi ’s within rows ⌈log2 (c )⌉α + c −p 1 nβ + c −p 1 nγ
≈ log2 (c )α + nr β + nr γ

39
Efficiency

√
Let r = c = p, then
√ √
p× p 1
Ep (n ) = p log2 (p) α
√
p (2β+γ)
1+ 2n2 γ + 2n γ

40
Strong scaling

Same story as before for p → ∞:

√ √
p× p 1
Ep (n) = p log2 (p) α
√
p (2β+γ)
∼ p−1
1+ 2n2 γ + 2n γ

No strong scaling

41
Weak scaling

Constant memory M = n2 /p:

√ √
p× p 1
Ep (n) = p log2 (p) α
√
p (2β+γ)
1+ 2n2 γ + 2n γ

42
Weak scaling

Constant memory M = n2 /p:

√ √
p× p 1 1
Ep (n) = √ = log2 (p) α
+ 2√1M (2β+γ)
p log2 (p) α p (2β+γ)
1+ 2n2 γ + 2n γ
1+ 2M γ γ

42
Weak scaling

Constant memory M = n2 /p:

√ √
p× p 1 1
Ep (n) = √ = log2 (p) α
+ 2√1M (2β+γ)
p log2 (p) α p (2β+γ)
1+ 2n2 γ + 2n γ
1+ 2M γ γ

Weak scaling:
for p → ∞ this is ≈ 1/ log2 p:
only slowly decreasing.

42
LU factorizations

• Needs a cyclic distribution

• This is very hard to program, so:
• Scalapack, 1990s product, not extendible, impossible interface
• Elemental: 2010s product, extendible, nice user interface (and it
is way faster)

43
Boundary value problems

Consider in 1D
(
−u ′′ (x ) = f (x , u , u ′ ) x ∈ [a, b]
u ( a ) = ua , u ( b ) = ub

in 2D: (
−uxx (x̄ ) − uyy (x̄ ) = f (x̄ ) x ∈ Ω = [0, 1]2
u (x̄ ) = u0 x̄ ∈ δΩ

44
Approximation of 2nd order derivatives
Taylor series (write h for δx):
h2 h3 h4 h5
u (x + h) = u (x ) + u ′ (x )h + u ′′ (x ) + u ′′′ (x ) + u ( 4 ) (x ) + u (5) (x ) +···
2! 3! 4! 5!
and
h2 h3 h4 h5
u (x − h) = u (x ) − u ′ (x )h + u ′′ (x ) − u ′′′ (x ) + u (4) (x ) − u (5) (x ) +···
2! 3! 4! 5!
Subtract:
h4
u (x + h) + u (x − h) = 2u (x ) + u ′′ (x )h2 + u (4) (x ) +···
12
so
u (x + h) − 2u (x ) + u (x − h) h4
u ′′ (x ) = − u (4) (x ) +···
h2 12

Numerical scheme:
u (x + h) − 2u (x ) + u (x − h)
− = f (x , u (x ), u ′ (x ))
h2
(2nd order PDEs are very common!)

45
This leads to linear algebra
2u (x ) − u (x + h) − u (x − h)
−uxx = f → = f (x , u (x ), u ′ (x ))
h2
Equally spaced points on [0, 1]: xk = kh where h = 1/(n + 1), then

−uk +1 + 2uk − uk −1 = −1/h2 f (xk , uk , uk′ ) for k = 1, . . . , n

Written as matrix equation:

    
2 −1 u1 f 1 + u0
−1 2 −1  u2   f2 
   =  
.. .. .. .. ..
. . . . .

46
Matrix properties

• Very sparse, banded

• Symmetric (only because 2nd order problem)
• Sign pattern: positive diagonal, nonpositive off-diagonal
(true for many second order methods)
• Positive definite (just like the continuous problem)
• Constant diagonals (from constant coefficients in the DE)

47
Sparse matrix in 2D case
Sparse matrices so far were tridiagonal: only in 1D case.

Two-dimensional: −uxx − uyy = f on unit square [0, 1]2

Difference equation:

4u (x , y ) − u (x + h, y ) − u (x − h, y ) − u (x , y + h) − u (x , y − h) = h2 f (x , y )

4uk − uk −1 − uk +1 − uk −n − uk +n = fk

Consider a graph where {uk }k are the edges

and (ui , uj ) is an edge iff aij ̸= 0.

48
The graph view of things
Poisson eq:

This is a graph!
This is the (adjacency) graph of a sparse matrix.

49
Sparse matrix from 2D equation
−1 −1
 
4 0/ 0/
 −1 4 1 −1 
 
 .. .. .. .. 
 . . . . 
 
 .. .. .. 

 . . −1 . 

 0/
 −1 4 0/ −1 

 −1 0/ 4 −1 −1 
 

 −1 −1 4 −1 −1 
 .. 

 ↑ . ↑ ↑ ↑ ↑ 

 k −n k −1 k k +1 −1 k +n 


 −1 −1 4 

.. ..
. .

50
Matrix properties
• Very sparse, banded
• Factorization takes less than n2 space, n3 work
• Symmetric (only because 2nd order problem)
• Sign pattern: positive diagonal, nonpositive off-diagonal
(true for many second order methods)
• Positive definite (just like the continuous problem)
• Constant diagonals: only because of the constant coefficient
differential equation
• Factorization: lower complexity than dense, recursion length less
than N.

51
Realistic meshes

gfgdgdg
No ratings yet
gfgdgdg
610 pages
Advanced Mechanics of Solids by L. S. Srinath
No ratings yet
Advanced Mechanics of Solids by L. S. Srinath
5 pages
3 UNIT TEST MATHS 24-11-24
No ratings yet
3 UNIT TEST MATHS 24-11-24
4 pages
Vlsi Signal Processing
No ratings yet
Vlsi Signal Processing
455 pages
Equity Structured Products Accumulator/ Decumulator
No ratings yet
Equity Structured Products Accumulator/ Decumulator
5 pages
Notes On Theory of Distributed Systems
No ratings yet
Notes On Theory of Distributed Systems
556 pages
Updated-Numerical Solutions To CE Problems
100% (2)
Updated-Numerical Solutions To CE Problems
24 pages
notes (2)
No ratings yet
notes (2)
584 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
Elective I (Math)
No ratings yet
Elective I (Math)
2 pages
0.1 Installation of R Packages
No ratings yet
0.1 Installation of R Packages
10 pages
Parallel Algorithm Main Single
No ratings yet
Parallel Algorithm Main Single
289 pages
hpc_cmake
No ratings yet
hpc_cmake
76 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Pdc - Co1-Basic Op & Cost Analysis
No ratings yet
Pdc - Co1-Basic Op & Cost Analysis
22 pages
hpc_arithmetic
No ratings yet
hpc_arithmetic
62 pages
Eigenvalues and Eigenvectors
No ratings yet
Eigenvalues and Eigenvectors
33 pages
hpc_debug
No ratings yet
hpc_debug
38 pages
Lecture-15-PDC-BCS-6EF-SMI-Spring-2025
No ratings yet
Lecture-15-PDC-BCS-6EF-SMI-Spring-2025
27 pages
hpc_cmakeshort
No ratings yet
hpc_cmakeshort
11 pages
hpc_pkgconfig
No ratings yet
hpc_pkgconfig
12 pages
hpc_git
No ratings yet
hpc_git
12 pages
hpc_performance
No ratings yet
hpc_performance
13 pages
hpc_nbody
No ratings yet
hpc_nbody
23 pages
hpc_graph
No ratings yet
hpc_graph
22 pages
hpc_intro
No ratings yet
hpc_intro
16 pages
807purl Discrete-Mathematics TYS
No ratings yet
807purl Discrete-Mathematics TYS
34 pages
12.revision Parallelization
No ratings yet
12.revision Parallelization
30 pages
Tensor Analysis
No ratings yet
Tensor Analysis
52 pages
Linear Algebra Notes Soniya Dhama
No ratings yet
Linear Algebra Notes Soniya Dhama
100 pages
LEC6 parallelAlg-Broadcasting
No ratings yet
LEC6 parallelAlg-Broadcasting
15 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Vector Space (DPP 1)-1
No ratings yet
Vector Space (DPP 1)-1
8 pages
ADV Determinants CofactorExpansions Applications
No ratings yet
ADV Determinants CofactorExpansions Applications
42 pages
hpc_iterative
No ratings yet
hpc_iterative
106 pages
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
No ratings yet
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
2 pages
GNU Octave - Matrix Manipulation
No ratings yet
GNU Octave - Matrix Manipulation
7 pages
Lec1 17
No ratings yet
Lec1 17
39 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
MIT - Applied Parallel Computing - Alan Edelman
No ratings yet
MIT - Applied Parallel Computing - Alan Edelman
187 pages
1 Parallel and Distributed Computation
No ratings yet
1 Parallel and Distributed Computation
10 pages
Mathematics-II Prof. Sunita. Gakkhar Department of Mathematics Indian Institute of Technology, Roorkee Module - 2 Lecture - 3 Determinants Part - 1
No ratings yet
Mathematics-II Prof. Sunita. Gakkhar Department of Mathematics Indian Institute of Technology, Roorkee Module - 2 Lecture - 3 Determinants Part - 1
38 pages
Eigenvalue Problems: Eigenvalues and Eigenvectors
No ratings yet
Eigenvalue Problems: Eigenvalues and Eigenvectors
11 pages
hpc_architecture
No ratings yet
hpc_architecture
86 pages
Exercise 9
No ratings yet
Exercise 9
5 pages
Intro To Communication: - Advantages
No ratings yet
Intro To Communication: - Advantages
13 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
AA-Part1 (1)
No ratings yet
AA-Part1 (1)
43 pages
hpc_unix
No ratings yet
hpc_unix
46 pages
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
No ratings yet
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
59 pages
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
No ratings yet
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
33 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
1 s2.0 S2590123024000215 Main
No ratings yet
1 s2.0 S2590123024000215 Main
8 pages
hpc_programming
No ratings yet
hpc_programming
33 pages
11 Maths Matrices 4 6
No ratings yet
11 Maths Matrices 4 6
3 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
Slides
No ratings yet
Slides
44 pages
000 Getstartedrpi Digital
100% (2)
000 Getstartedrpi Digital
116 pages
More Mcqs Set 1
No ratings yet
More Mcqs Set 1
5 pages
hpc_linear
No ratings yet
hpc_linear
52 pages
The Discrete Cosine Transform
No ratings yet
The Discrete Cosine Transform
13 pages
Cours 2
No ratings yet
Cours 2
25 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
VLSI Digital Signal Processing Systems by Keshab K Parhi
50% (4)
VLSI Digital Signal Processing Systems by Keshab K Parhi
25 pages
The Sieve of Eratosthenes
No ratings yet
The Sieve of Eratosthenes
68 pages
VLSI Digital Signal Processing Systems
No ratings yet
VLSI Digital Signal Processing Systems
25 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)
No ratings yet
FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)
48 pages
VLSI Digital Signal Processing Systems: Keshab K. Parhi
No ratings yet
VLSI Digital Signal Processing Systems: Keshab K. Parhi
25 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
Content PDF
No ratings yet
Content PDF
14 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
IntroDistribuetComputing
No ratings yet
IntroDistribuetComputing
41 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Cours 2
No ratings yet
Cours 2
25 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Numerical Methods in Finance. Part A. (2010-2011)
No ratings yet
Numerical Methods in Finance. Part A. (2010-2011)
23 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
FHMM1024 Chapter 1 Matrices and Linear Equations
100% (1)
FHMM1024 Chapter 1 Matrices and Linear Equations
95 pages
Lec2 17
No ratings yet
Lec2 17
27 pages
Linear Algebra
No ratings yet
Linear Algebra
21 pages
Lecture HPC 11 Parallelization
No ratings yet
Lecture HPC 11 Parallelization
128 pages
Determinants of A Matrix
No ratings yet
Determinants of A Matrix
43 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
MATRICES SUMMARY 2024-25 (1)
No ratings yet
MATRICES SUMMARY 2024-25 (1)
12 pages
Matrix Questions Bridge Course
No ratings yet
Matrix Questions Bridge Course
18 pages
AG Zariski
No ratings yet
AG Zariski
48 pages
HPC Unit 456
No ratings yet
HPC Unit 456
25 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
No ratings yet
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
24 pages
For Encode
No ratings yet
For Encode
177 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Lec4 17
No ratings yet
Lec4 17
22 pages
MATLAB Linear Algebra Tutorial
No ratings yet
MATLAB Linear Algebra Tutorial
25 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
MIT 18.06 Exam 1, Fall 2018 Solutions Johnson: Problem 1 (30 Points)
No ratings yet
MIT 18.06 Exam 1, Fall 2018 Solutions Johnson: Problem 1 (30 Points)
7 pages
Introduction
No ratings yet
Introduction
46 pages
Electromagnetic and Electrostatic Transmission-Line Parameters by Digital Computer
No ratings yet
Electromagnetic and Electrostatic Transmission-Line Parameters by Digital Computer
10 pages
Network Time Protocol (NTP) General Overview: David L. Mills University of Delaware
No ratings yet
Network Time Protocol (NTP) General Overview: David L. Mills University of Delaware
22 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
A Tutorial On SISO and MIMO Channel Capacities
No ratings yet
A Tutorial On SISO and MIMO Channel Capacities
4 pages
Lecture 3 Eigenvalues
No ratings yet
Lecture 3 Eigenvalues
38 pages
An Introduction To Parallel Algorithms
No ratings yet
An Introduction To Parallel Algorithms
66 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

hpc_scaling

Uploaded by

hpc_scaling

Uploaded by

Scalability of operations

Parallel operations are supposed to be faster than their sequential

• Every process has data, you want to bring it together;

Root process: the one doing the collecting or disseminating.

• Collect data: gather.

How would you realize the following scenarios with collectives?

• Let each process compute a random number. You want to print

Send n items and do m operations:

Pure sends: no γ term,

takes p − 1 messages of size N /p, for a total time of

Same running time as regular reduce!

• Is T1 based on the same algorithm? The parallel code?

• Fs sequential fraction, Fp parallelizable fraction

• Amdahl’s law: strong scaling

∀i ∈Ip : yi = ∑ aij xj = ∑ ∑ aij xj

∀i ∈Ip : yi = ∑ aij xj + ∑ ∑ aij xj

Local part Ip can be executed right away, Iq requires communication.

• Very flexible: mix of working on local parts, and receiving remote

(Are we making a big assumption here?)

• Separate communication and computation:

Step Cost (lower bound)

Assume that data arrives over a binary tree:

Step Cost (lower bound)

Strong scaling, weak scaling?

Processors fixed, problem grows:

• Collecting xj on each processor pij by an allgather inside the

Step Cost (lower bound)

Same story as before for p → ∞:

Constant memory M = n2 /p:

Constant memory M = n2 /p:

Constant memory M = n2 /p:

• Needs a cyclic distribution

−uk +1 + 2uk − uk −1 = −1/h2 f (xk , uk , uk′ ) for k = 1, . . . , n

Written as matrix equation:

• Very sparse, banded

Two-dimensional: −uxx − uyy = f on unit square [0, 1]2

Consider a graph where {uk }k are the edges

You might also like