Chapter 9 - Parallel Computation Problems
Chapter 9 - Parallel Computation Problems
Problems
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.
3
9.1 Numerical approach for dense
matrix
4
Review
5
Matrix-Vector Multiplication
Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.
6
Row-wise 1D Partitioning
7
Row-wise 1D Partitioning
Analysis?
8
Block 2D Partitioning
9
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.
Analysis?
10
1D vs 2D Formulation
11
Matrix-Matrix Multiplication
Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.
12
Simple 2D Algorithm
13
Cannon’s Algorithm
14
Can we do better?
15
3D Algorithm—DNS Algorithm
Partitioning the intermediate data
16
3D Algorithm—DNS Algorithm
17
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.
18
1D Partitioning
19
1D Pipelined Formulation
Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!
20
Cost-optimal with
n processors
21
1D Partitioning
22
2D Mapping
23
2D Pipelined
Cost-optimal with
n2 processors
24
9.2 Numerical approach for PDE
problem
25
Mathematic model and algorithm
• Heat equations (PDE):
∂C
= D∇ 2C
∂t
2 ∂ 2C ∂ 2 C
∇C= 2 + 2
∂x ∂y
• Algorithm
• Initial input value: Ci0, j d (i,j
• At step n+1: x tn ) tn
2 tn tn
∇ Ci , j = FDi , j =
(Ci +1, j + Ci −1, j + Ci , j +1 + Ci , j −1 − 4Ci , j )
tn tn tn
dx 2
tn +1
C i, j = C + dt * D * FD
tn
i, j
tn
i, j
• Data dependncy?
26
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem
27
Mathematic model and algorithm
• Notation:
u
c
c: Ci , j ,
u: Ci −1, j , l r
d: Ci +1, j ,
l: Ci , j −1 , d
r: Ci , j +1.
28
Implementation: Spatial Discretization (FD)
∇ C2 tn
= FD tn
=
(C tn
i +1, j + C itn− 1 , j + C itn, j + 1 + C itn, j − 1 − 4 C itn, j )
i, j i, j
dx 2
void FD(float *C, float *dC) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(C+i*n+j);
u = (i==0) ? *(C+i*n+j) : *(C+(i-1)*n+j);
d = (i==m-1) ? *(C+i*n+j) : *(C+(i+1)*n+j);
l = (j==0) ? *(C+i*n+j) : *(C+i*n+j-1); dx (i,j)
r = (j==n-1) ? *(C+i*n+j) : *(C+i*n+j+1);
*(dC+i*n+j) = (1/(dx*dx))*(u+d+l+r-4*c);
Boundary condition
}
}
29
Implementation: Time Integration
Citn, j+1 = Citn, j + dt * D * FDitn, j
while (t<=T)
{
FD(C, dC);
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
*(C+i*n+j) = *(C+i*n+j) + dt*(*(dC+i*n+j));
t=t+dt;
}
30
SPMD Parallel Algorithm (1)
CPU0
Domain
CPU1 Decomposition
CPU2
31
SPMD Parallel Algorithm SPMD (2)
32
SPMD Parallel Algorithm SPMD (3)
33
SPMD Parallel Algorithm SPMD (4)
• B2: Domain decomposition
• Many approaches
• Different approach has different efficiency
• Following is row-wise domain decomposition
• Given that the size of domain is: mxn
• Subdomain for each CPU: mcxn, where: mc=m/NP with
NP is the number of CPUs
CPU0
CPU1
CPU2
34
SPMD Parallel Algorithm SPMD (5)
• B3: Distribute Input data from Root to all other CPUs
MPI_Scatter (C, mc*n, MPI_FLOAT,
Cs, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);
CPU0
CPU1
CPU2
35
SPMD Parallel Algorithm SPMD (6)
• B5: Gather Output from all other CPU to Root
MPI_Gather ( Cs, mc*n, MPI_FLOAT,
C, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);
CPU0
CPU1
CPU2
36
SPMD Parallel Algorithm SPMD (7)
• B4: Computation
CPU0
- B4.1: Communication
- B4.2: Calculation
CPU1
CPU2
37
SPMD Parallel Algorithm SPMD (8)
array Cu Cu
- B4.1b): Communicate
CPU1
array Cd
Cd
Cs0,j
CPU2
38
SPMD Parallel Algorithm SPMD (9)
• B4.1a): Communicate array Cu
if (rank==0){
for (j=0; j<n; j++) *(Cu+j) = *(Cs+0*n+j);
MPI_Send (Cs+(mc- )*n, n, MPI_FLOAT, rank+1, rank, …);
} else if (rank==NP-1) {
MPI_Recv (Cu, n, MPI_FLOAT, rank-1, rank-1, …);
} else {
MPI_Send (Cs+(mc-1)*n, n, MPI_FLOAT, rank+1, rank,…);
MPI_Recv(Cu, n, MPI_FLOAT, rank-1, rank-1, …);
}
39
CPU0
Csmc-1,j
Cu
CPU1
Cd
Cs0,j
CPU2
40
SPMD Parallel Algorithm SPMD (10)
• B4.1b): Communicate array Cd
if (rank==NP-1){
for (j=0; j<n; j++) *(Cd+j) = *(Cs+(mc-1)*n+j);
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
} else if (rank==0) {
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
} else {
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
}
41
SPMD Parallel Algorithm SPMD (11)
• B4.2: Calculation
void FD(float *Cs, float *Cu, float *Cd, float *dCs ,int ms) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < ms ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(Cs+i*n+j);
u = (i==0) ? *(Cu+j) : *(Cs+(i-1)*n+j);
d = (i==ms-1) ? *(Cd+j) : *(Cs+(i+1)*n+j);
l = (j==0) ? *(Cs+i*n+j) : *(Cs+i*n+j-1);
r = (j==n-1) ? *(Cs+i*n+j) : *(Cs+i*n+j+1);
*(dCs+i*n+j) = (D/(dx*dx))*(u+d+l+r-4*c);
}
}
42
Thank you
for your
attentions!
43