0% found this document useful (0 votes)
38 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Parallel Computation

Problems
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
9.1 Numerical approach for dense
matrix

4
Review

5
Matrix-Vector Multiplication

Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.

6
Row-wise 1D Partitioning

How do we perform the operation?

7
Row-wise 1D Partitioning

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

Analysis?

8
Block 2D Partitioning

How do we perform the operation?

9
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.

Analysis?

10
1D vs 2D Formulation

Which one is better?

11
Matrix-Matrix Multiplication

Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.

12
Simple 2D Algorithm

Processors are arranged in a logical


sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?
How about the
memory
complexity?

13
Cannon’s Algorithm

Memory efficient variant of the simple


algorithm.
Key idea:
Replace traditional loop:

With the following loop:

During each step, processors operate on


different blocks of A and B.

14
Can we do better?

Can we use more than O(n2) processors?


So far the task corresponded to the dot-
product of two vectors
i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?

15
3D Algorithm—DNS Algorithm
Partitioning the intermediate data

16
3D Algorithm—DNS Algorithm

17
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.

18
1D Partitioning

Assign n/p rows of A to


each processor.
During the ith iteration:
Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.
Analysis?

(one element per processor)

19
1D Pipelined Formulation

Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!

20
Cost-optimal with
n processors

21
1D Partitioning

Is the block mapping a good idea?

22
2D Mapping

Each processor gets a 2D


block of the matrix.
Steps:
Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.
Analysis?

23
2D Pipelined

Cost-optimal with
n2 processors

24
9.2 Numerical approach for PDE
problem

PARALLEL SOLUTION TO PDEs


CASE STUDY: HEAT EQUATIONS

25
Mathematic model and algorithm
• Heat equations (PDE):
∂C
= D∇ 2C
∂t
2 ∂ 2C ∂ 2 C
∇C= 2 + 2
∂x ∂y

• Algorithm
• Initial input value: Ci0, j d (i,j
• At step n+1: x tn ) tn
2 tn tn
∇ Ci , j = FDi , j =
(Ci +1, j + Ci −1, j + Ci , j +1 + Ci , j −1 − 4Ci , j )
tn tn tn

dx 2
tn +1
C i, j = C + dt * D * FD
tn
i, j
tn
i, j
• Data dependncy?
26
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem

27
Mathematic model and algorithm

• Notation:
u
c
c: Ci , j ,
u: Ci −1, j , l r
d: Ci +1, j ,
l: Ci , j −1 , d
r: Ci , j +1.

28
Implementation: Spatial Discretization (FD)
∇ C2 tn
= FD tn
=
(C tn
i +1, j + C itn− 1 , j + C itn, j + 1 + C itn, j − 1 − 4 C itn, j )
i, j i, j
dx 2
void FD(float *C, float *dC) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(C+i*n+j);
u = (i==0) ? *(C+i*n+j) : *(C+(i-1)*n+j);
d = (i==m-1) ? *(C+i*n+j) : *(C+(i+1)*n+j);
l = (j==0) ? *(C+i*n+j) : *(C+i*n+j-1); dx (i,j)
r = (j==n-1) ? *(C+i*n+j) : *(C+i*n+j+1);
*(dC+i*n+j) = (1/(dx*dx))*(u+d+l+r-4*c);
Boundary condition
}
}
29
Implementation: Time Integration
Citn, j+1 = Citn, j + dt * D * FDitn, j

while (t<=T)
{
FD(C, dC);
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
*(C+i*n+j) = *(C+i*n+j) + dt*(*(dC+i*n+j));
t=t+dt;
}

30
SPMD Parallel Algorithm (1)

• SPMD: Single Program Multiple Data

CPU0

Domain
CPU1 Decomposition

CPU2

31
SPMD Parallel Algorithm SPMD (2)

• B1: Input data


• Usually, initial input data at CPU 0 (Root)
• B2: Domain decomposition
• B3: Distribute Input data from Root to all other CPUs
• B4: Computation (Each CPU calculate on its subdomain)
• B5: Gather Output from all other CPUs to Root

B3 and B5: Communication (Input và Output)

32
SPMD Parallel Algorithm SPMD (3)

• B1: Input data


• Depending on requirement of each problem
• Different input results in different output

33
SPMD Parallel Algorithm SPMD (4)
• B2: Domain decomposition
• Many approaches
• Different approach has different efficiency
• Following is row-wise domain decomposition
• Given that the size of domain is: mxn
• Subdomain for each CPU: mcxn, where: mc=m/NP with
NP is the number of CPUs

CPU0

CPU1

CPU2

34
SPMD Parallel Algorithm SPMD (5)
• B3: Distribute Input data from Root to all other CPUs
MPI_Scatter (C, mc*n, MPI_FLOAT,
Cs, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Input data initialized at Root

35
SPMD Parallel Algorithm SPMD (6)
• B5: Gather Output from all other CPU to Root
MPI_Gather ( Cs, mc*n, MPI_FLOAT,
C, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Result calculated at CPUs CPU0

36
SPMD Parallel Algorithm SPMD (7)
• B4: Computation

CPU0
- B4.1: Communication
- B4.2: Calculation
CPU1

CPU2

37
SPMD Parallel Algorithm SPMD (8)

• B4.1: Communication CPU0


Csmc-
- B4.1a): Communicate 1,j

array Cu Cu
- B4.1b): Communicate
CPU1
array Cd
Cd

Cs0,j
CPU2

38
SPMD Parallel Algorithm SPMD (9)
• B4.1a): Communicate array Cu

if (rank==0){
for (j=0; j<n; j++) *(Cu+j) = *(Cs+0*n+j);
MPI_Send (Cs+(mc- )*n, n, MPI_FLOAT, rank+1, rank, …);
} else if (rank==NP-1) {
MPI_Recv (Cu, n, MPI_FLOAT, rank-1, rank-1, …);
} else {
MPI_Send (Cs+(mc-1)*n, n, MPI_FLOAT, rank+1, rank,…);
MPI_Recv(Cu, n, MPI_FLOAT, rank-1, rank-1, …);
}

39
CPU0
Csmc-1,j

Cu

CPU1

Cd

Cs0,j
CPU2

40
SPMD Parallel Algorithm SPMD (10)
• B4.1b): Communicate array Cd
if (rank==NP-1){
for (j=0; j<n; j++) *(Cd+j) = *(Cs+(mc-1)*n+j);
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
} else if (rank==0) {
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
} else {
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
}

41
SPMD Parallel Algorithm SPMD (11)
• B4.2: Calculation
void FD(float *Cs, float *Cu, float *Cd, float *dCs ,int ms) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < ms ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(Cs+i*n+j);
u = (i==0) ? *(Cu+j) : *(Cs+(i-1)*n+j);
d = (i==ms-1) ? *(Cd+j) : *(Cs+(i+1)*n+j);
l = (j==0) ? *(Cs+i*n+j) : *(Cs+i*n+j-1);
r = (j==n-1) ? *(Cs+i*n+j) : *(Cs+i*n+j+1);
*(dCs+i*n+j) = (D/(dx*dx))*(u+d+l+r-4*c);
}
}
42
Thank you
for your
attentions!

43

You might also like