0% found this document useful (0 votes)

38 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Parallel Computation

Problems
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
9.1 Numerical approach for dense
matrix

4
Review

5
Matrix-Vector Multiplication

Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.

6
Row-wise 1D Partitioning

How do we perform the operation?

7
Row-wise 1D Partitioning

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

Analysis?

8
Block 2D Partitioning

How do we perform the operation?

9
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.

Analysis?

10
1D vs 2D Formulation

Which one is better?

11
Matrix-Matrix Multiplication

Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.

12
Simple 2D Algorithm

Processors are arranged in a logical

sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?
How about the
memory
complexity?

13
Cannon’s Algorithm

Memory efficient variant of the simple

algorithm.
Key idea:
Replace traditional loop:

With the following loop:

During each step, processors operate on

different blocks of A and B.

14
Can we do better?

Can we use more than O(n2) processors?

So far the task corresponded to the dot-
product of two vectors
i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?

15
3D Algorithm—DNS Algorithm
Partitioning the intermediate data

16
3D Algorithm—DNS Algorithm

17
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.

18
1D Partitioning

Assign n/p rows of A to

each processor.
During the ith iteration:
Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.
Analysis?

(one element per processor)

19
1D Pipelined Formulation

Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!

20
Cost-optimal with
n processors

21
1D Partitioning

Is the block mapping a good idea?

22
2D Mapping

Each processor gets a 2D

block of the matrix.
Steps:
Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.
Analysis?

23
2D Pipelined

Cost-optimal with
n2 processors

24
9.2 Numerical approach for PDE
problem

PARALLEL SOLUTION TO PDEs

CASE STUDY: HEAT EQUATIONS

25
Mathematic model and algorithm
• Heat equations (PDE):
∂C
= D∇ 2C
∂t
2 ∂ 2C ∂ 2 C
∇C= 2 + 2
∂x ∂y

• Algorithm
• Initial input value: Ci0, j d (i,j
• At step n+1: x tn ) tn
2 tn tn
∇ Ci , j = FDi , j =
(Ci +1, j + Ci −1, j + Ci , j +1 + Ci , j −1 − 4Ci , j )
tn tn tn

dx 2
tn +1
C i, j = C + dt * D * FD
tn
i, j
tn
i, j
• Data dependncy?
26
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem

27
Mathematic model and algorithm

• Notation:
u
c
c: Ci , j ,
u: Ci −1, j , l r
d: Ci +1, j ,
l: Ci , j −1 , d
r: Ci , j +1.

28
Implementation: Spatial Discretization (FD)
∇ C2 tn
= FD tn
=
(C tn
i +1, j + C itn− 1 , j + C itn, j + 1 + C itn, j − 1 − 4 C itn, j )
i, j i, j
dx 2
void FD(float *C, float *dC) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(C+i*n+j);
u = (i==0) ? *(C+i*n+j) : *(C+(i-1)*n+j);
d = (i==m-1) ? *(C+i*n+j) : *(C+(i+1)*n+j);
l = (j==0) ? *(C+i*n+j) : *(C+i*n+j-1); dx (i,j)
r = (j==n-1) ? *(C+i*n+j) : *(C+i*n+j+1);
*(dC+i*n+j) = (1/(dx*dx))*(u+d+l+r-4*c);
Boundary condition
}
}
29
Implementation: Time Integration
Citn, j+1 = Citn, j + dt * D * FDitn, j

while (t<=T)
{
FD(C, dC);
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
*(C+i*n+j) = *(C+i*n+j) + dt*(*(dC+i*n+j));
t=t+dt;
}

30
SPMD Parallel Algorithm (1)

• SPMD: Single Program Multiple Data

CPU0

Domain
CPU1 Decomposition

CPU2

31
SPMD Parallel Algorithm SPMD (2)

• B1: Input data

• Usually, initial input data at CPU 0 (Root)
• B2: Domain decomposition
• B3: Distribute Input data from Root to all other CPUs
• B4: Computation (Each CPU calculate on its subdomain)
• B5: Gather Output from all other CPUs to Root

B3 and B5: Communication (Input và Output)

32
SPMD Parallel Algorithm SPMD (3)

• B1: Input data

• Depending on requirement of each problem
• Different input results in different output

33
SPMD Parallel Algorithm SPMD (4)
• B2: Domain decomposition
• Many approaches
• Different approach has different efficiency
• Following is row-wise domain decomposition
• Given that the size of domain is: mxn
• Subdomain for each CPU: mcxn, where: mc=m/NP with
NP is the number of CPUs

CPU0

CPU1

CPU2

34
SPMD Parallel Algorithm SPMD (5)
• B3: Distribute Input data from Root to all other CPUs
MPI_Scatter (C, mc*n, MPI_FLOAT,
Cs, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Input data initialized at Root

35
SPMD Parallel Algorithm SPMD (6)
• B5: Gather Output from all other CPU to Root
MPI_Gather ( Cs, mc*n, MPI_FLOAT,
C, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Result calculated at CPUs CPU0

36
SPMD Parallel Algorithm SPMD (7)
• B4: Computation

CPU0
- B4.1: Communication
- B4.2: Calculation
CPU1

CPU2

37
SPMD Parallel Algorithm SPMD (8)

• B4.1: Communication CPU0

Csmc-
- B4.1a): Communicate 1,j

array Cu Cu
- B4.1b): Communicate
CPU1
array Cd
Cd

Cs0,j
CPU2

38
SPMD Parallel Algorithm SPMD (9)
• B4.1a): Communicate array Cu

if (rank==0){
for (j=0; j<n; j++) *(Cu+j) = *(Cs+0*n+j);
MPI_Send (Cs+(mc- )*n, n, MPI_FLOAT, rank+1, rank, …);
} else if (rank==NP-1) {
MPI_Recv (Cu, n, MPI_FLOAT, rank-1, rank-1, …);
} else {
MPI_Send (Cs+(mc-1)*n, n, MPI_FLOAT, rank+1, rank,…);
MPI_Recv(Cu, n, MPI_FLOAT, rank-1, rank-1, …);
}

39
CPU0
Csmc-1,j

CPU1

Cs0,j
CPU2

40
SPMD Parallel Algorithm SPMD (10)
• B4.1b): Communicate array Cd
if (rank==NP-1){
for (j=0; j<n; j++) *(Cd+j) = *(Cs+(mc-1)*n+j);
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
} else if (rank==0) {
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
} else {
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
}

41
SPMD Parallel Algorithm SPMD (11)
• B4.2: Calculation
void FD(float *Cs, float *Cu, float *Cd, float *dCs ,int ms) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < ms ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(Cs+i*n+j);
u = (i==0) ? *(Cu+j) : *(Cs+(i-1)*n+j);
d = (i==ms-1) ? *(Cd+j) : *(Cs+(i+1)*n+j);
l = (j==0) ? *(Cs+i*n+j) : *(Cs+i*n+j-1);
r = (j==n-1) ? *(Cs+i*n+j) : *(Cs+i*n+j+1);
*(dCs+i*n+j) = (D/(dx*dx))*(u+d+l+r-4*c);
}
}
42
Thank you
for your
attentions!

SAFe SPC Practice Exam Questions
100% (2)
SAFe SPC Practice Exam Questions
31 pages
Boilerplate v2.9.4 by DaviddTech
No ratings yet
Boilerplate v2.9.4 by DaviddTech
37 pages
CAN-QUEST Modelling Guide - Mar31 2016
No ratings yet
CAN-QUEST Modelling Guide - Mar31 2016
193 pages
NEG Micon NM 60 - 250-1000 70 M Hub
No ratings yet
NEG Micon NM 60 - 250-1000 70 M Hub
24 pages
Sols Book PDF
100% (1)
Sols Book PDF
120 pages
Vetassess - Stage 2 - Assessment Guide - FA PDF
No ratings yet
Vetassess - Stage 2 - Assessment Guide - FA PDF
14 pages
CHUONG7_FPGA
No ratings yet
CHUONG7_FPGA
74 pages
High-Efficiency and Low-Power Architectures For 2-D DCT and IDCT Based On CORDIC Rotation
No ratings yet
High-Efficiency and Low-Power Architectures For 2-D DCT and IDCT Based On CORDIC Rotation
6 pages
PINNs_CKadapa_1733587839
No ratings yet
PINNs_CKadapa_1733587839
20 pages
Computer Graphics: Line & Circle Drawing
No ratings yet
Computer Graphics: Line & Circle Drawing
31 pages
Chapter - Two
No ratings yet
Chapter - Two
62 pages
Lecture 8 - JPEG Compression (Part 3) : Klara Nahrstedt Spring 2012
No ratings yet
Lecture 8 - JPEG Compression (Part 3) : Klara Nahrstedt Spring 2012
40 pages
DIP Journal Pranav
No ratings yet
DIP Journal Pranav
27 pages
Digital Down Conversion in Software Radio Terminals
No ratings yet
Digital Down Conversion in Software Radio Terminals
5 pages
2D Discrete Cosine Transform
No ratings yet
2D Discrete Cosine Transform
4 pages
Implementation and Performance Analysis PDF
No ratings yet
Implementation and Performance Analysis PDF
5 pages
Algo Analysis
No ratings yet
Algo Analysis
33 pages
Unit 3
No ratings yet
Unit 3
14 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
24 pages
hpc_graph
No ratings yet
hpc_graph
22 pages
Sampling and Pulse Code Modulation
No ratings yet
Sampling and Pulse Code Modulation
14 pages
Unit - 1 Introduction To Data Structures, Searching and Sorting
No ratings yet
Unit - 1 Introduction To Data Structures, Searching and Sorting
221 pages
Abhi cg_exp-1.2
No ratings yet
Abhi cg_exp-1.2
5 pages
Signal Expressions: - Multiply Out: F ( (X + Y) Z) + (X Y Z) (X Z) + (Y Z) + (X Y Z
No ratings yet
Signal Expressions: - Multiply Out: F ( (X + Y) Z) + (X Y Z) (X Z) + (Y Z) + (X Y Z
26 pages
CG471 06
No ratings yet
CG471 06
19 pages
Cordic: (Coordinate Rotation Digital Computer)
No ratings yet
Cordic: (Coordinate Rotation Digital Computer)
25 pages
Cordic: (Coordinate Rotation Digital Computer)
No ratings yet
Cordic: (Coordinate Rotation Digital Computer)
25 pages
Homework #3 Solution: Department of Electrical and Computer Engineering University of Wisconsin - Madison
No ratings yet
Homework #3 Solution: Department of Electrical and Computer Engineering University of Wisconsin - Madison
7 pages
Digital 4
No ratings yet
Digital 4
38 pages
Chapter 8-b Lossy Compression Algorithms
No ratings yet
Chapter 8-b Lossy Compression Algorithms
18 pages
Chapter 03
No ratings yet
Chapter 03
68 pages
Convol 3 D 16 Bit
No ratings yet
Convol 3 D 16 Bit
14 pages
5CS4-AOA-Unit-2 @zammers
No ratings yet
5CS4-AOA-Unit-2 @zammers
41 pages
Computer Graphics: Line & Circle Drawing
No ratings yet
Computer Graphics: Line & Circle Drawing
29 pages
CommLab_Sp17_Lecture_6_v0
No ratings yet
CommLab_Sp17_Lecture_6_v0
4 pages
Computer Graphics: (CODE: ECS-504)
No ratings yet
Computer Graphics: (CODE: ECS-504)
23 pages
L40 AmortizedAnalysis II
No ratings yet
L40 AmortizedAnalysis II
17 pages
Buy ebook Graph Algorithms and Applications I 1st Edition Roberto Tamassia cheap price
No ratings yet
Buy ebook Graph Algorithms and Applications I 1st Edition Roberto Tamassia cheap price
86 pages
Unit 6 Implementation of discrete-time systems
No ratings yet
Unit 6 Implementation of discrete-time systems
19 pages
Ảnh màn hình 2025-04-10 lúc 10.10.40
No ratings yet
Ảnh màn hình 2025-04-10 lúc 10.10.40
63 pages
3-DesignCharacteristicsAndMetrics
No ratings yet
3-DesignCharacteristicsAndMetrics
44 pages
Graphics Display
No ratings yet
Graphics Display
59 pages
Ada Module 3 Notes
No ratings yet
Ada Module 3 Notes
40 pages
Reference 1
No ratings yet
Reference 1
10 pages
Physical Design Automation
No ratings yet
Physical Design Automation
116 pages
Lect 3 CG 2011
No ratings yet
Lect 3 CG 2011
31 pages
MCQ, Design and Analysis of Algorithm
100% (1)
MCQ, Design and Analysis of Algorithm
19 pages
DCT Presentation1
No ratings yet
DCT Presentation1
39 pages
Iare DS PPT 0
No ratings yet
Iare DS PPT 0
221 pages
Lesson 2 - Graphics Primitive Output
No ratings yet
Lesson 2 - Graphics Primitive Output
40 pages
An MDCT Hardware Accelerator For MP3 Audio
No ratings yet
An MDCT Hardware Accelerator For MP3 Audio
5 pages
Guckert Audio Compression SVD MDCT MP3 PDF
No ratings yet
Guckert Audio Compression SVD MDCT MP3 PDF
13 pages
Problem Set 7
No ratings yet
Problem Set 7
1 page
HW01 Line Integral
No ratings yet
HW01 Line Integral
1 page
Paar-ECC1999
No ratings yet
Paar-ECC1999
31 pages
GATE Paper CS-2006
No ratings yet
GATE Paper CS-2006
28 pages
CUDA Tricks PDF
No ratings yet
CUDA Tricks PDF
33 pages
CGA Lab File
No ratings yet
CGA Lab File
52 pages
Exercise 1
No ratings yet
Exercise 1
4 pages
EEE 241 Chap 04
No ratings yet
EEE 241 Chap 04
74 pages
Finite Difference Solution - 1 Finite Difference Solution - 2
No ratings yet
Finite Difference Solution - 1 Finite Difference Solution - 2
5 pages
Homework1 Solutions
No ratings yet
Homework1 Solutions
5 pages
Chapter 8 - Advanced Parallel Algorithms
No ratings yet
Chapter 8 - Advanced Parallel Algorithms
56 pages
DSP PPT-1
No ratings yet
DSP PPT-1
25 pages
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Business Studies Revision Worksheet 01
No ratings yet
Business Studies Revision Worksheet 01
1 page
Dlqi
No ratings yet
Dlqi
8 pages
Case Two - GE Healthcare in India
100% (1)
Case Two - GE Healthcare in India
3 pages
Higher Education Abhishek Lohia IIFT Delhi
No ratings yet
Higher Education Abhishek Lohia IIFT Delhi
10 pages
Answer: D. All of The Above
No ratings yet
Answer: D. All of The Above
19 pages
11 Servo Controles
No ratings yet
11 Servo Controles
12 pages
Latter v. Braddell 1881 England
No ratings yet
Latter v. Braddell 1881 England
7 pages
OLD MAN EMU - KitRecommendations - 1612172335 - AU
No ratings yet
OLD MAN EMU - KitRecommendations - 1612172335 - AU
4 pages
Cashflow Analysis
100% (1)
Cashflow Analysis
60 pages
Business Plan: The Establishment of A Rabbit Farm and The Marketing of Innovative Rabbit Meat Products & Sausages
No ratings yet
Business Plan: The Establishment of A Rabbit Farm and The Marketing of Innovative Rabbit Meat Products & Sausages
8 pages
Ranbaxy Sankyo
No ratings yet
Ranbaxy Sankyo
12 pages
Dairy Sector in India: An: Sanat Singaran Suchismita Sinha
No ratings yet
Dairy Sector in India: An: Sanat Singaran Suchismita Sinha
18 pages
Module 5
No ratings yet
Module 5
1 page
Kathleen Dani Johnson Resume Cheer
No ratings yet
Kathleen Dani Johnson Resume Cheer
2 pages
Transportify
No ratings yet
Transportify
5 pages
07 Engineers in Marketing & Service Activities
No ratings yet
07 Engineers in Marketing & Service Activities
7 pages
Socket Programming Notes
No ratings yet
Socket Programming Notes
39 pages
Riya Datta report (1)-1
No ratings yet
Riya Datta report (1)-1
53 pages
Mainline Data Sheets Cement Mortar Lining CML
No ratings yet
Mainline Data Sheets Cement Mortar Lining CML
1 page
Roberts 2022 Citizen Journalism
No ratings yet
Roberts 2022 Citizen Journalism
6 pages
Current Status and The Role of Biofilm Biofertilizer For Improving The Soil Health and Agronomic Efficiency of Maize On Marginal Soil
No ratings yet
Current Status and The Role of Biofilm Biofertilizer For Improving The Soil Health and Agronomic Efficiency of Maize On Marginal Soil
10 pages
Lecture #2 Diode Applications & Special Diodes: ECE-291 Electronic Engineering
No ratings yet
Lecture #2 Diode Applications & Special Diodes: ECE-291 Electronic Engineering
23 pages
Market Segmentation and Product Positioning: Mcgraw-Hill/Irwin
No ratings yet
Market Segmentation and Product Positioning: Mcgraw-Hill/Irwin
26 pages
Year 7 Energy Resources and Electrical Circuits Mark Scheme
No ratings yet
Year 7 Energy Resources and Electrical Circuits Mark Scheme
6 pages
Home Trade
No ratings yet
Home Trade
61 pages

Chapter 9 - Parallel Computation Problems

Uploaded by

Chapter 9 - Parallel Computation Problems

Uploaded by

Parallel Computation

How do we perform the operation?

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

How do we perform the operation?

Which one is better?

Processors are arranged in a logical

Memory efficient variant of the simple

With the following loop:

During each step, processors operate on

Can we use more than O(n2) processors?

Assign n/p rows of A to

(one element per processor)

Is the block mapping a good idea?

Each processor gets a 2D

PARALLEL SOLUTION TO PDEs

• SPMD: Single Program Multiple Data

• B1: Input data

B3 and B5: Communication (Input và Output)

• B1: Input data

Input data initialized at Root

Result calculated at CPUs CPU0

• B4.1: Communication CPU0

You might also like