0% found this document useful (0 votes)

9 views

08_1_MPI_Comm_Data_Distributions

Uploaded by

mazharmohyuddin1

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

08_1_MPI_Comm_Data_Distributions

Uploaded by

mazharmohyuddin1

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

MPI: Collective Communications & Data

Distributions

Dr. Mian M. Hamayun

[email protected]
http://seecs.nust.edu.pk/faculty/mianhamayun.html
Some material re-used from Mohamed Zahran (NYU)
Today’s Lecture
 Dealing with I/O
 The Trapezoidal Rule in MPI
 Reductions in MPI
 Collective vs. Point-to-Point
Communication
 Data Distributions

Copyright © 2010, Elsevier Inc. All rights Reserved 1

SPMD
 Single-Program Multiple-Data
 We compile one program.
 Process 0 does something different.
 Receives messages and prints them while the
other processes do the work.

 The if-else construct makes our program

SPMD.

Copyright © 2010, Elsevier Inc. All rights Reserved 2

Dealing with I/O

In all MPI implementations, all

processes in MPI_COMM_WORLD
has access to stdout and sterr.

BUT .. In most of them there is no

scheduling of access to output
devices!

Copyright © 2010, Elsevier Inc. All rights Reserved 3

Running with 6 Processes

unpredictable output!!
• Processes are competing for stdout
• Result: non-determinism!

Copyright © 2010, Elsevier Inc. All rights Reserved 4

How about Input?
 Most MPI implementations only allow
process 0 in MPI_COMM_WORLD
access to stdin.
 Process 0 must read the data and send to
the other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 5

Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved 6

TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 7

The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 8

One trapezoid

Copyright © 2010, Elsevier Inc. All rights Reserved 9

The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 10

Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved 11

Parallelizing the Trapezoidal Rule
1. Partition problem solution into tasks.
2. Identify communication channels between
tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Tasks and communications for
Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 13

Parallel pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 14

First version (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 15

First version (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 16

First version (3)

Copyright © 2010, Elsevier Inc. All rights Reserved 17

COLLECTIVE
COMMUNICATION

Copyright © 2010, Elsevier Inc. All rights Reserved 18

The Global Sum ... Again!!
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.

2. (a) Process 4 sends its newest value to process 0.

(b) Process 0 adds the received value to its newest
value.

Copyright © 2010, Elsevier Inc. All rights Reserved 19

A tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 20

An alternative tree-structured
global sum

Is this better? or the previous one?

A: Depends on the underlying system!

Copyright © 2010, Elsevier Inc. All rights Reserved 21

Reduction
Reducing a set of numbers into a smaller
set of numbers via a function
 Example: reducing the group [1, 2, 3, 4, 5]
with the sum function  15
MPI provides a handy function that handles
almost all of the common reductions that a
programmer needs to do in a parallel
application

Copyright © 2010, Elsevier Inc. All rights Reserved 22

Reduction Examples

Every process has an element

Every process has an array of elements

Copyright © 2010, Elsevier Inc. All rights Reserved 23
MPI_Reduce
has size: sizeof(datatype) * count
Only relevant in
dest_process

MPI_Reduce is called by all processes involved.

Location = rank of the process that owns it.

Copyright © 2010, Elsevier Inc. All rights Reserved 25
Collective vs. Point-to-Point Communications
 All the processes in the communicator must call
the same collective function.
 For example, a program that attempts to match a call
to MPI_Reduce on one process with a call to
MPI_Recv on another process is erroneous, and, in all
likelihood, the program will hang or crash.
 The arguments passed by each process to an
MPI collective communication must be
“compatible.”
 For example, if one process passes in 0 as the
dest_process and another passes in 1, then the
outcome of a call to MPI_Reduce is erroneous, and,
once again, the program is likely to hang or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 26

Collective vs. Point-to-Point Communications

 The output_data_p argument is only used

on dest_process.
 However, all of the processes still need to
pass in an actual argument corresponding
to output_data_p, even if it’s just NULL.
 All collective communication calls are
blocking.

Copyright © 2010, Elsevier Inc. All rights Reserved 27

Collective vs. Point-to-Point Communications

 Point-to-point communications are

matched on the basis of tags and
communicators.
 Collective communications don’t use tags.
 They’re matched solely on the basis of the
communicator and the order in which
they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved 28

Example (1)

Assume:
 all processes use the operator MPI_SUM

 and the destination is process 0

What will be the final values of b and d?

Copyright © 2010, Elsevier Inc. All rights Reserved 29

Example (2)
 At first glance, it might seem that after the two
calls to MPI_Reduce, the value of b will be 3,
and the value of d will be 6.
 However, the names of the memory locations
are irrelevant to the matching of the calls to
MPI_Reduce.
 The order of the calls will determine the
matching so the value stored in b will be 1+2+1 =
4, and the value stored in d will be 2+1+2 = 5.

Copyright © 2010, Elsevier Inc. All rights Reserved 30

Another Example
MPI_Reduce(&x, &x, 1, MPI_DOUBLE,
MPI_SUM, 0, comm);

This is illegal in MPI & the result is unpredictable!

Copyright © 2010, Elsevier Inc. All rights Reserved 31

Global Sum: Update All

A global sum followed

by distribution of the
result.

Copyright © 2010, Elsevier Inc. All rights Reserved 32

Global Sum: Exchange Partial Results

A butterfly-structured global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 33

MPI_Allreduce

 Useful in a situation in which all of the

processes need the result of a global sum
in order to complete some larger
computation.

Note: No destination argument is required!

Copyright © 2010, Elsevier Inc. All rights Reserved 34
Broadcast
 Data belonging to a single process is sent
to all of the processes in the
communicator.

All processes in the communicator must call MPI_Bcast()

Copyright © 2010, Elsevier Inc. All rights Reserved 35

A tree-structured broadcast.

Copyright © 2010, Elsevier Inc. All rights Reserved 36

A version of Get_input that uses MPI_Bcast

Copyright © 2010, Elsevier Inc. All rights Reserved 37

Collective vs. Point-to-Point – Summary

Collective
Point-to-Point

Data distributions

Compute a vector sum – Serial Version

Partitioning options
 Block partitioning
 Assign blocks of consecutive components to
each process.
 Cyclic partitioning
 Assign components in a round robin fashion.
 Block-cyclic partitioning
 Use a cyclic distribution of blocks of
components.

Different partitions of a 12-component
vector among 3 processes

Parallel implementation of
vector addition

How will you distribute parts of x[] and y[] to

processes?

Scatter
 MPI_Scatter can be used in a function that
reads in an entire vector on process 0 but
only sends the needed components to
each of the other processes.

Amount of data
going to each
process

All arguments are important for the source process (process 0 in our case)
For all other processes, only recv_buf_p, recv_count, recv_type, src_proc,
and comm are important
Copyright © 2010, Elsevier Inc. All rights Reserved 43
Reading and distributing a vector

Scatter

 send_buf_p
 is not used except by the sender.

 However, it must be defined or NULL on others to make the code

correct
 Must have at least communicator size * send_count elements

 All processes must call MPI_Scatter, not only the sender.

 send_count is the amount of data sent to each process.
 recv_buf_p must have at least send_count elements
 MPI_Scatter uses block distribution
Copyright © 2010, Elsevier Inc. All rights Reserved 45
Scatter

Gather
 Collect all of the components of the vector
onto process 0, and then process 0 can
process all of the components.

All arguments are important for the destination process.

Print a distributed vector (2)

Allgather
 Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
 As usual, recv_count is the amount of data
being received from each process.

Matrix-vector multiplication

i-th component of y
Dot product of the ith
row of A with x.

Matrix-vector multiplication

Serial pseudo-code

C style arrays

stored as

Serial matrix-vector multiplication

What if x[] is distributed among the different processes?

An MPI matrix-vector multiplication
function (2)

Concluding Remarks (1)
 Most serial programs are deterministic: if
we run the same program with the same
input we’ll get the same output.
 Parallel programs often don’t possess this
property.
 Many parallel programs use the single-
program multiple data or SPMD approach.
 A communicator is a collection of
processes that can send messages to
each other.
Copyright © 2010, Elsevier Inc. All rights Reserved 57
Concluding Remarks (2)
 Collective communications involve all the
processes in a communicator.
 When studying MPI be careful of the
caveats (i.e. usage that leads to crash,
nondeterministic behavior, ... ).
 In distributed memory systems,
communication is more expensive than
computation.

Concluding Remarks (3)
 Reducing messages is a good
performance strategy!
 Collective vs point-to-point
 Distributing a fixed amount of data among
several messages is more expensive than
sending a single big message.

Operating System MCQ (Multiple Choice Questions) - Sanfoundry
No ratings yet
Operating System MCQ (Multiple Choice Questions) - Sanfoundry
24 pages
Com 315 Python Lecture Note 2022 1
No ratings yet
Com 315 Python Lecture Note 2022 1
44 pages
Distributed Memory Programming With: Peter Pacheco
No ratings yet
Distributed Memory Programming With: Peter Pacheco
125 pages
MPI Pacheco Ch3
No ratings yet
MPI Pacheco Ch3
124 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
HPC-Lec 15
No ratings yet
HPC-Lec 15
26 pages
Lecture05 MPI
No ratings yet
Lecture05 MPI
26 pages
Chapter1 (1)
No ratings yet
Chapter1 (1)
39 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
01_Lecture Intro to HPC
No ratings yet
01_Lecture Intro to HPC
62 pages
07_2_Introduction_MPI
No ratings yet
07_2_Introduction_MPI
27 pages
Ch3_L1_PDC_CS4172_Fall_2024
No ratings yet
Ch3_L1_PDC_CS4172_Fall_2024
128 pages
1 MPI Communications: CS424. Parallel Computing Lab#4
No ratings yet
1 MPI Communications: CS424. Parallel Computing Lab#4
30 pages
Lab Manual 09 - P&DC
No ratings yet
Lab Manual 09 - P&DC
3 pages
facial audio stress report
No ratings yet
facial audio stress report
40 pages
Cocomo Model
No ratings yet
Cocomo Model
51 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
C Interview Questions
No ratings yet
C Interview Questions
8 pages
3.5 Lecture Summary - Coursera
No ratings yet
3.5 Lecture Summary - Coursera
1 page
Cocomo Model
No ratings yet
Cocomo Model
51 pages
Unit IV
No ratings yet
Unit IV
12 pages
MPI2
No ratings yet
MPI2
3 pages
Practice Problems Coa10e
No ratings yet
Practice Problems Coa10e
44 pages
Decimal To Binary, Octal and Hexadecimal Convertor Using Stack
No ratings yet
Decimal To Binary, Octal and Hexadecimal Convertor Using Stack
26 pages
Madhvi
No ratings yet
Madhvi
73 pages
A67_OS_EXP6
No ratings yet
A67_OS_EXP6
9 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
C Interview Questions
No ratings yet
C Interview Questions
35 pages
DSPsoft_Assign_01
No ratings yet
DSPsoft_Assign_01
4 pages
Eoy-Ict Comprehensive Guide Grade 7
No ratings yet
Eoy-Ict Comprehensive Guide Grade 7
13 pages
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
No ratings yet
Autocoder: Enhancing Code Large Language Model With Aiev-I: Nstruct
11 pages
IX Ch-2 Introduction To Python (Python Manual)
No ratings yet
IX Ch-2 Introduction To Python (Python Manual)
4 pages
Distributed-Memory Parallel Programming With MPI: Supervised By: Dr. Shaima Hagras
No ratings yet
Distributed-Memory Parallel Programming With MPI: Supervised By: Dr. Shaima Hagras
20 pages
3_1
No ratings yet
3_1
42 pages
NUMBER SYSTEM CONVERTER 0001
No ratings yet
NUMBER SYSTEM CONVERTER 0001
14 pages
Parallel Distributed Computing Using Python
No ratings yet
Parallel Distributed Computing Using Python
16 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
54 pages
Revised Manual For PPL - NEP
No ratings yet
Revised Manual For PPL - NEP
14 pages
CSC 202 Session 2
No ratings yet
CSC 202 Session 2
7 pages
Study Notes in C++
No ratings yet
Study Notes in C++
167 pages
OS Lab Manual 2023-24
No ratings yet
OS Lab Manual 2023-24
54 pages
C++ With Index
100% (1)
C++ With Index
49 pages
Report On Python
No ratings yet
Report On Python
44 pages
Report On Python
No ratings yet
Report On Python
57 pages
OOP and C Language
No ratings yet
OOP and C Language
7 pages
Computer Programming
No ratings yet
Computer Programming
53 pages
Build Applications On Linux Rtos: (Embbedded System)
No ratings yet
Build Applications On Linux Rtos: (Embbedded System)
5 pages
MCSE011
No ratings yet
MCSE011
18 pages
IPP Lab Manual
No ratings yet
IPP Lab Manual
20 pages
Cs2311-Object Oriented Programming
No ratings yet
Cs2311-Object Oriented Programming
22 pages
UNIT 3 OS
No ratings yet
UNIT 3 OS
42 pages
Data Structures and Object Oriented Programming in C++: Two Marks With Answer
No ratings yet
Data Structures and Object Oriented Programming in C++: Two Marks With Answer
21 pages
IoT Lab Manual Removed Removed
No ratings yet
IoT Lab Manual Removed Removed
25 pages
Python Intro
No ratings yet
Python Intro
78 pages
Cpa Lab 2 1 5 (3) - B
0% (1)
Cpa Lab 2 1 5 (3) - B
2 pages
Intro_MPI
No ratings yet
Intro_MPI
60 pages
My Traing
No ratings yet
My Traing
56 pages
Chapter 23
No ratings yet
Chapter 23
24 pages
Mpi Programming 2
No ratings yet
Mpi Programming 2
57 pages
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
JavaScript Algorithms and Data Structures_ Comprehensive Guide
No ratings yet
JavaScript Algorithms and Data Structures_ Comprehensive Guide
9 pages
Data Structures - Module 4
No ratings yet
Data Structures - Module 4
81 pages
S1 21 - Dseclzg519 L2
No ratings yet
S1 21 - Dseclzg519 L2
20 pages
Luxalgo King Indi
33% (3)
Luxalgo King Indi
7 pages
PHP Programming Notes
No ratings yet
PHP Programming Notes
471 pages
Practical Java Programs - 24
No ratings yet
Practical Java Programs - 24
2 pages
DSA ALGO Handwritten
No ratings yet
DSA ALGO Handwritten
20 pages
(PDF Download) The Big R-Book: From Data Science To Learning Machines and Big Data Philippe J. S. de Brouwer Fulll Chapter
100% (2)
(PDF Download) The Big R-Book: From Data Science To Learning Machines and Big Data Philippe J. S. de Brouwer Fulll Chapter
64 pages
DS Notes Removed
No ratings yet
DS Notes Removed
14 pages
Linked list
No ratings yet
Linked list
13 pages
EDA_UNIT_1
No ratings yet
EDA_UNIT_1
7 pages
Scheme and Syllabus- I year BE 2024-25.pdf
No ratings yet
Scheme and Syllabus- I year BE 2024-25.pdf
4 pages
Aryan DSA-169 Series
No ratings yet
Aryan DSA-169 Series
27 pages
Download ebooks file Introduction to Computing Systems: From Bits & Gates to C & Beyond 3rd Edition Yale Patt all chapters
100% (2)
Download ebooks file Introduction to Computing Systems: From Bits & Gates to C & Beyond 3rd Edition Yale Patt all chapters
55 pages
B.Tech (R23) - Computer Science Engineering - Course Structure & Syllabus
No ratings yet
B.Tech (R23) - Computer Science Engineering - Course Structure & Syllabus
55 pages
Python Short Question Answers
No ratings yet
Python Short Question Answers
133 pages
java_array_exercises_with_solutions
No ratings yet
java_array_exercises_with_solutions
2 pages
Module 3 Complete Notes
No ratings yet
Module 3 Complete Notes
123 pages
Data Structures Algorithms in Java
No ratings yet
Data Structures Algorithms in Java
3 pages
Lab 1 Introduction To MATLAB
No ratings yet
Lab 1 Introduction To MATLAB
25 pages
Error List
No ratings yet
Error List
350 pages
25 Most Frequently Asked DSA Questions in MAANG
No ratings yet
25 Most Frequently Asked DSA Questions in MAANG
17 pages
PPS Lab Record
No ratings yet
PPS Lab Record
71 pages
Lab 12 Int Array Searching and Sorting
No ratings yet
Lab 12 Int Array Searching and Sorting
8 pages
Unit-Iv Bdaur-Bcom
No ratings yet
Unit-Iv Bdaur-Bcom
9 pages
Answers To Exercises in C Book
No ratings yet
Answers To Exercises in C Book
16 pages
Introduction To Arrays - Final
No ratings yet
Introduction To Arrays - Final
17 pages
Unit Test 1 Need of DS: Data Structure
No ratings yet
Unit Test 1 Need of DS: Data Structure
23 pages
DSA Lab Manual Front Page, Index and Certificate copy
No ratings yet
DSA Lab Manual Front Page, Index and Certificate copy
3 pages
Python Libraries
No ratings yet
Python Libraries
17 pages

08_1_MPI_Comm_Data_Distributions

Uploaded by

08_1_MPI_Comm_Data_Distributions

Uploaded by

MPI: Collective Communications & Data

Dr. Mian M. Hamayun

Copyright © 2010, Elsevier Inc. All rights Reserved 1

 The if-else construct makes our program

Copyright © 2010, Elsevier Inc. All rights Reserved 2

In all MPI implementations, all

BUT .. In most of them there is no

Copyright © 2010, Elsevier Inc. All rights Reserved 3

Copyright © 2010, Elsevier Inc. All rights Reserved 4

Copyright © 2010, Elsevier Inc. All rights Reserved 5

Copyright © 2010, Elsevier Inc. All rights Reserved 6

Copyright © 2010, Elsevier Inc. All rights Reserved 7

Copyright © 2010, Elsevier Inc. All rights Reserved 8

Copyright © 2010, Elsevier Inc. All rights Reserved 9

Copyright © 2010, Elsevier Inc. All rights Reserved 10

Copyright © 2010, Elsevier Inc. All rights Reserved 11

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Copyright © 2010, Elsevier Inc. All rights Reserved 13

Copyright © 2010, Elsevier Inc. All rights Reserved 14

Copyright © 2010, Elsevier Inc. All rights Reserved 15

Copyright © 2010, Elsevier Inc. All rights Reserved 16

Copyright © 2010, Elsevier Inc. All rights Reserved 17

Copyright © 2010, Elsevier Inc. All rights Reserved 18

2. (a) Process 4 sends its newest value to process 0.

Copyright © 2010, Elsevier Inc. All rights Reserved 19

Copyright © 2010, Elsevier Inc. All rights Reserved 20

Is this better? or the previous one?

Copyright © 2010, Elsevier Inc. All rights Reserved 21

Copyright © 2010, Elsevier Inc. All rights Reserved 22

Every process has an element

Every process has an array of elements

MPI_Reduce is called by all processes involved.

Location = rank of the process that owns it.

Copyright © 2010, Elsevier Inc. All rights Reserved 26

 The output_data_p argument is only used

Copyright © 2010, Elsevier Inc. All rights Reserved 27

 Point-to-point communications are

Copyright © 2010, Elsevier Inc. All rights Reserved 28

 and the destination is process 0

What will be the final values of b and d?

Copyright © 2010, Elsevier Inc. All rights Reserved 29

Copyright © 2010, Elsevier Inc. All rights Reserved 30

This is illegal in MPI & the result is unpredictable!

Copyright © 2010, Elsevier Inc. All rights Reserved 31

A global sum followed

Copyright © 2010, Elsevier Inc. All rights Reserved 32

A butterfly-structured global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 33

 Useful in a situation in which all of the

Note: No destination argument is required!

All processes in the communicator must call MPI_Bcast()

Copyright © 2010, Elsevier Inc. All rights Reserved 35

Copyright © 2010, Elsevier Inc. All rights Reserved 36

Copyright © 2010, Elsevier Inc. All rights Reserved 37

Copyright © 2010, Elsevier Inc. All rights Reserved 38

Compute a vector sum – Serial Version

Copyright © 2010, Elsevier Inc. All rights Reserved 39

Copyright © 2010, Elsevier Inc. All rights Reserved 40

Copyright © 2010, Elsevier Inc. All rights Reserved 41

How will you distribute parts of x[] and y[] to

Copyright © 2010, Elsevier Inc. All rights Reserved 42

Copyright © 2010, Elsevier Inc. All rights Reserved 44

 However, it must be defined or NULL on others to make the code

 All processes must call MPI_Scatter, not only the sender.

Copyright © 2010, Elsevier Inc. All rights Reserved 46

All arguments are important for the destination process.

Copyright © 2010, Elsevier Inc. All rights Reserved 48

Copyright © 2010, Elsevier Inc. All rights Reserved 49

Copyright © 2010, Elsevier Inc. All rights Reserved 50

Copyright © 2010, Elsevier Inc. All rights Reserved 51

Copyright © 2010, Elsevier Inc. All rights Reserved 52

Copyright © 2010, Elsevier Inc. All rights Reserved 53

What if x[] is distributed among the different processes?

Copyright © 2010, Elsevier Inc. All rights Reserved 55

Copyright © 2010, Elsevier Inc. All rights Reserved 56

Copyright © 2010, Elsevier Inc. All rights Reserved 58