Slide CPD
Slide CPD
Computing
MODAI 1 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 2 / 70
Serial vs. Parallel computing
ref: https://hpc.llnl.gov/documentation/tutorials/
introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/documentation/tutorials/
introduction-parallel-computing-tutorial
MODAI 3 / 70
Single-core vs. Multi-cores processor
- Single-core: run on only one processor - Multi-cores: run on more than one
(CPU). processor (CPUs).
ref: http:
//dx.doi.org/10.5121/ijesa.2015.5101 ref: http://dx.doi.org/10.5121/ijesa.2015.5101
MODAI 4 / 70
Concurrency vs. Parallelism
- Concurrency: executing multiple tasks - Parallelism: executing multiple tasks on
on the same core using overlapping or different core.
time slicing.
ref: https:
//www.codeproject.com/Articles/
1267757/Concurrency-vs-Parallelism
ref: https://devopedia.org/concurrency-vs-parallelism
MODAI 5 / 70
Parallel vs. Distributed system
- Parallel computing system: contains - Distributed computing system: contains
multiple processors that communicate multiple processors connected by a
with each other using a shared memory. communication network.
ref: https://www.oreilly.com/library/view/
distributed-computing-in/9781787126992/
7478b64c-8de4-4db3-b473-66e1d1fcba77.xhtml
ref: https://www.oreilly.com/library/view/
distributed-computing-in/9781787126992/
MODAI 6 / 70
Node vs. Cluster
ref: https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial
MODAI 7 / 70
Memory Architectures
- Shared memory: multiple - Distributed memory: each - Hybrid memory: both
processors can operate processor has its own local shared and distributed
independently but share the memory, it operates memory.
same memory resources. independently; when a
processor needs access to
data in another processor,
it is usually the task of the
programmer to explicitly
define how and when data
is communicated on a ref: https://hpc.llnl.gov/
documentation/tutorials/
communication network. introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/
documentation/tutorials/
introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/
MODAI documentation/tutorials/ 8 / 70
Communication Models
- Shared memory model: processes/tasks - Distributed memory model: tasks
share a common address space (memory), exchange data through communications
which they read and write to memory by sending and receiving messages.
asynchronously.
ref: https://hpc.llnl.gov/documentation/
tutorials/introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/
documentation/tutorials/
introduction-parallel-computing-tutorial
MODAI 9 / 70
Processes vs. Threads
MODAI 10 / 70
Processes vs. Threads
- Lecture Summary (cont.):
• In summary:
• Processes are the basic units of distribution in a cluster of nodes - we can distribute
processes across multiple nodes in a High-performance computing (HPC), and even create
multiple processes within a node.
• Threads are the basic unit of parallelism and concurrency - we can create multiple threads
in a process that can share resources like memory, and contribute to improved performance.
However, it is not possible for two threads belonging to the same process to be scheduled
on different nodes.
ref: https://jedyang.com/post/
MODAI multithreading-in-python-pytorch-using-c++-extension 11 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 12 / 70
Threads
MODAI 13 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 14 / 70
Structured Locks
- Video: c2w1/1.2 Structured Locks
- Lecture Summary:
• In this lecture, we learned about structured locks, and how they can be implemented
using synchronized statements and methods in Java.
• Structured locks can be used to enforce mutual exclusion and avoid data races, as
illustrated by the incr() method in the A.count example, and the insert() and remove()
methods in the the Buffer example. A major benefit of structured locks is that their
acquire and release operations are implicit, since these operations are automatically
performed by the Java runtime environment when entering and exiting the scope of a
synchronized statement or method, even if an exception is thrown in the middle.
• We also learned about wait() and notify() operations that can be used to block and
resume threads that need to wait for specific conditions. For example, a producer thread
performing an insert() operation on a bounded buffer can call wait() when the buffer is
full, so that it is only unblocked when a consumer thread performing a remove()
operation calls notify(). Likewise, a consumer thread performing a remove() operation
on a bounded buffer can call wait() when the buffer is empty, so that it is only unblocked
when a producer thread performing an insert() operation calls notify().
• Structured locks are also referred to as intrinsic locks or monitors.
MODAI 15 / 70
Unstructured Locks
- Video: c2w1/1.3 Unstructured Locks
- Lecture Summary:
• In this lecture, we introduced unstructured locks (which can be obtained in Java by
creating instances of ReentrantLock()), and used three examples to demonstrate their
generality relative to structured locks:
• The first example showed how explicit lock() and unlock() operations on unstructured
locks can be used to support a hand-over-hand locking pattern that implements a
non-nested pairing of lock/unlock operations which cannot be achieved with synchronized
statements/methods.
• The second example showed how the tryLock() operations in unstructured locks can enable
a thread to check the availability of a lock, and thereby acquire it if it is available or do
something else if it is not.
• The third example illustrated the value of read-write locks (which can be obtained in Java
by creating instances of ReentrantReadWriteLock()), whereby multiple threads are
permitted to acquire a lock L in "read mode", L.readLock().lock(), but only one thread is
permitted to acquire the lock in "write mode", L.writeLock().lock().
• However, it is also important to remember that the generality and power of unstructured
locks is accompanied by an extra responsibility on the part of the programmer, e.g.,
ensuring that calls to unlock() are not forgotten, even in the presence of exceptions.
MODAI 16 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 17 / 70
Liveness
MODAI 18 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 19 / 70
Isolation
MODAI 20 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 21 / 70
Dining Philosophers
MODAI 22 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 23 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 24 / 70
Async/Finish
- Video: c1w1/1.1 Task Creation and Termination (Async, Finish)
- Lecture Summary:
• In this lecture, we learned the concepts of task creation and task termination in parallel
programs, using array-sum as an illustrative example:
• We learned the async notation for task creation: "async 〈stmt1〉", causes the parent task
(i.e., the task executing the async statement) to create a new child task to execute the
body of the async, 〈stmt1〉, asynchronously (i.e., before, after, or in parallel) with the
remainder of the parent task.
• We also learned the finish notation for task termination: "finish 〈stmt2〉" causes the parent
task to execute 〈stmt2〉, and then wait until 〈stmt2〉 and all async tasks created within
〈stmt2〉 have completed.
• Async and finish constructs may be arbitrarily nested. The example studied in the lecture
can be abstracted by the following pseudocode:
1 finish {
2 async S1 ; // asy nchron ously compute sum of the lower half of the array
3 S2 ; // compute sum of the upper half of the array in parallel with S1
4 }
5 S3 ; // combine the two partial sums after both S1 and S2 have finished
MODAI 25 / 70
Fork/Join
MODAI 26 / 70
Fork/Join
- Lecture Summary (cont.):
• A sketch of the Java code for the ASum class is as follows:
1 private static class ASum extends Re cu r si ve A ct io n {
2 int [] A ; // input array
3 int LO , HI ; // subrange
4 int SUM ; // return value
5 ...
6 @Override
7 protected void compute () {
8 SUM = 0;
9 for ( int i = LO ; i <= HI ; i ++) SUM += A [ i ];
10 } // compute ()
11 }
• FJ tasks are executed in a ForkJoinPool, which is a pool of Java threads. This pool
supports the invokeAll() method that combines both the fork and join operations by
executing a set of tasks in parallel, and waiting for their completion. For example,
ForkJoinTask.invokeAll(left,right) implicitly performs fork() operations on left and
right, followed by join() operations on both objects.
MODAI 27 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 28 / 70
Computation Graphs
MODAI 29 / 70
Computation Graphs
MODAI 30 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 31 / 70
Parallel Speedup
- Video: c1w1/1.4 Multiprocessor Scheduling, Parallel Speedup
- Lecture Summary:
• In this lecture, we studied the possible executions of a Computation Graph (CG) on an
idealized parallel machine with P processors. It is idealized because all processors are
assumed to be identical, and the execution time of a node is assumed to be fixed,
regardless of which processor it executes on.
• A legal schedule is one that obeys the dependence constraints in the CG, such that for
every directed edge (A, B), the schedule guarantees that step B is only scheduled after
step A completes. Unless otherwise specified, we will restrict our attention in this course
to schedules that have no unforced idleness, i.e., schedules in which a processor is not
permitted to be idle if a CG node is available to be scheduled on it. Such schedules are
also referred to as "greedy" schedules.
• We defined TP as the execution time of a CG on P processors, and observed that:
• T∞ ≤ TP ≤ T1
• T1 = WORK(G)
• TP ≥ SPAN(G) (we also saw examples for which there could be different values of TP for
different schedules of the same CG on P processors)
• We then defined the parallel speedup for a given schedule of a CG on P processors as
Speedup(P ) = T1 /TP ≤ WORK(G)/SPAN(G).
MODAI 32 / 70
Amdahl’s Law
MODAI 33 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 34 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 35 / 70
Tasks with Return Values
MODAI 36 / 70
In Java’s Fork/Join Framework
MODAI 37 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 38 / 70
Streams
MODAI 39 / 70
Streams
• This form of functional parallelism is a major convenience for the programmer, since they
do not need to worry about explicitly allocating intermediate collections (e.g., a collection
of all active students), or about ensuring that parallel accesses to data collections are
properly synchronized.
MODAI 40 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 41 / 70
Parallel Loops
- Video: c1w3/3.1 Parallel Loops
- Lecture Summary:
• In this lecture, we learned different ways of expressing parallel loops.
• The most general way is to think of each iteration of a parallel loop as an async task,
with a finish construct encompassing all iterations. This approach can support general
cases such as parallelization of the following pointer-chasing while loop (in pseudocode):
1 finish {
2 for ( p = head ; p != null ; p = p . next ) {
3 async compute ( p ) ;
4 }
5 }
• However, further efficiencies can be gained by paying attention to counted-for loops for
which the number of iterations is known on entry to the loop (before the loop executes its
first iteration). We then learned the forall notation for expressing parallel counted-for
loops, such as in the following vector addition statement (in pseudocode):
1 forall ( i : [0: n -1]) {
2 a [ i ] = b [ i ] + c [ i ];
3 }
MODAI 42 / 70
Parallel Loops
• In summary, streams are a convenient notation for parallel loops with at most one output
array, but the forall notation is more convenient for loops that create/update multiple
output arrays, as is the case in many scientific computations. For generality, we will use
the forall notation for parallel loops in the remainder of this module.
MODAI 43 / 70
Parallel Matrix Multiplication
MODAI 44 / 70
Parallel Matrix Multiplication
MODAI 45 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 46 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 47 / 70
Unicast Communication
MODAI 48 / 70
Unicast Communication
MODAI 49 / 70
Multithreaded Servers
MODAI 50 / 70
Multithreaded Servers
MODAI 51 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 52 / 70
Broadcast/Multicast Communication
MODAI 53 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 54 / 70
Remote Method Invocation
- Video: c3w2/2.3 Remote Method Invocation
- Lecture Summary:
• This lecture reviewed the concept of Remote Method Invocation (RMI), which extends the
notion of method invocation in a sequential program to a distributed programming setting.
• As an example, let us consider a scenario in which a thread running on JVM A wants to
invoke a method, foo(), on object x located on JVM B. This can be accomplished using
sockets and messages, but that approach would entail writing a lot of extra code for
encoding and decoding the method call, its arguments, and its return value. In contrast,
Java RMI provides a very convenient way to directly address this use case.
• To enable RMI:
• We run an RMI client on JVM A and an RMI server on JVM B.
• Further, JVM A is set up to contain a stub object or proxy object for remote object x
located on JVM B.
• When a stub method is invoked, it transparently initiates a connection with the remote
JVM containing the remote object, x, serializes and communicates the method parameters
to the remote JVM, receives the result of the method invocation, and deserializes the result
into object y (say) which is then passed on to the caller of method x.foo() as the result of
the RMI call.
MODAI 55 / 70
Remote Method Invocation
MODAI 56 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 57 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 58 / 70
SPMD model
- Video: c3w3/3.1 Single Program Multiple Data (SPMD) model
- Lecture Summary:
• In this lecture, we studied the Single Program Multiple Data (SPMD) model, which can
enable the use of a cluster of distributed nodes as a single parallel computer. Each node
in such a cluster typically consist of a multicore processor, a local memory, and a network
interface card (NIC) that enables it to communicate with other nodes in the cluster.
• One of the biggest challenges that arises when trying to use the distributed nodes as a
single parallel computer is that of data distribution:
• In general, we would want to allocate large data structures that span multiple nodes in the
cluster; this logical view of data structures is often referred to as a global view.
• However, a typical physical implementation of this global view on a cluster is obtained by
distributing pieces of the global data structure across different nodes, so that each node has
a local view of the piece of the data structure allocated in its local memory.
• In many cases in practice, the programmer has to undertake the conceptual burden of
mapping back and forth between the logical global view and the physical local views.
• In this module, we will focus on a commonly used implementation of the SPMD model,
that is referred to as the Message Passing Interface (MPI).
MODAI 59 / 70
SPMD model
- Lecture Summary (cont.):
• When using MPI, you designate a fixed set of processes that will participate for the entire
lifetime of the global application:
• It is common for each node to execute one MPI process, but it is also possible to execute
more than one MPI process per multicore node so as to improve the utilization of processor
cores within the node.
• Each process starts executing its own copy of the MPI program, and starts by calling the
mpi.MPI_Init() method, where mpi is the instance of the MPI class used by the process.
• After that, each process can call the
mpi.MPI_Comm_size(mpi.MPI_COMM_WORLD) method to determine the total
number of processes participating in the MPI application, and the
MPI_Comm_rank(mpi.MPI_COMM_WORLD) method to determine the process’ own
rank within the range, [0:(S-1)], where S=MPI_Comm_size().
• In this lecture, we studied how:
• A global view XG, of array X can be implemented by S local arrays (one per process) of
size, XL.length=XG.length/S. For simplicity, assume that XG.length is a multiple of S.
• Then, if we logically want to set XG[i]:=i for all logical elements of XG, we can instead set
XL[i]:=L*R+i in each local array, where L=XL.length, and R=MPI_Comm_rank().
• Thus process 0’s copy of XL will contain logical elements XG[0...L-1], process 1’s copy of
XL will contain logical elements XG[L...2*L-1], and so on.
• Thus, we see that the SPMD approach is very different from client-server programming,
where each process can be executing a different program.
MODAI 60 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 61 / 70
Point-to-Point Communication
- Video: c3w3/3.2 Point-to-Point Communication
- Lecture Summary:
• In this lecture, we studied how to perform point-to-point communication in MPI by
sending and receiving messages.
• In particular, we worked out the details for a simple scenario in which process 0 sends a
string, "ABCD", to process 1:
• Since MPI programs follow the SPMD model, we have to ensure that the same program
behaves differently on processes 0 and 1. This was achieved by using an if-then-else
statement that checks the value of the rank of the process that it is executing on. If the
rank is zero, we include the necessary code for calling MPI_Send(); otherwise, we include
the necessary code for calling MPI_Recv() (assuming that this simple program is only
executed with two processes).
• Both calls include a number of parameters. The MPI_Send() call specifies the substring to
be sent as a subarray by providing the string, offset, and data type, as well as the rank of
the receiver, and a tag to assist with matching send and receive calls (we used a tag value
of 99 in the lecture). The MPI_Recv() call (in the else part of the if-then-else statement)
includes a buffer in which to receive the message, along with the offset and data type, as
well as the rank of the sender and the tag.
MODAI 62 / 70
Point-to-Point Communication
MODAI 63 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 64 / 70
Collective Communication
MODAI 65 / 70
Collective Communication
- Lecture Summary (cont.):
• Collective communications help exploit this efficiency by leveraging the fact that all MPI
processes execute the same program in an SPMD model. For a broadcast operation:
• All MPI processes execute an MPI_Bcast() API call with a specified root process that is
the source of the data to be broadcasted.
• A key property of collective operations is that each process must wait until all processes
reach the same collective operation, before the operation can be performed. This form of
waiting is referred to as a barrier.
• After the operation is completed, all processes can move past the implicit barrier in the
collective call. In the case of MPI_Bcast(), each process will have obtained a copy of the
value broadcasted by the root process.
ref: https://nyu-cds.github.io/python-mpi/05-collectives
MODAI 67 / 70
Outline
1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading
MODAI 68 / 70
MPI and Threading
MODAI 69 / 70
MPI and Threading
- Lecture Summary (cont.):
• One approach to enable multithreading in MPI applications is to create one MPI process
(rank) per node, which starts execution in a single thread that is referred to as a master
thread. This thread calls MPI_Init() and MPI_Finalize() for its rank, and creates a
number of worker threads to assist in the computation to be performed within its MPI
process. Further, all MPI calls are performed only by the master thread. This approach is
referred to as the MPI_THREAD_FUNNELED mode, since, even though there are
multiple threads, all MPI calls are "funneled" through the master thread.
• A second more general mode for MPI and multithreading is referred to as
MPI_THREAD_SERIALIZED; in this mode, multiple threads may make MPI calls but
must do so one at a time using appropriate concurrency constructs so that the calls are
"serialized".
• The most general mode is called MPI_THREAD_MULTIPLE because it allows
multiple threads to make MPI calls in parallel; though this mode offers more flexibility
than the other modes, it puts an additional burden on the MPI implementation which
usually gets reflected in larger overheads for MPI calls relative to the more restrictive
modes. Further, even the MPI_THREAD_MULTIPLE mode has some notable
restrictions, e.g., it is not permitted in this mode for two threads in the same process to
wait on the same MPI request related to a nonblocking communication.
MODAI 70 / 70
Thanks for your attention!
MODAI 71 / 70
References
MODAI 72 / 70