0% found this document useful (0 votes)
26 views

Slide CPD

The document discusses concurrent, parallel, and distributed computing. It introduces key concepts like threads, locks, liveness, isolation, task parallelism, functional parallelism, client-server communication, broadcast/multicast communication, and message passing interface. The outline covers topics in concurrent computing, parallel computing, and distributed computing.

Uploaded by

Phu nguyen doan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Slide CPD

The document discusses concurrent, parallel, and distributed computing. It introduces key concepts like threads, locks, liveness, isolation, task parallelism, functional parallelism, client-server communication, broadcast/multicast communication, and message passing interface. The outline covers topics in concurrent computing, parallel computing, and distributed computing.

Uploaded by

Phu nguyen doan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Concurrent, Parallel, and Distributed

Computing

Trang Hồng Sơn

[email protected]

MODAI 1 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 2 / 70
Serial vs. Parallel computing

- Serial computing: instructions are - Parallel computing: instructions from


executed sequentially one after another each part execute simultaneously on
on a single processor. different processors.

ref: https://hpc.llnl.gov/documentation/tutorials/
introduction-parallel-computing-tutorial

ref: https://hpc.llnl.gov/documentation/tutorials/
introduction-parallel-computing-tutorial

MODAI 3 / 70
Single-core vs. Multi-cores processor
- Single-core: run on only one processor - Multi-cores: run on more than one
(CPU). processor (CPUs).

ref: http:
//dx.doi.org/10.5121/ijesa.2015.5101 ref: http://dx.doi.org/10.5121/ijesa.2015.5101
MODAI 4 / 70
Concurrency vs. Parallelism
- Concurrency: executing multiple tasks - Parallelism: executing multiple tasks on
on the same core using overlapping or different core.
time slicing.

ref: https:
//www.codeproject.com/Articles/
1267757/Concurrency-vs-Parallelism

ref: https://devopedia.org/concurrency-vs-parallelism
MODAI 5 / 70
Parallel vs. Distributed system
- Parallel computing system: contains - Distributed computing system: contains
multiple processors that communicate multiple processors connected by a
with each other using a shared memory. communication network.

ref: https://www.oreilly.com/library/view/
distributed-computing-in/9781787126992/
7478b64c-8de4-4db3-b473-66e1d1fcba77.xhtml

ref: https://www.oreilly.com/library/view/
distributed-computing-in/9781787126992/
MODAI 6 / 70
Node vs. Cluster

- Networks connect multiple stand-alone computers (nodes) to make larger parallel


computer clusters.

ref: https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial

MODAI 7 / 70
Memory Architectures
- Shared memory: multiple - Distributed memory: each - Hybrid memory: both
processors can operate processor has its own local shared and distributed
independently but share the memory, it operates memory.
same memory resources. independently; when a
processor needs access to
data in another processor,
it is usually the task of the
programmer to explicitly
define how and when data
is communicated on a ref: https://hpc.llnl.gov/
documentation/tutorials/
communication network. introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/
documentation/tutorials/
introduction-parallel-computing-tutorial

ref: https://hpc.llnl.gov/
MODAI documentation/tutorials/ 8 / 70
Communication Models
- Shared memory model: processes/tasks - Distributed memory model: tasks
share a common address space (memory), exchange data through communications
which they read and write to memory by sending and receiving messages.
asynchronously.

ref: https://hpc.llnl.gov/documentation/
tutorials/introduction-parallel-computing-tutorial
ref: https://hpc.llnl.gov/
documentation/tutorials/
introduction-parallel-computing-tutorial

MODAI 9 / 70
Processes vs. Threads

- Video: c3w4/4.1 Processes and Threads


- Lecture Summary:
• We introduced processes and threads, which serve as the fundamental building blocks of
distributed computing software. In the case of Java applications, a process corresponds to
a single Java Virtual Machine (JVM) instance, and threads are created within a JVM
instance.
• The advantages of creating multiple threads in a process include increased sharing of
memory and per-process resources by threads, improved responsiveness due to
multithreading, and improved performance since threads in the same process can
communicate with each other through a shared address space.
• The advantages of creating multiple processes in a node include improved responsiveness
(also) due to multiprocessing (e.g., when a JVM is paused during garbage collection),
improved scalability (going past the scalability limitations of multithreading), and
improved resilience to JVM failures within a node. However, processes can only
communicate with each other through message-passing and other communication
patterns for distributed computing.

MODAI 10 / 70
Processes vs. Threads
- Lecture Summary (cont.):
• In summary:
• Processes are the basic units of distribution in a cluster of nodes - we can distribute
processes across multiple nodes in a High-performance computing (HPC), and even create
multiple processes within a node.
• Threads are the basic unit of parallelism and concurrency - we can create multiple threads
in a process that can share resources like memory, and contribute to improved performance.
However, it is not possible for two threads belonging to the same process to be scheduled
on different nodes.

ref: https://jedyang.com/post/
MODAI multithreading-in-python-pytorch-using-c++-extension 11 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 12 / 70
Threads

- Video: c2w1/1.1 Threads


- Lecture Summary:
• We learned the concept of threads as lower-level building blocks for concurrent programs.
A unique aspect of Java compared to prior mainstream programming languages is that
Java included the notions of threads (as instances of the java.lang.Thread class) in its
language definition right from the start.
• When an instance of Thread is created (via a new operation), it does not start executing
right away; instead, it can only start executing when its start() method is invoked. The
statement or computation to be executed by the thread is specified as a parameter to the
constructor.
• The Thread class also includes a wait operation in the form of a join() method. If thread
t0 performs a t1.join() call, thread t0 will be forced to wait until thread t1 completes,
after which point it can safely access any values computed by thread t1. Since there is no
restriction on which thread can perform a join on which other thread, it is possible for a
programmer to erroneously create a deadlock cycle with join operations.

MODAI 13 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 14 / 70
Structured Locks
- Video: c2w1/1.2 Structured Locks
- Lecture Summary:
• In this lecture, we learned about structured locks, and how they can be implemented
using synchronized statements and methods in Java.
• Structured locks can be used to enforce mutual exclusion and avoid data races, as
illustrated by the incr() method in the A.count example, and the insert() and remove()
methods in the the Buffer example. A major benefit of structured locks is that their
acquire and release operations are implicit, since these operations are automatically
performed by the Java runtime environment when entering and exiting the scope of a
synchronized statement or method, even if an exception is thrown in the middle.
• We also learned about wait() and notify() operations that can be used to block and
resume threads that need to wait for specific conditions. For example, a producer thread
performing an insert() operation on a bounded buffer can call wait() when the buffer is
full, so that it is only unblocked when a consumer thread performing a remove()
operation calls notify(). Likewise, a consumer thread performing a remove() operation
on a bounded buffer can call wait() when the buffer is empty, so that it is only unblocked
when a producer thread performing an insert() operation calls notify().
• Structured locks are also referred to as intrinsic locks or monitors.

MODAI 15 / 70
Unstructured Locks
- Video: c2w1/1.3 Unstructured Locks
- Lecture Summary:
• In this lecture, we introduced unstructured locks (which can be obtained in Java by
creating instances of ReentrantLock()), and used three examples to demonstrate their
generality relative to structured locks:
• The first example showed how explicit lock() and unlock() operations on unstructured
locks can be used to support a hand-over-hand locking pattern that implements a
non-nested pairing of lock/unlock operations which cannot be achieved with synchronized
statements/methods.
• The second example showed how the tryLock() operations in unstructured locks can enable
a thread to check the availability of a lock, and thereby acquire it if it is available or do
something else if it is not.
• The third example illustrated the value of read-write locks (which can be obtained in Java
by creating instances of ReentrantReadWriteLock()), whereby multiple threads are
permitted to acquire a lock L in "read mode", L.readLock().lock(), but only one thread is
permitted to acquire the lock in "write mode", L.writeLock().lock().
• However, it is also important to remember that the generality and power of unstructured
locks is accompanied by an extra responsibility on the part of the programmer, e.g.,
ensuring that calls to unlock() are not forgotten, even in the presence of exceptions.

MODAI 16 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 17 / 70
Liveness

- Video: c2w1/1.4 Liveness


- Lecture Summary:
• In this lecture, we studied three ways in which a parallel program may enter a state in
which it stops making forward progress. For sequential programs, an "infinite loop" is a
common way for a program to stop making forward progress, but there are other ways to
obtain an absence of progress in a parallel program:
• The first is deadlock, in which all threads are blocked indefinitely, thereby preventing any
forward progress.
• The second is livelock, in which all threads repeatedly perform an interaction that prevents
forward progress, e.g., an infinite "loop" of repeating lock acquire/release patterns.
• The third is starvation, in which at least one thread is prevented from making any forward
progress.
• The term "liveness" refers to a progress guarantee. The three progress guarantees that
correspond to the absence of the conditions listed above are deadlock freedom, livelock
freedom, and starvation freedom.

MODAI 18 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 19 / 70
Isolation

- Video: c2w2/2.1 Critical Sections


- Lecture Summary:
• In this lecture, we learned how critical sections and the isolated construct can help
concurrent threads manage their accesses to shared resources, at a higher level than just
using locks.
• When programming with threads, it is well known that the following situation is defined
to be a data race error - when two accesses on the same shared location can potentially
execute in parallel, with least one access being a write. However, there are many cases in
practice when two tasks may legitimately need to perform concurrent accesses to shared
locations, as in the bank transfer example.
• With critical sections, two blocks of code that are marked as isolated, say A and B, are
guaranteed to be executed in mutual exclusion with A executing before B or vice versa.
With the use of isolated constructs, it is impossible for the bank transfer example to end
up in an inconsistent state because all the reads and writes for one isolated section must
complete before the start of another isolated construct.

MODAI 20 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 21 / 70
Dining Philosophers

- Video: c2w1/1.5 Dining Philosophers


- Lecture Summary:
• In the Dining Philosophers problem, there are five
threads, each of which models a "philosopher" that
repeatedly performs a sequence of actions which include
think, pick up chopsticks, eat, and put down chopsticks:
• First, we examined a solution to this problem using
structured locks, and demonstrated how this solution
could lead to a deadlock scenario (but not livelock).
• Second, we examined a solution using unstructured
locks with tryLock() and unlock() operations that
never block, and demonstrated how this solution could
lead to a livelock scenario (but not deadlock).
• Finally, we observed how a simple modification to the
first solution with structured locks, in which one ref: https://en.wikipedia.org/wiki/
philosopher picks up their right chopstick and their left, Dining_philosophers_problem
while the others pick up their left chopstick first and
then their right, can guarantee an absence of deadlock.

MODAI 22 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 23 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 24 / 70
Async/Finish
- Video: c1w1/1.1 Task Creation and Termination (Async, Finish)
- Lecture Summary:
• In this lecture, we learned the concepts of task creation and task termination in parallel
programs, using array-sum as an illustrative example:
• We learned the async notation for task creation: "async 〈stmt1〉", causes the parent task
(i.e., the task executing the async statement) to create a new child task to execute the
body of the async, 〈stmt1〉, asynchronously (i.e., before, after, or in parallel) with the
remainder of the parent task.
• We also learned the finish notation for task termination: "finish 〈stmt2〉" causes the parent
task to execute 〈stmt2〉, and then wait until 〈stmt2〉 and all async tasks created within
〈stmt2〉 have completed.
• Async and finish constructs may be arbitrarily nested. The example studied in the lecture
can be abstracted by the following pseudocode:
1 finish {
2 async S1 ; // asy nchron ously compute sum of the lower half of the array
3 S2 ; // compute sum of the upper half of the array in parallel with S1
4 }
5 S3 ; // combine the two partial sums after both S1 and S2 have finished

MODAI 25 / 70
Fork/Join

- Video: c1w1/1.2 Tasks in Java’s Fork-Join Framework


- Lecture Summary:
• In this lecture, we learned how to implement the async and finish functionality using
Java’s standard Fork/Join (FJ) framework. In this framework, a task can be specified in
the compute() method of a user-defined class that extends the standard
RecursiveAction class in the FJ framework.
• In our ArraySum example, we created class ASum with fields A for the input array, LO
and HI for the subrange for which the sum is to be computed, and SUM for the result for
that subrange. For an instance of this user-defined class (e.g., L in the lecture), we
learned that the method call, L.fork(), creates a new task that executes L’s compute()
method. This implements the functionality of the async construct that we learned earlier.
The call to L.join() then waits until the computation created by L.fork() has completed.
Note that join() is a lower-level primitive than finish because join() waits for a specific
task, whereas finish implicitly waits for all tasks created in its scope. To implement the
finish construct using join() operations, you have to be sure to call join() on every task
created in the finish scope.

MODAI 26 / 70
Fork/Join
- Lecture Summary (cont.):
• A sketch of the Java code for the ASum class is as follows:
1 private static class ASum extends Re cu r si ve A ct io n {
2 int [] A ; // input array
3 int LO , HI ; // subrange
4 int SUM ; // return value
5 ...
6 @Override
7 protected void compute () {
8 SUM = 0;
9 for ( int i = LO ; i <= HI ; i ++) SUM += A [ i ];
10 } // compute ()
11 }

• FJ tasks are executed in a ForkJoinPool, which is a pool of Java threads. This pool
supports the invokeAll() method that combines both the fork and join operations by
executing a set of tasks in parallel, and waiting for their completion. For example,
ForkJoinTask.invokeAll(left,right) implicitly performs fork() operations on left and
right, followed by join() operations on both objects.

MODAI 27 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 28 / 70
Computation Graphs

- Video: c1w1/1.3 Computation Graphs, Work, Span


- Lecture Summary:
• In this lecture, we learned about Computation Graphs (CGs), which model the execution
of a parallel program as a partially ordered set. Specifically, a CG consists of:
• A set of vertices or nodes, in which each node represents a step consisting of an arbitrary
sequential computation.
• A set of directed edges that represent ordering constraints among steps.
• For fork–join programs, it is useful to partition the edges into three cases:
1. Continue edges that capture sequencing of steps within a task.
2. Fork edges that connect a fork operation to the first step of child tasks.
3. Join edges that connect the last step of a task to all join operations on that task.
• CGs can be used to define data races, an important class of bugs in parallel programs. We
say that a data race occurs on location L in a computation graph, G, if there exist steps
S1 and S2 in G such that there is no path of directed edges from S1 to S2 or from S2 to
S1 in G, and both S1 and S2 read or write L (with at least one of the accesses being a
write, since two parallel reads do not pose a problem).

MODAI 29 / 70
Computation Graphs

- Lecture Summary (cont.):


• CGs can also be used to reason about the ideal parallelism of a parallel program as
follows:
• Define WORK(G) to be the sum of the execution times of all nodes in CG G,
• Define SPAN(G) to be the length of a longest path in G, when adding up the execution
times of all nodes in the path. The longest paths are known as critical paths, so SPAN also
represents the critical path length (CPL) of G.
• Given the above definitions of WORK and SPAN, we define the ideal parallelism of
Computation Graph G as the ratio, WORK(G)/SPAN(G).
• The ideal parallelism is an upper limit on the speedup factor that can be obtained from
parallel execution of nodes in computation graph G. Note that ideal parallelism is only a
function of the parallel program, and does not depend on the actual parallelism available
in a physical computer.

MODAI 30 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 31 / 70
Parallel Speedup
- Video: c1w1/1.4 Multiprocessor Scheduling, Parallel Speedup
- Lecture Summary:
• In this lecture, we studied the possible executions of a Computation Graph (CG) on an
idealized parallel machine with P processors. It is idealized because all processors are
assumed to be identical, and the execution time of a node is assumed to be fixed,
regardless of which processor it executes on.
• A legal schedule is one that obeys the dependence constraints in the CG, such that for
every directed edge (A, B), the schedule guarantees that step B is only scheduled after
step A completes. Unless otherwise specified, we will restrict our attention in this course
to schedules that have no unforced idleness, i.e., schedules in which a processor is not
permitted to be idle if a CG node is available to be scheduled on it. Such schedules are
also referred to as "greedy" schedules.
• We defined TP as the execution time of a CG on P processors, and observed that:
• T∞ ≤ TP ≤ T1
• T1 = WORK(G)
• TP ≥ SPAN(G) (we also saw examples for which there could be different values of TP for
different schedules of the same CG on P processors)
• We then defined the parallel speedup for a given schedule of a CG on P processors as
Speedup(P ) = T1 /TP ≤ WORK(G)/SPAN(G).
MODAI 32 / 70
Amdahl’s Law

- Video: c1w1/1.5 Amdahl’s Law


- Lecture Summary:
• In this lecture, we studied a simple observation made by Gene Amdahl in 1967: if q ≤ 1 is
the fraction of WORK in a parallel program that must be executed sequentially, then the
best speedup that can be obtained for that program for any number of processors, P , is
Speedup(P ) ≤ 1/q.
• This observation follows directly from a lower bound on parallel execution time that you
are familiar with, namely TP ≥ SP AN (G). If fraction q of WORK(G) is sequential, it
must be the case that SP AN (G) ≥ q × W ORK(G). Therefore, Speedup(P ) = T1 /TP
must be ≤ W ORK(G)/(q × W ORK(G)) = 1/q since T1 = W ORK(G) for greedy
schedulers.
• Amdahl’s Law reminds us to watch out for sequential bottlenecks both when designing
parallel algorithms and when implementing programs on real machines. As an example, if
q = 10%, then Amdahl’s Law reminds us that the best possible speedup must be ≤ 10
(which equals 1/q), regardless of the number of processors available.

MODAI 33 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 34 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 35 / 70
Tasks with Return Values

- Video: c1w2/2.1 Futures: Tasks with Return Values


- Lecture Summary:
• In this lecture, we learned how to extend the concept of asynchronous tasks to future
tasks and future objects (also known as promise objects). Future tasks are tasks with
return values, and a future object is a "handle" for accessing a task’s return value. There
are two key operations that can be performed on a future object, A:
1. Assignment: object A can be assigned a reference to a future object returned by a task of
the form, future 〈task-with-return-value〉 (using pseudocode notation). The content of
the future object is constrained to be single assignment (similar to a final variable in Java),
and cannot be modified after the future task has returned.
2. Blocking read : the operation, A.get(), waits until the task associated with future object A
has completed, and then propagates the task’s return value as the value returned by
A.get(). Any statement, S, executed after A.get() can be assured that the task associated
with future object A must have completed before S starts execution.
• These operations are carefully defined to avoid the possibility of a race condition on a
task’s return value, which is why futures are well suited for functional parallelism.

MODAI 36 / 70
In Java’s Fork/Join Framework

- Video: c1w2/2.2 Futures in Java’s Fork-Join Framework


- Lecture Summary:
• In this lecture, we learned how to express future tasks in Java’s Fork/Join (FJ)
framework. Some key differences between future tasks and regular tasks in the FJ
framework are as follows:
1. A future task extends the RecursiveTask class in the FJ framework, instead of
RecursiveAction as in regular tasks.
2. The compute() method of a future task must have a non-void return type, whereas it has
a void return type for regular tasks.
3. A method call like left.join() waits for the task referred to by object left in both cases, but
also provides the task’s return value in the case of future tasks.

MODAI 37 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 38 / 70
Streams

- Video: c1w2/2.4 Java Streams


- Lecture Summary:
• In this lecture we learned about Java streams, and how they provide a functional approach
to operating on collections of data.
• For example, the statement, "students.stream().forEach(s → System.out.println(s))", is a
succinct way of specifying an action to be performed on each element s in the collection,
students. An aggregate data query or data transformation can be specified by building a
stream pipeline consisting of a source (typically by invoking the .stream() method on a
data collection, a sequence of intermediate operations such as map() and filter(), and an
optional terminal operation such as forEach() or average(). As an example, the following
pipeline can be used to compute the average age of all active students using Java streams:
1 students . stream ()
2 . filter ( s -> s . getStatus () == Student . ACTIVE )
3 . mapToInt ( a -> a . getAge () )
4 . average () ;

MODAI 39 / 70
Streams

- Lecture Summary (cont.):


• From the viewpoint of this course, an important benefit of using Java streams when
possible is that the pipeline can be made to execute in parallel by designating the source
to be a parallel stream, i.e., by simply replacing students.stream() in the above code by
students.parallelStream() or Stream.of(students).parallel().
1 students . para llelSt ream ()
2 . filter ( s -> s . getStatus () == Student . ACTIVE )
3 . mapToInt ( a -> a . getAge () )
4 . average () ;

• This form of functional parallelism is a major convenience for the programmer, since they
do not need to worry about explicitly allocating intermediate collections (e.g., a collection
of all active students), or about ensuring that parallel accesses to data collections are
properly synchronized.

MODAI 40 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 41 / 70
Parallel Loops
- Video: c1w3/3.1 Parallel Loops
- Lecture Summary:
• In this lecture, we learned different ways of expressing parallel loops.
• The most general way is to think of each iteration of a parallel loop as an async task,
with a finish construct encompassing all iterations. This approach can support general
cases such as parallelization of the following pointer-chasing while loop (in pseudocode):
1 finish {
2 for ( p = head ; p != null ; p = p . next ) {
3 async compute ( p ) ;
4 }
5 }

• However, further efficiencies can be gained by paying attention to counted-for loops for
which the number of iterations is known on entry to the loop (before the loop executes its
first iteration). We then learned the forall notation for expressing parallel counted-for
loops, such as in the following vector addition statement (in pseudocode):
1 forall ( i : [0: n -1]) {
2 a [ i ] = b [ i ] + c [ i ];
3 }

MODAI 42 / 70
Parallel Loops

- Lecture Summary (cont.):


• We also discussed the fact that Java streams can be an elegant way of specifying parallel
loop computations that produce a single output array, e.g., by rewriting the vector
addition statement as follows:
1 a = IntStream . rangeClosed (0 , N -1) . parallel () . toArray ( i -> b [ i ] + c [ i ]) ;

• In summary, streams are a convenient notation for parallel loops with at most one output
array, but the forall notation is more convenient for loops that create/update multiple
output arrays, as is the case in many scientific computations. For generality, we will use
the forall notation for parallel loops in the remainder of this module.

MODAI 43 / 70
Parallel Matrix Multiplication

- Video: c1w3/3.2 Parallel Matrix Multiplication


- Lecture Summary:
• In this lecture, we reminded ourselves of the formula for multiplying two n × n matrices, a
Pn−1
and b, to obtain a product matrix, c, of size n × n: c[i][j] = k=0 (a[i][k] × b[k][j])
• This formula can be easily translated to a simple sequential algorithm for matrix
multiplication as follows (with pseudocode for counted-for loops):
1 for ( i : [0: n -1]) {
2 for ( j : [0: n -1]) {
3 c [ i ][ j ] = 0;
4 for ( k : [0: n -1]) {
5 c [ i ][ j ] = c [ i ][ j ] + ( a [ i ][ k ] * b [ k ][ j ]) ;
6 }
7 }
8 }

MODAI 44 / 70
Parallel Matrix Multiplication

- Lecture Summary (cont.):


• The interesting question now is: which of the for-i, for-j and for-k loops can be converted
to forall loops, i.e., can be executed in parallel? Upon a close inspection, we can see that
it is safe to convert for-i and for-j into forall loops, but for-k must remain a sequential
loop to avoid data races:
1 forall ( i : [0: n -1]) {
2 forall ( j : [0: n -1]) {
3 c [ i ][ j ] = 0;
4 for ( k : [0: n -1]) {
5 c [ i ][ j ] = c [ i ][ j ] + ( a [ i ][ k ] * b [ k ][ j ]) ;
6 }
7 }
8 }

MODAI 45 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 46 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 47 / 70
Unicast Communication

- Video: c3w2/2.1 Introduction to Sockets


- Lecture Summary:
• In this lecture, we learned about client-server programming, and how two distributed Java
applications can communicate with each other using sockets. Since each application in
this scenario runs on a distinct Java Virtual Machine (JVM) process, we used the terms
"application", "JVM" and "process" interchangeably in the lecture.
• For JVM A and JVM B to communicate with each other, we assumed that JVM A plays
the "client" role and JVM B the "server" role:
• To establish the connection, the main thread in JVM B first creates a ServerSocket (called
socket, say) which is initialized with a designated URL and port number.
• It then waits for client processes to connect to this socket by invoking the socket.accept()
method, which returns an object of type Socket (called s, say).
• The s.getInputStream() and s.getOutputStream() methods can be invoked on this
object to perform read and write operations via the socket, using the same APIs that you
use for file I/O via streams.

MODAI 48 / 70
Unicast Communication

- Lecture Summary (cont.):


• Once JVM B has set up a server socket, JVM A can connect to it as a client by creating
a Socket object with the appropriate parameters to identify JVM B’s server port. As in
the server case, the getInputStream() and getOutputStream() methods can be
invoked on this object to perform read and write operations.
• With this setup, JVM A and JVM B can communicate with each other by using read and
write operations, which get implemented as messages that flow across the network.
Client-server communication occurs at a lower level and scale than MapReduce, which
implicitly accomplishes communication among large numbers of processes. Hence,
client-server programming is typically used for building distributed applications with small
numbers of processes.

MODAI 49 / 70
Multithreaded Servers

- Video: c3w4/4.2 Multithreaded Servers


- Lecture Summary:
• In this lecture, we learned about multithreaded servers as an extension to the servers that
we studied in client-server programming.
• As a motivating example, we studied the timeline for a single request sent to a standard
sequential file server, which typically consists of four steps:
• A) accept the request,
• B) extract the necessary information from the request,
• C) read the file,
• D) send the file.
• In practice, step C) is usually the most time-consuming step in this sequence. However,
threads can be used to reduce this bottleneck. In particular, after step A) is performed, a
new thread can be created to process steps B), C) and D) for a given request. In this way,
it is possible to process multiple requests simultaneously because they are executing in
different threads.

MODAI 50 / 70
Multithreaded Servers

- Lecture Summary (cont.):


• One challenge of following this approach literally is that there is a significant overhead in
creating and starting a Java thread. However, since there is usually an upper bound on
the number of threads that can be efficiently utilized within a node (often limited by the
number of cores or hardware context), it is wasteful to create more threads than that
number. There are two approaches that are commonly taken to address this challenge in
Java applications:
• One is to use a thread pool, so that threads can be reused across multiple requests instead
of creating a new thread for each request.
• Another is to use lightweight tasking (e.g., as in Java’s ForkJoin framework) which execute
on a thread pool with a bounded number of threads, and offer the advantage that the
overhead of task creation is significantly smaller than that of thread creation.

MODAI 51 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 52 / 70
Broadcast/Multicast Communication

- Video: c3w2/2.4 Multicast Sockets


- Lecture Summary:
• In this lecture, we learned about multicast sockets, which are a generalization of the
standard socket interface that we studied earlier:
• Standard sockets can be viewed as unicast communications, in which a message is sent
from a source to a single destination.
• Broadcast communications represent a simple extension to unicast, in which a message can
be sent efficiently to all nodes in the same local area network as the sender.
• In contrast, multicast sockets enable a sender to efficiently send the same message to a
specified set of receivers on the Internet. This capability can be very useful for a number of
applications, which include news feeds, video conferencing, and multi-player games. One
reason why a 1:n multicast socket is more efficient than 1:1 sockets is because Internet
routers have built-in support for the multicast capability.

MODAI 53 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 54 / 70
Remote Method Invocation
- Video: c3w2/2.3 Remote Method Invocation
- Lecture Summary:
• This lecture reviewed the concept of Remote Method Invocation (RMI), which extends the
notion of method invocation in a sequential program to a distributed programming setting.
• As an example, let us consider a scenario in which a thread running on JVM A wants to
invoke a method, foo(), on object x located on JVM B. This can be accomplished using
sockets and messages, but that approach would entail writing a lot of extra code for
encoding and decoding the method call, its arguments, and its return value. In contrast,
Java RMI provides a very convenient way to directly address this use case.
• To enable RMI:
• We run an RMI client on JVM A and an RMI server on JVM B.
• Further, JVM A is set up to contain a stub object or proxy object for remote object x
located on JVM B.
• When a stub method is invoked, it transparently initiates a connection with the remote
JVM containing the remote object, x, serializes and communicates the method parameters
to the remote JVM, receives the result of the method invocation, and deserializes the result
into object y (say) which is then passed on to the caller of method x.foo() as the result of
the RMI call.

MODAI 55 / 70
Remote Method Invocation

- Lecture Summary (cont.):


• Thus, RMI takes care of a number of tedious details related to remote communication.
However, this convenience comes with a few setup requirements as well:
• First, objects x and y must be serializable, because their values need to be communicated
between JVMs A and B.
• Second, object x must be included in the RMI registry, so that it can be accessed through a
global name rather than a local object reference. The registry in turn assists in mapping
from global names to references to local stub objects.
• In summary, a key advantage of RMI is that, once this setup in place, method invocations
across distributed processes can be implemented almost as simply as standard method
invocations.

MODAI 56 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 57 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 58 / 70
SPMD model
- Video: c3w3/3.1 Single Program Multiple Data (SPMD) model
- Lecture Summary:
• In this lecture, we studied the Single Program Multiple Data (SPMD) model, which can
enable the use of a cluster of distributed nodes as a single parallel computer. Each node
in such a cluster typically consist of a multicore processor, a local memory, and a network
interface card (NIC) that enables it to communicate with other nodes in the cluster.
• One of the biggest challenges that arises when trying to use the distributed nodes as a
single parallel computer is that of data distribution:
• In general, we would want to allocate large data structures that span multiple nodes in the
cluster; this logical view of data structures is often referred to as a global view.
• However, a typical physical implementation of this global view on a cluster is obtained by
distributing pieces of the global data structure across different nodes, so that each node has
a local view of the piece of the data structure allocated in its local memory.
• In many cases in practice, the programmer has to undertake the conceptual burden of
mapping back and forth between the logical global view and the physical local views.
• In this module, we will focus on a commonly used implementation of the SPMD model,
that is referred to as the Message Passing Interface (MPI).

MODAI 59 / 70
SPMD model
- Lecture Summary (cont.):
• When using MPI, you designate a fixed set of processes that will participate for the entire
lifetime of the global application:
• It is common for each node to execute one MPI process, but it is also possible to execute
more than one MPI process per multicore node so as to improve the utilization of processor
cores within the node.
• Each process starts executing its own copy of the MPI program, and starts by calling the
mpi.MPI_Init() method, where mpi is the instance of the MPI class used by the process.
• After that, each process can call the
mpi.MPI_Comm_size(mpi.MPI_COMM_WORLD) method to determine the total
number of processes participating in the MPI application, and the
MPI_Comm_rank(mpi.MPI_COMM_WORLD) method to determine the process’ own
rank within the range, [0:(S-1)], where S=MPI_Comm_size().
• In this lecture, we studied how:
• A global view XG, of array X can be implemented by S local arrays (one per process) of
size, XL.length=XG.length/S. For simplicity, assume that XG.length is a multiple of S.
• Then, if we logically want to set XG[i]:=i for all logical elements of XG, we can instead set
XL[i]:=L*R+i in each local array, where L=XL.length, and R=MPI_Comm_rank().
• Thus process 0’s copy of XL will contain logical elements XG[0...L-1], process 1’s copy of
XL will contain logical elements XG[L...2*L-1], and so on.
• Thus, we see that the SPMD approach is very different from client-server programming,
where each process can be executing a different program.
MODAI 60 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 61 / 70
Point-to-Point Communication
- Video: c3w3/3.2 Point-to-Point Communication
- Lecture Summary:
• In this lecture, we studied how to perform point-to-point communication in MPI by
sending and receiving messages.
• In particular, we worked out the details for a simple scenario in which process 0 sends a
string, "ABCD", to process 1:
• Since MPI programs follow the SPMD model, we have to ensure that the same program
behaves differently on processes 0 and 1. This was achieved by using an if-then-else
statement that checks the value of the rank of the process that it is executing on. If the
rank is zero, we include the necessary code for calling MPI_Send(); otherwise, we include
the necessary code for calling MPI_Recv() (assuming that this simple program is only
executed with two processes).
• Both calls include a number of parameters. The MPI_Send() call specifies the substring to
be sent as a subarray by providing the string, offset, and data type, as well as the rank of
the receiver, and a tag to assist with matching send and receive calls (we used a tag value
of 99 in the lecture). The MPI_Recv() call (in the else part of the if-then-else statement)
includes a buffer in which to receive the message, along with the offset and data type, as
well as the rank of the sender and the tag.

MODAI 62 / 70
Point-to-Point Communication

- Lecture Summary (cont.):


• Each send/receive operation waits (or is blocked ) until its dual operation is performed by
the other process. Once a pair of parallel and compatible MPI_Send() and
MPI_Recv() calls is matched, the actual communication is performed by the MPI
library. This approach to matching pairs of send/receive calls in SPMD programs is
referred to as two-sided communication.
• As indicated in the lecture, the current implementation of MPI only supports
communication of (sub)arrays of primitive data types. However, since we have already
learned how to serialize and deserialize objects into/from bytes, the same approach can be
used in MPI programs by communicating arrays of bytes.

MODAI 63 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 64 / 70
Collective Communication

- Video: c3w3/3.5 Collective Communication


- Lecture Summary:
• In this lecture, we studied collective communication, which can involve multiple processes,
in a manner that is both similar to, and more powerful, than multicast and
publish-subscribe operations:
• We first considered a simple broadcast operation in which rank R0 needs to send message
X to all other MPI processes in the application.
• One way to accomplish this is for R0 to send individual (point-to-point) messages to
processes R1, R2, ... one by one, but this approach will make R0 a sequential bottleneck
when there are (say) thousands of processes in the MPI application.
• Further, the interconnection network for the compute nodes is capable of implementing
such broadcast operations more efficiently than point-to-point messages.

MODAI 65 / 70
Collective Communication
- Lecture Summary (cont.):
• Collective communications help exploit this efficiency by leveraging the fact that all MPI
processes execute the same program in an SPMD model. For a broadcast operation:
• All MPI processes execute an MPI_Bcast() API call with a specified root process that is
the source of the data to be broadcasted.
• A key property of collective operations is that each process must wait until all processes
reach the same collective operation, before the operation can be performed. This form of
waiting is referred to as a barrier.
• After the operation is completed, all processes can move past the implicit barrier in the
collective call. In the case of MPI_Bcast(), each process will have obtained a copy of the
value broadcasted by the root process.

ref: https://nyu-cds.github.io/python-mpi/05-collectives ref: https://nyu-cds.github.io/


MODAI 66 / 70
Collective Communication
- Lecture Summary (cont.):
• MPI supports a wide range of collective operations, which also includes reductions.
• The reduction example discussed in the lecture was one in which process R2 needs to
receive the sum of the values of element Y[0] in all the processes, and store it in element
Z[0] in process R2. To perform this computation:
• All processes will need to execute a Reduce() operation (with an implicit barrier ).
• The parameters to this call include the input array (Y), a zero offset for the input value,
the output array (Z, which only needs to be allocated in process R2), a zero offset for the
output value, the number of array elements (1) involved in the reduction from each process,
the data type for the elements (MPI.INT), the operator for the reduction (MPI_SUM),
and the root process (R2) which receives the reduced value.

ref: https://nyu-cds.github.io/python-mpi/05-collectives
MODAI 67 / 70
Outline

1. Introduction Futures
Streams
2. Concurrent Computing 3.3 Loop Parallelism
2.1 Threads
2.2 Locks 4. Distributed Computing
2.3 Liveness 4.1 Client-Server
Unicast Communication (one-to-one)
2.4 Isolation
Broadcast/Multicast Communication
2.5 Dining Philosophers (one-to-all/one-to-many)
3. Parallel Computing Remote Method Invocation
3.1 Task Parallelism 4.2 Message Passing Interface
Async/Finish vs. Fork/Join Single Program Multiple Data model
Computation Graphs Point-to-Point Communication
Parallel Speedup Collective Communication
3.2 Functional Parallelism MPI and Threading

MODAI 68 / 70
MPI and Threading

- Video: c3w4/4.3 MPI and Threading


- Lecture Summary:
• In this lecture, we learned how to extend the MPI with threads.
• As we learned earlier in the lecture on Processes and Threads, it can be inefficient to
create one process per processor core in a multicore node since there is a lot of
unnecessary duplication of memory, resources, and overheads when doing so.
• This same issue arises for MPI programs in which each rank corresponds to a
single-threaded process by default.
• Thus, there are many motivations for creating multiple threads in an MPI process,
including the fact that threads can communicate with each other much more efficiently
using shared memory, compared with the message-passing that is used to communicate
among processes.

MODAI 69 / 70
MPI and Threading
- Lecture Summary (cont.):
• One approach to enable multithreading in MPI applications is to create one MPI process
(rank) per node, which starts execution in a single thread that is referred to as a master
thread. This thread calls MPI_Init() and MPI_Finalize() for its rank, and creates a
number of worker threads to assist in the computation to be performed within its MPI
process. Further, all MPI calls are performed only by the master thread. This approach is
referred to as the MPI_THREAD_FUNNELED mode, since, even though there are
multiple threads, all MPI calls are "funneled" through the master thread.
• A second more general mode for MPI and multithreading is referred to as
MPI_THREAD_SERIALIZED; in this mode, multiple threads may make MPI calls but
must do so one at a time using appropriate concurrency constructs so that the calls are
"serialized".
• The most general mode is called MPI_THREAD_MULTIPLE because it allows
multiple threads to make MPI calls in parallel; though this mode offers more flexibility
than the other modes, it puts an additional burden on the MPI implementation which
usually gets reflected in larger overheads for MPI calls relative to the more restrictive
modes. Further, even the MPI_THREAD_MULTIPLE mode has some notable
restrictions, e.g., it is not permitted in this mode for two threads in the same process to
wait on the same MPI request related to a nonblocking communication.
MODAI 70 / 70
Thanks for your attention!

MODAI 71 / 70
References

Anuradhac and Arvindpdmn. “Concurrency vs Parallelism”. In: Devopedia. 2021.


url: https://devopedia.org/concurrency-vs-parallelism.
Barney, Blaise. “Introduction to Parallel Computing Tutorial”. In: Livermore
Computing. HPC@LLNL, 2023. url:
https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial.
Koirala, Shivprasad. “Concurrency vs Parallelism”. In: Code Project. 2018. url:
https://www.codeproject.com/Articles/1267757/Concurrency-vs-Parallelism.
Pattamsetti, Raja Malleswara Rao. Distributed Computing in Java 9. Packt
Publishing, 2017.
Saini, Prerna, Ankit Bansal, and Abhishek Sharma. “Time Critical Multitasking for
Multicore Microcontroller using XMOS Kit”. In: International Journal of Embedded
systems and Applications 5.1 (2015), pp. 1–18. doi: 10.5121/ijesa.2015.5101.
Sarkar, Vivek. “Parallel, Concurrent, and Distributed Programming in Java”. In:
Rice University. 2017.

MODAI 72 / 70

You might also like