Lecture 4
Lecture 4
Computing
Lecture-4: Parallel Programming Models
Yasir Noman Khalid
Overview
• There are several parallel programming models in common use:
• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
• Parallel programming models exist as an abstraction above hardware and memory
architectures.
• Although it might not seem apparent, these models are NOT specific to a particular type of
machine or memory architecture. In fact, any of these models can (theoretically) be
implemented on any underlying hardware. Examples are discussed in the next 2 slides.
2
SHARED memory model on a DISTRIBUTED
memory machine
• Machine memory is physically distributed across networked machines but appeared to the
user as a single shared memory global address space. Generically, this approach is referred to
as "virtual shared memory".
3
DISTRIBUTED memory model on a SHARED
memory machine
• Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000 employed the CC-
NUMA type of shared memory architecture, where every task has direct access to global
address space spread across all machines. However, the ability to send and receive messages
using MPI, as is commonly done over a network of distributed memory machines, was
implemented and commonly used.
4
Which model to use???
• Often a combination of what is available and personal choice.
• There is no "best" model, although there certainly are better implementations of some
models over others.
5
Shared Memory Model (without threads)
• In this programming model, processes/tasks share a common address space, which they read
and write to asynchronously.
• Various mechanisms such as locks / semaphores are used to
• Control access to the shared memory,
• Resolve contentions and
• Prevent race conditions and deadlocks.
• Perhaps the simplest parallel programming model.
• An advantage of this model from the programmer's point of view is that the notion of data
"ownership" is lacking
• There is no need to specify explicitly the communication of data between tasks.
• All processes see and have equal access to shared memory.
• Program development can often be simplified.
6
• An important disadvantage in terms of performance is that it becomes more difficult to
understand and manage data locality:
• Keeping data local to the process that works on it conserves memory accesses, cache refreshes and bus
traffic that occurs when multiple processes use the same data.
• Unfortunately, controlling data locality is hard to understand and may be beyond the control of the average
user.
7
Threads Model
• This programming model is a type of shared memory programming.
• In the threads model of parallel programming, a single "heavy weight" process can have multiple "light weight",
concurrent execution paths.
• For example:
• The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all the
necessary system and user resources to run. This is the "heavy weight" process.
• a.out performs some serial work, and then creates several tasks (threads) that can be scheduled and run by
the operating system concurrently.
• Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated
with replicating a program's resources for each thread ("light weight"). Each thread also benefits from a
global memory view because it shares the memory space of a.out.
• A thread's work may best be described as a subroutine within the main program. Any thread can execute any
subroutine at the same time as other threads.
• Threads communicate with each other through global memory (updating address locations). This requires
synchronization constructs to ensure that more than one thread is not updating the same global address at
any time.
• Threads can come and go, but a.out remains present to provide the necessary shared resources until the
application has completed. 8
9
Implementations: Threads Model
• From a programming perspective, threads implementations commonly comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives imbedded in either serial or parallel source code
• In both cases, the programmer is responsible for determining the parallelism (although
compilers can sometimes help).
• Threaded implementations are not new in computing. Historically, hardware vendors have
implemented their own proprietary versions of threads. These implementations differed
substantially from each other making it difficult for programmers to develop portable threaded
applications.
10
• Unrelated standardization efforts have resulted in two very different implementations of
threads: POSIX Threads and OpenMP.
• POSIX Threads
• Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.
• Part of Unix/Linux operating systems
• Library based
• Commonly referred to as Pthreads.
• Very explicit parallelism; requires significant programmer attention to detail.
• OpenMP
• Industry standard, jointly defined and endorsed by a group of major computer hardware and software
vendors, organizations and individuals.
• Compiler directive based
• Portable / multi-platform, including Unix and Windows platforms
• Available in C/C++ and Fortran implementations
• Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code.
• Other threaded implementations are common, but not discussed here:
• Microsoft threads
• Java, Python threads
• CUDA threads for GPUs 11
Distributed Memory/Message Passing Model
• This model demonstrates the following characteristics:
• A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same
physical machine and/or across an arbitrary number of machines.
• Tasks exchange data through communications by sending and receiving messages.
• Data transfer usually requires cooperative operations to be performed by each process. For example, a send
operation must have a matching receive operation.
• Implementations:
• From a programming perspective, message passing implementations usually comprise a library of
subroutines. Calls to these subroutines are imbedded in source code. The programmer is responsible for
determining all parallelism.
• Historically, a variety of message passing libraries have been available since the 1980s. These
implementations differed substantially from each other making it difficult for programmers to develop
portable applications.
• In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message
passing implementations.
• MPI is the "de facto" industry standard for message passing, replacing virtually all other message passing
implementations used for production work. MPI implementations exist for virtually all popular parallel
12
computing platforms. Not all implementations include everything in MPI-1, MPI-2 or MPI-3.
Implementations: Message Passing Model
• Message passing implementations usually comprise a library of subroutines. Calls to these subroutines are
imbedded in source code. The programmer is responsible for determining all parallelism.
• Historically, a variety of message passing libraries have been available since the 1980s. These
implementations differed substantially from each other making it difficult for programmers to develop
portable applications.
• MPI is the "de facto" industry standard for message passing, replacing virtually all other message passing
implementations used for production work. MPI implementations exist for virtually all popular parallel
computing platforms. Not all implementations include everything in MPI-1, MPI-2 or MPI-3.
13
Data Parallel Model
• May also be referred to as the Partitioned Global Address Space (PGAS) model.
• The data parallel model demonstrates the following characteristics:
• Address space is treated globally
• Most of the parallel work focuses on performing operations on a data set. The data set is typically organized
into a common structure, such as an array.
• A set of tasks work collectively on the same data structure, however, each task works on a different partition
of the same data structure.
• Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".
• On shared memory architectures, all tasks may have access to the data structure through
global memory.
• On distributed memory architectures, the global data structure can be split up logically and/or
physically across tasks.
14
15
Implementations: Data Parallel Model
• Currently, there are several relatively popular, and sometimes developmental, parallel programming
implementations based on the Data Parallel / PGAS model.
• Coarray Fortran: a small set of extensions to Fortran 95 for SPMD parallel programming. Compiler
dependent.
• Unified Parallel C (UPC): an extension to the C programming language for SPMD parallel programming.
Compiler dependent.
• Global Arrays: provides a shared memory style programming environment in the context of distributed array
data structures. Public domain library with C and Fortran77 bindings.
• X10: a PGAS based parallel programming language being developed by IBM at the Thomas J. Watson
Research Center.
• Chapel: an open-source parallel programming language project being led by Cray.
16
Hybrid Model
• A hybrid model combines more than one of the previously described programming models.
• Currently, a common example of a hybrid model is the combination of the message passing
model (MPI) with the threads model (OpenMP).
• Threads perform computationally intensive kernels using local, on-node data
• Communications between processes on different nodes occurs over the network using MPI
• This hybrid model lends itself well to the most popular hardware environment of clustered
multi/many-core machines.
• Another similar and increasingly popular example of a hybrid model is using MPI with CPU-GPU
(Graphics Processing Unit) programming.
• MPI tasks run on CPUs using local memory and communicating with each other over a network.
• Computationally intensive kernels are off-loaded to GPUs on-node.
• Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).
• Other hybrid models are common:
• MPI with Pthreads
• MPI with non-GPU accelerators etc. 17
18
Single Program Multiple Data (SPMD)
• SPMD is a "high level" programming model that can be built upon any combination of the
previously mentioned parallel programming models.
• SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This
program can be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• SPMD programs usually have the necessary logic programmed into them to allow different
tasks to branch or conditionally execute only those parts of the program they are designed to
execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a
portion of it.
• The SPMD model, using message passing or hybrid programming, is probably the most used
parallel programming model for multi-node clusters.
19
Multiple Program Multiple Data (MPMD):
• Like SPMD, MPMD is a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
• MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can
be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• MPMD applications are not as common as SPMD applications but may be better suited for
certain types of problems, particularly those that lend themselves better to functional
decomposition than domain decomposition.
20