0% found this document useful (0 votes)
4 views86 pages

Chapter Four_parallel Computing

Chapter Four discusses various parallel programming models, including Shared Memory, Distributed, Message Passing, Threads, and Data Parallel Models, which provide frameworks for structuring programs to utilize multiple processors efficiently. It highlights the characteristics, advantages, and disadvantages of these models, emphasizing the importance of selecting the appropriate model based on specific application needs and hardware capabilities. The chapter also covers the implementation and optimization of these models, particularly in systems like KSR and SGI Origin.

Uploaded by

umukewser temam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views86 pages

Chapter Four_parallel Computing

Chapter Four discusses various parallel programming models, including Shared Memory, Distributed, Message Passing, Threads, and Data Parallel Models, which provide frameworks for structuring programs to utilize multiple processors efficiently. It highlights the characteristics, advantages, and disadvantages of these models, emphasizing the importance of selecting the appropriate model based on specific application needs and hardware capabilities. The chapter also covers the implementation and optimization of these models, particularly in systems like KSR and SGI Origin.

Uploaded by

umukewser temam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Chapter Four: Parallel Programming Models

 Most Common Parallel Programming Models


 Shared Memory Model,
 Distributed Model
 Message Passing Model Frameworks for the programmer
 Threads Model
 Data Parallel Model

 Programming shared-address space systems


 Cilk/Cilkplus
 Mutual exclusion, locks, synchronizations Mechanisms for data
 OpenMP access and share
 Pthreads

 PCs Performance Analysis and Tuning


Introduction
 Parallel programming models are frameworks for structuring programs to
leverage multiple processors or cores, and the most common include shared
memory, message passing, and data parallelism. They are SW Architectures
and tools designed to make it easier and more efficient to write code.
 Parallel programming models are abstractions for executing tasks
concurrently on parallel hardware, while programming shared-address space
systems focuses on how processes/processors within a system can access and
share the same memory space.

 Parallel programming models are abstractions that simplify how


programmers think about and write code for parallel execution, offering
different ways to organize and manage tasks across multiple processors,
such as shared memory or message passing.
 A shared-address space system is a system where multiple processes or
processors can access the same memory locations, allowing for efficient data
sharing and communication.
 What do we communicate in sequential programs? Or is there any commns? Or Not cm
Cont…
 Parallel programming models focuses on how to decompose tasks, allocate
resources, and manage communication and synchronization between
different parts of a program that are running concurrently.
 Programming Shared-Address Space focuses on how processes can interact
and collaborate by directly accessing and modifying shared data in memory.

Reminders!!
SC v PC v PA vTuning
Shared memory programming model
 Parallel programming models exist as an abstraction above hardware and memory
architectures.
 Although it might not seem apparent, parallel programming models are NOT
specific to a particular type of machine or memory architecture. In fact, any of these
models can (theoretically) be implemented on any underlying hardware. But there is
appropriateness of HW parallel architectures based on the problems identified.
 Shared memory model on a distributed memory machine: Kendall Square Research
(KSR) ALLCACHE approach.
• Machine memory was physically distributed across networked machines, but appeared to
the user as a single shared memory (global address space). Generically, this approach is
referred to as "virtual shared memory".
• Note: although KSR is no longer in business, there is no reason to suggest that a similar
implementation will not be made available by another vendor in the future.

• Message passing model on a shared memory machine: MPI on SGI Origin.


• The SGI Origin employed the CC-NUMA type of shared memory architecture, where
every task has direct access to global memory.
• However, the ability to send and receive messages with MPI, as is commonly done over a
network of distributed memory machines, is not only implemented but is very commonly
used.
Overview
 The KSR ALLCACHE approach, developed by Kendall Square Research
(KSR), implements a shared memory model on a distributed memory
machine, presenting a single, global address space to the user despite the
memory being physically distributed across networked machines, also
known as "virtual shared memory“.

 Which model to use? This is often a combination of what is available


and personal choice. There is no "best" model, although there
certainly are better implementations of some models over others.
Overview
 In a traditional distributed memory system, each processor has its own private
memory, and communication between processors requires explicit message passing.
 The KSR ALLCACHE approach aims to overcome this limitation by creating the
illusion of a single, shared memory space, even though the memory is physically
distributed across multiple nodes. Key features of KSR
 Virtual Shared Memory approach is often referred to as "virtual shared memory"
because it provides the benefits of a shared memory model (easy data sharing and
programming) without the need for a physically centralized memory

 Hardware-Coherent Distributed Cache


 The KSR-1, a machine that implements the ALLCACHE approach, uses a
hardware-coherent distributed cache, meaning that the hardware ensures that all
processors see a consistent view of the shared memory, even when data is accessed
from different nodes.

 Hierarchical Memory System


 The KSR-1 has a hierarchical memory system, with local caches on each node and a
higher-level ring interconnecting clusters of nodes.
Overview
 Limitations of TDMM overcome by the ALLCACHE approach
 The ALLCACHE architecture, with its shared virtual memory (VSM) model,
provides a single, unified address space to all processors, making it easier for
programmers to write and manage parallel applications.

 Elimination of Inter-Processor Communication Overhead


 By treating all memory as a single, shared cache, the ALLCACHE architecture reduces the need for
explicit inter-processor communication, which can be complex and performance-intensive on
traditional distributed memory systems.

 Simplified Memory Management


 The ALLCACHE approach simplifies memory management by allowing the system to transparently
handle memory allocation, migration, and replication across the nodes, reducing the burden on the
programmer.

 Scalability
 The KSR-1, which implements the ALLCACHE architecture, is designed to be scalable, allowing it
to accommodate a large number of processors while maintaining a consistent memory model.
Overview
 Limitations………..
 Hardware Coherence
 The ALLCACHE architecture employs hardware-coherent distributed
caches, ensuring that all processors have a consistent view of the shared
memory, which simplifies synchronization and avoids race conditions.

 Sequential Consistency
 The KSR-1 provides a sequentially consistent memory and programming
model, meaning that the operations on shared memory appear to execute
in a consistent order, as if they were executed on a single processor.

 Software Transparency
 The ALLCACHE architecture aims to make the underlying distributed
memory nature of the system transparent to the programmer, allowing
them to focus on the application logic rather than the hardware details
Overview
 Message Passing Model on a Shared Memory Machine: MPI on SGI Origin
 The SGI Origin series was a pioneering line of high-performance shared-memory
multiprocessor systems that used the CC-NUMA (Cache-Coherent Non-Uniform
Memory Access) architecture. Despite being a shared-memory machine, the
Message Passing Interface (MPI) was frequently used for parallel programming on
these systems.
 SGI (Silicon Graphics) was a company known for its high-performance computing
systems, and CC-NUMA (Cache Coherent Non-Uniform Memory Access) is a
computer architecture
 SGI Origin’s Shared Memory Architecture (CC-NUMA)
 The SGI Origin employed a CC-NUMA architecture, meaning:
 Global Address Space: All processors could directly access any memory
location, but with non-uniform latency (accessing local memory was faster than
remote memory).
 Cache Coherence: Hardware ensured that cached copies of data remained
consistent across processors.
 Distributed Shared Memory: Memory was physically distributed across nodes
but appeared as a single shared address space to the programmer.
Overview
 Implications for MPI:
 Since all processes could directly read/write each other’s memory, MPI could
leverage shared memory for efficient message passing (e.g., avoiding network
overhead).
 However, MPI still enforced explicit communication (send/receive semantics) rather
than relying on direct memory access.
 Why we Use MPI on a Shared Memory Machine?
 Even though the SGI Origin provided a shared-memory programming model (e.g.,
OpenMP), MPI was still widely used because:
 Portability: MPI programs could run on both shared-memory and
distributed-memory systems without modification.
 Scalability: MPI’s explicit communication model scaled better for large
problems compared to naive shared-memory approaches.
 Avoiding Memory Contention: Explicit message passing helped avoid
bottlenecks from uncontrolled memory access in NUMA systems.
 Legacy Code: Many existing HPC applications were already written in MPI.
Overview
 How MPI Was Implemented on SGI Origin
 MPI on the SGI Origin was optimized to take advantage of shared memory while
maintaining the standard message-passing semantics:
 Intra-node Communication: When MPI processes ran on the same physical
node, messages were passed via shared memory buffers instead of TCP/IP or
other network protocols.
 This reduced latency and improved bandwidth compared to distributed-
memory clusters.
 Inter-node Communication: If processes spanned multiple nodes (in a larger
Origin configuration), MPI would use the network interconnect (e.g.,
NUMALink) for message passing.
 Optimizations:
 Zero-copy transfers: MPI could avoid memory copies by directly mapping
buffers between processes.
 Eager vs. Rendezvous Protocols: Short messages were sent immediately
("eager"), while long messages used handshaking to avoid buffer overflows.
Which one is"best" model from the two or others also?

 Which model to use is often a combination of what


is available and personal choice.
 There is no "best" model, although there certainly
are better implementations of some models over
others.
 The following sections describe each of the models
mentioned above, and also discuss some of their
actual implementations.
Shared Memory Model
 In the shared-memory programming model, cores share a common address space,
which they read and write asynchronously.
 Various mechanisms such as locks / semaphores are used to control access to the
shared memory, resolve contentions and to prevent race conditions and deadlocks.
 This is perhaps the simplest parallel programming model.
 An advantage of this model from the programmer's point of view is that the absence
of data "ownership", different processes can directly access and modify the same
data in a common memory space without needing to explicitly communicate or
transfer data between them. This removes the need for separate ownership or control
over data, simplifying the development and communication of parallel programs.
 The disadvantage in terms of performance is that it becomes more difficult to
understand and manage data locality:
 Keeping data local to the process that works on it conserves memory accesses, cache refreshes and bus traffic that
occurs when multiple processes use the same data.
 Unfortunately, controlling data locality is hard to understand and may be beyond the control of the average user.
Shared Memory Model Example using python

May be
for core 1

May be
for core 2
Shared Memory Model Example using python………
Shared Memory Prons and Cons
Advantages:
– Conceptually simple
– Usually minor modifications to existing code
– Often very portable to different architectures
– Shared memory allows processes to directly access the same
memory location, eliminating the need to copy data. This results in
significantly faster data exchange compared to message passing,
where data is copied between processes.
– Shared memory avoids the overhead associated with copying data
and inter-process communication mechanisms
Shared Memory Prons and Cons
Disadvantages
– Difficult to implement task-based parallelism, lack of flexibility
– Often does not scale well
– Requires a large amount of inherent data parallelism (e.g. large
arrays) to be effective
– Can be surprisingly difficult to get performance
– Synchronization Complexity: Ensuring data consistency and
preventing race conditions when multiple processes access shared
memory simultaneously is crucial. This requires careful use of
synchronization mechanisms like locks, semaphores, or monitors.
– As the number of processes accessing shared memory increases,
so does the potential for contention, which can lead to performance
degradation and scalability limitations.
Shared Memory Prons and Cons
Disadvantages
– Potential for Bottlenecks: Shared memory systems can become
performance bottlenecks if the memory bus is saturated by multiple
processes vying for access to the shared memory.

– Data Consistency Problems: If not managed carefully, updates to


shared data by one process may not be visible to others, leading to
data inconsistency and errors.

– Complexity and Risk of Errors: Shared memory programming can


be complex and introduce a higher risk of errors if synchronization
mechanisms are not implemented correctly.
Distributed Memory / Message Passing Model

 In the Distributed Memory / Message Passing Model, processors communicate by


explicitly exchanging messages rather than accessing a shared memory space. This
model is well-suited for distributed systems where processors may reside on
different machines connected by a network. Each processor operates independently
with its own local memory, exchanging data through messages.
 Each processor has its own private memory and operates independently, performing
calculations on its local data. This shows that in this model there are Independent
Processors.
 Communication between processors occurs through the explicit exchange of
messages, which may contain data or control information. Shows us clear Message
Exchange communication mechanisms.
 This model is particularly well-suited for distributed systems where processors are
physically separated and communicate via a network.
 Message passing can be implemented using libraries and tools like MPI (Message
Passing Interface) or PVM (Parallel Virtual Machine).
Distributed Memory / Message Passing Model
 This model demonstrates the following characteristics:
 A set of tasks that use their own local memory during computation.
 Multiple tasks can reside on the same physical machine and/or across an
arbitrary number of machines.
 Tasks exchange data through communications by sending and receiving
messages.
 Data transfer usually requires cooperative operations to be performed by each
process. For example, a send operation must have a matching receive operation.
Distributed Memory Prons and Cons

 Advantages:
 Scalability: Distributed systems can easily accommodate increased computational
needs by adding more machines.
 Reliability/Fault Tolerance: A failure on one server doesn't necessarily affect others,
improving overall system reliability.

 Flexibility: It's easier to implement, install, and debug new services in a distributed
environment.
 Speed/Performance: By leveraging multiple processors, the system can achieve
faster processing speeds.

 Resource Optimization: Distributed systems can utilize resources more effectively,


potentially reducing costs.
Distributed Memory Prons and Cons

 Disadvantages:
 Increased Complexity: Distributed systems are more complex to design, develop,
and maintain compared to simpler architectures.
 Debugging Challenges: Troubleshooting problems in a distributed system can be
more difficult due to the distributed nature of the components.
 Data Communication Overhead: Data transfer between processors or servers can
add overhead and potentially slow down performance.
 Network Dependency: The system's performance and reliability are heavily
dependent on the underlying network infrastructure.
 Synchronization Challenges: Maintaining data consistency across multiple
processors or servers can be complex.
 Security Risks: Distributed systems can be more vulnerable to security breaches,
requiring robust security measures.
Data Parallel Model
 Data parallel is a parallel programming model where the same
operation or function is applied to different subsets of data in parallel.

 This model is suitable for applications that have a high degree of data
parallelism, where the data can be easily divided into independent or
homogeneous chunks, and the computation on each chunk is similar
or identical.

 Data parallel can also achieve high performance and efficiency, as the
data and the computation can be distributed and balanced across
multiple processors or machines, and the communication and
synchronization can be minimized or optimized.
Data Parallel Model
 However, data parallel also has some limitations, such as the difficulty
of handling irregular or dynamic data structures, the dependency or
interaction between different subsets of data, and the diversity or
variability of the computation on each subset of data.

 To use data parallel, programmers need to use specific languages or


tools, such as MATLAB, NumPy, or TensorFlow, which can support
data parallel operations and functions.
Data Parallel Model
 May also be referred to as the Partitioned Global Address Space (PGAS) model.
 It involves dividing a dataset and applying the same operations (or model) to different subsets
of that data simultaneously, using multiple processors.
 The data parallel model demonstrates the following characteristics:
 Address space is treated globally
 Most of the parallel work focuses on performing operations on a data set. The
data set is typically organized into a common structure, such as an array or
cube.
 A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
 Tasks perform the same operation on their partition of work, for example, "add
4 to every array element".
 On shared memory architectures, all tasks may have access to the data structure through
global memory.
 On distributed memory architectures, the global data structure can be split up logically and/or
physically across tasks.
Data Parallel Model

 Implementations:
Data Parallel Model

 Generally,
 The data parallel model focuses on parallelizing computations on
data structures.
 It involves dividing the data into segments and distributing them
among multiple processing units.
 Each unit operates on its portion of the data simultaneously,
performing the same computation on different elements.
 This model is commonly used in GPU programming.
 Implementation Example: CUDA is a popular framework for

programming
Data Parallel Model
 MapReduce and Hadoop are mostly appropriate examples
fo data parallelism but will also be applied to task
parallelism.
 MapReduce Model: The MapReduce model is designed for processing large-
scale data sets across distributed systems.
 It divides the computation into two phases: the Map phase and the Reduce
phase.
 In the Map phase, data is processed in parallel by multiple mappers, generating
intermediate key-value pairs.
 In the Reduce phase, the intermediate results are combined by reducers to
produce the final output.
 Implementation Example: Apache Hadoop is a popular framework that
implements the MapReduce model.
Message Passing Model
 Message passing is a parallel programming model where multiple processes
or machines communicate by sending and receiving messages.

 Each process or machine has its own local memory and does not share any
data with others. This avoids the issues of data consistency and concurrency,
but it also introduces the overhead of message serialization, segmentation,
transmission, and coordination.
 Message passing is typically used for problems that involve coarse-grained
parallelism, where the work is divided into large and distinct units that can
be executed by specific processes or machines.
 Examples of message passing problems are distributed systems, web
services, or scientific simulations.
Message Passing Model
 Message serialization, transmission, and coordination refer to the processes of transforming
data into a portable format for storage or transmission, then sending it, and finally ensuring it
is received and understood correctly.
 This involves converting objects or data structures into a stream of bytes or characters,
transmitting it over a network or storing it, and then deserializing it back into the original
format.
 Serialization is the process of converting an object or data structure into a format that can be
easily transmitted or stored.
 Transmission Sending the serialized data over a network or to a storage medium.
 Coordination ensures reliable communication and proper handling of messages, including
issues like message ordering and duplicate message prevention.
 Coordination ensuring that messages are received and handled correctly, including aspects
like reliability, ordering, and duplicate prevention. Helps To guarantee reliable and consistent
communication.
 Mechanisms like message brokers, acknowledgment protocols, and duplicate detection
algorithms are used to ensure proper message handling.
Message Passing Model
 The message passing model demonstrates the following characteristics:
• A set of tasks that use their own local memory during computation.
• Multiple tasks can reside on the same physical machine as well across an
arbitrary number of machines.
• Tasks exchange data through communications by sending and receiving
messages.
• Data transfer usually requires cooperative operations to be performed by
each process.
• For example, a send operation must have a matching receive operation.
Message Passing Model Implementations: MPI
MPI_Send & Matching MPI_Recv Syntax
MPI_Send & Matching MPI_Recv Syntax
MPI_Send & Matching MPI_Recv Syntax
 MPI_Send(buffer, count, datatype, destination, tag, communicator)
 MPI_Send(&a[1], 10, MPI_INT, Node1, 20, comm); in C/C++ Matching syntax
 MPI_Recv(&a[1], 10, MPI_INT, SourceRank, 20, comm, &status);
 from mpi4py import MPI
 import numpy as np
 comm = MPI.COMM_WORLD
 rank = comm.Get_rank()
 # Create a buffer large enough to send/receive from index 1 onward
 a = np.zeros(20, dtype='i') # 'i' = int (32-bit) python
 if rank == 0: match
 a[1:11] = np.arange(100, 110) # Fill a[1] to a[10] with values 100 to 109
 comm.Send([a[1:11], MPI.INT], dest=1, tag=20) # Send 10 integers to rank 1
 elif rank == 1:
 comm.Recv([a[1:11], MPI.INT], source=0, tag=20) # Receive from rank 0
 print("Processor 1 received:", a[1:11])
 Output (Expected):Processor 1 received: [100 101 102 103 104 105 106 107 108 109]; why only rank 1receive
MPI_Send & Matching MPI_Recv Syntax equivalent to the above
python send and receive syntax

 #include <mpi.h>
 #include <iostream>
 int main(int argc, char* argv[]) {
 MPI_Init(&argc, &argv); // Initialize MPI
 int rank;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get current process rank
 int a[20] = {0}; // Buffer to hold 20 integers
 if (rank == 0) {
 // Fill a[1] to a[10] with values 100 to 109
 for (int i = 1; i <= 10; ++i) {
 a[i] = 100 + (i - 1);
 }
 // Send 10 integers starting from a[1] to rank 1
 MPI_Send(&a[1], 10, MPI_INT, 1, 20, MPI_COMM_WORLD);
 std::cout << "Processor 0 sent values from a[1] to a[10]" << std::endl;
MPI_Send & Matching MPI_Recv Syntax equivalent to the above
python send and receive syntax

 } else if (rank == 1) {
 MPI_Recv(&a[1], 10, MPI_INT, 0, 20, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
 std::cout << "Processor 1 received: ";
 for (int i = 1; i <= 10; ++i) {
 std::cout << a[i] << " ";
 }
 std::cout << std::endl;
 }

 MPI_Finalize(); // Finalize MPI


 return 0;
 }
 Does mpi do memory to memory or memory to catch or to registers?
Message Passing Model Implementations: MPI
 Generally,
 Point-to-Point and Collective Communications: MPI offers both point-to-point (send/receive)
and collective communications (broadcast, gather, reduce). Point-to-point communications
can be tailored for fine-grained control, while collective operations, such as all-to-all
broadcasts, are highly optimized for scalability across many processes.
 MPI has Blocking Communication (Synchronous Communication) feature:
 Functions like MPI_Send and MPI_Recv are blocking by default. When a process calls
MPI_Send, it will not proceed until the data is successfully sent, and similarly,
MPI_Recv will block until the data is received. Very important to handle
synchronizations. Like circuit switching communication.
 MPI has Non-Blocking Communication Features also
 MPI also provides non-blocking variants (e.g., MPI_Isend, MPI_Irecv) that allow
processes to continue executing while waiting for communication to complete.
 Non-blocking communication helps overlap computation with communication,
improving performance by hiding latency. However, it requires careful synchronization
to avoid using data before it’s fully received. Like packet switching communication
Message Passing Model Implementations: MPI
 Point-to-Point Communication
 Involves only two processes: one sender and one receiver.
 Blocking (MPI_Send, MPI_Recv) and Non-blocking (MPI_Isend, MPI_Irecv) variants exist.
 Used for customized data exchange between specific ranks.
 Example: Sending a message from Rank 0 to Rank 1
 int rank, data;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 if (rank == 0) {
 data = 42;
 MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); // Send to Rank 1
 }
 else if (rank == 1) {
 MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); //
Receive from Rank 0
 printf("Rank 1 received: %d\n", data);
 }
Message Passing Model Implementations: MPI
 MPI, collective communication
 Involves functions (like MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Reduce, etc.)
 do not use MPI_Send and MPI_Recv.
 Instead, they have their own specialized syntax where all processes in a communicator must
participate with a single function call.
1. Broadcast (MPI_Bcast), Broadcasts data from the root core to all others.
 int data;
 int root = 0; // Rank of the sender
 if (rank == root) {
 data = 42; // Only root initializes data
 }
 // All processes call MPI_Bcast
 MPI_Bcast(&data, 1, MPI_INT, root, MPI_COMM_WORLD);
 printf("Rank %d received: %d\n", rank, data);
Message Passing Model Implementations: MPI
 MPI, collective communication
 int rank, data;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 if (rank == 0) {
 data = 42; // Only Rank 0 initializes data
 }
 MPI_Bcast(&data, 1, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast from Rank 0
 printf("Rank %d received: %d\n", rank, data);
 Rank 0 received: 42
 Rank 1 received: 42 Why core 0 or rank 0 also gets its own data? Or prints its own
 Rank 2 received: 42 data? Answer this from the perspective of MPI design concept
 Rank 3 received: 42
 if (rank != 0) { // Only non-root ranks print What happens on the MPI
 printf("Rank %d received: %d\n", rank, data);} design if we replace with
code? Which one is better?
Task Parallel Model

 Task parallel is a parallel programming model where different


operations or functions are executed on different data in parallel.
 This model is suitable for applications that have a high degree of task
parallelism, where the data can be organized into independent or
heterogeneous tasks, and the computation on each task is different or
variable.

 Task parallel can also achieve high flexibility and adaptability, as the
data and the computation can be assigned and scheduled according to
the availability and capability of the processors or machines, and the
communication and synchronization can be customized or
coordinated.
Task Parallel Model

 However, task parallel also has some challenges, such as the


complexity of defining and managing the tasks, the overhead and
latency of the communication and synchronization, and the
dependency or coordination between different tasks.

 To use task parallel, programmers need to use general-purpose


languages or frameworks, such as C++, Java, or OpenCL, which can
support task parallel creation and execution.
Task Parallel Model

 Generally,
 The task parallel model involves dividing a program into smaller tasks that can
be executed concurrently.
 Each task can be assigned to a separate thread or process, allowing independent
execution.
 This model is suitable for irregular or dynamic workloads, where tasks can have
varying execution times.
 Implementation Example: Intel Threading Building Blocks (TBB) is a task-
based parallelism library.
Hybrid Model
 Hybrid Model (MPI + OpenMP)
 The Hybrid Model combines MPI for inter-node communication and OpenMP for
intra-node parallelism, making it well-suited for modern multi-core, multi-node
systems.
 In the Hybrid model, MPI is used for communication between different nodes
(distributed memory), while OpenMP is used within each node to exploit shared
memory parallelism.

 Performance Characteristics
 Scalability: The Hybrid model generally offers better scalability than pure MPI,
especially on modern multi-core systems. MPI handles the communication
across nodes, while OpenMP takes advantage of the shared memory model
within each node. This reduces communication overhead and minimizes latency
between threads within a node.
Hybrid Model
 Performance Characteristics
 Communication Overhead: The Hybrid model benefits from reduced
communication overhead within nodes due to the shared memory.
 However, communication between nodes is still handled via MPI, which introduces
the same overhead as in a pure distributed memory model. The communication
between threads within a node (OpenMP) incurs minimal overhead since they share
the same memory space.

 Memory Access: Within a node, OpenMP allows for local memory access, which is
significantly faster than inter-node memory access in a pure MPI model, where
memory is distributed across different machines.
Hybrid Model
 Challenges
 Load Balancing: The Hybrid model may face challenges with load balancing,
especially when the workload is not evenly distributed across nodes and cores. Care
must be taken to ensure that both MPI and OpenMP components are balanced
effectively to avoid bottlenecks.

 Complexity: Programming in a Hybrid model requires knowledge of both


distributed memory (MPI) and shared memory (OpenMP), making the programming
model more complex than either approach individually.
 Hybrid MPI + OpenMP generally outperforms pure MPI in multi-core nodes
because it minimizes intra-node communication latency and takes advantage of
shared memory. However, scalability beyond a single machine or node can be
constrained by MPI's communication overhead.
 Pure MPI (distributed memory) works well in systems with a large number of nodes
but incurs significant communication overhead due to inter-node messaging.
Hybrid Model
 Comparison distributed memory model (MPI) with hybrid
 The Hybrid Model (MPI + OpenMP) optimizes memory usage and locality by keeping data
within the node and minimizing inter-node communication. This results in better cache
efficiency and faster data access within nodes.
 The Distributed Memory Model (MPI) is more prone to cache inefficiencies because data is
distributed across nodes, requiring more communication and remote memory access.
 The Hybrid Model combines fault tolerance strategies from both MPI and OpenMP, but
handling failures is more complex because of the integration between distributed and shared
memory systems. The key strategies include checkpointing, process replication, and local
thread recovery.
 Distributed Memory Model (MPI) relies heavily on checkpointing and process migration to
handle process failures, but it requires more extensive fault tolerance mechanisms at the
system level, particularly for large-scale systems.

 Generally, weather we are operating in Hybrid Model, Distributed Memory Model


(MPI), and Shared Memory Model for a given problems they differ in terms of
performance, scalability, fault tolerance, and synchronization.
Hybrid Model
 A hybrid model combines more than one of the previously described programming
models.
 Currently, a common example of a hybrid model is the combination of the message
passing model (MPI) with the threads model (OpenMP).
 Threadsperform computationally intensive kernels using local, on-node data
 Communications between processes on different nodes occurs over the network
using MPI
 This hybrid model lends itself well to the most popular (currently) hardware
environment of clustered multi/many-core machines.
 Another similar and increasingly popular example of a hybrid model is using MPI
with CPU-GPU (graphics processing unit) programming.
 MPI tasks run on CPUs using local memory and communicating with each other over a network.
 Computationally intensive kernels are off-loaded to GPUs on-node.
 Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).

 Other hybrid models are common:


 MPI with Pthreads
 MPI with non-GPU accelerators
Hybrid Model
 Hierarchical parallel architectures are gaining more importance.
 Uniform programming interface is possible but might not be most efficient.
 MPI requires optimized library for intranode communication.
 OpenMP requires global address space implemented in software.
 Hybrid programming allows to optimize both levels of the
 hierarchy.
 MPI across nodes
 OpenMP within a node
 In addition to increased programming complexity it requires additional
performance-relevant design decisions prior to coding.
Hybrid Model
 Hybrid model
Programming shared-address space systems

 Programming shared-address space systems


Cilk/Cilkplus
Mutual exclusion,
Various mechanisms are employed
locks,
to control memory access. share and
ensure data consistency among
synchronizations
cores or threads.
OpenMP
Pthreads
Programming shared-address space systems

 In parallel computing, a shared-address space system refers to a system


where multiple processors or threads have access to a single, common
memory space.
 This model allows different processors or threads to read and write to the
same memory locations, making it easier to share data between them.

 In such systems, each processor or thread does not have its own private
memory; instead, all threads can access the same global memory.

 This allows threads to communicate and share data efficiently, but it also
introduces potential challenges related to concurrency and synchronization
(e.g., race conditions, memory consistency, etc.).
Programming shared-address space systems

Key Characteristics of Shared-Address Space Systems:


1. Global Memory: All threads have access to a global memory
space (e.g., RAM), and data is stored in this common memory
space.
2. Ease of Communication: Sharing data between threads is
straightforward because they can access the same memory
locations.
3. Synchronization: To avoid data races and inconsistencies,
synchronization mechanisms (such as locks, barriers, or atomic
operations) are needed to coordinate access to shared
memory.
Programming shared-address space systems

 Cilk and Cilk Plus are powerful programming models for parallel
programming in shared-address space systems.

 Cilk simplifies parallel programming by allowing easy management of


tasks, while Cilk Plus adds more features like improved vectorization and
better support for fine-grained parallelism.

 Both approaches rely on the shared-memory model, which is common in


multi-core systems, and are excellent for algorithms that can be broken
down into independent tasks.
Programming shared-address space systems

 Cilk Plus is an extension of C and C++ that supports shared memory


parallel programming with a simple and expressive syntax.
 It introduces three keywords: cilk_spawn , cilk_sync , and cilk_for .
 The cilk_spawn keyword allows you to create a child thread that executes a
function call in parallel with the parent thread.
 The cilk_sync keyword allows you to wait for all the child threads to
complete before proceeding.
 The cilk_for keyword allows you to execute a loop in parallel, with each
iteration assigned to a different thread.
 Cilk Plus also provides a runtime system that automatically schedules and
balances the threads, as well as a set of tools and libraries for performance
analysis and debugging.
Programming shared-address space systems

 Mutual Exclusion, Locks, and Synchronization


 In concurrent programming, when multiple threads or processes access shared
resources, ensuring that these resources are accessed in a safe and orderly manner is
crucial to prevent data races and inconsistencies.

 Mutual exclusion, locks, and synchronization mechanisms are used to ensure that
only one thread or process accesses a shared resource at a time or that threads
coordinate with each other.
Programming shared-address space systems

 1. Mutual Exclusion (Mutex):

 Mutual exclusion refers to a principle where only one thread (or process) is allowed
to execute a critical section of code at any given time. This ensures that shared data
or resources are not corrupted by simultaneous access.

 Critical Section: A portion of the code that accesses shared resources. Only one
thread can enter the critical section at a time.

 Mutex (Mutual Exclusion Lock): A synchronization primitive used to ensure


mutual exclusion.
 A thread must "lock" the mutex before entering the critical section and "unlock" it
when done. If a mutex is already locked, other threads trying to lock it must wait
until it is unlocked.
Programming shared-address space systems

 Locks and Synchronization:


 Locks: A lock is a mechanism used to control access to a shared resource in multi-threaded
programs.
 It can prevent data races by ensuring that only one thread can access a critical section at a
time. There are various types of locks (e.g., spinlocks, read-write locks), but the basic idea is
to ensure mutual exclusion for shared resources.
 Synchronization: Synchronization refers to coordinating the execution of threads so that they
do not interfere with each other inappropriately. This is necessary when threads need to
coordinate their execution or share data in a consistent and safe manner.
 Common synchronization mechanisms:
 Mutexes: Ensures only one thread at a time can access a shared resource.
 Semaphores: A counting semaphore allows a fixed number of threads to access a resource.
 Condition Variables: Used to signal between threads, allowing them to wait or wake up depending on certain
conditions.
 Barriers: Used to synchronize all threads at a certain point, making sure that all threads reach the barrier before
any of them can continue.
Programming shared-address space systems
 Types of Synchronization
 Barrier
 Usually implies that all tasks are involved
 Each task performs its work until it reaches the barrier. It then stops, or "blocks".
 When the last task reaches the barrier, all tasks are synchronized.
 What happens from here varies. Often, a serial section of work must be done. In
other cases, the tasks are automatically released to continue their work.
 Lock / semaphore
 Can involve any number of tasks
 Typically used to serialize (protect) access to global data or a section of code. Only
one task at a time may use (own) the lock / semaphore / flag.
 The first task to acquire the lock "sets" it. This task can then safely (serially) access
the protected data or code.
 Other tasks can attempt to acquire the lock but must wait until the task that owns the
lock releases it.
 Can be blocking or non-blocking.
Programming shared-address space systems

 OpenMP is a popular and portable framework for shared memory parallel


programming in C, C++, and Fortran.
 It consists of a set of compiler directives, library routines, and environment
variables that allow you to specify how to parallelize your code.

 You can use OpenMP to create and manage threads, distribute work among
them, synchronize their execution, and control their access to shared data.

 For example, you can use the #pragma omp parallel directive to create a
parallel region, where each thread executes the same code block.
 You can also use the #pragma omp for directive to split a loop iteration
among the threads, or the #pragma omp critical directive to protect a code
section from concurrent access by multiple threads.
Programming shared-address space systems

 OpenMP is a standard shared memory programming interface (1997)


 Directives for Fortran 77 and C/C++
 Fork-join model resulting in a global program
 It includes:
 Parallel loops
 Parallel sections
 Parallel regions
 Shared and private data
 Synchronization primitives

– barrier
– critical region
Programming shared-address space systems

 Atomic Operations (#pragma omp atomic):


 This directive ensures that a specific memory location is updated atomically. This is used
when you need to perform simple operations on shared variables (e.g., incrementing a
counter).
 #pragma omp atomic
 counter++;
 Locks (omp_lock_t):
 OpenMP provides explicit locks using omp_lock_t. You can initialize, set, and unset locks to
ensure mutual exclusion in critical regions.
 #include <omp.h>
 omp_lock_t lock;
 omp_init_lock(&lock);// Thread 1: lock the critical section
 omp_set_lock(&lock);// critical section
 omp_unset_lock(&lock); // Release the lock
 omp_destroy_lock(&lock); // Destroy the lock after use
Programming shared-address space systems

 Barrier (#pragma omp barrier):


 A barrier synchronizes all threads in a parallel region. No thread can continue past the barrier
until all threads have reached it.
 #pragma omp barrier
 Ordered (#pragma omp ordered):
 Ensures that threads execute certain sections of code in a specific order, useful when the
order of operations matters.
 #pragma omp for
 for (int i = 0; i < N; i++) {
 // Parallel loop}
 #pragma omp ordered{
 // Code executed in order by threads}
Programming shared-address space systems

 Pthreads is a low-level and portable API for shared memory parallel


programming in C and C++.
 It allows you to create and manipulate threads, as well as mutexes,
condition variables, barriers, and other synchronization mechanisms.
 Unlike OpenMP, Pthreads gives you more control and flexibility over the
thread management and behavior, but also requires more coding and error
handling.
 For example, you can use the pthread_create function to create a thread and
pass it a pointer to a function that defines its task.
 You can also use the pthread_join function to wait for a thread to finish, or
the pthread_mutex_lock and pthread_mutex_unlock functions to lock and
unlock a mutex that protects a shared resource.
Programming shared-address space systems

 Summary of Key Concepts


 Mutual Exclusion: Prevents multiple threads from simultaneously executing critical sections
that modify shared resources, often using mutexes or locks.
 Synchronization: Coordinates the execution of threads to ensure proper sequencing and data
consistency. Common methods include barriers, condition variables, and semaphores.
 OpenMP:
 Provides high-level constructs for parallel programming, including critical sections, atomic
operations, locks, barriers, and ordered execution.
 Easier to use for parallelism compared to low-level thread management.
 Pthreads:
 A low-level threading model that provides fine-grained control over thread creation,
synchronization (mutexes, condition variables, semaphores), and communication.
 More flexible but requires more manual management than OpenMP.
 Both OpenMP and Pthreads offer ways to manage concurrency in parallel programs, but
OpenMP simplifies parallel programming using compiler directives, while Pthreads provides
direct control over threads and synchronization at the system level.
Designing Parallel Programs for Highly performance
 Automatic vs. Manual Parallelization
 Designing and developing parallel programs has characteristically been a very manual
process. The programmer is typically responsible for both identifying and actually
implementing parallelism.
 Very often, manually developing parallel codes is a time consuming, complex, error-
prone and iterative process.
 For a number of years now, various tools have been available to assist the programmer
with converting serial programs into parallel programs. The most common type of tool
used to automatically parallelize a serial program is a parallelizing compiler or pre-
processor.
 A parallelizing compiler generally works in two different ways:
 Fully Automatic
 The compiler analyzes the source code and identifies opportunities for parallelism.
 The analysis includes identifying inhibitors to parallelism and possibly a cost weighting
on whether or not the parallelism would actually improve performance.
 Loops (do, for) are the most frequent target for automatic parallelization.
Designing Parallel Programs for Highly performance
 Programmer Directed
 Using "compiler directives" or possibly compiler flags, the programmer explicitly tells
the compiler how to parallelize the code.
 May be able to be used in conjunction with some degree of automatic parallelization
also.
 The most common compiler generated parallelization is done using on-node shared
memory and threads (such as OpenMP).
 If you are beginning with an existing serial code and have time or budget constraints,
then automatic parallelization may be the answer. However, there are several important
caveats that apply to automatic parallelization:
 Wrong results may be produced
 Performance may actually degrade
 Much less flexible than manual parallelization
 Limited to a subset (mostly loops) of code
 May actually not parallelize code if the compiler analysis suggests there are
inhibitors or the code is too complex.
Performance Analysis and Tuning
 Performance analysis and tuning of parallel computations involve identifying
bottlenecks, optimizing code for parallel execution, and measuring the efficiency of
parallel algorithms and implementations using metrics like speedup and efficiency.
 Fortunately, there are a number of excellent tools for parallel program performance
analysis and tuning. To evaluate a parallel program’s efficiency, we use the
following metrics: Speed UP and Efficiency mainly.
 In parallel programming, speedup measures how much faster a task completes with
multiple processors compared to a single processor, while efficiency quantifies how
effectively those processors are utilized.
 Speedup is calculated as the ratio of the execution time of a serial program to the
execution time of its parallel counterpart. For example, if a serial program takes 10
seconds and its parallel version takes 2 seconds, the speedup is 10/2 = 5.
 Efficiency, on the other hand, is calculated as the speedup divided by the number of
processors used. Continuing the previous example, if the parallel version used 5
processors, the efficiency would be 5/5 = 1 (or 100%).
Hybrid
Hybrid
 2 cores (99% efficiency):This suggests the algorithm is well-suited to parallelization
and gains nearly the maximum possible speedup with a small number of cores.

 4 cores (98% efficiency):The efficiency remains high, indicating that the gains from
adding more cores are still significant.

 8 cores (94% efficiency): The efficiency starts to drop, suggesting that the algorithm
is becoming more sensitive to overhead and contention as more cores are added.

 16 cores (81% efficiency):The efficiency drops significantly, indicating that the


overhead of managing the larger number of cores is outweighing the benefits of
parallelization. This suggests that the algorithm is not scaling well with a large
number of cores and may be limited by factors like communication or
synchronization.
 In essence, the results suggest that the algorithm's design might not be optimally
suited for scaling beyond a certain number of cores.
Examples

 Optimally, the speedup from parallelization would be linear doubling the number of
processing elements should halve the run-time, and doubling it a second time
should again halve the run-time.
 However, very few parallel algorithms achieve optimal speedup.
 If you plan to run a larger parallel job, then please do a scaling study first, where
you run a medium-size example on 1,2,4,8,12 and 24 cores and then calculate the
speedup and the efficiency of the runs to determine if your code even scales up to
24 cores, or if the sweet spot corresponds to a lower number of cores.
 This will help you to get a higher throughput in terms of numbers of jobs, if you
can only use a limited number of cores.
 The scaling study indicates, that using 4 cores is the sweet spot in terms of parallel
efficiency and adding more cores even makes the job slower.
Speed up and Efficiency
 Speedup:
 It's the ratio of the sequential execution time (T(1)) to the parallel execution time (T(p)).
 Measures the performance improvement gained by using multiple processors.
 Ideal speedup is linear, meaning that if you double the number of processors, the execution time
should halve.
 However, real-world speedup is often less than ideal due to factors like overhead from
communication between processors and synchronization.
 Efficiency:
 It indicates how effectively the parallel resources are utilized.
 Indicates how well the available processors are being utilized.
 A high efficiency means that each processor is contributing significantly to the overall performance
improvement.
 Efficiency is calculated as: Efficiency = Speedup / Number of Processors.
 An efficiency of 1 (or 100%) means that all processors are fully utilized, and the speedup is
perfectly linear.
 In practice, efficiency is often less than 1 due to factors like communication overhead and idle
processors.
Important terms for corrélations
 Run Time:
 It's the actual time it takes for the algorithm to complete execution.
 Parallel algorithms aim to reduce run time by distributing work across multiple cores.
 Number of Cores:
 It represents the number of processing units available for parallel execution.
 Increasing the number of cores can potentially increase speedup and improve efficiency, but it also
introduces overhead related to communication and synchronization between cores.
 Relationship among these factors:
 Speedup and Efficiency are dependent on Run Time and Number of Cores:
 If you can reduce the parallel run time (T(p)), while maintaining a decent number of cores, you'll likely achieve
higher speedup and efficiency.
 Higher Speedup doesn't always mean higher Efficiency:
 You can achieve high speedup by using a large number of cores, but if some cores are idle or spending more time
on communication than computation, the overall efficiency might be lower.
 Ideal Scenario:
 The goal is to achieve high speedup and efficiency by efficiently utilizing all available cores and minimizing
overhead associated with parallel execution.
Important terms for corrélations
 Scaling Limitations: Adding more cores might not always translate to a proportional
increase in speed. There's a point of diminishing returns where the overhead of
coordinating multiple cores outweighs the benefits of parallelism.
 Overhead and Coordination: When a program is executed across multiple cores,
there's additional overhead involved in managing the threads, synchronizing their
work, and transferring data between them. This overhead can become significant
when the number of cores increases, especially for applications that are not naturally
parallel.
 Memory Contention: Multiple cores often need to access the same shared memory,
leading to contention for resources. This can slow down execution as cores have to
wait for their turn to access the memory.

 Data Transfer: When a program is split across multiple cores, it often needs to send
data back and forth between them. This data transfer can become a bottleneck,
especially if the data is large or the communication network is slow.
Important terms for corrélations
 "Sweet Spot" for Cores:
 The optimal number of cores for a given application can vary. In many cases, using
4 cores or a similar number can be the sweet spot, where the benefits of parallelism
outweigh the overhead. However, this can also depend on the specific application
and hardware.

 Scaling Studies:
 Scaling studies are experiments that measure the performance of parallel code as
the number of cores or processors is increased. These studies can help identify the
point where adding more cores stops providing performance benefits or even causes
performance degradation.
Performance Analysis Tools
 Tools for Performance Analysis and Tuning:
 Profiling Tools: Help identify performance bottlenecks by measuring execution
time, resource utilization, and other relevant metrics.

 Performance Monitoring Tools: Provide real-time monitoring of performance


metrics during execution.

 Debugging Tools: Help identify and fix bugs that can impact performance.

 Automatic Parallelization Tools: Can automatically identify and parallelize portions


of code, but may require further tuning
Common Performance Bottlenecks & Tuning Strategies

Communication Overhead (MPI)


 Problem: Excessive message-passing latency/bandwidth limits.
 Solutions:
 Aggregate small messages into larger ones.
 Use non-blocking (async) communication (e.g., MPI_Isend, MPI_Irecv).
 Overlap computation & communication (hide latency).
 Optimize collective operations (e.g., replace MPI_All reduce with tree-
based reductions).
Load Imbalance
 Problem: Some threads/processes finish earlier, causing idle time.
 Solutions:
 Dynamic scheduling (e.g., schedule(dynamic) in OpenMP).
 Work stealing (some runtimes support this automatically).
 Better domain decomposition (e.g., space-filling curves for irregular workloads).
Common Performance Bottlenecks & Tuning Strategies

Synchronization Overhead (OpenMP/Threads)


 Problem: Excessive barriers, locks, or critical sections.
 Solutions:
 Reduce synchronization frequency (e.g., use nowait in OpenMP).
 Replace locks with atomic operations.
 Use task-based parallelism (OpenMP tasks, Intel TBB).
Memory Bottlenecks (NUMA, Cache Misses)
 Problem: Slow memory access due to poor data locality.
 Solutions:
 Dataalignment & padding (avoid false sharing).
 NUMA-aware allocation (e.g., numactl in Linux).
 Blocking/tiling for cache reuse.
Common Performance Bottlenecks & Tuning Strategies

Tuning Methodology
 Baseline Measurement – Run the unoptimized version and collect metrics.
 Profiling – Identify bottlenecks using tools (e.g., VTune, gprof).
 Hypothesis – Predict the cause (e.g., "MPI communication is too slow").
 Optimization – Apply fixes (e.g., switch to non-blocking MPI).
 Validation – Re-run and compare performance.
 Iterate – Repeat until performance goals are met.
 Example: MPI Tuning on SGI Origin (CC-NUMA)
 Problem: MPI on shared memory still has overhead due to synchronization.
 Optimizations:
 Use shared-memory-aware MPI (e.g., MPI_Comm_split_type with
MPI_COMM_TYPE_SHARED).
 Hybrid MPI+OpenMP – Reduce MPI processes per node, use threads for intra-node
parallelism.
 NUMA-aware data placement – Bind processes to cores near their memory.
Tuning and Optimization Techniques

 Optimization Goals
Execution Time
Get the answer as fast as possible.
Throughput
Get as many happy users as possible.
Utilization
 Get the best exploitation of the invested money
 Performance Analysis and Tuning is based on
 measurements of performance data
 evaluation of performance data
 code or policy changes based on analysis results
Tuning and Optimization Techniques
Tuning and Optimization Techniques
 Algorithm Optimization:
 Algorithm Selection: Choose algorithms that are well-suited for parallel execution.
 Data Partitioning: Divide the data in a way that minimizes communication overhead
and maximizes load balancing.
 Code Optimization:
 Reduce Communication: Minimize communication between processors by assigning
communicating tasks to the same process or using efficient communication protocols.
 Improve Load Balancing: Distribute the workload evenly among processors.
 Optimize Synchronization: Use efficient synchronization primitives and minimize the
amount of synchronization required.
 Hardware Optimization:
 Choose Appropriate Hardware: Select processors and interconnects that are well-suited
for parallel execution.
 Optimize Memory Access: Ensure efficient memory access patterns to avoid
bottlenecks.
Conclusion

 Performance tuning in parallel programming requires:


Accurate measurement (profiling/tracing tools) or
benchmarking
Identifying bottlenecks (communication, load imbalance,
memory issues).
Applying targeted optimizations (algorithmic, MPI/OpenMP
tuning, NUMA optimizations).
 Iterative refinement until performance goals are achieved.

 By systematically analyzing and tuning, parallel programs can achieve


near-optimal scalability and efficiency.
Thank you…!!!

You might also like