Chapter Four_parallel Computing
Chapter Four_parallel Computing
Reminders!!
SC v PC v PA vTuning
Shared memory programming model
Parallel programming models exist as an abstraction above hardware and memory
architectures.
Although it might not seem apparent, parallel programming models are NOT
specific to a particular type of machine or memory architecture. In fact, any of these
models can (theoretically) be implemented on any underlying hardware. But there is
appropriateness of HW parallel architectures based on the problems identified.
Shared memory model on a distributed memory machine: Kendall Square Research
(KSR) ALLCACHE approach.
• Machine memory was physically distributed across networked machines, but appeared to
the user as a single shared memory (global address space). Generically, this approach is
referred to as "virtual shared memory".
• Note: although KSR is no longer in business, there is no reason to suggest that a similar
implementation will not be made available by another vendor in the future.
Scalability
The KSR-1, which implements the ALLCACHE architecture, is designed to be scalable, allowing it
to accommodate a large number of processors while maintaining a consistent memory model.
Overview
Limitations………..
Hardware Coherence
The ALLCACHE architecture employs hardware-coherent distributed
caches, ensuring that all processors have a consistent view of the shared
memory, which simplifies synchronization and avoids race conditions.
Sequential Consistency
The KSR-1 provides a sequentially consistent memory and programming
model, meaning that the operations on shared memory appear to execute
in a consistent order, as if they were executed on a single processor.
Software Transparency
The ALLCACHE architecture aims to make the underlying distributed
memory nature of the system transparent to the programmer, allowing
them to focus on the application logic rather than the hardware details
Overview
Message Passing Model on a Shared Memory Machine: MPI on SGI Origin
The SGI Origin series was a pioneering line of high-performance shared-memory
multiprocessor systems that used the CC-NUMA (Cache-Coherent Non-Uniform
Memory Access) architecture. Despite being a shared-memory machine, the
Message Passing Interface (MPI) was frequently used for parallel programming on
these systems.
SGI (Silicon Graphics) was a company known for its high-performance computing
systems, and CC-NUMA (Cache Coherent Non-Uniform Memory Access) is a
computer architecture
SGI Origin’s Shared Memory Architecture (CC-NUMA)
The SGI Origin employed a CC-NUMA architecture, meaning:
Global Address Space: All processors could directly access any memory
location, but with non-uniform latency (accessing local memory was faster than
remote memory).
Cache Coherence: Hardware ensured that cached copies of data remained
consistent across processors.
Distributed Shared Memory: Memory was physically distributed across nodes
but appeared as a single shared address space to the programmer.
Overview
Implications for MPI:
Since all processes could directly read/write each other’s memory, MPI could
leverage shared memory for efficient message passing (e.g., avoiding network
overhead).
However, MPI still enforced explicit communication (send/receive semantics) rather
than relying on direct memory access.
Why we Use MPI on a Shared Memory Machine?
Even though the SGI Origin provided a shared-memory programming model (e.g.,
OpenMP), MPI was still widely used because:
Portability: MPI programs could run on both shared-memory and
distributed-memory systems without modification.
Scalability: MPI’s explicit communication model scaled better for large
problems compared to naive shared-memory approaches.
Avoiding Memory Contention: Explicit message passing helped avoid
bottlenecks from uncontrolled memory access in NUMA systems.
Legacy Code: Many existing HPC applications were already written in MPI.
Overview
How MPI Was Implemented on SGI Origin
MPI on the SGI Origin was optimized to take advantage of shared memory while
maintaining the standard message-passing semantics:
Intra-node Communication: When MPI processes ran on the same physical
node, messages were passed via shared memory buffers instead of TCP/IP or
other network protocols.
This reduced latency and improved bandwidth compared to distributed-
memory clusters.
Inter-node Communication: If processes spanned multiple nodes (in a larger
Origin configuration), MPI would use the network interconnect (e.g.,
NUMALink) for message passing.
Optimizations:
Zero-copy transfers: MPI could avoid memory copies by directly mapping
buffers between processes.
Eager vs. Rendezvous Protocols: Short messages were sent immediately
("eager"), while long messages used handshaking to avoid buffer overflows.
Which one is"best" model from the two or others also?
May be
for core 1
May be
for core 2
Shared Memory Model Example using python………
Shared Memory Prons and Cons
Advantages:
– Conceptually simple
– Usually minor modifications to existing code
– Often very portable to different architectures
– Shared memory allows processes to directly access the same
memory location, eliminating the need to copy data. This results in
significantly faster data exchange compared to message passing,
where data is copied between processes.
– Shared memory avoids the overhead associated with copying data
and inter-process communication mechanisms
Shared Memory Prons and Cons
Disadvantages
– Difficult to implement task-based parallelism, lack of flexibility
– Often does not scale well
– Requires a large amount of inherent data parallelism (e.g. large
arrays) to be effective
– Can be surprisingly difficult to get performance
– Synchronization Complexity: Ensuring data consistency and
preventing race conditions when multiple processes access shared
memory simultaneously is crucial. This requires careful use of
synchronization mechanisms like locks, semaphores, or monitors.
– As the number of processes accessing shared memory increases,
so does the potential for contention, which can lead to performance
degradation and scalability limitations.
Shared Memory Prons and Cons
Disadvantages
– Potential for Bottlenecks: Shared memory systems can become
performance bottlenecks if the memory bus is saturated by multiple
processes vying for access to the shared memory.
Advantages:
Scalability: Distributed systems can easily accommodate increased computational
needs by adding more machines.
Reliability/Fault Tolerance: A failure on one server doesn't necessarily affect others,
improving overall system reliability.
Flexibility: It's easier to implement, install, and debug new services in a distributed
environment.
Speed/Performance: By leveraging multiple processors, the system can achieve
faster processing speeds.
Disadvantages:
Increased Complexity: Distributed systems are more complex to design, develop,
and maintain compared to simpler architectures.
Debugging Challenges: Troubleshooting problems in a distributed system can be
more difficult due to the distributed nature of the components.
Data Communication Overhead: Data transfer between processors or servers can
add overhead and potentially slow down performance.
Network Dependency: The system's performance and reliability are heavily
dependent on the underlying network infrastructure.
Synchronization Challenges: Maintaining data consistency across multiple
processors or servers can be complex.
Security Risks: Distributed systems can be more vulnerable to security breaches,
requiring robust security measures.
Data Parallel Model
Data parallel is a parallel programming model where the same
operation or function is applied to different subsets of data in parallel.
This model is suitable for applications that have a high degree of data
parallelism, where the data can be easily divided into independent or
homogeneous chunks, and the computation on each chunk is similar
or identical.
Data parallel can also achieve high performance and efficiency, as the
data and the computation can be distributed and balanced across
multiple processors or machines, and the communication and
synchronization can be minimized or optimized.
Data Parallel Model
However, data parallel also has some limitations, such as the difficulty
of handling irregular or dynamic data structures, the dependency or
interaction between different subsets of data, and the diversity or
variability of the computation on each subset of data.
Implementations:
Data Parallel Model
Generally,
The data parallel model focuses on parallelizing computations on
data structures.
It involves dividing the data into segments and distributing them
among multiple processing units.
Each unit operates on its portion of the data simultaneously,
performing the same computation on different elements.
This model is commonly used in GPU programming.
Implementation Example: CUDA is a popular framework for
programming
Data Parallel Model
MapReduce and Hadoop are mostly appropriate examples
fo data parallelism but will also be applied to task
parallelism.
MapReduce Model: The MapReduce model is designed for processing large-
scale data sets across distributed systems.
It divides the computation into two phases: the Map phase and the Reduce
phase.
In the Map phase, data is processed in parallel by multiple mappers, generating
intermediate key-value pairs.
In the Reduce phase, the intermediate results are combined by reducers to
produce the final output.
Implementation Example: Apache Hadoop is a popular framework that
implements the MapReduce model.
Message Passing Model
Message passing is a parallel programming model where multiple processes
or machines communicate by sending and receiving messages.
Each process or machine has its own local memory and does not share any
data with others. This avoids the issues of data consistency and concurrency,
but it also introduces the overhead of message serialization, segmentation,
transmission, and coordination.
Message passing is typically used for problems that involve coarse-grained
parallelism, where the work is divided into large and distinct units that can
be executed by specific processes or machines.
Examples of message passing problems are distributed systems, web
services, or scientific simulations.
Message Passing Model
Message serialization, transmission, and coordination refer to the processes of transforming
data into a portable format for storage or transmission, then sending it, and finally ensuring it
is received and understood correctly.
This involves converting objects or data structures into a stream of bytes or characters,
transmitting it over a network or storing it, and then deserializing it back into the original
format.
Serialization is the process of converting an object or data structure into a format that can be
easily transmitted or stored.
Transmission Sending the serialized data over a network or to a storage medium.
Coordination ensures reliable communication and proper handling of messages, including
issues like message ordering and duplicate message prevention.
Coordination ensuring that messages are received and handled correctly, including aspects
like reliability, ordering, and duplicate prevention. Helps To guarantee reliable and consistent
communication.
Mechanisms like message brokers, acknowledgment protocols, and duplicate detection
algorithms are used to ensure proper message handling.
Message Passing Model
The message passing model demonstrates the following characteristics:
• A set of tasks that use their own local memory during computation.
• Multiple tasks can reside on the same physical machine as well across an
arbitrary number of machines.
• Tasks exchange data through communications by sending and receiving
messages.
• Data transfer usually requires cooperative operations to be performed by
each process.
• For example, a send operation must have a matching receive operation.
Message Passing Model Implementations: MPI
MPI_Send & Matching MPI_Recv Syntax
MPI_Send & Matching MPI_Recv Syntax
MPI_Send & Matching MPI_Recv Syntax
MPI_Send(buffer, count, datatype, destination, tag, communicator)
MPI_Send(&a[1], 10, MPI_INT, Node1, 20, comm); in C/C++ Matching syntax
MPI_Recv(&a[1], 10, MPI_INT, SourceRank, 20, comm, &status);
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
# Create a buffer large enough to send/receive from index 1 onward
a = np.zeros(20, dtype='i') # 'i' = int (32-bit) python
if rank == 0: match
a[1:11] = np.arange(100, 110) # Fill a[1] to a[10] with values 100 to 109
comm.Send([a[1:11], MPI.INT], dest=1, tag=20) # Send 10 integers to rank 1
elif rank == 1:
comm.Recv([a[1:11], MPI.INT], source=0, tag=20) # Receive from rank 0
print("Processor 1 received:", a[1:11])
Output (Expected):Processor 1 received: [100 101 102 103 104 105 106 107 108 109]; why only rank 1receive
MPI_Send & Matching MPI_Recv Syntax equivalent to the above
python send and receive syntax
#include <mpi.h>
#include <iostream>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv); // Initialize MPI
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get current process rank
int a[20] = {0}; // Buffer to hold 20 integers
if (rank == 0) {
// Fill a[1] to a[10] with values 100 to 109
for (int i = 1; i <= 10; ++i) {
a[i] = 100 + (i - 1);
}
// Send 10 integers starting from a[1] to rank 1
MPI_Send(&a[1], 10, MPI_INT, 1, 20, MPI_COMM_WORLD);
std::cout << "Processor 0 sent values from a[1] to a[10]" << std::endl;
MPI_Send & Matching MPI_Recv Syntax equivalent to the above
python send and receive syntax
} else if (rank == 1) {
MPI_Recv(&a[1], 10, MPI_INT, 0, 20, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
std::cout << "Processor 1 received: ";
for (int i = 1; i <= 10; ++i) {
std::cout << a[i] << " ";
}
std::cout << std::endl;
}
Task parallel can also achieve high flexibility and adaptability, as the
data and the computation can be assigned and scheduled according to
the availability and capability of the processors or machines, and the
communication and synchronization can be customized or
coordinated.
Task Parallel Model
Generally,
The task parallel model involves dividing a program into smaller tasks that can
be executed concurrently.
Each task can be assigned to a separate thread or process, allowing independent
execution.
This model is suitable for irregular or dynamic workloads, where tasks can have
varying execution times.
Implementation Example: Intel Threading Building Blocks (TBB) is a task-
based parallelism library.
Hybrid Model
Hybrid Model (MPI + OpenMP)
The Hybrid Model combines MPI for inter-node communication and OpenMP for
intra-node parallelism, making it well-suited for modern multi-core, multi-node
systems.
In the Hybrid model, MPI is used for communication between different nodes
(distributed memory), while OpenMP is used within each node to exploit shared
memory parallelism.
Performance Characteristics
Scalability: The Hybrid model generally offers better scalability than pure MPI,
especially on modern multi-core systems. MPI handles the communication
across nodes, while OpenMP takes advantage of the shared memory model
within each node. This reduces communication overhead and minimizes latency
between threads within a node.
Hybrid Model
Performance Characteristics
Communication Overhead: The Hybrid model benefits from reduced
communication overhead within nodes due to the shared memory.
However, communication between nodes is still handled via MPI, which introduces
the same overhead as in a pure distributed memory model. The communication
between threads within a node (OpenMP) incurs minimal overhead since they share
the same memory space.
Memory Access: Within a node, OpenMP allows for local memory access, which is
significantly faster than inter-node memory access in a pure MPI model, where
memory is distributed across different machines.
Hybrid Model
Challenges
Load Balancing: The Hybrid model may face challenges with load balancing,
especially when the workload is not evenly distributed across nodes and cores. Care
must be taken to ensure that both MPI and OpenMP components are balanced
effectively to avoid bottlenecks.
In such systems, each processor or thread does not have its own private
memory; instead, all threads can access the same global memory.
This allows threads to communicate and share data efficiently, but it also
introduces potential challenges related to concurrency and synchronization
(e.g., race conditions, memory consistency, etc.).
Programming shared-address space systems
Cilk and Cilk Plus are powerful programming models for parallel
programming in shared-address space systems.
Mutual exclusion, locks, and synchronization mechanisms are used to ensure that
only one thread or process accesses a shared resource at a time or that threads
coordinate with each other.
Programming shared-address space systems
Mutual exclusion refers to a principle where only one thread (or process) is allowed
to execute a critical section of code at any given time. This ensures that shared data
or resources are not corrupted by simultaneous access.
Critical Section: A portion of the code that accesses shared resources. Only one
thread can enter the critical section at a time.
You can use OpenMP to create and manage threads, distribute work among
them, synchronize their execution, and control their access to shared data.
For example, you can use the #pragma omp parallel directive to create a
parallel region, where each thread executes the same code block.
You can also use the #pragma omp for directive to split a loop iteration
among the threads, or the #pragma omp critical directive to protect a code
section from concurrent access by multiple threads.
Programming shared-address space systems
– barrier
– critical region
Programming shared-address space systems
4 cores (98% efficiency):The efficiency remains high, indicating that the gains from
adding more cores are still significant.
8 cores (94% efficiency): The efficiency starts to drop, suggesting that the algorithm
is becoming more sensitive to overhead and contention as more cores are added.
Optimally, the speedup from parallelization would be linear doubling the number of
processing elements should halve the run-time, and doubling it a second time
should again halve the run-time.
However, very few parallel algorithms achieve optimal speedup.
If you plan to run a larger parallel job, then please do a scaling study first, where
you run a medium-size example on 1,2,4,8,12 and 24 cores and then calculate the
speedup and the efficiency of the runs to determine if your code even scales up to
24 cores, or if the sweet spot corresponds to a lower number of cores.
This will help you to get a higher throughput in terms of numbers of jobs, if you
can only use a limited number of cores.
The scaling study indicates, that using 4 cores is the sweet spot in terms of parallel
efficiency and adding more cores even makes the job slower.
Speed up and Efficiency
Speedup:
It's the ratio of the sequential execution time (T(1)) to the parallel execution time (T(p)).
Measures the performance improvement gained by using multiple processors.
Ideal speedup is linear, meaning that if you double the number of processors, the execution time
should halve.
However, real-world speedup is often less than ideal due to factors like overhead from
communication between processors and synchronization.
Efficiency:
It indicates how effectively the parallel resources are utilized.
Indicates how well the available processors are being utilized.
A high efficiency means that each processor is contributing significantly to the overall performance
improvement.
Efficiency is calculated as: Efficiency = Speedup / Number of Processors.
An efficiency of 1 (or 100%) means that all processors are fully utilized, and the speedup is
perfectly linear.
In practice, efficiency is often less than 1 due to factors like communication overhead and idle
processors.
Important terms for corrélations
Run Time:
It's the actual time it takes for the algorithm to complete execution.
Parallel algorithms aim to reduce run time by distributing work across multiple cores.
Number of Cores:
It represents the number of processing units available for parallel execution.
Increasing the number of cores can potentially increase speedup and improve efficiency, but it also
introduces overhead related to communication and synchronization between cores.
Relationship among these factors:
Speedup and Efficiency are dependent on Run Time and Number of Cores:
If you can reduce the parallel run time (T(p)), while maintaining a decent number of cores, you'll likely achieve
higher speedup and efficiency.
Higher Speedup doesn't always mean higher Efficiency:
You can achieve high speedup by using a large number of cores, but if some cores are idle or spending more time
on communication than computation, the overall efficiency might be lower.
Ideal Scenario:
The goal is to achieve high speedup and efficiency by efficiently utilizing all available cores and minimizing
overhead associated with parallel execution.
Important terms for corrélations
Scaling Limitations: Adding more cores might not always translate to a proportional
increase in speed. There's a point of diminishing returns where the overhead of
coordinating multiple cores outweighs the benefits of parallelism.
Overhead and Coordination: When a program is executed across multiple cores,
there's additional overhead involved in managing the threads, synchronizing their
work, and transferring data between them. This overhead can become significant
when the number of cores increases, especially for applications that are not naturally
parallel.
Memory Contention: Multiple cores often need to access the same shared memory,
leading to contention for resources. This can slow down execution as cores have to
wait for their turn to access the memory.
Data Transfer: When a program is split across multiple cores, it often needs to send
data back and forth between them. This data transfer can become a bottleneck,
especially if the data is large or the communication network is slow.
Important terms for corrélations
"Sweet Spot" for Cores:
The optimal number of cores for a given application can vary. In many cases, using
4 cores or a similar number can be the sweet spot, where the benefits of parallelism
outweigh the overhead. However, this can also depend on the specific application
and hardware.
Scaling Studies:
Scaling studies are experiments that measure the performance of parallel code as
the number of cores or processors is increased. These studies can help identify the
point where adding more cores stops providing performance benefits or even causes
performance degradation.
Performance Analysis Tools
Tools for Performance Analysis and Tuning:
Profiling Tools: Help identify performance bottlenecks by measuring execution
time, resource utilization, and other relevant metrics.
Debugging Tools: Help identify and fix bugs that can impact performance.
Tuning Methodology
Baseline Measurement – Run the unoptimized version and collect metrics.
Profiling – Identify bottlenecks using tools (e.g., VTune, gprof).
Hypothesis – Predict the cause (e.g., "MPI communication is too slow").
Optimization – Apply fixes (e.g., switch to non-blocking MPI).
Validation – Re-run and compare performance.
Iterate – Repeat until performance goals are met.
Example: MPI Tuning on SGI Origin (CC-NUMA)
Problem: MPI on shared memory still has overhead due to synchronization.
Optimizations:
Use shared-memory-aware MPI (e.g., MPI_Comm_split_type with
MPI_COMM_TYPE_SHARED).
Hybrid MPI+OpenMP – Reduce MPI processes per node, use threads for intra-node
parallelism.
NUMA-aware data placement – Bind processes to cores near their memory.
Tuning and Optimization Techniques
Optimization Goals
Execution Time
Get the answer as fast as possible.
Throughput
Get as many happy users as possible.
Utilization
Get the best exploitation of the invested money
Performance Analysis and Tuning is based on
measurements of performance data
evaluation of performance data
code or policy changes based on analysis results
Tuning and Optimization Techniques
Tuning and Optimization Techniques
Algorithm Optimization:
Algorithm Selection: Choose algorithms that are well-suited for parallel execution.
Data Partitioning: Divide the data in a way that minimizes communication overhead
and maximizes load balancing.
Code Optimization:
Reduce Communication: Minimize communication between processors by assigning
communicating tasks to the same process or using efficient communication protocols.
Improve Load Balancing: Distribute the workload evenly among processors.
Optimize Synchronization: Use efficient synchronization primitives and minimize the
amount of synchronization required.
Hardware Optimization:
Choose Appropriate Hardware: Select processors and interconnects that are well-suited
for parallel execution.
Optimize Memory Access: Ensure efficient memory access patterns to avoid
bottlenecks.
Conclusion