HPC Manual 2022-23
HPC Manual 2022-23
LAB MANUAL
FOR
B.E. COMPUTER (SEM – VIII)
LABORATORY PRACTICE-V
SUBJECT CODE: 410255
PR: 50 Marks
A] Design and implement Parallel Breadth First Search based on existing algorithms
1 using OpenMP. Use a Tree or an undirected graph for BFS
B] Design and implement Parallel Depth First Search based on existing algorithms
using OpenMP. Use a Tree or an undirected graph for DFS.
A] Write a program to implement Parallel Bubble Sort. Use existing algorithms and
2 measure the performance of sequential and parallel algorithms.
B] Write a program to implement Parallel Merge Sort. Use existing algorithms and
measure the performance of sequential and parallel algorithms.
3 Implement Min, Max, Sum and Average operations using Parallel Reduction.
Prepared By Approved By
OPERATING SYSTEM: Latest 64-BIT Version Open source Linux or its derivative and update of
Microsoft Windows 7/ 8 Operating System onwards or 64-bit
Title: - Design and implement Parallel Breadth First Search based on existing algorithms using
OpenMP. Use a Tree or an undirected graph for BFS.
__________________________________________________________________________________________________________
Theory:
What is BFS?
BFS stands for Breadth-First Search. It is a graph traversal algorithm used to explore all the nodes of a
graph or tree systematically, starting from the root node or a specified starting point, and visiting all
the neighboring nodes at the current depth level before moving on to the next depth level.
The algorithm uses a queue data structure to keep track of the nodes that need to be visited, and
marks each visited node to avoid processing it again. The basic idea of the BFS algorithm is to visit all
the nodes at a given level before moving on to the next level, which ensures that all the nodes are
visited in breadth-first order.
BFS is commonly used in many applications, such as finding the shortest path between two nodes,
solving puzzles, and searching through a tree or graph.
Example of BFS
Now let’s take a look at the steps involved in traversing a graph by using Breadth-First Search:
Step 2: Select a starting node (visiting a node) and insert it into the Queue.
Step 3: Provided that the Queue is not empty, extract the node from the Queue and insert its child
nodes (exploring a node) into the Queue.
● Parallel BFS (Breadth-First Search) is an algorithm used to explore all the nodes of a graph or
tree systematically in parallel. It is a popular parallel algorithm used for graph traversal in
distributed computing, shared-memory systems, and parallel clusters.
● The parallel BFS algorithm starts by selecting a root node or a specified starting point, and
then assigning it to a thread or processor in the system. Each thread maintains a local queue
of nodes to be visited and marks each visited node to avoid processing it again.
● The algorithm then proceeds in levels, where each level represents a set of nodes that are at a
certain distance from the root node. Each thread processes the nodes in its local queue at the
current level, and then exchanges the nodes that are adjacent to the current level with other
threads or processors. This is done to ensure that the nodes at the next level are visited by the
next iteration of the algorithm.
● The parallel BFS algorithm uses two phases: the computation phase and the communication
phase. In the computation phase, each thread processes the nodes in its local queue, while in
the communication phase, the threads exchange the nodes that are adjacent to the current
level with other threads or processors.
● The parallel BFS algorithm terminates when all nodes have been visited or when a specified
node has been found. The result of the algorithm is the set of visited nodes or the shortest
path from the root node to the target node.
● Parallel BFS can be implemented using different parallel programming models, such as
OpenMP, MPI, CUDA, and others. The performance of the algorithm depends on the number of
threads or processors used, the size of the graph, and the communication overhead between
the threads or processors.
Questions:
1. What if BFS?
2. What is OpenMP? What is its significance in parallel programming?
3. Write down applications of Parallel BFS
4. How can BFS be parallelized using OpenMP? Describe the parallel BFS algorithm using
OpenMP.
5. Write Down Commands used in OpenMP?
Title :- Design and implement Parallel Depth First Search based on existing algorithms using OpenMP.
Use a Tree or an undirected graph for DFS.
__________________________________________________________________________________________________________
THEORY:
What is DFS?
DFS stands for Depth-First Search. It is a popular graph traversal algorithm that explores as far as
possible along each branch before backtracking. This algorithm can be used to find the shortest path
between two vertices or to traverse a graph in a systematic way. The algorithm starts at the root node
and explores as far as possible along each branch before backtracking. The backtracking is done to
explore the next branch that has not been explored yet.
DFS can be implemented using either a recursive or an iterative approach. The recursive approach is
simpler to implement but can lead to a stack overflow error for very large graphs. The iterative
approach uses a stack to keep track of nodes to be explored and is preferred for larger graphs.
DFS can also be used to detect cycles in a graph. If a cycle exists in a graph, the DFS algorithm will
eventually reach a node that has already been visited, indicating that a cycle exists.
A standard DFS implementation puts each vertex of the graph into one of two categories:
1. Visited
2. Not Visited
The purpose of the algorithm is to mark each vertex as visited while avoiding cycles.
Concept of OpenMP
● Parallel Depth-First Search (DFS) is an algorithm that explores the depth of a graph structure
to search for nodes. In contrast to a serial DFS algorithm that explores nodes in a sequential
manner, parallel DFS algorithms explore nodes in a parallel manner, providing a significant
speedup in large graphs.
● Parallel DFS works by dividing the graph into smaller subgraphs that are explored
simultaneously. Each processor or thread is assigned a subgraph to explore, and they work
independently to explore the subgraph using the standard DFS algorithm. During the
exploration process, the nodes are marked as visited to avoid revisiting them.
● To explore the subgraph, the processors maintain a stack data structure that stores the nodes
in the order of exploration. The top node is picked and explored, and its adjacent nodes are
pushed onto the stack for further exploration. The stack is updated concurrently by the
GES RHSCOEMSR,Computer Dept.
processors as they explore their subgraphs.
● Parallel DFS can be implemented using several parallel programming models such as
OpenMP, MPI, and CUDA. In OpenMP, the #pragma omp parallel for directive is used to
distribute the work among multiple threads. By using this directive, each thread operates on a
different part of the graph, which increases the performance of the DFS algorithm.
Question
1. What if DFS?
2. Write a parallel Depth First Search (DFS) algorithm using OpenMP
3. What is the advantage of using parallel programming in DFS?
4. How can you parallelize a DFS algorithm using OpenMP?
5. What is a race condition in parallel programming, and how can it be avoided in OpenMP?
Title: Write a program to implement Parallel Bubble Sort. Use existing algorithms and measure the
performance of sequential and parallel algorithms.
_________________________________________________________________________________________________________
THEORY:
Bubble Sort is a simple sorting algorithm that works by repeatedly swapping adjacent elements if they
are in the wrong order. It is called "bubble" sort because the algorithm moves the larger elements
towards the end of the array in a manner that resembles the rising of bubbles in a liquid.
The time complexity of Bubble Sort is O(n^2), which makes it inefficient for large lists. However, it has
the advantage of being easy to understand and implement, and it is useful for educational purposes
and for sorting small datasets.
Bubble Sort has limited practical use in modern software development due to its inefficient time
complexity of O(n^2) which makes it unsuitable for sorting large datasets. However, Bubble Sort has
some advantages and use cases that make it a valuable algorithm to understand, such as:
1. Simplicity: Bubble Sort is one of the simplest sorting algorithms, and it is easy to understand
and implement. It can be used to introduce the concept of sorting to beginners and as a basis
for more complex sorting algorithms.
2. Educational purposes: Bubble Sort is often used in academic settings to teach the principles
of sorting algorithms and to help students understand how algorithms work.
3. Small datasets: For very small datasets, Bubble Sort can be an efficient sorting algorithm, as its
overhead is relatively low.
4. Partially sorted datasets: If a dataset is already partially sorted, Bubble Sort can be very
efficient. Since Bubble Sort only swaps adjacent elements that are in the wrong order, it has a
low number of operations for a partially sorted dataset.
5. Performance optimization: Although Bubble Sort itself is not suitable for sorting large
datasets, some of its techniques can be used in combination with other sorting algorithms to
optimize their performance. For example, Bubble Sort can be used to optimize the performance
of Insertion Sort by reducing the number of comparisons needed.
GES RHSCOEMSR,Computer Dept.
Concept of OpenMP
When measuring the performance of the parallel Bubble sort algorithm, you will need to
specify the number of threads to use. You can experiment with different numbers of
threads to find the optimal value for your system.
1. top: The top command provides a real-time view of system resource usage, including CPU
utilization and memory consumption. To use it, open a terminal window and type top. The
output will display a list of processes sorted by resource usage, with the most resource-intensive
processes at the top.
2. htop: htop is a more advanced version of top that provides additional features, such as
interactive process filtering and a color-coded display. To use it, open a terminal window and
type htop.
Conclusion- In this way we can implement Bubble Sort in parallel way using OpenMP also come
to know how to how to measure performance of serial and parallel algorithm
Questions:-
Title: - Write a program to implement Parallel Merge Sort. Use existing algorithms and measure
the performance of sequential and parallel algorithms.
_________________________________________________________________________________________________________
THEORY:
Merge sort is a sorting algorithm that uses a divide-and-conquer approach to sort an array or a list of
elements. The algorithm works by recursively dividing the input array into two halves, sorting each
half, and then merging the sorted halves to produce a sorted output.
The merge sort algorithm can be broken down into the following steps:
● The merging step is where the bulk of the work happens in merge sort. The algorithm compares
the first elements of each sorted half, selects the smaller element, and appends it to the output
array. This process continues until all elements from both halves have been appended to the
output array.
● The time complexity of merge sort is O(n log n), which makes it an efficient sorting algorithm
for large input arrays. However, merge sort also requires additional memory to store the output
array, which can make it less suitable for use with limited memory resources.
● In simple terms, we can say that the process of merge sort is to divide the array into two halves,
sort each half, and then merge the sorted halves back together. This process is repeated until
the entire array is sorted.
● One thing that you might wonder is what is the specialty of this algorithm. We already have a
number of sorting algorithms then why do we need this algorithm? One of the main advantages
of merge sort is that it has a time complexity of O(n log n), which means it can sort large arrays
relatively quickly. It is also a stable sort, which means that the order of elements with equal
values is preserved during the sort.
● Merge sort is a popular choice for sorting large datasets because it is relatively efficient and
easy to implement. It is often used in conjunction with other algorithms, such as quicksort, to
improve the overall performance of a sorting routine.
Now, let's see the working of merge sort Algorithm. To understand the working of the merge sort
algorithm, let's take an unsorted array. It will be easier to understand the merge sort via an example.
● Now, again divide these two arrays into halves. As they are of size 4, divide them into new
arrays of size 2.
● Now, again divide these arrays to get the atomic value that cannot be further divided.
● Now, there is a final merging of the arrays. After the final merging of above arrays, the array
will look like -
Concept of OpenMP
● The parallel merge sort algorithm can be broken down into the following steps:
There are several metrics that can be used to measure the performance of sequential and
parallel merge sort algorithms:
1. Execution time: Execution time is the amount of time it takes for the algorithm to
complete its sorting operation. This metric can be used to compare the speed of
sequential and parallel merge sort algorithms.
2. Speedup: Speedup is the ratio of the execution time of the sequential merge sort
algorithm to the execution time of the parallel merge sort algorithm. A speedup of
greater than 1 indicates that the parallel algorithm is faster than the sequential
algorithm.
3. Efficiency: Efficiency is the ratio of the speedup to the number of processors or cores
used in the parallel algorithm. This metric can be used to determine how well the
parallel algorithm is utilizing the available resources.
4. Scalability: Scalability is the ability of the algorithm to maintain its performance as the
input size and number of processors or cores increase. A scalable algorithm will
maintain a consistent speedup and efficiency as more resources are added.
To measure the performance of sequential and parallel merge sort algorithms, you can perform
experiments on different input sizes and numbers of processors or cores. By measuring the
execution time, speedup, efficiency, and scalability of the algorithms under different conditions,
you can determine which algorithm is more efficient for different input sizes and hardware
configurations. Additionally, you can use profiling tools to analyze the performance of the
algorithms and identify areas for optimization
Conclusion- In this way we can implement Merge Sort in parallel way using OpenMP also come
to know how to how to measure performance of serial and parallel algorithm
Questions:-
Title :- Implement Min, Max, Sum and Average operations using Parallel Reduction.
__________________________________________________________________________________________________________
THEORY:
Parallel Reduction:-
Here's a function-wise manual on how to understand and run the sample C++ program that
demonstrates how to implement Min, Max, Sum, and Average operations using parallel
reduction.
1. Min_Reduction function
The function takes in a vector of integers as input and finds the minimum value in the
vector using parallel reduction.
The OpenMP reduction clause is used with the "min" operator to find the minimum
value across all threads.
The minimum value found by each thread is reduced to the overall minimum value of
the entire array.
The final minimum value is printed to the console.
2. Max_Reduction function
The function takes in a vector of integers as input and finds the maximum value in the
vector using parallel reduction.
The OpenMP reduction clause is used with the "max" operator to find the maximum
value across all threads.
The maximum value found by each thread is reduced to the overall maximum value of
the entire array.
The final maximum value is printed to the console.
3. Sum_Reduction function
The function takes in a vector of integers as input and finds the sum of all the values in
the vector using parallel reduction.
The OpenMP reduction clause is used with the "+" operator to find the sum across all
threads.
The sum found by each thread is reduced to the overall sum of the entire array.
The final sum is printed to the console.
4. Average_Reduction function
5. Main Function
a. The function initializes a vector of integers with some values.
b. The function calls the min_reduction, max_reduction,
sum_reduction, and average_reduction functions on the input vector to
find the corresponding values.
c. The final minimum, maximum, sum, and average values are printed to the console.
6. Compiling and running the program
Compile the program: You need to use a C++ compiler that supports OpenMP, such
as g++ or clang. Open a terminal and navigate to the directory where your program is
saved. Then, compile the program using the following command:
$ g++ -fopenmp program.cpp -o program
This command compiles your program and creates an executable file named
"program". The "- fopenmp" flag tells the compiler to enable OpenMP.
Run the program: To run the program, simply type the name of the executable file in
the terminal and press Enter:
$ ./program
Conclusion: We have implemented the Min, Max, Sum, and Average operations
using parallel reduction in C++ with OpenMP.
Questions:
1. What are the benefits of using parallel reduction for basic operations on large arrays?
2. How does OpenMP's "reduction" clause work in parallel reduction?
3. How do you set up a C++ program for parallel computation with OpenMP?
4. What are the performance characteristics of parallel reduction, and how do they
vary basedon input size?
5. How can you modify the provided code example for more complex operations using
parallelreduction?
THEORY:
What is CUDA
2. Allocate memory on the host: In this step, you need to allocate memory on the host for
the two vectors that you want to add and for the result vector. You can use the C malloc
function to allocate memory.
3. Initialize the vectors: In this step, you need to initialize the two vectors that you want to
add on the host. You can use a loop to fill the vectors with data.
4. Allocate memory on the device: In this step, you need to allocate memory on the device
for the two vectors that you want to add and for the result vector. You can use the
CUDA function cudaMalloc to allocate memory.
5. Copy the input vectors from host to device: In this step, you need to copy the two input
vectors from the host to the device memory. You can use the CUDA function
cudaMemcpy to copy the vectors.
7. Copy the result vector from device to host: In this step, you need to copy the result
vector from the device memory to the host memory. You can use the CUDA function
cudaMemcpy to copy the result vector.
8. Free memory on the device: In this step, you need to free the memory that was
allocated on the device. You can use the CUDA function cudaFree to free the memory.
9. Free memory on the host: In this step, you need to free the memory that was allocated
on the host. You can use the C free function to free the memory.
This will execute the program and perform the addition of two large vectors.
Theory:
What is CUDA
Using CUDA, developers can exploit the massive parallelism and high computational power of
GPUs to accelerate computationally intensive tasks, such as matrix operations, image
processing, and deep learning. CUDA has become an important tool for scientific research and
is widely used in fields like physics, chemistry, biology, and engineering.
Steps for Matrix Multiplication using CUDA
Here are the steps for implementing matrix multiplication using CUDA C:
1. Matrix Initialization: The first step is to initialize the matrices that you want to
multiply. You can use standard C or CUDA functions to allocate memory for the
matrices and initialize their values. The matrices are usually represented as 2D arrays.
2. Memory Allocation: The next step is to allocate memory on the host and the device for
the matrices. You can use the standard C malloc function to allocate memory on the
host and the CUDA function cudaMalloc() to allocate memory on the device.
3. Data Transfer: The third step is to transfer data between the host and the device.
You can use the CUDA function cudaMemcpy() to transfer data from the host to
the device or vice versa.
4. Kernel Launch: The fourth step is to launch the CUDA kernel that will perform the
matrix multiplication on the device. You can use the <<<...>>> syntax to specify the
number of blocks and threads to use. Each thread in the kernel will compute one
element of the output matrix.
5. Device Synchronization: The fifth step is to synchronize the device to ensure
GES RHSCOEMSR,Computer Dept. 7
that all kernel executions have completed before proceeding. You can use the
CUDA function
cudaDeviceSynchronize() to synchronize the device.
6. Data Retrieval: The sixth step is to retrieve the result of the computation from the
device to the host. You can use the CUDA function cudaMemcpy() to transfer data
from the device to the host.
7. Memory Deallocation: The final step is to deallocate the memory that was allocated on
the host and the device. You can use the C free function to deallocate memory on the
host and the CUDA function cudaFree() to deallocate memory on the device.
Questions:
1. What are the advantages of using CUDA to perform matrix multiplication compared to
using a CPU?