0% found this document useful (0 votes)
15 views

HPC Lab Manual

The document outlines practical exercises for a B. Tech CSE course focusing on High Performance Computing. It covers the features of Google Colab, basic Linux commands, and the implementation of a Concurrent Quick Sort algorithm using C++. Additionally, it discusses building an HPC cluster and includes code examples for Python and C++ programs.

Uploaded by

mangamanga1101
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

HPC Lab Manual

The document outlines practical exercises for a B. Tech CSE course focusing on High Performance Computing. It covers the features of Google Colab, basic Linux commands, and the implementation of a Concurrent Quick Sort algorithm using C++. Additionally, it discusses building an HPC cluster and includes code examples for Python and C++ programs.

Uploaded by

mangamanga1101
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Faculty of Engineering & Technology

High Performance Computing (203105430)


B. Tech CSE 4th Year 7th Semester

PRACTICAL: 01

AIM: Study the facilities provided by Google Colab.

Theory: -
 Features of Google Colab:
 Free Access to GPUs and TPUs: Google Colab offers
free access to NVIDIA GPUs and Google TPUs, which
significantly accelerates machine learning tasks.
 Jupyter Notebook Interface: It uses a familiar Jupyter
notebook interface, which is highly popular among data
scientists and researchers.
 Easy Collaboration: Notebooks can be easily shared and
collaborated on, similar to Google Docs.
 Pre-installed Libraries: Many popular Python libraries
are pre-installed, making it convenient to start working on
projects without needing to install dependencies.
 Integration with Google Drive: It integrates seamlessly
with Google Drive, making it easy to save and load files.
 Code Snippets and Examples: Colab provides various
code snippets and examples to help users get started
quickly.
 Interactive Visualizations: Supports interactive
visualizations and can display charts and graphs directly
within the notebook.
 Markdown Support: Supports Markdown for rich text
formatting within notebooks.

 Explain in Detail the Use of GPU on Google Colab by


Changing Runtime Type:
 Open Google Colab: Go to Google Colab and open a new
notebook or an existing one.
 Change Runtime Type:
 Click on Runtime in the menu bar.
 Select Change runtime type.
 In the dialog that appears, under Hardware
accelerator, choose GPU.
 Click Save.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 1
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 2
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
Python Program:

if “__name__” == “__main__”:
for i in range(1,11):
print(“2 X”,i,“=”,2*i)
Output:

C Program:

%%writefile C_program.c
#include “stdio.h”

int main(){
printf(“Welcome to Google Colab…”);
return 0;
}

!nvcc C_program.c
!ls
!./a.out

Output:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 3
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 4
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 02

AIM: Demonstrate basic Linux Commands.

Theory: -
 ls
Description: Lists files and directories in the current directory.

 pwd
Description: Prints the current working directory.

 cd
Description: Changes the current directory.

 mkdir
Description: Creates a new directory.

 rm
Description: Removes files.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 5
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 rmdir
Description: Removes Directories.

 mv
Description: Moves or renames files or directories.

 cat
Description: Concatenates and displays file content.

 head
Description: Displays the first few lines of a file.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 6
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 tail
Description: Displays the last few lines of a file.

 du -h
Description: Estimates file space usage.

 touch
Description: Creates an empty file or updates the access and
modification times of an existing file.

 echo
Description: Creates a file and writes a line of text to it.

 printf
Description: Creates a file and writes formatted text to it.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 7
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 vi
Description: Opens a text editor to create or edit files.

 ps
Description: Displays a snapshot of the current processes.

 cd
Description: Changes directory

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 8
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 whoami
Description: Displays the current user’s name.

 uname
Description: Displays system information.

 date
Description: Displays the current date and time.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
Page| 9
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 03

AIM: Using Divide and Conquer Strategies design a class for


Concurrent Quick Sort using C++.

Theory: -
1. What is Divide and Conquer?
Divide and Conquer Algorithm is a problem-solving
technique used to solve problems by dividing the main
problem into subproblems, solving them individually and
then merging them to find solution to the original problem.

2. Advantages of Divide and Conquer:


1. Efficiency: Divide and conquer algorithms typically have a
time complexity of O(n log n), which is more efficient
than many other algorithms for large datasets.
2. Simplicity: Divide and conquer algorithms are often easy
to understand and implement.
3. Parallelizability: Divide and conquer algorithms can be
easily parallelized, as each subproblem can be solved
independently.
4. Cache-friendliness: Divide and conquer algorithms tend to
have good cache performance, as they access data in a
predictable pattern.

3. Disadvantages of Divide and Conquer:


1. Recursion overhead: Divide and conquer algorithms use
recursion, which can lead to significant overhead in
terms of stack space and function calls.
2. Not suitable for all problems: Divide and conquer
algorithms are not suitable for all types of problems.
They are most effective for problems that can be
recursively divided into smaller subproblems.
3. Limited memory efficiency: Divide and conquer
algorithms can require a significant amount of memory,
as they create multiple copies of the input data.
4. Difficult to analyze: The time and space complexity of
divide and conquer algorithms can be difficult to
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 10
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

analyze, especially for complex problems.

4. What is Quick Sort?


Quick Sort is a sorting algorithm based on the Divide and
Conquer algorithm that picks an element as a pivot and
partitions the given array around the picked pivot by
placing the pivot in its correct position in the sorted array.

5. Algorithm for Quick Sort.


QUICKSORT(A, low, high)
if low < high then
pivotIndex = PARTITION(A, low, high)
QUICKSORT(A, low, pivotIndex - 1)
QUICKSORT(A, pivotIndex + 1, high)

PARTITION(A, low, high)


pivot = A[low]
i = low + 1
j = high

while true do
while i <= high and A[i] <= pivot do
i=i+1
end while
while j >= low and A[j] > pivot do
j=j-1
end while
if i >= j then
break
end if
SWAP(A[i], A[j])
end while

SWAP(A[low], A[j])
return j

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 11
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
#include <iostream>
#include <vector>
using namespace std;

int partition(vector<int>& arr, int low, int high) {


int pivot = arr[high];
int i = low - 1;
for (int j = low; j < high; j++) {
if (arr[j] < pivot) {
i++;
swap(arr[i], arr[j]);
}
}
i++;
swap(arr[i], arr[high]);
return i;
}

void quickSort(vector<int>& arr, int low, int high) {


if (low < high) {
int pindex = partition(arr, low, high);
quickSort(arr, low, pindex - 1);
quickSort(arr, pindex + 1, high);
}
}

int main() {
int n;
cout << "Enter number of elements: ";
cin >> n;

vector<int> arr(n);
cout << "Enter elements:\n";
for (int i = 0; i < n; i++) {
cin >> arr[i];
}

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 12
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

quickSort(arr, 0, n - 1);

cout << "Sorted array is: ";


for (int i = 0; i < n; i++) {
cout << arr[i] << " ";
}
cout << endl;

return 0;
}

Output:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 13
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 04

AIM: Write a program on an unloaded cluster for several


different numbers of nodes and record the time taken in each
case. Draw a graph of execution time against the number of
nodes.

Theory: -
HPC cluster: -
An HPC cluster, or high-performance computing cluster, is
a combination of specialized hardware, including a group
of large and powerful computers, and a distributed
processing software framework configured to handle
massive amounts of data at high speeds with parallel
performance and high availability.

Components of HPC cluster: -


 Compute hardware: -
Compute hardware includes servers, storage, and a
dedicated network. Typically, you will need to
provision at least three servers that function as
primary, worker, and client nodes. With such a
limited setup, you’ll need to invest in high-end
servers with ample processors and storage for more
compute capacity in each.
 Software: -
The software layer includes the tools you intend to
use to monitor, provision, and manage your HPC
cluster. Software stacks comprise libraries,
compilers, debuggers, and file systems as well to
execute cluster management functions.
 Facilities: -
To house your HPC cluster, you need actual physical
floor space to hold and support the weight of racks of
servers, which can include up to 72 blade-style
servers and five top-of-rack switches weighing in at
up to 1,800 pounds. You also must have enough
power to operate and cool the servers, which can
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 14
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

demand up to 43 kW.

How to build the HPC cluster: -


While building an HPC cluster is fairly straightforward, it
requires an organization to understand the level of
compute power needed on a daily basis to determine the
setup. You need to carefully assess questions such as: how
many servers are required; what software layer can handle
the workloads efficiently; where the cluster will be
housed; and what the system’s power and cooling
requirements are. Once these are decided, you can proceed
with building the cluster.

following the steps listed below:


Build a compute node:
Configure a head node by installing tools for
monitoring and resource management as well as
high-speed interconnect drivers/software. Create a
shared cluster directory, capture an image of the
compute node, and clone the image out to the rest of
the cluster that will run the workloads.
Configure IP addresses:
For peak efficiency, HPC clusters contain a high-
speed interconnect network that uses a dedicated IP
subnet. As you connect worker nodes to the head
node, you will assign additional IP addresses for
each node.
Configure jobs as CMU user groups:
As workloads arrive in the queue, you will need a
script to dynamically create CMU user groups for
each currently running job. Divide and Conquer
Algorithm is a problem-solving technique used to
solve problems by dividing the main problem into
subproblems, solving them individually and then
merging them to find solution to the original
problem.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 15
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import time

X = [[1,2],[1,4],[1,0],
[4,2],[4,0],[4,4],
[4,5],[0,2],[5,5]]

nodes=[1,2,3,4,5]
time_taken=[]
for n in nodes:
start_time = time.time()
kmeans = KMeans(n_clusters=n)
kmeans.fit(X)
end_time = time.time()
time_taken.append(end_time - start_time)

plt.plot(nodes, time_taken, marker='o')


plt.xlabel('Number of Clusters')
plt.ylabel('Time Taken (seconds)')
plt.title('Time Taken vs Number of Clusters')
plt.grid(True)
plt.show()

Output:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 16
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
%%writefile cluster.cu
#include <iostream>
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <chrono>

// Function to generate a random graph


void generate_graph(int num_nodes, int num_edges,
thrust::device_vector<int> &adj_list) {
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(0, num_nodes - 1);

for(int i = 0; i < num_edges; i++) {


int src = dist(rng);
int dst = dist(rng);
adj_list.push_back(src);
adj_list.push_back(dst);
}
}

int main() {
// Generate and display graphs with different numbers of nodes
for(int num_nodes = 10; num_nodes <= 500; num_nodes +=
50) {
int num_edges = num_nodes * 5; // Assuming 5 edges per
node
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 17
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

thrust::device_vector<int> adj_list(num_edges * 2);

auto start = std::chrono::high_resolution_clock::now();


generate_graph(num_nodes, num_edges, adj_list);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;

std::cout << "Graph with " << num_nodes << " nodes
generated in " << diff.count() << " seconds" << std::endl;
}

return 0;
}

Output:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 18
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 05

AIM: Write a program to check task distribution using Gprof.

Theory: -
Introduction to Profiler:
Profilers, which are also programs themselves, analyze
target programs by collecting information on their
execution. Based on their data granularity, on how
profilers collect information, they are classified into event
based or statistical profilers.

Introduction to Gprof profiler:


Gprof is a popular profiler used in software development
primarily for analyzing the execution time of a program
and identifying performance bottlenecks. It is typically
used with programs written in C, C++, or Fortran, and is
part of the GNU Compiler Collection (GCC).

Features of Gprof:
1. Profiling Execution Time:
Gprof measures how much time each function in a
program spends executing. It provides detailed statistics
about:
1. Total time spent in each function.
2. Number of times each function is called.
3. Time spent in each function as a percentage of the
total program execution time.

2. Call Graph Analysis:


Gprof generates a call graph that illustrates how
functions are called and where time is spent in the
program. This helps developers visualize the flow of
execution and identify chains of function calls that
contribute significantly to overall runtime.

3. Sampling profiler:
Gprof operates as a sampling profiler, meaning it
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 19
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

periodically interrupts the program to record the current


function call stack. This non-intrusive method allows it
to collect performance data with minimal overhead
compared to instrumentation-based profilers.

4. Integration of GCC:
Gprof is seamlessly integrated with GCC, the GNU
Compiler Collection, which means it can be used
directly during compilation and linking processes. It
requires compiling the program with specific flags (-pg
for profiling and -g for debugging symbols) to enable
profiling data collection.

5. Output and analysis:


After profiling, Gprof generates output files containing
detailed statistics and the call graph. Developers can
analyze this information using various tools provided by
Gprof or external visualization tools to pinpoint
performance hotspots and optimize the code
accordingly.

Simple Code and Gprof Steps: -


1. Code:
%%writefile looping.cpp
#include <iostream>
using namespace std;

int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " ";
}
return 0;
}

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 20
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

2. Steps:
 !g++ -wall -pg looping.cpp -o looping
 !ls

 ! ./looping

 !ls

 !gprof looping gmon.out > looping.txt


 !cat looping.txt

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 21
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Complex Code and Gprof Steps: -


3. Code:
%%writefile hat.cpp
#include <iostream>
using namespace std;

int main()
{
int i;
for (i = 0; i < 5; i++) {
cout << i << " ";
}
cout << endl;
i = 0;
while (i < 5) {
cout << i << " ";
i++;
}
return 0;
}

4. Steps:
 !g++ -wall -pg looping.cpp -o looping
 !ls

 ! ./looping

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 22
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 !gprof looping gmon.out > looping.txt


 !cat looping.txt

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 23
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 06

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

Theory: -
What's New in Intel® VTune™ Profile: -

 GPU Accelerators
 Stall Factor Information in GPU Profiling Results
When you run the GPU Compute/Media Hotspots analysis
to profile applications running on Intel® Data Center GPU
Max Series (code named Ponte Vecchio) devices, you can
now see the reasons for stalls in Xe Vector Engines
(XVEs), formerly known as Execution Units (EUs). Use
this information to better understand and resolve the stalls
in your busiest computing tasks. For more information, see
Analyze Xe Vector Engine (XVE) Stalls

 Metric Groups for Multiple GPUs


When you run the GPU Compute/Media Hotspots analysis
to profile an application executing on multiple Intel GPUs,
you can now see metric information grouped by Intel
microarchitecture family. See metrics for every GPU
architecture family in a new consolidated view. To learn
more, see Analysis Results for Multiple GPUs.

Application Performance Snapshot:


 Updated Metrics for Multiple GPUs
GPU metric information in the Application Performance
Snapshot HTML reports have been enhanced to better
represent data collected from multiple GPUs.

 Histograms in Metric Tooltips


The metric tooltips in Application Performance Snapshot
HTML reports were enhanced with histograms that clearly
visualize the distribution of metric values observed during
analysis.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 24
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Intel® VTune™ Profiler is a powerful tool for analyzing and


optimizing the performance of applications. Here’s a detailed
overview of how to use it for profiling:
1. Introduction to Intel® VTune™ Profiler
Intel® VTune™ Profiler is a performance analysis tool
that helps developers identify and fix performance
bottlenecks in their applications. It supports a wide range
of programming languages, including C, C++, Fortran,
Python, and more. The tool can profile applications
running on CPUs, GPUs, and FPGAs, making it versatile
for various computing environments1.

2. Installation and Setup


To get started with Intel® VTune™ Profiler, you need to
install it as part of the Intel® oneAPI Base Toolkit or as a
standalone version. Here are the steps:
Download and Install: Visit the Intel® VTune™
Profiler download page and follow the instructions to
download and install the tool.
System Setup: For CPU and GPU profiling, ensure
that the necessary drivers and libraries are installed.
For example, on Linux, you might need to install the
Intel® Metric Discovery API Library for GPU
analysis2.
3. Profiling an application
Intel® VTune™ Profiler provides various analysis types to
profile different aspects of your application:

a. CPU Hotspots Analysis


This analysis helps identify the most time-consuming
functions in your application. To perform a CPU
Hotspots Analysis:
Launch VTune Profiler: Open the VTune Profiler
GUI or use the command line tool.
Select Analysis Type: Choose “CPU Hotspots” from
the list of available analysis types.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 25
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Run Analysis: Specify the application to profile and


start the analysis. VTune Profiler will collect data on
CPU usage and highlight the functions that consume
the most CPU time.
b. Microarchitecture Exploration
This analysis provides insights into how well your
application utilizes the CPU’s microarchitecture. It
helps identify issues like cache misses, branch
mispredictions, and more:
Select Analysis Type: Choose “Microarchitecture
Exploration” from the analysis types.
Run Analysis: Start the analysis and review the
results to identify microarchitectural bottlenecks.
4. Analyzing Results
After running an analysis, VTune Profiler provides
detailed reports that help you understand the performance
characteristics of your application:
Hotspots View: This view shows the functions that
consume the most CPU time, allowing you to focus
on optimizing these areas.
Timeline View: This view provides a timeline of
your application’s execution, showing CPU and GPU
activity over time.
Call Stack View: This view shows the call stack for
each hotspot, helping you understand the context in
which performance issues occur.
5. Optimizing Your Application
Based on the analysis results, you can make targeted
optimizations to improve your application’s performance:
Algorithm Optimization: Refactor or replace
inefficient algorithms with more efficient ones.
Parallelization: Use threading or parallel
programming techniques to take advantage of multi-
core processors.
Memory Optimization: Optimize memory access
patterns to reduce cache misses and improve data
locality.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 26
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

6. Advanced Features
Intel® VTune™ Profiler offers several advanced features
for in-depth performance analysis:
Threading Analysis: Analyze the efficiency of your
application’s threading implementation and identify
synchronization issues.
I/O Analysis: Profile I/O operations to identify file
and network access bottlenecks.
GPU Offload Analysis: Analyze GPU-accelerated
code's performance and identify optimization
opportunities.
7. Conclusion
Intel® VTune™ Profiler is a comprehensive tool for
performance analysis and optimization. Using its various
analysis types and features, you can gain deep insights into
your application’s performance and make informed
decisions to optimize it.

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 27
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 07

AIM: Analyze the code using Nvidia-Profilers.

Theory: -
What is a profiler:
A profiler is a software tool used in programming and
software development to measure and analyze the runtime
behavior of a program. Its primary purpose is to identify
bottlenecks and areas of inefficiency in the code, helping
developers optimize performance.

What is Nvprof and give some key features of nvprof:


It is a command-line profiling tool provided by NVIDIA
for use with CUDA applications. It allows developers to
collect and view a wide variety of performance data for
CUDA-enabled GPUs, helping them to identify
performance bottlenecks and optimize their applications.

Key Features:
1. Kernel and Memory Operations Profiling:
 Collects detailed information about the execution of
CUDA kernels and memory operations.
 Measures kernel execution time, memory transfer time,
and other performance metrics.

2. Event and Metric Collection:


 Captures a wide range of performance events and
metrics, including instructions executed, memory
transactions, cache hits and misses, and more.
 Allows developers to specify which events and metrics
to collect.

3. Timeline View:
 Provides a timeline view of kernel execution and
memory transfers, helping developers visualize the
sequence of operations and identify overlapping or
sequential execution patterns.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 28
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

4. API Activity Tracing:


 Traces CUDA API calls, providing information about
the timing and duration of each call.
 Helps identify inefficiencies related to API usage.

5. CUDA Context and Stream Profiling:


 Profiles activity within different CUDA contexts and
streams, enabling developers to optimize concurrent
execution.

6. Output Formats:
 Supports multiple output formats, including text, CSV,
and SQL, allowing for flexible data analysis and
reporting.
 The collected data can be further analyzed using tools
like NVIDIA Visual Profiler (nvvp) or NVIDIA Nsight
Systems.

7. Device and Kernel Profiling:


 Profiles individual CUDA devices and kernels,
providing detailed information about their performance
characteristics.
 Helps in comparing the performance of different
devices and optimizing kernel performance.

8. Command-Line Interface:
 Operates through a command-line interface, making it
suitable for use in automated scripts and batch
processing.
 Offers a wide range of options and filters to customize
profiling sessions.

9. Profiling Overheads:
 Provides information about the overhead introduced by
profiling, helping developers understand and manage

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 29
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

the impact on application performance.

10. Compatibility:
 Supports a wide range of CUDA-enabled GPUs and
works with CUDA applications written in C, C++,
Fortran, and other supported languages.

Steps of using nvprof:

 Step 1: Compile Your CUDA Program


!nvcc name.cu -o name

 Step 2: Run Your Program


! ./name

 Step 3: Run Your Program with nvprof


!nvprof ./name

CUDA Program to add vector: -


Code:
%%writefile vector_add.cu

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

__global__ void AddVectorsCUDA(int *a, int *b, int *c,


int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}

int main() {
int n = 5;

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 30
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

int h_a[] = {1, 2, 3, 4, 5};


int h_b[] = {6, 7, 8, 9, 10};
int h_c[n];

int *d_a, *d_b, *d_c;

cudaMalloc((void **)&d_a, n * sizeof(int));


cudaMalloc((void **)&d_b, n * sizeof(int));
cudaMalloc((void **)&d_c, n * sizeof(int));

cudaMemcpy(d_a, h_a, n * sizeof(int),


cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(int),
cudaMemcpyHostToDevice);

AddVectorsCUDA<<<1, n>>>(d_a, d_b, d_c, n);

cudaMemcpy(h_c, d_c, n * sizeof(int),


cudaMemcpyDeviceToHost);

printf("Result of vector addition:\n");


for (int i = 0; i < n; i++) {
printf("%d ", h_c[i]);
}
printf("\n");

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

Steps:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 31
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 !nvcc vector_add.cu -o a vector_add


 ! ./ vector_add.cu

 !nvprof ./vector_add

CUDA Program to multiply vector: -


Code:
%%writefile matrix_mul.cu

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

#define N 2 // Defining a small 2x2 matrix for simplicity

__global__ void MatrixMulCUDA(int *a, int *b, int *c) {


int row = threadIdx.y;
int col = threadIdx.x;

int sum = 0;
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 32
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

for (int k = 0; k < N; k++) {


sum += a[row * N + k] * b[k * N + col];
}
c[row * N + col] = sum;
}

int main() {
int h_a[N][N] = { {1, 2}, {3, 4} };
int h_b[N][N] = { {5, 6}, {7, 8} };
int h_c[N][N];

int *d_a, *d_b, *d_c;

cudaMalloc((void **)&d_a, N * N * sizeof(int));


cudaMalloc((void **)&d_b, N * N * sizeof(int));
cudaMalloc((void **)&d_c, N * N * sizeof(int));

cudaMemcpy(d_a, h_a, N * N * sizeof(int),


cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, N * N * sizeof(int),
cudaMemcpyHostToDevice);

dim3 threadsPerBlock(N, N);


MatrixMulCUDA<<<1, threadsPerBlock>>>(d_a, d_b,
d_c);

cudaMemcpy(h_c, d_c, N * N * sizeof(int),


cudaMemcpyDeviceToHost);

printf("Result of matrix multiplication:\n");


for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%d ", h_c[i][j]);
}
printf("\n");
}

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 33
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

Steps:
 !nvcc matrix_mul.cu -o a matrix_mul
 ! ./ matrix_mul.cu

 !nvprof ./ matrix_mul

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 34
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 08

AIM: Write a program to perform load distribution on GPU


using CUDA.

Theory: -
What is a profiler:
Performing load distribution on GPU using CUDA in
Google Colab involves several steps. CUDA is a parallel
computing platform and application programming
interface (API) developed by NVIDIA for general-purpose
GPU programming. Google Colab provides free access to
GPU resources, making it an excellent platform to
experiment with CUDA.

Here are the steps to create a simple CUDA program in Google


Colab:
1. Access Google Colab
2. Create a New Notebook
3. Set GPU as the Runtime Type
4. Install Required Libraries
If you need to install any libraries, you can do so
using pip. For CUDA programming, you might need
to install the pycuda library: Code : !pip install
pycuda
5. Import Required Libraries:
Import the necessary Python libraries, including
pycuda and numpy for this example.
6. Write CUDA Kernel Code:
Write the CUDA kernel code in a code cell. This is
the part of the code that will run on the GPU.

Code: -
%%writefile load_distribute.cu
#include <iostream>

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 35
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

__global__ void workloadDistributionKernel(int *data, int n) {


int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = idx; // assign thread index to data array
}
}

int main() {
const int n = 23; // array size
int data[n];

// Allocate memory on the GPU


int *data_gpu;
cudaMalloc((void **)&data_gpu, n * sizeof(int));

// Launch kernel
int blockSize = 8;
int numBlocks = (n + blockSize - 1) / blockSize;
workloadDistributionKernel<<<numBlocks,
blockSize>>>(data_gpu, n);

// Copy result from device to host


cudaMemcpy(data, data_gpu, n * sizeof(int),
cudaMemcpyDeviceToHost);

// Print result
for (int i = 0; i < n; i++) {
std::cout << "data[" << i << "] = " << data[i] << std::endl;
}

// Free memory
cudaFree(data_gpu);

return 0;
}

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 36
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 37
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 09

AIM: Write a simple CUDA program to print “Hello World!”.

Theory: -
What is CUDA programming:
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and application programming
interface (API) model created by NVIDIA. It allows
developers to use NVIDIA GPUs (graphics processing
units) for general-purpose processing (an approach known
as GPGPU, General-Purpose computing on Graphics
Processing Units).

Logical Architecture of GPU:

1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.

2. Blocks:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 38
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

A block is a group of threads that execute concurrently on


an SM. Threads within the same block can cooperate with
each other through shared memory and synchronization
mechanisms.

3. Warps:
A warp is the smallest unit of execution in CUDA. It
consists of 32 consecutive threads that are executed in
lockstep on an SM. This means that all 32 threads within a
warp execute the same instruction at the same time.

4. Threads:
A thread is a basic unit of execution in CUDA (NVIDIA's
parallel computing platform). Threads are organized into
groups called thread blocks, and multiple thread blocks are
organized into a grid.

CUDA Program execution Flow:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 39
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Steps:
 Data copy from CPU to GPU.
 Execution on GPU.
 Data copy from GPU to CPU.

CUDA Program to print hello world: -


Code:
%%writefile ud.cu
#include "stdio.h"
__global__ void cuda_hello(){
printf("Hello World!\n");
}
int main(){
cuda_hello<<<1,5>>>();
cudaDeviceSynchronize();
return 0;
}

Steps:
 !nvcc ud.cu -o ud
 ! ./ud

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 40
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

CUDA Program to print hello world with different number of


blocks and threads: -
Code:
%%writefile ud1.cu
#include "stdio.h"
__global__ void cuda_hello(){
printf("Hello Engineer!\n");
}
int main(){
cuda_hello<<<4,2>>>();
cudaDeviceSynchronize();
return 0;
}

Steps:
 !nvcc ud.cu -o ud1
 ! ./ud1

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 41
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 10

AIM: Write a CUDA program to add two arrays.

Theory: -
What is CUDA programming:
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and application programming
interface (API) model created by NVIDIA. It allows
developers to use NVIDIA GPUs (graphics processing
units) for general-purpose processing (an approach known
as GPGPU, General-Purpose computing on Graphics
Processing Units).

Logical Architecture of GPU:

1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.

2. Blocks:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 42
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

A block is a group of threads that execute concurrently on


an SM. Threads within the same block can cooperate with
each other through shared memory and synchronization
mechanisms.

3. Warps:
A warp is the smallest unit of execution in CUDA. It
consists of 32 consecutive threads that are executed in
lockstep on an SM. This means that all 32 threads within a
warp execute the same instruction at the same time.

4. Threads:
A thread is a basic unit of execution in CUDA (NVIDIA's
parallel computing platform). Threads are organized into
groups called thread blocks, and multiple thread blocks are
organized into a grid.

CUDA Program execution Flow:

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 43
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Steps:
 Data copy from CPU to GPU.
 Execution on GPU.
 Data copy from GPU to CPU.

CUDA Program to add two numbers: -


Code:
%%writefile addNumbers.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int* c, const int* a, const int*


b) {
*c = *a + *b;
}

void addWithCuda(int* c, const int* a, const int* b) {


int* dev_a = nullptr;
int* dev_b = nullptr;
int* dev_c = nullptr;

cudaMalloc((void**)&dev_c, sizeof(int));
cudaMalloc((void**)&dev_a, sizeof(int));
cudaMalloc((void**)&dev_b, sizeof(int));

cudaMemcpy(dev_a, a, sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, sizeof(int),
cudaMemcpyHostToDevice);

addKernel<<<1, 1>>>(dev_c, dev_a, dev_b);


cudaDeviceSynchronize();
cudaMemcpy(c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 44
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
}

int main() {
const int a = 5;
const int b = 7;
int c = 0;

addWithCuda(&c, &a, &b);

printf("%d + %d = %d\n", a, b, c);

cudaDeviceReset();

return 0;
}

Steps:
 !nvcc addNumbers.cu -o addNumbers
 !./addNumbers

CUDA Program to add two arrays: -


Code:
%%writefile addArrays.cu
#include "cuda_runtime.h"

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 45
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int* c, const int* a, const int*


b, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
c[i] = a[i] + b[i];
}
}

void addWithCuda(int* c, const int* a, const int* b, int


size) {
int* dev_a = nullptr;
int* dev_b = nullptr;
int* dev_c = nullptr;

cudaMalloc((void**)&dev_c, size * sizeof(int));


cudaMalloc((void**)&dev_a, size * sizeof(int));
cudaMalloc((void**)&dev_b, size * sizeof(int));

cudaMemcpy(dev_a, a, size * sizeof(int),


cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size * sizeof(int),
cudaMemcpyHostToDevice);

int threadsPerBlock = 256;


int blocksPerGrid = (size + threadsPerBlock - 1) /
threadsPerBlock;
addKernel<<<blocksPerGrid,
threadsPerBlock>>>(dev_c, dev_a, dev_b, size);
cudaDeviceSynchronize();
cudaMemcpy(c, dev_c, size * sizeof(int),
cudaMemcpyDeviceToHost);
cudaFree(dev_c);
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 46
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(dev_a);
cudaFree(dev_b);
}

int main() {
const int arraySize = 5;
const int a[arraySize] = { 11, 22, 43, 34, 55 };
const int b[arraySize] = { 11, 22, 34, 43, 55 };
int c[arraySize] = { 0 };

addWithCuda(c, a, b, arraySize);

printf("{11, 22, 43, 34, 55} + {11, 22, 34, 43, 55} =
{%d, %d, %d, %d, %d}\n", c[0], c[1], c[2], c[3], c[4]);

cudaDeviceReset();

return 0;
}

Steps:
 !nvcc addArrays.cu -o addArrays
 !./addArrays

Enrollment No.: 210303105168


CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 47
Div: 25(CSE)

You might also like