HPC Lab Manual
HPC Lab Manual
PRACTICAL: 01
Theory: -
Features of Google Colab:
Free Access to GPUs and TPUs: Google Colab offers
free access to NVIDIA GPUs and Google TPUs, which
significantly accelerates machine learning tasks.
Jupyter Notebook Interface: It uses a familiar Jupyter
notebook interface, which is highly popular among data
scientists and researchers.
Easy Collaboration: Notebooks can be easily shared and
collaborated on, similar to Google Docs.
Pre-installed Libraries: Many popular Python libraries
are pre-installed, making it convenient to start working on
projects without needing to install dependencies.
Integration with Google Drive: It integrates seamlessly
with Google Drive, making it easy to save and load files.
Code Snippets and Examples: Colab provides various
code snippets and examples to help users get started
quickly.
Interactive Visualizations: Supports interactive
visualizations and can display charts and graphs directly
within the notebook.
Markdown Support: Supports Markdown for rich text
formatting within notebooks.
Code: -
Python Program:
if “__name__” == “__main__”:
for i in range(1,11):
print(“2 X”,i,“=”,2*i)
Output:
C Program:
%%writefile C_program.c
#include “stdio.h”
int main(){
printf(“Welcome to Google Colab…”);
return 0;
}
!nvcc C_program.c
!ls
!./a.out
Output:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 3
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
PRACTICAL: 02
Theory: -
ls
Description: Lists files and directories in the current directory.
pwd
Description: Prints the current working directory.
cd
Description: Changes the current directory.
mkdir
Description: Creates a new directory.
rm
Description: Removes files.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 5
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
rmdir
Description: Removes Directories.
mv
Description: Moves or renames files or directories.
cat
Description: Concatenates and displays file content.
head
Description: Displays the first few lines of a file.
tail
Description: Displays the last few lines of a file.
du -h
Description: Estimates file space usage.
touch
Description: Creates an empty file or updates the access and
modification times of an existing file.
echo
Description: Creates a file and writes a line of text to it.
printf
Description: Creates a file and writes formatted text to it.
vi
Description: Opens a text editor to create or edit files.
ps
Description: Displays a snapshot of the current processes.
cd
Description: Changes directory
whoami
Description: Displays the current user’s name.
uname
Description: Displays system information.
date
Description: Displays the current date and time.
PRACTICAL: 03
Theory: -
1. What is Divide and Conquer?
Divide and Conquer Algorithm is a problem-solving
technique used to solve problems by dividing the main
problem into subproblems, solving them individually and
then merging them to find solution to the original problem.
while true do
while i <= high and A[i] <= pivot do
i=i+1
end while
while j >= low and A[j] > pivot do
j=j-1
end while
if i >= j then
break
end if
SWAP(A[i], A[j])
end while
SWAP(A[low], A[j])
return j
Code: -
#include <iostream>
#include <vector>
using namespace std;
int main() {
int n;
cout << "Enter number of elements: ";
cin >> n;
vector<int> arr(n);
cout << "Enter elements:\n";
for (int i = 0; i < n; i++) {
cin >> arr[i];
}
quickSort(arr, 0, n - 1);
return 0;
}
Output:
PRACTICAL: 04
Theory: -
HPC cluster: -
An HPC cluster, or high-performance computing cluster, is
a combination of specialized hardware, including a group
of large and powerful computers, and a distributed
processing software framework configured to handle
massive amounts of data at high speeds with parallel
performance and high availability.
demand up to 43 kW.
Code: -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import time
X = [[1,2],[1,4],[1,0],
[4,2],[4,0],[4,4],
[4,5],[0,2],[5,5]]
nodes=[1,2,3,4,5]
time_taken=[]
for n in nodes:
start_time = time.time()
kmeans = KMeans(n_clusters=n)
kmeans.fit(X)
end_time = time.time()
time_taken.append(end_time - start_time)
Output:
Code: -
%%writefile cluster.cu
#include <iostream>
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <chrono>
int main() {
// Generate and display graphs with different numbers of nodes
for(int num_nodes = 10; num_nodes <= 500; num_nodes +=
50) {
int num_edges = num_nodes * 5; // Assuming 5 edges per
node
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 17
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
std::cout << "Graph with " << num_nodes << " nodes
generated in " << diff.count() << " seconds" << std::endl;
}
return 0;
}
Output:
PRACTICAL: 05
Theory: -
Introduction to Profiler:
Profilers, which are also programs themselves, analyze
target programs by collecting information on their
execution. Based on their data granularity, on how
profilers collect information, they are classified into event
based or statistical profilers.
Features of Gprof:
1. Profiling Execution Time:
Gprof measures how much time each function in a
program spends executing. It provides detailed statistics
about:
1. Total time spent in each function.
2. Number of times each function is called.
3. Time spent in each function as a percentage of the
total program execution time.
3. Sampling profiler:
Gprof operates as a sampling profiler, meaning it
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 19
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
4. Integration of GCC:
Gprof is seamlessly integrated with GCC, the GNU
Compiler Collection, which means it can be used
directly during compilation and linking processes. It
requires compiling the program with specific flags (-pg
for profiling and -g for debugging symbols) to enable
profiling data collection.
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " ";
}
return 0;
}
2. Steps:
!g++ -wall -pg looping.cpp -o looping
!ls
! ./looping
!ls
int main()
{
int i;
for (i = 0; i < 5; i++) {
cout << i << " ";
}
cout << endl;
i = 0;
while (i < 5) {
cout << i << " ";
i++;
}
return 0;
}
4. Steps:
!g++ -wall -pg looping.cpp -o looping
!ls
! ./looping
PRACTICAL: 06
Theory: -
What's New in Intel® VTune™ Profile: -
GPU Accelerators
Stall Factor Information in GPU Profiling Results
When you run the GPU Compute/Media Hotspots analysis
to profile applications running on Intel® Data Center GPU
Max Series (code named Ponte Vecchio) devices, you can
now see the reasons for stalls in Xe Vector Engines
(XVEs), formerly known as Execution Units (EUs). Use
this information to better understand and resolve the stalls
in your busiest computing tasks. For more information, see
Analyze Xe Vector Engine (XVE) Stalls
6. Advanced Features
Intel® VTune™ Profiler offers several advanced features
for in-depth performance analysis:
Threading Analysis: Analyze the efficiency of your
application’s threading implementation and identify
synchronization issues.
I/O Analysis: Profile I/O operations to identify file
and network access bottlenecks.
GPU Offload Analysis: Analyze GPU-accelerated
code's performance and identify optimization
opportunities.
7. Conclusion
Intel® VTune™ Profiler is a comprehensive tool for
performance analysis and optimization. Using its various
analysis types and features, you can gain deep insights into
your application’s performance and make informed
decisions to optimize it.
PRACTICAL: 07
Theory: -
What is a profiler:
A profiler is a software tool used in programming and
software development to measure and analyze the runtime
behavior of a program. Its primary purpose is to identify
bottlenecks and areas of inefficiency in the code, helping
developers optimize performance.
Key Features:
1. Kernel and Memory Operations Profiling:
Collects detailed information about the execution of
CUDA kernels and memory operations.
Measures kernel execution time, memory transfer time,
and other performance metrics.
3. Timeline View:
Provides a timeline view of kernel execution and
memory transfers, helping developers visualize the
sequence of operations and identify overlapping or
sequential execution patterns.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 28
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
6. Output Formats:
Supports multiple output formats, including text, CSV,
and SQL, allowing for flexible data analysis and
reporting.
The collected data can be further analyzed using tools
like NVIDIA Visual Profiler (nvvp) or NVIDIA Nsight
Systems.
8. Command-Line Interface:
Operates through a command-line interface, making it
suitable for use in automated scripts and batch
processing.
Offers a wide range of options and filters to customize
profiling sessions.
9. Profiling Overheads:
Provides information about the overhead introduced by
profiling, helping developers understand and manage
10. Compatibility:
Supports a wide range of CUDA-enabled GPUs and
works with CUDA applications written in C, C++,
Fortran, and other supported languages.
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
int main() {
int n = 5;
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
Steps:
!nvprof ./vector_add
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
int sum = 0;
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 32
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
int main() {
int h_a[N][N] = { {1, 2}, {3, 4} };
int h_b[N][N] = { {5, 6}, {7, 8} };
int h_c[N][N];
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
Steps:
!nvcc matrix_mul.cu -o a matrix_mul
! ./ matrix_mul.cu
!nvprof ./ matrix_mul
PRACTICAL: 08
Theory: -
What is a profiler:
Performing load distribution on GPU using CUDA in
Google Colab involves several steps. CUDA is a parallel
computing platform and application programming
interface (API) developed by NVIDIA for general-purpose
GPU programming. Google Colab provides free access to
GPU resources, making it an excellent platform to
experiment with CUDA.
Code: -
%%writefile load_distribute.cu
#include <iostream>
int main() {
const int n = 23; // array size
int data[n];
// Launch kernel
int blockSize = 8;
int numBlocks = (n + blockSize - 1) / blockSize;
workloadDistributionKernel<<<numBlocks,
blockSize>>>(data_gpu, n);
// Print result
for (int i = 0; i < n; i++) {
std::cout << "data[" << i << "] = " << data[i] << std::endl;
}
// Free memory
cudaFree(data_gpu);
return 0;
}
Output:
PRACTICAL: 09
Theory: -
What is CUDA programming:
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and application programming
interface (API) model created by NVIDIA. It allows
developers to use NVIDIA GPUs (graphics processing
units) for general-purpose processing (an approach known
as GPGPU, General-Purpose computing on Graphics
Processing Units).
1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.
2. Blocks:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 38
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester
3. Warps:
A warp is the smallest unit of execution in CUDA. It
consists of 32 consecutive threads that are executed in
lockstep on an SM. This means that all 32 threads within a
warp execute the same instruction at the same time.
4. Threads:
A thread is a basic unit of execution in CUDA (NVIDIA's
parallel computing platform). Threads are organized into
groups called thread blocks, and multiple thread blocks are
organized into a grid.
Steps:
Data copy from CPU to GPU.
Execution on GPU.
Data copy from GPU to CPU.
Steps:
!nvcc ud.cu -o ud
! ./ud
Steps:
!nvcc ud.cu -o ud1
! ./ud1
PRACTICAL: 10
Theory: -
What is CUDA programming:
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and application programming
interface (API) model created by NVIDIA. It allows
developers to use NVIDIA GPUs (graphics processing
units) for general-purpose processing (an approach known
as GPGPU, General-Purpose computing on Graphics
Processing Units).
1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.
2. Blocks:
3. Warps:
A warp is the smallest unit of execution in CUDA. It
consists of 32 consecutive threads that are executed in
lockstep on an SM. This means that all 32 threads within a
warp execute the same instruction at the same time.
4. Threads:
A thread is a basic unit of execution in CUDA (NVIDIA's
parallel computing platform). Threads are organized into
groups called thread blocks, and multiple thread blocks are
organized into a grid.
Steps:
Data copy from CPU to GPU.
Execution on GPU.
Data copy from GPU to CPU.
#include <stdio.h>
cudaMalloc((void**)&dev_c, sizeof(int));
cudaMalloc((void**)&dev_a, sizeof(int));
cudaMalloc((void**)&dev_b, sizeof(int));
cudaMemcpy(dev_a, a, sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, sizeof(int),
cudaMemcpyHostToDevice);
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
}
int main() {
const int a = 5;
const int b = 7;
int c = 0;
cudaDeviceReset();
return 0;
}
Steps:
!nvcc addNumbers.cu -o addNumbers
!./addNumbers
#include "device_launch_parameters.h"
#include <stdio.h>
cudaFree(dev_a);
cudaFree(dev_b);
}
int main() {
const int arraySize = 5;
const int a[arraySize] = { 11, 22, 43, 34, 55 };
const int b[arraySize] = { 11, 22, 34, 43, 55 };
int c[arraySize] = { 0 };
addWithCuda(c, a, b, arraySize);
printf("{11, 22, 43, 34, 55} + {11, 22, 34, 43, 55} =
{%d, %d, %d, %d, %d}\n", c[0], c[1], c[2], c[3], c[4]);
cudaDeviceReset();
return 0;
}
Steps:
!nvcc addArrays.cu -o addArrays
!./addArrays