0% found this document useful (0 votes)

40 views47 pages

HPC Lab Manual

The document outlines practical exercises for a B. Tech CSE course focusing on High Performance Computing. It covers the features of Google Colab, basic Linux commands, and the implementation of a Concurrent Quick Sort algorithm using C++. Additionally, it discusses building an HPC cluster and includes code examples for Python and C++ programs.

Uploaded by

mangamanga1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views47 pages

HPC Lab Manual

Uploaded by

mangamanga1101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 47

Faculty of Engineering & Technology

High Performance Computing (203105430)

B. Tech CSE 4th Year 7th Semester

PRACTICAL: 01

AIM: Study the facilities provided by Google Colab.

Theory: -
 Features of Google Colab:
 Free Access to GPUs and TPUs: Google Colab offers
free access to NVIDIA GPUs and Google TPUs, which
significantly accelerates machine learning tasks.
 Jupyter Notebook Interface: It uses a familiar Jupyter
notebook interface, which is highly popular among data
scientists and researchers.
 Easy Collaboration: Notebooks can be easily shared and
collaborated on, similar to Google Docs.
 Pre-installed Libraries: Many popular Python libraries
are pre-installed, making it convenient to start working on
projects without needing to install dependencies.
 Integration with Google Drive: It integrates seamlessly
with Google Drive, making it easy to save and load files.
 Code Snippets and Examples: Colab provides various
code snippets and examples to help users get started
quickly.
 Interactive Visualizations: Supports interactive
visualizations and can display charts and graphs directly
within the notebook.
 Markdown Support: Supports Markdown for rich text
formatting within notebooks.

 Explain in Detail the Use of GPU on Google Colab by

Changing Runtime Type:
 Open Google Colab: Go to Google Colab and open a new
notebook or an existing one.
 Change Runtime Type:
 Click on Runtime in the menu bar.
 Select Change runtime type.
 In the dialog that appears, under Hardware
accelerator, choose GPU.
 Click Save.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 1
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 2
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
Python Program:

if “__name__” == “__main__”:
for i in range(1,11):
print(“2 X”,i,“=”,2*i)
Output:

C Program:

%%writefile C_program.c
#include “stdio.h”

int main(){
printf(“Welcome to Google Colab…”);
return 0;
}

!nvcc C_program.c
!ls
!./a.out

Output:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 3
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 4
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 02

AIM: Demonstrate basic Linux Commands.

Theory: -
 ls
Description: Lists files and directories in the current directory.

 pwd
Description: Prints the current working directory.

 cd
Description: Changes the current directory.

 mkdir
Description: Creates a new directory.

 rm
Description: Removes files.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
Page| 5
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 rmdir
Description: Removes Directories.

 mv
Description: Moves or renames files or directories.

 cat
Description: Concatenates and displays file content.

 head
Description: Displays the first few lines of a file.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 6
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 tail
Description: Displays the last few lines of a file.

 du -h
Description: Estimates file space usage.

 touch
Description: Creates an empty file or updates the access and
modification times of an existing file.

 echo
Description: Creates a file and writes a line of text to it.

 printf
Description: Creates a file and writes formatted text to it.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 7
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 vi
Description: Opens a text editor to create or edit files.

 ps
Description: Displays a snapshot of the current processes.

 cd
Description: Changes directory

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 8
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 whoami
Description: Displays the current user’s name.

 uname
Description: Displays system information.

 date
Description: Displays the current date and time.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
Page| 9
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 03

AIM: Using Divide and Conquer Strategies design a class for

Concurrent Quick Sort using C++.

Theory: -
1. What is Divide and Conquer?
Divide and Conquer Algorithm is a problem-solving
technique used to solve problems by dividing the main
problem into subproblems, solving them individually and
then merging them to find solution to the original problem.

2. Advantages of Divide and Conquer:

1. Efficiency: Divide and conquer algorithms typically have a
time complexity of O(n log n), which is more efficient
than many other algorithms for large datasets.
2. Simplicity: Divide and conquer algorithms are often easy
to understand and implement.
3. Parallelizability: Divide and conquer algorithms can be
easily parallelized, as each subproblem can be solved
independently.
4. Cache-friendliness: Divide and conquer algorithms tend to
have good cache performance, as they access data in a
predictable pattern.

3. Disadvantages of Divide and Conquer:

1. Recursion overhead: Divide and conquer algorithms use
recursion, which can lead to significant overhead in
terms of stack space and function calls.
2. Not suitable for all problems: Divide and conquer
algorithms are not suitable for all types of problems.
They are most effective for problems that can be
recursively divided into smaller subproblems.
3. Limited memory efficiency: Divide and conquer
algorithms can require a significant amount of memory,
as they create multiple copies of the input data.
4. Difficult to analyze: The time and space complexity of
divide and conquer algorithms can be difficult to
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 10
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

analyze, especially for complex problems.

4. What is Quick Sort?

Quick Sort is a sorting algorithm based on the Divide and
Conquer algorithm that picks an element as a pivot and
partitions the given array around the picked pivot by
placing the pivot in its correct position in the sorted array.

5. Algorithm for Quick Sort.

QUICKSORT(A, low, high)
if low < high then
pivotIndex = PARTITION(A, low, high)
QUICKSORT(A, low, pivotIndex - 1)
QUICKSORT(A, pivotIndex + 1, high)

PARTITION(A, low, high)

pivot = A[low]
i = low + 1
j = high

while true do
while i <= high and A[i] <= pivot do
i=i+1
end while
while j >= low and A[j] > pivot do
j=j-1
end while
if i >= j then
break
end if
SWAP(A[i], A[j])
end while

SWAP(A[low], A[j])
return j

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 11
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
#include <iostream>
#include <vector>
using namespace std;

int partition(vector<int>& arr, int low, int high) {

int pivot = arr[high];
int i = low - 1;
for (int j = low; j < high; j++) {
if (arr[j] < pivot) {
i++;
swap(arr[i], arr[j]);
}
}
i++;
swap(arr[i], arr[high]);
return i;
}

void quickSort(vector<int>& arr, int low, int high) {

if (low < high) {
int pindex = partition(arr, low, high);
quickSort(arr, low, pindex - 1);
quickSort(arr, pindex + 1, high);
}
}

int main() {
int n;
cout << "Enter number of elements: ";
cin >> n;

vector<int> arr(n);
cout << "Enter elements:\n";
for (int i = 0; i < n; i++) {
cin >> arr[i];
}

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 12
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

quickSort(arr, 0, n - 1);

cout << "Sorted array is: ";

for (int i = 0; i < n; i++) {
cout << arr[i] << " ";
}
cout << endl;

return 0;
}

Output:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 13
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 04

AIM: Write a program on an unloaded cluster for several

different numbers of nodes and record the time taken in each
case. Draw a graph of execution time against the number of
nodes.

Theory: -
HPC cluster: -
An HPC cluster, or high-performance computing cluster, is
a combination of specialized hardware, including a group
of large and powerful computers, and a distributed
processing software framework configured to handle
massive amounts of data at high speeds with parallel
performance and high availability.

Components of HPC cluster: -

 Compute hardware: -
Compute hardware includes servers, storage, and a
dedicated network. Typically, you will need to
provision at least three servers that function as
primary, worker, and client nodes. With such a
limited setup, you’ll need to invest in high-end
servers with ample processors and storage for more
compute capacity in each.
 Software: -
The software layer includes the tools you intend to
use to monitor, provision, and manage your HPC
cluster. Software stacks comprise libraries,
compilers, debuggers, and file systems as well to
execute cluster management functions.
 Facilities: -
To house your HPC cluster, you need actual physical
floor space to hold and support the weight of racks of
servers, which can include up to 72 blade-style
servers and five top-of-rack switches weighing in at
up to 1,800 pounds. You also must have enough
power to operate and cool the servers, which can
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 14
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

demand up to 43 kW.

How to build the HPC cluster: -

While building an HPC cluster is fairly straightforward, it
requires an organization to understand the level of
compute power needed on a daily basis to determine the
setup. You need to carefully assess questions such as: how
many servers are required; what software layer can handle
the workloads efficiently; where the cluster will be
housed; and what the system’s power and cooling
requirements are. Once these are decided, you can proceed
with building the cluster.

following the steps listed below:

Build a compute node:
Configure a head node by installing tools for
monitoring and resource management as well as
high-speed interconnect drivers/software. Create a
shared cluster directory, capture an image of the
compute node, and clone the image out to the rest of
the cluster that will run the workloads.
Configure IP addresses:
For peak efficiency, HPC clusters contain a high-
speed interconnect network that uses a dedicated IP
subnet. As you connect worker nodes to the head
node, you will assign additional IP addresses for
each node.
Configure jobs as CMU user groups:
As workloads arrive in the queue, you will need a
script to dynamically create CMU user groups for
each currently running job. Divide and Conquer
Algorithm is a problem-solving technique used to
solve problems by dividing the main problem into
subproblems, solving them individually and then
merging them to find solution to the original
problem.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 15
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import time

X = [[1,2],[1,4],[1,0],
[4,2],[4,0],[4,4],
[4,5],[0,2],[5,5]]

nodes=[1,2,3,4,5]
time_taken=[]
for n in nodes:
start_time = time.time()
kmeans = KMeans(n_clusters=n)
kmeans.fit(X)
end_time = time.time()
time_taken.append(end_time - start_time)

plt.plot(nodes, time_taken, marker='o')

plt.xlabel('Number of Clusters')
plt.ylabel('Time Taken (seconds)')
plt.title('Time Taken vs Number of Clusters')
plt.grid(True)
plt.show()

Output:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 16
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Code: -
%%writefile cluster.cu
#include <iostream>
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <chrono>

// Function to generate a random graph

void generate_graph(int num_nodes, int num_edges,
thrust::device_vector<int> &adj_list) {
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(0, num_nodes - 1);

for(int i = 0; i < num_edges; i++) {

int src = dist(rng);
int dst = dist(rng);
adj_list.push_back(src);
adj_list.push_back(dst);
}
}

int main() {
// Generate and display graphs with different numbers of nodes
for(int num_nodes = 10; num_nodes <= 500; num_nodes +=
50) {
int num_edges = num_nodes * 5; // Assuming 5 edges per
node
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 17
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

thrust::device_vector<int> adj_list(num_edges * 2);

auto start = std::chrono::high_resolution_clock::now();

generate_graph(num_nodes, num_edges, adj_list);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;

std::cout << "Graph with " << num_nodes << " nodes
generated in " << diff.count() << " seconds" << std::endl;
}

return 0;
}

Output:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 18
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 05

AIM: Write a program to check task distribution using Gprof.

Theory: -
Introduction to Profiler:
Profilers, which are also programs themselves, analyze
target programs by collecting information on their
execution. Based on their data granularity, on how
profilers collect information, they are classified into event
based or statistical profilers.

Introduction to Gprof profiler:

Gprof is a popular profiler used in software development
primarily for analyzing the execution time of a program
and identifying performance bottlenecks. It is typically
used with programs written in C, C++, or Fortran, and is
part of the GNU Compiler Collection (GCC).

Features of Gprof:
1. Profiling Execution Time:
Gprof measures how much time each function in a
program spends executing. It provides detailed statistics
about:
1. Total time spent in each function.
2. Number of times each function is called.
3. Time spent in each function as a percentage of the
total program execution time.

2. Call Graph Analysis:

Gprof generates a call graph that illustrates how
functions are called and where time is spent in the
program. This helps developers visualize the flow of
execution and identify chains of function calls that
contribute significantly to overall runtime.

3. Sampling profiler:
Gprof operates as a sampling profiler, meaning it
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 19
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

periodically interrupts the program to record the current

function call stack. This non-intrusive method allows it
to collect performance data with minimal overhead
compared to instrumentation-based profilers.

4. Integration of GCC:
Gprof is seamlessly integrated with GCC, the GNU
Compiler Collection, which means it can be used
directly during compilation and linking processes. It
requires compiling the program with specific flags (-pg
for profiling and -g for debugging symbols) to enable
profiling data collection.

5. Output and analysis:

After profiling, Gprof generates output files containing
detailed statistics and the call graph. Developers can
analyze this information using various tools provided by
Gprof or external visualization tools to pinpoint
performance hotspots and optimize the code
accordingly.

Simple Code and Gprof Steps: -

1. Code:
%%writefile looping.cpp
#include <iostream>
using namespace std;

int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " ";
}
return 0;
}

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 20
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

2. Steps:
 !g++ -wall -pg looping.cpp -o looping
 !ls

 ! ./looping

 !ls

 !gprof looping gmon.out > looping.txt

 !cat looping.txt

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 21
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Complex Code and Gprof Steps: -

3. Code:
%%writefile hat.cpp
#include <iostream>
using namespace std;

int main()
{
int i;
for (i = 0; i < 5; i++) {
cout << i << " ";
}
cout << endl;
i = 0;
while (i < 5) {
cout << i << " ";
i++;
}
return 0;
}

4. Steps:
 !g++ -wall -pg looping.cpp -o looping
 !ls

 ! ./looping

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 22
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 !gprof looping gmon.out > looping.txt

 !cat looping.txt

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 23
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 06

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

Theory: -
What's New in Intel® VTune™ Profile: -

 GPU Accelerators
 Stall Factor Information in GPU Profiling Results
When you run the GPU Compute/Media Hotspots analysis
to profile applications running on Intel® Data Center GPU
Max Series (code named Ponte Vecchio) devices, you can
now see the reasons for stalls in Xe Vector Engines
(XVEs), formerly known as Execution Units (EUs). Use
this information to better understand and resolve the stalls
in your busiest computing tasks. For more information, see
Analyze Xe Vector Engine (XVE) Stalls

 Metric Groups for Multiple GPUs

When you run the GPU Compute/Media Hotspots analysis
to profile an application executing on multiple Intel GPUs,
you can now see metric information grouped by Intel
microarchitecture family. See metrics for every GPU
architecture family in a new consolidated view. To learn
more, see Analysis Results for Multiple GPUs.

Application Performance Snapshot:

 Updated Metrics for Multiple GPUs
GPU metric information in the Application Performance
Snapshot HTML reports have been enhanced to better
represent data collected from multiple GPUs.

 Histograms in Metric Tooltips

The metric tooltips in Application Performance Snapshot
HTML reports were enhanced with histograms that clearly
visualize the distribution of metric values observed during
analysis.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 24
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Intel® VTune™ Profiler is a powerful tool for analyzing and

optimizing the performance of applications. Here’s a detailed
overview of how to use it for profiling:
1. Introduction to Intel® VTune™ Profiler
Intel® VTune™ Profiler is a performance analysis tool
that helps developers identify and fix performance
bottlenecks in their applications. It supports a wide range
of programming languages, including C, C++, Fortran,
Python, and more. The tool can profile applications
running on CPUs, GPUs, and FPGAs, making it versatile
for various computing environments1.

2. Installation and Setup

To get started with Intel® VTune™ Profiler, you need to
install it as part of the Intel® oneAPI Base Toolkit or as a
standalone version. Here are the steps:
Download and Install: Visit the Intel® VTune™
Profiler download page and follow the instructions to
download and install the tool.
System Setup: For CPU and GPU profiling, ensure
that the necessary drivers and libraries are installed.
For example, on Linux, you might need to install the
Intel® Metric Discovery API Library for GPU
analysis2.
3. Profiling an application
Intel® VTune™ Profiler provides various analysis types to
profile different aspects of your application:

a. CPU Hotspots Analysis

This analysis helps identify the most time-consuming
functions in your application. To perform a CPU
Hotspots Analysis:
Launch VTune Profiler: Open the VTune Profiler
GUI or use the command line tool.
Select Analysis Type: Choose “CPU Hotspots” from
the list of available analysis types.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 25
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Run Analysis: Specify the application to profile and

start the analysis. VTune Profiler will collect data on
CPU usage and highlight the functions that consume
the most CPU time.
b. Microarchitecture Exploration
This analysis provides insights into how well your
application utilizes the CPU’s microarchitecture. It
helps identify issues like cache misses, branch
mispredictions, and more:
Select Analysis Type: Choose “Microarchitecture
Exploration” from the analysis types.
Run Analysis: Start the analysis and review the
results to identify microarchitectural bottlenecks.
4. Analyzing Results
After running an analysis, VTune Profiler provides
detailed reports that help you understand the performance
characteristics of your application:
Hotspots View: This view shows the functions that
consume the most CPU time, allowing you to focus
on optimizing these areas.
Timeline View: This view provides a timeline of
your application’s execution, showing CPU and GPU
activity over time.
Call Stack View: This view shows the call stack for
each hotspot, helping you understand the context in
which performance issues occur.
5. Optimizing Your Application
Based on the analysis results, you can make targeted
optimizations to improve your application’s performance:
Algorithm Optimization: Refactor or replace
inefficient algorithms with more efficient ones.
Parallelization: Use threading or parallel
programming techniques to take advantage of multi-
core processors.
Memory Optimization: Optimize memory access
patterns to reduce cache misses and improve data
locality.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 26
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

6. Advanced Features
Intel® VTune™ Profiler offers several advanced features
for in-depth performance analysis:
Threading Analysis: Analyze the efficiency of your
application’s threading implementation and identify
synchronization issues.
I/O Analysis: Profile I/O operations to identify file
and network access bottlenecks.
GPU Offload Analysis: Analyze GPU-accelerated
code's performance and identify optimization
opportunities.
7. Conclusion
Intel® VTune™ Profiler is a comprehensive tool for
performance analysis and optimization. Using its various
analysis types and features, you can gain deep insights into
your application’s performance and make informed
decisions to optimize it.

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 27
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 07

AIM: Analyze the code using Nvidia-Profilers.

Theory: -
What is a profiler:
A profiler is a software tool used in programming and
software development to measure and analyze the runtime
behavior of a program. Its primary purpose is to identify
bottlenecks and areas of inefficiency in the code, helping
developers optimize performance.

What is Nvprof and give some key features of nvprof:

It is a command-line profiling tool provided by NVIDIA
for use with CUDA applications. It allows developers to
collect and view a wide variety of performance data for
CUDA-enabled GPUs, helping them to identify
performance bottlenecks and optimize their applications.

Key Features:
1. Kernel and Memory Operations Profiling:
 Collects detailed information about the execution of
CUDA kernels and memory operations.
 Measures kernel execution time, memory transfer time,
and other performance metrics.

2. Event and Metric Collection:

 Captures a wide range of performance events and
metrics, including instructions executed, memory
transactions, cache hits and misses, and more.
 Allows developers to specify which events and metrics
to collect.

3. Timeline View:
 Provides a timeline view of kernel execution and
memory transfers, helping developers visualize the
sequence of operations and identify overlapping or
sequential execution patterns.
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 28
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

4. API Activity Tracing:

 Traces CUDA API calls, providing information about
the timing and duration of each call.
 Helps identify inefficiencies related to API usage.

5. CUDA Context and Stream Profiling:

 Profiles activity within different CUDA contexts and
streams, enabling developers to optimize concurrent
execution.

6. Output Formats:
 Supports multiple output formats, including text, CSV,
and SQL, allowing for flexible data analysis and
reporting.
 The collected data can be further analyzed using tools
like NVIDIA Visual Profiler (nvvp) or NVIDIA Nsight
Systems.

7. Device and Kernel Profiling:

 Profiles individual CUDA devices and kernels,
providing detailed information about their performance
characteristics.
 Helps in comparing the performance of different
devices and optimizing kernel performance.

8. Command-Line Interface:
 Operates through a command-line interface, making it
suitable for use in automated scripts and batch
processing.
 Offers a wide range of options and filters to customize
profiling sessions.

9. Profiling Overheads:
 Provides information about the overhead introduced by
profiling, helping developers understand and manage

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 29
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

the impact on application performance.

10. Compatibility:
 Supports a wide range of CUDA-enabled GPUs and
works with CUDA applications written in C, C++,
Fortran, and other supported languages.

Steps of using nvprof:

 Step 1: Compile Your CUDA Program

!nvcc name.cu -o name

 Step 2: Run Your Program

! ./name

 Step 3: Run Your Program with nvprof

!nvprof ./name

CUDA Program to add vector: -

Code:
%%writefile vector_add.cu

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

global void AddVectorsCUDA(int a, int b, int *c,

int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}

int main() {
int n = 5;

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 30
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

int h_a[] = {1, 2, 3, 4, 5};

int h_b[] = {6, 7, 8, 9, 10};
int h_c[n];

int d_a, d_b, *d_c;

cudaMalloc((void **)&d_a, n * sizeof(int));

cudaMalloc((void **)&d_b, n * sizeof(int));
cudaMalloc((void **)&d_c, n * sizeof(int));

cudaMemcpy(d_a, h_a, n * sizeof(int),

cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(int),
cudaMemcpyHostToDevice);

AddVectorsCUDA<<<1, n>>>(d_a, d_b, d_c, n);

cudaMemcpy(h_c, d_c, n * sizeof(int),

cudaMemcpyDeviceToHost);

printf("Result of vector addition:\n");

for (int i = 0; i < n; i++) {
printf("%d ", h_c[i]);
}
printf("\n");

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

Steps:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 31
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

 !nvcc vector_add.cu -o a vector_add

 ! ./ vector_add.cu

 !nvprof ./vector_add

CUDA Program to multiply vector: -

Code:
%%writefile matrix_mul.cu

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

#define N 2 // Defining a small 2x2 matrix for simplicity

global void MatrixMulCUDA(int a, int b, int *c) {

int row = threadIdx.y;
int col = threadIdx.x;

int sum = 0;
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 32
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

for (int k = 0; k < N; k++) {

sum += a[row * N + k] * b[k * N + col];
}
c[row * N + col] = sum;
}

int main() {
int h_a[N][N] = { {1, 2}, {3, 4} };
int h_b[N][N] = { {5, 6}, {7, 8} };
int h_c[N][N];

int d_a, d_b, *d_c;

cudaMalloc((void **)&d_a, N * N * sizeof(int));

cudaMalloc((void **)&d_b, N * N * sizeof(int));
cudaMalloc((void **)&d_c, N * N * sizeof(int));

cudaMemcpy(d_a, h_a, N * N * sizeof(int),

cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, N * N * sizeof(int),
cudaMemcpyHostToDevice);

dim3 threadsPerBlock(N, N);

MatrixMulCUDA<<<1, threadsPerBlock>>>(d_a, d_b,
d_c);

cudaMemcpy(h_c, d_c, N * N * sizeof(int),

cudaMemcpyDeviceToHost);

printf("Result of matrix multiplication:\n");

for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%d ", h_c[i][j]);
}
printf("\n");
}

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 33
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

Steps:
 !nvcc matrix_mul.cu -o a matrix_mul
 ! ./ matrix_mul.cu

 !nvprof ./ matrix_mul

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 34
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 08

AIM: Write a program to perform load distribution on GPU

using CUDA.

Theory: -
What is a profiler:
Performing load distribution on GPU using CUDA in
Google Colab involves several steps. CUDA is a parallel
computing platform and application programming
interface (API) developed by NVIDIA for general-purpose
GPU programming. Google Colab provides free access to
GPU resources, making it an excellent platform to
experiment with CUDA.

Here are the steps to create a simple CUDA program in Google

Colab:
1. Access Google Colab
2. Create a New Notebook
3. Set GPU as the Runtime Type
4. Install Required Libraries
If you need to install any libraries, you can do so
using pip. For CUDA programming, you might need
to install the pycuda library: Code : !pip install
pycuda
5. Import Required Libraries:
Import the necessary Python libraries, including
pycuda and numpy for this example.
6. Write CUDA Kernel Code:
Write the CUDA kernel code in a code cell. This is
the part of the code that will run on the GPU.

Code: -
%%writefile load_distribute.cu
#include <iostream>

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 35
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

global void workloadDistributionKernel(int *data, int n) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = idx; // assign thread index to data array
}
}

int main() {
const int n = 23; // array size
int data[n];

// Allocate memory on the GPU

int *data_gpu;
cudaMalloc((void **)&data_gpu, n * sizeof(int));

// Launch kernel
int blockSize = 8;
int numBlocks = (n + blockSize - 1) / blockSize;
workloadDistributionKernel<<<numBlocks,
blockSize>>>(data_gpu, n);

// Copy result from device to host

cudaMemcpy(data, data_gpu, n * sizeof(int),
cudaMemcpyDeviceToHost);

// Print result
for (int i = 0; i < n; i++) {
std::cout << "data[" << i << "] = " << data[i] << std::endl;
}

// Free memory
cudaFree(data_gpu);

return 0;
}

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 36
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Output:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 37
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 09

AIM: Write a simple CUDA program to print “Hello World!”.

Theory: -
What is CUDA programming:
CUDA (Compute Unified Device Architecture) is a
parallel computing platform and application programming
interface (API) model created by NVIDIA. It allows
developers to use NVIDIA GPUs (graphics processing
units) for general-purpose processing (an approach known
as GPGPU, General-Purpose computing on Graphics
Processing Units).

Logical Architecture of GPU:

1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.

2. Blocks:
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 38
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

A block is a group of threads that execute concurrently on

an SM. Threads within the same block can cooperate with
each other through shared memory and synchronization
mechanisms.

3. Warps:
A warp is the smallest unit of execution in CUDA. It
consists of 32 consecutive threads that are executed in
lockstep on an SM. This means that all 32 threads within a
warp execute the same instruction at the same time.

4. Threads:
A thread is a basic unit of execution in CUDA (NVIDIA's
parallel computing platform). Threads are organized into
groups called thread blocks, and multiple thread blocks are
organized into a grid.

CUDA Program execution Flow:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 39
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Steps:
 Data copy from CPU to GPU.
 Execution on GPU.
 Data copy from GPU to CPU.

CUDA Program to print hello world: -

Code:
%%writefile ud.cu
#include "stdio.h"
__global__ void cuda_hello(){
printf("Hello World!\n");
}
int main(){
cuda_hello<<<1,5>>>();
cudaDeviceSynchronize();
return 0;
}

Steps:
 !nvcc ud.cu -o ud
 ! ./ud

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 40
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

CUDA Program to print hello world with different number of

blocks and threads: -
Code:
%%writefile ud1.cu
#include "stdio.h"
__global__ void cuda_hello(){
printf("Hello Engineer!\n");
}
int main(){
cuda_hello<<<4,2>>>();
cudaDeviceSynchronize();
return 0;
}

Steps:
 !nvcc ud.cu -o ud1
 ! ./ud1

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 41
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

PRACTICAL: 10

AIM: Write a CUDA program to add two arrays.

Logical Architecture of GPU:

1. Grids:
Grid refers to the highest-level grouping of threads that are
scheduled for execution on the GPU device. It represents
the entire set of parallel work that needs to be processed by
the GPU.

2. Blocks:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 42
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

A block is a group of threads that execute concurrently on

an SM. Threads within the same block can cooperate with
each other through shared memory and synchronization
mechanisms.

CUDA Program execution Flow:

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 43
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

Steps:
 Data copy from CPU to GPU.
 Execution on GPU.
 Data copy from GPU to CPU.

CUDA Program to add two numbers: -

Code:
%%writefile addNumbers.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

global void addKernel(int* c, const int* a, const int*

b) {
*c = *a + *b;
}

void addWithCuda(int* c, const int* a, const int* b) {

int* dev_a = nullptr;
int* dev_b = nullptr;
int* dev_c = nullptr;

cudaMalloc((void**)&dev_c, sizeof(int));
cudaMalloc((void**)&dev_a, sizeof(int));
cudaMalloc((void**)&dev_b, sizeof(int));

cudaMemcpy(dev_a, a, sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, sizeof(int),
cudaMemcpyHostToDevice);

addKernel<<<1, 1>>>(dev_c, dev_a, dev_b);

cudaDeviceSynchronize();
cudaMemcpy(c, dev_c, sizeof(int),
cudaMemcpyDeviceToHost);
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 44
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
}

int main() {
const int a = 5;
const int b = 7;
int c = 0;

addWithCuda(&c, &a, &b);

printf("%d + %d = %d\n", a, b, c);

cudaDeviceReset();

return 0;
}

Steps:
 !nvcc addNumbers.cu -o addNumbers
 !./addNumbers

CUDA Program to add two arrays: -

Code:
%%writefile addArrays.cu
#include "cuda_runtime.h"

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 45
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

#include "device_launch_parameters.h"

#include <stdio.h>

global void addKernel(int* c, const int* a, const int*

b, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
c[i] = a[i] + b[i];
}
}

void addWithCuda(int* c, const int* a, const int* b, int

size) {
int* dev_a = nullptr;
int* dev_b = nullptr;
int* dev_c = nullptr;

cudaMalloc((void**)&dev_c, size * sizeof(int));

cudaMalloc((void**)&dev_a, size * sizeof(int));
cudaMalloc((void**)&dev_b, size * sizeof(int));

cudaMemcpy(dev_a, a, size * sizeof(int),

cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size * sizeof(int),
cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

int blocksPerGrid = (size + threadsPerBlock - 1) /
threadsPerBlock;
addKernel<<<blocksPerGrid,
threadsPerBlock>>>(dev_c, dev_a, dev_b, size);
cudaDeviceSynchronize();
cudaMemcpy(c, dev_c, size * sizeof(int),
cudaMemcpyDeviceToHost);
cudaFree(dev_c);
Enrollment No.: 210303105168
CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 46
Div: 25(CSE)
Faculty of Engineering & Technology
High Performance Computing (203105430)
B. Tech CSE 4th Year 7th Semester

cudaFree(dev_a);
cudaFree(dev_b);
}

int main() {
const int arraySize = 5;
const int a[arraySize] = { 11, 22, 43, 34, 55 };
const int b[arraySize] = { 11, 22, 34, 43, 55 };
int c[arraySize] = { 0 };

addWithCuda(c, a, b, arraySize);

printf("{11, 22, 43, 34, 55} + {11, 22, 34, 43, 55} =
{%d, %d, %d, %d, %d}\n", c[0], c[1], c[2], c[3], c[4]);

cudaDeviceReset();

return 0;
}

Steps:
 !nvcc addArrays.cu -o addArrays
 !./addArrays

Enrollment No.: 210303105168

CHAUDHARY UMANGKUMAR BALUBHAI
P a g e | 47
Div: 25(CSE)

AutoCAD Electrical 2023 Black Book
From Everand
AutoCAD Electrical 2023 Black Book
Gaurav Verma
No ratings yet
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
USM's Shriram Mantri Vidyanidhi Info Tech Academy: Algorithm & Data Structure Question Bank
No ratings yet
USM's Shriram Mantri Vidyanidhi Info Tech Academy: Algorithm & Data Structure Question Bank
38 pages
Design and Analysis of Algorithms: CSE 5311 Midterm Exam Practice
No ratings yet
Design and Analysis of Algorithms: CSE 5311 Midterm Exam Practice
11 pages
HPC Practical 3 Vish-23-25
No ratings yet
HPC Practical 3 Vish-23-25
3 pages
yuva hpc 1 (1) (5)
No ratings yet
yuva hpc 1 (1) (5)
4 pages
ADSA exp2
No ratings yet
ADSA exp2
8 pages
DAA Lab Manual
No ratings yet
DAA Lab Manual
46 pages
Zeel PR 3
No ratings yet
Zeel PR 3
3 pages
Name - Sagar de UNIVERSISTY ROLL - 10931121055 Class Roll - 55 Setion A Semester - 6 Year - 3 ACADEMIC YEAR - 2023-24
No ratings yet
Name - Sagar de UNIVERSISTY ROLL - 10931121055 Class Roll - 55 Setion A Semester - 6 Year - 3 ACADEMIC YEAR - 2023-24
5 pages
Pur 5 DAA
No ratings yet
Pur 5 DAA
3 pages
Daa 5
No ratings yet
Daa 5
4 pages
Sorting and Searching
No ratings yet
Sorting and Searching
22 pages
PRINT-1 (1)
No ratings yet
PRINT-1 (1)
14 pages
PPT 3.3.4 Quick Sort
No ratings yet
PPT 3.3.4 Quick Sort
20 pages
DAA Lab Manual VTU
100% (2)
DAA Lab Manual VTU
41 pages
govind_3
No ratings yet
govind_3
4 pages
Sublect Code: Kcs553: Design and Analysis of Algorithm Lab
No ratings yet
Sublect Code: Kcs553: Design and Analysis of Algorithm Lab
22 pages
Hostel Ranking System Using Merge Sort: Akarsh Srivastava 17BCI0091
No ratings yet
Hostel Ranking System Using Merge Sort: Akarsh Srivastava 17BCI0091
14 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
55 pages
Ada Lab Manual: Design and Analysis of Algorithms Laboratory
No ratings yet
Ada Lab Manual: Design and Analysis of Algorithms Laboratory
48 pages
CSE2208-Lab Manual
No ratings yet
CSE2208-Lab Manual
28 pages
TEC - Algorithm Lab
No ratings yet
TEC - Algorithm Lab
42 pages
Ada PDF
No ratings yet
Ada PDF
43 pages
PPT 3.3.3 Merge Sort
No ratings yet
PPT 3.3.3 Merge Sort
16 pages
Daa Lab
No ratings yet
Daa Lab
72 pages
Divide and Conquer Algorithms
No ratings yet
Divide and Conquer Algorithms
9 pages
PF Assignment 3 (Uw-23-Cs-Bs-100)
No ratings yet
PF Assignment 3 (Uw-23-Cs-Bs-100)
4 pages
DAA File1
No ratings yet
DAA File1
15 pages
Sorting Assignment
No ratings yet
Sorting Assignment
15 pages
Bhumika DAA file
No ratings yet
Bhumika DAA file
60 pages
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
No ratings yet
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
32 pages
DAA 2.1 - Sid
No ratings yet
DAA 2.1 - Sid
3 pages
03-134241-013-132709114921-19052025-104012pm
No ratings yet
03-134241-013-132709114921-19052025-104012pm
9 pages
ADA lab file 2025-3
No ratings yet
ADA lab file 2025-3
37 pages
f290637336_Future_Institute_of_Engineering_Management
No ratings yet
f290637336_Future_Institute_of_Engineering_Management
38 pages
Daa Lab Record (U19cn154)
No ratings yet
Daa Lab Record (U19cn154)
32 pages
Shivam final daa lab file.pdf
No ratings yet
Shivam final daa lab file.pdf
34 pages
Manish
No ratings yet
Manish
14 pages
Lecture Week 3 2quick Sort - Merge Sort 26022024 104041am
No ratings yet
Lecture Week 3 2quick Sort - Merge Sort 26022024 104041am
45 pages
DSA Lab 5
No ratings yet
DSA Lab 5
11 pages
Merge Sort and Quick Sort
No ratings yet
Merge Sort and Quick Sort
79 pages
Ass Gnment
No ratings yet
Ass Gnment
14 pages
DAA MANUAL FINAL
No ratings yet
DAA MANUAL FINAL
58 pages
Devraj
No ratings yet
Devraj
14 pages
Summary - EDA
No ratings yet
Summary - EDA
20 pages
govind_1 pdf
No ratings yet
govind_1 pdf
3 pages
Unit 2_Divide & Conquer and Greedy Stratagy
No ratings yet
Unit 2_Divide & Conquer and Greedy Stratagy
93 pages
Laboratory Manual: Computer Science and Engineering Internet of Everything Laboratory 203105306 5 Semester
No ratings yet
Laboratory Manual: Computer Science and Engineering Internet of Everything Laboratory 203105306 5 Semester
38 pages
Design and Analysis of Algorithms: Submitted By
No ratings yet
Design and Analysis of Algorithms: Submitted By
22 pages
Practical File of Design of Algorithms and Analysis
No ratings yet
Practical File of Design of Algorithms and Analysis
18 pages
Lecture11 Handouts Proto
No ratings yet
Lecture11 Handouts Proto
45 pages
Sorting Method
No ratings yet
Sorting Method
6 pages
Lab Manual Ds - 2021
No ratings yet
Lab Manual Ds - 2021
57 pages
Practical File Daa
No ratings yet
Practical File Daa
84 pages
2023300203_Exp02
No ratings yet
2023300203_Exp02
14 pages
Ada Cse 4th Lab Manual
No ratings yet
Ada Cse 4th Lab Manual
44 pages
Adsa Lab Manual-2-27
No ratings yet
Adsa Lab Manual-2-27
26 pages
C++ Learn in 24 Hours
From Everand
C++ Learn in 24 Hours
Alex Nordeen
No ratings yet
AutoCAD 2014 Made Easy
From Everand
AutoCAD 2014 Made Easy
CADfolks
4.5/5 (6)
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
AutoCAD Electrical 2020 Black Book
From Everand
AutoCAD Electrical 2020 Black Book
Gaurav Verma
4.5/5 (10)
CN Lab Manual-APURVA
No ratings yet
CN Lab Manual-APURVA
18 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
KENIL INS Lab Manual 1-2
No ratings yet
KENIL INS Lab Manual 1-2
13 pages
INS Lab Manual
No ratings yet
INS Lab Manual
49 pages
School Bus Selection Routing and Scheduling.
No ratings yet
School Bus Selection Routing and Scheduling.
186 pages
Bus Information System
No ratings yet
Bus Information System
82 pages
11 Modellingandanalysisonpublicbusstransportofthecityof Addis Ababa
No ratings yet
11 Modellingandanalysisonpublicbusstransportofthecityof Addis Ababa
10 pages
ALGO Practice Session-I
No ratings yet
ALGO Practice Session-I
2 pages
MCQ On Lec 6,7
No ratings yet
MCQ On Lec 6,7
6 pages
Unit 04 Sorting
No ratings yet
Unit 04 Sorting
191 pages
Cs502 Midterm Solved Mcqs by Junaid Malik
No ratings yet
Cs502 Midterm Solved Mcqs by Junaid Malik
69 pages
Algorithms Worksheet 4 Merge Sort and Quicksort
No ratings yet
Algorithms Worksheet 4 Merge Sort and Quicksort
5 pages
Data Structures and Algorithms II
No ratings yet
Data Structures and Algorithms II
103 pages
A Project Report Sorting Visualizer
No ratings yet
A Project Report Sorting Visualizer
12 pages
Report On DSA
50% (4)
Report On DSA
22 pages
CS502 - Midterm Solved Mcqs With References by Moaaz PDF
50% (8)
CS502 - Midterm Solved Mcqs With References by Moaaz PDF
25 pages
ADA Full Notes Hinglish
No ratings yet
ADA Full Notes Hinglish
5 pages
Lab Manual MCSE 101
No ratings yet
Lab Manual MCSE 101
35 pages
Design and Analysis of Algorithms Laboratory Manual-15CSL47 4 Semester CSE Department, CIT-Mandya Cbcs Scheme
100% (1)
Design and Analysis of Algorithms Laboratory Manual-15CSL47 4 Semester CSE Department, CIT-Mandya Cbcs Scheme
30 pages
Algorithm
No ratings yet
Algorithm
3 pages
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
No ratings yet
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
12 pages
Operting System (Micro Project) 2
No ratings yet
Operting System (Micro Project) 2
15 pages
C++ Book
0% (1)
C++ Book
879 pages
Oops Lab Manual
No ratings yet
Oops Lab Manual
47 pages
Quick Sort
No ratings yet
Quick Sort
15 pages
Algorithm Questions
No ratings yet
Algorithm Questions
34 pages
Insertion Sort
No ratings yet
Insertion Sort
11 pages
MARKED-DAA-Practice Question Bank For Insem Paper-20-21
No ratings yet
MARKED-DAA-Practice Question Bank For Insem Paper-20-21
5 pages
Mergesort: Merge Sort Visualizer
No ratings yet
Mergesort: Merge Sort Visualizer
90 pages
Final 450
No ratings yet
Final 450
45 pages
CSCI2100 07 Sorting
No ratings yet
CSCI2100 07 Sorting
82 pages
Quick Sort
No ratings yet
Quick Sort
9 pages
5 6077912854562865372
No ratings yet
5 6077912854562865372
13 pages
MCQs On Data Structures and Algorithms
No ratings yet
MCQs On Data Structures and Algorithms
28 pages
Ds MCQ
No ratings yet
Ds MCQ
17 pages

HPC Lab Manual

Uploaded by

HPC Lab Manual

Uploaded by

Faculty of Engineering & Technology

High Performance Computing (203105430)

AIM: Study the facilities provided by Google Colab.

 Explain in Detail the Use of GPU on Google Colab by

Enrollment No.: 210303105168

Enrollment No.: 210303105168

AIM: Demonstrate basic Linux Commands.

Enrollment No.: 210303105168

Enrollment No.: 210303105168

Enrollment No.: 210303105168

Enrollment No.: 210303105168

AIM: Using Divide and Conquer Strategies design a class for

2. Advantages of Divide and Conquer:

3. Disadvantages of Divide and Conquer:

analyze, especially for complex problems.

4. What is Quick Sort?

5. Algorithm for Quick Sort.

PARTITION(A, low, high)

Enrollment No.: 210303105168

int partition(vector<int>& arr, int low, int high) {

void quickSort(vector<int>& arr, int low, int high) {

Enrollment No.: 210303105168

cout << "Sorted array is: ";

Enrollment No.: 210303105168

AIM: Write a program on an unloaded cluster for several

Components of HPC cluster: -

How to build the HPC cluster: -

following the steps listed below:

Enrollment No.: 210303105168

plt.plot(nodes, time_taken, marker='o')

Enrollment No.: 210303105168

// Function to generate a random graph

for(int i = 0; i < num_edges; i++) {

thrust::device_vector<int> adj_list(num_edges * 2);

auto start = std::chrono::high_resolution_clock::now();

Enrollment No.: 210303105168

AIM: Write a program to check task distribution using Gprof.

Introduction to Gprof profiler:

2. Call Graph Analysis:

periodically interrupts the program to record the current

5. Output and analysis:

Simple Code and Gprof Steps: -

Enrollment No.: 210303105168

 !gprof looping gmon.out > looping.txt

Enrollment No.: 210303105168

Complex Code and Gprof Steps: -

Enrollment No.: 210303105168

 !gprof looping gmon.out > looping.txt

Enrollment No.: 210303105168

AIM: Use Intel V-Tune Performance Analyzer for Profiling.

 Metric Groups for Multiple GPUs

Application Performance Snapshot:

 Histograms in Metric Tooltips

Enrollment No.: 210303105168

Intel® VTune™ Profiler is a powerful tool for analyzing and

2. Installation and Setup

a. CPU Hotspots Analysis

Enrollment No.: 210303105168

Run Analysis: Specify the application to profile and

Enrollment No.: 210303105168

Enrollment No.: 210303105168

AIM: Analyze the code using Nvidia-Profilers.

What is Nvprof and give some key features of nvprof:

2. Event and Metric Collection:

4. API Activity Tracing:

5. CUDA Context and Stream Profiling:

7. Device and Kernel Profiling:

Enrollment No.: 210303105168

the impact on application performance.

Steps of using nvprof:

 Step 1: Compile Your CUDA Program

 Step 2: Run Your Program

 Step 3: Run Your Program with nvprof

CUDA Program to add vector: -

__global__ void AddVectorsCUDA(int *a, int *b, int *c,

Enrollment No.: 210303105168

global void AddVectorsCUDA(int a, int b, int *c,

int d_a, d_b, *d_c;

global void MatrixMulCUDA(int a, int b, int *c) {

int d_a, d_b, *d_c;

global void workloadDistributionKernel(int *data, int n) {

global void addKernel(int* c, const int* a, const int*

global void addKernel(int* c, const int* a, const int*