0% found this document useful (0 votes)

5 views

Gpu Test Answer Bank

The document covers key concepts in CUDA programming, including warp divergence, memory coalescing, and synchronization mechanisms like __syncthreads(). It also discusses the CUDA programming model, emphasizing threads, blocks, and grids, as well as the importance of shared memory for performance optimization. Additionally, it highlights OpenMP for parallel programming in shared-memory systems and addresses common synchronization mistakes and race hazards.

Uploaded by

kkperfect003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Gpu Test Answer Bank

Uploaded by

kkperfect003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

2-Mark Answers

1. Warp Divergence
Warp divergence occurs when threads in a warp take diﬀerent execu on paths. This causes
threads to be serialized, leading to performance degrada on.

Impact on Performance
Divergence reduces parallel eﬃciency, as threads execu ng diﬀerent instruc ons must wait for
others to complete.

2. Purpose of __syncthreads() in CUDA

__syncthreads() synchronizes threads within a block, ensuring all threads reach the same
point before proceeding.

Example Use
Useful in shared memory opera ons where data consistency is required before further opera ons.

3. Memory Coalescing in CUDA

Memory coalescing combines memory requests from mul ple threads into fewer
transac ons when threads access consecu ve memory loca ons.

Eﬀect
Reduces the number of memory transac ons and improves overall memory bandwidth.

4. Register Spilling vs Shared Memory

Register spilling happens when more registers are required than available, leading to data
being stored in slower local memory.

Shared Memory
Shared memory is faster but limited in size, and requires explicit management by the programmer.

5. Bank Conﬂicts in Shared Memory

Bank conﬂicts occur when mul ple threads access the same memory bank simultaneously,
causing serialized accesses.

Avoiding Conﬂicts
Aligning memory accesses or padding arrays can help in avoiding bank conﬂicts.

6. Occupancy in CUDA
Occupancy refers to the ra o of ac ve warps on a Streaming Mul processor (SM) to the
maximum number of warps supported.

Inﬂuence on Performance
Higher occupancy can hide memory latency by ensuring more warps are ready for execu on.

7. Constant vs Texture Memory

Constant memory is cached and eﬃcient for uniform data access across threads, while
texture memory is cached for spa al locality.

Caching Mechanism
Constant memory has a broadcast mechanism, while texture memory uses a texture cache for spa al
data.
8. Warp Size in CUDA
Warp size refers to the number of threads in a warp, typically 32. All threads in a warp
execute the same instruc on at the same me.

Op mizing Performance
Designing algorithms to avoid divergence within a warp is key to improving performance.

9. Streaming Mul processors (SMs)

SMs handle the execu on of warps in CUDA. Mul ple warps can be ac ve on an SM
simultaneously, improving throughput.

Parallel Execu on
Warps on an SM are scheduled and executed in a pipelined fashion to hide memory latency.

10. CUDA Memory Hierarchy

The CUDA memory hierarchy consists of registers, shared memory, and global memory, each
with diﬀerent access speeds and sizes.

Eﬀect on Kernel Performance

Minimizing access to global memory and u lizing shared memory improves kernel performance.

11. Tiling in CUDA

Tiling divides data into smaller les that ﬁt into shared memory, allowing threads to work on
smaller data chunks, reducing global memory accesses.

Example
Commonly used in matrix mul plica on to op mize performance.

12. Impact of Thread Block Size

Thread block size aﬀects the number of blocks that can run simultaneously on an SM,
impac ng performance.

Op mizing for Occupancy

Choosing the right block size maximizes occupancy and improves throughput.

13. Eﬀect of Registers on Occupancy

Using more registers per thread reduces the number of threads per block, lowering
occupancy and poten ally reducing performance.

Trade-oﬀ
High register usage can reduce register spilling, but may also reduce the number of threads that can
run concurrently.

14. blockIdx.x and threadIdx.x in CUDA

blockIdx.x iden ﬁes the block, and threadIdx.x iden ﬁes the thread within a block. These are
used to determine the unique index of each thread.

Applica on
Used for assigning work to diﬀerent threads in parallel compu ng.

15. Advantage of cudaMemcpy() with cudaMemcpyDeviceToHost

This ﬂag is used to copy data from device (GPU) memory to host (CPU) memory. It ensures
correct data transfer a er computa ons are completed on the GPU.
Usage
Essen al when transferring results back to the host for further processing or output.

16. Purpose of dim3 in CUDA

dim3 is a structure used to deﬁne the dimensions of thread blocks and grids in 1D, 2D, or 3D,
helping organize work in mul -dimensional arrays.

Advantage
Provides ﬂexibility for handling mul -dimensional data eﬃciently.

17. Fewer Threads than Required Opera ons in a Kernel

If fewer threads than needed are launched, some data elements remain unprocessed,
leading to incomplete or incorrect results.

Solu on
Ensure the total number of threads equals or exceeds the number of opera ons.

18. Task Parallelism vs Data Parallelism

Task parallelism involves diﬀerent tasks being performed in parallel, while data parallelism
performs the same task on diﬀerent data elements.

Example
Data parallelism is o en used in matrix opera ons, while task parallelism may be used in systems
with mul ple CPU cores handling diﬀerent tasks.

19. GPU Architecture Components

Key components of a GPU include SMs, memory hierarchy (global, shared, constant), and
execu on units that handle parallel tasks.

Role
These components work together to execute large numbers of threads in parallel, maximizing
throughput.

20. Comparison of CPU and GPU Design

CPUs are op mized for single-thread performance and complex tasks, while GPUs are
designed for parallelism with many simpler cores execu ng tasks simultaneously.

Applica on
CPUs are used for general-purpose tasks, and GPUs excel in parallel tasks like graphics rendering and
scien ﬁc compu ng.

1. What is OpenMP and why is it used?

 OpenMP (Open Mul -Processing) is an API for parallel programming in shared-memory

systems.

 It allows developers to add parallelism to their applica ons easily using compiler direc ves.

 OpenMP supports C, C++, and Fortran, enabling code to run on mul -core processors
eﬃciently.

 It enhances performance by distribu ng workloads across mul ple threads, improving

execu on speed.
2. How does OpenMP implement parallelism?

 OpenMP uses direc ves such as #pragma omp parallel to create parallel regions in the code.

 When a parallel region is encountered, mul ple threads are spawned to execute the code
concurrently.

 Threads share the same memory space, facilita ng communica on and data sharing among
them.

 This model allows developers to parallelize exis ng sequen al code with minimal changes.

3. Why use #pragma omp parallel in OpenMP?

 The #pragma omp parallel direc ve deﬁnes a sec on of code that can be executed by
mul ple threads simultaneously.

 It enables eﬃcient parallel execu on by distribu ng work among available threads.

 This direc ve is crucial for leveraging the capabili es of mul -core processors to reduce
execu on me.

 It helps improve program performance, especially for compute-intensive tasks.

4. How do you set the number of threads in OpenMP?

 Use the omp_set_num_threads(n) func on to specify the desired number of threads.

 Alterna vely, you can set the OMP_NUM_THREADS environment variable before running
the program.

 Se ng the number of threads allows for be er resource u liza on based on available

hardware.

 This ﬂexibility can lead to improved performance in diﬀerent compu ng environments.

5. Why is thread synchroniza on important in OpenMP?

 Thread synchroniza on is essen al for preven ng race condi ons when mul ple threads
access shared resources.

 It ensures that threads coordinate their opera ons, leading to consistent and correct results.

 OpenMP provides mechanisms like cri cal sec ons (#pragma omp cri cal) and barriers
(#pragma omp barrier) for managing synchroniza on.

 Proper synchroniza on is vital for the integrity of data in parallel applica ons.

6. How do #pragma omp cri cal and #pragma omp barrier diﬀer?
 #pragma omp cri cal: Restricts access to a block of code to one thread at a me, ensuring
safe updates to shared variables.

 #pragma omp barrier: Forces all threads to wait un l every thread reaches the barrier
before proceeding, synchronizing their execu on.

 Cri cal sec ons prevent data races, while barriers ensure that all threads are synchronized at
speciﬁc points in the program.

 Both are essen al for managing thread interac ons and maintaining data consistency.

7. Why use #pragma omp for in OpenMP?

 The #pragma omp for direc ve is used to distribute loop itera ons among mul ple threads
for parallel execu on.

 It enables eﬃcient parallel processing of loops where itera ons are independent,
signiﬁcantly speeding up computa ons.

 This direc ve simpliﬁes the process of parallelizing loops, reducing the need for manual
thread management.

 By leveraging #pragma omp for, developers can improve the performance of computa onally
intensive tasks.

8. How does OpenMP handle private and shared variables?

 OpenMP dis nguishes between shared and private variables to manage data access in
parallel regions.

 Shared variables are accessible by all threads, allowing them to read and modify the same
instance.

 Private variables are unique to each thread, ensuring that no other thread can access or
alter their values.

 Proper management of variable scopes is cri cal for avoiding race condi ons and ensuring
thread safety.

9. Why is dynamic scheduling useful in OpenMP?

 Dynamic scheduling allows itera ons to be assigned to threads at run me, improving load
balancing across threads.

 It enables idle threads to pick up new tasks, minimizing wait mes and enhancing resource
u liza on.

 This scheduling method is beneﬁcial for workloads where execu on mes of itera ons are
unpredictable.

 It helps maintain performance even in scenarios with irregular task distribu ons.
10. How does OpenMP achieve load balancing?

 OpenMP achieves load balancing primarily through dynamic scheduling, which distributes
work evenly among threads.

 When a thread completes its assigned tasks, it can request more work from a pool,
preven ng idleness.

 This approach ensures that all threads remain busy and work is processed eﬃciently.

 Load balancing is crucial for maximizing the performance of parallel applica ons.

11. What is an OpenMP task and why is it used?

 An OpenMP task is a unit of work that can be executed asynchronously by any thread in a
team.

 Tasks allow for greater ﬂexibility in parallel programming, especially for irregular workloads
that don’t ﬁt neatly into loops.

 By using tasks, developers can express parallelism for complex applica ons, such as those
with recursive calls or dynamic workﬂows.

 Tasking enables more eﬃcient use of available threads and resources by allowing
independent units of work to be processed as they become available.

1 What is RACE hazard ? 

Answer : A race hazard occurs when sec ons of the program “race” toward a cri cal point, such as a
memory read/write. Some mes warp 0 may win the race and the result is correct. Other mes warp
1 might get delayed and warp 3 hits the cri cal sec on ﬁrst, producing the wrong answer. The major
problem with race hazards is they do not always occur. This makes debugging them and trying to
place a breakpoint on the error diﬃcult. The second feature of race hazards is they are extremely
sensi ve to ming disturbances. Thus, adding a breakpoint and single-stepping the code always
delays the thread being observed. This delay o en changes the scheduling pa ern of other warps,
meaning the par cular condi ons of the wrong answer may never occur.

Q.2 Explain common mistakes by programmers rela ng to synchroniza on. 

Answer : Following are common mistakes made by programmers rela ng to synchroniza on then
they would be following : 1. Using the default stream 2. Using synchronous memory copies 3. Not
using pinned memory 4. Overuse of synchroniza on primi ves

8 Marks

1. Design a CUDA Program for Large-Scale Matrix Mul plica on

Matrix Mul plica on in CUDA

Matrix mul plica on is a common opera on in parallel compu ng. It involves mul plying two
matrices, which can be divided into smaller sec ons to be computed by individual threads in parallel.
In CUDA, this process can be op mized using shared memory to reduce the number of accesses to
the slower global memory.

Program Steps

1. Memory Alloca on: The ﬁrst step is to allocate memory for the matrices on the device using
cudaMalloc(). The input matrices A and B, and the output matrix C, need to be stored on the
GPU.

2. Divide into Tiles: The matrices are divided into smaller les that ﬁt into the fast shared
memory. Each thread block processes a le of the matrix, performing part of the overall
mul plica on.

3. Load Data into Shared Memory: Threads within a block load the les of matrices A and B
into shared memory. This reduces global memory accesses, which are much slower than
shared memory.

4. Synchronize Threads: Using __syncthreads() ensures that all threads have ﬁnished loading
data into shared memory before proceeding to the computa on.

5. Perform Computa on: Each thread computes part of the result by mul plying corresponding
elements of the les and summing the results.

6. Write Results: Once the mul plica on is complete, the results are wri en back to the output
matrix in global memory.

7. Transfer Results to Host: The ﬁnal result is transferred back to the host using cudaMemcpy().

Op mizing with Shared Memory

The use of shared memory reduces the number of accesses to global memory, signiﬁcantly improving
performance. By ling the matrices, threads can reuse data stored in shared memory, minimizing the
latency associated with global memory.

2. Explain the CUDA Programming Model with Emphasis on Threads, Blocks, and Grids

CUDA Programming Model Overview

The CUDA programming model is designed to take advantage of the parallel architecture of GPUs. In
this model, a problem is broken down into thousands of small tasks that can be executed
concurrently. CUDA organizes these tasks into a hierarchy of threads, blocks, and grids.

Threads

A thread is the smallest unit of execu on in CUDA. Each thread performs a speciﬁc opera on, such as
adding two elements of an array. Threads within a block can communicate with each other using
shared memory and can synchronize their execu on using __syncthreads().

Thread Blocks

Threads are grouped into blocks. Each block consists of mul ple threads that execute the same
kernel func on. The size of the block (number of threads) is chosen based on the problem being
solved and the hardware limita ons of the GPU. All threads within a block can access shared
memory, allowing them to share intermediate results and reduce global memory accesses.

Grids

A grid is a collec on of thread blocks. Each block is executed independently, and blocks do not
communicate with each other. This structure allows the program to scale to handle very large
datasets, as each block works on a por on of the data.

Indexing Threads and Blocks

Each thread has a unique index, calculated using threadIdx and blockIdx, which allows it to operate
on a speciﬁc piece of data. For example, in a matrix mul plica on program, the index of a thread
determines which elements of the matrices it mul plies.

Scalability

The CUDA programming model is highly scalable. The same code can run eﬃciently on GPUs with
diﬀerent architectures because the work is distributed across grids and blocks, allowing the program
to adapt to the number of available cores.

3. Describe CUDA’s Memory Hierarchy and its Implica ons for Performance

Memory Types in CUDA

CUDA’s memory hierarchy is cri cal to the performance of GPU programs. The diﬀerent types of
memory include:

 Registers: The fastest type of memory, used for storing thread-speciﬁc data.

 Shared Memory: Fast on-chip memory shared by all threads within a block.

 Global Memory: Large but slow memory that is accessible by all threads across all blocks.

 Constant Memory: A small, read-only memory that is cached and op mized for uniform
access across threads.

 Texture Memory: Cached memory op mized for 2D spa al locality, o en used in image
processing.

Performance Implica ons

1. Minimizing Global Memory Access: Since global memory has high latency, minimizing the
number of accesses to it is essen al for performance. Programs should aim to load data from
global memory once and reuse it mul ple mes within shared memory.

2. Using Shared Memory: Shared memory is much faster than global memory, but its size is
limited. Eﬃciently using shared memory can drama cally reduce the me spent accessing
data from global memory, especially in programs like matrix mul plica on.

3. Register Usage and Occupancy: Each thread has access to a limited number of registers.
Using too many registers can reduce the number of threads that can run concurrently (lower
occupancy), while using too few may lead to register spilling, where variables are stored in
slower local memory.
Memory Coalescing

Memory coalescing is another important factor for performance. When threads access consecu ve
memory addresses, their requests are combined into a single transac on. This reduces the number
of memory transac ons and improves performance. Developers should write CUDA code to ensure
that memory accesses are coalesced whenever possible.

4. Discuss Warp Scheduling and its Impact on CUDA Performance

Warp Execu on in CUDA

In CUDA, threads are executed in groups of 32 called warps. All threads in a warp execute the same
instruc on simultaneously. If all threads in a warp follow the same control path (no divergence), the
warp executes eﬃciently.

Warp Scheduling

Each Streaming Mul processor (SM) on the GPU contains mul ple warp schedulers. The warp
scheduler selects which warps are executed at any given me. If one warp stalls due to a memory
opera on or a condi onal branch, the scheduler can switch to another ready warp, allowing the GPU
to remain fully u lized and reducing idle me.

Warp Divergence

Warp divergence occurs when threads within a warp take diﬀerent execu on paths due to
condi onal statements (if, else). When this happens, the warp is split into mul ple sub-warps, which
must be executed sequen ally. This leads to reduced parallelism and lower performance. Developers
should aim to minimize warp divergence by structuring their code so that threads within a warp
follow the same execu on path.

Hiding Latency

One of the key func ons of warp scheduling is hiding memory latency. When a warp is wai ng for
data from global memory, the scheduler can switch to another warp that is ready to execute. This
ensures that the SM remains busy, even when some warps are stalled.

Occupancy and Warp Scheduling

Occupancy refers to the ra o of ac ve warps on an SM to the maximum number of warps it can

support. High occupancy allows the warp scheduler to be er hide memory latency by having more
warps available to switch between.

5. Explain the Diﬀerences between CUDA Global Memory, Shared Memory, Constant Memory, and
Texture Memory

Global Memory

 Size and Scope: Global memory is the largest memory space on the GPU, accessible by all
threads across all blocks. However, it has high latency, making it slower to access compared
to other memory types.
 Usage: Global memory is typically used for storing large datasets that need to be shared
across mul ple blocks, such as matrices or arrays. To minimize performance issues, data
should be accessed in a coalesced manner (consecu ve threads accessing consecu ve
memory addresses).

Shared Memory

 Speed: Shared memory is an on-chip memory that is much faster than global memory. It is
shared among all threads within a block, allowing them to collaborate on computa ons.

 Limita ons: The size of shared memory is limited, so it must be used eﬃciently. It is best
suited for temporary storage of intermediate results, such as les of a matrix in matrix
mul plica on.

Constant Memory

 Read-Only: Constant memory is a small, read-only memory space that is cached. It is

op mized for data that does not change during the kernel’s execu on and is accessed
uniformly by all threads.

 Usage: Ideal for storing constants, conﬁgura on data, or small lookup tables. If all threads
read the same value, the data is broadcasted to all threads, making it very eﬃcient.

Texture Memory

 Spa al Access Op miza on: Texture memory is cached and op mized for 2D spa al locality,
making it suitable for image processing tasks where threads access neighboring pixels.

 Interpola on: In addi on to being cached, texture memory supports interpola on, which
can be useful for certain types of applica ons, such as rendering or image transforma ons.

6. Describe the Execu on Model of a CUDA Kernel and How Thread Blocks and Warps Are
Scheduled on the Hardware

CUDA Kernel Execu on

A kernel is the func on that runs on the GPU. When a kernel is launched, it is executed by thousands
of threads in parallel. Each thread runs the same code, but operates on diﬀerent data elements.

Thread Blocks

Threads are grouped into blocks, which are scheduled on the Streaming Mul processors (SMs) of the
GPU. Each block is assigned to an SM for execu on. Once a block is assigned to an SM, it remains
there un l all of its threads have completed execu on.

Warps

Within each block, threads are further grouped into warps of 32 threads. A warp is the smallest
execu on unit in CUDA. All threads in a warp execute the same instruc on simultaneously. The
threads in a warp are processed in lockstep, meaning they all follow the same execu on path.

Scheduling on the Hardware

CUDA’s hardware scheduler controls the execu on of warps on an SM. When a warp encounters a
delay (e.g., wai ng for data from memory), the scheduler switches to another warp that is ready to
execute. This helps hide memory latency and keeps the SM fully u lized.

Resource Alloca on

Each SM has a limited amount of resources, such as registers and shared memory. The number of
thread blocks that can run concurrently on an SM depends on the resource requirements of each
block. For example, if a kernel uses a large amount of shared memory or registers per block, fewer
blocks can be scheduled on the SM.

7. Write a CUDA Program for Element-Wise Array Addi on

Program Overview

In element-wise array addi on, each thread is responsible for adding one element from two input
arrays and storing the result in a third array. This is a simple, highly parallel opera on, as each
addi on is independent of the others.

Steps

1. Allocate Memory: Use cudaMalloc() to allocate memory for the input and output arrays on
the GPU.

2. Transfer Data to Device: Copy the input arrays from the host (CPU) to the device (GPU) using
cudaMemcpy().

3. Kernel Launch: Write a kernel func on where each thread adds corresponding elements
from the two input arrays and stores the result in the output array.

4. Transfer Results to Host: A er the kernel ﬁnishes, copy the output array back to the host
using cudaMemcpy().

5. Free Memory: Use cudaFree() to free the allocated memory on the GPU.

global void addArrays(int a, int b, int *c, int N) {

int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) {

c[index] = a[index] + b[index];

8. Write a CUDA Program for Vector Addi on

Program Overview
Vector addi on is similar to element-wise array addi on. Each thread is responsible for adding
corresponding elements from two input vectors and storing the result in a third vector.

Steps

1. Memory Alloca on: Allocate memory for the vectors on the GPU using cudaMalloc().

2. Data Transfer: Copy the input vectors from the host to the device using cudaMemcpy().

3. Kernel Execu on: Each thread adds one element from the two input vectors and stores the
result in the output vector.

4. Results Transfer: A er the kernel has ﬁnished, copy the result back to the host using
cudaMemcpy().

5. Cleanup: Free the memory allocated on the GPU using cudaFree().

global void vectorAdd(float A, float B, float *C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N) {

C[i] = A[i] + B[i];

9. Discuss Thread Synchroniza on in CUDA and Write a Program Demonstra ng the Use of
__syncthreads()

Thread Synchroniza on in CUDA

In CUDA, threads within a block can communicate with each other using shared memory. However,
when mul ple threads access shared memory simultaneously, synchroniza on is necessary to ensure
that all threads have completed their opera ons before proceeding to the next step. CUDA provides
the __syncthreads() func on to synchronize threads within a block.

Importance of Synchroniza on

Without synchroniza on, race condi ons may occur where some threads proceed with incomplete
or incorrect data. For example, in matrix mul plica on, synchroniza on is required to ensure that all
threads have loaded the data into shared memory before they begin mul plying the matrix
elements.

Example: Matrix Addi on with __syncthreads()

In matrix addi on, threads within a block may need to wait for each other to complete loading data
into shared memory before proceeding to the addi on step. Using __syncthreads() ensures that all
threads have completed loading data before they begin compu ng the sum.

global void matrixAdd(float A, float B, float *C, int N) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

if (i < N) {

C[i] = A[i] + B[i];

__syncthreads(); // Ensures all threads complete addi on before proceeding

10. Describe GPU Architecture and its Components

Overview of GPU Architecture

GPUs are designed for parallel processing with thousands of cores capable of execu ng tasks
concurrently. The main components of a GPU include Streaming Mul processors (SMs), warp
schedulers, CUDA cores, and a memory hierarchy.

Streaming Mul processors (SMs)

Each GPU consists of several SMs, which are the primary units responsible for execu ng threads.
Each SM contains mul ple CUDA cores that execute arithme c and logic opera ons. SMs also
contain shared memory, warp schedulers, and other units that help manage the execu on of
threads.

CUDA Cores

CUDA cores are the individual processing units within an SM that perform arithme c opera ons.
Each SM contains a large number of CUDA cores, allowing it to execute many threads in parallel.

Warp Schedulers

Each SM contains mul ple warp schedulers that manage the execu on of warps (groups of 32
threads). The warp scheduler switches between ac ve warps to hide memory latency and ensure
that the SM is fully u lized.

Memory Hierarchy

The memory hierarchy in a GPU consists of:

 Registers: Fast, thread-speciﬁc memory.

 Shared Memory: Fast, block-speciﬁc memory shared among threads within a block.

 Global Memory: Slower, larger memory accessible by all threads.

 Constant and Texture Memory: Specialized, cached memory types op mized for speciﬁc
access pa erns.

Parallel Execu on

GPUs can execute thousands of threads simultaneously by distribu ng them across mul ple SMs.
Each SM processes mul ple warps, and the warp scheduler ensures that the SM remains busy by
switching between warps when some warps are wai ng for data
1. Explain the Core Concepts of OpenMP with Examples

Introduc on to OpenMP

OpenMP (Open Mul -Processing) is a widely-used API for parallel programming in shared-memory
systems. It allows developers to write parallel code in C, C++, and Fortran by adding compiler
direc ves, library rou nes, and environment variables to exis ng code. OpenMP simpliﬁes the
development of parallel applica ons by abstrac ng the complexi es of thread management.

Core Concepts

1. Parallelism in OpenMP
OpenMP uses the #pragma omp parallel direc ve to specify regions of code that can be
executed by mul ple threads. These threads share the same memory space, allowing
eﬃcient communica on between them. Example:

#pragma omp parallel

prin ("Thread %d says hello\n", omp_get_thread_num());

This code will execute the prin statement in parallel, and each thread will print its ID.

2. Work Sharing
OpenMP includes work-sharing constructs like #pragma omp for, which distributes loop
itera ons across threads. This parallelizes loops where itera ons are independent of each
other, improving performance in computa onally expensive loops.

#pragma omp parallel for

for (int i = 0; i < N; i++) {

A[i] = B[i] + C[i];

In this example, each thread handles a por on of the loop itera ons, speeding up the overall
computa on.

3. Thread Management
OpenMP allows control over the number of threads with the omp_set_num_threads()
func on or the OMP_NUM_THREADS environment variable. This provides ﬂexibility in
managing resources based on the hardware conﬁgura on.

4. Private and Shared Variables

OpenMP dis nguishes between private and shared variables. Shared variables are accessible
by all threads, while private variables are unique to each thread.

#pragma omp parallel private(x) shared(y)

2. Discuss OpenMP Data Sharing and Synchroniza on Mechanisms with Programs

Data Sharing in OpenMP

In OpenMP, variables can either be shared or private. Understanding how variables are handled is
crucial for wri ng correct parallel programs.

1. Shared Variables
Shared variables are accessible by all threads within a parallel region. By default, variables
declared outside of the parallel region are shared.

int sum = 0;

#pragma omp parallel shared(sum)

sum += omp_get_thread_num();

In this example, all threads modify the shared variable sum. However, this could lead to a race
condi on without synchroniza on.

2. Private Variables
Private variables are local to each thread and are not shared among threads. This ensures
that each thread has its own copy of the variable, preven ng race condi ons.

#pragma omp parallel private(i)

for (int i = 0; i < N; i++) {

// Thread-private variable

Synchroniza on Mechanisms

1. Cri cal Sec ons

The #pragma omp cri cal direc ve ensures that a par cular sec on of code is executed by
only one thread at a me. This is useful for upda ng shared resources.

#pragma omp parallel

#pragma omp cri cal

sum += omp_get_thread_num();

}
2. Barriers
A barrier ensures that all threads reach a certain point in the program before any of them
proceed. It is used for synchronizing the progress of threads.

#pragma omp barrier

3. Atomic Opera ons

Atomic opera ons ensure that certain memory accesses (like upda ng a shared variable)
occur atomically, preven ng race condi ons with minimal overhead.

#pragma omp atomic

sum++;

4. Locks
OpenMP also provides explicit locks (omp_set_lock, omp_unset_lock) for more ﬁne-grained
control over synchroniza on.

Conclusion

OpenMP’s data-sharing a ributes and synchroniza on mechanisms like cri cal sec ons, barriers, and
atomic opera ons are essen al tools for preven ng race condi ons and ensuring the correctness of
parallel programs.

3. Compare Sta c and Dynamic Scheduling in OpenMP with Examples

Overview of Scheduling in OpenMP

In OpenMP, scheduling determines how loop itera ons are distributed among threads. The two
primary types of scheduling are sta c and dynamic. The choice of scheduling strategy can
signiﬁcantly aﬀect the performance of a parallel applica on.

Sta c Scheduling

With sta c scheduling, itera ons are divided into ﬁxed-size chunks and assigned to threads before
execu on begins. This type of scheduling is ideal when the workload is evenly distributed across
itera ons.

#pragma omp parallel for schedule(sta c)

for (int i = 0; i < N; i++) {

A[i] = B[i] + C[i];

 Advantages: Low overhead because the chunks are predetermined. Ideal for loops where
each itera on takes roughly the same amount of me.

 Disadvantages: Poor load balancing if the itera ons vary signiﬁcantly in execu on me.

Dynamic Scheduling

In dynamic scheduling, itera ons are assigned to threads dynamically during run me. When a
thread ﬁnishes its assigned work, it requests more itera ons from the pool.
#pragma omp parallel for schedule(dynamic)

for (int i = 0; i < N; i++) {

A[i] = B[i] + C[i];

 Advantages: Be er load balancing because idle threads are assigned more work. Useful
when itera ons have unpredictable workloads.

 Disadvantages: Higher overhead due to the dynamic assignment of itera ons during
execu on.

Chunk Size

In both sta c and dynamic scheduling, the chunk size can be speciﬁed. A larger chunk size reduces
the overhead of assigning itera ons but may lead to load imbalances. A smaller chunk size improves
load balancing but increases overhead.

#pragma omp parallel for schedule(dynamic, 10)

for (int i = 0; i < N; i++) {

A[i] = B[i] + C[i];

This example uses dynamic scheduling with a chunk size of 10, meaning each thread will be assigned
10 itera ons at a me.

Choosing the Right Schedule

 Sta c Scheduling: Best for loops where each itera on has a uniform workload.

 Dynamic Scheduling: Preferred for loops with irregular or unpredictable workloads, where
some itera ons may take longer than others.

Conclusion

Choosing the right scheduling strategy depends on the nature of the loop workload. Sta c scheduling
is eﬃcient for predictable workloads, while dynamic scheduling provides be er load balancing for
uneven workloads.

4. Explain Tasking in OpenMP with Examples. Discuss Task Crea on, Task Synchroniza on, and Task
Dependencies

Introduc on to Tasking in OpenMP

Tasking in OpenMP allows for greater flexibility in parallel programming by le ng the developer
define independent units of work (tasks) that can be executed asynchronously by different threads.
Tasking is par cularly useful for irregular workloads where parallelizing loops is not straigh orward.

Task Crea on
Tasks in OpenMP are created using the #pragma omp task direc ve. Each task can be executed by
any available thread, and the execu on order is not necessarily the same as the order of task
crea on.

#pragma omp task

// Task code

prin ("Task executed by thread %d\n", omp_get_thread_num());

In this example, a task is created inside the parallel region. Any thread in the team can execute this
task, independent of the others.

Task Synchroniza on

Task synchroniza on ensures that tasks are completed before a certain point in the program. The
#pragma omp taskwait direc ve is used to wait for all child tasks created by the current task to
complete.

#pragma omp taskwait

This direc ve is used to synchronize tasks, ensuring that all previously launched tasks are ﬁnished
before the program proceeds.

Task Dependencies

OpenMP 4.0 introduced task dependencies to allow be er control over the execu on order of tasks.
Tasks can be deﬁned with dependencies, ensuring that one task waits for the comple on of another
before star ng.

#pragma omp task depend(inout: x)

x = compute(x);

In this example, the depend clause ensures that the task does not start un l all tasks that write to x
have completed.

Example: Fibonacci Calcula on Using Tasks

A common example of tasking is the calcula on of Fibonacci numbers, where each recursive call can
be treated as a separate task.

int ﬁb(int n) {

int x, y;

if (n < 2) return n;
#pragma omp task shared(x)

x = ﬁb(n-1);

#pragma omp task shared(y)

y = ﬁb(n-2);

#pragma omp taskwait

return x + y;

In this example, two tasks are created for the recursive calls to ﬁb(n-1) and ﬁb(n-2). The taskwait
direc ve ensures that the results of both tasks are computed before proceeding.

Conclusion

Tasking in OpenMP oﬀers a ﬂexible way to parallelize irregular workloads. Tasks can be created
asynchronously, synchronized using taskwait, and managed using task dependencies to ensure
correct execu on

Q 1 Describe Parallel Programming Issues

1. Load Balancing

 Deﬁni on: Load balancing involves distribu ng work evenly among available processing
units (threads or processors) to ensure that no single unit is overwhelmed while others are
idle.

 Challenge: Uneven distribu on of tasks can lead to some threads ﬁnishing early while others
con nue processing, reducing overall performance.

 Solu on: Dynamic scheduling techniques can be used to assign tasks to threads at run me
based on their availability, ensuring a more uniform workload.

2. Data Dependencies

 Deﬁni on: Data dependencies occur when threads rely on data produced by other threads,
crea ng a need for synchroniza on to prevent race condi ons.

 Challenge: If one thread modiﬁes shared data that another thread is reading, it may lead to
inconsistent or incorrect results.

 Solu on: Use synchroniza on mechanisms like mutexes, semaphores, or barriers to ensure
that threads operate on data only when it is safe to do so.

3. Race Condi ons

 Defini on: A race condi on occurs when mul ple threads a empt to read and write shared
data simultaneously, leading to unpredictable results.
 Challenge: Without proper control, the final value of shared data can depend on the ming
of thread execu on, causing bugs that are hard to reproduce and fix.

 Solu on: Implement cri cal sec ons or atomic opera ons to control access to shared
resources, ensuring that only one thread can modify data at a me.

4. Deadlocks

 Deﬁni on: A deadlock is a situa on where two or more threads are blocked indeﬁnitely,
each wai ng for resources held by the other.

 Challenge: Deadlocks can lead to a complete halt in program execu on, making it cri cal to
manage resource alloca on carefully.

 Solu on: Use techniques like lock ordering, meout mechanisms, or deadlock detec on
algorithms to prevent or resolve deadlocks.

Q.2 Explain Synchroniza on Problems along with Possible Solu ons

1. Synchroniza on Problems

Synchroniza on problems arise when mul ple threads access shared resources or data structures
concurrently. The primary issues include:

 Race Condi ons: As described above, these occur when threads read and write shared data
simultaneously, resul ng in inconsistent outcomes.

 Deadlocks: As threads wait on each other to release resources, the program can become
stuck, requiring interven on to recover.

 Livelocks: Threads may con nuously change states in response to one another without
making progress, which can be as detrimental as deadlocks.

2. Possible Solu ons

 Mutexes and Locks: Implement mutexes (mutual exclusions) to ensure that only one thread
can access a cri cal sec on of code at any me. This prevents race condi ons.

pthread_mutex_lock(&mutex);

// cri cal sec on code

pthread_mutex_unlock(&mutex);

 Semaphores: Use semaphores to control access to shared resources. They can be used to
limit the number of threads accessing a resource simultaneously.

sem_wait(&sem);

// access shared resource

sem_post(&sem);

 Barriers: Employ barriers to synchronize a group of threads at a speciﬁc point in the code. All
threads must reach the barrier before any can proceed, ensuring coordinated execu on.
#pragma omp barrier

 Lock-Free Programming: Design algorithms that do not require locks by using atomic
opera ons. This reduces the likelihood of deadlocks and increases performance in highly
concurrent systems.

Q.3 Describe Diﬀerent Algorithmic Issues and Explain

1. Algorithmic Issues in Parallel Programming

Algorithmic issues pertain to the design and implementa on of algorithms that can be eﬃciently
executed in parallel. Key issues include:

 Decomposi on: The problem must be divided into smaller subproblems that can be solved
independently by diﬀerent threads. The granularity of this decomposi on is crucial; too
coarse-grained may lead to underu liza on of resources, while too ﬁne-grained may
introduce overhead.

 Communica on Overhead: Parallel algorithms o en require threads to communicate,

especially when sharing data. High communica on overhead can negate the beneﬁts of
parallelism, making it vital to minimize data transfer between threads.

 Scalability: An algorithm must be designed to eﬃciently u lize addi onal processing

resources. If performance does not improve signiﬁcantly with more threads or processors,
the algorithm is not scalable.

 Synchroniza on: Parallel algorithms may require synchroniza on to manage access to

shared resources. The design must minimize the need for synchroniza on to avoid
bo lenecks and maximize throughput.

2. Example of Algorithmic Issues

 Matrix Mul plica on: When implemen ng matrix mul plica on in parallel, the problem can
be divided into smaller tasks where each thread computes a submatrix.

o Decomposi on: Decide how to divide the matrix—by rows, columns, or blocks.

o Communica on: Ensure that each thread accesses its assigned elements without
unnecessary communica on overhead.

o Scalability: As the size of matrices increases, the algorithm should maintain

eﬃciency with addi onal threads.

 Sor ng Algorithms: When parallelizing sor ng algorithms, such as QuickSort or MergeSort,

the algorithm must be designed to manage data dependencies and minimize the need for
synchroniza on.

o Decomposi on: Split the array into smaller segments for sor ng.

o Communica on: Combine results eﬃciently while minimizing data transfer between
threads.

Working Safely With Reach Trucks
100% (1)
Working Safely With Reach Trucks
1 page
Itools 4.5.0.6 Crack +full License Key (Lifetime Keygen) (2021)
0% (2)
Itools 4.5.0.6 Crack +full License Key (Lifetime Keygen) (2021)
13 pages
Assignment_week12
No ratings yet
Assignment_week12
3 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
No ratings yet
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
50 pages
GTC-S62191 (1)
No ratings yet
GTC-S62191 (1)
89 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Parallel Programming Using Openmp: Mike Bailey
No ratings yet
Parallel Programming Using Openmp: Mike Bailey
27 pages
Chapter Three parallel computing
No ratings yet
Chapter Three parallel computing
44 pages
2
No ratings yet
2
58 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
PRACTICE QUESTIONS FOR PP FINAL ORALS-2022 (1)
No ratings yet
PRACTICE QUESTIONS FOR PP FINAL ORALS-2022 (1)
8 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
chapter-8
No ratings yet
chapter-8
58 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CUDA
No ratings yet
CUDA
33 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Openmp 6pp
No ratings yet
Openmp 6pp
5 pages
CSC-334_ P&DC_Lab manual_V2.0
No ratings yet
CSC-334_ P&DC_Lab manual_V2.0
102 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
250121_L5
No ratings yet
250121_L5
59 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
lecture-GPU-17
No ratings yet
lecture-GPU-17
51 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Processors
No ratings yet
Processors
25 pages
Parallel Programming Module 3
No ratings yet
Parallel Programming Module 3
44 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Hardware
No ratings yet
Hardware
54 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Review 1 (Project)
No ratings yet
Review 1 (Project)
18 pages
Contemporary Theories of Management
No ratings yet
Contemporary Theories of Management
4 pages
Chapter I-IV
No ratings yet
Chapter I-IV
40 pages
DESIGN & STANDARD PARAMETERS - PIPING WORKS (Uploaded)
No ratings yet
DESIGN & STANDARD PARAMETERS - PIPING WORKS (Uploaded)
3 pages
PCD1 System-Catalogue Controls
No ratings yet
PCD1 System-Catalogue Controls
12 pages
FPM Basics
No ratings yet
FPM Basics
16 pages
HD785 - 7 Transmission1
No ratings yet
HD785 - 7 Transmission1
12 pages
Uniform Rules For Digital Trade Transactions (URDTT Version 1.0)
No ratings yet
Uniform Rules For Digital Trade Transactions (URDTT Version 1.0)
27 pages
Cerner Tips and Tricks Ofd
No ratings yet
Cerner Tips and Tricks Ofd
39 pages
CS6413-4413 - Winter 2019
No ratings yet
CS6413-4413 - Winter 2019
3 pages
385.06.11 Vir Sistem D.O.O. - Epc155, CL2, Mar2ra
No ratings yet
385.06.11 Vir Sistem D.O.O. - Epc155, CL2, Mar2ra
2 pages
LCR FF 3U XX ATR - ALLv1
No ratings yet
LCR FF 3U XX ATR - ALLv1
6 pages
S120 Drive FCT Man 0620 en-US
No ratings yet
S120 Drive FCT Man 0620 en-US
1,006 pages
SM 1
0% (4)
SM 1
2 pages
Dyson 360 Eye Instruction Manual
No ratings yet
Dyson 360 Eye Instruction Manual
64 pages
RENK Multi Functional Test Facility en
No ratings yet
RENK Multi Functional Test Facility en
12 pages
Maharashtra Pomegranate Exporters PDF - PDF - Mumbai
No ratings yet
Maharashtra Pomegranate Exporters PDF - PDF - Mumbai
7 pages
Sample Report - Bug Bounty Program
No ratings yet
Sample Report - Bug Bounty Program
41 pages
Eco-Design For The Construction Industry: A Guide For Smes On
No ratings yet
Eco-Design For The Construction Industry: A Guide For Smes On
2 pages
Speeduino Manual-2
No ratings yet
Speeduino Manual-2
16 pages
Research and Development in Nigeria's Tertiary Institutions: Issues, Challenges and Way Forward
No ratings yet
Research and Development in Nigeria's Tertiary Institutions: Issues, Challenges and Way Forward
9 pages
Computer Programming BDA 24202: Prepared By: Dr. Ho Fu Haw
No ratings yet
Computer Programming BDA 24202: Prepared By: Dr. Ho Fu Haw
40 pages
Milan Nayek: Contact Details
No ratings yet
Milan Nayek: Contact Details
2 pages
FA410 S6 DataSheet
No ratings yet
FA410 S6 DataSheet
1 page
Walter Martino Minus One
100% (1)
Walter Martino Minus One
32 pages
GX Works2 Brochure - L08122ef
No ratings yet
GX Works2 Brochure - L08122ef
48 pages
ERP 201 Introduction To Sap Erp Operations: Eswar Raman
No ratings yet
ERP 201 Introduction To Sap Erp Operations: Eswar Raman
30 pages
EN Form Rad
No ratings yet
EN Form Rad
2 pages

Gpu Test Answer Bank

Uploaded by

Gpu Test Answer Bank

Uploaded by

2-Mark Answers

2. Purpose of __syncthreads() in CUDA

3. Memory Coalescing in CUDA

4. Register Spilling vs Shared Memory

5. Bank Conﬂicts in Shared Memory

7. Constant vs Texture Memory

9. Streaming Mul processors (SMs)

10. CUDA Memory Hierarchy

Eﬀect on Kernel Performance

11. Tiling in CUDA

12. Impact of Thread Block Size

Op mizing for Occupancy

13. Eﬀect of Registers on Occupancy

14. blockIdx.x and threadIdx.x in CUDA

15. Advantage of cudaMemcpy() with cudaMemcpyDeviceToHost

16. Purpose of dim3 in CUDA

17. Fewer Threads than Required Opera ons in a Kernel

18. Task Parallelism vs Data Parallelism

19. GPU Architecture Components

20. Comparison of CPU and GPU Design

1. What is OpenMP and why is it used?

 OpenMP (Open Mul -Processing) is an API for parallel programming in shared-memory

 It enhances performance by distribu ng workloads across mul ple threads, improving

3. Why use #pragma omp parallel in OpenMP?

 It enables eﬃcient parallel execu on by distribu ng work among available threads.

 It helps improve program performance, especially for compute-intensive tasks.

4. How do you set the number of threads in OpenMP?

 Use the omp_set_num_threads(n) func on to specify the desired number of threads.

 Se ng the number of threads allows for be er resource u liza on based on available

 This ﬂexibility can lead to improved performance in diﬀerent compu ng environments.

5. Why is thread synchroniza on important in OpenMP?

7. Why use #pragma omp for in OpenMP?

8. How does OpenMP handle private and shared variables?

9. Why is dynamic scheduling useful in OpenMP?

11. What is an OpenMP task and why is it used?

1 What is RACE hazard ? 

Q.2 Explain common mistakes by programmers rela ng to synchroniza on. 

1. Design a CUDA Program for Large-Scale Matrix Mul plica on

Matrix Mul plica on in CUDA

Op mizing with Shared Memory

CUDA Programming Model Overview

Indexing Threads and Blocks

Memory Types in CUDA

Performance Implica ons

4. Discuss Warp Scheduling and its Impact on CUDA Performance

Warp Execu on in CUDA

Occupancy and Warp Scheduling

Occupancy refers to the ra o of ac ve warps on an SM to the maximum number of warps it can

 Read-Only: Constant memory is a small, read-only memory space that is cached. It is

CUDA Kernel Execu on

Scheduling on the Hardware

7. Write a CUDA Program for Element-Wise Array Addi on

__global__ void addArrays(int *a, int *b, int *c, int N) {

int index = blockIdx.x * blockDim.x + threadIdx.x;

c[index] = a[index] + b[index];

8. Write a CUDA Program for Vector Addi on

5. Cleanup: Free the memory allocated on the GPU using cudaFree().

__global__ void vectorAdd(float *A, float *B, float *C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

C[i] = A[i] + B[i];

Thread Synchroniza on in CUDA

Example: Matrix Addi on with __syncthreads()

__global__ void matrixAdd(float *A, float *B, float *C, int N) {

C[i] = A[i] + B[i];

__syncthreads(); // Ensures all threads complete addi on before proceeding

10. Describe GPU Architecture and its Components

Overview of GPU Architecture

Streaming Mul processors (SMs)

The memory hierarchy in a GPU consists of:

 Registers: Fast, thread-speciﬁc memory.

 Global Memory: Slower, larger memory accessible by all threads.

#pragma omp parallel

prin ("Thread %d says hello\n", omp_get_thread_num());

#pragma omp parallel for

for (int i = 0; i < N; i++) {

A[i] = B[i] + C[i];

global void addArrays(int a, int b, int *c, int N) {

global void vectorAdd(float A, float B, float *C, int N) {

global void matrixAdd(float A, float B, float *C, int N) {