Cuda

The document provides an introduction to CUDA, a parallel computing platform developed by NVIDIA for leveraging the capabilities of GPUs. It discusses various types of parallelism, the advantages of GPUs over CPUs, and the architecture of CUDA, including its thread hierarchy and memory management. Additionally, it covers the development environment needed for CUDA programming and provides examples of kernel functions and their execution on GPUs.

Uploaded by

thisisakshatsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Cuda

Uploaded by

thisisakshatsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 69

CUDA-Introduction

By
Dr. Sandhya Parasnath Dubey and Dr. Manisha
Assistant Professor
Department of Data Science and Computer Applications - MIT,
Manipal Academy of Higher Education, Manipal-576104, Karnataka, India.
Email: [email protected]; [email protected]
Parallelism
• Writing a parallel program must always start by identifying the parallelism
inherent in the
algorithm at hand.
• Different variants of parallelism induce different methods of parallelization
• Instruction Level Parallelism (ILP)
• ILP is the parallel or simultaneous execution of a sequence of instructions
in a
computer program.
• ILP is specifically about how many instructions a computer can execute at the
same
time. It measures the efficiency of this parallel execution.
• Caution:
ØData dependency: Data dependence means that one instruction is dependent on
another. This can limit how much parallelism we can achieve.
ØName Dependency: A name dependency occurs when two instructions use the same
register or memory location, called a name, but there is no flow of data
between
them. This can also impact parallel execution.
Parallelism
• Data Level Parallelism
• Many problems in scientific computing involve processing of large quantities
of data
stored on a computer.
• If we can process different parts of the data simultaneously using multiple
processors,
this is called data level parallelism.
• This approach is widely used in MIMD (Multiple Instruction, Multiple Data)
type
computers, where each processor can execute different instruction set (i.e.,
different
parts of the code) on different pieces of data.
• The same code is executed on all processors, with independent instruction
pointers.
• Example: If you have a huge dataset that needs to be processed, instead of one
processor handling all the data, multiple processors can each work on
different
sections of the data simultaneously, making the whole process much faster.
Parallelism
• Functional/Task Level parallelism
• Sometimes the solution of a “big” numerical problem can be split into
separate
subtasks.
• A large problem can be broken down into smaller tasks, each of which can be
executed independently.
• Which work together by data exchange and synchronization.
• In this case, the subtasks execute completely different code on different
data items,
which is why functional parallelism is also called MPMD (Multiple Program
Multiple
Data).
• Functional parallelism bears pros and cons
• Cons: When different parts of the problem have different performance
properties
and hardware requirements, bottlenecks and load imbalance can easily
arise. If
one task takes longer than another, it can slow down the entire process.
• Pros: On the other hand, overlapping tasks that would otherwise be
executed
sequentially could accelerate execution considerably.
Introduction: Benefits of Using GPUs
• The Graphics Processing Unit (GPU) provides much higher instruction throughput
and memory bandwidth than the
CPU
Ø For certain types of tasks, the GPU can process a massive amount of data and
perform a large number of operations in parallel, far more than a
CPU could.

• Many applications leverage these higher capabilities to run faster on the GPU
than on the CPU.
• This difference in capabilities between the GPU and the CPU exists because they
are designed with different goals:
• The CPU is designed to excel at executing a sequence of operations, called a
thread, as fast as possible and can
execute a few tens of these threads in parallel. (running your operating
system or basic software applications,
memory management)
• The GPU is designed to excel at executing thousands of threads in parallel.
This makes it extremely efficient at
performing the same operation on large datasets—like rendering the pixels on
your screen or processing large
matrices in deep learning models.
• The GPU is specialized for highly parallel computations. Its architecture is
different from the CPU because more
transistors—the basic building blocks of the chip—are dedicated to data
processing rather than data caching and
flow control.
• While the CPU is a general-purpose processor that can do a bit of everything,
the GPU is a highly specialized tool
designed to perform a lot of similar operations very quickly and
simultaneously. It is well-suited for tasks that can
be broken down into thousands of parallel operations.
Introduction: Benefits of Using GPUs
• Devoting more transistors to data processing.
• e.g: floating-point computations
• The GPU can hide memory access latencies with computation, instead of relying on
large data
caches and complex flow control.
• This approach contrasts with CPUs, which often rely on large data caches and
complex flow control
mechanisms to minimize the impact of memory access delays.
• In CPU, long memory access latencies are a significant challenge and require
expensive solutions like
large caches and intricate flow control mechanisms, which use a lot of
transistors.
• GPUs, avoid these expensive latencies by focusing on continuous computation. When
a thread
running on a GPU needs to access data from memory, it might encounter a delay
while waiting for
that data to be fetched. Instead of stalling the entire GPU while waiting, the
GPU can quickly "switch
out" that thread and "switch in" another thread that is ready to continue
computing, effectively
keeping the processing pipeline full and efficient.
Introduction: CPU vs GPU

Figure 1. shows an example distribution of

chip
resources for a CPU versus a GPU

• The CPU has a few large, powerful cores. Each core has its own control unit.
These cores are designed to handle a wide
range of tasks quickly and efficiently, making CPUs good at general-purpose
computing. CPUs have multiple levels of cache
(L1, L2, and sometimes L3) that are close to the cores. Below the caches is the
DRAM (main memory), which is used to
store data and instructions that are not immediately needed but can be fetched
into the cache as required.
• The GPU has a large number of smaller, simpler cores arranged in a highly
parallel fashion. Each green square in the
diagram represents a core. These cores are designed to perform the same operation
on multiple pieces of data
simultaneously.
• Like the CPU, the GPU also has DRAM for storing data and instructions. However,
because the GPU's strength is in handling
large amounts of data at once, it often accesses memory in large blocks, which is
different from the CPU's approach.
A Scalable Programming Model
• Mainstream processor chips are now built as parallel systems.
• Multicore CPUs and manycore GPUs
• Multicore CPUs: These have multiple cores (like dual-core, quad-core, octa-
core) that
can execute multiple threads or tasks at the same time. Manycore GPUs: They
contain
hundreds or even thousands of smaller cores optimized for simultaneous
processing.
• The challenge is to develop application software that transparently scales its
parallelism to leverage the increasing number of processor cores
• CUDA (Compute Unified Device Architecture) developed by NVIDIA, is a
parallel computing platform and programming model specifically designed for
GPUs.
• It provides a framework to write programs that can run on NVIDIA GPUs. It
allows developers to write code that can be executed in parallel across many
GPU cores.
A Scalable Programming Model
• To help developers achieve this scalable parallelism, CUDA introduces three main
concepts:
o A hierarchy of thread groups: Each thread executes a copy of the kernel function.
A kernel is a function that
is executed on the GPU. When you call a kernel, it runs many times
simultaneously. Each of these runs is
called a "thread," and all these threads work together to complete a task more
quickly. CUDA organizes
threads into a hierarchy, which allows for scalable parallel execution:
Ø Threads: The smallest unit of execution.
Ø Thread Blocks: A group of threads that can cooperate among themselves by
sharing data through shared
memory and can synchronize their execution to coordinate with each other.
Ø Grid of Thread Blocks: Multiple thread blocks that can execute independently
and in parallel across
different cores of the GPU.
o Shared memories: CUDA provides different types of memory spaces, including shared
memory, which is a
fast, on-chip memory that is shared among threads within the same block. It is
much faster than global
memory (which is off-chip) and can be used for inter-thread communication within
a block.
o Barrier synchronization: CUDA provides mechanisms to synchronize threads within a
block. This ensures
that all threads in a block have reached a certain point in the execution before
any of them can proceed
A Scalable Programming Model
• CUDA guides the programmer to partition the problem into coarse sub-problems
that can be solved independently in parallel by blocks of threads.
• Each sub-problem is further divided into smaller tasks that are solved
cooperatively
by all threads within a block.
• Each block of threads can be scheduled on any of the available multiprocessors
within a GPU.
• Blocks may be executed in any order, concurrently (at the same time) or
sequentially
(one after another), depending on the GPU's resources.
• This flexible scheduling ensures that the same CUDA program can run on GPUs
with
different numbers of multiprocessors without modification.
• Only the runtime system needs to know the physical multiprocessor count.
A Scalable Programming Model
• The image
demonstrates CUDA's flexibility
in executing the
same program on different
hardware
configurations.
• Whether a GPU
has 2 or 4 SMs, the
program adapts,
and the runtime system
manages how the
blocks are scheduled
across the
available SMs.
• Scalability:
CUDA's programming model is
scalable. A
program written for one GPU
can run on
another with more or fewer
SMs without
modification.
• The CUDA runtime
system takes care of
distributing the
work (blocks) across the
available SMs,
optimizing performance
Blocks in CUDA represent groups of threads that execute the same
based on the
hardware configuration.
kernel function. These blocks are independent of each other and can
be scheduled on any available SM.
Development environment
Development environment necessary for programming with NVIDIA GPUs using CUDA:
• Every NVIDIA GPU released since the 2006 GeForce 8800 GTX is equipped with CUDA
capabilities.
Ø GPUs are built on the CUDA architecture, which allows them to run CUDA
programs.
• NVIDIA DEVICE DRIVER
Ø NVIDIA provides specialized system software, known as the device driver,
which acts as an intermediary
between your programs and the CUDA-enabled hardware.
Ø The device driver ensures that your applications can efficiently communicate
with and utilize the GPU for
computation.
• CUDA Development Toolkit
Ø CUDA applications typically involve computations on both the GPU and the CPU.
You need two compilers:
Ø GPU Compiler (CUDA Toolkit): NVIDIA provides a specific compiler as part of
the CUDA Toolkit that
compiles code to run on the GPU.
Ø CPU Compiler: The regular CPU code is compiled using a standard compiler like
GCC (on Linux) or MSVC
(on Windows).
Heterogeneous Computing
• Heterogeneous computing refers to the use of multiple types of processors (e.g.,
CPUs and GPUs) within the
same system to perform computational tasks.

• Terminology:
• Host: The CPU and its memory (host memory)
• Host Memory: This is the system's main memory (RAM) that is accessible by the
CPU. The CPU performs
general-purpose processing and manages the overall operation of the system.
• Role in Heterogeneous Computing: In CUDA programming, the host is responsible
for initiating and
managing computational tasks, including preparing data and sending it to the
GPU for processing.
• Device: The GPU and its memory (device memory)
• Device Memory: This is the GPU's own memory, separate from the CPU's memory.
It is used to store
data that the GPU will process and the results it produces.
• Role in Heterogeneous Computing: The device (GPU) handles the parallel
processing of data. It executes
computationally intensive tasks that have been offloaded by the host.
Heterogeneous Computing
Process flow
Process flow
Process flow
Hello world example • A function named
kernel() qualified with __global__
• This is a simple CUDA C program with two distinctions: • A call to the
function, embellished with <<<1,1>>>
• The __global__ qualifier before the function kernel indicates that this
function will be compiled to
run on the device, i.e., the GPU. The kernel function is the CUDA kernel.
• Kernel Function:
Ø The __global__ qualifier tells the CUDA compiler to generate device code
for this function.
When this function runs, it will execute on the GPU.
Ø Inside this kernel function, we use printf to print "Hello, CUDA World!" to
the GPU's output.
• Main Function:
Ø The main function is executed on the host, which is the CPU in this
context.
Ø kernel<<<1, 1>>>(); launches the kernel on the GPU with a grid size of 1
block and 1 thread
per block.
Ø This is a minimal setup to execute the kernel.
Ø cudaDeviceSynchronize(); ensures that the host waits for the GPU to finish
executing the
kernel before proceeding. This helps in synchronizing the host and
device.
• Compilation and Execution:
Ø When you compile this code using nvcc, the CUDA compiler, it will separate
the device code
and host code.
Ø The kernel() function is compiled by nvcc into code that runs on the GPU.
Ø The main() function is compiled by the host compiler, which is typically a
standard C
compiler.
Ø When you run the code, the kernel() function will be executed on the GPU,
and you should
see "Hello, CUDA World!" printed on the console.
Kernels
• In CUDA, we extend C/C++ with some additional features that allow us to write
programs that can run on a GPU.
• CUDA C/C++ extends C/C++ by allowing the programmer to define C/C++ functions,
called kernels.
• When called, are executed N times in parallel by N different CUDA threads, as
opposed to only once like regular C/C++
functions.
• A kernel is defined using the __global__ declaration specifier.
• The number of CUDA threads that execute that kernel for a given kernel call is
specified using a new <<< >>> execution
configuration syntax.
• Each thread that executes the kernel is given a unique thread ID that is
accessible within the kernel through built-in
variables.
• This allows each thread to work on a different piece of data. In this case, each
thread will add one element from the
arrays a and b and store the result in c.
Kernels
• In this line of code, <<<1, N>>> is the execution configuration.
• It tells CUDA how many threads to launch.
• The first number, 1, is the number of thread blocks, and the second number, N, is
the number of
threads per block.
• In the example, we are launching N threads, all executing the same kernel
function add in parallel.
Each thread operates on different data based on its thread ID.
• This ID is crucial because it allows each thread to work on a different piece of
data. In the add kernel,
each thread uses its ID to determine which elements of the arrays a and b it
should add together.
Kernels
• In this example, VecAdd<<<2, N>>> launches a kernel with a grid of 2 blocks and
each block containing N threads.
• Each thread computes one element of the vector addition, effectively processing 2
* N elements.
• The total number of threads in the grid is 2 * N. Each thread is assigned a
unique index that
can be calculated using the block and thread indices. This index helps each
thread process a
different element or piece of data.
• Here, blockIdx.x gives the block index, blockDim.x gives the number of threads
per block,
and threadIdx.x gives the thread index within its block.
• Each block can be executed concurrently on different Streaming Multiprocessors
(SMs) on
the GPU. If there are enough SMs available, both blocks can be processed in
parallel,
improving overall performance.
Kernels
• The following sample code, using the built-in variable threadIdx, adds two
vectors A and B of
size N and stores the result into vector C.
• Each of the N threads that execute VecAdd() performs one pair-wise addition.
Thread Hierarchy
• threadIdx is a built-in variable in CUDA that gives each thread a unique
identifier within a thread
block.
• It's a 3-component vector (threadIdx.x, threadIdx.y, threadIdx.z).
• It can represent a thread's position in a one-dimensional (1D), two-dimensional
(2D), or three-
dimensional (3D) thread block.
• Thread Blocks are groups of threads that execute on a GPU core.
• A thread block can be organized in 1D, 2D, or 3D.
• For example, if you have a block of size ᵃᵆ, then the threads are organized in a
1D structure.
• For a block of size (ᵃᵆ, ᵃᵆ), threads are organized in a 2D structure.
• For (ᵃᵆ, ᵃᵆ, ᵃᵆ), threads are organized in 3D.
• The index of a thread and its thread ID relate to each other in a straightforward
way.
Thread Hierarchy
• The relationship between a thread's index and its thread ID varies depending on
the dimensionality
of the block:
• 1D Block: The thread ID is directly given by threadIdx.x.
• 2D Block: The thread ID is calculated as threadIdx.x + threadIdx.y * blockDim.x.
• 3D Block: The thread ID is calculated as threadIdx.x + threadIdx.y * blockDim.x +
threadIdx.z *
blockDim.x * blockDim.y.
• This allows each thread to have a unique ID within a block, which helps in
dividing tasks among
threads.
Kernel Functions and Threading

• The number of threads per block and the number of blocks in a grid
are specified using the special <<< >>> syntax when you launch a
kernel.
• For example, <<<gridSize, blockSize>>> where gridSize defines the
number of blocks and blockSize defines the number of threads per
block.
• Each block within the grid can be identified by a one-dimensional,
two-dimensional, or three-dimensional unique index accessible
within the kernel through the built-in blockIdx variable.
• blockIdx: Identifies the block within the grid.
• blockDim: Specifies the dimensions of the thread block.
Kernel Functions and Threading

• When a host code launches a kernel, the CUDA runtime system generates a grid of
threads.
• Grid is organized into an array of thread blocks. All blocks of a grid are of
same size.
• Each block can contain upto 1024 threads, with flexibility in distributing these
elements into three dimensions.
• The number of threads in each block is specified by the host code when a kernel
is launched.
• The same kernel can be launched with different numbers of threads at different
parts of the host code.
• For a given grid of threads, the number of threads in a block is available in the
blockDim variable.
• In general, the dimensions of thread blocks should be multiples of 32 due to
hardware efficiency reasons.
Kernel Functions and Threading

• Each thread in a block has a unique threadIdx value.

• This allows each thread to combine its threadIdx and blockIdx values to create a
unique global
index for itself with the entire grid.
• Data index i is calculated as i = blockIdx.x * blockDim.x + threadIdx.x
• Since blockDim is 256 in the example, the i values of threads in block 0 ranges
from 0 to 255, in
block 1 ranges from 256 to 511, and so on.
Kernel Functions and Threading
__global__
• Keyword indicates that the function being declared
is a CUDA kernel function.
• Function is to be executed on the device and can
only be called from the host code.
__device__
• Keyword indicates that the function being declared
is a CUDA device function.
• A device function executes on a CUDA device and
can only be called from a kernel function or another
device function.
__host__
• Host function is simply a traditional C function that
executes on the host and can only be called from
another host function.
CUDA Thread Organization
• A thread block can be organized in 1D, 2D, or 3D.
1D Blocks and 1D Grid:
• localThreadID = threadIdx.x
• globalThreadID = blockIdx.x * blockDim.x + threadIdx.x (Global thread ID is
calculated by
considering how many threads are in the previous blocks and then adding the
thread index in
the current block)
Example:
• blockDim.x = 4 (4 threads per block)
• gridDim.x = 3 (3 blocks)
• For blockIdx.x = 2 and threadIdx.x = 3
• globalThreadID = 2 * 4 + 3 = 11
CUDA Thread Organization
2D Blocks and 2D Grid:
• localThreadID = threadIdx.y * blockDim.x + threadIdx.x
• globalThreadID = (blockIdx.y * gridDim.x + blockIdx.x) * (blockDim.x *
blockDim.y)
+ threadIdx.y * blockDim.x + threadIdx.x (Global thread ID is calculated by
considering total
number of threads from previous blocks (both x and y) and adding the current
thread's
position)
Example:
• blockDim.x = 4, blockDim.y = 2 (8 threads per block)
• gridDim.x = 3, gridDim.y = 2 (6 blocks)
• For blockIdx.x = 2, blockIdx.y = 1 and threadIdx.x = 3, threadIdx.y = 1:
• globalThreadID = (1 * 3 + 2) * 8 + 1 * 4 + 3 = 40 + 7 = 47
CUDA Thread Organization
3D Blocks and 3D Grid:
• localThreadID = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x
+ threadIdx.x
• globalThreadID = (blockIdx.z * gridDim.y * gridDim.x + blockIdx.y * gridDim.x +
blockIdx.x) * (blockDim.z * blockDim.y *
blockDim.x)) + (threadIdx.z * blockDim.y * blockDim.x) + (threadIdx.y *
blockDim.x) + threadIdx.x
• (Global thread ID is calculated by considering block's position in all three
dimensions and the thread's position within the block:)
Example:
• blockDim.x = 4, blockDim.y = 2, blockDim.z = 2 (16 threads per block)
• gridDim.x = 3, gridDim.y = 2, gridDim.z = 2 (12 blocks)
• For blockIdx.x = 2, blockIdx.y = 1, blockIdx.z = 1 and threadIdx.x = 3,
threadIdx.y = 1, threadIdx.z = 1:
• globalThreadID = ((1 * 2 * 3 + 1 * 3 + 2) * 16) + (1 * 8) + (1 * 4) + 3
= 112 + 8 + 4 + 3
= 127
CUDA Thread Organization
• Execution configuration parameters in a kernel launch statement specify the
dimension of the grid and the dimensions of each block.
• In general, a grid is a 3D array of blocks and each block is a 3D array of
threads.
• The programmer can choose to use fewer dimensions by setting the unused
dimensions to 1.
• The organization of a grid is determined by the execution configuration
parameters of the kernal launch statement.
• The first parameter specifies the dimensions of the grid in number of blocks.
• The second parameter specifies the dimensions of each block in number of threads.
• Each such parameter is of dim3 type, which is a C struct with three unsigned
integer fields, x, y, z.
• For 1D and 2D grids and blocks, the unused dimension fields should be set to 1.
• Example: Generate a 1D grid that consists of 128 blocks, each of which consists
of 32 threads. Total number of threads in the grid is
128x32 = 4096. dimBlock and dimGrid are host code variables defined by the
programmer.

• In CUDA C, the allowed values for gridDim.x, gridDim.y, and gridDim.z ranges from
1 to 65,536.
CUDA Thread Organization
• The kernel launch can also be written as:

• It allows the number of blocks to vary with the size of the vectors so that the
grid will have enough threads to cover all
vector elements.
• The value of variable n at kernel launch time will determine the dimension of the
grid.
• If n is equal to 1000, the grid will consist of 4 blocks. The statement will
launch 4*256 = 1024 threads. The first 1000 threads
will perform addition on the 1000 vector elements and remaining 24 will not.
• If n is equal to 4000, the grid will have 16 blocks. In each case, there will be
enough threads to cover all the vector elements.
• Once vecAddKernel() is launched, the grid and block dimensions will remain the
same until the entire grid finishes
execution.
• CUDA C provides a special shortcut for launching kernel with 1D grids and blocks.
Instead of using dim3 variables, one can
use arithmetic expressions to specify the configuration of 1D grids and blocks.

• CUDA C compiler simply takes the arithmetic expression as the x dimensions and
assumes that the y and z dimensions are 1.
CUDA Thread Organization
• In CUDA, a thread block is a group of threads that execute together on a single
multiprocessor.
• In the provided example, we define a thread block of size 16×16, meaning each
block contains
256 threads.
• A grid is a collection of thread blocks. In the example, we create a grid that
has enough blocks to
cover the entire matrix.
• The grid is sized so that each element in the matrices A, B, and C is handled by
a separate
thread.
• The kernel MatAdd is defined to perform matrix addition. Each thread in this
kernel is
responsible for computing one element of the output matrix C.
• The kernel uses threadIdx and blockIdx to determine the indices i and j of the
matrix element
that a particular thread is responsible for.
• threadIdx.x and threadIdx.y give the thread's position within its block, while
blockIdx.x and
blockIdx.y give the block's position within the grid.
• These are combined with blockDim.x and blockDim.y, which provide the dimensions
of the
block, to calculate the exact position of the matrix element each thread will
process.
• The kernel is launched with a grid and block configuration using the <<<>>>
syntax. In this case,
the grid is configured to have enough blocks such that each thread processes a
unique element
in the matrix.
CUDA Thread Organization
• One of the most important concepts in CUDA is that thread blocks must execute
independently. This means that any block
can be scheduled on any multiprocessor at any time, without any dependencies on
other blocks.
• This independence allows CUDA programs to scale with the number of cores on a
GPU.
• As the number of cores increases, more blocks can be executed simultaneously,
leading to better performance.
• Although blocks are independent of each other, threads within a block can
cooperate. They can share data through shared
memory, a special type of memory that is accessible only to threads within the
same block.
• Threads within a block can also synchronize their execution using the
__syncthreads() function.
• This is important when threads need to wait for each other to reach a certain
point before proceeding, for example, to
avoid race conditions when accessing shared memory.
• By organizing threads into blocks and grids, CUDA allows the same code to run
efficiently on GPUs with different numbers
of cores.
• The independence of thread blocks ensures that as more cores become available,
the GPU can automatically distribute
work among them, leading to improved performance without requiring changes to the
code.
CUDA Thread Organization
• The grid can have higher dimensionality than its blocks and
vice versa.
• Figure shows an example of 2D grid of size (2, 2, 1), that
consists of 3D (4, 2, 2) blocks.
• The grid can be generated with the following host code:
dim3 dimGrid(2,2,1);
dim3 dimBlock(4,2,2);
KernelFunction<<<dimGrid, dimBlock>>>(. . .);
• The grid consists of 4 blocks organized into a 2X2 array.
• Each block is labeled with (blockIdx.y, blockIdx.x).
• Example, block(1,0) has blockIdx.y=1 and blockIdx.x=0.
• The ordering of the labels is such that the highest dimensions
comes first.
• This is reverse of the ordering used in the configuration
parameters where the lowest dimensions comes first.
Mapping Threads to Multidimensional Data
• The choice of 1D, 2D, or 3D thread organizations is usually based on the
nature of the data.
• For example, pictures are a 2D array of pixels.
• It is often convenient to use a 2D grid that consists of 2D blocks to process
the pixels in a picture.
• Consider a picture of 76 x 62 picture.
• Assume that we decided to use a 16 x 16 block, with 16 threads in the x
direction and 16 threads in the y direction.
• We will need five blocks in the x direction and four blocks in the y direction,
which results in 5 x 4 = 20 blocks.
• Note: We have 4 extra threads in the x direction and 2 extra threads in the y
direction. That is, we will generate 80x64 threads to process 76x62 pixels.
• An if statement is needed to prevent the extra threads from taking effect.
Analogously, we should expect that the picture processing kernel function
will have if statements to test whether the thread indices threadIdx.x and
threadIdx.y fall within the valid range of pixels.
Mapping Threads to Multidimensional Data
• Assume that the host code uses an integer variable n to track the number of
pixels in the x
direction, and another integer variable m to track the number of pixels in the y
direction.
• We further assume that the input picture data has been copied to the device
memory and can be
accessed through a pointer variable d_Pin.
• The output picture has been allocated in the device memory and can be accessed
through a
pointer variable d_Pout.
• The following host code can be used to launch a 2D kernel to process the
picture.
Mapping Threads to Multidimensional Data
Mapping Threads to Multidimensional Data
• Ideally, we would like to access d_Pin as a 2D array where an element at
row j and column i can be accessed as d_Pin[j][i].
• Programmers need to explicitly linearize, of flatten a dynamically allocated
2D array into 1D array in CUDA C.
• In reality, all multidimensional arrays in C are linearized due to the use of
flat memory space in modern computers.
• There are two ways to linearize a 2D array.
• Row-major layout: Place all elements of the same row into consecutive
locations. It is used by CUDA C
• Rows are then placed one after another into the memory space.
• For 4x4 matrix M, the 1D equivalent index for the M element in row j and
column i is j x 4 + i.
• The j x 4 term skips all elements of the rows before row j and i term then
selects the right element within the section for row j.
• For example: 1D index for M(2,1) is 2x4+1 = 9.
• M9 is the 1D equivalent to M2,1
• Column-major layout: Place all elements of the same column into
consecutive locations. The columns are then placed one after another into
memory space. It is used by FORTRAN compilers.
Memory Hierarchy
CUDA threads may access data from multiple memory spaces during their execution.
1. Per-Thread Local Memory:
• Each thread has access to its own private local memory. This memory is very small
and is
used to store variables that are private to that thread.
• If a thread is working on a particular element of an array, it can store
temporary variables
here. This memory is relatively slow if used extensively because it spills over
to global
memory when registers are full.
2. Per-Block Shared Memory:
• Each block of threads has access to shared memory, which is extremely fast and is
local to
the block. All threads within a block can share data through this memory.
• If multiple threads need to work on a common dataset, shared memory is ideal
because it's
much faster than accessing global memory repeatedly.
• Reduces access to global memory by storing frequently used data locally within
the block.
3. Grid and Global Memory:
• Each grid consists of multiple blocks. All threads from all blocks can access the
global
memory, which is the largest but slowest form of memory in the CUDA architecture.
• This is where the main data resides. All threads, irrespective of which block
they belong to,
can access this memory.
• When you copy data from the host (CPU) to the device (GPU), it gets stored in
global
memory.
Figure: Different levels of memory available to CUDA

threads and how they interact.

Memory Hierarchy
4. Constant and Texture Memory:
• In addition to shared and global memory, CUDA also offers read-only constant and
texture memory.
• Constant Memory: This is used when many threads need read-only access to the same
data (e.g., a constant value used in all
threads).
• It is optimized for scenarios where many threads need to read the same data but
the data itself doesn't change during kernel
execution. This is ideal for broadcasting the same constant values across
multiple threads.
• Threads can read from constant memory, but they cannot modify it.
• Texture Memory: This is optimized for 2D spatial locality (i.e., adjacent threads
read adjacent memory locations). This is
particularly useful certain graphics and image processing applications.
• It is highly effective when threads access data that is stored in nearby memory
locations (like neighboring pixels in an image).
• Constant and texture memory resides in global memory on the GPU, but it has a
dedicated cache, making access faster than
normal global memory.
Heterogeneous Programming
• Heterogeneous computing refers to the use of multiple types of processors (e.g.,
CPUs and GPUs) within the
same system to perform computational tasks.
• CUDA assumes that the CPU (host) and GPU (device) are physically separate
entities.
• The CPU executes the main program, while the GPU runs the parallel code (kernels)
designed to be executed by
thousands of threads simultaneously.
• The GPU acts as a coprocessor to the CPU. It performs computationally intensive
tasks while the CPU handles
the overall program logic and coordination. The CPU launches kernels that run on
the GPU.
• The CUDA programming model also assumes that both the host and the device
maintain their own separate
memory spaces in DRAM, referred to as host memory and device memory.
• Therefore, data needs to be transferred between them.
• CUDA provides a set of API calls to manage this process.
• This includes device memory allocation and deallocation as well as data transfer
between host and device
memory.
Heterogeneous
Programming
Streaming Multiprocessors
Threads and Warps:
• In CUDA, threads are grouped into warps.
• A warp is a collection of 32 threads that execute the same instructions
simultaneously (SIMT—Single Instruction,
Multiple Threads). Multiple warps together form a thread block.
Thread Blocks and SMs:
• Several thread blocks are assigned to a Streaming Multiprocessor (SM) for
execution.
• Each SM is responsible for executing multiple thread blocks in parallel.
• Once one thread block finishes, the SM immediately begins executing the next
thread block in the queue.
Multiple SMs in a GPU:
• A GPU is composed of multiple SMs, each of which can run several thread blocks at
the same time.
• This enables massive parallelism in the GPU.
• The primary task of an SM is that it must execute several thread blocks in
parallel.
• As soon as one of its thread block has completed execution, it takes up the
serially next thread block.
Streaming Multiprocessors
Streaming Multiprocessors
1. Execution cores. Each SM contains multiple execution cores capable of handling
different types of instructions:
1. Single Precision Floating Point Units: For operations using single-precision
(32-bit) floating-point numbers.
2. Double Precision Floating Point Units: For operations using double-precision
(64-bit) floating-point numbers.
3. Special Function Units (SFUs): For performing specialized operations such as
trigonometric calculations.
2. Caches
1. L1 cache: This cache reduces memory access latency, ensuring faster access to
data required by the threads.
2. Shared memory: Shared memory is a fast, low-latency memory that is shared
among the threads in a block. It
enables threads to collaborate efficiently by sharing data.
3. Constant cache: Optimized for broadcasting data, the constant cache provides
read-only memory for threads.
It’s used when many threads need to read the same data.
4. Texture cache. This cache is optimized for spatial locality in texture
memory, commonly used in graphics and
image processing. Spatial locality means that when a thread accesses a
certain memory location, it is likely that
neighboring threads will access nearby memory locations.
The texture cache takes advantage of this by fetching a block of memory around
the requested data, thus reducing
the number of separate memory requests. By fetching multiple nearby memory
locations in a single request, the
texture cache reduces the overhead of separate memory accesses, improving
overall memory bandwidth
utilization.
Streaming Multiprocessors

4. Schedulers for warps: Each SM has warp schedulers responsible for dispatching
instructions to
different warps based on scheduling policies. These policies ensure that warps
are issued instructions
in an efficient and fair manner to maximize GPU utilization.

5. A substantial number of registers. Since each SM may have hundreds or thousands

of threads running
concurrently, registers must store a large amount of per-thread data. This large
pool of registers allows
each thread to quickly access data without needing to use slower global memory.
Device Global Memory and Data Transfer
• In CUDA, host and devices have separate
memory spaces.
• To execute a kernel on a device, the
programmer needs to allocate global
memory (device memory) on the device
and transfer pertinent data from the host
memory to the allocated device memory.
(Part 1)
• After device execution, the programmer
needs to transfer result data from the
device memory back to the host memory
and free up the device memory that is no
longer needed. (Part 3)
Device Global Memory and Data Transfer
• The CUDA runtime system provides API functions for managing data in the device
memory.
• For example, Parts 1 and 3 of the vecAdd() function in Figure 3.5 need to use
these API functions to
allocate:
Ødevice memory for A, B, and C;
Øtransfer A and B from host memory to device memory;
Øtransfer C from device memory to host memory; and
Øfree the device memory for A, B, and C.
Device Global Memory and Data Transfer
• Function cudaMalloc() can be called from the host code to allocate a
piece of device global memory for an object.
• The first parameter to the cudaMalloc() function is the address of a
pointer variable that will be set to point to the allocated object.
• The address of the pointer variable should be cast to (void) because
the function expects a generic pointer.
• This parameter allows the cudaMalloc() function to write the address
of the allocated memory into the pointer variable.
• The host code passes this pointer value to the kernels that need to
access the allocated memory.
• The second parameter to the cudaMalloc() function gives the size of
the data to be allocated, in terms of bytes.
• After the computation, cudaFree() is called with pointer d_A as input
to free the storage space.
Device Global Memory and Data Transfer
• Once the host code has allocated device memory for the data objects, it
can request that data be transferred from host to device.
• This is accomplished by calling one of the CUDA API functions
cudaMemcpy().
• The cudaMemcpy() function takes four parameters.
• A pointer to the destination location for the data object to be copied.
• The second parameter points to the source location.
• The third parameter specifies the number of bytes to be copied.
• The fourth parameter indicates the types of memory involved in the
copy:
• from host memory to host memory
• from host memory to device memory
• from device memory to host memory
• from device memory to device memory
Device Global Memory and Data Transfer
Synchronization and Transparent Scalability
• CUDA allows threads in the same block to coordinate their
activities using a barrier synchronization function
__syncthreads().
• When a kernel function calls __syncthreads() all threads in a
block will be held at the calling location until every thread in
the block reaches the location.
• This ensures that all threads in a block have completed a
phase of their execution of the kernel before any of them can
move on to the next phase.
• In CUDA, a __syncthreads() statement, if present, must be
executed by all threads in a block
• When a __syncthread() statement is placed in an if
statement, it must either be executed by all threads in the
block or none of them.
Synchronization and Transparent Scalability
• For an if-then-else statement, if each path has a __syncthreads() statement,
either all threads in a block execute the
__syncthreads() on the then path or all of them execute the else path.
• In a typical if-else statement, different threads in a CUDA block can follow
different execution paths based on a condition.
However, __syncthreads() requires all threads in the block to reach the same
synchronization barrier to ensure the program
doesn’t hang.
• All threads in the block must evaluate condition the same way (either true or
false), so that all threads execute the same
__syncthreads() call.
• If some threads take the if path and others take the else path, the threads are
synchronizing at different points, leading to
deadlock. This happens because each group of threads is waiting for the others at
a different barrier. They would end up
waiting for each other forever.
• If you need to have different execution logic in if-else, you should place
__syncthreads() outside of the conditional block,
ensuring all threads eventually synchronize at the same point.
Synchronization and Transparent Scalability
• One needs to make sure that all threads involved in the barrier synchronization
have access to the
necessary resources to eventually arrive at the barrier.
• Otherwise, a thread that never arrived at the barrier synchronization point can
cause everyone else
to wait forever.
• CUDA runtime systems satisfy this constraint by assigning execution resources to
all threads in a
block as a unit.
• A block can begin execution only when the runtime system has secured all the
resources needed
for all threads in the block to complete execution.
• When a thread of a block is assigned to an execution resource, all other threads
in the same block
are also assigned to the same resource.
Synchronization and Transparent Scalability
• Threads in different blocks cannot perform barrier synchronization with each
other, the CUDA runtime
system can execute blocks in any order relative to each other since none of them
need to wait for each
other.
• This flexibility enables scalable implementations.
Synchronization and Transparent Scalability
• The ability to execute the same application code at a wide range of speeds allows
the production of a
wide range of implementations according to the cost, power, and performance
requirements of
particular market segments.

• Ex: A mobile processor may execute an application slowly but at extremely low
power consumption,
and a desktop processor may execute the same application at a higher speed while
consuming more
power.

• Both execute exactly the same application program with no change to the code.

• The ability to execute the same application code on hardware with a different
number of execution
resources is referred to as transparent scalability which reduces the burden on
application developers
and improves the usability of applications.
Assigning Resources to Blocks
• Once a kernel is launched, the CUDA runtime system generates the
corresponding grid of threads
• Threads are assigned to execution resources on a block-by-block basis.
• In the current generation of hardware, the execution resources are
organized into streaming multiprocessors (SMs).
• Each device has a limit on the number of blocks that can be assigned
to each SM.
• For example, a CUDA device may allow up to eight blocks to be
assigned to each SM.
• In situations where there is an insufficient amount of any one or more
types of resources needed for the simultaneous execution of eight
blocks, the CUDA runtime automatically reduces the number of blocks
assigned to each SM until their combined resource usage falls under
the limit.
• With a limited numbers of SMs and a limited number of blocks that
can be assigned to each SM, there is a limit on the number of blocks
that can be actively executing in a CUDA device.
Passing Parameter
• We can pass parameters to a kernel
as we would with any C function
• We need to allocate memory to do
anything useful on a device, such as
return values to the host
• A kernel call looks and acts exactly
like any function call in standard C
• The runtime system takes care of any
complexity introduced by the fact
that these parameters need to get
from the host to the device
Passing Parameter
• The allocation of memory using cudaMalloc()
• This call behaves very similarly to the standard C call malloc(), but it tells
the
CUDA runtime to allocate the memory on the device
• The first argument is a pointer to the pointer you want to hold the address of
the newly allocated memory
• The second parameter is the size of the allocation you want to make
• The HANDLE_ERROR() that surrounds these calls is a utility macro
• It simply detects that the call has returned an error, prints the associated
error message,
and exits the application with an EXIT_FAILURE code
Passing Parameter
• it is the responsibility of the programmer not to dereference the pointer
returned by cudaMalloc() from code that executes on the host
• Host code may pass this pointer around, perform arithmetic on it, or even cast
it to a different type. But you cannot use it to read or write from memory
Passing Parameter
• Restrictions on the usage of device pointer as follows:
• You can pass pointers allocated with cudaMalloc() to functions that execute on
the
device
• You can use pointers allocated with cudaMalloc()to read or write memory from
code
that executes on the device
• You can pass pointers allocated with cudaMalloc()to functions that execute on
the host
• You cannot use pointers allocated with cudaMalloc()to read or write memory
from code
that executes on the host.
Passing Parameter
• We can’t use standard C’s free() function to release memory we’ve allocated
with cudaMalloc()
• To free memory we’ve allocated with cudaMalloc(), we need to use a call to
cudaFree
()
• Two of the most common methods for accessing device memory
• by using device pointers from within device code
• By using calls to cudaMemcpy()
• Host pointers can access memory from host code, and device pointers can
access memory from device code
Passing Parameter
• we can also access memory on a device through calls to cudaMemcpy() from host
code
• These calls behave exactly like standard C memcpy() with an additional parameter
to
specify which of the source and destination pointers point to device memory.
• The last parameter to cudaMemcpy() is cudaMemcpyDeviceToHost, instructing the
runtime that the source pointer is a device pointer and the destination pointer
is a
host pointer
• cudaMemcpyHostToDevice would indicate the opposite situation, where the source
data is on the host and the destination is an address on the device
• Finally, we can even specify that both pointers are on the device by passing
cudaMemcpyDeviceToDevice
Querying Devices

Survey, Strain Identification and Management of Huanglongbing (HLB) Disease of Citrus in The Philippines
100% (1)
Survey, Strain Identification and Management of Huanglongbing (HLB) Disease of Citrus in The Philippines
21 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
GP Standard Precaution PDF
No ratings yet
GP Standard Precaution PDF
55 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
GPU Architecture
0% (2)
GPU Architecture
28 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
course-7
No ratings yet
course-7
21 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
chapter-8
No ratings yet
chapter-8
58 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
1
No ratings yet
1
44 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA
No ratings yet
CUDA
33 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
GPU_Architecture_and_Programming_Lecture
No ratings yet
GPU_Architecture_and_Programming_Lecture
9 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CPU Parallelism & GPU
No ratings yet
CPU Parallelism & GPU
12 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Cuda PDF
No ratings yet
Cuda PDF
18 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
Part1 22
No ratings yet
Part1 22
77 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Handling Telephone
100% (1)
Handling Telephone
30 pages
Rbi
No ratings yet
Rbi
1 page
Zhuangzi Caring For Life
No ratings yet
Zhuangzi Caring For Life
2 pages
Tro3063 - WQ7 - Tro3063 - WQ7
100% (8)
Tro3063 - WQ7 - Tro3063 - WQ7
255 pages
Presentation For SEN
No ratings yet
Presentation For SEN
31 pages
Sensory Retraining
No ratings yet
Sensory Retraining
5 pages
A Snake in The Grass
No ratings yet
A Snake in The Grass
6 pages
2223 T2 F Level I Social Studies Revision Sheet
No ratings yet
2223 T2 F Level I Social Studies Revision Sheet
11 pages
CV Veronika Najoan
No ratings yet
CV Veronika Najoan
1 page
May 2024
No ratings yet
May 2024
27 pages
Stage Issue Rule Description Related Cases/Rules: Federal Rules of Civil Procedure Rules Chart
No ratings yet
Stage Issue Rule Description Related Cases/Rules: Federal Rules of Civil Procedure Rules Chart
11 pages
Makalah Kelompok 1
No ratings yet
Makalah Kelompok 1
15 pages
Psychology Final Ia
No ratings yet
Psychology Final Ia
15 pages
The Effectiveness of Using ECRIF and PPP Strategies in Fifth Graders
No ratings yet
The Effectiveness of Using ECRIF and PPP Strategies in Fifth Graders
18 pages
When Rivals Merge ..
No ratings yet
When Rivals Merge ..
18 pages
Generate Rent Receipt Online - Free Rent Receipt Generator
No ratings yet
Generate Rent Receipt Online - Free Rent Receipt Generator
2 pages
OE-A2 Rev 12 Chapter 8
No ratings yet
OE-A2 Rev 12 Chapter 8
50 pages
Technical Data Sheet Jazeera Epo-Rich Primer 90 JI-61031: Description
No ratings yet
Technical Data Sheet Jazeera Epo-Rich Primer 90 JI-61031: Description
3 pages
CIM Review e
No ratings yet
CIM Review e
14 pages
Helium - Neon Laser: Optical Fiber Communication
No ratings yet
Helium - Neon Laser: Optical Fiber Communication
7 pages
Computer CT1
No ratings yet
Computer CT1
17 pages
Future Simple: Affirmative Sentences Negative Sentences Yes/No Questions WH Questions
No ratings yet
Future Simple: Affirmative Sentences Negative Sentences Yes/No Questions WH Questions
3 pages
(LBYCV1A) 12071269 - Activity 1 Report
No ratings yet
(LBYCV1A) 12071269 - Activity 1 Report
6 pages
I Want To Die But I Want To Eat Tteokbokki by Baek Se Hee
No ratings yet
I Want To Die But I Want To Eat Tteokbokki by Baek Se Hee
14 pages
Gioi Thieu Tap Doan DKQGVN
No ratings yet
Gioi Thieu Tap Doan DKQGVN
43 pages
Soul Heights and Soul Depths
No ratings yet
Soul Heights and Soul Depths
88 pages
Mitsui-Bussan Scholarship Program For Indonesia Haris
No ratings yet
Mitsui-Bussan Scholarship Program For Indonesia Haris
2 pages
Learning Programming Using Matlab
No ratings yet
Learning Programming Using Matlab
88 pages

Cuda

Uploaded by

Cuda

Uploaded by

CUDA-Introduction

Figure 1. shows an example distribution of

• Each thread in a block has a unique threadIdx value.

threads and how they interact.

5. A substantial number of registers. Since each SM may have hundreds or thousands

You might also like