UNIT-5
UNIT-5
1
CUDA Kernel Execution
GPUs do work in parallel
performWork<<<2, 4>>>()
GP
U
GPU work is done in a thread
performWork<<<2, 4>>>()
GP
U
Many threads run in parallel
performWork<<<2, 4>>>()
GP
U
A collection of threads is a block
performWork<<<2, 4>>>()
GP
U
There are many blocks
performWork<<<2, 4>>>()
GP
U
A collection of blocks is a grid
performWork<<<2, 4>>>()
GP
U
GPU functions are called kernels
performWork<<<2, 4>>>()
GP
U
Kernels are launched with an
execution configuration
performWork<<<2, 4>>>()
GP
U
The execution configuration defines
the number of blocks in the grid
performWork<<<2, 4>>>()
GP
U
… as well as the number of threads in
each block
performWork<<<2, 4>>>()
GP
U
Every block in the grid contains the
same number of threads
performWork<<<2, 4>>>()
GP
U
CUDA-Provided Thread Hierarchy
Variables
Inside kernels definitions, CUDA-
provided variables describe its
executing thread, block, and grid
performWork<<<2, 4>>>()
GP
U
gridDim.x is the number of blocks in
the grid, in this case 2
performWork<<<2, 4>>>()
GP
U
2
blockIdx.x is the index of the
current block within the grid, in this
case 0
performWork<<<2, 4>>>()
GP
U
0 1
blockIdx.x is the index of the
current block within the grid, in this
case 1
performWork<<<2, 4>>>()
GP
U
0 1
Inside a kernel blockDim.x describes
the number of threads in a block. In
this case 4
performWork<<<2, 4>>>()
GP
U
4
All blocks in a grid contain the same
number of threads
performWork<<<2, 4>>>()
GP
U
4 4
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3
performWork<<<2, 4>>>()
GP
U
0 1 2 3 0 1 2 3
Coordinating Parallel Threads
Assume data is in a 0 indexed vector
DAT
GP
U
A
performWork<<<2, 4>>>()
GP
U
Assume data is in a 0 indexed vector
0 4
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
GP
U
Somehow, each thread must be
0 4 mapped to work on an element in the
vector
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
GP
U
Recall that each thread has access to
0 4 the size of its block via blockDim.x
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
GP
U
4 4
…and the index of its block within the
0 4 grid via blockIdx.x
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
0 1
GP
U
4 4
…and its own index within its block via
0 4 threadIdx.x
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
Using these variables, the formula
0 4 threadIdx.x + blockIdx.x *
blockDim.x will map each thread to
DAT
GP
U 1 5 one element in the vector
A
2 6
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 0 0 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 0 0 4
A dataIndex
2 6 0
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 1 0 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 1 0 4
A dataIndex
2 6 1
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 2 0 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 2 0 4
A dataIndex
2 6 2
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 3 0 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 3 0 4
A dataIndex
2 6 3
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 0 1 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 0 1 4
A dataIndex
2 6 4
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 1 1 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 1 1 4
A dataIndex
2 6 5
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 2 1 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 2 1 4
A dataIndex
2 6 6
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 3 1 4
A dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1 5 3 1 4
A dataIndex
2 6 7
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
Grid Size Work Amount Mismatch
In previous scenarios, the number of
0 4 threads in the grid matched the
number of elements exactly
DAT
GP
U 1 5
A
2 6
3 7
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
What if there are more threads than
0 4 work to be done?
DAT
GP
U 1
A
2
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
Attempting to access non-existent
0 4 elements can result in a runtime error
DAT
GP
U 1
A
2
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
Code must check that the dataIndex
0 4 calculated by threadIdx.x +
blockIdx.x * blockDim.x is less
DAT
GP
U 1 than N, the number of data elements.
A
2
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
0 1 4
A dataIndex < N = Can work
2 4 5 ?
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
0 1 4
A dataIndex < N = Can work
2 4 5 true
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
1 1 4
A dataIndex < N = Can work
2 5 5 ?
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
1 1 4
A dataIndex < N = Can work
2 5 5 false
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
2 1 4
A dataIndex < N = Can work
2 6 5 ?
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
2 1 4
A dataIndex < N = Can work
2 6 5 false
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
2 1 4
A dataIndex < N = Can work
2 6 5 ?
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *
DAT
GP
U 1
2 1 4
A dataIndex < N = Can work
2 6 5 false
performWork<<<2, 4>>>()
0 1
GP
U 0 1 2 3 0 1 2 3
4 4
UNIT-5
CUDA THREADS
CUDA THREAD ORGANIZATION
• All CUDA threads in a grid execute the same kernel function and they
rely on coordinates to distinguish themselves from each other and to
identify the appropriate portion of the data to process.
• These threads are organized into a two-level hierarchy:
a grid consists of one or more blocks
each block in turn consists of one or more threads
• All threads in a block share the same block index, which can be accessed
as the blockIdx variable in a kernel.
• Each thread also has a thread index, which can be accessed as the
threadIdx variable in a kernel.
67
When a thread executes a The execution configuration
kernel function, references parameters in a kernel
to the blockIdx and launch statement specify the
threadIdx variables return dimensions of the grid and
the coordinates of the the dimensions of each
thread. block.
69
Example : 1
• Host code can be used to launch the vecAddkernel() kernel
function and generate a 1D grid that consists of 128 blocks, each
of which consists of 32 threads. The total number of threads in
the grid is 128*32=4,096.
• Note that dimBlock and dimGrid are host code variables defined
by the programmer. 70
Example : 2
72
• In CUDA C, the allowed values of gridDim.x, gridDim.y, and gridDim.z
range from 1 to 65,536.
• All threads in a block share the same blockIdx.x, blockIdx.y, and
blockIdx.z values.
• Among all blocks, the blockIdx.x value ranges between 0 and
gridDim.x-1, the blockIdx.y value between 0 and gridDim.y-1, and the
blockIdx.z value between 0 and gridDim.z-1.
73
• The total size of a block is limited to 1,024 threads, with
flexibility in distributing these elements into the three
dimensions as long as the total number of threads does not
exceed 1,024.
• For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all
allowable blockDim values, but (32, 32, 2) is not allowable
since the total number of threads would exceed 1,024.
• Grid can have higher dimensionality than its blocks and vice
versa
74
Example : 3
75
The choice of 1D, 2D, or 3D thread
organizations is usually based on
MAPPING the nature of the data.
THREADS TO
MULTIDIMENSIO
NAL DATA Ex: pictures are a 2D array of
pixels. It is often convenient to use
a 2D grid that consists of 2D blocks
to process the pixels in a picture.
76
76 *62 picture = 4712
80 * 64 threads
2D GRID to
process picture
16*16
5 * 4 = 20 block
77
• Host code uses an integer variable n,m to track the number of pixels in the x, y
direction respectively.
• We further assume that the input picture data has been copied to the device
memory and can be accessed through a pointer variable d_Pin.
• The output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.
• The following host code can be used to launch a 2D kernel to process the picture:
dim3 dimGrid(ceil(n/16.0), ceil(m/16.0), 1);
dim3 dimBlock(16, 16, 1);
pictureKernel<<<dimGrid, dimBlock >>>(d_Pin, d_Pout, n, m);
78
To process a 2,000 * 1,500 (3 M pixel) picture,
we will generate 14,100 blocks, 150 in the x
direction and 94 in the y direction.
80
• There are at least two ways one can linearize
a 2D array. One is to place all elements of the
same row into consecutive locations.
• The rows are then placed one after another
into the memory space. This arrangement,
called row-major layout.
• Another way to linearize a 2D array is to place
all elements of the same column into
consecutive locations.
• The columns are then placed one after
another into the memory space. This
arrangement, called column major layout, is
used by FORTRAN compilers. N
81
Example
82
Let’s assume that the kernel will scale
every pixel value in the picture by a
factor of 2.0. The kernel code is
conceptually quite simple. There are a
total of blockDim.x*gridDim.x threads in
the horizontal direction.
83
Example 4:
84
Execution of pictureKernel()
85
Area 1:
• Consists of the threads that belong to the 12
blocks covering the majority of pixels in the
picture.
• Both Col and Row values of these threads are
within range;
• All these threads will pass the if statement test
and process pixels in the dark shaded area of
the picture.
• That is, all 16 * 16 = 256 threads in each block
will process pixels.
86
AREA 2
• The second area, contains the threads that belong to the 3 blocks in
the medium-shaded area covering the upper-right.
• Although the Row values of these threads are always within range,
the Col values of some of them exceed the n value (76).
• This is because the number of threads in the horizontal direction is
always a multiple of the blockDim.x value chosen by the programmer
(16 in this case).
• The smallest multiple of 16 needed to cover 76 pixels is 80.
• As a result, 12 threads in each row will find their Col values within
range and will process pixels.
• On the other hand, 4 threads in each row will find their Col values out
of range, and thus fail the if statement condition.
• These threads will not process any pixels. Overall, 12 * 16 = 192 out of
the 16 * 16 = 256 threads will process pixels.
87
Area 3 • The third area, accounts for the 3 lower-left blocks
covering the medium-shaded area of the picture.
• Although the Col values of these threads are
always within range, the Row values of some of
them exceed the m value (62).
• This is because the number of threads in the
vertical direction is always multiples of the
blockDim.y value chosen by the programmer (16
in this case).
• The smallest multiple of 16 to cover 62 is 64.
• As a result, 14 threads in each column will find
their Row values within range and will process
pixels.
• On the other hand, 2 threads in each column will
fail the if statement of area 2, and will not process
any pixels; 16 * 14 = 224 out of the 256 threads
will process pixels.
88
Area 4
• The fourth area, contains the threads that cover
the lower-right light-shaded area of the picture.
• Similar to area 2, 4 threads in each of the top
14 rows will find their Col values out of range.
• Similar to area 3, the entire bottom two rows of
this block will find their Row values out of range.
• So, only 14 * 12 = 168 of the 16 * 16 = 256
threads will be allowed to process thread.
89
3D ARRAYS
• Similarly in 3D arrays are implemented by including another dimension when we
linearize arrays.
• This is done by placing each “plane” of the array one after another.
• Assume that the programmer uses variables m and n to track the number of rows and
columns in a 3D array.
• The programmer also needs to determine the values of blockDim.z and gridDim.z when
launching a kernel.
• In the kernel, the array index will involve another global index:
int Plane = blockIdx.z*blockDim.z + threadIdx.z
90