0% found this document useful (0 votes)
2 views90 pages

UNIT-5

Uploaded by

swatiulligeri52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views90 pages

UNIT-5

Uploaded by

swatiulligeri52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

• Launching a CUDA kernel creates a grid of threads that all

execute the kernel function.


• The kernel function specifies the C statements that are
executed by each individual thread at runtime.
• Each thread uses a unique coordinate, or thread index, to
identify the portion of the data structure to process.
• In a CUDA kernel function, gridDim, blockDim, blockIdx, and
threadIdx are all built-in variables. Their values are
preinitialized by the CUDA runtime systems and can be
referenced in the kernel function.

1
CUDA Kernel Execution
GPUs do work in parallel

performWork<<<2, 4>>>()

GP
U
GPU work is done in a thread

performWork<<<2, 4>>>()

GP
U
Many threads run in parallel

performWork<<<2, 4>>>()

GP
U
A collection of threads is a block

performWork<<<2, 4>>>()

GP
U
There are many blocks

performWork<<<2, 4>>>()

GP
U
A collection of blocks is a grid

performWork<<<2, 4>>>()

GP
U
GPU functions are called kernels

performWork<<<2, 4>>>()

GP
U
Kernels are launched with an
execution configuration

performWork<<<2, 4>>>()

GP
U
The execution configuration defines
the number of blocks in the grid

performWork<<<2, 4>>>()

GP
U
… as well as the number of threads in
each block

performWork<<<2, 4>>>()

GP
U
Every block in the grid contains the
same number of threads

performWork<<<2, 4>>>()

GP
U
CUDA-Provided Thread Hierarchy
Variables
Inside kernels definitions, CUDA-
provided variables describe its
executing thread, block, and grid

performWork<<<2, 4>>>()

GP
U
gridDim.x is the number of blocks in
the grid, in this case 2

performWork<<<2, 4>>>()

GP
U

2
blockIdx.x is the index of the
current block within the grid, in this
case 0

performWork<<<2, 4>>>()

GP
U

0 1
blockIdx.x is the index of the
current block within the grid, in this
case 1

performWork<<<2, 4>>>()

GP
U

0 1
Inside a kernel blockDim.x describes
the number of threads in a block. In
this case 4

performWork<<<2, 4>>>()

GP
U

4
All blocks in a grid contain the same
number of threads

performWork<<<2, 4>>>()

GP
U

4 4
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Coordinating Parallel Threads
Assume data is in a 0 indexed vector

DAT
GP
U
A

performWork<<<2, 4>>>()

GP
U
Assume data is in a 0 indexed vector
0 4

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U
Somehow, each thread must be
0 4 mapped to work on an element in the
vector
DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U
Recall that each thread has access to
0 4 the size of its block via blockDim.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U

4 4
…and the index of its block within the
0 4 grid via blockIdx.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U

4 4
…and its own index within its block via
0 4 threadIdx.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Using these variables, the formula
0 4 threadIdx.x + blockIdx.x *
blockDim.x will map each thread to
DAT
GP
U 1 5 one element in the vector
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 0 4
A dataIndex

2 6 0

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 0 4
A dataIndex

2 6 1

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 0 4
A dataIndex

2 6 2

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 0 4
A dataIndex

2 6 3

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 1 4
A dataIndex

2 6 4

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 1 4
A dataIndex

2 6 5

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 1 4
A dataIndex

2 6 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 1 4
A dataIndex

2 6 7

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Grid Size Work Amount Mismatch
In previous scenarios, the number of
0 4 threads in the grid matched the
number of elements exactly
DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
What if there are more threads than
0 4 work to be done?

DAT
GP
U 1
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Attempting to access non-existent
0 4 elements can result in a runtime error

DAT
GP
U 1
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Code must check that the dataIndex
0 4 calculated by threadIdx.x +
blockIdx.x * blockDim.x is less
DAT
GP
U 1 than N, the number of data elements.
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
0 1 4
A dataIndex < N = Can work

2 4 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
0 1 4
A dataIndex < N = Can work

2 4 5 true

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
1 1 4
A dataIndex < N = Can work

2 5 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
1 1 4
A dataIndex < N = Can work

2 5 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
UNIT-5
CUDA THREADS
CUDA THREAD ORGANIZATION

• All CUDA threads in a grid execute the same kernel function and they
rely on coordinates to distinguish themselves from each other and to
identify the appropriate portion of the data to process.
• These threads are organized into a two-level hierarchy:
a grid consists of one or more blocks
each block in turn consists of one or more threads
• All threads in a block share the same block index, which can be accessed
as the blockIdx variable in a kernel.
• Each thread also has a thread index, which can be accessed as the
threadIdx variable in a kernel.
67
When a thread executes a The execution configuration
kernel function, references parameters in a kernel
to the blockIdx and launch statement specify the
threadIdx variables return dimensions of the grid and
the coordinates of the the dimensions of each
thread. block.

These dimensions are


available as predefined built-
in variables gridDim and
blockDim in kernel
functions.
68
The exact organization of a grid is determined by the
execution configuration parameters (within <<< >>> ) of the
kernel launch statement.

Each such parameter is of dim3 type, which is a C struct with


three unsigned integer fields, x, y, and z. These three fields
correspond to the three dimensions.

For 1D or 2D grids and blocks, the unused dimension fields


should be set to 1 for clarity.

69
Example : 1
• Host code can be used to launch the vecAddkernel() kernel
function and generate a 1D grid that consists of 128 blocks, each
of which consists of 32 threads. The total number of threads in
the grid is 128*32=4,096.

dim3 dimGrid(128, 1, 1);


dim3 dimBlock(32, 1, 1);
vecAddKernel <<<dimGrid, dimBlock>>>(...)

• Note that dimBlock and dimGrid are host code variables defined
by the programmer. 70
Example : 2

dim3 dimGrid(ceil(n/256.0), 1, 1); Once vecAddKernel() is launched, the


dim3 dimBlock(256, 1, 1); grid and block dimensions will remain
the same until the entire grid finishes
vecAddKernel<<<dimGrid, dimBlock>>>(...);
execution.

•This allows the number of blocks to vary with


the size of the vectors so that the grid will
have enough threads to cover all vector
elements.
• The value of variable n at kernel launch time
will determine the dimension of the grid. 71
CUDA C provides a special shortcut
for launching a kernel with 1D
grids and blocks i.e instead of
vecAddKernel<<<ceil(n/256.0),
using dim3 variable simply takes
256>>>(...);
the arithmetic expression as the x
dimensions and assumes that the
y and z dimensions are 1.

72
• In CUDA C, the allowed values of gridDim.x, gridDim.y, and gridDim.z
range from 1 to 65,536.
• All threads in a block share the same blockIdx.x, blockIdx.y, and
blockIdx.z values.
• Among all blocks, the blockIdx.x value ranges between 0 and
gridDim.x-1, the blockIdx.y value between 0 and gridDim.y-1, and the
blockIdx.z value between 0 and gridDim.z-1.

73
• The total size of a block is limited to 1,024 threads, with
flexibility in distributing these elements into the three
dimensions as long as the total number of threads does not
exceed 1,024.

• For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all
allowable blockDim values, but (32, 32, 2) is not allowable
since the total number of threads would exceed 1,024.

• Grid can have higher dimensionality than its blocks and vice
versa

74
Example : 3

dim3 dimGrid(2, 2, 1);


dim3 dimBlock(4, 2, 2);

75
The choice of 1D, 2D, or 3D thread
organizations is usually based on
MAPPING the nature of the data.
THREADS TO
MULTIDIMENSIO
NAL DATA Ex: pictures are a 2D array of
pixels. It is often convenient to use
a 2D grid that consists of 2D blocks
to process the pixels in a picture.

76
76 *62 picture = 4712
80 * 64 threads

2D GRID to
process picture

16*16

5 * 4 = 20 block
77
• Host code uses an integer variable n,m to track the number of pixels in the x, y
direction respectively.
• We further assume that the input picture data has been copied to the device
memory and can be accessed through a pointer variable d_Pin.
• The output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.
• The following host code can be used to launch a 2D kernel to process the picture:
dim3 dimGrid(ceil(n/16.0), ceil(m/16.0), 1);
dim3 dimBlock(16, 16, 1);
pictureKernel<<<dimGrid, dimBlock >>>(d_Pin, d_Pout, n, m);

78
To process a 2,000 * 1,500 (3 M pixel) picture,
we will generate 14,100 blocks, 150 in the x
direction and 94 in the y direction.

Within the kernel function, references to built-


in variables gridDim.x, gridDim.y, blockDim.x,
and blockDim.y will result in 150, 94, 16, and
16, respectively.
79
MAPPING THREADS TO
MULTIDIMENSIONAL DATA

• In reality, all multidimensional arrays in C are linearized.


This is due to the use of a “flat” memory space in modern
computers.
• In the case of statically allocated arrays, the compilers allow
the programmers to use higher-dimensional indexing.
• compiler linearizes them into an equivalent 1D array and
translates the multidimensional indexing syntax into a 1D
offset.
• In the case of dynamically allocated arrays, the current
CUDA C compiler leaves the work of such translation to the
programmers due to lack of dimensional information

80
• There are at least two ways one can linearize
a 2D array. One is to place all elements of the
same row into consecutive locations.
• The rows are then placed one after another
into the memory space. This arrangement,
called row-major layout.
• Another way to linearize a 2D array is to place
all elements of the same column into
consecutive locations.
• The columns are then placed one after
another into the memory space. This
arrangement, called column major layout, is
used by FORTRAN compilers. N

81
Example

82
Let’s assume that the kernel will scale
every pixel value in the picture by a
factor of 2.0. The kernel code is
conceptually quite simple. There are a
total of blockDim.x*gridDim.x threads in
the horizontal direction.

83
Example 4:

84
Execution of pictureKernel()

85
Area 1:
• Consists of the threads that belong to the 12
blocks covering the majority of pixels in the
picture.
• Both Col and Row values of these threads are
within range;
• All these threads will pass the if statement test
and process pixels in the dark shaded area of
the picture.
• That is, all 16 * 16 = 256 threads in each block
will process pixels.
86
AREA 2

• The second area, contains the threads that belong to the 3 blocks in
the medium-shaded area covering the upper-right.
• Although the Row values of these threads are always within range,
the Col values of some of them exceed the n value (76).
• This is because the number of threads in the horizontal direction is
always a multiple of the blockDim.x value chosen by the programmer
(16 in this case).
• The smallest multiple of 16 needed to cover 76 pixels is 80.
• As a result, 12 threads in each row will find their Col values within
range and will process pixels.
• On the other hand, 4 threads in each row will find their Col values out
of range, and thus fail the if statement condition.
• These threads will not process any pixels. Overall, 12 * 16 = 192 out of
the 16 * 16 = 256 threads will process pixels.
87
Area 3 • The third area, accounts for the 3 lower-left blocks
covering the medium-shaded area of the picture.
• Although the Col values of these threads are
always within range, the Row values of some of
them exceed the m value (62).
• This is because the number of threads in the
vertical direction is always multiples of the
blockDim.y value chosen by the programmer (16
in this case).
• The smallest multiple of 16 to cover 62 is 64.
• As a result, 14 threads in each column will find
their Row values within range and will process
pixels.
• On the other hand, 2 threads in each column will
fail the if statement of area 2, and will not process
any pixels; 16 * 14 = 224 out of the 256 threads
will process pixels.
88
Area 4
• The fourth area, contains the threads that cover
the lower-right light-shaded area of the picture.
• Similar to area 2, 4 threads in each of the top
14 rows will find their Col values out of range.
• Similar to area 3, the entire bottom two rows of
this block will find their Row values out of range.
• So, only 14 * 12 = 168 of the 16 * 16 = 256
threads will be allowed to process thread.

89
3D ARRAYS
• Similarly in 3D arrays are implemented by including another dimension when we
linearize arrays.
• This is done by placing each “plane” of the array one after another.
• Assume that the programmer uses variables m and n to track the number of rows and
columns in a 3D array.
• The programmer also needs to determine the values of blockDim.z and gridDim.z when
launching a kernel.
• In the kernel, the array index will involve another global index:
int Plane = blockIdx.z*blockDim.z + threadIdx.z

90

You might also like