0% found this document useful (0 votes)

2 views90 pages

UNIT-5

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views90 pages

UNIT-5

Uploaded by

swatiulligeri52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 90

• Launching a CUDA kernel creates a grid of threads that all

execute the kernel function.

• The kernel function specifies the C statements that are
executed by each individual thread at runtime.
• Each thread uses a unique coordinate, or thread index, to
identify the portion of the data structure to process.
• In a CUDA kernel function, gridDim, blockDim, blockIdx, and
threadIdx are all built-in variables. Their values are
preinitialized by the CUDA runtime systems and can be
referenced in the kernel function.

1
CUDA Kernel Execution
GPUs do work in parallel

performWork<<<2, 4>>>()

GP
U
GPU work is done in a thread

performWork<<<2, 4>>>()

GP
U
Many threads run in parallel

performWork<<<2, 4>>>()

GP
U
A collection of threads is a block

performWork<<<2, 4>>>()

GP
U
There are many blocks

performWork<<<2, 4>>>()

GP
U
A collection of blocks is a grid

performWork<<<2, 4>>>()

GP
U
GPU functions are called kernels

performWork<<<2, 4>>>()

GP
U
Kernels are launched with an
execution configuration

performWork<<<2, 4>>>()

GP
U
The execution configuration defines
the number of blocks in the grid

performWork<<<2, 4>>>()

GP
U
… as well as the number of threads in
each block

performWork<<<2, 4>>>()

GP
U
Every block in the grid contains the
same number of threads

performWork<<<2, 4>>>()

GP
U
CUDA-Provided Thread Hierarchy
Variables
Inside kernels definitions, CUDA-
provided variables describe its
executing thread, block, and grid

performWork<<<2, 4>>>()

GP
U
gridDim.x is the number of blocks in
the grid, in this case 2

performWork<<<2, 4>>>()

GP
U

2
blockIdx.x is the index of the
current block within the grid, in this
case 0

performWork<<<2, 4>>>()

GP
U

0 1
blockIdx.x is the index of the
current block within the grid, in this
case 1

performWork<<<2, 4>>>()

GP
U

0 1
Inside a kernel blockDim.x describes
the number of threads in a block. In
this case 4

performWork<<<2, 4>>>()

GP
U

4
All blocks in a grid contain the same
number of threads

performWork<<<2, 4>>>()

GP
U

4 4
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3

performWork<<<2, 4>>>()

GP
U

0 1 2 3 0 1 2 3
Coordinating Parallel Threads
Assume data is in a 0 indexed vector

DAT
GP
U
A

performWork<<<2, 4>>>()

GP
U
Assume data is in a 0 indexed vector
0 4

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U
Somehow, each thread must be
0 4 mapped to work on an element in the
vector
DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U
Recall that each thread has access to
0 4 the size of its block via blockDim.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()

GP
U

4 4
…and the index of its block within the
0 4 grid via blockIdx.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U

4 4
…and its own index within its block via
0 4 threadIdx.x

DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Using these variables, the formula
0 4 threadIdx.x + blockIdx.x *
blockDim.x will map each thread to
DAT
GP
U 1 5 one element in the vector
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 0 4
A dataIndex

2 6 0

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 0 4
A dataIndex

2 6 1

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 0 4
A dataIndex

2 6 2

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 0 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 0 4
A dataIndex

2 6 3

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 0 1 4
A dataIndex

2 6 4

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 1 1 4
A dataIndex

2 6 5

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 2 1 4
A dataIndex

2 6 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 1 4
A dataIndex

2 6 ?

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1 5 3 1 4
A dataIndex

2 6 7

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Grid Size Work Amount Mismatch
In previous scenarios, the number of
0 4 threads in the grid matched the
number of elements exactly
DAT
GP
U 1 5
A
2 6

3 7

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
What if there are more threads than
0 4 work to be done?

DAT
GP
U 1
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Attempting to access non-existent
0 4 elements can result in a runtime error

DAT
GP
U 1
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
Code must check that the dataIndex
0 4 calculated by threadIdx.x +
blockIdx.x * blockDim.x is less
DAT
GP
U 1 than N, the number of data elements.
A
2

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
0 1 4
A dataIndex < N = Can work

2 4 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
0 1 4
A dataIndex < N = Can work

2 4 5 true

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
1 1 4
A dataIndex < N = Can work

2 5 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
1 1 4
A dataIndex < N = Can work

2 5 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 ?

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
threadIdx. + blockIdx. blockDim.x
0 4 x x *

DAT
GP
U 1
2 1 4
A dataIndex < N = Can work

2 6 5 false

performWork<<<2, 4>>>()
0 1

GP
U 0 1 2 3 0 1 2 3

4 4
UNIT-5
CUDA THREADS
CUDA THREAD ORGANIZATION

• All CUDA threads in a grid execute the same kernel function and they
rely on coordinates to distinguish themselves from each other and to
identify the appropriate portion of the data to process.
• These threads are organized into a two-level hierarchy:
a grid consists of one or more blocks
each block in turn consists of one or more threads
• All threads in a block share the same block index, which can be accessed
as the blockIdx variable in a kernel.
• Each thread also has a thread index, which can be accessed as the
threadIdx variable in a kernel.
67
When a thread executes a The execution configuration
kernel function, references parameters in a kernel
to the blockIdx and launch statement specify the
threadIdx variables return dimensions of the grid and
the coordinates of the the dimensions of each
thread. block.

These dimensions are

available as predefined built-
in variables gridDim and
blockDim in kernel
functions.
68
The exact organization of a grid is determined by the
execution configuration parameters (within <<< >>> ) of the
kernel launch statement.

Each such parameter is of dim3 type, which is a C struct with

three unsigned integer fields, x, y, and z. These three fields
correspond to the three dimensions.

For 1D or 2D grids and blocks, the unused dimension fields

should be set to 1 for clarity.

69
Example : 1
• Host code can be used to launch the vecAddkernel() kernel
function and generate a 1D grid that consists of 128 blocks, each
of which consists of 32 threads. The total number of threads in
the grid is 128*32=4,096.

dim3 dimGrid(128, 1, 1);

dim3 dimBlock(32, 1, 1);
vecAddKernel <<<dimGrid, dimBlock>>>(...)

• Note that dimBlock and dimGrid are host code variables defined
by the programmer. 70
Example : 2

dim3 dimGrid(ceil(n/256.0), 1, 1); Once vecAddKernel() is launched, the

dim3 dimBlock(256, 1, 1); grid and block dimensions will remain
the same until the entire grid finishes
vecAddKernel<<<dimGrid, dimBlock>>>(...);
execution.

•This allows the number of blocks to vary with

the size of the vectors so that the grid will
have enough threads to cover all vector
elements.
• The value of variable n at kernel launch time
will determine the dimension of the grid. 71
CUDA C provides a special shortcut
for launching a kernel with 1D
grids and blocks i.e instead of
vecAddKernel<<<ceil(n/256.0),
using dim3 variable simply takes
256>>>(...);
the arithmetic expression as the x
dimensions and assumes that the
y and z dimensions are 1.

72
• In CUDA C, the allowed values of gridDim.x, gridDim.y, and gridDim.z
range from 1 to 65,536.
• All threads in a block share the same blockIdx.x, blockIdx.y, and
blockIdx.z values.
• Among all blocks, the blockIdx.x value ranges between 0 and
gridDim.x-1, the blockIdx.y value between 0 and gridDim.y-1, and the
blockIdx.z value between 0 and gridDim.z-1.

73
• The total size of a block is limited to 1,024 threads, with
flexibility in distributing these elements into the three
dimensions as long as the total number of threads does not
exceed 1,024.

• For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all
allowable blockDim values, but (32, 32, 2) is not allowable
since the total number of threads would exceed 1,024.

• Grid can have higher dimensionality than its blocks and vice
versa

74
Example : 3

dim3 dimGrid(2, 2, 1);

dim3 dimBlock(4, 2, 2);

75
The choice of 1D, 2D, or 3D thread
organizations is usually based on
MAPPING the nature of the data.
THREADS TO
MULTIDIMENSIO
NAL DATA Ex: pictures are a 2D array of
pixels. It is often convenient to use
a 2D grid that consists of 2D blocks
to process the pixels in a picture.

76
76 *62 picture = 4712
80 * 64 threads

2D GRID to
process picture

16*16

5 * 4 = 20 block
77
• Host code uses an integer variable n,m to track the number of pixels in the x, y
direction respectively.
• We further assume that the input picture data has been copied to the device
memory and can be accessed through a pointer variable d_Pin.
• The output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.
• The following host code can be used to launch a 2D kernel to process the picture:
dim3 dimGrid(ceil(n/16.0), ceil(m/16.0), 1);
dim3 dimBlock(16, 16, 1);
pictureKernel<<<dimGrid, dimBlock >>>(d_Pin, d_Pout, n, m);

78
To process a 2,000 * 1,500 (3 M pixel) picture,
we will generate 14,100 blocks, 150 in the x
direction and 94 in the y direction.

Within the kernel function, references to built-

in variables gridDim.x, gridDim.y, blockDim.x,
and blockDim.y will result in 150, 94, 16, and
16, respectively.
79
MAPPING THREADS TO
MULTIDIMENSIONAL DATA

• In reality, all multidimensional arrays in C are linearized.

This is due to the use of a “flat” memory space in modern
computers.
• In the case of statically allocated arrays, the compilers allow
the programmers to use higher-dimensional indexing.
• compiler linearizes them into an equivalent 1D array and
translates the multidimensional indexing syntax into a 1D
offset.
• In the case of dynamically allocated arrays, the current
CUDA C compiler leaves the work of such translation to the
programmers due to lack of dimensional information

80
• There are at least two ways one can linearize
a 2D array. One is to place all elements of the
same row into consecutive locations.
• The rows are then placed one after another
into the memory space. This arrangement,
called row-major layout.
• Another way to linearize a 2D array is to place
all elements of the same column into
consecutive locations.
• The columns are then placed one after
another into the memory space. This
arrangement, called column major layout, is
used by FORTRAN compilers. N

81
Example

82
Let’s assume that the kernel will scale
every pixel value in the picture by a
factor of 2.0. The kernel code is
conceptually quite simple. There are a
total of blockDim.x*gridDim.x threads in
the horizontal direction.

83
Example 4:

84
Execution of pictureKernel()

85
Area 1:
• Consists of the threads that belong to the 12
blocks covering the majority of pixels in the
picture.
• Both Col and Row values of these threads are
within range;
• All these threads will pass the if statement test
and process pixels in the dark shaded area of
the picture.
• That is, all 16 * 16 = 256 threads in each block
will process pixels.
86
AREA 2

• The second area, contains the threads that belong to the 3 blocks in
the medium-shaded area covering the upper-right.
• Although the Row values of these threads are always within range,
the Col values of some of them exceed the n value (76).
• This is because the number of threads in the horizontal direction is
always a multiple of the blockDim.x value chosen by the programmer
(16 in this case).
• The smallest multiple of 16 needed to cover 76 pixels is 80.
• As a result, 12 threads in each row will find their Col values within
range and will process pixels.
• On the other hand, 4 threads in each row will find their Col values out
of range, and thus fail the if statement condition.
• These threads will not process any pixels. Overall, 12 * 16 = 192 out of
the 16 * 16 = 256 threads will process pixels.
87
Area 3 • The third area, accounts for the 3 lower-left blocks
covering the medium-shaded area of the picture.
• Although the Col values of these threads are
always within range, the Row values of some of
them exceed the m value (62).
• This is because the number of threads in the
vertical direction is always multiples of the
blockDim.y value chosen by the programmer (16
in this case).
• The smallest multiple of 16 to cover 62 is 64.
• As a result, 14 threads in each column will find
their Row values within range and will process
pixels.
• On the other hand, 2 threads in each column will
fail the if statement of area 2, and will not process
any pixels; 16 * 14 = 224 out of the 256 threads
will process pixels.
88
Area 4
• The fourth area, contains the threads that cover
the lower-right light-shaded area of the picture.
• Similar to area 2, 4 threads in each of the top
14 rows will find their Col values out of range.
• Similar to area 3, the entire bottom two rows of
this block will find their Row values out of range.
• So, only 14 * 12 = 168 of the 16 * 16 = 256
threads will be allowed to process thread.

89
3D ARRAYS
• Similarly in 3D arrays are implemented by including another dimension when we
linearize arrays.
• This is done by placing each “plane” of the array one after another.
• Assume that the programmer uses variables m and n to track the number of rows and
columns in a 3D array.
• The programmer also needs to determine the values of blockDim.z and gridDim.z when
launching a kernel.
• In the kernel, the array index will involve another global index:
int Plane = blockIdx.z*blockDim.z + threadIdx.z

GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
HPC
No ratings yet
HPC
90 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
22 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
Ethics for the Information Age (8th Edition) Michael J. Quinn download
100% (1)
Ethics for the Information Age (8th Edition) Michael J. Quinn download
54 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
tilining
No ratings yet
tilining
23 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
5-computation
No ratings yet
5-computation
13 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
ED001 The Urban Sketching Handbook Drawing With A Tablet Easy Techniques For Mastering Digital Drawing On Location (2020)
100% (1)
ED001 The Urban Sketching Handbook Drawing With A Tablet Easy Techniques For Mastering Digital Drawing On Location (2020)
115 pages
Hpc file
No ratings yet
Hpc file
22 pages
COE4590_16_GPU2
No ratings yet
COE4590_16_GPU2
12 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
6-computation
No ratings yet
6-computation
11 pages
GPU - Mid - Gradescope
No ratings yet
GPU - Mid - Gradescope
11 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Class 10
No ratings yet
Class 10
13 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Multi - Dim
No ratings yet
Multi - Dim
29 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Comp422 2011 Lecture8 UPC
No ratings yet
Comp422 2011 Lecture8 UPC
44 pages
Unit_IV-Topic_7-CUDA_programming_model_features
No ratings yet
Unit_IV-Topic_7-CUDA_programming_model_features
6 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU_Assignment-3_Solution
No ratings yet
GPU_Assignment-3_Solution
4 pages
Processors
No ratings yet
Processors
25 pages
#Include #Include #Define
No ratings yet
#Include #Include #Define
8 pages
Module 3 Quiz
No ratings yet
Module 3 Quiz
2 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Project by RamanKumar
No ratings yet
Project by RamanKumar
138 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Chapter 8
No ratings yet
Chapter 8
17 pages
2.2 Fixed Point Iteration
100% (1)
2.2 Fixed Point Iteration
11 pages
Coursera Quiz Week2 Fall 2012
No ratings yet
Coursera Quiz Week2 Fall 2012
3 pages
Jenkins Master and Slave
No ratings yet
Jenkins Master and Slave
21 pages
Matrices II System of Linear Equations
No ratings yet
Matrices II System of Linear Equations
9 pages
OLED C10 Product Guide
No ratings yet
OLED C10 Product Guide
22 pages
OS Session3 U2
No ratings yet
OS Session3 U2
18 pages
2724 Reversible Instance Normalizat
No ratings yet
2724 Reversible Instance Normalizat
25 pages
Getting Started Guide
No ratings yet
Getting Started Guide
102 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
EXO 2005 Step by Step
No ratings yet
EXO 2005 Step by Step
119 pages
DxDiag
No ratings yet
DxDiag
39 pages
3rd Year ITpm Project - Group 2
No ratings yet
3rd Year ITpm Project - Group 2
41 pages
Step Into RPA
No ratings yet
Step Into RPA
22 pages
Registration Form Validation: Write A Java Script To Validate The Following Registration Form
No ratings yet
Registration Form Validation: Write A Java Script To Validate The Following Registration Form
19 pages
Anomaly Detection of CAN Bus Messages Through Analysis of ID
No ratings yet
Anomaly Detection of CAN Bus Messages Through Analysis of ID
7 pages
Security Analytics Use Cases
No ratings yet
Security Analytics Use Cases
12 pages
CORE - MIL Q3 Week1 7
No ratings yet
CORE - MIL Q3 Week1 7
39 pages
Int (2)
No ratings yet
Int (2)
15 pages
Pink Illustration English Class Presentation
No ratings yet
Pink Illustration English Class Presentation
18 pages
Project in Business Healtcare BI
No ratings yet
Project in Business Healtcare BI
47 pages
AZ 104T00A 01 Identity
No ratings yet
AZ 104T00A 01 Identity
25 pages
Effects of Social Networking Media To The Academic Performance of The Students
No ratings yet
Effects of Social Networking Media To The Academic Performance of The Students
6 pages
Dell Powerstore Gen2 Spec Sheet
No ratings yet
Dell Powerstore Gen2 Spec Sheet
9 pages
Wireless Communication Systems Module 4: Digital Modulation and Pulse Shaping Techniques
No ratings yet
Wireless Communication Systems Module 4: Digital Modulation and Pulse Shaping Techniques
9 pages
Openssl Essentials: Working With SSL Certificates, Private Keys and Csrs
No ratings yet
Openssl Essentials: Working With SSL Certificates, Private Keys and Csrs
6 pages
Compiler Design
No ratings yet
Compiler Design
2 pages
Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis
No ratings yet
Object Detection and Tracking Algorithms For Vehicle Counting: A Comparative Analysis
11 pages
Virtual Agent: Advantages
No ratings yet
Virtual Agent: Advantages
4 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

UNIT-5

Uploaded by

UNIT-5

Uploaded by

• Launching a CUDA kernel creates a grid of threads that all

execute the kernel function.

These dimensions are

Each such parameter is of dim3 type, which is a C struct with

For 1D or 2D grids and blocks, the unused dimension fields

dim3 dimGrid(128, 1, 1);

dim3 dimGrid(ceil(n/256.0), 1, 1); Once vecAddKernel() is launched, the

•This allows the number of blocks to vary with

dim3 dimGrid(2, 2, 1);

Within the kernel function, references to built-

• In reality, all multidimensional arrays in C are linearized.

You might also like