Cuda 101
Cuda 101
CUDA 101
About Me
• Hello I am Joyen an upcoming SoC Design Engineer @Incore_Semiconductors
• Co-Founder Hyprthrd
• GDSC-Lead’22
Gaming ofc!
Data center
//Device Code
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); • `Dg` is of type `dim3` and specifies the dimensions and size of
}
the grid
//Host code • `Db` is of type `dim3` and specifies the dimensions and size of
int main(){ each thread block
//kernel launch parameters • `Ns` is of type `size_t` and specifies the number of bytes of
hello_cuda<< <1,10>> > (); // async call shared memory that is dynamically allocated per thread block for
printf("Hello from CPU \n");
cudaDeviceSynchronize();
this call in addition to statically allocated memory. `Ns` is an
cudaDeviceReset(); optional argument that defaults to 0.
return 0; • `S` is of type `cudaStream_t` and specifies the stream
} associated with this call. The stream must have been allocated
in the same grid where the call is being made. `S` is an
optional argument that defaults to the NULL stream
Grids and Blocks
• Grid is the collection of all the threads launch for a kernel, in the above code we had 20 threads. A three dimensional view of the
grid can be visualised using the Cartesian coordinate system.
• Threads in a grid is organized in to groups called thread blocks, these thread blokes allows the CUDA toolkit to synchronise and
manage workload without heavy performance penalties.
Kernel launch parameters tells the compiler on how much blocks exist and the number of threads per block.
Kernel Launch Parameters
• The kernel launch takes four parameters but for now we will go ahead with two, now like how we did previously if we use integer we can
specify only one dimension only. we use #dim3data type to declare a three dimensional variable.
• dim3 is a vector type which in 1 by default, to access the individual values use the below.
Example: Lets say we need to launch the hello_world kernel with one dimensional grid with 32 thread arranged into 8
thread blocks, where each block having 4 threads in x dimension arrangement of the grid would look like this. This
can be represented as:
Grid and Blocks
Example 2: Now try creating the following shown below. A 2D grid, with total of 64 threads arranged in 16 threads in x
dimension and 4 threads in y dimension. Each thread block will have 8 threads in x dimension and 2 thread in y
dimension.
Grids and Blocks
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h" //Device Code
#include <stdio.h> __global__ void hello_cuda(){
printf("Hello from CUDA world \n");
//Device Code }
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); //Host code
} int main(){
//kernel launch parameters
//Host code int nx, ny;
int main(){ nx = 16;
//kernel launch parameters ny = 4;
dim3 block(8,2,1);
dim3 grid(2,2,1); dim3 block(8,2,1);
dim3 grid(nx /block.x, ny/block.y);
hello_cuda<<<grid, block>>>();
cudaDeviceSynchronize(); hello_cuda<<<grid, block>>>();
cudaDeviceReset(); cudaDeviceSynchronize();
return 0; cudaDeviceReset();
} return 0;
}
Organization of thread in a
CUDA program
CUDA runtime uniquely initialised threadIdx variable for each thread depending on where that particular thread is located in the
thread block. threadIdx is a #dim3 type variable
B 1 0 0
E 4 0 0
F 5 0 0
G 6 0 0
H 7 0 0
Organization of thread in a
CUDA program
A B C D E F G H
A 0 0 0
B 1 0 0
C 2 0 0
D 3 0 0
E 0 0 0
F 1 0 0
G 2 0 0
H 3 0 0
Organization of thread in a
CUDA program
Q 2 0 0
T U V X
R 1 0 0
S 3 0 0
V 0 0 0
X 3 0 0
Organization of thread in a
CUDA program
X P
Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z
Y Q P 0 0 0
Q 2 1 0
R T
R 0 0 0
S U S 3 1 0
T 1 0 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. U 0 1 0
X 1 0 0
Y 1 1 0
Organization of thread in a
CUDA program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
//Device Code
__global__ void print_thread() {
printf("x: %d y: %d z: %d \n", threadIdx.x,
threadIdx.y, threadIdx.z);
Example-5: Lets code a CUDA program that is going to }
launch a 2-D grid which has 256 threads arranged into 2 //Host code
thread blocks in x-dimension and y-dimension. So we int main() {
will have 16 threads per block with 8 thread in x and
int nx, ny;
y-dimension.
nx = 16;
ny = 16;
cudaDeviceReset();
return 0;
}
Organization of thread in a
CUDA program
A B C D E F G H
A 0 0 0
B 0 0 0
C 0 0 0
D 0 0 0
E 1 0 0
F 1 0 0
G 1 0 0
H 1 0 0
Organization of thread in a
CUDA program
Q 0 0 0
T U V X
R 1 0 0
S 1 0 0
V 1 1 0
X 1 1 0
Organization of thread in a
CUDA program
Thread Name blockIdx.X blockIdx.Y blockIdx.Z
X P
P 1 0 0
Y Q Q 1 0 0
R 0 1 0
R T
S 0 1 0
S U T 1 1 0
U 1 1 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. X 0 0 0
Y 0 0 0
Organization of thread in a
CUDA program
blockDim
• The blockDim variable consist number of threads in each
dimension of a thread block. Notice all the thread block
in a grid have same block size, so this variable value is
same for all the threads in a grid.
• blockDim is a #dim3 type variable.
gridDim
• The #gridDim variable consists number of thread
blocks in each dimension of a grid.
• gridDim is a #dim type variable.
The blockDim.x is going to be 4 and the blockDim.y is going to be 2.
#include <stdio.h>
//Device Code
__global__ void print_details() {
printf("blockIdx x: % d y : % d z : % d \nblockDim x:
% d y : % d z : % d\ngridDim x: % d y : % d z : % d ",
blockIdx.x, blockIdx.y, blockIdx.z, blockDim.x, blockDim.y,
blockDim.z,gridDim.x, gridDim.y, gridDim.z);
The goal of this exercise is to use the }
same example in we ran previously but use
//Host code
the #blockIdx, #blockDim and the #gridDim int main() {
and observe the output.
int nx, ny;
nx = 16;
ny = 16;
int * d_data;
cudaMalloc((void**)&d_data, array_byte_size);
cudaMemcpu(d_data, h_data, array_byte_size, cudaMemcpuHosToDevice);
dim3 block(8);
dim3 grid(1);
unique_idx_calc_threadIdx <<<grid, block>>>(d_data);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;}
Unique index calculation
• In a #CUDA program it is very common to use #threadIdx, #blockIdx,
#blockDim variable value to calculate the array indices. Now it is
important to remember why we use #CUDA in the first place, it is because
there are no dependencies or very less dependencies in the loop.
• In the following section we are going to use these variables to access
elements of the array transferred to the kernel.
A 0 0 1
A B C D E F G H
B 1 0 1
C 2 0 1
D 3 0 1
G 2 1 1
H 3 1 1
Unique index calculation
• The previous offset won't work as it will return the same 8
elements from the x-dimension.
• Now we will need a block offset along with the thread offset.
CPU SM
Direction Keyword
C CUDA Description
Memeset cudaMemset memset() sets values for given memory location and
we have cudaMemset() function which performs same
operation in the device
Free cudaFree free() function reclaims specified memory location
in the host and cudaFree() reclaims in the device
Device Properties
There are more properties refer to the datasheet
Property Explanation
Major/Minor Major and minor revision numbers defining the device's compute capability
maxThreadsPerBlock Maximum number of threads per block
• Speculation
• Processor technology
• Dual Issue processor
Intel’s Larrabee
• Larrabee is Intel’s code name for a future graphics
processing architecture based on the x86 architecture.
The first Larrabee chip is said to use dual-issue
cores derived from the original Pentium design, but
modified to include support for 64-bit x86 operations
and a new 512-bit vector-processing unit.
Intel’s Nehalem Architecture
• Once compiled, kernels consist of many threads that execute the same
program in parallel: one thread is like one iteration of a loop.
The Programming Model (CUDA)
The Programming Model (CUDA)
• Fermi’s streaming
multiprocessors, shown in Figure
6, comprise 32 cores, each of
which can perform floating-point
and integer operations, along
with 16 load-store units for
memory operations, four special-
function units, and 64K of local
SRAM split between cache and
local memory.
The Streaming Multiprocessor
Conclusion…
Thank You!