0% found this document useful (0 votes)
17 views53 pages

Cuda 101

Uploaded by

abhrabagchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views53 pages

Cuda 101

Uploaded by

abhrabagchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

X

CUDA 101
About Me
• Hello I am Joyen an upcoming SoC Design Engineer @Incore_Semiconductors
• Co-Founder Hyprthrd
• GDSC-Lead’22

Below are some of my research interest:


• Computer Architecture
• SoC (System on Chip)
• HW/SW co-design
• Hardware Accelerator
But why GPU?
AI AI AI AI …………..

Evergrowing demand for HPC

Gaming ofc!

Data center

And a lot more. . . . !


CPU Vs GPU
CUDA Programming Flow

Initializat Transfer Kernel launch Transfer Reclaim the


ion of data data form with needed results back memory from
form CPU CPU context grid/block to CPU both CPU and
to GPU size context from GPU
context GPU context
Hello CUDA World!

Kernel_name<< <number_of_blocks,thread_per_block>> > ()


#include "cuda_runtime.h"
#include "device_launch_parameters.h" kernel_name<< < Dg, Db, Ns, S >> >([kernel arguments]);
#include <stdio.h>

//Device Code
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); • `Dg` is of type `dim3` and specifies the dimensions and size of
}
the grid
//Host code • `Db` is of type `dim3` and specifies the dimensions and size of
int main(){ each thread block
//kernel launch parameters • `Ns` is of type `size_t` and specifies the number of bytes of
hello_cuda<< <1,10>> > (); // async call shared memory that is dynamically allocated per thread block for
printf("Hello from CPU \n");
cudaDeviceSynchronize();
this call in addition to statically allocated memory. `Ns` is an
cudaDeviceReset(); optional argument that defaults to 0.
return 0; • `S` is of type `cudaStream_t` and specifies the stream
} associated with this call. The stream must have been allocated
in the same grid where the call is being made. `S` is an
optional argument that defaults to the NULL stream
Grids and Blocks
• Grid is the collection of all the threads launch for a kernel, in the above code we had 20 threads. A three dimensional view of the
grid can be visualised using the Cartesian coordinate system.

• Threads in a grid is organized in to groups called thread blocks, these thread blokes allows the CUDA toolkit to synchronise and
manage workload without heavy performance penalties.

Kernel launch parameters tells the compiler on how much blocks exist and the number of threads per block.
Kernel Launch Parameters
• The kernel launch takes four parameters but for now we will go ahead with two, now like how we did previously if we use integer we can
specify only one dimension only. we use #dim3data type to declare a three dimensional variable.

• dim3 is a vector type which in 1 by default, to access the individual values use the below.

dim3 variable_name(X, Y, Z); Limitations:


For block_size is that we can maximum have 1024 for X and Y
direction and 64 for the Z direction* Y* Z <= 1024The limitation for
variable_name.x the number of thread blocks in each dimension is that you can
variable_name.y maximum have 2^32 - 1 in x direction, 65536 threads in other
variable_name.z dimension.

Example: Lets say we need to launch the hello_world kernel with one dimensional grid with 32 thread arranged into 8
thread blocks, where each block having 4 threads in x dimension arrangement of the grid would look like this. This
can be represented as:
Grid and Blocks
Example 2: Now try creating the following shown below. A 2D grid, with total of 64 threads arranged in 16 threads in x
dimension and 4 threads in y dimension. Each thread block will have 8 threads in x dimension and 2 thread in y
dimension.
Grids and Blocks
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h" //Device Code
#include <stdio.h> __global__ void hello_cuda(){
printf("Hello from CUDA world \n");
//Device Code }
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); //Host code
} int main(){
//kernel launch parameters
//Host code int nx, ny;
int main(){ nx = 16;
//kernel launch parameters ny = 4;
dim3 block(8,2,1);
dim3 grid(2,2,1); dim3 block(8,2,1);
dim3 grid(nx /block.x, ny/block.y);
hello_cuda<<<grid, block>>>();
cudaDeviceSynchronize(); hello_cuda<<<grid, block>>>();
cudaDeviceReset(); cudaDeviceSynchronize();
return 0; cudaDeviceReset();
} return 0;
}
Organization of thread in a
CUDA program
CUDA runtime uniquely initialised threadIdx variable for each thread depending on where that particular thread is located in the
thread block. threadIdx is a #dim3 type variable

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z


A B C D E F G H
A 0 0 0

B 1 0 0

Example-1: consider a 1-D grid with 1 thread C 2 0 0


block with 8 threads
D 3 0 0

E 4 0 0

F 5 0 0

G 6 0 0

H 7 0 0
Organization of thread in a
CUDA program
A B C D E F G H

Example-2: consider a 1-D grid with 2 thread block with 8


threads

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

A 0 0 0

B 1 0 0

C 2 0 0

D 3 0 0

E 0 0 0

F 1 0 0

G 2 0 0

H 3 0 0
Organization of thread in a
CUDA program

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z


P Q R S
P 0 0 0

Q 2 0 0
T U V X
R 1 0 0

S 3 0 0

Example-3: consider a 1-D grid with 4 thread block T 0 0 0


with 4 threads per block as shown below.
U 2 0 0

V 0 0 0

X 3 0 0
Organization of thread in a
CUDA program

X P
Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z
Y Q P 0 0 0

Q 2 1 0
R T
R 0 0 0
S U S 3 1 0

T 1 0 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. U 0 1 0

X 1 0 0

Y 1 1 0
Organization of thread in a
CUDA program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

//Device Code
__global__ void print_thread() {
printf("x: %d y: %d z: %d \n", threadIdx.x,
threadIdx.y, threadIdx.z);
Example-5: Lets code a CUDA program that is going to }

launch a 2-D grid which has 256 threads arranged into 2 //Host code
thread blocks in x-dimension and y-dimension. So we int main() {
will have 16 threads per block with 8 thread in x and
int nx, ny;
y-dimension.
nx = 16;
ny = 16;

//kernel launch parameters

dim3 block(8, 8);


dim3 grid(nx/block.x, ny/block.y);

print_thread << <grid, block >> > (); // async call


printf("Hello from CPU \n");
cudaDeviceSynchronize();

cudaDeviceReset();
return 0;
}
Organization of thread in a
CUDA program
A B C D E F G H

Example-2: consider a 1-D grid with 2 thread block with 8


threads

Thread Name blockIdx.X blockIdx.Y blockIdx.Z

A 0 0 0

B 0 0 0

C 0 0 0

D 0 0 0

E 1 0 0

F 1 0 0

G 1 0 0

H 1 0 0
Organization of thread in a
CUDA program

Thread Name blockIdx.X blockIdx.Y blockIdx.Z


P Q R S
P 0 0 0

Q 0 0 0
T U V X
R 1 0 0

S 1 0 0

Example-3: consider a 1-D grid with 4 thread block T 0 1 0


with 4 threads per block as shown below.
U 0 1 0

V 1 1 0

X 1 1 0
Organization of thread in a
CUDA program
Thread Name blockIdx.X blockIdx.Y blockIdx.Z
X P
P 1 0 0

Y Q Q 1 0 0

R 0 1 0
R T
S 0 1 0

S U T 1 1 0

U 1 1 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. X 0 0 0

Y 0 0 0
Organization of thread in a
CUDA program

blockDim
• The blockDim variable consist number of threads in each
dimension of a thread block. Notice all the thread block
in a grid have same block size, so this variable value is
same for all the threads in a grid.
• blockDim is a #dim3 type variable.

gridDim
• The #gridDim variable consists number of thread
blocks in each dimension of a grid.
• gridDim is a #dim type variable.
The blockDim.x is going to be 4 and the blockDim.y is going to be 2.

The gridDim.x is going to be 3 and the gridDim.y is going to be 2.


Organization of thread in a
CUDA program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

//Device Code
__global__ void print_details() {
printf("blockIdx x: % d y : % d z : % d \nblockDim x:
% d y : % d z : % d\ngridDim x: % d y : % d z : % d ",
blockIdx.x, blockIdx.y, blockIdx.z, blockDim.x, blockDim.y,
blockDim.z,gridDim.x, gridDim.y, gridDim.z);
The goal of this exercise is to use the }
same example in we ran previously but use
//Host code
the #blockIdx, #blockDim and the #gridDim int main() {
and observe the output.
int nx, ny;

nx = 16;
ny = 16;

dim3 block(8, 8);


dim3 grid(nx/block.x, ny/block.y);

print_details << <grid, block >> > (); // async call


printf("Hello from CPU \n");
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
Unique index calculation
__global__ void unique_idx_calc_threadIdx(int * input) {
int tid = threadIdx.x;
printf("threadIdx.x : %d, value : %d \n", tid, input[tid]);
• In a #CUDA program it is very common to }
use #threadIdx, #blockIdx, #blockDim //Host code
variable value to calculate the array int main() {
indices. Now it is important to remember
why we use #CUDA in the first place, it int array_size = 8;
is because there are no dependencies or int array_byte_size = sizeof(int) * array_size;
int h_data[] = {31, 34, 41, 44, 23, 32, 34, 23};
very less dependencies in the loop.
for(int i = 0; i < array_size; i++){
• In the following section we are going to printf("%d ", h_data[i]);
use these variables to access elements }
of the array transferred to the kernel. printf("\n \n");

int * d_data;
cudaMalloc((void**)&d_data, array_byte_size);
cudaMemcpu(d_data, h_data, array_byte_size, cudaMemcpuHosToDevice);

dim3 block(8);
dim3 grid(1);
unique_idx_calc_threadIdx <<<grid, block>>>(d_data);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;}
Unique index calculation
• In a #CUDA program it is very common to use #threadIdx, #blockIdx,
#blockDim variable value to calculate the array indices. Now it is
important to remember why we use #CUDA in the first place, it is because
there are no dependencies or very less dependencies in the loop.
• In the following section we are going to use these variables to access
elements of the array transferred to the kernel.

Thread Name threadIdx.X blockIdx.X blockDim.X

A 0 0 1
A B C D E F G H
B 1 0 1

C 2 0 1

D 3 0 1

gid = tid + offset; E 0 1 1


gid = tid + blockIdx.x*blockDim.x; F 1 1 1

G 2 1 1

H 3 1 1
Unique index calculation
• The previous offset won't work as it will return the same 8
elements from the x-dimension.
• Now we will need a block offset along with the thread offset.

num_of_threads_row = gridDim.x * blockDim.x

num_of_threads_thread_block = blockDim.xrow_offset = num_of_threads_in_block * blockIdx.y

block_offset = number_of_threads_in_block * blockIdx.X

index = row_offset + block_offset + tid

gid = gridDim.x*blockDim.x*blockIdx.y + blcokIdx.x* blockDim.x + threadIdx.X


Memory transfer between host
and device
Host Device

CPU SM

Caches & DRAM Caches & DRAM


Memory transfer between host
and device
Syntax:

cudaMemCpy(destination ptr, sourse ptr, size in byte, direction)

Direction Keyword

Host to Device cudamemcpyhtod

Device to Host cudamemecpydtoh

Device to Device cudamemecpydtod


Memory transfer between host
and device

C CUDA Description

Malloc cudaMalloc Allocates the memory in the host and device

Memeset cudaMemset memset() sets values for given memory location and
we have cudaMemset() function which performs same
operation in the device
Free cudaFree free() function reclaims specified memory location
in the host and cudaFree() reclaims in the device
Device Properties
There are more properties refer to the datasheet

Property Explanation

Name ACII string identifying the device

Major/Minor Major and minor revision numbers defining the device's compute capability
maxThreadsPerBlock Maximum number of threads per block

TotalGlobalMem Total amount of global memory available on the device in bytes

maxThreadsDim maximum size of each dimension of a block


MaxGridsize maximum size of each dimension of a grid

Warp Size Warp size for the device


GPU Architecture
CPU Computing—the Great
Tradition
CPU Computing—the
Great Tradition
CPU performance is the product of many related
advances:
• Increased transistor density
• Increased transistor performance
• Wider data paths
• Pipelining
• Superscalar execution
• Speculative execution
• Caching
• Chip and system-level integration
CPU’s are great…
• They’re easy to program
• Software developers can ignore most of the
complexity in modern CPUs; microarchitecture is
almost invisible, and compiler magic hides the
rest.
• Multicore chips have the same software
architecture as older multiprocessor systems: a
simple coherent memory model and a sea of
identical computing engines.
CPU’s are great But…
• CPU cores continue to be
optimized for single-threaded
performance at the expense of
parallel execution.
• This fact is most apparent
when one considers that
integer and floating-point
execution units occupy only a
tiny fraction of the die area
in a modern CPU.

[CPU Intel, Source: paper]


CPU’s are great But…
• With such a small part of the chip devoted to performing
direct calculations, it’s no surprise that CPUs are
relatively inefficient for high-performance computing
applications.

• Most of the circuitry on a CPU, and therefore most of the


heat it generates, is devoted to invisible complexity:
those caches, instruction decoders, branch predictors,
and other features that are not architecturally visible
but which enhance single-threaded performance.
But I can try !
The Ideas was to increase the throughput via
architectural innovations.

• Speculation
• Processor technology
• Dual Issue processor
Intel’s Larrabee
• Larrabee is Intel’s code name for a future graphics
processing architecture based on the x86 architecture.
The first Larrabee chip is said to use dual-issue
cores derived from the original Pentium design, but
modified to include support for 64-bit x86 operations
and a new 512-bit vector-processing unit.
Intel’s Nehalem Architecture

Nehalem is the most sophisticated


microarchitecture in any x86 processor. Its
features are like a laundry list of high
performance CPU design: four-wide
superscalar, out of order, speculative
execution, simultaneous multithreading,
multiple branch predictors, on-die power
gating, on-die memory controllers, large
caches, and multiple interprocessor
interconnects.
What we had so
far…
Understanding the problem is half the solution
The Wall
The History of the GPU
• There were attempts to build
chip-scale parallel processors
in the 1990s, but the limited
transistor budgets in those days
favored more sophisticated
single-core designs.
• The real path toward GPU
computing began, not with GPUs,
but with non-programmable 3G-
graphics accelerators.

[GeForce7000, Source: images]


The History of the GPU
NVIDIA’s GeForce 3 in 2001
introduced programmable pixel
shading to the
consumer market. The
programmability of this chip was
very limited, but later
GeForce products became more
flexible and faster, adding
separate programmable engines for
vertex and geometry shading. This
evolution culminated in the
GeForce 7800 [GeForce7000, Source:
images]
The History of the GPU
Introducing Fermi
• GPU computing isn’t meant to
replace CPU computing. Each
approach has advantages for
certain kinds of software. As
explained earlier, CPUs are
optimized for applications
where most of the work is
being done by a limited number
of threads, especially where
the threads exhibit high data
locality, a mix of different
operations, and a high
percentage of conditional
branches.
Introducing Fermi
• GPU design aims at the other end of the spectrum: applications
with multiple threads that are dominated by longer sequences
of computational instructions. Over the last few years, GPUs
have become much better at thread handling, data caching,
virtual memory management, flow control, and other CPU-like
features, but the distinction between computationally
intensive software and control-flow intensive software is
fundamental.

• At this level of abstraction, the GPU looks like sea of


computational units with only a few support elements—an
illustration of the key GPU design goal, which is to maximize
floating-point throughput.
Introducing Fermi
• GPU design aims at the other end of the spectrum: applications
with multiple threads that are dominated by longer sequences
of computational instructions. Over the last few years, GPUs
have become much better at thread handling, data caching,
virtual memory management, flow control, and other CPU-like
features, but the distinction between computationally
intensive software and control-flow intensive software is
fundamental.

• At this level of abstraction, the GPU looks like sea of


computational units with only a few support elements—an
illustration of the key GPU design goal, which is to maximize
floating-point throughput.
Introducing Fermi
The Programming Model (CUDA)
• The complexity of the Fermi architecture is managed by a multi-level
programming model that allows software developers to focus on algorithm
design rather than the details of how to map the algorithm to the
hardware, thus improving productivity. This is a concern that
conventional CPUs have yet to address because their structures are
simple and regular: a small number of cores presented as logical peers
on a virtual bus.

• the computational elements of algorithms are known as kernels(opencl


and CUDA)
• An application or library function may consist of one or more kernels

• Once compiled, kernels consist of many threads that execute the same
program in parallel: one thread is like one iteration of a loop.
The Programming Model (CUDA)
The Programming Model (CUDA)

• Multiple threads are grouped into


thread blocks containing up to 1,536
threads. All of the threads in a
thread block will run on a single SM

• Thread blocks can coordinate the


use of global shared memory among
themselves but may execute in any
order, concurrently or
sequentially.
https://docs.nvidia.com/cuda/cuda-c-
programming-guide/index.html
The Programming Model (CUDA)

• At any one time, the entire Fermi device is dedicated to a


single application. As mentioned above, an application may
include multiple kernels. Fermi supports simultaneous
execution of multiple kernels from the same application, each
kernel being distributed to one or more SMs on the device.
This capability avoids the situation where a kernel is only
able to use part of the device and the rest goes unused.

• This switching is managed by the chip-level GigaThread


hardware thread scheduler, which manages 1,536 simultaneously
active threads for each streaming multiprocessor across 16
kernels.
The Streaming Multiprocessor

• Fermi’s streaming
multiprocessors, shown in Figure
6, comprise 32 cores, each of
which can perform floating-point
and integer operations, along
with 16 load-store units for
memory operations, four special-
function units, and 64K of local
SRAM split between cache and
local memory.
The Streaming Multiprocessor
Conclusion…
Thank You!

You might also like