0% found this document useful (0 votes)

17 views53 pages

Cuda 101

Uploaded by

abhrabagchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views53 pages

Cuda 101

Uploaded by

abhrabagchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

X

CUDA 101
About Me
• Hello I am Joyen an upcoming SoC Design Engineer @Incore_Semiconductors
• Co-Founder Hyprthrd
• GDSC-Lead’22

Below are some of my research interest:

• Computer Architecture
• SoC (System on Chip)
• HW/SW co-design
• Hardware Accelerator
But why GPU?
AI AI AI AI …………..

Evergrowing demand for HPC

Gaming ofc!

Data center

And a lot more. . . . !

CPU Vs GPU
CUDA Programming Flow

Initializat Transfer Kernel launch Transfer Reclaim the

ion of data data form with needed results back memory from
form CPU CPU context grid/block to CPU both CPU and
to GPU size context from GPU
context GPU context
Hello CUDA World!

Kernel_name<< <number_of_blocks,thread_per_block>> > ()

#include "cuda_runtime.h"
#include "device_launch_parameters.h" kernel_name<< < Dg, Db, Ns, S >> >([kernel arguments]);
#include <stdio.h>

//Device Code
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); • `Dg` is of type `dim3` and specifies the dimensions and size of
}
the grid
//Host code • `Db` is of type `dim3` and specifies the dimensions and size of
int main(){ each thread block
//kernel launch parameters • `Ns` is of type `size_t` and specifies the number of bytes of
hello_cuda<< <1,10>> > (); // async call shared memory that is dynamically allocated per thread block for
printf("Hello from CPU \n");
cudaDeviceSynchronize();
this call in addition to statically allocated memory. `Ns` is an
cudaDeviceReset(); optional argument that defaults to 0.
return 0; • `S` is of type `cudaStream_t` and specifies the stream
} associated with this call. The stream must have been allocated
in the same grid where the call is being made. `S` is an
optional argument that defaults to the NULL stream
Grids and Blocks
• Grid is the collection of all the threads launch for a kernel, in the above code we had 20 threads. A three dimensional view of the
grid can be visualised using the Cartesian coordinate system.

• Threads in a grid is organized in to groups called thread blocks, these thread blokes allows the CUDA toolkit to synchronise and
manage workload without heavy performance penalties.

Kernel launch parameters tells the compiler on how much blocks exist and the number of threads per block.
Kernel Launch Parameters
• The kernel launch takes four parameters but for now we will go ahead with two, now like how we did previously if we use integer we can
specify only one dimension only. we use #dim3data type to declare a three dimensional variable.

• dim3 is a vector type which in 1 by default, to access the individual values use the below.

dim3 variable_name(X, Y, Z); Limitations:

For block_size is that we can maximum have 1024 for X and Y
direction and 64 for the Z direction* Y* Z <= 1024The limitation for
variable_name.x the number of thread blocks in each dimension is that you can
variable_name.y maximum have 2^32 - 1 in x direction, 65536 threads in other
variable_name.z dimension.

Example: Lets say we need to launch the hello_world kernel with one dimensional grid with 32 thread arranged into 8
thread blocks, where each block having 4 threads in x dimension arrangement of the grid would look like this. This
can be represented as:
Grid and Blocks
Example 2: Now try creating the following shown below. A 2D grid, with total of 64 threads arranged in 16 threads in x
dimension and 4 threads in y dimension. Each thread block will have 8 threads in x dimension and 2 thread in y
dimension.
Grids and Blocks
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h" //Device Code
#include <stdio.h> __global__ void hello_cuda(){
printf("Hello from CUDA world \n");
//Device Code }
__global__ void hello_cuda(){
printf("Hello from CUDA world \n"); //Host code
} int main(){
//kernel launch parameters
//Host code int nx, ny;
int main(){ nx = 16;
//kernel launch parameters ny = 4;
dim3 block(8,2,1);
dim3 grid(2,2,1); dim3 block(8,2,1);
dim3 grid(nx /block.x, ny/block.y);
hello_cuda<<<grid, block>>>();
cudaDeviceSynchronize(); hello_cuda<<<grid, block>>>();
cudaDeviceReset(); cudaDeviceSynchronize();
return 0; cudaDeviceReset();
} return 0;
}
Organization of thread in a
CUDA program
CUDA runtime uniquely initialised threadIdx variable for each thread depending on where that particular thread is located in the
thread block. threadIdx is a #dim3 type variable

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

A B C D E F G H
A 0 0 0

B 1 0 0

Example-1: consider a 1-D grid with 1 thread C 2 0 0

block with 8 threads
D 3 0 0

E 4 0 0

F 5 0 0

G 6 0 0

H 7 0 0
Organization of thread in a
CUDA program
A B C D E F G H

Example-2: consider a 1-D grid with 2 thread block with 8

threads

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

A 0 0 0

B 1 0 0

C 2 0 0

D 3 0 0

E 0 0 0

F 1 0 0

G 2 0 0

H 3 0 0
Organization of thread in a
CUDA program

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

P Q R S
P 0 0 0

Q 2 0 0
T U V X
R 1 0 0

S 3 0 0

Example-3: consider a 1-D grid with 4 thread block T 0 0 0

with 4 threads per block as shown below.
U 2 0 0

V 0 0 0

X 3 0 0
Organization of thread in a
CUDA program

X P
Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z
Y Q P 0 0 0

Q 2 1 0
R T
R 0 0 0
S U S 3 1 0

T 1 0 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. U 0 1 0

X 1 0 0

Y 1 1 0
Organization of thread in a
CUDA program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

//Device Code
__global__ void print_thread() {
printf("x: %d y: %d z: %d \n", threadIdx.x,
threadIdx.y, threadIdx.z);
Example-5: Lets code a CUDA program that is going to }

launch a 2-D grid which has 256 threads arranged into 2 //Host code
thread blocks in x-dimension and y-dimension. So we int main() {
will have 16 threads per block with 8 thread in x and
int nx, ny;
y-dimension.
nx = 16;
ny = 16;

//kernel launch parameters

dim3 block(8, 8);

dim3 grid(nx/block.x, ny/block.y);

print_thread << <grid, block >> > (); // async call

printf("Hello from CPU \n");
cudaDeviceSynchronize();

cudaDeviceReset();
return 0;
}
Organization of thread in a
CUDA program
A B C D E F G H

Example-2: consider a 1-D grid with 2 thread block with 8

threads

Thread Name blockIdx.X blockIdx.Y blockIdx.Z

A 0 0 0

B 0 0 0

C 0 0 0

D 0 0 0

E 1 0 0

F 1 0 0

G 1 0 0

H 1 0 0
Organization of thread in a
CUDA program

Thread Name blockIdx.X blockIdx.Y blockIdx.Z

P Q R S
P 0 0 0

Q 0 0 0
T U V X
R 1 0 0

S 1 0 0

Example-3: consider a 1-D grid with 4 thread block T 0 1 0

with 4 threads per block as shown below.
U 0 1 0

V 1 1 0

X 1 1 0
Organization of thread in a
CUDA program
Thread Name blockIdx.X blockIdx.Y blockIdx.Z
X P
P 1 0 0

Y Q Q 1 0 0

R 0 1 0
R T
S 0 1 0

S U T 1 1 0

U 1 1 0
Example-4: consider a 1-D grid with 4 thread block
with 8 threads per block as shown below. X 0 0 0

Y 0 0 0
Organization of thread in a
CUDA program

blockDim
• The blockDim variable consist number of threads in each
dimension of a thread block. Notice all the thread block
in a grid have same block size, so this variable value is
same for all the threads in a grid.
• blockDim is a #dim3 type variable.

gridDim
• The #gridDim variable consists number of thread
blocks in each dimension of a grid.
• gridDim is a #dim type variable.
The blockDim.x is going to be 4 and the blockDim.y is going to be 2.

The gridDim.x is going to be 3 and the gridDim.y is going to be 2.

Organization of thread in a
CUDA program
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

//Device Code
__global__ void print_details() {
printf("blockIdx x: % d y : % d z : % d \nblockDim x:
% d y : % d z : % d\ngridDim x: % d y : % d z : % d ",
blockIdx.x, blockIdx.y, blockIdx.z, blockDim.x, blockDim.y,
blockDim.z,gridDim.x, gridDim.y, gridDim.z);
The goal of this exercise is to use the }
same example in we ran previously but use
//Host code
the #blockIdx, #blockDim and the #gridDim int main() {
and observe the output.
int nx, ny;

nx = 16;
ny = 16;

dim3 block(8, 8);

dim3 grid(nx/block.x, ny/block.y);

print_details << <grid, block >> > (); // async call

printf("Hello from CPU \n");
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
Unique index calculation
__global__ void unique_idx_calc_threadIdx(int * input) {
int tid = threadIdx.x;
printf("threadIdx.x : %d, value : %d \n", tid, input[tid]);
• In a #CUDA program it is very common to }
use #threadIdx, #blockIdx, #blockDim //Host code
variable value to calculate the array int main() {
indices. Now it is important to remember
why we use #CUDA in the first place, it int array_size = 8;
is because there are no dependencies or int array_byte_size = sizeof(int) * array_size;
int h_data[] = {31, 34, 41, 44, 23, 32, 34, 23};
very less dependencies in the loop.
for(int i = 0; i < array_size; i++){
• In the following section we are going to printf("%d ", h_data[i]);
use these variables to access elements }
of the array transferred to the kernel. printf("\n \n");

int * d_data;
cudaMalloc((void**)&d_data, array_byte_size);
cudaMemcpu(d_data, h_data, array_byte_size, cudaMemcpuHosToDevice);

dim3 block(8);
dim3 grid(1);
unique_idx_calc_threadIdx <<<grid, block>>>(d_data);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;}
Unique index calculation
• In a #CUDA program it is very common to use #threadIdx, #blockIdx,
#blockDim variable value to calculate the array indices. Now it is
important to remember why we use #CUDA in the first place, it is because
there are no dependencies or very less dependencies in the loop.
• In the following section we are going to use these variables to access
elements of the array transferred to the kernel.

Thread Name threadIdx.X blockIdx.X blockDim.X

A 0 0 1
A B C D E F G H
B 1 0 1

C 2 0 1

D 3 0 1

gid = tid + offset; E 0 1 1

gid = tid + blockIdx.x*blockDim.x; F 1 1 1

G 2 1 1

H 3 1 1
Unique index calculation
• The previous offset won't work as it will return the same 8
elements from the x-dimension.
• Now we will need a block offset along with the thread offset.

num_of_threads_row = gridDim.x * blockDim.x

num_of_threads_thread_block = blockDim.xrow_offset = num_of_threads_in_block * blockIdx.y

block_offset = number_of_threads_in_block * blockIdx.X

index = row_offset + block_offset + tid

gid = gridDim.xblockDim.xblockIdx.y + blcokIdx.x* blockDim.x + threadIdx.X

Memory transfer between host
and device
Host Device

CPU SM

Caches & DRAM Caches & DRAM

Memory transfer between host
and device
Syntax:

cudaMemCpy(destination ptr, sourse ptr, size in byte, direction)

Direction Keyword

Host to Device cudamemcpyhtod

Device to Host cudamemecpydtoh

Device to Device cudamemecpydtod

Memory transfer between host
and device

C CUDA Description

Malloc cudaMalloc Allocates the memory in the host and device

Memeset cudaMemset memset() sets values for given memory location and
we have cudaMemset() function which performs same
operation in the device
Free cudaFree free() function reclaims specified memory location
in the host and cudaFree() reclaims in the device
Device Properties
There are more properties refer to the datasheet

Property Explanation

Name ACII string identifying the device

Major/Minor Major and minor revision numbers defining the device's compute capability
maxThreadsPerBlock Maximum number of threads per block

TotalGlobalMem Total amount of global memory available on the device in bytes

maxThreadsDim maximum size of each dimension of a block

MaxGridsize maximum size of each dimension of a grid

Warp Size Warp size for the device

GPU Architecture
CPU Computing—the Great
Tradition
CPU Computing—the
Great Tradition
CPU performance is the product of many related
advances:
• Increased transistor density
• Increased transistor performance
• Wider data paths
• Pipelining
• Superscalar execution
• Speculative execution
• Caching
• Chip and system-level integration
CPU’s are great…
• They’re easy to program
• Software developers can ignore most of the
complexity in modern CPUs; microarchitecture is
almost invisible, and compiler magic hides the
rest.
• Multicore chips have the same software
architecture as older multiprocessor systems: a
simple coherent memory model and a sea of
identical computing engines.
CPU’s are great But…
• CPU cores continue to be
optimized for single-threaded
performance at the expense of
parallel execution.
• This fact is most apparent
when one considers that
integer and floating-point
execution units occupy only a
tiny fraction of the die area
in a modern CPU.

[CPU Intel, Source: paper]

CPU’s are great But…
• With such a small part of the chip devoted to performing
direct calculations, it’s no surprise that CPUs are
relatively inefficient for high-performance computing
applications.

• Most of the circuitry on a CPU, and therefore most of the

heat it generates, is devoted to invisible complexity:
those caches, instruction decoders, branch predictors,
and other features that are not architecturally visible
but which enhance single-threaded performance.
But I can try !
The Ideas was to increase the throughput via
architectural innovations.

• Speculation
• Processor technology
• Dual Issue processor
Intel’s Larrabee
• Larrabee is Intel’s code name for a future graphics
processing architecture based on the x86 architecture.
The first Larrabee chip is said to use dual-issue
cores derived from the original Pentium design, but
modified to include support for 64-bit x86 operations
and a new 512-bit vector-processing unit.
Intel’s Nehalem Architecture

Nehalem is the most sophisticated

microarchitecture in any x86 processor. Its
features are like a laundry list of high
performance CPU design: four-wide
superscalar, out of order, speculative
execution, simultaneous multithreading,
multiple branch predictors, on-die power
gating, on-die memory controllers, large
caches, and multiple interprocessor
interconnects.
What we had so
far…
Understanding the problem is half the solution
The Wall
The History of the GPU
• There were attempts to build
chip-scale parallel processors
in the 1990s, but the limited
transistor budgets in those days
favored more sophisticated
single-core designs.
• The real path toward GPU
computing began, not with GPUs,
but with non-programmable 3G-
graphics accelerators.

[GeForce7000, Source: images]

The History of the GPU
NVIDIA’s GeForce 3 in 2001
introduced programmable pixel
shading to the
consumer market. The
programmability of this chip was
very limited, but later
GeForce products became more
flexible and faster, adding
separate programmable engines for
vertex and geometry shading. This
evolution culminated in the
GeForce 7800 [GeForce7000, Source:
images]
The History of the GPU
Introducing Fermi
• GPU computing isn’t meant to
replace CPU computing. Each
approach has advantages for
certain kinds of software. As
explained earlier, CPUs are
optimized for applications
where most of the work is
being done by a limited number
of threads, especially where
the threads exhibit high data
locality, a mix of different
operations, and a high
percentage of conditional
branches.
Introducing Fermi
• GPU design aims at the other end of the spectrum: applications
with multiple threads that are dominated by longer sequences
of computational instructions. Over the last few years, GPUs
have become much better at thread handling, data caching,
virtual memory management, flow control, and other CPU-like
features, but the distinction between computationally
intensive software and control-flow intensive software is
fundamental.

• At this level of abstraction, the GPU looks like sea of

computational units with only a few support elements—an
illustration of the key GPU design goal, which is to maximize
floating-point throughput.
Introducing Fermi
• GPU design aims at the other end of the spectrum: applications
with multiple threads that are dominated by longer sequences
of computational instructions. Over the last few years, GPUs
have become much better at thread handling, data caching,
virtual memory management, flow control, and other CPU-like
features, but the distinction between computationally
intensive software and control-flow intensive software is
fundamental.

• At this level of abstraction, the GPU looks like sea of

computational units with only a few support elements—an
illustration of the key GPU design goal, which is to maximize
floating-point throughput.
Introducing Fermi
The Programming Model (CUDA)
• The complexity of the Fermi architecture is managed by a multi-level
programming model that allows software developers to focus on algorithm
design rather than the details of how to map the algorithm to the
hardware, thus improving productivity. This is a concern that
conventional CPUs have yet to address because their structures are
simple and regular: a small number of cores presented as logical peers
on a virtual bus.

• the computational elements of algorithms are known as kernels(opencl

and CUDA)
• An application or library function may consist of one or more kernels

• Once compiled, kernels consist of many threads that execute the same
program in parallel: one thread is like one iteration of a loop.
The Programming Model (CUDA)
The Programming Model (CUDA)

• Multiple threads are grouped into

thread blocks containing up to 1,536
threads. All of the threads in a
thread block will run on a single SM

• Thread blocks can coordinate the

use of global shared memory among
themselves but may execute in any
order, concurrently or
sequentially.
https://docs.nvidia.com/cuda/cuda-c-
programming-guide/index.html
The Programming Model (CUDA)

• At any one time, the entire Fermi device is dedicated to a

single application. As mentioned above, an application may
include multiple kernels. Fermi supports simultaneous
execution of multiple kernels from the same application, each
kernel being distributed to one or more SMs on the device.
This capability avoids the situation where a kernel is only
able to use part of the device and the rest goes unused.

• This switching is managed by the chip-level GigaThread

hardware thread scheduler, which manages 1,536 simultaneously
active threads for each streaming multiprocessor across 16
kernels.
The Streaming Multiprocessor

• Fermi’s streaming
multiprocessors, shown in Figure
6, comprise 32 cores, each of
which can perform floating-point
and integer operations, along
with 16 load-store units for
memory operations, four special-
function units, and 64K of local
SRAM split between cache and
local memory.
The Streaming Multiprocessor
Conclusion…
Thank You!

GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Bringing Up Bebe 01 Bringing Up Bb Druckerman Pamela instant download
No ratings yet
Bringing Up Bebe 01 Bringing Up Bb Druckerman Pamela instant download
9 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
HPC
No ratings yet
HPC
90 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
lecture-GPU-17
No ratings yet
lecture-GPU-17
51 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Sensors and Instrumentation Aircraft Aerospace Energy Harvesting Dynamic Environments Testing Volume 7 Proceedings of the 38th IMAC A Conference and Exposition on Structural Dynamics 2020 1st Edition Chad Walber download
100% (2)
Sensors and Instrumentation Aircraft Aerospace Energy Harvesting Dynamic Environments Testing Volume 7 Proceedings of the 38th IMAC A Conference and Exposition on Structural Dynamics 2020 1st Edition Chad Walber download
54 pages
Endsem Imp Hpc Unit 5
No ratings yet
Endsem Imp Hpc Unit 5
24 pages
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
22 pages
Basic Elements of A Program
No ratings yet
Basic Elements of A Program
12 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
5. Moving to Parallel With CUDA - Hello Program
No ratings yet
5. Moving to Parallel With CUDA - Hello Program
14 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
GTC-S62191 (1)
No ratings yet
GTC-S62191 (1)
89 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Cuda C
No ratings yet
Cuda C
70 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
govind_6
No ratings yet
govind_6
4 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
chapter-8
No ratings yet
chapter-8
58 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
cuda
No ratings yet
cuda
25 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Class 10
No ratings yet
Class 10
13 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
marketing(1)
No ratings yet
marketing(1)
2 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
GT Dental Supplies Complete Price List General - Consumable 2
No ratings yet
GT Dental Supplies Complete Price List General - Consumable 2
4 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
De-thi-thu-THPT-mon-Tieng-Anh-THPT-Truong-Chinh (bản chính)
No ratings yet
De-thi-thu-THPT-mon-Tieng-Anh-THPT-Truong-Chinh (bản chính)
9 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Credits
No ratings yet
Credits
16 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Peace Advocacy Campaign
No ratings yet
Peace Advocacy Campaign
16 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
No ratings yet
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
64 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Selection Notification08072021
No ratings yet
Selection Notification08072021
2 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Greg Strapach Resume Fy19
No ratings yet
Greg Strapach Resume Fy19
3 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
TDS¤37942¤Barrier 80 S¤Euk¤GB
No ratings yet
TDS¤37942¤Barrier 80 S¤Euk¤GB
5 pages
Solid Waste Management in Zamboanga City: Status Quo and Challenges
100% (1)
Solid Waste Management in Zamboanga City: Status Quo and Challenges
23 pages
Machine Design & CAD - Lecture 2
No ratings yet
Machine Design & CAD - Lecture 2
18 pages
There Seems To Be An Issue With The Connection, What Went Wrong?
No ratings yet
There Seems To Be An Issue With The Connection, What Went Wrong?
5 pages
SL - VD4-AF (EN) - ABB Frequent VCB
No ratings yet
SL - VD4-AF (EN) - ABB Frequent VCB
19 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Rail Defect Detection and Classification With Real Time Image Processing Technique
No ratings yet
Rail Defect Detection and Classification With Real Time Image Processing Technique
8 pages
C-PVC Fire Sprinkler System: ... Your Safety Is Our Concern
No ratings yet
C-PVC Fire Sprinkler System: ... Your Safety Is Our Concern
4 pages
Anand Rathi PROJECT
No ratings yet
Anand Rathi PROJECT
64 pages
Deesha Bhaumik CV PHD Apps For 889
No ratings yet
Deesha Bhaumik CV PHD Apps For 889
3 pages
CUDA
No ratings yet
CUDA
33 pages
Civil Service Manual
No ratings yet
Civil Service Manual
171 pages
Buck Converter Step-Down Converter: X in X
100% (1)
Buck Converter Step-Down Converter: X in X
12 pages
Unit 2
No ratings yet
Unit 2
17 pages
The Academy 2011 Report
No ratings yet
The Academy 2011 Report
31 pages
1.2.1 Variance of A Random Variable Definition 1.19
No ratings yet
1.2.1 Variance of A Random Variable Definition 1.19
2 pages
Goodman Resume
No ratings yet
Goodman Resume
1 page
Telehandler Forklift Checklist
No ratings yet
Telehandler Forklift Checklist
2 pages
AAWorkflowWithMES Dec2012
No ratings yet
AAWorkflowWithMES Dec2012
108 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Environmental Studies: Course Objectives
No ratings yet
Environmental Studies: Course Objectives
3 pages
Catalogo Faro PDF
No ratings yet
Catalogo Faro PDF
26 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
CCNP Practical Studies Routing
92% (12)
CCNP Practical Studies Routing
421 pages

Cuda 101

Uploaded by

Cuda 101

Uploaded by

X

Below are some of my research interest:

Evergrowing demand for HPC

And a lot more. . . . !

Initializat Transfer Kernel launch Transfer Reclaim the

Kernel_name<< <number_of_blocks,thread_per_block>> > ()

dim3 variable_name(X, Y, Z); Limitations:

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

Example-1: consider a 1-D grid with 1 thread C 2 0 0

Example-2: consider a 1-D grid with 2 thread block with 8

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

Thread Name ThreadIdx.X ThreadIdx.Y ThreadIdx.Z

Example-3: consider a 1-D grid with 4 thread block T 0 0 0

//kernel launch parameters

dim3 block(8, 8);

print_thread << <grid, block >> > (); // async call

Example-2: consider a 1-D grid with 2 thread block with 8

Thread Name blockIdx.X blockIdx.Y blockIdx.Z

Thread Name blockIdx.X blockIdx.Y blockIdx.Z

Example-3: consider a 1-D grid with 4 thread block T 0 1 0

The gridDim.x is going to be 3 and the gridDim.y is going to be 2.

dim3 block(8, 8);

print_details << <grid, block >> > (); // async call

Thread Name threadIdx.X blockIdx.X blockDim.X

gid = tid + offset; E 0 1 1

num_of_threads_row = gridDim.x * blockDim.x

num_of_threads_thread_block = blockDim.xrow_offset = num_of_threads_in_block * blockIdx.y

block_offset = number_of_threads_in_block * blockIdx.X

index = row_offset + block_offset + tid

gid = gridDim.x*blockDim.x*blockIdx.y + blcokIdx.x* blockDim.x + threadIdx.X

Caches & DRAM Caches & DRAM

cudaMemCpy(destination ptr, sourse ptr, size in byte, direction)

Host to Device cudamemcpyhtod

Device to Host cudamemecpydtoh

Device to Device cudamemecpydtod

Malloc cudaMalloc Allocates the memory in the host and device

Name ACII string identifying the device

TotalGlobalMem Total amount of global memory available on the device in bytes

maxThreadsDim maximum size of each dimension of a block

Warp Size Warp size for the device

[CPU Intel, Source: paper]

• Most of the circuitry on a CPU, and therefore most of the

Nehalem is the most sophisticated

[GeForce7000, Source: images]

• At this level of abstraction, the GPU looks like sea of

• At this level of abstraction, the GPU looks like sea of

• the computational elements of algorithms are known as kernels(opencl

• Multiple threads are grouped into

• Thread blocks can coordinate the

• At any one time, the entire Fermi device is dedicated to a

• This switching is managed by the chip-level GigaThread

You might also like

gid = gridDim.xblockDim.xblockIdx.y + blcokIdx.x* blockDim.x + threadIdx.X