SlideShare a Scribd company logo
Cyril Zeller
NVIDIA Developer Technology
Tutorial CUDA
© NVIDIA Corporation 2008
Enter the GPU
GPU = Graphics Processing Unit
Chip in computer video cards, PlayStation 3, Xbox, etc.
Two major vendors: NVIDIA and ATI (now AMD)
© NVIDIA Corporation 2008
Enter the GPU
GPUs are massively multithreaded manycore chips
NVIDIA Tesla products have up to 128 scalar processors
Over 12,000 concurrent threads in flight
Over 470 GFLOPS sustained performance
Users across science & engineering disciplines are
achieving 100x or better speedups on GPUs
CS researchers can use GPUs as a research platform
for manycore computing: arch, PL, numeric, …
© NVIDIA Corporation 2008
Enter CUDA
CUDA is a scalable parallel programming model and a
software environment for parallel computing
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model
NVIDIA’s TESLA GPU architecture accelerates CUDA
Expose the computational horsepower of NVIDIA GPUs
Enable general-purpose GPU computing
CUDA also maps well to multicore CPUs!
© NVIDIA Corporation 2008
CUDA
Programming Model
© NVIDIA Corporation 2008
Parallel Kernel
KernelA (args);
Parallel Kernel
KernelB (args);
Serial Code
. . .
. . .
Serial Code
Device
Device
Host
Host
Heterogeneous Programming
CUDA = serial program with parallel kernels, all in C
Serial C code executes in a host thread (i.e. CPU thread)
Parallel kernel C code executes in many device threads
across multiple processing elements (i.e. GPU threads)
© NVIDIA Corporation 2008
Kernel = Many Concurrent Threads
One kernel is executed at a time on the device
Many threads execute each kernel
Each thread executes the same code…
… on different data based on its threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
CUDA threads might be
Physical threads
As on NVIDIA GPUs
GPU thread creation and
context switching are
essentially free
Or virtual threads
E.g. 1 CPU core might execute
multiple CUDA threads
© NVIDIA Corporation 2008
Hierarchy of Concurrent Threads
Threads are grouped into thread blocks
Kernel = grid of thread blocks
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
Thread Block 0
…
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block 1
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
By definition, threads in the same block may synchronize with
barriers
scratch[threadID] = begin[threadID];
__syncthreads();
int left = scratch[threadID - 1];
Threads
wait at the barrier
until all threads
in the same block
reach the barrier
© NVIDIA Corporation 2008
Transparent Scalability
Thread blocks cannot synchronize
So they can run in any order, concurrently or sequentially
This independence gives scalability:
A kernel scales across any number of parallel cores
2-Core Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
4-Core Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
© NVIDIA Corporation 2008
Heterogeneous Memory Model
Device 0
memory
Device 1
memory
Host memory cudaMemcpy()
Per-thread
Per-block
Per-device
© NVIDIA Corporation 2009
Kernel Memory Access
Thread
Registers
Local Memory
Shared
Memory
Block
...Kernel 0
...Kernel 1
Global
Memory
Time
On-chip
Off-chip, uncached
• On-chip, small
• Fast
• Off-chip, large
• Uncached
• Persistent across
kernel launches
• Kernel I/O
Multiprocessor
© NVIDIA Corporation 2009
Physical Memory Layout
“Local” memory resides in device DRAM
Use registers and shared memory to minimize local
memory use
Host can read and write global memory but not
shared memory
Host
CPU
ChipsetDRAM
Device
DRAM
Local
Memory
Global
Memory
GPU
Multiprocessor
Multiprocessor
Registers
Shared Memory
© NVIDIA Corporation 2009
10-Series Architecture
240 thread processors execute kernel threads
30 multiprocessors, each contains
8 thread processors
One double-precision unit
Shared memory enables thread cooperation
Thread
Processors
Multiprocessor
Shared
Memory
Double
© NVIDIA Corporation 2009
Execution Model
Software Hardware
Threads are executed by thread processors
Thread
Thread
Processor
Thread
Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on
one multiprocessor - limited by multiprocessor
resources (shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
Only one kernel can execute on a device at
one time
CUDA Programming Basics
Part I - Software Stack and Memory Management
© NVIDIA Corporation 2009
Compiler
Any source file containing language extensions, like
“<<< >>>”, must be compiled with nvcc
nvcc is a compiler driver
Invokes all the necessary tools and compilers like cudacc,
g++, cl, ...
nvcc can output either:
C code (CPU code)
That must then be compiled with the rest of the application
using another tool
PTX or object code directly
An executable requires linking to:
Runtime library (cudart)
Core library (cuda)
© NVIDIA Corporation 2009
Compiling
NVCC
CPU/GPU
Source
PTX to Target
Compiler
G80 … GPU
Target code
PTX Code
Virtual
Physical
CPU Source
© NVIDIA Corporation 2009
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory
cudaMalloc(void **pointer, size_t nbytes)
cudaMemset(void *pointer, int value, size_t
count)
cudaFree(void *pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
© NVIDIA Corporation 2009
Data Copies
cudaMemcpy(void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
direction specifies locations (host or device) of src and
dst
Blocks CPU thread: returns after the copy is complete
Doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host
a_h
b_h
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
a_h
b_h
a_d
b_d
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
© NVIDIA Corporation 2009
Data Movement Example
Host Device
CUDA Programming Basics
Part II - Kernels
© NVIDIA Corporation 2009
Thread Hierarchy
Threads launched for a parallel section are
partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory
© NVIDIA Corporation 2009
Executing Code on the GPU
Kernels are C functions with some restrictions
Cannot access host memory
Must have void return type
No variable number of arguments (“varargs”)
Not recursive
No static variables
Function arguments automatically copied from host
to device
© NVIDIA Corporation 2009
Function Qualifiers
Kernels designated by function qualifier:
__global__
Function called from host and executed on device
Must return void
Other CUDA function qualifiers
__device__
Function called from device and run on device
Cannot be called from host code
__host__
Function called from host and executed on host (default)
__host__ and __device__ qualifiers can be combined to
generate both CPU and GPU code
© NVIDIA Corporation 2009
Launching Kernels
Modified C function call syntax:
kernel<<<dim3 dG, dim3 dB>>>(…)
Execution Configuration (“<<< >>>”)
dG - dimension and size of grid in blocks
Two-dimensional: x and y
Blocks launched in the grid: dG.x*dG.y
dB - dimension and size of blocks in threads:
Three-dimensional: x, y, and z
Threads per block: dB.x*dB.y*dB.z
Unspecified dim3 fields initialize to 1
© NVIDIA Corporation 2008
28
More on Thread and Block IDs
Threads and blocks have
IDs
So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when
processing
multidimensional data
Image processing
Solving PDEs on volumes
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
© NVIDIA Corporation 2009
Execution Configuration Examples
kernel<<<32,512>>>(...);
dim3 grid, block;
grid.x = 2; grid.y = 4;
block.x = 8; block.y = 16;
kernel<<<grid, block>>>(...);
dim3 grid(2, 4), block(8,16);
kernel<<<grid, block>>>(...);
Equivalent assignment using
constructor functions
© NVIDIA Corporation 2009
CUDA Built-in Device Variables
All __global__ and __device__ functions have
access to these automatically defined variables
dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
© NVIDIA Corporation 2009
Built-in variables are used to determine unique
thread IDs
Map from local thread ID (threadIdx) to a global ID which
can be used as array indices
Unique Thread IDs
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
blockIdx.x
blockDim.x = 5
threadIdx.x
blockIdx.x*blockDim.x
+threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Grid
© NVIDIA Corporation 2009
Minimal Kernels
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = blockIdx.x;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = threadIdx.x;
}
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
Output: 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
© NVIDIA Corporation 2009
Increment Array Example
CPU program CUDA program
void inc_cpu(int *a, int N)
{
int idx;
for (idx = 0; idx<N; idx++)
a[idx] = a[idx] + 1;
}
void main()
{
…
inc_cpu(a, N);
…
}
__global__ void inc_gpu(int *a_d, int N)
{
int idx = blockIdx.x * blockDim.x
+ threadIdx.x;
if (idx < N)
a_d[idx] = a_d[idx] + 1;
}
void main()
{
…
dim3 dimBlock (blocksize);
dim3 dimGrid(ceil(N/(float)blocksize));
inc_gpu<<<dimGrid, dimBlock>>>(a_d, N);
…
}
© NVIDIA Corporation 2009
Host Synchronization
All kernel launches are asynchronous
control returns to CPU immediately
kernel executes after all previous CUDA calls have
completed
cudaMemcpy() is synchronous
control returns to CPU after copy completes
copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
blocks until all previous CUDA calls complete
© NVIDIA Corporation 2009
Host Synchronization Example
…
// copy data from host to device
cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);
// execute the kernel
inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);
// run independent CPU code
run_cpu_stuff();
// copy data from device back to host
cudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);
…
© NVIDIA Corporation 2009
Variable Qualifiers (GPU code)
__device__
Stored in global memory (large, high latency, no cache)
Allocated with cudaMalloc (__device__ qualifier implied)
Accessible by all threads
Lifetime: application
__shared__
Stored in on-chip shared memory (very low latency)
Specified by execution configuration or at compile time
Accessible by all threads in the same thread block
Lifetime: thread block
Unqualified variables:
Scalars and built-in vector types are stored in registers
Arrays may be in registers or local memory
© NVIDIA Corporation 2009
GPU Thread Synchronization
void __syncthreads();
Synchronizes all threads in a block
Generates barrier synchronization instruction
No thread can pass this barrier until all threads in the
block reach it
Used to avoid RAW / WAR / WAW hazards when accessing
shared memory
Allowed in conditional code only if the conditional
is uniform across the entire thread block
© NVIDIA Corporation 2009
GPU Atomic Integer Operations
Requires hardware with compute capability >= 1.1
G80 = Compute capability 1.0
G84/G86/G92 = Compute capability 1.1
GT200 = Compute capability 1.3
Atomic operations on integers in global memory:
Associative operations on signed/unsigned ints
add, sub, min, max, ...
and, or, xor
Increment, decrement
Exchange, compare and swap
Atomic operations on integers in shared memory
Requires compute capability >= 1.2
Ad

More Related Content

What's hot (20)

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Tips and experience of DX12 Engine development .
Tips and experience of DX12 Engine development .Tips and experience of DX12 Engine development .
Tips and experience of DX12 Engine development .
YEONG-CHEON YOU
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
University of Amsterdam
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
Koan-Sin Tan
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
Mark Scully
 
BlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
BlackHat USA 2011 - Stefan Esser - iOS Kernel ExploitationBlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
BlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
Stefan Esser
 
97 Things Every SRE Should Know
97 Things Every SRE Should Know97 Things Every SRE Should Know
97 Things Every SRE Should Know
Kapil Mohan
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDB
JongJin Lee
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
Piyush Mittal
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
jpaugh
 
Grid Computing Systems and Resource Management
Grid Computing Systems and Resource ManagementGrid Computing Systems and Resource Management
Grid Computing Systems and Resource Management
Souparnika Patil
 
[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
Taras Zakharchenko
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
ナム-Nam Nguyễn
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
Gerke Max Preussner
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Tips and experience of DX12 Engine development .
Tips and experience of DX12 Engine development .Tips and experience of DX12 Engine development .
Tips and experience of DX12 Engine development .
YEONG-CHEON YOU
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
University of Amsterdam
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
Koan-Sin Tan
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
Mark Scully
 
BlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
BlackHat USA 2011 - Stefan Esser - iOS Kernel ExploitationBlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
BlackHat USA 2011 - Stefan Esser - iOS Kernel Exploitation
Stefan Esser
 
97 Things Every SRE Should Know
97 Things Every SRE Should Know97 Things Every SRE Should Know
97 Things Every SRE Should Know
Kapil Mohan
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDB
JongJin Lee
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
jpaugh
 
Grid Computing Systems and Resource Management
Grid Computing Systems and Resource ManagementGrid Computing Systems and Resource Management
Grid Computing Systems and Resource Management
Souparnika Patil
 
[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes
Bruno Borges
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
ナム-Nam Nguyễn
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
West Coast DevCon 2014: Game Programming in UE4 - Game Framework & Sample Pro...
Gerke Max Preussner
 

Viewers also liked (20)

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social Networks
Giacomo Bergami
 
HPP Week 1 Summary
HPP Week 1 SummaryHPP Week 1 Summary
HPP Week 1 Summary
Pipat Methavanitpong
 
IEEE ITSS Nagoya Chapter NVIDIA
IEEE ITSS Nagoya Chapter NVIDIAIEEE ITSS Nagoya Chapter NVIDIA
IEEE ITSS Nagoya Chapter NVIDIA
Tak Izaki
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
Baidu World 2016 With NVIDIA CEO Jen-Hsun Huang
Baidu World 2016 With NVIDIA CEO Jen-Hsun HuangBaidu World 2016 With NVIDIA CEO Jen-Hsun Huang
Baidu World 2016 With NVIDIA CEO Jen-Hsun Huang
NVIDIA
 
NVIDIA Deep Learning.
NVIDIA Deep Learning. NVIDIA Deep Learning.
NVIDIA Deep Learning.
Skolkovo Robotics Center
 
Nvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't WaitNvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't Wait
inside-BigData.com
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
WithTheBest
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
2016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v022016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v02
Carlo Nardone
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA Taiwan
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
Sri Ambati
 
GTC China 2016
GTC China 2016GTC China 2016
GTC China 2016
NVIDIA
 
NVIDIA CES 2016 Highlights
NVIDIA CES 2016 HighlightsNVIDIA CES 2016 Highlights
NVIDIA CES 2016 Highlights
NVIDIA
 
NVIDIA DGX-1 Community-Based Benchmark
NVIDIA DGX-1 Community-Based BenchmarkNVIDIA DGX-1 Community-Based Benchmark
NVIDIA DGX-1 Community-Based Benchmark
Enrico Busto
 
NVIDIA CES 2016 Press Conference
NVIDIA CES 2016 Press ConferenceNVIDIA CES 2016 Press Conference
NVIDIA CES 2016 Press Conference
NVIDIA
 
JETSON : AI at the EDGE
JETSON : AI at the EDGEJETSON : AI at the EDGE
JETSON : AI at the EDGE
Skolkovo Robotics Center
 
A Year of Innovation Using the DGX-1 AI Supercomputer
A Year of Innovation Using the DGX-1 AI SupercomputerA Year of Innovation Using the DGX-1 AI Supercomputer
A Year of Innovation Using the DGX-1 AI Supercomputer
NVIDIA
 
Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social Networks
Giacomo Bergami
 
IEEE ITSS Nagoya Chapter NVIDIA
IEEE ITSS Nagoya Chapter NVIDIAIEEE ITSS Nagoya Chapter NVIDIA
IEEE ITSS Nagoya Chapter NVIDIA
Tak Izaki
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
Baidu World 2016 With NVIDIA CEO Jen-Hsun Huang
Baidu World 2016 With NVIDIA CEO Jen-Hsun HuangBaidu World 2016 With NVIDIA CEO Jen-Hsun Huang
Baidu World 2016 With NVIDIA CEO Jen-Hsun Huang
NVIDIA
 
Nvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't WaitNvidia SC16: The Greatest Challenges Can't Wait
Nvidia SC16: The Greatest Challenges Can't Wait
inside-BigData.com
 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
WithTheBest
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
2016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v022016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v02
Carlo Nardone
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA Taiwan
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
Sri Ambati
 
GTC China 2016
GTC China 2016GTC China 2016
GTC China 2016
NVIDIA
 
NVIDIA CES 2016 Highlights
NVIDIA CES 2016 HighlightsNVIDIA CES 2016 Highlights
NVIDIA CES 2016 Highlights
NVIDIA
 
NVIDIA DGX-1 Community-Based Benchmark
NVIDIA DGX-1 Community-Based BenchmarkNVIDIA DGX-1 Community-Based Benchmark
NVIDIA DGX-1 Community-Based Benchmark
Enrico Busto
 
NVIDIA CES 2016 Press Conference
NVIDIA CES 2016 Press ConferenceNVIDIA CES 2016 Press Conference
NVIDIA CES 2016 Press Conference
NVIDIA
 
A Year of Innovation Using the DGX-1 AI Supercomputer
A Year of Innovation Using the DGX-1 AI SupercomputerA Year of Innovation Using the DGX-1 AI Supercomputer
A Year of Innovation Using the DGX-1 AI Supercomputer
NVIDIA
 
Ad

Similar to Cuda introduction (20)

introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
Himanshu577858
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe464163
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
Moayadhn
 
NVIDIA cuda programming, open source and AI
NVIDIA cuda programming, open source and AINVIDIA cuda programming, open source and AI
NVIDIA cuda programming, open source and AI
Tae wook kang
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
coolmirza143
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Cuda intro
Cuda introCuda intro
Cuda intro
Anshul Sharma
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
Raymond Tay
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
Thiruselvan Subramanian
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
douglaslyon
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
Himanshu577858
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
pepe464163
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
Moayadhn
 
NVIDIA cuda programming, open source and AI
NVIDIA cuda programming, open source and AINVIDIA cuda programming, open source and AI
NVIDIA cuda programming, open source and AI
Tae wook kang
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
Raymond Tay
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
Ad

Recently uploaded (20)

apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Political History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptxPolitical History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
SPRING FESTIVITIES - UK AND USA -
SPRING FESTIVITIES - UK AND USA            -SPRING FESTIVITIES - UK AND USA            -
SPRING FESTIVITIES - UK AND USA -
Colégio Santa Teresinha
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 

Cuda introduction

  • 1. Cyril Zeller NVIDIA Developer Technology Tutorial CUDA
  • 2. © NVIDIA Corporation 2008 Enter the GPU GPU = Graphics Processing Unit Chip in computer video cards, PlayStation 3, Xbox, etc. Two major vendors: NVIDIA and ATI (now AMD)
  • 3. © NVIDIA Corporation 2008 Enter the GPU GPUs are massively multithreaded manycore chips NVIDIA Tesla products have up to 128 scalar processors Over 12,000 concurrent threads in flight Over 470 GFLOPS sustained performance Users across science & engineering disciplines are achieving 100x or better speedups on GPUs CS researchers can use GPUs as a research platform for manycore computing: arch, PL, numeric, …
  • 4. © NVIDIA Corporation 2008 Enter CUDA CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal extensions to familiar C/C++ environment Heterogeneous serial-parallel programming model NVIDIA’s TESLA GPU architecture accelerates CUDA Expose the computational horsepower of NVIDIA GPUs Enable general-purpose GPU computing CUDA also maps well to multicore CPUs!
  • 5. © NVIDIA Corporation 2008 CUDA Programming Model
  • 6. © NVIDIA Corporation 2008 Parallel Kernel KernelA (args); Parallel Kernel KernelB (args); Serial Code . . . . . . Serial Code Device Device Host Host Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a host thread (i.e. CPU thread) Parallel kernel C code executes in many device threads across multiple processing elements (i.e. GPU threads)
  • 7. © NVIDIA Corporation 2008 Kernel = Many Concurrent Threads One kernel is executed at a time on the device Many threads execute each kernel Each thread executes the same code… … on different data based on its threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID CUDA threads might be Physical threads As on NVIDIA GPUs GPU thread creation and context switching are essentially free Or virtual threads E.g. 1 CPU core might execute multiple CUDA threads
  • 8. © NVIDIA Corporation 2008 Hierarchy of Concurrent Threads Threads are grouped into thread blocks Kernel = grid of thread blocks … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 By definition, threads in the same block may synchronize with barriers scratch[threadID] = begin[threadID]; __syncthreads(); int left = scratch[threadID - 1]; Threads wait at the barrier until all threads in the same block reach the barrier
  • 9. © NVIDIA Corporation 2008 Transparent Scalability Thread blocks cannot synchronize So they can run in any order, concurrently or sequentially This independence gives scalability: A kernel scales across any number of parallel cores 2-Core Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 4-Core Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c);
  • 10. © NVIDIA Corporation 2008 Heterogeneous Memory Model Device 0 memory Device 1 memory Host memory cudaMemcpy()
  • 11. Per-thread Per-block Per-device © NVIDIA Corporation 2009 Kernel Memory Access Thread Registers Local Memory Shared Memory Block ...Kernel 0 ...Kernel 1 Global Memory Time On-chip Off-chip, uncached • On-chip, small • Fast • Off-chip, large • Uncached • Persistent across kernel launches • Kernel I/O
  • 12. Multiprocessor © NVIDIA Corporation 2009 Physical Memory Layout “Local” memory resides in device DRAM Use registers and shared memory to minimize local memory use Host can read and write global memory but not shared memory Host CPU ChipsetDRAM Device DRAM Local Memory Global Memory GPU Multiprocessor Multiprocessor Registers Shared Memory
  • 13. © NVIDIA Corporation 2009 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors, each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Thread Processors Multiprocessor Shared Memory Double
  • 14. © NVIDIA Corporation 2009 Execution Model Software Hardware Threads are executed by thread processors Thread Thread Processor Thread Block Multiprocessor Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file) ... Grid Device A kernel is launched as a grid of thread blocks Only one kernel can execute on a device at one time
  • 15. CUDA Programming Basics Part I - Software Stack and Memory Management
  • 16. © NVIDIA Corporation 2009 Compiler Any source file containing language extensions, like “<<< >>>”, must be compiled with nvcc nvcc is a compiler driver Invokes all the necessary tools and compilers like cudacc, g++, cl, ... nvcc can output either: C code (CPU code) That must then be compiled with the rest of the application using another tool PTX or object code directly An executable requires linking to: Runtime library (cudart) Core library (cuda)
  • 17. © NVIDIA Corporation 2009 Compiling NVCC CPU/GPU Source PTX to Target Compiler G80 … GPU Target code PTX Code Virtual Physical CPU Source
  • 18. © NVIDIA Corporation 2009 GPU Memory Allocation / Release Host (CPU) manages device (GPU) memory cudaMalloc(void **pointer, size_t nbytes) cudaMemset(void *pointer, int value, size_t count) cudaFree(void *pointer) int n = 1024; int nbytes = 1024*sizeof(int); int *a_d = 0; cudaMalloc( (void**)&a_d, nbytes ); cudaMemset( a_d, 0, nbytes); cudaFree(a_d);
  • 19. © NVIDIA Corporation 2009 Data Copies cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete Doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice
  • 20. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device
  • 21. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host a_h b_h
  • 22. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 23. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 24. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 25. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 26. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 27. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device a_h b_h a_d b_d
  • 28. int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © NVIDIA Corporation 2009 Data Movement Example Host Device
  • 30. © NVIDIA Corporation 2009 Thread Hierarchy Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can: Synchronize their execution Communicate via shared memory
  • 31. © NVIDIA Corporation 2009 Executing Code on the GPU Kernels are C functions with some restrictions Cannot access host memory Must have void return type No variable number of arguments (“varargs”) Not recursive No static variables Function arguments automatically copied from host to device
  • 32. © NVIDIA Corporation 2009 Function Qualifiers Kernels designated by function qualifier: __global__ Function called from host and executed on device Must return void Other CUDA function qualifiers __device__ Function called from device and run on device Cannot be called from host code __host__ Function called from host and executed on host (default) __host__ and __device__ qualifiers can be combined to generate both CPU and GPU code
  • 33. © NVIDIA Corporation 2009 Launching Kernels Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(…) Execution Configuration (“<<< >>>”) dG - dimension and size of grid in blocks Two-dimensional: x and y Blocks launched in the grid: dG.x*dG.y dB - dimension and size of blocks in threads: Three-dimensional: x, y, and z Threads per block: dB.x*dB.y*dB.z Unspecified dim3 fields initialize to 1
  • 34. © NVIDIA Corporation 2008 28 More on Thread and Block IDs Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)
  • 35. © NVIDIA Corporation 2009 Execution Configuration Examples kernel<<<32,512>>>(...); dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 8; block.y = 16; kernel<<<grid, block>>>(...); dim3 grid(2, 4), block(8,16); kernel<<<grid, block>>>(...); Equivalent assignment using constructor functions
  • 36. © NVIDIA Corporation 2009 CUDA Built-in Device Variables All __global__ and __device__ functions have access to these automatically defined variables dim3 gridDim; Dimensions of the grid in blocks (at most 2D) dim3 blockDim; Dimensions of the block in threads dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block
  • 37. © NVIDIA Corporation 2009 Built-in variables are used to determine unique thread IDs Map from local thread ID (threadIdx) to a global ID which can be used as array indices Unique Thread IDs 0 0 1 2 3 4 1 0 1 2 3 4 2 0 1 2 3 4 blockIdx.x blockDim.x = 5 threadIdx.x blockIdx.x*blockDim.x +threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Grid
  • 38. © NVIDIA Corporation 2009 Minimal Kernels __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; } Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 Output: 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 Output: 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
  • 39. © NVIDIA Corporation 2009 Increment Array Example CPU program CUDA program void inc_cpu(int *a, int N) { int idx; for (idx = 0; idx<N; idx++) a[idx] = a[idx] + 1; } void main() { … inc_cpu(a, N); … } __global__ void inc_gpu(int *a_d, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a_d[idx] = a_d[idx] + 1; } void main() { … dim3 dimBlock (blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); inc_gpu<<<dimGrid, dimBlock>>>(a_d, N); … }
  • 40. © NVIDIA Corporation 2009 Host Synchronization All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed cudaMemcpy() is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed cudaThreadSynchronize() blocks until all previous CUDA calls complete
  • 41. © NVIDIA Corporation 2009 Host Synchronization Example … // copy data from host to device cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice); // execute the kernel inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N); // run independent CPU code run_cpu_stuff(); // copy data from device back to host cudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost); …
  • 42. © NVIDIA Corporation 2009 Variable Qualifiers (GPU code) __device__ Stored in global memory (large, high latency, no cache) Allocated with cudaMalloc (__device__ qualifier implied) Accessible by all threads Lifetime: application __shared__ Stored in on-chip shared memory (very low latency) Specified by execution configuration or at compile time Accessible by all threads in the same thread block Lifetime: thread block Unqualified variables: Scalars and built-in vector types are stored in registers Arrays may be in registers or local memory
  • 43. © NVIDIA Corporation 2009 GPU Thread Synchronization void __syncthreads(); Synchronizes all threads in a block Generates barrier synchronization instruction No thread can pass this barrier until all threads in the block reach it Used to avoid RAW / WAR / WAW hazards when accessing shared memory Allowed in conditional code only if the conditional is uniform across the entire thread block
  • 44. © NVIDIA Corporation 2009 GPU Atomic Integer Operations Requires hardware with compute capability >= 1.1 G80 = Compute capability 1.0 G84/G86/G92 = Compute capability 1.1 GT200 = Compute capability 1.3 Atomic operations on integers in global memory: Associative operations on signed/unsigned ints add, sub, min, max, ... and, or, xor Increment, decrement Exchange, compare and swap Atomic operations on integers in shared memory Requires compute capability >= 1.2