Gpu History and Cuda Programming Basics
Gpu History and Cuda Programming Basics
CUDA PROGRAMMING
BASICS
Outline of CUDA Basics
SMEM
SMEM
SMEM
SMEM
PCIe CPU
Global Memory
Chipset
Blocks of threads run on an SM
Streaming Processor Streaming Multiprocessor
SMEM
Threadblock
Thread
Per-block
Registers Memory Shared
Memory
Memory
Whole grid runs on GPU
...
SMEM
SMEM
SMEM
SMEM
Global Memory
Thread Hierarchy
Kernel 0
Sequential
... Kernels
Per-device
Global
Kernel 1
Memory
...
Memory Model
Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run grid of N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Vector Addition Kernel
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Data Copies
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
free( h_a );
cudaFree( d_a );
return 0;
}
Example: Shuffling Data
int main()
{
// Run grid of N/256 blocks of 256 threads each
shuffle<<< N/256, 256>>>(d_old, d_new, d_ind);
}
IDs and Dimensions
Threads:
3D IDs, unique within a block
Device
Blocks: Grid 1
a[idx] = a[idx]+1;
}
int main()
{
int dimx = 16;
int dimy = 16;
int num_bytes = dimx*dimy*sizeof(int);
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
free( h_a );
cudaFree( d_a );
return 0;
}
Blocks must be independent
Any possible interleaving of blocks should
be valid
presumed to run to completion without pre-
emption
can run in any order
can run concurrently OR sequentially
Framebuffer
The Graphics Pipeline
Framebuffer
The Graphics Pipeline
Framebuffer
The Graphics Pipeline
Framebuffer
The Graphics Pipeline
Rasterize
Number of modes for each
stage grew over time
Pixel
Hard to optimize HW
Test & Blend
Pixel
Vertex & pixel processing
became programmable, new
stages added
Test & Blend
Framebuffer
Expanded to full ISA
Why GPUs scale so nicely
Workload and Programming Model provide
lots of parallelism
Applications provide large groups of
vertices at once
Vertices can be processed in parallel
Apply same transform to all vertices
Triangles contain many pixels
Pixels from a triangle can be processed in
parallel
Apply same shader to all pixels
Very efficient hardware to hide serialization
bottlenecks
With Moore’s Law…
Vertex
Vertex
Pixel 0
Raster
Raster
Blend
Pixel 1
Pixel Pixel 2
Blend Pixel 3
Vrtx 0
Vrtx 1
Vrtx 2
More Efficiency
Control