Introduction To CUDA C
Introduction To CUDA C
NVIDIA Corporation
1
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance
• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous programming
– Straightforward APIs to manage devices, memory etc.
© NVIDIA 2013 3
Prerequisites
• You (probably) need experience with C or C++
© NVIDIA 2013 4
Heterogeneous Computing
Blocks
Threads
Indexing
CONCEPTS
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
© NVIDIA 2013 5
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
HELLO WORLD!
Handling errors
Managing devices
6
Heterogeneous Computing
Terminology:
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013 7
Heterogeneous Computing
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
parallel code
return 0;
}
serial code
© NVIDIA 2013 8
Simple Processing Flow
PCI Bus
© NVIDIA 2013 9
Simple Processing Flow
PCI Bus
© NVIDIA 2013 10
Simple Processing Flow
PCI Bus
© NVIDIA 2013 11
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the
host $ nvcc hello_world.cu
$ a.out
Hello World!
NVIDIA compiler (nvcc) can be $
used to compile programs
with no device code
© NVIDIA 2013 12
Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
© NVIDIA 2013 13
Hello World! with Device Code
__global__ void mykernel(void) {
}
© NVIDIA 2013 14
Hello World! with Device Code
mykernel<<<1,1>>>();
• Triple angle brackets mark a call from host code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment
© NVIDIA 2013 15
Hello World! with Device Code
__global__ void mykernel(void){ Output:
}
$ nvcc hello.cu
int main(void) { $ a.out
mykernel<<<1,1>>>(); Hello World!
printf("Hello World!\n");
$
return 0;
}
© NVIDIA 2013 16
Parallel Programming in CUDA C/C++
• But wait… GPU computing is
about massive parallelism!
a b c
© NVIDIA 2013 17
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
© NVIDIA 2013 18
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
memory
© NVIDIA 2013 19
Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code
© NVIDIA 2013 20
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
© NVIDIA 2013 21
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
© NVIDIA 2013 22
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013 23
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN
Handling errors
Managing devices
PARALLEL
© NVIDIA 2013 24
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
© NVIDIA 2013 25
Vector Addition on the Device
• With add() running in parallel we can do vector addition
© NVIDIA 2013 27
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
© NVIDIA 2013 28
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
© NVIDIA 2013 29
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013 30
Review (1 of 2)
• Difference between host and device
– Host CPU
– Device GPU
© NVIDIA 2013 32
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
INTRODUCING
Handling errors
Managing devices
THREADS
© NVIDIA 2013 33
CUDA Threads
• Terminology: a block can be split into parallel threads
© NVIDIA 2013 35
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013 36
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
COMBINING THREADS Managing devices
AND BLOCKS
© NVIDIA 2013 37
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads
01 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
© NVIDIA 2013 39
Indexing Arrays: Example
• Which thread will operate on the red element?
3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
© NVIDIA 2013 42
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013 43
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x
© NVIDIA 2013 44
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COOPERATING
Handling errors
Managing devices
THREADS
© NVIDIA 2013 47
1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements within a
radius
radius radius
© NVIDIA 2013 48
Implementing Within a Block
• Each thread processes one output element
– blockDim.x elements per block
© NVIDIA 2013 49
Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory
© NVIDIA 2013 50
Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
© NVIDIA 2013 52
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
© NVIDIA 2013 53
Data Race!
The stencil example will not work…
Suppose thread 15 reads the halo before thread 0 has fetched it…
int result = 0;
result += temp[lindex + 1];
Load from temp[19]
© NVIDIA 2013 54
__syncthreads()
• void __syncthreads();
© NVIDIA 2013 55
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
© NVIDIA 2013 56
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
© NVIDIA 2013 57
Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blockIdx.x to access block index within grid
– Use threadIdx.x to access thread index within block
© NVIDIA 2013 58
Review (2 of 2)
• Use __shared__ to declare a variable/array in shared memory
– Data is shared between threads in a block
– Not visible to threads in other blocks
© NVIDIA 2013 59
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
MANAGING THE Managing devices
DEVICE
© NVIDIA 2013 60
Coordinating Host & Device
• Kernel launches are asynchronous
– Control returns to the CPU immediately
© NVIDIA 2013 61
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an earlier asynchronous operation (e.g. kernel)
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013 62
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
© NVIDIA 2013 64
Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of memories
– Features & capabilities
– A kernel is launched as a
Grid 1
grid of blocks of threads Block Block Block
• blockIdx and (0,0,0) (1,0,0) (2,0,0)
threadIdx are 3D
Block Block Block
• We showed only one (0,1,0) (1,1,0) (2,1,0)
dimension (x)
Block (1,1,0)
Thre Thre Thre Thre Thre
• Built-in variables: ad
(0,0,
0)
ad
(1,0,
0)
ad
(2,0,
0)
ad
(3,0,
0)
ad
(4,0,
0)
– threadIdx Thre Thre Thre Thre Thre
ad ad ad ad ad
– blockIdx (0,1, (1,1, (2,1, (3,1, (4,1,
0) 0) 0) 0) 0)
– blockDim Thre Thre Thre Thre Thre
– gridDim ad
(0,2,
ad
(1,2,
ad
(2,2,
ad
(3,2,
ad
(4,2,
0) 0) 0) 0) 0)
© NVIDIA 2013 66
Textures
0 1 2 3 4
• Read-only object 0
– Dedicated cache
1
(2.5, 0.5)
• Dedicated filtering hardware (1.0, 1.0)
2
(Linear, bilinear, trilinear)
• Addressable as 1D, 2D or 3D
© NVIDIA 2013 68