Intro To CUDA
Intro To CUDA
CUDA C/C++
Tom Papatheodore
HPC User Support Specialist/Programmer
• ssh [email protected]
• ssh [email protected]
• cd $MEMBERWORK/trn001
• qstat –u username
• Check output file vec_add.o${JOB_ID}
– If you see __SUCCESS__, you have successfully run on GPU
– If not, try again and/or ask for help
6 GB
Titan/Chester node
32 GB
CPU (host)
• Several compute cores
• Heterogeneous Programming
– program separated into serial regions (run on CPU) & parallel regions (run on GPU)
• Heterogeneous Programming
– program separated into serial regions (run on CPU) & parallel regions (run on GPU)
A
+ + + … +
B
= = = … =
C
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int main(){
int *A = (int*)malloc(bytes);
int *B = (int*)malloc(bytes);
int *C = (int*)malloc(bytes);
. . .
int main(){
. . .
cudaMalloc(&d_A, bytes);
cudaMalloc(&d_B, bytes);
cudaMalloc(&d_C, bytes);
. . .
int main(){
. . .
A[i] = 1;
B[i] = 2;
C[i] = 0;
. . .
int main(){
. . .
. . .
cudaMemcpyKind kind )
int main(){
. . .
. . .
int main(){
. . .
. . .
int main(){
. . .
if(C[i] != 3)
. . .
int main(){
. . .
free(A);
free(B);
free(C);
. . .
int main(){
. . .
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
Serial – CPU
Parallel – GPU
__global__
Indicates the function is a CUDA kernel function – called by the host and executed on the
device.
__void__
}
Blocks
}
Threads
{
4
int i = blockDim.x * blockIdx.x + threadIdx.x;
blockDim
Gives the number of threads within each block (in the x-dimension in 1D case)
• E.g., 4 threads per block
{
(4) (0-3)
int i = blockDim.x * blockIdx.x + threadIdx.x; 0
if (i<N) c[i] = a[i] + b[i];
} 1
blockIdx 2
Specifies which block the thread belongs to (within the grid of blocks)
3
{
(4) (0-3) (0-3) 0
int i = blockDim.x * blockIdx.x + threadIdx.x; 0 1
2
3
if (i<N) c[i] = a[i] + b[i];
0
} 1 1
2
3
0
1
threadIdx 2 2
3
3 1
2
3
{
(4) (2) (1)
int i = blockDim.x * blockIdx.x + threadIdx.x;
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
if (i<N)
Number of threads in the grid might be larger than number of elements in array.
int i
In general
kernel<<< blk_in_grid, thr_per_blk >>>(arg1, arg2, …);
• Change thr_per_blk back to a value <= 1024 and change the size of d_A
– e.g., cudaMalloc(&d_A, 8e9*bytes);
6 GB
32 GB
Introduction to CUDA C/C++
CUDA Error Checking
API calls
cudaError_t err = cudaMalloc(&d_A, 8e9*bytes);
// After launch, control returns to the host, so errors can occur at seemingly
// random points later in the code. Calling cudaDeviceSynchronize catches these
// errors and allows you to check them
In general
dim3 is c struct
dim3 threads_per_block( threads per block in x-dim, with member
threads per block in y-dim,
variables x, y, z.
A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A2,8 A2,9
Assume a 4x4
blocks of threads…
A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A3,8 A3,9
dim3 threads_per_block( 4, 4, 1 );
dim3 blocks_in_grid( ceil( float(N) / threads_per_block.x ),
ceil( float(M) / threads_per_block.y ) , 1 );
mat_add<<< blocks_in_grid, threads_per_block >>>(d_a, d_b, d_c);
… …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
… A0,0 A0,1 A0,2 A0,3 A0,4 A1,0 A1,1 A1,2 A1,3 A1,4 …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
… A0,0 A0,1 A0,2 A0,3 A0,4 A1,0 A1,1 A1,2 A1,3 A1,4 A2,0 A2,1 A2,2 A2,3 A2,4 …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
… A0,0 A0,1 A0,2 A0,3 A0,4 A1,0 A1,1 A1,2 A1,3 A1,4 A2,0 A2,1 A2,2 A2,3 A2,4 A3,0 A3,1 A3,2 A3,3 A3,4 …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A2,8 A2,9
Assume 4x4 blocks
of threads…
A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A3,8 A3,9
0 A0,0 A0,1 A0,2 A0,3 A0,4 A0,5 A0,6 A0,7 A0,8 A0,9 M = 7 rows
1 A1,0 A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,7 A1,8 A1,9
N = 10 columns
2 A2,0 A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,7 A2,8 A2,9
Assume 4x4 blocks
of threads…
3 A3,0 A3,1 A3,2 A3,3 A3,4 A3,5 A3,6 A3,7 A3,8 A3,9
• NOTE: you cannot exceed 1024 threads per block (in total)
– threads_per_block( 16, 16, 1 ); 256 ✔
– threads_per_block( 32, 32, 1 ); 1024 ✔
– threads_per_block( 64, 64, 1 ); 4096 x
A0,0 A0,1 A0,2 A0,3 A0,4 x0 A0,0 * x0 + A0,1 * x1 + A0,2 * x2 + A0,3 * x3 + A0,4 * x4
x4
A0,0 A0,1 A0,2 A0,3 A0,4 x0 A0,0 * x0 + A0,1 * x1 + A0,2 * x2 + A0,3 * x3 + A0,4 * x4
A1,0 A1,1 A1,2 A1,3 A1,4 x1 A1,0 * x0 + A1,1 * x1 + A1,2 * x2 + A1,3 * x3 + A1,4 * x4
x4
A0,0 A0,1 A0,2 A0,3 A0,4 x0 A0,0 * x0 + A0,1 * x1 + A0,2 * x2 + A0,3 * x3 + A0,4 * x4
A1,0 A1,1 A1,2 A1,3 A1,4 x1 A1,0 * x0 + A1,1 * x1 + A1,2 * x2 + A1,3 * x3 + A1,4 * x4
x4
A0,0 A0,1 A0,2 A0,3 A0,4 x0 A0,0 * x0 + A0,1 * x1 + A0,2 * x2 + A0,3 * x3 + A0,4 * x4
A1,0 A1,1 A1,2 A1,3 A1,4 x1 A1,0 * x0 + A1,1 * x1 + A1,2 * x2 + A1,3 * x3 + A1,4 * x4
A3,0 A3,1 A3,2 A3,3 A3,4 x3 A3,0 * x0 + A3,1 * x1 + A3,2 * x2 + A3,3 * x3 + A3,4 * x4
x4
A1,0 A1,1 A1,2 A1,3 A1,4 x1 A1,0 * x0 + A1,1 * x1 + A1,2 * x2 + A1,3 * x3 + A1,4 * x4 1
A3,0 A3,1 A3,2 A3,3 A3,4 x3 A3,0 * x0 + A3,1 * x1 + A3,2 * x2 + A3,3 * x3 + A3,4 * x4 3
x4
• cudaDeviceProp
– C struct with many member variables
• Add output for memory clock frequency and memory bus width
– Google “cudaDeviceProp”, find the member variables and add print statements
BW = ((memory clock rate in kHz)*1e3) * ((memory bus width in bits)*2) * (1/8) * (1/1e9) )
• Limited amount
– 49152 B per block
x0 x1 x2 x3 x4 x5 y0
= x0 * y0 + x1 * y1 + x2 * y2 + x3 * y3 + x4 * y4 + x5 * y5
y1
y2
y3
y4
y5
But when thread 0 from each block tries to update the (global) res
variable, thread 0 from another block might also be writing to it
• Data race condition!
• Solution: Atomic Functions
To do so
• Edit the kernel so that each blocks computes only THREADS_PER_BLOCK
elements of the dot product (i.e. only a portion of the sum of products)
• Sum results from each block into global res variable using atomicAdd()
HINTS
• Each block’s (local) thread 0 should be computing a portion of the dot product.
– i.e. threadIdx.x instead of the global thread id
__global__ void dot_prod(int *a, int *b, int *res) Since shared memory
{ is only shared among
__shared__ int products[THREADS_PER_BLOCK]; threads in same
int id = blockDim.x * blockIdx.x + threadIdx.x; block, only compute
products[threadIdx.x] = a[id] * b[id]; portion of dot product