0% found this document useful (0 votes)

52 views

Introduction To CUDA C 3

Uploaded by

Tarun Ram

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Introduction To CUDA C 3

Uploaded by

Tarun Ram

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

CUDA C/C++ BASICS

NVIDIA Corporation

© NVIDIA 2013
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance

• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous programming
– Straightforward APIs to manage devices, memory etc.

• This session introduces CUDA C/C++

© NVIDIA 2013
Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C++ kernels
– Manage GPU memory
– Manage communication and synchronization

• You don’t need GPU experience

• You don’t need parallel programming

experience

• You don’t need graphics experience

Blocks

Threads

Indexing
CONCEPTS
Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices
© NVIDIA 2013
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

HELLO WORLD!
Handling errors

Managing devices
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)

Host Device

using
using namespace
namespace std;
std;
#define
#define N
N 1024
1024
#define RADIUS
#define RADIUS 33
#define BLOCK_SIZE
#define BLOCK_SIZE 16
16

__global__
__global__ void
void stencil_1d(int
stencil_1d(int *in,
*in, int
int *out)
*out) {{
__shared__
__shared__ int int temp[BLOCK_SIZE
temp[BLOCK_SIZE + +2 2 ** RADIUS];
RADIUS];
int
int gindex
gindex == threadIdx.x
threadIdx.x + + blockIdx.x
blockIdx.x ** blockDim.x;
blockDim.x;
int
int lindex
lindex =
= threadIdx.x
threadIdx.x + + RADIUS;
RADIUS;

//
// Read
Read input
input elements
elements into
into shared
shared memory
memory
temp[lindex] =
temp[lindex] = in[gindex];
in[gindex];
if (threadIdx.x
if (threadIdx.x <
< RADIUS)
RADIUS) {{
temp[lindex -- RADIUS]
temp[lindex RADIUS] =
= in[gindex
in[gindex --
RADIUS];
RADIUS];
temp[lindex
temp[lindex +
+ BLOCK_SIZE]
BLOCK_SIZE] =
=
in[gindex
in[gindex +
+ BLOCK_SIZE];
BLOCK_SIZE];

parallel fn
}}

//
// Synchronize
Synchronize (ensure
(ensure all
all the
the data
data is
is available)
available)
__syncthreads();
__syncthreads();
//
// Apply
Apply the
the stencil
stencil
int result
int result == 0;
0;
for
for (int
(int offset
offset == -RADIUS
-RADIUS ;; offset
offset <=
<= RADIUS
RADIUS ;; offset++)
offset++)
result
result +=
+= temp[lindex
temp[lindex +
+ offset];
offset];

//
// Store
Store the
the result
result
out[gindex] =
out[gindex] = result;
result;
}}

void
void fill_ints(int
fill_ints(int *x,
*x, int
int n)
n) {{
fill_n(x,
fill_n(x, n,
n, 1);
1);
}}

int
int main(void)
main(void) {{
int
int *in,
*in, *out;
*out; //
// host
host copies
copies of
of a,
a, b,
b, cc
int *d_in,
int *d_in, *d_out;
*d_out; //
// device
device copies
copies ofof a,
a, b,
b, cc
int size
int size == (N
(N +
+ 2*RADIUS)
2*RADIUS) ** sizeof(int);
sizeof(int);

//
// Alloc
Alloc space
space for
for host
host copies
copies andand setup
setup values
values
in =
in = (int
(int *)malloc(size);
*)malloc(size); fill_ints(in,
fill_ints(in, NN++ 2*RADIUS);
2*RADIUS);
out =
out = (int
(int *)malloc(size);
*)malloc(size); fill_ints(out,
fill_ints(out, N
N+ + 2*RADIUS);
2*RADIUS);

serial code
//
// Alloc
Alloc space
space for
for device
device copies
copies
cudaMalloc((void
cudaMalloc((void **)&d_in,
**)&d_in, size);
size);
cudaMalloc((void
cudaMalloc((void **)&d_out,
**)&d_out, size);
size);

//
// Copy
Copy to
to device
device
cudaMemcpy(d_in,
cudaMemcpy(d_in, in,
in, size,
size,
cudaMemcpyHostToDevice);
cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out,
cudaMemcpy(d_out, out, size,
size,
cudaMemcpyHostToDevice);
cudaMemcpyHostToDevice);
//
// Launch
Launch stencil_1d()
stencil_1d() kernel
kernel on
on GPU
GPU

parallel code
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in +
+
RADIUS,
RADIUS, d_out
d_out +
+ RADIUS);
RADIUS);

//
// Copy
Copy result
result back
back to
to host
host
cudaMemcpy(out,
cudaMemcpy(out, d_out,
d_out, size,
size,
cudaMemcpyDeviceToHost);
cudaMemcpyDeviceToHost);

serial code
//
// Cleanup
Cleanup
free(in); free(out);
free(in); free(out);
cudaFree(d_in);
cudaFree(d_in); cudaFree(d_out);
cudaFree(d_out);
return
return 0;
0;
}}

PCI Bus

1. Copy input data from CPU memory

to GPU memory

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory

© NVIDIA 2013
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.
NVIDIA compiler (nvcc) can be used cu
to compile programs with no device $ a.out
code Hello World!
$

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

 Two new syntactic elements…

• CUDA C/C++ keyword global indicates a function that:

– Runs on the device
– Is called from host code

• nvcc separates source code into host and device

components
– Device functions (e.g. mykernel()) processed by NVIDIA compiler
– Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe

• Triple angle brackets mark a call from host

code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function

on the GPU!

Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc
printf("Hello World!\n");
hello.cu
return 0;
$ a.out
}
Hello World!
$
• mykernel() does nothing,
somewhat anticlimactic!

• We need a more interesting example…

• We’ll start by adding two integers and

build up to vector addition

a b c

© NVIDIA 2013
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• As before global is a CUDA C/C++ keyword

meaning
– add() will execute on the device
– add() will be called from the host

© NVIDIA 2013
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• add() runs on the device, so a, b and c must

point to device memory

• We need to allocate memory on the GPU

© NVIDIA 2013
Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code

• Simple CUDA API for handling device memory

– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()

© NVIDIA 2013
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• Let’s take a look at main()…

© NVIDIA 2013
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values

a = 2;
b = 7;

© NVIDIA 2013
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

© NVIDIA 2013
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N

times in parallel

• Terminology: each parallel invocation of add() is referred to as a

block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x

global void add(int a, int b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles a

different index
© NVIDIA 2013
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

© NVIDIA 2013
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

© NVIDIA 2013
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Using global to declare a function as device code

– Executes on the device
– Called from the host

• Passing parameters from host code to a device function

• Launching parallel kernels

– Launch N copies of add() with add<<<N,1>>>(…);
– Use blockIdx.x to access block index

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

INTRODUCING
Handling errors

Managing devices

THREADS

• Let’s change add() to use parallel threads instead of

parallel blocks
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…

© NVIDIA 2013
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
COMBINING THREADS Managing devices

AND BLOCKS

© NVIDIA 2013
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads

• Let’s adapt vector addition to use both blocks and

threads

• Why? We’ll come to that…

• First let’s discuss data indexing…

• No longer as simple as using blockIdx.x and threadIdx.x

– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
7
01 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

is given by:
int index = threadIdx.x + blockIdx.x * M;

© NVIDIA 2013
Indexing Arrays: Example
• Which thread will operate on the red
element?
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3

M = 8 threadIdx.x = 5

7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;
© NVIDIA 2013
Vector Addition with Blocks and Threads

• Use the built-in variable blockDim.x for threads per block

int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined version of add() to use parallel threads

and parallel blocks

global void add(int a, int b, int *c) {

int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?

© NVIDIA 2013
Addition with Blocks and Threads:
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Avoid accessing beyond the end of the arrays:

__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:

add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

• Unlike parallel blocks, threads have mechanisms to:

– Communicate
– Synchronize

• To look closer, we need a new example…

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

COOPERATING
Handling errors

Managing devices

THREADS

© NVIDIA 2013
1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements within a
radius

• If radius is 3, then each output element is the sum of

7 input elements:

radius radius

• Input elements are read several times

– With radius 3, each input element is read seven times

• Extremely fast on-chip memory, user-managed

• Declare using shared, allocated per block

• Data is not visible to threads in other blocks

© NVIDIA 2013
Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory

– Each block needs a halo of radius elements at each boundary

halo on left halo on right

Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

© NVIDIA 2013
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

 Suppose thread 15 reads the halo before thread 0 has fetched it…

temp[lindex] = in[gindex]; Store at temp[18]

if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS]; Skipped, threadIdx > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

int result = 0;
result += temp[lindex + 1];
Load from temp[19]

• Synchronizes all threads within a block

– Used to prevent RAW / WAR / WAW hazards

• All threads must reach the barrier

– In conditional code, the condition must be
uniform across the block

© NVIDIA 2013
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data is available)

__syncthreads();

© NVIDIA 2013
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

© NVIDIA 2013
Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blockIdx.x to access block index within grid
– Use threadIdx.x to access thread index within block

• Allocate elements to threads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013
Review (2 of 2)
• Use __shared__ to declare a variable/array in
shared memory
– Data is shared between threads in a block
– Not visible to threads in other blocks

• Use __syncthreads() as a barrier

– Use to prevent data hazards

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
MANAGING THE Managing devices

DEVICE

• CPU needs to synchronize before consuming the

results
cudaMemcpy() Blocks the CPU until the copy is complete
Copy begins when all preceding CUDA calls have
completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchro Blocks the CPU until all preceding CUDA calls have
nize() completed

© NVIDIA 2013
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an earlier asynchronous operation (e.g. kernel)

• Get the error code for the last error:

cudaError_t cudaGetLastError(void)

• Get a string to describe the error:

char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

• Multiple threads can share a device

• A single thread can manage multiple devices

cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝
✝
requires OS and device support
© NVIDIA 2013
Introduction to CUDA C/C++
• What have we learned?
– Write and launch CUDA C/C++ kernels
• __global__, blockIdx.x, threadIdx.x, <<<>>>

– Manage GPU memory

• cudaMalloc(), cudaMemcpy(), cudaFree()

– Manage communication and synchronization

• __shared__, __syncthreads()

• cudaMemcpy() vs cudaMemcpyAsync(),
cudaDeviceSynchronize()

© NVIDIA 2013
Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of memories
– Features & capabilities
Compute Selected Features Tesla models
Capability (see CUDA C Programming Guide for complete list)
1.0 Fundamental CUDA support 870
1.3 Double precision, improved memory accesses, 10-series
atomics
2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC, 20-series
P2P,
concurrent kernels/copies, function pointers,
recursion

• The following presentations concentrate on Fermi devices

– Compute Capability >= 2.0 © NVIDIA 2013
IDs and Dimensions
Device
– A kernel is launched as a
Grid 1
grid of blocks of threads Bloc
k
Bloc
k
Bloc
k
• blockIdx and (0,0, (1,0, (2,0,
0) 0) 0)
threadIdx are 3D Bloc Bloc Bloc
k k k
• We showed only one (0,1, (1,1, (2,1,
dimension (x) 0) 0) 0)

Block (1,1,0)

• Built-in variables: Thre

ad
(0,0,
Thre
ad
(1,0,
Thre
ad
(2,0,
Thre
ad
(3,0,
Thre
ad
(4,0,
0) 0) 0) 0) 0)
– threadIdx
Thre Thre Thre Thre Thre
– blockIdx ad ad ad ad ad
(0,1, (1,1, (2,1, (3,1, (4,1,
– blockDim 0) 0) 0) 0) 0)
Thre Thre Thre Thre Thre
– gridDim ad ad ad ad ad
(0,2, (1,2, (2,2, (3,2, (4,2,
0) 0) 0) 0) 0)
© NVIDIA 2013
Textures
0 1 2 3 4
• Read-only object 0
– Dedicated cache
1
(2.5, 0.5)
• Dedicated filtering hardware (1.0, 1.0)
(Linear, bilinear, trilinear) 2

• Addressable as 1D, 2D or 3D

• Out-of-bounds address handling

(Wrap, clamp)

© NVIDIA 2013
Topics we skipped
• We skipped some details, you can learn more:
– CUDA Programming Guide
– CUDA Zone – tools, training, webinars and more
developer.nvidia.com/cuda

• Need a quick primer for later:

– Multi-dimensional indexing
– Textures

DeltaV Control PID
No ratings yet
DeltaV Control PID
4 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Using CUDA
No ratings yet
Using CUDA
57 pages
CUDA - Quick Reference PDF
No ratings yet
CUDA - Quick Reference PDF
2 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
24 Vivado HLS Intro
No ratings yet
24 Vivado HLS Intro
34 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Rishi
No ratings yet
Rishi
30 pages
Direct3D 11 Computer Shader More Generality For Advanced Techniques
No ratings yet
Direct3D 11 Computer Shader More Generality For Advanced Techniques
54 pages
Kernel Cu
100% (1)
Kernel Cu
1 page
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
No ratings yet
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
20 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
Unix
No ratings yet
Unix
9 pages
What Is A Device Driver ?: Device Drivers in Xinu
No ratings yet
What Is A Device Driver ?: Device Drivers in Xinu
9 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Computer Network: Intramantra Global Solution PVT LTD, Indore
No ratings yet
Computer Network: Intramantra Global Solution PVT LTD, Indore
26 pages
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
No ratings yet
Automatically Converting C/ C++ To Opencl/Cuda: Introduction by David Williams
52 pages
Digital Design Using Verilog HDL: Fall 21
No ratings yet
Digital Design Using Verilog HDL: Fall 21
39 pages
Race Conditions 4 - Races in Memory
No ratings yet
Race Conditions 4 - Races in Memory
7 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
6-computation
No ratings yet
6-computation
11 pages
The Design of Malware On Modern Hardware: Malware Inside Intel SGX Enclaves
No ratings yet
The Design of Malware On Modern Hardware: Malware Inside Intel SGX Enclaves
19 pages
03 Basic Direct3D Programming
No ratings yet
03 Basic Direct3D Programming
46 pages
CN manual
No ratings yet
CN manual
20 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Spectre (v1 v2 v4) V.S. Meltdown (v3)
No ratings yet
Spectre (v1 v2 v4) V.S. Meltdown (v3)
76 pages
Posix Threads
No ratings yet
Posix Threads
54 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
No ratings yet
Ejercicio 2 Práctica 3: CUDA Desempeño en Función de La Homogeneidad para Acceder A Memoria y de La Regularidad Del Código
8 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Ex - No.: 2b) Link State Routing Algorithm Date
No ratings yet
Ex - No.: 2b) Link State Routing Algorithm Date
11 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
In-Depth Network Programming in C++
No ratings yet
In-Depth Network Programming in C++
5 pages
Dicom Image 3D Mainpulation
No ratings yet
Dicom Image 3D Mainpulation
3 pages
Operating Systen
No ratings yet
Operating Systen
7 pages
Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
Module 4
No ratings yet
Module 4
39 pages
Pyshed - Doc For Python Library
No ratings yet
Pyshed - Doc For Python Library
23 pages
Course: Parallel Processing Lab #2 - Multithreads and Openmp
No ratings yet
Course: Parallel Processing Lab #2 - Multithreads and Openmp
14 pages
tradingbot
No ratings yet
tradingbot
11 pages
usdt
No ratings yet
usdt
10 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
SUMOBOT Docu PDF
No ratings yet
SUMOBOT Docu PDF
4 pages
Calculationsolar_inverter_charger_BAINGA_MKS47
No ratings yet
Calculationsolar_inverter_charger_BAINGA_MKS47
1 page
SCR Characteristics: Experiment No.: 1
No ratings yet
SCR Characteristics: Experiment No.: 1
38 pages
Voltage Sag and Mitigation Using Dynamic Voltage Restorer (DVR) System
No ratings yet
Voltage Sag and Mitigation Using Dynamic Voltage Restorer (DVR) System
6 pages
Implementing Boolean Function Over Multiplexer: Objectives
0% (1)
Implementing Boolean Function Over Multiplexer: Objectives
3 pages
Software Component of Computer
No ratings yet
Software Component of Computer
8 pages
DTSD1352-C 1 (6A) Datasheet
No ratings yet
DTSD1352-C 1 (6A) Datasheet
2 pages
01
No ratings yet
01
1 page
H-W Manual PDF
No ratings yet
H-W Manual PDF
58 pages
Active Hi-Fi Loudspeakers
No ratings yet
Active Hi-Fi Loudspeakers
17 pages
12V - 230V 50Hz Square Wave Inverter With 555
No ratings yet
12V - 230V 50Hz Square Wave Inverter With 555
2 pages
Lesson 6 Types of Computer System Errors
No ratings yet
Lesson 6 Types of Computer System Errors
36 pages
SSGC 2016 Solved Paper
No ratings yet
SSGC 2016 Solved Paper
4 pages
PXR Series Catalog & Technical Datasheet
No ratings yet
PXR Series Catalog & Technical Datasheet
36 pages
A Novel Audio Power Amplifier Topology PDF
No ratings yet
A Novel Audio Power Amplifier Topology PDF
26 pages
Unit 2 Circuit Design Process
No ratings yet
Unit 2 Circuit Design Process
11 pages
21FG1RG
100% (1)
21FG1RG
22 pages
Transformers Notes - 7 PDF
No ratings yet
Transformers Notes - 7 PDF
7 pages
30ETH06P Diodo
No ratings yet
30ETH06P Diodo
6 pages
Vlsipython
No ratings yet
Vlsipython
4 pages
Fault Codes
No ratings yet
Fault Codes
6 pages
INTERFACING - 8255: Program
No ratings yet
INTERFACING - 8255: Program
4 pages
ABTEC A2 Audio Analyzer Introduction
No ratings yet
ABTEC A2 Audio Analyzer Introduction
7 pages
Transistor Guide
100% (1)
Transistor Guide
40 pages
Microwave Midterm Quiz 2
No ratings yet
Microwave Midterm Quiz 2
7 pages
Latex_Report_Mini_Project_2A (1)
No ratings yet
Latex_Report_Mini_Project_2A (1)
44 pages
KLV 40v300a
No ratings yet
KLV 40v300a
129 pages
FR1500M3 Sme-B PDF
No ratings yet
FR1500M3 Sme-B PDF
135 pages
EE103 Homework 2
100% (1)
EE103 Homework 2
4 pages

Introduction To CUDA C 3

Uploaded by

Introduction To CUDA C 3

Uploaded by

CUDA C/C++ BASICS

• This session introduces CUDA C/C++

• You don’t need GPU experience

• You don’t need parallel programming

• You don’t need graphics experience

1. Copy input data from CPU memory

1. Copy input data from CPU memory

1. Copy input data from CPU memory

 Two new syntactic elements…

• CUDA C/C++ keyword __global__ indicates a function that:

• nvcc separates source code into host and device

• Triple angle brackets mark a call from host

• That’s all that is required to execute a function

• We need a more interesting example…

• We’ll start by adding two integers and

• As before __global__ is a CUDA C/C++ keyword

• add() runs on the device, so a, b and c must

• We need to allocate memory on the GPU

• Simple CUDA API for handling device memory

• Let’s take a look at main()…

// Allocate space for device copies of a, b, c

// Setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Instead of executing add() once, execute N

• Terminology: each parallel invocation of add() is referred to as a

__global__ void add(int *a, int *b, int *c) {

• By using blockIdx.x to index into the array, each block handles a

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

• Let’s take a look at main()…

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N blocks

// Copy result back to host

• Using __global__ to declare a function as device code

• Passing parameters from host code to a device function

• Launching parallel kernels

• Let’s change add() to use parallel threads instead of

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N threads

// Copy result back to host

• Let’s adapt vector addition to use both blocks and

• Why? We’ll come to that…

• First let’s discuss data indexing…

• No longer as simple as using blockIdx.x and threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

int index = threadIdx.x + blockIdx.x * M;

• Use the built-in variable blockDim.x for threads per block

• Combined version of add() to use parallel threads

__global__ void add(int *a, int *b, int *c) {

• What changes need to be made in main()?

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Avoid accessing beyond the end of the arrays:

• Update the kernel launch:

• Unlike parallel blocks, threads have mechanisms to:

• To look closer, we need a new example…

• If radius is 3, then each output element is the sum of

• Input elements are read several times

• Extremely fast on-chip memory, user-managed

• Declare using __shared__, allocated per block

• Data is not visible to threads in other blocks

– Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements © NVIDIA 2013

// Read input elements into shared memory

// Store the result

temp[lindex] = in[gindex]; Store at temp[18]

• Synchronizes all threads within a block

• CUDA C/C++ keyword global indicates a function that:

• As before global is a CUDA C/C++ keyword

global void add(int a, int b, int *c) {

• Using global to declare a function as device code

global void add(int a, int b, int *c) {

• Declare using shared, allocated per block