0% found this document useful (0 votes)

20 views

Introduction To CUDA C

Uploaded by

VincentKao

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Introduction To CUDA C

Uploaded by

VincentKao

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

CUDA C/C++ BASICS

NVIDIA Corporation

1
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance

• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous programming
– Straightforward APIs to manage devices, memory etc.

• This session introduces CUDA C/C++

© NVIDIA 2013 2
Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C++ kernels
– Manage GPU memory
– Manage communication and synchronization

• You don’t need GPU experience

• You don’t need parallel programming experience

• You don’t need graphics experience

Blocks

Threads

Indexing
CONCEPTS
Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices
© NVIDIA 2013 5
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

HELLO WORLD!
Handling errors

Managing devices

6
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)

Host Device

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
parallel code
return 0;
}

serial code

PCI Bus

1. Copy input data from CPU memory

to GPU memory

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory

© NVIDIA 2013 11
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the
host $ nvcc hello_world.cu
$ a.out
Hello World!
NVIDIA compiler (nvcc) can be $
used to compile programs
with no device code

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

 Two new syntactic elements…

• CUDA C/C++ keyword global indicates a function that:

– Runs on the device
– Is called from host code

• nvcc separates source code into host and device components

– Device functions (e.g. mykernel()) processed by NVIDIA compiler
– Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe

• Triple angle brackets mark a call from host code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function on the GPU!

© NVIDIA 2013 15
Hello World! with Device Code
__global__ void mykernel(void){ Output:
}
$ nvcc hello.cu
int main(void) { $ a.out
mykernel<<<1,1>>>(); Hello World!
printf("Hello World!\n");
$
return 0;
}

mykernel() does nothing, somewhat

anticlimactic!

• We need a more interesting

example…

• We’ll start by adding two integers

and build up to vector addition

a b c

© NVIDIA 2013 17
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• As before global is a CUDA C/C++ keyword meaning

– add() will execute on the device
– add() will be called from the host

© NVIDIA 2013 18
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

runs on the device, so a, b and c must point to device

• add()

memory

• We need to allocate memory on the GPU

© NVIDIA 2013 19
Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code

• Simple CUDA API for handling device memory

– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()

© NVIDIA 2013 20
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• Let’s take a look at main()…

© NVIDIA 2013 21
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values

a = 2;
b = 7;

© NVIDIA 2013 22
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

© NVIDIA 2013 24
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N times in parallel

• Terminology: each parallel invocation of add() is referred to as

a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x

global void add(int a, int b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles a

different index
© NVIDIA 2013 26
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

© NVIDIA 2013 27
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

© NVIDIA 2013 28
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013 29
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Using global to declare a function as device code

– Executes on the device
– Called from the host

• Passing parameters from host code to a device function

• Launching parallel kernels

– Launch N copies of add() with add<<<N,1>>>(…);
– Use blockIdx.x to access block index

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

INTRODUCING
Handling errors

Managing devices

THREADS

• Let’s change add() to use parallel threads instead of

parallel blocks
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…

© NVIDIA 2013 34
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013 35
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
COMBINING THREADS Managing devices

AND BLOCKS

© NVIDIA 2013 37
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads

• Let’s adapt vector addition to use both blocks and threads

• Why? We’ll come to that…

• First let’s discuss data indexing…

• No longer as simple as using blockIdx.x and threadIdx.x

– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

01 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

is given by:
int index = threadIdx.x + blockIdx.x * M;

3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;
© NVIDIA 2013 40
Vector Addition with Blocks and Threads
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined version of add() to use parallel threads

and parallel blocks
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?

© NVIDIA 2013 41
Addition with Blocks and Threads:
main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013 42
Addition with Blocks and Threads:
main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Avoid accessing beyond the end of the arrays:

__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:

add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

• Unlike parallel blocks, threads have mechanisms to:

– Communicate
– Synchronize

• To look closer, we need a new example…

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

COOPERATING
Handling errors

Managing devices

THREADS

© NVIDIA 2013 47
1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements within a
radius

• If radius is 3, then each output element is the sum of

7 input elements:

radius radius

• Input elements are read several times

– With radius 3, each input element is read seven times

• Extremely fast on-chip memory, user-managed

• Declare using shared, allocated per block

• Data is not visible to threads in other blocks

© NVIDIA 2013 50
Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory

– Each block needs a halo of radius elements at each boundary

halo on left halo on right

Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

© NVIDIA 2013 52
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

 Suppose thread 15 reads the halo before thread 0 has fetched it…

temp[lindex] = in[gindex]; Store at temp[18]

if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS]; Skipped, threadIdx > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

int result = 0;
result += temp[lindex + 1];
Load from temp[19]

• Synchronizes all threads within a block

– Used to prevent RAW / WAR / WAW hazards

• All threads must reach the barrier

– In conditional code, the condition must be uniform across the block

© NVIDIA 2013 55
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data is available)

__syncthreads();

© NVIDIA 2013 56
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

© NVIDIA 2013 57
Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blockIdx.x to access block index within grid
– Use threadIdx.x to access thread index within block

• Allocate elements to threads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013 58
Review (2 of 2)
• Use __shared__ to declare a variable/array in shared memory
– Data is shared between threads in a block
– Not visible to threads in other blocks

• Use __syncthreads() as a barrier

– Use to prevent data hazards

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors
MANAGING THE Managing devices

DEVICE

• CPU needs to synchronize before consuming the results

cudaMemcpy() Blocks the CPU until the copy is complete

Copy begins when all preceding CUDA calls have completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls have completed

© NVIDIA 2013 61
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an earlier asynchronous operation (e.g. kernel)

• Get the error code for the last error:

cudaError_t cudaGetLastError(void)

• Get a string to describe the error:

char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013 62
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

• Multiple threads can share a device

• A single thread can manage multiple devices

cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝
✝
requires OS and device support
© NVIDIA 2013 63
Introduction to CUDA C/C++
• What have we learned?
– Write and launch CUDA C/C++ kernels
• __global__, blockIdx.x, threadIdx.x, <<<>>>

– Manage GPU memory

• cudaMalloc(), cudaMemcpy(), cudaFree()

– Manage communication and synchronization

• __shared__, __syncthreads()

• cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize()

© NVIDIA 2013 64
Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of memories
– Features & capabilities

Compute Selected Features Tesla models

Capability (see CUDA C Programming Guide for complete list)
1.0 Fundamental CUDA support 870
1.3 Double precision, improved memory accesses, 10-series
atomics
2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC, 20-series
P2P,
concurrent kernels/copies, function pointers,
recursion

• The following presentations concentrate on Fermi devices

– A kernel is launched as a
Grid 1
grid of blocks of threads Block Block Block
• blockIdx and (0,0,0) (1,0,0) (2,0,0)

threadIdx are 3D
Block Block Block
• We showed only one (0,1,0) (1,1,0) (2,1,0)

dimension (x)
Block (1,1,0)
Thre Thre Thre Thre Thre

• Built-in variables: ad
(0,0,
0)
ad
(1,0,
0)
ad
(2,0,
0)
ad
(3,0,
0)
ad
(4,0,
0)
– threadIdx Thre Thre Thre Thre Thre
ad ad ad ad ad
– blockIdx (0,1, (1,1, (2,1, (3,1, (4,1,
0) 0) 0) 0) 0)
– blockDim Thre Thre Thre Thre Thre
– gridDim ad
(0,2,
ad
(1,2,
ad
(2,2,
ad
(3,2,
ad
(4,2,
0) 0) 0) 0) 0)
© NVIDIA 2013 66
Textures
0 1 2 3 4
• Read-only object 0
– Dedicated cache
1
(2.5, 0.5)
• Dedicated filtering hardware (1.0, 1.0)
2
(Linear, bilinear, trilinear)

• Addressable as 1D, 2D or 3D

• Out-of-bounds address handling

(Wrap, clamp)
© NVIDIA 2013 67
Topics we skipped
• We skipped some details, you can learn more:
– CUDA Programming Guide
– CUDA Zone – tools, training, webinars and more
developer.nvidia.com/cuda

• Need a quick primer for later:

– Multi-dimensional indexing
– Textures

How To Run CUDA C
100% (1)
How To Run CUDA C
6 pages
How To Build A Mining Rig Things You Need To Know Before You Start
No ratings yet
How To Build A Mining Rig Things You Need To Know Before You Start
10 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
5. Moving to Parallel With CUDA - Hello Program
No ratings yet
5. Moving to Parallel With CUDA - Hello Program
14 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
Lab 10,11
No ratings yet
Lab 10,11
4 pages
CUDA Introduction
No ratings yet
CUDA Introduction
71 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
NVIDIA CUDA Computational Finance Geeks3D
No ratings yet
NVIDIA CUDA Computational Finance Geeks3D
39 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Additionof2Vector
No ratings yet
CUDA Additionof2Vector
2 pages
2-Computation
No ratings yet
2-Computation
15 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Program Structure of CUDA
No ratings yet
Program Structure of CUDA
3 pages
Cuda C
No ratings yet
Cuda C
70 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Unit V Part B and C - 240514 - 220831
No ratings yet
Unit V Part B and C - 240514 - 220831
17 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
CUDA
No ratings yet
CUDA
33 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
02.introduction To Operating System
No ratings yet
02.introduction To Operating System
25 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
L06_GPGPU_CUDA_Programming_1
No ratings yet
L06_GPGPU_CUDA_Programming_1
23 pages
Intro To Arduino Programming
No ratings yet
Intro To Arduino Programming
46 pages
A Whirlwind Tour of Python
No ratings yet
A Whirlwind Tour of Python
24 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
PDF
No ratings yet
PDF
105 pages
202206-ECCOMAS-Oslo-Article
No ratings yet
202206-ECCOMAS-Oslo-Article
12 pages
Sict 310
No ratings yet
Sict 310
59 pages
Daily Digest: Samsung Agrees To Sell Symbian Stake To Nokia
No ratings yet
Daily Digest: Samsung Agrees To Sell Symbian Stake To Nokia
7 pages
Anul Lansarii Denumire Chip Numar Tranzistoare: Legea Lui Moore: 2010 - 2020
No ratings yet
Anul Lansarii Denumire Chip Numar Tranzistoare: Legea Lui Moore: 2010 - 2020
3 pages
Chapter 1 - Introduction To Computer Graphics
No ratings yet
Chapter 1 - Introduction To Computer Graphics
44 pages
VCF 5.1.x Whats New Technical - Combined GA Final
No ratings yet
VCF 5.1.x Whats New Technical - Combined GA Final
50 pages
Jet Impingement For Electronic Cooling Application
No ratings yet
Jet Impingement For Electronic Cooling Application
3 pages
Freshwater Fish Image Classifier
No ratings yet
Freshwater Fish Image Classifier
54 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Breaking Out of VirtualBox Through 3D Acceleration-Francisco Falcon PDF
No ratings yet
Breaking Out of VirtualBox Through 3D Acceleration-Francisco Falcon PDF
94 pages
Lastexception 63848560975
No ratings yet
Lastexception 63848560975
18 pages
Last Exception
No ratings yet
Last Exception
4 pages
Basic Computer Knowledge
No ratings yet
Basic Computer Knowledge
9 pages
(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
No ratings yet
(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
12 pages
Best Graphics Export
No ratings yet
Best Graphics Export
68 pages
6 71 W65R0 D02
No ratings yet
6 71 W65R0 D02
106 pages
KKS 2 11 2 RN Sept 2024 KICKSTART-2.11.2
No ratings yet
KKS 2 11 2 RN Sept 2024 KICKSTART-2.11.2
7 pages
Hardware Quiz Vocabulary Define Each of The Terms Below
No ratings yet
Hardware Quiz Vocabulary Define Each of The Terms Below
2 pages
Domain Specific Computer Architectures for Emerging Applications Machine Learning and Neural Networks 1st Edition Chao Wang - The ebook in PDF and DOCX formats is ready for download now
100% (1)
Domain Specific Computer Architectures for Emerging Applications Machine Learning and Neural Networks 1st Edition Chao Wang - The ebook in PDF and DOCX formats is ready for download now
54 pages
CPE199 Test Prep 4
No ratings yet
CPE199 Test Prep 4
5 pages
Cryptocurrencies - Core Information Technology
No ratings yet
Cryptocurrencies - Core Information Technology
10 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
MMulti Analyzer
No ratings yet
MMulti Analyzer
28 pages
Nvidia Quadro k5200 by Pny Datasheet Eng Webv2
No ratings yet
Nvidia Quadro k5200 by Pny Datasheet Eng Webv2
2 pages
Error Log
No ratings yet
Error Log
302 pages
OrangePi Zero3 H618 User Manual v1.3
No ratings yet
OrangePi Zero3 H618 User Manual v1.3
321 pages
Prepare To Use Deep Learning in ArcGIS Pro
100% (1)
Prepare To Use Deep Learning in ArcGIS Pro
6 pages
Pomelo Settings(1)
No ratings yet
Pomelo Settings(1)
10 pages

Introduction To CUDA C

Uploaded by

Introduction To CUDA C

Uploaded by

CUDA C/C++ BASICS

• This session introduces CUDA C/C++

• You don’t need GPU experience

• You don’t need parallel programming experience

• You don’t need graphics experience

using namespace std;

__global__ void stencil_1d(int *in, int *out) {

// Read input elements into shared memory

// Apply the stencil

// Store the result

void fill_ints(int *x, int n) {

// Alloc space for host copies and setup values

// Alloc space for device copies

// Launch stencil_1d() kernel on GPU

// Copy result back to host

1. Copy input data from CPU memory

1. Copy input data from CPU memory

1. Copy input data from CPU memory

 Two new syntactic elements…

• CUDA C/C++ keyword __global__ indicates a function that:

• nvcc separates source code into host and device components

• That’s all that is required to execute a function on the GPU!

mykernel() does nothing, somewhat

• We need a more interesting

• We’ll start by adding two integers

• As before __global__ is a CUDA C/C++ keyword meaning

runs on the device, so a, b and c must point to device

• We need to allocate memory on the GPU

• Simple CUDA API for handling device memory

• Let’s take a look at main()…

// Allocate space for device copies of a, b, c

// Setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Instead of executing add() once, execute N times in parallel

• Terminology: each parallel invocation of add() is referred to as

__global__ void add(int *a, int *b, int *c) {

• By using blockIdx.x to index into the array, each block handles a

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

• Let’s take a look at main()…

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N blocks

// Copy result back to host

• Using __global__ to declare a function as device code

• Passing parameters from host code to a device function

• Launching parallel kernels

• Let’s change add() to use parallel threads instead of

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N threads

// Copy result back to host

• Let’s adapt vector addition to use both blocks and threads

• Why? We’ll come to that…

• First let’s discuss data indexing…

• No longer as simple as using blockIdx.x and threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

int index = threadIdx.x + blockIdx.x * M;

• Combined version of add() to use parallel threads

• What changes need to be made in main()?

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Avoid accessing beyond the end of the arrays:

• Update the kernel launch:

• Unlike parallel blocks, threads have mechanisms to:

• To look closer, we need a new example…

• If radius is 3, then each output element is the sum of

• Input elements are read several times

• Extremely fast on-chip memory, user-managed

• Declare using __shared__, allocated per block

global void stencil_1d(int in, int out) {

• CUDA C/C++ keyword global indicates a function that:

• As before global is a CUDA C/C++ keyword meaning

global void add(int a, int b, int *c) {

• Using global to declare a function as device code

• Declare using shared, allocated per block