0% found this document useful (0 votes)
0 views

CUDA Programming Model

The document introduces the CUDA Programming Model, outlining the five essential steps for writing a CUDA program: allocating GPU memory, copying data, performing computations, copying results back, and deallocating memory. It provides detailed examples, including a vector addition kernel, and emphasizes the importance of managing memory and thread indexing for efficient GPU programming. Future articles will cover advanced topics such as CUDA grids, blocks, and optimization techniques.

Uploaded by

Shivam kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

CUDA Programming Model

The document introduces the CUDA Programming Model, outlining the five essential steps for writing a CUDA program: allocating GPU memory, copying data, performing computations, copying results back, and deallocating memory. It provides detailed examples, including a vector addition kernel, and emphasizes the importance of managing memory and thread indexing for efficient GPU programming. Future articles will cover advanced topics such as CUDA grids, blocks, and optimization techniques.

Uploaded by

Shivam kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn

Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off

Edit article
View stats
View post

Programming GPUs - Part 1: CUDA


Programming Model
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm
| DL Models Optimization | Parallel Programming in…

January 12, 2025

In this part of the series, we explore the CUDA Programming Model and
break down the key steps to writing a CUDA program. CUDA, NVIDIA’s
parallel computing platform, enables developers to harness the power of
GPUs for high-performance computations.

To program a GPU using CUDA, there are five essential steps:

1. Allocate GPU memory

2. Copy data to GPU memory


3. Perform computation on the GPU

4. Copy data back to the CPU (host)


5. Deallocate GPU memory

Let’s dive into each step with examples.

1. Allocating GPU Memory


To allocate memory on the GPU, we use the cudaMalloc function. This
function allocates memory on the device (GPU) that can be accessed by
CUDA kernels.

Syntax:

cudaError_t cudaMalloc(void** devPtr, size_t size);

https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 1/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
devPtr: Pointer to the allocated device memory.
size: Number of bytes to allocate.

Example:

float* d_array;
size_t size = 100 * sizeof(float);
cudaMalloc((void**)&d_array, size);

Here, d_array is a pointer to the GPU memory where an array of 100 floats
is allocated.

2. Copying Data Between Host and Device


CUDA provides the cudaMemcpy function to transfer data between host
memory (CPU) and device memory (GPU).

Syntax:

cudaError_t cudaMemcpy(void* dst, const void* src,


size_t count, cudaMemcpyKind kind);

dst: Destination pointer.


src: Source pointer.
count: Number of bytes to copy.

kind: Direction of data transfer (cudaMemcpyHostToDevice or


cudaMemcpyDeviceToHost).

Example:
Copying Data from Host to Device:

float h_array[100]; // Host array


cudaMemcpy(d_array, h_array, size,
cudaMemcpyHostToDevice);

Copying Data from Device to Host:

float h_result[100]; // Host array to store results


cudaMemcpy(h_result, d_array, size,
cudaMemcpyDeviceToHost);

3. Deallocating GPU Memory


To free the allocated memory on the GPU, we use cudaFree.

Syntax:
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 2/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn

cudaError_t cudaFree(void* devPtr);

Example:

cudaFree(d_array);

Always remember to free GPU memory after computation to prevent


memory leaks.

4. Performing Computation on the GPU


CUDA uses kernels to perform computations on the GPU. A kernel is a
function declared with the __global__ keyword and executed on the GPU.
Kernels are launched using the syntax:

kernel_name<<<numBlocks, numThreadsPerBlock>>>
(arguments);

numBlocks: Number of thread blocks in the grid.


numThreadsPerBlock: Number of threads per block.

Example: Vector Addition Kernel


Let’s implement vector addition using CUDA.

Kernel Code:

__global__ void vectorAdd(const float* A, const


float* B, float* C, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Calculate thread ID
if (i < N) {
C[i] = A[i] + B[i];
}
}

Complete Program:

#include <cuda_runtime.h>
#include <iostream>

__global__ void vectorAdd(const float* A, const


float* B, float* C, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];

https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 3/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
}
}

int main() {
int N = 1000;
size_t size = N * sizeof(float);

// Allocate host memory


float* h_A = new float[N];
float* h_B = new float[N];
float* h_C = new float[N];

// Initialize host arrays


for (int i = 0; i < N; i++) {
h_A[i] = i;
h_B[i] = i * 2;
}

// Allocate device memory


float* d_A, * d_B, * d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);

// Copy data from host to device


cudaMemcpy(d_A, h_A, size,
cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size,
cudaMemcpyHostToDevice);

// Launch the kernel


int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) /
threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>
(d_A, d_B, d_C, N);

// Copy results from device to host


cudaMemcpy(h_C, d_C, size,
cudaMemcpyDeviceToHost);

// Display some results


for (int i = 0; i < 10; i++) {
std::cout << "C[" << i << "] = " << h_C[i]
<< std::endl;
}

// Free memory
delete[] h_A;
delete[] h_B;
delete[] h_C;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

return 0;
}

https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 4/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn

What’s Next?
In this article, we introduced the CUDA Programming Model and
implemented a vector addition example. In future articles, we’ll explore:

CUDA Grids and Blocks


CUDA Compilation Process

Advanced Optimization Techniques

Stay tuned for more insights into GPU programming!

#GPU #CUDA #ParallelComputing #GPUProgramming #Programming


#Nvidia #HighPerformanceComputing #TechInsights

Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel
Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science

More articles for you

Comments

19 · 2 comments

Like Comment Share


With Google Gemma 2 LLM – How to set The Hidden Power of GPUs: Beyond 3D What's New with Tech #27-2024
up a Personal Voice AI Assistant on a Local Rendering AI & Tech Horizon
Workstation with NVIDIA GPU Antoine Fortin
Add a comment…

https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 5/6
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn

Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off

Edit article
View stats
View post

Programming GPUs - Part 2: CUDA


Memory Hierarchy
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm
| DL Models Optimization | Parallel Programming in…

January 14, 2025

In GPU programming, understanding the CUDA memory hierarchy and


the structure of threads, blocks, and grids is essential. This article delves
into how CUDA organizes computations and demonstrates how to index
threads for efficient programming.

Courtesy: Nvidia CUDADocs

https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 1/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn

Threads, Blocks, and Grids in CUDA


CUDA uses a hierarchical structure to organize threads for parallel
execution:

1. Grid: An array of blocks.

2. Block: A collection of threads.


3. Thread: The smallest unit of execution.

This hierarchy allows GPUs to handle a vast number of threads, achieving


massive parallelism. Each thread executes the same function, known as
the kernel, but operates on different data.

Kernel and Parallel Execution


A kernel is a GPU function that runs on multiple threads. All threads in a
grid execute the same kernel, but they process data independently. To
ensure each thread performs its task, CUDA provides mechanisms to
compute unique thread indices.

Courtesy: Microway CUDA guide

Indexing Threads in CUDA


CUDA exposes several built-in variables to identify threads within the grid
and blocks:

1. gridDim: Number of blocks in the grid.


2. blockIdx: Index of the block in the grid.

3. blockDim: Number of threads in a block.


4. threadIdx: Index of the thread in the block.

The global thread index can be calculated using:

https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 2/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn

int threadId = blockIdx.x * blockDim.x +


threadIdx.x;

This computation uniquely identifies each thread, allowing it to access


specific data in memory.

Code Snippet: Indexing Threads


Here’s a simple CUDA kernel to compute the global thread index:

__global__ void computeIndices() {


int threadId = blockIdx.x * blockDim.x +
threadIdx.x;
printf("Thread ID: %d\n", threadId);
}

This kernel computes and prints the ID of each thread.

Key Components of CUDA Code


Two critical aspects of CUDA programming are:

1. Thread Index and Computation: Properly identifying which part of


data a thread handles.

2. Boundary Conditions: Ensuring threads do not access out-of-bound


memory locations.

Revisiting Vector Addition with Boundary Conditions


Let’s rewrite the vector addition kernel from the previous article, including
boundary conditions:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void vectorAdd(float *a, float *b, float


*c, int n) {
int threadId = blockIdx.x * blockDim.x +
threadIdx.x;

// Boundary condition to prevent out-of-bound


memory access
if (threadId < n) {
c[threadId] = a[threadId] + b[threadId];
}
}

int main() {
int n = 1000; // Size of vectors
size_t size = n * sizeof(float);

// Host memory allocation

https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 3/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn
float *h_a = (float *)malloc(size);
float *h_b = (float *)malloc(size);
float *h_c = (float *)malloc(size);

// Initialize vectors
for (int i = 0; i < n; i++) {
h_a[i] = i * 1.0f;
h_b[i] = i * 2.0f;
}

// Device memory allocation


float *d_a, *d_b, *d_c;
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Copy data from host to device


cudaMemcpy(d_a, h_a, size,
cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size,
cudaMemcpyHostToDevice);

// Kernel launch
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) /
threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>
(d_a, d_b, d_c, n);

// Copy result back to host


cudaMemcpy(h_c, d_c, size,
cudaMemcpyDeviceToHost);

// Verify result
for (int i = 0; i < n; i++) {
printf("%f ", h_c[i]);
}

// Free memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);

return 0;
}

Comments

12

Like Comment Share

Add a comment…

https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 4/5
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn

Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off

Edit article
View stats
View post

Programming GPUs – Part 3: CUDA


Code Compilation and
Synchronization
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm
| DL Models Optimization | Parallel Programming in…

January 18, 2025

CUDA programming is an exciting journey, and understanding how CUDA


code is compiled and synchronized is crucial for mastering GPU
programming. In this article, we’ll cover the role of NVCC, the compilation
process, CUDA-specific keywords, and how to handle asynchronous
execution and synchronization in CUDA programs.

What is NVCC?
NVCC (NVIDIA CUDA Compiler) is the toolchain used to compile CUDA
programs. It processes both host (CPU) and device (GPU) code, ensuring
that your CUDA kernels run seamlessly on the GPU while the host code
operates on the CPU.

Steps to Install NVCC


1. Download the CUDA Toolkit from the NVIDIA website.
2. Follow the installation instructions for your operating system
(Windows, Linux, or macOS).

3. Ensure the CUDA environment variables (PATH and


LD_LIBRARY_PATH) are set.

How to Compile and Run a CUDA Program


Use the .cu extension for CUDA files.

Compile with:

https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 1/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn

nvcc -o output_file source_file.cu

Run the compiled binary:

./output_file

The CUDA Compilation Process


NVCC compiles CUDA code in two parts:

1. Host Code: Written in C++ and compiled with a host compiler (e.g.,
GCC, MSVC). The output is host assembly code (x86, ARM, etc.),
executed on the CPU.
2. Device Code: CUDA kernels are compiled into .ptx (virtual ISA) code.
At runtime, the GPU’s JIT (Just-In-Time) compiler translates .ptx into
device-specific assembly (e.g., SASS), which the GPU executes.

Courtesy: Medium blog by CisMine Ng

CUDA Function Keywords


CUDA provides three keywords to define where and how functions are
executed:

1. __host__: Executed on the host (CPU).

2. __global__: Defined on the host but executed on the device (GPU).


Used to define kernel functions.
3. __device__: Executed on the device (GPU). Can be called only from
other device or global functions.

Example:

#include <iostream>

https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 2/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
__global__ void gpuKernel() {
printf("Hello from GPU thread %d!\n",
threadIdx.x);
}

int main() {
gpuKernel<<<1, 10>>>(); // Launch kernel
cudaDeviceSynchronize(); // Synchronize CPU and
GPU
return 0;
}

Asynchronous Kernel Calls


In CUDA, kernel launches are asynchronous. This means that after a kernel
is invoked, the CPU does not wait for the GPU to finish execution—it
immediately moves to the next instruction.

Example:

__global__ void vectorAdd(int *a, int *b, int *c,


int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}

int main() {
const int n = 1024;
int *a, *b, *c; // Host pointers
int *d_a, *d_b, *d_c; // Device pointers

// Allocate memory
cudaMalloc((void**)&d_a, n * sizeof(int));
cudaMalloc((void**)&d_b, n * sizeof(int));
cudaMalloc((void**)&d_c, n * sizeof(int));

// Launch kernel
vectorAdd<<<1, n>>>(d_a, d_b, d_c, n);
printf("Kernel launched asynchronously.\n");

// Synchronize to ensure GPU has finished


cudaDeviceSynchronize();

// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

In this example, printf is executed immediately after the kernel launch,


demonstrating asynchronous behavior.

Synchronizing with cudaDeviceSynchronize()

https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 3/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
To ensure that all GPU operations have completed before proceeding, use
cudaDeviceSynchronize(). This function blocks the CPU until all preceding
GPU tasks are completed.

Syntax:

cudaError_t cudaDeviceSynchronize();

Usage:

vectorAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b,


d_c, n);
cudaDeviceSynchronize(); // Wait for GPU to finish
computation

Revised Vector Addition with Synchronization


Here’s the vector addition kernel from Part 2, now including boundary
checks:

__global__ void vectorAdd(int *a, int *b, int *c,


int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) { // Boundary condition
c[idx] = a[idx] + b[idx];
}
}

int main() {
const int n = 1024;
int *h_a, *h_b, *h_c; // Host arrays
int *d_a, *d_b, *d_c; // Device arrays

// Allocate host memory


h_a = (int*)malloc(n * sizeof(int));
h_b = (int*)malloc(n * sizeof(int));
h_c = (int*)malloc(n * sizeof(int));

// Allocate device memory


cudaMalloc((void**)&d_a, n * sizeof(int));
cudaMalloc((void**)&d_b, n * sizeof(int));
cudaMalloc((void**)&d_c, n * sizeof(int));

// Copy data to device


cudaMemcpy(d_a, h_a, n * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(int),
cudaMemcpyHostToDevice);

// Launch kernel
vectorAdd<<<(n + 255) / 256, 256>>>(d_a, d_b,
d_c, n);

https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 4/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
cudaDeviceSynchronize(); // Synchronize CPU and
GPU

// Copy result back to host


cudaMemcpy(h_c, d_c, n * sizeof(int),
cudaMemcpyDeviceToHost);

// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);

return 0;
}

Key Takeaways:

NVCC splits host and device code for efficient execution.

Asynchronous kernel launches improve performance but require


careful synchronization.

cudaDeviceSynchronize() ensures GPU tasks are complete before


continuing.

Proper boundary checks in CUDA kernels prevent memory access


violations.

Comments

35 · 2 comments

Like Comment Share

Add a comment…

Most recent

IMAMA SHEHZAD • 1st 2mo


NLP Engineer - Gnani.AI |Aspiring Data Scientist | Applying NLP & LLMs t…

Great post! You've clearly explained the critical role of NVCC plays while
compiling the CUDA program

Like Reply · 1 reply

Prasanna Biswas Author 2mo


AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models…

IMAMA SHEHZAD
Thank you so much for your kind words! I'm glad you found the
explanation of NVCC's role clear and helpful. Let me know if there's
any specific aspect of CUDA programming you'd like me to explore
further in future posts! …more

Like · 1 Reply

Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel
Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science

https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 5/6

You might also like