CUDA Programming Model
CUDA Programming Model
Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off
Edit article
View stats
View post
In this part of the series, we explore the CUDA Programming Model and
break down the key steps to writing a CUDA program. CUDA, NVIDIA’s
parallel computing platform, enables developers to harness the power of
GPUs for high-performance computations.
Syntax:
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 1/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
devPtr: Pointer to the allocated device memory.
size: Number of bytes to allocate.
Example:
float* d_array;
size_t size = 100 * sizeof(float);
cudaMalloc((void**)&d_array, size);
Here, d_array is a pointer to the GPU memory where an array of 100 floats
is allocated.
Syntax:
Example:
Copying Data from Host to Device:
Syntax:
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 2/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
Example:
cudaFree(d_array);
kernel_name<<<numBlocks, numThreadsPerBlock>>>
(arguments);
Kernel Code:
Complete Program:
#include <cuda_runtime.h>
#include <iostream>
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 3/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
}
}
int main() {
int N = 1000;
size_t size = N * sizeof(float);
// Free memory
delete[] h_A;
delete[] h_B;
delete[] h_C;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;
}
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 4/6
3/30/25, 2:45 PM Programming GPUs - Part 1: CUDA Programming Model | LinkedIn
What’s Next?
In this article, we introduced the CUDA Programming Model and
implemented a vector addition example. In future articles, we’ll explore:
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel
Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science
Comments
19 · 2 comments
https://www.linkedin.com/pulse/programming-gpus-part-1-cuda-model-prasanna-biswas-rhdyc/?trackingId=h1V62HqATWtLjIdoDHfK2Q%3D%3D 5/6
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn
Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off
Edit article
View stats
View post
https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 1/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn
https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 2/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn
#include <cuda_runtime.h>
#include <stdio.h>
int main() {
int n = 1000; // Size of vectors
size_t size = n * sizeof(float);
https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 3/5
3/30/25, 2:53 PM Programming GPUs - Part 2: CUDA Memory Hierarchy | LinkedIn
float *h_a = (float *)malloc(size);
float *h_b = (float *)malloc(size);
float *h_c = (float *)malloc(size);
// Initialize vectors
for (int i = 0; i < n; i++) {
h_a[i] = i * 1.0f;
h_b[i] = i * 2.0f;
}
// Kernel launch
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) /
threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>
(d_a, d_b, d_c, n);
// Verify result
for (int i = 0; i < n; i++) {
printf("%f ", h_c[i]);
}
// Free memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
return 0;
}
Comments
12
Add a comment…
https://www.linkedin.com/pulse/programming-gpus-part-2-cuda-memory-hierarchy-prasanna-biswas-vtg8c/?trackingId=uGEKkVxBsvYd1k6zFe%2Fmw… 4/5
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
Reactivate Prem
Home My Network Jobs Messaging Notifications Me For Business 50% Off
Edit article
View stats
View post
What is NVCC?
NVCC (NVIDIA CUDA Compiler) is the toolchain used to compile CUDA
programs. It processes both host (CPU) and device (GPU) code, ensuring
that your CUDA kernels run seamlessly on the GPU while the host code
operates on the CPU.
Compile with:
https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 1/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
./output_file
1. Host Code: Written in C++ and compiled with a host compiler (e.g.,
GCC, MSVC). The output is host assembly code (x86, ARM, etc.),
executed on the CPU.
2. Device Code: CUDA kernels are compiled into .ptx (virtual ISA) code.
At runtime, the GPU’s JIT (Just-In-Time) compiler translates .ptx into
device-specific assembly (e.g., SASS), which the GPU executes.
Example:
#include <iostream>
https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 2/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
__global__ void gpuKernel() {
printf("Hello from GPU thread %d!\n",
threadIdx.x);
}
int main() {
gpuKernel<<<1, 10>>>(); // Launch kernel
cudaDeviceSynchronize(); // Synchronize CPU and
GPU
return 0;
}
Example:
int main() {
const int n = 1024;
int *a, *b, *c; // Host pointers
int *d_a, *d_b, *d_c; // Device pointers
// Allocate memory
cudaMalloc((void**)&d_a, n * sizeof(int));
cudaMalloc((void**)&d_b, n * sizeof(int));
cudaMalloc((void**)&d_c, n * sizeof(int));
// Launch kernel
vectorAdd<<<1, n>>>(d_a, d_b, d_c, n);
printf("Kernel launched asynchronously.\n");
// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 3/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
To ensure that all GPU operations have completed before proceeding, use
cudaDeviceSynchronize(). This function blocks the CPU until all preceding
GPU tasks are completed.
Syntax:
cudaError_t cudaDeviceSynchronize();
Usage:
int main() {
const int n = 1024;
int *h_a, *h_b, *h_c; // Host arrays
int *d_a, *d_b, *d_c; // Device arrays
// Launch kernel
vectorAdd<<<(n + 255) / 256, 256>>>(d_a, d_b,
d_c, n);
https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 4/6
3/30/25, 3:00 PM Programming GPUs – Part 3: CUDA Code Compilation and Synchronization | LinkedIn
cudaDeviceSynchronize(); // Synchronize CPU and
GPU
// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);
return 0;
}
Key Takeaways:
Comments
35 · 2 comments
Add a comment…
Most recent
Great post! You've clearly explained the critical role of NVCC plays while
compiling the CUDA program
IMAMA SHEHZAD
Thank you so much for your kind words! I'm glad you found the
explanation of NVCC's role clear and helpful. Let me know if there's
any specific aspect of CUDA programming you'd like me to explore
further in future posts! …more
Like · 1 Reply
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel
Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science
https://www.linkedin.com/pulse/programming-gpus-part-3-cuda-code-compilation-prasanna-biswas-uzilc/?trackingId=iKS58nYpTOVqK6ADtz%2FzSA… 5/6