Class 10

The document discusses how to compile CUDA programs using nvcc and specifies file naming conventions and compiler options. It also describes CUDA thread organization including blocks, grids, thread indexes, and how to define the number of blocks and threads per block.

Uploaded by

Thiago Salles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views13 pages

Class 10

Uploaded by

Thiago Salles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Compiling CUDA programs

Any program with CUDA programming should be named: xxxx.cu!

This file can contain both HOST and DEVICE code. !
!
The compiler is nvcc, which can handle both C++ and CUDA code. It is capable of
compiling all C++ commands for the HOST and most for the DEVICE.!
!
nvcc xxxx.cu -> a.out!
!
nvcc xxxx.cu -o xxxxx yields xxxxx as executable file!
!
!
Compute capability: newer cards have better capabilities!
!
nvcc -code=sm_30 xxxx.cu means compute_capability of 3.0!
nvcc -arch compute_30 xxxx,cu!
!
nvcc -m32 xxxx,cu means compile 32bit code, default is 64 bit (-m64)!
!
Threads
Consider a kernel call:!
!
kernel <<< num_Blocks, threads_per_block>>> (args…)!
!
!
mat_mult <<< 1, N>> (… )!
!
This means invoke kernel with 1 block of N threads. All of data fits within N. !
N can be as big as 1024 on MBP!
!
mat_mult <<<N, 1 >>> (…) use N blocks of 1 thread.!
N can be as large as 65,535 (or larger 231-1)!
!
!
!
!
(Number of threads per block should be some multiple of 32)!
Kernel Calls
dim3 Grid,Block;!
!
// define Grid, Block Note unspecified dim3 field initializes to 1. !
..!
!
kernel <<< Grid, Block>>> (….)!
!
Grid: dimension and size of grid (of blocks)!
!
In two-dimensions: x, y!
Number of blocks: Grid.x * Grid.y !
!
Block: dimension and size of blocks of threads!
In three-dimensions: x, y, z!
Threads per block: Block.x * Block.y* Block.z!
!
!
!
!
!
Global/Device Automatic Variables

dim3 gridDim;!
Dimension of the grid in blocks; GridDim.x, GridDim.y, GridDim.z!
!
dim3 blockDim;!
Dimensions of the block in number of threads!
!
dim3 blockIdx;!
Block index within grid (starting with 0)!
!
dim3 threadIdx;!
Thread index within block!
!
!
Note: dim3 dimension not specified is initialized to 1
Threads on GPU
Threads are organized in blocks; blocks are grouped into a grid; and threads are
executed in kernel as a grid of blocks of threads; all computing the same function.!
!
Each block is a 3D array of threads defined by the dimensions: Dx, Dy, and Dz,!
which you specify.!
!
Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048).!
!
Each thread has a thread index, threadIdx: (x,y, z); !
0≤ x < Dx, 0 ≤ y < Dy, 0 ≤ z < Dz, where Dx, Dy, Dz are the block dimensions;!
Dx * Dy * Dz = max threads per block !
!
Each thread also has a thread id: threadId = x + y Dx + z Dx Dy !
The threadId is like 1D representation of an array in memory.!
!
If you are working with 1D vectors, then Dy and Dz could be zero. Then!
threadIdx is x, and threadId is x. Working with 2D arrays, then Dz would be zero.!
!
!
threadId in different kind of blocks

x + y Dx

x+y Dx + z Dx Dz
Max threads in block: 512 Fermi;1024 for Compute Capability 2
Thread Index (threadIdx) and ThreadId
In 1-D: For one block, the unique threadId of thread of index (x) = x!
or threadIdx.x = x; Maximum size problem: 1024 threads !
!
In 2-D, with block of size (Dx, Dy), the unique threadId of !
thread with index (x,y): threadId= x + y Dx!
!
threadIdx.x = x; threadIdx.y = y!
!
In 3-D, with block of size (Dx,Dy, Dz), the unique threadID of!
thread with index (x,y,z): threadId = x+y Dx + z Dx Dz!
!
threadIdx.x = x; threadIdx.y = y; threadIdx.z = z!
!
!
Total number of threads = Thread_per_block* Number of blocks!
!
Max number of threads_per_block = 1024 for Cuda Capability 2.0 + !
Max dimensions of thread block (1024,1024, 64) but max threads 1024 !
!
Typical sizes: (16, 16), (32, 32) optimum size will depend on program
Grids of Blocks
When you have more data than the maximum number of threads per block.
Handle additional threads with more blocks. !
!
A grid is a 1D (x), 2D (x,y) or 3D (x,y,z) array of blocks !
!
(gridDim.x, gridDim.y, and gridDim.z)!
!
Each block has a blockIdx which is the index of the block within the grid ! !
!
! ! ! (blockId.x, blockId.y, blockId.z)!
!
!
Remember, each thread has a threadIdx within the block; it is 3D:!
! threadIdx.x, threadIdx.y, and threadIdx.z!
!
!
Grid of Blocks
One block is too small to handle most GPU problems. Need a grid of blocks.!
Blocks can be in 1-D, 2-D, or 3-D grids of thread blocks. All blocks are the same size.!
!
The number of thread blocks depends usually on the number of threads needed for a
particular problem.!
!
Example for a 1D grid of 2D blocks:!
!
int main()!
{!
int numBlocks = 16;!
dim3 threadsPerBlock (N,N); //1 block of N x N x 1 threads!
!
MatAdd<<<numBlocks, threadsPerBlock>>( A, B, C);!
!
Each block identified by build-in variable: BlockIdx. Dimension of block!
given by built-in blockDim variable (Dx, Dy, Dz). This is same as threadsPerBlock!
!
!
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) Compute C. <3!
Compute Capability >3: (2147483647, 65535, 65535)
2D Grid
dim3 GridDim(3,2); !
dim3 BlockDim(4,3); //Dx, Dy, Dz!
!
In kernel (using thread index): !

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;
!

Consider N x N array and each element assigned a thread:!

!
dim3 threadsPerBlock(16,16); //Dx, Dy = BlockDim!
!
dim3 numBlocks(N/threadsPerBlock.x, N/threadsPerBlock.y)// GridDim!
!
MatAdd<<<numBlocks, threadsPerBlock>>(A,B,C);
Dx=4, Dy=4

threadIdx:

threadId = 9 x 16+8
!
#include <iostream>!
using namespace std;! Class Problem:!
int main()! Write kernel to carry out the!
{!
float * x; //host arrays!
following: y[i] = a * x[i] + y[i];!
float * y;! !
float * d_x; //device arrays! saxpy (int n, float a, float *x, float *y)
float * d_y;!
int n = 1048576;!
x = new float[n]; !
y = new float[n];!
// intialize x,y; a initialized in kernel call!
for (int i = 0; i<n; i++)!
{!
x[i] = (float)i;!
y[i] = (float)i;!
}!
cudaMalloc(&d_x, n*sizeof(float));!
cudaMalloc(&d_y, n*sizeof(float));!
cudaMemcpy(d_x, x, n*sizeof(float), cudaMemcpyHostToDevice);!
cudaMemcpy(d_y, y, n*sizeof(float), cudaMemcpyHostToDevice)!
!
saxpy<<<4096,256>>>(n, 2.0, d_x, d_y); //4096*256 = 1048576
Answer: SAXPY kernel
__global__ void saxpy(int n, float a, float *x, float *y)!
{!
int i = blockIdx.x * blockDim.x + threadIdx.x;!
if (i < n) !
y[i] = a * x[i] + y[i];!
}

Owner'S Manual: Model AXT-240 240 Watt 4-Channel Amplifier
No ratings yet
Owner'S Manual: Model AXT-240 240 Watt 4-Channel Amplifier
12 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
22 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
HPC
No ratings yet
HPC
90 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Hpc file
No ratings yet
Hpc file
22 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Lec 6
No ratings yet
Lec 6
16 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Threads
No ratings yet
Threads
54 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
CUDAProgModel
No ratings yet
CUDAProgModel
24 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Brazil 9k 2025
No ratings yet
Brazil 9k 2025
9 pages
Lec 16
No ratings yet
Lec 16
15 pages
IMECS2010 pp513-517
No ratings yet
IMECS2010 pp513-517
5 pages
Application of Dimensionality Reduction in Recommender System - A Case Study
No ratings yet
Application of Dimensionality Reduction in Recommender System - A Case Study
12 pages
Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
No ratings yet
Cse 590 Data Mining: Prof. Anita Wasilewska SUNY Stony Brook
66 pages
Programming Hadoop
No ratings yet
Programming Hadoop
42 pages
List
No ratings yet
List
4 pages
Challenger Banks - Reshaping Customer Experience - Global
No ratings yet
Challenger Banks - Reshaping Customer Experience - Global
18 pages
Module 3
No ratings yet
Module 3
22 pages
JavaScript Programming Beginner To Professional (BASIC + ADVANCE) GUIDE To LEARN JAVASCRIPT in 7 DAYS (Maurya, Rahul (Maurya, Rahul) ) (Z-Library)
No ratings yet
JavaScript Programming Beginner To Professional (BASIC + ADVANCE) GUIDE To LEARN JAVASCRIPT in 7 DAYS (Maurya, Rahul (Maurya, Rahul) ) (Z-Library)
724 pages
Interview Questions To Ask A Cyber Security Analyst Xobin Downloaded
No ratings yet
Interview Questions To Ask A Cyber Security Analyst Xobin Downloaded
8 pages
Gandhd: Shot Blasting System
No ratings yet
Gandhd: Shot Blasting System
12 pages
Single 2-Input Nand Gate: PD CC o
No ratings yet
Single 2-Input Nand Gate: PD CC o
7 pages
Material Status For Blasting and Coating
No ratings yet
Material Status For Blasting and Coating
5 pages
Growth Strategy Handbook
100% (3)
Growth Strategy Handbook
75 pages
ECE108 Course Notes
No ratings yet
ECE108 Course Notes
97 pages
Hariharan BI Developer SQL Expert Data Modeller
No ratings yet
Hariharan BI Developer SQL Expert Data Modeller
2 pages
WBS Week 2
No ratings yet
WBS Week 2
2 pages
Lv4000ikr11 (En) e Cooling Unit (80kw)
No ratings yet
Lv4000ikr11 (En) e Cooling Unit (80kw)
48 pages
Yaesu FT-450 and TS-450D - Recommended Interconnection Diagram
No ratings yet
Yaesu FT-450 and TS-450D - Recommended Interconnection Diagram
1 page
7 - Quiz 1 - Mathematics - Fractions
No ratings yet
7 - Quiz 1 - Mathematics - Fractions
1 page
Ejabberd Server Setup
No ratings yet
Ejabberd Server Setup
7 pages
Acoustics For Music
No ratings yet
Acoustics For Music
4 pages
Sample Exam Questions: September 2010 (Reviewed September 2014)
No ratings yet
Sample Exam Questions: September 2010 (Reviewed September 2014)
26 pages
OLYMPUS Stream InstallV1.9.2 en
No ratings yet
OLYMPUS Stream InstallV1.9.2 en
28 pages
Building Careers and Writing Résumés: Chapter 15 - 1
No ratings yet
Building Careers and Writing Résumés: Chapter 15 - 1
25 pages
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
No ratings yet
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
18 pages
OBE Video
No ratings yet
OBE Video
72 pages
10.sinif-ingilizce-2.donem-2.yazili-cevap-anahtari-2024-2025
No ratings yet
10.sinif-ingilizce-2.donem-2.yazili-cevap-anahtari-2024-2025
2 pages
How Can I Obscure The Text in A PDF File (E.g. - Greek - It, or Replace The Text With Lorem Ipsum) - Super User
No ratings yet
How Can I Obscure The Text in A PDF File (E.g. - Greek - It, or Replace The Text With Lorem Ipsum) - Super User
2 pages
Finisher F1sm
No ratings yet
Finisher F1sm
310 pages
Sensors 22 02081
No ratings yet
Sensors 22 02081
18 pages
b 9757 tey
No ratings yet
b 9757 tey
8 pages
Atty Letter
No ratings yet
Atty Letter
17 pages
Download Full Patterns in the Machine A Software Engineering Guide to Embedded Development 1st Edition John T Taylor Wayne T Taylor PDF All Chapters
100% (3)
Download Full Patterns in the Machine A Software Engineering Guide to Embedded Development 1st Edition John T Taylor Wayne T Taylor PDF All Chapters
65 pages

Class 10

Uploaded by

Class 10

Uploaded by

Compiling CUDA programs

Any program with CUDA programming should be named: xxxx.cu!

int i = blockIdx.x * blockDim.x + threadIdx.x;

Consider N x N array and each element assigned a thread:!

You might also like