Class 10
Class 10
dim3 gridDim;!
Dimension of the grid in blocks; GridDim.x, GridDim.y, GridDim.z!
!
dim3 blockDim;!
Dimensions of the block in number of threads!
!
dim3 blockIdx;!
Block index within grid (starting with 0)!
!
dim3 threadIdx;!
Thread index within block!
!
!
Note: dim3 dimension not specified is initialized to 1
Threads on GPU
Threads are organized in blocks; blocks are grouped into a grid; and threads are
executed in kernel as a grid of blocks of threads; all computing the same function.!
!
Each block is a 3D array of threads defined by the dimensions: Dx, Dy, and Dz,!
which you specify.!
!
Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048).!
!
Each thread has a thread index, threadIdx: (x,y, z); !
0≤ x < Dx, 0 ≤ y < Dy, 0 ≤ z < Dz, where Dx, Dy, Dz are the block dimensions;!
Dx * Dy * Dz = max threads per block !
!
Each thread also has a thread id: threadId = x + y Dx + z Dx Dy !
The threadId is like 1D representation of an array in memory.!
!
If you are working with 1D vectors, then Dy and Dz could be zero. Then!
threadIdx is x, and threadId is x. Working with 2D arrays, then Dz would be zero.!
!
!
threadId in different kind of blocks
x + y Dx
x+y Dx + z Dx Dz
Max threads in block: 512 Fermi;1024 for Compute Capability 2
Thread Index (threadIdx) and ThreadId
In 1-D: For one block, the unique threadId of thread of index (x) = x!
or threadIdx.x = x; Maximum size problem: 1024 threads !
!
In 2-D, with block of size (Dx, Dy), the unique threadId of !
thread with index (x,y): threadId= x + y Dx!
!
threadIdx.x = x; threadIdx.y = y!
!
In 3-D, with block of size (Dx,Dy, Dz), the unique threadID of!
thread with index (x,y,z): threadId = x+y Dx + z Dx Dz!
!
threadIdx.x = x; threadIdx.y = y; threadIdx.z = z!
!
!
Total number of threads = Thread_per_block* Number of blocks!
!
Max number of threads_per_block = 1024 for Cuda Capability 2.0 + !
Max dimensions of thread block (1024,1024, 64) but max threads 1024 !
!
Typical sizes: (16, 16), (32, 32) optimum size will depend on program
Grids of Blocks
When you have more data than the maximum number of threads per block.
Handle additional threads with more blocks. !
!
A grid is a 1D (x), 2D (x,y) or 3D (x,y,z) array of blocks !
!
(gridDim.x, gridDim.y, and gridDim.z)!
!
Each block has a blockIdx which is the index of the block within the grid ! !
!
! ! ! (blockId.x, blockId.y, blockId.z)!
!
!
Remember, each thread has a threadIdx within the block; it is 3D:!
! threadIdx.x, threadIdx.y, and threadIdx.z!
!
!
Grid of Blocks
One block is too small to handle most GPU problems. Need a grid of blocks.!
Blocks can be in 1-D, 2-D, or 3-D grids of thread blocks. All blocks are the same size.!
!
The number of thread blocks depends usually on the number of threads needed for a
particular problem.!
!
Example for a 1D grid of 2D blocks:!
!
int main()!
{!
int numBlocks = 16;!
dim3 threadsPerBlock (N,N); //1 block of N x N x 1 threads!
!
MatAdd<<<numBlocks, threadsPerBlock>>( A, B, C);!
!
Each block identified by build-in variable: BlockIdx. Dimension of block!
given by built-in blockDim variable (Dx, Dy, Dz). This is same as threadsPerBlock!
!
!
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) Compute C. <3!
Compute Capability >3: (2147483647, 65535, 65535)
2D Grid
dim3 GridDim(3,2); !
dim3 BlockDim(4,3); //Dx, Dy, Dz!
!
In kernel (using thread index): !
threadIdx:
threadId = 9 x 16+8
!
#include <iostream>!
using namespace std;! Class Problem:!
int main()! Write kernel to carry out the!
{!
float * x; //host arrays!
following: y[i] = a * x[i] + y[i];!
float * y;! !
float * d_x; //device arrays! saxpy (int n, float a, float *x, float *y)
float * d_y;!
int n = 1048576;!
x = new float[n]; !
y = new float[n];!
// intialize x,y; a initialized in kernel call!
for (int i = 0; i<n; i++)!
{!
x[i] = (float)i;!
y[i] = (float)i;!
}!
cudaMalloc(&d_x, n*sizeof(float));!
cudaMalloc(&d_y, n*sizeof(float));!
cudaMemcpy(d_x, x, n*sizeof(float), cudaMemcpyHostToDevice);!
cudaMemcpy(d_y, y, n*sizeof(float), cudaMemcpyHostToDevice)!
!
saxpy<<<4096,256>>>(n, 2.0, d_x, d_y); //4096*256 = 1048576
Answer: SAXPY kernel
__global__ void saxpy(int n, float a, float *x, float *y)!
{!
int i = blockIdx.x * blockDim.x + threadIdx.x;!
if (i < n) !
y[i] = a * x[i] + y[i];!
}