2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
2
The “reduction” task
Input: an array in of n numbers
Output: sum (or product, max, min, …) of these numbers
n=8
in 1 9 5 1 6 4 7 2
Reduce – sum
35
3
Reduction algorithm
• Sum reduction sequential:
sum = 0;
for (int i = 0;i < N;i++){
sum += input[i];
}
4
Sequential implementation
1 9 5 1 6 4 7 2
Time step 1 +
10
int reduceOnHost(int *in, int n)
{ Time step 2 +
int s = in[0]; 15
for (int i = 1; i < n; i++)
s += in[i]; Time step 3 +
return s; 16
}
Time step 4 +
22
Time step 5 +
Time (# time steps): 7 = n-1 = O(n)
26
Work (# pluses): 7 = n-1 = O(n)
Time step 6 +
33
Time step 7 +
35
5
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
10 6 10 9
Time step 2 + +
16 19
Time step 3 +
35
Time: ?
Work: ?
6
Parallel implementation – idea
8
Parallel Reduction in Real life
• Sports & Competitions: Max reduction
• Also use to process large input data sets (Google and Hadoop
MapReduce frameworks)
• There is no required order of processing elements in a data set
(associative and commutative)
• Partition the data set into smaller chunks
• Have each thread to process a chunk
• Use a reduction tree to summarize the results from each chunk into the
final answer
9
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
Need synchronization before next step 10 6 10 9
But: in a kernel function, we can Time step 2 + +
only synchronize threads in the
same block 16 19
If n ≤ 2×block-size, we can use a Time step 3 +
kernel with one block
35
If n > 2×block-size, what should
we do? Time: 3 = log2n = O(log2n)
Work: 7 = n-1 = O(n) = work of the sequential version
(Later, we will see tasks in which parallel implementations
need to do more work than sequential)
10
A simple reduction kernel
11
A simple reduction kernel
12
Hierarchical reduction for bigger input
13
Parallel implementation
– idea to reduce within each block
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block
threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 2
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2
14
Hierarchical reduction for arbitrary
input length
__global__ void reduceBlksKernel1(int* in, int* out, int n){
int i = blockIdx.x * 2 * blockDim.x + 2 * threadIdx.x;
for (int stride = 1; stride <= blockDim.x; stride *= 2){
if (threadIdx.x % stride == 0)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}
if (threadIdx.x == 0)
atomicAdd(out, in[blockIdx.x * 2 * blockDim.x]);
}
Assume: 2×block-size = 2𝑘
15
In each block, how many diverged warps?
(not consider blocks in the edge)
• Stride = 1:
• All threads are “on”
• → No diverged warp
• Stride = 2:
• Only threads with threadIdx.x % 2 == 0 are “on”
• → All warps are diverged
• Stride = 4, 8, …, 32:
• All warps are diverged
• Stride = 64, 128, …:
• # diverged warps decrease to 1 16
Kernel function – 2nd version:
reduce warp divergence
• Idea: reduce # diverged warps in each step by rearranging
threads so that “on” threads are first adjacent threads
• Example: consider a block of 128 threads
• Stride = 1: All 128 threads are “on”
• Stride = 2: First 64 threads are “on”, the rest are “off”
• Stride = 4: First 32 threads are “on”, the rest are “off”
• …
17
Kernel function - 2nd version:
reduce warp divergence
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block
threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 1
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2
18
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;
if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}
19
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;
if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}
20
Kernel function - 3rd version:
reduce warp divergence + ?
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block
Stride = 4
threadIdx.x 0 1 2 3
7 13 12 3 6 4 7 2
Stride = 2
threadIdx.x 0 1
19 16 12 3 6 4 7 2
Stride = 1
threadIdx.x 0
35 16 12 3 6 4 7 2
21
Kernel function - 3rd version:
reduce warp divergence + ?
Code: your next homework ;-)
22
Reference
• [1] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
• [2] Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014
23
THE END
24