0% found this document useful (0 votes)
17 views

2023 CSC14120 Lecture04 CUDAParallelExecution (P2)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

2023 CSC14120 Lecture04 CUDAParallelExecution (P2)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Parallel Programming

Parallel Execution in CUDA


(Part 2)

Phạm Trọng Nghĩa


[email protected]
Overview
Apply knowledge of parallel execution in CUDA to
write a fast CUDA program doing “reduction”
• The “reduction” task
• Sequential implementation
• Parallel implementation
• Kernel function – 1st version
• Kernel function – 2nd version: reduce warp divergence
• Kernel function – 3rd version: reduce warp divergence + …

2
The “reduction” task
Input: an array in of n numbers
Output: sum (or product, max, min, …) of these numbers

n=8
in 1 9 5 1 6 4 7 2

Reduce – sum

35
3
Reduction algorithm
• Sum reduction sequential:
sum = 0;
for (int i = 0;i < N;i++){
sum += input[i];
}

• General form of a reduction sequential:


acc = IDENTITY;
for (int i = 0;i < N;i++){
acc = Operator(acc, input[i]);
}

4
Sequential implementation
1 9 5 1 6 4 7 2
Time step 1 +
10
int reduceOnHost(int *in, int n)
{ Time step 2 +
int s = in[0]; 15
for (int i = 1; i < n; i++)
s += in[i]; Time step 3 +
return s; 16
}
Time step 4 +
22
Time step 5 +
Time (# time steps): 7 = n-1 = O(n)
26
Work (# pluses): 7 = n-1 = O(n)
Time step 6 +
33
Time step 7 +
35
5
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
10 6 10 9
Time step 2 + +
16 19
Time step 3 +
35

Time: ?
Work: ?

6
Parallel implementation – idea

• For N input values, the reduction tree performs


• N/2 + N/4 + N/8 +…+ 1 = N -1 operations
• In Log (N) steps – 1,000,000 input values take 20 steps
• Assuming that we have enough execution resources
• Average Parallelism (N-1)/Log(N))
• For N = 1,000,000, average parallelism is 50,000
• However, peak resource requirement is 500,000
• This is not resource efficient

• This is a work-efficient parallel algorithm


• The amount of work done is comparable to the an efficient sequential
algorithm
• Many parallel algorithms are not work efficient
7
Reduction trees
• Order of performing the operations will be changed
(sequential ➔ parallel)
• Operator must be associative
• Serial
• ((((((3 max 1) max 7) max 0) max 4) max 1) max 6) max 3
• Paralell
• ((3 max 1) max (7 max 0)) max ((4 max 1) max (6 max 3))

• We also need rearranges the order of the operands


• Operator to be commutative

8
Parallel Reduction in Real life
• Sports & Competitions: Max reduction

• Also use to process large input data sets (Google and Hadoop
MapReduce frameworks)
• There is no required order of processing elements in a data set
(associative and commutative)
• Partition the data set into smaller chunks
• Have each thread to process a chunk
• Use a reduction tree to summarize the results from each chunk into the
final answer
9
Parallel implementation – idea
1 9 5 1 6 4 7 2
Time step 1 + + + +
Need synchronization before next step 10 6 10 9
But: in a kernel function, we can Time step 2 + +
only synchronize threads in the
same block 16 19
If n ≤ 2×block-size, we can use a Time step 3 +
kernel with one block
35
If n > 2×block-size, what should
we do? Time: 3 = log2n = O(log2n)
Work: 7 = n-1 = O(n) = work of the sequential version
(Later, we will see tasks in which parallel implementations
need to do more work than sequential)

10
A simple reduction kernel

11
A simple reduction kernel

__global__ void reduceBlksKernel0(int* in, int* out, int n) {


int i = 2 * threadIdx.x;
for (int stride = 1; stride <= blockDim.x; stride *= 2) {
if (threadIdx.x % stride == 0)
in[i] += in[i + stride];
__syncthreads();
}
if (threadIdx.x == 0)
*out = in[0];
}

12
Hierarchical reduction for bigger input

13
Parallel implementation
– idea to reduce within each block
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 2
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2

14
Hierarchical reduction for arbitrary
input length
__global__ void reduceBlksKernel1(int* in, int* out, int n){
int i = blockIdx.x * 2 * blockDim.x + 2 * threadIdx.x;
for (int stride = 1; stride <= blockDim.x; stride *= 2){
if (threadIdx.x % stride == 0)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}
if (threadIdx.x == 0)
atomicAdd(out, in[blockIdx.x * 2 * blockDim.x]);
}

Assume: 2×block-size = 2𝑘
15
In each block, how many diverged warps?
(not consider blocks in the edge)

• Stride = 1:
• All threads are “on”
• → No diverged warp
• Stride = 2:
• Only threads with threadIdx.x % 2 == 0 are “on”
• → All warps are diverged
• Stride = 4, 8, …, 32:
• All warps are diverged
• Stride = 64, 128, …:
• # diverged warps decrease to 1 16
Kernel function – 2nd version:
reduce warp divergence
• Idea: reduce # diverged warps in each step by rearranging
threads so that “on” threads are first adjacent threads
• Example: consider a block of 128 threads
• Stride = 1: All 128 threads are “on”
• Stride = 2: First 64 threads are “on”, the rest are “off”
• Stride = 4: First 32 threads are “on”, the rest are “off”
• …

17
Kernel function - 2nd version:
reduce warp divergence
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

threadIdx.x 0 1 2 3
10 9 6 1 10 4 9 2
threadIdx.x 0 1
16 9 5 1 19 4 7 2
threadIdx.x 0
35 9 5 1 19 4 7 2

18
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;

for (int stride = 1; stride < 2 * blockDim.x; stride *= 2)


{
int i = numElemsBeforeBlk + ...;
if (threadIdx.x ...)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}

if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}

19
Kernel function – 2nd version:
reduce warp divergence
__global__ void reduceOnDevice2(int *in, int *out, int n)
{
int numElemsBeforeBlk = blockIdx.x * blockDim.x * 2;

for (int stride = 1; stride < 2 * blockDim.x; stride *= 2)


{
int i = numElemsBeforeBlk + threadIdx.x * 2 * stride;
if (threadIdx.x < blockDim.x / stride)
if (i + stride < n)
in[i] += in[i + stride];
__syncthreads(); // Synchronize within each block
}

if (threadIdx.x == 0)
out[blockIdx.x] = in[numElemsBeforeBlk];
}

20
Kernel function - 3rd version:
reduce warp divergence + ?
Consider a block of 4 threads
Data of previous block 1 9 5 1 6 4 7 2 Data of next block

Stride = 4
threadIdx.x 0 1 2 3

7 13 12 3 6 4 7 2
Stride = 2
threadIdx.x 0 1

19 16 12 3 6 4 7 2
Stride = 1
threadIdx.x 0

35 16 12 3 6 4 7 2
21
Kernel function - 3rd version:
reduce warp divergence + ?
Code: your next homework ;-)

22
Reference
• [1] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
• [2] Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014

23
THE END

24

You might also like