Chap9_CUDA Optimization
Chap9_CUDA Optimization
Optimization
On chip
Off chip
(on-board)
B[i,j] = A[j,i]
__share__ S[];
S[i,j] = A[i,j];
B[i,j] = S[j,i];
bank conflict
NTHU LSA Lab 33
Example: No bank Conflict
Linear addressing Random 1:1 Permutation
Thread0 Bank0 Thread0 Bank0
Thread1 Bank1 Thread1 Bank1
Thread2 Bank2 Thread2 Bank2
Thread3 Bank3 Thread3 Bank3
Thread4 Bank4 Thread4 Bank4
Thread5 Bank5 Thread5 Bank5
Thread6 Bank6 Thread6 Bank6
Thread7 Bank7 Thread7 Bank7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
Memory padding
Add addition memory space to avoid bank conflict
AN EXAMPLE OF CUDA
30x Speedup!
T1 T1
4WARP
2WARP
1WARP
1WARP
1WARP
1WARP