Multithreaded Architectures: Lecture 5: Performance Considerations
Multithreaded Architectures: Lecture 5: Performance Considerations
Lecture 5: Performance
Considerations
1
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Global Memory (DRAM) Bandwidth
Ideal Reality
2
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bank Organization
• Each core array has
Row Memory Cell
about O(1M) bits
Row
Decoder Core Array
Addr • Each bit is stored in a
tiny capacitor, made
of one transistor
Sense Amps
Column Latches
Wide
Column Mux Pin Interface
Addr
Narrow
Off-chip Data
3
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
A very small (8x2 bit) DRAM Bank
0 1 1
decode
Sense amps
Mux 4
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM core arrays are slow.
• Reading from a cell in the core array is a very
slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future
About 1000 cells connected to
each vertical line
decode
decode
Sense amps
Mux 6
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting (cont.)
second part of the burst
0 1 1
decode
Mux 7
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time
Non-burst timing
Modern DRAM systems are designed to
be always accessed in burst mode. Burst
Burst timing bytes are transferred but discarded when
accesses are not to sequential locations.
8
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Multiple DRAM Banks
0 0 1 1
decode
decode
Mux Mux
Bank 0 Bank 1
9
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
P[Row*Width+Col] = Pvalue;
}
}
12
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Two Access Patterns
d_M d_N
Thread 1
WIDTH
Thread 2
WIDTH
(a) (b)
M[Row*Width+k] N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code13
©Wen-mei W. Hwu and David Kirk/NVIDIA,
N accesses are coalesced.
N0,0 N0,1 N0,2 N0,3
Access
direction in N1,0 N1,1 N1,2 N1,3
Kernel
N2,0 N2,1 N2,2 N2,3
code
N3,0 N3,1 N3,2 N3,3
N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3
14
©Wen-mei W. Hwu and David Kirk/NVIDIA,
M accesses are not coalesced.
Load iteration 0
T0 T1 T2 T3
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 15
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Use shared memory to enable
coalescing in tiled matrix multiplication
d_M d_N
Original
WIDTH
Access
Pattern
WIDTH
Copy into
scratchpad
memory
d_M d_N
Tiled
Access
Pattern Perform
multiplication
with scratchpad
values
16
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];
7. float Pvalue = 0;
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
chip
–
R C$ Shared
L1 per SM F L1 Mem
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
G80 Shared Memory
• Each SM has 16 KB of Shared I$
L1
Memory
– 16 banks of 32bit words Multithreaded
Instruction Buffer
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
Parallel Memory Architecture
• In a parallel machine, many threads access memory
– Therefore, memory is divided into banks Bank 0
– Essential to achieve high bandwidth Bank 1
Bank 2
Bank 3
Bank 4
• Each bank can service one address per cycle Bank 5
– A memory can service as many simultaneous Bank 6
accesses as it has banks Bank 7
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• No Bank Conflicts • No Bank Conflicts
– Linear addressing – 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• 2-way Bank Conflicts • 8-way Bank Conflicts
– Linear addressing – Linear addressing
stride == 2 stride == 8
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
How addresses map to banks on
G80
• Each bank has a bandwidth of 32 bits per clock
cycle
• Successive 32-bit words are assigned to
successive banks
• G80 has 16 banks
– So bank = address mod 16
– Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp
24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no
bank conflicts
• The fast case:
– If all threads of a half-warp access different banks, there is no
bank conflict
– If all threads of a half-warp access the identical address, there
is no bank conflict (broadcast)
• The slow case:
– Bank Conflict: multiple threads in the same half-warp access
the same bank
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Linear Addressing s=1
Thread 0 Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7
float foo =
shared[baseIndex + s * Thread 15 Bank 15
threadIdx.x];
s=3
Thread 0 Bank 0
27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Control Flow Instructions
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: avoid divergence when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Threads in the same warp follow different paths; threads 0, 1 and
2 follow different path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• But all threads in any given warp follow the same path
28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Parallel Reduction
• Given an array of values, “reduce” them to a
single value in parallel
• Examples
– sum reduction: sum of all values in the array
– Max reduction: maximum of all values in the array
30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A simple implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
0 1 2 3 4 5 6 7 8 9 10 11
3 0..7 8..15
iterations
Array elements 32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Some Observations
• In each iteration, two control flow paths will be
sequentially traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition may cost extra cycles
depending on the implementation of divergence
• No more than half of threads will be executing at any time
– All odd index threads do not do addition right from the beginning!
– On average, less than ¼ of the threads will be activated for all
warps over time.
– After the 5th iteration, entire warps in each block will not do
addition, poor resource utilization but no divergence in warps that
do not do addition.
• This can go on for a while, up to 4 more iterations (512/32=16= 24),
where each iteration only has one thread activated until all warps
retire
33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shortcomings of the implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
BAD: Divergence
unsigned int t = threadIdx.x; due to interleaved
for (unsigned int stride = 1; branch decisions
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A better implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
0 1 2 3 … 13 14 15 16 17 18 19
1 0+16 15+31
36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Registers, ILP and Instruction Mix
37
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Programmer View of Register File
4 blocks 3 blocks
• There are 8192 registers
in each SM in G80
– This is an implementation
decision, not part of CUDA
– Registers are dynamically
partitioned across all blocks
assigned to the SM
– Once assigned to a block,
the register is NOT
accessible by threads in
other blocks
– Each thread in the same
block only access registers
assigned to itself
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
• If each Block has 16X16 threads and each thread uses
10 registers, how many thread can run on each SM?
– Each block requires 10*256 = 2560 registers
– 8192 = 3 * 2560 + change
– So, three blocks can run on an SM as far as registers are
concerned
• We want to optimize the performance of the computation
done by each thread (the kernel code). The optimized
kernel code increases the use of registers by 1.
– Each Block now requires 11*256 = 2816 registers
– 8192 < 2816 *3
– Only two Blocks can run on an SM, 1/3 reduction of
parallelism!!!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
TB0 (cont.)
Insufficient
TB1 TB2 registers to allocate
3 blocks
……… ………
7. float Pvalue = 0;
43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
__syncthreads()
45
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
}
bx
Prefetch 0 1 2
tx
• Deposit blue tile from register 0 1 2 TILE_WIDTH-1
TILE_WIDTH
• Syncthreads
WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md
memory
Pd
0
• ….
TILE_WIDTHE
0 Pdsub
1
WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
46
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Instruction Mix Considerations
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
47
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Unrolling
Ctemp = 0; Ctemp = 0;
for (...) { for (...) {
__shared__ float As[16][16]; __shared__ float As[16][16];
__shared__ float Bs[16][16]; __shared__ float Bs[16][16];
48
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Major Performance Detractors
• Long-latency operations
– Avoid stalls by executing other threads
• Stalls and bubbles in the pipeline
– Barrier synchronization
– Branch divergence
• Shared resource saturation
– Global memory bandwidth
– Shared memory capacity
49
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010