0% found this document useful (0 votes)

15 views

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

YAAKOV SOLOMON

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

YAAKOV SOLOMON

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 5: Performance
Considerations

1
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Global Memory (DRAM) Bandwidth

Ideal Reality

2
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bank Organization
• Each core array has
Row Memory Cell
about O(1M) bits
Row
Decoder Core Array
Addr • Each bit is stored in a
tiny capacitor, made
of one transistor
Sense Amps

Column Latches

Wide
Column Mux Pin Interface
Addr
Narrow
Off-chip Data
3
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
A very small (8x2 bit) DRAM Bank
0 1 1

decode

Sense amps

Mux 4
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM core arrays are slow.
• Reading from a cell in the core array is a very
slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future
About 1000 cells connected to
each vertical line
decode

A very small capacitance

that stores a data bit
5
©Wen-mei W. Hwu and David Kirk/NVIDIA, To sense amps
DRAM Bursting (burst size = 4 bits)
0 1 0

decode

Sense amps

Mux 6
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting (cont.)
second part of the burst
0 1 1

decode

Sense amps and buffer

Mux 7
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bursting for the 8x4 Bank

Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time

Non-burst timing
Modern DRAM systems are designed to
be always accessed in burst mode. Burst
Burst timing bytes are transferred but discarded when
accesses are not to sequential locations.
8
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Multiple DRAM Banks
0 0 1 1

decode
decode

Sense amps Sense amps

Mux Mux
Bank 0 Bank 1
9
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

10
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Placing a 2D C array into linear
memory space (review)

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

linearized order in increasing address

11
©Wen-mei W. Hwu and David Kirk/NVIDIA,
A Simple Matrix Multiplication Kernel
(review)
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];

P[Row*Width+Col] = Pvalue;
}
}
12
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Two Access Patterns

d_M d_N

Thread 1

WIDTH
Thread 2
WIDTH

(a) (b)

M[Row*Width+k] N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code13
©Wen-mei W. Hwu and David Kirk/NVIDIA,
N accesses are coalesced.
N0,0 N0,1 N0,2 N0,3
Access
direction in N1,0 N1,1 N1,2 N1,3
Kernel
N2,0 N2,1 N2,2 N2,3
code
N3,0 N3,1 N3,2 N3,3

Load iteration 0 Load iteration 1

…
T0 T1 T2 T3 T0 T1 T2 T3
N

N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3

14
©Wen-mei W. Hwu and David Kirk/NVIDIA,
M accesses are not coalesced.

Access M0,0 M0,1 M0,2 M0,3

direction in M1,0 M1,1 M1,2 M1,3
Kernel M[Row*Width+k]
code M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

…
Load iteration 1
T0 T1 T2 T3

Load iteration 0
T0 T1 T2 T3
M

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 15
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Use shared memory to enable
coalescing in tiled matrix multiplication
d_M d_N

Original

WIDTH
Access
Pattern

WIDTH
Copy into
scratchpad
memory
d_M d_N

Tiled
Access
Pattern Perform
multiplication
with scratchpad
values
16
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k)

13. Pvalue += Mds[ty][k] * Nds[k][tx];
14. __syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Constants
• Immediate address constants I$
L1
• Indexed address constants
• Constants stored in DRAM, and cached on Multithreaded
Instruction Buffer

chip
–
R C$ Shared
L1 per SM F L1 Mem

• A constant value can be broadcast to all

Operand Select
threads in a Warp
– Extremely efficient way of accessing a value
MAD SFU
that is common for all threads in a block!

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
G80 Shared Memory
• Each SM has 16 KB of Shared I$
L1
Memory
– 16 banks of 32bit words Multithreaded
Instruction Buffer

• CUDA uses Shared Memory as

R C$ Shared

shared storage visible to all threads in F L1 Mem

a thread block Operand Select

– read and write access

MAD SFU

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
Parallel Memory Architecture
• In a parallel machine, many threads access memory
– Therefore, memory is divided into banks Bank 0
– Essential to achieve high bandwidth Bank 1
Bank 2
Bank 3
Bank 4
• Each bank can service one address per cycle Bank 5
– A memory can service as many simultaneous Bank 6
accesses as it has banks Bank 7

• Multiple simultaneous accesses to a bank Bank 15

result in a bank conflict
– Conflicting accesses are serialized

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• No Bank Conflicts • No Bank Conflicts
– Linear addressing – 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• 2-way Bank Conflicts • 8-way Bank Conflicts
– Linear addressing – Linear addressing
stride == 2 stride == 8
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15

23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
How addresses map to banks on
G80
• Each bank has a bandwidth of 32 bits per clock
cycle
• Successive 32-bit words are assigned to
successive banks
• G80 has 16 banks
– So bank = address mod 16
– Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no
bank conflicts
• The fast case:
– If all threads of a half-warp access different banks, there is no
bank conflict
– If all threads of a half-warp access the identical address, there
is no bank conflict (broadcast)
• The slow case:
– Bank Conflict: multiple threads in the same half-warp access
the same bank
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Linear Addressing s=1
Thread 0 Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7

float foo =
shared[baseIndex + s * Thread 15 Bank 15

threadIdx.x];
s=3
Thread 0 Bank 0

• This is only bank-conflict-free if s Thread 1

Thread 2
Bank 1
Bank 2

shares no common factors with the

Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
number of banks Thread 6
Thread 7
Bank 6
Bank 7

– 16 on G80, so s must be odd

Thread 15 Bank 1526
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Control Flow

27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Control Flow Instructions
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: avoid divergence when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Threads in the same warp follow different paths; threads 0, 1 and
2 follow different path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• But all threads in any given warp follow the same path

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Parallel Reduction
• Given an array of values, “reduce” them to a
single value in parallel
• Examples
– sum reduction: sum of all values in the array
– Max reduction: maximum of all values in the array

• Typically parallel implementation:

– Recursively halve # threads, add two values per
thread
– Takes log(n) steps for n elements, requires n/2
threads 29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A Vector Reduction Example
• Assume an in-place reduction using shared
memory
– The original vector is in device global memory
– The shared memory used to hold a partial sum vector
– Each iteration brings the partial sum vector closer to
the final sum
– The final solution will be in element 0

30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A simple implementation
• Assume we have already loaded array into
__shared__ float partialSum[]

unsigned int t = threadIdx.x;

for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Vector Reduction with Branch
Divergence
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10

0 1 2 3 4 5 6 7 8 9 10 11

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

3 0..7 8..15
iterations

Array elements 32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Some Observations
• In each iteration, two control flow paths will be
sequentially traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition may cost extra cycles
depending on the implementation of divergence
• No more than half of threads will be executing at any time
– All odd index threads do not do addition right from the beginning!
– On average, less than ¼ of the threads will be activated for all
warps over time.
– After the 5th iteration, entire warps in each block will not do
addition, poor resource utilization but no divergence in warps that
do not do addition.
• This can go on for a while, up to 4 more iterations (512/32=16= 24),
where each iteration only has one thread activated until all warps
retire
33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shortcomings of the implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
BAD: Divergence
unsigned int t = threadIdx.x; due to interleaved
for (unsigned int stride = 1; branch decisions
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A better implementation
• Assume we have already loaded array into
__shared__ float partialSum[]

unsigned int t = threadIdx.x;

for (unsigned int stride = blockDim.x>>1;
stride > 0; stride >>=1)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
No Divergence until < 16 sub-sums
Thread 0 Thread 1 Thread 2 Thread 14Thread 15

0 1 2 3 … 13 14 15 16 17 18 19

1 0+16 15+31

37
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Programmer View of Register File
4 blocks 3 blocks
• There are 8192 registers
in each SM in G80
– This is an implementation
decision, not part of CUDA
– Registers are dynamically
partitioned across all blocks
assigned to the SM
– Once assigned to a block,
the register is NOT
accessible by threads in
other blocks
– Each thread in the same
block only access registers
assigned to itself
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
• If each Block has 16X16 threads and each thread uses
10 registers, how many thread can run on each SM?
– Each block requires 10*256 = 2560 registers
– 8192 = 3 * 2560 + change
– So, three blocks can run on an SM as far as registers are
concerned
• We want to optimize the performance of the computation
done by each thread (the kernel code). The optimized
kernel code increases the use of registers by 1.
– Each Block now requires 11*256 = 2816 registers
– 8192 < 2816 *3
– Only two Blocks can run on an SM, 1/3 reduction of
parallelism!!!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
TB0 (cont.)
Insufficient
TB1 TB2 registers to allocate
3 blocks

Thread Contexts Thread Contexts

SP0 SP7 SP0 SP7

……… ………

32KB Register File 32KB Register File

X
16KB Shared Memory 16KB Shared Memory

(a) Pre-“optimization” (b) Post-“optimization”

Increase in per-thread performance, but fewer threads:

Lower
© David Kirk/NVIDIA and overall
Wen-mei W. performance in this case
Hwu, 2007-2010 40
ECE 408, University of Illinois, Urbana-Champaign
More on Dynamic Partitioning
• Dynamic partitioning gives more flexibility to
compilers/programmers
– One can run a smaller number of threads that require
many registers each or a large number of threads that
require few registers each
• This allows for finer grain threading than traditional CPU
threading models.
– The compiler can tradeoff between instruction-level
parallelism and thread level parallelism

ECE 408, University of Illinois, Urbana-Champaign
ILP vs. TLP Example
• Assume that a kernel has 256-thread Blocks, 4
independent instructions for each global memory load in
the thread program, and each thread uses 10 registers,
global loads have 200 cycles
– 3 Blocks can run on each SM
• If a compiler can use one more register to change the
dependence pattern so that 8 independent instructions
exist for each global memory load
– Only two can run on each SM
– However, one only needs 200/(8*4) = 7 Warps to tolerate the
memory latency
– Two blocks have 16 Warps. The performance can be actually
higher!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 42
ECE 408, University of Illinois, Urbana-Champaign
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k)

13. Pvalue += Mds[ty][k] * Nds[k][tx];
14. __syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
44
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Prefetching
Loop { Load first tile from global
memory into registers
Load current tile to shared
memory Loop {
Deposit current tile from
__syncthreads() registers to shared memory
__syncthreads()
Compute current tile
Load next tile from global
__syncthreads() memory into registers
}
Compute current tile

tx
• Deposit blue tile from register 0 1 2 TILE_WIDTH-1

into shared memory Nd

TILE_WIDTH
• Syncthreads

WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md
memory
Pd

0
• ….

TILE_WIDTHE
0 Pdsub
1

WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
46
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Instruction Mix Considerations
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];

There is only one mul/add between branches and

address calculation.

Loop unrolling can help.

Pvalue += Mds[ty][0] * Nds[0][tx] + …

Mds[ty][15] * Nds[15][tx];

47
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Unrolling
Ctemp = 0; Ctemp = 0;
for (...) { for (...) {
__shared__ float As[16][16]; __shared__ float As[16][16];
__shared__ float Bs[16][16]; __shared__ float Bs[16][16];

// load input tile elements // load input tile elements

As[ty][tx] = A[indexA]; As[ty][tx] = A[indexA];
Bs[ty][tx] = B[indexB]; Bs[ty][tx] = B[indexB];
indexA += 16; indexA += 16;
indexB += 16 * widthB; indexB += 16 * widthB;
__syncthreads(); __syncthreads();

// compute results for tile // compute results for tile

for (i = 0; i < 16; i++) Ctemp +=
{ As[ty][0] * Bs[0][tx];
Ctemp += As[ty][i] ...
* Bs[i][tx]; Ctemp +=
} As[ty][15] * Bs[15][tx];

syncthreads(); syncthreads(); Does this use

} }
C[indexC] = Ctemp; C[indexC] = Ctemp;more registers?

(b) Tiled Version (c) Unrolled Version

Removal of branch instructions and address calculations

48
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Major Performance Detractors
• Long-latency operations
– Avoid stalls by executing other threads
• Stalls and bubbles in the pipeline
– Barrier synchronization
– Branch divergence
• Shared resource saturation
– Global memory bandwidth
– Shared memory capacity
49
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

U9L5 Activity Guide
No ratings yet
U9L5 Activity Guide
1 page
GPU Compute
100% (1)
GPU Compute
58 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Threads
No ratings yet
Threads
54 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
Lec 3
No ratings yet
Lec 3
48 pages
Main Memory: Prof. Mike Giles
No ratings yet
Main Memory: Prof. Mike Giles
9 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
21 pages
Class13
No ratings yet
Class13
19 pages
1083_Wang
No ratings yet
1083_Wang
56 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
1
No ratings yet
1
44 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
CS5204/EE5364 - Advanced Computer Architecture - Memory
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Memory
67 pages
NCSA02_Fundamental_CUDA_Optimization
No ratings yet
NCSA02_Fundamental_CUDA_Optimization
50 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
No ratings yet
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
56 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
11 Memory
No ratings yet
11 Memory
41 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
2
No ratings yet
2
58 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
LM32_AIT_L22
No ratings yet
LM32_AIT_L22
20 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Highly Available Storage for Windows Servers
From Everand
Highly Available Storage for Windows Servers
Paul Massiglia
No ratings yet
Multiplexed Networks for Embedded Systems: CAN, LIN, FlexRay, Safe-by-Wire...
From Everand
Multiplexed Networks for Embedded Systems: CAN, LIN, FlexRay, Safe-by-Wire...
Dominique Paret
No ratings yet
No Proca Photons
No ratings yet
No Proca Photons
6 pages
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
No ratings yet
Introduction To Random Signals and Noise - 2005 - Van Etten - Appendix F The Q and Erfc Functions
2 pages
Global Entrainment of Transcriptional Systems To Periodic Inputs
No ratings yet
Global Entrainment of Transcriptional Systems To Periodic Inputs
26 pages
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
No ratings yet
Comment On Crystallographic, Spectroscopic, Thermal, Optical Physics
4 pages
Design of 2 4 GHZ Mmic Feed Forward Ampl
No ratings yet
Design of 2 4 GHZ Mmic Feed Forward Ampl
7 pages
Introduction To Time-Delay and Sampled-Data Systems
No ratings yet
Introduction To Time-Delay and Sampled-Data Systems
6 pages
Ieee 802.11ac Wlan Simulation in Matlab
No ratings yet
Ieee 802.11ac Wlan Simulation in Matlab
6 pages
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
No ratings yet
Specification Parameters of WLAN Performance With MATLAB Simulink Model of IEEE 802.11
10 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Multithreaded Architectures: (Applied Parallel Programming)
No ratings yet
Multithreaded Architectures: (Applied Parallel Programming)
29 pages
Improving GPU Performance Via Large Warps and Two
No ratings yet
Improving GPU Performance Via Large Warps and Two
11 pages
ILP-Architectures Part I
No ratings yet
ILP-Architectures Part I
56 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
A Hybrid Register Cache For GPUs
No ratings yet
A Hybrid Register Cache For GPUs
11 pages
Snoopy Protocols: Design Issues, Split-Transaction Bus, TLB Coherence
No ratings yet
Snoopy Protocols: Design Issues, Split-Transaction Bus, TLB Coherence
11 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
Chapter 2
No ratings yet
Chapter 2
61 pages
OpenCL Image Convolution Filter - Box Filter
No ratings yet
OpenCL Image Convolution Filter - Box Filter
8 pages
CS4961: Parallel Programming Midterm Exam October 20, 2011
No ratings yet
CS4961: Parallel Programming Midterm Exam October 20, 2011
4 pages
Parallel Database Systems and Their Architecture
No ratings yet
Parallel Database Systems and Their Architecture
17 pages
Book Numerical
100% (1)
Book Numerical
388 pages
DM Tutoria
No ratings yet
DM Tutoria
116 pages
[Cs431] Slide
No ratings yet
[Cs431] Slide
168 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Model Answers - HW1 PDF
No ratings yet
Model Answers - HW1 PDF
6 pages
AWS Learning Path
No ratings yet
AWS Learning Path
1 page
hpc pyq
No ratings yet
hpc pyq
11 pages
Comparison of NVIDIA Kepler K40 and NVIDIA V100 Architectures
No ratings yet
Comparison of NVIDIA Kepler K40 and NVIDIA V100 Architectures
2 pages
ch09 Morris Mano
No ratings yet
ch09 Morris Mano
15 pages
Computer Architecture
100% (1)
Computer Architecture
125 pages
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
No ratings yet
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
21 pages
Slurm ParallelCluster AWS
No ratings yet
Slurm ParallelCluster AWS
3 pages
Cse Iitkgp Ac in Oldlook Syllabus HTML
No ratings yet
Cse Iitkgp Ac in Oldlook Syllabus HTML
65 pages
BSCSF 1552 09 21 Cosf326
No ratings yet
BSCSF 1552 09 21 Cosf326
4 pages
M Tech
No ratings yet
M Tech
23 pages
SAP 1 Chap 5
No ratings yet
SAP 1 Chap 5
30 pages
Day 04 - AdvancedTopics OpenDSS in RD
No ratings yet
Day 04 - AdvancedTopics OpenDSS in RD
48 pages
Technical Guide: Reference System Setup - v1.0
No ratings yet
Technical Guide: Reference System Setup - v1.0
9 pages
Master of Computer Application : Draft Syllabus
No ratings yet
Master of Computer Application : Draft Syllabus
50 pages
Set Chassis Cluster Disable Reboot
No ratings yet
Set Chassis Cluster Disable Reboot
3 pages
Algorithms and Architectures For 2D Discrete Wavelet Transform
No ratings yet
Algorithms and Architectures For 2D Discrete Wavelet Transform
20 pages
Answers
No ratings yet
Answers
15 pages
School of Engineering Sciences
No ratings yet
School of Engineering Sciences
17 pages
FPGA Report
No ratings yet
FPGA Report
49 pages

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

Multithreaded Architectures

(Applied Parallel Programming)

A very small capacitance

Sense amps and buffer

Sense amps Sense amps

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

linearized order in increasing address

if ((Row < Width) && (Col < Width)) {

// each thread computes one element of the block sub-matrix

Load iteration 0 Load iteration 1

Access M0,0 M0,1 M0,2 M0,3

M3,0 M3,1 M3,2 M3,3

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

// Collaborative loading of Md and Nd tiles into shared memory

12. for (int k = 0; k < TILE_WIDTH; ++k)

• A constant value can be broadcast to all

• CUDA uses Shared Memory as

shared storage visible to all threads in F L1 Mem

a thread block Operand Select

– read and write access

• Multiple simultaneous accesses to a bank Bank 15

Thread 15 Bank 15 Thread 15 Bank 15

• This is only bank-conflict-free if s Thread 1

shares no common factors with the

– 16 on G80, so s must be odd

• Typically parallel implementation:

unsigned int t = threadIdx.x;

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

unsigned int t = threadIdx.x;

Thread Contexts Thread Contexts

SP0 SP7 SP0 SP7

32KB Register File 32KB Register File

(a) Pre-“optimization” (b) Post-“optimization”

Increase in per-thread performance, but fewer threads:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

// Collaborative loading of Md and Nd tiles into shared memory

12. for (int k = 0; k < TILE_WIDTH; ++k)

into shared memory Nd

There is only one mul/add between branches and

Loop unrolling can help.

Pvalue += Mds[ty][0] * Nds[0][tx] + …

// load input tile elements // load input tile elements

// compute results for tile // compute results for tile

__syncthreads(); __syncthreads(); Does this use

(b) Tiled Version (c) Unrolled Version

Removal of branch instructions and address calculations

You might also like

syncthreads(); syncthreads(); Does this use