0% found this document useful (0 votes)
39 views

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CUDA on NVIDIA GPU

AMPERE MICROARCHITECTURE
Taking your algorithms to the
next level of performance
Carter Edwards, May 19, 2020
CUDA 11.0 Enhancements
Leveraging NVIDIA Ampere GPU Microarchitecture

Asynchronously Copy Global → Shared Memory

Flexible Synchronization for Producer → Consumer and other Algorithms

Influence Residency of Data in L2 Cache

Warp Synchronous Reduction

2
Asynchronously
Copy Global → Shared Memory
3
__shared__ Memory
Many Algorithms’ Key for Performance

Current use of shared memory __shared__ extern int shbuf[];


while ( an_algorithm_iterates ) {
• Time-stepping and global data iteration
__syncthreads();
• Copy global data to shared memory for ( i = ... ) {
shbuf[i] = gldata[i]; /* copy */
• Compute on shared memory
}
__syncthreads();
/* compute on shbuf[] */
Copy and Compute Phases are Sequenced
}
Global → Shared Memory has a Journey
the __syncthreads() sandwich

4
Global → Shared Memory Journey
The journey may be longer than it appears
anatomy of copy shared global
shbuf[i] = gldata[i];

Journey through memory before GA100

SMEM register L1 L2 GMEM

Better: Don’t pass through registers along the way; using fewer registers can improve occupancy
SMEM register
L1 L2 GMEM

Even better: Don’t pass through L1 cache along the way; let other data persist longer
SMEM register L1
L2 GMEM

5
Copy and Compute Sequencing
Each iteration: First copy GMEM SMEM; then Compute on SMEM
GMEM GMEM GMEM GMEM
copy copy copy copy

SMEM compute SMEM compute SMEM compute SMEM compute

Better to Compute while Copying for later iteration(s)


GMEM GMEM GMEM GMEM GMEM

SMEM[0] SMEM[1] SMEM[0] SMEM[1] SMEM[0]


compute compute compute compute compute
Example two stage pipelining of copy and compute

6
Async-Copy Pipeline
Begin with Simple One-Stage Pipeline

Submit asynchronous copy via the better journey

pipeline pipe; • To dst in shared memory


• From src in global memory
memcpy_async(dst, src, pipe);
• Data type is trivially copyable
pipe.commit_and_wait();
Thread submits as many async-copy as needed

Thread waits for all submitted asynchronous copy


operations to complete

7
Update Implementation with Async-Copy
Begin with Simple One-Stage Pipeline
Before GA100 GPU Now
__shared__ extern int shbuf[]; __shared__ extern int shbuf[];
pipeline pipe;
while ( an_algorithm_iterates ) { while ( an_algorithm_iterates ) {
__syncthreads(); __syncthreads();
for ( i = ... ) { for ( i = ... ) {
shbuf[i] = gldata[i]; memcpy_async(shbuf[i],gldata[i],pipe);
} }
pipe.commit_and_wait();
__syncthreads(); __syncthreads();
/* compute on shbuf[] */ /* compute on shbuf[] */
} }

Still have the __syncthreads() sandwich 8


Async-Copy Pipeline
Then Improve with Even Better Journey through Memory
memcpy_async(dst, src, pipe);

Better async-copy journey Even better async-copy journey

• To shared memory • To shared memory


• From global memory • From global memory
• Data type is trivially copyable • Data type is trivially copyable
• Data size is multiple of 4 bytes • Data size is multiple of 16 bytes
• Data is aligned to 4 bytes • Data is aligned to 16 bytes

9
Multi-Stage Pipeline
Then Improve by Overlapping Copy and Compute

pipeline pipe; Submit async-copy for dst = src;


• Submit as many as needed
memcpy_async( dst, src, pipe);
• Submit into new stage of pipeline
pipe.commit(); Commit new stage K, but do not wait for it now
• A sequence of stages { 0, 1, ..., K }
pipe.wait_prior<N>();

stage-0 stage-1 stage-2 stage-3 stage-4


Wait for prior stage K-N in the pipeline
of sync-copy operations to complete

stage-0 stage-1 stage-2 stage-3 stage-4

10
Update Algorithm with Multi-Stage Pipeline
(1 of 2) Similar Pattern: Async-Copy, Commit, Wait
One Stage Multi-Stage
for (stage=0; stage < end; ++stage){ for (stage=next=0; stage < end; ++stage){
__syncthreads(); /* __syncthreads(); */
for (; next < stage + nStage ; ++next){
s = next % nStage ;
for ( i = ... ) { for ( i = ... ) {
memcpy_async(sh[i],gl[i],pipe); memcpy_async(sh[s][i],gl[i],pipe);
} }
pipe.commit_and_wait(); pipe.commit();
}
pipe.wait_prior< nStage-1 >();
__syncthreads(); __syncthreads();
/* compute on sh[] */ /* compute on sh[stage % nStage][] */
} }
11
Update Algorithm with Multi-Stage Pipeline
(2 of 2) Declare, Fill, and Wait with N-Stages of Buffers
Multi-Stage
Removed leading __syncthreads()
for (stage=next=0; stage < end; ++stage){
/* __syncthreads(); */
Submit async-copy for later stages for (; next < stage + nStage ; ++next){
• Recycle among nStage buffers s = next % nStage ;
for ( i = ... ) {
memcpy_async(sh[s][i],gl[i],pipe);
Commit next stage but do not wait
}
pipe.commit();
Wait for current stage of copies to complete }
• stage = next – (nStage-1); pipe.wait_prior< nStage-1 >();
• Prior stage relative to most recent commit __syncthreads();
/* compute on sh[stage % nStage][] */
}
Wait for all threads to complete copies
12
Pairs Well with Cooperative Groups
Collective Async-Copy of a whole Array
template<class GroupType, class T>
size_t memcpy_async( GroupType & group, T * dstPtr, size_t dstCount
const T * srcPtr, size_t srcCount
pipeline & pipe );

• GroupType is an intra-block Cooperative Group

• Partitions array range [0..dstCount-1] among threads

• Submits async-copy for: dstPtr[0..srcCount-1] = srcPtr[0..srcCount-1];

• Zero fill left-overs: dstPtr[srcCount..dstCount-1] = 0;

• Given aligned arrays will use

13
Async-Copy Microbenchmark
14
Your Algorithm’s Mileage May Vary
Microbenchmark Comparing Synchronous vs. Asynchronous Copy
shbuf[i]=gldata[i]; vs. memcpy_async(shbuf[i],gldata[i],pipe);
Simple one-stage pipeline

• Without computations

• Ample registers available

Results: Consistently better performance; best when

• Data type is 16 bytes to get

• Modest thread block size

Except, a corner case where traditional synchronous copy can perform better

15
Microbenchmark Performance Experiment
Microbenchmark Comparing Synchronous vs. Asynchronous Copy

/* sync: Conventional synchronous memory copy */


for (size_t i = 0; i < copy_count; ++i) {
shared[blockDim.x * i + threadIdx.x] = global[blockDim.x * i + threadIdx.x];
}

/* async: Asynchronous memory copy */


pipeline pipe;
for (size_t i = 0; i < copy_count; ++i) {
memcpy_async( shared[blockDim.x * i + threadIdx.x],
global[blockDim.x * i + threadIdx.x], pipe);
}
pipe.commit();
pipe.wait_prior<0>();

16
Performance Experiment
Varied Thread Block Size and Sizeof Data Type

Thread block sizes : 128, 256, 512


/* Conventional synchronous memory copy */
Data type sizes : 4 and 16 bytes for (size_t i = 0; i < copy_count; ++i) {
shared[blockDim.x*i + threadIdx.x] =
global[blockDim.x*i + threadIdx.x];
Measure clock-cycles required to copy an array of N bytes }

• Y-axis = clock-cycles Copy 4Byte Elements


2000
• X-axis = bytes copied 1800
1600
sync-128

Clock Cycles
1400
Simple experiment with surprisingly complicated results 1200
sync-256
1000 sync-512
• Now step through results 800 async-128
600 async-256
400 async-512
0 10000 20000

17
Performance Experiment
(1 of 4) Compiler Optimizes Traditional Synchronous Copy
Copy 4Byte Data Type Loop unrolling with Copy 16Byte Data Type
2000 up to four loads/stores
2000
“in flight”
1800 1800

1600 36 1600
sync-128

1400 1400

Clock Cycles
32
Clock Cycles

28
1200 1200
24
1000 20 1000
8

time to copy
800 16 800
12 4
600 600

4 8
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
18
Performance Experiment
(2 of 4) Async-Copy is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000

1800 Don’t need compiler’s


1800 unrolling and
instruction scheduling optimizations
1600 1600

faster
1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-128
1000 async-128 1000

800 800

600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

19
Performance Experiment
(3 of 4) Copying 16byte Data Type is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000

1800 1800

1600 1600

1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-128
1000 async-128 1000

800 800

faster
600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

20
Performance Experiment
(4 of 4) Corner Case when Synchronous Copy might Win
Copy 4Byte Elements Large number of threads (512+) Copy 16Byte Elements
2000 Many available registers2000
1800 Ample opportunity to hide
1800 latency

1600 1600

1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-512
1000 async-512 1000

800 800

600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

21
for more Async-Copy examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific

22
Synchronize
Producer → Consumer
and other Algorithms
23
Built-in Sync-Functions (barriers)
__syncthreads() , __syncwarp() , and siblings

Absolute Best Performing Barrier for

• Synchronizing whole thread block CUDA Cooperative Groups also

• Synchronizing a warp, or subset of a warp provides this_grid().sync();

And now CUDA adds barriers to synchronize

• Multi-warp subset of thread block

• Producer → Consumer pattern

• Integrated synchronization of thread execution and asynchronous memory copy

24
Split Producer → Consumer Thread Block
a.k.a. Algorithms with Block Partitioning and Specialization

Producer Producer Producer Producer


threads threads threads threads

SMEM[0] SMEM[1] SMEM[0] SMEM[1]

Consumer Consumer Consumer Consumer


threads threads threads threads

Producer threads and Consumer threads must coordinate

• Consumer must wait until buffer is ready for consumption

• Producer must wait until buffer is available for production

Recommendation: keep warps’ threads together


25
Producer → Consumer Synchronization
Need a New Type of Barrier for “One Sided” Sync
Consumer → Producer Sync : OK to fill SMEM buffer[ stage % nStage ]

Producer Producer Producer


threads threads threads

Using two barriers per stage SMEM[0] SMEM[1]

Consumer Consumer Consumer


threads threads threads

Producer → Consumer Sync : done filling SMEM buffer[ stage % nStage ]

26
Arrive/Wait Barrier
Enabling Producer → Consumer Synchronization
cuda::barrier<...>

• Implementation of ISO/C++ arrive/wait barrier

• Flexibly synchronize arbitrary groups of threads

• This presentation is only considering threads within a CUDA thread block

First: Introduce you to arrive/wait barrier

Then: Show in producer → consumer pattern

27
Arrive/Wait Barrier
Introductory Example: Replacing __syncthreads()

Barrier object in
shared memory
__global__ void kernel() __global__ void kernel()
{ {
__shared__ cuda::barrier<...> bar;
/* ... initialize bar ... */
while ( iterating ) { while ( iterating ) {
__syncthreads(); bar.arrive_and_wait();
} }
} }

Note: __syncthreads() has the best performance to sync a whole thread block

28
Split Arrive and then Wait
Thread Group Memory Ordering and Visibility
bar.arrive_and_wait(){ bar.wait( bar.arrive() ); }

__shared__ int x ;
A thread’s memory updates BEFORE arrive while ( iterating ) {
are visible to thread group AFTER wait if ( tid == 0 ) x = 42 ;
/* BEFORE */
auto token = bar.arrive();
Memory updates BETWEEN arrive and wait /* BETWEEN */
should be local to this thread bar.wait(token);
/* AFTER */
Put the BETWEEN time to good use,
assert( x == 42 );
otherwise threads may just idle
}
29
Introduce Async-Copy Operations
Thread Group Memory Ordering & Visibility

Transfer pipeline’s wait to the barrier


while ( iterating ) {
memcpy_async(sh[i],gl[j],pipe);
/* BEFORE */
Threads submit async-copies BEFORE
pipe.arrive_on(bar);
Pipeline arrives on the barrier
auto token = bar.arrive();
/* BETWEEN */
Barrier wait combines pipeline wait bar.wait(token);
and thread synchronization wait.
/* AFTER */
assert( sh[i] == gl[j] );
Copied data visible to thread group AFTER }

30
Arrive/Wait Barrier Initialization
Enabling Arbitrary Subgroups of a Thread Block
__global__ void kernel()
“Bootstrap Initialization” {
Barrier object is uninitialized __shared__ barrier<...> bar;
/* ... to initialize bar: */
Choose a thread to initialize if ( tid == 0 ) init( &bar, NumThreads );

Initialize with number of threads that


will arrive and wait (participate) __syncthreads();
/* barrier is ready for use */
Synchronize participating threads }
before they use the barrier

Barrier initialized to synchronize an arbitrary subset of a thread block

Keep threads in a warp together for best performance


31
Producer → Consumer Async-Copy
Partition Thread Block into Producer and Consumer Subsets
__global__ void kernel() {
__shared__ barrier<...> bar;
if ( 0 == threadIdx.x ) init(&bar,blockDim.x);
__syncthreads();
if ( threadIdx.x < NumProducer ) producer(bar);
else consumer(bar);

Producer Threads
Consumer Threads /* fill shared memory buffer */
/* wait for fill of shared memory buffer */ memcpy_async(sh[i],gl[j],pipe);
bar.arrive_and_wait(); pipe.arrive_on(bar);
/* compute using buffer */ bar.arrive();

one-sided synchronization: producer arrives and consumer waits


32
Producer → Consumer Pattern
We did a Deep-Dive into Some of the Details

Covered
Producer uses async-copy to fill buffer

Producer uses barrier for “buffer is filled”

Consumer waits on barrier for “buffer is filled”

Exercise for the Student


Barrier for “OK to fill buffer”

Producer-internal or consumer-internal barrier

33
Influence Residency
of Data in L2 Cache
34
Cache Memory Hierarchy
Residency of Data in Cache Affects Performance

Global Memory
L2 Cache
L1 Cache

Algorithms tune for spatial-temporal locality to get cache residency and thus performance

Spatial Locality : Algorithm’s nearby threads access nearby global memory


array[ threadIx.x ]  adjacent thread accesses adjacent memory
Temporal Locality : Algorithm’s nearby instructions access nearby global memory
array[ i ]
array[ i + 1 ]  adjacent instruction accesses adjacent memory

35
Influence Residency of Data in L2 Cache
Intra-Kernel and Inter-Kernel Performance Benefits

Global Memory Global Memory

L2 L2 L2

Kernel Producer Kernel Consumer Kernel

Reduce Intra-Kernel Reduce Producer-Consumer Inter-Kernel


trips to global memory trips to global memory

L2 L2

Kernel Producer Kernel Consumer Kernel


36
Access Policy to Influence L2 Residency
Select an Array in Global Memory to Persist in L2 Cache
Global Memory array
L2 Cache array

cudaStreamAttrValue attr ;

attr.accessPolicyWindow.base_ptr = /* beginning of array */ ;

attr.accessPolicyWindow.num_bytes = /* number of bytes in array */ ;

attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;

cudaStreamSetAttribute(stream,cudaStreamAttributeAccessPolicyWindow,&attr);

Set on a CUDA stream, applied to subsequent kernels in that stream

37
Access Policy to Influence L2 Residency
Select an Array in Global Memory to Persist in L2 Cache
Global Memory array
L2 Cache array

attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting ;

L2 Access Policies

• Persisting : accessed memory more likely to persist in L2 cache

• Streaming : accessed memory less likely to persist in L2 cache

• Force to Normal : remove Persisting policy*

*will come back to this


38
Persistence in L2 Cache
Data Can Persist “Long” After Kernel Exists

L2 L2

Kernel Producer Kernel Consumer Kernel

Power: Improve intra-kernel and inter-kernel producer→consumer performance

Responsibility: Avoid oversubscription of the persisting L2 cache capacity

• Concurrently executing kernels only use their fare share of persisting L2 cache

• Clean up when done, don’t let unused data persist in L2 cache

Eventually HW will automatically clean up

39
Clean Up
Remove Persisting Property When No Longer Needed

1) Consumer kernel cleans up:

attr.accessPolicyWindow.base_ptr = /* beginning of array */ ;


attr.accessPolicyWindow.num_bytes = /* number of bytes in array */ ;
attr.accessPolicyWindow.hitProp = cudaAccessPropertyNormal ;
cudaStreamSetAttribute(stream,cudaStreamAttributeAccessPolicyWindow,&attr);
consumer_kernel<<<...,stream>>>(...);

OR

2) Host cleans up whole L2 cache: cudaCtxResetPersistingL2Cache();

40
Set Aside L2 Cache for Persisting Accesses
L2 Cache: Persisting Normal and Streaming

Setting persisting set-aside is a device-level operation


cudaGetDeviceProperties(&prop,device);

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,prop.l2CacheSize*0.75);

Interoperability limitations

• Multi-Process Service (MPS), only set at process start via environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT=75 (75%)

• Multi-Instance GPU (MIG), disables this feature

41
Influence L2 Cache Residency
Microbenchmark
42
Microbenchmark Performance Experiment
Update Elements of Array in a Random Order

while( iter < count ) {


index = random(...);
if ( threadId % 2 )
regular[ index % rLen ] = regular[ index % rLen ] + regular[ index % rLen ];
else
persist[ index % pLen ] = persist[ index % pLen ] + persist[ index % pLen ];
}

Persisting global array Regular global array


Size 0.25 x L2 cache size 4 x L2 cache size
Update Even id threads update Odd id threads update

43
Reducing Global Memory Traffic
Metric: Percentage of Peak Global Bandwidth Utilized

40 Persisting array is 25% L2 capacity


25% reduction
35
Fully persists with 30% L2 set-aside
% Peak Bandwidth Used

30
25
Leftover set-aside is used normally
20
15
10
Your algorithm’s mileage will vary
5
0 • Identify an array to persist that
0 15 30 45 60 75
% L2 Set Aside is more frequently used

44
for more L2 Residency examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific

45
Warp Synchronous Reduction
46
New CUDA Warp Intrinsics
int __reduce_op_sync(unsigned mask, int val);
Integer reduce op { add, min, max } and bitwise reduce op { and, or, xor }

Approximately 10x faster than current best shuffle-based fan-in algorithm

Before: Five Steps of Warp-Shuffle Now: One HW Accelerated Collective


1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
32
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

47
CUDA Cooperative Group Collective
thread_tile_block and coalesced_group
value = reduce( group, value, op );

group is : thread_tile_block<N> or coalesced_group

op is C++ : plus<T>, less<T>, greater<T>, bit_and<T>, bit_or<T>, bit_xor<T>

When data type T is 32bit integer then use new CUDA warp intrinsics

Otherwise use best five step warp-shuffle and apply the operator

48

You might also like