s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
AMPERE MICROARCHITECTURE
Taking your algorithms to the
next level of performance
Carter Edwards, May 19, 2020
CUDA 11.0 Enhancements
Leveraging NVIDIA Ampere GPU Microarchitecture
2
Asynchronously
Copy Global → Shared Memory
3
__shared__ Memory
Many Algorithms’ Key for Performance
4
Global → Shared Memory Journey
The journey may be longer than it appears
anatomy of copy shared global
shbuf[i] = gldata[i];
Better: Don’t pass through registers along the way; using fewer registers can improve occupancy
SMEM register
L1 L2 GMEM
Even better: Don’t pass through L1 cache along the way; let other data persist longer
SMEM register L1
L2 GMEM
5
Copy and Compute Sequencing
Each iteration: First copy GMEM SMEM; then Compute on SMEM
GMEM GMEM GMEM GMEM
copy copy copy copy
6
Async-Copy Pipeline
Begin with Simple One-Stage Pipeline
7
Update Implementation with Async-Copy
Begin with Simple One-Stage Pipeline
Before GA100 GPU Now
__shared__ extern int shbuf[]; __shared__ extern int shbuf[];
pipeline pipe;
while ( an_algorithm_iterates ) { while ( an_algorithm_iterates ) {
__syncthreads(); __syncthreads();
for ( i = ... ) { for ( i = ... ) {
shbuf[i] = gldata[i]; memcpy_async(shbuf[i],gldata[i],pipe);
} }
pipe.commit_and_wait();
__syncthreads(); __syncthreads();
/* compute on shbuf[] */ /* compute on shbuf[] */
} }
9
Multi-Stage Pipeline
Then Improve by Overlapping Copy and Compute
10
Update Algorithm with Multi-Stage Pipeline
(1 of 2) Similar Pattern: Async-Copy, Commit, Wait
One Stage Multi-Stage
for (stage=0; stage < end; ++stage){ for (stage=next=0; stage < end; ++stage){
__syncthreads(); /* __syncthreads(); */
for (; next < stage + nStage ; ++next){
s = next % nStage ;
for ( i = ... ) { for ( i = ... ) {
memcpy_async(sh[i],gl[i],pipe); memcpy_async(sh[s][i],gl[i],pipe);
} }
pipe.commit_and_wait(); pipe.commit();
}
pipe.wait_prior< nStage-1 >();
__syncthreads(); __syncthreads();
/* compute on sh[] */ /* compute on sh[stage % nStage][] */
} }
11
Update Algorithm with Multi-Stage Pipeline
(2 of 2) Declare, Fill, and Wait with N-Stages of Buffers
Multi-Stage
Removed leading __syncthreads()
for (stage=next=0; stage < end; ++stage){
/* __syncthreads(); */
Submit async-copy for later stages for (; next < stage + nStage ; ++next){
• Recycle among nStage buffers s = next % nStage ;
for ( i = ... ) {
memcpy_async(sh[s][i],gl[i],pipe);
Commit next stage but do not wait
}
pipe.commit();
Wait for current stage of copies to complete }
• stage = next – (nStage-1); pipe.wait_prior< nStage-1 >();
• Prior stage relative to most recent commit __syncthreads();
/* compute on sh[stage % nStage][] */
}
Wait for all threads to complete copies
12
Pairs Well with Cooperative Groups
Collective Async-Copy of a whole Array
template<class GroupType, class T>
size_t memcpy_async( GroupType & group, T * dstPtr, size_t dstCount
const T * srcPtr, size_t srcCount
pipeline & pipe );
13
Async-Copy Microbenchmark
14
Your Algorithm’s Mileage May Vary
Microbenchmark Comparing Synchronous vs. Asynchronous Copy
shbuf[i]=gldata[i]; vs. memcpy_async(shbuf[i],gldata[i],pipe);
Simple one-stage pipeline
• Without computations
Except, a corner case where traditional synchronous copy can perform better
15
Microbenchmark Performance Experiment
Microbenchmark Comparing Synchronous vs. Asynchronous Copy
16
Performance Experiment
Varied Thread Block Size and Sizeof Data Type
Clock Cycles
1400
Simple experiment with surprisingly complicated results 1200
sync-256
1000 sync-512
• Now step through results 800 async-128
600 async-256
400 async-512
0 10000 20000
17
Performance Experiment
(1 of 4) Compiler Optimizes Traditional Synchronous Copy
Copy 4Byte Data Type Loop unrolling with Copy 16Byte Data Type
2000 up to four loads/stores
2000
“in flight”
1800 1800
1600 36 1600
sync-128
1400 1400
Clock Cycles
32
Clock Cycles
28
1200 1200
24
1000 20 1000
8
time to copy
800 16 800
12 4
600 600
4 8
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
18
Performance Experiment
(2 of 4) Async-Copy is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000
faster
1400 1400
Clock Cycles
Clock Cycles
1200 1200
sync-128
1000 async-128 1000
800 800
600 600
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
19
Performance Experiment
(3 of 4) Copying 16byte Data Type is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000
1800 1800
1600 1600
1400 1400
Clock Cycles
Clock Cycles
1200 1200
sync-128
1000 async-128 1000
800 800
faster
600 600
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
20
Performance Experiment
(4 of 4) Corner Case when Synchronous Copy might Win
Copy 4Byte Elements Large number of threads (512+) Copy 16Byte Elements
2000 Many available registers2000
1800 Ample opportunity to hide
1800 latency
1600 1600
1400 1400
Clock Cycles
Clock Cycles
1200 1200
sync-512
1000 async-512 1000
800 800
600 600
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
21
for more Async-Copy examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific
22
Synchronize
Producer → Consumer
and other Algorithms
23
Built-in Sync-Functions (barriers)
__syncthreads() , __syncwarp() , and siblings
24
Split Producer → Consumer Thread Block
a.k.a. Algorithms with Block Partitioning and Specialization
26
Arrive/Wait Barrier
Enabling Producer → Consumer Synchronization
cuda::barrier<...>
27
Arrive/Wait Barrier
Introductory Example: Replacing __syncthreads()
Barrier object in
shared memory
__global__ void kernel() __global__ void kernel()
{ {
__shared__ cuda::barrier<...> bar;
/* ... initialize bar ... */
while ( iterating ) { while ( iterating ) {
__syncthreads(); bar.arrive_and_wait();
} }
} }
Note: __syncthreads() has the best performance to sync a whole thread block
28
Split Arrive and then Wait
Thread Group Memory Ordering and Visibility
bar.arrive_and_wait(){ bar.wait( bar.arrive() ); }
__shared__ int x ;
A thread’s memory updates BEFORE arrive while ( iterating ) {
are visible to thread group AFTER wait if ( tid == 0 ) x = 42 ;
/* BEFORE */
auto token = bar.arrive();
Memory updates BETWEEN arrive and wait /* BETWEEN */
should be local to this thread bar.wait(token);
/* AFTER */
Put the BETWEEN time to good use,
assert( x == 42 );
otherwise threads may just idle
}
29
Introduce Async-Copy Operations
Thread Group Memory Ordering & Visibility
30
Arrive/Wait Barrier Initialization
Enabling Arbitrary Subgroups of a Thread Block
__global__ void kernel()
“Bootstrap Initialization” {
Barrier object is uninitialized __shared__ barrier<...> bar;
/* ... to initialize bar: */
Choose a thread to initialize if ( tid == 0 ) init( &bar, NumThreads );
Producer Threads
Consumer Threads /* fill shared memory buffer */
/* wait for fill of shared memory buffer */ memcpy_async(sh[i],gl[j],pipe);
bar.arrive_and_wait(); pipe.arrive_on(bar);
/* compute using buffer */ bar.arrive();
Covered
Producer uses async-copy to fill buffer
33
Influence Residency
of Data in L2 Cache
34
Cache Memory Hierarchy
Residency of Data in Cache Affects Performance
Global Memory
L2 Cache
L1 Cache
Algorithms tune for spatial-temporal locality to get cache residency and thus performance
35
Influence Residency of Data in L2 Cache
Intra-Kernel and Inter-Kernel Performance Benefits
L2 L2 L2
L2 L2
cudaStreamAttrValue attr ;
attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
cudaStreamSetAttribute(stream,cudaStreamAttributeAccessPolicyWindow,&attr);
37
Access Policy to Influence L2 Residency
Select an Array in Global Memory to Persist in L2 Cache
Global Memory array
L2 Cache array
attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting ;
L2 Access Policies
L2 L2
• Concurrently executing kernels only use their fare share of persisting L2 cache
39
Clean Up
Remove Persisting Property When No Longer Needed
OR
40
Set Aside L2 Cache for Persisting Accesses
L2 Cache: Persisting Normal and Streaming
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,prop.l2CacheSize*0.75);
Interoperability limitations
• Multi-Process Service (MPS), only set at process start via environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT=75 (75%)
41
Influence L2 Cache Residency
Microbenchmark
42
Microbenchmark Performance Experiment
Update Elements of Array in a Random Order
43
Reducing Global Memory Traffic
Metric: Percentage of Peak Global Bandwidth Utilized
30
25
Leftover set-aside is used normally
20
15
10
Your algorithm’s mileage will vary
5
0 • Identify an array to persist that
0 15 30 45 60 75
% L2 Set Aside is more frequently used
44
for more L2 Residency examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific
45
Warp Synchronous Reduction
46
New CUDA Warp Intrinsics
int __reduce_op_sync(unsigned mask, int val);
Integer reduce op { add, min, max } and bitwise reduce op { and, or, xor }
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
32
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
47
CUDA Cooperative Group Collective
thread_tile_block and coalesced_group
value = reduce( group, value, op );
When data type T is 32bit integer then use new CUDA warp intrinsics
Otherwise use best five step warp-shuffle and apply the operator
48