0% found this document useful (0 votes)

39 views

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Uploaded by

ابراهيم التباع

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Uploaded by

ابراهيم التباع

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

CUDA on NVIDIA GPU

AMPERE MICROARCHITECTURE
Taking your algorithms to the
next level of performance
Carter Edwards, May 19, 2020
CUDA 11.0 Enhancements
Leveraging NVIDIA Ampere GPU Microarchitecture

Asynchronously Copy Global → Shared Memory

Flexible Synchronization for Producer → Consumer and other Algorithms

Influence Residency of Data in L2 Cache

Warp Synchronous Reduction

2
Asynchronously
Copy Global → Shared Memory
3
__shared__ Memory
Many Algorithms’ Key for Performance

Current use of shared memory shared extern int shbuf[];

while ( an_algorithm_iterates ) {
• Time-stepping and global data iteration
__syncthreads();
• Copy global data to shared memory for ( i = ... ) {
shbuf[i] = gldata[i]; /* copy */
• Compute on shared memory
}
__syncthreads();
/* compute on shbuf[] */
Copy and Compute Phases are Sequenced
}
Global → Shared Memory has a Journey
the __syncthreads() sandwich

4
Global → Shared Memory Journey
The journey may be longer than it appears
anatomy of copy shared global
shbuf[i] = gldata[i];

Journey through memory before GA100

SMEM register L1 L2 GMEM

Better: Don’t pass through registers along the way; using fewer registers can improve occupancy
SMEM register
L1 L2 GMEM

Even better: Don’t pass through L1 cache along the way; let other data persist longer
SMEM register L1
L2 GMEM

5
Copy and Compute Sequencing
Each iteration: First copy GMEM SMEM; then Compute on SMEM
GMEM GMEM GMEM GMEM
copy copy copy copy

SMEM compute SMEM compute SMEM compute SMEM compute

Better to Compute while Copying for later iteration(s)

GMEM GMEM GMEM GMEM GMEM

SMEM[0] SMEM[1] SMEM[0] SMEM[1] SMEM[0]

compute compute compute compute compute
Example two stage pipelining of copy and compute

6
Async-Copy Pipeline
Begin with Simple One-Stage Pipeline

Submit asynchronous copy via the better journey

pipeline pipe; • To dst in shared memory

• From src in global memory
memcpy_async(dst, src, pipe);
• Data type is trivially copyable
pipe.commit_and_wait();
Thread submits as many async-copy as needed

Thread waits for all submitted asynchronous copy

operations to complete

7
Update Implementation with Async-Copy
Begin with Simple One-Stage Pipeline
Before GA100 GPU Now
__shared__ extern int shbuf[]; __shared__ extern int shbuf[];
pipeline pipe;
while ( an_algorithm_iterates ) { while ( an_algorithm_iterates ) {
__syncthreads(); __syncthreads();
for ( i = ... ) { for ( i = ... ) {
shbuf[i] = gldata[i]; memcpy_async(shbuf[i],gldata[i],pipe);
} }
pipe.commit_and_wait();
__syncthreads(); __syncthreads();
/* compute on shbuf[] */ /* compute on shbuf[] */
} }

Still have the __syncthreads() sandwich 8

Async-Copy Pipeline
Then Improve with Even Better Journey through Memory
memcpy_async(dst, src, pipe);

Better async-copy journey Even better async-copy journey

• To shared memory • To shared memory

• From global memory • From global memory
• Data type is trivially copyable • Data type is trivially copyable
• Data size is multiple of 4 bytes • Data size is multiple of 16 bytes
• Data is aligned to 4 bytes • Data is aligned to 16 bytes

9
Multi-Stage Pipeline
Then Improve by Overlapping Copy and Compute

pipeline pipe; Submit async-copy for dst = src;

• Submit as many as needed
memcpy_async( dst, src, pipe);
• Submit into new stage of pipeline
pipe.commit(); Commit new stage K, but do not wait for it now
• A sequence of stages { 0, 1, ..., K }
pipe.wait_prior<N>();

stage-0 stage-1 stage-2 stage-3 stage-4

Wait for prior stage K-N in the pipeline
of sync-copy operations to complete

stage-0 stage-1 stage-2 stage-3 stage-4

10
Update Algorithm with Multi-Stage Pipeline
(1 of 2) Similar Pattern: Async-Copy, Commit, Wait
One Stage Multi-Stage
for (stage=0; stage < end; ++stage){ for (stage=next=0; stage < end; ++stage){
__syncthreads(); /* __syncthreads(); */
for (; next < stage + nStage ; ++next){
s = next % nStage ;
for ( i = ... ) { for ( i = ... ) {
memcpy_async(sh[i],gl[i],pipe); memcpy_async(sh[s][i],gl[i],pipe);
} }
pipe.commit_and_wait(); pipe.commit();
}
pipe.wait_prior< nStage-1 >();
__syncthreads(); __syncthreads();
/* compute on sh[] */ /* compute on sh[stage % nStage][] */
} }
11
Update Algorithm with Multi-Stage Pipeline
(2 of 2) Declare, Fill, and Wait with N-Stages of Buffers
Multi-Stage
Removed leading __syncthreads()
for (stage=next=0; stage < end; ++stage){
/* __syncthreads(); */
Submit async-copy for later stages for (; next < stage + nStage ; ++next){
• Recycle among nStage buffers s = next % nStage ;
for ( i = ... ) {
memcpy_async(sh[s][i],gl[i],pipe);
Commit next stage but do not wait
}
pipe.commit();
Wait for current stage of copies to complete }
• stage = next – (nStage-1); pipe.wait_prior< nStage-1 >();
• Prior stage relative to most recent commit __syncthreads();
/* compute on sh[stage % nStage][] */
}
Wait for all threads to complete copies
12
Pairs Well with Cooperative Groups
Collective Async-Copy of a whole Array
template<class GroupType, class T>
size_t memcpy_async( GroupType & group, T * dstPtr, size_t dstCount
const T * srcPtr, size_t srcCount
pipeline & pipe );

• GroupType is an intra-block Cooperative Group

• Partitions array range [0..dstCount-1] among threads

• Submits async-copy for: dstPtr[0..srcCount-1] = srcPtr[0..srcCount-1];

• Zero fill left-overs: dstPtr[srcCount..dstCount-1] = 0;

• Given aligned arrays will use

13
Async-Copy Microbenchmark
14
Your Algorithm’s Mileage May Vary
Microbenchmark Comparing Synchronous vs. Asynchronous Copy
shbuf[i]=gldata[i]; vs. memcpy_async(shbuf[i],gldata[i],pipe);
Simple one-stage pipeline

• Without computations

• Ample registers available

Results: Consistently better performance; best when

• Data type is 16 bytes to get

• Modest thread block size

Except, a corner case where traditional synchronous copy can perform better

15
Microbenchmark Performance Experiment
Microbenchmark Comparing Synchronous vs. Asynchronous Copy

/* sync: Conventional synchronous memory copy */

for (size_t i = 0; i < copy_count; ++i) {
shared[blockDim.x * i + threadIdx.x] = global[blockDim.x * i + threadIdx.x];
}

/* async: Asynchronous memory copy */

pipeline pipe;
for (size_t i = 0; i < copy_count; ++i) {
memcpy_async( shared[blockDim.x * i + threadIdx.x],
global[blockDim.x * i + threadIdx.x], pipe);
}
pipe.commit();
pipe.wait_prior<0>();

16
Performance Experiment
Varied Thread Block Size and Sizeof Data Type

Thread block sizes : 128, 256, 512

/* Conventional synchronous memory copy */
Data type sizes : 4 and 16 bytes for (size_t i = 0; i < copy_count; ++i) {
shared[blockDim.x*i + threadIdx.x] =
global[blockDim.x*i + threadIdx.x];
Measure clock-cycles required to copy an array of N bytes }

• Y-axis = clock-cycles Copy 4Byte Elements

2000
• X-axis = bytes copied 1800
1600
sync-128

Clock Cycles
1400
Simple experiment with surprisingly complicated results 1200
sync-256
1000 sync-512
• Now step through results 800 async-128
600 async-256
400 async-512
0 10000 20000

17
Performance Experiment
(1 of 4) Compiler Optimizes Traditional Synchronous Copy
Copy 4Byte Data Type Loop unrolling with Copy 16Byte Data Type
2000 up to four loads/stores
2000
“in flight”
1800 1800

1600 36 1600
sync-128

1400 1400

Clock Cycles
32
Clock Cycles

28
1200 1200
24
1000 20 1000
8

time to copy
800 16 800
12 4
600 600

4 8
400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied
18
Performance Experiment
(2 of 4) Async-Copy is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000

1800 Don’t need compiler’s

1800 unrolling and
instruction scheduling optimizations
1600 1600

faster
1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-128
1000 async-128 1000

800 800

600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

19
Performance Experiment
(3 of 4) Copying 16byte Data Type is Faster
Copy 4Byte Data Type Copy 16Byte Data Type
2000 2000

1800 1800

1600 1600

1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-128
1000 async-128 1000

800 800

faster
600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

20
Performance Experiment
(4 of 4) Corner Case when Synchronous Copy might Win
Copy 4Byte Elements Large number of threads (512+) Copy 16Byte Elements
2000 Many available registers2000
1800 Ample opportunity to hide
1800 latency

1600 1600

1400 1400

Clock Cycles
Clock Cycles

1200 1200
sync-512
1000 async-512 1000

800 800

600 600

400 400
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Bytes Copied

21
for more Async-Copy examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific

22
Synchronize
Producer → Consumer
and other Algorithms
23
Built-in Sync-Functions (barriers)
__syncthreads() , __syncwarp() , and siblings

Absolute Best Performing Barrier for

• Synchronizing whole thread block CUDA Cooperative Groups also

• Synchronizing a warp, or subset of a warp provides this_grid().sync();

And now CUDA adds barriers to synchronize

• Multi-warp subset of thread block

• Producer → Consumer pattern

• Integrated synchronization of thread execution and asynchronous memory copy

24
Split Producer → Consumer Thread Block
a.k.a. Algorithms with Block Partitioning and Specialization

Producer Producer Producer Producer

threads threads threads threads

SMEM[0] SMEM[1] SMEM[0] SMEM[1]

Consumer Consumer Consumer Consumer

threads threads threads threads

Producer threads and Consumer threads must coordinate

• Consumer must wait until buffer is ready for consumption

• Producer must wait until buffer is available for production

Recommendation: keep warps’ threads together

25
Producer → Consumer Synchronization
Need a New Type of Barrier for “One Sided” Sync
Consumer → Producer Sync : OK to fill SMEM buffer[ stage % nStage ]

Producer Producer Producer

threads threads threads

Using two barriers per stage SMEM[0] SMEM[1]

Consumer Consumer Consumer

threads threads threads

Producer → Consumer Sync : done filling SMEM buffer[ stage % nStage ]

26
Arrive/Wait Barrier
Enabling Producer → Consumer Synchronization
cuda::barrier<...>

• Implementation of ISO/C++ arrive/wait barrier

• Flexibly synchronize arbitrary groups of threads

• This presentation is only considering threads within a CUDA thread block

First: Introduce you to arrive/wait barrier

Then: Show in producer → consumer pattern

27
Arrive/Wait Barrier
Introductory Example: Replacing __syncthreads()

Barrier object in
shared memory
__global__ void kernel() __global__ void kernel()
{ {
__shared__ cuda::barrier<...> bar;
/* ... initialize bar ... */
while ( iterating ) { while ( iterating ) {
__syncthreads(); bar.arrive_and_wait();
} }
} }

Note: __syncthreads() has the best performance to sync a whole thread block

28
Split Arrive and then Wait
Thread Group Memory Ordering and Visibility
bar.arrive_and_wait(){ bar.wait( bar.arrive() ); }

__shared__ int x ;
A thread’s memory updates BEFORE arrive while ( iterating ) {
are visible to thread group AFTER wait if ( tid == 0 ) x = 42 ;
/* BEFORE */
auto token = bar.arrive();
Memory updates BETWEEN arrive and wait /* BETWEEN */
should be local to this thread bar.wait(token);
/* AFTER */
Put the BETWEEN time to good use,
assert( x == 42 );
otherwise threads may just idle
}
29
Introduce Async-Copy Operations
Thread Group Memory Ordering & Visibility

Transfer pipeline’s wait to the barrier

while ( iterating ) {
memcpy_async(sh[i],gl[j],pipe);
/* BEFORE */
Threads submit async-copies BEFORE
pipe.arrive_on(bar);
Pipeline arrives on the barrier
auto token = bar.arrive();
/* BETWEEN */
Barrier wait combines pipeline wait bar.wait(token);
and thread synchronization wait.
/* AFTER */
assert( sh[i] == gl[j] );
Copied data visible to thread group AFTER }

30
Arrive/Wait Barrier Initialization
Enabling Arbitrary Subgroups of a Thread Block
__global__ void kernel()
“Bootstrap Initialization” {
Barrier object is uninitialized __shared__ barrier<...> bar;
/* ... to initialize bar: */
Choose a thread to initialize if ( tid == 0 ) init( &bar, NumThreads );

Initialize with number of threads that

will arrive and wait (participate) __syncthreads();
/* barrier is ready for use */
Synchronize participating threads }
before they use the barrier

Barrier initialized to synchronize an arbitrary subset of a thread block

Keep threads in a warp together for best performance

31
Producer → Consumer Async-Copy
Partition Thread Block into Producer and Consumer Subsets
__global__ void kernel() {
__shared__ barrier<...> bar;
if ( 0 == threadIdx.x ) init(&bar,blockDim.x);
__syncthreads();
if ( threadIdx.x < NumProducer ) producer(bar);
else consumer(bar);

Producer Threads
Consumer Threads /* fill shared memory buffer */
/* wait for fill of shared memory buffer */ memcpy_async(sh[i],gl[j],pipe);
bar.arrive_and_wait(); pipe.arrive_on(bar);
/* compute using buffer */ bar.arrive();

one-sided synchronization: producer arrives and consumer waits

32
Producer → Consumer Pattern
We did a Deep-Dive into Some of the Details

Covered
Producer uses async-copy to fill buffer

Producer uses barrier for “buffer is filled”

Consumer waits on barrier for “buffer is filled”

Exercise for the Student

Barrier for “OK to fill buffer”

Producer-internal or consumer-internal barrier

33
Influence Residency
of Data in L2 Cache
34
Cache Memory Hierarchy
Residency of Data in Cache Affects Performance

Global Memory
L2 Cache
L1 Cache

Algorithms tune for spatial-temporal locality to get cache residency and thus performance

Spatial Locality : Algorithm’s nearby threads access nearby global memory

array[ threadIx.x ]  adjacent thread accesses adjacent memory
Temporal Locality : Algorithm’s nearby instructions access nearby global memory
array[ i ]
array[ i + 1 ]  adjacent instruction accesses adjacent memory

35
Influence Residency of Data in L2 Cache
Intra-Kernel and Inter-Kernel Performance Benefits

Global Memory Global Memory

L2 L2 L2

Kernel Producer Kernel Consumer Kernel

Reduce Intra-Kernel Reduce Producer-Consumer Inter-Kernel

trips to global memory trips to global memory

L2 L2

Kernel Producer Kernel Consumer Kernel

36
Access Policy to Influence L2 Residency
Select an Array in Global Memory to Persist in L2 Cache
Global Memory array
L2 Cache array

cudaStreamAttrValue attr ;

attr.accessPolicyWindow.base_ptr = /* beginning of array */ ;

attr.accessPolicyWindow.num_bytes = /* number of bytes in array */ ;

attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;

cudaStreamSetAttribute(stream,cudaStreamAttributeAccessPolicyWindow,&attr);

Set on a CUDA stream, applied to subsequent kernels in that stream

37
Access Policy to Influence L2 Residency
Select an Array in Global Memory to Persist in L2 Cache
Global Memory array
L2 Cache array

attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting ;

L2 Access Policies

• Persisting : accessed memory more likely to persist in L2 cache

• Streaming : accessed memory less likely to persist in L2 cache

• Force to Normal : remove Persisting policy*

*will come back to this

38
Persistence in L2 Cache
Data Can Persist “Long” After Kernel Exists

L2 L2

Kernel Producer Kernel Consumer Kernel

Power: Improve intra-kernel and inter-kernel producer→consumer performance

Responsibility: Avoid oversubscription of the persisting L2 cache capacity

• Concurrently executing kernels only use their fare share of persisting L2 cache

• Clean up when done, don’t let unused data persist in L2 cache

Eventually HW will automatically clean up

39
Clean Up
Remove Persisting Property When No Longer Needed

1) Consumer kernel cleans up:

attr.accessPolicyWindow.base_ptr = /* beginning of array */ ;

attr.accessPolicyWindow.num_bytes = /* number of bytes in array */ ;
attr.accessPolicyWindow.hitProp = cudaAccessPropertyNormal ;
cudaStreamSetAttribute(stream,cudaStreamAttributeAccessPolicyWindow,&attr);
consumer_kernel<<<...,stream>>>(...);

2) Host cleans up whole L2 cache: cudaCtxResetPersistingL2Cache();

40
Set Aside L2 Cache for Persisting Accesses
L2 Cache: Persisting Normal and Streaming

Setting persisting set-aside is a device-level operation

cudaGetDeviceProperties(&prop,device);

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,prop.l2CacheSize*0.75);

Interoperability limitations

• Multi-Process Service (MPS), only set at process start via environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT=75 (75%)

• Multi-Instance GPU (MIG), disables this feature

41
Influence L2 Cache Residency
Microbenchmark
42
Microbenchmark Performance Experiment
Update Elements of Array in a Random Order

while( iter < count ) {

index = random(...);
if ( threadId % 2 )
regular[ index % rLen ] = regular[ index % rLen ] + regular[ index % rLen ];
else
persist[ index % pLen ] = persist[ index % pLen ] + persist[ index % pLen ];
}

Persisting global array Regular global array

Size 0.25 x L2 cache size 4 x L2 cache size
Update Even id threads update Odd id threads update

43
Reducing Global Memory Traffic
Metric: Percentage of Peak Global Bandwidth Utilized

40 Persisting array is 25% L2 capacity

25% reduction
35
Fully persists with 30% L2 set-aside
% Peak Bandwidth Used

30
25
Leftover set-aside is used normally
20
15
10
Your algorithm’s mileage will vary
5
0 • Identify an array to persist that
0 15 30 45 60 75
% L2 Set Aside is more frequently used

44
for more L2 Residency examples and performance deep-dive see:
S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture
Thursday May 21 at 10:15am Pacific

45
Warp Synchronous Reduction
46
New CUDA Warp Intrinsics
int __reduce_op_sync(unsigned mask, int val);
Integer reduce op { add, min, max } and bitwise reduce op { and, or, xor }

Approximately 10x faster than current best shuffle-based fan-in algorithm

Before: Five Steps of Warp-Shuffle Now: One HW Accelerated Collective

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
32
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

47
CUDA Cooperative Group Collective
thread_tile_block and coalesced_group
value = reduce( group, value, op );

group is : thread_tile_block<N> or coalesced_group

op is C++ : plus<T>, less<T>, greater<T>, bit_and<T>, bit_or<T>, bit_xor<T>

When data type T is 32bit integer then use new CUDA warp intrinsics

Otherwise use best five step warp-shuffle and apply the operator

Java Concurrency in Practice (Dark Demon) (h33t)
No ratings yet
Java Concurrency in Practice (Dark Demon) (h33t)
2 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Chap9_CUDA Optimization
No ratings yet
Chap9_CUDA Optimization
73 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
05_Atomics_Reductions_Warp_Shuffle 05_Atomics_Reductions_Warp_Shuffle
No ratings yet
05_Atomics_Reductions_Warp_Shuffle 05_Atomics_Reductions_Warp_Shuffle
27 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
PDS_Ising_Model
No ratings yet
PDS_Ising_Model
6 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
1
No ratings yet
1
44 pages
S62192
No ratings yet
S62192
127 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15 (1)
42 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
No ratings yet
CS 179: GPU Computing: Recitation 2: Synchronization, Shared
22 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
No ratings yet
Performance (Memory) Optimization: National Tsing-Hua University 2017, Summer Semester
77 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Week 11
No ratings yet
Week 11
21 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Embedded Systems Practices
No ratings yet
Embedded Systems Practices
16 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA
No ratings yet
CUDA
33 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
written_asst2
No ratings yet
written_asst2
27 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
Lec 1
No ratings yet
Lec 1
27 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
chapter-8
No ratings yet
chapter-8
58 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Owens
No ratings yet
Owens
67 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Introduction To Programming in CUDA C: Will Landau
No ratings yet
Introduction To Programming in CUDA C: Will Landau
40 pages
OSC Question Bank (Unit 1, 2, 3)
No ratings yet
OSC Question Bank (Unit 1, 2, 3)
1 page
Os Unit 2 Lecture Notes 2
No ratings yet
Os Unit 2 Lecture Notes 2
31 pages
RBS Log 1
No ratings yet
RBS Log 1
2,344 pages
RT2021-Chap4
No ratings yet
RT2021-Chap4
51 pages
Applied Operating System
No ratings yet
Applied Operating System
3 pages
(AP CSP) (The Internet) Sequential vs. Parallel and Distributed (Student)
No ratings yet
(AP CSP) (The Internet) Sequential vs. Parallel and Distributed (Student)
2 pages
Thread in Operating System
No ratings yet
Thread in Operating System
21 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
100% (4)
Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
72 pages
Trace
No ratings yet
Trace
30 pages
6CS5 DS Unit-3
No ratings yet
6CS5 DS Unit-3
45 pages
UnixCG Process Programming
No ratings yet
UnixCG Process Programming
19 pages
1.1 MultiThreading
No ratings yet
1.1 MultiThreading
16 pages
Be - Information Technology Engineering - Semester 5 - 2024 - May - Operating Systems Os Pattern 2019
No ratings yet
Be - Information Technology Engineering - Semester 5 - 2024 - May - Operating Systems Os Pattern 2019
2 pages
Distributed System Lab Manual
75% (4)
Distributed System Lab Manual
62 pages
2021年《操作系统》试卷
No ratings yet
2021年《操作系统》试卷
4 pages
Socket Programming and Threading
No ratings yet
Socket Programming and Threading
4 pages
Lab Exercise
No ratings yet
Lab Exercise
50 pages
Concurrent Processes
No ratings yet
Concurrent Processes
52 pages
Probe Sharing a Simple Technique to Improve on Sparrow
No ratings yet
Probe Sharing a Simple Technique to Improve on Sparrow
8 pages
Lamport Non Token Based Algorithm
No ratings yet
Lamport Non Token Based Algorithm
13 pages
Concurrency Mechanism
No ratings yet
Concurrency Mechanism
8 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
3 pages
Processes: Exercises
No ratings yet
Processes: Exercises
4 pages
Final Exam1
No ratings yet
Final Exam1
3 pages
Parallel Processing Chapter - 2
0% (1)
Parallel Processing Chapter - 2
135 pages
OS all valid programs
No ratings yet
OS all valid programs
29 pages

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Uploaded by

s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance

Uploaded by

CUDA on NVIDIA GPU

Asynchronously Copy Global → Shared Memory

Flexible Synchronization for Producer → Consumer and other Algorithms

Influence Residency of Data in L2 Cache

Warp Synchronous Reduction

Current use of shared memory __shared__ extern int shbuf[];

Journey through memory before GA100

SMEM register L1 L2 GMEM

SMEM compute SMEM compute SMEM compute SMEM compute

Better to Compute while Copying for later iteration(s)

SMEM[0] SMEM[1] SMEM[0] SMEM[1] SMEM[0]

Submit asynchronous copy via the better journey

pipeline pipe; • To dst in shared memory

Thread waits for all submitted asynchronous copy

Still have the __syncthreads() sandwich 8

Better async-copy journey Even better async-copy journey

• To shared memory • To shared memory

pipeline pipe; Submit async-copy for dst = src;

stage-0 stage-1 stage-2 stage-3 stage-4

stage-0 stage-1 stage-2 stage-3 stage-4

• GroupType is an intra-block Cooperative Group

• Partitions array range [0..dstCount-1] among threads

• Submits async-copy for: dstPtr[0..srcCount-1] = srcPtr[0..srcCount-1];

• Zero fill left-overs: dstPtr[srcCount..dstCount-1] = 0;

• Given aligned arrays will use

• Ample registers available

Results: Consistently better performance; best when

• Data type is 16 bytes to get

• Modest thread block size

/* sync: Conventional synchronous memory copy */

/* async: Asynchronous memory copy */

Thread block sizes : 128, 256, 512

• Y-axis = clock-cycles Copy 4Byte Elements

1800 Don’t need compiler’s

Absolute Best Performing Barrier for

• Synchronizing whole thread block CUDA Cooperative Groups also

• Synchronizing a warp, or subset of a warp provides this_grid().sync();

And now CUDA adds barriers to synchronize

• Multi-warp subset of thread block

• Producer → Consumer pattern

• Integrated synchronization of thread execution and asynchronous memory copy

Producer Producer Producer Producer

SMEM[0] SMEM[1] SMEM[0] SMEM[1]

Consumer Consumer Consumer Consumer

Producer threads and Consumer threads must coordinate

• Consumer must wait until buffer is ready for consumption

• Producer must wait until buffer is available for production

Recommendation: keep warps’ threads together

Producer Producer Producer

Using two barriers per stage SMEM[0] SMEM[1]

Consumer Consumer Consumer

Producer → Consumer Sync : done filling SMEM buffer[ stage % nStage ]

• Implementation of ISO/C++ arrive/wait barrier

• Flexibly synchronize arbitrary groups of threads

• This presentation is only considering threads within a CUDA thread block

First: Introduce you to arrive/wait barrier

Then: Show in producer → consumer pattern

Transfer pipeline’s wait to the barrier

Initialize with number of threads that

Barrier initialized to synchronize an arbitrary subset of a thread block

Keep threads in a warp together for best performance

one-sided synchronization: producer arrives and consumer waits

Producer uses barrier for “buffer is filled”

Consumer waits on barrier for “buffer is filled”

Exercise for the Student

Producer-internal or consumer-internal barrier

Spatial Locality : Algorithm’s nearby threads access nearby global memory

Global Memory Global Memory

Kernel Producer Kernel Consumer Kernel

Reduce Intra-Kernel Reduce Producer-Consumer Inter-Kernel

Kernel Producer Kernel Consumer Kernel

attr.accessPolicyWindow.base_ptr = /* beginning of array */ ;

attr.accessPolicyWindow.num_bytes = /* number of bytes in array */ ;

Set on a CUDA stream, applied to subsequent kernels in that stream

• Persisting : accessed memory more likely to persist in L2 cache

• Streaming : accessed memory less likely to persist in L2 cache

Current use of shared memory shared extern int shbuf[];