Uncoalesced Global Accesses
Uncoalesced Global Accesses
SAMPLE
Chapter 1. Introduction.........................................................................................1
Chapter 2. Application.......................................................................................... 2
Chapter 3. Configuration....................................................................................... 3
Chapter 4. Initial version of the kernel..................................................................... 4
Chapter 5. Updated version of the kernel..................................................................8
Chapter 6. Resources.......................................................................................... 11
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | ii
Chapter 1.
INTRODUCTION
This sample profiles a memory-bound CUDA kernel which does a simple computation
on an array of double3 data type in global memory using the Nsight Compute profiler.
The profiler is used to analyze and identify the memory accesses which are uncoalesced
and result in inefficient DRAM accesses.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 1
Chapter 2.
APPLICATION
The sample CUDA application adds a floating point constant to an input array of
1,048,576 (1024*1024) double3 elements in global memory and generates an output array
of double3 in global memory of the same size. double3 is a 24-byte built-in vector type
which is a structure containing 3 double precision floating point values:
struct
{
double x, y, z;
};
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 2
Chapter 3.
CONFIGURATION
The profiling results included in this document were collected on the following
configuration:
‣ Target system: Linux (x86_64) with a NVIDIA RTX A4500 (Ampere GA102) GPU
‣ Nsight Compute version: 2023.3.1
The Nsight Compute UI screen shots in the document are taken by opening the profiling
reports on a Windows 10 system.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 3
Chapter 4.
INITIAL VERSION OF THE KERNEL
The initial version of the sample code provides a naive implementation for the kernel
which adds a floating point constant to an input array of double3.
The instruction a = d_in[index] in the kernel code results in each thread in a warp
accessing global memory 24-bytes apart. In the first step all threads request a load
for d_in[index].x as shown in the following diagram. In the second step a load for
d_in[index].y and in the third step a load for d_in[index].z is made by all threads.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 4
Initial version of the kernel
‣ Refer to the README distributed with the sample on how to build the application
‣ Run ncu-ui on the host system
‣ Use a local connection if the GPU is on the host system. If the GPU is on a remote
system, set up a remote connection to the target system
‣ Use the Profile activity to profile the sample application
‣ Choose the full section set
‣ Use defaults for all other options
‣ Set a report name and then click on Launch
Summary page
The Summary page lists the kernels profiled and provides some key metrics for each
profiled kernel. It also lists the performance opportunities and estimated speedup for
each. In this sample we have only one kernel launch.
The duration for this initial version of the kernel is 89.86 microseconds and this is used
as the baseline for further optimizations.
For this kernel it shows a hint for Uncoalesced Global Accesses and suggests
checking the L2 Theoretical Sectors Global Excessive table for the primary
source locations. Click on Uncoalesced Global Accesses rule link to see more
context on the Details page. It opens the Source Counters section on the Details
page.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 5
Initial version of the kernel
Source page
The CUDA source and SASS(GPU Assembly) for the kernel is shown side by side.
When opening the Source page from Source Counters section, the Navigation metric
is automatically filled in to match, in this case L2 Theoretical Sectors Global
Excessive. You can see this by the bolding in the column header. The source line at
which the bottleneck occurs is highlighted.
It shows uncoalesced global memory load accesses at line #55:
double3 a = d_in[index];
d_out[index] = a;
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 6
Initial version of the kernel
The source page shows notification as Source Markers in the left header of both the
source and SASS code. By hovering the mouse on a marker it shows details in a pop-up
window for the specific source line.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 7
Chapter 5.
UPDATED VERSION OF THE KERNEL
Considering the uncoalesced accesses reported by the profiler we analyze the global
load access pattern. Each thread executes 3 reads for the three double values in double3.
We can treat the double3 array as a double array and each thread can process one double
instead of one double3. With this change threads in a warp access consecutive double
values and both loads and stores are coalesced.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 8
Updated version of the kernel
We can also confirm that the global memory accesses are now coalesced. In the L1/TEX
Cache metrics table under the Memory Workload Analysis section we see that the
Sectors/Req metric value is 8 for both global loads and global stores. Also the Source
Counters section does not show any Uncoalesced Global Accesses.
To get a better understanding of why the runtime did not decrease for the updated
version of the kernel on this device we can navigate to the Memory Chart. The chart
reveals that previously the caches were able to counteract the bad memory accesses
pattern, as is apparent from the significant drop in the L2 hit rate and the much reduced
data transfer between L1 and L2 caches in the updated version of the kernel. The chart
also indicates that the actual bottleneck is the DRAM throughput, which is closest to its
peak. Because the amount of data transferred between Device Memory and the L2 Cache
did not significantly change between the two versions of the kernel, the runtime did not
change either.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 9
Updated version of the kernel
Improving the memory access pattern as shown here, and thereby reducing the pressure
on the caches, will have a significant impact on the runtime, in particular when using
non-streaming memory access patterns, and on device with smaller caches.
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 10
Chapter 6.
RESOURCES
www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 11
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.
Copyright
© 2022-2024 NVIDIA Corporation and affiliates. All rights reserved.
This product includes software developed by the Syncro Soft SRL (http://
www.sync.ro/).
www.nvidia.com