0% found this document useful (0 votes)

14 views

Uncoalesced Global Accesses

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Uncoalesced Global Accesses

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

UNCOALESCED GLOBAL ACCESSES

SAMPLE

v2023.3.1 | September 2024

TABLE OF CONTENTS

Chapter 1. Introduction.........................................................................................1
Chapter 2. Application.......................................................................................... 2
Chapter 3. Configuration....................................................................................... 3
Chapter 4. Initial version of the kernel..................................................................... 4
Chapter 5. Updated version of the kernel..................................................................8
Chapter 6. Resources.......................................................................................... 11

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | ii
Chapter 1.
INTRODUCTION

This sample profiles a memory-bound CUDA kernel which does a simple computation
on an array of double3 data type in global memory using the Nsight Compute profiler.
The profiler is used to analyze and identify the memory accesses which are uncoalesced
and result in inefficient DRAM accesses.

Global memory accesses on a GPU

Global memory resides in device memory and device memory is accessed via 32, 64, or
128-byte memory transactions.
When a warp executes an instruction that accesses global memory, it coalesces the
memory accesses of the threads within the warp into one or more of these memory
transactions depending on the size of the data accessed by each thread and the
distribution of the memory addresses across the threads. If global memory accesses
of the threads within a warp cannot be combined into the same memory transaction
then we refer to these as uncoalesced global memory accesses. In general, the more
transactions are necessary, the more unused bytes are transferred in addition to the bytes
accessed by the threads, reducing the instruction throughput accordingly. For example,
if a 32-byte memory transaction is generated for each thread's 4-byte access, throughput
is divided by 8.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 1
Chapter 2.
APPLICATION

The sample CUDA application adds a floating point constant to an input array of
1,048,576 (1024*1024) double3 elements in global memory and generates an output array
of double3 in global memory of the same size. double3 is a 24-byte built-in vector type
which is a structure containing 3 double precision floating point values:

struct
{
double x, y, z;
};

The uncoalescedGobalAccesses sample is available with Nsight Compute under <nsight-

compute-install-directory>/extras/samples/uncoalescedGlobalAccesses.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 2
Chapter 3.
CONFIGURATION

The profiling results included in this document were collected on the following
configuration:
‣ Target system: Linux (x86_64) with a NVIDIA RTX A4500 (Ampere GA102) GPU
‣ Nsight Compute version: 2023.3.1
The Nsight Compute UI screen shots in the document are taken by opening the profiling
reports on a Windows 10 system.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 3
Chapter 4.
INITIAL VERSION OF THE KERNEL

The initial version of the sample code provides a naive implementation for the kernel
which adds a floating point constant to an input array of double3.

global void addConstDouble3(int numElements, double3 *d_in, double k,

double3 *d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < numElements)
{
double3 a = d_in[index];
a.x += k;
a.y += k;
a.z += k;
d_out[index] = a;
}
}

The instruction a = d_in[index] in the kernel code results in each thread in a warp
accessing global memory 24-bytes apart. In the first step all threads request a load
for d_in[index].x as shown in the following diagram. In the second step a load for
d_in[index].y and in the third step a load for d_in[index].z is made by all threads.

The instruction d_out[index] = a; has a similar multistep storage pattern.

Profile the initial version of the kernel

There are multiple ways to profile kernels with Nsight Compute. For full details see the
Nsight Compute Documentation. One example workflow to follow for this sample:

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 4
Initial version of the kernel

‣ Refer to the README distributed with the sample on how to build the application
‣ Run ncu-ui on the host system
‣ Use a local connection if the GPU is on the host system. If the GPU is on a remote
system, set up a remote connection to the target system
‣ Use the Profile activity to profile the sample application
‣ Choose the full section set
‣ Use defaults for all other options
‣ Set a report name and then click on Launch

Summary page
The Summary page lists the kernels profiled and provides some key metrics for each
profiled kernel. It also lists the performance opportunities and estimated speedup for
each. In this sample we have only one kernel launch.
The duration for this initial version of the kernel is 89.86 microseconds and this is used
as the baseline for further optimizations.

For this kernel it shows a hint for Uncoalesced Global Accesses and suggests
checking the L2 Theoretical Sectors Global Excessive table for the primary
source locations. Click on Uncoalesced Global Accesses rule link to see more
context on the Details page. It opens the Source Counters section on the Details
page.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 5
Initial version of the kernel

Details page - Source Counters section

The Source Counters section shows a hint for Uncoalesced Global Accesses.
It explains that the metric L2 Theoretical Sectors Global Excessive is the
indicator for uncoalesced accesses. The table for this metric lists the source lines with
the highest value. Click on the Apply Rules button at the top to apply rules so that the
we can also see the hints at the source line level on the source page. Click on one of the
source lines to view the kernel source at which the bottleneck occurs.

Source page
The CUDA source and SASS(GPU Assembly) for the kernel is shown side by side.
When opening the Source page from Source Counters section, the Navigation metric
is automatically filled in to match, in this case L2 Theoretical Sectors Global
Excessive. You can see this by the bolding in the column header. The source line at
which the bottleneck occurs is highlighted.
It shows uncoalesced global memory load accesses at line #55:

double3 a = d_in[index];

It shows uncoalesced global memory store accesses at line #59:

d_out[index] = a;

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 6
Initial version of the kernel

The source page shows notification as Source Markers in the left header of both the
source and SASS code. By hovering the mouse on a marker it shows details in a pop-up
window for the specific source line.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 7
Chapter 5.
UPDATED VERSION OF THE KERNEL

Considering the uncoalesced accesses reported by the profiler we analyze the global
load access pattern. Each thread executes 3 reads for the three double values in double3.
We can treat the double3 array as a double array and each thread can process one double
instead of one double3. With this change threads in a warp access consecutive double
values and both loads and stores are coalesced.

global void addConstDouble(int numElements, double *d_in, double k, double

*d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < numElements)
{
d_out[index] = d_in[index] + k;
}
}

Profile the updated kernel

After profiling the updated version, we can set a baseline to the initial version of the
kernel and compare the profiling results. The kernel duration has reduced only slightly
on this GPU from 89.86 microseconds to 89.73 microseconds. However, the Speed
Of Light section already reveals that the pressure on the L1/TEX and L2 Caches has
significantly been reduced.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 8
Updated version of the kernel

We can also confirm that the global memory accesses are now coalesced. In the L1/TEX
Cache metrics table under the Memory Workload Analysis section we see that the
Sectors/Req metric value is 8 for both global loads and global stores. Also the Source
Counters section does not show any Uncoalesced Global Accesses.

To get a better understanding of why the runtime did not decrease for the updated
version of the kernel on this device we can navigate to the Memory Chart. The chart
reveals that previously the caches were able to counteract the bad memory accesses
pattern, as is apparent from the significant drop in the L2 hit rate and the much reduced
data transfer between L1 and L2 caches in the updated version of the kernel. The chart
also indicates that the actual bottleneck is the DRAM throughput, which is closest to its
peak. Because the amount of data transferred between Device Memory and the L2 Cache
did not significantly change between the two versions of the kernel, the runtime did not
change either.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 9
Updated version of the kernel

Improving the memory access pattern as shown here, and thereby reducing the pressure
on the caches, will have a significant impact on the runtime, in particular when using
non-streaming memory access patterns, and on device with smaller caches.

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 10
Chapter 6.
RESOURCES

‣ GPU Technology Conference 2021 talk S32089: Requests, Wavefronts, Sectors

Metrics: Understanding and Optimizing Memory-Bound Kernels with Nsight
Compute
‣ Nsight Compute Documentation

www.nvidia.com
Uncoalesced Global Accesses Sample v2023.3.1 | 11
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.

This product includes software developed by the Syncro Soft SRL (http://
www.sync.ro/).

www.nvidia.com

MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
C# for Beginners: Learn in 24 Hours
From Everand
C# for Beginners: Learn in 24 Hours
Alex Nordeen
No ratings yet
Google Cloud Platform - Networking
From Everand
Google Cloud Platform - Networking
alasdair gilchrist
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
XC4418 dataSheetMain PDF
No ratings yet
XC4418 dataSheetMain PDF
1 page
Android Attendance Management System
75% (12)
Android Attendance Management System
54 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Uncoalesced Global Accesses
No ratings yet
Uncoalesced Global Accesses
14 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
16 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
DevOps for the Desperate: A Hands-On Survival Guide
From Everand
DevOps for the Desperate: A Hands-On Survival Guide
Bradley Smith
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
From Everand
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
Steve Brown
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
From Everand
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
Jordan Lioy
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
Cloud Native Security
From Everand
Cloud Native Security
Chris Binnie
5/5 (1)
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Docker: The Complete Guide to the Most Widely Used Virtualization Technology. Create Containers and Deploy them to Production Safely and Securely.: Docker & Kubernetes, #1
From Everand
Docker: The Complete Guide to the Most Widely Used Virtualization Technology. Create Containers and Deploy them to Production Safely and Securely.: Docker & Kubernetes, #1
Jordan Lioy
No ratings yet
AZ-801 Exam Prep: Configuring Windows Server Hybrid Services
From Everand
AZ-801 Exam Prep: Configuring Windows Server Hybrid Services
Steve Brown
No ratings yet
Professional CUDA C Programming
From Everand
Professional CUDA C Programming
John Cheng
5/5 (1)
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Bare Metal C: Embedded Programming for the Real World
From Everand
Bare Metal C: Embedded Programming for the Real World
Stephen Oualline
No ratings yet
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Part3 22
No ratings yet
Part3 22
85 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Node.js: Tools & Skills
From Everand
Node.js: Tools & Skills
James Hibbard
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Microsoft Azure Network Engineer AZ 700
From Everand
Microsoft Azure Network Engineer AZ 700
Manish Soni
No ratings yet
Class13
No ratings yet
Class13
19 pages
Projects with IOTA
From Everand
Projects with IOTA
Guillermo Perez Guillen
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Accessing Global and Shared Memory: Introduction To Supercomputing (MCS 572) Memory Coalescing Techniques
No ratings yet
Accessing Global and Shared Memory: Introduction To Supercomputing (MCS 572) Memory Coalescing Techniques
26 pages
Archives
No ratings yet
Archives
4 pages
Training
No ratings yet
Training
4 pages
Sanitizer NVTX Guide
No ratings yet
Sanitizer NVTX Guide
12 pages
Copyright and Licenses
No ratings yet
Copyright and Licenses
46 pages
Customization Guide
No ratings yet
Customization Guide
25 pages
1734-Um018 - En-E Set
No ratings yet
1734-Um018 - En-E Set
128 pages
Computer Programming 1 Bachelor of Science in Information Technology
No ratings yet
Computer Programming 1 Bachelor of Science in Information Technology
1 page
Real Time and FPGA
No ratings yet
Real Time and FPGA
11 pages
Oracle Performance Tuning 101 - Developer Perspective 090528 - 1
No ratings yet
Oracle Performance Tuning 101 - Developer Perspective 090528 - 1
25 pages
PC Software and C' Programming
No ratings yet
PC Software and C' Programming
2 pages
HANA Update and Migration Guide 1.0 To 2.0 Version 1-0-0
No ratings yet
HANA Update and Migration Guide 1.0 To 2.0 Version 1-0-0
22 pages
ARM MC Module 03
No ratings yet
ARM MC Module 03
21 pages
ZS5601 AZ5121 zone selector manual
No ratings yet
ZS5601 AZ5121 zone selector manual
8 pages
A CMOS/SOI Continuous-Time Low-Pass G - C Filter: C. Cavalcanti, J. A. de Lima & M. Verleysen
No ratings yet
A CMOS/SOI Continuous-Time Low-Pass G - C Filter: C. Cavalcanti, J. A. de Lima & M. Verleysen
6 pages
OpenID Connect (OIDC)
No ratings yet
OpenID Connect (OIDC)
9 pages
MIL STD 1553 Reference Manual
No ratings yet
MIL STD 1553 Reference Manual
461 pages
Unit-3 Mobile Transport Layer
No ratings yet
Unit-3 Mobile Transport Layer
34 pages
Zones and Containers FAQ
No ratings yet
Zones and Containers FAQ
38 pages
At89sxx Development Board
No ratings yet
At89sxx Development Board
3 pages
Ite
No ratings yet
Ite
12 pages
Gajab Exams Sanjal Gajab Exams Sanjal: Second Terminal Examination Second Terminal Examination
No ratings yet
Gajab Exams Sanjal Gajab Exams Sanjal: Second Terminal Examination Second Terminal Examination
1 page
GitHub - Neo23x0:sigma: Generic Signature Format For SIEM Systems PDF
No ratings yet
GitHub - Neo23x0:sigma: Generic Signature Format For SIEM Systems PDF
13 pages
Embedded Design With The Microblaze Soft Processor Core: Fpga and Asic Technology Comparison - 1
No ratings yet
Embedded Design With The Microblaze Soft Processor Core: Fpga and Asic Technology Comparison - 1
39 pages
Introduction of Internet of Things: Drive For Ever
No ratings yet
Introduction of Internet of Things: Drive For Ever
13 pages
Online Proctored Faq
No ratings yet
Online Proctored Faq
5 pages
Cen Sks Comp
No ratings yet
Cen Sks Comp
603 pages
Different Ways To Load Data in Qlikview
No ratings yet
Different Ways To Load Data in Qlikview
3 pages
UP-X898MD: A6 Analog and Digital Black and White Thermal Printer
No ratings yet
UP-X898MD: A6 Analog and Digital Black and White Thermal Printer
6 pages
How To Enable or Disable USB Ports in Windows 7
No ratings yet
How To Enable or Disable USB Ports in Windows 7
10 pages
40110
No ratings yet
40110
11 pages
Sharp Electronic Components 2016 Catalog-1839228
No ratings yet
Sharp Electronic Components 2016 Catalog-1839228
73 pages
CEG 2136 - Fall 2008 - Final PDF
No ratings yet
CEG 2136 - Fall 2008 - Final PDF
9 pages

Uncoalesced Global Accesses

Uploaded by

Uncoalesced Global Accesses

Uploaded by

UNCOALESCED GLOBAL ACCESSES

v2023.3.1 | September 2024

Global memory accesses on a GPU

The uncoalescedGobalAccesses sample is available with Nsight Compute under <nsight-

__global__ void addConstDouble3(int numElements, double3 *d_in, double k,

The instruction d_out[index] = a; has a similar multistep storage pattern.

Profile the initial version of the kernel

Details page - Source Counters section

It shows uncoalesced global memory store accesses at line #59:

__global__ void addConstDouble(int numElements, double *d_in, double k, double

Profile the updated kernel

‣ GPU Technology Conference 2021 talk S32089: Requests, Wavefronts, Sectors

You might also like

global void addConstDouble3(int numElements, double3 *d_in, double k,

global void addConstDouble(int numElements, double *d_in, double k, double