Opencl Programming For The Cuda Architecture
Opencl Programming For The Cuda Architecture
Version 2.3
8/31/2009
In general, there are multiple ways of implementing a given algorithm in OpenCL and these
multiple implementations can have vastly different performance characteristics for a given
compute device architecture. This whitepaper is a summary of the main guidelines for
choosing the best implementation on NVIDIA GPUs. More details are provided in the
NVIDIA OpenCL Programming Guide [ 1 ] and NVIDIA OpenCL Best Practices
Guide [ 2 ].
Heterogeneous Computing
An OpenCL application executes across a collection of heterogeneous processors: typically,
a host CPU and one or more GPUs.
Applications should take advantage of such a configuration by making sure that all
processors are busy most of the time and by assigning to each processor the type of work it
does best. The next section goes over the key differences between the CPU and GPU
architectures and what type of workload is best suited to each.
Data must be transferred between processors (in the case of discrete GPUs) and for some
applications these data transfers can represent a significant fraction of the overall execution
time. Applications must therefore strive to either minimize them or hide them by
overlapping them with kernel execution. Data transfers can overlap with kernel execution
only for data allocated with the CL_MEM_ALLOC_HOST_PTR flag as detailed in the
Programming and Best Practices Guides. Data allocated this way also usually transfer at a
higher throughput across the PCIe bus. Hence, it is recommended that OpenCL applications
allocate data with this flag.
GPU Computing
In order to distribute a workload appropriately between the CPU and the GPU, it is
important to understand the differences between the two architectures.
Highly Multithreaded
One key difference is in how the two architectures address the issue of accessing off-chip
memory, a very expensive operation with hundreds clock cycles of latency. CPUs devote a
lot of transitors to on-chip caches in order to reduce as much as possible the overall latency
caused by off-chip memory accesses. For applications with no or very little parallelism, this
latency reduction is the best strategy. For applications with a high number of parallel
computations, another strategy is to ensure that the processor is always busy with some
computations while other computations are waiting on memory accesses or at
synchronization points. In other words, latency is “hidden” rather than reduced. This latency
T1 T1
T2
T3
T4
T2
T3
T4
Computation Thread
Tn Processed
In the kernel of Listing 2, each work-item computes one element of W. In some cases, giving
more work to each work-item and lowering the number of work-items accordingly yields
better performance as it amortizes startup overhead better. For this example, it means
adding a loop to have each work-item compute multiple elements of W as in Listing 3.
The number of elements computed by each work-item is then equal to height divided by
the total number of work-items (plus one more for some work-items if height is not a
multiple of the number of work-items).
An advantage of this last formulation is that it allows us to decouple the NDRange from the
input data size. In other words, the number of work-items can now be arbitrarily set, usually
to maximize performance.
Memory Optimizations
Assuming that global memory latency is hidden by running enough work-items per
multiprocessor, the next optimization to focus on is maximizing the kernel’s overall memory
throughput. This is done by maximizing the use of high bandwidth memory (OpenCL local
and constant memory, Section 3.3 of OpenCL specification) and by using the proper
memory access pattern for each memory type to achieve maximum throughput.
Global Memory Coalescing
Ensuring that global memory accesses are coalesced as often as possible is one of the most
important optimizations for the CUDA architecture, because it can affect performance by
up to an order of magnitude. The Programming and Best Practices Guides give a detailed
description of the criteria required for a global memory access to be coalesced and the
performance costs if one or several of these criteria are not met. It suffices to say here that
the more local and regular the memory accesses by a warp are, the fewer memory
transactions result; ideally they should be fully coalesced, i.e. translate into a single memory
transaction.
In the kernel of Listing 3, each warp reads M in a non-coalesced way since the elements read
by a warp are not contiguous in memory as illustrated in Figure 2.
0 M
M
1
…
Nwg-1
0
1
M W
0
…
1
2
…
31
…
Nwi-1
Elements read
by a warp in one
local_id non-coalesced
memory access
Each work-item is now responsible for calculating part of the dot product of V and a row of
M and storing it to OpenCL local memory. The first work-item in the work-group does the
final sum to compute the dot product. A work-group barrier function call is necessary to
prevent the first work-item from reading OpenCL local memory before the other work-
items are done writing to it.
V
Nwg: Number of work-groups
group_id
Nwi: Number of work-items per work-group
0 M
1 M
…
Nwg-1
0
1
…
Nwg-1
0
1
…
M W
Nwi-1 Nwi-1
0 1 2 … 31 … 0 1 2 … 31 … 01 2 …
local_id
Elements read
by a warp in one
coalesced
memory access
Values 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Step 1 local_ids 0 1 2 3 4 5 6 7
Stride 1
Values 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
Step 2 local_ids 0 1 2 3
Stride 2
Values 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2
Step 3 local_ids 0 1
Stride 4
Values 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Step 4 local_ids 0
Stride 8
Values 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Banks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 …
Indices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 …
Values 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 -1 4 11 -5 0 12 …
Step 1 local_ids
0 1 2 3 4 5 6 7 8 9 10
Stride 1
partialDotProduct[index]
Banks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 …
Indices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 …
Values 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 -1 4 11 -5 0 12 …
Step 1 local_ids
0 1 2 3 4 5 6 7 8 9 10
Stride 1
partialDotProduct[index + stride]
Indices 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Values 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Step 1 local_ids 0 1 2 3 4 5 6 7
Stride 8
Values 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
Step 2 local_ids 0 1 2 3
Stride 4
Values 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Step 3 local_ids 0 1
Stride 2
Values 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Step 4 local_ids 0
Stride 1
Values 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Instruction Optimizations
Scalar Architecture
The CUDA architecture is a scalar architecture. Therefore, there is no performance benefit
from using vector types and instructions. These should only be used for convenience. It is
also in general better to have more work-items than fewer using large vectors.
Trading Precision for Speed
The CUDA architecture offers ways to trade precision for speed:
- native_* and half_* mathematical functions have lower precision than their
counterparts, but can be orders of magnitude faster.
- The -cl-mad-enable build option can provide large performance gains for kernels
with many multiply-add operations.
Avoiding Low-Throughput Instructions
OpenCL applications should minimize use of low-throughput instructions as documented in
the Programming Guide. Integer division and modulo, in particular, are expensive
instructions that should be avoided or replaced with bitwise operations whenever possible.
Control Flow
Because of the SIMT execution model, threads within the same warp should follow the same
execution path as much as possible. Otherwise, the different execution paths are serialized
increasing overall execution time accordingly. Too many divergent warps during a kernel
launch can drastically affect performance.
Note that threads in different warps are free to diverge without performance penalties.
Warp Synchronization
Since threads within a warp are inherently synchronized, synchronization at work-group level
via a work-group barrier function call is only needed if there is communication between
threads from different warps.
A trivial case is when only one warp is active like in the last loop iterations of the parallel
reduction code of Listing 6, for example. The kernel could be rewritten by unrolling the loop
and removing the barrier function call for the last iterations. It could also be rewritten to
take advantage of warp synchronization as illustrated in Figure 7 and Listing 7. In a first step,
each warp reduces its own 64-element portion of the array independendtly of each other, so
there is no need to synchronize. The size of the array is equal to the size of the work-group,
which is less or equal to 512 on NVIDIA GPUs. So, after the first step, a maximum of 8
elements remain to be reduced, which is done by a single warp. The performance gain
shown in Table 1 is modest, but we expect it to increase as our OpenCL implementation
improves.
Values 10 1 … -3 1 -1 … 2 1 7 … 3 0 20 … -7 -3 5 …
local_ids 0 1 … 31 32 33 … 63 64 65 …
Values 11 0 … -1 1 -1 … 2 21 27 … -4 0 20 … -7 6 -4 …
… … …
Values 6 -3 … 40 -11 … 17 3 …
local_ids 0 … 32 … 64 …
Values 3 … 29 … 20 …
local_ids 0 32 64 …
Values 3 29 20 …
local_ids 0 1 2 …
Values 32 11 19 …
Values 17 8 …
local_ids 0
Values 25 …
// The first thread of each warp stores the result of the reduction
// at the beginning of partialDotProduct
if (id == 0)
partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult;
Listing 2 135.9
Listing 3 135.5
Listing 4 99.7
Listing 5 31.4
Listing 6 23.9
Listing 7 21.7
References
[ 1 ] NVIDIA OpenCL Programming Guide in NVIDIA OpenCL SDK
[ 2 ] NVIDIA OpenCL Best Practices Guide in NVIDIA OpenCL SDK
[ 3 ] Scalable Parallel Programming with CUDA, in ACM Queue, VOL 6, No. 2
(March/April 2008), © ACM, 2008. http://mags.acm.org/queue/20080304/?u1=texterity"
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050
www.nvidia.com