Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
73 (2013) 4–13
1. Introduction The key to the success of GPU computing has partly been its
massive performance when compared to the CPU: today, there is
Graphics processing units (GPUs) have for well over a decade a performance gap of roughly seven times between the two when
been used for general-purpose computation, called GPU comput- comparing theoretical peak bandwidth and gigaflops performance
ing or GPGPU [8]. When the first GPU programs were written, the (see Fig. 2). This performance gap has its roots in physical per-
GPU was used much like a calculator: it had a set of fixed oper- core restraints and architectural differences between the two
ations that were exploited to achieve some desired result. As the processors. The CPU is in essence a serial von Neumann processor,
GPU is designed to output a two-dimensional (2D) image from a and is highly optimized to execute a series of operations in order.
three-dimensional (3D) virtual world, the operations it could per- One of the major performance factors of CPUs has traditionally
form were fundamentally linked with graphics, and the first GPU been its steadily increasing frequency. If you double the frequency,
programs were expressed as operations on graphical primitives you double the performance, and there has been a long-standing
such as triangles. These programs were difficult to develop, de- trend of exponential frequency growth. In the 2000s, however,
bug, and optimize, and compiler bugs were frequently encoun- this increase came to an abrupt stop as we hit the power wall [3]:
tered. However, the proof-of-concept programs demonstrated that because the power consumption of a CPU is proportional to the
the use of GPUs could give dramatic speedups over central process- frequency cubed [4], the power density was approaching that of
ing units (CPUs) for certain algorithms [20,4], and research on GPUs a nuclear reactor core [22]. Unable to cool such chips sufficiently,
soon led to the development of higher-level third-party languages the trend of exponential frequency growth stopped at just below
that abstracted away the graphics. These languages, however, were 4.0 GHz. Coupled with the memory wall and the ILP 1 wall, serial
rapidly abandoned when the hardware vendors released dedicated computing had reached its zenith in performance [3], and CPUs
non-graphics languages that enabled the use of GPUs for general- started increasing performance through multicore and vector
purpose computing (see Fig. 1). instructions instead.
At the same time as CPUs hit the serial performance ceiling,
GPUs were growing exponentially in performance due to massive
∗
Corresponding author.
E-mail addresses: [email protected] (A.R. Brodtkorb),
[email protected] (T.R. Hagen), [email protected]
(M.L. Sætra). 1 ILP stands for instruction level parallelism.
0743-7315/$ – see front matter © 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2012.04.003
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 5
Fig. 1. History of programming languages for GPU computing. When GPUs were first being used for non-graphics applications, one had to rewrite the application in terms
of operations on graphical primitives using languages such as OpenGL or DirectX. As this was cumbersome and error prone, several academic and third-party languages that
abstracted away the graphics appeared. Since 2007, however, vendors started releasing general-purpose languages for programming the GPU, such as AMD Close-to-Metal
(CTM) and NVIDIA CUDA. Today, the predominant general-purpose languages are NVIDIA CUDA, DirectCompute, and OpenCL.
Fig. 2. Historical comparison of theoretical peak performance in terms of gigaflops and bandwidth for the fastest available NVIDIA GPUs and Intel CPUs. Today, the
performance gap is currently roughly seven times for both metrics.
Fig. 4. Current Fermi-class GPU hardware. The GPU consisting of up to 16 streaming multiprocessors (also known as SMs) is shown in (left), and (right) shows a single
multiprocessor.
Much of the information in this article can be found in distributes thread blocks to multiprocessor thread schedulers (see
a variety of different sources, including books, documentation, Fig. 4a). This scheduler handles concurrent kernel2 execution and
manuals, conference presentations, and on Internet fora. Getting out-of-order thread block execution.
an overview of all this information is an arduous exercise that Each multiprocessor has 16 load/store units, allowing source
requires a substantial effort. The aim of this article is therefore to and destination addresses to be calculated for 16 threads per clock
give an overview of state-of-the-art programming techniques and cycle. Special function units (SFUs) execute intrinsic instructions
profile-driven development, and to serve as a step-by-step guide such as sine, cosine, square root, and interpolation. Each SFU
for optimizing GPU codes. The rest of the article is sectioned as executes one instruction per thread, per clock. The multiprocessor
follows. First, we give a short overview of current GPU hardware in schedules threads in groups of 32 parallel threads called warps.
Section 2, followed by general programming strategies in Section 3. Each multiprocessor features two warp schedulers and two
Then, we give a thorough overview of profile-driven development instruction dispatch units, allowing two warps to be issued and
for GPUs in Section 4, and a short overview of available debugging executed concurrently. The Fermi dual warp scheduler selects two
tools in Section 5. Finally, we offer our view on the current and warps, and issues one instruction from each warp to a group of 16
future trends in Section 6, and conclude with a short summary in CUDA cores, 16 load/store units, or 4 SFUs. The multiprocessor has
Section 7. 64 KB of on-chip memory that can be configured as 48 KB of shared
memory with 16 KB of L1 cache or as 16 KB of shared memory with
2. Fermi GPU architecture 48 KB of L1 cache.
A traditional critique of GPUs has been their lack of IEEE-
The multicore CPU is composed of a handful of complex cores compliant floating-point operations and error-correcting code
with large caches. The cores are optimized for single-threaded (ECC) memory. However, these shortcomings have been addressed
performance and can handle up to two hardware threads per by NVIDIA, and all of their recent GPUs offer fully IEEE-754-
core using hyper-threading. This means that a lot of transistor compliant single-precision and double-precision floating-point
space is dedicated to complex instruction-level parallelism such operations, in addition to ECC memory.
as instruction pipelining, branch prediction, speculative execution,
and out-of-order execution, leaving only a tiny fraction of the die 3. GPU programming strategies
area for integer and floating-point execution units. In contrast,
a GPU is composed of hundreds of simpler cores that can Programming GPUs is unlike traditional CPU programming,
handle thousands of concurrent hardware threads. GPUs are because the hardware is dramatically different. It can often be a
designed to maximize floating-point throughput, whereby most relatively simple task to get started with GPU programming and get
transistors within each core are dedicated to computation rather speedups over existing CPU codes, but these first attempts at GPU
than complex instruction-level parallelism and large caches. The computing are often suboptimal, and do not utilize the hardware to
following part of this section gives a short overview of a modern a satisfactory degree. Achieving a scalable high-performance code
GPU architecture. that uses hardware resources efficiently is still a difficult task that
Today’s Fermi-based architecture [17] feature up to 512 can take months or years to master.
accelerator cores called CUDA cores (see Fig. 4a). Each CUDA core
has a fully pipelined integer arithmetic logic unit (ALU) and a
3.1. Guidelines for latency hiding and thread performance
floating-point unit (FPU) that executes one integer or floating-
point instruction per clock cycle. The CUDA cores are organized
The GPU execution model is based around the concept of
in 16 streaming multiprocessors, each with 32 CUDA cores (see
launching a kernel on a grid consisting of blocks (see Fig. 5).
Fig. 4b). Fermi also includes a coherent L2 cache of 768 KB that
Each block again consists of a set of threads, and threads within
is shared across all 16 multiprocessors in the GPU, and the GPU has
a 384-bit GDDR5 DRAM memory interface supporting up to a total
of 6 GB of on-board memory. A host interface connects the GPU to
the CPU via the PCI Express bus. The GigaThread global scheduler 2 A kernel is a GPU program that typically executes in a data-parallel fashion.
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 7
Fig. 5. The CUDA concept of a grid of blocks. Each block consists of a set of threads that can communicate and cooperate. Each thread uses its block index in combination
with its thread index to identify its position in the global grid.
Fig. 6. Bank conflicts and thread divergence. (left) Conflict-free column-wise access of shared memory. The example shows how padding shared memory to be the number
of banks plus one gives conflict-free access by columns (marked with circles for the first column). Without padding, all elements in the first column would belong to the
same bank, and thus give an eight-way bank conflict. By padding the width by one, we ensure that the elements belong to different banks. Please note that the constructed
example shows eight banks, whilst current hardware has 32 banks. (right) Branching on 32-wide SIMD GPU architectures. All threads perform the same computations, but
the result is masked out for the dashed boxes.
the same block can synchronize and cooperate using fast shared SIMD3 fashion, in which the same instruction is simultaneously
memory. This maps to the hardware so that a block runs on a single executed for 32 different data elements, called a warp. This is
multiprocessor, and one multiprocessor can execute multiple illustrated in Fig. 6b, in which a branch is taken by only some of
blocks in a time-sliced fashion. The grid and block dimensions can the threads within a warp. This means that all threads within a
be one, two, or three dimensional, and they determine the number warp must execute both parts of the branch, which in the utmost
of threads that will be used. Each thread has a unique identifier consequence slows down the program by a factor 32. Conversely,
this does not affect performance when all threads in a warp take
within its block, and each block has a unique global identifier.
the same branch.
These are combined to create a unique global identifier per thread.
One technique used to avoid expensive branching within a
The massively threaded architecture of the GPU is used to hide kernel is to sort the elements according to the branch, and thus
memory latencies. Even though the GPU has a vastly superior make sure the threads within each warp all execute their code
memory bandwidth compared to CPUs, it still takes on the order without branching. Another way of preventing branching is to
of hundreds of clock cycles to start the fetch of a single element perform the branch once on the CPU instead of for each warp
from main GPU memory. This latency is automatically hidden by on the GPU. This can be done for example using templates: by
the GPU through rapid switching between threads. Once a thread replacing the branch variable with a template variable, we can
stalls on a memory fetch, the GPU instantly switches to the next generate two kernels, one for condition true, and one for condition
available thread in a fashion similar to hyper-threading [14] on false, and let the CPU perform the branch and from this select the
Intel CPUs. This strategy, however, is most efficient when there are correct kernel. The use of templates, however, is not particularly
enough available threads to completely hide the memory latency, powerful in this example, as the overhead of running a simple
meaning that we need a lot of threads. As there is a maximum coherent if statement in the kernel would be small. But when
there are a lot of parameters, there can be a large performance
number of concurrent threads a GPU can support, we can calculate
gain from using template kernels [7,5]. Another prime example of
the percentage of this figure we are using. This number is referred
the benefit of template kernels is the ability to specify different
to as the occupancy, and as a rule of thumb it is good to keep shared memory sizes at compile time, thus allowing the compiler
a relatively high occupancy. However, a higher occupancy does to issue warnings for out-of-bounds access. The use of templates
not necessarily equate to higher performance: once all memory can also be used to perform compile-time loop unrolling, which
latencies are hidden, a higher occupancy may actually degrade the has a great performance impact. By using a switch-case statement,
performance as it also affects other performance metrics. with a separate kernel being launched for different for-loop sizes,
Hardware threads are available on Intel CPUs as hyper- the performance can be greatly improved.
threading, but a GPU thread operates quite differently from these
CPU threads. One of the things that differs from traditional CPU
programming is that the GPU executes instructions in a 32-way 3 SIMD stands for single instruction multiple data.
8 A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13
3.2. Memory guidelines 16 or 48 KB, at the expense of shared memory. The L2 cache, on
the other hand, can be turned on or off on at compile time, or by
CPUs have struggled with the memory wall for a long time. The using inline PTX assembly instructions in the kernel. The benefit
memory wall, in which transferring data to the processor is far of turning off the L2 cache is that the GPU is now allowed to
more expensive than computing on that data, can also be a problem transfer smaller amounts of data than a full cache line, which will
on GPUs. This means that many algorithms will often be memory often improve performance for sparse and other random access
bound, making memory optimizations important. The first lesson algorithms.
in memory optimization is to reuse data and keep it in the fastest In addition to the L1 and L2 caches, the GPU also has dedicated
available memory. For GPUs, there are three memory areas, listed caches that are related to traditional graphics functions. The
in decreasing order by speed: registers, shared memory, and global constant-memory cache is one example, which in CUDA is typically
memory. used for arguments sent to a GPU kernel. It has its own dedicated
Registers are the fastest memory units on a GPU, and each cache tailored for broadcast, in which all threads in a block access
multiprocessor on the GPU has a large, but limited, register file the same data. The GPU also has a texture cache that can be
which is divided amongst threads residing on that multiprocessor. used to accelerate reading global memory. However, the L1 cache
Registers are private for each thread, and if the threads use more has a higher bandwidth, so the texture cache is most useful if
registers than are physically available, registers will also spill to combined with texture functions such as linear interpolation
the L1 cache and global memory. This means that, when you have between elements.
a high number of threads, the number of registers available to
each thread is very restricted, which is one of the reasons why a 3.3. Further guidelines
high occupancy may actually hurt performance. Thus, thread-level
parallelism is not the only way of increasing performance. It is also The CPU and the GPU are different processors that operate
possible to increase performance by decreasing the occupancy to asynchronously. This means that we can let the CPU and the GPU
increase the number of registers available per thread. perform different tasks simultaneously, which is a key ingredient
The second fastest memory type is shared memory, and this of heterogeneous computing: the efficient use of multiple different
memory can be just as fast as registers if accessed properly. Shared computational resources, in which each resource performs the
memory is a very powerful tool in GPU computing, and the main tasks for which it is best suited. In the CUDA API, this is exposed
difference between registers and shared memory is the ability for as streams. Each stream is an in-order queue of operations that
several threads to share data. Shared memory is accessible to all will be performed by the GPU, including memory transfers and
threads within one block, thus enabling cooperation. It can be kernel launches. A typical use case is that the CPU schedules
thought of as a kind of programmable cache, or scratchpad, in a memory copy from the CPU to the GPU, a kernel launch,
which the programmer is responsible for placing often-used data and a copy of results from the GPU to the CPU. The CPU then
there explicitly. However, as with caches, its size is limited (up continues to perform CPU-side calculations simultaneously as the
to 48 KB) and this can often be a limitation on the number of GPU processes its operations, and only synchronizes with the GPU
threads per block. Shared memory is physically organized into 32 when its results are needed. There is also support for independent
banks that serve one warp with data simultaneously. However, for streams, which can execute their operations simultaneously as
full speed, each thread must access a distinct bank. Failure to do long as they obey their own streams order. Current GPUs support
so leads to more memory requests, one for each bank conflict. A up to 16 concurrent kernel launches [18], which means that we
classical way to avoid bank conflicts is to use padding. In Fig. 6a, can have both data parallelism, in terms of a computational grid
for example, we can avoid bank conflicts for column-wise access of blocks, and task parallelism, in terms of different concurrent
by padding the shared memory with an extra element, so that kernels. GPUs furthermore support overlapping memory copies
neighboring elements in the same column belong to different between the CPU and the GPU and kernel execution. This means
banks. that we can simultaneously copy data from the CPU to the GPU,
The third, and slowest type of memory on the GPU is the global execute 16 different kernels, and copy data from the GPU back to
memory, which is the main memory of the GPU. Even though it the CPU if all these operations are scheduled properly to different
has an impressive bandwidth, it has a high latency, as discussed streams.
earlier. These latencies are preferably hidden by a large number When transferring data between the CPU and the GPU over
of threads, but there are still large pitfalls. First of all, just as with the PCI Express bus, it is beneficial to use so-called page-locked
CPUs, the GPU transfers full cache lines across the bus (called memory. This essentially disables the operating system from
coalesced reads). As a rule of thumb, transferring a single element paging memory, meaning that the memory area is guaranteed to
consumes the same bandwidth as transferring a full cache line. be continuous and in physical RAM (not swapped out to disk, for
Thus, to achieve full memory bandwidth, we should program the example). However, page-locked memory is scarce, and is rapidly
kernel such that warps access continuous regions of memory. exhausted if used carelessly. A further optimization for page-
Furthermore, we want to using full cache lines, which is done by locked memory is to use write-combining allocation. This disables
starting at a quad word boundary (the start address of a cache CPU caching of a memory area that the CPU will only write to, and
line), and using full quadwords (128 bytes) as the smallest unit. can increases the bandwidth utilization by up to 40% [18]. It should
This address alignment is typically achieved by padding arrays. also be noted that enabling ECC memory will negatively affect both
Alternatively, for non-cached loads, it is sufficient to align to word the bandwidth utilization and available memory, as ECC requires
boundaries and transfer words (32 bytes). To fully occupy the extra bits for error control.
memory bus, the GPU also uses memory parallelism, in which a CUDA now also supports a unified address space, in which the
large number of outstanding memory requests are used to occupy physical location of a pointer is automatically determined. That is,
the bandwidth. This is both a reason for a high memory latency, data can be copied from the GPU to the CPU (or the other way
and a reason for high bandwidth utilization. round) without specifying the direction of the copy. While this
Fermi also has hardware L1 and L2 caches that work in a similar might not seem like a great benefit at first, it greatly simplifies
fashion as traditional CPU caches. The L2 cache size is fixed and code needed to copy data between CPU and GPU memories, and
shared between all multiprocessors on the GPU, whilst the L1 cache enables advanced memory accesses. The unified memory space
is per multiprocessor. The L1 cache can be configured to be either is particularly powerful when combined with mapped memory.
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 9
Fig. 7. Normalized run time of modified kernels which are used to identify bottlenecks: (top left) a well-balanced kernel, (top right) a latency-bound kernel, (bottom left)
a memory-bound kernel, and (bottom right) an arithmetic-bound kernel. ‘‘Total’’ refers to the total kernel time, whilst ‘‘Memory’’ refers to a kernel stripped of arithmetic
operations, and ‘‘Math’’ refers to a kernel stripped of memory operations. It is important to note that latencies are part of the measured run times for all kernel versions.
A mapped memory area is a continuous block of memory that is different hardware counters, and the correct interpretation of
available directly from both the CPU and the GPU simultaneously. these numbers is required to identify bottlenecks. The second
When using mapped memory, data transfers between the CPU option is to modify the source code, and compare the execution
and the GPU are executed asynchronously with kernel execution time of the differently modified kernels.
automatically. The profiler can be used to identify whether a kernel is limited
The most recent CUDA version has also become threadsafe [18], by bandwidth or arithmetic operations. This is done by simply
so that one CPU thread can control multiple CUDA contexts looking at the instruction-to-byte ratio, or in other words finding
(e.g., one for each physical GPU), and conversely multiple CPU out how many arithmetic operations your kernel performs per
threads can share control of one CUDA context. The unified byte it reads. The ratio can be found by comparing the instructions
memory model together with the new threadsafe context handling issued counter (multiplied with the warp size, 32) to the sum
enables much faster transfers between multiple GPUs. The CPU of global store transactions and L1 global load miss counters
thread can simply issue a direct GPU–GPU copy, bypassing a (both multiplied with the cache line size, 128 bytes), or directly
superfluous copy to CPU memory. through the instruction/byte counter. Then we compare this
ratio to the theoretical ratio for the specific hardware the kernel
4. Profile-driven development4 is running on, which is available in the profiler as the ideal
instruction/byte ratio counter.
A famous quote attributed to Donald Knuth is that ‘‘premature The profiler does not always report accurate figures, because
optimization is the root of all evil’’ [11], or, put another way, make the number of load and store instructions may be lower than
sure that the code produces the correct results before trying to the actual number of memory transactions, depending on address
optimize it, and optimize only where it will make an impact. The patterns and individual transfer sizes. To get the most accurate
first step in optimization is always to identify the major application figures, we can follow another strategy which is to compare the
bottlenecks, as performance will increase the most when removing run time of three modified versions of the kernel: the original
these. However, locating the bottleneck is hard enough on a CPU, kernel, one math version in which all memory loads and stores
and can be even more difficult on a GPU. Optimization should are removed, and one memory version in which all arithmetic
also be considered a cyclic process, meaning that, after having operations are removed (see Fig. 7). If the math version is
found and removed one bottleneck, we need to repeat the profiling significantly faster than the original and memory kernels, we know
process to find the next bottleneck in the application. This cyclic that the kernel is memory bound, and conversely for arithmetic.
optimization can be repeated until the kernel operates close to This method has the added benefit of showing how well memory
operations and arithmetic operations overlap.
the theoretical hardware limits or all optimization techniques have
To create the math kernel, we simply comment out all load
been exhausted.
operations, and move every store operation inside conditionals
To identify the performance bottleneck in a GPU application, it
that will always evaluate to false. We do this to fool the compiler so
is important to chose the appropriate performance metrics, and
that it does not optimize away the parts we want to profile, since
then compare the measured performance to the theoretical peak
the compiler will strip away all code not contributing to the final
performance. There are several bottlenecks one can encounter
output to global memory. However, to make sure that the compiler
when programming GPUs. For a GPU kernel, there are three main
does not move the computations inside the conditional as well, the
bottlenecks: the kernel may be limited by instruction throughput,
result of the computations must also be used in the condition, as
memory throughput, or latencies. However, it might also be that
shown in Listing 1. Creating the memory kernel, on the other hand,
CPU–GPU communication is the bottleneck, or that application
is much simpler. Here, we can simply comment out all arithmetic
overheads dominate the run time.
operations, and instead add all data used by the kernel, and write
out the sum as the result.
4.1. Locating kernel bottlenecks
Note that, if control flow or addressing is dependent on data in
memory, the method becomes less straightforward and requires
There are two main approaches to locating the performance
special care. A further issue with modifying the source code is
bottleneck of a CUDA kernel, the first and most obvious being
that the register count can change, which again can increase
to use the CUDA profiler. The profiler is a program that samples
the occupancy and thereby invalidate the measured run time.
This, however, can be is solved by increasing the shared memory
parameter in the launch configuration of the kernel,
4 Many of the optimization techniques presented in this section are from the someKernel<<<grid_size, block_size,
excellent presentations by Paulius Micikevicius [16,15]. shared_mem_size, ...>>>(...),
10 A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13
__global__ void main(..., int flag) { An important note when it comes to optimization of memory
float result = ...; on the GPU is that traditional CPU cache-blocking techniques
if(1.0f == result * flag) typically do not work. This is because the GPU’s L1 and L2
output[i] = value; caches are not aimed at temporal reuse like CPU caches usually
} are, which means that attempts at cache blocking can even be
counterproductive. The rule of thumb is therefore that, when
Listing 1: Compiler trick for arithmetic only kernel. By adding
optimizing for memory throughput on GPUs, do not think of
the kernel argument flag (which we always set to 0), we
caches at all. However, fetching data from textures can alleviate
disable the compiler from optimizing away the if-statement, and
pressure on the memory system since these fetches go through a
simultaneously disable the global store operation. different cache in smaller transactions. Nevertheless, the L1 cache
is superior in performance, and only in rare cases will the texture
cache increase performance [18].
until the occupancy of the unmodified version is matched. The
occupancy can easily be examined using the profiler or the CUDA 4.3. Profiling and optimizing latencies and instruction throughput
Occupancy Calculator.
If a kernel appears to be well balanced (i.e., neither memory If a kernel is bound by instruction throughput, there may be
nor arithmetics appears to be the bottleneck), we must still several underlying causes. Warp serialization (see Section 3.1) may
check whether or not our kernel operates close to the theoretical be a major bottleneck, or we may similarly have that bank conflicts
performance numbers since it can suffer from latencies. These cause shared memory serialization. The third option is that we
latencies are typically caused by problematic data dependencies or have data dependencies that inhibit performance.
the inherent latencies of arithmetic operations. Thus, if your kernel Instruction serialization means that some threads in a warp
is well balanced, but operates at only a fraction of the theoretical ‘‘replay’’ the same instruction as opposed to all threads issuing
peak, it is probably bound by latencies. In this case, a reorganization the same instruction only once (see Fig. 6b). The profiler can
of memory requests and arithmetic operations is often required. be used to determine the level of instruction serialization
The goal should be to have many outstanding memory requests by comparing the instructions_executed counter to the
that can overlap with arithmetic operations. instructions_issued counter, in which the difference is due
to serialization. Note that even if there is a difference between
instructions executed and instructions issued, this is only a
4.2. Profiling and optimizing memory
problem if it constitutes a significant percentage.
One of the causes for instruction replays is divergent branches,
Let us assume that we have identified the major bottleneck of identified by comparing the divergent_branch counter to the
the kernel to be memory transactions. The first and least time- branch counter. We can also profile it by modifying the source
consuming thing to try for memory-bound kernels is to experiment code so that all threads take the same branch, and compare the
with the settings for caching and non-caching loads and the size of run-times. The remedy is to remove as many branches as possible,
the L1 cache to find the best settings. This can have a large impact for example by sorting the input data or splitting the kernel into
in cases of register spilling and for strided or scattered memory two separate kernels (see Section 3.1).
access patterns, and it requires no code changes. Outside these Another cause for replays is bank conflicts, which can
short experiments, however, there are two major factors that we be the case if the l1_shared_bank_conflict counter is
want to examine, namely the access pattern and the number of a significant percentage of the sum of the shared_loads
concurrent memory requests. and shared_stores counters. Another way of profiling bank
To determine if the access pattern is the problem, we conflicts is to modify the kernel source code by removing
compare the number of memory instructions with the number of the bank conflicts. This is done by changing the indexing
transferred bytes. For example, for global load we should compare so that all accesses are either broadcasts (all threads access
the number of bytes requested (gld_request multiplied by bytes the same element) or conflict free (each thread uses the
per request, 128 bytes) to the number of transferred bytes (the index threadIdx.y*blockDim.x+threadIdx.x). The shared
sum of l1_global_load_miss and l1_global_load_hit memory variables also need to be declared as volatile to prevent
multiplied by the cache line size, 128 bytes). If the number of the compiler from storing them in registers in the modified
instructions per byte is far larger than one, we have a clear kernel. Padding is one way of removing these bank conflicts (see
indication that global memory loads have a problematic access Section 3.2), and one can also try rearranging the shared memory
pattern. In that case, we should try reorganizing the memory layout (e.g., by storing by columns instead of by rows).
access pattern to better fit with the rules in Section 3. For global If we have ruled out the above causes, our kernel might be suf-
store, the counters we should compare are gst_request and fering from arithmetic operation latencies and data dependencies.
global_store_transactions. We can find out if this is the case by comparing the kernel perfor-
If we are memory bound, but the access patterns are good, mance to hardware limits. This is done by examining the IPC -
we might be suffering from having too few outstanding mem- instructions/cycle counter in the profiler, which gives the
ory requests: according to Little’s Law [13], we need (mem- ratio of executed instructions per clock cycle. For compute capabil-
ory latency × bandwidth) bytes in flight to saturate the bus. ity 2.0, this figure should be close to 2, whilst for compute capabil-
To determine if the number of concurrent memory accesses ity 2.1 it should approach 4 instructions per cycle. If the achieved
is too low, we can compare the achieved memory through- instructions per cycle count is very low, this is a clear indication
that there are data dependencies and arithmetic latencies that af-
put (glob_mem_read_throughput and glob_mem_write_
fect the performance. In this case, we can try storing intermediate
throughput in the profiler) against the theoretical peak. If the
calculations in separate registers to minimize arithmetic latencies
hardware throughput is dramatically lower, the memory bus is
due to register dependencies.
not saturated, and we should try increasing the number of concur-
rent memory transactions by increasing the occupancy. This can
4.4. Further optimization parameters
be done through adjustment of block dimensions or reduction of
register count, or we can modify the kernel to process more data Grid size and block size are important optimization parameters,
elements per thread. A further optimization path is to move index- and they are usually not easy to set. Both the grid size and the
ing calculations and memory transactions in an attempt to achieve block size must be chosen according to the size and structure of the
better overlap of memory transactions and arithmetic operations. input data, but they must also be tuned to fit the GPU’s architecture
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 11
in order to yield a high performance. Most importantly, we need occupancy to hide memory latencies simultaneously as wanting to
enough total threads to keep the GPU fully occupied in order to increase the number of per-thread registers for more per-thread
hide memory and other latencies. Since each multiprocessor can storage. The first of these criteria requires more threads per block,
execute up to eight blocks simultaneously, choosing too small the second requires fewer threads per block, and it is not given
blocks prevents a high occupancy. Simultaneously, we do not want which configuration will give the best performance. With the
too large blocks since this may cause register spilling if the kernel sheer number of conflicting optimization parameters, it rapidly be-
uses a lot of registers. The number of threads per block should comes difficult to find out what to optimize. Experienced develop-
also, if possible, be a multiple of 32, since each multiprocessor ers are somewhat guided by educated guesses together with a trial
executes full warps. When writing a kernel for the GPU, one also and error approach, but finding the global optimum is often too
often encounters a situation in which the number of data elements difficult to be performed manually.
is not a multiple of the block size. In this case, it is recommended Auto-tuning strategies have been known for a long time
to launch a grid larger than the number of elements and use an on CPUs, and are used to optimize the performance using
out-of-bounds test to discard the unnecessary computations. cache blocking and many other techniques. On GPUs, we can
In many cases, it is acceptable to trade accuracy for perfor- similarly create auto-tuning codes that execute a kernel for each
mance, either because we simply need a rough estimate, or be- set of optimization parameters, and select the best-performing
cause modeling or data errors shadow the inevitable floating-point configuration. However, the search space is large, and brute force
rounding errors. The double-precision to single-precision ratio for techniques are thus not a viable solution. Pruning of this search
GPUs is 2:1,5 which means that a double-precision operation takes space is still an open research question, but several papers on the
twice as long as a single-precision operation (just as for CPUs). This subject have been published (see, for example, [12,6]).
makes it well worth investigating whether or not single precision Even if auto-tuning is outside the scope of a project, preparing
is sufficiently accurate for the application. For many applications, for it can still be an important part of profile-guided development.
double precision is required for calculating results, but the results The first part of auto-tuning is often to use template arguments for
themselves can be stored in single precision without loss of accu- different kernel parameters such as shared memory and block size.
racy. In these cases, all data transfers will execute twice as fast, This gives you many different varieties of the same kernel, so you
simultaneously as we will only occupy half the space in memory. can easily switch between different implementations. A benefit of
For other cases, single precision is sufficiently accurate for arith- having many different implementations of the same kernel is that
metics as well, meaning that we can also perform the floating-point you can perform run-time auto-tuning of your code. Consider the
operations twice as quickly. Remember also that all floating-point following example. For a dam-break simulation on the GPU, you
literals without the f suffix are represented using 64 bits of preci- might have one kernel that is optimized for scenarios in which
sion according to the C standard, and that, when one operand in most of the domain is dry. However, as the dam breaks, water
an expression uses 64 bits, all operations must be performed using spreads throughout the domain, making this kernel inefficient. You
64 bit precision. Some math functions can be compiled directly then create a second kernel that is efficient when most of the
into faster, albeit less accurate, versions. This is enabled using the domain contains water, and switch between these two kernels
double-underscore version of the function, for example __sin() at run time. One strategy here is to perform a simple check of
instead of sin(). By using the --use_fast_math compiler flag, which kernel is the fastest after given number of iterations, and use
all fast hardware math functions that are available will be used. It this kernel. Performing this check every hundredth time step, for
should also be noted that this compiler flag also treats denormal- example, gives a very small overhead to the total computational
ized numbers as zero, and faster (but less accurate) approximations time, and ensures that you are using the most efficient kernel
are used for divisions, reciprocals, and square roots. throughout the simulation.
Even if an application does none of its computing on the CPU,
there still must be some CPU code for setting up and controlling
4.6. Reporting performance
CUDA contexts, launching kernels, and transferring data between
the CPU and the GPU. In most cases it is also desirable to have
One of the key points made in early GPU papers was that
some computations performed on the CPU. There is almost always
one could obtain high speedups over the CPU for a variety of
some serial code in an algorithm that cannot be parallelized and
different algorithms. The tendency has since been to report ever-
therefore will execute faster on the CPU. Another reason is that
increasing speedups, and today papers report that their codes
there is no point in letting the CPU idle while the GPU does
run anything from tens to hundreds and thousands times faster
computations: use both processors when possible.
than CPU ‘‘equivalents’’. However, when examining the theoretical
There are considerable overheads connected with data transfers
performance of the architectures, the performance gap is roughly
between the CPU and the GPU. To hide a memory transfer from the
seven times between state-of-the-art CPUs and GPUs (see Fig. 2).
CPU to the GPU before a kernel is launched one can use streams,
Thus, reporting a speedup of hundreds of times or more holds no
issue the memory transfer asynchronously, and do work on the
scientific value without further explanations supported by detailed
CPU while the data is being transferred to the GPU (see Section 3.3).
benchmarks and profiling results.
Since data transfers between the CPU and the GPU goes through
The sad truth about many papers reporting dramatic speedup
the PCI Express bus both ways, these transfers will often be a
figures is that the speedup is misleading, at best. Often, a GPU
bottleneck. By using streams and asynchronous memory transfers,
code can be compared to an inefficient CPU code, or a state-of-
and by trying to keep the data on the GPU as much as possible, this
the-art desktop GPU can be compared to a laptop CPU several
bottleneck can be reduced to a minimum.
years old. Some claims of SSE and other optimizations of the CPU
code are often made, giving the impression that the CPU code
4.5. Auto-tuning of GPU kernels
is efficient. However, for many implementations, this still might
not be the case: if you optimize using SSE instructions, but the
The above-mentioned performance guidelines are often con-
bottleneck is memory latency, your optimizations are worthless in
flicting, one example being that you want to optimize for
a performance perspective.
Reporting performance is a difficult subject, and the correct
way of reporting performance will vary from case to case. There
5 For the GeForce products this ratio is 8:1. is also a balance between the simplicity of the benchmark and its
12 A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13
ease of interpretation. For example, the Top500 list [23], which Essentially, GPUs are inexpensive because NVIDIA and AMD sell a
rates the fastest supercomputers in the world, is often criticized lot of GPUs to the entertainment market. After researchers started
for being too simplistic as it gives a single rating (gigaflops) from to exploit GPUs, these hardware vendors eventually developed
a single benchmark (solving a linear system of equations). For functionality that is tailored for general-purpose computing and
GPUs, we want to see an end to the escalating and misleading started selling GPUs intended for computing alone. Backed by the
speedup race, and rather see detailed profiling results. This will mass market, this means that the cost for the vendors to target this
give a much better view of how well the algorithm exploits the emerging market is very low, and the profits high.
hardware, both on the CPU and on the GPU. Furthermore, reporting There are great similarities between the vector machines of the
how well your implementation utilizes the hardware, and what 1990s and today’s use of GPUs. The interest in vector machines
the bottlenecks are, will give insight into how well it will perform eventually died out, as the x86 market took over. Will the same
on similar and even future hardware. Another benefit here is that
happen to GPUs once we conquer the power wall? In the short
it becomes transparent what the bottleneck of the algorithm is,
term, the answer is no, for several reasons. First of all, conquering
meaning that it will become clear what hardware vendors and
the power wall is not even on the horizon, meaning that, for the
researchers should focus on for improving performance.
foreseeable future, parallelism will be the key ingredient that will
5. Debugging increase performance. Also, while vector machines were reserved
for supercomputers alone, today’s GPUs are available in everything
As an ever-increasing portion of the C++ standard is supported from cell phones to supercomputers. A large number of software
by CUDA and more advanced debugging tools emerge, debugging companies now use GPUs for computing in their products due to
GPU codes becomes more and more like debugging CPU codes. this mass market adaption, and this is one of the key reasons why
Many CUDA programmers have encountered the ‘‘unspecified we will have GPUs or GPU-like accelerator cores in the future. Just
launch failure’’, which could be notoriously hard to debug. Such as x86 is difficult to replace, it is now becoming difficult to replace
errors were typically only found by either modification and GPUs, as the software and hardware investments in this technology
experimenting, or by careful examination of the source code. are large and increasing.
Today, however, there are powerful CUDA debugging tools for all NVIDIA have just released their most recent architecture, called
the most commonly used operating systems. Kepler [19], which has several updates compared to the Fermi
CUDA-GDB, available for Linux and Mac, can step through a architecture described in this paper. First of all, it has changed the
kernel line by line at the granularity of a warp, e.g., identifying organization of the multiprocessors: whilst the Fermi architecture
where an out-of-bounds memory access occurs, in a similar has up to 16 streaming multiprocessors, each with 32 CUDA cores,
fashion to debugging a CPU program with GDB. In addition to the new Kepler architecture has only 4 multiprocessors, but each
stepping, CUDA-GDB also supports breakpoints, variable watches, with 192 CUDA cores. This totals 1536 CUDA cores for one chip,
and switching between blocks and threads. Other useful features compared to 512 for the Fermi architecture. A second change is that
include reports on the currently active CUDA threads on the GPU,
the clock frequency has been decreased from roughly 1.5 GHz to
reports on current hardware and memory utilization, and in-place
just over 1 GHz. These two changes are essentially a continuation
substitution of changed code in a running CUDA application.
of the existing trend of increasing performance through more cores
The tool enables debugging on hardware in real time, and the
running at a decreased clock frequency. Combined with a new
only requirement for using CUDA-GDB is that the kernel is
compiled with the -g -G flags. These flags make the compiler add production process of 28 nm (compared to 40 nm for Fermi), the
debugging information into the executable, and the executable to effect is that the new architecture has a lower power consumption,
spill all variables to memory. yet roughly doubles the gigaflops performance. The bandwidth
On Microsoft Windows, Parallel NSight is a plug-in for Microsoft to the L2 cache has been increased giving a performance boost
Visual Studio which offers conditional breakpoints, assembly-level for applications that can utilize the L2 cache well, yet the main
debugging, and memory checking directly in the Visual Studio memory bandwidth is the same. NVIDIA has also announced that
IDE. It furthermore offers an excellent profiling tool, and is freely it plans to build high-performance processors with integrated CPU
available to developers. However, debugging requires two distinct cores and GPU cores based on the low-power ARM architecture and
GPUs, one for display, and one for running the actual code to be the future Maxwell GPU architecture.
debugged. Competition between CPUs and GPUs is likely to be intense for
conquering the high-performance and scientific community, and
6. Trends in GPU computing heterogeneous CPU–GPU systems are already on the market. The
AMD Fusion architecture [1] incorporates multiple CPU and GPU
GPUs have truly been a disruptive technology. Starting out as cores into a single die, and Intel have released their Sandy Bridge
academic examples, they were shunned in scientific and high- architecture based on the same concept. Intel have also developed
performance communities: they were inflexible, inaccurate, and other highly parallel architectures, including the Larrabee [21]
required a complete redesign of existing software. However, based on simple and power-efficient x86 cores, the Single-chip
with time, there has been a low-end disruption, in which GPUs Cloud Computer (SCC) [10], and the 80-core tera-scale research
have slowly conquered a large portion of the high-performance
chip Polaris [24]. These architectures have never been released as
computing segment, and three of the five fastest supercomputers
commercial products, but have culminated into the Knights Corner
are powered mainly by GPUs [23]. GPUs were never designed
co-processor with up to 1 teraflops performance [9]. This co-
with high-performance computing in mind, but have nevertheless
processor resides on an adapter connected through the PCI Express
evolved into powerful processors that fit this market perfectly
in less than ten years. The software and the hardware have bus, just as today’s graphics adapters, but exposes a traditional
developed in harmony, both due to the hardware vendors seeing C, C++ and Fortran programming environment, in which existing
new possibilities and due to researchers and industry identifying legacy code can simply be recompiled for the new architecture.
points of improvement. It is impossible to predict the future of Today, we see that the processors converge towards incorpo-
GPUs, but by examining their history and current state, we might rating traditional CPU cores in addition to highly parallel accel-
be able to identify some trends that are worth noting. erator cores on the same die, such as the aforementioned AMD
The success of GPUs has partly been due to their price: GPUs Fusion and Intel Sandy Bridge. These architectures do not target
are ridiculously inexpensive in terms of performance per dollar. the high-performance segment. However, NVIDIA have plans for
This again comes from the mass production and target market. the high-performance segment with their work on combining ARM
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 13
CPU cores and Maxwell GPU cores on the same chip. We thus see [16] P. Micikevicius, Fundamental performance optimizations for GPUs, [Confer-
it as likely that we will see a combination of CPU and GPU cores ence presentation], 2010 GPU Technology Conference, session 2011, 2010.
[17] NVIDIA, NVIDIA’s next generation CUDA compute architecture: Fermi, 2010.
on the same chip, sharing the same memory space, in the near fu-
[18] NVIDIA, NVIDIA CUDA programming guide 4.1, 2011.
ture. This will have a dramatic effect on the memory bottleneck [19] NVIDIA, NVIDIA GeForce GTX680, Technical report, NVIDIA Corporation, 2012.
that exists between these architectures today, and open for tighter [20] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips, GPU computing,
cooperation between fast serial execution and massive parallel ex- Proceedings of the IEEE 96 (5) (2008) 879–899.
[21] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,
ecution to tackle Amdahl’s law.
S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan,
P. Hanrahan, Larrabee: a many-core x86 architecture for visual computing,
7. Summary ACM Transactions on Graphics 27 (3) (2008) 18:1–18:15.
[22] G. Taylor, Energy efficient circuit design and the future of power delivery,
[Conference presentation], Electrical Performance of Electronic Packaging and
In this article, we have given an overview of hardware and tradi-
Systems, October 2009.
tional optimization techniques for the GPU. We have furthermore [23] Top 500 supercomputer sites, http://www.top500.org/, November 2011.
given a step-by-step guide to profile-driven development, in which [24] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh,
bottlenecks and possible solutions are outlined. The focus is on T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, S. Borkar, An
80-tile sub-100-w teraflops processor in 65-nm CMOS, Solid-State Circuits 43
state-of-the-art hardware with accompanying tools, and we have
(1) (2008) 29–41.
addressed the most prominent bottlenecks: memory, arithmetics,
and latencies.
Dr. André R. Brodtkorb is a research scientist at SINTEF, a
References non-profit research organization in Norway with roughly
2000 researchers, where he works on GPU acceleration
[1] Advanced micro devices, AMD Fusion family of APUs: enabling a superior, and algorithm design. He is also associate professor at
immersive PC experience, Technical report, 2010. the Norwegian School of Information Technology, where
[2] G.M. Amdahl, Validity of the Single Processor Approach to Achieving he teaches several courses on graphics programming.
Large Scale Computing Capabilities, Morgan Kaufmann Publishers Inc., San His research interests include GPU and heterogeneous
Francisco, CA, USA, 2000, pp. 79–81 (Chapter 2). computing, simulation of partial differential equations
[3] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, (PDEs), and real-time graphics and visualization.
D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, The landscape of
parallel computing research: a view from Berkeley, Technical report, EECS
Department, University of California, Berkeley, December 2006.
[4] A.R. Brodtkorb, C. Dyken, T.R. Hagen, J.M. Hjelmervik, O. Storaasli, State-of-the-
art in heterogeneous computing, Scientific Programming 18 (1) (2010) 1–33.
[5] A.R. Brodtkorb, M.L. Sætra, M. Altinakar, Efficient shallow water simulations on
GPUs: implementation, visualization, verification, and validation, Computers Trond R. Hagen has been head of the Heterogeneous
& Fluids 55 (0) (2012) 1–12. Computing Group at SINTEF ICT since January 2009 and
[6] A. Davidson, J.D. Owens, Toward techniques for auto-tuning GPU algorithms, has an adjunct position at Narvik University Collage,
in: Proceedings of Para 2010: State of the Art in Scientific and Parallel where he teaches computer graphics, virtual reality, and
Computing, 2010. animation. His research focus is on massive parallel com-
[7] M. Harris, NVIDIA GPU computing SDK 4.1: optimizing parallel reduction in puting, visual computing, and cloud computing on hetero-
CUDA, 2011. geneous architectures, e.g., combination of CPUs and GPUs.
[8] M. Harris, D. Göddeke, General-purpose computation on graphics hardware, His research areas are computer graphics, visualization,
http://gpgpu.org. geometric modeling, and simulation (conservation laws,
[9] Intel, Intel many integrated core (Intel MIC) architecture: ISC’11 demos and smoothed particle hydrodynamics).
performance description, Technical report, 2011.
[10] Intel Labs, The SCC platform overview, Technical report, Intel Corporation,
2010.
[11] D.E. Knuth, Structured programming with go to statements, Computing
Surveys 6 (1974) 261–301. Martin L. Sætra is a Ph.D. fellow working with heteroge-
[12] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning gemm for GPUs, in: neous computing, and the GPU in particular. In his master’s
Proceedings of the 9th International Conference on Computational Science: thesis, he was one of the first to utilize GPU clusters to nu-
Part I, 2009. merically solve partial differential equations. He also holds
[13] J.D.C. Little, S.C. Graves, Building Intuition: Insights from Basic Operations a part-time position at the Norwegian Meteorological In-
Management Models and Principles, Springer, 2008, pp. 81–100 (Chapter 5). stitute as a developer, where he works with visualization
[14] D.T. Marr, F. Binns, D.L. Hill, G. Hinton, D.A. Koufaty, J.A. Miller, tools, among other things.
M. Upton, Hyper-threading technology architecture and microarchitec-
ture, Intel Technology Journal 6 (1) (2002) 1–12.
[15] P. Micikevicius, Analysis-driven performance optimization, [Conference
presentation], 2010 GPU Technology Conference, Session 2012, 2010.