AMD Accelerated Parallel Processing OCL Programming Guide-2013!06!21
AMD Accelerated Parallel Processing OCL Programming Guide-2013!06!21
June 2013
rev2.5
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo,
AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI,
the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade-
marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows
Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic-
tions. Other names are for informational purposes only and may be trademarks of their
respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by
permission by Khronos.
The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to
make changes to specifications and product descriptions at any time without notice. The
information contained herein may be of a preliminary or advance nature and is subject to
change without notice. No license, whether express, implied, arising by estoppel or other-
wise, to any intellectual property rights is granted by this publication. Except as set forth
in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever,
and disclaims any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular purpose, or
infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as compo-
nents in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of AMD’s
product could create a situation where personal injury, death, or severe property or envi-
ronmental damage may occur. AMD reserves the right to discontinue or make changes to
its products at any time without notice.
ii
AMD ACCELERATED P ARALLEL P ROCESSING
Preface
Audience
This document is intended for programmers. It assumes prior experience in
writing code for CPUs and a basic understanding of threads (work-items). While
a basic understanding of GPU architectures is useful, this document does not
assume prior graphics knowledge. It further assumes an understanding of
chapters 1, 2, and 3 of the OpenCL Specification (for the latest version, see
http://www.khronos.org/registry/cl/ ).
Organization
This AMD Accelerated Parallel Processing document begins, in Chapter 1, with
an overview of: the AMD Accelerated Parallel Processing programming models,
OpenCL, the AMD Compute Abstraction Layer (CAL), the AMD APP Kernel
Analyzer, and the AMD APP Profiler. Chapter 2 discusses the compiling and
running of OpenCL programs. Chapter 3 describes using GNU debugger (GDB)
to debug OpenCL programs. Chapter 4 is a discussion of general performance
and optimization considerations when programming for AMD Accelerated Parallel
Processing devices. Chapter 5 details performance and optimization
considerations specifically for Southern Island devices. Chapter 6 details
performance and optimization devices for Evergreen and Northern Islands
devices. Appendix A describes the supported optional OpenCL extensions.
Appendix B details the installable client driver (ICD) for OpenCL. Appendix C
details the compute kernel and contrasts it with a pixel shader. Appendix D lists
the device parameters. Appendix E describes the OpenCL binary image format
(BIF). Appendix F describes the OpenVideo Decode API. Appendix G describes
the interoperability between OpenCL and OpenGL. The last section of this book
is a glossary of acronyms and terms, as well as an index.
Preface iii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Conventions
The following conventions are used in this document.
[1,2) A range that includes the left-most value (in this case, 1) but excludes the right-most
value (in this case, 2).
[1,2] A range that includes both the left-most and right-most values (in this case, 1 and 2).
7:4 A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.
italicized word or phrase The first use of a term or concept basic to the understanding of stream computing.
Related Documents
The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.
AMD, R600 Technology, R600 Instruction Set Architecture, Sunnyvale, CA,
est. pub. date 2007. This document includes the RV670 GPU instruction
details.
ISO/IEC 9899:TC2 - International Standard - Programming Languages - C
Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language,
Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P.
Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” ACM
Trans. Graph., vol. 23, no. 3, pp. 777–786, 2004.
AMD Compute Abstraction Layer (CAL) Intermediate Language (IL)
Reference Manual. Published by AMD.
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat;
Houston, Mike; Fatahalian, Kayvon. “BrookGPU”
http://graphics.stanford.edu/projects/brookgpu/
Buck, Ian. “Brook Spec v0.2”. October 31, 2003.
http://merrimac.stanford.edu/brook/brookspec-05-20-03.pdf
OpenGL Programming Guide, at http://www.glprogramming.com/red/
Microsoft DirectX Reference Website, at http://msdn.microsoft.com/en-
us/directx
GPGPU: http://www.gpgpu.org, and Stanford BrookGPU discussion forum
http://www.gpgpu.org/forums/
iv Preface
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Contact Information
URL: developer.amd.com/appsdk
Developing: developer.amd.com/
Forum: developer.amd.com/openclforum
REVISION HISTORY
Rev Description
Preface v
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
vi Preface
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Contents
Preface
Contents
Contents vii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
viii Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Chapter 6 OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
6.1 Global Memory Optimization .......................................................................................................... 6-1
6.1.1 Two Memory Paths...........................................................................................................6-3
6.1.2 Channel Conflicts.............................................................................................................6-6
6.1.3 Float4 Or Float1..............................................................................................................6-11
6.1.4 Coalesced Writes ...........................................................................................................6-12
6.1.5 Alignment ........................................................................................................................6-14
6.1.6 Summary of Copy Performance ...................................................................................6-16
6.1.7 Hardware Variations.......................................................................................................6-16
Contents ix
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
x Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Contents xi
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index
xii Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Figures
1.1 Generalized AMD GPU Compute Device Structure for Southern Islands Devices ...............1-2
1.2 AMD Radeon™ HD 79XX Device Partial Block Diagram ......................................................1-3
1.3 Generalized AMD GPU Compute Device Structure................................................................1-4
1.4 Simplified Block Diagram of and Evergreen-Family GPU ......................................................1-5
1.5 AMD Accelerated Parallel Processing Software Ecosystem ..................................................1-6
1.6 Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing for
Evergreen and Northern Island Devices .................................................................................1-8
1.7 Work-Item Grouping Into Work-Groups and Wavefronts ........................................................1-9
1.8 Interrelationship of Memory Domains for Southern Islands Devices ...................................1-12
1.9 Dataflow between Host and GPU .........................................................................................1-12
1.10 Simplified Execution Of Work-Items On A Single Stream Core ...........................................1-16
1.11 Stream Core Stall Due to Data Dependency ........................................................................1-17
1.12 OpenCL Programming Model ................................................................................................1-19
2.1 OpenCL Compiler Toolchain....................................................................................................2-1
2.2 Runtime Processing Structure .................................................................................................2-6
4.1 Timeline and API Trace View in Microsoft Visual Studio 2010 ..............................................4-3
4.2 Context Summary Page View in Microsoft Visual Studio 2010 ..............................................4-4
4.3 Warning(s) and Error(s) Page .................................................................................................4-5
4.4 Sample Session View in Microsoft Visual Studio 2010 ..........................................................4-6
4.5 Sample Kernel Occupancy Modeler Screen ...........................................................................4-7
4.6 AMD APP Kernel Analyzer......................................................................................................4-9
5.1 Memory System .......................................................................................................................5-2
5.2 Channel Remapping/Interleaving.............................................................................................5-5
5.3 Transformation to Staggered Offsets.......................................................................................5-8
5.4 One Example of a Tiled Layout Format ................................................................................5-28
5.5 Northern Islands Compute Unit Arrangement .......................................................................5-36
5.6 Southern Island Compute Unit Arrangement ........................................................................5-36
6.1 Memory System .......................................................................................................................6-2
6.2 FastPath (blue) vs CompletePath (red) Using float1 ..............................................................6-3
6.3 Transformation to Staggered Offsets.......................................................................................6-9
6.4 Two Kernels: One Using float4 (blue), the Other float1 (red) .............................................. 6-11
6.5 Effect of Varying Degrees of Coalescing - Coal (blue), NoCoal (red), Split (green) ..........6-13
6.6 Unaligned Access Using float1..............................................................................................6-15
6.7 Unmodified Loop....................................................................................................................6-43
6.8 Kernel Unrolled 4X.................................................................................................................6-44
6.9 Unrolled Loop with Stores Clustered.....................................................................................6-44
6.10 Unrolled Kernel Using float4 for Vectorization ......................................................................6-45
6.11 One Example of a Tiled Layout Format ................................................................................6-49
A.1 Peer-to-Peer Transfers Using the cl_amd_bus_addressable_memory Extension .............. A-14
C.1 Pixel Shader Matrix Transpose .............................................................................................. C-2
C.2 Compute Kernel Matrix Transpose......................................................................................... C-3
C.3 LDS Matrix Transpose ............................................................................................................ C-4
Contents xiii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
xiv Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Tables
Contents xv
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
xvi Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Chapter 1
OpenCL Architecture and AMD
Accelerated Parallel Processing
This chapter provides a general software and hardware overview of the AMD
Accelerated Parallel Processing implementation of the OpenCL standard. It
explains the memory structure and gives simple programming examples.
OpenCL's API also supports the concept of a task dispatch. This is equivalent to
executing a kernel on a compute device with a work-group and NDRange
containing a single work-item. Parallelism is expressed using vector data types
implemented by the device, enqueuing multiple tasks, and/or enqueuing native
kernels developed using a programming model orthogonal to OpenCL.
1.1.1 Synchronization
The two domains of synchronization in OpenCL are work-items in a single work-
group and command-queue(s) in a single context. Work-group barriers enable
synchronization of work-items in a work-group. Each work-item in a work-group
must first execute the barrier before executing any instruction beyond this barrier.
Either all of, or none of, the work-items in a work-group must encounter the
barrier. A barrier or mem_fence operation does not have global scope, but is
relevant only to the local workgroup on which they operate.
GPU GPU
Compute Device Compute Device
16 Processing Elements
16 Processing Elements
1 Scalar Unit + 4 Vector Units
16 Processing Elements
16 Processing Elements
Figure 1.1 Generalized AMD GPU Compute Device Structure for Southern
Islands Devices
For AMD Radeon™ HD 79XX devices, each of the 32 CUs has one Scalar Unit
and four Vector Units, each of which contain an array of 16 PEs. Each PE
consists of one ALU. Figure 1.2 shows only two compute units of the array that
comprises the compute device of the AMD Radeon™ HD 7XXX family. The four
Vector Units use SIMD execution of a scalar instruction. This makes it possible
to run four separate instructions at once, but they are dynamically scheduled (as
opposed to those for the AMD Radeon™ HD 69XX devices, which are statically
scheduled).
In Figure 1.2, there are two command processors, which can process two
command queues concurrently. The Scalar Unit, Vector Unit, Level 1 data cache
(L1), and Local Data Share (LDS) are the components of one compute unit, of
which there are 32. The SC cache is the scalar unit data cache, and the Level
2 cache consists of instructions and data.
As noted, the AMD Radeon™ HD 79XX devices also have a scalar unit, and the
instruction stream contains both scalar and vector instructions. On each cycle, it
selects a scalar instruction and a vector instruction (as well as a memory
operation and a branch operation, if available); it issues one to the scalar unit,
the other to the vector unit; this takes four cycles to issue over the four vector
cores (the same four cycles over which the 16 units execute 64 work-items).
The number of compute units in an AMD GPU, and the way they are structured,
varies with the device family, as well as device designations within a family. Each
of these vector units possesses ALUs (processing elements). For devices in the
Northern Islands (AMD Radeon™ HD 69XX) and Southern Islands (AMD
Radeon™ HD 7XXX) families, these ALUs are arranged in four (in the Evergreen
family, there are five) SIMD arrays consisting of 16 processing elements each.
(See Section 1.3, “Hardware Overview for Evergreen and Northern Islands
Devices.”) Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On devices in the Southern
Island family, the four stream cores execute code from four different wavefronts.
AMD GPUs consists of multiple compute units. The number of them and the way
they are structured varies with the device family, as well as device designations
within a family. Each of these processing elements possesses ALUs. For devices
in the Northern Islands and Southern Islands families, these ALUs are arranged
in four (in the Evergreen family, there are five) processing elements with arrays
of 16 ALUs. Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On devices in the Southern
Island family, the four processing elements execute code from four different
wavefronts. On Northern Islands and Evergreen family devices, the four arrays
execute instructions from one wavefront, so that each work-item issues four (for
Northern Islands) or five (for Evergreen) instructions at once in a very-long-
instruction-word (VLIW) packet.
Figure 1.3 shows a simplified block diagram of a generalized AMD GPU compute
device.
GPU GPU
Compute Device Compute Device
Processing Elements
ALUs
Instruction
Processing Element and Control
Flow
Branch
Execution
Unit
ALUs
General-Purpose Registers
GPU compute devices comprise groups of compute units. Each compute unit
contains numerous processing elements, which are responsible for executing
kernels, each operating on an independent data stream. Processing elements, in
turn, contain numerous processing elements, which are the fundamental,
1.3 Hardware Overview for Evergreen and Northern Islands Devices 1-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Stream Applications
OpenCL Runtime
Multicore
AMD GPUs
CPUs
The AMD Accelerated Parallel Processing software stack provides end-users and
developers with a complete, flexible suite of tools to leverage the processing
The latest generations of AMD GPUs use unified shader architectures capable
of running different kernel types interleaved on the same hardware.
Programmable GPU compute devices execute various user-developed programs,
known to graphics programmers as shaders and to compute programmers as
kernels. These GPU compute devices can execute non-graphics functions using
a data-parallel programming model that maps executions onto compute units. In
this programming model, known as AMD Accelerated Parallel Processing, arrays
of input data elements stored in memory are accessed by a number of compute
units.
Processing Elementk
Registers/Constants/Literals
Memory Interface
Memory Controller
Process. Process. Process. Memory
Element0 Elementk Elementn-1
Scheduler (UTDP)
Compute Device
Range
WORK-GROUP
Dim Y
Di
m
en
si
on
Dimension X
Z
Dim Y
WORK-ITEM
Wavefront
(HW-SPECIFIC SIZE)
Di
m
en
si
on
Dimension X
Z
The size of wavefronts can differ on different GPU compute devices. For
example, some of the low-end and older GPUs, such as the AMD Radeon™ HD
54XX series graphics cards, have a wavefront size of 32 work-items. Higher-end
and newer AMD GPUs have a wavefront size of 64 work-items.
if(x)
{
. //items within these braces = A
.
.
}
else
{
. //items within these braces = B
.
.
}
The wavefront mask is set true for lanes (elements/items) in which x is true, then
execute A. The mask then is inverted, and B is executed.
Example 1: If two branches, A and B, take the same amount of time t to execute
over a wavefront, the total time of execution, if any work-item diverges, is 2t.
Loops execute in a similar fashion, where the wavefront occupies a compute unit
as long as there is at least one work-item in the wavefront still being processed.
Thus, the total execution time for the wavefront is determined by the work-item
with the longest execution time.
Figure 1.8 illustrates the interrelationship of the memories. (Note that the
referenced color buffer is a write-only output buffer in a pixel shader that has a
predetermined location based on the pixel location.)
Compute Device
Compute Unit n
Private Memory
(Reg Files)
m
Compute Unit 1 Private Memory
Private Memory
1 2
(Reg Files)
(Reg Files)
m
Private Memory Proc. Elem.
1 2
(Reg Files) (ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
n Mem.
Local
L1
(LDS) n n
Host
Compute Device DMA
GLOBAL MEMORY CONSTANT MEMORY PCIe
Memory (VRAM)
Figure 1.9 illustrates the standard dataflow between host (CPU) and GPU.
S I B
A A
T e A
G L TP
L L
H P L ER
O I
O C O
C V
S I B
A A
T e A
L T
L
E
There are two ways to copy data from the host to the GPU compute device
memory:
With proper memory transfer management and the use of system pinned
memory (host/CPU memory remapped to the PCIe memory space), copying
between host (CPU) memory and PCIe memory can be skipped.
Double copying lowers the overall system memory bandwidth. In GPU compute
device programming, pipelining and other techniques help reduce these
bottlenecks. See Chapter 4, Chapter 5, and Chapter 6 for more specifics about
optimization techniques.
Image reads are cached through the texture system (corresponding to the L2 and
L1 caches).
Transfers from the system to the GPU compute device are done either by the
command processor or by the DMA engine. The GPU compute device also can
read and write system memory directly from the compute unit through kernel
instructions over the PCIe bus.
Most commands to the GPU compute device are buffered in a command queue
on the host side. The command queue is sent to the GPU compute device, and
the commands are processed by it. There is no guarantee as to when commands
from the command queue are executed, only that they are executed in order.
Unless the GPU compute device is busy, commands are executed immediately.
DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there
are no dependencies. However, data is not guaranteed to be ready until the DMA
engine signals that the event or transfer is completed. The application can query
the hardware for DMA event completion. If used carefully, DMA transfers are
another source of parallelization.
Southern Island devices have two SDMA engines that can perform bidirectional
transfers over the PCIe bus with multiple queues created in consecutive order,
since each SDMA engine is assigned to an odd or an even queue
correspondingly.
In some cases, the user might want to mask the visibility of the GPUs seen by
the OpenCL application. One example is to dedicate one GPU for regular
graphics operations and the other three (in a four-GPU system) for Compute. To
do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a comma-
separated list variable:
Another example is a system with eight GPUs, where two distinct OpenCL
applications are running at the same time. The administrator might want to set
GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the
second application; thus, partitioning the available GPUs so that both
applications can run at the same time.
Work-Item
T0
STALL READY
T1
READY STALL
T2
READY STALL
T3
READY STALL
0 20 40 60 80
At runtime, work-item T0 executes until cycle 20; at this time, a stall occurs due
to a memory fetch request. The scheduler then begins execution of the next
work-item, T1. Work-item T1 executes until it stalls or completes. New work-items
execute, and the process continues until the available number of active work-
items is reached. The scheduler then returns to the first work-item, T0.
If the data work-item T0 is waiting for has returned from memory, T0 continues
execution. In the example in Figure 1.10, the data is ready, so T0 continues.
Since there were enough work-items and processing element operations to cover
the long memory latencies, the stream core does not idle. This method of
memory latency hiding helps the GPU compute device achieve maximum
performance.
If none of T0 – T3 are runnable, the stream core waits (stalls) until one of T0 –
T3 is ready to execute. In the example shown in Figure 1.11, T0 is the first to
continue execution.
Work-Item
T0
STALL
T1
STALL
T2
STALL
T3
STALL
0 20 40 60 80
The causes for this situation are discussed in the following sections.
1.8 Terminology
A compute kernel is a specific type of kernel that is not part of the traditional
graphics pipeline. The compute kernel type can be used for graphics, but its
strength lies in using it for non-graphics fields such as physics, AI, modeling,
HPC, and various other computationally intensive applications.
In a compute kernel, the work-item spawn order is sequential. This means that
on a chip with N work-items per wavefront, the first N work-items go to wavefront
1, the second N work-items go to wavefront 2, etc. Thus, the work-item IDs for
wavefront K are in the range (K•N) to ((K+1)•N) - 1.
1. The LDS size is allocated per work-group. Each work-group specifies how
much of the LDS it requires. The hardware scheduler uses this information
to determine which work groups can share a compute unit.
2. Data can only be shared within work-items in a work-group.
3. Memory accesses outside of the work-group result in undefined behavior.
barrier(...)
} }
Context
Local Memory Local Memory
Queue Queue
Global/Constant Memory
The devices are capable of running data- and task-parallel work. A kernel can be
executed as a function of multi-dimensional domains of indices. Each element is
called a work-item; the total number of indices is defined as the global work-size.
The global work-size can be divided into sub-domains, called work-groups, and
individual work-items within a group can communicate through global or locally
shared memory. Work-items are synchronized through barrier or fence
operations. Figure 1.12 is a representation of the host/device architecture with a
single platform, consisting of a GPU and a CPU.
Many operations are performed with respect to a given context; there also are
many operations that are specific to a device. For example, program compilation
and kernel execution are done on a per-device basis. Performing work with a
device, such as executing kernels or moving data to and from the device’s local
memory, is done using a corresponding command queue. A command queue is
associated with a single device and a given context; all work for a specific device
is done through this interface. Note that while a single command queue can be
associated with only a single device, there is no limit to the number of command
queues that can point to the same device. For example, it is possible to have
one command queue for executing kernels and a command queue for managing
data transfers between the host and the device.
Most OpenCL programs follow the same pattern. Given a specific platform, select
a device or devices to create a context, allocate memory, create device-specific
command queues, and perform data transfers and computations. Generally, the
platform is the gateway to accessing specific devices, given these devices and
a corresponding context, the application is independent of the platform. Given a
context, the application can:
1. The host program must select a platform, which is an abstraction for a given
OpenCL implementation. Implementations by multiple vendors can coexist on
a host, and the sample uses the first one available.
2. A device id for a GPU device is requested. A CPU device could be requested
by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device,
such as a given GPU, or an abstracted device, such as the collection of all
CPU cores on the host.
3. On the selected device, an OpenCL context is created. A context ties
together a device, memory buffers related to that device, OpenCL programs,
and command queues. Note that buffers related to a device can reside on
either the host or the device. Many OpenCL programs have only a single
context, program, and command queue.
4. Before an OpenCL kernel can be launched, its program source is compiled,
and a handle to the kernel is created.
5. A memory buffer is allocated in the context.
6. The kernel is launched. While it is necessary to specify the global work size,
OpenCL determines a good local work size for this device. Since the kernel
was launch asynchronously, clFinish() is used to wait for completion.
7. The data is mapped to the host for examination. Calling
clEnqueueMapBuffer ensures the visibility of the buffer on the host, which in
this case probably includes a physical transfer. Alternatively, we could use
clEnqueueWriteBuffer(), which requires a pre-allocated host-side buffer.
Example Code 1 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
cl_platform_id platform;
cl_device_id device;
// 6. Launch the kernel. Let OpenCL pick the local work size.
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);
clFinish( queue );
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
return 0;
}
The code is written so that it performs very well on either CPU or GPU. The
number of threads launched depends on how many hardware processors are
available. Each thread walks the source buffer, using a device-optimal access
pattern selected at runtime. A multi-stage reduction using __local and __global
atomics produces the single result value.
Runtime Code –
1. The source memory buffer is allocated, and initialized with a random pattern.
Also, the actual min() value for this data set is serially computed, in order to
later verify the parallel result.
2. The compiler is instructed to dump the intermediate IL and ISA files for
further analysis.
3. The main section of the code, including device setup, CL data buffer creation,
and code compilation, is executed for each device, in this case for CPU and
GPU. Since the source memory buffer exists on the host, it is shared. All
other resources are device-specific.
4. The global work size is computed for each device. A simple heuristic is used
to ensure an optimal number of threads on each device. For the CPU, a
given CL implementation can translate one work-item per CL compute unit
into one thread per CPU core.
On the GPU, an initial multiple of the wavefront size is used, which is
adjusted to ensure even divisibility of the input data over all threads. The
value of 7 is a minimum value to keep all independent hardware units of the
compute units busy, and to provide a minimum amount of memory latency
hiding for a kernel with little ALU activity.
5. After the kernels are built, the code prints errors that occurred during kernel
compilation and linking.
6. The main loop is set up so that the measured timing reflects the actual kernel
performance. If a sufficiently large NLOOPS is chosen, effects from kernel
launch time and delayed buffer copies to the device by the CL runtime are
minimized. Note that while only a single clFinish() is executed at the end
of the timing run, the two kernels are always linked using an event to ensure
serial execution.
The bandwidth is expressed as “number of input bytes processed.” For high-
end graphics cards, the bandwidth of this algorithm is about an order of
magnitude higher than that of the CPU, due to the parallelized memory
subsystem of the graphics card.
7. The results then are checked against the comparison value. This also
establishes that the result is the same on both CPU and GPU, which can
serve as the first verification test for newly written kernel code.
8. Note the use of the debug buffer to obtain some runtime variables. Debug
buffers also can be used to create short execution traces for each thread,
assuming the device has enough memory.
9. You can use the Timer.cpp and Timer.h files from the TransferOverlap
sample, which is in the SDK samples.
Kernel Code –
10. The code uses four-component vectors (uint4) so the compiler can identify
concurrent execution paths as often as possible. On the GPU, this can be
used to further optimize memory accesses and distribution across ALUs. On
the CPU, it can be used to enable SSE-like execution.
11. The kernel sets up a memory access pattern based on the device. For the
CPU, the source buffer is chopped into continuous buffers: one per thread.
Each CPU thread serially walks through its buffer portion, which results in
good cache and prefetch behavior for each core.
On the GPU, each thread walks the source buffer using a stride of the total
number of threads. As many threads are executed in parallel, the result is a
maximally coalesced memory pattern requested from the memory back-end.
For example, if each compute unit has 16 physical processors, 16 uint4
requests are produced in parallel, per clock, for a total of 256 bytes per clock.
12. The kernel code uses a reduction consisting of three stages: __global to
__private, __private to local, which is flushed to __global, and finally
__global to __global. In the first loop, each thread walks __global
memory, and reduces all values into a min value in __private memory
(typically, a register). This is the bulk of the work, and is mainly bound by
__global memory bandwidth. The subsequent reduction stages are brief in
comparison.
13. Next, all per-thread minimum values inside the work-group are reduced to a
__local value, using an atomic operation. Access to the __local value is
serialized; however, the number of these operations is very small compared
to the work of the previous reduction stage. The threads within a work-group
are synchronized through a local barrier(). The reduced min value is
stored in __global memory.
14. After all work-groups are finished, a second kernel reduces all work-group
values into a single value in __global memory, using an atomic operation.
This is a minor contributor to the overall runtime.
Example Code 3 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "Timer.h"
#define NDEVS 2
" \n"
" // Dump some debug information. \n"
" \n"
" if( get_global_id(0) == 0 ) \n"
" { \n"
" dbg[0] = get_num_groups(0); \n"
" dbg[1] = get_global_size(0); \n"
" dbg[2] = count; \n"
" dbg[3] = stride; \n"
" } \n"
"} \n"
" \n"
"// 13. Reduce work-group min values from __global to __global. \n"
" \n"
"__kernel void reduce( __global uint4 *src, \n"
" __global uint *gmin ) \n"
"{ \n"
" (void) atom_min( gmin, gmin[get_global_id(0)] ) ; \n"
"} \n";
cl_uint *src_ptr;
unsigned int num_src_items = 4096*4096;
time_t ltime;
time(<ime);
// Get a platform.
cl_program program;
cl_kernel minp;
cl_kernel reduce;
cl_mem src_buf;
cl_mem dst_buf;
cl_mem dbg_buf;
cl_uint *dst_ptr,
*dbg_ptr;
clGetDeviceIDs( platform,
devs[dev],
1,
&device,
NULL);
cl_uint compute_units;
size_t global_work_size;
size_t local_work_size;
size_t num_groups;
clGetDeviceInfo( device,
CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint),
&compute_units,
NULL);
local_work_size = ws;
}
queue = clCreateCommandQueue(context,
device,
0, NULL);
if(ret != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", ret);
char buf[0x10000];
clGetProgramBuildInfo( program,
device,
CL_PROGRAM_BUILD_LOG,
0x10000,
buf,
NULL);
printf("\n%s\n", buf);
return(-1);
}
CPerfCounter t;
t.Reset();
t.Start();
cl_event ev;
int nloops = NLOOPS;
while(nloops--)
{
clEnqueueNDRangeKernel( queue,
minp,
1,
NULL,
&global_work_size,
&local_work_size,
0, NULL, &ev);
clEnqueueNDRangeKernel( queue,
reduce,
1,
NULL,
&num_groups,
NULL, 1, &ev, NULL);
}
clFinish( queue );
t.Stop();
printf("\n");
return 0;
}
Chapter 2
Building and Running OpenCL
Programs
The compiler tool-chain provides a common framework for both CPUs and
GPUs, sharing the front-end and some high-level compiler transformations. The
back-ends are optimized for the device type (CPU or GPU). Figure 2.1 is a high-
level diagram showing the general compilation path of applications using
OpenCL. Functions of an application that benefit from acceleration are re-written
in OpenCL and become the OpenCL source. The code calling these functions
are changed to use the OpenCL API. The rest of the application remains
unchanged. The kernels are compiled by the OpenCL compiler to either CPU
binaries or GPU binaries, depending on the target device.
O
p OpenCL Compiler
e Built-In
n Library
C LLVM IR Linker
OpenCL L
Front-End
Source
R LLVM
u Optimizer
n
t
i LLVM IR
m
e
LLVM AS AMD IL
CPU GPU
For CPU processing, the OpenCL runtime uses the LLVM AS to generate x86
binaries. The OpenCL runtime automatically determines the number of
processing elements, or cores, present in the CPU and distributes the OpenCL
kernel between them.
For GPU processing, the OpenCL runtime post-processes the incomplete AMD
IL from the OpenCL compiler and turns it into complete AMD IL. This adds
macros (from a macro database, similar to the built-in library) specific to the
GPU. The OpenCL Runtime layer then removes unneeded functions and passes
the complete IL to the Shader compilre for compilation to GPU-specific binaries.
For GPU processing, the LLVM IR-to-AMD IL module receives LLVM IR and
generates optimized IL for a specific GPU type in an incomplete format, which is
passed to the OpenCL runtime, along with some metadata for the runtime layer
to finish processing.
1. Compile all the C++ files (Template.cpp), and get the object files.
For 32-bit object files on a 32-bit system, or 64-bit object files on 64-bit
system:
g++ -o Template.o -DAMD_OS_LINUX -c Template.cpp -I$AMDAPPSDKROOT/include
2. Link all the object files generated in the previous step to the OpenCL library
and create an executable.
For linking to a 64-bit library:
g++ -o Template Template.o -lOpenCL -L$AMDAPPSDKROOT/lib/x86_64
The following are linking options if the samples depend on the SDKUtil Library
(assuming the SDKUtil library is created in $AMDAPPSDKROOT/lib/x86_64 for 64-
bit libraries, or $AMDAPPSDKROOT/lib/x86 for 32-bit libraries).
-I dir — Add the directory dir to the list of directories to be searched for
header files. When parsing #include directives, the OpenCL compiler
resolves relative paths using the current working directory of the application.
-D name — Predefine name as a macro, with definition = 1. For -
D name=definition, the contents of definition are tokenized and processed
as if they appeared during the translation phase three in a #define directive.
In particular, the definition is truncated by embedded newline characters.
-D options are processed in the order they are given in the options argument
to clBuildProgram.
-g — This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 3, “Debugging
OpenCL.” This option does not affect the default optimization of the OpenCL
code.
-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.
-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”
-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”
-save-temps[=<prefix>] — This option dumps intermediate temporary
files, such as IL and ISA code, for each OpenCL kernel. If <prefix> is not
given, temporary files are saved in the default temporary directory (the
current directory for Linux, C:\Users\<user>\AppData\Local for Windows).
If <prefix> is given, those temporary files are saved with the given
<prefix>. If <prefix> is an absolute path prefix, such as
C:\your\work\dir\mydumpprefix, those temporaries are saved under
C:\your\work\dir, with mydumpprefix as prefix to all temporary names. For
example,
-save-temps
under the default directory
_temp_nn_xxx_yyy.il, _temp_nn_xxx_yyy.isa
-save-temps=aaa
under the default directory
aaa_nn_xxx_yyy.il, aaa_nn_xxx_yyy.isa
-save-temps=C:\you\dir\bbb
under C:\you\dir
bbb_nn_xxx_yyy.il, bbb_nn_xxx_yyy.isa
where xxx and yyy are the device name and kernel name for this build,
respectively, and nn is an internal number to identify a build to avoid
overriding temporary files. Note that this naming convention is subject to
change.
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
As illustrated in Figure 2.2, the application can create multiple command queues
(some in libraries, for different components of the application, etc.). These
queues are muxed into one queue per device type. The figure shows command
queues 1 and 3 merged into one CPU device queue (blue arrows); command
queue 2 (and possibly others) are merged into the GPU device queue (red
arrow). The device queue then schedules work onto the multiple compute
resources present in the device. Here, K = kernel commands, M = memory
commands, and E = event commands.
Programming n
3
Layer 2
1
Command M1 K1 E1 K2 M2 K3 M3
Queues
For CPU queue For CPU queue For GPU queue
GPU
Device
Command CPU
M11 K11 E11 K12 M12 M31 K32 M32
Queue
Chapter 3
Debugging OpenCL
http://developer.amd.com/tools/gDEBugger/Pages/default.aspx
After installing gDEBugger for Visual Studio, launch Visual Studio, and open the
solution to be worked on. In the Visual Studio menu bar, note the new
gDEBugger menu, which contains all the required controls.
Select a Visual C/C++ project, and set its debugging properties as normal. To
add a breakpoint, either select New gDEBugger Breakpoint from the gDEBugger
menu, or navigate to a kernel file used in the application and set a breakpoint on
the desired source line. Then, select the Launch OpenCL/OpenGL Debugging
from the gDEBugger menu to start debugging.
To start kernel debugging, you can choose one of several options; one of these
is to Step Into (F11) the appropriate clEnqueueNDRangeKernel function call.
Once the kernel starts executing, debug it like C/C++ code, stepping into, out of,
or over function calls in the kernel, setting source breakpoints, and inspecting the
locals, autos, watch, and call stack views.
To view OpenCL and OpenGL objects and their information, use the gDEBugger
Explorer and gDEBugger Properties view. Additional views and features provide
more detailed and advanced information on the OpenCL and OpenGL runtimes,
their states, and the objects created within them.
For further information and more detailed usage instructions, see the gDEBugger
User Guide:
http://developer.amd.com/tools/gDEBugger/webhelp/index.html
AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0" or
AMD_OCL_BUILD_OPTIONS="-g -O0"
b [N | function | kernel_name]
where N is the line number in the source code, function is the function name,
and kernel_name is constructed as follows: if the name of the kernel is
bitonicSort, the kernel_name is __OpenCL_bitonicSort_kernel.
Note that if no breakpoint is set, the program does not stop until execution is
complete.
Also note that OpenCL kernel symbols are not visible in the debugger until the
kernel is loaded. A simple way to check for known OpenCL symbols is to set a
Unsorted Input
53 5 199 15 120 9 71 107 71 242 84 150 134 180 26 128 196 9 98 4 102 65
206 35 224 2 52 172 160 94 2 214 99 .....
File OCLm2oVFr.cl:
void __OpenCL_bitonicSort_kernel(uint *, const uint, const uint, const
uint, const uint);
Non-debugging symbols:
0x00007ffff23c2dc0 __OpenCL_bitonicSort_kernel@plt
0x00007ffff23c2f40 __OpenCL_bitonicSort_stub
(gdb) b __OpenCL_bitonicSort_kernel
Breakpoint 2 at 0x7ffff23c2de9: file OCLm2oVFr.cl, line 32.
(gdb) c
Continuing.
[Switching to Thread 0x7ffff2fcc700 (LWP 1895)]
Continuing.
3.2.4 Notes
1. To make a breakpoint in a working thread with some particular ID in
dimension N, one technique is to set a conditional breakpoint when the
get_global_id(N) == ID. To do this, use:
b [ N | function | kernel_name ] if (get_global_id(N)==ID)
where N can be 0, 1, or 2.
Chapter 4
OpenCL Performance and
Optimization
This section describes the major features of Profiler version 2.4. Because the
Profiler is still being developed, please see the documentation for the latest
features of the tool at the same URL provided above.
Plug-in for Microsoft Visual Studio 2008 or 2010 (recommended). This lets
you visualize and analyze the results in multiple ways.
Command-line utility tool for both Windows and Linux platforms. This is a
way to collect data for applications without source code access. The results
can be analyzed directly in the text format or visualized in the Visual Studio
plug-in.
Reveal the high-level structure of the application with the Timeline view. This
lets you investigate the number of OpenCL contexts and command queues
created, as well as the relationships of these items in the application. The
timeline shows the host code, kernel, and data transfer execution. See
Section 4.1.1.1, “Timeline View,” page 4-2.
Identify whether the application is bound by kernel execution or data transfer
time; find the top ten most expensive kernels and data transfers; find the API
hot spots (most frequently called or expensive API call) in the application with
the Summary Pages view. See Section 4.1.1.2, “Summary Pages View,”
page 4-4).
View and debug the input parameters and output results for all API calls in
the application with the API Trace view. See Section 4.1.1.3, “API Trace
View,” page 4-5.
An OpenCL Performance Marker (CLPerfMarker) library is also provided for
visualizing and analyzing non-OpenCL host code on the Timeline. Users can
instrument their code with calls to clBeginPerfMarkerAMD() and
clEndPerfMarkerAMD(). These calls are then used by the Profiler to
annotate the host-code timeline hierarchically. For more information, see the
CLPerfMarkerAMD.pdf in the CLPerfMarker/Doc subdirectory under the
Profiler installation directory, typically
$AMDAPPSDKROOT/Tools/AMDAPPProfiler-vx.x/.
The timeline view (the top half of Figure 4.1) provides a visual representation of
the execution of the application.
Figure 4.1 Timeline and API Trace View in Microsoft Visual Studio 2010
Along the top of the timeline is the time grid, which shows the total elapsed time
of the application, in milliseconds. Timing begins when the first OpenCL call is
made by the application, and ends when the final OpenCL is made. Directly
below the time grid, each host (OS) thread that made at least one OpenCL call
is listed. For each host thread, the OpenCL API calls are plotted along the time
grid, showing the start time and duration of each call. Below the host threads, an
OpenCL tree shows all contexts and queues created by the application, along
with data transfer operations and kernel execution operations for each queue.
The Timeline View can be navigated by zooming, panning, collapsing/expanding,
and selecting an interest region. From the Timeline View, we can also navigate
to the corresponding API call in the API Trace View and vice versa.
The Timeline View can be useful for debugging your OpenCL application. Some
examples are:
Easily confirm that the high-level structure of the algorithm is correct (the
number of queues and contexts created match your expectation).
Confirm that synchronization has been performed properly in the application.
For example, if kernel A execution is dependent on a buffer write or copy
and/or outputs from kernel B execution, then, if the synchronization has been
set up correctly, kernel A execution appears after the completion of the buffer
execution and kernel B execution in the timeline grid. It can be hard to find
this type of synchronization error using traditional debugging techniques.
Confirm that the kernel and data transfer execution from all the queues have
been performed efficiently. This is easily verified by ensuring that non-
dependent kernel and data transfer execution happens concurrently in the
timeline grid.
The Summary Pages View (Figure 4.2) shows the statistics of your OpenCL
application. It can provide a general idea of the location of the program's
bottlenecks. It also provides useful information such as the number of buffers and
number of images created on each context, most expensive kernel call, etc.
Figure 4.2 Context Summary Page View in Microsoft Visual Studio 2010
1. API Summary shows the useful statistics for all OpenCL API calls made in
the application for API hot spots identification.
2. Context Summary shows the timing information for all the kernel dispatches
and data transfers for each context. This permits identifying whether the
application is bound by the kernel execution or data transfer. If the
application is bound by the data transfers, this page permits finding the most
expensive data transfer type (read, write, copy, or map) in the application.
3. Kernel Summary lists all the kernels that are created in the application. If the
application is bound by the kernel execution, it is possible to find the device
causing the bottleneck. If the kernel execution on the GPU device is the
bottleneck, use the GPU performance counters (see Section 4.1.2,
“Collecting OpenCL GPU Kernel Performance Counters,” page 4-5) to
investigate the bottleneck inside the kernel.
4. Top 10 Data Transfer Summary shows the top ten most expensive individual
data transfers.
5. Top 10 Kernel Summary shows the top ten most expensive individual kernel
executions.
6. Warning(s) and Error(s) shows potential problems in your OpenCL
application.
The Warning(s) and Error(s) page (Figure 4.3) shows potential problems and
optimization hints in your OpenCL application, including unreleased OpenCL
resources, OpenCL API failures, non-optimized work size, non-optimized data
transfer, and excessive synchronization; it also provides suggestions to achieve
better performance. Clicking on a hyperlink takes you to the corresponding
OpenCL API that generated the message.
The API Trace View (the bottom half in Figure 4.1) lists all the OpenCL API calls
made by the application. Each host thread that makes at least one OpenCL call
is listed in a separate tab. Each tab contains a list of all the API calls made by
that particular thread. For each call, it shows the index of the call (representing
execution order), the name of the API function, a semicolon-delimited list of
parameters passed to the function, and the value returned by the function.
Double-clicking an item in the API Trace view displays and zooms into that API
call in the Host Thread row in the Timeline View. If stack trace is enabled while
collecting the API trace, and the application contains debug information, it is
possible to navigate from the API trace to source code.
The lets you analyze and debug the input parameters and output results for each
API call. For example it is easy to check that all the API calls are returning
CL_SUCCESS, all the buffers are created with the correct flags, as well as to
identify redundant API calls. The API Trace shows additional information about
data transfers using clEnqueueMapBuffer/clEnqueueMapImage; this includes the
source, destination, and transfer type of the map operation. Some devices can
take advantage of zero copy to save on data transfer time.
After determining the most expensive kernel to be optimized using the trace data,
collect the GPU performance counters to drill down to the kernel execution on
the GPU devices. Using the performance counters, it is possible to:
Find the number of resources (VGPR, SGPR [if applicable], and Local
Memory size) allocated for the kernel. These resources affect the possible
number of in-flight wavefronts in the GPU (higher number is required to hide
the data latency). The Occupancy Modeler identifies the limiting factor for
achieving a higher count of in-flight wavefronts.
Identify the number of ALU, global, and local memory instructions executed
in the GPU.
Identify the number of bytes fetched from and written to the global memory.
View use of the SIMD engine and memory units in the system.
View the efficiency of the Shader Compiler to pack ALU instructions into the
VLIW instructions in AMD GPUs.
View the local memory (LDS) bank conflict.
The Session view (Figure 4.4) shows the resulting performance counters for a
Profiler session. The output data is recorded in a csv format.
The top part of the page shows four graphs (only three on non-GCN devices)
that provide a visual indication of how kernel resources affect the theoretical
number of in-flight wavefronts on a compute unit. The graph representing the
limiting resource has its title displayed in red text. If there is more than one
limiting resource, more than one graph can have a red title. In each graph, the
actual usage of the particular resource being graphed is highlighted with an
orange square. Hovering the mouse over a point in the graph causes a popup
hint to be displayed; this shows the current X and Y values at that location.
The first graph, titled Number of waves limited by Work-group size, shows
how the number of active wavefronts is affected by the size of the work-group
for the dispatched kernel. In the above screen shot, note that the highest number
of wavefronts is achieved when the work-group size is in the range of 64 to 128.
The second graph, titled Number of waves limited by VGPRs, shows how the
number of active wavefronts is affected by the number of vector GPRs used by
the dispatched kernel. In the above screen shot, note that as the number of
VGPRs used increases, the number active wavefronts decreases in steps. Note
that this graph shows that more than 62 GPRs can be allocated, although 62 is
the maximum number, since the Shader Compiler assumes the work-group size
is 256 items by default (the largest possible work-group size). For the Shader
Compiler to allocate more than 62 GPRs, the kernel source code must be
marked with the reqd_work_group_size kernel attribute. This attribute can tell
the Shader Compiler that the kernel is to be launched with a work-group size
smaller than the maximum, allowing it to allocate more GPRs. Thus, for X-axis
values greater than 62, the GPR graph shows the theoretical number of
wavefronts that can be launched if the kernel specified a smaller work-group size
using that attribute.
If running on and AMD Radeon HD 7XXX series (GCN) device, the third graph,
titled Number of waves limited by SGPR, shows how the number of active
wavefronts is affected by the number of scalar GPRs used by the dispatched
kernel. In the above screen shot, note that as the number of used SGPRs
increases, the number active wavefronts decreases in steps.
The fourth graph, titled Number of waves limited by LDS, shows how the
number of active wavefronts is affected by the amount of LDS used by the
dispatched kernel. In the above screen shot, note that as the amount of LDS
usage increases, the number active wavefronts decreases in steps.
Below the graphs is a table that provides information about the device, the
kernel, and the kernel occupancy. In the Kernel Occupancy section, note the
limits imposed by each kernel resource, as well as which resource is currently
limiting the number of waves for the kernel dispatch. This also displays the kernel
occupancy percentage.
The KernelAnalyzer can be installed as part of the AMD APP SDK installation,
or individually using its own installer package. The KernelAnalyzer package can
be downloaded from:
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx.
Compile and disassemble the OpenCL kernel for multiple Catalyst driver
versions and GPU device targets.
View the OpenCL kernel compilation’s error messages from the OpenCL run-
time.
View the AMD Intermediate Language (IL) code generated by the OpenCL
run-time.
View the ISA code generated by the AMD Shader Compiler. Typically, hard
core kernel optimizations are performed by analyzing the ISA code.
View the statistics generated by analyzing the ISA code.
View General-Purpose Register usage and spill registers allocated for the
kernel.
The instruction set architecture (ISA) defines the instructions and formats
accessible to programmers and compilers for the AMD GPUs. The Northern
Islands-family ISA instructions and microcode are documented in the AMD
Northern Islands-Family ISA Instructions and Microcode.
Developers also can generate IL and ISA code from their OpenCL kernel by
setting the environment variable AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps
(see Section 2.1.4, “AMD-Developed Supplemental Compiler Options,” page 2-
4).
The sample code below shows how to compute the kernel execution time (End-
Start):
cl_event myEvent;
cl_ulong startTime, endTime;
clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &startTime, NULL);
clGetEventProfilingInfo(myEvent, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &endTimeNs, NULL);
cl_ulong kernelExecTimeNs = endTime-startTime;
The AMD APP Profiler also can record the execution time for a kernel
automatically. The Kernel Time metric reported in the Profiler output uses the
built-in OpenCL timing capability and reports the same result as the
kernelExecTimeNs calculation shown above.
Another interesting metric to track is the kernel launch time (Start – Queue). The
kernel launch time includes both the time spent in the user application (after
enqueuing the command, but before it is submitted to the device), as well as the
time spent in the runtime to launch the kernel. For CPU devices, the kernel
launch time is fast (tens of s), but for discrete GPU devices it can be several
hundred s. Enabling profiling on a command queue adds approximately 10 s
to 40 s overhead to all clEnqueue calls. Much of the profiling overhead affects
the start time; thus, it is visible in the launch time. Be careful when interpreting
this metric. To reduce the launch overhead, the AMD OpenCL runtime combines
several command submissions into a batch. Commands submitted as batch
report similar start times and the same end time.
Measure performance of your test with CPU counters. Do not use OCL profiling.
To determine if an application is executed asynchonically, build a dependent
execution with OCL events. This is a "generic" solution; however, there is an
exception when you can enable profiling and have overlap transfers. DRMDMA
engines do not support timestamps ("GPU counters"). To get OCL profiling data,
the runtime must synchronize the main command processor (CP) with the DMA
engine; this disables overlap. Note, however, that Southern Islands has two
independent main CPs and runtime pairs them with DMA engines. So, the
application can still execute kernels on one CP, while another is synced with a
DRM engine for profiling; this lets you profile it with APP or OCL profiling.
clGetDeviceInfo(…,CL_DEVICE_PROFILING_TIMER_RESOLUTION…);
AMD CPUs and GPUs report a timer resolution of 1 ns. AMD OpenCL devices
are required to correctly track time across changes in frequency and power
states. Also, the AMD OpenCL SDK uses the same time-domain for all devices
in the platform; thus, the profiling timestamps can be directly compared across
the CPU and GPU devices.
The sample code below can be used to read the current value of the OpenCL
timer clock. The clock is the same routine used by the AMD OpenCL runtime to
generate the profiling timestamps. This function is useful for correlating other
program events with the OpenCL profiling timestamps.
uint64_t
timeNanos()
{
#ifdef linux
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return (unsigned long long) tp.tv_sec * (1000ULL * 1000ULL * 1000ULL) +
(unsigned long long) tp.tv_nsec;
#else
LARGE_INTEGER current;
QueryPerformanceCounter(¤t);
return (unsigned long long)((double)current.QuadPart / m_ticksPerSec * 1e9);
#endif
}
Normal CPU time-of-day routines can provide a rough measure of the elapsed
time of a GPU kernel. GPU kernel execution is non-blocking, that is, calls to
enqueue*Kernel return to the CPU before the work on the GPU is finished. For
an accurate time value, ensure that the GPU is finished. In OpenCL, you can
force the CPU to wait for the GPU to become idle by inserting calls to
clFinish() before and after the sequence you want to time; this increases the
timing accuracy of the CPU routines. The routine clFinish() blocks the CPU
until all previously enqueued OpenCL commands have finished.
For more information, see section 5.9, “Profiling Operations on Memory Objects
and Kernels,” of the OpenCL 1.0 Specification.
where:
Br = total number of bytes read from global memory.
Bw = total number of bytes written to global memory.
T = time required to run kernel, specified in nanoseconds.
If Br and Bw are specified in bytes, and T in ns, the resulting effective bandwidth
is measured in GB/s, which is appropriate for current CPUs and GPUs for which
the peak bandwidth range is 20-260 GB/s. Computing Br and Bw requires a
thorough understanding of the kernel algorithm; it also can be a highly effective
way to optimize performance. For illustration purposes, consider a simple matrix
addition: each element in the two source arrays is read once, added together,
then stored to a third array. The effective bandwidth for a 1024x1024 matrix
addition is calculated as:
If the elapsed time for this copy as reported by the profiling timers is 1000000 ns
(1 million ns, or .001 sec), the effective bandwidth is:
The AMD APP Profiler can report the number of dynamic instructions per thread
that access global memory through the FetchInsts and WriteInsts counters. The
Fetch and Write reports average the per-thread counts; these can be fractions if
the threads diverge. The Profiler also reports the dimensions of the global
NDRange for the kernel in the GlobalWorkSize field. The total number of threads
can be determined by multiplying together the three components of the range. If
all (or most) global accesses are the same size, the counts from the Profiler and
the approximate size can be used to estimate Br and Bw:
In this example, assume we know that all accesses in the kernel are four bytes;
then, the bandwidth can be calculated as:
OpenCL uses memory objects to pass data to kernels. These can be either
buffers or images. Space for these is managed by the runtime, which uses
several types of memory, each with different performance characteristics. Each
type of memory is suitable for a different usage pattern. The following
subsections describe:
CPU R GPU W GPU Shader R GPU Shader W GPU DMA R GPU DMA W
Host memory and device memory in the above table consists of one of the
subtypes given below.
This regular CPU memory can be access by the CPU at full memory bandwidth;
however, it is not directly accessible by the GPU. For the GPU to transfer host
memory to device memory (for example, as a parameter to
clEnqueueReadBuffer or clEnqueueWriteBuffer), it first must be pinned (see
section 4.5.1.2). Pinning takes time, so avoid incurring pinning costs where CPU
overhead must be avoided.
When host memory is copied to device memory, the OpenCL runtime uses the
following transfer methods.
<=32 kB: For transfers from the host to device, the data is copied by the CPU
to a runtime pinned host memory buffer, and the DMA engine transfers the
data to device memory. The opposite is done for transfers from the device to
the host.
>32 kB and <=16 MB: The host memory physical pages containing the data
are pinned, the GPU DMA engine is used, and the pages then are unpinned.
>16 MB: Runtime pins host memory in stages of 16 MB blocks and transfer
data to the device using the GPU DMA engine. Double buffering for pinning
is used to overlap the pinning cost of each 16 MB block with the DMA
transfer.
This is host memory that the operating system has bound to a fixed physical
address and that the operating system ensures is resident. The CPU can access
pinned host memory at full memory bandwidth. The runtime limits the total
amount of pinned host memory that can be used for memory objects. (See
Section 4.5.2, “Placement,” page 4-16, for information about pinning memory.
If the runtime knows the data is in pinned host memory, it can be transferred to,
and from, device memory without requiring staging buffers or having to perform
pinning/unpinning on each transfer. This offers improved transfer performance.
Currently, the runtime recognizes only data that is in pinned host memory for
operation arguments that are memory objects it has allocated in pinned host
memory. For example, the buffer argument of
clEnqueueReadBuffer/clEnqueueWriteBuffer and image argument of
clEnqueueReadImage/clEnqueueWriteImage. It does not detect that the ptr
arguments of these operations addresses pinned host memory, even if they are
the result of clEnqueueMapBuffer/clEnqueueMapImage on a memory object that
is in pinned host memory.
The runtime can make pinned host memory directly accessible from the GPU.
Like regular host memory, the CPU uses caching when accessing pinned host
memory. Thus, GPU accesses must use the CPU cache coherency protocol
when accessing. For discrete devices, the GPU access to this memory is through
the PCIe bus, which also limits bandwidth. For APU devices that do not have the
PCIe overhead, GPU access is significantly slower than accessing device-visible
host memory (see section 4.5.1.3), which does not use the cache coherency
protocol.
The runtime allocates a limited amount of pinned host memory that is accessible
by the GPU without using the CPU cache coherency protocol. This allows the
GPU to access the memory at a higher bandwidth than regular pinned host
memory.
APU devices have no device memory and use device-visible host memory for
their global device memory.
Discrete GPU devices have their own dedicated memory, which provides the
highest bandwidth for GPU access. The CPU cannot directly access device
memory on a discrete GPU (except for the host-visible device memory portion
described in section 4.5.1.5).
On an APU, the system memory is shared between the GPU and the CPU; it is
visible by either the CPU or the GPU at any given time. A significant benefit of
this is that buffers can be zero copied between the devices by using map/unmap
operations to logically move the buffer between the CPU and the GPU address
space. (Note that in the system BIOS at boot time, it is possible to allocate the
size of the frame buffer. This section of memory is divided into two parts, one of
which is invisible to the CPU. Thus, not all system memory supports zero copy.
See Table 4.2, specifically the Default row.) See Section 4.5.4, “Mapping,”
page 4-18, for more information on zero copy.
4.5.2 Placement
Every OpenCL memory object has a location that is defined by the flags passed
to clCreateBuffer/clCreateImage. A memory object can be located either on
a device, or (as of SDK 2.4) it can be located on the host and accessed directly
by all the devices. The Location column of Table 4.2 gives the memory type used
for each of the allocation flag values for different kinds of devices. When a device
kernel is executed, it accesses the contents of memory objects from this location.
The performance of these accesses is determined by the memory kind used.
An OpenCL context can have multiple devices, and a memory object that is
located on a device has a location on each device. To avoid over-allocating
device memory for memory objects that are never used on that device, space is
not allocated until first used on a device-by-device basis. For this reason, the first
use of a memory object after it is created can be slower than subsequent uses.
clEnqueueMapBuffer/
clEnqueueMapImage/
clEnqueueUnmapMemObject
clCreateBuffer/ Map
clCreateImage Flags Argument Device Type Location Mode Map Location
Host-visible device
Discrete GPU
memory Use Location directly
CL_MEM_USE_PERSISTENT_MEM_AMD (different memory area
Host-visible device Zero copy
(when VM is enabled) APU can be used on each
memory
map).
CPU Host memory
CL_MEM_USE_PERSISTENT_MEM_AM
D Same as default.
(when VM is not enabled)
4.5.3.2 Using Both CPU and GPU Devices, or using an APU Device
Unlike GPUs, CPUs do not contain dedicated hardware (samplers) for accessing
images. Instead, image access is emulated in software. Thus, a developer may
prefer using buffers instead of images if no sampling operation is needed.
4.5.4 Mapping
The host application can use clEnqueueMapBuffer/clEnqueueMapImage to
obtain a pointer that can be used to access the memory object data. When
finished accessing, clEnqueueUnmapMemObject must be used to make the data
available to device kernel access. When a memory object is located on a device,
the data either can be transferred to, and from, the host, or (as of SDK 2.4) be
accessed directly from the host. Memory objects that are located on the host, or
located on the device but accessed directly by the host, are termed zero copy
memory objects. The data is never transferred, but is accessed directly by both
the host and device. Memory objects that are located on the device and
transferred to, and from, the device when mapped and unmapped are termed
copy memory objects. The Map Mode column of Table 4.2 specifies the transfer
mode used for each kind of memory object, and the Map Location column
indicates the kind of memory referenced by the pointer returned by the map
operations.
From Southern Island on, devices support zero copy memory objects under
Linux; however, only images created with CL_MEM_USE_PERSISTENT_MEM_AMD can
be zero copy.
Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large host
memory buffer is shared between multiple devices and the copies are too
expensive. When choosing this, the cost of the transfer must be greater than the
extra cost of the slower accesses.
Streaming writes by the host to zero copy device resident memory objects are
about as fast as the transfer rates, so this can be a good choice when the host
does not read the memory object to avoid the host having to make a copy of the
data to transfer. Memory objects requiring partial updates between kernel
executions can also benefit. If the contents of the memory object must be read
by the host, use clEnqueueCopyBuffer to transfer the data to a separate
CL_MEM_ALLOC_HOST_PTR buffer.
For memory objects with copy map mode, the memory object location is on the
device, and it is transferred to, and from, the host when clEnqueueMapBuffer /
clEnqueueMapImage / clEnqueueUnmapMemObject are called. Table 4.3 shows
how the map_flags argument affects transfers. The runtime transfers only the
portion of the memory object requested in the offset and cb arguments. When
accessing only a portion of a memory object, only map that portion for improved
performance.
clEnqueueMapBuffer /
clEnqueueMapImage Transfer on clEnqueueMapBuffer / Transfer on
map_flags argument clEnqueueMapImage clEnqueueUnmapMemObject
CL_MAP_READ
Device to host if map location is not current. Host to device.
CL_MAP_WRITE
CL_MAP_WRITE_INVALI
None. Host to device.
DATE_REGION
When images are transferred, additional costs are involved because the image
must be converted to, and from, linear address mode for host access. The
runtime does this by executing kernels on the device.
For Southern Islands and later, devices support at least two hardware compute
queues. That allows an application to increase the throughput of small dispatches
with two command queues for asynchronous submission and possibly execution.
The hardware compute queues are selected in the following order: first queue =
even OCL command queues, second queue = odd OCL queues.
4.6.1 Definitions
Deferred allocation — The CL runtime attempts to minimize resource
consumption by delaying buffer allocation until first use. As a side effect, the
first accesses to a buffer may be more expensive than subsequent accesses.
Peak interconnect bandwidth — As used in the text below, this is the transfer
bandwidth between host and device that is available under optimal conditions
at the application level. It is dependent on the type of interconnect, the
chipset, and the graphics chip. As an example, a high-performance PC with
a PCIe 3.0 16x bus and a GCN architecture (AMD Radeon HD 7XXX
series) graphics card has a nominal interconnect bandwidth of 16 GB/s.
Pinning — When a range of host memory is prepared for transfer to the GPU,
its pages are locked into system memory. This operation is called pinning; it
can impose a high cost, proportional to the size of the memory range. One
of the goals of optimizing data transfer is to use pre-pinned buffers whenever
possible. However, if pre-pinned buffers are used excessively, it can reduce
the available system memory and result in excessive swapping. Host side
zero copy buffers provide easy access to pre-pinned memory.
WC — Write Combine is a feature of the CPU write path to a select region
of the address space. Multiple adjacent writes are combined into cache lines
(for example, 64 bytes) before being sent to the external bus. This path
typically provides fast streamed writes, but slower scattered writes.
Depending on the chip set, scattered writes across a graphics interconnect
can be very slow. Also, some platforms require multi-core CPU writes to
saturate the WC path over an interconnect.
Uncached accesses — Host memory and I/O regions can be configured as
uncached. CPU read accesses are typically very slow; for example:
uncached CPU reads of graphics memory over an interconnect.
USWC — Host memory from the Uncached Speculative Write Combine heap
can be accessed by the GPU without causing CPU cache coherency traffic.
Due to the uncached WC access path, CPU streamed writes are fast, while
CPU reads are very slow. On APU devices, this memory provides the fastest
possible route for CPU writes followed by GPU reads.
4.6.2 Buffers
OpenCL buffers currently offer the widest variety of specialized buffer types and
optimized paths, as well as slightly higher transfer performance.
AMD APP SDK 2.4 on Windows 7 and Vista introduces a new feature called zero
copy buffers.
If a buffer is of the zero copy type, the runtime tries to leave its content in place,
unless the application explicitly triggers a transfer (for example, through
clEnqueueCopyBuffer()). Depending on its type, a zero copy buffer resides on
the host or the device. Independent of its location, it can be accessed directly by
the host CPU or a GPU device kernel, at a bandwidth determined by the
capabilities of the hardware interconnect.
Since not all possible read and write paths perform equally, check the application
scenarios below for recommended usage. To assess performance on a given
platform, use the BufferBandwidth sample.
If a given platform supports the zero copy feature, the following buffer types are
available:
Zero copy buffers work well on APU devices. SDK 2.5 introduced an optimization
that is of particular benefit on APUs. The runtime uses USWC memory for buffers
allocated as CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY. On APU systems,
this type of zero copy buffer can be written to by the CPU at very high data rates,
then handed over to the GPU at minimal cost for equally high GPU read-data
rates over the Radeon memory bus. This path provides the highest data transfer
rate for the CPU-to-GPU path. The use of multiple CPU cores may be necessary
to achieve peak write performance.
5. clEnqueueNDRangeKernel( buffer )
As this memory is not cacheable, CPU read operations are very slow. This type
of buffer also exists on discrete platforms, but transfer performance typically is
limited by PCIe bandwidth.
Zero copy buffers can provide low latency for small transfers, depending on the
transfer path. For small buffers, the combined latency of map/CPU memory
access/unmap can be smaller than the corresponding DMA latency.
AMD APP SDK 2.5 introduces a new feature called pre-pinned buffers. This
feature is supported on Windows 7, Windows Vista, and Linux.
clEnqueueRead/WriteBuffer
clEnqueueRead/WriteImage
clEnqueueRead/WriteBufferRect
Note that the CL image calls must use pre-pinned mapped buffers on the host
side, and not pre-pinned images.
From an application point of view, two fundamental use cases exist, and they can
be linked to the various options, described below.
Note that the OpenCL runtime uses deferred allocation to maximize memory
resources. This means that a complete roundtrip chain, including data transfer
and kernel compute, might take one or two iterations to reach peak performance.
Option 4 - Direct host access to a zero copy device buffer (requires zero
copy support)
This option allows overlapping of data transfers and GPU compute. It is also
useful for sparse write updates under certain constraints.
a. A zero copy buffer on the device is created using the following command:
buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )
This buffer can be directly accessed by the host CPU, using the
uncached WC path. This can take place at the same time the GPU
executes a compute kernel. A common double buffering scheme has the
kernel process data from one buffer while the CPU fills a second buffer.
See the TransferOverlap code sample.
A zero copy device buffer can also be used to for sparse updates, such
as assembling sub-rows of a larger matrix into a smaller, contiguous
block for GPU processing. Due to the WC path, it is a good design
choice to try to align writes to the cache line size, and to pick the write
block size as large as possible.
b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is
directly mapped into the host address space.
2. The application transfers data via memset( ptr ), memcpy( ptr,
srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy
device buffer. Depending on the chipset, the bandwidth can be of
the same order of magnitude as the interconnect bandwidth,
although it typically is lower than peak.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the
buffer continues to reside on the device.
Option 5 - Direct GPU access to a zero copy host buffer (requires zero
copy support)
This option allows direct reads or writes of host memory by the GPU. A GPU
kernel can import data from the host without explicit transfer, and write data
directly back to host memory. An ideal use is to perform small I/Os straight
from the kernel, or to integrate the transfer latency directly into the kernel
execution time.
a. The application creates a zero copy host buffer.
buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b. Next, the application modifies or reads the zero copy host buffer.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ |
CL_MAP_WRITE, .. )
The achievable bandwidth depends on the platform and chipset, but can
be of the same order of magnitude as the peak interconnect bandwidth.
For discrete graphics cards, it is important to note that resulting GPU
kernel bandwidth is an order of magnitude lower compared to a kernel
accessing a regular device buffer located on the device.
d. Following kernel execution, the application can access data in the host
buffer in the same manner as described above.
Table 4.5 provides a comparison of the CPU and GPU performance charac-
teristics in an AMD A8-4555M “Trinity” APU (19 W, 21 GB/s memory bandwidth).
Each GPU wavefront has its own register state, which enables the fast single-
cycle switching between threads. Also, GPUs can be very efficient at
gather/scatter operations: each work-item can load from any arbitrary address,
and the registers are completely decoupled from the other threads. This is
substantially more flexible and higher-performing than a classic Vector ALU-style
architecture (such as SSE on the CPU), which typically requires that data be
accessed from contiguous and aligned memory locations. SSE supports
instructions that write parts of a register (for example, MOVLPS and MOVHPS, which
write the upper and lower halves, respectively, of an SSE register), but these
instructions generate additional microarchitecture dependencies and frequently
require additional pack instructions to format the data correctly.
In contrast, each GPU thread shares the same program counter with 63 other
threads in a wavefront. Divergent control-flow on a GPU can be quite expensive
and can lead to significant under-utilization of the GPU device. When control flow
substantially narrows the number of valid work-items in a wave-front, it can be
faster to use the CPU device.
CPUs also tend to provide significantly more on-chip cache than GPUs. In this
example, the CPU device contains 512 kB L2 cache/core plus a 6 MB L3 cache
that is shared among all cores, for a total of 8 MB of cache. In contrast, the GPU
device contains only 128 kB cache shared by the five compute units. The larger
CPU cache serves both to reduce the average memory latency and to reduce
memory bandwidth in cases where data can be re-used from the caches.
Finally, note the approximate 2X difference in kernel launch latency. The GPU
launch time includes both the latency through the software stack, as well as the
time to transfer the compiled kernel and associated arguments across the PCI-
express bus to the discrete GPU. Notably, the launch time does not include the
time to compile the kernel. The CPU can be the device-of-choice for small, quick-
running problems when the overhead to launch the work on the GPU outweighs
the potential speedup. Often, the work size is data-dependent, and the choice of
device can be data-dependent as well. For example, an image-processing
algorithm may run faster on the GPU if the images are large, but faster on the
CPU when the images are small.
For some algorithms, the advantages of the GPU (high computation throughput,
latency hiding) are offset by the advantages of the CPU (low latency, caches, fast
launch time), so that the performance on either devices is similar. This case is
more common for mid-range GPUs and when running more mainstream
algorithms. If the CPU and the GPU deliver similar performance, the user can
get the benefit of either improved power efficiency (by running on the GPU) or
higher peak performance (use both devices).
Usually, when the data size is small, it is faster to use the CPU because the start-
up time is quicker than on the GPU due to a smaller driver overhead and
avoiding the need to copy buffers from the host to the device.
makes these cases more common and more extreme. Here are some
suggestions for these situations.
– The scheduler should support sending different workload sizes to
different devices. GPUs typically prefer larger grain sizes, and higher-
performing GPUs prefer still larger grain sizes.
– The scheduler should be conservative about allocating work until after it
has examined how the work is being executed. In particular, it is
important to avoid the performance cliff that occurs when a slow device
is assigned an important long-running task. One technique is to use small
grain allocations at the beginning of the algorithm, then switch to larger
grain allocations when the device characteristics are well-known.
– As a special case of the above rule, when the devices are substantially
different in performance (perhaps 10X), load-balancing has only a small
potential performance upside, and the overhead of scheduling the load
probably eliminates the advantage. In the case where one device is far
faster than everything else in the system, use only the fast device.
– The scheduler must balance small-grain-size (which increase the
adaptiveness of the schedule and can efficiently use heterogeneous
devices) with larger grain sizes (which reduce scheduling overhead).
Note that the grain size must be large enough to efficiently use the GPU.
Asynchronous Launch
OpenCL devices are designed to be scheduled asynchronously from a
command-queue. The host application can enqueue multiple kernels, flush
the kernels so they begin executing on the device, then use the host core for
other work. The AMD OpenCL implementation uses a separate thread for
each command-queue, so work can be transparently scheduled to the GPU
in the background.
Avoid starving the high-performance GPU devices. This can occur if the
physical CPU core, which must re-fill the device queue, is itself being used
as a device. A simple approach to this problem is to dedicate a physical CPU
core for scheduling chores. The device fission extension (see Section A.7,
“cl_ext Extensions,” page A-4) can be used to reserve a core for scheduling.
For example, on a quad-core device, device fission can be used to create an
OpenCL device with only three cores.
Another approach is to schedule enough work to the device so that it can
tolerate latency in additional scheduling. Here, the scheduler maintains a
watermark of uncompleted work that has been sent to the device, and refills
the queue when it drops below the watermark. This effectively increase the
grain size, but can be very effective at reducing or eliminating device
starvation. Developers cannot directly query the list of commands in the
OpenCL command queues; however, it is possible to pass an event to each
clEnqueue call that can be queried, in order to determine the execution
status (in particular the command completion time); developers also can
maintain their own queue of outstanding requests.
AMD Southern Islands GPUs can execute multiple kernels simultaneously when
there are no dependencies.
For low-latency CPU response, it can be more efficient to use a dedicated spin
loop and not call clFinish() Calling clFinish() indicates that the application
wants to wait for the GPU, putting the thread to sleep. For low latency, the
application should use clFlush(), followed by a loop to wait for the event to
complete. This is also true for blocking maps. The application should use non-
blocking maps followed by a loop waiting on the event. The following provides
sample code for this.
if (sleep)
{
// this puts host thread to sleep, useful if power is a consideration
or overhead is not a concern
clFinish(cmd_queue_);
}
else
{
// this keeps the host thread awake, useful if latency is a concern
clFlush(cmd_queue_);
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
while (eventStatus > 0)
{
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
Sleep(0); // be nice to other threads, allow scheduler to find
other work if possible
// Choose your favorite way to yield, SwitchToThread() for example,
in place of Sleep(0)
}
}
Code optimized for the Tahiti device (the AMD Radeon™ HD 7970 GPU) typically
runs well across other members of the Southern Islands family.
CPUs and GPUs have very different performance characteristics, and some of
these impact how one writes an optimal kernel. Notable differences include:
The Vector ALU floating point resources in a CPU (SSE/AVX) require the use
of vectorized types (such as float4) to enable packed SSE code generation
and extract good performance from the Vector ALU hardware. The GPU
Vector ALU hardware is more flexible and can efficiently use the floating-
point hardware; however, code that can use float4 often generates hi-quality
code for both the CPU and the AMD GPUs.
The AMD OpenCL CPU implementation runs work-items from the same
work-group back-to-back on the same physical CPU core. For optimally
coalesced memory patterns, a common access pattern for GPU-optimized
algorithms is for work-items in the same wavefront to access memory
locations from the same cache line. On a GPU, these work-items execute in
parallel and generate a coalesced access pattern. On a CPU, the first work-
item runs to completion (or until hitting a barrier) before switching to the next.
Generally, if the working set for the data used by a work-group fits in the CPU
caches, this access pattern can work efficiently: the first work-item brings a
line into the cache hierarchy, which the other work-items later hit. For large
working-sets that exceed the capacity of the cache hierarchy, this access
pattern does not work as efficiently; each work-item refetches cache lines
that were already brought in by earlier work-items but were evicted from the
cache hierarchy before being used. Note that AMD CPUs typically provide
512 kB to 2 MB of L2+L3 cache for each compute unit.
CPUs do not contain any hardware resources specifically designed to
accelerate local memory accesses. On a CPU, local memory is mapped to
the same cacheable DRAM used for global memory, and there is no
performance benefit from using the __local qualifier. The additional memory
operations to write to LDS, and the associated barrier operations can reduce
performance. One notable exception is when local memory is used to pack
values to avoid non-coalesced memory patterns.
CPU devices only support a small number of hardware threads, typically two
to eight. Small numbers of active work-group sizes reduce the CPU switching
overhead, although for larger kernels this is a second-order effect.
For a balanced solution that runs reasonably well on both devices, developers
are encouraged to write the algorithm using float4 vectorization. The GPU is
more sensitive to algorithm tuning; it also has higher peak performance potential.
Thus, one strategy is to target optimizations to the GPU and aim for reasonable
performance on the CPU. For peak performance on all devices, developers can
choose to use conditional compilation for key code loops in the kernel, or in some
cases even provide two separate kernels. Even with device-specific kernel
optimizations, the surrounding host code for allocating memory, launching
kernels, and interfacing with the rest of the program generally only needs to be
written once.
Chapter 5
OpenCL Performance and Optimiza-
tion for GCN Devices
The GPU consists of multiple compute units. Each compute unit (CU) contains
local (on-chip) memory, L1 cache, registers, and four SIMDs. Each SIMD
consists of 16 processing element (PEs). Individual work-items execute on a
single processing element; one or more work-groups execute on a single
compute unit. On a GPU, hardware schedules groups of work-items, called
wavefronts, onto compute units; thus, work-items within a wavefront execute in
lock-step; the same instruction is executed on different data.
Since the L1 cache is 16 kB per compute unit, the total L1 cache size is
16 kB * (# of compute units). For the AMD Radeon™ HD 7970, this means a total
of 512 kB L1 cache. L1 bandwidth can be computed as:
If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. Often, a large power of two stride
results in a channel conflict. The size of the power of two stride that causes a
specific type of conflict depends on the chip. A stride that results in a channel
conflict on a machine with eight channels might result in a bank conflict on a
machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.
One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is relatively small compared to the time to
perform the kernel calculation.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.
In this example:
where the lower bits are all the same, the memory requests all access the same
bank on the same channel and are processed serially.
Note that an increase of the address by 2048 results in a 1/3 probability the
same channel is hit; increasing the address by 256 results in a 1/6 probability
the same channel is hit, etc.
On AMD Radeon HD 78XX GPUs, the channel selection are bits 10:8 of the
byte address. For the AMD Radeon HD 77XX, the channel selection are bits
9:8 of the byte address. This means a linear burst switches channels every 256
bytes. Since the wavefront size is 64, channel conflicts are avoided if each work-
item in a wave reads a different address from a 64-word region. All AMD
Radeon HD 7XXX series GPUs have the same layout: channel ends at bit 8,
and the memory bank is to the left of the channel.
For AMD Radeon HD 77XX and 78XX GPUs, when calculating an address as
y*width+x, but reading a burst on a column (incrementing y), only one memory
channel of the system is used, since the width is likely a multiple of 256 words
= 2048 bytes. If the width is an odd multiple of 256B, then it cycles through all
channels.
One or more work-groups execute on each compute unit. On the AMD Radeon
HD 7000-series GPUs, work-groups are dispatched in a linear order, with x
changing most rapidly. For a single dimension, this is:
DispatchOrder = get_group_id(0)
This is row-major-ordering of the blocks in the index space. Once all compute
units are in use, additional work-groups are assigned to compute units as
needed. Work-groups retire in order, so active work-groups are contiguous.
An inefficient access pattern is if each wavefront accesses all the channels. This
is likely to happen if consecutive work-items access data that has a large power
of two strides.
In the next example of a kernel for copying, the input and output buffers are
interpreted as though they were 2D, and the work-group size is organized as 2D.
By changing the width, the data type and the work-group dimensions, we get a
set of kernels out of this code.
Staggered offsets apply a coordinate transformation to the kernel so that the data
is processed in a different order. Unlike adding a column, this technique does not
use extra space. It is also relatively simple to add to existing code.
1. Generally, it is not a good idea to make the work-group size something other than an integer multiple
of the wavefront size, but that usually is less important than avoiding channel conflicts.
Work-
Group
Work-Group size k by k
2N
0,0 0,0
Matrix in row
After transform
1,0 major order 1,0
2,0 2,0
2N 2N
Linear format (each group
is a power of two apart)
K + 2N 2K + 2N
Offset format (each group is not a
power of two apart)
The global ID values reflect the order that the hardware initiates work-groups.
The values of get group ID are in ascending launch order.
The hardware launch order is fixed, but it is possible to change the launch order,
as shown in the following example.
To transform the code, add the following four lines to the top of the kernel.
get_group_id_0 = get_group_id(0);
get_group_id_1 = (get_group_id(0) + get_group_id(1)) % get_local_size(0);
get_global_id_0 = get_group_id_0 * get_local_size(0) + get_local_id(0);
get_global_id_1 = get_group_id_1 * get_local_size(1) + get_local_id(1);
Then, change the global IDs and group IDs to the staggered form. The result is:
__kernel void
copy_float (
__global const DATA_TYPE * A,
__global DATA_TYPE * C)
{
size_t get_group_id_0 = get_group_id(0);
size_t get_group_id_1 = (get_group_id(0) + get_group_id(1)) %
get_local_size(0);
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
From a hardware standpoint, reads from a fixed address have the same upper
bits, so they collide and are serialized. To read in a single value, read the value
in a single work-item, place it in local memory, and then use that location:
Avoid:
temp = input[3] // if input is from global space
Use:
if (get_local_id(0) == 0) {
local = input[3]
}
barrier(CLK_LOCAL_MEM_FENCE);
temp = local
Each compute unit accesses the memory system in quarter-wavefront units. The
compute unit transfers a 32-bit address and one element-sized piece of data for
each work-item. This results in a total of 16 elements + 16 addresses per quarter-
clGetDeviceInfo( …, CL_DEVICE_LOCAL_MEM_SIZE, … );
All AMD Southern Islands GPUs contain a 64 kB LDS for each compute unit;
although only 32 kB can be allocated per work-group. The LDS contains 32-
banks, each bank is four bytes wide and 256 bytes deep; the bank address is
determined by bits 6:2 in the address. Appendix D, “Device Parameters” shows
how many LDS banks are present on the different AMD Southern Island devices.
As shown below, programmers must carefully control the bank bits to avoid bank
conflicts as much as possible. Bank conflicts are determined by what addresses
are accessed on each half wavefront boundary. Threads 0 through 31 are
checked for conflicts as are threads 32 through 63 within a wavefront.
In a single cycle, local memory can service a request for each bank (up to 32
accesses each cycle on the AMD Radeon HD 7970 GPU). For an AMD
Radeon HD 7970 GPU, this delivers a memory bandwidth of over 100 GB/s for
each compute unit, and more than 3.5 TB/s for the whole chip. This is more than
14X the global memory bandwidth. However, accesses that map to the same
bank are serialized and serviced on consecutive cycles. LDS operations do not
stall; however, the compiler inserts wait operations prior to issuing operations that
depend on the results. A wavefront that generated bank conflicts does not stall
implicitly, but may stall explicitly in the kernel if the compiler has inserted a wait
command for the outstanding memory access. The GPU reprocesses the
wavefront on subsequent cycles, enabling only the lanes receiving data, until all
the conflicting accesses complete. The bank with the most conflicting accesses
determines the latency for the wavefront to complete the local memory operation.
The worst case occurs when all 64 work-items map to the same bank, since each
access then is serviced at a rate of one per clock cycle; this case takes 64 cycles
to complete the local memory access for the wavefront. A program with a large
number of bank conflicts (as measured by the LDSBankConflict performance
counter in the AMD APP Profiler statistics) might benefit from using the constant
or image memory rather than LDS.
Thus, the key to effectively using the LDS is to control the access pattern, so that
accesses generated on the same cycle map to different banks in the LDS. One
notable exception is that accesses to the same address (even though they have
the same bits 6:2) can be broadcast to all requestors and do not generate a bank
conflict. The LDS hardware examines the requests generated over two cycles (32
work-items of execution) for bank conflicts. Ensure, as much as possible, that the
memory requests generated from a quarter-wavefront avoid bank conflicts by
using unique address bits 6:2. A simple sequential address pattern, where each
work-item reads a float2 value from LDS, generates a conflict-free access pattern
on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern,
where each work-item reads a float4 value from LDS, uses only half the banks
on each cycle on the AMD Radeon HD 7XXX GPU and delivers half the
performance of the float access pattern.
Each stream processor can generate up to two 4-byte LDS requests per cycle.
Byte and short reads consume four bytes of LDS bandwidth. Developers can use
the large register file: each compute unit has 256 kB of register space available
(8X the LDS size) and can provide up to twelve 4-byte values/cycle (6X the LDS
bandwidth). Registers do not offer the same indexing flexibility as does the LDS,
but for some algorithms this can be overcome with loop unrolling and explicit
addressing.
LDS reads require one ALU operation to initiate them. Each operation can initiate
two loads of up to four bytes each.
The AMD APP Profiler provides the following performance counter to help
optimize local memory usage:
These declarations can be either in the parameters to the kernel call or in the
body of the kernel. The __local syntax allocates a single block of memory, which
is shared across all work-items in the workgroup.
To write data into local memory, write it into an array allocated with __local. For
example:
localBuffer[i] = 5.0;
A typical access pattern is for each work-item to collaboratively write to the local
memory: each work-item writes a subsection, and as the work-items execute in
parallel they write the entire array. Combined with proper consideration for the
access pattern and bank alignment, these collaborative write approaches can
lead to highly efficient memory accessing.
The following example is a simple kernel section that collaboratively writes, then
reads from, local memory:
Note the host code cannot read from, or write to, local memory. Only the kernel
can access local memory.
0
integers 1.. 64
integers -1 .. -16
0.5 single or double floats
-0.5
1.0
-1.0
2.0
-2.0
4.0
-4.0
Any other literal constant increases the code size by at least 32 bits.
The AMD implementation of OpenCL provides three levels of performance for the
“constant” memory type.
3. Varying Index
More sophisticated addressing patterns, including the case where each work-
item accesses different indices, are not hardware accelerated and deliver the
same performance as a global memory read with the potential for cache hits.
To further improve the performance of the AMD OpenCL stack, two methods
allow users to take advantage of hardware constant buffers. These are:
The compiler tries to map private memory allocations to the pool of GPRs in the
GPU. In the event GPRs are not available, private memory is mapped to the
“scratch” region, which has the same performance as global memory.
Section 5.6.2, “Resource Limits on Active Wavefronts,” page 5-18, has more
information on register allocation and identifying when the compiler uses the
scratch region. GPRs provide the highest-bandwidth access of any hardware
resource. In addition to reading up to 12 bytes/cycle per processing element from
the register file, the hardware can access results produced in the previous cycle
without consuming any register file bandwidth.
Varying-indexed constants, which are cached only in L2, use the same path as
global memory access and are subject to the same bank and alignment
constraints described in Section 5.1, “Global Memory Optimization,” page 5-1.
The L1 and L2 read/write caches are constantly enabled. As of SDK 2.4, read
only buffers can be cached in L1 and L2.
The L1 cache can service up to four address requests per cycle, each delivering
up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;
smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.
Using float4 with images increases the request size and can deliver higher L1
cache bandwidth.
Each memory channel on the GPU contains an L2 cache that can deliver up to
64 bytes/cycle. The AMD Radeon HD 7970 GPU has 12 memory channels;
thus, it can deliver up to 768 bytes/cycle; divided among 2048 stream cores, this
provides up to ~0.4 bytes/cycle for each stream core.
Global Memory bandwidth is limited by external pins, not internal bus bandwidth.
The AMD Radeon HD 7970 GPU supports up to 264 GB/s of memory
bandwidth which is an average of 0.14 bytes/cycle for each stream core.
Note that Table 5.1 shows the performance for the AMD Radeon HD 7970
GPU. The “Size/Compute Unit” column and many of the bandwidths/processing
element apply to all Southern Islands-class GPUs; however, the “Size/GPU”
column and the bandwidths for varying-indexed constant, L2, and global memory
vary across different GPU devices. The resource capacities and peak bandwidth
for other AMD GPU devices can be found in Appendix D, “Device Parameters”
The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read
addressing are linked. It is important that L1 is initially filled from global memory
with a coalesced access pattern; once filled, random accesses come at no extra
processing cost.
Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak
bandwidth is achieved when each thread operates on a two-vector of 32-bit
words (16 threads per clock operate on 32 banks). If an algorithm requires
coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger
can lead to bank conflicts, although the compiler can mitigate some of these.
From an application point of view, filling LDS from global memory, and reading
from it, are independent operations that can use independent addressing. Thus,
LDS can be used to explicitly convert a scattered access pattern to a coalesced
pattern for read and write to global memory. Or, by taking advantage of the LDS
read broadcast feature, LDS can be filled with a coalesced pattern from global
memory, followed by all threads iterating through the same LDS words
simultaneously.
LDS reuses the data already pulled into cache by other wavefronts. Sharing
across work-groups is not possible because OpenCL does not guarantee that
LDS is in a particular state at the beginning of work-group execution. L1 content,
on the other hand, is independent of work-group execution, so that successive
work-groups can share the content in the L1 cache of a given Vector ALU.
However, it currently is not possible to explicitly control L1 sharing across work-
groups.
The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count.
Better sharing efficiency requires a larger work-group, so that more work-items
share the same LDS. Compiling kernels for larger work-groups typically results
in increased register use, so that fewer wavefronts can be scheduled
simultaneously per Vector ALU. This, in turn, reduces memory latency hiding.
Requesting larger amounts of LDS per work-group results in fewer wavefronts
per Vector ALU, with the same effect.
LDS typically involves the use of barriers, with a potential performance impact.
This is true even for read-only use cases, as LDS must be explicitly filled in from
global memory (after which a barrier is required before reads can commence).
Kernel execution time also plays a role in hiding memory latency: longer chains
of ALU instructions keep the functional units busy and effectively hide more
latency. To better understand this concept, consider a global memory access
which takes 400 cycles to execute. Assume the compute unit contains many
other wavefronts, each of which performs five ALU instructions before generating
another global memory reference. As discussed previously, the hardware
executes each instruction in the wavefront in four cycles; thus, all five instructions
occupy the ALU for 20 cycles. Note the compute unit interleaves two of these
wavefronts and executes the five instructions from both wavefronts (10 total
instructions) in 40 cycles. To fully hide the 400 cycles of latency, the compute
unit requires (400/40) = 10 pairs of wavefronts, or 20 total wavefronts. If the
wavefront contains 10 instructions rather than 5, the wavefront pair would
consume 80 cycles of latency, and only 10 wavefronts would be required to hide
the 400 cycles of latency.
Generally, it is not possible to predict how the compute unit schedules the
available wavefronts, and thus it is not useful to try to predict exactly which ALU
block executes when trying to hide latency. Instead, consider the overall ratio of
ALU operations to fetch operations – this metric is reported by the AMD APP
Profiler in the ALUFetchRatio counter. Each ALU operation keeps the compute
unit busy for four cycles, so you can roughly divide 500 cycles of latency by
(4*ALUFetchRatio) to determine how many wavefronts must be in-flight to hide
that latency. Additionally, a low value for the ALUBusy performance counter can
indicate that the compute unit is not providing enough wavefronts to keep the
execution resources in full use. (This counter also can be low if the kernel
exhausts the available DRAM bandwidth. In this case, generating more
wavefronts does not improve performance; it can reduce performance by creating
more contention.)
These limits are largely properties of the hardware and, thus, difficult for
developers to control directly. Fortunately, these are relatively generous limits.
Frequently, the register and LDS usage in the kernel determines the limit on the
number of active wavefronts/compute unit, and these can be controlled by the
developer.
Southern Islands registers are scalar, so each is 32-bits. Each wavefront can
have at most 256 registers (VGPRs). To compute the number of wavefronts per
CU, take (256/# registers)*4.
For example, a kernel that uses 120 registers (120x32-bit values) can run with
eight active wavefronts on each compute unit. Because of the global limits
described earlier, each compute unit is limited to 40 wavefronts; thus, kernels can
use up to 25 registers (25x32-bit values) without affecting the number of
wavefronts/compute unit.
The AMD APP Profiler displays the number of GPRs used by the kernel.
Alternatively, the AMD APP Profiler generates the ISA dump (described in
Section 4.3, “Analyzing Processor Kernels,” page 4-9), which then can be
searched for the string :NUM_GPRS.
The AMD APP KernelAnalyzer also shows the GPR used by the kernel,
across a wide variety of GPU compilation targets.
The compiler generates spill code (shuffling values to, and from, memory) if it
cannot fit all the live values into registers. Spill code uses long-latency global
memory and can have a large impact on performance. Spilled registers can be
cached in Southern Island devices, thus reducing the impact on performance.
The AMD APP Profiler reports the static number of register spills in the
ScratchReg field. Generally, it is a good idea to re-write the algorithm to use
fewer GPRs, or tune the work-group dimensions specified at launch time to
expose more registers/kernel to the compiler, in order to reduce the scratch
register usage to 0.
__attribute__((reqd_work_group_size(X, Y, Z)))
Section 6.7.2 of the OpenCL specification explains the attribute in more detail.
In addition to registers, shared memory can also serve to limit the active
wavefronts/compute unit. Each compute unit has 64 kB of LDS, which is shared
among all active work-groups. Note that the maximum allocation size is 32 kB.
LDS is allocated on a per-work-group granularity, so it is possible (and useful)
for multiple wavefronts to share the same local memory allocation. However,
large LDS allocations eventually limits the number of workgroups that can be
active. Table 5.2 provides more details about how LDS usage can impact the
wavefronts/compute unit.
<=4K 40 40 32 16
4.0K-4.2K 40 40 30 15
4.2K-4.5K 40 40 28 14
4.5K-4.9K 40 39 26 13
4.9K-5.3K 40 36 24 12
5.3K-5.8K 40 33 22 11
5.8K-6.4K 40 30 20 10
6.4K-7.1K 36 27 18 9
7.1K-8.0K 32 24 16 8
8.0K-9.1K 28 21 14 7
9.1K-10.6K 24 18 12 6
10.6K-12.8K 20 15 10 5
12.8K-16.0K 16 12 8 4
16.0K-21.3K 12 9 6 3
21.3K-32.0K 8 6 4 2
1. Assumes each work-group uses four wavefronts (the maximum supported by the AMD
OpenCL SDK).
AMD provides the following tools to examine the amount of LDS used by the
kernel:
The AMD APP Profiler displays the LDS usage. See the LocalMem counter.
Alternatively, use the AMD APP Profiler to generate the ISA dump (described
in Section 4.3, “Analyzing Processor Kernels,” page 4-9), then search for the
string SQ_LDS_ALLOC:SIZE in the ISA dump. Note that the value is shown in
hexadecimal format.
OpenCL does not explicitly limit the number of work-groups that can be submitted
with a clEnqueueNDRangeKernel command. The hardware limits the available in-
flight threads, but the OpenCL SDK automatically partitions a large number of
work-groups into smaller pieces that the hardware can process. For some large
workloads, the amount of memory available to the GPU can be a limitation; the
problem might require so much memory capacity that the GPU cannot hold it all.
In these cases, the programmer must partition the workload into multiple
clEnqueueNDRangeKernel commands. The available device memory can be
obtained by querying clDeviceInfo.
OpenCL limits the number of work-items in each group. Call clDeviceInfo with
the CL_DEVICE_MAX_WORK_GROUP_SIZE to determine the maximum number of
work-groups supported by the hardware. Currently, AMD GPUs with SDK 2.1
return 256 as the maximum number of work-items per work-group. Note the
number of work-items is the product of all work-group dimensions; for example,
a work-group with dimensions 32x16 requires 512 work-items, which is not
allowed with the current AMD OpenCL SDK.
Work-items in the same work-group can share data through LDS memory and
also use high-speed local atomic operations. Thus, larger work-groups enable
more work-items to efficiently share data, which can reduce the amount of slower
global communication. However, larger work-groups reduce the number of global
work-groups, which, for small workloads, could result in idle compute units.
Generally, larger work-groups are better as long as the global range is big
enough to provide 1-2 Work-Groups for each compute unit in the system; for
small workloads it generally works best to reduce the work-group size in order to
avoid idle compute units. Note that it is possible to make the decision
dynamically, when the kernel is launched, based on the launch dimensions and
the target device characteristics.
The local NDRange can contain up to three dimensions, here labeled X, Y, and
Z. The X dimension is returned by get_local_id(0), Y is returned by
get_local_id(1), and Z is returned by get_local_id(2). The GPU hardware
schedules the kernels so that the X dimension moves fastest as the work-items
are packed into wavefronts. For example, the 128 threads in a 2D work-group of
dimension 32x4 (X=32 and Y=4) are packed into two wavefronts as follows
(notation shown in X,Y order).
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 9,0 10,0 11,0 12,0 13,0 14,0 15,0
16,0 17,0 18,0 19,0 20,0 21,0 22,0 23,0 24,0 25,0 26,0 27,0 28,0 29,0 30,0 31,0
WaveFront0
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 10,1 11,1 12,1 13,1 14,1 15,1
16,1 17,1 18,1 19,1 20,1 21,1 22,1 23,1 24,1 25,1 26,1 27,1 28,1 29,1 30,1 31,1
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 8,2 9,2 10,2 11,2 12,2 13,2 14,2 15,2
16,2 17,2 18,2 19,2 20,2 21,2 22,2 23,2 24,2 25,2 26,2 27,2 28,2 29,2 30,2 31,2
WaveFront1
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 8,3 9,3 10,3 11,3 12,3 13,3 14,3 15,3
16,3 17,3 18,3 19,3 20,3 21,3 22,3 23,3 24,3 25,3 26,3 27,3 28,3 29,3 30,3 31,3
The total number of work-items in the work-group is typically the most important
parameter to consider, in particular when optimizing to hide latency by increasing
wavefronts/compute unit. However, the choice of XYZ dimensions for the same
overall work-group size can have the following second-order effects.
Work-items in the same wavefront have the same program counter and
execute the same instruction on each cycle. The packing order can be
important if the kernel contains divergent branches. If possible, pack together
work-items that are likely to follow the same direction when control-flow is
encountered. For example, consider an image-processing kernel where each
work-item processes one pixel, and the control-flow depends on the color of
the pixel. It might be more likely that a square of 8x8 pixels is the same color
than a 64x1 strip; thus, the 8x8 would see less divergence and higher
performance.
When in doubt, a square 16x16 work-group size is a good start.
Select the work-group size to be a multiple of 64, so that the wavefronts are
fully populated.
Schedule at least four wavefronts per compute unit.
Latency hiding depends on both the number of wavefronts/compute unit, as
well as the execution time for each kernel. Generally, 8 to 32
wavefronts/compute unit is desirable, but this can vary significantly,
depending on the complexity of the kernel and the available memory
bandwidth. The AMD APP Profiler and associated performance counters can
help to select an optimal value.
Generally, the throughput and latency for 32-bit integer operations is the same
as for single-precision floating point operations.
24-bit integer MULs and MADs have four times the throughput of 32-bit integer
multiplies. 24-bit signed and unsigned integers are natively supported on the
Southern Islands family of devices. The use of OpenCL built-in functions for
mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing
operations.
Packed 16-bit and 8-bit operations are not natively supported; however, in cases
where it is known that no overflow will occur, some algorithms may be able to
effectively pack 2 to 4 values into the 32-bit registers natively supported by the
hardware.
Table 5.3 shows the throughput for each stream processing core. To obtain the
peak throughput for the whole device, multiply the number of stream cores and
the engine clock (see Appendix D, “Device Parameters”). For example, according
to Table 5.3, a Tahiti device can perform one double-precision ADD
operations/2 cycles in each stream core. An AMD Radeon HD 7970 GPU has
2048 Stream Cores and an engine clock of 925 MHz, so the entire GPU has a
throughput rate of (.5*2048*925 MHz) = 947 GFlops for double-precision adds.
No unrolling example:
#pragma unroll 1
for (int i = 0; i < n; i++) {
...
}
Partial unrolling example:
#pragma unroll 4
for (int i = 0; i < 128; i++) {
...
}
Currently, the unroll pragma requires that the loop boundaries can be determined
at compile time. Both loop bounds must be known at compile time. If n is not
given, it is equivalent to the number of iterations of the loop when both loop
bounds are known. If the unroll-factor is not specified, and the compiler can
determine the loop count, the compiler fully unrolls the loop. If the unroll-factor is
not specified, and the compiler cannot determine the loop count, the compiler
does no unrolling.
Linear – A linear layout format arranges the data linearly in memory such that
element addresses are sequential. This is the layout that is familiar to CPU
programmers. This format must be used for OpenCL buffers; it can be used
for images.
Tiled – A tiled layout format has a pre-defined sequence of element blocks
arranged in sequential memory addresses (see Figure 5.4 for a conceptual
illustration). A microtile consists of ABIJ; a macrotile consists of the top-left
16 squares for which the arrows are red. Only images can use this format.
Translating from user address space to the tiled arrangement is transparent
to the user. Tiled memory layouts provide an optimized memory access
pattern to make more efficient use of the RAM attached to the GPU compute
device. This can contribute to lower latency.
Physical
A B C D E F G H
I J K L M N O P Logical
Q R S T U V W X
A B C D I J K L
Q R S T E F G H
M N O P U V W X
Memory access patterns in compute kernels are usually different from those in
the pixel shaders. Whereas the access pattern for pixel shaders is in a
hierarchical, space-filling curve pattern and is tuned for tiled memory
performance (generally for textures), the access pattern for a compute kernel is
linear across each row before moving to the next row in the global id space. This
has an effect on performance, since pixel shaders have implicit blocking, and
compute kernels do not. If accessing a tiled image, best performance is achieved
if the application tries to use workgroups with 16x16 (or 8x8) work-items.
C += D;
} else {
C -= D;
}
global int* d;
size_t idx = get_global_id(0);
if (idx & 1) {
d = b;
} else {
d = c;
}
a[idx] = d[idx];
}
This is inefficient because the GPU compiler must know the base pointer that
every load comes from and in this situation, the compiler cannot determine
what ‘d’ points to. So, both B and C are assigned to the same GPU resource,
removing the ability to do certain optimizations.
If the algorithm allows changing the work-group size, it is possible to get
better performance by using larger work-groups (more work-items in each
work-group) because the workgroup creation overhead is reduced. On the
other hand, the OpenCL CPU runtime uses a task-stealing algorithm at the
work-group level, so when the kernel execution time differs because it
contains conditions and/or loops of varying number of iterations, it might be
better to increase the number of work-groups. This gives the runtime more
flexibility in scheduling work-groups to idle CPU cores. Experimentation
might be needed to reach optimal work-group size.
Since the AMD OpenCL runtime supports only in-order queuing, using
clFinish() on a queue and queuing a blocking command gives the same
result. The latter saves the overhead of another API command.
For example:
clEnqueueWriteBuffer(myCQ, buff, CL_FALSE, 0, buffSize, input, 0, NULL,
NULL);
clFinish(myCQ);
is equivalent, for the AMD OpenCL runtime, to:
clEnqueueWriteBuffer(myCQ, buff, CL_TRUE, 0, buffSize, input, 0, NULL,
NULL);
Study the local memory (LDS) optimizations. These greatly affect the GPU
performance. Note the difference in the organization of local memory on the
GPU as compared to the CPU cache. Local memory is shared by many
work-items (64 on Tahiti). This contrasts with a CPU cache that normally is
dedicated to a single work-item. GPU kernels run well when they
collaboratively load the shared memory.
GPUs have a large amount of raw compute horsepower, compared to
memory bandwidth and to “control flow” bandwidth. This leads to some high-
level differences in GPU programming strategy.
– A CPU-optimized algorithm may test branching conditions to minimize
the workload. On a GPU, it is frequently faster simply to execute the
workload.
– A CPU-optimized version can use memory to store and later load pre-
computed values. On a GPU, it frequently is faster to recompute values
rather than saving them in registers. Per-thread registers are a scarce
resource on the CPU; in contrast, GPUs have many available per-thread
register resources.
Use float4 and the OpenCL built-ins for vector types (vload, vstore, etc.).
These enable the AMD Accelerated Parallel Processing OpenCL
implementation to generate efficient, packed SSE instructions when running
on the CPU. Vectorization is an optimization that benefits both the AMD CPU
and GPU.
The CPU contains a vector unit, which can be efficiently used if the developer is
writing the code using vector data types.
For architectures before Bulldozer, the instruction set is called SSE, and the
vector width is 128 bits. For Bulldozer, there the instruction set is called AVX, for
which the vector width is increased to 256 bits.
Using four-wide vector types (int4, float4, etc.) is preferred, even with Bulldozer.
The CPU does not benefit much from local memory; sometimes it is detrimental
to performance. As local memory is emulated on the CPU by using the caches,
accessing local memory and global memory are the same speed, assuming the
information from the global memory is in the cache.
There also is hardware support for OpenCL functions that give the new hardware
implementation of rotating.
For example:
can be written as a composition of mad instructions which use fused multiple add
(FMA):
if(x==1) r=0.5;
if(x==2) r=1.0;
becomes
Note that if the body of the if statement contains an I/O, the if statement cannot
be eliminated.
A conditional expression with many terms can compile into nested conditional
code due to the C-language requirement that expressions must short circuit. To
prevent this, move the expression out of the control flow statement. For example:
if(a&&b&&c&&d){…}
becomes
The same applies to conditional expressions used in loop constructs (do, while,
for).
If the loop bounds are known, and the loop is small (less than 16 or 32
instructions), unrolling the loop usually increases performance.
for loops can generate more conditional code than equivalent do or while loops.
Experiment with these different loop types to find the one with best performance.
SI compute units are much different than those of previous chips. With previous
generations, a compute unit (Vector ALU) was VLIW in nature, so four (Cayman
GPUs) or five (all other Evergreen/Northern Islands GPUs) instructions could be
packed into a single ALU instruction slot (called a bundle). It was not always easy
to schedule instructions to fill all of these slots, so achieving peak ALU utilization
was a challenge.
With SI GPUs, the compute units are now scalar; however, there now are four
Vector ALUs per compute unit. Each Vector ALU requires at least one wavefront
scheduled to it to achieve peak ALU utilization.
Along with the four Vector ALUs within a compute unit, there is also a scalar unit.
The scalar unit is used to handle branching instructions, constant cache
accesses, and other operations that occur per wavefront. The advantage to
having a scalar unit for each compute unit is that there are no longer large
penalties for branching, aside from thread divergence.
The instruction set for SI is scalar, as are GPRs. Also, the instruction set is no
longer clause-based. There are two types of GPRs: scalar GPRs (SGPRs) and
vector GPRs (VGPRs). Each Vector ALU has its own SGPR and VGPR pool.
There are 512 SGPRs and 256 VGPRs per Vector ALU. VGPRs handle all vector
instructions (any instruction that is handled per thread, such as v_add_f32, a
floating point add). SGPRs are used for scalar instructions: any instruction that
is executed once per wavefront, such as a branch, a scalar ALU instruction, and
constant cache fetches. (SGPRs are also used for constants, all buffer/texture
definitions, and sampler definitions; some kernel arguments are stored, at least
temporarily, in SGPRs.) SGPR allocation is in increments of eight, and VGPR
allocation is in increments of four. These increments also represent the minimum
allocation size of these resources.
Typical vector instructions execute in four cycles; typical scalar ALU instructions
in one cycle. This allows each compute unit to execute one Vector ALU and one
scalar ALU instruction every four clocks (each compute unit is offset by one cycle
from the previous one).
All Southern Islands GPUs have double-precision support. For Tahiti (AMD
Radeon HD 79XX series), double precision adds run at one-half the single
precision add rate. Double-precision multiplies and MAD instructions run at one-
quarter the floating-point rate.
The double-precision rate of Pitcairn (AMD Radeon HD 78XX series) and Cape
Verde (AMD Radeon HD 77XX series) is one quarter that of Tahiti. This also
affects the performance of single-precision fused multiple add (FMA).
L1 cache is still shared within a compute unit. The size has now increased to
16 kB per compute unit for all SI GPUs. The caches now are read/write, so
sharing data between work-items in a work-group (for example, when LDS does
not suffice) is much faster.
Since there are no more clauses in the SI instruction set architecture (ISA), the
compiler inserts “wait” commands to indicate that the compute unit needs the
results of a memory operation before proceeding. If the scalar unit determines
that a wait is required (the data is not yet ready), the Vector ALU can switch to
another wavefront. There are different types of wait commands, depending on
the memory access.
Notes –
If the output data is the same each time, then this is a false dependency because
there is no reason to stall concurrent execution of dispatches. To avoid stalls,
use multiple output buffers. The number of buffers required to get peak
performance depends on the kernel.
Table 5.4 compares the resource limits for Northern Islands and Southern Islands
GPUs.
Table 5.4 Resource Limits for Northern Islands and Southern Islands
LDS
VLIW Max
Width VGPRs SGPRs LDS Size Alloc L1$/CU L2$/Channel
Northern 256
4 - 32 kB 32 kB 8 kB 64 kB
Islands (128-bit)
Southern 256
1 512 64 kB 32 kB 16 kB 64 kB
Islands (32-bit)
Table 5.5 provides a simplified picture showing the Northern Island compute unit
arrangement.
TEXTURE
X Y Z W LDS
UNIT
Table 5.6 provides a simplified picture showing the Southern Island compute unit
arrangement.
Chapter 6
OpenCL Performance and
Optimization for Evergreen and
Northern Islands Devices
This chapter discusses performance and optimization when programming for
AMD Accelerated Parallel Processing GPU compute devices that are part of the
Southern Islands family, as well as CPUs and multiple devices. Details specific
to the Evergreen and Northern Islands families of GPUs are provided in
Chapter 5, “OpenCL Performance and Optimization for GCN Devices.”
The GPU consists of multiple compute units. Each compute unit contains 32 kB
local (on-chip) memory, L1 cache, registers, and 16 processing element (PE).
Each processing element contains a five-way (or four-way, depending on the
GPU type) VLIW processor. Individual work-items execute on a single processing
element; one or more work-groups execute on a single compute unit. On a GPU,
hardware schedules the work-items. On the ATI Radeon™ HD 5000 series of
GPUs, hardware schedules groups of work-items, called wavefronts, onto stream
cores; thus, work-items within a wavefront execute in lock-step; the same
instruction is executed on different data.
The L1 cache is 8 kB per compute unit. (For the ATI Radeon™ HD 5870 GPU,
this means 160 kB for the 20 compute units.) The L1 cache bandwidth on the
ATI Radeon™ HD 5870 GPU is one terabyte per second:
Multiple compute units share L2 caches. The L2 cache size on the ATI Radeon™
HD 5870 GPUs is 512 kB:
The bandwidth between L1 caches and the shared L2 cache is 435 GB/s:
CU CU CU CU CU CU CU CU
16 pe 16 pe 16 pe 16 pe 16 pe 16 pe 16 pe 16 pe
LDS LDS LDS LDS LDS LDS LDS LDS
L1 L1 L1 L1 L1 L1 L1 L1
WC WC WC WC
L2 L2 L2 L2
FastPath
FastPath
FastPath
FastPath
Complete Complete Complete Complete
Path Path Path Path
Atomics Atomics Atomics Atomics
The ATI Radeon™ HD 5870 GPU has eight memory controllers (“Memory
Channel” in Figure 6.1). The memory controllers are connected to multiple banks
of memory. The memory is GDDR5, with a clock speed of 1200 MHz and a data
rate of 4800 Mb/pin. Each channel is 32-bits wide, so the peak bandwidth for the
ATI Radeon™ HD 5870 GPU is:
If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. A large power of two stride
results in a channel conflict; a larger power of two stride results in a bank conflict.
The size of the power of two stride that causes a specific type of conflict depends
6-2 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
on the chip. A stride that results in a channel conflict on a machine with eight
channels might result in a bank conflict on a machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.
FastPath performs only basic operations, such as loads and stores (data
sizes must be a multiple of 32 bits). This often is faster and preferred when
there are no advanced operations.
CompletePath, supports additional advanced operations, including atomics
and sub-32-bit (byte/short) data transfers.
100000
80000
Bandwidth (MB/s)
60000
40000
20000
The kernel code follows. Note that the atomic extension must be enabled under
OpenCL 1.0.
__kernel void
CopyFastPath(__global const float * input,
__global float * output)
{
int gid = get_global_id(0);
output[gid] = input[gid];
return ;
}
__kernel void
CopyComplete(__global const float * input, __global float* output)
{
int gid = get_global_id(0);
if (gid <0){
atom_add((__global int *) output,1);
}
output[gid] = input[gid];
return ;
}
Table 6.1 lists the effective bandwidth and ratio to maximum bandwidth.
Since the path selection is done automatically by the OpenCL compiler, your
kernel may be assigned to CompletePath. This section explains the strategy the
compiler uses, and how to find out what path was used.
The compiler is conservative when it selects memory paths. The compiler often
maps all user data into a single unordered access view (UAV),1 so a single
atomic operation (even one that is not executed) may force all loads and stores
to use CompletePath.
The effective bandwidth listing above shows two OpenCL kernels and the
associated performance. The first kernel uses the FastPath while the second
uses the CompletePath. The second kernel is forced to CompletePath because
in CopyComplete, the compiler noticed the use of an atomic.
There are two ways to find out which path is used. The first method uses the
AMD APP Profiler, which provides the following three performance counters for
this purpose:
1. UAVs allow compute shaders to store results in (or write results to) a buffer at any arbitrary location.
On DX11 hardware, UAVs can be created from buffers and textures. On DX10 hardware, UAVs can-
not be created from typed resources (textures). This is the same as a random access target (RAT).
6-4 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
1. FastPath counter: The total bytes written through the FastPath (no atomics,
32-bit types only).
2. CompletePath counter: The total bytes read and written through the
CompletePath (supports atomics and non-32-bit types).
3. PathUtilization counter: The percentage of bytes read and written through the
FastPath or CompletePath compared to the total number of bytes transferred
over the bus.
The second method is static and lets you determine the path by looking at a
machine-level ISA listing (using the AMD APP KernelAnalyzer in OpenCL).
...
TEX: ...
... VFETCH ...
... MEM_RAT_CACHELESS_STORE_RAW: ...
...
The vfetch Instruction is a load type that in graphics terms is called a vertex
fetch (the group control TEX indicates that the load uses the L1 cache.)
.. MEM_RAT_NOP_RTN_ACK: RAT(1)
.. WAIT_ACK: Outstanding_acks <= 0
.. TEX: ADDR(64) CNT(1)
.. VFETCH ...
.. MEM_RAT_STORE_RAW: RAT(1)
One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is relatively small compared to the time to
perform the kernel calculation.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.
In this example:
where the lower bits are all the same, the memory requests all access the same
bank on the same channel and are processed serially.
6-6 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
On all ATI Radeon HD 5000-series GPUs, the lower eight bits select an
element within a channel.
The next set of bits select the channel. The number of channel bits varies,
since the number of channels is not the same on all parts. With eight
channels, three bits are used to select the channel; with two channels, a
single bit is used.
The next set of bits selects the memory bank. The number of bits used
depends on the number of memory banks.
The remaining bits are the rest of the address.
On the ATI Radeon HD 5870 GPU, the channel selection are bits 10:8 of the
byte address. This means a linear burst switches channels every 256 bytes.
Since the wavefront size is 64, channel conflicts are avoided if each work-item
in a wave reads a different address from a 64-word region. All ATI Radeon HD
5000 series GPUs have the same layout: channel ends at bit 8, and the memory
bank is to the left of the channel.
Similarly, the bank selection bits on the ATI Radeon HD 5870 GPU are bits
14:11, so the bank switches every 2 kB. A linear burst of 32 kB cycles through
all banks and channels of the system. If accessing a 2D surface along a column,
with a y*width+x calculation, and the width is some multiple of 2 kB dwords (32
kB), then only 1 bank and 1 channel are accessed of the 16 banks and 8
channels available on this GPU.
All ATI Radeon HD 5000-series GPUs have an interleave of 256 bytes (64
dwords).
One or more work-groups execute on each compute unit. On the ATI Radeon
HD 5000-series GPUs, work-groups are dispatched in a linear order, with x
changing most rapidly. For a single dimension, this is:
DispatchOrder = get_group_id(0)
This is row-major-ordering of the blocks in the index space. Once all compute
units are in use, additional work-groups are assigned to compute units as
needed. Work-groups retire in order, so active work-groups are contiguous.
An inefficient access pattern is if each wavefront accesses all the channels. This
is likely to happen if consecutive work-items access data that has a large power
of two strides.
In the next example of a kernel for copying, the input and output buffers are
interpreted as though they were 2D, and the work-group size is organized as 2D.
By changing the width, the data type and the work-group dimensions, we get a
set of kernels out of this code.
Table 6.2 shows how much the launch dimension can affect performance. It lists
each kernel’s effective bandwidth and ratio to maximum bandwidth.
6-8 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Staggered offsets apply a coordinate transformation to the kernel so that the data
is processed in a different order. Unlike adding a column, this technique does not
use extra space. It is also relatively simple to add to existing code.
Work-
Group
Work-Group size k by k
2N
0,0 0,0
Matrix in row
After transform
1,0 major order 1,0
2,0 2,0
2N 2N
Linear format (each group
is a power of two apart)
K + 2N 2K + 2N
Offset format (each group is not a
power of two apart)
1. Generally, it is not a good idea to make the work-group size something other than an integer multiple
of the wavefront size, but that usually is less important than avoiding channel conflicts.
The global ID values reflect the order that the hardware initiates work-groups.
The values of get group ID are in ascending launch order.
The hardware launch order is fixed, but it is possible to change the launch order,
as shown in the following example.
To transform the code, add the following four lines to the top of the kernel.
get_group_id_0 = get_group_id(0);
get_group_id_1 = (get_group_id(0) + get_group_id(1)) % get_local_size(0);
get_global_id_0 = get_group_id_0 * get_local_size(0) + get_local_id(0);
get_global_id_1 = get_group_id_1 * get_local_size(1) + get_local_id(1);
Then, change the global IDs and group IDs to the staggered form. The result is:
__kernel void
copy_float (
__global const DATA_TYPE * A,
__global DATA_TYPE * C)
{
size_t get_group_id_0 = get_group_id(0);
size_t get_group_id_1 = (get_group_id(0) + get_group_id(1)) %
get_local_size(0);
6-10 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
From a hardware standpoint, reads from a fixed address have the same upper
bits, so they collide and are serialized. To read in a single value, read the value
in a single work-item, place it in local memory, and then use that location:
Avoid:
temp = input[3] // if input is from global space
Use:
if (get_local_id(0) == 0) {
local = input[3]
}
barrier(CLK_LOCAL_MEM_FENCE);
temp = local
The performance of these kernels can be seen in Figure 6.4. Change to float4
after eliminating the conflicts.
130000
120000
Bandwidth (MB/s)
110000
100000
90000
80000
Size (Bytes)
Figure 6.4 Two Kernels: One Using float4 (blue), the Other float1 (red)
The following code example has two kernels, both of which can do a simple copy,
but Copy4 uses float4 data types.
__kernel void
Copy4(__global const float4 * input,
__global float4 * output)
{
int gid = get_global_id(0);
output[gid] = input[gid];
return;
}
__kernel void
Copy1(__global const float * input,
__global float * output)
{
int gid = get_global_id(0);
output[gid] = input[gid];
return;
}
Copying data as float4 gives the best result: 84% of absolute peak. It also speeds
up the 2D versions of the copy (see Table 6.3).
In coalesced writes, the compute unit transfers one 32-bit address and 16
element-sized pieces of data for each quarter-wavefront, for a total of 16
elements +1 address per quarter-wavefront. For coalesced writes, processing
quarter-wavefront takes one cycle instead of two. While this is twice as fast, the
times are small compared to the rate the memory controller can handle the data.
See Figure 6.5.
6-12 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The first kernel Copy1 maximizes coalesced writes: work-item k writes to address
k. The second kernel writes a shifted pattern: In each quarter-wavefront of 16
work-items, work-item k writes to address k-1, except the first work-item in each
quarter-wavefront writes to address k+16. There is not enough order here to
coalesce on some other vendor machines. Finally, the third kernel has work-item
k write to address k when k is even, and write address 63-k when k is odd. This
pattern never coalesces.
amd amd-Split
amd-NOCoal
95000
Bandwidth (MB/s)
90000
85000
80000
Size (Bytes)
Table 6.4 lists the effective bandwidth and ratio to maximum bandwidth for each
kernel type.
6.1.5 Alignment
The program in Figure 6.6 shows how the performance of a simple, unaligned
access (float1) of this kernel varies as the size of offset varies. Each transfer was
large (16 MB). The performance gain by adjusting alignment is small, so
generally this is not an important consideration on AMD GPUs.
6-14 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
96000
95000
94000
93000
92000
0 20 40 60
Offset
__kernel void
CopyAdd(global const float * input,
__global float * output,
const int offset)
{
int gid = get_global_id(0)+ offset;
output[gid] = input[gid];
return;
}
Table 6.5 lists the effective bandwidth and ratio to maximum bandwidth for each
kernel type.
1. Examine the code to ensure you are using FastPath, not CompletePath,
everywhere possible. Check carefully to see if you are minimizing the
number of kernels that use CompletePath operations. You might be able to
use textures, image-objects, or constant buffers to help.
2. Examine the data-set sizes and launch dimensions to see if you can
eliminate bank conflicts.
3. Try to use float4 instead of float1.
4. Try to change the access pattern to allow write coalescing. This is important
on some hardware platforms, but only of limited importance for AMD GPU
devices.
5. Finally, look at changing the access pattern to allow data alignment.
clGetDeviceInfo( …, CL_DEVICE_LOCAL_MEM_SIZE, … );
All AMD Evergreen GPUs contain a 32K LDS for each compute unit. On high-
end GPUs, the LDS contains 32-banks, each bank is four bytes wide and 256
bytes deep; the bank address is determined by bits 6:2 in the address. On lower-
end GPUs, the LDS contains 16 banks, each bank is still 4 bytes in size, and the
bank used is determined by bits 5:2 in the address. Appendix D, “Device
Parameters,” shows how many LDS banks are present on the different AMD
Evergreen products. As shown below, programmers should carefully control the
bank bits to avoid bank conflicts as much as possible.
6-16 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
In a single cycle, local memory can service a request for each bank (up to 32
accesses each cycle on the ATI Radeon HD 5870 GPU). For an ATI Radeon
HD 5870 GPU, this delivers a memory bandwidth of over 100 GB/s for each
compute unit, and more than 2 TB/s for the whole chip. This is more than 14X
the global memory bandwidth. However, accesses that map to the same bank
are serialized and serviced on consecutive cycles. A wavefront that generates
bank conflicts stalls on the compute unit until all LDS accesses have completed.
The GPU reprocesses the wavefront on subsequent cycles, enabling only the
lanes receiving data, until all the conflicting accesses complete. The bank with
the most conflicting accesses determines the latency for the wavefront to
complete the local memory operation. The worst case occurs when all 64 work-
items map to the same bank, since each access then is serviced at a rate of one
per clock cycle; this case takes 64 cycles to complete the local memory access
for the wavefront. A program with a large number of bank conflicts (as measured
by the LDSBankConflict performance counter) might benefit from using the
constant or image memory rather than LDS.
Thus, the key to effectively using the local cache memory is to control the access
pattern so that accesses generated on the same cycle map to different banks in
the local memory. One notable exception is that accesses to the same address
(even though they have the same bits 6:2) can be broadcast to all requestors
and do not generate a bank conflict. The LDS hardware examines the requests
generated over two cycles (32 work-items of execution) for bank conflicts.
Ensure, as much as possible, that the memory requests generated from a
quarter-wavefront avoid bank conflicts by using unique address bits 6:2. A simple
sequential address pattern, where each work-item reads a float2 value from LDS,
generates a conflict-free access pattern on the ATI Radeon HD 5870 GPU.
Note that a sequential access pattern, where each work-item reads a float4 value
from LDS, uses only half the banks on each cycle on the ATI Radeon HD 5870
GPU and delivers half the performance of the float access pattern.
Each stream processor can generate up to two 4-byte LDS requests per cycle.
Byte and short reads consume four bytes of LDS bandwidth. Since each stream
processor can execute five operations (or four, depending on the GPU type) in
the VLIW each cycle (typically requiring 10-15 input operands), two local memory
requests might not provide enough bandwidth to service the entire instruction.
Developers can use the large register file: each compute unit has 256 kB of
register space available (8X the LDS size) and can provide up to twelve 4-byte
values/cycle (6X the LDS bandwidth). Registers do not offer the same indexing
flexibility as does the LDS, but for some algorithms this can be overcome with
loop unrolling and explicit addressing.
LDS reads require one ALU operation to initiate them. Each operation can initiate
two loads of up to four bytes each.
The AMD APP Profiler provides the following performance counter to help
optimize local memory usage:
These declarations can be either in the parameters to the kernel call or in the
body of the kernel. The __local syntax allocates a single block of memory, which
is shared across all work-items in the workgroup.
To write data into local memory, write it into an array allocated with __local. For
example:
localBuffer[i] = 5.0;
A typical access pattern is for each work-item to collaboratively write to the local
memory: each work-item writes a subsection, and as the work-items execute in
parallel they write the entire array. Combined with proper consideration for the
access pattern and bank alignment, these collaborative write approaches can
lead to highly efficient memory accessing. Local memory is consistent across
work-items only at a work-group barrier; thus, before reading the values written
collaboratively, the kernel must include a barrier() instruction.
The following example is a simple kernel section that collaboratively writes, then
reads from, local memory:
float f = localBuffer[tx];
for (uint i=tx+1; i<64; i++) {
f *= localBuffer[i];
}
Out[gx] = f;
}
6-18 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Note the host code cannot read from, or write to, local memory. Only the kernel
can access local memory.
(see section 5.8 of the OpenCL Specification) that is less than the wavefront size.
Developers are strongly encouraged to include the barriers where appropriate,
and rely on the compiler to remove the barriers when possible, rather than
manually removing the barriers(). This technique results in more portable
code, including the ability to run kernels on CPU devices.
To further improve the performance of the AMD OpenCL stack, two methods
allow users to take advantage of hardware constant buffers. These are:
6-20 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The compiler tries to map private memory allocations to the pool of GPRs in the
GPU. In the event GPRs are not available, private memory is mapped to the
“scratch” region, which has the same performance as global memory.
Section 6.6.2, “Resource Limits on Active Wavefronts,” page 6-24, has more
information on register allocation and identifying when the compiler uses the
scratch region. GPRs provide the highest-bandwidth access of any hardware
resource. In addition to reading up to 48 bytes/cycle from the register file, the
hardware can access results produced in the previous cycle (through the
Previous Vector/Previous Scalar register) without consuming any register file
bandwidth. GPRs have some restrictions about which register ports can be read
on each cycle; but generally, these are not exposed to the OpenCL programmer.
Varying-indexed constants use the same path as global memory access and are
subject to the same bank and alignment constraints described in Section 6.1,
“Global Memory Optimization,” page 6-1.
The L1 and L2 caches are currently only enabled for images and same-indexed
constants. As of SDK 2.4, read only buffers can be cached in L1 and L2. To
enable this, the developer must indicate to the compiler that the buffer is read
only and does not alias with other buffers. For example, use:
The const indicates to the compiler that mypointerName is read only from the
kernel, and the restrict attribute indicates to the compiler that no other pointer
aliases with mypointerName.
The L1 cache can service up to four address request per cycle, each delivering
up to 16 bytes. The bandwidth shown assumes an access size of 16 bytes;
smaller access sizes/requests result in a lower peak bandwidth for the L1 cache.
Using float4 with images increases the request size and can deliver higher L1
cache bandwidth.
Each memory channel on the GPU contains an L2 cache that can deliver up to
64 bytes/cycle. The ATI Radeon HD 5870 GPU has eight memory channels;
thus, it can deliver up to 512bytes/cycle; divided among 320 stream cores, this
provides up to ~1.6 bytes/cycle for each stream core.
Global Memory bandwidth is limited by external pins, not internal bus bandwidth.
The ATI Radeon HD 5870 GPU supports up to 153 GB/s of memory bandwidth
which is an average of 0.6 bytes/cycle for each stream core.
Note that Table 6.6 shows the performance for the ATI Radeon HD 5870 GPU.
The “Size/Compute Unit” column and many of the bandwidths/processing
element apply to all Evergreen-class GPUs; however, the “Size/GPU” column
and the bandwidths for varying-indexed constant, L2, and global memory vary
across different GPU devices. The resource capacities and peak bandwidth for
other AMD GPU devices can be found in Appendix D, “Device Parameters.”
The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read
addressing are linked. It is important that L1 is initially filled from global memory
with a coalesced access pattern; once filled, random accesses come at no extra
processing cost.
Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak
bandwidth is achieved when each thread operates on a two-vector of 32-bit
words (16 threads per clock operate on 32 banks). If an algorithm requires
coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger
can lead to bank conflicts.
From an application point of view, filling LDS from global memory, and reading
from it, are independent operations that can use independent addressing. Thus,
LDS can be used to explicitly convert a scattered access pattern to a coalesced
pattern for read and write to global memory. Or, by taking advantage of the LDS
read broadcast feature, LDS can be filled with a coalesced pattern from global
memory, followed by all threads iterating through the same LDS words
simultaneously.
6-22 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count.
Better sharing efficiency requires a larger work-group, so that more work items
share the same LDS. Compiling kernels for larger work groups typically results
in increased register use, so that fewer wavefronts can be scheduled
simultaneously per Vector ALU. This, in turn, reduces memory latency hiding.
Requesting larger amounts of LDS per work-group results in fewer wavefronts
per Vector ALU, with the same effect.
LDS typically involves the use of barriers, with a potential performance impact.
This is true even for read-only use cases, as LDS must be explicitly filled in from
global memory (after which a barrier is required before reads can commence).
global memory access is made idle until the memory request completes. During
this time, the compute unit can process other independent wavefronts, if they are
available.
Kernel execution time also plays a role in hiding memory latency: longer kernels
keep the functional units busy and effectively hide more latency. To better
understand this concept, consider a global memory access which takes 400
cycles to execute. Assume the compute unit contains many other wavefronts,
each of which performs five ALU instructions before generating another global
memory reference. As discussed previously, the hardware executes each
instruction in the wavefront in four cycles; thus, all five instructions occupy the
ALU for 20 cycles. Note the compute unit interleaves two of these wavefronts
and executes the five instructions from both wavefronts (10 total instructions) in
40 cycles. To fully hide the 400 cycles of latency, the compute unit requires
(400/40) = 10 pairs of wavefronts, or 20 total wavefronts. If the wavefront
contains 10 instructions rather than 5, the wavefront pair would consume 80
cycles of latency, and only 10 wavefronts would be required to hide the 400
cycles of latency.
Generally, it is not possible to predict how the compute unit schedules the
available wavefronts, and thus it is not useful to try to predict exactly which ALU
block executes when trying to hide latency. Instead, consider the overall ratio of
ALU operations to fetch operations – this metric is reported by the AMD APP
Profiler in the ALUFetchRatio counter. Each ALU operation keeps the compute
unit busy for four cycles, so you can roughly divide 500 cycles of latency by
(4*ALUFetchRatio) to determine how many wavefronts must be in-flight to hide
that latency. Additionally, a low value for the ALUBusy performance counter can
indicate that the compute unit is not providing enough wavefronts to keep the
execution resources in full use. (This counter also can be low if the kernel
exhausts the available DRAM bandwidth. In this case, generating more
wavefronts does not improve performance; it can reduce performance by creating
more contention.)
6-24 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
These limits are largely properties of the hardware and, thus, difficult for
developers to control directly. Fortunately, these are relatively generous limits.
Frequently, the register and LDS usage in the kernel determines the limit on the
number of active wavefronts/compute unit, and these can be controlled by the
developer.
Each compute unit provides 16384 GP registers, and each register contains
4x32-bit values (either single-precision floating point or a 32-bit integer). The total
register size is 256 kB of storage per compute unit. These registers are shared
among all active wavefronts on the compute unit; each kernel allocates only the
registers it needs from the shared pool. This is unlike a CPU, where each thread
is assigned a fixed set of architectural registers. However, using many registers
in a kernel depletes the shared pool and eventually causes the hardware to
throttle the maximum number of active wavefronts.
Table 6.7 shows how the registers used in the kernel impacts the register-limited
wavefronts/compute unit.
For example, a kernel that uses 30 registers (120x32-bit values) can run with
eight active wavefronts on each compute unit. Because of the global limits
described earlier, each compute unit is limited to 32 wavefronts; thus, kernels can
use up to seven registers (28 values) without affecting the number of
wavefronts/compute unit. Finally, note that in addition to the GPRs shown in the
table, each kernel has access to four clause temporary registers.
0-1 248
2 124
3 82
4 62
5 49
6 41
7 35
8 31
9 27
10 24
11 22
12 20
13 19
14 17
15 16
16 15
17 14
18-19 13
19-20 12
21-22 11
23-24 10
25-27 9
28-31 8
32-35 7
36-41 6
42-49 5
50-62 4
63-82 3
83-124 2
The AMD APP Profiler displays the number of GPRs used by the kernel.
Alternatively, the AMD APP Profiler generates the ISA dump (described in
Section 4.3, “Analyzing Processor Kernels,” page 4-9), which then can be
searched for the string :NUM_GPRS.
The AMD APP KernelAnalyzer also shows the GPR used by the kernel,
across a wide variety of GPU compilation targets.
6-26 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The compiler generates spill code (shuffling values to, and from, memory) if it
cannot fit all the live values into registers. Spill code uses long-latency global
memory and can have a large impact on performance. The AMD APP Profiler
reports the static number of register spills in the ScratchReg field. Generally, it
is a good idea to re-write the algorithm to use fewer GPRs, or tune the work-
group dimensions specified at launch time to expose more registers/kernel to the
compiler, in order to reduce the scratch register usage to 0.
For example, if the compiler allocates 70 registers for the work-item, Table 6.7
shows that only three wavefronts (192 work-items) are supported. If the user later
launches the kernel with a work-group size of four wavefronts (256 work-items),
the launch fails because the work-group requires 70*256=17920 registers, which
is more than the hardware allows. To prevent this from happening, the compiler
performs the register allocation with the conservative assumption that the kernel
is launched with the largest work-group size (256 work-items). The compiler
guarantees that the kernel does not use more than 62 registers (the maximum
number of registers which supports a work-group with four wave-fronts), and
generates low-performing register spill code, if necessary.
__attribute__((reqd_work_group_size(X, Y, Z)))
Section 6.7.2 of the OpenCL specification explains the attribute in more detail.
In addition to registers, shared memory can also serve to limit the active
wavefronts/compute unit. Each compute unit has 32k of LDS, which is shared
among all active work-groups. LDS is allocated on a per-work-group granularity,
so it is possible (and useful) for multiple wavefronts to share the same local
memory allocation. However, large LDS allocations eventually limits the number
of workgroups that can be active. Table 6.8 provides more details about how LDS
usage can impact the wavefronts/compute unit.
<=4K 32 24 16 8
4.0K-4.6K 28 21 14 7
4.6K-5.3K 24 18 12 6
5.3K-6.4K 20 15 10 5
6.4K-8.0K 16 12 8 4
8.0K-10.7K 12 9 6 3
10.7K-16.0K 8 6 4 2
16.0K-32.0K 4 3 2 1
1. Assumes each work-group uses four wavefronts (the maximum supported by the AMD
OpenCL SDK).
AMD provides the following tools to examine the amount of LDS used by the
kernel:
The AMD APP Profiler displays the LDS usage. See the LocalMem counter.
Alternatively, use the AMD APP Profiler to generate the ISA dump (described
in Section 4.3, “Analyzing Processor Kernels,” page 4-9), then search for the
string SQ_LDS_ALLOC:SIZE in the ISA dump. Note that the value is shown in
hexadecimal format.
OpenCL does not explicitly limit the number of work-groups that can be submitted
with a clEnqueueNDRangeKernel command. The hardware limits the available in-
flight threads, but the OpenCL SDK automatically partitions a large number of
work-groups into smaller pieces that the hardware can process. For some large
workloads, the amount of memory available to the GPU can be a limitation; the
problem might require so much memory capacity that the GPU cannot hold it all.
6-28 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
In these cases, the programmer must partition the workload into multiple
clEnqueueNDRangeKernel commands. The available device memory can be
obtained by querying clDeviceInfo.
OpenCL limits the number of work-items in each group. Call clDeviceInfo with
the CL_DEVICE_MAX_WORK_GROUP_SIZE to determine the maximum number of
work-groups supported by the hardware. Currently, AMD GPUs with SDK 2.1
return 256 as the maximum number of work-items per work-group. Note the
number of work-items is the product of all work-group dimensions; for example,
a work-group with dimensions 32x16 requires 512 work-items, which is not
allowed with the current AMD OpenCL SDK.
Work-items in the same work-group can share data through LDS memory and
also use high-speed local atomic operations. Thus, larger work-groups enable
more work-items to efficiently share data, which can reduce the amount of slower
global communication. However, larger work-groups reduce the number of global
work-groups, which, for small workloads, could result in idle compute units.
Generally, larger work-groups are better as long as the global range is big
enough to provide 1-2 Work-Groups for each compute unit in the system; for
small workloads it generally works best to reduce the work-group size in order to
avoid idle compute units. Note that it is possible to make the decision
dynamically, when the kernel is launched, based on the launch dimensions and
the target device characteristics.
Often, work can be moved from the work-group into the kernel. For example, a
matrix multiply where each work-item computes a single element in the output
array can be written so that each work-item generates multiple elements. This
technique can be important for effectively using the processing elements
available in the five-wide (or four-wide, depending on the GPU type) VLIW
processing engine (see the ALUPacking performance counter reported by the
AMD APP Profiler). The mechanics of this technique often is as simple as adding
a for loop around the kernel, so that the kernel body is run multiple times inside
this loop, then adjusting the global work size to reduce the work-items. Typically,
the local work-group is unchanged, and the net effect of moving work into the
kernel is that each work-group does more effective processing, and fewer global
work-groups are required.
When moving work to the kernel, often it is best to combine work-items that are
separated by 16 in the NDRange index space, rather than combining adjacent
work-items. Combining the work-items in this fashion preserves the memory
access patterns optimal for global and local memory accesses. For example,
consider a kernel where each kernel accesses one four-byte element in array A.
The resulting access pattern is:
Work-item 0 1 2 3
…
Cycle0 A+0 A+1 A+2 A+3
Work-item 0 1 2 3 4 5
This pattern shows that on the first cycle the access pattern contains “holes.”
Also, this pattern results in bank conflicts on the LDS. A better access pattern is
to combine four work-items so that the first work-item accesses array elements
A+0, A+16, A+32, and A+48. The resulting access pattern is:
Work-item 0 1 2 3 4 5
Increasing the processing done by the kernels can allow more processing to be
done on the fixed pool of local memory available to work-groups. For example,
consider a case where an algorithm requires 32x32 elements of shared memory.
If each work-item processes only one element, it requires 1024 work-items/work-
group, which exceeds the maximum limit. Instead, each kernel can be written to
6-30 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The local NDRange can contain up to three dimensions, here labeled X, Y, and
Z. The X dimension is returned by get_local_id(0), Y is returned by
get_local_id(1), and Z is returned by get_local_id(2). The GPU hardware
schedules the kernels so that the X dimensions moves fastest as the work-items
are packed into wavefronts. For example, the 128 threads in a 2D work-group of
dimension 32x4 (X=32 and Y=4) would be packed into two wavefronts as follows
(notation shown in X,Y order):
0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0 9,0 10,0 11,0 12,0 13,0 14,0 15,0
16,0 17,0 18,0 19,0 20,0 21,0 22,0 23,0 24,0 25,0 26,0 27,0 28,0 29,0 30,0 31,0
WaveFront0
0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 8,1 9,1 10,1 11,1 12,1 13,1 14,1 15,1
16,1 17,1 18,1 19,1 20,1 21,1 22,1 23,1 24,1 25,1 26,1 27,1 28,1 29,1 30,1 31,1
0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 8,2 9,2 10,2 11,2 12,2 13,2 14,2 15,2
16,2 17,2 18,2 19,2 20,2 21,2 22,2 23,2 24,2 25,2 26,2 27,2 28,2 29,2 30,2 31,2
WaveFront1
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 8,3 9,3 10,3 11,3 12,3 13,3 14,3 15,3
16,3 17,3 18,3 19,3 20,3 21,3 22,3 23,3 24,3 25,3 26,3 27,3 28,3 29,3 30,3 31,3
The total number of work-items in the work-group is typically the most important
parameter to consider, in particular when optimizing to hide latency by increasing
wavefronts/compute unit. However, the choice of XYZ dimensions for the same
overall work-group size can have the following second-order effects.
Evergreen
Evergreen
Cypress, Juniper,
Cedar
Redwood
Work-items/Wavefront 64 32
Stream Cores / CU 16 8
GP Registers / CU 16384 8192
Local Memory Size 32K 32K
Maximum Work-Group Size 256 128
The difference in total register size can impact the compiled code and cause
register spill code for kernels that were tuned for other devices. One technique
that can be useful is to specify the required work-group size as 128 (half the
default of 256). In this case, the compiler has the same number of registers
available as for other devices and uses the same number of registers. The
developer must ensure that the kernel is launched with the reduced work size
(128) on Cedar-class devices.
Select the work-group size to be a multiple of 64, so that the wavefronts are
fully populated.
Always provide at least two wavefronts (128 work-items) per compute unit.
For a ATI Radeon HD 5870 GPU, this implies 40 wave-fronts or 2560 work-
items. If necessary, reduce the work-group size (but not below 64 work-
items) to provide work-groups for all compute units in the system.
Latency hiding depends on both the number of wavefronts/compute unit, as
well as the execution time for each kernel. Generally, two to eight
wavefronts/compute unit is desirable, but this can vary significantly,
depending on the complexity of the kernel and the available memory
bandwidth. The AMD APP Profiler and associated performance counters can
help to select an optimal value.
6-32 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
higher clock rate (2800 MHz vs 750 MHz for this comparison), as well as the
operation latency; the CPU is optimized to perform an integer add in just one
cycle, while the GPU requires eight cycles. The CPU also has a latency-
optimized path to DRAM, while the GPU optimizes for bandwidth and relies on
many in-flight threads to hide the latency. The ATI Radeon HD 5670 GPU, for
example, supports more than 15,000 in-flight threads and can switch to a new
thread in a single cycle. The CPU supports only four hardware threads, and
thread-switching requires saving and restoring the CPU registers from memory.
The GPU requires many active threads to both keep the execution resources
busy, as well as provide enough threads to hide the long latency of cache
misses.
Each GPU thread has its own register state, which enables the fast single-cycle
switching between threads. Also, GPUs can be very efficient at gather/scatter
operations: each thread can load from any arbitrary address, and the registers
are completely decoupled from the other threads. This is substantially more
flexible and higher-performing than a classic Vector ALU-style architecture (such
as SSE on the CPU), which typically requires that data be accessed from
contiguous and aligned memory locations. SSE supports instructions that write
parts of a register (for example, MOVLPS and MOVHPS, which write the upper and
lower halves, respectively, of an SSE register), but these instructions generate
additional microarchitecture dependencies and frequently require additional pack
instructions to format the data correctly.
In contrast, each GPU thread shares the same program counter with 63 other
threads in a wavefront. Divergent control-flow on a GPU can be quite expensive
and can lead to significant under-utilization of the GPU device. When control flow
substantially narrows the number of valid work-items in a wave-front, it can be
faster to use the CPU device.
CPUs also tend to provide significantly more on-chip cache than GPUs. In this
example, the CPU device contains 512k L2 cache/core plus a 6 MB L3 cache
that is shared among all cores, for a total of 8 MB of cache. In contrast, the GPU
device contains only 128 k cache shared by the five compute units. The larger
CPU cache serves both to reduce the average memory latency and to reduce
memory bandwidth in cases where data can be re-used from the caches.
Finally, note the approximate 9X difference in kernel launch latency. The GPU
launch time includes both the latency through the software stack, as well as the
time to transfer the compiled kernel and associated arguments across the PCI-
express bus to the discrete GPU. Notably, the launch time does not include the
time to compile the kernel. The CPU can be the device-of-choice for small, quick-
running problems when the overhead to launch the work on the GPU outweighs
the potential speedup. Often, the work size is data-dependent, and the choice of
device can be data-dependent as well. For example, an image-processing
algorithm may run faster on the GPU if the images are large, but faster on the
CPU when the images are small.
6-34 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
magnitude faster on the GPU, and at higher power efficiency. Serial or small
parallel workloads (too small to efficiently use the GPU resources) often run
significantly faster on the CPU devices. In some cases, the same algorithm can
exhibit both types of workload. A simple example is a reduction operation such
as a sum of all the elements in a large array. The beginning phases of the
operation can be performed in parallel and run much faster on the GPU. The end
of the operation requires summing together the partial sums that were computed
in parallel; eventually, the width becomes small enough so that the overhead to
parallelize outweighs the computation cost, and it makes sense to perform a
serial add. For these serial operations, the CPU can be significantly faster than
the GPU.
For some algorithms, the advantages of the GPU (high computation throughput,
latency hiding) are offset by the advantages of the CPU (low latency, caches, fast
launch time), so that the performance on either devices is similar. This case is
more common for mid-range GPUs and when running more mainstream
algorithms. If the CPU and the GPU deliver similar performance, the user can
get the benefit of either improved power efficiency (by running on the GPU) or
higher peak performance (use both devices).
Usually, when the data size is small, it is faster to use the CPU because the start-
up time is quicker than on the GPU due to a smaller driver overhead and
avoiding the need to copy buffers from the host to the device.
devices finish and become idle while the whole system waits for the single,
unexpectedly slow device.
6-36 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Asynchronous Launch
OpenCL devices are designed to be scheduled asynchronously from a
command-queue. The host application can enqueue multiple kernels, flush
the kernels so they begin executing on the device, then use the host core for
other work. The AMD OpenCL implementation uses a separate thread for
each command-queue, so work can be transparently scheduled to the GPU
in the background.
One situation that should be avoided is starving the high-performance GPU
devices. This can occur if the physical CPU core, which must re-fill the device
queue, is itself being used as a device. A simple approach to this problem is
to dedicate a physical CPU core for scheduling chores. The device fission
extension (see Section A.7, “cl_ext Extensions,” page A-4) can be used to
reserve a core for scheduling. For example, on a quad-core device, device
fission can be used to create an OpenCL device with only three cores.
Another approach is to schedule enough work to the device so that it can
tolerate latency in additional scheduling. Here, the scheduler maintains a
watermark of uncompleted work that has been sent to the device, and refills
the queue when it drops below the watermark. This effectively increase the
grain size, but can be very effective at reducing or eliminating device
starvation. Developers cannot directly query the list of commands in the
OpenCL command queues; however, it is possible to pass an event to each
clEnqueue call that can be queried, in order to determine the execution
status (in particular the command completion time); developers also can
maintain their own queue of outstanding requests.
For many algorithms, this technique can be effective enough at hiding latency
so that a core does not need to be reserved for scheduling. In particular,
algorithms where the work-load is largely known up-front often work well with
a deep queue and watermark. Algorithms in which work is dynamically
created may require a dedicated thread to provide low-latency scheduling.
Data Location
Discrete GPUs use dedicated high-bandwidth memory that exists in a
separate address space. Moving data between the device address space and
the host requires time-consuming transfers over a relatively slow PCI-
Express bus. Schedulers should be aware of this cost and, for example,
attempt to schedule work that consumes the result on the same device
producing it.
CPU and GPU devices share the same memory bandwidth, which results in
additional interactions of kernel executions.
before flushing can enable the host CPU to batch together the command
submission, which can reduce launch overhead.
For low-latency CPU response, it can be more efficient to use a dedicated spin
loop and not call clFinish() Calling clFinish() indicates that the application
wants to wait for the GPU, putting the thread to sleep. For low latency, the
application should use clFlush(), followed by a loop to wait for the event to
complete. This is also true for blocking maps. The application should use non-
blocking maps followed by a loop waiting on the event. The following provides
sample code for this.
if (sleep)
{
// this puts host thread to sleep, useful if power is a consideration
or overhead is not a concern
clFinish(cmd_queue_);
}
else
{
// this keeps the host thread awake, useful if latency is a concern
clFlush(cmd_queue_);
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
while (eventStatus > 0)
{
error_ = clGetEventInfo(event, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &eventStatus, NULL);
Sleep(0); // be nice to other threads, allow scheduler to find
other work if possible
6-38 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Code optimized for the Cypress device (the ATI Radeon™ HD 5870 GPU)
typically runs well across other members of the Evergreen family. There are
some differences in cache size and LDS bandwidth that might impact some
kernels (see Appendix D, “Device Parameters”). The Cedar ASIC has a smaller
wavefront width and fewer registers (see Section 6.6.4, “Optimizing for Cedar,”
page 6-32, for optimization information specific to this device).
As described in Section 6.9, “Clause Boundaries,” page 6-46, CPUs and GPUs
have very different performance characteristics, and some of these impact how
one writes an optimal kernel. Notable differences include:
The Vector ALU floating point resources in a CPU (SSE) require the use of
vectorized types (float4) to enable packed SSE code generation and extract
good performance from the Vector ALU hardware. The GPU VLIW hardware
is more flexible and can efficiently use the floating-point hardware even
without the explicit use of float4. See Section 6.8.4, “VLIW and SSE
Packing,” page 6-43, for more information and examples; however, code that
can use float4 often generates hi-quality code for both the CPU and the AMD
GPUs.
The AMD OpenCL CPU implementation runs work-items from the same
work-group back-to-back on the same physical CPU core. For optimally
coalesced memory patterns, a common access pattern for GPU-optimized
algorithms is for work-items in the same wavefront to access memory
locations from the same cache line. On a GPU, these work-items execute in
parallel and generate a coalesced access pattern. On a CPU, the first work-
item runs to completion (or until hitting a barrier) before switching to the next.
Generally, if the working set for the data used by a work-group fits in the CPU
caches, this access pattern can work efficiently: the first work-item brings a
line into the cache hierarchy, which the other work-items later hit. For large
working-sets that exceed the capacity of the cache hierarchy, this access
pattern does not work as efficiently; each work-item refetches cache lines
that were already brought in by earlier work-items but were evicted from the
cache hierarchy before being used. Note that AMD CPUs typically provide
512k to 2 MB of L2+L3 cache for each compute unit.
CPUs do not contain any hardware resources specifically designed to
accelerate local memory accesses. On a CPU, local memory is mapped to
the same cacheable DRAM used for global memory, and there is no
performance benefit from using the __local qualifier. The additional memory
operations to write to LDS, and the associated barrier operations can reduce
For a balanced solution that runs reasonably well on both devices, developers
are encouraged to write the algorithm using float4 vectorization. The GPU is
more sensitive to algorithm tuning; it also has higher peak performance potential.
Thus, one strategy is to target optimizations to the GPU and aim for reasonable
performance on the CPU. For peak performance on all devices, developers can
choose to use conditional compilation for key code loops in the kernel, or in some
cases even provide two separate kernels. Even with device-specific kernel
optimizations, the surrounding host code for allocating memory, launching
kernels, and interfacing with the rest of the program generally only needs to be
written once.
6-40 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Note that single precision MAD operations have five times the throughput of the
double-precision rate, and that double-precision is only supported on the AMD
Radeon™ HD69XX devices. The use of single-precision calculation is
encouraged, if that precision is acceptable. Single-precision data is also half the
size of double-precision, which requires less chip bandwidth and is not as
demanding on the cache structures.
Generally, the throughput and latency for 32-bit integer operations is the same
as for single-precision floating point operations.
24-bit integer MULs and MADs have five times the throughput of 32-bit integer
multiplies. 24-bit unsigned integers are natively supported only on the Evergreen
family of devices and later. Signed 24-bit integers are supported only on the
Northern Island family of devices and later. The use of OpenCL built-in functions
for mul24 and mad24 is encouraged. Note that mul24 can be useful for array
indexing operations.
Packed 16-bit and 8-bit operations are not natively supported; however, in cases
where it is known that no overflow will occur, some algorithms may be able to
effectively pack 2 to 4 values into the 32-bit registers natively supported by the
hardware.
Table 6.10 shows the throughput for each stream processing core. To obtain the
peak throughput for the whole device, multiply the number of stream cores and
the engine clock (see Appendix D, “Device Parameters”). For example, according
to Table 6.10, a Cypress device can perform two double-precision ADD
operations/cycle in each stream core. From Appendix D, “Device Parameters,” a
ATI Radeon HD 5870 GPU has 320 Stream Cores and an engine clock of
850 MHz, so the entire GPU has a throughput rate of (2*320*850 MHz) = 544
GFlops for double-precision adds.
6-42 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Figure 6.9 shows and example of an unrolled loop with clustered stores.
output[gid+i+0] = Velm0;
output[gid+i+1] = Velm1;
output[gid+i+2] = Velm2;
output[gid+i+3] = Velm3;
}
}
Unrolling the loop to expose the underlying parallelism typically allows the GPU
compiler to pack the instructions into the slots in the VLIW word. For best results,
unrolling by a factor of at least 5 (perhaps 8 to preserve power-of-two factors)
may deliver best performance. Unrolling increases the number of required
registers, so some experimentation may be required.
The CPU back-end requires the use of vector types (float4) to vectorize and
generate packed SSE instructions. To vectorize the loop above, use float4 for the
6-44 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
array arguments. Obviously, this transformation is only valid in the case where
the array elements accessed on each loop iteration are adjacent in memory. The
explicit use of float4 can also improve the GPU performance, since it clearly
identifies contiguous 16-byte memory operations that can be more efficiently
coalesced.
Figure 6.10 is an example of an unrolled kernel that uses float4 for vectorization.
output[gid+i] = Velm;
}
}
ALU and LDS access instructions are placed in the same clause. FETCH,
ALU/LDS, and STORE instructions are placed into separate clauses.
The GPU schedules a pair of wavefronts (referred to as the “even” and “odd”
wavefront). The even wavefront executes for four cycles (each cycle executes a
quarter-wavefront); then, the odd wavefront executes for four cycles. While the
odd wavefront is executing, the even wavefront accesses the register file and
prepares operands for execution. This fixed interleaving of two wavefronts allows
the hardware to efficiently hide the eight-cycle register-read latencies.
With the exception of the special treatment for even/odd wavefronts, the GPU
scheduler only switches wavefronts on clause boundaries. Latency within a
clause results in stalls on the hardware. For example, a wavefront that generates
an LDS bank conflict stalls on the compute unit until the LDS access completes;
the hardware does not try to hide this stall by switching to another available
wavefront.
The address calculations for FETCH and STORE instructions execute on the
same hardware in the compute unit as do the ALU clauses. The address
calculations for memory operations consumes the same executions resources
that are used for floating-point computations.
The ISA dump shows the clause boundaries. See the example shown below.
For more information on clauses, see the AMD Evergreen-Family ISA Microcode
And Instructions (v1.0b) and the AMD R600/R700/Evergreen Assembly
Language Format documents.
6-46 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
No unrolling example:
#pragma unroll 1
for (int i = 0; i < n; i++) {
...
}
Partial unrolling example:
#pragma unroll 4
for (int i = 0; i < 128; i++) {
...
}
Currently, the unroll pragma requires that the loop boundaries can be determined
at compile time. Both loop bounds must be known at compile time. If n is not
given, it is equivalent to the number of iterations of the loop when both loop
bounds are known. If the unroll-factor is not specified, and the compiler can
determine the loop count, the compiler fully unrolls the loop. If the unroll-factor is
not specified, and the compiler cannot determine the loop count, the compiler
does no unrolling.
Linear – A linear layout format arranges the data linearly in memory such
that element addresses are sequential. This is the layout that is familiar to
CPU programmers. This format must be used for OpenCL buffers; it can be
used for images.
Tiled – A tiled layout format has a pre-defined sequence of element blocks
arranged in sequential memory addresses (see Figure 6.11 for a conceptual
illustration). A microtile consists of ABIJ; a macrotile consists of the top-left
16 squares for which the arrows are red. Only images can use this format.
Translating from user address space to the tiled arrangement is transparent
to the user. Tiled memory layouts provide an optimized memory access
6-48 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
pattern to make more efficient use of the RAM attached to the GPU compute
device. This can contribute to lower latency.
Physical
A B C D E F G H
I J K L M N O P Logical
Q R S T U V W X
A B C D I J K L
Q R S T E F G H
M N O P U V W X
Memory access patterns in compute kernels are usually different from those in
the pixel shaders. Whereas the access pattern for pixel shaders is in a
hierarchical, space-filling curve pattern and is tuned for tiled memory
performance (generally for textures), the access pattern for a compute kernel is
linear across each row before moving to the next row in the global id space. This
has an effect on performance, since pixel shaders have implicit blocking, and
compute kernels do not. If accessing a tiled image, best performance is achieved
if the application tries to use workgroups as a simple blocking strategy.
6-50 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
Study the local memory (LDS) optimizations. These greatly affect the GPU
performance. Note the difference in the organization of local memory on the
GPU as compared to the CPU cache. Local memory is shared by many
work-items (64 on Cypress). This contrasts with a CPU cache that normally
is dedicated to a single work-item. GPU kernels run well when they
collaboratively load the shared memory.
GPUs have a large amount of raw compute horsepower, compared to
memory bandwidth and to “control flow” bandwidth. This leads to some high-
level differences in GPU programming strategy.
– A CPU-optimized algorithm may test branching conditions to minimize
the workload. On a GPU, it is frequently faster simply to execute the
workload.
– A CPU-optimized version can use memory to store and later load pre-
computed values. On a GPU, it frequently is faster to recompute values
rather than saving them in registers. Per-thread registers are a scarce
resource on the CPU; in contrast, GPUs have many available per-thread
register resources.
Use float4 and the OpenCL built-ins for vector types (vload, vstore, etc.).
These enable the AMD Accelerated Parallel Processing OpenCL
implementation to generate efficient, packed SSE instructions when running
on the CPU. Vectorization is an optimization that benefits both the AMD CPU
and GPU.
6-52 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
The CPU contains a vector unit, which can be efficiently used if the developer is
writing the code using vector data types.
For architectures before Bulldozer, the instruction set is called SSE, and the
vector width is 128 bits. For Bulldozer, there the instruction set is called AVX, for
which the vector width is increased to 256 bits.
Using four-wide vector types (int4, float4, etc.) is preferred, even with Bulldozer.
The CPU does not benefit much from local memory; sometimes it is detrimental
to performance. As local memory is emulated on the CPU by using the caches,
accessing local memory and global memory are the same speed, assuming the
information from the global memory is in the cache.
There also is hardware support for OpenCL functions that give the new hardware
implementation of rotating.
For example:
can be written as a composition of mad instructions which use fused multiple add
(FMA):
6.10.7.1 Clauses
The architecture for the 69XX series of GPUs is clause-based. A clause is similar
to a basic block, a sequence of instructions that execute without flow control or
The AMD APP KernelAnalyzer assembler listing lets you view clauses. Try the
optimizations listed here from inside the AMD APP KernelAnalyzer to see the
improvements in performance.
if(x==1) r=0.5;
if(x==2) r=1.0;
becomes
Note that if the body of the if statement contains an I/O, the if statement cannot
be eliminated.
A conditional expression with many terms can compile into a number of clauses
due to the C-language requirement that expressions must short circuit. To
prevent this, move the expression out of the control flow statement. For example:
if(a&&b&&c&&d){…}
becomes
The same applies to conditional expressions used in loop constructs (do, while,
for).
If the loop bounds are known, and the loop is small (less than 16 or 32
instructions), unrolling the loop usually increases performance.
6-54 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED PARALLEL PROCESSING
for loops can generate more clauses than equivalent do or while loops.
Experiment with these different loop types to find the one with best performance.
The native hardware I/O transaction size is four words (float4, int4 types). Avoid
I/Os with smaller data, and rewrite the kernel to use the native size data. Kernel
performance increases, and only 25% as many work items need to be
dispatched.
6-56 Chapter 6: OpenCL Performance and Optimization for Evergreen and Northern Islands
Devices
AMD ACCELERATED P ARALLEL P ROCESSING
Chapter 7
OpenCL Static C++
Programming Language
7.1 Overview
This extension defines the OpenCL Static C++ kernel language, which is a form
of the ISO/IEC Programming languages C++ specification1. This language
supports overloading and templates that can be resolved at compile time (hence
static), while restricting the use of language features that require dynamic/runtime
resolving. The language also is extended to support most of the features
described in Section 6 of OpenCL spec: new data types (vectors, images,
samples, etc.), OpenCL Built-in functions, and more.
Note that supporting templates and overloading highly improve the efficiency of
writing code: it allows developers to avoid replication of code when not
necessary.
Using kernel template and kernel overloading requires support from the runtime
API as well. AMD provides a simple extension to clCreateKernel, which
enables the user to specify the desired kernel.
To support these cases, the following error codes were added; these can be
returned by clCreateKernel.
On the host side, the application creates the class and an equivalent memory
object with the same size (using the sizeof function). It then can use the class
methods to set or change values of the class members. When the class is ready,
the application uses a standard buffer API to move the class to the device (either
Unmap or Write), then sets the buffer object as the appropriate kernel argument
and enqueues the kernel for execution. When the kernel finishes the execution,
the application can map back (or read) the buffer object into the class and
continue working on it.
-x language
A class definition can not contain any address space qualifier, either for members
or for methods:
7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language 7-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
class myClass{
public:
int myMethod1(){ return x;}
void __local myMethod2(){x = 0;}
private:
int x;
__local y; // illegal
};
The class invocation inside a kernel, however, can be either in private or local
address space:
7.3.3 Namespaces
Namespaces are support without change as per [1].
7.3.4 Overloading
As defined in of the C++ language specification, when two or more different
declarations are specified for a single name in the same scope, that name is said
to be overloaded. By extension, two declarations in the same scope that declare
the same name but with different types are called overloaded declarations. Only
kernel and function declarations can be overloaded, not object and type
declarations.
Also, the rules for well-formed programs as defined by Section 13 of the C++
language specification are lifted to apply to both kernel and function declarations.
The overloading resolution is per Section 13.1 of the C++ language specification,
but extended to account for vector types. The algorithm for “best viable function”,
Section 13.3.3 of the C++ language specification, is extended for vector types by
inducing a partial-ordering as a function of the partial-ordering of its elements.
Following the existing rules for vector types in the OpenCL 1.2 specification,
explicit conversion between vectors is not allowed. (This reduces the number of
possible overloaded functions with respect to vectors, but this is not expected to
be a particular burden to developers because explicit conversion can always be
applied at the point of function evocation.)
For overloaded kernels, the following syntax is used as part of the kernel name:
foo(type1,...,typen)
__attribute__((mangled_name(myMangledName)))
7.3.5 Templates
OpenCL C++ provides unrestricted support for C++ templates, as defined in
Section 14 of the C++ language specification. The arguments to templates are
extended to allow for all OpenCL base types, including vectors and pointers
qualified with OpenCL C address spaces (i.e. __global, __local, __private,
and __constant).
OpenCL C++ kernels (defined with __kernel) can be templated and can be
called from within an OpenCL C (C++) program or as an external entry point
(from the host).
For kernel templates, the following syntax is used as part of the kernel name
(assuming a kernel called foo):
foo<type1,...,typen>
foo<type1,...,typen>(typen+1,...,typem)
To support template kernels, the same mechanism for kernel overloading is used.
Use the following syntax:
__attribute__((mangled_name(myMangledName)))
7.3 Additions and Changes to Section 6 - The OpenCL C Programming Language 7-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
7.3.6 Exceptions
Exceptions, as per Section 15 of the C++ language specification, are not
supported. The keywords try, catch, and throw are reserved, and the OpenCL
C++ compiler must produce a static compile time error if they are used in the
input program.
7.3.7 Libraries
Support for the general utilities library, as defined in Sections 20-21 of the C++
language specification, is not provided. The standard C++ libraries and STL
library are not supported.
7.4 Examples
7.4.1 Passing a Class from the Host to the Device and Back
The class definition must be the same on the host code and the device code,
besides the members’ type in the case of vectors. If the class includes vector
data types, the definition must conform to the table that appears on Section 6.1.2
of the OpenCL Programming Specification 1.2, Corresponding API type for
OpenCL Language types.
Class Test
{
setX (int value);
private:
int x;
}
Class Test
{
setX (int value);
private:
int x;
}
MyFunc ()
{
tempClass = new(Test);
... // Some OpenCL startup code – create context, queue, etc.
cl_mem classObj = clCreateBuffer(context,
CL_MEM_USE_HOST_PTR, sizeof(Test),
&tempClass, event);
clEnqueueMapBuffer(...,classObj,...);
tempClass.setX(10);
clEnqueueUnmapBuffer(...,classObj,...); //class is passed to the Device
clEnqueueNDRange(..., fooKernel, ...);
clEnqueueMapBuffer(...,classObj,...); //class is passed back to the Host
The names testAddFloat4 and testAddInt8 are the external names for the two
kernel instants. When calling clCreateKernel, passing one of these kernel
names leads to the correct overloaded kernel.
Appendix A
OpenCL Optional Extensions
The OpenCL extensions are associated with the devices and can be queried for
a specific device. Extensions can be queried for platforms also, but that means
that all devices in the platform support those extensions.
The OpenCL Specification states that all API functions of the extension must
have names in the form of cl<FunctionName>KHR, cl<FunctionName>EXT, or
cl<FunctionName><VendorName>. All enumerated values must be in the form of
CL_<enum_name>_KHR, CL_<enum_name>_EXT, or
CL_<enum_name>_<VendorName>.
After the device list is retrieved, the extensions supported by each device can be
queried with function call clGetDeviceInfo() with parameter param_name being
set to enumerated value CL_DEVICE_EXTENSIONS.
The extensions are returned in a char string, with extension names separated by
a space. To see if an extension is present, search the string for a specified
substring.
The initial state of the compiler is set to ignore all extensions as if it was explicitly
set with the following directive:
This means that the extensions must be explicitly enabled to be used in kernel
programs.
Each extension that affects kernel code compilation must add a defined macro
with the name of the extension. This allows the kernel code to be compiled
differently, depending on whether the extension is supported and enabled, or not.
For example, for extension cl_khr_fp64 there should be a #define directive for
macro cl_khr_fp64, so that the following code can be preprocessed:
#ifdef cl_khr_fp64
// some code
#else
// some code
#endif
This returns the address of the extension function specified by the FunctionName
string. The returned value must be appropriately cast to a function pointer type,
specified in the extension spec and header file.
A return value of NULL means that the specified function does not exist in the
CL implementation. A non-NULL return value does not guarantee that the
extension function actually exists – queries described in sec. 2 or 3 must be done
to ensure the extension is supported.
A.8.1 cl_amd_fp64
Before using double data types, double-precision floating point operators, and/or
double-precision floating point routines in OpenCL™ C kernels, include the
#pragma OPENCL EXTENSION cl_amd_fp64 : enable directive. See Table A.1
for a list of supported routines.
A.8.2 cl_amd_vec3
This extension adds support for vectors with three elements: float3, short3,
char3, etc. This data type was added to OpenCL 1.1 as a core feature. For more
details, see section 6.1.2 in the OpenCL 1.1 or OpenCL 1.2 spec.
A.8.3 cl_amd_device_persistent_memory
This extension adds support for the new buffer and image creation flag
CL_MEM_USE_PERSISTENT_MEM_AMD. Buffers and images allocated with this flag
reside in host-visible device memory. This flag is mutually exclusive with the flags
CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR.
A.8.4 cl_amd_device_attribute_query
This extension provides a means to query AMD-specific device attributes. To
enable this extension, include the #pragma OPENCL EXTENSION
cl_amd_device_attribute_query : enable directive. Once the extension is
enabled, and the clGetDeviceInfo parameter <param_name> is set to
CL_DEVICE_PROFILING_TIMER_OFFSET_AMD, the offset in nano-seconds between
an event timestamp and Epoch is returned.
A.8.4.1 cl_device_profiling_timer_offset_amd
This query enables the developer to get the offset between event timestamps in
nano-seconds. To use it, compile the kernels with the #pragma OPENCL
EXTENSION cl_amd_device_attribute_query : enable directive. For
kernels complied with this pragma, calling clGetDeviceInfo with <param_name>
set to CL_DEVICE_PROFILING_TIMER_OFFSET_AMD returns the offset in nano-
seconds between event timestamps.
A.8.4.2 cl_amd_device_topology
This query enables the developer to get a description of the topology used to
connect the device to the host. Currently, this query works only in Linux. Calling
clGetDeviceInfo with <param_name> set to CL_DEVICE_TOPOLOGY_AMD returns
the following 32-bytes union of structures.
typedef union
{
struct { cl_uint type; cl_uint data[5]; } raw;
struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char
device; cl_char function; } pcie; } cl_device_topology_amd;
The type of the structure returned can be queried by reading the first unsigned
int of the returned data. The developer can use this type to cast the returned
union into the right structure type.
Currently, the only supported type in the structure above is PCIe (type value =
1). The information returned contains the PCI Bus/Device/Function of the device,
and is similar to the result of the lspci command in Linux. It enables the
developer to match between the OpenCL device ID and the physical PCI
connection of the card.
A.8.4.3 cl_amd_device_board_name
This query enables the developer to get the name of the GPU board and model
of the specific device. Currently, this is only for GPU devices.
A.8.5 cl_amd_compile_options
This extension adds the following options, which are not part of the OpenCL
specification.
-g — This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 3, “Debugging
OpenCL.” This option does not affect the default optimization of the OpenCL
code.
-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.
-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix E, “OpenCL Binary Image
Format (BIF) v2.0.”
-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text section.
For more information, see Appendix E, “OpenCL Binary Image Format (BIF)
v2.0.”
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
A.8.6 cl_amd_offline_devices
To generate binary images offline, it is necessary to access the compiler for every
device that the runtime supports, even if the device is currently not installed on
the system. When, during context creation, CL_CONTEXT_OFFLINE_DEVICES_AMD
is passed in the context properties, all supported devices, whether online or
offline, are reported and can be used to create OpenCL binary images.
A.8.7 cl_amd_event_callback
This extension provides the ability to register event callbacks for states other than
cl_complete. The full set of event states are allowed: cl_queued,
cl_submitted, and cl_running. This extension is enabled automatically and
does not need to be explicitly enabled through #pragma when using the SDK v2
of AMD Accelerated Parallel Processing.
A.8.8 cl_amd_popcnt
This extension introduces a “population count” function called popcnt. This
extension was taken into core OpenCL 1.2, and the function was renamed
popcount. The core 1.2 popcount function (documented in section 6.12.3 of the
OpenCL Specification) is identical to the AMD extension popcnt function.
A.8.9 cl_amd_media_ops
This extension adds the following built-in functions to the OpenCL language.
Note: For OpenCL scalar types, n = 1; for vector types, it is {2, 4, 8, or 16}.
Return value
((((uint)src[0]) & 0xFF) << 0) +
((((uint)src[1]) & 0xFF) << 8) +
((((uint)src[2]) & 0xFF) << 16) +
((((uint)src[3]) & 0xFF) << 24)
A.8.10 cl_amd_media_ops2
This extension adds further built-in functions to those of cl_amd_media_ops.
When enabled, it adds the following built-in functions to the OpenCL language.
Built-in Function: uintn amd_msad (uintn src0, uintn src1, uintn src2)
Description:
uchar4 src0u8 = as_uchar4(src0.s0);
uchar4 src1u8 = as_uchar4(src1.s0);
dst.s0 = src2.s0 +
((src1u8.s0 == 0) ? 0 : abs(src0u8.s0 - src1u8.s0)) +
((src1u8.s1 == 0) ? 0 : abs(src0u8.s1 - src1u8.s1)) +
((src1u8.s2 == 0) ? 0 : abs(src0u8.s2 - src1u8.s2)) +
((src1u8.s3 == 0) ? 0 : abs(src0u8.s3 - src1u8.s3));
A similar operation is applied to other components of the vectors.
Built-in Function: ulongn amd_qsad (ulongn src0, uintn src1, ulongn src2)
Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_sad(as_uint(src0u8.s0123), src1.s0, src2u16.s0);
dstu16.s1 = amd_sad(as_uint(src0u8.s1234), src1.s0, src2u16.s1);
dstu16.s2 = amd_sad(as_uint(src0u8.s2345), src1.s0, src2u16.s2);
dstu16.s3 = amd_sad(as_uint(src0u8.s3456), src1.s0, src2u16.s3);
dst.s0 = as_uint2(dstu16);
A similar operation is applied to other components of the vectors.
Built-in Function:
ulongn amd_mqsad (ulongn src0, uintn src1, ulongn src2)
Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_msad(as_uint(src0u8.s0123), src1.s0, src2u16.s0);
dstu16.s1 = amd_msad(as_uint(src0u8.s1234), src1.s0, src2u16.s1);
dstu16.s2 = amd_msad(as_uint(src0u8.s2345), src1.s0, src2u16.s2);
dstu16.s3 = amd_msad(as_uint(src0u8.s3456), src1.s0, src2u16.s3);
dst.s0 = as_uint2(dstu16);
A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadw (uintn src0, uintn src1, uintn src2)
Description:
ushort2 src0u16 = as_ushort2(src0.s0);
ushort2 src1u16 = as_ushort2(src1.s0);
dst.s0 = src2.s0 +
abs(src0u16.s0 - src1u16.s0) +
abs(src0u16.s1 - src1u16.s1);
A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadd (uintn src0, uintn src1, uintn src2)
Description:
Built-in Function: uintn amd_bfe (uintn src0, uintn src1, uintn src2)
Description:
NOTE: The >> operator represents a logical right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = (src0.s0 << (32 - offset - width)) >> (32 - width);
else
dst.s0 = src0.s0 >> offset;
A similar operation is applied to other components of the vectors.
Built-in Function: intn amd_bfe (intn src0, uintn src1, uintn src2)
Description:
NOTE: operator >> represent an arithmetic right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = src0.s0 << (32-offset-width) >> 32-width;
else
dst.s0 = src0.s0 >> offset;
A similar operation is applied to other components of the vectors.
Built-in Function:
intn amd_median3 (intn src0, intn src1, intn src2)
uintn amd_median3 (uintn src0, uintn src1, uintn src2)
floatn amd_median3 (floatn src0, floatn src1, floatn src2)
Description:
Built-in Function:
intn amd_min3 (intn src0, intn src1, intn src2)
uintn amd_min3 (uintn src0, uintn src1, uintn src2)
floatn amd_min3 (floatn src0, floatn src1, floatn src2)
Description:
Returns min of src0, src1, and src2.
Built-in Function:
intn amd_max3 (intn src0, intn src1, intn src2)
uintn amd_max3 (uintn src0, uintn src1, uintn src2)
floatn amd_max3 (floatn src0, floatn src1, floatn src2)
Description:
Returns max of src0, src1, and src2.
A.8.11 cl_amd_printf
The OpenCL Specification 1.1 and 1.2 support the optional AMD extension
cl_amd_printf, which provides printf capabilities to OpenCL C programs. To use
this extension, an application first must include
#pragma OPENCL EXTENSION cl_amd_printf : enable.
Built-in function:
printf(__constant char * restrict format, …);
This function writes output to the stdout stream associated with the
host application. The format string is a character sequence that:
The OpenCL C printf closely matches the definition found as part of the
C99 standard. Note that conversions introduced in the format string with
% are supported with the following guidelines:
A.8.12 cl_amd_predefined_macros
The following macros are predefined when compiling OpenCL™ C kernels.
These macros are defined automatically based on the device for which the code
is being compiled.
GPU devices:
__WinterPark__
__BeaverCreek__
__Turks__
__Caicos__
__Tahiti__
__Pitcairn__
__Capeverde__
__Cayman__
__Barts__
__Cypress__
__Juniper__
__Redwood__
__Cedar__
__ATI_RV770__
__ATI_RV730__
__ATI_RV710__
__Loveland__
__GPU__
CPU devices:
__CPU__
__X86__
__X86_64__
Note that __GPU__ or __CPU__ are predefined whenever a GPU or CPU device
is the compilation target.
return "X86-64CPU";
#elif defined(__CPU__)
return "GenericCPU";
#else
return "UnknownDevice";
#endif
}
kernel void test_pf(global int* a)
{
printf("Device Name: %s\n", getDeviceName());
}
A.8.13 cl_amd_bus_addressable_memory
This extension defines an API for peer-to-peer transfers between AMD GPUs
and other PCIe device, such as third-party SDI I/O devices. Peer-to-peer
transfers have extremely low latencies by not having to use the host’s main
memory or the CPU (see Figure A.1). This extension allows sharing a memory
allocated by the graphics driver to be used by other devices on the PCIe bus
(peer-to-peer transfers) by exposing a write-only bus address. It also allows
memory allocated on other PCIe devices (non-AMD GPU) to be directly
accessed by AMD GPUs. One possible use of this is for a video capture device
to directly write into the GPU memory using its DMA.This extension is supported
only on AMD FirePro™ professional graphics cards.
Table A.2 Extension Support for Older AMD GPUs and CPUs
x86 CPU
Extension Juniper1 Redwood2 Cedar3 with SSE2 or later
cl_khr_*_atomics Yes Yes Yes Yes
cl_ext_atomic_counters_32 Yes Yes Yes No
cl_khr_gl_sharing Yes Yes Yes Yes
cl_khr_byte_addressable_store Yes Yes Yes Yes
cl_ext_device_fission No No No Yes
cl_amd_device_attribute_query Yes Yes Yes Yes
cl_khr_fp64 No No No Yes
cl_amd_fp644 No No No Yes
cl_amd_vec3 Yes Yes Yes Yes
Images Yes Yes Yes Yes5
cl_khr_d3d10_sharing Yes Yes Yes Yes
cl_amd_media_ops Yes Yes Yes Yes
cl_amd_media_ops2 Yes Yes Yes Yes
cl_amd_printf Yes Yes Yes Yes
cl_amd_popcnt Yes Yes Yes Yes
cl_khr_3d_image_writes Yes Yes Yes No
Platform Extensions
cl_khr_icd Yes Yes Yes Yes
cl_amd_event_callback Yes Yes Yes Yes
cl_amd_offline_devices Yes Yes Yes Yes
1. ATI Radeon HD 5700 series, AMD Mobility Radeon
HD 5800 series, AMD FirePro
V5800 series, AMD
Mobility FirePro M7820.
2. ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5500 Series, AMD
Mobility Radeon™ HD 5700 Series, AMD Mobility Radeon™ HD 5600 Series, AMD FirePro™ V4800
Series, AMD FirePro™ V3800 Series, AMD Mobility FirePro™ M5800
3. ATI Radeon™ HD 5400 Series, AMD Mobility Radeon™ HD 5400 Series
4. Available on all devices that have double-precision, including all Southern Island devices.
5. Environment variable CPU_IMAGE_SUPPORT must be set.
Appendix B
The OpenCL Installable Client Driver
(ICD)
The OpenCL Installable Client Driver (ICD) is part of the AMD Accelerated
Parallel Processing software stack. Code written prior to SDK v2.0 must be
changed to comply with OpenCL ICD requirements.
B.1 Overview
The ICD allows multiple OpenCL implementations to co-exist; also, it allows
applications to select between these implementations at runtime.
context = clCreateContextFromType(
0,
dType,
NULL,
NULL,
&status);
/*
* Have a look at the available platforms and pick either
* the AMD one if available or a reasonable default.
*/
cl_uint numPlatforms;
cl_platform_id platform = NULL;
status = clGetPlatformIDs(0, NULL, &numPlatforms);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
if (0 < numPlatforms)
{
cl_platform_id* platforms = new cl_platform_id[numPlatforms];
status = clGetPlatformIDs(numPlatforms, platforms, NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
for (unsigned i = 0; i < numPlatforms; ++i)
{
char pbuf[100];
status = clGetPlatformInfo(platforms[i],
CL_PLATFORM_VENDOR,
sizeof(pbuf),
pbuf,
NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformInfo failed."))
{
return SDK_FAILURE;
}
platform = platforms[i];
if (!strcmp(pbuf, "Advanced Micro Devices, Inc."))
{
break;
}
}
delete[] platforms;
}
/*
* If we could find our platform, use it. Otherwise pass a NULL and
get whatever the
* implementation thinks we should be using.
*/
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};
/* Use NULL for backward compatibility */
cl_context_properties* cprops = (NULL == platform) ? NULL : cps;
context = clCreateContextFromType(
cprops,
dType,
NULL,
NULL,
&status);
NOTE: It is recommended that the host code look at the platform vendor string
when searching for the desired OpenCL platform, instead of using the platform
name string. The platform name string might change, whereas the platform
vendor string remains constant for a particular vendor’s implementation.
Appendix C
Compute Kernel
C.2 Indexing
A primary difference between a compute kernel and a pixel shader is the
indexing mode. In a pixel shader, indexing is done through the vWinCoord
register and is directly related to the output domain (frame buffer size and
geometry) specified by the user space program. This domain is usually in the
Euclidean space and specified by a pair of coordinates. In a compute kernel,
however, this changes: the indexing method is switched to a linear index between
one and three dimensions, as specified by the user. This gives the programmer
more flexibility when writing kernels.
Indexing is done through the vaTid register, which stands for absolute work-item
id. This value is linear: from 0 to N-1, where N is the number of work-items
requested by the user space program to be executed on the GPU compute
device. Two other indexing variables, vTid and vTgroupid, are derived from
settings in the kernel and vaTid.
In SDK 1.4 and later, new indexing variables are introduced for either 3D spawn
or 1D spawn. The 1D indexing variables are still valid, but replaced with
vAbsTidFlat, vThreadGrpIdFlat, and vTidInGrpFlat, respectively. The 3D versions
are vAbsTid, vThreadGrpId, and vTidInGrp. The 3D versions have their
respective positions in each dimension in the x, y, and z components. The w
component is not used. If the group size for a dimension is not specified, it is an
implicit 1. The 1D version has the dimension replicated over all components.
il_ps_2_0
dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__
dcl_output_generic o0
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
sample_resource(0)_sampler(0) o0, vWinCoord0.yx
end
Figure C.1 shows the performance results of using a pixel shader for this matrix
transpose.
12
10
8
GB/s
6 PS
0
88
16
44
72
00
28
56
84
2
0
64
19
32
44
57
70
83
96
10
12
13
14
16
17
18
19
Matrix Size
il_cs_2_0
dcl_num_threads_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
umod r0.x, vAbsTidFlat.x, cb0[0].x
udiv r0.y, vAbsTidFlat.x, cb0[0].x
sample_resource(0)_sampler(0) r1, r0.yx
mov g[vAbsTidFlat.x], r1
end
Figure C.2 shows the performance results using a compute kernel for this matrix
transpose.
4
GB/s
CS
3
0
88
16
44
72
00
28
56
84
2
0
64
19
32
44
57
70
83
96
10
12
13
14
16
17
18
19
Matrix Size
100
90
80
70
60
GB/s
50 LDS
40
30
20
10
0
88
16
44
72
00
28
56
84
2
0
64
19
32
44
57
70
83
96
10
12
13
14
16
17
18
19
Matrix Size
Appendix D
Device Parameters
On the following pages, Table D.2 through Table D.7 provide device-specific
information for AMD GPUs.
D-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
D-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Table D.6 Parameters for 56xx, 57xx, 58xx, Eyfinity6, and 59xx Devices
D-7
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Table D.7 Parameters for Exxx, Cxx, 54xx, and 55xx Devices
Redwood Redwood
Zacate Zacate Ontario Ontario Cedar
PRO2 PRO
Product Name (ATI Radeon E-350 E-240 C-50 C-30 5450 5550 5570
HD)
Engine Speed (MHz) 492 500 276 277 650 550 650
Compute Resources
Compute Units 2 2 2 2 2 4 5
Stream Cores 16 16 16 16 16 64 80
Processing Elements 80 80 80 80 80 320 400
Peak Gflops 78.72 80 44.16 44.32 104 352 520
Cache and Register Sizes
# of Vector Registers/CU 8192 8192 8192 8192 8192 16384 16384
Size of Vector Registers/CU 128 kB 128 kB 128 kB 128 kB 128 kB 256 kB 256 kB
LDS Size/ CU 32 kB 32 kB 32 kB 32 kB 32k 32k 32k
LDS Banks / CU 16 16 16 16 16 16 16
Constant Cache / GPU 4 kB 4 kB 4 kB 4 kB 4k 16k 16k
Max Constants / CU 4 kB 4 kB 4 kB 4 kB 4k 8k 8k
L1 Cache Size / CU 8 kB 8 kB 8 kB 8 kB 8k 8k 8k
L2 Cache Size / GPU 64 kB 64 kB 64 kB 64 kB 64k 128k 128k
Peak GPU Bandwidths
Register Read (GB/s) 378 384 212 213 499 1690 2496
LDS Read (GB/s) 63 64 35 35 83 141 208
Constant Cache Read (GB/s) 126 128 71 71 166 563 832
L1 Read (GB/s) 63 64 35 35 83 141 208
L2 Read (GB/s) 63 64 35 35 83 141 166
Global Memory (GB/s) 9 9 9 9 13 26 29
Global Limits
Max Wavefronts / GPU 192 192 192 192 192 248 248
Max Wavefronts / CU (avg) 96.0 96.0 96.0 96.0 96.0 62.0 49.6
Max Work-Items / GPU 6144 6144 6144 6144 6144 15872 15872
Memory
Memory Channels 2 2 2 2 2 4 4
Memory Bus Width (bits) 64 64 64 64 64 128 128
Memory Type and DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3
Speed (MHz) 533 533 533 533 800 800 900
Frame Buffer Shared Shared Shared Shared 1 GB / 1 GB / 1 GB /
Memory Memory Memory Memory 512 MB 512 MB 512 MB
Appendix E
OpenCL Binary Image Format (BIF)
v2.0
E.1 Overview
OpenCL Binary Image Format (BIF) 2.0 is in the ELF format. BIF2.0 allows the
OpenCL binary to contain the OpenCL source program, the LLVM IR, and the
executable. The BIF defines the following special sections:
The BIF can have other special sections for debugging, etc. It also contains
several ELF special sections, such as:
By default, OpenCL generates a binary that has LLVM IR, and the executable for
the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM IR and the
executable for the CPU (.llvmir and .text sections). The BIF binary always
contains a .comment section, which is a readable C string. The default behavior
can be changed with the BIF options described in Section E.2, “BIF Options,”
page E-3.
The LLVM IR enables recompilation from LLVM IR to the target. When a binary
is used to run on a device for which the original program was not generated and
the original device is feature-compatible with the current device, OpenCL
recompiles the LLVM IR to generate a new code for the device. Note that the
LLVM IR is only universal within devices that are feature-compatible in the same
device type, not across different device types. This means that the LLVM IR for
the CPU is not compatible with the LLVM IR for the GPU. The LLVM IR for a
GPU works only for GPU devices that have equivalent feature sets.
The fields not shown in Table E.1 are given values according to the ELF
Specification. The e_machine value is defined as one of the oclElfTargets
enumerants; the values for these are:
e_machine =
E.1.2 Bitness
The BIF can be either 32-bit ELF format or a 64-bit ELF format. For the GPU,
OpenCL generates a 32-bit BIF binary; it can read either 32-bit BIF or 64-bit BIF
binary. For the CPU, OpenCL generates and reads only 32-bit BIF binaries if the
host application is 32-bit (on either 32-bit OS or 64-bit OS). It generates and
reads only 64-bit BIF binary if the host application is 64-bit (on 64-bit OS).
By default, OpenCL generates the .llvmir section, .amdil section, and .text
section. The following are examples for using these options:
This binary can recompile for all the other devices of the same device type.
Appendix F
Open Decode API Tutorial
F.1 Overview
This section provides a basic tutorial for using the sample program for Open
Decode. The Open Decode API provides the ability to access the hardware for
fixed-function decoding using the AMD Unified Video Decoder block on the GPU
for decoding H.264 video.
The following is an introduction for the Open CL programmer to start using UVD
hardware; it shows how to perform a decode using the Open Video Decode API.
Open Decode allows the decompression to take place on the GPU, where the
Open CL buffers reside. This lets applications perform post-processing
operations on the decompressed data on the GPU prior to rendering the frames.
F.2 Initializing
The first step in using the Open Decode is to get the Device Info and capabilities
through OVDecodeGetDeviceInfo and OVDecodeGetDeviceCap.
OVDecodeGetDeviceInfo obtains the information about the device(s) and
initializes the UVD hardware and firmware. As a result of the call to
OVDecodeGetDeviceCaps, the deviceInfo data structure provides the supported
output format and the compression profile. The application then can verify that
these values support the requested decode. The following code snippet shows
the use of OVDecodeGetDeviceInfo and OVDecodeGetDeviceCap.
intptr_t properties[] =
{
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
0
};
ovdContext = clCreateContext(properties,
1,
&clDeviceID,
0,
0,
&err);
From the capabilities that you have confirmed above in step 1, you now specify
the decode profile (H.264) and the output format (NV12 Interleaved). The height
and width also are specified. This can be obtained by parsing the data from the
input stream.
session = OVDecodeCreateSession(
ovdContext,
ovDeviceID,
profile,
oFormat,
oWidth,
oHeight);
F.5 Decoding
Decode execution goes through OpenCL and starts the UVD decode function.
cl_cmd_queue = clCreateCommandQueue((cl_context)ovdContext,
clDeviceID,
0,
&err);
This section demonstrates how the Frame info set up can be done with the
information read in from the video frame parameters.
output_surface = clCreateBuffer((cl_context)ovdContext,
CL_MEM_READ_WRITE,
host_ptr_size,
NULL,
&err);
The sample demonstrates how data can be read to provide Open Decode
with the information needed. Details can be obtained by reviewing the
sample routine ‘ReadPictureData’ to fill in the values needed to send into
the OvDecodePicture.
ReadPictureData(iFramesDecoded,
&picture_parameter,
&pic_parameter_2,
pic_parameter_2_size,
bitstream_data,
&bitstream_data_read_size,
bitstream_data_max_size,
slice_data_control,
slice_data_control_size);
OPEventHandle eventRunVideoProgram;
OVresult res = OVDecodePicture(session,
&picture_parameter,
&pic_parameter_2,
pic_parameter_2_size,
&bitstream_data,
bitstream_data_read_size,
slice_data_control,
slice_data_control_size,
output_surface,
num_event_in_wait_list,
NULL,
&eventRunVideoProgram,
0);
driver session and all internal resources; it also sets the UVD clock to idle state.
The following code snippet shows how this is done.
err = clReleaseMemObject((cl_mem)output_surface);
err = clReleaseContext((cl_context)ovdContext);
Appendix G
OpenCL-OpenGL Interoperability
Using GLUT
1. Use glutInit to initialize the GLUT library and negotiate a session with the
windowing system. This function also processes the command line options,
depending on the windowing system.
2. Use wglGetCurrentContext to get the current rendering GL context
(HGLRC) of the calling thread.
3. Use wglGetCurrentDC to get the device context (HDC) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use the clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) function and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device ID of
the CL device associated with OpenGL context.
5. Use clCreateContext (See Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows you how to create an interoperability context
using GLUT on single GPU system.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
Cl_context_properties cpsGL[] =
{CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
1. Use CreateWindow for window creation and get the device handle (HWND).
2. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use CreateDC function to
create a device context (HDC) for the specified device.
3. Use ChoosePixelFormat to match an appropriate pixel format supported by
a device context and to a given pixel format specification.
4. Use SetPixelFormat to set the pixel format of the specified device context
to the format specified.
5. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
6. Use wglMakeCurrent to bind the GL context created in the above step as
the current rendering context.
7. Use clGetGLContextInfoKHR function (see Section 9.7 of the OpenCL
Specification 1.1) and parameter CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
to get the device ID of the CL device associated with OpenGL context.
8. Use clCreateContext function (see Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
WIN32 API for windowing. (Users also can refer to the SimpleGL sample in the
AMD APP SDK samples.)
int pfmt;
PIXELFORMATDESCRIPTOR pfd;
pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR);
pfd.nVersion = 1;
pfd.dwFlags = PFD_DRAW_TO_WINDOW |
PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER ;
pfd.iPixelType = PFD_TYPE_RGBA;
pfd.cColorBits = 24;
pfd.cRedBits = 8;
pfd.cRedShift = 0;
pfd.cGreenBits = 8;
pfd.cGreenShift = 0;
pfd.cBlueBits = 8;
pfd.cBlueShift = 0;
pfd.cAlphaBits = 8;
pfd.cAlphaShift = 0;
pfd.cAccumBits = 0;
pfd.cAccumRedBits = 0;
pfd.cAccumGreenBits = 0;
pfd.cAccumBlueBits = 0;
pfd.cAccumAlphaBits = 0;
pfd.cDepthBits = 24;
pfd.cStencilBits = 8;
pfd.cAuxBuffers = 0;
pfd.iLayerType = PFD_MAIN_PLANE;
pfd.bReserved = 0;
pfd.dwLayerMask = 0;
pfd.dwVisibleMask = 0;
pfd.dwDamageMask = 0;
ZeroMemory(&pfd, sizeof(PIXELFORMATDESCRIPTOR));
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>("SimpleGL"),
WS_CAPTION | WS_POPUPWINDOW | WS_VISIBLE,
0,
0,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
hRC = wglCreateContext(hDC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR, (cl_context_properties) hRC,
CL_WGL_HDC_KHR, (cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
2. To query all display devices in the current session, call this function in a loop,
starting with DevNum set to 0, and incrementing DevNum until the function fails.
To select all display devices in the desktop, use only the display devices that
have the DISPLAY_DEVICE_ATTACHED_TO_DESKTOP flag in the
DISPLAY_DEVICE structure.
3. To get information on the display adapter, call EnumDisplayDevices with
lpDevice set to NULL. For example, DISPLAY_DEVICE.DeviceString
contains the adapter name.
4. Use EnumDisplaySettings to get DEVMODE. dmPosition.x and
dmPosition.y are used to get the x coordinate and y coordinate of the
current display.
5. Try to find the first OpenCL device (winner) associated with the OpenGL
rendering context by using the loop technique of 2., above.
6. Inside the loop:
a. Create a window on a specific display by using the CreateWindow
function. This function returns the window handle (HWND).
b. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use the CreateDC function
to create a device context (HDC) for the specified device.
c. Use ChoosePixelFormat to match an appropriate pixel format supported
by a device context to a given pixel format specification.
d. Use SetPixelFormat to set the pixel format of the specified device
context to the format specified.
e. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
f. Use wglMakeCurrent to bind the GL context created in the above step
as the current rendering context.
g. Use clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) and CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
parameter to get the number of GL associated devices for CL context
creation. If the number of devices is zero go to the next display in the
loop. Otherwise, use clGetGLContextInfoKHR (See Section 9.7 of the
OpenCL Specification 1.1) and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device
ID of the CL device associated with OpenGL context.
h. Use clCreateContext (See Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code demonstrates how to use WIN32 Windowing API in CL-GL
interoperability on multi-GPU environment.
int xCoordinate = 0;
int yCoordinate = 0;
0); deviceNum++)
{
if (dispDevice.StateFlags &
DISPLAY_DEVICE_MIRRORING_DRIVER)
{
continue;
}
DEVMODE deviceMode;
EnumDisplaySettings(dispDevice.DeviceName,
ENUM_CURRENT_SETTINGS,
&deviceMode);
xCoordinate = deviceMode.dmPosition.x;
yCoordinate = deviceMode.dmPosition.y;
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(
reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>(
"OpenGL Texture Renderer"),
WS_CAPTION | WS_POPUPWINDOW,
xCoordinate,
yCoordinate,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
hRC = wglCreateContext(hDC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
if (!clGetGLContextInfoKHR)
{
clGetGLContextInfoKHR = (clGetGLContextInfoKHR_fn)
clGetExtensionFunctionAddress(
"clGetGLContextInfoKHR");
}
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
if (deviceSize == 0)
{
// no interopable CL device found, cleanup
wglMakeCurrent(NULL, NULL);
wglDeleteContext(hRC);
DeleteDC(hDC);
hDC = NULL;
hRC = NULL;
DestroyWindow(gHwnd);
// try the next display
continue;
}
ShowWindow(gHwnd, SW_SHOW);
//Found a winner
break;
}
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
G.1.3 Limitations
It is recommended not to use GLUT in a multi-GPU environment.
Using GLUT
1. Use glutInit to initialize the GLUT library and to negotiate a session with
the windowing system. This function also processes the command-line
options depending on the windowing system.
2. Use glXGetCurrentContext to get the current rendering context
(GLXContext).
3. Use glXGetCurrentDisplay to get the display (Display *) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use clGetGLContextInfoKHR (see Section 9.7 of the OpenCL Specification
1.1) and the CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the
device ID of the CL device associated with the OpenGL context.
5. Use clCreateContext (see Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
GLUT in Linux.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
Cl_context_properties cpsGL[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR,
(intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR,
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
4. Use XCreateColormap to create a color map of the specified visual type for
the screen on which the specified window resides and returns the colormap
ID associated with it. Note that the specified window is only used to
determine the screen.
5. Use XCreateWindow to create an unmapped sub-window for a specified
parent window, returns the window ID of the created window, and causes the
X server to generate a CreateNotify event. The created window is placed on
top in the stacking order with respect to siblings.
6. Use XMapWindow to map the window and all of its sub-windows that have had
map requests. Mapping a window that has an unmapped ancestor does not
display the window, but marks it as eligible for display when the ancestor
becomes mapped. Such a window is called unviewable. When all its
ancestors are mapped, the window becomes viewable and is visible on the
screen if it is not obscured by another window.
7. Use glXCreateContextAttribsARB to initialize the context to the initial state
defined by the OpenGL specification, and returns a handle to it. This handle
can be used to render to any GLX surface.
8. Use glXMakeCurrent to make argrument3 (GLXContext) the current GLX
rendering context of the calling thread, replacing the previously current
context if there was one, and attaches argument3 (GLXcontext) to a GLX
drawable, either a window or a GLX pixmap.
9. Use clGetGLContextInfoKHR to get the OpenCL-OpenGL interoperability
device corresponding to the window created in step 5.
10. Use clCreateContext to create the context on the interoperable device
obtained in step 9.
The following code snippet shows how to create a CL-GL interoperability context
using the X Window system in Linux.
int nelements;
GLXFBConfig *fbc = glXChooseFBConfig(displayName,
DefaultScreen(displayName), 0, &nelements);
static int attributeList[] = { GLX_RGBA,
GLX_DOUBLEBUFFER,
GLX_RED_SIZE,
1,
GLX_GREEN_SIZE,
1,
GLX_BLUE_SIZE,
1,
None
};
XVisualInfo *vi = glXChooseVisual(displayName,
DefaultScreen(displayName),
attributeList);
XSetWindowAttributes swa;
swa.colormap = XCreateColormap(displayName,
RootWindow(displayName, vi->screen),
vi->visual,
AllocNone);
swa.border_pixel = 0;
swa.event_mask = StructureNotifyMask;
GLXCREATECONTEXTATTRIBSARBPROC glXCreateContextAttribsARB =
(GLXCREATECONTEXTATTRIBSARBPROC)
glXGetProcAddress((const
GLubyte*)"glXCreateContextAttribsARB");
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
win,
ctx);
cl_context_properties cpsGL[] = {
CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
status = clGetGLContextInfoKHR( cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDeviceId,
NULL);
displayName = XOpenDisplay(NULL);
int screenNumber = ScreenCount(displayName);
XCloseDisplay(displayName);
win = XCreateWindow(displayName,
RootWindow(displayName, vi->screen),
10,
10,
width,
height,
0,
vi->depth,
InputOutput,
vi->visual,
CWBorderPixel|CWColormap|CWEventMask,
&swa);
XMapWindow (displayName, win);
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
gGlCtx = glXGetCurrentContext();
properties cpsGL[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
int numDevices = (deviceSize / sizeof(cl_device_id));
if(numDevices == 0)
{
glXDestroyContext(glXGetCurrentDisplay(), gGlCtx);
continue;
}
else
{
//Interoperable device found
std::cout<<"Interoperable device found "<<std::endl;
break;
}
}
Index
Symbols 2D
address. . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
_cdecl calling convention
work-groups
Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
four number identification . . . . . 5-8, 6-10
_global atomics. . . . . . . . . . . . . . . . . . . . . . 1-23
2D addresses
_local atomics . . . . . . . . . . . . . . . . . . . . . . . 1-23
reading and writing. . . . . . . . . . . . . . . . . 1-13
_local syntax . . . . . . . . . . . . . . . . . . . 5-11, 6-18
3D indexing variables
_stdcall calling convention
spawn . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
vAbsTid. . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
.amdil
vThreadGrpld . . . . . . . . . . . . . . . . . . . . . . C-1
generating. . . . . . . . . . . . . . . . . . . . . . . . . E-3
vTidInGrp . . . . . . . . . . . . . . . . . . . . . . . . . C-1
.comment
3D indexing versions positions . . . . . . . . . . C-1
BIF binary . . . . . . . . . . . . . . . . . . . . . . . . . E-1
6900 series GPUs
storing OpenCL and driver versions that
optimizing kernels. . . . . . . . . . . . . . . . . . 6-53
created the binary . . . . . . . . . . . . . . . . E-1
79XX series devices. . . . . . . . . . . . . . . 1-2, 1-3
.llvmir
generating. . . . . . . . . . . . . . . . . . . . . . . . . E-3 A
storing OpenCL immediate representation
(LLVM IR). . . . . . . . . . . . . . . . . . . . . . . E-1 absolute work-item id
.rodata indexing
storing OpenCL runtime control data. . . . E-1 vaTid register . . . . . . . . . . . . . . . . . . . . C-1
.shstrtab acceleration
forming an ELF. . . . . . . . . . . . . . . . . . . . . E-1 hardware. . . . . . . . . . . . . . . . . . . . . . . . . 5-13
.source access
storing OpenCL source program . . . . . . . E-1 highest bandwidth through GPRs . . . . . 5-15
.strtab instructions
forming an ELF. . . . . . . . . . . . . . . . . . . . . E-1 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . 6-46
.symtab LDS . . . . . . . . . . . . . . . . . . . . . . . . . . 6-46
forming an ELF. . . . . . . . . . . . . . . . . . . . . E-1 memory. . . . . . . . . . . . . . . . . . . . . 1-11, 1-13
.text linear arrangement. . . . . . . . . . 5-27, 6-48
generating. . . . . . . . . . . . . . . . . . . . . . . . . E-3 tiled arrangement . . . . . . . . . . . 5-27, 6-48
storing the executable . . . . . . . . . . . . . . . E-1 patterns
compute kernels. . . . . . . . . . . . 5-28, 6-49
Numerics controlling . . . . . . . . . . . . . . . . . . . . . . 6-17
generating global and LDS memory
1D address . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
references . . . . . . . . . . . . . . . . . . . 6-30
1D copying
inefficient . . . . . . . . . . . . . . . . . . . 5-6, 6-8
bandwidth and ratio to peak bandwidth. . 6-4
pixel shaders . . . . . . . . . . 5-28, 6-49, C-1
1D indexing variables
preserving sequentially-increasing address-
spawn . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
ing of the original kernel . . . . . . . . 6-30
vAbsTidFlat. . . . . . . . . . . . . . . . . . . . . . . . C-1
simple stride and large non-unit
vThreadGrpldFlat . . . . . . . . . . . . . . . . . . . C-1
strides . . . . . . . . . . . . . . . . . . . 5-3, 6-6
vTindlnGrpFlat . . . . . . . . . . . . . . . . . . . . . C-1
1D indexing version position . . . . . . . . . . . . C-1
Index-2
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-4
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-6
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-7
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-8
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-9
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-10
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
definition device-specific
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17 operations
NDRange . . . . . . . . . . . . . . . . . . . . . . . . 1-17 kernel execution . . . . . . . . . . . . . . . . . 1-19
wavefront . . . . . . . . . . . . . . . . . . . . . . . . 1-10 program compilation. . . . . . . . . . . . . . 1-19
derived classes . . . . . . . . . . . . . . . . . . . . . . . 7-3 Direct Memory Access (DMA)
device engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-15
AMD GPU parameters . . . . . . . . . . . . . . . D-1 signaling transfer is completed . . . . . 1-15
APU transfers data to device memory. . . . 4-14
GPU access is slower . . . . . . . . . . . . 4-15 transfers . . . . . . . . . . . . . . . . . . . . . . . . . 1-15
balanced solution that runs well on parallelization . . . . . . . . . . . . . . . . . . . 1-15
CPU and GPU . . . . . . . . . . . . . 4-36, 6-40 directives
Cedar ASIC . . . . . . . . . . . . . . . . . . . . . . 6-39 extension name overrides . . . . . . . . . . . . A-2
creating context . . . . . . . . . . . . . . 4-36, 6-40 order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
Cypress. . . . . . . . . . . . . . . . . . . . . . . . . . 6-39 disassembly information
dedicated memory getting details after running the profiler . 4-10
discrete GPU . . . . . . . . . . . . . . . . . . . 4-16 discrete GPU
different performance moving data . . . . . . . . . . . . . . . . . 4-34, 6-37
characteristics . . . . . . . . . . . . . 4-35, 6-39 do loops
extension support listing. . . . . . . . . . . . . A-15 vs for loops. . . . . . . . . . . . . . . . . . . . . . . 5-33
fission extension support in OpenCL. . . . A-4 domains
fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15 of synchronization . . . . . . . . . . . . . . . . . . 1-1
heterogeneous . . . . . . . . . . . . . . . 4-32, 6-36 command-queue . . . . . . . . . . . . . . . . . 1-1
kernels work-items . . . . . . . . . . . . . . . . . . . . . . 1-1
copying between device memory . . . 4-21 double buffering
list overlapping CPU copies with DMA . . . . 4-14
function call query . . . . . . . . . . . . . . . . A-2 double copying
memory memory bandwidth . . . . . . . . . . . . . . . . . 1-13
avoiding over-allocating . . . . . . . . . . . 4-16 double-precision
transfers . . . . . . . . . . . . . . . . . . 4-14, 4-15 supported on all Southern Island
multiple devices . . . . . . . . . . . . . . . . . . . . . . . . 5-24
creating a separate queue. . . . 4-32, 6-35 double-precision floating-point
when to use . . . . . . . . . . . . . . . 4-31, 6-35 performing operations . . . . . . . . . . . . . . . 1-6
no limit of number of command double-precision support . . . . . . . . . . . . . . 5-34
queues . . . . . . . . . . . . . . . . . . . . . . . . 1-19 drill down
obtaining peak throughput . . . . . . 5-25, 6-42 kernel execution . . . . . . . . . . . . . . . . . . . . 4-6
peak performances. . . . . . . . . . . . 4-36, 6-40 driver layer
placing in the same context . . . . . 4-37, 6-40 issuing commands . . . . . . . . . . . . . . . . . 1-14
relationship translating commands . . . . . . . . . . . . . . 1-14
sample code. . . . . . . . . . . . . . . . . . . . 1-20 dynamic frequency scaling
scheduling device performance . . . . . . . . . . . . . . . . 4-32
across both CPU and GPU . . . 4-32, 6-36 dynamic scheduling
starving the GPU . . . . . . . . . . . . . . . . . . 6-37 algorithms
device fission extension Cilk . . . . . . . . . . . . . . . . . . . . . . 4-32, 6-36
reserving a core for scheduling . . 4-33, 6-37 heterogeneous workloads . . . . . . 4-32, 6-36
device timestamps separate instructions . . . . . . . . . . . . . . . . 1-2
retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
device-optimal access pattern E
threading. . . . . . . . . . . . . . . . . . . . . . . . . 1-23 efficiency
devices of kernel and data transfer. . . . . . . . . . . . 4-4
79XX series . . . . . . . . . . . . . . . . . . . 1-2, 1-3 element
work-item . . . . . . . . . . . . . . . . . . . . . . . . 1-19
Index-11
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
ELF exceptions
.rodata C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
storing OpenCL runtime control data. E-1 executing
.shstrtab branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
forming an ELF. . . . . . . . . . . . . . . . . . E-1 command-queues in-order . . . . . . 4-34, 6-38
.strtab kernels. . . . . . . . . . . . . . . . . . . 1-1, 1-5, 1-19
forming an ELF. . . . . . . . . . . . . . . . . . E-1 using corresponding command queue 1-19
.symtab kernels for specific devices
forming an ELF. . . . . . . . . . . . . . . . . . E-1 OpenCL programming model . . . . . . . 1-18
.text loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
storing the executable . . . . . . . . . . . . E-1 non-graphic function
format . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1 data-parallel programming model. . . . . 1-7
forming . . . . . . . . . . . . . . . . . . . . . . . . . . E-1 work-items
header fields . . . . . . . . . . . . . . . . . . . . . . E-2 on a single processing element . . 5-2, 6-1
special sections execution
BIF . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1 command queue . . . . . . . . . . . . . . . . . . . 1-14
enforce ordering of a single instruction over all
between or within queues work-items. . . . . . . . . . . . . . . . . . . . . . 1-18
events . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 of GPU non-blocking kernel . . . . . . . . . . 4-12
synchronizing a given event . . . . . . . . . . 1-19 OpenCL applications. . . . . . . . . . . . . . . . . 2-6
within a single queue order
command-queue barrier . . . . . . . . . . . . 1-2 barriers . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
engine range
DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14 balancing the workload. . . . . . . 5-17, 6-23
enqueuing optimization. . . . . . . . . . . . . . . . 5-17, 6-23
commands in OpenCL . . . . . . . . . . . . . . . 1-2 single stream core . . . . . . . . . . . . . . . . . 1-16
multiple tasks execution dimensions
parallelism. . . . . . . . . . . . . . . . . . . . . . . 1-1 guidelines . . . . . . . . . . . . . . . . . . . . . . . . 4-18
native kernels execution of work-items . . . . . . . . . . . . . . . 1-16
parallelism. . . . . . . . . . . . . . . . . . . . . . . 1-1 explicit copying of data . . . . . . . . . . . . . . . . 1-12
Enqueuing commands before flushing . . . . 4-34 extension
environment variable cl_amd_popcnt . . . . . . . . . . . . . . . . . . . . A-7
AMD_OCL_BUILD_OPTIONS . . . . . . . . . 2-5 clCreateKernel . . . . . . . . . . . . . . . . . . . . . 7-2
AMD_OCL_BUILD_OPTIONS_APPEND . 2-5 extension function pointers . . . . . . . . . . . . . A-3
setting to avoid source changes . . . . . . . 3-2 extension functions
Euclidean space NULL and non-Null return values. . . . . . A-3
output domain . . . . . . . . . . . . . . . . . . . . . C-1 extension support by device
event for devices 1 . . . . . . . . . . . . . . . . . . . . . A-15
commands. . . . . . . . . . . . . . . . . . . . . . . . . 2-6 for devices 2 and CPUs . . . . . . . . . . . . A-16
enforces ordering extensions
between queues . . . . . . . . . . . . . . . . . . 1-2 all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
within queues . . . . . . . . . . . . . . . . . . . . 1-2 AMD vendor-specific. . . . . . . . . . . . . . . . A-4
synchronizing . . . . . . . . . . . . . . . . . . . . . 1-19 approved by Khronos Group . . . . . . . . . A-1
event commands . . . . . . . . . . . . . . . . . . . . . . 2-6 approved list from Khronos Group . . . . . A-3
Event Wait Lists character strings . . . . . . . . . . . . . . . . . . . A-1
placing Event_Objects . . . . . . . . . . . . . . F-3 cl_amd_device_attribute_query . . . . . . . A-5
Event_Object cl_amd_event_callback
decoding execution registering event callbacks for states. A-6
OpenCL. . . . . . . . . . . . . . . . . . . . . . . . F-3 cl_amd_fp64 . . . . . . . . . . . . . . . . . . . . . . A-4
events cl_amd_media_ops . . . . . . . . . . . . . A-7, A-9
forced ordering between . . . . . . . . . . . . . . 1-2 cl_amd_printf. . . . . . . . . . . . . . . . . . . . . A-12
Evergreen cl_ext. . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
optimizing kernels . . . . . . . . . . . . . . . . . . 6-53 compiler set to ignore . . . . . . . . . . . . . . . A-2
Index-12
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-13
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-14
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-15
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-16
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-17
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-18
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-19
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-20
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
LDS mapping
optimization . . . . . . . . . . . . . . . 5-10, 6-16 executions onto compute units . . . . . . . . 1-7
size . . . . . . . . . . . . . . . . . . . . . . 5-20, 6-27 memory into CPU address space
no benefit for CPU . . . . . . . . . . . . . . . . . 5-31 as uncached. . . . . . . . . . . . . . . . . . . . 4-16
scratchpad memory . . . . . . . . . . . 5-11, 6-18 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
writing data into . . . . . . . . . . . . . . 5-12, 6-18 runtime transfers
local ranges copy memory objects. . . . . . . . . . . . . 4-19
dividing from global NDRange . . . 5-17, 6-23 the host application . . . . . . . . . . . . . . . . 4-18
local work size . . . . . . . . . . . . . . . . . 5-21, 6-29 user data into a single UAV. . . . . . . . . . . 6-4
location indexing work-items onto n-dimensional grid
pixel shader . . . . . . . . . . . . . . . . . . . . . . . C-1 (ND-Range) . . . . . . . . . . . . . . . . . . . . . 1-8
loop work-items to stream cores . . . . . . . . . . . 1-7
constructs zero copy memory objects . . . . . . . . . . . 4-18
conditional expressions . . . . . . 5-33, 6-54 mapping/unmapping transfer
execute . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10 pin/unpin runtime . . . . . . . . . . . . . . . . . . 4-20
types maps
experimenting . . . . . . . . . . . . . . 5-33, 6-55 non-blocking . . . . . . . . . . . . . . . . . 4-35, 6-38
unrolling . . . . . . . . . . . . . . . . . . . . 5-26, 6-43 masking GPUs . . . . . . . . . . . . . . . . . . . . . . 1-15
4x . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43 math libraries . . . . . . . . . . . . . . . . . . 5-25, 6-42
exposing more parallelism . . . . . . . . . 6-43 function (non-native). . . . . . . . . . . 5-25, 6-42
increasing performance . . . . . . 5-33, 6-54 native_function . . . . . . . . . . . . . . . 5-25, 6-42
performance tips . . . . . . . . . . . 5-29, 6-50 matrix multiplication
using pragma compiler directive convolution
hint . . . . . . . . . . . . . . . . . . . . 5-26, 6-48 L1 . . . . . . . . . . . . . . . . . . . . . . . 5-16, 6-22
with clustered stores . . . . . . . . . . . . . 6-44 matrix transpose
vectorizing naive compute kernel . . . . . . . . . . . . . . . . C-2
using float4. . . . . . . . . . . . . . . . . . . . . 6-44 naive pixel shader . . . . . . . . . . . . . . . . . . C-2
loop unrolling optimizations . . . . . . . . . . . . 5-29 performance comparison
loops compute kernel vs pixel shader. . . . . . C-2
for vs do or while . . . . . . . . . . . . . . . . . . 5-33 performance results
Low-Level Virtual Machine (LLVM) of compute kernel . . . . . . . . . . . . . . . . C-3
See LLVM of LDS . . . . . . . . . . . . . . . . . . . . . . . . . C-4
of pixel shader . . . . . . . . . . . . . . . . . . . C-2
M media instructions
macros AMD . . . . . . . . . . . . . . . . . . . . . . . 5-25, 6-42
GPU-specific. . . . . . . . . . . . . . . . . . . . . . . 2-1 mem_fence operation . . . . . . . . . . . . . . . . . . 1-1
predefined MEM_RAT
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . A-13 instruction sequence meaning . . . . . . . . . 6-5
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . A-13 means CompletePath . . . . . . . . . . . . . . . . 6-6
OpenCL C kernels . . . . . . . . . . . . . . . A-13 MEM_RAT_CACHELESS
MAD instruction . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
double-precision operations . . . . . . . . . . 6-41 means FastPath . . . . . . . . . . . . . . . . . . . . 6-6
instruction . . . . . . . . . . . . . . . . . . . 5-25, 6-42 MEM_RAT_STORE instruction . . . . . . . . . . 6-6
converting separate MUL/ADD Memcpy
operations . . . . . . . . . . . . . . . . . . . 5-25 transferring between various kinds of host
single precision operation . . . . . . . . . . . 6-41 memory . . . . . . . . . . . . . . . . . . . . . . . 4-20
MAD instruction memories
converting separate MUL/ADD interrelationship of . . . . . . . . . . . . . . . . . 1-11
operations. . . . . . . . . . . . . . . . . . . . . . 6-42 memory
map calls. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 access. . . . . . . . . . . . . . . . . . . . . . 1-11, 1-13
tracking default memory objects . . . . . . 4-20
map_flags argument . . . . . . . . . . . . . . . . . . 4-19
Index-21
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-22
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
create . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 N
default . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
naive compute kernel
enabling zero copy . . . . . . . . . . . . . . . . . 4-18
kernel structure. . . . . . . . . . . . . . . . . . . . . C-3
location . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
matrix transpose . . . . . . . . . . . . . . . . . . . . C-2
modifying . . . . . . . . . . . . . . . . . . . . . . . . 4-20
naïve matrix transpose . . . . . . . . . . . . . . . . . C-2
passing data to kernels . . . . . . . . . . . . . 4-13
naive pixel shader
runtime
kernel structure. . . . . . . . . . . . . . . . . . . . . C-2
limits . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
matrix transpose . . . . . . . . . . . . . . . . . . . . C-2
policy . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
namespaces
runtime policy
C++ support for . . . . . . . . . . . . . . . . . . . . 7-4
best performance practices . . . . . . . . 4-13
supported feature in C++ . . . . . . . . . . . . . 7-1
transferring data to and from the host. . 4-18
naming conventions
zero copy . . . . . . . . . . . . . . . . . . . . . . . . 4-18
API extension functions . . . . . . . . . . . . . . A-1
mapping . . . . . . . . . . . . . . . . . . . . . . . 4-18
elements contained in extensions . . . . . . A-1
optimizing data movement . . . . . . . . . 4-18
enumerated values. . . . . . . . . . . . . . . . . . A-1
support . . . . . . . . . . . . . . . . . . . . . . . . 4-18
extensions. . . . . . . . . . . . . . . . . . . . . . . . . A-1
zero copy host resident
Khronos Group approved . . . . . . . . . . A-1
boosting performance . . . . . . . . . . . . 4-19
provided by a specific vendor . . . . . . . A-1
memory stride
provided by multiple vendors. . . . . . . . A-1
description of . . . . . . . . . . . . . . . . . . 5-3, 6-6
native data type
menu
L1 cache. . . . . . . . . . . . . . . . . . . . 5-16, 6-22
viewing
native format
IL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
LDS cache . . . . . . . . . . . . . . . . . . 5-16, 6-22
ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
native speedup factor
source OpenCL . . . . . . . . . . . . . . . . . 4-10
for certain functions . . . . . . . . . . . . . . . . 6-43
metadata structures
native_function math library . . . . . . . 5-25, 6-42
holding debug information . . . . . . . . . . . . 2-2
n-dimensional grid (ND-Range) . . . . . . . . . . 1-8
OpenCL-specific information . . . . . . . . . . 2-2
n-dimensional index space
Microsoft Visual Studio
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
AMD APP Profiler
NDRange
viewing IL and ISA code . . . . . . . . . . 4-10
accumulation operations . . . . . . . . . . . . 1-17
microtile. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
balancing the workload . . . . . . . . 5-17, 6-23
minGW
defnition . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
GDB running. . . . . . . . . . . . . . . . . . . . . . . 3-4
dimensions . . . . . . . . . . . . . . . . . . 5-22, 6-31
motion estimation algorithms
exploiting performance of the GPU 5-17, 6-23
SAD . . . . . . . . . . . . . . . . . . . . . . . 5-25, 6-42
general guidelines
MULs. . . . . . . . . . . . . . . . . . . . . . . . . 5-25, 6-42
determining optimization . . . . . 5-23, 6-32
multi-core
global
runtimes
divided into local ranges . . . . . 5-17, 6-23
Cilk . . . . . . . . . . . . . . . . . . . . . . 4-32, 6-36
index space
schedulers . . . . . . . . . . . . . . . . . . 4-32, 6-36
combining work-items . . . . . . . . . . . . 6-30
multi-GPU environment
input streams . . . . . . . . . . . . . . . . . . . . . 1-17
use of GLUT. . . . . . . . . . . . . . . . . . . . . . G-7
n-dimensional index space. . . . . . . . . . . . 1-7
multiple devices
optimization. . . . . . . . . . . . . . . . . . 5-17, 6-23
creating a separate queue for each
summary. . . . . . . . . . . . . . . . . . 5-23, 6-32
device . . . . . . . . . . . . . . . . . . . . 4-32, 6-35
partitioning work . . . . . . . . . . . . . . 5-21, 6-28
in OpenCL runtime . . . . . . . . . . . . 4-29, 6-33
profiler reports the dimensions
optimization. . . . . . . . . . . . . . . . 4-1, 5-1, 6-1
GlobalWorkSize field . . . . . . . . . . . . . 4-13
partitioning work for . . . . . . . . . . . 4-32, 6-35
random-access functionality . . . . . . . . . . 1-17
when to use . . . . . . . . . . . . . . . . . 4-31, 6-35
reduction operations. . . . . . . . . . . . . . . . 1-17
variable output counts . . . . . . . . . . . . . . 1-17
Index-23
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-24
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-25
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-26
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-27
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-28
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-29
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-30
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-31
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-32
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-33
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-34
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
work-group work-item
allocation of LDS size . . . . . . . . . . . . . . 1-18 reaching point (barrier) in the code . . 1-18
and available compute units . . . . 5-21, 6-29 sharing data through LDS
barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 memory . . . . . . . . . . . . . . . . 5-22, 6-29
blocking strategy using high-speed local atomic
when accessing a tiled image . 5-28, 6-49 operations . . . . . . . . . . . . . . 5-22, 6-29
composed of wavefronts . . . . . . . . . . . . 1-18 work-groups
compute unit supports a maximum of assigned to CUs as needed . . . . . . . . . . 5-6
eight . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24 dispatching on HD 7000 series . . . . . . . . 5-6
concept relating to compute kernels . . . 1-18 no limit in OpenCL . . . . . . . . . . . . . . . . . 5-21
defined . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10 work-item . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
dimensions vs size . . . . . . . . . . . . 5-22, 6-31 barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
dispatching in a linear order branch granularity. . . . . . . . . . . . . . . . . . 1-10
HD 5000 series GPU . . . . . . . . . . . . . . 6-7 communicating
dividing global work-size into through globally shared memory . . . . 1-19
sub-domains . . . . . . . . . . . . . . . . . . . . 1-19 through locally shared memory . . . . . 1-19
dividing work-items . . . . . . . . . . . . . . . . . . 1-8 creation . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
executing 2D deactivation. . . . . . . . . . . . . . . . . . . . . . . 1-13
four number identification . . . . . 5-8, 6-10 dividing into work-groups . . . . . . . . . . . . . 1-8
executing on a single compute unit . 5-2, 6-1 does not write
initiating order . . . . . . . . . . . . . . . . . 5-8, 6-10 coalesce detection ignores it . . . . . . . 6-13
limited number of active element . . . . . . . . . . . . . . . . . . . . . . . . . . 1-19
LDS allocations . . . . . . . . . . . . 5-20, 6-27 encountering barriers . . . . . . . . . . . . . . . . 1-1
maximum size can be obtained . . . . . . . 6-32 executing
moving work to kernel . . . . . . . . . . . . . . 6-29 on a single processing element. . 5-2, 6-1
number of wavefronts in. . . . . . . . . . . . . . 1-8 on same cycle in the processing
optimization engine . . . . . . . . . . . . . . . . . 5-22, 6-31
wavefront size . . . . . . . . . . . . . 5-12, 6-19 the branch . . . . . . . . . . . . . . . . . . . . . 1-10
partitioning into smaller pieces for execution in lock-step. . . . . . . . . . . . . . . . 5-2
processing . . . . . . . . . . . . . . . . 5-21, 6-28 id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
performance . . . . . . . . . . . . . . . . . . . . . . 1-18 kernel running on compute unit . . . . . . . . 1-7
processing a block in column-order 5-8, 6-10 limiting number in each group . . . 5-21, 6-29
processing increased on the fixed pool of local mapping
memory . . . . . . . . . . . . . . . . . . . . . . . 6-30 onto n-dimensional grid (ND-Range). . 1-8
relationship to wavefront . . . . . . . . . . . . 1-10 to stream cores . . . . . . . . . . . . . . . . . . 1-7
relationship with wavefronts . . . . . . . . . . 1-10 NDRange index space . . . . . . . . . . . . . . 6-30
retiring in order . . . . . . . . . . . . . . . . . 5-6, 6-8 non-active . . . . . . . . . . . . . . . . . . . . . . . . 1-11
selecting size number of registers used by . . . . . . . . . 5-19
wavefronts are fully populated. 5-23, 6-32 OpenCL supports up to 256 . . . . . . . . . 6-25
sharing not possible . . . . . . . . . . . 5-17, 6-23 packing order . . . . . . . . . . . . . . . . 5-23, 6-31
size processing wavefront . . . . . . . . . . . . . . . . 1-9
CUDA code . . . . . . . . . . . . . . . 5-30, 6-51 reaching point (barrier) in the code . . . . 1-18
second-order effects . . . . . . . . 5-22, 6-31 read or write adjacent memory
square 16x16 . . . . . . . . . . . . . . 5-23, 6-31 addresses. . . . . . . . . . . . . . . . . . . 5-3, 6-6
size limiting number of active wavefronts 4-7 reading in a single value . . . . . . . . 5-9, 6-11
specifying requested by user space program. . . . . . C-1
default size at compile-time . . . 5-19, 6-27 same wavefront
wavefronts . . . . . . . . . . . . . . . . . . . . . . 1-8 executing same instruction on each
staggering . . . . . . . . . . . . . . . . . . . . 5-8, 6-10 cycle . . . . . . . . . . . . . . . . . . . 5-23, 6-31
avoiding channel conflicts . . . . . 5-8, 6-10 same program counter. . . . . . . 5-23, 6-31
tuning dimensions specified at launch
time. . . . . . . . . . . . . . . . . . . . . . 5-19, 6-27
Index-35
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
X
X Window system
using for CL-GL interoperability . . . . . . . G-8
Z
zero copy. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
direct GPU access to . . . . . . . . . . . . . . . 4-28
direct host access to. . . . . . . . . . . . . . . . 4-27
performance boost . . . . . . . . . . . . . . . . . 4-19
under Linux . . . . . . . . . . . . . . . . . . . . . . . 4-19
when creating memory objects. . . . . . . . 4-18
zero copy buffer
available buffer types . . . . . . . . . . . . . . . 4-23
calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
size limit per buffer . . . . . . . . . . . . . . . . . 4-23
zero copy buffers . . . . . . . . . . . . . . . . . . . . 4-22
runtime . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
zero copy memory objects . . . . . . . . . . . . . 4-18
Index-36
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.