Dokumen - Tips - Amd Accelerated Parallel Processing Accelerated Parallel Processing Opencl PDF
Dokumen - Tips - Amd Accelerated Parallel Processing Accelerated Parallel Processing Opencl PDF
October 2014
rev1.0
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo,
AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI,
the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade-
marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows
Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic-
tions. Other names are for informational purposes only and may be trademarks of their
respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by
permission by Khronos.
The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to
make changes to specifications and product descriptions at any time without notice. The
information contained herein may be of a preliminary or advance nature and is subject to
change without notice. No license, whether express, implied, arising by estoppel or other-
wise, to any intellectual property rights is granted by this publication. Except as set forth
in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever,
and disclaims any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular purpose, or
infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as compo-
nents in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of AMD’s
product could create a situation where personal injury, death, or severe property or envi-
ronmental damage may occur. AMD reserves the right to discontinue or make changes to
its products at any time without notice.
ii
AMD ACCELERATED P ARALLEL P ROCESSING
Preface
Audience
This document is intended for programmers. It assumes prior experience in
writing code for CPUs and a basic understanding of threads (work-items). While
a basic understanding of GPU architectures is useful, this document does not
assume prior graphics knowledge. It further assumes an understanding of
chapters 1, 2, and 3 of the OpenCL Specification (for the latest version, see
http://www.khronos.org/registry/cl/ ).
Organization
This AMD Accelerated Parallel Processing document begins, in Chapter 1, with
an overview of: the AMD Accelerated Parallel Processing programming models,
OpenCL, and the AMD Compute Abstraction Layer (CAL). Chapter 2 discusses
the AMD implementation of OpenCL. Chapter 3 discusses the compiling and
running of OpenCL programs. Chapter 4 describes using the AMD CodeXL GPU
Debugger and the GNU debugger (GDB) to debug OpenCL programs. Chapter 5
provides information about the extension that defines the OpenCL Static C++
kernel language, which is a form of the ISO/IEC Programming languages C++
specification. Appendix A describes the supported optional OpenCL extensions.
Appendix B details the installable client driver (ICD) for OpenCL. Appendix C
details the compute kernel and contrasts it with a pixel shader. Appendix C
describes the OpenCL binary image format (BIF). Appendix D provides a
hardware overview of pre-GCN devices. Appendix E describes the
interoperability between OpenCL and OpenGL. Appendix F describes the new
and deprecated functions in OpenCL 2.0. The last section of this book is an
index.
Preface iii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Conventions
The following conventions are used in this document.
[1,2) A range that includes the left-most value (in this case, 1) but excludes the right-most
value (in this case, 2).
[1,2] A range that includes both the left-most and right-most values (in this case, 1 and 2).
7:4 A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.
italicized word or phrase The first use of a term or concept basic to the understanding of stream computing.
Related Documents
The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.
The OpenCL Specification, Version 2.0, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2013.
AMD, R600 Technology, R600 Instruction Set Architecture, Sunnyvale, CA,
est. pub. date 2007. This document includes the RV670 GPU instruction
details.
ISO/IEC 9899:TC2 - International Standard - Programming Languages - C
Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language,
Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P.
Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” ACM
Trans. Graph., vol. 23, no. 3, pp. 777–786, 2004.
AMD Compute Abstraction Layer (CAL) Intermediate Language (IL)
Reference Manual. Published by AMD.
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat;
Houston, Mike; Fatahalian, Kayvon. “BrookGPU”
http://graphics.stanford.edu/projects/brookgpu/
Buck, Ian. “Brook Spec v0.2”. October 31, 2003.
http://merrimac.stanford.edu/brook/brookspec-05-20-03.pdf
OpenGL Programming Guide, at http://www.glprogramming.com/red/
Microsoft DirectX Reference Website, at http://msdn.microsoft.com/en-
us/directx
iv Preface
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Contact Information
URL: developer.amd.com/appsdk
Developing: developer.amd.com/
Forum: developer.amd.com/openclforum
Preface v
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
vi Preface
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Contents
Preface
Contents
Chapter 1
OpenCL Architecture and AMD Accelerated Parallel Processing
1.1 Terminology ...................................................................................................................................... 1-1
1.2 OpenCL Overview ............................................................................................................................ 1-2
1.3 Programming Model ........................................................................................................................ 1-3
1.4 Synchronization ............................................................................................................................... 1-4
1.5 Memory Architecture and Access.................................................................................................. 1-5
1.5.1 Data Share Operations ....................................................................................................1-7
1.5.2 Dataflow in Memory Hierarchy .......................................................................................1-8
1.5.3 Memory Access ................................................................................................................1-9
1.5.4 Global Memory ...............................................................................................................1-10
1.5.5 Image Read/Write ...........................................................................................................1-10
1.6 Example Programs......................................................................................................................... 1-10
1.6.1 First Example: Simple Buffer Write .............................................................................1-10
1.6.2 Example: SAXPY Function............................................................................................1-14
1.6.3 Example: Parallel Min() Function .................................................................................1-19
Chapter 2
AMD Implementation
2.1 The AMD Accelerated Parallel Processing Implementation of OpenCL ................................... 2-1
2.1.1 Work-Item Processing .....................................................................................................2-4
2.1.2 Work-Item Creation ..........................................................................................................2-4
2.1.3 Flow Control .....................................................................................................................2-4
2.2 Hardware Overview for GCN Devices ........................................................................................... 2-6
2.2.1 Key differences between pre-GCN and GCN devices .................................................2-7
2.2.2 Key differences between Southern Islands, Sea Islands, and Volcanic Islands families
2-8
2.3 Communication Between Host and the GPU Compute Device.................................................. 2-8
2.3.1 Processing API Calls: The Command Processor ........................................................2-9
2.3.2 DMA Transfers ..................................................................................................................2-9
2.3.3 Masking Visible Devices................................................................................................2-10
2.4 Wavefront Scheduling ................................................................................................................... 2-10
Contents vii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Chapter 3
Building and Running OpenCL Programs
3.1 Compiling the Program ................................................................................................................... 3-2
3.1.1 Compiling on Windows ...................................................................................................3-2
3.1.2 Compiling on Linux .........................................................................................................3-3
3.1.3 Supported Standard OpenCL Compiler Options..........................................................3-4
3.1.4 AMD-Developed Supplemental Compiler Options .......................................................3-4
3.2 Running the Program ...................................................................................................................... 3-5
3.2.1 Running Code on Windows............................................................................................3-6
3.2.2 Running Code on Linux ..................................................................................................3-7
3.3 Calling Conventions ........................................................................................................................ 3-7
Chapter 4
Debugging OpenCL
4.1 AMD CodeXL GPU Debugger ......................................................................................................... 4-1
4.2 Debugging CPU Kernels with GDB ............................................................................................... 4-2
4.2.1 Setting the Environment .................................................................................................4-2
4.2.2 Setting the Breakpoint in an OpenCL Kernel...............................................................4-2
4.2.3 Sample GDB Session ......................................................................................................4-3
4.2.4 Notes..................................................................................................................................4-4
Chapter 5
OpenCL Static C++ Programming Language
5.1 Overview ........................................................................................................................................... 5-1
5.1.1 Supported Features .........................................................................................................5-1
5.1.2 Unsupported Features.....................................................................................................5-2
5.1.3 Relations with ISO/IEC C++ ............................................................................................5-2
5.2 Additions and Changes to Section 5 - The OpenCL C Runtime ............................................... 5-2
5.2.1 Additions and Changes to Section 5.7.1 - Creating Kernel Objects .........................5-2
5.2.2 Passing Classes between Host and Device .................................................................5-3
5.3 Additions and Changes to Section 6 - The OpenCL C Programming Language .................... 5-3
5.3.1 Building C++ Kernels.......................................................................................................5-3
5.3.2 Classes and Derived Classes.........................................................................................5-3
5.3.3 Namespaces......................................................................................................................5-4
5.3.4 Overloading.......................................................................................................................5-4
5.3.5 Templates ..........................................................................................................................5-5
5.3.6 Exceptions ........................................................................................................................5-6
5.3.7 Libraries ............................................................................................................................5-6
5.3.8 Dynamic Operation ..........................................................................................................5-6
5.4 Examples........................................................................................................................................... 5-6
5.4.1 Passing a Class from the Host to the Device and Back ............................................5-6
5.4.2 Kernel Overloading ..........................................................................................................5-7
5.4.3 Kernel Template................................................................................................................5-8
viii Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Chapter 6
OpenCL 2.0
6.1 Introduction ...................................................................................................................................... 6-1
6.2 Shared Virtual Memory (SVM) ........................................................................................................ 6-1
6.2.1 Overview............................................................................................................................6-1
6.2.2 Usage.................................................................................................................................6-3
Coarse-grained memory ..................................................................................................6-4
6.3 Generic Address Space .................................................................................................................. 6-5
6.3.1 Overview............................................................................................................................6-5
6.3.2 Usage.................................................................................................................................6-6
Generic example...............................................................................................................6-6
AMD APP SDK example ..................................................................................................6-6
6.4 Device-side enqueue ....................................................................................................................... 6-7
6.4.1 Overview............................................................................................................................6-7
6.4.2 Usage.................................................................................................................................6-8
Iterate until convergence ................................................................................................6-8
Data-dependent refinement.............................................................................................6-8
Extracting Primes from an array by using device-side enqueue...............................6-9
Binary search using device enqueue (or kernel enqueue).......................................6-15
6.5 Atomics and synchronization....................................................................................................... 6-18
6.5.1 Overview..........................................................................................................................6-18
6.5.2 Usage...............................................................................................................................6-18
Atomic Loads/Stores .....................................................................................................6-19
Atomic Compare and Exchange (CAS) .......................................................................6-20
Atomic Fetch...................................................................................................................6-20
6.6 Pipes................................................................................................................................................ 6-21
6.6.1 Overview..........................................................................................................................6-21
6.6.2 Functions for accessing pipes .....................................................................................6-22
6.6.3 Usage...............................................................................................................................6-23
6.7 Sub-groups ..................................................................................................................................... 6-24
6.7.1 Overview..........................................................................................................................6-24
6.8 Program-scope global Variables .................................................................................................. 6-24
6.8.1 Overview..........................................................................................................................6-24
6.9 Image Enhancements .................................................................................................................... 6-24
6.9.1 Overview..........................................................................................................................6-24
6.9.2 sRGB................................................................................................................................6-25
6.9.3 Depth images..................................................................................................................6-27
6.10 Non-uniform work group size....................................................................................................... 6-28
6.10.1 Overview..........................................................................................................................6-28
6.11 Portability considerations ............................................................................................................. 6-28
6.11.1 Migrating from OpenCL 1.2 to OpenCL 2.0 ................................................................6-28
6.11.2 Identifying implementation specifics...........................................................................6-29
Contents ix
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Appendix A
OpenCL Optional Extensions
A.1 Extension Name Convention ..........................................................................................................A-1
A.2 Querying Extensions for a Platform..............................................................................................A-1
A.3 Querying Extensions for a Device.................................................................................................A-2
A.4 Using Extensions in Kernel Programs..........................................................................................A-2
A.5 Getting Extension Function Pointers ............................................................................................A-3
A.6 List of Supported Extensions that are Khronos-Approved........................................................A-3
A.7 cl_ext Extensions.........................................................................................................................A-4
A.8 AMD Vendor-Specific Extensions ..................................................................................................A-4
A.8.1 cl_amd_fp64................................................................................................................. A-4
A.8.2 cl_amd_vec3................................................................................................................. A-4
A.8.3 cl_amd_device_persistent_memory.................................................................. A-4
A.8.4 cl_amd_device_attribute_query....................................................................... A-5
cl_device_profiling_timer_offset_amd....................................................... A-5
cl_amd_device_topology........................................................................................ A-5
cl_amd_device_board_name................................................................................... A-5
A.8.5 cl_amd_compile_options ...................................................................................... A-6
A.8.6 cl_amd_offline_devices....................................................................................... A-6
A.8.7 cl_amd_event_callback.......................................................................................... A-6
A.8.8 cl_amd_popcnt............................................................................................................ A-7
A.8.9 cl_amd_media_ops..................................................................................................... A-7
A.8.10 cl_amd_media_ops2................................................................................................... A-9
A.8.11 cl_amd_printf.......................................................................................................... A-12
A.8.12 cl_amd_predefined_macros................................................................................. A-13
A.8.13 cl_amd_bus_addressable_memory..................................................................... A-14
A.9 Supported Functions for cl_amd_fp64 / cl_khr_fp64.......................................................A-15
A.10 Extension Support by Device.......................................................................................................A-15
Appendix B
The OpenCL Installable Client Driver (ICD)
B.1 Overview ...........................................................................................................................................B-1
B.2 Using ICD..........................................................................................................................................B-1
Appendix C
OpenCL Binary Image Format (BIF) v2.0
C.1 Overview ...........................................................................................................................................C-1
C.1.1 Executable and Linkable Format (ELF) Header........................................................... C-2
C.1.2 Bitness.............................................................................................................................. C-3
C.2 BIF Options.......................................................................................................................................C-3
Appendix D
x Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Appendix E
OpenCL-OpenGL Interoperability
E.1 Under Windows ................................................................................................................................E-1
E.1.1 Single GPU Environment ............................................................................................... E-2
Creating CL Context from a GL Context...................................................................... E-2
E.1.2 Multi-GPU Environment .................................................................................................. E-4
Creating CL context from a GL context ....................................................................... E-4
E.1.3 Limitations ....................................................................................................................... E-7
E.2 Linux Operating System .................................................................................................................E-7
E.2.1 Single GPU Environment ............................................................................................... E-7
Creating CL Context from a GL Context...................................................................... E-7
E.2.2 Multi-GPU Configuration .............................................................................................. E-10
Creating CL Context from a GL Context.................................................................... E-10
E.3 Additional GL Formats Supported...............................................................................................E-13
Appendix F
New and deprecated functions in OpenCL 2.0
F.1 New built-in functions in OpenCL 2.0 ........................................................................................... F-1
F.1.1 Work Item Functions....................................................................................................... F-1
F.1.2 Integer functions ............................................................................................................. F-1
F.1.3 Synchronization Functions ............................................................................................ F-1
F.1.4 Address space quailfier functions ................................................................................ F-2
F.1.5 Atomic functions ............................................................................................................. F-2
F.1.6 Image Read and Write Functions.................................................................................. F-2
F.1.7 Work group functions..................................................................................................... F-2
F.1.8 Pipe functions.................................................................................................................. F-3
F.1.9 Enqueueing Kernels........................................................................................................ F-3
F.1.10 Sub-groups ...................................................................................................................... F-4
F.2 Deprecated built-ins......................................................................................................................... F-4
F.3 New runtime APIs in OpenCL 2.0 .................................................................................................. F-6
F.3.1 New Types........................................................................................................................ F-6
F.3.2 New Macros ..................................................................................................................... F-6
F.3.3 New API calls................................................................................................................... F-8
F.4 Deprecated runtimes ....................................................................................................................... F-8
Index
Contents xi
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
xii Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Figures
Contents xiii
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
xiv Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED PARALLEL PROCESSING
Tables
Contents xv
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
xvi Contents
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Chapter 1
OpenCL Architecture and AMD
Accelerated Parallel Processing
This chapter provides a general software and hardware overview of the AMD
Accelerated Parallel Processing implementation of the OpenCL standard. It
explains the memory structure and gives simple programming examples.
1.1 Terminology
Term Description
compute kernel To define a compute kernel, it is first necessary to define a kernel. A
kernel is a small unit of execution that performs a clearly defined function
and that can be executed in parallel. Such a kernel can be executed on
each element of an input stream (called an NDRange), or simply at each
point in an arbitrary index space. A kernel is analogous and, on some
devices identical, to what graphics programmers call a shader program.
This kernel is not to be confused with an OS kernel, which controls
hardware. The most basic form of an NDRange is simply mapped over
input data and produces one output item for each input tuple.
Subsequent extensions of the basic model provide random-access
functionality, variable output counts, and reduction/accumulation
operations. Kernels are specified using the kernel keyword.
Term Description
wavefronts and Wavefronts and work-groups are two concepts relating to compute
work-groups kernels that provide data-parallel granularity. A wavefront executes a
number of work-items in lock step relative to each other. Sixteen work-
items are executed in parallel across the vector unit, and the whole
wavefront is covered over four clock cycles. It is the lowest level that flow
control can affect. This means that if two work-items inside of a
wavefront go divergent paths of flow control, all work-items in the
wavefront go to both paths of flow control.
OpenCL's API also supports the concept of a task dispatch. This is equivalent to
executing a kernel on a compute device with a work-group and NDRange
containing a single work-item. Parallelism is expressed using vector data types
implemented by the device, enqueuing multiple tasks, and/or enqueuing native
kernels developed using a programming model orthogonal to OpenCL.
barrier(...)
} }
Context
Local Memory Local Memory
Queue Queue
Global/Constant Memory
The devices are capable of running data- and task-parallel work. A kernel can be
executed as a function of multi-dimensional domains of indices. Each element is
called a work-item; the total number of indices is defined as the global work-size.
The global work-size can be divided into sub-domains, called work-groups, and
individual work-items within a group can communicate through global or locally
shared memory. Work-items are synchronized through barrier or fence
operations. Figure 1.1 is a representation of the host/device architecture with a
single platform, consisting of a GPU and a CPU.
Many operations are performed with respect to a given context; there also are
many operations that are specific to a device. For example, program compilation
and kernel execution are done on a per-device basis. Performing work with a
device, such as executing kernels or moving data to and from the device’s local
memory, is done using a corresponding command queue. A command queue is
associated with a single device and a given context; all work for a specific device
is done through this interface. Note that while a single command queue can be
associated with only a single device, there is no limit to the number of command
queues that can point to the same device. For example, it is possible to have
one command queue for executing kernels and a command queue for managing
data transfers between the host and the device.
Most OpenCL programs follow the same pattern. Given a specific platform, select
a device or devices to create a context, allocate memory, create device-specific
command queues, and perform data transfers and computations. Generally, the
platform is the gateway to accessing specific devices, given these devices and
a corresponding context, the application is independent of the platform. Given a
context, the application can:
1.4 Synchronization
The two domains of synchronization in OpenCL are work-items in a single work-
group and command-queue(s) in a single context. Work-group barriers enable
synchronization of work-items in a work-group. Each work-item in a work-group
must first execute the barrier before executing any instruction beyond this barrier.
Either all of, or none of, the work-items in a work-group must encounter the
barrier. A barrier or mem_fence operation does not have global scope, but is
relevant only to the local workgroup on which they operate.
Compute Device
Compute Unit n
Private Memory
(Reg Files)
m
Compute Unit 1 Private Memory
Private Memory
1 2
(Reg Files)
(Reg Files)
m
Private Memory Proc. Elem.
1 2
(Reg Files) (ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
n Mem.
Local
L1
(LDS) n n
Host
Compute Device DMA
GLOBAL MEMORY CONSTANT MEMORY PCIe
Memory (VRAM)
Figure 1.3 illustrates the standard dataflow between host (CPU) and GPU.
S I B
A A
T e A
G L TP
L L
H P L ER
O I
O C O
C V
S I B
A A
T e A
L T
L
E
There are two ways to copy data from the host to the GPU compute device
memory:
With proper memory transfer management and the use of system pinned
memory (host/CPU memory remapped to the PCIe memory space), copying
between host (CPU) memory and PCIe memory can be skipped.
Double copying lowers the overall system memory bandwidth. In GPU compute
device programming, pipelining and other techniques help reduce these
bottlenecks. See the AMD OpenCL Optimization Reference Guide for more
specifics about optimization techniques.
Figure 1.4 shows the conceptual framework of the LDS is integration into the
memory of AMD GPUs using OpenCL.
Compute Device
Work-Group Work-Group
LDS LDS
Global/Constant Memory
Frame Buffer
Host
Host Memory
Physically located on-chip, directly next to the ALUs, the LDS is approximately
one order of magnitude faster than global memory (assuming no bank conflicts).
The high bandwidth of the LDS memory is achieved not only through its proximity
to the ALUs, but also through simultaneous access to its memory banks. Thus,
it is possible to concurrently execute 32 write or read instructions, each nominally
32-bits; extended instructions, read2/write2, can be 64-bits each. If, however,
more than one access attempt is made to the same bank at the same time, a
bank conflict occurs. In this case, for indexed and atomic operations, hardware
prevents the attempted concurrent accesses to the same bank by turning them
into serial accesses. This decreases the effective bandwidth of the LDS. For
maximum throughput (optimal efficiency), therefore, it is important to avoid bank
conflicts. A knowledge of request scheduling and address mapping is key to
achieving this.
Compute Unit
Private Private
(Buffers) Memory Memory (Images)
Work- Work-
Item Item
Texture (per
LDS Compute
L1
Unit)
VRAM
Global Memory
To load data into LDS from global memory, it is read from global memory and
placed into the work-item’s registers; then, a store is performed to LDS. Similarly,
to store data into global memory, data is read from LDS and placed into the work-
item’s registers, then placed into global memory. To make effective use of the
LDS, an algorithm must perform many operations on what is transferred between
global memory and LDS. It also is possible to load data from a memory buffer
directly into LDS, bypassing VGPRs.
LDS atomics are performed in the LDS hardware. (Thus, although ALUs are not
directly used for these operations, latency is incurred by the LDS executing this
function.) If the algorithm does not require write-to-read reuse (the data is read
only), it usually is better to use the image dataflow (see right side of Figure 1.5)
because of the cache hierarchy.
Actually, buffer reads may use L1 and L2. When caching is not used for a buffer,
reads from that buffer bypass L2. After a buffer read, the line is invalidated; then,
on the next read, it is read again (from the same wavefront or from a different
clause). After a buffer write, the changed parts of the cache line are written to
memory.
Buffers and images are written through the texture L2 cache, but this is flushed
immediately after an image write.
The data in private memory is first placed in registers. If more private memory is
used than can be placed in registers, or dynamic indexing is used on private
arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory
is a private subset of global memory, so performance can be dramatically
degraded if spilling occurs.
Global memory can be in the high-speed GPU memory (VRAM) or in the host
memory, which is accessed by the PCIe bus. A work-item can access global
memory either as a buffer or a memory object. Buffer objects are generally read
and written directly by the work-items. Data is accessed through the L2 and L1
data caches on the GPU. This limited form of caching provides read coalescing
among work-items in a wavefront. Similarly, writes are executed through the
texture L2 cache.
Global atomic operations are executed through the texture L2 cache. Atomic
instructions that return a value to the kernel are handled similarly to fetch
instructions: the kernel must use S_WAITCNT to ensure the results have been
written to the destination GPR before using the data.
When using a global memory, each work-item can write to an arbitrary location
within it. Global memory use a linear layout. If consecutive addresses are written,
the compute unit issues a burst write for more efficient memory access. Only
read-only buffers, such as constants, are cached.
Image reads are cached through the texture system (corresponding to the L2 and
L1 caches).
1. The host program must select a platform, which is an abstraction for a given
OpenCL implementation. Implementations by multiple vendors can coexist on
a host, and the sample uses the first one available.
2. A device id for a GPU device is requested. A CPU device could be requested
by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device,
such as a given GPU, or an abstracted device, such as the collection of all
CPU cores on the host.
Example Code 1 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
cl_platform_id platform;
cl_device_id device;
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
cl_platform_id platform;
cl_device_id device;
// 6. Launch the kernel. Let OpenCL pick the local work size.
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);
clFinish( queue );
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
return 0;
}
1. Enable error checking through the exception handling mechanism in the C++
bindings by using the following define.
#define __CL_ENABLE_EXCEPTIONS
This removes the need to error check after each OpenCL call. If there is an
error, the C++ bindings code throw an exception that is caught at the end of
the try block, where we can clean up the host memory allocations. In this
example, the C++ objects representing OpenCL resources (cl::Context,
cl::CommandQueue, etc.) are declared as automatic variables, so they do not
7. Create two buffers, corresponding to the X and Y vectors. Ensure the host-
side buffers, pX and pY, are allocated and initialized. The
CL_MEM_COPY_HOST_PTR flag instructs the runtime to copy over the
contents of the host pointer pX in order to initialize the buffer bufX. The bufX
buffer uses the CL_MEM_READ_ONLY flag, while bufY requires the
CL_MEM_READ_WRITE flag.
bufX = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length, pX);
8. Create a program object from the kernel source string, build the program for
our devices, and create a kernel object corresponding to the SAXPY kernel.
(At this point, it is possible to create multiple kernel objects if there are more
than one.)
cl::Program::Sources sources(1, std::make_pair(kernelStr.c_str(),
kernelStr.length()));
program = cl::Program(context, sources);
program.build(devices);
kernel = cl::Kernel(program, "saxpy");
9. Enqueue the kernel for execution on the device (GPU in our example).
Set each argument individually in separate kernel.setArg() calls. The
arguments, do not need to be set again for subsequent kernel enqueue calls.
Reset only those arguments that are to pass a new value to the kernel. Then,
enqueue the kernel to the command queue with the appropriate global and
local work sizes.
kernel.setArg(0, bufX);
kernel.setArg(1, bufY);
kernel.setArg(2, a);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(),
cl::NDRange(length), cl::NDRange(64));
10. Read back the results from bufY to the host pointer pY. We will make this a
blocking call (using the CL_TRUE argument) since we do not want to proceed
before the kernel has finished execution and we have our results back.
queue.enqueueReadBuffer(bufY, CL_TRUE, 0, length * sizeof(cl_float),
pY);
11. Clean up the host resources (pX and pY). OpenCL resources is cleaned up
by the C++ bindings support code.
Example Code 2 –
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <string>
#include <iostream>
#include <string>
using std::cout;
using std::cerr;
using std::endl;
using std::string;
/////////////////////////////////////////////////////////////////
// Helper function to print vector elements
/////////////////////////////////////////////////////////////////
void printVector(const std::string arrayName,
const cl_float * arrayData,
const unsigned int length)
{
int numElementsToPrint = (256 < length) ? 256 : length;
cout << endl << arrayName << ":" << endl;
for(int i = 0; i < numElementsToPrint; ++i)
cout << arrayData[i] << " ";
cout << endl;
}
/////////////////////////////////////////////////////////////////
// Globals
/////////////////////////////////////////////////////////////////
int length = 256;
cl_float * pX = NULL;
cl_float * pY = NULL;
cl_float a = 2.f;
std::vector<cl::Platform> platforms;
cl::Context context;
std::vector<cl::Device> devices;
cl::CommandQueue queue;
cl::Program program;
cl::Kernel kernel;
cl::Buffer bufX;
cl::Buffer bufY;
/////////////////////////////////////////////////////////////////
// The saxpy kernel
/////////////////////////////////////////////////////////////////
string kernelStr =
"__kernel void saxpy(const __global float * x,\n"
" __global float * y,\n"
" const float a)\n"
"{\n"
" uint gid = get_global_id(0);\n"
" y[gid] = a* x[gid] + y[gid];\n"
"}\n";
/////////////////////////////////////////////////////////////////
// Allocate and initialize memory on the host
/////////////////////////////////////////////////////////////////
void initHost()
{
size_t sizeInBytes = length * sizeof(cl_float);
pX = (cl_float *) malloc(sizeInBytes);
if(pX == NULL)
throw(string("Error: Failed to allocate input memory on host\n"));
pY = (cl_float *) malloc(sizeInBytes);
if(pY == NULL)
throw(string("Error: Failed to allocate input memory on host\n"));
/////////////////////////////////////////////////////////////////
// Release host memory
/////////////////////////////////////////////////////////////////
void cleanupHost()
{
if(pX)
{
free(pX);
pX = NULL;
}
if(pY != NULL)
{
free(pY);
pY = NULL;
}
}
void
main(int argc, char * argv[])
{
try
{
/////////////////////////////////////////////////////////////////
// Allocate and initialize memory on the host
/////////////////////////////////////////////////////////////////
initHost();
/////////////////////////////////////////////////////////////////
// Find the platform
/////////////////////////////////////////////////////////////////
cl::Platform::get(&platforms);
std::vector<cl::Platform>::iterator iter;
for(iter = platforms.begin(); iter != platforms.end(); ++iter)
{
if(!strcmp((*iter).getInfo<CL_PLATFORM_VENDOR>().c_str(),
"Advanced Micro Devices, Inc."))
{
break;
} }
/////////////////////////////////////////////////////////////////
// Create an OpenCL context
/////////////////////////////////////////////////////////////////
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)(*iter)(), 0 };
context = cl::Context(CL_DEVICE_TYPE_GPU, cps);
/////////////////////////////////////////////////////////////////
// Detect OpenCL devices
/////////////////////////////////////////////////////////////////
devices = context.getInfo<CL_CONTEXT_DEVICES>();
/////////////////////////////////////////////////////////////////
// Create an OpenCL command queue
/////////////////////////////////////////////////////////////////
queue = cl::CommandQueue(context, devices[0]);
/////////////////////////////////////////////////////////////////
// Create OpenCL memory buffers
/////////////////////////////////////////////////////////////////
bufX = cl::Buffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length,
pX);
bufY = cl::Buffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length,
pY);
/////////////////////////////////////////////////////////////////
// Load CL file, build CL program object, create CL kernel object
/////////////////////////////////////////////////////////////////
cl::Program::Sources sources(1, std::make_pair(kernelStr.c_str(),
kernelStr.length()));
program = cl::Program(context, sources);
program.build(devices);
kernel = cl::Kernel(program, "saxpy");
/////////////////////////////////////////////////////////////////
// Set the arguments that will be used for kernel execution
/////////////////////////////////////////////////////////////////
kernel.setArg(0, bufX);
kernel.setArg(1, bufY);
kernel.setArg(2, a);
/////////////////////////////////////////////////////////////////
// Enqueue the kernel to the queue
// with appropriate global and local work sizes
/////////////////////////////////////////////////////////////////
queue.enqueueNDRangeKernel(kernel, cl::NDRange(),
cl::NDRange(length), cl::NDRange(64));
/////////////////////////////////////////////////////////////////
// Enqueue blocking call to read back buffer Y
/////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////
// Release host resources
/////////////////////////////////////////////////////////////////
cleanupHost();
}
catch (cl::Error err)
{
/////////////////////////////////////////////////////////////////
// Catch OpenCL errors and print log if it is a build error
/////////////////////////////////////////////////////////////////
cerr << "ERROR: " << err.what() << "(" << err.err() << ")" <<
endl;
if (err.err() == CL_BUILD_PROGRAM_FAILURE)
{
string str =
program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
cout << "Program Info: " << str << endl;
}
cleanupHost();
}
catch(string msg)
{
cerr << "Exception caught in main(): " << msg << endl;
cleanupHost();
}
}
The code is written so that it performs very well on either CPU or GPU. The
number of threads launched depends on how many hardware processors are
available. Each thread walks the source buffer, using a device-optimal access
pattern selected at runtime. A multi-stage reduction using __local and __global
atomics produces the single result value.
Runtime Code –
1. The source memory buffer is allocated, and initialized with a random pattern.
Also, the actual min() value for this data set is serially computed, in order to
later verify the parallel result.
2. The compiler is instructed to dump the intermediate IL and ISA files for
further analysis.
3. The main section of the code, including device setup, CL data buffer creation,
and code compilation, is executed for each device, in this case for CPU and
GPU. Since the source memory buffer exists on the host, it is shared. All
other resources are device-specific.
4. The global work size is computed for each device. A simple heuristic is used
to ensure an optimal number of threads on each device. For the CPU, a
given CL implementation can translate one work-item per CL compute unit
into one thread per CPU core.
On the GPU, an initial multiple of the wavefront size is used, which is
adjusted to ensure even divisibility of the input data over all threads. The
value of 7 is a minimum value to keep all independent hardware units of the
compute units busy, and to provide a minimum amount of memory latency
hiding for a kernel with little ALU activity.
5. After the kernels are built, the code prints errors that occurred during kernel
compilation and linking.
6. The main loop is set up so that the measured timing reflects the actual kernel
performance. If a sufficiently large NLOOPS is chosen, effects from kernel
launch time and delayed buffer copies to the device by the CL runtime are
minimized. Note that while only a single clFinish() is executed at the end
of the timing run, the two kernels are always linked using an event to ensure
serial execution.
The bandwidth is expressed as “number of input bytes processed.” For high-
end graphics cards, the bandwidth of this algorithm is about an order of
magnitude higher than that of the CPU, due to the parallelized memory
subsystem of the graphics card.
7. The results then are checked against the comparison value. This also
establishes that the result is the same on both CPU and GPU, which can
serve as the first verification test for newly written kernel code.
8. Note the use of the debug buffer to obtain some runtime variables. Debug
buffers also can be used to create short execution traces for each thread,
assuming the device has enough memory.
9. You can use the Timer.cpp and Timer.h files from the TransferOverlap
sample, which is in the SDK samples.
Kernel Code –
10. The code uses four-component vectors (uint4) so the compiler can identify
concurrent execution paths as often as possible. On the GPU, this can be
used to further optimize memory accesses and distribution across ALUs. On
the CPU, it can be used to enable SSE-like execution.
11. The kernel sets up a memory access pattern based on the device. For the
CPU, the source buffer is chopped into continuous buffers: one per thread.
Each CPU thread serially walks through its buffer portion, which results in
good cache and prefetch behavior for each core.
On the GPU, each thread walks the source buffer using a stride of the total
number of threads. As many threads are executed in parallel, the result is a
maximally coalesced memory pattern requested from the memory back-end.
For example, if each compute unit has 16 physical processors, 16 uint4
requests are produced in parallel, per clock, for a total of 256 bytes per clock.
12. The kernel code uses a reduction consisting of three stages: __global to
__private, __private to local, which is flushed to __global, and finally
__global to __global. In the first loop, each thread walks __global
memory, and reduces all values into a min value in __private memory
(typically, a register). This is the bulk of the work, and is mainly bound by
__global memory bandwidth. The subsequent reduction stages are brief in
comparison.
13. Next, all per-thread minimum values inside the work-group are reduced to a
__local value, using an atomic operation. Access to the __local value is
serialized; however, the number of these operations is very small compared
to the work of the previous reduction stage. The threads within a work-group
are synchronized through a local barrier(). The reduced min value is
stored in __global memory.
14. After all work-groups are finished, a second kernel reduces all work-group
values into a single value in __global memory, using an atomic operation.
This is a minor contributor to the overall runtime.
Example Code 3 –
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "Timer.h"
#define NDEVS 2
" for( int n=0; n < count; n++, idx += stride ) \n"
" { \n"
" pmin = min( pmin, src[idx].x ); \n"
" pmin = min( pmin, src[idx].y ); \n"
" pmin = min( pmin, src[idx].z ); \n"
" pmin = min( pmin, src[idx].w ); \n"
" } \n"
" \n"
" // 12. Reduce min values inside work-group. \n"
" \n"
" if( get_local_id(0) == 0 ) \n"
" lmin[0] = (uint) -1; \n"
" \n"
" barrier( CLK_LOCAL_MEM_FENCE ); \n"
" \n"
" (void) atom_min( lmin, pmin ); \n"
" \n"
" barrier( CLK_LOCAL_MEM_FENCE ); \n"
" \n"
" // Write out to __global. \n"
" \n"
" if( get_local_id(0) == 0 ) \n"
" gmin[ get_group_id(0) ] = lmin[0]; \n"
" \n"
" // Dump some debug information. \n"
" \n"
" if( get_global_id(0) == 0 ) \n"
" { \n"
" dbg[0] = get_num_groups(0); \n"
" dbg[1] = get_global_size(0); \n"
" dbg[2] = count; \n"
" dbg[3] = stride; \n"
" } \n"
"} \n"
" \n"
"// 13. Reduce work-group min values from __global to __global. \n"
" \n"
"__kernel void reduce( __global uint4 *src, \n"
" __global uint *gmin ) \n"
"{ \n"
" (void) atom_min( gmin, gmin[get_global_id(0)] ) ; \n"
"} \n";
cl_uint *src_ptr;
unsigned int num_src_items = 4096*4096;
time_t ltime;
time(<ime);
// Get a platform.
cl_mem src_buf;
cl_mem dst_buf;
cl_mem dbg_buf;
cl_uint *dst_ptr,
*dbg_ptr;
clGetDeviceIDs( platform,
devs[dev],
1,
&device,
NULL);
cl_uint compute_units;
size_t global_work_size;
size_t local_work_size;
size_t num_groups;
clGetDeviceInfo( device,
CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint),
&compute_units,
NULL);
cl_uint ws = 64;
local_work_size = ws;
}
queue = clCreateCommandQueue(context,
device,
0, NULL);
if(ret != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", ret);
char buf[0x10000];
clGetProgramBuildInfo( program,
device,
CL_PROGRAM_BUILD_LOG,
0x10000,
buf,
NULL);
printf("\n%s\n", buf);
return(-1);
}
CPerfCounter t;
t.Reset();
t.Start();
cl_event ev;
int nloops = NLOOPS;
while(nloops--)
{
clEnqueueNDRangeKernel( queue,
minp,
1,
NULL,
&global_work_size,
&local_work_size,
0, NULL, &ev);
clEnqueueNDRangeKernel( queue,
reduce,
1,
NULL,
&num_groups,
NULL, 1, &ev, NULL);
}
clFinish( queue );
t.Stop();
printf("\n");
return 0;
}
Chapter 2
AMD Implementation
Compute Applications
OpenCL Runtime
The AMD Accelerated Parallel Processing software stack provides end-users and
developers with a complete, flexible suite of tools to leverage the processing
power in AMD GPUs. AMD Accelerated Parallel Processing software embraces
open-systems, open-platform standards. The AMD Accelerated Parallel
Processing open platform strategy enables AMD technology partners to develop
and provide third-party development tools.
The latest generations of AMD GPUs use unified shader architectures capable
of running different kernel types interleaved on the same hardware.
Programmable GPU compute devices execute various user-developed programs,
known to graphics programmers as shaders and to compute programmers as
kernels. These GPU compute devices can execute non-graphics functions using
a data-parallel programming model that maps executions onto compute units.
Each compute unit contains one (pre-GCN devices) or more (GCN devices)
vector (SIMD) units. In this programming model, known as AMD Accelerated
Parallel Processing, arrays of input data elements stored in memory are
accessed by a number of compute units.
GPU DEVICE
CUs
Compute Unit 0
00
1
Processing Elements 2
Work-Items
Work-Groups
ND-RANGE
Note that in OpenCL 2.0, the work-groups are not required to divide evenly into
the NDRange.
Range
WORK-GROUP
Dim Y
Di
m
en
si
on
Dimension X
Z
Dim Y
WORK-ITEM
Wavefront
( HW-Specific Size)
Di
m
en
si
on
Dimension X
Z
The size of wavefronts can differ on different GPU compute devices. For
example, some of the low-end and older GPUs, such as the AMD Radeon™ HD
54XX series graphics cards, have a wavefront size of 32 work-items. Higher-end
and newer AMD GPUs have a wavefront size of 64 work-items.
if(x)
{
. //items within these braces = A
.
.
}
else
{
. //items within these braces = B
.
.
}
The wavefront mask is set true for lanes (elements/items) in which x is true, then
execute A. The mask then is inverted, and B is executed.
Example 1: If two branches, A and B, take the same amount of time t to execute
over a wavefront, the total time of execution, if any work-item diverges, is 2t.
Loops execute in a similar fashion, where the wavefront occupies a compute unit
as long as there is at least one work-item in the wavefront still being processed.
Thus, the total execution time for the wavefront is determined by the work-item
with the longest execution time.
GPU GPU
Compute Device Compute Device
Figure 2.4 Generalized AMD GPU Compute Device Structure for GCN
Devices
In GCN devices, each CU includes one Scalar Unit and four Vector (SIMD) units,
each of which contains an array of 16 processing elements (PEs). Each PE
contains one ALU. Each SIMD unit simultaneously executes a single operation
across 16 work items, but each can be working on a separate wavefront.
For example, for the AMD Radeon™ HD 79XX devices each of the 32 CUs has
one Scalar Unit and four Vector Units. Figure 2.5 shows only two compute
engines/command processors of the array that comprises the compute device of
the AMD Radeon™ HD 79XX family.
Asynchronous Compute Engine Asynchronous Compute Engine
/ Command Processor / Command Processor
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
SC cache
SC cache
I cache
I cache
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
Read/Write memory interface
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
SC cache
SC cache
I cache
I cache
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
SC cache
SC cache
I cache
I cache
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
1 Scalar Unit 4 Vector Unit L1 LDS LDS L1 4 Vector Unit 1 Scalar Unit
SC cache
SC cache
I cache
I cache
Level 2 cache
GDDR5 Memory System
In Figure 2.5, there are two command processors, which can process two
command queues concurrently. The Scalar Unit, Vector Unit, Level 1 data cache
(L1), and Local Data Share (LDS) are the components of one compute unit, of
which there are 32. The scalar (SC) cache is the scalar unit data cache, and the
Level 2 cache consists of instructions and data.
On GCN devices, the instruction stream contains both scalar and vector
instructions. On each cycle, it selects a scalar instruction and a vector instruction
(as well as a memory operation and a branch operation, if available); it issues
one to the scalar unit, the other to the vector unit; this takes four cycles to issue
over the four vector cores (the same four cycles over which the 16 units execute
64 work-items).
In GCN devices, the CUs are arranged in four vector unit arrays consisting of 16
processing elements each. Each of these arrays executes a single instruction
across each lane for each block of 16 work-items. That instruction is repeated
over four cycles to make the 64-element vector called a wavefront.
Thus, In GCN devices, the four vector units within a CU can operate on four
different wavefronts. If operations within a wavefront include dependencies,
independent operations from different wavefronts can be selected to be assigned
to a single vector unit to be executed in parallel every cycle.
2.2.2 Key differences between Southern Islands, Sea Islands, and Volcanic Islands families
The number of Asynchronous Compute Engines (ACEs) and CUs in an AMD
GPU, and the way they are structured, vary with the device family, as well as
with the device designations within a family.
The ACEs are responsible for managing the CUs and for scheduling and
resource allocation of the compute tasks (but not of the graphics shader tasks).
The ACEs operate independently; the greater the number of ACEs, the greater
is the performance. Each ACE fetches commands from cache or memory, and
creates task queues to be scheduled for execution on the CUs depending on
their priority and on the availability of resources.
Each ACE contains up to 8 queues and, together with the graphics command
processor, allows up to nine independent vector instructions to be executed per
clock cycle. Some of these queues are not available for use by OpenCL.
Devices in the Southern Islands families typically have two ACEs. Devices in the
Sea Islands and Volcanic Islands families contain between four and eight ACEs,
so they offer more performance. For example, the AMD Radeon™ R9 290X
devices, in the VI family contain 8 ACEs and 44 CUs.
Communication and data transfers between the system and the GPU compute
device occur on the PCIe channel. AMD Accelerated Parallel Processing
graphics cards use PCIe 2.0 x16 (second generation, 16 lanes). Generation 1
x16 has a theoretical maximum throughput of 4 GBps in each direction.
Generation 2 x16 doubles the throughput to 8 GBps in each direction. Southern
Islands AMD GPUs support PCIe 3.0 with a theoretical peak performance of
16 GBps. Actual transfer performance is CPU and chipset dependent.
Transfers from the system to the GPU compute device are done either by the
command processor or by the DMA engine. The GPU compute device also can
read and write system memory directly from the compute unit through kernel
instructions over the PCIe bus.
Most commands to the GPU compute device are buffered in a command queue
on the host side. The queue of commands is sent to, and processed by, the GPU
compute device. There is no guarantee as to when commands from the
command queue are executed, only that they are executed in order.
Direct Memory Access (DMA) memory transfers can be executed separately from
the command queue using the DMA engine on the GPU compute device. DMA
calls are executed immediately; and the order of DMA calls and command queue
flushes is guaranteed.
DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there
are no dependencies. However, data is not guaranteed to be ready until the DMA
engine signals that the event or transfer is completed. The application can use
OpenCL to query the hardware for DMA event completion. If used carefully, DMA
transfers are another source of parallelization.
Southern Island devices have two DMA engines that can perform bidirectional
transfers over the PCIe bus with multiple queues created in consecutive order,
since each DMA engine is assigned to an odd or an even queue correspondingly.
2.3 Communication Between Host and the GPU Compute Device 2-9
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
In some cases, the user might want to mask the visibility of the GPUs seen by
the OpenCL application. One example is to dedicate one GPU for regular
graphics operations and the other three (in a four-GPU system) for Compute. To
do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a comma-
separated list variable:
Another example is a system with eight GPUs, where two distinct OpenCL
applications are running at the same time. The administrator might want to set
GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the
second application; thus, partitioning the available GPUs so that both
applications can run at the same time.
W0
STALL READY
Wavefronts
W1
READY STALL
W2
READY STALL
W3
READY STALL
0 20 40 60 80
At runtime, wavefront T0 executes until cycle 20; at this time, a stall occurs due
to a memory fetch request. The scheduler then begins execution of the next
wavefront, T1. Wavefront T1 executes until it stalls or completes. New wavefronts
execute, and the process continues until the available number of active
wavefronts is reached. The scheduler then returns to the first wavefront, T0.
If the data wavefront T0 is waiting for has returned from memory, T0 continues
execution. In the example in Figure 2.6, the data is ready, so T0 continues. Since
there were enough wavefronts and processing element operations to cover the
long memory latencies, the compute unit does not idle. This method of memory
latency hiding helps the GPU compute device achieve maximum performance.
If none of T0 – T3 are runnable, the compute unit waits (stalls) until one of T0 –
T3 is ready to execute. In the example shown in Figure 2.7, T0 is the first to
continue execution.
W0
STALL
Wavefronts
W1
STALL
W2
STALL
W3
STALL
0 20 40 60 80
Chapter 3
Building and Running OpenCL
Programs
The compiler tool-chain provides a common framework for both CPUs and
GPUs, sharing the front-end and some high-level compiler transformations. The
back-ends are optimized for the device type (CPU or GPU). Figure 3.1 is a high-
level diagram showing the general compilation path of applications using
OpenCL. Functions of an application that benefit from acceleration are re-written
in OpenCL and become the OpenCL source. The code calling these functions
are changed to use the OpenCL API. The rest of the application remains
unchanged. The kernels are compiled by the OpenCL compiler to either CPU
binaries or GPU binaries, depending on the target device.
O
p OpenCL Compiler
e Built-In
n Library
C LLVM IR Linker
OpenCL L
Front-End
Source
R LLVM
u Optimizer
n
t
i LLVM IR
m
e
LLVM AS AMD IL
CPU GPU
For CPU processing, the OpenCL runtime uses the LLVM AS to generate x86
binaries. The OpenCL runtime automatically determines the number of
processing elements, or cores, present in the CPU and distributes the OpenCL
kernel between them.
For GPU processing, the OpenCL runtime post-processes the AMD IL from the
OpenCL compiler and turns it into complete AMD IL. This adds macros (from a
macro database, similar to the built-in library) specific to the GPU. The OpenCL
Runtime layer then removes unneeded functions and passes the complete IL to
the Shader compiler for compilation to GPU-specific binaries.
For GPU processing, the LLVM IR-to-AMD IL module receives LLVM IR and
generates optimized IL for a specific GPU type in an incomplete format, which is
passed to the OpenCL runtime, along with some metadata for the runtime layer
to finish processing.
1. Compile all the C++ files (Template.cpp), and get the object files.
For 32-bit object files on a 32-bit system, or 64-bit object files on 64-bit
system:
g++ -o Template.o -DAMD_OS_LINUX -c Template.cpp -I$AMDAPPSDKROOT/include
2. Link all the object files generated in the previous step to the OpenCL library
and create an executable.
For linking to a 64-bit library:
g++ -o Template Template.o -lOpenCL -L$AMDAPPSDKROOT/lib/x86_64
The following are linking options if the samples depend on the SDKUtil Library
(assuming the SDKUtil library is created in $AMDAPPSDKROOT/lib/x86_64 for 64-
bit libraries, or $AMDAPPSDKROOT/lib/x86 for 32-bit libraries).
-I dir — Add the directory dir to the list of directories to be searched for
header files. When parsing #include directives, the OpenCL compiler
resolves relative paths using the current working directory of the application.
-D name — Predefine name as a macro, with definition = 1. For -
D name=definition, the contents of definition are tokenized and processed
as if they appeared during the translation phase three in a #define directive.
In particular, the definition is truncated by embedded newline characters.
-D options are processed in the order they are given in the options argument
to clBuildProgram.
-g — This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 4, “Debugging
OpenCL.” This option does not affect the default optimization of the OpenCL
code.
-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.
-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix C, “OpenCL Binary Image
Format (BIF) v2.0.”
-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix C, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix C, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text
section. For more information, see Appendix C, “OpenCL Binary Image
Format (BIF) v2.0.”
-save-temps[=<prefix>] — This option dumps intermediate temporary
files, such as IL and ISA code, for each OpenCL kernel. If <prefix> is not
given, temporary files are saved in the default temporary directory (the
current directory for Linux, C:\Users\<user>\AppData\Local for Windows).
If <prefix> is given, those temporary files are saved with the given
<prefix>. If <prefix> is an absolute path prefix, such as
C:\your\work\dir\mydumpprefix, those temporaries are saved under
C:\your\work\dir, with mydumpprefix as prefix to all temporary names. For
example,
-save-temps
under the default directory
_temp_nn_xxx_yyy.il, _temp_nn_xxx_yyy.isa
-save-temps=aaa
under the default directory
aaa_nn_xxx_yyy.il, aaa_nn_xxx_yyy.isa
-save-temps=C:\you\dir\bbb
under C:\you\dir
bbb_nn_xxx_yyy.il, bbb_nn_xxx_yyy.isa
where xxx and yyy are the device name and kernel name for this build,
respectively, and nn is an internal number to identify a build to avoid
overriding temporary files. Note that this naming convention is subject to
change.
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
As illustrated in Figure 3.2, the application can create multiple command queues
(some in libraries, for different components of the application, etc.). These
queues are muxed into one queue per device type. The figure shows command
queues 1 and 3 merged into one CPU device queue (blue arrows); command
queue 2 (and possibly others) are merged into the GPU device queue (red
arrow). The device queue then schedules work onto the multiple compute
resources present in the device. Here, K = kernel commands, M = memory
commands, and E = event commands.
Programming n
3
Layer 2
1
Command M1 K1 E1 K2 M2 K3 M3
Queues
For CPU queue For CPU queue For GPU queue
GPU
Device
Command CPU
M11 K11 E11 K12 M12 M31 K32 M32
Queue
Chapter 4
Debugging OpenCL
access the kernel execution directly from the API call that issues it,
debug inside the kernel, and
view all variable values across the different work-groups and work-items.
http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
After installing CodeXL, launch Visual Studio, then open the solution to be
worked on. In the Visual Studio menu bar, note the new CodeXL menu, which
contains all the required controls.
Select a Visual C/C++ project, and set its debugging properties as normal. To
add a breakpoint, either select New CodeXL Breakpoint from the CodeXL menu,
or navigate to a kernel file (*.cl) used in the application, and set a breakpoint
on the appropriate source line. Then, select the Launch OpenCL/OpenGL
Debugging from the CodeXL menu to start debugging.
To start kernel debugging, there are several options. One is to Step Into (F11)
the appropriate clEnqueueNDRangeKernel function call. Once the kernel starts
executing, debug it like C/C++ code, stepping into, out of, or over function calls
in the kernel, setting source breakpoints, and inspecting the locals, autos, watch,
and call stack views.
If you develop on Linux, or do not use Visual Studio, using the CodeXL stand-
alone application is very similar. After installing CodeXL, launch the CodeXL
stand-alone application from the installation folder. On the start page, select
"Create New Project," and use the browse button next to the "Executable Path"
field to select your application. Click the "Go" (F5) button, and follow the
instructions above to enter kernel debugging.
AMD_OCL_BUILD_OPTIONS_APPEND="-g -O0" or
AMD_OCL_BUILD_OPTIONS="-g -O0"
b [N | function | kernel_name]
where N is the line number in the source code, function is the function name,
and kernel_name is constructed as follows: if the name of the kernel is
bitonicSort, the kernel_name is __OpenCL_bitonicSort_kernel.
Note that if no breakpoint is set, the program does not stop until execution is
complete.
Also note that OpenCL kernel symbols are not visible in the debugger until the
kernel is loaded. A simple way to check for known OpenCL symbols is to set a
Unsorted Input
53 5 199 15 120 9 71 107 71 242 84 150 134 180 26 128 196 9 98 4 102 65
206 35 224 2 52 172 160 94 2 214 99 .....
File OCLm2oVFr.cl:
void __OpenCL_bitonicSort_kernel(uint *, const uint, const uint, const
uint, const uint);
Non-debugging symbols:
0x00007ffff23c2dc0 __OpenCL_bitonicSort_kernel@plt
0x00007ffff23c2f40 __OpenCL_bitonicSort_stub
(gdb) b __OpenCL_bitonicSort_kernel
Breakpoint 2 at 0x7ffff23c2de9: file OCLm2oVFr.cl, line 32.
(gdb) c
Continuing.
[Switching to Thread 0x7ffff2fcc700 (LWP 1895)]
Continuing.
4.2.4 Notes
1. To make a breakpoint in a working thread with some particular ID in
dimension N, one technique is to set a conditional breakpoint when the
get_global_id(N) == ID. To do this, use:
b [ N | function | kernel_name ] if (get_global_id(N)==ID)
where N can be 0, 1, or 2.
Chapter 5
OpenCL Static C++ Programming
Language
5.1 Overview
This extension defines the OpenCL Static C++ kernel language, which is a form
of the ISO/IEC Programming languages C++ specification1. This language
supports overloading and templates that can be resolved at compile time (hence
static), while restricting the use of language features that require dynamic/runtime
resolving. The language also is extended to support most of the features
described in Section 6 of OpenCL spec: new data types (vectors, images,
samples, etc.), OpenCL Built-in functions, and more.
Note that supporting templates and overloading highly improve the efficiency of
writing code: it allows developers to avoid replication of code when not
necessary.
Using kernel template and kernel overloading requires support from the runtime
API as well. AMD provides a simple extension to clCreateKernel, which
enables the user to specify the desired kernel.
To support these cases, the following error codes were added; these can be
returned by clCreateKernel.
On the host side, the application creates the class and an equivalent memory
object with the same size (using the sizeof function). It then can use the class
methods to set or change values of the class members. When the class is ready,
the application uses a standard buffer API to move the class to the device (either
Unmap or Write), then sets the buffer object as the appropriate kernel argument
and enqueues the kernel for execution. When the kernel finishes the execution,
the application can map back (or read) the buffer object into the class and
continue working on it.
-x language
A class definition can not contain any address space qualifier, either for members
or for methods:
5.3 Additions and Changes to Section 6 - The OpenCL C Programming Language 5-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
class myClass{
public:
int myMethod1(){ return x;}
void __local myMethod2(){x = 0;}
private:
int x;
__local y; // illegal
};
The class invocation inside a kernel, however, can be either in private or local
address space:
5.3.3 Namespaces
Namespaces are support without change as per [1].
5.3.4 Overloading
As defined in of the C++ language specification, when two or more different
declarations are specified for a single name in the same scope, that name is said
to be overloaded. By extension, two declarations in the same scope that declare
the same name but with different types are called overloaded declarations. Only
kernel and function declarations can be overloaded, not object and type
declarations.
Also, the rules for well-formed programs as defined by Section 13 of the C++
language specification are lifted to apply to both kernel and function declarations.
The overloading resolution is per Section 13.1 of the C++ language specification,
but extended to account for vector types. The algorithm for “best viable function”,
Section 13.3.3 of the C++ language specification, is extended for vector types by
inducing a partial-ordering as a function of the partial-ordering of its elements.
Following the existing rules for vector types in the OpenCL 1.2 specification,
explicit conversion between vectors is not allowed. (This reduces the number of
possible overloaded functions with respect to vectors, but this is not expected to
be a particular burden to developers because explicit conversion can always be
applied at the point of function evocation.)
For overloaded kernels, the following syntax is used as part of the kernel name:
foo(type1,...,typen)
__attribute__((mangled_name(myMangledName)))
5.3.5 Templates
OpenCL C++ provides unrestricted support for C++ templates, as defined in
Section 14 of the C++ language specification. The arguments to templates are
extended to allow for all OpenCL base types, including vectors and pointers
qualified with OpenCL C address spaces (i.e. __global, __local, __private,
and __constant).
OpenCL C++ kernels (defined with __kernel) can be templated and can be
called from within an OpenCL C (C++) program or as an external entry point
(from the host).
For kernel templates, the following syntax is used as part of the kernel name
(assuming a kernel called foo):
foo<type1,...,typen>
foo<type1,...,typen>(typen+1,...,typem)
To support template kernels, the same mechanism for kernel overloading is used.
Use the following syntax:
__attribute__((mangled_name(myMangledName)))
5.3 Additions and Changes to Section 6 - The OpenCL C Programming Language 5-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
5.3.6 Exceptions
Exceptions, as per Section 15 of the C++ language specification, are not
supported. The keywords try, catch, and throw are reserved, and the OpenCL
C++ compiler must produce a static compile time error if they are used in the
input program.
5.3.7 Libraries
Support for the general utilities library, as defined in Sections 20-21 of the C++
language specification, is not provided. The standard C++ libraries and STL
library are not supported.
5.4 Examples
5.4.1 Passing a Class from the Host to the Device and Back
The class definition must be the same on the host code and the device code,
besides the members’ type in the case of vectors. If the class includes vector
data types, the definition must conform to the table that appears on Section 6.1.2
of the OpenCL Programming Specification 1.2, Corresponding API type for
OpenCL Language types.
Class Test
{
setX (int value);
private:
int x;
}
Class Test
{
setX (int value);
private:
int x;
}
MyFunc ()
{
tempClass = new(Test);
... // Some OpenCL startup code – create context, queue, etc.
cl_mem classObj = clCreateBuffer(context,
CL_MEM_USE_HOST_PTR, sizeof(Test),
&tempClass, event);
clEnqueueMapBuffer(...,classObj,...);
tempClass.setX(10);
clEnqueueUnmapBuffer(...,classObj,...); //class is passed to the Device
clEnqueueNDRange(..., fooKernel, ...);
clEnqueueMapBuffer(...,classObj,...); //class is passed back to the Host
The names testAddFloat4 and testAddInt8 are the external names for the two
kernel instants. When calling clCreateKernel, passing one of these kernel
names leads to the correct overloaded kernel.
Chapter 6
OpenCL 2.0
6.1 Introduction
The OpenCL 2.0 specification is a significant evolution of OpenCL. It introduces
features that allow closer collaboration between the host and OpenCL devices,
such as Shared Virtual Memory (SVM) and device-side enqueue. Other features,
such as pipes, dynamic parallelism, and new image-related additions provide
effective ways of expressing heterogeneous programming constructs.
The following sections highlight the salient features of OpenCL 2.0 and provide
usage guidelines.
For guidelines on how to migrate from OpenCL 1.2 to OpenCL 2.0 and for
information about querying for image- and device-specific extensions, see
Portability considerations.
For a list of the new and deprecated functions, see Appendix F, “New and
deprecated functions in OpenCL 2.0.”
6.2.1 Overview
In OpenCL 1.2, the host and OpenCL devices do not share the same virtual
address space. Consequently, the host memory, the device memory, and
communication between the host and the OpenCL devices, need to be explicitly
specified and managed. Buffers may need to be copied over to the OpenCL
device memory for processing and copied back after processing; the same
memory cannot be accessed simultaneously by the host and the OpenCL
devices. To access locations within a buffer (or regions within an image), the
appropriate offsets must be passed to and from the OpenCL devices.
In OpenCL 2.0, the host and OpenCL devices may share the same virtual
address space. Buffers need not be copied over between devices. When the host
and the OpenCL devices share the address space, communication between the
host and the devices can occur via shared memory. This simplifies programming
in heterogeneous contexts.
Support for SVM does not imply or require that the host and the OpenCL devices
in an OpenCL 2.0 compliant architecture share actual physical memory. The
OpenCL runtime manages the transfer of data between the host and the OpenCL
devices; the process is transparent to the programmer, who sees a unified
address space.
A caveat, however, concerns situations in which the host and the OpenCL
devices access the same region of memory at the same time. It would be highly
inefficient for the host and the OpenCL devices to have a consistent view of the
memory for each load/store from any device/host. In general, the memory model
of the language or architecture implementation determines how or when a
memory location written by one thread or agent is visible to another. The memory
model also determines to what extent the programmer can control the scope of
such accesses.
OpenCL 2.0 adopts the memory model defined C++11 with some extensions.
The memory orders taken from C++11 are: "relaxed", "acquire", "release",
“acquire-release”, and "sequential consistent".
atomic_load/store
atomic_init
atomic_work_item_fence
atomic_exchange
atomic_compare_exchange
atomic_fetch_<op>, where <op> is "add", "sub", "xor", "and", or "or"
OpenCL 2.0 introduces the concept of "memory scope", which limits the extent
to which atomic operations are visible. For example:
"workgroup" scope means that the updates are to be visible only within the
work group
"device" scope means that the updates are to be visible only within the
device (across workgroups within the device)
For coarse-grained SVM, the synchronization points are: the mapping or un-
mapping of the SVM memory and kernel launch or completion. This means
that any updates are visible only at the end of the kernel or at the point of
un-mapping the region of memory. Coarse-grained buffer memory has a fixed
virtual address for all devices it is allocated on. The physical memory is
allocated on Device Memory.
For fine-grained SVM, the synchronization points include, in addition to those
defined for coarse-grained SVM, the mapping or un-mapping of memory and
atomic operations. This means that updates are visible at the level of atomic
operations on the SVM buffer (for fine-grained buffer SVM, allocated with the
CL_MEM_SVM_ATOMICS flag) or the SVM system, i.e. anywhere in the SVM
(for fine-grained system SVM). Fine-grained buffer memory has the same
virtual address for all devices it is allocated on. The physical memory is
allocated on the Device-Visible Host Memory. If the fine grain buffer is
allocated with the CL_MEM_SVM_ATOMICS flag, the memory will be GPU-
CPU coherent.
The OpenCL 2.0 specification mandates coarse-grained SVM but not fine-
grained SVM.
For details, the developer is urged to read Section 3.3 of the OpenCL 2.0
specification.
6.2.2 Usage
In OpenCL 2.0, SVM buffers shared between the host and OpenCL devices are
created by calling clSVMAlloc (or malloc/new in the case of fine-grain system
support). The contents of such buffers may include pointers (into SVM buffers).
Pointer-based data structures are especially useful in heterogenous
programming scenarios. A typical scenario is as follows:
3. Host fills/updates the SVM buffer(s) with data structures including pointers
5. Host enqueues processing kernels, passing SVM buffers to the kernels with
calls to clSetKernelArgSVMPointer and/or clSetKernelExecInfo
6. The OpenCL 2.0 device processes the structures in SVM buffer(s) including
following/updating pointers.
Note that the map and unmap operations in Steps 2 and 4 may be eliminated if
the SVM buffers are created by using the CL_MEM_SVM_FINE_GRAIN_BUFFER
flag, which may not be supported on all devices.
Some applications do not require fine-grained atomics to ensure that the SVM is
consistent across devices after each read/write access. After the initial
map/creation of the buffer, the GPU or any other devices typically read from
memory. Even if the GPU or other devices write to memory, they may not require
a consistent view of the memory.
For example, while searching in parallel on a binary search tree (as in the Binary
Search tree sample presented later in this section), coarse-grain buffers are
usually sufficient. In general, coarse-grain buffers provide faster access
compared to fine grain buffers as the memory is not required to be consistent
across devices.
tmp_node = root;
while (1) {
break;
found_nodes[init_id + i] = tmp_node;
}
In the above example, the binary search tree root is created using coarse-
grain SVM on the host as
int flags = CL_MEM_READ_ONLY;
exit(1);
The “data” is the tree created by the host as a coarse-grain buffer and is passed
to the kernel as an input pointer.
CPU – time
Tree (size in M)(ms) GPU (OpenCL 2.0) GPU (OpenCL 1.2) *
1 23.46 5.17 3.22 (+1.92) (+49.58 +8.50)
The above table shows the performance of the 2.0 implementation over the 1.2
implementation.
As SVM was absent in OpenCL 1.2, the tree needed to be sent to the GPU
memory. In addition, as pointers were not pointing to the same tree as on the
host, the tree might have required to be transformed into arrays for GPU
compute. The values provided in parentheses indicate the extra time required for
such computes. Alternatively, offsets might have been required to be used
instead of pointers on both the host and the device at the cost of a few more
additions.
Finally, more than 5M nodes could not be allocated in 1.2, as the allowable
memory allocation was limited by the amount of memory that could be used on
the device. Overall, the 2.0 version exceeds the 1.2 version in both performance
and usability.
6.3.1 Overview
In OpenCL 1.2, all parameters in a function definition must have address spaces
associated with them. (The default address space is the private address space.)
This necessitates creating an explicit version of the function must be created for
each desired address space.
OpenCL 2.0 introduces a new address space called the generic address space.
Data cannot be stored in the generic address space, but a pointer to this space
can reference data located in the private, local, or global address spaces. A
function with generic pointer arguments may be called with pointers to any
address space except the constant address space. Pointers that are declared
without pointing to a named address space point to the generic address space.
However, such pointers must be associated with a named address space before
they can be used. Functions may be written with arguments and return values
that point to the generic address space, improving readability and
programmability.
6.3.2 Usage
In OpenCL 1.2, the developer needed to write three functions for a pointer p that
can reference the local, private, or global address space:
As foo is a generic function, the compiler will accept calls to it with pointers to
any address space except the constant address space.
Note: The OpenCL 2.0 spec itself shows most built-in functions that accept
pointer arguments as accepting generic pointer arguments.
In the xxx APP SDK sample, addMul2d is a generic function that uses generic
address spaces for its operands. The function computes the convolution sum of
two vectors. Two kernels compute the convolution: one uses data in the global
address space (convolution2DUsingGlobal); the other uses the local
address space (sepiaToning2DUsingLocal). The use of a single function
improves the readability of the source.
{ int i, j;
sum +=
(convert_float4(src[(i*width)+j]))*((float4)(filter[(i*filterDim.x)
+j]));
return sum;
Note: The compiler will try to resolve the address space at compile time.
Otherwise, the runtime will decide whether the pointer references the local or the
global address space. For optimum performance, the code must make it easy for
the compiler to detect the pointer reference by avoiding data-dependent address
space selection, so that run-time resolution -- which is costly -- is not required.
6.4.1 Overview
In OpenCL 1.2, a kernel cannot be enqueued from a currently running kernel.
Enqueuing a kernel requires returning control to the host.
OpenCL 2.0 introduces "clang blocks" and new built-in functions that allow
kernels to enqueue new work to device-side queues. In addition, the runtime API
call clCreateCommandQueue has been deprecated in favor of a new call,
clCreateCommandQueueWithProperties, which allows creating device-side
command queues. Device enqueue allows programmers to launch the (child)
kernels from the current (parent) kernel.
6.4.2 Usage
In OpenCL 1.2, the host side code to perform this might be structured as follows:
1. Enqueue kernel A
2. Enqueue kernel B
3. Enqueue kernel C
4. Enqueue kernel Check
5. Enqueue blocking map of Check result, e.g. with clEnqueueSVMMap
6. If Check result is not "Converged" then:
Enqueue unmap of Check result
7. Go to Step 1
However, with device-side enqueue in OpenCL 2.0, the Check kernel may be
altered to enqueue blocks that carry out A, B, C, and Check when it detects that
convergence has not been reached. This avoids a potentially costly interaction
with the host on each iteration. Also, a slight modification of Check might allow
the replacement of the entire loop above with a single host-side enqueue of the
Check kernel.
With OpenCL 1.2, this process would require a complex interaction between the
host and the OpenCL device. The device-side kernel would need to somehow
mark the sub-regions requiring further work, and the host side code would need
to scan all of the sub-regions looking for the marked ones and then enqueue a
kernel for each marked sub-region. This is made more difficult by the lack of
globally visible atomic operations in OpenCL 1.2.
However, with OpenCL 2.0, rather than just marking each interesting sub-region,
the kernel can instead launch a new sub-kernel to process each marked sub-
region. This significantly simplifies the code and improves efficiency due to the
elimination of the interactions with, and dependence on, the host.
The following figure explains the input and output expected from this sample:
Input:
12 31 47 64 19 27 49 81 99 11
Output:
31 47 19 11
As can be seen in the Figure above, the input is given as an array of positive
numbers and the expected output is an array of the primes in the input array.
This is a classic example of data-parallel processing, in which the processing for
each array element for checking whether the element is prime or not, can be
done in parallel with all others.
if (isPrime(in[id]))
out[id] = in[id];
}
Assuming that the output array is initialized with zeroes before the kernel
executes, the output would be:
0 31 47 0 19 0 0 0 0 11
The output array can be written into by using atomics directly. But doing so would
essentially make the loop run sequentially.
Extracting the non-zero numbers will require atomics or processing on the CPU
through a sequential loop, both of which may not be inefficient or scalable.
In OpenCL 2.0, the new workgroup built-ins can efficiently compute prefix sums
and perform reductions on various operations like additions, maximum and
minimum on a workgroup. Some such built-in are:
work_group_scan_inclusive_<op>,
work_group_scan_exclusive_<op>, and work_group_reduce_<op>,
where <op> can be one of "add", "max", or "min". The "reduce" operation does
the reduction in dimension, by yielding the sum of all elements in the array as a
scalar (sum), while the prefix sum yields an array of the same size, with each
element containing the "prefix-sum" for that index.
The following code illustrates how the workgroup built-ins in OpenCL 2.0 can be
used to efficiently extract the primes from an array of positive integers.
int i;
if (number == 0) return 0;
if (number % i == 0) return 0;
return 1;
/***
* set_primes_kernel:
* this kernel fills the boolean primes array to reflect the entry
in
***/
int id = get_global_id(0);
primes[id] = 0;
if (isPrime(in[id]))
primes[id] = 1;
/***
* get_primes_kernel:
***/
int id = get_global_id(0);
int k = output[id] - 1;
if (id == 0) {
if (output[id] == 1)
outPrimes[k] = in[id];
return;
if (output[id-1] == k)
outPrimes[k] = in[id];
/***
* group_scan_kernel:
***/
int in_data;
int i = get_global_id(0);
in_data = in[i];
out[i] = work_group_scan_inclusive_add(in_data);
/***
* global_scan_kernel:
***/
unsigned int l;
int add_elem;
if (lid == 0)
add_elem = out[prev_el];
work_group_barrier(CLK_GLOBAL_MEM_FENCE|CLK_LOCAL_MEM_FENCE);
add_elem = work_group_broadcast(add_elem,0);
out[curr_el] += add_elem;
}
The first kernel fills the "primes" Boolean array as in the OpenCL 1.2 example.
The second kernel runs "group_scan_kernel" on this boolean "prime" array.
However, the "group_scan_kernel" function works only at the workgroup level.
The next kernel, "global_scan_kernel", runs in multiple (log (n)) stages, where
"n" is the size of the input array. At each stage, it merges two consecutive
work_groups for which the "prefix_sums" are ready for the work group. After
log(n) stages, the prefix sums for the complete "out" array are computed. This
gives index of the final primes array. During every stage, work_group_broadcast
is used to broadcast the prefix sum of the last element of the previous workgroup.
The following figures show the outputs of various arrays at various stages.
Input:
12 31 47 64 19 27 49 81 99 11
(Assuming workgroup size is 4)
0 1 1 0 1 0 0 0 0 1
0 1 2 2 1 1 1 1 0 1
After global_scan_kernel out:
0 1 2 2 3 3 3 3 3 4
Finally, the device enqueue feature is employed. Device enqueue enqueues
these kernels in sequence instead of host-enqueing them. This improves
performance. As these kernels must be run in sequence, they cannot be clubbed
into one kernel.
outPrimes:
31 47 19 11
The performance of this OpenCL 2.0 version is compared with that of OpenCL
1.2 version for the same problem. Unlike OpenCL 2.0, OpenCL 1.2 does not
provide workgroup built-ins to perform prefix sum efficiently. While primes can be
in the same manner as in OpenCL 2.0, the group and global scan contain
redundant computations of the order of O(n) for each work item, as illustrated in
the following basic OpenCL 1.2 implementation:
int id = get_global_id(0);
int idx,i;
if (primes[id]) {
idx = 0;
for (i=0;i<id;i++)
idx += primes[i];
outPrimes[idx] = in[id];
Thus, in OpenCL 1.2, each work item computes the prefix sum for its index and
stores the input prime into the outPrimes array. Clearly, more work is performed
for each work item, compared to OpenCL 2.0.
The following graph and the table compare the performance of OpenCL 2.0
version with that of the OpenCL 1.2 version:
As the size of the input array exceeds 4K, the difference in performance between
the OpenCL 1.2 implementation and the OpenCL 2.0 implementation becomes
more apparent. In OpenCL 1.2, the time taken is O(n) time and the work done is
O(n^2); in OpenCL 2.0, the time taken is O(log (n)) and the work done is O(n),
assuming the prefix sum at the group level is constant time.
The power of device enqueue is aptly illustrated in the example of binary search.
To make the problem interesting, multiple keys in a sorted array will be searched
for. The versions written for OpenCL 1.2 and 2.0 will also be compared with
respect to programmability and performance.
The OpenCL 1.2 version of the code that performs binary search is as follows:
output[i]=lBound;
The search for multiple keys is done sequentially, while the sorted array is
divided into 256 sized chunks. The NDRange is the size of the array divided by
the chunk size. Each work item checks whether the key is present in the range
and if the key is present, updates the output array.
The issue with the above approach is that if the input array is very large, the
number of work items (NDRange) would be very large. The array is not divided
into smaller, more-manageable chunks.
In OpenCL 2.0, the device enqueue feature offers clear advantages in binary
search performance.
for(key_count=0;key_count<no_of_keys;key_count++)
continue;
else
*/
subdivSize_for_keys[key_count] = subdivSize;
outputArray[key_count].x = parent_globalids[key_count]
+ tid * subdivSize_for_keys[key_count];
parent_globalids[key_count] +=
subdivSize_for_keys[key_count] * tid ;
outputArray[key_count].w = 1;
outputArray[key_count].y =
subdivSize_for_keys[key_count];
globalLowerIndex = tid*subdivSize_for_keys[key_count];
continue;
subdivSize_for_keys[key_count] = (globalUpperIndex -
globalLowerIndex + 1)/global_threads;
void (^binarySearch_device_enqueue_wrapper_blk)(void) =
^{binarySearch_device_enqueue(outputArray,
sortedArray,
subdivSize_for_keys[key_count],
globalLowerIndex,
keys[key_count],global_threads,key_count,parent_globalids);};
int err_ret =
enqueue_kernel(def_q,CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange1,binarySearc
h_device_enqueue_wrapper_blk);
if(err_ret != 0)
outputArray[key_count].w = 2;
outputArray[key_count].z = err_ret;
return;
In the OpenCL 2.0 version, each work item checks for each key, if it is found in
its search range. If the key is found, it further divides the range into chunks and
enqueues the kernel for further processing.
The advantage is that when the input array is large, the OpenCL 2.0 version
divides the input array into 1024-sized chunks. The chunk in which the given key
falls is found and another kernel is enqueued which further divides it into 1024-
sized chunks, and so on. In OpenCL 1.2, as the whole array is taken as the
NDRange, a huge number of work groups require processing.
The following figure shows how the OpenCL 2.0 version compares to the
OpenCL 1.2 as the array increases beyond a certain size.
Performance comparison v/s 1.2
1400
1200
1000
Kernel Time 800
(Milli secs) 600
400 OpenCL1.2
200 OpenCL2.0
0
1 10 100 1000 2000 1 10 100 1000 2000
100K 100K 100K 100K 100K 10M 10M 10M 10M 10M
Keys & Samples
The above figure shows the performance benefit of using OpenCL 2.0 over the
same sample using OpenCL 1.2. In OpenCL 2.0, the reduced number of kernel
launches from the host allow superior performance. The kernel enqueues are
much more efficient when done from the device.
6.5.1 Overview
In OpenCL 1.2, only work-items in the same workgroup can synchronize.
OpenCL 2.0 introduces a new and detailed memory model which allows
developers to reason about the effects of their code on memory, and in particular
understand whether atomic operations and fences used for synchronization
ensure the visibility of variables being used to communicate between threads. In
conjunction with the new memory model, OpenCL 2.0 adds a new set of atomic
built-in functions and fences derived from C++11 (although the set of types is
restricted), and also deprecates the 1.2 atomic built in functions and fences.
6.5.2 Usage
The following examples to illustrate the use of atomics are part of the AMD APP
SDK.
This sample illustrates atomic loads/stores with the use of memory orders.
The host uses the C++11 compiler and the same memory model.
int i;
i = get_global_id(0);
buffer[i] += i;
The kernel next stores (100+i), where i is the ID of the work-item into
atomicBuffer[i]. The order used is memory_order_release which
ensures that the updated copy reaches the CPU which is waiting for it to report
PASS for the test.
After the atomic operation, the updates on fine-grain variables (such as buffer)
will also be available at the host. The CPU checks for the following to ensure that
the results are OK:
for (i=0;i<N;i++)
while(std::atomic_load_explicit ((std::atomic<int>
*)&atomicBuffer[i], std::memory_order_acquire) != (100+i));
for (i=0;i<N;i++)
if (buffer[i] != (64+i))
This sample illustrates the use of the atomic CAS operation typically used for
"lock-free" programming, in which a critical section can be created without having
to use waiting mutexes/semaphores. The following kernel simultaneously inserts
the IDs of various work items into the "list" array by using atomic CAS operation.
The same loop also runs on the host and inserts the other half (N) work items.
In this way, 2*N numbers are inserted into this "list".
int head, i;
i = get_global_id(0) + 1;
head = list[0];
if (i != get_global_size(0)) {
do {
list[i] = head;
} while (!atomic_compare_exchange_strong((global
atomic_int *) &list[0], &head,i), memory_order_release,
memory_order_acquire, memory_scope_system);
Note how there is no wait to enter the critical section, but list[0] and head are
updated atomically. On the CPU too, a similar loop runs. Again note that the
variables "list"and "head" must be in fine-grain SVM buffers.
memory_order_release and memory_scope_system are used to ensure
that the CPU gets the updates -- hence the name "platform atomics."
This sample illustrates the use of the atomic fetch operation. The fetch operation
is an RMW (Read-Modify-Write) operation. The following kernel computes the
maximum of the N numbers in array "A". The result of the intermediate
comparisons is computed and the result is placed in a Boolean array "B". After
the matrix "B" is computed, the row (i) is computed. The row which has all 1s will
be the maximum (C[i]).
kernel void atomicMax(volatile global int *A, global int *B, global
int *C, global int *P)
int i = get_global_id(0);
int j = get_global_id(1);
int N = *P, k;
else B[i*N+j] = 0;
if (j == 0) {
C[i] = 1;
for (k=0;k<N;k++)
Similarly, another sample includes the following kernel that increments 2*N times,
N times in the kernel and another N times on the host:
//(*count)++;
6.6 Pipes
6.6.1 Overview
OpenCL 2.0 introduces a new mechanism, pipes, for passing data between
kernels. A pipe is essentially a structured buffer containing space for some
number of a single kernel specified type of "packet" and bookkeeping
information. Pipes are accessed via special read_pipe and write_pipe built-
in functions. A given kernel may either read from or write to a pipe, but not both.
Pipes are only "coherent" at the standard synchronization points; the result of
Pipes are created on the host with a call to clCreatePipe, and may be passed
between kernels. Pipes may be particularly useful when combined with device-
size enqueue for dynamically constructing computational data flow graphs.
There are two types of pipes: a read pipe, from which a number of packets can
be read; and a write pipe, to which a number of packets can be written.
Note: A pipe specified as read-only cannot be written into and a pipe specified
as write-only cannot be read from. A pipe cannot be read from and written into
at the same time.
The memory allocated in the above function can be passed to kernels as read-
only or write-only pipes.
Also, a set of built-in functions have been added to operate on the pipes.
6.6.3 Usage
The corresponding sample in the AMD APP SDK contains two kernels: a
"producer_kernel", which writes into the pipe, and another kernel,
"consumer_kernel", which reads from the pipe. In this example, the producer
writes a sequence of random numbers. The consumer reads these numbers and
creates a histogram of these numbers. The producer kernel first reserves the
amount of space in the pipe by invoking:
write_pipe(rng_pipe,rid,lid, &gfrn);
work_group_commit_write_pipe(rng_pipe, rid);
Similarly, the consumer pipe reserves the pipe for reading and reads from the
pipe by issuing:
work_group_reserve_read_pipe(rng_pipe, szgr);
read_pipe(rng_pipe,rid,lid, &rn);
work_group_commit_read_pipe(rng_pipe, rid);
The pipe thus provides a useful communication mechanism between two kernels.
6.7 Sub-groups
6.7.1 Overview
OpenCL 2.0 introduces a Khronos sub-group extension. Sub-groups are a logical
abstraction of the hardware SIMD execution model akin to wavefronts, warps, or
vectors and permit programming closer to the hardware in a vendor-independent
manner. This extension includes a set of cross-sub-group built-in functions that
match the set of the cross-work-group built-in functions specified in Section 6.4
6.8.1 Overview
OpenCL 1.2 permits the declaration of only constant address space variables at
program scope.
OpenCL 2.0 permits the declaration of variables in the global address space at
program (i.e. outside function) scope. These variables have the lifetime of the
program in which they appear, and may be initialized. The host cannot directly
access program-scope variables; a kernel must be used to read/write their
contents from/to a buffer created on the host.
Program-scope global variables can save data across kernel executions. Using
program-scope variables can potentially eliminate the need to create buffers on
the host and pass them into each kernel for processing. However, there is a limit
to the size of such variables. The developer must ensure that the total size does
not exceed the value returned by the device info query:
CL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE.
6.9.1 Overview
OpenCL 2.0 introduces significant enhancements for processing images.
A read_write access qualifier for images has been added. The qualifier allows
reading from and writing to certain types of images (verified against
clGetSupportedImageFormats by using the
CL_MEM_KERNEL_READ_AND_WRITE flag) in the same kernel, but reads must be
sampler-less. An atomic_work_item_fence with the
CLK_IMAGE_MEM_FENCE flage and the memory_scope_work_item memory
scope is required between reads and writes to the same image to ensure that
the writes are visible to subsequent reads. If multiple work-items are writing to
and reading from multiple locations in an image, a call to work_group_barrier
with the CLK_IMAGE_MEM_FENCE flag is required.
OpenCL 2.0 provides improved image support, specially support for sRGB
images and depth images.
6.9.2 sRGB
sRGB is a standard RGB color space that is used widely on monitors, printers,
digital cameras, and the Internet. Because the linear RGB value is used in most
image processing algorithms. Processing the images often requires converting
sRGB to linear RGB.
OpenCL 2.0 provides a new feature for handling this conversion directly. Note
that only the combination of data type CL_UNORM_INT8 and channel order
CL_sRGBA is mandatory in OpenCL 2.0. The AMD implementations support this
combination. The remaining combinations are optional in OpenCL 2.0.
cl_image_format imageFormat;
imageFormat.image_channel_data_type = CL_UNORM_INT8;
imageFormat.image_channel_order = CL_sRGBA
CL_MEM_READ_ONY | CL_MEM_COPY_HOST_PTR,
&imageFormat,
&desc, //cl_image_desc
A new sRGB image can also be created based on an existing RGB image object,
so that the kernel can implicitly convert the sRGB image data to RGB. This is
useful when the viewing pixels are sRGB but share the same data as the existing
RGB image.
After an sRGB image object has been created, the read_imagef call can be
used in the kernel to read it transparently. read_imagef explicitly converts
sRGB values into linear RGB. Converting sRGB into RGB in the kernel explicitly
is not necessary if the device supports OpenCL 2.0. Note that only
read_imagef can be used for reading sRGB image data because only the
CL_UNOR_INT8 data type is supported with OpenCL 2.0.
The following is a kernel sample that illustrates how to read an sRGB image
object.
OpenCL 2.0 does not include writing sRGB images directly, but provides the
cl_khr_srgb_image_writes extension. The AMD implementations do not
support this extension as of this writing.
In order to write sRGB pixels in a kernel, explicit conversion from linear RGB to
sRGB must be implemented in the kernel.
cl_image_format imageFormat;
imageFormat.image_channel_data_type = CL_UNORM_INT16;
imageFormat.image_channel_order = CL_DEPTH
context, // A
valid OpenCL context
CL_MEM_READ_ONY | CL_MEM_COPY_HOST_PTR,
&imageFormat,
&desc,
//cl_image_desc
pSrcImage, // An
pointer to the image data
&retErr); //
Returned error code
A depth image object can be read by using the read_imagef call in the kernel.
For write, write_imagef must be used. read_image(i|ui) and
write_image(i|ui) are not supported for depth images.
// Read depth image object (input) based on sampler and offset and
save it (results)
6.10.1 Overview
Prior to OpenCL 2.0, each work-group size needed to divide evenly into the
corresponding global size. This requirement is relaxed in OpenCL 2.0; the final
work-group in each dimension is allowed to be smaller than all of the other work-
groups in the "uniform" part of the NDRange. This can reduce the effort required
to map problems onto NDRanges.
OpenCL 2.0 includes changes in the runtime and the compiler. In the runtime,
some new functions (such as for SVM) have been added. In the compiler, the -
cl-std=CL2.0 option is needed in order to compile OpenCL 2.0 kernels.
If a program uses the OpenCL 2.0 functions and if one compiles a kernel by
using the cl-std=CL2.0 option, the program will not build or compile on
OpenCL 1.2 platforms. If a program uses only OpenCL 1.2 functions and if one
compiles a kernel without the cl-std=CL2.0 option, then the program should
run on OpenCL 2.0 platforms.
Appendix A
OpenCL Optional
Extensions
The OpenCL extensions are associated with the devices and can be queried for
a specific device. Extensions can be queried for platforms also, but that means
that all devices in the platform support those extensions.
The OpenCL Specification states that all API functions of the extension must
have names in the form of cl<FunctionName>KHR, cl<FunctionName>EXT, or
cl<FunctionName><VendorName>. All enumerated values must be in the form of
CL_<enum_name>_KHR, CL_<enum_name>_EXT, or
CL_<enum_name>_<VendorName>.
After the device list is retrieved, the extensions supported by each device can be
queried with function call clGetDeviceInfo() with parameter param_name being
set to enumerated value CL_DEVICE_EXTENSIONS.
The extensions are returned in a char string, with extension names separated by
a space. To see if an extension is present, search the string for a specified
substring.
The initial state of the compiler is set to ignore all extensions as if it was explicitly
set with the following directive:
This means that the extensions must be explicitly enabled to be used in kernel
programs.
Each extension that affects kernel code compilation must add a defined macro
with the name of the extension. This allows the kernel code to be compiled
differently, depending on whether the extension is supported and enabled, or not.
For example, for extension cl_khr_fp64 there should be a #define directive for
macro cl_khr_fp64, so that the following code can be preprocessed:
#ifdef cl_khr_fp64
// some code
#else
// some code
#endif
This returns the address of the extension function specified by the FunctionName
string. The returned value must be appropriately cast to a function pointer type,
specified in the extension spec and header file.
A return value of NULL means that the specified function does not exist in the
CL implementation. A non-NULL return value does not guarantee that the
extension function actually exists – queries described in sec. 2 or 3 must be done
to ensure the extension is supported.
A.8.1 cl_amd_fp64
Before using double data types, double-precision floating point operators, and/or
double-precision floating point routines in OpenCL™ C kernels, include the
#pragma OPENCL EXTENSION cl_amd_fp64 : enable directive. See Table A.1
for a list of supported routines.
A.8.2 cl_amd_vec3
This extension adds support for vectors with three elements: float3, short3,
char3, etc. This data type was added to OpenCL 1.1 as a core feature. For more
details, see section 6.1.2 in the OpenCL 1.1 or OpenCL 1.2 spec.
A.8.3 cl_amd_device_persistent_memory
This extension adds support for the new buffer and image creation flag
CL_MEM_USE_PERSISTENT_MEM_AMD. Buffers and images allocated with this flag
reside in host-visible device memory. This flag is mutually exclusive with the flags
CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR.
A.8.4 cl_amd_device_attribute_query
This extension provides a means to query AMD-specific device attributes. To
enable this extension, include the #pragma OPENCL EXTENSION
cl_amd_device_attribute_query : enable directive. Once the extension is
enabled, and the clGetDeviceInfo parameter <param_name> is set to
CL_DEVICE_PROFILING_TIMER_OFFSET_AMD, the offset in nano-seconds between
an event timestamp and Epoch is returned.
1.8.4.1 cl_device_profiling_timer_offset_amd
This query enables the developer to get the offset between event timestamps in
nano-seconds. To use it, compile the kernels with the #pragma OPENCL
EXTENSION cl_amd_device_attribute_query : enable directive. For
kernels complied with this pragma, calling clGetDeviceInfo with <param_name>
set to CL_DEVICE_PROFILING_TIMER_OFFSET_AMD returns the offset in nano-
seconds between event timestamps.
1.8.4.2 cl_amd_device_topology
This query enables the developer to get a description of the topology used to
connect the device to the host. Currently, this query works only in Linux. Calling
clGetDeviceInfo with <param_name> set to CL_DEVICE_TOPOLOGY_AMD returns
the following 32-bytes union of structures.
typedef union
{
struct { cl_uint type; cl_uint data[5]; } raw;
struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char
device; cl_char function; } pcie; } cl_device_topology_amd;
The type of the structure returned can be queried by reading the first unsigned
int of the returned data. The developer can use this type to cast the returned
union into the right structure type.
Currently, the only supported type in the structure above is PCIe (type value =
1). The information returned contains the PCI Bus/Device/Function of the device,
and is similar to the result of the lspci command in Linux. It enables the
developer to match between the OpenCL device ID and the physical PCI
connection of the card.
1.8.4.3 cl_amd_device_board_name
This query enables the developer to get the name of the GPU board and model
of the specific device. Currently, this is only for GPU devices.
A.8.5 cl_amd_compile_options
This extension adds the following options, which are not part of the OpenCL
specification.
-g — This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 4, “Debugging
OpenCL.” This option does not affect the default optimization of the OpenCL
code.
-O0 — Specifies to the compiler not to optimize. This is equivalent to the
OpenCL standard option -cl-opt-disable.
-f[no-]bin-source — Does [not] generate OpenCL source in the .source
section. For more information, see Appendix C, “OpenCL Binary Image
Format (BIF) v2.0.”
-f[no-]bin-llvmir — Does [not] generate LLVM IR in the .llvmir section.
For more information, see Appendix C, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-amdil — Does [not] generate AMD IL in the .amdil section.
For more information, see Appendix C, “OpenCL Binary Image Format (BIF)
v2.0.”
-f[no-]bin-exe — Does [not] generate the executable (ISA) in .text section.
For more information, see Appendix C, “OpenCL Binary Image Format (BIF)
v2.0.”
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
A.8.6 cl_amd_offline_devices
To generate binary images offline, it is necessary to access the compiler for every
device that the runtime supports, even if the device is currently not installed on
the system. When, during context creation, CL_CONTEXT_OFFLINE_DEVICES_AMD
is passed in the context properties, all supported devices, whether online or
offline, are reported and can be used to create OpenCL binary images.
A.8.7 cl_amd_event_callback
This extension provides the ability to register event callbacks for states other than
cl_complete. The full set of event states are allowed: cl_queued,
cl_submitted, and cl_running. This extension is enabled automatically and
does not need to be explicitly enabled through #pragma when using the SDK v2
of AMD Accelerated Parallel Processing.
A.8.8 cl_amd_popcnt
This extension introduces a “population count” function called popcnt. This
extension was taken into core OpenCL 1.2, and the function was renamed
popcount. The core 1.2 popcount function (documented in section 6.12.3 of the
OpenCL Specification) is identical to the AMD extension popcnt function.
A.8.9 cl_amd_media_ops
This extension adds the following built-in functions to the OpenCL language.
Note: For OpenCL scalar types, n = 1; for vector types, it is {2, 4, 8, or 16}.
Return value
((((uint)src[0]) & 0xFF) << 0) +
((((uint)src[1]) & 0xFF) << 8) +
((((uint)src[2]) & 0xFF) << 16) +
((((uint)src[3]) & 0xFF) << 24)
A.8.10 cl_amd_media_ops2
This extension adds further built-in functions to those of cl_amd_media_ops.
When enabled, it adds the following built-in functions to the OpenCL language.
Built-in Function: uintn amd_msad (uintn src0, uintn src1, uintn src2)
Description:
uchar4 src0u8 = as_uchar4(src0.s0);
uchar4 src1u8 = as_uchar4(src1.s0);
dst.s0 = src2.s0 +
((src1u8.s0 == 0) ? 0 : abs(src0u8.s0 - src1u8.s0)) +
((src1u8.s1 == 0) ? 0 : abs(src0u8.s1 - src1u8.s1)) +
((src1u8.s2 == 0) ? 0 : abs(src0u8.s2 - src1u8.s2)) +
((src1u8.s3 == 0) ? 0 : abs(src0u8.s3 - src1u8.s3));
A similar operation is applied to other components of the vectors.
Built-in Function: ulongn amd_qsad (ulongn src0, uintn src1, ulongn src2)
Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_sad(as_uint(src0u8.s0123), src1.s0, src2u16.s0);
dstu16.s1 = amd_sad(as_uint(src0u8.s1234), src1.s0, src2u16.s1);
dstu16.s2 = amd_sad(as_uint(src0u8.s2345), src1.s0, src2u16.s2);
dstu16.s3 = amd_sad(as_uint(src0u8.s3456), src1.s0, src2u16.s3);
dst.s0 = as_uint2(dstu16);
A similar operation is applied to other components of the vectors.
Built-in Function:
ulongn amd_mqsad (ulongn src0, uintn src1, ulongn src2)
Description:
uchar8 src0u8 = as_uchar8(src0.s0);
ushort4 src2u16 = as_ushort4(src2.s0);
ushort4 dstu16;
dstu16.s0 = amd_msad(as_uint(src0u8.s0123), src1.s0, src2u16.s0);
dstu16.s1 = amd_msad(as_uint(src0u8.s1234), src1.s0, src2u16.s1);
dstu16.s2 = amd_msad(as_uint(src0u8.s2345), src1.s0, src2u16.s2);
dstu16.s3 = amd_msad(as_uint(src0u8.s3456), src1.s0, src2u16.s3);
dst.s0 = as_uint2(dstu16);
A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadw (uintn src0, uintn src1, uintn src2)
Description:
ushort2 src0u16 = as_ushort2(src0.s0);
ushort2 src1u16 = as_ushort2(src1.s0);
dst.s0 = src2.s0 +
abs(src0u16.s0 - src1u16.s0) +
abs(src0u16.s1 - src1u16.s1);
A similar operation is applied to other components of the vectors.
Built-in Function: uintn amd_sadd (uintn src0, uintn src1, uintn src2)
Description:
Built-in Function: uintn amd_bfe (uintn src0, uintn src1, uintn src2)
Description:
NOTE: The >> operator represents a logical right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = (src0.s0 << (32 - offset - width)) >> (32 - width);
else
dst.s0 = src0.s0 >> offset;
A similar operation is applied to other components of the vectors.
Built-in Function: intn amd_bfe (intn src0, uintn src1, uintn src2)
Description:
NOTE: operator >> represent an arithmetic right shift.
offset = src1.s0 & 31;
width = src2.s0 & 31;
if width = 0
dst.s0 = 0;
else if (offset + width) < 32
dst.s0 = src0.s0 << (32-offset-width) >> 32-width;
else
dst.s0 = src0.s0 >> offset;
A similar operation is applied to other components of the vectors.
Built-in Function:
intn amd_median3 (intn src0, intn src1, intn src2)
uintn amd_median3 (uintn src0, uintn src1, uintn src2)
floatn amd_median3 (floatn src0, floatn src1, floatn src2)
Description:
Built-in Function:
intn amd_min3 (intn src0, intn src1, intn src2)
uintn amd_min3 (uintn src0, uintn src1, uintn src2)
floatn amd_min3 (floatn src0, floatn src1, floatn src2)
Description:
Returns min of src0, src1, and src2.
Built-in Function:
intn amd_max3 (intn src0, intn src1, intn src2)
uintn amd_max3 (uintn src0, uintn src1, uintn src2)
floatn amd_max3 (floatn src0, floatn src1, floatn src2)
Description:
Returns max of src0, src1, and src2.
A.8.11 cl_amd_printf
The OpenCL Specification 1.1 and 1.2 support the optional AMD extension
cl_amd_printf, which provides printf capabilities to OpenCL C programs. To use
this extension, an application first must include
#pragma OPENCL EXTENSION cl_amd_printf : enable.
Built-in function:
printf(__constant char * restrict format, …);
This function writes output to the stdout stream associated with the
host application. The format string is a character sequence that:
The OpenCL C printf closely matches the definition found as part of the
C99 standard. Note that conversions introduced in the format string with
% are supported with the following guidelines:
A.8.12 cl_amd_predefined_macros
The following macros are predefined when compiling OpenCL™ C kernels.
These macros are defined automatically based on the device for which the code
is being compiled.
GPU devices:
__WinterPark__
__BeaverCreek__
__Turks__
__Caicos__
__Tahiti__
__Pitcairn__
__Capeverde__
__Cayman__
__Barts__
__Cypress__
__Juniper__
__Redwood__
__Cedar__
__ATI_RV770__
__ATI_RV730__
__ATI_RV710__
__Loveland__
__GPU__
CPU devices:
__CPU__
__X86__
__X86_64__
Note that __GPU__ or __CPU__ are predefined whenever a GPU or CPU device
is the compilation target.
return "X86-64CPU";
#elif defined(__CPU__)
return "GenericCPU";
#else
return "UnknownDevice";
#endif
}
kernel void test_pf(global int* a)
{
printf("Device Name: %s\n", getDeviceName());
}
A.8.13 cl_amd_bus_addressable_memory
This extension defines an API for peer-to-peer transfers between AMD GPUs
and other PCIe device, such as third-party SDI I/O devices. Peer-to-peer
transfers have extremely low latencies by not having to use the host’s main
memory or the CPU (see Figure A.1). This extension allows sharing a memory
allocated by the graphics driver to be used by other devices on the PCIe bus
(peer-to-peer transfers) by exposing a write-only bus address. It also allows
memory allocated on other PCIe devices (non-AMD GPU) to be directly
accessed by AMD GPUs. One possible use of this is for a video capture device
to directly write into the GPU memory using its DMA.This extension is supported
only on AMD FirePro™ professional graphics cards.
guaranteed. The access to the counter is done only through add/dec built-in
functions; thus, no two work-items have the same value returned in the case that
a given kernel only increments or decrements the counter. (Also see
http://www.khronos.org/registry/cl/extensions/ext/cl_ext_atomic_counters_32.txt.)
Table A.2 Extension Support for Older AMD GPUs and CPUs
x86 CPU
Extension Juniper1 Redwood2 Cedar3 with SSE2 or later
cl_khr_*_atomics Yes Yes Yes Yes
cl_ext_atomic_counters_32 Yes Yes Yes No
cl_khr_gl_sharing Yes Yes Yes Yes
cl_khr_byte_addressable_store Yes Yes Yes Yes
cl_ext_device_fission No No No Yes
cl_amd_device_attribute_query Yes Yes Yes Yes
cl_khr_fp64 No No No Yes
4
cl_amd_fp64 No No No Yes
cl_amd_vec3 Yes Yes Yes Yes
Images Yes Yes Yes Yes5
cl_khr_d3d10_sharing Yes Yes Yes Yes
cl_amd_media_ops Yes Yes Yes Yes
cl_amd_media_ops2 Yes Yes Yes Yes
cl_amd_printf Yes Yes Yes Yes
cl_amd_popcnt Yes Yes Yes Yes
cl_khr_3d_image_writes Yes Yes Yes No
Platform Extensions
cl_khr_icd Yes Yes Yes Yes
cl_amd_event_callback Yes Yes Yes Yes
cl_amd_offline_devices Yes Yes Yes Yes
1. ATI Radeon HD 5700 series, AMD Mobility Radeon
HD 5800 series, AMD FirePro
V5800 series, AMD
Mobility FirePro M7820.
2. ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5600 Series, ATI Radeon™ HD 5500 Series, AMD
Mobility Radeon™ HD 5700 Series, AMD Mobility Radeon™ HD 5600 Series, AMD FirePro™ V4800
Series, AMD FirePro™ V3800 Series, AMD Mobility FirePro™ M5800
3. ATI Radeon™ HD 5400 Series, AMD Mobility Radeon™ HD 5400 Series
4. Available on all devices that have double-precision, including all Southern Island devices.
5. Environment variable CPU_IMAGE_SUPPORT must be set.
Appendix B
The OpenCL Installable Client Driver
(ICD)
The OpenCL Installable Client Driver (ICD) is part of the AMD Accelerated
Parallel Processing software stack. Code written prior to SDK v2.0 must be
changed to comply with OpenCL ICD requirements.
B.1 Overview
The ICD allows multiple OpenCL implementations to co-exist; also, it allows
applications to select between these implementations at runtime.
context = clCreateContextFromType(
0,
dType,
NULL,
NULL,
&status);
/*
* Have a look at the available platforms and pick either
* the AMD one if available or a reasonable default.
*/
cl_uint numPlatforms;
cl_platform_id platform = NULL;
status = clGetPlatformIDs(0, NULL, &numPlatforms);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
if (0 < numPlatforms)
{
cl_platform_id* platforms = new cl_platform_id[numPlatforms];
status = clGetPlatformIDs(numPlatforms, platforms, NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
for (unsigned i = 0; i < numPlatforms; ++i)
{
char pbuf[100];
status = clGetPlatformInfo(platforms[i],
CL_PLATFORM_VENDOR,
sizeof(pbuf),
pbuf,
NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformInfo failed."))
{
return SDK_FAILURE;
}
platform = platforms[i];
if (!strcmp(pbuf, "Advanced Micro Devices, Inc."))
{
break;
}
}
delete[] platforms;
}
/*
* If we could find our platform, use it. Otherwise pass a NULL and
get whatever the
* implementation thinks we should be using.
*/
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};
/* Use NULL for backward compatibility */
cl_context_properties* cprops = (NULL == platform) ? NULL : cps;
context = clCreateContextFromType(
cprops,
dType,
NULL,
NULL,
&status);
NOTE: It is recommended that the host code look at the platform vendor string
when searching for the desired OpenCL platform, instead of using the platform
name string. The platform name string might change, whereas the platform
vendor string remains constant for a particular vendor’s implementation.
Appendix C
OpenCL Binary Image Format (BIF)
v2.0
C.1 Overview
OpenCL Binary Image Format (BIF) 2.0 is in the ELF format. BIF2.0 allows the
OpenCL binary to contain the OpenCL source program, the LLVM IR, and the
executable. The BIF defines the following special sections:
The BIF can have other special sections for debugging, etc. It also contains
several ELF special sections, such as:
By default, OpenCL generates a binary that has LLVM IR, and the executable for
the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM IR and the
executable for the CPU (.llvmir and .text sections). The BIF binary always
contains a .comment section, which is a readable C string. The default behavior
can be changed with the BIF options described in Section C.2, “BIF Options,”
page C-3.
The LLVM IR enables recompilation from LLVM IR to the target. When a binary
is used to run on a device for which the original program was not generated and
the original device is feature-compatible with the current device, OpenCL
recompiles the LLVM IR to generate a new code for the device. Note that the
LLVM IR is only universal within devices that are feature-compatible in the same
device type, not across different device types. This means that the LLVM IR for
the CPU is not compatible with the LLVM IR for the GPU. The LLVM IR for a
GPU works only for GPU devices that have equivalent feature sets.
The fields not shown in Table C.1 are given values according to the ELF
Specification. The e_machine value is defined as one of the oclElfTargets
enumerants; the values for these are:
e_machine =
C.1.2 Bitness
The BIF can be either 32-bit ELF format or a 64-bit ELF format. For the GPU,
OpenCL generates a 32-bit BIF binary; it can read either 32-bit BIF or 64-bit BIF
binary. For the CPU, OpenCL generates and reads only 32-bit BIF binaries if the
host application is 32-bit (on either 32-bit OS or 64-bit OS). It generates and
reads only 64-bit BIF binary if the host application is 64-bit (on 64-bit OS).
By default, OpenCL generates the .llvmir section, .amdil section, and .text
section. The following are examples for using these options:
This binary can recompile for all the other devices of the same device type.
Appendix D
Hardware overview of pre-GCN
devices
A general OpenCL device comprises compute units, each of which can have
multiple processing elements. A work-item (or SPMD kernel instance) executes
on a single processing element. The processing elements within a compute unit
can execute in lock-step using SIMD execution. Compute units, however,
execute independently (see Figure D.1).
AMD GPUs consist of multiple compute units. The number of them and the way
they are structured varies with the device family, as well as device designations
within a family. Each of these processing elements possesses ALUs. For devices
in the Northern Islands and Southern Islands families, these ALUs are arranged
in four (in the Evergreen family, there are five) processing elements with arrays
of 16 ALUs. Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On Northern Islands and
Evergreen family devices, the PE arrays execute instructions from one wavefront,
so that each work-item issues four (for Northern Islands) or five (for Evergreen)
instructions at once in a very-long-instruction-word (VLIW) packet.
Figure D.1 shows a simplified block diagram of a generalized AMD GPU compute
device.
GPU GPU
Compute Device Compute Device
Processing Elements
ALUs
Instruction
Processing Element and Control
Flow
Branch
Execution
Unit
ALUs
General-Purpose Registers
GPU compute devices comprise groups of compute units. Each compute unit
contains numerous processing elements, which are responsible for executing
kernels, each operating on an independent data stream. Processing elements, in
turn, contain numerous processing elements, which are the fundamental,
programmable ALUs that perform integer, single-precision floating-point, double-
precision floating-point, and transcendental operations. All processing elements
within a compute unit execute the same instruction sequence in lock-step for
Evergreen and Northern Islands devices; different compute units can execute
D-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
different instructions.
Appendix E
OpenCL-OpenGL Interoperability
Using GLUT
1. Use glutInit to initialize the GLUT library and negotiate a session with the
windowing system. This function also processes the command line options,
depending on the windowing system.
2. Use wglGetCurrentContext to get the current rendering GL context
(HGLRC) of the calling thread.
3. Use wglGetCurrentDC to get the device context (HDC) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use the clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) function and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device ID of
the CL device associated with OpenGL context.
5. Use clCreateContext (See Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows you how to create an interoperability context
using GLUT on single GPU system.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
Cl_context_properties cpsGL[] =
{CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_WGL_HDC_KHR, (intptr_t) wglGetCurrentDC(),
1. Use CreateWindow for window creation and get the device handle (HWND).
2. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use CreateDC function to
create a device context (HDC) for the specified device.
3. Use ChoosePixelFormat to match an appropriate pixel format supported by
a device context and to a given pixel format specification.
4. Use SetPixelFormat to set the pixel format of the specified device context
to the format specified.
5. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
6. Use wglMakeCurrent to bind the GL context created in the above step as
the current rendering context.
7. Use clGetGLContextInfoKHR function (see Section 9.7 of the OpenCL
Specification 1.1) and parameter CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
to get the device ID of the CL device associated with OpenGL context.
8. Use clCreateContext function (see Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
WIN32 API for windowing. (Users also can refer to the SimpleGL sample in the
AMD APP SDK samples.)
int pfmt;
PIXELFORMATDESCRIPTOR pfd;
pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR);
pfd.nVersion = 1;
pfd.dwFlags = PFD_DRAW_TO_WINDOW |
PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER ;
pfd.iPixelType = PFD_TYPE_RGBA;
pfd.cColorBits = 24;
pfd.cRedBits = 8;
pfd.cRedShift = 0;
pfd.cGreenBits = 8;
pfd.cGreenShift = 0;
pfd.cBlueBits = 8;
pfd.cBlueShift = 0;
pfd.cAlphaBits = 8;
pfd.cAlphaShift = 0;
pfd.cAccumBits = 0;
pfd.cAccumRedBits = 0;
pfd.cAccumGreenBits = 0;
pfd.cAccumBlueBits = 0;
pfd.cAccumAlphaBits = 0;
pfd.cDepthBits = 24;
pfd.cStencilBits = 8;
pfd.cAuxBuffers = 0;
pfd.iLayerType = PFD_MAIN_PLANE;
pfd.bReserved = 0;
pfd.dwLayerMask = 0;
pfd.dwVisibleMask = 0;
pfd.dwDamageMask = 0;
ZeroMemory(&pfd, sizeof(PIXELFORMATDESCRIPTOR));
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>("SimpleGL"),
WS_CAPTION | WS_POPUPWINDOW | WS_VISIBLE,
0,
0,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
hRC = wglCreateContext(hDC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR, (cl_context_properties) hRC,
CL_WGL_HDC_KHR, (cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
2. To query all display devices in the current session, call this function in a loop,
starting with DevNum set to 0, and incrementing DevNum until the function fails.
To select all display devices in the desktop, use only the display devices that
have the DISPLAY_DEVICE_ATTACHED_TO_DESKTOP flag in the
DISPLAY_DEVICE structure.
3. To get information on the display adapter, call EnumDisplayDevices with
lpDevice set to NULL. For example, DISPLAY_DEVICE.DeviceString
contains the adapter name.
4. Use EnumDisplaySettings to get DEVMODE. dmPosition.x and
dmPosition.y are used to get the x coordinate and y coordinate of the
current display.
5. Try to find the first OpenCL device (winner) associated with the OpenGL
rendering context by using the loop technique of 2., above.
6. Inside the loop:
a. Create a window on a specific display by using the CreateWindow
function. This function returns the window handle (HWND).
b. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use the CreateDC function
to create a device context (HDC) for the specified device.
c. Use ChoosePixelFormat to match an appropriate pixel format supported
by a device context to a given pixel format specification.
d. Use SetPixelFormat to set the pixel format of the specified device
context to the format specified.
e. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
f. Use wglMakeCurrent to bind the GL context created in the above step
as the current rendering context.
g. Use clGetGLContextInfoKHR (See Section 9.7 of the OpenCL
Specification 1.1) and CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
parameter to get the number of GL associated devices for CL context
creation. If the number of devices is zero go to the next display in the
loop. Otherwise, use clGetGLContextInfoKHR (See Section 9.7 of the
OpenCL Specification 1.1) and the
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the device
ID of the CL device associated with OpenGL context.
h. Use clCreateContext (See Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code demonstrates how to use WIN32 Windowing API in CL-GL
interoperability on multi-GPU environment.
int xCoordinate = 0;
int yCoordinate = 0;
0); deviceNum++)
{
if (dispDevice.StateFlags &
DISPLAY_DEVICE_MIRRORING_DRIVER)
{
continue;
}
DEVMODE deviceMode;
EnumDisplaySettings(dispDevice.DeviceName,
ENUM_CURRENT_SETTINGS,
&deviceMode);
xCoordinate = deviceMode.dmPosition.x;
yCoordinate = deviceMode.dmPosition.y;
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(
reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>(
"OpenGL Texture Renderer"),
WS_CAPTION | WS_POPUPWINDOW,
xCoordinate,
yCoordinate,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
hRC = wglCreateContext(hDC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
if (!clGetGLContextInfoKHR)
{
clGetGLContextInfoKHR = (clGetGLContextInfoKHR_fn)
clGetExtensionFunctionAddress(
"clGetGLContextInfoKHR");
}
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
if (deviceSize == 0)
{
// no interopable CL device found, cleanup
wglMakeCurrent(NULL, NULL);
wglDeleteContext(hRC);
DeleteDC(hDC);
hDC = NULL;
hRC = NULL;
DestroyWindow(gHwnd);
// try the next display
continue;
}
ShowWindow(gHwnd, SW_SHOW);
//Found a winner
break;
}
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
E.1.3 Limitations
It is recommended not to use GLUT in a multi-GPU environment.
Using GLUT
1. Use glutInit to initialize the GLUT library and to negotiate a session with
the windowing system. This function also processes the command-line
options depending on the windowing system.
2. Use glXGetCurrentContext to get the current rendering context
(GLXContext).
3. Use glXGetCurrentDisplay to get the display (Display *) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use clGetGLContextInfoKHR (see Section 9.7 of the OpenCL Specification
1.1) and the CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the
device ID of the CL device associated with the OpenGL context.
5. Use clCreateContext (see Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
GLUT in Linux.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
Cl_context_properties cpsGL[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR,
(intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR,
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
4. Use XCreateColormap to create a color map of the specified visual type for
the screen on which the specified window resides and returns the colormap
ID associated with it. Note that the specified window is only used to
determine the screen.
5. Use XCreateWindow to create an unmapped sub-window for a specified
parent window, returns the window ID of the created window, and causes the
X server to generate a CreateNotify event. The created window is placed on
top in the stacking order with respect to siblings.
6. Use XMapWindow to map the window and all of its sub-windows that have had
map requests. Mapping a window that has an unmapped ancestor does not
display the window, but marks it as eligible for display when the ancestor
becomes mapped. Such a window is called unviewable. When all its
ancestors are mapped, the window becomes viewable and is visible on the
screen if it is not obscured by another window.
7. Use glXCreateContextAttribsARB to initialize the context to the initial state
defined by the OpenGL specification, and returns a handle to it. This handle
can be used to render to any GLX surface.
8. Use glXMakeCurrent to make argrument3 (GLXContext) the current GLX
rendering context of the calling thread, replacing the previously current
context if there was one, and attaches argument3 (GLXcontext) to a GLX
drawable, either a window or a GLX pixmap.
9. Use clGetGLContextInfoKHR to get the OpenCL-OpenGL interoperability
device corresponding to the window created in step 5.
10. Use clCreateContext to create the context on the interoperable device
obtained in step 9.
The following code snippet shows how to create a CL-GL interoperability context
using the X Window system in Linux.
int nelements;
GLXFBConfig *fbc = glXChooseFBConfig(displayName,
DefaultScreen(displayName), 0, &nelements);
static int attributeList[] = { GLX_RGBA,
GLX_DOUBLEBUFFER,
GLX_RED_SIZE,
1,
GLX_GREEN_SIZE,
1,
GLX_BLUE_SIZE,
1,
None
};
XVisualInfo *vi = glXChooseVisual(displayName,
DefaultScreen(displayName),
attributeList);
XSetWindowAttributes swa;
swa.colormap = XCreateColormap(displayName,
RootWindow(displayName, vi->screen),
vi->visual,
AllocNone);
swa.border_pixel = 0;
swa.event_mask = StructureNotifyMask;
GLXCREATECONTEXTATTRIBSARBPROC glXCreateContextAttribsARB =
(GLXCREATECONTEXTATTRIBSARBPROC)
glXGetProcAddress((const
GLubyte*)"glXCreateContextAttribsARB");
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
win,
ctx);
cl_context_properties cpsGL[] = {
CL_CONTEXT_PLATFORM,(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
status = clGetGLContextInfoKHR( cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDeviceId,
NULL);
displayName = XOpenDisplay(NULL);
int screenNumber = ScreenCount(displayName);
XCloseDisplay(displayName);
win = XCreateWindow(displayName,
RootWindow(displayName, vi->screen),
10,
10,
width,
height,
0,
vi->depth,
InputOutput,
vi->visual,
CWBorderPixel|CWColormap|CWEventMask,
&swa);
XMapWindow (displayName, win);
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
gGlCtx = glXGetCurrentContext();
properties cpsGL[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
int numDevices = (deviceSize / sizeof(cl_device_id));
if(numDevices == 0)
{
glXDestroyContext(glXGetCurrentDisplay(), gGlCtx);
continue;
}
else
{
//Interoperable device found
std::cout<<"Interoperable device found "<<std::endl;
break;
}
}
Appendix F
New and deprecated functions in
OpenCL 2.0
F.1.10 Sub-groups
get_sub_group_size Get size of current sub group
write_mem_fence
atomic_add
atomic_sub
atomic_xchg
atomic_inc
atomic_dec
atomic_cmpxchg
atomic_min
atomic_max
atomic_and
atomic_or
atomic_xor
CL_DEVICE_SVM_ATOMICS
CL_QUEUE_SIZE
CL_MEM_SVM_FINE_GRAIN_BUFFER
CL_MEM_SVM_ATOMICS
CL_sRGB
CL_sRGBx
CL_sRGBA
CL_sBGRA
CL_ABGR
CL_MEM_OBJECT_PIPE
CL_MEM_USES_SVM_POINTER
CL_PIPE_PACKET_SIZE
CL_PIPE_MAX_PACKETS
CL_SAMPLER_MIP_FILTER_MODE
CL_SAMPLER_LOD_MIN
CL_SAMPLER_LOD_MAX
CL_PROGRAM_BUILD_GLOBAL_VARIABLE_TOTAL_SIZE
CL_KERNEL_ARG_TYPE_PIPE
CL_KERNEL_EXEC_INFO_SVM_PTRS
CL_KERNEL_EXEC_INFO_SVM_FINE_GRAIN_SYSTEM
CL_COMMAND_SVM_FREE
CL_COMMAND_SVM_MEMCPY
CL_COMMAND_SVM_MEMFILL
CL_COMMAND_SVM_MAP
CL_COMMAND_SVM_UNMAP
CL_PROFILING_COMMAND_COMPLETE
Index
Symbols NDRange . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
address
_cdecl calling convention
1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
_global atomics. . . . . . . . . . . . . . . . . . . . . . . 19
normalized . . . . . . . . . . . . . . . . . . . . . . . . . 10
_local atomics . . . . . . . . . . . . . . . . . . . . . . . . 19
un-normalized. . . . . . . . . . . . . . . . . . . . . . . 10
_stdcall calling convention
allocating
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
images
.amdil
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 4
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
memory
.comment
selecting a device . . . . . . . . . . . . . . . . . . 4
BIF binary . . . . . . . . . . . . . . . . . . . . . . . . . . 1
memory buffer
storing OpenCL and driver versions that cre-
OpenCL program model . . . . . . . . . . . . . 4
ated the binary . . . . . . . . . . . . . . . . . . . . 1
ALUs
.llvmir
arrangement of . . . . . . . . . . . . . . . . . . . . . . . 8
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
AMD Accelerated Parallel Processing
storing OpenCL immediate representation
implementation of OpenCL . . . . . . . . . . . . . 1
(LLVM IR). . . . . . . . . . . . . . . . . . . . . . . . 1
open platform strategy . . . . . . . . . . . . . . . . . 1
.rodata
programming model . . . . . . . . . . . . . . . . . . . 2
storing OpenCL runtime control data. . . . . 1
relationship of components . . . . . . . . . . . . . 1
.shstrtab
software stack . . . . . . . . . . . . . . . . . . . . . . . 1
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
Installable Client Driver (ICD) . . . . . . . . . 1
.source
AMD APP KernelAnalyzer . . . . . . . . . . . . . . . . 1
storing OpenCL source program . . . . . . . . 1
AMD Core Math Library (ACML) . . . . . . . . . . . 1
.strtab
AMD GPU
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
number of compute units . . . . . . . . . . . . . . . 8
.symtab
AMD Radeon HD 68XX . . . . . . . . . . . . . . . . . 15
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
AMD Radeon HD 69XX . . . . . . . . . . . . . . . . . 15
.text
AMD Radeon HD 75XX . . . . . . . . . . . . . . . . . 15
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
AMD Radeon HD 77XX . . . . . . . . . . . . . . . . . 15
storing the executable . . . . . . . . . . . . . . . . 1
AMD Radeon HD 78XX . . . . . . . . . . . . . . . . . 15
Numerics AMD Radeon HD 79XX series. . . . . . . . . . . . 15
AMD Radeon HD 7XXX . . . . . . . . . . . . . . . . . . 2
1D address . . . . . . . . . . . . . . . . . . . . . . . . . . 10 AMD Radeon R9 290X . . . . . . . . . . . . . . . 4, 8
2D AMD supplemental compiler . . . . . . . . . . . . . . 6
address . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 -g option . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2D addresses AMD supplemental compiler option
reading and writing. . . . . . . . . . . . . . . . . . 10 -f[n-]bin-source . . . . . . . . . . . . . . . . . . . . . . . 6
A -f[no-]bin-amdil . . . . . . . . . . . . . . . . . . . . . . . 6
-f[no-]bin-exe . . . . . . . . . . . . . . . . . . . . . . . . 6
access -f[no-]bin-llvmir . . . . . . . . . . . . . . . . . . . . . . . 6
memory. . . . . . . . . . . . . . . . . . . . . . . . . . 5, 9 AMD vendor-specific extensions . . . . . . . . . . . 4
accumulation operations amd_bitalign
Index-2
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
amd_bitalign . . . . . . . . . . . . . . . . . . . . . . . . 8 CL context
amd_bytealign . . . . . . . . . . . . . . . . . . . . . . 8 associate with GL context . . . . . . . . . . . . . 1
amd_lerp. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CL kernel function
amd_pack . . . . . . . . . . . . . . . . . . . . . . . . . . 7 breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . . 3
amd_sad . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 CL options
amd_sad4 . . . . . . . . . . . . . . . . . . . . . . . . . . 8 change during runtime . . . . . . . . . . . . . . . . 5
amd_sadhi . . . . . . . . . . . . . . . . . . . . . . . . . 9 cl_amd_device_attribute_query extension
amd_unpack0 . . . . . . . . . . . . . . . . . . . . . . . 7 querying AMD-specific device attributes . . 5
amd_unpack1 . . . . . . . . . . . . . . . . . . . . . . . 7 cl_amd_event_callback extension
amd_unpack2 . . . . . . . . . . . . . . . . . . . . . . . 7 registering event callbacks for states . . . . 6
amd_unpack3 . . . . . . . . . . . . . . . . . . . . . . . 7 cl_amd_fp64 extension. . . . . . . . . . . . . . . . . . 4
built-in functions cl_amd_media_ops extension
for OpenCL language adding built-in functions to OpenCL language
cl_amd_media_ops . . . . . . . . . . . . . . 7, 9 7, 9
OpenCL C programs cl_amd_printf extension . . . . . . . . . . . . . . . . 12
cl_amd_printf . . . . . . . . . . . . . . . . . . . . 12 cl_ext extensions . . . . . . . . . . . . . . . . . . . . . . 4
variadic arguments . . . . . . . . . . . . . . . . . . 12 cl_khr_fp64
writing output to the stdout stream . . . . . 12 supported function . . . . . . . . . . . . . . . . . . 15
burst write . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 classes
passing between host and device . . . . . . . 3
C clBuildProgram
C front-end debugging OpenCL program . . . . . . . . . . . 2
compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 clCreateKernel
C kernels C++extension . . . . . . . . . . . . . . . . . . . . . . . 2
predefined macros . . . . . . . . . . . . . . . . . . 13 clEnqueue commands . . . . . . . . . . . . . . . . . . 5
C program sample clEnqueueNDRangeKernel
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 10 setting breakpoint in the host code . . . . . . 3
C programming clGetPlatformIDs() function
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 available OpenCL implementations . . . . . . 1
C++ API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 clGetPlatformInfo() function
C++ bindings available OpenCL implementations . . . . . . 1
OpenCL programming . . . . . . . . . . . . . . . 14 querying supported extensions for OpenCL
C++ extension platform . . . . . . . . . . . . . . . . . . . . . . . . . 1
unsupported features . . . . . . . . . . . . . . . . . 2 C-like language
C++ files OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
compiling. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 code
C++ kermel language . . . . . . . . . . . . . . . . . iii, 1 basic programming steps . . . . . . . . . . . . . 10
C++ kernels ICD-compliant version . . . . . . . . . . . . . . 1, 2
building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 parallel min() function. . . . . . . . . . . . . . . . 19
C++ templates . . . . . . . . . . . . . . . . . . . . . . . . 5 pre-ICD snippet . . . . . . . . . . . . . . . . . . . 1, 2
cache running
L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 on Linux . . . . . . . . . . . . . . . . . . . . . . . . . 7
L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 on Windows . . . . . . . . . . . . . . . . . . . . . . 6
texture system . . . . . . . . . . . . . . . . . . . . . 10 runtime steps . . . . . . . . . . . . . . . . . . . . . . 19
call error code requirements
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Installable Client Driver (ICD) . . . . . . . . . . 1
calling convention CodeXL GPU Debugger . . . . . . . . . . . . . . . . . 1
Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 command processor
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 transfer from system to GPU . . . . . . . . . . . 9
character extensions. . . . . . . . . . . . . . . . . . . . 1 command processors
searching for substrings . . . . . . . . . . . . . . . 2 concurrent processing of command queues 7
character sequence command queue . . . . . . . . . . . . . . . . . . . . . . . 4
format string . . . . . . . . . . . . . . . . . . . . . . . 12 associated with single device . . . . . . . . . . 4
Index-3
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
barrier -f[no-]bin-llvmir . . . . . . . . . . . . . . . . . . . . . . 4
enforce ordering within a single queue . 4 -f[no-]bin-source . . . . . . . . . . . . . . . . . . . . . 4
creating device-specific. . . . . . . . . . . . . . . . 4 -g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
elements -O0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
constants. . . . . . . . . . . . . . . . . . . . . . . . . 9 -save-temps . . . . . . . . . . . . . . . . . . . . . . . . 4
kernel execution calls . . . . . . . . . . . . . . . 9 compiling
kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 9 an OpenCL application . . . . . . . . . . . . . . . . 2
transfers between device and host . . . . 9 C++ files . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
executing kernels . . . . . . . . . . . . . . . . . . . . 4 kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 on Linux
moving data . . . . . . . . . . . . . . . . . . . . . . . . 4 building 32-bit object files on a 64-bit sys-
no limit of the number pointing to the same tem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
device . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 linking to a 32-bit library. . . . . . . . . . . . . 3
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 linking to a 64-bit library. . . . . . . . . . . . . 3
command queues . . . . . . . . . . . . . . . . . . . . . . 7 OpenCL on Linux . . . . . . . . . . . . . . . . . . . . 3
multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 OpenCL on Windows . . . . . . . . . . . . . . . . . 2
command-queue barrier . . . . . . . . . . . . . . . . . 4 Intel C (C++) compiler . . . . . . . . . . . . . . 2
commands setting project properties . . . . . . . . . . . . 2
API Visual Studio 2008 Professional Edition 2
three categories . . . . . . . . . . . . . . . . . . . 6 the host program . . . . . . . . . . . . . . . . . . . . 2
buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 computation
clEnqueue . . . . . . . . . . . . . . . . . . . . . . . . . . 5 data-parallel model . . . . . . . . . . . . . . . . . . . 2
driver layer issuing . . . . . . . . . . . . . . . . . . . 9 compute device structure
driver layer translating . . . . . . . . . . . . . . . . 9 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6, 2
event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 compute kernel
GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 data-parallel granularity . . . . . . . . . . . . . . . 2
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 strengths
OpenCL API functions . . . . . . . . . . . . . . . . 5 computationally intensive applications . . 1
queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 wavefronts. . . . . . . . . . . . . . . . . . . . . . . . . . 2
communication and data transfers between sys- workgroups . . . . . . . . . . . . . . . . . . . . . . . . . 2
tem and GPU compute unit
PCIe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
communication between the host (CPU) and the stream cores . . . . . . . . . . . . . . . . . . . . . . . . 3
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 compute unites
compilation number in AMD GPU . . . . . . . . . . . . . . . . . 8
error compute units
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 16 290X devices . . . . . . . . . . . . . . . . . . . . . . . 6
compile time independent operation . . . . . . . . . . . . . . . . 4
resolving format string . . . . . . . . . . . . . . . 12 number in AMD GPUs . . . . . . . . . . . . . . . . 8
compiler structured in AMD GPUs . . . . . . . . . . . . . . 8
LLVM framework. . . . . . . . . . . . . . . . . . . . . 2 constants
set to ignore all extensions . . . . . . . . . . . . 2 caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 command queue elements . . . . . . . . . . . . . 9
back-end . . . . . . . . . . . . . . . . . . . . . . . . . 1 constraints
OpenCL. . . . . . . . . . . . . . . . . . . . . . . . . . 1 of the current LDS model . . . . . . . . . . . . . . 2
sharing front-end . . . . . . . . . . . . . . . . . . 1 context
sharing high-level transformations . . . . . 1 relationship
transformations . . . . . . . . . . . . . . . . . . . . . . 1 sample code . . . . . . . . . . . . . . . . . . . . . . 4
using standard C front-end. . . . . . . . . . . . . 2 contexts
compiler option associating CL and GL . . . . . . . . . . . . . . . . 1
-f[no-]bin-amdil . . . . . . . . . . . . . . . . . . . . . . 4 copying data
-f[no-]bin-exe . . . . . . . . . . . . . . . . . . . . . . . . 4 implicit and explicit . . . . . . . . . . . . . . . . . . . 6
Index-4
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-5
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
.text extension
storing the executable . . . . . . . . . . . . . . 1 cl_amd_popcnt . . . . . . . . . . . . . . . . . . . . . . 7
format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 clCreateKernel . . . . . . . . . . . . . . . . . . . . . . 2
forming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 extension function pointers . . . . . . . . . . . . . . . 3
header fields . . . . . . . . . . . . . . . . . . . . . . . . 2 extension functions
special sections NULL and non-Null return values. . . . . . . . 3
BIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 extension support by device
enforce ordering for devices 1 . . . . . . . . . . . . . . . . . . . . . . . 15
between or within queues for devices 2 and CPUs . . . . . . . . . . . . . . 16
events . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 extensions
synchronizing a given event . . . . . . . . . . . . 3 all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
within a single queue AMD vendor-specific. . . . . . . . . . . . . . . . . . 4
command-queue barrier . . . . . . . . . . . . . 4 approved by Khronos Group . . . . . . . . . . . 1
engine approved list from Khronos Group . . . . . . . 3
DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 character strings . . . . . . . . . . . . . . . . . . . . . 1
enqueuing cl_amd_device_attribute_query . . . . . . . . . 5
commands in OpenCL . . . . . . . . . . . . . . . . 5 cl_amd_event_callback
multiple tasks registering event callbacks for states. . . 6
parallelism. . . . . . . . . . . . . . . . . . . . . . . . 2 cl_amd_fp64 . . . . . . . . . . . . . . . . . . . . . . . . 4
native kernels cl_amd_media_ops . . . . . . . . . . . . . . . . . 7, 9
parallelism. . . . . . . . . . . . . . . . . . . . . . . . 2 cl_amd_printf. . . . . . . . . . . . . . . . . . . . . . . 12
environment variable cl_ext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
AMD_OCL_BUILD_OPTIONS . . . . . . . . . . 5 compiler set to ignore . . . . . . . . . . . . . . . . . 2
AMD_OCL_BUILD_OPTIONS_APPEND . . 5 device fission . . . . . . . . . . . . . . . . . . . . . . . 4
setting to avoid source changes . . . . . . . . 2 disabling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
event enabling. . . . . . . . . . . . . . . . . . . . . . . . . . 2, 3
commands. . . . . . . . . . . . . . . . . . . . . . . . . . 6 FunctionName string. . . . . . . . . . . . . . . . . . 3
enforces ordering kernel code compilation
between queues . . . . . . . . . . . . . . . . . . . 5 adding defined macro. . . . . . . . . . . . . . . 3
within queues . . . . . . . . . . . . . . . . . . . . . 5 naming conventions . . . . . . . . . . . . . . . . . . 1
synchronizing . . . . . . . . . . . . . . . . . . . . . . . 3 optional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
event commands . . . . . . . . . . . . . . . . . . . . . . . 6 provided by a specific vendor . . . . . . . . . . 1
events provided collectively by multiple vendors . . 1
forced ordering between . . . . . . . . . . . . . . . 5 querying for a platform . . . . . . . . . . . . . . . . 1
exceptions querying in OpenCL . . . . . . . . . . . . . . . . . . 1
C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 same name overrides . . . . . . . . . . . . . . . . . 2
executing use in kernel programs. . . . . . . . . . . . . . . . 2
branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 3 F
using corresponding command queue . . 4 -f[n-]bin-source
kernels for specific devices AMD supplemental compiler option . . . . . . 6
OpenCL programming model . . . . . . . . . 3 -f[no-]bin-amdil
loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 AMD supplemental compiler option . . . . . . 6
non-graphic function compiler option . . . . . . . . . . . . . . . . . . . . . . 4
data-parallel programming model. . . . . . 2 -f[no-]bin-exe
execution AMD supplemental compiler option . . . . . . 6
command queue . . . . . . . . . . . . . . . . . . . . . 9 compiler option . . . . . . . . . . . . . . . . . . . . . . 4
of a single instruction over all work-items . 2 -f[no-]bin-llvmir
OpenCL applications. . . . . . . . . . . . . . . . . . 6 AMD supplemental compiler option . . . . . . 6
order compiler option . . . . . . . . . . . . . . . . . . . . . . 4
barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 4 -f[no-]bin-source
single stream core . . . . . . . . . . . . . . . . . . 10 compiler option . . . . . . . . . . . . . . . . . . . . . . 4
explicit copying of data . . . . . . . . . . . . . . . . . . 6 fetch unit
Index-6
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-7
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-8
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-9
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-10
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-11
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-12
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
interface overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 2 spawn order
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 of work-item . . . . . . . . . . . . . . . . . . . . . . . . 1
changing options . . . . . . . . . . . . . . . . . . 6 sequential . . . . . . . . . . . . . . . . . . . . . . . . . . 1
system functions. . . . . . . . . . . . . . . . . . . . . 5 stalls
memory fetch request . . . . . . . . . . . . . . . 11
S static C++ kernel language . . . . . . . . . . . . iii, 1
sample code stdout stream
relationship between writing output associated with the host appli-
buffer(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4 cation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
command queue(s) . . . . . . . . . . . . . . . . 4 stream core
context(s) . . . . . . . . . . . . . . . . . . . . . . . . 4 compute units . . . . . . . . . . . . . . . . . . . . . . . 3
device(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4 executing kernels . . . . . . . . . . . . . . . . . . . . 3
kernel(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4 idle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
relationship between context(s) . . . . . . . . . 4 instruction sequence . . . . . . . . . . . . . . . . . 3
-save-temps processing elements . . . . . . . . . . . . . . . . . 3
compiler option . . . . . . . . . . . . . . . . . . . . . . 4 stall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
SAXPY function due to data dependency . . . . . . . . . . . 12
code sample . . . . . . . . . . . . . . . . . . . . . . . 16 stream kernel . . . . . . . . . . . . . . . . . . . . . . . . 10
SC cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 supplemental compiler options . . . . . . . . . . . . 6
scalar instructions . . . . . . . . . . . . . . . . . . . . . . 7 synchronization
scalar unit data cache command-queue barrier . . . . . . . . . . . . . . . 4
SC cache . . . . . . . . . . . . . . . . . . . . . . . . . . 7 domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
scalra instructions . . . . . . . . . . . . . . . . . . . . . . 7 command-queue . . . . . . . . . . . . . . . . . . 4
scheduling work-items . . . . . . . . . . . . . . . . . . . . . . . 4
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
work-items points
for execution. . . . . . . . . . . . . . . . . . . . . . 3 in a kernel . . . . . . . . . . . . . . . . . . . . . . . 2
range . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 synchronizing
scope a given event . . . . . . . . . . . . . . . . . . . . . . . 3
global . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 event
SDKUtil library . . . . . . . . . . . . . . . . . . . . . . . . 3 enforce the correct order of execution . 3
linking options. . . . . . . . . . . . . . . . . . . . . . . 3 through barrier operations
Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 work-items . . . . . . . . . . . . . . . . . . . . . . . 3
set through fence operations
a breakpoint . . . . . . . . . . . . . . . . . . . . . . . . 2 work-items . . . . . . . . . . . . . . . . . . . . . . . 3
shader architecture syntax
unified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 GCC option. . . . . . . . . . . . . . . . . . . . . . . . . 3
shaders and kernels . . . . . . . . . . . . . . . . . . . . 2 system
SIMD arrays pinned memory. . . . . . . . . . . . . . . . . . . . . . 7
processing elements. . . . . . . . . . . . . . . . . . 8
T
simple buffer write
code sample . . . . . . . . . . . . . . . . . . . . . . . 12 templates
example programs . . . . . . . . . . . . . . . . . . 10 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
simple testing kernel, member, default argument, limited
programming techniques class, partial . . . . . . . . . . . . . . . . . . . . . . 1
parallel min function . . . . . . . . . . . . . . . 19 terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
single device associated with command queue texture system
4 caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
single stream core execution . . . . . . . . . . . . 10 thread
single-precision floating-point launching . . . . . . . . . . . . . . . . . . . . . . . . . 19
performing operations. . . . . . . . . . . . . . . 3, 4 threading
software device-optimal access pattern . . . . . . . . . 19
Index-13
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
throughput W
PCIe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
wavefront
timing
block of work-items. . . . . . . . . . . . . . . . . . . 4
of simplified execution of work-items
combining paths . . . . . . . . . . . . . . . . . . . . . 4
single stream core . . . . . . . . . . . . . . . . 10
concept relating to compute kernels . . . . . 2
toolchain
definition . . . . . . . . . . . . . . . . . . . . . . . . . 4, 8
compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
transcendental
masking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
pipelining work-items on a stream core . . . 4
performing operations . . . . . . . . . . . . . . . . . 3
relationship to work-group . . . . . . . . . . . . . 4
transfer
relationship with work-groups . . . . . . . . . . . 4
between device and host
required number spawned by GPU . . . . . . 4
command queue elements . . . . . . . . . . . 9
size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
data
size for optimum hardware usage . . . . . . . 4
select a device . . . . . . . . . . . . . . . . . . . . 4
size on AMD GPUs . . . . . . . . . . . . . . . . . . 4
to the optimizer. . . . . . . . . . . . . . . . . . . . 2
total execution time. . . . . . . . . . . . . . . . . . . 5
DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
work-group . . . . . . . . . . . . . . . . . . . . . . . 2, 3
from system to GPU
work-item processing . . . . . . . . . . . . . . . . . 4
command processor . . . . . . . . . . . . . . . . 9
Windows
DMA engine . . . . . . . . . . . . . . . . . . . . . . 9
calling convention . . . . . . . . . . . . . . . . . . . . 7
management
compiling
memory . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Intel C (C++) compiler . . . . . . . . . . . . . . 2
work-item to fetch unit . . . . . . . . . . . . . . . . 9
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 2
U Visual Studio 2008 Professional Edition 2
debugging
unified shader architecture . . . . . . . . . . . . . . . 2 OpenCL kernels . . . . . . . . . . . . . . . . . . . 4
un-normalized addresses . . . . . . . . . . . . . . . 10 running code . . . . . . . . . . . . . . . . . . . . . . . . 6
V settings for compiling OpenCL . . . . . . . . . . 2
work-group
variable output counts allocation of LDS size. . . . . . . . . . . . . . . . . 2
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . . 1 barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
variadic arguments composed of wavefronts. . . . . . . . . . . . . . . 2
use of in the built-in printf. . . . . . . . . . . . . 12 concept relating to compute kernels . . . . . 2
vector data types defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 2 dividing global work-size into sub-domains 3
vector instructions . . . . . . . . . . . . . . . . . . . . . . 7 dividing work-items . . . . . . . . . . . . . . . . . . . 3
vendor number of wavefronts in . . . . . . . . . . . . . . . 3
platform vendor string . . . . . . . . . . . . . . . . . 3 performance . . . . . . . . . . . . . . . . . . . . . . . . 2
vendor name relationship to wavefront. . . . . . . . . . . . . . . 4
matching platform vendor string . . . . . . . . . 3 relationship with wavefronts . . . . . . . . . . . . 4
vendor-specific extensions specifying
AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 wavefronts. . . . . . . . . . . . . . . . . . . . . . . . 3
Very Long Instruction Word (VLIW) work-item
instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 4 reaching point (barrier) in the code . . . . 2
work-item . . . . . . . . . . . . . . . . . . . . . . . . 4 work-item
Visual Studio 2008 Professional Edition. . . . . 4 branch granularity . . . . . . . . . . . . . . . . . . . . 4
compiling OpenCL on Windows . . . . . . . . . 2 communicating
developing application code . . . . . . . . . . . . 4 through globally shared memory . . . . . . 3
VRAM through locally shared memory . . . . . . . 3
global memory . . . . . . . . . . . . . . . . . . . . . . 9 creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
deactivation . . . . . . . . . . . . . . . . . . . . . . . . . 9
dividing into work-groups . . . . . . . . . . . . . . 3
Index-14
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
encountering barriers . . . . . . . . . . . . . . . . . 4
executing
the branch . . . . . . . . . . . . . . . . . . . . . . . 4
kernel running on compute unit . . . . . . . . . 2
mapping
onto n-dimensional grid (ND-Range). . . 3
to stream cores . . . . . . . . . . . . . . . . . . . 2
non-active . . . . . . . . . . . . . . . . . . . . . . . . . . 4
processing wavefront . . . . . . . . . . . . . . . . . 4
reaching point (barrier) in the code . . . . . . 2
scheduling
for execution. . . . . . . . . . . . . . . . . . . . . . 3
the range of . . . . . . . . . . . . . . . . . . . . . . 2
spawn order . . . . . . . . . . . . . . . . . . . . . . . . 1
synchronization
through barrier operations . . . . . . . . . . . 3
through fence operations . . . . . . . . . . . . 3
VLIW instruction . . . . . . . . . . . . . . . . . . . . . 4
work-items
divergence in wavefront . . . . . . . . . . . . . . . 4
pipelining on a stream core . . . . . . . . . . . . 4
X
X Window system
using for CL-GL interoperability . . . . . . . . . 8
Index-15
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.
AMD ACCELERATED P ARALLEL P ROCESSING
Index-16
Copyright © 2013 Advanced Micro Devices, Inc. All rights reserved.