3 Heterogeneous Computer Architectures: 3.1 Gpus
3 Heterogeneous Computer Architectures: 3.1 Gpus
3.1 GPUs
3.1.1 Motivation
In recent years a very specific type of accelarators has been rising. Originating from
traditional computer graphics, where a programmable graphics pipeline has become
more and more ubiquitous, the use of the massively parallel architecture of a Graphical
Processing Unit (GPU) is rising for general purpose parallel programming tasks. As such,
the triumph of General Purpose Computation on Graphics Processing Units (GPGPU)
has began.
The origins of GPGPU lie in the (ab)use of shader programming languages like GLSL [1],
Cg [2] and HLSL [3] from which the first versions of CUDA [4] is derived. CUDA became
the de-facto standard on NVIDIA based accelarators.
Since then it appeared that the techniques developed for GPGPU can be applied to
a broader range of accelarators and multi core architectures. As such, the OpenCL
standard [5] was developed and is being standardized in collaboration with multiple
technical teams from various companies.
In this lesson you will learn the basic principles that went into OpenCL and how to
use them efficiently on your devices. OpenCL defines basic principles shared between
different accelarator devices:
• Platform Model
• Execution Model
• Memory Model
• Programming Model
Those will be explained in the remaining sections. This lesson will close with a short
reference to the OpenCL C language. A complete specification of OpenCL can be found
here [6].
46
3.1 GPUs
The OpenCL Platform Model includes a host, one or more compute devices, compute
units and the processing elements (see Figure 3.1).
Figure 3.1: The OpenCL Platform Model (Image courtesy of Khronos Group)
Host
The Host is responsible for coordinating the execution of a parallel code. That means it is
the machine on which the main C/C++ program is running on. As such it is responsible
for managing data (copy from and to device memory), starting and synchronisation of
tasks. The host program is usually running on the CPU built into the computer.
Compute Device
The compute device is the accelarator on which the task is beeing offloaded on. It
contains at least one Compute Unit and global memory. In the case of a GPU, the
Compute Unit is the actual Graphics Card. When using CPUs with OpenCL, it is the
entire set of CPUs built into the machine.
Compute Units
A Compute Unit is the conglomerate of Processing Elements including shared memory
which serves for synchronization between different Processing Elements. On NVIDIA
Graphics Cards, this is represented by the Multiprocessor and on regular CPUs, the
Compute Units are mapped to single cores.
Processing Elements
The Processing Elements form the actual execution entities in an OpenCL Context. Mul-
tiple Processing Elements are dealt with in a Single Instruction Multiple Data (SIMD)
fashion. On GPUs a Processing Element is a Shader and on CPUs it is represented by
a single SIMD Execution Unit.
47
3.1 GPUs
Distribution of Work
48
3.1 GPUs
Work Groups is a set of Work Items. Each Work Item is therefore a single element in
the to be processed index space. Each Work Item is supposed to perform the exact
same code on a Compute. That is, a Compute Unit executes as much Work Items in
parallel as it has Processing Elements. This happens in a SIMD fashion, which means
that Processing Elements may idle once branches in the control diverge. A kernel is
programmed in terms of the globally defined index space which is divided among the
locally defined index space, which is determined by the size of the Work Items.
An example to demonstrate the mapping from Work Group Sizes to Work Item ID is in
place here:
To further demonstrate this, we choose a NDRange with 128x64 elements. A work group
should have 16x16 Work Items. We therefore have 8x4 Work Groups. The global Work
Item ID (52, 22) is located in the Work Group with ID (3, 1) where the local ID within
that Work Group is (4, 6). Assuming an Offset of (0, 0) our formula gives us:
(3 ∗ 16 + 4 + 0, 1 ∗ 16 + 6 + 0) = (52, 22)
OpenCL uses contexts to manage the execution of the program. This is accomplished
by creating a context object in the Host-Code. A context consists of the following
resources:
• Devices: The used OpenCL devices
• Kernels: OpenCL functions which will run on the devices
• Program Objects: The implementations of the kernels
• Memory Objects: Memory regions used by the Host and the devices
To use a specific device, the host uses Command Queues. The host code adds the
commands to be executed to the queue to be run on the device specified. A list of
possible commands are:
• Kernel Execution: Execution of kernels on the Processing Elements of a device.
• Memory Commands: Memory transfer between host and device
49
3.1 GPUs
The OpenCL Memory Model is able to cope with different memory hierarchies as well
as having different allocation policies and access right management.
Memory in OpenCL is divided up in four parts. Figure 3.3 shows the respective positions
and dataflow possibilities.
50
3.1 GPUs
Global Memory
This memory region is used by all work items as shared memory. It has read and write
access. Global memory is the destination of host to device data transfer.
Constant Memory
This section is written by the host and stays constant throughout the kernel execution.
This means, the kernel has read-only access to this memory. Usually, this memory ends
up in the global memory, however, some OpenCL-Devices have constant-caches which
leads to decrease in latency in most use cases.
Local Memory
Local Memory is shared between a work group. This, relatively small, memory region is
closest to work items and can be used for efficient synchronization between work-items
in a work group.
Private Memory
Image Memory
The origin of the Image Memory is in the graphics processing heritage of GPUs. This
memory is read only in the device and can be used as a big global constant memory
region. The important features are that this memory region is cached and various
operations (such as interpolation and clamping) are implemented in hardware and offer
fast execution times.
Table 3.1 will illustrate the different memory types showing their respective access rights
and allocation strategies on the host and device.
The host application allocates memory with the help of the OpenCL API. Data transfer
is inserted into the command queue. Those can be blocking or non-blocking. Which
means the function only returns when the data transfer is complete, or needs additional
synchronization.
51
3.1 GPUs
The OpenCL Programming Model allows to combine various OpenCL devices with the
host system to create a single heterogeneous programming environment. This is divided
into three different parts:
1. The OpenCL Platform Layer: Responsible for querying various information
and capabilities of the system
2. The OpenCL Runtime Layer: Responsible for creating contexts, command
queue, managing memory and executing kernels
3. The OpenCL C Language: This is the language in which the OpenCL kernels
are written.
The OpenCL Platform and Runtime Layer are exposed via a C-API which will be in-
troduced here. The OpenCL C language is a language extension to the C standard.
Please note that we only briefly discuss the various functionality. For a detailed descrip-
tion, please refer to the OpenCL reference manual [6].
The OpenCL Platform Layer is responsible for allowing the host program to query
the available OpenCL devices and their properties as well as creating contextes for the
various devices.
Querying platform information
The function clGetPlatformIDs uses the platforms as an out parameter to return the
available platforms. If num_entries is 0 and platforms NULL, num_platforms contains
the number of available platforms, which can then be used to allocate the correct number
of platforms and the function can be called again.
52
3.1 GPUs
The function clGetPlatformInfo is used to query information about the specific plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:
platform_info_name Rückgabe Beschreibung
CL_PLATFORM_NAME char[] The name of the platform
CL_PLATFORM_VENDOR char[] The platform vendor
CL_PLATFORM_VERSION char[] The version of the platform
CL_PLATFORM_EXTENSION char[] The supported extensions
Querying devices of a platform
The devices of a platform are retrieved with the function clGetDeviceIDs. In order to
retrieve the number of available devices for a platform, the same technique as described
above needs to be applied. cl_device_type can be one of the following:
• CL_DEVICE_TYPE_ALL: All devices in that platform
• CL_DEVICE_TYPE_CPU: Only GPU devices
• CL_DEVICE_TYPE_GPU: Only CPU devices
cl_int clGetDeviceInfo(cl_device_id device,
cl_device_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
The function clGetDeviceInfo is used to query information about the specific plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:
53
3.1 GPUs
The OpenCL Runtime Layer allows the manipulation of context during the runtime of
the program and the creation of command queueus. Command queues are used to create
program objects, execution of kernels and memory management.
cl_command_queue clCreateCommandQueue(cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int *errcode_ret);
Command queues are a handle for managing the execution of operations on a device. A
device can be associated with more than one command queue, which allows the over-
lapping of memory transfer and computation. A command queue is only valid within a
given context.
Memory Management:
cl_mem clCreateBuffer(cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret);
54
3.1 GPUs
2
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/functionQualifiers.html
3
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/workItemFunctions.html
55
3.1 GPUs
This is an example kernel that can be used to perform an elmentwise addition of the two
vectors A and B and store the result in C. The kernel function is executed concurrently
for each element in the vector. Each execution of the kernel is associated with a global
ID and a local ID (see Section 3.1.3) which can be retrieved via the built-in functions.
This allows us to correctly add the contents of the memory in global memory without
race conditions.
cl_program clCreateProgramWithSource(cl_context context,
cl_uint count,
const char **strings,
const size_t *lengths,
cl_int *errcode_ret);
clBuildProgram builds the given program for the passed devices. If this functions
returns with a error code, we have to further check what went wrong.
cl_int clGetProgramBuildInfo(cl_program program,
cl_device_id device,
cl_program_build_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
56
3.1 GPUs
Once the kernel has been created, the arguments to the kernel functions have to be set
via the clSetKernelArg function. The index starts with 0.
cl_int clEnqueueNDRangeKernel(cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);
18 int main()
19 {
57
3.1 GPUs
58
3.1 GPUs
59
3.1 GPUs
List of Abbreviations
60
Bibliography
3.2 CPUs
3.2.1 Motivation
Although the last years have seen a myriad of emerging new architectures—ranging from
GPU accelerators to the Xeon Phi, Accelerated Processing Unit (APU) designs, 4 and
various experimental platforms—the commodity x86 Central Processing Unit (CPU)
is still the main workhorse used in traditional desktop computers, computing centers,
cluster computers, and supercomputers, and, as such, important to be examined in more
detail in this course.
In this lesson, you will learn about the different types of parallelism that can be found
in today’s CPU architectures. In addition, you will also learn about tools, such as
OpenMP [?], that will allow you to exploit the parallel capabilities in an comfortable
way.
4
The common denominator APU is used for architectures that provide an integration of CPU and GPU
architectures that share a common memory, in most cases via a shared last-level cache. Naively put,
APUs can be thought of as a fusion of CPUs and GPUs into a single design.
61