0% found this document useful (0 votes)
29 views

3 Heterogeneous Computer Architectures: 3.1 Gpus

The document discusses GPUs and heterogeneous computing architectures. It provides an overview of the OpenCL platform model, which includes a host, compute devices with compute units and processing elements. The execution model defines how kernels are executed across devices, including work distribution via NDRanges and work items. Memory is divided into global, constant, local and private regions with different access rights and allocation strategies between the host and devices.

Uploaded by

Soumya Sen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

3 Heterogeneous Computer Architectures: 3.1 Gpus

The document discusses GPUs and heterogeneous computing architectures. It provides an overview of the OpenCL platform model, which includes a host, compute devices with compute units and processing elements. The execution model defines how kernels are executed across devices, including work distribution via NDRanges and work items. Memory is divided into global, constant, local and private regions with different access rights and allocation strategies between the host and devices.

Uploaded by

Soumya Sen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

3 Heterogeneous Computer Architectures

3.1 GPUs

3.1.1 Motivation

In recent years a very specific type of accelarators has been rising. Originating from
traditional computer graphics, where a programmable graphics pipeline has become
more and more ubiquitous, the use of the massively parallel architecture of a Graphical
Processing Unit (GPU) is rising for general purpose parallel programming tasks. As such,
the triumph of General Purpose Computation on Graphics Processing Units (GPGPU)
has began.
The origins of GPGPU lie in the (ab)use of shader programming languages like GLSL [1],
Cg [2] and HLSL [3] from which the first versions of CUDA [4] is derived. CUDA became
the de-facto standard on NVIDIA based accelarators.
Since then it appeared that the techniques developed for GPGPU can be applied to
a broader range of accelarators and multi core architectures. As such, the OpenCL
standard [5] was developed and is being standardized in collaboration with multiple
technical teams from various companies.
In this lesson you will learn the basic principles that went into OpenCL and how to
use them efficiently on your devices. OpenCL defines basic principles shared between
different accelarator devices:
• Platform Model
• Execution Model
• Memory Model
• Programming Model
Those will be explained in the remaining sections. This lesson will close with a short
reference to the OpenCL C language. A complete specification of OpenCL can be found
here [6].

46
3.1 GPUs

3.1.2 The OpenCL Platform Model

The OpenCL Platform Model includes a host, one or more compute devices, compute
units and the processing elements (see Figure 3.1).

Figure 3.1: The OpenCL Platform Model (Image courtesy of Khronos Group)

Host
The Host is responsible for coordinating the execution of a parallel code. That means it is
the machine on which the main C/C++ program is running on. As such it is responsible
for managing data (copy from and to device memory), starting and synchronisation of
tasks. The host program is usually running on the CPU built into the computer.
Compute Device
The compute device is the accelarator on which the task is beeing offloaded on. It
contains at least one Compute Unit and global memory. In the case of a GPU, the
Compute Unit is the actual Graphics Card. When using CPUs with OpenCL, it is the
entire set of CPUs built into the machine.
Compute Units
A Compute Unit is the conglomerate of Processing Elements including shared memory
which serves for synchronization between different Processing Elements. On NVIDIA
Graphics Cards, this is represented by the Multiprocessor and on regular CPUs, the
Compute Units are mapped to single cores.
Processing Elements
The Processing Elements form the actual execution entities in an OpenCL Context. Mul-
tiple Processing Elements are dealt with in a Single Instruction Multiple Data (SIMD)
fashion. On GPUs a Processing Element is a Shader and on CPUs it is represented by
a single SIMD Execution Unit.

3.1.3 The OpenCL Execution Model

The Execution of a OpenCL-Program is divided in two parts:

47
3.1 GPUs

1. The Kernels which are executed on one or more devices


2. The Host Code orchestrating the execution of kernels onto multiple devices
The Execution model is then defined how the different kernels are executed on the
various devices. Most important is the distribution of work as well as the management
of Contexts and Command Queues.

Distribution of Work

In OpenCL, the parallelization of tasks is managed by so called NDRanges. A NDRange

Figure 3.2: NDRange

provides a N-Dimensional (where N is 1, 2 or 3) description of the problem space (or


index space). Figure 3.2 show an example 2-dimensionsal space with the size Gx ∗ Gy .
This decomposition can, for example be used in a Matrix Operation. Each NDRange will
be subsequently divided into single tiles, also known as Work Groups.

48
3.1 GPUs

Work Groups is a set of Work Items. Each Work Item is therefore a single element in
the to be processed index space. Each Work Item is supposed to perform the exact
same code on a Compute. That is, a Compute Unit executes as much Work Items in
parallel as it has Processing Elements. This happens in a SIMD fashion, which means
that Processing Elements may idle once branches in the control diverge. A kernel is
programmed in terms of the globally defined index space which is divided among the
locally defined index space, which is determined by the size of the Work Items.
An example to demonstrate the mapping from Work Group Sizes to Work Item ID is in
place here:

(wx , wy ) = Work Group ID


(Sx , Sy ) = Work Group Size
(sx , sy ) = Local Work Item ID
(Fx , Fy ) = Global Offset
(gx , gy ) = Global Work Item ID
(gx , gy ) = (wx ∗ Sx + sx + Fx , wy ∗ Sy + sy + Fy )

To further demonstrate this, we choose a NDRange with 128x64 elements. A work group
should have 16x16 Work Items. We therefore have 8x4 Work Groups. The global Work
Item ID (52, 22) is located in the Work Group with ID (3, 1) where the local ID within
that Work Group is (4, 6). Assuming an Offset of (0, 0) our formula gives us:

(3 ∗ 16 + 4 + 0, 1 ∗ 16 + 6 + 0) = (52, 22)

Context and Command Queues

OpenCL uses contexts to manage the execution of the program. This is accomplished
by creating a context object in the Host-Code. A context consists of the following
resources:
• Devices: The used OpenCL devices
• Kernels: OpenCL functions which will run on the devices
• Program Objects: The implementations of the kernels
• Memory Objects: Memory regions used by the Host and the devices
To use a specific device, the host uses Command Queues. The host code adds the
commands to be executed to the queue to be run on the device specified. A list of
possible commands are:
• Kernel Execution: Execution of kernels on the Processing Elements of a device.
• Memory Commands: Memory transfer between host and device

49
3.1 GPUs

• Synchronization: Synchronization between the concurrently executing control


flows of host and device
The execution of commands can be scheduled as in-order or out-of-order. With in-order
execution, the commands will be executed sequentially in the order in which they are
inserted into the queue. When a out-of-order command queue has been created, the
execution order is unspecified. The order of execution is determined by specifying data
dependencies through event objects. This can be used to hide latencies to optimize data
transfers. event objects can also be used with in-order command queues for synchro-
nization. There can be more than one command queue per host. This can be used to
increase concurrency in the system to further hide latencies.

3.1.4 The OpenCL Memory Model

The OpenCL Memory Model is able to cope with different memory hierarchies as well
as having different allocation policies and access right management.
Memory in OpenCL is divided up in four parts. Figure 3.3 shows the respective positions
and dataflow possibilities.

Figure 3.3: OpenCL Memory Regions1

50
3.1 GPUs

Global Memory

This memory region is used by all work items as shared memory. It has read and write
access. Global memory is the destination of host to device data transfer.

Constant Memory

This section is written by the host and stays constant throughout the kernel execution.
This means, the kernel has read-only access to this memory. Usually, this memory ends
up in the global memory, however, some OpenCL-Devices have constant-caches which
leads to decrease in latency in most use cases.

Local Memory

Local Memory is shared between a work group. This, relatively small, memory region is
closest to work items and can be used for efficient synchronization between work-items
in a work group.

Private Memory

Private Memory represents exclusive, non-shared memory of a specific work item. It


is essentially made up of the registers of a processing elements as well as a exclusive
memory region in the main memory of the device.

Image Memory

The origin of the Image Memory is in the graphics processing heritage of GPUs. This
memory is read only in the device and can be used as a big global constant memory
region. The important features are that this memory region is cached and various
operations (such as interpolation and clamping) are implemented in hardware and offer
fast execution times.
Table 3.1 will illustrate the different memory types showing their respective access rights
and allocation strategies on the host and device.
The host application allocates memory with the help of the OpenCL API. Data transfer
is inserted into the command queue. Those can be blocking or non-blocking. Which
means the function only returns when the data transfer is complete, or needs additional
synchronization.

51
3.1 GPUs

Global Image Constant Local Private


Host Dynamic Allocation Dynamic Allocation Dynamic Allocation Dynamic Allocation No Allocation
Read/Write Access Read/Write Access Read/Write Access No Access No Access
Device No Allocation No Allocation Static Allocation Static Allocation Static Allocation
Read/Write Access Read-Only Access Read-Only Access Read/Write Access Read/Write Access

Table 3.1: Types of OpenCL Memory

3.1.5 The OpenCL Programming Model

The OpenCL Programming Model allows to combine various OpenCL devices with the
host system to create a single heterogeneous programming environment. This is divided
into three different parts:
1. The OpenCL Platform Layer: Responsible for querying various information
and capabilities of the system
2. The OpenCL Runtime Layer: Responsible for creating contexts, command
queue, managing memory and executing kernels
3. The OpenCL C Language: This is the language in which the OpenCL kernels
are written.
The OpenCL Platform and Runtime Layer are exposed via a C-API which will be in-
troduced here. The OpenCL C language is a language extension to the C standard.
Please note that we only briefly discuss the various functionality. For a detailed descrip-
tion, please refer to the OpenCL reference manual [6].

The OpenCL Platform Layer

The OpenCL Platform Layer is responsible for allowing the host program to query
the available OpenCL devices and their properties as well as creating contextes for the
various devices.
Querying platform information

cl_int clGetPlatformIDs(cl_uint num_entries,


cl_platform_id *platforms,
cl_uint *num_platforms);

The function clGetPlatformIDs uses the platforms as an out parameter to return the
available platforms. If num_entries is 0 and platforms NULL, num_platforms contains
the number of available platforms, which can then be used to allocate the correct number
of platforms and the function can be called again.

52
3.1 GPUs

cl_int clGetPlatformInfo(cl_platform_id platform,


cl_platform_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

The function clGetPlatformInfo is used to query information about the specific plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:
platform_info_name Rückgabe Beschreibung
CL_PLATFORM_NAME char[] The name of the platform
CL_PLATFORM_VENDOR char[] The platform vendor
CL_PLATFORM_VERSION char[] The version of the platform
CL_PLATFORM_EXTENSION char[] The supported extensions
Querying devices of a platform

cl_int clGetDeviceIDs(cl_platform_id platform,


cl_device_type device_type,
cl_uint num_entries,
cl_device_id *devices,
cl_uint *num_devices);

The devices of a platform are retrieved with the function clGetDeviceIDs. In order to
retrieve the number of available devices for a platform, the same technique as described
above needs to be applied. cl_device_type can be one of the following:
• CL_DEVICE_TYPE_ALL: All devices in that platform
• CL_DEVICE_TYPE_CPU: Only GPU devices
• CL_DEVICE_TYPE_GPU: Only CPU devices
cl_int clGetDeviceInfo(cl_device_id device,
cl_device_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

The function clGetDeviceInfo is used to query information about the specific plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:

53
3.1 GPUs

param_name Rückgabe Beschreibung


CL_DEVICE_NAME char[] The name of the device
CL_DEVICE_VENDOR_ID cl_uint The vendor ID of the device
CL_DEVICE_MAX_COMPUTE_UNITS cl_uint Maximum available compute units on the device
CL_DEVICE_GLOBAL_MEM_SIZE cl_ulong Available device global memory in bytes
CL_DEVICE_LOCAL_MEM_SIZE cl_ulong Available device local memory per workgroup in bytes
cl_context clCreateContext(const cl_context_properties *properties,
cl_uint num_devices,
const cl_device_id *devices,
void (CL_CALLBACK *pfn_notify)(
const char *errinfo,
const coid *private_info,
size_t cb,
void *user_data),
void *user_data,
cl_int *errcode_ret);

To create a context, the function clCreateContext is used. A context can be create


for one or more devices which are passed with the devices pointer, the num_devices
parameter determines the number of devices this context uses. properties is a list
of various context specific properties and is formed as a NULL terminated list. The
remaining parameters are used for error detection.

The OpenCL Runtime Layer

The OpenCL Runtime Layer allows the manipulation of context during the runtime of
the program and the creation of command queueus. Command queues are used to create
program objects, execution of kernels and memory management.
cl_command_queue clCreateCommandQueue(cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int *errcode_ret);

Command queues are a handle for managing the execution of operations on a device. A
device can be associated with more than one command queue, which allows the over-
lapping of memory transfer and computation. A command queue is only valid within a
given context.
Memory Management:
cl_mem clCreateBuffer(cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret);

54
3.1 GPUs

A memory object is defined by the cl_mem type. clCreateBuffer is used to create a


memory object. size determines the size in bytes, and flags is a bitfield defining the
opencl device access rights ( CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, CL_MEM_READ_WRITE
and CL_COPY_HOST_PTR) and can be combined with a bitwise or. host_ptr can be NULL,
but has to be valid in case the flag CL_COPY_HOST_PTR is set since then, the host memory
is directly copied into the device memory.
To manually copy memory between host and memory, the clEnqueueWriteBuffer and
clEnqueuReadBuffer can be used.
cl_int clEnqueueWriteBuffer(cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t size,
void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

cl_int clEnqueueReadBuffer(cl_command_queue command_queue,


cl_mem buffer,
cl_bool blocking_read,
size_t offset,
size_t size,
void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

Creating Programs: Since OpenCL is a platform independent framework, the kernels


to be used are compiled at runtime of the host program. This brings the advantage of
being able to use any device without recompilation of the host code. The given OpenCL
program can then be optimized towards the used accelerator.
A OpenCL program is written in the OpenCL C language. One of the main differ-
ences to plain C is the availabilty of of various qualifiers 2 , built-in functions for math
and vector operations and built-in functions for work-item information 3 . For example,
the __kernel function qualifier denotes an entry point of a OpenCL function and the
__global variable qualifier is used to mark a pointer that lives in global memory.
1 __kernel vec_add(__global *A, __global *B, __global *C)
2 {
3 int idx = get_global_id(0);
4 C[idx] = A[idx] + B[idx];
5 }

2
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/functionQualifiers.html
3
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/workItemFunctions.html

55
3.1 GPUs

This is an example kernel that can be used to perform an elmentwise addition of the two
vectors A and B and store the result in C. The kernel function is executed concurrently
for each element in the vector. Each execution of the kernel is associated with a global
ID and a local ID (see Section 3.1.3) which can be retrieved via the built-in functions.
This allows us to correctly add the contents of the memory in global memory without
race conditions.
cl_program clCreateProgramWithSource(cl_context context,
cl_uint count,
const char **strings,
const size_t *lengths,
cl_int *errcode_ret);

The clCreateProgramWithSource creates a OpenCL program object. It is associated


with a context. The OpenCL source code is passed in via character arrays that have
been created a priori as discussed above. After creating a program object, it has to be
compiled and linked.
cl_int clBuildProgram(cl_program program,
cl_uint num_devices,
const cl_device_id *devices,
const char *options,
void (CL_CALLBACK *pfn_notify)(cl_program, void *user_data)
,
void *user_data);

clBuildProgram builds the given program for the passed devices. If this functions
returns with a error code, we have to further check what went wrong.
cl_int clGetProgramBuildInfo(cl_program program,
cl_device_id device,
cl_program_build_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

clGetProgramBuildInfo can be used to obtained detailed build informations. The


parameters are to be used in the usual way.
param_name Rückgabe Beschreibung
CL_PROGRAM_BUILD_STATUS cl_build_status The status of the build
CL_PROGRAM_BUILD_LOG char[] Detailed information of the build process, including error
Executing Kernels: To use the kernels that are contained in a cl_program object,
the kernel needs to be created via a string identifying the kernel itself.
cl_kernel clCreateKernel(cl_program program,
const char *kernel_name,
cl_int *errcode_ret);

56
3.1 GPUs

cl_int clSetKernelArg(cl_kernel kernel,


cl_uint arg_index,
size_t arg_size,
const void *arg_value);

Once the kernel has been created, the arguments to the kernel functions have to be set
via the clSetKernelArg function. The index starts with 0.
cl_int clEnqueueNDRangeKernel(cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

The execution of the kernel is then done by enqueueing it to the command_queue


with the clEnqueueNDRangeKernel. The setup of the global work size and offset as
well as the local work size is done in the same scheme as discussed in section 3.1.3.
If local_work_size is NULL, the OpenCL implementation chooses the best possible
value.
Full Example:
The following source code gives a full example with inline comments to the above dis-
cussed functionality.
1 // Include the OpenCL C-API definitions
2 #include <CL/cl.h>
3
4 // Include some utilities from the C++ standard library
5 #include <iostreams>
6 #include <vector>
7
8 // Define the OpenCL program doing an elementwise vector addition in the form:
9 // C[i] = A[i] + B[i];
10 const char * opencl_source =
11 "__kernel vec_add(__global *A, __global *B, __global *C)"
12 "{ "
13 " int i = get_global_id(0); "
14 " C[i] = A[i] + B[i]; "
15 "} "
16 ;
17

18 int main()
19 {

57
3.1 GPUs

20 // variable used for error checking


21 cl_int error = CL_SUCCESS;
22

23 // Determine the number of available platforms:


24 cl_uint num_platforms = 0;
25 error = clGetPlatformIDs(0, NULL, &num_platforms);
26 if(error != CL_SUCCESS || num_platforms == 0) { /* error handling ... */ }
27
28 // Get the first OpenCL platform.
29 cl_platform platform;
30 error = clGetPlatformIDs(1, &platform, NULL);
31 if(error != CL_SUCCESS) { /* error handling ... */ }
32
33 // Determine the number of available devices on that platform:
34 cl_uint num_devices = 0;
35 error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);
36 if(error != CL_SUCCESS || num_devices == 0) { /* error handling ... */ }
37
38 // Get the first OpenCL device on that platform
39 cl_device_id device;
40 error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
41 if(error != CL_SUCCESS) { /* error handling ... */ }
42
43 // Print out the device name:
44 size_t name_len = 0;
45 error = clGetDeviceInfo(device, CL_DEVICE_NAME, 0, NULL, &name_len);
46 if(error != CL_SUCCESS || name_len == 0) { /* error handling ... */ }
47 char *device_name = new[name_len];
48 error = clGetDeviceInfo(device, CL_DEVICE_NAME, name_len, device_name, NULL)
;
49 if(error != CL_SUCCESS) { /* error handling ... */ }
50 std::cout << "Using device " << device_name << "\n";
51 delete[] device_name;
52
53 // Create a context for our device
54 cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &error);
55 if(error != CL_SUCCESS) { /* error handling ... */ }
56

57 // Create a command queue for our context and device


58 cl_command_queue cq = clCreateCommandQueue(context, device, 0, &error);
59 if(error != CL_SUCCESS) { /* error handling ... */ }
60
61 // Create the OpenCL program for our kernel
62 size_t source_length[] = {sizeof(opencl_source)};
63 clCreateProgramWithSource(context, 1, &opencl_source, source_length, &error)
;
64 if(error != CL_SUCCESS) { /* error handling ... */ }
65
66 // Build our program

58
3.1 GPUs

67 error = clBuildProgram(program, 1, &device, NULL, NULL, NULL);


68 if(error != CL_SUCCESS)
69 {
70 // Retreive the build log to see why our build failed:
71 size_t log_len = 0;
72 clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &
log_len);
73 if(error != CL_SUCCESS || log_len == 0) { /* error handling ... */ }
74 char *build_log = new char[log_len];
75 clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_len,
build_log, NULL);
76 std::cout << "Build failed:\n" << build_log << "\n";
77 }
78
79 // Create the kernel object
80 cl_kernel kernel = clCreateKernel(program, "vec_add", &error);
81 if(error != CL_SUCCESS) { /* error handling ... */ }
82
83 // Create our host datastructures, fillA() and fillB() are placeholders
84 // to fill the two vectors with arbitrary data.
85 std::vector<double> A = fillA();
86 std::vector<double> B = fillB();
87 std::vector<double> C(B.size());
88
89 // Create our OpenCL memory objects
90 cl_mem A_device = clCreateBuffer(context, CL_MEM_READ_ONLY, A.size(), NULL,
&error);
91 if(error != CL_SUCCESS) { /* error handling ... */ }
92 cl_mem B_device = clCreateBuffer(context, CL_MEM_READ_ONLY, B.size(), NULL,
&error);
93 if(error != CL_SUCCESS) { /* error handling ... */ }
94 cl_mem C_device = clCreateBuffer(context, CL_MEM_WRITE_ONLY, C.size(), NULL,
&error);
95 if(error != CL_SUCCESS) { /* error handling ... */ }
96
97 // Copy our data to the device
98 error = clEnqueueWriteBuffer(cq, A_device, CL_TRUE, 0, A.size(), A.data(),
0, NULL, NULL);
99 if(error != CL_SUCCESS) { /* error handling ... */ }
100 error = clEnqueueWriteBuffer(cq, B_device, CL_TRUE, 0, B.size(), B.data(),
0, NULL, NULL);
101 if(error != CL_SUCCESS) { /* error handling ... */ }
102
103 // Set the kernel arguments
104 error = clSetKernelArg(kernel, 0, sizeof(cl_mem), &A_device);
105 if(error != CL_SUCCESS) { /* error handling ... */ }
106 error = clSetKernelArg(kernel, 1, sizeof(cl_mem), &B_device);
107 if(error != CL_SUCCESS) { /* error handling ... */ }
108 error = clSetKernelArg(kernel, 2, sizeof(cl_mem), &C_device);

59
3.1 GPUs

109 if(error != CL_SUCCESS) { /* error handling ... */ }


110
111 // Determine the number of our global work size.
112 size_t global_work_offset[] = {0};>
113 size_t global_work_size[] = {A.size()};
114
115 // Enqueue the kernel Execution
116 error = clEnqueueNDRangeKernel(cq, kernel, 1, global_work_offset,
global_work_size, NULL, 0, NULL, NULL);
117 if(error != CL_SUCCESS) { /* error handling ... */ }
118
119 // Wait for the OpenCL kernel execution is complete
120 clFlush();
121
122 // Copy the result back into our vector
123 error = clEnqueueReadBuffer(cq, C_device, CL_TRUE, 0, C.size(), NULL, &error
);
124 if(error != CL_SUCCESS) { /* error handling ... */ }
125
126 // We can now use the computed result!
127 }

List of Abbreviations

GPU Graphical Processing Unit


GPGPU General Purpose Computation on Graphics Processing Units
SIMD Single Instruction Multiple Data

60
Bibliography

[1] The Khronos Consortium, OpenGL Shading Language, 2014, https://www.opengl.


org/registry/doc/GLSLangSpec.4.40.pdf
[2] Microsoft, High Level Shading Language for DirectX, 2014, http://msdn.
microsoft.com/en-us/library/windows/desktop/bb509561(v=vs.85).aspx
[3] NVIDIA,C for Graphics, 2012, https://developer.nvidia.com/cg-toolkit
[4] NVIDIA, NVIDIA CUDA Zone, 2014, https://developer.nvidia.com/cuda-zone
[5] The Khronos Consortium, OpenGL Shading Language, 2014, http://www.khronos.
org/registry/cl/specs/opencl-1.2.pdf#page=125
[6] The Khronos Consortium, OpenCL 1.2 Reference Docs, 2014, https://www.
khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/

3.2 CPUs

3.2.1 Motivation

Although the last years have seen a myriad of emerging new architectures—ranging from
GPU accelerators to the Xeon Phi, Accelerated Processing Unit (APU) designs, 4 and
various experimental platforms—the commodity x86 Central Processing Unit (CPU)
is still the main workhorse used in traditional desktop computers, computing centers,
cluster computers, and supercomputers, and, as such, important to be examined in more
detail in this course.
In this lesson, you will learn about the different types of parallelism that can be found
in today’s CPU architectures. In addition, you will also learn about tools, such as
OpenMP [?], that will allow you to exploit the parallel capabilities in an comfortable
way.

4
The common denominator APU is used for architectures that provide an integration of CPU and GPU
architectures that share a common memory, in most cases via a shared last-level cache. Naively put,
APUs can be thought of as a fusion of CPUs and GPUs into a single design.

61

Powered by TCPDF (www.tcpdf.org)

You might also like