0% found this document useful (0 votes)

29 views

3 Heterogeneous Computer Architectures: 3.1 Gpus

The document discusses GPUs and heterogeneous computing architectures. It provides an overview of the OpenCL platform model, which includes a host, compute devices with compute units and processing elements. The execution model defines how kernels are executed across devices, including work distribution via NDRanges and work items. Memory is divided into global, constant, local and private regions with different access rights and allocation strategies between the host and devices.

Uploaded by

Soumya Sen

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

3 Heterogeneous Computer Architectures: 3.1 Gpus

Uploaded by

Soumya Sen

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

3 Heterogeneous Computer Architectures

3.1 GPUs

3.1.1 Motivation

In recent years a very specific type of accelarators has been rising. Originating from
traditional computer graphics, where a programmable graphics pipeline has become
more and more ubiquitous, the use of the massively parallel architecture of a Graphical
Processing Unit (GPU) is rising for general purpose parallel programming tasks. As such,
the triumph of General Purpose Computation on Graphics Processing Units (GPGPU)
has began.
The origins of GPGPU lie in the (ab)use of shader programming languages like GLSL [1],
Cg [2] and HLSL [3] from which the first versions of CUDA [4] is derived. CUDA became
the de-facto standard on NVIDIA based accelarators.
Since then it appeared that the techniques developed for GPGPU can be applied to
a broader range of accelarators and multi core architectures. As such, the OpenCL
standard [5] was developed and is being standardized in collaboration with multiple
technical teams from various companies.
In this lesson you will learn the basic principles that went into OpenCL and how to
use them efficiently on your devices. OpenCL defines basic principles shared between
different accelarator devices:
• Platform Model
• Execution Model
• Memory Model
• Programming Model
Those will be explained in the remaining sections. This lesson will close with a short
reference to the OpenCL C language. A complete specification of OpenCL can be found
here [6].

46
3.1 GPUs

3.1.2 The OpenCL Platform Model

The OpenCL Platform Model includes a host, one or more compute devices, compute
units and the processing elements (see Figure 3.1).

Figure 3.1: The OpenCL Platform Model (Image courtesy of Khronos Group)

Host
The Host is responsible for coordinating the execution of a parallel code. That means it is
the machine on which the main C/C++ program is running on. As such it is responsible
for managing data (copy from and to device memory), starting and synchronisation of
tasks. The host program is usually running on the CPU built into the computer.
Compute Device
The compute device is the accelarator on which the task is beeing oﬄoaded on. It
contains at least one Compute Unit and global memory. In the case of a GPU, the
Compute Unit is the actual Graphics Card. When using CPUs with OpenCL, it is the
entire set of CPUs built into the machine.
Compute Units
A Compute Unit is the conglomerate of Processing Elements including shared memory
which serves for synchronization between diﬀerent Processing Elements. On NVIDIA
Graphics Cards, this is represented by the Multiprocessor and on regular CPUs, the
Compute Units are mapped to single cores.
Processing Elements
The Processing Elements form the actual execution entities in an OpenCL Context. Mul-
tiple Processing Elements are dealt with in a Single Instruction Multiple Data (SIMD)
fashion. On GPUs a Processing Element is a Shader and on CPUs it is represented by
a single SIMD Execution Unit.

3.1.3 The OpenCL Execution Model

The Execution of a OpenCL-Program is divided in two parts:

47
3.1 GPUs

1. The Kernels which are executed on one or more devices

2. The Host Code orchestrating the execution of kernels onto multiple devices
The Execution model is then deﬁned how the diﬀerent kernels are executed on the
various devices. Most important is the distribution of work as well as the management
of Contexts and Command Queues.

Distribution of Work

In OpenCL, the parallelization of tasks is managed by so called NDRanges. A NDRange

Figure 3.2: NDRange

provides a N-Dimensional (where N is 1, 2 or 3) description of the problem space (or

index space). Figure 3.2 show an example 2-dimensionsal space with the size Gx ∗ Gy .
This decomposition can, for example be used in a Matrix Operation. Each NDRange will
be subsequently divided into single tiles, also known as Work Groups.

48
3.1 GPUs

Work Groups is a set of Work Items. Each Work Item is therefore a single element in
the to be processed index space. Each Work Item is supposed to perform the exact
same code on a Compute. That is, a Compute Unit executes as much Work Items in
parallel as it has Processing Elements. This happens in a SIMD fashion, which means
that Processing Elements may idle once branches in the control diverge. A kernel is
programmed in terms of the globally deﬁned index space which is divided among the
locally deﬁned index space, which is determined by the size of the Work Items.
An example to demonstrate the mapping from Work Group Sizes to Work Item ID is in
place here:

(wx , wy ) = Work Group ID

(Sx , Sy ) = Work Group Size
(sx , sy ) = Local Work Item ID
(Fx , Fy ) = Global Oﬀset
(gx , gy ) = Global Work Item ID
(gx , gy ) = (wx ∗ Sx + sx + Fx , wy ∗ Sy + sy + Fy )

To further demonstrate this, we choose a NDRange with 128x64 elements. A work group
should have 16x16 Work Items. We therefore have 8x4 Work Groups. The global Work
Item ID (52, 22) is located in the Work Group with ID (3, 1) where the local ID within
that Work Group is (4, 6). Assuming an Oﬀset of (0, 0) our formula gives us:

(3 ∗ 16 + 4 + 0, 1 ∗ 16 + 6 + 0) = (52, 22)

Context and Command Queues

OpenCL uses contexts to manage the execution of the program. This is accomplished
by creating a context object in the Host-Code. A context consists of the following
resources:
• Devices: The used OpenCL devices
• Kernels: OpenCL functions which will run on the devices
• Program Objects: The implementations of the kernels
• Memory Objects: Memory regions used by the Host and the devices
To use a speciﬁc device, the host uses Command Queues. The host code adds the
commands to be executed to the queue to be run on the device speciﬁed. A list of
possible commands are:
• Kernel Execution: Execution of kernels on the Processing Elements of a device.
• Memory Commands: Memory transfer between host and device

49
3.1 GPUs

• Synchronization: Synchronization between the concurrently executing control

ﬂows of host and device
The execution of commands can be scheduled as in-order or out-of-order. With in-order
execution, the commands will be executed sequentially in the order in which they are
inserted into the queue. When a out-of-order command queue has been created, the
execution order is unspeciﬁed. The order of execution is determined by specifying data
dependencies through event objects. This can be used to hide latencies to optimize data
transfers. event objects can also be used with in-order command queues for synchro-
nization. There can be more than one command queue per host. This can be used to
increase concurrency in the system to further hide latencies.

3.1.4 The OpenCL Memory Model

The OpenCL Memory Model is able to cope with different memory hierarchies as well
as having different allocation policies and access right management.
Memory in OpenCL is divided up in four parts. Figure 3.3 shows the respective positions
and dataflow possibilities.

Figure 3.3: OpenCL Memory Regions1

50
3.1 GPUs

Global Memory

This memory region is used by all work items as shared memory. It has read and write
access. Global memory is the destination of host to device data transfer.

Constant Memory

This section is written by the host and stays constant throughout the kernel execution.
This means, the kernel has read-only access to this memory. Usually, this memory ends
up in the global memory, however, some OpenCL-Devices have constant-caches which
leads to decrease in latency in most use cases.

Local Memory

Local Memory is shared between a work group. This, relatively small, memory region is
closest to work items and can be used for eﬃcient synchronization between work-items
in a work group.

Private Memory

Private Memory represents exclusive, non-shared memory of a speciﬁc work item. It

is essentially made up of the registers of a processing elements as well as a exclusive
memory region in the main memory of the device.

Image Memory

The origin of the Image Memory is in the graphics processing heritage of GPUs. This
memory is read only in the device and can be used as a big global constant memory
region. The important features are that this memory region is cached and various
operations (such as interpolation and clamping) are implemented in hardware and oﬀer
fast execution times.
Table 3.1 will illustrate the diﬀerent memory types showing their respective access rights
and allocation strategies on the host and device.
The host application allocates memory with the help of the OpenCL API. Data transfer
is inserted into the command queue. Those can be blocking or non-blocking. Which
means the function only returns when the data transfer is complete, or needs additional
synchronization.

51
3.1 GPUs

Global Image Constant Local Private

Host Dynamic Allocation Dynamic Allocation Dynamic Allocation Dynamic Allocation No Allocation
Read/Write Access Read/Write Access Read/Write Access No Access No Access
Device No Allocation No Allocation Static Allocation Static Allocation Static Allocation
Read/Write Access Read-Only Access Read-Only Access Read/Write Access Read/Write Access

Table 3.1: Types of OpenCL Memory

3.1.5 The OpenCL Programming Model

The OpenCL Programming Model allows to combine various OpenCL devices with the
host system to create a single heterogeneous programming environment. This is divided
into three diﬀerent parts:
1. The OpenCL Platform Layer: Responsible for querying various information
and capabilities of the system
2. The OpenCL Runtime Layer: Responsible for creating contexts, command
queue, managing memory and executing kernels
3. The OpenCL C Language: This is the language in which the OpenCL kernels
are written.
The OpenCL Platform and Runtime Layer are exposed via a C-API which will be in-
troduced here. The OpenCL C language is a language extension to the C standard.
Please note that we only brieﬂy discuss the various functionality. For a detailed descrip-
tion, please refer to the OpenCL reference manual [6].

The OpenCL Platform Layer

The OpenCL Platform Layer is responsible for allowing the host program to query
the available OpenCL devices and their properties as well as creating contextes for the
various devices.
Querying platform information

cl_int clGetPlatformIDs(cl_uint num_entries,

cl_platform_id *platforms,
cl_uint *num_platforms);

The function clGetPlatformIDs uses the platforms as an out parameter to return the
available platforms. If num_entries is 0 and platforms NULL, num_platforms contains
the number of available platforms, which can then be used to allocate the correct number
of platforms and the function can be called again.

52
3.1 GPUs

cl_int clGetPlatformInfo(cl_platform_id platform,

cl_platform_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

The function clGetPlatformInfo is used to query information about the speciﬁc plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:
platform_info_name Rückgabe Beschreibung
CL_PLATFORM_NAME char[] The name of the platform
CL_PLATFORM_VENDOR char[] The platform vendor
CL_PLATFORM_VERSION char[] The version of the platform
CL_PLATFORM_EXTENSION char[] The supported extensions
Querying devices of a platform

cl_int clGetDeviceIDs(cl_platform_id platform,

cl_device_type device_type,
cl_uint num_entries,
cl_device_id *devices,
cl_uint *num_devices);

The devices of a platform are retrieved with the function clGetDeviceIDs. In order to
retrieve the number of available devices for a platform, the same technique as described
above needs to be applied. cl_device_type can be one of the following:
• CL_DEVICE_TYPE_ALL: All devices in that platform
• CL_DEVICE_TYPE_CPU: Only GPU devices
• CL_DEVICE_TYPE_GPU: Only CPU devices
cl_int clGetDeviceInfo(cl_device_id device,
cl_device_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

The function clGetDeviceInfo is used to query information about the speciﬁc plat-
form. The technique to retrieve the number of bytes to allocate for param_value as for
clGetPlatformInfo can be used. Possible parameters for param_name are:

53
3.1 GPUs

param_name Rückgabe Beschreibung

CL_DEVICE_NAME char[] The name of the device
CL_DEVICE_VENDOR_ID cl_uint The vendor ID of the device
CL_DEVICE_MAX_COMPUTE_UNITS cl_uint Maximum available compute units on the device
CL_DEVICE_GLOBAL_MEM_SIZE cl_ulong Available device global memory in bytes
CL_DEVICE_LOCAL_MEM_SIZE cl_ulong Available device local memory per workgroup in bytes
cl_context clCreateContext(const cl_context_properties *properties,
cl_uint num_devices,
const cl_device_id *devices,
void (CL_CALLBACK *pfn_notify)(
const char *errinfo,
const coid *private_info,
size_t cb,
void *user_data),
void *user_data,
cl_int *errcode_ret);

To create a context, the function clCreateContext is used. A context can be create

for one or more devices which are passed with the devices pointer, the num_devices
parameter determines the number of devices this context uses. properties is a list
of various context speciﬁc properties and is formed as a NULL terminated list. The
remaining parameters are used for error detection.

The OpenCL Runtime Layer

The OpenCL Runtime Layer allows the manipulation of context during the runtime of
the program and the creation of command queueus. Command queues are used to create
program objects, execution of kernels and memory management.
cl_command_queue clCreateCommandQueue(cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int *errcode_ret);

Command queues are a handle for managing the execution of operations on a device. A
device can be associated with more than one command queue, which allows the over-
lapping of memory transfer and computation. A command queue is only valid within a
given context.
Memory Management:
cl_mem clCreateBuffer(cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret);

54
3.1 GPUs

A memory object is deﬁned by the cl_mem type. clCreateBuffer is used to create a

memory object. size determines the size in bytes, and flags is a bitfield defining the
opencl device access rights ( CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, CL_MEM_READ_WRITE
and CL_COPY_HOST_PTR) and can be combined with a bitwise or. host_ptr can be NULL,
but has to be valid in case the flag CL_COPY_HOST_PTR is set since then, the host memory
is directly copied into the device memory.
To manually copy memory between host and memory, the clEnqueueWriteBuffer and
clEnqueuReadBuffer can be used.
cl_int clEnqueueWriteBuffer(cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t size,
void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

cl_int clEnqueueReadBuffer(cl_command_queue command_queue,

cl_mem buffer,
cl_bool blocking_read,
size_t offset,
size_t size,
void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

Creating Programs: Since OpenCL is a platform independent framework, the kernels

to be used are compiled at runtime of the host program. This brings the advantage of
being able to use any device without recompilation of the host code. The given OpenCL
program can then be optimized towards the used accelerator.
A OpenCL program is written in the OpenCL C language. One of the main differ-
ences to plain C is the availabilty of of various qualifiers 2 , built-in functions for math
and vector operations and built-in functions for work-item information 3 . For example,
the __kernel function qualifier denotes an entry point of a OpenCL function and the
__global variable qualifier is used to mark a pointer that lives in global memory.
1 __kernel vec_add(__global *A, __global *B, __global *C)
2 {
3 int idx = get_global_id(0);
4 C[idx] = A[idx] + B[idx];
5 }

2
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/functionQualiﬁers.html
3
https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/workItemFunctions.html

55
3.1 GPUs

This is an example kernel that can be used to perform an elmentwise addition of the two
vectors A and B and store the result in C. The kernel function is executed concurrently
for each element in the vector. Each execution of the kernel is associated with a global
ID and a local ID (see Section 3.1.3) which can be retrieved via the built-in functions.
This allows us to correctly add the contents of the memory in global memory without
race conditions.
cl_program clCreateProgramWithSource(cl_context context,
cl_uint count,
const char **strings,
const size_t *lengths,
cl_int *errcode_ret);

The clCreateProgramWithSource creates a OpenCL program object. It is associated

with a context. The OpenCL source code is passed in via character arrays that have
been created a priori as discussed above. After creating a program object, it has to be
compiled and linked.
cl_int clBuildProgram(cl_program program,
cl_uint num_devices,
const cl_device_id *devices,
const char *options,
void (CL_CALLBACK *pfn_notify)(cl_program, void *user_data)
,
void *user_data);

clBuildProgram builds the given program for the passed devices. If this functions
returns with a error code, we have to further check what went wrong.
cl_int clGetProgramBuildInfo(cl_program program,
cl_device_id device,
cl_program_build_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);

clGetProgramBuildInfo can be used to obtained detailed build informations. The

parameters are to be used in the usual way.
param_name Rückgabe Beschreibung
CL_PROGRAM_BUILD_STATUS cl_build_status The status of the build
CL_PROGRAM_BUILD_LOG char[] Detailed information of the build process, including error
Executing Kernels: To use the kernels that are contained in a cl_program object,
the kernel needs to be created via a string identifying the kernel itself.
cl_kernel clCreateKernel(cl_program program,
const char *kernel_name,
cl_int *errcode_ret);

56
3.1 GPUs

cl_int clSetKernelArg(cl_kernel kernel,

cl_uint arg_index,
size_t arg_size,
const void *arg_value);

Once the kernel has been created, the arguments to the kernel functions have to be set
via the clSetKernelArg function. The index starts with 0.
cl_int clEnqueueNDRangeKernel(cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event);

The execution of the kernel is then done by enqueueing it to the command_queue

with the clEnqueueNDRangeKernel. The setup of the global work size and offset as
well as the local work size is done in the same scheme as discussed in section 3.1.3.
If local_work_size is NULL, the OpenCL implementation chooses the best possible
value.
Full Example:
The following source code gives a full example with inline comments to the above dis-
cussed functionality.
1 // Include the OpenCL C-API definitions
2 #include <CL/cl.h>
3
4 // Include some utilities from the C++ standard library
5 #include <iostreams>
6 #include <vector>
7
8 // Define the OpenCL program doing an elementwise vector addition in the form:
9 // C[i] = A[i] + B[i];
10 const char * opencl_source =
11 "__kernel vec_add(__global *A, __global *B, __global *C)"
12 "{ "
13 " int i = get_global_id(0); "
14 " C[i] = A[i] + B[i]; "
15 "} "
16 ;
17

18 int main()
19 {

57
3.1 GPUs

20 // variable used for error checking

21 cl_int error = CL_SUCCESS;
22

23 // Determine the number of available platforms:

24 cl_uint num_platforms = 0;
25 error = clGetPlatformIDs(0, NULL, &num_platforms);
26 if(error != CL_SUCCESS || num_platforms == 0) { /* error handling ... */ }
27
28 // Get the ﬁrst OpenCL platform.
29 cl_platform platform;
30 error = clGetPlatformIDs(1, &platform, NULL);
31 if(error != CL_SUCCESS) { /* error handling ... */ }
32
33 // Determine the number of available devices on that platform:
34 cl_uint num_devices = 0;
35 error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);
36 if(error != CL_SUCCESS || num_devices == 0) { /* error handling ... */ }
37
38 // Get the ﬁrst OpenCL device on that platform
39 cl_device_id device;
40 error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
41 if(error != CL_SUCCESS) { /* error handling ... */ }
42
43 // Print out the device name:
44 size_t name_len = 0;
45 error = clGetDeviceInfo(device, CL_DEVICE_NAME, 0, NULL, &name_len);
46 if(error != CL_SUCCESS || name_len == 0) { /* error handling ... */ }
47 char *device_name = new[name_len];
48 error = clGetDeviceInfo(device, CL_DEVICE_NAME, name_len, device_name, NULL)
;
49 if(error != CL_SUCCESS) { /* error handling ... */ }
50 std::cout << "Using device " << device_name << "\n";
51 delete[] device_name;
52
53 // Create a context for our device
54 cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &error);
55 if(error != CL_SUCCESS) { /* error handling ... */ }
56

57 // Create a command queue for our context and device

58 cl_command_queue cq = clCreateCommandQueue(context, device, 0, &error);
59 if(error != CL_SUCCESS) { /* error handling ... */ }
60
61 // Create the OpenCL program for our kernel
62 size_t source_length[] = {sizeof(opencl_source)};
63 clCreateProgramWithSource(context, 1, &opencl_source, source_length, &error)
;
64 if(error != CL_SUCCESS) { /* error handling ... */ }
65
66 // Build our program

58
3.1 GPUs

67 error = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

68 if(error != CL_SUCCESS)
69 {
70 // Retreive the build log to see why our build failed:
71 size_t log_len = 0;
72 clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &
log_len);
73 if(error != CL_SUCCESS || log_len == 0) { /* error handling ... */ }
74 char *build_log = new char[log_len];
75 clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_len,
build_log, NULL);
76 std::cout << "Build failed:\n" << build_log << "\n";
77 }
78
79 // Create the kernel object
80 cl_kernel kernel = clCreateKernel(program, "vec_add", &error);
81 if(error != CL_SUCCESS) { /* error handling ... */ }
82
83 // Create our host datastructures, fillA() and fillB() are placeholders
84 // to fill the two vectors with arbitrary data.
85 std::vector<double> A = fillA();
86 std::vector<double> B = fillB();
87 std::vector<double> C(B.size());
88
89 // Create our OpenCL memory objects
90 cl_mem A_device = clCreateBuffer(context, CL_MEM_READ_ONLY, A.size(), NULL,
&error);
91 if(error != CL_SUCCESS) { /* error handling ... */ }
92 cl_mem B_device = clCreateBuffer(context, CL_MEM_READ_ONLY, B.size(), NULL,
&error);
93 if(error != CL_SUCCESS) { /* error handling ... */ }
94 cl_mem C_device = clCreateBuffer(context, CL_MEM_WRITE_ONLY, C.size(), NULL,
&error);
95 if(error != CL_SUCCESS) { /* error handling ... */ }
96
97 // Copy our data to the device
98 error = clEnqueueWriteBuffer(cq, A_device, CL_TRUE, 0, A.size(), A.data(),
0, NULL, NULL);
99 if(error != CL_SUCCESS) { /* error handling ... */ }
100 error = clEnqueueWriteBuffer(cq, B_device, CL_TRUE, 0, B.size(), B.data(),
0, NULL, NULL);
101 if(error != CL_SUCCESS) { /* error handling ... */ }
102
103 // Set the kernel arguments
104 error = clSetKernelArg(kernel, 0, sizeof(cl_mem), &A_device);
105 if(error != CL_SUCCESS) { /* error handling ... */ }
106 error = clSetKernelArg(kernel, 1, sizeof(cl_mem), &B_device);
107 if(error != CL_SUCCESS) { /* error handling ... */ }
108 error = clSetKernelArg(kernel, 2, sizeof(cl_mem), &C_device);

59
3.1 GPUs

109 if(error != CL_SUCCESS) { /* error handling ... */ }

110
111 // Determine the number of our global work size.
112 size_t global_work_offset[] = {0};>
113 size_t global_work_size[] = {A.size()};
114
115 // Enqueue the kernel Execution
116 error = clEnqueueNDRangeKernel(cq, kernel, 1, global_work_offset,
global_work_size, NULL, 0, NULL, NULL);
117 if(error != CL_SUCCESS) { /* error handling ... */ }
118
119 // Wait for the OpenCL kernel execution is complete
120 clFlush();
121
122 // Copy the result back into our vector
123 error = clEnqueueReadBuffer(cq, C_device, CL_TRUE, 0, C.size(), NULL, &error
);
124 if(error != CL_SUCCESS) { /* error handling ... */ }
125
126 // We can now use the computed result!
127 }

List of Abbreviations

GPU Graphical Processing Unit

GPGPU General Purpose Computation on Graphics Processing Units
SIMD Single Instruction Multiple Data

60
Bibliography

[1] The Khronos Consortium, OpenGL Shading Language, 2014, https://www.opengl.

org/registry/doc/GLSLangSpec.4.40.pdf
[2] Microsoft, High Level Shading Language for DirectX, 2014, http://msdn.
microsoft.com/en-us/library/windows/desktop/bb509561(v=vs.85).aspx
[3] NVIDIA,C for Graphics, 2012, https://developer.nvidia.com/cg-toolkit
[4] NVIDIA, NVIDIA CUDA Zone, 2014, https://developer.nvidia.com/cuda-zone
[5] The Khronos Consortium, OpenGL Shading Language, 2014, http://www.khronos.
org/registry/cl/specs/opencl-1.2.pdf#page=125
[6] The Khronos Consortium, OpenCL 1.2 Reference Docs, 2014, https://www.
khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/

3.2 CPUs

3.2.1 Motivation

Although the last years have seen a myriad of emerging new architectures—ranging from
GPU accelerators to the Xeon Phi, Accelerated Processing Unit (APU) designs, 4 and
various experimental platforms—the commodity x86 Central Processing Unit (CPU)
is still the main workhorse used in traditional desktop computers, computing centers,
cluster computers, and supercomputers, and, as such, important to be examined in more
detail in this course.
In this lesson, you will learn about the diﬀerent types of parallelism that can be found
in today’s CPU architectures. In addition, you will also learn about tools, such as
OpenMP [?], that will allow you to exploit the parallel capabilities in an comfortable
way.

4
The common denominator APU is used for architectures that provide an integration of CPU and GPU
architectures that share a common memory, in most cases via a shared last-level cache. Naively put,
APUs can be thought of as a fusion of CPUs and GPUs into a single design.

Powered by TCPDF (www.tcpdf.org)

Pass Microsoft AZ-104 Exam With 100% Guarantee
100% (1)
Pass Microsoft AZ-104 Exam With 100% Guarantee
7 pages
Introduction_to_OpenCL_with_Examples
No ratings yet
Introduction_to_OpenCL_with_Examples
128 pages
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
The OpenCL™ Specification-19-26
No ratings yet
The OpenCL™ Specification-19-26
8 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Lec 1
No ratings yet
Lec 1
27 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Supercomputing On Graphics Cards: Marcus Bannerman
No ratings yet
Supercomputing On Graphics Cards: Marcus Bannerman
18 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
Csit3913 PDF
No ratings yet
Csit3913 PDF
12 pages
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
100% (1)
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
49 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
chapter-8
No ratings yet
chapter-8
58 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
GPGPU
No ratings yet
GPGPU
139 pages
WhitePaper GPU Computing On Mali
No ratings yet
WhitePaper GPU Computing On Mali
6 pages
DNA Assembly With de Bruijn Graphs On FPGA PDF
No ratings yet
DNA Assembly With de Bruijn Graphs On FPGA PDF
4 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
1
No ratings yet
1
44 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Ravishekhar Banger
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
From Everand
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
Patrick Mukosha
No ratings yet
BRKRST-2612 Cisco IOS - Managing, Optimising and Tweaking
No ratings yet
BRKRST-2612 Cisco IOS - Managing, Optimising and Tweaking
105 pages
5596 EN SKF Microlog MX Series
No ratings yet
5596 EN SKF Microlog MX Series
4 pages
Acer Aspire 5020 - Wistron Bolsena
No ratings yet
Acer Aspire 5020 - Wistron Bolsena
58 pages
Direct Calling LANSA From RPGLE
No ratings yet
Direct Calling LANSA From RPGLE
4 pages
What Is Memory
No ratings yet
What Is Memory
5 pages
Notes Chapter 4
No ratings yet
Notes Chapter 4
11 pages
The File. Below Is The Description of All Type of Files
No ratings yet
The File. Below Is The Description of All Type of Files
5 pages
Presentation of Computer On The Topic of Mobile Computing
100% (1)
Presentation of Computer On The Topic of Mobile Computing
15 pages
G Ravichandran CV
No ratings yet
G Ravichandran CV
3 pages
Ecs H55h-I V1.0a PDF
100% (1)
Ecs H55h-I V1.0a PDF
28 pages
Q Po3434btj94tbj94rjv34jvw349j0v34w90
No ratings yet
Q Po3434btj94tbj94rjv34jvw349j0v34w90
44 pages
High Level Design
No ratings yet
High Level Design
4 pages
Arabia Tech - Product Pricing
No ratings yet
Arabia Tech - Product Pricing
40 pages
Module 7 - Device Management 2 (Autopilot) PDF
100% (1)
Module 7 - Device Management 2 (Autopilot) PDF
47 pages
Energy Efficient Routing Protocol Using Local Mobile Agent For Large Scale WSNs
No ratings yet
Energy Efficient Routing Protocol Using Local Mobile Agent For Large Scale WSNs
6 pages
DH SD4A425DB HNY - Datasheet 20221015
No ratings yet
DH SD4A425DB HNY - Datasheet 20221015
4 pages
Optical Convergence Equipment
No ratings yet
Optical Convergence Equipment
19 pages
Bmis300 Sample Questions
100% (1)
Bmis300 Sample Questions
3 pages
Workbook in Operating System
100% (1)
Workbook in Operating System
42 pages
LTE Parameters Commonly Used For Optimization Recommendations
No ratings yet
LTE Parameters Commonly Used For Optimization Recommendations
8 pages
Ltrsec 2101 LG
No ratings yet
Ltrsec 2101 LG
63 pages
TDMA Frame Structure
0% (1)
TDMA Frame Structure
3 pages
BeIN_Sport
100% (1)
BeIN_Sport
35 pages
VST Host
No ratings yet
VST Host
82 pages
Emdoor Info. EM-Q15 (10.1 Inch) Qualcomm Android Rugged Tablet PC Spec. (V20200929)
0% (1)
Emdoor Info. EM-Q15 (10.1 Inch) Qualcomm Android Rugged Tablet PC Spec. (V20200929)
2 pages
Serial EEPROM Emulation: 24C01 (Mode1) : 24C01-24C16 (Mode2)
No ratings yet
Serial EEPROM Emulation: 24C01 (Mode1) : 24C01-24C16 (Mode2)
11 pages
Kamran Hashmi
No ratings yet
Kamran Hashmi
2 pages
Microcontroller Based Control
No ratings yet
Microcontroller Based Control
5 pages
Computer Programming MCQ With Answers
0% (1)
Computer Programming MCQ With Answers
2 pages

3 Heterogeneous Computer Architectures: 3.1 Gpus

Uploaded by

3 Heterogeneous Computer Architectures: 3.1 Gpus

Uploaded by

3 Heterogeneous Computer Architectures

3.1.2 The OpenCL Platform Model

3.1.3 The OpenCL Execution Model

The Execution of a OpenCL-Program is divided in two parts:

1. The Kernels which are executed on one or more devices

In OpenCL, the parallelization of tasks is managed by so called NDRanges. A NDRange

Figure 3.2: NDRange

provides a N-Dimensional (where N is 1, 2 or 3) description of the problem space (or

(wx , wy ) = Work Group ID

Context and Command Queues

• Synchronization: Synchronization between the concurrently executing control

3.1.4 The OpenCL Memory Model

Figure 3.3: OpenCL Memory Regions1

Private Memory represents exclusive, non-shared memory of a speciﬁc work item. It

Global Image Constant Local Private

Table 3.1: Types of OpenCL Memory

3.1.5 The OpenCL Programming Model

The OpenCL Platform Layer

cl_int clGetPlatformIDs(cl_uint num_entries,

cl_int clGetPlatformInfo(cl_platform_id platform,

cl_int clGetDeviceIDs(cl_platform_id platform,

param_name Rückgabe Beschreibung

To create a context, the function clCreateContext is used. A context can be create

The OpenCL Runtime Layer

A memory object is deﬁned by the cl_mem type. clCreateBuffer is used to create a

cl_int clEnqueueReadBuffer(cl_command_queue command_queue,

Creating Programs: Since OpenCL is a platform independent framework, the kernels

The clCreateProgramWithSource creates a OpenCL program object. It is associated

clGetProgramBuildInfo can be used to obtained detailed build informations. The

cl_int clSetKernelArg(cl_kernel kernel,

The execution of the kernel is then done by enqueueing it to the command_queue

20 // variable used for error checking

23 // Determine the number of available platforms:

57 // Create a command queue for our context and device

67 error = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

109 if(error != CL_SUCCESS) { /* error handling ... */ }

GPU Graphical Processing Unit

[1] The Khronos Consortium, OpenGL Shading Language, 2014, https://www.opengl.

Powered by TCPDF (www.tcpdf.org)

You might also like