0% found this document useful (0 votes)

46 views

Opencl 2.0 Features: Benjamin Coquelle MAY 2015

The document discusses new features in OpenCL 2.0 including shared virtual memory, pipes, nested parallelism, and work group built-in functions. It provides details on shared virtual memory such as allocating SVM buffers and passing SVM pointers as kernel arguments. Pipes are described as queue-like objects that allow communication between kernels via built-in read and write functions.

Uploaded by

sdancer75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Opencl 2.0 Features: Benjamin Coquelle MAY 2015

Uploaded by

sdancer75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

OPENCL 2.

0 FEATURES
BENJAMIN COQUELLE
MAY 2015
MAIN FEATURES

 Shared virtual memory

‒ Allows to share complex structures between host and devices.
 Pipes
 Nested parallelism
‒ Enqueue a kernel from a kernel
‒ Similar to CUDA dynamic parallelism (compute capability 3.5)
 Work group built-in functions (scan, reduce…)
 Generic address space
‒ avoid to duplicate code

2 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)

 clSVMAlloc – allocates a shared virtual memory buffer

‒ Specify size in bytes
‒ Specify usage information
‒ Optional alignment value

 SVM pointer can be shared by the host and OpenCL device

void* clSVMAlloc(cl_context ctx, cl_mem_flags flags, size_t size, unsigned int alignement)

 Examples
clSVMAlloc(ctx, CL_MEM_READ_WRITE, 1024 * sizeof(float), 0)
clSVMAlloc(ctx, CL_MEM_READ_ONLY, 1024 * 1024, sizeof(cl_float4))

 Free SVM buffers

- clEnqueueSVMFree, clSVMFree

3 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)

 clSetKernelArgSVMPointer
‒ SVM pointers as kernel arguments
‒ A SVM pointer
‒ A SVM pointer + offset

// allocating SVM pointers

cl_float *src = (cl_float *)clSVMAlloc(ctx, CL_MEM_READ_ONLY, size, 0);
cl_float *dst = (cl_float *)clSVMAlloc(ctx, CL_MEM_READ_WRITE, size, 0);

// Passing SVM pointers as arguments

clSetKernelArgSVMPointer(vec_add_kernel, 0, src);
clSetKernelArgSVMPointer(vec_add_kernel, 1, dst);

// Passing SVM pointer + offset as arguments

clSetKernelArgSVMPointer(vec_add_kernel, 0, src + offset);
clSetKernelArgSVMPointer(vec_add_kernel, 1, dst + offset);

4 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)

typedef struct
 clSetKernelExecInfo {
- Passing SVM pointers in other SVM objects float *pB;
} my_info_t;
// allocating SVM pointers
my_info_t *pA = (my_info_t *)clSVMAlloc(ctx, kernel void my_kernel(global my_info_t *pA,…)
CL_MEM_READ_ONLY, sizeof(my_info_t), 0); {
do_stuff(pA->pB, …);

pA->pB = (cl_float *)clSVMAlloc(ctx, }

CL_MEM_READ_WRITE, size, 0);

// Passing SVM pointers

clSetKernelArgSVMPointer(my_kernel, 0, pA);
clSetKernelExecInfo (my_kernel, CL_KERNEL_EXEC_INFO_SVM_PTRS,1 * sizeof(void *), &pA->pB);

5 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SVM
BINARY TREE EXAMPLE

typedef struct nodeStruct

{
int value;
struct nodeStruct* left;
struct nodeStruct* right;
} node;

svmTreeBuf = clSVMAlloc(context,
CL_MEM_READ_WRITE,
numNodes*sizeof(node),
0);

6 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)

 Three types of sharing

‒ Coarse-grained buffer sharing
‒ Fine-grained buffer sharing
‒ System sharing

7 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)
COARSE & FINE-GRAINED BUFFER SHARING

 SVM buffers allocated using clSVMAlloc

 Coarse grained sharing
- Memory consistency only guaranteed at synchronization points
- Host still needs to use synchronization APIs to update data
- clEnqueueSVMMap / clEnqueueSVMUnmap or event callbacks
- Memory consistency is at a buffer level
- Allows sharing of pointers between host and OpenCL device
 Fine grained sharing
- No synchronization needed between host and OpenCL device
- Host and device can update data in buffer concurrently
- Memory consistency using C11 atomics and synchronization operations
- Optional Feature

8 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)
SYSTEM SHARING

 Can directly use any pointer allocated on the host

‒ No OpenCL APIs needed to allocate SVM buffers. Just use malloc/new
 Both host and OpenCL device can update data using C11 atomics and synchronization functions
 Optional Feature

9 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)
COARSE GRAIN BUFFER SVM VS CL1.2
//by default the buffer is allocated as coarse grain //create device buffer
float* Buffer = (float*)clSVMAlloc(ctx, CL_MEM_READ_WRITE, cl_mem DeviceBuffer = clCreateBuffer(ctx,
1024 * sizeof(float), 0); CL_MEM_READ_WRITE, 1024*sizeof(float), NULL, &err
);

//map and fill the buffer from host

status = clEnqueueSVMMap(commandQueue, CL_TRUE, CL_MAP_WRITE, //create host buffer
Buffer, 1024*sizeof(float)), 0, float* hostBuffer = new float[1024];
NULL, NULL);
for (int i=0; i<1024; i++)
for (int i=0; i<1024; i++)
hostBuffer ] = ….;
Buffer[i] = ….;
//data transfer happens here
//data transfer will happen here
clEnqueueWriteBuffer(queue, DeviceBuffer,… , hostBuffer);
clEnqueueSVMUnmap(commandQueue, Buffer, 0, NULL, NULL);

//use our device buffer on device

// use your SVM buffer in you OpenCL kernel
clSetKernelArgSVMPointer(my_kernel, 0, Buffer); clSetKernelArg(my_kernel,0,sizeof(cl_mem), &DeviceBuffer );

clEnqueueNDRangeKernel(queue, my_kernel,…) clEnqueueNDRangeKernel(queue, my_kernel,…)

10 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)
FINE GRAIN BUFFER SVM VS CL1.2
//CL_MEM_SVM_FINE_GRAIN_BUFFER means host and device can //create device buffer
//concurrently access the buffer

cl_mem DeviceBuffer = clCreateBuffer(ctx,

float* Buffer = (float*)clSVMAlloc(ctx, CL_MEM_READ_WRITE | CL_MEM_READ_WRITE, 1024*sizeof(float), NULL, &err
CL_MEM_SVM_FINE_GRAIN_BUFFER, );
1024 * sizeof(float), 0);

//create host buffer

//fill the buffer from host
float* hostBuffer = new float[1024];
for (int i=0; i<1024; i++)
for (int i=0; i<1024; i++)
Buffer[i] = ….;
hostBuffer ] = ….;
//data transfer happens here
// use your SVM buffer in you OpenCL kernel on device
directly clEnqueueWriteBuffer(queue, DeviceBuffer,… , hostBuffer);

clSetKernelArgSVMPointer(my_kernel, 0, Buffer);

clEnqueueNDRangeKernel(queue, my_kernel,…) //use our device buffer on device

clSetKernelArg(my_kernel,0,sizeof(cl_mem), &DeviceBuffer );

clEnqueueNDRangeKernel(queue, my_kernel,…)

11 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)
FINE GRAIN SYSTEM

//no more OpenCL API needed to allocate data, simply use your favorite memory allocation function : new, malloc…
float* Buffer = (float*)malloc(1024*sizeof(float))

//fill the buffer from host

for (int i=0; i<1024; i++)
Buffer[i] = ….;

// use your SVM buffer in you OpenCL kernel on device directly

clSetKernelArgSVMPointer(my_kernel, 0, Buffer);

clEnqueueNDRangeKernel(queue, my_kernel,…)

12 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

SHARE VIRTUAL MEMORY (SVM)

 https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf
 http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
‒ Samples : GlobalMemoryBandwidth, DeviceEnqueueBFS, SVMBinaryTreeSearch, RangeMinimumQuery,
SVMAtomicsBinaryTreeInsert (APU only), FineGrainSVM (APU only)

13 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

PIPES

 Act like a queue object (FIFO) between kernels.

 Pipes objects are created on host….
‒ clCreatePipe(cl_context ctx, cl_mem_flags flags, cl_uint packet_size, cl_uint max_packets, cl_pipe_properties*,
cl_int*)
 ….But they cannot be accessed from host (read and write)
‒ The only valid memory flag for clCreatePipe is CL_MEM_HOST_NO_ACCESS
 Pipes can either be read_only or write_only within a kernel
 Pipes can only be coming from a kernel/functions arguments
‒ Pipes can’t be created locally in a function/kernel
 Pipes can only be used through built-in CL2.0 functions
‒ read_pipe (pipe p, reserve_id_t reserve_id, uint index, gentype *ptr): for reading 1 packet from pipe p into ptr.
‒ write_pipe (pipe p, reserve_id_t reserve_id, uint index, gentype *ptr): for writing 1 packet
 Pipes don’t define any ordering for read/write operations amongst all the threads running. It is up to
the developers to control this if needed
14 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL
PIPES

kernel void pipeWrite(global int *src, __write_only pipe int out_pipe)

{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_write_pipe (out_pipe, 1);

if( is_valid_reserve_id (res_id))

{
if( write_pipe (out_pipe, res_id, 0, &src[gid]) != 0)
{
return;
}
commit_write_pipe (out_pipe, res_id);
}
}

15 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

PIPES
REFERENCES

16 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

 In OpenCL 1.2 only the host can enqueue kernels

 Iterative algorithm example
‒ kernel A queues kernel B
‒ kernel B decides to queue kernel A again
 A very simple but extremely common nested parallelism example

17 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

 Allow a device to queue kernels to itself

‒ Allow a work-item(s) to queue kernels
 Use similar approach to how host queues commands
‒ Queues and Events
‒ Event and Profiling functions

18 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

 Use clang Blocks to describe kernel to queue

kernel void my_func(global int a, global int b)

{
…
void (^my_block_A)(void) =
^
{
size_t id = get_global_id(0);
b[id] += a[id];
};

enqueue_kernel(get_default_queue(),
CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange_1D(…),
my_block_A);
}

19 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM
2 API

int enqueue_kernel(queue_t queue,

kernel_enqueue_flags_t flags,
const ndrange_t ndrange,
void (^block)())

int enqueue_kernel(queue_t queue,

kernel_enqueue_flags_t flags,
const ndrange_t ndrange,
uint num_events_in_wait_list,
const clk_event_t *event_wait_list,
clk_event_t *event_ret,
void (^block)())

20 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM
QUEUING KERNELS WITH POINTERS TO LOCAL ADDRESS SPACE AS ARGUMENTS

int enqueue_kernel(queue_t queue,

kernel_enqueue_flags_t flags,
const ndrange_t ndrange,
void (^block)(local void *, …), uint size0, …)

int enqueue_kernel(queue_t queue,

kernel_enqueue_flags_t flags,
const ndrange_t ndrange,
uint num_events_in_wait_list,
const clk_event_t *event_wait_list,
clk_event_t *event_ret,
void (^block)(local void *, …), uint size0, …)

21 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

void my_func_local_arg (global int a, local int lptr, …) { … }

kernel void my_func(global int *a, …)

{
…
uint local_mem_size = compute_local_mem_size(…);
enqueue_kernel(get_default_queue(),
CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange_1D(…),
^(local int *p){my_func_local_arg(a, p, …);},
local_mem_size);
}

22 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

 Specify when a child kernel can begin execution (pick one)

‒ Don’t wait on parent
‒ Wait for kernel to finish execution
‒ Wait for work-group to finish execution
 A kernel’s execution status is complete
‒ when it has finished execution
‒ and all its child kernels have finished execution

23 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLEISM

 Other Commands
‒ Queue a marker
 Query Functions
‒ Get workgroup size for a block
 Event Functions
‒ Retain & Release events
‒ Create user event
‒ Set user event status
‒ Capture event profiling info
 Helper Functions
‒ Get default queue
‒ Return a 1D, 2D or 3D ND-range descriptor

24 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM

 https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf
 http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
‒ Samples DeviceEnqueueBFS, ExtractPrimes, RegionGrowingSegmentation, BinarySearchDeviceSideEnqueue

25 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

WORK GROUP FUNCTION

 Scan
‒ work_group_scan_exclusive<op>
‒ work_group_scan_inclusive<op>
 Reduce
‒ work_group_reduce<op>
 Voting functions
‒ work_group_all
‒ work_group_any
 Broadcast
‒ work_group_broadcast

26 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

WORK GROUP FUNCTION
PREFIX SUM

__kernel void group_scan_kernel(global float in, global float out)

{
float in_data;
int i = get_global_id(0);
in_data = in[i];
out[i] = work_group_scan_inclusive_add(in_data);
}

 Once we have the scan for each work group, we need to sum up the “next group” with the last value of
the previous one

27 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

WORK GROUP FUNCTION
PREFIX SUM

 This operation needs to be repeated

28 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

WORK GROUP FUNCTION
PREFIX SUM

__kernel void global_scan_kernel(__global float *out, unsigned int stage)

{
…
/* find the element to be added */
l = (grid >> stage);
prev_gr = l*(vlen << 1) + vlen - 1;
prev_el = prev_gr*szgr + szgr - 1;
if (lid == 0)
add_elem = out[prev_el];

work_group_barrier(CLK_GLOBAL_MEM_FENCE|CLK_LOCAL_MEM_FENCE);
add_elem = work_group_broadcast(add_elem,0);

/* find the array to which the element to be added */

curr_gr = prev_gr + 1 + (grid % vlen);
curr_el = curr_gr*szgr + lid;
out[curr_el] += add_elem;
}

29 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

WORK GROUP FUNCTION

 https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf
 http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
‒ Samples DeviceEnqueueBFS, BuiltInScan, RegionGrowingSegmentation, ExtractPrimes

30 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
BREADTH FIRST SEARCH

 BFS is a strategy for searching in a graph. It begins at the root node and inspects all the neighbouring
nodes. Then for each of those nodes it inspects their neighbour nodes and so on.

2 3

4 5 6 7

8 9

 The classic serial algorithm uses a queue(fifo) to store the non treated nodes of the graph. Once a node
is visited, it is popped out from the queue. We then look for its neighbour nodes to add in the queue.

1 2 3 3 4 5 4 5 6 7 5 6 7

31 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
BREADTH FIRST SEARCH

 We will use 2 OpenCL pipe objects to simulate our queue

‒ Nodes of the current of level (read pipe)
‒ Nodes of the next level (write pipe)
 We will parallelize the visit of a given level
‒ Each kernel launch will only work on a given level
‒ Each thread will treat one node
 We use the nested parallelism to enqueue a new kernel to work on the next level
1

2 3

4 5 6 7

 Pipe states : 8 9

1 3 2 7 5 4 6 8 9
32 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL
NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
READING CURRENT LEVEL, ONE NODE PER WORK-ITEM

__kernel
void deviceEnqueueBFSKernel(__global uint *d_rowPtr, __global uint *d_colIndex, __global uint *d_dist,
__read_only pipe uint d_vertexFrontier_inPipe,
__write_only pipe uint d_edgeFrontier_outPipe, uint parentNodeLevel )
{
…
atomic_store_explicit(&g_totalNeighborsCount,0,memory_order_seq_cst, memory_scope_device);
// read current level's vertices to be visited (/* reading from pipe */)
res_read_id = reserve_read_pipe(d_vertexFrontier_inPipe, 1);
if(is_valid_reserve_id(res_read_id))
{
if(read_pipe(d_vertexFrontier_inPipe, res_read_id, 0, &node) != 0)
{
return;
}
commit_read_pipe(d_vertexFrontier_inPipe, res_read_id);
}

33 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
WRITING CHILD NODE INTO THE SECOND PIPE

// we first checked whether node is visited and got the number of child
// expand these neighbours for the next level, only when it has not been visited (/* Writing into Pipe */)
for(int i = 0; i < numChildPerNode; i++)
{
childNode = getChildNode(d_colIndex, offset+i);
if(d_dist[childNode] == INIFINITY)
{
res_write_id = reserve_write_pipe(d_edgeFrontier_outPipe, 1);
if(is_valid_reserve_id(res_write_id))
{
if(write_pipe(d_edgeFrontier_outPipe, res_write_id, 0, &childNode) != 0)
{
return;
}
commit_write_pipe(d_edgeFrontier_outPipe, res_write_id);
}
tmpNeighborsCount++;
}

34 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
COMPUTING THE NUMBER OF CHILD NODES AT THE NEXT LEVEL

//summing number of Neighbours within work group

wgCnt = work_group_reduce_add(tmpNeighborsCount);
}
//summing total number of Neighbours across all work-groups
if(lid == 0)
{
atomic_fetch_add_explicit(&g_totalNeighborsCount, wgCnt, memory_order_seq_cst, memory_scope_device
}

35 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
RELAUNCH THE NEW KERNEL

if(gid == 0) //only one work item will enqueue a new kernel

{
globalThreads = 1;
currentLevel = d_dist[node];
queue_t q = get_default_queue();
ndrange_t ndrange1 = ndrange_1D(globalThreads);

void (^bfsDummy_device_enqueue_wrapper_blk)(void) = ^{deviceEnqueueDummyKernel(…

d_edgeFrontier_outPipe,
d_vertexFrontier_inPipe,
currentLevel );};
int err_ret = enqueue_kernel(q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange1, bfsDummy_device_enqueue_wrapper_blk);

if(err_ret != 0)
{
return;
}

36 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

NESTED PARALLELISM + PIPES + WORK-GROUP FUNCTIONS
LAUNCH MAIN KERNEL WITH THE NUMBER OF CHILD NODES

void deviceEnqueueDummyKernel(…)
{
uint globalThreads = atomic_load_explicit(&g_totalNeighborsCount, memory_order_seq_cst, memory_scope_device);

if(globalThreads == 0) // don't need to launch kernel if there is no child

return;

queue_t q = get_default_queue();
ndrange_t ndrange1 = ndrange_1D(globalThreads);

void (^bfs_device_enqueue_wrapper_blk)(void) = ^{ deviceEnqueueBFSKernel (d_rowPtr,

d_colIndex,
d_dist,
d_edgeFrontier_outPipe,
d_vertexFrontier_inPipe,
parentNodeLevel );};
int err_ret = enqueue_kernel (q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange1, bfs_device_enqueue_wrapper_blk);

37 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

GENERIC ADDRESS SPACE

 In OpenCL 1.2, function arguments that are a pointer to a type must declare the address space of the
memory region pointed to
 Many examples where developers want to use the same code but with pointers on different address
spaces
void void
my_func (local int *ptr, …) my_func (global int *ptr, …)
{ {
… …
foo(ptr, …); foo(ptr, …);
… …
} }

 Above example is not supported in OpenCL 1.2

 Results in developers having to duplicate code, which prone to errors

38 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

GENERIC ADDRESS SPACE

 OpenCL 2.0 no longer requires an address space void

qualifier for arguments to a function that are a my_func_generic_pointer (int *ptr, …)
pointer to a type {
‒ Except for kernel functions …

 Generic address space assumed if no address space }

is specified
kernel void
 Makes it really easy to write functions without foo(global int *g_ptr, local int *l_ptr, …)
having to worry about which address space {
arguments point to …
my_func_generic_pointer (g_ptr, …);
my_func_generic_pointer (l_ptr, …);
}

39 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

GENERIC ADDRESS SPACE

40 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

Get Bioinformatics Algorithms An Active Learning Approach 2nd Edition Phillip Compeau PDF ebook with Full Chapters Now
100% (7)
Get Bioinformatics Algorithms An Active Learning Approach 2nd Edition Phillip Compeau PDF ebook with Full Chapters Now
51 pages
OFI GST E Invoice Technical Flow
No ratings yet
OFI GST E Invoice Technical Flow
29 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
Introduction_to_OpenCL_with_Examples
No ratings yet
Introduction_to_OpenCL_with_Examples
128 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Opencl Api Reference: The Opencl Platform Layer
No ratings yet
Opencl Api Reference: The Opencl Platform Layer
11 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Intro To OpenCL C++ Whitepaper May15
No ratings yet
Intro To OpenCL C++ Whitepaper May15
9 pages
VCL Guide
No ratings yet
VCL Guide
32 pages
FPT2017-PipeCNN
No ratings yet
FPT2017-PipeCNN
4 pages
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Ipc Using Shared Memory
No ratings yet
Ipc Using Shared Memory
25 pages
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
100% (2)
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
58 pages
Lecture 1.2.2
No ratings yet
Lecture 1.2.2
13 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
AdvancedOpenCL Full
No ratings yet
AdvancedOpenCL Full
101 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
Process Scheduling
No ratings yet
Process Scheduling
8 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Lecture22
No ratings yet
Lecture22
106 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
1605808992-Using oneAPI FPGA IXPUG
No ratings yet
1605808992-Using oneAPI FPGA IXPUG
105 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Sycl 2020 Reference Guide
No ratings yet
Sycl 2020 Reference Guide
16 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Lecture20
No ratings yet
Lecture20
81 pages
Week 11
No ratings yet
Week 11
21 pages
DNA Assembly With de Bruijn Graphs On FPGA PDF
No ratings yet
DNA Assembly With de Bruijn Graphs On FPGA PDF
4 pages
A Es Implementation On Open CL
No ratings yet
A Es Implementation On Open CL
6 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
18-600 Recitation #11: Malloc Lab
No ratings yet
18-600 Recitation #11: Malloc Lab
30 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
100% (1)
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
49 pages
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
No ratings yet
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
18 pages
Supercomputing On Graphics Cards: Marcus Bannerman
No ratings yet
Supercomputing On Graphics Cards: Marcus Bannerman
18 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Micro - Arch Openpiton
No ratings yet
Micro - Arch Openpiton
51 pages
Benini ISC2023 Paving the Road for Riscv
No ratings yet
Benini ISC2023 Paving the Road for Riscv
40 pages
What Every Systems Programmer Should Know About Concurrency: Matt Kline
No ratings yet
What Every Systems Programmer Should Know About Concurrency: Matt Kline
12 pages
Jeremi Gosney Password Cracking HPC Passwords12
No ratings yet
Jeremi Gosney Password Cracking HPC Passwords12
29 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
Creating HWSW Co-Designed MPSoPCs From High Level Programming Models
No ratings yet
Creating HWSW Co-Designed MPSoPCs From High Level Programming Models
7 pages
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
From Everand
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
Joe Brian
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
How Do I Disable X at Boot Time So That The System Boots in Text Mode
No ratings yet
How Do I Disable X at Boot Time So That The System Boots in Text Mode
11 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Concurrent Kernel in OpenCL
No ratings yet
Concurrent Kernel in OpenCL
1 page
Performance Microsoft - TypeScript Wiki
No ratings yet
Performance Microsoft - TypeScript Wiki
11 pages
Redirect HTTP To HTTPS in Nginx - Linuxize
No ratings yet
Redirect HTTP To HTTPS in Nginx - Linuxize
7 pages
(INFO) ANDROID DEVICE PARTITIONS and FILESYSTEMS - XDA Developers Forums
No ratings yet
(INFO) ANDROID DEVICE PARTITIONS and FILESYSTEMS - XDA Developers Forums
12 pages
Android Partitions Explained - Boot, System, Recovery, Data, Cache & Misc
No ratings yet
Android Partitions Explained - Boot, System, Recovery, Data, Cache & Misc
6 pages
Token Based Authentication Made Easy - Auth0
100% (1)
Token Based Authentication Made Easy - Auth0
10 pages
How To Enable ES6 (And Beyond) Syntax With Node and Express
No ratings yet
How To Enable ES6 (And Beyond) Syntax With Node and Express
20 pages
Inside The Intel and Creative Assembly Collaboration: White Paper
No ratings yet
Inside The Intel and Creative Assembly Collaboration: White Paper
10 pages
The Use of Information Technology in The Universit
No ratings yet
The Use of Information Technology in The Universit
9 pages
Pre-73 DLX: Vintage Style Pre Amplifier
No ratings yet
Pre-73 DLX: Vintage Style Pre Amplifier
2 pages
Redis Command Line To View Chinese Without Scrambling
No ratings yet
Redis Command Line To View Chinese Without Scrambling
3 pages
8 Function Christmas Light Circuit - Homemade Circuit Projects
No ratings yet
8 Function Christmas Light Circuit - Homemade Circuit Projects
5 pages
How To Compute The PSNR (Peak Signal-To-Noise Ratio)
No ratings yet
How To Compute The PSNR (Peak Signal-To-Noise Ratio)
3 pages
A Journey Through The CPU Pipeline
No ratings yet
A Journey Through The CPU Pipeline
20 pages
G5ca 1a Relay
No ratings yet
G5ca 1a Relay
4 pages
F3294 Phe840m
No ratings yet
F3294 Phe840m
2 pages
Manuale Ecoloader (Eng) - Rev2014
No ratings yet
Manuale Ecoloader (Eng) - Rev2014
16 pages
Bernoulli's Equation
No ratings yet
Bernoulli's Equation
12 pages
TDD Interference Management and Traffic Adaptation
No ratings yet
TDD Interference Management and Traffic Adaptation
109 pages
Turbo Assembler Version 2.0 Users Guide 1990
100% (1)
Turbo Assembler Version 2.0 Users Guide 1990
518 pages
Katalog NYM
No ratings yet
Katalog NYM
2 pages
A Solution-Focused Approach To Rational-Emotive Behavior Therapy - Toward A Theoretical Integration
No ratings yet
A Solution-Focused Approach To Rational-Emotive Behavior Therapy - Toward A Theoretical Integration
22 pages
Aspire Mole Concept (17!4!21)
No ratings yet
Aspire Mole Concept (17!4!21)
44 pages
SUN2000-100KTL-M1 Output Characteristics Curve: Huawei Technologies Co., LTD
No ratings yet
SUN2000-100KTL-M1 Output Characteristics Curve: Huawei Technologies Co., LTD
7 pages
PDA2
No ratings yet
PDA2
10 pages
Typhoon Q500 4k - User Manual V12302015
No ratings yet
Typhoon Q500 4k - User Manual V12302015
39 pages
Sta. Teresa College: Course Syllabus Course Code Course Name Course Credits Course Description
No ratings yet
Sta. Teresa College: Course Syllabus Course Code Course Name Course Credits Course Description
8 pages
UNIT-5: Procedure of T-Test
No ratings yet
UNIT-5: Procedure of T-Test
12 pages
800mm Wall
No ratings yet
800mm Wall
12 pages
B.tech 4yr Computer Sci.&Engg Batch2017 Part-II, III, IV (Sem III, IV, V, Vi, Vii, Viii)
No ratings yet
B.tech 4yr Computer Sci.&Engg Batch2017 Part-II, III, IV (Sem III, IV, V, Vi, Vii, Viii)
75 pages
Xlpe Cable
No ratings yet
Xlpe Cable
18 pages
Measuring Agricultural Knowledge and Adoption: Policy Research Working Paper 7058
No ratings yet
Measuring Agricultural Knowledge and Adoption: Policy Research Working Paper 7058
35 pages
Evaluation of Paracetamol Tablets
No ratings yet
Evaluation of Paracetamol Tablets
12 pages
Business Research Methods Chapter 13
No ratings yet
Business Research Methods Chapter 13
4 pages
Data Sheet
No ratings yet
Data Sheet
69 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
MATHS P1 Memo Gr12 Sept 2014 EA
No ratings yet
MATHS P1 Memo Gr12 Sept 2014 EA
16 pages
3 Data Acquisition - Xid-10710319 - 2
No ratings yet
3 Data Acquisition - Xid-10710319 - 2
32 pages
D_L_Mills_1974_Rep._Prog._Phys._37_817
No ratings yet
D_L_Mills_1974_Rep._Prog._Phys._37_817
111 pages
A Multiphase Buoyancy-Drag Model For The Study of Rayleigh-Taylor and Richtmyer-Meshkov Instabilities in Dusty Gases
No ratings yet
A Multiphase Buoyancy-Drag Model For The Study of Rayleigh-Taylor and Richtmyer-Meshkov Instabilities in Dusty Gases
17 pages
Narayana JEE Main Model Weekend Mock Test
No ratings yet
Narayana JEE Main Model Weekend Mock Test
15 pages
Midasit: 1. General Information
No ratings yet
Midasit: 1. General Information
4 pages
Metro Long Haul - Details 15-03-2011
No ratings yet
Metro Long Haul - Details 15-03-2011
29 pages
DLP in Mathematics 4 1 1
No ratings yet
DLP in Mathematics 4 1 1
13 pages
CE135 - 6. Shear and Diagonal Tension
No ratings yet
CE135 - 6. Shear and Diagonal Tension
34 pages

Opencl 2.0 Features: Benjamin Coquelle MAY 2015

Uploaded by

Opencl 2.0 Features: Benjamin Coquelle MAY 2015

Uploaded by

OPENCL 2.

 Shared virtual memory

2 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 clSVMAlloc – allocates a shared virtual memory buffer

 SVM pointer can be shared by the host and OpenCL device

 Free SVM buffers

3 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

// allocating SVM pointers

// Passing SVM pointers as arguments

// Passing SVM pointer + offset as arguments

4 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

pA->pB = (cl_float *)clSVMAlloc(ctx, }

CL_MEM_READ_WRITE, size, 0);

// Passing SVM pointers

5 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

typedef struct nodeStruct

6 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Three types of sharing

7 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 SVM buffers allocated using clSVMAlloc

8 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Can directly use any pointer allocated on the host

9 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

//map and fill the buffer from host

//use our device buffer on device

clEnqueueNDRangeKernel(queue, my_kernel,…) clEnqueueNDRangeKernel(queue, my_kernel,…)

10 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

cl_mem DeviceBuffer = clCreateBuffer(ctx,

//create host buffer

clEnqueueNDRangeKernel(queue, my_kernel,…) //use our device buffer on device

11 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

//fill the buffer from host

// use your SVM buffer in you OpenCL kernel on device directly

12 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

13 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Act like a queue object (FIFO) between kernels.

__kernel void pipeWrite(__global int *src, __write_only pipe int out_pipe)

if( is_valid_reserve_id (res_id))

15 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

16 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 In OpenCL 1.2 only the host can enqueue kernels

17 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Allow a device to queue kernels to itself

18 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Use clang Blocks to describe kernel to queue

kernel void my_func(global int *a, global int *b)

19 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

int enqueue_kernel(queue_t queue,

int enqueue_kernel(queue_t queue,

20 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

int enqueue_kernel(queue_t queue,

int enqueue_kernel(queue_t queue,

21 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

void my_func_local_arg (global int *a, local int *lptr, …) { … }

kernel void my_func(global int *a, …)

22 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 Specify when a child kernel can begin execution (pick one)

23 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

24 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

25 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

26 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

__kernel void group_scan_kernel(__global float *in, __global float *out)

27 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 This operation needs to be repeated

28 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

__kernel void global_scan_kernel(__global float *out, unsigned int stage)

/* find the array to which the element to be added */

29 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

30 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

31 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

 We will use 2 OpenCL pipe objects to simulate our queue

33 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

34 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

//summing number of Neighbours within work group

35 | PRESENTATION TITLE | MAY 21, 2015 | CONFIDENTIAL

if(gid == 0) //only one work item will enqueue a new kernel

kernel void pipeWrite(global int *src, __write_only pipe int out_pipe)

kernel void my_func(global int a, global int b)

void my_func_local_arg (global int a, local int lptr, …) { … }

__kernel void group_scan_kernel(global float in, global float out)