The pocl Kernel Compiler

The pocl Kernel Compiler
Clay Chang

CPU versus GPU
• Sophiscated Control
• Branch Prediction
• Out-of-Order Execution
• Large Cache
• Little Control
• No or Limited Branch
Prediction
• Simple Execution
• Small or no cache
• Lots of ALUs

Why OpenCL for CPU
 Muiti-core CPU is out there
 E.g. MediaTek Tri-Cluster 10 cores SoC
 Mobile GPU is already busy
 ~25% occupied by system UI in Android
 Not every programs run good on GPU
 Heavy Branch Divergence
 OpenCL allows easily exploit multi-core and SIMD
 Imagine: writing pthread + SIMD in assembly or intrinsics

Running OpenCL Kernels on CPU
 One thread per work-item?
 Thousands of threads being created
 Context-switching problems
 How to synchronize threads?
 How about running one work-group on a CPU thread?

Related Works
 Twin peaks: a software platform for heterogeneous computing on
general-purpose and graphics processors.
 MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core
CPUs
 Clover (http://people.freedesktop.org/~steckdenis/clover)
 Shamrock (https://git.linaro.org/gpgpu/shamrock.git)

What is to pocl
 POrtable Computing Language
 An efficient implementation of OpenCL standard which can be easily
adapted for new targets
 http://github.com/pocl/pocl
 Main developer: Pekka Jääskeläinen from Tampere University of
Technology
 Supporting Architecture: CPU, tce, cellspu, HSA
 Current version: 0.11

The pocl Kernel Compiler
OpenCL
Kernel Source
Clang / LLVM
pocl
Kernel Compiler
clBuildProgram(…)
clEnqueueNDRangeKernel
(…, local_size, …)
Single Work-item
Kernel
Transformed
Kernel

pocl Compilation Chain
1
2
3
4 Compile Kernel (OpenCL C) by
Clang
1
Linked with target-specific built-
in functions, such as sin, cos,
geom_distance, etc…
2
Work-group Function
Generation / Parallel Work-item
Loops Creation
3
Backend Optimizations (Auto-
vecs, …) and CodeGen
4

Work-group_function() {
for (int i = 0; i < work-group_size; i++) {
}
}
Work-group Function Generation
Kernel (single work-item)
What if there are
barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)

Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by
all work-items of a work-group executing the kernel
or by none at all…”
if (tid % 2) {
….
barrier();
…
}

Kernel Without barriers
• A node in a CFG is a basic block
(BB)
• BB: branchless sequence of
instructions
• BB executed as an entity,
from the first instruction to
the last.
• An edge in a CFG represents
a branch in the control flow
• Multiple exit BBs are
allowed
• pocl Kernel Compiler generates
WI-loop around the CFG

Types of Barrier
Un-conditional barriers
 barrier that dominates the exit node
Conditional barriers
 Barriers being placed in
 if – else
 for-loop (b-loop)

Kernel with unconditional barriers
 pocl Kernel Compiler creates WI-loops
before and after the barrier
 This forms an algorithm:
Algorithm 1: Parallel region formation when the kernel
does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and
the exit nodes of the kernel function and that there is
only one exit node in the kernel function. This is a safe
starting condition as it does not affect any execution
order restrictions.
Step2: Perform a depth-first-search traversal of the kernel
CFG. Ignore the possible back edges to avoid infinite
loops and to include the loops of the kernel to the
parallel region.
Step3: When encountering a barrier, create a parallel
region by calling CreateSubgraph for the previously
encountered barrier and the newly found barrier.
barrier
barrier

A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation
in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting
at the entry node.
Step2: Each time a new, unprocessed conditional barrier
is found, use CreateSubgraph to produce a sub-CFG from
that barrier to the next exit node (duplicate the tail).
Step3: Replicate the created sub-CFG using ReplicateCFG.
In order to reduce code duplication, merge the tails from
the same unconditional barrier paths. That is, replicate
the basic blocks only after the last barrier that is
unconditionally reachable from the one at hand.
Step4: Start the algorithm at each of the found barrier
successors.

A CFG with Two Conditional barriers
– After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?

“Peel” the First
Loop Iteration
?
?
No more ambiguous
branches in WI-
loops!

Barriers in Kernel Loops
Insert implicit barrier into:
1. End of loop pre-header
block
2. Before the loop latch
branch
3. After the PhiNode
region of the loop
header block
3
2
1

Horizontal Inner-Loop Parallelization
More parallelization after loop interchange
blockWidth unknown until runtime

Handling of Kernel Variables
1. There will be two parallel regions
2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)
3. B’s lifetime span across both parallel regions
Context Array

References
 Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle
Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable
OpenCL Implementation" in International Journal of Parallel
Programming, Springer, August 2014.
 http://github.com/pocl/pocl

The pocl Kernel Compiler

Recommended

More Related Content

What's hot (20)

Similar to The pocl Kernel Compiler (20)

Recently uploaded (20)

The pocl Kernel Compiler

Editor's Notes