SlideShare a Scribd company logo
The pocl Kernel Compiler
Clay Chang
CPU versus GPU
• Sophiscated Control
• Branch Prediction
• Out-of-Order Execution
• Large Cache
• Little Control
• No or Limited Branch
Prediction
• Simple Execution
• Small or no cache
• Lots of ALUs
OpenCL as the Portable API
Why OpenCL for CPU
 Muiti-core CPU is out there
 E.g. MediaTek Tri-Cluster 10 cores SoC
 Mobile GPU is already busy
 ~25% occupied by system UI in Android
 Not every programs run good on GPU
 Heavy Branch Divergence
 OpenCL allows easily exploit multi-core and SIMD
 Imagine: writing pthread + SIMD in assembly or intrinsics
Running OpenCL Kernels on CPU
 One thread per work-item?
 Thousands of threads being created
 Context-switching problems
 How to synchronize threads?
 How about running one work-group on a CPU thread?
Related Works
 Twin peaks: a software platform for heterogeneous computing on
general-purpose and graphics processors.
 MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core
CPUs
 Clover (http://people.freedesktop.org/~steckdenis/clover)
 Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
What is to pocl
 POrtable Computing Language
 An efficient implementation of OpenCL standard which can be easily
adapted for new targets
 http://github.com/pocl/pocl
 Main developer: Pekka Jääskeläinen from Tampere University of
Technology
 Supporting Architecture: CPU, tce, cellspu, HSA
 Current version: 0.11
Components in pocl
The pocl Kernel Compiler
OpenCL
Kernel Source
Clang / LLVM
pocl
Kernel Compiler
clBuildProgram(…)
clEnqueueNDRangeKernel
(…, local_size, …)
Single Work-item
Kernel
Transformed
Kernel
pocl Compilation Chain
1
2
3
4 Compile Kernel (OpenCL C) by
Clang
1
Linked with target-specific built-
in functions, such as sin, cos,
geom_distance, etc…
2
Work-group Function
Generation / Parallel Work-item
Loops Creation
3
Backend Optimizations (Auto-
vecs, …) and CodeGen
4
Work-group_function() {
for (int i = 0; i < work-group_size; i++) {
}
}
Work-group Function Generation
Kernel (single work-item)
What if there are
barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)
Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by
all work-items of a work-group executing the kernel
or by none at all…”
if (tid % 2) {
….
barrier();
…
}
Kernel Without barriers
• A node in a CFG is a basic block
(BB)
• BB: branchless sequence of
instructions
• BB executed as an entity,
from the first instruction to
the last.
• An edge in a CFG represents
a branch in the control flow
• Multiple exit BBs are
allowed
• pocl Kernel Compiler generates
WI-loop around the CFG
Types of Barrier
Un-conditional barriers
 barrier that dominates the exit node
Conditional barriers
 Barriers being placed in
 if – else
 for-loop (b-loop)
Kernel with unconditional barriers
 pocl Kernel Compiler creates WI-loops
before and after the barrier
 This forms an algorithm:
Algorithm 1: Parallel region formation when the kernel
does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and
the exit nodes of the kernel function and that there is
only one exit node in the kernel function. This is a safe
starting condition as it does not affect any execution
order restrictions.
Step2: Perform a depth-first-search traversal of the kernel
CFG. Ignore the possible back edges to avoid infinite
loops and to include the loops of the kernel to the
parallel region.
Step3: When encountering a barrier, create a parallel
region by calling CreateSubgraph for the previously
encountered barrier and the newly found barrier.
barrier
barrier
A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation
in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting
at the entry node.
Step2: Each time a new, unprocessed conditional barrier
is found, use CreateSubgraph to produce a sub-CFG from
that barrier to the next exit node (duplicate the tail).
Step3: Replicate the created sub-CFG using ReplicateCFG.
In order to reduce code duplication, merge the tails from
the same unconditional barrier paths. That is, replicate
the basic blocks only after the last barrier that is
unconditionally reachable from the one at hand.
Step4: Start the algorithm at each of the found barrier
successors.
A CFG with Two Conditional barriers
– After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?
“Peel” the First
Loop Iteration
?
?
No more ambiguous
branches in WI-
loops!
Barriers in Kernel Loops
Insert implicit barrier into:
1. End of loop pre-header
block
2. Before the loop latch
branch
3. After the PhiNode
region of the loop
header block
3
2
1
Horizontal Inner-Loop Parallelization
More parallelization after loop interchange
blockWidth unknown until runtime
Handling of Kernel Variables
1. There will be two parallel regions
2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)
3. B’s lifetime span across both parallel regions
Context Array
References
 Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle
Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable
OpenCL Implementation" in International Journal of Parallel
Programming, Springer, August 2014.
 http://github.com/pocl/pocl
Ad

More Related Content

What's hot (20)

Linux kernel
Linux kernelLinux kernel
Linux kernel
Mahmoud Shiri Varamini
 
Linux Hardening
Linux HardeningLinux Hardening
Linux Hardening
Michael Boelen
 
Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0
Emertxe Information Technologies Pvt Ltd
 
Embedded Operating System - Linux
Embedded Operating System - LinuxEmbedded Operating System - Linux
Embedded Operating System - Linux
Emertxe Information Technologies Pvt Ltd
 
Fast boot
Fast bootFast boot
Fast boot
SZ Lin
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basics
Ramdas Mozhikunnath
 
Fcfs scheduling
Fcfs schedulingFcfs scheduling
Fcfs scheduling
myrajendra
 
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
CNIT 126 Ch 7: Analyzing Malicious Windows ProgramsCNIT 126 Ch 7: Analyzing Malicious Windows Programs
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
Sam Bowne
 
Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)
RuggedBoardGroup
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet
Isham Rashik
 
What is Bootloader???
What is Bootloader???What is Bootloader???
What is Bootloader???
Dinesh Damodar
 
Construct an Efficient and Secure Microkernel for IoT
Construct an Efficient and Secure Microkernel for IoTConstruct an Efficient and Secure Microkernel for IoT
Construct an Efficient and Secure Microkernel for IoT
National Cheng Kung University
 
Quickboot on i.MX6
Quickboot on i.MX6Quickboot on i.MX6
Quickboot on i.MX6
Gary Bisson
 
MSc CST (5yr Integrated Course ) Syllabus - Madras University
MSc CST (5yr Integrated Course ) Syllabus - Madras UniversityMSc CST (5yr Integrated Course ) Syllabus - Madras University
MSc CST (5yr Integrated Course ) Syllabus - Madras University
Griffinder VinHai
 
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Sam Bowne
 
Unix ppt
Unix pptUnix ppt
Unix ppt
Dr Rajiv Srivastava
 
Monolithic kernel
Monolithic kernelMonolithic kernel
Monolithic kernel
ARAVIND18MCS1004
 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)
Peter Tröger
 
Bsd presentation
Bsd presentationBsd presentation
Bsd presentation
Majda Allani
 
Single &amp;Multi Core processor
Single &amp;Multi Core processorSingle &amp;Multi Core processor
Single &amp;Multi Core processor
Justify Shadap
 
Fast boot
Fast bootFast boot
Fast boot
SZ Lin
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basics
Ramdas Mozhikunnath
 
Fcfs scheduling
Fcfs schedulingFcfs scheduling
Fcfs scheduling
myrajendra
 
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
CNIT 126 Ch 7: Analyzing Malicious Windows ProgramsCNIT 126 Ch 7: Analyzing Malicious Windows Programs
CNIT 126 Ch 7: Analyzing Malicious Windows Programs
Sam Bowne
 
Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)Embedded Linux BSP Training (Intro)
Embedded Linux BSP Training (Intro)
RuggedBoardGroup
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet
Isham Rashik
 
Quickboot on i.MX6
Quickboot on i.MX6Quickboot on i.MX6
Quickboot on i.MX6
Gary Bisson
 
MSc CST (5yr Integrated Course ) Syllabus - Madras University
MSc CST (5yr Integrated Course ) Syllabus - Madras UniversityMSc CST (5yr Integrated Course ) Syllabus - Madras University
MSc CST (5yr Integrated Course ) Syllabus - Madras University
Griffinder VinHai
 
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Sam Bowne
 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)
Peter Tröger
 
Single &amp;Multi Core processor
Single &amp;Multi Core processorSingle &amp;Multi Core processor
Single &amp;Multi Core processor
Justify Shadap
 

Similar to The pocl Kernel Compiler (20)

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Linaro
 
Adding a BOLT pass
Adding a BOLT passAdding a BOLT pass
Adding a BOLT pass
Amir42407
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
HostedbyConfluent
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
HostedbyConfluent
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
Jian-Hong Pan
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
byonggon chun
 
Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABI
Alison Chaiken
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
SnehaLatha68
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow Analysis
Edgar Barbosa
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
Equnix Business Solutions
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
Andrey Vladimirov
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Jeff Larkin
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Yandex
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
Alison Chaiken
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
Deepak Kumar
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Jian-Hong Pan
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
eurobsdcon
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technology
SZ Lin
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
Jiannan Ouyang, PhD
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Linaro
 
Adding a BOLT pass
Adding a BOLT passAdding a BOLT pass
Adding a BOLT pass
Amir42407
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
HostedbyConfluent
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
HostedbyConfluent
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
Jian-Hong Pan
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
byonggon chun
 
Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABI
Alison Chaiken
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
SnehaLatha68
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow Analysis
Edgar Barbosa
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
Equnix Business Solutions
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
Andrey Vladimirov
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Jeff Larkin
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Yandex
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
Alison Chaiken
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
Deepak Kumar
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Jian-Hong Pan
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
eurobsdcon
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technology
SZ Lin
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
Jiannan Ouyang, PhD
 
Ad

Recently uploaded (20)

Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Ad

The pocl Kernel Compiler

  • 1. The pocl Kernel Compiler Clay Chang
  • 2. CPU versus GPU • Sophiscated Control • Branch Prediction • Out-of-Order Execution • Large Cache • Little Control • No or Limited Branch Prediction • Simple Execution • Small or no cache • Lots of ALUs
  • 3. OpenCL as the Portable API
  • 4. Why OpenCL for CPU  Muiti-core CPU is out there  E.g. MediaTek Tri-Cluster 10 cores SoC  Mobile GPU is already busy  ~25% occupied by system UI in Android  Not every programs run good on GPU  Heavy Branch Divergence  OpenCL allows easily exploit multi-core and SIMD  Imagine: writing pthread + SIMD in assembly or intrinsics
  • 5. Running OpenCL Kernels on CPU  One thread per work-item?  Thousands of threads being created  Context-switching problems  How to synchronize threads?  How about running one work-group on a CPU thread?
  • 6. Related Works  Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs  Clover (http://people.freedesktop.org/~steckdenis/clover)  Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
  • 7. What is to pocl  POrtable Computing Language  An efficient implementation of OpenCL standard which can be easily adapted for new targets  http://github.com/pocl/pocl  Main developer: Pekka Jääskeläinen from Tampere University of Technology  Supporting Architecture: CPU, tce, cellspu, HSA  Current version: 0.11
  • 9. The pocl Kernel Compiler OpenCL Kernel Source Clang / LLVM pocl Kernel Compiler clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …) Single Work-item Kernel Transformed Kernel
  • 10. pocl Compilation Chain 1 2 3 4 Compile Kernel (OpenCL C) by Clang 1 Linked with target-specific built- in functions, such as sin, cos, geom_distance, etc… 2 Work-group Function Generation / Parallel Work-item Loops Creation 3 Backend Optimizations (Auto- vecs, …) and CodeGen 4
  • 11. Work-group_function() { for (int i = 0; i < work-group_size; i++) { } } Work-group Function Generation Kernel (single work-item) What if there are barriers? WI-loop clEnqueueNDRangeKernel(…., group_size, ….)
  • 12. Semantics of barrier Synchronization OpenCL 1.2 rev19 p.30: “… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…” if (tid % 2) { …. barrier(); … }
  • 13. Kernel Without barriers • A node in a CFG is a basic block (BB) • BB: branchless sequence of instructions • BB executed as an entity, from the first instruction to the last. • An edge in a CFG represents a branch in the control flow • Multiple exit BBs are allowed • pocl Kernel Compiler generates WI-loop around the CFG
  • 14. Types of Barrier Un-conditional barriers  barrier that dominates the exit node Conditional barriers  Barriers being placed in  if – else  for-loop (b-loop)
  • 15. Kernel with unconditional barriers  pocl Kernel Compiler creates WI-loops before and after the barrier  This forms an algorithm: Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers. Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions. Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region. Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier. barrier barrier
  • 16. A CFG with Two Conditional barriers Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel. Step1: Perform a depth-first traversal of the CFG, starting at the entry node. Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail). Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand. Step4: Start the algorithm at each of the found barrier successors.
  • 17. A CFG with Two Conditional barriers – After Tail Duplication Easier for WI-loops creation! barrier barrier barrier barrier ? ?
  • 18. “Peel” the First Loop Iteration ? ? No more ambiguous branches in WI- loops!
  • 19. Barriers in Kernel Loops Insert implicit barrier into: 1. End of loop pre-header block 2. Before the loop latch branch 3. After the PhiNode region of the loop header block 3 2 1
  • 20. Horizontal Inner-Loop Parallelization More parallelization after loop interchange blockWidth unknown until runtime
  • 21. Handling of Kernel Variables 1. There will be two parallel regions 2. a‘s lifetime only in the first parallel region (it’s a temporary variable) 3. B’s lifetime span across both parallel regions Context Array
  • 22. References  Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.  http://github.com/pocl/pocl

Editor's Notes

  • #18: A, B, D forms a parallel region and from B, there’s a branch to the middle of another parallel region’s (ABEHI) work-item loop. If at least one work-item takes the branch after B that can lead to a barrier, the rest of the work-item must follow  peel first loop