SlideShare a Scribd company logo
oneAPI DPC++ Workshop
9th December 2020
Intel Confidential 2
Agenda
• Intel® oneAPI
• Introduction
• DPC++
• Introduction
• DPC++ “Hello world”
• Lab
• Intel® DPC++ Compatibility Tool
• Introduction
• Demo
Optimization Notice
2
Introduction to Intel®
oneAPI
Intel Confidential 4
XPUs
Programming
Challenges
Growth in specialized workloads
Variety of data-centric hardware required
No common programming language or APIs
Inconsistent tool support across platforms
Each platform requires unique software investment
Middleware / Frameworks
Application Workloads Need Diverse Hardware
Language & Libraries
Scalar Vector Matrix Spatial
4
CP
U
GP
U
FP
GA
Other
accel.
Intel Confidential 5
5
introducing
oneapi
Unified programming model to simplify development across diverse
architectures
Unified and simplified language and libraries for expressing parallelism
Uncompromised native high-level language performance
Based on industry standards and open specifications
Interoperable with existing HPC programming models
Industry Intel
Initiative Product
Middleware / Frameworks
Application Workloads Need Diverse Hardware
Scalar Vector Matrix Spatial
XPUs
CP
U
GP
U
FP
GA
Other
accel.
oneAPI
Data Parallel C++
Subarnarekha Ghosal
Intel Confidential 7
Introduction
Intel Confidential 8
Intel® oneAPI DPC++ Overview
DPC++
SYCL Next
(Intel
Extensions)
Latest Available
SYCL Spec
C++ 17
Intel Confidential 9
Intel® oneAPI DPC++ Overview
1.
• Data Parallel C++ is a high-level language designed to target
heterogenous architecture and take advantage of data parallelism.
2.
• Reuse Code across CPU and accelerators while performing custom
tuning.
3.
• Open-source implementation in Github helps to incorporate ideas
from end users.
9
Intel Confidential 10
Before we start
Lambda Expressions #include <algorithm>
#include <cmath>
void abssort(float* x, unsigned n) {
std::sort(x, x + n,
// Lambda expression
[ ](float a, float b)
{
return (std::abs(a) < std::abs(b));
}
);
}
• A convenient way of defining an
anonymous function object right at
the location where it is invoked or
passed as an argument to a function
• Lambda functions can be used to
define kernels in SYCL
• The kernel lambda MUST use copy
for all its captures (i.e., [=])
Capture clause
Parameter list
Lambda body
10
Intel Confidential 11
COMMAND GROUP
HANDLER
DEVICE (S)
Query for the
Available device
Kernel Model: Send a kernel (lambda) for
execution.
Queue executes the commands on the
device
parallel_for will execute in parallel across
the compute elements of the device
BUF A
BUF B
BUF C
ACC B
ACC C
Read
Read
Write
ACC A
Command groups control
execution on the device
Dispatches Kernels to the
device
Buffers and Accessors
manage memory across
Host and Device
QUEUE
HOST
DPC++ Program Flow
Intel Confidential 12
DPC++ “Hello world”
Intel Confidential 13
13
Step 1
#include <CL/sycl.hpp>
using namespace cl::sycl;
Intel Confidential 14
Step 2
buffer bufA (A, range(SIZE) );
buffer bufB (B, range (SIZE) );
buffer bufC (C, range (SIZE) );
14
Intel Confidential 15
Step 3
gpu_selector deviceSelector;
queue myQueue(deviceSelector);
15
• The device selector can be a default selector or a cpu or gpu selector or intel::fpga_selector.
• If the device is not explicitly mentioned during the creation of command queue, the runtime
selects one for you.
• It is a good practice to specify the selector to make sure the right device is chosen.
Intel Confidential 16
Step 4
myQueue.submit([&](handler& cgh) {
16
Intel Confidential 17
Step 5
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
17
Intel Confidential 18
Step 6
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
18
 Each iteration (work-
item) will have a
separate index id (i)
Intel Confidential 19
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i){
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
DPC++ “Hello World”: Vector Addition Entire Code
19
Intel Confidential 20
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;}
return 0;
}
Host code
Anatomy of a DPC++ Application
20
Host code
Intel Confidential 21
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
Accelerator
device code
Anatomy of a DPC++ Application
21
Host code
Host code
Intel Confidential 22
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
22
DPC++ basics
 Write-buffer is now out-of-scope, so
kernel completes, and host pointer
has consistent view of output.
Intel Confidential 23
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
23
DPC++ basics
Intel Confidential 24
DPCPP Demo session
Intel Confidential 25
Intel® oneAPI DPC++ Heterogenous Platform
CPU
(Host)
GPU
(Device)
FPGA
(Device)
Other
Accelerator
(Device)
CPU
(Device)
25
26Intel Confidential
For code samples on all these concepts Visit:
https://github.com/oneapi-src/oneAPI-samples/
Intel Confidential 27
DPC++ Summary
•DPC++ is an open standard based programming model for Heterogenous Platforms.
•It can target different accelerators from different vendors
•Single sourced programming model
•oneAPI specifications available publicly:
https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions
Feedback and active participation encouraged
Intel® DPC++ Compatibility Tool
Intel Confidential 29
 Migrates some portion of their existing code written in CUDA to the newly developed DPC++
language.
 Our experience has shown that this can vary greatly, but on average, about 80-90% of CUDA code in
applications can be migrated by this tool.
 Completion of the code and verification of the final code is expected to be manual process done by
the developer.
https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-dpcpp-
compatibility-tool/top.html
What is the Intel® DPC++ Compatibility Tool?
Intel Confidential 30
DPCT* Demo session
Intel Confidential 31
Backup
Intel Confidential 32
DPC++ Deep Dive
Intel Confidential 33
Intel® oneAPI DPC++ Heterogenous Platform
CPU
(Host)
GPU
(Device)
FPGA
(Device)
Other
Accelerator
(Device)
CPU
(Device)
33
Intel Confidential 34
Execution Flow
Global/Constant Memory
Host Memory
Host
Device
(CPU)
(GPU, MIC, FPGA, …)
Compute Unit
(CU)
LocalMemoryLocalMemoryLocalMemoryLocalMemory
Command
Group
• Synchronization cmd
• Data movement ops
• User-defined kernels
Command
GroupCommand
GroupCommand
Group
Command
Queue
Executed on…
submits...
Command
QueueCommand
Queue
Host code
Executed on…
DPC++ Application
Device code
Private Memory
34
Intel Confidential 35
Execution Flow Contd.
Execution of Kernel Instances
Device (GPU, FPGA, …)
Compute Unit
(CU)
Kernel instance =
Kernel object &
nd_range &
work-group
decomposition
Work-pool
Command
QueueCommand
QueueCommand
Queue
enqueued…
35
Intel Confidential 36
Memory Model
Intel Confidential 37
Hardware Architecture
Intel Confidential 38
 Global memory:
 Accessible to all work-items in all work-
groups.
 Reads and writes may be cached.
 Persistent across kernel invocations
Memory Model
Constant memory:
• A region of global memory that
remains constant during the
execution of a kernel
Local Memory:
• Memory region shared between work-items
in a single work-group.
Private Memory:
• Region of memory private to a work-item.
Variables defined in one work-item’s private
memory are not visible to another work-item
Global/Constant Memory
Device (GPU, FPGA, …)
Compute Unit
(CU)
LocalMemoryLocalMemoryLocalMemoryLocalMemory
Private Memory
38
Intel Confidential 39
DPC++ - device memory model
Local Memory
Private
Memory
Work-Item
Private
Memory
Work-Item
Private
Memory
Work-Item
Work-Group
Global Memory Constant
Memory
Device
Work-Group
……
Work-GroupWork-Group
…
…
Local Memory
Private
Memory
Work-Item
Private
Memory
Work-Item
Private
Memory
Work-Item…
Work-Group
…
Device
Intel Confidential 40
Unified Shared Memory
 SYCL 1.2.1 specification offers: – Buffer/Accessor: For tracking and managing memory transfer and
guarantee data consistency across host and DPC++ devices.
 Many HPC and Enterprise applications use pointers to manage data.
 DPC++ Extension for Pointer Based programming: – Unified Shared Memory (USM): Device Kernels
can access the data using pointers
Intel Confidential 41
USM Allocation
Device(Explicit
data movement)
Host(Data sent
over bus, such
as PCIe)
Shared(Data can
migrate b/w host
and memory)
Types of USM
Intel Confidential 42
Kernel Model
Intel Confidential 43
Kernel Execution Model
 Kernel Parallelism
 Multi Dimensional Kernel
 ND-Range
 Sub-group
 Work-Group
 Work Item
Intel Confidential 44
Kernel Execution Model
 Explicit ND-range for control- similar to programming models such as OpenCL, SYCL, CUDA.
ND-range
Global work size
Work-group
Work-item
44
Intel Confidential 45
nd_range & nd_item
 Example: Process every pixel in a 1920x1080 image
 Each pixel needs processing, kernel is executed on each pixel (work-item)
 1920 x 1080 = 2M pixels = global size
 Not all 2M can run in parallel on device, there is hardware resource limits.
 We have to split into smaller groups of pixel blocks = local size (work-group)
 Either let the complier determine work-group size OR we can specify the work-group size using nd_range()

Intel Confidential 46
Example: Process every pixel in a 1920x1080 image
 Let compiler determine work-group size

 Programmer specifies work-group size
h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)),
[=](id<2> item){
// CODE THAT RUNS ON DEVICE
})
h.parallel_for(range<2>(1920,1080), [=](id<2>
item){
// CODE THAT RUNS ON DEVICE
});
nd_range & nd_item
global
size
local size
(work-group
size)
Intel Confidential 47
nd_range & nd_item
 Example: Process every pixel in a 1920x1080 image
 How do we choose work-group size?
• Work-group size of 8x8 divides equally for 1920x1080
• Work-group size of 9x9 does not divide equally for 1920x1080
• Compiler will throw error (invalid work group size error)
• Work-group size of 10x10 divides equally for 1920x1080
• Works, but always better to use multiple of 8 for better resource utilization
• Work-group size of 24x24 divides equally for 1920x1080
• 24x24=576, will fail compile assuming GPU max work-group size is 256
GOOD
48

More Related Content

What's hot (20)

PPTX
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
Yong Feng
 
PPTX
Intel Developer Program
Intel® Software
 
PPTX
OpenCV for Embedded: Lessons Learned
Yury Gorbachev
 
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
PDF
DCEU 18: Edge Computing with Docker Enterprise
Docker, Inc.
 
PDF
Resilient microservices with Kubernetes - Mete Atamel
ITCamp
 
PDF
End-to-End Big Data AI with Analytics Zoo
Jason Dai
 
PDF
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY
 
PDF
Tesla Accelerated Computing Platform
inside-BigData.com
 
PDF
Journey Through Four Stages of Kubernetes Deployment Maturity
Altoros
 
PPTX
Fabio rapposelli pks-vmug
VMUG IT
 
PDF
NFV features in kubernetes
Kuralamudhan Ramakrishnan
 
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
PDF
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
AMD Developer Central
 
PDF
Modest scale HPC on Azure using CGYRO
Igor Sfiligoi
 
PPTX
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
Shikha Srivastava
 
PDF
Red Hat OpenShift Container Platform Overview
James Falkner
 
PDF
ONS 2018 LA - Intel Tutorial: Cloud Native to NFV - Alon Bernstein, Cisco & K...
Kuralamudhan Ramakrishnan
 
PDF
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson
 
PDF
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
Yong Feng
 
Intel Developer Program
Intel® Software
 
OpenCV for Embedded: Lessons Learned
Yury Gorbachev
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
DCEU 18: Edge Computing with Docker Enterprise
Docker, Inc.
 
Resilient microservices with Kubernetes - Mete Atamel
ITCamp
 
End-to-End Big Data AI with Analytics Zoo
Jason Dai
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY
 
Tesla Accelerated Computing Platform
inside-BigData.com
 
Journey Through Four Stages of Kubernetes Deployment Maturity
Altoros
 
Fabio rapposelli pks-vmug
VMUG IT
 
NFV features in kubernetes
Kuralamudhan Ramakrishnan
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
AMD Developer Central
 
Modest scale HPC on Azure using CGYRO
Igor Sfiligoi
 
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
Shikha Srivastava
 
Red Hat OpenShift Container Platform Overview
James Falkner
 
ONS 2018 LA - Intel Tutorial: Cloud Native to NFV - Alon Bernstein, Cisco & K...
Kuralamudhan Ramakrishnan
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 

Similar to OneAPI dpc++ Virtual Workshop 9th Dec-20 (20)

PDF
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
JunZhao68
 
PPTX
Griffon Topic2 Presentation (Tia)
Nat Weerawan
 
PDF
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
tdc-globalcode
 
PDF
Performance Verification for ESL Design Methodology from AADL Models
Space Codesign
 
PDF
"Making OpenCV Code Run Fast," a Presentation from Intel
Edge AI and Vision Alliance
 
PPTX
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar
 
PDF
Chapter three embedded system corse ppt AASTU.pdf
MitikuAbebe2
 
PPT
Agnostic Device Drivers
Heiko Joerg Schick
 
PDF
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
Edge AI and Vision Alliance
 
PDF
Deep Learning Edge
Ganesan Narayanasamy
 
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
PDF
Open CL For Haifa Linux Club
Ofer Rosenberg
 
PDF
Brillo/Weave Part 2: Deep Dive
Jalal Rohani
 
PDF
GNU Compiler Collection - August 2005
Saleem Ansari
 
PDF
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY
 
PDF
ELC North America 2021 Introduction to pin muxing and gpio control under linux
Neil Armstrong
 
PDF
Mesa and Its Debugging, Вадим Шовкопляс
Sigma Software
 
PDF
Perceptual Computing Workshop à Paris
BeMyApp
 
PDF
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
Edge AI and Vision Alliance
 
PPT
Developing new zynq based instruments
Graham NAYLOR
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
JunZhao68
 
Griffon Topic2 Presentation (Tia)
Nat Weerawan
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
tdc-globalcode
 
Performance Verification for ESL Design Methodology from AADL Models
Space Codesign
 
"Making OpenCV Code Run Fast," a Presentation from Intel
Edge AI and Vision Alliance
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar
 
Chapter three embedded system corse ppt AASTU.pdf
MitikuAbebe2
 
Agnostic Device Drivers
Heiko Joerg Schick
 
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
Edge AI and Vision Alliance
 
Deep Learning Edge
Ganesan Narayanasamy
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Open CL For Haifa Linux Club
Ofer Rosenberg
 
Brillo/Weave Part 2: Deep Dive
Jalal Rohani
 
GNU Compiler Collection - August 2005
Saleem Ansari
 
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY
 
ELC North America 2021 Introduction to pin muxing and gpio control under linux
Neil Armstrong
 
Mesa and Its Debugging, Вадим Шовкопляс
Sigma Software
 
Perceptual Computing Workshop à Paris
BeMyApp
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
Edge AI and Vision Alliance
 
Developing new zynq based instruments
Graham NAYLOR
 
Ad

More from Tyrone Systems (20)

PDF
Kubernetes in The Enterprise
Tyrone Systems
 
PDF
Why minio wins the hybrid cloud?
Tyrone Systems
 
PDF
why min io wins the hybrid cloud
Tyrone Systems
 
PDF
5 ways hci (hyper-converged infrastructure) powering today’s modern learning ...
Tyrone Systems
 
PDF
5 current and near-future use cases of ai in broadcast and media.
Tyrone Systems
 
PDF
How hci is driving digital transformation in the insurance firms to enable pr...
Tyrone Systems
 
PDF
How blockchain is revolutionising healthcare industry’s challenges of genomic...
Tyrone Systems
 
PDF
5 ways hpc can provides cost savings and flexibility to meet the technology i...
Tyrone Systems
 
PDF
How Emerging Technologies are Enabling The Banking Industry
Tyrone Systems
 
PDF
Five Exciting Ways HCI can accelerates digital transformation for Media and E...
Tyrone Systems
 
PPTX
Design and Optimize your code for high-performance with Intel® Advisor and I...
Tyrone Systems
 
PDF
Fast-Track Your Digital Transformation with Intelligent Automation
Tyrone Systems
 
PDF
Top Five benefits of Hyper-Converged Infrastructure
Tyrone Systems
 
PDF
An Effective Approach to Cloud Migration for Small and Medium Enterprises (SMEs)
Tyrone Systems
 
PDF
How can Artificial Intelligence improve software development process?
Tyrone Systems
 
PDF
3 Ways Machine Learning Facilitates Fraud Detection
Tyrone Systems
 
PDF
Four ways to digitally transform with HPC in the cloud
Tyrone Systems
 
PDF
How to Secure Containerized Environments?
Tyrone Systems
 
PPTX
Tyrone-Intel oneAPI Webinar: Optimized Tools for Performance-Driven, Cross-Ar...
Tyrone Systems
 
PDF
Top 5 Benefits of Hyper-Converged Infrastructure
Tyrone Systems
 
Kubernetes in The Enterprise
Tyrone Systems
 
Why minio wins the hybrid cloud?
Tyrone Systems
 
why min io wins the hybrid cloud
Tyrone Systems
 
5 ways hci (hyper-converged infrastructure) powering today’s modern learning ...
Tyrone Systems
 
5 current and near-future use cases of ai in broadcast and media.
Tyrone Systems
 
How hci is driving digital transformation in the insurance firms to enable pr...
Tyrone Systems
 
How blockchain is revolutionising healthcare industry’s challenges of genomic...
Tyrone Systems
 
5 ways hpc can provides cost savings and flexibility to meet the technology i...
Tyrone Systems
 
How Emerging Technologies are Enabling The Banking Industry
Tyrone Systems
 
Five Exciting Ways HCI can accelerates digital transformation for Media and E...
Tyrone Systems
 
Design and Optimize your code for high-performance with Intel® Advisor and I...
Tyrone Systems
 
Fast-Track Your Digital Transformation with Intelligent Automation
Tyrone Systems
 
Top Five benefits of Hyper-Converged Infrastructure
Tyrone Systems
 
An Effective Approach to Cloud Migration for Small and Medium Enterprises (SMEs)
Tyrone Systems
 
How can Artificial Intelligence improve software development process?
Tyrone Systems
 
3 Ways Machine Learning Facilitates Fraud Detection
Tyrone Systems
 
Four ways to digitally transform with HPC in the cloud
Tyrone Systems
 
How to Secure Containerized Environments?
Tyrone Systems
 
Tyrone-Intel oneAPI Webinar: Optimized Tools for Performance-Driven, Cross-Ar...
Tyrone Systems
 
Top 5 Benefits of Hyper-Converged Infrastructure
Tyrone Systems
 
Ad

Recently uploaded (20)

PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PDF
My Thoughts On Q&A- A Novel By Vikas Swarup
Niharika
 
PPTX
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPT
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
PPTX
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
My Thoughts On Q&A- A Novel By Vikas Swarup
Niharika
 
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
Translation_ Definition, Scope & Historical Development.pptx
DhatriParmar
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Unlock the Power of Cursor AI: MuleSoft Integrations
Veera Pallapu
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
DRUGS USED IN THERAPY OF SHOCK, Shock Therapy, Treatment or management of shock
Rajshri Ghogare
 
Virus sequence retrieval from NCBI database
yamunaK13
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 

OneAPI dpc++ Virtual Workshop 9th Dec-20

  • 2. Intel Confidential 2 Agenda • Intel® oneAPI • Introduction • DPC++ • Introduction • DPC++ “Hello world” • Lab • Intel® DPC++ Compatibility Tool • Introduction • Demo Optimization Notice 2
  • 4. Intel Confidential 4 XPUs Programming Challenges Growth in specialized workloads Variety of data-centric hardware required No common programming language or APIs Inconsistent tool support across platforms Each platform requires unique software investment Middleware / Frameworks Application Workloads Need Diverse Hardware Language & Libraries Scalar Vector Matrix Spatial 4 CP U GP U FP GA Other accel.
  • 5. Intel Confidential 5 5 introducing oneapi Unified programming model to simplify development across diverse architectures Unified and simplified language and libraries for expressing parallelism Uncompromised native high-level language performance Based on industry standards and open specifications Interoperable with existing HPC programming models Industry Intel Initiative Product Middleware / Frameworks Application Workloads Need Diverse Hardware Scalar Vector Matrix Spatial XPUs CP U GP U FP GA Other accel. oneAPI
  • 8. Intel Confidential 8 Intel® oneAPI DPC++ Overview DPC++ SYCL Next (Intel Extensions) Latest Available SYCL Spec C++ 17
  • 9. Intel Confidential 9 Intel® oneAPI DPC++ Overview 1. • Data Parallel C++ is a high-level language designed to target heterogenous architecture and take advantage of data parallelism. 2. • Reuse Code across CPU and accelerators while performing custom tuning. 3. • Open-source implementation in Github helps to incorporate ideas from end users. 9
  • 10. Intel Confidential 10 Before we start Lambda Expressions #include <algorithm> #include <cmath> void abssort(float* x, unsigned n) { std::sort(x, x + n, // Lambda expression [ ](float a, float b) { return (std::abs(a) < std::abs(b)); } ); } • A convenient way of defining an anonymous function object right at the location where it is invoked or passed as an argument to a function • Lambda functions can be used to define kernels in SYCL • The kernel lambda MUST use copy for all its captures (i.e., [=]) Capture clause Parameter list Lambda body 10
  • 11. Intel Confidential 11 COMMAND GROUP HANDLER DEVICE (S) Query for the Available device Kernel Model: Send a kernel (lambda) for execution. Queue executes the commands on the device parallel_for will execute in parallel across the compute elements of the device BUF A BUF B BUF C ACC B ACC C Read Read Write ACC A Command groups control execution on the device Dispatches Kernels to the device Buffers and Accessors manage memory across Host and Device QUEUE HOST DPC++ Program Flow
  • 12. Intel Confidential 12 DPC++ “Hello world”
  • 13. Intel Confidential 13 13 Step 1 #include <CL/sycl.hpp> using namespace cl::sycl;
  • 14. Intel Confidential 14 Step 2 buffer bufA (A, range(SIZE) ); buffer bufB (B, range (SIZE) ); buffer bufC (C, range (SIZE) ); 14
  • 15. Intel Confidential 15 Step 3 gpu_selector deviceSelector; queue myQueue(deviceSelector); 15 • The device selector can be a default selector or a cpu or gpu selector or intel::fpga_selector. • If the device is not explicitly mentioned during the creation of command queue, the runtime selects one for you. • It is a good practice to specify the selector to make sure the right device is chosen.
  • 16. Intel Confidential 16 Step 4 myQueue.submit([&](handler& cgh) { 16
  • 17. Intel Confidential 17 Step 5 auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); 17
  • 18. Intel Confidential 18 Step 6 cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); 18  Each iteration (work- item) will have a separate index id (i)
  • 19. Intel Confidential 19 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i){ C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } DPC++ “Hello World”: Vector Addition Entire Code 19
  • 20. Intel Confidential 20 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl;} return 0; } Host code Anatomy of a DPC++ Application 20 Host code
  • 21. Intel Confidential 21 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } Accelerator device code Anatomy of a DPC++ Application 21 Host code Host code
  • 22. Intel Confidential 22 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } 22 DPC++ basics  Write-buffer is now out-of-scope, so kernel completes, and host pointer has consistent view of output.
  • 23. Intel Confidential 23 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } 23 DPC++ basics
  • 25. Intel Confidential 25 Intel® oneAPI DPC++ Heterogenous Platform CPU (Host) GPU (Device) FPGA (Device) Other Accelerator (Device) CPU (Device) 25
  • 26. 26Intel Confidential For code samples on all these concepts Visit: https://github.com/oneapi-src/oneAPI-samples/
  • 27. Intel Confidential 27 DPC++ Summary •DPC++ is an open standard based programming model for Heterogenous Platforms. •It can target different accelerators from different vendors •Single sourced programming model •oneAPI specifications available publicly: https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions Feedback and active participation encouraged
  • 29. Intel Confidential 29  Migrates some portion of their existing code written in CUDA to the newly developed DPC++ language.  Our experience has shown that this can vary greatly, but on average, about 80-90% of CUDA code in applications can be migrated by this tool.  Completion of the code and verification of the final code is expected to be manual process done by the developer. https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-dpcpp- compatibility-tool/top.html What is the Intel® DPC++ Compatibility Tool?
  • 33. Intel Confidential 33 Intel® oneAPI DPC++ Heterogenous Platform CPU (Host) GPU (Device) FPGA (Device) Other Accelerator (Device) CPU (Device) 33
  • 34. Intel Confidential 34 Execution Flow Global/Constant Memory Host Memory Host Device (CPU) (GPU, MIC, FPGA, …) Compute Unit (CU) LocalMemoryLocalMemoryLocalMemoryLocalMemory Command Group • Synchronization cmd • Data movement ops • User-defined kernels Command GroupCommand GroupCommand Group Command Queue Executed on… submits... Command QueueCommand Queue Host code Executed on… DPC++ Application Device code Private Memory 34
  • 35. Intel Confidential 35 Execution Flow Contd. Execution of Kernel Instances Device (GPU, FPGA, …) Compute Unit (CU) Kernel instance = Kernel object & nd_range & work-group decomposition Work-pool Command QueueCommand QueueCommand Queue enqueued… 35
  • 38. Intel Confidential 38  Global memory:  Accessible to all work-items in all work- groups.  Reads and writes may be cached.  Persistent across kernel invocations Memory Model Constant memory: • A region of global memory that remains constant during the execution of a kernel Local Memory: • Memory region shared between work-items in a single work-group. Private Memory: • Region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item Global/Constant Memory Device (GPU, FPGA, …) Compute Unit (CU) LocalMemoryLocalMemoryLocalMemoryLocalMemory Private Memory 38
  • 39. Intel Confidential 39 DPC++ - device memory model Local Memory Private Memory Work-Item Private Memory Work-Item Private Memory Work-Item Work-Group Global Memory Constant Memory Device Work-Group …… Work-GroupWork-Group … … Local Memory Private Memory Work-Item Private Memory Work-Item Private Memory Work-Item… Work-Group … Device
  • 40. Intel Confidential 40 Unified Shared Memory  SYCL 1.2.1 specification offers: – Buffer/Accessor: For tracking and managing memory transfer and guarantee data consistency across host and DPC++ devices.  Many HPC and Enterprise applications use pointers to manage data.  DPC++ Extension for Pointer Based programming: – Unified Shared Memory (USM): Device Kernels can access the data using pointers
  • 41. Intel Confidential 41 USM Allocation Device(Explicit data movement) Host(Data sent over bus, such as PCIe) Shared(Data can migrate b/w host and memory) Types of USM
  • 43. Intel Confidential 43 Kernel Execution Model  Kernel Parallelism  Multi Dimensional Kernel  ND-Range  Sub-group  Work-Group  Work Item
  • 44. Intel Confidential 44 Kernel Execution Model  Explicit ND-range for control- similar to programming models such as OpenCL, SYCL, CUDA. ND-range Global work size Work-group Work-item 44
  • 45. Intel Confidential 45 nd_range & nd_item  Example: Process every pixel in a 1920x1080 image  Each pixel needs processing, kernel is executed on each pixel (work-item)  1920 x 1080 = 2M pixels = global size  Not all 2M can run in parallel on device, there is hardware resource limits.  We have to split into smaller groups of pixel blocks = local size (work-group)  Either let the complier determine work-group size OR we can specify the work-group size using nd_range() 
  • 46. Intel Confidential 46 Example: Process every pixel in a 1920x1080 image  Let compiler determine work-group size   Programmer specifies work-group size h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)), [=](id<2> item){ // CODE THAT RUNS ON DEVICE }) h.parallel_for(range<2>(1920,1080), [=](id<2> item){ // CODE THAT RUNS ON DEVICE }); nd_range & nd_item global size local size (work-group size)
  • 47. Intel Confidential 47 nd_range & nd_item  Example: Process every pixel in a 1920x1080 image  How do we choose work-group size? • Work-group size of 8x8 divides equally for 1920x1080 • Work-group size of 9x9 does not divide equally for 1920x1080 • Compiler will throw error (invalid work group size error) • Work-group size of 10x10 divides equally for 1920x1080 • Works, but always better to use multiple of 8 for better resource utilization • Work-group size of 24x24 divides equally for 1920x1080 • 24x24=576, will fail compile assuming GPU max work-group size is 256 GOOD
  • 48. 48