0% found this document useful (0 votes)

261 views

OpenCL Tutorial - Basics

This document provides an overview of OpenCL, including: 1) OpenCL is a standardized language for programming accelerators like GPUs in a portable way. Kernels are compiled at runtime for different hardware. 2) The execution model includes work items, work groups, and program grids for organizing computation across multiple devices and platforms. 3) The memory model has different types like global, local, constant, and private memory with different performance characteristics and constraints.

Uploaded by

ozgur_sahin_13

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

261 views

OpenCL Tutorial - Basics

Uploaded by

ozgur_sahin_13

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

OpenCL Tutorial

Guillermo Marcus

Overview
14:00

Part I OpenCL Overview Hello Vector Coffee Break

15:30 16:00

Part II Reduction Matrix Multiply

About me
Dr. Guillermo Marcus
[email protected]

PhD from Heidelberg in Computer Science 2011 Head of the Scientific Computing Research Group until March 2013 NVIDIA (OptiX Group) from May 2013 Teached the ZITI Master Lecture in GPU Computing between 2011-2013

OpenCL Overview
Standarized language to program accelerators http://www.khronos.org/opencl C-based, APIs and GPU code is C or C-like Compiles at runtime Supported by multiple hardware vendors NVIDIA, AMD, ARM, PowerVR, Altera While code is portable, optimizations are not!

OpenCL Basics
Application Models Execution Model Memory Model

Application Model
Activities are driven by the host computer Multiple platforms, multiple devices possible IO is an important part of the model

GPU Kernels
- Starts a computation in the GPU - "Launches" (starts) a collection of threads - Requires code to execute AND a specification (how the threads are organized) - Can be blocking or non-blocking

Work Item

Execution Model
Work Items
Kernel code "Serial" execution thread Private variables

int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+=4) c[i] = a[i] + b[i];

Work Groups
Synchronization inside the group Data sharing inside the group

Program Grid
Collection of Work Groups No synchronization No data Sharing

Work Items
A single thread in the GPU The are executed normally as SIM Thread code is the same for all work items Work items can have private variables Have an Unique ID inside the kernel
int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+= 4) c[i] = a[i] + b[i];

Single Instruction, Multiple Threads

Combines the flexibility of the thread model with the efficiency of the Single Instruction, Multiple Data architecture. Normally, there are many more threads than workers.
int a[N], b[N], c[N]; int tid; tid = getThreadID(); c[tid] = a[tid] + b[tid];

w or k 1 er w or k 2 er w or k 3 er w or k 4 er

Work Groups
Work Groups are collections of Work Items Items inside a Work Group ... are executed in parallel* share local data have a local ID can be organized as 1D,2D,3D* arrays Work Groups ... are independent of each other have an unique ID inside the kernel

Program Grid
Work Groups are organized as a 1D, 2D, 3D array Between Work Groups there is ... No communication No data synchronization In fact, often there is not even data coherency between work groups!

Memory Model
Hierarchical organization of areas: Host, Global, Local, Registers Moving data between areas is expensive Data coherency is not garanteed at all times or across all areas Every area has its own constraint set Controlled by attributes in the code definition

Memory Model Overview

Host Memory
Main Memory of the Host Computer Can move data only between the host and the GPU Global Memory Transfer is always initiated by the Host, can be Synchronous or Asynchronous Bandwidth is limited by the PCIe links

Global Memory
Main GPU Memory available to all threads Biggest in size, up to several GBs Huge bandwidth, but also huge latency typically 400-800 cycles not always cached Performance is very dependent of access patterns

Local Memory
Available to all threads inside a Work Group Limited in size (typical: 8KB-64KB) Latency comparable to registers Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns Used as scratchpad or cache of global memory

GPU Registers
Private to every thread Normally hidden, no direct access, optimized by the compiler Fastest access, only constrained in number of available registers Some platforms may use more registers than others..... depends on the hardware architecture

Constant Memory
Read only memory Cached Good for storing Look Up Tables and nonchangeable values It is normally a small area of the global memory

Private Memory
Unique to every Work Item Normally it is mapped first to registers, then to global memory when there is no more free registers

Kernel Specification
Defines the number and distribution of threads inside the kernel. A GPU program can be launched with different specifications, creating different kernels. The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.

Global and Local Settings (1D)

// Create kernel specification (ND range) NDRange global(VECT_SIZE); NDRange local(1);

// Create kernel specification (ND range) int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1); NDRange global(64*groups); NDRange local(64);

Global and Local Settings (2D)

// Create kernel specification (ND range) int gX = X_SIZE/4 + ((X_SIZE % 4 == 0) ? 0 : 1); int gY = Y_SIZE/3 + ((Y_SIZE % 3 == 0) ? 0 : 1); NDRange global(gX*4, gY*3); NDRange local(4,3);

Basic built-in functions values

Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Exercitii
No ratings yet
Exercitii
8 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Digital Design Using Verilog HDL: Fall 21
No ratings yet
Digital Design Using Verilog HDL: Fall 21
39 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
previosyear3rd
No ratings yet
previosyear3rd
28 pages
Mines Paristech / Cri Lal / Cnrs / In2P3
No ratings yet
Mines Paristech / Cri Lal / Cnrs / In2P3
37 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Nvidia Cuda
No ratings yet
Nvidia Cuda
26 pages
Multicore02 1 Updated
No ratings yet
Multicore02 1 Updated
25 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Fpga: Digital Designs: Team Name:Digital Dreamers
No ratings yet
Fpga: Digital Designs: Team Name:Digital Dreamers
8 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Unit-2-2 Mark
No ratings yet
Unit-2-2 Mark
8 pages
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
No ratings yet
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
10 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
(Videogame) Rendering 102
No ratings yet
(Videogame) Rendering 102
32 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Deep learning1
No ratings yet
Deep learning1
23 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Fundamental It A
No ratings yet
Fundamental It A
12 pages
Pthreads Programming
No ratings yet
Pthreads Programming
54 pages
Programming With HDLS: Paul Chow February 11, 2008
No ratings yet
Programming With HDLS: Paul Chow February 11, 2008
21 pages
Riding The Next Wave of Embedded Multicore Processors: - Maximizing CPU Performance in A Power-Constrained World
No ratings yet
Riding The Next Wave of Embedded Multicore Processors: - Maximizing CPU Performance in A Power-Constrained World
36 pages
PG Diploma in Embedded Systems Design (PG-DESD) Course Focus
No ratings yet
PG Diploma in Embedded Systems Design (PG-DESD) Course Focus
7 pages
Linux Drivers notes
No ratings yet
Linux Drivers notes
7 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
L-05
No ratings yet
L-05
18 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
COA PROJECT PROPOSAL
No ratings yet
COA PROJECT PROPOSAL
6 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
Parallel Systems Assignment
No ratings yet
Parallel Systems Assignment
11 pages
Es Module 2 Notes PDF
No ratings yet
Es Module 2 Notes PDF
11 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
42 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
x86 Microarchitectures
No ratings yet
x86 Microarchitectures
17 pages
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
No ratings yet
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
25 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Threads Vs Processes: Andrew Tridgell
No ratings yet
Threads Vs Processes: Andrew Tridgell
18 pages
Wavelet Tree
No ratings yet
Wavelet Tree
29 pages
Verilog_Chapter1_Introduction
No ratings yet
Verilog_Chapter1_Introduction
39 pages
Convolutional Neural Network Workbench - CodeProject
No ratings yet
Convolutional Neural Network Workbench - CodeProject
8 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Introduction to Scientific Programming
No ratings yet
Introduction to Scientific Programming
17 pages
Cluster Basics
No ratings yet
Cluster Basics
34 pages
Notes7 Applying Synthesis Constraints
No ratings yet
Notes7 Applying Synthesis Constraints
18 pages
A15 Disassembler
No ratings yet
A15 Disassembler
13 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Is 3589
No ratings yet
Is 3589
2 pages
212 Nadella Rky RKX
No ratings yet
212 Nadella Rky RKX
1 page
Eph 2019
No ratings yet
Eph 2019
1 page
51864
No ratings yet
51864
88 pages
2022 NCHE222 - Aanlyntoets 1 - Online Test 1 Memo
No ratings yet
2022 NCHE222 - Aanlyntoets 1 - Online Test 1 Memo
2 pages
Complex Engineering Problem
100% (1)
Complex Engineering Problem
2 pages
Honors Precalculus Semester1 Exam Review
No ratings yet
Honors Precalculus Semester1 Exam Review
17 pages
IM ch01
100% (1)
IM ch01
11 pages
Aduc812: Microconverter Quick Reference Guide
No ratings yet
Aduc812: Microconverter Quick Reference Guide
2 pages
Computer Aided Design and Analysis of El
No ratings yet
Computer Aided Design and Analysis of El
5 pages
SO IT MCQs For SBI IBPS Officer
No ratings yet
SO IT MCQs For SBI IBPS Officer
22 pages
A-Level Biology (7401/7402) : Topic 2 Practice Exam Questions
No ratings yet
A-Level Biology (7401/7402) : Topic 2 Practice Exam Questions
12 pages
Data Visualization With Power BI
No ratings yet
Data Visualization With Power BI
49 pages
Lecture 6 - Functional Forms of Linear Regression Models - Reciprocal Model
No ratings yet
Lecture 6 - Functional Forms of Linear Regression Models - Reciprocal Model
11 pages
制作参考文献
100% (1)
制作参考文献
6 pages
Simulating Multicast Events in Win32 Delphi
No ratings yet
Simulating Multicast Events in Win32 Delphi
3 pages
Sample Maths QP
No ratings yet
Sample Maths QP
5 pages
Testing
No ratings yet
Testing
34 pages
Tustin Stability Criteria
No ratings yet
Tustin Stability Criteria
1 page
Incorporating Prior Knowledge Into Boosting
No ratings yet
Incorporating Prior Knowledge Into Boosting
8 pages
Shell Heavy Oil Fractionator Problem
No ratings yet
Shell Heavy Oil Fractionator Problem
4 pages
1 - 2 Introduction of Phytochemisry I
No ratings yet
1 - 2 Introduction of Phytochemisry I
20 pages
132KV Switchgear Room
No ratings yet
132KV Switchgear Room
9 pages
IBA - Syllabus - PM
No ratings yet
IBA - Syllabus - PM
5 pages
3H_MS_Nov 2022 2
No ratings yet
3H_MS_Nov 2022 2
14 pages
An Adjusted Boxplot For Skewed
No ratings yet
An Adjusted Boxplot For Skewed
8 pages
Dharminder Arora Cad Zone
No ratings yet
Dharminder Arora Cad Zone
20 pages
Scope of Work and Method Statement S
No ratings yet
Scope of Work and Method Statement S
13 pages
Chemistry by Ashwani Ohri: Electronegativity and Miscl. Questions
No ratings yet
Chemistry by Ashwani Ohri: Electronegativity and Miscl. Questions
3 pages

OpenCL Tutorial - Basics

Uploaded by

OpenCL Tutorial - Basics

Uploaded by

OpenCL Tutorial

Part I OpenCL Overview Hello Vector Coffee Break

Part II Reduction Matrix Multiply

Single Instruction, Multiple Threads

Memory Model Overview

Global and Local Settings (1D)

Global and Local Settings (2D)

Basic built-in functions values

You might also like