OpenCL Tutorial - Basics
OpenCL Tutorial - Basics
Guillermo Marcus
Overview
14:00
15:30 16:00
About me
Dr. Guillermo Marcus
[email protected]
PhD from Heidelberg in Computer Science 2011 Head of the Scientific Computing Research Group until March 2013 NVIDIA (OptiX Group) from May 2013 Teached the ZITI Master Lecture in GPU Computing between 2011-2013
OpenCL Overview
Standarized language to program accelerators http://www.khronos.org/opencl C-based, APIs and GPU code is C or C-like Compiles at runtime Supported by multiple hardware vendors NVIDIA, AMD, ARM, PowerVR, Altera While code is portable, optimizations are not!
OpenCL Basics
Application Models Execution Model Memory Model
Application Model
Activities are driven by the host computer Multiple platforms, multiple devices possible IO is an important part of the model
GPU Kernels
- Starts a computation in the GPU - "Launches" (starts) a collection of threads - Requires code to execute AND a specification (how the threads are organized) - Can be blocking or non-blocking
Work Item
Execution Model
Work Items
Kernel code "Serial" execution thread Private variables
int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+=4) c[i] = a[i] + b[i];
Work Groups
Synchronization inside the group Data sharing inside the group
Program Grid
Collection of Work Groups No synchronization No data Sharing
Work Items
A single thread in the GPU The are executed normally as SIM Thread code is the same for all work items Work items can have private variables Have an Unique ID inside the kernel
int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+= 4) c[i] = a[i] + b[i];
w or k 1 er w or k 2 er w or k 3 er w or k 4 er
Work Groups
Work Groups are collections of Work Items Items inside a Work Group ... are executed in parallel* share local data have a local ID can be organized as 1D,2D,3D* arrays Work Groups ... are independent of each other have an unique ID inside the kernel
Program Grid
Work Groups are organized as a 1D, 2D, 3D array Between Work Groups there is ... No communication No data synchronization In fact, often there is not even data coherency between work groups!
Memory Model
Hierarchical organization of areas: Host, Global, Local, Registers Moving data between areas is expensive Data coherency is not garanteed at all times or across all areas Every area has its own constraint set Controlled by attributes in the code definition
Host Memory
Main Memory of the Host Computer Can move data only between the host and the GPU Global Memory Transfer is always initiated by the Host, can be Synchronous or Asynchronous Bandwidth is limited by the PCIe links
Global Memory
Main GPU Memory available to all threads Biggest in size, up to several GBs Huge bandwidth, but also huge latency typically 400-800 cycles not always cached Performance is very dependent of access patterns
Local Memory
Available to all threads inside a Work Group Limited in size (typical: 8KB-64KB) Latency comparable to registers Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns Used as scratchpad or cache of global memory
GPU Registers
Private to every thread Normally hidden, no direct access, optimized by the compiler Fastest access, only constrained in number of available registers Some platforms may use more registers than others..... depends on the hardware architecture
Constant Memory
Read only memory Cached Good for storing Look Up Tables and nonchangeable values It is normally a small area of the global memory
Private Memory
Unique to every Work Item Normally it is mapped first to registers, then to global memory when there is no more free registers
Kernel Specification
Defines the number and distribution of threads inside the kernel. A GPU program can be launched with different specifications, creating different kernels. The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.
// Create kernel specification (ND range) int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1); NDRange global(64*groups); NDRange local(64);