Automatic generation of platform architectures using open cl and fpga roadmap

Automatic generation of platform architectures
using OpenCL
FPGA roadmap
Department of Electrical and
Computer Engineering
University of Thessaly
Volos, Greece
Nikolaos Bellas

What is an FPGA?
• Field Programmable Gate Array (FPGA) is the best
known example of Reconfigurable Logic
• Hardware can be modified post chip fabrication
• Tailor the Hardware to the application
– Fixed logic processors (CPUs/GPUs) only modify
their software (via programming)
• FPGAs can offer superior performance,
performance/power, or performance/cost compared to
CPUs and GPUs.
2

FPGA architecture
• A generic island-style
FPGA fabric
• Configurable Logic Blocks
(CLB) and Programmable
Switch Matrices (PSM)
• Bitstream configures
functionality of each CLB
and interconnection
between logic blocks
3

The Xilinx Slice
• Xilinx slice features
– LUTs
– MUXF5, MUXF6,
MUXF7, MUXF8
(only the F5 and
F6 MUX are shown
in this diagram)
– Carry Logic
– MULT_ANDs
– Sequential Elements
•Detailed Structure

LUTLUT
Example 2-input LUT
• Lookup table: a b out
0 0
0 1
1 0
1 1
a
b
out
0
0
0
1
0 0 0 1
1
0
0
1
1 0 0 1
5
configuration
input

Modern FPGA architecture
Xilinx Virtex family
6
•Columns of on-chips SRAMs, hard IP cores (PPC 405), and
•DSP slices (Multiply-Accumulate) units

FPGA discussion
• Advantages
– Potential for (near) optimal performance for a given
application
– Various forms of parallelisms can be exploited
• Disadvantages
– Programmable mainly at the hardware level using
Hardware Description Languages (BUT, this can
change)
– Lower clock frequency (200-300 MHz) compared to
CPUs (~ 3GHz) and GPUs (~1.5 GHz)
7

MATENVMED
Silicon OpenCL: Automatic
generation of platform
architectures using OpenCL
8

18-19/7/2013 MATENVMED Plenary Meeting
Introduction
• Automatic generation of hardware at the
research forefront in the last 10 years.
• Variety of High Level Programming Models:
C/C++, C-like Languages, MATLAB
• Obstacles:
– Parallelism Extraction for larger applications
– Extensive Compiler Transformations & Optimizations
• Parallel Programming Models to the Rescue:
– CUDA, OpenCL.
9

Motivation
• Parallel programming models are for
reconfigurable platforms.
• A major shift of Computing industry toward
many-core computing systems.
• Reconfigurable fabrics bear a strong
resemblance to many core systems.
10

Vision
• Provide the tools and methodology to enable the large
pool of software developers and domain experts, who do
not necessarily have expertise on hardware design, to
architect whole accelerator-based systems
– Borrowed from advances in massively parallel programming
models
11
FPGA
PCIexpress
GPU
CPU
PCIexpress

Silicon OpenCL
• Silicon-OpenCL
“SOpenCL”.
• A tool flow to convert
an unmodified
OpenCL application
into a SoC design with
HW/SW components.

18-19/7/2013
Contribution
• Architectural Synthesis methodology:
– Code Transformations.
– Architectural Template.
13MATENVMED Plenary Meeting

OpenCL for Heterogeneous Systems
• OpenCL (Open Computing Language) : A unified programming
model aims at letting a programmer write a portable program once
and deploy it on any heterogeneous system with CPUs and GPUs.
• Became an important industry standard after release due to
substantial industry support.
14

OpenCL Platform Model
One host and one or more Compute Devices (CD)
Each CD consists of one or more Compute Units (CU)
Each CU is further divided into one or more Processing Elements (PE)
15
Main
Program
Computations
Kernels

OpenCL Kernel Execution Geometry
• OpenCL defines a geometric partitioning of grid of computations
• Grid consists of N dimensional space of work-groups
• Each work-group consists of N dimensional space of work-items.
work-group
grid work-item
16

OpenCL Simple Example
__kernel void vadd(
__global int* a,
__global int* b,
__global int* c) {
int idx= get_global_id(0);
c[idx] = a[idx] + b[idx];
}
• OpenCL kernel describes the computation of a work-
item
• Finest parallelism granularity
• e.g. add two integer vectors (N=1)
void add(int* a,
int* b,
int* c) {
for (int idx=0; idx<sizeof(a); idx++)
}
C code OpenCL kernel code
Run-time call
Used to differentiate execution
for each work-item
17

Why OpenCL as an HDL?
• OpenCL exposes parallelism at the finest
granularity
– Allows easy hardware generation at different levels of
granularity
– One accelerator per work-item, one accelerator per work-
group, one accelerator per multiple work-groups, etc.
• OpenCL exposes data communication
– Critical to transfer and stage data across platforms
• We target unmodified OpenCL to enable
hardware design to software engineers
– No need for hardware/architectural expertise
18

SOpenCL Tool Flow

Granularity Management
work-group
FPGA
Optimal thread granularity depends on hardware platform
GPU
CPU
We select a hardware accelerator to process one work-group per
invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting

Granularity Coarsening
Work-item
thread
Work-group
thread

Serialization of Work Items
__kernel void vadd(…) {
int idx = get_global_id(0);
}
__kernel void Vadd(…) {
int idx;
for( i = 0; i < get_local_size(2); i++)
for( j = 0; j < get_local_size(1); j++)
for( k = 0; k < get_local_size(0); k++)
{
idx = get_item_gid(0);
}
}
OpenCL
code
C code
22
idx = (global_id2(0) + i) * Grid_Width * Grid_Height +
(global_id1(0) + j) * Grid_Width +
(global_id0(0) + k);

Architectural Synthesis
• Exploit available parallelism and application specific
features.
• Apply a series of transformations to generate customized
hardware accelerators.
23
• Uses LLVM Compiler Infrastructure.
• Generate synthesizable Verilog & Test bench.

Feed Data
in Order
Write Data
in Order
•FU types,
•Bitwidths,
•I/O Bandwidth
2407/27/13
Verilog Generation: PE Architecture
•Predication
•Code
•slicing
•SMS mod
•scheduling
•Verilog
•generation
MATENVMED Kickc Off Meeting

Roadmap for FPGA
implementation

FPGA Implementation
• Our plan is to use the same code base
(e.g. OpenCL) to explore different
architectures
– OpenCL used for multicore CPU, GPU, FPGA
(SOpenCL)
• Fast exploration based on area,
performance and power requirements

FPGA Implementation
• Monte-Carlo simulations can exploit multi-level
parallelism of FPGAs
– Multiple MC simulations per point
– Multiple points simultaneously
– Double precision Trigonometric, Log, Additions,
Multiplications functions for each walk
– FP operations with double precision are not FPGAs
strong point, but still SOpenCL can handle it.

Automatic generation of platform architectures using open cl and fpga roadmap

Recommended

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Automatic generation of platform architectures using open cl and fpga roadmap (20)

Recently uploaded (20)

Automatic generation of platform architectures using open cl and fpga roadmap

Editor's Notes