0% found this document useful (0 votes)
14 views

A_Survey_and_Evaluation_of_FPGA_High-Level_Synthesis_Tools

This document presents a comprehensive survey and evaluation of high-level synthesis (HLS) tools for field-programmable gate arrays (FPGAs), highlighting their advantages in designing high-performance and energy-efficient systems. It discusses various HLS tools, their input languages, and methodologies for performance evaluation, while addressing the challenges faced by designers in adopting these tools. The paper also emphasizes the collaboration between academia and industry in advancing HLS technologies and outlines the need for further research in this area.

Uploaded by

ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

A_Survey_and_Evaluation_of_FPGA_High-Level_Synthesis_Tools

This document presents a comprehensive survey and evaluation of high-level synthesis (HLS) tools for field-programmable gate arrays (FPGAs), highlighting their advantages in designing high-performance and energy-efficient systems. It discusses various HLS tools, their input languages, and methodologies for performance evaluation, while addressing the challenges faced by designers in adopting these tools. The paper also emphasizes the collaboration between academia and industry in advancing HLS technologies and outlines the need for further research in this area.

Uploaded by

ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO.

10, OCTOBER 2016 1591

A Survey and Evaluation of FPGA


High-Level Synthesis Tools
Razvan Nane, Member, IEEE, Vlad-Mihai Sima, Christian Pilato, Member, IEEE,
Jongsok Choi, Student Member, IEEE, Blair Fort, Student Member, IEEE, Andrew Canis, Student Member, IEEE,
Yu Ting Chen, Student Member, IEEE, Hsuan Hsiao, Student Member, IEEE, Stephen Brown, Member, IEEE,
Fabrizio Ferrandi, Member, IEEE, Jason Anderson, Member, IEEE, and Koen Bertels, Member, IEEE

Abstract—High-level synthesis (HLS) is increasingly popular huge acceleration at a fraction of a processor’s energy, the
for the design of high-performance and energy-efficient heteroge- main drawback is related to its design. On one hand, describ-
neous systems, shortening time-to-market and addressing today’s ing these components in a hardware description language
system complexity. HLS allows designers to work at a higher-
(HDL) (e.g., VHSIC hardware description language (VHDL)
level of abstraction by using a software program to specify the
hardware functionality. Additionally, HLS is particularly interest- or Verilog) allows the designer to adopt existing tools for reg-
ing for designing field-programmable gate array circuits, where ister transfer level (RTL) and logic synthesis into the target
hardware implementations can be easily refined and replaced in technology. On the other hand, this requires the designer to
the target device. Recent years have seen much activity in the specify functionality at a low level of abstraction, where cycle-
HLS research community, with a plethora of HLS tool offer- by-cycle behavior is completely specified. The use of such
ings, from both industry and academia. All these tools may have languages requires advanced hardware expertise, besides being
different input languages, perform different internal optimiza-
tions, and produce results of different quality, even for the very cumbersome to develop in. This leads to longer development
same input description. Hence, it is challenging to compare their times that can critically impact the time-to-market.
performance and understand which is the best for the hard- An interesting solution to realize such heterogeneity and,
ware to be implemented. We present a comprehensive analysis at the same time, address the time-to-market problem is the
of recent HLS tools, as well as overview the areas of active combination of reconfigurable hardware architectures, such as
interest in the HLS research community. We also present a first- field-programmable gate arrays (FPGAs) and high-level syn-
published methodology to evaluate different HLS tools. We use
our methodology to compare one commercial and three academic thesis (HLS) tools [2]. FPGAs are integrated circuits that can
tools on a common set of C benchmarks, aiming at perform- be configured by the end user to implement digital circuits.
ing an in-depth evaluation in terms of performance and the use Most FPGAs are also reconfigurable, allowing a relatively
of resources. quick refinement and optimization of a hardware design with
Index Terms—BAMBU, comparison, DWARV, evaluation, field- no additional manufacturing costs. The designer can mod-
programmable gate array (FPGA), high-level synthesis (HLS), ify the HDL description of the components and then use an
L EG U P, survey. FPGA vendor toolchain for the synthesis of the bitstream to
configure the FPGA. HLS tools start from a software pro-
grammable high-level language (HLL) (e.g., C, C++, and
I. I NTRODUCTION
SystemC) to automatically produce a circuit specification in
LOCK frequency scaling in processors stalled in the
C middle of the last decade, and in recent years, an
alternative approach for high-throughput and energy-efficient
HDL that performs the same function. HLS offers benefits to
software engineers, enabling them to reap some of the speed
and energy benefits of hardware, without actually having to
processing is based on heterogeneity, where designers integrate build up hardware expertise. HLS also offers benefits to hard-
software processors and application-specific customized hard- ware engineers, by allowing them to design systems faster
ware for acceleration, each tailored toward specific tasks [1]. at a high-level of abstraction and rapidly explore the design
Although specialized hardware has the potential to provide space. This is crucial in the design of complex systems [3]
Manuscript received May 20, 2015; revised August 6, 2015; accepted
and especially suitable for FPGA design where many alterna-
November 21, 2015. Date of publication December 30, 2015; date of cur- tive implementations can be easily generated, deployed onto
rent version September 7, 2016. This paper was recommended by Associate the target device, and compared. Recent developments in the
Editor Nagaraj NS. FPGA industry, such as Microsoft’s application of FPGAs in
R. Nane, V.-M. Sima, and K. Bertels are with the Delft University of
Technology, Delft 2625 NW, The Netherlands (e-mail: [email protected]; Bing search acceleration [4] and the forthcoming acquisition of
[email protected]). Altera by Intel, further underscore the need for using FPGAs
C. Pilato was with the Politecnico di Milano, Milan 20133, Italy, and he is as computing platforms with high-level software-amenable
now with Columbia University, New York, NY 10027 USA.
J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, and J. Anderson
design methodologies. HLS has also been recently applied
are with the University of Toronto, Toronto, ON M5S 3G4, Canada (e-mail: to a variety of applications (e.g., medical imaging, convolu-
[email protected]). tional neural networks, and machine learning), with significant
F. Ferrandi is with the Politecnico di Milano, Milan 20133, Italy. benefits in terms of performance and energy consumption [5].
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. Although HLS tools seem to efficiently mitigate the problem
Digital Object Identifier 10.1109/TCAD.2015.2513673 of creating the hardware description, automatically generating
0278-0070 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1592 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

Fig. 1. Classification of High-Level Synthesis Tools Based on the Input Language.

hardware from software is not easy and a wide range of dif- 2) A description of HLS optimizations and problems
ferent approaches have been developed. One approach is to wherein active research is underway (Section III).
adapt the HLL to a specific application domain (e.g., dataflow 3) The first-published comprehensive in-depth evaluation
languages for describing streaming applications). These HLS (Section IV) and discussion (Section V) of selected
tools can leverage dedicated optimizations or microarchitec- commercial and academic HLS tools in terms of per-
tural solutions for the specific domain. However, the algorithm formance and area metrics.
designer, who is usually a software engineer, has to under- This analysis shows that industry and academia are closely
stand how to properly update the code. This approach is progressing together toward efficient methods to automatically
usually time-consuming and error prone. For this reason, some design application-specific customized hardware accelerators.
HLS tools offer complete support for a standard HLL, such However, several research challenges remain open.
as C, giving complete freedom to the algorithm designer.
Understanding the current HLS research directions, the dif-
ferent HLS tools available, and their capabilities is a difficult II. OVERVIEW OF H IGH -L EVEL S YNTHESIS T OOLS
challenge, and a thoughtful analysis is lacking in the literature In this section, we present an overview of academic and
to cover all these aspects. For example, [6] was a small sur- commercial HLS tools. The presentation will be done accord-
vey of existing HLS tools with a static comparison (on criteria ing to a classification of the design input language as shown
such as the documentation available or the learning curve) of in Fig. 1. We distinguish between two major categories,
their features and user experience. However, the tools have not namely tools that accept domain-specific languages (DSLs)
been applied to benchmarks, nor were the results produced by and tools that are based on general-purpose programmable lan-
the tools compared. Indeed, given an application to be imple- guages (GPLs). DSLs are split into new languages invented
mented as a hardware accelerator, it is crucial to understand specially for a particular tool-flow and GPL-based dialects,
which HLS tool better fits the characteristics of the algorithm. which are languages based on a GPL (e.g., C) extended with
For this reason, we believe that it is important to have a com- specific constructs to convey specific hardware information
prehensive survey of recent HLS tools, research directions, as to the tool. Under each category, the corresponding tools are
well as a systematic method to evaluate different tools on the listed in green, red, or blue fonts, where green represents the
same benchmarks in order to analyze the results. tool being in use, red implies the tool is abandoned, and blue
In this paper, we present a thorough analysis of HLS tools, implies N/A, meaning that no information is currently known
current HLS research thrusts, as well as a detailed way to about its status. Furthermore, the bullet type, defined in the fig-
evaluate different state-of-the-art tools (both academic and ure’s legend, denotes the target application domain for which
commercial) on performance and resource usage. The three the tool can be used. Finally, tool names which are under-
academic tools considered are Delft workbench automated lined in the figure represent tools that also support SystemC
reconfigurable VHDL generator (DWARV) [7], BAMBU [8], as input. Using DSLs or SystemC raises challenges for adop-
and L EG U P [9]—tools under active development in three insti- tion of HLS by software developers. In this section, due to
tutions and whose developers are co-authoring this paper. The space limitations, we describe only the unique features of each
contributions of this paper are as follows. tool. For general information (e.g., target application domain,
1) A thorough evaluation of past and present HLS tools support for floating/fixed-point (FP) arithmetic, and automatic
(Section II). testbench generation), the reader is referred to Table I. We

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1593

first introduce the academic HLS tools evaluated in this study, Bluespec Compiler (BSC) [14] is a tool that uses Bluespec
before moving onto highlight features of other HLS tools System Verilog (BSV) as the design language. BSV is a
available in the community (either commercial or academic). high-level functional HDL based on Verilog and inspired by
Haskell, where modules are implemented as a set of rules
A. Academic HLS Tools Evaluated in This Study using Verilog syntax. The rules are called guarded atomic
DWARV [7] is an academic HLS compiler developed at actions and express behavior in the form of concurrently coop-
the Delft University of Technology. The tool is based on the erating finite state machines (FSMs) [15]. Using this language,
CoSy commercial compiler infrastructure [10] developed by and implicitly the BSC tool, requires developers with specific
ACE. The characteristics of DWARV are directly related to the expertise.
advantages of using CoSy, which are its modular and robust PipeRench [16] was an early reconfigurable computing
back-end, and that is easily extensible with new optimizations. project. The PipeRench compiler was intended solely for
BAMBU [8] is an academic HLS tool developed at the generating reconfigurable pipelines in stream-based media
Politecnico di Milano and first released in 2012. BAMBU is applications. The source language is a dataflow intermediate
able to generate different Pareto-optimal implementations to language, which is a single-assignment language with C oper-
tradeoff latency and resource requirements, and to support ators. The output of the tool is a bitstream to configure the
hardware/software partitioning for designing complex hetero- reconfigurable pipeline in the target circuit.
geneous platforms. Its modular organization allows the evalu- HercuLeS [17] is a compiler that uses an N-address code
ation of new algorithms, architectures, or synthesis methods. intermediate representation, which is a new typed-assembly
BAMBU leverages the GNU compiler collection compiler’s language created by a front-end available through GCC
many compiler-based optimizations and implements a novel Gimple. The work deals only with complete applications
memory architecture to efficiently support complex constructs targeting FPGAs.
of the C language (e.g., function calls, pointers, multidimen- CoDeveloper [18] is the HLS design environment provided
sional arrays, and structs) [11]. It is also able to support by Impulse accelerated technologies. Impulse-C is based on a
different data types, including FP arithmetic, in order to C-language subset to which it adds communicating sequen-
generate optimized micro-architectures. tial processes (CSP)-style extensions. These extensions are
L EG U P [9] is a research compiler developed at the required for parallel programming of mixed processor and
University of Toronto, first released in 2011 and currently on FPGA platforms. Because the basic principle of the CSP
its forth public release. It accepts a C-language program as programming model consists of processes that have to be
input and operates in one of two ways: 1) it synthesizes the independently synchronized and streams for interprocess com-
entire C program to hardware or 2) it synthesizes the program munication, the application domain is limited primarily to
to a hybrid system comprising a processor (a microproces- image processing and streaming applications.
sor without interlocked pipeline stages (MIPS) soft processor DK Design Suite [19] uses Handel-C as the design language,
or ARM) and one or more hardware accelerators. In the lat- which is based on a rich subset of the C language extended
ter flow, the user designates which C functions to implement with hardware-specific language constructs. The user however,
as accelerators. Communication between the MIPS/ARM and needs to specify timing requirements and to describe the paral-
accelerators is through Altera’s memory-mapped on-chip bus lelization and synchronization segments in the code explicitly.
interface (L EG U P is designed specifically to target a variety In addition, the data mapping to different memories has to be
of Altera FPGA families). For hardware synthesis, most of manually performed. Because of these language additions, the
the C language is supported, with the exception of dynamic user needs advanced hardware knowledge.
memory allocation and recursion. L EG U P is built within Single-Assignment C (SA-C) [20] is a compiler that uses a C
the open-source low level virtual machine (LLVM) compiler language variant in which variables can be set only once. This
framework [12]. Compared to other HLS tools, L EG U P has paper provided the inspiration for the later riverside optimizing
several unique features. It supports Pthreads and OpenMP, compiler for configurable computing (ROCCC) compiler. The
where parallel software threads are automatically synthesized language introduces new syntactical constructs that require
into parallel-operating hardware. Automated bitwidth reduc- application rewriting.
tion can be invoked to shrink datapath widths based on The Garp [21] project’s main goal was to accelerate loops
compile-time (static) variable range and bitmask analysis. of general-purpose software applications. It accepts C as input
Multi-cycle path analysis and register removal are also sup- and generates hardware code for the loop.
ported, wherein L EG U P eliminates registers on some paths The Napa-C [22] project was one of the first to con-
permitted more than a single cycle to complete, generating sider high-level compilation for systems which contain both a
constraints for the back-end of the toolflow accordingly. microprocessor and reconfigurable logic. The NAPA C com-
piler, implemented in Stanford University Intermediate Format
B. Other HLS Tools and targeting national semiconductor’s NAPA1000 chip, per-
CyberWorkBench [13] is a set of synthesis, verification, and formed semantic analysis of the pragma-annotated program
simulation tools intended for system-level design. The tool and co-synthesized a conventional program executable for the
input is the behavioral description language (BDL), i.e., a processor, and a configuration bit stream.
superset of the C language extended with constructs to express In eXCite [23], communication channels have to be inserted
hardware concepts. Examples of such constructs are user- manually to describe the communication between the software
defined bitwidth for variables, synchronization, explicit clock and hardware. These channels can be streaming, blocking,
boundaries specification, and concurrency. or indexed (e.g., for handling arrays). Different types of

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1594 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

TABLE I
OVERVIEW OF H IGH -L EVEL S YNTHESIS T OOLS

communication between the software and hardware parts The C to Hardware Compiler [28] generates hardware to be
(e.g., streaming and shared memory) are possible. offloaded onto a application specific processor core, for which
The ROCCC [24] project focused mainly on the paral- verification has to be done manually by loading and execut-
lelization of heavy-compute-density applications having little ing the generated design on an Altium Desktop NanoBoard
control. This restricts its application domain to streaming NB2DSK01.
applications, and it means that the input C is limited to a subset A distinct feature of the GAUT [29] project is that besides
of the C language. For example, only perfectly nested loops the processing accelerator, it can generate both communication
with fixed stride, operating on integer arrays are allowed. and memory units. A testbench is also automatically generated
Catapult-C [25] is a commercial HLS tool initially ori- to apply stimuli to the design and to analyze the results for val-
ented toward the application-specific integrated circuit (ASIC) idation purposes. Fixed-point arithmetic is supported through
hardware developer, however, it now targets both FPGAs and Mentor Graphics Algorithmic C class library.
ASICs. It offers flexibility in choosing the target technology, Trident [30] is a compiler that is an offshoot of an earlier
external libraries, setting the design clock frequency, mapping project called Sea Cucumber [31]. It generates VHDL-based
function parameters to either register, random-access memory, accelerators for scientific applications operating on FP data
ROM, or streaming interfaces. starting from a C-language program. Its strength is in allow-
C-to-Silicon (CtoS) [26], offered by Cadence, offers support ing users to select FP operators from a variety of standard
for both control- and dataflow applications. Since it accepts libraries, such as FPLibrary and Quixilica, or to import
SystemC as input, it is possible to accurately specify differ- their own.
ent interface types, from simple function array parameters to C2H [32] was an HLS tool offered by Altera Corporation
cycle-accurate transmission protocols. since 2006. The tool is technology dependent, generating
SPARK [27] was targeted to multimedia and image pro- accelerators that can only communicate via an Altera Avalon
cessing applications along with control-intensive microproces- bus with an Altera NIOS II configurable soft processor.
sor functional blocks. The compiler generated synthesizable Furthermore, using this tool required advanced hardware
VHDL can be mapped to both ASICs or FPGAs. design knowledge in order to configure and connect the

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1595

accelerators to the rest of the system—tasks performed in Furthermore, different parameter mappings to memory can be
Altera’s development environment. specified. Streaming or shared memory type interfaces are both
Synphony C [33], formerly PICO [34], is an HLS tool for supported to simplify accelerator integration.
hardware DSP design offered by Synopsys. The tool can sup-
port both streaming and memory interfaces and allows for III. H IGH -L EVEL S YNTHESIS (HLS) O PTIMIZATIONS
performance-related optimizations to be fine-tuned (e.g., loop
unrolling and loop pipelining). FP operations are not per- HLS tools feature several optimizations to improve the per-
mitted, but the programmer can use fixed-point arithmetic. formance of the accelerators. Some of them are borrowed from
Comparison results published by BDTi [35] showed that per- the compiler community, while others are specific for hardware
formance and area metrics for Synphony-produced circuits are design. In this section, we discuss these HLS optimizations,
comparable with those obtained with AutoESL (the product which are also current research trends for the HLS community.
that become Vivado HLS when acquired by Xilinx).
A. Operation Chaining
The goal of the MATCH [36] software system was to
translate and map MATLAB code to heterogeneous comput- Operation chaining is an optimization that performs opera-
ing platforms for signal and image processing applications. tion scheduling within the target clock period. This requires
The MATCH technology was later transferred to a startup the designer to “chain” two combinational operators together
company, AccelChip [37], bought in 2006 by Xilinx but dis- in a single cycle in a way that false paths are avoided [46].
continued in 2010. The tool was one of the few on the Concretely, if two operations are dependent in the data-flow
market that started from a MATLAB input description to gen- graph and they can both complete execution in a time smaller
erate VHDL or Verilog. Key features of the product were than the target clock period, then they can be scheduled in
automation conversion of FP to fixed point. the same cycle; otherwise at least two cycles are needed to
The CHiMPS compiler [38] targets applications for high- finish execution, along with a register for the intermediate
performance. The distinctive feature of CHiMPS is its many- result. Generally, chaining reduces the number of cycles in
cache, which is a hardware model that adapts the hundreds of the schedule, improving performance, and reducing the global
small, independent FPGA memories to the specific memory number of registers in the circuit. However, this is highly tech-
needs of an application. This allows for many simultaneous nology dependent and requires an accurate characterization of
memory operations per clock cycle to boost performance. the resource library (see Section III-E).
DEFACTO [39] is one of the early design environments that
proposed hardware/software co-design solutions as an answer B. Bitwidth Analysis and Optimization
to increasing demands for computational power. DEFACTO is Bitwidth optimization is a transformation that aims to
composed of a series of tools such as a profiler, partitioner, reduce the number of bits required by datapath operators. This
and software and hardware compilers to perform fast design is a very important optimization because it impacts all non-
space exploration (DSE) given a set of design constraints. functional requirements (e.g., performance, area, and power)
MaxCompiler [40] is a data-flow specific HLS tool. The of a design, without affecting its behavior. Differently from
compiler accepts MaxJ, a Java-based language, as input general-purpose processor compilers, which are designed to
and generates synthesizable code for the hardware data-flow target a processor with a fixed-sized datapath (usually 32 or
engines provided by Maxeler’s hardware platform. 64 bits), a hardware compiler can exploit specialization by
The Kiwi [41] programming library and its associated syn- generating custom-size operators (i.e., functional units) and
thesis system generates FPGA co-processors (in Verilog) from registers. As a direct consequence, we can select the minimal
C# programs. Kiwi allows the programmer to use parallel con- number of bits required for an operation and/or storage of
structs such as events, monitors, and threads, which are closer the specific algorithm, which in turns leads to minimal space
to hardware concepts than classical software constructs. used for registers, and smaller functional units that translates
Sea cucumber [31] is a Java-based compiler that generates into less area, less power, and shorter critical paths. However,
electronic design interchange format netlists and adopts the this analysis cannot be usually completely automated since
standard Java thread model, augmented with a communication it often requires specific knowledge of the algorithm and the
model based on CSP. input data sets.
Cynthesizer [42], recently acquired by Cadence, includes
formal verification between RTL and gates, power analysis, C. Memory Space Allocation
and several optimizations, such as support for FP operations FPGAs contain multiple memory banks in the form of dis-
with IEEE-754 single/double precision. tributed block RAMs (BRAMs) across the device. This allows
Vivado HLS [43], formerly AutoPilot [44], was developed the designer to partition and map software data structures
initially by AutoESL until it was acquired by Xilinx in 2011. onto dedicated BRAMs in order to implement fast memory
The new improved product, which is based on LLVM, was accesses at low cost. As a result, the scheduler can perform
released early 2013, and includes a complete design envi- multiple memory operations in one cycle once it is able to
ronment with abundant features to fine-tune the generation statically determine that they access different memories in the
process from HLL to HDL. C, C++, and SystemC are same cycle without contention. This feature is similar to the
accepted as input, and hardware modules are generated in allocation of different memory spaces used in the embedded
VHDL, Verilog, and SystemC. During the compilation pro- systems domain. Using multiple BRAMs increases the avail-
cess, it is possible to apply different optimizations, such able parallelism. On the other hand, these memory elements
as operation chaining, loop pipelining, and loop unrolling. have a limited number of memory ports and the customization

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1596 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

of memory accesses may require the creation of an efficient each operation (e.g., arithmetic or non-arithmetic), its operand
multi-bank architecture to avoid limiting the performance [47]. types (e.g., integer and float), and its bit-width. At this stage,
some operations may benefit from specific optimizations. For
D. Loop Optimizations example, multiplications or divisions by a constant are typ-
Hardware acceleration is particularly important for algo- ically transformed into operations that use only shifts and
rithms with compute-intensive loops. Loop pipelining is a key adds [54], [55] in order to improve area and timing. All these
performance optimization for loops implemented in hardware. characteristics are then used during the module allocation
This optimization exploits loop-level parallelism by allowing phase, where the resulting operations are associated with func-
a loop iteration to start before the completion of its predeces- tional units contained in the resource library [46]. This heavily
sor, provided that data dependencies are satisfied. The concept impacts the use of resources and the timing of the resulting
is related to software pipelining [48], which has widely been circuit. Hence, the proper composition of such a library and
applied in very long instruction word processors. A key con- its characterization is crucial for efficient HLS.
cept in loop pipelining is the initiation interval (II), which The library of functional units can be quite rich and may
is the number of clock cycles between successive loop itera- contain several implementations for each single operation.
tions. For high throughput, it is desirable that II be as small On one hand, the library usually includes resources that are
as possible, ideally one, which implies that a new loop iter- specific for the technology provider (e.g., the FPGA ven-
ation is started every cycle. Achieving the minimum II for a dor). Some of these resources may leverage vendor-specific
given design can be impeded by two factors: 1) resource con- intrinsics or IP generators. The module allocation will exploit
straints and 2) loop-carried dependencies. Regarding resource resources that have been explicitly tailored and optimized for
constraints, consider a scenario where a loop body contains the specific target. This is usually adopted by HLS tools that
three load operations and one store operation. In this case, are specific for some FPGA vendors (e.g., [43]). The library
achieving an II less than two is impossible if the memory may also contain resources that are expressed as templates in
has two ports, since each loop iteration has four memory a standard HDL (i.e., Verilog or VHDL). These templates can
operations. For this reason, loop optimizations are frequently be retargeted and customized based on characteristics of the
combined with multi-bank architecture to fully exploit the par- target technology, like in FloPoCo [56]. In this case, the under-
allelism [47]. With respect to loop-carried dependencies, if a lying logic synthesis tool can determine the best architecture
loop iteration depends on a result computed in a prior itera- to implement each function. For example, multipliers can be
tion, that data dependency may restrict the ability to reduce II, mapped either on dedicated DSP blocks or implemented with
as it may be necessary to delay commencing an iteration until look-up-tables (LUTs).
its dependent data has been computed. To perform aggressive optimizations, each component of
For example, DWARV leverages CoSy to implement loop the library needs to be annotated with information useful
pipelining. The heuristic applied is based on swing modulo during the entire HLS process, such as resource occupation
scheduling [49], which considers operation latencies between and latency for executing the operations. There are sev-
loop instructions to move conflicting instructions and reduce eral approaches to library characterization. The first approach
the II. However, due to the high availability of resources in performs a rapid logic synthesis during the scheduling and
FPGAs, the loop pipelining algorithm for hardware generation binding of the operations to determine the most suitable can-
can be relaxed. This can be accomplished by fixing the II to a didate resources, like in Cadence’s C-to-Silicon [57]. However,
desired value, i.e., based on a required design throughput, and this approach has a high cost in terms of computation time,
then generating enough hardware (e.g., registers and functional especially when the HLS is repeatedly performed for the
units) to accommodate the particular II. same target. An alternative approach is to precharacterize all
Recent research has focused on loop pipelining for nested resources in advance, as done in BAMBU [8]. The performance
loops. Consider, for example, a doubly nested loop whose out- estimation starts with a generic template of the functional unit,
ermost loop (with induction variable i) iterates 100 times, and which can be parametric with respect to bitwidths and pipeline
whose innermost one (with induction variable j) iterates up to i stages. Latency and resource occupation are then obtained by
times. The iteration space traversed by i and j can be viewed as synthesizing each configuration and storing the results in the
a polyhedron (in this case, a triangle) and analytically analyzed library. Mathematical models can be built on top of these
with the polyhedral model [50]. Applying loop transformations actual synthesis values [58], [59]. Additionally, this infor-
(e.g., exchanging the outer and inner loop) result in differ- mation can also be coupled with delays obtained after the
ent polyhedra and potentially different IIs. Polyhedral-based place-and-route phase. This may improve the maximum fre-
optimizations have been applied to synthesize memory archi- quency and the design latency and it makes the HLS results
tectures [51], improve throughput [52], and optimize resource more predictable [60].
usage [53].
F. Speculation and Code Motion
E. Hardware Resource Library Most HLS scheduling techniques can extract parallelism
In the process of HLS, in order to generate an efficient only within the same control region (i.e., the same CDFG basic
implementation that meets timing requirements while min- block). This can limit the performance of the resulting accel-
imizing the use of resources, it is essential to determine erator, especially in control-intensive designs. Speculation
how to implement each operation. Specifically, the front- is a code-motion technique that allows operations to be
end phase first inspects the given behavioral specification moved along their execution traces, possibly anticipating
and identifies operations characteristics, such as the type of them before the conditional constructs that control their

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1597

execution [61]–[64]. A software compiler is less likely to use with GPU vendors, who have been gaining traction in the
this technique since, in a sequential machine, they may delay high-performance computing market.
the overall execution with computations that are unnecessary
in certain cases. In hardware, however, speculated operations H. If-Conversion
can often be executed in parallel with the rest of the operations.
If-conversion [67] is a well-known software transformation
Their results simply will be simply maintained or discarded
that enables predicated execution, i.e., an instruction is exe-
based on later-computed branch outcomes.
cuted only when its predicate or guard evaluates to true. The
G. Exploiting Spatial Parallelism main objective of this transformation is to schedule in parallel
instructions from disjoint execution paths created by selec-
A primary mechanism through which hardware may provide
tive statements (e.g., if statements). The goals are two fold.
higher speed than a software implementation is by instantiating
First, it increases the number of parallel operations. Second, it
multiple hardware units that execute concurrently (spatial par-
facilitates pipelining by removing control dependencies within
allelism). HLS tools can extract fine-grained instruction-level
the loop, which may shorten the loop body schedule. In
parallelism by analyzing data dependencies and loop-level
software, this leads to a 34% performance improvement, on
parallelism via loop pipelining. It is nevertheless difficult to
average [68]. However, if-conversion should be enabled only
automatically extract large amounts of coarse-grained paral-
when the branches have a balanced number of cycles required
lelism, as the challenges therein are akin to those faced by an
to complete execution. When this is not the case, predicated
auto-parallelizing software compiler. A question that arises,
execution incurs a slowdown in execution time because, if the
therefore, is how to specify hardware parallelism to an HLS
shorter branch is taken, useless instructions belonging to the
tool whose input is a software programming language. With
longer unselected branch will need to be checked before exe-
many HLS tools, a designer synthesizes an accelerator and
cuting useful instructions can be resumed. Therefore, different
then manually writes RTL that instantiates multiple instances
algorithms have been proposed to decide when it is benefi-
of the synthesized core, steering input/output data to/from
cial to apply if-conversion and when the typical conditional
each, accordingly. However, this approach is error prone and
jump approach should be followed. For example, in [69], a
requires hardware expertise. An alternative approach is to
generic model to select the fastest implementation for if-then-
support the synthesis of software parallelization paradigms.
else statements is proposed. This selection is done according to
L EG U P supports the HLS of pthreads and OpenMP [65],
the number of implicated if statements as well as the balance
which are two standard ways of expressing parallelism in C
characteristics. The approach for selecting if-conversion on a
programs, widely used by software engineers. The general idea
case-by-case basis changes for hardware compilers generating
is to synthesize parallel software threads into an equal num-
an FPGA hardware circuit. This is because resources can be
ber of parallel-operating hardware units. With pthreads, a user
allocated as-needed (subject to area or power constraints), and
can express both task and data-level spatial parallelism. In
therefore, we can schedule branches in a manner that does not
the former, each hardware unit may be performing a different
affect the branch-minimal schedule. Data and control instruc-
function, and in the latter, multiple hardware units perform
tions can be executed in parallel and we can insert “jumps” to
the same function on different portions of an input data set.
the end of the if-statement to short-cut the execution of (use-
L EG U P also supports the synthesis of two standard pthreads
less) longer branch instructions when a shorter path is taken.
synchronization constructions: 1) mutexes and 2) barriers.
This was demonstrated in [70] (incorporated in DWARV),
With OpenMP, the authors have focused on supporting the
which proposed a lightweight if-conversion scheme adapted
aspects of the standard that target loop parallelization, e.g., an
for hardware generation. Furthermore, the work showed that
N-iteration loop with no loop-carried dependencies can be split
such a lightweight predicative scheme is beneficial for hard-
into pieces that are executed in parallel by concurrently oper-
ware compilers, with performance always at least as good as
ating hardware units. An interesting aspect of L EG U P is the
when no if-conversion is enabled.
support for nested parallelism: threads forking threads. Here,
the threads initially forked within a program may themselves
fork other threads, or contain OpenMP parallelization con- IV. E VALUATION OF H IGH -L EVEL S YNTHESIS T OOLS
structs. A limitation of the L EG U P work is that the number In this section, we define a common environment to eval-
of parallel hardware units instantiated must exactly match the uate four HLS tools, one commercialize and three academic:
number of software threads forked since in hardware there is DWARV, BAMBU, and L EG U P. With L EG U P, we target the
no support for context switching. fastest speedgrade of the Altera Stratix V family [71], while
Altera has taken an different approach with their OpenCL with the other tools we target the fastest speedgrade of the
SDK [66], which supports HLS of OpenCL programs. The Xilinx Virtex-7 family [72]. Both Stratix V and Virtex-7 are
OpenCL language is a variant of C and is heavily used for 28 nm state-of-the-art high-performance FPGAs fabricated
parallel programming of graphics processing units (GPUs). by TSMC. The primary combinational logic element in both
With OpenCL, one generally launches hundreds or thou- architectures is a dual-output 6-input LUT.
sands of threads that are relatively fine-grained, for example, Three metrics are used to evaluate circuit performance:
each computing a vector dot product, or even an individual maximum frequency (Fmax in MHz), cycle latency (i.e., the
scalar multiplication. Altera synthesizes OpenCL into a deeply number of clock cycles needed for a benchmark to com-
pipelined FPGA circuit that connects to an ×86-based host plete the computation), and wall-clock time (minimum clock
processor over peripheral component interconnect express. The period × cycle latency). Clock period (and the corresponding
support for OpenCL HLS allows Altera to compete directly Fmax) is extracted from post-routing static timing analysis.

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1598 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

TABLE II
Cycle latency is evaluated by simulating the resulting RTL B ENCHMARK C HARACTERISTICS AND TARGET
circuits using ModelSim (discussed further below). We do not F REQUENCIES FOR O PTIMIZED F LOW (MH Z )
include in the evaluations the HLS tool execution times as this
time is negligible in comparison with the synthesis, mapping,
placement, and routing time.
To evaluate area, we consider logic, DSP, and memory
usage. For logic area, in Xilinx devices, we report the total
number of fracturable 6-LUTs, each of which can be used
to implement any single function of up to six variables, or
any two functions that together use at most five distinct vari-
ables. For Altera, we report the total number of used adaptive
logic modules (ALMs), each of which contains one frac-
turable 6-LUT, that can implement any single function of
up to six variables, any two four-variable functions, a five-
and three-variable function, and several other dual-function
combinations. With respect to DSP usage, we consider the
DSP units in Altera and Xilinx devices to be roughly equiv-
alent (Xilinx devices contain hardened 25 × 18 multipliers,
whereas Altera devices contain hardened 18 × 18 multipliers). reported reflect the kernel itself and do not include the global
For memory, we report the total number of dedicated blocks memory. The rationale behind this decision is to focus on the
used (e.g., BRAMs), which are equivalent to 18 Kb in Virtex-7 results on the HLS-generated portion of the circuit, rather than
and 20 Kb in Stratix V. on the integration with the rest of the system.

A. Potential Sources of Inaccuracy B. Benchmark Overview


Although we have endeavored to make the comparison The synthesized benchmark kernels are listed in Table II,
between tools as fair as possible, we discuss potential sources where we mention in the second and third columns the applica-
of inaccuracy to better understand the results. First, the tion domain of the corresponding kernel, as well as its source.
evaluated HLS tools are built within different compilers Most of the kernels have been extracted from the C-language
(e.g., BAMBU is built within GCC, L EG U P within LLVM, and CHStone benchmark suite [73], with the remainder being
DWARV within CoSy), and target different FPGA devices. It from DWARV and BAMBU. The selected functions originate
is thus impossible to perfectly isolate variations in circuit area from different application domains, which are control-flow, as
and speed attributable to the HLS tools versus other criteria. well as data-flow dominated as we aim at evaluating generic
Each compiler framework has a different set of optimizations (nonapplication-specific) HLS tools.
that execute before HLS, with potentially considerable impact An important aspect of the benchmarks used in this paper is
on HLS results. Likewise, we expect that Altera’s RTL and that input and golden output vectors are available for each pro-
logic synthesis, placement and routing, are different than those gram. Hence, it is possible to “execute” each benchmark with
within Xilinx’s tool. Moreover, while the chosen Virtex-7 and the built-in input vectors, both in software and also in HLS-
Stratix V are fabricated in the same TSMC process, there are generated RTL using ModelSim. The RTL simulation permits
differences in the FPGA architecture itself. For example, as extraction of the total cycle count, as well as enables functional
mentioned above, the fracturable 6-LUTs in Altera FPGAs are correctness checking.
more flexible than the fracturable 6-LUTs in Xilinx FPGAs,
owing to the Altera ALMs having more inputs. This will C. HLS Evaluation
impact the final resource requirements for the accelerators. We performed two sets of experiments to evaluate the com-
Finally, although we have selected the fastest speedgrade for pilers. In the first experiment, we executed each tool in a
each vendor’s device, we cannot be sure whether the fraction “push-button” manner using all of its default settings, which
of die binned in Xilinx’s fastest speedgrade is the same as that we refer to as standard-optimization. The first experiment thus
for Altera because this information is kept proprietary by the represents what a user would see running the HLS tools “out
vendors. of the box.” We used the following default target frequencies:
Other differences relate to tool assumptions, e.g., about 250 MHz for BAMBU, 150 MHz for DWARV, and 200 MHz
memory implementation. For each benchmark kernel, some for L EG U P. For the commercial tool, we decided to use a
data are kept local to the kernel (i.e., in BRAMs instanti- default frequency of 400 MHz. In the second experiment,
ated within the module), whereas other data are considered we manually optimized the programs and constraints for the
“global,” kept outside the kernel and accessed via a memory specific tools (by using compiler flags and code annotations
controller. As an example, in L EG U P, the data are considered to enable various optimizations) to generate performance-
local when, at compile time, are proven to solely be accessed optimized implementations. Table III lists for each tool the
within the kernel (e.g., an array declared within the kernel optimizations enabled in this second experiment. As we do
itself and used as a scratch pad). The various tools evaluated do not have access to the source of the commercial tool, its list is
not necessarily make the same decisions regarding which data based on the observations done through the available options
is kept local versus global. The performance and area numbers and on the inspection of the generated code. The last four

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1599

TABLE III
O PTIMIZATIONS U SED [L ETTER IN () R EFERS TO S UBSECTIONS IN S ECTION III]. V: U SED ; X : U NUSED

TABLE IV
S TANDARD -O PTIMIZATION P ERFORMANCE R ESULTS . Fmax I S R EPORTED IN MHz, WALL -C LOCK IN µs

columns of Table II show the HLS target frequencies used (among the academic tools) for several benchmarks, including
for the optimized experiment. It should be noted that there is aes_decrypt and bellmanford.
no strict correlation between these and the actual post place For the performance-optimized results in Table V, a key
and route frequencies obtained after implementing the designs takeaway is that performance is drastically improved when
(shown in Tables IV and V) due to the actual vendor-provided the constraints and source code input to the HLS tools are
back-end tools that perform the actual mapping, placing, and tuned. For the commercial tool, geomean wall-clock time
routing steps. This is explained by the inherently approximate is reduced from 37.1 to 19.9 µs (1.9×) in the optimized
timing models used in HLS. The target frequency used as input results. For BAMBU, DWARV, and L EG U P, the wall-clock time
to the HLS tools should be regarded only as an indication of reductions in the optimized flow are 1.6×, 1.7×, and 2×,
how much operation chaining can be performed. As a rule of respectively, on average (comparing values in the GEOMEAN
thumb, in order to implement a design at some frequency, one row of the table). It is interesting that, for all the tools,
should target a higher frequency in HLS. the average performance improvements in the optimized flow
Table IV shows performance metrics (e.g., number of were roughly the same. From this, we conclude that one
cycles, maximum frequency after place and route, and wall- can expect ∼1.6–2× performance improvement, on average,
clock time) obtained in the standard-optimization scenario, from tuning code and constraints provided to HLS. We also
while Table V shows the same performance metrics obtained in observe that, from the performance angle, the academic tools
the performance-optimized scenario. The error (ERR) entries are comparable to the commercial tool. BAMBU and L EG U P,
denote errors that prevented us from obtaining complete results in particular, deliver superior wall-clock time to commercial,
for the corresponding benchmarks (e.g., compiler segmenta- on average.
tion error). Observe that geometric mean data are included at For completeness, the area-related metrics are shown in
the bottom of the rows. Two rows of geomean are shown: the Tables VI and VII for the standard and optimized flows,
first includes only those benchmarks for which all tools were respectively. Comparisons between L EG U P and the other tools
successful; the second includes all benchmarks, and is shown are more difficult in this case, owing to architectural dif-
for BAMBU and L EG U P. In the standard-optimization results ferences between Stratix V and Virtex-7. Among the flows
in Table IV, we see that the commercial tool is able to achieve that target Xilinx, the commercial HLS tool delivers con-
the highest Fmax; BAMBU implementations have the lowest siderably more compact implementations than the academic
cycle latencies; and BAMBU and L EG U P deliver roughly the tools (much smaller LUT consumption) since we anticipate
same (and lowest) average wall-clock time. However, we also it implements more technology-oriented optimizations. For all
observe that no single tool delivers superior results for all flows (including L EG U P), we observe that, in the performance-
benchmarks. For example, while DWARV does not provide the optimized flow, more resources are used to improve effectively
lowest wall-clock time on average, it produced the best results performance.

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1600 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

TABLE V
P ERFORMANCE -O PTIMIZED R ESULTS . Fmax I S R EPORTED IN MHz, WALL -C LOCK IN µs

TABLE VI
S TANDARD -O PTIMIZATION A REA R ESULTS

V. D ISCUSSION F ROM THE T OOL P ERSPECTIVE Loop unrolling was used for adpcm, matrix, and sha. On three
In this section, we describe the results for the academic HLS benchmarks (gsm, matrix, and sobel), GCC vectorization pro-
tools from a tool-specific viewpoint and highlight techniques duced a better wall-time, while function inlining was useful
used to improve performance in each tool. for gsm, dfadd, dfsin, aes encrypt, and decrypt.
BAMBU’s front-end phase also implements operation trans-
formations that are specific for HLS, e.g., by transforming
A. Bambu multiplications and divisions which are usually very expen-
BAMBU leverages GCC to perform classical code optimiza- sive in hardware. BAMBU maps 64-bit divisions onto a C
tions, such as loop unrolling and constant propagation. To library function implementing the Newton–Raphson algorithm
simplify the use of the tool for software designers, its inter- for the integer division. This leads to a higher number of DSPs
face has been designed such that the designer can use the same required by dfdiv and dfsin in the standard-optimization case.
compilation flags and directives that would be given to GCC. BAMBU also supports FP operations since it interfaces with
In the standard-optimization case, the compiler optimization FloPoCo library [56].
level passed to GCC is -O3, without any modifications to the All functional units are precharacterized for multiple com-
source code of the benchmarks. In the performance-optimized binations of target devices, bit-widths, and pipeline stages.
study, the source code was modified only in the case of sobel, Hence, BAMBU implements a technology-aware scheduler
where we used the same version modified by the L EG U P team. to perform aggressive operation chaining and code motion.

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1601

TABLE VII
P ERFORMANCE -O PTIMIZED A REA R ESULTS

This reduces the total number of clock cycles, while respect- considered. Another limitation is a mismatch between the clock
ing the given timing constraint. Trimming of the address bus period targeted by operation-chaining and the selection of IP
was useful for bellmanford, matrix, satd, and sobel. cores in the target technology (e.g., for a divider unit), which
Finally, BAMBU adopts a novel architecture for memory are not (re)generated on request based on a target frequency.
accesses [11]. Specifically, BAMBU builds a hierarchical dat- Operation chaining is set to a specific target frequency for each
apath directly connected to a dual-port BRAM whenever a benchmark (as shown in Table II). However, this can differ sig-
local aggregated or a global scalar/aggregate data type is used nificantly from that achievable within the instantiated IP cores
by the kernel and whenever the accesses can be determined at available in DWARVs IP library, as shown for example in the
compile time. In this case, multiple memory accesses can be dfxxx kernels. DWARV targets mostly small and medium size
performed in parallel. Otherwise, the memories are intercon- kernels. It thus generates a central FSM and always maps local
nected so that it is also possible to support dynamic resolution arrays to distributed logic. This is a problem for large kernels
of the addresses. Indeed, the same memory infrastructure can such as the jpeg benchmark, which could not be mapped in the
be natively connected to external components (e.g., a local available area on the target platform. Another minor limitation
scratch-pad memory or cache) or directly to the bus to access is the transformation—in the compiler back-end—of switch
off-chip memory. Finally, if the kernel has pointers as param- constructs to if-else constructs. Generating lower-level switch
eters, it assumes that the objects referred are allocated on constructs would improve the aes, mips, and jpeg kernels, that
dual-port BRAMs. contain multiple switch statements.
The optimized results obtained for blowfish and jpeg are
the same obtained in the first study since we were not able to C. LegUp
identify different options to improve the results. Several methods exist for optimizing L EG U P-produced
circuits: automatic LLVM compiler optimizations [12], user-
B. DWARV defined directives for activating various hardware specific
Since DWARV is based on CoSy [10], one of the main features, and source code modifications. Since L EG U P is built
advantages is its flexibility to easily exploit standard and cus- within LLVM, users can utilize LLVM optimization passes
tom optimizations. The framework contains 255 transformation with minimal effort. In the context of hardware circuits,
and optimization passes available in the form of stand-alone for the performance-optimized runs, function inlining and
engines. For the standard-evaluation experiment, the most loop unrolling provided benefits across multiple benchmarks.
important optimizations that DWARV uses are if-conversion, Function inlining allows the hardware scheduler to exploit
operation chaining, multiple memories, and a simple (i.e., anal- more instruction-level parallelism and simplify the FSM.
ysis based only on standard integer types) bit-width analysis. Similarly, loop unrolling exposes more parallelism across loop
For the performance-optimized runs, pragmas were added iterations. The performance boost associated with inlining and
to enable loop unrolling. However, not all framework opti- unrolling generally comes at the cost of increased area.
mizations are yet fully integrated in the HLS flow. One of L EG U P also offers many hardware optimizations that
the DWARV restrictions is that it does not support global users can activate by means of TcL directives, such as
variables. As a result, the CHStone benchmarks, which rely activating loop pipelining or changing the target clock period.
heavily on global variables, had to be rewritten to transform Loop pipelining allows consecutive iterations of a loop to
global variables to function parameters passed by reference. begin execution before the previous iteration has completed,
Besides the effort needed to rewrite code accessing global reducing the overall number of clock cycles. Longer clock
memory, some global optimizations across functions are not periods permit more chaining, reducing cycle latency. If the

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1602 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

reduction in cycle latency does not exceed the amount by [8] C. Pilato and F. Ferrandi, “Bambu: A modular framework for the high
which the clock period lengthens, wall-clock time will be also level synthesis of memory-intensive applications,” in Proc. FPL, Porto,
Portugal, 2013, pp. 1–4.
improved. [9] A. Canis et al., “LegUp: High-level synthesis for FPGA-based proces-
Manual source code modifications can be made to assist sor/accelerator systems,” in Proc. ACM FPGA, Monterey, CA, USA,
L EG U P in inferring parallelism within the program. One 2011, pp. 33–36.
such modification is to convert single-threaded execution [10] ACE-Associated Compiler Experts. CoSy. [Online]. Available:
http://www.ace.nl, accessed Jan. 13, 2016.
to multithreaded execution using pthreads/OpenMP, whereby [11] C. Pilato, F. Ferrandi, and D. Sciuto, “A design methodology to imple-
L EG U P synthesizes the multiple parallel threads into parallel ment memory accesses in high-level synthesis,” in Proc. IEEE/ACM
hardware accelerators. This optimization was applied for all CODES+ISSS, Taipei, Taiwan, 2011, pp. 49–58.
[12] C. Lattner and V. Adve, “LLVM: A compilation framework for life-
of the df benchmarks. In the df benchmarks, a set of inputs long program analysis & transformation,” in Proc. IEEE/ACM CGO,
is applied to a kernel in a data-parallel fashion—there are no San Jose, CA, USA, 2004, pp. 75–88.
dependencies between the inputs. Such a situation is particu- [13] K. Wakabayashi and T. Okamoto, “C-based SoC design flow and EDA
larly desirable for L EG U P’s multithreading synthesis: multiple tools: An ASIC and system vendor perspective,” IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 19, no. 12, pp. 1507–1522,
identical hardware kernels are instantiated, each operating in Dec. 2000.
parallel on disjoint subsets of the input data. [14] BlueSpec Inc. High-Level Synthesis Tools. [Online]. Available:
http://bluespec.com/high-level-synthesis-tools.html, accessed
Jan. 13, 2016.
VI. C ONCLUSION [15] R. Nikhil, “Bluespec system verilog: Efficient, correct RTL from high
To the authors’ knowledge, this paper represents the first level specifications,” in Proc. IEEE/ACM MEMOCODE, San Diego, CA,
broad evaluation of several HLS tools. We presented an USA, 2004, pp. 69–70.
[16] S. C. Goldstein et al., “PipeRench: A reconfigurable architecture and
extensive survey and categorization for past and present hard- compiler,” IEEE Comput., vol. 33, no. 4, pp. 70–77, Apr. 2000.
ware compilers. We then described the optimizations on [17] N. Kavvadias and K. Masselos, “Automated synthesis of FSMD-based
which recent and ongoing research in the HLS community accelerators for hardware compilation,” in Proc. IEEE ASAP, Delft,
The Netherlands, 2012, pp. 157–160.
is focussed. We experimentally evaluated three academic HLS [18] Impulse Accelerated Technologies. Impulse CoDeveloper C-to-FPGA
tools, BAMBU, DWARV, and L EG U P, against a commercial Tools. [Online]. Available: http://www.impulseaccelerated.com/
tool. The methodology aims at providing a fair compari- products_universal.htm, accessed Jan. 13, 2016.
son of tools, even if they are built within different compiler [19] Mentor Graphics. DK Design Suite: Handel-C to FPGA for
Algorithm Design. [Online]. Available: http://www.mentor.com/
frameworks and target different FPGA families. The results products/fpga/handel-c/dk-design-suite, accessed Jan. 13, 2016.
shows that each HLS tool can significantly improve the perfor- [20] W. A. Najjar et al., “High-level language abstraction for reconfigurable
mance with benchmark-specific optimizations and constraints. computing,” IEEE Comput., vol. 36, no. 8, pp. 63–69, Aug. 2003.
However, software engineers need to take into account that [21] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The Garp architecture
and C compiler,” IEEE Comput., vol. 33, no. 4, pp. 62–69, Apr. 2000.
optimizations that are necessary to realize high performance [22] M. B. Gokhale and J. M. Stone, “NAPA C: Compiling for a hybrid
in hardware (e.g., enabling loop pipelining and removing RISC/FPGA architecture,” in Proc. IEEE FCCM, Napa Valley, CA,
control flow) differ significantly from software-oriented ones USA, 1998, pp. 126–135.
[23] Y Explorations. eXCite: C to RTL Behavioral Synthesis. [Online].
(e.g., data reorganization for cache locality). Available: http://www.yxi.com/products.php, accessed Jan. 13, 2016.
Overall, the performance results showed that academic and [24] J. Villarreal, A. Park, W. Najjar, and R. Halstead, “Designing modular
commercial HLS tools are not drastically far apart in terms hardware accelerators in C with ROCCC 2.0,” in Proc. IEEE FCCM,
of quality, and that no single tool produced the best results Charlotte, NC, USA, 2010, pp. 127–134.
[25] Calypto Design Systems. Catapult: Product Family Overview. [Online].
for all benchmarks. Obviously, despite this, it should never- Available: http://calypto.com/en/products/catapult/overview, accessed
theless be noted that the commercial compiler supports more Jan. 13, 2016.
features, allowing multiple input and output languages, the [26] Cadence. C-to-Silicon Compiler. [Online]. Available:
customization of the generated kernels in terms of interface http://www.cadence.com/products/sd/silicon_compiler/pages/default.aspx,
accessed Jan. 13, 2016.
types, memory bank usage, throughput, etc., while at the same [27] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “SPARK: A high-level syn-
time also being more robust than the academic tools. thesis framework for applying parallelizing compiler transformations,”
in Proc. VLSI Design, New Delhi, India, 2003, pp. 461–466.
[28] Altium. Altium Designer: A Unified Solution. [Online]. Available:
R EFERENCES http://www.altium.com/en/products/altium-designer, accessed
[1] S. Borkar and A. A. Chien, “The future of microprocessors,” Commun. Jan. 13, 2016.
ACM, vol. 54, no. 5, pp. 67–77, May 2011. [29] Universite de Bretagne-Sud. GAUT—High-Level Synthesis Tool from
[2] P. Coussy and A. Morawiec, High-Level Synthesis: From Algorithm to C to RTL. [Online]. Available: http://hls-labsticc.univ-ubs.fr/, accessed
Digital Circuit. Dordrecht, The Netherlands: Springer, 2008. Jan. 13, 2016.
[3] H.-Y. Liu, M. Petracca, and L. P. Carloni, “Compositional system- [30] J. L. Tripp, M. B. Gokhale, and K. D. Peterson, “Trident: From high-
level design exploration with planning of high-level synthesis,” in Proc. level language to hardware circuitry,” IEEE Comput., vol. 40, no. 3,
IEEE/ACM DATE, Dresden, Germany, 2012, pp. 641–646. pp. 28–37, Mar. 2007.
[4] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale [31] J. L. Tripp, P. A. Jackson, and B. L. Hutchings, “Sea cucumber: A
datacenter services,” in Proc. IEEE/ACM ISCA, Minneapolis, MN, USA, synthesizing compiler for FPGAs,” in Proc. FPL, Montpellier, France,
2014, pp. 13–24. 2002, pp. 875–885.
[5] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep [32] Altera. C2H Compiler—Discontinued. [Online]. Available:
convolutional neural networks,” in Proc. ACM FPGA, Monterey, CA, www.altera.com/literature/pcn/pdn1208.pdf, accessed Jan. 13, 2016.
USA, 2015, pp. 161–170. [33] Synopsys. Synphony C Compiler. [Online]. Available:
[6] W. Meeus, K. Van Beeck, T. Goedemé, J. Meel, and D. Stroobandt, https://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/
“An overview of today’s high-level synthesis tools,” Design Autom. SynphonyC-Compiler.aspx, accessed Jan. 13, 2016.
Embedded Syst., vol. 16, no. 3, pp. 31–51, 2012. [34] V. Kathail et al., “PICO (Program In, Chip Out): Automatically design-
[7] R. Nane et al., “DWARV 2.0: A CoSy-based C-to-VHDL hardware ing custom computers,” IEEE Comput., vol. 35, no. 9, pp. 39–47,
compiler,” in Proc. FPL, Oslo, Norway, 2012, pp. 619–622. Sep. 2002.

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
NANE et al.: SURVEY AND EVALUATION OF FPGA HLS TOOLS 1603

[35] BDTi. BDTI High-Level Synthesis Tool Certification [62] S. Gupta, N. Savoiu, N. Dutt, R. Gupta, and A. Nicolau, “Conditional
Program Results. [Online]. Available: http://www.bdti.com/ speculation and its effects on performance and area for high-level syn-
Resources/BenchmarkResults/HLSTCP, accessed Jan. 13, 2016. thesis,” in Proc. ACM ISSS, Montreal, QC, Canada, 2001, pp. 171–176.
[36] Northwestern University. MATCH Compiler. [Online]. Available: [63] R. Cordone, F. Ferrandi, M. D. Santambrogio, G. Palermo, and
http://www.ece.northwestern.edu/cpdc/Match/, accessed Jan. 13, 2016. D. Sciuto, “Using speculative computation and parallelizing techniques
[37] Xilinx. AccelDSP Synthesis Tool. [Online]. Available: to improve scheduling of control based designs,” in Proc. IEEE/ACM
http://www.xilinx.com/tools/acceldsp.htm, accessed Jan. 13, 2016. ASP-DAC, Yokohama, Japan, 2006, pp. 898–904.
[38] A. Putnam et al., “CHiMPS: A C-Level compilation flow for hybrid [64] H. Zheng, Q. Liu, J. Li, D. Chen, and Z. Wang, “A gradual scheduling
CPU-FPGA architectures,” in Proc. FPL, Heidelberg, Germany, 2008, framework for problem size reduction and cross basic block parallelism
pp. 173–178. exploitation in high-level synthesis,” in Proc. IEEE/ACM ASP-DAC,
[39] K. Bondalapati et al., “DEFACTO: A design environment for adaptive Yokohama, Japan, 2013, pp. 780–786.
computing technology,” in Proc. IEEE RAW, 1999, pp. 570–578. [65] J. Choi, S. Brown, and J. Anderson, “From software threads to parallel
[40] Maxeler Technologies. MaxCompiler. [Online]. Available: hardware in high-level synthesis for FPGAs,” in Proc. IEEE FPT, Kyoto,
https://www.maxeler.com/products/software/maxcompiler/, accessed Japan, Dec. 2013, pp. 270–277.
Jan. 13, 2016. [66] D. P. Singh, T. S. Czajkowski, and A. Ling, “Harnessing the power
[41] D. J. Greaves and S. Singh, “Kiwi: Synthesis of FPGA circuits from of FPGAs using Altera’s OpenCL compiler,” in Proc. ACM FPGA,
parallel programs,” in Proc. IEEE FCCM, Palo Alto, CA, USA, 2008, Monterey, CA, USA, 2013, pp. 5–6.
pp. 3–12. [67] S. A. Mahlke et al., “Characterizing the impact of predicated execution
[42] Cadence. Cynthesizer Solution. [Online]. Available: on branch prediction,” in Proc. ACM/IEEE MICRO, San Jose, CA, USA,
http://www.cadence.com/products/sd/cynthesizer/pages/default.aspx? 1994, pp. 217–227.
CMP=MOSS6/products/cynthesizer.asp, accessed Jan. 13, 2016. [68] N. J. Warter, D. M. Lavery, and W.-M. W. Hwu, “The benefit of predi-
[43] Xilinx Inc. Vivado Design Suite—VivadoHLS. [Online]. Available: cated execution for software pipelining,” in Proc. Hawaii Int. Conf. Syst.
http://www.xilinx.com/products/design-tools/vivado/index.htm, Sci., vol. 1. Wailea, HI, USA, 1993, pp. 497–506.
accessed Jan. 13, 2016. [69] M. Hohenauer et al., “Retargetable code optimization for predi-
[44] J. Cong et al., “High-level synthesis for FPGAs: From prototyping to cated execution,” in Proc. IEEE/ACM DATE, Munich, Germany, 2008,
deployment,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., pp. 1492–1497.
vol. 30, no. 4, pp. 473–491, Apr. 2011. [70] R. Nane, V.-M. Sima, and K. Bertels, “A lightweight speculative and
[45] G. N. T. Huong and S. W. Kim, “GCC2Verilog compiler toolset for predicative scheme for hardware execution,” in Proc. IEEE ReConFig,
complete translation of C programming language into verilog HDL,” Cancún, Mexico, 2012, pp. 1–6.
ETRI J., vol. 33, no. 5, pp. 731–740, 2011. [71] Stratix V FPGA Data Sheet, Altera Corp., San Jose, CA, USA, 2015.
[72] Virtex-7 FPGA Data Sheet, Xilinx Inc., San Jose, CA, USA, 2015.
[46] L. Stok, “Data path synthesis,” Integr. VLSI J., vol. 18, no. 1, pp. 1–71,
[73] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, “Proposal and quan-
1994.
titative analysis of the CHStone benchmark program suite for practical
[47] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, “System-
C-based high-level synthesis,” Inf. Process., vol. 17, pp. 242–254, 2009.
level memory optimization for high-level synthesis of component-based
SoCs,” in Proc. IEEE CODES+ISSS, New Delhi, India, 2014, pp. 1–10. Razvan Nane (M’14) received the Ph.D. degree in
[48] B. R. Rau, “Iterative modulo scheduling: An algorithm for software computer engineering from the Delft University of
pipelining loops,” in Proc. IEEE/ACM MICRO, San Jose, CA, USA, Technology, Delft, The Netherlands, in 2014.
1994, pp. 63–74. He is a Post-Doctoral Researcher with the Delft
[49] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero, “Swing module University of Technology. He is the main devel-
scheduling: A lifetime-sensitive approach,” in Proc. IEEE/ACM PACT, oper of the DWARV C-to-VHDL hardware com-
Boston, MA, USA, 1996, pp. 80–86. piler. His current research interests include high-
[50] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul, level synthesis for reconfigurable architectures, hard-
“The polyhedral model is more widely applicable than you think,” ware/software codesign methods for heterogeneous
in Compiler Construction. Heidelberg, Germany: Springer, 2010, systems, and compilation and simulation techniques
pp. 208–303. for emerging memristor-based in-memory comput-
[51] Y. Wang, P. Li, and J. Cong, “Theory and algorithm for generalized ing high-performance architectures.
memory partitioning in high-level synthesis,” in Proc. ACM FPGA, Vlad-Mihai Sima received the M.Sc. degree in com-
Monterey, CA, USA, 2014, pp. 199–208. puter science and engineering from Universitatea
[52] W. Zuo et al., “Improving polyhedral code generation for high-level Politehnica, Bucharest, Romania, in 2006, and the
synthesis,” in Proc. IEEE CODES+ISSS, Montreal, QC, Canada, 2013, Ph.D. degree in computer engineering from the
pp. 1–10. Delft University of Technology (TU Delft), Delft,
[53] J. Cong, M. Huang, and P. Zhang, “Combining computation and commu- The Netherlands, in 2013.
nication optimizations in system synthesis for streaming applications,” He was a Researcher with TU Delft, involved
in Proc. ACM FPGA, Monterey, CA, USA, 2014, pp. 213–222. in the development of various research related to
[54] F. de Dinechin, “Multiplication by rational constants,” IEEE Trans. high-level synthesis for heterogeneous architectures,
Circuits Syst. II, Exp. Briefs, vol. 59, no. 2, pp. 98–102, Feb. 2012. namely the DWARV C-to-VHDL compiler. He is cur-
[55] M. Kumm et al., “Multiple constant multiplication with ternary adders,” rently the Head of Research with Bluebee, Delft, a
in Proc. FPL, Porto, Portugal, 2013, pp. 1–8. startup company focusing on providing high-performance solutions for the
[56] F. de Dinechin and B. Pasca, “Designing custom arithmetic data paths genomics market. His current research interests include high performance
with FloPoCo,” IEEE Design Test Comput., vol. 28, no. 4, pp. 18–27, computing, high-level synthesis, and heterogeneous architectures.
Jul./Aug. 2011. Christian Pilato (S’08–M’12) received the
[57] A. Kondratyev, L. Lavagno, M. Meyer, and Y. Watanabe, “Exploiting Ph.D. degree in information technology from the
area/delay tradeoffs in high-level synthesis,” in Proc. IEEE/ACM DATE, Politecnico di Milano, Milan, Italy, in 2011.
Dresden, Germany, 2012, pp. 1024–1029. He is a Post-Doctoral Research Scientist with
[58] T. Jiang, X. Tang, and P. Banerjee, “Macro-models for high level area Columbia University, New York, NY, USA. He was
and power estimation on FPGAS,” in Proc. IEEE GLSVLSI, Boston, a Research Associate the Politecnico di Milano,
MA, USA, 2004, pp. 162–165. in 2013. He was one of the Developers of the
[59] D. C. Zaretsky, G. Mittal, R. P. Dick, and P. Banerjee, “Balanced Bambu HLS tool. He is actively involved in the
scheduling and operation chaining in high-level synthesis for FPGA technical program committees of several electronic
designs,” in Proc. IEEE ISQED, San Jose, CA, USA, 2007, pp. 595–601. design automation conferences, such as Design,
[60] H. Zheng, S. T. Gurumani, K. Rupnow, and D. Chen, “Fast and effective Automation and Test in Europe and International
placement and routing directed high-level synthesis for FPGAs,” in Proc. Conference on Field-programmable Logic and Applications (FPL). His
ACM FPGA, Monterey, CA, USA, 2014, pp. 1–10. current research interests include high-level synthesis with emphasis on
[61] G. Lakshminarayana, A. Raghunathan, and N. K. Jha, “Incorporating memory aspects and system-level design of heterogeneous architectures
speculative execution into scheduling of control-flow intensive behav- for energy-efficient high performance along with issues related to physical
ioral descriptions,” in Proc. IEEE/ACM DAC, San Francisco, CA, USA, design.
1998, pp. 108–113. Dr. Pilato is a member of ACM.
Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.
1604 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 35, NO. 10, OCTOBER 2016

Jongsok Choi (S’12) received the B.A.Sc. degree Stephen Brown (M’90) received the B.Sc.Eng.
from the University of Waterloo, Waterloo, ON, degree from the University of New Brunswick,
Canada, and the M.A.Sc. degree from the University Fredericton, NB, Canada, and the M.A.Sc. and Ph.D.
of Toronto (U of T), Toronto, ON, Canada. He degrees in electrical engineering from the University
is currently pursuing the Ph.D. degree with the of Toronto, Toronto, ON, Canada.
Department of Electrical and Computer Engineering, He is a Professor with the University of Toronto,
U of T. Toronto, ON, Canada, and the Director of the
He was with Altera, Toronto; Qualcomm, Worldwide University Program with Altera, Toronto,
San Diego, CA, USA; Marvell Semiconductor, ON, Canada. From 2000 to 2008, he was the
Santa Clara, CA, USA; and STMicroelectronics, Director of Research and Development with the
Markham, ON, Canada. He has been the developer Altera Toronto Technology Center, Toronto. He has
of the system-level functionality within LegUp, including the support for authored over 70 scientific publications and co-authored three textbooks
Pthreads/OpenMP-driven hardware parallelization and the automatic gener- entitled Fundamentals of Digital Logic With VHDL Design (McGraw-Hill,
ation of hybrid processor/accelerator systems. 2004), Fundamentals of Digital Logic With Verilog Design (McGrawHill,
2002), and Field-Programmable Gate Arrays (Kluwer Academic, 1992).
His current research interests include computer-aided design algorithms,
field-programmable very large scale integration technology, and computer
architecture.
Blair Fort (S’06) received the B.A.Sc. degree in Dr. Brown was a recipient of the Canadian Natural Sciences and
engineering science and the M.A.Sc. degree in elec- Engineering Research Council’s 1992 Doctoral Prize for the Best Doctoral
trical and computer engineering from the University Thesis in Canada, and a number of awards for excellence in teaching.
of Toronto (U of T), Toronto, ON, Canada, in 2004
and 2006, respectively. He is currently pursing the
Ph.D. degree in electrical and computer engineering
with the U of T. Fabrizio Ferrandi (M’95) received the Laurea (cum
He has been a Technical Staff Member with Altera laude) degree in electronic engineering and the
Corporation’s Toronto Technology Centre, Toronto, Ph.D. degree in information and automation engi-
since 2006. neering (computer engineering) from the Politecnico
di Milano, Milan, Italy, in 1992 and 1997, respec-
tively.
He joined the faculty of the Politecnico di Milano,
in 2002, where he is currently an Associate Professor
with the Dipartimento di Elettronica, Informazione
Andrew Canis (S’06) received the B.A.Sc. degree e Bioingegneria. He has published over 100 papers.
from the University of Waterloo, Waterloo, ON, He is one of the maintainers of the Bambu HLS tool.
Canada, in 2008, and the Ph.D. degree from the His current research interests include synthesis, verification, simulation, and
University of Toronto, Toronto, ON, Canada, in testing of digital circuits and systems.
2015, both in computer engineering. Prof. Ferrandi is a member of the IEEE Computer Society, the Test
He was an Intern with Altera, Toronto; Sun Technology Technical Committee, and the European Design and Automation
Microsystems Laboratories, Menlo Park, CA, USA; Association.
and Oracle Laboratories, Menlo Park, CA, USA,
where he researched circuit optimization algo-
rithms. He is currently a Co-Founder and the
Jason Anderson (S’96–M’05) received the
Chief Executive Officer of LegUp Computing Inc.,
B.Sc. degree in computer engineering from the
Toronto. His current research interests include high-level synthesis, recon-
University of Manitoba, Winnipeg, MB, Canada,
figurable computing, embedded system-on-chip design, and electronic design
and the M.A.Sc. and Ph.D. degrees in electrical
automation for FPGAs.
and computer engineering from the University of
Dr. Canis was a recipient of the Natural Sciences and Engineering Research
Toronto (U of T), Toronto, ON, Canada.
Council of Canada Alexander Graham Bell Canada Graduate Scholarship for
He joined the Field-Programmable Gate
both his M.A.Sc. and Ph.D. studies.
Array (FPGA) Implementation Tools Group, Xilinx,
Inc., San Jose, CA, USA, in 1997, where he was
involved in placement, routing, and synthesis.
He is currently an Associate Professor with the
Yu Ting Chen (S’15) is currently pursuing the Department of Electrical and Computer Engineering, U of T, and holds the
M.A.Sc. degree with the University of Toronto, Jeffrey Skoll Endowed Chair in Software Engineering. He has authored over
Toronto, ON, Canada. 70 papers in refereed conference proceedings and journals, and holds 27
She is currently part of the LegUp high-level syn- issued U.S. patents. His current research interests include computer-aided
thesis team. Her current research interests include design, architecture, and circuits for FPGAs.
improving memory performance and reducing the
memory bottleneck in LegUp-generated hardware
for parallel programs. Koen Bertels (M’05) received the Ph.D. degree in
computer information systems from the University
of Antwerp, Antwerp, Belgium.
He is a Professor and the Head of the Computer
Engineering Laboratory with Delft University of
Technology, Delft, The Netherlands. He is a
Hsuan Hsiao (S’15) received the B.A.Sc. degree Principal Investigator with the Qutech Research
in computer engineering from the University of Center, researching on quantum computing. He has
Toronto, Toronto, ON, Canada, in 2014, where co-authored over 30 journal papers and over 150
she is currently pursuing the M.A.Sc. degree conference papers. His current research interests
with the Department of Electrical and Computer include heterogeneous multicore computing, inves-
Engineering. tigating topics ranging from compiler technology, runtime support, and
Her current research interests include high-level architecture.
synthesis with reduced width datapaths. Dr. Bertels served as the General and Program Chair for various confer-
ences, such as FPL, Reconfigurable Architectures Workshop, and International
Symposium on Applied Reconfigurable Computing.

Authorized licensed use limited to: Nottingham Trent University. Downloaded on November 12,2024 at 10:26:02 UTC from IEEE Xplore. Restrictions apply.

You might also like