0% found this document useful (0 votes)
37 views

Design and Verification Using High-Level Synthesis: Andres Takach

G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Design and Verification Using High-Level Synthesis: Andres Takach

G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

3S-1

Design and Verification Using High-Level Synthesis


Andres Takach
[email protected]
Mentor Graphics Corporation
8005 S.W. Boeckman Rd
Wilsonville, OR 97070 USA

Abstract - The adoption of HLS has been driven by the need to on manual creation of the RTL has continued to evolve over
tackle growing verification costs in traditional RTL design the years and there is an expectation that HLS works within
flows. This paper presents an overview of design, optimization and supports existing RTL design methodologies. In addition
and verification using HLS. It also outlines some of the there is an expectation that more tools and methodologies
requirements for HLS design to fit into existing design and
should be developed around the high-level specification in
verification flows and ways in which such flows might be
C++. Sections IV, V and VI cover some of the ongoing
adapted as HLS is more widely deployed.
challenges and trends on the way HLS is getting deployed
and how design methodologies may evolve around HLS.
I. Introduction While verification has been the main driver for adopting
HLS, the need to design within certain power budgets is
High-level synthesis has been in use in industry for a
increasingly important. The highest power saving are
number of years. Its adoption has been driven primarily due
obtained at the design decisions taken at the highest
to the advantages that raising the level of abstraction of
abstraction levels. Exploration of the design space can lead to
design has on reducing the ever increasing costs of functional
significant power reductions as the best architecture is not
verification.
usually evident. While an existing RTL IP block can be
Adoption of HLS has been most widespread for the reused for a smaller technology node, the best results are
creation of new complex subsystems of SoC designs where obtained by tailoring the architecture to the target technology
time to market is an important consideration. Often such new node.
subsystems are driven by new and evolving standards such as
A standard power optimization used for RTL is clock
standards for wireless communication, image and video
gating. During the HLS process, sequential analysis is done
encoding. One of the recent publicly available successes of
that enables it to reduce the conditions under which registers
HLS is in the design of the VP9 video decoder [1]. Video
are enabled. By doing so, the switching activity of datapath
decoders for the H.264 standard and its successor the HEVC
may also be reduced, and HLS generates the RTL so that
[2] standard have also been designed using HLS. It is worth
downstream synthesis flows can readily perform clock gating
noting that HLS has enabled more aggressive schedules in
to reduce power. Section VIII presents results obtained on
incorporating more capabilities (e.g, profiles in video
power reduction with this low-power HLS feature.
decoders) than it would have been possible with manual RTL
design methodologies. Users that have adopted it now view
II. Coding C++ for Synthesis
HLS as important to their competitiveness in the market. The
success on such complex blocks is also driving HLS to be In general, it is best to start from the simplest and most
applied to blocks that have existing RTL IP, by writing compact code rather than code that has been optimized for
synthesizable high-level descriptions for them. This is done software execution. It is also important to understand the
to get better hardware for new technology nodes and to get hardware implications of different C++ constructs. There are
the verification benefits of raising the level of abstraction. two main areas to consider to create the C++ input that
captures hardware intent:
High-level synthesis tools currently used in industry use a
specification of the behavior or algorithm written in C++ or • Perform numerical refinement if necessary. Floating point
SystemC and generate RTL specifications that can be used in variables need to be replaced by bit-accurate fixed-point
existing design flows. Throughout this paper, the term C++ or floating-point variables tailored with the minimal
specification will be used to cover both pure C++ and bitwidth characteristics that preserve the numerical
SystemC specifications. Section II provides an overview of performance of the algorithm. For example, the AC
important considerations when coding the C++ specification. Datatype package [3] provides integer, fixed-point and
Section III provides an overview of the micro-architectural floating-point and complex datatypes to facilitate making
high-level controls that are provided in HLS to generate an the numerical refinement while keeping the abstraction
RTL specification that meets the design goals. level of the description.
Traditional design and verification methodologies based • Consider how arrays are accessed and re-write the

978-1-4673-9569-4/16/$31.00 ©2016 IEEE 198


3S-1
description to capture the memory architecture that is that is required to meet a certain performance. For example, an
required to minimize hardware. For example, image array that is accessed sequentially in the behavior can be
applications go over an image by operating on a one transferred four elements at a time to increase performance. In
dimensional or two dimensional window whose center SystemC, interfaces are determined by ports on the module
point slides over consecutive pixels. A window C++ class and alternative transfer widths need to be coded explicitly in
can encapsulate the buffering of array accesses for the C++ under the control of template parameters or macro
window operations. Array accesses are thus minimized defines.
resulting in a much better hardware implementation than
from a generic description that has not been written for that C. Variable/Array Mapping and Memory Architecture
hardware intent. Power, area and performance are highly dependent on the
Modelling of concurrent communicating processes memory architecture. It is important to code the C++ with
(blocks) can be done in pure C++ using a Khan process hardware intent to achieve the best results. Within the the
network (KPN) modelling style. This style is quite compact same C++, there are many choices that can be selected during
and retains a high-level of abstraction. An alternative is to use synthesis to actually define the memory architecture of the
the thread or method processes in SystemC. Modular IO design.
interfaces provides an encapsulation of cycle and pin accurate Interface or local variables or arrays may be mapped to
IO SystemC interfaces for synthesis while also providing a memories or may be split into registers. Smaller arrays are
transaction (TLM) view for simulation. Such interfaces typically mapped to registers while larger arrays are typically
provide a way to separate out the cycle accurate mapped to memories. The required read and write bandwidth
communication so that the rest of the behavior can retain its of the memory depends on the algorithm and the performance
high-level of abstraction. requirements on the design.

III. Architectural Synthesis D. Loop Unrolling


Architectural choices have a great impact on the Partially or fully unrolling a loop exposes parallelism that
performance and area of a design. They also have a great exist across subsequent loop iterations. In some cases, partial
impact on power, though often in less predictable ways than unrolling may also be used in a coordinated way with memory
for area and performance. Exploring different choices and mapping and interface synthesis to increase the effective
measuring their impact is greatly facilitated by HLS. bandwidth for data transfer. For example, unrolling may
expose the possibility of accessing even and odd elements of
A. Hierarchy Boundaries an array as one memory word when the array is mapped to
Hierarchy provides a way to both divide the design in more memory.
manageable pieces and to model concurrency in a more
explicit way. A function that is called repeatedly can be E. Loop Pipelining
extracted out in its own hierarchy. For example, in Catapult Loop pipelining provides a way to increase the throughput
[4], such a block with a simple IO protocol is called a CCORE of a loop (or decreasing its overall latency) by initiating the
and may be either combinational or sequential, with or (i+1)th iteration of the loop before the ith iteration has
without state. It is possible to characterize the timing of such a completed. Overlapping the execution of subsequent iterations
block using the target ASIC or FPGA RTL synthesis tool to of a loop exploits parallelism across loop iterations. In many
get an estimate that is consistent with the RTL tool’s cases loop pipelining may improve the resource utilization,
optimizations. The use of such a hierarchy increases the thus increasing the performance/area metric of the design.
capacity, reduces runtime and helps get better quality of The pipeline initiation interval (II) is the number of cycles
results by making it easier for the designer to focus on the a pipeline stage takes to complete before initiating the
larger picture. execution for the next loop iteration. It is common to use and
A methodology supported in RTL synthesis is bottom-up II=1 for highest performance, but other II values are used
design: sub-blocks are synthesized and brought in as part of a depending on the application. For example, a loop may be
larger design. This is a design flow that is also expected in partially unrolled so the new body has two copies of the body
HLS and has been recently productized[4]. A number of of the original loop and an II of two may be used. Partially
alternative implementations for a hierarchy block can be unrolling a loop may help expose some optimization potential
created with different interfaces and later used as part of a across loop iterations. For example, it could expose the
larger design. possibility to merge two arithmetic operators into a more
optimized implementation. It may also expose behavior that is
B. Interface Synthesis specific to even and odd loop iterations and better optimize the
Interface synthesis converts the way the C++ function copies of the loop body.
communicates with the outside world. During synthesis, the Loop pipelining can be applied at any loop level. The
best interface can be selected to transfer the data at the rate process from a SystemC process (e.g, SC_THREAD) or a

199
3S-1
function body that is synthesized as a module is considered the IV. Verification in an HLS Design Flow
top level loop for that process. That loop can be pipelined in The source input to HLS is written in C++ in a far more
which case the loop will ramp up (fill all its pipeline stages) abstract form than an RTL implementation. The high-level of
and continue execution accepting new inputs and producing abstraction enables a designer to run orders of magnitudes
new outputs. Alternatively, one or more inner loops may be more vectors than on an RTL implementation. The net effect is
pipelined. The pipeline ramps up to fill all stages and ramps significant savings in verification because functional issues
downs as the each stage for the last iteration of the loop is are caught earlier in the design cycle.
completed.
In an HLS design flow, the HLS tool creates the RTL from
A loop that contains other loops is pipelined by first the high-level C++ specification in a way that preserves bit-
flattening the nested loop into a single loop. Synthesis accurate behavior equivalence under a set of assumptions of
automates this transformation so there is no need for a manual how interfaces get mapped and scheduled. Either RTL
rewrite of the behavior. simulation or sequential formal equivalence checking [6] can
be used to verify that the expected correct-by-construction
F. Loop Merging
property is not invalidated by an issue in the HLS tool. Until
Two loops that are sequentially adjacent may be merged formal equivalence becomes commonplace, a full simulation
into a single loop that executes the same behavior. The merged or emulation based verification is still required for the
loop can result in a hardware execution with less latency. It generated RTL. Nonetheless, there is a verification saving as
can also save hardware. For example, if the first loop fills up compared to manually created RTL since many verification
an array that is consumed by the second loop, the lifetime of cycles to debug functional bugs in the RTL are avoided.
that array will be reduced and may in fact go away altogether.
A. Validating the C++ Input Description
G. Technology Library
Validating the input description is needed to create the
While the target technology node might be a given, there initial design that is correct and to verify that it remains
are cases where a new technology node was not seen to correct after numerical refinement (e.g. float to fixed-point)
provide a good benefit for the cost and the design was re- and after any rewrites that are meant to make the input more
optimized for an older technology. That exploration would not suitable for HLS.
have been possible in a manual RTL creation flow. Within the
Simulation is the most commonly used testing
same technology node, choices of low or high Vth can also be
methodology for the C++ specification. The verification is
considered.
done at different levels: it may start with a block and/or as sets
H. Clock Frequency of blocks in a subsystem. For example, a video decoding
subsystem can run many frames of video to verify that the
The choice of clock frequency can result in significant
decoding is working correctly. The speed of the simulation
power savings. It is possible to change the clock frequency in
allows to regression test the design with far fewer compute
conjunction with other architectural choices such as pipelining
resources and in a far shorter time than what is required for an
initiation interval to get the same overall design throughput.
equivalent RTL design.
Using hierarchy can be used to tailor the clock period to the As the complexity of systems increase, it also becomes
specific data rate transfer of each block of the design. In DSP
more difficult to have a high degree of confidence that the test
designs the input and output data rate for a block may be quite
vectors applied are sufficient to cover all the interesting
different. A good example is decimation. For a streaming
scenarios. Generic code coverage is used as a metric in C++
design, blocks downstream and upstream from a block that
for software development but lacks important features that
have different input and output data rates could be clocked at
have been developed for measuring coverage in RTL as
different clock frequencies.
described in Section C.
I. Scheduling Formal tools can be used to check properties of the source
C++ to make sure there are no ill-defined behavior such as
Scheduling transforms the untimed behavior (or partially
out-of-bound accesses of arrays.
timed in SystemC designs) into an architecture with a well
defined cycle-by-cycle behavior. It takes into account required B. Verifying the Generated RTL
synthesis directives such as the clock frequency and the target
technologies. In addition, it takes into account cycle and Ideally verification of the design would be done fully in
resource constraints that are either explicitly provided by the C++ and formal verification tools would prove that the RTL
user or implied by interface synthesis directives, variable/ specification generated by HLS is functionally equivalent to
array mapping directives and loop pipelining/unrolling the C++ design.
directives. Scheduling selects among combinational, Sequential equivalence checking [6] has shown promise to
sequential and pipelined components that implement the formally verify that the generated RTL is functionally
operations in the algorithm. equivalent to the C++ specification. As it get deployed for

200
3S-1
block-level verification it promises to reduce some of the C++ may not lead to full coverage as measured by RTL
simulation-based verification requirements to more narrowly coverage tools:
focus on testing the correctness of the integration of the blocks • Coverage of the body of a function does not distinguish the
and subsystems in the SoC. Formal techniques will likely coverage by different callers to that function. Synthesis, on
evolve to target aspects of integration such as correctness of the other hand inlines functions and the coverage may be
protocols, deadlock-free operation etc. less for some call contexts.
Until formal techniques can address verification of the full • Loop unrolling in synthesis replicates the body of a loop
SoC, traditional RTL verification methodologies will still be and it may expose a coverage hole that did not exist in the
used to varying degrees to cover at least some verification loop body prior to loop unrolling. For example
aspects. Currently RTL verification methodologies typically IRU LQWL LL ^
include a mixture of simulation and emulation. ZDLW 
LI [ \
C. RTL Coverage Metrics I D 
One of the challenges with verification is knowing if the `
input vectors that are used cover all the interesting scenarios After full loop unrolling becomes
and would therefore expose a functional issue in the design. ZDLW 
Coverage metrics on the RTL are used an indicator of the LI [ \
quality of the testbench. Most verification engineers expect I D 
that the set of vectors that achieve full structural and ZDLW
functional coverage in the C++ specification also result in the LI [ \
same coverage when applied to the RTL specification I D 
generated by HLS. The input vectors for the second case need to be more
extensive for covering the behavior for each replication of
There are a number of challenges in coverage metrics that
the expression x & y and of the inlined function f(a).
artificially lowers the coverage obtained in the RTL
specification: • Synthesis adds micro architectural details to the design and
that introduces control that needs additional vectors to
• It is primarily intended to cover control conditions since
cover. For example, while in a pure C++ specification the
fully covering arithmetic and datapath is extremely
interface is a function call, the synthesized RTL can have
challenging. That means that the metric is sensitive on how
interfaces that wait for data to become available. The
functionality is expressed. For example, a <= 0 may deliver
additional logic for stalling the design while it is waiting
different coverage than the equivalent a==0 && a < 0,
for data needs to be explicitly exercised in the testbench.
because in the second case hitting the case a==0 is required
for full coverage, but it is not required for a <= 0. One of the sources of sequential redundancies is control
Optimizations during synthesis may inadvertently that is distributed between an explicit finite-state machine
introduce points that are reported as not being covered. (FSM) and a shift register that indicates which stage has valid
data in a pipeline. It is possible to have combinations of the
• Combinational and sequential redundancies in the logic can
FSM state and valid states that are not reachable. If
lead to a reduction of coverage, even if the testbench could
combinational logic is built without taking into account the
be fully exhaustive. Synthesis can take care of eliminating
unreachable scenarios, then sequential redundancy will be
such redundancies, though in some cases it may need to
present. A similar source of sequential redundancies results
understand don't care conditions that come from the
from merging of two loops. The combined loop exits when the
environment. For example an input may be encoded in a 1-
behavior of each loop has been fully executed. The exit
hot fashion and if synthesis does not account for that
condition of the combined loop is a logical AND of the exit
information then some logic redundancies may be
conditions of each of the merged loops. Such conditions are
produced by synthesis.
often correlated and could result in a sequential redundancy
• A coverage hole for an RTL signal can be reported as many that will result in coverage holes for the AND gate.
coverage holes on the fanout of that signal. More formal
tools to relate derivative coverage holes would be useful. As stated earlier, the way that the RTL is expressed has an
Identifying such unique holes would both improve the impact on whether is included as part of the coverage metric.
metric and facilitate finding how to enhance the vector set For example, fixed-point datatypes provide and saturation
to cover it or to find out if the a whole group of related behavior:
coverage holes can be waived. DFBIL[HGWUXH!\ 
DFBIL[HGWUXH$&B:5$3$&B6$7![ \
Some of the RTL verification methodologies are migrating
to C++. Coverage on the C++ can be used to measure how will check for overflow and perform saturation accordingly. In
well the testbench is exercising corner cases to see how well order to check overflow it will check if any of the bits 16-23 of
the source description is validated. The same input vectors can y is not identical to bit 15 of y (since the MSB of x is bit 15).
then be used to exercise the RTL. Traditional line coverage in A natural expectation for a testbench is to check whether the

201
3S-1
condition for saturation is ever exercised. However, if the RTL should be pre-computed before the loop. If the change
specification for that check is written in terms of discrete happens to be required on that code, loop unrolling will not
logical gates, the coverage metrics will indicate holes unless replicate the change in that code.
the overflow is exercised by a difference in each of the bits. Incremental synthesis provides the least changes when the
Coverage of stalling logic can be improved by directing change is isolated within a single control step (combinational)
HLS to add an optional stalling pin to any block. This and does not cross clock register boundaries.
provides a way to directly force stalls to exercise behavior that
may otherwise be hard to reach with a testbench. VI. RTL Linting
RTL linting is designed to enforce certain styles to catch
D. Emulation
common mistakes that may lead to functional bugs or tool
A methodology that has been adopted by some HLS users issues on that RTL. Unfortunately, the checks in some tools
is to limit simulation based verification for integration testing are fairly basic (pattern driven) and can raise many false
and move verification to emulation earlier. Skipping a lot of alarms in RTL code that is well written. A lint-complaint
simulation based functional verification of the RTL is a rewrite of the RTL code can often seem more obfuscated and
pragmatic approach given that there is a higher confidence in fact be more likely to have an actual error that the original
with HLS that the design is functionally correct. This is a code. Lint rule decks used by different companies can be
trend that will likely grow as more verification tools and contradictory making it impossible for HLS to generate the
methodologies are developed to target the verification of the same RTL that passes all lint rule checks in use.
C++ source description. In HLS, the RTL code is machine generated and RTL
E. Prototyping linting is still done as part of the same policy that applies to
manually generated RTL code even though there appears to be
One of the benefits of an HLS design methodology is that little value for such checking. It is likely that as HLS is further
the same C++ source can be easily retargeted to a new deployed, linting will be adapted to a smaller subset of lint
technology. This enables prototyping an ASIC design using rules that is done with more sophisticated/formal analysis.
FPGAs. For example a video encoding design could be
evaluated using such a prototyping flow. VII. RTL Synthesis Flows
V. ECO Flows It is important for HLS to support the user’s synthesis flow.
That support can be provided by component characterization
In the traditional manual RTL creation methodology, ECOs flows with the target synthesis tool to obtain relevant area and
are quite common. The main objective of an ECO delay estimates for that specific tool. Both ASIC and modern
methodology is to reduce the cost of implementing a FPGA technologies and synthesis tools are supported.
functional change or a change to address a timing closure Memory and other complex IP can also be characterized so
issue. For example, it is possible for functional verification to that HLS understands how they can be used as part of the
uncover a functional issue after place-and-route has been design.
completed. Ideally, the change can be done with the least
In addition to the generated RTL, the RTL synthesis scripts
impact to all the work that has already been completed.
are produced for the target tool. Timing constraints are
ECOs in RTL generated by HLS are quite rare relative to produced in standard format for the generated RTL.
ECOs in manually generated RTL. Nonetheless, in order for
HLS to fit into existing design flows, a similar ability to VIII. Power Optimization
perform ECOs is generally expected by new users. In an HLS
Using architectural exploration an HLS user can generate a
context, the change could be a functional change that needs to
number of designs and perform power analysis to find the
be done on the C++ design or a change in some synthesis
design that consumes the least power. For example Catapult
directive. ECOs are facilitated by incremental synthesis flows
LP [4] has the analysis capability built-in. In addition it
that result in the least number of changes to the generated
provides power optimization based on clock-gating. The
RTL. There are two challenges with incremental flows in HLS
optimizations strengthen the enables of registers, by
flows:
restricting the conditions under which registers are enabled:
• Large designs can be done in HLS from an abstract
• Extract register feedback path conditions and re-express
compact description. The most effective methodology to
the feedback by adding/strengthening the enable condition.
bound changes is to divide the design using hierarchy, so
any potential ECO change is isolated to a smaller subset of • Strengthen the enable of a register based on the enable
the design. conditions of registers that are in its input cone of logic.
• A small change can be amplified if it is in code that is This is stability based enable strengthening. Additional 1-
inlined more than one or is part of a body of a loop that is bit registers may be required to hold the enable conditions
unrolled many times. As a general guideline, code that is from a prior cycle.
inside of a loop that is not dependent on the loop iteration • Strengthen the enable based on the conditional use of the

202
3S-1
Another driver for HLS is low-power design. As HLS
register in its fanout. This is an observability based facilitates reuse of the C++ IP, it enables targeting an existing
[7][8][9] strengthening of enables that also reduces C++ to a new technology node to get hardware more highly
switching activity on the register and its fanout. optimized for power. As results show, HLS optimizations that
Table I shows the power savings achieved in industrial understand the sequential properties of the design are able to
designs. The savings are highly dependent on the design. better optimize the design and deliver additional power
Designs that are pipelined and always active have less saving.
opportunity for saving power with clock gating. Designs that
have communicating blocks where some blocks are stalled References
while waiting for data from other blocks can see more
significant gains. Observability analysis delivers the most [1] http://www.webmproject.org/hardware/vp9/
benefit when behavior is written as unconditional behavior [2] J.-R. Ohm; W.-J. Han; T. Wiegand. “Overview of the high effi-
whose result is used conditionally in a later cycle and that ciency video coding (HEVC) standard”. IEEE Transactions on
condition is available in an earlier cycle. Circuits and Systems for Video Technology. Volume 22, Issue 12.
Dec. 2012.
TABLE I
[3] Algorithmic C (AC) Datatypes. http://calypto.com/en/
Estimated power savings with clock gating.
downloads#data-type
Design Base Power LP Power Savings [4] Catapult Synthesis, Mentor Graphics Corporation, http://
μW μW % www.mentor.com/esl/catapult/
Video Encoder 1 2707 1338 51% [5] P. Coussy and A. Takach, “Raising the abstraction level of hard-
FFT 5383 5007 7% ware design,” IEEE Design &Test of Computers, vol. 26, no. 4,
pp. 4–6, Jul.–Aug. 200.
Video Encoder 2 966 805 17%
[6] Anmol Mathur, Masahiro Fujita, Edmund Clarke, Pascal Urard,
Motion Estimation 1786 1630 9% “Functional equivalence verification tools in high-level synthesis
Block flows,” IEEE Design & Test of Computers, vol. 26, no. 4, pp. 88-
Automotive 42 37 12% 95, July/August, 2009
JPEG 8935 8302 7% [7] Mitsuhisa Ohnishi, Akihisa Yamada, Hiroaki Noda, and Takashi
Kambe. 1997. A method of redundant clocking detection and
Image Scaler 25021 12370 51% power reduction at RT level design. In Proceedings of the 1997
international symposium on Low power electronics and design
. (ISLPED '97). ACM.
[8] Babighian, Pietro, Luca Benini, and Enrico Macii. “A scalable
IX. Conclusions algorithm for RTL insertion of gated clocks based on ODCs com-
This paper summarizes the current design and verification putation.” Computer-Aided Design of Integrated Circuits and
methodologies using HLS and how it fits into existing design Systems, IEEE Transactions on 24.1 (2005): 29-42.
flows. The high costs of verification in traditional manual [9] Jason Cong, Bin Liu, Rupak Majumdar, and Zhiru Zhang. 2010.
RTL flows continues to be the main reason to move up design Behavior-level observability analysis for operation gating in low-
and verification to a higher level of abstraction. Verification is power behavioral synthesis. ACM Trans. Des. Autom. Electron.
expected to be a major source of innovation as formal Syst. 16, 1, Article 4 (November 2010).
techniques and ways to complement it with simulation-based
verification and a more rapid transition to emulation become
mainstream. Some of techniques developed for RTL such as
assertions and coverage will be applied to the C++
specification.

You might also like