41
41
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract—The paper investigates the reduction of dynamic the design abstraction level. As a consequence, adding power
power for streaming applications yielded by asynchronous controllers at the behavioral description design stage con-
dataflow designs by using clock gating techniques. Streaming ap- stitutes an additional task that has to be carried-out with
plications constitute a very broad class of computing algorithms
in areas such as signal processing, digital media coding, cryptog- care to avoid introducing undesired application behaviors
raphy, video analytics, network routing and packet processing and might reduce the portability of the code (i.e platform
and many others. The paper introduces a set of techniques that, is changed during the development process). In addition, it
considering the dynamic streaming behavior of algorithms, can is extremely difficult for HLSs approaches that are based
achieve power savings by selectively switching off parts of the on Imperative Model of Computations (MoCs) [4] to apply
circuits when they are temporarily inactive. The techniques being
independent from the semantic of the application can be applied power optimization solutions that can be yielded by automatic
to any application and can be integrated into the synthesis stage tools starting from the (imperative) behavioral description.
of a high-level dataflow design flow. Experimental results of at- Conversely, dynamic dataflow [5], [6], [7] designs such as
size applications synthesized on FPGAs platforms demonstrate for instance the ones expressible using the formal RVC-CAL
power reductions achievable with no loss in data throughput. language possess interesting properties that can be exploited
for reducing the power consumption without affecting, by
I. I NTRODUCTION construction, the behavioral characteristics of the application.
In RVC-CAL, every actor can concurrently execute processing
Power dissipation is currently the major limitation of silicon tasks, executions might be disabled by input blocking reads,
computing devices. Reducing power has also other beneficial and every communications among actors can occur only by
effects, it implies less stringent needs for cooling, improved means of order preserving lossless queues. As a consequence,
longevity, longer autonomy in the case of battery operated an actor may be stopped for a certain period if its processing
devices and obviously, lower power costs. For all these reasons tasks are idle or its outputs queues (buffers) are full without
power also frequently affects the choice of the computing impacting the overall throughput and semantical behavior of
platform right at the outset. For example, Field-Programmable the design. In addition, to higher levels of dynamic behaviors
Gate Arrays (FPGAs) imply higher power dissipation per that might be present in a given dataflow design, correspond
logic unit when compared to equivalent Application-Specific higher levels of power reduction opportunities. This is not the
Integrated Circuit (ASIC), but often compare favorably to case for synchronous dataflow designs that always consume
conventional processors used for the same functional tasks. and produce a fixed amount of data tokens. Thus, synchronous
For any silicon device, power dissipation can be partitioned dataflow design always dissipate a constant amount of power
into two components: a static and a dynamic component. Static compared to asynchronous dataflow. In this perspective the
power dissipation, also referred to as quiescent or standby techniques that transform intrinsically dynamic algorithm into
power consumption, is the result of the leakage current of static versions such as the ones that are implemented by static
the transistors, also affected by the ambient temperature. By dataflow MoC for deriving analytical guaranteed bounds or
contrast, dynamic power dissipation is caused by transistors other analyzability purposes. In general this transformations
being switched and by losses of charges being moved along are done by introducing dummy tokens guaranteeing constant
wires. Power dissipation increases linearly with frequency, due rates. Thus, in terms of power optimization such approaches
largely to the influence of parasitic capacitances. To counteract are inefficient.
this effect, ASIC designers have employed clock gating (CG)
This paper is organized as follows: in Section II previous
techniques in the last twenty years [1], [2], [3].
works on clock-gating are briefly introduced. Section III de-
Different strategies for optimizing power consumption on
scribes in details the clock-gating strategy and how it is applied
ASICs and FPGAs are discussed in Section II. These papers
on a dataflow design. In Section IV experimental results are
describe the impact of a chosen technology for a given
presented and conclusions are finally drawn in Section V.
architecture, but do not describe how to reduce power at
Endri Bezati, Simone Casale Brunet, Marco mattavelli are with the Lab- II. R ELATED W ORK
oratory SCI-STI-MM of École Polytechnique Fédérale de Lausanne (EPFL),
1015 Lausanne, Switzerland e-mail: ([email protected]) Globally Asynchronous Locally Synchronous (GALS)
Jorn W. Janneck is with Departement of Computer Science, Lund Univer-
sity, Sweden based systems consist of several locally synchronous com-
Manuscript received April 19, 2014 ponents which communicate with each other asynchronously.
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
Works on GALS can be separated into three categories: parti- switching off its clock will not have an impact on design
tioning, communication devices, and dedicated architectures. throughput. Even though RVC-CAL dataflow designs are used
Dataflow design modeling, exploration, and optimization for for the behavioral description, such clock gating strategy is
GALS-based designs has been studied previously by sev- more general and can be applied to systems that represent the
eral authors. Shen et al. [8] proposed a design and evalua- execution of a process that communicates with asynchronous
tion framework for modeling application-specific GALS-based FIFO buffers. The queues should be asynchronous for lossless
dataflow architectures for cyclo-static applications, where sys- communication when an actor is clock gated and a design has
tem performance, e.g. throughput, is taken into account during differing input clock domains.
optimization. Similarly, Wuu et al. [9] and Ghavami et al. [10]
proposed a method for automatic synthesis of asynchronous
Queue Actor A Queue
digital systems. These two approaches were developed for CLK R CLK W F AF CLK CLK R CLK W F AF
design objective was to optimize the mapping of an application Clock Enabler Circuit
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
full; input AF, for almost full; and output EN, for enable. The Template 1: Clock Enabler circuit module creation
AF input becomes active high when there is only one space module clock enabler
left on its FIFO Queue. Its finite state machine (FSM) has 5 Input : actor
Input : enable
states S = {IN IT, SP ACE, AF U LL DISABLE, F U LL, Input : clk in
AF U LL EN ABLE}. The controller starts with the IN IT Input : reset
out
Input : ∀Palmost
state and maintains the EN output port at active high until F f ull
and AF become active low. Input : ∀Pfout ull
Output: clk out
The active high EN is maintained during the SP ACE for p in ∀P out do
state. As a queue becomes full, the state changes to wire [”sizeof(p.fanout)”:0] ”nameof(p)” enable;
AF U LL DISABLE. In this state, the EN output passes to reg clock enable;
wire buf enable;
an active low. A conservative approach is taken in this state for p in ∀P out do
as the BUFGCE disables the output clock on the high-to- for idx in sizeof(p.fanout) do
low edge. The clock enables entering the BUFGCE should controller c ”nameof(p)” ”idx”(
.almost full(”nameof(p)” almost full[”idx”])
be synchronized to the input clock. Once the queue becomes .full(”port.name” full[”idx”]),
full, the controller maintains the EN at active low. When a .enable(”port.name” enable[”idx”]),
token is consumed from the queue, the controller passes to the .clk(clk),
.reset(reset));
AF U LL EN ABLE state, and it activates the clock. Then,
depending on whether the buffer becomes full or almost full, always @(posedge clk) being
clock enable <= for p ∀P out SEPARATOR ”—” do
the state changes to either the F U LL or the SP ACE state. if sizeof(p.fanout) > 1 then
Strategy: The user can choose a mapping configuration for idx in sizeof(p.fanout) SEPARATOR ”&” do
that indicates which actor should be clock gated. To do so, an nameof(p) enable[”idx”]
attribute is given to each actor. If an actor has been selected else
nameof(p) enable
for clock gating, all of its output FIFO queues, A and AF,
are connected to a clock enabler controller. Output queues assign buf enable = en ? clock enable : 1;
can be connected through a fanout or directly to a queue. In BUFGCE clock enabling (.I(clk), .CE(buf enable), .O(clk out));
endmodule
the first case, the controller results are connected to an AND
logic port. This is a safe approach in the case that one of the
queues in the fanout is full. In this case, the fanout should
command the actor not to produce a token. For the latter case, Test Design: The Intra MPEG-4 SP description contains
if an actor’s output is connected directly to a queue without 32 actors and it is 4:2:0 decoder which is separated into 8
a fanout, the result should be connected to an OR logic port processing blocks: four components for luminance (Y) and
as the next actor may need to consume a certain number of two each for chrominance (U and V)). The parser block
tokens to output a token. This may lead the system to lock includes the syntactical bitstream parser and the variable length
due to the unavailability of data. In the third case, if there decoding process which the Tex Y, U, and V blocks (for
is a combination of outputs with or without a fanout, then texture) implements. The residual decoding (AC-DC predic-
an n-input OR logic port is inserted. Figure 3 depicts these tion, inverse scanning, inverse quantization, and IDCT) and
configurations. the MOT Y, U and V realize the motion compensation stage
A pseudo-template of the clock Enabler circuit is given (framebuffer, interpolation, and residual error addition). Due
in Template 1. This template generates a Verilog file that to the nature of the experiments, the motion compensation
takes into account the different cases described previously. blocks contain only the residual error addition actor. By using
These situations are detected and generated automatically as the TURNUS profiler, a close to minimum queue size [17] for
described in the ”Always Clause”. A flip-flop (created by the each queue in the decoder is determined.
always clause) is connected between the BUFGCE and the Experimental Flow: For the experimental evaluation, a
final OR or AND port. Thus, clock glitches are eliminated and Virtex 7 XC7VX485T-2 FPGA (VC707 Evaluation Kit) was
the clock enabling is runt free. The last output of the clock used. The HDL code of the decoder was generated by Xronos
gating is a new clock that is connected to the actors, its fanouts, and synthesized with the Xilinx XST synthesizer. Following
and its queues’ write and read clocks (CLK W and CLK R, synthesis, placing and routing was applied to produce a
respectively) as visualized in Figure 1. final netlist. This netlist was then simulated with Modelsim
to extract the switching activity information (SAIF file) of
the design. The Xilinx XPower analyser was then used to
IV. E XPERIMENTAL RESULTS
determine power consumption, using the the design netlist,
In this section, the power reduction gain of the aforemen- the design constraints, and the simulation activity SAIF as
tioned methodology is evaluated by applying it to a video inputs. Also, all of the results given have a high confidence
decoder design. In [16], the reader can find a variety of level meaning that at least 97% of the design nets are found
RVC-CAL applications for dataflow programs. One of these within the SAIF file. Table I shows the synthesis results of
applications is the Intra MPEG 4 simple profile decoder. Due the Intra MPEG-4 simple profile decoder with and without
to restrictions on the number of clock buffers in Xilinx FPGAs, clock gating. This example demonstrates that the clock gated
the design selected was refactored to result in 32 actors. decoder uses more slices than the non clock gated one. Even
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
AF
EN
Controller
F
CLK
AF AF
EN EN D S
Controller D S Controller
F F AF
CLK CLK CLK
EN
CLK Controller D S
F
CLK
CLK
AF BUFGCE AF BUFGCE
EN EN
Controller Controller
AF BUFGCE
F F EN
CLK CLK Controller
F
CLK CLK
CLK
CLK
Clock Enabler Circuit Clock Enabler Circuit Clock Enabler Circuit
(a) Single Output Port with a Fanout (b) Two Different Output Ports (c) A single Output Port with a Fanout and
another Ouput port
80
Clock Gating Disabled (mW) Enabled (mW)
Actors clocks 58 43 60
Clocks 94 80
40
Logic 25 24
Signals 42 41 20
Leakage 242 242
Total 403 387 0
0 10 20 30 40 50 60 70 80 90
Throttle %
TABLE II: Power consumption of the Intra MPEG-4 SP Fig. 4: Power consumption of overall clocks, the signals, logic,
decoder when the clock gating is disabled/enabled. and the total dynamic power consumption of the Intra MPEG
4 SP decoder when its output is throttled from 0% to 90%.
A. Power saving efficiency over decoder throttling when its output is throttled from 0% to 90%. As demonstrated,
As described in Table I the maximum decoder throughput the total dynamic power consumption has decreased from 145
rate is 350 frames per second for a QCIF image (176x144 mW to 106 mW, a power reduction of 27%. Compared to
pixels). For the experiment, the decoder is throttled such that the non clock gate decoder, the dynamic power have been
it decodes only 30 images per second for two resolutions QCIF reduced by 34%. Figure 5 reports the power consumption of
and CIF (384x288 pixels). each clock and their activation rate when throttled. From this
Figure 5b reports the power consumption and the activation graph, the data of 15 actors has been removed due to their
rate for each actor’s clock (found on the Intra MPEG-4 activation rate being more than 99%. All actor clock activation
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
4
way to recover power otherwise lost in ”idle” cycles. As a
3.5
result, this technique is particularly interesting in applications
Power Consumption mW
2.5
with dynamically varying performance requirements, when
2
designing to a particular performance point is impossible, and
1.5 when power consumption is deemed costly.
1 Further investigations into clock gating should consider
0.5 more aggressive control logic, whereby control is given to
0 each individual actor, allowing greater flexibility to actor
k
lk
lk
lk
lk
lk
lk
k
lk
lk
lk
inactivity. Furthermore, it will be necessary to develop tools
cl
cl
cl
cl
cl
cl
cl
cl
_c
_c
_c
_c
_c
_c
_c
_c
_c
r_
d_
d_
p_
d_
_
e_
P_
dd
lit
rs
2d
2d
20
xp
IQ
IS
dd
liz
d
t2
IA
0_
sp
de
Y_
Y_
r4
ct
ct
ke
_a
_a
_a
r_
dc
_a
ria
Y_
42
C
id
id
ge
ea
x_
bl
x_
_U
_V
_Y
te
_i
D
se
x_
r_
V_
Y_
r_
lit
eh
er
te
te
C
Y_
U
M
M
M
te
r_
te
sp
pa
D
x_
x_
M
x_
no
no
rs
no
x_
lit
Y_
pa
r_
te
te
pa
te
sp
te
of clock domains for more efficient implementations. Lastly,
x_
pa
r_
r_
te
pa
pa
(a) Actors clock power consumption. additional considerations could be given to controlling clock
speed and, possibly, voltage transitions.
100
80 R EFERENCES
Activation Rate %
lk
k
lk
lk
k
lk
lk
k
lk
lk
lk
lk
k
lk
cl
cl
cl
cl
cl
cl
cl
_c
_c
_c
_c
_c
_c
_c
_c
_c
_c
d_
P_
_
0_
e_
lit
2d
2d
IQ
IS
dd
dd
dd
B
rs
xp
dd
Q
t2
42
liz
IA
0_
sp
Y_
de
Y_
ct
ct
ke
_a
_a
_a
r_
dc
_a
Y_
ria
r
42
id
ge
ea
x_
x_
te
bl
_U
_V
_Y
_i
x_
se
V_
Y_
r_
lit
r_
eh
er
te
te
C
Y_
U
M
M
M
te
te
r_
sp
D
x_
x_
pa
x_
M
no
no
rs
no
x_
lit
Y_
pa
te
te
te
pa
sp
te
x_
pa
r_
r_
te
pa
pa
0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.