0% found this document useful (0 votes)
3 views

41

The paper discusses techniques for reducing dynamic power consumption in streaming applications on FPGAs through clock gating methods. It highlights the advantages of asynchronous dataflow designs, particularly RVC-CAL, in achieving power savings without compromising data throughput. Experimental results demonstrate the effectiveness of these techniques in optimizing power usage while maintaining performance.

Uploaded by

Javeed Mohammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

41

The paper discusses techniques for reducing dynamic power consumption in streaming applications on FPGAs through clock gating methods. It highlights the advantages of asynchronous dataflow designs, particularly RVC-CAL, in achieving power savings without compromising data throughput. Experimental results demonstrate the effectiveness of these techniques in optimizing power usage while maintaining performance.

Uploaded by

Javeed Mohammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 1

Clock-gating of streaming applications for energy


efficient implementations on FPGAs
Endri Bezati, Simone Casale-Brunet, Member, IEEE
Marco Mattavelli, and Jorn W. Janneck, Member, IEEE

Abstract—The paper investigates the reduction of dynamic the design abstraction level. As a consequence, adding power
power for streaming applications yielded by asynchronous controllers at the behavioral description design stage con-
dataflow designs by using clock gating techniques. Streaming ap- stitutes an additional task that has to be carried-out with
plications constitute a very broad class of computing algorithms
in areas such as signal processing, digital media coding, cryptog- care to avoid introducing undesired application behaviors
raphy, video analytics, network routing and packet processing and might reduce the portability of the code (i.e platform
and many others. The paper introduces a set of techniques that, is changed during the development process). In addition, it
considering the dynamic streaming behavior of algorithms, can is extremely difficult for HLSs approaches that are based
achieve power savings by selectively switching off parts of the on Imperative Model of Computations (MoCs) [4] to apply
circuits when they are temporarily inactive. The techniques being
independent from the semantic of the application can be applied power optimization solutions that can be yielded by automatic
to any application and can be integrated into the synthesis stage tools starting from the (imperative) behavioral description.
of a high-level dataflow design flow. Experimental results of at- Conversely, dynamic dataflow [5], [6], [7] designs such as
size applications synthesized on FPGAs platforms demonstrate for instance the ones expressible using the formal RVC-CAL
power reductions achievable with no loss in data throughput. language possess interesting properties that can be exploited
for reducing the power consumption without affecting, by
I. I NTRODUCTION construction, the behavioral characteristics of the application.
In RVC-CAL, every actor can concurrently execute processing
Power dissipation is currently the major limitation of silicon tasks, executions might be disabled by input blocking reads,
computing devices. Reducing power has also other beneficial and every communications among actors can occur only by
effects, it implies less stringent needs for cooling, improved means of order preserving lossless queues. As a consequence,
longevity, longer autonomy in the case of battery operated an actor may be stopped for a certain period if its processing
devices and obviously, lower power costs. For all these reasons tasks are idle or its outputs queues (buffers) are full without
power also frequently affects the choice of the computing impacting the overall throughput and semantical behavior of
platform right at the outset. For example, Field-Programmable the design. In addition, to higher levels of dynamic behaviors
Gate Arrays (FPGAs) imply higher power dissipation per that might be present in a given dataflow design, correspond
logic unit when compared to equivalent Application-Specific higher levels of power reduction opportunities. This is not the
Integrated Circuit (ASIC), but often compare favorably to case for synchronous dataflow designs that always consume
conventional processors used for the same functional tasks. and produce a fixed amount of data tokens. Thus, synchronous
For any silicon device, power dissipation can be partitioned dataflow design always dissipate a constant amount of power
into two components: a static and a dynamic component. Static compared to asynchronous dataflow. In this perspective the
power dissipation, also referred to as quiescent or standby techniques that transform intrinsically dynamic algorithm into
power consumption, is the result of the leakage current of static versions such as the ones that are implemented by static
the transistors, also affected by the ambient temperature. By dataflow MoC for deriving analytical guaranteed bounds or
contrast, dynamic power dissipation is caused by transistors other analyzability purposes. In general this transformations
being switched and by losses of charges being moved along are done by introducing dummy tokens guaranteeing constant
wires. Power dissipation increases linearly with frequency, due rates. Thus, in terms of power optimization such approaches
largely to the influence of parasitic capacitances. To counteract are inefficient.
this effect, ASIC designers have employed clock gating (CG)
This paper is organized as follows: in Section II previous
techniques in the last twenty years [1], [2], [3].
works on clock-gating are briefly introduced. Section III de-
Different strategies for optimizing power consumption on
scribes in details the clock-gating strategy and how it is applied
ASICs and FPGAs are discussed in Section II. These papers
on a dataflow design. In Section IV experimental results are
describe the impact of a chosen technology for a given
presented and conclusions are finally drawn in Section V.
architecture, but do not describe how to reduce power at
Endri Bezati, Simone Casale Brunet, Marco mattavelli are with the Lab- II. R ELATED W ORK
oratory SCI-STI-MM of École Polytechnique Fédérale de Lausanne (EPFL),
1015 Lausanne, Switzerland e-mail: ([email protected]) Globally Asynchronous Locally Synchronous (GALS)
Jorn W. Janneck is with Departement of Computer Science, Lund Univer-
sity, Sweden based systems consist of several locally synchronous com-
Manuscript received April 19, 2014 ponents which communicate with each other asynchronously.

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 2

Works on GALS can be separated into three categories: parti- switching off its clock will not have an impact on design
tioning, communication devices, and dedicated architectures. throughput. Even though RVC-CAL dataflow designs are used
Dataflow design modeling, exploration, and optimization for for the behavioral description, such clock gating strategy is
GALS-based designs has been studied previously by sev- more general and can be applied to systems that represent the
eral authors. Shen et al. [8] proposed a design and evalua- execution of a process that communicates with asynchronous
tion framework for modeling application-specific GALS-based FIFO buffers. The queues should be asynchronous for lossless
dataflow architectures for cyclo-static applications, where sys- communication when an actor is clock gated and a design has
tem performance, e.g. throughput, is taken into account during differing input clock domains.
optimization. Similarly, Wuu et al. [9] and Ghavami et al. [10]
proposed a method for automatic synthesis of asynchronous
Queue Actor A Queue
digital systems. These two approaches were developed for CLK R CLK W F AF CLK CLK R CLK W F AF

fine-grained dataflow graphs, where actors are primitives or CLK

combinational functions. Related to our work, authors in [11]


proposed a multiple clock, domain-design methodology for AF
EN D S
Controller
reducing the power consumption of dataflow programs. Their F
CLK CLK
BUFGCE

design objective was to optimize the mapping of an application Clock Enabler Circuit

while still meeting design performance requirements. This


optimization was achieved by assigning each clock domain Fig. 1: Clock gating methodology applied for Actor A. The
an optimized clock frequency to reduce power consumption. actor A has two outputs one of those have a fanout of two. The
Clock enabling circuit takes the Almost Full and Full signal of
III. C LOCK -G ATING S TRATEGY each queue and a clock from a clock domain and as a result
it is going to activate or deactivate the clock of Actor A.
Current FPGA families support different clock gating strate-
gies and each manufacturer provides its own IP for manag- This strategy consists of adding a Clock Enabler circuit for
ing these different approaches. The methodology described activating the Actors’ clock. This circuit contains: a controller
here is based on using primitives specific to Xilinx FPGA for each output port queue of each actor, a combinatorial logic
architectures. However, this methodology can be modified to for the configuration of the output ports, and a clock buffer
support other FPGA vendors primitives. In the remainder of (which enables the clock). A representation of an actor with a
this section, it is briefly described how clock gating techniques single output port being clock gated is illustrated in Figure 1.
are implemented on Xilinx FPGAs and how an automatic As depicted, queues are asynchronous. Queues have two input
clock gating strategy within Xronos HLS is realized. clocks: one for consuming tokens and one for producing them.
Additionally, queues have two output ports: AF for almost
A. Profile guided buffer size full, and F for full. The actors input clock is connected to the
output of the Clock Enabler circuit. Finally, the clock buffer
The execution of a dataflow program consists of a sequence
BUFGCE input clock should be connected with a Flip-Flop
of action firings. These firings can be correlated to each
for glitch-free clock gating [15].
other in a graph-based representation using an approach called
The Flip-Flop will introduce a one-clock latency when the
Execution Trace Graphing (ETG). The graph is an acyclic
clock is switched off, but this additional clock cycle will not
directed graph where each node represents an action firing, and
have an impact on actors that are on the critical path. Those
a directed arc represents either a data or a control dependency
actors are not being clock gated because the TURNUS dimen-
between two different action firings. The effectiveness of
sioning of the FIFO queues is based on critical path analysis.
analyzing a dataflow program using an ETG is demonstrated
Hence, this approach does not impact overall performance.
in [12]. Xronos provides profiling for each firing execution
in clock cycles. This is achieved by retrieving the difference
of DONE and GO signals for each action firing during RTL F, AF F, AF

simulation [13]. Timing information is added to the ETG F, AF F, AF F, AF F, AF


for each firing and each dependency arises according to a INIT
F, AF
SPACE
AFULL
FULL
AFULL
DISABLE ENABLE
corresponding time value, thus transforming the ETG into en en F, AF en F, AF en F, AF en
a weighted graph. A close-to-optimal buffer size configura-
tion, in terms of execution throughput and buffer memory
utilization, can be obtained through an iterative analysis of F, AF

the algorithmic critical path evaluated using the weighted


ETG. For a detailed description the interested reader can refer Fig. 2: State machine of the clock enabling controller. The
to [14]. controller has two inputs, F for full, AF for almost full and
one output en as the enable signal.
B. Coarse-grained clock gating strategy Clock enabling controller: The clock enabling controller
When the output buffer of any actor is full, the clock of this is represented in Figure 2. The controller is implemented as
actor should be turned off as the actor is idle. This is because a finite state machine having a clock; a reset; input F, for

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 3

full; input AF, for almost full; and output EN, for enable. The Template 1: Clock Enabler circuit module creation
AF input becomes active high when there is only one space module clock enabler
left on its FIFO Queue. Its finite state machine (FSM) has 5 Input : actor
Input : enable
states S = {IN IT, SP ACE, AF U LL DISABLE, F U LL, Input : clk in
AF U LL EN ABLE}. The controller starts with the IN IT Input : reset
out
Input : ∀Palmost
state and maintains the EN output port at active high until F f ull
and AF become active low. Input : ∀Pfout ull
Output: clk out
The active high EN is maintained during the SP ACE for p in ∀P out do
state. As a queue becomes full, the state changes to wire [”sizeof(p.fanout)”:0] ”nameof(p)” enable;
AF U LL DISABLE. In this state, the EN output passes to reg clock enable;
wire buf enable;
an active low. A conservative approach is taken in this state for p in ∀P out do
as the BUFGCE disables the output clock on the high-to- for idx in sizeof(p.fanout) do
low edge. The clock enables entering the BUFGCE should controller c ”nameof(p)” ”idx”(
.almost full(”nameof(p)” almost full[”idx”])
be synchronized to the input clock. Once the queue becomes .full(”port.name” full[”idx”]),
full, the controller maintains the EN at active low. When a .enable(”port.name” enable[”idx”]),
token is consumed from the queue, the controller passes to the .clk(clk),
.reset(reset));
AF U LL EN ABLE state, and it activates the clock. Then,
depending on whether the buffer becomes full or almost full, always @(posedge clk) being
clock enable <= for p ∀P out SEPARATOR ”—” do
the state changes to either the F U LL or the SP ACE state. if sizeof(p.fanout) > 1 then
Strategy: The user can choose a mapping configuration for idx in sizeof(p.fanout) SEPARATOR ”&” do
that indicates which actor should be clock gated. To do so, an nameof(p) enable[”idx”]
attribute is given to each actor. If an actor has been selected else
nameof(p) enable
for clock gating, all of its output FIFO queues, A and AF,
are connected to a clock enabler controller. Output queues assign buf enable = en ? clock enable : 1;
can be connected through a fanout or directly to a queue. In BUFGCE clock enabling (.I(clk), .CE(buf enable), .O(clk out));
endmodule
the first case, the controller results are connected to an AND
logic port. This is a safe approach in the case that one of the
queues in the fanout is full. In this case, the fanout should
command the actor not to produce a token. For the latter case, Test Design: The Intra MPEG-4 SP description contains
if an actor’s output is connected directly to a queue without 32 actors and it is 4:2:0 decoder which is separated into 8
a fanout, the result should be connected to an OR logic port processing blocks: four components for luminance (Y) and
as the next actor may need to consume a certain number of two each for chrominance (U and V)). The parser block
tokens to output a token. This may lead the system to lock includes the syntactical bitstream parser and the variable length
due to the unavailability of data. In the third case, if there decoding process which the Tex Y, U, and V blocks (for
is a combination of outputs with or without a fanout, then texture) implements. The residual decoding (AC-DC predic-
an n-input OR logic port is inserted. Figure 3 depicts these tion, inverse scanning, inverse quantization, and IDCT) and
configurations. the MOT Y, U and V realize the motion compensation stage
A pseudo-template of the clock Enabler circuit is given (framebuffer, interpolation, and residual error addition). Due
in Template 1. This template generates a Verilog file that to the nature of the experiments, the motion compensation
takes into account the different cases described previously. blocks contain only the residual error addition actor. By using
These situations are detected and generated automatically as the TURNUS profiler, a close to minimum queue size [17] for
described in the ”Always Clause”. A flip-flop (created by the each queue in the decoder is determined.
always clause) is connected between the BUFGCE and the Experimental Flow: For the experimental evaluation, a
final OR or AND port. Thus, clock glitches are eliminated and Virtex 7 XC7VX485T-2 FPGA (VC707 Evaluation Kit) was
the clock enabling is runt free. The last output of the clock used. The HDL code of the decoder was generated by Xronos
gating is a new clock that is connected to the actors, its fanouts, and synthesized with the Xilinx XST synthesizer. Following
and its queues’ write and read clocks (CLK W and CLK R, synthesis, placing and routing was applied to produce a
respectively) as visualized in Figure 1. final netlist. This netlist was then simulated with Modelsim
to extract the switching activity information (SAIF file) of
the design. The Xilinx XPower analyser was then used to
IV. E XPERIMENTAL RESULTS
determine power consumption, using the the design netlist,
In this section, the power reduction gain of the aforemen- the design constraints, and the simulation activity SAIF as
tioned methodology is evaluated by applying it to a video inputs. Also, all of the results given have a high confidence
decoder design. In [16], the reader can find a variety of level meaning that at least 97% of the design nets are found
RVC-CAL applications for dataflow programs. One of these within the SAIF file. Table I shows the synthesis results of
applications is the Intra MPEG 4 simple profile decoder. Due the Intra MPEG-4 simple profile decoder with and without
to restrictions on the number of clock buffers in Xilinx FPGAs, clock gating. This example demonstrates that the clock gated
the design selected was refactored to result in 32 actors. decoder uses more slices than the non clock gated one. Even

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 4

AF
EN
Controller

F
CLK
AF AF
EN EN D S
Controller D S Controller

F F AF
CLK CLK CLK
EN
CLK Controller D S
F
CLK
CLK
AF BUFGCE AF BUFGCE
EN EN
Controller Controller
AF BUFGCE
F F EN
CLK CLK Controller

F
CLK CLK
CLK
CLK
Clock Enabler Circuit Clock Enabler Circuit Clock Enabler Circuit

(a) Single Output Port with a Fanout (b) Two Different Output Ports (c) A single Output Port with a Fanout and
another Ouput port

Fig. 3: Clock Enabler Circuit in three different configurations.

Logic Utilization Non Clock Gated Clock Gated Available


Slices 9214 12776 607200
simple profile decoder). The activation rate of the actor’s clock
LUTs 21499 25126 303600 demonstrates that some of them have an activation rate of
BUFGCTRLs 1 32 32 less than 10%. As a result of these activation rates, the power
BRAMs 7 7 1030
DSPs 18 18 2850
consumption on clocks has drastically fallen by 53.7% for
Max Freq. 109 109 - the QCIF resolution and 47.6% for the CIF resolution. As for
overall power consumption, the decoder consumes 59mW less
TABLE I: Synthesis results of the Intra MPEG-4 simple profile
for the QCIF resolution and 54mW less for the CIF resolution.
decoder synthesized for Virtex 7 XC7VX485T-2 FPGA, with
Furthermore, out of 31 actors, 15 are almost always on. This
and without clock gating.
means that for the 15 actors, their output buffers never fill
up. Further improvements of this methodology could entail
though this represents 28% more slices overall compared to detecting which actors do not benefit from clock gating and
the non clock gated decoder, the clock gating methodology eliminating the instantiation of unnecessary additional logic.
requires only 15% more LUTs. A 50 MHz clock has been
given as a synthesis constraint. B. Power saving efficiency over bandwidth demand
Table II depicts the power consumption of the decoder
In this experiment, the decoder was throttled from 0% to
including the circuit of the clock gating methodology. Two test
90% simulating a channel with differing consumption rates.
cases where considered: clock gating de-activated and clock
This is an example of clock gating applied not specifically
gating activated when decoding at maximum throughput.
to video decoding applications, but to a general application.
In Table II, the Actors Clocks label only the power consump-
Figure 4 represents the power consumption of the decoder
tion of the clock nets of the actor. The Clocks cell contains
the Actors Clock nets, the enabling of clock buffer nets, and 180

the nominal 50Mhz clock net. As a a result of clock gating, 160


Total
Actors Clock
the Actors Clocks consume 26% less power, but due to the Clocks
Logic
140
Signal
decoder running at full speed, the activation rate of the Logic
Power Consumption mW

and Signals nets remain resulting in a total power decrease of 120

4% (16 mW less). 100

80
Clock Gating Disabled (mW) Enabled (mW)
Actors clocks 58 43 60

Clocks 94 80
40
Logic 25 24
Signals 42 41 20
Leakage 242 242
Total 403 387 0
0 10 20 30 40 50 60 70 80 90
Throttle %

TABLE II: Power consumption of the Intra MPEG-4 SP Fig. 4: Power consumption of overall clocks, the signals, logic,
decoder when the clock gating is disabled/enabled. and the total dynamic power consumption of the Intra MPEG
4 SP decoder when its output is throttled from 0% to 90%.

A. Power saving efficiency over decoder throttling when its output is throttled from 0% to 90%. As demonstrated,
As described in Table I the maximum decoder throughput the total dynamic power consumption has decreased from 145
rate is 350 frames per second for a QCIF image (176x144 mW to 106 mW, a power reduction of 27%. Compared to
pixels). For the experiment, the decoder is throttled such that the non clock gate decoder, the dynamic power have been
it decodes only 30 images per second for two resolutions QCIF reduced by 34%. Figure 5 reports the power consumption of
and CIF (384x288 pixels). each clock and their activation rate when throttled. From this
Figure 5b reports the power consumption and the activation graph, the data of 15 actors has been removed due to their
rate for each actor’s clock (found on the Intra MPEG-4 activation rate being more than 99%. All actor clock activation

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 5

4
way to recover power otherwise lost in ”idle” cycles. As a
3.5
result, this technique is particularly interesting in applications
Power Consumption mW

2.5
with dynamically varying performance requirements, when
2
designing to a particular performance point is impossible, and
1.5 when power consumption is deemed costly.
1 Further investigations into clock gating should consider
0.5 more aggressive control logic, whereby control is given to
0 each individual actor, allowing greater flexibility to actor

k
lk

lk

lk
lk

lk

lk
k

lk

lk

lk
inactivity. Furthermore, it will be necessary to develop tools

cl
cl

cl

cl

cl
cl

cl

cl
_c

_c

_c
_c

_c

_c
_c

_c

_c
r_
d_

d_

p_

d_
_

e_

P_
dd

lit
rs

2d

2d
20

xp

IQ

IS
dd
liz
d

t2

IA
0_

sp
de

Y_
Y_
r4

ct

ct
ke
_a

_a

_a

r_

dc

_a
ria

Y_
42

that partition complex applications onto the limited number

C
id

id
ge

ea

x_
bl

x_
_U

_V

_Y

te

_i

D
se

x_
r_

V_

Y_
r_

lit
eh
er

te
te
C

Y_
U
M

M
M

te
r_

te
sp
pa

D
x_

x_
M

x_
no

no

rs
no

x_
lit

Y_
pa

r_

te

te
pa

te
sp

te
of clock domains for more efficient implementations. Lastly,
x_
pa
r_

r_

te
pa

pa

(a) Actors clock power consumption. additional considerations could be given to controlling clock
speed and, possibly, voltage transitions.
100

80 R EFERENCES
Activation Rate %

[1] Massoud Pedram, “Power minimization in ic design: Principles and


60
applications,” ACM Trans. Des. Autom. Electron. Syst., vol. 1, no. 1,
pp. 3–56, Jan. 1996.
40
[2] Qing Wu, M. Pedram, and Xunwei Wu, “Clock-gating and its application
20
to low power design of sequential circuits,” Circuits and Systems I:
Fundamental Theory and Applications, IEEE Transactions on, vol. 47,
0
no. 3, pp. 415–420, Mar 2000.
[3] G.E. Tellez, A. Farrahi, and M. Sarrafzadeh, “Activity-driven clock
k
k

lk
k

lk

lk
k

lk

lk
k

lk

lk

lk
lk
k

lk

cl
cl

cl

cl
cl
cl

cl

_c
_c

_c

_c

_c
_c

_c

_c
_c
_c

design for low power circuits,” in Computer-Aided Design, 1995.


r_
p_

d_

P_
_
0_

e_

lit
2d

2d

IQ

IS
dd

dd

dd

B
rs
xp

dd
Q

t2
42

liz

IA
0_

sp

Y_
de

Y_
ct

ct
ke
_a

_a

_a

r_

dc

_a

Y_
ria
r

42

ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International


C
id

id
ge

ea

x_
x_
te
bl
_U

_V

_Y

_i

x_
se

V_

Y_
r_

lit
r_

eh
er

te
te
C

Y_
U
M

M
M

te

te
r_

sp

D
x_

x_
pa

x_
M

no

no

rs
no

x_
lit

Y_
pa

Conference on, Nov 1995, pp. 62–65.


r_

te

te
te
pa

sp

te
x_
pa
r_

r_

te
pa

pa

[4] E. Lee and A. Sangiovanni-Vincentelli, “Comparing models of compu-


(b) Actors clock activation rate. tation,” in Proceedings of the 1996 IEEE/ACM international conference
on Computer-aided design. IEEE Computer Society, 1997, pp. 234–241.
Fig. 5: Power consumption and activation rate of each clock [5] Gilles Kahn, “The Semantics of Simple Language for Parallel Program-
gated actor clock of the MPEG-4 SP decoder. Median values ming,” in IFIP Congress, 1974, pp. 471–475.
[6] Edward A. Lee and David G. Messerschmitt, “Static scheduling of
were retrieved from a MPEG-4 reference QCIF input stimuli synchronous data flow programs for digital signal processing,” IEEE
(video sequence). Trans. Comput., vol. 36, no. 1, pp. 24–35, 1987.
[7] E.A. Lee and T.M. Parks, “Dataflow process networks,” Proceedings of
the IEEE, vol. 83, no. 5, pp. 773 –801, may 1995.
[8] Syed Suhaib, Deepak Mathaikutty, and Sandeep Shukla, “Dataflow
rates decreased while increasing throttle (apart from two architectures for GALS,” Electronic Notes in Theoretical Computer
cases, par splitter Qp clk and tex Y DCR addr clk where Science, vol. 200, no. 1, pp. 33–50, 2008.
[9] Tzyh-Yung Wuu and Sarma B. K. Vrudhula, “Synthesis of asynchronous
the power consumption increased slightly). The decoder used systems from data flow specification,” Research Report ISI/RR-93-366,
was YUV 420. When it reaches 60%, the decoder throttles University of Southern California, Information Sciences Institute, Dec
the luminance decoding, but the the chrominance decoding 1993.
[10] Behnam Ghavami and Hossein Pedram, “High performance asyn-
remains active. This also occurred during a behavioral simu- chronous design flow using a novel static performance analysis method,”
lation in Modelsim. Comput. Electr. Eng., vol. 35, no. 6, pp. 920–941, Nov. 2009.
[11] S.C. Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J.W.
Janneck, “Partitioning and optimization of high level stream applications
V. C ONCLUSION for multi clock domain architectures,” in Signal Processing Systems
(SiPS), 2013 IEEE Workshop on, Oct 2013, pp. 177–182.
This paper presents a clock-gating methodology applied to [12] Simone Casale-Brunet, Analysis and optimization of dynamic dataflow
dataflow designs that can be automatically included in the programs, Ph.D. thesis, STI, Lausanne, 2015.
synthesis stage of a HLS design flow. The application of the [13] Endri Bezati, High-level synthesis of dataflow programs for heteroge-
neous platforms, Ph.D. thesis, STI, Lausanne, 2015.
power saving technique is independent from the sematic of [14] S. Casale-Brunet, M. Mattavelli, and J.W. Janneck, “Buffer optimization
application and does not need any additional step or effort based on critical path analysis of a dataflow program design,” in Circuits
during the ”design” of the application at the dataflow program and Systems (ISCAS), 2013 IEEE International Symposium on, May
2013, pp. 1384–1387.
level. The clock gating logic is generated during the synthesis [15] Xilinx, Analysis of Power Savings from Intelligent Clock Gating, August
stage together with the synthesis of the computational kernels 2012, XAPP790.
connected via FIFO queues constituting the dataflow network. [16] “Open RVC-CAL Applications,” 2014, http://github.com/orcc/orc-apps,
accessed 25-February-2014].
Conceivably, these techniques could be extended to other [17] M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, and J. Janneck,
dataflow Methods of Computation. “Dataflow programs analysis and optimization using model predictive
Experimental results are very encouraging: savings in power control techniques,” Journal of Signal Processing Systems, pp. 1–11,
2015.
dissipation achieved with a slight increase in control logic
without any reduction in throughput have been achieved.
Unsurprisingly, clock gating is attractive in situations where
the design is not used to its full capacity. In these circum-
stances clock gating is a simple, automatic, and effective

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like