0% found this document useful (0 votes)

3 views

41

The paper discusses techniques for reducing dynamic power consumption in streaming applications on FPGAs through clock gating methods. It highlights the advantages of asynchronous dataflow designs, particularly RVC-CAL, in achieving power savings without compromising data throughput. Experimental results demonstrate the effectiveness of these techniques in optimizing power usage while maintaining performance.

Uploaded by

Javeed Mohammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

41

Uploaded by

Javeed Mohammad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 1

Clock-gating of streaming applications for energy

efficient implementations on FPGAs
Endri Bezati, Simone Casale-Brunet, Member, IEEE
Marco Mattavelli, and Jorn W. Janneck, Member, IEEE

Abstract—The paper investigates the reduction of dynamic the design abstraction level. As a consequence, adding power
power for streaming applications yielded by asynchronous controllers at the behavioral description design stage con-
dataflow designs by using clock gating techniques. Streaming ap- stitutes an additional task that has to be carried-out with
plications constitute a very broad class of computing algorithms
in areas such as signal processing, digital media coding, cryptog- care to avoid introducing undesired application behaviors
raphy, video analytics, network routing and packet processing and might reduce the portability of the code (i.e platform
and many others. The paper introduces a set of techniques that, is changed during the development process). In addition, it
considering the dynamic streaming behavior of algorithms, can is extremely difficult for HLSs approaches that are based
achieve power savings by selectively switching off parts of the on Imperative Model of Computations (MoCs) [4] to apply
circuits when they are temporarily inactive. The techniques being
independent from the semantic of the application can be applied power optimization solutions that can be yielded by automatic
to any application and can be integrated into the synthesis stage tools starting from the (imperative) behavioral description.
of a high-level dataflow design flow. Experimental results of at- Conversely, dynamic dataflow [5], [6], [7] designs such as
size applications synthesized on FPGAs platforms demonstrate for instance the ones expressible using the formal RVC-CAL
power reductions achievable with no loss in data throughput. language possess interesting properties that can be exploited
for reducing the power consumption without affecting, by
I. I NTRODUCTION construction, the behavioral characteristics of the application.
In RVC-CAL, every actor can concurrently execute processing
Power dissipation is currently the major limitation of silicon tasks, executions might be disabled by input blocking reads,
computing devices. Reducing power has also other beneficial and every communications among actors can occur only by
effects, it implies less stringent needs for cooling, improved means of order preserving lossless queues. As a consequence,
longevity, longer autonomy in the case of battery operated an actor may be stopped for a certain period if its processing
devices and obviously, lower power costs. For all these reasons tasks are idle or its outputs queues (buffers) are full without
power also frequently affects the choice of the computing impacting the overall throughput and semantical behavior of
platform right at the outset. For example, Field-Programmable the design. In addition, to higher levels of dynamic behaviors
Gate Arrays (FPGAs) imply higher power dissipation per that might be present in a given dataflow design, correspond
logic unit when compared to equivalent Application-Specific higher levels of power reduction opportunities. This is not the
Integrated Circuit (ASIC), but often compare favorably to case for synchronous dataflow designs that always consume
conventional processors used for the same functional tasks. and produce a fixed amount of data tokens. Thus, synchronous
For any silicon device, power dissipation can be partitioned dataflow design always dissipate a constant amount of power
into two components: a static and a dynamic component. Static compared to asynchronous dataflow. In this perspective the
power dissipation, also referred to as quiescent or standby techniques that transform intrinsically dynamic algorithm into
power consumption, is the result of the leakage current of static versions such as the ones that are implemented by static
the transistors, also affected by the ambient temperature. By dataflow MoC for deriving analytical guaranteed bounds or
contrast, dynamic power dissipation is caused by transistors other analyzability purposes. In general this transformations
being switched and by losses of charges being moved along are done by introducing dummy tokens guaranteeing constant
wires. Power dissipation increases linearly with frequency, due rates. Thus, in terms of power optimization such approaches
largely to the influence of parasitic capacitances. To counteract are inefficient.
this effect, ASIC designers have employed clock gating (CG)
This paper is organized as follows: in Section II previous
techniques in the last twenty years [1], [2], [3].
works on clock-gating are briefly introduced. Section III de-
Different strategies for optimizing power consumption on
scribes in details the clock-gating strategy and how it is applied
ASICs and FPGAs are discussed in Section II. These papers
on a dataflow design. In Section IV experimental results are
describe the impact of a chosen technology for a given
presented and conclusions are finally drawn in Section V.
architecture, but do not describe how to reduce power at
Endri Bezati, Simone Casale Brunet, Marco mattavelli are with the Lab- II. R ELATED W ORK
oratory SCI-STI-MM of École Polytechnique Fédérale de Lausanne (EPFL),
1015 Lausanne, Switzerland e-mail: ([email protected]) Globally Asynchronous Locally Synchronous (GALS)
Jorn W. Janneck is with Departement of Computer Science, Lund Univer-
sity, Sweden based systems consist of several locally synchronous com-
Manuscript received April 19, 2014 ponents which communicate with each other asynchronously.

0278-0070 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2016.2597215, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 2

Works on GALS can be separated into three categories: parti- switching off its clock will not have an impact on design
tioning, communication devices, and dedicated architectures. throughput. Even though RVC-CAL dataflow designs are used
Dataflow design modeling, exploration, and optimization for for the behavioral description, such clock gating strategy is
GALS-based designs has been studied previously by sev- more general and can be applied to systems that represent the
eral authors. Shen et al. [8] proposed a design and evalua- execution of a process that communicates with asynchronous
tion framework for modeling application-specific GALS-based FIFO buffers. The queues should be asynchronous for lossless
dataflow architectures for cyclo-static applications, where sys- communication when an actor is clock gated and a design has
tem performance, e.g. throughput, is taken into account during differing input clock domains.
optimization. Similarly, Wuu et al. [9] and Ghavami et al. [10]
proposed a method for automatic synthesis of asynchronous
Queue Actor A Queue
digital systems. These two approaches were developed for CLK R CLK W F AF CLK CLK R CLK W F AF

fine-grained dataflow graphs, where actors are primitives or CLK

combinational functions. Related to our work, authors in [11]

proposed a multiple clock, domain-design methodology for AF
EN D S
Controller
reducing the power consumption of dataflow programs. Their F
CLK CLK
BUFGCE

design objective was to optimize the mapping of an application Clock Enabler Circuit

while still meeting design performance requirements. This

optimization was achieved by assigning each clock domain Fig. 1: Clock gating methodology applied for Actor A. The
an optimized clock frequency to reduce power consumption. actor A has two outputs one of those have a fanout of two. The
Clock enabling circuit takes the Almost Full and Full signal of
III. C LOCK -G ATING S TRATEGY each queue and a clock from a clock domain and as a result
it is going to activate or deactivate the clock of Actor A.
Current FPGA families support different clock gating strate-
gies and each manufacturer provides its own IP for manag- This strategy consists of adding a Clock Enabler circuit for
ing these different approaches. The methodology described activating the Actors’ clock. This circuit contains: a controller
here is based on using primitives specific to Xilinx FPGA for each output port queue of each actor, a combinatorial logic
architectures. However, this methodology can be modified to for the configuration of the output ports, and a clock buffer
support other FPGA vendors primitives. In the remainder of (which enables the clock). A representation of an actor with a
this section, it is briefly described how clock gating techniques single output port being clock gated is illustrated in Figure 1.
are implemented on Xilinx FPGAs and how an automatic As depicted, queues are asynchronous. Queues have two input
clock gating strategy within Xronos HLS is realized. clocks: one for consuming tokens and one for producing them.
Additionally, queues have two output ports: AF for almost
A. Profile guided buffer size full, and F for full. The actors input clock is connected to the
output of the Clock Enabler circuit. Finally, the clock buffer
The execution of a dataflow program consists of a sequence
BUFGCE input clock should be connected with a Flip-Flop
of action firings. These firings can be correlated to each
for glitch-free clock gating [15].
other in a graph-based representation using an approach called
The Flip-Flop will introduce a one-clock latency when the
Execution Trace Graphing (ETG). The graph is an acyclic
clock is switched off, but this additional clock cycle will not
directed graph where each node represents an action firing, and
have an impact on actors that are on the critical path. Those
a directed arc represents either a data or a control dependency
actors are not being clock gated because the TURNUS dimen-
between two different action firings. The effectiveness of
sioning of the FIFO queues is based on critical path analysis.
analyzing a dataflow program using an ETG is demonstrated
Hence, this approach does not impact overall performance.
in [12]. Xronos provides profiling for each firing execution
in clock cycles. This is achieved by retrieving the difference
of DONE and GO signals for each action firing during RTL F, AF F, AF

simulation [13]. Timing information is added to the ETG F, AF F, AF F, AF F, AF

for each firing and each dependency arises according to a INIT
F, AF
SPACE
AFULL
FULL
AFULL
DISABLE ENABLE
corresponding time value, thus transforming the ETG into en en F, AF en F, AF en F, AF en
a weighted graph. A close-to-optimal buffer size configura-
tion, in terms of execution throughput and buffer memory
utilization, can be obtained through an iterative analysis of F, AF

the algorithmic critical path evaluated using the weighted

ETG. For a detailed description the interested reader can refer Fig. 2: State machine of the clock enabling controller. The
to [14]. controller has two inputs, F for full, AF for almost full and
one output en as the enable signal.
B. Coarse-grained clock gating strategy Clock enabling controller: The clock enabling controller
When the output buffer of any actor is full, the clock of this is represented in Figure 2. The controller is implemented as
actor should be turned off as the actor is idle. This is because a finite state machine having a clock; a reset; input F, for

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 3

full; input AF, for almost full; and output EN, for enable. The Template 1: Clock Enabler circuit module creation
AF input becomes active high when there is only one space module clock enabler
left on its FIFO Queue. Its finite state machine (FSM) has 5 Input : actor
Input : enable
states S = {IN IT, SP ACE, AF U LL DISABLE, F U LL, Input : clk in
AF U LL EN ABLE}. The controller starts with the IN IT Input : reset
out
Input : ∀Palmost
state and maintains the EN output port at active high until F f ull
and AF become active low. Input : ∀Pfout ull
Output: clk out
The active high EN is maintained during the SP ACE for p in ∀P out do
state. As a queue becomes full, the state changes to wire [”sizeof(p.fanout)”:0] ”nameof(p)” enable;
AF U LL DISABLE. In this state, the EN output passes to reg clock enable;
wire buf enable;
an active low. A conservative approach is taken in this state for p in ∀P out do
as the BUFGCE disables the output clock on the high-to- for idx in sizeof(p.fanout) do
low edge. The clock enables entering the BUFGCE should controller c ”nameof(p)” ”idx”(
.almost full(”nameof(p)” almost full[”idx”])
be synchronized to the input clock. Once the queue becomes .full(”port.name” full[”idx”]),
full, the controller maintains the EN at active low. When a .enable(”port.name” enable[”idx”]),
token is consumed from the queue, the controller passes to the .clk(clk),
.reset(reset));
AF U LL EN ABLE state, and it activates the clock. Then,
depending on whether the buffer becomes full or almost full, always @(posedge clk) being
clock enable <= for p ∀P out SEPARATOR ”—” do
the state changes to either the F U LL or the SP ACE state. if sizeof(p.fanout) > 1 then
Strategy: The user can choose a mapping configuration for idx in sizeof(p.fanout) SEPARATOR ”&” do
that indicates which actor should be clock gated. To do so, an nameof(p) enable[”idx”]
attribute is given to each actor. If an actor has been selected else
nameof(p) enable
for clock gating, all of its output FIFO queues, A and AF,
are connected to a clock enabler controller. Output queues assign buf enable = en ? clock enable : 1;
can be connected through a fanout or directly to a queue. In BUFGCE clock enabling (.I(clk), .CE(buf enable), .O(clk out));
endmodule
the first case, the controller results are connected to an AND
logic port. This is a safe approach in the case that one of the
queues in the fanout is full. In this case, the fanout should
command the actor not to produce a token. For the latter case, Test Design: The Intra MPEG-4 SP description contains
if an actor’s output is connected directly to a queue without 32 actors and it is 4:2:0 decoder which is separated into 8
a fanout, the result should be connected to an OR logic port processing blocks: four components for luminance (Y) and
as the next actor may need to consume a certain number of two each for chrominance (U and V)). The parser block
tokens to output a token. This may lead the system to lock includes the syntactical bitstream parser and the variable length
due to the unavailability of data. In the third case, if there decoding process which the Tex Y, U, and V blocks (for
is a combination of outputs with or without a fanout, then texture) implements. The residual decoding (AC-DC predic-
an n-input OR logic port is inserted. Figure 3 depicts these tion, inverse scanning, inverse quantization, and IDCT) and
configurations. the MOT Y, U and V realize the motion compensation stage
A pseudo-template of the clock Enabler circuit is given (framebuffer, interpolation, and residual error addition). Due
in Template 1. This template generates a Verilog file that to the nature of the experiments, the motion compensation
takes into account the different cases described previously. blocks contain only the residual error addition actor. By using
These situations are detected and generated automatically as the TURNUS profiler, a close to minimum queue size [17] for
described in the ”Always Clause”. A flip-flop (created by the each queue in the decoder is determined.
always clause) is connected between the BUFGCE and the Experimental Flow: For the experimental evaluation, a
final OR or AND port. Thus, clock glitches are eliminated and Virtex 7 XC7VX485T-2 FPGA (VC707 Evaluation Kit) was
the clock enabling is runt free. The last output of the clock used. The HDL code of the decoder was generated by Xronos
gating is a new clock that is connected to the actors, its fanouts, and synthesized with the Xilinx XST synthesizer. Following
and its queues’ write and read clocks (CLK W and CLK R, synthesis, placing and routing was applied to produce a
respectively) as visualized in Figure 1. final netlist. This netlist was then simulated with Modelsim
to extract the switching activity information (SAIF file) of
the design. The Xilinx XPower analyser was then used to
IV. E XPERIMENTAL RESULTS
determine power consumption, using the the design netlist,
In this section, the power reduction gain of the aforemen- the design constraints, and the simulation activity SAIF as
tioned methodology is evaluated by applying it to a video inputs. Also, all of the results given have a high confidence
decoder design. In [16], the reader can find a variety of level meaning that at least 97% of the design nets are found
RVC-CAL applications for dataflow programs. One of these within the SAIF file. Table I shows the synthesis results of
applications is the Intra MPEG 4 simple profile decoder. Due the Intra MPEG-4 simple profile decoder with and without
to restrictions on the number of clock buffers in Xilinx FPGAs, clock gating. This example demonstrates that the clock gated
the design selected was refactored to result in 32 actors. decoder uses more slices than the non clock gated one. Even

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 4

AF
EN
Controller

F
CLK
AF AF
EN EN D S
Controller D S Controller

F F AF
CLK CLK CLK
EN
CLK Controller D S
F
CLK
CLK
AF BUFGCE AF BUFGCE
EN EN
Controller Controller
AF BUFGCE
F F EN
CLK CLK Controller

F
CLK CLK
CLK
CLK
Clock Enabler Circuit Clock Enabler Circuit Clock Enabler Circuit

(a) Single Output Port with a Fanout (b) Two Different Output Ports (c) A single Output Port with a Fanout and
another Ouput port

Fig. 3: Clock Enabler Circuit in three different configurations.

Logic Utilization Non Clock Gated Clock Gated Available

Slices 9214 12776 607200
simple profile decoder). The activation rate of the actor’s clock
LUTs 21499 25126 303600 demonstrates that some of them have an activation rate of
BUFGCTRLs 1 32 32 less than 10%. As a result of these activation rates, the power
BRAMs 7 7 1030
DSPs 18 18 2850
consumption on clocks has drastically fallen by 53.7% for
Max Freq. 109 109 - the QCIF resolution and 47.6% for the CIF resolution. As for
overall power consumption, the decoder consumes 59mW less
TABLE I: Synthesis results of the Intra MPEG-4 simple profile
for the QCIF resolution and 54mW less for the CIF resolution.
decoder synthesized for Virtex 7 XC7VX485T-2 FPGA, with
Furthermore, out of 31 actors, 15 are almost always on. This
and without clock gating.
means that for the 15 actors, their output buffers never fill
up. Further improvements of this methodology could entail
though this represents 28% more slices overall compared to detecting which actors do not benefit from clock gating and
the non clock gated decoder, the clock gating methodology eliminating the instantiation of unnecessary additional logic.
requires only 15% more LUTs. A 50 MHz clock has been
given as a synthesis constraint. B. Power saving efficiency over bandwidth demand
Table II depicts the power consumption of the decoder
In this experiment, the decoder was throttled from 0% to
including the circuit of the clock gating methodology. Two test
90% simulating a channel with differing consumption rates.
cases where considered: clock gating de-activated and clock
This is an example of clock gating applied not specifically
gating activated when decoding at maximum throughput.
to video decoding applications, but to a general application.
In Table II, the Actors Clocks label only the power consump-
Figure 4 represents the power consumption of the decoder
tion of the clock nets of the actor. The Clocks cell contains
the Actors Clock nets, the enabling of clock buffer nets, and 180

the nominal 50Mhz clock net. As a a result of clock gating, 160

Total
Actors Clock
the Actors Clocks consume 26% less power, but due to the Clocks
Logic
140
Signal
decoder running at full speed, the activation rate of the Logic
Power Consumption mW

and Signals nets remain resulting in a total power decrease of 120

4% (16 mW less). 100

80
Clock Gating Disabled (mW) Enabled (mW)
Actors clocks 58 43 60

Clocks 94 80
40
Logic 25 24
Signals 42 41 20
Leakage 242 242
Total 403 387 0
0 10 20 30 40 50 60 70 80 90
Throttle %

TABLE II: Power consumption of the Intra MPEG-4 SP Fig. 4: Power consumption of overall clocks, the signals, logic,
decoder when the clock gating is disabled/enabled. and the total dynamic power consumption of the Intra MPEG
4 SP decoder when its output is throttled from 0% to 90%.

A. Power saving efficiency over decoder throttling when its output is throttled from 0% to 90%. As demonstrated,
As described in Table I the maximum decoder throughput the total dynamic power consumption has decreased from 145
rate is 350 frames per second for a QCIF image (176x144 mW to 106 mW, a power reduction of 27%. Compared to
pixels). For the experiment, the decoder is throttled such that the non clock gate decoder, the dynamic power have been
it decodes only 30 images per second for two resolutions QCIF reduced by 34%. Figure 5 reports the power consumption of
and CIF (384x288 pixels). each clock and their activation rate when throttled. From this
Figure 5b reports the power consumption and the activation graph, the data of 15 actors has been removed due to their
rate for each actor’s clock (found on the Intra MPEG-4 activation rate being more than 99%. All actor clock activation

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 5

4
way to recover power otherwise lost in ”idle” cycles. As a
3.5
result, this technique is particularly interesting in applications
Power Consumption mW

2.5
with dynamically varying performance requirements, when
2
designing to a particular performance point is impossible, and
1.5 when power consumption is deemed costly.
1 Further investigations into clock gating should consider
0.5 more aggressive control logic, whereby control is given to
0 each individual actor, allowing greater flexibility to actor

k
lk

lk
lk

lk
k

lk
inactivity. Furthermore, it will be necessary to develop tools

cl
cl

cl
_c

_c
_c

_c
r_
d_

d_
_

P_
dd

lit
rs

2d
20

IS
dd
liz
d

IA
0_

sp
de

Y_
Y_
r4

ct
ke
_a

_a
ria

Y_
42

that partition complex applications onto the limited number

C
id

id
ge

x_
bl

x_
_U

D
se

x_
r_

Y_
r_

lit
eh
er

te
te
C

Y_
U
M

M
M

te
r_

te
sp
pa

D
x_

x_
M

x_
no

rs
no

x_
lit

Y_
pa

te
pa

te
sp

te
of clock domains for more efficient implementations. Lastly,
x_
pa
r_

te
pa

(a) Actors clock power consumption. additional considerations could be given to controlling clock
speed and, possibly, voltage transitions.
100

80 R EFERENCES
Activation Rate %

[1] Massoud Pedram, “Power minimization in ic design: Principles and

60
applications,” ACM Trans. Des. Autom. Electron. Syst., vol. 1, no. 1,
pp. 3–56, Jan. 1996.
40
[2] Qing Wu, M. Pedram, and Xunwei Wu, “Clock-gating and its application
20
to low power design of sequential circuits,” Circuits and Systems I:
Fundamental Theory and Applications, IEEE Transactions on, vol. 47,
0
no. 3, pp. 415–420, Mar 2000.
[3] G.E. Tellez, A. Farrahi, and M. Sarrafzadeh, “Activity-driven clock
k
k

lk
k

lk
lk
k

cl
cl

cl
cl
cl

_c
_c

_c
_c
_c

design for low power circuits,” in Computer-Aided Design, 1995.

r_
p_

P_
_
0_

lit
2d

IS
dd

B
rs
xp

dd
Q

t2
42

liz

IA
0_

Y_
de

Y_
ct

ct
ke
_a

Y_
ria
r

ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International

C
id

id
ge

x_
x_
te
bl
_U

x_
se

Y_
r_

lit
r_

eh
er

te
te
C

Y_
U
M

M
M

te
r_

D
x_

x_
pa

x_
M

rs
no

x_
lit

Y_
pa

Conference on, Nov 1995, pp. 62–65.

te
te
pa

te
x_
pa
r_

te
pa

[4] E. Lee and A. Sangiovanni-Vincentelli, “Comparing models of compu-

(b) Actors clock activation rate. tation,” in Proceedings of the 1996 IEEE/ACM international conference
on Computer-aided design. IEEE Computer Society, 1997, pp. 234–241.
Fig. 5: Power consumption and activation rate of each clock [5] Gilles Kahn, “The Semantics of Simple Language for Parallel Program-
gated actor clock of the MPEG-4 SP decoder. Median values ming,” in IFIP Congress, 1974, pp. 471–475.
[6] Edward A. Lee and David G. Messerschmitt, “Static scheduling of
were retrieved from a MPEG-4 reference QCIF input stimuli synchronous data flow programs for digital signal processing,” IEEE
(video sequence). Trans. Comput., vol. 36, no. 1, pp. 24–35, 1987.
[7] E.A. Lee and T.M. Parks, “Dataflow process networks,” Proceedings of
the IEEE, vol. 83, no. 5, pp. 773 –801, may 1995.
[8] Syed Suhaib, Deepak Mathaikutty, and Sandeep Shukla, “Dataflow
rates decreased while increasing throttle (apart from two architectures for GALS,” Electronic Notes in Theoretical Computer
cases, par splitter Qp clk and tex Y DCR addr clk where Science, vol. 200, no. 1, pp. 33–50, 2008.
[9] Tzyh-Yung Wuu and Sarma B. K. Vrudhula, “Synthesis of asynchronous
the power consumption increased slightly). The decoder used systems from data flow specification,” Research Report ISI/RR-93-366,
was YUV 420. When it reaches 60%, the decoder throttles University of Southern California, Information Sciences Institute, Dec
the luminance decoding, but the the chrominance decoding 1993.
[10] Behnam Ghavami and Hossein Pedram, “High performance asyn-
remains active. This also occurred during a behavioral simu- chronous design flow using a novel static performance analysis method,”
lation in Modelsim. Comput. Electr. Eng., vol. 35, no. 6, pp. 920–941, Nov. 2009.
[11] S.C. Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J.W.
Janneck, “Partitioning and optimization of high level stream applications
V. C ONCLUSION for multi clock domain architectures,” in Signal Processing Systems
(SiPS), 2013 IEEE Workshop on, Oct 2013, pp. 177–182.
This paper presents a clock-gating methodology applied to [12] Simone Casale-Brunet, Analysis and optimization of dynamic dataflow
dataflow designs that can be automatically included in the programs, Ph.D. thesis, STI, Lausanne, 2015.
synthesis stage of a HLS design flow. The application of the [13] Endri Bezati, High-level synthesis of dataflow programs for heteroge-
neous platforms, Ph.D. thesis, STI, Lausanne, 2015.
power saving technique is independent from the sematic of [14] S. Casale-Brunet, M. Mattavelli, and J.W. Janneck, “Buffer optimization
application and does not need any additional step or effort based on critical path analysis of a dataflow program design,” in Circuits
during the ”design” of the application at the dataflow program and Systems (ISCAS), 2013 IEEE International Symposium on, May
2013, pp. 1384–1387.
level. The clock gating logic is generated during the synthesis [15] Xilinx, Analysis of Power Savings from Intelligent Clock Gating, August
stage together with the synthesis of the computational kernels 2012, XAPP790.
connected via FIFO queues constituting the dataflow network. [16] “Open RVC-CAL Applications,” 2014, http://github.com/orcc/orc-apps,
accessed 25-February-2014].
Conceivably, these techniques could be extended to other [17] M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, and J. Janneck,
dataflow Methods of Computation. “Dataflow programs analysis and optimization using model predictive
Experimental results are very encouraging: savings in power control techniques,” Journal of Signal Processing Systems, pp. 1–11,
2015.
dissipation achieved with a slight increase in control logic
without any reduction in throughput have been achieved.
Unsurprisingly, clock gating is attractive in situations where
the design is not used to its full capacity. In these circum-
stances clock gating is a simple, automatic, and effective

Power Optimization (Part 2) : Xuan Silvia' Zhang
No ratings yet
Power Optimization (Part 2) : Xuan Silvia' Zhang
26 pages
Grade 7 Ict Exam
100% (4)
Grade 7 Ict Exam
9 pages
(2023 Conference) Novel_Clock_Gating_Broadcasting_Applications_for_Low-Power_FPGA_Architectures
No ratings yet
(2023 Conference) Novel_Clock_Gating_Broadcasting_Applications_for_Low-Power_FPGA_Architectures
5 pages
Low Power Implementation of RISC V Proce
No ratings yet
Low Power Implementation of RISC V Proce
6 pages
VLSIdoc
No ratings yet
VLSIdoc
6 pages
File 1501
No ratings yet
File 1501
31 pages
Low Power Vlsi Design
No ratings yet
Low Power Vlsi Design
5 pages
Containing Power Dissipation in The Latest Generation of Integrated Circuits Is Testing The Ingenuity of Designers
No ratings yet
Containing Power Dissipation in The Latest Generation of Integrated Circuits Is Testing The Ingenuity of Designers
5 pages
Literature Survey of Low Power Strategies and
No ratings yet
Literature Survey of Low Power Strategies and
4 pages
Distributed Facts Device for Flow Controls
From Everand
Distributed Facts Device for Flow Controls
Dr.V.V.L.N. Sastry
No ratings yet
A Low-Power FPGA Based On Autonomous Fine-Grain Power Gating
No ratings yet
A Low-Power FPGA Based On Autonomous Fine-Grain Power Gating
13 pages
Low Power VLSI Design
No ratings yet
Low Power VLSI Design
6 pages
Different Low Power Techniques: Trade-Offs Associated With The Various Power Management Techniques
No ratings yet
Different Low Power Techniques: Trade-Offs Associated With The Various Power Management Techniques
2 pages
Design of Direct CPSFF Flip-Flop For Low Power Applications
No ratings yet
Design of Direct CPSFF Flip-Flop For Low Power Applications
4 pages
Recent Trends in Low Power VLSI Design: R. Sivakumar, D. Jothi
No ratings yet
Recent Trends in Low Power VLSI Design: R. Sivakumar, D. Jothi
15 pages
Designing For Low Power in Soc Projects
No ratings yet
Designing For Low Power in Soc Projects
14 pages
IJCRT1872033
No ratings yet
IJCRT1872033
10 pages
Low Power VLSI Design Methodologies & Power Management
No ratings yet
Low Power VLSI Design Methodologies & Power Management
4 pages
A Better Tool For Functional Verification of Low-Power Designs With IEEE 1801 UPF
No ratings yet
A Better Tool For Functional Verification of Low-Power Designs With IEEE 1801 UPF
13 pages
Low Power Design: Dr. Paul D. Franzon
No ratings yet
Low Power Design: Dr. Paul D. Franzon
16 pages
Power_optimization_in_configurable_ALU_using_blend_of_techniques (1)
No ratings yet
Power_optimization_in_configurable_ALU_using_blend_of_techniques (1)
5 pages
Cmos Power Consumption AND Approaches Towards Low Power Design
No ratings yet
Cmos Power Consumption AND Approaches Towards Low Power Design
24 pages
Automatic Clock Gating For Power Reductionl
No ratings yet
Automatic Clock Gating For Power Reductionl
11 pages
Chapter 17: Low-Power Design: Keshab K. Parhi and Viktor Owall
No ratings yet
Chapter 17: Low-Power Design: Keshab K. Parhi and Viktor Owall
34 pages
D Flip Flop
No ratings yet
D Flip Flop
5 pages
Powergating Fpga 1
No ratings yet
Powergating Fpga 1
6 pages
Low Power Vlsi Design: Assignment-1 G Abhishek Kumar Reddy, M Manoj Varma
No ratings yet
Low Power Vlsi Design: Assignment-1 G Abhishek Kumar Reddy, M Manoj Varma
17 pages
AN N-F F - O E: EW OLD LIP Flop With Utput Nable
No ratings yet
AN N-F F - O E: EW OLD LIP Flop With Utput Nable
9 pages
Lasc As 2010
No ratings yet
Lasc As 2010
4 pages
Data of ISCAS Benchmark Circuits (RTL CAD Tool Design)
No ratings yet
Data of ISCAS Benchmark Circuits (RTL CAD Tool Design)
4 pages
Low Power Design Methodologies and Flows
No ratings yet
Low Power Design Methodologies and Flows
52 pages
A Survey On Sequential Elements For Low Power Clocking System
No ratings yet
A Survey On Sequential Elements For Low Power Clocking System
10 pages
2539
No ratings yet
2539
29 pages
Lec 38
No ratings yet
Lec 38
31 pages
30VLSI System Level
No ratings yet
30VLSI System Level
49 pages
Power and Delay Reduction Techniques in Digital Design_20250317_090354_0000
No ratings yet
Power and Delay Reduction Techniques in Digital Design_20250317_090354_0000
19 pages
Lecture13 03 PDF
No ratings yet
Lecture13 03 PDF
35 pages
Unit 5
No ratings yet
Unit 5
11 pages
Design of medium grain integrated clock gater for low power clock network
No ratings yet
Design of medium grain integrated clock gater for low power clock network
9 pages
Vtu Lecture1
No ratings yet
Vtu Lecture1
48 pages
Dees ST-Microelectronics Stradale Primosole, Viale Andrea Dona Universita' Di Catania 1-95 121 CATANIA Italy 1-95 125 CATANIA Italy
No ratings yet
Dees ST-Microelectronics Stradale Primosole, Viale Andrea Dona Universita' Di Catania 1-95 121 CATANIA Italy 1-95 125 CATANIA Italy
4 pages
LPVD U1,2
No ratings yet
LPVD U1,2
34 pages
Imp - Powergating Fpga 2
No ratings yet
Imp - Powergating Fpga 2
14 pages
lpvd u1
No ratings yet
lpvd u1
21 pages
My Paper
No ratings yet
My Paper
11 pages
Dynamic Power Reduction WP
No ratings yet
Dynamic Power Reduction WP
6 pages
Advancements in VLSI Low-Power Design Strategies A
No ratings yet
Advancements in VLSI Low-Power Design Strategies A
7 pages
(2014 Transanction) Design_Flow_for_Flip-Flop_Grouping_in_Data-Driven_Clock_Gating
No ratings yet
(2014 Transanction) Design_Flow_for_Flip-Flop_Grouping_in_Data-Driven_Clock_Gating
8 pages
Vlsi Design2011 LP Tutorial Techpa
No ratings yet
Vlsi Design2011 LP Tutorial Techpa
127 pages
Power Reduction in Datapath Designs
No ratings yet
Power Reduction in Datapath Designs
10 pages
A2 Sahin Paper Icg
No ratings yet
A2 Sahin Paper Icg
19 pages
Factors Affecting Power Consumption in VLSI
No ratings yet
Factors Affecting Power Consumption in VLSI
44 pages
Low Power Register Design With Integration Clock Gating and Power Gating
No ratings yet
Low Power Register Design With Integration Clock Gating and Power Gating
6 pages
Power Analysis Methodology and Objectives for TI wireless platform PDF
No ratings yet
Power Analysis Methodology and Objectives for TI wireless platform PDF
19 pages
Reducing Execution Unit Leakage Power in Embedded Processors
No ratings yet
Reducing Execution Unit Leakage Power in Embedded Processors
11 pages
An_energy-aware_dynamic_scheduling_algorithm_for_hard_real-time_systems
No ratings yet
An_energy-aware_dynamic_scheduling_algorithm_for_hard_real-time_systems
4 pages
Efficient Design of 1
No ratings yet
Efficient Design of 1
7 pages
Low Power Implem
No ratings yet
Low Power Implem
19 pages
Energy Efficient CMOS Microprocessor Design
No ratings yet
Energy Efficient CMOS Microprocessor Design
10 pages
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
From Everand
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
Analog Dialogue
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Mc68hc705p6a 783101
No ratings yet
Mc68hc705p6a 783101
99 pages
Comparison of I C Logic Families
No ratings yet
Comparison of I C Logic Families
2 pages
Tic Tac Toe Nabeegh Ahmed (19L-1098), Muhammad Hassan (19L-1011) Section 2B
100% (1)
Tic Tac Toe Nabeegh Ahmed (19L-1098), Muhammad Hassan (19L-1011) Section 2B
10 pages
TMS320C5x: By-D.Jenny Simpsolin
No ratings yet
TMS320C5x: By-D.Jenny Simpsolin
28 pages
University of Hawaii EE 361L Getting Started With Artix-7 Digilent Basys3 Board Lab 4.1
No ratings yet
University of Hawaii EE 361L Getting Started With Artix-7 Digilent Basys3 Board Lab 4.1
13 pages
Final PPT of CCSDS TC System
0% (1)
Final PPT of CCSDS TC System
38 pages
Embedded Testing
No ratings yet
Embedded Testing
225 pages
Encoders and Decoders, Multiplexer, Tri-State Inverter
No ratings yet
Encoders and Decoders, Multiplexer, Tri-State Inverter
25 pages
Defence-Force 6502 65816 Opcodes
No ratings yet
Defence-Force 6502 65816 Opcodes
15 pages
Verification
No ratings yet
Verification
11 pages
Tabela de Equivalências de EPROM: Número BOSCH
No ratings yet
Tabela de Equivalências de EPROM: Número BOSCH
3 pages
Asynchronous Sequential Circuits
No ratings yet
Asynchronous Sequential Circuits
35 pages
Ldom Troubleshooting
100% (1)
Ldom Troubleshooting
8 pages
Digital Electronics Synchronous Circuits:: Up-Down Counter
No ratings yet
Digital Electronics Synchronous Circuits:: Up-Down Counter
7 pages
Arhitecturi de Sisteme Incorporate Microcontrolere Si Sisteme Integrate
No ratings yet
Arhitecturi de Sisteme Incorporate Microcontrolere Si Sisteme Integrate
67 pages
DE LAB Question Bank
No ratings yet
DE LAB Question Bank
2 pages
DDC IP-CORE Datasheet
No ratings yet
DDC IP-CORE Datasheet
4 pages
2449 Ex4
No ratings yet
2449 Ex4
15 pages
Ram Oops
No ratings yet
Ram Oops
17 pages
Helloworld Adam Taylor Part 13
No ratings yet
Helloworld Adam Taylor Part 13
8 pages
What Is A Computer?: ITEC 1011 Introduction To Information Technologies
No ratings yet
What Is A Computer?: ITEC 1011 Introduction To Information Technologies
35 pages
WINSEM2023-24 BECE204L TH VL2023240505623 2024-01-05 Reference-Material-II
No ratings yet
WINSEM2023-24 BECE204L TH VL2023240505623 2024-01-05 Reference-Material-II
64 pages
Config MGR Setup
No ratings yet
Config MGR Setup
79 pages
Motherboard Chip Level Servicing Tutorials
No ratings yet
Motherboard Chip Level Servicing Tutorials
5 pages
Counter Ic With 2-Wire (I C-Bus) Interface
No ratings yet
Counter Ic With 2-Wire (I C-Bus) Interface
26 pages
Linear Integrated Circuits Lab Manual For Flip Flops and Logic Gates
No ratings yet
Linear Integrated Circuits Lab Manual For Flip Flops and Logic Gates
14 pages
Evolution of Intel Processors
No ratings yet
Evolution of Intel Processors
8 pages
Placa Base Msi rc410m
No ratings yet
Placa Base Msi rc410m
125 pages
PLC Training Ladder, RS232
No ratings yet
PLC Training Ladder, RS232
64 pages

41

Uploaded by

41

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 1

Clock-gating of streaming applications for energy

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 2

fine-grained dataflow graphs, where actors are primitives or CLK

combinational functions. Related to our work, authors in [11]

while still meeting design performance requirements. This

simulation [13]. Timing information is added to the ETG F, AF F, AF F, AF F, AF

the algorithmic critical path evaluated using the weighted

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 3

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 4

Fig. 3: Clock Enabler Circuit in three different configurations.

Logic Utilization Non Clock Gated Clock Gated Available

the nominal 50Mhz clock net. As a a result of clock gating, 160

and Signals nets remain resulting in a total power decrease of 120

4% (16 mW less). 100

JOURNAL OF LATEX CLASS FILES, VOL. 11, NO. 4, DECEMBER 2015 5

that partition complex applications onto the limited number

[1] Massoud Pedram, “Power minimization in ic design: Principles and

design for low power circuits,” in Computer-Aided Design, 1995.

ICCAD-95. Digest of Technical Papers., 1995 IEEE/ACM International

Conference on, Nov 1995, pp. 62–65.

[4] E. Lee and A. Sangiovanni-Vincentelli, “Comparing models of compu-

You might also like