0% found this document useful (0 votes)
4 views

Power Reduction in Datapath Designs

Power Reduction

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Power Reduction in Datapath Designs

Power Reduction

Uploaded by

GoobeD'Great
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Power Reduction in Datapath Designs

Sreekanth Madgula Zia Khan


Desktop Products Group
Intel Corporation
1900 Prairie City Road
Folsom, California, USA

Abstract

As device counts and operating frequencies continue to increase, power consumption has become
a critical issue in today’s complex chips. Modern manufacturing techniques have made it
possible to put an entire system on a chip (SOC) further aggravating power consumption
problem. For very high frequency designs the power can become a performance limiter. Saving
power is important for mobile and hand-held products as well where prolonging battery life is a
very important design goal. Many such products contain significant computation capability that is
implemented using high performance datapath circuitry that consume significant amount of
power.

Clock gating can be used effectively to minimize power consumption in integrated circuits. For
control logic structures clock gating can be done automatically using RTL clock gating feature of
Synopsys’ Power Compiler tool that works in conjunction with Design Compiler (DC) [1].
However, most computationally intensive designs also contain datapath circuits that are often
implemented using Module Compiler (MC) [2], a datapath synthesis tool, which lacked clock
gating capability until recently. We worked with the tool vendor to enhance MC so it can
automatically implement clock-gating logic in datapath circuits.
1.0 Introduction

The proliferation of mobile computers and hand-held devices has created a strong demand for
reducing power consumption in these devices to prolong their battery life. At the same time,
increased performance and complexity of these devices is causing a rapid rise in power
consumption. Similarly, the performance of SOC designs may be limited by thermal constraints.
Therefore, cost effective power reduction is an increasingly important design goal.

The capabilities and features of these devices are steadily increasing with each new generation.
This improvement is being achieved with increasingly elaborate designs that make use of
complex control and datapath circuits. A significant portion of these designs is implemented
using datapath intensive logic involving arithmetic operations. These datapath intensive designs
can be efficiently implemented using Module Compiler (MC), which enables high productivity
by allowing designers to quickly explore several different architectures with minimal changes to
RTL description [2].

An effective technique for saving power is to identify sections of the chip that are inactive during
certain periods of operation and shut off the clock to those areas. A discussion on how to identify
inactive sections of a chip can be found in [1]. Power Compiler, working in conjunction with
Design Compiler (DC), automatically implements clock gating in a design [1]. However this
capability is only available for designs that are synthesized using DC. On the other hand, MC
directly maps RTL (written in MCL format) into gates without allowing clock gating. MCL
(Module Compiler Language) is a special language used by MC for modeling designs in RTL.
MCL’s syntax is very similar to verilog. Designers familiar with verilog may find it very easy to
learn MCL.

A significant portion of modern chips contains datapath circuits that may be synthesized using
MC. However, MC previously did not allow automatic insertion of clock gates in a datapath
design. This represents a significant opportunity for power savings in data paths circuits.
Therefore, we worked with the tool vendor to enhance MC so clock gating can be done
automatically in datapath circuits.

In this paper, we describe how the new Module Compiler clock gating feature (available in
2000.11 version) works. We describe our clock gating flow based on this feature and present our
results. We show that using this methodology significant power savings can be realized.

2.0 Power Saving Opportunities

The work reported in this paper was done on a graphics chip that consisted of datapath as well as
control logic circuits. The control logic section was coded in VHDL and synthesized using DC.
For clock gating we used Power Compiler following techniques reported in [1]. The datapath
logic section consisted of more than 50% of all the gates in the design. A majority of datapath
circuits were coded in Module Compiler Language (MCL) so that we could take advantage of
MC’s superior optimization capabilities [2].

SNUG San Jose 2001 2 Power Reduction in Datapath Designs


We identified two types of circuit structures in our MCL-coded datapath designs that could
potentially provide power saving opportunity; large register banks and pipelines.

There are several large register banks in the design that hold the same value of data for multiple
cycles until a new data is requested with an active enable signal. During those cycles when the
previous data needs to be preserved, the data is simply reloaded through a feedback path. When
new data is to be captured the feedback path is disabled and new data is loaded into the registers.
Figure 1(a) shows the implementation of this design structure. This reloading of previous data
results in significant power dissipation. By disabling the clock during cycles when previous data
is to be preserved we can avoid the need to reload existing data thus saving power. Figure 1(b)
shows a modified circuit that uses clock gating logic to disable clock during re-load cycles.

Clk Clock Gate


Clk
Select Select
Data
1
Flip Out
flop
0 Data
Flip Out
flop

Figure 1(a) Figure 1(b)

Figure 1: Clock gating logic implemented by Power Compiler

For a practical use of the clock gating concept described above, consider the design shown in
Figure 2(a). The logic block on the right requires both inputs to be valid at the same time. Since
the pipeline has a latency of 3, In must be held for 3 cycles and is specified by hold1 signal. This
structure is called synchronous load-enable register which is functionally equivalent to a
multiplexer followed by a register described above and is typically coded as ensreg() function in
MCL. The following code fragment shows how this register can be modeled in MCL.

Out = ensreg (In, hold1, 1);

The enable state register bank holds its data until data on the pipeline unit below becomes valid
(in this case 3 cycles). This circuit is a suitable candidate for clock gating. For clock gating, the
load-enable register is replaced by a regular register with clock gate inserted as shown in Figure
2(b). Here the data is held for required number of cycles by shutting off the clock to the register.

SNUG San Jose 2001 3 Power Reduction in Datapath Designs


hold1

ensreg
EN

In D Q Out In D
Out
Q

LOGIC
LOGIC
CLOCK
CLOCK CLOCK
GATE
hold1

PIPELINE WITH
LATENCY = 3 PIPELINE WITH
LATENCY = 3

Figure 2(a) Figure 2(b)

Figure 2: Register bank structure suitable for clock gating

Another common structure used in datapath circuits is the pipelined architecture for
computational elements. Generally data flows along a pipeline where it is modified at every stage
and results are then stored in the following registers. The key goals of this architecture are
throughput and latency. In a pipelined architecture, when a bottleneck/slow processing stage
follows a high throughput computation element like a pipelined multiplier or adder, the output of
high throughput stage is delayed (or stalled) until bottleneck stage is ready to process the new
data. In a design this stalling operation is implemented by selectively reloading the data of a stage
till new data is available. This is shown in Figure 3(a) where a circular buffer (implemented as
memory module) follows a pipelined multiplier and gets loaded with multiplier output at every
cycle. The contents of this buffer are periodically read by subsequent logic stage (not shown in
the diagram) at a slower rate. In this architecture, the buffer becomes full after certain cycles and
will not be ready to accept more data from multiplier without overwriting its contents. In order to
avoid data corruption, multiplier pipeline needs to be stalled (i.e. reloaded) till the buffer is ready
to accept the new data. However, this reloading causes unnecessary power consumption, which
can be prevented by using clock gating to stall the output of multiplier. This is shown in Figure
3(b).

MC provides a specific directive, called pipestall, which can be used to model pipeline register
stages that may require stalling. By using this directive the designer simply identifies a pipeline
register as a potential candidate for clock gating. During synthesis, the tool determines if the
pipeline needs to be stalled and, if so required, implements to clock gating logic. The details of
this and other directives can be found in [4,5].

As an example, section of code with pipestall directive for pipelined multiplier and a circuit
diagram that describes typical application of this directive is shown below:

directive (group = "multiplier", pipeline = "on", pipestall = "enable", clock = "CLOCK");


prod_temp = A * B ;
prod = ResolveLatency (prod_temp, 1);

SNUG San Jose 2001 4 Power Reduction in Datapath Designs


ENABLE / PIPESTALL
(HOLD)

A FULL
EN EN A
D Q D Q
LOGIC D Q
LOGIC LOGIC D Q
LOGIC LOGIC LOGIC

B CLOCK B
CLOCK FULL
PIPELINED MULTIPLIER CLOCK
PIPELINED MULTIPLIER
GATE
(HOLD)
MEMORY ENABLE MEMORY

Figure 3(a) Figure 3(b)

Figure 3: Pipelined architecture suitable for clock gating

3.0 Design Flow For Clock Gating

The traditional approach for synthesizing a design is to use DC for all logic structures including
datapath and control logic. Most arithmetic functions are implemented using DesignWare
libraries. Power Compiler, in conjunction with DC, is used to incorporate clock gating in these
designs. If a high performance datapath circuit is needed, it is synthesized separately by MC
(without clock gating) and the resulting netlist is then merged with the rest of the logic. The final
design is then assembled and further optimized in DC. This approach is shown in flow diagram
in Figure 4(a). This methodology works fairly well for most designs where clock gating is only
done in DC-synthesized logic.

For control logic in our design, we used the traditional approach of using DC with Power
Compiler to implement clock gating following the flow outlined in Figure 4(a). As mentioned
earlier, a significant portion of our design contained datapaths and was synthesized using MC. To
save power in datapath circuits, the MC tool was enhanced to enable automatic clock gating.
When synthesizing using this enhanced version of MC, tool determines the presence of logic
suitable for clock gating as described in previous section. The design is then divided in to
structures that may be clock gated and the rest that cannot be clock gated. In gateable logic
structures the synchronous load-enable flops are identified and are left mapped to technology
independent GTECH cells, i.e. these flops are not mapped to CMOS gates and are instead left
mapped to SEQGEN cells. The remaining logic is mapped to CMOS gates. This special
mapping is carried out when MC compilation is invoked with a clock gating option (-cg +).
These partially mapped db files are then read into Power Compiler that substitutes the SEQGEN
flops to regular flops and also inserts clock gates. Since this mapping happens after MC, the
netlist is not fully optimized. Therefore an incremental compile in DC is needed to fully optimize
the design. This flow is shown in Figure 4(b).

SNUG San Jose 2001 5 Power Reduction in Datapath Designs


VHDL/VERILOG MCL (DATA PATH) VHDL/VERILOG MCL (DATA PATH)
(CONTROL LOGIC) (CONTROL LOGIC)
MODULE COMPILER
MODULE COMPILER
[WITH CG OPTION]
(GATE LEVEL
(GATE LEVEL
MAPPING
MAPPING
HDL COMPILER & HDL COMPILER &
OPTIMIZATION )
OPTIMIZATION )

DESIGN POWER COMPILER


COMPILER (CLOCK GATE
(INCREMENTAL INSERTION)
POWER COMPILER POWER COMPILER
OPTIMIZATION)
(CLOCK GATING) (CLOCK GATING)

DESIGN COMPILER
DESIGN COMPILER DESIGN COMPILER (SEQGEN MAPPING
(GATE LEVEL (GATE LEVEL &
MAPPING MAPPING OPTIMIZATION )
& &
OPTIMIZATION ) OPTIMIZATION )

DESIGN COMPILER DESIGN COMPILER


(BLOCK ASSEMBLY & TOP LEVEL (BLOCK ASSEMBLY & TOP LEVEL
OPTIMIZATION) OPTIMIZATION)

NETLIST FOR NETLIST FOR


LAYOUT LAYOUT

Figure 4(a) Figure 4(b)

Figure 4: Synthesis Flow for (a) using clock gating in DC only, and (b) using clock gating in
DC and MC

It should be mentioned here that the new version of Synopsys tools (version 2000.11) provides a
capability to call MC from with in DC. Thus the flow shown in Figure 4 could be further
simplified by making a call to MC from with in DC to compile the datapath circuits. This may
help in making the synthesis scripts simpler. More details on this capability can be found in
[4,5].

We used clock gating feature of MC on all datapath circuits used in our chip. Our analysis
showed that nearly all flops in our designs were suitable candidates for clock gating. However,
there is a cost associated with clock gating a flop i.e. the extra logic of clock gating that replaces
a multiplexer. This cost is less if one instance of clock gating logic can be used to gate multiple
flops such as a register bank. So one has to trade off the amount of power savings against the
extra area due to clock gates. Our previous work has shown that the sweet spot is 8 or more flops
i.e., one clock gate for 8 flops or more [1]. Therefore in these experiments we restricted the tool
to use a clock gate for controlling 8 or more flops.

SNUG San Jose 2001 6 Power Reduction in Datapath Designs


4.0 Results

The work reported here was done on a graphics chip for personal computers. We chose
representative set of five different test cases to represent wide variety of data path functionality in
the 3D graphics block. These blocks include texture blending, format converters and polygon
binning circuits with gate counts ranging from 500 to 5000 cells. All these blocks were
implemented using MC.

To determine the effectiveness of our clock gating scheme we calculated power consumed by the
units with and without clock gating. For power estimation we used the report_power command
in DC. This command uses activity factors to estimate power. Accurate power estimation
requires a vectors set that mimics the operation of the design in a real life environment. Since our
design was targeted for use in a variety of applications it is difficult to capture a representative
vector set. Therefore, some empirical toggle rates are used for power estimations [3]. For work
reported in this paper we estimated toggle rates by assuming that clock signals toggle two times
per cycle (once for going high and once for going low) and all other signals change 0.25 times
per cycle. These toggle rates were applied to the primary inputs of circuits and activity factors for
all nodes were computed using probabilistic estimations.

The power estimates reported here include internal cell power and dynamic net switching power.
The graphs in Figure 5 show power consumed by each design with and without clock gating.
This data shows significant power savings in these designs when clock gating is used. The power
savings ranged from 35% to 57% for these test cases. Our analysis showed that all flops in these
test cases were clock gated. The variation in power savings, 35% to 57%, is due to the fact that
even though all flops in these designs are gated the clock gate enables are not necessarily always
active. This behavior is a characteristic of the functionality of each design.

35
30
25
Power
20 Before CG
(mw)
15 After CG
10
5
0
Design1 Design2 Design3 Design4 Design5

Figure 5: Power savings due to clock gating.

SNUG San Jose 2001 7 Power Reduction in Datapath Designs


It should be noted that all results reported here are for pre-layout netlists that were synthesized
using a custom wire load model. Therefore, we have assumed an ideal clock with a specified
skew and have not performed clock gate sizing based on fanout. Our Clock Tree Synthesis tool
performs accurate sizing based on extracted parasitics and actual placements of the sequential
cells. Therefore, in actual design these clock gates get sized properly based on loading and
location.

Since we are concerned about the area cost of clock gating, we had set threshold for clock gating
to be 8 bits. To make sure that area cost was acceptable we looked at the cell area of all designs.
The graph in Figure 6 shows area of all designs with and without clock gating. This data shows
that there is no area penalty for clock gating; rather there is an area reduction of about 5% to
10%. We believe this is due to the fact that during clock gating MC replaces multiplexers that are
connected to the inputs of all flops in the feedback loop of register bank with a single clock gate.

140000
120000
100000
Area units
80000 Before CG
After CG
60000
40000
20000
0
Design1 Design2 Design3 Design4 Design5

Figure 6: Area savings due to clock gating.

The changes in design due to clock gating result in adding more logic in clock paths. This may
affect clock skew. We addressed this problem by accommodating the extra delay due to clock
gates in our total clock skew budgets and designed the clock tree accordingly. With this approach
timing results of design are not affected due to clock gating.

5.0 Limitations and Issues

During this work, we noticed certain limitations of the tool and reported them to the vendor. We
created work arounds for these limitations to continue our work. We believe the vendor has plans
to enhance the tools to solve these problems.

SNUG San Jose 2001 8 Power Reduction in Datapath Designs


5.1 Multiple Clock Gates

For nets with high fanouts, such as enable signals for a bank of registers, MC builds a buffer tree.
As MC builds a buffer tree for these nets the actual enable signal might be split as multiple nets
with each segment of the buffer tree being considered as a separate net. This causes Power
Compiler to treat them as different and unique enable signals. It infers separate clock gates for
each segment of this buffer tree even though the entire register bank is connected to the same
common enable signal. A workaround for this issue is to use buffer attribute to direct MC to use
only one buffer, i.e. not build a buffer tree. Later Power Compiler can then create a single clock
gate for this net. As Power Compiler has a capability to balance load on a clock gate, it may
duplicate the clock gates as needed to satisfy the fanout and loading requirements.

5.2 Retiming

A very powerful capability of MC is its ability to retime a pipeline to balance the delays of each
stage by moving logic across flop boundaries. As clock gating absorbs the logic (multiplexer) in
the datapath, post-clock gating retiming could optimize the circuit moving the logic between the
pipeline stages. Since DC has an inherent limitation on retiming the clock gated circuits,
Synopsys prohibits the usage of this feature as part of DC optimization after MC. Thus once
clock gating is done no retiming is possible. However, MC retiming before the clock gate
insertion can be performed.

5.3 Scan-test Mode Usage

Many designs use scan design concepts to meet their design-for-test goals. MC enables this by
converting all flip-flops into their scan equivalents using the scan attribute. However, the clock
gating feature does not support the scan-test mode and must be turned off. But there is a simple
work around for this problem. During post-MC optimization in DC, an incremental compile with
scan option can be performed within DC for scan replacement. One should be careful to make
sure that clock gating style with test mode control is properly incorporated so that clock gating
can be disabled when scan mode is in operation.

6.0 Conclusions

Datapath circuits constitute significant portion of modern high performance designs. Significant
power savings are possible by using clock-gating techniques. With addition of clock gating
capability in MC it is now possible to implement this technique automatically in datapath
designs. Besides saving power, it also has additional benefits of improving area of the design.
Our experiments show a power savings of 35% to 57% in datapath designs while reducing area
by 5% to 10%. Automatic clock gating is now well integrated in MC and DC so very minimal
change in design flow is needed to take advantage of it.

SNUG San Jose 2001 9 Power Reduction in Datapath Designs


7.0 Acknowledgements

We would like to thank Vasu Madabushi, CAE for MC at Synopsys and Troy Lombardi, our
Synopsys AC for their support during this work. We would also like to express our appreciation
for Trinanjan Chatterjee, Manager, MC R&D, Synopsys, for his enthusiastic support for this
work.

8.0 References

1. Z. Khan & G. Mehta, “Automatic Clock Gating for Power Reduction,” SNUG, March
1999.
2. N. Cooray & Z. Khan, “High Performance Datapath Synthesis using Module
CompilerTM,” SNUG, March 2000.
3. Z. Khan, “Using Synthesis Techniques for Power Reduction,” SNUG, March 1998.
4. Module Compiler User Guide (2000.11)
5. Module Compiler Reference Manual (2000.11)

SNUG San Jose 2001 10 Power Reduction in Datapath Designs

You might also like