0% found this document useful (0 votes)
48 views

PSVLSI417

ruyh

Uploaded by

Hyd Vlsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

PSVLSI417

ruyh

Uploaded by

Hyd Vlsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO.

2, FEBRUARY 2016 421

High-Speed and Energy-Efficient Carry Skip Adder


Operating Under a Wide Range of
Supply Voltage Levels
Milad Bahadori, Mehdi Kamal, Ali Afzali-Kusha, Senior Member, IEEE, and Massoud Pedram, Fellow, IEEE

Abstract In this paper, we present a carry skip adder


(CSKA) structure that has a higher speed yet lower energy
low-power/energy consumptions, which is a challenge for the
consumption compared with the conventional one. The speed designers of general purpose processors.
enhancement is achieved by applying concatenation and One of the effective techniques to lower the power
incrementation schemes to improve the efficiency of the consumption of digital circuits is to reduce the supply voltage
conventional CSKA (Conv-CSKA) structure. In addition, instead due to quadratic dependence of the switching energy on the
of utilizing multiplexer logic, the proposed structure makes use of
AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound voltage. Moreover, the subthreshold current, which is the
gates for the skip logic. The structure may be realized with both main leakage component in OFF devices, has an expo- nential
fixed stage size and variable stage size styles, wherein the latter dependence on the supply voltage level through the drain-
further improves the speed and energy parameters of the adder. induced barrier lowering effect [10]. Depending on the amount
Finally, a hybrid variable latency extension of the proposed of the supply voltage reduction, the operation of ON devices
structure, which lowers the power consumption without
considerably impacting the speed, is presented. This extension may reside in the superthreshold, near-threshold, or
utilizes a modified parallel structure for increasing the slack subthreshold regions. Working in the superthreshold region
time, and hence, enabling further voltage reduction. The provides us with lower delay and higher switching and leakage
proposed structures are assessed by comparing their speed, powers compared with the near/subthreshold regions. In the
power, and energy parameters with those of other adders using a
subthreshold region, the logic gate delay and leakage power
45-nm static CMOS technology for a wide range of supply
voltages. The results that are obtained using HSPICE simulations exhibit exponential dependences on the supply and threshold
reveal, on average, 44% and 38% improvements in the delay voltages. Moreover, these voltages are (potentially) subject to
and energy, respectively, compared with those of the Conv- process and environmental variations in the nanoscale tech-
CSKA. In addition, the powerdelay product was the lowest nologies. The variations increase uncertainties in the aforesaid
among the structures considered in this paper, while its
energydelay product was almost the same as that of the Kogge
performance parameters. In addition, the small subthreshold
Stone parallel prefix adder with considerably smaller area and current causes a large delay for the circuits operating in the
power consumption. Simulations on the proposed hybrid variable subthreshold region [10].
latency CSKA reveal reduction in the power consumption Recently, the near-threshold region has been considered as
compared with the latest works in this field while having a a region that provides a more desirable tradeoff point between
reasonably high speed.
delay and power dissipation compared with that of the
Index Terms Carry skip adder (CSKA), energy efficient, subthreshold one, because it results in lower delay com- pared
high performance, hybrid variable latency adders, voltage with the subthreshold region and significantly lowers switching
scaling.
and leakage powers compared with the superthresh- old region.
I. I NTRODUCTION In addition, near-threshold operation, which uses supply
voltage levels near the threshold voltage of transistors [11],
A DDERS are a key building block in arithmetic and
logic units (ALUs) [1] and hence increasing their speed
and reducing their power/energy consumption strongly affect
suffers considerably less from the process and environmental
variations compared with the subthreshold region.
The dependence of the power (and performance) on the
the speed and power consumption of processors. There are
supply voltage has been the motivation for design of circuits
many works on the subject of optimizing the speed and
with the feature of dynamic voltage and frequency scaling. In
power of these units, which have been reported in [2][9].
these circuits, to reduce the energy consumption, the system
Obviously, it is highly desirable to achieve higher speeds at
may change the voltage (and frequency) of the circuit based on
Manuscript received October 8, 2014; revised January 16, 2015; accepted the workload requirement [12]. For these systems, the circuit
February 13, 2015. Date of publication March 11, 2015; date of current should be able to operate under a wide range of supply voltage
version January 19, 2016. The work of M. Bahadori, M. Kamal, and A.
Afzali-Kusha was supported by the Iranian National Science Foundation. levels. Of course, achieving higher speeds at lower supply
The work of voltages for the computational blocks, with the adder as one
M. Pedram was supported by the U.S. National Science Foundation. the main components, could be crucial in the design of high-
M. Bahadori, M. Kamal, A. Afzali-Kusha are with the School of Electrical
and Computer Engineering, University of Tehran, Tehran 19967-15433, Iran speed, yet energy efficient, processors.
(e-mail: [email protected]; [email protected]; [email protected]). In addition to the knob of the supply voltage, one may
M. Pedram is with the Department of Electrical Engineering, University of choose between different adder structures/families for
Southern California, Los Angeles, CA 90089 USA (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2015.2405133
1063-8210 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
422 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

optimizing power and speed. There are many adder 3) Investigating the impact of voltage scaling on the
families with different delays, power consumptions, and efficiency of the proposed CSKA structure (from the
area usages. Examples include ripple carry adder (RCA), nominal supply voltage to the near-threshold voltage).
carry increment adder (CIA), carry skip adder (CSKA), carry 4) Proposing a hybrid variable latency CSKA structure
select adder (CSLA), and parallel prefix adders (PPAs). The based on the extension of the suggested CSKA, by
descriptions of each of these adder architectures along with replacing some of the middle stages in its structure with
their characteristics may be found in [1] and [13]. The a PPA, which is modified in this paper.
RCA has the simplest structure with the smallest area and The rest of this paper is organized as follows. Section II
power consumption but with the worst critical path delay. discusses related work on modifying the CSKA structure for
In the CSLA, the speed, power consumption, and area usages improving the speed as well as prior work that use variable
are considerably larger than those of the RCA. The PPAs, latency structures for increasing the efficiency of adders at low
which are also called carry look-ahead adders, exploit direct supply voltages. In Section III, the Conv-CSKA with fixed
parallel prefix structures to generate the carry as fast as stage size (FSS) and variable stage size (VSS) is explained,
possible [14]. There are different types of the parallel prefix while Section IV describes the proposed static CSKA struc-
algorithms that lead to different PPA structures with ture. The hybrid variable latency CSKA structure is suggested
different performances. As an example, the KoggeStone in Section V. The results of comparing the characteristics of the
adder (KSA) [15] is one of the fastest structures but results in proposed structures with those of other adders are discussed in
large power consumption and area usage. It should be noted Section VI. Finally, the conclusion is drawn in Section VII.
that the structure complexities of PPAs are more than those of
other adder schemes [13], [16].
II. P RIOR WORK
The CSKA, which is an efficient adder in terms of power
consumption and area usage, was introduced in [17]. The Since the focus of this paper is on the CSKA structure, first
critical path delay of the CSKA is much smaller than the one the related work to this adder are reviewed and then the variable
in the RCA, whereas its area and power consumption are latency adder structures are discussed.
similar to those of the RCA. In addition, the power-delay
product (PDP) of the CSKA is smaller than those of the CSLA A. Modifying CSKAs for Improving Speed
and PPA structures [19]. In addition, due to the small number of
transistors, the CSKA benefits from relatively short wiring The conventional structure of the CSKA consists of stages
lengths as well as a regular and simple layout [18]. The containing chain of full adders (FAs) (RCA block) and 2:1
comparatively lower speed of this adder structure, however, multiplexer (carry skip logic). The RCA blocks are connected to
limits its use for high-speed applications. each other through 2:1 multiplexers, which can be placed into
In this paper, given the attractive features of the CSKA one or more level structures [19]. The CSKA configuration (i.e.,
structure, we have focused on reducing its delay by mod- the number of the FAs per stage) has a great impact on the
ifying its implementation based on the static CMOS logic. The speed of this type of adder [23]. Many methods have been
concentration on the static CMOS originates from the desire to suggested for finding the optimum number of the FAs [18]
have a reliably operating circuit under a wide range of supply [26]. The techniques presented in [19][24] make use of VSSs
voltages in highly scaled technologies [10]. The proposed to minimize the delay of adders based on a single- level carry
modification increases the speed considerably while skip logic. In [25], some methods to increase the speed of the
maintaining the low area and power consumption features of the multilevel CSKAs are proposed. The techniques, however,
CSKA. In addition, an adjustment of the structure, based on the cause area and power increase considerably and less regular
variable latency technique, which in turn lowers the power layout. The design of a static CMOS CSKA where the stages of
consumption without considerably impacting the CSKA speed, the CSKA have a variable sizes was suggested in [18]. In
is also presented. To the best of our knowledge, no work addition, to lower the propagation delay of the adder, in each
concentrating on design of CSKAs operating from the stage, the carry look-ahead logics were utilized. Again, it had a
superthreshold region down to near-threshold region and also, complex layout as well as large power consumption and area
the design of (hybrid) variable latency CSKA structures have usage. In addition, the design approach, which was presented
been reported in the literature. Hence, the contributions of this only for the 32-bit adder, was not general to be applied for
paper can be summarized as follows. structures with different bits lengths.
Alioto and Palumbo [19] propose a simple strategy for the
1) Proposing a modified CSKA structure by combining the design of a single-level CSKA. The method is based on the VSS
concatenation and the incrementation schemes to the technique where the near-optimal numbers of the FAs are
conventional CSKA (Conv-CSKA) structure for enhanc- determined based on the skip time (delay of the multiplexer),
ing the speed and energy efficiency of the adder. The and the ripple time (the time required by a carry to ripple
modification provides us with the ability to use simpler through a FA). The goal of this method is to decrease the critical
carry skip logics based on the AOI/OAI compound gates path delay by considering a noninteger ratio of the skip time to
instead of the multiplexer. the ripple time on contrary to most of the previous works,
2) Providing a design strategy for constructing an which considered an integer ratio [17], [20]. In all of the works
efficient CSKA structure based on analytically expres- reviewed so far, the focus was on the speed, while the power
sions presented for the critical path delay. consumption and area usage of the CSKAs were
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 423

Fig. 1. Conventional structure of the CSKA [19].

In addition to the chain of FAs in each stage, there is a carry


not considered. Even for the speed, the delay of skip logics, skip logic. For an RCA that contains N cascaded FAs, the worst
which are based on multiplexers and form a large part of the propagation delay of the summation of two N -bit numbers, A
adder critical path delay [19], has not been reduced. and B, belongs to the case where all the FAs are in the
propagation mode. It means that the worst case delay belongs to
B. Improving Efficiency of Adders at Low Supply Voltages the case where
To improve the performance of the adder structures at low Pi = Ai Bi = 1 for i = 1, . . . , N
supply voltage levels, some methods have been proposed in
[27][36]. In [27][29], an adaptive clock stretching operation where Pi is the propagation signal related to Ai and Bi . This
has been suggested. The method is based on the observation shows that the delay of the RCA is linearly related to N [1]. In
that the critical paths in adder units are rarely activated. the case, where a group of cascaded FAs are in the propagate
Therefore, the slack time between the critical paths and the off- mode, the carry output of the chain is equal to the carry input.
critical paths may be used to reduce the supply voltage. Notice In the CSKA, the carry skip logic detects this situation, and
that the voltage reduction must not increase the delays of the makes the carry ready for the next stage without waiting for the
noncritical timing paths to become larger than the period of the operation of the FA chain to be completed. The skip operation is
clock allowing us to keep the original clock frequency at a performed using the gates and the multiplexer shown in the
reduced supply voltage level. When the critical timing paths in figure. Based on this explanation, the N FAs of the CSKA are
the adder are activated, the structure uses two clock cycles to grouped in Q stages. Each stage contains an RCA block with
complete the operation. This way the power consumption Mj FAs ( j = 1, ... , Q) and a skip logic. In each stage, the
reduces considerably at the cost of rather small throughput inputs of the multiplexer (skip logic) are the carry input of
degradation. In [27], the efficiency of this method for the stage and the carry output of its RCA block (FA chain). In
reducing the power consumption of the RCA structure has been addition, the product of the propagation signals ( P) of the stage
demonstrated. is used as the selector signal of the multiplexer.
The CSLA structure in [28] was enhanced to use adaptive The CSKA may be implemented using FSS and VSS where
clock stretching operation where the enhanced structure was the highest speed may be obtained for the VSS structure [19],
called cascade CSLA (C2SLA). Compared with the common [22]. Here, the stage size is the same as the RCA block size.
CSLA structure, C2SLA uses more and different sizes of RCA In Sections III-A and III-B, these two different
blocks. Since the slack time between the critical timing paths implementations of the CSKA adder are described in more
and the longest off-critical path was small, the supply voltage detail.
scaling, and hence, the power reduction were limited. Finally,
A. Fixed Stage Size CSKA
using the hybrid structure to improve the effec- tiveness of the
adaptive clock stretching operation has been investigated in
By assuming that each stage of the CSKA contains M FAs,
there are Q = N /M stages where for the sake of simplicity, we
[31] and [33]. In the proposed hybrid structure, the KSA has
been used in the middle part of the C2SLA where this
assume Q is an integer. The input signals of the
combination leads to the positive slack time increase. However,
j th multiplexer are the carry output of the FAs chain in the j
the C2SLA and its hybrid version are not good candidates for
th stage denoted by C 0 , j the carry output of the previous stage
low-power ALUs. This statement originates from the fact that
(carry input of the j th stage) denoted by C 1 (Fig.
j 1).
due to the logic duplication in this type of adders, the powerThe critical path of the CSKA contains three parts: 1) the
consumption and also the PDP are still
path of the FA chain of the first stage whose delay is equal to
high even at low supply voltages [33]. M TCARRY; 2) the path of the intermediate carry skip
multiplexer whose delay is equal to the (Q 1) TMUX; and
III. C ONVENTIONAL CARRY S KIP A DDER 3) the path of the FA chain in the last stage whose its delay is
The structure of an N -bit Conv-CSKA, which is based on equal to the (M 1) TCARRY + TSUM. Note that TCARRY,
blocks of the RCA (RCA blocks), is shown in Fig. 1.
424 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

TSUM, and TMUX are the propagation delays of the carry output of the next stage due to the additional multiplexer by reducing
of an FA, the sum output of an FA, and the output delay of a 2:1 the sum delay of the RCA block. This may be analytically
multiplexer, respectively. Hence, the critical path delay of expressed as
a FSS CSKA is formulated by TSUM, 1 T T ; for i p. (5)
i
.. . . + SUM,i MUX
N
TD = [M TCARRY]+
The trend of decreasing the stage size should be continued
M TMUX until we produce the required number of adder bits.
1
+ [(M 1) TCARRY + TSUM]. (1)
Note that, in this case, the size of the last RCA block may
Based on (1), the optimum value of M (Mopt) that leads to only be one (i.e., one FA). Hence, to reach the highest number
optimum propagation delay may be calculated as (0.5N)1/2 of input bits under a constant propagation delay, both (4) and
where is equal to TMUX/ TCARRY. Therefore, the (5) should be satisfied. Having these constraints, we can
optimum propagation delay (TD,opt) is obtained from minimize the delay of the CSKA for a given number of input
TD,opt = 2,2N T CARRY TMUX + (TSUM TCARRY TMUX ) bits to find the stages sizes for an optimal structure.
In this optimal CSKA, the size of first p stages is increased,
= TSUM + (2 2 N1) TCARRY. (2) while the size of the last (Q p) stages is decreased. For this
structure, the pth stage, which is called nucleus of the adder,
Thus, the optimum delay of the FSS CSKA is almost has the maximum size [24].
proportional to the square root of the product of N and [19]. Now, let us find the constraints used for determining the
optimum structure in this case. As mentioned before, when the
B. Variable Stage Size CSKA j th stage is not in the propagate mode, the carry output
of the stage is C0. jIn this case, the maximum of t0 is equal
As mentioned before, by assigning variable sizes to the j
stages, the speed of the CSKA may be improved. The speed to M j TCARRY. To satisfy (4), we increase the size of the first
improvement in this type is achieved by lowering the delays of p stages up to the nucleus using [19]
the first and third terms in (1). These delays are minimized by M j M1 + ( j 1); for 1 j p. (6)
lowering sizes of first and last RCA blocks. For instance, the
first RCA block size may be set to one, whereas sizes of the In addition, the maximum of TSUM,i is equal to (Mi 1)
following blocks may increase. To determine the rate 1of 1
increase, let us express the propagation delay of the C (t ) by
TCARRY + TSUM. To satisfy (5), the size of the last (Q p)
stages from the nucleus to the last stage should decrease
t1 0 1 j j
based on [19]
. .
j = max t j 1, t j 1 + TMUX (3) M i MQ + for p Q. (7)
(Q i); i
where t0 (t1 ) shows the calculating delay of C
0
j 1 j j 1 j In the case, where is an integer value, the exact sizes
1
(C11)
signal in the0 ( j 1)th stage. 1In a FSS CSKA, except in the of stages for the optimal structure can be determined. Subse-
first stage, t j is smaller than t j. Hence, based on (3), the delay
quently, the optimal values of M1, MQ , and Q as well as the
of t0 may be increased from t0 to t1 without increasing the
j 1 j delay of the optimal CSKA may be calculated [19]. In the case,
1 1
delay of C1j signal. This means that one could increase the size where is a noninteger value, one may realize only a near-
of the ( j 1)th stage (i.e., Mj 1) without increasing the optimal structure, as detailed in [19] and [21]. In this case, most
propagation delay of the CSKA. Therefore, increasing the size of the time, by setting M1 to 1 and using (6) and (7), the near-
of Mj for the j th stage should be bounded by optimal structure is determined. It should be noted that, in
t 0 t1 = t0 + ( j 1)T practice, is noninteger whose value is smaller than one. This
j j 1 MUX. (4)
is the case that has been studied in [19], where the estimation of
Since the last RCA block size also should be minimized, the the near-optimal propagation delay of the CSKA is given by
increase in the stage size may not be continued to the last [19]
RCA block. Thus, we justify the decrease in the RCA block . .
. , .
sizes toward the last stage. First, note that based on Fig. 1, the TD,opt = 1 TCARRY + 1 TMUX + TSUM.
output of the j th stage is, in the worst case, accessible after ,

2 2 N (8)
tj 1+ TSUM, j . Assuming that the pth stage has the maximum 2
RCA block size, we wish to keep the delay of the outputs of This equation may be written in a more general form by
the following stages to be equal to the delay of the output replacing TMUX by TSKIP to allow for other logic types instead
of the pth stage. To keep the same worst case delay for of the multiplexer. For this form, becomes equal to
the critical path, we should reduce the size of the following TSKIP/TCARRY. Finally, note that in real implementations,
RCA blocks. Forisexample, i p, for, the (i +T1)th stage, TSKIP < TCARRY, and hence, J/2] becomes equal to one.
the output delay t 1 + T when
+T MUX where
SUM,i 1 is
i + SUM,i+1 Thus, (8) may be written as
424 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016
the delay of the (i + 1)th RCA block for calculating all of its .
.
sum outputs when its carry input is ready. Therefore, the size TPDopt = TCARRY + 1 TSKIP + TSUM. (9)
of the (i + 1)th stage should be reduced to decrease TSUM,i+1 2 N

preventing the increase in the worst case delay (TD) of the Note that, as (9) reveals that a large portion of the critical path
adder. In other words, we eliminate the increase in the delay delay is due to the carry skip logics.
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 425

Fig. 2. Proposed CI-CSKA structure.

IV. P ROPOSED CSKA S TRUCTURE


Based on the discussion presented in Section III, it is
concluded that by reducing the delay of the skip logic, one may
lower the propagation delay of the CSKA significantly. Hence,
in this paper, we present a modified CSKA structure that
reduces this delay.

A. General Description of the Proposed Structure . j 1


The structure is based on combining the concatenation and Fig. 3. Internal structure of the j th incrementation block, K j =
Mr

the incrementation schemes [13] with the Conv-CSKA struc- r =1


( j = 2,...,
ture, and hence, is denoted by CI-CSKA. It provides us with Q).
the ability to use simpler carry skip logics. The logic replaces which is RCA. The stages 2 to Q consist of two blocks of RCA
2:1 multiplexers by AOI/OAI compound gates (Fig. 2). The and incrementation. The incrementation block uses the
gates, which consist of fewer transistors, have lower delay,
area, and smaller power consumption compared with those
of the 2:1 multiplexer [37]. Note that, in this structure, as
the carry propagates through the skip logics, it becomes
complemented. Therefore, at the output of the skip logic of
even stages, the complement of the carry is generated. The
structure has a considerable lower propagation delay with a
slightly smaller area compared with those of the conventional
one. Note that while the power consumptions of the AOI
(or OAI) gate are smaller than that of the multiplexer, the
power consumption of the proposed CI-CSKA is a little more
than that of the conventional one. This is due to the increase
in the number of the gates, which imposes a higher wiring
capacitance (in the noncritical paths).
Now, we describe the internal structure of the proposed CI-
CSKA shown in Fig. 2 in more detail. The adder contains
two N bits inputs, A and B, and Q stages.
Each stage consists of an RCA block with the size of
M j ( j = 1, . . . , Q). In this structure, the carry input
of all the RCA blocks, except for the first block
which is Ci , is zero (concatenation of the RCA blocks).
Therefore, all the blocks execute their jobs simultaneously.
In this structure, when the first block computes the
summation of its corresponding input bits (i.e., SM1 , . .. ,
S1), and C1, the other blocks simultaneously compute
the
intermediate results [i.e., {Z K j +M j , . . . , Z K j +2, Z K j
. +1j } for
Kj =
1 r =1 M r ( j = 2 , . . . , Q)], and also C j signals.
In the proposed structure, the first stage has only one
block,
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 425
intermediate results generated by the RCA block and the carry The reason for using both AOI and OAI compound gates
output of the previous stage to calculate the final summation as the skip logics is the inverting functions of these gates in
of the stage. The internal structure of the incrementation standard cell libraries. This way the need for an inverter gate,
block, which contains a chain of half-adders (HAs), is which increases the power consumption and delay, is
shown in Fig. 3. In addition, note that, to reduce the delay eliminated. As shown in Fig. 2, if an AOI is used as the skip
considerably, for computing the carry output of the stage, logic, the next skip logic should use OAI gate. In addition,
the carry output of the incrementation block is not used. another point to mention is that the use of the proposed skipping
As shown in Fig. 2, the skip logic determines the carry output structure in the Conv-CSKA structure increases the delay of the
of the j th stage (CO, j ) based on the intermediate results critical path considerably. This originates from the fact that, in
of the j th stage and the carry output of the previous stage the Conv-CSKA, the skip logic (AOI or OAI compound gates)
(CO, j 1) as well as the carry output of the corresponding is not able to bypass the zero carry input until the zero carry
RCA block (C j ). When determining C O, j , these cases may input propagates from the corresponding RCA block. To solve
be encountered. When C j is equal to one, C O, j will be one. this problem, in the proposed structure, we have used an RCA
On the other hand, when C j is equal to zero, if the product block with a carry input of zero (using the concatenation
of the intermediate results is one (zero), the value of C O, j approach). This way, since the RCA block of the stage does
will be the same as C O, j 1 (zero). not need to wait for the carry output of the
426 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

previous stage, the output carries of the blocks are calculated FSS or VSS. Here, the stage size is the same as the RCA and
in parallel. incrementation blocks size. In the case of the FSS (FSS-CI-
CSKA), there are Q = N /M stages with the size
B. Area and Delay of the Proposed Structure of M. The optimum value of M, which may be obtained
As mentioned before, the use of the static AOI and OAI using (11), is given by
gates (six transistors) compared with the static 2:1 multiplexer Mopt = . (12)
(12 transistors), leads to decreases in the area usage and delay N(T AOI + T OAI
)

of the skip logic [37], [38]. In addition, except for the first 2(TCARRY + TAND )
RCA block, the carry input for all other blocks is zero, and In the case of the VSS (VSS-CI-CSKA), the sizes of the
hence, for these blocks, the first adder cell in the RCA chain stages, which are M to M , are obtained using a method
1 Q
is a HA. This means that (Q 1) FAs in the conventional
similar to the one discussed in Section III-B. For this structure,
structure are replaced with the same number of HAs in the the new value for T
SKIP should be used, and hence, becomes
suggested structure decreasing the area usage (Fig. 2). In addi-
(TAOI + TOAI) / (2TCARRY). In particular, the following steps
tion, note that the proposed structure utilizes incrementation should be taken.
blocks that do not exist in the conventional one. These blocks, 1) The size of the RCA block of the first stage is one.
however, may be implemented with about the same logic gates 2) From the second stage to the nucleus stage, the size
(XOR and AND gates) as those used for generating the select of j th stage is determined based on the delay of the
signal of the multiplexer in the conventional structure. product of the sum of its RCA block and the delay of
Therefore, the area usage of the proposed CI-CSKA structure is the carry output of the ( j 1)th stage. Hence, based
decreased compared with that of the conventional one.
on the description given in Section III-B, the size of
The critical path of the proposed CI-CSKA structure, which the RCA block of the j th stage should be as large as
contains three parts, is shown in Fig. 2. These parts include the possible, while the delay of the product of the its output
chain of the FAs of the first stage, the path of the skip logics, sum should be smaller than the delay of the carry output
and the incrementation block in the last stage. The delay of this
of the ( j 1)th stage. Therefore, in this case, the sizes
path (TD) may be expressed as of the stages are either not changed or increased.
TD = [M1 TCARRY]+ [(Q 2)TSKIP] 3) The increase in the size is continued until the
summation of all the sizes up to this stage becomes
+ [(MQ 1)TAND + TXOR ] (10)
larger than N /2. The last stage, which has the largest
where the three brackets correspond to the three parts size, is considered as the nucleus ( pth) stage. There are
mentioned above, respectively. Here, TAND and TXOR are the cases that we should consider the stage right before this
delays of the two inputs static AND and XOR gates, respectively. stage as the nucleus stage (Step 5).
Note that, [(M j 1)TAND + TXOR ] shows the critical path delay 4) Starting from the stage ( p + 1) to the last stage, the
of the j th incrementation block (TINC, j ), which is shown in sizes of the stage i is determined based on the delay of
Fig. 3. the incrementation block of the i th and (i 1)th stages
To calculate the delay of the skip logic, the average of the (TINC,i and TINC,i1, respectively), and the delay of the
delays of the AOI and OAI gates, which are typically close to skip logic. In particular
one another [35], is used. Thus,. (10) may. be modified to..
TAOI + TOAI TINC,i TINC,i1 TSKIP,i1; for i p + 1. (13)
TD = [M1 TCARRY]+
(Q 2) In this case, the size of the last stage is one, and its
2
+ [(MQ 1)TAND + TXOR ] (11) RCA block contains a HA.
5) Finally, note that, it is possible that the sum of all the
where TAOI and TOAI are the delays of the static AOI and OAI stage sizes does not become equal to N . In the case,
gates, respectively. where the sum is smaller than N by d bits, we should
The comparison of (1) and (11) indicates that the delay of the add another stage with the size of d. The stage is placed
proposed structure is smaller than that of the conventional one. close to the stage with the same size. In the case, where
The First reason is that the delay of the skip logic is
the sum is larger than N by d bits, the size of the stages
considerably smaller than that of the conventional structure
should be revised (Step 3). For more details on how to
while the number of the stages is about the same in both
revise the stage sizes, one may refer to [19].
structures. Second, since TAND and TXOR are smaller than TCARRY
Now, the procedure for determining the stage sizes is
and TSUM, the third additive term in (11) becomes smaller than
demonstrated for the 32-bit adder. It includes both the con-
the third term in (1) [37]. It should be noted that the delay
ventional and the proposed CI-CSKA structures. The number of
reduction of the skip logic has the largest impact on the delay
stages and the corresponding size for each stage, which are
decrease of the whole structure.
given in Fig. 4, have been determined based on a 45-nm static
CMOS technology [38]. The dashed and dotted lines in the
C. Stage Sizes Consideration plot indicate the rates of size increase and decrease. While the
Similar to the Conv-CSKA structure, the proposed CI-CSKA increase and decrease rates in the conventional structure are
structure may be implemented with either balanced, the decrease rate is more than the
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 427

Fig. 4. Sizes of the stages in the case of VSS for the proposed and The concepts of the variable latency adders, adaptive clock
conventional 32-bit CSKA structures in 45-nm static CMOS technology.
stretching, and also supply voltage scaling in an N -bit RCA
adder may be explained using Fig. 5. The predictor block
consists of some XOR and AND gates that determines the product
increase one in the case of the proposed structure. It originates
of the propagate signals of considered bit positions. Since the
from the fact that, in the Conv-CSKA structure, both of the block has some area and power overheads, only few middle bits
stages size increase and decrease are determined based on the
are used to predict the activation of the critical paths at price of
RCA block delay [according to (4) and (5)], while in the prediction accuracy decrease [31], [33].
proposed CI-CSKA structure, the increase is determined based
In Fig. 5, the input bits ( j + 1)th( j + m)th have been
on the RCA block delay and the decrease is determined based
on the incrementation block delay [according to (13)]. The
imbalanced rates may yield a larger nucleus stage and smaller
number of stages leading to a smaller propagation delay.

V. P ROPOSED H YBRID VARIABLE L ATENCY CSKA


In this section, first, the structure of a generic variable
latency adder, which may be used with the voltage scaling
relying on adaptive clock stretching, is described. Then, a
hybrid variable latency CSKA structure based on the CI-CSKA
structure described in Section IV is proposed.

A. Variable Latency Adders Relying


on Adaptive Clock Stretching
The basic idea behind variable latency adders is that the
critical paths of the adders are activated rarely [33]. Hence, the
supply voltage may be scaled down without decreasing the
clock frequency. If the critical paths are not activated, one
clock period is enough for completing the operation. In the
cases, where the critical paths are activated, the structure allows
two clock periods for finishing the operation. Hence, in this
structure, the slack between the longest off-critical paths and
the longest critical paths determines the maximum amount of
the supply voltage scaling. Therefore, in the variable latency
adders, for determining the critical paths activation, a predictor
block, which works based on the inputs pattern, is required
[28].
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 427
m, the number of misprediction decreases at the price of
Fig. 5. Generic structure of variable latency adders based on RCA.
increasing the longest off-critical path, and hence, limiting the
range of the voltage scaling. Therefore, the predictor block size
exploited to predict the propagation of the carry output of the should be selected based on these tradeoffs.
j th stage (FA) to the carry output of ( j + m)th stage. For this
configuration, the carry propagation path from the first stage
B. Proposed Hybrid Variable Latency CSKA Structure
to the N th stage is the longest critical path (which is denoted
by Long Latency Path (LLP), while the carry propagation path The basic idea behind using VSS CSKA structures was based
from first stage to the ( j +m)th stage and the carry on almost balancing the delays of paths such that the delay of
propagation the critical path is minimized compared with that of the FSS
path from ( j + 1)th stage to the N th stage (which are denoted structure [21]. This deprives us from having the opportunity of
by Short Latency Path (SLP1) and SLP2, respectively) are using the slack time for the supply voltage scal- ing. To provide
the longest off-critical paths. It should be noted the paths that the variable latency feature for the VSS CSKA structure, we
the predictor shows are (are not) active for a given set of replace some of the middle stages in our pro- posed structure
inputs are considered as critical (off-critical) paths. Having the with a PPA modified in this paper. It should be noted that since
bits in the middle decreases the maximum of the off-critical the Conv-CSKA structure has a lower speed than that of the
paths [33]. The range of voltage scaling is determined by the proposed one, in this section, we do not consider the
slack time, which is defined by the delay difference between conventional structure. The proposed hybrid variable latency
LLP and max(SLP1, SLP2). Since the activation probability CSKA structure is shown in Fig. 6 where an M p -bit modified
of the critical paths is low (<1/2m), the clock stretching has PPA is used for the pth stage (nucleus stage). Since the nucleus
a negligible impact on the throughput (e.g., for a 32-bit adder, stage, which has the largest size (and delay) among the stages,
m = 610 may be considered [33]). There are cases that the is present in both SLP1 and SLP2, replacing it by the PPA
predictor mispredicts the critical path activation. By increasing reduces the delay of the longest
428 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

Fig. 6. Structure of the proposed hybrid variable latency CSKA.

As shown in the figure, in the preprocessing level, the


propagate signals ( Pi) and generate signals (Gi) for the inputs
are calculated. In the next level, using BrentKung parallel
prefix network, the longest carry (i.e., G8:1) of the prefix
network along with P8:1, which is the product of the all
propagate signals of the inputs, are calculated sooner than
other intermediate signals in this network. The signal P8:1
is used in the skip logic to determine if the carry output of
the previous stage (i.e., CO, p1) should be skipped or not. In
addition, this signal is exploited as the predictor signal
in the variable latency adder. It should be mentioned that all
of these operations are performed in parallel with other
stages. In the case, where P8:1 is one, CO, p1 should skip this
stage predicting that some critical paths are activated. On the
other hand, when P8:1 is zero, CO, p is equal to the G8:1. In
addition, no critical path will be activated in this case.
After the parallel prefix network, the intermediate carries,
which are functions of CO, p1 and intermediate signals, are
computed (Fig. 7). Finally, in the postprocessing level, the
output sums of this stage are calculated. It should be noted that
this implementation is based on the similar ideas of the
concatenation and incrementation concepts used in the CI-
CSKA discussed in Section IV. It should be noted that the end
part of the SPL1 path from C O, p1 to final summation
.pthe
Fig. 7. Internal structure of the pth stage of 1 proposed hybrid variable
results of the PPA block and the beginning part of the
latency CSKA. Mp is equal to 8 and K p = Mr . SPL2 paths from inputs of this block to CO, p belong to the
r
=1
PPA block (Fig. 7). In addition, similar to the proposed CI-
CSKA structure, the first point of SPL1 is the first input bit of
off-critical paths. Thus, the use of the fast PPA helps the first stage, and the last point of SPL2 is the last bit of the
increasing the available slack time in the variable latency sum output of the incrementation block of the stage Q.
structure. It should be mentioned that since the input bits The steps for determining the sizes of the stages in the hybrid
of the PPA block are used in the predictor block, this block variable latency CSKA structure are similar to the ones
becomes parts of both SLP1 and SLP2. discussed in Section IV. Since the PPA structure is more efficient
In the proposed hybrid structure, the prefix network of the when its size is equal to an integer power of two, we can select
BrentKung adder [39] is used for constructing the nucleus a larger size for the nucleus stage accordingly [14]. This implies
stage (Fig. 7). One the advantages of the this adder compared that the third step discussed in that section is modified. The
with other prefix adders is that in this structure, using forward larger size (number of bits), compared with that of the nucleus
paths, the longest carry is calculated sooner compared with the stage in the original CI-CSKA structure, leads to the decrease in
intermediate carries, which are computed by backward paths. In the number of stages as well smaller delays for SLP1 and SLP2.
addition, the fan-out of adder is less than other parallel adders, Thus, the slack time increases further.
while the length of its wiring is smaller [14]. Finally, it has a
simple and regular layout. The internal structure of the stage p,
including the modified PPA and skip logic, is shown VI. R ESULTS AND D ISCUSSION
in Fig. 7. Note that, for this figure, the size of the PPA is In this section, we assess the efficacies of the proposed
assumed to be 8 (i.e., Mp = 8). structures by comparing their delays, powers, energies,
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 429

Fig. 8. Critical path delay of the adders versus the supply voltage. Fig. 9. Power consumption of the adders versus the supply voltage.

and areas with those of some other adders. All the adders
considered here had the size of 32 bits and were designed the Conv-CSKA, our proposed structures reduce the delays
and simulated using a 45-nm static CMOS technology [38]. further such that in the case of VSS, the delay becomes even
The simulations were performed using HSPICE [40] in the lower than that of SQRT-CSLA. For the supply voltages con-
room temperature of 25 C. The nominal supply voltage sidered here, the delay reductions of the CI-CSKA compared
of the technology was 1.1 V, and the threshold voltages of the with those of the Conv-CSKA in the case of the FSS (VSS)
nMOS and pMOS transistors were 0.677 and 0.622 V, were in the range of 40% 42% (40% 44%). In addition, using
respectively. It should be noted that, to extract the power VSS scheme in the CI-CSKA (Conv-CSKA), provides us with
consumption of the adders, 10 000 uniform random stimuli the delay reductions of 15%17% (11%14%). Finally, the
were injected to them. In addition, for each adder structure in results indicate that reducing the supply voltage from 1.1 V to
each supply voltage level, the injection rate of the stimuli was the nMOS threshold voltage causes an about 12 fold increase in
chosen based on the maximum operating frequency of the the delay for all the adders.
structure. In the following Section VI-A and Section VI-B, we The power consumptions of the adders versus the supply
first concentrate on studying the effectiveness of the pro- posed voltage are shown in Fig. 9. The results reveal that the smallest
CI-CSKA structure and then investigate the efficiency of the power consumption belongs to the RCA, while the KSA
proposed hybrid variable latency structure based on the CI- structure consumes the highest power owing to its parallel
CSKA. structure. The power consumption of the CIA is more than the
RCA while it is smaller than that of the SQRT-CSLA. The
A. CSKA Structures With Fixed and Variable Stage Sizes reason for the high power of the SQRT-CSLA is its logic
duplication. The power consumptions of the conventional and
In this section, both proposed and Conv-CSKA structures
proposed CI-CSKA structures are slightly more than that of the
with FSS and VSS are considered. The optimum size of the
CIA. The powers of these adders increase further using VSS
stages for the FSS was 4 in the proposed (CI-CSKA) and
scheme where the number of stages is larger. As mentioned
Conv-CSKA adders. The sizes of the stages in the case of
before, the power of the CI-CSKA structure
VSS were the same, as indicated in Fig. 4. The compara- tive
is little more than that of the conventional one. For exam- ple,
study also included the RCA, CIA, square-root CSLA (SQRT- the power of VSS-CI-CSKA is 5%7% larger than that of
CSLA), and KSA. The results were obtained for a wide range of
the VSS-Conv-CSKA. It should be pointed out that
voltage levels from the nominal voltage (superthresh- old) to
while the delay of the VSS-CI-CSKA was smaller than delay of
nMOS threshold voltage (VTH,nMOS) (near threshold). The the SQRT-CSLA, its power is also considerably smaller than
delays of the adders versus the supply voltage are plotted in that of the SQRT-CSLA. Finally, the results reveal, on average,
Fig. 8. As the results show, the RCA (KSA) has the highest a 32 reduction in the power consumption of the adders when
(lowest) delay due to its serial (parallel) structure under all the
scaling the supply voltage from 1.1 V to the nMOS threshold
supply voltages. In addition, the smaller delay of SQRT-CSLA voltage.
compared with that of CIA is due to the logic duplication. In Fig. 10 shows the PDP of the adders for different supply
addition, as was expected the CSKA voltages. The proposed CI-CSKA has the best PDP compared
structures have significantly smaller delays compared with that with those of the other structures in the supply voltage range
of the RCA. In addition, their delays are less than that of the considered in this paper. The highest PDP
CIA. As is observed from this figure, compared with
(with 2.5 more than that of the CI-CSKA structure)
430 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

Fig. 10. PDP of the adders versus the supply voltage. Fig. 12. Energydelay Pareto-optimal curves for different adders.

TABLE I
AREA USAGES AND NUMBER OF TRANSISTORS OF THE ADDERS

Fig. 11. EDP of the adders versus the supply voltage.

corresponds to SQRT-CSLA. After SQRT-CSLA, KSA has the


highest PDP. The results show that the PDP of the proposed
CI-CSKA structure is 35%38% less than that of the Conv-
CSKA structure. In addition, in both the conventional and the
proposed structures, the PDP of FSS and VSS are about the
same.
The values of the energydelay product (EDP) of the adders
versus the supply voltage are plotted in Fig. 11. As the results
reveal, the RCA has the largest EDP due to its lowest speed.
The EDP of the proposed VSS-CI-CSKA is almost the same as Fig. 13. Changes of delay, power, energy, area, and number of transistors
for the proposed VSS-CI-CSKA structure compared with those of the VSS-
that of the KSA structure. The lower value of the EDP for the Conv-CSKA structure in the case of 16-, 32-, and 64-bit length.
proposed CI-CSKA originates from the smaller power
consumption as well as higher speed of the structure. Pareto-optimal curves are plotted in Fig. 12, which suggests the
Furthermore, the VSS-CI-CSKA has smaller area and power proposed VSS-CI-CSKA structure as the better adder.
consumption compared with those of the KSA. Finally, to Table I reports the area usages and number of transistors for
demonstrate the tradeoffs between the delay and the energy each adder structure. The RCA has the smallest area, while the
for each adder structure, the energydelay KSA has the highest area. The next largest adder
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 431

Fig. 14. (a) Ratio of the slack time to DLLP , and the ratio of the slack time to (DLLP )2 for the four studied variable latency adders. (b) Power consumption
and PDP at the nominal and low VDD for the adders. (Nominal VDD is 1.1 V for all the structures.)

of the RCA, eight intermediate bits of 1320 were exploited for


is SQRT-CSLA. All four CSKA structures and the CIA have the prediction block. The C2SLA is an extension of the SQRT-
about the same area. In addition, as stated before, the proposed CSLA where the variable latency feature is achieved by
CI-CSKA structure slightly decreases the area compared with increasing the number of stages as well as having different sizes
that of the conventional one. In addition, the number of for their RCA blocks. In the C2SLA, cascading was done
transistors of the proposed CI-CSKA structure is smaller than by dividing the 32 bits into groups of {2, 2, 3, 4, 5, 2, 2, 3,
that of the Conv-CSKA structure in both FSS and VSS styles. 4, 5} where the partial sum was computed in parallel for Ci
It should be noted that the lowest PDP (energy) and low area = 0 as well as Ci = 1 using the RCA. Next, the multiplex-
of the proposed CI-CSKA structure were the motivation ers selected the appropriate sum based on the actual carry.
behind extending the structure for variable latency In this structure, seven intermediate bits of 1723 were used in
applications. the prediction block. In the hybrid C 2SLA, nine intermediate
Finally, to investigate the effect of bit length on the effi- results were calculated using KSA where the details may be
ciency of the proposed CI-CSKA structure, we compare the found in [33].
changes [(ValueConventionalValueProposed)/ValueConventional] of As a measure of the ability of a structure in using the variable
the delay, power, energy, and area of the CI-CSKA and latency feature for reducing the power consumption, one may
Conv-CSKA structures for 16-, 32-, and 64-bit. For the sake use the ratio of the slack time to the delay of the adder
of space, we present the average results of different supply (which is equal to the delay of the LLP denoted by DLLP).
voltage levels. In addition, for the same reason, we limit the The ratios for the four adder structures are shown in Fig. 14(a).
comparison with VSS structures because the VSS-CI-CSKA The figure also contains the ratio of the slack time to (DLLP)2 to
is the more efficient structure among the considered CSKA include the speed of the adder in the figure of merit for the
structures. Furthermore, as mentioned before, the proposed efficacy of the structure in reducing the power using the
hybrid variable latency CSKA is constructed based on the variable latency scheme. Note that, the details for the LLP,
VSS-CI-CSKA. The results are presented in Fig. 13. The SLP1, and SPL2 in the C2SLA and hybrid C2SLA may be found
figure reveals that the delay reduction and energy saving in [31]. As the results show, the RCA can obtain the highest
slightly decreases and the power increase enlarges a bit with improvement using the adaptive clock stretching technique.
increasing the length. In addition, the increase in the bit length This adder, however, has the worst delay among the four adder
improves the area and number of transistors of the proposed structures. The next highest improvement belongs to the
VSS-CI-CSKA compared with those of the VSS-Conv-CSKA. proposed hybrid CSKA whose delay is in the order of the
In Section VI-B, we present the results for the variable latency other two adders. The observation indicates that the proposed
adders. hybrid CSKA may be considered as a fast adder structure for
low-power applications. To further clarify this, the results for
B. Variable Latency Adders the power and PDP at both the nominal and the reduced supply
In this part, the performance of the proposed hybrid variable voltages for each adder structure are plotted in Fig. 14(b). The
latency CSKA structure is compared with those of some other amounts of power and energy savings are functions of the
variable latency adders, including RCA [27], C 2SLA [29], and supply voltage deduction, which is determined by the slack
hybrid C2SLA [31], [33]. In the proposed 32-bit hybrid time. Since the slack times are different for the structures, the
structure, an 8-bit modified PPA block was used in the nucleus amounts of the voltage reduction are different too. The power
stage (Fig. 7). The sizes of the stages from LSB to MSB were and PDP at the nominal voltage are for the
{1, 1, 1, 2, 2, 3, 3, 8, 3, 3, 2, 2, 1} where the prediction was
performed using the input bits of 14 21. In the case
432 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016

corresponding baseline structure of each adder (no variable [4] V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, and
latency structure) while at the reduced voltage, the variable R. Krishnamurthy, Comparison of high-performance VLSI adders in
latency structure is considered. the energy-delay space, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 13, no. 6, pp. 754758, Jun. 2005.
The results show that the highest power (energy) reduction of [5] B. Ramkumar and H. M. Kittur, Low-power and area-efficient carry
29% belongs to the RCA structure, which has due the highest select adder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20,
slack time. In this case, the supply voltage reduction was 0.2 V. no. 2, pp. 371375, Feb. 2012.
In the case of the standard C2SLA, since the slack time was [6] M. Vratonjic, B. R. Zeydel, and V. G. Oklobdzija, Low- and ultra low-
power arithmetic units: Design and comparison, in Proc. IEEE Int.
small, the voltage reduction was 0.05 V, which led to a power Conf. Comput. Design, VLSI Comput. Process. (ICCD), Oct. 2005, pp.
reduction of 6%. For the hybrid C2SLA, the slack time was 249252.
higher than that of the standard C2SLA and hence the voltage [7] C. Nagendra, M. J. Irwin, and R. M. Owens, Area-time-power tradeoffs
in parallel adders, IEEE Trans. Circuits Syst. II, Analog Digit. Signal
reduction of 0.1 V became possible. This provided a higher Process., vol. 43, no. 10, pp. 689702, Oct. 1996.
power reduction (13%). Finally, the proposed hybrid CSKA [8] Y. He and C.-H. Chang, A power-delay efficient hybrid carry-
lookahead/carry-select based redundant binary to twos complement
had a larger slack compared with that of the hybrid C2SLA, and converter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 1,
hence, the voltage reduction of 0.15 V was possible. This pp. 336346, Feb. 2008.
provided the structure with a power reduction of 23% (larger [9] C.-H. Chang, J. Gu, and M. Zhang, A review of 0.18 m full adder
performances for tree structured arithmetic circuits, IEEE Trans. Very
than those of the C2SLA structures). The very low delay of Large Scale Integr. (VLSI) Syst., vol. 13, no. 6, pp. 686695, Jun. 2005.
the hybrid variable latency CSKA along [10] D. Markovic, C. C. Wang, L. P. Alarcon, T.-T. Liu, and J. M. Rabaey,
with its lower power consumption result in the minimum PDP Ultralow-power design in near-threshold region, Proc. IEEE, vol. 98,
no. 2, pp. 237252, Feb. 2010.
for this structure. In addition, the higher PDP of the C 2SLA [11] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and
structures is due to their high-power consumptions. T. Mudge, Near-threshold computing: Reclaiming Moores law through
energy efficient integrated circuits, Proc. IEEE, vol. 98, no. 2,
pp. 253266, Feb. 2010.
VII. C ONCLUSION [12] S. Jain et al., A 280 mV-to-1.2 V wide-operating-range IA-32
processor in 32 nm CMOS, in IEEE Int. Solid-State Circuits Conf.
In this paper, a static CMOS CSKA structure called CI- Dig. Tech. Papers (ISSCC), Feb. 2012, pp. 6668.
CSKA was proposed, which exhibits a higher speed and lower [13] R. Zimmermann, Binary adder architectures for cell-based VLSI and
their synthesis, Ph.D. dissertation, Dept. Inf. Technol. Elect. Eng.,
energy consumption compared with those of the conven- tional Swiss Federal Inst. Technol. (ETH), Zrich, Switzerland, 1998.
one. The speed enhancement was achieved by modifying the [14] D. Harris, A taxonomy of parallel prefix networks, in Proc. IEEE Conf.
structure through the concatenation and incrementation Rec. 37th Asilomar Conf. Signals, Syst., Comput., vol. 2. Nov. 2003,
pp. 22132217.
techniques. In addition, AOI and OAI compound gates were [15] P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient
exploited for the carry skip logics. The efficiency of the solution of a general class of recurrence equations, IEEE Trans.
proposed structure for both FSS and VSS was studied by Comput., vol. C-22, no. 8, pp. 786793, Aug. 1973.
comparing its power and delay with those of the Conv-CSKA, [16] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew, and
R. Krishnamurthy, Energy-delay estimation technique for high-
RCA, CIA, SQRT-CSLA, and KSA structures. The results performance microprocessor VLSI adders, in Proc. 16th IEEE Symp.
revealed considerably lower PDP for the VSS implementation Comput. Arithmetic, Jun. 2003, pp. 272279.
of the CI-CSKA structure over a wide range of voltage from [17] M. Lehman and N. Burla, Skip techniques for high-speed carry-
propagation in binary arithmetic units, IRE Trans. Electron. Comput.,
super-threshold to near threshold. The results also suggested the vol. EC-10, no. 4, pp. 691698, Dec. 1961.
CI-CSKA structure as a very good adder for the appli- cations [18] K. Chirca et al., A static low-power, high-performance 32-bit carry
where both the speed and energy consumption are critical. In skip adder, in Proc. Euromicro Symp. Digit. Syst. Design (DSD),
Aug./Sep. 2004, pp. 615619.
addition, a hybrid variable latency extension of the structure [19] M. Alioto and G. Palumbo, A simple strategy for optimized design
was proposed. It exploited a modified parallel adder structure at of one-level carry-skip adders, IEEE Trans. Circuits Syst. I, Fundam.
the middle stage for increasing the slack time, which provided Theory Appl., vol. 50, no. 1, pp. 141148, Jan. 2003.
[20] S. Majerski, On determination of optimal distributions of carry skips in
us with the opportunity for lowering the energy consumption adders, IEEE Trans. Electron. Comput., vol. EC-16, no. 1, pp. 4558,
by reducing the supply voltage. The efficacy of this structure Feb. 1967.
was compared versus those of the variable latency RCA, [21] A. Guyot, B. Hochet, and J.-M. Muller, A way to build efficient carry-
skip adders, IEEE Trans. Comput., vol. C-36, no. 10, pp. 11441152,
C2SLA, and hybrid C2SLA structures. Again, the suggested Oct. 1987.
structure showed the lowest delay and PDP making itself as a [22] S. Turrini, Optimal group distribution in carry-skip adders, in Proc.
better candidate for high-speed low-energy applications. 9th IEEE Symp. Comput. Arithmetic, Sep. 1989, pp. 96103.
[23] P. K. Chan, M. D. F. Schlag, C. D. Thomborson, and V. G. Oklobdzija,
Delay optimization of carry-skip adders and block carry-lookahead
REFERENCES adders using multidimensional dynamic programming, IEEE Trans.
Comput., vol. 41, no. 8, pp. 920930, Aug. 1992.
[1] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA: [24] V. Kantabutra, Designing optimum one-level carry-skip adders, IEEE
A K Peters, Ltd., 2002. Trans. Comput., vol. 42, no. 6, pp. 759764, Jun. 1993.
[2] R. Zlatanovici, S. Kao, and B. Nikolic, Energydelay optimization [25] V. Kantabutra, Accelerated two-level carry-skip addersA type of very
of 64-bit carry-lookahead adders with a 240 ps 90 nm CMOS design fast adders, IEEE Trans. Comput., vol. 42, no. 11, pp. 13891393,
example, IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 569583, Nov. 1993.
Feb. 2009. [26] S. Jia et al., Static CMOS implementation of logarithmic skip adder,
[3] S. K. Mathew, M. A. Anders, B. Bloechel, T. Nguyen, in Proc. IEEE Conf. Electron Devices Solid-State Circuits, Dec. 2003,
R. K. Krishnamurthy, and S. Borkar, A 4-GHz 300-mW 64-bit pp. 509512.
integer execution ALU with dual supply voltages in 90-nm CMOS, [27] H. Suzuki, W. Jeong, and K. Roy, Low power adder with adaptive
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 4451, Jan. 2005. supply voltage, in Proc. 21st Int. Conf. Comput. Design, Oct. 2003,
pp. 103106.
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 433

[28] H. Suzuki, W. Jeong, and K. Roy, Low-power carry-select adder using


Mehdi Kamal received the B.Sc. degree from the
adaptive supply voltage based on input vector patterns, in Proc. Int.
Iran University of Science and Technology, Tehran,
Symp. Low Power Electron. Design (ISLPED), Aug. 2004, pp. 313318.
Iran, in 2005, the M.Sc. degree from the Sharif
[29] Y. Chen, H. Li, K. Roy, and C.-K. Koh, Cascaded carry-select adder
University of Technology, Tehran, in 2007, and the
(C2SA): A new structure for low-power CSA design, in Proc. Int. Ph.D. degree from the University of Tehran, Tehran,
Symp. Low Power Electron. Design (ISLPED), Aug. 2005, pp. 115
in 2013, all in computer engineering.
118.
He is currently a Research Associate with the
[30] Y. Chen, H. Li, J. Li, and C.-K. Koh, Variable-latency
Low-Power High-Performance Nanosystem Labora-
adder (VL-adder): New arithmetic circuit design practice to overcome
tory, School of Electrical and Computer
NBTI, in Proc. ACM/IEEE Int. Symp. Low Power Electron.
Engineering, University of Tehran. His current
Design (ISLPED), Aug. 2007, pp. 195200.
research interests include reliability in nanoscale
[31] S. Ghosh and K. Roy, Exploring high-speed low-power hybrid arith-
design, application-
metic units at scaled supply and adaptive clock-stretching, in Proc.
specific instruction set processor design, hardware/software co-design, and
Asia South Pacific Design Autom. Conf. (ASPDAC), Mar. 2008, pp.
635640. low-power design.
[32] Y. Chen et al., Variable-latency adder (VL-adder) designs for low power
and NBTI tolerance, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 18, no. 11, pp. 16211624, Nov. 2010.
[33] S. Ghosh, D. Mohapatra, G. Karakonstantis, and K. Roy, Voltage
scalable high-speed robust hybrid arithmetic units using adaptive clock-
ing, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 9, Ali Afzali-Kusha (SM06) received the B.Sc.
pp. 13011309, Sep. 2010. degree from the Sharif University of Technology,
[34] Y. Liu, Y. Sun, Y. Zhu, and H. Yang, Design methodology of variable Tehran, Iran, in 1988, the M.Sc. degree from the
latency adders with multistage function speculation, in Proc. IEEE University of Pittsburgh, Pittsburgh, PA, USA, in
11th Int. Symp. Quality Electron. Design (ISQED), Mar. 2010, pp. 824 1991, and the Ph.D. degree from the University of
830. Michigan, Ann Arbor, MI, USA, in 1994, all in
[35] Y.-S. Su, D.-C. Wang, S.-C. Chang, and M. Marek-Sadowska, Perfor- electrical engineering.
mance optimization using variable-latency design style, IEEE Trans. He was a Post-Doctoral Fellow with the
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 10, pp. 18741883, University of Michigan from 1994 to 1995. He has
Oct. 2011. been with the University of Tehran, since 1995,
[36] K. Du, P. Varman, and K. Mohanram, High performance reliable where he is currently a Professor of the School of
variable latency carry select addition, in Proc. Design, Autom., Test Electrical and
Eur. Conf. Exhibit. (DATE), Mar. 2012, pp. 12571262. Computer Engineering and the Director of the Low-Power High-Performance
[37] J. M. Rabaey, A. Chandrakasa, and B. Nikolic, Digital Integrated Nanosystems Laboratory. He was a Research Fellow with the University of
Circuits: A Design Perspective, 2nd ed. Englewood Cliffs, NJ, USA: Toronto, Toronto, ON, Canada, and the University of Waterloo, Waterloo, ON,
Prentice-Hall, 2003. Canada, in 1998 and 1999, respectively. His current research interests include
[38] NanGate 45 nm Open Cell Library. [Online]. Available: low-power high-performance design methodologies from the physical design
http://www.nangate.com/, accessed Dec. 2010. level to the system level for nanoelectronics era.
[39] R. P. Brent and H. T. Kung, A regular layout for parallel adders, IEEE
Trans. Comput., vol. C-31, no. 3, pp. 260264, Mar. 1982.
[40] Synopsys HSPICE. [Online]. Available: http://www.synopsys.com,
accessed Sep. 2011.
Massoud Pedram (F01) received the Ph.D. degree
in electrical engineering and computer sciences
from the University of California at Berkeley,
Berkeley, CA, USA, in 1991.
He is currently the Stephen and Etta Varra
Professor with the Ming Hsieh Department of
Electrical Engineering, University of Southern
California, Los Angeles, CA, USA. He holds
10 U.S. patents and has authored four books,
12 book chapters, and over 140 archival and
350 conference papers. His current research
Milad Bahadori received the M.Sc. degree in interests
electrical and electronic engineering from the Sharif include low-power electronics, energy-efficient processing, and cloud
University of Technology, Tehran, Iran, in 2011. He computing to photovoltaic cell power generation, energy storage, and power
is currently pursuing the Ph.D. degree with the conversion, and RT level optimization of VLSI circuits to synthesis and
University of Tehran, Tehran, Iran. physical design of quantum circuits.
He was a Research Assistant with the Sharif Prof. Pedram and his students have received six conference and two
Integrated Circuits Design Center, Sharif Univer- IEEE T RANSACTIONS Best Paper Awards for the research. He was a recipient
sity of Technology, from 2009 to 2012, where he of the 1996 Presidential Early Career Award for Scientists and Engineers
was involved in digital systems design. He joined and an ACM Distinguished Scientist, and currently serves as the Editor-in-
the Low-Power High-Performance Nanosystems Chief of the ACM Transactions on Design Automation of Electronic Systems.
Laboratory, University of Tehran, in 2012, as He has served on the Technical Program Committee of a number of premiere
a Research Assistant. His current research interests include low-power high- conferences in his field. He was the Founding Technical Program Co-Chair
performance VLSI design, reliability in nanoscale design, near-threshold of the 1996 International Symposium on Low-Power Electronics and Design
computing, high-performance low-power arithmetic circuits design, and and the Technical Program Chair of the 2002 International Symposium on
cryptographic systems design. Physical Design.

You might also like