PSVLSI417
PSVLSI417
optimizing power and speed. There are many adder 3) Investigating the impact of voltage scaling on the
families with different delays, power consumptions, and efficiency of the proposed CSKA structure (from the
area usages. Examples include ripple carry adder (RCA), nominal supply voltage to the near-threshold voltage).
carry increment adder (CIA), carry skip adder (CSKA), carry 4) Proposing a hybrid variable latency CSKA structure
select adder (CSLA), and parallel prefix adders (PPAs). The based on the extension of the suggested CSKA, by
descriptions of each of these adder architectures along with replacing some of the middle stages in its structure with
their characteristics may be found in [1] and [13]. The a PPA, which is modified in this paper.
RCA has the simplest structure with the smallest area and The rest of this paper is organized as follows. Section II
power consumption but with the worst critical path delay. discusses related work on modifying the CSKA structure for
In the CSLA, the speed, power consumption, and area usages improving the speed as well as prior work that use variable
are considerably larger than those of the RCA. The PPAs, latency structures for increasing the efficiency of adders at low
which are also called carry look-ahead adders, exploit direct supply voltages. In Section III, the Conv-CSKA with fixed
parallel prefix structures to generate the carry as fast as stage size (FSS) and variable stage size (VSS) is explained,
possible [14]. There are different types of the parallel prefix while Section IV describes the proposed static CSKA struc-
algorithms that lead to different PPA structures with ture. The hybrid variable latency CSKA structure is suggested
different performances. As an example, the KoggeStone in Section V. The results of comparing the characteristics of the
adder (KSA) [15] is one of the fastest structures but results in proposed structures with those of other adders are discussed in
large power consumption and area usage. It should be noted Section VI. Finally, the conclusion is drawn in Section VII.
that the structure complexities of PPAs are more than those of
other adder schemes [13], [16].
II. P RIOR WORK
The CSKA, which is an efficient adder in terms of power
consumption and area usage, was introduced in [17]. The Since the focus of this paper is on the CSKA structure, first
critical path delay of the CSKA is much smaller than the one the related work to this adder are reviewed and then the variable
in the RCA, whereas its area and power consumption are latency adder structures are discussed.
similar to those of the RCA. In addition, the power-delay
product (PDP) of the CSKA is smaller than those of the CSLA A. Modifying CSKAs for Improving Speed
and PPA structures [19]. In addition, due to the small number of
transistors, the CSKA benefits from relatively short wiring The conventional structure of the CSKA consists of stages
lengths as well as a regular and simple layout [18]. The containing chain of full adders (FAs) (RCA block) and 2:1
comparatively lower speed of this adder structure, however, multiplexer (carry skip logic). The RCA blocks are connected to
limits its use for high-speed applications. each other through 2:1 multiplexers, which can be placed into
In this paper, given the attractive features of the CSKA one or more level structures [19]. The CSKA configuration (i.e.,
structure, we have focused on reducing its delay by mod- the number of the FAs per stage) has a great impact on the
ifying its implementation based on the static CMOS logic. The speed of this type of adder [23]. Many methods have been
concentration on the static CMOS originates from the desire to suggested for finding the optimum number of the FAs [18]
have a reliably operating circuit under a wide range of supply [26]. The techniques presented in [19][24] make use of VSSs
voltages in highly scaled technologies [10]. The proposed to minimize the delay of adders based on a single- level carry
modification increases the speed considerably while skip logic. In [25], some methods to increase the speed of the
maintaining the low area and power consumption features of the multilevel CSKAs are proposed. The techniques, however,
CSKA. In addition, an adjustment of the structure, based on the cause area and power increase considerably and less regular
variable latency technique, which in turn lowers the power layout. The design of a static CMOS CSKA where the stages of
consumption without considerably impacting the CSKA speed, the CSKA have a variable sizes was suggested in [18]. In
is also presented. To the best of our knowledge, no work addition, to lower the propagation delay of the adder, in each
concentrating on design of CSKAs operating from the stage, the carry look-ahead logics were utilized. Again, it had a
superthreshold region down to near-threshold region and also, complex layout as well as large power consumption and area
the design of (hybrid) variable latency CSKA structures have usage. In addition, the design approach, which was presented
been reported in the literature. Hence, the contributions of this only for the 32-bit adder, was not general to be applied for
paper can be summarized as follows. structures with different bits lengths.
Alioto and Palumbo [19] propose a simple strategy for the
1) Proposing a modified CSKA structure by combining the design of a single-level CSKA. The method is based on the VSS
concatenation and the incrementation schemes to the technique where the near-optimal numbers of the FAs are
conventional CSKA (Conv-CSKA) structure for enhanc- determined based on the skip time (delay of the multiplexer),
ing the speed and energy efficiency of the adder. The and the ripple time (the time required by a carry to ripple
modification provides us with the ability to use simpler through a FA). The goal of this method is to decrease the critical
carry skip logics based on the AOI/OAI compound gates path delay by considering a noninteger ratio of the skip time to
instead of the multiplexer. the ripple time on contrary to most of the previous works,
2) Providing a design strategy for constructing an which considered an integer ratio [17], [20]. In all of the works
efficient CSKA structure based on analytically expres- reviewed so far, the focus was on the speed, while the power
sions presented for the critical path delay. consumption and area usage of the CSKAs were
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 423
TSUM, and TMUX are the propagation delays of the carry output of the next stage due to the additional multiplexer by reducing
of an FA, the sum output of an FA, and the output delay of a 2:1 the sum delay of the RCA block. This may be analytically
multiplexer, respectively. Hence, the critical path delay of expressed as
a FSS CSKA is formulated by TSUM, 1 T T ; for i p. (5)
i
.. . . + SUM,i MUX
N
TD = [M TCARRY]+
The trend of decreasing the stage size should be continued
M TMUX until we produce the required number of adder bits.
1
+ [(M 1) TCARRY + TSUM]. (1)
Note that, in this case, the size of the last RCA block may
Based on (1), the optimum value of M (Mopt) that leads to only be one (i.e., one FA). Hence, to reach the highest number
optimum propagation delay may be calculated as (0.5N)1/2 of input bits under a constant propagation delay, both (4) and
where is equal to TMUX/ TCARRY. Therefore, the (5) should be satisfied. Having these constraints, we can
optimum propagation delay (TD,opt) is obtained from minimize the delay of the CSKA for a given number of input
TD,opt = 2,2N T CARRY TMUX + (TSUM TCARRY TMUX ) bits to find the stages sizes for an optimal structure.
In this optimal CSKA, the size of first p stages is increased,
= TSUM + (2 2 N1) TCARRY. (2) while the size of the last (Q p) stages is decreased. For this
structure, the pth stage, which is called nucleus of the adder,
Thus, the optimum delay of the FSS CSKA is almost has the maximum size [24].
proportional to the square root of the product of N and [19]. Now, let us find the constraints used for determining the
optimum structure in this case. As mentioned before, when the
B. Variable Stage Size CSKA j th stage is not in the propagate mode, the carry output
of the stage is C0. jIn this case, the maximum of t0 is equal
As mentioned before, by assigning variable sizes to the j
stages, the speed of the CSKA may be improved. The speed to M j TCARRY. To satisfy (4), we increase the size of the first
improvement in this type is achieved by lowering the delays of p stages up to the nucleus using [19]
the first and third terms in (1). These delays are minimized by M j M1 + ( j 1); for 1 j p. (6)
lowering sizes of first and last RCA blocks. For instance, the
first RCA block size may be set to one, whereas sizes of the In addition, the maximum of TSUM,i is equal to (Mi 1)
following blocks may increase. To determine the rate 1of 1
increase, let us express the propagation delay of the C (t ) by
TCARRY + TSUM. To satisfy (5), the size of the last (Q p)
stages from the nucleus to the last stage should decrease
t1 0 1 j j
based on [19]
. .
j = max t j 1, t j 1 + TMUX (3) M i MQ + for p Q. (7)
(Q i); i
where t0 (t1 ) shows the calculating delay of C
0
j 1 j j 1 j In the case, where is an integer value, the exact sizes
1
(C11)
signal in the0 ( j 1)th stage. 1In a FSS CSKA, except in the of stages for the optimal structure can be determined. Subse-
first stage, t j is smaller than t j. Hence, based on (3), the delay
quently, the optimal values of M1, MQ , and Q as well as the
of t0 may be increased from t0 to t1 without increasing the
j 1 j delay of the optimal CSKA may be calculated [19]. In the case,
1 1
delay of C1j signal. This means that one could increase the size where is a noninteger value, one may realize only a near-
of the ( j 1)th stage (i.e., Mj 1) without increasing the optimal structure, as detailed in [19] and [21]. In this case, most
propagation delay of the CSKA. Therefore, increasing the size of the time, by setting M1 to 1 and using (6) and (7), the near-
of Mj for the j th stage should be bounded by optimal structure is determined. It should be noted that, in
t 0 t1 = t0 + ( j 1)T practice, is noninteger whose value is smaller than one. This
j j 1 MUX. (4)
is the case that has been studied in [19], where the estimation of
Since the last RCA block size also should be minimized, the the near-optimal propagation delay of the CSKA is given by
increase in the stage size may not be continued to the last [19]
RCA block. Thus, we justify the decrease in the RCA block . .
. , .
sizes toward the last stage. First, note that based on Fig. 1, the TD,opt = 1 TCARRY + 1 TMUX + TSUM.
output of the j th stage is, in the worst case, accessible after ,
2 2 N (8)
tj 1+ TSUM, j . Assuming that the pth stage has the maximum 2
RCA block size, we wish to keep the delay of the outputs of This equation may be written in a more general form by
the following stages to be equal to the delay of the output replacing TMUX by TSKIP to allow for other logic types instead
of the pth stage. To keep the same worst case delay for of the multiplexer. For this form, becomes equal to
the critical path, we should reduce the size of the following TSKIP/TCARRY. Finally, note that in real implementations,
RCA blocks. Forisexample, i p, for, the (i +T1)th stage, TSKIP < TCARRY, and hence, J/2] becomes equal to one.
the output delay t 1 + T when
+T MUX where
SUM,i 1 is
i + SUM,i+1 Thus, (8) may be written as
424 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016
the delay of the (i + 1)th RCA block for calculating all of its .
.
sum outputs when its carry input is ready. Therefore, the size TPDopt = TCARRY + 1 TSKIP + TSUM. (9)
of the (i + 1)th stage should be reduced to decrease TSUM,i+1 2 N
preventing the increase in the worst case delay (TD) of the Note that, as (9) reveals that a large portion of the critical path
adder. In other words, we eliminate the increase in the delay delay is due to the carry skip logics.
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 425
previous stage, the output carries of the blocks are calculated FSS or VSS. Here, the stage size is the same as the RCA and
in parallel. incrementation blocks size. In the case of the FSS (FSS-CI-
CSKA), there are Q = N /M stages with the size
B. Area and Delay of the Proposed Structure of M. The optimum value of M, which may be obtained
As mentioned before, the use of the static AOI and OAI using (11), is given by
gates (six transistors) compared with the static 2:1 multiplexer Mopt = . (12)
(12 transistors), leads to decreases in the area usage and delay N(T AOI + T OAI
)
of the skip logic [37], [38]. In addition, except for the first 2(TCARRY + TAND )
RCA block, the carry input for all other blocks is zero, and In the case of the VSS (VSS-CI-CSKA), the sizes of the
hence, for these blocks, the first adder cell in the RCA chain stages, which are M to M , are obtained using a method
1 Q
is a HA. This means that (Q 1) FAs in the conventional
similar to the one discussed in Section III-B. For this structure,
structure are replaced with the same number of HAs in the the new value for T
SKIP should be used, and hence, becomes
suggested structure decreasing the area usage (Fig. 2). In addi-
(TAOI + TOAI) / (2TCARRY). In particular, the following steps
tion, note that the proposed structure utilizes incrementation should be taken.
blocks that do not exist in the conventional one. These blocks, 1) The size of the RCA block of the first stage is one.
however, may be implemented with about the same logic gates 2) From the second stage to the nucleus stage, the size
(XOR and AND gates) as those used for generating the select of j th stage is determined based on the delay of the
signal of the multiplexer in the conventional structure. product of the sum of its RCA block and the delay of
Therefore, the area usage of the proposed CI-CSKA structure is the carry output of the ( j 1)th stage. Hence, based
decreased compared with that of the conventional one.
on the description given in Section III-B, the size of
The critical path of the proposed CI-CSKA structure, which the RCA block of the j th stage should be as large as
contains three parts, is shown in Fig. 2. These parts include the possible, while the delay of the product of the its output
chain of the FAs of the first stage, the path of the skip logics, sum should be smaller than the delay of the carry output
and the incrementation block in the last stage. The delay of this
of the ( j 1)th stage. Therefore, in this case, the sizes
path (TD) may be expressed as of the stages are either not changed or increased.
TD = [M1 TCARRY]+ [(Q 2)TSKIP] 3) The increase in the size is continued until the
summation of all the sizes up to this stage becomes
+ [(MQ 1)TAND + TXOR ] (10)
larger than N /2. The last stage, which has the largest
where the three brackets correspond to the three parts size, is considered as the nucleus ( pth) stage. There are
mentioned above, respectively. Here, TAND and TXOR are the cases that we should consider the stage right before this
delays of the two inputs static AND and XOR gates, respectively. stage as the nucleus stage (Step 5).
Note that, [(M j 1)TAND + TXOR ] shows the critical path delay 4) Starting from the stage ( p + 1) to the last stage, the
of the j th incrementation block (TINC, j ), which is shown in sizes of the stage i is determined based on the delay of
Fig. 3. the incrementation block of the i th and (i 1)th stages
To calculate the delay of the skip logic, the average of the (TINC,i and TINC,i1, respectively), and the delay of the
delays of the AOI and OAI gates, which are typically close to skip logic. In particular
one another [35], is used. Thus,. (10) may. be modified to..
TAOI + TOAI TINC,i TINC,i1 TSKIP,i1; for i p + 1. (13)
TD = [M1 TCARRY]+
(Q 2) In this case, the size of the last stage is one, and its
2
+ [(MQ 1)TAND + TXOR ] (11) RCA block contains a HA.
5) Finally, note that, it is possible that the sum of all the
where TAOI and TOAI are the delays of the static AOI and OAI stage sizes does not become equal to N . In the case,
gates, respectively. where the sum is smaller than N by d bits, we should
The comparison of (1) and (11) indicates that the delay of the add another stage with the size of d. The stage is placed
proposed structure is smaller than that of the conventional one. close to the stage with the same size. In the case, where
The First reason is that the delay of the skip logic is
the sum is larger than N by d bits, the size of the stages
considerably smaller than that of the conventional structure
should be revised (Step 3). For more details on how to
while the number of the stages is about the same in both
revise the stage sizes, one may refer to [19].
structures. Second, since TAND and TXOR are smaller than TCARRY
Now, the procedure for determining the stage sizes is
and TSUM, the third additive term in (11) becomes smaller than
demonstrated for the 32-bit adder. It includes both the con-
the third term in (1) [37]. It should be noted that the delay
ventional and the proposed CI-CSKA structures. The number of
reduction of the skip logic has the largest impact on the delay
stages and the corresponding size for each stage, which are
decrease of the whole structure.
given in Fig. 4, have been determined based on a 45-nm static
CMOS technology [38]. The dashed and dotted lines in the
C. Stage Sizes Consideration plot indicate the rates of size increase and decrease. While the
Similar to the Conv-CSKA structure, the proposed CI-CSKA increase and decrease rates in the conventional structure are
structure may be implemented with either balanced, the decrease rate is more than the
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 427
Fig. 4. Sizes of the stages in the case of VSS for the proposed and The concepts of the variable latency adders, adaptive clock
conventional 32-bit CSKA structures in 45-nm static CMOS technology.
stretching, and also supply voltage scaling in an N -bit RCA
adder may be explained using Fig. 5. The predictor block
consists of some XOR and AND gates that determines the product
increase one in the case of the proposed structure. It originates
of the propagate signals of considered bit positions. Since the
from the fact that, in the Conv-CSKA structure, both of the block has some area and power overheads, only few middle bits
stages size increase and decrease are determined based on the
are used to predict the activation of the critical paths at price of
RCA block delay [according to (4) and (5)], while in the prediction accuracy decrease [31], [33].
proposed CI-CSKA structure, the increase is determined based
In Fig. 5, the input bits ( j + 1)th( j + m)th have been
on the RCA block delay and the decrease is determined based
on the incrementation block delay [according to (13)]. The
imbalanced rates may yield a larger nucleus stage and smaller
number of stages leading to a smaller propagation delay.
Fig. 8. Critical path delay of the adders versus the supply voltage. Fig. 9. Power consumption of the adders versus the supply voltage.
and areas with those of some other adders. All the adders
considered here had the size of 32 bits and were designed the Conv-CSKA, our proposed structures reduce the delays
and simulated using a 45-nm static CMOS technology [38]. further such that in the case of VSS, the delay becomes even
The simulations were performed using HSPICE [40] in the lower than that of SQRT-CSLA. For the supply voltages con-
room temperature of 25 C. The nominal supply voltage sidered here, the delay reductions of the CI-CSKA compared
of the technology was 1.1 V, and the threshold voltages of the with those of the Conv-CSKA in the case of the FSS (VSS)
nMOS and pMOS transistors were 0.677 and 0.622 V, were in the range of 40% 42% (40% 44%). In addition, using
respectively. It should be noted that, to extract the power VSS scheme in the CI-CSKA (Conv-CSKA), provides us with
consumption of the adders, 10 000 uniform random stimuli the delay reductions of 15%17% (11%14%). Finally, the
were injected to them. In addition, for each adder structure in results indicate that reducing the supply voltage from 1.1 V to
each supply voltage level, the injection rate of the stimuli was the nMOS threshold voltage causes an about 12 fold increase in
chosen based on the maximum operating frequency of the the delay for all the adders.
structure. In the following Section VI-A and Section VI-B, we The power consumptions of the adders versus the supply
first concentrate on studying the effectiveness of the pro- posed voltage are shown in Fig. 9. The results reveal that the smallest
CI-CSKA structure and then investigate the efficiency of the power consumption belongs to the RCA, while the KSA
proposed hybrid variable latency structure based on the CI- structure consumes the highest power owing to its parallel
CSKA. structure. The power consumption of the CIA is more than the
RCA while it is smaller than that of the SQRT-CSLA. The
A. CSKA Structures With Fixed and Variable Stage Sizes reason for the high power of the SQRT-CSLA is its logic
duplication. The power consumptions of the conventional and
In this section, both proposed and Conv-CSKA structures
proposed CI-CSKA structures are slightly more than that of the
with FSS and VSS are considered. The optimum size of the
CIA. The powers of these adders increase further using VSS
stages for the FSS was 4 in the proposed (CI-CSKA) and
scheme where the number of stages is larger. As mentioned
Conv-CSKA adders. The sizes of the stages in the case of
before, the power of the CI-CSKA structure
VSS were the same, as indicated in Fig. 4. The compara- tive
is little more than that of the conventional one. For exam- ple,
study also included the RCA, CIA, square-root CSLA (SQRT- the power of VSS-CI-CSKA is 5%7% larger than that of
CSLA), and KSA. The results were obtained for a wide range of
the VSS-Conv-CSKA. It should be pointed out that
voltage levels from the nominal voltage (superthresh- old) to
while the delay of the VSS-CI-CSKA was smaller than delay of
nMOS threshold voltage (VTH,nMOS) (near threshold). The the SQRT-CSLA, its power is also considerably smaller than
delays of the adders versus the supply voltage are plotted in that of the SQRT-CSLA. Finally, the results reveal, on average,
Fig. 8. As the results show, the RCA (KSA) has the highest a 32 reduction in the power consumption of the adders when
(lowest) delay due to its serial (parallel) structure under all the
scaling the supply voltage from 1.1 V to the nMOS threshold
supply voltages. In addition, the smaller delay of SQRT-CSLA voltage.
compared with that of CIA is due to the logic duplication. In Fig. 10 shows the PDP of the adders for different supply
addition, as was expected the CSKA voltages. The proposed CI-CSKA has the best PDP compared
structures have significantly smaller delays compared with that with those of the other structures in the supply voltage range
of the RCA. In addition, their delays are less than that of the considered in this paper. The highest PDP
CIA. As is observed from this figure, compared with
(with 2.5 more than that of the CI-CSKA structure)
430 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2016
Fig. 10. PDP of the adders versus the supply voltage. Fig. 12. Energydelay Pareto-optimal curves for different adders.
TABLE I
AREA USAGES AND NUMBER OF TRANSISTORS OF THE ADDERS
Fig. 14. (a) Ratio of the slack time to DLLP , and the ratio of the slack time to (DLLP )2 for the four studied variable latency adders. (b) Power consumption
and PDP at the nominal and low VDD for the adders. (Nominal VDD is 1.1 V for all the structures.)
corresponding baseline structure of each adder (no variable [4] V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, and
latency structure) while at the reduced voltage, the variable R. Krishnamurthy, Comparison of high-performance VLSI adders in
latency structure is considered. the energy-delay space, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 13, no. 6, pp. 754758, Jun. 2005.
The results show that the highest power (energy) reduction of [5] B. Ramkumar and H. M. Kittur, Low-power and area-efficient carry
29% belongs to the RCA structure, which has due the highest select adder, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20,
slack time. In this case, the supply voltage reduction was 0.2 V. no. 2, pp. 371375, Feb. 2012.
In the case of the standard C2SLA, since the slack time was [6] M. Vratonjic, B. R. Zeydel, and V. G. Oklobdzija, Low- and ultra low-
power arithmetic units: Design and comparison, in Proc. IEEE Int.
small, the voltage reduction was 0.05 V, which led to a power Conf. Comput. Design, VLSI Comput. Process. (ICCD), Oct. 2005, pp.
reduction of 6%. For the hybrid C2SLA, the slack time was 249252.
higher than that of the standard C2SLA and hence the voltage [7] C. Nagendra, M. J. Irwin, and R. M. Owens, Area-time-power tradeoffs
in parallel adders, IEEE Trans. Circuits Syst. II, Analog Digit. Signal
reduction of 0.1 V became possible. This provided a higher Process., vol. 43, no. 10, pp. 689702, Oct. 1996.
power reduction (13%). Finally, the proposed hybrid CSKA [8] Y. He and C.-H. Chang, A power-delay efficient hybrid carry-
lookahead/carry-select based redundant binary to twos complement
had a larger slack compared with that of the hybrid C2SLA, and converter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 1,
hence, the voltage reduction of 0.15 V was possible. This pp. 336346, Feb. 2008.
provided the structure with a power reduction of 23% (larger [9] C.-H. Chang, J. Gu, and M. Zhang, A review of 0.18 m full adder
performances for tree structured arithmetic circuits, IEEE Trans. Very
than those of the C2SLA structures). The very low delay of Large Scale Integr. (VLSI) Syst., vol. 13, no. 6, pp. 686695, Jun. 2005.
the hybrid variable latency CSKA along [10] D. Markovic, C. C. Wang, L. P. Alarcon, T.-T. Liu, and J. M. Rabaey,
with its lower power consumption result in the minimum PDP Ultralow-power design in near-threshold region, Proc. IEEE, vol. 98,
no. 2, pp. 237252, Feb. 2010.
for this structure. In addition, the higher PDP of the C 2SLA [11] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and
structures is due to their high-power consumptions. T. Mudge, Near-threshold computing: Reclaiming Moores law through
energy efficient integrated circuits, Proc. IEEE, vol. 98, no. 2,
pp. 253266, Feb. 2010.
VII. C ONCLUSION [12] S. Jain et al., A 280 mV-to-1.2 V wide-operating-range IA-32
processor in 32 nm CMOS, in IEEE Int. Solid-State Circuits Conf.
In this paper, a static CMOS CSKA structure called CI- Dig. Tech. Papers (ISSCC), Feb. 2012, pp. 6668.
CSKA was proposed, which exhibits a higher speed and lower [13] R. Zimmermann, Binary adder architectures for cell-based VLSI and
their synthesis, Ph.D. dissertation, Dept. Inf. Technol. Elect. Eng.,
energy consumption compared with those of the conven- tional Swiss Federal Inst. Technol. (ETH), Zrich, Switzerland, 1998.
one. The speed enhancement was achieved by modifying the [14] D. Harris, A taxonomy of parallel prefix networks, in Proc. IEEE Conf.
structure through the concatenation and incrementation Rec. 37th Asilomar Conf. Signals, Syst., Comput., vol. 2. Nov. 2003,
pp. 22132217.
techniques. In addition, AOI and OAI compound gates were [15] P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient
exploited for the carry skip logics. The efficiency of the solution of a general class of recurrence equations, IEEE Trans.
proposed structure for both FSS and VSS was studied by Comput., vol. C-22, no. 8, pp. 786793, Aug. 1973.
comparing its power and delay with those of the Conv-CSKA, [16] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew, and
R. Krishnamurthy, Energy-delay estimation technique for high-
RCA, CIA, SQRT-CSLA, and KSA structures. The results performance microprocessor VLSI adders, in Proc. 16th IEEE Symp.
revealed considerably lower PDP for the VSS implementation Comput. Arithmetic, Jun. 2003, pp. 272279.
of the CI-CSKA structure over a wide range of voltage from [17] M. Lehman and N. Burla, Skip techniques for high-speed carry-
propagation in binary arithmetic units, IRE Trans. Electron. Comput.,
super-threshold to near threshold. The results also suggested the vol. EC-10, no. 4, pp. 691698, Dec. 1961.
CI-CSKA structure as a very good adder for the appli- cations [18] K. Chirca et al., A static low-power, high-performance 32-bit carry
where both the speed and energy consumption are critical. In skip adder, in Proc. Euromicro Symp. Digit. Syst. Design (DSD),
Aug./Sep. 2004, pp. 615619.
addition, a hybrid variable latency extension of the structure [19] M. Alioto and G. Palumbo, A simple strategy for optimized design
was proposed. It exploited a modified parallel adder structure at of one-level carry-skip adders, IEEE Trans. Circuits Syst. I, Fundam.
the middle stage for increasing the slack time, which provided Theory Appl., vol. 50, no. 1, pp. 141148, Jan. 2003.
[20] S. Majerski, On determination of optimal distributions of carry skips in
us with the opportunity for lowering the energy consumption adders, IEEE Trans. Electron. Comput., vol. EC-16, no. 1, pp. 4558,
by reducing the supply voltage. The efficacy of this structure Feb. 1967.
was compared versus those of the variable latency RCA, [21] A. Guyot, B. Hochet, and J.-M. Muller, A way to build efficient carry-
skip adders, IEEE Trans. Comput., vol. C-36, no. 10, pp. 11441152,
C2SLA, and hybrid C2SLA structures. Again, the suggested Oct. 1987.
structure showed the lowest delay and PDP making itself as a [22] S. Turrini, Optimal group distribution in carry-skip adders, in Proc.
better candidate for high-speed low-energy applications. 9th IEEE Symp. Comput. Arithmetic, Sep. 1989, pp. 96103.
[23] P. K. Chan, M. D. F. Schlag, C. D. Thomborson, and V. G. Oklobdzija,
Delay optimization of carry-skip adders and block carry-lookahead
REFERENCES adders using multidimensional dynamic programming, IEEE Trans.
Comput., vol. 41, no. 8, pp. 920930, Aug. 1992.
[1] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA: [24] V. Kantabutra, Designing optimum one-level carry-skip adders, IEEE
A K Peters, Ltd., 2002. Trans. Comput., vol. 42, no. 6, pp. 759764, Jun. 1993.
[2] R. Zlatanovici, S. Kao, and B. Nikolic, Energydelay optimization [25] V. Kantabutra, Accelerated two-level carry-skip addersA type of very
of 64-bit carry-lookahead adders with a 240 ps 90 nm CMOS design fast adders, IEEE Trans. Comput., vol. 42, no. 11, pp. 13891393,
example, IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 569583, Nov. 1993.
Feb. 2009. [26] S. Jia et al., Static CMOS implementation of logarithmic skip adder,
[3] S. K. Mathew, M. A. Anders, B. Bloechel, T. Nguyen, in Proc. IEEE Conf. Electron Devices Solid-State Circuits, Dec. 2003,
R. K. Krishnamurthy, and S. Borkar, A 4-GHz 300-mW 64-bit pp. 509512.
integer execution ALU with dual supply voltages in 90-nm CMOS, [27] H. Suzuki, W. Jeong, and K. Roy, Low power adder with adaptive
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 4451, Jan. 2005. supply voltage, in Proc. 21st Int. Conf. Comput. Design, Oct. 2003,
pp. 103106.
BAHADORI et al.: HIGH-SPEED AND ENERGY-EFFICIENT CSKA 433