Ripple-Precharge TCAM (Network Search Engines - Low Power Solution)
Ripple-Precharge TCAM (Network Search Engines - Low Power Solution)
net/publication/4182416
CITATIONS READS
32 161
4 authors, including:
All content following this page was uploaded by Mehrdad Nourani on 01 December 2014.
ploiting the above fact wherein the match line is charged only concept of local search lines having low swing receivers which
when there is an exact match in the first four bits of TCAM enables the global search lines to have minimum voltage swing.
word, thereby significantly reducing the number of transitions The low swing receivers which reside in the local search lines
in the match line. The parasitics for simulation were extracted then amplifies this small voltage to obtain the desired voltage
from the layout implemented for a 64 32 RP-TCAM architec- level. Hsaio et al. [3] set-forth guidelines to design low power
ture using 0 18µm technology. This structure has 1.71% less CAM based on their power models. They also minimized the
area and 80% less power when compared to the conventional power consumption in their CAM by having two variations of
TCAM of equal storage size and functionality. Our RP-TCAM NAND types CAM bit in their structure and also by carefully
architecture has a search time of 1.86ns. crafting their layouts [3]. Efthymiou et al. [4] proposed a mixed
serial-parallel CAM for use in caches which exploits the the ad-
dress patterns commonly found in application programs.
I. I NTRODUCTION Another problem of TCAM in addition to high power dis-
Content addressable memory (CAM) is a fully associative sipation, is its low storage density, due to the high number of
memory with a search time of only one clock cycle unlike the transistors per each cell. Each TCAM cell requires 16 transis-
traditional random access memory (RAM) that requires two or tors (14 in some literature [5]), as opposed to 6 for SRAM or
more clock cycles for a search operation. They find wide range 2 for DRAM [6]. This has inspired some researchers to offer
of applications in cache memories, network search engines, heuristics that optimize TCAM usage [7]. In the state of the art
telecommunication and cryptography. CAM is broadly clas- TCAM technology, any bit in a word can be masked indepen-
sified into two types - binary CAM and ternary CAM (TCAM). dently. This flexibility comes at a cost. Each cell includes two
Binary CAM is primarily used as instruction or data cache SRAM (DRAM) bits to be able to store each of the three possi-
while ternary CAM which has an additional “don’t-care state” ble states of the cell, namely 0, 1, and don’t-care. In an earlier
is mainly used for the longest prefix matching tasks in network work, we have offered an optimized TCAM cell that employs
search engines. One of the major issues in TCAM, or in gen- w 1 RAM bits (instead of conventional 2w bits) for a word of
eral any CAM, is their very high dynamic power consumption. size w [8]. This structure, called prefix CAM (PCAM) employs
The main reason behind this issue is the fully parallel nature of about 22% less transistors than a conventional TCAM, for equal
search operation. This fully parallel search operation causes all storage size and equal functionality. However, power saving of
the match lines in a TCAM block to charge in their precharge PCAM compared to TCAM was not significant.
phase and allows all but one match line to discharge during their
evaluation phase. The one match line which does not discharge B. Main Contribution
during evaluation phase indicates the match in the search oper-
ation. A low power ripple-precharge TCAM (RP-TCAM) archi-
tecture, utilized for longest prefix matching task, is proposed
which consumes about 80% less power compared to the con-
A. Prior Work ventional TCAM architecture. RP-TCAM has equal area and
Most of the previous work related to power reduction in performance as that of a conventional TCAM. The key nov-
CAM concentrated mainly on reducing the dynamic power due elty in this paper is twofold. Firstly, the precharge voltage to
Proceedings of the 2005 International Conference on Computer Design (ICCD’05)
0-7695-2451-6/05 $20.00 © 2005 IEEE
Timing signal: used to control the sequencing and coordination of various operations (recharge, evaluating, sensing the match lines)
=> reduces power consumption by simplifying the control and synchronization mechanisms. This can be achieved by optimizing the architecture to minimize the need for complex timing signals or by using
alternative techniques to achieve efficient operation without explicit timing control.
Data
evaluate the parallel search operation is selectively and serially
rippled through the first four most significant serial CAM bits.
Data I/O Interface
This idea exploits the fact that by just comparing the first four BL w-1 BL w-2 ... BL
0
most significant bits of the TCAM word we can identify up to
80% of search mismatches. The second novelty in our proposed WL 0 ML 0
b
w-1
b
w-2
... b
0
Address Decoder
Priority Encoder
RP-TCAM architecture is that it does not have any timing sig-
WL 1 ML 1
nals. By eliminating the timing signals, we not only reduce the b
w-1
b
w-2
... b
0
Output
RAM
dynamic power, dissipated due to charging and discharging of Best (e.g. next hop)
...
...
...
...
...
Match
such highly capacitive node, but also save energy consumed by WL n-1 ML n-1
their drivers. b ... b
bw-1 w-2 0
BLB DB BL
BLB DB BL D
D
CMPB CMP
CMPB CMP
X X
TABLE III
B EHAVIOR OF THE EVALUATION LOGIC IN RP-TCAM.
MD
MDB
D CMP X ML OUT
0 0 0 Vdd
MWL
0 1 1 0
MASK-BIT
1 0 1 Vdd
Figure 2. A conventional TCAM cell. 1 1 0 0
Vdd
X31 X30 X29 X28 X27 X26
... X0
ML
...
MDB27 MDB26
MDB0
LA
MASK-BIT 27 MASK-BIT 26
... MASK-BIT 0
TCAM is replaced with a PMOS transistors in the new CAM. bit as shown in Table III. If this 0 starts to ripple through the
The source of the PMOS transistor is connected to ML IN and subsequent bits i.e. considering subsequent bits to have an exact
its drain is connected to the drain of the discharging NMOS match, we might end up having a Vt (threshold voltage) drop at
transistor whose source is grounded. The node which connects match line (ML). In order to avoid this and have a clear distinc-
PMOS and the NMOS transistor is connected to the ML OUT tion between a match and mismatch, an inverter followed by
line which is in turn connected to the ML IN line of the next a discharge transistor is connected to ML as shown in Figure
bit. Therefore, the PMOS transistors in series ripple Vdd only 4. The input of the inverter (LA) is connected to the ML OUT
when there is an exact match in the most significant four bits of the 29th bit of the RP-TCAM word. This is done in order
of the RP-TCAM word. The additional NMOS transistor con- to forsee a Vt rippling through the series PMOS transistor and
nected to the ML OUT node of the parallel PMOS transistors avoid having Vt drop in match line for a mismatch.
ripples a 0 whenever there is a mismatch. The match line in The number of CAM bits through which the Vdd is to be rip-
the parallel part of the RP-TCAM is thus selectively charged pled is quite small for all practical packet traces and routing
by the rippling Vdd only when there is a match in the first 4 tables that we have tried so far. In what follows, show one
most significant bits. The behavior of the evaluation logic for a such simulation. The simulation results are obtained using the
ripple-precharge CAM cells is summarized in Table III. forwarding table taken from the AS1221 edge router on June
10, 2004 and has 168,178 active IPv4 route prefixes [14] and
B. Circuit Behavior the packet trace from the main router of a national laboratory
The novel RP-TCAM architecture consists of both ripple- used also in [15]. The objective behind this simulation is to
precharge CAM bits and conventional TCAM bits as shown in find out the percentage of prefix mismatch corresponding to N
Figure 4. The first four most significant bits of the RP-TCAM most significant bits of the incoming packet. The simulation
word are intentionally CAM bits and not TCAM bits. This ar- results shown in Figure 5 reflect the percentage of searches in
chitectural change can be validated from the fact that most sig- which mismatch discovered in the first N most significant bits
nificant eight bits of the packet are never masked [12]. Further, (100α%). N 4 is a very good choice because by just 4 most
the elimination of mask bit from the first four most significant significant bits of the packets destination address we can deter-
bits of the TCAM is justified from the results of packet profil- mine 80% of the prefix mismatches. We did not choose N 3
ing [13]. This is mainly due to hierarchical structure of internet because the percentage of prefix mismatch was only 60% for
protocol (IP) address allocation in classless inter-domain rout- 3 bits when compared to 80% for N 4. On the other hand,
ing (CIDR). Therefore, the first four most significant bits of the choosing N 4 is not advisable because we will have perfor-
proposed 32-bit RP-TCAM word are CAM bits and the least mance penalty as the search time will increase.
significant twenty eight bits are TCAM bits. RP-TCAM archi-
tecture initially evaluates the first four most significant bits of C. Power Analysis
the search key serially. If there is an exact match between the Dynamic power consumption is given by Pdyn 1
2 Cload
current data bit stored in the SRAM and the corresponding com- f Vdd2 , where, C is the load capacitance switched, Vdd is
load
parand bit sent through CMP and CMPB lines, the Vdd ripples the supply voltage and f is the frequency of transitions on that
through the current bit to evaluate the next bit. If there is a mis- particular node. In our application, for n word memory modules
match in any of the first four bits of the ripple-precharge CAM we have:
a 0 is propagated to the next bit from current mismatched bit.
The match line in the second part is charged to Vdd only when
there is a match in the first four CAM bits. PTCAM n PML n 1
2
2
CML f TCAM Vdd
Whenever there is a mismatch in any one of the three most
significant bits of the RP-TCAM, a 0 is propagated to the next
PRP TCAM n PML n 1
CML f RP TCAM
2
Vdd
(1)
2
Proceedings of the 2005 International Conference on Computer Design (ICCD’05)
0-7695-2451-6/05 $20.00 © 2005 IEEE
ML(RP) NAME LP/PS DB RR MD
110.0 ML(RP) 0 0 0
ML(CONV) 4 0 0
100.0
90.0
80.0
Mismatch for N bits (%)
70.0
60.0
50.0
ML(CONV)
40.0
30.0
20.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0
Number of bits (N)
0.3 10.3 20.3 30.3 40.3 50.3 60.3 70.3 80.3 90.3 100.3
NANO SECONDS
Suppose α is the fraction of searches that lead to a mismatch
and found through checking the first 4 bits. Based on what we
discussed so far, in RP-TCAM architecture we expect this frac- Figure 6. Signal waveform on the match line for RP-TCAM (top curve) and
tion to be large which means that on average 100α% of searches TCAM (bottom curve).
4
experience 32 f TCAM transitions and the rest see all-bit transi-
Average Power
tions in a word. In other words: 0.0
α 1 α f TCAM 1 α f TCAM
4 32 7 −0.05
f RP TCAM
32 32 8 −0.1
The total saving that RP-TCAM architecture that will achieve −0.15 (95.179n, −0.0038481)
−0.2
TCAM TCAM 7
∆P α (2) −0.3
PTCAM f TCAM 8
−0.35
Power (Watts)
: Time (s)
RP−TCAM
−0.5
0.0 10n 20n 30n 40n 50n 60n 70n 80n 90n 100n
A. Layout Implementation Figure 7. Average power consumed by TCAM and RP-TCAM architectures.
The layouts of the proposed 32-bit RP-TCAM architecture as
well as the conventional TCAM architecture were drawn using The layouts for 64 32 bit RP-TCAM block and conven-
Cadence tools using 0 18µm Digital CMOS process [16]. The tional TCAM block were drawn and are shown in 8. It’s obvi-
supply voltage was fixed at Vdd 1 8V . The parasitics were ous that there is no area increase in RP-TCAM. Table IV shows
extracted for both the layouts and the circuit simulation was the comparison of time average power, area and search time
done using SPICE3 [17]. The resulting waveform of the match of the conventional TCAM and proposed RP-TCAM architec-
line for 32-bit RP-TCAM and TCAM word are shown in Fig- tures. A negative number in the last column shows improve-
ure 6. In the RP-TCAM architecture the match line (top curve) ment in area and power. We can clearly see that, even for 50
charges to Vdd only when there is an exact match between the searches, the average power consumed by the RP-TCAM ar-
TCAM word entry and the input comparand, unlike the con- chitecture is found to be almost 75.79% less than the conven-
ventional TCAM architecture wherein the match line (bottom tional architecture. RP-TCAM has an area saving of 1.71%
curve) precharges during every search operation and discharges with 2.76% degradation in performance.
during all but one search operation. The waveforms compar-
ing the time average power consumed by both architectures are
shown in Figure 7. This is done by applying 50 keys (longest B. Comments on Design and Implementation Issues
prefix searchers) and using transistor-level (SPICE) simulator The most significant four bits of the CAM are arranged in
[17]. a manner that they occupy the area equivalent to two TCAM
Proceedings of the 2005 International Conference on Computer Design (ICCD’05)
0-7695-2451-6/05 $20.00 © 2005 IEEE
V. C ONCLUSION
This paper presents a novel, low power TCAM architecture
used for longest prefix matching tasks in network search engine
applications. Our proposed RP-TCAM architecture implements
the idea of rippling Vdd to selectively precharge the highly ca-
pacitive match line. The condition for precharging the match
line is derived from the fact that more than 80% of mismatch
can be identified by just comparing the first four bits of the
prefixes. By exploiting this inherent property of the prefixes
we designed our RP-TCAM. Both the conventional TCAM and
proposed RP-TCAM were implemented in 0 18µm technology.
The average power consumed by the proposed RP-TCAM is
found to be 75 79% less than that of the conventional TCAM.
The area of the RP-TCAM is found to be 1 71% less than that
of conventional TCAM for equal storage size and functionality.
Our RP-TCAM architecture has a search time of 1 86ns.
ACKNOWLEDGEMENTS
This work has been supported in part by the Cisco Systems
Figure 8. Layouts of 64 32 RP-TCAM and TCAM architectures.
Academic Research and Technology Initiative Award.
TABLE IV R EFERENCES
C OMPARISON OF 64 32 CONVENTIONAL AND PROPOSED TCAM [1] C.A. Zukowski and S.Y. Wang, “Use of selective precharge for
ARCHITECTURE . low-power content-addressable memories,” In Proceedings of the
IEEE International Symposium on Circuits and Systems, Vol. 3,
pages 1788-1791, 1997.
[2] K. Pagiamtzis and A. Sheikholeslami, “Pipelined match-lines
Comparison Metric TCAM RP-TCAM Change [%] and hierarchical search-lines for low-power content-addressable
Area (mm2 ) 0.292 0.287 -1.71 memories,” In Proceedings of the IEEE Custom Integrated Cir-
Search Time (ns) 1.81 1.86 +2.76 cuits Conference, pages 383-386, 2003.
[3] I.Y.L. Hsiao, D.H. Wang, and C.W. Jen, “Power modeling and
Average Power (mW ) 15.90 3.85 -75.79 low-power design of content addressable memories,” In Proceed-
ings of the IEEE International Symposium on Circuits and Sys-
tems,Vol. 4, pages 926-929, 2001.
[4] Aristides Efthymiou and Jim D.Garside, “A CAM with Mixed
cells. Therefore, we have an area saving of two TCAM bits Serial-Parallel comparison for use in Low Energy Caches,” IEEE
when compared to the conventional TCAM architecture. The Transactions on Very Large Scale Integration Systems,Vol. 12, no.
additional overhead of an inverter and NMOS transistor to dis- 3, Mar. 2004.
[5] I. Arsovski, T. Chandler, and A. Sheikholeslami, “A Ternary
charge the match line is very negligible when compared to the Content-Addressable Memory (TCAM) Based on 4T Static Stor-
other very recent architectures [4] which has much higher area age and Including a Current-Race Sensing Scheme,” IEEE Jour-
overhead. The ripple-precharge CAM bit has PMOS transistors nal of Solid-State Circuits, vol. 38, no. 1, January 2003.
in its evaluation logic when compared to conventional TCAM [6] J. Rabaey, Digital Integrated Circuits, Prentice Hall, 1996.
which has NMOS transistors. This makes the size of evaluation [7] H. Liu, “Routing Table Compaction in Ternary CAM,” IEEE Mi-
cro, January, February 2002.
logic for that part CAM a little bigger than that of its TCAM [8] M. Akhbarizadeh, M. Nourani, D. Vijayasarathi, P. Balsara,
counterpart. Overall, we have an area saving of 1.71% com- “PCAM: A Ternary CAM Optimized for Packet Forwarding
pared to the conventional TCAM architecture. Tasks,” IEEE International Conference on Computer Design, Oc-
In Figure 6 we see that there is a small glitch in the match line tober 2004.
[9] Pei, T.-B., Zukowski, C. “Putting Routing Tables in Silicon,”
of the proposed RP-TCAM architecture. The glitch is due to the IEEE Network Magazine, January 1992, 42-49.
corner case wherein the mismatch occurs in the least significant [10] Integrated Device Technology, Inc., www.idt.com, 2005.
28 bits and not in the most significant 4 bits of the TCAM. The [11] Chris h. Kim and Kaushik Roy, “Dynamic Vt SRAM : A Leakage
effect of this mismatch is that both charging and evaluation of Tolerant Cache Memory for Low Voltage Microprocessors,” In
match line of RP-TCAM takes place at the same time. This ef- Proceedings of ISLPED’02, August 2002.
[12] RFC1519, the Internet Engineering Task Force, www.ietf.org,
fect is minimized by making the PMOS transistors of the serial 2005.
TCAM weaker i.e by decreasing their sizes at the same time not [13] Internet Performance Measurement and Analysis Project, Univ.
compromising on performance. Another important design con- of Michigan and Merit Network Inc., www.merit.edu, 2005.
straint which we face is the leakage current during a mismatch [14] bgp.potaroo.net, “BGP Routing Table Analysis Reports,” 2004.
in the first four most significant bit. During mismatch, the out- [15] T. Chiueh and P. Pradham, “Cache Memory Design for Internet
Processors,” IEEE Micro, February 2000.
put of the CAM cell (X) is Vdd -Vtn and not Vdd . This makes the [16] Cadence Design Systems Inc., “Virtuoso Layout Editor Users
PMOS transistor to leak. This issue can be addressed either by Guide - Version 4.4.6,” June 2000.
having a transmission gate logic in evaluation transistors or by [17] Texas Instruments Inc., “TI Spice3 User’s and Reference Manual
having high Vt transistors. - Version 1.6,” 1994.
Proceedings of the 2005 International Conference on Computer Design (ICCD’05)
0-7695-2451-6/05 $20.00 © 2005 IEEE
View publication stats