0% found this document useful (0 votes)
2 views

CAM basepaper

The document presents a low-power content-addressable memory (CAM) architecture utilizing a sparse clustered network to enhance energy efficiency by reducing parallel comparisons during searches. The proposed design significantly lowers dynamic energy consumption and search delay compared to conventional CAMs, achieving 8% and 26% of the energy and delay of traditional architectures, respectively. The architecture is validated through simulations using TSMC 65-nm CMOS technology, showcasing its potential for applications requiring frequent and fast look-up operations.

Uploaded by

saidulu.i
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CAM basepaper

The document presents a low-power content-addressable memory (CAM) architecture utilizing a sparse clustered network to enhance energy efficiency by reducing parallel comparisons during searches. The proposed design significantly lowers dynamic energy consumption and search delay compared to conventional CAMs, achieving 8% and 26% of the energy and delay of traditional architectures, respectively. The architecture is validated through simulations using TSMC 65-nm CMOS technology, showcasing its potential for applications requiring frequent and fast look-up operations.

Uploaded by

saidulu.i
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

642 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO.

4, APRIL 2015

Algorithm and Architecture for a Low-Power


Content-Addressable Memory Based
on Sparse Clustered Networks
Hooman Jarollahi, Student Member, IEEE, Vincent Gripon, Naoya Onizawa, Member, IEEE,
and Warren J. Gross, Senior Member, IEEE

Abstract— We propose a low-power content-addressable [3], [4], database accelerators, image processing, parametric
memory (CAM) employing a new algorithm for associativity curve extraction [5], Hough transformation [6], Huffman cod-
between the input tag and the corresponding address of the ing/decoding [7], virus detection [8] Lempel–Ziv compres-
output data. The proposed architecture is based on a recently
developed sparse clustered network using binary connections that sion [9], and image coding [10].
on-average eliminates most of the parallel comparisons per- Due to the frequent and parallel search operations, CAMs
formed during a search. Therefore, the dynamic energy con- consume a significant amount of energy. CAM architectures
sumption of the proposed design is significantly lower compared typically use highly capacitive search lines (SLs) causing them
with that of a conventional low-power CAM design. Given an not to be energy efficient when scaled. For example, this
input tag, the proposed architecture computes a few possibilities
for the location of the matched tag and performs the comparisons power inefficiency has constrained TLBs to be limited to
on them to locate a single valid match. TSMC 65-nm CMOS tech- no more than 512 entries in current processors. In Hitachi
nology was used for simulation purposes. Following a selection SH-3 and StrongARM embedded processors, the fully asso-
of design parameters, such as the number of CAM entries, the ciative TLBs consume about 15% and 17% of the total
energy consumption and the search delay of the proposed design chip power, respectively [11]–[13]. Consequently, the main
are 8%, and 26% of that of the conventional NAND architecture,
respectively, with a 10% area overhead. A design methodology research objective has been focused on reducing the energy
based on the silicon area and power budgets, and performance consumption without compromising the throughput. Energy
requirements is discussed. saving opportunities have been discovered by employing either
Index Terms— Associative memory, content-addressable circuit-level techniques [14], [15], architectural-level [16], [17]
memory (CAM), low-power computing, recurrent neural techniques, or the codesign of the two, [18], some of which
networks, sparse clustered networks (SCNs). have been surveyed in [19]. Although dynamic CMOS circuit
techniques can result in low-power and low-cost CAMs, these
I. I NTRODUCTION
designs can suffer from low noise margins, charge sharing,

A CONTENT-addressable memory (CAM) is a type of


memory that can be accessed using its contents rather
than an explicit address. In order to access a particular entry
and other problems [16].
A new family of associative memories based on sparse
clustered networks (SCNs) has been recently introduced
in such memories, a search data word is compared against [20], [21], and implemented using field-programmable gate
previously stored entries in parallel to find a match. Each arrays (FPGAs) [22]–[24]. Such memories make it possible to
stored entry is associated with a tag that is used in the store many short messages instead of few long ones as in the
comparison process. Once a search data word is applied to conventional Hopfield networks [25] with significantly lower
the input of a CAM, the matching data word is retrieved level of computational complexity. Furthermore, a significant
within a single clock cycle if it exists. This prominent feature improvement is achieved in terms of the number of informa-
makes CAM a promising candidate for applications where tion bits stored per memory bit (efficiency). In this paper, a
frequent and fast look-up operations are required, such as in variation of this approach and a corresponding architecture are
translation look-aside buffers (TLBs) [1], [2], network routers introduced to construct a classifier that can be trained with the
Manuscript received July 5, 2013; revised February 24, 2014; accepted association between a small portion of the input tags and the
April 8, 2014. Date of publication April 30, 2014; date of current version corresponding addresses of the output data. The term CAM
March 18, 2015. refers to binary CAM (BCAM) throughout this paper. Orig-
H. Jarollahi and W. J. Gross are with the Department of Electrical and
Computer Engineering, McGill University, Montreal, QC H3A 2A7, Canada inally included in [26], preliminary results were introduced
(e-mail: [email protected]; [email protected]). for an architecture with particular parameters conditioned on
V. Gripon is with the Department of Electronics, Télécom Bretagne, Brest uniform distribution of the input patterns. In this paper, an
29238, France (e-mail: [email protected]).
N. Onizawa is with the Research Institute of Electrical Commu- extended version is presented that elaborates the effect of the
nication, Tohoku University, Sendai 980-8577, Japan (e-mail: naoya. design’s degrees of freedom, and the effect of nonuniformity of
[email protected]). the input patterns on energy consumption and the performance.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. The proposed architecture (SCN-CAM) consists of an
Digital Object Identifier 10.1109/TVLSI.2014.2316733 SCN-based classifier coupled to a CAM-array. The CAM-array
1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 643

TLBs are fully associative, in order to find the corresponding


PPN, a fully parallel search among VPNs is conducted for
every generated virtual address [2].
A BCAM cell is typically the integration of a 6-transistor
(6T) SRAM cell and comparator circuitry. The comparator
circuitry is made out of either an XNOR or an XOR structure,
leading to a NAND-type or a NOR-type operation, respectively.
The selection of the comparing structure depends on the
performance and the power requirements, as a NAND-type
operation is slower and consumes less energy as opposed to
that of a NOR type.
The schematic of two types of typical BCAM cells are
Fig. 1. Simple example of a 4 × 4 CAM array consisting of the CAM cells, shown in Fig. 2. In a NAND-type CAM, the MLs are
MLs, sense amplifiers, and differential SLs. precharged high during the precharge phase. During the eval-
uation phase, in the case of a match, the corresponding ML is
is divided into several equally sized sub-blocks, which can be pulled down though a series of transistors [M5 in Fig. 2(b)]
activated independently. For a previously trained network and performing a login NAND in the comparison process. In a
given an input tag, the classifier only uses a small portion NOR-type CAM [Fig. 2(a)], the MLs are also precharged high
of the tag and predicts very few sub-blocks of the CAM during the precharge phase. However, during the evaluation
to be activated. Once the sub-blocks are activated, the tag phase, all of the MLs are pulled down unless there is a matched
is compared against the few entries in them while keeping entry such that the pull-down paths M3 − M4 and M5 − M6 are
the rest deactivated and thus lowers the dynamic energy disabled. Therefore, a NOR-type CAM has a higher switching
dissipation. activity compared with that of a NAND type since there are
The rest of this paper is organized as follows. Section II typically more mismatched entries than the matched ones.
describes basic operation of the CAM. In Section III, some Although a NAND-type CAM has the advantage of lower
of the recent research works related to this area are summa- energy consumption compared with that of the NOR-type coun-
rized. In Section IV, the proposed associativity algorithm is terpart, it has two drawbacks: 1) a quadratic delay dependence
introduced. Section V describes the hardware architecture fol- on the number of cells due to the serial pull-down path and
lowed by Section VI with the simulation results. Circuit level 2) a low noise margin.
simulations throughout this paper are obtained using HSPICE
and TSMC 65-nm CMOS technology. Finally, conclusions are
drawn in Section VII. III. R ELATED W ORK
Energy reduction of CAMs employing circuit-level tech-
II. CAM R EVIEW niques are mostly based on the following strategies:
In a conventional CAM array, each entry consists of a tag 1) reducing the SL energy consumption by disabling the
that, if matched with the input, points to the location of a precharge process of SLs when not necessary [18], [27]–[30]
data word in a static random access memory (SRAM) block. and 2) reducing the ML precharging, for example, by segment-
The actual data of interest are stored in the SRAM and a tag ing the ML, selectively precharging the first few segments
is simply a reference to it. Therefore, when it is required to and then propagating the precharge process if and only if
search for the data in the SRAM, it suffices to search for its those first segments match [31]. This segmentation strategy
corresponding tag. Consequently, the tag may be shorter than increases the delay as the number of segments is increased.
the SRAM-data and would require fewer bit comparisons. A hybrid-type CAM integrates the low-power feature of NAND
An example of a typical CAM array, consisting of four type with the high-performance NOR type [12] while similar
entries having 4 bits each, is shown in Fig. 1. A search data to selective precharging method [31], the ML is segmented
register is used to store the input bits. The register applies into two portions. The high-speed CAM designed in 32-nm
the search data on the differential SLs, which are shared CMOS [1] achieves the cycle time of 290 ps using a swapped
among the entries. Then, the search data are compared against CAM cell that reduces the search delay while requiring a
all of the CAM entries. Each CAM-word is attached to a larger CAM cell (11-transistors) than a conventional CAM cell
common match line (ML) among its constituent bits, which [9-transistors (9T)] used in SCN-CAM. A high-performance
indicates, whether or not, they match with the input bits. Since AND -type match-line scheme is proposed in [32], where mul-
the MLs are highly capacitive, a sense amplifier is typically tiple fan-in AND gates are used for low switching activity
considered for each ML to increase the performance of the along with segmented-style match-line evaluation to reduce
search operation. the energy consumption.
As an example, in TLBs, the tag is the virtual page In the bank-selection architecture [33], [34], the CAM array
number (VPN), and the data are the corresponding physical is divided into B equally partitioned banks that are activated
page number (PPN). A virtual address generated by the CPU based on the value of added bits of length log2 (B) to the search
consists of the VPN, and a page offset. The page offset is later data word. These extra bits are decoded to determine, which
used along with PPN to form the physical address. Since most banks must be selected. This architecture was considered at

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
644 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

Fig. 2. Classical BCAM cell types. (a) 10T NOR. (b) 9T NAND.

first to reduce the silicon area by sharing the comparison


circuitry between the blocks but was later considered for power
reduction as well. The drawback of this architecture is that
the banks can overflow since the length of the words remains
the same for all the banks. For example, let us consider
a 128k-entry CAM that incorporates 60-bit words and one
additional bank-select bit such that two banks result with 64k
entries each. Therefore, each bank can have 260 possibilities
causing an overflow probability that is higher compared with
when not banked. This overflow would require extra circuitry
that reduces the power saving opportunity since as a result Fig. 3. Top level block diagram of SCN-CAM. The CAM array is divided
into M/ζ − 1 sub-blocks that can be independently activated for comparison.
multiple banks are activated concurrently [19]. The compare-enable signals are generated by the SCN-based classifier.
The precomputation-based CAM (PB-CAM) architecture
(also known as one’s count) was introduced in [16].
PB-CAM divides the comparison process and the circuitry correlation between the input patterns. Furthermore, the asyn-
into two stages. First, it counts the number of ones in an chronous architecture in [15] is more susceptible to process
input and then compares the result with that of the entries variations compared with its synchronous counterparts.
using an additional CAM circuit that has the number of ones
in the CAM-data previously stored. This activates a few MLs
and deactivates the others. In the second stage, a modified IV. SCN-CAM A LGORITHM
CAM hierarchy is used, which has reduced complexity, and As shown in Fig. 3, the proposed architecture (SCN-CAM)
has only one pull-down path instead of two compared with the consists of an SCN-based classifier, which is connected to a
conventional design. The modified architecture only considers special-purpose CAM array. The SCN-based classifier is at
0 mismatches instead of full comparison since the 1s have first trained with the association between the tags and the
already been compared. The number of comparisons can be address of the data to be later retrieved. The proposed CAM
reduced to M × log(N + 2) + (M × N)/(N + 1) bits, where array is based on a typical architecture, but is divided into
M is the number of entries in the CAM and N is the number of several sub-blocks that can be compare-enabled independently.
bits per entry. In the proposed design, we demonstrate how it is Therefore, it is also possible to train the network with the
possible to reduce the number of comparisons to only N bits. association between the tag and each CAM sub-block if the
Furthermore, in PB-CAM, the increase of the tag length affects number of desired sub-blocks is known. However, in this
the energy consumption, the delay, and also complicates the paper, we focus on a generic architecture that can be easily
precomputation stage. optimized for any number of CAM sub-blocks. Once an input
In the asynchronous architecture proposed by [15], as the tag is presented to the SCN-based classifier, it predicts which
CAM assigns consecutive search data matched in different CAM sub-block(s) need to be compare-enabled and thus saves
word blocks, it operates based on the delay of matching the dynamic power by disabling the rest. Disabling a CAM
its first few bits instead of its full length as long as the sub-block avoids charging its highly capacitive SLs, while
consecutive subsearch words are different. However, the cycle applying the search data, and also turns the precharge path
time is drastically increased when the search-data patterns off for the MLs.
are correlated. For example, if we have correlations in the We show how it is possible, through the algorithmic
first 8 bits of the stored data, the cycle time is increased to reduction in hardware complexity, to reduce the number of
1.359 ns, which is 5.2 times that of the noncorrelated scenario. comparisons to only one in average. SCN-CAM uses only a
In the proposed design, the cycle time is independent of the portion of the actual tag to create or recover the association

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 645

101 and 110. Then, each part is associated with a neuron


in the corresponding cluster in PI : 5 for 101 and 6 for 110.
Finally, the connections from these neurons toward the target
neuron, 4, in PII are added. That is, w(1,5)(4), and w(2,6)(4) is
equal to 1.
2) Network Update: When an update is requested in
SCN-CAM, retraining the entire SCN-based classifier with
the entries is not required. The reason lies in the fact that
the output neurons of PII are independent from each other.
Therefore, by deleting the connections from a neuron PII to
the corresponding connections in PI , a tag can be deleted.
Fig. 4. Representation of the proposed SCN-CAM for consisting of M entries In other words, to delete an entry, c connections are removed,
and a reduced-length tag of c × log2 (l).
one for each cluster. Adding new connections to the same
neuron in PII , but to different neurons in PI , adds a new entry
with the corresponding output. The operation of the CAM, to the SCN-based classifier. The new entry can therefore be
on average, allows this reduction in the tag length. A large added by adding new connections while keeping the previous
enough tag length permits SCN-CAM to always point to a connections for other entries in the network.
single sub-block. However, the length of reduced-length tag 3) Tag Decoding: Once the SCN-based classifier has been
affects the hardware complexity of the SCN-based classifier. trained, the ultimate goal after receiving the tag is to determine
The length of the reduced-length tag is not dependent on the which neuron(s) in PII should be activated based on the given
length of the original tag but rather dependent on the number q bits of the tag. This process is called decoding in which the
of CAM entries. connection values are recalled from the memory. The decoding
process is divided into four steps.
1) An input tag is reduced in length to q bits and divided
A. SCN-Based Classifier
into c equally sized partitions. The q bits can be
As shown in Fig. 4, an SCN-based classifier consists of two selected within the tag bits in such a way to reduce the
parts: 1) PI and 2) PII . The neurons in PI are binary, correspond correlation.
to the input tags, and are grouped into c equally sized clusters 2) Local Decoding (LD): A single neuron per cluster in
with l neurons in each. Processing of an input tag in the PI is activated using a direct binary-to-integer mapping
SCN-based classifier is for either of the two situations: training from the tag portion to the index of the neuron to be
or decoding. In this paper, either for training or decoding activated.
purposes, the input tag is reduced in length to q bits, and then 3) Global Decoding (GD): GD determines which neuron(s)
divided into c equally sized partitions of length κ bits each. in PII must be activated based on the results from LD
Each partition is then mapped to the index of a neuron in its and the stored connection values. If there exists at least
corresponding cluster in PI , using a direct binary-to-integer one active connection from each cluster in PI toward
mapping from the tag portion to the index of the neuron to be a neuron in PII , that neuron is activated. GD can be
activated. Thus, l = 2κ . If l is a given parameter, the number expressed as
of clusters is calculated to be c = q/ log2 (l). Therefore, for
 l 
c   
simplicity in hardware implementation, we can choose q to
vni = w(i, j )(i  ) v (i, j ) (1)
be a multiple of κ. It is important to note that there are no
i=1 j =1
connections between the neurons in PI . PII is a single cluster  
consisting of M neurons, which is equal to the number of where and represent logical OR and AND opera-
entries in the CAM. Each neuron in PII , n i  , is connected to tions, respectively. v (i, j ) is the binary value of the j th
every neuron in PI via a connection whose binary value is neuron in the i th cluster in PI , whereas v ni  is the binary
w(i, j )(i  ) , and thus equal to either 0 or 1. The value of w(i, j )(i  ) value of the i  th neuron in PII .
determines whether there exists an association between the j th 4) If more than one neuron are activated in PII , then,
neuron in the i th cluster in PI , and the i  th neuron in PII . the same number of word comparisons are required
1) Network Training: The binary values of the connections to detect the correct match. A single activated neuron
in the SCN-based classier indicate associations of the input means no further comparisons are required. Because we
tags and the corresponding outputs. The connection values are may not afford (in terms of the silicon area) to imple-
set during the training process, and are stored in a memory ment only one independently controlled CAM-row per
module such that they can later be used to retrieve the address neuron, the neurons in PII are grouped into ζ -neurons.
of the target data. A connection has a value 1 when there exists Each group of neurons generates a single activation
an association between the corresponding neuron in PI and a signal to enable parallel comparison operations in its
CAM entry, represented as a neuron in PII . corresponding CAM sub-block. A logical OR operation
For example, let us assume c = 2 and q = 6. For is thus performed on the value of each group of neurons
a reduced-length input tag 101110 associated to the fourth resulting generation of M/ζ bits, which is also equal to
entry in the CAM, first, we split this input into two parts: the number of CAM sub-blocks.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

Fig. 6. Expected value of the number of required comparisons in SCN-CAM


versus the number of bits in the reduced-length tag.
Fig. 5. Relationship between the length of the truncated tag (q), the number
of matched entries in SCN-CAM (λ), and the estimated matching probability Fig. 5 shows simulations results on how it is possible
(P(λ)) for M = 512. to reduce the estimated number of required comparisons by
increasing q. It is interesting to note that the number of clusters
B. Tag-Length Reduction in PI does not affect the number of required comparisons
  
Given the input tags, the number of bits in the reduced- M −1 1 λ 1 M−1−λ
length tag, q, determines the number of possible ambiguities Pλ = 1 − (2)
λ 2q 2q
in PII . The generated ambiguities can be corrected with
additional comparisons to find the exact match in the CAM. E(λ) = λ · Pλ = (M − 1)/(2q ) (3)
Therefore, no errors are produced in determining the matched λ

result(s). On the other hand, no length reduction leads to the Var(λ) = (M − 1)(1/2q )(1 − 1/2q ). (4)
generation of no ambiguities, but a higher level of hardware If the input pattern is correlated in a sense that certain bits
complexity in the SCN-based classifier, since more neurons repeat their values among all of the CAM, (3) can be modified
are required. to the following:
M −1
E c (λ) = (5)
C. Data Distribution 2q−k
where k is the number of similar bits in q, and E c is the
The number of ambiguities, generated in PII is dependent on
expected value of the number of matched entries. We can
the correlation factor of the tag pattern, that is the number of
consider more complex models, such as when the reduced tags
similar repeating bits in the subset of tags. A higher degree of
are obtained using a Bernoulli distribution with parameter α.
similarities results in a higher number of ambiguous neurons.
We then partition reduced tags depending on the number of
If the tag pattern is previously known, it is possible to select
ones, i, they contain. In such a case, we obtain
the reduced-length tag bits among those that have a lower
q 
level of similarity likelihood. Otherwise, SCN-CAM detects q
the valid match but with a higher cost of power consumption. E G (λ) = (M − 1) (α i (1 − α)q−i )2 . (6)
i
Assuming a random distribution for the CAM entries, the i=0
expected number of possible matches is the actual match In particular, we verify that for α = 1/2, E G (λ) corresponds
plus the expected number of ambiguities. The number of to the independent identically distributed uniform case. Fig. 6
ambiguities given a random and uniformly distributed input shows simulation results based on one million random and
pattern can be estimated using (2), where λ is a random uniformly distributed reduced-length tags and two different
variable representing the number of ambiguities. Pλ is the CAM sizes. It shows how the expected value of the number of
probability that exactly λ ambiguities occur using q bits of possible matches (E(λ)) is decreased to only one by increasing
the tag. Therefore, it follows a binomial distribution as shown the value of the number of bits in the reduced-length tag
in (2). Software simulations on numerical data samples verify as shown in (2) and (3). The algorithm of SCN-CAM is
the validity of (2). Therefore, according to the binomial law similar to that of the precomputation-based CAM (PB-CAM)
the expected value of λ, E(λ), and its variance, Var(λ), [16], [17]. A drawback of such methods, unlike SCN-CAM,
can be calculated as shown in (3) and (4), respectively. is that as the length of the tags is increased, the cycle time
If q = log2 (M), only one ambiguity is achieved on average and the circuit complexity of the precomputation stage are
leading to the activation of one extra neuron in PII in addition dramatically increased. Furthermore, we will show that unlike
to the actual match. Consequently, this value of q, and λ the PB-CAMs, SCN-CAM can potentially narrow down the
is used throughout this paper as a starting point to estimate search procedure to only one comparisons with a simple
the cycle time and the energy consumption of the proposed computational complexity that does not grow with the increase
CAM design. of the tag length.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 647

TABLE I is shown in Fig. 7. It consists of c κ-to-l one-hot decoders,


R EFERENCE D ESIGN PARAMETERS c SRAM modules of size l × M each, M c−input AND gates,
M/ζ ζ −input OR gates, and M/ζ 2-input NAND gates. Each
row of an SRAM module stores the connections from one
tag to its corresponding output neuron. Each reduced-length
tag of length q is thus divided into c subtags of κ bits each,
where each subtag creates the row address of each SRAM
module.
1) Training: During the training process, the SRAM mod-
ules store the connection values between the input tags and
their corresponding outputs, which are later used in the decod-
ing process. The training process of the SCN-based classifier
in hardware is similar to that of a conventional CAM in
principle.
First, an input tag is reduced in length to q bits, and
segmented into c parts. Each segment is then presented to
the corresponding one-hot decoder, as shown in Fig. 7, to
determine the row-address of the SRAM module correspond-
ing to the segmented tag’s cluster. The association between
a tag and its corresponding output is written in the SRAM
V. C IRCUIT I MPLEMENTATION module by accessing one row per SRAM module (i.e., one
A top-level block diagram of the implementation of neuron per cluster in PI ), and writing a 1 into the i th bit of the
SCN-CAM is shown in Fig. 3. It shows how the M-bit SRAM row, corresponding to the i th neuron in PII . This
SCN-based classifier is connected to a custom-designed CAM process takes a single clock cycle per entry per SRAM module,
array shown in Fig. 9, where an example pertaining to the oper- M cycles in total if parallel writing in the SRAM modules is
ation of a 4-bit CAM is demonstrated. A 10-transistor (10T) possible, and M × c cycles otherwise.
NOR-type CAM with NOR-type ML architecture was used. The 2) Decoding: The decoding process shown in (1) is imple-
conventional NAND and NOR-type CAM architectures were mented using the structure of the SRAM modules, and the
also implemented for comparison purposes. c-input AND gates. The input tag is first reduced in length to
In order to implement a circuit that can elaborate the benefit q bits, and segmented into c equal-length parts. Each segment
of the proposed algorithm, a set of design points were selected is presented to its corresponding one-hot decoder, as shown
among 15 different parameter sets with the common goal in Fig. 7, that determines which row of the SRAM is to be
of discovering the minimum energy consumption per search, accessed. The location of this row corresponds to the index of
while keeping the silicon-area overhead and the cycle time an activated neuron in PI . In order to read the stored values
reasonable. The optimum design parameters depend on the in the SRAM, a sense amplifier is not required since the
speed, energy consumption, and area requirements. If the area bit lines are only attached to few cells. Furthermore, a cell
budget is limited, smaller values of ζ is preferred with the storing a 1 is the point of interest because it determines an
cost of higher number of comparisons and thus the energy active connection. Therefore, in order to reduce the switching
consumption. If the energy consumption is a critical design activity of the successive logical stages, the complementary
parameter, and the budget for the silicon area is more relaxed, bit line (BL ) is used to read the values of the SRAM cells
a balance between a large enough q and a small ζ needs to since it is pulled down during the read in case of reading a 1.
be considered. A preferred set of design choices based on the In each SRAM module, the accessed row is the only row
experimental simulations on a 512-entry CAM is summarized that can contain the information leading to the activation
in Table I, where n is the number of cells attached to a local of a neuron  in PII and inherently eliminates unnecessary
match line (LML). Since N = 128, the last segment of the w(i, j )(i  ) v (i, j ) operations between the connection values and
LML is considered to include eight cells attached to it. the neural values in (1). In other words, since there are never
In SCN-CAM, we use the NOR-type CAM structure, in any ambiguities in PI , instead of calculating the neural values
order to take advantage of its better noise margin and the low in PII by computing all of the possible logical AND and OR
latency, compared with the NAND-type counterpart. We will operations in (1), only those connections coming from the acti-
then show how it consumes lower energy per search compared vated neurons in PI are used. This simplification is possible by
with that of a conventional NAND-type architecture—one of integrating the one-hot decoders and the SRAM modules in a
the low-energy architectures of CAM. configuration shown in Fig. 7. Since there exists c × l SRAM
modules in the SCN, the learning process, or dynamically
updating the SCN-based classifier’s connections only takes
A. SCN-CAM: Architecture of SCN-Based Classifier l clock cycles, and is not dependent on M. Therefore, the delay
The SCN-based classifier in SCN-CAM architecture gen- required by the SCN-based classifier to update its connections
erates the compare-enable signal(s) for the CAM sub-blocks is by far less than what would be required to access and
attached to it. The architecture of the SCN-based classifier retrieve values from the processor that keeps track of the tags.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
648 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

Fig. 7. Simplified schematic of the SCN-based classifier generating compare-enable signals for the CAM array.

3) Updating: Updating the network with a new entry is


according the approach that was explained in Section IV-A2.
First, to delete a previously trained entry, one row per
SRAM-block, corresponding to the entry that is being deleted,
is read into registers. Then, the i th bit in each read row is
converted to a 0 to remove the connections from the i th neuron
in PII to the corresponding neuron in PI . Finally, the modified
row is written back into the SRAM-block. The deletion process
thus takes two clock cycles. To update an entry after deletion,
the new entry is added to the network using the same approach
as in the training process.

B. SCN-CAM: CAM Architecture


In order to exploit the prominent feature of the SCN-based
Fig. 8. Number of activated sub-blocks in the CAM (ψ) versus the number
associative memory in the classification of the search data, a of bits in the reduced-length tag (q).
conventional CAM array is divided into sufficient number of
compare-enabled sub-blocks such that: 1) the number of sub-
Fig. 8 shows the number of activated sub-blocks for various
blocks are not too many to expand the layout and to complicate
values of M, while sweeping q. It shows how it is possible
the interconnections and 2) the number of sub-blocks should
to reduce β to only one sub-block by increasing the value
not be too few to be able to exploit to energy-saving opportu-
of q sufficiently depending on the number of entries in the
nity with the SCN-based classifier. Consequently, the neurons
CAM array. Each sub-block has pass-gate devices attached to
in PII are grouped and ORed as shown in Fig. 7 to construct
the SLs, which are controlled by the compare-enable signals
the compare-enable signal(s) for the CAM array. Even the
from the SCN-based classifier. Furthermore, the precharging
conventional CAM arrays need to be divided into multiple
process of the MLs are also controlled by the SCN-based
sub-blocks since long bit lines and SLs can slow down the
classifier. This way, the dynamic energy consumption due to
read, write, and search operations due to the presence of drain,
both charging of the SLs and the MLs can be controlled
gate, and wire capacitances.
using the same compare-enable signal(s) generated by the
The number of sub-blocks, β, is equal to M/ζ , where M is
SCN-based classifier. Since very few sub-blocks are acti-
the total number of entries of the CAM, and ζ is the number
vated using SCN-CAM, it is possible to exploit the low-
of CAM-rows per sub-block in the hierarchical arrangement
latency feature of the NOR-type architecture instead of the
as shown in Fig. 9, with schematic details in Fig. 10.
NAND -type counterpart for the CAM part of SCN-CAM.
The number of cells attached to an LML is denoted by n.
Because the energy saving opportunity is achieved using the
The number of compare-enabled sub-blocks (ψ) can be esti-
architectural modification of the search procedure, we will still
mated by multiplying the probability that a sub-block can be
significantly reduce the energy consumption compared with
enabled by the total number of sub-blocks
that of the conventional NAND-type architecture while taking
 1+E(λ) the advantage of the high-speed feature of NOR-type CAMs.
1
ψ = 1− 1− · β. (7) For ultralow power applications, a NAND-type CAM-cell may
β
also be used with the cost of speed.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 649

Fig. 11. Estimated number of required transistors for SCN-CAM designed


for 128 × 40 and 512 × 40 CAMs, and comparing them with the conventional
Fig. 9. Simplified array organization of the proposed CAM architecture low-power NAND-type CAMs of the same size for various truncated tag
showing an example when N = 4, search data word is 0110 and En 0 = 1. The bits (q).
sub-block compare-enable signals are generated by the SCN-based classifier.
VI. C IRCUIT E VALUATION
A complete circuit for SCN-CAM was implemented and
simulated using HSPICE and TSMC 65-nm CMOS technology
according to Table I parameters, including full dimensions
of CAM arrays, SRAM arrays, logical gates, and extracted
parasitics from the wires in the physical layout.
A wave-pipelining approach has been followed for clk1
and clk2 signals in Fig. 7 to integrate the operation
SCN-based classifier and the CAM sub-blocks. This approach
is verified in terms of reliability and the latency under worst-
case process variations [slow–slow (SS) corner for latency
and fast-fast corner for reliability]. Other methods, such as
registered pipelining [35], are also possible, where M/ζ − 1
registers are placed after the OR gates in Fig. 7. This way, the
frequency of operation is determined by taking the minimum
reliable frequency between clk1 and clk2 .

A. Energy Consumption Model


To investigate the energy consumption of SCN-CAM for
various design parameters, such as q, c, and the effect of
nonuniform input distributions, the energy consumption of
SCN-CAM can be modeled as
E Total = E SCN + E CAM
Fig. 10. Simplified schematic view of the proposed CAM array. The compare- E SCN = c · E Dec + M · c · E SRAMacc + (l − 1) · M
enable signals for the CAM sub-blocks are generated by the SCN-based
classifier. · c · E SRAMidle + E λ · E ANDc +ψ · E ORζ
E CAM = (ψ · ζ − 1) · N · E CAMmismatch + N · E CAMmatch
+ (M/ζ − ψ) · ζ · N · E CAMstat (8)
The total number of sub-blocks can be selected depending
on the silicon-area availability since each sub-block will where the total energy consumption is divided into the energy
slightly increase the silicon area. If the input data word is consumption in the SCN-based classifier (E SCN ), and the
not uniformly distributed, more sub-blocks will be activated CAM sub-blocks (E CAM ). The SCN-based classifier’s contri-
during a search consuming higher amounts of energy while the bution to the energy consumption includes decoders, E Dec ,
accuracy of the final output is not affected (Fig. 14). Therefore, SRAMs (accessed, E SRAMacc , and idle, E SRAMidle ), and the
a false-negative output is never generated. However, since the logical gates to perform the GD and to generate the compare-
full length of the tag is not used in SCN-CAM, it is possible to enable signals for the CAM array. For every search operation,
select the reduced-length tag bits depending on the application one row in each SRAM is accessed and thus, the rest of
and according to a pattern to reduce the tag correlation. the rows is in idle states, in which there exists a switching

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
650 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

Fig. 14. Total estimated energy consumption per search and match for
SCN-CAM, and comparing them with the conventional low-power
NAND -type CAM design for various truncated tag bits (q).
Fig. 12. Total estimated dynamic energy consumption per search for
SCN-CAM, and comparing them with the conventional low-power
NAND -type CAM design for various values of the reduced-length tag (q).

Fig. 13. Estimated energy consumption per bit per search of the proposed
NOR architecture, and the conventional NAND -type CAM for various values Fig. 15. Simulation results for SCN-CAM based on reference design
of word lengths (N). parameters in Table I.

activity in their bit lines. Therefore, we have divided the energy SCN-based classifier can be estimated considering the area of
consumption of the SRAMs into the corresponding states. the decoders, the SRAM arrays, the precharge devices, the read
Furthermore, the CAM portion of the energy model consists and write circuits, the interconnections, and the standard cells.
of match (E CAMmatch ) and mismatch (E CAMmismatch ) portions, Similarly, the area of the CAM array can be estimated by
whose values depend on the number of ambiguities discussed considering the gaps between the CAM sub-blocks, pass-
in Section IV-B. Static energy consumption of the idle CAMs gate transistors, read and write circuits. The area overhead
(E CAMstat ) has also been included due to the presence of is estimated to be 10.1% higher than that of the conventional
leakage current occurring in advanced technologies. The esti- CAM design, for design selections in Table I.
mation of the number of required transistors follows a similar In the simulations for measuring the energy consumption
model. and the cycle time (/bit/search), on average half of the data
bits were assumed to mismatch in case of a word mismatch.
In Fig. 12, the relationship between the dynamic energy
B. Area Estimation and Simulation Results consumption of SCN-CAM and the tag length is depicted for
Fig. 11 shows the estimated overhead of the number of various number of entries of the CAM in comparison with
transistors in SCN-CAM for various number of entries of the the conventional CAMs. The estimated energy consumption
CAM in comparison with the conventional design. For design is obtained based on (6), and the extracted values for energy
selections in Table I, this overhead is only 3.4% compared consumption using HSPICE simulations. As the value of q
with that of the conventional CAM. The silicon area of the is increased, the energy consumption is decreased as well

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 651

TABLE II
R ESULT C OMPARISONS

since the number of comparisons is reduced but up to a point implementation of the conventional NAND-type and NOR-type
until the energy consumption of the SCN-based classifier itself CAMs. The technology-scaled version (to 65-nm CMOS) of
would dominate that of the CAM array. Therefore, the energy these results are evaluated according to the method described
consumption of the SCN-based classifier is not dependent on in [18]. Unless otherwise indicated, the reported results are
the original tag length, and rather on the number of entries in based on simulations. The energy consumption of SCN-CAM
the CAM array. is 4.08%, 8.02%, 3.59%, 66.7%, 53.89%, 30.5%, and 3.69%
Fig. 13 shows the effect of the word length on the energy of that of the referenced NOR-based, referenced NAND-based,
consumption in comparison with the conventional design. [1], [12], [15], [16], and [32], respectively. On the other
The original tag length (N) does not change the architecture hand, the cycle time of SCN-CAM is 132%, 31.4%, 224%,
and the energy consumption of the SCN-based classifier. 67.2%, 169%, 48.4%, and 107% of that of the referenced NOR,
Furthermore, due to the small size of the sub-blocks, the referenced NAND, [1], [12], [15], [16], and [32], respectively.
search power of SCN-CAM is much smaller compared with Although the energy consumption of the CAM presented in
that of the conventional. Consequently, as N is increased, [15] is small compared with most of the others, the cycle time
the energy consumption per-bit-per-search is decreased in is significantly increased (5.2×) when the search-data patterns
SCN-CAM while it stays constant in the conventional CAM. are correlated, whereas the cycle time of SCN-CAM remains
It also implies an advantage of SCN-CAM over PB-CAMs, unchanged in a similar situation.
such as in [16] and [17], where longer tag lengths increase The required silicon area of SCN-CAM is estimated to
the energy consumption as well as the precomputation delay. be 10.1% larger than that of the conventional NAND-type
This is because longer tags will increase the complexity of counterpart mainly due to the existence of the gaps between
the adders and the number of comparisons. Fig. 14 shows the SRAM blocks of the SCN-based classifier. Consequently,
the effect of the correlation in the entries of the CAM on the the silicon area can be reduced if fewer sub-blocks are used
energy consumption for various lengths of the reduced-length with the cost of energy consumption.
tag. The expected value of the number of sub-blocks has been
calculated according to (5). The correlation effect is applied
VII. C ONCLUSION
on the inputs by creating similarities within their contents.
For example, a 10% correlation means that 10% of the bit In this paper, the algorithm and the architecture of a
values are similar in all input patterns. It is therefore observed low-power CAM are introduced. The proposed architecture
that larger correlation costs energy consumption although it (SCN-CAM) employs a novel associativity mechanism based
does not affect the performance as in [15]. Fig. 15 shows on a recently developed family of associative memories based
simulation results for measuring the cycle time of SCN-CAM on SCNs.
for the selected design parameters shown in Table I. It shows SCN-CAM is suitable for low-power applications, where
the worst-case cycle time, where the last input of the AND/ OR frequent and parallel look-up operations are required.
gates are pulled up and under SS corner. It also shows the SCN-CAM employs an SCN-based classifier, which is
wave-pipelining method of clk1 and clk2 signals as shown in connected to several independently compare-enabled CAM
Fig. 7. clk2 is simply a delayed version of clk1 . sub-blocks, some of which are enabled once a tag is pre-
The cycle time is measured by the maximum reliable sented to the SCN-based classifier. By using independent
frequency of operation in the worst-case cycle time (SS) nodes in the output part of SCN-CAM’s training network,
scenario. Table II summarizes the comparisons of the cycle simple and fast updates can be achieved without retraining the
time and the energy consumption between SCN-CAM and network entirely. With optimized lengths of the reduced-length
a collection of other related work including our own tags, SCN-CAM eliminates most of the comparison operations

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
652 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 4, APRIL 2015

given a uniform distribution of the reduced-length inputs. [18] P.-T. Huang and W. Hwang, “A 65 nm 0.165 fJ/Bit/Search 256 × 144
Depending on the application, nonuniform inputs may result in TCAM macro design for IPv6 lookup tables,” IEEE J. Solid-State
Circuits, vol. 46, no. 2, pp. 507–519, Feb. 2011.
higher power consumptions, but does not affect the accuracy [19] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory
of the final result. In other words, a few false-positives may be (CAM) circuits and architectures: A tutorial and survey,” IEEE J. Solid-
generated by the SCN-based classifier, which are then filtered State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.
[20] V. Gripon and C. Berrou, “Sparse neural networks with large learning
by the enabled CAM sub-blocks. Therefore, no false-negatives diversity,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1087–1096,
are ever generated. Jul. 2011.
Conventional NAND-type and NOR-type architectures were [21] V. Gripon and C. Berrou, “Nearly-optimal associative memories based
on distributed constant weight codes,” in Proc. ITA Workshop, Feb. 2012,
also implemented in the same process technology to com- pp. 269–273.
pare SCN-CAM against, along with other recently developed [22] H. Jarollahi, N. Onizawa, V. Gripon, and W. J. Gross, “Architecture
CAM architectures. It has been estimated that for a case and implementation of an associative memory using sparse clustered
networks,” in Proc. IEEE ISCAS, Seoul, South Korea, May 2012,
study design parameter, the energy consumption and the pp. 2901–2904.
cycle time of SCN-CAM are 8.02%, and 28.6% of that of [23] H. Jarollahi, N. Onizawa, V. Gripon, and W. J. Gross, “Reduced-
the conventional NAND-type architecture, respectively, with complexity binary-weight-coded associative memories,” in Proc. IEEE
ICASSP, May 2013, pp. 2523–2527.
a 10.1% area overhead. Future work includes investigating [24] H. Jarollahi, N. Onizawa, and W. J. Gross, “Selective decoding in
sparse compression techniques for the matrix storing the associative memories based on sparse-clustered networks,” in Proc.
connections in order to further reduce the area overhead. IEEE Global Conf. Signal Inf. Process., Dec. 2013, pp. 1270–1273.
[25] J. J. Hopfield, “Neural networks and physical systems with emergent
collective computational abilities,” Proc. Nat. Acad. Sci. USA, vol. 79,
R EFERENCES no. 8, pp. 2554–2558, Apr. 1982.
[26] H. Jarollahi, V. Gripon, N. Onizawa, and W. J. Gross, “A low-power
[1] A. Agarwal et al., “A 128×128 b high-speed wide-and match-line content-addressable memory based on clustered-sparse networks,” in
content addressable memory in 32 nm CMOS,” in Proc. ESSCIRC, Proc. 24th IEEE Int. Conf. ASAP, Jun. 2013, pp. 305–308.
Sep. 2011, pp. 83–86. [27] H. Noda et al., “A cost-efficient high-performance dynamic TCAM
[2] Y.-J. Chang and M.-F. Lan, “Two new techniques integrated for energy- with pipelined hierarchical searching and shift redundancy architecture,”
efficient TLB design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 245–253, Jan. 2005.
vol. 15, no. 1, pp. 13–23, Jan. 2007. [28] K. Pagiamtzis and A. Sheikholeslami, “Pipelined match-lines and hier-
[3] H. Chao, “Next generation routers,” Proc. IEEE, vol. 90, no. 9, archical search-lines for low-power content-addressable memories,” in
pp. 1518–1558, Sep. 2002. Proc. IEEE Custom Integr. Circuits Conf., Sep. 2003, pp. 383–386.
[4] N.-F. Huang, W.-E. Chen, J.-Y. Luo, and J.-M. Chen, “Design of multi- [29] K. Pagiamtzis and A. Sheikholeslami, “A low-power content-
field IPv6 packet classifiers using ternary CAMs,” in Proc. IEEE Global addressable memory (CAM) using pipelined hierarchical search
Telecommun. Conf., vol. 3. 2001, pp. 1877–1881. scheme,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1512–1519,
[5] M. Meribout, T. Ogura, and M. Nakanishi, “On using the CAM concept Sep. 2004.
for parametric curve extraction,” IEEE Trans. Image Process., vol. 9, [30] H. Noda et al., “A 143 MHz 1.1 W 4.5 Mb dynamic TCAM with
no. 12, pp. 2126–2130, Dec. 2000. hierarchical searching and shift redundancy architecture,” in Proc. IEEE
[6] M. Nakanishi and T. Ogura, “A real-time CAM-based Hough transform ISSCC, vol. 1. Feb. 2004, pp. 208–523.
algorithm and its performance evaluation,” in Proc. 13th Int. Conf. [31] C. Zukowski and S.-Y. Wang, “Use of selective precharge for low-
Pattern Recognit., vol. 2. Aug. 1996, pp. 516–521. power on the match lines of content-addressable memories,” in Proc.
[7] L.-Y. Liu, J.-F. Wang, R.-J. Wang, and J.-Y. Lee, “CAM-based VLSI Int. Workshop Memory Technol., Des. Test., Aug. 1997, pp. 64–68.
architectures for dynamic Huffman coding,” IEEE Trans. Consum. [32] J.-S. Wang, H.-Y. Li, C.-C. Chen, and C. Yeh, “An AND-type match-
Electron., vol. 40, no. 3, pp. 282–289, Aug. 1994. line scheme for energy-efficient content addressable memories,” in IEEE
[8] C.-C. Wang, C.-J. Cheng, T.-F. Chen, and J.-S. Wang, “An adaptively ISSCC Dig. Tech. Papers, vol. 1. Feb. 2005, pp. 464–610.
dividable dual-port BiTCAM for virus-detection processors in mobile [33] M. Motomura, J. Toyoura, K. Hirata, H. Ooka, H. Yamada, and
devices,” IEEE J. Solid-State Circuits, vol. 44, no. 5, pp. 1571–1581, T. Enomoto, “A 1.2-million transistor, 33-MHz, 20-b dictionary search
May 2009. processor (DISP) ULSI with a 160-kb CAM,” IEEE J. Solid-State
[9] B. Wei, R. Tarver, J.-S. Kim, and K. Ng, “A single chip Lempel–Ziv Circuits, vol. 25, no. 5, pp. 1158–1165, Oct. 1990.
data compressor,” in Proc. IEEE ISCAS, May 1993, pp. 1953–1955. [34] K. Schultz and P. Gulak, “Fully parallel integrated CAM/RAM using
[10] S. Panchanathan and M. Goldberg, “A content-addressable memory preclassification to enable large capacities,” IEEE J. Solid-State Circuits,
architecture for image coding using vector quantization,” IEEE Trans. vol. 31, no. 5, pp. 689–699, May 1996.
Signal Process., vol. 39, no. 9, pp. 2066–2078, Sep. 1991. [35] K. Pagiamtzis and A. Sheikholeslami, “Pipelined match-lines and hier-
[11] T. Juan, T. Lang, and J. Navarro, “Reducing TLB power requirements,” archical search-lines for low-power content-addressable memories,” in
in Proc. Int. Symp. Low Power Electron. Des., Aug. 1997, pp. 196–201. Proc. IEEE Custom Integr. Circuits Conf., Sep. 2003, pp. 383–386.
[12] Y.-J. Chang and Y.-H. Liao, “Hybrid-type CAM design for both power
and performance efficiency,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 8, pp. 965–974, Aug. 2008.
[13] Z. Lei, H. Xu, D. Ikebuchi, H. Amano, T. Sunata, and M. Namiki,
“Reducing instruction TLB’s leakage power consumption for embedded Hooman Jarollahi (S’09) received the B.A.Sc. and
processors,” in Proc. Int. Green Comput. Conf., Aug. 2010, pp. 477–484. M.A.Sc. degrees in electronics engineering from
[14] S.-H. Yang, Y.-J. Huang, and J.-F. Li, “A low-power ternary content Simon Fraser University, Burnaby, BC, Canada, in
addressable memory with Pai-Sigma matchlines,” IEEE Trans. Very 2008 and 2010, respectively. He is currently pursu-
Large Scale Integr. (VLSI) Syst., vol. 20, no. 10, pp. 1909–1913, ing the Ph.D. degree with the Department of Elec-
Oct. 2012. trical and Computer Engineering, McGill University,
[15] N. Onizawa, S. Matsunaga, V. C. Gaudet, and T. Hanyu, “High- Montreal, QC, Canada.
throughput low-energy content-addressable memory based on self-timed He was a Visiting Scholar with the Research
overlapped search mechanism,” in Proc. Int. Symp. Asynchron. Circuits Institute of Electrical Communication, Tohoku Uni-
Syst., May 2012, pp. 41–48. versity, Sendai, Japan, from 2012 to 2013. His cur-
[16] C.-S. Lin, J.-C. Chang, and B.-D. Liu, “A low-power precomputation- rent research interests include design and hardware
based fully parallel content-addressable memory,” IEEE J. Solid-State implementation of energy-efficient and application-specific VLSI systems,
Circuits, vol. 38, no. 4, pp. 654–662, Apr. 2003. such as associative memories and content-addressable memories.
[17] S.-J. Ruan, C.-Y. Wu, and J.-Y. Hsieh, “Low power design of Mr. Jarollahi was a recipient of the Teledyne DALSA Award in 2010,
precomputation-based content-addressable memory,” IEEE Trans. Very for which he presented a patented architecture of a power and area-efficient
Large Scale Integr. (VLSI) Syst., vol. 16, no. 3, pp. 331–335, Mar. 2008. SRAM.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.
JAROLLAHI et al.: ALGORITHM AND ARCHITECTURE FOR A LOW-POWER CAM 653

Vincent Gripon received the M.S. degree from Warren J. Gross (SM’10) received the B.A.Sc.
École Normale Supérieure of Cachan, Cachan, degree in electrical engineering from the University
France, and the Ph.D. degree from Télécom of Waterloo, Waterloo, ON, Canada, and the M.A.Sc.
Bretagne, Brest, France. and Ph.D. degrees from the University of Toronto,
He is a Permanent Researcher with Institut Mines- Toronto, ON, in 1996, 1999, and 2003, respectively.
Télécom, Télécom Bretagne. His intent is to propose He is currently an Associate Professor with the
models of neural networks inspired by information Department of Electrical and Computer Engineer-
theory principles, what could be called informa- ing, McGill University, Montral, QC, Canada. His
tional neurosciences. He is also the Co-Creator and current research interests include the design and
Organizer of an online programming contest named implementation of signal processing systems and
TaupIC, which targets the French top undergraduate custom computer architectures.
students. His current research interests include information theory, neuro- Dr. Gross is currently the Chair of the IEEE Signal Processing Society
science, and theoretical and applied computer science. Technical Committee on Design and Implementation of Signal Processing
Systems. He served as a Technical Program Co-Chair of the IEEE Workshop
on Signal Processing Systems in 2012, and as the Chair of the IEEE ICC 2012
Workshop on Emerging Data Storage Technologies. He served as an Associate
Editor of the IEEE T RANSACTIONS ON S IGNAL P ROCESSING. He has served
on the Program Committees of the IEEE Workshop on Signal Processing
Naoya Onizawa (M’09) received the B.E., M.E., Systems, the IEEE Symposium on Field-Programmable Custom Computing
and D.E. degrees in electrical and communication Machines, and the International Conference on Field-Programmable Logic and
engineering from Tohoku University, Sendai, Japan, Applications, and has served as the General Chair of the 6th Annual Analog
in 2004, 2006, and 2009, respectively. Decoding Workshop. He is a licensed Professional Engineer in the Province
He was a Post-Doctoral Fellow with Tohoku Uni- of Ontario.
versity from 2009 to 2011, the University of Water-
loo, Waterloo, ON, Canada, in 2011, and the McGill
University, Montreal, QC, Canada, from 2011 to
2013. He is currently an Assistant Professor with
the Frontier Research Institute for Interdisciplinary
Sciences, Tohoku University. His current research
interests include the energy-efficient VLSI design based on asynchronous
circuits and multiple-valued circuits, and their applications, such as LDPC
decoders, associative memories, and network-on-chips.
Dr. Onizawa was a recipient of the Best Paper Award at the IEEE Computer
Society Annual Symposium on VLSI in 2010.

Authorized licensed use limited to: QIS College Of Engineering & Technology Autonomous. Downloaded on February 08,2023 at 08:49:28 UTC from IEEE Xplore. Restrictions apply.

You might also like