High-Throughput Pattern Matching With
High-Throughput Pattern Matching With
Abstract— In this paper, we propose a novel CMOS+ feature extraction [2], [3]), and string matching (e.g., in a con-
MOLecular (CMOL) field-programmable gate array (FPGA) cir- text of network intrusion detection [4]–[7], deoxyribonucleic
cuit architecture to perform massively parallel, high-throughput acid sequence matching [8], [9], database searching [10], and
computations, which is especially useful for pattern matching
tasks and multidimensional associative searches. In the new network packet routing [11], [12]). The common feature in
architecture, patterns are stored as resistive states of emerging these tasks is that they can be efficiently parallelized, and
nonvolatile memory nanodevices, while the analyzed data are that the same basic operation is performed numerous times
streamed via CMOS subsystem. The main improvements over using one set of fixed data known in advance (which are
prior work offered by the proposed circuits are increased allowed to change infrequently), such as a filter template in
nanodevice utilization and, as a result, substantially higher
throughput, which is demonstrated by a detailed analysis of the image processing or keyword in string matching, along with
implementation of pattern matching task on the new architecture. streaming input data. In this respect, RCs offer massively
For example, our estimates show that the proposed CMOL parallel, instant-specific computation customized for the needs
FPGA circuits based on the 22-nm CMOS technology and one of the particular application, and thus potentially offer real-
crossbar layer with 22-nm nanowire half-pitch allows up to time processing coupled with low-power consumption.
12.5% average nanodevice utilization, i.e., the fraction of the
devices turned to the high conductive state, as compared to a However, even contemporary RCs cannot provide enough
typical ∼0.1% of the original CMOL FPGA circuits. This in turn computational power for future demands. For example,
enables throughput close to 7.1 × 1016 bits/s/cm2 at ∼ 1 fJ/bit in network intrusion detection applications, deep packet infor-
energy efficiency, for matching of ∼ 107 250-bit patterns stored mation of computer networks is compared against a known
locally on a 1 cm2 chip. These numbers are at least 2 orders sequence of data representing a computer virus or other
of magnitude better throughput as compared to that of other
state-of-the-art FPGA methods, and begin to approach ternary
malicious content [4], [13]. In order to provide real-time
content-addressable memory -like performance at similar CMOS protection, the search engine should perform many compar-
technology nodes. More generally, we argue that the proposed isons in parallel, and simultaneously allow for updating of
concept combines the versatility of reconfigurable architectures virus signatures. Earlier RC implementations were adequate
and density of the associative memories. It can be viewed as a to ensure a few Gbit/s/cm2 -scale sustained throughput for
very tight symbiotic integration of memory and logic functions
for high-performance logic-in-memory computing.
∼2000 100-B long patterns [4]. The throughput could be
further significantly improved by employing dynamic reconfig-
Index Terms— CMOS+MOLecular (CMOL), field- uration and customized hardware, including dedicated ternary
programmable gate array (FPGA), hybrid circuits, logic-
in-memory computing, memristor, pattern matching, ReRAM, content-addressable memories (TCAMs) [5], [7], [13], [14].
resistive switching, ternary content-addressable memory However, even these techniques have limited benefits, largely
(TCAM). due to excessive reconfiguration overhead for multicontext
field-programmable gate arrays (FPGAs) [1], I/O limitations
I. I NTRODUCTION for dynamic reconfiguration, and/or rigid inefficient struc-
ture of content-addressable memories (CAMs), and thus may
R ECONFIGURABLE circuits (RCs) are very efficient for
information processing tasks [1], such as image and
signal processing (e.g., filtering, edge detection, coding, and
be insufficient for future needs. The deployment of faster
100-Gbit/s-scale data networks, as well as the continued
increase in the number of patterns (e.g., the number of
Manuscript received September 7, 2017; revised January 16, 2018; accepted known viruses) makes real-time protection impossible even
February 17, 2018. This work was supported in part by RICARDO through for the most advanced circuit implementations with CMOS
NSF under Grant 1730309, in part by NSF under Grant 1563935, in part by
CCF through NSF under Grant 1740352, and in part by AFOSR MURI under technology.
Grant FA9550-12-1-0038. (Corresponding author: Dmitri B. Strukov.) The performance of RCs can be greatly improved
A. Madhavan and D. B. Strukov are with the Department of Electri- using hybrid CMOS/nanoelectronic circuits [15]–[18].
cal and Computer Engineering, University of California, Santa Barbara,
CA 93106 USA (e-mail: [email protected]; [email protected]). One such example is CMOS+MOLecular (CMOL) FPGA
T. Sherwood is with the Department of Computer Science, University of [17], [19]–[23], where CMOL stands for CMOL scale hybrid
California, Santa Barbara, CA 93106 USA (e-mail: [email protected]). circuit, which was conceived to seize the density advantages
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. of emerging technologies, such as nanoimprint lithography
Digital Object Identifier 10.1109/TVLSI.2018.2809644 and monolithically integrated self-assembled nanodevices,
1063-8210 © 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Pattern matching with CAMs. (a) General idea. (b) Example of one row of SRAM-based CAM memory implemented in OR style [14].
(c) SRAM-based TCAM memory cell [14]. (d)–(i) TCAM cells based on nonconventional memory technologies. (d)-(g) TCAM cells based on flash
memory [48], hybrid CMOS/MRAM [49], CMOS/STT-RAM [50], and CMOS/memristors [51] technologies, respectively. (i) Memristor-based implementation.
connected by only one via (Fig. 5). (Note that more readily
manufacturable CMOL structures, with nanowires running
strictly in vertical and horizontal directions, are also possible.
For example, the effective rotation of the nanowire array can be
implemented by adjusting positions of the cells’ vias [23] or by
using zig-zag shaped nanowires [65].)
Selection of any crosspoint device, to perform read or write
operation, is implemented with double-decoding scheme. The
first level of CMOS decoders, implemented by the peripheral
CMOS decoders and pass gates/transistor of the atomic CMOS
Fig. 4. (a) Cartoon of CMOL circuits. (b) Functional partitioning of
the CMOL circuit for pattern matching applications by the area-distributed cells, is used to select a pair of vias, one blue and one red,
interface (red and blue arrows), between the CMOS layer at the bottom and which connect to the corresponding mutually perpendicular
passive nanowire crossbar on top. nanowire segments that lead to the crosspoint device in
question. The second level of decoding is implemented with
the sneak-path is naturally cut by suppressing reverse cur- half-biasing approach, which utilizes memristor nonlinearities
rent for the devices with very asymmetric I –V s with in switching kinetics and electron transport to enable unique
(+) (−)
VT V R < VT . access to the specific crosspoint device.
For simplicity, in our analysis, we assume bipolar mem- The angle of the crossbar rotation α depends upon several
ristors described by asymmetric idealized I –V curve with parameters such as cell complexity, CMOS process, and pitch
precisely defined write voltage VW , i.e., with no device-to- of the nanowires. Specifically, assuming that Fnano and FCMOS
device and cycle-to-cycle variations. are the minimum half-pitch of the nanowire crossbar array and
the feature size of CMOS circuitry, respectively, and that the
C. CMOL Structure side length of one atomic CMOS cell is 2β FCMOS , where
In CMOL structures one [17], [25] or several [64]–[66] β is a parameter representing cell’s size, the CMOL topology
crossbar layers are vertically integrated on top of conven- is described by set of equations [17]
tional CMOS circuits [Fig. 4(a)]. One of the key character- 1 2 β FCMOS
tan(α) = , r +1= . (1)
istics of the CMOL architecture are an ability of accessing r Fnano
(reading or writing) every crosspoint nanodevice from much It is also very convenient to characterize CMOL architecture
sparser CMOS circuitry without sacrificing crossbar integra- with parameter M A
tion density. The crosspoint memristors can be programmed
L − 2Fnano
to either high or low resistive states, and together with the MA = − 1 = r2 − 1 (2)
CMOS layer create a reconfigurable fabric that can perform 2Fnano
information processing (i.e., pattern matching for the consid- that defines the number of atomic cells connected to
ered applications) and interconnect duties. one atomic cell and is equal to the number of cross-
In particular, the CMOS layer is arranged as an array of ings (memristors) on one nanowire segment. For example,
“atomic” CMOS cells [Figs. 4(b) and 5], which are con- Fig. 7(a) shows a CMOL structure for r = 6 with its
nected to the nanoscale crossbar circuit via an area-distributed connectivity domain highlighted, and, in particular, shows that
interface. Each cell houses CMOS circuits that provide unique the given atomic cell can be connected to any of the other
access to each of the cell’s two vias from the cell array periph- M A = 36 atomic cells, including connection to itself, in its
ery, and also CMOS circuitry specific to the implemented connectivity domain via the crossbar structure.
CMOL circuit. The nanowire crossbar is rotated with respect For a fixed complexity CMOS cell in a certain process,
to the array of atomic cells underneath, and provides high lowering the pitch of the nanowire allows higher density
fan-in and fan-out connectivity between them. For example, by increasing the relative angle of the crossbar with respect
each (output) blue via connects to a certain quasi-horizontal to the CMOS vias. This also explains how, while keeping
nanowire, which in turn connects to multiple quasi-vertical nanowire half-pitch constant, maximum crossbar density can
nanowires through crosspoint devices. These quasi-vertical be preserved irrespective of the size of the atomic cell,
wires each connect to (input) red vias of other surrounding while only affecting its connectivity. In addition, it is worth
cells. The rotation of the crossbar naturally breaks nanowires mentioning that the CMOL segmented crossbar structure is not
into segments, and as a result, every nanowire segment is only good for high fan-in fan-out computation, but also has
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. General idea for pattern matching with 1-D streaming data. Here,
we assume that adjacent blocks of pipeline data are shifted by one bit position
and the block length matches the pattern length.
V. P ERFORMANCE M ODELING
We have modeled the performance of a generic 1-D pattern
Fig. 11. General idea for mapping image processing tasks to the proposed matching task and found the optimal parameters to maximize
CMOL FPGA architecture. the possible pattern matching throughput per unit area. Before
discussing the details of an optimization procedure, let us first
i.e., width of the unit cell array, so that the total number of outline some common assumptions for the devices, diode–
patterns which can be matched by an array is roughly half of resistor logic and its performance modeling while focusing
the total number of unit cells in the array. on the more promising dynamic logic counterpart of this
architecture.
B. 2-D Pattern Matching
The main difference for mapping of image processing tasks A. Nanowires
is that both streaming data and patterns are 2-D. For example, The resistance of 2Fnano -long nanowire segment is approxi-
in automatic target recognition (ATR) systems 128 × 128 mated using Matthessian formula [70], which accounts for the
array of 8-bit pixels is searched for potential targets which increase in the resistivity ρ due to surface scattering effects in
are described by 16 × 16 binary pixel templates [36]. The nanoscale wires, i.e.,
bottleneck operation in ATR algorithm is 1-bit correlations 2Fnano 2ρbulk λ
between input image and template which has to be done Rwire ≈ ρ ≈ 1+ . (4)
A (Fnano )2 A Fnano Fnano
for all possible relative offsets and for rather large number
Here, A is the relative thickness of the nanowires with respect
of templates. (It should be noted that contemporary ATR
to its width (i.e., the cross-sectional aspect ratio), while λ is
systems work with much larger input and template image sizes
the mean free path of the electrons.
and higher data rate, e.g., required for hyperspectral image
The capacitance of 2Fnano -long nanowire segment is
processing [69].)
approximated analytically by using the equation
Naturally, a correlation operation which produces multibit
2
Fnano F2
output value cannot be done with a single unit cell in dig-
Cwire ≈ 0 1 + 0 2 nano + A 0 2 4Fnano
ital CMOL FPGA circuits. On the other hand, a combined 2d 2d
operation of correlation with thresholding, which is effectively 4nano
+ A 0 2 (5)
an approximate pattern matching, is straightforward if unit
log Fnano
d + 10
cell is configured to implement a linear threshold gate. Such
an operation might be sufficient for eliminating bottleneck which was verified using COMSOL simulations. Here, d is
processing in ATR and other related image processing tasks. the thickness of the thin film, i.e., the distance between
Similar to the previously considered mapping scheme, two, mutually perpendicular sets of crossbar nanowires, 0 is
the maximum utilization is achieved with balanced number vacuum permittivity, 1 and 2 are dielectric constants of
of white and green cells (Fig. 11). Let us assume that the the nanodevices and surrounding insulator, respectively, and
connectivity domain is large enough that each unit cell per- a constant 10 was determined by fitting (5) to the numerical
forms matching between streaming data of input image and simulations. Note that on the right-hand side of (5), the first
the whole template. (The proposed scheme can be extended to two terms crudely correspond to parallel plate capacitance,
the case when the domain is smaller, by performing matching the third term is the interlayer side wall capacitance, and
operations for the parts of the template instead.) Let us the last term is the side wall capacitance between crossbar
also assume that a K × K 2-D image is pushed through a nanowires in the same layer.
pipeline formed by green cells, e.g., from left to right, by one In our performance analysis, we assume copper crossbar
unit cell position each cycle. To perform correlation for one nanowires with A = 0.1 and ρbulk = 1.7 × 10−8 -m. Using
template, for all possible vertical offsets in one cycle requires λ = 40 nm, (4) yields an accurate approximation for both grain
programming a column of K white cells to perform matching and surface scattering as reported in international technology
with the same template. Evaluation of all horizontal offsets roadmap for semiconductors (ITRS) [71]. For capacitance
would just require K cycles to push data (from left to right) estimates, we assume 1 = 3.9, 2 = 2.5, and d = 5 nm,
past the column of a particular white cells. The next column which is representative of SiO2 memristive devices.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
B. Diode–Resistor Logic FCMOS node. Note that using Rpass = (Rpass )max in (9)
2
Let us assume that at least γ fraction of voltage applied assumes that area for the minimum-size transistor is 25FCMOS .
from the CMOS cell is dropped across nanodevice and at For the dynamic logic, the delay is estimated as
most 1 − γ fraction is dropped on nanowire and pass gate
τ ≈ 2(2MCwire + Cgate )RON (10)
connecting outputs of DFF and the corresponding vias. This
can be satisfied by choosing RON and Rpass according to where 2 MCwire is capacitance of two nanowire segments,
RON ≥ (1 − χ)γM(M Rwire + Rpass )/(1 − γ ) (6) while Cgate is a total capacitance of CMOS circuitry at
the input of DFF, including its gate capacitance and drain
where the factor (1 − χ)M is effectively the maximum fan out capacitances of the configuration and pull-down pass gates.
for each output of the DFF cell ( 3), i.e., the largest permitted The additional factor of 2 is to account for both precharging
fraction of the devices in the ON state on each of the two and evaluation phases, which is rather conservative assumption
output nanowires. given that precharging currents are not limited by RON value.
For dynamic logic operation, the slowest (worst case) Note that for all studied parameters Cgate is typically much
mismatch operation corresponds to charging via single nan- smaller than 2 MCwire .
odevice in the ON state. For the considered asymmetric I –V The average power per unit cell is dominated by the
characteristics, this charging time will be always faster than dynamic component, for which the upper bound is evaluated
that of the match operation and the worst case voltage margins as
are given by
Pcell ≈ (2 × 2MCwire + Cgate )V R2 /τ. (11)
V ≈ V R /(1 + 2M RON /ROFF ). (7)
Equation (11) implies activity factor of 1 for DFF’s input
The safe margins can be in principle calculated from noise and output nanowires and input CMOS circuitry, i.e., their
and CMOS variations analysis [19]. Equation (7), however, charging and discharging within a single clock cycle. This is
shows that such analysis can be simplified by selecting large quite a pessimistic assumption given that matching events can
enough ROFF so that the margins are comparable with V R . For be assumed to be rare on average and hence outputs of all
example, requiring ROFF /RON > 2M, which is very reasonable cells configured to perform pattern matching will not change.
assumption as we show below, results in V ≈ V R /2, and Still, as we show later, the total power, which should be less
provides justification of neglecting leakages via OFF state than the maximum allowable power density pmax , i.e.,
devices in (6), as well as delay and power estimates.
In our simulations, we assume V R = 1 V and γ = 0.9. Pcell ≤ pmax Acell (12)
Fig. 13. Simulation results for the optimal value of Rpass , χ = 0.5, and specific values of FCMOS and Fnano [73]. (a) Connectivity factor of a unit cell.
(b) Optimal value of RON determined as a minimum resistance satisfying two constrains: proper operation of diode–resistor logic (6) and power density
budget (12). (c) Capacitance of a segment and the whole wire. (d) Energy per bit (which is the same for all values of FCMOS ). (e) Diode–resistor logic delay
(clock cycle time). (f) Aggregate pattern matching throughput per unit area. Note that for the shown case, RON is always determined by (6).
TABLE I
P ERFORMANCE OF VARIOUS PATTERN M ATCHING A RCHITECTURES . (# T HE N UMBERS FOR B OTH TCAM AND FPGA I MPLEMENTATIONS
A RE R ATHER O PTIMISTIC . O NLY A REA OF M EMORY C ELLS I S TAKEN I NTO A CCOUNT FOR THE F ORMER , W HILE
FPGA N UMBERS A RE E STIMATED BASED ON THE R EPORTED L OGIC U TILIZATION .)
elements in existing hardware-based pattern matchers is lim- Even though our performance analysis is somewhat sim-
ited by the 2-D chip area, they must be dynamically reconfig- plified, we believe that we accounted for the most important
ured to accommodate additional patterns that are beyond their factors. For example, CMOS process variations, critical for
storage capabilities and rely on OFF-chip storage. Dynamic diode–resistor logic operation can be effectively dealt with
reconfiguration is rather slow and very energy inefficient, thus by appropriate scaling of the CMOS transistors. Its additional
we expect that the throughput for a fixed area for higher overhead, as well as additional area due to bulky programming
capacity pattern matching tasks will be considerably smaller circuitry which might be required for memristors with large
than the ideal value. On the other hand, it should be possible write voltages [74] should not change much our simulation
to support larger bit capacity in the proposed circuits by results because of already large pass gates/drive circuit tran-
integrating more crossbar layers, e.g., similar to 3-D CMOL sistors for the optimal cases. Though, the dynamic logic is
circuits [64], [65] without large penalty in throughput. susceptible to capacitive coupling noise, it should be possible
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ACKNOWLEDGMENT
The authors would like thank R. Brayton, A. Mishchenko,
and K. K. Likharev for their useful discussions.
C ONFLICTS OF I NTEREST
The presented work is built upon our previous
Fig. 14. Maximum throughput as function of mapping scenario for several
values of FCMOS (similar to the ones used in Fig. 13). results reported in [30]–[32]. The material from these
papers is included in Sections III and IV-A. However,
Sections IV-B, V, and VI elaborate on the new results, which
to minimize its effect by balancing signal transitions. For were never published before.
example, the CMOL topology and application mapping can be In particular, in this paper we present the following, for the
further optimized to avoid Miller effects. Based on our expe- first time.
rience with CMOL design [75] and quite large area of CMOL 1) A performance analysis of the proposed circuit. Our
cells in this paper, the area overhead of peripheral decoders estimates also account for the sizing of CMOS circuits,
and clock distribution network should be also insignificant. which was generally neglected in all previous CMOL
As it is evident from the choice of the optimal values of RON FPGA works but, as we show in this paper, is crucial
[Fig. 13(b)], the dynamic power consumption of our circuits for providing correct functionality.
is always well below 200 W/cm2 . Our estimates shows that 2) An optimization procedure that considers architectural,
the neglected dynamic power of the clock distribution network topological and circuit level constraints to maximize the
(especially considering relatively slow cycle times at optimal throughput of the proposed circuits.
Fnano = FCMOS ) and static power are also always limited to 3) An additional application case study.
the sub watt range.
Perhaps the most critical challenge toward practical real- R EFERENCES
ization of the proposed circuits is that fabrication technology [1] S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and
for the memristive devices, especially for their passive back- Practice of FPGA-Based Computation, vol. 1. San Mateo, CA, USA:
end-of-line integration, is in need of improvement. A particular Morgan Kaufmann, 2010.
[2] Q. Gu, T. Takaki, and I. Ishii, “Fast FPGA-based multiobject feature
concern is memristor nonidealities, such as current fluctuations extraction,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1,
due to drift in the memristor state and random telegraph noise, pp. 30–45, Jan. 2013.
and variations in the switching threshold voltages. One possi- [3] C. Gentsos, C.-L. Sotiropoulou, S. Nikolaidis, and N. Vassiliadis, “Real-
time canny edge detection parallel implementation for FPGAs,” in Proc.
bly strategy to deal with these issues is to identify defective ICECS, Dec. 2010, pp. 499–502.
devices during test stage and avoid them during application [4] Y. H. Cho and W. H. Mangione-Smith, “Deep network packet filter
mapping [19], [76]. On the other hand, variations in the design for reconfigurable devices,” ACM Trans. Embedded Comput.
Syst., vol. 7, no. 2, Feb. 2008, Art. no. 21.
ON and OFF resistance states would be less problematic due to, [5] L. Tan and T. Sherwood, “Architectures for bit-split string scanning in
e.g., possibility of fine-tuning memristor conductances using intrusion detection,” IEEE Micro, vol. 26, no. 1, pp. 110–117, Jan. 2006.
simple tuning algorithm setup. Also, though the cycling [6] T. N. Thinh, T. T. Hieu, V. Q. Dung, and S. Kittitornkun, “A FPGA-based
deep packet inspection engine for network intrusion detection system,”
endurance for many memristors is generally much less as com- in Proc. 9th Int. Conf. Elect. Eng./Electron., Comput., Telecommun. Inf.
pared to that of volatile memories, it should be still adequate Technol. (ECTI-CON), May 2012, pp. 1–4.
for many FPGA applications, given that the memristor states [7] H. Le and V. K. Prasanna, “A memory-efficient and modular approach
for large-scale string pattern matching,” IEEE Trans. Comput., vol. 62,
are only switched during reconfiguration stage and remain no. 5, pp. 844–857, May 2013.
unchanged during logic operation. [8] Y. Xin et al., “Parallel architecture for DNA sequence inexact matching
In summary, in this paper, we proposed new CMOL FPGA with Burrows-Wheeler transform,” Microelectron. J., vol. 44, no. 8,
pp. 670–682, Aug. 2013.
circuits for high-throughput computation. The performance [9] D. Lavenier, G. Georges, and X. Liu, “A reconfigurable index FLASH
advantage of novel circuits is mainly due to very high density memory tailored to seed-based genomic sequence comparison algo-
of nanoscale devices and very tight and synergetic integration rithms,” J. VLSI Signal Process., vol. 48, no. 3, pp. 255–269, 2007.
[10] Q. Zhang, R. D. Chamberlain, R. S. Indeck, B. M. West, and J. White,
of memory and logic functions. The tight integration is enabled “Massively parallel data mining using reconfigurable hardware: Approx-
by high communication bandwidth of the area-distributed imate string matching,” in Proc. 18th Int. Parallel Distrib. Process.
interface between the nano and CMOS subsystems, while Symp., Apr. 2004, p. 259.
[11] M. B. Anwer, M. Motiwala, M. bin Tariq, and N. Feamster, “Switch-
the synergy is due to flexible resource allocation that allows Blade: A platform for rapid deployment of network protocols on pro-
nanodevices to be used either as a TCAM cell or to implement grammable hardware,” ACM SIGCOMM Comput. Commu. Rev., vol. 40,
programmable logic/interconnect. Though CMOL circuits are no. 4, pp. 183–194, Oct. 2010.
[12] T. Sherwood, G. Varghese, and B. Calder, “A pipelined memory archi-
essential for getting high bandwidth between memory and tecture for high throughput network processors,” in Proc. 30th Annu.
logic subsystems, other stacking schemes with area-distributed Int. Symp. Comput. Archit., Jun. 2003, pp. 288–299.
connectivity, such as through silicon via technology, and differ- [13] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu, “Fast regular
expression matching using small TCAMs for network intrusion detection
ent memory devices, such as flash memory might be suitable and prevention systems,” in Proc. 19th USENIX conf. Secur., Aug. 2010,
for the proposed concept. Understanding the prospects of p. 8.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[14] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem- [40] H. Kim and K.-I. Choi, “A pipelined non-deterministic finite automaton-
ory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. based string matching scheme using merged state transitions in an
Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006. FPGA,” PLoS ONE, vol. 11, no. 10, p. e0163535, 2016.
[15] M. R. Stan, P. D. Franzon, S. C. Goldstein, J. C. Lach, and M. M. Ziegler, [41] H. J. Kim, “A failureless pipelined Aho-Corasick algorithm for FPGA-
“Molecular electronics: From devices and interconnect to circuits and based parallel string matching engine,” in Information Science and
architecture,” Proc. IEEE, vol. 91, no. 11, pp. 1940–1957, Nov. 2003. Applications. Berlin, Germany: Springer, 2015, pp. 157–164.
[16] A. DeHon, “Nanowire-based programmable architectures,” ACM [42] X. Wang and D. Pao, “Memory-based architecture for multicharac-
J. Emerg. Technol. Comput. Syst., vol. 1, no. 2, pp. 109–162, Jul. 2005. ter Aho–Corasick string matching,” IEEE Trans. Very Large Scale
[17] K. K. Likharev and D. B. Strukov, “CMOL: Devices, circuits, and Integr. (VLSI) Syst., vol. 26, no. 1, pp. 143–154, Jan. 2017.
architectures,” in Introducing Molecular Electronics. New York, NY, [43] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
USA: Springer-Verlag, 2006, pp. 447–477. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2,
[18] X. Tang, P.-E. Gaillardon, and G. De Micheli, “A high-performance low- pp. 203–215, Feb. 2007.
power near-Vt RRAM-based FPGA,” in Proc. Int. Conf. Field-Program. [44] J. Li et al., “1Mb 0.41 μm2 2T-2R cell nonvolatile TCAM with two-
Technol. (FPT), Dec. 2014, pp. 207–214. bit encoding and clocked self-referenced sensing,” in Proc. Symp. VLSI
[19] D. B. Strukov and K. K. Likharev, “CMOL FPGA: A reconfigurable Circuits (VLSIC), Jun. 2013, pp. C104–C105.
architecture for hybrid digital circuits with two-terminal nanodevices,” [45] S. Matsunaga et al., “Fully parallel 6T-2MTJ nonvolatile TCAM with
Nanotechnology, vol. 16, no. 6, p. 888, 2005. single-transistor-based self match-line discharge control,” in Proc. Symp.
[20] D. B. Strukov and K. K. Likharev, “A reconfigurable architecture for VLSI Circuits (VLSIC), Jun. 2011, pp. 298–299.
hybrid CMOS/nanodevice circuits,” in Proc. ACM/SIGDA 14th Int. [46] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A resistive TCAM accelerator
Symp. FPGAs, 2006, pp. 131–140. for data-intensive computing,” in Proc. Micro, Dec. 2011, pp. 339–350.
[21] D. B. Strukov and K. K. Likharev, “Reconfigurable hybrid [47] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “AC-DIMM:
CMOS/nanodevice circuits for image processing,” IEEE Trans. Associative computing with STT-MRAM,” ACM SIGARCH Comput.
Nanotechnol., vol. 6, no. 6, pp. 696–710, Nov. 2007. Archit. News, vol. 41, no. 3, pp. 189–200, 2013.
[22] Q. Xia et al., “Memristor-CMOS hybrid integrated circuits for recon- [48] T. Hanyu, N. Kanagawa, and M. Kameyama, “Non-volatile one-
figurable logic,” Nano Lett., vol. 9, no. 10, pp. 3640–3645, 2009. transistor-cell multiple-valued cam with a digit-parallel-access scheme
[23] D. Strukov and A. Mishchenko, “Monolithically stackable hybrid and its applications,” Comput. Elect. Eng., vol. 23, no. 6, pp. 407–414,
FPGA,” in Proc. Conf. Design, Autom. Test Eur., Mar. 2010, 1997.
pp. 661–666. [49] S. Matsunaga et al., “Standby-power-free compact ternary content-
[24] J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for addressable memory cell chip using magnetic tunnel junction devices,”
computing,” Nature Nanotechnol., vol. 8, no. 1, pp. 13–24, 2013. Appl. Phys. Exp., vol. 2, no. 2, p. 023004, 2009.
[25] K. K. Likharev, “Hybrid CMOS/nanoelectronic circuits: Opportuni- [50] W. Xu, T. Zhang, and Y. Chen, “Design of spin-torque transfer mag-
ties and challenges,” J. Nanoelectron. Optoelectron., vol. 3, no. 3, netoresistive RAM and CAM/TCAM with high sensing and search
pp. 203–230, 2008. speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 1,
pp. 66–74, Jan. 2010.
[26] A. M. Arafeh and S. M. Sait, “Cells reconfiguration around defects in
[51] K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang, D. Abbott, and
CMOS/nanofabric circuits using simulated evolution heuristic,” in Proc.
S.-M. S. Kang, “Memristor MOS content addressable mem-
ISQED, Mar. 2015, pp. 581–588.
ory (MCAM): Hybrid architecture for future high performance search
[27] W. N. N. Hung, C. Gao, X. Song, and D. Hammerstrom, “Defect-tolerant
engines,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,
CMOL cell assignment via satisfiability,” IEEE Sensors J., vol. 8, no. 6,
no. 8, pp. 1407–1417, Aug. 2011.
pp. 823–830, Jun. 2008.
[52] I. Arsovski, T. Chandler, and A. Sheikholeslami, “A ternary content-
[28] Z.-L. Pan, L. Chen, and G.-Z. Zhang, “Efficient design method for cell addressable memory (TCAM) based on 4T static storage and including
allocation in hybrid CMOS/nanodevices using a cultural algorithm with a current-race sensing scheme,” IEEE J. Solid-State Circuits, vol. 38,
chaotic behavior,” Frontiers Phys., vol. 11, no. 2, p. 116201, Apr. 2016. no. 1, pp. 155–158, Jan. 2003.
[29] S. M. Sait and A. M. Arafeh, “Cell assignment in hybrid [53] H. Noda et al., “A cost-efficient high-performance dynamic TCAM
CMOS/nanodevices architecture using Tabu search,” Appl. Intell., with pipelined hierarchical searching and shift redundancy architecture,”
vol. 40, no. 1, pp. 1–12, Jan. 2014. IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 245–253, Jan. 2005.
[30] D. B. Strukov, “Hybrid CMOS/nanodevice circuits with tightly inte- [54] T. Kusumoto, D. Ogawa, K. Dosaka, M. Miyama, and Y. Matsuda,
grated memory and logic functionality,” in Proc. Nanotechnol., vol. 11. “A charge recycling TCAM with Checkerboard Array arrangement
2011, pp. 9–12. for low power applications,” in Proc. IEEE Asian Solid-State Circuits
[31] F. Alibart, T. Sherwood, and D. B. Strukov, “Hybrid CMOS/nanodevice Conf. (A-SSCC), Nov. 2008, pp. 253–256.
circuits for high throughput pattern matching applications,” in Proc. [55] Y.-D. Kim, H.-S. Ahn, S. Kim, and D.-K. Jeong, “A high-speed range-
NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), Jun. 2011, pp. 279–286. matching TCAM for storage-efficient packet classification,” IEEE Trans.
[32] A. Madhavan and D. B. Strukov, “Mapping of image and network Circuits Syst. I, Reg. Papers, vol. 56, no. 6, pp. 1221–1230, Jun. 2009.
processing tasks on high-throughput CMOL FPGA circuits,” in Proc. [56] J.-S. Wang, H.-Y. Li, C.-C. Chen, and C. Yeh, “An AND-type match-
IEEE/IFIP 20th Int. Conf. VLSI Syst.-Chip (VLSI-SoC), Oct. 2012, line scheme for energy-efficient content addressable memories,” in IEEE
pp. 82–87. Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2005,
[33] B. Schmidt, Bioinformatics: High Performance Parallel Computer pp. 464–610.
Architectures. Boca Raton, FL, USA: CRC Press, 2010. [57] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring
[34] Z. K. Baker and V. K. Prasanna, “Time and area efficient pattern hyperdimensional associative memory,” in Proc. HPCA, Feb. 2017,
matching on FPGAs,” in Proc. FPGA, 2004, pp. 223–232. pp. 445–456.
[35] R. Tessier and W. Burleson, “Reconfigurable computing for digital [58] I. Roy, A. Srivastava, M. Nourian, M. Becchi, and S. Aluru, “High
signal processing: A survey,” J. VLSI Signal Process., vol. 28, nos. 1–2, performance pattern matching using the automata processor,” in Proc.
pp. 7–27, May 2001. IEEE Int. Parallel Distrib. Process. Symp., May 2016, pp. 1123–1132.
[36] K.-N. Chia et al., “Configurable computing solutions for automatic target [59] K. Zhou, J. Wadden, J. J. Fox, K. Wang, D. E. Brown, and K. Skadron,
recognition,” in Proc. IEEE Symp. FPGAs Custom Comput. Mach., “Regular expression acceleration on the micron automata processor: Brill
Apr. 1996, pp. 70–79. tagging as a case study,” in Proc. IEEE Int. Conf. Big Data (Big Data),
[37] J. Yang, L. Jiang, Q. Tang, Q. Dai, and J. Tan, “PiDFA: A practical Oct./Nov. 2015, pp. 355–360.
multi-stride regular expression matching engine based On FPGA,” in [60] B. Govoreanu et al., “10 × 10nm2 Hf/HfOx crossbar resistive RAM with
Proc. IEEE Int. Conf. Commun. (ICC), May 2016, pp. 1–7. excellent performance, reliability and low-energy operation,” in IEDM
[38] Y.-H. Yang and V. Prasanna, “High-performance and compact architec- Tech. Dig., Dec. 2011, pp. 6–31.
ture for regular expression matching on FPGA,” IEEE Trans. Comput., [61] S. Pi, P. Lin, and Q. Xia, “Cross point arrays of 8 nm × 8 nm memristive
vol. 61, no. 7, pp. 1013–1025, Jul. 2012. devices fabricated with nanoimprint lithography,” J. Vac. Sci. Technol. B,
[39] N. L. Or, X. Wang, and D. Pao, “MEMORY-based hardware archi- Microelectron. Process. Phenom., vol. 31, no. 6, p. 06FA02, 2013.
tectures to detect ClamAV virus signatures with restricted regu- [62] D. B. Strukov and H. Kohlstedt, “Resistive switching phenomena in thin
lar expression features,” IEEE Trans. Comput., vol. 65, no. 4, films: Materials, devices, and applications,” MRS Bull., vol. 37, no. 2,
pp. 1225–1238, Apr. 2016. pp. 108–114, Feb. 2012.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[63] G. W. Burr et al., “Access devices for 3D crosspoint memory,” J. Vac. Advait Madhavan (M’12) received the M.S. and
Sci. Technol. B, Microelectron. Process. Phenom., vol. 32, no. 4, Ph.D. degrees from the Electrical and Computer
p. 040802, 2014. Engineering Department, University of California
[64] D. B. Strukov and R. S. Williams, “Four-dimensional address topology Santa Barbara, Santa Barbara, CA, USA,
for circuits with stacked multilayer crossbar arrays,” Proc. Nat. Acad. in 2013 and 2016, respectively.
Sci. USA, vol. 106, no. 48, pp. 20155–20158, 2009. He is currently a Postdoctoral Researcher at the
[65] B. Chakrabarti et al., “A multiply-add engine with monolithically inte- National Institute of Standards and Technology,
grated 3D memristor crossbar/CMOS hybrid circuit,” Sci. Rep., vol. 7, Gaithersburg, MD, USA. His current research
Feb. 2017, Art. no. 42429. interests include novel methods for information
[66] G. C. Adam, B. D. Hoskins, M. Prezioso, F. Merrikh-Bayat, processing, ranging from conceptualization of
B. Chakrabarti, and D. B. Strukov, “3-D memristor crossbars for high-level architectures, analog and digital circuits
analog and neuromorphic computing applications,” IEEE Trans. Electron to integration with emerging technologies and chip designs.
Devices, vol. 64, no. 1, pp. 312–318, Jan. 2017. Dr. Madhavan was a recipient of the Micro Top Pick Award in 2015.
[67] G. Ligang, F. Alibart, and D. B. Strukov, “Programmable
CMOS/memristor threshold logic,” IEEE Trans. Nanotechnol., vol. 12,
no. 2, pp. 115–119, Mar. 2013. Tim Sherwood (M’03–SM’14) is currently a Pro-
[68] D. Gavrilov, D. B. Strukov, and K. K. Likharev. (2017). “Capacity, fessor of Computer Science and the Associate Vice
fidelity, and noise tolerance of associative spatial-temporal memo- Chancellor for Research at the University of Cal-
ries based on memristive neuromorphic network.” [Online]. Available: ifornia Santa Barbara, Santa Barbara, CA, USA.
https://arxiv.org/abs/1707.03855 He specializes in the development of processors
[69] S. M. Chai, A. Gentile, W. E. Lugo-Beauchamp, J. Fonseca, exploiting novel technologies, provable properties,
J. L. Cruz-Rivera, and D. S. Wills, “Focal-plane processing architectures and hardware-accelerated algorithms.
for real-time hyperspectral image processing,” Appl. Opt., vol. 39, no. 5, Prof. Sherwood is a seven-time winner of the
pp. 835–849, 2000. IEEE Micro Top Pick Award, an ACM Distin-
[70] C. Kittel, Introduction to Solid State Physics. Hoboken, NJ, USA: Wiley, guished Scientist, winner of the UCSB Academic
2005. Senate Distinguished Teaching Award, and is the
[71] International Technology Roadmap for Semiconductors, Semiconductor 2016 SIGARCH Maurice Wilkes Awardee “for contributions to novel program
Industry Association, 2013. analysis advancing architectural modeling and security.”
[72] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-
Submicron FPGAS, vol. 497. New York, NY, USA: Springer, 2012.
[73] (2018). MATLAB Code for Optimal Throughput Calculation. Dmitri B. Strukov (M’02–SM’16) received the
[Online]. Available: https://www.ece.ucsb.edu/~strukov/papers/2018/ M.S. degree in applied physics and mathematics
pm/code.m from the Moscow Institute of Physics and Technol-
[74] X. Tang, G. Kim, P.-E. Gaillardon, and G. De Micheli, “A study ogy, Dolgoprudny, Russia, in 1999 and the Ph.D.
on the programming structures for RRAM-based FPGA architectures,” degree in electrical engineering from Stony Brook
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 4, pp. 503–516, University, Stony Brook, NY, USA, in 2006.
Apr. 2016. He is currently a Professor of Electrical and Com-
[75] M. Payvand et al., “A configurable CMOS memory platform for puter Engineering at the University of California
3D-integrated memristors,” in Proc. IEEE Int. Symp. Circuits Santa Barbara, Santa Barbara, CA, USA. His cur-
Syst. (ISCAS), May 2015, pp. 1378–1381. rent research interests include different aspects of
[76] D. B. Strukov and K. K. Likharev, “CMOL FPGA circuits,” in Proc. computations, in particular addressing questions on
CDES, 2006, pp. 213–219. how to efficiently perform computation on various levels of abstraction.