0% found this document useful (0 votes)
27 views

High-Throughput Pattern Matching With

Pattern Recognition
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

High-Throughput Pattern Matching With

Pattern Recognition
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

High-Throughput Pattern Matching With


CMOL FPGA Circuits: Case for
Logic-in-Memory Computing
Advait Madhavan, Member, IEEE, Tim Sherwood, Senior Member, IEEE,
and Dmitri B. Strukov , Senior Member, IEEE

Abstract— In this paper, we propose a novel CMOS+ feature extraction [2], [3]), and string matching (e.g., in a con-
MOLecular (CMOL) field-programmable gate array (FPGA) cir- text of network intrusion detection [4]–[7], deoxyribonucleic
cuit architecture to perform massively parallel, high-throughput acid sequence matching [8], [9], database searching [10], and
computations, which is especially useful for pattern matching
tasks and multidimensional associative searches. In the new network packet routing [11], [12]). The common feature in
architecture, patterns are stored as resistive states of emerging these tasks is that they can be efficiently parallelized, and
nonvolatile memory nanodevices, while the analyzed data are that the same basic operation is performed numerous times
streamed via CMOS subsystem. The main improvements over using one set of fixed data known in advance (which are
prior work offered by the proposed circuits are increased allowed to change infrequently), such as a filter template in
nanodevice utilization and, as a result, substantially higher
throughput, which is demonstrated by a detailed analysis of the image processing or keyword in string matching, along with
implementation of pattern matching task on the new architecture. streaming input data. In this respect, RCs offer massively
For example, our estimates show that the proposed CMOL parallel, instant-specific computation customized for the needs
FPGA circuits based on the 22-nm CMOS technology and one of the particular application, and thus potentially offer real-
crossbar layer with 22-nm nanowire half-pitch allows up to time processing coupled with low-power consumption.
12.5% average nanodevice utilization, i.e., the fraction of the
devices turned to the high conductive state, as compared to a However, even contemporary RCs cannot provide enough
typical ∼0.1% of the original CMOL FPGA circuits. This in turn computational power for future demands. For example,
enables throughput close to 7.1 × 1016 bits/s/cm2 at ∼ 1 fJ/bit in network intrusion detection applications, deep packet infor-
energy efficiency, for matching of ∼ 107 250-bit patterns stored mation of computer networks is compared against a known
locally on a 1 cm2 chip. These numbers are at least 2 orders sequence of data representing a computer virus or other
of magnitude better throughput as compared to that of other
state-of-the-art FPGA methods, and begin to approach ternary
malicious content [4], [13]. In order to provide real-time
content-addressable memory -like performance at similar CMOS protection, the search engine should perform many compar-
technology nodes. More generally, we argue that the proposed isons in parallel, and simultaneously allow for updating of
concept combines the versatility of reconfigurable architectures virus signatures. Earlier RC implementations were adequate
and density of the associative memories. It can be viewed as a to ensure a few Gbit/s/cm2 -scale sustained throughput for
very tight symbiotic integration of memory and logic functions
for high-performance logic-in-memory computing.
∼2000 100-B long patterns [4]. The throughput could be
further significantly improved by employing dynamic reconfig-
Index Terms— CMOS+MOLecular (CMOL), field- uration and customized hardware, including dedicated ternary
programmable gate array (FPGA), hybrid circuits, logic-
in-memory computing, memristor, pattern matching, ReRAM, content-addressable memories (TCAMs) [5], [7], [13], [14].
resistive switching, ternary content-addressable memory However, even these techniques have limited benefits, largely
(TCAM). due to excessive reconfiguration overhead for multicontext
field-programmable gate arrays (FPGAs) [1], I/O limitations
I. I NTRODUCTION for dynamic reconfiguration, and/or rigid inefficient struc-
ture of content-addressable memories (CAMs), and thus may
R ECONFIGURABLE circuits (RCs) are very efficient for
information processing tasks [1], such as image and
signal processing (e.g., filtering, edge detection, coding, and
be insufficient for future needs. The deployment of faster
100-Gbit/s-scale data networks, as well as the continued
increase in the number of patterns (e.g., the number of
Manuscript received September 7, 2017; revised January 16, 2018; accepted known viruses) makes real-time protection impossible even
February 17, 2018. This work was supported in part by RICARDO through for the most advanced circuit implementations with CMOS
NSF under Grant 1730309, in part by NSF under Grant 1563935, in part by
CCF through NSF under Grant 1740352, and in part by AFOSR MURI under technology.
Grant FA9550-12-1-0038. (Corresponding author: Dmitri B. Strukov.) The performance of RCs can be greatly improved
A. Madhavan and D. B. Strukov are with the Department of Electri- using hybrid CMOS/nanoelectronic circuits [15]–[18].
cal and Computer Engineering, University of California, Santa Barbara,
CA 93106 USA (e-mail: [email protected]; [email protected]). One such example is CMOS+MOLecular (CMOL) FPGA
T. Sherwood is with the Department of Computer Science, University of [17], [19]–[23], where CMOL stands for CMOL scale hybrid
California, Santa Barbara, CA 93106 USA (e-mail: [email protected]). circuit, which was conceived to seize the density advantages
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. of emerging technologies, such as nanoimprint lithography
Digital Object Identifier 10.1109/TVLSI.2018.2809644 and monolithically integrated self-assembled nanodevices,
1063-8210 © 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

and to combine it with the flexibility and versatility of CMOS


technology. Most of the configuration overhead in CMOL
FPGAs, including all configuration memory and some routing
circuitry, is lifted above the CMOS plane. Logic gates are
based on a combination of nonlinear nanoscale resistive
switching devices (which are also called “memristors” [24])
and CMOS logic, which improves the aggregate density of
the logic circuitry [20], [21], [25]. In other CMOL-FPGA-like
concepts, nanoscale devices are only utilized for routing
purposes [22], [23]. Fig. 1. Example of pattern matching circuitry, which is designed to detect two
Though the density advantage is significant, the nanode- patterns, “0011” and “1010” in data streaming along the chain of DFFs. The
vice utilization in the previously reported works on CMOL bit values alongside wires illustrate a specific example of the data values in
the pipeline and resulting logic values, i.e., detection of “0011” pattern. Note
FPGAs [19], [21], [23], [26]–[30] is well below 1% due that three logic gates (2 AND and 1 OR) can be realized with one 4:1 lookup
to the limited benefits of utilizing high fan-in gates. In this table in the FPGA implementation.
paper, we present a modification to the original logic struc-
ture [19], [20] and show that some information processing structures for the particular set of patterns/expressions being
tasks are uniquely suited for the high fan-in gates of CMOL searched [37]–[39], e.g., through a technique analogous to
FPGA circuits. The TCAM-like cell architecture allows for a common expression elimination [4], [36] or by constructing
more efficient use of memristive devices resulting in much deterministic and nondeterministic finite-state automata for
higher performance, while still being able to maintain the recognition [40]–[42]. The flexibility and bit-level configura-
reconfigurability, hence blending the best of both worlds. bility of FPGAs make them a natural platform for instance-
The close proximity of the nanodevices to CMOS, by virtue specific highly parallel implementations in which both
of the vertical integration, allows for synergistic interaction memory functions (i.e., storage of patterns and logic
between memory and computation, hence resulting in state- operations) are performed locally. On the other hand, reconfig-
of-the-art performance. urability comes at a high price, typically with ∼ 40× larger
Some of the architecture details and one application study area and ∼ 3× longer delay as compared to custom circuit
have been reported earlier in [30]–[32]. The major contribution implementations [43].
of this paper is as follows. The second approach is based on TCAMs [44]–[56], which
1) A performance analysis of the proposed circuit. Our allows bit-level comparisons of streaming data against stored
estimates account for the sizing of CMOS circuits, which patterns in massively parallel fashion [Fig. 2(a)]. The relatively
was generally neglected in previous CMOL FPGA work, dense structure of CAMs, which are roughly 2× sparser than
though crucial for providing correct functionality for the conventional static random access memories (SRAMs), allows
considered circuits. more patterns to be stored in the same unit of silicon as
2) An optimization procedure that considers architectural, compared to FPGA approaches. The ternary aspect of the
topological, and circuit-level constraints to maximize the TCAM allows it to match do not care conditions as well. The
throughput of the proposed circuits. downside of this approach is that the long memory lines used
3) An additional application case study. for matching must be charged and discharged on each and
The rest of this paper is organized as follows. In Section II, every search cycle, even when no matches are to be found.
we briefly review the background material on pattern match- The principle of operation of a conventional SRAM-based
ing, the considered resistive switching devices, and CMOL CAM is shown in Fig. 2(b). It consists of several CAM
circuits. In Section III, we introduce modified CMOL FPGA memory cells arranged along a match line. Each CAM cell has
architecture. Section IV discusses two applications mapped on a dual-inverter memory element (which comprises 4 transis-
a new architecture, while performance modeling results are tors), and 4 match and pull-down transistors. (The read/write
provided in Section V. Finally, the results are discussed and circuitry of each memory cell, which is another 4 transistors,
summarized in Section VI. has been left out for clarity.) Once the data have been stored
in the memory element, the search operation is initiated by
II. BACKGROUND precharging the match line to a logical “1.” The data to be
A. Pattern Matching searched are then presented along the search lines. Depending
At a high level, contemporary high-performance pattern upon the data stored in the memory element, on a mismatch,
matching approaches can be divided into two groups. The there will be a clear discharge path from the match line to
first approach makes use of the reconfigurable nature of ground and on a match, there will be no discharge.
FPGA, exploiting the fine-grain configurability of the devices Fig. 2(c) shows an SRAM TCAM cell with the ability
to implement a dense pattern matching structure [1], [33]–[36]. to store do not cares by splitting the memory cell into two,
For example, many FPGA schemes make use of the config- which can now both store zeros, thereby always keeping
urable interconnect to stream data through a series of basic the discharge path off. Similar to the T/CAM cells shown
pattern matching operations performed by lookup tables inside in Fig. 2(b) and (c), there were many proposals of TCAM
logic blocks (Fig. 1). Going a step further, the reconfigurable cell implementations with other memory technologies. For
nature of the hardware can be exploited to optimize matching example, Fig. 2(d)–(g) shows TCAM cells based on flash
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 3

Fig. 2. Pattern matching with CAMs. (a) General idea. (b) Example of one row of SRAM-based CAM memory implemented in OR style [14].
(c) SRAM-based TCAM memory cell [14]. (d)–(i) TCAM cells based on nonconventional memory technologies. (d)-(g) TCAM cells based on flash
memory [48], hybrid CMOS/MRAM [49], CMOS/STT-RAM [50], and CMOS/memristors [51] technologies, respectively. (i) Memristor-based implementation.

film can be switched reversibly between high (“ON”) and


low (“OFF”) conductive states, characterized by RON and ROFF
resistances, respectively. For properly engineered nanodevices,
the conductive state can be retained indefinitely and probed
without disturbing it by applying relatively small (“read”) volt-
age bias (≤ V R ). Because of an ionic memory mechanism [24],
and a simple structure, which is conducive to aggressive
Fig. 3. Idealized I –V for bipolar memristors. (Inset: Cartoon of a crosspoint lithographic and other patterning techniques, memristors have
device.). excellent density prospects. For example, several groups have
recently shown metal oxide memristors with nanodevice area
memory [48], hybrid CMOS/magnetic random access memory below 15 × 15 nm2 [60], [61], which is defined by the overlap
(MRAM) [49], CMOS/spin torque transfer (STT)-RAM [50], area of bottom and top electrodes.
and CMOS/memristor [51] circuits. In this paper, we consider Switching the bipolar nanodevice between high conductive
the implementation of TCAM cell with a pair of two and low conductive states is accomplished by applying write
memristors [Fig. 2(i)] [30]–[32]. voltages of opposite polarity. For example, Fig. 3 shows
More recently, pattern-matching implementations were hysteretic I –V curve for idealized bipolar nanodevice for
suggested based on hyper-dimensional memory [57] and which applying V ≥ +VW across the device would switch
micrometer automata processor [58], [59]. In spite of algorith- it into the ON state (so-called set transition), while applying
mic differences, the operation of hyper-dimensional memory negative voltage V ≤ −VW would switch it back (reset) to the
circuit is somewhat similar to that of TCAM with the added OFF state.
functionality of measuring the sense current which is repre- The very high density of individual memristors can be
sentative of the distance of the mismatched query pattern with sustained at the circuit level by employing passive crossbar
the stored patterns. On the other hand, automata processor structures, which consists of mutually perpendicular nanowires
approach is more similar to FPGA computing in the ability to with nanodevices formed at their crosspoints. Crossbar integra-
perform fine grain, massively parallel operations on a stream of tion imposes additional requirements for the memristors, such
input data. It is essentially a “sea-of-gates” fabric with Boolean as the need for low forming voltage [24]. (The forming process
logic gates and counters interconnected with a reconfigurable is a one-time application of a relatively large voltage or cur-
routing network, but is more catered toward implementations rent pulse to turn an as-fabricated “virgin” nanodevice to
of high-throughput nondeterministic finite-state machines. operational memristor.) Other major challenges of passively
integrated crossbar circuits are state disturbances of half-select
B. Resistive Switching Devices devices during write operation, sneak-path currents during read
Resistive switching devices [24] are a key ingredient operation, and the common problem of currents running via
of the proposed CMOL FPGA pattern matching circuits. half/unselected nanodevices.
(In this paper, we also use terms “crosspoint device,” Currents via unselected and half-selected devices, which are
“nanodevice,” or simply “device” to describe memristors.) much higher for the write operation because of larger voltage
In its simplest form, a memristor consists of three layers: ranges, can lead to undesirable voltage drops across nanowires.
top and bottom (metallic) electrodes, and a thin film of some One of the solutions to this problem is to utilize nanode-
insulating material (inset of Fig. 3), most typically transition vices with strongly nonlinear electron transport presented as
metal oxide, which can undergo resistive switching [24]. (diode-like) nonlinear I –V with threshold voltage VT for
Specifically, by applying a relatively large (“write”) voltage the current flow (Fig. 3), which suppresses any unwanted
bias (VW ) across the electrodes of such a nanodevice, the thin currents in the crossbar circuit [24], [62], [63]. For example,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Crossbar rotation in CMOL shown for specific value of r = 3.

connected by only one via (Fig. 5). (Note that more readily
manufacturable CMOL structures, with nanowires running
strictly in vertical and horizontal directions, are also possible.
For example, the effective rotation of the nanowire array can be
implemented by adjusting positions of the cells’ vias [23] or by
using zig-zag shaped nanowires [65].)
Selection of any crosspoint device, to perform read or write
operation, is implemented with double-decoding scheme. The
first level of CMOS decoders, implemented by the peripheral
CMOS decoders and pass gates/transistor of the atomic CMOS
Fig. 4. (a) Cartoon of CMOL circuits. (b) Functional partitioning of
the CMOL circuit for pattern matching applications by the area-distributed cells, is used to select a pair of vias, one blue and one red,
interface (red and blue arrows), between the CMOS layer at the bottom and which connect to the corresponding mutually perpendicular
passive nanowire crossbar on top. nanowire segments that lead to the crosspoint device in
question. The second level of decoding is implemented with
the sneak-path is naturally cut by suppressing reverse cur- half-biasing approach, which utilizes memristor nonlinearities
rent for the devices with very asymmetric I –V s with in switching kinetics and electron transport to enable unique
(+) (−)
VT  V R < VT . access to the specific crosspoint device.
For simplicity, in our analysis, we assume bipolar mem- The angle of the crossbar rotation α depends upon several
ristors described by asymmetric idealized I –V curve with parameters such as cell complexity, CMOS process, and pitch
precisely defined write voltage VW , i.e., with no device-to- of the nanowires. Specifically, assuming that Fnano and FCMOS
device and cycle-to-cycle variations. are the minimum half-pitch of the nanowire crossbar array and
the feature size of CMOS circuitry, respectively, and that the
C. CMOL Structure side length of one atomic CMOS cell is 2β FCMOS , where
In CMOL structures one [17], [25] or several [64]–[66] β is a parameter representing cell’s size, the CMOL topology
crossbar layers are vertically integrated on top of conven- is described by set of equations [17]
tional CMOS circuits [Fig. 4(a)]. One of the key character- 1  2 β FCMOS
tan(α) = , r +1= . (1)
istics of the CMOL architecture are an ability of accessing r Fnano
(reading or writing) every crosspoint nanodevice from much It is also very convenient to characterize CMOL architecture
sparser CMOS circuitry without sacrificing crossbar integra- with parameter M A
tion density. The crosspoint memristors can be programmed
L − 2Fnano
to either high or low resistive states, and together with the MA = − 1 = r2 − 1 (2)
CMOS layer create a reconfigurable fabric that can perform 2Fnano
information processing (i.e., pattern matching for the consid- that defines the number of atomic cells connected to
ered applications) and interconnect duties. one atomic cell and is equal to the number of cross-
In particular, the CMOS layer is arranged as an array of ings (memristors) on one nanowire segment. For example,
“atomic” CMOS cells [Figs. 4(b) and 5], which are con- Fig. 7(a) shows a CMOL structure for r = 6 with its
nected to the nanoscale crossbar circuit via an area-distributed connectivity domain highlighted, and, in particular, shows that
interface. Each cell houses CMOS circuits that provide unique the given atomic cell can be connected to any of the other
access to each of the cell’s two vias from the cell array periph- M A = 36 atomic cells, including connection to itself, in its
ery, and also CMOS circuitry specific to the implemented connectivity domain via the crossbar structure.
CMOL circuit. The nanowire crossbar is rotated with respect For a fixed complexity CMOS cell in a certain process,
to the array of atomic cells underneath, and provides high lowering the pitch of the nanowire allows higher density
fan-in and fan-out connectivity between them. For example, by increasing the relative angle of the crossbar with respect
each (output) blue via connects to a certain quasi-horizontal to the CMOS vias. This also explains how, while keeping
nanowire, which in turn connects to multiple quasi-vertical nanowire half-pitch constant, maximum crossbar density can
nanowires through crosspoint devices. These quasi-vertical be preserved irrespective of the size of the atomic cell,
wires each connect to (input) red vias of other surrounding while only affecting its connectivity. In addition, it is worth
cells. The rotation of the crossbar naturally breaks nanowires mentioning that the CMOL segmented crossbar structure is not
into segments, and as a result, every nanowire segment is only good for high fan-in fan-out computation, but also has
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 5

double-decoding scheme of CMOL memory. Note that unlike


the original concept discussed in [19], [20], here, we assume
that each (red or blue) via is connected with pass gates to
two select and two data CMOS lines [Fig. 6(a)]. Connecting
each via to a pair of data lines allow different voltages
to be applied independently, which is more desirable for a
half-biasing scheme, while in the original concept some of
the nanowires were always floated. Also, in principle, pass
transistors for connecting data and select lines to vias can be
utilized instead of pass gates, however, as our estimates below
show, this does not help much in reducing the cell area.
After the programming stage, logic operations are imple-
mented with diode–resistor logic formed by the ON-state
nanodevices and CMOS pass transistors [20], while the signal
restoration, inversion, and latching are performed by the
CMOS subsystem [Fig. 6(b) and (f)]. Similar to conventional
implementations, two flavors, static and dynamic, diode–
resistor logic are possible with only subtle modification of the
underlying hardware (though with a different requirement of
the nanodevices). In the static case, the cell’s input nanowires
are pulled down to the ground via pass transistor. Within the
course of a single operation, i.e., performed in one clock cycle,
Fig. 6. Proposed CMOL FPGA for pattern matching. (a) Unit cell, which the final output voltage value is determined by the resistive
is comprised of two atomic cells, hosting CMOS DFF and pass gates. For divider formed by the diode–resistor logic. The dynamic case,
clarity, Schmitt trigger is not shown. (b) Equivalent circuit of one unit cell with on the other hand, is similar to TCAM-based implementations
diode-like memristor having I –V characteristics shown in Fig. 3 suitable for
(c) linear threshold logic, or (d) diode–resistor logic. (e) Fragment of CMOL with the circuit operation divided into a precharge and an
fabric showing nanowires connected to one of the input nanowire of unit cell. evaluate phases. In the precharge phase, the cell’s input
(f) Example of unit cell operation. (f) Six inputs [out of 24 total in (e)]. nanowires behave like a match lines [similar to Fig. 2(i)] which
(g) Equivalent gate representation of pattern matching operation for the
specific pattern stored in memristors shown in (f). are precharged low using a pull-down transistor, while the
output nanowires are decoupled from their corresponding DFF
outputs by deasserting the pass gate inputs. In the evaluation
a large amount of parallelism embedded in it. Two adjacent phase, the output enable lines are asserted which enable the
atomic cells share a considerable portion of their connectivity proper logic functionality of the dynamic cell by pulling the
domains. These shared cells, in spite of sharing quasi-vertical output voltage high in the case of a mismatch and leaving it
nanowires, interact with the adjacent atomic cells through low in the case of a match.
different memristors as their crosspoints lie on different quasi- The specific logic functionality of each unit cell and its con-
horizontal lines. nectivity is governed by the state of memristors connected to
its quasi-horizontal nanowire [Fig. 6(e)]. For instance, Fig. 6(f)
III. PATTERN M ATCHING C IRCUIT A RCHITECTURE shows a particular example of implementing function A’B,
Fig. 6 shows the proposed CMOL FPGA circuits for pattern where signals A, B, C, and their complements are routed
matching. The CMOL fabric is a uniform array of “unit” cells, from the output of the surrounding unit cells. Both true and
each comprised of two atomic cells. The unit cell implements complementary values of the signal are available at the output
a CMOS D-flip-flop (DFF) connected via pass gates to cell’s of the DFF, so that each bit of a pattern is represented by
vias. To improve voltage margins, we assume that each unit 2 memristors. Fig. 6(g) shows the equivalent custom logic gate,
cell also hosts a Schmitt trigger, which is inserted between which performs the exact pattern matching corresponding to
the cell’s input vias and the input of the DFF. Note that the the specific pattern stored in memristors [Fig. 6(f)]. It is worth
DFFs’ inputs and outputs are connected to each other only via mentioning that the state of the memristors remains unchanged
crossbar circuit and not via CMOS subsystem. during logic operation stage, because the voltage drop across
Similar to the originally proposed circuits [19], [20], mem- memristors are always less or equal to V R . Also note that while
ristors at the nanowire crosspoints can be programmed to Fig. 6(f) shows matching of two 3-bit patterns, i.e., one pattern
perform logic as well as interconnect functions. In order to comprised by the state of flip-flop cells and another one by the
configure the CMOL crossbar circuit to implement custom state of memristors, the number of bits in a pattern that can
logic, first, the CMOS block is disabled in all cells by be compared by one unit cell is typically much larger (>100)
deasserting “enable” line [Fig. 6(a)]. This is equivalent to for practical values of topological parameter r .
tristating the output of the CMOS cell such that applied write The unit cell can be also configured to perform approximate
voltages do not short circuit the output of the DFF’s drivers. pattern matching when analog properties of memristors are
As a result, any crosspoint device in the crossbar structure utilized to implement linear threshold gates [Fig. 6(c)] [67].
can be programmed to the ON or OFF state by utilizing the Such linear threshold gate can implement matching of two
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. General idea for pattern matching with 1-D streaming data. Here,
we assume that adjacent blocks of pipeline data are shifted by one bit position
and the block length matches the pattern length.

incoming data that needs to be searched for patterns, while


the remaining unit cells will be used to process and store
data indicating successful matches [Fig. 4(b)]. In this respect,
the proposed architecture is similar to conventional TCAM
Fig. 7. (a) Top view of crossbar circuit and input connectivity domain for circuits. Indeed, TCAM cells are implemented with differential
CMOL with r = 6, which is considered for application studies. (b) Equivalent pair of memristors [Fig. 2(i)], search lines are supplied from
sea-of-gate circuit architecture. (c) Functionality of one logic tile. (a) Set of
atomic cells bounded by red lines is a physically permitted domain (of atomic unit cells streaming data [Fig. 2(a)], while the pattern matching
cells), while the somewhat smaller domain of unit cells surrounded by black in a row of cells [Fig. 2(b)], and latching of the match
lines is used in mapping examples. In (b), the central tile can be connected, result is implemented within a single unit cell dedicated
i.e., directly pass output to or take inputs from, to any of the surrounding
shaded tiles. In (c), AND gate should be replaced with linear threshold gate for TCAM operation. The major advantage of the proposed
when analog properties of memristors are exploited. architecture, however, is very high density of TCAM-like cells
and flexible, FPGA-like allocation of unit cells for streaming
patterns based on the Hamming distance between them and the and processing the data, which can be tailored for a particular
specific threshold for Hamming distance can be programmed application.
infield by setting appropriate analog resistance states of the
IV. A PPLICATION M APPING C ASE S TUDIES
memristors [57], [68].
Since a unit cell consists of two atomic cells and has two A. 1-D Pattern Matching
input crossbar nanowires, its connectivity domain, i.e., the Let us first consider 1-D stream of data pushed through a
domain of atomic cells to which the given one can be con- very deep pipeline, which is representative of network intru-
nected, is larger [Fig. 7(a)]. The largest unit cell connectivity sion detection, bioinformatics, network routing, and various
domain M is achieved by having the least overlap between other string processing applications [1]. Conceptually, the idea
individual connectivity domains of two atomic cells compris- of pattern matching for 1-D streaming data is simple (Fig. 8).
ing the unit cell, and in this case M ≈ M A . For example, this To improve throughput it is natural to perform multiple pattern
could be implemented with the blue via having contacts at the matching operations in parallel. Because of fan-out restrictions
edges of the crossbar nanowire segments as shown in Fig. 7(a), (i.e., limited connectivity domain) pattern matching is per-
which is achieved by choosing appropriate relative position of formed simultaneously with several (W ) blocks of the pipeline
vias inside the cell. data as shown in Fig. 8, with U operations done concurrently
To simplify mapping of the applications to the proposed for each block. (Here, we assume that block length matches the
CMOL fabric and to provide a simple abstracted view of length of pattern being searched.) With such a scheme, the total
the logic and routing architecture, we will further use artifi- number of pattern matching operations performed in a given
cially smaller (than physically permissible) rectangular shaped cycle is U × W and W cycles are needed to check a certain
domains—for example, domain of 5 × 5 unit cells for topo- portion of the streaming data against all (U × W ) programmed
logical parameter r = 6 [Fig. 7(a)]. Therefore, the proposed patterns. Therefore, it is natural to allocate (configure) some
CMOL FPGA architecture can be thought of as an array unit cells in the homogeneous array to perform streaming
of multifunctional unit cells [Fig. 7(b) and (c)]. Every unit data by forming long pipelines, while others to implement
cell can pass its outputs to or accept inputs from any of pattern matching. For simplicity, we assume that the results of
the unit cells in 5 × 5 connectivity domain, which is always pattern matching operations are logically summed together so
centered with respect to a given unit cell [Fig. 7(b)]. More- that the circuit generates a single bit on the output at every
over, as discussed above, every unit cell can be configured cycle. (A more sophisticated processing would be straight-
to perform AND (or linear threshold functions) with nor- forward given universality of the unit cells and flexibility in
mal or complimented outputs of unit cells in its connectivity mapping.)
domain. Naturally, due to De-Morgan’s law and the presence Fig. 9 shows an example of the mapping where white, green,
of complementary output, Boolean OR functions can also be and blue flip-flops represent unit cells performing pattern
realized for every unit cell. matching, data streaming, and processing of pattern match-
As we will show next, the patterns will be stored as the ing results, respectively. More specifically, in this example,
state of memristors, some unit cells will be configured to store the streaming data are passed via two independent pipelines
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 7

Fig. 9. Example of pattern matching operations for the data streamed


via green flip flops (unit cells), with the first half of the data coming
from the top pipeline and the second half from the bottom one. The white
Fig. 10. Mapping of 1-D pattern matching task to CMOL FPGA with r = 6.
unit cells are programmed to match patterns “0111011111,” “1X10000X00,”
(a) General mapping scheme. (b) Zoomed-in view showing mapping of the
“1011001111,” and “1100110110,” while the results of pattern matching
streaming data and unit cells performing pattern matching. (c) Scheme for
operations are summed in a pipeline comprised by a chain of blue cells.
getting logical summation operation of pattern matchings and data propagation
Here, d denotes data bit stored by a unit cell and the index for d denotes
in a pipeline. (d) Relative window for pattern matching operations in the
the order of data in 1-D stream, while p denotes a pattern being matched by
data stream. Red arrows show schematically data movement/logical operation
a given unit cell and pair of indexes for p represents specific pattern index
performed by each type of unit cell.
within the block (out of total U patterns) and block id (out of total W blocks),
respectively—see Fig. 8. Note that the summation is performed with AND gate
based on De-Morgan’s Law.
Fig. 10 shows an example of one such mapping. The
streaming 1-D data are duplicated and pushed through several
formed by green unit cells, and four exact pattern matching pipelines. To ensure that any connectivity domains of white
operations are performed with streaming data at each cycle by cells consists of always unique and contiguous streaming data
white unit cells. The results of pattern matching operations are from the green cells, the relative position of data in every sec-
summed in a pipeline comprised by blue unit cells. ond pipeline is shifted by five positions [Figs. 9(b) 10(b)] for
In general, the largest number of bits in a pattern that the considered connectivity domain size (Fig. 7). The unit cells
can be matched with one unit cell is (Nbit )max = M − 1. that should be allocated to duplicate the streaming data are not
This would correspond to the case when all unit cells in the shown though their overhead is negligible.
connectivity domain of the given one are configured to stream Because of the limited size of the connectivity domain,
data. In this case, only χ = χmin = 1/M fraction of the unit the logical summation from all of the pattern matching cells
cells are performing the pattern matching operation. At the is done in several steps. As Figs. 9 and 10(c) show, at each
other extreme case, i.e., when all unit cells in the connectivity cycle a particular blue cell latches the logical summation of
domain are configured to perform pattern matching except for the values from the two nearest white cells (which hold the
one, χ = χmax = (M − 1)/M and Nbit = (Nbit )min = 1. results of pattern matches from the previous cycle), and one
More generally, the number of bits in a pattern which can be blue cell to the left of the given one. The partial sum is fully
compared by one unit cell is pipelined and propagates along a row of blue cells at the rate
of one cell position per cycle, i.e., as fast as the streaming data.
Nbit = (1 − χ)M. (3)
Once partial sums from different rows are propagated to the
It is trivial to show that the largest total number of bits in right edge of the chip, they are summed up in the similar
all patterns matched per one cycle (χ Nbit ) is achieved when fashion to get one single value. This value represents the
χ = 0.5, i.e., when half of the unit cells in the connectivity logical summation of all results of the comparisons performed
domain are configured to perform pattern matching and the within the array at specific time window.
other half to stream data. (Note that this conclusion assumes Finally, let us note that for a considered value of r = 6, each
that the length of pattern is not fixed but rather a parameter white cell performs Nbit = 10 bit-wide pattern matching. With
which is optimized. Also, the unit cells configured to process such mapping, white cells in the same column are performing
the results of pattern matching are neglected in this analysis, matching within the same block of data at any given cycle
which is justified due to their relatively small number, at least [Fig. 10(b) and (d)]. Therefore, the number of pattern matching
in our considered case.) A similar observation for balancing operations per block (U ) is given by the number of white
streaming and processing resources has been made when cells in a column, which is roughly equal to the half of the
mapping network processing tasks on conventional FPGA total number of unit cells per column. The number of different
circuits [34]. blocks (W ) is given by the number of unit cells in a row,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

of white cells can be programmed to perform matching for


a second template and so on. The result of the pattern matching
operations might be summed all together and pushed to the
bottom of the array and then propagated to the right.
Many different alternative implementations are also
possible. For example, for faster processing (though sacrificing
the total number of templates) the streaming data can be
pushed by several unit cell positions in one cycle. In this case,
more than one column of white cells should be allocated per
one template, but the number of cycles to check for all possible
offsets is proportionally less.

V. P ERFORMANCE M ODELING
We have modeled the performance of a generic 1-D pattern
Fig. 11. General idea for mapping image processing tasks to the proposed matching task and found the optimal parameters to maximize
CMOL FPGA architecture. the possible pattern matching throughput per unit area. Before
discussing the details of an optimization procedure, let us first
i.e., width of the unit cell array, so that the total number of outline some common assumptions for the devices, diode–
patterns which can be matched by an array is roughly half of resistor logic and its performance modeling while focusing
the total number of unit cells in the array. on the more promising dynamic logic counterpart of this
architecture.
B. 2-D Pattern Matching
The main difference for mapping of image processing tasks A. Nanowires
is that both streaming data and patterns are 2-D. For example, The resistance of 2Fnano -long nanowire segment is approxi-
in automatic target recognition (ATR) systems 128 × 128 mated using Matthessian formula [70], which accounts for the
array of 8-bit pixels is searched for potential targets which increase in the resistivity ρ due to surface scattering effects in
are described by 16 × 16 binary pixel templates [36]. The nanoscale wires, i.e.,
 
bottleneck operation in ATR algorithm is 1-bit correlations 2Fnano 2ρbulk λ
between input image and template which has to be done Rwire ≈ ρ ≈ 1+ . (4)
A (Fnano )2 A Fnano Fnano
for all possible relative offsets and for rather large number
Here, A is the relative thickness of the nanowires with respect
of templates. (It should be noted that contemporary ATR
to its width (i.e., the cross-sectional aspect ratio), while λ is
systems work with much larger input and template image sizes
the mean free path of the electrons.
and higher data rate, e.g., required for hyperspectral image
The capacitance of 2Fnano -long nanowire segment is
processing [69].)
approximated analytically by using the equation
Naturally, a correlation operation which produces multibit
2
Fnano F2
output value cannot be done with a single unit cell in dig-
Cwire ≈ 0 1 + 0 2 nano + A 0 2 4Fnano
ital CMOL FPGA circuits. On the other hand, a combined 2d 2d
operation of correlation with thresholding, which is effectively 4nano
+ A 0 2   (5)
an approximate pattern matching, is straightforward if unit
log Fnano
d + 10
cell is configured to implement a linear threshold gate. Such
an operation might be sufficient for eliminating bottleneck which was verified using COMSOL simulations. Here, d is
processing in ATR and other related image processing tasks. the thickness of the thin film, i.e., the distance between
Similar to the previously considered mapping scheme, two, mutually perpendicular sets of crossbar nanowires, 0 is
the maximum utilization is achieved with balanced number vacuum permittivity, 1 and 2 are dielectric constants of
of white and green cells (Fig. 11). Let us assume that the the nanodevices and surrounding insulator, respectively, and
connectivity domain is large enough that each unit cell per- a constant 10 was determined by fitting (5) to the numerical
forms matching between streaming data of input image and simulations. Note that on the right-hand side of (5), the first
the whole template. (The proposed scheme can be extended to two terms crudely correspond to parallel plate capacitance,
the case when the domain is smaller, by performing matching the third term is the interlayer side wall capacitance, and
operations for the parts of the template instead.) Let us the last term is the side wall capacitance between crossbar
also assume that a K × K 2-D image is pushed through a nanowires in the same layer.
pipeline formed by green cells, e.g., from left to right, by one In our performance analysis, we assume copper crossbar
unit cell position each cycle. To perform correlation for one nanowires with A = 0.1 and ρbulk = 1.7 × 10−8 -m. Using
template, for all possible vertical offsets in one cycle requires λ = 40 nm, (4) yields an accurate approximation for both grain
programming a column of K white cells to perform matching and surface scattering as reported in international technology
with the same template. Evaluation of all horizontal offsets roadmap for semiconductors (ITRS) [71]. For capacitance
would just require K cycles to push data (from left to right) estimates, we assume 1 = 3.9, 2 = 2.5, and d = 5 nm,
past the column of a particular white cells. The next column which is representative of SiO2 memristive devices.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 9

B. Diode–Resistor Logic FCMOS node. Note that using Rpass = (Rpass )max in (9)
2
Let us assume that at least γ fraction of voltage applied assumes that area for the minimum-size transistor is 25FCMOS .
from the CMOS cell is dropped across nanodevice and at For the dynamic logic, the delay is estimated as
most 1 − γ fraction is dropped on nanowire and pass gate
τ ≈ 2(2MCwire + Cgate )RON (10)
connecting outputs of DFF and the corresponding vias. This
can be satisfied by choosing RON and Rpass according to where 2 MCwire is capacitance of two nanowire segments,
RON ≥ (1 − χ)γM(M Rwire + Rpass )/(1 − γ ) (6) while Cgate is a total capacitance of CMOS circuitry at
the input of DFF, including its gate capacitance and drain
where the factor (1 − χ)M is effectively the maximum fan out capacitances of the configuration and pull-down pass gates.
for each output of the DFF cell ( 3), i.e., the largest permitted The additional factor of 2 is to account for both precharging
fraction of the devices in the ON state on each of the two and evaluation phases, which is rather conservative assumption
output nanowires. given that precharging currents are not limited by RON value.
For dynamic logic operation, the slowest (worst case) Note that for all studied parameters Cgate is typically much
mismatch operation corresponds to charging via single nan- smaller than 2 MCwire .
odevice in the ON state. For the considered asymmetric I –V The average power per unit cell is dominated by the
characteristics, this charging time will be always faster than dynamic component, for which the upper bound is evaluated
that of the match operation and the worst case voltage margins as
are given by
Pcell ≈ (2 × 2MCwire + Cgate )V R2 /τ. (11)
V ≈ V R /(1 + 2M RON /ROFF ). (7)
Equation (11) implies activity factor of 1 for DFF’s input
The safe margins can be in principle calculated from noise and output nanowires and input CMOS circuitry, i.e., their
and CMOS variations analysis [19]. Equation (7), however, charging and discharging within a single clock cycle. This is
shows that such analysis can be simplified by selecting large quite a pessimistic assumption given that matching events can
enough ROFF so that the margins are comparable with V R . For be assumed to be rare on average and hence outputs of all
example, requiring ROFF /RON > 2M, which is very reasonable cells configured to perform pattern matching will not change.
assumption as we show below, results in V ≈ V R /2, and Still, as we show later, the total power, which should be less
provides justification of neglecting leakages via OFF state than the maximum allowable power density pmax , i.e.,
devices in (6), as well as delay and power estimates.
In our simulations, we assume V R = 1 V and γ = 0.9. Pcell ≤ pmax Acell (12)

is rarely a limiting factor for performance.


C. Area, Delay, and Power
In our simulations, we assume pmax = 200 W/cm2 ,
According to Section III-C, area of unit cell is
and (Rpass )max = 27.3/13.3/6.6/4.6 k and Cgate =
Acell = 2(2β FCMOS )2 ≈ 2M(2Fnano )2 . (8) 7.5/22.5/76.2/135 fF for the considered FCMOS =
22/45/90/130 nm nodes, respectively—all typical values
To calculate βmin , the minimum cell area is estimated similar specified by ITRS [71].
to [72], i.e., by counting the number of transistors in the
cell, and modeling the area of each transistor according to
its driving strength. D. Throughput and Energy Per Bit
Specifically, out of the total 54 transistors in a cell, Given the area of the chip Achip, the total number of cells is
we assume that there are 22 minimum-size transistors, includ- Ncell = Achip /Acell , while the total number of patterns Npattern
ing those used for configuration circuits, which do not have to that can be stored in memristors and compared in one cycle
support high currents because of the diode-like asymmetric is
I –V characteristics of memristors. There are also 20 tran-
sistors that compose the DFF which are sized according to Npattern ≈ χ Ncell . (13)
3 input and 2 input NAND gates, as well as a Schmitt trigger
The total number of pattern bits stored in a chip is, therefore,
that is composed of 6 transistors, 2 of which are minimum
sized, 2 of size 2 and 2 of size 4. On the other hand, Ntotal = Nbit Npattern ≈ (1 − χ)χ M Achip /Acell
we assume 4 transistors, which are used in the output drivers,
and 4 transistors of the two pass gates controlled by enable ≈ (1 − χ)χ Achip /(8Fnano
2
) (14)
lines (Fig. 6) are scaled up to accommodate the current driving
and the aggregate pattern matching throughput is
requirements of memristor layer for proper operation of diode–
resistor logic. For simplicity, all nonminimum-size transistors Ntotal
. T = (15)
are scaled up equally and their area is estimated as τ
Atran ≈ (0.5 + 0.5(Rpass )max /Rpass ) × 25FCMOS
2
(9) Another metric of interest is the consumed energy during
pattern matching operations per single bit, which is simply
where (Rpass )max is effective drain–source resistance of the
minimum-size transistor at the saturation for the specific E bit = Pcell Ncell τ/Ntotal . (16)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

effects on throughput. On one hand, increasing Fnano reduces


the size of the connectivity domain, which in turn, reduces the
value of RON , since the pass gates have to support a smaller
number of devices, and lead to faster operation speeds. On the
other hand, reducing the size of the connectivity domain
reduces the number of bits matched and hence reduces the
throughput. The simulation results also show that the through-
put is roughly proportional to 1/FCMOS . This is because
for a given Fnano , the decrease in CMOS feature size leads
Fig. 12. Optimal value of RON and the pattern matching throughput as a to proportionally smaller connectivity factor, and, in turn,
function of driving strength for a particular case FCMOS = Fnano = 22 nm proportionally smaller optimal values of RON .
and χ = 0.5.
Another interesting result from the optimization procedure
E. Optimization and Simulation Results [as shown in Fig. 13(d)] is the independence of the energy per
bit metric on FCMOS . Intuitively, this means that E bit is deter-
In general, our goal is to find the maximum possible pattern
mined by dynamic energy of a single, 2Fnano -long segment of
matching throughput per unit area, i.e., the largest T /Achip,
the crossbar wire (match line).
for a particular choice of mapping scenario χ and technology
Furthermore, though χ = 0.5 corresponds to the maxi-
feature sizes FCMOS and Fnano , by optimizing Rpass and RON .
mum number of stored pattern matching bits, as discussed
In particular, we use a brute-force approach, by first sweeping
in Section IV, Fig. 14 shows that the largest throughput is
via broad range of Rpass values, starting with its maximum
achieved with 1 − χ < 0.5. Indeed, the reduction of 1 − χ
possible value (Rpass )max . Naturally, the technology parame-
allows a smaller RON to be supported by the same size pass
ters, mapping scenario, and Rpass impact the area of the cell
gates. This reduction in RON in turn allows for a reduction in τ
and hence all dependent topological and application mapping
that outweighs the loss in throughput as a result of a lowering
parameters, i.e., r, β, βmin , Acell , M A , M, Nbit , Npattern , Ntotal
in number of pattern matching bits (Nbit ). This reduction in
defined by (1)–(3), (13), and (14). We then sweep via realistic
1 − χ is also met with a commensurate increase in the number
(implementable) range of RON , which is constrained from
of streaming cells hence throughput does not drop drastically.
below by proper operation of diode–resistor logic defined
(Note that the further decrease in 1−χ, beyond what is shown
by (6) and (in very few cases) power budget, i.e., (11) and (12).
in Fig. 14, will eventually lead to the drop in the throughput
For a fixed technology parameters and mapping scenario,
due to power density constraint.)
an optimal throughput peaks at certain Rpass value and, for any
The considered values of RON and ROFF are quite realis-
specific Rpass , always obtained using the smallest permissible
tic for the most cases, which indicates the practicality for
value of RON (Fig. 12). The detailed results of an optimization
manufacturing such circuits. For example, at χ = 0.5 and
for χ = 0.5 and the impact of χ on throughput are shown on
Fnano = FCMOS , the optimal value of RON is always around
Figs. 13 and 14, respectively.
10 M [Fig. 13(b)], while the corresponding ON/ OFF ratio
from (7) and [Fig. 13(a)] is ROFF /RON ≥ 2 × 103 , which
VI. D ISCUSSION AND S UMMARY are not uncommon for memristors [24]. It should be also
To get intuition behind the optimization procedure, let us noted that while the considered nanowire aspect ratio A =
first note that for a fixed chip area, the total number of stored 0.1 is rather conservative choice, the throughput is not very
bits depends only on Fnano and χ (14), (15), so that the largest sensitive to A and would actually slightly improve further by
throughput is achieved by minimizing the delay τ . In turn, considering even smaller A . This is because wire resistance
the delay, which depends on the product of RON and M, is min- is rarely a limiting factor in our optimization and A only
imized by tuning value of Rpass . Indeed, for relatively large impacts fringe capacitance of the crossbar wires.
values of Rpass , the cell area weakly depends on pass transistor The maximum throughput for FCMOS = Fnano = 22 nm
scaling and decreasing Rpass results in smaller RON because and χ = 0.5 is close to 8 × 1016 bits/s/cm2 for matching
of (6). However, when scaled pass transistors start dominating of ∼ 107 250-bit patterns, assuming practical power consump-
the cell area, the further decrease in Rpass is counterproductive tion density [Fig. 13(f)]. This number compares very favorably
because of increase in M due to (1) and (2). As a result, with the reported state-of-the-art FPGA performance (Table I).
there is a certain optimum value of Rpass corresponding to the (Note that because the largest throughput in our circuit is
smallest delay and largest throughput (Fig. 12). Interestingly, always achieved at Fnano = FCMOS , we report only one
the optimal driving strength, i.e., (Rpass )max /Rpass , is always feature size for our work.) Our reported throughput vastly
close to ∼ 15 for all studied cases of FCMOS , Fnano , and χ, exceeds FPGA only implementations and rivals state-of-the-
at which the area of the pass gates and drive circuits is art TCAM based implementation such as [44] and [49].
comparable to that of the remaining circuitry in a cell. We expect that algorithmic improvements, in particular, the use
The results of optimization as shown in Fig. 13, show of common subexpression elimination techniques will increase
that though throughput is reaching its maximum value at the throughput of the proposed circuits even further.
around Fnano = FCMOS , it remains relatively constant across Even more important is the fact that the proposed circuits
changes in Fnano for fixed values of FCMOS . This can be could potentially offer much higher pattern capacity without
attributed to two simultaneous factors that have opposing any performance penalty. Because the number of storage
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 11

Fig. 13. Simulation results for the optimal value of Rpass , χ = 0.5, and specific values of FCMOS and Fnano [73]. (a) Connectivity factor of a unit cell.
(b) Optimal value of RON determined as a minimum resistance satisfying two constrains: proper operation of diode–resistor logic (6) and power density
budget (12). (c) Capacitance of a segment and the whole wire. (d) Energy per bit (which is the same for all values of FCMOS ). (e) Diode–resistor logic delay
(clock cycle time). (f) Aggregate pattern matching throughput per unit area. Note that for the shown case, RON is always determined by (6).

TABLE I
P ERFORMANCE OF VARIOUS PATTERN M ATCHING A RCHITECTURES . (# T HE N UMBERS FOR B OTH TCAM AND FPGA I MPLEMENTATIONS
A RE R ATHER O PTIMISTIC . O NLY A REA OF M EMORY C ELLS I S TAKEN I NTO A CCOUNT FOR THE F ORMER , W HILE
FPGA N UMBERS A RE E STIMATED BASED ON THE R EPORTED L OGIC U TILIZATION .)

elements in existing hardware-based pattern matchers is lim- Even though our performance analysis is somewhat sim-
ited by the 2-D chip area, they must be dynamically reconfig- plified, we believe that we accounted for the most important
ured to accommodate additional patterns that are beyond their factors. For example, CMOS process variations, critical for
storage capabilities and rely on OFF-chip storage. Dynamic diode–resistor logic operation can be effectively dealt with
reconfiguration is rather slow and very energy inefficient, thus by appropriate scaling of the CMOS transistors. Its additional
we expect that the throughput for a fixed area for higher overhead, as well as additional area due to bulky programming
capacity pattern matching tasks will be considerably smaller circuitry which might be required for memristors with large
than the ideal value. On the other hand, it should be possible write voltages [74] should not change much our simulation
to support larger bit capacity in the proposed circuits by results because of already large pass gates/drive circuit tran-
integrating more crossbar layers, e.g., similar to 3-D CMOL sistors for the optimal cases. Though, the dynamic logic is
circuits [64], [65] without large penalty in throughput. susceptible to capacitive coupling noise, it should be possible
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

using such mature and readily available device and integration


technologies is one of the important future research directions.

ACKNOWLEDGMENT
The authors would like thank R. Brayton, A. Mishchenko,
and K. K. Likharev for their useful discussions.

C ONFLICTS OF I NTEREST
The presented work is built upon our previous
Fig. 14. Maximum throughput as function of mapping scenario for several
values of FCMOS (similar to the ones used in Fig. 13). results reported in [30]–[32]. The material from these
papers is included in Sections III and IV-A. However,
Sections IV-B, V, and VI elaborate on the new results, which
to minimize its effect by balancing signal transitions. For were never published before.
example, the CMOL topology and application mapping can be In particular, in this paper we present the following, for the
further optimized to avoid Miller effects. Based on our expe- first time.
rience with CMOL design [75] and quite large area of CMOL 1) A performance analysis of the proposed circuit. Our
cells in this paper, the area overhead of peripheral decoders estimates also account for the sizing of CMOS circuits,
and clock distribution network should be also insignificant. which was generally neglected in all previous CMOL
As it is evident from the choice of the optimal values of RON FPGA works but, as we show in this paper, is crucial
[Fig. 13(b)], the dynamic power consumption of our circuits for providing correct functionality.
is always well below 200 W/cm2 . Our estimates shows that 2) An optimization procedure that considers architectural,
the neglected dynamic power of the clock distribution network topological and circuit level constraints to maximize the
(especially considering relatively slow cycle times at optimal throughput of the proposed circuits.
Fnano = FCMOS ) and static power are also always limited to 3) An additional application case study.
the sub watt range.
Perhaps the most critical challenge toward practical real- R EFERENCES
ization of the proposed circuits is that fabrication technology [1] S. Hauck and A. DeHon, Reconfigurable Computing: The Theory and
for the memristive devices, especially for their passive back- Practice of FPGA-Based Computation, vol. 1. San Mateo, CA, USA:
end-of-line integration, is in need of improvement. A particular Morgan Kaufmann, 2010.
[2] Q. Gu, T. Takaki, and I. Ishii, “Fast FPGA-based multiobject feature
concern is memristor nonidealities, such as current fluctuations extraction,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1,
due to drift in the memristor state and random telegraph noise, pp. 30–45, Jan. 2013.
and variations in the switching threshold voltages. One possi- [3] C. Gentsos, C.-L. Sotiropoulou, S. Nikolaidis, and N. Vassiliadis, “Real-
time canny edge detection parallel implementation for FPGAs,” in Proc.
bly strategy to deal with these issues is to identify defective ICECS, Dec. 2010, pp. 499–502.
devices during test stage and avoid them during application [4] Y. H. Cho and W. H. Mangione-Smith, “Deep network packet filter
mapping [19], [76]. On the other hand, variations in the design for reconfigurable devices,” ACM Trans. Embedded Comput.
Syst., vol. 7, no. 2, Feb. 2008, Art. no. 21.
ON and OFF resistance states would be less problematic due to, [5] L. Tan and T. Sherwood, “Architectures for bit-split string scanning in
e.g., possibility of fine-tuning memristor conductances using intrusion detection,” IEEE Micro, vol. 26, no. 1, pp. 110–117, Jan. 2006.
simple tuning algorithm setup. Also, though the cycling [6] T. N. Thinh, T. T. Hieu, V. Q. Dung, and S. Kittitornkun, “A FPGA-based
deep packet inspection engine for network intrusion detection system,”
endurance for many memristors is generally much less as com- in Proc. 9th Int. Conf. Elect. Eng./Electron., Comput., Telecommun. Inf.
pared to that of volatile memories, it should be still adequate Technol. (ECTI-CON), May 2012, pp. 1–4.
for many FPGA applications, given that the memristor states [7] H. Le and V. K. Prasanna, “A memory-efficient and modular approach
for large-scale string pattern matching,” IEEE Trans. Comput., vol. 62,
are only switched during reconfiguration stage and remain no. 5, pp. 844–857, May 2013.
unchanged during logic operation. [8] Y. Xin et al., “Parallel architecture for DNA sequence inexact matching
In summary, in this paper, we proposed new CMOL FPGA with Burrows-Wheeler transform,” Microelectron. J., vol. 44, no. 8,
pp. 670–682, Aug. 2013.
circuits for high-throughput computation. The performance [9] D. Lavenier, G. Georges, and X. Liu, “A reconfigurable index FLASH
advantage of novel circuits is mainly due to very high density memory tailored to seed-based genomic sequence comparison algo-
of nanoscale devices and very tight and synergetic integration rithms,” J. VLSI Signal Process., vol. 48, no. 3, pp. 255–269, 2007.
[10] Q. Zhang, R. D. Chamberlain, R. S. Indeck, B. M. West, and J. White,
of memory and logic functions. The tight integration is enabled “Massively parallel data mining using reconfigurable hardware: Approx-
by high communication bandwidth of the area-distributed imate string matching,” in Proc. 18th Int. Parallel Distrib. Process.
interface between the nano and CMOS subsystems, while Symp., Apr. 2004, p. 259.
[11] M. B. Anwer, M. Motiwala, M. bin Tariq, and N. Feamster, “Switch-
the synergy is due to flexible resource allocation that allows Blade: A platform for rapid deployment of network protocols on pro-
nanodevices to be used either as a TCAM cell or to implement grammable hardware,” ACM SIGCOMM Comput. Commu. Rev., vol. 40,
programmable logic/interconnect. Though CMOL circuits are no. 4, pp. 183–194, Oct. 2010.
[12] T. Sherwood, G. Varghese, and B. Calder, “A pipelined memory archi-
essential for getting high bandwidth between memory and tecture for high throughput network processors,” in Proc. 30th Annu.
logic subsystems, other stacking schemes with area-distributed Int. Symp. Comput. Archit., Jun. 2003, pp. 288–299.
connectivity, such as through silicon via technology, and differ- [13] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu, “Fast regular
expression matching using small TCAMs for network intrusion detection
ent memory devices, such as flash memory might be suitable and prevention systems,” in Proc. 19th USENIX conf. Secur., Aug. 2010,
for the proposed concept. Understanding the prospects of p. 8.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MADHAVAN et al.: HIGH-THROUGHPUT PATTERN MATCHING WITH CMOL FPGA CIRCUITS 13

[14] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem- [40] H. Kim and K.-I. Choi, “A pipelined non-deterministic finite automaton-
ory (CAM) circuits and architectures: A tutorial and survey,” IEEE J. based string matching scheme using merged state transitions in an
Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006. FPGA,” PLoS ONE, vol. 11, no. 10, p. e0163535, 2016.
[15] M. R. Stan, P. D. Franzon, S. C. Goldstein, J. C. Lach, and M. M. Ziegler, [41] H. J. Kim, “A failureless pipelined Aho-Corasick algorithm for FPGA-
“Molecular electronics: From devices and interconnect to circuits and based parallel string matching engine,” in Information Science and
architecture,” Proc. IEEE, vol. 91, no. 11, pp. 1940–1957, Nov. 2003. Applications. Berlin, Germany: Springer, 2015, pp. 157–164.
[16] A. DeHon, “Nanowire-based programmable architectures,” ACM [42] X. Wang and D. Pao, “Memory-based architecture for multicharac-
J. Emerg. Technol. Comput. Syst., vol. 1, no. 2, pp. 109–162, Jul. 2005. ter Aho–Corasick string matching,” IEEE Trans. Very Large Scale
[17] K. K. Likharev and D. B. Strukov, “CMOL: Devices, circuits, and Integr. (VLSI) Syst., vol. 26, no. 1, pp. 143–154, Jan. 2017.
architectures,” in Introducing Molecular Electronics. New York, NY, [43] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
USA: Springer-Verlag, 2006, pp. 447–477. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2,
[18] X. Tang, P.-E. Gaillardon, and G. De Micheli, “A high-performance low- pp. 203–215, Feb. 2007.
power near-Vt RRAM-based FPGA,” in Proc. Int. Conf. Field-Program. [44] J. Li et al., “1Mb 0.41 μm2 2T-2R cell nonvolatile TCAM with two-
Technol. (FPT), Dec. 2014, pp. 207–214. bit encoding and clocked self-referenced sensing,” in Proc. Symp. VLSI
[19] D. B. Strukov and K. K. Likharev, “CMOL FPGA: A reconfigurable Circuits (VLSIC), Jun. 2013, pp. C104–C105.
architecture for hybrid digital circuits with two-terminal nanodevices,” [45] S. Matsunaga et al., “Fully parallel 6T-2MTJ nonvolatile TCAM with
Nanotechnology, vol. 16, no. 6, p. 888, 2005. single-transistor-based self match-line discharge control,” in Proc. Symp.
[20] D. B. Strukov and K. K. Likharev, “A reconfigurable architecture for VLSI Circuits (VLSIC), Jun. 2011, pp. 298–299.
hybrid CMOS/nanodevice circuits,” in Proc. ACM/SIGDA 14th Int. [46] Q. Guo, X. Guo, Y. Bai, and E. Ipek, “A resistive TCAM accelerator
Symp. FPGAs, 2006, pp. 131–140. for data-intensive computing,” in Proc. Micro, Dec. 2011, pp. 339–350.
[21] D. B. Strukov and K. K. Likharev, “Reconfigurable hybrid [47] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman, “AC-DIMM:
CMOS/nanodevice circuits for image processing,” IEEE Trans. Associative computing with STT-MRAM,” ACM SIGARCH Comput.
Nanotechnol., vol. 6, no. 6, pp. 696–710, Nov. 2007. Archit. News, vol. 41, no. 3, pp. 189–200, 2013.
[22] Q. Xia et al., “Memristor-CMOS hybrid integrated circuits for recon- [48] T. Hanyu, N. Kanagawa, and M. Kameyama, “Non-volatile one-
figurable logic,” Nano Lett., vol. 9, no. 10, pp. 3640–3645, 2009. transistor-cell multiple-valued cam with a digit-parallel-access scheme
[23] D. Strukov and A. Mishchenko, “Monolithically stackable hybrid and its applications,” Comput. Elect. Eng., vol. 23, no. 6, pp. 407–414,
FPGA,” in Proc. Conf. Design, Autom. Test Eur., Mar. 2010, 1997.
pp. 661–666. [49] S. Matsunaga et al., “Standby-power-free compact ternary content-
[24] J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for addressable memory cell chip using magnetic tunnel junction devices,”
computing,” Nature Nanotechnol., vol. 8, no. 1, pp. 13–24, 2013. Appl. Phys. Exp., vol. 2, no. 2, p. 023004, 2009.
[25] K. K. Likharev, “Hybrid CMOS/nanoelectronic circuits: Opportuni- [50] W. Xu, T. Zhang, and Y. Chen, “Design of spin-torque transfer mag-
ties and challenges,” J. Nanoelectron. Optoelectron., vol. 3, no. 3, netoresistive RAM and CAM/TCAM with high sensing and search
pp. 203–230, 2008. speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 1,
pp. 66–74, Jan. 2010.
[26] A. M. Arafeh and S. M. Sait, “Cells reconfiguration around defects in
[51] K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang, D. Abbott, and
CMOS/nanofabric circuits using simulated evolution heuristic,” in Proc.
S.-M. S. Kang, “Memristor MOS content addressable mem-
ISQED, Mar. 2015, pp. 581–588.
ory (MCAM): Hybrid architecture for future high performance search
[27] W. N. N. Hung, C. Gao, X. Song, and D. Hammerstrom, “Defect-tolerant
engines,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,
CMOL cell assignment via satisfiability,” IEEE Sensors J., vol. 8, no. 6,
no. 8, pp. 1407–1417, Aug. 2011.
pp. 823–830, Jun. 2008.
[52] I. Arsovski, T. Chandler, and A. Sheikholeslami, “A ternary content-
[28] Z.-L. Pan, L. Chen, and G.-Z. Zhang, “Efficient design method for cell addressable memory (TCAM) based on 4T static storage and including
allocation in hybrid CMOS/nanodevices using a cultural algorithm with a current-race sensing scheme,” IEEE J. Solid-State Circuits, vol. 38,
chaotic behavior,” Frontiers Phys., vol. 11, no. 2, p. 116201, Apr. 2016. no. 1, pp. 155–158, Jan. 2003.
[29] S. M. Sait and A. M. Arafeh, “Cell assignment in hybrid [53] H. Noda et al., “A cost-efficient high-performance dynamic TCAM
CMOS/nanodevices architecture using Tabu search,” Appl. Intell., with pipelined hierarchical searching and shift redundancy architecture,”
vol. 40, no. 1, pp. 1–12, Jan. 2014. IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 245–253, Jan. 2005.
[30] D. B. Strukov, “Hybrid CMOS/nanodevice circuits with tightly inte- [54] T. Kusumoto, D. Ogawa, K. Dosaka, M. Miyama, and Y. Matsuda,
grated memory and logic functionality,” in Proc. Nanotechnol., vol. 11. “A charge recycling TCAM with Checkerboard Array arrangement
2011, pp. 9–12. for low power applications,” in Proc. IEEE Asian Solid-State Circuits
[31] F. Alibart, T. Sherwood, and D. B. Strukov, “Hybrid CMOS/nanodevice Conf. (A-SSCC), Nov. 2008, pp. 253–256.
circuits for high throughput pattern matching applications,” in Proc. [55] Y.-D. Kim, H.-S. Ahn, S. Kim, and D.-K. Jeong, “A high-speed range-
NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), Jun. 2011, pp. 279–286. matching TCAM for storage-efficient packet classification,” IEEE Trans.
[32] A. Madhavan and D. B. Strukov, “Mapping of image and network Circuits Syst. I, Reg. Papers, vol. 56, no. 6, pp. 1221–1230, Jun. 2009.
processing tasks on high-throughput CMOL FPGA circuits,” in Proc. [56] J.-S. Wang, H.-Y. Li, C.-C. Chen, and C. Yeh, “An AND-type match-
IEEE/IFIP 20th Int. Conf. VLSI Syst.-Chip (VLSI-SoC), Oct. 2012, line scheme for energy-efficient content addressable memories,” in IEEE
pp. 82–87. Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2005,
[33] B. Schmidt, Bioinformatics: High Performance Parallel Computer pp. 464–610.
Architectures. Boca Raton, FL, USA: CRC Press, 2010. [57] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring
[34] Z. K. Baker and V. K. Prasanna, “Time and area efficient pattern hyperdimensional associative memory,” in Proc. HPCA, Feb. 2017,
matching on FPGAs,” in Proc. FPGA, 2004, pp. 223–232. pp. 445–456.
[35] R. Tessier and W. Burleson, “Reconfigurable computing for digital [58] I. Roy, A. Srivastava, M. Nourian, M. Becchi, and S. Aluru, “High
signal processing: A survey,” J. VLSI Signal Process., vol. 28, nos. 1–2, performance pattern matching using the automata processor,” in Proc.
pp. 7–27, May 2001. IEEE Int. Parallel Distrib. Process. Symp., May 2016, pp. 1123–1132.
[36] K.-N. Chia et al., “Configurable computing solutions for automatic target [59] K. Zhou, J. Wadden, J. J. Fox, K. Wang, D. E. Brown, and K. Skadron,
recognition,” in Proc. IEEE Symp. FPGAs Custom Comput. Mach., “Regular expression acceleration on the micron automata processor: Brill
Apr. 1996, pp. 70–79. tagging as a case study,” in Proc. IEEE Int. Conf. Big Data (Big Data),
[37] J. Yang, L. Jiang, Q. Tang, Q. Dai, and J. Tan, “PiDFA: A practical Oct./Nov. 2015, pp. 355–360.
multi-stride regular expression matching engine based On FPGA,” in [60] B. Govoreanu et al., “10 × 10nm2 Hf/HfOx crossbar resistive RAM with
Proc. IEEE Int. Conf. Commun. (ICC), May 2016, pp. 1–7. excellent performance, reliability and low-energy operation,” in IEDM
[38] Y.-H. Yang and V. Prasanna, “High-performance and compact architec- Tech. Dig., Dec. 2011, pp. 6–31.
ture for regular expression matching on FPGA,” IEEE Trans. Comput., [61] S. Pi, P. Lin, and Q. Xia, “Cross point arrays of 8 nm × 8 nm memristive
vol. 61, no. 7, pp. 1013–1025, Jul. 2012. devices fabricated with nanoimprint lithography,” J. Vac. Sci. Technol. B,
[39] N. L. Or, X. Wang, and D. Pao, “MEMORY-based hardware archi- Microelectron. Process. Phenom., vol. 31, no. 6, p. 06FA02, 2013.
tectures to detect ClamAV virus signatures with restricted regu- [62] D. B. Strukov and H. Kohlstedt, “Resistive switching phenomena in thin
lar expression features,” IEEE Trans. Comput., vol. 65, no. 4, films: Materials, devices, and applications,” MRS Bull., vol. 37, no. 2,
pp. 1225–1238, Apr. 2016. pp. 108–114, Feb. 2012.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[63] G. W. Burr et al., “Access devices for 3D crosspoint memory,” J. Vac. Advait Madhavan (M’12) received the M.S. and
Sci. Technol. B, Microelectron. Process. Phenom., vol. 32, no. 4, Ph.D. degrees from the Electrical and Computer
p. 040802, 2014. Engineering Department, University of California
[64] D. B. Strukov and R. S. Williams, “Four-dimensional address topology Santa Barbara, Santa Barbara, CA, USA,
for circuits with stacked multilayer crossbar arrays,” Proc. Nat. Acad. in 2013 and 2016, respectively.
Sci. USA, vol. 106, no. 48, pp. 20155–20158, 2009. He is currently a Postdoctoral Researcher at the
[65] B. Chakrabarti et al., “A multiply-add engine with monolithically inte- National Institute of Standards and Technology,
grated 3D memristor crossbar/CMOS hybrid circuit,” Sci. Rep., vol. 7, Gaithersburg, MD, USA. His current research
Feb. 2017, Art. no. 42429. interests include novel methods for information
[66] G. C. Adam, B. D. Hoskins, M. Prezioso, F. Merrikh-Bayat, processing, ranging from conceptualization of
B. Chakrabarti, and D. B. Strukov, “3-D memristor crossbars for high-level architectures, analog and digital circuits
analog and neuromorphic computing applications,” IEEE Trans. Electron to integration with emerging technologies and chip designs.
Devices, vol. 64, no. 1, pp. 312–318, Jan. 2017. Dr. Madhavan was a recipient of the Micro Top Pick Award in 2015.
[67] G. Ligang, F. Alibart, and D. B. Strukov, “Programmable
CMOS/memristor threshold logic,” IEEE Trans. Nanotechnol., vol. 12,
no. 2, pp. 115–119, Mar. 2013. Tim Sherwood (M’03–SM’14) is currently a Pro-
[68] D. Gavrilov, D. B. Strukov, and K. K. Likharev. (2017). “Capacity, fessor of Computer Science and the Associate Vice
fidelity, and noise tolerance of associative spatial-temporal memo- Chancellor for Research at the University of Cal-
ries based on memristive neuromorphic network.” [Online]. Available: ifornia Santa Barbara, Santa Barbara, CA, USA.
https://arxiv.org/abs/1707.03855 He specializes in the development of processors
[69] S. M. Chai, A. Gentile, W. E. Lugo-Beauchamp, J. Fonseca, exploiting novel technologies, provable properties,
J. L. Cruz-Rivera, and D. S. Wills, “Focal-plane processing architectures and hardware-accelerated algorithms.
for real-time hyperspectral image processing,” Appl. Opt., vol. 39, no. 5, Prof. Sherwood is a seven-time winner of the
pp. 835–849, 2000. IEEE Micro Top Pick Award, an ACM Distin-
[70] C. Kittel, Introduction to Solid State Physics. Hoboken, NJ, USA: Wiley, guished Scientist, winner of the UCSB Academic
2005. Senate Distinguished Teaching Award, and is the
[71] International Technology Roadmap for Semiconductors, Semiconductor 2016 SIGARCH Maurice Wilkes Awardee “for contributions to novel program
Industry Association, 2013. analysis advancing architectural modeling and security.”
[72] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-
Submicron FPGAS, vol. 497. New York, NY, USA: Springer, 2012.
[73] (2018). MATLAB Code for Optimal Throughput Calculation. Dmitri B. Strukov (M’02–SM’16) received the
[Online]. Available: https://www.ece.ucsb.edu/~strukov/papers/2018/ M.S. degree in applied physics and mathematics
pm/code.m from the Moscow Institute of Physics and Technol-
[74] X. Tang, G. Kim, P.-E. Gaillardon, and G. De Micheli, “A study ogy, Dolgoprudny, Russia, in 1999 and the Ph.D.
on the programming structures for RRAM-based FPGA architectures,” degree in electrical engineering from Stony Brook
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 4, pp. 503–516, University, Stony Brook, NY, USA, in 2006.
Apr. 2016. He is currently a Professor of Electrical and Com-
[75] M. Payvand et al., “A configurable CMOS memory platform for puter Engineering at the University of California
3D-integrated memristors,” in Proc. IEEE Int. Symp. Circuits Santa Barbara, Santa Barbara, CA, USA. His cur-
Syst. (ISCAS), May 2015, pp. 1378–1381. rent research interests include different aspects of
[76] D. B. Strukov and K. K. Likharev, “CMOL FPGA circuits,” in Proc. computations, in particular addressing questions on
CDES, 2006, pp. 213–219. how to efficiently perform computation on various levels of abstraction.

You might also like