0% found this document useful (0 votes)
24 views

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

This document presents a time-domain computing in memory (TD-CIM) scheme using spintronics that can perform energy-efficient computations for convolutional neural networks. The TD-CIM circuit implements basic logic operations by recording output at different time moments. It also introduces a multi-addend addition mechanism without cascaded full adders. Finally, it proposes quantizing floating-point pretrained CNN models to fixed-point for improved compatibility with the TD-CIM circuit, and evaluates the delay and energy benefits of a TD-CIM architecture using a reconfigurable spin-orbit torque MRAM for digit recognition tasks.

Uploaded by

AMANDEEP SINGH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

This document presents a time-domain computing in memory (TD-CIM) scheme using spintronics that can perform energy-efficient computations for convolutional neural networks. The TD-CIM circuit implements basic logic operations by recording output at different time moments. It also introduces a multi-addend addition mechanism without cascaded full adders. Finally, it proposes quantizing floating-point pretrained CNN models to fixed-point for improved compatibility with the TD-CIM circuit, and evaluates the delay and energy benefits of a TD-CIM architecture using a reconfigurable spin-orbit torque MRAM for digit recognition tasks.

Uploaded by

AMANDEEP SINGH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

3, MARCH 2021 1193

Time-Domain Computing in Memory Using


Spintronics for Energy-Efficient Convolutional
Neural Network
Yue Zhang , Senior Member, IEEE, Jinkai Wang, Graduate Student Member, IEEE, Chenyu Lian, Yining Bai ,
Guanda Wang, Graduate Student Member, IEEE, Zhizhong Zhang, Student Member, IEEE,
Zhenyi Zheng, Graduate Student Member, IEEE, Lei Chen, Kun Zhang, Member, IEEE,
Georgios Sirakoulis , Member, IEEE, and Youguang Zhang, Member, IEEE

Abstract— The data transfer bottleneck in Von Neumann


architecture owing to the separation between processor and
memory hinders the development of high-performance com-
puting. The computing in memory (CIM) concept is widely
considered as a promising solution for overcoming this issue.
In this article, we present a time-domain CIM (TD-CIM) scheme
using spintronics, which can be applied to construct the energy-
efficient convolutional neural network (CNN). Basic Boolean logic
operations are implemented through recording the bit-line output
at different moments. A multi-addend addition mechanism is then Fig. 1. (a) Von Neumann architecture. (b) Computing in memory architecture.
introduced based on the TD-CIM circuit, which can eliminate
the cascaded full adders. To further optimize the compatibility
of TD-CIM circuit for CNN, we also propose a quantization
method that transforms floating-point parameters of pre-trained and internet of things (IoT) [1]. Among various algorithms
CNN models into fixed-point parameters. Finally, we build a of ML, convolutional neural network (CNN) is one of the
TD-CIM architecture integrating with a highly reconfigurable
array of field-free spin-orbit torque magnetic random access representative methods [2], [3], possessing the extraordinary
memory (SOT-MRAM) and evaluate its benefits for the quantized performance in cognitive and decision-making tasks [4]. How-
CNN. By performing digit recognition with the MNIST dataset, ever, with the increasing dataset scale and target complex-
we find that the delay and energy are respectively reduced by 1.2- ity, CNN is facing the challenges of increasingly complex
2.7 times and 2.4 × 103 -1.1 × 104 times compared with STT-CIM interconnections, more convolution computations and frequent
and CRAM based on spintronic memory. Finally, the recognition
accuracy can reach 98.65% and 91.11% on MNIST and CIFAR- data transfers. There are certain improvements in algorithms
10, respectively. to overcome these challenges [5]–[7]. However, the Von
Index Terms— Computing in memory, time-domain, spintron- Neumann bottleneck, owing to the limited data bandwidth
ics, digit recognition, convolutional neural networks. between memory and processor in Von Neumann architecture,
I. I NTRODUCTION inherently constraints the execution performance of CNN,
as shown in Fig. 1(a).
M ACHINE learning (ML) has made great progress driven
by the demand of burgeoning big-data-driven applica-
tions, such as artificial intelligence (AI), autonomous driving
In order to address the above issues, computing in mem-
ory (CIM) architecture has been introduced as depicted
in Fig. 1(b). By exploiting the physical attributes of struc-
Manuscript received November 1, 2020; revised January 20, 2021; accepted tures or devices, computations are performed in memory
January 27, 2021. Date of publication February 3, 2021; date of current
version February 23, 2021. This work was supported in part by the National to achieve significant time and energy efficiency [8]–[10].
Natural Science Foundation of China under Grant 61971024 and Grant According to this concept, there have been many explorations
51901008, in part by the International Mobility Project under Grant B16001, based on static random access memory (SRAM), dynamic
and in part by the National Key Technology Program of China under Grant
2017ZX01032101. This article was recommended by Associate Editor S. Yin. RAM (DRAM) and emerging non-volatile memory (NVM)
(Corresponding author: Yue Zhang.) technologies. For example, [11]–[17] proposed to use sense
Yue Zhang, Jinkai Wang, Chenyu Lian, Yining Bai, Guanda Wang, amplifier (SA) or analog-to-digital converter (ADC) in SRAM
Zhizhong Zhang, Zhenyi Zheng, Lei Chen, Kun Zhang, and Youguang Zhang
are with the MIIT Key Laboratory of Spintronics, School of Integrated and DRAM to distinguish the variational current or voltage
Circuit Science and Engineering, Fert Beijing Institute, Beihang University, generated by multiple activated bit-cells, thereby implementing
Beijing 100191, China, and also with the Nanoelectronics Science and logic operations. However, due to the increase of leakage cur-
Technology Center, Hefei Innovation Research Institute, Beihang University,
Hefei 230013, China (e-mail: [email protected]). rent with the scaling down of CMOS devices, the processing
Georgios Sirakoulis is with the Department of Electrical and Computer of data-intensive application produces considerable energy in
Engineering, Democritus University of Thrace, 67100 Xanthi, Greece. SRAM and DRAM.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2021.3055830. Recent breakthroughs in several NVM techniques provide
Digital Object Identifier 10.1109/TCSI.2021.3055830 a potential way to realize near-zero leakage and static power
1549-8328 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1194 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

Fig. 2. TD-CIM circuit. (a) Spintronic bit-cell structure and switching principle of field-free SOT-MRAM. (b) Spintronic cell array. (c) TDC unit. (d) Principle
of TDC unit. (e) Logic unit.

consumption. Among different NVMs, spintronic memories N-bit addends with the TD-CIM circuit is realized and used
offer the advantageous performance, especially, in terms of for the multiplication. In order to improve the compatibility of
energy and time of the write operations [18]–[22]. This reduces TD-CIM circuit for CNN, we propose a quantization method
the energy of the CIM architecture that requires writing the that transforms floating-point parameters of pre-trained CNN
logic results back to bit-cells and various CIM architectures models into fixed-point parameters. Finally, a TD-CIM archi-
based on spintronic memories have been proposed. Reference tecture with a highly reconfigurable array of spin-orbit torque
[23] presented spin transfer torque CIM (STT-CIM) archi- magnetic RAM (SOT-MRAM) is built and we evaluate its
tecture by modifying peripheral decision circuit to sense the delay and energy by performing 2D convolution to recognize
effective resistance of bit-line, which can perform Boolean handwritten digit images from the MNIST dataset. Compared
logic, arithmetic and complex vector operations. Using the with STT-CIM and CRAM architectures, the delay of the TD-
physical attributes of STT device, [24] proposed the compu- CIM architecture is reduced by 2.7 times and 1.2 times, and
tational RAM (CRAM) architecture to perform computations the energy is decreased by 2.4×103 times and 1.1×104 times,
in cell array, which generates logic outputs directly in STT respectively.
devices. However, these CIM architectures only adopt the The remained parts are organized as follows: Section II
concept of the arithmetic logic unit (ALU) to carry out presents TD-CIM circuit to implement Boolean logic. Multi-
computations, but don’t fully explore the inherent advan- addend addition and efficient multiplication schemes based
tages of memory array. For example, the addition operation, on the TD-CIM circuit are described in Section III. The
fundamental unit in all arithmetic operations [25], [26], is quantization method of CNN and a TD-CIM architecture are
normally implemented by cascading full adders. If the same illuminated in Section IV. Section V analyzes the reliability
mechanism is used in CIM architectures, a large amount of of TD-CIM circuit and evaluates the performance of TD-
additional decoding operations and time sequence schedules CIM architecture by performing 2D convolution for digit
are required, which greatly increases the computation com- recognition. Conclusions are presented in Section VI.
plexity and degrades the performance in terms of delay and
energy. II. TD-CIM C IRCUIT FOR B OOLEAN L OGIC
In this work, we propose a time-domain CIM (TD-CIM)
In CIM architecture, distinguishing the bit-line voltage is a
scheme based on spintronic memory enabling simplification
common method to perform logic operations [12]. Its principle
of arithmetic operations for energy-efficient CNN. TD-CIM
can be analyzed by RC circuit model, in which the bit-line
circuit is firstly proposed to execute NOR, NAND and XOR
voltage (Vt ) is expressed as
operations by converting the variation of bit-line voltage to
Tdis
the time domain. According to the characteristics of the Tdis − (R
Vt = V0 e− RC = V0 e P R +R E R )C (1)
output, we propose a multi-addend addition mechanism for
implementing the addition operation of multiple 1-bit addends where Tdis refers to the discharge time, V0 is the initial voltage
in a memory access. Furthermore, the addition of multiple of bit-line, R and C are the resistance and capacitance on the

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1195

discharge channel, respectively. In memory array, R includes TABLE I


equivalent resistance (R E R ) of activated bit-cells in parallel T RUTH TABLE FOR T YPICAL L OGIC F UNCTIONS BASED ON TD-CIM
C IRCUIT
and parasitic resistance (R P R ) on discharge channel. As R P R
and C are always constant after the chip is designed, the bit-
line voltage mainly depends on R E R after a fixed Tdis . In this
case, different configurations of activated bit-cells can thus be
reflected by the bit-line voltage and the logic operations can be
implemented through comparing the bit-line voltage with the
reference voltage. However, the difference of bit-line voltages
with different input configurations is normally slight [27], [28],
hence the logic output detection requires accurate generation
and distribution of the reference voltage for SA or ADC. The from voltage difference (V ) to time difference (T ). For
more bit-cells are activated simultaneously, the more difficult example, the four configurations of two activated bit-cells can
the detection becomes. be divided into three cases according to the value of R E R :
To solve this problem, as shown in Fig. 2, we propose two activated bit-cells both store ‘0’ (Case “00”); one of bit-
a TD-CIM circuit which is composed of a spintronic cell cells stores ‘0’ and the other stores ‘1’ (Case “01&10”); two
array, a time-domain conversion (TDC) unit and a logic activated bit-cells both store ‘1’ (Case “11”). Due to the dif-
unit. It is well known that SOT-MRAM provides advanta- ferent R E R in these cases, the speeds of V B L drop are varied.
geous write behavior compared with STT-MRAM [22], [29]. The reversal moments of VT DC in these three cases are thus
As a large amount of write operations are normally required different and forms two intervals, i.e., T1 and T2 . In T1 ,
in CIM architecture, applying SOT-MRAM can obviously VT DC is high voltage (‘1’) only in the case “00”, implementing
improve the overall performance of TD-CIM circuit. However, the NOR logic. Similarly, in T2 , the VT DC is low voltage
to achieve the deterministic switching of magnetic tunnel (‘0’) only in the case “11”, implementing the NAND logic.
junction (MTJ) with perpendicular magnetic anisotropy (PMA) Therefore, by choosing these moments to distinguish VT DC ,
in SOT-MRAM, an additional magnetic field has to be used, reconfigurable logic operations can be achieved. Here, we can
which is a major hurdle for its practical application. Recently, a utilize a series of DFFs to record VT DC at these moments and
field-free SOT-MRAM was proposed by combining STT and buffers to enhance the drive capability (see Fig. 2(e)). Table I
SOT effects [30], [31]. As illustrated in Fig. 2(a), its write exhibits the truth table for typical functions based on TD-CIM
operation has three phases: (i) SOT current flows through circuit, in which the outputs of DFF0 and DFF1 (D0 and D1 )
the heavy metal to form in-plane magnetization in the free evaluate NOR and NAND logic operations of the data stored in
layer (FL) of MTJ due to the spin-Hall effect (SHE); (ii) STT the activated bit-cells. Furthermore, we design an XOR circuit
current is then injected to determine the MTJ’s state; (iii) SOT in the logic unit consisting of a pull-up channel and two pull-
current is removed, but STT current still remains until the down channels. When D0 and D1 are both ‘0’ or ‘1’, the pull-
magnetization relaxes to the perpendicular axis. If STT current up channel is closed and one of the pull-down channels is
flows from FL to pinned layer (PL), the MTJ state is set to opened, by which the output of the buffer in the logic unit
‘0’ (low resistance, R L ), and the current with the opposite drops to ‘0’. Alternatively, when D0 and D1 are ‘0’ and ‘1’,
direction writes ‘1’ (high resistance, R H ). By this way, due respectively, the buffer outputs a high voltage because the pull-
to the metastable state induced by the SOT current, the effect up channel is opened and both of the two pull-down channels
of STT current is amplified to reduce the incubation delay of are closed. Fig. 3 demonstrates the transient simulation results
magnetization switching. Hence, this field-free SOT-MRAM of TD-CIM circuit based on the field-free SOT-MRAM. The
can provide fast switching speed as well as low energy and signals CP0 and CP1 control the DFF0 and DFF1 to record the
we adopt it in the spintronic cell array (see Fig. 2(b)). VT DC at two moments (T1 and T2). Hence, D0 and D1 give
The TDC unit consists of an inverter and a buffer (see the NOR logic and NAND logic outputs, respectively. Besides,
Fig. 2(c)). The inverter is connected to the bit-line (BL). For the XOR logic can be realized based on D0 and D1 . As the
realizing TDC function, BL is firstly pre-charged to the supply output of the XOR circuit should be detected after T2, at which
voltage (VDD). When the spintronic bit-cells are activated by both NOR and NAND logic operations are completed, the total
the word lines (WLs) and the source-line (SL) is connected to delay of the TD-CIM circuit based on field-free SOT-MRAM
the ground through enabling the DS signal, BL voltage (V B L ) to achieve XOR logic is about 1.2 ns. It is also noteworthy
starts to decrease. The output of TDC unit (VT DC ) doesn’t that NOR, NAND and XOR logic operations are carried out
reverse until V B L is decreased to the threshold voltage of the through one memory access in the TD-CIM circuit.
interior NMOS transistor (Vnt h ). The discharge time (Tdis ) can
thus be derived by the transformation of Eq. (1) as follow
III. M ULTI -A DDEND A DDITION AND E FFICIENT
Vnt h Vnt h M ULTIPLICATION S CHEME BASED ON TD-CIM C IRCUIT
Tdis = −RC ln = −(R P R + R E R )C ln (2)
V0 V0 The addition is the basis for carrying out complex arithmetic
As shown in Fig. 2(d), different configurations of activated operations. Normally, the addition of multiple addends is
bit-cells (via R E R ) are reflected to the discharge time (i.e., implemented by cascading the full adders based on Boolean
the reversal moment of VT DC ), which realizes the conversion logic [32], [33]. However, as to CIM scheme, the cascade of

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1196 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

two ‘1’. Four 1-bit addends are then added in M1 . In the


ultimate case that these 1-bit addends are all ‘1’, M1 will
directly generate a carry to M3 . Meanwhile, M2 might generate
a carry to M3 as well. Therefore, in the addition of three
addends, two additional bits are required for the computation
in M3 . The similar mechanism is observed in the addition of
four addends shown in Fig. 4(b), where three carries might
be computed. This conclusion can be extended to the addition
of n addends, where n-1 additional bits should be taken into
account in each operation.
Fig. 5(a) illustrates the principle of the addition of n addends
in the TD-CIM circuit, in which 2n-1 word-lines are activated
simultaneously, including n 1-bit addends and n-1 carries in
each column. In this case, there are 22n−1 configurations of
these activated bit-cells, which are classified into 2n cases
according to the number of the datum ‘1’ stored in bit-cell.
In order to distinguish these 2n cases, 2n-1 DFFs are used in
the TD-CIM circuit to record the outputs at 2n-1 moments,
as demonstrated in Fig. 5(b). The sum (Si ) and carry (Ci ) in
Mi can be expressed as

Si = (D0 X O R D1 )O R(D2 X O R D3 ) . . . O R D2n−2


Fig. 3. Transient simulation results of TD-CIM circuit based on field-free (3)
SOT-MRAM. (a) BL and VT DC in the case “11”. (b) BL and VT DC in the
case “01&10”. (c) BL and VT DC in the case “00”. (d) CP0 and CP1. (e) C(i,1) = D1 X O R D3
Outputs of XOR, D0 and D1 in the case “11”. (f) Outputs of XOR, D0 and
D1 in the case “01&10”. (g) Outputs of XOR, D0 and D1 in the case “00”.
C(i,2) = D3 X O R D5
···
C(i,n−2) = D2n−5 X O R D2n−3
C(i,n−1) = D2n−3 (4)

where C(i, p) ( p = 1, 2…, n-1) represents the carry calculated


in Mi for Mi+ p . As the 2n-1 bit-cells can simultaneously be
activated in TD-CIM circuit, Si and Ci can be obtained in
a memory access, which effectively reduces the computing
complexity.
Fig. 6 shows a detailed operational process for an addition
of three 8-bit addends (A, B and C) stored by row in memory
array.
 Assigning n = 3 to Eq. (4), two carries, i.e. C(i,1) =
Fig. 4. Carry principle for the addition of multiple addends. (a) Case of D1 X O R D3 and C(i,2) = D3 are generated in Mi . In order
three addends. (b) Case of four addends. to store these carries, two additional bit-cells are needed for
each order of magnitude. Note that the additional bit-cells
should be initialized to ‘0’. Hence, by activating the five word-
full adders requires a series of decoding operations and time
line connected to these bit-cells where three 8-bit addends and
sequence schedules of memory, which greatly increases the
two carries are stored, S0 , C(0,1) and C(0,2) are firstly obtained
computing complexity. To address this issue, we propose a
in M0 by using TD-CIM circuit to implement the addition of
multi-addend addition scheme based on TD-CIM circuit to
five 1-bit numbers (three 1-bit addends and two 1-bit carries)
simplify the arithmetic operations in memory. An efficient
and then written into the corresponding bit-cells. To further
multiplication scheme is also brought out for the following
decrease the operation time, these bit-cells can be selected in
investigation on CNN.
advance by decoder and S0 , C(0,1) and C(0,2) can be written
at the same time because they are in different columns. The
A. Multi-Addend Addition Scheme Based on TD-CIM Circuit similar process will subsequently be carried out for the other
Fig. 4 exemplifies the carry principle for the multi-addend orders of magnitude. Considering two overflow orders, it only
addition operation, in which Mi (i = 0, 1, 2…) represents the takes 10 cycles to complete the addition of three 8-bit addends
order of magnitude in binary addend and M0 is the lowest. based on TD-CIM circuit. Note that the number of cycles is
In the addition of three addends, as shown in Fig. 4(a), a carry only related to the number of the bit, instead of the number
is generated if the three 1-bit addends in M0 have more than of addends.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1197

Fig. 5. (a) Principle of the addition of n addends based on TD-CIM circuit. (b) Schematic of multi-addend addition based on TD-CIM circuit.

Fig. 6. Diagram of addition in array for three 8-bit addends based on TD- Fig. 7. Efficient multiplication scheme based on TD-CIM circuit.
CIM circuit.

in order to reduce the bit number of the addition in each order


B. Efficient Multiplication Scheme of magnitude, we respectively store C(i,2) in Mi+2 (i = 2,
In digital integrated circuits, the multiplication is normally 3, 4, 5) of K, instead of requiring another row for storing
implemented through shift and addition operations. The afore- C(i,2) as Fig. 6. Hence, only one row is added to store the
mentioned multi-addend addition can be applied and improve carry C(i,1) . According to the multi-addend addition scheme,
the performance of multiplications in memory. Fig. 7 exhibits this multiplication can be completed within 7 cycles, which is
the efficient multiplication scheme based on TD-CIM circuit much more compact than the mainstream schemes based on
for two 4-bit numbers (A and B). During the shift opera- full adders [24]. Therefore, the multiplication operation of two
tion, four 4-bit numbers are first generated by performing N-bit numbers takes 2N cycles in TD-CIM scheme, including
AND operations between A and B. Then, by utilizing write the one cycle to shift the partial product and 2N-1 cycles of
operations in memory, the shift operations of the four 4-bit the addition operation.
numbers are realized. When implementing the addition of the
four 4-bit numbers via TD-CIM circuit, four 8-bit numbers IV. O PTIMIZATION OF CNN BASED ON TD-CIM C IRCUIT
(K, Y, Z and L) are constructed by filling the datum of ‘0’
in the rows. M2 , M3 , M4 and M5 might generate two carries CNN is commonly applied to analyze visual image, which
(C(i,1) and C(i,2) ), while only one carry is generated in M1 is composed of convolution layer, activation function, pooling
and M6 . Note that the bit in M0 of K is directly as the layer and fully-connected layers [4]. Particularly, the convo-
S0 , which doesn’t require any computation operation. Here, lution layer is used for image sharpening, blurring and edge

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1198 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

computations [35]–[40] and will improve the compatibility


of TD-CIM circuit for CNN. Reference [39] is a typical
method to use fixed-point parameters instead of high-precision
floating-point parameters in CNN algorithm. However, this
quantization method necessitates multiple retraining processes
to generate fixed-point parameters, which cannot be widely
used due to its low efficiency. Here, we propose an alternative
quantization method to optimize the generation process of
fixed-point parameters.
When transforming the trained results of floating-point
parameters into fixed-point parameters, we set the quantization
process as an optimization task rather than directly obtain the
fixed-point parameters by training the floating-point parame-
ters. Besides, CNN performs recognition task by selecting
the element with the largest probability in the result vector.
Therefore, in the proposed quantization method, the fixed-
Fig. 8. Convolution for an image with 4-bit per pixel by using a kernel of point parameters are achieved in the pre-trained neural network
size 3 × 3 with weight of 4-bit in TD-CIM circuit.
models without changing the sequence of output probability.
As shown in Algorithm 1, the quantization process of CNN
can be described in three steps as follows:
detection. Its core computation is described as

n 
n
Algorithm 1 Quantization Procedure of CNN
Oi, j = Ii+k, j +l Wk.l (5)
k=1 l=1
Input: Pre-Trained Model Weight in 32-Bit or 64-Bit Float W
Output: Quantized Model Weight in N-Bit Integer Wq
where W represents the matrix of convolution kernels and I
Step 1. Compute the range of W :
refers the matrix of input pixels. It is obvious that the computa-
lowbound = min(W ) upbound = max(W )
tion of convolution consists of multiplication and summation
Step 2. Compute the scale of W :
operations. Hence, based on the multi-addend addition and
S = 2 N − 1 L = abs( lowbound ) U = abs( upbound )
efficient multiplication schemes proposed above, the TD-CIM
if L ≤ U then
circuit can greatly improve the computation performance of
Scale = S ÷ L
convolution in memory.
else
Fig. 8 shows an example of convolution operation for an
Scale = S ÷ U
image with 4-bit per pixel, in which the kernel of size 3 × 3
Step 3. Quantize the W in to integer:
with weight of 4-bit is computed by using TD-CIM circuit.
Wq = Round ( Scale ×W )
The efficient multiplication operations are firstly performed to
return Wq
generate nine 8-bit numbers. Then, the final result of the con-
volution is obtained through implementing the multi-addend
addition for these numbers. Considering the computational Step 1: The floating-point parameters in pre-trained models
accuracy of the TD-CIM circuit, the nine 8-bit numbers are and the input data of image are scaled to the largest range
divided into three computation blocks, instead of adding all of N-bit via linear transformation, for example, the weight
of them at once. A 10-bit number is thus generated by each range of −0.31∼ +0.44 is scaled to −15∼ +15 which can be
block. Finally, three 10-bit numbers are added to obtain the represented by a 4-bit number. Since the activation function
convolution result, which should be an 11-bit number. Similar is the only non-order-preserving factor that can change the
to the convolution implementation, it is also possible to realize sequence of result vector, the scaling method depends on the
the activation function, pooling player and fully-connected type of activation function in CNN. Here, we use the ReLU
layer by using TD-CIM circuit. function as the activation function because it is widely used in
neural network models. In CNN based on ReLU function, the
sequence of the final result will remain unchanged if the sign
A. Quantization Method of CNN of parameters is not changed. Therefore, the sign of parameters
CNN requires high-precision floating-point computation requires remaining unchanged during the scaling operation.
according to the gradient descent learning algorithm, which Step 2: Step 1 doesn’t consider the decrease of accuracy,
normally increases the complexity for CIM based CNN com- which is inconsistent with reality (the accuracy drop is usually
putation. For example, [34] demonstrates that both time and more than 2%). Here, we propose a mathematical model to
energy of the computations for floating-point numbers exceed solve this problem and obtain the smallest accuracy drop.
those for fixed-point numbers by more than one order of Because the prediction result in CNN is given through the
magnitude under the same calculation condition. Therefore, vector from the output layer and the result vector will change
the transformation of floating-point parameters into fixed-point after quantizing the original neural network, the closer the
parameters in CNN is beneficial for reducing the number of form number of output vector between the quantized model

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1199

Fig. 9. (a) Original image with 8-bit per pixel. (b) Converted image with
4-bit per pixel. (c) Convolution result.

and the original model is, the smaller the accuracy drop caused
by quantization is. Hence, we quantize the weight in neural
network to minimize the accuracy drop. It can be described as
Fig. 10. TD-CIM architecture for quantized CNN.
 
min σ (wq ) − σ (w)
s.t.wq = Q(w) (6) B. TD-CIM Architecture for Quantized CNN
We then design a TD-CIM architecture using field-free SOT-
where w refers the set of floating-point weight matrixes, wq MRAM to execute the quantized CNN, as shown in Fig. 10.
represents the set of weight matrixes scaled to N-bit, σ is the It consists of three sub-arrays: data array, shift array and
computation process of classical CNN with ReLU function summation array. The data array is specialized to store the
and Q is the quantization function. original data of images and the convolution kernel. For the
Step 3: The scaling operation compensates the influence convolution computation of CNN, the shift array stores the
of activation function on quantization. Meanwhile, because shifted data in the multiplications and the summation array
the purpose of pooling layer is to progressively reduce the stores the results of the multiplications.
spatial size of parameters and computations in the network In data array, to enhance the parallelism of AND operations,
and its operation on each feature map is independent [41], the the image data are stored by row and the kernel data are stored
computations of pooling layer doesn’t affect the final sequence by column. By using the TD-CIM circuit located in each
of out probability. Therefore, according to the associative law column, each bit of the pixel can carry out AND operation
of calculation in convolution layer and fully connection layer, with any bit of the kernel at the same time. When their
Eq. (6) is modified as results are transferred to the shift array, the shift operations
are implemented by write operations. The addition part in the
 
min σ N (wq − w) proposed efficient multiplication scheme is also performed in
s.t.wq = Q(w) (7) the shift array to get the results of multiplication. At last,
these results are transferred to the summation array to obtain
the final convolution results by implementing the multi-addend
where N represents the number of bits in single pixel and addition operation based on TD-CIM circuit.
single kernel weight, σ N is the computation process of clas- However, owing to the structural limitation of memory array,
sical CNN except for pooling layer and activation function. logic operations can only be carried out on the columns.
Moreover, the difference between σ N and σ N is proportional By contrast, to increase the parallelism of write operations,
to the difference between wq and w. Therefore, the optimal the numbers generated by multiplication operations are stored
solution of Q is rounding. Meanwhile, the difference between on the rows. This causes that the multi-addend addition of
wq and w is reduced as the size of N increases, which is them cannot be implemented. Therefore, we propose a highly
beneficial to quantizing CNN. reconfigurable array based on the field-free SOT-MRAM
This method implements quantization without retraining and allowing the logic operations on the rows. As shown in Fig. 11,
encoding, which reduces the amount of calculation. Mean- a transistor is added in the bit-cell to construct three bit-lines
while, by scaling the weight appropriately to neutralize the (SL, BL, CBL) and three word-lines (WL, RWL, CWL). When
non-linearity of ReLU, its accuracy drop produced by quanti- performing logic operations on the rows, the TD-CIM circuit
zation without retraining can be minimized. Finally, the N-bit is connected to the CBL. Thanks to this reconfigurability
convolution operation is quantized into 4-bit, thereby reducing enhancement, the quantized CNN can be implemented more
the complexity of CNN and enhancing the compatibility of efficiently by the TD-CIM architecture.
TD-CIM circuit for CNN. Fig. 9 demonstrates an example
of an image processing by using the proposed quantization
method. The original image with 8-bit per pixel is converted V. P ERFORMANCE E VALUATION AND D ISCUSSION
to the image with 4-bit per pixel. Fig. 9(c) shows the result As reliability is crucial for implementing logic operations,
of convolution computation with 4-bit kernel. we first analyze the reliability of the TD-CIM circuit. Hybrid

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1200 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

Fig. 12. Monte Carlo simulation results of VT DC in different data cases


when five bit-cells are activated simultaneously (1V, TT, 25◦ ).

From Eq. (2), the Tdis(x,n−x) is written as


 
RH RL Vnt h
Tdis(x,n−x) = − R P R + C ln
Fig. 11. Structure of high reconfigurable array based on SOT-MRAM. (n − x) R L + x · R H V0
(8)
TABLE II
Then, a bit-cell storing ‘1’ is changed to ‘0’, i.e., x-1 bit-
K EY PARAMETERS OF THE F IELD -F REE SOT-MRAM
cells store ‘1’and n-x+1 bit-cells store ‘0’. In this case, the
Tdis(x−1,n−x+1) is descried as
 
RH RL
Tdis(x−1,n−x+1) = − R P R +
(n −x +1) R L +(x −1) R H
Vnt h
×C ln (9)
V0
Hence, the T between the them can be expressed as
T = Tdis(x,n−x) − Tdis(x−1, n−x+1) (10)
It is well known that tunnel magnetoresistance (TMR) ratio is
defined as TMR =(R H -R L )/R L , which reflects the difference
between R H and R L . Therefore, the Eq. (10) is rewritten as
 
(1+T M R) T M R
T =
CMOS/SOT-MRAM simulations are carried out by applying (n +x · T M R) (n +x · T M R−T M R)
28 nm CMOS process technology and the field-free SOT- Vnt h
×R L C ln (11)
MRAM compact model [31]. Table II summarizes the key V0
parameters of the field-free SOT-MRAM, which are dependent Here, TMR and R L are determined by the MTJ device and can
on physical models and experimental measurements [29], [30]. be regarded as constants after the fabrication. Therefore, from
Then, by performing 2D convolutions in LeNet-5 to recognize Eq. (11), T will gradually decrease as x reduces. When none
the handwritten digit images from the MNIST dataset, we eval- of bit-cells stores ‘1’ (i.e., x = 0), the T is in the worst case
uate the performance of the TD-CIM architecture in terms of (Tworst ). If Tworst is smaller than the delay of the DFF,
delay and energy, and compare it with those of the existing there will occur errors in the TD-CIM circuit because DFF
CIM architectures. Moreover, the recognition accuracy is given cannot record the VT DC in time. To ensure the reliability of
to prove the compatibly of the TD-CIM architecture for CNN. the TD-CIM circuit, the Tworst must be large enough.
We carry out the Monte Carlo simulations of 104 samples
when five word-lines are activated simultaneously, i.e., there
A. Reliability Analysis of TD-CIM Circuit are five activated bit-cells on a column. Fig. 12 demonstrates
Although the reliability of TD-CIM circuit is improved the variations of T among different data cases at 1V, TT,
by distinguishing the bit-line voltage in time domain [42], 25◦. Note that, PVT variation also affects T in addition to
it will be deteriorated with the increasing number of activated the number of activated bit-cells. Here, the process deviation
bit-cells. Assuming that n bit-cells are activated in TD-CIM of the MTJ resistance follow a Gaussian distribution with 5%
circuit, in which x bit-cells store ‘1’ and the others store ‘0’. variability [42], [43]. For the CMOS transistors, local and

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1201

As the AND operations can be executed in each column


simultaneously in the proposed highly reconfigurable field-free
SOT-MRAM array, the total delay of the AND operations for
a multiplication of two 4-bit numbers is T AN D . In addition,
its total energy is equal to 16E AN D as 16 AND operations are
carried out.
2) Shift Array: The shift and addition operations are imple-
mented in the shift array. On one hand, the delay and energy of
the shift operations are mainly from the data write operations,
which can be expressed as
Tshi f t = = Twrit e/bit + Tpre (14)
E shi f t = E writ e/bit + E pre (15)
where Twrit e/bit and E writ e/bit represent write time and
energy of a field-free SOT-MRAM device. Because 4 num-
bers with 4-bit are generated after implementing AND oper-
ations, the total delay and energy of shift operations are
Fig. 13. Computation Accuracy of TD-CIM circuit for different data cases.
4 Tshi f t , 16 E shi f t , respectively. Note that the write operations
in different columns can be executed at the same time within
a memory access.
global variations are included by using the foundry statistical On the other hand, the multi-addend addition in a col-
models [43]. Simulation results show that the Tworst is about umn or row includes the computations of TD-CIM circuit and
60ps, it is enough to record the state of VT DC by using DFFs write operations for carries and sum. Hence, its delay and
[44]. The computation accuracy of the addition of three 1-bit energy are described as
addends in the TD-CIM circuit are also evaluated at different
data cases, as shown in Fig. 13. The lowest computation Taddit ion = TT D−C I M/ Add + Twrit e/bit + 2T pre (16)
accuracy appears in the case of “00000” (∼97.9%), which is E addit ion = E T D−C I M/ Add + 2E writ e/bit + 2E pre (17)
normally sufficient for the CNN algorithm. When the number
of addends increases to four, the lowest computation accuracy As elucidated in Section 3.2, 7 cycles of additions are
will be about 92.3%, which cannot satisfy the demand of the required to obtain the multiplication result of two 4-bit num-
algorithm. Therefore, based on the current performance of bers. The total delay and energy of the addition operations in
CMOS and MTJ, the reliability of TD-CIM circuit is enough a multiplication are 7Taddit ion and 7E addit ion , respectively.
to realize logic operations when less than three addends are 3) Summation Array: The summation array is used to
activated simultaneously. That is why we divide nine 8-bit calculate the final convolution results by adding all the mul-
numbers into three computation blocks in Fig. 8. Moreover, tiplication results. Since the size of the convolution kernel
it is believed that the reliability of TD-CIM circuit can be is 3 × 3, 9 numbers are generated by the multiplications
enhanced to face more activated bit-cells with the improvement in one convolution computation and then transferred to the
of performance of CMOS and MTJ in the future. summation array at the same time. The delay of this part is
thus equal to Tshi f t . As each multiplication result of two 4-bit
numbers should have 8-bit, the energy for the data transfer is
B. Evaluation of Delay and Energy written as
We then evaluate the delay and energy of the TD-CIM E trans = 72E writ e/bit + E pre (18)
architecture through carrying out 2D convolutions of two
For the summation operation, these 9 numbers with 8-bit are
4-bit numbers in CNN. Here we divide the delay and energy
divided into 3 computational blocks as mentioned in Section 4.
into three parts according to the structure of the TD-CIM
Therefore, the total delay and energy in the summation array
architecture: data array, shift array and summation array.
are expressed as
1) Data Array: As mentioned in Section 4.2, AND oper-
ations are firstly performed in the data array of the TD- Tsummat ion = 21Taddit ion + Tshi f t (19)
CIM architecture. The peripheral circuit should be used to E summat ion = 41E addit ion + E trans (20)
select the bit-cells storing the operational data and will also
be used for every step of the following operations. Hence, In summary, the total delay and energy of a convolution
the delay and energy of an AND operation are from the TD- computation in Eq. (5) for 4-bit pixels and the kernel of size
CIM circuit (TT D−C I M/ AN D and E T D−C I M/ AN D ) and the 3 × 3 with weight of 4-bit are expressed as
peripheral circuit (T pre and E pre ), which can be described TT OT AL
as  
= 9 T AN D + 4Tshi f t + 7Taddit ion + Tsummat ion (21)
T AN D = TT D−C I M/ AN D + Tpre (12) E T OT AL
   
E AN D = E T D−C I M/ AN D + E pre (13) = 9 16 E AN D + E shi f t +7E addit ion + E summat ion (22)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1202 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

TABLE III TABLE IV


K EY D ELAY AND E NERGY PARAMETERS IN TD-CIM A RCHITECTURE P ERFORMANCE C OMPARISON OF TD-CIM, STT-CIM AND CRAM
A RCHITECTURES ON 2D C ONVOLUTION AT 1MB, 28 NM

3 × 3 in TD-CIM architecture can be obtained and are shown


in Table IV. To demonstrate the performance advantage of
the proposed TD-CIM architecture, we compare it with STT-
CIM [23] and CRAM [24] architectures. They are simu-
lated under the same CMOS and MTJ technologies, which
guarantee the comparison fairness. Compared with STT-CIM
and CRAM architectures, the delay of TD-CIM architecture
with multi-addend addition that executes the 2D convolution
computations is reduced by 1.7 times and 0.4 times, and the
energy is decreased by 1.9 × 103 times and 8.9 × 103 times,
respectively. The similar results can be observed in Fig. 14,
which displays the performance comparison of the overall digit
recognition based on the quantized CNN in TD-CIM, STT-
CIM and CRAM architectures. Here, the delay and energy
are respectively reduced by 1.2-2.7 times and 2.4 × 103 -
1.1×104 times compared with STT-CIM and CRAM architec-
tures. It is noteworthy that, as the logic operations in CRAM
architectures are directly implemented in bit-cell array without
using peripheral logic circuit, CRAM has a smaller area
Fig. 14. Performance comparison among TD-CIM, STT-CIM and CRAM
architectures on digit recognition. overhead than the other two architectures. The area overhead
of TD-CIM architectures is smaller than that of STT-CIM
architecture. In summary, the proposed TD-CIM architecture
A model calculating the overall delay and energy of the greatly improves the performance of CNN, especially in terms
CNN is developed by analyzing the computation type and of energy.
number of these layers in LeNet-5. For the activation function Table V compares the proposed TD-CIM scheme with state-
and pooling player, they perform comparison operation, which of-the-art CIM architectures published in the recent years.
is difficult to realize in CIM architecture. We thus use the According to the computation method, CIM architectures can
conventional comparator in TD-CIM, STT-CIM and CRAM be divided into two schemes, i.e. analog and digital. From
architectures, by which their delay and energy in the activation this comparison table, [11] and [27] adopt analog computation
function and pooling player are the same. This model is method, in which the multiply–accumulate (MAC) result is
utilized in the following discussions and comparisons. directly reflected on bit-line as the total bit-line discharge cur-
rent is the sum of each activated bit-cell current during one bit-
line discharging. MAC result is then obtained by using ADC to
C. Results and Discussions sense the analog bit-line voltages. Therefore, the CIM architec-
To analyze the performance of TD-CIM architecture and tures using analog computation method show excellent energy
demonstrate its advantages, key parameters should firstly be efficiency in low precision, but suffer from large area overhead,
determined. As Twrit e/bit , E writ e/bit , T pre and E pre are related limited functionality (add and multiply only) and algorithm,
to the array size, we get them by modifying the NVSim as shown in Table IV. References [12] and [13] using digital
memory simulator to adopt the proposed highly reconfigurable computation method can perform high-precision arithmetic
field-free SOT-MRAM array of 1 MB [45]. TT D−C I M/ AN D operations, but have poor energy efficiency. Reference [46]
and E T D−C I M/ AN D are obtained by the SPICE simulations has proposed a time-domain computation method to improve
for the TD-CIM circuit. We then get TT D−C I M/ Add and the energy efficiency and bit-precision scalability, where the
E T D−C I M/ Add in the case of five activated bit-cells, which pulse width modulation (PWM) is used to map digital values
is determined considering the reliability limitation. The values into time-domain. However, the latency is very long due to
of these key parameters are listed in Table III, which includes its sequential operation, and its weight precision is limited
different cases of data stored in the activated bit-cells. to 1 bit [47]. TD-CIM scheme performs logic operation on
According to the model and these key parameters, the delay time-domain, but it is essentially a digital computation method
and energy for a 2D convolution with the kernel of size because the arithmetic operations are realized by composing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1203

TABLE V
C OMPARISON W ITH P REVIOUS W ORKS

the logic operation. Compared with the conventional addition on bit-line to the time domain, which not only improves the
operation of two addends, TD-CIM scheme can implement sensing reliability but also allows the multi-addend addition to
the addition operation of three addends during one bit-line simplify the arithmetic. To further improve the compatibility of
discharging. Note that more addends can be added in one TD-CIM circuit for CNN, we propose a quantization method
addition operation. Therefore, one addition operation in TD- without sharp accuracy dropping, which can also reduce the
CIM scheme is equivalent to two addition operations in the complexity of CNN. A TD-CIM architecture with a highly
conventional digital CIM scheme, which further improves the reconfigurable field-free SOT-MRAM array is constructed to
energy efficiency. Besides, in shift and summation array of realize the optimal performance of quantized CNN. Finally,
the TD-CIM architecture, one TD-CIM circuit is shared by by recognizing the handwritten digit from the MNIST dataset,
eight columns, which saves the area overhead. In summary, we find that both delay and energy of the TD-CIM architecture
TD-CIM scheme offers higher energy efficiency and lower are greatly reduced compared with STT-CIM and CRAM
area overhead than existing CIM architectures using the digital architectures. In addition, TD-CIM architecture has higher
computation method. energy efficiency and lower area overhead than present CIM
Moreover, in terms of the recognition accuracy, although the architectures using digital computation method. Finally, the
weight in LeNet-5 is quantized from floating-point parameters accuracy off 98.65% and 91.11% are achieved in the TD-
to fixed-point parameters, it still achieves the accuracy of CIM architecture with 4-bit fixed-point parameter on MNIST
99.57% to recognize the handwritten digit from the MNIST and CIFAR-10 respectively, which demonstrates the proposed
dataset. Since it is difficult to know the specific distribution quantization method of CNN is compatible with TD-CIM
of data cases for the CNN computation process, we assume architecture. This work has significance for further research
the total computation accuracy of TD-CIM circuit is the mean on high-performance memory-oriented computing systems.
of the accuracies shown in Fig. 13, i.e. 99.07%. Then, it is
introduced as a parameter to the quantified LeNet-5. Result
R EFERENCES
shows that the accuracy of the quantified LeNet-5 run in TD-
CIM scheme is 98.65%, less than the accuracy of 99.57% by [1] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memory
0.92%, but it is still higher than that in [46], i.e., 98.42%. VLSI architecture for convolutional neural networks,” IEEE Trans.
Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 494–505, Sep. 2018.
Furthermore, we also extend our design to the CIFAR-10 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
dataset. We first use pre-trained VGG11 model in library of with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Pytorch model zoo, which achieves the accuracy of 93.78%. Process. Syst. (NIPS), 2012, pp. 1097–1105.
[3] D. Silver et al., “Mastering the Game of Go with deep neural networks
Then, the VGG11 model is quantized with Algorithm 1. The and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
final validation accuracy is 91.97%, with an accuracy drop [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
of 1.81% caused by the quantization. Similarly, the accuracy of pp. 436–444, May 2015.
[5] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized CNN:
TD-CIM circuit is introduced to the quantified VGG11 model. A unified approach to accelerate and compress convolutional networks,”
Finally, the accuracy of the quantified VGG11 model run in IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4730–4743,
TD-CIM scheme is 91.11%, less than the accuracy of 93.78% Oct. 2018.
[6] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and
by 2.67%. In summary, the compatibility of the TD-CIM connections for efficient neural networks,” in Proc. NIPS, Montréal, QC,
architecture and the quantized CNN is well delivered. Canada, 2015, pp. 1135–1143.
[7] J. Wang, J. Lin, and Z. Wang, “Efficient hardware architectures for deep
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers,
VI. C ONCLUSION vol. 65, no. 6, pp. 1941–1953, Jun. 2018.
This article proposes a TD-CIM architecture using spintron- [8] P. Chi et al., “PRIME: A novel processing-in-memory architecture
for neural network computation in ReRAM-based main memory,” in
ics to optimize the performance of delay and energy for CNN Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Seoul,
applications. TD-CIM circuit converts the voltage difference South Korea, Jun. 2016, pp. 27–39.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1204 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

[9] S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and evaluation [31] Z. Wang, W. Zhao, E. Deng, J.-O. Klein, and C. Chappert,
of a spintronic in-memory processing platform for nonvolatile data “Perpendicular-anisotropy magnetic tunnel junction switched by spin-
encryption,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., Hall-assisted spin-transfer torque,” J. Phys. D, Appl. Phys., vol. 48, no. 6,
vol. 37, no. 9, pp. 1788–1801, Sep. 2018. Jan. 2015, Art. no. 065001.
[10] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-based [32] E. Deng et al., “Synchronous 8-bit non-volatile full-adder based on spin
computation-in-memory unit-macro for CNN-based AI edge proces- transfer torque magnetic tunnel junction,” IEEE Trans. Circuits Syst. I,
sors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801, Reg. Papers, vol. 62, no. 7, pp. 1757–1765, Jul. 2015.
Oct. 2020. [33] E. E. Swartzlander, “Recent results in merged arithmetic,” Proc. SPIE,
[11] J. Yang et al., “Sandwich-RAM: An energy-efficient in-memory BWN vol. 3461, pp. 576–583, Oct. 1998.
architecture with pulse-width modulation,” in IEEE Int. Solid-State [34] M. Horowitz, “Computing’s energy problem (and what we can do about
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Feb. 2019, pp. 394–396. San Francisco, CA, USA, Feb. 2014, pp. 10–14.
[12] J. C. Wang et al., “A 28-nm compute SRAM with bit-serial [35] E. Cai, D. Juan, D. Stamoulis, and D. Marculescu, “NeuralPower:
logic/arithmetic operations for programmable in-memory vector comput- Predict and deploy energy-efficient convolutional neural networks,” in
ing,” IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 76–86, Jan. 2020. Proc. ACML, Seoul, South Korea, Nov. 2017, pp. 622–637.
[13] K. Lee, J. Jeong, S. Cheon, W. Choi, and J. Park, “Bit parallel 6T SRAM [36] I. Hubara, M. Courbariaux, D. Soudry, E. El-Yaniv, and Y. Bengio,
in-memory computing with reconfigurable bit-precision,” in Proc. 57th “Binarized neural networks,” in Proc. NIPS, Barcelona, Spain, 2016,
ACM/IEEE Design Automat. Conf. (DAC), San Francisco, CA, USA, pp. 4114–4122.
Jul. 2020, Art. no. 20052792. [37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
[14] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory- ImageNet classification using binary convolutional neural networks,” in
computing SRAM macro based on robust capacitive coupling computing Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands,
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897, 2016, pp. 525–542.
Jul. 2020. [38] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
[15] X. Si et al., “A twin-8T SRAM computation-in-memory unit-macro for in Proc. ICLR, Toulon, France, 2017, pp. 1–10.
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits, [39] S. Han, H. Z. Mao, and W. J. Dally, “Deep compression: Compressing
vol. 55, no. 1, pp. 189–202, Jan. 2020. deep neural networks with pruning, trained quantization and Huffman
[16] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient coding,” in Proc. ICLR, San Juan, Puerto Rico, 2016, pp. 1–14.
SRAM with in-memory dot-product computation for low-power convo- [40] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
pp. 217–230, Jan. 2019. Gate Arrays, Los Angeles, CA, USA, Feb. 2016, pp. 26–35.
[17] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” IEEE [41] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and
Comput. Archit. Lett., vol. 14, no. 2, pp. 127–131, Jul. 2015. J. Schmidhuber, “Fast image scanning with deep max-pooling convo-
[18] G. D. Wang et al., “Compact modeling of perpendicular-magnetic- lutional neural networks,” in Proc. IEEE Int. Conf. Image Process.,
anisotropy double-barrier magnetic tunnel junction with enhanced ther- Melbourne, VIC, Australia, Sep. 2013, pp. 4034–4038.
mal stability recording structure,” IEEE Trans. Electron Devices, vol. 66, [42] Q.-K. Trinh, S. Ruocco, and M. Alioto, “Time-based sensing for
no. 5, pp. 2431–2436, May 2019. reference-less and robust read in STT-MRAM memories,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 65, no. 10, pp. 3338–3348, Oct. 2018.
[19] Z. H. Wang et al., “Proposal of toggle spin torques magnetic RAM
[43] Y. Zhou et al., “A self-timed voltage-mode sensing scheme with suc-
for ultrafast computing,” IEEE Electron Device Lett., vol. 40, no. 5,
cessive sensing and checking for STT-MRAM,” IEEE Trans. Circuits
pp. 726–729, May 2019.
Syst. I, Reg. Papers, vol. 67, no. 5, pp. 1602–1614, May 2020.
[20] R. De Rose et al., “A variation-aware timing modeling approach for write
[44] G. Scotti, D. Bellizia, A. Trifiletti, and G. Palumbo, “Design of low-
operation in hybrid CMOS/STT-MTJ circuits,” IEEE Trans. Circuits
voltage high-speed CML D-latches in nanometer CMOS technologies,”
Syst. I, Reg. Papers, vol. 65, no. 3, pp. 1086–1095, Mar. 2018.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 12,
[21] G. Wang et al., “Ultra-dense ring-shaped racetrack memory cache pp. 3509–3520, Dec. 2017.
design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 1, [45] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level
pp. 215–225, Jan. 2019. performance, energy, and area model for emerging nonvolatile memory,”
[22] Z. Y. Zheng et al., “Enhanced spin-orbit torque and multilevel current- IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7,
induced switching in W/Co-Tb/Pt heterostructure,” Phys. Rev. A, Gen. pp. 994–1007, Jul. 2012.
Phys., vol. 12, no. 4, Oct. 2019, Art. no. 044032. [46] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-
[23] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory TOPS/W all-digital time-domain CNN engine using bi-directional mem-
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale ory delay lines for energy efficient edge computing,” IEEE J. Solid-State
Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018. Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020.
[24] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J.-P. Wang, [47] H. A. Maharmeh, N. J. Sarhan, C.-C. Hung, M. Ismail, and M. Alhawari,
and S. S. Sapatnekar, “In-memory processing on the spintronic CRAM: “Compute-in-time for deep neural network accelerators: Challenges
From hardware design to application mapping,” IEEE Trans. Comput., and prospects,” in Proc. IEEE 63rd Int. Midwest Symp. Circuits Syst.
vol. 68, no. 8, pp. 1159–1173, Aug. 2019. (MWSCAS), Springfield, MA, USA, Aug. 2020, pp. 990–993.
[25] H. Naseri and S. Timarchi, “Low-power and fast full adder by exploring
new XOR and XNOR gates,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018.
[26] H. T. Bui, Y. Wang, and Y. Jiang, “Design and analysis of low-power
10-transistor full adders using novel XOR-XNOR gates,” IEEE Trans.
Circuits Syst. II, Analog Digit. Signal Process., vol. 49, no. 1, pp. 25–30,
Jan. 2002.
[27] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatile
computing-in-memory core for low-power parallel MAC operations,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 6, pp. 1867–1880,
Jun. 2020. Yue Zhang (Senior Member, IEEE) received the
[28] J. K. Wang et al., “A self-matching complementary-reference sensing B.S. degree in optoelectronics from the Huazhong
scheme for high-speed and reliable toggle spin torque MRAM,” IEEE University of Science and Technology, Wuhan,
Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4247–4258, China, in 2009, and the M.S. and Ph.D. degrees
Dec. 2020, doi: 10.1109/TCSI.2020.3020137. in microelectronics from the University of Paris-
[29] Z. Y. Zheng et al., “Perpendicular magnetization switching by large Sud, France, in 2011 and 2014, respectively. He
spin–orbit torques from sputtered Bi2 Te3 ,” Chin. Phys. B, vol. 29, no. 7, is currently an Associate Professor with Beihang
Jul. 2020, Art. no. 078505. University, China. His current research interests
[30] M. Wang et al., “Field-free switching of a perpendicular magnetic tunnel include emerging non-volatile memory technologies
junction through the interplay of spin–orbit and spin-transfer torques,” and hybrid low-power circuit designs.
Nature Electron., vol. 1, no. 11, pp. 582–588, Nov. 2018.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1205

Jinkai Wang (Graduate Student Member, IEEE) Zhenyi Zheng (Graduate Student Member, IEEE)
received the B.S. degree in physics and electronic received the B.S. and master’s degrees from Beihang
engineering from Kaili University, Kaili, China, University, Beijing, China, in 2015 and 2018, respec-
in 2015, and the M.S. degree in circuits and systems tively, where he is currently pursuing the Ph.D.
from Anhui University, Anhui, China, in 2018. He degree. His current research interests include spin-
is currently pursuing the Ph.D. degree in physical orbit torque effect and ferrimagnetic materials.
electronics with Beihang University, China. His cur-
rent research interest includes the high-performance
hybrid circuits.

Lei Chen received the B.S. degree in electronic


and information engineering from Anhui University,
Chenyu Lian received the B.S. degree in soft- Hefei, China, in 2018. He is currently pursuing
ware engineering from Beijing Jiaotong University, the Ph.D. degree in microelectronics and solid-state
Beijing, China, in 2018. He is currently pursuing electronics with Beihang University, Beijing, China.
the M.S. degree in integrated circuit with Beihang His research interests include lateral spin valves and
University. His current research interests include emerging non-volatile memory technologies.
efficient deep learning methods on hardware and in
memory computing.

Kun Zhang (Member, IEEE) received the B.S. and


Ph.D. degrees in physics from Shandong Univer-
sity, Jinan, China, in 2012 and 2017, respectively.
Yining Bai received the B.S. degree in communica- He is currently a Lecturer with Beihang Univer-
tion engineering from Beijing Jiaotong University, sity, Beijing, China. His current research interests
Beijing, China. She is currently pursuing the M.S. include emerging non-volatile memory device and
degree with Beihang University. Her current research in-memory computing application.
interest includes memory computing.

Georgios Sirakoulis (Member, IEEE) received the


M.Eng. (Diploma) and Ph.D. degrees in electri-
cal and computer engineering (ECE) from the
Democritus University of Thrace (DUTh), Greece,
in 1996 and 2001, respectively. He has been a
Guanda Wang (Graduate Student Member, IEEE) Tenured Associate Professor with the ECE Depart-
received the B.S. degree in communication engi- ment, DUTh, since 2008. He has published more
neering from the Beijing University of Posts and than 200 technical articles, guest-edited 11 special
Telecommunications, Beijing, China. He is currently issues, co-edited five books, and coauthored 15 book
pursuing the Ph.D. degree with Beihang University. chapters. He is EUROPRACTICE representative for
His current research interests include simulation DUTh, and he has served as a member at the EU
analysis of MTJ and all spin logic device. IDEAS Program. He has participated as a PI in more than 20 scientific
programs and projects funded from the Greek Government and Industry as
well as the European Commission. His current research interests include
emergent electronic circuits and systems, memristors, green and unconven-
tional computing cellular automata theory and applications, complex systems,
bioinspired computation/biocomputation, modeling, and simulation.

Zhizhong Zhang (Student Member, IEEE) received Youguang Zhang (Member, IEEE) received the
the B.S. degree from Beihang University, Beijing, M.S. degree in mathematics from Peking University,
China, where he is currently pursuing the Ph.D. Beijing, China, in 1987, and the Ph.D. degree in
degree in microelectronics. His current research communication and electronic systems from Bei-
interests include the theoretical magnetism and hang University, Beijing in 1990. He is currently a
micromagnetic simulation. Professor with the School of Electronic and Informa-
tion Engineering, Beihang University. His research
interests include circuit and system co-design for the
emerging memory and computing systems.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.

You might also like