0% found this document useful (0 votes)

24 views

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

This document presents a time-domain computing in memory (TD-CIM) scheme using spintronics that can perform energy-efficient computations for convolutional neural networks. The TD-CIM circuit implements basic logic operations by recording output at different time moments. It also introduces a multi-addend addition mechanism without cascaded full adders. Finally, it proposes quantizing floating-point pretrained CNN models to fixed-point for improved compatibility with the TD-CIM circuit, and evaluates the delay and energy benefits of a TD-CIM architecture using a reconfigurable spin-orbit torque MRAM for digit recognition tasks.

Uploaded by

AMANDEEP SINGH

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

Uploaded by

AMANDEEP SINGH

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

3, MARCH 2021 1193

Time-Domain Computing in Memory Using

Spintronics for Energy-Efficient Convolutional
Neural Network
Yue Zhang , Senior Member, IEEE, Jinkai Wang, Graduate Student Member, IEEE, Chenyu Lian, Yining Bai ,
Guanda Wang, Graduate Student Member, IEEE, Zhizhong Zhang, Student Member, IEEE,
Zhenyi Zheng, Graduate Student Member, IEEE, Lei Chen, Kun Zhang, Member, IEEE,
Georgios Sirakoulis , Member, IEEE, and Youguang Zhang, Member, IEEE

Abstract— The data transfer bottleneck in Von Neumann

architecture owing to the separation between processor and
memory hinders the development of high-performance com-
puting. The computing in memory (CIM) concept is widely
considered as a promising solution for overcoming this issue.
In this article, we present a time-domain CIM (TD-CIM) scheme
using spintronics, which can be applied to construct the energy-
efficient convolutional neural network (CNN). Basic Boolean logic
operations are implemented through recording the bit-line output
at different moments. A multi-addend addition mechanism is then Fig. 1. (a) Von Neumann architecture. (b) Computing in memory architecture.
introduced based on the TD-CIM circuit, which can eliminate
the cascaded full adders. To further optimize the compatibility
of TD-CIM circuit for CNN, we also propose a quantization
method that transforms floating-point parameters of pre-trained and internet of things (IoT) [1]. Among various algorithms
CNN models into fixed-point parameters. Finally, we build a of ML, convolutional neural network (CNN) is one of the
TD-CIM architecture integrating with a highly reconfigurable
array of field-free spin-orbit torque magnetic random access representative methods [2], [3], possessing the extraordinary
memory (SOT-MRAM) and evaluate its benefits for the quantized performance in cognitive and decision-making tasks [4]. How-
CNN. By performing digit recognition with the MNIST dataset, ever, with the increasing dataset scale and target complex-
we find that the delay and energy are respectively reduced by 1.2- ity, CNN is facing the challenges of increasingly complex
2.7 times and 2.4 × 103 -1.1 × 104 times compared with STT-CIM interconnections, more convolution computations and frequent
and CRAM based on spintronic memory. Finally, the recognition
accuracy can reach 98.65% and 91.11% on MNIST and CIFAR- data transfers. There are certain improvements in algorithms
10, respectively. to overcome these challenges [5]–[7]. However, the Von
Index Terms— Computing in memory, time-domain, spintron- Neumann bottleneck, owing to the limited data bandwidth
ics, digit recognition, convolutional neural networks. between memory and processor in Von Neumann architecture,
I. I NTRODUCTION inherently constraints the execution performance of CNN,
as shown in Fig. 1(a).
M ACHINE learning (ML) has made great progress driven
by the demand of burgeoning big-data-driven applica-
tions, such as artificial intelligence (AI), autonomous driving
In order to address the above issues, computing in mem-
ory (CIM) architecture has been introduced as depicted
in Fig. 1(b). By exploiting the physical attributes of struc-
Manuscript received November 1, 2020; revised January 20, 2021; accepted tures or devices, computations are performed in memory
January 27, 2021. Date of publication February 3, 2021; date of current
version February 23, 2021. This work was supported in part by the National to achieve significant time and energy efficiency [8]–[10].
Natural Science Foundation of China under Grant 61971024 and Grant According to this concept, there have been many explorations
51901008, in part by the International Mobility Project under Grant B16001, based on static random access memory (SRAM), dynamic
and in part by the National Key Technology Program of China under Grant
2017ZX01032101. This article was recommended by Associate Editor S. Yin. RAM (DRAM) and emerging non-volatile memory (NVM)
(Corresponding author: Yue Zhang.) technologies. For example, [11]–[17] proposed to use sense
Yue Zhang, Jinkai Wang, Chenyu Lian, Yining Bai, Guanda Wang, amplifier (SA) or analog-to-digital converter (ADC) in SRAM
Zhizhong Zhang, Zhenyi Zheng, Lei Chen, Kun Zhang, and Youguang Zhang
are with the MIIT Key Laboratory of Spintronics, School of Integrated and DRAM to distinguish the variational current or voltage
Circuit Science and Engineering, Fert Beijing Institute, Beihang University, generated by multiple activated bit-cells, thereby implementing
Beijing 100191, China, and also with the Nanoelectronics Science and logic operations. However, due to the increase of leakage cur-
Technology Center, Hefei Innovation Research Institute, Beihang University,
Hefei 230013, China (e-mail: [email protected]). rent with the scaling down of CMOS devices, the processing
Georgios Sirakoulis is with the Department of Electrical and Computer of data-intensive application produces considerable energy in
Engineering, Democritus University of Thrace, 67100 Xanthi, Greece. SRAM and DRAM.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2021.3055830. Recent breakthroughs in several NVM techniques provide
Digital Object Identifier 10.1109/TCSI.2021.3055830 a potential way to realize near-zero leakage and static power
1549-8328 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1194 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

Fig. 2. TD-CIM circuit. (a) Spintronic bit-cell structure and switching principle of field-free SOT-MRAM. (b) Spintronic cell array. (c) TDC unit. (d) Principle
of TDC unit. (e) Logic unit.

consumption. Among different NVMs, spintronic memories N-bit addends with the TD-CIM circuit is realized and used
offer the advantageous performance, especially, in terms of for the multiplication. In order to improve the compatibility of
energy and time of the write operations [18]–[22]. This reduces TD-CIM circuit for CNN, we propose a quantization method
the energy of the CIM architecture that requires writing the that transforms floating-point parameters of pre-trained CNN
logic results back to bit-cells and various CIM architectures models into fixed-point parameters. Finally, a TD-CIM archi-
based on spintronic memories have been proposed. Reference tecture with a highly reconfigurable array of spin-orbit torque
[23] presented spin transfer torque CIM (STT-CIM) archi- magnetic RAM (SOT-MRAM) is built and we evaluate its
tecture by modifying peripheral decision circuit to sense the delay and energy by performing 2D convolution to recognize
effective resistance of bit-line, which can perform Boolean handwritten digit images from the MNIST dataset. Compared
logic, arithmetic and complex vector operations. Using the with STT-CIM and CRAM architectures, the delay of the TD-
physical attributes of STT device, [24] proposed the compu- CIM architecture is reduced by 2.7 times and 1.2 times, and
tational RAM (CRAM) architecture to perform computations the energy is decreased by 2.4×103 times and 1.1×104 times,
in cell array, which generates logic outputs directly in STT respectively.
devices. However, these CIM architectures only adopt the The remained parts are organized as follows: Section II
concept of the arithmetic logic unit (ALU) to carry out presents TD-CIM circuit to implement Boolean logic. Multi-
computations, but don’t fully explore the inherent advan- addend addition and efficient multiplication schemes based
tages of memory array. For example, the addition operation, on the TD-CIM circuit are described in Section III. The
fundamental unit in all arithmetic operations [25], [26], is quantization method of CNN and a TD-CIM architecture are
normally implemented by cascading full adders. If the same illuminated in Section IV. Section V analyzes the reliability
mechanism is used in CIM architectures, a large amount of of TD-CIM circuit and evaluates the performance of TD-
additional decoding operations and time sequence schedules CIM architecture by performing 2D convolution for digit
are required, which greatly increases the computation com- recognition. Conclusions are presented in Section VI.
plexity and degrades the performance in terms of delay and
energy. II. TD-CIM C IRCUIT FOR B OOLEAN L OGIC
In this work, we propose a time-domain CIM (TD-CIM)
In CIM architecture, distinguishing the bit-line voltage is a
scheme based on spintronic memory enabling simplification
common method to perform logic operations [12]. Its principle
of arithmetic operations for energy-efficient CNN. TD-CIM
can be analyzed by RC circuit model, in which the bit-line
circuit is firstly proposed to execute NOR, NAND and XOR
voltage (Vt ) is expressed as
operations by converting the variation of bit-line voltage to
Tdis
the time domain. According to the characteristics of the Tdis − (R
Vt = V0 e− RC = V0 e P R +R E R )C (1)
output, we propose a multi-addend addition mechanism for
implementing the addition operation of multiple 1-bit addends where Tdis refers to the discharge time, V0 is the initial voltage
in a memory access. Furthermore, the addition of multiple of bit-line, R and C are the resistance and capacitance on the

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1195

discharge channel, respectively. In memory array, R includes TABLE I

equivalent resistance (R E R ) of activated bit-cells in parallel T RUTH TABLE FOR T YPICAL L OGIC F UNCTIONS BASED ON TD-CIM
C IRCUIT
and parasitic resistance (R P R ) on discharge channel. As R P R
and C are always constant after the chip is designed, the bit-
line voltage mainly depends on R E R after a fixed Tdis . In this
case, different configurations of activated bit-cells can thus be
reflected by the bit-line voltage and the logic operations can be
implemented through comparing the bit-line voltage with the
reference voltage. However, the difference of bit-line voltages
with different input configurations is normally slight [27], [28],
hence the logic output detection requires accurate generation
and distribution of the reference voltage for SA or ADC. The from voltage difference (V ) to time difference (T ). For
more bit-cells are activated simultaneously, the more difficult example, the four configurations of two activated bit-cells can
the detection becomes. be divided into three cases according to the value of R E R :
To solve this problem, as shown in Fig. 2, we propose two activated bit-cells both store ‘0’ (Case “00”); one of bit-
a TD-CIM circuit which is composed of a spintronic cell cells stores ‘0’ and the other stores ‘1’ (Case “01&10”); two
array, a time-domain conversion (TDC) unit and a logic activated bit-cells both store ‘1’ (Case “11”). Due to the dif-
unit. It is well known that SOT-MRAM provides advanta- ferent R E R in these cases, the speeds of V B L drop are varied.
geous write behavior compared with STT-MRAM [22], [29]. The reversal moments of VT DC in these three cases are thus
As a large amount of write operations are normally required different and forms two intervals, i.e., T1 and T2 . In T1 ,
in CIM architecture, applying SOT-MRAM can obviously VT DC is high voltage (‘1’) only in the case “00”, implementing
improve the overall performance of TD-CIM circuit. However, the NOR logic. Similarly, in T2 , the VT DC is low voltage
to achieve the deterministic switching of magnetic tunnel (‘0’) only in the case “11”, implementing the NAND logic.
junction (MTJ) with perpendicular magnetic anisotropy (PMA) Therefore, by choosing these moments to distinguish VT DC ,
in SOT-MRAM, an additional magnetic field has to be used, reconfigurable logic operations can be achieved. Here, we can
which is a major hurdle for its practical application. Recently, a utilize a series of DFFs to record VT DC at these moments and
field-free SOT-MRAM was proposed by combining STT and buffers to enhance the drive capability (see Fig. 2(e)). Table I
SOT effects [30], [31]. As illustrated in Fig. 2(a), its write exhibits the truth table for typical functions based on TD-CIM
operation has three phases: (i) SOT current flows through circuit, in which the outputs of DFF0 and DFF1 (D0 and D1 )
the heavy metal to form in-plane magnetization in the free evaluate NOR and NAND logic operations of the data stored in
layer (FL) of MTJ due to the spin-Hall effect (SHE); (ii) STT the activated bit-cells. Furthermore, we design an XOR circuit
current is then injected to determine the MTJ’s state; (iii) SOT in the logic unit consisting of a pull-up channel and two pull-
current is removed, but STT current still remains until the down channels. When D0 and D1 are both ‘0’ or ‘1’, the pull-
magnetization relaxes to the perpendicular axis. If STT current up channel is closed and one of the pull-down channels is
flows from FL to pinned layer (PL), the MTJ state is set to opened, by which the output of the buffer in the logic unit
‘0’ (low resistance, R L ), and the current with the opposite drops to ‘0’. Alternatively, when D0 and D1 are ‘0’ and ‘1’,
direction writes ‘1’ (high resistance, R H ). By this way, due respectively, the buffer outputs a high voltage because the pull-
to the metastable state induced by the SOT current, the effect up channel is opened and both of the two pull-down channels
of STT current is amplified to reduce the incubation delay of are closed. Fig. 3 demonstrates the transient simulation results
magnetization switching. Hence, this field-free SOT-MRAM of TD-CIM circuit based on the field-free SOT-MRAM. The
can provide fast switching speed as well as low energy and signals CP0 and CP1 control the DFF0 and DFF1 to record the
we adopt it in the spintronic cell array (see Fig. 2(b)). VT DC at two moments (T1 and T2). Hence, D0 and D1 give
The TDC unit consists of an inverter and a buffer (see the NOR logic and NAND logic outputs, respectively. Besides,
Fig. 2(c)). The inverter is connected to the bit-line (BL). For the XOR logic can be realized based on D0 and D1 . As the
realizing TDC function, BL is firstly pre-charged to the supply output of the XOR circuit should be detected after T2, at which
voltage (VDD). When the spintronic bit-cells are activated by both NOR and NAND logic operations are completed, the total
the word lines (WLs) and the source-line (SL) is connected to delay of the TD-CIM circuit based on field-free SOT-MRAM
the ground through enabling the DS signal, BL voltage (V B L ) to achieve XOR logic is about 1.2 ns. It is also noteworthy
starts to decrease. The output of TDC unit (VT DC ) doesn’t that NOR, NAND and XOR logic operations are carried out
reverse until V B L is decreased to the threshold voltage of the through one memory access in the TD-CIM circuit.
interior NMOS transistor (Vnt h ). The discharge time (Tdis ) can
thus be derived by the transformation of Eq. (1) as follow
III. M ULTI -A DDEND A DDITION AND E FFICIENT
Vnt h Vnt h M ULTIPLICATION S CHEME BASED ON TD-CIM C IRCUIT
Tdis = −RC ln = −(R P R + R E R )C ln (2)
V0 V0 The addition is the basis for carrying out complex arithmetic
As shown in Fig. 2(d), different configurations of activated operations. Normally, the addition of multiple addends is
bit-cells (via R E R ) are reflected to the discharge time (i.e., implemented by cascading the full adders based on Boolean
the reversal moment of VT DC ), which realizes the conversion logic [32], [33]. However, as to CIM scheme, the cascade of

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1196 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

two ‘1’. Four 1-bit addends are then added in M1 . In the

ultimate case that these 1-bit addends are all ‘1’, M1 will
directly generate a carry to M3 . Meanwhile, M2 might generate
a carry to M3 as well. Therefore, in the addition of three
addends, two additional bits are required for the computation
in M3 . The similar mechanism is observed in the addition of
four addends shown in Fig. 4(b), where three carries might
be computed. This conclusion can be extended to the addition
of n addends, where n-1 additional bits should be taken into
account in each operation.
Fig. 5(a) illustrates the principle of the addition of n addends
in the TD-CIM circuit, in which 2n-1 word-lines are activated
simultaneously, including n 1-bit addends and n-1 carries in
each column. In this case, there are 22n−1 configurations of
these activated bit-cells, which are classified into 2n cases
according to the number of the datum ‘1’ stored in bit-cell.
In order to distinguish these 2n cases, 2n-1 DFFs are used in
the TD-CIM circuit to record the outputs at 2n-1 moments,
as demonstrated in Fig. 5(b). The sum (Si ) and carry (Ci ) in
Mi can be expressed as

Si = (D0 X O R D1 )O R(D2 X O R D3 ) . . . O R D2n−2

Fig. 3. Transient simulation results of TD-CIM circuit based on field-free (3)
SOT-MRAM. (a) BL and VT DC in the case “11”. (b) BL and VT DC in the
case “01&10”. (c) BL and VT DC in the case “00”. (d) CP0 and CP1. (e) C(i,1) = D1 X O R D3
Outputs of XOR, D0 and D1 in the case “11”. (f) Outputs of XOR, D0 and
D1 in the case “01&10”. (g) Outputs of XOR, D0 and D1 in the case “00”.
C(i,2) = D3 X O R D5
···
C(i,n−2) = D2n−5 X O R D2n−3
C(i,n−1) = D2n−3 (4)

where C(i, p) ( p = 1, 2…, n-1) represents the carry calculated

in Mi for Mi+ p . As the 2n-1 bit-cells can simultaneously be
activated in TD-CIM circuit, Si and Ci can be obtained in
a memory access, which effectively reduces the computing
complexity.
Fig. 6 shows a detailed operational process for an addition
of three 8-bit addends (A, B and C) stored by row in memory
array.
Assigning n = 3 to Eq. (4), two carries, i.e. C(i,1) =
Fig. 4. Carry principle for the addition of multiple addends. (a) Case of D1 X O R D3 and C(i,2) = D3 are generated in Mi . In order
three addends. (b) Case of four addends. to store these carries, two additional bit-cells are needed for
each order of magnitude. Note that the additional bit-cells
should be initialized to ‘0’. Hence, by activating the five word-
full adders requires a series of decoding operations and time
line connected to these bit-cells where three 8-bit addends and
sequence schedules of memory, which greatly increases the
two carries are stored, S0 , C(0,1) and C(0,2) are firstly obtained
computing complexity. To address this issue, we propose a
in M0 by using TD-CIM circuit to implement the addition of
multi-addend addition scheme based on TD-CIM circuit to
five 1-bit numbers (three 1-bit addends and two 1-bit carries)
simplify the arithmetic operations in memory. An efficient
and then written into the corresponding bit-cells. To further
multiplication scheme is also brought out for the following
decrease the operation time, these bit-cells can be selected in
investigation on CNN.
advance by decoder and S0 , C(0,1) and C(0,2) can be written
at the same time because they are in different columns. The
A. Multi-Addend Addition Scheme Based on TD-CIM Circuit similar process will subsequently be carried out for the other
Fig. 4 exemplifies the carry principle for the multi-addend orders of magnitude. Considering two overflow orders, it only
addition operation, in which Mi (i = 0, 1, 2…) represents the takes 10 cycles to complete the addition of three 8-bit addends
order of magnitude in binary addend and M0 is the lowest. based on TD-CIM circuit. Note that the number of cycles is
In the addition of three addends, as shown in Fig. 4(a), a carry only related to the number of the bit, instead of the number
is generated if the three 1-bit addends in M0 have more than of addends.

Fig. 5. (a) Principle of the addition of n addends based on TD-CIM circuit. (b) Schematic of multi-addend addition based on TD-CIM circuit.

Fig. 6. Diagram of addition in array for three 8-bit addends based on TD- Fig. 7. Efficient multiplication scheme based on TD-CIM circuit.
CIM circuit.

in order to reduce the bit number of the addition in each order

B. Efficient Multiplication Scheme of magnitude, we respectively store C(i,2) in Mi+2 (i = 2,
In digital integrated circuits, the multiplication is normally 3, 4, 5) of K, instead of requiring another row for storing
implemented through shift and addition operations. The afore- C(i,2) as Fig. 6. Hence, only one row is added to store the
mentioned multi-addend addition can be applied and improve carry C(i,1) . According to the multi-addend addition scheme,
the performance of multiplications in memory. Fig. 7 exhibits this multiplication can be completed within 7 cycles, which is
the efficient multiplication scheme based on TD-CIM circuit much more compact than the mainstream schemes based on
for two 4-bit numbers (A and B). During the shift opera- full adders [24]. Therefore, the multiplication operation of two
tion, four 4-bit numbers are first generated by performing N-bit numbers takes 2N cycles in TD-CIM scheme, including
AND operations between A and B. Then, by utilizing write the one cycle to shift the partial product and 2N-1 cycles of
operations in memory, the shift operations of the four 4-bit the addition operation.
numbers are realized. When implementing the addition of the
four 4-bit numbers via TD-CIM circuit, four 8-bit numbers IV. O PTIMIZATION OF CNN BASED ON TD-CIM C IRCUIT
(K, Y, Z and L) are constructed by filling the datum of ‘0’
in the rows. M2 , M3 , M4 and M5 might generate two carries CNN is commonly applied to analyze visual image, which
(C(i,1) and C(i,2) ), while only one carry is generated in M1 is composed of convolution layer, activation function, pooling
and M6 . Note that the bit in M0 of K is directly as the layer and fully-connected layers [4]. Particularly, the convo-
S0 , which doesn’t require any computation operation. Here, lution layer is used for image sharpening, blurring and edge

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1198 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

computations [35]–[40] and will improve the compatibility

of TD-CIM circuit for CNN. Reference [39] is a typical
method to use fixed-point parameters instead of high-precision
floating-point parameters in CNN algorithm. However, this
quantization method necessitates multiple retraining processes
to generate fixed-point parameters, which cannot be widely
used due to its low efficiency. Here, we propose an alternative
quantization method to optimize the generation process of
fixed-point parameters.
When transforming the trained results of floating-point
parameters into fixed-point parameters, we set the quantization
process as an optimization task rather than directly obtain the
fixed-point parameters by training the floating-point parame-
ters. Besides, CNN performs recognition task by selecting
the element with the largest probability in the result vector.
Therefore, in the proposed quantization method, the fixed-
Fig. 8. Convolution for an image with 4-bit per pixel by using a kernel of point parameters are achieved in the pre-trained neural network
size 3 × 3 with weight of 4-bit in TD-CIM circuit.
models without changing the sequence of output probability.
As shown in Algorithm 1, the quantization process of CNN
can be described in three steps as follows:
detection. Its core computation is described as

n
n
Algorithm 1 Quantization Procedure of CNN
Oi, j = Ii+k, j +l Wk.l (5)
k=1 l=1
Input: Pre-Trained Model Weight in 32-Bit or 64-Bit Float W
Output: Quantized Model Weight in N-Bit Integer Wq
where W represents the matrix of convolution kernels and I
Step 1. Compute the range of W :
refers the matrix of input pixels. It is obvious that the computa-
lowbound = min(W ) upbound = max(W )
tion of convolution consists of multiplication and summation
Step 2. Compute the scale of W :
operations. Hence, based on the multi-addend addition and
S = 2 N − 1 L = abs( lowbound ) U = abs( upbound )
efficient multiplication schemes proposed above, the TD-CIM
if L ≤ U then
circuit can greatly improve the computation performance of
Scale = S ÷ L
convolution in memory.
else
Fig. 8 shows an example of convolution operation for an
Scale = S ÷ U
image with 4-bit per pixel, in which the kernel of size 3 × 3
Step 3. Quantize the W in to integer:
with weight of 4-bit is computed by using TD-CIM circuit.
Wq = Round ( Scale ×W )
The efficient multiplication operations are firstly performed to
return Wq
generate nine 8-bit numbers. Then, the final result of the con-
volution is obtained through implementing the multi-addend
addition for these numbers. Considering the computational Step 1: The floating-point parameters in pre-trained models
accuracy of the TD-CIM circuit, the nine 8-bit numbers are and the input data of image are scaled to the largest range
divided into three computation blocks, instead of adding all of N-bit via linear transformation, for example, the weight
of them at once. A 10-bit number is thus generated by each range of −0.31∼ +0.44 is scaled to −15∼ +15 which can be
block. Finally, three 10-bit numbers are added to obtain the represented by a 4-bit number. Since the activation function
convolution result, which should be an 11-bit number. Similar is the only non-order-preserving factor that can change the
to the convolution implementation, it is also possible to realize sequence of result vector, the scaling method depends on the
the activation function, pooling player and fully-connected type of activation function in CNN. Here, we use the ReLU
layer by using TD-CIM circuit. function as the activation function because it is widely used in
neural network models. In CNN based on ReLU function, the
sequence of the final result will remain unchanged if the sign
A. Quantization Method of CNN of parameters is not changed. Therefore, the sign of parameters
CNN requires high-precision floating-point computation requires remaining unchanged during the scaling operation.
according to the gradient descent learning algorithm, which Step 2: Step 1 doesn’t consider the decrease of accuracy,
normally increases the complexity for CIM based CNN com- which is inconsistent with reality (the accuracy drop is usually
putation. For example, [34] demonstrates that both time and more than 2%). Here, we propose a mathematical model to
energy of the computations for floating-point numbers exceed solve this problem and obtain the smallest accuracy drop.
those for fixed-point numbers by more than one order of Because the prediction result in CNN is given through the
magnitude under the same calculation condition. Therefore, vector from the output layer and the result vector will change
the transformation of floating-point parameters into fixed-point after quantizing the original neural network, the closer the
parameters in CNN is beneficial for reducing the number of form number of output vector between the quantized model

Fig. 9. (a) Original image with 8-bit per pixel. (b) Converted image with
4-bit per pixel. (c) Convolution result.

and the original model is, the smaller the accuracy drop caused
by quantization is. Hence, we quantize the weight in neural
network to minimize the accuracy drop. It can be described as
Fig. 10. TD-CIM architecture for quantized CNN.

min σ (wq ) − σ (w)
s.t.wq = Q(w) (6) B. TD-CIM Architecture for Quantized CNN
We then design a TD-CIM architecture using field-free SOT-
where w refers the set of floating-point weight matrixes, wq MRAM to execute the quantized CNN, as shown in Fig. 10.
represents the set of weight matrixes scaled to N-bit, σ is the It consists of three sub-arrays: data array, shift array and
computation process of classical CNN with ReLU function summation array. The data array is specialized to store the
and Q is the quantization function. original data of images and the convolution kernel. For the
Step 3: The scaling operation compensates the influence convolution computation of CNN, the shift array stores the
of activation function on quantization. Meanwhile, because shifted data in the multiplications and the summation array
the purpose of pooling layer is to progressively reduce the stores the results of the multiplications.
spatial size of parameters and computations in the network In data array, to enhance the parallelism of AND operations,
and its operation on each feature map is independent [41], the the image data are stored by row and the kernel data are stored
computations of pooling layer doesn’t affect the final sequence by column. By using the TD-CIM circuit located in each
of out probability. Therefore, according to the associative law column, each bit of the pixel can carry out AND operation
of calculation in convolution layer and fully connection layer, with any bit of the kernel at the same time. When their
Eq. (6) is modified as results are transferred to the shift array, the shift operations
are implemented by write operations. The addition part in the

min σ N (wq − w) proposed efficient multiplication scheme is also performed in
s.t.wq = Q(w) (7) the shift array to get the results of multiplication. At last,
these results are transferred to the summation array to obtain
the final convolution results by implementing the multi-addend
where N represents the number of bits in single pixel and addition operation based on TD-CIM circuit.
single kernel weight, σ N is the computation process of clas- However, owing to the structural limitation of memory array,
sical CNN except for pooling layer and activation function. logic operations can only be carried out on the columns.
Moreover, the difference between σ N and σ N is proportional By contrast, to increase the parallelism of write operations,
to the difference between wq and w. Therefore, the optimal the numbers generated by multiplication operations are stored
solution of Q is rounding. Meanwhile, the difference between on the rows. This causes that the multi-addend addition of
wq and w is reduced as the size of N increases, which is them cannot be implemented. Therefore, we propose a highly
beneficial to quantizing CNN. reconfigurable array based on the field-free SOT-MRAM
This method implements quantization without retraining and allowing the logic operations on the rows. As shown in Fig. 11,
encoding, which reduces the amount of calculation. Mean- a transistor is added in the bit-cell to construct three bit-lines
while, by scaling the weight appropriately to neutralize the (SL, BL, CBL) and three word-lines (WL, RWL, CWL). When
non-linearity of ReLU, its accuracy drop produced by quanti- performing logic operations on the rows, the TD-CIM circuit
zation without retraining can be minimized. Finally, the N-bit is connected to the CBL. Thanks to this reconfigurability
convolution operation is quantized into 4-bit, thereby reducing enhancement, the quantized CNN can be implemented more
the complexity of CNN and enhancing the compatibility of efficiently by the TD-CIM architecture.
TD-CIM circuit for CNN. Fig. 9 demonstrates an example
of an image processing by using the proposed quantization
method. The original image with 8-bit per pixel is converted V. P ERFORMANCE E VALUATION AND D ISCUSSION
to the image with 4-bit per pixel. Fig. 9(c) shows the result As reliability is crucial for implementing logic operations,
of convolution computation with 4-bit kernel. we first analyze the reliability of the TD-CIM circuit. Hybrid

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1200 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

Fig. 12. Monte Carlo simulation results of VT DC in different data cases

when five bit-cells are activated simultaneously (1V, TT, 25◦ ).

From Eq. (2), the Tdis(x,n−x) is written as

RH RL Vnt h
Tdis(x,n−x) = − R P R + C ln
Fig. 11. Structure of high reconfigurable array based on SOT-MRAM. (n − x) R L + x · R H V0
(8)
TABLE II
Then, a bit-cell storing ‘1’ is changed to ‘0’, i.e., x-1 bit-
K EY PARAMETERS OF THE F IELD -F REE SOT-MRAM
cells store ‘1’and n-x+1 bit-cells store ‘0’. In this case, the
Tdis(x−1,n−x+1) is descried as

RH RL
Tdis(x−1,n−x+1) = − R P R +
(n −x +1) R L +(x −1) R H
Vnt h
×C ln (9)
V0
Hence, the T between the them can be expressed as
T = Tdis(x,n−x) − Tdis(x−1, n−x+1) (10)
It is well known that tunnel magnetoresistance (TMR) ratio is
defined as TMR =(R H -R L )/R L , which reflects the difference
between R H and R L . Therefore, the Eq. (10) is rewritten as

(1+T M R) T M R
T =
CMOS/SOT-MRAM simulations are carried out by applying (n +x · T M R) (n +x · T M R−T M R)
28 nm CMOS process technology and the field-free SOT- Vnt h
×R L C ln (11)
MRAM compact model [31]. Table II summarizes the key V0
parameters of the field-free SOT-MRAM, which are dependent Here, TMR and R L are determined by the MTJ device and can
on physical models and experimental measurements [29], [30]. be regarded as constants after the fabrication. Therefore, from
Then, by performing 2D convolutions in LeNet-5 to recognize Eq. (11), T will gradually decrease as x reduces. When none
the handwritten digit images from the MNIST dataset, we eval- of bit-cells stores ‘1’ (i.e., x = 0), the T is in the worst case
uate the performance of the TD-CIM architecture in terms of (Tworst ). If Tworst is smaller than the delay of the DFF,
delay and energy, and compare it with those of the existing there will occur errors in the TD-CIM circuit because DFF
CIM architectures. Moreover, the recognition accuracy is given cannot record the VT DC in time. To ensure the reliability of
to prove the compatibly of the TD-CIM architecture for CNN. the TD-CIM circuit, the Tworst must be large enough.
We carry out the Monte Carlo simulations of 104 samples
when five word-lines are activated simultaneously, i.e., there
A. Reliability Analysis of TD-CIM Circuit are five activated bit-cells on a column. Fig. 12 demonstrates
Although the reliability of TD-CIM circuit is improved the variations of T among different data cases at 1V, TT,
by distinguishing the bit-line voltage in time domain [42], 25◦. Note that, PVT variation also affects T in addition to
it will be deteriorated with the increasing number of activated the number of activated bit-cells. Here, the process deviation
bit-cells. Assuming that n bit-cells are activated in TD-CIM of the MTJ resistance follow a Gaussian distribution with 5%
circuit, in which x bit-cells store ‘1’ and the others store ‘0’. variability [42], [43]. For the CMOS transistors, local and

As the AND operations can be executed in each column

simultaneously in the proposed highly reconfigurable field-free
SOT-MRAM array, the total delay of the AND operations for
a multiplication of two 4-bit numbers is T AN D . In addition,
its total energy is equal to 16E AN D as 16 AND operations are
carried out.
2) Shift Array: The shift and addition operations are imple-
mented in the shift array. On one hand, the delay and energy of
the shift operations are mainly from the data write operations,
which can be expressed as
Tshi f t = = Twrit e/bit + Tpre (14)
E shi f t = E writ e/bit + E pre (15)
where Twrit e/bit and E writ e/bit represent write time and
energy of a field-free SOT-MRAM device. Because 4 num-
bers with 4-bit are generated after implementing AND oper-
ations, the total delay and energy of shift operations are
Fig. 13. Computation Accuracy of TD-CIM circuit for different data cases.
4 Tshi f t , 16 E shi f t , respectively. Note that the write operations
in different columns can be executed at the same time within
a memory access.
global variations are included by using the foundry statistical On the other hand, the multi-addend addition in a col-
models [43]. Simulation results show that the Tworst is about umn or row includes the computations of TD-CIM circuit and
60ps, it is enough to record the state of VT DC by using DFFs write operations for carries and sum. Hence, its delay and
[44]. The computation accuracy of the addition of three 1-bit energy are described as
addends in the TD-CIM circuit are also evaluated at different
data cases, as shown in Fig. 13. The lowest computation Taddit ion = TT D−C I M/ Add + Twrit e/bit + 2T pre (16)
accuracy appears in the case of “00000” (∼97.9%), which is E addit ion = E T D−C I M/ Add + 2E writ e/bit + 2E pre (17)
normally sufficient for the CNN algorithm. When the number
of addends increases to four, the lowest computation accuracy As elucidated in Section 3.2, 7 cycles of additions are
will be about 92.3%, which cannot satisfy the demand of the required to obtain the multiplication result of two 4-bit num-
algorithm. Therefore, based on the current performance of bers. The total delay and energy of the addition operations in
CMOS and MTJ, the reliability of TD-CIM circuit is enough a multiplication are 7Taddit ion and 7E addit ion , respectively.
to realize logic operations when less than three addends are 3) Summation Array: The summation array is used to
activated simultaneously. That is why we divide nine 8-bit calculate the final convolution results by adding all the mul-
numbers into three computation blocks in Fig. 8. Moreover, tiplication results. Since the size of the convolution kernel
it is believed that the reliability of TD-CIM circuit can be is 3 × 3, 9 numbers are generated by the multiplications
enhanced to face more activated bit-cells with the improvement in one convolution computation and then transferred to the
of performance of CMOS and MTJ in the future. summation array at the same time. The delay of this part is
thus equal to Tshi f t . As each multiplication result of two 4-bit
numbers should have 8-bit, the energy for the data transfer is
B. Evaluation of Delay and Energy written as
We then evaluate the delay and energy of the TD-CIM E trans = 72E writ e/bit + E pre (18)
architecture through carrying out 2D convolutions of two
For the summation operation, these 9 numbers with 8-bit are
4-bit numbers in CNN. Here we divide the delay and energy
divided into 3 computational blocks as mentioned in Section 4.
into three parts according to the structure of the TD-CIM
Therefore, the total delay and energy in the summation array
architecture: data array, shift array and summation array.
are expressed as
1) Data Array: As mentioned in Section 4.2, AND oper-
ations are firstly performed in the data array of the TD- Tsummat ion = 21Taddit ion + Tshi f t (19)
CIM architecture. The peripheral circuit should be used to E summat ion = 41E addit ion + E trans (20)
select the bit-cells storing the operational data and will also
be used for every step of the following operations. Hence, In summary, the total delay and energy of a convolution
the delay and energy of an AND operation are from the TD- computation in Eq. (5) for 4-bit pixels and the kernel of size
CIM circuit (TT D−C I M/ AN D and E T D−C I M/ AN D ) and the 3 × 3 with weight of 4-bit are expressed as
peripheral circuit (T pre and E pre ), which can be described TT OT AL
as
= 9 T AN D + 4Tshi f t + 7Taddit ion + Tsummat ion (21)
T AN D = TT D−C I M/ AN D + Tpre (12) E T OT AL

E AN D = E T D−C I M/ AN D + E pre (13) = 9 16 E AN D + E shi f t +7E addit ion + E summat ion (22)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1202 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

TABLE III TABLE IV

K EY D ELAY AND E NERGY PARAMETERS IN TD-CIM A RCHITECTURE P ERFORMANCE C OMPARISON OF TD-CIM, STT-CIM AND CRAM
A RCHITECTURES ON 2D C ONVOLUTION AT 1MB, 28 NM

3 × 3 in TD-CIM architecture can be obtained and are shown

in Table IV. To demonstrate the performance advantage of
the proposed TD-CIM architecture, we compare it with STT-
CIM [23] and CRAM [24] architectures. They are simu-
lated under the same CMOS and MTJ technologies, which
guarantee the comparison fairness. Compared with STT-CIM
and CRAM architectures, the delay of TD-CIM architecture
with multi-addend addition that executes the 2D convolution
computations is reduced by 1.7 times and 0.4 times, and the
energy is decreased by 1.9 × 103 times and 8.9 × 103 times,
respectively. The similar results can be observed in Fig. 14,
which displays the performance comparison of the overall digit
recognition based on the quantized CNN in TD-CIM, STT-
CIM and CRAM architectures. Here, the delay and energy
are respectively reduced by 1.2-2.7 times and 2.4 × 103 -
1.1×104 times compared with STT-CIM and CRAM architec-
tures. It is noteworthy that, as the logic operations in CRAM
architectures are directly implemented in bit-cell array without
using peripheral logic circuit, CRAM has a smaller area
Fig. 14. Performance comparison among TD-CIM, STT-CIM and CRAM
architectures on digit recognition. overhead than the other two architectures. The area overhead
of TD-CIM architectures is smaller than that of STT-CIM
architecture. In summary, the proposed TD-CIM architecture
A model calculating the overall delay and energy of the greatly improves the performance of CNN, especially in terms
CNN is developed by analyzing the computation type and of energy.
number of these layers in LeNet-5. For the activation function Table V compares the proposed TD-CIM scheme with state-
and pooling player, they perform comparison operation, which of-the-art CIM architectures published in the recent years.
is difficult to realize in CIM architecture. We thus use the According to the computation method, CIM architectures can
conventional comparator in TD-CIM, STT-CIM and CRAM be divided into two schemes, i.e. analog and digital. From
architectures, by which their delay and energy in the activation this comparison table, [11] and [27] adopt analog computation
function and pooling player are the same. This model is method, in which the multiply–accumulate (MAC) result is
utilized in the following discussions and comparisons. directly reflected on bit-line as the total bit-line discharge cur-
rent is the sum of each activated bit-cell current during one bit-
line discharging. MAC result is then obtained by using ADC to
C. Results and Discussions sense the analog bit-line voltages. Therefore, the CIM architec-
To analyze the performance of TD-CIM architecture and tures using analog computation method show excellent energy
demonstrate its advantages, key parameters should firstly be efficiency in low precision, but suffer from large area overhead,
determined. As Twrit e/bit , E writ e/bit , T pre and E pre are related limited functionality (add and multiply only) and algorithm,
to the array size, we get them by modifying the NVSim as shown in Table IV. References [12] and [13] using digital
memory simulator to adopt the proposed highly reconfigurable computation method can perform high-precision arithmetic
field-free SOT-MRAM array of 1 MB [45]. TT D−C I M/ AN D operations, but have poor energy efficiency. Reference [46]
and E T D−C I M/ AN D are obtained by the SPICE simulations has proposed a time-domain computation method to improve
for the TD-CIM circuit. We then get TT D−C I M/ Add and the energy efficiency and bit-precision scalability, where the
E T D−C I M/ Add in the case of five activated bit-cells, which pulse width modulation (PWM) is used to map digital values
is determined considering the reliability limitation. The values into time-domain. However, the latency is very long due to
of these key parameters are listed in Table III, which includes its sequential operation, and its weight precision is limited
different cases of data stored in the activated bit-cells. to 1 bit [47]. TD-CIM scheme performs logic operation on
According to the model and these key parameters, the delay time-domain, but it is essentially a digital computation method
and energy for a 2D convolution with the kernel of size because the arithmetic operations are realized by composing

TABLE V
C OMPARISON W ITH P REVIOUS W ORKS

the logic operation. Compared with the conventional addition on bit-line to the time domain, which not only improves the
operation of two addends, TD-CIM scheme can implement sensing reliability but also allows the multi-addend addition to
the addition operation of three addends during one bit-line simplify the arithmetic. To further improve the compatibility of
discharging. Note that more addends can be added in one TD-CIM circuit for CNN, we propose a quantization method
addition operation. Therefore, one addition operation in TD- without sharp accuracy dropping, which can also reduce the
CIM scheme is equivalent to two addition operations in the complexity of CNN. A TD-CIM architecture with a highly
conventional digital CIM scheme, which further improves the reconfigurable field-free SOT-MRAM array is constructed to
energy efficiency. Besides, in shift and summation array of realize the optimal performance of quantized CNN. Finally,
the TD-CIM architecture, one TD-CIM circuit is shared by by recognizing the handwritten digit from the MNIST dataset,
eight columns, which saves the area overhead. In summary, we find that both delay and energy of the TD-CIM architecture
TD-CIM scheme offers higher energy efficiency and lower are greatly reduced compared with STT-CIM and CRAM
area overhead than existing CIM architectures using the digital architectures. In addition, TD-CIM architecture has higher
computation method. energy efficiency and lower area overhead than present CIM
Moreover, in terms of the recognition accuracy, although the architectures using digital computation method. Finally, the
weight in LeNet-5 is quantized from floating-point parameters accuracy off 98.65% and 91.11% are achieved in the TD-
to fixed-point parameters, it still achieves the accuracy of CIM architecture with 4-bit fixed-point parameter on MNIST
99.57% to recognize the handwritten digit from the MNIST and CIFAR-10 respectively, which demonstrates the proposed
dataset. Since it is difficult to know the specific distribution quantization method of CNN is compatible with TD-CIM
of data cases for the CNN computation process, we assume architecture. This work has significance for further research
the total computation accuracy of TD-CIM circuit is the mean on high-performance memory-oriented computing systems.
of the accuracies shown in Fig. 13, i.e. 99.07%. Then, it is
introduced as a parameter to the quantified LeNet-5. Result
R EFERENCES
shows that the accuracy of the quantified LeNet-5 run in TD-
CIM scheme is 98.65%, less than the accuracy of 99.57% by [1] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memory
0.92%, but it is still higher than that in [46], i.e., 98.42%. VLSI architecture for convolutional neural networks,” IEEE Trans.
Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 494–505, Sep. 2018.
Furthermore, we also extend our design to the CIFAR-10 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
dataset. We first use pre-trained VGG11 model in library of with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Pytorch model zoo, which achieves the accuracy of 93.78%. Process. Syst. (NIPS), 2012, pp. 1097–1105.
[3] D. Silver et al., “Mastering the Game of Go with deep neural networks
Then, the VGG11 model is quantized with Algorithm 1. The and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
final validation accuracy is 91.97%, with an accuracy drop [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
of 1.81% caused by the quantization. Similarly, the accuracy of pp. 436–444, May 2015.
[5] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized CNN:
TD-CIM circuit is introduced to the quantified VGG11 model. A unified approach to accelerate and compress convolutional networks,”
Finally, the accuracy of the quantified VGG11 model run in IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4730–4743,
TD-CIM scheme is 91.11%, less than the accuracy of 93.78% Oct. 2018.
[6] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and
by 2.67%. In summary, the compatibility of the TD-CIM connections for efficient neural networks,” in Proc. NIPS, Montréal, QC,
architecture and the quantized CNN is well delivered. Canada, 2015, pp. 1135–1143.
[7] J. Wang, J. Lin, and Z. Wang, “Efficient hardware architectures for deep
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers,
VI. C ONCLUSION vol. 65, no. 6, pp. 1941–1953, Jun. 2018.
This article proposes a TD-CIM architecture using spintron- [8] P. Chi et al., “PRIME: A novel processing-in-memory architecture
for neural network computation in ReRAM-based main memory,” in
ics to optimize the performance of delay and energy for CNN Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Seoul,
applications. TD-CIM circuit converts the voltage difference South Korea, Jun. 2016, pp. 27–39.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1204 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021

[9] S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and evaluation [31] Z. Wang, W. Zhao, E. Deng, J.-O. Klein, and C. Chappert,
of a spintronic in-memory processing platform for nonvolatile data “Perpendicular-anisotropy magnetic tunnel junction switched by spin-
encryption,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., Hall-assisted spin-transfer torque,” J. Phys. D, Appl. Phys., vol. 48, no. 6,
vol. 37, no. 9, pp. 1788–1801, Sep. 2018. Jan. 2015, Art. no. 065001.
[10] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-based [32] E. Deng et al., “Synchronous 8-bit non-volatile full-adder based on spin
computation-in-memory unit-macro for CNN-based AI edge proces- transfer torque magnetic tunnel junction,” IEEE Trans. Circuits Syst. I,
sors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801, Reg. Papers, vol. 62, no. 7, pp. 1757–1765, Jul. 2015.
Oct. 2020. [33] E. E. Swartzlander, “Recent results in merged arithmetic,” Proc. SPIE,
[11] J. Yang et al., “Sandwich-RAM: An energy-efficient in-memory BWN vol. 3461, pp. 576–583, Oct. 1998.
architecture with pulse-width modulation,” in IEEE Int. Solid-State [34] M. Horowitz, “Computing’s energy problem (and what we can do about
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Feb. 2019, pp. 394–396. San Francisco, CA, USA, Feb. 2014, pp. 10–14.
[12] J. C. Wang et al., “A 28-nm compute SRAM with bit-serial [35] E. Cai, D. Juan, D. Stamoulis, and D. Marculescu, “NeuralPower:
logic/arithmetic operations for programmable in-memory vector comput- Predict and deploy energy-efficient convolutional neural networks,” in
ing,” IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 76–86, Jan. 2020. Proc. ACML, Seoul, South Korea, Nov. 2017, pp. 622–637.
[13] K. Lee, J. Jeong, S. Cheon, W. Choi, and J. Park, “Bit parallel 6T SRAM [36] I. Hubara, M. Courbariaux, D. Soudry, E. El-Yaniv, and Y. Bengio,
in-memory computing with reconfigurable bit-precision,” in Proc. 57th “Binarized neural networks,” in Proc. NIPS, Barcelona, Spain, 2016,
ACM/IEEE Design Automat. Conf. (DAC), San Francisco, CA, USA, pp. 4114–4122.
Jul. 2020, Art. no. 20052792. [37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
[14] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory- ImageNet classification using binary convolutional neural networks,” in
computing SRAM macro based on robust capacitive coupling computing Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands,
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897, 2016, pp. 525–542.
Jul. 2020. [38] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
[15] X. Si et al., “A twin-8T SRAM computation-in-memory unit-macro for in Proc. ICLR, Toulon, France, 2017, pp. 1–10.
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits, [39] S. Han, H. Z. Mao, and W. J. Dally, “Deep compression: Compressing
vol. 55, no. 1, pp. 189–202, Jan. 2020. deep neural networks with pruning, trained quantization and Huffman
[16] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient coding,” in Proc. ICLR, San Juan, Puerto Rico, 2016, pp. 1–14.
SRAM with in-memory dot-product computation for low-power convo- [40] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
pp. 217–230, Jan. 2019. Gate Arrays, Los Angeles, CA, USA, Feb. 2016, pp. 26–35.
[17] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” IEEE [41] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and
Comput. Archit. Lett., vol. 14, no. 2, pp. 127–131, Jul. 2015. J. Schmidhuber, “Fast image scanning with deep max-pooling convo-
[18] G. D. Wang et al., “Compact modeling of perpendicular-magnetic- lutional neural networks,” in Proc. IEEE Int. Conf. Image Process.,
anisotropy double-barrier magnetic tunnel junction with enhanced ther- Melbourne, VIC, Australia, Sep. 2013, pp. 4034–4038.
mal stability recording structure,” IEEE Trans. Electron Devices, vol. 66, [42] Q.-K. Trinh, S. Ruocco, and M. Alioto, “Time-based sensing for
no. 5, pp. 2431–2436, May 2019. reference-less and robust read in STT-MRAM memories,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 65, no. 10, pp. 3338–3348, Oct. 2018.
[19] Z. H. Wang et al., “Proposal of toggle spin torques magnetic RAM
[43] Y. Zhou et al., “A self-timed voltage-mode sensing scheme with suc-
for ultrafast computing,” IEEE Electron Device Lett., vol. 40, no. 5,
cessive sensing and checking for STT-MRAM,” IEEE Trans. Circuits
pp. 726–729, May 2019.
Syst. I, Reg. Papers, vol. 67, no. 5, pp. 1602–1614, May 2020.
[20] R. De Rose et al., “A variation-aware timing modeling approach for write
[44] G. Scotti, D. Bellizia, A. Trifiletti, and G. Palumbo, “Design of low-
operation in hybrid CMOS/STT-MTJ circuits,” IEEE Trans. Circuits
voltage high-speed CML D-latches in nanometer CMOS technologies,”
Syst. I, Reg. Papers, vol. 65, no. 3, pp. 1086–1095, Mar. 2018.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 12,
[21] G. Wang et al., “Ultra-dense ring-shaped racetrack memory cache pp. 3509–3520, Dec. 2017.
design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 1, [45] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level
pp. 215–225, Jan. 2019. performance, energy, and area model for emerging nonvolatile memory,”
[22] Z. Y. Zheng et al., “Enhanced spin-orbit torque and multilevel current- IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7,
induced switching in W/Co-Tb/Pt heterostructure,” Phys. Rev. A, Gen. pp. 994–1007, Jul. 2012.
Phys., vol. 12, no. 4, Oct. 2019, Art. no. 044032. [46] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-
[23] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory TOPS/W all-digital time-domain CNN engine using bi-directional mem-
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale ory delay lines for energy efficient edge computing,” IEEE J. Solid-State
Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018. Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020.
[24] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J.-P. Wang, [47] H. A. Maharmeh, N. J. Sarhan, C.-C. Hung, M. Ismail, and M. Alhawari,
and S. S. Sapatnekar, “In-memory processing on the spintronic CRAM: “Compute-in-time for deep neural network accelerators: Challenges
From hardware design to application mapping,” IEEE Trans. Comput., and prospects,” in Proc. IEEE 63rd Int. Midwest Symp. Circuits Syst.
vol. 68, no. 8, pp. 1159–1173, Aug. 2019. (MWSCAS), Springfield, MA, USA, Aug. 2020, pp. 990–993.
[25] H. Naseri and S. Timarchi, “Low-power and fast full adder by exploring
new XOR and XNOR gates,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018.
[26] H. T. Bui, Y. Wang, and Y. Jiang, “Design and analysis of low-power
10-transistor full adders using novel XOR-XNOR gates,” IEEE Trans.
Circuits Syst. II, Analog Digit. Signal Process., vol. 49, no. 1, pp. 25–30,
Jan. 2002.
[27] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatile
computing-in-memory core for low-power parallel MAC operations,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 6, pp. 1867–1880,
Jun. 2020. Yue Zhang (Senior Member, IEEE) received the
[28] J. K. Wang et al., “A self-matching complementary-reference sensing B.S. degree in optoelectronics from the Huazhong
scheme for high-speed and reliable toggle spin torque MRAM,” IEEE University of Science and Technology, Wuhan,
Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4247–4258, China, in 2009, and the M.S. and Ph.D. degrees
Dec. 2020, doi: 10.1109/TCSI.2020.3020137. in microelectronics from the University of Paris-
[29] Z. Y. Zheng et al., “Perpendicular magnetization switching by large Sud, France, in 2011 and 2014, respectively. He
spin–orbit torques from sputtered Bi2 Te3 ,” Chin. Phys. B, vol. 29, no. 7, is currently an Associate Professor with Beihang
Jul. 2020, Art. no. 078505. University, China. His current research interests
[30] M. Wang et al., “Field-free switching of a perpendicular magnetic tunnel include emerging non-volatile memory technologies
junction through the interplay of spin–orbit and spin-transfer torques,” and hybrid low-power circuit designs.
Nature Electron., vol. 1, no. 11, pp. 582–588, Nov. 2018.

Jinkai Wang (Graduate Student Member, IEEE) Zhenyi Zheng (Graduate Student Member, IEEE)
received the B.S. degree in physics and electronic received the B.S. and master’s degrees from Beihang
engineering from Kaili University, Kaili, China, University, Beijing, China, in 2015 and 2018, respec-
in 2015, and the M.S. degree in circuits and systems tively, where he is currently pursuing the Ph.D.
from Anhui University, Anhui, China, in 2018. He degree. His current research interests include spin-
is currently pursuing the Ph.D. degree in physical orbit torque effect and ferrimagnetic materials.
electronics with Beihang University, China. His cur-
rent research interest includes the high-performance
hybrid circuits.

Lei Chen received the B.S. degree in electronic

and information engineering from Anhui University,
Chenyu Lian received the B.S. degree in soft- Hefei, China, in 2018. He is currently pursuing
ware engineering from Beijing Jiaotong University, the Ph.D. degree in microelectronics and solid-state
Beijing, China, in 2018. He is currently pursuing electronics with Beihang University, Beijing, China.
the M.S. degree in integrated circuit with Beihang His research interests include lateral spin valves and
University. His current research interests include emerging non-volatile memory technologies.
efficient deep learning methods on hardware and in
memory computing.

Kun Zhang (Member, IEEE) received the B.S. and

Ph.D. degrees in physics from Shandong Univer-
sity, Jinan, China, in 2012 and 2017, respectively.
Yining Bai received the B.S. degree in communica- He is currently a Lecturer with Beihang Univer-
tion engineering from Beijing Jiaotong University, sity, Beijing, China. His current research interests
Beijing, China. She is currently pursuing the M.S. include emerging non-volatile memory device and
degree with Beihang University. Her current research in-memory computing application.
interest includes memory computing.

Georgios Sirakoulis (Member, IEEE) received the

M.Eng. (Diploma) and Ph.D. degrees in electri-
cal and computer engineering (ECE) from the
Democritus University of Thrace (DUTh), Greece,
in 1996 and 2001, respectively. He has been a
Guanda Wang (Graduate Student Member, IEEE) Tenured Associate Professor with the ECE Depart-
received the B.S. degree in communication engi- ment, DUTh, since 2008. He has published more
neering from the Beijing University of Posts and than 200 technical articles, guest-edited 11 special
Telecommunications, Beijing, China. He is currently issues, co-edited five books, and coauthored 15 book
pursuing the Ph.D. degree with Beihang University. chapters. He is EUROPRACTICE representative for
His current research interests include simulation DUTh, and he has served as a member at the EU
analysis of MTJ and all spin logic device. IDEAS Program. He has participated as a PI in more than 20 scientific
programs and projects funded from the Greek Government and Industry as
well as the European Commission. His current research interests include
emergent electronic circuits and systems, memristors, green and unconven-
tional computing cellular automata theory and applications, complex systems,
bioinspired computation/biocomputation, modeling, and simulation.

Zhizhong Zhang (Student Member, IEEE) received Youguang Zhang (Member, IEEE) received the
the B.S. degree from Beihang University, Beijing, M.S. degree in mathematics from Peking University,
China, where he is currently pursuing the Ph.D. Beijing, China, in 1987, and the Ph.D. degree in
degree in microelectronics. His current research communication and electronic systems from Bei-
interests include the theoretical magnetism and hang University, Beijing in 1990. He is currently a
micromagnetic simulation. Professor with the School of Electronic and Informa-
tion Engineering, Beihang University. His research
interests include circuit and system co-design for the
emerging memory and computing systems.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.

MC-10143133-9999 Lotus Evora ECU Comms Failure
No ratings yet
MC-10143133-9999 Lotus Evora ECU Comms Failure
5 pages
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
No ratings yet
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
14 pages
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
No ratings yet
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
11 pages
A 12.08-TOPS/W All-Digital Time-Domain CNN Engine Using Bi-Directional Memory Delay Lines For Energy Efficient Edge Computing
No ratings yet
A 12.08-TOPS/W All-Digital Time-Domain CNN Engine Using Bi-Directional Memory Delay Lines For Energy Efficient Edge Computing
16 pages
10 1109@tcsii 2020 3013336
No ratings yet
10 1109@tcsii 2020 3013336
5 pages
10T SRAM Computing-in-Memory Macros For Binary and
No ratings yet
10T SRAM Computing-in-Memory Macros For Binary and
15 pages
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
No ratings yet
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
13 pages
An 8-Bit in Resistive Memory Computing Core With Regulated Passive Neuron and Bitline Weight Mapping
No ratings yet
An 8-Bit in Resistive Memory Computing Core With Regulated Passive Neuron and Bitline Weight Mapping
13 pages
An In-Memory VLSI Architecture For Convolutional Neural Networks
No ratings yet
An In-Memory VLSI Architecture For Convolutional Neural Networks
12 pages
08) COMPAC Compressed Time-Domain Pooling-Aware Convolution CNN Engine With Reduced Data Movement For Energy-Efficient AI Computing
No ratings yet
08) COMPAC Compressed Time-Domain Pooling-Aware Convolution CNN Engine With Reduced Data Movement For Energy-Efficient AI Computing
16 pages
06A Dual-Split 6T SRAM-Based Computing-in-Memory
No ratings yet
06A Dual-Split 6T SRAM-Based Computing-in-Memory
14 pages
Multiply Accumulate Operations in Memristor Crossbar Arrays Foranalog Computing
No ratings yet
Multiply Accumulate Operations in Memristor Crossbar Arrays Foranalog Computing
22 pages
A Reliable 8T SRAM For High-Speed Searching and Logic-in-Memory Operations
No ratings yet
A Reliable 8T SRAM For High-Speed Searching and Logic-in-Memory Operations
12 pages
Hai Jin
No ratings yet
Hai Jin
15 pages
Micromaquinas para Mecatronica
No ratings yet
Micromaquinas para Mecatronica
12 pages
Reconfigurable 2T2R ReRAM Architecture For Versatile Data Storage and Computing In-Memory
No ratings yet
Reconfigurable 2T2R ReRAM Architecture For Versatile Data Storage and Computing In-Memory
14 pages
A_Fully_Bit-Flexible_Computation_in_Memory_Macro_Using_Multi-Functional_Computing_Bit_Cell_and_Embedded_Input_Sparsity_Sensing
No ratings yet
A_Fully_Bit-Flexible_Computation_in_Memory_Macro_Using_Multi-Functional_Computing_Bit_Cell_and_Embedded_Input_Sparsity_Sensing
9 pages
2102.09561
No ratings yet
2102.09561
14 pages
An Efficient Memristor-Based Circuit Implementation of Squeeze-and-Excitation Fully Convolutional Neural Networks
No ratings yet
An Efficient Memristor-Based Circuit Implementation of Squeeze-and-Excitation Fully Convolutional Neural Networks
12 pages
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
No ratings yet
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
5 pages
AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3
No ratings yet
AreviewonSRAM Basedcomputingin Memory Circuitsfunctionsandapplications3
27 pages
Efficient Design of Majority-Logic-Based Approximate Arithmetic Circuits
No ratings yet
Efficient Design of Majority-Logic-Based Approximate Arithmetic Circuits
13 pages
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
No ratings yet
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
14 pages
Sensors 24 00181 v2
No ratings yet
Sensors 24 00181 v2
26 pages
ECBC Efficient Convolution Via Blocked Columnizing
No ratings yet
ECBC Efficient Convolution Via Blocked Columnizing
13 pages
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
No ratings yet
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
10 pages
Deep Learning For Compute in Memory
No ratings yet
Deep Learning For Compute in Memory
8 pages
A Survey of MRAM-Centric Computing From Near Memory To in Memory
No ratings yet
A Survey of MRAM-Centric Computing From Near Memory To in Memory
13 pages
Xnor
No ratings yet
Xnor
11 pages
Achieving Privacy-Friendly Storage and Secure Statistics For Smart Meter Data On Outsourced Clouds
No ratings yet
Achieving Privacy-Friendly Storage and Secure Statistics For Smart Meter Data On Outsourced Clouds
12 pages
Accuracy_Improvement_With_Weight_Mapping_Strategy_and_Output_Transformation_for_STT-MRAM-Based_Computing-in-Memory
No ratings yet
Accuracy_Improvement_With_Weight_Mapping_Strategy_and_Output_Transformation_for_STT-MRAM-Based_Computing-in-Memory
7 pages
Deep Affine Motion Compensation Network For Inter Prediction in VVC
No ratings yet
Deep Affine Motion Compensation Network For Inter Prediction in VVC
11 pages
Paper 5
No ratings yet
Paper 5
11 pages
A Brain-Inspired ADC-Free SRAM-Based In-Memory Computing Macro With High-Precision MAC for AI Application
No ratings yet
A Brain-Inspired ADC-Free SRAM-Based In-Memory Computing Macro With High-Precision MAC for AI Application
5 pages
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
No ratings yet
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
12 pages
MNSIM: Simulation Platform For Memristor-Based Neuromorphic Computing System
No ratings yet
MNSIM: Simulation Platform For Memristor-Based Neuromorphic Computing System
16 pages
A Multi-Functional In-Memory Inference Processor Using A Standard 6T SRAM Array
No ratings yet
A Multi-Functional In-Memory Inference Processor Using A Standard 6T SRAM Array
14 pages
Wang CSPNet A New Backbone That Can Enhance Learning Capability of CVPRW 2020 Paper
No ratings yet
Wang CSPNet A New Backbone That Can Enhance Learning Capability of CVPRW 2020 Paper
10 pages
Improved Low-Power Cost-Effective DCT Implementation Based On Markov Random Field and Stochastic Logic
No ratings yet
Improved Low-Power Cost-Effective DCT Implementation Based On Markov Random Field and Stochastic Logic
11 pages
2017.01.jssc.eyeriss_design
No ratings yet
2017.01.jssc.eyeriss_design
12 pages
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
No ratings yet
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
19 pages
Edge_and_Central_Cloud_Computing_A_Perfect_Pairing_for_High_Energy_Efficiency_and_Low-Latency
No ratings yet
Edge_and_Central_Cloud_Computing_A_Perfect_Pairing_for_High_Energy_Efficiency_and_Low-Latency
14 pages
Cooperative Resource Allocation for NOMA-MEC Multi-Cell Network 2025
No ratings yet
Cooperative Resource Allocation for NOMA-MEC Multi-Cell Network 2025
16 pages
Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
No ratings yet
Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
12 pages
MMM Machine Learning-Based Macro-Modeling for Linear Analog ICs and ADC DACs
No ratings yet
MMM Machine Learning-Based Macro-Modeling for Linear Analog ICs and ADC DACs
13 pages
DAC'19-3D_CNN_Video_Analysis
No ratings yet
DAC'19-3D_CNN_Video_Analysis
6 pages
CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays
No ratings yet
CIMAT A Compute-In-Memory Architecture For On-Chip Training Based On Transpose SRAM Arrays
11 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
An Ultra Low-Power Memristive Neuromorphic Circuit For Internet of Things Smart Sensors
No ratings yet
An Ultra Low-Power Memristive Neuromorphic Circuit For Internet of Things Smart Sensors
12 pages
2501.19347v1
No ratings yet
2501.19347v1
10 pages
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
No ratings yet
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
11 pages
VermaJiaValaviTangOzatayChenZhangDeaville_SSCSMagazine2019
No ratings yet
VermaJiaValaviTangOzatayChenZhangDeaville_SSCSMagazine2019
13 pages
A_Comprehensive_Technique_Based_on_Machine_Learning_for_Device_and_Circuit_Modeling_of_Gate-All-Around_Nanosheet_Transistors
No ratings yet
A_Comprehensive_Technique_Based_on_Machine_Learning_for_Device_and_Circuit_Modeling_of_Gate-All-Around_Nanosheet_Transistors
14 pages
Energy-Efficient Multi-UAV-Enabled Multiaccess Edge Computing Incorporating NOMA
No ratings yet
Energy-Efficient Multi-UAV-Enabled Multiaccess Edge Computing Incorporating NOMA
15 pages
A Configurable 10T SRAM-Based IMC Accelerator With Scaled-Voltage-Based Pulse Count Modulation for MAC and High-Throughput XAC
No ratings yet
A Configurable 10T SRAM-Based IMC Accelerator With Scaled-Voltage-Based Pulse Count Modulation for MAC and High-Throughput XAC
6 pages
04) Efficient - Time-Domain - In-Memory - Computing - Based - On - TST-MRAM
No ratings yet
04) Efficient - Time-Domain - In-Memory - Computing - Based - On - TST-MRAM
5 pages
Link 1
No ratings yet
Link 1
6 pages
A_Hybrid_DQN_and_Optimization_Approach_for_Strategy_and_Resource_Allocation_in_MEC_Networks
No ratings yet
A_Hybrid_DQN_and_Optimization_Approach_for_Strategy_and_Resource_Allocation_in_MEC_Networks
14 pages
Towards Integration of A Dedicated Memory Controll
No ratings yet
Towards Integration of A Dedicated Memory Controll
11 pages
tcad.2020.3012216
No ratings yet
tcad.2020.3012216
13 pages
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Chapter-2 Process Management
No ratings yet
Chapter-2 Process Management
56 pages
Ceph Cookbook - Sample Chapter
No ratings yet
Ceph Cookbook - Sample Chapter
28 pages
3 Installing JBASE 4.1 On Linux
No ratings yet
3 Installing JBASE 4.1 On Linux
10 pages
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
No ratings yet
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
172 pages
OTL Rollback
No ratings yet
OTL Rollback
2 pages
ERMS Consultant Guide
100% (1)
ERMS Consultant Guide
47 pages
Ec6013 Ampmc MSM
100% (1)
Ec6013 Ampmc MSM
94 pages
DecisionSpace InstallationNotes
No ratings yet
DecisionSpace InstallationNotes
54 pages
Deploys and Tests A Workflow in Collibra
No ratings yet
Deploys and Tests A Workflow in Collibra
9 pages
i.MX6Q: i.MX 6quad Processors - High-Performance, 3D Graphics, HD Video, Arm Cortex - A9 Core
No ratings yet
i.MX6Q: i.MX 6quad Processors - High-Performance, 3D Graphics, HD Video, Arm Cortex - A9 Core
5 pages
TV Repair Guide LCD PDF
No ratings yet
TV Repair Guide LCD PDF
45 pages
How To Make A JDM Programmer by Ron
No ratings yet
How To Make A JDM Programmer by Ron
4 pages
DeltaV SIS CHARMs Hardware Reference
100% (1)
DeltaV SIS CHARMs Hardware Reference
158 pages
Project Report Final Updated'
No ratings yet
Project Report Final Updated'
37 pages
CS 1 Ans - Introduction To Python
No ratings yet
CS 1 Ans - Introduction To Python
9 pages
AT-S81TR Specification
No ratings yet
AT-S81TR Specification
2 pages
1.1.1.4 Lab - Installing The CyberOps Workstation Virtual Machine
100% (1)
1.1.1.4 Lab - Installing The CyberOps Workstation Virtual Machine
6 pages
3.1 Storage Devices and Media
No ratings yet
3.1 Storage Devices and Media
21 pages
Report Mini Project
No ratings yet
Report Mini Project
16 pages
Jpa (Java Persistence Api) Cheat Sheet: Transaction Management With Entitymanager
No ratings yet
Jpa (Java Persistence Api) Cheat Sheet: Transaction Management With Entitymanager
8 pages
TSO Commands: XMIT Node - Userid DA ('Your - PDS') OUTDA ('Your - PS')
No ratings yet
TSO Commands: XMIT Node - Userid DA ('Your - PDS') OUTDA ('Your - PS')
4 pages
Lokesh Final
No ratings yet
Lokesh Final
26 pages
3.1.6 The VB Debugger - 1 PDF
No ratings yet
3.1.6 The VB Debugger - 1 PDF
14 pages
Unit-IV_Files
No ratings yet
Unit-IV_Files
33 pages
1W-H0-05 BZ (Eng)
No ratings yet
1W-H0-05 BZ (Eng)
7 pages
Firefly User Manual 26 April 2021
No ratings yet
Firefly User Manual 26 April 2021
1 page
Assignment Practical 1 It
No ratings yet
Assignment Practical 1 It
10 pages
Computational Problem Solving: Chapter 1, Sections 1.5-1.7
100% (1)
Computational Problem Solving: Chapter 1, Sections 1.5-1.7
39 pages
Moxa - EDS-510E - EDS-518E Manual PDF
No ratings yet
Moxa - EDS-510E - EDS-518E Manual PDF
105 pages

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

Uploaded by

03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network

Uploaded by

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

3, MARCH 2021 1193

Time-Domain Computing in Memory Using

Abstract— The data transfer bottleneck in Von Neumann

discharge channel, respectively. In memory array, R includes TABLE I

two ‘1’. Four 1-bit addends are then added in M1 . In the

Si = (D0 X O R D1 )O R(D2 X O R D3 ) . . . O R D2n−2

where C(i, p) ( p = 1, 2…, n-1) represents the carry calculated

in order to reduce the bit number of the addition in each order

computations [35]–[40] and will improve the compatibility

Fig. 12. Monte Carlo simulation results of VT DC in different data cases

From Eq. (2), the Tdis(x,n−x) is written as

As the AND operations can be executed in each column

TABLE III TABLE IV

3 × 3 in TD-CIM architecture can be obtained and are shown

Lei Chen received the B.S. degree in electronic

Kun Zhang (Member, IEEE) received the B.S. and

Georgios Sirakoulis (Member, IEEE) received the

You might also like