03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network
03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1194 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
Fig. 2. TD-CIM circuit. (a) Spintronic bit-cell structure and switching principle of field-free SOT-MRAM. (b) Spintronic cell array. (c) TDC unit. (d) Principle
of TDC unit. (e) Logic unit.
consumption. Among different NVMs, spintronic memories N-bit addends with the TD-CIM circuit is realized and used
offer the advantageous performance, especially, in terms of for the multiplication. In order to improve the compatibility of
energy and time of the write operations [18]–[22]. This reduces TD-CIM circuit for CNN, we propose a quantization method
the energy of the CIM architecture that requires writing the that transforms floating-point parameters of pre-trained CNN
logic results back to bit-cells and various CIM architectures models into fixed-point parameters. Finally, a TD-CIM archi-
based on spintronic memories have been proposed. Reference tecture with a highly reconfigurable array of spin-orbit torque
[23] presented spin transfer torque CIM (STT-CIM) archi- magnetic RAM (SOT-MRAM) is built and we evaluate its
tecture by modifying peripheral decision circuit to sense the delay and energy by performing 2D convolution to recognize
effective resistance of bit-line, which can perform Boolean handwritten digit images from the MNIST dataset. Compared
logic, arithmetic and complex vector operations. Using the with STT-CIM and CRAM architectures, the delay of the TD-
physical attributes of STT device, [24] proposed the compu- CIM architecture is reduced by 2.7 times and 1.2 times, and
tational RAM (CRAM) architecture to perform computations the energy is decreased by 2.4×103 times and 1.1×104 times,
in cell array, which generates logic outputs directly in STT respectively.
devices. However, these CIM architectures only adopt the The remained parts are organized as follows: Section II
concept of the arithmetic logic unit (ALU) to carry out presents TD-CIM circuit to implement Boolean logic. Multi-
computations, but don’t fully explore the inherent advan- addend addition and efficient multiplication schemes based
tages of memory array. For example, the addition operation, on the TD-CIM circuit are described in Section III. The
fundamental unit in all arithmetic operations [25], [26], is quantization method of CNN and a TD-CIM architecture are
normally implemented by cascading full adders. If the same illuminated in Section IV. Section V analyzes the reliability
mechanism is used in CIM architectures, a large amount of of TD-CIM circuit and evaluates the performance of TD-
additional decoding operations and time sequence schedules CIM architecture by performing 2D convolution for digit
are required, which greatly increases the computation com- recognition. Conclusions are presented in Section VI.
plexity and degrades the performance in terms of delay and
energy. II. TD-CIM C IRCUIT FOR B OOLEAN L OGIC
In this work, we propose a time-domain CIM (TD-CIM)
In CIM architecture, distinguishing the bit-line voltage is a
scheme based on spintronic memory enabling simplification
common method to perform logic operations [12]. Its principle
of arithmetic operations for energy-efficient CNN. TD-CIM
can be analyzed by RC circuit model, in which the bit-line
circuit is firstly proposed to execute NOR, NAND and XOR
voltage (Vt ) is expressed as
operations by converting the variation of bit-line voltage to
Tdis
the time domain. According to the characteristics of the Tdis − (R
Vt = V0 e− RC = V0 e P R +R E R )C (1)
output, we propose a multi-addend addition mechanism for
implementing the addition operation of multiple 1-bit addends where Tdis refers to the discharge time, V0 is the initial voltage
in a memory access. Furthermore, the addition of multiple of bit-line, R and C are the resistance and capacitance on the
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1195
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1196 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1197
Fig. 5. (a) Principle of the addition of n addends based on TD-CIM circuit. (b) Schematic of multi-addend addition based on TD-CIM circuit.
Fig. 6. Diagram of addition in array for three 8-bit addends based on TD- Fig. 7. Efficient multiplication scheme based on TD-CIM circuit.
CIM circuit.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1198 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1199
Fig. 9. (a) Original image with 8-bit per pixel. (b) Converted image with
4-bit per pixel. (c) Convolution result.
and the original model is, the smaller the accuracy drop caused
by quantization is. Hence, we quantize the weight in neural
network to minimize the accuracy drop. It can be described as
Fig. 10. TD-CIM architecture for quantized CNN.
min σ (wq ) − σ (w)
s.t.wq = Q(w) (6) B. TD-CIM Architecture for Quantized CNN
We then design a TD-CIM architecture using field-free SOT-
where w refers the set of floating-point weight matrixes, wq MRAM to execute the quantized CNN, as shown in Fig. 10.
represents the set of weight matrixes scaled to N-bit, σ is the It consists of three sub-arrays: data array, shift array and
computation process of classical CNN with ReLU function summation array. The data array is specialized to store the
and Q is the quantization function. original data of images and the convolution kernel. For the
Step 3: The scaling operation compensates the influence convolution computation of CNN, the shift array stores the
of activation function on quantization. Meanwhile, because shifted data in the multiplications and the summation array
the purpose of pooling layer is to progressively reduce the stores the results of the multiplications.
spatial size of parameters and computations in the network In data array, to enhance the parallelism of AND operations,
and its operation on each feature map is independent [41], the the image data are stored by row and the kernel data are stored
computations of pooling layer doesn’t affect the final sequence by column. By using the TD-CIM circuit located in each
of out probability. Therefore, according to the associative law column, each bit of the pixel can carry out AND operation
of calculation in convolution layer and fully connection layer, with any bit of the kernel at the same time. When their
Eq. (6) is modified as results are transferred to the shift array, the shift operations
are implemented by write operations. The addition part in the
min σ N (wq − w) proposed efficient multiplication scheme is also performed in
s.t.wq = Q(w) (7) the shift array to get the results of multiplication. At last,
these results are transferred to the summation array to obtain
the final convolution results by implementing the multi-addend
where N represents the number of bits in single pixel and addition operation based on TD-CIM circuit.
single kernel weight, σ N is the computation process of clas- However, owing to the structural limitation of memory array,
sical CNN except for pooling layer and activation function. logic operations can only be carried out on the columns.
Moreover, the difference between σ N and σ N is proportional By contrast, to increase the parallelism of write operations,
to the difference between wq and w. Therefore, the optimal the numbers generated by multiplication operations are stored
solution of Q is rounding. Meanwhile, the difference between on the rows. This causes that the multi-addend addition of
wq and w is reduced as the size of N increases, which is them cannot be implemented. Therefore, we propose a highly
beneficial to quantizing CNN. reconfigurable array based on the field-free SOT-MRAM
This method implements quantization without retraining and allowing the logic operations on the rows. As shown in Fig. 11,
encoding, which reduces the amount of calculation. Mean- a transistor is added in the bit-cell to construct three bit-lines
while, by scaling the weight appropriately to neutralize the (SL, BL, CBL) and three word-lines (WL, RWL, CWL). When
non-linearity of ReLU, its accuracy drop produced by quanti- performing logic operations on the rows, the TD-CIM circuit
zation without retraining can be minimized. Finally, the N-bit is connected to the CBL. Thanks to this reconfigurability
convolution operation is quantized into 4-bit, thereby reducing enhancement, the quantized CNN can be implemented more
the complexity of CNN and enhancing the compatibility of efficiently by the TD-CIM architecture.
TD-CIM circuit for CNN. Fig. 9 demonstrates an example
of an image processing by using the proposed quantization
method. The original image with 8-bit per pixel is converted V. P ERFORMANCE E VALUATION AND D ISCUSSION
to the image with 4-bit per pixel. Fig. 9(c) shows the result As reliability is crucial for implementing logic operations,
of convolution computation with 4-bit kernel. we first analyze the reliability of the TD-CIM circuit. Hybrid
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1200 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1201
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1202 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1203
TABLE V
C OMPARISON W ITH P REVIOUS W ORKS
the logic operation. Compared with the conventional addition on bit-line to the time domain, which not only improves the
operation of two addends, TD-CIM scheme can implement sensing reliability but also allows the multi-addend addition to
the addition operation of three addends during one bit-line simplify the arithmetic. To further improve the compatibility of
discharging. Note that more addends can be added in one TD-CIM circuit for CNN, we propose a quantization method
addition operation. Therefore, one addition operation in TD- without sharp accuracy dropping, which can also reduce the
CIM scheme is equivalent to two addition operations in the complexity of CNN. A TD-CIM architecture with a highly
conventional digital CIM scheme, which further improves the reconfigurable field-free SOT-MRAM array is constructed to
energy efficiency. Besides, in shift and summation array of realize the optimal performance of quantized CNN. Finally,
the TD-CIM architecture, one TD-CIM circuit is shared by by recognizing the handwritten digit from the MNIST dataset,
eight columns, which saves the area overhead. In summary, we find that both delay and energy of the TD-CIM architecture
TD-CIM scheme offers higher energy efficiency and lower are greatly reduced compared with STT-CIM and CRAM
area overhead than existing CIM architectures using the digital architectures. In addition, TD-CIM architecture has higher
computation method. energy efficiency and lower area overhead than present CIM
Moreover, in terms of the recognition accuracy, although the architectures using digital computation method. Finally, the
weight in LeNet-5 is quantized from floating-point parameters accuracy off 98.65% and 91.11% are achieved in the TD-
to fixed-point parameters, it still achieves the accuracy of CIM architecture with 4-bit fixed-point parameter on MNIST
99.57% to recognize the handwritten digit from the MNIST and CIFAR-10 respectively, which demonstrates the proposed
dataset. Since it is difficult to know the specific distribution quantization method of CNN is compatible with TD-CIM
of data cases for the CNN computation process, we assume architecture. This work has significance for further research
the total computation accuracy of TD-CIM circuit is the mean on high-performance memory-oriented computing systems.
of the accuracies shown in Fig. 13, i.e. 99.07%. Then, it is
introduced as a parameter to the quantified LeNet-5. Result
R EFERENCES
shows that the accuracy of the quantified LeNet-5 run in TD-
CIM scheme is 98.65%, less than the accuracy of 99.57% by [1] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memory
0.92%, but it is still higher than that in [46], i.e., 98.42%. VLSI architecture for convolutional neural networks,” IEEE Trans.
Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 494–505, Sep. 2018.
Furthermore, we also extend our design to the CIFAR-10 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
dataset. We first use pre-trained VGG11 model in library of with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Pytorch model zoo, which achieves the accuracy of 93.78%. Process. Syst. (NIPS), 2012, pp. 1097–1105.
[3] D. Silver et al., “Mastering the Game of Go with deep neural networks
Then, the VGG11 model is quantized with Algorithm 1. The and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
final validation accuracy is 91.97%, with an accuracy drop [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
of 1.81% caused by the quantization. Similarly, the accuracy of pp. 436–444, May 2015.
[5] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized CNN:
TD-CIM circuit is introduced to the quantified VGG11 model. A unified approach to accelerate and compress convolutional networks,”
Finally, the accuracy of the quantified VGG11 model run in IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4730–4743,
TD-CIM scheme is 91.11%, less than the accuracy of 93.78% Oct. 2018.
[6] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and
by 2.67%. In summary, the compatibility of the TD-CIM connections for efficient neural networks,” in Proc. NIPS, Montréal, QC,
architecture and the quantized CNN is well delivered. Canada, 2015, pp. 1135–1143.
[7] J. Wang, J. Lin, and Z. Wang, “Efficient hardware architectures for deep
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers,
VI. C ONCLUSION vol. 65, no. 6, pp. 1941–1953, Jun. 2018.
This article proposes a TD-CIM architecture using spintron- [8] P. Chi et al., “PRIME: A novel processing-in-memory architecture
for neural network computation in ReRAM-based main memory,” in
ics to optimize the performance of delay and energy for CNN Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Seoul,
applications. TD-CIM circuit converts the voltage difference South Korea, Jun. 2016, pp. 27–39.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
1204 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 3, MARCH 2021
[9] S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and evaluation [31] Z. Wang, W. Zhao, E. Deng, J.-O. Klein, and C. Chappert,
of a spintronic in-memory processing platform for nonvolatile data “Perpendicular-anisotropy magnetic tunnel junction switched by spin-
encryption,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., Hall-assisted spin-transfer torque,” J. Phys. D, Appl. Phys., vol. 48, no. 6,
vol. 37, no. 9, pp. 1788–1801, Sep. 2018. Jan. 2015, Art. no. 065001.
[10] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-based [32] E. Deng et al., “Synchronous 8-bit non-volatile full-adder based on spin
computation-in-memory unit-macro for CNN-based AI edge proces- transfer torque magnetic tunnel junction,” IEEE Trans. Circuits Syst. I,
sors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801, Reg. Papers, vol. 62, no. 7, pp. 1757–1765, Jul. 2015.
Oct. 2020. [33] E. E. Swartzlander, “Recent results in merged arithmetic,” Proc. SPIE,
[11] J. Yang et al., “Sandwich-RAM: An energy-efficient in-memory BWN vol. 3461, pp. 576–583, Oct. 1998.
architecture with pulse-width modulation,” in IEEE Int. Solid-State [34] M. Horowitz, “Computing’s energy problem (and what we can do about
Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
Feb. 2019, pp. 394–396. San Francisco, CA, USA, Feb. 2014, pp. 10–14.
[12] J. C. Wang et al., “A 28-nm compute SRAM with bit-serial [35] E. Cai, D. Juan, D. Stamoulis, and D. Marculescu, “NeuralPower:
logic/arithmetic operations for programmable in-memory vector comput- Predict and deploy energy-efficient convolutional neural networks,” in
ing,” IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 76–86, Jan. 2020. Proc. ACML, Seoul, South Korea, Nov. 2017, pp. 622–637.
[13] K. Lee, J. Jeong, S. Cheon, W. Choi, and J. Park, “Bit parallel 6T SRAM [36] I. Hubara, M. Courbariaux, D. Soudry, E. El-Yaniv, and Y. Bengio,
in-memory computing with reconfigurable bit-precision,” in Proc. 57th “Binarized neural networks,” in Proc. NIPS, Barcelona, Spain, 2016,
ACM/IEEE Design Automat. Conf. (DAC), San Francisco, CA, USA, pp. 4114–4122.
Jul. 2020, Art. no. 20052792. [37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
[14] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory- ImageNet classification using binary convolutional neural networks,” in
computing SRAM macro based on robust capacitive coupling computing Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands,
mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897, 2016, pp. 525–542.
Jul. 2020. [38] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
[15] X. Si et al., “A twin-8T SRAM computation-in-memory unit-macro for in Proc. ICLR, Toulon, France, 2017, pp. 1–10.
multibit CNN-based AI edge processors,” IEEE J. Solid-State Circuits, [39] S. Han, H. Z. Mao, and W. J. Dally, “Deep compression: Compressing
vol. 55, no. 1, pp. 189–202, Jan. 2020. deep neural networks with pruning, trained quantization and Huffman
[16] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient coding,” in Proc. ICLR, San Juan, Puerto Rico, 2016, pp. 1–14.
SRAM with in-memory dot-product computation for low-power convo- [40] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
lutional neural networks,” IEEE J. Solid-State Circuits, vol. 54, no. 1, tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
pp. 217–230, Jan. 2019. Gate Arrays, Los Angeles, CA, USA, Feb. 2016, pp. 26–35.
[17] V. Seshadri et al., “Fast bulk bitwise AND and OR in DRAM,” IEEE [41] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and
Comput. Archit. Lett., vol. 14, no. 2, pp. 127–131, Jul. 2015. J. Schmidhuber, “Fast image scanning with deep max-pooling convo-
[18] G. D. Wang et al., “Compact modeling of perpendicular-magnetic- lutional neural networks,” in Proc. IEEE Int. Conf. Image Process.,
anisotropy double-barrier magnetic tunnel junction with enhanced ther- Melbourne, VIC, Australia, Sep. 2013, pp. 4034–4038.
mal stability recording structure,” IEEE Trans. Electron Devices, vol. 66, [42] Q.-K. Trinh, S. Ruocco, and M. Alioto, “Time-based sensing for
no. 5, pp. 2431–2436, May 2019. reference-less and robust read in STT-MRAM memories,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 65, no. 10, pp. 3338–3348, Oct. 2018.
[19] Z. H. Wang et al., “Proposal of toggle spin torques magnetic RAM
[43] Y. Zhou et al., “A self-timed voltage-mode sensing scheme with suc-
for ultrafast computing,” IEEE Electron Device Lett., vol. 40, no. 5,
cessive sensing and checking for STT-MRAM,” IEEE Trans. Circuits
pp. 726–729, May 2019.
Syst. I, Reg. Papers, vol. 67, no. 5, pp. 1602–1614, May 2020.
[20] R. De Rose et al., “A variation-aware timing modeling approach for write
[44] G. Scotti, D. Bellizia, A. Trifiletti, and G. Palumbo, “Design of low-
operation in hybrid CMOS/STT-MTJ circuits,” IEEE Trans. Circuits
voltage high-speed CML D-latches in nanometer CMOS technologies,”
Syst. I, Reg. Papers, vol. 65, no. 3, pp. 1086–1095, Mar. 2018.
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 12,
[21] G. Wang et al., “Ultra-dense ring-shaped racetrack memory cache pp. 3509–3520, Dec. 2017.
design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 1, [45] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level
pp. 215–225, Jan. 2019. performance, energy, and area model for emerging nonvolatile memory,”
[22] Z. Y. Zheng et al., “Enhanced spin-orbit torque and multilevel current- IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7,
induced switching in W/Co-Tb/Pt heterostructure,” Phys. Rev. A, Gen. pp. 994–1007, Jul. 2012.
Phys., vol. 12, no. 4, Oct. 2019, Art. no. 044032. [46] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, “A 12.08-
[23] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory TOPS/W all-digital time-domain CNN engine using bi-directional mem-
with spin-transfer torque magnetic RAM,” IEEE Trans. Very Large Scale ory delay lines for energy efficient edge computing,” IEEE J. Solid-State
Integr. (VLSI) Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018. Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020.
[24] M. Zabihi, Z. I. Chowdhury, Z. Zhao, U. R. Karpuzcu, J.-P. Wang, [47] H. A. Maharmeh, N. J. Sarhan, C.-C. Hung, M. Ismail, and M. Alhawari,
and S. S. Sapatnekar, “In-memory processing on the spintronic CRAM: “Compute-in-time for deep neural network accelerators: Challenges
From hardware design to application mapping,” IEEE Trans. Comput., and prospects,” in Proc. IEEE 63rd Int. Midwest Symp. Circuits Syst.
vol. 68, no. 8, pp. 1159–1173, Aug. 2019. (MWSCAS), Springfield, MA, USA, Aug. 2020, pp. 990–993.
[25] H. Naseri and S. Timarchi, “Low-power and fast full adder by exploring
new XOR and XNOR gates,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018.
[26] H. T. Bui, Y. Wang, and Y. Jiang, “Design and analysis of low-power
10-transistor full adders using novel XOR-XNOR gates,” IEEE Trans.
Circuits Syst. II, Analog Digit. Signal Process., vol. 49, no. 1, pp. 25–30,
Jan. 2002.
[27] S. Zhang, K. Huang, and H. Shen, “A robust 8-bit non-volatile
computing-in-memory core for low-power parallel MAC operations,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 6, pp. 1867–1880,
Jun. 2020. Yue Zhang (Senior Member, IEEE) received the
[28] J. K. Wang et al., “A self-matching complementary-reference sensing B.S. degree in optoelectronics from the Huazhong
scheme for high-speed and reliable toggle spin torque MRAM,” IEEE University of Science and Technology, Wuhan,
Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4247–4258, China, in 2009, and the M.S. and Ph.D. degrees
Dec. 2020, doi: 10.1109/TCSI.2020.3020137. in microelectronics from the University of Paris-
[29] Z. Y. Zheng et al., “Perpendicular magnetization switching by large Sud, France, in 2011 and 2014, respectively. He
spin–orbit torques from sputtered Bi2 Te3 ,” Chin. Phys. B, vol. 29, no. 7, is currently an Associate Professor with Beihang
Jul. 2020, Art. no. 078505. University, China. His current research interests
[30] M. Wang et al., “Field-free switching of a perpendicular magnetic tunnel include emerging non-volatile memory technologies
junction through the interplay of spin–orbit and spin-transfer torques,” and hybrid low-power circuit designs.
Nature Electron., vol. 1, no. 11, pp. 582–588, Nov. 2018.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: TD-CIM USING SPINTRONICS FOR ENERGY-EFFICIENT CNN 1205
Jinkai Wang (Graduate Student Member, IEEE) Zhenyi Zheng (Graduate Student Member, IEEE)
received the B.S. degree in physics and electronic received the B.S. and master’s degrees from Beihang
engineering from Kaili University, Kaili, China, University, Beijing, China, in 2015 and 2018, respec-
in 2015, and the M.S. degree in circuits and systems tively, where he is currently pursuing the Ph.D.
from Anhui University, Anhui, China, in 2018. He degree. His current research interests include spin-
is currently pursuing the Ph.D. degree in physical orbit torque effect and ferrimagnetic materials.
electronics with Beihang University, China. His cur-
rent research interest includes the high-performance
hybrid circuits.
Zhizhong Zhang (Student Member, IEEE) received Youguang Zhang (Member, IEEE) received the
the B.S. degree from Beihang University, Beijing, M.S. degree in mathematics from Peking University,
China, where he is currently pursuing the Ph.D. Beijing, China, in 1987, and the Ph.D. degree in
degree in microelectronics. His current research communication and electronic systems from Bei-
interests include the theoretical magnetism and hang University, Beijing in 1990. He is currently a
micromagnetic simulation. Professor with the School of Electronic and Informa-
tion Engineering, Beihang University. His research
interests include circuit and system co-design for the
emerging memory and computing systems.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:06:02 UTC from IEEE Xplore. Restrictions apply.