0% found this document useful (0 votes)
31 views

CREAM Computing in ReRAM-Assisted Energy - and Area-Efficient SRAM For Reliable Neural Network Acceleration

This document summarizes a research paper that proposes a new approach called CREAM (Computing in ReRAM-Assisted Energy- and Area-Efficient SRAM) for accelerating large-scale neural networks. CREAM uses high-density on-chip resistive RAM (ReRAM) to store all neural network weights and restores them to proposed non-volatile SRAM computing-in-memory cells for parallel computations. It also introduces a novel algorithm to configure weights to deal with ReRAM and CMOS variations and improve accuracy. Evaluations show CREAM achieves up to 3.47x higher energy efficiency than SRAM-based approaches and 1.70x than ReRAM-based approaches, while maintaining higher accuracy under

Uploaded by

huyuwen603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

CREAM Computing in ReRAM-Assisted Energy - and Area-Efficient SRAM For Reliable Neural Network Acceleration

This document summarizes a research paper that proposes a new approach called CREAM (Computing in ReRAM-Assisted Energy- and Area-Efficient SRAM) for accelerating large-scale neural networks. CREAM uses high-density on-chip resistive RAM (ReRAM) to store all neural network weights and restores them to proposed non-volatile SRAM computing-in-memory cells for parallel computations. It also introduces a novel algorithm to configure weights to deal with ReRAM and CMOS variations and improve accuracy. Evaluations show CREAM achieves up to 3.47x higher energy efficiency than SRAM-based approaches and 1.70x than ReRAM-based approaches, while maintaining higher accuracy under

Uploaded by

huyuwen603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3198 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO.

8, AUGUST 2023

CREAM: Computing in ReRAM-Assisted Energy-


and Area-Efficient SRAM for Reliable Neural
Network Acceleration
Yanan Sun , Senior Member, IEEE, Dengfeng Wang, Student Member, IEEE, Liukai Xu,
Yiming Chen , Student Member, IEEE, Zhi Li , Songyuan Liu, Weifeng He , Senior Member, IEEE,
Yongpan Liu , Senior Member, IEEE, Huazhong Yang , Fellow, IEEE, and Xueqing Li , Senior Member, IEEE

Abstract— SRAM-based computing-in-memory (CIM) has


been widely explored to accelerate neural networks (NNs).
However, it is challenging to store all weights of many modern
NNs due to limited on-chip SRAM capacity. This bottleneck
induces a large amount of off-chip DRAM accesses and impedes
the improvement of performance and energy efficiency. This
paper proposes a new approach of computing in resistive
random-access memory (ReRAM)-assisted energy- and area-
efficient SRAM (CREAM) for accelerating large-scale NNs while
eliminating the DRAM access. The NN weights are all stored in
high-density on-chip ReRAMs and restored to the proposed non-
volatile SRAM (nvSRAM) CIM cells with array-level parallelism.
Furthermore, to deal with the influence of ReRAM and CMOS
variations, a novel layer-wise and bit-wise weight-configuration
search algorithm is proposed by leveraging different sensitivity of
each layer in NN models. A data-aware weight-mapping method
is also presented to efficiently map NN models to ReRAMs
in CREAM for high computation parallelism. The experiment
results show 10.3× weight storage density over the standard 6T
SRAM array. Evaluations of ResNet-18 and VGG-9 on CIFAR-
10/CIFAR-100 datasets show up to 3.47× and 1.70× energy
efficiency over two baseline designs of SRAM-CIM and ReRAM-
CIM, respectively, in addition to 15.6% higher accuracy than Fig. 1. SRAM-CIM trends. (a) NN inference energy breakdown. (b) Limited
ReRAM-CIM under device variations. memory size in CIM. (c) Macro energy efficiency of recent SRAM-CIM [1],
[2], [3], [4], [5], [6] and ReRAM-CIM [7], [8], [9], [10], [11] (data movement
Index Terms— ReRAM-assisted SRAM, compute-in-memory, energy excluded).
neural network, high density, restore yield, weight mapping.

I. I NTRODUCTION in various applications, such as language processing and


image classification tasks. Unfortunately, the complex DNNs
D EEP neural networks (DNNs) are getting deeper with
more parameters for widespread of artificial intelligence exacerbate the memory-wall problem with tremendous data
movement between the memory units and the multiply-and-
Manuscript received 9 October 2022; revised 2 March 2023 and 27 April accumulate (MAC) units in the traditional von Neumann
2023; accepted 1 May 2023. Date of publication 11 May 2023; date of current architecture. CIM is an attractive solution to mitigate the
version 28 July 2023. This work was supported in part by NSFC under Grant
62174110, Grant U21B2030, Grant 92264204, and Grant 61934005; in part memory wall bottleneck by merging MAC operations into
by the National Key R&D Program of China under Grant 2021YFA0717400, memory arrays with high computing parallelism.
Grant 2019YFA0706100, and Grant 2018YFA0701500; in part by the Natural As a mature candidate with high energy efficiency, SRAM-
Science Foundation of Shanghai under Grant 23ZR1433200; and in part by based CIM (SRAM-CIM) has been stimulating various
the BirenTech Research. This article was recommended by Associate Editor
M.-F. Chang. (Corresponding author: Xueqing Li.)
designs in the current domain [1], [2], [3], [4], the charge
Yanan Sun, Dengfeng Wang, Liukai Xu, Zhi Li, Songyuan Liu, domain [5], the digital domain [6], [12], and the time
and Weifeng He are with the Department of Micro-Nano Electronics, domain [13]. However, the capacity of on-chip SRAM-CIM
Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: sunyanan@ macros is typically limited at KB-MB levels in the state-
sjtu.edu.cn; [email protected]; [email protected]; liusongyuan@
sjtu.edu.cn; [email protected]). of-art works [1], [2], [3], [4], [5], [6], as summarized in
Yiming Chen, Yongpan Liu, Huazhong Yang, and Xueqing Li are with Fig. 1(b). Such a capacity cannot hold all weights of typical
BNRist, Electronic Engineering Department, Tsinghua University, Beijing modern NNs considering the high cost of SRAM for edge
100084, China (e-mail: [email protected]; ypliu@tsinghua. AI devices. As shown in Fig. 1(a), the off-chip DRAM
edu.cn; [email protected]; [email protected]).
Color versions of one or more figures in this article are available at
access accounts for up to 87% of energy consumption, which
https://doi.org/10.1109/TCSI.2023.3272874. limits the energy efficiency and performance of SRAM-CIM.
Digital Object Identifier 10.1109/TCSI.2023.3272874 Apparently, sparsity-aware NN compression is useful for many
1549-8328 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3199

NNs [14], but the desire for no DRAM access and full on-chip
weight storage of larger and general NNs needs more efforts.
To enlarge the on-chip memory capacity, CIM based on
emerging non-volatile memory (NVM) such as ReRAMs [7],
[8], [9], [10], [11], MRAMs [15], [16], and FeFETs [17]
are being widely explored. Among them, the two-terminal
ReRAM devices with a large on/off resistance ratio enable
higher weight-storage density and larger on-chip capacity
Fig. 2. Previous 7T1R SLC-nvSRAM cell [26] that supports only bit-to-bit
than SRAM arrays, as shown in Fig. 1(b). Like other store/restore operations and no in-memory computing.
NVM, the non-volatile ReRAM-CIM can be powered off
to save the leakage energy without losing the weights.
Unfortunately, limited by the DC power consumption or
complex peripherals, ReRAM-CIM still suffers from high
inference energy compared with SRAM-CIM, even with recent
advanced techniques [11], as shown in Fig. 1(c). Moreover,
the stochastic filament-based switching in ReRAM causes
significant process variations, which are challenging to handle
to prevent severe NN accuracy loss [18]. Recent efforts have
proposed unary weight-coding method [19], on-device and Fig. 3. Computing-in-memory architecture to accelerate MAC operations of
off-device co-design training algorithm [20], write-and-verify NN: (a) ReRAM-CIM [19], and (b) SRAM-CIM [30].
programming schemes [21], [22], etc. However, it remains a
significant challenge to implement robust large-scale NNs with II. C HALLENGES OF NV SRAM AND CIM
high energy and area efficiency for ReRAM-CIM. The challenges of the existing nvSRAM circuits and CIM
In this work, we propose a novel non-volatile SRAM- circuits are described in this section, so as to understand the
CIM (nvSRAM-CIM) macro by combining non-volatile high- design space of the proposed ReRAM-based nvSRAM-CIM.
density ReRAM devices with robust and energy-efficient
SRAM cells for large NNs. Furthermore, we propose a CIM A. Previous Non-Volatile SRAM Cells
architecture, namely CREAM, to exploit the nvSRAM-CIM
macro with a novel layer-wise and bit-wise configuration Many NVM-based nvSRAM designs have been explored
search algorithm and a data-aware mapping method for NN for frequent-off and instant-on applications [23], [24], [25],
inference. The contributions of this work include: [26], [27], [28], [29]. The adopted NVM technologies include
ferroelectric capacitor [23], MRAM [24], [25], ReRAM [26],
1) Proposed ReRAM-based nvSRAM-CIM macro and [27], [28], FeFET [29], etc. Among these designs, the ReRAM
CREAM architecture that leverage grouped high-density based nvSRAM is featured with high density and a medium-
ReRAM devices for NN weight storage and energy- high on/off ratio.
efficient SRAM-CIM for NN inference for the first time, The ReRAM-based nvSRAM has been presented in recent
which provides a new CIM paradigm for energy-efficient works based on single-level cell (SLC) [26], [27] and multi-
NN acceleration; level cell (MLC) [28] ReRAM devices. The schematic of
2) Comprehensive circuit analysis on the macro structure previously published 7T1R SLC-nvSRAM cell [26] is shown
and circuit parameters, including the number of ReRAM in Fig. 2. SL and SWL denote the source line and the switching
devices associated with an SRAM cell, a proposed line, respectively. The two differential supply rails are labeled
grouped ReRAM organization, etc., for optimized as VDDQ and VDDQB . By moving the 1-bit logic information
tradeoff between memory capacity and restore yield. in the SRAM cell to the non-volatile SLC ReRAM device RQ ,
3) Custom architecture to exploit the characteristics of the the previous 7T1R SLC-nvSRAM cell [26] can power off the
proposed nvSRAM-CIM macro, including a variation- leaky SRAM cell during the standby period to save the energy.
aware layer-wise and bit-wise configuration search The logic state is recalled from RQ and restored into the
algorithm for high-density storage and high-accuracy SRAM cell when the supply is powered on. In [28], an energy-
inference, and a data-aware weight-mapping method for efficient MLC-nvSRAM circuit is presented to back up 2-bit
low inference latency and high hardware utilization. data of every two SRAM cells into a single four-level MLC
4) Evaluation of the proposed ReRAM-based nvSRAM- ReRAM device. However, the previously published nvSRAM
CIM macro and CREAM architecture, showing higher circuits [24], [25], [26], [27], [28] essentially support only bit-
memory density of 10.3× over the traditional SRAM to-bit store/restore operations and no CIM functions, where the
array, and higher energy efficiency of up to 3.47× over memory density of logic information is still limited. Inspired
the traditional SRAM-CIM. by previous works, we propose to adopt the nvSRAM circuits
and enable the in-memory MAC operations to perform the NN
In the rest of this paper, Section II introduces the challenges inference. Furthermore, we extend to the multi-ReRAM-one-
of the conventional nvSRAM and CIM circuits. Section III SRAM nvSRAM to support more weights in the footprint of
presents nvSRAM-CIM and CREAM. Section IV discusses an SRAM cell for enhanced data storage density.
the impact of ReRAM group settings on the restore yield and
the memory density. Section V presents the layer-wise and
bit-wise configuration search algorithm and the data-aware B. Previous CIM Based on SRAM and ReRAM Cells
weight-mapping method. Section VI presents the evaluation In many existing CIM solutions, as shown in Fig. 3, the NN
results. Finally, Section VII concludes this work. weights are quantized and stored in the CIM memory cells

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3200 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

of the macro, while the NN activations are usually voltage TABLE I


signals applied to the rows of the memory array. The product S IGNAL S ETTINGS O F N V SRAM-CIM C IRCUITRY#
of weights and activations is generated by each cell, and the
accumulation is performed on each column. Then, the MAC
operation output of each column is sensed by an analog-to-
digital converter (ADC) or other peripherals, and shift & adder
could be used to support MAC of multi-bit input and multi-bit
weight before sending the final output to the output registers.
For ReRAM-CIM, the NN weights are represented by the
conductance of ReRAM devices [7], [8], [9], [10], [11], [18],
[19], [20]. The footprint of a ReRAM device could be as
small as 6F2 in principle, which enables the high-density
weight storage in ReRAM array. By applying inputs to
rows, the output current is summed up on each column by
Kirchhoff’s Current Law (KCL). However, the large direct
current on the bitlines BLs and the peripheral circuits limit multiple densely placed ReRAM groups are attached to the
the energy efficiency. Furthermore, the immature ReRAM top of each SRAM cell for boosting the data storage capacity
fabrication process induces significant resistance variations, and initialized with weights in SRAM using the store mode.
which results in the weight deviation and the accuracy Thirdly, the weights of each layer in ReRAM groups are
degradation of NN inference. To alleviate this accuracy loss, sequentially recalled to the SRAM cells for MAC operations
several solutions have been presented in [19], [20], [21], by the restore mode. Finally, to conduct the MAC operations,
and [22]. Variation-aware algorithms, e.g., unary coding, the store path is reused in the CIM mode. The CIM cell
could mitigate the variation of synaptic weights [19]. The circuit, the five operation schemes, and the operation details
hardware-software co-design [20] could make the weight of the CIM macro are described subsequently.
aware of the hardware fault and enhance the weight error
tolerance. A ReRAM programming method write-and-verify
could reduce the ReRAM conductance deviation during A. Proposed nvSRAM-CIM Circuit
write [21], [22]. However, the repeated write on ReRAMs The proposed nvSRAM-CIM cell is shown in Fig. 4(b),
induces higher energy costs, and the accuracy loss cannot consisting of a 6T SRAM cell (N1 –N4 and P1 –P2 ) with
be fully recovered. Future works are needed to improve the an additional control transistor N7 , two differential discharge
accuracy of ReRAM-CIM. paths of N5 and N6 –Rref for weight restore, a store/CIM path
SRAM-CIM has shown different characteristics. Plenty of N8 –N10 , and multiple high-density ReRAM groups separated
works have contributed to the design of SRAM-CIM cell by the group select transistors NG1 –NGm . N7 is controlled
and peripheral circuits. Two major types of SRAM-CIM by the control signal CTRL and shared in each memory
gained a lot of attention: current-domain SRAM-CIM [1], column. The array-evel-shared restore signal RSTR drives N5
[2], [3], [4] and charge-domain SRAM-CIM [5]. The current- and N6 . N8 and N9 are controlled by the storage node Q
domain SRAM-CIM macro benefits from the simplified circuit and the row-wise-shared store signal STR, respectively. N10
structure with relatively small area overhead. The currents on is driven by the row-wise-shared reset signal RST while its
BLs are sensed by ADC as the result of MAC operation. drain is connected to the computation bitline CBL. The group
The charge-domain SRAM-CIM may generate more accurate selection transistors (NG1 –NGm ) and 6T-SRAM are sized for
and robust MAC result as the resolution is dominated by achieving compact layout area while the switching voltages
the passive capacitor mismatches. Generally, as the on-chip (with VSET = 0.55V and VRESET = –0.52V [43]) of ReRAMs
SRAM resources are limited [1], [2], [3], [4], [5], [6], [14], are tuned for ensuring sufficient drivability in this work.
[30], [31], [32], the weight data reload is usually needed Different from the conventional ReRAM-based nvSRAM
for large-scale NN inference, which becomes the bottleneck designs, multiple densely placed ReRAM groups are attached
towards higher energy efficiency and/or performance. to the top of each 6T SRAM cell for enhancing the data storage
In summary, there are challenges in both ReRAM-CIM and density, while occupying the same footprint size as the layout
SRAM-CIM works. If their advantages could be enabled in area of SRAM cell and periphery transistors. As shown in
a hybrid CIM fashion, the task-level energy-efficiency and Fig. 4(b), each ReRAM group contains n ReRAMs R S1 –R Sn
performance can be significantly improved. for data storage and one Rmask for masking the unselected
ReRAMs. The Rmask is fixed at the low-resistance state
(LRS), which can set the parallelly unselected ReRAMs to an
III. P ROPOSED N V SRAM-CIM C IRCUIT overall low resistance. The group select signals GRO_SEL for
This section presents the proposed ReRAM-based selecting the certain group for weight restore and the source
nvSRAM-CIM cell, subarray, and macro in CREAM, lines SL connecting to grouped ReRAMs are all shared at the
as illustrated in Fig. 4. The CIM macro includes multiple array level. The signal settings for store, restore, and CIM
subarrays of CIM cells, supporting several operation modes: modes are listed in Table I, which will be revisited while
the store mode, the restore mode, the CIM mode, the stand-by introducing each mode. In this work, the resistances of LRS
mode, and the memory mode. With the proposed nvSRAM- and high-resistance state (HRS) of ReRAMs are selected to
CIM circuitry, the NN inference can be processed efficiently be 50k and 1M [46], respectively.
by four steps. Firstly, the NN weights are loaded to SRAM The ReRAMs used in CREAM can be selected as
cells (which only have limited storage capacity) in batches either forming-free devices [43] or those with suppressed
by the conventional SRAM write operations. Secondly, the forming voltages (VF ) [44]. VF less than 2V [44] can be

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3201

Fig. 4. Illustrations of (a) proposed CREAM architecture and (b) nvSRAM-CIM cell.

Fig. 5. The operations and waveforms of (a) store mode and (b) restore mode.

considered as the safe voltage range to guarantee the reliability We assume that the Ri_j are connected to SL j and located in
of core transistors and ReRAMs after electroforming and the i th group of an nvSRAM-CIM cell. Each bit of SRAM
stressed programming process under 28nm CMOS technology. data is then stored into the selected Ri_j of each nvSRAM-
Assuming that forming process is needed, Rsi can be formed CIM cell in array-level parallelism for the whole array using
by activating the forming path (Rsi -NGi -N10 ) with SLi , the store operation.
GRO_SELi , and CBL biased to VF , VDD , and 0V, respectively. The store mode consists of two phases with the operations
Similarly, Rref can be formed by activating the forming path and waveforms illustrated in Fig. 5(a). In store phase-1, STR
(Rref -N6 -N4 ) with WL/RSTR, SLr e f , and BLB biased to VDD , is grounded while RST and CBL are at VDDH . CTRL is tied
VF , and 0V, respectively. to VDD . By biasing GRO_SELi to VDDH and SL j to GND,
all the selected Ri_j in the whole array are simultaneously
B. Store Operation of Proposed nvSRAM-CIM Cell initialized and reset to HRS representing ‘0’. Meanwhile, all
In the store mode, each bit of NN weights is preloaded the unselected ReRAMs are intact by setting SLx at VDDL1 as
from an SRAM cell to one of the grouped ReRAMs. Weights listed in Table I.
of certain NN layers are initially written into the SRAM In store phase-2, STR transitions from 0V to VDD .
cells of nvSRAM-CIM arrays by conventional SRAM write. GRO_SELi and SL j are set to VDD to activate the selected

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3202 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

Ri_j . SLx are maintained at VDDL1 to keep the unselected


ReRAMs intact. NGi and N9 are turned on. Provided that Q in
the SRAM-CIM cell to be stored is ‘1’, N8 is turned on, the
selected Ri_j is then set to LRS (representing ‘1’) along the
activated set path as illustrated in Fig. 5(a). Alternatively,
the selected Ri_j is maintained at HRS if Q stores ‘0’ by cutting
off N8 . For the rest of NN weights, similar store processes are
carried out by repeating the above two store phases until the
whole NN weights are preloaded to the grouped ReRAMs.

C. Restore Operation of Proposed nvSRAM-CIM Cell


In the restore mode, each bit of preloaded weights in
grouped ReRAMs is recalled from the selected Ri_j and loaded
into the corresponding 6T SRAM cell where the selected Ri_j
is attached. Thanks to the large capacity of grouped ReRAMs,
the off-chip memory access is eliminated. Besides, the restore Fig. 6. Operation of CIM mode.
operation is conducted in array-level parallelism with two
phases shown in Fig. 5(b).
In restore phase-1, RSTR, GRO_SELi , and CTRL are
grounded as listed in Table I. Both BL and BLB are set to
VDD while VDDH is applied on the wordline WL to ensure that
both storage nodes Q and QB are precharged to a high voltage
Fig. 7. Worst-case restore scenarios of selected Rsel and unselected Runsel
level. ReRAMs for restoring ‘1’ and ‘0’.
In restore phase-2, by setting SL j to VDDL2 and SLx to
GND, a voltage divider is formed between the selected Ri_j
and the parallelly connected Rmask and unselected ReRAMs.
VDDL2 is set to be 0.5V with suppressed voltage drop (VDDL2 -
VY ) of the selected ReRAM (well below the switching
voltages), to avoid protentional read-disturb of ReRAMs. The
voltage of the node Y is determined by the resistance ratio
between the selected Ri_j and Req_unsel . As Rmask is fixed Fig. 8. The voltage waveform of the node Y during restore of ‘1’ and ‘0’.
at LRS, the equivalent resistance (Req_unsel ) of the parrallely
connected Rmask and unselected ReRAMs is at relatively low
resistance despite of the data pattern of unselected ReRAMs. serve as a decoupled bit-wise product engine of input and
Provided that selected Ri_j is at LRS (or HRS), the voltage weight, to avoid potential read-disturb problem in 6T SRAM-
at Y is relatively high (or low). Then, RSTR and GRO_SELi CIM. The single-bit multiplication result is presented by the
transition to VDDL1 while CTRL is maintained at GND. N5 , discharge current Ii produced by the product engine in the
N6 , and NGi are turned on to create two differential discharge row. Only when the input and weight are both ‘1’, N8 –N10 are
paths (Q–Y and QB–GND). If the voltage at Y is relatively turned on to form the discharge current. Then, the discharge
high (or low), Q is discharged slower (or faster) than the currents of each row are accumulated on CBL. With a fixed
reference discharge path of QB. Due to the different discharge discharge time 1T set to 100 ps, the voltage of CBL drops
rate of discharge paths, a voltage difference is established by 1V , which presents the MAC value. This voltage drop
between Q and QB. This difference is eventually amplified to a is sensed by a 5-bit flash ADC shared by four columns.
full voltage swing after CTRL transitions to VDD . To transverse The 256 × 256-bit nvSRAM-CIM macro is set to activate a
the whole NN for inference, repeated array-level restore maximum of 32 rows for balanced linearity and signal margin
operations are needed. of MAC values on CBL.

D. CIM Operation of Proposed nvSRAM-CIM Cell


E. CIM Macro Structure
In the CIM mode, multi-bit MAC between 4-bit inputs
and 8-bit weights are realized by serially processing As shown in Fig. 4(a), the CIM macro includes multiple
multiplications of 1-bit inputs and 8-bit weights, where subarrays, each consisting of a high-density nvSRAM-CIM
the results in the same output channel are presented as array, an input decoder for serializing multi-bit input, 5-bit
accumulated voltage drop at CBL. The CIM operation is shown flash ADCs for MAC result sensing, and shift-adders for multi-
in Fig. 6. cycle accumulations. CREAM takes the advantages of high
In practice, CBL is precharged to VDD in the initial stage. maturity of SRAM-CIM and high storage density of ReRAMs
STR is set to VDD while SL j and SLx are set to GND. simultaneously. Moreover, by storing weights in ReRAM
WL, GRO_SEL, and RSTR are set to GND to cut off paths devices, the circuitry can be powered off to save energy
from the storage nodes to BL, BLB, and CBL, as listed in when the accelerator is idle in stand-by mode. Alternatively,
Table I. Weights quantized as 8 bits are preliminary stored in memory mode, the performance of read and write operations
in ReRAMs which are then restored to SRAM-CIM cells with proposed nvSRAM-CIM is similar to the standard 6T
for CIM operations. The 4-bit input is applied as serial RST SRAM, by cutting off the corresponding store and restore
pulses with a period 1T . Three n-type transistors N8 –N10 paths.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3203

where Rmask is set to LRS for shielding the unselected


ReRAMs. Depending on Req_unsel and the selected Ri_j , the
voltage at Y is maintained at a certain voltage as:
Req_unsel VS L
VY = VS L = , (2)
Req_unsel + Rsel 1 + Rsel /Req_unsel
where VS L is the SL voltage of the selected ReRAM. From
(2), VY is maintained at a relatively high (low) voltage when
the selected ReRAM is in LRS (HRS).
Fig. 9. Simulated Rref distributions assuming (m, n) = (3, 4) with (a) only
VY can be influenced by Req_unsel , which is varied with
ReRAM variations and (b) both ReRAM and CMOS variations. the data pattern of unselected ReRAMs. The worst-case data
pattern of restoring ‘0’ (‘1’) occurs when all unselected
ReRAMs are HRS (LRS), corresponding to the highest
(lowest) Req_unsel . The worst cases of restoring ‘0’ and ‘1’
denote VY _w0 and VY _w1 , respectively. Furthermore, when
the ReRAM variations are considered, there are voltage
fluctuations of VY _w0 and VY _w1 due to varied resistances
of Rsel and Req_unsel , as shown in Fig. 8. The deviation of
VY makes the ReRAM states difficult to be distinguished.
For easily distinguishable states, the lower bound of VY _w1
should be larger than the upper bound of VY _w0 . The
voltage difference VDi f f er ence between the two worst-case data
patterns should be sufficient, which is defined as:
VS L VS L
VDi f f er ence = VY _w1 − VY _w0 = − ,
1+n R H /R L + n
1 1
= VS L ( − RH
). (3)
1+n +n
RL

From (3), VDi f f er ence increases with a smaller n, a higher


HRS-to-LRS ratio R H /R L , and a higher source-line voltage
VS L . A higher VDi f f er ence leads to enhanced robustness of
nvSRAM-CIM cell with higher restore yield.
Fig. 10. Restore yield of a nvSRAM-CIM cell considering both CMOS and
ReRAM variations with (a) different levels of ReRAM variations (3σ /µ) and
(b) different ReRAM settings (m, n) at 3σ /µ = 10%.
B. Design Space Exploration of Weight-Restore Yield and
Storage Density Under Both ReRAM and CMOS Variations
IV. N V SRAM-CIM ROBUSTNESS A NALYSIS
The design space is explored for trading-off between restore
The robustness of the nvSRAM-CIM cell of restoring the yield and storage density in this section, by considering various
weights from ReRAMs to SRAM-CIM cells is influenced by factors including the ReRAM resistance ratio of R H /R L ,
the data pattern of unselected ReRAMs, ReRAM and CMOS CMOS and ReRAM variations, and ReRAM group settings
variations, HRS-to-LRS resistance ratio and ReRAM group of (m, n). The restore yield of HRS (LRS) is named as
settings. The ReRAM group setting is represented as the HRS-Yield (LRS-Yield). Avg-Yield which is the average of
number of ReRAM groups m and the number of ReRAMs HRS-Yield and LRS-Yield, is used to characterize the restore
n per group. This section analyzes the circuit robustness by yield of CIM cell.
characterizing the weight restore yield and explores the design For a given (m, n) of (3, 4) with R H /R L fixed at
space. 1000 Monte-Carlo simulations are performed. The 1M/50k, the Rref distributions for successfully restoring
filament gap parameter in ReRAM device model [38] is varied ‘0’ and ‘1’ under 1000 Monte-Carlo simulations are shown in
following Gaussian distributions with different variations of Fig. 9. From Fig. 9(b), the distributions of Rref are wider when
3σ /µ, with typical value of 10% [45]. both ReRAM and CMOS variations are considered compared
with Fig. 9(a). Such mismatches induced by ReRAM and
A. Influence of ReRAM Variations on Restore Operation CMOS variations can lead to the deviations of discharging
The robustness of the voltage-divider-type select scheme rates between the two paths (Q–Y and QB–GND) and degrade
is influenced by the data pattern of unselected ReRAMs in the restore yield. To quantitatively compensate the mismatch
the selected group and the ReRAM variations. A simplified effect caused by ReRAM and CMOS variations, Rref is tuned
voltage divider in the restore mode is shown in Fig. 7, where for simultaneously maintaining high HRS-Yield and LRS-
SL j is set to VDDL2 . When the ReRAM variations are not Yield. The optimal Rref is obtained by bisection method
considered, typically, the parallel resistance of Rmask and with 1000 Monte-Carlo simulations for each candidate value.
unselected ReRAMs can be expressed as Req_unsel : Note that, to program Rref to target resistance level, Rref is
1 1 X 1 initially reset to HRS and then set to the target resistance
= + , (1) level by tunning the compliance current along BLB-N4 -N6 -
Req_unsel Rmask Ri_k SLr e f [28]. Similar method with in-situ iterative write (IWR)
k̸= j

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3204 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

TABLE II
R ESTORE Y IELD W ITH D IFFERENT R E RAM O N /O FF R ATIO AND N V SRAM-CIM C ELL S ETTINGS AT R E RAM VARIATION (3σ /µ) OF 10%

Fig. 11. Layout of an nvSRAM-CIM cell when (m, n) = (4, 8) with


4 groups and 8 ReRAMs in each group. The cell layout area is 2.33µm2
(0.65µm × 3.59µm).

Fig. 13. Optimized (m, n) search and data-aware weight-mapping method.

Fig. 12. Impact of different ReRAM settings of (m, n) under layout area
restriction on (a) storage density and (b) restore yield at 3σ /µ = 10%.
remain unchanged for relatively small n values while increased
when n exceeding a certain value. In this work, the large n
values for given m that induces extra area overhead is not
as reported in [48] can be employed for achieving the target considered for high integration density. Considering the above
optimal value of Rref . layout restriction, the design space for all the possible ReRAM
The impact of R H /R L for given ReRAM settings of settings in terms of storage density and restore yield map is
(m, n) is evaluated as listed in Table II. R H /R L cannot not shown in Fig. 12. As shown in Fig. 12, the storage density
be too low, in order to yield sufficient voltage difference is increased with either larger n or larger m. Alternatively,
(VDifference ) between different data patterns as shown in (3) the restore yield is approximately unchanged with different
in Section IV-A. With a fixed (m, n), a higher R H /R L leads to m when n is fixed. The restore yield is however degraded
a higher Avg-Yield with larger voltage difference VDi f f er ence with increased n. This reveals the trade-off between the
between two different ‘0’ and ‘1’ states to be restored. storage density and the restore robustness. This will be further
The restore yield of an nvSRAM-CIM cell under both discussed in the next section. The lowest bound of restore
CMOS and ReRAM variations at different levels of 3σ /µ yield threshold (assumed to be 80% Avg-yield) and density
variations of filament gap in ReRAMs are evaluated as shown threshold (assumed to be 6 bits/µm2 ) are set to maintain high
in Fig. 10(a). The optimal Rref values for different settings NN accuracy and storage density as highlighted in red lines
are also labeled. The ReRAM variations dominate and induce in Fig. 12. Meanwhile, (m, n) with both relatively low restore
an apparent drop in restore yield when the 3σ /µ of ReRAM yield and storage density are not considered in this work.
variations is at 30%.
The impact of different ReRAM group settings (m, n) on V. L AYER -W ISE AND B IT-W ISE Y IELD -AWARE
restore yield is evaluated at the corresponding optimal Rref C ONFIGURATION S EARCH AND M APPING M ETHOD
values as shown in Fig. 10(b), where m is varied from 4 to Considering the need for robust operations under non-ideal
6 at three different values (4, 6, and 8) of n. The restore yield factors described in Section IV, to achieve maximized storage
is decreased with increased n, as VDi f f er ence is reduced from capacity of NN weights with sufficient NN accuracy and
(3). Note that, since different ReRAM groups are isolated by computing efficiency, this section presents (i) the proposed
the group select transistors, the number of groups m in each weight-restore yield-aware configuration search algorithm to
nvSRAM-CIM has little impact on the restore yield as shown optimize the (m, n) settings, and (ii) the mapping method
in Fig. 10(b). from NN models to the ReRAMs in CREAM to enhance the
The layout of an nvSRAM-CIM cell is shown in Fig. 11, computing parallelism with low inference latency, as shown
assuming that the ReRAM setting (m, n) is (4, 8) and the in Fig. 13.
footprint size of each ReRAM is 50nm×50nm (equivalent to
the VIA size in 28nm). Note that, the ReRAMs incur no area
overhead to nvSRAM-CIM cells, thanks to their backend-of- A. Layer-Wise and Bit-Wise Yield-Aware Configuration
line (BEOL)-compatible process with CMOS. The layout area Search
increases with larger m due to more group select transistors. While higher restore yield can reduce the deviations of
For a given m, the layer area of nvSRAM-CIM cell can weights and MAC values during inference, we also notice

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3205

Fig. 14. Flowchart of proposed layer-wise and bit-wise configuration search algorithm.

Fig. 15. Comparison of CIM and restore operations between the tradition and proposed mapping methods for mapping of layer x (Cx = 16, Mx = 128, Ri
refers to the i th ReRAMs in nvSRAM-CIM cells).

that different layers have a different level of tolerance to convergence. Each group setting of (m, n) is selected from
the weight deviations [33], [34], [35], [36] for a given NN 7 candidates highlighted in Fig. 13. A total of 74 configurations
model. In addition, bits of weight with different significance in with different ReRAM group settings (m, n) for every two bits
different layers may have diverse tolerances to the deviations of weights are traversed. For each configuration, the weight
of hardware. To store the weights with high density while deviation (1W ) is obtained by 10,000-sample Monte-Carlo
maintaining high accuracy, the ReRAM-group settings (m, n) simulations based on the corresponding restore yield of (m,
of different bits of weights in different NN layers are optimized n) for each bit and storage density (Dw ) of weight, as shown
here. in Fig. 14. Then, the weight-configurations are sorted and
Fig. 14 illustrates the flowchart of the proposed weight- further pruned by removing the candidates with high 1W but
configuration (CFG) search algorithm, which is also applicable low Dw . The remaining weight configurations are sorted in an
to different NN models. The total available hardware resources ascending order of 1W and Dw , which form a lookup table
(including the total number of nvSRAM-CIM subarrays and and sent to the second stage.
the corresponding ReRAM settings (m, n) in each subarray) The second stage is layer-wise weight-configuration search
are assumed to be determined by the CFG search results (LWS), which searches for a preliminary weight- configuration
of the largest-scale NN model in this work. To ensure that to enhance the weight-storage density while maintaining high
the hardware resources are highly utilized for different NN accuracy. The weights of each layer are assumed to possess
models, the weight-configuration for smaller-scale NNs is the same weight configuration. The NN layers are firstly
also adjusted by taking the total available hardware resources sorted in a descending order by the sensitivity to the weight
into consideration. It can be divided into three stages. In the deviations, which is evaluated as the KL-divergence between
first stage, the bit-wise weight-configuration sorting (BWS) is the ideal distribution of layer output without any deviations
used to generate the lookup table of CFG as the candidates and the non-ideal distribution with weight deviations under
used in latter algorithms. Every two adjacent bits of 8-bit device variations. The smaller the KL-divergence is, the less
weights share the same ReRAM group setting (m, n) for quick sensitive the NN layer is [37]. Meanwhile, an initial accuracy

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3206 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

Algorithm 1 Data-Aware Weight-Mapping Method bits of weight in the largest-scale NN model are calculated,
Input: CFG for each layer of NN model, the number (Nk ) which also determines the total available hardware resources
of subarrays sharing the same ReRAM group settings of available for smaller NN models.
(m k , n k ) where k ≥ 1, quantization bits q of weight In the third stage, the weight-configuration adjustment for
Output: Weight mapping scheme smaller-scale NNs (WAS) is executed to enhance the hardware
1: for each layer l do utilization of nvSRAM-CIM subarrays with high restore yield
while improving the NN accuracy, regarding the total available
2: Convert each bit of weight in layer l to 2D matrix;
hardware resources. The preliminary hardware requirements
3: Divide the weight matrix into bl blocks; for smaller-scale NNs depicted by LWS are compared with
4: end for the total available hardware resources, to determine if there
5: for each (m k , n k ) do are sufficient subarrays with certain ReRAM group settings for
6: Calculate the totally required blocks Bk_r eq of weight storage. WAS also starts from the NN layer with the
NN according to CFG; highest sensitivity of smaller-scale NNs. If there are sufficient
7: Calculate the remaining available blocks subarrays for a certain layer with corresponding ReRAM
Bk_ra = 8×m k × n k × Nk – Bk_r eq ; settings to accommodate the weights, no CFG adjustment
8: end for is required. Alternatively, given that the hardware resources
9: for each layer l do may not be enough to accommodate the weights, the weight
10: Calculate the required blocks Bk_r eql for layer l; configurations of some bits of the layer are then required to
be adjusted by WAS. To be specific, if there already exist
11: Calculate accelerable blocks Bk_accl = Bk_r eql
nvSRAM-CIM subarrays with a higher restore yield, several
mod Nk bits of the layer are allocated to those subarrays to preserve
12: Calculate the
 duplication  times high accuracy. If the number of subarrays with a higher restore
DT = min( Nk /Bk_accl )-1; yield is still not enough to accommodate the weights, the
13: if DT > 0 and Bk_ra > Bk_accl × DT do total accuracy loss threshold L th is increased to store the NN
14: Duplicate blocks of layer l for DT times; weights. L th is then fed back to LWS. This process is repeated
15: Update Bk_ra = Bk_ra – Bk_accl × DT; until the whole NN is stored in nvSRAM-CIM subarrays.
16: end for Finally, the smaller NN models are retrained with the final
17: Distribute all blocks of NN evenly to subarrays CFG results to adapt to the restore errors and enhance the
according to CFG; accuracy.
18: for each block bi of all NN layers do
19: if there exist sufficiently unoccupied
columnsin the same row of subarray with B. Traditional Naïve Weight-Mapping Method
mapped block bi−1 do
With the traditional naïve weight-mapping method, each
20: Map the block bi onto the unoccupied columns; bit of weights is stored into the ReRAM one by one in the
21: else do subarrays of CREAM which share the same ReRAM group
22: Map the block bi onto the columns on new rows; settings, similar to the traditional ReRAM-CIM. This mapping
23: end for is performed layer by layer. For each layer, the convolutional
layer with C x input channels and Mx output channels is
firstly converted into q weight matrixes with each size of
C x × k × k rows and Mx columns, where the kernel size is
loss threshold (L th ) that an NN can be accommodated to k ×k and the NN weights are quantized to q-bits. Each weight
the weight-configurations with different restore yields is set. matrix stores one bit of weights in the layer. Then, the weight
The maximum allowed accuracy loss threshold (1ACCth_l ) matrixes are divided into blocks. The size of each blocks
of each layer is then set as the product of L th and the equals the maximum activated part of nvSRAM-CIM subarray.
corresponding parameter percentage of current layer in the Considering the maximum activated 32 rows in 256 × 256-bit
whole NN. The LWS starts with the most sensitive layer and nvSRAM-CIM macros, the maximum size of each block can
is then performed layer by layer in a descending order of be 32×256. Finally, each block of weights is mapped onto the
sensitivity. For layer l, the maximum allowed weight deviation unoccupied ReRAMs one by one in nvSRAM-CIM subarrays.
(1Wth_l ) is determined without exceeding 1ACCth_l . The Different bits of layer are mapped to different subarrays for
weight-configuration with 1Wl just lower than 1Wth_l is then parallel computation.
selected from the CFG lookup table of BWS as the weight- As shown in Fig. 15, taking the layer x with 16 input
configuration CFGl for layer l. Note that, the corresponding channels and 128 output channels with 3 × 3 kernels as an
maximum achievable storage density of weight Dl is also example, each bit of weight is converted to a 144×128 weight
found. Cl is then fixed for layer l in LWS, and the weight matrix. For the k th bit of weight w[k], the weight matrix
deviations caused by CFG of the previous layers are also is firstly divided into 5 blocks. With the naïve mapping,
considered in the search of latter layers. By the end of all 5 weight blocks are then mapped to ReRAMs on a
LWS, the final CFGs for each layer of the largest-scale NN single subarray-1. Among them, 2 blocks are mapped to the
are obtained, while the preliminary CFGs for the smaller remaining unoccupied rows of Ri while 3 blocks are mapped
NNs are also derived. The largest-scale NN is retrained to the new Ri+1 . To perform the computation for w[k] in layer
with the final CFGs of each layer to enhance the accuracy. x, a total of 2 restore operations (for restoring Ri and Ri+1 )
Then, the hardware resources of the required total subarrays and 5 CIM operations are required with the naïve weight-
with corresponding ReRAM group setting (m, n) for all mapping method.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3207

C. Proposed Data-Aware Weight-Mapping Method As shown in Fig. 15, assume that there are 4 subarrays
(subarray-1 to 4) sharing the same ReRAM setting (m k , n k )
To speed up the CIM operations with enhanced computation for each bit of weights in layer x. By adopting the proposed
parallelism, the data-aware weight-mapping method is pre- data-aware weight-mapping method, the total 5 weight blocks
sented in this section to efficiently map the weights of different of w[k]
NN models to ReRAMs in multiple nvSRAM-CIM subarrays. are distributed to 4 subarrays according to the CFG.
The details are shown in Algorithm 1. Among them, 4 blocks B(k, 1)–B(k, 4) of w[k] are mapped
The details of the proposed mapping method are shown in to subarray-1 to 4, respectively, which can be computed
Algorithm 1. Each NN layer is firstly converted into a weight simultaneously. As there still exist sufficiently unoccupied
matrix and divided into blocks in the same way as the naïve columns in the same row with B(k, 1)–B(k, 4), the remaining
weight-mapping. Then, the weight blocks are distributed and 1 block B(k, 5) is then duplicated by 3 times and mapped to
mapped to each subarray as evenly as possible for parallelly all 4 subarrays, which can further enhance the computation
carrying out the MAC operations for different subarrays, parallelism. Consequently, it only takes 1 restore operation
instead of clustering into a single subarray. By distributing (simultaneously restoring Ri1 –Ri4 ) and 1.25 CIM operation
weights evenly, CIM operations can be performed in parallel for mapping w[k] as shown in Fig. 15. Compared to the
by multiple subarrays, and the NN inference latency can be naïve mapping with only one subarray used each time, the
reduced. Finally, the blocks of weight distributed to subarrays numbers of both restore and CIM operations are reduced with
are mapped onto ReRAMs. When only part of Ri in a subarray the proposed data-aware weight-mapping.
are occupied by the previous layers, the weights of the current
layer can also be mapped to the unoccupied Ri in the same
subarray without wasting the hardware resources. To map the VI. R ESULTS AND A NALYSIS
NN weights compactly and minimize the restore times, blocks This section evaluates the proposed CIM macro and the
are first assigned to the empty space of the last block in the CREAM architecture, and compares with the CIM baselines
subarray. on density, energy efficiency, accuracy, throughput, etc.
Firstly, each NN layer is converted into a weight matrix
and divided into blocks (lines 1–4). With the weight blocks,
the number of required blocks for the whole NN with A. Memory Density of NvSRAM-CIM Cell
corresponding (m, n) settings are calculated, and the remaining
available blocks Bk_ra of (m k , n k ) are also calculated by SPICE simulations in this section are performed at 25◦ using
subtracting (lines 5–8). the physics-based ReRAM Verilog-A model from [38] and
Then, the weight duplication is performed in lines 9–16. a commercial 28nm CMOS technology. The supply voltage
When the number of weight blocks is not divisible by the is 0.9V. The store/restore/CIM energy consumption (Estore ,
number of subarrays, only some subarrays may perform CIM Erestore , ECIM ), bits per cell, cell layout area (Acell ), 256×256-
operations while the others are not computing. Meanwhile, bit array area (Aarray ), and storage density of the nvSRAM-
it is a common case that the NN size does not exactly CIM cell are evaluated as listed in Table III. The ReRAM
match the on-chip storage capacity, and some ReRAMs group setting (m, n) is assumed to be (4, 8). The standard 6T
may not be occupied after the mapping of the whole NN. SRAM and the previous 7T1R nvSRAM [26] cells are also
This fact makes it naturally practical to duplicate some evaluated.
weight to idle ReRAMs and increase the overall computing While the standard 6T SRAM and the previous 7T1R
throughput. Firstly, the total required block of (m k , n k ) nvSRAM can only store 1 bit per cell, the proposed nvSRAM-
for a certain layer is counted Bk_r eql . Then, the number CIM cell can store 32 bits data in a single cell when (m, n)
of accelerable block Bk_accl is calculated according to the is (4, 8). This huge advantage of density improvement is
required blocks and subarray number, since only part of obtained with only 89.5% and 53.3% increased store and
weights can be duplicated for the acceleration. The duplication restore energy consumption, respectively, as listed in Table III.
time is determined as the minimum of the floored quotient In addition, the bits stored in each nvSRAM-CIM cell can
of subarray number Nk and Bk_accl . If the duplication time be flexibly adjusted by the ReRAM setting of (m, n) as
is more than 1 and enough on-chip hardware resources presented in Section IV-B. In practice, the storage density
exist, blocks of layer are duplicated and Bk_ra is updated. is enhanced significantly by 10.3× and 12.0× compared
Otherwise, the layer is not duplicated and Bk_ra remains to the standard 6T SRAM and previous 7T1R nvSRAM,
unchanged. respectively, and the density benefits can be further enhanced
Finally, in lines 17–23, the weight blocks are distributed when adopting monolithic-three dimensional technology for
to subarrays and mapped. The weight blocks are distributed grouped ReRAMs.
to each subarray as evenly as possible for parallelly carrying
out the MAC operations for different subarrays, instead of
clustering into a single subarray. All the weight blocks are B. Performance of CREAM Scheme
mapped to ReRAMs on corresponding subarrays. To map Two typical NN models of Resnet-18 and VGG-9 are
the NN weights compactly, blocks are firstly assigned to the adopted to evaluate CREAM. The accuracy of both NNs is
unoccupied columns in the same row of the subarray with the tested on CIFAR-10 and CIFAR-100 datasets. The evaluation
last mapped block. Otherwise, if not enough empty columns parameters of the circuits are shown in Table IV. The NN
exist, the block is mapped onto columns of new rows. When weights are quantized to 8 bits, while the inputs of each layer
all rows of Ri in a subarray are occupied, the weights of the are quantized to 4 bits. In the evaluated methods, each ADC
current layer are mapped to the unoccupied Ri+1 in the same is shared by 4 columns and 32 rows are activated at the same
subarray. time in a subarray for maintaining high NN accuracy.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3208 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

The following architectures are compared against CREAM: TABLE III


• Baseline-1: the NN weights are stored in off-chip DRAM P ERFORMANCE C OMPARISON B ETWEEN P ROPOSED N V SRAM-CIM
and SRAM-CIM performs the MAC operation. C ELL AND P REVIOUS N V SRAM AND SRAM C ELLS
IN 28 NM CMOS T ECHNOLOGY
• Baseline-2: the NN weights are stored in on-chip
ReRAM, and MAC operations are performed in SRAM-
CIM.
• Baseline-3: the weight storage and current-domain MAC
operations are both implemented on ReRAM crossbars.
As the parameter size of VGG-9 (14MB) is larger than
ResNet-18 (11MB), the hardware setting is determined by
the search result of VGG-9 in our evaluation. Provided that
only the nvSRAM-CIM subarrays with the highest restore
yield with (m, n) of (6, 3) are employed for weight storage,
100 subarrays are required to store all weights of VGG-9.
Alternatively, as listed in Table V, only 66 subarrays are
used to accommodate VGG-9 by adopting the proposed layer- TABLE IV
wise and bit-wise configuration search algorithm, which saves D ESIGN S PECIFICATIONS OF E VALUATED C IRCUITS IN 28 NM CMOS
34% of subarrays in CREAM. For ResNet-18, the weight
configuration can be adjusted to fit the hardware setting
of subarrays in Table V. The required capacities of on-
chip SRAM and ReRAM cells are about 4Mb and 120Mb,
respectively, which are manufacturing feasible according
to [41] and [42].
The inference energy consumption of VGG-9 and
ResNet-18 is compared in Fig. 16(a). With larger parameter
size and more computing operations for inference, VGG-9
consumes more inference energy than ResNet-18. The energy
consumption of baseline-1 is the highest due to the required
off-chip DRAM accesses. In baseline-2, storing weights in
on-chip ReRAM devices saves some energy consumed by
weight reloading. CREAM consumes the lowest energy by
employing more energy-efficient SRAM-CIM with eliminated
weight transfer from the outside of arrays, compared to the TABLE V
baseline-1 to 3. T HE N UMBER OF N V SRAM-CIM S UBARRAYS W ITH D IFFERENT
The evaluation results of energy efficiency in CREAM and R E RAM S ETTINGS OF ( M , N ) TO ACCOMMODATE VGG-9
baselines are shown in Fig. 16(b). The proposed CREAM ON CIFAR-100 DATASET

achieves the highest energy efficiency on both VGG-9 and


ResNet-18 among all the CIM designs. Thanks to the
dramatically decreased energy consumption of data movement,
a maximum of 3.47× energy efficiency is achieved on VGG-9
with CREAM over the traditional SRAM-CIM baseline-1
which requires off-chip DRAM accesses. Meanwhile, com-
pared with the baseline-2 whose weights are stored in on- the weight storage and computation in ReRAM crossbars, the
chip ReRAM devices, the energy efficiency is enhanced by CREAM-enabled nvSRAM-CIM does not suffer from the DC
1.85× on VGG-9 by employing CREAM due to the eliminated power consumption or complex peripherals, and thus achieves
off-array memory access. Furthermore, by employing energy- 27.2% lower energy on CIM operations, compared to the
efficient SRAM-CIM, the energy efficiency of CREAM is baseline-3 counterpart.
1.70× over the baseline-3 of ReRAM-CIM. Besides, the The inference accuracy performed by CREAM is also
improvement of energy efficiency varies with different NN evaluated and compared with ReRAM-CIM (baseline-3).
models, which have different parameter size and number As shown in Fig. 17(a), the accuracy of ReRAM-CIM drops
of computing operations. As shown in Fig. 16(b), CREAM quickly with increased ReRAM variations. As the 3σ /µ of
achieves higher energy efficiency on VGG-9 which possesses ReRAM variations is increased to 30%, the accuracy losses
a larger parameters size and less computing operations (with of ReRAM-CIM are reduced by up to 22.3% and 22.7% for
14MB parameters and 3.95 × 1010 operations) than ResNet-18 VGG-9 and ResNet-18 on CIFAR-100 dataset, respectively.
(with 11MB parameters and 3.51 × 1010 operations). Alternatively, the accuracy of proposed CREAM is maintained
The detailed energy breakdown of different CIM designs relatively high across different levels of ReRAM variations.
on VGG-9 is shown in Fig. 16(c). Due to off-array memory As shown in Fig. 17(a), when 3σ /µ of ReRAM variations is
accesses, the energy consumption of data movement in 30%, the accuracy losses of CREAM for VGG-9 and ResNet-
baseline-1 and 2 occupies a significant portion of 71.2% and 18 on the CIFAR-10 dataset are less than 7% (with 6.18%
46.0%, respectively. Alternatively, the energy consumed by and 1.96% drops), while the accuracy loss on the CIFAR-100
data movement of weight restoring from ReRAMs to SRAM- dataset is less than 11% (with 10.9% and 7.29% drops). The
CIM cells only accounts for 0.1% in the proposed CREAM as robustness and reliability of proposed nvSRAM-CIM macro
shown in Fig. 16(c). Although no data movement is needed for and CREAM architecture are therefore preserved for different

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3209

Fig. 16. Evaluations of (a) energy consumption, (b) energy efficiency on VGG-9 and ResNet-18, and (c) detailed energy breakdown on VGG-9.

Fig. 17. Evaluations of (a) accuracy with different levels of ReRAM variations and (b) throughput of CREAM on VGG-9 and ResNet-18.

for reliable neural network acceleration. The proposed


nvSRAM-CIM cell and the CREAM architecture take the
advantages of energy-efficient reliable SRAM-CIM and high-
density ReRAM devices for large-scale NN acceleration.
A layer-wise and bit-wise configuration search is proposed to
optimize the ReRAM group settings with enhanced weight-
storage density and high NN accuracy despite of ReRAM
and CMOS variations. A data-ware mapping method is also
presented to efficiently map the NN weights on CIM subarrays
with high computation parallelism. The memory density of
Fig. 18. Cycles of VGG-9 and ResNet-18 on CIFAR-10 and CIFAR-100 weights in the proposed nvSRAM-CIM array is enhanced
datasets with CREAM. by 10.3× compared with the standard 6T SRAM array. The
power consumption of weight loading is significantly reduced
NNs and datasets by tolerating the device variations during with the array-level parallel weight restore mechanism,
inference. by eliminating the off-array memory access. The energy
The throughput and cycles of CREAM with naïve efficiency of the CREAM scheme based on nvSRAM-CIM
mapping and proposed data-aware weight-mapping methods is enhanced by up to 3.47× compared with the traditional
are compared next. As shown in Fig. 17(b), thanks to the SRAM-CIM and 1.70× compared with a baseline ReRAM-
parallel CIM operations enabled by mapping NN weights CIM.
to multiple subarrays the throughput of CREAM with data-
aware weight-mapping method is enhanced by 70.7% on R EFERENCES
VGG-9 and 76.6% on ResNet-18 compared to the naïve [1] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier
mapping. Furthermore, the cycles of VGG-9 inference with implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI
Circuits, Jun. 2016, pp. 1–2.
data-aware mapping method are both reduced by 41.4% on [2] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-tolerant
CIFAR-10 and CIFAR-100 datasets, compared with the naïve in-memory machine learning classifier via on-chip training,” IEEE
mapping method as shown in Fig. 18. The cycles of ResNet- J. Solid-State Circuits, vol. 53, no. 11, pp. 3163–3173, Nov. 2018.
18 inference on CIFAR-10 and CIFAR-100 datasets are also [3] X. Si et al., “A twin-8T SRAM computation-in-memory macro for
reduced by 43.4% and 44.8%, respectively, compared to the multiple-bit CNN-based machine learning,” in IEEE Int. Solid-State
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2019, pp. 396–398.
naïve mapping method. [4] X. Si et al., “A 28 nm 64 Kb 6T SRAM computing-in-memory macro
with 8 b MAC operation for AI edge chips,” in IEEE Int. Solid-State
VII. C ONCLUSION Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 246–248.
[5] J.-W. Su et al., “A 28 nm 384 kb 6T-SRAM computation-in-memory
This paper has presented CREAM, an energy- and area- macro with 8 b precision for AI edge chips,” in IEEE Int. Solid-State
efficient computing methodology in ReRAM-assisted SRAM Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 250–252.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3210 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023

[6] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm2 fully-digital [25] S. Tripathi, S. Choudhary, and P. K. Misra, “A novel STT–SOT MTJ-
computing-in-memory macro supporting wide-range dynamic-voltage- based nonvolatile SRAM for power gating applications,” IEEE Trans.
frequency scaling and simultaneous MAC and write operations,” in IEEE Electron Devices, vol. 69, no. 3, pp. 1058–1064, Mar. 2022.
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, [26] A. Lee et al., “RRAM-based 7T1R nonvolatile SRAM with 2× reduction
pp. 1–3. in store energy and 94× reduction in restore energy for frequent-
[7] R. Mochida et al., “A 4M synapses integrated analog ReRAM based off instant-on applications,” in Proc. Symp. VLSI Technol., Jun. 2015,
66.5 TOPS/W neural-network processor with cell current controlled pp. 76–77.
writing and flexible network architecture,” in Proc. IEEE Symp. VLSI [27] C. Peng et al., “Average 7T1R nonvolatile SRAM with R/W margin
Technol., Jun. 2018, pp. 175–176. enhanced for low-power application,” IEEE Trans. Very Large Scale
[8] C. Xue et al., “A 1 Mb multibit ReRAM computing-in-memory macro Integr. (VLSI) Syst., vol. 26, no. 3, pp. 584–588, Mar. 2018.
with 14.6 ns parallel MAC computing time for CNN based AI edge [28] Y. Sun et al., “Energy-efficient nonvolatile SRAM design based on
processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. resistive switching multi-level cells,” IEEE Trans. Circuits Syst. II, Exp.
Papers, Feb. 2019, pp. 388–390. Briefs, vol. 66, no. 5, pp. 753–757, May 2019.
[9] C.-X. Xue et al., “A 22 nm 2 Mb ReRAM compute-in-memory macro [29] X. Li et al., “Design of nonvolatile SRAM with ferroelectric FETs
with 121–28 TOPS/W for multibit MAC computing for tiny AI edge for energy-efficient backup and restore,” IEEE Trans. Electron Devices,
devices,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. vol. 64, no. 7, pp. 3037–3040, Jul. 2017.
Papers, Feb. 2020, pp. 244–246. [30] X. Si et al., “A local computing cell and 6T SRAM-based computing-
[10] C.-X. Xue et al., “A 22 nm 4 Mb 8 b-precision ReRAM computing-in- in-memory macro with 8-b MAC operation for edge AI chips,” IEEE
memory macro with 11.91 to 195.7 TOPS/W for tiny AI edge devices,” J. Solid-State Circuits, vol. 56, no. 9, pp. 2817–2831, Sep. 2021.
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, [31] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-
Feb. 2021, pp. 245–247. based computation-in-memory unit-macro for CNN-based AI edge
[11] J.-M. Hung et al., “An 8-Mb DC-current-free binary-to-8 b precision processors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801,
ReRAM nonvolatile computing-in-memory macro using time-space- Oct. 2020.
readout with 1286.4–21.6 TOPS/W for edge-AI devices,” in IEEE [32] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, computing SRAM macro based on robust capacitive coupling computing
pp. 1–3. mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897,
[12] Y.-D. Chih et al., “An 89 TOPS/W and 16.3 TOPS/mm2 all-digital Jul. 2020.
SRAM-based full-precision compute-in memory macro in 22 nm for [33] M. Ha, Y. Byun, S. Moon, Y. Lee, and S. Lee, “Layerwise buffer voltage
machine-learning edge applications,” in IEEE Int. Solid-State Circuits scaling for energy-efficient convolutional neural network,” IEEE Trans.
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 252–254. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 1, pp. 1–10,
[13] P.-C. Wu et al., “A 28 nm 1 Mb time-domain computing-in-memory Jan. 2021.
6T-SRAM macro with a 6.6 ns latency, 1241 GOPS and 37.01 TOPS/W [34] S. Ryu et al., “BitBlade: Energy-efficient variable bit-precision hardware
for 8 b-MAC operations for edge-AI devices,” in IEEE Int. Solid-State accelerator for quantized neural networks,” IEEE J. Solid-State Circuits,
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 1–3. vol. 57, no. 6, pp. 1924–1935, Jun. 2022.
[14] J. Yue et al., “STICKER-IM: A 65 nm computing-in-memory NN [35] Y. Long, E. Lee, D. Kim, and S. Mukhopadhyay, “Q-PIM: A genetic
processor using block-wise sparsity optimization and inter/intra-macro algorithm based flexible DNN quantization method and application
data reuse,” IEEE J. Solid-State Circuits, vol. 57, no. 8, pp. 2560–2573, to processing-in-memory platform,” in Proc. 57th ACM/IEEE Design
Aug. 2022. Autom. Conf. (DAC), Jul. 2020, pp. 1–6.
[15] P. Deaville, B. Zhang, and N. Verma, “A 22 nm 128-kb MRAM [36] J. H. Ko, D. Kim, T. Na, and S. Mukhopadhyay, “Design and
row/column-parallel in-memory computing macro with memory- analysis of a neural network inference engine based on adaptive weight
resistance boosting and multi-column ADC readout,” in Proc. IEEE compression,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
Symp. VLSI Technol. Circuits, Jun. 2022, pp. 268–269. vol. 38, no. 1, pp. 109–121, Jan. 2019.
[16] J. Wang et al., “Reconfigurable bit-serial operation using toggle [37] F. Liu et al., “Improving neural network efficiency via post-training
SOT-MRAM for high-performance computing in memory architecture,” quantization with adaptive floating-point,” in Proc. IEEE/CVF Int. Conf.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 11, pp. 1–11, Comput. Vis. (ICCV), Oct. 2021, pp. 5261–5270.
Jul. 2022. [38] P.-Y. Chen and S. Yu, “Compact modeling of RRAM devices and its
[17] G. Yin et al., “Enabling lower-power charge-domain nonvolatile in- applications in 1T1R and 1S1R array design,” IEEE Trans. Electron
memory computing with ferroelectric FETs,” IEEE Trans. Circuits Syst. Devices, vol. 62, no. 12, pp. 4022–4028, Dec. 2015.
II, Exp. Briefs, vol. 68, no. 7, pp. 2262–2266, Jul. 2021. [39] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
[18] Y. Long, X. She, and S. Mukhopadhyay, “Design of reliable DNN Scalable and efficient neural network acceleration with 3D memory,”
accelerator with un-reliable ReRAM,” in Proc. Design, Autom. Test Eur. SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 751–764, Apr. 2017.
Conf. Exhib. (DATE). Mar. 2019, pp. 1769–1774. [40] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level
[19] Y. Sun et al., “Unary coding and variation-aware optimal mapping performance, energy, and area model for emerging nonvolatile memory,”
scheme for reliable ReRAM-based neuromorphic computing,” IEEE IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7,
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 12, pp. 994–1007, Jul. 2012.
pp. 2495–2507, Dec. 2021. [41] B. Zhang et al., “PIMCA: A programmable in-memory computing
[20] Z. Song et al., “ITT-RNA: Imperfection tolerable training for RRAM- accelerator for energy-efficient DNN inference,” IEEE J. Solid-State
crossbar-based deep neural-network accelerator,” IEEE Trans. Comput.- Circuits, vol. 58, no. 5, pp. 1436–1449, May 2023.
Aided Design Integr. Circuits Syst., vol. 40, no. 1, pp. 129–142, [42] R. Fackenthal et al., “A 16 Gb ReRAM with 200 MB/s write and
Jan. 2021. 1 GB/s read in 27 nm technology,” in IEEE Int. Solid-State Circuits
[21] Y. Luo, X. Han, Z. Ye, H. Barnaby, J.-S. Seo, and S. Yu, “Array-level Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 338–339.
programming of 3-bit per cell resistive memory and its application for [43] H. Abbas, A. Ali, J. Li, T. T. T. Tun, and D. S. Ang, “Forming-free, self-
deep neural network inference,” IEEE Trans. Electron Devices, vol. 67, compliance WTe2 -based conductive bridge RAM with highly uniform
no. 11, pp. 4621–4625, Nov. 2020. multilevel switching for high-density memory,” IEEE Electron Device
[22] Z. Meng, Y. Sun, and W. Qian, “Write or not: Programming scheme Lett., vol. 44, no. 2, pp. 253–256, Feb. 2023.
optimization for RRAM-based neuromorphic computing,” in Proc. 59th [44] X. Xu et al., “40× retention improvement by eliminating resistance
ACM/IEEE Design Autom. Conf., Jul. 2022, pp. 985–990. relaxation with high temperature forming in 28 nm RRAM chip,” in
[23] Y.-C. Luo, S. Datta, and S. Yu, “Three-dimensional (3D) non-volatile IEDM Tech. Dig., Dec. 2018, pp. 20.1.1–20.1.4.
SRAM with IWO transistor and HZO ferroelectric capacitor,” in Proc. [45] X. Xue et al., “A 0.13 µm 8 Mb logic-based Cux Siy O ReRAM with self-
Int. Symp. VLSI Technol., Syst. Appl., Apr. 2021, pp. 1–2. adaptive operation for yield enhancement and power reduction,” IEEE
[24] J. Singh and B. Raj, “Design and investigation of 7T2M-NVSRAM J. Solid-State Circuits, vol. 48, no. 5, pp. 1315–1322, May 2013.
with enhanced stability and temperature impact on store/restore energy,” [46] M. Zhao et al., “Characterizing endurance degradation of incremental
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 6, switching in analog RRAM for neuromorphic systems,” in IEDM Tech.
pp. 1322–1328, Jun. 2019. Dig., Dec. 2018, pp. 20.2.1–20.2.4.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3211

[47] J. Yang et al., “A 28 nm 1.5 Mb embedded 1T2R RRAM with Songyuan Liu received the B.E. degree from
14.8 Mb/mm2 using sneaking current suppression and compensation the School of Electronic, Information and Elec-
techniques,” in Proc. IEEE Symp. VLSI Circuits, Jun. 2020, pp. 1–2. trical Engineering, Shanghai Jiao Tong University,
[48] J.-H. Yoon, M. Chang, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and in 2021. He is currently pursuing the M.E. degree
A. Raychowdhury, “A 40-nm, 64-kb, 56.67 TOPS/W voltage-sensing with the Department of Micro-Nano Electronics,
computing-in-memory/digital RRAM macro supporting iterative write Shanghai Jiao Tong University. His research interests
with verification and online read-disturb detection,” IEEE J. Solid-State include system on chip and computing in memory.
Circuits, vol. 57, no. 1, pp. 68–79, Jan. 2022.

Yanan Sun (Senior Member, IEEE) received the


B.E. degree in microelectronics from Shanghai Jiao
Tong University, Shanghai, China, in 2009, and the
Weifeng He (Senior Member, IEEE) received the
Ph.D. degree from The Hong Kong University of
B.S., M.S., and Ph.D. degrees in microelectronics
Science and Technology, Hong Kong, in 2015.
and solid state electronics from the Harbin Institute
She is currently an Associate Professor with the
of Technology, China, in 1999, 2001, and 2005,
Department of Micro-Nano Electronics, Shanghai
respectively.
Jiao Tong University. Her research interests include
He is currently a Professor with the Department
energy-efficient VLSI circuits and system design
of Micro-Nano Electronics, Shanghai Jiao Tong
with emerging nanotechnologies. She received the
University, China. His research interests include
Best Paper Award Nomination in the 2020 IEEE
VLSI architecture for video signal processing,
Design, Automation, and Test in Europe Conference (DATE) and the Best
reconfigurable processor architecture, and low-
Paper Award (First Place) in the 2014 IEEE International Conference on
power circuit designs.
Microelectronics (ICM). She currently serves on the editorial board of the
Microelectronics Journal.

Yongpan Liu (Senior Member, IEEE) received the


Dengfeng Wang (Student Member, IEEE) received B.S., M.S., and Ph.D. degrees from the Department
the B.E. degree from the Department of Physics of Electronic Engineering, Tsinghua University,
and Astronomy, Shanghai Jiao Tong University, Beijing, China, 1999, 2002, and 2007, respectively.
Shanghai, China, in 2020. He is currently pursuing He was a Visiting Scholar with The Pennsylvania
the Ph.D. degree with the Department of Micro- State University, State College, PA, USA, and
Nano Electronics, Shanghai Jiao Tong University. The City University of Hong Kong, Hong Kong.
His research interests include computing-in-memory He is currently a Professor with the Department of
and monolithic three-dimensional circuit designs. Electronic Engineering, Tsinghua University. He has
published more than 200 peer-reviewed conferences
and journal articles and developed several fast
sleep/wakeup nonvolatile processors using emerging memory and artificial
intelligent accelerators using algorithm-architecture co-optimization. His
research interests include energy-efficient circuits and systems for artificial
Liukai Xu received the B.E. degree from the intelligence, emerging memory devices, and the IoT applications.
Department of Micro-Nano Electronics, Shanghai
Jiao Tong University, Shanghai, China, in 2021,
where he is currently pursuing the M.E. degree.
His current research interests include computing-in-
Huazhong Yang (Fellow, IEEE) received the B.S.
memory and neural-network accelerator designs.
degree in microelectronics and the M.S. and Ph.D.
degrees in electronic engineering from Tsinghua
University, Beijing, China, in 1989, 1993, and 1998,
respectively.
In 1993, he joined the Department of Electronic
Engineering, Tsinghua University, where he has
been a Full Professor since 1998. He has been
Yiming Chen (Student Member, IEEE) received in charge of several projects, including projects
the B.S. degree from the Department of Elec- sponsored by the National Science and Technology
tronic Engineering, Tsinghua University, Beijing, Major Project, the 863 Program, NSFC, and several
China, in 2021, where he is currently pursuing international research projects.
the Ph.D. degree. His current research inter-
ests include computing-in-memory architecture and
co-optimization on artificial intelligence.
Xueqing Li (Senior Member, IEEE) received the
B.S. and Ph.D. degrees from the Department
of Electronic Engineering, Tsinghua University,
Beijing, China, in 2007 and 2013, respectively.
From 2013 to 2017, he was a Post-Doctoral
Researcher with the Department of Computer
Zhi Li received the B.E. degree from the Department Science and Engineering, The Pennsylvania State
of Micro-Nano Electronics, Shanghai Jiao Tong University, University Park, PA, USA. He joined
University, Shanghai, China, in 2020, where he is the Department of Electronic Engineering, Tsinghua
currently pursuing the M.E. degree. His research University, as an Assistant Professor, in 2018.
interests include in-memory-computing software and He is currently an Associate Professor with the
hardware co-designs. Department of Electronic Engineering, Tsinghua University. He has more
than 100 publications and holds 20 China and U.S. patents. His research
interests include high-performance data converter circuit designs, emerging
memory and memory-oriented computing. He is an Associate Editor of IEEE
T RANSACTIONS ON VLSI.

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.

You might also like