CREAM Computing in ReRAM-Assisted Energy - and Area-Efficient SRAM For Reliable Neural Network Acceleration
CREAM Computing in ReRAM-Assisted Energy - and Area-Efficient SRAM For Reliable Neural Network Acceleration
8, AUGUST 2023
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3199
NNs [14], but the desire for no DRAM access and full on-chip
weight storage of larger and general NNs needs more efforts.
To enlarge the on-chip memory capacity, CIM based on
emerging non-volatile memory (NVM) such as ReRAMs [7],
[8], [9], [10], [11], MRAMs [15], [16], and FeFETs [17]
are being widely explored. Among them, the two-terminal
ReRAM devices with a large on/off resistance ratio enable
higher weight-storage density and larger on-chip capacity
Fig. 2. Previous 7T1R SLC-nvSRAM cell [26] that supports only bit-to-bit
than SRAM arrays, as shown in Fig. 1(b). Like other store/restore operations and no in-memory computing.
NVM, the non-volatile ReRAM-CIM can be powered off
to save the leakage energy without losing the weights.
Unfortunately, limited by the DC power consumption or
complex peripherals, ReRAM-CIM still suffers from high
inference energy compared with SRAM-CIM, even with recent
advanced techniques [11], as shown in Fig. 1(c). Moreover,
the stochastic filament-based switching in ReRAM causes
significant process variations, which are challenging to handle
to prevent severe NN accuracy loss [18]. Recent efforts have
proposed unary weight-coding method [19], on-device and Fig. 3. Computing-in-memory architecture to accelerate MAC operations of
off-device co-design training algorithm [20], write-and-verify NN: (a) ReRAM-CIM [19], and (b) SRAM-CIM [30].
programming schemes [21], [22], etc. However, it remains a
significant challenge to implement robust large-scale NNs with II. C HALLENGES OF NV SRAM AND CIM
high energy and area efficiency for ReRAM-CIM. The challenges of the existing nvSRAM circuits and CIM
In this work, we propose a novel non-volatile SRAM- circuits are described in this section, so as to understand the
CIM (nvSRAM-CIM) macro by combining non-volatile high- design space of the proposed ReRAM-based nvSRAM-CIM.
density ReRAM devices with robust and energy-efficient
SRAM cells for large NNs. Furthermore, we propose a CIM A. Previous Non-Volatile SRAM Cells
architecture, namely CREAM, to exploit the nvSRAM-CIM
macro with a novel layer-wise and bit-wise configuration Many NVM-based nvSRAM designs have been explored
search algorithm and a data-aware mapping method for NN for frequent-off and instant-on applications [23], [24], [25],
inference. The contributions of this work include: [26], [27], [28], [29]. The adopted NVM technologies include
ferroelectric capacitor [23], MRAM [24], [25], ReRAM [26],
1) Proposed ReRAM-based nvSRAM-CIM macro and [27], [28], FeFET [29], etc. Among these designs, the ReRAM
CREAM architecture that leverage grouped high-density based nvSRAM is featured with high density and a medium-
ReRAM devices for NN weight storage and energy- high on/off ratio.
efficient SRAM-CIM for NN inference for the first time, The ReRAM-based nvSRAM has been presented in recent
which provides a new CIM paradigm for energy-efficient works based on single-level cell (SLC) [26], [27] and multi-
NN acceleration; level cell (MLC) [28] ReRAM devices. The schematic of
2) Comprehensive circuit analysis on the macro structure previously published 7T1R SLC-nvSRAM cell [26] is shown
and circuit parameters, including the number of ReRAM in Fig. 2. SL and SWL denote the source line and the switching
devices associated with an SRAM cell, a proposed line, respectively. The two differential supply rails are labeled
grouped ReRAM organization, etc., for optimized as VDDQ and VDDQB . By moving the 1-bit logic information
tradeoff between memory capacity and restore yield. in the SRAM cell to the non-volatile SLC ReRAM device RQ ,
3) Custom architecture to exploit the characteristics of the the previous 7T1R SLC-nvSRAM cell [26] can power off the
proposed nvSRAM-CIM macro, including a variation- leaky SRAM cell during the standby period to save the energy.
aware layer-wise and bit-wise configuration search The logic state is recalled from RQ and restored into the
algorithm for high-density storage and high-accuracy SRAM cell when the supply is powered on. In [28], an energy-
inference, and a data-aware weight-mapping method for efficient MLC-nvSRAM circuit is presented to back up 2-bit
low inference latency and high hardware utilization. data of every two SRAM cells into a single four-level MLC
4) Evaluation of the proposed ReRAM-based nvSRAM- ReRAM device. However, the previously published nvSRAM
CIM macro and CREAM architecture, showing higher circuits [24], [25], [26], [27], [28] essentially support only bit-
memory density of 10.3× over the traditional SRAM to-bit store/restore operations and no CIM functions, where the
array, and higher energy efficiency of up to 3.47× over memory density of logic information is still limited. Inspired
the traditional SRAM-CIM. by previous works, we propose to adopt the nvSRAM circuits
and enable the in-memory MAC operations to perform the NN
In the rest of this paper, Section II introduces the challenges inference. Furthermore, we extend to the multi-ReRAM-one-
of the conventional nvSRAM and CIM circuits. Section III SRAM nvSRAM to support more weights in the footprint of
presents nvSRAM-CIM and CREAM. Section IV discusses an SRAM cell for enhanced data storage density.
the impact of ReRAM group settings on the restore yield and
the memory density. Section V presents the layer-wise and
bit-wise configuration search algorithm and the data-aware B. Previous CIM Based on SRAM and ReRAM Cells
weight-mapping method. Section VI presents the evaluation In many existing CIM solutions, as shown in Fig. 3, the NN
results. Finally, Section VII concludes this work. weights are quantized and stored in the CIM memory cells
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3200 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3201
Fig. 4. Illustrations of (a) proposed CREAM architecture and (b) nvSRAM-CIM cell.
Fig. 5. The operations and waveforms of (a) store mode and (b) restore mode.
considered as the safe voltage range to guarantee the reliability We assume that the Ri_j are connected to SL j and located in
of core transistors and ReRAMs after electroforming and the i th group of an nvSRAM-CIM cell. Each bit of SRAM
stressed programming process under 28nm CMOS technology. data is then stored into the selected Ri_j of each nvSRAM-
Assuming that forming process is needed, Rsi can be formed CIM cell in array-level parallelism for the whole array using
by activating the forming path (Rsi -NGi -N10 ) with SLi , the store operation.
GRO_SELi , and CBL biased to VF , VDD , and 0V, respectively. The store mode consists of two phases with the operations
Similarly, Rref can be formed by activating the forming path and waveforms illustrated in Fig. 5(a). In store phase-1, STR
(Rref -N6 -N4 ) with WL/RSTR, SLr e f , and BLB biased to VDD , is grounded while RST and CBL are at VDDH . CTRL is tied
VF , and 0V, respectively. to VDD . By biasing GRO_SELi to VDDH and SL j to GND,
all the selected Ri_j in the whole array are simultaneously
B. Store Operation of Proposed nvSRAM-CIM Cell initialized and reset to HRS representing ‘0’. Meanwhile, all
In the store mode, each bit of NN weights is preloaded the unselected ReRAMs are intact by setting SLx at VDDL1 as
from an SRAM cell to one of the grouped ReRAMs. Weights listed in Table I.
of certain NN layers are initially written into the SRAM In store phase-2, STR transitions from 0V to VDD .
cells of nvSRAM-CIM arrays by conventional SRAM write. GRO_SELi and SL j are set to VDD to activate the selected
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3202 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3203
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3204 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
TABLE II
R ESTORE Y IELD W ITH D IFFERENT R E RAM O N /O FF R ATIO AND N V SRAM-CIM C ELL S ETTINGS AT R E RAM VARIATION (3σ /µ) OF 10%
Fig. 12. Impact of different ReRAM settings of (m, n) under layout area
restriction on (a) storage density and (b) restore yield at 3σ /µ = 10%.
remain unchanged for relatively small n values while increased
when n exceeding a certain value. In this work, the large n
values for given m that induces extra area overhead is not
as reported in [48] can be employed for achieving the target considered for high integration density. Considering the above
optimal value of Rref . layout restriction, the design space for all the possible ReRAM
The impact of R H /R L for given ReRAM settings of settings in terms of storage density and restore yield map is
(m, n) is evaluated as listed in Table II. R H /R L cannot not shown in Fig. 12. As shown in Fig. 12, the storage density
be too low, in order to yield sufficient voltage difference is increased with either larger n or larger m. Alternatively,
(VDifference ) between different data patterns as shown in (3) the restore yield is approximately unchanged with different
in Section IV-A. With a fixed (m, n), a higher R H /R L leads to m when n is fixed. The restore yield is however degraded
a higher Avg-Yield with larger voltage difference VDi f f er ence with increased n. This reveals the trade-off between the
between two different ‘0’ and ‘1’ states to be restored. storage density and the restore robustness. This will be further
The restore yield of an nvSRAM-CIM cell under both discussed in the next section. The lowest bound of restore
CMOS and ReRAM variations at different levels of 3σ /µ yield threshold (assumed to be 80% Avg-yield) and density
variations of filament gap in ReRAMs are evaluated as shown threshold (assumed to be 6 bits/µm2 ) are set to maintain high
in Fig. 10(a). The optimal Rref values for different settings NN accuracy and storage density as highlighted in red lines
are also labeled. The ReRAM variations dominate and induce in Fig. 12. Meanwhile, (m, n) with both relatively low restore
an apparent drop in restore yield when the 3σ /µ of ReRAM yield and storage density are not considered in this work.
variations is at 30%.
The impact of different ReRAM group settings (m, n) on V. L AYER -W ISE AND B IT-W ISE Y IELD -AWARE
restore yield is evaluated at the corresponding optimal Rref C ONFIGURATION S EARCH AND M APPING M ETHOD
values as shown in Fig. 10(b), where m is varied from 4 to Considering the need for robust operations under non-ideal
6 at three different values (4, 6, and 8) of n. The restore yield factors described in Section IV, to achieve maximized storage
is decreased with increased n, as VDi f f er ence is reduced from capacity of NN weights with sufficient NN accuracy and
(3). Note that, since different ReRAM groups are isolated by computing efficiency, this section presents (i) the proposed
the group select transistors, the number of groups m in each weight-restore yield-aware configuration search algorithm to
nvSRAM-CIM has little impact on the restore yield as shown optimize the (m, n) settings, and (ii) the mapping method
in Fig. 10(b). from NN models to the ReRAMs in CREAM to enhance the
The layout of an nvSRAM-CIM cell is shown in Fig. 11, computing parallelism with low inference latency, as shown
assuming that the ReRAM setting (m, n) is (4, 8) and the in Fig. 13.
footprint size of each ReRAM is 50nm×50nm (equivalent to
the VIA size in 28nm). Note that, the ReRAMs incur no area
overhead to nvSRAM-CIM cells, thanks to their backend-of- A. Layer-Wise and Bit-Wise Yield-Aware Configuration
line (BEOL)-compatible process with CMOS. The layout area Search
increases with larger m due to more group select transistors. While higher restore yield can reduce the deviations of
For a given m, the layer area of nvSRAM-CIM cell can weights and MAC values during inference, we also notice
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3205
Fig. 14. Flowchart of proposed layer-wise and bit-wise configuration search algorithm.
Fig. 15. Comparison of CIM and restore operations between the tradition and proposed mapping methods for mapping of layer x (Cx = 16, Mx = 128, Ri
refers to the i th ReRAMs in nvSRAM-CIM cells).
that different layers have a different level of tolerance to convergence. Each group setting of (m, n) is selected from
the weight deviations [33], [34], [35], [36] for a given NN 7 candidates highlighted in Fig. 13. A total of 74 configurations
model. In addition, bits of weight with different significance in with different ReRAM group settings (m, n) for every two bits
different layers may have diverse tolerances to the deviations of weights are traversed. For each configuration, the weight
of hardware. To store the weights with high density while deviation (1W ) is obtained by 10,000-sample Monte-Carlo
maintaining high accuracy, the ReRAM-group settings (m, n) simulations based on the corresponding restore yield of (m,
of different bits of weights in different NN layers are optimized n) for each bit and storage density (Dw ) of weight, as shown
here. in Fig. 14. Then, the weight-configurations are sorted and
Fig. 14 illustrates the flowchart of the proposed weight- further pruned by removing the candidates with high 1W but
configuration (CFG) search algorithm, which is also applicable low Dw . The remaining weight configurations are sorted in an
to different NN models. The total available hardware resources ascending order of 1W and Dw , which form a lookup table
(including the total number of nvSRAM-CIM subarrays and and sent to the second stage.
the corresponding ReRAM settings (m, n) in each subarray) The second stage is layer-wise weight-configuration search
are assumed to be determined by the CFG search results (LWS), which searches for a preliminary weight- configuration
of the largest-scale NN model in this work. To ensure that to enhance the weight-storage density while maintaining high
the hardware resources are highly utilized for different NN accuracy. The weights of each layer are assumed to possess
models, the weight-configuration for smaller-scale NNs is the same weight configuration. The NN layers are firstly
also adjusted by taking the total available hardware resources sorted in a descending order by the sensitivity to the weight
into consideration. It can be divided into three stages. In the deviations, which is evaluated as the KL-divergence between
first stage, the bit-wise weight-configuration sorting (BWS) is the ideal distribution of layer output without any deviations
used to generate the lookup table of CFG as the candidates and the non-ideal distribution with weight deviations under
used in latter algorithms. Every two adjacent bits of 8-bit device variations. The smaller the KL-divergence is, the less
weights share the same ReRAM group setting (m, n) for quick sensitive the NN layer is [37]. Meanwhile, an initial accuracy
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3206 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
Algorithm 1 Data-Aware Weight-Mapping Method bits of weight in the largest-scale NN model are calculated,
Input: CFG for each layer of NN model, the number (Nk ) which also determines the total available hardware resources
of subarrays sharing the same ReRAM group settings of available for smaller NN models.
(m k , n k ) where k ≥ 1, quantization bits q of weight In the third stage, the weight-configuration adjustment for
Output: Weight mapping scheme smaller-scale NNs (WAS) is executed to enhance the hardware
1: for each layer l do utilization of nvSRAM-CIM subarrays with high restore yield
while improving the NN accuracy, regarding the total available
2: Convert each bit of weight in layer l to 2D matrix;
hardware resources. The preliminary hardware requirements
3: Divide the weight matrix into bl blocks; for smaller-scale NNs depicted by LWS are compared with
4: end for the total available hardware resources, to determine if there
5: for each (m k , n k ) do are sufficient subarrays with certain ReRAM group settings for
6: Calculate the totally required blocks Bk_r eq of weight storage. WAS also starts from the NN layer with the
NN according to CFG; highest sensitivity of smaller-scale NNs. If there are sufficient
7: Calculate the remaining available blocks subarrays for a certain layer with corresponding ReRAM
Bk_ra = 8×m k × n k × Nk – Bk_r eq ; settings to accommodate the weights, no CFG adjustment
8: end for is required. Alternatively, given that the hardware resources
9: for each layer l do may not be enough to accommodate the weights, the weight
10: Calculate the required blocks Bk_r eql for layer l; configurations of some bits of the layer are then required to
be adjusted by WAS. To be specific, if there already exist
11: Calculate accelerable blocks Bk_accl = Bk_r eql
nvSRAM-CIM subarrays with a higher restore yield, several
mod Nk bits of the layer are allocated to those subarrays to preserve
12: Calculate the
duplication times high accuracy. If the number of subarrays with a higher restore
DT = min( Nk /Bk_accl )-1; yield is still not enough to accommodate the weights, the
13: if DT > 0 and Bk_ra > Bk_accl × DT do total accuracy loss threshold L th is increased to store the NN
14: Duplicate blocks of layer l for DT times; weights. L th is then fed back to LWS. This process is repeated
15: Update Bk_ra = Bk_ra – Bk_accl × DT; until the whole NN is stored in nvSRAM-CIM subarrays.
16: end for Finally, the smaller NN models are retrained with the final
17: Distribute all blocks of NN evenly to subarrays CFG results to adapt to the restore errors and enhance the
according to CFG; accuracy.
18: for each block bi of all NN layers do
19: if there exist sufficiently unoccupied
columnsin the same row of subarray with B. Traditional Naïve Weight-Mapping Method
mapped block bi−1 do
With the traditional naïve weight-mapping method, each
20: Map the block bi onto the unoccupied columns; bit of weights is stored into the ReRAM one by one in the
21: else do subarrays of CREAM which share the same ReRAM group
22: Map the block bi onto the columns on new rows; settings, similar to the traditional ReRAM-CIM. This mapping
23: end for is performed layer by layer. For each layer, the convolutional
layer with C x input channels and Mx output channels is
firstly converted into q weight matrixes with each size of
C x × k × k rows and Mx columns, where the kernel size is
loss threshold (L th ) that an NN can be accommodated to k ×k and the NN weights are quantized to q-bits. Each weight
the weight-configurations with different restore yields is set. matrix stores one bit of weights in the layer. Then, the weight
The maximum allowed accuracy loss threshold (1ACCth_l ) matrixes are divided into blocks. The size of each blocks
of each layer is then set as the product of L th and the equals the maximum activated part of nvSRAM-CIM subarray.
corresponding parameter percentage of current layer in the Considering the maximum activated 32 rows in 256 × 256-bit
whole NN. The LWS starts with the most sensitive layer and nvSRAM-CIM macros, the maximum size of each block can
is then performed layer by layer in a descending order of be 32×256. Finally, each block of weights is mapped onto the
sensitivity. For layer l, the maximum allowed weight deviation unoccupied ReRAMs one by one in nvSRAM-CIM subarrays.
(1Wth_l ) is determined without exceeding 1ACCth_l . The Different bits of layer are mapped to different subarrays for
weight-configuration with 1Wl just lower than 1Wth_l is then parallel computation.
selected from the CFG lookup table of BWS as the weight- As shown in Fig. 15, taking the layer x with 16 input
configuration CFGl for layer l. Note that, the corresponding channels and 128 output channels with 3 × 3 kernels as an
maximum achievable storage density of weight Dl is also example, each bit of weight is converted to a 144×128 weight
found. Cl is then fixed for layer l in LWS, and the weight matrix. For the k th bit of weight w[k], the weight matrix
deviations caused by CFG of the previous layers are also is firstly divided into 5 blocks. With the naïve mapping,
considered in the search of latter layers. By the end of all 5 weight blocks are then mapped to ReRAMs on a
LWS, the final CFGs for each layer of the largest-scale NN single subarray-1. Among them, 2 blocks are mapped to the
are obtained, while the preliminary CFGs for the smaller remaining unoccupied rows of Ri while 3 blocks are mapped
NNs are also derived. The largest-scale NN is retrained to the new Ri+1 . To perform the computation for w[k] in layer
with the final CFGs of each layer to enhance the accuracy. x, a total of 2 restore operations (for restoring Ri and Ri+1 )
Then, the hardware resources of the required total subarrays and 5 CIM operations are required with the naïve weight-
with corresponding ReRAM group setting (m, n) for all mapping method.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3207
C. Proposed Data-Aware Weight-Mapping Method As shown in Fig. 15, assume that there are 4 subarrays
(subarray-1 to 4) sharing the same ReRAM setting (m k , n k )
To speed up the CIM operations with enhanced computation for each bit of weights in layer x. By adopting the proposed
parallelism, the data-aware weight-mapping method is pre- data-aware weight-mapping method, the total 5 weight blocks
sented in this section to efficiently map the weights of different of w[k]
NN models to ReRAMs in multiple nvSRAM-CIM subarrays. are distributed to 4 subarrays according to the CFG.
The details are shown in Algorithm 1. Among them, 4 blocks B(k, 1)–B(k, 4) of w[k] are mapped
The details of the proposed mapping method are shown in to subarray-1 to 4, respectively, which can be computed
Algorithm 1. Each NN layer is firstly converted into a weight simultaneously. As there still exist sufficiently unoccupied
matrix and divided into blocks in the same way as the naïve columns in the same row with B(k, 1)–B(k, 4), the remaining
weight-mapping. Then, the weight blocks are distributed and 1 block B(k, 5) is then duplicated by 3 times and mapped to
mapped to each subarray as evenly as possible for parallelly all 4 subarrays, which can further enhance the computation
carrying out the MAC operations for different subarrays, parallelism. Consequently, it only takes 1 restore operation
instead of clustering into a single subarray. By distributing (simultaneously restoring Ri1 –Ri4 ) and 1.25 CIM operation
weights evenly, CIM operations can be performed in parallel for mapping w[k] as shown in Fig. 15. Compared to the
by multiple subarrays, and the NN inference latency can be naïve mapping with only one subarray used each time, the
reduced. Finally, the blocks of weight distributed to subarrays numbers of both restore and CIM operations are reduced with
are mapped onto ReRAMs. When only part of Ri in a subarray the proposed data-aware weight-mapping.
are occupied by the previous layers, the weights of the current
layer can also be mapped to the unoccupied Ri in the same
subarray without wasting the hardware resources. To map the VI. R ESULTS AND A NALYSIS
NN weights compactly and minimize the restore times, blocks This section evaluates the proposed CIM macro and the
are first assigned to the empty space of the last block in the CREAM architecture, and compares with the CIM baselines
subarray. on density, energy efficiency, accuracy, throughput, etc.
Firstly, each NN layer is converted into a weight matrix
and divided into blocks (lines 1–4). With the weight blocks,
the number of required blocks for the whole NN with A. Memory Density of NvSRAM-CIM Cell
corresponding (m, n) settings are calculated, and the remaining
available blocks Bk_ra of (m k , n k ) are also calculated by SPICE simulations in this section are performed at 25◦ using
subtracting (lines 5–8). the physics-based ReRAM Verilog-A model from [38] and
Then, the weight duplication is performed in lines 9–16. a commercial 28nm CMOS technology. The supply voltage
When the number of weight blocks is not divisible by the is 0.9V. The store/restore/CIM energy consumption (Estore ,
number of subarrays, only some subarrays may perform CIM Erestore , ECIM ), bits per cell, cell layout area (Acell ), 256×256-
operations while the others are not computing. Meanwhile, bit array area (Aarray ), and storage density of the nvSRAM-
it is a common case that the NN size does not exactly CIM cell are evaluated as listed in Table III. The ReRAM
match the on-chip storage capacity, and some ReRAMs group setting (m, n) is assumed to be (4, 8). The standard 6T
may not be occupied after the mapping of the whole NN. SRAM and the previous 7T1R nvSRAM [26] cells are also
This fact makes it naturally practical to duplicate some evaluated.
weight to idle ReRAMs and increase the overall computing While the standard 6T SRAM and the previous 7T1R
throughput. Firstly, the total required block of (m k , n k ) nvSRAM can only store 1 bit per cell, the proposed nvSRAM-
for a certain layer is counted Bk_r eql . Then, the number CIM cell can store 32 bits data in a single cell when (m, n)
of accelerable block Bk_accl is calculated according to the is (4, 8). This huge advantage of density improvement is
required blocks and subarray number, since only part of obtained with only 89.5% and 53.3% increased store and
weights can be duplicated for the acceleration. The duplication restore energy consumption, respectively, as listed in Table III.
time is determined as the minimum of the floored quotient In addition, the bits stored in each nvSRAM-CIM cell can
of subarray number Nk and Bk_accl . If the duplication time be flexibly adjusted by the ReRAM setting of (m, n) as
is more than 1 and enough on-chip hardware resources presented in Section IV-B. In practice, the storage density
exist, blocks of layer are duplicated and Bk_ra is updated. is enhanced significantly by 10.3× and 12.0× compared
Otherwise, the layer is not duplicated and Bk_ra remains to the standard 6T SRAM and previous 7T1R nvSRAM,
unchanged. respectively, and the density benefits can be further enhanced
Finally, in lines 17–23, the weight blocks are distributed when adopting monolithic-three dimensional technology for
to subarrays and mapped. The weight blocks are distributed grouped ReRAMs.
to each subarray as evenly as possible for parallelly carrying
out the MAC operations for different subarrays, instead of
clustering into a single subarray. All the weight blocks are B. Performance of CREAM Scheme
mapped to ReRAMs on corresponding subarrays. To map Two typical NN models of Resnet-18 and VGG-9 are
the NN weights compactly, blocks are firstly assigned to the adopted to evaluate CREAM. The accuracy of both NNs is
unoccupied columns in the same row of the subarray with the tested on CIFAR-10 and CIFAR-100 datasets. The evaluation
last mapped block. Otherwise, if not enough empty columns parameters of the circuits are shown in Table IV. The NN
exist, the block is mapped onto columns of new rows. When weights are quantized to 8 bits, while the inputs of each layer
all rows of Ri in a subarray are occupied, the weights of the are quantized to 4 bits. In the evaluated methods, each ADC
current layer are mapped to the unoccupied Ri+1 in the same is shared by 4 columns and 32 rows are activated at the same
subarray. time in a subarray for maintaining high NN accuracy.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3208 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3209
Fig. 16. Evaluations of (a) energy consumption, (b) energy efficiency on VGG-9 and ResNet-18, and (c) detailed energy breakdown on VGG-9.
Fig. 17. Evaluations of (a) accuracy with different levels of ReRAM variations and (b) throughput of CREAM on VGG-9 and ResNet-18.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
3210 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 70, NO. 8, AUGUST 2023
[6] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm2 fully-digital [25] S. Tripathi, S. Choudhary, and P. K. Misra, “A novel STT–SOT MTJ-
computing-in-memory macro supporting wide-range dynamic-voltage- based nonvolatile SRAM for power gating applications,” IEEE Trans.
frequency scaling and simultaneous MAC and write operations,” in IEEE Electron Devices, vol. 69, no. 3, pp. 1058–1064, Mar. 2022.
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, [26] A. Lee et al., “RRAM-based 7T1R nonvolatile SRAM with 2× reduction
pp. 1–3. in store energy and 94× reduction in restore energy for frequent-
[7] R. Mochida et al., “A 4M synapses integrated analog ReRAM based off instant-on applications,” in Proc. Symp. VLSI Technol., Jun. 2015,
66.5 TOPS/W neural-network processor with cell current controlled pp. 76–77.
writing and flexible network architecture,” in Proc. IEEE Symp. VLSI [27] C. Peng et al., “Average 7T1R nonvolatile SRAM with R/W margin
Technol., Jun. 2018, pp. 175–176. enhanced for low-power application,” IEEE Trans. Very Large Scale
[8] C. Xue et al., “A 1 Mb multibit ReRAM computing-in-memory macro Integr. (VLSI) Syst., vol. 26, no. 3, pp. 584–588, Mar. 2018.
with 14.6 ns parallel MAC computing time for CNN based AI edge [28] Y. Sun et al., “Energy-efficient nonvolatile SRAM design based on
processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. resistive switching multi-level cells,” IEEE Trans. Circuits Syst. II, Exp.
Papers, Feb. 2019, pp. 388–390. Briefs, vol. 66, no. 5, pp. 753–757, May 2019.
[9] C.-X. Xue et al., “A 22 nm 2 Mb ReRAM compute-in-memory macro [29] X. Li et al., “Design of nonvolatile SRAM with ferroelectric FETs
with 121–28 TOPS/W for multibit MAC computing for tiny AI edge for energy-efficient backup and restore,” IEEE Trans. Electron Devices,
devices,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. vol. 64, no. 7, pp. 3037–3040, Jul. 2017.
Papers, Feb. 2020, pp. 244–246. [30] X. Si et al., “A local computing cell and 6T SRAM-based computing-
[10] C.-X. Xue et al., “A 22 nm 4 Mb 8 b-precision ReRAM computing-in- in-memory macro with 8-b MAC operation for edge AI chips,” IEEE
memory macro with 11.91 to 195.7 TOPS/W for tiny AI edge devices,” J. Solid-State Circuits, vol. 56, no. 9, pp. 2817–2831, Sep. 2021.
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, [31] Y.-C. Chiu et al., “A 4-Kb 1-to-8-bit configurable 6T SRAM-
Feb. 2021, pp. 245–247. based computation-in-memory unit-macro for CNN-based AI edge
[11] J.-M. Hung et al., “An 8-Mb DC-current-free binary-to-8 b precision processors,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2790–2801,
ReRAM nonvolatile computing-in-memory macro using time-space- Oct. 2020.
readout with 1286.4–21.6 TOPS/W for edge-AI devices,” in IEEE [32] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, computing SRAM macro based on robust capacitive coupling computing
pp. 1–3. mechanism,” IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1888–1897,
[12] Y.-D. Chih et al., “An 89 TOPS/W and 16.3 TOPS/mm2 all-digital Jul. 2020.
SRAM-based full-precision compute-in memory macro in 22 nm for [33] M. Ha, Y. Byun, S. Moon, Y. Lee, and S. Lee, “Layerwise buffer voltage
machine-learning edge applications,” in IEEE Int. Solid-State Circuits scaling for energy-efficient convolutional neural network,” IEEE Trans.
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 252–254. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 1, pp. 1–10,
[13] P.-C. Wu et al., “A 28 nm 1 Mb time-domain computing-in-memory Jan. 2021.
6T-SRAM macro with a 6.6 ns latency, 1241 GOPS and 37.01 TOPS/W [34] S. Ryu et al., “BitBlade: Energy-efficient variable bit-precision hardware
for 8 b-MAC operations for edge-AI devices,” in IEEE Int. Solid-State accelerator for quantized neural networks,” IEEE J. Solid-State Circuits,
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2022, pp. 1–3. vol. 57, no. 6, pp. 1924–1935, Jun. 2022.
[14] J. Yue et al., “STICKER-IM: A 65 nm computing-in-memory NN [35] Y. Long, E. Lee, D. Kim, and S. Mukhopadhyay, “Q-PIM: A genetic
processor using block-wise sparsity optimization and inter/intra-macro algorithm based flexible DNN quantization method and application
data reuse,” IEEE J. Solid-State Circuits, vol. 57, no. 8, pp. 2560–2573, to processing-in-memory platform,” in Proc. 57th ACM/IEEE Design
Aug. 2022. Autom. Conf. (DAC), Jul. 2020, pp. 1–6.
[15] P. Deaville, B. Zhang, and N. Verma, “A 22 nm 128-kb MRAM [36] J. H. Ko, D. Kim, T. Na, and S. Mukhopadhyay, “Design and
row/column-parallel in-memory computing macro with memory- analysis of a neural network inference engine based on adaptive weight
resistance boosting and multi-column ADC readout,” in Proc. IEEE compression,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
Symp. VLSI Technol. Circuits, Jun. 2022, pp. 268–269. vol. 38, no. 1, pp. 109–121, Jan. 2019.
[16] J. Wang et al., “Reconfigurable bit-serial operation using toggle [37] F. Liu et al., “Improving neural network efficiency via post-training
SOT-MRAM for high-performance computing in memory architecture,” quantization with adaptive floating-point,” in Proc. IEEE/CVF Int. Conf.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 11, pp. 1–11, Comput. Vis. (ICCV), Oct. 2021, pp. 5261–5270.
Jul. 2022. [38] P.-Y. Chen and S. Yu, “Compact modeling of RRAM devices and its
[17] G. Yin et al., “Enabling lower-power charge-domain nonvolatile in- applications in 1T1R and 1S1R array design,” IEEE Trans. Electron
memory computing with ferroelectric FETs,” IEEE Trans. Circuits Syst. Devices, vol. 62, no. 12, pp. 4022–4028, Dec. 2015.
II, Exp. Briefs, vol. 68, no. 7, pp. 2262–2266, Jul. 2021. [39] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
[18] Y. Long, X. She, and S. Mukhopadhyay, “Design of reliable DNN Scalable and efficient neural network acceleration with 3D memory,”
accelerator with un-reliable ReRAM,” in Proc. Design, Autom. Test Eur. SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 751–764, Apr. 2017.
Conf. Exhib. (DATE). Mar. 2019, pp. 1769–1774. [40] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-level
[19] Y. Sun et al., “Unary coding and variation-aware optimal mapping performance, energy, and area model for emerging nonvolatile memory,”
scheme for reliable ReRAM-based neuromorphic computing,” IEEE IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7,
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 12, pp. 994–1007, Jul. 2012.
pp. 2495–2507, Dec. 2021. [41] B. Zhang et al., “PIMCA: A programmable in-memory computing
[20] Z. Song et al., “ITT-RNA: Imperfection tolerable training for RRAM- accelerator for energy-efficient DNN inference,” IEEE J. Solid-State
crossbar-based deep neural-network accelerator,” IEEE Trans. Comput.- Circuits, vol. 58, no. 5, pp. 1436–1449, May 2023.
Aided Design Integr. Circuits Syst., vol. 40, no. 1, pp. 129–142, [42] R. Fackenthal et al., “A 16 Gb ReRAM with 200 MB/s write and
Jan. 2021. 1 GB/s read in 27 nm technology,” in IEEE Int. Solid-State Circuits
[21] Y. Luo, X. Han, Z. Ye, H. Barnaby, J.-S. Seo, and S. Yu, “Array-level Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 338–339.
programming of 3-bit per cell resistive memory and its application for [43] H. Abbas, A. Ali, J. Li, T. T. T. Tun, and D. S. Ang, “Forming-free, self-
deep neural network inference,” IEEE Trans. Electron Devices, vol. 67, compliance WTe2 -based conductive bridge RAM with highly uniform
no. 11, pp. 4621–4625, Nov. 2020. multilevel switching for high-density memory,” IEEE Electron Device
[22] Z. Meng, Y. Sun, and W. Qian, “Write or not: Programming scheme Lett., vol. 44, no. 2, pp. 253–256, Feb. 2023.
optimization for RRAM-based neuromorphic computing,” in Proc. 59th [44] X. Xu et al., “40× retention improvement by eliminating resistance
ACM/IEEE Design Autom. Conf., Jul. 2022, pp. 985–990. relaxation with high temperature forming in 28 nm RRAM chip,” in
[23] Y.-C. Luo, S. Datta, and S. Yu, “Three-dimensional (3D) non-volatile IEDM Tech. Dig., Dec. 2018, pp. 20.1.1–20.1.4.
SRAM with IWO transistor and HZO ferroelectric capacitor,” in Proc. [45] X. Xue et al., “A 0.13 µm 8 Mb logic-based Cux Siy O ReRAM with self-
Int. Symp. VLSI Technol., Syst. Appl., Apr. 2021, pp. 1–2. adaptive operation for yield enhancement and power reduction,” IEEE
[24] J. Singh and B. Raj, “Design and investigation of 7T2M-NVSRAM J. Solid-State Circuits, vol. 48, no. 5, pp. 1315–1322, May 2013.
with enhanced stability and temperature impact on store/restore energy,” [46] M. Zhao et al., “Characterizing endurance degradation of incremental
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 6, switching in analog RRAM for neuromorphic systems,” in IEDM Tech.
pp. 1322–1328, Jun. 2019. Dig., Dec. 2018, pp. 20.2.1–20.2.4.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: CREAM FOR RELIABLE NEURAL NETWORK ACCELERATION 3211
[47] J. Yang et al., “A 28 nm 1.5 Mb embedded 1T2R RRAM with Songyuan Liu received the B.E. degree from
14.8 Mb/mm2 using sneaking current suppression and compensation the School of Electronic, Information and Elec-
techniques,” in Proc. IEEE Symp. VLSI Circuits, Jun. 2020, pp. 1–2. trical Engineering, Shanghai Jiao Tong University,
[48] J.-H. Yoon, M. Chang, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and in 2021. He is currently pursuing the M.E. degree
A. Raychowdhury, “A 40-nm, 64-kb, 56.67 TOPS/W voltage-sensing with the Department of Micro-Nano Electronics,
computing-in-memory/digital RRAM macro supporting iterative write Shanghai Jiao Tong University. His research interests
with verification and online read-disturb detection,” IEEE J. Solid-State include system on chip and computing in memory.
Circuits, vol. 57, no. 1, pp. 68–79, Jan. 2022.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on December 03,2023 at 16:17:54 UTC from IEEE Xplore. Restrictions apply.