A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
Abstract— A novel 4T2C ternary embedded DRAM (eDRAM) Index Terms— Embedded DRAM, compute-in-memory, hard-
cell is proposed for computing a vector-matrix multiplication ware accelerator, current-mode, vector-matrix multiplication,
in the memory array. The proposed eDRAM-based compute- SRAM.
in-memory (CIM) architecture addresses a well-known Von
Neumann bottle-neck in the traditional computer architecture
I. I NTRODUCTION
and improves both latency and energy in processing neural
networks. The proposed ternary eDRAM cell takes a smaller
area than prior SRAM-based bitcells using 6-12 transistors.
Nevertheless, the compact eDRAM cell stores a ternary state
T HE Von Neumann architecture has been applied to most
electronic devices since it was first introduced in 1945.
The central concept of this architecture is the separation of
(−1, 0, or +1), while the SRAM bitcells can only store a binary
state. We also present a method to mitigate the compute accuracy memory from its central processing unit (CPU). In general,
degradation issue due to device mismatches and variations. a traditional computer consists of three separate parts: arith-
Besides, we extend the eDRAM cell retention time to 200µs metic logic unit (ALU), control unit, and memory. A typical
by adding a custom metal capacitor at the storage node. With compute operation is performed in three stages as follows.
the improved retention time, the overall energy consumption of First, both instructions and data stored in memory are trans-
eDRAM macro, including a regular refresh operation, is lower
than most of prior SRAM-based CIM macros. A 128×128 ternary ferred from memory to ALU before the compute begins. Sec-
eDRAM macro computes a vector-matrix multiplication between ond, the fetched instructions and data are stored in temporary
a vector with 64 binary inputs and a matrix with 64 × 128 registers and used for computation in ALU. Finally, the com-
ternary weights. Hence, 128 outputs are generated in parallel. puted results are sent back to memory. In recent years, the fun-
Note that both weight and input bit-precisions are programmable damental limits of the conventional Von Neumann architecture
for supporting a wide range of edge computing applications
with different performance requirements. The bit-precisions are are brought into spotlights. The memory access usually dom-
readily tunable by assigning a variable number of eDRAM cells inates the entire energy consumption of the modern micro-
per weight or adding multiple pulses to input. An embedded processors and the limited communication band-width limits
column ADC based on replica cells sweeps the reference level for the compute performances in both throughput and latency.
2N −1 cycles and converts the analog accumulated bitline voltage Due to such limitations, the Von Neumann architecture is no
to a 1-5bit digital output. A critical bitline accumulate operation
is simulated (Monte-Carlo, 3K runs). It shows the standard longer the best choice, especially for processing artificial deep
deviation of 2.84% that could degrade the classification accuracy neural networks (DNNs) in resource-constrained mobile edge
of the MNIST dataset by 0.6% and the CIFAR-10 dataset by computing devices.
1.3% versus a baseline with no variation. The simulated energy is One of the alternative architectures that could significantly
1.81fJ/operation, and the energy efficiency is 552.5-17.8TOPS/W improve the performance in processing DNNs is a compute-
(for 1-5bit ADC) at 200MHz using 65nm technology.
in-memory (CIM) architecture. As shown in Fig. 1, the in-
Manuscript received June 6, 2020; revised October 13, 2020; accepted memory architecture enables the memory to compute essential
November 1, 2020. Date of publication November 16, 2020; date of current functions by embedding them in its macro. As a result, we can
version January 12, 2021. This work was supported by the Singapore minimize both energy consumption and compute latency by
government’s Research, Innovation and Enterprise 2020 Plan Advanced Man-
ufacturing and Engineering Domain under Grant A1687b0033. This article eliminating a large portion of energy-hungry data communi-
was recommended by Associate Editor M.-F. Chang. (Corresponding author: cations between ALU and memory. Besides, massively parallel
Bongjin Kim.) computations in the large memory array maximizes throughput
Chengshuo Yu is with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore 639798, and also with Institute and fully utilize memory capacity. In conclusion, the DNN
of Microelectronics, A∗ STAR, Singapore 138634. processing in mobile devices will become much faster and
Taegeun Yoo, Hyunjoon Kim, Tony Tae-Hyoung Kim, and Bongjin Kim more energy-efficient by adopting the CIM architecture.
are with the School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore 639798 (e-mail: [email protected]). Fig. 2(a) illustrates a brain-inspired neuron which computes
Kevin Chai Tshun Chuan is with the Institute of Microelectronics, A∗ STAR, a dot-product between ‘n’ pairs of inputs and weights. The
Singapore 138634. dot-product is followed by a nonlinear activation, which
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. generates an output. A CIM macro with embedded brain-
Digital Object Identifier 10.1109/TCSI.2020.3036209 inspired neurons is shown in Fig. 2(b). The macro consists of
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 669
TABLE I
C OMPARISON OF E MBEDDED M EMORY C ANDIDATES FOR C OMPUTE -I N -M EMORY M ACRO
The rest of the paper is as follows. Section II introduces (RBLL/RBLR) when a negative short pulse is applied to RWL,
the proposed ternary eDRAM cell with its basic operations. and the stored weight is high. Note that the eDRAM bitcell
Section III presents the reconfigurability of the CIM macro decouples read and write operation, and hence is free from
based on the proposed eDRAM cells. The design challenges a write disturbance issue. Besides, the proposed eDRAM cell
are discussed in Section IV. Section V introduces the embed- stores a ternary state (−1, 0, or +1) using four transistors only
ded ADC and the offset-calibration approach. Section VI while a standard SRAM bitcell shares a read/write bitline and
describes the overall architecture and simulation results, fol- stores a binary state using six transistors.
lowed by a conclusion in Section VII.
B. Compute-In-Memory Multiplication
II. E DRAM C OMPUTE -I N -M EMORY M ACRO
Fig. 4 describes a compute-in-memory dot-product opera-
A. Proposed 4T2C Ternary eDRAM Cell tion using the proposed 4T2C ternary eDRAM cell. Before
In this work, we propose an embedded DRAM (eDRAM) computing, two read bitlines in the middle are pre-charged to
based CIM macro for the first time. Different styles of a high voltage. Fig. 4(a), top shows the circuit of the eDRAM
eDRAM bitcells have been introduced, as shown in Table I. cell. A single cell consists of four transistors and six control
Barth et al. [13] presented a 1T1C bitcell. The highest lines; two pairs of write and read bitlines, a write wordline,
integration density and the low power consumption of the and a read wordline. The stored weights on the left/right
1T1C structure facilitate it to become a promising bitcell bitcells are WL and WR . The table in Fig. 4(a) summarizes
candidate for an eDRAM-based CIM. However, the compact a default ternary-weight binary-input multiply operation in a
1T1C bitcell shares a bitline for reading/writing a storage cell. Here, note that the high or the low voltage is denoted
capacitor, and hence suffers from the write disturbance issue. by ‘H’/’L’ while ternary state values are represented by ‘−1’,
The dynamic nature of eDRAM bitcell necessitates a reg- ‘0’, and ‘1’. A ternary weight is stored in an eDRAM cell
ular refresh operation with associated energy consumption. as −1 (WL = H, WR = L), 0 (WL = L, WR = L), or +1
Besides, it requires low leakage access transistors and a deep (WL = L, WR = H). A binary input is represented by a
trench capacitor, which is not available in the generic logic transient voltage at an RWL node. The input is ‘0’ when
process. Chun et al. [14], [15] have proposed gain-cell based RWL is ‘H’ and is ‘1’ when a negative short pulse is applied
eDRAM macros based on compact 3T [14], 2T [15], and to the RWL node. After a cycle of operation, the ternary-
2T1C [16] bitcell structures using only standard logic process. weight binary-input multiplication result is accumulated in a
The logic-compatible eDRAM bitcells decouple read and write differential read bitline (RBLL/RBLR) as a voltage difference
operations, and the manufacturing cost is lower than the 1T1C (−V, 0, or +V) based on weight and input combinations
eDRAM [13]. The memory density is higher than 6T SRAM as shown in Fig. 4(a).
but lower than 1T1C eDRAM. A logic compatible 2T eDRAM Fig. 4(b) describes how the proposed 4T2C ternary eDRAM
bitcell circuit comprising of a PMOS and an NMOS transistor cell operates for each weight (W) and input (X) combination.
is shown in Fig. 3. The PMOS transistor is used for accessing When the RWL node is high (i.e., input is 0), no current flows
the internal storage node through write bitline (WBL) when through RBLL/RBLR and hence results in ‘no change’ in the
its gate node (WWL) voltage is low. The NMOS transistor voltage difference between the two read bitlines as shown in
works as a storage capacitor as well as a read access transistor all three diagrams in Fig. 4(b), left. When a negative pulse is
to read the stored value when the source (RWL) node voltage applied to RWL, either RBLL or RBLR is discharged through
is low. In this work, we use a pair of 2T eDRAM bitcells an NMOS read access transistor on the left or right bitcell
as a ternary-weight compute-in-memory unit. Each eDRAM when the storage node voltage is ‘H’. As a result, the voltage
bitcell on the left and the right works as a current discharging difference between RBLL and RBLR decreases or increases
unit, and contributes to a finite voltage drop in read bitlines by ‘V’ as highlighted in Fig. 4(b), right. When both weight
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 671
III. R ECONFIGURABILITY
The weight and input bit-precisions of the proposed
eDRAM macro can be reconfigured from ternary/binary to
the higher bit-precisions for processing DNNs having more
stringent accuracy requirements. Fig. 6 describes how we
reconfigure the input and weight bit-precisions using RWL
input pulses and the number of eDRAM cells per weight.
Fig. 6(a) shows the programmable input precision based on
the number of short pulse cycles per input. A pulse train
instead of a single pulse can be applied to an RWL node to
increase the number of input levels. Note that the number of
pulses corresponds to the number of input levels, and it can
be increased as long as the accumulated voltage level does not
exceed the limited operating range of the read bitlines. As for
the implementation of multi-level weights, we can group
multiple eDRAM cells to represent a weight. For instance,
two or three ternary eDRAM cells are combined to work as a
5- or 7-level (−2 to 2 or −3 to 3) weight storage, as shown
in Fig. 6(b). Note that each eDRAM cell adds two more levels, Fig. 6. Reconfigurable (a) input and (b) weight precisions using multiple
RWL pulses and ternary eDRAM cells to represent an input or a weight.
and the RWL node has to be shorted between cells.
Fig. 7 shows an example of reconfigured column with
ternary eDRAM cells for a dot-product between ‘n’ pairs of
multi-level inputs and weights. A group comprising of four
eDRAM cells represents a 9-level (−4 to 4) weight, and a
shared RWL with three negative pulse cycles represents a
4-level (0 to 3) input. Fig. 8 shows a reconfigured column
with 5-level weights and 3-level inputs, as well as the timing
diagrams of essential signals and the simulated waveforms.
A pre-charged RBL voltage (0.7 V) drops twice for each of the
two negative short pulses, and the level of each voltage drop is
proportional to the number of enabled eDRAM bitcells, which
is swept from 0 to 64, as shown in Fig. 8(c). Note that the RBL
voltage levels settled after two-cycles of the bitline accumulate
operation still do not exceed the predefined dynamic range
(∼250mV) to ensure the linearity.
Fig. 7. An example of reconfigured column with 4T2C ternary eDRAM
cells: 4-level inputs and 9-level weights.
IV. D ESIGN C HALLENGES
One of the critical design challenges of the eDRAM macro time. Tikekar et al. [19] proposed the on-demand power-up
is the short retention time of its dynamic memory cell having scheme that can minimize the excessive power from eDRAM
the limited storage capacitance and the substantial leakage access. Park et al. [20] and Choi et al. [21] used eDRAM as
currents. The situation becomes more severe in the eDRAM a temporary buffer.
macro with 2T gain cells since it relies on a small (<1fF) The 2T eDRAM bitcell (i.e., a half-circuit of the proposed
gate and parasitic capacitance while a 1T1C structure utilizes 4T2C ternary eDRAM cell) was originally developed [15] to
a dedicated storage capacitor such as a 20fF deep trench capac- increase its retention time by eliminating a pulling-down gate
itor [13]. Various approaches [17]–[22] have been introduced leakage of the NMOS access transistor, as shown in Fig. 9, left.
to make the retention time longer while maintaining a high Note that the NMOS transistor is off when the bitcell is not
integration density. Cho et al. [17] applied different refresh accessed, and hence its gate leakage is negligible. However,
cycle times to different memory blocks based on the priority the retention time of eDRAM bitcell for CIM macro has to be
of data to optimize refresh operation and the associated energy. redefined since it not only flips the stored data but also degrade
Kazimirsky et al. [18] presented an algorithm that schedules the computation accuracy in the bitline accumulation. A unit
the refresh operations to happen only during the unoccupied discharging current is directly affected by the stored voltage in
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Fig. 10. A retention time comparison with 10K Monte Carlo iterations.
‘P’ represents a write access PMOS and ‘N’ represents a read access NMOS.
Fig. 9. Leakage Current and Retention Time Theory of eDRAM; the impact
of retention time on computation.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 673
Fig. 14. Timing diagram of the proposed column ADC Reference with two
Fig. 12. Retention time comparison with three different metal capacitor different dot-product results, 17 and 25.
layouts: (a) Schematic Only (b) Covered with grounded M4-M7 (c) Custom
MOMCAP.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 675
Fig. 20. Size and density comparison of 16 kb CIM macro layout between 8T SRAM [6] and the proposed 4T2C eDRAM.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
TABLE II
S UMMARY OF VGG-L IKE CNN M ODEL
TABLE III
S UMMARY OF MNIST/CIFAR-10 C LASSIFICATION R ESULTS
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 677
TABLE IV
P ERFORMANCE C OMPARISON W ITH S TATE -O F -T HE -A RTS
Fig. 20 shows a comparison between 8T SRAM [6] and represent itself as a random noise distribution at the output.
the proposed 4T2C ternary eDRAM based macro and cell The red curve of fig. 22 shows the simulated classification
layout. The SRAM based CIM macro performs a similar accuracy versus the worst-case standard-deviation (std-dev) of
bitline accumulate operation using two extra transistors for the bitline voltage difference (Vdiff ) using a multi-layer per-
decoupling read/write on top of the standard 6T SRAM bitcell. ceptron (MLP) network with two hidden layers (784-256-64-
Hence, it can store a binary weight and is free from retention 10) and MNIST dataset. And the blue curve of fig. 22 shows
time issue. However, the size of its bitcell is significantly the simulated classification accuracy using a VGG-like con-
larger than our 4T2C eDRAM cell. As shown in Fig. 19, volutional neural network (CNN) with six convolution layers
the proposed ternary 4T2C eDRAM cell only occupies 0.32× and three fully-connected (FC) layers with CIFAR-10 dataset.
area than the SRAM bitcell. The eDRAM memory density is These two curves show a similar decreasing trend with the
0.866Mb/mm2, which is 2.97× higher than the 8T SRAM. increase of variation. Fig. 23 shows the classification accuracy
Note that the layouts for both 8T SRAM and 4T2C eDRAM of using binary weight and ternary weights with the sweeping
follows the same standard logic design rule. variation (i.e. 0% to 10.5%). The simulated results clearly
Fig. 21 shows the bitline accumulation transfer charac- illustrate that the ternary weight provides better accuracy
teristic based on 3K runs of Monte-Carlo simulation using under all variation conditions. Table II summarizes detailed
a column of 128× eDRAM cells. For verifying both lin- configurations of the simulated VGG-like CNN model. A bina-
earity and variation of the proposed eDRAM-based bitline rized neural network [25] is used for training and testing
accumulation, we swept dot-product results from −128 to MLP and CNN models. The simulated MNIST classification
+128 and measured the bitline voltage difference. Based accuracy initially degrades slower than CIFAR10 results when
on the simulation result, we achieved the standard-deviation std-dev is lower, and then it decreases much quicker as std-
of 13.7mV, 14.2mV, and 13.6mV, and the mean accu- dev exceeds 4% of the full dynamic range. Based on the
mulated bitline voltage difference of −121.9mV, 0V, and simulated worst-case variation (i.e., σ = 2.84%) in Fig. 20,
121.4mV when the dot-product result is −64, 0, and +64, the estimated classification accuracies are 96.78% for MNIST
respectively. using MLP and 82.8% for CIFAR-10 using the VGG-like
The impact of process variations in the bitline accumulation CNN model. The results show 0.6% and 1.3% accuracy
has been assessed by measuring image classification accuracy degradations, respectively, from the baseline results with no
using two different neural network configurations and datasets. variation. Table III summarizes the simulated MNIST/CIFAR-
The Vdiff std-dev/full-range of this design (i.e. 2.84%) is 10 classification results.
from the Monte-Carlo simulation with PVT and mismatch Table IV summarizes a performance comparison between
information, as shown in Fig.21. To show the trend of accuracy the proposed 4T2C eDRAM macro and prior CIM macros
with Vdiff std-dev/full-range, we sweep Vdiff std-dev/full- based on 6-to-12T SRAM bitcells. Note that the proposed
range from 0 to 10.5%. In testing phase, we introduce the work realizes a compact DRAM based CIM macro for the
noise obeying normal distribution with different variation first time. Compared to the previous SRAM-based works,
in each layer between convolution and batch normalization eDRAM offers its unique advantages, including the inherently
layers, the error/variation generated within the hardware will decoupled read/write ports and the high memory density.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Besides, the proposed eDRAM macro is highly-reconfigurable [7] J. Kim et al., “Area-efficient and variation-tolerant in-memory BNN
in terms of bit-precisions, and it embeds ADC per each computing using 6T SRAM array,” in Proc. Symp. VLSI Circuits (SOVC),
Jun. 2019, pp. C118–C119.
column-based neuron. [8] M. Kang, S. K. Gonugondla, A. Patil, and N. Shanbhag, “A
481 pJ/decision 3.4 M decision/s multifunctional deep in-memory
inference processor using standard 6T SRAM array,” Oct. 2016,
VII. C ONCLUSION arXiv:1610.07501. [Online]. Available: https://arxiv.org/abs/1610.07501
[9] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A multi-
In this paper, we presented a novel 4T2C ternary eDRAM functional in-memory inference processor using a standard 6T SRAM
cell for energy-efficient processing of DNN. The proposed array,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 642–655,
eDRAM cell consists of a pair of asymmetric 2T1C eDRAM Feb. 2018.
[10] J.-W. Su et al., “15.2 A 28 nm 64Kb inference-training two-way
bitcells with decoupled read and write ports. Hence, one of the transpose multibit 6T SRAM compute-in-memory macro for AI edge
critical issues (i.e., write disturbance) in analog compute-in- chips,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
memory has been eliminated. The compact eDRAM cell stores Papers, Feb. 2020, pp. 240–242.
[11] W.-H. Chen et al., “A 65 nm 1Mb nonvolatile computing-in-memory
a ternary weight and occupies 0.32× area (or 2.97× higher ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN
memory density) than the 8T SRAM-based bitcell, which can AI edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC)
only store a binary state. The read wordline (RWL) is used Dig. Tech. Papers, Feb. 2018, pp. 494–496.
[12] C.-X. Xue et al., “24.1 A 1Mb multibit ReRAM computing-in-memory
as an input port, and it can be reconfigured by programming macro with 14.6ns parallel MAC computing time for CNN based AI
the number of short pulses in a pulse train. To address the edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
short retention time issue of eDRAM, we added custom metal Tech. Papers, Feb. 2019, pp. 388–390.
[13] J. Barth et al., “A 500 MHz random cycle, 1.5 ns latency, SOI embedded
capacitors on the internal storage node without increasing the DRAM macro featuring a three-transistor micro sense amplifier,” IEEE
cell size. We embedded column ADCs [6] in each column of J. Solid-State Circuits, vol. 43, no. 1, pp. 86–95, Jan. 2008.
our eDRAM macro to address the ADC overhead issue, one of [14] K. C. Chun, P. Jain, J. H. Lee, and C. H. Kim, “A 3T gain cell
embedded DRAM utilizing preferential boosting for high density and
the critical challenges in the analog CIM macro design. The low power on-die caches,” IEEE J. Solid-State Circuits, vol. 46, no. 6,
proposed ADC performs a reconfigurable 1-5bit conversion, pp. 1495–1505, Jun. 2011.
and it takes 1-31 conversion cycles (i.e., 2N − 1 cycles for [15] K. C. Chun, P. Jain, T.-H. Kim, and C. H. Kim, “A 667 MHz logic-
compatible embedded DRAM featuring an asymmetric 2T gain cell for
N-bit). A column of 128× eDRAM cells is comprised of high speed on-die caches,” IEEE J. Solid-State Circuits, vol. 47, no. 2,
64× cells for dot-product, 32× for ADC reference, and 32× pp. 547–559, Feb. 2012.
for calibration. A Monte-Carlo simulation result demonstrates [16] K. Chun, W. Zhang, P. Jain, and C. H. Kim, “A 700 MHz 2T1C
embedded DRAM macro in a generic logic process with no boosted
both high linearity and the reasonable variation in the bit- supplies,” IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
line accumulate operation. The simulated worst-case standard Papers, Feb. 2011, pp. 272–273.
deviation is 14.2 mV, which is 2.84% of the full dynamic [17] K. Cho, Y. Lee, Y. H. Oh, G.-C. Hwang, and J. W. Lee, “EDRAM-based
tiered-reliability memory with applications to low-power frame buffers,”
range of 500mV. The simulated MNIST classification result in Proc. Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2014,
using a three-layer MLP (784-256-64-10) architecture results pp. 333–338.
in the estimated accuracy of 96.78%, which is 0.6% lower than [18] A. Kazimirsky and S. Wimer, “Opportunistic refreshing algorithm for
eDRAM memories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63,
the baseline with no process variation. The simulated CIFAR- no. 11, pp. 1921–1932, Nov. 2016.
10 classification using a VGG-like CNN model results in an [19] M. Tikekar, V. Sze, and A. P. Chandrakasan, “A fully integrated energy-
estimated 82.8% accuracy, which is only 1.3% lower than the efficient H.265/HEVC decoder with eDRAM for wearable devices,”
IEEE J. Solid-State Circuits, vol. 53, no. 8, pp. 2368–2377, Aug. 2018.
baseline accuracy. The proposed ternary eDRAM cell presents [20] Y. S. Park, D. Blaauw, D. Sylvester, and Z. Zhang, “Low-power high-
the smallest area (i.e., 1.08μm2) and the highest density throughput LDPC decoder using non-refresh embedded DRAM,” IEEE
among the published CIM bitcells using 65nm technology. J. Solid-State Circuits, vol. 49, no. 3, pp. 783–794, Mar. 2014.
[21] W. Choi, G. Kang, and J. Park, “A refresh-less eDRAM macro with
embedded voltage reference and selective read for an area and power
R EFERENCES efficient Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 50, no. 10,
pp. 2451–2462, Oct. 2015.
[1] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier [22] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “RANA: Towards efficient
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI neural acceleration with refresh-optimized embedded DRAM,” in Proc.
Circuits (VLSI-Circuits), Jun. 2016, pp. 1–2. ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018,
[2] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-tile 2.4- pp. 340–352.
mb in-memory-computing CNN accelerator employing charge-domain [23] T. Yoo, H. Kim, Q. Chen, T. T.-H. Kim, and B. Kim, “A logic compatible
compute,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1789–1799, 4T dual embedded DRAM array for in-memory computation of deep
Jun. 2019. neural networks,” in Proc. IEEE/ACM Int. Symp. Low Power Electron.
[3] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-efficient Design (ISLPED), Jul. 2019, pp. 1–6.
SRAM with embedded convolution computation for low-power CNN- [24] M. Ichihashi, H. Toda, Y. Itoh, and K. Ishibashi, “0.5 V asymmetric
based machine learning applications,” in IEEE Int. Solid-State Circuits three-Tr. Cell (ATC) DRAM using 90nm generic CMOS logic process,”
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 488–490. in Dig. Tech. Papers, Symp. VLSI Circuits, Jun. 2005, pp. 366–369.
[4] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory [25] I. Hubara, M. Courbaruaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
computing SRAM macro for binary/ternary deep neural networks,” IEEE “Binarized neural networks,” in Proc. 30th Conf. Neural Inf. Process.
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. Syst. (NIPS), 2016, pp. 4107–4115.
[5] H. Kim, Q. Chen, and B. Kim, “A 16K SRAM-based mixed-signal in- [26] K. C. Chun, W. Zhang, P. Jain, and C. H. Kim, “A 2T1C embedded
memory computing macro featuring voltage-mode accumulator and row- DRAM macro with no boosted supplies featuring a 7T SRAM based
by-row ADC,” in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), repair and a cell storage monitor,” IEEE J. Solid-State Circuits, vol. 47,
Nov. 2019, pp. 35–36. no. 10, pp. 2517–2526, Oct. 2012.
[6] C. Yu, T. Yoo, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, “A 16K [27] Y. Zha, E. Nowak, and J. Li, “Liquid silicon: A nonvolatile fully
current-based 8T SRAM compute-in-memory macro with decoupled programmable processing-in-memory processor with monolithically
read/write and 1-5bit column ADC,” in Proc. IEEE Custom Integr. integrated ReRAM,” IEEE J. Solid-State Circuits, vol. 55, no. 4,
Circuits Conf. (CICC), Mar. 2020, pp. 1–4. pp. 908–919, Apr. 2020.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 679
Chengshuo Yu (Graduate Student Member, IEEE) Minnesota in 2008, the DAC/ISSCC Student Design Contest Award in 2008,
received the B.S. degree in electronic engineering the Samsung Humantech Thesis Award in 2008, 2001, and 1999, and the
from Feng Chia University, Taiwan, in 2019. He is ETRI Journal Paper of the Year Award in 2005. He was the Chair of the IEEE
currently pursuing the Ph.D. degree with the School Solid-State Circuits Society Singapore Chapter. He has served on numerous
of Electrical and Electronic Engineering, Nanyang conferences as a Committee Member. He also serves as an Associate Editor for
Technological University, Singapore. the IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI)
His research interests include in-memory comput- S YSTEMS , IEEE A CCESS , and the IEIE Journal of Semiconductor Technology
ing and time-domain hardware accelerator. and Science.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.