0% found this document useful (0 votes)
52 views

eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application

1) The document proposes a novel in-embedded DRAM (eDRAM) compute design called eDRAM-OESP that arranges operand bits in an interleaved manner to improve throughput efficiency for on-edge signal processing applications. 2) The interleaved eDRAM architecture enables reading corresponding bits of multiple operands from memory cells simultaneously and writing results back in the same activate window, saving multiple precharge and activate cycles. 3) Evaluation shows the proposed design improves computing time by 31% for 16-bit addition and 30.6% for 8-bit addition over state-of-the-art work, achieving best performance of 2.5ms for 1D convolution with 120nJ energy.

Uploaded by

mqyank
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application

1) The document proposes a novel in-embedded DRAM (eDRAM) compute design called eDRAM-OESP that arranges operand bits in an interleaved manner to improve throughput efficiency for on-edge signal processing applications. 2) The interleaved eDRAM architecture enables reading corresponding bits of multiple operands from memory cells simultaneously and writing results back in the same activate window, saving multiple precharge and activate cycles. 3) Evaluation shows the proposed design improves computing time by 31% for 16-bit addition and 30.6% for 8-bit addition over state-of-the-art work, achieving best performance of 2.5ms for 1D convolution with 120nJ energy.

Uploaded by

mqyank
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

eDRAM-OESP: A novel performance efficient

in-embedded-DRAM-compute design for on-edge


signal processing application

Abstract—In-Memory-Computing (IMC) architectures allow around the traditional architectural plan have not given the
arithmetic and logical functionalities around the memory arrays required benefits [3]. Hence novel design approaches have
to effectively use the memory bandwidth and avoid frequent data emerged, especially for catering towards the data computing
movement to the processor. As expected, the IMC architecture
leads to high throughput performance and significant energy demands at all levels.
savings primarily due to less workload moving data from memory In the past, bit-wise boolean functions [4], arithmetic op-
to the computing core. Embedded DRAM (eDRAM), composed of erations [5], dot-product computations [6], and binary neu-
1-transistor, 1-capacitor (1T1C) bit cell with logic block enables ral networks [6], [7] are realized in SRAM cells. Several
computing with benefits in terms of power savings and high approaches to compute-in-DRAM are reported in the recent
performance, favorable for embedded computing engines. The
work proposes a novel in-eDRAM-compute design employing a past [8]–[12]. Traditionally large DRAM banks are applicable
1T1C eDRAM cell with the bit-serial computation that targets only in data-centers and hence the compute-in-DRAM design
3x throughput efficiency by arranging the operand bits in an is targeted toward data-center requirements, and not necessary
interleaved manner. The interleaved eDRAM architecture enables for an edge-inferencing design where the memory availability
to employ reading corresponding bits of multiple operands from is tight.
the memory cells at the same time, and also allows to write
back post computing in the same activate window, thereby saving Embedded DRAM (eDRAM) is adopted for edge-
on the multiple precharge and activate cycles. Additionally, inferencing applications, where the processor cores retrieve the
the interleaved architecture allows pipelining the continuously real-time data from eDRAM, and generates the results [13].
arriving digitized signal and processes the same. The computing Circuits in the form of 1T and 1C allow storing dense
block in the form of a 1-bit adder with a multiplexer unit is information in a compact space, besides benefiting with min-
optimized for different hardware metrics such as delay, power,
and product of power-and-delay (PDP) for adopting the design imal leakage and power dissipation. Owing to these benefits,
per the specifications. several hardware accelerators [14], [15], and high performance
The eDRAM-based efficient computing design is evaluated servers [13], [16] have adopted eDRAM as on-chip memory
for 1-bit adder and further characterized for 8-bit, and 16- units. However, a short retention time demands eDRAM to be
bit adders, multipliers, and 1-D convolution of varying filter data refreshed frequently, which dilutes the power gained from
sizes. The proposed design exhibited improvement in computing
time by 31% for 16-bit addition and 30.6% for 8-bit addition the compact sizing. Additionally, the error-tolerant nature of
over the existing state-of-the-art work. The bit-serial in-eDRAM- the neural networks covers the retention failures and presents
compute design achieved the best performance of 2.5 ms of reliable in-memory compute inference [17].
computing time and 120 nJ of energy for performing a 1-D This paper presents a unique proposition of tripling the
convolution operation. The in-eDRAM-compute design is a step throughput of the bit-serial in-eDRAM-computing by arrang-
towards designing embedded memory with convolutional neural
network (CNN) compute capability for customized real-time edge ing the operands in an interleaved configuration. Bit-serial
inferencing applications. computing is often neglected owing to its inefficient latency or
Index Terms—In-memory-computing, embedded DRAM, clock cycles; however, it presents enormous benefits in terms
Adder, Multiplier, Convolution, Signal filtering of silicon design space and throughput are highly suitable
for designing a customized on-edge signal processing system.
I. I NTRODUCTION Embedded systems are generally short on memory, which
Traditional computing technologies lack the ability to pro- demands processing the real-time arrival of data or signal
cess the large-scale data at a faster rate due to the bottleneck without having the luxury of storing and processing. Hence
of retrieving data from the memory and supplying to the bit-serial in-eDRAM-compute is an ideal candidate to satisfy
processor [1]. Many engineering designs have been attempted the edge-inferencing needs. In the proposed architecture, the
to pre-save the data in Cache and other memory blocks which continuous streaming of the digitized signals is computed on
are closer to the processor for producing results at an efficient arrival in the eDRAM memory cell, and hence the draw-
rate. Memory hierarchies based on their placement and their back of data retention is avoided. The modern-day real-
accessibility to the processor are created, and continuous at- time inferencing of signals involves compounded arithmetic
tempts to redesign the architecture and evolve hybrid memory operations, including convolution, and hence the proposed
architectures have been attempted to target energy-efficient triple throughput designed bit-serial in-eDRAM-computing is
computing systems [2]. However, all the design innovations structured to perform the 1-D convolution operation on the
signal with different kernel sizes. The paper establishes the Ai , and filter Fi samples. Intuitively bit-serial takes more P-
transistor-level design and characterizes the results towards A cycles to extract all the input and filter data information.
various hardware parameters. Additionally, most of the in- In contrast, bit-parallel performs the same in lesser P-A
memory-computing work does not optimize the design and cycles but at the cost of higher computational resources and
retain the sizing of the transistors in compute block to less throughput. Most logical operations that do not involve
the smallest possible 2:1 ratio to achieve compact spacing. dependency on the previous bit operation benefit from the
However, this approach results in inefficient computing with bit-parallel mechanism. However, arithmetic operations such
delayed output. Traditionally memory designs are maintained as addition, multiplication, and extended versions must wait
to the lowest possible transistor sizing primarily to maximize for individual bit operations. Hence bit-parallel does not offer
a large array of memory banks in a given space and retain the expected throughput benefits compared with a bit-serial
the data for the most part, with infrequent data access. Hence mechanism for arithmetic operations. The proposed bit-serial
the compute block, when integrated with the memory banks,
followed a similar sizing format, which has, however, not
met the efficient computational requirements. The paper seeks
to optimize the compute blocks using the particle-swarm-
optimization (PSO) method and presents the computational
time and power benefits over the existing 2:1 sizing. As per the
authors’ knowledge, this is the first time the work on eDRAM
with a focus on bit-serial computation has been proposed.
An optimized, throughput efficient bit-serial in-eDRAM com-
puting design is presented with the hardware characteristics
for addition, multiplication, and 1-D convolution operations Fig. 2. Proposed architecture of in-eDRAM-compute design. Compute block
for various filter sizes. The proposed in-eDRAM-compute consisting of multiplexer units and Adder module is shown.
design is targeted toward efficient and reliable real-time signal
processing applications. method arranges input A, and filter F in an interleaved manner
to simultaneously precharge (P) the corresponding bit line, as
shown in Figure 2. The proposed interleaved arrangement not
only facilitates the simultaneous reading of the corresponding
bits of both operands (A, and F ) from the memory cell at
one activate (A) phase but keeps the memory cell to store the
result, also precharged. In the same activate (A) time window,
the result is written to the precharged memory cell, thereby
saving two precharge cycles. Hence for a two-bit addition, the
conventional bit-serial method employs P-A-P-A-P-A cycle,
where the reading of corresponding bits of two operands from
the memory cell concedes two P-A time-frame and writing of
Fig. 1. Traditional Bit-Serial and Bit-Parallel Architectures.
the result back to the memory cell requires one P-A cycle. The
proposed bit-serial is read from the memory cell, computed,
and the result is written back to the memory cell in P-A
II. P ROPOSED D ESIGN cycle, where a single A covers activation of reading control
and write control. In the proposed architecture, the sequence
A. Architecture is compressed to P-A thus offering 3X throughput.
Traditional arrangement of bit-serial and bit-parallel data In the schematic, data from each line is fed to the computing
access in the memory is shown in the Figure 1, where A1, cell (Adder in this case). The design of multiplexer units forms
A2, A3 represents the digitized samples, and F 1, F 2, F 3 rep- the control unit for the compute block to select the intended
resents filter samples of certain bit-width each. The bit-serial operation. The row-decoder and column decoder units are
mechanism allows for reading individual bit of the operand at appropriately enabled to read the bits of the inputs from the
a time by enabling the necessary bit-lines through the decoder same memory row (Ai , Fi ), which is then passed to the
units. Precharge (P), and Activate (A) phases are configured by compute unit, and finally, the result is stored back in the
the decoder units for reading the necessary bits. Based on the memory cell. The row decoder unit is sequenced to enable bit
bit-widths, multiple P-A cycles are required to read individual levels from LSB to MSB of the supplied input operands. The
input and filter data. The Bit-parallel mechanism allows to read compute block consists of an adder block and multiplexer units
all bits of an individual sample Ai in a single P-A phase, and controlled by a separate decoder. The multiplexer select line
subsequently, all bits of filter data Fi is read in the next P −A is driven by the desired compute operations. In this case, the
cycle to perform the desired computation. Multiple pairs of P- addition or multiplication operations are achieved by the adder
A cycles are required to read the corresponding pairs of input design, where multiplier operation is achieved by executing
TABLE I
Q25 Q23 Q22 Q21 Q19 Q20 Q17 PSO OPTIMISED TRANSISTOR SIZES OF 28T 1- BIT A DDER .
Q28
Q18 Q16
Widths(µm) Widths(µm) Widths(µm) Widths(µm)
Q26 Q24 Transistor
for 2 : 1 design for minimum Delay for minimum Power for minimum PDP
Q14
Q1 0.24 2.00 0.89 0.91
Q15 Q2 0.24 2.00 0.50 1.32
A Q3 0.24 0.74 1.48 1.01
Q4 0.12 0.12 1.30 0.19
B
Q5 0.12 0.12 1.55 0.82
Cin
Q12 Q6 0.12 2.00 0.97 1.32
Q27 Q7 0.12 2.00 1.20 0.77
Q8 0.12 2.00 1.10 1.45
Q1
Q3 Q9 Q11
Q13
Q9 0.18 0.83 0.67 1.67
Q10 0.36 0.12 1.20 0.89
Q11 0.36 1.19 0.89 0.93
Q2 Q4 Q5 Q6 Q7 Q8 Q10 Q12 0.36 0.12 1.06 0.91
Q13 0.12 0.12 0.69 0.89
Q14 0.24 0.12 1.32 0.99
Q15 0.72 2.00 0.65 1.67
Q16 0.72 0.12 1.27 1.12
Q17 0.72 1.83 1.04 6.02
Q18 0.48 0.12 0.62 0.76
Q6 Q7 Q19 0.24 0.12 0.54 1.01
A ~S Q20 0.24 2.00 1.35 1.21
Q21 0.24 2.00 1.36 0.93
Q22 0.24 2.00 1.46 1.46
Q5 Q8 Q23 0.24 0.36 1.34 1.02
B S Q24 0.48 2.00 1.15 0.32
Q10 Q25 0.36 2.00 0.89 1.50
Q26 0.48 0.22 1.18 1.33
Q27 0.12 1.51 1.26 1.29
Q1 Q4 O
Q9 Q28 0.24 0.99 1.37 0.84
A B

Q2 Q3
~S S
designs were fixed to minimum widths while adopting relevant
ratios to satisfy the read and write stability of the technology
node employed. The PSO run was setup for establishing
optimized widths for 28T adder and 10T Multiplexer units.
Fig. 3. Schematic of 28T 1-bit full-adder and 10T multiplexer forming the The optimized widths for minimum delay, power, and product
compute unit of in-eDRAM-compute design.
of power and delay (PDP) are evolved using PSO for satisfying
designs catering to individual requirements of efficient per-
the adder operation multiple times. The same bit-serial in- formance, low power, and combination of both, respectively.
eDRAM-compute architecture consisting of an adder unit is The optimum widths for the 28T and 10T designed for adder
employed to achieve a 1-D convolution operation for various and multiplexer units are mentioned in the Tables I and II
filter sizes, which is explained later. respectively. The Tables exhibit the 28T + 10T widths for
the minimum delay, minimum power, and minimum PDP and
B. Optimized Design of Compute Logic compares the same with 2:1 sized transistors. The higher order
arithmetic operations were further designed and evaluated
The in-eDRAM-compute cell is composed of conventional using an abstract level tool, CACTI [18] configured with a
precharge, sense amplifier along with the compute units. DRAM model. CACTI is a popular memory simulator tool
Figures 3 shows the transistor level schematic of a single- developed by HP labs. It uses analytical models to evaluate
bit compute block from in-eDRAM-compute design, which the performance parameters of the higher order memory and
includes 28 Transistor (T) full-adder, and 10T multiplexer unit. compute units such as Vdd power and delay. The 8-bit addition,
The multiplexer unit is included to select between the addition 8-bit multiplication, and 1-D convolution for 8-bit and 16-bit
and multiplication operations as depicted in the compute block data was modeled in the CACTI tool and characterized using
of Figure 2. The convolution operation is established by setting the same 45 nm models.
up a sequence of addition and multiplication operations by the
control unit. The 28T design for the 1-bit full adder and the TABLE II
tristate design for the multiplexer unit are highly accomplished PSO OPTIMISED TRANSISTOR SIZES OF 10T MULTIPLEXER .
designs and are generally available as standard cell adder units; Widths(µm) Widths(µm) Widths(µm) Widths(µm)
Transistor
hence the same is adopted in the proposed eDRAM-compute for 2 : 1 design for minimum Delay for minimum Power for minimum PDP
Q1 0.12 2.00 1.21 0.94
unit. The widths of the 1-bit full-adder unit and the multiplexer Q2 0.12 1.59 1.08 1.38
Q3 0.12 1.66 1.51 0.59
unit are optimized for delay, power, and product of delay-and- Q4 0.12 1.29 1.33 0.17
power (PDP) separately to employ the in-eDRAM-compute as Q5 0.24 1.97 0.65 1.57
Q6 0.24 2.00 1.07 0.91
per the design requirements. The transistor level design of the Q7 0.24 0.12 0.81 1.08
Q8 0.24 1.51 1.25 1.15
in-eDRAM-compute structure for 1-bit adder cell, including Q9 0.12 2.00 1.57 1.24
multiplexer unit was designed in Cadence Virtuoso using Q10 0.24 1.09 1.07 0.80

45 nm technology PDK. The Precharge and sense amplifier


10 4
200 3
150
2.5
150
-30.7% 2

Power (nW)
Delay (ns)

100

PDP
100 1.5

-64.2% 1
50
50
0.5
-98.2%
0 0 0
2:1 lay 2:1 er 2:1 P
De w PD
O- -Po O-
PS PS
O PS

Fig. 4. Hardware-metrics evaluated for PSO optimized design variants for


in-eDRAM-compute design.

As shown in the Figure 4, the proposed optimal design


offers an improvement of 30.7% in delay, 64.2% power
savings, 98.2% benefits in power-delay-product (PDP) for
1-bit full-adder operation. The 2:1 design performs poorly
in all the three parameters for a single bit addition, when
compared to other optimized designs. The optimized compute
block in the in-eDRAM-compute architecture is expected to
offer continued benefits over performance, power and PDP for (a)
multiplication and convolution operation.
C. 1-D Convolution operation
Figure 5 (a) shows a sequence of operations to deduce 1-D
convolution for a given filter. A sampled input signal (A) of
n size is convoluted with filter (F ) of k size, as stated in the
figure. The output of n sample-size is expected from the con-
volution operation considering the sliding window of 1. The
data-bits representing input, and kernel of 8-bits each provide
the convolution output in 8 + k − 1-bits, where k refers to the
filter size. Continuous arrival of input signal will overwrite
on the memory cells A(1-4) post the current computations,
thereby utilizing the structured pipelined architecture. When
P 1 + P 2 is computed, a new A1 × F 1 is likely to begin in
the interim. (b)
The adoption of the proposed bit-serial in-eDRAM-compute
Fig. 5. (a) A schematic representing the memory and compute block for
system for the 1-D convolution operation requires parallel performing 1-D convolution operation with a filter size of three, and (b) timing
stacked memory banks, as shown in the figure. The computa- sequence describing the read-compute and write operations for performing the
tion of A1×F 1, A2×F 2, A3×F 3 to yield P 1, P 2, and P 3, is same.
performed simultaneously in one P-A cycle and the product of
these multiplication units are then sequentially added, resulting
in P4 , and R1 . Hence for a convolution operation with a
Addition 10 4 Multiplication
filter of size 3, three such P-A cycles are required, where 3200 5

each P-A cycle is represented by two time-frames as denoted 3000


Latency (ns)

Latency (ns)

in the Figure 5 (b). Higher filter size is likely to concede 2800 4.5

more time. A detailed timing diagram showcasing read-and- 2600


4
compute, and write operations for the proposed in-eDRAM- 2400

2200
compute structure with the filter size of 3, is shown in the 3.5
2000
Figure 5 (b). The figure clearly depicts that 6 time-frames 1 al y P er 1 al y P er
2: eri ela PD Pow 2: eri ela PD Pow
where each time-frame is composed of precharge, read-and- -S -D O- - -S -D O- -
Bit PSO PS PSO Bit PSO PS PSO
compute from eDRAM cells, and write to the eDRAM cell.
Fig. 6. Improvement in latency of a single addition and multiplication
D. Comparison operation over [19] for 16-bit operands, by optimising the compute logic
The proposed design was compared with the state of the art (Comparing it with Bit-Serial methodology also).
in-memory-computing design discussed in [19]. The proposed
work distinguishes with the design discussed in [19] in terms The optimized design for minimum power requires a similar
of transistor level optimized design for the compute block computing time as that of 2:1 design, but offers energy saving
including the multiplexer module. The multiplexer module of more than 50% for performing one 8-bit addition operation
for selecting the operations was either ignored in [19] or it and more than 66% for the multiplication operation. The
was included in the control block which was not disclosed. optimized design for minimum PDP offers energy savings of
Additionally the interleaved approach of bit-serial access for more than 87% for both addition, and multiplication operations
computing is another major modification to the bit-serial when compared to the 2:1 design approach, and similarly
design discussed in [19]. The bit-serial designs were evaluated optimized design for the minimum delay improves the latency
for single addition and multiplication operations for 16-bit by at least 26% for addition and multiplication operations,
operands, and comparisons in terms of latency parameters respectively, over the 2:1 design.
are shown in the Figure 6. Our designs referring to PSO- 1-D convolution operation using in-eDRAM-compute de-
Delay, PSO-PDP, and PSO-Power shows the optimized design sign was evaluated for various filter sizes to demonstrate the
results targeted towards minimum delay, minimum power, and usage across a broad frequency spectrum range. Filter size of
minimum PDP separately. The proposed PSO-Delay design 240 and above is useful for high-frequency audio applications.
offers 31%, and 30.6% improvement in the computational time Similarly, a filter size below 30 targets low-frequency EEG or
for 16-bit addition and multiplication operations respectively. EMG physiological processing. The optimized widths offer
Our PSO-PDP design also offers latency improvement of better latency metrics and energy savings over 2:1 designs
6.5%, and 6.1% over the state-of-the-art work, for addition and consistently across filter sizes, and irrespective of bit-widths
multiplication operations, respectively. The other optimized (8-bit or 16-bit). In terms of the design variants, an efficient
design also shows slight latency improvement over the state- 1-D convolution operation on 8-bit data with 256-sized filters
of-the-art work. requires 2.5 ms of computing time, with a minimum energy
cost of 120 nJ. A 16-bit convolution operation requires four
III. R ESULTS AND D ISCUSSIONS times more computational time and energy for a similar filter
size. Design optimized towards minimum PDP, and minimum
delay individually offers 37.50% energy savings, and 26.47%
Addition Multiplication
1500 15000 delay improvement, respectively for 8-bit 1-D convolution
operations with 256-sized filter over 2:1 designed in-eDRAM-
Latency (ns)

1000 10000 compute design. For 16-bit convolution operation with the
same 256-sized filter, 42.10% energy savings, and 28% delay
500 5000
improvement, were achieved from the design optimized to-
wards minimum PDP and minimum delay, respectively. Hence
0 0
(a) (b) PSO - Delay based on the real-time signal processing application require-
PSO - Power
PSO - PDP
ments, the design optimized towards minimum delay serves
0.05 0.4 2:1
best for an efficient performance system, whereas the design
0.04
0.3 optimized for minimum PDP serves best for low energy sys-
Energy (nJ)

0.03 tem design. It was also observed that optimized in-eDRAM-


0.2
0.02 compute design and unoptimized 2:1 transistor design for
0.01
0.1 performing 1-D convolution occupies similar footprint. As
0 0
expected, the 2:1 ratioed and minimum-sized transistor design
(c) (d) gives a slight advantage in terms of the silicon space over other
designs; however, the optimized designs offer large throughput
Fig. 7. (a,b) Latency, and (c,d) Energy metrics, characterized for a single
and energy benefits that fit the requirements. Additionally,
8-bit addition, and multiplication operation for optimized and 2:1 transistor convolution operations for 8 channel data with filter size of
designs. 256, and 16 was investigated in Arduino micro-controller
and the same was compared with the proposed in-eDRAM-
Figure 7 shows the computing-time in ns, and energy compute architecture design. The proposed delay optimized
consumed in nJ, for an 8-bit addition, and multiplication in-eDRAM-Compute design exhibited 24.42X, and 22.76X
operation. As expected, one multiplication operation concedes computation speed improvement for 256, and 16 sized filter
a decade more computing time and energy when compared respectively.
to an addition operation. Among four design variants, it is
evident that the optimized design targeted for minimum delay IV. C ONCLUSION
provided the lowest computing time, making it a performance- The paper introduces to throughput-efficient bit-serial in-
efficient design. However, optimized design for minimum PDP eDRAM computing design. The PSO optimized design for
showcases the least energy consumed by the operation across three different metrics such as delay, power, and PDP
all other design variants. 2:1 design shows poor results in were evaluated independently for addition, multiplication, and
terms of energy and latency for both operations, as expected. convolution operations. The proposed interleaved bit-serial
8 bit data 3
16 bit data
10 PSO - Delay
PSO - Power
10 2
PSO - PDP

Energy (nJ)
2 2:1
10
Bit Serial

1
10
10 1

10 0 10 0
16 40 64 88 112 136 160 184 208 232 256 16 40 64 88 112 136 160 184 208 232 256
(a) (c)

7
10
Latency (ns)

10 6

10 6

10 5
16 40 64 88 112 136 160 184 208 232 256 16 40 64 88 112 136 160 184 208 232 256
(b) (d)

Fig. 8. Energy and Latency of 1-D Convolution operation versus different Filter sizes for (a, b) 8-bit, and (c, d) 16-bit data. All plots are in logscale.

adopted in-eDRAM-compute design offers 31% improvement [8] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim,
in the arithmetic computing over the existing state-of-the- M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit:
In-memory accelerator for bulk bitwise operations using commodity
art bit-serial in-memory-computing design [19]. The proposed dram technology,” in Proceedings of the 50th Annual IEEE/ACM
PSO optimized towards minimum PDP, bit-serial computing International Symposium on Microarchitecture, ser. MICRO-50 ’17.
design exhibits low energy and improved throughput design New York, NY, USA: Association for Computing Machinery, 2017, p.
273–287. [Online]. Available: https://doi.org/10.1145/3123939.3124544
which adequately fits to the edge-computing requirements. The [9] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhi-
interleaved in-eDRAM-compute design offers energy benefits menko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
of 37.50%, and 42.10%, and latency improvement of 26.47%, Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy
and initialization,” in 2013 46th Annual IEEE/ACM International Sym-
and 28% for 8-bit and 16-bit convolution operations over posium on Microarchitecture (MICRO), 2013, pp. 185–197.
256-sized filters. The best metric of 120 nJ, and 2.5 ms of [10] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “Dracc: A
computing energy and time respectively was characterized for dram based accelerator for accurate cnn inference,” in Proceedings of
the 55th Annual Design Automation Conference, ser. DAC ’18. New
running 256 sized filter for a convolution operation, which York, NY, USA: Association for Computing Machinery, 2018. [Online].
adequately serves the need for real-time signal processing for Available: https://doi.org/10.1145/3195970.3196029
on-edge inferencing systems. [11] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and
Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in
R EFERENCES Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-50 ’17. New York, NY, USA:
[1] C. Yu, T. Yoo, H. Kim, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, Association for Computing Machinery, 2017, p. 288–301. [Online].
“A logic-compatible edram compute-in-memory with embedded adcs for Available: https://doi.org/10.1145/3123939.3123977
processing neural networks,” IEEE Transactions on Circuits and Systems [12] S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng,
I: Regular Papers, vol. 68, no. 2, pp. 667–679, 2021. B. Brennan, and Y. Xie, “Scope: A stochastic computing engine for
[2] R. Nair, “Evolution of memory architecture,” Proceedings of the IEEE, dram-based in-situ accelerator,” in 2018 51st Annual IEEE/ACM Inter-
vol. 103, no. 8, pp. 1331–1345, 2015. national Symposium on Microarchitecture (MICRO), 2018, pp. 696–709.
[3] R. Gauchi, M. Kooli, P. Vivet, J.-P. Noel, E. Beigné, S. Mitra, and H.-P. [13] G. Fredeman, D. W. Plass, A. Mathews, J. Viraraghavan, K. Reyer, T. J.
Charles, “Memory sizing of a scalable sram in-memory computing tile Knips, T. Miller, E. L. Gerhard, D. Kannambadi, C. Paone, D. Lee, D. J.
based architecture,” in 2019 IFIP/IEEE 27th International Conference Rainey, M. Sperling, M. Whalen, S. Burns, R. R. Tummuru, H. Ho,
on Very Large Scale Integration (VLSI-SoC), 2019, pp. 166–171. A. Cestero, N. Arnold, B. A. Khan, T. Kirihata, and S. S. Iyer, “A 14
[4] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-sram: Enabling in- nm 1.1 mb embedded dram macro with 1 ns access,” IEEE Journal of
memory boolean computations in cmos static random access memories,” Solid-State Circuits, vol. 51, no. 1, pp. 230–239, 2016.
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, [14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
no. 12, pp. 4219–4232, 2018. Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning super-
[5] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, computer,” in 2014 47th Annual IEEE/ACM International Symposium
D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration on Microarchitecture, 2014, pp. 609–622.
of deep neural networks,” in 2018 ACM/IEEE 45th Annual International [15] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
Symposium on Computer Architecture (ISCA), 2018, pp. 383–396. A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
[6] Z. Jiang, S. Yin, M. Seok, and J.-s. Seo, “Xnor-sram: In-memory computing,” in Proceedings of the 43rd International Symposium on
computing sram macro for binary/ternary deep neural networks,” in 2018 Computer Architecture, ser. ISCA ’16. IEEE Press, 2016, p. 1–13.
IEEE Symposium on VLSI Technology, 2018, pp. 173–174. [Online]. Available: https://doi.org/10.1109/ISCA.2016.11
[7] A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficient [16] F. Hamzaoglu, U. Arslan, N. Bisnik, S. Ghosh, M. B. Lal, N. Lindert,
sram with embedded convolution computation for low-power cnn-based M. Meterelliyoz, R. B. Osborne, J. Park, S. Tomishima, Y. Wang, and
machine learning applications,” in 2018 IEEE International Solid - State K. Zhang, “A 1 gb 2 ghz 128 gb/s bandwidth embedded dram in 22 nm
Circuits Conference - (ISSCC), 2018, pp. 488–490. tri-gate cmos technology,” IEEE Journal of Solid-State Circuits, vol. 50,
no. 1, pp. 150–157, 2015.
[17] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “Rana: Towards efficient https://cir.nii.ac.jp/crid/1571417126351323392
neural acceleration with refresh-optimized embedded dram,” in 2018 [19] A. Parmar, K. Prasad, N. Rao, and J. Mckie, “An automated approach
ACM/IEEE 45th Annual International Symposium on Computer Archi- to compare bit serial and bit parallel in-memory computing for dnns,” in
tecture (ISCA), 2018, pp. 340–352. 2022 IEEE International Symposium on Circuits and Systems (ISCAS),
[18] N. MURALIMANOHAR, CACTI 6.0 : A tool 2022, p. accepted.
to model large caches, 2009. [Online]. Available:

You might also like