0% found this document useful (0 votes)
16 views81 pages

Jaisimha Thesis 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views81 pages

Jaisimha Thesis 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Scalable Multi-bit Precision Compute-in-Memory on Embedded-DRAM

BY

SHRUTHI JAISIMHA
B.E., Visvesvaraya Technological University (VTU)

THESIS
Submitted as partial fullfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Chicago, 2021
Chicago, Illinois

Defense Committee:
Dr. Amit Ranjan Trivedi, Chair and Advisor
Dr. Wenjing Rao
Dr. Zhichun Zhu
DEDICATION

To my parents and brother for their constant love and support.

ii
ACKNOWLEDGEMENT

It would have not been possible for me to complete my masters thesis without the
help and support of the people around me.
First and foremost, I would like to express my immense gratitude to my thesis ad-
visor, Prof. Amit Ranjan Trivedi for providing me with an opportunity to pursue
a thesis research under his guidance. I am extremely grateful and indebted to him
for his expert and sincere counseling and encouragement at every step during the
course of the thesis.
I am thankful to my committee members, Prof. Wenjing Rao and Prof. Zhichun
Zhu for agreeing to serve on my committee and provide invaluable advice.
I am thankful to my fellow student researchers at AEON Lab, Priyesh Shukla,
Shamma Nasrin and Ahish Shylendra for giving me beneficial advice and helping
me out whenever I approached them.
Lastly I would like to express my sincere thanks to everyone who have directly or
indirectly helped and encouraged me towards the completion of my thesis.

SJ

iii
TABLE OF CONTENTS

CHAPTER PAGE

1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 MobileNetV2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Multiplication Free Operator . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Analysis of In-Memory Architecture . . . . . . . . . . . . . . . . . . . 6

2 LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-
Product Computation for Low-Power Convolutional Neural Networks 8
2.2 In-Memory Computation of a Machine-Learning Classifier in a Stan-
dard 6T SRAM Array . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 eDRAM-Based Tiered-Reliability Memory with Applications to Low-
Power Frame Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 A 667 MHz Logic-Compatible Embedded DRAM Featuring an Asym-
metric 2T Gain Cell for High Speed On-Die Caches . . . . . . . . . . 18

3 DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 CIFAR-100 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 CiM BASED eDRAM HARDWARE SETUP . . . . . . . . . . . . . 25


4.1 Proposed eDRAM bitcell . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 The Proposed CiM Framework for 64x15 eDRAM macro . . . . . . . 29
4.3 Operation Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 4-bit Flash ADC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Linearity Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Process Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Energy Consumption Comparison . . . . . . . . . . . . . . . . . . . . 55

6 FUTURE SCOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iv
LIST OF TABLES
TABLE PAGE

I ANALYSIS OF CIM NETWORK . . . . . . . . . . . . . . . . . . . . . . . . 8

II SUPERCLASS AND CLASSES OF CIFAR100 DATASET . . . . . . . 26

III LINEARITY ACCURACY OUTPUT VALUES . . . . . . . . . . . . . . 47

IV OVERALL POWER CONSUMED IN CIM FRAMEWORK . . . . . . 55

V ENERGY COMPARISON WITH STATE OF THE ART DESIGNS . 57

v
LIST OF FIGURES
FIGURE PAGE

1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

2 In-Memory Computing Architecture . . . . . . . . . . . . . . . . . . . . . xv

3 Conventional Multiplication Operation in CiM framework . . . . . . . xvi

1.1 Depth-wise Separable Convolution . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 MobileNet V2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Voltage domain conversions in conventional CiM architectures . . . . . 10

2.2 Basic MAC operation in a neuron . . . . . . . . . . . . . . . . . . . . . . 10

2.3 CiM Architecture of CONVSRAM . . . . . . . . . . . . . . . . . . . . . . 12

2.4 CiM Architecture of ML Classifier using 6T SRAM . . . . . . . . . . . . 14

2.5 Digital to Analog Converter . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Conventional 1T1C DRAM Cell . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 DRAM mapped into a crossbar memory array . . . . . . . . . . . . . . . 18

2.8 Conventional 6T SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.9 3T eDRAM gain cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Retention Time for eDRAM cells . . . . . . . . . . . . . . . . . . . . . . . 21

2.11 Conventional Leakage current in eDRAM cell . . . . . . . . . . . . . . . . 22

4.1 eDRAM bitcell for In-Memory Computing . . . . . . . . . . . . . . . . . 27

4.2 eDRAM bitcell for In-Memory Computing . . . . . . . . . . . . . . . . . 30

4.3 CiM framework using eDRAM cells for 64x15 macro . . . . . . . . . . . 31

vi
LIST OF FIGURES (Continued)

FIGURE PAGE

4.4 Mapping of weights and inputs into eDRAM macro . . . . . . . . . . . 32

4.5 Timing Diagram of the control signals in the CiM framework . . . . . 35

4.6 4-bit FLASH ADC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.7 N-Type and P-Type Comparator . . . . . . . . . . . . . . . . . . . . . . . 38

4.8 The Overall Framework of the Proposed CiM Architecture . . . . . . . 39

5.1 Multiplication Operation in In-Memory System . . . . . . . . . . . . . . 42

5.2 Average/Accumulation Operation . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Voltage levels of the Resistive Ladder . . . . . . . . . . . . . . . . . . . . . 44

5.4 Comparator and Encoder Outputs . . . . . . . . . . . . . . . . . . . . . . 45

5.5 The Timing Analysis of the entire CiM framework . . . . . . . . . . . . 46

5.6 Linearity Accuracy Plot for Inputs and SL Line Voltage . . . . . . . . . 48

5.7 Linearity Accuracy of 4-bit Flash ADC . . . . . . . . . . . . . . . . . . . 49

5.8 Histogram Plot for Sigma Variation of 15mV . . . . . . . . . . . . . . . . 51

5.9 Histogram Plot for Sigma Variation of 25mV . . . . . . . . . . . . . . . 52

5.10 Histogram Plot for Sigma Variation of 35mV . . . . . . . . . . . . . . . . 53

5.11 Histogram Plot for Sigma Variation of 35mV . . . . . . . . . . . . . . . . 54

5.12 Power Consumption in CiM Framework . . . . . . . . . . . . . . . . . . . 56

vii
LIST OF ABBREVIATIONS

NN Neural Networks
AI Artificial Intelligence
ML Machine Learning
ANN Artificial Neural Networks
CNN Convolutional Neural Networks
DNN Deep Neural Networks
RNN Recurrent Neural Networks
MLP Multi Layer Perceptron
eDRAM Embedded Dynamic Random Access Memory
SRAM Static Random Access Memory
RRAM Resistive Random Access Memory
MAC Multiply and Accumulate Unit
MAV Multiplication and Average
ADC Analog-to-Digital Converter
DAC Digital-to-Analog Converter
CiM Compute in Memory
IMP In-Memory-Computing
MC Monte-Carlo
DWConv Depthwise Convolution
PE Processing Element
MUX Multiplexers
MF Multiplication Free Operator
SNR Static to Noise Ratio

viii
SUMMARY

Deep Neural Networks are hierarchical layered organization with neurons connected
to each other as inspired by the human brain. These neurons are segregated into
input, hidden, and output layers. Each neuron performs Multiplication and Accu-
mulation operation of the weights and input features. These operations require the
data to be fetched from the processor into the memory. There is tremendous amount
of energy lost in the traditional Von-Neuman Architecture due to the data movement
between the processor and memory systems. The challenge of designing an energy
efficient network which can perform operations on the data by eliminating the data
movement is a potential solution to enhance performance without losing on energy
consumption. This can be realized by performing the computationally expensive
mathematical operations within the memory cells and this concept is called Com-
pute In Memory(CiM) or In Memory Computing(IMP). This CiM framework can
be exploited in Deep learning algorithms which leverages faster computing power,
efficient storage, growing volume and variety of data for highly accurate predictions.
There is a surge in on-edge intelligence applications which demands considerable
reduction in power consumption, compact designs and robustness to handle the dy-
namic decision making.
The current CiM framework designs are complex with their mixed-signal peripherals
which requires Digital-to-Analog converters at every input node. Considering the
conventional CiM architecture, the inputs are converted into analog domain as this
conversion would simply imply the multiplication operation to be the total current
flowing through the bitline. This necessitates the use of DACs at every bitline to
convert the digital voltage into analog domain. These parallel DACs are compu-
tationally expensive which needs to be addressed to make CiM frameworks more
energy efficient. In this work, the use of these power-hungry devices is obviated by

ix
SUMMARY (Continued)

co-designing the learning operators to eDRAM based CiM. The newly introduced
eDRAM framework for in-memory computing will eliminate the use of DAC even
for multibit precision operation and does not demand high-precision ADCs for DNN
processing by adopting a new kind of operator called multiplication-free operators.
The eDRAM cells shows enormous advantages over SRAM cells with their compact
CiM designs with single transistor memory cell, and lower leakage power and thus
is well suited for storage and processing in memory.
The proposed eDRAM macro framework is divided into µ-arrays and µ-channels
where each µ-array is solely responsible to store one weight channel. The weights
used for Neural Networks are stored across columns in every µ-array with each
weight bitplane arranged across rows. The eDRAM bitcell has additional two tran-
sistors which help in adapting these cells for CiM operations. They are connected
to control signals which are used to facilitate the eDRAM cells in performing the
bitwise multiplication operation. The weights of the neural networks are stored in
the eDRAM cells and the inverted inputs are fed into the cells through the control
signals. Depending on the weight bit and the input bit, the cells either charge or
discharge performing the required bitwise multiplication operation, without the need
of DACs at every bitline of the eDRAM cells. The bitwise multiplication operation
is followed by average operation. By using Transmission Gate MUXes the bitlines
are summed up and are then fed into DACs to calculate the corresponding digital
output. 4-bit Flash ADCs are designed to convert the analog sum output to the
corresponding digital bits. These ADC outputs are shifted and added to produce
the final output from the CiM system. The power numbers of each of these blocks
are noted. The major power consumed is by the Flash ADC which accounts to
about 64% of the total power consumption for a 4-bit ADC precision. Linearity

x
SUMMARY (Continued)

Accuracy is also another parameter check which is done to thoroughly understand


the working of the CiM architecture. It can be observed that there is a linear trend
between the inputs and the SL line. As the input values are increasing, the SL also
varies linearly. The design is subjected to process variability of 15mV, 25mV, and
35mV and the two sigma values are well within bounds and the histogram plot of
standard deviation and sigma threshold follows a Gaussian trend.

xi
MOTIVATION

Deep Neural Networks are escalating to different applications where along with
higher prediction precision, low power is also a critical requirement. Since DNNs
employ thousands of millions of parameters for either classification or regression
applications, employing DNNs in low power applications is challenging. Speed,
accuracy, and power being the crucial parameters of consideration, computing in
memory is a potential solution to this complex problem. There have been several in-
memory architecture designs using Static-Random-Access-Memory cells for Binary-
weighted Neural networks with area and power-hungry DACs arranged in parallel
at every bitline. Time Domain DACs were introduced to minimize the overheads
caused due to mixed-signals, however, with increase in the precision of the input bits
it necessitates either the complex scaling of analog-domain voltage or operating time
to increase exponentially. The standard 6T SRAM cells produce much smaller cache
memory when compared to state-of-the-art 1 bit memory cells like eDRAM cells.
DACs and ADCs further alleviate the issue of high-power requirement of DNNs for
the increasing complexity of the real-world problems. Multiplication-free operator
is used to perform the computationally expensive operation of DNNs without the
use of DACs in the in-memory engine.

xii
INTRODUCTION

Neural Networks

For the human eyes to recognize a set of digits seems relatively easier on the outside.
However, around 150 million neurons are involved with billions of interconnections
between them and these neurons are present in the Visual Cortex V1 which alone
is not sufficient but needs series of cortices working progressively to read the image.
This apparently is a easier process is extremely difficult to describe algorithmically
for the computers to perform the same operation our brains does it so easily for us.
Neural Networks is rather a better solution as it considers a set of samples from the
database to train the model and automatically learns to figure out the rules and
regulations to make sense of the samples. With the increase in the number of the
samples better would be the accuracy of recognizing the samples. These training
samples can range from thousands to billions of them.[1]
The basic building block of neural networks is a type of artificial neuron called
perceptron which takes several inputs and produces a single binary output. To
determine the importance of the inputs in producing the outputs, each of these
inputs are multiplied by weights or kernels. Thus, output is either 0 or 1 based on if
the weighted sum of each of the inputs is above or below the threshold value. Now,
we have the control to predict the correct output by manipulating the weights and
the threshold. This perceptron can make decisions based on the weighted sum of
the inputs.
The neural networks have perceptron layers stacked up to predict the output. The
first layer is called the input layer which takes simple decisions based on the weighted
sum of the inputs. The layers following the input layers are called hidden layers
which also make decisions based on the inputs fed by the previous layer and pro

xiii
INTRODUCTION (Continued)

duces the next layer receives the input. The output layer is the last layer which pro-
duces the prediction. The complexity of the decision-making increases gradually as
we progress through the layers. The threshold which is used to decide the weighted
sum output is called the bias. Thus, if the product of kernel weights and inputs
added with the bias is less than zero, the output is zero and one otherwise.

Figure 1: Neural Network


[2]

These multi-layers are called multilayer perceptrons. The input and output layer
designs are relatively straightforward while the hidden layers’ connection can be en-
gineered to obtain the desired output. Generally, the output of one layer is the input
to the next layer and this refers to feedforward technique. However, in recurrent-
neural-networks, we observe that there is a feedback loop. In these neural networks,
a set of neurons are fired for some time before they become inactive, during this
phase they fire new sets of neurons and thus causes an array of fired neurons. The
loops generated in these neural networks do not affect the input instantaneously and
thus cause no issues. [3]
The neural networks have various on-edge intelligence applications which demands

xiv
INTRODUCTION (Continued)

energy efficient architecture. To achieve this feet, operations on the inputs and
weights are performed within the memory which is discussed next.

In-memory Computing

There is a lot of analysis and research going on into how to perform the operations
in the core without spending much power in the data transfer between the processor
and the memory. Improvement in the performance at lower power consumption
is possible without ‘hitting the memory wall’. Different architectures can also be
designed to reduce the memory bandwidth however, we focus on addressing the issue
by performing the operations in the memory.
Multiplication and Accumulation operation in a more traditional route would require
the kernel and activation functions to be moved to the cores to perform operations
on them. This movement demands a lot of unnecessary power consumption. Thus,
in-memory computing approach allows to use the kernel weights to be stored in the
memory units and apply the activation functions to realize the multiplication and
accumulation operation.
In traditional in-memory computing the operations in the memory are performed
in analog domain to reduce power consumption.[4] Digital-to-Analog converters are
used to convert the digital input into analog domain and drive the memory cells.
The outputs from the bitlines of memory cells are analog in nature which will then
be converted into digital domain using Analog-to-Digital converts.
The memory cells were designed using Resistive Random-Access Memories to realize
in-memory computing. But these cells suffer from being widely used due to non-
linear relationship of programming the resistances. Flash cells have seen significant

xv
INTRODUCTION (Continued)

Figure 2: In-Memory Computing Architecture


[5]

use in in-memory computing. They are primarily resistive like RRAMs and capaci-
tive in nature. It is a transistor if in on-or-off mode and resistor if it is partially on.
These cells can be used in edge-computing and if sufficient arrays are stacked, they
can store the weights to perform the multiplication-and-accumulation operation. As
these cells are non-volatile, they do not require to be programmed overtime. This
eliminates the data movement from the memories to the processor and essentially re-
alizes computing-in-memory. This followed by Static-Random-Access memory units
which replaced the flash cells for in-memory computation. In this implementation
the values are maintained in analog domain before the memory computation and

xvi
INTRODUCTION (Continued)

then in digital domain after the memory computation until bitline readout. Several
bitlines must be considered for a single operation as they all maintain all digital
value. [6]
These bitlines can be grouped to perform several multiplications in parallel. For
n-bits of operation, n-bits input vector are considered and several multiplications
are computed successively. These bitline charges are stored in the capacitor and are
sent into Analog-to-Digital converter for transform them back into Digital domain.
Dynamic-Random-Access-Memory were also experimented to test their in-memory
computing ability. Here computing was brought to the memory. After the compu-
tation, the result should travel to the core.

Figure 3: Conventional Multiplication Operation in CiM framework


[6]

CiM framework is realized using different types of memory cells and this work fo-
cuses on during embedded dynamic random access memory cells to achieve lower
power and compact architecture. This design attracts many applications for on-edge
intelligence. This is still not sufficient to have a compact design and thus dropout-

xvii
INTRODUCTION (Continued)

based Bayesian Inference is encouraged. Dropout helps in reducing the network


complexity and Bayesian Inference enhances the performance of the architecture
even when few neurons are dropped during the training phase.

xviii
PROBLEM STATEMENT

In-memory computing has gained relatively very high attention due to the efficient
way of computing within the memory cells. Over time the memory cells have evolved
from using Resistive Random-Access Memory to Static Random-Access Memories.
There are no systematic frameworks on how 1 bit memory cells could be efficiently
used for in-memory computing. The traditional CiM architectures need high pre-
cision Analog-to-Digital converters and Digital-to-Analog converters which are area
and power inefficient and this problem demands an innovative solution. Thus, the
proposed In-Memory computing architecture is based on state-of-the-art Embed-
ded Dynamic Random-Access memories which uses a new kind of operator called
Multiplication free operator. This work is an effort in realizing energy efficient and
compact CiM architecture which is DAC free even for higher-precision bit opera-
tion. The CiM implementation using eDRAM cells are desirable as SRAM-based
CiM framework have higher cell area which limits the memory footprint for on-edge
devices.

xix
THESIS ORGANIZATION

This thesis comprises of 7 chapters.


Chapter 1: Background
Chapter 2: Literature Survey
Chapter 3: Datasets
Chapter 4: CiM based eDRAM Hardware Setup
Chapter 5: Results
Chapter 6: Future Scope
Chapter 7: Conclusion

xx
Chapter 1
BACKGROUND

1.1 MobileNetV2

This neural network architecture was developed for edge computing applications
(particularly focused for mobile and other limited resource platforms). The aim of
this network is to reduce the overall computational effort while retaining the same
accuracy. Linear Bottleneck and the inverted residual layer are the heart of inno-
vation in this architecture which helps in maintaining the overall performance of
the network. This group of layers performs three basic operations of compression
of low-dimensional inputs followed by expansion to higher dimensions and lastly
filtered lightweight depth-wise convolution.
Depth-wise separable convolution:
Every Convolutional layer of the Neural Networks contains depth-wise separable
convolution for better efficiency. In this technique, the convolutional operation is
split into two layers where in the first layer depth-wise convolution is performed and
in the second layer pointwise convolution is computed.
In MobileNet V2 architecture, the pointwise convolution reduces the number of
channels and thus referred to as Projection Layer – The high dimensional data is
projected into a tensor with lower dimensions. [7]
Now, with the projection layer shrinking the number of channels the depth-wise
layer might work on lesser number of channels. This layer is called Bottleneck layer
as the amount of data passing through this layer is reduced due to the bottleneck
created by the projection layer. This block is called Bottleneck residual block where

1
Figure 1.1: Depth-wise Separable Convolution
[8]

the output of each block is a bottleneck.


The point-wise convolution (1x1 convolution) expands the number of channels in
the data before it is fed into the depth-wise convolution layer. Thus, the expansion
layer generally has higher number of output channels when compared to the number
of input channels.
The tensor fed into the expansion block and the tensor at the output are low di-
mensional tensor. However, the operations performed within the block in high di-
mensional.

Figure 1.2: MobileNet V2 Architecture


[8]

The activation function RELU6 and the batch normalization is present in every
layer. The expansion layer outputs does not go through the activation function as
this layer produces low dimensional output, applying non-linear operations on the
data apparently destroyed the useful information and reduced the overall accuracy
of the neural network.

2
1.2 Multiplication Free Operator

In in-memory computing units the memory cells help in performing the product of
kernel weights and inputs values. In order to compute the n-element products of
weights w and inputs x need n-DACs and one ADC. This implies that the number
of DACs increases with the increase in the number of product computations. These
cause higher area and power consumption. With the increase in the bit precision of
the input vectors they call for complex designs of DACs.
The precision of ADCs is dependent on the input’s precision and the number of
cells combined together to form the sum of the products. For an input x which is
of n-bits and if we consider combining L cells, the precision of the ADC will be
approximately n + log2 (L). This ADC precision becomes more stringent with the
increase in the input bit precision and the number of cells which would be used to
sum the products. [?]
These complex design challenges of ADCs and DACs are addressed by using the
multiplication free operator which eliminates the use of DACs in the in-memory
engine. This new operator can be explained in the following few concepts:
1. By using, the multiplication-free-operator it helps in removing the DAC units
in the in-memory system as the high-precision multiplication of the input vectors
and kernel weights are eliminated. In the new presented operator, the multiplication
operation between the weights and inputs are represented like this:

X
x⊕w = sign(xi ) · abs(wi ) + sign(wi ) · abs(xi ) (1.1)

The equation above has the following functions. Here, · is the element-wise multipli-
cation operator between the input vector and kernel weights, + is the element-wise
P
addition operation, is the vector sum operator. sign() operation considers only

3
the sign bit of weights or inputs and abs() represents the absolute value of the
operands. This operator inherently performs the correlation of sign(x) with abs(w)
and sign(w) with abs(x). This avoids the need of high precision multiplication op-
eration between the operands. Thus, eliminating the use of DACs in the in-memory
engine.
2. Equation 1.1 is reformulated to further reduce the overall dynamic power
consumption of the circuit. Using the step() function only a single product port of
eDRAM cells are processed which inhibits the dynamic power consumption.

X X X
sign(w) · x = 2 × step(w) · x − x (1.2a)
X X X
sign(x) · w = 2 × step(x) · w − w (1.2b)

P
Here, the first term of the Equation 1.2a and 1.2b are of low dynamic energy. x is
the residue term which can be computed by summing all the row of dummy weights
P
which stores ones in the eDRAM cells. w is the weight statistics that can be
evaluated before the multiplication operation beings and can be used during the
evaluation. Based on the reformulation of the previous multiplication-free operator,
the new product of two vectors involves sign and addition operations. Considering
the two vectors x and y in the Real values plane, the correlation operation of x and
y are represented by the following:

The above equation can be manipulated to fit the Equation 1.1. This equation can
also be reformulated to l1 norm as x + x = 2(kxk)1 The traditional neurons need

4
to be replaced with the affine transformation using the co-designed neural network
operator as [φ((α(x ⊕ w) + b)] where w stands for the weights, α is scaling co-
efficient and b is the bias. This does not require any additional nonlinear activation
layer as the NN operator itself is non-linear. This operator can be easily scaled to
recurrent neural networks (RNN), convolutional neural network (CNN) and Multi-
layer perceptron (MLP) neural network layers. The training of these neural networks
is done through back-propagation[?].

5
1.3 Analysis of In-Memory Architecture

The tradeoffs generally seen in the 2D architecture is modified by In-Memory archi-


tectures:
Assumption: D refers to the number of bits of data needed for the computations.
This data is distributed across D1/2 x D1/2 array.
Parameters of Interest: Bandwidth, Latency, Energy, and SNR (typically for SRAM
cells)
Bandwidth:

• In the traditional architecture approach the data movement to and from mem-
ory was limited due to data transfer through a single memory port every clock
cycle. The total bandwidth would decrease as the port size could increase as
D1/2 every D1/2 cycles.

• In in-memory computing engine due to massive possibility of parallelism as the


data is readily available for the computation bandwidth does not decrease.[9]

Latency:

• In Traditional approach the capacitances across BL/BLB and number of ac-


cesses increases by D1/2 thus causing the latency to increase by D.

• In in-memory computing engine even though the capacitance increases by D1/2


due to the increase in the number of bit cells which discharge in parallel results
in nullifying the increase in capacitance and maintains constant voltage swing
across BL/BLB lines.[9]

Energy:

• In traditional approach the energy increases by D3/2 due to increase in the

6
number of accesses by D1/2 , increase in the number of bit lines by D1/2 , and
increase in BL/BLB capacitance D1/2 .[9]

• In in-memory architectures even though all the WLs are driven at once but
due to low voltages across WLs, the energy is lesser than traditional approach.
BL/BLB discharge dominates with large voltage swings but the number of
access remains constant and thus increasing the energy by D.

SNR:

• In traditional approach SRAMs employ low voltage swings across BL/BLB


without depending on the number of bits D.

• In in-memory computing SNR varies, the absolute discharge on BL/BLB is


constant with D, the differential discharge decreases with BL/BLB capacitance
as D1/2 .[9]

Thus, we observe that in-memory computing engine shifts all the trade-offs otherwise
seen in the traditional approach benefitting bandwidth, energy, and latency at the
cost of SNR.

Parameters Traditional Architecture In-Memory Architecture


Bandwidth 1/D1/2 1
Latency D 1
Energy D3/2 D
SNR 1 1/D1/2

TABLE I: ANALYSIS OF CIM NETWORK

7
Chapter 2
LITERATURE SURVEY

2.1 CONV-SRAM: An Energy-Efficient SRAM

With In-Memory Dot-Product Computation

for Low-Power Convolutional Neural

Networks

To realize Neural network operations in In-memory computation it requires the op-


erations to be performed in both analog and digital domains for energy-efficient
computations. This process is divided into three steps. Firstly, it necessitates the
effective inputs of the convolutional layers to be converted from digital domain to the
analog voltages using Digital-to-Analog converters (DACs). Secondly, these analog
voltages need to be multiplied with the 1-bit kernel weights which completes the
first half process in the MAC operation. Now, these dot products would have to be
averaged over the total number of columns and thus produces the analog-convoluted
output. This process represents Multiplication-and-Average (MAV). Lastly, Analog-
to-Digital (ADC) converter is needed to convert this analog output voltage is con-
verted back into the digital logic.

The below procedure that happens inside of in-memory engine is explained below
with the equivalent mathematical formulas. In a Neural Network containing convo-
lutional layers, as discussed, the output (Y) is the dot product of the kernel weights
(w) and inputs feature maps (x). It can be represented by the following equation.

8
Figure 2.1: Voltage domain conversions in conventional CiM architectures
[10]

Figure 2.2: Basic MAC operation in a neuron

When considering binary weights, the above equation can be manipulated further
as shown below in (2a), here ak corresponds to the co-efficient of the kth filter.
The value of ak can be represented as a ratio of number of elements added per dot
product per clock cycle-¿ Mk , N

Further exploiting equation (2b) such that it can be represented as (3) where YOUT
is the effective convolution output. XIN is the effective convolutional input which
has a lower precision. Here, Mk which is the scaling factor is separated out as it can
be evaluated after the entire dot product is performed.

9
The Equation (3) must be performed in the analog domain by conserving energy
spent on the computation.

OVERALL ARCHITECTURE
The in-memory array is a crossbar architecture consisting of SRAM cells. The array
is divided into local micro arrays where each local array stores the binary values of
the weights. Depending on whether the values are either -1 or +1 the 10T SRAM
cells stores the kernel weights as either logic 0 or logic 1, respectively. Each of these
local arrays perform Multiplication-and-Average operation and have ADCs dedi-
cated to every local array to compute the digital logic of the corresponding analog
voltages. This is shown in the figure below.

Choosing the bit-precision of ADCs and DACs is important as the error rate is
dependent on the accuracy of the outputs of the converters. With the increase in
the precision of the inputs, the design complexity of ADCs/DACs increases. How-

10
Figure 2.3: CiM Architecture of CONVSRAM
[10]

ever, with the lower precision of the input and output feature maps area and power
taxing of DACs/ADCs circuit is considerably reduced. Thus, in order to compute
the inner dot-product of the input features and kernel weights together consisting
of I-elements needs a total of l-DACs and one ADC. [10]These converter circuits are
power and area hungry especially as DACs are mostly active. And as mentioned
with the increase in the input bit-width the precision demanded by the DACs also
increase in order to maintain the accuracy of the MAC operations.

11
2.2 In-Memory Computation of a

Machine-Learning Classifier in a Standard

6T SRAM Array

In this paper there are approaches attempted to reduce the overall energy consump-
tion in data-driven models for inference as the memory accessing dominates and
incurs high energy dissipation. Thus, developing different architectures which could
combine computing and memory such as computing in memory could be a potential
solution. Here 6 T SRAM cell is used where the computations are performed within
the memory cells to reduce the memory accesses.
The figure below is the proposed 6 T SRAM cell based in-memory architecture.
Based on the periphery there are two modes of operation.[9] First, SRAM Mode
where the operations of read and write is performed in the SRAM cells on the dig-
ital data and a single word line is driven at a time. Second, Classify Mode where
the wordlines of all the cells are fired at the same time to analog voltages.
Here, the features in the digital format are fed as WL voltages. In this architecture
it requires DACs at every wordline to generate the analog voltages.
Word-Line DACs:
The figure below indicates the Word-Line DAC (WLDAC) circuit with 5-bit digital
values inputted into the DAC circuit. These digital inputs are converted into cur-
rents and are then converted into output WL voltage. This WL voltage is obtained
when it is run through the upsized replica of the bit cell. As indicated in the bit-cell
replica circuit, MD, R is the driver transistor which receives CLASS EN voltage
(VDD) when working in the classify mode.

MA,R transistor is self-biased access transistor which can supply the required WL

12
Figure 2.4: CiM Architecture of Machine Learning Classifier using 6T SRAM
[9]

voltage corresponding to IDAC (DAC current). To maintain Bl/BLB discharge


from not getting saturated the amplitude of WLs are carefully designed by handling
BL/BLB capacitance. The amplitude of WL can be controlled by upsizing ratio R
and the biasing of PMOS source terminals.
There are observations made in the paper about the impact of the settling time
of WL voltage. This settling time is variable and dependent on the self-biasing of
bit-cell replica. Thus, there is increase in the slew rate when low currents drive large
WL capacitance. To solve this problem offset current source is introduced in the
DAC circuits.
The discharging of BL/BLB can cause errors in SRAM mode and Classify mode of
operation. As in SRAM mode low voltage swings across BL/BLB are preferred but
in Classify mode high voltage swings across BL/BLB is preferred as all the bit lines
are summed during this phase higher dynamic range should be maintained.
There could be two possible faults introduced due to the above-mentioned condition.

13
Figure 2.5: Digital to Analog Converter
[9]

First, due to decreasing VD S across the transistors the overall current could be less.
Second, there could be a possibility where NMOS biasing is weak and thus causing
the cells to pull up instead of pulling down.

14
2.3 eDRAM-Based Tiered-Reliability Memory

with Applications to Low-Power Frame

Buffers

The Multi-core processors provide higher computing power because they exploit
micro-architecture level parallelism while trying to limit the overall power consump-
tion. This has increased the number of cores in the processor. To potentially
leverage this feature of parallelism and the appetite for large data it requires the
increase in embedded memory. In the modern System-on-chip, the memories con-
sume a large fraction of the chip. These embedded memories are predominated by
Static-Random-Access-Memory cells while they can be replaced by the potential
embedded-dynamic-random-access-memory cells which will reduce the area occu-
pied or increase the density and/or reduce the power consumption. eDRAMs have
replaced SRAM cells as a potential low-cost memory device and have seen its ex-
tensive use in the modern multi-core processors at the last-level of caches either
integrated on the same die or stitched in a different package. Due to the higher
density of eDRAMs and lower leakage power they are more suitable for caches and
registers. They find their applications in caches, frame buffers in video driver and
video processing [11]
There have been a quite a few successful designs of eDRAM cells and the tradi-
tional 1T1C DRAM cells have 4.5x higher bit-cell density and 5x lower static power
dissipation including the refreshing power than the standard 6 T SRAM cell [12].
However, these cells suffer from three major drawbacks. First, they require detailed
and complex capacitor fabrication second, ultra-low leakage access transistor and
third, the read operation is destructive due to charge sharing.

15
[11]
Figure 2.6: Conventional 1T1C DRAM Cell

Here are some of the factors to be considered while using eDRAMs for in-memory
computing.
Advantages: Due to low leakage currents, non-ratio logic, and small cell size. The
write margin is better in eDRAM cells than SRAM cells as there is no contention
between the access devices and the cross coupled latch in eDRAM cells. eDRAM
cells are less sensitive to process variations.
Disadvantages: Due to short data retention times due to the small capacitors and
the various leakage currents in the cells which vary according to PVT conditions.
This requires larger refresh power and poor read performance.

One major drawback in embedded DRAMs is the need of periodic refreshing of


the cells. As observed in DRAMs where it requires continuous refreshing of the
cells’ stored memory due to the destructive read, eDRAMs also face this challenge
of refreshing rates. Frequent refreshing causes higher power consumption as well as
limiting the access period of the cells to perform in-memory computation. If refresh-
ing is not performed in a timely manner, it can potentially reduce the storage node
voltage and can cause memory errors. In DRAM cells, the sub-threshold leakage
current is handled by boosting the supplies and the loss due to threshold voltage is
compensated during the write operation. This tradeoff is crucial which needs to be

16
Figure 2.7: DRAM mapped into a crossbar memory array
[11]

addressed.

There have been many works which have tried to mitigate the refreshing operation
in eDRAM cells. A study made in [14] shows that the highest power consumption
is due to the refreshing operation. In [14] the focus is on reducing the power con-
sumption which could be high due to frequent refreshing. While considering the
application which are inherently error-resilient, in this paper they aim to reduce
the power consumed by increasing the refreshing period of the gain-cell beyond the
worst-case point of failure. In data mining applications, many memory errors can
be tolerated. Other techniques as presented in would employ error correcting code
(ECC) which can extend the DRAM refresh period. [15] employs retention-aware
DRAM training.

17
Figure 2.8: Conventional 6T SRAM cell
[13]

2.4 A 667 MHz Logic-Compatible Embedded

DRAM Featuring an Asymmetric 2T Gain

Cell for High Speed On-Die Caches

The basic operations of read, write and hold of the 3T eDRAM cell is explained in
this section.
PW transistor is the write access transistor, PS denotes the storage cell transistor
and PR is the read access transistor. During the write or write-back phase, the
values from the WBL bine is written into the storage node through PW transistor.
WWL signal is driven to a negative voltage (logic 0) similar to 1T1C cell so that
the voltage available on the Write-Bitline (WBL) can be written into the storage
node. During the read phase, the Read-Bitline (RBL) which is pre-discharged is
driven to logic 1 only when the stored voltage at the gate of PS transistor is logic
0. In the situation when the storage node is storing logic 1, then RBL remains at
pre-discharged voltage.[17] To sense the voltage stored in the cell, the RBL voltage

18
Figure 2.9: 3T eDRAM gain cell
[16]

can be compared with a reference voltage using sense amplifier. During the hold
phase, PR and PW are turned off and the storage node is left floating. The leakage
due to sub-threshold, gate, and junction due to the neighboring transistors causes
the voltage change in the storage node. The figure below shows the plot of how the
voltage at the storage node is degraded as the time lapses.

As the adjacent PMOS transistors of higher voltage around the storage node, the
retention time of logic 0 is much shorter than logic 1. If PMOS transistors are
replaced by NMOS transistors, then it becomes critical to hold logic 1 for a longer
time than it is to hold logic 0. This retention time is dependent directly on the
leakage currents flowing through these transistors.
Due to the PVT characteristics, each eDRAM cell endures different retention time
characteristics. Thus, the refresh rate of the eDRAM array is dependent on the cell
with the shortest retention time.

19
Figure 2.10: Retention Time for eDRAM cells
[16]

As explained in the previous section, PW is the write access transistor, PR is the


read access transistor, and PS is the storage transistor. These eDRAM gain cells
have decoupled read and write operation unlike the SRAM cells.[18] For the read
access operation, Read-Wordlines (RWL) and Read-Bitlines (RBL) are used and for
the write access operation Write-Wordlines (WWL) and Write-Bitlines (WBL) are
used.
In addition to Read, and Write phase there is data retention phase which determines
how long the data is held in the storage node.
Data Retention Phase: In this phase PW and PR are turned off thus leaving
the storage node floating. The various leakages in the junction, gate, and channel
results in the degradation of the value stored in the node. As opposed to what we

20
Figure 2.11: Leakage current in eDRAM cell
[16]

saw in the previous section, here as the NMOS transistors are used thus the data
retention time (DRT) of logic 1 is shorter than that for logic 0. DRT depends on the
accumulated leakage currents flowing through the storage nodes. The figure shows
the Monte-Carlo simulations in HSPICE for the data-retention time variations. It
represents the DRT cell-to-cell variations for 1Mb memory macro. [19]
The read reference bias voltage is set to 0.65 volts and it is expected for the voltage of
logic 1 stored to be higher by 0.2 volts to achieve the same margin as that obtained
for logic 0 storage value. It is thus observed from the figure that, the DRT for
storing logic 1 is between 12.2 us and 54.1 µs[16]. This is attributed due to high
sub-threshold leakage current observed in NMOS storage transistor, however, logic 0
DRT has relatively stable characteristics. Due to WWL coupling the initial voltage
level in the storage node is much lower than expected. This further reduces the
DRT for logic 1 stored in the cell.

21
Chapter 3
DATASETS

The MobileNetV2 is trained using CIFAR 10 and CIFAR 100 datasets.

3.1 CIFAR-10 Dataset

CIFAR-10 datasets consist of 60000 images each of which are of 32x32 dimensions
with about 10 classes. Each class have around 6000 images. These images are
classified into training set and testing set with 50000 images in training set and
10000 images in testing set. [20]
The dataset images are further divided into five training batches and one test batch
where each batch consists of 10000 images. In order to create the test set, 1000
images are chosen randomly from each of the classes and thus totally making it up
to 10000 images for the testing set. Once the testing set is created the rest of the
images left in the dataset forms the training set which are again arranged in the
random order as observed in the testing set. With an exception to this rule there
could be training batches which contain more images from a particular class when
compared to the other classes.
Here is the list of classes seen in CIFAR-10 dataset.
1. Airplane
2. Automobile
3. Bird
4. Cat
5. Deer
6. Dog

22
7. Frog
8. Horse
9. Ship
10. Truck
All of these classes are mutually exclusive without having any duplicated images
between the classes.
The images are loaded as a dictionary in the batch files.
1. Data – An array of 10000x3027 NumPy is stored. Each row consists of 32x32
color image. The RGB color space in these images have 1024 entries dedicated to
red color channel values, 1024 entries for green color channel values, AND 1024 for
blue color channels.
2. Labels – A list of 10000 numbers ranging from 0-9 is stored and corresponds
to the image. The index in the nth position represents the label for the corresponding
nth image in the data.

23
3.2 CIFAR-100 Dataset

CIFAR-100 dataset is similar to CIFAR-10 dataset with an excepting of having 100


classes instead of 10 classes. Each class of CIFAR-100 dataset consists of 600 images
each. The training set consists of 500 images per class and the testing set consists
of 100 images per class. Thus accounting for a total of 50000 images for training
set and 10000 images for testing set.[21] Each of the 100 classes in CIFAR-100 are
grouped into 20 superclasses. Each image has two labels associated with it. ‘Fine’
label represents the class to which it belongs to and ‘Coarse’ label represents the
superclass it belongs to.
Here are some superclass and classes for CIFAR-100 dataset.

Superclass Class
Aquatic mammals Beaver, dolphin, otter, seal
Fish Flatfish, ray, shark, trout
Flowers Orchids, poppies, roses, tulips
Insects Bee, beetle, butterfly, caterpillar
Trees Maple, oak, palm, pine
People Baby, boy, girl, man, woman
Fruit and vegetables Apples, oranges, pears, sweet peppers

TABLE II: SUPERCLASS AND CLASSES OF CIFAR100 DATASET

24
Chapter 4
CiM BASED eDRAM
HARDWARE SETUP

4.1 Proposed eDRAM bitcell

In the previous chapter we have seen how the conventional eDRAM cells function.
This memory cell needs to be modified to suit in-memory computing. This trans-
formed cell could potentially replace SRAM cells for in-memory computing due to
higher density and lower power consumption.
The conventional eDRAM cell is modified and is shown in the figure below.

Figure 4.1: eDRAM bitcell for In-Memory Computing

25
The above eDRAM cell employed is used to perform in-memory computing can be
a potential replacement to SRAM cells. The read, write, and hold operations is
similar to the conventional eDRAM cell. M1 transistor is the write transistor that
allows the data fed in at Write-Bitline (WBL) to be fed into the storage node when
Write-Wordline (WWL) is fired. M2 refers to the storage transistor which helps in
holding the data stored in the storage node. The source terminal of M2 is connected
to Column Line (CL) which facilitates in performing in-memory computing. M3 is
the read transistor which affects the Product Line (PL) whenever Row Line (RL) is
fired.
As we observe the changes in the new design, the source terminal of the storage
transistor M2 which would be connected to VDD source is now connected to Col-
umn Line to perform In-Memory computing operation using these state-of-the-art
1-bit memory cells. To perform in-memory computing within the memory array for
the neural network layers, the weights and the inputs which are needed to perform
multiplication-accumulate operation have to be stored or fed as an input to the
memory cell. Alongside power and performance, robustness is also an important
criterion. In order to make this design compatible to various dynamic decisions,
dropout Based Bayesian Inference is performed. Monte Carlo based dropout is ap-
plied to the input features before being stored into the memory array which is shown
in the figure below. Bayesian Inference performs the statistical inference for 100s to
1000s of iterations. It is vital to have a compact, energy efficient and yet a robust
CiM framework. It can be achieved by performing Monte Carlo based Dropout on
the neurons before storing them in the memory arrays. Bayesian Inference is acti-
vated during the testing phase to improve the performance of the network despite
the neurons being dropped.

26
The MAC operation of y=wx+b which needs the weights and inputs to be mul-
tiplied and then added with the bias. Through the write transistor M1, the weights
are transmitted into the storage node from the Write-Bitline (WBL) when the Write-
Wordline (WWL) is high. The inputs are transmitted through Column Line (CL).
Here, the inputs are fed through an inverter to the Column Line (CL). The transis-
tor M3 which is connected to Row Line (RL) and Product Line (PL) facilitate in
performing the multiplication operation. PL line is precharged to VDD. Depending
on the voltage levels of the inputs and weights, the voltage level on the Product
Line varies. If the weights stored and the inputs fed are data ‘1’ then the product
line will discharge to data ‘0’. If either the inputs or weights are data ‘0’, then the
PL line remains high. This ensures the bitwise multiplication between the inputs
and the weights. Whenever the input bit and the weight stored is high, the PL line
discharges otherwise it remains high.

27
Figure 4.2: eDRAM bitcell for In-Memory Computing

The figure below is the in-memory computing block diagram. In the proposed de-
sign, the eDRAM cell-based macro consists of µ-arrays and µ-channels. For every
weight channel, one µ-array is assigned. In every µ-array DNN weights are stored in
a column-wise structure which each bit-plane of weights arranged in a row. Thus,
for an N-dimensional weight channel, where each weight is of m-bit precision, it
requires m rows with N columns of eDRAM cells in every µ-array.

28
4.2 The Proposed CiM Framework for 64x15

eDRAM macro

Figure 4.3: CiM framework using eDRAM cells for 64x15 macro

29
The eDRAM cell shown in the previous figure has decoupling read/write and prod-
uct operations and thus allows the value to be stored in the storage hold phase as
the decoupling action reduces the interference between the operation and eliminates
the impact of process variability.
Every µ-array is associated with a µ-channel. Here all the µ-channels transmit either
the inputs or outputs to or from the µ-arrays. There is a flexibility in the number
of µ-arrays when the number of channels for a weight filter is more. The µ-channels
attach the µ-arrays thus facilitating the inputs to be shared among the µ-arrays.
The loading of input feature map is reduced by adopting the following technique
in the µ-array. As mentioned, in order to facilitate weights with more channels the
µ-arrays are merged. Thus, when the two channels are merged, the top channel
receives the data from the bottom channel directly and it eliminates the inputs be-
ing loaded to the top channels. This method of loading inputs reduces the overhead.

Figure 4.4: Mapping of weights and inputs into eDRAM macro

30
The figure above shows the inputs/weights mapping into the eDRAM macro along
with the steps explaining the sequence of operations. For the initial step of multi-
plying sign(x) . abs(w) to perform w ⊕ x, sign(x) is first loaded into the -channel
and then multiplied with abs(w) rows in the -array. For the last step of perform-
ing sign(w) . abs(x), the bit-planes of the absolute value of the input vector x are
loaded sequentially into the -channel and are then multiplied with sign(w) row of
that particular -array. Here, the sign(w) and sign(x) can be replaced with step(w)
and step(x) respectively as negative value on the weight/input bit is considered as
‘0’ step while the positive value on the weight/input bit is considered as ‘1’ step in
the step function.

31
4.3 Operation Cycles

In every µ-array, in order to compute x ⊕ w, the operations are performed in bit-


planes. In order to evaluate step(x) . abs(w), it takes one instruction cycle to
compute the ith weight vector bitplane. As explained previously, the input values
step(x) is inverted and fed into the Column Line (CL) through µ-channels. Initially,
the SL lines are discharged to low voltage by high value on the Row Line Discharge
signal. This will make sure the dummy voltage values stored in the SL are discharged
to zero. This step is followed by Precharging the PL line. Once the clock switches,
the PL lines are disconnected from the tri-state MUXes and left it floating. This
step is followed by turning on the Row-line (RL) corresponding to the ith bit vector
of w. At this stage depending on the weight value of wi,j corresponding to column
j and the input value xi, the product line PL either discharges or remains in the
precharged level.

32
Figure 4.5: Timing Diagram of the control signals in the proposed CiM framework

As we observed in the previous section, whenever wi,j and xi is high, PL will dis-
charge. All the columns are averaged over the sum-line to generate MAV result
for the input vector and weight bit plane. The figure above shows the instruction
sequence to compute MAV consisting of precharge, product, and average stages.
To minimize the leakage current the eDRAM cells are kept in the hold mode and
provide additional clock cycle for the Product Line to discharge.

33
4.4 4-bit Flash ADC

The output of the Sum cells is analog, and these analog voltages need to be converted
into Digital Values. In this work we use 4-bit Flash ADC to produce the required
4-bit digital bits as the output. 4-bit Flash is also referred to as parallel Analog to
Digital converter. The figure below is the simple 4-bit Flash ADC.
Series of comparators are arranged where each comparator has two inputs. One
is the input signal and the other is new reference voltage. The outputs from the
comparators are fed into the priority encoder circuit, which then produces the binary
output. The reference voltage Vref is the input to the resistive ladder which generates
different voltage levels at every node of the resistive ladder which acts as the new
reference voltage to the comparators. The sum line voltage level is compared with
the newly generated reference voltage level at all the comparators. When the sum
line voltage is higher than the new reference voltage, the comparator produces logic
1 at the output of the comparator and logic 0 otherwise.

34
Figure 4.6: 4-bit FLASH ADC

The comparator design in this work is shown in the figure below. In order to attain
the rail-to-rail input voltages a cross-coupled comparator with both n-type and p-
type is used. The n-type comparators are made of NMOS transistors and receive
their inputs from NMOS, and the p-type transistors are made of PMOS transistors
and receive their inputs from PMOS. The transistors from each type are coupled.
As the input voltage reaches closer to voltage 0, PMOS devices which are triggered
by low voltage levels are dominant. Due to the connections between the coupling
transistors either the n-type or p-type comparator instances could be dominant, and
the other type can be overridden.

35
Figure 4.7: N-Type and P-Type Comparator

The priority encoder is used to generate the 4-bit digital output based on the com-
parator output. The waveforms show how the encoder output changes with the
variations in the comparator outputs.
The comparators are sequentially arranged from the lowest to the higher order and
thus the regular priority encoder would suffice to realize the correct 4-bit Flash ADC
output.

36
Figure 4.8: The Overall Framework of the Proposed CiM Architecture using eDRAM
macro

37
Advantages of proposed design:

The use of Multiplication-free operator inference framework with µ-arrays and µ-


channels has the following advantages.

1. By adopting Multiplication-free-operator, it has eliminated the use of DACs


while performing in-memory computing. Even though in certain implemen-
tations, by using parallel channels the overheads of DACs can be eliminated,
however, these frameworks do not find its use in the state-of-the-art neural
network architectures like MobileNet. For thin convolutional layers, DAC-free
frameworks are more efficient and thus they provide fine-grained embedding
of µ-channels.

2. Multiplication-free-operator can be leveraged to perform bitplane-wise oper-


ations. ADCs are not required to have higher precision as the bitplane-wise
operations reduces the demand for higher precision ADCs. It requires O(n2)
operating cycles if bitplane-wise operations are performed on traditional op-
erator. However, using multiplication-free-operator, it requires about O(2n)
cycles to perform the same bitplane-wise operation.

38
Chapter 5
RESULTS

5.1 Timing Analysis

This section of the manual discusses the timing analysis of the proposed CiM frame-
work using eDRAM cells. There are four control signals required to realize the
operation of the architecture. At the beginning of every operation the product line
is precharged to VPCH and the Sum line is discharged to zero.
All of the product lines are connected using tristate MUXes. The waveform below
shows the multiplication operation happening in memory and the outputs are ob-
served on the corresponding product ports. As discussed earlier, PL discharges only
when the inputs and the weights are data 1, i.e., there is only 25% chance for the
PL to discharge otherwise in all the cases it remains high.
At 1ns PCH goes low activating the PFET and thus charging PL to data 1. At
time 1.5ns RL0 is fired and thus allowing the product port to either discharge or
remain high. As we observe in the waveform below, PL0 discharges as the value of
inputs and weights are storing data 1. However, PL1 remains high even though the
weights are storing data 1, the input is storing data 0 thus not giving the path for
the product port to discharge.

39
Figure 5.1: Multiplication Operation in In-Memory System

40
Before starting the multiplication process it is important to discharge any arbitrary
voltage stored on the Sum Line (SL). A discharge transistor is connected to the Sum
line and is activated by RLD signal. PCH and RLD operations can be overlapped
due to their independent operations. When SUM0 signal goes high, the average
operation takes place and is reflected on the Sum Line Signal. This marks the end
of Multiplication and Accumulation operation in In-Memory System.

Figure 5.2: Average/Accumulation Operation

41
The output from the SUM Line is the input to the FLASH ADC which acts as
the input voltage. The reference voltage is divided at different nodes by the resis-
tive ladder. The reference voltage and the input voltage are compared using the
comparator. This comparator gives either the output 0 or 1 depending on the corre-
sponding input voltage level when compared to reference voltage. The output from
the comparator is of 15 bits which then gets encoded using priority encoder into 4
bits output. This step is followed by shift-add operation, where the outputs from
the encoder are shifted to the right by 1 bit and then added to the previous result.

Figure 5.3: Voltage levels of the Resistive Ladder

This figure shows the comparator and encoder outputs for the SL line with 0.815V.
From the previous figure we see that the from the node nr13 down to nr1 has a lower
voltage level compared to the input voltage of 0.815V. Thus, 13 comparators show

42
an output value of 1 while the top two show the value 0.
The output from the comparator is then fed into the encoder which encodes the
16-bit comparator output into 4-bit outputs.

Figure 5.4: Comparator and Encoder Outputs

43
This figure shows the working of the entire CiM framework with the control signals
and the outputs from every stage of the In-Memory network.

Figure 5.5: The Timing Analysis of the entire CiM framework

44
5.2 Linearity Accuracy

Linearity accuracy is performed on the CiM framework to understand the accuracy


of the CiM engine. From the table below and the Graph it is evident that the
plot follows linearity. As we increase the input values, we see that the SL value also
increases almost linearly. There are some abnormalities due to the non-linear nature
of the transistors. Even through there are non-linearities in the Analog voltage levels
we can observe that the ADC output bits are perfectly linear in nature.

Inputs sl(V) ADC bits


15’h7FFF 0.04 0000
15’h7FFE 0.101 0001
15’h7FFC 0.15 0010
15’h7FF8 0.19 0011
15’h7FF0 0.27 0100
15’h7FE0 0.32 0101
15’h7FC0 0.38 0110
15’h7F80 0.43 0111
15’h7F00 0.52 1000
15’h7E00 0.56 1001
15’h7C00 0.62 1010
15’h7800 0.71 1011
15’h7000 0.75 1100
15’h6000 0.78 1101
15’h4000 0.89 1110
15’h0000 0.97 1111

TABLE III: LINEARITY ACCURACY OUTPUT VALUES

45
Figure 5.6: Linearity Accuracy Plot for Inputs and SL Line Voltage

46
Figure 5.7: Linearity Accuracy of 4-bit Flash ADC

47
5.3 Process Variability

The CiM network is subjected to process variation to understand how the tolerance
level of the design. In this work, the design is subjected to sigma threshold of 15mV,
25mV, and 35mV. With these three values, we observe a trend of how the design
responds to the process variability. 100 Monte-Carlo simulations are run to observe
the variance of the design.
For 15mV, we see a Gaussian like trend which shows that the variation in the output
SL line is not around the mean value. A standard deviation of 0.0041 with mean of
0.3260 is observed. The step size for an ADC is 62.5mV and we can observe that
2xsigma gives a value of 0.0084 which is well within the bounds.

48
Figure 5.8: Histogram Plot for Sigma Variation of 15mV

49
Similar to the previous histogram curve for 15mV as the sigma Vth, the curve below
also follows a similar trend of Gaussian curve. For this sigma Vth the standard
deviation is about 0.0067 and mean of 0.3245. The two-sigma value is again well
within the bounds with 25mV variation in the sigma Vth value.

Figure 5.9: Histogram Plot for Sigma Variation of 25mV

50
For sigma Vth value of 35mV, it follows a similar as for the other two variation
values. The mean value is about 0.3241, standard deviation of 0.0115. Two sigma
value is again well within the bounds.

Figure 5.10: Histogram Plot for Sigma Variation of 35mV

51
This plot is for Sigma vs Sigma Vth. This curve shows that as the process variation
increases, the standard deviation also increases. Ideally, we expect the curve to be
flat, but it is not the case due to non-linearities in the design.

Figure 5.11: Histogram Plot for Sigma Variation of 35mV

52
5.4 Power Dissipation

This section of the manual shows how the power is consumed in the entire CiM
framework. As we can observe from the Table and the Pie Chart the maximum
power consumption is in ADC operation. This is caused due to the comparator cir-
cuits. the supply voltage for Digital Circuits is about 0.65V, for the eDRAM macro
is about 0.85V and for the ADC circuit it is about 1V.

Overall Power Consumed (µW) 15.1515


Power of eDRAM macro(µW) 2.216
Power of ADC Circuits (µW) 9.77
Power of Digital Circuits (µW) 3.1655

TABLE IV: OVERALL POWER CONSUMED IN CIM FRAMEWORK

53
Figure 5.12: Power Consumption in CiM Framework

The pie-chart below shows the power distribution for a 64x15 eDRAM macro. The
maximum power is consumed by the 16 comparators in the ADCs.

54
5.5 Energy Consumption Comparison

This section shows the energy comparison for the different state of the art In-
Memory Computing Designs. We can observe the design in this work consumes
about 5.54fJ/Op of energy for a 64x16 array size.

Array Size Memory Cell Technology Node Energy consumption per operation
64x15 eDRAM 16nM 5.54E-15J/Op
64x16 [10] SRAM 7nm 24.81E-15J/Op
256x256 [22] RRAM 45nM 25.77E-15J/Op

TABLE V: ENERGY COMPARISON WITH THE OTHER STATE OF THE ART


DESIGNS

55
Chapter 6
FUTURE SCOPE

The proposed CiM framework has has a compact design as the 3T eDRAM cells
provide higher memory footprint than the existing in-memory cells. However, this
alone is not sufficient to scale the CiM to the large neural networks. A simple Mo-
bileNet Neural Network has millions of MAC operations performed within memory
and having small memory size would not solve the massive computational energy
consumed. Thus other techniques such as Dropout based Bayesian Inference can be
performed on the in-memory cells to address the high energy expenditure. Bayesian
Inference performs about 100-1000 iterations which results in highly computation-
ally expensive designs. eDRAM based CiM Architecture could be leverage to find
a solution to this problem. For various on-edge intelligence applications which de-
mand low energy, area and high robustness to handle the dynamic decision making,
eDRAM based CiM Architecture can be adopted in these applications.

56
Chapter 7
CONCLUSION

Compute In Memory has been widely used in various applications as a po-


tential solution for various energy constraints presented. Enormous energy is
saved as CiM architecture do not have data movements between the proces-
sor and the memory arrays. Most of the designs focus on designing efficient
memory cells but in this work an operator is designed which can adapt to
the memory’s processing and physical constraints. This design uses compact
eDRAM cells for performing the 8-bit Multiplication and Accumulation Op-
eration. With the help of a new kind of operator called Multiplication Free
Operator the CiM is DAC free even for multibit precision and do not require
high precision ADCs to relalize a simple MAC operation. The operations
performed within the memory arrays are analog in nature and this output
needs to be converted back into the digital domain using ADC circuits. In
our design 4-bit FLASH ADC is used to convert the analog signal into 4-bit
outputs. This operation is followed by shift-add which shifts and adds the
result from the previous cycle. The proposed Compute-In-Memory framework
using eDRAM cells are energy and area efficient due to low leakage power and
compact design.

57
REFERENCES
[1] “Machine learning for beginners: An introduction to neu-
ral networks.” [Online]. Available: https://towardsdatascience.com/
machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
[2] “Using neural nets to recognize handwritten digits.” [Online]. Available:
http://neuralnetworksanddeeplearning.com/chap1.html
[3] “A beginner’s guide to neural networks and deep learning.” [Online]. Available:
https://wiki.pathmind.com/neural-network
[4] “What is in-memory computation.” [Online]. Available: https://hazelcast.
com/glossary/in-memory-computation/
[5] A. Sebastian, M. L. Gallo, R. Khaddam-Alijameh, and E. Eleftheriou, “Memory
devices and applications for in-memory computing,” Nature Nanotechnology,
2020. [Online]. Available: https://doi.org/10.1038/s41565-020-0655-z
[6] “In-memory computing.” [Online]. Available: https://www.eejournal.com/
article/in-memory-computing/
[7] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2:
Inverted residuals and linear bottlenecks,” 2019.
[8] “Mobilenet version 2.” [Online]. Available: https://machinethink.net/blog/
mobilenet-v2/
[9] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a machine-
learning classifier in a standard 6t sram array,” 2017.
[10] A. Biswas and A. P. Chandrakasan, “Conv-sram: An energy-efficient sram
with in-memory dot-product computation for low-power convolutional neural
networks,” IEEE Journal of Solid State Circuits, vol. 54, 2019. [Online].
Available: https://ieeexplore.ieee.org/document/8579538
[11] “edram-based tiered-reliability memory with applications to low-power frame
buffers,” ISLPED, 2014. [Online]. Available: https://ieeexplore-ieee-org.proxy.
cc.uic.edu/stamp/stamp.jsp?tp=&arnumber=7298279
[12] D. Somasekhar, Y. Ye, P. Aseron, S. L. Lu, and M. Khellah, “2 ghz 2 mb 2t
gain-cell memory macro with 128 gb/s bandwidth in a 65 nm logic process,” in
ieee isscc dig. tech. papers,” IEEE Internatinal Solid State Circuits Conference,
vol. 2008. [Online]. Available: https://ieeexplore.ieee.org/document/4523163
[13] S. Birla, R. Kumar, and M. Pattanaik, “Stability and leakage analysis of a
novel pp based 9t sram cell using n curve at deep submicron technology for
multimedia applications.” [Online]. Available: https://www.researchgate.net/
figure/Conventional-6T-SRAM-cell fig1 220073701
[14] “Samsung analyst day 2013: Display trends.” [Online]. Available: https:
//www.samsung.com/
[15] A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,”
Proc. of USENIX.

58
[16] K. C. Chun, P. Jain, T.-H. Kim, , and C. H. Kim, “A 667 mhz logic-compatible
embedded dram featuring an asymmetric 2t gain cell for high speed on-die
caches,” IEEE Journal of Solid State Circuits, vol. 47, 2012.
[17] K. C. Chun, P. Jain, J. H. Lee, and C. H. Kim, “A 3t gain cell embedded dram
utilizingpreferential boosting for high density andlow power on-die caches,”
IEEE Journal of Solid State Circuits, vol. 46, 2011. [Online]. Available: http:
//people.ece.umn.edu/groups/VLSIresearch/papers/2011/JSSC11 DRAM.pdf
[18] M. Ichihashi, H. Toda, Y. Itoh, and K. Ishibashi, ““0.5 v asymmetric three-tr.
cell (atc) dram using 90 nm generic cmos logic process,” Proc. VLSI Circuits
Symp, 2005.
[19] W. Luk, J. Cai, R. Dennard, and et. al., “A 3-transistor dram cell with gated
diode for enhanced speed and retention time,” Symposium on VLSI Circuits,
2007. [Online]. Available: https://ieeexplore.ieee.org/document/1705371
[20] “The cifar-10 dataset.” [Online]. Available: https://www.cs.toronto.edu/
∼kriz/cifar.html

[21] “The cifar-100 dataset.” [Online]. Available: https://www.cs.toronto.edu/


∼kriz/cifar.html

[22] S.Zhang, K.Huang, and H.Shen, “A robust 8-bit non-volatile computing-in-


memory core for low-power parallel mac operations,” IEEE Transactions,
2020. [Online]. Available: https://ieeexplore.ieee.org/document/8998360
[23] B. S. Ronald Kalla, “Power7: Ibm’s next generation server processor,” IEEE
Hot Chips 21 Symposium, 2009. [Online]. Available: https://ieeexplore.ieee.
org/document/7478381
[24] R. E. Matick and S. E. Schuster, “Logic-based edram: Origins and rationale
for use,” IBM Journal of Research and Development, vol. 49, 2005. [Online].
Available: https://ieeexplore.ieee.org/document/5388855
[25] J. Barth, W. R. Reohr, P. Parries, G. Fredeman, and et.al, “A 500 mhz random
cycle, 1.5 ns latency, soi embedded dram macro featuring a three-transistor
micro sense amplifier,” IEEE Journal of Solid-State Circuits, vol. 43, 2009.
[Online]. Available: https://ieeexplore.ieee.org/document/4443182
[26] S. Romanovsky, A. Katoch, A. Achyuthan, and et.al., “A 500 mhz
random-access embedded 1 mb dram macro in bulk cmos,” 2008. [Online].
Available: https://ieeexplore.ieee.org/document/4523161
[27] P. J. Kim, J. Barth, W. R. Reohr, and et.al., “A 1 mb cache subsystem
prototype with 1.8 ns embedded drams in 45 nm soi cmos,” 2009. [Online].
Available: https://ieeexplore.ieee.org/document/4804985
[28] “A priority-based 6t/8t hybrid sram architecture for aggressive voltage
scaling in video applications,” IEEE Transactions, 2011. [Online]. Available:
https://ieeexplore.ieee.org/document/5686921

59
[29] Q. Dong and et al., “A 351tops/w and 372.4gops compute-in-memory sram
macro in 7nm finfet cmos for machine-learning applications,” ISSCC, 2020.
[Online]. Available: https://ieeexplore.ieee.org/document/9062985

60
VITA

SHRUTHI JAISIMHA

Education M.S. Electrical and Computer Engineering 2019 – 2021


University of Illinois at Chicago
B.E. Electronics and Communication 2014 – 2018
Visvesvaraya Technological University

Coursework Introduction to VLSI Design


Advanced Computer Architecture
High Performance ICs and Systems
HDL based System Design
High Performance Processors and Systems
Advanced VLSI Design
Testing and Reliability of Digital Systems

Projects Design & Verification of MIPS Single & Multi Cycle Processors
MAC Datapath for Neural Networks
Power and Clock Glitch in AES Implemented Hardware
Memory Scheduling Algorithms on the USIMM Simulator
Design of Energy Efficinet and Size Reduced SRAM Cell
Gabor Features for Single Sample Face Recognition

Experience Graduate Intern Technical, Intel Corporation, 2020 - 2021


Engineering Intern, Elite RF, 2020 - 2021
Project Trainee Engineer, BRICOM Technologies, 2019 - 2019

61

You might also like