DNN Architectures
DNN Architectures
vision due to their well-established performance attributes. DNN algorithms exhibit high
accuracy and are characterized by powerful multilevel feature extractions leading to a huge
variable parameters with high variance that require a high computational complexity resulting
into huge memory foot prints. The availability of CPU, GPU, TPU and NPU with high
precision processing units and abundant memory, has made it possible to execute
computationally intensive DNNs for various real world applications involving image
classification, speech recognition, and natural language processing. Image recognition uses
DNN’s to automatically identify objects, people, places and actions in images. Image
recognition is used to perform tasks like labelling images with descriptive tags, searching for
content in images, and guiding robots, autonomous vehicles, and driver assistance systems.
DNNs have multiple layers of nodes to derive high-level functions from input data. The
term DNN is derived by the fact that the neural network gets deeper in its core layers, with
more than one hidden layer, facilitating the transformation of data into much definitive results
and also characterized by powerful multilevel feature extractions and higher accuracy. In
general for vision based applications, DNNs take image as input, extracts feature maps of
various regions of interest (ROIs) in the frame and localises them.
The various layers of DNNs performs the following operations in a hierarchical manner.
1. Convolution: It is a linear operation that involves the multiplication of a set of weights
extracted as feature maps from the input, and kernel which also referred as filter is
applied to compute new feature maps, which produces convolved features. During this
process, the kernel will slide over the grid of the input feature maps vertically as well
as horizontally pixel by pixel, extracting each corresponding tile value\cite{37}. The
convolution operation J = H * F is represented as:
2. Activation Layer : The activation layer we apply linear and nonlinear function to avoid
losing major features. There are many activation function like, Sigmoid function, TanH
function and Rectified Linear Unit (ReLU) function. Main purpose of implementing
activation layer is to find the desired output node by applying linear or nonlinear
transformations. The output layer calculated based on the cost function and softmax
function, which is regarded as cross-entropy.
In the proposed work, we use ReLU transformation as the activation function, in order to
introduce non linearity into the model. The ReLU activation function is described as:
3. Pooling: Here the convolved features are down sampled by the DNN, resulting in the
reduction in the number of feature map dimensions, without affecting any critical feature
information.
4. Fully Connected layer: Major functionality of the fully connected layer is to arrive at
the output by compiling the data extracted from previous layers. The output vectors are result
of the activation units where in each activation unit is connected to all the nodes of the previous
layer ina feed forward manner. Fully connected layers are succeeded by softmax layer to get
probabilities of the input belonging to a particular class for classification tasks. The number of
parameters calculated at each M layer is obtained using Eq.3,
Maintaining the accuracy threshold and the throughput of the DNN architectures along with
addressing the constraint issues with model size, power consumption, computational resource
utilization, run-time memory bandwidth, number of computing operations are challenges for
the deployment of these models on embedded platforms.
2.2.1 AlexNet
2.2.2 VGG-19
2.2.3 Resnet
2.2.4 Inception Net V3
2.2.5 Squeeze Net
2.2.6 MobileNet-V3
AlexNet Architecture
The AlexNet model is one of the most important pioneering steps towards accelerating
the development of the DNN models for image classification. The model has won the ILSVRC
challenge in 2012. The model consists of 8 layers with learnable parameters in this model. The
8 layers can be further classified as 5 convolution layers and 3 fully connected layers as shown
in the below figure 1. The activation functions used in this model is ‘RELU’ for 7 layers and
last layer uses Softmax function to select the predicted class based on the class ranks. The
below diagram explains the AlexNet architecture trained on the ImageNet dataset.
Figure 5: AlexNet Architecture
The input image size used for the model is 227*227*3. Next, we have the first
convolution layer with 96 filters each of size 11X11, stride 4. The output feature map from the
first layer is 55X55X96. Next, we have the Max pooling layer, of size 3X3 and stride 2, which
helps reduce the height and width of the features as we move forward across the layers. The
resulting feature map has size of 27X27X96. Then, we have the second convolution layer with
256 filters each of size 5X5, stride 1 and padding 2. The output size now is 27X27X256. We
again apply a max pooling operation with size 3X3 and stride 2. The resulting feature map is
13X13X256. Next, we apply the third convolution operation with 384 filters of size 3X3 stride
1 and also padding 1. The output feature map is of shape 13X13X384. Then we have the fourth
convolution operation with 384 filters of size 3X3 with stride and padding being 1 each. Output
feature map size doesn’t change here i.e 13X13X384. After this, we have the final convolution
layer of size 3X3 with 256 such filters. The stride and padding are set to one also the activation
function is relu. The resulting feature map is of shape 13X13X256. Next, we apply the third
max pooling operation of size 3X3 and stride 2. The resulting final feature map has shape of
6X6X256.
We observe the sole goal of using these layer by layer convolutions is to extract as
many features as possible by decreasing the height and width of the input images while at the
same time, increasing the depth of the output feature maps. Finally, we have the fully connected
layers with dropout rate 0.5 to make predictions at the final classifier layer having 1000 classes
(ImageNet Classes) based on the information we extracted.
The experimental results of the AlexNet architecture for image classification is as follows,
VGG-19 Architecture
In 2014, the team Visual Geometry Group from Oxford proposed the DNN
model(VGG-19) which won the ILSVRC challenge for Image classification domain. The
architecture has total of 19 layers in which 16 convolution layers, 3 Fully connected layer, 5
MaxPool layers and 1 SoftMax layer as shown in the below figure. The architecture has a fixed
size of (224 * 224) RGB image, which was given as input to this network which means that
the matrix was of shape (224,224,3). The only pre-processing that was done is that they
subtracted the mean RGB value from each pixel, computed over the whole training set. It Used
kernels of (3 * 3) size with a stride size of 1 pixel which enabled it to cover the whole notion
of the image.
Spatial padding was used to preserve the spatial resolution of the image. Max pooling
was performed over a 2 * 2 pixel windows with stride 2. This was followed by ‘relu’ to
introduce non-linearity to make the model classify better and to improve computational time
as the previous models used ‘tanh’ or ‘sigmoid’ functions this proved much better than those.
It implemented three fully connected layers from which first two were of size 4096 and after
that a layer with 1000 channels for 1000-way ILSVRC classification and the final layer is a
‘softmax’ function.
The main use of this architecture is its use in transfer learning domain, to train other
smaller architectures. This architecture can also be used in facial recognition task. This
architecture is also very robust to be used with other datasets with smaller modifications.
The experimental results of the VGG 19 architecture for image classification is as follows,
ResNet Architecture
Resnet refers to the residual networks. It was developed by Microsoft Research and won the
ILSVRC 2015. It has two versions – one with 34 layers and the other, which is a subset of the
former, which is a 18 layered network. The proposed Resnet was to solve the degradation
problem present in the deep networks. Earlier with deep networks there was a major problem
that as the depth of the networks increase, the accuracy gets saturated and degrades rapidly.
To solve this problem, the authors introduced the hypothesis with empirical evidences of skip
connections that helped to minimize the degradation problem to a large extent. These were
called residual connections. This enabled them to make one of the deepest networks without
compromising with the accuracy of the model. The design of the network extensively uses 3*3
filter size. The down sampling across the DNN layers is achieved with stride 2. It uses a global
average pooling layer and a 1000 neuron fully connected layer. The final classifier layer uses
‘softmax’ to make predictions. Each of the ResNet block is two layer deep. It can be 3 layer
deep in case of larger networks.
The residual connections are of two types clearly distinguished by the authors as
follows:
1. Identity Shortcuts when the input and output dimensions are the same.
2. Projection Shortcuts when the input and output dimensions differ.
We majorly use identity shortcuts throughout the network. We use projection shortcuts only
when there is a difference in dimensions.
The experimental results of the ResNet architecture for image classification is as follows,
The architecture family of inception architecture was developed by Google and is also named
as “GoogleNet”. This was proposed in ILSVRC 2014. It is a DNN model with 27 layers of
depth which consists of 2 convolution layer, 4 max pool layers, 9 inceptions layers, 1 avg pool,
1 dropout, 1 linear and 1 softmax layer altogether. This architecture makes use of the sparsely
connected networks that help increase the depth and width of the networks, at the same time,
reduce the over-fitting and saves on the computational budget.
Figure 8 Inception architectures
The main idea of using inceptions is that we have multiple operations at one
layer and all of them are fired together. They work as a team to give out an ensemble of output
features which in turn results in the model being versatile for any incoming input. The
inception-net V3 uses batch normalization in auxiliary classifiers, RMSProp optimizer,
factorized 7*7 convolutions and label smoothing.
The experimental results of the InceptionNet architecture for image classification is as follows,
SqueezeNet Architecture
The squeeze net architecture is a very light weight module. This is achieved by replacing all
3*3 filters with 1*1 filters that result in the model having 9 times fewer parameters than the
former.
Next, we use squeeze layers to decrease the number of input channels to the filters. We delay
the down sampling so that we can have a larger activation map that paves its way for greater
accuracy. Therefore, the architectural designing of squeeze net is a carefully engineered
process where we try to decrease the number of model parameters in the first two steps listed
above and in the third step, we try to increase the model accuracy.
Figure -9 Squeeze net, Squeeze net with simple bypass, Squeeze net with complex bypass
As evident from the above diagram, we have a unique asset called “fire module” which
embedded the squeeze convolution layer within it. We use the three hyper parameters i.e no.
of 1*1 squeeze layers, no. of 1*1 expansion layers and no. of 3*3 expansion layers. This
inspiration is taken from the inception family that the neurons that work together as a team
result in more generalized and versatile predictions. The use of bypasses is inspired from the
Resnet shortcuts. The number of filters per fire module is gradually increased from the
beginning to the end of the network. Max-pooling with a stride of 2 is performed after layers
convolution 1, fire module 4, fire module 8, and convolution 10.
The experimental results of the SqeezeNet architecture for image classification is as follows,
MobileNet V3
The main objective of Mobile Net family of architectures was to make it more apt
for use on embedded devices. Mobile Net V1 version proposed a novel idea of using depth
wise separable convolutions. Mobile Net V2 version proposed another additional idea of using
expansion layers in the blocks. This was called the “Inverted Residual Blocks” that helped
improve the performance to a greater extent. The Mobile Net V3 version further enhanced the
prospects of its earlier versions. The improvements were made by using hardware accelerated
neural architecture search and NetAdapt algorithm for layer wise search.
There were two major network design changes that contributed significantly to the network
design namely; Layer removal and swish non-linearity. The layer removal was done manually
without using neural architecture search. Here, in the last block, the 1x1 expansion layer taken
from the Inverted Residual Unit from MobileNetV2 is moved past the pooling layer. This
means the 1x1 layer works on feature maps of size 1x1 instead of 7x7 making it efficient in
terms of computation and latency. As the expansion layer takes a lot of computation, it is moved
behind a pooling layer, we don’t need to do the compression done by projection layer from the
last layer from the previous block. Thus we can remove that projection layer and the filtering
layer from the previous bottleneck layer of the previous block. Another change proposed was
to use 16 filters in the initial 3x3 layer instead of 32, which is the default for mobile net model
family, which was empirically proved to be successful.
The activation function used is hard swish non-linearity which is defined by the
authors as follows:
𝑅𝐸𝐿𝑈6(𝑥 + 3)
ℎ − 𝑠𝑤𝑖𝑠ℎ[𝑥] = 𝑥
6
As hard swish doesn’t make use of sigmoid and uses Relu6 instead, it largely saves
computational time that would be required to calculate the sigmoid. There are two variants of
Mobile Net V3 – large and small. The overall model architecture is as follows:
zation.