deep-learning-notes-all-units
deep-learning-notes-all-units
The human brain consists of a large number, more than a billion of neural cells that process
information. Each cell works like a simple processor. The massive interaction between all cells and
their parallel processing only makes the brain’s abilities possible. Figure 1 represents a human
biological nervous unit. Various parts of biological neural network(BNN) is marked in Figure 1.
1.4.1.1.Supervised learning :
Every input pattern that is used to train the network is associated with an output pattern which is
the target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the error.The
error can then be used to change network parameters, which result in an improvement in
performance.
1.4.1.2 Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering and
adapting to structural features in the input patterns.
1.4.1.3 Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only
indicates if the computed output correct or incorrect.The information provided helps the network in
the learning process.
1.4.1.4 Hebbian learning:
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the oldest
learning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖, 𝑦𝑖) are associated
by the weight matrix W, known as the correlation matrix.
It is computed as
𝑛
W= ∑𝑖=1 𝑥𝑖𝑦𝑖𝑇 ------------ eq(1)
Here 𝑦𝑖𝑇 is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the rule have
been proposed.
1.4.1.5 Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation function
of the network.Also it is required that the activation function employed by the network is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
𝜕𝐸
∆𝑤 = ɳ
𝜕𝑤𝑖𝑗
𝑖𝑗 ----------- eq(2)
𝜕𝐸
Where, ɳ is the learning rate parameter and
𝜕𝑤𝑖𝑗
is the error gradient with reference to the
weight 𝑤𝑖𝑗.
classification into two categories and then the general multiclass classification later. For
classification
into only two categories, all we need is a single output neuron. Here we will use bipolar neurons.
The simplest architecture that could do the job consists of a layer of N input neurons, an output
layer with a single output neuron, and no hidden layers. This is the same architecture as we saw
before for Hebb learning. However, we will use a different transfer function here for the output
neurons as given below in eq (7). Figure 7 represents a single layer perceptron network.
eq (7)
Equation 7 gives the bipolar activation function which is the most common function used in
the perceptron networks. Figure 7 represents a single layer perceptron network. The inputs arising
from the problem space are collected by the sensors and they are fed to the aswociation
units.Association units are the units which are responsible to associate the inputs based on their
similarities. This unit groups the similar inputs hence the name association unit. A single input from
each group is given to the summing unit.Weights are randomnly fixed intially and assigned to this
inputs. The net value is calculate by using the expression
x = Σ wiai – θ eq(8)
This value is given to the activation function unit to get the final output response.The actual
output is compared with the Target or desired .If they are same then we can stop training else the
weights haqs to be updated .It means there is error .Error is given as δ = b-s , where b is the desired
/ Target output and S is the actual outcome of the machinehere the weights are updated based on the
perceptron Learning law as given in equation 9.
Weight change is given as Δw= η δ ai. So new weight is given as
Wi (new) = Wi (old) + Change in weight vector (Δw) eq(9)
1.5.2. Perceptron Algorithm
Step 1: Initialize weights and bias.For simplicity, set weights and bias to zero.Set learning
rate in the range of zero to one.
• Step 2: While stopping condition is false do steps 2-6
• Step 3: For each training pair s:t do steps 3-5
• Step 4: Set activations of input units xi = ai
• Step 5: Calculate the summing part value Net = Σ aiwi-θ
• Step 6: Compute the response of output unit based on the activation functions
• Step 7: Update weights and bias if an error occurred for this pattern(if yis not equal to t)
Weight (new) = wi(old) + atxi , & bias (new) = b(old) + at
Else wi(new) = wi(old) & b(new) = b(old)
• Step 8: Test Stopping Condition
1.5.3. Limitations of single layer perceptrons:
• Uses only Binary Activation function
• Can be used only for Linear Networks
• Since uses Supervised Learning ,Optimal Solution is provided
• Training Time is More
• Cannot solve Linear In-separable Problem
1.
Initialize the weights (Wi) & Bias (B0) to small random values near Zero
2.
Set learning rate η or α in the range of “0” to “1”
3.
Check for stop condition. If stop condition is false do steps 3 to 7
4.
For each Training pairs do step 4 to 7
5.
Set activations of Output units: xi = si for i=1 to N
6.
Calculate the output Response
yin = b0 + Σ xiwi
7.
Activation function used is Bipolar sigmoidal or Bipolar Step functions
For Multi Layer networks, based on the number of layers steps 6 & 7 are repeated
8.
If the Targets is (not equal to) = to the actual output (Y), then update weights and bias
based on Perceptron Learning Law
Wi (new) = Wi (old) + Change in weight vector
Change in weight vector = ηtixi
Where η = Learning Rate
ti = Target output of ith unit
xi = ith Input vector
b0(new) = b0 (old) + Change in Bias
Change in Bias = ηti
Else Wi (new) = Wi (old)
b0(new) = b0 (old)
9.
Test for Stop condition
9
Perceptron are successful only on problems with a linearly separable solution sapce.Figure 9
represents both linear separable as well as linear in seperable problem.Perceptron cannot handle, in
particular, tasks which are not linearly separable.(Known as linear inseparable problem).Sets of
points in two dimensional spaces are linearly separable if the sets can be seperated by a straight
line.Generalizing, a set of points in n-dimentional space are that can be seperated by a straight
line.is called Linear seperable as represented in figure 9.
Single layer perceptron can be used for linear separation.Example AND gate.But it cant be
used for non linear ,inseparable problems.(Example XOR Gate).Consider figure 10.
10
Convex regions can be created by multiple decision lines arising from multi layer
networks.Single layer network cannot be used to solve inseparable problem.Hence we go for
multilayer network there by creating convex regions which solves the inseparable problem.
1.6.1 Convex Region:
Select any Two points in a region and draw a straight line between these two points. If the
points selected and the lines joining them both lie inside the region then that region is known as
convex regions.
1.6.2. Types of convex regions
(a) Open Convex region (b) Closed Convex region
Figure 9 A: Circle - Closed convex region Figure 9 B: Triangle - Closed convex region
1.7. Logistic Regression
Logistic regression is a probabilistic model that organizes the instances in terms of
probabilities. Because the classification is probabilistic, a natural method for optimizing the
parameters is to ensure that the predicted probability of the observed class for each training
occurrence is as large as possible. This goal is achieved by using the notion of maximumlikelihood
estimation in order to learn the parameters of the model. The likelihood of the training data is
defined as the product of the probabilities of the observed labels of each training instance. Clearly,
larger values of this objective function are better. By using the negative logarithm of this value, one
obtains a loss function in minimization form. Therefore, the output node uses the negative log-
likelihood as a loss function. This loss function replaces the squared error used in the Widrow-Hoff
method. The output layer can be formulated with the sigmoid activation function, which is very
common in neural network design.
11
12
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
13
14
15
UNIT II
INTRODUCTION TO DEEP LEARNING
The chain rule that underlies the back-propaga琀椀on algorithm was invented in the
seventeenth century (Leibniz, 1676; L’Hôpital, 1696)
Beginning in the 1940s, the func琀椀on approxima琀椀on techniques were used to mo琀椀vate
machine learning models such as the perceptron
The earliest models were based on linear models. Cri琀椀cs including Marvin Minsky
pointed out several of the 昀氀aws of the linear model family, such as its inability to learn
the XOR func琀椀on, which led to a backlash against the en琀椀re neural network approach
E昀케cient applica琀椀ons of the chain rule based on dynamic programming began to appear
in the 1960s and 1970s
Werbos (1981) proposed applying chain rule techniques for training ar琀椀昀椀cial neural
networks. The idea was 昀椀nally developed in prac琀椀ce a昀琀er being independently
rediscovered in di昀昀erent ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a)
Following the success of back-propaga琀椀on, neural network research gained popularity
and reached a peak in the early 1990s. A昀琀erwards, other machine learning techniques
became more popular un琀椀l the modern deep learning renaissance that began in 2006
The core ideas behind modern feedforward networks have not changed substan琀椀ally
since the 1980s. The same back-propaga琀椀on algorithm and the same approaches to
gradient descent are s琀椀ll in use.
Most of the improvement in neural network performance from 1986 to 2015 can be
attributed to two factors. First, larger datasets have reduced the degree to which statistical
generalization is a challenge for neural networks. Second, neural networks have become
much larger, because of more powerful computers and better software infrastructure.A
small number of algorithmic changes have also improved the performance of neural
networks noticeably. One of these algorithmic changes was the replacement of mean
squared error with the cross-entropy family of loss functions. Mean squared error was
popular in the 1980s and 1990s but was gradually replaced by cross-entropy losses and the
principle of maximum likelihood as ideas spread between the statistics community and the
machine learning community.
The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units. Rectification using the max{0, z} function was
introduced in early neural network models and dates back at least as far as the Cognitron
and Neo-Cognitron (Fukushima, 1975, 1980).
For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities
16
is even more important than learning the weights of the hidden layers. Random weights
are
sufficient to propagate useful information through a rectified linear network, enabling the
classifier layer at the top to learn how to map different feature vectors to class identities.
When more data is available, learning begins to extract enough useful knowledge to exceed
the performance of randomly chosen parameters. Glorot et al. (2011a) showed that learning
is far easier in deep rectified linear networks than in deep networks that have curvature or
two-sided saturation in their activation functions.
When the modern resurgence of deep learning began in 2006, feedforward networks
continued to have a bad reputation. From about 2006 to 2012, it was widely believed that
feedforward networks would not perform well unless they were assisted by other models,
such as probabilistic models. Today, it is now known that with the right resources and
engineering practices, feedforward networks perform very well. Today, gradient-based
learning in feedforward networks is used as a tool to develop probabilistic models.
Feedforward networks continue to have unfulfilled potential. In the future, we expect they
will be applied to many more tasks, and that advances in optimization algorithms and model
design will improve their performance even further.
18
19
Yk = f(yink)
III. Backpropagation of Errors
Step 7: δk = (tk – Yk)f(yink )
Step 8: δinj = Σ δjVjk
IV. Updating of Weights & Biases
Step 8: Weight correction is Δwij = αδkZj
bias Correction is Δwoj = αδk
V. Updating of Weights & Biases
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij
Vjk(new) = Vjk(old) + ΔVjk
New bias is
Woj(new) = Woj(old) + Δwoj
Vok(new) = Vok(old) + ΔVok
2.2.5 Merits
•Has smooth effect on weight correction
•Computing time is less if weight’s are small
•100 times faster than perceptron model
• Has a systematic weight updating procedure
2.2.6. Demerits
• Learning phase requires intensive calculations
• Selection of number of Hidden layer neurons is an issue
• Selection of number of Hidden layers is also an issue
• Network gets trapped in Local Minima
• Temporal Instability
• Network Paralysis
• Training time is more for Complex problems
2.3 Regularization
A fundamental problem in machine learning is how to make an algorithm that
will perform well not just on the training data, but also on new inputs. Many
strategies used in machine learning are explicitly designed to reduce the test error,
possibly at the expense of increased training error. These strategies are known
collectively as regularization.
Definition: - “any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error.”
In the context of deep learning, most regularization strategies are based on
regularizing estimators.
Regularization of an estimator works by trading increased bias for reduced
variance.
20
We can see that the addition of the weight decay term has modified the learning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just before
performing the usual gradient update. This describes what happens in a single step.
The approximation ^J is Given by
21
The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’
To study the eff ect of weight decay,
penalize the size of the model parameters. Another option is to use L1 regularization.
23
L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a
positive hyperparameter α. Thus, the regularized objective function J˜(w; X, y) is given by
By inspecting equation 1, we can see immediately that the effect of L 1 regularization is quite
different from that of L 2 regularization. Specifically, we can see that the regularization
contribution to the gradient no longer scales linearly with each wi ; instead it is a constant factor
with a sign equal to sign(wi).
24
L1 regularization can add the penalty term in cost function. But L2 regularization appends
the squared value of weights in the cost function.
L1 regularization can be helpful in features selection by eradicating the unimportant
features, whereas, L2 regularization is not recommended for feature selection
L1 doesn’t have a closed form solution since it includes an absolute value and it is a non-
differentiable function, while L2 has a solution in closed form as it’s a square of a weight
25
Even though the input X was normalized but the output is no longer on the same scale. The
data passes through multiple layers of network with multiple times(sigmoidal) activation functions
are applied, which leads to an internal co-variate shift in the data.
This motivates us to move towards Batch Normalization
Normalization is the process of altering the input data to have mean as zero and standard deviation
value as one.
2.4.1 Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to calculate the mean of this hidden
activation.
(2) After calculating the mean the next step is to calculate the standard deviation of the hidden
activations.
(3) Now we normalize the hidden activations using these Mean & Standard Deviation values. To do
this, we subtract the mean from each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two components
of the BN algorithm is used, γ(gamma) and β (beta). These parameters are used for re-scaling (γ)
and shifting(β) the vector contains values from the previous operations.
These two parameters are learnable parameters, Hence during the training of neural
network, the optimal values of γ and β are obtained and used. Hence we get the accurate
normalization of each batch.
26
27
1 One Hidden layer(or very less no. of Deep Net’s has many layers of Hidden
Hidden Layers) layers with more no. of neurons in
each layers
2 Takes input only as VECTORS DL can have raw data like image, text
as inputs
3 Shallow net’s needs more parameters DL can fit functions better with less
to have better fit parameters than a shallow network
28
UNIT III
DIMENTIONALITY REDUCTION
Linear (PCA, LDA) and manifolds, metric learning - Auto encoders and dimensionality
reduction in networks - Introduction to Convnet - Architectures – AlexNet, VGG, Inception,
ResNet
- Training a Convnet: weights initialization, batch normalization, hyperparameter optimization.
29
Figure 3A: PCA for Data Representation Figure 3B: PCA Dimension Reduction
If the variation in a data set is caused by some natural property, or is caused by random
experimental error, then we may expect it to be normally distributed. In this case we show
the nominal extent of the normal distribution by a hyper-ellipse (the two-dimensional
ellipse in the example). The hyper ellipse encloses data points that are thought of as
belonging to a class. It is drawn at a distance beyond which the probability of a point
belonging to the class is low, and can be thought of as a class boundary.
If the variation in the data is caused by some other relationship, then PCA gives us a
way of reducing the dimensionality of a data set. Consider two variables that are nearly
related linearly as shown in figure 3B. As in figure 3A the principal direction in which the
data varies is shown by the U axis, and the secondary direction by the V axis. However in
this case all the V coordinates are all very close to zero. We may assume, for example, that
they are only non zero because of experimental noise. Thus in the U V axis system we can
represent the data set by one variable U and discard V . Thus we have reduced the
dimensionality of the problem by 1Compu琀椀ng the Principal Components
The vector x is called an eigenvector of A associated with the eigenvalue λ. Notice that
there is no unique solution for x in the above equation. It is a direction vector only and can
be scaled to any magnitude. To find a numerical solution for x we need to set one of its
elements to an arbitrary value, say 1, which gives us a set of simultaneous equations to
solve for the other elements. If there is no solution, we repeat the process with another
element. Ordinarily we normalize the final values so that x has length one, that is x · xT = 1.
Suppose we have a 3 × 3 matrix A with eigenvectors x1, x2, x3, and eigenvalues λ1, λ2, λ3
so:
Ax1 = λ1x1 Ax2 = λ2x2 Ax3 = λ3x3
Putting the eigenvectors as the columns of a matrix gives:
4. Improves Visualization:
32
Linear Discriminant Analysis as its name suggests is a linear model for classification and
dimensionality reduction. Most commonly used for feature extraction in pattern classification
problems.
3.4.1 Need for LDA:
Logistic Regression is perform well for binary classification but fails in the case of multiple
classification problems with well-separated classes. While LDA handles these quite
efficiently.
LDA can also be used in data pre-processing to reduce the number of features just as PCA
which reduces the computing cost significantly.
3.4.2. Limitations:
Linear decision boundaries may not effectively separate non-linearly separable classes.
More flexible boundaries are desired.
In cases where the number of observations exceeds the number of features, LDA might not
perform as desired. This is called Small Sample Size (SSS) problem. Regularization is
required.
1. Simple prototype classifier: Distance to the class mean is used, it’s simple to interpret.
2. Decision boundary is linear: It’s simple to implement and the classification is robust.
3. Dimension reduction: It provides informative low-dimensional view on the data, which is
both useful for visualization and feature engineering.
Shortcomings of LDA:
33
1. Linear decision boundaries may not adequately separate the classes. Support for more
general boundaries is desired.
2. In a high-dimensional setting, LDA uses too many parameters. A regularized version of
LDA is desired.
3. Support for more complex prototype classification is desired.
3.5. Manifold Learnings:
Manifold learning for dimensionality reduction has recently gained much attention to
assist image processing tasks such as segmentation, registration, tracking,
recognition, and computational anatomy.
The drawbacks of PCA in handling dimensionality reduction problems for non-linear
weird and curved shaped surfaces necessitated development of more advanced
algorithms like Manifold Learning.
There are different variant’s of Manifold Learning that solves the problem of reducing
data dimensions and feature-sets obtained from real world problems representing
uneven weird surfaces by sub-optimal data representation.
This kind of data representation selectively chooses data points from a low-
dimensional manifold that is embedded in a high-dimensional space in an attempt to
generalize linear frameworks like PCA.
Manifolds give a look of flat and featureless space that behaves like Euclidean space.
Manifold learning problems are unsupervised where it learns the high-dimensional
structure of the data from the data itself, without the use of predetermined
classifications and loss of importance of information regarding some characteristic of
the original variables.
The goal of the manifold-learning algorithms is to recover the original domain
structure, up to some scaling and rotation. The nonlinearity of these algorithms allows
them to reveal the domain structure even when the manifold is not linearly embedded.
It uses some scaling and rotation for this purpose.
Manifold learning algorithms are divided in to two categories:
Global methods: Allows high-dimensional data to be mapped from high-
dimensional to low-dimensional such that the global properties are preserved.
Examples include Multidimensional Scaling (MDS), Isomaps covered in the
following sections.
Local methods: Allows high-dimensional data to be mapped to low dimensional
such that local properties are preserved. Examples are Locally linear embedding
(LLE), Laplacian eigenmap (LE), Local tangent space alignment (LSTA),
Hessian Eigenmapping (HLLE)
Three popular manifold learning algorithms:
IsoMap (Isometric Mapping)
34
number of featuresin the data. Before feeding the data into the AutoEncoder the data must
definitely be scaled between 0 and 1 using MinMaxScaler since we are going to use
sigmoid
36
activation function in the output layer which outputs values between0 and 1.When we are
using AutoEncoders for dimensionality reduction we’ll beextracting the bottleneck layer
and use it to reduce the dimensions. Thisprocess can be viewed as feature extraction.
The type of AutoEncoder that we’re using is Deep AutoEncoder, where theencoder and
the decoder are symmetrical. The Autoencoders don’t necessarily have a symmetrical
encoder and decoder but we can have the encoderanddecodernon-symmetricalaswell.
Deep Autoencoder
Sparse Autoencoder
Under complete Autoencoder
Variational Autoencoder
LSTM Autoencoder
37
3.7. AlexNet:
Alexnet model was proposed in 2012 in the research paper named Imagenet
Classification with Deep Convolution Neural Network by Alex Krizhevsky and his
colleagues
Then the fourth convolution operation with 384 filters of size 3X3. The stride value
along with the padding is 1.The output size remains unchanged as 13X13X384.
After this, we have the final convolution layer of size 3X3 with 256 such filters. The
stride and padding are set to 1,also the activation function is relu. The resulting feature
map is of shape 13X13X256
If we look at the architecture now, the number of filters is increasing as we are going
deeper. Hence more features are extracted as we move deeper into the architecture.
Also, the filter size is reducing, which means a decrease in the feature map shape.
3.8. VGG-16
The major shortcoming of too many hyper-parameters of AlexNet was solved by
VGG Net by replacing large kernel-sized filters (11 and 5 in the first and second
convolution layer, respectively) with multiple 3×3 kernel-sized filters one after
another.
The architecture developed by Simonyan and Zisserman was the 1st runner up of the
Visual Recognition Challenge of 2014.
The architecture consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a
stride of 1.
Padding is kept same to preserve the dimension.
There are 16 layers in the network where the input image is RGB format with
dimension of 224*224*3, followed by 5 pairs of Convolution(filters: 64, 128,
256,512,512) and Max Pooling.
The output of these layers is fed into three fully connected layers and a softmax
function in the output layer.
In total there are 138 Million parameters in VGG Net
3.9 ResNet:
ResNet, the winner of ILSVRC-2015 competition is a deep network with over 100
layers. Residual networks (ResNet) is similar to VGG nets however with a sequential
approach they also use “Skip connections” and “batch normalization” that helps to train
deep layers without hampering the performance. After VGG Nets, as CNNs were going
deep, it was becoming hard to train them because of vanishing gradients problem that makes
the derivate infinitely small. Therefore, the overall performance saturates or even degrades.
The idea of skips connection came from highway network where gated shortcut connections
were used
3.10 Inception Net:
Figure 7: Incep琀椀onNet
Inception network also known as GoogleLe Net was proposed by developers at google
in “Going Deeper with Convolutions” in 2014. The motivation of InceptionNet comes from
the presence of sparse features Salient parts in the image that can have a large variation in
size. Due to this, the selection of right kernel size becomes extremely difficult as big kernels
are selected for global features and small kernels when the features are locally located. The
InceptionNets resolves this by stacking multiple kernels at the same level. Typically it uses
5*5, 3*3 and 1*1 filters in one go.
3.11. Hyperparameter Optimization:
Hyperparameter optimization in machine learning intends to find the
hyperparameters of a given machine learning algorithm that deliver the best performance as
measured on a validation set. Hyperparameters, in contrast to model parameters, are set by the
machine learning engineer before training. The number of trees in a random forest is a
hyperparameter while the weights in a neural network are model parameters learned during
training. Hyperparameter optimization finds a combination of hyperparameters that returns
40
an optimal
41
model which reduces a predefined loss function and in turn increases the accuracy on given
independent data
3.11.1 Hyperparameter Optimization methods
42
UNIT IV
DIMENTIONALITY REDUCTION
Optimization in deep learning– Non-convex optimization for deep networks- Stochastic
Optimization Generalization in neural networks- Spatial Transformer Networks- Recurrent
networks, LSTM Recurrent Neural Network Language Models- Word-Level RNNs & Deep
Reinforcement Learning - Computational & Artificial Neuroscience.
4.1 Optimization in Deep Learning:
In Deep Learning, with the help of loss function, the performance of the model is estimated/
evaluated. This loss is used to train the network so that it performs better. Essentially, we try
to minimize the Loss function. Lower Loss means the model performs better. The Process of
minimizing any mathematical function is called Optimization.
Optimizers are algorithms or methods used to change the features of the neural network such
as weights and learning rate so that the loss is reduced. Optimizers are used to solve optimization
problems by minimizing the function
The Goal of an Optimizer is to minimize the Objective Function(Loss Function based on the
Training Data set). Simply Optimization is to minimize the Training Error.
4.1.1 Need for Optimization:
Prescence of Local Minima reduces the model performance
Prescence of Saddle Points which creates Vanishing Gradients or Exploding Gradient Issues
To select appropriate weight values and other associated model parameters
To minimize the loss value (Training error)
43
45
Localisation Net:
With input feature map U, with width W, height H and C channels, outputs
are θ, the parameters of transformation Tθ. It can be learnt as affine transform
Grid Generator:
Suppose we have a regular grid G, this G is a set of points with target
coordinates (xt_i, yt_i). Then we apply transformation T θ on G, i.e. T θ( G).
After Tθ(G), a set of points with destination coordinates (xt_i, yt_i) is outputted.
These points have been altered based on the transformation parameters. It can be
Translation, Scale, Rotation or More Generic Warping depending on how we set θ as
mentioned above.
Sampler:
Based on the new set of coordinates (xt_i, yt_i), we generate a
transformed output feature map V. This V is translated, scaled, rotated, warped,
projective transformed or affined, whatever. It is noted that STN can be applied to
not only input image, but also intermediate feature maps.
STN is a mechanism that rotates or scales an input image or a feature
map in order to focus on the target object and to remove rotational variance .
One of the most notable features of STNs is their modularity (the module can
be injected into any part of the model) and their ability to be trained with a single backprop
algorithm without modification of the initial model.
4.4.1. Advantages:
Helps in learning explicit spatial transformations like translation, rotation, scaling,
cropping, non-rigid deformations, etc. of features.
Can be used in any networks and at any layer and learnt in an end-to-end trainable
manner.
Provides improvement in the performance of existing models.
4.5. Recurrent Neural Networks:
RNNs are very powerful, because they combine two properties:
Distributed hidden state that allows them to store a lot of information about
the past efficiently.
Non-linear dynamics that allows them to update their hidden state in
complicated ways.
With enough neurons and time, RNNs can compute anything that can be computed
by your computer.
4.5.1. Need for RNN:
Normal Networks cannot handle sequential data
46
cell using
49
logistic and linear units with multiplicative interactions. Information gets into the cell
whenever its “write” gate is on. The information stays in the cell so long as its “keep” gate
is on. Information can be read from the cell by turning on its “read” gate.(Refer Figure 4.6
– shown Below)
To preserve information for a long time in the activities of an RNN, we use a circuit
that implements an analog memory cell.
– A linear unit that has a self-link with a weight of 1 will maintain its state.
– Information is stored in the cell by activating its write gate.
– Information is retrieved by activating the read gate.
– We can backpropagate through this circuit because logistics are had nice
derivatives.
50
Step 2: Decide how much this unit adds to the current state
In the second layer, there are two parts. One is the sigmoid function, and the other is
the tanh function. In the sigmoid function, it decides which values to let through (0
or 1). tanh function gives weightage to the values which are passed, deciding their level of
importance (-1 to 1).
Step 3: Decide what part of the current cell state makes it to the output
The third step is to decide what the output will be. First, we run a sigmoid layer,
which decides what parts of the cell state make it to the output. Then, we put the cell state
through tanh to push the values to be between -1 and 1 and multiply it by the output of the
sigmoid gate.
4.6.2. Applications of LSTM include:
• Robot control
• Time series prediction
• Speech recognition
• Rhythm learning
• Music composition
• Grammar learning
• Handwriting recognition
4.7. Computational and Artificial Neuro-Science:
Computational neuroscience is the field of study in which mathematical tools and theories
are used to investigate brain function.
The term “computational neuroscience” has two different definitions:
1. using a computer to study the brain
2. studying the brain as a computer
Computational and Artificial Neuroscience deals with the study or understanding of how
signals are transmitted through and from the human brain. A better understanding of How
decision is made in human brain by processing the data or signals will help us in
developing Intelligent algorithms or programs to solve complex problems. Hence, we need
to understand the basics of Biological Neural Networks (BNN).
4.7.1. The Biological Neurons:
The human brain consists of a large number, more than a billion of neural cells that
process information. Each cell works like a simple processor. The massive interaction
between all cells and their parallel processing only makes the brain’s abilities possible.
Figure 1 represents a human biological nervous unit. Various parts of biological neural
network(BNN) is marked in Figure 4.7.
51
52
Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.
Information flow in a neural cell
The input/output and the propagation of information are shown below.
4.7.2. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a
real (biological) neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function
(i.e. squashing/transfer/threshold function).
An output line transmits the result to other neurons.
4.7.3. Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single
activation function. An Artificial neural network(ANN) model based on the biological
neural sytems is shown in figure 4.8.
53
54
UNIT V
APPLICATIONS OF DEEP LEARNING
Imagenet- Detection-Audio WaveNet-Natural Language Processing Word2Vec - Joint
Detection BioInformatics- Face Recognition- Scene Understanding- Gathering Image Captions
5.1. Imagenet:
ImageNet is useful for many computer vision applications such as object recognition, image
classification and object localization.Prior to ImageNet, a researcher wrote one algorithm to
identify dogs, another to identify cats, and so on. After training with ImageNet, the same
algorithm could be used to identify different objects. The diversity and size of ImageNet meant
that a computer looked at and learned from many variations of the same object. These variations
could include camera angles, lighting conditions, and so on. Models built from such extensive
training were better at many computers vision tasks. ImageNet convinced researchers those
large datasets were important for algorithms and models to work well.
5.1.1. Technical details of Image Net:
ImageNet did not define these subcategories on its own but derived these from
WordNet. WordNet is a database of English words linked together by semantic relationships.
Words of similar meaning are grouped together into a synonym set, simply called synset.
Hypernyms are synsets that are more general. Thus, "organism" is a hypernym of "plant".
55
Hyponyms are synsets that are more specific. Thus, "aquatic" is a hyponym of "plant". This
hierarchy makes it useful for computer vision tasks. If the model is not sure about a
subcategory,
it can simply classify the image higher up the hierarchy where the error probability is less. For
example, if model is unsure that it's looking at a rabbit, it can simply classify it as a mammal.
While WordNet has 100K+ synsets, only the nouns have been considered by ImageNet.
Humans make mistakes and therefore we must have checks in place to overcome them.
Each human is given a task of 100 images. In each task, 6 "gold standard" images are placed
with known labels. At most 2 errors are allowed on these standard images, otherwise the task
has to be restarted.
In addition, the same image is labelled by three different humans. When there's
disagreement, such ambiguous images are resubmitted to another human with tighter quality
threshold (only one allowed error on the standard images).
For public access, ImageNet provides image thumbnails and URLs from where the original
images were downloaded. Researchers can use these URLs to download the original images.
However, those who wish to use the images for non-commercial or educational purpose, can
create an account on ImageNet and request access. This will allow direct download of images
from ImageNet. This is useful when the original sources of images are no longer available.
The dataset can be explored via a browser-based user interface. Alternatively, there's also
an API. Researchers may want to read the API Documentation. This documentation also shares
how to download image features and bounding boxes.
Images are not uniformly distributed across subcategories. One research team found that
by considering 200 subcategories, they found that the top 11 had 50% of the images, followed
by a long tail.
When classifying people, ImageNet uses labels that are racist, misogynist and offensive.
People are treated as objects. Their photos have been used without their knowledge. About
5.8% labels are wrong. ImageNet lacks geodiversity. Most of the data represents North
America and Europe. China and India are represented in only 1% and 2.1% of the images
56
respectively. This implies that models trained on ImageNet will not work well when applied
for the developing world.
57
Another study from 2016 found that 30% of ImageNet's image URLs are broken. This is
about 4.4 million annotations lost. Copyright laws prevent caching and redistribution of these
images by ImageNet itself
5.2. WaveNet:
WaveNet is a deep generative model of raw audio waveforms. We show that WaveNets
are able to generate speech which mimics any human voice and which sounds more natural
than the best existing Text-to-Speech systems, reducing the gap with human performance by
over 50%. Allowing people to converse with machines is a long-standing dream of human-
computer interaction. The ability of computers to understand natural speech has been
revolutionised in the last few years by the application of deep neural networks. However,
generating speech with computers — a process usually referred to as speech synthesis or
text- to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very
large database of short speech fragments are recorded from a single speaker and then
recombined to form complete utterances. This makes it difficult to modify the voice (for
example switching to a different speaker, or altering the emphasis or emotion of their speech)
without recording a whole new database.
This has led to a great demand for parametric TTS, where all the information required to
generate the data is stored in the parameters of the model, and the contents and characteristics
of the speech can be controlled via the inputs to the model. So far, however, parametric TTS
has tended to sound less natural than concatenative. Existing parametric models typically
generate audio signals by passing their outputs through signal processing algorithms known
as vocoders. WaveNet changes this paradigm by directly modelling the raw waveform of the
audio signal, one sample at a time. As well as yielding more natural-sounding speech, using
raw waveforms means that WaveNet can model any kind of audio, including music.
58
Typically, the speech audio has a sampling rate of 22K or 16K. For few seconds of speech,
it means there are more than 100K values for a single data and it is enormous for the network
to consume. Hence, we need to restrict the size, preferably to around 8K. At the end, the
values are predicted in Q channels (eg. Q=256 or 65536), which is compared to the original
audio data compressed to Q distinct values. For that, the mulaw quantization could be used:
it maps the values to the range of [0,Q]. And the loss can be computed either by
cross-entropy, or discretized logistic mixture.
59
And the element-wise addition of a skip connection and output of causal 1D results in
the residual
60
The above diagram (Figure5.3 ) shows the phases or logical steps involved in natural
language processing
5.4. Word2Vec:
Word embedding is one of the most popular representation of document vocabulary. It
is capable of capturing context of a word in a document, semantic and syntactic similarity,
relation with other words, etc. What are word embeddings exactly? Loosely speaking, they
61
are vector
62
representations of a particular word. Having said this, what follows is how do we generate
them? More importantly, how do they capture the context? Word2Vec is one of the most
popular technique to learn word embeddings using shallow neural network. It was developed
by Tomas Mikolov in 2013 at Google.
The purpose and usefulness of Word2vec is to group the vectors of similar words
together in vector space. That is, it detects similarities mathematically. Word2vec creates
vectors that are distributed numerical representations of word features, features such as the
context of individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses
about a word’s meaning based on past appearances. Those guesses can be used to establish a
word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or
cluster documents and classify them by topic. Those clusters can form the basis of search,
sentiment analysis and recommendations in such diverse fields as scientific research, legal
discovery, e-commerce and customer relationship management. Measuring cosine similarity,
no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle,
complete overlap.
Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its
input is a text corpus and its output is a set of vectors: feature vectors that represent words in
that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form
that deep neural networks can understand.
Figure 5.4: Two models of Word2Vec (A- CBOW & B- Skip-Gram model)
63
65
66
67
Finally, an activation function such as softmax or sigmoid is used to classify the outputs as
Normal and Abnormal.
5.5.1 Steps Involved:
• Provide input image into convolution layer
• Choose parameters, apply filters with strides, padding if requires. Perform convolution
on the image and apply ReLU activation to the matrix.
• Perform pooling to reduce dimensionality size
• Add as many convolutional layers until satisfied
• Flatten the output and feed into a fully connected layer (FC Layer)
• Output the class using an activation function (Logistic Regression with cost functions)
and classifies images.
5.6. Other Applications:
Similarly for the other Applications such as Facial Recognition and Scene
Matching applications appropriate Deep Learning Based Algorithms such as AlexNet,
VGG, Inception, ResNet and or Deep learning-based LSTM or RNN can be used. These
Networks has to be explained with necessary Diagrams and appropriate Explanations.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
68