0% found this document useful (0 votes)
44 views32 pages

ml-unit-4

The document contains lecture notes on Machine Learning from Osmania University, covering various topics such as supervised and unsupervised algorithms, neural networks, and reinforcement learning. It outlines key concepts like feature extraction, overfitting, and evaluation metrics, along with important algorithms like regression, decision trees, and CNNs. Additionally, it includes important questions and references for further reading on machine learning principles and applications.

Uploaded by

ankithmahareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views32 pages

ml-unit-4

The document contains lecture notes on Machine Learning from Osmania University, covering various topics such as supervised and unsupervised algorithms, neural networks, and reinforcement learning. It outlines key concepts like feature extraction, overfitting, and evaluation metrics, along with important algorithms like regression, decision trees, and CNNs. Additionally, it includes important questions and references for further reading on machine learning principles and applications.

Uploaded by

ankithmahareddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

lOMoARcPSD|45565536

Ml unit 4

Machine learning (Osmania University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Mahareddy Ankithreddy ([email protected])
lOMoARcPSD|45565536

Machine Learning
Lecture Notes
Syllabus
UNIT-I
• Introduction: Representation and Learning: Feature Vectors, Feature
Spaces, Feature Extraction and Feature Selection, Learning Problem
Formulation
• Types of Machine Learning Algorithms: Parametric and Non-parametric
Machine Learning Algorithms, Supervised, Unsupervised, Semi-
Supervised and Reinforced Learning.
• Preliminaries: Overfitting, Training, Testing, and Validation Sets, The
Confusion Matrix, Accuracy Metrics: Evaluation Measures: SSE, RMSE,
R2, confusion matrix, precision, recall, F-Score, Receiver Operator
Characteristic (ROC) Curve. Unbalanced Datasets. some basic statistics:
Averages, Variance and Covariance, The Gaussian, the bias-variance
tradeoff.
UNIT-II
• Supervised Algorithms: Regression: Linear Regression, Logistic
Regression, Linear Discriminant Analysis. Classification: Decision Tree,
Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, evaluation
of classification: cross validation, hold out.
UNIT-III
• Ensemble Algorithms: Bagging, Random Forest, Boosting
• Unsupervised Learning: Cluster Analysis: Similarity Measures, categories
of clustering algorithms, k-means, Hierarchical, Expectation-
Maximization Algorithm, Fuzzy c-means algorithm
UNIT-IV
• Neural Networks: Multilayer Perceptron, Back-propagation algorithm,
Training strategies, Activation Functions, Gradient Descent for Machine
Learning, Radial basis functions, Hopfield network, Recurrent Neural
Networks.
• Deep learning: Introduction to deep learning, Convolutional Neural
Networks (CNN), CNN
• Architecture, pre-trained CNN (LeNet, AlexNet).
UNIT-V

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 1

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• Reinforcement Learning: overview, example: getting lost, State and Action


Spaces, The Reward Function, Discounting, Action Selection, Policy,
Markov decision processes Q-learning, uses of Reinforcement learning
Applications of Machine Learning in various fields: Text classification,
Image Classification, Speech Recognition.

TEXTBOOKS
1. Stephen Marsland, Machine Learning: An Algorithmic Perspective, Second
Edition Chapman & Hall/Crc MachineLearning & Pattern Recognition)(2014)
Tom Mitchell, Machine Learning, McGraw-HillScience/ Engineering/
Math;(1997).
2. Kevin Murphy, Machine Learning: A Probabilistic Perspective, MITPress,
2012
3. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Second Edition,
Springer Series in Statistics.(2009).
4. Christopher Bishop, Pattern Recognition and Machine Learning, Springer,
2007.
5. Uma N. Dulhare, Khaleel Ahmad, Khairol Amali Bin Ahmad, Machine
Learning and Big Data: Concepts, Algorithms, Tools and Applications,
Scrivener publishing,Wiley,2020
6. Pattern Recognition and Machine Learning, Christopher M.Bishop,
Springer.(2006)

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 2

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

UNIT 4

Important Questions
1. Perceptron rule
2. Delta rule
3. Back propagation based neural network
4. Activation Functions
5. Gradient Descent for Machine Learning
6. Radial basis functions
7. Hopfield network
8. Recurrent Neural Networks.
9. Convolutional Neural Networks
10. Architecture, pre-trained CNN (LeNet, AlexNet).

1.Perceptron rule
Perceptron is a function that maps its input x, which is multiplied with the learned
weight coefficient wi to generate an output value f(x). Then it is non-linearly
transformed to generate the output y=f(e).

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 3

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

It is an error-correcting rule of a single-layer feedforward network. It is


supervised in nature and calculates the error between the desired and actual output
and if the output is present then only adjustments of weight are done. Perceptron
Learning Rule states that the algorithm would automatically learn the optimal
weight coefficients.

Weights are modified at each step according to the perceptron training rule, which
revises the weight !"# associated with input $" according to the rule
%& ß %& + '%&
Where '%& = ()*# + ,-.&
Here t is the target output for input $",
o is the output generated by the perceptron, and
/ is a positive constant called the learning rate.
The role of the learning rate is to moderate the degree to which weights are
changed at each step.

2.Delta rule
It was developed by Bernard Widrow and Marcian Hoff and It depends on
supervised learning. It is also known as the Least Mean Square method and it
minimizes error over all the training patterns.

Error is the difference between a single actual value and a single predicted
value.
• For Sample x 0 12 #344,4#5 6 7# + 78
• Where y is the observed value and 78 is the value predicted by the model

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 4

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Loss is the average error over training data.


<
• 9#. 0 12 #:,;; 6 # # > 5< ? 5@ ? A 5B
=
• Where 5C # is the Error of ith sample of N given samples
• Risk is the average error over all data

To calculate the direction of steepest descent along the error surface


• Its the derivative of Loss L with respect to each component of the vector
%
DE.
• This vector derivative is called the gradient of Loss L, with respect to %
DE,
written F:)% DE-
IJ IJ IJ
• F:)%
DE-# G # HIK #2 # IK #2 # N 2 # IK P
L M O

• F:)%DE-# is itself a vector (has direction), whose components are the partial
derivatives of L with respect to each of the %C .
• When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in Loss L.
• The negative of this vector therefore gives the direction of steepest
decrease.

3.Back propagation based neural network


• a computer system modelled based on the human brain and nervous
system.
• Artificial neural networks (ANNs) are comprised of node layers,
containing an input layer, one or more hidden layers, and an output layer.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 5

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• Each node, or artificial neuron, connects to another and has an associated


weight and threshold.
• If the output of any individual node is above the specified threshold value,
that node is activated, sending data to the next layer of the network.
• Otherwise, no data is passed along to the next layer of the network.

Each neuron is composed of two units. First unit adds products of weights
coefficients and input signals. The second unit use nonlinear neuron transfer
(activation) function. y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron. Bias gives each neuron a firing threshold
that determined what value it needed before it should fire/activate/work. This
threshold is adjustable, so that we can change the value that the neuron fires at.
Usually represented as Q

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 6

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

¥ To teach the neural network we need training data set.


¥ The training data set consists of input signals (x1 and x2) assigned with
corresponding target (desired output) Z.
¥ The network training is an iterative process. In each iteration weights
coefficients of nodes are modified using new data from training data set.
¥ Modification is calculated using algorithm described below:
¥ Each teaching step starts with giving input signals from training set.
¥ Determine output signals values for each neuron in each network layer.
¥ Symbols w(xm)n represent weights of connections between network
input xm and neuron n in input layer. Symbols yn represents output signal
of neuron n.

7@ 6 R@ ST)U<-@ .< ? T)U@-@ .@ V


7W 6 RW ST)U<-W .< ? T)U@-W .W V
Propagation of signals through the hidden layer. Symbols wmn represent weights
of connections between output of neuron m and input of neuron n in the next
layer.

7X 6 RX )T<X .< ? T@X .@ ? # TWX 7W -

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 7

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

In the next algorithm step the output signal of the network y is compared with the
desired output value (the target z), which is found in training data set.
The difference is called error signal d of output layer neuron

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 8

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Learning Algorithm using Backpropagation


• Training Algorithm
• Backpropagation Error
• Updation of weight and bias
• Test the stopping condition

Parameters
• x = inputs training vector x=(x1,x2,…………xn).
• y = target vector y=(y1,y2……………yn).
• δk = error at output unit.
• δj = error at hidden layer.
• ( = learning rate.
• V0j = bias of hidden unit j.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 9

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

4.Activation Functions
Each neuron is composed of two units. First unit adds products of weights
coefficients and input signals. The second unit use nonlinear neuron transfer
(activation) function. The purpose of the activation function is to introduce non-
linearity into the output of a neuron. y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron. Bias gives each neuron a firing threshold
that determined what value it needed before it should fire/activate/work. This
threshold is adjustable, so that we can change the value that the neuron fires at.
Usually represented as Q

A neural network without an activation function is essentially just a linear


regression model. The activation function does the non-linear transformation to
the input making it capable to learn and perform more complex tasks.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 10

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

5.Gradient Descent for Machine Learning


Delta rule
It was developed by Bernard Widrow and Marcian Hoff and It depends on
supervised learning. It is also known as the Least Mean Square method and it
minimizes error over all the training patterns.

Error is the difference between a single actual value and a single predicted value.
• For Sample x 0 12 #344,4#5 6 7# + 78
• Where y is the observed value and 78 is the value predicted by the model

Loss is the average error over training data.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 11

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

<
• 9#. 0 12 #:,;; 6 # = # > 5< ? 5@ ? A 5B
• Where 5C # is the Error of ith sample of N given samples
• Risk is the average error over all data

To calculate the direction of steepest descent along the error surface


• Its the derivative of Loss L with respect to each component of the vector
%
DE.
• This vector derivative is called the gradient of Loss L, with respect to %
DE,
written F:)% DE-
IJ IJ IJ
• F:)%
DE-# G # H #2 # #2 # N 2 # P
IKL IKM IKO

• F:)% DE-# is itself a vector (has direction), whose components are the partial
derivatives of L with respect to each of the %C .
• When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in Loss L.
• The negative of this vector therefore gives the direction of steepest
decrease.
• Optimizing the loss function Y )% -
• Almost all NN models these days are trained with a variant of the
gradient descent (GD) algorithm
• GD applies iterative refinement of the network parameters#
• GD uses the opposite direction of the gradient of the loss with
respect to the NN parameters (i.e.,FY )% - 6 Z[Y \[%C ] ) for
updating %
Gradient Descent Algorithm
• Steps in the gradient descent algorithm:

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 12

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

1. Randomly initialize the model parameters2 %^


IJ)KL -
2. Compute the gradient of the loss function at %^ : FY )%^ -=
IKC
3. Update the parameters as: %B_` 6 %^ + aFY )%^ -
• Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

Gradient descent algorithm stops when a local minimum of the loss surface is
reached
¥ GD does not guarantee reaching a global minimum
¥ However, empirical evidence suggests that GD works well for NNs

Y )% -

%
• Based on the error in various training models, the Gradient Descent
learning algorithm can be categorized into
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent.

Batch gradient descent


• Batch gradient descent (BGD) is used to find the error for each point in the
training set and update the model after evaluating all training examples.
• This procedure is known as the training epoch.
• In simple words, it is a greedy approach where we have to sum over all
examples for each update.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 13

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Advantages
• It produces less noise in comparison to other gradient descent.
• It produces stable gradient descent convergence.
• It is Computationally efficient as all resources are used for all training
samples.

Stochastic gradient descent


• runs one training example per iteration.
• processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time.
• only one training example at a time à easier to store in allocated memory.
• efficiency losses à frequent updates
• due to frequent updates à noisy gradient.
• However, sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.
Advantages
• It is easier to allocate in desired memory.
• It is relatively fast to compute than batch gradient descent.
• It is more efficient for large datasets.
• helpful in finding the global minimum
Mini-Batch Gradient Descent
• Combination of both batch gradient descent and stochastic gradient
descent.
• divides training datasets into small batches à performs updates on batches
• smaller batches à maintain the computational efficiency of batch gradient
descent and speed of stochastic gradient descent.
• Hence, we can achieve a special type of gradient descent with higher
computational efficiency and less noisy gradient descent.

6.Radial basis functions


• A Radial Basis Function Network (RBFN) is a particular type of neural
network
• It has only 2 layers, the hidden layer and the output layer, excluding the
input layer
• Contrasts from a multilayer perceptron in the method that the hidden units
implement computations.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 14

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• A Radial basis function is a function whose value depends only on the


Euclidean distance from the origin.

• Alternative forms of radial basis functions are defined as the Euclidean


distance from another point denoted C, called a center.

• Any function whose result = distance from origin or a centre c is called a


radial basis function
• Radial Basis Kernel is a kernel function that is used in machine learning
to find a non-linear classifier or regression line.
• Kernel Function is used to transform n-dimensional input to m-
dimensional input, where m is much higher than n then find the dot product
in higher dimensional efficiently.
• In the RBF model the hidden units provide a set of “functions” that
constitute an arbitrary “basis” for the input patterns when they are
expanded to the hidden space.
• The inspiration for the RBF model is based on Cover’s theorem (1965) on
the separability of patterns:
• “A complex pattern-classification problem cast in a high-dimensional
space nonlinearly is more likely to be linearly separable than in a low-
dimensional space”.
• The hidden layer provides a non-linear transformation of the input space
to the hidden space, which is assumed usually of high enough dimension.
• The output layer linearly combines the activations of the hidden layer
• Note: The RBF model owns its development on ideas of fitting hyper-
surfaces to data points in a high-dimensional space.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 15

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• The output hi of each hidden unit i is then computed by applying the basis
function f(x) to this distance
• hi = G(di,σi)
• the basis function is a curve (typically a Gaussian function) which has a
peak at zero distance and it decreases as the distance from the center
increases.

• The transformation at hidden layer is nonlinear, whereas the


transformation at output layer is linear.
• The jth output is computed as
• 7b 6 Rb#).- 6 Tcb ? # >JCm< TCd e )f. + gf-2 #b 6 h2i2jk k l
• An RBFN performs classification by measuring the input’s similarity to
examples from the training set.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 16

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• Each RBFN neuron stores a “prototype”, which is just one of the examples
from the training set.
• The basic idea of this model is that the entire feature vector space is
partitioned by Gaussian neural nodes, where each node generates a signal
corresponding to an input vector, and strength of the signal produced by
each neuron depends on the distance between its center and the input
vector.
• Also for inputs lying closer in Euclidian vector space, the output signals
that are generated must be similar.
• Here, n is center of the neuron and o(1) is response of the neuron
corresponding to input X.

Advantages
• Good Generalization
• Faster Training
• Only one hidden layer
• Strong tolerance to input noise
• Easy interpretation of the meaning or function of each node in the hidden
layer

7.Hopfield network
• A Hopfield network is a particular type of single-layered‘n’ fully
connected recurrent neurons. network. Dr. John J. Hopfield invented it in
1982.
• These networks were introduced to collect and retrieve memory and store
various patterns.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 17

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• Also, auto-association and optimization of the task can be done using these
networks.
• Types
• Discrete Hopfield Network
• Binary (0/1)
• Bipolar (-1/1)
• Continuous Hopfield Network
• In this network, each node is fully connected(recurrent) to other nodes.
• It behaves in a discrete manner, i.e. it gives finite distinct output:
• Binary à ON (1) or OFF (0).
• Bipolar à -1 or +1
• These outputs/states can be restored based on the input received from other
nodes.
• Unlike other neural networks, the output of the Hopfield network is finite.
• Also, the input and output sizes must be the same in these networks
• This model consists of neurons with one inverting and one non-inverting
output.
• The output of each neuron should be the input of other neurons but not the
input of self.
• Weight/connection strength is represented by wij
• The weights are symmetric in nature and have the following properties.
• %&b 6 #%b&
• %&& 6 c

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 18

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• [x1, x2, ... ,xn] à Input to the n given neurons.


• [y1 , y2, ... ,yn] à Output obtained from the n given neurons
• Wij à weight associated with the connection between the ith and the jth
neuron.
• For storing a set of input patterns S(p) [p = 1 to P], where S(p) = S1(p) …
Si(p) … Sn(p), the weight matrix is given by:
• For binary patterns


• For bipolar patterns


• Discrete Hopfield Networks
Training Algorithm
1. Initialize weights (wij) to store patterns (using training algorithm).
2. For each input vector yi, perform steps 3-7.
3. Make the initial activators of the network equal to the input vector x.

4. For each vector yi, perform steps 5-7.


5. Calculate the total input of the network yin using the equation given below.

6. Apply activation over the total input to calculate the output as per the
equation given below:

7. Now feedback the obtained output yi to all other units. Thus, the activator
vector is updated.
8. Test the network for convergence.

Hopfield Networks Energy Function


• An energy function is defined as a function that is bonded and non-
increasing function of the state of the system.
• Energy function Ef, also called Lyapunov function determines the
stability of discrete Hopfield network

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 19

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

<
• 3R 6 + >BCm< >Bdm< 7C 7d TCd # + # >BCm< .C 7C ? # >BCm< QC 7C
@
• Condition − In a stable network, whenever the state of node changes, the
energy function will decrease.
Continuous Hopfield Network
• The Hopfield network consists of associative memory.
• This memory allows the system to retrieve the memory using an
incomplete portion.
• The network can restore the closest pattern using the data captured in
associative memory.
• This feature of Hopfield networks makes it a good candidate for pattern
recognition.
• where,
• vi = output
• ui = internal activity of a node
• g can be any activation function

• The Hopfield networks have an energy function associated with them.


• It either diminishes or remains unchanged on update (feedback) after every
iteration.

• When the energy function reaches its minimum the network converges to
a stable configuration

8.Recurrent Neural Networks


• Apple’s Siri and Google’s voice search both use Recurrent Neural
Networks (RNNs), which are the state-of-the-art method for sequential
data.
• It’s the first algorithm with an internal memory that remembers its previous
input, output, making it perfect for problems involving sequential data in
machine learning.
• type of Neural Network where the output from the previous step is fed as
input to the current step.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 20

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• In traditional neural networks, all the inputs and outputs are independent
of each other, but to predict the next word of a sentence, there is a need to
remember the previous words.
• The most important feature of RNN is its Hidden state, which remembers
some information about a sequence.
• The state is also referred to as Memory State since it remembers the
previous input to the network.
• It uses the same parameters for each input as it performs the same task on
all the inputs or hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks.

• Same as Feedforward network, However, differences arise in the way


information flows from input to output.
• Unlike NN that have different weights matrices for each Neuron, in RNN
the weight across the network remains the same.
• Input: x(t) is taken as the input to the network at time step t.
• Hidden state: h(t) represents a hidden state at time t and acts as “memory”
of the network.
• h(t) is calculated based on the current input and the previous time step’s
hidden state:
• The function f is taken to be a non-linear transformation such as tanh,
ReLU.
• h(t) = f(Ux(t) + Wh(t−1) + b)
• O(t) = Vh(t) + c
Recurrent Neural Networks Training
• Weights: The RNN has
• input à hidden - weight matrix U
• hiddenà hidden-recurrent weight matrix W,

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 21

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• Hidden à output - weight matrix V


• all these weights (U,V,W) are shared across time.
• Output: o(t) illustrates the output of the network.
• Forward Pass
• h(t) = f(Ux(t) + Wh(t−1) + b)
• O(t) = Vh(t) + c
• pq 6 rstuvw$S#x)u-V
• Softmax function converts a vector of K values into a set of probabilities
of K outcomes
• A recurrent neural network uses a backpropagation algorithm for training,
but backpropagation happens for every timestamp, which is why it is
commonly called as backpropagation through time.
Vanishing Gradient
• The main issue with basic recurrent layer is of vanishing gradient problem
• due to this it is not very good at learning long-term correlations.
• Basic recurrent layer does not handle long sequences very well

Exploding Gradient

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 22

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

9.Convolutional Neural Networks


A convolutional neural network (CNN or convnet) is a subset of machine
learning. It is one of the various types of artificial neural.
“A simple CNN is a sequence of layers, and every layer of a CNN transforms one
volume of activations to another through a differentiable function.”
A CNN is a kind of network architecture for deep learning algorithms and is
specifically used for image recognition and tasks that involve the processing of
pixel data. Convolutional Neural Network (CNN) is the extended version of
artificial neural networks (ANN) which is predominantly used to extract the
feature from the grid-like matrix dataset. For example visual datasets like images
or videos where data patterns play an extensive role.
For identifying and recognizing objects, CNNs are the network architecture of
choice.
Used in computer vision (CV) tasks and for applications where object recognition
is vital, such as self-driving cars and facial recognition.

CNN’s Basic Architecture


A CNN architecture consists of two key components:

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 23

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

• A convolution tool that separates and identifies the distinct features of an image
for analysis in a process known as Feature Extraction

• A fully connected layer that takes the output of the convolution process and
predicts the image’s class based on the features retrieved earlier.

The CNN is made up of three types of layers: convolutional layers, pooling layers,
and fully-connected (FC) layers.
Convolution Layers
This is the very first layer in the CNN that is responsible for the extraction of the
different features from the input images. The convolution mathematical operation
is done between the input image and a filter of a specific size MxM in this layer.

The Fully Connected


The Fully Connected (FC) layer comprises the weights and biases together with
the neurons and is used to connect the neurons between two separate layers. The
last several layers of a CNN Architecture are usually positioned before the output
layer.

Pooling layer
The Pooling layer is responsible for the reduction of the size(spatial) of the
Convolved Feature. This decrease in the computing power is being required to
process the data by a significant reduction in the dimensions.
There are two types of pooling
1 average pooling
2 max pooling.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 24

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

A Pooling Layer is usually applied after a Convolutional Layer. This layer’s


major goal is to lower the size of the convolved feature map to reduce
computational expenses. This is accomplished by reducing the connections
between layers and operating independently on each feature map. There are
numerous sorts of Pooling operations, depending on the mechanism utilised.

10. Pre-trained CNN (LeNet)


Pre-training a neural network refers to first training a model on one task or
dataset. Then using the parameters or model from this training to train another
model on a different task or dataset. This gives the model a head-start instead of
starting from scratch.
The most crucial aspect of pre-training neural networks is the task at hand.
Specifically, the task from which the model initially learns must be similar to the
task the model is used for in future. We can’t train a model in weather forecasting
and then, later on, use it for object detection.

Now pre-training a neural network entails four basic steps:


1. We have a machine learning model M and datasets A and B
2. Train M with dataset A
3. Before training the model on dataset B, initialize some of the parameters
of mm with the model which is trained on A
4. Train M on B

LeNet Architecture
The network has 5 layers with learnable parameters and hence named Lenet-5. It
has three sets of convolution layers with a combination of average pooling. After
the convolution and average pooling layers, we have two fully connected layers.
At last, a Softmax classifier which classifies the images into respective class.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 25

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

The input to this model is a 32 X 32 grayscale image hence the number of


channels is one.

We then apply the first convolution operation with the filter size 5X5 and we have
6 such filters. The activation function used at his layer is tanh. As a result, we get
a feature map of size 28X28X6.
Output Size Of A Convolution Layer
output= ((Input-filter size)/ stride)+1
Also, the number of filters becomes the channel in the output feature map.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 26

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

After the first pooling operation, we apply the average pooling and the size of the
feature map is reduced by half. Note that, the number of channels is intact.

Next, we have a convolution layer with sixteen filters of size 5X5. Again the
feature map changed it is 10X10X16. Also, the activation function is tanh.
The output size is calculated in a similar manner. After this, we again applied an
average pooling or subsampling layer, which again reduces the size of the
feature map by half i.e 5X5X16.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 27

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Then we have a final convolution layer of size 5X5 with 120 filters. As shown in
the above image. Leaving the feature map size 1X1X120. After which flatten
result is 120 values.

After these convolution layers, we have a fully connected layer with 84 neurons
and the activation function used here is again tanh. At last, we have an output
layer with 10 neurons since the data have ten classes.

Here is the final architecture of the Lenet-5 model.

Summary
The network has
¥ 5 layers with learnable parameters.
¥ The input to the model is a grayscale image.
¥ It has 3 convolution layers, two average pooling layers, and two fully
connected layers with a softmax classifier.
¥ The number of trainable parameters is 60000.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 28

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Pre-trained CNN (AlexNet)


Pre-training a neural network refers to first training a model on one task or
dataset. Then using the parameters or model from this training to train another
model on a different task or dataset. This gives the model a head-start instead of
starting from scratch.
The most crucial aspect of pre-training neural networks is the task at hand.
Specifically, the task from which the model initially learns must be similar to the
task the model is used for in future. We can’t train a model in weather forecasting
and then, later on, use it for object detection.

Now pre-training a neural network entails four basic steps:


1. We have a machine learning model M and datasets A and B
2. Train M with dataset A
3. Before training the model on dataset B, initialize some of the parameters
of mm with the model which is trained on A
4. Train M on B

AlexNet
Alexnet won the Imagenet large-scale visual recognition challenge in 2012. The
model was proposed in 2012 in the research paper named Imagenet Classification
with Deep Convolution Neural Network by Alex Krizhevsky and his colleagues.
In this model, the depth of the network was increased in comparison to Lenet-5.

Alexnet Architecture
One thing to note here, since Alexnet is a deep architecture, the authors
introduced padding to prevent the size of the feature maps from reducing
drastically. The input to this model is the images of size 227X227X3.

Convolution and Maxpooling Layers


Then we apply the first convolution layer with 96 filters of size 11X11 with stride
4. The activation function used in this layer is relu. The output feature map is
55X55X96.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 29

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

Output Size Of A Convolution Layer


output= ((Input-filter size)/ stride)+1
Also, the number of filters becomes the channel in the output feature map.

Next, we have the first Maxpooling layer, of size 3X3 and stride 2. Then we get
the resulting feature map with the size 27X27X96.

After this, we apply the second convolution operation. This time the filter size is
reduced to 5X5 and we have 256 such filters. The stride is 1 and padding 2. The
activation function used is again relu. Now the output size we get is 27X27X256.

Again we applied a max-pooling layer of size 3X3 with stride 2. The resulting
feature map is of shape 13X13X256.

Now we apply the third convolution operation with 384 filters of size 3X3 stride
1 and also padding 1. Again the activation function used is relu. The output
feature map is of shape 13X13X384.

Then we have the fourth convolution operation with 384 filters of size 3X3. The
stride along with the padding is 1. On top of that activation function used is relu.
Now the output size remains unchanged i.e 13X13X384.

After this, we have the final convolution layer of size 3X3 with 256 such filters.
The stride and padding are set to one also the activation function is relu. The
resulting feature map is of shape 13X13X256.

So if you look at the architecture till now, the number of filters is increasing as
we are going deeper. Hence it is extracting more features as we move deeper into
the architecture. Also, the filter size is reducing, which means the initial filter was

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 30

Downloaded by Mahareddy Ankithreddy ([email protected])


lOMoARcPSD|45565536

larger and as we go ahead the filter size is decreasing, resulting in a decrease in


the feature map shape.

Next, we apply the third max-pooling layer of size 3X3 and stride 2. Resulting in
the feature map of the shape 6X6X256.

After this, we have our first dropout layer. The drop-out rate is set to be 0.5.

Then we have the first fully connected layer with a relu activation function. The
size of the output is 4096. Next comes another dropout layer with the dropout rate
fixed at 0.5.

This followed by a second fully connected layer with 4096 neurons and relu
activation.

Finally, we have the last fully connected layer or output layer with 1000 neurons
as we have 10000 classes in the data set. The activation function used at this layer
is Softmax.

This is the architecture of the Alexnet model. It has a total of 62.3 million
learnable parameters.

Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 31

Downloaded by Mahareddy Ankithreddy ([email protected])

You might also like