ml-unit-4
ml-unit-4
Ml unit 4
Machine Learning
Lecture Notes
Syllabus
UNIT-I
• Introduction: Representation and Learning: Feature Vectors, Feature
Spaces, Feature Extraction and Feature Selection, Learning Problem
Formulation
• Types of Machine Learning Algorithms: Parametric and Non-parametric
Machine Learning Algorithms, Supervised, Unsupervised, Semi-
Supervised and Reinforced Learning.
• Preliminaries: Overfitting, Training, Testing, and Validation Sets, The
Confusion Matrix, Accuracy Metrics: Evaluation Measures: SSE, RMSE,
R2, confusion matrix, precision, recall, F-Score, Receiver Operator
Characteristic (ROC) Curve. Unbalanced Datasets. some basic statistics:
Averages, Variance and Covariance, The Gaussian, the bias-variance
tradeoff.
UNIT-II
• Supervised Algorithms: Regression: Linear Regression, Logistic
Regression, Linear Discriminant Analysis. Classification: Decision Tree,
Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, evaluation
of classification: cross validation, hold out.
UNIT-III
• Ensemble Algorithms: Bagging, Random Forest, Boosting
• Unsupervised Learning: Cluster Analysis: Similarity Measures, categories
of clustering algorithms, k-means, Hierarchical, Expectation-
Maximization Algorithm, Fuzzy c-means algorithm
UNIT-IV
• Neural Networks: Multilayer Perceptron, Back-propagation algorithm,
Training strategies, Activation Functions, Gradient Descent for Machine
Learning, Radial basis functions, Hopfield network, Recurrent Neural
Networks.
• Deep learning: Introduction to deep learning, Convolutional Neural
Networks (CNN), CNN
• Architecture, pre-trained CNN (LeNet, AlexNet).
UNIT-V
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 1
TEXTBOOKS
1. Stephen Marsland, Machine Learning: An Algorithmic Perspective, Second
Edition Chapman & Hall/Crc MachineLearning & Pattern Recognition)(2014)
Tom Mitchell, Machine Learning, McGraw-HillScience/ Engineering/
Math;(1997).
2. Kevin Murphy, Machine Learning: A Probabilistic Perspective, MITPress,
2012
3. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Second Edition,
Springer Series in Statistics.(2009).
4. Christopher Bishop, Pattern Recognition and Machine Learning, Springer,
2007.
5. Uma N. Dulhare, Khaleel Ahmad, Khairol Amali Bin Ahmad, Machine
Learning and Big Data: Concepts, Algorithms, Tools and Applications,
Scrivener publishing,Wiley,2020
6. Pattern Recognition and Machine Learning, Christopher M.Bishop,
Springer.(2006)
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 2
UNIT 4
Important Questions
1. Perceptron rule
2. Delta rule
3. Back propagation based neural network
4. Activation Functions
5. Gradient Descent for Machine Learning
6. Radial basis functions
7. Hopfield network
8. Recurrent Neural Networks.
9. Convolutional Neural Networks
10. Architecture, pre-trained CNN (LeNet, AlexNet).
1.Perceptron rule
Perceptron is a function that maps its input x, which is multiplied with the learned
weight coefficient wi to generate an output value f(x). Then it is non-linearly
transformed to generate the output y=f(e).
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 3
Weights are modified at each step according to the perceptron training rule, which
revises the weight !"# associated with input $" according to the rule
%& ß %& + '%&
Where '%& = ()*# + ,-.&
Here t is the target output for input $",
o is the output generated by the perceptron, and
/ is a positive constant called the learning rate.
The role of the learning rate is to moderate the degree to which weights are
changed at each step.
2.Delta rule
It was developed by Bernard Widrow and Marcian Hoff and It depends on
supervised learning. It is also known as the Least Mean Square method and it
minimizes error over all the training patterns.
Error is the difference between a single actual value and a single predicted
value.
• For Sample x 0 12 #344,4#5 6 7# + 78
• Where y is the observed value and 78 is the value predicted by the model
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 4
• F:)%DE-# is itself a vector (has direction), whose components are the partial
derivatives of L with respect to each of the %C .
• When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in Loss L.
• The negative of this vector therefore gives the direction of steepest
decrease.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 5
Each neuron is composed of two units. First unit adds products of weights
coefficients and input signals. The second unit use nonlinear neuron transfer
(activation) function. y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron. Bias gives each neuron a firing threshold
that determined what value it needed before it should fire/activate/work. This
threshold is adjustable, so that we can change the value that the neuron fires at.
Usually represented as Q
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 6
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 7
In the next algorithm step the output signal of the network y is compared with the
desired output value (the target z), which is found in training data set.
The difference is called error signal d of output layer neuron
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 8
Parameters
• x = inputs training vector x=(x1,x2,…………xn).
• y = target vector y=(y1,y2……………yn).
• δk = error at output unit.
• δj = error at hidden layer.
• ( = learning rate.
• V0j = bias of hidden unit j.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 9
4.Activation Functions
Each neuron is composed of two units. First unit adds products of weights
coefficients and input signals. The second unit use nonlinear neuron transfer
(activation) function. The purpose of the activation function is to introduce non-
linearity into the output of a neuron. y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron. Bias gives each neuron a firing threshold
that determined what value it needed before it should fire/activate/work. This
threshold is adjustable, so that we can change the value that the neuron fires at.
Usually represented as Q
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 10
Error is the difference between a single actual value and a single predicted value.
• For Sample x 0 12 #344,4#5 6 7# + 78
• Where y is the observed value and 78 is the value predicted by the model
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 11
<
• 9#. 0 12 #:,;; 6 # = # > 5< ? 5@ ? A 5B
• Where 5C # is the Error of ith sample of N given samples
• Risk is the average error over all data
• F:)% DE-# is itself a vector (has direction), whose components are the partial
derivatives of L with respect to each of the %C .
• When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in Loss L.
• The negative of this vector therefore gives the direction of steepest
decrease.
• Optimizing the loss function Y )% -
• Almost all NN models these days are trained with a variant of the
gradient descent (GD) algorithm
• GD applies iterative refinement of the network parameters#
• GD uses the opposite direction of the gradient of the loss with
respect to the NN parameters (i.e.,FY )% - 6 Z[Y \[%C ] ) for
updating %
Gradient Descent Algorithm
• Steps in the gradient descent algorithm:
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 12
Gradient descent algorithm stops when a local minimum of the loss surface is
reached
¥ GD does not guarantee reaching a global minimum
¥ However, empirical evidence suggests that GD works well for NNs
Y )% -
%
• Based on the error in various training models, the Gradient Descent
learning algorithm can be categorized into
• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 13
Advantages
• It produces less noise in comparison to other gradient descent.
• It produces stable gradient descent convergence.
• It is Computationally efficient as all resources are used for all training
samples.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 14
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 15
• The output hi of each hidden unit i is then computed by applying the basis
function f(x) to this distance
• hi = G(di,σi)
• the basis function is a curve (typically a Gaussian function) which has a
peak at zero distance and it decreases as the distance from the center
increases.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 16
• Each RBFN neuron stores a “prototype”, which is just one of the examples
from the training set.
• The basic idea of this model is that the entire feature vector space is
partitioned by Gaussian neural nodes, where each node generates a signal
corresponding to an input vector, and strength of the signal produced by
each neuron depends on the distance between its center and the input
vector.
• Also for inputs lying closer in Euclidian vector space, the output signals
that are generated must be similar.
• Here, n is center of the neuron and o(1) is response of the neuron
corresponding to input X.
Advantages
• Good Generalization
• Faster Training
• Only one hidden layer
• Strong tolerance to input noise
• Easy interpretation of the meaning or function of each node in the hidden
layer
7.Hopfield network
• A Hopfield network is a particular type of single-layered‘n’ fully
connected recurrent neurons. network. Dr. John J. Hopfield invented it in
1982.
• These networks were introduced to collect and retrieve memory and store
various patterns.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 17
• Also, auto-association and optimization of the task can be done using these
networks.
• Types
• Discrete Hopfield Network
• Binary (0/1)
• Bipolar (-1/1)
• Continuous Hopfield Network
• In this network, each node is fully connected(recurrent) to other nodes.
• It behaves in a discrete manner, i.e. it gives finite distinct output:
• Binary à ON (1) or OFF (0).
• Bipolar à -1 or +1
• These outputs/states can be restored based on the input received from other
nodes.
• Unlike other neural networks, the output of the Hopfield network is finite.
• Also, the input and output sizes must be the same in these networks
• This model consists of neurons with one inverting and one non-inverting
output.
• The output of each neuron should be the input of other neurons but not the
input of self.
• Weight/connection strength is represented by wij
• The weights are symmetric in nature and have the following properties.
• %&b 6 #%b&
• %&& 6 c
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 18
•
• For bipolar patterns
•
• Discrete Hopfield Networks
Training Algorithm
1. Initialize weights (wij) to store patterns (using training algorithm).
2. For each input vector yi, perform steps 3-7.
3. Make the initial activators of the network equal to the input vector x.
6. Apply activation over the total input to calculate the output as per the
equation given below:
7. Now feedback the obtained output yi to all other units. Thus, the activator
vector is updated.
8. Test the network for convergence.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 19
<
• 3R 6 + >BCm< >Bdm< 7C 7d TCd # + # >BCm< .C 7C ? # >BCm< QC 7C
@
• Condition − In a stable network, whenever the state of node changes, the
energy function will decrease.
Continuous Hopfield Network
• The Hopfield network consists of associative memory.
• This memory allows the system to retrieve the memory using an
incomplete portion.
• The network can restore the closest pattern using the data captured in
associative memory.
• This feature of Hopfield networks makes it a good candidate for pattern
recognition.
• where,
• vi = output
• ui = internal activity of a node
• g can be any activation function
• When the energy function reaches its minimum the network converges to
a stable configuration
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 20
• In traditional neural networks, all the inputs and outputs are independent
of each other, but to predict the next word of a sentence, there is a need to
remember the previous words.
• The most important feature of RNN is its Hidden state, which remembers
some information about a sequence.
• The state is also referred to as Memory State since it remembers the
previous input to the network.
• It uses the same parameters for each input as it performs the same task on
all the inputs or hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 21
Exploding Gradient
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 22
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 23
• A convolution tool that separates and identifies the distinct features of an image
for analysis in a process known as Feature Extraction
• A fully connected layer that takes the output of the convolution process and
predicts the image’s class based on the features retrieved earlier.
The CNN is made up of three types of layers: convolutional layers, pooling layers,
and fully-connected (FC) layers.
Convolution Layers
This is the very first layer in the CNN that is responsible for the extraction of the
different features from the input images. The convolution mathematical operation
is done between the input image and a filter of a specific size MxM in this layer.
Pooling layer
The Pooling layer is responsible for the reduction of the size(spatial) of the
Convolved Feature. This decrease in the computing power is being required to
process the data by a significant reduction in the dimensions.
There are two types of pooling
1 average pooling
2 max pooling.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 24
LeNet Architecture
The network has 5 layers with learnable parameters and hence named Lenet-5. It
has three sets of convolution layers with a combination of average pooling. After
the convolution and average pooling layers, we have two fully connected layers.
At last, a Softmax classifier which classifies the images into respective class.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 25
We then apply the first convolution operation with the filter size 5X5 and we have
6 such filters. The activation function used at his layer is tanh. As a result, we get
a feature map of size 28X28X6.
Output Size Of A Convolution Layer
output= ((Input-filter size)/ stride)+1
Also, the number of filters becomes the channel in the output feature map.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 26
After the first pooling operation, we apply the average pooling and the size of the
feature map is reduced by half. Note that, the number of channels is intact.
Next, we have a convolution layer with sixteen filters of size 5X5. Again the
feature map changed it is 10X10X16. Also, the activation function is tanh.
The output size is calculated in a similar manner. After this, we again applied an
average pooling or subsampling layer, which again reduces the size of the
feature map by half i.e 5X5X16.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 27
Then we have a final convolution layer of size 5X5 with 120 filters. As shown in
the above image. Leaving the feature map size 1X1X120. After which flatten
result is 120 values.
After these convolution layers, we have a fully connected layer with 84 neurons
and the activation function used here is again tanh. At last, we have an output
layer with 10 neurons since the data have ten classes.
Summary
The network has
¥ 5 layers with learnable parameters.
¥ The input to the model is a grayscale image.
¥ It has 3 convolution layers, two average pooling layers, and two fully
connected layers with a softmax classifier.
¥ The number of trainable parameters is 60000.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 28
AlexNet
Alexnet won the Imagenet large-scale visual recognition challenge in 2012. The
model was proposed in 2012 in the research paper named Imagenet Classification
with Deep Convolution Neural Network by Alex Krizhevsky and his colleagues.
In this model, the depth of the network was increased in comparison to Lenet-5.
Alexnet Architecture
One thing to note here, since Alexnet is a deep architecture, the authors
introduced padding to prevent the size of the feature maps from reducing
drastically. The input to this model is the images of size 227X227X3.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 29
Next, we have the first Maxpooling layer, of size 3X3 and stride 2. Then we get
the resulting feature map with the size 27X27X96.
After this, we apply the second convolution operation. This time the filter size is
reduced to 5X5 and we have 256 such filters. The stride is 1 and padding 2. The
activation function used is again relu. Now the output size we get is 27X27X256.
Again we applied a max-pooling layer of size 3X3 with stride 2. The resulting
feature map is of shape 13X13X256.
Now we apply the third convolution operation with 384 filters of size 3X3 stride
1 and also padding 1. Again the activation function used is relu. The output
feature map is of shape 13X13X384.
Then we have the fourth convolution operation with 384 filters of size 3X3. The
stride along with the padding is 1. On top of that activation function used is relu.
Now the output size remains unchanged i.e 13X13X384.
After this, we have the final convolution layer of size 3X3 with 256 such filters.
The stride and padding are set to one also the activation function is relu. The
resulting feature map is of shape 13X13X256.
So if you look at the architecture till now, the number of filters is increasing as
we are going deeper. Hence it is extracting more features as we move deeper into
the architecture. Also, the filter size is reducing, which means the initial filter was
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 30
Next, we apply the third max-pooling layer of size 3X3 and stride 2. Resulting in
the feature map of the shape 6X6X256.
After this, we have our first dropout layer. The drop-out rate is set to be 0.5.
Then we have the first fully connected layer with a relu activation function. The
size of the output is 4096. Next comes another dropout layer with the dropout rate
fixed at 0.5.
This followed by a second fully connected layer with 4096 neurons and relu
activation.
Finally, we have the last fully connected layer or output layer with 1000 neurons
as we have 10000 classes in the data set. The activation function used at this layer
is Softmax.
This is the architecture of the Alexnet model. It has a total of 62.3 million
learnable parameters.
Machine Learning Lecture Notes for CSE by Dr Diana Moses, MCET UNIT 4, Pg - 31