mod3
mod3
All that’s left to find is the weights and the threshold value. Given the weights and threshold
value a decision boundary is plotted using the AND function’s aggregation equation:
x1w1+x2w2+w0=0
x1+x2 - 1.5=0
Take any random line that separates the black and the red points. Then find it's equation. Then
you'll have the weights (the coefficients of x1 and x2) and the thershold value T (the constant in
the line's equation).
x1w1+x2w2+w0 = 0
x2 = – (w1/w2) x1 – (w0/w2)
x2 = (slope) x1 + bias)
w1 = 1, w2 = 1,wo = -1.5
X1 + X2 - 1.5=0
x1 x2 y x1 x2 y If net sum > = 0, then output is 1 or (1)
0 0 -1 (0) -1 -1 -1 Otherwise output is 0 or (-1)
0 1 -1 (0) -1 1 -1
1 0 -1 (0) 1 -1 -1
1 1 1 (1) 1 1 1
Rosenblatt's Perceptron built around the McCulloch-Pitts neural model
• The linear combiner or the adder mode computes the linear combination of the inputs
applied to the synapses with synaptic weights being w1, w2,……,wn.
• Then, the hard limiter checks whether the resulting sum is positive or negative
• If the input of the hard limiter node is positive, the output is +1, and if the input is
negative, the output is -1.
• Mathematically the hard limiter input is:
Rosenblatt's Perceptron
• The objective of the perceptron is o classify a set of inputs into two classes c1 and c2.
• For the two input signals denoted by the variables x1 and x2, the decision boundary is a
straight line of the form:
So, for a perceptron having the values of synaptic weights w0,w1 and w2 as -2, 1/2 and 1/4,
respectively. The linear decision boundary will be of the form:
W.X + b > 0
W.X+ b < 0
x1 + x2 – 1.5 = 0
W.X + b > 0
W.X+ b < 0
x1 + x2 – 0.5 = 0
Multilayer Perceptron
• A basic perceptron works very successfully for data sets which possess linearly separable
patterns.
• However, in practical situations, Minsky and Papert in their work in 1969 showed that a
basic perceptron is not able to learn to compute even a simple 2 bit XOR.
• So, let us understand the reason.
The data is not linearly separable. Only a curved decision
boundary can separate the classes properly. To address this
issue, the other option is to use two decision boundary lines in
place of one. This is the philosophy used to design the multi-
layer perceptron model.
>0
<0 x1 – x2 – 0.5 = 0
Single layer Perceptron Learning Rule
• When Rosenblatt introduced the perceptron, he also introduced the perceptron
learning rule(the algorithm used to calculate the correct weights for a perceptron
automatically).
The only noticeable difference from Rosenblatt’s model to the one above is the
differentiability of the activation function. Since 1986, a lot of different activation
functions have been proposed. See some of the most popular examples in the next slides.
Activation functions that are commonly used based on few desirable properties
like :
•Nonlinear — When the activation function is non-linear, then a two-layer
neural network can be proven to be a universal function approximator. The
identity activation function does not satisfy this property. When multiple layers
use the identity activation function, the entire network is equivalent to a
single-layer model.
•Range — When the range of the activation function is finite, gradient-based
training methods tend to be more stable, because pattern presentations
significantly affect only limited weights. When the range is infinite, training is
generally more efficient because pattern presentations significantly affect most
of the weights. In the latter case, smaller learning rates are typically necessary.
•Continuously differentiable — This property is desirable (ReLU is not
continuously differentiable and has some issues with gradient-based
optimization, but it is still possible) for enabling gradient-based optimization
methods. The binary step activation function is not differentiable at 0, and it
differentiates to 0 for all other values, so gradient-based methods can make no
progress with it.
•Monotonic — When the activation function is monotonic, the error surface
associated with a single-layer model is guaranteed to be convex.
The Activation Functions can be basically divided into 2 types-
1.Linear Activation Function
2.Non-linear Activation Functions
Linear or Identity Activation Function
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or
various parameters of usual data that
is fed to the neural networks.
• Linear function has limited power and ability to handle complexity. It can be used for a
simple task like interpretability.
• Derivative of a linear function is a constant..
• If all layers of the neural network uses linear activation it is equivalent to a single layer
network, meaning output of the first layer is same as the output of the nth layer.
Non- linear Activation Function
Sigmoid or Logistic Activation Function- The Sigmoid Function curve looks like a S-shape.
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is
especially used for models where we have to predict the probability as an output. Since probability
of anything exists only between the range of 0 and 1, sigmoid is the right choice.
• The function is monotonic and differentiable. That means, we can find the slope of the sigmoid
curve at any two points.
• The function is centered on 0.5 not zero centered and this reduces the efficiency of updating the
weights.
• Gives rise to a problem of vanishing gradients. Sigmoids saturate and kill gradients and slow in
convergence.
• The softmax function is a more generalized logistic activation function which is used for
multiclass classification.
Non- linear Activation Function
Tanh or hyperbolic tangent Activation Function -tanh is also like logistic sigmoid but better. The
range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped)..
• The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be
mapped near zero in the tanh graph.
• The function is monotonic and differentiable. Output is zero centered hence convergence is usually
faster if the average of each input variable over the training set is close to zero.
• The function suffers from vanishing gradient problem and is computationally expensive due to its
exponential operation. It saturate and kill the gradients.
• The tanh function is mainly used for classification between two classes.
• Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
Non- linear or Identity Activation Function
ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function. It is used in almost all the convolutional
neural networks or deep learning.
• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero
and f(z) is equal to z when z is above or equal to zero.
• The function is monotonic and continuous. It is not differentiable at point 0, and derivative
for negative values is 0.
• Relu overcomes the vanishing gradient problem, and computationally faster, but suffers
from dying ReLU problem for negative values.
• But the issue is that all the negative values become zero immediately which decreases the
ability of the model to fit or train from the data properly. That means any negative input given
to the ReLU activation function turns the value into zero immediately in the graph, which in
turns affects the resulting graph by not mapping the negative values appropriately.
Non- linear or Identity Activation Function
Leaky ReLU - It is an attempt to solve the dying ReLU problem
• The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01
or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Due to its exponential nature, it is computationally expensive.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their
derivatives also monotonic in nature.
• Suffers from vanishing gradient problem for negative input values.
Non- linear or Identity Activation Function
Softmax
• a type of sigmoid function but very useful to handle multi-class classification
problems.
• It can be described as a combination of multiple sigmoidal functions.
• It returns the probability for a data-point belonging to each individual class. The sum
of all the probability values is 1. Hence the output is the probability distribution.
• The softmax function is often used in the final layer of a neural network-based
classifier. Such networks are commonly trained under a log loss (or cross-entropy)
regime, giving a non-linear variant of multinomial logistic regression.
When to use which activation functions
• Usually, if the output ranges between (0,1) or (-1, 1) then sigmoid or tanh can be
used. On the other hand, to predict output values larger than 1, ReLU is commonly
used.
• In the case of a binary classifier, the Sigmoid activation function should be used.
While predicting a probability for a multiclass problem, the softmax activation
function should be used in the last layer.
• Again, tanh or sigmoid usually doesn’t work well in hidden layers. ReLU or Leaky
ReLU should be used in hidden layers. Swish activation function is used when
number of hidden layers are high (close to 30).
• However, the use of activation functions mostly depends on the data, problem in hand
and the range of the expected output.
Multi-class Vs Binary Classification
Multi-class Classification One-vs-All classification
• In OvO classification, instead of using a one-hot target vector that assigns a one to the target
class and zeros to all other classes, we need to construct a method that allows for pairwise
classification.
• Therefore for K number of classes, we need to construct a target vector consisting of L =
K(K-1)/2 values. (one binary classifier per pair of class)
• The output units in the deep neural network represent binary classifiers with outputs in the
bound [-1,1] or [0,1]. Sigmoid activation function is used for the L output units,
Example: Given a dataset having with 4 classes(A, B, C, D), the required number of output
neurons 6 forming total 6 binary classes.
The multi-output binary cross-entropy loss JOvO for an example is computed with:
Loss functions:
Loss functions are helpful to train a neural network. Given an input
and a target, they calculate the loss, i.e difference between output and
target variable. Loss functions fall under four major category:
Final activation:
Loss function:
Loss functions:
One Vs All
Loss function: K number of neurons required in the output layer for k- classes
Loss functions:
1 1 1 – C8
Loss function:
Summary of activation functions and loss functions:
Y^ = sign(W.X)
y^=sign(W.X) Y^=W.X
Smooth approximation
of desired goal
(classification) (classification)
Y^= 1/(1+exp(-W.X)
Y^=W.X
(classification)
Fig. c- (Widrow-Hoff method), LMS, ADALINE
Fisher Discriminant function
Variants of Perceptron
Hinge loss
Summary
Multilayer Perceptron
𝑚 𝑛
𝑍𝑗 = 𝑓 𝑉𝑖𝑗 𝑋𝑖 𝑌𝑘 = 𝑓 𝑊𝑗𝑘 𝑍𝑗
𝑖=1 𝑗=1
Gradient Descent
error
m
Gradient Descent
3. Make sure to scale the data if it’s on a very different scales. If we don’t scale
the data, the level curves (contours) would be narrower and taller which means it
would take longer time to converge
Scale the data to have μ = 0 and σ = 1. Below is the formula for scaling each
example:
Gradient Descent
4. On each iteration, take the partial derivative of the cost function J(w) w.r.t each
parameter (gradient):
•Continue the process until the cost function converges. That is, until the error curve becomes flat
and doesn’t change.
•In addition, on each iteration, the step would be in the direction that gives the maximum change
since it’s perpendicular to level curves at each step.
Gradient Descent
Batch gradient
•Batch Gradient Descent is when we sum up over all examples on each iteration when performing
the updates to the parameters. Therefore, for each update, we have to sum over all examples:
Wnew=wold+J
Multi layer Perceptron and Back Propagation
LTU-Linear threshold unit
For the Backpropagation algorithm to work, step function is replaced with logistic
function. (Step is non differential function and its gradient is zero for the flats)
ReLU function
Multi layer Perceptron and Back Propagation
Step 4: Adjust the weights to reduce the error- (Gradient descend step)
Exercise
Given network is to be trained on the data for XOR.
For the input vector [0, 1] , learning rate 0.25, logistic activation function, compute the
weight corrections.
Wnew = Wold+W
Solution 0 1
i j
xk vki zi wij
yj
x E z E y
vki xk i wij zi j
vk j wi j
i j wij f ' ( znetj ) j (t y ) f ' ( ynet )
j
wij (t j y j ) f ' ( ynetj ) zi
vki f ' ( z neti ) xk w
j
j ij
Multi layer Perceptron and Back Propagation
Solution
Let w1 = -4.5, w2 = 5.3
v11 = -2.0, v12 = 4.3
v21 = 9.2, v22 = 8.8
b1 = 2.0, b2 = -0.1
b3 = -0.8
v11 =0.25 (0.9999)(1-0.9999) (0) [(0.1249)(-4.5)] v12 =0.25 (0.9998)(1-0.9998) (0) [(0.1249)(5.3)]
=0 =0
v21 =0.25 (0.9999)(1-0.9999) (1) [(0.1249)(-4.5)] v22 =0.25 (0.9998)(1-0.9998) (1) [(0.1249)(5.3)]
= 0.000014 =0.000033
b1 =0.25 (0.9999)(1-0.9999) (1) [(0.1249)(-4.5)] b2 =0.25 (0.9998)(1-0.9998) (1) [(0.1249)(5.3)]
=0.000014 =0.000033
Multi layer Perceptron and Back Propagation
Solution
Let w1 = -4.5, w2 = 5.3
v11 = -2.0, v12 = 4.3
v21 = 9.2, v22 = 8.8
b1 = 2.0, b2 = -0.1
b3 = -0.8
Let’s assume we are training a network to differentiate between cats and dogs. We therefore
only need two output neurons — one for each classification. We feed a cat image into the
network.
For now, imagine that each pixel of the image corresponds to one ‘input’ (we’ll see later how
we can improve on this for images). Here, it’s assigned a probability of 62% that the image is a
dog, and 38% that it’s a cat. Ideally, we want it to say this image is 100% cat.
Multi layer Perceptron and Back Propagation
So, we go backwards through the network, nudging the weights and biases to increase the
chance that the network would classify this as a cat.
Multi layer Perceptron and Back Propagation
• Consider your input data , where N is the number of samples on your input data
(batch size) and D the dimensions (On the previous example D is 2, size house,
num bedrooms).
• The first thing to do is to subtract the mean value of the input data, this will
centralize the data dispersion around zero. (i.e the average of input variables over
the training set should be zero).
• On prediction phase is common to store this mean value to be subtracted from a
test example.
• After your data is centralized around zero, you can make all features have the
same range by dividing X by it's standard deviation. (i.e scale input variables so
that their covariances are about the same)
Multi layer Perceptron and Back Propagation
Weight initialization
• If you initialize your weights to zero, your gradient descent will never converge.
• A better idea is to initialize your weights with values close to zero (but not zero),
ie: 0.01
• A weight should be randomly drawn from a distribution (e.g uniform distribution)
with mean zero and unit standard deviation. The problem with this initialization is
that the variance of the outputs will grow with the number of inputs. To solve this
issue we can divide the random term by the square root of the number of inputs.
• Therefore the initial weights are close to zero with mean zero
• and standard deviation
Activation function
Multi layer Perceptron and Back Propagation
Using learning rates
Multi layer Perceptron
Multi layer Perceptron Fine tuning neural network hyper-parameters
MNIST data
Multi layer Perceptron Fine tuning neural network hyper-parameters
MNIST data
0
1
2
3
4
5
6
7
8
9
Multi layer Perceptron Fine tuning neural network hyper-parameters
XOR problem
• Knowing that there are just two lines required to represent the decision
boundary tells us that the first hidden layer will have two hidden neurons.
• we have a single hidden layer with two hidden neurons.
• Each hidden neuron could be regarded as a linear classifier that is represented
as a line as in figure
• There will be two outputs, one from each classifier (i.e. hidden neuron). But we
are to build a single classifier with one output representing the class label, not
two classifiers.
Multi layer Perceptron Fine tuning neural network hyper-parameters
Required architecture
Multi layer Perceptron Fine tuning neural network hyper-parameters
Example 2
Multi layer Perceptron Fine tuning neural network hyper-parameters
Example 2
The next step is to split the decision boundary into a set of lines, where each
line will be modeled as a perceptron in the ANN. Before drawing lines, the
points at which the boundary changes direction should be marked as shown in
figure
Multi layer Perceptron Fine tuning neural network hyper-parameters
Example 2
Example 2
Multi layer Perceptron Fine tuning neural network hyper-parameters
Final architecture
Techniques to prevent overfitting
1. Hold-out (data)
Rather than using all of our data for training, we can simply split our dataset into two
sets: training and testing. A common split ratio is 80% for training and 20% for testing.
We train our model until it performs well not only on the training set but also for the
testing set. This indicates good generalization capability since the testing set
represents unseen data that were not used for training. However, this approach would
require a sufficiently large dataset to train on even after splitting.
2. Cross-validation (data)
We can split our dataset into k groups (k-fold cross-validation). We let one of the
groups to be the testing set (please see hold-out explanation) and the others as the
training set, and repeat this process until each individual group has been used as the
testing set (e.g., k repeats). Unlike hold-out, cross-validation allows all data to be
eventually used for training but is also more computationally expensive than hold-out.
3. Data augmentation (data)
A larger dataset would reduce overfitting. If we cannot gather more data and are
constrained to the data we have in our current dataset, we can apply data
augmentation to artificially increase the size of our dataset. For example, if we are
training for an image classification task, we can perform various image
transformations to our image dataset (e.g., flipping, rotating, rescaling, shifting).
Techniques to prevent overfitting
4. Feature selection (data)
If we have only a limited amount of training samples, each with a large number of
features, we should only select the most important features for training so that our
model doesn’t need to learn for so many features and eventually overfit. We can
simply test out different features, train individual models for these features, and
evaluate generalization capabilities, or use one of the various widely used feature
selection methods.
5. L1 / L2 regularization (learning algorithm)
Regularization is a technique to constrain our network from learning a model that is
too complex, which may therefore overfit. In L1 or L2 regularization, we can add a
penalty term on the cost function to push the estimated coefficients towards zero
(and not take more extreme values). L2 regularization allows weights to decay
towards zero but not to zero, while L1 regularization allows weights to decay to zero.
Techniques to prevent overfitting
6. Remove layers / number of units per layer (model)
As mentioned in L1 or L2 regularization, an over-complex model may more likely
overfit. Therefore, we can directly reduce the model’s complexity by removing layers
and reduce the size of our model. We may further reduce complexity by decreasing
the number of neurons in the fully-connected layers. We should have a model with a
complexity that sufficiently balances between underfitting and overfitting for our task.
Techniques to prevent overfitting
7. Dropout (model)
By applying dropout, which is a form of regularization, to our layers, we ignore a
subset of units of our network with a set probability. Using dropout, we can reduce
interdependent learning among units, which may have led to overfitting. However,
with dropout, we would need more epochs for our model to converge.
Techniques to prevent overfitting
8. Early stopping (model)
We can first train our model for an arbitrarily large number of epochs and plot the
validation loss graph (e.g., using hold-out). Once the validation loss begins to degrade
(e.g., stops decreasing but rather begins increasing), we stop the training and save the
current model. We can implement this either by monitoring the loss graph or set an
early stopping trigger. The saved model would be the optimal model for generalization
among different training epoch values.