0% found this document useful (0 votes)
14 views

ML Mod 2 full

Uploaded by

chn22csc313
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML Mod 2 full

Uploaded by

chn22csc313
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

What are Neural Networks?

 Models of the brain and nervous system


 Highly parallel
 Process information much more like the brain than
a serial computer
 Learning
 Very simple principles
 Very complex behaviours
 Applications
 As powerful problem solvers
 As biological models
Biological Neural Nets
 Pigeons as art experts (Watanabe et al. 1995)

 Experiment:
 Pigeon in Skinner box
 Present paintings of two different artists (e.g.
Monet / Picasso)
 Reward for pecking when presented a particular
artist (e.g. Picasso)
 Pigeons were able to discriminate between paintings
from the two artists with 95% accuracy (when
presented with pictures they had been trained on)

 Discrimination still 85% successful for previously


unseen paintings of the artists

 Pigeons do not simply memorise the pictures


 They can extract and recognise patterns (the ‗style‘)
 They generalise from the already seen to make
predictions

 This is what neural networks (biological and artificial)


are good at.
ANNs – The basics
ANNs incorporate the two fundamental components
of biological neural nets:

1. Neurones (nodes)
2. Synapses (weights)
Neurone vs. Node
What is an artificial neuron ?

 n 1

y  f  w0   wi xi 
 i 1 
y

w0

For Example
y = sign(…)
x1 x2 x3
Activation functions
Linear

yx

Logistic

1
y
1  exp(  x)
Synapse vs. weight
Feed-forward neural network

Information flow is unidirectional


Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer

Information processing is parallel


Perceptron
• Rosenblatt (1957)
• Linear separation +
++
• Inputs :Vector of real values + y 1
+ +
+ + + +
• Outputs :1 or 0 +
+
+ + +
+ + +
+
+ + ++ + +
+ +
+ +
+
y0 +
+
y  f (v)
v  w0  w1 x1  w2 x2 w0  w1 x1  w2 x2  0

w0
w2
w1

1 x1 x2
Perceptron Learning

How does a Perceptron learn the


appropriate weights?
Perceptron Learning
I’m going to learn the category/class “good fruit” defined as
anything that is sweet.
Features:
Taste Sweet = 1, Not_Sweet = 0
Seeds Edible = 1, Not_Edible = 0
Skin Edible = 1, Not_Edible = 0

Output Classes:
Good_Fruit = 1
Not_Good_Fruit = 0
Perceptron Learning
Let‘s start with no knowledge:

Input

Taste
0.0
Output
0.0
Seeds

0.0 If ∑ > 0.4


then fire
Skin
Perceptron Learning
 The weights are empty:

Input

Taste
0.0
Output
0.0
Seeds

0.0 If ∑ > 0.4


then fire
Skin
Perceptron Learning
 To train the perceptron, we will show it each example
and have it categorize each one.

 Since it‘s starting with no knowledge, it is going to make


mistakes. When it makes a mistake, we are going to
adjust the weights to make that mistake less likely in the
future.

 When we adjust the weights, we‘re going to take


relatively small steps to be sure we don‘t over-correct
and create new problems
Trained Perceptron
Show it a banana:

Input

Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00

0.0 If ∑ > 0.4


then fire
Skin 0 0
Trained Perceptron
Show it a banana:

Input

Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00

0.0 If ∑ > 0.4


then fire
Skin 0 0
Trained Perceptron
Show it a banana:

Input

Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00

0.0 If ∑ > 0.4


then fire
Skin 0 0
Trained Perceptron
Show it a banana:

Input

Taste 1 1 0.0
Output Teacher
0.0
Seeds 1 1 .00 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
 In this case we have:
 (1 X 0) = 0
 + (1 X 0) = 0
 + (0 X 0) = 0
 It adds up to 0.0.
 Since that is less than the threshold (0.40), we
responded “no.”
 Is that correct? No.
Perceptron Learning
• Since we got it wrong, we know we
need to change the weights. We‘ll do
that using the delta rule (delta for
change).

∆w = learning rate X (teacher provided


class – predicted class) X node input
Perceptron Learning
• The three parts of that are:
• Learning rate: We set that ourselves. I want it to be
large enough that learning happens in a reasonable
amount of time, but small enough that I don‘t go too
fast. I‘m picking 0.25.

• (teacher provided class – predicted class): The


teacher knows the correct answer (e.g., that a
banana should be a good fruit). In this case, the
teacher says 1, the output is 0, so (1 - 0) = 1.

• Input value: This is the input value to that input node


For the first node, 1.
Perceptron Learning
 To pull it together:
 Learning rate: 0.25.
 (teacher provided class – predicted class): 1.
 node input: 1.
 ∆w = 0.25 X 1 X 1 = 0.25.
 Since it‘s a ∆w, it‘s telling us how much to change the
first weight. In this case, we‘re adding 0.25 to it.
Let‘s think about the delta rule
 (teacher provided class- predicted class):
 If we get the categorization right, it will be zero (the
right answer minus itself). If we get it right, we won‘t
change any of the weights. As far as we know we
have a good solution, why would we change it?

 If we get the categorization wrong, it will either be -1


or +1.
 If we said “yes” when the answer was “no,” we‘re
too high on the weights and we will get a value of
-1 which will result in reducing the weights.
 If we said “no” when the answer was “yes,” we‘re
too low on the weights and this will cause them to
be increased.
Let‘s think about the delta rule
 Input value to the node:
 If the input value to an input node is a 0, then it
didn‘t participate in making the decision. In that case,
it shouldn‘t be adjusted. Multiplying by zero will make
that happen.
 If the input value to an input node is a 1, then it did
participate and we should change the weight (up or
down as needed).
Perceptron Learning
 How do we change the weights for banana?

Feature: Learning (teacher - Input: ∆w


rate: provided
class):
taste 1 +0.25
seeds 0.25 1 1 +0.25
skin 0 0
Perceptron Learning
Adjusting weight 1:
.25 X (1 – 0) X 1 = 0.25
Input

Taste 1 1 0.0
Output Teacher
0.0
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Corrected weight 1:

Input

Taste 1 1 0.25
Output Teacher
0.0
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Adjusting weight 2:
.25 X (1 – 0) X 1 = 0.25
Input

Taste 1 1 0.25
Output Teacher
0.0
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Corrected weight 2:

Input

Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Adjusting weight 3:
.25 X (1 – 0) X 0 = 0.00
Input

Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Corrected weight 3:

Input

Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1

0.0 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
 To continue training, we show it the next example,
adjust the weights…

 We will keep cycling through the examples until we go all


the way through one time without making any changes
to the weights. At that point, the concept is learned.
Perceptron Learning
Show it a pear:

Input

Taste 1 1 0.25
Output Teacher
0.25
Seeds 0 0 0.25 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
 How do we change the weights for pear?

Feature: Learning (teacher Input: ∆w


rate: provided
class):
taste 1 +0.25
seeds 0.25 1 0 0
skin 1 +0.25
Perceptron Learning
Adjusting weight 1:
.25 X (1 – 0) X 1 = 0.25
Input

Taste 1 1 0.25
Output Teacher
0.25
Seeds 0 0 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
Corrected weight 1:

Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
Adjusting weight 2:
.25 X (1 – 0) X 0 = 0.00
Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
Corrected weight 2:

Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
Adjusting weight 3:
.25 X (1 – 0) X 1 = 0.25
Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1

0.0 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
Corrected weight 3:

Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 1 1 0 1

025 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
Here it is with the final weights:

Input

Taste
0.50
Output
0.25
Seeds

0.25 If ∑ > 0.4


then fire
Skin
Perceptron Learning
Show it a lemon:

Input

Taste 0 0 0.50
Output Teacher
0.25
Seeds 0 0 0 0 0

0.25 If ∑ > 0.4


then fire
Skin 0 0
Perceptron Learning
 How do we change the weights for lemon?

Feature: Learning (teacher - Input: ∆w


rate: provided
class):
taste 0 0
seeds 0.25 0 0 0
skin 0 0
Perceptron Learning
Here it is with the adjusted weights:

Input

Taste
0.50
Output
0.25
Seeds

0.25 If ∑ > 0.4


then fire
Skin
Perceptron Learning
Show it a strawberry:

Input

Taste 1 1 0.50
Output Teacher
0.25
Seeds 1 1 1 1 1

0.25 If ∑ > 0.4


then fire
Skin 1 1
Perceptron Learning
 How do we change the weights for
strawberry?
Feature: Learning (teacher – Input: ∆w
rate: provided
class ):
taste 1 0
seeds 0.25 0 1 0
skin 1 0
Perceptron Learning
Here it is with the adjusted weights:

Input

Taste
0.50
Output
0.25
Seeds

0.25 If ∑ > 0.4


then fire
Skin
Perceptron
Threshold Function

• If we define s(.) as a threshold function


Output and Activation Function
Error Calculation
CHAPTER 9. NEURAL NETWORKS 112

Figure 9.2: Flow of signals in a biological neuron

computer switching speeds, we are able to take complex decisions relatively quickly. Because of
this, it is believed that the information processing capabilities of biological neural systems is a con-
sequence of the ability of such systems to carry out a huge number of parallel processes distributed
over many neurons. The developments in ANN systems are motivated by the desire to implement
this kind of highly parallel computation using distributed representations.

9.3 Artificial neurons


Definition
An artificial neuron is a mathematical function conceived as a model of biological neurons. Artificial
neurons are elementary units in an artificial neural network. The artificial neuron receives one or
more inputs (representing excitatory postsynaptic potentials and inhibitory postsynaptic potentials
at neural dendrites) and sums them to produce an output. Each input is separately weighted, and the
sum is passed through a function known as an activation function or transfer function.

Schematic representation of an artificial neuron


The diagram shown in Figure ?? gives a schematic representation of a model of an artificial neuron.
The notations in the diagram have the following meanings:

x0 = 1

x1 w0
w1
x2 w2 Output (y)
∑ n
f
⎛n ⎞ ⎛n ⎞
∑ w i xi f ∑ w i xi y = f ∑ wi xi
... i=0 ⎝i=0 ⎠ ⎝i=0 ⎠
wn

xn

Figure 9.3: Schematic representation of an artificial neuron

x1 , x 2 , . . . x n ∶ input signals
w1 , w2 , . . . w n ∶ weights associated with input signals
CHAPTER 9. NEURAL NETWORKS 113

x0 ∶ input signal taking the constant value 1


w0 ∶ weight associated with x0 (called bias)
∑∶ indicates summation of input signals
f∶ function which produces the output
y∶ output signal

The function f can be expressed in the following form:


n
y = f ( ∑ w i xi ) (9.1)
i=0

Remarks
The small circles in the schematic representation of the artificial neuron shown in Figure 9.3 are
called the nodes of the neuron. The circles on the left side which receives the values of x0 , x1 , . . . , xn
are called the input nodes and the circle on the right side which outputs the value of y is called
output node. The squares represent the processes that are taking place before the result is outputted.
They need not be explicitly shown in the schematic representation. Figure 9.4 shows a simplified
representation of an artificial neuron.

x0 = 1

x1 w0
w1
x2 w2 Output (y)
⎛n ⎞
y = f ∑ w i xi
... ⎝i=0 ⎠
wn

xn

Figure 9.4: Simplified representation of an artificial neuron

9.4 Activation function


9.4.1 Definition
In an artificial neural network, the function which takes the incoming signals as input and produces
the output signal is known as the activation function.

Remark
Eq.(9.1) represents the activation function of the ANN model shown in Figure ??.

9.4.2 Some simple activation functions


The following are some of the simple activation functions.
CHAPTER 9. NEURAL NETWORKS 114

1. Threshold activation function


The threshold activation function is defined by


⎪ 1 if x > 0
f (x) = ⎨

⎪−1 if x ≤ 0

The graph of this function is shown in Figure 9.5.

x
0

−1

Figure 9.5: Threshold activation function

2. Unit step functions


Sometimes, the threshold activation function is also defined as a unit step function in which case it
is called a unit-step activation function. This is defined as follows:


⎪1 if x ≥ 0
f (x) = ⎨

⎪ 0 if x < 0

The graph of this function is shown in Figure 9.6.

x
0

−1

Figure 9.6: Unit step activation function

3. Sigmoid activation function (logistic function)


One of the must commonly used activation functions is the sigmoid activation function. It is defined
as follows:
1
f (x) =
1 + e−x
The graph of the function is shown in Figure 9.7.

f (x)
1

x
0

Figure 9.7: The sigmoid activation function


CHAPTER 9. NEURAL NETWORKS 115

4. Linear activation function


The linear activation function is defined by

F (x) = mx + c.

This defines a straight line in the xy-plane.

x
0

−1

Figure 9.8: Linear activation function

5. Piecewise (or, saturated) linear activation function


This is defined by



⎪0 if x < xmin


f (x) = ⎨mx + c if xmin ≤ x ≤ xmax




⎪0 if x > xmax

x
0

−1

Figure 9.9: Piecewise linear activation function

6. Gaussian activation function


This is defined by
1 (x−µ)2
f (x) = √ e− 2σ2 .
σ 2π

x
0

−1

Figure 9.10: Gaussian activation function


CHAPTER 9. NEURAL NETWORKS 116

7. Hyperbolic tangential activation function


This is defined by
ex − e−x
f (x) = .
ex + e−x

x
0

−1

Figure 9.11: Hyperbolic tangent activation function

9.5 Perceptron
The perceptron is a special type of artificial neuron in which thee activation function has a special
form.

9.5.1 Definition
A perceptron is an artificial neuron in which the activation function is the threshold function.
Consider an artificial neuron having x1 , x2 , ⋯, xn as the input signals and w1 , w2 , ⋯, wn as the
associated weights. Let w0 be some constant. The neuron is called a perceptron if the output of the
neuron is given by the following function:


⎪ 1 if w0 + w1 x1 + ⋯ + wn xn > 0
o(x1 , x2 , . . . , xn ) = ⎨

⎪−1 if w0 + w1 x1 + ⋯ + wn xn ≤ 0

Figure 9.12 shows the schematic representation of a perceptron.

x0 = 1

x1 w0
w1

x2 w2 Output (y)
∑ ⎧
n



n
∑ w i xi ⎪ 1 if ∑ wi xi > 0
y=⎨
i=0



i=0

⎩−1

... otherwise
wn

xn

Figure 9.12: Schematic representation of a perceptrn


CHAPTER 9. NEURAL NETWORKS 117

Remarks
1. The quantity −w0 can be looked upon as a “threshold” that should be crossed by the weighted
sum w1 x1 + ⋯ + wn xn in order for the neuron to output a “1”.

9.5.2 Representations of boolean functions by perceptrons


In this section we examine whether simple boolean functions like x1 AND x2 can be represented by
perceptrons. To be consistent with the conventions in the definition of a perceptron we assume that
the values −1 and 1 represent the boolean constants “false” and “true” respectively.

9.5.3 Representation of x1 AND x2


Let x1 and x2 be two boolean variables. Then the boolean function x1 AND x2 is represented by
Table 9.1. It can be easily verified that the perceptron shown in Figure 9.13 represents the function

x1 x2 x1 AND x2
−1 −1 −1
−1 1 −1
1 −1 −1
1 1 1

Table 9.1: The boolean function x1 AND x2

x1 AND x2 .

x0 = 1

w0 = −0.8
w1 = 0.5 Output (y)
∑ ⎧
x1 3 ⎪


3
∑ wi xi ⎪
⎪ 1 if ∑ wi xi > 0
w3 = 0.5 i=0 y=⎨



i=0

⎪−1 otherwise

x2

Figure 9.13: Representation of x1 AND x2 by a perceptron

In the perceptron shown in Figure 9.13, the output is given by







3
⎪ 1 if ∑ wi xi > 0
y=⎨



i=0

⎪−1 otherwise



⎪ 1 if − 0.8 + 0.5x1 + 0.5x2 > 0
=⎨

⎪−1 otherwise

Representations of OR, NAND and NOR


The functions x1 OR x2 , x1 NAND x2 and x1 NOR x2 can also be represented by perceptrons. Table
9.2 shows the values to be assigned to the weights w0 , w1 , w2 for getting these boolean functions.
Practical issues in neural network training

The Problem of Overfitting

Overfitting is a significant concern in neural network training, as
these models are capable of learning intricate patterns within
the training data, sometimes to the extent of capturing noise or
outliers.


The problem of overfitting in neural networks is characterized
by the model performing well on the training data but failing to
generalize effectively to new, unseen data.
Causes of Overfitting in Neural Networks

Model Complexity:
 Complex neural network architectures with a large number of parameters may lead to
overfitting, especially when the available training data is limited.

Insufficient Data:
 If the size of the training dataset is small, neural networks might memorize specific
examples rather than learning the underlying patterns.

Inadequate Regularization:
 Insufficient use of regularization techniques, such as weight decay, dropout, or batch
normalization, can contribute to overfitting.

Noisy Features:
 Neural networks may overfit if they learn patterns from irrelevant or noisy features in the
training data.

Effects of Overfitting in Neural Networks


Accuracy Drop on Test Data:
 The model performs well on the training set but has lower accuracy on the
validation or test set.

Increased Variance:
 The model becomes sensitive to small fluctuations in the training data,
resulting in high variance.

Complex Decision Boundaries:
 Overfit neural networks tend to create complex decision boundaries that
closely fit the training data but do not generalize well.

Methods to reduce Overfitting in Neural
Networks

Regularization Techniques:
 Use L1 or L2 regularization to penalize large weights in the model.
 Apply dropout, a technique where random neurons are "dropped out" during training,
preventing over-reliance on specific neurons.

Early Stopping:
 Monitor the model's performance on a validation set during training and stop training
when the performance on the validation set starts degrading.

Data Augmentation:
 Increase the effective size of the training dataset by applying data augmentation
techniques, such as rotation, flipping, or scaling. This introduces diversity and helps the
model generalize better.

Simplifying Model Architecture:
 Reduce the number of layers or neurons in the network to simplify the model and prevent

Cross-Validation:
 Use cross-validation to assess the model's performance on different subsets
of the data and detect overfitting.

Batch Normalization:
 Apply batch normalization to normalize the inputs to each layer, which can
help stabilize and speed up training.

Ensemble Methods:
 Combine predictions from multiple neural networks to reduce overfitting and
improve generalization.

Monitor Loss Curves:
 Visualize and monitor the training and validation loss curves to identify signs
of overfitting, such as increasing validation loss.

Addressing overfitting in neural networks is an ongoing process that involves
experimentation and careful tuning of hyperparameters.


The choice of specific strategies depends on the characteristics of the
dataset and the architecture of the neural network being used.


Regularization, early stopping, and data augmentation are commonly
employed techniques to enhance the generalization ability of neural
networks.

Vanishing Gradient Problem

In deep networks during backpropagation, as gradients are propagated
backward through layers, they can diminish to near-zero values.


This is often intensified by the use of activation functions with derivatives that
tend to be very small in certain regions (e.g., sigmoid or tanh functions).


Layers earlier in the network receive very small gradients, leading to
negligible updates to their weights.


These layers may essentially stop learning as their weights are no longer
being adjusted significantly.

Methods to reduce the vanishing gradient
problem

Activation Functions: Replace sigmoid or tanh activations with non saturating
aactivation functions rectified linear units (ReLU) or variants like Leaky ReLU
to mitigate vanishing gradients.


Batch Normalization: Normalize inputs to each layer, helping stabilize and
propagate gradients.


Skip Connections: Implement skip connections or residual connections, as
seen in architectures like ResNet, to create shortcut paths for gradient flow.

Exploding Gradient Problem

Opposite to the vanishing gradient problem, exploding gradients occur when
gradients become extremely large during backpropagation.

This often happens when weights are initialized too large or when there is an
issue with the optimization process.


Gradients become so large that weight updates are excessively large, leading
to instability in the training process.

This can result in NaN (Not a Number) values, making the training process
diverge.

Methods to reduce the exploding gradient
problem

Weight Initialization: Use proper weight initialization techniques, such
as Xavier/Glorot initialization, to control the scale of weights.


Gradient Clipping: Clip gradients during training to prevent them from
exceeding a certain threshold.


Learning Rate Scheduling: Adjust learning rates dynamically during
training to prevent abrupt updates.
Difficulties in Convergence
• Convergence difficulties in neural network training are common challenges that can hinder the model
from effectively learning from the training data. These difficulties may manifest in various ways during
the training process. Here are some common issues related to convergence in neural network training:
1. Slow Convergence
2. Noisy Training Curves
3. Plateauing or Diverging Loss
4. Vanishing or Exploding Gradients
5. Overfitting
6. Data Imbalance
7. Vanishing Learning Rate
• Addressing convergence difficulties often involves a combination of hyperparameter tuning, careful
preprocessing, and architectural adjustments. Experimenting with different configurations and
monitoring training progress can help diagnose and mitigate convergence issues in neural network
training.
Local Optima
• Local Optimum: A point in the solution space where the function has
a lower value than its immediate neighbors, but not necessarily the
absolute lowest value across the entire space.

• Local optima can occur in complex, non-convex optimization problems. In such


problems, the objective function may have multiple peaks and valleys.

• Global Optimum: The absolute lowest point in the entire solution


space.

• Challenges:

• If optimization algorithms get stuck in a local optimum, they might fail to find
the global optimum, leading to suboptimal solutions.
• In high-dimensional spaces, it is challenging to explore the entire space
thoroughly, making it more likely to converge to a local optimum.
Spurious Optima
• Spurious Optima: Points in the solution space where the gradient is
zero, but the point is not a true optimum. These points might occur due to
flat regions, saddle points, or other irregularities in the objective function.

• Spurious optima can arise in the presence of regions in the solution space where
the gradient is very small or zero.
• In high-dimensional spaces, saddle points, where some dimensions have an
increasing gradient and others have a decreasing gradient, can lead to spurious
optima.

Challenges:

• Optimization algorithms relying solely on gradient information may get stuck in


spurious optima, even though the point is not a true optimum.
• In deep learning, where optimization problems are often high-dimensional, dealing
with spurious optima becomes a common challenge.
Strategies to Address Local and Spurious
Optima
• Initialization Strategies
• Adaptive Learning Rates
• Stochasticity
• Exploration-Exploitation Trade-off
• Higher-Order Optimization
• Ensemble Methods
• Architecture Design

You might also like