ML Mod 2 full
ML Mod 2 full
Experiment:
Pigeon in Skinner box
Present paintings of two different artists (e.g.
Monet / Picasso)
Reward for pecking when presented a particular
artist (e.g. Picasso)
Pigeons were able to discriminate between paintings
from the two artists with 95% accuracy (when
presented with pictures they had been trained on)
1. Neurones (nodes)
2. Synapses (weights)
Neurone vs. Node
What is an artificial neuron ?
n 1
y f w0 wi xi
i 1
y
w0
For Example
y = sign(…)
x1 x2 x3
Activation functions
Linear
yx
Logistic
1
y
1 exp( x)
Synapse vs. weight
Feed-forward neural network
1 x1 x2
Perceptron Learning
Output Classes:
Good_Fruit = 1
Not_Good_Fruit = 0
Perceptron Learning
Let‘s start with no knowledge:
Input
Taste
0.0
Output
0.0
Seeds
Input
Taste
0.0
Output
0.0
Seeds
Input
Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00
Input
Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00
Input
Taste 1 1 0.0
Output
0.0
Seeds 1 1 .00
Input
Taste 1 1 0.0
Output Teacher
0.0
Seeds 1 1 .00 0 1
Taste 1 1 0.0
Output Teacher
0.0
Seeds 1 1 0 1
Input
Taste 1 1 0.25
Output Teacher
0.0
Seeds 1 1 0 1
Taste 1 1 0.25
Output Teacher
0.0
Seeds 1 1 0 1
Input
Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1
Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1
Input
Taste 1 1 0.25
Output Teacher
0.25
Seeds 1 1 0 1
Input
Taste 1 1 0.25
Output Teacher
0.25
Seeds 0 0 0.25 0 1
Taste 1 1 0.25
Output Teacher
0.25
Seeds 0 0 0 1
Input
Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1
Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1
Input
Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1
Taste 1 1 0.50
Output Teacher
0.25
Seeds 0 0 0 1
Input
Taste 1 1 0.50
Output Teacher
0.25
Seeds 1 1 0 1
Input
Taste
0.50
Output
0.25
Seeds
Input
Taste 0 0 0.50
Output Teacher
0.25
Seeds 0 0 0 0 0
Input
Taste
0.50
Output
0.25
Seeds
Input
Taste 1 1 0.50
Output Teacher
0.25
Seeds 1 1 1 1 1
Input
Taste
0.50
Output
0.25
Seeds
computer switching speeds, we are able to take complex decisions relatively quickly. Because of
this, it is believed that the information processing capabilities of biological neural systems is a con-
sequence of the ability of such systems to carry out a huge number of parallel processes distributed
over many neurons. The developments in ANN systems are motivated by the desire to implement
this kind of highly parallel computation using distributed representations.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
∑ n
f
⎛n ⎞ ⎛n ⎞
∑ w i xi f ∑ w i xi y = f ∑ wi xi
... i=0 ⎝i=0 ⎠ ⎝i=0 ⎠
wn
xn
x1 , x 2 , . . . x n ∶ input signals
w1 , w2 , . . . w n ∶ weights associated with input signals
CHAPTER 9. NEURAL NETWORKS 113
Remarks
The small circles in the schematic representation of the artificial neuron shown in Figure 9.3 are
called the nodes of the neuron. The circles on the left side which receives the values of x0 , x1 , . . . , xn
are called the input nodes and the circle on the right side which outputs the value of y is called
output node. The squares represent the processes that are taking place before the result is outputted.
They need not be explicitly shown in the schematic representation. Figure 9.4 shows a simplified
representation of an artificial neuron.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
⎛n ⎞
y = f ∑ w i xi
... ⎝i=0 ⎠
wn
xn
Remark
Eq.(9.1) represents the activation function of the ANN model shown in Figure ??.
x
0
−1
x
0
−1
f (x)
1
x
0
F (x) = mx + c.
x
0
−1
x
0
−1
x
0
−1
x
0
−1
9.5 Perceptron
The perceptron is a special type of artificial neuron in which thee activation function has a special
form.
9.5.1 Definition
A perceptron is an artificial neuron in which the activation function is the threshold function.
Consider an artificial neuron having x1 , x2 , ⋯, xn as the input signals and w1 , w2 , ⋯, wn as the
associated weights. Let w0 be some constant. The neuron is called a perceptron if the output of the
neuron is given by the following function:
⎧
⎪
⎪ 1 if w0 + w1 x1 + ⋯ + wn xn > 0
o(x1 , x2 , . . . , xn ) = ⎨
⎪
⎪−1 if w0 + w1 x1 + ⋯ + wn xn ≤ 0
⎩
Figure 9.12 shows the schematic representation of a perceptron.
x0 = 1
x1 w0
w1
x2 w2 Output (y)
∑ ⎧
n
⎪
⎪
⎪
n
∑ w i xi ⎪ 1 if ∑ wi xi > 0
y=⎨
i=0
⎪
⎪
⎪
i=0
⎩−1
⎪
... otherwise
wn
xn
Remarks
1. The quantity −w0 can be looked upon as a “threshold” that should be crossed by the weighted
sum w1 x1 + ⋯ + wn xn in order for the neuron to output a “1”.
x1 x2 x1 AND x2
−1 −1 −1
−1 1 −1
1 −1 −1
1 1 1
x1 AND x2 .
x0 = 1
w0 = −0.8
w1 = 0.5 Output (y)
∑ ⎧
x1 3 ⎪
⎪
⎪
3
∑ wi xi ⎪
⎪ 1 if ∑ wi xi > 0
w3 = 0.5 i=0 y=⎨
⎪
⎪
⎪
i=0
⎪
⎪−1 otherwise
⎩
x2
The problem of overfitting in neural networks is characterized
by the model performing well on the training data but failing to
generalize effectively to new, unseen data.
Causes of Overfitting in Neural Networks
Model Complexity:
Complex neural network architectures with a large number of parameters may lead to
overfitting, especially when the available training data is limited.
Insufficient Data:
If the size of the training dataset is small, neural networks might memorize specific
examples rather than learning the underlying patterns.
Inadequate Regularization:
Insufficient use of regularization techniques, such as weight decay, dropout, or batch
normalization, can contribute to overfitting.
Noisy Features:
Neural networks may overfit if they learn patterns from irrelevant or noisy features in the
training data.
Effects of Overfitting in Neural Networks
Accuracy Drop on Test Data:
The model performs well on the training set but has lower accuracy on the
validation or test set.
Increased Variance:
The model becomes sensitive to small fluctuations in the training data,
resulting in high variance.
Complex Decision Boundaries:
Overfit neural networks tend to create complex decision boundaries that
closely fit the training data but do not generalize well.
Methods to reduce Overfitting in Neural
Networks
Regularization Techniques:
Use L1 or L2 regularization to penalize large weights in the model.
Apply dropout, a technique where random neurons are "dropped out" during training,
preventing over-reliance on specific neurons.
Early Stopping:
Monitor the model's performance on a validation set during training and stop training
when the performance on the validation set starts degrading.
Data Augmentation:
Increase the effective size of the training dataset by applying data augmentation
techniques, such as rotation, flipping, or scaling. This introduces diversity and helps the
model generalize better.
Simplifying Model Architecture:
Reduce the number of layers or neurons in the network to simplify the model and prevent
Cross-Validation:
Use cross-validation to assess the model's performance on different subsets
of the data and detect overfitting.
Batch Normalization:
Apply batch normalization to normalize the inputs to each layer, which can
help stabilize and speed up training.
Ensemble Methods:
Combine predictions from multiple neural networks to reduce overfitting and
improve generalization.
Monitor Loss Curves:
Visualize and monitor the training and validation loss curves to identify signs
of overfitting, such as increasing validation loss.
Addressing overfitting in neural networks is an ongoing process that involves
experimentation and careful tuning of hyperparameters.
The choice of specific strategies depends on the characteristics of the
dataset and the architecture of the neural network being used.
Regularization, early stopping, and data augmentation are commonly
employed techniques to enhance the generalization ability of neural
networks.
Vanishing Gradient Problem
In deep networks during backpropagation, as gradients are propagated
backward through layers, they can diminish to near-zero values.
This is often intensified by the use of activation functions with derivatives that
tend to be very small in certain regions (e.g., sigmoid or tanh functions).
Layers earlier in the network receive very small gradients, leading to
negligible updates to their weights.
These layers may essentially stop learning as their weights are no longer
being adjusted significantly.
Methods to reduce the vanishing gradient
problem
Activation Functions: Replace sigmoid or tanh activations with non saturating
aactivation functions rectified linear units (ReLU) or variants like Leaky ReLU
to mitigate vanishing gradients.
Batch Normalization: Normalize inputs to each layer, helping stabilize and
propagate gradients.
Skip Connections: Implement skip connections or residual connections, as
seen in architectures like ResNet, to create shortcut paths for gradient flow.
Exploding Gradient Problem
Opposite to the vanishing gradient problem, exploding gradients occur when
gradients become extremely large during backpropagation.
This often happens when weights are initialized too large or when there is an
issue with the optimization process.
Gradients become so large that weight updates are excessively large, leading
to instability in the training process.
This can result in NaN (Not a Number) values, making the training process
diverge.
Methods to reduce the exploding gradient
problem
Weight Initialization: Use proper weight initialization techniques, such
as Xavier/Glorot initialization, to control the scale of weights.
Gradient Clipping: Clip gradients during training to prevent them from
exceeding a certain threshold.
Learning Rate Scheduling: Adjust learning rates dynamically during
training to prevent abrupt updates.
Difficulties in Convergence
• Convergence difficulties in neural network training are common challenges that can hinder the model
from effectively learning from the training data. These difficulties may manifest in various ways during
the training process. Here are some common issues related to convergence in neural network training:
1. Slow Convergence
2. Noisy Training Curves
3. Plateauing or Diverging Loss
4. Vanishing or Exploding Gradients
5. Overfitting
6. Data Imbalance
7. Vanishing Learning Rate
• Addressing convergence difficulties often involves a combination of hyperparameter tuning, careful
preprocessing, and architectural adjustments. Experimenting with different configurations and
monitoring training progress can help diagnose and mitigate convergence issues in neural network
training.
Local Optima
• Local Optimum: A point in the solution space where the function has
a lower value than its immediate neighbors, but not necessarily the
absolute lowest value across the entire space.
• Challenges:
• If optimization algorithms get stuck in a local optimum, they might fail to find
the global optimum, leading to suboptimal solutions.
• In high-dimensional spaces, it is challenging to explore the entire space
thoroughly, making it more likely to converge to a local optimum.
Spurious Optima
• Spurious Optima: Points in the solution space where the gradient is
zero, but the point is not a true optimum. These points might occur due to
flat regions, saddle points, or other irregularities in the objective function.
• Spurious optima can arise in the presence of regions in the solution space where
the gradient is very small or zero.
• In high-dimensional spaces, saddle points, where some dimensions have an
increasing gradient and others have a decreasing gradient, can lead to spurious
optima.
Challenges: