0% found this document useful (0 votes)
4 views

LecML -3 NN

The document outlines a course on Data Science, specifically focusing on Neural Networks, taught by Dr. Devesh Bhimsaria at IIT Roorkee. It covers the origins, architectures, learning processes, and common issues associated with neural networks, including problems like vanishing and exploding gradients, slow convergence, overfitting, and underfitting. Solutions to these challenges are also discussed, emphasizing the importance of proper initialization, choice of activation functions, and optimization techniques.

Uploaded by

ranis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

LecML -3 NN

The document outlines a course on Data Science, specifically focusing on Neural Networks, taught by Dr. Devesh Bhimsaria at IIT Roorkee. It covers the origins, architectures, learning processes, and common issues associated with neural networks, including problems like vanishing and exploding gradients, slow convergence, overfitting, and underfitting. Solutions to these challenges are also discussed, emphasizing the importance of proper initialization, choice of activation functions, and optimization techniques.

Uploaded by

ranis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Science

DAI-101 Spring 2024-25

Dr. Devesh Bhimsaria


Office: F9, Old Building
Department of Biosciences and Bioengineering
Indian Institute of Technology–Roorkee
[email protected]
Dr. Devesh Bhimsaria 1
Neural Network
Neural Network

Dr. Devesh Bhimsaria 3


Slide credit: wiki
Neural Network
• Origins: Algorithms that try to mimic the brain.
• Very widely used in the 80s and early 90s; popularity
diminished in the late 90s.
• Recent resurgence: State-of-the-art technique for many
applications
• Artificial neural networks are not nearly as complex or
intricate as the actual brain structure

Dr. Devesh Bhimsaria 4


Slide credit: Andrew Ng
Neural Network

Dr. Devesh Bhimsaria 5


Slide credit: Eric Eaton
Neuron Model: Logistic Unit

Do we really need an activation function??

Dr. Devesh Bhimsaria 6


Slide credit: Eric Eaton
Neural Network

Dr. Devesh Bhimsaria 7


Feed-Forward Process

Dr. Devesh Bhimsaria 8


Slide credit: Eric Eaton
Other Network Architectures

Dr. Devesh Bhimsaria 9


Slide credit: Eric Eaton
Multiple Output Units: One-vs-Rest

Dr. Devesh Bhimsaria 10


Neural Network Classification

Equivalent to
dimensions of 𝑥

Dr. Devesh Bhimsaria 11


Slide credit: Andrew Ng
Neural Network examples
Representing Boolean Functions

Dr. Devesh Bhimsaria 13


Slide credit: Eric Eaton
Representing Boolean Functions

Dr. Devesh Bhimsaria 14


Slide credit: Eric Eaton
Combining Representations to Create
Non-Linear Functions

Dr. Devesh Bhimsaria 15


Slide credit: Eric Eaton
Layering Representations

Dr. Devesh Bhimsaria 16


Slide credit: Eric Eaton
Layering Representations

Dr. Devesh Bhimsaria 17


Slide credit: Eric Eaton
Neural Network Learning
Perceptron Learning Rule

Dr. Devesh Bhimsaria 19


Slide credit: Eric Eaton
Neural Network

Dr. Devesh Bhimsaria 20


Learning in NN: Backpropagation

• Using mean squared error (like in linear regression) is not


ideal for logistic regression due to the non-convexity and poor
convergence behaviour.

• Log is there due to Maximum Likelihood criteria


Dr. Devesh Bhimsaria 21
Slide credit: Eric Eaton
Cost Function

Dr. Devesh Bhimsaria 22


Slide credit: Eric Eaton
Forward Propagation

Do we really need an activation function??

Dr. Devesh Bhimsaria 23


Slide credit: Eric Eaton
Backpropagation Intuition

Dr. Devesh Bhimsaria 24


Slide credit: Eric Eaton
Backpropagation Intuition

Dr. Devesh Bhimsaria 25


Slide credit: Eric Eaton
Backpropagation Issues
• “Backprop is the cockroach of machine learning. It’s
ugly, and annoying, but you just can’t get rid of it.” -
Geoff Hinton
• Problems
• Vanishing Gradient Problem
• Exploding Gradient Problem
• Slow Convergence
• Getting Stuck in Local Minima
• Overfitting (High Variance)
• Underfitting (High Bias)

Dr. Devesh Bhimsaria 26


Vanishing Gradient Problem
What?
• In deep networks, gradients become very small as they propagate backward
• Weight updates shrink, causing earlier layers to learn very slowly or not at all.
• Leading to slow convergence or even network stalling.

Why?
• Activation functions like sigmoid and tanh squash values into small ranges.
• Their derivatives are also small (<= 0.25 for sigmoid), leading to small gradients.
• This shrinks the gradient exponentially as it moves backward.

Solutions:
• Use ReLU (Rectified Linear Unit) (ReLU has gradients of 1 for positive inputs).
• Use Batch Normalization to stabilize activations.
• Better weight initialization (Xavier- sigmoid & tanh- similar variance of outputs across
layers /He initialization- ReLU Accounts for the non-zero mean and asymmetry).

Dr. Devesh Bhimsaria 27


Slide credit: Online
Exploding Gradient Problem
What?
• Gradients become too large and cause unstable training.
• Weight updates explode, leading to NaN or large numbers.

Why?
• Happens in very deep networks with large weight updates.
• Poor weight initialization or using large learning rates.

Solutions:
• Use Gradient Clipping (cap gradients at a threshold).
• Xavier/He weight initialization.
• Reduce the learning rate.
Dr. Devesh Bhimsaria 28
Slide credit: Online
Slow Convergence
What?
• Training takes too long due to small weight updates.
• Learning stagnates, especially in deep networks.

Why Does It Happen?


• Poor weight initialization.
• Poor choice of activation function (e.g., sigmoid in deep networks).
• Inefficient optimizer (e.g., using plain SGD instead of adaptive optimizers).
Note- Stochastic GD uses just one random data point (or mini-batch) per update
instead of computing the gradient over the entire dataset

Solutions:
• Use optimizers like Adam, RMSprop, or Momentum.
• Use a learning rate scheduler (reduce learning rate when needed).
• Use Batch Normalization to speed up training.

Dr. Devesh Bhimsaria 29


Slide credit: Online
Getting Stuck in Local Minima
Why?
• Non-convex loss functions (common in deep networks).
• Poor initialization of weights.
• Using SGD without momentum makes it harder to escape.

Solutions:
• Use optimizers like Adam, Momentum, or RMSprop, which
help escape local minima.
• Increase network capacity (more neurons/layers) to create
a smoother loss surface.
• Train longer or use learning rate annealing (manual).

Dr. Devesh Bhimsaria 30


Slide credit: Online
Overfitting (High Variance)
What?
• Performs well on training data but poorly on test data.

Why?
• Too many parameters compared to the training data.
• No regularization used (e.g., dropout, L2 regularization).
• Too many training epochs.

Solutions:
• Use Dropout (randomly disable neurons during training).
• Use L2 Regularization (Weight Decay) to prevent large
weights.
• Increase training data or use data augmentation.

Dr. Devesh Bhimsaria 31


Slide credit: Online
Underfitting (High Bias)
What?
• The model is too simple and can’t capture the data’s complexity.
• Both training and test performance are poor.

Why?
• The network is too shallow.
• Not enough neurons in hidden layers.
• High regularization making the model too restrictive.

Solutions:
• Increase the number of layers or neurons in hidden layers.
• Reduce regularization strength (e.g., lower L2 penalty).
• Train for more epochs.

Dr. Devesh Bhimsaria 32


Slide credit: Online
Thank You
• Thanks to those who made their material available
online. I have tried to acknowledge the person or
website from where material was taken.
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/

Dr. Devesh Bhimsaria 33

You might also like