0% found this document useful (0 votes)

3 views

Neural Networks

The document provides an overview of neural networks, including their structure, training methods, and challenges such as the curse of dimensionality and issues like dying neurons and vanishing gradients. It discusses the perceptron and multilayer perceptron (MLP) as solutions to classification problems, along with techniques such as gradient descent and backpropagation for training. Additionally, it highlights the importance of activation functions and regularization in improving model performance.

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Neural Networks

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Overview

1. Limitations of Linear Models

2. Neural Networks
The Perceptron
Multilayer Perceptron
Regression and Classification
Regularization
(Stochastic) Gradient Descent
Backpropagation

3. Issues of Neural Networks

Dying Neurons
Vanishing and Exploding Gradients

Curse of Dimensionality
With curse of dimensionality, we refer to a set of issues that arise when the dimensionality
of the data is high.
In linear regression and classification, one issue is that we usually need a number of
features that is exponential w.r.t. the number of dimensions of the input . As the number of
features grows, inverting becomes more costly. What's more, the number of
parameters may eventually become greater than the number of samples (leading to
overfitting).

Curse of Dimensionality
With tile coding (and RBFs) we need exponentially many features to equally cover the
space.

1 / 14
The same is true for other feature functions too.

Curse of Dimensionality
One central idea in machine learning (and deep learning) is that the data lies on a low-
dimensional manifold. Neural networks are more powerful than linear models as they
automatically extract useful features from the data, largely alleviating the problem of
choosing good features for high-dimensional data.
Neural networks work with high-dimensional data as input, e.g., high-resolution images and
text. The price to pay is that they are more complex, don't have analytical solutions, and
have many more hyper-parameters (hyper-parameters: parameters that we need to choose
prior to model training). This makes it more difficult to select good models.

Perceptron
The perceptron is a simple nonlinear model used for classification

where is the vector of weights, is the bias, and is the activation function, i.e.,

Training the Perceptron

Suppose you have a classification dataset of input-output pairs . The perceptron can
be trained using the following procedure:
(1) Initialize the weights randomly (you can consider the bias as a special weight)
2. Compute the model's prediction (I omitted the bias for the reason
above)
(3) At time-step , compute . Repeat from step 2 .

The algorithm above works similarly to a gradient ascent algorithm, where is the learning
rate.

XOR Problem
The perceptron can only separate the data using a line. However, the data may not be
linearly separable, and thus, the desired function cannot be learned.

2 / 14
Multilayer Perceptron (MLP)
A solution to the XOR problem is to use two layers of perceptrons,

where both and are activation functions. The idea is to have a non-linear layer that can
work with non-linearly separable data.
The non-linear layer is also called the hidden layer. The number of rows of
(or the number of elements of ) determine the number of neurons of the hidden layer.

It is possible to show that a model with one linear layer and a non-linear activation
function is a universal function approximator (i.e., it can model any function with
arbitrary high resolution given enough neurons).

Multilayer Perceptron (MLP)

In general, we can design a system with layers,

(Input of the neural network) (Affine transformation of layer ) (Hidden layer )

(Output)
where are the activation functions of the hidden layers, is the activation function of
the last layer, and are the weight matrices and weight vectors of the hidden layers.

3 / 14
For simplicity, we will refer to a neural network as a parametric function that depends on
the input and on a set of parameters , i.e.,

Generic Feedforward Neural Networks

MLPs are composed of multiple fully-connected layers. Each layer is only connected to the
next one.

More generic feedforward neural networks can break these assumptions. For example,
convolutional layers are not fully connected and residual neural networks have
connections between non-consecutive layers. In general, feedforward neural networks
may be arbitrarily complex. Their only constraint is that they are directed, acyclic graphs.

Common Activation Functions

Common activation functions are continuous and monotinic. Examples are
(1) Sigmoid function
(2) Hiperbolic tangent (Tanh)

3 Rectified Linear Unit (ReLU)

(4) Leaky ReLU.

ML Regression with Neural Networks

We can perform regression with the neural network above by setting equal to the identity
function (i.e., ), and assuming Gaussian noise on the output.

Hence,

We know that the maximum likelihood solution is therefore:

Here is a L2 norm.

4 / 14
ML Regression with NN
In this case, the maximum likelihood does not have a closed form solution. As we will see,
we will need to use gradient descent to miminize the (mean) squared error.

ML Classification with NN
To perform classification with neural networks, we set to be the sigmoid function for two
classes or the softmax function for more than two classes. These functions can be
interpreted as the parameters of a Bernoulli or a categorical distribution for the output.
The maximum likelihood solution is the one that minimizes the cross entropy between the
model prediction and the target for each sample and class . Then the maximum
likelihood solution is

Once again, this equation does not have a closed form solution.

Regularization
We have seen that maximum a posteriori estimation introduces regularization, i.e.,

where determines the additional regularization term added to the loss function.

Regularization
Different priors will result in different regularization terms.
(Centered Diagonal) Gaussian:

5 / 14
(Centered Diagonal) Laplace:

A Gaussian prior tends to make all weights small, while a Laplace prior tends to sparsify the
weights (many will be zeros).

6 / 14
Bayesian Neural Networks*
We have seen that neural networks can be used for maximum likelihood and maximum a
posteriori estimation. What about Bayesian estimation?
Bayesian Neural Networks are neural networks that use Bayesian inference to
estimate the uncertainty associated with their predictions.
The core idea is to sample the weights of the neural network given the data (i.e.,
) to obtain a distribution of estimators.
They are useful when dealing with small datasets or when the data is noisy or
incomplete.
Bayesian Neural Networks have been used in a variety of applications, including
image classification, speech recognition, and natural language processing.

Training Bayesian Neural Networks is non trivial, and we will not study them.

Gradient Descent
Gradient descent is an algorithm that numerically finds the minimum of a function by
repeatedly moving in the direction of steepest descent, i.e.,

Remark: Gradient descent is not guaranteed to find the global optimum.

7 / 14
Jacobian Matrix
The Jacobian is a matrix of first-order partial derivatives of a vector function.
For , the Jacobian matrix at is

The Jacobian matrix approximates the change of near as

Jacobian Matrix
Using the previous definition, if then is a row vector, hence,

Let's revisit the chain rule. Suppose and , then

8 / 14
Important: The differential of w.r.t. w can be written both with explicit
parametrization:

and implicit parametrization:

From now on, we use this second, shorter notation.

Gradient Descent
Usually, we need to minimize a loss function (e.g., MSE, cross entropy, ...) w.r.t. the neural
network's parameters, i.e.,

Therefore, we need to compute

where .

Stochastic Gradient Descent

Gradient descent is expensive to compute. An idea to improve its efficiency is to estimate
the gradient by using only one data point. In this way, we can optimize the function faster.
Stochastic gradient descent (SGD) is an iterative method for optimizing a function by
randomly selecting one data point at a time and moving in the direction of the steepest
descent based on that point, i.e.,

SGD also helps escaping local minima thanks to its high variance.
Question: Is stochastic gradient descent an unbiased gradient estimator?

9 / 14
Minibatch Stochastic Gradient Descent
Minibatch SGD is an iterative method for optimizing a function by randomly selecting a
few data points at a time and moving in the direction of the steepest descent based on the
result of the following equation for those points:

where is a set of points sampled at random and . By choosing different

batch sizes , we can control the variance of this gradient estimator. A small batch size
produces a high variance. The high variance of (minibatch) SGD can be also compensated
with a low learning rate.

Numerical, Symbolic, and Automatic Differentiation

How to compute the gradients ?

Numerical differentiation is a method of approximating the derivative of a function

using finite differences based on the values of the function.
Symbolic differentiation is a method of finding the exact derivative of a function by
manipulating mathematical expressions using rules of calculus.
Automatic differentiation is a method of computing the exact derivative of a
function specified by a computer program using the chain rule.

Numerical Symbolic Automatic

Accuracy Low High High
Speed Fast Slow Fast
Memory Low High Low
Implementation Easy Hard Easy

Table: A comparison of different differentiation methods.

Backpropagation
Backpropagation is an algorithm for training feedforward neural networks using
gradient descent.
It computes the gradient of the loss function with respect to the network weights by
applying the chain rule from the output layer to the input layer.
It updates the network weights in the direction that minimizes the loss function using
the computed gradient.
It consists of two phases: a forward pass for computing the neural network's
prediction and a backward pass for computing the gradient.
Figure: An illustration of backpropagation. Source:

10 / 14
https://en.wikipedia.org/wiki/Backpropagation

Derivation of Backpropagation
Let's set the notation first.
(1) we consider a single sample

(2) we do not consider biases

(3) the output of the neural network is
(4) the hidden layers are , where indices refer to layer numbers

(5) the values before the activation functions are (bias omitted)
6 to make notation more general, we set
Objective: We want to compute the gradient of the loss function w.r.t. the row of the weight
matrix , which we indicate with the column vector .

Derivation of Backpropagation

We continue to expand the terms using the chain rule until we reach the layer , and we
notice that

Derivation of Backpropagation
The reason of it can be seen in this graphical illustration of :

11 / 14
Derivation of Backpropagation
This means that

E.g，

Derivation of Backpropagation
Which, back in gradient notation, is

At this point, realize that many Jacobians are shared across different layers, i.e.,

Derivation of Backpropagation
It is also possible to see that all the sequences of Jacobians can be computed recursively, i.e.,

and

Now, we can call the propagation error of layer , and

12 / 14
The propagation error quantifies the impact of neurons of layer on the output.

Derivation of Backpropagation
We can now rewrite the gradients in terms of propagation error. Notice that the terms
are scalars

E.g.,

Derivation of Backpropagation
Since depends on , we can start the computation from the last layer and
propagate the error backwards, i.e.,

(Initialize Propagation Error) (Gradient Computation)

(Error Propagation)

(Gradient Computation)
(Error Propagation)
(Gradient Computation)

Backpropagation Algorithm
(1) Compute the forward pass, i.e., compute propagating the input forward through
the network.

(2) Initialize the propagation error .

(3) For to do

13 / 14
(1) For every row , compute the gradient . Store the results.

(2) Propagate the error

(4) Compose back all the gradient terms and return them.

Saturation and Dying ReLU

A good choice of the activation function is essential. Some activation functions have
derivative close to zero on their domain, e.g., Sigmoid, Tanh and ReLU. This can be a
problem, as when the derivative gets close to zero, it becomes hard to train the neuron.
When the derivative of sigmoid or tanh activation functions get close to zero we say that
the neuron saturates.

When this problem appears in ReLU, is even more critical since in their flat region ReLU
have gradient exactly equal to zero and they do not propagate gradients. We call this
problem dying ReLU. A solution is to use "Leaky ReLU", and use a good random
initialization avoiding low-gradient regions.

Vanishing and Exploding Gradients

It can be seen that is a product of many terms. The number of terms increases with the
depth of the neural network for the layers close to the input.
The product of many terms causes extremely high variances, producing gradients that are
either very high or very low in magnitude. Exploding gradients can be tackled with gradient
clipping. Gradient clipping adds a bias to the gradient estimator to lower its variance.
Vanishing gradient (and exploding gradient) can be dealt with via residual networks that
connect non-neighboring layers.

14 / 14

Integer Programming (3rd Book)
No ratings yet
Integer Programming (3rd Book)
51 pages
Lab Experiment # 3: Root Finding Using Newton-Raphson Method
No ratings yet
Lab Experiment # 3: Root Finding Using Newton-Raphson Method
6 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Lecture2
No ratings yet
Lecture2
67 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
72 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
DSA5105 Lecture5
No ratings yet
DSA5105 Lecture5
52 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Deep learning
No ratings yet
Deep learning
15 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
tfm_lichtner_bajjaoui_aisha
No ratings yet
tfm_lichtner_bajjaoui_aisha
18 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
465-Lecture 2-4
No ratings yet
465-Lecture 2-4
43 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
DL_Unit2
No ratings yet
DL_Unit2
113 pages
Deep Learning - DL-2
100% (1)
Deep Learning - DL-2
44 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Kannan M5L3 Notes
No ratings yet
Kannan M5L3 Notes
98 pages
cs188-sp24-note22
No ratings yet
cs188-sp24-note22
8 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Unit 3
No ratings yet
Unit 3
110 pages
NN Theory
No ratings yet
NN Theory
138 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Sigmoid Neural Networks to Predict Handwritten Digits
No ratings yet
Sigmoid Neural Networks to Predict Handwritten Digits
16 pages
Deep+Learning+Module-02+Search+Creators
No ratings yet
Deep+Learning+Module-02+Search+Creators
15 pages
A2.2 DNN Update 2
No ratings yet
A2.2 DNN Update 2
51 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
Complete Deep Learning Interview Question
No ratings yet
Complete Deep Learning Interview Question
46 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
10 Multilayer Perceptrons
No ratings yet
10 Multilayer Perceptrons
54 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Review Worksheet
No ratings yet
Review Worksheet
2 pages
Numerical Analysis
No ratings yet
Numerical Analysis
9 pages
CS2006
No ratings yet
CS2006
2 pages
Chapter 4 AA
No ratings yet
Chapter 4 AA
12 pages
Shortest Path Dijkstra's Algorithm
No ratings yet
Shortest Path Dijkstra's Algorithm
22 pages
Week 12.3E Newtons Method
No ratings yet
Week 12.3E Newtons Method
4 pages
6.Np Hard and Np Complete
No ratings yet
6.Np Hard and Np Complete
5 pages
Linear programming and network flows Fourth Edition M. S. Bazaraa - The ebook in PDF and DOCX formats is ready for download now
100% (1)
Linear programming and network flows Fourth Edition M. S. Bazaraa - The ebook in PDF and DOCX formats is ready for download now
49 pages
Bandwidth Minimization Algorithm For Finite Element Mesh
No ratings yet
Bandwidth Minimization Algorithm For Finite Element Mesh
13 pages
COL 726 Assignment 1: 15 February - 1 March, 2020
No ratings yet
COL 726 Assignment 1: 15 February - 1 March, 2020
2 pages
Nonlinear FEA Explained 1744683330
No ratings yet
Nonlinear FEA Explained 1744683330
12 pages
Uas Matops Acjm
No ratings yet
Uas Matops Acjm
29 pages
Immediate Download Test Bank For Precalculus With Limits A Graphing Approach Texas Edition 6th Edition Larson 1285867718 9781285867717 All Chapters
100% (10)
Immediate Download Test Bank For Precalculus With Limits A Graphing Approach Texas Edition 6th Edition Larson 1285867718 9781285867717 All Chapters
49 pages
Math 8 Q1 Week 1
100% (3)
Math 8 Q1 Week 1
10 pages
Game Theory
No ratings yet
Game Theory
27 pages
Algo PPT
No ratings yet
Algo PPT
146 pages
DAA Questions 502040
No ratings yet
DAA Questions 502040
3 pages
Course Outline Optimization Technique
No ratings yet
Course Outline Optimization Technique
2 pages
AI File
No ratings yet
AI File
21 pages
(2021) A Heuristic Approach For Two Dimensional Rectangular Cutting
No ratings yet
(2021) A Heuristic Approach For Two Dimensional Rectangular Cutting
15 pages
Midterms Hmet
No ratings yet
Midterms Hmet
4 pages
De Novo Programming: Ha Thi Xuan Chi, PHD
No ratings yet
De Novo Programming: Ha Thi Xuan Chi, PHD
21 pages
Computer Oriented Statistical Methods
33% (3)
Computer Oriented Statistical Methods
3 pages
(eBook PDF) Linear and Nonlinear Optimization (International Series in Operations Research & Management Science Book 253) instant download
100% (7)
(eBook PDF) Linear and Nonlinear Optimization (International Series in Operations Research & Management Science Book 253) instant download
56 pages
4343 2024s Mid Sample 1 PDF
No ratings yet
4343 2024s Mid Sample 1 PDF
1 page
K. S. R. College of Engineering, Tiruchengode - 637 215: Internal Assessment Test - Ii
No ratings yet
K. S. R. College of Engineering, Tiruchengode - 637 215: Internal Assessment Test - Ii
2 pages
Deep_Learning_Notes
No ratings yet
Deep_Learning_Notes
4 pages
Finite Difference Methods For Two-Point Boundary Value Prob-Lems
No ratings yet
Finite Difference Methods For Two-Point Boundary Value Prob-Lems
3 pages