0% found this document useful (0 votes)
3 views

Neural Networks

The document provides an overview of neural networks, including their structure, training methods, and challenges such as the curse of dimensionality and issues like dying neurons and vanishing gradients. It discusses the perceptron and multilayer perceptron (MLP) as solutions to classification problems, along with techniques such as gradient descent and backpropagation for training. Additionally, it highlights the importance of activation functions and regularization in improving model performance.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Neural Networks

The document provides an overview of neural networks, including their structure, training methods, and challenges such as the curse of dimensionality and issues like dying neurons and vanishing gradients. It discusses the perceptron and multilayer perceptron (MLP) as solutions to classification problems, along with techniques such as gradient descent and backpropagation for training. Additionally, it highlights the importance of activation functions and regularization in improving model performance.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Overview

1. Limitations of Linear Models

2. Neural Networks
The Perceptron
Multilayer Perceptron
Regression and Classification
Regularization
(Stochastic) Gradient Descent
Backpropagation

3. Issues of Neural Networks


Dying Neurons
Vanishing and Exploding Gradients

Curse of Dimensionality
With curse of dimensionality, we refer to a set of issues that arise when the dimensionality
of the data is high.
In linear regression and classification, one issue is that we usually need a number of
features that is exponential w.r.t. the number of dimensions of the input . As the number of
features grows, inverting becomes more costly. What's more, the number of
parameters may eventually become greater than the number of samples (leading to
overfitting).

Curse of Dimensionality
With tile coding (and RBFs) we need exponentially many features to equally cover the
space.

1 / 14
The same is true for other feature functions too.

Curse of Dimensionality
One central idea in machine learning (and deep learning) is that the data lies on a low-
dimensional manifold. Neural networks are more powerful than linear models as they
automatically extract useful features from the data, largely alleviating the problem of
choosing good features for high-dimensional data.
Neural networks work with high-dimensional data as input, e.g., high-resolution images and
text. The price to pay is that they are more complex, don't have analytical solutions, and
have many more hyper-parameters (hyper-parameters: parameters that we need to choose
prior to model training). This makes it more difficult to select good models.

Perceptron
The perceptron is a simple nonlinear model used for classification

where is the vector of weights, is the bias, and is the activation function, i.e.,

Training the Perceptron


Suppose you have a classification dataset of input-output pairs . The perceptron can
be trained using the following procedure:
(1) Initialize the weights randomly (you can consider the bias as a special weight)
2. Compute the model's prediction (I omitted the bias for the reason
above)
(3) At time-step , compute . Repeat from step 2 .

The algorithm above works similarly to a gradient ascent algorithm, where is the learning
rate.

XOR Problem
The perceptron can only separate the data using a line. However, the data may not be
linearly separable, and thus, the desired function cannot be learned.

2 / 14
Multilayer Perceptron (MLP)
A solution to the XOR problem is to use two layers of perceptrons,

where both and are activation functions. The idea is to have a non-linear layer that can
work with non-linearly separable data.
The non-linear layer is also called the hidden layer. The number of rows of
(or the number of elements of ) determine the number of neurons of the hidden layer.

It is possible to show that a model with one linear layer and a non-linear activation
function is a universal function approximator (i.e., it can model any function with
arbitrary high resolution given enough neurons).

Multilayer Perceptron (MLP)


In general, we can design a system with layers,

(Input of the neural network) (Affine transformation of layer ) (Hidden layer )


(Output)
where are the activation functions of the hidden layers, is the activation function of
the last layer, and are the weight matrices and weight vectors of the hidden layers.

3 / 14
For simplicity, we will refer to a neural network as a parametric function that depends on
the input and on a set of parameters , i.e.,

Generic Feedforward Neural Networks


MLPs are composed of multiple fully-connected layers. Each layer is only connected to the
next one.

More generic feedforward neural networks can break these assumptions. For example,
convolutional layers are not fully connected and residual neural networks have
connections between non-consecutive layers. In general, feedforward neural networks
may be arbitrarily complex. Their only constraint is that they are directed, acyclic graphs.

Common Activation Functions


Common activation functions are continuous and monotinic. Examples are
(1) Sigmoid function
(2) Hiperbolic tangent (Tanh)

3 Rectified Linear Unit (ReLU)


(4) Leaky ReLU.

ML Regression with Neural Networks


We can perform regression with the neural network above by setting equal to the identity
function (i.e., ), and assuming Gaussian noise on the output.

Hence,

We know that the maximum likelihood solution is therefore:

Here is a L2 norm.

4 / 14
ML Regression with NN
In this case, the maximum likelihood does not have a closed form solution. As we will see,
we will need to use gradient descent to miminize the (mean) squared error.

ML Classification with NN
To perform classification with neural networks, we set to be the sigmoid function for two
classes or the softmax function for more than two classes. These functions can be
interpreted as the parameters of a Bernoulli or a categorical distribution for the output.
The maximum likelihood solution is the one that minimizes the cross entropy between the
model prediction and the target for each sample and class . Then the maximum
likelihood solution is

Once again, this equation does not have a closed form solution.

Regularization
We have seen that maximum a posteriori estimation introduces regularization, i.e.,

where determines the additional regularization term added to the loss function.

Regularization
Different priors will result in different regularization terms.
(Centered Diagonal) Gaussian:

5 / 14
(Centered Diagonal) Laplace:

A Gaussian prior tends to make all weights small, while a Laplace prior tends to sparsify the
weights (many will be zeros).

6 / 14
Bayesian Neural Networks*
We have seen that neural networks can be used for maximum likelihood and maximum a
posteriori estimation. What about Bayesian estimation?
Bayesian Neural Networks are neural networks that use Bayesian inference to
estimate the uncertainty associated with their predictions.
The core idea is to sample the weights of the neural network given the data (i.e.,
) to obtain a distribution of estimators.
They are useful when dealing with small datasets or when the data is noisy or
incomplete.
Bayesian Neural Networks have been used in a variety of applications, including
image classification, speech recognition, and natural language processing.

Training Bayesian Neural Networks is non trivial, and we will not study them.

Gradient Descent
Gradient descent is an algorithm that numerically finds the minimum of a function by
repeatedly moving in the direction of steepest descent, i.e.,

Remark: Gradient descent is not guaranteed to find the global optimum.

7 / 14
Jacobian Matrix
The Jacobian is a matrix of first-order partial derivatives of a vector function.
For , the Jacobian matrix at is

The Jacobian matrix approximates the change of near as

Jacobian Matrix
Using the previous definition, if then is a row vector, hence,

Let's revisit the chain rule. Suppose and , then

8 / 14
Important: The differential of w.r.t. w can be written both with explicit
parametrization:

and implicit parametrization:

From now on, we use this second, shorter notation.

Gradient Descent
Usually, we need to minimize a loss function (e.g., MSE, cross entropy, ...) w.r.t. the neural
network's parameters, i.e.,

Therefore, we need to compute

where .

Stochastic Gradient Descent


Gradient descent is expensive to compute. An idea to improve its efficiency is to estimate
the gradient by using only one data point. In this way, we can optimize the function faster.
Stochastic gradient descent (SGD) is an iterative method for optimizing a function by
randomly selecting one data point at a time and moving in the direction of the steepest
descent based on that point, i.e.,

SGD also helps escaping local minima thanks to its high variance.
Question: Is stochastic gradient descent an unbiased gradient estimator?

9 / 14
Minibatch Stochastic Gradient Descent
Minibatch SGD is an iterative method for optimizing a function by randomly selecting a
few data points at a time and moving in the direction of the steepest descent based on the
result of the following equation for those points:

where is a set of points sampled at random and . By choosing different


batch sizes , we can control the variance of this gradient estimator. A small batch size
produces a high variance. The high variance of (minibatch) SGD can be also compensated
with a low learning rate.

Numerical, Symbolic, and Automatic Differentiation


How to compute the gradients ?

Numerical differentiation is a method of approximating the derivative of a function


using finite differences based on the values of the function.
Symbolic differentiation is a method of finding the exact derivative of a function by
manipulating mathematical expressions using rules of calculus.
Automatic differentiation is a method of computing the exact derivative of a
function specified by a computer program using the chain rule.

Numerical Symbolic Automatic


Accuracy Low High High
Speed Fast Slow Fast
Memory Low High Low
Implementation Easy Hard Easy

Table: A comparison of different differentiation methods.

Backpropagation
Backpropagation is an algorithm for training feedforward neural networks using
gradient descent.
It computes the gradient of the loss function with respect to the network weights by
applying the chain rule from the output layer to the input layer.
It updates the network weights in the direction that minimizes the loss function using
the computed gradient.
It consists of two phases: a forward pass for computing the neural network's
prediction and a backward pass for computing the gradient.
Figure: An illustration of backpropagation. Source:

10 / 14
https://en.wikipedia.org/wiki/Backpropagation

Derivation of Backpropagation
Let's set the notation first.
(1) we consider a single sample

(2) we do not consider biases


(3) the output of the neural network is
(4) the hidden layers are , where indices refer to layer numbers

(5) the values before the activation functions are (bias omitted)
6 to make notation more general, we set
Objective: We want to compute the gradient of the loss function w.r.t. the row of the weight
matrix , which we indicate with the column vector .

Derivation of Backpropagation

We continue to expand the terms using the chain rule until we reach the layer , and we
notice that

Derivation of Backpropagation
The reason of it can be seen in this graphical illustration of :

11 / 14
Derivation of Backpropagation
This means that

E.g,

Derivation of Backpropagation
Which, back in gradient notation, is

At this point, realize that many Jacobians are shared across different layers, i.e.,

Derivation of Backpropagation
It is also possible to see that all the sequences of Jacobians can be computed recursively, i.e.,

and

Now, we can call the propagation error of layer , and

12 / 14
The propagation error quantifies the impact of neurons of layer on the output.

Derivation of Backpropagation
We can now rewrite the gradients in terms of propagation error. Notice that the terms
are scalars

E.g.,

Derivation of Backpropagation
Since depends on , we can start the computation from the last layer and
propagate the error backwards, i.e.,

(Initialize Propagation Error) (Gradient Computation)


(Error Propagation)

(Gradient Computation)
(Error Propagation)
(Gradient Computation)

Backpropagation Algorithm
(1) Compute the forward pass, i.e., compute propagating the input forward through
the network.

(2) Initialize the propagation error .

(3) For to do

13 / 14
(1) For every row , compute the gradient . Store the results.

(2) Propagate the error

(4) Compose back all the gradient terms and return them.

Saturation and Dying ReLU


A good choice of the activation function is essential. Some activation functions have
derivative close to zero on their domain, e.g., Sigmoid, Tanh and ReLU. This can be a
problem, as when the derivative gets close to zero, it becomes hard to train the neuron.
When the derivative of sigmoid or tanh activation functions get close to zero we say that
the neuron saturates.

When this problem appears in ReLU, is even more critical since in their flat region ReLU
have gradient exactly equal to zero and they do not propagate gradients. We call this
problem dying ReLU. A solution is to use "Leaky ReLU", and use a good random
initialization avoiding low-gradient regions.

Vanishing and Exploding Gradients


It can be seen that is a product of many terms. The number of terms increases with the
depth of the neural network for the layers close to the input.
The product of many terms causes extremely high variances, producing gradients that are
either very high or very low in magnitude. Exploding gradients can be tackled with gradient
clipping. Gradient clipping adds a bias to the gradient estimator to lower its variance.
Vanishing gradient (and exploding gradient) can be dealt with via residual networks that
connect non-neighboring layers.

14 / 14

You might also like