Neural Networks
Neural Networks
2. Neural Networks
The Perceptron
Multilayer Perceptron
Regression and Classification
Regularization
(Stochastic) Gradient Descent
Backpropagation
Curse of Dimensionality
With curse of dimensionality, we refer to a set of issues that arise when the dimensionality
of the data is high.
In linear regression and classification, one issue is that we usually need a number of
features that is exponential w.r.t. the number of dimensions of the input . As the number of
features grows, inverting becomes more costly. What's more, the number of
parameters may eventually become greater than the number of samples (leading to
overfitting).
Curse of Dimensionality
With tile coding (and RBFs) we need exponentially many features to equally cover the
space.
1 / 14
The same is true for other feature functions too.
Curse of Dimensionality
One central idea in machine learning (and deep learning) is that the data lies on a low-
dimensional manifold. Neural networks are more powerful than linear models as they
automatically extract useful features from the data, largely alleviating the problem of
choosing good features for high-dimensional data.
Neural networks work with high-dimensional data as input, e.g., high-resolution images and
text. The price to pay is that they are more complex, don't have analytical solutions, and
have many more hyper-parameters (hyper-parameters: parameters that we need to choose
prior to model training). This makes it more difficult to select good models.
Perceptron
The perceptron is a simple nonlinear model used for classification
where is the vector of weights, is the bias, and is the activation function, i.e.,
The algorithm above works similarly to a gradient ascent algorithm, where is the learning
rate.
XOR Problem
The perceptron can only separate the data using a line. However, the data may not be
linearly separable, and thus, the desired function cannot be learned.
2 / 14
Multilayer Perceptron (MLP)
A solution to the XOR problem is to use two layers of perceptrons,
where both and are activation functions. The idea is to have a non-linear layer that can
work with non-linearly separable data.
The non-linear layer is also called the hidden layer. The number of rows of
(or the number of elements of ) determine the number of neurons of the hidden layer.
It is possible to show that a model with one linear layer and a non-linear activation
function is a universal function approximator (i.e., it can model any function with
arbitrary high resolution given enough neurons).
3 / 14
For simplicity, we will refer to a neural network as a parametric function that depends on
the input and on a set of parameters , i.e.,
More generic feedforward neural networks can break these assumptions. For example,
convolutional layers are not fully connected and residual neural networks have
connections between non-consecutive layers. In general, feedforward neural networks
may be arbitrarily complex. Their only constraint is that they are directed, acyclic graphs.
Hence,
Here is a L2 norm.
4 / 14
ML Regression with NN
In this case, the maximum likelihood does not have a closed form solution. As we will see,
we will need to use gradient descent to miminize the (mean) squared error.
ML Classification with NN
To perform classification with neural networks, we set to be the sigmoid function for two
classes or the softmax function for more than two classes. These functions can be
interpreted as the parameters of a Bernoulli or a categorical distribution for the output.
The maximum likelihood solution is the one that minimizes the cross entropy between the
model prediction and the target for each sample and class . Then the maximum
likelihood solution is
Once again, this equation does not have a closed form solution.
Regularization
We have seen that maximum a posteriori estimation introduces regularization, i.e.,
where determines the additional regularization term added to the loss function.
Regularization
Different priors will result in different regularization terms.
(Centered Diagonal) Gaussian:
5 / 14
(Centered Diagonal) Laplace:
A Gaussian prior tends to make all weights small, while a Laplace prior tends to sparsify the
weights (many will be zeros).
6 / 14
Bayesian Neural Networks*
We have seen that neural networks can be used for maximum likelihood and maximum a
posteriori estimation. What about Bayesian estimation?
Bayesian Neural Networks are neural networks that use Bayesian inference to
estimate the uncertainty associated with their predictions.
The core idea is to sample the weights of the neural network given the data (i.e.,
) to obtain a distribution of estimators.
They are useful when dealing with small datasets or when the data is noisy or
incomplete.
Bayesian Neural Networks have been used in a variety of applications, including
image classification, speech recognition, and natural language processing.
Training Bayesian Neural Networks is non trivial, and we will not study them.
Gradient Descent
Gradient descent is an algorithm that numerically finds the minimum of a function by
repeatedly moving in the direction of steepest descent, i.e.,
7 / 14
Jacobian Matrix
The Jacobian is a matrix of first-order partial derivatives of a vector function.
For , the Jacobian matrix at is
Jacobian Matrix
Using the previous definition, if then is a row vector, hence,
8 / 14
Important: The differential of w.r.t. w can be written both with explicit
parametrization:
Gradient Descent
Usually, we need to minimize a loss function (e.g., MSE, cross entropy, ...) w.r.t. the neural
network's parameters, i.e.,
where .
SGD also helps escaping local minima thanks to its high variance.
Question: Is stochastic gradient descent an unbiased gradient estimator?
9 / 14
Minibatch Stochastic Gradient Descent
Minibatch SGD is an iterative method for optimizing a function by randomly selecting a
few data points at a time and moving in the direction of the steepest descent based on the
result of the following equation for those points:
Backpropagation
Backpropagation is an algorithm for training feedforward neural networks using
gradient descent.
It computes the gradient of the loss function with respect to the network weights by
applying the chain rule from the output layer to the input layer.
It updates the network weights in the direction that minimizes the loss function using
the computed gradient.
It consists of two phases: a forward pass for computing the neural network's
prediction and a backward pass for computing the gradient.
Figure: An illustration of backpropagation. Source:
10 / 14
https://en.wikipedia.org/wiki/Backpropagation
Derivation of Backpropagation
Let's set the notation first.
(1) we consider a single sample
(5) the values before the activation functions are (bias omitted)
6 to make notation more general, we set
Objective: We want to compute the gradient of the loss function w.r.t. the row of the weight
matrix , which we indicate with the column vector .
Derivation of Backpropagation
We continue to expand the terms using the chain rule until we reach the layer , and we
notice that
Derivation of Backpropagation
The reason of it can be seen in this graphical illustration of :
11 / 14
Derivation of Backpropagation
This means that
E.g,
Derivation of Backpropagation
Which, back in gradient notation, is
At this point, realize that many Jacobians are shared across different layers, i.e.,
Derivation of Backpropagation
It is also possible to see that all the sequences of Jacobians can be computed recursively, i.e.,
and
12 / 14
The propagation error quantifies the impact of neurons of layer on the output.
Derivation of Backpropagation
We can now rewrite the gradients in terms of propagation error. Notice that the terms
are scalars
E.g.,
Derivation of Backpropagation
Since depends on , we can start the computation from the last layer and
propagate the error backwards, i.e.,
(Gradient Computation)
(Error Propagation)
(Gradient Computation)
Backpropagation Algorithm
(1) Compute the forward pass, i.e., compute propagating the input forward through
the network.
(3) For to do
13 / 14
(1) For every row , compute the gradient . Store the results.
(4) Compose back all the gradient terms and return them.
When this problem appears in ReLU, is even more critical since in their flat region ReLU
have gradient exactly equal to zero and they do not propagate gradients. We call this
problem dying ReLU. A solution is to use "Leaky ReLU", and use a good random
initialization avoiding low-gradient regions.
14 / 14