Unit 1 Deep Learning
Unit 1 Deep Learning
Linear algebra forms the backbone of many concepts in deep learning. Linear algebra is
a continuous form of mathematics and is applied throughout science and engineering
because it allows you to model natural phenomena and to compute them efficiently.
Because it is a form of continuous and not discrete mathematics, a lot of computer
scientists don’t have a lot of experience with it. Linear algebra is also central to almost
all areas of mathematics like geometry and functional analysis. Its concepts are a crucial
prerequisite for understanding the theory behind machine learning, especially if you are
working with deep learning algorithms.
In linear algebra, data is represented by linear equations, which are presented in the form
of matrices and vectors. Therefore, you are mostly dealing with matrices and vectors
rather than with scalars (we will cover these terms in the following section). When you
have the right libraries, like Numpy, at your disposal, you can compute complex matrix
multiplication very easily with just a few lines of code. (Note: this blog post ignores
concepts of linear algebra that are not important for machine learning.)
You don’t need to understand linear algebra before getting started with machine learning,
but at some point, you may want to gain a better understanding of how the
different machine learning algorithms really work under the hood. This will help you to
make better decisions during a machine learning system’s development. So if you really
want to be a professional in this field, you will have to master the parts of linear algebra
that are important for machine learning.
1. SCALAR
2. VECTOR
In deep learning, data is often represented as vectors (arrays of numbers) and scalars
(single numbers). Vectors can represent various data types like images, text
embeddings, or numerical features.
A vector is an ordered array of numbers and can be in a row or a column. A vector has
just a single index, which can point to a specific value within the vector. For example,
V2 refers to the second value within the vector, which is -8 in the graphic above.
MATRIX
A matrix is an ordered 2D array of numbers and it has two indices. The first one points
to the row and the second one to the column. For example, M23 refers to the value in
the second row and the third column, which is 8 in the yellow graphic above. A matrix
can have multiple numbers of rows and columns. Note that a vector is also a matrix, but
with only one row or one column.
The matrix in the example in the yellow graphic is also a 2- by 3-dimensional matrix
(rows x columns). Below you can see another example of a matrix along with its
notation:
TENSOR
You can think of a tensor as an array of numbers, arranged on a regular grid, with a
variable number of axes. A tensor has three indices, where the first one points to the row,
the second to the column and the third one to the axis. For example, T232 points to the
second row, the third column, and the second axis. This refers to the value 0 in the right
tensor in the graphic below:
Tensor is the most general term for all of these concepts above because a tensor is a
multidimensional array and it can be a Vector and a Matrix, depending on the number
of indices it has. For example, a first-order tensor would be a vector (1 index). A second-
order tensor is a matrix (2 indices) and third-order tensors (3 indices) and higher are
called higher-order tensors (3 or more indices).
1. MATRIX-SCALAR OPERATIONS
If you multiply, divide, subtract, or add a scalar to a matrix, you do so with every element
of the matrix. The image below illustrates this perfectly for multiplication:
2. MATRIX-VECTOR MULTIPLICATION
Multiplying a matrix by a vector can be thought of as multiplying each row of the matrix
by the column of the vector. The output will be a vector that has the same number of
rows as the matrix. The image below shows how this works:
To better understand the concept, we will go through the calculation of the second image.
To get the first value of the resulting vector (16), we take the numbers of the vector we
want to multiply with the matrix (1 and 5), and multiply them with the numbers of the
first row of the matrix (1 and 3). This looks like this:
1*1 + 3*5 = 16
We do the same for the values within the second row of the matrix:
4*1 + 0*5 = 4
2*1 + 1*5 = 7
4. MATRIX-MATRIX MULTIPLICATION
Multiplying two matrices together isn’t that hard either if you know how to multiply a
matrix by a vector. Note that you can only multiply matrices together if the number of
the first matrix’s columns matches the number of the second matrix’s rows. The result
will be a matrix with the same number of rows as the first matrix and the same number
of columns as the second matrix. It works as follows:
You simply split the second matrix into column-vectors and multiply the first matrix
separately by each of these vectors. Then you put the results in a new matrix
(without adding them up!). The image below explains this step by step:
Matrix multiplication has several properties that allow us to bundle a lot of computation
into one matrix multiplication. We will discuss them one by one below. We will start by
explaining these concepts with Scalars and then with matrices because this will give you
a better understanding of the process.
1. NOT COMMUTATIVE
Scalar multiplication is commutative but matrix multiplication is not. This means that
when we are multiplying scalars, 7*3 is the same as 3*7. But when we multiply matrices
by each other, A*B isn’t the same as B*A.
2. ASSOCIATIVE
Scalar and matrix multiplication are both associative. This means that the scalar
multiplication 3(5*3) is the same as (3*5)3 and that the matrix multiplication A(B*C) is
the same as (A*B)C.
3. DISTRIBUTIVE
Scalar and matrix multiplication are also both distributive. This means that
3(5 + 3) is the same as 3*5 + 3*3 and that A(B+C) is the same as A*B + A*C.
4. IDENTITY MATRIX
The identity matrix is a special kind of matrix but first, we need to define what an Identity
is. The number 1 is an identity because everything you multiply with 1 is equal to itself.
Therefore every matrix that is multiplied by an identity matrix is equal to itself. For
example, matrix A times its identity-matrix is equal to A.
You can spot an identity matrix by the fact that it has ones along its diagonals and that
every other value is zero. It is also a “squared matrix,” meaning that its number of rows
matches its number of columns.
We previously discussed that matrix multiplication is not commutative but there is one
exception, namely if we multiply a matrix by an identity matrix. Therefore, the following
equation is true: A*I = I*A = A
The matrix inverse and the matrix transpose are two special kinds of matrix properties.
Again, we will start by discussing how these properties relate to real numbers and then
how they relate to matrices.
1. INVERSE
First of all, what is an inverse? A number that is multiplied by its inverse is equal to 1.
Note that every number except 0 has an inverse. If you multiply a matrix by its inverse,
the result is its identity matrix. The example below shows what the inverse of scalars
looks like:
But not every matrix has an inverse. You can compute the inverse of a Matrix if it is a
“squared matrix” and if it has an inverse. Discussing which matrices have an inverse
would be unfortunately out of the scope of this post.
The image below shows a matrix multiplied by its inverse, which results in a 2-by-2
identity matrix.
2. TRANSPOSE
And lastly, we will discuss the matrix transpose property. This is basically the mirror
image of a matrix, along a 45-degree axis. It is fairly simple to get the transpose of a
matrix. Its first column is the first row of the matrix transpose and the second column is
the second row of the matrix transpose. An m*n matrix is transformed into an n*m
matrix. Also, the Aij element of A is equal to the Aji(transpose) element. The image
below illustrates that:
In deep learning, various types of errors can occur during model training and evaluation.
These errors affect the performance and accuracy of the model in different ways. Here
are some common types of errors:
1. Training Error: This error occurs during the training phase when the model
makes predictions on the training data. It measures the difference between the
predicted output and the actual target for the training examples. The goal of
training is to minimize this error by adjusting the model's parameters.
2. Validation Error: Validation error is calculated by evaluating the model's
performance on a separate dataset (validation set) that the model hasn't seen
during training. It helps in assessing how well the model generalizes to new,
unseen data. High validation error indicates poor generalization.
3. Test Error: Test error measures the performance of the trained model on a
completely independent dataset (test set). This set is kept entirely separate from
both the training and validation sets. Test error provides an estimate of the model's
performance in real-world scenarios.
4. Underfitting: Underfitting occurs when a model is too simple to capture the
complexity of the data, resulting in poor performance on both the training and
validation sets. The model fails to learn the patterns in the data.
5. Overfitting: Overfitting happens when a model learns to perform exceptionally
well on the training data but fails to generalize to new, unseen data. It occurs when
the model becomes too complex, capturing noise or specific details in the training
data that don't generalize well.
6. Bias Error (or Constant Error): Bias error represents the inability of a model
to capture the true relationship between input and output. High bias often leads to
underfitting, indicating that the model is too simple to represent the underlying
patterns.
7. Variance Error: Variance error occurs due to the model's sensitivity to
fluctuations in the training data. High variance can lead to overfitting, indicating
that the model is too sensitive to the specific data points in the training set.
8. Data Leakage: Data leakage happens when information from the validation or
test set unintentionally influences the model during training. This leads to an
overly optimistic estimation of the model's performance.
9. Sampling Error: Sampling error occurs when the training, validation, or test
datasets are not representative of the entire population, leading to biased model
evaluations and inaccurate predictions.
Bias-Variance Trade-Off
Striking a balance between accuracy and the ability to make predictions beyond the
training data in an ML model is called the bias-variance tradeoff.
What is bias?
Bias in machine learning refers to the difference between a model’s predictions and the
actual distribution of the value it tries to predict. Models with high bias oversimplify the
data distribution rule/function, resulting in high errors in both the training outcomes and
test data analysis results.
Bias is a systematic error that occurs due to incorrect assumptions in the machine
learning process, leading to the misrepresentation of data distribution.
The level of bias in a model is heavily influenced by the quality and quantity of training
data involved. Using insufficient data will result in flawed predictions. At the same time,
it can also result from the choice of an inappropriate model.
Bias is problematic because it indicates that your model does not accurately represent
the data it is trying to predict. However, in some situations, it may be acceptable to limit
your model to a specific area, such as a specialized medical model for women under 30,
or to add additional human control, such as calling an operator when the model is
uncertain. The primary challenge is that you may be unable to identify all of these
instances during the training and testing phase with complete certainty, which may result
in unforeseen difficulties. Therefore, you should be prepared to address such issues as
they arise.
What is variance?
Variance stands in contrast to bias; it measures how much a distribution on several sets
of data values differs from each other. The most common approach to measuring
variance is by performing cross-validation experiments and looking at how the model
performs on different random splits of your training data.
A model with a high level of variance depends heavily on the training data and,
consequently, has a limited ability to generalize to new, unseen figures. This can result
in excellent performance on training data but significantly higher error rates during
model verification on the test data. Nonlinear machine learning algorithms often have
high variance due to their high flexibility.
A complex model can learn complicated functions, which leads to higher variance.
However, if the model becomes too complex for the dataset, high variance can result in
overfitting. Low variance indicates a limited change in the target function in response to
changes in the training data, while high variance means a significant difference.
Low testing accuracy. Despite high accuracy on training data, high variance
models tend to perform poorly on test data.
Overfitting. A high-variance model often leads to overfitting as it becomes too
complex.
Overcomplexity. As researchers, we expect that increasing the complexity of a
model will result in improved performance on both training and testing data sets.
However, when a model becomes too complex and a simpler model may provide
the same level of accuracy, it’s better to choose the simpler one.
Model overcomplexity and model overfitting are related, but they are not necessarily the
same thing.
To put it simply, model overfitting occurs when a model is too complex for the
distribution it is trying to predict. For instance, attempting to predict non-linear data with
a linear model.
Model overcomplexity, on the other hand, happens when a model has too many
parameters or a structure that is too complex to be effectively trained on the provided
data. For instance, attempting to fit a neural network with 1.5 million parameters when
only 100 objects are available. Such a model may fit the data well and have little bias,
but it will likely have high variance as there is not enough data to capture the proper
patterns with so many parameters.
Variance indicates that your model is not reliable. Suppose minor changes in the input
data result in significant changes in the output. In that case, the model has not properly
extracted the underlying patterns, and its decision making cannot be trusted. While
having minor and predictable errors is acceptable, it’s a significant issue when the model
performs well for some objects but completely fails for others. One possible approach to
addressing this issue is to use a rule-based system in post-production, such as
implementing a rule like “do not sell anything with a margin greater than 50%.”
Now let’s take a look at the different combinations of bias and variance in machine
learning models and the results they provide.
A machine learning model with low bias and low variance is considered ideal but is not
often the case in the machine learning practice, so we can speak of “reasonable bias” and
“reasonable variance.”
Predictions are consistent but inaccurate on average in this scenario. This happens when
the model doesn’t learn well from the training data or has too few parameters, leading to
underfitting issues.
With both high bias and high variance, the predictions are both inconsistent and
inaccurate on
average.
As we have learned, bias and variance are interdependent. In other words, lowering a
model’s bias leads to an increase in its variance and vice versa. This relationship between
bias and variance is known as the bias-variance tradeoff.
The balance between bias and variance can be adjusted in specific algorithms by
modifying parameters, as seen in the following examples:
For k-nearest neighbors, a low bias and high variance can be corrected by
increasing the value of k, which increases the bias and decreases the variance.
For support vector machines, a low bias and high variance can be altered by
adjusting the C parameter, which increases the bias but decreases the variance.
Unfortunately, it’s impossible to determine the actual bias and variance error terms while
we are trying to predict the target function. However, bias and variance serve as useful
frameworks to understand the performance of machine learning algorithms in making
predictions.
Vector calculus and optimization are crucial branches of mathematics with widespread applications in
physics, engineering, computer science, and various other fields. Here's a brief review of key concepts
from each:
Vector Calculus:
1. Vectors and Vector Operations:
Vectors represent quantities with both magnitude and direction.
Vector operations include addition, subtraction, scalar multiplication, and dot product.
2. Vector Functions:
Vector-valued functions map real numbers to vectors.
Derivatives of vector functions involve partial derivatives for each component.
Example:
4. Line Integrals:
Integrating a scalar or vector field along a curve.
5. Surface Integrals:
Integrating a scalar or vector field over a surface.
Optimization:
Adaptive learning rate methods like Adam adjust the learning rate for each
parameter individually.
2. Loss Functions:
The objective function in deep learning is often a loss function, which
measures the difference between the predicted and true values.
Cross-entropy, mean squared error, and hinge loss are common loss
functions used in different types of neural network tasks.
3. Regularization:
Regularization terms, often based on L1 or L2 norms, are added to the
objective function to prevent overfitting.
They penalize large weights in the network, promoting a simpler model.
4. Hyperparameter Tuning:
The optimization of hyperparameters, such as learning rate and momentum,
is a crucial aspect of training deep neural networks.
Grid search, random search, and more advanced optimization techniques
are used for hyperparameter tuning.
5. Convexity Issues:
Deep learning problems are often non-convex, meaning they can have
multiple local minima.
Despite the lack of convexity, stochastic gradient descent variants often
perform well in practice.
6. Automatic Differentiation:
Automatic differentiation is a computational technique that automatically
calculates gradients of a computational graph.
It is essential for implementing backpropagation efficiently in deep learning
frameworks.
1. Introduction to Optimization:
Optimization involves finding the best solution to a problem, often maximizing or
minimizing a function.
2. Local and Global Extrema:
Local extrema occur at critical points; global extrema are the absolute maximum or
minimum.
3. Optimization Techniques:
Gradient Descent: Iterative method to minimize a function by moving in the direction
of steepest descent.
Lagrange Multipliers: Technique for constrained optimization.
4. Convex Optimization:
Problems with convex objective functions and constraints have efficient algorithms and
unique solutions.
5. Optimization in Machine Learning:
Widely used in training models to minimize a loss function.
Regularization techniques to prevent overfitting.
6. Linear Programming:
Optimization of a linear objective function subject to linear equality and inequality
constraints.
Till now we have seen the parameters required for gradient descent. Now let us map the
parameters with the gradient descent algorithm and work on an example to better
understand gradient descent. Let us consider a parabolic equation y=4x 2. By looking at the
equation we can identify that the parabolic function is minimum at x = 0 i.e. at x=0, y=0.
Therefore x=0 is the local minima of the parabolic function y=4x 2. Now let us see the
algorithm for gradient descent and how we can obtain the local minima by applying
gradient descent:
Algorithm for Gradient Descent
Steps should be made in proportion to the negative of the function gradient (move away
from the gradient) at the current point to find local minima. Gradient Ascent is the
procedure for approaching a local maximum of a function by taking steps proportional to
the positive of the gradient (moving towards the gradient).
repeat until convergence
{
w = w - (learning_rate * (dJ/dw))
b = b - (learning_rate * (dJ/db))
}
Step 1: Initializing all the necessary parameters and deriving the gradient function for the
parabolic equation 4x 2. The derivative of x 2 is 2x, so the derivative of the parabolic equation
4x2 will be 8x.
x0 = 3 (random initialization of x)
learning_rate = 0.01 (to determine the step size while moving towards local minima)
Iteration 2:
x2 = x1 - (learning_rate * gradient)
x2 = 2.76 - (0.01 * (8 * 2.76))
x2 = 2.76 - 0.2208
x2 = 2.5392
Iteration 3:
x3 = x2 - (learning_rate * gradient)
x3 = 2.5392 - (0.01 * (8 * 2.5392))
x3 = 2.5392 - 0.203136
x3 = 2.3360
From the above three iterations of gradient descent, we can notice that the value of x is
decreasing iteration by iteration and will slowly converge to 0 (local minima) by running the
gradient descent for more iterations. Now you might have a question, for how many iterations
we should run gradient descent?
We can set a stopping threshold i.e. when the difference between the previous and the present
value of x becomes less than the stopping threshold we stop the iterations. When it comes to
the implementation of gradient descent for machine learning algorithms and deep learning
algorithms we try to minimize the cost function in the algorithms using gradient descent. Now
that we are clear with the gradient descent’s internal working, let us look into the python
implementation of gradient descent where we will be minimizing the cost function of the
linear regression algorithm and finding the best fit line. In our case the parameters are below
mentioned:
Prediction Function
The prediction function for the linear regression algorithm is a linear equation given by
y=wx+b.
prediction_function (y) = (w * x) + b
Here, x is the independent variable
y is the dependent variable
Cost Function
The cost function is used to calculate the loss based on the predictions made. In linear
regression, we use mean squared error to calculate the loss. Mean Squared Error is the sum
of the squared differences between the actual and predicted values.
But how does this apply to neural networks and deep learning?
There are many types of loss functions, each of which are used in certain
roles. Instead of getting too caught up in which loss function is being used,
instead think of it this way:
3. We compute the prediction and then the loss/cost function, which tells
us how good/bad of a job we did at making the correct prediction
We do this over and over again until our model is said to “converge” and is
able to make reliable, accurate predictions.
There are many types of gradient descent algorithms, but the types we’ll
be focusing on here today are:
3. Mini-batch SGD
The most basic form of gradient descent, which I like to call vanilla
gradient descent, we only update the weights of the network once per
update.
In vanilla gradient descent we only update the network’s weights once per
iteration, meaning that the network sees the entire dataset every time a
weight update is performed.
Furthermore, the larger your dataset gets, the more nuanced your
gradients can become, and if you’re only updating the weights once per
epoch then you’re going to be spending the majority of your time
computing predictions and not much time actually learning (which is
the goal of an optimization problem, right?)
Luckily, there are other variations of gradient descent that address this
problem.
Until convergence:
Randomly select a data point from our dataset
Make a prediction on it
Compute the loss and the gradient
Update the parameters of the network
That said, performing N weight updates per epoch (where N is equal to the
total number of data points in our dataset) is also a bit computationally
wasteful — we’ve now swung to the other side of the pendulum.
Mini-batch SGD
While SGD can converge faster for large datasets, we actually run into
another problem — we cannot leverage our vectorized libraries that make
training super fast (again, because we are only passing one data point at
a time through the network).
If you visualize each mini-batch directly then you’ll see a very noisy plot,
such as the following one:
Figure 4: Plotting the loss of every mini-batch can lead to a very noisy plot.
But when you average out the loss across all mini-batches the plot is
actually quite stable:
Figure
5: Averaging the mini-batch loss over the course of an entire epoch leads
to more stable-looking plots.
Note: Depending on what deep learning library you are using you may see
both types of plots.
When you hear deep learning practitioners talking about SGD they
are more than likely talking about mini-batch SGD.
SGD has a problem when navigating areas of the loss landscape that are
significantly steeper in one dimension than in others (which you’ll see
around local optima).
When this happens, it appears that SGD simply oscillates the ravine
instead of descending into areas of lower loss and ideally lower accuracy.
Get used to seeing momentum when using SGD — it is used in the majority
of neural network experiments that apply SGD.
The problem with momentum is that once you develop a head of steam,
the train can easily become out of control and roll right over our local
minima and back up the hill again.
Nesterov acceleration accounts for this and helps us recognize when the
loss landscape starts sloping back up again.
Momentum-based Optimization:
The update rule for the Momentum-based Gradient Optimizer can be expressed as follows:
import math
# Objective function
def obj_func(x):
return x * x - 4 * x + 4
def grad(x):
return 2 * x - 4
# Number of iterations
iterations = 0
v = 0
while (1):
iterations += 1
v = beta * v + (1 - beta) * grad(x)
x_prev = x
x = x - alpha * v
if x_prev == x:
print("Done optimizing the objective function. ")
break