Deep Learning Math Background
Deep Learning Math Background
Abstract
We present a gentle introduction to elementary mathematical notation with the focus of
communicating deep learning principles. This is a “math crash course” aimed at quickly
enabling scientists with understanding of the building blocks used in many equations,
formulas, and algorithms that describe deep learning. While this short presentation cannot
replace solid mathematical knowledge that needs multiple courses and years to solidify,
our aim is to allow non-mathematical readers to overcome hurdles of reading texts that
also use such mathematical notation. We describe a few basic deep learning models using
mathematical notation before we unpack the meaning of the notation. In particular, this
text includes an informal introduction to summations, sets, functions, vectors, matrices,
gradients, and a few more objects that are often used to describe deep learning. While this
is a mathematical crash course, our presentation is kept in the context of deep learning
and machine learning models including the sigmoid model, the softmax model, and fully
connected feedforward deep neural networks. We also hint at basic mathematical objects
appearing in neural networks for images and text data.
1 Introduction
By now, deep learning has become an indispensable tool in science, permeating fields such
as astronomy, genomics, climate science, robotics, materials science, and medical science. Its
transformative impact extends beyond traditional boundaries, finding applications not only in
theoretical domains but also in practical, real-world scenarios such as deciphering the intricacies
of the human genome, predicting climate patterns, and enhancing surgical precision in neuro-
surgery. Deep learning serves as the linchpin for advancements in data analysis, image recog-
nition, natural language processing, and decision-making systems, catalyzing breakthroughs
that were once deemed unattainable. The past two decades have seen great advances in deep
learning. See [LeCun et al., 2015] for a brief overview.
1
There are multiple ways in which one can try and understand the workings of deep learning.
One popular way to describe deep learning is via a hands-on programming approach as in
[Howard and Gugger, 2020]. This is good for someone directly implementing deep learning with
a specific programming language such as Python. However, if one only seeks to understand
what is going on in the realm of deep learning, the use of programming can be too specific and
time consuming. Hence, a more reasonable approach is to understand underlying basic ideas,
often using mathematical notation. This is the approach that we adopted in our recent book,
The Mathematical Engineering of Deep Learning, [Liquet et al., 2024]. Other similar texts that
also require mathematical notation include Understanding Deep Learning [Prince, 2023] and
the more classic Deep Learning [Goodfellow et al., 2016], among others.
In all these texts, mathematical notation is very effective at pinpointing ideas, in a dense
and concise manner. The problem however is that many scientists using deep learning, are not
always comfortable with such notation. Understanding mathematical notation requires prereq-
uisite knowledge. For example, for our book, we believe that having taken 3 to 4 university-level
mathematics courses is necessary for a thorough appreciation of the deep learning concepts that
we present. Naturally, one cannot obtain such knowledge and experience overnight. Neverthe-
less, there is a spectrum between in-depth understanding to avoidance, and we believe that
through a short piece such as this work here, the reader can get a basic understanding of the
meaning of the notation, and as a consequence gain an entry point to deep learning as well.
Hence this work here can be treated as a quick guide for mathematics notation in the context
of deep learning.
To visit our goal of starting to understand mathematical notation, we are motivated by three
key models from the world of deep learning and machine learning, which we aptly call Model I,
Model II, and Model III. Model I, is the sigmoid model, also known as the logistic model.
Model II is the softmax model, also known as the multinomial regression model. Model III is
the general fully connected feedforward deep neural network, also known as the feedforward
model, and sometimes called a multi-layer perceptron.
Each of these models operates on some input which we denote as x and create an output
which we denote as ŷ. The input x can be a series of numbers, an image, a voice signal, or
essentially any structured input. The output ŷ is a probability between 0 and 1 for Model I.
It is a list (or vector) of such probabilities, summing up to 1 for Model II. And in the case of
Model III, it can be either a probability, a vector of probabilities, or any other type of output
that our application dictates.
In these cases, models are presented with an input feature vector x and return an output ŷ.
With the sigmoid model, the output is a probability value indicating the likelihood of a binary
event, serving as a foundational element in binary classification tasks. The softmax model, on
the other hand, extends this concept to multiple classes, assigning probabilities to each class for
multi-class classification problems. In the case of the general fully connected neural network,
its flexibility allows for intricate mappings between inputs and outputs, enabling the network
to learn complex relationships within the data. Understanding these fundamental models is a
crucial step towards unraveling the mathematical framework that underpins deep learning.
Deep learning is the area where models are created (training) and then used (produc-
tion/inference) for solving machine learning or artificial intelligence tasks. The types of models
and tasks are too numerous to cover in this presentation. Instead, let us focus on a simple
supervised learning task, encompassing both binary and multi-class classification. In a binary
classification scenario, a deep learning model is presented with inputs and aims to determine a
2
binary outcome. For instance, in a medical diagnosis context, the model might assess whether
an input image x is associated with a particular pathology or not. This may be presented in
the form of a probability ŷ.
Similarly, in multi-class classification, the model, still operating on such an input image x,
may determine which class from a finite collection best represents the input. In a medical
image classification scenario, particularly in the realm of diagnosing brain diseases, the deep
learning model is presented with an input brain scan and tasked with classifying it into one of
several possible conditions. For instance, the model may need to distinguish between various
types of brain tumors, such as glioblastoma, meningioma, and pituitary adenoma. The output
of the model is typically a list (or vector) of probabilities associated with each of these classes
(brain tumor types). The class with the highest probability is typically chosen as the model’s
prediction.
The remainder of this document is structured as follows. In Section 2 we see a mathematical
description of models I, II, and III. Our aim with the early introduction of the models is to
motivate the mathematical sections that follow where we unpack the notation presented for
these models. Hence a reader reading Section 2 should not be intimidated by the notation, but
rather try to embrace it as it is unpacked and described in the sections that follow. In Section 3
we review summation notation through the application of data standardization (mean and
variance computation). Most scientists would have seen and used such summation notation
before, hence this section is mostly a review. In Section 4 we present an elementary view of
sets, and functions. This is in no way an exhaustive description of set theory, but rather aims
to place the reader in the mindset of set notation, and especially the “in” symbol, ∈, that is
often used, as well as the notation R for the set of real numbers and the declaration of functions
using the arrow, →, from the input set to the output set. In Section 5 we outline the notion
of vectors, which in greatest simplicity can be viewed as lists of numbers. Ideas such as inner
products and norms are also introduced. In Section 6 we discuss matrices and in particular
matrix-vector multiplication. This operation is central in deep learning. We then comment on
additional aspects of vectors and matrices in Section 7. In Section 8 we overview the concept of
gradients, as these are the objects used when training deep learning models. We also see a basic
version of the gradient descent algorithm. Then in Section 9 we put a few of the mathematical
pieces together, shining more light on fully connected feedforward deep neural networks. In
Section 10 we briefly discuss mathematical ideas used in a few extension models, including
convolutional neural networks, recurrent neural networks, and the attention mechanism which
is used in many contemporary large language models. We finally conclude in Section 11.
3
date them, as it is expected that much of the notation of the equations may be new or forgotten
territory. In places where the reader requires more clarity, it is recommended that the reader
treat it as motivation for the sections below, where mathematical background is provided, and
the equations from this section are further explained.
Model I: The sigmoid model also known as the logistic model. Using equations this
model, can be described as,
z
z }| {
1
ŷ = σSig b0 + w ⊤ x , with σSig (z) = . (1)
1 + e−z
Here the input to the model is a list of p numbers, represented all together as x, which we can
also call a vector, and denote as x ∈ Rp . The parameters of the model are the number (scalar)
b0 which is called the bias, also known as the intercept, and the vector w ∈ Rp , which is called
the vector of weights. The operation b0 + w⊤ x means adding the scalar b0 to the inner product
operation, w⊤ x, which also computes to a scalar. We use the overbrace to denote the outcome
of this operation as z = b0 + w⊤ x. Note that this addition of a bias to an inner product is
sometimes called an affine transformation, and can also be written as,
p
X
z = b0 + wi xi . (2)
i=1
| {z }
w⊤ x
Pp 1
z = b0 + i=1 wi x i σSig (z) = ∈ (0, 1)
1 + e−z
Figure 1: The sigmoid model represented with neural network terminology as a shallow neural
network. Observe that the artificial neuron is composed of an affine transformation using z = b0 +w⊤ x
followed by a non-linear activation transformation σSig (z).
If the model would have not had σSig ( ), then it would be a linear model, which we do not
cover here. But with it, the σSig ( ) means a sigmoid function, which takes any scalar input z and
produces a probability output ŷ = σSig (z). The reader can experiment to compute 1/(1 + e−z )
4
with a calculator or spreadsheet for several values of z, to see that it always yields numbers
that are between 0 and 1, and for example with z = 0 we have σSig (0) = 1/2. In a deep learning
framework, the function, σSig ( ), is a type of activation function and the form of (1) represents
an artificial neuron which we also present in Figure 1.
The sigmoid model can be viewed as the simplest non-linear neural network model. Outside
of the context of deep learning, the sigmoid model is very popular in statistics, and known
as logistic regression. This model is suitable for binary classification tasks where positive
samples are encoded via y = 1 and negative samples are encoded via y = 0. The output of the
model ŷ is a number in the continuous range [0, 1] indicating the probability that the features
vector x, matches a positive label. Hence, the higher the value of ŷ, the more likely it is that
the label associated with x is y = 1. A classifier can be constructed via a decision rule based
on a threshold τ , with the predicted output being,
(
0 (negative), if ŷ ≤ τ
Yb = (3)
1 (positive), if ŷ > τ.
Model II: The softmax model also known as the multinomial regression model.
Using equations, this model can be described as,
z ∈ RK
z }| { ez1
1 ..
ŷ = SSoftmax b + W x , with SSoftmax (z) = PK . . (4)
e z i
i=1 z
eK
Here like the sigmoid model, the input, denoted x, is a vector of dimension p, and as before
we can denote this as x ∈ Rp . However the output, ŷ is no longer a single probability, as in
Model I, but is rather a vector of probabilities of length K, where the individual probabilities
sum up to 1. The parameters of the model are also different. The parameter b plays the role
of b0 from Model I and is no longer a scalar, but is rather a vector of dimension K, and we
denote this as b ∈ RK . It is called the bias vector. As for the weights, unlike Model I where the
weights were in a vector, here we have a matrix of weights, or weight matrix, W , of dimensions
K and p, where K is the number of matrix rows, and p is the number matrix columns. This
can be denoted as W ∈ RK×p .
Like Model I in (1), here, the operation of the model in (4) denotes an intermediate variable
z. The difference is that z is a vector of dimension K, denoted z ∈ RK . Here to construct z we
add the vector b to the vector produced by the matrix-vector multiplication W x. This means
that for i = 1, . . . , K, each scalar, zi in the vector z ∈ RK is computed as,
p
X
zi = bi + wi,j xj , (5)
j=1
where bi is the i-th scalar in the vector b and wi,j is the scalar in the matrix W at row i and
column j.
Like Model I, that applies the (non-linear) function σSig ( ) on the scalar z, here with Model II
we apply the function Ssoftmax ( ) on the vector z. This function is called the softmax and it
5
x1
x2 p
X ez1
b1 +
j=1
w1,j xj PK
j=1 e
zj ŷ1
| {z }
z1
x3
p
X ez2
b2 +
j=1
w2,j xj PK
j=1 e
zj ŷ2
| {z }
z2
ŷ
p
X e zK
bK +
j=1
wK,j xj PK
j=1 e
zj ŷK
| {z }
zK
xp
Softmax
Figure 2: The softmax model as a shallow neural network model with output ŷ which is composed
of elements ŷ1 , . . . , ŷK , representing a probability vector.
converts a vector of dimension K to a different vector of the same dimension K, where all
elements of the output vector are probabilities, and they sum up to 1.
The form of (4) positions the softmax model as a shallow neural network similarly to the way
that the sigmoid model is a shallow neural network. The difference is that the softmax model has
vector outputs while the sigmoid model has a scalar output. Figure 2 illustrates the softmax
model as a neural network. Here each circle can again be viewed as an “artificial neuron”
however note that the softmax function affects all neurons together via the normalization in
the denominator of the softmax function Ssoftmax ( ). Hence the activation value of each neuron
is not independent of the activation values of the other neurons.
The softmax model, also known as the multinomial regression model, is the most popular
model for multi-class classification, where the goal is to provide a prediction Yb for the label in
{1, . . . , K} associated with the input feature vector x. The softmax model produces an output
vector, ŷ, which is a probability vector, and can be used to create a classifier by choosing the
6
class that has the highest probability. Namely,
and this means to choose the index k from the set {1, . . . , K}, where the entry (probability)
ŷk is highest. This approach is called a maximum a posteriori probability (MAP) decision rule
since it simply chooses Yb as the class that is most probable. It is the most common decision
rule when using deep learning models for classification.
Model III: General fully connected neural networks also known as feedforward
networks. Using equations, this model can be described as,
where each individual function f [ℓ] ( ) operating on some input u, can be described as,
Here in (7) we see composition of functions, where first the function f [1] ( ) is applied to the
input x, and then the function f [2] ( ) is applied on the output of f [1] (x), and then f [3] ( ) is
applied on the output of f [2] (f [1] (x)), and so fourth until the last function f [L] ( ) is applied on
what was computed prior. Each such function application is the operation of a layer in a deep
neural network.
Different types of neural networks have different types of layers, and in the simplest case
of feedforward fully connected neural networks, (8) describes the operation of layer ℓ. Observe
that (8) is somewhat similar to the left side of (4). In the case of (8), we have that S [ℓ] ( ) is
some vector activation function, while in the case of (4) there is only a single layer (and hence
ℓ is absent) and the activation function is the softmax function.
The parameters of each layer ℓ are also similar to Model II (which only has a single layer).
Here with Model III, each layer has a bias vector b[ℓ] and a weight matrix W [ℓ] . The dimensions
of these vectors and matrices are specific to the architecture of the network, and we fill in these
details in Section 9. At this point, let us appreciate that Model III is more complex. For
one thing, it has the rich set of parameters (b[1] , W [1] ), . . . , (b[L] , W [L] ) which can easily involve
hundreds of thousands, millions, or billions of individual scalar parameters. With so many
parameters organized in layers, it is generally much more expressive and can capture complex
relationships in data. Figure 3 presents a schematic of such a deep neural network.
x1
x2
ŷ
x3
x4
| {z } | {z } | {z } | {z }
(b[1] , W [1] ) (b[2] , W [2] ) (b[3] , W [3] ) (b[4] , W [4] )
Figure 3: A fully connected feedforward deep neural network with multiple hidden layers. The input
to the network is the vector x = (x1 , x2 , x3 , x4 ) and in this case the output is the scalar ŷ. For this
particular example the dimensions of W [1] are 4 × 4, W [2] is 3 × 4, W [3] is 5 × 3, and W [4] is 1 × 5.
The bias vectors are of dimension 4 for b[1] , 3 for b[2] , 5 for b[3] , and 1 (a scalar) for b[4] .
Assume that our data is x(1) , . . . , x(n) where each x(i) is some sample which is a list or vector
(i)
of length p. We can also denote xj as the j-th element (or coordinate) of the i-th data sample.
Hence we can think of the data as an n by p table, or matrix, where each row is a sample, and
each column is a specific part of that sample, sometimes called a feature.
For instance, consider a dataset consisting of medical measurements for patients, where each
patient’s data includes their age, blood pressure, and cholesterol level. This means that p = 3.
As an example, this data may look as follows:
Further, the sample standard deviation is the square root of the sample variance and is denoted
via sj . Note that in statistics one sometimes divides by n − 1 instead of n for the sample
variance, but this is a detail we will skip here. Our purpose in presenting (10) is also that we
review summation notation (with Σ, i.e., Sigma).
To review summation notation, consider an expression where we have some list of 4 numbers,
h , h(2) , h(3) , h(4) with h(1) = 2, h(2) = 4, h(3) = 0, and h(4) = 10. Then this arbitrary
(1)
expression,
X4
h(i) , (11)
i=1
(1) (2) (3) (4)
is just shorthand for h + h + h + h . It thus equals 16 in this example. One can think
of the variable i as “running” from i = 1 all the way up to i = 4. We could have also used
j instead of i or any other variable name. One could have obviously had more complicated
expressions with summation notation, such as for example,
4
X
(h(i) + h(5−i) − 11)2 , (12)
i=1
implies summing up all of the square roots of the elements of the set B (as the reader may verify,
the result is approximately 31.77). Note that our previous way of using summation notation
can also be written in terms of sets. For example, the summation in (11) can be written as,
X
h(i) , (15)
i∈{1,2,3,4}
as this shows that the variable i runs on each element of the set {1, 2, 3, 4}.
One very important set is the set of real numbers, denoted R. Unlike the example sets A,
B, or {1, 2, 3, 4} which only have a finite number of elements, the set R has every number on
the real number line and is hence not a finite set. We again can speak about elements of R. So
for example it is true that 7 ∈ R, and it is also true that hello ̸∈ R. When we consider the
parameters of Model I in (1), we can write b0 ∈ R to signify that b0 is a real number.
We can also denote the set of real numbers as (−∞, ∞) implying that it is the interval of all
numbers that are greater than −∞ and less than ∞, and this means all real numbers. There
are other sets that we can denote in a similar way, for example [−1, 1] is the set of all numbers
greater than or equal to −1 and less than or equal to 1. Another option is the set (0, ∞) which
means the set of all positive numbers (greater than 0). The set [0, 1] is the set of all real number
between 0 and 1 inclusive, and this is basically all numbers that can describe a probability. A
related set (0, 1) contains all the numbers between 0 and 1 but without the boundaries 0 and
1.
One can study and discuss sets much further, and in a more formal manner, or even in very
formal means that relate to mathematical logic. But our purpose here is simply to introduce
minimal notation. For this, we present a few more sets when dealing with vectors and matrices
in the sections below. But first, at this point, with our basic understanding of sets, we are
ready to discuss the notion of a function.
Put simply, a function is an operation that transforms elements from one set to elements
of another set. We can for example denote our function as f and think that it operates on
10
inputs from the set A = {7, 2.5, hello} and gives us outputs from the set of real numbers R.
Formally this can be denoted as,
f : A → R, (16)
and this notation tells us that all of the possible inputs to the function f ( ) are the elements
of A. It further says that the outputs must be elements of the real numbers R. This declaration
of the function via (16) says that for every u ∈ A we should have an answer of what the function
gives, and we denote this as f (u). Note that we sometimes write f (·) to indicate the function
where “·” just stands for the place where the argument u should appear.
A declaration such as (16) does not define how the function works. To do so, we must
be more explicit either with a formula, or an algorithm, or a lookup table. In this example,
since the input set has a small number of heterogeneous elements, let us specify the function
via a lookup table approach. In particular we can state, f (7) = 3.4, f (2.5) = −2.1, and
f (hello) = 20.3. This would then specify exactly what the function f ( ) does for every u ∈ A.
In other cases, we can specify how the function works with a formula. This is often common
for functions f : R → R. An arbitrary example of such a function from R to R is,
The reader would have seen plots previously where such a function, or others, are plotted where
on the horizontal axis we plot u and on the vertical axis we plot 3 cos(eu−2 ). Importantly, for
every u ∈ R we have a specific y = f (u), and the plot is a connection of the points (u, f (u))
on the plane, essentially for every u ∈ R (or realistically on some smaller set which defines
the domain of the plot). Also note that in this case the function (17) is composed of other
operations and functions such as the cosine function and the exponentiation function. Indeed
function composition is a very common operation where outputs of one function are used as
inputs of another. An example is with Model III as in (7) where we have L − 1 function
compositions and each function represents the operation of a layer of neural network.
In deep learning we use functions in multiple ways. One way is for constructing models
such as I – III. Another way is for specifying the whole model as a function. A third way is for
construction of loss functions. Let us now highlight such uses.
First in terms of construction of models, as we can see for Model I in (1), we define the
sigmoid function
σSig : R → [0, 1]. (18)
This function takes any real valued scalar as input, denoted as z in (1), and the output is a
number in the range [0, 1]. Note that for this sigmoid function we have a formula, 1/(1 + e−z ),
which exactly describes how to compute σSig (z). A schematic plot of this function is inside the
right circle in Figure 1.
Further, for Model II in (4) we have the softmax function, SSoftmax (z) as a building block.
This function can be declared as,
SSoftmax : RK → RK , (19)
since it takes K-dimensional vectors as inputs and returns K-dimensional vectors as outputs.
We describe it further in the next section after we discuss vectors. Similarly, for Model III,
we have activation functions for deep feedforward neural networks, S [ℓ] (·). We discuss such
activation functions in Section 9.
11
Now a different use of functions for the neural network models I, II, and III is that the whole
model is a function that converts some input x to an output ŷ. In terms of all three models,
I – III, the input x is a p-dimensional list of numbers, or vector, a notion further discussed in
the next section. For now, let us agree that this set is denoted as Rp . The output is either a
scalar value (an element of R or an element of [0, 1]) or a vector which is an element of RK (a
vector or list of K numbers). In particular, here are the functions described by these models.
For Model I, yielding a probability output we can declare the function of the model as,
The representation in (20) then tells us that inputs to the model are x ∈ Rp and outputs are
going to be ŷ ∈ [0, 1]. Now notice that we also decorate the function name f with a subscript
(b0 , w), and this subscript signifies the parameters of the model. In particular, when we train
a deep learning model such as this, we find suitable values of the parameters b0 and w for the
data. These parameters then exactly dictate the specifics of our model function f(b0 ,w) ( ). The
way the function actually works was already specified in (1). However, some specifics of that
presentation, such as the inner product w⊤ x, are explained in the next section. In Figure 4 we
see two example plots for two different instances of (20) where each time we plot ŷ = f(b0 ,w) (x).
In (a) we see a case with p = 1, and we also see data points for which the function was fit. In
(b) we see a case with p = 2, this time without the data points.
1.0
0.8
0.6 0.8
y
^
0.6
y^
0.4
0.4 40
0.2
35
0.2 30
25
0.0
20 x2
500 15
1000
500 1000 1500 1500 10
x1
x1
(a) (b)
Figure 4: Probability output using Model I. (a) A p = 1 model with one feature. (b) A p = 2 model
based on two features.
For Model II, the function of the model can be specified as,
f(b,W ) : Rp → RK , (21)
12
and in particular the output in this case is a K-dimensional vector. The parameters of the
model are (b, W ), and more details about how the model operates are in the sequel.
For Model III, we leave the specification of the type of output open. In some cases it can
be a single probability value, so we may specify the output as a value in the set [0, 1]. In other
cases it can be a single real number, as one typically has in regression problems. Thus the
output can be specified as a value in R. We can also use Model III for multi-class classification
like Model II and then the output is a vector as with Model II, so the output is a value in RK
(also sometimes denoted Rq ). One way to write this is,
f(b[1] ,W [1] ),...,(b[L] ,W [L] ) : Rp → O, where O is [0, 1], or R, or Rq . (22)
In all these scenarios of Model III, the input is similar to the other two models, but for the
output type there are several options. Observe also the rich set of parameters that the model
has, namely, (b[1] , W [1] ), . . . , (b[L] , W [L] ).
Another use of functions in deep learning is loss functions. In general when we train a
model we are given a fixed dataset and wish to find the best set of parameters such that the
model fits the data. For example with Model II we would seek the best possible vector b and
matrix W that we can find to match the data. The way that this “bestness” is quantified is via
a function that we artificially construct for the model. Unlike the model functions (20), (21),
and (22) which operate on the input x, the loss function is a function of the parameters, and
the training data is fixed for this function and determines its shape.
0
2
2 1 0 1 2
1
(a) (b)
Figure 5: An example plot of an hypothetical loss function (also known as loss landscape) when
there are two parameters. (a) A 3D surface plot where we see that the function has multiple valleys
(local minima). (b) The same function can be plotted using a contour plot, where each line along a
contour maintains the same value of the loss function (like a topographical map).
Now since each type of model has a different set of possible parameters, it is common to
just call all the parameters θ. In this case θ can signify (b0 , w) for Model I, (b, W ) for Model II,
and so fourth. We further call the set of all possible parameters Θ (and the specific form of
this set depends on the model type that we use). The loss function, can then be written as,
CData : Θ → R, (23)
13
where the subscript, “Data”, reminds us that the function’s actual form depends on the training
data that we have. Now in many cases, each θ ∈ Θ contains many parameters (numbers).
Especially for example with Model III, where one can quickly get millions of parameters in a
deep neural network.
The act of learning the parameters, or training the model, is the act of finding some θ ∈ Θ
where CData (θ) is as low as possible or close to the lowest value. This is an optimization problem
where we try to minimize the loss CData ( ).
Now when the number of parameters in each θ is 1 or 2, it is possible to plot the loss
function and then visualize the minimization of the loss. Such plots are typically not done
for operational purposes but rather for pedagogic purposes. In particular, the case of θ being
composed of two numbers θ1 and θ2 is easy to plot, and such plots allow us to also think about
the techniques and challenges of minimizing loss in general. In Figure 5 we present a plot of
such an hypothetical loss function.
5 Vectors
Now that we understand the basics of sets and functions, let us advance to the concept of a
vector. As already seen in Models I – III, the input x to the model is a vector. This is a
list of numbers which encodes some form of input data. Vectors can also appear as outputs.
Specifically, in Model II, and in some variants of Model III, the output, ŷ is also a vector. For
example, with Model II and K = 4 we may have an output vector such as,
0.14
0.02
ŷ =
0.65 , (24)
0.19
which in this case marks the probabilities of the four classes of classification (observe that in
this example the numbers are non-negative and sum up to 1). When we have a vector, we
often relate to the individual components via a subscript, so that for example ŷ1 = 0.14. In
this example vector, we see that ŷ3 = 0.65 is the highest probability, so if this is a classification
output, then we would choose Yb = 3 as our classification choice using (6).
The set of vectors of length or dimension n is denoted as Rn . So for example, for the vector ŷ
of (24) we can write ŷ ∈ R4 . Note that in different texts, vectors are denoted differently. For us,
let us also write the same vector as ŷ = (0.14, 0.02, 0.65, 0.19). The use of the round brackets,
( ), is sometimes associated with the term tuple, which is similar to a vector and for our purposes
may mean the same thing. Also note that the way we wrote the vector in (24) is as a column
vector (the vector is standing up). This way of writing is especially relevant when we do matrix-
vector multiplication in the next section. Similarly, one may write ŷ = [0.14 0.02 0.65 0.19]⊤ ,
where in this case we first wrote a row vector [0.14 0.02 0.65 0.19] and then applied transpose
to it using the ⊤ symbol in superscript. For vectors, such transposition simply converts them
from column to row, and vice-versa. More on this operation appears below in the context of
matrices.
For a data scientist and a machine learner, vectors are first and foremost considered as
lists of numbers. But in fields such as physics, or applied mathematics, vectors also typically
describe directions and magnitudes. This is very apparent in two dimensional spaces or three
14
dimensional spaces. When we discuss gradient vectors in Section 8 below, we also consider
vectors as directions in space. But for now first let us think of vectors as lists of numbers,
exactly playing the role of inputs, outputs, parameters, and intermediate computations in
models such as our example models I – III.
Note that single numbers, known as scalars, can be compared in the sense that one is greater
than the other (as long as they are real numbers and not complex numbers which we do not often
use in deep learning). For example we know that 2.4 is greater than −8.3, denoted −8.3 < 2.4.
If we treat all the scalar real numbers as the set R, then we say there is an order on R, because
for every u ∈ R and v ∈ R one can determine if u < v is true or false. With vectors, unlike
scalars, there is not such an obvious order.
While there is not an obvious universal way to order vectors, we can associate scalars
(numbers) with the distances between two vectors, as well as with the length of individual
vectors, and these scalars can then be used for ordering and other purposes. One of the most
basic ways to do this is the inner product which for a vector u ∈ Rn and a vector v ∈ Rn , is
written as u⊤ v or sometimes as u · v. The inner product is computed as,
n
X
⊤
u v= ui vi = u1 v1 + u2 v2 + . . . + un vn . (25)
i=1
Hence it is a summation of the products of the individual entries of the vectors. So for example,
as the reader may verify for u = (2, 0, −3) and v = (1, −12.5, 2), the inner product is u⊤ v =
−4. When the inner product is near 0 it means that the vectors are quite different, whereas
when the inner product is far from 0 (either positive or negative) it means that the vectors
are similar. When the inner product is exactly 0 we say that the vectors are orthogonal. In
a geometric representation this means the vectors are perpendicular. We do not dive into
such a representation of vectors, and instead the interested reader can visit a text such as
[Boyd and Vandenberghe, 2018] for further reading.
If we now return to Model I and revisit (1) and (2), then we can observe the inner product
w⊤ x in these equations. Here w is the vector of weight parameters of the model and x is the
model input.
One can also compute the norm of a vector u ∈ Rn , denoted as ∥u∥ and computed as the
square root of the inner product of the vector with itself (this is sometimes called the L2 norm
or the Euclidean norm). That is,
v
u n
√ uX
∥u∥ = u u = t
⊤ u2i . (26)
i=1
√
For example for u = (2, 0, −3) as before, ∥u∥ = 13 ≈ 3.6. The norm of a vector is always
a non-negative number and is exactly 0 if and only if all the entries of the vector are 0, and
otherwise it is strictly positive. So while we cannot order vectors in a unique manner, we can
order vectors based on their norm which is a number that summarizes their magnitude.
Now that we have norm, we can also consider a scaled version of the inner product which is
called the cosine of the angle between two vectors. For u ∈ Rn and v ∈ Rn , as long as neither
of these vectors is all 0, this is computed as,
Pn
u⊤ v ui vi
Cosine of the angle between u and v = = pPn i=1 pPn . (27)
∥u∥ ∥v∥ u
i=1 i
2
v
i=1 i
2
15
So for example for the u and v examples above, the cosine of the angle is about −0.087. The
cosine of the angle is always between −1 and 1 and the closer it is to 0 the less similar the
vectors are in some sense. It is exactly 0 if the vectors are orthogonal.
All of the above definitions and computations of vectors are very common to use in deep
learning. One more thing that we do is arithmetic with vectors. The most basic operation is to
take a scalar (a single number) and multiply each element of the vector by this scalar. This is
called scalar multiplication (or sometimes scalar-vector multiplication). So for a scalar, α ∈ R
and a vector u ∈ Rn , the scalar multiplication α u is a new vector where at coordinate i it has
α ui . As an example, let us return to u = (2, 0, −3) from above, and say that α = −4, we have
that α u = (−8, 0, 12).
Let us connect this to the softmax function. If we consider the right side of (4) then we
canPobserve that this is in fact a case of scalar multiplication. In particular the expression
1/ K zi z1 zK
i=1 e is a scalar which multiplies the vector (e , . . . , e ). In this case, given some input
z ∈ RK , the softmax function transforms z such that all entries are positive via the exponen-
tiation. It also ensures the sum of the entries is 1 via the scalar multiplication. Importantly,
the result of the softmax transforms an arbitrary vector of numbers to a vector of probabil-
ities, maintaining the same order. For example if as input to the softmax function we have
z = (0.03, −1.91, 1.56, 0.34), then it is already evident that the third entry z3 = 1.56 is the
maximal, but this is not quantified in terms of probabilities. Then after exponentiation
P we have
(ez1 , ez2 , ez3 , ez4 ) = (1.03, 0.148, 4.807, 1.405). Now we can compute that 1/ K i=1 e zi
= 0.1353,
z1 z2 z3 z4
and then by applying scalar multiplication of this value by (e , e , e , e ), we arrive at ŷ as
in (24) (note that this is approximate due to rounding).
In addition to scalar multiplication we can also add two vectors of the same dimension.
This works by adding the individual matching coordinates. So for u ∈ Rn and v ∈ Rn , the
addition u + v is a new vector, where at coordinate i it has ui + vi . So again with u and v
as in the example above, we have that, u + v = (3, −12.5, −1). Now subtraction can also
be defined by scalar-multiplying the second vector by −1 and then adding. So for example
u − v = (1, 12.5, −5). Note that if we subtract two vectors that are equal then we get a vector
of all 0 values, called the zero vector.
Having vector subtraction allows us to use the vector norm to define the distance between
two vectors. In particular, given a vector u ∈ Rn and a vector v ∈ Rn , the distance (also known
as Euclidean distance) between the two vectors is ∥u − v∥. Hence we first subtract the vectors
and then compute the norm of the answer. That is,
v
u n
uX
Distance between vectors u and v = ∥u − v∥ = t (ui − vi )2 . (28)
i=1
This number is never negative, and the closer it is to 0 the closer that u is to v. For the example
vectors in R3 above, we have that ∥u − v∥ = 13.5. Note that sometimes we consider a similar
quantity without the square root. This is naturally denoted as ∥u − v∥2 , and it can sometimes
be called the squared error between the vectors. It can also be represented in terms of the
innner product of the difference u − v and itself,
n
X
Squared error between u and v = ∥u − v∥2 = (u − v)⊤ (u − v) = (ui − vi )2 . (29)
i=1
16
It turns out that variants of the squared error as in (29) naturally play a role as loss functions
in deep learning (as well as classical statistics). In particular, one of the vectors, say u can play
the role of desired model output, often denoted y, whereas the other vector, v, is the predicted
model output, ŷ. In this case, ∥y − ŷ∥2 is a measure of how far the obtained output ŷ is from
what it should have been.
To make this more concrete, say we are using Model II for multi-class classification with
K = 4 classes. After the model is trained, one can then consider a test set of say 1000 samples
of inputs x(1) , . . . , x(1000) (each of these is a vector of some dimension p, e.g., p = 300). We
then apply the model on each of these vectors and get 1000 result vectors which we denote
as ŷ (1) , . . . , ŷ (1000) . Each of these results vectors look like the vector in (24) only generally has
different probabilities.
We now wish to compare the result vectors to what they would ideally be. For this, assume
that we also have outcomes which indicate for each observation if it is the first class, second
class, third class, or fourth class. One thing we can do with this is to create a set of vectors
called one-hot encoded vectors. If an observation is in the first class, the associated one-hot
encoded vector is (1, 0, 0, 0). If it is in the second class, the one-hot encoded vector is (0, 1, 0, 0).
And so fourth. These vectors are also called the canonical unit vectors. Observe also that they
represent probability vectors, with probabilities being degenerative in the sense that all the
probability mass is at one position while other positions have 0 probability. With this we
create the 1000 one-hot encoded vectors y (1) , . . . , y (1000) . We then define loss or error as
1000
1 X (i)
Mean squared error between y and ŷ = ∥y − ŷ (i) ∥2 . (30)
1000 i=1
Here we are averaging the squared errors between each desired (one-hot encoded) y (i) and
obtained prediction ŷ (i) . If our model is perfect then this mean squared error will be 0, but
generally it is positive, yet the lower it is, the better our predictions.
We note that in practice, for Model I and Model II we typically use a cross entropy loss
which differs from this simpler mean squared error. For details, see for example Chapter 3 of
[Liquet et al., 2024].
6 Matrices
Now that we have a basic handle on vectors, let us focus on matrices. In this short section,
we shall touch upon several uses of matrices within the context of deep learning. One use is
to organize stored data as a table. Another use that we touch on very lightly is for describing
covariances of variables. A third use, which is the most important for us is linear transforma-
tions, and this is the role that the weight matrices (parameters), W in Model II, and W [ℓ] in
Model III play. As with vectors, our exposition is only the tip of the ice-berg and for a more
expanded introduction we recommend [Boyd and Vandenberghe, 2018] as a first reading.
A matrix is a list of numbers organized in rows and columns. For example this is the matrix
X with 4 rows and 3 columns,
0.4 −1 4
1.2 0 −0.5
X= 0 2.1
. (31)
3
5 2.1 −10
17
We say that this is a 4 × 3 matrix and we can refer to each element as xi,j where i = 1, 2, 3, 4
denotes the index of the row and j = 1, 2, 3 is for the column. It is common to use capital
letters for matrices and then refer to individual elements with the lower case letters. For
example x3,2 = 2.1. One very basic use for matrices is to describe tabular data, similarly to
how we described data in Section 3. To match the description there we should notice that
(i) (i)
xj = xi,j (the observation or sample with index i, denoted via the superscript (i) in xj is the
(i)
row, and the feature j, denoted via the subscript j in xj is the column). We denote the set
of all matrices with m rows and n columns as Rm×n . Note that in the case of a data matrix
matching the data table in (9) we need a matrix in Rn×p because it has n rows (observations),
and p columns (features).
If the number of rows and the number of columns is equal, we say that the matrix is square.
The set of square matrices of dimension n is denoted as Rn×n . In addition to being square, if
all the elements xi,j where i ̸= j are 0, then we say the matrix is diagonal (it only has non-zero
entries on the diagonal which is all entries xi,i for i = 1, . . . , n). One very important type of
diagonal matrix is the n × n identity matrix, denoted I, which has 0 values everywhere except
on the diagonal where it has 1 values. For example this is the 3 × 3 identity matrix,
1 0 0
I = 0 1 0 . (32)
0 0 1
Vectors can be viewed as special cases of matrices. For example consider the column vector, u
and the row vector v,
2
u = 0 , v= 2 0 6 . (33)
6
Viewed as matrices we can say that u is a 3×1 matrix, and v is a 1×3 matrix (namely u ∈ R3×1
and v ∈ R1×3 ). We can also recall the transpose operator, ⊤, for vectors that converts a column
vector to a row vector with the same numbers, and vice-versa. Here, for the example values we
have for u and v in (33), we obviously see that u⊤ = v and v ⊤ = u.
Indeed for any matrix of dimension m × n, if we transpose the matrix we get a matrix of
dimension n × m where the (i, j)-th entry (row i and column j) of the transposed matrix, is the
(j, i)-th entry of the original matrix. For example, the transpose of the 4 × 3 matrix X from
(31) is the 3 × 4 matrix given by
0.4 1.2 0 5
X ⊤ = −1 0 2.1 2.1 . (34)
4 −0.5 3 −10
When a matrix is square, applying the transpose to it does not change the dimensions.
Incidentally a square matrix A such that A = A⊤ is called a symmetric matrix, because any
entry ai,j on one side of the diagonal, equals the matching entry aj,i on the other side of the
diagonal. In statistics, data science, and some aspects of deep learning, an important place
where symmetric matrices arise is with variance and covariance descriptions of features under
study. In particular the covariance matrix (sometimes called the variance-covariance matrix),
is often denoted as Σ (not to be confused with the summation notation reviewed in Section 3),
18
and it captures the variances and covariances of the features under study. In particular the
(i, j)-th element of Σ is the covariance between feature i and feature j, and this equals the
(j, i)-th element as well. For the case where i = j, i.e., on the diagonal, this element is the
variance of feature i. We do not make use of such matrices further in this text, but refer the
reader to an elementary exposition such as in chapter 3 of [Nazarathy and Klok, 2021], or the
references there-in.
One of the most important things that we can do with matrices is matrix multiplication.
For simplicity let us first take two square matrices, each in R3×3 .
2 0 3 4 1 0
A = 0 1 1 and B = 1 0 0 . (35)
2 0 0 0 0 3
Now, in this case, the product C = AB is a new 3 × 3 matrix, where the entry ci,j is the inner
product of the i-th row of A and the j-th column of B. For example at i = 2 and j = 3 we
have,
c2,3 = a2,1 b1,3 + a2,2 b2,3 + a2,3 b3,3 = 0 · 0 + 1 · 0 + 1 · 3 = 3.
In the same manner, to get all other 8 elements of the matrix C, we can do all other inner
products. As the reader can verify, the matrix C turns out to be,
8 2 9
C = AB = 1 0 3 . (36)
8 2 0
Note that multiplication of scalars is commutative since for two scalars (numbers), a and b,
we always have that ab = ba. With matrices this is not the case. For example C̃ = BA yields
a different result to C = AB. As the reader may verify,
8 1 13
C̃ = BA = 2 0 3 ̸= C. (37)
6 0 0
Up to now we multiplied square matrices of the same dimension, but we can also, in certain
cases, multiply non-square matrices. The rules defined for matrix multiplication dictate that it
cannot be done for any two matrices, but only in certain cases. In particular take A ∈ Rm×n
and B ∈ Rn×r , then A has n columns and B has the same number, n, of rows. This means
that rows of A and columns of B are of the same dimension, n, and thus we can compute inner
products between rows of A and columns of B. Otherwise, if these dimensions do not match,
then matrix multiplication is not defined. So for example we can compute the product of a
4 × 3 matrix by a 3 × 7 matrix, but we cannot compute the product of a 4 × 3 matrix and a
4 × 7 matrix. Note that sometimes, depending on the dimension, we can compute a product
AB, but not the product BA, or vise versa.
We also mention that the identity matrix, such as the 3 × 3 example shown in (32) is special
in terms of multiplication. As the reader can verify, if we multiply either A or B from (35) by
I (from either side), then the result does not change. That is, AI = A, IA = A, BI = B, and
IB = B. This holds for any dimension of the identity matrix and any other matrix, where the
identity matrix and the other matrix can be multiplied (with matching dimensions). Hence I
in matrices acts like 1 in scalars (for any scalar a ∈ R, 1a = a and a1 = a).
19
For our purposes, an important form of matrix multiplication is when the second matrix
is actually a vector. In this case we call this matrix-vector multiplication. In particular take
W ∈ RK×p and take a column vector x ∈ Rp×1 (we could have just stated that x is an element
of Rp , but for purposes of matrix multiplication, we consider it as a matrix). This notation
matches the left side of (4) of Model II, where we multiply a K × p parameter matrix W by
the input vector x.
According to the rules of matrix multiplication, the multiplication W x yields a K × 1
matrix as a result, or simply a K dimensional vector. For example, here is a schematic of this
matrix-vector multiplication with K = 5 and p = 3,
w1,1 w1,2 w1,3 w1,1 x1 + w1,2 x2 + w1,3 x3
w2,1 w2,2 w2,3 x1 w2,1 x1 + w2,2 x2 + w2,3 x3
w3,1 w3,2 w3,3 x2 =
w3,1 x1 + w3,2 x2 + w3,3 x3 . (38)
w4,1 w4,2 w4,3 x3 w4,1 x1 + w4,2 x2 + w4,3 x3
w5,1 w5,2 w5,3 | {z } w 5,1 x1 + w5,2 x2 + w5,3 x3
| {z } Input vector x ∈ R3 | {z }
Parameter matrix W ∈ R5×3 Output vector in R5
As is evident, each entry of the output vector is the inner product between the associated
row of W and the input vector x.
We should also note that in (4) of Model II there is the bias vector b added to W x. This is
an addition of two vectors in RK . Thus we have the vector z = b + W x, represented as follows
(for an example with p = 3 and K = 5),
b1 w1,1 x1 + w1,2 x2 + w1,3 x3
b2 w2,1 x1 + w2,2 x2 + w2,3 x3
z= b
3
+ w3,1 x1 + w3,2 x2 + w3,3 x3 .
(39)
b4 w4,1 x1 + w4,2 x2 + w4,3 x3
b5 w5,1 x1 + w5,2 x2 + w5,3 x3
This representation of z in (39), exactly agrees with (5) which appeared earlier, before reviewing
vector and matrix operations. Note that (8) for Model III defining the action of a layer,
S [ℓ] (b[ℓ] + W [ℓ] u) can now also be understood as a similar operation to (38) and (39). We
provide further details in Section 9.
8 Gradients
Before delving into gradients and their pivotal role in deep learning, let us briefly revisit a
fundamental concept from calculus, the derivative. At its core, the derivative provides us with
21
df
a measure of how a function f : R → R changes as its input varies. Represented by du , the
derivative signifies the slope of the tangent line to the function’s curve at a specific point u ∈ R.
The derivative can also been seen a function, f ′ : R → R, where f ′ (u) is the derivative at a
df
specific point u, namely f ′ (u) = du .
An estimate of the slope of the function (rise divided by run) at a specific point u is
where ∆ is a positive small number such that u, and u + ∆ are two nearby points on R. We
can then treat the derivative of f ( ) at u as the limit of this slope as ∆ gets small. Formally
one can write this as,
f (u + ∆) − f (u)
f ′ (u) = lim .
∆→0 ∆
A deeper understanding of derivatives may require a review of basic calculus which we cannot
afford in this exposition. For this, we refer the reader to any basic calculus text, or online
resource. One interesting and enjoyable read which may help readers gain insight on this topic
is Burn Math Class: And Reinvent Mathematics for Yourself [Wilkes, 2016].
In deep learning we use derivatives for training. Consider first an hypothetical scenario
where we are training a model with a single parameter θ. Now as denoted at the end of
Section 4, we have a loss function CData (θ), and we wish to minimize this function. For this we
can use the derivative dCdθ
to gain information about the slope of the function, and this gives
us an indication about the directions and the magnitudes that can be used in our optimization
procedure. Ultimately, with the aid of derivatives, we try to find the best θ for the loss. Note
that in this section we denote C as shorthand for CData .
Now, let us extend our perspective to a more complex scenario where our model has multiple
parameters, often denoted as θ = (θ1 , θ2 , ..., θd ), similarly to the presentation at the end of
Section 4. If we are in Model I then these d parameters are b0 and the p elements of the vector
w, so d = p + 1. If we are in Model II then those d parameters can be taken as the vector
b ∈ RK and the matrix W ∈ RK×p , so the vector θ is of dimension d = p + pK = p(K + 1).
If we are in Model III then θ is even more complex and constitutes the weights and biases in
all layers; details for Model III are in the next section. In any case, we treat all the individual
parameters in the vectors and matrices as one long vector θ with d elements.
We now generalize the notion of the derivative from one dimension to d dimensions using
the notion of a gradient. For a parameter point θ ∈ Rd , the gradient denoted as ∇C(θ) is
a vector in Rd which points at the direction of steepest ascent and has a magnitude (norm)
which captures how steep the function is in that direction. In
fact, the gradient is composed of
∂C ∂C ∂C
individual partial derivatives, and ∇C(θ) = ∂θ1 , ∂θ2 , ..., ∂θd . Note that each partial derivative
∂C
∂θj
is the derivative of C(θ) with respect to θj assuming that all other parameters are fixed.
Just like the derivative which can be viewed as a function, the gradient can also be viewed as
a function,
∇C : Rd → Rd . (40)
For loss functions C( ) with vector inputs of length 2 as illustrated in Figure 5, the gradient
can be drawn as an arrow on the plane. For loss functions with vector inputs of length 3,
the gradient is an arrow in 3 dimensional space. For loss functions of higher dimensions (it is
22
common in deep learning to have d in the order of millions or more), we as humans cannot
visualize the gradient, yet it describes the direction of movement of steepest ascent/increase.
Importantly, when we consider loss landscapes for deep learning, the gradient tells us the
slope of the terrain and points us in the direction where the loss increases most rapidly. Imagine
standing at a point on this landscape where the direction we would move to ascend the slope
most rapidly is precisely the direction of steepest increase. However, our goal is to descend into
the valleys, seeking the lowest points where the loss is minimized. To achieve this, we utilize
the negative of the gradient (scalar multiplication of the gradient by −1), as it points in the
direction of steepest decrease. In fact we further multiply the gradient by the scalar α, called
the learning rate, which controls how big of a step we take.
During the training of a deep learning model, our objective is to iteratively adjust the
model’s parameters in the opposite direction of the gradient. This process, known as gradient
descent, guides us downhill, helping us reduce our loss in the loss landscape. The key idea is to
use an update such as,
θ(t+1) = θ(t) − α∇C(θ(t) ), with α > 0, (41)
where θ(t) is the current parameter point and θ(t+1) is the next parameter point. We start at
some initial guess θ(0) then iterate via (41) to improve our parameters; namely to train our
model. The learning rate, α, specifies how big of steps we take during the algorithm. The
learning rate is an example of an hyper-parameter in deep learning and it sometimes requires
tweaking for gradient descent to work well.
As an algorithm, using a sequence of steps which one may convert to code in a programming
language such as Python, R, or Julia, we may specify this basic gradient descent algorithm with
pseudo code. In particular these are the steps of the algorithm:
Note that the use of expressions such as a ← b in steps 1 and 4, imply to substitute the
value of the right hand side b into the left hand variable a. With an algorithm as specified
above, the parameters of the model θ are in memory at all times, and as the algorithm iterates
over steps 2 – 5, these parameters are updated at each time when step 4 is executed. Observe
that this step implements the update in (41). After the algorithm starts with an initial guess
θ(0) (step 1), each iteration of steps 2 – 5 yield a new update, where in the first iteration we
have θ(1) = θ(0) − α∇C(θ(0) ), in the second iteration we have θ(2) = θ(1) − α∇C(θ(1) ), and so
fourth. Note that in this specification we do not go into the termination condition of step 5. See
Chapter 4 of [Liquet et al., 2024] for more details about this gradient descent algorithm and
its generalizations and variants, and in particular the most popular variant for deep learning
called ADAM, [Kingma and Ba, 2014].
Importantly, we should mention that step 3 computes the gradient in each iteration, and
hence this step can be computationally intensive for large deep learning models. Simple models
23
such as our Model I and Model II actually have closed form formulas for the gradient, but for
cases such as Model III, this is where the famous backpropagation algorithm is used. Namely,
the execution of step 3 is actually the result of running a whole other algorithm based on
ideas called backward mode automatic differentiation. Details of backpropagation are studied
in chapters 4 and 5 of [Liquet et al., 2024].
(a) (b)
Figure 6: Illustration of gradient descent for 5 iterations starting with θ(0) and getting near the
optimum θ∗ with θ(5) . (a) A one-dimensional loss landscape C(θ) with θ ∈ R. (b) a two-dimensional
loss landscape C(θ1 , θ2 ) with θ = (θ1 , θ2 ) ∈ R2 .
As an illustration of gradient descent, see Figure 6. First in (a) we consider a model with
a single parameter θ ∈ R and the associated loss function C(θ) plotted with a minimum at
θ∗ . The gradient descent algorithm starts with an initial guess θ(0) and then updates to θ(1)
using θ(1) = θ(0) − α∇C(θ(0) ). In this simple one dimensional case, the gradient at the point
θ(0) reduces to the derivative, C ′ (θ0 ), which is the slope of the tangent at the point θ(0) . In our
example this derivative is negative, then by multiplying this derivative by −α we move in the
opposite direction (right) than the gradient. Using the same process, the next move is driven
by θ(2) = θ(1) − αC ′ (θ(1) ). Note that in this case the slope of the tangent at the point θ(1) is
positive. Then by multiplying by −α, we are move left towards the minimum. In the figure we
show the iterates up to θ(5) which is not far from the minimal point θ∗ .
In Figure 6 (b) we see a similar trajectory, only this time moving on the (θ1 , θ2 ) plane, and
the points are plotted on the surface as C(θ(i) ). Here in this example, small steps move in the
direction of steepst descent, ultimatly reaching a point close to the optimum θ∗ = (θ1∗ , θ2∗ ). The
reader should comprehend that with realistic problems, the number of parameters d can be in
the order of millions. Hence each gradient descent iteration is a step in such a high dimensional
space, which we cannot visualize.
24
9 Putting Bits Together Into a Deep Neural Network
Now we are ready to revisit Model III and fill in a few of the missing details. With deep neural
networks, like Model III, the versatility of the model allows us to in principle approximate any
function. In particular, say that in reality we have some unknown function f ∗ : Rp −→ Rq ,
which is only available to us via data points (x(i) , y (i) ) with x(i) ∈ Rp , y (i) ∈ Rq , and y (i) =
f ∗ (x(i) ). If it is a binary classification scenario then q = 1 and the output can be considered
as an element in [0, 1]. If it is a multi-class classification scenario then q = K and the output
can be considered as an element in RK . If it is a regression problem then again q = 1 and the
output can be any scalar value in R. Finally, in other applications we have vector output Rq
for some q > 1. In all of these cases, with Model III we can try and approximate f ∗ ( ) via the
model function (22).
Models I and II are shallow neural networks as they only involve a single layer. As such,
they are not able to approximate any arbitrary function very well. However, Model III when
used with multiple layers, is a very rich model, and in principle no matter what f ∗ ( ) we have
in reality, with enough training data, and enough computation power for training (gradient
descent), we can obtain,
f(b[1] ,W [1] ),...,(b[L] ,W [L] ) (x) ≈ f ∗ (x). (42)
That is, in general, there exist some parameters (b[1] , W [1] ), . . . , (b[L] , W [L] ) that will enable our
model to approximate any function f ∗ ( ). See chapter 5 of [Liquet et al., 2024] for further
discussion about the expressivity of deep neural networks. Let us now dive into a few more
details of Model III.
Recall from (8) that every layer ℓ of the model is of the form f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u)
and the model consists of a composition of such layers via (7). We have already encountered
an affine operation similar to b[ℓ] + W [ℓ] u in the context of Model II where it was b + W x.
Like Model II we sometimes use a softmax for S [ℓ] ( ) in the last layer, ℓ = L, especially when
our goal is multi-class classification. Yet for inner layers, ℓ = 1, . . . , L − 1, we use a (scalar)
activation function σ : R → R and then the structure of the function S [ℓ] ( ) is via element-wise
activations S [ℓ] (z) = σ (z1 ) . . . σ (zN ) , where z = (z1 , . . . , zN ). In general, σ( ) can be a
sigmoid function, as defined in (1), or it can be one of other common activation functions in
deep learning such as ReLU or Tanh; see chapter 5 of [Liquet et al., 2024] for details. Note
that in many texts one just writes, σ(b[ℓ] + W [ℓ] u) for the layer, implying that σ( ) is applied
element-wise.
We can now revisit Figure 3 which illustrates a small version of such a model with L = 4.
Observe that for this network, the input dimension is p = 4 and the output dimension is q = 1.
The function of each layer f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u) operates on the outputs of the previous
layer (or the input to the network in case ℓ = 1) and yields activation values (sometimes just
called neuron values) of layer ℓ. We denote these values as a[ℓ] and thus for ℓ = 1, . . . , L,
What remains to be specified are the dimensions of each of the activations in the network.
For example, as the reader may inspect based on the number of neurons (blue nodes) per layer
in Figure 3, we have that,
ŷ = S [4] (W [4] S [3] (W [3] S [2] (W [2] S [1] (W [1] x + b[1] ) + b[2] ) + b[3] ) + b[4] ). (45)
In fact, the reader may work out that the number of parameters is,
d = 4 × 4 + 4 + 3 × 4 + 3 + 5 × 3 + 5 + 1 × 5 + 1 = 61. (46)
| {z } | {z } | {z } | {z }
Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer
Hence, when training such a model via gradient descent, each time we compute the gradient,
we obtain a gradient vector in R61 . More realistic models may have many more activations
(neurons), and sometimes more layers as well. Hence the number of parameters, d, easily
climbs to millions or more, and this is why training large deep neural networks may be very
expensive in terms of time, hardware, and power.
26
size 100 × 100 and the kernel W is a 3 × 3 matrix of weight parameters represented as
w1,1 w1,2 w1,3
W = w2,1 w2,2 w2,3 . (47)
w3,1 w3,2 w3,3
Note that the kernel matrix W is usually much smaller than the image x. We slide the kernel
over the image to perform the convolution operation. As in the case of Model III, the best
values for the weight parameters (wi,j entries in W ) are learned during the training process.
In the basic version for this example, the result z of the convolution operation is a matrix
of dimension (100 − 3 + 1) × (100 − 3 + 1) = 98 × 98 with the (i, j)-th element of z computed as
2 X
X 2
zi,j = xi+i′ ,j+j ′ · wi′ +1,j ′ +1 for i, j ∈ {1, . . . , 98}. (48)
i′ =0 j ′ =0
Here xi+i′ ,j+j ′ represents the pixel value at position (i + i′ , j + j ′ ) in the input image, and
wi′ +1,j ′ +1 represents the weight parameter at position (i′ + 1, j ′ + 1) in the kernel. For instance,
to compute the value z1,1 ,we perform the calculation,
2 X
X 2
z1,1 = x1+i′ ,1+j ′ · wi′ +1,j ′ +1 . (49)
i′ =0 j ′ =0
Note that z1,1 is the result of an inner product between the vectorized W and the vectorized
3 × 3 top-left submatrix of x.
Similarly, we compute all the elements of z by sliding the kernel over the image and perform-
ing such an inner product each time between the kernel W and the corresponding submatrix
of x. After performing the convolution operation between the image and the kernel, a convo-
lutional layer in a CNN applies an activation function, just like in Model III. This yields what
is sometimes called a feature map that highlights certain patterns or features present in the
image. Now just like in Model III, the feature map serves as input to subsequent layers in the
CNN for further processing and analysis.
It’s important to note that CNNs involve other operations, such as pooling, and various
architectural configurations including multiple channels (feature maps) per layer, skip connec-
tions, integration of fully connected layers, and others. See chapter 6 of [Liquet et al., 2024] for
more details. We also note that from an historical perspective, the work in [Krizhevsky et al., 2012]
was pivotal in highlighting the strength of deep learning, and convolutional neural networks in
particular.
While CNNs are excellent for tasks like image recognition, sequential data such as text, often
requires a different approach. This is where sequence models like recurrent neural networks
(RNNs) and long short term memory (LSTM) models come into play. Unlike CNNs which
process the entire input at once, RNNs and LSTMs process the data sequentially one element
at a time. In doing so, these models maintain an internal state that captures information about
previous elements in the sequence.
One key challenge in using RNNs and LSTMs for natural language processing tasks is how
to represent words as numerical vectors such that these models can understand the data. This
is where word embedding becomes useful. Word embedding is a technique used to represent
27
words as vectors, where the vectors corresponding to similar words, remain close to each other.
Via the vector representation of the words, similarity between any two words is measured by
the cosine of the angle between the corresponding two vectors using formula (27).
For example, the word king could be represented by the vector (0.41, 1.2, 3.4, −1.3) and
the word queen can be represented by a relatively similar vector such as (0.39, 1.1, 3.5, 1.6).
Then a completely different word such as mean might be represented by a vector such as
(−0.2, −3.2, 1.3, 0.8). One can now verify in this example, that the cosine of the angle between
the words king and queen is about 0.729 while the cosine of the angle between mean and the
other two words is lower, which is at about −0.04 for king and 0.156 for queen, respectively.
With such a word embedding approach, the typical way we process input text is to convert
each word (or sometimes a similar notion known as a token) to a vector. Hence the input
to a model, is not just a single vector x as is the case for models I — III, but is rather a
sequence of vectors. Then an RNN model or LSTM model processes this sequence, one by one,
keeping an internal state and also resulting in an output sequence. This technique has proven
valuable for many language tasks including translation among others. For a description of how
classical models such as RNN models or LSTM models deal with such data, see chapter 7 of
[Liquet et al., 2024].
Recurrent neural networks, long short term memory models, and a few variants were the
main sequence models in deep learning up to recent times. However, in the last few years,
following the 2017 paper [Vaswani et al., 2017], a completely different approach, called trans-
former models has emerged and is now the main tool used in contemporary large language
models. Transformers overcome, limitations of RNNs and LSTM models, such as difficulty in
parallelization and difficulty in capturing long-range dependencies (even though LSTM models
are explicitly designed for enabling long range memory). Transformers address these limi-
tations, primarily by leveraging on an idea called the attention mechanism. Unlike RNNs
and LSTMs, which process inputs sequentially, transformers process all words simultaneously,
enabling highly efficient parallel computation. This parallelization makes transformers partic-
ularly well-suited for handling long sequences, such as those encountered in machine transla-
tion, document summarization tasks, and general interactions with large language models via
chat. While we leave a complete description of the transformers architecture to chapter 7 of
[Liquet et al., 2024], or other sources, let us see now how basics of the attention mechanism can
be described via inner products and the softmax function.
When processing each word, the attention mechanism allows the model to focus on relevant
parts of the input sequence. That is, by “attending” from each word to every other word,
we capture dependencies across the entire sequence. Imagine reading a long piece of text and
trying to summarize it. Instead of reading the entire text from start to finish every time, one
can generate a summary by focusing on the most relevant parts, or key words. This selective
focus is analogous to what the attention mechanism does. At a high level, we assign a weight
to each input word based on its relevance to the current word being processed. This weight
determines how much attention the model should pay to that input word when generating the
output associated with the current word.
Mathematically, to understand the attention mechanism, consider a sequence of words
x⟨1⟩ , . . . , x⟨T ⟩ where each x⟨t⟩ is a vector representing a word (or token) using our word em-
bedding scheme. Our goal is to compute the attention weights for a specific current word, x⟨t⟩ .
We begin by calculating a score (also called alignment score) for all other input words, x⟨j⟩ ,
based on their similarity to x⟨t⟩ . A basic form for such an alignment score is using the inner
28
product,
s(x⟨t⟩ , x⟨j⟩ ) = (x⟨t⟩ )⊤ x⟨j⟩ . (50)
These scores, computed for j = 1, . . . , T , are then passed through the softmax function,
Ssoftmax ( ), defined in (4), which squashes them into a probability vector (α1 , . . . , αT ). Namely
the attention weight of any input word j from the perspective of the current word t is,
⟨t⟩ ,x⟨j⟩ )
es(x
αj = PT ′ . (51)
t′ =1 es(x⟨t⟩ ,x⟨t ⟩ )
The probability vector (α1 , . . . , αT ) represents how much attention each input word x⟨j⟩
should receive when we handle the current word x⟨t⟩ . Intuitively, due to the inner product
operation used in (50), input words that are more similar to the current word will have higher
attention weights, indicating that they are more relevant for generating the output. Conversely,
input words that are less relevant will have lower attention weights, meaning they contribute
less to the output generation process. By selectively attending to the most relevant parts of the
input sequence, the attention mechanism enables us to capture long-range dependencies and
learn complex patterns in the data. This yields a setup that is highly effective for a wide range
of natural language processing tasks, from language translation to text generation.
29
matrix inverses, and tensors. Finally we covered gradient vectors, the key component for
gradient descent learning (optimization). As we saw, this dense manifest of “basic mathematical
knowledge” can go a long way in helping to describe complex deep learning models, and we
revisited our Models I – III throughout, connecting the basic notation of these models to the
elementary mathematical principles.
In terms of models beyond our example models I – III, we briefly highlighted ideas of con-
volutions, word embedding, and the attention mechanism. Other aspects of deep learning that
we did not cover include generative models such as generative adversarial networks, variational
autoencoders, diffusion models, reinforcement learning, and graph neural networks. Admit-
tedly, some of these concepts may require more mathematical background than we provided
here. The reader may see chapter 8 of [Liquet et al., 2024] for an overview of this assortment
of topics.
References
[Boyd and Vandenberghe, 2018] Boyd, S. and Vandenberghe, L. (2018). Introduction to applied
linear algebra: Vectors, matrices, and least squares. Cambridge university press.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning.
MIT Press.
[Howard and Gugger, 2020] Howard, J. and Gugger, S. (2020). Deep Learning for Coders with
fastai and PyTorch. O’Reilly Media.
[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic
optimization. arXiv:1412.6980.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet
classification with deep convolutional neural networks. Advances in neural information pro-
cessing systems.
[LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature.
[Liquet et al., 2024] Liquet, B., Moka, S., and Nazarathy, Y. (2024). The Mathematical Engi-
neering of Deep Learning. CRC Press.
[Nazarathy and Klok, 2021] Nazarathy, Y. and Klok, H. (2021). Statistics with Julia. Springer.
[Prince, 2023] Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.
[Strang, 2019] Strang, G. (2019). Linear algebra and learning from data. Wellesley-Cambridge
Press Cambridge.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems.
[Wilkes, 2016] Wilkes, J. (2016). Burn Math Class: And Reinvent Mathematics for Yourself.
Basic Books.
30