0% found this document useful (0 votes)

16 views

Deep Learning Math Background

Uploaded by

pooja sinha

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Deep Learning Math Background

Uploaded by

pooja sinha

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Navigating Mathematical Basics:

A Primer for Deep Learning in Science

Benoit Liquet1,2 , Sarat Moka3 , and Yoni Nazarathy4,5
1
School of Mathematical and Physical Sciences, Macquarie University, Australia.
2 Laboratory of Mathematics and its applications, E2S-UPPA, Université de Pau et Pays de L’Adour.
3 School of Mathematics and Statistics, The University of New South Wales, Australia.
4 School of Mathematics and Physics, The University of Queensland, Australia.
5 Accumulation Point Pty Ltd, Australia.

February 15, 2024

Abstract
We present a gentle introduction to elementary mathematical notation with the focus of
communicating deep learning principles. This is a “math crash course” aimed at quickly
enabling scientists with understanding of the building blocks used in many equations,
formulas, and algorithms that describe deep learning. While this short presentation cannot
replace solid mathematical knowledge that needs multiple courses and years to solidify,
our aim is to allow non-mathematical readers to overcome hurdles of reading texts that
also use such mathematical notation. We describe a few basic deep learning models using
mathematical notation before we unpack the meaning of the notation. In particular, this
text includes an informal introduction to summations, sets, functions, vectors, matrices,
gradients, and a few more objects that are often used to describe deep learning. While this
is a mathematical crash course, our presentation is kept in the context of deep learning
and machine learning models including the sigmoid model, the softmax model, and fully
connected feedforward deep neural networks. We also hint at basic mathematical objects
appearing in neural networks for images and text data.

1 Introduction
By now, deep learning has become an indispensable tool in science, permeating fields such
as astronomy, genomics, climate science, robotics, materials science, and medical science. Its
transformative impact extends beyond traditional boundaries, finding applications not only in
theoretical domains but also in practical, real-world scenarios such as deciphering the intricacies
of the human genome, predicting climate patterns, and enhancing surgical precision in neuro-
surgery. Deep learning serves as the linchpin for advancements in data analysis, image recog-
nition, natural language processing, and decision-making systems, catalyzing breakthroughs
that were once deemed unattainable. The past two decades have seen great advances in deep
learning. See [LeCun et al., 2015] for a brief overview.
1
There are multiple ways in which one can try and understand the workings of deep learning.
One popular way to describe deep learning is via a hands-on programming approach as in
[Howard and Gugger, 2020]. This is good for someone directly implementing deep learning with
a specific programming language such as Python. However, if one only seeks to understand
what is going on in the realm of deep learning, the use of programming can be too specific and
time consuming. Hence, a more reasonable approach is to understand underlying basic ideas,
often using mathematical notation. This is the approach that we adopted in our recent book,
The Mathematical Engineering of Deep Learning, [Liquet et al., 2024]. Other similar texts that
also require mathematical notation include Understanding Deep Learning [Prince, 2023] and
the more classic Deep Learning [Goodfellow et al., 2016], among others.
In all these texts, mathematical notation is very effective at pinpointing ideas, in a dense
and concise manner. The problem however is that many scientists using deep learning, are not
always comfortable with such notation. Understanding mathematical notation requires prereq-
uisite knowledge. For example, for our book, we believe that having taken 3 to 4 university-level
mathematics courses is necessary for a thorough appreciation of the deep learning concepts that
we present. Naturally, one cannot obtain such knowledge and experience overnight. Neverthe-
less, there is a spectrum between in-depth understanding to avoidance, and we believe that
through a short piece such as this work here, the reader can get a basic understanding of the
meaning of the notation, and as a consequence gain an entry point to deep learning as well.
Hence this work here can be treated as a quick guide for mathematics notation in the context
of deep learning.
To visit our goal of starting to understand mathematical notation, we are motivated by three
key models from the world of deep learning and machine learning, which we aptly call Model I,
Model II, and Model III. Model I, is the sigmoid model, also known as the logistic model.
Model II is the softmax model, also known as the multinomial regression model. Model III is
the general fully connected feedforward deep neural network, also known as the feedforward
model, and sometimes called a multi-layer perceptron.
Each of these models operates on some input which we denote as x and create an output
which we denote as ŷ. The input x can be a series of numbers, an image, a voice signal, or
essentially any structured input. The output ŷ is a probability between 0 and 1 for Model I.
It is a list (or vector) of such probabilities, summing up to 1 for Model II. And in the case of
Model III, it can be either a probability, a vector of probabilities, or any other type of output
that our application dictates.
In these cases, models are presented with an input feature vector x and return an output ŷ.
With the sigmoid model, the output is a probability value indicating the likelihood of a binary
event, serving as a foundational element in binary classification tasks. The softmax model, on
the other hand, extends this concept to multiple classes, assigning probabilities to each class for
multi-class classification problems. In the case of the general fully connected neural network,
its flexibility allows for intricate mappings between inputs and outputs, enabling the network
to learn complex relationships within the data. Understanding these fundamental models is a
crucial step towards unraveling the mathematical framework that underpins deep learning.
Deep learning is the area where models are created (training) and then used (produc-
tion/inference) for solving machine learning or artificial intelligence tasks. The types of models
and tasks are too numerous to cover in this presentation. Instead, let us focus on a simple
supervised learning task, encompassing both binary and multi-class classification. In a binary
classification scenario, a deep learning model is presented with inputs and aims to determine a
2
binary outcome. For instance, in a medical diagnosis context, the model might assess whether
an input image x is associated with a particular pathology or not. This may be presented in
the form of a probability ŷ.
Similarly, in multi-class classification, the model, still operating on such an input image x,
may determine which class from a finite collection best represents the input. In a medical
image classification scenario, particularly in the realm of diagnosing brain diseases, the deep
learning model is presented with an input brain scan and tasked with classifying it into one of
several possible conditions. For instance, the model may need to distinguish between various
types of brain tumors, such as glioblastoma, meningioma, and pituitary adenoma. The output
of the model is typically a list (or vector) of probabilities associated with each of these classes
(brain tumor types). The class with the highest probability is typically chosen as the model’s
prediction.
The remainder of this document is structured as follows. In Section 2 we see a mathematical
description of models I, II, and III. Our aim with the early introduction of the models is to
motivate the mathematical sections that follow where we unpack the notation presented for
these models. Hence a reader reading Section 2 should not be intimidated by the notation, but
rather try to embrace it as it is unpacked and described in the sections that follow. In Section 3
we review summation notation through the application of data standardization (mean and
variance computation). Most scientists would have seen and used such summation notation
before, hence this section is mostly a review. In Section 4 we present an elementary view of
sets, and functions. This is in no way an exhaustive description of set theory, but rather aims
to place the reader in the mindset of set notation, and especially the “in” symbol, ∈, that is
often used, as well as the notation R for the set of real numbers and the declaration of functions
using the arrow, →, from the input set to the output set. In Section 5 we outline the notion
of vectors, which in greatest simplicity can be viewed as lists of numbers. Ideas such as inner
products and norms are also introduced. In Section 6 we discuss matrices and in particular
matrix-vector multiplication. This operation is central in deep learning. We then comment on
additional aspects of vectors and matrices in Section 7. In Section 8 we overview the concept of
gradients, as these are the objects used when training deep learning models. We also see a basic
version of the gradient descent algorithm. Then in Section 9 we put a few of the mathematical
pieces together, shining more light on fully connected feedforward deep neural networks. In
Section 10 we briefly discuss mathematical ideas used in a few extension models, including
convolutional neural networks, recurrent neural networks, and the attention mechanism which
is used in many contemporary large language models. We finally conclude in Section 11.

2 Some Motivating Deep Learning Models

Let us now use mathematical notation and some key equations to present what are perhaps
the three most popular elementary machine learning and deep learning models. Note that as
is common in mathematical texts, formulas and equations are numbered as (1), (2), etc. and
are referenced as such. Models I and II are shallow neural networks and are considered as basic
machine learning (and statistical) models. Model III generalizes these models to a deep neural
network and is the archetypal deep learning model.
One of our purposes in this presentation is that the reader explore the notation via several
key equations. On a first reading of this section, the reader should not let the notation intimi-

3
date them, as it is expected that much of the notation of the equations may be new or forgotten
territory. In places where the reader requires more clarity, it is recommended that the reader
treat it as motivation for the sections below, where mathematical background is provided, and
the equations from this section are further explained.

Model I: The sigmoid model also known as the logistic model. Using equations this
model, can be described as,
z
z }| {
1
ŷ = σSig b0 + w ⊤ x , with σSig (z) = . (1)
1 + e−z
Here the input to the model is a list of p numbers, represented all together as x, which we can
also call a vector, and denote as x ∈ Rp . The parameters of the model are the number (scalar)
b0 which is called the bias, also known as the intercept, and the vector w ∈ Rp , which is called
the vector of weights. The operation b0 + w⊤ x means adding the scalar b0 to the inner product
operation, w⊤ x, which also computes to a scalar. We use the overbrace to denote the outcome
of this operation as z = b0 + w⊤ x. Note that this addition of a bias to an inner product is
sometimes called an affine transformation, and can also be written as,
p
X
z = b0 + wi xi . (2)
i=1
| {z }
w⊤ x

Input Weight, Bias Affine Activation Output

x (w, b0 ) Transformation σSig (z) ŷ
z
b0
 
x1
 
 
 
  w1
 x2 
 
 
  w2 P
  ŷ = σSig (b0 + w> x)
 
 . 
 . 
 . 
 
 
 
 
xp wp

Pp 1
z = b0 + i=1 wi x i σSig (z) = ∈ (0, 1)
1 + e−z

Figure 1: The sigmoid model represented with neural network terminology as a shallow neural
network. Observe that the artificial neuron is composed of an affine transformation using z = b0 +w⊤ x
followed by a non-linear activation transformation σSig (z).

If the model would have not had σSig ( ), then it would be a linear model, which we do not
cover here. But with it, the σSig ( ) means a sigmoid function, which takes any scalar input z and
produces a probability output ŷ = σSig (z). The reader can experiment to compute 1/(1 + e−z )
4
with a calculator or spreadsheet for several values of z, to see that it always yields numbers
that are between 0 and 1, and for example with z = 0 we have σSig (0) = 1/2. In a deep learning
framework, the function, σSig ( ), is a type of activation function and the form of (1) represents
an artificial neuron which we also present in Figure 1.
The sigmoid model can be viewed as the simplest non-linear neural network model. Outside
of the context of deep learning, the sigmoid model is very popular in statistics, and known
as logistic regression. This model is suitable for binary classification tasks where positive
samples are encoded via y = 1 and negative samples are encoded via y = 0. The output of the
model ŷ is a number in the continuous range [0, 1] indicating the probability that the features
vector x, matches a positive label. Hence, the higher the value of ŷ, the more likely it is that
the label associated with x is y = 1. A classifier can be constructed via a decision rule based
on a threshold τ , with the predicted output being,
(
0 (negative), if ŷ ≤ τ
Yb = (3)
1 (positive), if ŷ > τ.

In many cases one selects the threshold at τ = 0.5.

Model II: The softmax model also known as the multinomial regression model.
Using equations, this model can be described as,
 
z ∈ RK
z }| { ez1
1  .. 
ŷ = SSoftmax b + W x , with SSoftmax (z) = PK  . . (4)
e z i
i=1 z
eK
Here like the sigmoid model, the input, denoted x, is a vector of dimension p, and as before
we can denote this as x ∈ Rp . However the output, ŷ is no longer a single probability, as in
Model I, but is rather a vector of probabilities of length K, where the individual probabilities
sum up to 1. The parameters of the model are also different. The parameter b plays the role
of b0 from Model I and is no longer a scalar, but is rather a vector of dimension K, and we
denote this as b ∈ RK . It is called the bias vector. As for the weights, unlike Model I where the
weights were in a vector, here we have a matrix of weights, or weight matrix, W , of dimensions
K and p, where K is the number of matrix rows, and p is the number matrix columns. This
can be denoted as W ∈ RK×p .
Like Model I in (1), here, the operation of the model in (4) denotes an intermediate variable
z. The difference is that z is a vector of dimension K, denoted z ∈ RK . Here to construct z we
add the vector b to the vector produced by the matrix-vector multiplication W x. This means
that for i = 1, . . . , K, each scalar, zi in the vector z ∈ RK is computed as,
p
X
zi = bi + wi,j xj , (5)
j=1

where bi is the i-th scalar in the vector b and wi,j is the scalar in the matrix W at row i and
column j.
Like Model I, that applies the (non-linear) function σSig ( ) on the scalar z, here with Model II
we apply the function Ssoftmax ( ) on the vector z. This function is called the softmax and it
5
x1

x2 p
X ez1
b1 +
j=1
w1,j xj PK
j=1 e
zj ŷ1
| {z }
z1
x3

p
X ez2
b2 +
j=1
w2,j xj PK
j=1 e
zj ŷ2
| {z }
z2
ŷ

p
X e zK
bK +
j=1
wK,j xj PK
j=1 e
zj ŷK
| {z }
zK

Softmax

Figure 2: The softmax model as a shallow neural network model with output ŷ which is composed
of elements ŷ1 , . . . , ŷK , representing a probability vector.

converts a vector of dimension K to a different vector of the same dimension K, where all
elements of the output vector are probabilities, and they sum up to 1.
The form of (4) positions the softmax model as a shallow neural network similarly to the way
that the sigmoid model is a shallow neural network. The difference is that the softmax model has
vector outputs while the sigmoid model has a scalar output. Figure 2 illustrates the softmax
model as a neural network. Here each circle can again be viewed as an “artificial neuron”
however note that the softmax function affects all neurons together via the normalization in
the denominator of the softmax function Ssoftmax ( ). Hence the activation value of each neuron
is not independent of the activation values of the other neurons.
The softmax model, also known as the multinomial regression model, is the most popular
model for multi-class classification, where the goal is to provide a prediction Yb for the label in
{1, . . . , K} associated with the input feature vector x. The softmax model produces an output
vector, ŷ, which is a probability vector, and can be used to create a classifier by choosing the

6
class that has the highest probability. Namely,

Yb = argmax ŷk , (6)

k∈{1,...,K}

and this means to choose the index k from the set {1, . . . , K}, where the entry (probability)
ŷk is highest. This approach is called a maximum a posteriori probability (MAP) decision rule
since it simply chooses Yb as the class that is most probable. It is the most common decision
rule when using deep learning models for classification.

Model III: General fully connected neural networks also known as feedforward
networks. Using equations, this model can be described as,

ŷ = f [L] (f [L−1] (f [L−2] (. . . (f [1] (x)) . . .))), (7)

where each individual function f [ℓ] ( ) operating on some input u, can be described as,

f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u), for ℓ = 1, . . . , L. (8)

Here in (7) we see composition of functions, where first the function f [1] ( ) is applied to the
input x, and then the function f [2] ( ) is applied on the output of f [1] (x), and then f [3] ( ) is
applied on the output of f [2] (f [1] (x)), and so fourth until the last function f [L] ( ) is applied on
what was computed prior. Each such function application is the operation of a layer in a deep
neural network.
Different types of neural networks have different types of layers, and in the simplest case
of feedforward fully connected neural networks, (8) describes the operation of layer ℓ. Observe
that (8) is somewhat similar to the left side of (4). In the case of (8), we have that S [ℓ] ( ) is
some vector activation function, while in the case of (4) there is only a single layer (and hence
ℓ is absent) and the activation function is the softmax function.
The parameters of each layer ℓ are also similar to Model II (which only has a single layer).
Here with Model III, each layer has a bias vector b[ℓ] and a weight matrix W [ℓ] . The dimensions
of these vectors and matrices are specific to the architecture of the network, and we fill in these
details in Section 9. At this point, let us appreciate that Model III is more complex. For
one thing, it has the rich set of parameters (b[1] , W [1] ), . . . , (b[L] , W [L] ) which can easily involve
hundreds of thousands, millions, or billions of individual scalar parameters. With so many
parameters organized in layers, it is generally much more expressive and can capture complex
relationships in data. Figure 3 presents a schematic of such a deep neural network.

3 Data Standardization and Recalling Summation No-

tation
Raw data often requires preprocessing before it can be used in a deep learning model. One such
form of preprocessing is standardization of the data, also sometimes called normalization of the
data. This involves subtraction of the mean and division by the standard deviation. Let us
describe how this is done here and in the process review summation notation which is needed
for other purposes also.
7
Input Hidden Hidden Hidden Output
Layer Layer 1 Layer 2 Layer 3 Layer

ŷ
x3

| {z } | {z } | {z } | {z }
(b[1] , W [1] ) (b[2] , W [2] ) (b[3] , W [3] ) (b[4] , W [4] )

Figure 3: A fully connected feedforward deep neural network with multiple hidden layers. The input
to the network is the vector x = (x1 , x2 , x3 , x4 ) and in this case the output is the scalar ŷ. For this
particular example the dimensions of W [1] are 4 × 4, W [2] is 3 × 4, W [3] is 5 × 3, and W [4] is 1 × 5.
The bias vectors are of dimension 4 for b[1] , 3 for b[2] , 5 for b[3] , and 1 (a scalar) for b[4] .

Assume that our data is x(1) , . . . , x(n) where each x(i) is some sample which is a list or vector
(i)
of length p. We can also denote xj as the j-th element (or coordinate) of the i-th data sample.
Hence we can think of the data as an n by p table, or matrix, where each row is a sample, and
each column is a specific part of that sample, sometimes called a feature.
For instance, consider a dataset consisting of medical measurements for patients, where each
patient’s data includes their age, blood pressure, and cholesterol level. This means that p = 3.
As an example, this data may look as follows:

Age Blood Pressure Cholesterol Level

 
50 120 200
 35 130 220 
 
 .. .. .. 
 . . . 
 .. .. .. 
 
 . . . 
  (9)
 
 .. .. .. 
 . . . 
 
 .. .. .. 
 . . . 
45 125 190
The number of observations n is the number of rows in the table. In this example, the second
observation in the second row, x(2) = (35, 130, 220), corresponds to the information of the
(1)
second patient. Similarly, x3 = 200 is the cholesterol level for the first patient.
8
(1) (n)
You can now take all the sample values for some feature j, and denote them as xj , . . . , xj .
This would be one column of the data. Now for this column we can compute the sample mean
and sample variance respectively as,
n n
1 X (i) 1 X (i)
xj = x , s2j = (xj − xj )2 . (10)
n i=1 j n i=1

Further, the sample standard deviation is the square root of the sample variance and is denoted
via sj . Note that in statistics one sometimes divides by n − 1 instead of n for the sample
variance, but this is a detail we will skip here. Our purpose in presenting (10) is also that we
review summation notation (with Σ, i.e., Sigma).
To review summation notation, consider an expression where we have some list of 4 numbers,
h , h(2) , h(3) , h(4) with h(1) = 2, h(2) = 4, h(3) = 0, and h(4) = 10. Then this arbitrary
(1)

expression,
X4
h(i) , (11)
i=1
(1) (2) (3) (4)
is just shorthand for h + h + h + h . It thus equals 16 in this example. One can think
of the variable i as “running” from i = 1 all the way up to i = 4. We could have also used
j instead of i or any other variable name. One could have obviously had more complicated
expressions with summation notation, such as for example,
4
X
(h(i) + h(5−i) − 11)2 , (12)
i=1

which the reader can verify equals 100.

Now in (10) we first use summation notation in the most basic manner to compute the
sample mean for feature j which we denote as xj . This is then used to compute the sample
variance for that feature which we denote as s2j . Then the sample standard deviation, sj , is
just the square root of s2j .
With our data, once we have the sample mean and sample standard deviation for each
feature, we may standardize the data samples of each feature j = 1, . . . , p and each observation
i = 1, . . . , n to obtain standardized samples,
(i)
(i) xj − xj
x̃j = . (13)
sj
These standardized observations can also be placed in an n by p table just like the original
(1) (n)
data. Now the standardized data for feature j, x̃j , . . . , x̃j , has a sample mean of exactly 0
and a sample standard deviation of exactly 1. In the case the data samples of the feature
are distributed according to a normal (Gaussian) distribution then most standardized samples
would lie in the range [−3, 3]. Even if the data is not normally distributed, the standardized
samples will still lie in the vicinity of this range and are centered about 0.
Such standardization is useful as it places the dynamic range of the model inputs on a
uniform scale and thus improves the numerical stability of algorithms. It also allows us to
use similar models for different datasets that may, without standardization, have completely
different dynamic ranges. It is one of many tricks of the trade when dealing with data for deep
learning. We also chose to present it here as a basic review of summation notation.
9
4 From Basic Sets to Functions
Almost all of mathematics starts with the notion of a set. Formally, a set is an unordered
collection of unique items, and sets are typically denoted like {7, 2.5, hello}, where the curly
braces, {}, indicate that this is a set. In this particular example the set is composed of the
number 7, the number 2.5, and the text hello. Like anything in mathematics, we can name
sets, and we could have also written A = {7, 2.5, hello} to name this set as A.
Each of 7, 2.5, and hello are elements of the set A. And we can write 7 ∈ A, to imply
that 7 is an element of A, and similarly with 2.5 and hello. This “in” symbol, ∈, is useful
for statements such as: “for all u ∈ A do something with u”. This means, “do something with
7, and do something with 2.5, and do something with hello”. We can also say that 4 ∈ / A,
because 4 is not an element of A. That is ̸∈ is “not in” with the slash across the ∈ symbol.
Sometimes we define sets in less explicit ways. For example B = {0, 2, 4, 6, . . . , 20} is the set
of all even numbers between 0 and 20 including 0 and 20. And this is clear to us even though
we did not write out every element of B. We can also use sets within summation notation. For
example, X√
u, (14)
u∈B

implies summing up all of the square roots of the elements of the set B (as the reader may verify,
the result is approximately 31.77). Note that our previous way of using summation notation
can also be written in terms of sets. For example, the summation in (11) can be written as,
X
h(i) , (15)
i∈{1,2,3,4}

as this shows that the variable i runs on each element of the set {1, 2, 3, 4}.
One very important set is the set of real numbers, denoted R. Unlike the example sets A,
B, or {1, 2, 3, 4} which only have a finite number of elements, the set R has every number on
the real number line and is hence not a finite set. We again can speak about elements of R. So
for example it is true that 7 ∈ R, and it is also true that hello ̸∈ R. When we consider the
parameters of Model I in (1), we can write b0 ∈ R to signify that b0 is a real number.
We can also denote the set of real numbers as (−∞, ∞) implying that it is the interval of all
numbers that are greater than −∞ and less than ∞, and this means all real numbers. There
are other sets that we can denote in a similar way, for example [−1, 1] is the set of all numbers
greater than or equal to −1 and less than or equal to 1. Another option is the set (0, ∞) which
means the set of all positive numbers (greater than 0). The set [0, 1] is the set of all real number
between 0 and 1 inclusive, and this is basically all numbers that can describe a probability. A
related set (0, 1) contains all the numbers between 0 and 1 but without the boundaries 0 and
1.
One can study and discuss sets much further, and in a more formal manner, or even in very
formal means that relate to mathematical logic. But our purpose here is simply to introduce
minimal notation. For this, we present a few more sets when dealing with vectors and matrices
in the sections below. But first, at this point, with our basic understanding of sets, we are
ready to discuss the notion of a function.
Put simply, a function is an operation that transforms elements from one set to elements
of another set. We can for example denote our function as f and think that it operates on
10
inputs from the set A = {7, 2.5, hello} and gives us outputs from the set of real numbers R.
Formally this can be denoted as,
f : A → R, (16)
and this notation tells us that all of the possible inputs to the function f ( ) are the elements
of A. It further says that the outputs must be elements of the real numbers R. This declaration
of the function via (16) says that for every u ∈ A we should have an answer of what the function
gives, and we denote this as f (u). Note that we sometimes write f (·) to indicate the function
where “·” just stands for the place where the argument u should appear.
A declaration such as (16) does not define how the function works. To do so, we must
be more explicit either with a formula, or an algorithm, or a lookup table. In this example,
since the input set has a small number of heterogeneous elements, let us specify the function
via a lookup table approach. In particular we can state, f (7) = 3.4, f (2.5) = −2.1, and
f (hello) = 20.3. This would then specify exactly what the function f ( ) does for every u ∈ A.
In other cases, we can specify how the function works with a formula. This is often common
for functions f : R → R. An arbitrary example of such a function from R to R is,

f (u) = 3 cos(eu−2 ). (17)

The reader would have seen plots previously where such a function, or others, are plotted where
on the horizontal axis we plot u and on the vertical axis we plot 3 cos(eu−2 ). Importantly, for
every u ∈ R we have a specific y = f (u), and the plot is a connection of the points (u, f (u))
on the plane, essentially for every u ∈ R (or realistically on some smaller set which defines
the domain of the plot). Also note that in this case the function (17) is composed of other
operations and functions such as the cosine function and the exponentiation function. Indeed
function composition is a very common operation where outputs of one function are used as
inputs of another. An example is with Model III as in (7) where we have L − 1 function
compositions and each function represents the operation of a layer of neural network.
In deep learning we use functions in multiple ways. One way is for constructing models
such as I – III. Another way is for specifying the whole model as a function. A third way is for
construction of loss functions. Let us now highlight such uses.
First in terms of construction of models, as we can see for Model I in (1), we define the
sigmoid function
σSig : R → [0, 1]. (18)
This function takes any real valued scalar as input, denoted as z in (1), and the output is a
number in the range [0, 1]. Note that for this sigmoid function we have a formula, 1/(1 + e−z ),
which exactly describes how to compute σSig (z). A schematic plot of this function is inside the
right circle in Figure 1.
Further, for Model II in (4) we have the softmax function, SSoftmax (z) as a building block.
This function can be declared as,

SSoftmax : RK → RK , (19)

since it takes K-dimensional vectors as inputs and returns K-dimensional vectors as outputs.
We describe it further in the next section after we discuss vectors. Similarly, for Model III,
we have activation functions for deep feedforward neural networks, S [ℓ] (·). We discuss such
activation functions in Section 9.
11
Now a different use of functions for the neural network models I, II, and III is that the whole
model is a function that converts some input x to an output ŷ. In terms of all three models,
I – III, the input x is a p-dimensional list of numbers, or vector, a notion further discussed in
the next section. For now, let us agree that this set is denoted as Rp . The output is either a
scalar value (an element of R or an element of [0, 1]) or a vector which is an element of RK (a
vector or list of K numbers). In particular, here are the functions described by these models.
For Model I, yielding a probability output we can declare the function of the model as,

f(b0 ,w) : Rp → [0, 1]. (20)

The representation in (20) then tells us that inputs to the model are x ∈ Rp and outputs are
going to be ŷ ∈ [0, 1]. Now notice that we also decorate the function name f with a subscript
(b0 , w), and this subscript signifies the parameters of the model. In particular, when we train
a deep learning model such as this, we find suitable values of the parameters b0 and w for the
data. These parameters then exactly dictate the specifics of our model function f(b0 ,w) ( ). The
way the function actually works was already specified in (1). However, some specifics of that
presentation, such as the inner product w⊤ x, are explained in the next section. In Figure 4 we
see two example plots for two different instances of (20) where each time we plot ŷ = f(b0 ,w) (x).
In (a) we see a case with p = 1, and we also see data points for which the function was fit. In
(b) we see a case with p = 2, this time without the data points.

1.0

0.8

0.6 0.8
y
^

0.6
y^

0.4

0.4 40
0.2
35
0.2 30
25
0.0
20 x2
500 15
1000
500 1000 1500 1500 10
x1
x1
(a) (b)

Figure 4: Probability output using Model I. (a) A p = 1 model with one feature. (b) A p = 2 model
based on two features.

For Model II, the function of the model can be specified as,

f(b,W ) : Rp → RK , (21)
12
and in particular the output in this case is a K-dimensional vector. The parameters of the
model are (b, W ), and more details about how the model operates are in the sequel.
For Model III, we leave the specification of the type of output open. In some cases it can
be a single probability value, so we may specify the output as a value in the set [0, 1]. In other
cases it can be a single real number, as one typically has in regression problems. Thus the
output can be specified as a value in R. We can also use Model III for multi-class classification
like Model II and then the output is a vector as with Model II, so the output is a value in RK
(also sometimes denoted Rq ). One way to write this is,
f(b[1] ,W [1] ),...,(b[L] ,W [L] ) : Rp → O, where O is [0, 1], or R, or Rq . (22)
In all these scenarios of Model III, the input is similar to the other two models, but for the
output type there are several options. Observe also the rich set of parameters that the model
has, namely, (b[1] , W [1] ), . . . , (b[L] , W [L] ).
Another use of functions in deep learning is loss functions. In general when we train a
model we are given a fixed dataset and wish to find the best set of parameters such that the
model fits the data. For example with Model II we would seek the best possible vector b and
matrix W that we can find to match the data. The way that this “bestness” is quantified is via
a function that we artificially construct for the model. Unlike the model functions (20), (21),
and (22) which operate on the input x, the loss function is a function of the parameters, and
the training data is fixed for this function and determines its shape.

0
2

2 1 0 1 2
1
(a) (b)

Figure 5: An example plot of an hypothetical loss function (also known as loss landscape) when
there are two parameters. (a) A 3D surface plot where we see that the function has multiple valleys
(local minima). (b) The same function can be plotted using a contour plot, where each line along a
contour maintains the same value of the loss function (like a topographical map).

Now since each type of model has a different set of possible parameters, it is common to
just call all the parameters θ. In this case θ can signify (b0 , w) for Model I, (b, W ) for Model II,
and so fourth. We further call the set of all possible parameters Θ (and the specific form of
this set depends on the model type that we use). The loss function, can then be written as,
CData : Θ → R, (23)
13
where the subscript, “Data”, reminds us that the function’s actual form depends on the training
data that we have. Now in many cases, each θ ∈ Θ contains many parameters (numbers).
Especially for example with Model III, where one can quickly get millions of parameters in a
deep neural network.
The act of learning the parameters, or training the model, is the act of finding some θ ∈ Θ
where CData (θ) is as low as possible or close to the lowest value. This is an optimization problem
where we try to minimize the loss CData ( ).
Now when the number of parameters in each θ is 1 or 2, it is possible to plot the loss
function and then visualize the minimization of the loss. Such plots are typically not done
for operational purposes but rather for pedagogic purposes. In particular, the case of θ being
composed of two numbers θ1 and θ2 is easy to plot, and such plots allow us to also think about
the techniques and challenges of minimizing loss in general. In Figure 5 we present a plot of
such an hypothetical loss function.

5 Vectors
Now that we understand the basics of sets and functions, let us advance to the concept of a
vector. As already seen in Models I – III, the input x to the model is a vector. This is a
list of numbers which encodes some form of input data. Vectors can also appear as outputs.
Specifically, in Model II, and in some variants of Model III, the output, ŷ is also a vector. For
example, with Model II and K = 4 we may have an output vector such as,
 
0.14
0.02
ŷ =  
0.65 , (24)
0.19

which in this case marks the probabilities of the four classes of classification (observe that in
this example the numbers are non-negative and sum up to 1). When we have a vector, we
often relate to the individual components via a subscript, so that for example ŷ1 = 0.14. In
this example vector, we see that ŷ3 = 0.65 is the highest probability, so if this is a classification
output, then we would choose Yb = 3 as our classification choice using (6).
The set of vectors of length or dimension n is denoted as Rn . So for example, for the vector ŷ
of (24) we can write ŷ ∈ R4 . Note that in different texts, vectors are denoted differently. For us,
let us also write the same vector as ŷ = (0.14, 0.02, 0.65, 0.19). The use of the round brackets,
( ), is sometimes associated with the term tuple, which is similar to a vector and for our purposes
may mean the same thing. Also note that the way we wrote the vector in (24) is as a column
vector (the vector is standing up). This way of writing is especially relevant when we do matrix-
vector multiplication in the next section. Similarly, one may write ŷ = [0.14 0.02 0.65 0.19]⊤ ,
where in this case we first wrote a row vector [0.14 0.02 0.65 0.19] and then applied transpose
to it using the ⊤ symbol in superscript. For vectors, such transposition simply converts them
from column to row, and vice-versa. More on this operation appears below in the context of
matrices.
For a data scientist and a machine learner, vectors are first and foremost considered as
lists of numbers. But in fields such as physics, or applied mathematics, vectors also typically
describe directions and magnitudes. This is very apparent in two dimensional spaces or three
14
dimensional spaces. When we discuss gradient vectors in Section 8 below, we also consider
vectors as directions in space. But for now first let us think of vectors as lists of numbers,
exactly playing the role of inputs, outputs, parameters, and intermediate computations in
models such as our example models I – III.
Note that single numbers, known as scalars, can be compared in the sense that one is greater
than the other (as long as they are real numbers and not complex numbers which we do not often
use in deep learning). For example we know that 2.4 is greater than −8.3, denoted −8.3 < 2.4.
If we treat all the scalar real numbers as the set R, then we say there is an order on R, because
for every u ∈ R and v ∈ R one can determine if u < v is true or false. With vectors, unlike
scalars, there is not such an obvious order.
While there is not an obvious universal way to order vectors, we can associate scalars
(numbers) with the distances between two vectors, as well as with the length of individual
vectors, and these scalars can then be used for ordering and other purposes. One of the most
basic ways to do this is the inner product which for a vector u ∈ Rn and a vector v ∈ Rn , is
written as u⊤ v or sometimes as u · v. The inner product is computed as,
n
X
⊤
u v= ui vi = u1 v1 + u2 v2 + . . . + un vn . (25)
i=1

Hence it is a summation of the products of the individual entries of the vectors. So for example,
as the reader may verify for u = (2, 0, −3) and v = (1, −12.5, 2), the inner product is u⊤ v =
−4. When the inner product is near 0 it means that the vectors are quite different, whereas
when the inner product is far from 0 (either positive or negative) it means that the vectors
are similar. When the inner product is exactly 0 we say that the vectors are orthogonal. In
a geometric representation this means the vectors are perpendicular. We do not dive into
such a representation of vectors, and instead the interested reader can visit a text such as
[Boyd and Vandenberghe, 2018] for further reading.
If we now return to Model I and revisit (1) and (2), then we can observe the inner product
w⊤ x in these equations. Here w is the vector of weight parameters of the model and x is the
model input.
One can also compute the norm of a vector u ∈ Rn , denoted as ∥u∥ and computed as the
square root of the inner product of the vector with itself (this is sometimes called the L2 norm
or the Euclidean norm). That is,
v
u n
√ uX
∥u∥ = u u = t
⊤ u2i . (26)
i=1
√
For example for u = (2, 0, −3) as before, ∥u∥ = 13 ≈ 3.6. The norm of a vector is always
a non-negative number and is exactly 0 if and only if all the entries of the vector are 0, and
otherwise it is strictly positive. So while we cannot order vectors in a unique manner, we can
order vectors based on their norm which is a number that summarizes their magnitude.
Now that we have norm, we can also consider a scaled version of the inner product which is
called the cosine of the angle between two vectors. For u ∈ Rn and v ∈ Rn , as long as neither
of these vectors is all 0, this is computed as,
Pn
u⊤ v ui vi
Cosine of the angle between u and v = = pPn i=1 pPn . (27)
∥u∥ ∥v∥ u
i=1 i
2
v
i=1 i
2

15
So for example for the u and v examples above, the cosine of the angle is about −0.087. The
cosine of the angle is always between −1 and 1 and the closer it is to 0 the less similar the
vectors are in some sense. It is exactly 0 if the vectors are orthogonal.
All of the above definitions and computations of vectors are very common to use in deep
learning. One more thing that we do is arithmetic with vectors. The most basic operation is to
take a scalar (a single number) and multiply each element of the vector by this scalar. This is
called scalar multiplication (or sometimes scalar-vector multiplication). So for a scalar, α ∈ R
and a vector u ∈ Rn , the scalar multiplication α u is a new vector where at coordinate i it has
α ui . As an example, let us return to u = (2, 0, −3) from above, and say that α = −4, we have
that α u = (−8, 0, 12).
Let us connect this to the softmax function. If we consider the right side of (4) then we
canPobserve that this is in fact a case of scalar multiplication. In particular the expression
1/ K zi z1 zK
i=1 e is a scalar which multiplies the vector (e , . . . , e ). In this case, given some input
z ∈ RK , the softmax function transforms z such that all entries are positive via the exponen-
tiation. It also ensures the sum of the entries is 1 via the scalar multiplication. Importantly,
the result of the softmax transforms an arbitrary vector of numbers to a vector of probabil-
ities, maintaining the same order. For example if as input to the softmax function we have
z = (0.03, −1.91, 1.56, 0.34), then it is already evident that the third entry z3 = 1.56 is the
maximal, but this is not quantified in terms of probabilities. Then after exponentiation
P we have
(ez1 , ez2 , ez3 , ez4 ) = (1.03, 0.148, 4.807, 1.405). Now we can compute that 1/ K i=1 e zi
= 0.1353,
z1 z2 z3 z4
and then by applying scalar multiplication of this value by (e , e , e , e ), we arrive at ŷ as
in (24) (note that this is approximate due to rounding).
In addition to scalar multiplication we can also add two vectors of the same dimension.
This works by adding the individual matching coordinates. So for u ∈ Rn and v ∈ Rn , the
addition u + v is a new vector, where at coordinate i it has ui + vi . So again with u and v
as in the example above, we have that, u + v = (3, −12.5, −1). Now subtraction can also
be defined by scalar-multiplying the second vector by −1 and then adding. So for example
u − v = (1, 12.5, −5). Note that if we subtract two vectors that are equal then we get a vector
of all 0 values, called the zero vector.
Having vector subtraction allows us to use the vector norm to define the distance between
two vectors. In particular, given a vector u ∈ Rn and a vector v ∈ Rn , the distance (also known
as Euclidean distance) between the two vectors is ∥u − v∥. Hence we first subtract the vectors
and then compute the norm of the answer. That is,
v
u n
uX
Distance between vectors u and v = ∥u − v∥ = t (ui − vi )2 . (28)
i=1

This number is never negative, and the closer it is to 0 the closer that u is to v. For the example
vectors in R3 above, we have that ∥u − v∥ = 13.5. Note that sometimes we consider a similar
quantity without the square root. This is naturally denoted as ∥u − v∥2 , and it can sometimes
be called the squared error between the vectors. It can also be represented in terms of the
innner product of the difference u − v and itself,
n
X
Squared error between u and v = ∥u − v∥2 = (u − v)⊤ (u − v) = (ui − vi )2 . (29)
i=1

16
It turns out that variants of the squared error as in (29) naturally play a role as loss functions
in deep learning (as well as classical statistics). In particular, one of the vectors, say u can play
the role of desired model output, often denoted y, whereas the other vector, v, is the predicted
model output, ŷ. In this case, ∥y − ŷ∥2 is a measure of how far the obtained output ŷ is from
what it should have been.
To make this more concrete, say we are using Model II for multi-class classification with
K = 4 classes. After the model is trained, one can then consider a test set of say 1000 samples
of inputs x(1) , . . . , x(1000) (each of these is a vector of some dimension p, e.g., p = 300). We
then apply the model on each of these vectors and get 1000 result vectors which we denote
as ŷ (1) , . . . , ŷ (1000) . Each of these results vectors look like the vector in (24) only generally has
different probabilities.
We now wish to compare the result vectors to what they would ideally be. For this, assume
that we also have outcomes which indicate for each observation if it is the first class, second
class, third class, or fourth class. One thing we can do with this is to create a set of vectors
called one-hot encoded vectors. If an observation is in the first class, the associated one-hot
encoded vector is (1, 0, 0, 0). If it is in the second class, the one-hot encoded vector is (0, 1, 0, 0).
And so fourth. These vectors are also called the canonical unit vectors. Observe also that they
represent probability vectors, with probabilities being degenerative in the sense that all the
probability mass is at one position while other positions have 0 probability. With this we
create the 1000 one-hot encoded vectors y (1) , . . . , y (1000) . We then define loss or error as
1000
1 X (i)
Mean squared error between y and ŷ = ∥y − ŷ (i) ∥2 . (30)
1000 i=1

Here we are averaging the squared errors between each desired (one-hot encoded) y (i) and
obtained prediction ŷ (i) . If our model is perfect then this mean squared error will be 0, but
generally it is positive, yet the lower it is, the better our predictions.
We note that in practice, for Model I and Model II we typically use a cross entropy loss
which differs from this simpler mean squared error. For details, see for example Chapter 3 of
[Liquet et al., 2024].

6 Matrices
Now that we have a basic handle on vectors, let us focus on matrices. In this short section,
we shall touch upon several uses of matrices within the context of deep learning. One use is
to organize stored data as a table. Another use that we touch on very lightly is for describing
covariances of variables. A third use, which is the most important for us is linear transforma-
tions, and this is the role that the weight matrices (parameters), W in Model II, and W [ℓ] in
Model III play. As with vectors, our exposition is only the tip of the ice-berg and for a more
expanded introduction we recommend [Boyd and Vandenberghe, 2018] as a first reading.
A matrix is a list of numbers organized in rows and columns. For example this is the matrix
X with 4 rows and 3 columns,  
0.4 −1 4
1.2 0 −0.5
X=  0 2.1
. (31)
3 
5 2.1 −10
17
We say that this is a 4 × 3 matrix and we can refer to each element as xi,j where i = 1, 2, 3, 4
denotes the index of the row and j = 1, 2, 3 is for the column. It is common to use capital
letters for matrices and then refer to individual elements with the lower case letters. For
example x3,2 = 2.1. One very basic use for matrices is to describe tabular data, similarly to
how we described data in Section 3. To match the description there we should notice that
(i) (i)
xj = xi,j (the observation or sample with index i, denoted via the superscript (i) in xj is the
(i)
row, and the feature j, denoted via the subscript j in xj is the column). We denote the set
of all matrices with m rows and n columns as Rm×n . Note that in the case of a data matrix
matching the data table in (9) we need a matrix in Rn×p because it has n rows (observations),
and p columns (features).
If the number of rows and the number of columns is equal, we say that the matrix is square.
The set of square matrices of dimension n is denoted as Rn×n . In addition to being square, if
all the elements xi,j where i ̸= j are 0, then we say the matrix is diagonal (it only has non-zero
entries on the diagonal which is all entries xi,i for i = 1, . . . , n). One very important type of
diagonal matrix is the n × n identity matrix, denoted I, which has 0 values everywhere except
on the diagonal where it has 1 values. For example this is the 3 × 3 identity matrix,
 
1 0 0
I = 0 1 0 . (32)
0 0 1

Vectors can be viewed as special cases of matrices. For example consider the column vector, u
and the row vector v,  
2
u = 0 , v= 2 0 6 . (33)
6
Viewed as matrices we can say that u is a 3×1 matrix, and v is a 1×3 matrix (namely u ∈ R3×1
and v ∈ R1×3 ). We can also recall the transpose operator, ⊤, for vectors that converts a column
vector to a row vector with the same numbers, and vice-versa. Here, for the example values we
have for u and v in (33), we obviously see that u⊤ = v and v ⊤ = u.
Indeed for any matrix of dimension m × n, if we transpose the matrix we get a matrix of
dimension n × m where the (i, j)-th entry (row i and column j) of the transposed matrix, is the
(j, i)-th entry of the original matrix. For example, the transpose of the 4 × 3 matrix X from
(31) is the 3 × 4 matrix given by
 
0.4 1.2 0 5
X ⊤ = −1 0 2.1 2.1  . (34)
4 −0.5 3 −10

When a matrix is square, applying the transpose to it does not change the dimensions.
Incidentally a square matrix A such that A = A⊤ is called a symmetric matrix, because any
entry ai,j on one side of the diagonal, equals the matching entry aj,i on the other side of the
diagonal. In statistics, data science, and some aspects of deep learning, an important place
where symmetric matrices arise is with variance and covariance descriptions of features under
study. In particular the covariance matrix (sometimes called the variance-covariance matrix),
is often denoted as Σ (not to be confused with the summation notation reviewed in Section 3),
18
and it captures the variances and covariances of the features under study. In particular the
(i, j)-th element of Σ is the covariance between feature i and feature j, and this equals the
(j, i)-th element as well. For the case where i = j, i.e., on the diagonal, this element is the
variance of feature i. We do not make use of such matrices further in this text, but refer the
reader to an elementary exposition such as in chapter 3 of [Nazarathy and Klok, 2021], or the
references there-in.
One of the most important things that we can do with matrices is matrix multiplication.
For simplicity let us first take two square matrices, each in R3×3 .
   
2 0 3 4 1 0
A = 0 1 1 and B = 1 0 0 . (35)
2 0 0 0 0 3
Now, in this case, the product C = AB is a new 3 × 3 matrix, where the entry ci,j is the inner
product of the i-th row of A and the j-th column of B. For example at i = 2 and j = 3 we
have,
c2,3 = a2,1 b1,3 + a2,2 b2,3 + a2,3 b3,3 = 0 · 0 + 1 · 0 + 1 · 3 = 3.
In the same manner, to get all other 8 elements of the matrix C, we can do all other inner
products. As the reader can verify, the matrix C turns out to be,
 
8 2 9
C = AB = 1 0 3 . (36)
8 2 0
Note that multiplication of scalars is commutative since for two scalars (numbers), a and b,
we always have that ab = ba. With matrices this is not the case. For example C̃ = BA yields
a different result to C = AB. As the reader may verify,
 
8 1 13
C̃ = BA = 2 0 3  ̸= C. (37)
6 0 0
Up to now we multiplied square matrices of the same dimension, but we can also, in certain
cases, multiply non-square matrices. The rules defined for matrix multiplication dictate that it
cannot be done for any two matrices, but only in certain cases. In particular take A ∈ Rm×n
and B ∈ Rn×r , then A has n columns and B has the same number, n, of rows. This means
that rows of A and columns of B are of the same dimension, n, and thus we can compute inner
products between rows of A and columns of B. Otherwise, if these dimensions do not match,
then matrix multiplication is not defined. So for example we can compute the product of a
4 × 3 matrix by a 3 × 7 matrix, but we cannot compute the product of a 4 × 3 matrix and a
4 × 7 matrix. Note that sometimes, depending on the dimension, we can compute a product
AB, but not the product BA, or vise versa.
We also mention that the identity matrix, such as the 3 × 3 example shown in (32) is special
in terms of multiplication. As the reader can verify, if we multiply either A or B from (35) by
I (from either side), then the result does not change. That is, AI = A, IA = A, BI = B, and
IB = B. This holds for any dimension of the identity matrix and any other matrix, where the
identity matrix and the other matrix can be multiplied (with matching dimensions). Hence I
in matrices acts like 1 in scalars (for any scalar a ∈ R, 1a = a and a1 = a).
19
For our purposes, an important form of matrix multiplication is when the second matrix
is actually a vector. In this case we call this matrix-vector multiplication. In particular take
W ∈ RK×p and take a column vector x ∈ Rp×1 (we could have just stated that x is an element
of Rp , but for purposes of matrix multiplication, we consider it as a matrix). This notation
matches the left side of (4) of Model II, where we multiply a K × p parameter matrix W by
the input vector x.
According to the rules of matrix multiplication, the multiplication W x yields a K × 1
matrix as a result, or simply a K dimensional vector. For example, here is a schematic of this
matrix-vector multiplication with K = 5 and p = 3,
   
w1,1 w1,2 w1,3   w1,1 x1 + w1,2 x2 + w1,3 x3
w2,1 w2,2 w2,3  x1 w2,1 x1 + w2,2 x2 + w2,3 x3 
   
w3,1 w3,2 w3,3  x2  =  
  w3,1 x1 + w3,2 x2 + w3,3 x3  . (38)
w4,1 w4,2 w4,3  x3  w4,1 x1 + w4,2 x2 + w4,3 x3 
w5,1 w5,2 w5,3 | {z } w 5,1 x1 + w5,2 x2 + w5,3 x3
| {z } Input vector x ∈ R3 | {z }
Parameter matrix W ∈ R5×3 Output vector in R5

As is evident, each entry of the output vector is the inner product between the associated
row of W and the input vector x.
We should also note that in (4) of Model II there is the bias vector b added to W x. This is
an addition of two vectors in RK . Thus we have the vector z = b + W x, represented as follows
(for an example with p = 3 and K = 5),
   
b1 w1,1 x1 + w1,2 x2 + w1,3 x3
b2  w2,1 x1 + w2,2 x2 + w2,3 x3 
   
z= b
  
3
 + w3,1 x1 + w3,2 x2 + w3,3 x3  .
 (39)
b4  w4,1 x1 + w4,2 x2 + w4,3 x3 
b5 w5,1 x1 + w5,2 x2 + w5,3 x3

This representation of z in (39), exactly agrees with (5) which appeared earlier, before reviewing
vector and matrix operations. Note that (8) for Model III defining the action of a layer,
S [ℓ] (b[ℓ] + W [ℓ] u) can now also be understood as a similar operation to (38) and (39). We
provide further details in Section 9.

7 Further Notes About Vectors and Matrices

We should mention that we have only skimmed vector and matrix operations, but our presen-
tation was enough for understanding the representations of models I – III. In further inves-
tigation one may consider many more useful aspects of vectors and matrices in texts such as
[Boyd and Vandenberghe, 2018], or slightly more advanced linear algebra in [Strang, 2019], or
references there-in.
We also mention that an important part of matrix notation that often appears is the matrix
inverse. We do not cover this concept fully here, but only hint at its meaning. One way to
motivate it is to note that there is not an operation for division of matrices, and instead there is
the operation of multiplication by an inverse. With two scalar numbers a ∈ R and b ∈ R with
b ̸= 0, one way to express the division a/b is via the multiplication a 1b or a b−1 (i.e., multiply a
20
by the inverse of b). With matrices, in certain cases, for a matrix B we have a matrix called the
inverse and denoted as B −1 . Then (when dimensions agree) we may consider multiplications
such as AB −1 for some matrix A. Matrix inverses often appear when one considers systems of
linear equations, and in the context of deep learning this may be when considering elementary
linear models. See for example chapter 2 of [Liquet et al., 2024] for an example of where such
concepts are needed. In deep learning models such as our models I – III here, one does not
often encounter matrix inverses.
We note that sometimes data samples may not be in vector form. In that case, we convert
them to vectors before inputting to the model. For example, if the input is a black and white
(monochrome image) then it can be in the form of a matrix. Still we typically vectorize it,
to convert the matrix to a vector so that it can be used as an input to models I – III. Such
vectorization can simply be done by taking an n × m matrix and spreading out the columns
(or rows) to create a long vector of dimension m · n. In general, vectorization is an important
operation in deep learning. To see an example, return to the matrix X in (31). A vectorized form
of X taken column wise is x = (0.4, 1.2, 0, 5.0, −1, 0, 2.1, 2.1, 4, −0.5, 3, −10). An alternative
is to vectorize row wise where we obtain x = (0.4, −1, 4, 1.2, 0, −0.5, 0, 2.1, 3, 5, 2.1, −10).
We also mention that many matrix and vector operations can be conducted element-wise.
For example taking two vectors u ∈ Rn and v ∈ Rn , we have already seen that u + v is an
element-wise operation giving a vector where the i-th entry is ui + vi . An element-wise product
of u and v can also be defined similarly, where the i-th entry of the element-wise product is
ui vi . This type of product is sometimes denoted as u ⊙ v and is called the Hadamard product.
With the exception of addition, element-wise operations are not extremely common in general
statistics and mathematics, but in deep learning they occur more often. One could have also
for example had element-wise division. Importantly, when we have some function f : R → R
(scalar inputs and scalar outputs), we can sometimes apply it element-wise to a whole vector
or matrix. For example if the input vector is u = (u1 , u2 , u3 ), we can agree to apply f ( )
element-wise on u and write the result as f (u) = (f (u1 ), f (u2 ), f (u3 )). This is very common
with activation functions in internal layers of deep neural networks such as Model III. More
details are in in the sequel. Element-wise operations can be applied to matrices just like they
are to vectors.
We finally mention the concept of a tensor. As the reader may have gauged, there is a
progression from scalars, to vectors, and then from vectors to matrices. Each time another
“dimension” also known as “rank” or “order” is added. Scalars do not need indexing; vectors
require a single index; and matrices require two indices. The next level up is a 3-tensor which
can be viewed as a layering of multiple matrices of the same dimension on top of each other.
That is, in addition to rows, and columns, there is also a depth component. A very typical
application is a red, green, and blue image which can be comprised of three matrices, one for
the values of the red pixels, one for the green, and one for the blue. One can even consider
higher dimensional tensors, yet in many deep learning models, 3-tensors suffice. For example
in convolutional neural networks, 3-tensors are used throughout.

8 Gradients
Before delving into gradients and their pivotal role in deep learning, let us briefly revisit a
fundamental concept from calculus, the derivative. At its core, the derivative provides us with

21
df
a measure of how a function f : R → R changes as its input varies. Represented by du , the
derivative signifies the slope of the tangent line to the function’s curve at a specific point u ∈ R.
The derivative can also been seen a function, f ′ : R → R, where f ′ (u) is the derivative at a
df
specific point u, namely f ′ (u) = du .
An estimate of the slope of the function (rise divided by run) at a specific point u is

Rise f (u + ∆) − f (u) f (u + ∆) − f (u)

= = ,
Run (u + ∆) − u ∆

where ∆ is a positive small number such that u, and u + ∆ are two nearby points on R. We
can then treat the derivative of f ( ) at u as the limit of this slope as ∆ gets small. Formally
one can write this as,
f (u + ∆) − f (u)
f ′ (u) = lim .
∆→0 ∆
A deeper understanding of derivatives may require a review of basic calculus which we cannot
afford in this exposition. For this, we refer the reader to any basic calculus text, or online
resource. One interesting and enjoyable read which may help readers gain insight on this topic
is Burn Math Class: And Reinvent Mathematics for Yourself [Wilkes, 2016].
In deep learning we use derivatives for training. Consider first an hypothetical scenario
where we are training a model with a single parameter θ. Now as denoted at the end of
Section 4, we have a loss function CData (θ), and we wish to minimize this function. For this we
can use the derivative dCdθ
to gain information about the slope of the function, and this gives
us an indication about the directions and the magnitudes that can be used in our optimization
procedure. Ultimately, with the aid of derivatives, we try to find the best θ for the loss. Note
that in this section we denote C as shorthand for CData .
Now, let us extend our perspective to a more complex scenario where our model has multiple
parameters, often denoted as θ = (θ1 , θ2 , ..., θd ), similarly to the presentation at the end of
Section 4. If we are in Model I then these d parameters are b0 and the p elements of the vector
w, so d = p + 1. If we are in Model II then those d parameters can be taken as the vector
b ∈ RK and the matrix W ∈ RK×p , so the vector θ is of dimension d = p + pK = p(K + 1).
If we are in Model III then θ is even more complex and constitutes the weights and biases in
all layers; details for Model III are in the next section. In any case, we treat all the individual
parameters in the vectors and matrices as one long vector θ with d elements.
We now generalize the notion of the derivative from one dimension to d dimensions using
the notion of a gradient. For a parameter point θ ∈ Rd , the gradient denoted as ∇C(θ) is
a vector in Rd which points at the direction of steepest ascent and has a magnitude (norm)
which captures how steep the function is in that direction. In
fact, the gradient is composed of
∂C ∂C ∂C
individual partial derivatives, and ∇C(θ) = ∂θ1 , ∂θ2 , ..., ∂θd . Note that each partial derivative
∂C
∂θj
is the derivative of C(θ) with respect to θj assuming that all other parameters are fixed.
Just like the derivative which can be viewed as a function, the gradient can also be viewed as
a function,
∇C : Rd → Rd . (40)
For loss functions C( ) with vector inputs of length 2 as illustrated in Figure 5, the gradient
can be drawn as an arrow on the plane. For loss functions with vector inputs of length 3,
the gradient is an arrow in 3 dimensional space. For loss functions of higher dimensions (it is
22
common in deep learning to have d in the order of millions or more), we as humans cannot
visualize the gradient, yet it describes the direction of movement of steepest ascent/increase.
Importantly, when we consider loss landscapes for deep learning, the gradient tells us the
slope of the terrain and points us in the direction where the loss increases most rapidly. Imagine
standing at a point on this landscape where the direction we would move to ascend the slope
most rapidly is precisely the direction of steepest increase. However, our goal is to descend into
the valleys, seeking the lowest points where the loss is minimized. To achieve this, we utilize
the negative of the gradient (scalar multiplication of the gradient by −1), as it points in the
direction of steepest decrease. In fact we further multiply the gradient by the scalar α, called
the learning rate, which controls how big of a step we take.
During the training of a deep learning model, our objective is to iteratively adjust the
model’s parameters in the opposite direction of the gradient. This process, known as gradient
descent, guides us downhill, helping us reduce our loss in the loss landscape. The key idea is to
use an update such as,
θ(t+1) = θ(t) − α∇C(θ(t) ), with α > 0, (41)
where θ(t) is the current parameter point and θ(t+1) is the next parameter point. We start at
some initial guess θ(0) then iterate via (41) to improve our parameters; namely to train our
model. The learning rate, α, specifies how big of steps we take during the algorithm. The
learning rate is an example of an hyper-parameter in deep learning and it sometimes requires
tweaking for gradient descent to work well.
As an algorithm, using a sequence of steps which one may convert to code in a programming
language such as Python, R, or Julia, we may specify this basic gradient descent algorithm with
pseudo code. In particular these are the steps of the algorithm:

Algorithm: Gradient descent with loss C(θ)

1 θ ← θ(0)
2 repeat
3 Compute the gradient ∇C(θ)
4 θ ← θ − α∇C(θ)
5 until θ satisfies a termination condition

Note that the use of expressions such as a ← b in steps 1 and 4, imply to substitute the
value of the right hand side b into the left hand variable a. With an algorithm as specified
above, the parameters of the model θ are in memory at all times, and as the algorithm iterates
over steps 2 – 5, these parameters are updated at each time when step 4 is executed. Observe
that this step implements the update in (41). After the algorithm starts with an initial guess
θ(0) (step 1), each iteration of steps 2 – 5 yield a new update, where in the first iteration we
have θ(1) = θ(0) − α∇C(θ(0) ), in the second iteration we have θ(2) = θ(1) − α∇C(θ(1) ), and so
fourth. Note that in this specification we do not go into the termination condition of step 5. See
Chapter 4 of [Liquet et al., 2024] for more details about this gradient descent algorithm and
its generalizations and variants, and in particular the most popular variant for deep learning
called ADAM, [Kingma and Ba, 2014].
Importantly, we should mention that step 3 computes the gradient in each iteration, and
hence this step can be computationally intensive for large deep learning models. Simple models
23
such as our Model I and Model II actually have closed form formulas for the gradient, but for
cases such as Model III, this is where the famous backpropagation algorithm is used. Namely,
the execution of step 3 is actually the result of running a whole other algorithm based on
ideas called backward mode automatic differentiation. Details of backpropagation are studied
in chapters 4 and 5 of [Liquet et al., 2024].

(a) (b)

Figure 6: Illustration of gradient descent for 5 iterations starting with θ(0) and getting near the
optimum θ∗ with θ(5) . (a) A one-dimensional loss landscape C(θ) with θ ∈ R. (b) a two-dimensional
loss landscape C(θ1 , θ2 ) with θ = (θ1 , θ2 ) ∈ R2 .

As an illustration of gradient descent, see Figure 6. First in (a) we consider a model with
a single parameter θ ∈ R and the associated loss function C(θ) plotted with a minimum at
θ∗ . The gradient descent algorithm starts with an initial guess θ(0) and then updates to θ(1)
using θ(1) = θ(0) − α∇C(θ(0) ). In this simple one dimensional case, the gradient at the point
θ(0) reduces to the derivative, C ′ (θ0 ), which is the slope of the tangent at the point θ(0) . In our
example this derivative is negative, then by multiplying this derivative by −α we move in the
opposite direction (right) than the gradient. Using the same process, the next move is driven
by θ(2) = θ(1) − αC ′ (θ(1) ). Note that in this case the slope of the tangent at the point θ(1) is
positive. Then by multiplying by −α, we are move left towards the minimum. In the figure we
show the iterates up to θ(5) which is not far from the minimal point θ∗ .
In Figure 6 (b) we see a similar trajectory, only this time moving on the (θ1 , θ2 ) plane, and
the points are plotted on the surface as C(θ(i) ). Here in this example, small steps move in the
direction of steepst descent, ultimatly reaching a point close to the optimum θ∗ = (θ1∗ , θ2∗ ). The
reader should comprehend that with realistic problems, the number of parameters d can be in
the order of millions. Hence each gradient descent iteration is a step in such a high dimensional
space, which we cannot visualize.

24
9 Putting Bits Together Into a Deep Neural Network
Now we are ready to revisit Model III and fill in a few of the missing details. With deep neural
networks, like Model III, the versatility of the model allows us to in principle approximate any
function. In particular, say that in reality we have some unknown function f ∗ : Rp −→ Rq ,
which is only available to us via data points (x(i) , y (i) ) with x(i) ∈ Rp , y (i) ∈ Rq , and y (i) =
f ∗ (x(i) ). If it is a binary classification scenario then q = 1 and the output can be considered
as an element in [0, 1]. If it is a multi-class classification scenario then q = K and the output
can be considered as an element in RK . If it is a regression problem then again q = 1 and the
output can be any scalar value in R. Finally, in other applications we have vector output Rq
for some q > 1. In all of these cases, with Model III we can try and approximate f ∗ ( ) via the
model function (22).
Models I and II are shallow neural networks as they only involve a single layer. As such,
they are not able to approximate any arbitrary function very well. However, Model III when
used with multiple layers, is a very rich model, and in principle no matter what f ∗ ( ) we have
in reality, with enough training data, and enough computation power for training (gradient
descent), we can obtain,
f(b[1] ,W [1] ),...,(b[L] ,W [L] ) (x) ≈ f ∗ (x). (42)
That is, in general, there exist some parameters (b[1] , W [1] ), . . . , (b[L] , W [L] ) that will enable our
model to approximate any function f ∗ ( ). See chapter 5 of [Liquet et al., 2024] for further
discussion about the expressivity of deep neural networks. Let us now dive into a few more
details of Model III.
Recall from (8) that every layer ℓ of the model is of the form f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u)
and the model consists of a composition of such layers via (7). We have already encountered
an affine operation similar to b[ℓ] + W [ℓ] u in the context of Model II where it was b + W x.
Like Model II we sometimes use a softmax for S [ℓ] ( ) in the last layer, ℓ = L, especially when
our goal is multi-class classification. Yet for inner layers, ℓ = 1, . . . , L − 1, we use a (scalar)
activation function σ : R → R and then the structure of the function S [ℓ] ( ) is via element-wise
activations S [ℓ] (z) = σ (z1 ) . . . σ (zN ) , where z = (z1 , . . . , zN ). In general, σ( ) can be a
sigmoid function, as defined in (1), or it can be one of other common activation functions in
deep learning such as ReLU or Tanh; see chapter 5 of [Liquet et al., 2024] for details. Note
that in many texts one just writes, σ(b[ℓ] + W [ℓ] u) for the layer, implying that σ( ) is applied
element-wise.
We can now revisit Figure 3 which illustrates a small version of such a model with L = 4.
Observe that for this network, the input dimension is p = 4 and the output dimension is q = 1.
The function of each layer f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u) operates on the outputs of the previous
layer (or the input to the network in case ℓ = 1) and yields activation values (sometimes just
called neuron values) of layer ℓ. We denote these values as a[ℓ] and thus for ℓ = 1, . . . , L,

a[ℓ] = f [ℓ] (a[ℓ−1] ), where a[0] = x, and ŷ = a[L] . (43)

What remains to be specified are the dimensions of each of the activations in the network.
For example, as the reader may inspect based on the number of neurons (blue nodes) per layer
in Figure 3, we have that,

x = a[0] ∈ R4 , a[1] ∈ R4 , a[2] ∈ R3 , a[3] ∈ R5 , ŷ = a[4] ∈ R. (44)

25
These particular dimensions are part of the specific architecture of the deep neural network
and they also imply the dimensions of the model parameters. For example consider layer ℓ = 3
with f [3] (u) = S [3] (b[3] + W [3] u). From (43) we see that f [3] ( ) maps a[2] to a[3] and further
via (44) we see that this is a mapping from R3 to R5 . If we revisit the way that matrix-
vector multiplication works as in (38), we see that we must have W [3] ∈ R5×3 . Further from
(39), we must have b[3] ∈ R5 . In a similar manner we can consider other layers and see that
the dimensions of the individual parameters, (b[1] , W [1] ), . . . , (b[L] , W [L] ), are specified by the
number of activations. The reader may also appreciate that the number of arrows going from
layer 2 to layer 3 in Figure 3 is 15, exactly matching the 5 × 3 = 15 entries of W [ℓ] . Hence
each element in the weight matrix W [ℓ] can be considered as an individual weight on the arrow
connecting a neuron from layer 2 to layer 3.
As a final illustration, let us unpack the function of the model for the example network from
Figure 3. When considering the operation at each of the layers one after the other we obtain,

ŷ = S [4] (W [4] S [3] (W [3] S [2] (W [2] S [1] (W [1] x + b[1] ) + b[2] ) + b[3] ) + b[4] ). (45)
In fact, the reader may work out that the number of parameters is,

d = 4 × 4 + 4 + 3 × 4 + 3 + 5 × 3 + 5 + 1 × 5 + 1 = 61. (46)
| {z } | {z } | {z } | {z }
Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer

Hence, when training such a model via gradient descent, each time we compute the gradient,
we obtain a gradient vector in R61 . More realistic models may have many more activations
(neurons), and sometimes more layers as well. Hence the number of parameters, d, easily
climbs to millions or more, and this is why training large deep neural networks may be very
expensive in terms of time, hardware, and power.

10 More Advanced Models

Deep neural networks have been extended way beyond Model III, yet the good news is that for
more advanced models, generally one does not necessarily need more advanced mathematics.
In this section, we present some key mathematical constructs used beyond models I – III.
We present the convolution operation used in convolutional neural networks (CNNs), word
embedding used in sequence models such as recurrent neural network (RNNs) or long short term
memory (LSTM) models, and the attention mechanism used in transformer models that power
current state of the art large language models.
Convolutional neural networks have revolutionized the field of computer vision. Fundamen-
tally, CNNs leverage the concept of the convolution, a mathematical operation that involves
applying a filter, also known as a kernel, across an input data grid to extract local patterns
or features. This process is akin to sliding a window over an image and with each move com-
puting the inner product between the elements of the (vectorized) kernel and the elements of
the corresponding region of the (vectorized) image. The convolution operation between two
matrices W and x is denoted as z = W ⋆ x. There are different versions of the convolution
operation. Here, we present a basic version and for more general scenarios, refer to chapter 6
of [Liquet et al., 2024]. To understand this basic version, assume that x is a grayscale image of

26
size 100 × 100 and the kernel W is a 3 × 3 matrix of weight parameters represented as
 
w1,1 w1,2 w1,3

W = w2,1 w2,2 w2,3  . (47)
w3,1 w3,2 w3,3

Note that the kernel matrix W is usually much smaller than the image x. We slide the kernel
over the image to perform the convolution operation. As in the case of Model III, the best
values for the weight parameters (wi,j entries in W ) are learned during the training process.
In the basic version for this example, the result z of the convolution operation is a matrix
of dimension (100 − 3 + 1) × (100 − 3 + 1) = 98 × 98 with the (i, j)-th element of z computed as
2 X
X 2
zi,j = xi+i′ ,j+j ′ · wi′ +1,j ′ +1 for i, j ∈ {1, . . . , 98}. (48)
i′ =0 j ′ =0

Here xi+i′ ,j+j ′ represents the pixel value at position (i + i′ , j + j ′ ) in the input image, and
wi′ +1,j ′ +1 represents the weight parameter at position (i′ + 1, j ′ + 1) in the kernel. For instance,
to compute the value z1,1 ,we perform the calculation,
2 X
X 2
z1,1 = x1+i′ ,1+j ′ · wi′ +1,j ′ +1 . (49)
i′ =0 j ′ =0

Note that z1,1 is the result of an inner product between the vectorized W and the vectorized
3 × 3 top-left submatrix of x.
Similarly, we compute all the elements of z by sliding the kernel over the image and perform-
ing such an inner product each time between the kernel W and the corresponding submatrix
of x. After performing the convolution operation between the image and the kernel, a convo-
lutional layer in a CNN applies an activation function, just like in Model III. This yields what
is sometimes called a feature map that highlights certain patterns or features present in the
image. Now just like in Model III, the feature map serves as input to subsequent layers in the
CNN for further processing and analysis.
It’s important to note that CNNs involve other operations, such as pooling, and various
architectural configurations including multiple channels (feature maps) per layer, skip connec-
tions, integration of fully connected layers, and others. See chapter 6 of [Liquet et al., 2024] for
more details. We also note that from an historical perspective, the work in [Krizhevsky et al., 2012]
was pivotal in highlighting the strength of deep learning, and convolutional neural networks in
particular.
While CNNs are excellent for tasks like image recognition, sequential data such as text, often
requires a different approach. This is where sequence models like recurrent neural networks
(RNNs) and long short term memory (LSTM) models come into play. Unlike CNNs which
process the entire input at once, RNNs and LSTMs process the data sequentially one element
at a time. In doing so, these models maintain an internal state that captures information about
previous elements in the sequence.
One key challenge in using RNNs and LSTMs for natural language processing tasks is how
to represent words as numerical vectors such that these models can understand the data. This
is where word embedding becomes useful. Word embedding is a technique used to represent
27
words as vectors, where the vectors corresponding to similar words, remain close to each other.
Via the vector representation of the words, similarity between any two words is measured by
the cosine of the angle between the corresponding two vectors using formula (27).
For example, the word king could be represented by the vector (0.41, 1.2, 3.4, −1.3) and
the word queen can be represented by a relatively similar vector such as (0.39, 1.1, 3.5, 1.6).
Then a completely different word such as mean might be represented by a vector such as
(−0.2, −3.2, 1.3, 0.8). One can now verify in this example, that the cosine of the angle between
the words king and queen is about 0.729 while the cosine of the angle between mean and the
other two words is lower, which is at about −0.04 for king and 0.156 for queen, respectively.
With such a word embedding approach, the typical way we process input text is to convert
each word (or sometimes a similar notion known as a token) to a vector. Hence the input
to a model, is not just a single vector x as is the case for models I — III, but is rather a
sequence of vectors. Then an RNN model or LSTM model processes this sequence, one by one,
keeping an internal state and also resulting in an output sequence. This technique has proven
valuable for many language tasks including translation among others. For a description of how
classical models such as RNN models or LSTM models deal with such data, see chapter 7 of
[Liquet et al., 2024].
Recurrent neural networks, long short term memory models, and a few variants were the
main sequence models in deep learning up to recent times. However, in the last few years,
following the 2017 paper [Vaswani et al., 2017], a completely different approach, called trans-
former models has emerged and is now the main tool used in contemporary large language
models. Transformers overcome, limitations of RNNs and LSTM models, such as difficulty in
parallelization and difficulty in capturing long-range dependencies (even though LSTM models
are explicitly designed for enabling long range memory). Transformers address these limi-
tations, primarily by leveraging on an idea called the attention mechanism. Unlike RNNs
and LSTMs, which process inputs sequentially, transformers process all words simultaneously,
enabling highly efficient parallel computation. This parallelization makes transformers partic-
ularly well-suited for handling long sequences, such as those encountered in machine transla-
tion, document summarization tasks, and general interactions with large language models via
chat. While we leave a complete description of the transformers architecture to chapter 7 of
[Liquet et al., 2024], or other sources, let us see now how basics of the attention mechanism can
be described via inner products and the softmax function.
When processing each word, the attention mechanism allows the model to focus on relevant
parts of the input sequence. That is, by “attending” from each word to every other word,
we capture dependencies across the entire sequence. Imagine reading a long piece of text and
trying to summarize it. Instead of reading the entire text from start to finish every time, one
can generate a summary by focusing on the most relevant parts, or key words. This selective
focus is analogous to what the attention mechanism does. At a high level, we assign a weight
to each input word based on its relevance to the current word being processed. This weight
determines how much attention the model should pay to that input word when generating the
output associated with the current word.
Mathematically, to understand the attention mechanism, consider a sequence of words
x⟨1⟩ , . . . , x⟨T ⟩ where each x⟨t⟩ is a vector representing a word (or token) using our word em-
bedding scheme. Our goal is to compute the attention weights for a specific current word, x⟨t⟩ .
We begin by calculating a score (also called alignment score) for all other input words, x⟨j⟩ ,
based on their similarity to x⟨t⟩ . A basic form for such an alignment score is using the inner
28
product,
s(x⟨t⟩ , x⟨j⟩ ) = (x⟨t⟩ )⊤ x⟨j⟩ . (50)
These scores, computed for j = 1, . . . , T , are then passed through the softmax function,
Ssoftmax ( ), defined in (4), which squashes them into a probability vector (α1 , . . . , αT ). Namely
the attention weight of any input word j from the perspective of the current word t is,
⟨t⟩ ,x⟨j⟩ )
es(x
αj = PT ′ . (51)
t′ =1 es(x⟨t⟩ ,x⟨t ⟩ )

The probability vector (α1 , . . . , αT ) represents how much attention each input word x⟨j⟩
should receive when we handle the current word x⟨t⟩ . Intuitively, due to the inner product
operation used in (50), input words that are more similar to the current word will have higher
attention weights, indicating that they are more relevant for generating the output. Conversely,
input words that are less relevant will have lower attention weights, meaning they contribute
less to the output generation process. By selectively attending to the most relevant parts of the
input sequence, the attention mechanism enables us to capture long-range dependencies and
learn complex patterns in the data. This yields a setup that is highly effective for a wide range
of natural language processing tasks, from language translation to text generation.

11 Conclusion and Outlook

While our ultimate goal is deep learning, our focus in this document was mathematical back-
ground. For this we covered many elementary components of mathematical notation used
in conveying deep learning ideas. Such a survey cannot replace solid mathematical founda-
tions, yet we hope that it enables scientists that do not often use mathematics, to engage with
mathematical descriptions of deep learning models and algorithms. We can also consider the
presentation here as a gentle introduction (or review) for readers wanting to study our book,
The Mathematical Engineering of Deep Learning [Liquet et al., 2024], as well as the other books
we mentioned at the introduction. While the presentation here does not empower the reader
with all of the needed mathematical background, it does enable the reader to follow a solid por-
tion of the content from such a book. Readers wishing to further strengthen their foundations
are encouraged to follow standard courses and books in calculus, linear algebra, probability,
and statistics.1
This is a summary of the key mathematical content that we covered. We covered sets
and their elements, as well as summation notation with sets. We covered basic set notation
for the real numbers R as well as notation for vectors Rn , and matrices Rm×n . We covered
declaration of functions using notation such as f : A → B, where A and B are sets. We
covered basics of vectors, but without their geometric interpretation, a topic that the reader
can pick up elsewhere. In particular we covered inner products, Euclidean norms, cosine of
the angle, the Euclidean distance between vectors, and mean squared error. We also covered
basics of matrices including elementary definitions, diagonal matrices, symmetric matrices,
identity matrices, matrix transpose, and importantly matrix multiplication (including matrix-
vector multiplication). We also mentioned element-wise operations on vectors and matrices,
1
See https://deeplearningmath.org/ for specific suggestions.

29
matrix inverses, and tensors. Finally we covered gradient vectors, the key component for
gradient descent learning (optimization). As we saw, this dense manifest of “basic mathematical
knowledge” can go a long way in helping to describe complex deep learning models, and we
revisited our Models I – III throughout, connecting the basic notation of these models to the
elementary mathematical principles.
In terms of models beyond our example models I – III, we briefly highlighted ideas of con-
volutions, word embedding, and the attention mechanism. Other aspects of deep learning that
we did not cover include generative models such as generative adversarial networks, variational
autoencoders, diffusion models, reinforcement learning, and graph neural networks. Admit-
tedly, some of these concepts may require more mathematical background than we provided
here. The reader may see chapter 8 of [Liquet et al., 2024] for an overview of this assortment
of topics.

References
[Boyd and Vandenberghe, 2018] Boyd, S. and Vandenberghe, L. (2018). Introduction to applied
linear algebra: Vectors, matrices, and least squares. Cambridge university press.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning.
MIT Press.
[Howard and Gugger, 2020] Howard, J. and Gugger, S. (2020). Deep Learning for Coders with
fastai and PyTorch. O’Reilly Media.
[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic
optimization. arXiv:1412.6980.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet
classification with deep convolutional neural networks. Advances in neural information pro-
cessing systems.
[LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature.
[Liquet et al., 2024] Liquet, B., Moka, S., and Nazarathy, Y. (2024). The Mathematical Engi-
neering of Deep Learning. CRC Press.
[Nazarathy and Klok, 2021] Nazarathy, Y. and Klok, H. (2021). Statistics with Julia. Springer.
[Prince, 2023] Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press.
[Strang, 2019] Strang, G. (2019). Linear algebra and learning from data. Wellesley-Cambridge
Press Cambridge.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural
information processing systems.
[Wilkes, 2016] Wilkes, J. (2016). Burn Math Class: And Reinvent Mathematics for Yourself.
Basic Books.

Instant Access to Fundamentals of Deep Learning Nikhil Buduma ebook Full Chapters
No ratings yet
Instant Access to Fundamentals of Deep Learning Nikhil Buduma ebook Full Chapters
40 pages
Mathematical Aspects of Deep Learning Philipp Grohs (Editor) All Chapters Instant Download
100% (3)
Mathematical Aspects of Deep Learning Philipp Grohs (Editor) All Chapters Instant Download
40 pages
Math For Data Science
No ratings yet
Math For Data Science
11 pages
A Break Down On The 6 Subjects Covered in The AMC MCQ Medical Examination
100% (1)
A Break Down On The 6 Subjects Covered in The AMC MCQ Medical Examination
2 pages
Mathematics of Deep Learning Introduction - Leonid Berlyand
100% (1)
Mathematics of Deep Learning Introduction - Leonid Berlyand
134 pages
Deep Learning: An Introduction For Applied Mathematicians: Catherine F. Higham Desmond J. Higham
No ratings yet
Deep Learning: An Introduction For Applied Mathematicians: Catherine F. Higham Desmond J. Higham
32 pages
Notation Example
No ratings yet
Notation Example
11 pages
Deep Learning For Mathematicians
No ratings yet
Deep Learning For Mathematicians
32 pages
Deep Learning: Book Review
No ratings yet
Deep Learning: Book Review
4 pages
TOC Micro. Fabric.
No ratings yet
TOC Micro. Fabric.
4 pages
Deep Learning and Pure Mathematics
No ratings yet
Deep Learning and Pure Mathematics
16 pages
Deep Learning For Mathematicians
No ratings yet
Deep Learning For Mathematicians
32 pages
Aves T6
No ratings yet
Aves T6
2 pages
An introduction to mathematics of deep learning
No ratings yet
An introduction to mathematics of deep learning
14 pages
Key Concepts in Discrete Mathematics
From Everand
Key Concepts in Discrete Mathematics
Udayan Bhattacharya
No ratings yet
Module1_ Deep Learning
No ratings yet
Module1_ Deep Learning
26 pages
Download full (Ebook) Math and Architectures of Deep Learning (MEAP V10) by Krishnendu Chaudhury ebook all chapters
100% (7)
Download full (Ebook) Math and Architectures of Deep Learning (MEAP V10) by Krishnendu Chaudhury ebook all chapters
81 pages
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
No ratings yet
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
56 pages
Unit-3
No ratings yet
Unit-3
16 pages
Instant download Deep Learning Vol 1 From Basics to Practice Andrew Glassner pdf all chapter
100% (5)
Instant download Deep Learning Vol 1 From Basics to Practice Andrew Glassner pdf all chapter
65 pages
Intro Deep Learning
No ratings yet
Intro Deep Learning
32 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Instant Access to Deep Learning A Visual Approach Glassner ebook Full Chapters
100% (2)
Instant Access to Deep Learning A Visual Approach Glassner ebook Full Chapters
65 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Instant Access to Fundamentals of Deep Learning Nikhil Buduma ebook Full Chapters
100% (1)
Instant Access to Fundamentals of Deep Learning Nikhil Buduma ebook Full Chapters
40 pages
Main
No ratings yet
Main
183 pages
Deep Learning Vol 1 From Basics to Practice Andrew Glassner - Download the full ebook set with all chapters in PDF format
100% (2)
Deep Learning Vol 1 From Basics to Practice Andrew Glassner - Download the full ebook set with all chapters in PDF format
55 pages
Mastering Discrete Mathematics
From Everand
Mastering Discrete Mathematics
Gautami Devar
No ratings yet
Deep Learning
No ratings yet
Deep Learning
142 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
DL-2
No ratings yet
DL-2
62 pages
Complete UNIT III DEEP LEARNING PPT (1)
No ratings yet
Complete UNIT III DEEP LEARNING PPT (1)
126 pages
Module 1
No ratings yet
Module 1
23 pages
Machine Learning Practical File
No ratings yet
Machine Learning Practical File
41 pages
III-II CSM (Ar 20) DL 5 Units Question Answers
No ratings yet
III-II CSM (Ar 20) DL 5 Units Question Answers
108 pages
Deep Learning
No ratings yet
Deep Learning
100 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Dasar Statistika Dan Matematika
No ratings yet
Dasar Statistika Dan Matematika
30 pages
Deep Learning
100% (4)
Deep Learning
100 pages
Mathematical Aspects of Deep Learning Philipp Grohs (Editor) instant download
100% (2)
Mathematical Aspects of Deep Learning Philipp Grohs (Editor) instant download
51 pages
Deep Learning Review
No ratings yet
Deep Learning Review
9 pages
Fundamentals of Deep Learning Nikhil Buduma pdf download
No ratings yet
Fundamentals of Deep Learning Nikhil Buduma pdf download
40 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Deep Learning Artificial Intelligence
No ratings yet
Deep Learning Artificial Intelligence
9 pages
mml-book (ch1)
No ratings yet
mml-book (ch1)
6 pages
d2l en
No ratings yet
d2l en
982 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
ChatGPT背后的底层技术 Deep+Learning (深度学习) 论文翻译
No ratings yet
ChatGPT背后的底层技术 Deep+Learning (深度学习) 论文翻译
39 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
151 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Buy ebook Deep Learning with Python 1st Edition Nikhil Ketkar cheap price
100% (4)
Buy ebook Deep Learning with Python 1st Edition Nikhil Ketkar cheap price
40 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
PR Unit 1 2
No ratings yet
PR Unit 1 2
40 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
1406715069952cc-polity-part_1
No ratings yet
1406715069952cc-polity-part_1
255 pages
122771506999f88-31--distribution-of-key-natural-resources (1)
No ratings yet
122771506999f88-31--distribution-of-key-natural-resources (1)
20 pages
Ancient Lecture 12
No ratings yet
Ancient Lecture 12
23 pages
Ancient Lecture 15
No ratings yet
Ancient Lecture 15
15 pages
Building Interactive Websites - JavaScript and The DOM Cheatsheet - Codecademy
No ratings yet
Building Interactive Websites - JavaScript and The DOM Cheatsheet - Codecademy
4 pages
Kartu Stok Obat B 2023
No ratings yet
Kartu Stok Obat B 2023
494 pages
ANPH 111 (Anatomy and Physiology) : Bachelor of Science in Nursing
No ratings yet
ANPH 111 (Anatomy and Physiology) : Bachelor of Science in Nursing
11 pages
EXTEREME COACHING IN CRM Qmod2
No ratings yet
EXTEREME COACHING IN CRM Qmod2
22 pages
Homolitics
No ratings yet
Homolitics
11 pages
Paper TR31 0610045112
No ratings yet
Paper TR31 0610045112
11 pages
GO 57, dt.18.04.2025 reg AP Minor Mineral Policy, 2025
No ratings yet
GO 57, dt.18.04.2025 reg AP Minor Mineral Policy, 2025
31 pages
Reverse Mortgage Loan
No ratings yet
Reverse Mortgage Loan
5 pages
Cat® Truck Bodies
No ratings yet
Cat® Truck Bodies
18 pages
Stone Crushi NG Machi NE: Henan Fote Heavy Machinery Co., LTD
No ratings yet
Stone Crushi NG Machi NE: Henan Fote Heavy Machinery Co., LTD
5 pages
Cable Gland
No ratings yet
Cable Gland
1 page
Fish and Chips
No ratings yet
Fish and Chips
2 pages
diamond test
No ratings yet
diamond test
7 pages
6.002 Circuits and Electronics Quiz #2: Massachusetts Institute of Technology
No ratings yet
6.002 Circuits and Electronics Quiz #2: Massachusetts Institute of Technology
18 pages
eph_Martello_Katunayake_6030
No ratings yet
eph_Martello_Katunayake_6030
26 pages
3D-Based Crane Evaluation System for Mobile Crane Operation Selection on Modular-Based Heavy Construction Sites
No ratings yet
3D-Based Crane Evaluation System for Mobile Crane Operation Selection on Modular-Based Heavy Construction Sites
12 pages
Adjustable Ckoke (Not Cortec) SOP
No ratings yet
Adjustable Ckoke (Not Cortec) SOP
10 pages
IV. Hoàn thành các câu sau: (Chuyển sang câu phủ định)
No ratings yet
IV. Hoàn thành các câu sau: (Chuyển sang câu phủ định)
5 pages
Over or Under Voltage Protection of Electrical Appliances
No ratings yet
Over or Under Voltage Protection of Electrical Appliances
3 pages
Plumbing Notes: Plumbing Legend:: Roof Plan Floor Plan
No ratings yet
Plumbing Notes: Plumbing Legend:: Roof Plan Floor Plan
1 page
SC Test 121 - Mads
No ratings yet
SC Test 121 - Mads
20 pages
Ncp-Icu 1
No ratings yet
Ncp-Icu 1
3 pages
Q 3 SCIENCE UNPACKED MELC Edited
No ratings yet
Q 3 SCIENCE UNPACKED MELC Edited
3 pages
Download Complete Thinking Toolbox Thirty Five Lessons That Will Build Your Reasoning Skills The Hans Bluedorn & Nathaniel Bluedorn PDF for All Chapters
No ratings yet
Download Complete Thinking Toolbox Thirty Five Lessons That Will Build Your Reasoning Skills The Hans Bluedorn & Nathaniel Bluedorn PDF for All Chapters
24 pages
I. Objectives: Daily Lesson Log (DO # 42, S. 2016)
No ratings yet
I. Objectives: Daily Lesson Log (DO # 42, S. 2016)
3 pages
Recipe Cost Calculator
No ratings yet
Recipe Cost Calculator
5 pages
Sweet Child of Mine
No ratings yet
Sweet Child of Mine
3 pages
Project Name:-: Inspection Test Plan (ITP)
No ratings yet
Project Name:-: Inspection Test Plan (ITP)
6 pages
Computer MCQ
No ratings yet
Computer MCQ
7 pages
Development of The University of Wisconsin's Parallel Hybrid-Electric Sport Utility Vehicle
No ratings yet
Development of The University of Wisconsin's Parallel Hybrid-Electric Sport Utility Vehicle
16 pages

Deep Learning Math Background

Uploaded by

Deep Learning Math Background

Uploaded by

Navigating Mathematical Basics:

A Primer for Deep Learning in Science

February 15, 2024

2 Some Motivating Deep Learning Models

Input Weight, Bias Affine Activation Output

In many cases one selects the threshold at τ = 0.5.

Yb = argmax ŷk , (6)

ŷ = f [L] (f [L−1] (f [L−2] (. . . (f [1] (x)) . . .))), (7)

f [ℓ] (u) = S [ℓ] (b[ℓ] + W [ℓ] u), for ℓ = 1, . . . , L. (8)

3 Data Standardization and Recalling Summation No-

Age Blood Pressure Cholesterol Level

which the reader can verify equals 100.

f (u) = 3 cos(eu−2 ). (17)

f(b0 ,w) : Rp → [0, 1]. (20)

7 Further Notes About Vectors and Matrices

Rise f (u + ∆) − f (u) f (u + ∆) − f (u)

Algorithm: Gradient descent with loss C(θ)

a[ℓ] = f [ℓ] (a[ℓ−1] ), where a[0] = x, and ŷ = a[L] . (43)

x = a[0] ∈ R4 , a[1] ∈ R4 , a[2] ∈ R3 , a[3] ∈ R5 , ŷ = a[4] ∈ R. (44)

10 More Advanced Models

11 Conclusion and Outlook

You might also like