Chap 2
Chap 2
Artificial Neural Networks (ANNs) are a series of mathematical models inspired by the
structure and function of biological neural networks in the human brain. These mod-
els abstract the complex interconnections of neurons to construct artificial neurons and
establish connections between them according to specific topological structures, thereby
simulating the information processing capabilities of biological neural networks. In the
field of artificial intelligence, ANNs are often referred to as neural networks (NN) or
neural models.
Neural networks were initially introduced as a principal connectionism model, with
the Parallel Distributed Processing (PDP) model [McClelland et al., 1986] being the most
popular during the mid to late 1980s. The PDP model is characterized by three main fea-
tures: 1) Information representation is distributed across multiple units; 2) Memory and
knowledge are stored in the connections between units; 3) Learning of new knowledge
occurs through gradual changes in the connection strengths between units. These fea-
tures have greatly influenced the development of modern neural network architectures,
enabling them to effectively process and store information in a manner similar to the
human brain.
Connectionist neural networks exhibit a variety of network structures and learning
methods. Early models emphasized biological plausibility, aiming to closely mimic the
structure and function of biological neurons. However, later models shifted their focus
towards simulating specific cognitive abilities, such as object recognition and language
understanding. This transition was largely driven by the introduction of error backpropa-
gation, a powerful learning algorithm that significantly improved the learning capabilities
of neural networks. With the ability to learn from large-scale datasets and the availabil-
ity of enhanced computational capabilities, such as parallel processing, neural networks
have made remarkable breakthroughs in various machine learning tasks, particularly in
processing perceptual signals like speech and images.
This chapter focuses primarily on neural networks that learn through error backprop-
agation, treating them as a type of machine learning model. From a machine learning
perspective, neural networks are generally regarded as nonlinear models, with the basic
units being neurons equipped with nonlinear activation functions. The numerous connec-
tions between neurons, each associated with a weight, contribute to the highly nonlinear
nature of neural networks. These connection weights are the parameters that need to be
37
38 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
learned, which can be achieved within the framework of machine learning using gradi-
ent descent methods. By adjusting the weights iteratively based on the error between the
predicted and desired outputs, neural networks can learn to model complex relationships
and solve a wide range of tasks.
2.1 Neurons
An Artificial Neuron, or simply a Neuron, is the fundamental building block of a neural
network, designed to model the structure and characteristics of biological neurons. A
neuron receives a set of input signals, processes them, and produces an output signal.
Biologists discovered the intricate structure of biological neurons in the early 20th
century. A typical biological neuron consists of multiple dendrites, which receive in-
put signals from other neurons, and a single axon, which sends output signals to other
neurons. When the accumulated input signals received by a neuron exceed a certain
threshold, the neuron becomes excited and generates an electrical pulse called an action
potential. This action potential propagates along the axon and is transmitted to other
neurons through synaptic connections at the axon terminals.
Inspired by the biological neuron, psychologist McCulloch and mathematician Pitts
proposed a simplified neuron model in 1943, known as the MP neuron [McCulloch et
al., 1943]. The structure of neurons in modern neural networks closely resembles that
of the MP neuron, with the main difference lying in the choice of the activation function
f . While the MP neuron utilized a step function that produced binary outputs (0 or 1),
modern neurons typically employ continuous and differentiable activation functions to
enable gradient-based learning.
Mathematically, a neuron receives D inputs x1 , x2 , · · · , xD , which can be represented
as a vector x = [x1 ; x2 ; · · · ; xD ]. The neuron computes a weighted sum of these inputs,
called the net input or net activation, denoted as z ∈ R:
D
!
z= wd xd + b = w⊤ x + b,
d=1
where w = [w1 ; w2 ; · · · ; wD ] ∈ RD is the weight vector, and b ∈ R is the bias term. The
weights represents the strength and importance of each input connection to the neuron.
Each element wi in the weight vector corresponds to the weight associated with the i-th
input. A larger absolute value of wi indicates that the i-th input has a stronger influence
on the neuron’s output. The sign of wi determines whether the input has an excitatory
(positive) or inhibitory (negative) effect on the neuron’s activation.
The bias term b represents the neuron’s activation threshold. It allows the neuron to
shift its activation function to the left or right, effectively adjusting the neuron’s sensitivity
to input signals. A positive bias makes the neuron more likely to fire, while a negative
bias makes it less likely to fire. The bias term can be thought of as an additional input to
the neuron, with a fixed value of 1, and its associated weight is the bias term itself.
Together, the weighted sum of inputs w⊤ x and the bias term b determine the neuron’s
net input or activation, which is then passed through the activation function to produce
2.1. NEURONS 39
the neuron’s output. By adjusting the weights and bias during the learning process,
the neuron can adapt to different patterns and learn to respond appropriately to input
signals.
The net input z is then passed through a nonlinear activation function f (·) to produce
the neuron’s output or activation value a:
a = f (z).
The choice of the activation function f (·) is crucial, as it introduces nonlinearity into
the neuron’s computation, enabling neural networks to model complex relationships and
learn intricate patterns in data. Figure 2.1 illustrates the structure of a typical artificial
neuron, highlighting the input signals, weights, bias, net input, activation function, and
output. By organizing multiple neurons into layered structures and connecting them
Figure 2.1: A typical artificial neuron. The neuron receives input signals x1 , x2 , · · · , xD ,
which are multiplied by their corresponding weights w1 , w2 , · · · , wD . The weighted sum
of the inputs and the bias b form the net input z, which is then passed through an activa-
tion function f (·) to produce the neuron’s output a.
appropriately, we can construct powerful neural networks capable of learning and solving
complex tasks. The following sections will delve deeper into the architecture and learning
mechanisms of feed forward neural networks, exploring how they can be trained using
gradient descent methods to minimize the error between predicted and desired outputs.
40 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
1. Continuity and Differentiability (with the allowance for a few points of non-
differentiability): A differentiable activation function enables the direct application
of numerical optimization methods, such as gradient descent, for learning network
parameters. Continuity and differentiability are crucial for the smooth flow of gra-
dients during the backpropagation process, allowing the network to learn effec-
tively.
2. Simplicity of the Function and its Derivative: The simplicity of activation func-
tions and their derivatives facilitates higher computational efficiency within the
network. However, there is a trade-off between the simplicity of the activation
function and its ability to capture complex patterns. Striking a balance between
these two factors is essential for optimal network performance.
3. Appropriate Range for the Derivative: The derivative of the activation function
should fall within a suitable range, not too large or too small, to maintain training
efficiency and stability. If the derivative is too large, it can lead to exploding gra-
dients, causing the learning process to become unstable. On the other hand, if the
derivative is too small, it can result in vanishing gradients, making it difficult for
the network to learn and update its parameters effectively.
Several commonly used activation functions in neural networks are introduced below:
Tanh Function: The Tanh function is another type of sigmoid function, defined as
exp(x) − exp(−x)
tanh(x) = . (4.5)
exp(x) + exp(−x)
The Tanh function can be considered a scaled and shifted version of the Logistic func-
tion, with a range of (−1, 1).
Figure 2.2: Logistic and Tanh functions. The Logistic function squashes the input to the
range (0, 1), while the Tanh function squashes the input to the range (−1, 1), providing
zero-centered outputs.
Taking the Logistic function σ(x) as an example, its derivative is σ ′ (x) = σ(x)(1−σ(x)).
The first-order Taylor expansion of the Logistic function around zero is
⎧
⎪
⎨1 if gl (x) ≥ 1,
hard-logistic(x) = gl (x) if 0 < gl (x) < 1, (4.9)
⎪
⎩
0 if gl (x) ≤ 0,
= max (min (gl (x), 1) , 0)
= max (min(0.25x + 0.5, 1), 0) .
Similarly, the first-order Taylor expansion of the Tanh function around zero is
hard-tanh(x) = max (min (gt (x), 1) , −1) = max (min(x, 1), −1) .
Figure 2.3 depicts Hard Sigmoid-type activation functions, highlighting the simpli-
fied computational models that approximate the traditional Logistic and Tanh functions
with reduced computational complexity, while preserving their essential characteristics.
The piecewise linear nature of these approximations makes them more computationally
efficient compared to their smooth counterparts.
Figure 2.3: Hard Logistic and Hard Tanh functions. These piecewise linear functions
approximate the Logistic and Tanh functions, respectively, reducing computational com-
plexity while preserving the essential characteristics of the original functions.
x > 0, which alleviates the vanishing gradient problem to some extent and accelerates
the convergence of gradient descent.
Disadvantages: The output of ReLU is non-zero-centered, introducing bias shifts to
subsequent neural network layers and affecting the efficiency of gradient descent. More-
over, ReLU neurons are relatively easy to "die" during training, a phenomenon known as
the Dying ReLU Problem. If, after an inappropriate update, a ReLU neuron in the first
hidden layer is never activated across all training data, its gradient will always be 0, and
it will never activate in future training. This problem can also occur in other hidden
layers.
Leaky ReLU Leaky ReLU maintains a small gradient γ for inputs x < 0, allowing for
a non-zero gradient when the neuron is not activated, thus avoiding the issue of never
being activated [Maas et al., 2013]. Leaky ReLU is defined as
&
x if x > 0,
LeakyReLU(x) =
γx if x ≤ 0,
Parametric ReLU (PReLU) further generalizes the idea by introducing a learnable pa-
rameter γi , allowing different neurons to have different slopes for the negative part of
their input. It is defined as
ELU (Exponential Linear Unit) aims to bring the benefits of ReLU-like functions while
attempting to make the mean activations closer to zero, which speeds up learning. It is
defined as
&
x if x > 0,
ELU(x) =
γ(exp(x) − 1) if x ≤ 0,
with γ ≥ 0 controlling the saturation level for negative inputs. By bringing the mean
activations closer to zero, ELU helps to alleviate the bias shift problem and accelerates
the learning process.
Figure 2.4: ReLU, Leaky ReLU, ELU, and Softplus functions. These functions provide var-
ious nonlinear transformations, each with its own advantages and characteristics, con-
tributing to the diversity and adaptability of neural network architectures.
46 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
GELU(x) = x · P (X ≤ x),
zk = w ⊤
k x + bk ,
2.2. ACTIVATION FUNCTIONS 47
Figure 2.5: Swish activation function for different values of β. The Swish function acts
as a self-gated activation function, interpolating between a linear function and the ReLU
function based on the value of β.
enabling it to learn a convex non-linear mapping from input to output. The Maxout
function approximates any convex function with piecewise linear segments and is non-
differentiable at a finite set of points.
The motivation behind using a vector input instead of a scalar input in the Maxout
unit is to increase the flexibility and representation power of the activation function. By
taking the maximum over a set of linear transformations of the input, the Maxout unit
can learn to adapt its shape based on the data, effectively approximating complex convex
functions.
Figure 2.6 illustrates the structure and behavior of the Maxout unit.
48 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
Figure 2.6: Maxout unit. The Maxout unit takes a vector input and computes the maxi-
mum over a set of linear transformations, enabling it to learn a convex non-linear map-
ping from input to output.
and may require experimentation and empirical evaluation. Different activation functions
may perform better in different scenarios, and it is common to use a combination of
activation functions within a single network, depending on the specific requirements of
each layer.
features to identify more complex patterns and objects. This hierarchical structure allows
feedforward networks to effectively capture the spatial dependencies in the data and
make accurate predictions.
Memory Networks Memory networks, such as recurrent neural networks (RNNs) and
long short-term memory (LSTM) networks, are designed to process sequential data, such
as time series or natural language. These networks incorporate feedback connections,
allowing information to persist across multiple time steps. The motivation behind mem-
ory networks is to capture the temporal dependencies in the data and learn to store and
update relevant information over time. In an RNN, the hidden state at each time step is
updated based on the current input and the previous hidden state, allowing the network
to maintain a running memory of the sequence. This memory mechanism enables RNNs
to capture long-term dependencies and make predictions based on the entire sequence
history. However, RNNs may struggle with very long sequences due to the vanishing
gradient problem.
To address this issue, LSTMs introduce a more complex memory cell with gating
mechanisms that regulate the flow of information over time. This allows LSTMs to
selectively remember or forget information, making them more effective at capturing
long-range dependencies in sequences.
Graph Networks Graph networks, such as graph convolutional networks (GCNs) and
graph attention networks (GATs), are designed to process graph-structured data, where
entities are represented as nodes and their relationships are represented as edges. The
motivation behind graph networks is to learn node embeddings that capture the struc-
tural information of the graph and enable tasks such as node classification, link predic-
tion, or graph classification.
In a GCN, the node embeddings are updated by aggregating information from the
node’s neighbors, allowing the network to capture the local graph structure. This process
is repeated for multiple layers, with each layer capturing a larger neighborhood around
the nodes. By learning node embeddings that incorporate both node features and graph
structure, GCNs can effectively solve tasks such as semi-supervised node classification.
GATs extend the idea of GCNs by introducing an attention mechanism that allows
the network to assign different importance to different neighbors during the aggregation
process. This enables GATs to capture more complex and non-local dependencies in
the graph and has been shown to improve performance on various graph-based tasks. In
summary, the choice of neural network structure depends on the specific characteristics of
the data and the task at hand. Feedforward networks are well-suited for structured data,
memory networks for sequential data, and graph networks for graph-structured data.
By understanding how information propagates through each type of network and how
this relates to their suitability for different applications, researchers and practitioners can
make informed decisions when designing neural network architectures for their specific
problems.
2.4. FEEDFORWARD NEURAL NETWORKS 51
Notations for Describing Feedforward Neural Networks The following table presents
the notations used to describe feedforward neural networks.
Notation Meaning
L The number of layers in the neural network
Ml The number of neurons in layer l
fl (·) The activation function of neurons in layer l
W (l) ∈ RMl ×Ml−1 The weight matrix from layer l − 1 to layer l
b(l) ∈ RMl The bias vector from layer l − 1 to layer l
z (l) ∈ RMl The net input (pre-activation) of neurons in layer l
a(l) ∈ RMl The output (post-activation) of neurons in layer l
Let a(0) = x, the feedforward neural network propagates information by iterating the
following formulas:
Theorem 2.1 (Universal Approximation Theorem, Cybenko, 1989; Hornik et al., 1989)
Let φ(·) be a non-constant, bounded, and monotonically-increasing continuous function. For
any given function f defined on a D-dimensional unit hypercube [0, 1]D within the continu-
ous function space C(ID ), there exists an integer M , a set of real numbers vm , bm ∈ R, and
real vectors wm ∈ RD for m = 1, · · · , M , such that the function
M
! ' (
F (x) = vm φ w ⊤
m x + bm (2.2)
m=1
can approximate f with |F (x) − f (x)| < ϵ for all x ∈ ID , where ϵ > 0 is an arbitrarily small
positive number.
2.4. FEEDFORWARD NEURAL NETWORKS 53
The Universal Approximation Theorem holds due to the combination of hidden layers
and nonlinear activation functions in the network architecture. The hidden layers allow
the network to learn a hierarchical representation of the input data, with each layer
capturing increasingly abstract features. The nonlinear activation functions, such as the
sigmoid or ReLU function, introduce non-linearity into the network, enabling it to model
complex, non-linear relationships between the inputs and outputs. As the number of
hidden neurons increases, the MLP becomes more expressive and can approximate a
wider range of functions.
The theorem demonstrates the computational power of neural networks to approxi-
mate a given continuous function but does not specify how to find such a network or its
optimality. Additionally, in machine learning applications, the true mapping function is
unknown, generally learned through empirical risk minimization and regularization due
to neural networks’ susceptibility to overfitting on the training set.
Feature Extraction One important application of MLPs is feature extraction, where the
network learns to transform raw input data into a more informative and discriminative
representation. The characteristics of input samples significantly impact the performance
of classifiers. In supervised learning, good features can greatly enhance the classifier’s
effectiveness. To achieve superior classification results, it is often necessary to transform
the sample’s original feature vector x ∈ RD into a more effective feature vector φ(x) ∈
′
RD , a process known as feature extraction.
Multi-layer feedforward neural networks can be viewed as a non-linear composite
′
function φ : RD → RD , mapping the input x to the output φ(x). By training an MLP on
a large dataset with known labels, the network can learn to extract meaningful features
that capture the underlying structure of the data. These learned features can then be
used as input to other machine learning algorithms, such as support vector machines or
decision trees, to improve their performance.
Loss Functions The choice of loss function plays a crucial role in the learning process of
an MLP. The loss function quantifies the discrepancy between the network’s predictions
and the true labels, providing a measure of how well the network is performing. Different
loss functions are suited for different types of tasks:
• Mean Squared Error (MSE): Commonly used for regression problems, where the
goal is to predict continuous values.
• Cross-Entropy Loss: Often used for classification problems, where the goal is to
predict discrete class labels.
54 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
The choice of loss function can significantly impact the network’s learning dynamics
and its ability to generalize to new data.
Integration with Classifiers Given a training sample (x, y), an MLP can be used to
map the input x to a transformed feature vector φ(x). This transformed feature vector
can then be used as input to a classifier g(·):
ŷ = g(φ(x); θ),
where g(·) is a linear or non-linear classifier, θ represents the parameters of the classifier,
and ŷ is the output of the classifier. In some cases, the classifier can be integrated into
the MLP as the last layer of the network:
• Binary Classification: For problems where y ∈ {0, 1}, a single neuron with a sigmoid
activation function can be used in the last layer. The network’s output a(L) ∈ R can
directly serve as the conditional probability of the positive class:
• Multi-Class Classification: For problems where y ∈ {1, . . . , C}, the last layer can
consist of C neurons with a softmax activation function. The output of the last
layer ŷ ∈ RC represents the predicted conditional probabilities for each class:
ŷ = softmax(z(L) ),
where z(L) ∈ RC is the net input of the neurons in the last layer.
By integrating the classifier into the MLP, the network can directly output the con-
ditional probabilities of different classes, effectively combining feature extraction and
classification into a single model.
where y ∈ {0, 1}C is the one-hot vector representation of the true label y, and ŷ is the
predicted output of the network.
2.4. FEEDFORWARD NEURAL NETWORKS 55
2.4.4 DropOut
Deep neural networks have diverse architectures that range from shallow to very deep,
aiming to generalize from given datasets. Nonetheless, in this quest, they may learn not
just the features but the statistical noise, leading to a phenomenon known as overfitting.
Overfitting enhances model performance on the training dataset at the expense of its pre-
dictive power on new data points. Traditionally, regularization techniques that penalize
the network weights have been used to tackle overfitting, but they often fall short when
dealing with large network architectures or limited data.
Figure 2.8: Dropout Neural Net Model showing a standard neural net and a thinned net
produced by applying dropout.
Figure 2.9: A unit’s behavior during training with dropout probability p, and its adjusted
behavior during inference.
where f denotes the activation function, such as the logistic sigmoid f (x) = 1+exp(−x)
1
.
With the introduction of dropout, the operation adapts to incorporate a random thinning
as illustrated in Figure 2.10.
During training, the presence of each unit is determined by a Bernoulli random vari-
able with probability p, leading to the modified feed-forward equations:
r(l) ∼ Bernoulli(p),
ŷ(l) = r(l) · y(l) ,
z(l+1) = w(l) ŷ(l) + b(l) .
Here, r(l) is a vector of independent Bernoulli random variables, and ŷ(l) represents the
thinned output of layer l, which then feeds into the subsequent layer. This process effec-
tively simulates training a vast ensemble of networks with shared weights.
dropout = nn.Dropout(p=0.5)
58 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
Figure 2.10: Comparison of the basic operations of a standard network versus a network
employing dropout.
output = dropout(input)
Assuming the use of stochastic gradient descent for neural network parameter learning,
given a sample (x, y), input it into the neural network model to obtain the network out-
put ŷ. Assuming the loss function is L(y, ŷ), parameter learning requires computing
the derivative of the loss function with respect to each parameter. For generality, com-
pute the partial derivatives for parameters W (l) and b(l) at layer l. The computation of
∂L(y,ŷ)
∂w(l)
, involving vector or matrix differentiation, is complex. Thus, first calculate the
partial derivatives of L(y, ŷ) with respect to each element of the parameter matrix ∂L(y,ŷ)
(l) .
∂wij
According to the chain rule,
2.5. BACKPROPAGATION ALGORITHM 59
⎡ (l) ⎤
∂z (l) (l) (l) ∂z
⎣ ∂z1 , · · · ∂zi (l)
Ml
= , ··· , ⎦ (2.7)
(l) (l) (l) (l)
∂wij ∂wij ∂wij ∂wij
1 1 ∂(w(l) a(l−1) + b(l)
i )
2 2
..
i.
= 0, · · · , (l)
··· ,0 (2.8)
∂wij
3 4
(l−1)
= 0, · · · , aj , · · · , 0 (2.9)
(l−1)
! Ii (aj ) ∈ R1×Ml , (2.10)
This matrix differentiation uses the denominator layout, where a column vector’s
derivative with respect to a scalar is a row vector.
(l)
(2) For the derivative ∂z ∂b(l)
, since the function relationship between z (l) and b(l) is
z (l) = W (l) a(l−1) + b(l) , the derivative is
∂z (l)
= I Ml ∈ RMl ×Ml ,
∂b(l)
where I Ml is the Ml × Ml identity matrix.
(3) For the derivative ∂L(y,ŷ)
∂z (l)
, it represents the influence of the neurons at layer l on
the final loss and hence the sensitivity of the final loss to the neurons at layer l, often
termed the error term for the neurons at layer l, denoted by δ (l) .
∂L(y, ŷ)
δ (l) ! ∈ RM l
∂z (l)
The error term δ(l) also indirectly reflects the degree of contribution of different neu-
rons to the network’s capability, thereby effectively addressing the Credit Assignment
Problem (CAP).
Given z (l+1) = W (l+1) a(l) + b(l+1) , we have
60 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
∂z (l+1) ) (l+1) *⊤
= W ∈ RMl ×Ml+1 (2.11)
∂a(l)
' (
Given a(l) = fl z (l) , where fl (·) is a function applied element-wise, it follows that
' (
∂a(l) ∂fl z (l)
= (2.12)
∂z (l) ∂z (l)
' ' ((
= diag fl′ z (l) ∈ RMl ×Ml (2.13)
Thus, according to the chain rule, the error term for layer l is
∂L(y, ŷ)
δ (l) ! (2.14)
∂z (l)
∂a(l) ∂z (l+1) ∂L(y, ŷ)
= · · (2.15)
∂z (l) ∂a(l) ∂z (l+1)
' ' (( ) *⊤
= diag fl′ z (l) · W (l+1) · δ (l+1) (2.16)
5) *⊤ 6
' (l) (
′
= fl z ⊙ W (l+1)
δ (l+1)
∈ RM l , (2.17)
∂L(y, ŷ) 3 4
(l−1)
(l)
= δ (l) aj
∂wij
(l) (l−1)
= δ i aj
The gradient of L(y, ŷ) with respect to the weights W (l) of layer l is
∂L(y, ŷ)
(l)
= δ (l) ∈ RMl
∂b
Having computed the error term for each layer, we can determine the gradient for
each layer’s parameters. Thus, the training process of a feedforward neural network
2.6. AUTOMATIC GRADIENT COMPUTATION 61
with respect to each parameter can be manually calculated using the chain rule and
implemented in code. However, manual differentiation is not only tedious but also prone
to errors, making the implementation of neural networks highly inefficient and error-
prone. This is particularly true for complex network architectures with a large number of
parameters, where manual differentiation becomes impractical.
To address this issue, modern deep learning frameworks include functionalities for
automatic gradient computation. This allows researchers and practitioners to focus on
defining the network structure and implementing it in code, while the computation of
gradients is done automatically, significantly improving development efficiency and re-
ducing the likelihood of errors.
Methods for automatic gradient computation can be categorized into three main
types: Numerical Differentiation, Symbolic Differentiation, and Automatic Differentia-
tion.
f (x + ∆x) − f (x)
f ′ (x) = lim . (2.18)
∆x→0 ∆x
To compute the derivative of f (x) at a point x, a small perturbation ∆x is added
to x, and the gradient is approximated using the formula above. Although straightfor-
ward to implement, numerical differentiation faces challenges in choosing an appropriate
value for ∆x. If ∆x is too small, it can introduce significant round-off errors due to the
limitations of floating-point arithmetic. Conversely, if ∆x is too large, it may increase
truncation error, leading to inaccurate derivative calculations.
In practice, to reduce truncation errors, the following formula is often used:
f (x + ∆x) − f (x − ∆x)
f ′ (x) = lim . (2.19)
∆x→0 2∆x
This formula, known as the central difference formula, provides a more accurate
approximation of the derivative f ′ (x) compared to the forward or backward difference
methods. The central difference method calculates the slope of the secant line through
the points (x − ∆x, f (x − ∆x)) and (x + ∆x, f (x + ∆x)), effectively using information
from both sides of x. This symmetric approach reduces the error in the derivative approx-
imation, as the leading error terms cancel out. As a result, for a given ∆x, the central
difference method tends to offer a superior balance between truncation error and round-
off error, leading to a more reliable estimate of the derivative, especially when dealing
with functions that are smooth near the point x.
The computational complexity of numerical differentiation is another drawback. For a
function with N parameters, each parameter must be perturbed individually to compute
the gradient, resulting in a total complexity of O(N 2 ) for a single gradient computation.
This makes numerical differentiation inefficient for functions with a large number of
parameters, such as deep neural networks.
2.6. AUTOMATIC GRADIENT COMPUTATION 63
avoiding round-off and truncation errors, offering precise solutions by manipulating sym-
bols rather than numerical approximations. It transforms input expressions iteratively or
recursively using predefined rules, stopping when no further transformation by these
rules is possible.
Symbolic differentiation can compute the mathematical representation of gradients
at compile time, further optimizing the process through symbolic computation methods.
It boasts platform independence, allowing execution on both CPUs and GPUs. However,
it does come with its limitations:
1. Long compilation times, especially for complex expressions or loops that require
extensive unrolling.
where x is the input scalar, and w and b are the weight and bias parameters, respectively.
First, we decompose the composite function f (x; w, b) into a series of basic operations
and form a computational graph. A computational graph is a graphical representation
of mathematical operations, where each non-leaf node represents a basic operation, and
each leaf node represents an input variable or constant.
The composite function f (x; w, b) consists of 6 basic functions, hi , for 1 ≤ i ≤ 6. The
following table delineates each basic function along with its derivative, facilitating the
computation of derivatives of f (x; w, b) with respect to w and b through predefined rules.
To derive the entire composite function f (x; w, b) with respect to w and b, one multi-
plies the derivatives along the paths connecting f (x; w, b) to w and b within the compu-
tational graph:
∂f (x; w, b) ∂f (x; w, b) ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂h1
= (2.21)
∂w ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂h1 ∂w
∂f (x; w, b) ∂f (x; w, b) ∂h6 ∂h5 ∂h4 ∂h3 ∂h2
= (2.22)
∂b ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂b
Considering ∂f (x;w,b)
∂w
for x = 1, w = 0, b = 0, the derivative is calculated as 0.25. If
multiple paths from the function to a parameter exist, sum the derivatives along each
path to obtain the final gradient.
Based on the sequence of derivative computations, automatic differentiation (AD) is
divided into two modes: forward mode and reverse mode.
• The forward mode computes gradients in the same direction as the computational
graph, recursively applying the chain rule. For example, consider calculating ∂f (x;w,b)
∂w
when x = 1, w = 0, b = 0. The forward mode’s cumulative calculation sequence is
shown to result in a derivative of 0.25 for f (x; w, b) with respect to w.
• Reverse mode, on the other hand, computes gradients in the opposite direction of
the computational graph’s flow. For ∂f (x;w,b)
∂w
under the same conditions, the reverse
mode’s sequence also leads to a derivative calculation of 0.25.
Both forward and reverse modes are applications of the chain rule for gradient accu-
mulation. Reverse mode is equivalent to the backpropagation technique used for com-
puting gradients. For a general function f : RN → RM , the forward mode requires N
traversals for each input variable, whereas the reverse mode requires M traversals for
66 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING
each output. When N > M , reverse mode is more efficient. In the context of feed-
forward neural network parameter learning, reverse mode is the most efficient method,
requiring only one pass.
2.7 Conclusion
In this chapter, we have explored the fundamental concepts and techniques behind neural
networks for machine learning. Building upon the basic principles of machine learning
discussed in the previous chapter, such as models, learning criteria, and optimization
algorithms, we have delved into the specific architecture and learning mechanisms of
neural networks.
We began by examining the basic building block of neural networks: the artificial
neuron. Inspired by the structure and function of biological neurons, artificial neurons
receive input signals, process them through weighted connections and an activation func-
tion, and produce an output signal. The choice of activation function is crucial, as it
introduces nonlinearity into the network, enabling it to learn complex patterns and rep-
resentations from the input data.
Next, we investigated various network structures, including feedforward networks,
memory networks, and graph networks. Each structure is designed to handle different
types of data and solve specific tasks effectively. Feedforward networks, such as multi-
layer perceptrons (MLPs), are well-suited for processing structured data and learning
hierarchical representations. Memory networks, like recurrent neural networks (RNNs)
and long short-term memory (LSTM) networks, are designed to process sequential data
and capture temporal dependencies. Graph networks, such as graph convolutional net-
works (GCNs) and graph attention networks (GATs), are tailored for processing graph-
structured data and learning node embeddings that capture the structural information of
the graph.
We then focused on the training process of feedforward neural networks, which in-
volves finding the optimal values of the network’s parameters (weights and biases) that
minimize a chosen loss function. The backpropagation algorithm, which efficiently com-
putes the gradients of the loss function with respect to the network’s parameters by re-
cursively applying the chain rule of differentiation, forms the cornerstone of parameter
learning in neural networks. We also discussed the Universal Approximation Theorem,
which highlights the expressive power of MLPs and their ability to approximate any con-
tinuous function, given sufficient hidden neurons and appropriate activation functions.
To mitigate overfitting, a common challenge in training deep neural networks, we
introduced the concept of dropout regularization. Dropout randomly drops units during
training, effectively preventing units from co-adapting too closely to the training data
and promoting the learning of more robust and generalized representations.
Finally, we explored the various methods for automatic gradient computation, which
is essential for efficient training of neural networks. Automatic differentiation, particu-
larly in reverse mode (backpropagation), offers the best balance of computational effi-
ciency, numerical stability, and ease of implementation, making it the preferred choice
68 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING