0% found this document useful (0 votes)
34 views

Chap 2

This document provides an overview of artificial neural networks (ANNs) and their use in machine learning. It discusses how ANNs are inspired by biological neural networks and simulate their information processing capabilities. Connectionist neural networks can have different structures and learning methods. Modern neural networks typically use the error backpropagation algorithm to learn from large datasets. The chapter focuses on neural networks that learn through backpropagation and can be treated as nonlinear machine learning models. It describes the basic components of artificial neurons, including inputs, weights, biases, activation functions, and outputs. Choosing the right activation functions is important for enabling neural networks to learn complex patterns from data.

Uploaded by

api-732329359
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Chap 2

This document provides an overview of artificial neural networks (ANNs) and their use in machine learning. It discusses how ANNs are inspired by biological neural networks and simulate their information processing capabilities. Connectionist neural networks can have different structures and learning methods. Modern neural networks typically use the error backpropagation algorithm to learn from large datasets. The chapter focuses on neural networks that learn through backpropagation and can be treated as nonlinear machine learning models. It describes the basic components of artificial neurons, including inputs, weights, biases, activation functions, and outputs. Choosing the right activation functions is important for enabling neural networks to learn complex patterns from data.

Uploaded by

api-732329359
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 2

Neural Networks for Machine Learning

Artificial Neural Networks (ANNs) are a series of mathematical models inspired by the
structure and function of biological neural networks in the human brain. These mod-
els abstract the complex interconnections of neurons to construct artificial neurons and
establish connections between them according to specific topological structures, thereby
simulating the information processing capabilities of biological neural networks. In the
field of artificial intelligence, ANNs are often referred to as neural networks (NN) or
neural models.
Neural networks were initially introduced as a principal connectionism model, with
the Parallel Distributed Processing (PDP) model [McClelland et al., 1986] being the most
popular during the mid to late 1980s. The PDP model is characterized by three main fea-
tures: 1) Information representation is distributed across multiple units; 2) Memory and
knowledge are stored in the connections between units; 3) Learning of new knowledge
occurs through gradual changes in the connection strengths between units. These fea-
tures have greatly influenced the development of modern neural network architectures,
enabling them to effectively process and store information in a manner similar to the
human brain.
Connectionist neural networks exhibit a variety of network structures and learning
methods. Early models emphasized biological plausibility, aiming to closely mimic the
structure and function of biological neurons. However, later models shifted their focus
towards simulating specific cognitive abilities, such as object recognition and language
understanding. This transition was largely driven by the introduction of error backpropa-
gation, a powerful learning algorithm that significantly improved the learning capabilities
of neural networks. With the ability to learn from large-scale datasets and the availabil-
ity of enhanced computational capabilities, such as parallel processing, neural networks
have made remarkable breakthroughs in various machine learning tasks, particularly in
processing perceptual signals like speech and images.
This chapter focuses primarily on neural networks that learn through error backprop-
agation, treating them as a type of machine learning model. From a machine learning
perspective, neural networks are generally regarded as nonlinear models, with the basic
units being neurons equipped with nonlinear activation functions. The numerous connec-
tions between neurons, each associated with a weight, contribute to the highly nonlinear
nature of neural networks. These connection weights are the parameters that need to be

37
38 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

learned, which can be achieved within the framework of machine learning using gradi-
ent descent methods. By adjusting the weights iteratively based on the error between the
predicted and desired outputs, neural networks can learn to model complex relationships
and solve a wide range of tasks.

2.1 Neurons
An Artificial Neuron, or simply a Neuron, is the fundamental building block of a neural
network, designed to model the structure and characteristics of biological neurons. A
neuron receives a set of input signals, processes them, and produces an output signal.
Biologists discovered the intricate structure of biological neurons in the early 20th
century. A typical biological neuron consists of multiple dendrites, which receive in-
put signals from other neurons, and a single axon, which sends output signals to other
neurons. When the accumulated input signals received by a neuron exceed a certain
threshold, the neuron becomes excited and generates an electrical pulse called an action
potential. This action potential propagates along the axon and is transmitted to other
neurons through synaptic connections at the axon terminals.
Inspired by the biological neuron, psychologist McCulloch and mathematician Pitts
proposed a simplified neuron model in 1943, known as the MP neuron [McCulloch et
al., 1943]. The structure of neurons in modern neural networks closely resembles that
of the MP neuron, with the main difference lying in the choice of the activation function
f . While the MP neuron utilized a step function that produced binary outputs (0 or 1),
modern neurons typically employ continuous and differentiable activation functions to
enable gradient-based learning.
Mathematically, a neuron receives D inputs x1 , x2 , · · · , xD , which can be represented
as a vector x = [x1 ; x2 ; · · · ; xD ]. The neuron computes a weighted sum of these inputs,
called the net input or net activation, denoted as z ∈ R:
D
!
z= wd xd + b = w⊤ x + b,
d=1

where w = [w1 ; w2 ; · · · ; wD ] ∈ RD is the weight vector, and b ∈ R is the bias term. The
weights represents the strength and importance of each input connection to the neuron.
Each element wi in the weight vector corresponds to the weight associated with the i-th
input. A larger absolute value of wi indicates that the i-th input has a stronger influence
on the neuron’s output. The sign of wi determines whether the input has an excitatory
(positive) or inhibitory (negative) effect on the neuron’s activation.
The bias term b represents the neuron’s activation threshold. It allows the neuron to
shift its activation function to the left or right, effectively adjusting the neuron’s sensitivity
to input signals. A positive bias makes the neuron more likely to fire, while a negative
bias makes it less likely to fire. The bias term can be thought of as an additional input to
the neuron, with a fixed value of 1, and its associated weight is the bias term itself.
Together, the weighted sum of inputs w⊤ x and the bias term b determine the neuron’s
net input or activation, which is then passed through the activation function to produce
2.1. NEURONS 39

the neuron’s output. By adjusting the weights and bias during the learning process,
the neuron can adapt to different patterns and learn to respond appropriately to input
signals.
The net input z is then passed through a nonlinear activation function f (·) to produce
the neuron’s output or activation value a:

a = f (z).

The choice of the activation function f (·) is crucial, as it introduces nonlinearity into
the neuron’s computation, enabling neural networks to model complex relationships and
learn intricate patterns in data. Figure 2.1 illustrates the structure of a typical artificial
neuron, highlighting the input signals, weights, bias, net input, activation function, and
output. By organizing multiple neurons into layered structures and connecting them

Figure 2.1: A typical artificial neuron. The neuron receives input signals x1 , x2 , · · · , xD ,
which are multiplied by their corresponding weights w1 , w2 , · · · , wD . The weighted sum
of the inputs and the bias b form the net input z, which is then passed through an activa-
tion function f (·) to produce the neuron’s output a.

appropriately, we can construct powerful neural networks capable of learning and solving
complex tasks. The following sections will delve deeper into the architecture and learning
mechanisms of feed forward neural networks, exploring how they can be trained using
gradient descent methods to minimize the error between predicted and desired outputs.
40 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

2.2 Activation Functions


Activation functions are a crucial component of artificial neurons in neural networks, as
they introduce nonlinearity into the network, enabling it to learn complex patterns and
representations from the input data. The choice of activation function can significantly
impact the network’s performance and learning dynamics. To enhance the network’s
representation capability and learning ability, activation functions should exhibit the fol-
lowing properties:

1. Continuity and Differentiability (with the allowance for a few points of non-
differentiability): A differentiable activation function enables the direct application
of numerical optimization methods, such as gradient descent, for learning network
parameters. Continuity and differentiability are crucial for the smooth flow of gra-
dients during the backpropagation process, allowing the network to learn effec-
tively.

2. Simplicity of the Function and its Derivative: The simplicity of activation func-
tions and their derivatives facilitates higher computational efficiency within the
network. However, there is a trade-off between the simplicity of the activation
function and its ability to capture complex patterns. Striking a balance between
these two factors is essential for optimal network performance.

3. Appropriate Range for the Derivative: The derivative of the activation function
should fall within a suitable range, not too large or too small, to maintain training
efficiency and stability. If the derivative is too large, it can lead to exploding gra-
dients, causing the learning process to become unstable. On the other hand, if the
derivative is too small, it can result in vanishing gradients, making it difficult for
the network to learn and update its parameters effectively.

Several commonly used activation functions in neural networks are introduced below:

2.2.1 Sigmoid Functions


Sigmoid functions refer to a class of S-shaped curve functions known as saturation func-
tions. Commonly used sigmoid functions include the Logistic function and the Tanh
function.

• Saturation For a function f (x), if as x → −∞, its derivative f ′ (x) → 0, it is called


left-saturated. Similarly, if as x → +∞, its derivative f ′ (x) → 0, it is termed right-
saturated. A function is considered fully saturated if it meets both left and right
saturation criteria. Saturation can have significant implications on the gradient
flow during backpropagation, as it can lead to the vanishing gradient problem,
where the gradients become extremely small, making it difficult for the network to
learn and update its parameters effectively.
2.2. ACTIVATION FUNCTIONS 41

Logistic Function: The Logistic function is defined as


1
σ(x) = . (4.4)
1 + exp(−x)
The Logistic function acts as a "squashing" function, compressing the input from the
real number domain to (0, 1). Near 0, the sigmoid function approximates a linear func-
tion; towards the extremes, it suppresses the input. Smaller inputs get closer to 0, and
larger inputs approach 1. This characteristic is similar to biological neurons, exciting for
some inputs (output 1) and inhibiting for others (output 0). Compared to the step activa-
tion function used by perceptrons, the Logistic function is continuous and differentiable,
making it mathematically more favorable.
Due to the Logistic function’s properties, neurons equipped with it exhibit two no-
table features: 1) Their output can directly be interpreted as a probability distribution,
enabling better integration with statistical learning models. 2) It acts as a soft gate,
controlling the amount of information passed by other neurons.
However, the Logistic function has some limitations. It suffers from the vanishing
gradient problem, where the gradients become extremely small for inputs far from zero,
making it difficult for the network to learn and update its parameters effectively. Addi-
tionally, the Logistic function always outputs values greater than 0, resulting in a non-
zero-centered output, which can cause a bias shift in the inputs of subsequent neural
layers, further slowing the convergence of gradient descent.
For more details, see Section 6.6.

Tanh Function: The Tanh function is another type of sigmoid function, defined as

exp(x) − exp(−x)
tanh(x) = . (4.5)
exp(x) + exp(−x)
The Tanh function can be considered a scaled and shifted version of the Logistic func-
tion, with a range of (−1, 1).

tanh(x) = 2σ(2x) − 1. (4.6)


Figure 2.2 shows the shapes of the Logistic and Tanh functions. The Tanh function out-
puts are zero-centered, unlike the Logistic function, which always outputs values greater
than 0. Zero-centered outputs can help alleviate the bias shift problem encountered with
the Logistic function, leading to faster convergence of gradient descent.

2.2.2 Hard-Logistic and Hard-Tanh Functions


Both the Logistic and Tanh functions are types of Sigmoid functions characterized by
saturation, but they involve significant computational overhead due to their saturation
on both ends and approximate linearity near zero. As a result, these functions can be
approximated using piecewise functions, reducing the computational complexity while
preserving their essential characteristics.
42 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

Figure 2.2: Logistic and Tanh functions. The Logistic function squashes the input to the
range (0, 1), while the Tanh function squashes the input to the range (−1, 1), providing
zero-centered outputs.

Taking the Logistic function σ(x) as an example, its derivative is σ ′ (x) = σ(x)(1−σ(x)).
The first-order Taylor expansion of the Logistic function around zero is

gl (x) ≈ σ(0) + x × σ ′ (0) = 0.25x + 0.5. (4.7)


Thus, the Logistic function can be approximated by a piecewise function known as
hard-logistic(x):



⎨1 if gl (x) ≥ 1,
hard-logistic(x) = gl (x) if 0 < gl (x) < 1, (4.9)


0 if gl (x) ≤ 0,
= max (min (gl (x), 1) , 0)
= max (min(0.25x + 0.5, 1), 0) .

Similarly, the first-order Taylor expansion of the Tanh function around zero is

gt (x) ≈ tanh(0) + x × tanh′ (0) = x. (4.12)


The Hard-Tanh function is then defined as
2.2. ACTIVATION FUNCTIONS 43

hard-tanh(x) = max (min (gt (x), 1) , −1) = max (min(x, 1), −1) .
Figure 2.3 depicts Hard Sigmoid-type activation functions, highlighting the simpli-
fied computational models that approximate the traditional Logistic and Tanh functions
with reduced computational complexity, while preserving their essential characteristics.
The piecewise linear nature of these approximations makes them more computationally
efficient compared to their smooth counterparts.

Figure 2.3: Hard Logistic and Hard Tanh functions. These piecewise linear functions
approximate the Logistic and Tanh functions, respectively, reducing computational com-
plexity while preserving the essential characteristics of the original functions.

2.2.3 ReLU Function


The ReLU (Rectified Linear Unit) function, also known as the Rectifier function, is cur-
rently one of the most frequently used activation functions in deep neural networks.
Defined as a ramp function, ReLU is expressed as

ReLU(x) = max(0, x).


Advantages: Neurons employing ReLU require only addition, multiplication, and
comparison operations, making computations more efficient. ReLU is also considered
to have biological plausibility, such as unilateral inhibition and a broad excitation margin
(i.e., the level of excitation can be very high), mirroring the sparsity observed in biologi-
cal neural networks where only about 1% to 4% of neurons are active at any given time.
Unlike Sigmoid-type activation functions, which lead to dense activations in the neural
network, ReLU facilitates a good level of sparsity, with approximately 50% of neurons
being active.
In terms of optimization, compared to the saturation at both ends of Sigmoid-type
functions, the ReLU function saturates only on the left and has a derivative of 1 for
44 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

x > 0, which alleviates the vanishing gradient problem to some extent and accelerates
the convergence of gradient descent.
Disadvantages: The output of ReLU is non-zero-centered, introducing bias shifts to
subsequent neural network layers and affecting the efficiency of gradient descent. More-
over, ReLU neurons are relatively easy to "die" during training, a phenomenon known as
the Dying ReLU Problem. If, after an inappropriate update, a ReLU neuron in the first
hidden layer is never activated across all training data, its gradient will always be 0, and
it will never activate in future training. This problem can also occur in other hidden
layers.

Leaky ReLU Leaky ReLU maintains a small gradient γ for inputs x < 0, allowing for
a non-zero gradient when the neuron is not activated, thus avoiding the issue of never
being activated [Maas et al., 2013]. Leaky ReLU is defined as
&
x if x > 0,
LeakyReLU(x) =
γx if x ≤ 0,

= max(0, x) + γ min(0, x),


where γ is a small constant slope for the negative part of the function, providing a
pathway for gradients during the backpropagation process even for negative input values,
thus addressing the Dying ReLU Problem.

Parametric ReLU (PReLU) further generalizes the idea by introducing a learnable pa-
rameter γi , allowing different neurons to have different slopes for the negative part of
their input. It is defined as

PReLUi (x) = max(0, x) + γi min(0, x),


where γi is the slope for x ≤ 0. Thus, PReLU adapts to the data, potentially improving
model flexibility and performance.

ELU (Exponential Linear Unit) aims to bring the benefits of ReLU-like functions while
attempting to make the mean activations closer to zero, which speeds up learning. It is
defined as
&
x if x > 0,
ELU(x) =
γ(exp(x) − 1) if x ≤ 0,
with γ ≥ 0 controlling the saturation level for negative inputs. By bringing the mean
activations closer to zero, ELU helps to alleviate the bias shift problem and accelerates
the learning process.

Softplus is a smooth approximation of the ReLU function, providing a differentiable


and smooth curve that closely resembles the behavior of a ReLU. It is given by
2.2. ACTIVATION FUNCTIONS 45

Softplus(x) = log(1 + exp(x)).

The derivative of Softplus is the logistic function, linking it directly to traditional


sigmoid functions.
Figure 2.4 showcases examples of the ReLU, Leaky ReLU, ELU, and Softplus func-
tions, illustrating the diversity and adaptability of activation functions in neural network
architectures. These functions play a crucial role in determining the nonlinear trans-
formation between layers, significantly impacting the network’s learning dynamics and
performance.

Figure 2.4: ReLU, Leaky ReLU, ELU, and Softplus functions. These functions provide var-
ious nonlinear transformations, each with its own advantages and characteristics, con-
tributing to the diversity and adaptability of neural network architectures.
46 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

2.2.4 Swish Function


The Swish function, proposed by Ramachandran et al. in 2017, is a self-gated activation
function given by
swish(x) = x · σ(βx),
where σ(·) is the logistic function, and β is a learnable parameter or a fixed hyperparam-
eter. The function σ(·) ranges between (0, 1), acting as a soft gating mechanism. When
σ(βx) is close to 1, the gate is "open," and the output approximates x itself; when σ(βx)
is near 0, the gate is "closed," leading to an output close to 0.
Figure 2.5 showcases the Swish function for different values of β.
For β = 0, Swish becomes a linear function x/2. At β = 1, it behaves approximately
linear for x > 0 and saturates for x < 0, displaying some non-monotonic behavior. As
β → +∞, σ(βx) approaches a discrete 0-1 function, and Swish approximates the ReLU
function, making it a nonlinear interpolation between a linear function and ReLU, con-
trolled by β.

2.2.5 GELU Function


The Gaussian Error Linear Unit (GELU), introduced by Hendrycks et al. in 2016, func-
tions similarly to Swish but incorporates a gating mechanism based on the Gaussian
distribution. It is defined as

GELU(x) = x · P (X ≤ x),

where P (X ≤ x) is the cumulative distribution function (CDF) of the Gaussian distribu-


tion N (µ, σ 2 ), typically with µ = 0 and σ = 1. The GELU function can be approximated
using the Tanh function or the Logistic function, offering a special case of the Swish
function when approximated with the latter.
The GELU function differs from the Swish function in its gating mechanism, which is
based on the Gaussian distribution rather than the Logistic function. This gating mech-
anism allows the GELU function to adapt its behavior based on the input, providing a
more flexible and data-dependent activation.

2.2.6 Maxout Unit


The Maxout unit, introduced by Goodfellow et al. in 2013, diverges from traditional
scalar-input activation functions like Sigmoid or ReLU by accepting a vector input from
the previous layer’s raw outputs, x = [x1 ; x2 ; · · · ; xD ].
Each Maxout unit has K weight vectors wk ∈ RD and biases bk for 1 ≤ k ≤ K,
producing K net inputs zk , 1 ≤ k ≤ K,

zk = w ⊤
k x + bk ,
2.2. ACTIVATION FUNCTIONS 47

Figure 2.5: Swish activation function for different values of β. The Swish function acts
as a self-gated activation function, interpolating between a linear function and the ReLU
function based on the value of β.

with the Maxout non-linear function defined as

maxout(x) = max (zk ),


k∈[1,K]

enabling it to learn a convex non-linear mapping from input to output. The Maxout
function approximates any convex function with piecewise linear segments and is non-
differentiable at a finite set of points.
The motivation behind using a vector input instead of a scalar input in the Maxout
unit is to increase the flexibility and representation power of the activation function. By
taking the maximum over a set of linear transformations of the input, the Maxout unit
can learn to adapt its shape based on the data, effectively approximating complex convex
functions.
Figure 2.6 illustrates the structure and behavior of the Maxout unit.
48 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

Figure 2.6: Maxout unit. The Maxout unit takes a vector input and computes the maxi-
mum over a set of linear transformations, enabling it to learn a convex non-linear map-
ping from input to output.

2.2.7 Choosing an Activation Function


The choice of activation function depends on various factors, such as the nature of the
problem, the network architecture, and the desired properties of the model. Here are
some general guidelines for selecting an activation function:
1. ReLU and its variants (Leaky ReLU, PReLU, ELU) are popular choices for deep
neural networks due to their simplicity, computational efficiency, and ability to alleviate
the vanishing gradient problem. They are particularly effective in convolutional neural
networks (CNNs) and deep feedforward networks.
2. Sigmoid functions (Logistic and Tanh) are useful when the desired output is proba-
bilistic or requires a squashing behavior. They are commonly used in the output layer for
binary classification problems or in recurrent neural networks (RNNs) for gating mecha-
nisms.
3. Swish and GELU functions provide a balance between the properties of ReLU and
sigmoid functions, offering a smooth and self-gated activation that can adapt to the data.
They have shown promising results in various deep learning architectures, particularly in
models with attention mechanisms.
4. Maxout units are a flexible option when the desired activation function shape is
unknown or complex. They can learn to adapt their shape based on the data, making
them suitable for a wide range of problems.
It is worth noting that the choice of activation function is not always straightforward
2.3. NETWORK STRUCTURE 49

and may require experimentation and empirical evaluation. Different activation functions
may perform better in different scenarios, and it is common to use a combination of
activation functions within a single network, depending on the specific requirements of
each layer.

Summary Activation functions are a fundamental component of artificial neurons in


neural networks, introducing nonlinearity and enabling the network to learn complex
patterns and representations. The choice of activation function significantly impacts the
network’s performance, learning dynamics, and computational efficiency.
This chapter has presented an overview of various activation functions, including sig-
moid functions (Logistic and Tanh), their piecewise linear approximations (Hard-Logistic
and Hard-Tanh), ReLU and its variants (Leaky ReLU, PReLU, ELU), Swish, GELU, and
Maxout units. Each activation function has its own properties, advantages, and limita-
tions, making them suitable for different types of problems and network architectures.
When selecting an activation function, it is essential to consider factors such as the
nature of the problem, the desired properties of the model, and the computational effi-
ciency. Experimentation and empirical evaluation are often necessary to determine the
most appropriate activation function for a given task.
As the field of deep learning continues to evolve, new activation functions may emerge,
offering improved performance and additional properties. Researchers and practitioners
should stay updated with the latest developments in activation functions and be open to
exploring novel approaches to further enhance the capabilities of neural networks.

2.3 Network Structure


The functionality of a biological neuron is relatively simple, whereas an artificial neuron
is an idealized and simplified implementation of the biological neuron, with even simpler
functions. To emulate the capabilities of the human brain, a single neuron is far from
sufficient; it requires the collaborative effort of many neurons to achieve complex func-
tions. Neurons that collaborate through certain connections or information transmission
methods form a network, known as a neural network.
Neural networks can be broadly categorized into three main structures: feedforward
networks, memory networks, and graph networks. Each structure is designed to handle
different types of data and solve specific tasks effectively.

Feedforward Networks Feedforward networks, such as fully connected networks and


convolutional neural networks (CNNs), are suitable for processing structured data, like
images or fixed-length sequences. In these networks, information flows unidirectionally
from the input layer to the output layer, passing through one or more hidden layers. The
key motivation behind feedforward networks is to learn hierarchical representations of
the input data, with each layer extracting increasingly abstract features.
For example, in a CNN designed for image classification, the early layers learn to
detect low-level features such as edges and textures, while the later layers combine these
50 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

features to identify more complex patterns and objects. This hierarchical structure allows
feedforward networks to effectively capture the spatial dependencies in the data and
make accurate predictions.

Memory Networks Memory networks, such as recurrent neural networks (RNNs) and
long short-term memory (LSTM) networks, are designed to process sequential data, such
as time series or natural language. These networks incorporate feedback connections,
allowing information to persist across multiple time steps. The motivation behind mem-
ory networks is to capture the temporal dependencies in the data and learn to store and
update relevant information over time. In an RNN, the hidden state at each time step is
updated based on the current input and the previous hidden state, allowing the network
to maintain a running memory of the sequence. This memory mechanism enables RNNs
to capture long-term dependencies and make predictions based on the entire sequence
history. However, RNNs may struggle with very long sequences due to the vanishing
gradient problem.
To address this issue, LSTMs introduce a more complex memory cell with gating
mechanisms that regulate the flow of information over time. This allows LSTMs to
selectively remember or forget information, making them more effective at capturing
long-range dependencies in sequences.

Graph Networks Graph networks, such as graph convolutional networks (GCNs) and
graph attention networks (GATs), are designed to process graph-structured data, where
entities are represented as nodes and their relationships are represented as edges. The
motivation behind graph networks is to learn node embeddings that capture the struc-
tural information of the graph and enable tasks such as node classification, link predic-
tion, or graph classification.
In a GCN, the node embeddings are updated by aggregating information from the
node’s neighbors, allowing the network to capture the local graph structure. This process
is repeated for multiple layers, with each layer capturing a larger neighborhood around
the nodes. By learning node embeddings that incorporate both node features and graph
structure, GCNs can effectively solve tasks such as semi-supervised node classification.
GATs extend the idea of GCNs by introducing an attention mechanism that allows
the network to assign different importance to different neighbors during the aggregation
process. This enables GATs to capture more complex and non-local dependencies in
the graph and has been shown to improve performance on various graph-based tasks. In
summary, the choice of neural network structure depends on the specific characteristics of
the data and the task at hand. Feedforward networks are well-suited for structured data,
memory networks for sequential data, and graph networks for graph-structured data.
By understanding how information propagates through each type of network and how
this relates to their suitability for different applications, researchers and practitioners can
make informed decisions when designing neural network architectures for their specific
problems.
2.4. FEEDFORWARD NEURAL NETWORKS 51

2.4 Feedforward Neural Networks


Given a neuron, we can construct a network by treating each neuron as a node. Different
neural network models possess various network connection topologies. A straightforward
topology among these is the feedforward network. The Feedforward Neural Network
(FNN), also known as multi-layer perceptrons (MLPs), are one of the most basic and
widely-used types of artificial neural networks. They are called "feedforward" because
information flows through the network in a unidirectional manner, from the input layer
to the output layer, without any feedback loops. MLPs consist of an input layer, one or
more hidden layers, and an output layer, with each layer fully connected to the next.
The primary purpose of feedforward neural networks is to approximate complex func-
tions and transform input features into a more suitable representation for the given task.
By learning a hierarchical representation of the input data, MLPs can capture intricate
patterns and relationships, making them effective for a wide range of applications, such
as classification, regression, and feature extraction.
In a feedforward neural network, neurons are assigned to different layers. Neurons in
each layer receive signals from the previous layer and produce outputs to the next layer.
The layer at the beginning is called the input layer, the final layer is known as the output
layer, and the layers in between are referred to as hidden layers. There is no feedback
in the entire network, and signals propagate unidirectionally from the input layer to the
output layer, which can be represented by a directed acyclic graph. Figure 2.7 presents a
typical multi-layer feedforward neural network.

Notations for Describing Feedforward Neural Networks The following table presents
the notations used to describe feedforward neural networks.

Notation Meaning
L The number of layers in the neural network
Ml The number of neurons in layer l
fl (·) The activation function of neurons in layer l
W (l) ∈ RMl ×Ml−1 The weight matrix from layer l − 1 to layer l
b(l) ∈ RMl The bias vector from layer l − 1 to layer l
z (l) ∈ RMl The net input (pre-activation) of neurons in layer l
a(l) ∈ RMl The output (post-activation) of neurons in layer l

Let a(0) = x, the feedforward neural network propagates information by iterating the
following formulas:

z (l) = W (l) a(l−1) + b(l) ,


' ( (2.1)
a(l) = fl z (l) .
First, the net activation z (l) for layer l is computed based on the activation values a(l−1)
from layer l − 1, and then passed through an activation function to obtain the activation
values for layer l. Thus, each neural layer can be considered an affine transformation
followed by a nonlinear transformation.
52 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

Figure 2.7: Multi-layer feed-forward network.

2.4.1 Universal Approximation Theorem


One of the key theoretical foundations of feedforward neural networks is the Universal
Approximation Theorem, which states that an MLP with a single hidden layer containing
a finite number of neurons can approximate any continuous function on compact subsets
of RD , under mild assumptions on the activation function. This theorem highlights the
expressive power of MLPs and their ability to model complex relationships between inputs
and outputs.

Theorem 2.1 (Universal Approximation Theorem, Cybenko, 1989; Hornik et al., 1989)
Let φ(·) be a non-constant, bounded, and monotonically-increasing continuous function. For
any given function f defined on a D-dimensional unit hypercube [0, 1]D within the continu-
ous function space C(ID ), there exists an integer M , a set of real numbers vm , bm ∈ R, and
real vectors wm ∈ RD for m = 1, · · · , M , such that the function
M
! ' (
F (x) = vm φ w ⊤
m x + bm (2.2)
m=1

can approximate f with |F (x) − f (x)| < ϵ for all x ∈ ID , where ϵ > 0 is an arbitrarily small
positive number.
2.4. FEEDFORWARD NEURAL NETWORKS 53

The Universal Approximation Theorem holds due to the combination of hidden layers
and nonlinear activation functions in the network architecture. The hidden layers allow
the network to learn a hierarchical representation of the input data, with each layer
capturing increasingly abstract features. The nonlinear activation functions, such as the
sigmoid or ReLU function, introduce non-linearity into the network, enabling it to model
complex, non-linear relationships between the inputs and outputs. As the number of
hidden neurons increases, the MLP becomes more expressive and can approximate a
wider range of functions.
The theorem demonstrates the computational power of neural networks to approxi-
mate a given continuous function but does not specify how to find such a network or its
optimality. Additionally, in machine learning applications, the true mapping function is
unknown, generally learned through empirical risk minimization and regularization due
to neural networks’ susceptibility to overfitting on the training set.

2.4.2 Application to Machine Learning


Feedforward neural networks, particularly multi-layer perceptrons (MLPs), have found
extensive application in various machine learning tasks, especially in supervised learning.
In this setting, an MLP can be trained to map input features to corresponding target
labels, effectively learning a function that can predict the output for new, unseen inputs.

Feature Extraction One important application of MLPs is feature extraction, where the
network learns to transform raw input data into a more informative and discriminative
representation. The characteristics of input samples significantly impact the performance
of classifiers. In supervised learning, good features can greatly enhance the classifier’s
effectiveness. To achieve superior classification results, it is often necessary to transform
the sample’s original feature vector x ∈ RD into a more effective feature vector φ(x) ∈

RD , a process known as feature extraction.
Multi-layer feedforward neural networks can be viewed as a non-linear composite

function φ : RD → RD , mapping the input x to the output φ(x). By training an MLP on
a large dataset with known labels, the network can learn to extract meaningful features
that capture the underlying structure of the data. These learned features can then be
used as input to other machine learning algorithms, such as support vector machines or
decision trees, to improve their performance.

Loss Functions The choice of loss function plays a crucial role in the learning process of
an MLP. The loss function quantifies the discrepancy between the network’s predictions
and the true labels, providing a measure of how well the network is performing. Different
loss functions are suited for different types of tasks:

• Mean Squared Error (MSE): Commonly used for regression problems, where the
goal is to predict continuous values.
• Cross-Entropy Loss: Often used for classification problems, where the goal is to
predict discrete class labels.
54 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

The choice of loss function can significantly impact the network’s learning dynamics
and its ability to generalize to new data.

Integration with Classifiers Given a training sample (x, y), an MLP can be used to
map the input x to a transformed feature vector φ(x). This transformed feature vector
can then be used as input to a classifier g(·):

ŷ = g(φ(x); θ),

where g(·) is a linear or non-linear classifier, θ represents the parameters of the classifier,
and ŷ is the output of the classifier. In some cases, the classifier can be integrated into
the MLP as the last layer of the network:

• Binary Classification: For problems where y ∈ {0, 1}, a single neuron with a sigmoid
activation function can be used in the last layer. The network’s output a(L) ∈ R can
directly serve as the conditional probability of the positive class:

p(y = 1|x) = a(L) .

• Multi-Class Classification: For problems where y ∈ {1, . . . , C}, the last layer can
consist of C neurons with a softmax activation function. The output of the last
layer ŷ ∈ RC represents the predicted conditional probabilities for each class:

ŷ = softmax(z(L) ),

where z(L) ∈ RC is the net input of the neurons in the last layer.

By integrating the classifier into the MLP, the network can directly output the con-
ditional probabilities of different classes, effectively combining feature extraction and
classification into a single model.

2.4.3 Parameter Learning


Training a feedforward neural network involves finding the optimal values of the net-
work’s parameters (weights and biases) that minimize a chosen loss function. This opti-
mization process is typically performed using gradient-based methods, such as stochastic
gradient descent (SGD), where the gradients of the loss function with respect to the
parameters are computed and used to update the parameters iteratively.
For a given training sample (x, y), the cross-entropy loss function is commonly used:

L(y, ŷ) = −y⊤ log ŷ, (4.42)

where y ∈ {0, 1}C is the one-hot vector representation of the true label y, and ŷ is the
predicted output of the network.
2.4. FEEDFORWARD NEURAL NETWORKS 55

Given a training set D = {(x(n) , y (n) )}N


n=1 , the structured risk function on the dataset
is defined as:
N
1 ! ' (n) (n) ( 1
R(W, b) = L y , ŷ + λ∥W∥2F , (4.43)
N n=1 2
where W and b represent all the weight matrices and bias vectors in the network, respec-
tively. The second term is a regularization term used to prevent overfitting, with λ > 0
being a hyperparameter that controls the strength of regularization. The Frobenius norm
∥W∥2F is defined as:
L !Ml M l−1 ) *2
! ! (l)
2
∥W∥F = wij , (4.44)
l=1 i=1 j=1

where L is the number of layers, and Ml is the number of neurons in layer l.


The backpropagation algorithm is the cornerstone of parameter learning in feedfor-
ward neural networks. It is an efficient method for computing the gradients of the loss
function with respect to the network’s parameters by recursively applying the chain rule
of differentiation. The algorithm consists of two main stages:
• Forward pass: The input is propagated through the network to compute the output
and the corresponding loss.
• Backward pass: The gradients of the loss function with respect to the parameters
are computed and propagated back through the network.
During each iteration of the gradient descent method, the parameters W(l) and b(l) at
layer l are updated as follows:
+ N
+ ' (n) (n) ( , ,
1 ! ∂L y , ŷ
W(l) ← W(l) − α + λW(l) , (2.3)
N n=1 ∂W(l)
+ N ' (n) (n) ( ,
1 ! ∂L y , ŷ
b(l) ← b(l) − α , (2.4)
N n=1 ∂b(l)

where α is the learning rate.


The computational complexity of the backpropagation algorithm scales linearly with
the number of parameters in the network, making it efficient for training large-scale
feedforward neural networks. However, as the network becomes deeper (i.e., with more
hidden layers), the gradients can become increasingly small (vanishing gradient prob-
lem) or large (exploding gradient problem), which can slow down the learning process
and make it challenging to train very deep networks. Techniques such as careful weight
initialization, gradient clipping, and using architectures like residual networks can help
mitigate these issues.
In summary, parameter learning in feedforward neural networks is accomplished us-
ing gradient-based optimization methods, with the backpropagation algorithm serving as
an efficient way to compute the required gradients. By minimizing the chosen loss func-
tion, the network learns to adapt its parameters to fit the training data and generalize to
new, unseen samples.
56 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

2.4.4 DropOut
Deep neural networks have diverse architectures that range from shallow to very deep,
aiming to generalize from given datasets. Nonetheless, in this quest, they may learn not
just the features but the statistical noise, leading to a phenomenon known as overfitting.
Overfitting enhances model performance on the training dataset at the expense of its pre-
dictive power on new data points. Traditionally, regularization techniques that penalize
the network weights have been used to tackle overfitting, but they often fall short when
dealing with large network architectures or limited data.

Figure 2.8: Dropout Neural Net Model showing a standard neural net and a thinned net
produced by applying dropout.

Dropout as a Robust Regularization Technique A comprehensive solution to over-


fitting is to average predictions across all possible parameter configurations, but this
approach is infeasible due to computational constraints. Ensemble techniques like Ad-
aBoost, XGBoost, and Random Forest suggest a better path, where multiple models col-
lectively improve prediction robustness. However, these methods become unwieldy with
the increasing depth of neural networks. Dropout emerges as an efficient alternative,
inspired by the concept of ensemble learning.
Dropout regularizes the network by randomly dropping unitsboth input and hidden-
during the training phase. This stochastic thinning of the network induces a form of
ensemble learning within a single model architecture, effectively preventing units from
co-adapting too closely to the training data and ensuring that the network learns more
generalized and robust representations.
2.4. FEEDFORWARD NEURAL NETWORKS 57

Figure 2.9: A unit’s behavior during training with dropout probability p, and its adjusted
behavior during inference.

Mathematical Formulation and Implications of Dropout The mathematical essence


of dropout is elegantly simple yet profound. In a standard neural network without
dropout, the feed-forward operation can be described by the following equations for
any layer l and hidden unit i:

z (l+1) = w(l+1) y (l) + b(l+1) ,


y (l+1) = f (z (l+1) ),

where f denotes the activation function, such as the logistic sigmoid f (x) = 1+exp(−x)
1
.
With the introduction of dropout, the operation adapts to incorporate a random thinning
as illustrated in Figure 2.10.
During training, the presence of each unit is determined by a Bernoulli random vari-
able with probability p, leading to the modified feed-forward equations:

r(l) ∼ Bernoulli(p),
ŷ(l) = r(l) · y(l) ,
z(l+1) = w(l) ŷ(l) + b(l) .

Here, r(l) is a vector of independent Bernoulli random variables, and ŷ(l) represents the
thinned output of layer l, which then feeds into the subsequent layer. This process effec-
tively simulates training a vast ensemble of networks with shared weights.

Dropout During Inference and PyTorch Implementation During inference, dropout


is not applied, and the weights are adjusted to compensate for the dropout probability p,
maintaining the same expected output as during training. This adjustment is crucial for
ensuring the consistency of the model’s performance.
Implementing dropout in PyTorch is straightforward with the built-in Dropout layer,
which handles the random deactivation of units during training and automates the ap-
propriate scaling of weights during inference.

dropout = nn.Dropout(p=0.5)
58 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

Figure 2.10: Comparison of the basic operations of a standard network versus a network
employing dropout.

output = dropout(input)

In conclusion, dropout is an ingenious regularization strategy that mitigates over-


fitting by promoting the learning of robust features and preventing the network from
relying too heavily on any particular unit. By integrating dropout, neural networks can
achieve better generalization, making them more effective for a wide array of tasks.

2.5 Backpropagation Algorithm

Assuming the use of stochastic gradient descent for neural network parameter learning,
given a sample (x, y), input it into the neural network model to obtain the network out-
put ŷ. Assuming the loss function is L(y, ŷ), parameter learning requires computing
the derivative of the loss function with respect to each parameter. For generality, com-
pute the partial derivatives for parameters W (l) and b(l) at layer l. The computation of
∂L(y,ŷ)
∂w(l)
, involving vector or matrix differentiation, is complex. Thus, first calculate the
partial derivatives of L(y, ŷ) with respect to each element of the parameter matrix ∂L(y,ŷ)
(l) .
∂wij
According to the chain rule,
2.5. BACKPROPAGATION ALGORITHM 59

∂L(y, ŷ) ∂z (l) ∂L(y, ŷ)


(l)
= (l) (l)
(2.5)
∂wij ∂wij ∂z
∂L(y, ŷ) ∂z (l) ∂L(y, ŷ)
= (2.6)
∂b(l) ∂b(l) ∂z (l)
The second term in both equations (2.5, 2.6), the derivative of the objective function
with respect to the neurons z (l) at layer l, known as the error term, can be computed at
(l) (l) ∂L(y,ŷ)
once. This reduces the computation to three partial derivatives: ∂z (l) , ∂z
∂b (l) , and ∂z (l)
.
∂wij
Now, let’s calculate these three derivatives:
(l)
(1) For the derivative ∂z (l) , given z (l) = W (l) a(l−1) + b(l) , the derivative is illustrated
∂wij
as follows:

⎡ (l) ⎤
∂z (l) (l) (l) ∂z
⎣ ∂z1 , · · · ∂zi (l)
Ml
= , ··· , ⎦ (2.7)
(l) (l) (l) (l)
∂wij ∂wij ∂wij ∂wij
1 1 ∂(w(l) a(l−1) + b(l)
i )
2 2
..
i.
= 0, · · · , (l)
··· ,0 (2.8)
∂wij
3 4
(l−1)
= 0, · · · , aj , · · · , 0 (2.9)
(l−1)
! Ii (aj ) ∈ R1×Ml , (2.10)

This matrix differentiation uses the denominator layout, where a column vector’s
derivative with respect to a scalar is a row vector.
(l)
(2) For the derivative ∂z ∂b(l)
, since the function relationship between z (l) and b(l) is
z (l) = W (l) a(l−1) + b(l) , the derivative is

∂z (l)
= I Ml ∈ RMl ×Ml ,
∂b(l)
where I Ml is the Ml × Ml identity matrix.
(3) For the derivative ∂L(y,ŷ)
∂z (l)
, it represents the influence of the neurons at layer l on
the final loss and hence the sensitivity of the final loss to the neurons at layer l, often
termed the error term for the neurons at layer l, denoted by δ (l) .

∂L(y, ŷ)
δ (l) ! ∈ RM l
∂z (l)
The error term δ(l) also indirectly reflects the degree of contribution of different neu-
rons to the network’s capability, thereby effectively addressing the Credit Assignment
Problem (CAP).
Given z (l+1) = W (l+1) a(l) + b(l+1) , we have
60 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

∂z (l+1) ) (l+1) *⊤
= W ∈ RMl ×Ml+1 (2.11)
∂a(l)
' (
Given a(l) = fl z (l) , where fl (·) is a function applied element-wise, it follows that
' (
∂a(l) ∂fl z (l)
= (2.12)
∂z (l) ∂z (l)
' ' ((
= diag fl′ z (l) ∈ RMl ×Ml (2.13)
Thus, according to the chain rule, the error term for layer l is

∂L(y, ŷ)
δ (l) ! (2.14)
∂z (l)
∂a(l) ∂z (l+1) ∂L(y, ŷ)
= · · (2.15)
∂z (l) ∂a(l) ∂z (l+1)
' ' (( ) *⊤
= diag fl′ z (l) · W (l+1) · δ (l+1) (2.16)
5) *⊤ 6
' (l) (

= fl z ⊙ W (l+1)
δ (l+1)
∈ RM l , (2.17)

where ⊙ denotes the Hadamard product, implying element-wise multiplication.


It is evident from the equation 2.17 that the error term for layer l can be derived
from the error term of layer l + 1, illustrating the essence of backpropagation. The back-
propagation algorithm essentially means that the error term (or sensitivity) of a neuron
in layer l is the weighted sum of the error terms of neurons in layer l + 1 it connects to,
subsequently multiplied by the gradient of the neuron’s activation function.
After computing the three derivatives mentioned, equation (2.5) can be expressed as

∂L(y, ŷ) 3 4
(l−1)
(l)
= δ (l) aj
∂wij
(l) (l−1)
= δ i aj

The gradient of L(y, ŷ) with respect to the weights W (l) of layer l is

∂L(y, ŷ) ' (l−1) (⊤


(l)
= δ (l)
a ∈ RMl ×Ml−1
∂W
Similarly, the gradient of L(y, ŷ) with respect to the biases b(l) of layer l is

∂L(y, ŷ)
(l)
= δ (l) ∈ RMl
∂b
Having computed the error term for each layer, we can determine the gradient for
each layer’s parameters. Thus, the training process of a feedforward neural network
2.6. AUTOMATIC GRADIENT COMPUTATION 61

using the backpropagation algorithm can be summarized in three steps:


(1) Forward computation of the net input z (l) and activation a(l) for each layer up to
the last layer;
(2) Backward propagation of the error term δ(l) for each layer;
(3) Computation and update of each layer’s parameters based on the derivatives.

Algorithm 2 Stochastic Gradient Descent with Backpropagation


Require: Training set D = {(x(n) , y (n) )}N n=1 , validation set V, learning rate α, regulariza-
tion coefficient λ, number of layers L, number of neurons Ml for 1 ≤ l ≤ L.
1: Randomly initialize W, b.
2: repeat
3: Randomly shuffle samples in the training set D.
4: for n = 1 to N do
5: Select sample (x(n) , y (n) ) from D.
6: Forward compute the net input z (l) and activation a(l) for each layer up to the
last layer.
7: Backward propagate the error δ (l) for each layer using equation (4.63).
8: for all l do
9: Compute the gradient of each layer’s parameters using equation (4.68):

∂L(y (n) , ŷ (n) )


= δ (l) (a(l−1) )" ;
∂W (l)
10: Compute the gradient of each layer’s biases using equation (4.69):

∂L(y (n) , ŷ (n) )


= δ (l) ;
∂b(l)
11: Update the weights:

W (l) ← W (l) − α(δ (l) (a(l−1) )T + λW (l) );

12: Update the biases:


b(l) ← b(l) − αδ (l) ;
13: end for
14: end for
15:until the error rate of the neural network model on the validation set V no longer
decreases.
Ensure: The final weights w and biases b.

2.6 Automatic Gradient Computation


Neural network parameters are primarily optimized through gradient descent. Once the
loss function and the network structure are defined, the gradient of the loss function
62 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

with respect to each parameter can be manually calculated using the chain rule and
implemented in code. However, manual differentiation is not only tedious but also prone
to errors, making the implementation of neural networks highly inefficient and error-
prone. This is particularly true for complex network architectures with a large number of
parameters, where manual differentiation becomes impractical.
To address this issue, modern deep learning frameworks include functionalities for
automatic gradient computation. This allows researchers and practitioners to focus on
defining the network structure and implementing it in code, while the computation of
gradients is done automatically, significantly improving development efficiency and re-
ducing the likelihood of errors.
Methods for automatic gradient computation can be categorized into three main
types: Numerical Differentiation, Symbolic Differentiation, and Automatic Differentia-
tion.

2.6.1 Numerical Differentiation


Numerical Differentiation is a method that uses numerical techniques to compute the
derivative of a function f (x). The derivative of function f (x) at point x is defined as:

f (x + ∆x) − f (x)
f ′ (x) = lim . (2.18)
∆x→0 ∆x
To compute the derivative of f (x) at a point x, a small perturbation ∆x is added
to x, and the gradient is approximated using the formula above. Although straightfor-
ward to implement, numerical differentiation faces challenges in choosing an appropriate
value for ∆x. If ∆x is too small, it can introduce significant round-off errors due to the
limitations of floating-point arithmetic. Conversely, if ∆x is too large, it may increase
truncation error, leading to inaccurate derivative calculations.
In practice, to reduce truncation errors, the following formula is often used:

f (x + ∆x) − f (x − ∆x)
f ′ (x) = lim . (2.19)
∆x→0 2∆x
This formula, known as the central difference formula, provides a more accurate
approximation of the derivative f ′ (x) compared to the forward or backward difference
methods. The central difference method calculates the slope of the secant line through
the points (x − ∆x, f (x − ∆x)) and (x + ∆x, f (x + ∆x)), effectively using information
from both sides of x. This symmetric approach reduces the error in the derivative approx-
imation, as the leading error terms cancel out. As a result, for a given ∆x, the central
difference method tends to offer a superior balance between truncation error and round-
off error, leading to a more reliable estimate of the derivative, especially when dealing
with functions that are smooth near the point x.
The computational complexity of numerical differentiation is another drawback. For a
function with N parameters, each parameter must be perturbed individually to compute
the gradient, resulting in a total complexity of O(N 2 ) for a single gradient computation.
This makes numerical differentiation inefficient for functions with a large number of
parameters, such as deep neural networks.
2.6. AUTOMATIC GRADIENT COMPUTATION 63

2.6.2 Symbolic Differentiation


Symbolic Differentiation, also known as symbolic computation, is an automatic differenti-
ation technique that relies on symbolic manipulation of mathematical expressions. Unlike
numerical differentiation, symbolic differentiation handles expressions with variables as
symbols, without requiring specific numerical values. The process involves transforming
the input expression using predefined rules, such as the chain rule, product rule, and
quotient rule, until no further simplification is possible.
One of the main advantages of symbolic differentiation is that it provides exact deriva-
tives, avoiding the round-off and truncation errors associated with numerical differentia-
tion. Additionally, the derived expressions can be optimized using symbolic computation
techniques, leading to more efficient implementations.
For example, consider the process of differentiating and simplifying the mathematical
2 x
expression f (x) = (x +3x+2)·e
x+1
. We aim to find f ′ (x), the derivative of f (x) with respect to
x, using symbolic differentiation.
This process can be implemented in Python using the SymPy library, a tool for sym-
bolic mathematics. The following code demonstrates how to perform the differentiation
and simplify the result:
1 import sympy as sp
2

3 # Define the symbol


4 x = sp . symbols ( ’x ’)
5

6 # Define the function


7 f = ( x **2 + 3* x + 2) * sp . exp ( x ) / ( x + 1)
8

9 # Differentiate f with respect to x


10 f_prime = sp . diff (f , x )
11

12 # Simplify the derivative


13 f_prime _s implif ie d = sp . simplify ( f_prime )
14

15 # Display the original function , its derivative , and the simplified


derivative
16 print ( " Original function : f ( x ) = " , f )
17 print ( " Derivative : f ’( x ) = " , f_prime )
18 print ( " Simplified derivative : f ’( x ) = " , f_ pr im e_ si mp li fi ed )
19 Original function : f ( x ) = ( x **2 + 3* x + 2) * exp ( x ) /( x + 1)
20 Derivative : f ’( x ) = ( x **2 + 3* x + 2) * exp ( x ) /( x + 1) + (2* x + 3) * exp
( x ) /( x + 1) - ( x **2 + 3* x + 2) * exp ( x ) /( x + 1) **2
21 Simplified derivative : f ’( x ) = ( x * exp ( x ) + 2* exp ( x ) ) /( x + 1)
The code snippet uses SymPy to differentiate the function f (x) with respect to x and
then simplifies the resulting expression. This showcases the power of symbolic differen-
tiation in providing exact, simplified forms of derivatives, which are crucial for further
analytical or numerical computations.
Symbolic differentiation overcomes the limitations of numerical differentiation by
64 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

avoiding round-off and truncation errors, offering precise solutions by manipulating sym-
bols rather than numerical approximations. It transforms input expressions iteratively or
recursively using predefined rules, stopping when no further transformation by these
rules is possible.
Symbolic differentiation can compute the mathematical representation of gradients
at compile time, further optimizing the process through symbolic computation methods.
It boasts platform independence, allowing execution on both CPUs and GPUs. However,
it does come with its limitations:

1. Long compilation times, especially for complex expressions or loops that require
extensive unrolling.

2. The need for a special language or syntax to represent mathematical expressions


and declare variables.

3. Difficulty in debugging due to the abstract nature of symbolic representations.

2.6.3 Automatic Differentiation


Automatic Differentiation (AD) is a method that computes derivatives of a function, rep-
resented by a computer program, in an automated fashion. Unlike symbolic differentia-
tion, which operates on mathematical expressions, AD works directly with the code that
implements the function.
The core principle of AD is that any numerical computation can be decomposed into
a sequence of elementary operations, such as addition, subtraction, multiplication, and
division, as well as elementary functions like exponential, logarithm, and trigonometric
functions. By applying the chain rule to these elementary operations, AD can compute
the gradient of the overall function.
AD operates on a computational graph, which is a directed acyclic graph representing
the flow of data through the elementary operations. Each node in the graph represents
an operation, and the edges represent the flow of data between operations. By traversing
the graph and applying the chain rule, AD can compute the gradients of the output with
respect to the inputs.
AD can operate in two modes: forward mode and reverse mode. In forward mode,
the gradients are computed in the same order as the original function evaluation, while
in reverse mode, the gradients are computed in the opposite order. Reverse mode, also
known as backpropagation, is more commonly used in deep learning because it is more
efficient for functions with a large number of inputs and a small number of outputs, which
is typical for neural networks. The computational cost of reverse mode is proportional
to the number of outputs, making it well-suited for computing gradients of scalar-valued
functions, such as loss functions.
For clarity, let’s illustrate the process of automatic differentiation using a common
composite function found in neural networks. Consider the composite function f (x; w, b)
defined as:
1
f (x; w, b) = (2.20)
exp(−(wx + b)) + 1
2.6. AUTOMATIC GRADIENT COMPUTATION 65

where x is the input scalar, and w and b are the weight and bias parameters, respectively.
First, we decompose the composite function f (x; w, b) into a series of basic operations
and form a computational graph. A computational graph is a graphical representation
of mathematical operations, where each non-leaf node represents a basic operation, and
each leaf node represents an input variable or constant.
The composite function f (x; w, b) consists of 6 basic functions, hi , for 1 ≤ i ≤ 6. The
following table delineates each basic function along with its derivative, facilitating the
computation of derivatives of f (x; w, b) with respect to w and b through predefined rules.

Function Derivative w.r.t. w Derivative w.r.t. x


∂h1 ∂h1
h1 = x · w ∂w
=x ∂x
=w
∂h2 ∂h2
h2 = h1 + b ∂h1
=1 ∂b
=1
h3 = h2 · −1 ∂h3
∂h2
= −1 -
h4 = exp(h3 ) ∂h4
∂h3
= exp(h3 ) -
h5 = h4 + 1 ∂h5
∂h4
=1 -
h6 = 1/h5 ∂h6
∂h5
= − h12 -
5

To derive the entire composite function f (x; w, b) with respect to w and b, one multi-
plies the derivatives along the paths connecting f (x; w, b) to w and b within the compu-
tational graph:
∂f (x; w, b) ∂f (x; w, b) ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂h1
= (2.21)
∂w ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂h1 ∂w
∂f (x; w, b) ∂f (x; w, b) ∂h6 ∂h5 ∂h4 ∂h3 ∂h2
= (2.22)
∂b ∂h6 ∂h5 ∂h4 ∂h3 ∂h2 ∂b
Considering ∂f (x;w,b)
∂w
for x = 1, w = 0, b = 0, the derivative is calculated as 0.25. If
multiple paths from the function to a parameter exist, sum the derivatives along each
path to obtain the final gradient.
Based on the sequence of derivative computations, automatic differentiation (AD) is
divided into two modes: forward mode and reverse mode.

• The forward mode computes gradients in the same direction as the computational
graph, recursively applying the chain rule. For example, consider calculating ∂f (x;w,b)
∂w
when x = 1, w = 0, b = 0. The forward mode’s cumulative calculation sequence is
shown to result in a derivative of 0.25 for f (x; w, b) with respect to w.

• Reverse mode, on the other hand, computes gradients in the opposite direction of
the computational graph’s flow. For ∂f (x;w,b)
∂w
under the same conditions, the reverse
mode’s sequence also leads to a derivative calculation of 0.25.

Both forward and reverse modes are applications of the chain rule for gradient accu-
mulation. Reverse mode is equivalent to the backpropagation technique used for com-
puting gradients. For a general function f : RN → RM , the forward mode requires N
traversals for each input variable, whereas the reverse mode requires M traversals for
66 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

each output. When N > M , reverse mode is more efficient. In the context of feed-
forward neural network parameter learning, reverse mode is the most efficient method,
requiring only one pass.

Comparison of Automatic Gradient Computation Methods The following table sum-


marizes the strengths and weaknesses of the three main automatic gradient computation
methods in terms of computational efficiency, numerical stability, and ease of implemen-
tation:

Numerical Diff. Symbolic Diff. Automatic Diff.


Computational Efficiency Low High High
Numerical Stability Low High High
Ease of Implementation High Low Medium

Automatic differentiation, particularly in reverse mode, offers the best balance of


computational efficiency, numerical stability, and ease of implementation, making it the
preferred choice for most deep learning frameworks.

Static and Dynamic Computational Graphs Computational graphs can be categorized


into static and dynamic graphs. Static computational graphs, used by frameworks like
Theano and TensorFlow 1.x, are defined during the compilation stage and cannot be
modified during runtime. This allows for extensive optimizations and efficient execution
but limits flexibility.
On the other hand, dynamic computational graphs, used by frameworks like PyTorch
and TensorFlow 2.x with eager execution, are constructed on the fly during runtime. This
provides greater flexibility and easier debugging but may have some overhead compared
to static graphs.
The choice between static and dynamic computational graphs depends on the specific
requirements of the task and the trade-off between performance and flexibility. Some
frameworks, like TensorFlow 2.x, offer a hybrid approach, allowing users to choose be-
tween static graphs (via @tf.function decoration) and dynamic graphs (eager execution)
as needed.

Symbolic Differentiation and Automatic Differentiation Both symbolic differentia-


tion and automatic differentiation use computational graphs and the chain rule to au-
tomate the computation of derivatives. The main difference lies in when the graph is
constructed and when the numerical values are substituted.
Symbolic differentiation constructs the computational graph during the compilation
stage, derives the derivative expressions symbolically, and optionally optimizes them.
The numerical values are substituted only during the execution stage. In contrast, au-
tomatic differentiation constructs the computational graph on the fly during execution,
with numerical values substituted immediately. The gradients are then computed using
either forward or reverse mode traversal of the graph.
2.7. CONCLUSION 67

In practice, automatic differentiation is more widely used in deep learning frame-


works due to its efficiency and ease of implementation, while symbolic differentiation
is used in specialized cases where the derived expressions need to be inspected or opti-
mized.

2.7 Conclusion
In this chapter, we have explored the fundamental concepts and techniques behind neural
networks for machine learning. Building upon the basic principles of machine learning
discussed in the previous chapter, such as models, learning criteria, and optimization
algorithms, we have delved into the specific architecture and learning mechanisms of
neural networks.
We began by examining the basic building block of neural networks: the artificial
neuron. Inspired by the structure and function of biological neurons, artificial neurons
receive input signals, process them through weighted connections and an activation func-
tion, and produce an output signal. The choice of activation function is crucial, as it
introduces nonlinearity into the network, enabling it to learn complex patterns and rep-
resentations from the input data.
Next, we investigated various network structures, including feedforward networks,
memory networks, and graph networks. Each structure is designed to handle different
types of data and solve specific tasks effectively. Feedforward networks, such as multi-
layer perceptrons (MLPs), are well-suited for processing structured data and learning
hierarchical representations. Memory networks, like recurrent neural networks (RNNs)
and long short-term memory (LSTM) networks, are designed to process sequential data
and capture temporal dependencies. Graph networks, such as graph convolutional net-
works (GCNs) and graph attention networks (GATs), are tailored for processing graph-
structured data and learning node embeddings that capture the structural information of
the graph.
We then focused on the training process of feedforward neural networks, which in-
volves finding the optimal values of the network’s parameters (weights and biases) that
minimize a chosen loss function. The backpropagation algorithm, which efficiently com-
putes the gradients of the loss function with respect to the network’s parameters by re-
cursively applying the chain rule of differentiation, forms the cornerstone of parameter
learning in neural networks. We also discussed the Universal Approximation Theorem,
which highlights the expressive power of MLPs and their ability to approximate any con-
tinuous function, given sufficient hidden neurons and appropriate activation functions.
To mitigate overfitting, a common challenge in training deep neural networks, we
introduced the concept of dropout regularization. Dropout randomly drops units during
training, effectively preventing units from co-adapting too closely to the training data
and promoting the learning of more robust and generalized representations.
Finally, we explored the various methods for automatic gradient computation, which
is essential for efficient training of neural networks. Automatic differentiation, particu-
larly in reverse mode (backpropagation), offers the best balance of computational effi-
ciency, numerical stability, and ease of implementation, making it the preferred choice
68 CHAPTER 2. NEURAL NETWORKS FOR MACHINE LEARNING

for most deep learning frameworks.


Having established a solid theoretical foundation for neural networks, the next chap-
ter will focus on the practical implementation of machine learning using PyTorch. Through
hands-on examples and step-by-step guidance, you will learn how to build, train, and
evaluate neural networks using this powerful deep learning framework. By combining
the conceptual understanding gained from this chapter with the practical skills developed
in the next, you will be well-equipped to tackle a wide range of machine learning tasks
and develop state-of-the-art solutions using neural networks.

You might also like