Mathematical_Foundations_of_Deep_Learning
Mathematical_Foundations_of_Deep_Learning
Sourangshu Ghosh*
Abstract
Deep learning, as a computational paradigm, fundamentally relies on the synergy of func-
tional approximation, optimization theory, and statistical learning. This work presents an
extremely rigorous mathematical framework that formalizes deep learning through the lens of
measurable function spaces, risk functionals, and approximation theory. We begin by defin-
ing the risk functional as a mapping between measurable function spaces, establishing its
structure via Frechet differentiability and variational principles. The hypothesis complexity
of neural networks is rigorously analyzed using VC-dimension theory for discrete hypotheses
and Rademacher complexity for continuous spaces, providing fundamental insights into gen-
eralization and overfitting.
A refined proof of the Universal Approximation Theorem is developed using convolution op-
erators and the Stone-Weierstrass theorem, demonstrating how neural networks approximate
arbitrary continuous functions on compact domains with quantifiable error bounds. The depth
vs. width trade-off is explored through capacity analysis, bounding the expressive power of
networks using Fourier analysis and Sobolev embeddings, with rigorous compactness argu-
ments via the Rellich-Kondrachov theorem.
We extend the theoretical framework to training dynamics, analyzing gradient flow and sta-
tionary points, the Hessian structure of optimization landscapes, and the Neural Tangent
Kernel (NTK) regime. Generalization bounds are established through PAC-Bayes formalism
and spectral regularization, connecting information-theoretic insights to neural network sta-
bility. The analysis further extends to advanced architectures, including convolutional and
recurrent networks, transformers, generative adversarial networks (GANs), and variational
autoencoders, emphasizing their function space properties and representational capabilities.
Finally, reinforcement learning is rigorously examined through deep Q-learning and policy
optimization, with applications spanning robotics and autonomous systems. The mathemat-
ical depth is reinforced by a comprehensive exploration of optimization techniques, covering
stochastic gradient descent (SGD), adaptive moment estimation (Adam), and spectral-based
regularization methods. The discussion culminates in a deep investigation of function space
embeddings, generalization error bounds, and the fundamental limits of deep learning models.
This work bridges deep learning’s theoretical underpinnings with modern advancements, of-
fering a mathematically precise and exhaustive exposition that is indispensable for researchers
aiming to rigorously understand and extend the frontiers of deep learning theory.
Contents
1 Mathematical Foundations 4
1
1.1 Problem Definition: Risk Functional as a Mapping Between Spaces . . . . 4
1.1.1 Measurable Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Risk as a Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Approximation Spaces for Neural Networks . . . . . . . . . . . . . . . . . . . 8
1.2.1 VC-dimension theory for discrete hypotheses . . . . . . . . . . . . . . 9
1.2.2 Rademacher complexity for continuous spaces . . . . . . . . . . . . . 11
1.2.3 Sobolev Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Rellich-Kondrachov Compactness Theorem . . . . . . . . . . . . . . . 15
2
7.3.2 Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Popular CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4.2 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4.3 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
13 Appendix 137
13.1 Linear Algebra Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
13.1.1 Matrices and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 137
13.1.2 Vector Spaces and Linear Transformations . . . . . . . . . . . . . . . . 138
13.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . 139
13.1.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . 139
13.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.2.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.2.3 Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
13.3.1 Gradient Descent (GD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
13.3.2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . 144
13.3.3 Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
13.4 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.4.1 Matrix Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.4.2 Tensor Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14 Acknowledgments 149
3
1 Mathematical Foundations
Deep learning is a computational paradigm for solving high-dimensional function approximation
problems. At its core, it relies on the synergy of:
where X is the input space, Y is the output space, P is the data distribution, ℓ(·, ·) is a loss
function, θ are parameters of the neural network. This task involves the composition of several
disciplines, each of which is explored in rigorous detail below.
Literature Review: Rao et. al. (2024) [1] investigated approximation theory within Lebesgue
4
measurable function spaces, providing an analysis of operator convergence. They also established
a theoretical framework for function approximation in Lebesgue spaces and provided a rigorous
study of symmetric properties in function spaces. Mukhopadhyay and Ray (2025) [2] provided a
comprehensive introduction to measurable function spaces, with a focus on Lp-spaces and their
completeness properties. They also established the fundamental role of Lp-spaces in measure the-
ory and discussed the relationship between continuous function spaces and measurable functions.
Szoldra (2024) [3] examined measurable function spaces in quantum mechanics, exploring the role
of measurable observables in ergodic theory. They connected functional analysis and measure the-
ory to quantum state evolution and provided a mathematical foundation for quantum machine
learning in function spaces. Lee (2025) [10] investigated metric space theory and functional analy-
sis in the context of measurable function spaces in AI models. He formalized how function spaces
can model self-referential structures in AI and provided a bridge between measure theory and gen-
erative models. Song et. al. (2025) [4] discussed measurable function spaces in the context of
urban renewal and performance evaluation. They developed a rigorous evaluation metric using
measurable function spaces and explored function space properties in applied data science and
urban analytics. Mennaoui et. al. (2025) [5] explored measurable function spaces in the theory
of evolution equations, a key concept in functional analysis. They established a rigorous study
of measurable operator functions and provided new insights into function spaces and their role in
solving differential equations. Pedroza (2024) [6] investigated domain stability in machine learning
models using function spaces. He established a formal mathematical relationship between function
smoothness and domain adaptation and uses topological and measurable function spaces to analyze
stability conditions in learning models. Cerreia-Vioglio and Ok (2024) [7] developed a new integra-
tion theory for measurable set-valued functions. They introduced a generalization of integration
over Banach-valued functions and established weak compactness properties in measurable function
spaces. Averin (2024) [8] applied measurable function spaces to gravitational entropy theory. He
provided a rigorous proof of entropy bounds using function space formalism and connected measure
theory with relativistic field equations. Potter (2025) [9] investigated measurable function spaces
in the context of Fourier analysis and crystallographic structures. He established new results on
Fourier transforms of measurable functions and introduced a novel framework for studying function
spaces in invariant shift operators.
Measurable spaces are not merely abstract structures but are the backbone of measure theory,
probability, and integration. For example, the Borel σ-algebra B(R) on the real numbers R is the
smallest σ-algebra containing all open intervals (a, b) for a, b ∈ R. This σ-algebra is pivotal in
defining Lebesgue measure, where measurable sets generalize the classical notion of intervals to
include sets that are neither open nor closed. Moreover, the construction of a σ-algebra generated
by a collection of subsets C ⊆ 2X , denoted σ(C), provides a minimal framework that includes C
and satisfies all σ-algebra properties, enabling the systematic extension of measurability to more
complex scenarios. For instance, starting with intervals in R, one can build the Borel σ-algebra, a
critical tool in modern analysis.
The structure of a measurable space allows the definition of a measure µ, a function µ : Σ → [0, ∞]
that assigns a non-negative value to each set in Σ, adhering to two key axioms: µ(∅) = 0 and count-
∞
able additivity, P∞ that for any disjoint collection {An }n=1 ⊆ Σ, the measure of their union
S∞ which states
satisfies µ ( n=1 An ) = n=1 µ(An ). This property is indispensable in extending intuitive notions
of length, area, and volume to arbitrary measurable sets, paving the way for the Lebesgue integral.
A function f : X → R is then termed Σ-measurable if for every Borel set B ∈ B(R), the preimage
f −1 (B) belongs to Σ. This definition ensures that the function is compatible with the σ-algebra, a
necessity for defining integrals and expectation in probability theory.
In summary, measurable spaces represent a highly versatile and mathematically rigorous frame-
5
work, underpinning vast areas of analysis and probability. Their theoretical depth lies in their
ability to systematically handle infinite operations while maintaining closure, consistency, and flex-
ibility for defining measures, measurable functions, and integrals. As such, the rigorous study of
measurable spaces is indispensable for advancing modern mathematical theory, providing a bridge
between abstract set theory and applications in real-world phenomena.
Let (X , ΣX , µX ) and (Y, ΣY , µY ) be measurable spaces. The true risk functional is defined as:
Z
R(f ) = ℓ(f (x), y) dP (x, y), (2)
X ×Y
where:
• f belongs to a hypothesis space F ⊆ Lp (X , µX ).
where ⟨·, ·⟩ denotes the inner product in L2 (X ). In the field of risk management and decision
theory, the concept of a risk functional is a mathematical formalization that captures how risk
is quantified for a given outcome or state. A risk functional, denoted as R, acts as a map that
takes elements from a given space X (which represents the possible outcomes or states) and returns
a real-valued risk measure. This risk measure, R(x), expresses the degree of risk or the adverse
outcome associated with a particular element x ∈ X. The space X may vary depending on the
context—this could be a space of random variables, trajectories, or more complex function spaces.
The risk functional maps x to R, i.e., R : X → R, where each R(x) reflects the risk associated with
the specific realization x. One of the most foundational forms of risk functionals is based on the
6
expectation of a loss function L(x), where x ∈ X represents a random variable or state, and L(x)
quantifies the loss associated with that state. The risk functional can be expressed as an expected
loss, written mathematically as:
Z
R(x) = E[L(x)] = L(x)p(x) dx (4)
X
where p(x) is the probability density function of the outcome x, and the integration is taken
over the entire space X. In this setup, L(x) can be any function that measures the severity or
unfavorable nature of the outcome x. In a financial context, L(x) could represent the loss function
for a portfolio, and p(x) would be the probability density function of the portfolio’s returns. In
many cases, a specific form of L(x) is used, such as L(x) = (x − µ)2 , where µ is the target or
expected value. This choice results in the risk functional representing the variance of the outcome
x, expressed as: Z
R(x) = (x − µ)2 p(x) dx (5)
X
This formulation captures the variability or dispersion of outcomes around a mean value, a common
risk measure in applications like portfolio optimization or risk management. Additionally, another
widely used class of risk functionals arises from quantile-based risk measures, such as Value-
at-Risk (VaR), which focuses on the extreme tail behavior of the loss distribution. The VaR at
a confidence level α ∈ [0, 1] is defined as the smallest value l such that the probability of L(x)
exceeding l is no greater than 1 − α, i.e.,
P (L(x) ≤ l) ≥ α (6)
This defines a threshold beyond which the worst outcomes are expected to occur with probability
1 − α. Value-at-Risk provides a measure of the worst-case loss under normal circumstances, but it
does not provide information about the severity of losses exceeding this threshold. To address this
limitation, the Conditional Value-at-Risk (CVaR) is introduced, which measures the expected
loss given that the loss exceeds the VaR threshold. Mathematically, CVaR at the level α is given
by:
CVaRα (x) = E[L(x) | L(x) ≥ VaRα (x)] (7)
This conditional expectation provides a more detailed assessment of the potential extreme losses
beyond the VaR threshold. The CVaR is a more comprehensive measure, capturing the tail risk
and providing valuable information about the magnitude of extreme adverse events. In cases where
the space X represents trajectories or paths, such as in the context of continuous-time processes
or dynamical systems, the risk functional is often formulated in terms of integrals over time. For
example, consider x(t) as a trajectory in the function space C([0, T ], Rn ), the space of continuous
functions on the interval [0, T ]. The risk functional in this case might quantify the total deviation
of the trajectory from a reference or target trajectory over time. A typical example could be the
total squared deviation, written as:
Z T
R(x) = ∥x(t) − x̄(t)∥2 dt (8)
0
where x̄(t) represents a reference trajectory and ∥ · ∥ is a norm, such as the Euclidean norm. This
risk functional quantifies the total deviation (or energy) of the trajectory from the target path
over the entire time interval, and is used in various applications such as control theory
Pn and optimal
2 2
trajectory planning. A common choice for the norm ∥x(t)∥ might be ∥x(t)∥ = i=1 xi (t), where
xi (t) are the components of the trajectory x(t) in Rn . In some cases, the space X of possible
outcomes may not be a finite-dimensional vector space, but instead a Banach space or a Hilbert
space, particularly when x represents a more complex object such as a function or a trajectory.
For example, the space C([0, T ], Rn ) is a Banach space, and the risk functional may involve the
7
evaluation of integrals over this function space. In such settings, the risk functional can take the
form: Z T
R(x) = ∥x(t)∥pp dt (9)
0
where ∥ · ∥p is the p-norm, and p ≥ 1. For p = 2, this risk functional represents the total energy
of the trajectory, but other norms can be used to emphasize different types of risks. For instance,
the L∞ -norm would focus on the maximum deviation of the trajectory from the target path. The
concept of convexity plays a significant role in the theory of risk functionals. Convexity ensures
that the risk associated with a convex combination of two states x1 and x2 is less than or equal to
the weighted average of the risks of the individual states. Mathematically, for λ ∈ [0, 1], convexity
demands that:
R(λx1 + (1 − λ)x2 ) ≤ λR(x1 ) + (1 − λ)R(x2 ) (10)
This property reflects the diversification effect in risk management, where mixing several states or
outcomes generally leads to a reduction in overall risk. Convex risk functionals are particularly
important in portfolio theory, where they allow for risk minimization through diversification. For
example, if R(x) represents the variance of a portfolio’s returns, then the convexity property ensures
that combining different assets will result in a portfolio with lower overall risk than the risk of any
individual asset. Monotonicity is another important property for risk functionals, ensuring that
the risk increases as the outcome becomes more adverse. If x1 is worse than x2 according to some
partial order, we have:
R(x1 ) ≥ R(x2 ) (11)
Monotonicity ensures that the risk functional behaves in a way that aligns with intuitive notions
of risk: worse outcomes are associated with higher risk. In financial contexts, this is reflected in
the fact that losses increase the associated risk measure. Finally, in some applications, the risk
functional is derived from perturbation analysis to study how small changes in parameters affect
the overall risk. Consider x(ϵ) as a perturbed trajectory, where ϵ is a small parameter, and the
Fréchet derivative of the risk functional with respect to ϵ is given by:
d
R(x(ϵ)) (12)
dϵ ϵ=0
This derivative quantifies the sensitivity of the risk to perturbations in the system and is crucial
in the analysis of stability and robustness. Such analyses are essential in areas like stochastic
control and optimization, where it is important to understand how small changes in the model’s
parameters can influence the risk profile.
Thus, the risk functional is a powerful tool for quantifying and managing uncertainty, and its
formulation can be adapted to various settings, from random variables and stochastic processes to
continuous trajectories and dynamic systems. The risk functional provides a rigorous mathemat-
ical framework for assessing and minimizing risk in complex systems, and its flexibility makes it
applicable across a wide range of domains.
8
• Rademacher complexity for continuous spaces:
" N
#
1 X
RN (F) = Eϵ sup ϵi f (xi ) , (14)
f ∈F N i=1
Literature Review: There are several articles that explore the VC-dimension theory for dis-
crete hypotheses very rigorously. N. Bousquet and S. Thomassé (2015) [18] explored in their paper
the VC-dimension in the context of graph theory, connecting it to structural properties such as the
Erdős-Pósa property. Yıldız and Alpaydin (2009) [19] in their article computed the VC-dimension
for decision tree hypothesis spaces, considering both discrete and continuous features. Zhang et.
al. (2012) [20] introduced a discretized VC-dimension to bridge real-valued and discrete hypothesis
spaces, offering new theoretical tools for complexity analysis. Riondato and Zdonik (2011) [21]
adapted VC-dimension theory to database systems, analyzing SQL query selectivity using a theo-
retical lens. Riggle and Sonderegger (2010) [22] investigated the VC-dimension in linguistic models,
focusing on grammar hypothesis spaces. Anderson (2023) [23] provided a comprehensive review
of VC-dimension in fuzzy systems, particularly in logic frameworks involving discrete structures.
Fox et. al. (2021) [24] proved key conjectures for systems with bounded VC-dimension, offering
insights into combinatorial implications. Johnson (2021) [25] discusses binary representations and
VC-dimensions, with implications for discrete hypothesis modeling. Janzing (2018) [26] in his paper
focuses on hypothesis classes with low VC-dimension in causal inference frameworks. Hüllermeier
and Tehrani (2012) [27] in their paper explored the theoretical VC-dimension of Choquet integrals,
applied to discrete machine learning models. The book titled “Foundations of Machine Learning”
[28] by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar offers a very good foundational
discussion on VC-dimension in the context of statistical learning. Another book titled “Learning
Theory: An Approximation Theory Viewpoint” by Felipe Cucker and Ding-Xuan Zhou [29] dis-
cusses the role of VC-dimension in approximation theory. Yet another book titled “Understanding
Machine Learning: From Theory to Algorithms” by Shai Shalev-Shwartz and Shai Ben-David [30]
contains detailed chapters on hypothesis spaces and VC-dimension.
For discrete hypotheses, the VC-dimension theory applies to a class of hypotheses that map a
set of input points to binary output labels (typically 0 or 1). The VC-dimension for a hypothesis
class refers to the largest set of data points that can be shattered by that class, where ”shattering”
means that the hypothesis class can realize all possible labelings of these points.
We shall now discuss the Formal Mathematical Framework. Let X be a finite or infinite
set called the instance space, which represents the input space. Consider a hypothesis class H,
where each hypothesis h ∈ H is a function h : X → {0, 1}. The function h classifies each element
of X into one of two classes: 0 or 1. Given a subset S = {x1 , x2 , . . . , xk } ⊆ X, we say that H
shatters S if for every possible labeling ⃗y = (y1 , y2 , . . . , yk ) ∈ {0, 1}k , there exists some h ∈ H
such that for all i ∈ {1, 2, . . . , k}:
h(xi ) = yi (15)
9
In other words, a hypothesis class H shatters S if it can produce every possible binary labeling
on the set S. The VC-dimension VC(H) is defined as the size of the largest set S that can be
shattered by H:
VC(H) = sup{k | ∃S ⊆ X, |S| = k, S is shattered by H} (16)
If no set of points can be shattered, then the VC-dimension is 0. Some Properties of the VC-
Dimension are
We shall now discuss the VC-Dimension and Generalization Bounds (VC Theorem). The VC-
dimension theorem (often referred to as Hoeffding’s bound or the generalization bound)
provides a probabilistic guarantee on the relationship between the training error and the true error.
Specifically, it gives an upper bound on the probability that the generalization error exceeds the
empirical error (training error) by more than ϵ.
Let D be the distribution from which the training data is drawn, and let err(h)
ˆ and err(h) represent
the empirical error and true error of a hypothesis h ∈ H, respectively:
n
1X
err(h)
ˆ = 1{h(xi )̸=yi } (18)
n i=1
|err(h)
ˆ − err(h)| ≤ ϵ (20)
10
where C is a constant depending on the distribution. This bound emphasizes the importance of
VC-dimension in controlling the complexity of the hypothesis class. A larger VC-dimension requires
a larger sample size to avoid overfitting and ensure reliable generalization. Some Detailed Examples
are:
where w ∈ R2 is the weight vector and b ∈ R is the bias term. The VC-dimension of
linear classifiers in R2 is 3. This can be rigorously shown by noting that for any set of 3
points in R2 , the hypothesis class H can shatter these points. In fact, any possible binary
labeling of the 3 points can be achieved by some linear classifier. However, for 4 points in
R2 , it is impossible to shatter all possible binary labelings (e.g., the four vertices of a convex
quadrilateral), meaning the VC-dimension is 3.
where the αi1 ,i2 ,...,in are coefficients and x = (x1 , x2 , . . . , xn ). The VC-dimension of poly-
nomial classifiers of degree d in Rn grows as O(nd ), implying that the complexity of the
hypothesis class increases rapidly with both the degree d and the dimension n of the input
space.
Neural networks, depending on their architecture, can have very high VC-dimensions. In particu-
lar, the VC-dimension of a neural network with L layers, each containing N neurons, is typically
O(N L ), indicating that the VC-dimension grows exponentially with both the number of neurons
and the number of layers. This result provides insight into the complexity of neural networks and
their capacity to overfit data when the training sample size is insufficient.
11
learning algorithms. Ma and Wang (2020) [38] investigated Rademacher complexity bounds in deep
residual networks. Bartlett and Mendelson (2002) [39] wrote a foundational paper on complexity
measures, providing fundamental theoretical insights into generalization bounds. Dzahini and Wild
(2024) [40] in their paper extended Rademacher-based complexity to stochastic optimization meth-
ods. McDonald and Shalizi (2011) [41] showed using sequential Rademacher complexities for I.I.D
process how to control the generalization error of time series models wherein past values of the
outcome are used to predict future values.
where ∥f ∥∞ = ess supx∈X |f (x)| denotes the essential supremum. For rigor, F is assumed measur-
able in the sense that for every ϵ > 0, there exists a countable subset Fϵ ⊆ F such that:
where Eσ denotes expectation over σ. The supremum can be interpreted as a functional dual
norm in L∞ (X , R), where F is the unit ball. Using the symmetrization technique, the Rademacher
complexity relates to the deviation of Pn [f ] from D[f ]:
where: h i
Rn (F) = ES R̂S (F) . (32)
This is derived by first symmetrizing the sample and then invoking Jensen’s inequality and the
independence of σ. There are some Complexity Bounds that use Covering Numbers and Entropy
that need to be discussed. In Metric Entropy, we let ∥ · ∥∞ be the metric on F. The covering
number N (ϵ, F, ∥ · ∥∞ ) satisfies:
12
Regarding the Dudley’s Entropy Integral, For a bounded function class F (compact under ∥ · ∥∞ ):
12 ∞ p
Z
Rn (F) ≤ inf 4α + √ log N (ϵ, F, ∥ · ∥∞ ) dϵ . (34)
α>0 n α
There is also a Relation to Talagrand’s Concentration Inequality. Talagrand’s inequality provides
tail bounds for the supremum of empirical processes:
nϵ2
P sup |Pn [f ] − D[f ]| > ϵ ≤ 2 exp − , (35)
f ∈F 2∥f ∥2∞
reinforcing the link between Rn (F) and generalization. There are some Applications in Continuous
Function Classes. One example is the RKHS with Gaussian Kernel. For F as the unit ball of an
RKHS with kernel k(x, x′ ), the covering number satisfies:
1
log N (ϵ, F, ∥ · ∥∞ ) ∼ O 2 , (36)
ϵ
yielding:
1
Rn (F) ∼ O √ . (37)
n
For F ⊆ H s (Rd ), the covering number depends on the smoothness s and dimension d:
1
Rn (F) ∼ O . (38)
ns/d
Rademacher complexity is deeply embedded in modern empirical process theory. Its intricate
relationship with measure-theoretic tools, symmetrization, and concentration inequalities provides
a robust theoretical foundation for understanding generalization in high-dimensional spaces.
13
function spaces relevant to ultraparabolic operators.
W k,p (X ) ,→ C m (X ), (39)
if k − dp > m, ensuring fθ ∈ C ∞ (X ) for smooth activations σ. For a function u ∈ Lp (Ω), its weak
derivative Dα u satisfies:
Z Z
|α|
α
u(x)D ϕ(x) dx = (−1) v(x)ϕ(x) dx ∀ϕ ∈ Cc∞ (Ω), (40)
Ω Ω
where v ∈ Lp (Ω) is the weak derivative. This definition extends the classical notion of differentiation
to functions that may not be pointwise differentiable. The Sobolev norm encapsulates both function
values and their derivatives:
1/p
X
∥u∥W k,p (Ω) = ∥Dα u∥pLp (Ω) . (41)
|α|≤k
Key properties:
• Semi-norm Dominance: The W k,p -norm is controlled by the seminorm |u|W k,p , ensuring
sensitivity to high-order derivatives.
Sobolev spaces W k,p (Ω) embed into Lq (Ω) or C m (Ω), depending on k, p, q, and n. These embeddings
govern the smoothness and integrability of u and its derivatives. There are several Advanced
Theorems on Sobolev Embeddings. They are as follows:
• If k > n/p, W k,p (Ω) ,→ C m,α (Ω) with m = ⌊k − n/p⌋ and α = k − n/p − m.
• If k = n/p, W k,p (Ω) ,→ Lq (Ω) for q < ∞.
• If k < n/p, W k,p (Ω) ,→ Lq (Ω) where 1
q
= 1
p
− nk .
The Proof of Sobolev Embedding starts with the Scaling Analysis.Define uλ (x) = u(λx). Then:
For derivatives:
∥Dα uλ ∥Lp (Ω) = λ|α|−n/p ∥Dα u∥Lp (λ−1 Ω) . (44)
14
The scaling relation λk−n/p aligns with the Sobolev embedding condition k > n/p. Sobolev norms
in Rn are equivalent to decay rates of Fourier coefficients:
Z 1/2
2k 2
∥u∥W k,p ∼ |ξ| |û(ξ)| dξ . (45)
Rn
For k > n/p, Fourier decay implies uniform bounds, ensuring u ∈ C m,α . Interpolation spaces
bridge Lp and W k,p , providing finer embeddings. Duality: Sobolev embeddings are equivalent to
boundedness of adjoint operators in Lq . For −∆u = f , u ∈ W 2,p (Ω) ensures u ∈ C 0,α (Ω) if p > n/2.
Sobolev spaces govern variational problems in geometry, e.g., minimal surfaces and harmonic maps.
On Ω with fractal boundaries, trace theorems refine Sobolev embeddings.
Literature Review: Lassoued (2026) [51] examined function spaces on the torus and their lack
of compactness, highlighting cases where the classical Rellich-Kondrachov result fails. He extended
compact embedding results to function spaces with periodic structures. He also discussed trace
theorems and regular function spaces in this new context. Chen et.al. (2024) [52] extended the
Rellich-Kondrachov theorem to Hörmander vector fields, a class of differential operators that appear
in hypoelliptic PDEs. They established a degenerate compact embedding theorem, generalizing pre-
vious results in the field. They also provided applications to geometric inequalities, highlighting
the role of compact embeddings in PDE theory. Adams and Fournier (2003) [53] in their book
provided a complete proof of the Rellich-Kondrachov theorem, along with a discussion of compact
embeddings. They also covered function space theory, embedding theorems, and applications in
PDEs. Brezis (2010) [54] wrote a highly recommended resource for understanding Sobolev spaces
and their compactness properties. The book included applications to variational methods and weak
solutions of PDEs. Evans (2022) [55] in his classic PDE textbook includes a discussion of compact
Sobolev embeddings, their implications for weak convergence, and applications in variational meth-
ods. Maz’ya (2011) [56] provided a detailed treatment of Sobolev space theory, including compact
embedding theorems in various settings.
To rigorously state the theorem, we consider a bounded open domain Ω ⊂ Rn with a Lipschitz
boundary. For 1 ≤ p < n, the theorem asserts that the embedding
15
np
is compact whenever q ≤ n−p . More precisely, this means that if {uk } ⊂ W 1,p (Ω) is a bounded
sequence in the Sobolev norm, i.e., there exists a constant C > 0 such that
∥uk ∥W 1,p (Ω) = ∥uk ∥Lp (Ω) + ∥∇uk ∥Lp (Ω) ≤ C, (47)
then there exists a subsequence {ukj } and a function u ∈ Lq (Ω) such that
However, weak convergence alone does not imply compactness. To obtain strong convergence
in Lq (Ω), we need additional arguments. This is accomplished using the Fréchet-Kolmogorov
compactness criterion, which states that a bounded subset of Lq (Ω) is compact if and only if it
is tight and uniformly equicontinuous. More formally, compactness follows if
2. The sequence uk (x) does not escape to infinity in a way that prevents strong convergence.
To quantify this, we invoke the Sobolev-Poincaré inequality, which states that for p < n, there
exists a constant C such that
Z
1
∥u − uΩ ∥Lq (Ω) ≤ C∥∇u∥Lp (Ω) , uΩ = u(x) dx. (52)
|Ω| Ω
Thus,
∥uk − u∥Lq (Ω) → 0, (55)
which establishes the strong convergence in Lq (Ω), completing the proof. The key insight is
that compactness arises because the gradients of uk provide control over the oscillations of uk ,
ensuring that the sequence cannot oscillate indefinitely without converging in norm. The crucial
role of Sobolev embeddings is to guarantee that even though W 1,p (Ω) does not embed compactly
np
into itself, it does embed compactly into Lq (Ω) for q < n−p . This embedding ensures that weak
1,p q
convergence in W (Ω) implies strong convergence in L (Ω), proving the theorem.
16
2 Universal Approximation Theorem: Refined Proof
The Universal Approximation Theorem (UAT) is a fundamental result in neural network theory,
stating that a feedforward neural network with a single hidden layer containing a finite number of
neurons can approximate any continuous function on a compact subset of Rn to any desired degree
of accuracy, provided that an appropriate activation function is used. This theorem has significant
implications in machine learning, function approximation, and deep learning architectures.
Literature Review: Hornik et. al. (1989) [57] in their seminal paper rigorously proved that
multilayer feedforward neural networks with a single hidden layer and a sigmoid activation func-
tion can approximate any continuous function on a compact set. It extends prior results and lays
the foundation for the modern understanding of UAT. Cybenko (1989) [58] provided one of the first
rigorous proofs of the UAT using the sigmoid function as the activation function. They demon-
strated that a single hidden layer network can approximate any continuous function arbitrarily well.
Barron (1993) [59] extended UAT by quantifying the approximation error and analyzing the rate
of convergence. This work is crucial for understanding the practical efficiency of neural networks.
Pinkus (1999) [60] provided a comprehensive survey of UAT from the perspective of approximation
theory and also discussed conditions for approximation with different activation functions and the
theoretical limits of neural networks. Lu et.al. (2017) [61] investigated how the width of neural
networks affects their approximation capability, challenging the notion that deeper networks are
always better. They also provided insights into trade-offs between depth and width. Hanin and
Sellke (2018) [62] extended UAT to ReLU activation functions, showing that deep ReLU networks
achieve universal approximation while maintaining minimal width constraints. Garcıa-Cervera et.
al. (2024) [63] extended the universal approximation theorem to set-valued functions and its ap-
plications to Deep Operator Networks (DeepONets), which are useful in control theory and PDE
modeling. Majee et.al. (2024) [64] explored the universal approximation properties of deep neu-
ral networks for solving inverse problems using Markov Chain Monte Carlo (MCMC) techniques.
Toscano et. al. (2024) [65] introduced Kurkova-Kolmogorov-Arnold Networks (KKANs), an ex-
tension of UAT incorporating Kolmogorov’s superposition theorem for improved approximation
capabilities. Son (2025) [66] established a new framework for operator learning based on the UAT,
providing a theoretical foundation for backpropagation-free deep networks.
The kernel ϕ(x) is typically chosen to be smooth, compactly supported, and normalized such that
Z
ϕ(x) dx = 1. (57)
Rn
To approximate f locally, we introduce a scaling parameter ϵ > 0 and define the scaled kernel ϕϵ (x)
as x
ϕϵ (x) = ϵ−n ϕ . (58)
ϵ
The factor ϵ−n ensures that ϕϵ (x) remains a probability density function, satisfying
Z Z
ϕϵ (x) dx = ϕ(x) dx = 1. (59)
Rn Rn
17
The convolution of f with the scaled kernel ϕϵ is given by
Z
(f ∗ ϕϵ )(x) = f (y)ϕϵ (x − y) dy. (60)
Rn
x−y
Performing the change of variables z = ϵ
, we have y = x − ϵz and dy = ϵn dz. Substituting into
the integral, we obtain Z
(f ∗ ϕϵ )(x) = f (x − ϵz)ϕ(z) dz. (61)
Rn
This representation shows that (f ∗ ϕϵ )(x) is a smoothed version of f (x), where the smoothing
is controlled by the parameter ϵ. As ϵ → 0, the kernel ϕϵ (x) becomes increasingly concentrated
around x, and we recover f (x) in the limit:
lim(f ∗ ϕϵ )(x) = f (x), (62)
ϵ→0
assuming f is continuous. This result can be rigorously proven using properties of the kernel
ϕ, such as its smoothness and compact support, and the dominated convergence theorem, which
ensures that the integral converges uniformly to f (x). Now, let us consider the role of convolution
operators in the approximation of f by neural networks. A single-layer feedforward neural network
is expressed as
M
X
fˆ(x) = ci σ(wiT x + bi ), (63)
i=1
where ci ∈ R are coefficients, wi ∈ Rn are weight vectors, bi ∈ R are biases, and σ : R → R is the
activation function. The activation function σ(wiT x + bi ) can be interpreted as a localized response
function, analogous to the kernel ϕ(x − y) in convolution. By drawing an analogy between the two,
we can write the neural network approximation as
M
X
fˆ(x) ≈ f (xi )ϕϵ (x − xi )∆x (64)
i=1
This result hinges on the ability of the activation function σ to generate a rich set of basis func-
tions. For example, if σ(x) = max(0, x) (ReLU), the network approximates f (x) by piecewise linear
functions. If σ(x) = 1+e1−x (sigmoid), the network generates smooth approximations that resemble
logistic regression.
In this refined proof of the UAT, convolution operators provide a unifying framework for un-
derstanding the smoothing, localization, and discretization processes that underlie neural network
approximations. The interplay between ϕϵ (x), f ∗ ϕϵ (x), and fˆ(x) reveals the profound mathemat-
ical structure that connects classical approximation theory with modern machine learning. This
connection not only enhances our theoretical understanding of neural networks but also guides the
design of architectures and algorithms for practical applications.
18
2.1.1 Stone-Weierstrass Application
Literature Review: Rudin (1976) [67] introduced the Weierstrass approximation theorem and
proves its generalization, the Stone-Weierstrass theorem. He also discussed the algebraic structure
of function spaces and how the theorem ensures the uniform approximation of continuous func-
tions by polynomials. He also presented examples and exercises related to compactness, uniform
convergence, and Banach algebra structures. Stein and Shakarchi (2005) [68] extended the Stone-
Weierstrass theorem into measure theory and functional analysis. He also proved the theorem in
the context of Lebesgue integration. He also discussed how it applies to Hilbert spaces and orthog-
onal polynomials. He also connected the theorem to Fourier analysis and spectral decomposition.
Conway (2019) [69] explored the Stone-Weierstrass theorem in the setting of Banach algebras and
C-algebras*. He also extended the theorem to non-commutative function algebras and discussed the
operator-theoretic implications of the theorem in Hilbert spaces. He also analyzed the theorem’s
application to spectral theory. Dieudonné (1981) [70] traced the historical development of func-
tional analysis, including the origins of the Stone-Weierstrass theorem and discussed contributions
by Karl Weierstrass and Marshall Stone. He also explored how the theorem influenced topological
vector spaces and operator theory and also included perspectives on the axiomatic development of
function approximation. Folland (1999) [71] discussed the Stone-Weierstrass theorem in depth with
applications to probability theory and ergodic theory and used the theorem to establish the density
of algebraic functions in measure spaces He also connected the Stone-Weierstrass theorem to func-
tional approximation in Lp spaces. He also explored the interplay between the Stone-Weierstrass
theorem and the Hahn-Banach theorem. Sugiura (2024) [72] extended the Stone-Weierstrass theo-
rem to the study of reservoir computing in machine learning and proved that certain neural networks
can approximate functions uniformly under the assumptions of the theorem. He bridges classical
functional approximation with modern AI and deep learning. Liu et al. (2024) [73] investigated
the Stone-Weierstrass theorem in normed module settings and used category theory to general-
ize function approximation results. He also extended the theorem beyond real-valued functions
to structured mathematical objects. Martinez-Barreto (2025) [74] provided a modern formulation
of the theorem with rigorous proof and reviewed applications in operator algebras and topology.
He also discussed open problems related to function approximation. Chang and Wei (2024) [75]
used the Stone-Weierstrass theorem to derive new operator inequalities and applied the theorem to
functional analysis in quantum mechanics. Caballer et al. (2024) [76] investigated cases where the
Stone-Weierstrass theorem fails and provided counterexamples and refined conditions for uniform
approximation. Chen (2024) [77] extended the Stone-Weierstrass theorem to generalized function
spaces and introduced a new class of uniform topological algebras. Rafiei and Akbarzadeh-T (2024)
[78] used the Stone-Weierstrass theorem to analyze function approximation in fuzzy logic systems
and explored the applications in control systems and AI.
This supremum norm is critical in defining the proximity between continuous functions, as we seek
to approximate any function f ∈ C(X) by a function g from a subalgebra A ⊂ C(X). The Stone-
19
Weierstrass theorem guarantees that if the subalgebra A satisfies two essential properties—(1) it
contains the constant functions, and (2) it separates points—then the closure of A in the supremum
norm will be the entire space C(X). To formalize this, we define the point separation property
as follows: for every pair of distinct points x1 , x2 ∈ X, there exists a function h ∈ A such that
h(x1 ) ̸= h(x2 ). This condition ensures that functions from A are sufficiently “rich” to distinguish
between different points in X. Mathematically, this is expressed as:
Given these two properties, the Stone-Weierstrass theorem asserts that for any continuous function
f ∈ C(X) and any ϵ > 0, there exists an element g ∈ A such that:
This result ensures that any continuous function on a compact Hausdorff space can be approxi-
mated arbitrarily closely by functions from a sufficiently rich subalgebra. In the context of the
Universal Approximation Theorem (UAT), we seek to apply the Stone-Weierstrass theorem
to the approximation capabilities of neural networks. Let K ⊆ Rn be a compact subset, and let
f ∈ C(K) be a continuous function defined on this set. A feedforward neural network with a
non-linear activation function σ has the form:
N
X
fˆθ (x) = wi σ(⟨wi , x⟩ + bi ) (70)
i=1
where ⟨wi , x⟩ represents the inner product between the weight vector wi and the input x, and bi
represents the bias term. The activation function σ is typically non-linear (such as the sigmoid or
ReLU function), and the parameters θ = {wi , bi }Ni=1 are the weights and biases of the network. The
ˆ
function fθ (x) is a weighted sum of the non-linear activations applied to the affine transformations
of x.
We now explore the connection between neural networks and the Stone-Weierstrass theorem. A
critical observation is that the set of functions defined by a neural network with non-linear activation
is a subalgebra of C(K) provided the activation function σ is sufficiently rich in its non-linearity.
This non-linearity ensures that the network can separate points in K, meaning that for any two
distinct points x1 , x2 ∈ K, there exists a network function fˆθ that takes distinct values at these
points. This satisfies the point separation condition required by the Stone-Weierstrass theorem.
To formalize this, consider two distinct points x1 , x2 ∈ K. Since σ is non-linear, the function fˆθ (x)
with appropriately chosen weights and biases will satisfy:
Thus, the algebra of neural network functions satisfies the point separation property. By applying
the Stone-Weierstrass theorem, we conclude that this algebra is dense in C(K), meaning that for
any continuous function f ∈ C(K) and any ϵ > 0, there exists a neural network function fˆθ such
that:
∥f (x) − fˆθ (x)∥∞ < ϵ ∀x ∈ K (72)
This rigorous result shows that neural networks with a non-linear activation function can approxi-
mate any continuous function on a compact set arbitrarily closely in the supremum norm, thereby
proving the Universal Approximation Theorem. To further explore this, consider the error term:
For a given function f and a compact set K, this error term can be made arbitrarily small by
increasing the number of neurons in the hidden layer of the neural network. This increases the
20
capacity of the network, effectively enlarging the subalgebra of functions generated by the network,
thereby improving the approximation. As the number of neurons increases, the network’s ability to
approximate any function from C(K) becomes increasingly precise, which aligns with the conclusion
of the Stone-Weierstrass theorem that the network functions form a dense subalgebra in C(K).
Thus, the Universal Approximation Theorem, derived through the Stone-Weierstrass theorem,
rigorously proves that neural networks can approximate any continuous function on a compact set
to any desired degree of accuracy. The combination of the non-linearity of the activation function
and the architecture of the neural network guarantees that the network can generate a dense
subalgebra of continuous functions, ultimately allowing it to approximate any function from C(K).
This result not only formalizes the approximation power of neural networks but also provides a
deep theoretical foundation for understanding their capabilities as universal approximators.
where the functions ψpq (xp ) encode the univariate projections of the input variables xp , and the
outer functions Φq aggregate these projections into the final output. This decomposition highlights
a fundamental property of multivariate continuous functions: their expressiveness can be captured
through hierarchical compositions of simpler, univariate components.
Literature Review: There are some Classical References on the Kolmogorov-Arnold Superpo-
sition Theorem (KST). Kolmogorov (1957) [79] in his Foundational Paper on KST established that
any continuous function of several variables can be represented as a superposition of continuous
functions of a single variable and addition. This was groundbreaking because it provided a uni-
versal function decomposition method, independent of inner-product spaces. He proved that there
exist functions ϕq and ψq such that any function f (x1 , x2 , . . . , xn ) can be expressed as:
2n+1 n
!
X X
f (x1 , ..., xn ) = ϕq ψqp (xp ) (75)
q=1 p=1
where the ψqp are univariate functions. Kolmogorov provided a mathematical basis for approxima-
tion theory and neural networks, influencing modern machine learning architectures. Arnold (1963)
[80] refined Kolmogorov’s theorem by proving that one can restrict the superposition to functions
of at most two variables instead of one. Arnold’s formulation led to the Kolmogorov-Arnold
representation: !
2n+1
X n
X
f (x1 , ..., xn ) = ϕq xq + ψqp (xp ) (76)
q=1 p=1
making the theorem more suitable for practical computations. Arnold strengthened the expressiv-
ity of neural networks, inspiring alternative function representations in high-dimensional settings.
21
Lorentz (2008) [81] in his book discusses the significance of KST in approximation theory and con-
structive mathematics. He provided error estimates for approximating multivariate functions using
Kolmogorov-type decompositions. He showed how KST fits within Bernstein approximation the-
ory. He helped frame KST in the context of function approximation, bridging it to computational
applications. Pinkus (1999) [60] analyzed the role of KST in multilayer perceptrons (MLPs),
showing how it influences function expressibility in neural networks. He demonstrated that feed-
forward neural networks can approximate arbitrary functions using Kolmogorov superposition. He
also provided bounds on network depth and width required for universal approximation. He played
a crucial role in understanding the theoretical power of deep learning. There are several modern
Contributions in KST (2024–2025). Guilhoto and Perdikaris (2024) [82] explored how KST can
be reformulated using deep learning architectures. They proposed Kolmogorov-Arnold Networks
(KANs), a new type of neural network inspired by KST. They showed that KANs outperform
traditional feedforward networks in function approximation tasks. They also provided empirical
evidence of KAN efficiency in real-world datasets. They also introduced a new paradigm in ma-
chine learning, making function decomposition more interpretable. Alhafiz, M. R. et al. (2025) [83]
applied KST-based networks to turbulence modeling in fluid mechanics. They demonstrated how
KANs improve predictive accuracy for Navier-Stokes turbulence models. They showed a reduc-
tion in computational complexity compared to classical turbulence models. They also developed a
data-driven turbulence modeling framework leveraging KST. They advanced machine learning ap-
plications in computational fluid dynamics (CFD). Lorencin, I. et al. (2024) [84] used KST-inspired
neural networks for predicting propulsion system parameters in ships. They implemented KANs to
model hybrid ship propulsion (Combined Diesel-Electric and Gas - CODLAG) and demonstrated
a highly accurate prediction model for propulsion efficiency. They also provided a new benchmark
dataset for ship propulsion research. They extended KST applications to naval engineering & au-
tonomous systems.
In the context of neural networks, this result establishes the theoretical universality of function ap-
proximation. A neural network with a single hidden layer approximates a function f (x1 , x2 , . . . , xn )
by representing it as !
W
X n
X
f (x1 , x2 , . . . , xn ) ≈ ai σ wij xj + bi , (77)
i=1 j=1
where W is the width of the hidden layer, σ is a nonlinear activation function, wij are weights,
bi are biases, and ai are output weights. The expressive power of such shallow networks depends
critically on the width W , as the universal approximation theorem ensures that W → ∞ suffices
to approximate any continuous function arbitrarily well. However, for a fixed approximation error
ϵ > 0, the required width grows exponentially with the input dimension n, satisfying a lower bound
of
W ≥ C · ϵ−n , (78)
where C depends on the function’s Lipschitz constant. This exponential dependence, sometimes
called the ”curse of dimensionality,” underscores the inefficiency of shallow architectures in captur-
ing high-dimensional dependencies.
22
The advantage of depth becomes apparent when we consider deep neural networks, which uti-
lize hierarchical representations. A deep network with D layers and width W per layer constructs
a function as a composition of layer-wise transformations:
where h(k) denotes the output of the k-th layer, W (k) is the weight matrix, b(k) is the bias vector,
and σ is the nonlinear activation. The final output of the network is then given by
The depth D of the network allows it to approximate hierarchical compositions of functions. For
example, if a target function f (x) has a compositional structure
where each gi is a simple function, the depth D directly corresponds to the number of nested
transformations. This compositional hierarchy enables deep networks to approximate functions
efficiently, achieving a reduction in the required parameter count. The approximation error ϵ for a
deep network decreases polynomially with D, satisfying
1
ϵ≤O , (82)
D2
which is exponentially more efficient than the error scaling for shallow networks. In light of the
Kolmogorov-Arnold theorem, the decomposition
2n n
!
X X
f (x1 , x2 , . . . , xn ) = Φq ψpq (xp ) (83)
q=0 p=1
demonstrates how deep networks align naturally with the structure of multivariate functions. The
inner functions ψpq capture local dependencies, while the outer functions Φq aggregate these into a
global representation. This layered decomposition mirrors the depth-based structure of neural net-
works, where each layer learns a specific aspect of the function’s complexity. Finally, the parameter
count in a deep network with D layers and width W per layer is given by
P ≤ O(D · W 2 ), (84)
23
data efficiently while maintaining interpretability. Liu et. al. (2024) [217] extended Graph Convo-
lutional Networks (GCNs) by integrating Fourier analysis and spectral wavelets to improve graph
expressivity. It bridges the gap between frequency-domain analysis and graph embeddings, making
GCNs more effective for complex data structures. Vlasic (2024) [218] presented a Fourier series-
inspired feature mapping technique to encode classical data into quantum circuits. It demonstrates
how Fourier coefficients can enhance the representational capacity of quantum models, leading
to better compression and generalization. Kim et. al. (2024) [219] introduced Neural Fourier
Modelling (NFM), a novel approach to representing time-series data compactly while preserving
its expressivity. It outperforms traditional models like Short-Time Fourier Transform (STFT) in
retaining long-term dependencies. Xie et. al. (2024) [220] explored how Fourier basis functions
can be used to enhance the expressivity of tensor networks while maintaining computational ef-
ficiency. It establishes trade-offs between expressivity and model complexity in machine learning
architectures. Liu et. al. (2024) [221] integrated spectral modulation and Fourier transforms into
implicit neural representations for text-to-image synthesis. Fourier analysis improves global coher-
ence while preserving local expressivity in generative models. Zhang (2024) [222] demonstrated how
Fourier and Lock-in spectrum techniques can represent long-term variations in mechanical signals.
The Fourier-based decomposition allows for more expressive representations of mechanical failures
and degradation. Hamed and Lachiri (2024) [223] applied Fourier transformations to speech syn-
thesis models, improving their ability to transfer expressive content from text to speech. Fourier
series allows capturing prosody, rhythm, and tone variations effectively. Lehmann et. al. (2024)
[224] integrated Fourier-based deep learning models for seismic activity prediction. It explores the
expressivity of Fourier Neural Operators (FNOs) in capturing wave propagations in different geo-
logical environments.
The Fourier analysis of expressivity in neural networks seeks to rigorously quantify how neural
architectures, characterized by their depth and width, can approximate functions through the de-
composition of those functions into their Fourier spectra. Consider a square-integrable function
f : Rd → R, for which the Fourier transform is defined as
Z
fˆ(ξ) = f (x)e−i2πξ·x dx (86)
Rd
where ξ ∈ Rd represents the frequency. The inverse Fourier transform reconstructs the function as
Z
f (x) = fˆ(ξ)ei2πξ·x dξ (87)
Rd
The magnitude |fˆ(ξ)| reflects the energy contribution of the frequency ξ to f . Neural networks
approximate f by capturing its Fourier spectrum, but the architecture fundamentally governs how
efficiently this approximation can be achieved, especially in the presence of high-frequency compo-
nents.
For shallow networks with one hidden layer and a finite number of neurons, the universal ap-
proximation theorem establishes that
n
X
f (x) ≈ ai ϕ(wi · x + bi ) (88)
i=1
where ϕ is the activation function, wi ∈ Rd are weights, bi ∈ R are biases, and ai ∈ R are
coefficients. The Fourier transform of this representation can be expressed as
n
X
fˆ(ξ) ≈ ai ϕ̂(ξ)e−i2πξ·bi (89)
i=1
24
where ϕ̂(ξ) denotes the Fourier transform of the activation function. For smooth activation func-
tions like sigmoid or tanh, ϕ̂(ξ) decays exponentially as ∥ξ∥ → ∞, limiting the network’s ability
to approximate functions with high-frequency content unless the width n is exceedingly large.
Specifically, the Fourier coefficients decay as
where β > 0 depends on the smoothness of ϕ. This restriction implies that shallow networks
are biased toward low-frequency functions unless their width scales exponentially with the input
dimension d. Deep networks, on the other hand, leverage their hierarchical structure to overcome
these limitations. A deep network with L layers recursively composes functions, producing an
output of the form
where ϕl is the activation function at layer l, W(l) are weight matrices, and b(l) are bias vectors.
The Fourier transform of this composition can be analyzed iteratively. If h(l) = ϕl (W(l) h(l−1) +b(l) )
represents the output of the l-th layer, then
(l) (ξ) = ϕ̂ (ξ) ∗ W\
hc (l) h(l−1) (ξ) (92)
l
where ∗ denotes convolution and ϕ̂l is the Fourier transform of the activation function. The recursive
application of this convolution amplifies high-frequency components, enabling deep networks to
approximate functions whose Fourier spectra exhibit polynomial decay. Specifically, the Fourier
coefficients of a deep network decay as
where α depends on the activation function. This is in stark contrast to the exponential decay
observed in shallow networks.
The activation function plays a pivotal role in shaping the Fourier spectrum of neural networks. For
example, the rectified linear unit (ReLU) ϕ(x) = max(0, x) introduces significant high-frequency
components into the network. The Fourier transform of the ReLU activation is given by
1
ϕ̂(ξ) = (94)
2πiξ
which decays more slowly than the Fourier transforms of smooth activations. Consequently, ReLU-
based networks are particularly effective at approximating functions with oscillatory behavior. To
illustrate, consider the function
f (x) = sin(2πξ · x) (95)
A shallow network requires an exponentially large number of neurons to approximate f when ∥ξ∥ is
large, but a deep network can achieve the same approximation with polynomially fewer parameters
by leveraging its hierarchical structure. The expressivity of deep networks can be further quantified
by considering their ability to approximate bandlimited functions, i.e., functions f whose Fourier
spectra are supported on ∥ξ∥ ≤ ωmax . For a shallow network with width n, the required number
of neurons scales as
n ∼ (ωmax )d (96)
where d is the input dimension. In contrast, for a deep network with depth L, the width scales as
reflecting the exponential efficiency of depth in distributing the approximation of frequency compo-
nents across layers. For example, if f (x) = cos(2πξ · x) with ∥ξ∥ = ωmax , a deep network requires
25
significantly fewer parameters than a shallow network to approximate f to the same accuracy.
In summary, the Fourier analysis of expressivity rigorously demonstrates the superiority of deep
networks over shallow ones in approximating complex functions. Depth introduces a hierarchi-
cal compositional structure that enables the efficient representation of high-frequency components,
while width provides a rich basis for approximating the function’s Fourier spectrum. Together,
these properties explain the remarkable capacity of deep neural networks to approximate functions
with intricate spectral structures, offering a mathematically rigorous foundation for understanding
their expressivity.
26
expressed as
dθ(t)
= −∇θ L(θ(t)). (98)
dt
The loss function, typically of the form
n
1 X
L(θ) = ∥f (xi ; θ) − yi ∥2 , (99)
2n i=1
measures the discrepancy between the network’s predicted outputs f (xi ; θ) and the true labels yi .
At stationary points of the flow, the condition
∇θ L(θ∗ ) = 0 (100)
holds, indicating that the gradient vanishes. To classify these stationary points, the Hessian ma-
trix H = ∇2θ L(θ) is examined. For eigenvalues {λi } of H, the nature of the stationary point is
determined: λi > 0 for all i corresponds to a local minimum, λi < 0 for all i to a local maximum,
and mixed signs indicate a saddle point.Under gradient flow dθ(t)dt
= −∇θ L(θ(t)), the trajectory
converges to critical points:
lim ∥∇θ L(θ(t))∥ = 0. (101)
t→∞
The gradient flow also governs the temporal evolution of the network’s predictions f (x; θ(t)). A
Taylor series expansion of f (x; θ) about an initial parameter θ0 gives:
1
f (x; θ) = f (x; θ0 ) + Jf (x; θ0 )(θ − θ0 ) + (θ − θ0 )⊤ Hf (x; θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ), (102)
2
where Jf (x; θ0 ) = ∇θ f (x; θ0 ) is the Jacobian and Hf (x; θ0 ) is the Hessian of f (x; θ) with respect to
θ. In the NTK (neural tangent kernel) regime, higher-order terms are negligible due to the large
parameterization of the network, and the linear approximation suffices:
Under gradient flow, the time derivative of the network’s predictions is given by:
df (x; θ(t)) dθ(t)
= Jf (x; θ(t)) . (104)
dt dt
Substituting the parameter dynamics dθ(t) = −∇θ L(θ(t)) = − ni=1 (f (xi ; θ(t)) − yi )Jf (xi ; θ(t)),
P
dt
this becomes: n
df (x; θ(t)) X
=− Jf (x; θ(t))Jf (xi ; θ(t))⊤ (f (xi ; θ(t)) − yi ). (105)
dt i=1
Defining the NTK as K(x, x′ ; θ) = Jf (x; θ)Jf (x′ ; θ)⊤ , and assuming constancy of the NTK during
training (K(x, x′ ; θ) ≈ K0 (x, x′ )), the evolution equation simplifies to:
n
df (x; θ(t)) X
=− K0 (x, xi )(f (xi ; θ(t)) − yi ). (106)
dt i=1
Rewriting in matrix form, let f (t) = [f (x1 ; θ(t)), . . . , f (xn ; θ(t))]⊤ and y = [y1 , . . . , yn ]⊤ . The NTK
matrix K0 ∈ Rn×n evaluated at initialization defines the system:
df (t)
= −K0 (f (t) − y). (107)
dt
The solution to this linear system is:
27
As t → ∞, the predictions converge to the labels: f (t) → y, implying zero training error. The
eigenvalues of K0 determine the rates of convergence. Diagonalizing K0 as K0 = QΛQ⊤ , where Q
is orthogonal and Λ = diag(λ1 , . . . , λn ), the dynamics in the eigenbasis are:
df̃ (t)
= −Λ(f̃ (t) − ỹ), (109)
dt
with f̃ (t) = Q⊤ f (t) and ỹ = Q⊤ y. Solving, we obtain:
Each mode decays exponentially with a rate proportional to the eigenvalue λi . Modes with larger
λi converge faster, while smaller eigenvalues slow convergence.
The NTK framework thus rigorously explains the linearization of training dynamics in overparam-
eterized neural networks. This linear behavior ensures that the optimization trajectory remains
within a convex region of the parameter space, leading to both convergence and generalization. By
leveraging the constancy of the NTK, the complexity of nonlinear neural networks is reduced to an
analytically tractable framework that aligns closely with empirical observations.
The behavior of the loss function around a specific parameter value θ0 can be rigorously analyzed
using a second-order Taylor expansion. This expansion is given by:
1
L(θ) = L(θ0 ) + (θ − θ0 )⊤ ∇θ L(θ0 ) + (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ). (112)
2
Here, the term (θ − θ0 )⊤ ∇θ L(θ0 ) represents the linear variation of the loss, while the quadratic
term 21 (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) describes the curvature effects. The eigenvalues of H(θ0 ) dictate the
nature of the critical point θ0 . Specifically, if all λi > 0, θ0 is a local minimum; if all λi < 0, it is
a local maximum; and if the eigenvalues have mixed signs, θ0 is a saddle point. The leading-order
approximation to the change in the loss function, ∆L ≈ 12 δθ⊤ H(θ0 )δθ, highlights the dependence
on the eigenstructure of H(θ0 ). In the context of gradient descent, parameter updates follow the
iterative scheme:
θ(t+1) = θ(t) − η∇θ L(θ(t) ), (113)
where η is the learning rate. Substituting the Taylor expansion of ∇θ L(θ(t) ) around θ0 gives:
To analyze this update rigorously, we project θ(t) − θ0 onto the eigenbasis of H(θ0 ), expressing it
as:
Xd
(t) (t)
θ − θ0 = c i vi , (115)
i=1
28
(t)
where ci = vi⊤ (θ(t) − θ0 ). Substituting this expansion into the gradient descent update rule yields:
h i
(t+1) (t) (t)
ci = ci − η vi⊤ ∇θ L(θ0 ) + λi ci . (116)
The convergence of this iterative scheme is governed by the condition |1−ηλi | < 1, which constrains
the learning rate η relative to the spectrum of H(θ0 ). For eigenvalues λi with large magnitudes,
excessively large learning rates η can cause oscillatory or divergent updates.
In the Neural Tangent Kernel (NTK) regime, the evolution of a neural network during train-
ing can be approximated by a linearization of the network output around the initialization. Let
fθ (x) denote the output of the network for input x. Linearizing fθ (x) around θ0 gives:
The spectral properties of the Hessian play a pivotal role in the generalization properties of neural
networks. Empirical studies reveal that the eigenvalue spectrum of H(θ) often exhibits a ”bulk-
and-spike” structure, with a dense bulk of eigenvalues near zero and a few large outliers. The bulk
corresponds to flat directions in the loss landscape, which contribute to the robustness and gener-
alization of the model, while the spikes represent sharp directions associated with overfitting. This
spectral structure can be analyzed using random matrix theory, where the density of eigenvalues
ρ(λ) is modeled by distributions such as the Marchenko-Pastur law:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), (121)
2πλq
√
where λ± = (1± q)2 are the spectral bounds and q = nd is the ratio of the number of parameters to
the number of data points. This rigorous analysis links the Hessian structure to both the optimiza-
tion dynamics and the generalization performance of neural networks, providing a comprehensive
mathematical understanding of the training process. The Hessian H(θ) satisfies:
For overparameterized networks, H(θ) is nearly degenerate, implying the existence of flat minima.
29
of training is to minimize a loss function L(θ), defined over a dataset {(xi , yi )}ni=1 , where xi ∈ Rd
and yi ∈ Rm represent the input-target pairs. The evolution of the parameters during training is
governed by the gradient flow equation dθ dt
= −∇θ L(θ), where ∇θ L(θ) is the gradient of the loss
function with respect to the parameters. To analyze the dynamics of the network outputs, we first
consider the time derivative of fθ (x). Using the chain rule, this is expressed as:
∂fθ (x) dθ
= ∇θ fθ (x)⊤ . (123)
∂t dt
dθ
Substituting dt
= −∇θ L(θ), we have:
∂fθ (x)
= −∇θ fθ (x)⊤ ∇θ L(θ). (124)
∂t
The gradient of the loss function, L(θ), can be expressed explicitly in terms of the training data.
For a generic loss function over the dataset, this takes the form:
n
1X
L(θ) = ℓ(fθ (xi ), yi ), (125)
n i=1
where ℓ(fθ (xi ), yi ) represents the loss for the i-th data point. The gradient of the loss with respect
to the parameters is therefore given by:
n
1X
∇θ L(θ) = ∇θ fθ (xi )∇fθ (xi ) ℓ(fθ (xi ), yi ). (126)
n i=1
To introduce the Neural Tangent Kernel (NTK), we define it as the Gram matrix of the Jacobians
of the network output with respect to the parameters:
Θ(x, x′ ; θ) = ∇θ fθ (x)⊤ ∇θ fθ (x′ ). (128)
Using this definition, the time evolution of the output becomes:
n
∂fθ (x) 1X
=− Θ(x, xi ; θ)∇fθ (xi ) ℓ(fθ (xi ), yi ). (129)
∂t n i=1
In the overparameterized regime, where the number of parameters p is significantly larger than the
number of training data points n, it has been empirically and theoretically observed that the NTK
Θ(x, x′ ; θ) remains nearly constant during training. Specifically, Θ(x, x′ ; θ) ≈ Θ(x, x′ ; θ0 ), where
θ0 represents the parameters at initialization. This constancy significantly simplifies the analysis
of the network’s training dynamics. To see this, consider the solution to the differential equation
governing the output dynamics. Let F (t) ∈ Rn×m represent the matrix of network outputs for all
training inputs, where the i-th row corresponds to fθ (xi ). The dynamics can be expressed in matrix
form as:
∂F (t) 1
= − Θ(θ0 )∇F L(F ), (130)
∂t n
where Θ(θ0 ) ∈ Rn×n is the NTK matrix evaluated at initialization, and ∇F L(F ) is the gradient of
the loss with respect to the output matrix F . For the special case of a mean squared error loss,
1
L(F ) = 2n ∥F − Y ∥2F , where Y ∈ Rn×m is the matrix of target outputs, the gradient simplifies to:
1
∇F L(F ) = (F − Y ). (131)
n
30
Substituting this into the dynamics, we obtain:
∂F (t) 1
= − 2 Θ(θ0 )(F (t) − Y ). (132)
∂t n
The solution to this differential equation is:
Θ(θ0 )
F (t) = Y + e− n2
t
(F (0) − Y ), (133)
where F (0) represents the initial outputs of the network. As t → ∞, the exponential term vanishes,
and the network outputs converge to the targets Y , provided that Θ(θ0 ) is positive definite. The rate
of convergence is determined by the eigenvalues of Θ(θ0 ), with smaller eigenvalues corresponding
to slower convergence along the associated eigenvectors. To understand the stationary points of
this system, we note that these occur when ∂F∂t(t) = 0. From the dynamics, this implies:
If Θ(θ0 ) is invertible, this yields F = Y , indicating that the network exactly interpolates the train-
ing data at the stationary point. However, if Θ(θ0 ) is not full-rank, the stationary points form a
subspace of solutions satisfying (I − Π)(F − Y ) = 0, where Π is the projection operator onto the
column space of Θ(θ0 ).
The NTK framework provides a mathematically rigorous lens to analyze training dynamics, eluci-
dating the interplay between parameter evolution, kernel properties, and loss convergence in neural
networks. By linearizing the training dynamics through the NTK, we achieve a deep understand-
ing of how overparameterized networks evolve under gradient flow and how they reach stationary
points, revealing their capacity to interpolate data with remarkable precision.
This linear approximation transforms the nonlinear dynamics of f into a simpler, linearized form.
To analyze training, we introduce the Jacobian matrix J ∈ RN ×P , where Jij = ∂f (x i ;θ 0 )
∂θj
. The vector
N
of outputs f (t) ∈ R , aggregating predictions over the dataset, evolves as
31
As P → ∞, the NTK converges to a deterministic matrix that remains nearly constant during
training. Substituting the linearized form of f (t) into the gradient descent update equation gives
η
f (t + 1) = f (t) − Θ(f (t) − y), (139)
N
where y ∈ RN is the vector of true labels. Defining the residual r(t) = f (t) − y, the dynamics of
training reduce to η
r(t + 1) = I − Θ r(t). (140)
N
The eigendecomposition Θ = QΛQ⊤ , with orthogonal Q and diagonal Λ = diag(λ1 , . . . , λN ), allows
us to analyze the decay of residuals in the eigenbasis of Θ:
η
r̃(t + 1) = I − Λ r̃(t), (141)
N
where r̃(t) = Q⊤ r(t). Each component decays as
t
ηλi
r̃i (t) = 1 − r̃i (0). (142)
N
For small η, the training dynamics are approximately continuous, governed by
dr(t) 1
= − Θr(t), (143)
dt N
leading to the solution
Θt
r(t) = exp − r(0). (144)
N
The NTK for specific architectures, such as fully connected ReLU networks, can be derived using
layerwise covariance matrices. Let Σ(l) (x, x′ ) denote the covariance between pre-activations at layer
l. The recurrence relation for Σ(l) is
1 (l−1)
Σ(l) (x, x′ ) = ∥z (x)∥∥z(l−1) (x′ )∥ (sin θ + (π − θ) cos θ) , (145)
2π
−1 Σ(l−1) (x,x′ )
where θ = cos √ . The NTK, a sum over contributions from all layers,
(l−1)
Σ (l−1) ′ ′
(x,x)Σ (x ,x )
quantifies how parameter updates propagate through the network.
In the infinite-width limit, the NTK framework predicts generalization properties, as the kernel
matrix Θ governs both training and test-time behavior. The NTK connects neural networks to
classical kernel methods, offering a bridge between deep learning and well-established theoretical
tools in approximation theory. This regime’s deterministic and analytical tractability enables pre-
cise characterizations of network performance, convergence rates, and robustness to initialization
and learning rate variations.
32
exponential and Gibbs priors for learning, improving PAC-Bayesian bounds for supervised classifi-
cation. Germain et. al. (2009) [94] applied PAC-Bayes theory to linear classifiers, including SVMs
and logistic regression. They demonstrated that PAC-Bayesian generalization bounds are tighter
than classical Vapnik-Chervonenkis (VC) dimension bounds. Seeger (2002) [95] extended PAC-
Bayes bounds to Gaussian Process models, proving tight generalization guarantees for Bayesian
classifiers. He laid the groundwork for probabilistic kernel methods. Alquier et. al. (2006) [96]
connected variational inference and PAC-Bayes bounds, proving that variational approximations
can preserve generalization guarantees of PAC-Bayesian bounds. Dziugaite and Roy (2017) [97]
gave one of the first applications of PAC-Bayes to deep learning. They derived nonvacuous gener-
alization bounds for stochastic neural networks, bridging theory and practice. Rivasplata et. al.
(2020) [98] provided novel PAC-Bayes bounds that improve over existing guarantees, making PAC-
Bayesian bounds more practical for modern ML applications. Lever et. al. (2013) [99] explored
data-dependent priors in PAC-Bayes theory, showing that adaptive priors lead to tighter gener-
alization bounds. Rivasplata et. al. (2018) [100] introduced instance-dependent priors, improv-
ing personalized learning and making PAC-Bayesian methods more useful for real-world machine
learning problems. Lindemann et. al. (2024) [101] integrated PAC-Bayes theory with conformal
prediction to improve formal verification in control systems, demonstrating PAC-Bayes’ relevance
to safety-critical applications.
At the core of the PAC-Bayes formalism lies the ambition to rigorously quantify the generalization
ability of hypotheses h ∈ H based on their performance on a finite dataset S ∼ Dm , where D
represents the underlying, and typically unknown, data distribution. The PAC framework, which
was originally designed to provide high-confidence guarantees on the true risk
serves as a computable proxy. The key question addressed by PAC-Bayes is: How does R̂(h, S)
relate to R(h), and how can we bound the deviation probabilistically? For a distribution Q over H,
these risks are generalized as:
This generalization is pivotal because it allows the analysis to transcend individual hypotheses and
consider probabilistic ensembles, where Q(h) represents a posterior belief over the hypothesis space
conditioned on the observed data. We now need to discuss how Prior and Posterior Distributions
33
encode knowledge and complexity. The prior P is a fixed distribution over H that reflects pre-data
assumptions about the plausibility of hypotheses. Crucially, P must be independent of S to avoid
biasing the bounds. The posterior Q, however, is data-dependent and typically chosen to minimize
a combination of empirical risk and complexity. This choice is guided by the PAC-Bayes inequality,
which regularizes Q via its Kullback-Leibler (KL) divergence from P :
Z
Q(h)
KL(Q∥P ) = Q(h) log dh. (149)
H P (h)
The KL divergence quantifies the informational cost of updating P to Q, serving as a penalty term
that discourages overly complex posteriors. This regularization is critical in preventing overfitting,
ensuring that Q achieves a balance between data fidelity and model simplicity.
Let’s derive the PAC-Bayes Inequality: Probabilistic and Information-Theoretic Foundations. The
derivation of the PAC-Bayes inequality hinges on a combination of probabilistic tools and information-
theoretic arguments. A central step involves applying a change of measure from P to Q, leveraging
the identity:
Q(h)
Eh∼Q [f (h)] = Eh∼P f (h) . (150)
P (h)
This allows the incorporation of Q into bounds that originally apply to fixed h. By analyzing
the moment-generating function of deviations between R̂(h, S) and R(h), and applying Hoeffding’s
inequality to the empirical loss, we arrive at the following bound for any Q and P , with probability
at least 1 − δ: s
KL(Q∥P ) + log 1δ
R(Q) ≤ R̂(Q, S) + . (151)
2m
The generalization bound is therefore given by:
r
KL(Q∥P ) + log(1/δ)
L(f ) − Lemp (f ) ≤ , (152)
2N
where KL(Q∥P ) quantifies the divergence between the posterior Q and prior P . This bound is
remarkable because it explicitly ties the true risk R(Q) to the empirical risk R̂(Q, S), the KL
divergence, and the sample size m. The PAC-Bayes bound encapsulates three competingq forces:
log 1δ
the empirical risk R̂(Q, S), the complexity penalty KL(Q∥P ), and the confidence term 2m
. This
interplay reflects a fundamental trade-off in learning:
1. Empirical Risk: R̂(Q, S) captures how well the posterior Q fits the training data.
There are several extensions and Advanced Applications of Pac-Bayes Formalism. While the classi-
cal PAC-Bayes framework assumes i.i.d. data, recent advancements have generalized the theory to
handle structured data, such as in time-series and graph-based learning. Furthermore, alternative
divergence measures, like Rényi divergence or Wasserstein distance, have been explored to accom-
modate scenarios where KL divergence may be inappropriate. In practical settings, PAC-Bayes
34
bounds have been instrumental in analyzing neural networks, Bayesian ensembles, and stochas-
tic processes, offering theoretical guarantees even in high-dimensional, non-convex optimization
landscapes.
Literature Review: Jin et. al. (2025) [102] introduced a novel confusional spectral regularization
technique to improve fairness in machine learning models. The study focuses on the spectral norm
of the robust confusion matrix and proposes a method to control spectral properties, ensuring more
robust and unbiased learning. It provides insights into how regularization can mitigate biases in
classification tasks. Ye et. al. (2025) [103] applied spectral clustering with regularization to detect
small clusters in complex networks. The work enhances spectral clustering techniques by integrat-
ing regularization methods, allowing improved performance in anomaly detection and community
detection tasks. The approach significantly improves robustness in highly noisy data environments.
Bhattacharjee and Bharadwaj (2025) [104] explored how spectral domain representations can ben-
efit from autoencoder-based feature extraction combined with stochastic regularization techniques.
The authors propose a Symmetric Autoencoder (SymAE) that enables better generalization of
spectral features, particularly useful in high-dimensional data and deep learning applications. Wu
et. al. (2025) [105] applied spectral regularization to geophysical data processing, specifically for
high-resolution velocity spectrum analysis. The approach enhances the resolution of velocity esti-
mation in seismic imaging by using hyperbolic Radon transform regularization, demonstrating how
spectral regularization can benefit applications beyond traditional ML. Ortega et. al. (2025) [106]
applied Tikhonov regularization to atmospheric spectral analysis, optimizing gas retrieval strategies
in high-resolution spectroscopic observations. The work significantly improves methane (CH4) and
nitrous oxide (N2O) detection accuracy by reducing noise in spectral measurements, showcasing
the impact of spectral regularization in remote sensing and environmental monitoring. Kazmi et.
al. (2025) [107] proposed a spectral regularization-based federated learning model to improve ro-
bustness in cybersecurity threat detection. The model addresses the issue of non-IID data in SDN
(Software Defined Networks) by utilizing spectral norm-based regularization within deep learning
architectures. Zhao et. al. (2025) [108] introduced a regularized deep spectral clustering method,
which enhances feature selection and clustering robustness. The authors utilize projected adap-
tive feature selection combined with spectral graph regularization, improving clustering accuracy
and interpretability in high-dimensional datasets. Saranya and Menaka (2025) [109] integrated
spectral regularization with quantum-based machine learning to analyze EEG signals for Autism
Spectrum Disorder (ASD) detection. The proposed method improves spatial filtering and feature
extraction using wavelet-based regularization, leading to more reliable EEG pattern recognition.
Dhalbisoi et. al. (2024) [110] developed a Regularized Zero-Forcing (RZF) method for spectral
efficiency optimization in beyond 5G networks. The authors demonstrate that spectral regulariza-
tion techniques can significantly improve signal-to-noise ratios in wireless communication systems,
optimizing data transmission in massive MIMO architectures. Wei et. al. (2025) [111] explored the
use of spectral regularization in medical imaging, particularly in 3D near-infrared spectral tomogra-
phy. The proposed model integrates regularized convolutional neural networks (CNNs) to improve
tissue imaging resolution and accuracy, demonstrating an application of spectral regularization in
biomedical engineering.
35
Let us define a target function f (x), where x ∈ Rd , and its Fourier transform fˆ(ξ) as
Z
ˆ
f (ξ) = f (x)e−i2πξ·x dx (153)
Rd
This transform breaks down f (x) into frequency components indexed by ξ. In the context of deep
learning, we seek to approximate f (x) with a neural network output fNN (x; θ), where θ represents
the set of trainable parameters. The loss function to be minimized is typically the mean squared
error: Z
L(θ) = |f (x) − fNN (x; θ)|2 dx (154)
Rd
We can equivalently express this loss in the Fourier domain, leveraging Parseval’s theorem:
Z 2
L(θ) = fˆ(ξ) − fˆNN (ξ; θ) dξ (155)
Rd
where η is the learning rate. The gradient of the loss function with respect to θ is
Z
∇θ L(θ) = 2 fˆNN (ξ; θ) − fˆ(ξ) ∇θ fˆNN (ξ; θ) dξ (157)
Rd
At the core of this gradient descent process lies the behavior of the gradient ∇θ fˆNN (ξ; θ) with respect
to the frequency components ξ. For neural networks, particularly those with ReLU activations, the
gradients of the output with respect to the parameters are expected to decay for high-frequency
components. This can be approximated as
1
R(ξ) ∼ (158)
1 + ∥ξ∥2
which implies that the neural network is inherently more sensitive to low-frequency components of
the target function during early iterations of training. This spectral decay is a direct consequence
of the structure of the network’s activations, which are more sensitive to low-frequency features due
to their smoother, lower-order terms. To understand the role of the neural tangent kernel (NTK),
which governs the linearized dynamics of the neural network, we define the NTK as
P
′
X ∂fNN (x; θ) ∂fNN (x′ ; θ)
Θ(x, x ; θ) = (159)
i=1
∂θi ∂θi
The NTK essentially describes how the output of the network changes with respect to its param-
eters. The evolution of the network’s output during training can be approximated by the solution
to a linear system governed by the NTK. The output of the network at time t is given by
X
ck 1 − e−ηλk t ϕk (x)
fNN (x; t) = (160)
k
where {λk } are the eigenvalues of Θ, and {ϕk (x)} are the corresponding eigenfunctions. The
eigenvalues λk determine the speed of convergence for each frequency mode, with low-frequency
modes (large λk ) converging more quickly than high-frequency ones (small λk ):
This differential learning rate for frequency components leads to the spectral regularization phe-
nomenon, where the network learns the low-frequency components of the function first, and the
36
high-frequency modes only begin to adapt once the low-frequency ones have been approximated
with sufficient accuracy. In a more formal setting, the spectral bias can also be understood in terms
of Sobolev spaces. A neural network function fNN can be seen as a function in a Sobolev space
W m,2 , where the norm of a function f in this space is defined as
Z 2
m
2
∥f ∥W m,2 = 1 + ∥ξ∥2 fˆ(ξ) dξ (162)
Rd
When training a neural network, the optimization process implicitly regularizes the higher-order
Sobolev norms, meaning that the network will initially approximate the target function in terms
of lower-order derivatives (which correspond to low-frequency modes). This can be expressed by
introducing a regularization term in the loss function:
where λ is a regularization parameter that controls the trade-off between data fidelity and smooth-
ness in the approximation.
Thus, spectral regularization emerges as a consequence of the network’s architecture, the nature of
gradient descent optimization, and the inherent smoothness of the functions that neural networks
are capable of learning. The mathematical structure of the NTK and the regularization properties
of the Sobolev spaces provide a rigorous framework for understanding why neural networks prior-
itize the learning of low-frequency modes, reinforcing the idea that neural networks are implicitly
biased toward smooth, low-frequency approximations at the beginning of training. This insight
has profound implications for the generalization behavior of neural networks and their capacity to
approximate complex functions.
37
⃗ = [w1 , w2 , . . . , wn ]T ∈ Rn , augmented by a bias term b ∈ R. This
corresponding weight vector w
can be expressed as
Xn
T
z=w ⃗ ⃗x + b = wi xi + b. (164)
i=1
To determine the output, this value is passed through the step activation function, defined mathe-
matically as (
1, z ≥ 0,
ϕ(z) = (165)
0, z < 0.
Thus, the perceptron’s decision-making process can be expressed as
⃗ T ⃗x + b),
y = ϕ(w (166)
where y ∈ {0, 1}. The equation w ⃗ T ⃗x + b = 0 defines a hyperplane in Rn , which acts as the decision
boundary. For any input ⃗x, the classification is determined by the sign of w ⃗ T ⃗x +b, specifically y = 1
T
⃗ ⃗x +b ≥ 0 and y = 0 otherwise. Geometrically, this classification corresponds to partitioning the
if w
input space into two distinct half-spaces. To train the perceptron, a supervised learning algorithm
adjusts the weights w⃗ and the bias b iteratively using labeled training data {(⃗xi , yi )}m i=1 , where yi
represents the ground truth. When the predicted output ypred = ϕ(w ⃗ T ⃗xi + b) differs from yi , the
weight vector and bias are updated according to the rule
⃗ ←w
w ⃗ + η(yi − ypred )⃗xi , (167)
and
b ← b + η(yi − ypred ), (168)
where η > 0 is the learning rate. Each individual weight wj is updated as
For a linearly separable dataset, the Perceptron Convergence Theorem asserts that the algorithm
will converge to a solution after a finite number of updates. Specifically, the number of updates is
bounded by
R2
, (170)
γ2
where R = maxi ∥⃗xi ∥ is the maximum norm of the input vectors, and γ is the minimum margin,
defined as
⃗ T ⃗xi + b)
yi (w
γ = min . (171)
i ∥w∥⃗
The limitations of the perceptron, particularly its inability to solve linearly inseparable problems
such as the XOR problem, necessitate the extension to artificial neurons with non-linear activation
functions. A popular choice is the sigmoid activation function
1
ϕ(z) = , (172)
1 + e−z
which maps z ∈ R to the continuous interval (0, 1). The derivative of the sigmoid function, essential
for gradient-based optimization, is
Another widely used activation function is the hyperbolic tangent tanh(z), defined as
ez − e−z
tanh(z) = , (174)
ez + e−z
38
with derivative
tanh′ (z) = 1 − tanh2 (z). (175)
ReLU, or Rectified Linear Unit, is defined as
ϕ(z) = max(0, z), (176)
with derivative (
1, z > 0,
ϕ′ (z) = (177)
0, z ≤ 0.
These non-linear activations enable the network to approximate non-linear decision boundaries,
a capability absent in the perceptron. Artificial neurons form the building blocks of multi-layer
perceptrons (MLPs), where neurons are organized into layers. For an L-layer network, the input ⃗x
is transformed layer by layer. At layer l, the output is
⃗ (l)⃗z(l−1) + ⃗b(l) ),
⃗z(l) = ϕ(l) (W (178)
where W⃗ (l) ∈ Rnl ×nl−1 is the weight matrix, ⃗b(l) ∈ Rnl is the bias vector, and ϕ(l) is the activation
function. The network’s output is
⃗yˆ = ϕ(L) (W
⃗ (L)⃗z(L−1) + ⃗b(L) ). (179)
The Universal Approximation Theorem guarantees that MLPs with sufficient neurons and non-
linear activations can approximate any continuous function f : Rn → Rm to arbitrary precision.
Formally, for any ϵ > 0, there exists an MLP g(⃗x) such that
∥f (⃗x) − g(⃗x)∥∞ < ϵ (180)
for all ⃗x ∈ Rn . Training an MLP minimizes a loss function L that quantifies the error between
predicted outputs ⃗yˆ and ground truth labels ⃗y . For regression, the mean squared error is
m
1 X ˆ
L= ∥⃗yi − ⃗yi ∥2 , (181)
m i=1
and for classification, the cross-entropy loss is
m
1 Xh T i
L=− ⃗yi log ⃗yˆi + (1 − ⃗yi )T log(1 − ⃗yˆi ) . (182)
m i=1
⃗ (l) , ⃗b(l) }L as
Optimization uses stochastic gradient descent (SGD), updating parameters Θ = {W l=1
Θ ← Θ − η∇Θ L. (183)
Gradients are computed via backpropagation:
∂L
= δ (l)⃗z(l−1)T , (184)
⃗
∂W (l)
Artificial neurons and their extensions have thus provided the foundation for modern deep learn-
ing. Their mathematical underpinnings and computational frameworks are instrumental in solving
a wide array of problems, from classification and regression to complex decision-making. The in-
terplay of linear algebra, calculus, and optimization theory in their formulation ensures that these
networks are both theoretically robust and practically powerful.
39
5.2 Feedforward Neural Networks
Feedforward neural networks (FNNs) are mathematical constructs designed to approximate ar-
bitrary mappings f : Rn → Rm by composing affine transformations and nonlinear activation
functions. At their core, these networks consist of L layers, where each layer k transforms its input
⃗ak−1 ∈ Rmk−1 into an output ⃗ak ∈ Rmk via the operation
⃗ak = fk (Wk⃗ak−1 + ⃗bk ). (186)
Here, Wk ∈ Rmk ×mk−1 represents the weight matrix, ⃗bk ∈ Rmk is the bias vector, and fk : Rmk →
Rmk is a component-wise activation function. Formally, if we denote the input layer as ⃗a0 = ⃗x, the
final output of the network, ⃗y ∈ Rm , is given by ⃗aL = fL (WL⃗aL−1 + ⃗bL ). Each transformation in
this sequence can be described as ⃗zk = Wk⃗ak−1 + ⃗bk , followed by the activation ⃗ak = fk (⃗zk ). The
affine transformation ⃗zk = Wk⃗ak−1 + ⃗bk encapsulates the linear combination of inputs with weights
Wk and the addition of biases ⃗bk . For any two layers k and k + 1, the overall transformation can
be represented by
⃗zk+1 = Wk+1 (Wk⃗ak−1 + ⃗bk ) + ⃗bk+1 . (187)
Expanding this, we have
⃗zk+1 = Wk+1 Wk⃗ak−1 + Wk+1⃗bk + ⃗bk+1 . (188)
Without the nonlinearity introduced by fk , the network reduces to a single affine transformation
⃗y = W ⃗x + ⃗b, (189)
where W = WL WL−1 · · · W1 and
⃗b = WL WL−1 · · · W2⃗b1 + · · · + ⃗bL . (190)
Thus, the incorporation of nonlinear activation functions is critical, as it enables the network to
approximate non-linear mappings. Activation functions fk are applied element-wise to the pre-
activation vector ⃗zk . The choice of activation significantly affects the network’s behavior and
training. For example, the sigmoid activation f (x) = 1+e1−x compresses inputs into the range (0, 1)
and has a derivative given by
f ′ (x) = f (x)(1 − f (x)). (191)
ex −e−x
The hyperbolic tangent activation f (x) = tanh(x) = ex +e−x
maps inputs to (−1, 1) with a derivative
These derivatives are essential for gradient-based optimization. The objective of training a feedfor-
ward neural network is to minimize a loss function L, which measures the discrepancy between the
predicted outputs ⃗yi and the true targets ⃗ti over a dataset {(⃗xi , ⃗ti )}N
i=1 . For regression problems,
the mean squared error (MSE) is often used, given by
N N X m
1 X 1 X
L= ∥⃗yi − ⃗ti ∥ =
2
(yi,j − ti,j )2 . (194)
N i=1 N i=1 j=1
40
where ti,j represents the one-hot encoded labels. The gradient of L with respect to the network
parameters is computed using backpropagation, which applies the chain rule iteratively to propagate
errors from the output layer to the input layer. During backpropagation, the error signal at the
output layer is computed as
∂L
δL = = ∇⃗y L ⊙ fL′ (⃗zL ), (196)
∂⃗zL
where ⊙ denotes the Hadamard product. For hidden layers, the error signal propagates backward
as
T
δk = (Wk+1 δk+1 ) ⊙ fk′ (⃗zk ). (197)
The gradients of the loss with respect to the weights and biases are then given by
∂L ∂L
= δk⃗aTk−1 , = δk . (198)
∂Wk ∂⃗bk
These gradients are used to update the parameters through optimization algorithms like stochastic
gradient descent (SGD), where
∂L ⃗bk ← ⃗bk − η ∂L ,
Wk ← Wk − η , (199)
∂Wk ∂⃗bk
with η > 0 as the learning rate. The universal approximation theorem rigorously establishes that
a feedforward neural network with at least one hidden layer and sufficiently many neurons can
approximate any continuous function f : Rn → Rm on a compact domain D ⊂ Rn . Specifically,
for any ϵ > 0, there exists a network fˆ such that ∥f (⃗x) − fˆ(⃗x)∥ < ϵ for all ⃗x ∈ D. This expressive
capability arises because the composition of affine transformations and nonlinear activations allows
the network to approximate highly complex functions by partitioning the input space into regions
and assigning different functional behaviors to each.
where w⊤ x represents the dot product of the weight vector and the input vector. The activation
function σ(z) is then applied to this net input to obtain the output of the neuron a:
n
!
X
a = σ(z) = σ wi xi + b . (201)
i=1
41
The activation function introduces a non-linearity into the neuron’s response, which is a crucial
aspect of neural networks because, without it, the network would only be able to perform linear
transformations of the input data, limiting its ability to approximate complex, real-world func-
tions. The non-linearity introduced by σ(z) is fundamental because it enables the network to
capture intricate relationships between the input and output, making neural networks capable of
solving problems that require hierarchical feature extraction, such as image classification, time-
series forecasting, and language modeling. The importance of non-linearity is most clearly evident
when considering the mathematical formulation of a multi-layer neural network. For a feed-forward
neural network with L layers, the output ŷ of the network is given by the composition of successive
affine transformations and activation functions. Let x denote the input vector, Wk and bk be the
weight matrix and bias vector for the k-th layer, and σk be the activation function for the k-th
layer. The output of the network is:
If σ(z) were a linear function, say σ(z) = c·z for some constant c, the composition of such functions
would still result in a linear function. Specifically, if each σk were linear, the overall network function
would simplify to a single linear transformation:
ŷ = c1 · x + c2 , (203)
where c1 and c2 are constants dependent on the parameters of the network. In this case, the
network would have no greater expressive power than a simple linear regression model, regardless
of the number of layers. Thus, the non-linearity introduced by activation functions allows neural
networks to approximate any continuous function, as guaranteed by the universal approximation
theorem. This theorem states that a feed-forward neural network with at least one hidden layer
and a sufficiently large number of neurons can approximate any continuous function f (x), provided
the activation function is non-linear and the network has enough capacity.
Next, consider the mathematical properties that the activation function σ(z) must possess. First,
it must be differentiable to allow the use of gradient-based optimization methods like backpropaga-
tion for training. Backpropagation relies on the chain rule of calculus to compute the gradients of
the loss function L with respect to the parameters (weights and biases) of the network. Suppose
L = L(ŷ, y) is the loss function, where ŷ is the predicted output of the network and y is the true
label. During training, we compute the gradient of L with respect to the weights using the chain
rule. Let ak = σk (zk ) represent the output of the activation function at layer k, where zk is the
input to the activation function. The gradient of the loss with respect to the weights at layer k is
given by:
∂L ∂L ∂ak ∂zk
= . (204)
∂Wk ∂ak ∂zk ∂Wk
The term ∂a k
∂zk
is the derivative of the activation function, which must exist and be well-defined
for gradient-based optimization to work effectively. If the activation function is not differentiable,
the backpropagation algorithm cannot compute the gradients, preventing the training process from
proceeding.
Now consider the specific forms of activation functions commonly used in practice. The sigmoid
activation function is one of the most well-known, defined as:
1
σ(z) = . (205)
1 + e−z
Its derivative is:
σ ′ (z) = σ(z)(1 − σ(z)), (206)
42
which can be derived by applying the chain rule to the expression for σ(z). Although sigmoid
is differentiable and smooth, it suffers from the vanishing gradient problem, especially for large
positive or negative values of z. Specifically, as z → ∞, σ ′ (z) → 0, and similarly as z → −∞,
σ ′ (z) → 0. This results in very small gradients during backpropagation, making it difficult for
the network to learn when the input values become extreme. To mitigate the vanishing gradient
problem, the hyperbolic tangent (tanh) function is often used as an alternative. It is defined
as:
ez − e−z
tanh(z) = z , (207)
e + e−z
with derivative:
tanh′ (z) = 1 − tanh2 (z). (208)
The tanh function outputs values in the range (−1, 1), which helps to center the data around
zero. While the tanh function overcomes some of the vanishing gradient issues associated with the
sigmoid function, it still suffers from the problem for large |z|, where the gradients approach zero.
The Rectified Linear Unit (ReLU) is another commonly used activation function. It is defined
as:
ReLU(z) = max(0, z), (209)
with derivative: (
1, z > 0,
ReLU′ (z) = (210)
0, z ≤ 0.
ReLU is particularly advantageous because it is computationally efficient, as it only requires a
comparison to zero. Moreover, for positive values of z, the derivative is constant and equal to 1,
which helps avoid the vanishing gradient problem. However, ReLU can suffer from the dying ReLU
problem, where neurons output zero for all inputs if the weights are initialized poorly or if the
learning rate is too high, leading to inactive neurons that do not contribute to the learning process.
To address the dying ReLU problem, the Leaky ReLU activation function is introduced, defined
as: (
z, z > 0,
Leaky ReLU(z) = (211)
αz, z ≤ 0,
where α is a small constant, typically chosen to be 0.01. The derivative of the Leaky ReLU is:
(
1, z > 0,
Leaky ReLU′ (z) = (212)
α, z ≤ 0.
Leaky ReLU ensures that neurons do not become entirely inactive by allowing a small, non-zero
gradient for negative values of z. Finally, for classification tasks, particularly when there are
multiple classes, the Softmax activation function is often used in the output layer of the neural
network. The Softmax function is defined as:
ezi
Softmax(zi ) = Pn zj , (213)
j=1 e
where zi is the input to the i-th neuron in the output layer, and the denominator ensures that the
outputs sum to 1, making them interpretable as probabilities. The Softmax function is typically
used in multi-class classification problems, where the network must predict one class out of several
possible categories.
In summary, activation functions are a vital component of neural networks, enabling them to learn
intricate patterns in data, allowing for the successful application of neural networks to diverse tasks.
Different activation functions—such as sigmoid, tanh, ReLU, Leaky ReLU, and Softmax—each of-
fer distinct advantages and limitations, and their choice significantly impacts the performance and
training dynamics of the neural network.
43
5.4 Loss Functions
In neural networks, the loss function is a crucial mathematical tool that quantifies the difference
between the predicted output of the model and the true output or target. Let xi be the input vector
and yi the corresponding target vector for the i-th training example. The network, parameterized
by weights W, generates a prediction denoted as ŷi = f (xi ; W), where f (xi ; W) represents the
model’s output. The objective of training the neural network is to minimize the discrepancy between
the predicted output ŷi and the true label yi across all training examples, effectively learning the
mapping function from inputs to outputs. A typical objective function is the average loss over a
dataset of N samples:
N
1 X
L(W) = L(yi , ŷi ) (214)
N i=1
where L(yi , ŷi ) represents the loss function that computes the error between the true output yi
and the predicted output ŷi for each data point. To minimize this objective function, optimization
algorithms such as gradient descent are used. The general update rule for the weights W is given
by:
W ← W − η∇W L(W) (215)
where η is the learning rate, and ∇W L(W) is the gradient of the loss function with respect to
the weights. The gradient is computed using backpropagation, which applies the chain rule
of calculus to propagate the error backward through the network, updating the parameters to
minimize the loss. For this, we use the partial derivatives of the loss with respect to each layer’s
weights and biases, ensuring the error is distributed appropriately across all layers. For regression
tasks, the Mean Squared Error (MSE) loss is frequently used. This loss function quantifies
the error as the average squared difference between the predicted and true values. The MSE for a
dataset of N examples is given by:
N
1 X
LMSE = (yi − ŷi )2 (216)
N i=1
where ŷi = f (xi ; W) is the network’s predicted output for the i-th input xi . The gradient of the
MSE with respect to the network’s output ŷi is:
∂LMSE
= 2 (ŷi − yi ) (217)
∂ ŷi
This gradient guides the weight update in the direction that minimizes the squared error, leading
to a better fit of the model to the training data. For classification tasks, the cross-entropy
loss is often employed, as it is particularly well-suited to tasks where the output is a probability
distribution over multiple classes. In the binary classification case, where the target label yi is
either 0 or 1, the binary cross-entropy loss function is defined as:
N
1 X
LCE =− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (218)
N i=1
where ŷi = f (xi ; W) is the predicted probability that the i-th sample belongs to the positive class
(i.e., class 1). For multiclass classification, where the target label yi is a one-hot encoded vector
representing the true class, the general form of the cross-entropy loss is:
N C
1 XX
LCE = − yi,c log(ŷi,c ) (219)
N i=1 c=1
44
where C is the number of classes, and ŷi,c = f (xi ; W) is the predicted probability that the i-th
sample belongs to class c. The gradient of the cross-entropy loss with respect to the predicted
probabilities ŷi is:
∂LCE
= ŷi,c − yi,c (220)
∂ ŷi,c
This gradient facilitates the weight update by adjusting the model’s parameters to reduce the dif-
ference between the predicted probabilities and the actual class labels.
In neural network training, the optimization process often involves regularization techniques to
prevent overfitting, especially in cases with high-dimensional data or deep networks. L2 reg-
ularization (also known as Ridge regression) is one common approach, which penalizes large
weights by adding a term proportional to the squared L2 norm of the weights to the loss function.
The regularized loss function becomes:
n
X
Lreg = LMSE + λ Wj2 (221)
j=1
where λ is the regularization strength, and Wj represents the parameters of the network. The
gradient of the regularized loss with respect to the weights is:
∂Lreg ∂LMSE
= + 2λWj (222)
∂Wj ∂Wj
This additional term discourages large values of the weights, reducing the complexity of the model
and helping it generalize better to unseen data. Another form of regularization is L1 regulariza-
tion (or Lasso regression), which promotes sparsity in the model by adding the L1 norm of the
weights to the loss function. The L1 regularized loss function is:
n
X
Lreg = LMSE + λ |Wj | (223)
j=1
The gradient of this regularized loss function with respect to the weights is:
∂Lreg ∂LMSE
= + λ sign(Wj ) (224)
∂Wj ∂Wj
where sign(Wj ) is the sign function, which returns 1 for positive values of Wj , −1 for negative
values, and 0 for Wj = 0. L1 regularization encourages the model to select only a small subset of
features by forcing many of the weights to exactly zero, thus simplifying the model and promoting
interpretability. The optimization process for neural networks can be viewed as solving a non-
convex optimization problem, given the highly non-linear activation functions and the deep
architectures typically used. In this context, stochastic gradient descent (SGD) is commonly
employed to perform the optimization by updating the weights based on the gradient computed
from a random mini-batch of the data. The update rule for SGD can be expressed as:
where ∇W Lbatch is the gradient of the loss function computed over the mini-batch, and η is the
learning rate. Due to the non-convexity of the objective function, SGD tends to converge to a local
minimum or a saddle point, rather than the global minimum, especially in deep neural networks
with many layers.
In summary, the loss function plays a central role in guiding the optimization process in neu-
ral network training by quantifying the error between the predicted and true outputs. Different
45
loss functions are employed depending on the nature of the problem, with MSE being common for
regression and cross-entropy used for classification. Regularization techniques such as L2 and L1
regularization are incorporated to prevent overfitting and ensure better generalization. Through
optimization algorithms like gradient descent, the neural network parameters are iteratively up-
dated based on the gradients of the loss function, with the ultimate goal of minimizing the loss
across all training examples.
46
The forward pass through the network consists of computing the activations at each layer. For the
l-th layer, the pre-activation z (l) is calculated as:
where a(l−1) is the activation from the previous layer and W(l) is the weight matrix connecting the
(l − 1)-th layer to the l-th layer. The output of the layer, i.e., the activation a(l) , is computed by
applying the activation function σ (l) element-wise to z (l) :
The final output of the network is given by the activation a(L) at the last layer, which is the
predicted output ŷ (i) :
ŷ (i) = a(L) . (231)
The backpropagation algorithm computes the gradient of the loss function J(θ) with respect to
each parameter (weights and biases). First, we compute the error at the output layer. Let δ (L)
represent the error at layer L. This is computed by taking the derivative of the loss function with
respect to the activations at the output layer:
∂L ′
δ (L) = (L)
⊙ σ (L) (z (L) ), (232)
∂a
′
where ⊙ denotes element-wise multiplication, and σ (L) (z (L) ) is the derivative of the activation
function applied element-wise to z (L) . For squared error loss, the derivative with respect to the
activations is:
∂L
(L)
= ŷ (i) − y (i) (233)
∂a
so the error term at the output layer is:
′
δ (L) = (ŷ (i) − y (i) ) ⊙ σ (L) (z (L) ) (234)
To propagate the error backward through the network, we compute the errors at the hidden layers.
For each hidden layer l = L − 1, L − 2, . . . , 1, the error δ (l) is calculated by the chain rule:
′
δ (l) = W(l+1)T δ (l+1) ⊙ σ (l) (z (l) )
(235)
where W(l+1)T ∈ Rnl+1 ×nl is the transpose of the weight matrix connecting layer l to layer l+1. This
equation uses the fact that the error at layer l depends on the error at the next layer, modulated
by the weights, and the derivative of the activation function at layer l. Once the errors δ (l) are
computed for all layers, we can compute the gradients of the loss function with respect to the
parameters (weights and biases). The gradient of the loss with respect to the weights W(l) is:
N
∂J(θ) 1 X (l) (l−1) T
= δ (a ) (236)
∂W(l) N i=1
The gradient of the loss with respect to the biases b(l) is:
N
∂J(θ) 1 X (l)
= δ (237)
∂b(l) N i=1
After computing these gradients, we update the parameters using an optimization algorithm such
as gradient descent. The weight update rule is:
∂J(θ)
W(l) ← W(l) − η , (238)
∂W(l)
47
and the bias update rule is:
∂J(θ)
b(l) ← b(l) − η (239)
∂b(l)
where η is the learning rate controlling the step size in the gradient descent update. This process of
forward pass, backpropagation, and parameter update is repeated over multiple epochs, with each
epoch consisting of a forward pass, a backward pass, and a parameter update, until the network
converges to a local minimum of the loss function.
At each step of backpropagation, the chain rule is applied recursively to propagate the error
backward through the network, adjusting each weight and bias to minimize the total loss. The
′
derivative of the activation function σ (l) (z (l) ) is critical, as it dictates how the error is modulated
at each layer. Depending on the choice of activation function (e.g., ReLU, sigmoid, or tanh), the
derivative will take different forms, and this choice has a direct impact on the learning dynamics
and convergence rate of the network. Thus, backpropagation serves as the computational back-
bone of neural network training. By calculating the gradients of the loss function with respect to
the network parameters through efficient error propagation, backpropagation allows the network
to adjust its parameters iteratively, gradually minimizing the error and improving its performance
across tasks. This process is mathematically rigorous, utilizing fundamental principles of calculus
and optimization, ensuring that the neural network learns effectively from its training data.
where (xi , yi ) are the input-output pairs in the training dataset of size N , and ℓ(θ; xi , yi ) is the
sample-specific loss. The minimization problem is solved iteratively, starting from an initial guess
θ (0) and updating according to the rule
where η > 0 is the learning rate, and ∇θ L(θ) is the gradient of the loss with respect to θ. The gra-
dient, computed via backpropagation, follows the chain rule and propagates through the network’s
layers to adjust weights and biases optimally. In a feedforward neural network with L layers, the
computations proceed as follows. The input to layer l is
where W (l) ∈ Rnl ×nl−1 and b(l) ∈ Rnl are the weight matrix and bias vector for the layer, respec-
tively, and a(l−1) is the activation vector from the previous layer. The output is then
where f (l) is the activation function. Backpropagation begins with the computation of the error at
the output layer,
∂ℓ
δ (L) = ⊙ f ′(L) (z (L) ), (244)
∂a(L)
where f ′(L) (·) is the derivative of the activation function. For hidden layers, the error propagates
recursively as
δ (l) = (W (l+1) )⊤ δ (l+1) ⊙ f ′(l) (z (l) ). (245)
48
The gradients for weight and bias updates are then computed as
∂L
(l)
= δ (l) (a(l−1) )⊤ (246)
∂W
and
∂L
(l)
= δ (l) , (247)
∂b
respectively. The dynamics of gradient descent are deeply influenced by the curvature of the loss
surface, encapsulated by the Hessian matrix
For a small step size η, the change in the loss function can be approximated as
η2
∆L ≈ −η∥∇θ L(θ)∥2 + (∇θ L(θ))⊤ H(θ)∇θ L(θ). (249)
2
This reveals that convergence is determined not only by the gradient magnitude but also by the
curvature of the loss surface along the gradient direction. The eigenvalues λ1 , λ2 , . . . , λd of H(θ)
dictate the local geometry, with large condition numbers κ = λλmax min
slowing convergence due to
ill-conditioning. Stochastic gradient descent (SGD) modifies the standard gradient descent by
computing updates based on a single data sample (xi , yi ), leading to
While SGD introduces variance into the updates, this stochasticity helps escape saddle points
characterized by zero gradient but mixed curvature. To balance computational efficiency and
stability, mini-batch SGD computes gradients over a randomly selected subset B ⊂ {1, . . . , N } of
size |B|, yielding
1 X
∇θ LB (θ) = ∇θ ℓ(θ; xi , yi ). (251)
|B| i∈B
Momentum methods enhance convergence by incorporating a memory of past gradients. The
velocity term
v (k+1) = γv (k) + η∇θ L(θ) (252)
accumulates gradient information, and the parameter update is
Analyzing momentum in the eigenspace of H(θ), with H = QΛQ⊤ , reveals that the effective step
size in each eigendirection is
η
ηeff,i = , (254)
1 − γλi
showing that momentum accelerates convergence in low-curvature directions while damping oscil-
lations in high-curvature directions. Adaptive gradient methods, such as AdaGrad, RMSProp, and
Adam, refine learning rates for individual parameters. In AdaGrad, the adaptive learning rate is
(k+1) η
ηi =q , (255)
(k+1)
Gii +ϵ
where
(k+1) (k)
Gii = Gii + (∇θi L(θ))2 . (256)
RMSProp modifies this with an exponentially weighted average
(k+1) (k)
Gii = βGii + (1 − β) (∇θi L(θ))2 . (257)
49
Adam combines RMSProp with momentum, where the first and second moments are
and
v (k+1) = β2 v (k) + (1 − β2 ) (∇θ L(θ))2 . (259)
Bias corrections yield
(k+1) m(k+1) (k+1) v (k+1)
m̂ = , v̂ = . (260)
1 − β1k 1 − β2k
The final parameter update is
m̂(k+1)
θ (k+1) = θ (k) − η p . (261)
v̂ (k+1) + ϵ
In conclusion, gradient descent and its variants provide a rich framework for optimizing neural
network parameters. While standard gradient descent offers a basic approach, advanced methods
like momentum and adaptive gradients significantly enhance convergence by tailoring updates to
the landscape of the loss surface and the dynamics of training.
The Stochastic Gradient Descent (SGD) optimizer is an iterative method designed to mini-
mize an objective function f (w) by updating a parameter vector w in the direction of the negative
gradient. The fundamental optimization problem can be expressed as
where
N
1 X
f (w) = ℓ(w; xi , yi ) (263)
N i=1
50
represents the empirical risk, constructed from a dataset {(xi , yi )}N
i=1 . Here, ℓ(w; xi , yi ) denotes
d
the loss function, w ∈ R is the parameter vector, N is the dataset size, and f (w) approximates
the true population risk
Ex,y [ℓ(w; x, y)]. (264)
Standard gradient descent involves the update rule
is the full gradient. However, for large-scale datasets, the computation of ∇f (w) becomes com-
putationally prohibitive, motivating the adoption of stochastic approximations. The stochastic
approximation relies on the idea of estimating the gradient ∇f (w) using a single data point or a
small batch of data points. Denoting the random index sampled at iteration t as it , the stochastic
gradient can be written as
b (w(t) ) = ∇ℓ(w(t) ; xit , yit ).
∇f (267)
Consequently, the update rule becomes
w(t+1) = w(t) − η ∇f
b (w(t) ). (268)
b (w(t) ) = 1
X
∇f ∇ℓ(w(t) ; xi , yi ). (269)
m i∈B
t
An important property of ∇f
b (w) is its unbiasedness:
E[∇f
b (w)] = ∇f (w). (270)
is the variance of the gradients. To analyze the convergence properties of SGD, we assume f (w)
to be L-smooth, meaning
∥∇f (w1 ) − ∇f (w2 )∥ ≤ L∥w1 − w2 ∥, (273)
and f (w) to be bounded below by f ∗ = inf w f (w). Using Taylor expansion, we can write
η2L b
f (w(t+1) ) ≤ f (w(t) ) − η∥∇f (w(t) )∥2 + ∥∇f (w(t) )∥2 . (274)
2
Taking expectations yields
51
showing that the convergence rate depends on the interplay between the learning rate η, the smooth-
ness constant L, and the gradient variance σ 2 . For η small enough, the dominant term in conver-
gence is − η2 E[∥∇f (w(t) )∥2 ], leading to monotonic decrease in f (w(t) ). In the strongly convex case,
where f (w) satisfies
µ
f (w1 ) ≥ f (w2 ) + ∇f (w2 )⊤ (w1 − w2 ) + ∥w1 − w2 ∥2 (276)
2
for µ > 0, SGD achieves linear convergence. Specifically,
ησ 2
E[∥w(t) − w∗ ∥2 ] ≤ (1 − ηµ)t ∥w(0) − w∗ ∥2 + . (277)
2µ
For non-convex functions, where ∇2 f (w) can have both positive and negative eigenvalues, SGD
may converge to a local minimizer or saddle point. Stochasticity plays a pivotal role in escaping
strict saddle points ws where ∇f (ws ) = 0 but λmin (∇2 f (ws )) < 0.
The Adaptive Moment Estimation (Adam) optimizer can be considered a sophisticated, hybrid
optimization algorithm combining elements of momentum-based methods and adaptive learning
rate techniques, which is why it has become a cornerstone in the optimization of complex ma-
chine learning models, particularly those used in deep learning. Adam’s formulation is centered
on computing and using both the first and second moments (i.e., the mean and the variance) of
the gradient with respect to the loss function at each parameter update. This process effectively
adapts the learning rate for each parameter, based on its respective gradient’s statistical properties.
The moment-based adjustments provide robustness against issues such as poor conditioning of the
52
objective function and gradient noise, which are prevalent in large-scale optimization problems.
Mathematically, at each iteration t, the Adam optimizer updates the parameter vector θt ∈ Rn ,
where n is the number of parameters of the model, based on the gradient gt , which is the gradient
of the objective function with respect to θt , i.e., gt = ∇θ f (θt ). In its essence, Adam computes
two distinct quantities: the first moment estimate mt and the second moment estimate vt , which
are recursive moving averages of the gradients and the squared gradients, respectively. The first
moment estimate mt is given by
where β1 ∈ [0, 1) is the decay rate for the first moment. This recurrence equation represents a
weighted moving average of the gradients, which is intended to capture the directional momentum
in the optimization process. By incorporating the first moment, Adam accumulates information
about the historical gradients, which helps mitigate oscillations and stabilizes the convergence
direction. The term (1 − β1 ) ensures that the most recent gradient gt receives a more significant
weight in the computation of mt . Similarly, the second moment estimate vt , which represents the
exponentially decaying average of the squared gradients, is updated as
where β2 ∈ [0, 1) is the decay rate for the second moment. This moving average of squared gradi-
ents captures the variance of the gradient at each iteration. The second moment vt thus acts as an
estimate of the curvature of the objective function, which allows the optimizer to adjust the step
size for each parameter accordingly. Specifically, large values of vt correspond to parameters that
experience high gradient variance, signaling a need for smaller updates to prevent overshooting,
while smaller values of vt correspond to parameters with low gradient variance, where larger up-
dates are appropriate. This mechanism is akin to automatically tuning the learning rate for each
parameter based on the local geometry of the loss function. At initialization, both mt and vt are
typically set to zero. This initialization introduces a bias toward zero, particularly at the initial
time steps, causing the estimates of the moments to be somewhat underrepresented in the early
iterations. To correct for this bias, bias correction terms are introduced. The bias-corrected first
moment m̂t is given by
mt
m̂t = , (280)
1 − β1t
and the bias-corrected second moment v̂t is given by
vt
v̂t = . (281)
1 − β2t
The purpose of these corrections is to offset the initial tendency of mt and vt to underestimate the
true values due to their initialization at zero. As the iteration progresses, the bias correction terms
become less significant, and the estimates of the moments converge to their true values, allowing
for more accurate parameter updates. The actual update rule for the parameters θt is determined
by using the bias-corrected first and second moment estimates m̂t and v̂t , respectively. The update
equation is given by
m̂t
θt+1 = θt − η √ , (282)
v̂t + ϵ
where η is the global learning rate, and ϵ is a small constant (typically 10−8 ) added to the denomi-
nator for numerical stability. This update rule incorporates
√ both the momentum (through m̂t ) and
the adaptive learning rate (through v̂t ). The factor v̂t + ϵ is particularly crucial as it ensures that
parameters with large gradient variance (i.e., those with large values in vt ) receive smaller updates,
whereas parameters with smaller gradient variance (i.e., those with small values in vt ) receive larger
53
updates, thus preventing divergence in high-variance regions.
The learning rate adjustment in Adam is dynamic in nature, as it is controlled by the second
moment estimate v̂t , which means that Adam has a per-parameter learning rate for each param-
eter. For each parameter, the learning rate is inversely proportional to the square root of its
corresponding second moment estimate v̂t , leading to adaptive learning rates. This is what enables
Adam to operate effectively in highly non-convex optimization landscapes, as it reduces the learn-
ing rate in directions where the gradient exhibits high variance, thus stabilizing the updates, and
increases the learning rate where the gradient variance is low, speeding up convergence. In the
case where Adam is applied to convex objective functions, convergence can be analyzed mathemat-
ically. Under standard assumptions, such as bounded gradients and a decreasing learning rate, the
convergence of Adam can be shown by proving that
∞
X ∞
X
ηt2 < ∞ and ηt = ∞, (283)
t=1 t=1
where ηt is the learning rate at time step t. The first condition ensures that the learning rate decays
sufficiently rapidly to guarantee convergence, while the second ensures that the learning rate does
not decay too quickly, allowing for continual updates as the algorithm progresses. However, Adam is
not without its limitations. One notable issue arises from the fact that the second moment estimate
vt may decay too quickly, causing overly aggressive updates in regions where the gradient variance
is relatively low. To address this, the AMSGrad variant was introduced. AMSGrad modifies the
second moment update rule by replacing vt with
thereby ensuring that v̂t never decreases, which helps prevent the optimizer from making overly
large updates in situations where the second moment estimate may be miscalculated. By forcing v̂t
to increase or remain constant, AMSGrad reduces the chance of large, destabilizing parameter up-
dates, thereby improving the stability and convergence of the optimizer, particularly in difficult or
ill-conditioned optimization problems. Additionally, further extensions of Adam, such as AdaBelief,
introduce additional modifications to the second moment estimate by introducing a belief-based
mechanism to correct the moment estimates. Specifically, AdaBelief estimates the second moment
v̂t in a way that adjusts based on the belief in the direction of the gradient, offering further stability
in cases where gradients may be sparse or noisy. These innovations underscore the flexibility of
Adam and its variants in optimizing complex loss functions across a range of machine learning tasks.
Ultimately, the Adam optimizer stands as a highly sophisticated, mathematically rigorous opti-
mization algorithm, effectively combining momentum and adaptive learning rates. By using both
the first and second moments of the gradient, Adam dynamically adjusts the parameter updates,
providing a robust and efficient optimization framework for non-convex, high-dimensional objective
functions. The use of bias correction, coupled with the adaptive nature of the optimizer, allows it
to operate effectively across a wide range of problem settings, making it a go-to method for many
machine learning and deep learning applications. The mathematical rigor behind Adam ensures
that it remains a highly stable and efficient optimization technique, capable of overcoming many
of the challenges posed by large-scale and noisy gradient information in machine learning models.
54
smoothing out noisy gradients. Liu and Ma (2024) [156] investigated loss oscillations observed in
adaptive optimizers, including RMSProp. It explains how RMSProp’s exponential moving average
mechanism contributes to this phenomenon and proposes a novel perspective on tuning hyperpa-
rameters to mitigate oscillations. Li (2024) [157] explored the fundamental theoretical properties
of adaptive optimizers, with a special focus on RMSProp. It rigorously examines the interplay
between smoothness conditions and the adaptive nature of RMSProp, showing how it balances
stability and convergence speed. Heredia (2024) [158] presented a new mathematical framework
for analyzing RMSProp using integro-differential equations. The model provides deeper theoretical
insights into how RMSProp updates gradients differently from AdaGrad and Adam, particularly
in terms of gradient smoothing. Ye (2024) [159] discussed how preconditioning methods, including
RMSProp, enhance gradient descent optimization. It explains why RMSProp’s adaptive learn-
ing rate is beneficial in high-dimensional settings and provides a theoretical justification for its
effectiveness in regularized optimization problems. Compagnoni et. al. (2024) [160] employed
stochastic differential equations (SDEs) to model the behavior of RMSProp and other adaptive
optimizers. It provides new theoretical insights into how noise affects the optimization process
and how RMSProp adapts to different gradient landscapes. Yao et. al. (2024) [161] presented
a system response curve analysis of first-order optimization methods, including RMSProp. The
authors develop a dynamic equation for RMSProp that explains its stability and effectiveness in
deep learning tasks. Wen and Lei (2024) [162] explored an alternative optimization framework that
integrates RMSProp-style updates with an ADMM approach. It provides theoretical guarantees
for the convergence of RMSProp in non-convex optimization problems. Hannibal et. al. (2024)
[163] critiques the convergence properties of popular optimizers, including RMSProp. It rigorously
proves that in certain settings, RMSProp may not lead to a global minimum, emphasizing the im-
portance of hyperparameter tuning. Yang (2025) [164] extended the theoretical understanding of
adaptive optimizers like RMSProp by analyzing the impact of bias in stochastic gradient updates.
It provides a rigorous mathematical treatment of how bias affects convergence.
The Root Mean Squared Propagation (RMSProp) optimizer is a sophisticated variant of the gradi-
ent descent algorithm that adapts the learning rate for each parameter in a non-linear, non-convex
optimization problem. The fundamental issue with standard gradient descent lies in the constant
learning rate η, which fails to account for the varying magnitudes of the gradients in different di-
rections of the parameter space. This lack of adaptation can cause inefficient optimization, where
large gradients may lead to overshooting and small gradients lead to slow convergence. RMSProp
addresses this problem by dynamically adjusting the learning rate based on the historical gradient
magnitudes, offering a more tailored and efficient approach. Consider the objective function f (θ),
where θ ∈ Rn is the vector of parameters that we aim to optimize. Let ∇f (θ) denote the gradient
of f (θ) with respect to θ, which is a vector of partial derivatives:
T
∂f (θ) ∂f (θ) ∂f (θ)
∇f (θ) = , ,..., . (285)
∂θ1 ∂θ2 ∂θn
In traditional gradient descent, the update rule for θ is:
θt+1 = θt − η∇f (θt ), (286)
where η is the learning rate, a scalar constant. However, this approach does not account for the fact
that the gradient magnitudes may differ significantly along different directions in the parameter
space, especially in high-dimensional, non-convex functions. The RMSProp optimizer introduces a
solution by adapting the learning rate for each parameter in proportion to the magnitude of the
historical gradients. The key modification in RMSProp is the introduction of a running average
of the squared gradients for each parameter θi , denoted as E[g 2 ]i,t , which captures the cumulative
magnitude of the gradients over time. The update rule for E[g 2 ]i,t is given by the exponential
moving average formula:
E[g 2 ]i,t = βE[g 2 ]i,t−1 + (1 − β)gi,t
2
, (287)
55
where gi,t = ∂f∂θ(θit ) is the gradient of the objective function with respect to the parameter θi at
time step t, and β is the decay factor, typically set close to 1 (e.g., β = 0.9). This recurrence
relation allows the gradient history to influence the current update while exponentially forgetting
older gradient information. The value of β determines the memory of the squared gradients, where
higher values of β give more weight to past gradients. The update for θi in RMSProp is then given
by:
η
θi,t+1 = θi,t − p gi,t , (288)
E[g 2 ]i,t + ϵ
where ϵ is a small positive constant (typically ϵ = 10−8 ) introduced to avoid division by zero and
ensure numerical stability. The term √ 12 dynamically adjusts the learning rate for each pa-
E[g ]i,t +ϵ
rameter based on the magnitude of the squared gradient history. This adjustment allows RMSProp
to take larger steps in directions where gradients have historically been small, and smaller steps
in directions where gradients have been large, leading to a more stable and efficient optimization
process.
Mathematically, the key advantage of RMSProp over traditional gradient descent lies in its ability
to adapt the learning rate according to the local geometry of the objective function. In regions
where the objective p function is steep (large gradients), RMSProp reduces the effective learning
rate by dividing by E[g 2 ]i,t , mitigating the risk of overshooting. Conversely, in flatter regions
with smaller gradients, RMSProp increases the learning rate, allowing for faster convergence. This
self-adjusting mechanism is crucial in high-dimensional optimization tasks, where the gradients
along different directions can vary greatly in magnitude, as is often the case in deep learning tasks
involving neural networks. The exponential moving average of squared gradients used in RMSProp
is analogous to a form of local normalization, where each parameter is scaled by the inverse of the
running average of its gradient squared. This normalization ensures that the optimizer does not
become overly sensitive to gradients in any particular direction, thus stabilizing the optimization
process. In more formal terms, if the objective function f (θ) exhibits sharp curvatures along cer-
tain directions, RMSProp mitigates the effects of such curvatures by scaling down the step size
along those directions. This scaling behavior can be interpreted as a form of gradient re-weighting,
where the influence of each parameter’s gradient is modulated by its historical behavior, making
the optimizer more robust to ill-conditioned optimization problems. The introduction of ϵ ensures
that the denominator never becomes zero, even in the case where the squared gradient history for
a parameter θi becomes extremely small. This is crucial for maintaining the numerical stability of
the algorithm, particularly in scenarios where gradients may vanish or grow exceedingly small over
many iterations, as seen in certain deep learning applications, such as training very deep neural
networks. By providing a small non-zero lower bound to the learning rate, ϵ ensures that the up-
dates remain smooth and predictable.
RMSProp’s performance is heavily influenced by the choice of β, which controls the trade-off
between long-term history and recent gradient information. When β is close to 1, the optimizer
relies more heavily on the historical gradients, which is useful for capturing long-term trends in the
optimization landscape. On the other hand, smaller values of β allow the optimizer to be more re-
sponsive to recent gradient changes, which can be beneficial in highly non-stationary environments
or rapidly changing optimization landscapes. In the context of deep learning, RMSProp is particu-
larly effective for optimizing objective functions with complex, high-dimensional parameter spaces,
such as those encountered in training deep neural networks. The non-convexity of such objective
functions often leads to a gradient that can vary significantly in magnitude across different layers of
the network. RMSProp helps to balance the updates across these layers by adjusting the learning
rate based on the historical gradients, ensuring that all layers receive appropriate updates without
being dominated by large gradients from any single layer. This adaptability helps in preventing
gradient explosions or vanishing gradients, which are common issues in deep learning optimiza-
56
tion. In summary, RMSProp provides a robust and efficient optimization technique by adapting
the learning rate based on the historical squared gradients of each parameter. The exponential
decay of the squared gradient history allows RMSProp to strike a balance between stability and
adaptability, preventing overshooting and promoting faster convergence in non-convex optimization
problems. The introduction of ϵ ensures numerical stability, and the parameter β offers flexibility
in controlling the influence of past gradients. This makes RMSProp particularly well-suited for
high-dimensional optimization tasks, especially in deep learning applications, where the parameter
space is vast, and gradient magnitudes can differ significantly across dimensions. By effectively
normalizing the gradients and dynamically adjusting the learning rates, RMSProp significantly
enhances the efficiency and stability of gradient-based optimization methods.
Overfitting in neural networks is a critical issue where the model learns to excessively fit the
training data, capturing not just the true underlying patterns but also the noise and anomalies
57
present in the data. This leads to poor generalization to unseen data, resulting in a model that has
a low training error but a high test error. Mathematically, consider a dataset D = {(xi , yi )}N i=1 ,
where xi ∈ Rd represents the input feature vector for each data point, and yi ∈ R represents the
corresponding target value. The goal is to fit a neural network model f (x; w) parameterized by
weights w ∈ RM , where M denotes the number of parameters in the model. The model’s objective
is to minimize the empirical risk, given by the mean squared error between the predicted values
and the true target values:
N
1 X
R̂(w) = L(f (xi ; w), yi ) (289)
N i=1
where L denotes the loss function, typically the squared error L(ŷi , yi ) = (ŷi − yi )2 . In this frame-
work, the neural network tries to minimize the empirical risk on the training set. However, the
true goal is to minimize the expected risk R(w), which reflects the model’s performance on the
true distribution P (x, y) of the data. This expected risk is given by:
Overfitting occurs when the model minimizes R̂(w) to an excessively small value, but R(w) remains
large, indicating that the model has fit the noise in the training data, rather than capturing the true
data distribution. This discrepancy arises from an overly complex model that learns to memorize
the training data rather than generalizing across different inputs. A fundamental insight into the
overfitting phenomenon comes from the bias-variance decomposition of the generalization error.
The total error in a model’s prediction fˆ(x) of the true target function g(x) can be decomposed
as:
E = E[(g(x) − fˆ(x))2 ] = Bias2 (fˆ(x)) + Var(fˆ(x)) + σ 2 (291)
where Bias2 (fˆ(x)) represents the squared difference between the expected model prediction and
the true function, Var(fˆ(x)) is the variance of the model’s predictions across different training sets,
and σ 2 is the irreducible error due to the intrinsic noise in the data. In the context of overfitting,
the model typically exhibits low bias (as it fits the training data very well) but high variance (as
it is highly sensitive to the fluctuations in the training data). Therefore, regularization techniques
aim to reduce the variance of the model while maintaining its ability to capture the true underlying
relationships in the data, thereby improving generalization. One of the most popular methods to
mitigate overfitting is L2 regularization (also known as weight decay), which adds a penalty term
to the loss function based on the squared magnitude of the weights. The regularized loss function
is given by:
XM
2
R̂reg (w) = R̂(w) + λ∥w∥2 = R̂(w) + λ wj2 (292)
j=1
where λ is a positive constant controlling the strength of the regularization. The gradient of the
regularized loss function with respect to the weights is:
The term 2λw introduces weight shrinkage, which discourages the model from fitting excessively
large weights, thus preventing overfitting by reducing the model’s complexity. This regularization
approach is a direct way to control the model’s capacity by penalizing large weight values, leading
to a simpler model that generalizes better. In contrast, L1 regularization adds a penalty based
on the absolute values of the weights:
M
X
R̂reg (w) = R̂(w) + λ∥w∥1 = R̂(w) + λ |wj | (294)
j=1
58
The gradient of the L1 regularized loss function is:
where sgn(w) denotes the element-wise sign function. L1 regularization has a unique property of
inducing sparsity in the weights, meaning it drives many of the weights to exactly zero, effectively
selecting a subset of the most important features. This feature selection mechanism is particularly
useful in high-dimensional settings, where many input features may be irrelevant. A more advanced
regularization technique is dropout, which randomly deactivates a fraction of neurons during
training. Let hi represent the activation of the i-th neuron in a given layer. During training,
dropout produces a binary mask mi sampled from a Bernoulli distribution with success probability
p, i.e., mi ∼ Bernoulli(p), such that:
1
hdrop
i = mi ⊙ hi (296)
p
where ⊙ denotes element-wise multiplication. The factor 1/p ensures that the expected value
of the activations remains unchanged during training. Dropout effectively forces the network to
learn redundant representations, reducing its reliance on specific neurons and promoting better
generalization. By training an ensemble of subnetworks with shared weights, dropout helps to
prevent the network from memorizing the training data, thus reducing overfitting. Early stopping
is another technique to prevent overfitting, which involves halting the training process when the
validation error starts to increase. The model is trained on the training set, but its performance is
evaluated on a separate validation set. If the validation error Rval (t) increases after several epochs,
training is stopped to prevent further overfitting. Mathematically, the stopping criterion is:
where t∗ represents the epoch at which the validation error reaches its minimum. This technique
avoids the risk of continuing to fit the training data beyond the point where the model starts to lose
its ability to generalize. Data augmentation artificially enlarges the training dataset by applying
transformations to the original data. Let T = {T1 , T2 , . . . , TK } represent a set of transformations
(such as rotations, scaling, and translations). For each training example (xi , yi ), the augmented
dataset D′ consists of K new examples:
These transformations create new, varied examples, which help the model generalize better by
preventing it from fitting too closely to the original, potentially noisy data. Data augmentation is
particularly beneficial in domains like image processing, where transformations like rotations and
flips do not change the underlying label but provide additional examples to learn from. Batch
normalization normalizes the activations of each mini-batch to reduce internal covariate shift and
stabilize the learning process. Given a mini-batch B = {hi }m i=1 with activations hi , the mean and
variance of the activations across the mini-batch are computed as:
m m
1 X 1 X
µB = hi , σB2 = (hi − µB )2 (299)
m i=1 m i=1
59
preventing the model from getting stuck in sharp, narrow minima in the loss landscape.
In conclusion, overfitting is a significant challenge in training neural networks, and its prevention
requires a combination of techniques aimed at controlling model complexity, improving generaliza-
tion, and reducing sensitivity to noise in the training data. Regularization methods such as L2
and L1 regularization, dropout, and early stopping, combined with strategies like data augmenta-
tion and batch normalization, are fundamental to improving the performance of neural networks
on unseen data and ensuring that they do not overfit the training set. The mathematical formu-
lations and optimization strategies outlined here provide a detailed and rigorous framework for
understanding and mitigating overfitting in machine learning models.
6.3.1 Dropout
Dropout, a regularization technique in neural networks, is designed to address overfitting, a situ-
ation where a model performs well on training data but fails to generalize to unseen data. The
general problem of overfitting in machine learning arises when a model becomes excessively com-
plex, with a high number of parameters, and learns to model noise in the data rather than the
true underlying patterns. This can result in poor generalization performance on new, unseen data.
In the context of neural networks, the solution often involves regularization techniques to penalize
complexity and prevent the model from memorizing the data. Dropout, introduced by Geoffrey
Hinton et al., represents a unique and powerful method to regularize neural networks by introducing
stochasticity during the training process, which forces the model to generalize better and prevents
overfitting. To understand the mathematics behind dropout, let fθ (x) represent the output of a
neural network for input x with parameters θ. The goal during training is to minimize a loss func-
tion that measures the discrepancy between the predicted output and the true target y. Without
any regularization, the objective is to minimize the empirical loss:
N
1 X
Lempirical (θ) = L(fθ (xi ), yi ) (301)
N i=1
where L(fθ (xi ), yi ) is the loss function (e.g., cross-entropy or mean squared error), and N is the
number of data samples. A model trained to minimize this loss function without regularization will
likely overfit to the training data, capturing the noise rather than the underlying distribution of
the data. Dropout addresses this by randomly “dropping out” a fraction of the network’s neurons
during each training iteration, which is mathematically represented by modifying the activations
of neurons.
Let us consider a feedforward neural network with a set of activations ai for the neurons in the i-th
layer, which is computed as ai = f (W xi + bi ), where W represents the weight matrix, xi the input
to the neuron, and bi the bias. During training with dropout, for each neuron, a random Bernoulli
variable ri is introduced, where:
ri ∼ Bernoulli(p) (302)
with probability p representing the retention probability (i.e., the probability that a neuron is kept
active), and 1 − p representing the probability that a neuron is “dropped” (set to zero). The
activation of the i-th neuron is then modified as follows:
a′i = ri · ai = ri · f (W xi + bi ) (303)
where ri is a random binary mask for each neuron. During each forward pass, different neurons are
randomly dropped out, and the network is effectively training on a different subnetwork, forcing
the network to learn a more robust set of features that do not depend on any particular neuron.
In this way, dropout acts as a form of ensemble learning, as each forward pass corresponds to a
60
different realization of the network.
The mathematical expectation of the loss function with respect to the dropout mask r can be
written as:
N
1 X
Er [Ldropout (θ, r)] = L(fθ (xi , r), yi ) (304)
N i=1
where fθ (xi , r) is the output of the network with the dropout mask r. Since the dropout mask
is random, the loss is an expectation over all possible configurations of dropout masks. This
randomness induces an implicit ensemble effect, where the model is trained not just on a single set of
parameters θ, but effectively on a distribution of models, each corresponding to a different dropout
configuration. The model is, therefore, regularized because the network is forced to generalize
across these different subnetworks, and overfitting to the training data is prevented. One way to
gain deeper insight into dropout is to consider its connection with Bayesian inference. In the context
of deep learning, dropout can be viewed as an approximation to Bayesian posterior inference. In
Bayesian terms, we seek the posterior distribution of the network’s parameters θ, given the data
D, which can be written as:
p(D|θ)p(θ)
p(θ|D) = (305)
p(D)
where p(D|θ) is the likelihood of the data given the parameters, p(θ) is the prior distribution over the
parameters, and p(D) is the marginal likelihood of the data. Dropout approximates this posterior
by averaging over the outputs of many different subnetworks, each corresponding to a different
dropout configuration. This interpretation is formalized by observing that each forward pass with
a different dropout mask corresponds to a different realization of the model, and averaging over
all dropout masks gives an approximation to the Bayesian posterior. Thus, the expected output of
the network, given the data x, under dropout is:
M
1 X
Er [fθ (x)] = fθ (x, ri ) (306)
M i=1
where ri is a dropout mask drawn from the Bernoulli distribution and M is the number of Monte
Carlo samples of dropout configurations. This expectation can be interpreted as a form of ensemble
averaging, where each individual forward pass corresponds to a different model sampled from the
posterior.
Dropout is also highly effective because it controls the bias-variance tradeoff. The bias-variance
tradeoff is a fundamental concept in statistical learning, where increasing model complexity reduces
bias but increases variance, and vice versa. A highly complex model tends to have low bias but
high variance, meaning it fits the training data very well but fails to generalize to new data. Regu-
larization techniques, such as dropout, seek to reduce variance without increasing bias excessively.
Dropout achieves this by introducing stochasticity in the learning process. By randomly deacti-
vating neurons during training, the model is forced to learn robust features that do not depend on
the presence of specific neurons. In mathematical terms, the variance of the model’s output can be
expressed as:
Var(fθ (x)) = Er [(fθ (x))2 ] − (Er [fθ (x)])2 (307)
By averaging over multiple dropout configurations, the variance is reduced, leading to better gener-
alization performance. Although dropout introduces some bias by reducing the network’s capacity
(since fewer neurons are available at each step), the variance reduction outweighs the bias increase,
resulting in improved generalization. Another key mathematical aspect of dropout is its relation-
ship with stochastic gradient descent (SGD). In the standard SGD framework, the parameters θ are
updated using the gradient of the loss with respect to the parameters. In the case of dropout, the
61
gradient is computed based on a stochastic subnetwork at each training iteration, which introduces
an element of randomness into the optimization process. The parameter update rule with dropout
can be written as:
θt+1 = θt − η∇θ Er [Ldropout (θ, r)] (308)
where η is the learning rate, and ∇θ is the gradient of the loss with respect to the model parame-
ters. The expectation is taken over all possible dropout configurations, which means that at each
step, the gradient update is based on a different realization of the model. This stochasticity helps
the optimization process by preventing the model from getting stuck in local minima, improving
convergence towards global minima, and enhancing generalization. Finally, it is important to note
that dropout has a close connection with low-rank approximations. During each forward pass with
dropout, certain neurons are effectively removed, which reduces the rank of the weight matrix,
as some rows or columns of the matrix are set to zero. This stochastic reduction in rank forces
the network to learn lower-dimensional representations of the data, effectively performing low-rank
regularization. This aspect of dropout can be formalized by observing that each dropout mask
corresponds to a sparse matrix, and the network is effectively learning a low-rank approximation
of the data distribution. By doing so, dropout prevents the network from learning overly complex
representations that could overfit the data, leading to improved generalization.
Given a set of n observations {(xi , yi )}ni=1 , where each xi ∈ Rp is a feature vector and yi ∈ R
is the corresponding target value, the task is to find a parameter vector θ ∈ Rp that minimizes
the loss function. In standard linear regression, the objective is to minimize the mean squared
error (MSE), defined as:
n
1X 2 1
L(θ) = yi − xTi θ = ∥Xθ − y∥2 (309)
n i=1 n
where X ∈ Rn×p is the design matrix, with rows xTi , and y ∈ Rn is the vector of target values.
The solution to this problem, without any regularization, is given by the ordinary least squares
62
(OLS) solution:
−1
θ̂OLS = XT X XT y (310)
This formulation, however, can lead to overfitting when p is large or when XT X is nearly singular.
In such cases, regularization is used to modify the loss function, adding a penalty term R(θ) to
the objective function that discourages large values for the parameters θi . The regularized loss
function is given by:
Lregularized (θ) = L(θ) + λR(θ) (311)
where λ is a regularization parameter that controls the strength of the penalty. The term
R(θ) penalizes the complexity of the model by imposing constraints on the magnitude of the
coefficients. Let us explore two widely used forms of regularization: L1 regularization (Lasso)
and L2 regularization (Ridge). L1 regularization involves adding the ℓ1 -norm of the parameter
vector θ as the penalty term:
p
X
RL1 (θ) = |θi | (312)
i=1
This formulation promotes sparsity in the parameter vector θ, causing many coefficients to be-
come exactly zero, effectively performing feature selection. In high-dimensional settings where
many features are irrelevant, L1 regularization helps reduce the model complexity by forcing ir-
relevant features to be excluded from the model. The effect of the L1 penalty can be understood
geometrically by noting that the constraint region defined by the ℓ1 -norm is a diamond-shaped
region in p-dimensional space. When solving this optimization problem, the coefficients often lie
on the boundary of this diamond, leading to a sparse solution with many coefficients being exactly
zero. Mathematically, the soft-thresholding solution that arises from solving the L1 regularized
optimization problem is given by:
This soft-thresholding property drives coefficients to zero when their magnitude is less than λ,
resulting in a sparse solution. L2 regularization, on the other hand, uses the ℓ2 -norm of the
parameter vector θ as the penalty term:
p
X
RL2 (θ) = θi2 (315)
i=1
This penalty term does not force any coefficients to be exactly zero but rather shrinks the coeffi-
cients towards zero, effectively reducing their magnitudes. The L2 regularization helps stabilize the
solution when there is multicollinearity in the features by reducing the impact of highly correlated
features. The optimization problem with L2 regularization leads to a ridge regression solution,
which is given by the following expression:
−1 T
θ̂ridge = XT X + λI X y (317)
63
where I is the identity matrix. The L2 penalty introduces a circular or spherical constraint in
the parameter space, resulting in a solution where all coefficients are reduced in magnitude, but
none are eliminated. The Elastic Net regularization is a hybrid technique that combines both L1
and L2 regularization. The regularized loss function for Elastic Net is given by:
p p
1 2
X X
LElasticNet (θ) = ∥Xθ − y∥ + λ1 |θi | + λ2 θi2 (318)
n i=1 i=1
In this case, λ1 and λ2 control the strength of the L1 and L2 penalties, respectively. The Elas-
tic Net regularization is particularly useful when dealing with datasets where many features are
correlated, as it combines the sparsity-inducing property of L1 regularization with the stability-
enhancing property of L2 regularization. The Elastic Net has been shown to outperform L1 and
L2 regularization in some cases, particularly when there are groups of correlated features. The
optimization problem can be solved using coordinate descent or proximal gradient methods,
which efficiently handle the mixed penalties. The choice of regularization parameter λ is critical
in controlling the bias-variance tradeoff. A small value of λ leads to a low-penalty model that
is more prone to overfitting, while a large value of λ forces the coefficients to shrink towards zero,
potentially leading to underfitting. Thus, it is important to select an optimal value for λ to strike
a balance between bias and variance. This can be achieved by using cross-validation techniques,
where the model is trained on a subset of the data, and the performance is evaluated on the re-
maining data.
64
Razavi-Termeh et. al. (2025) [144] explored the role of geospatial artificial intelligence (GeoAI)
in mapping flood-prone areas, leveraging metaheuristic algorithms for hyperparameter tuning. It
offers insights into machine learning applications in environmental science. Kiran and Ozyildirim
(2022) [145] proposed a distributed variable-length genetic algorithm to optimize hyperparameters
in reinforcement learning (RL), improving training efficiency and robustness. Unlike traditional
deep RL, which lacks extensive tuning due to complexity, our approach systematically enhances
performance across various RL tasks, outperforming Bayesian methods. Results show that more
generations yield optimal, computationally efficient solutions, advancing RL for real-world appli-
cations.
h∗ = arg min Lval (θ∗ (h); h), where θ∗ (h) = arg min Ltrain (θ; h). (319)
h∈H θ
Here, H denotes the hyperparameter space, which is often high-dimensional, non-convex, and
computationally expensive to traverse. The training loss function Ltrain (θ; h) is typically represented
as an empirical risk computed over the training dataset {(xi , yi )}N
i=1 :
N
1 X
Ltrain (θ; h) = ℓ(f (xi ; θ, h), yi ), (320)
N i=1
where f (xi ; θ, h) is the neural network output given the input xi , parameters θ, and hyperparameters
h, and ℓ(a, b) is the loss function quantifying the discrepancy between prediction a and ground truth
b. For classification tasks, ℓ often takes the form of cross-entropy loss:
C
X
ℓ(a, b) = − bk log ak , (321)
k=1
where C is the number of classes, and ak and bk are the predicted and true probabilities for the
k-th class, respectively. Central to the training process is the optimization of θ via gradient-based
methods such as stochastic gradient descent (SGD). The parameter updates are governed by:
where η > 0 is the learning rate, a critical hyperparameter controlling the step size. The stability
and convergence of SGD depend on η, which must satisfy:
2
0<η< , (323)
λmax (H)
where λmax (H) is the largest eigenvalue of the Hessian matrix H = ∇2θ Ltrain (θ; h). This condition
ensures that the gradient descent steps do not overshoot the minimum. To analyze convergence
behavior, the loss function Ltrain (θ; h) near a critical point θ∗ can be approximated via a second-
order Taylor expansion:
1
Ltrain (θ; h) ≈ Ltrain (θ∗ ; h) + (θ − θ∗ )⊤ H(θ − θ∗ ), (324)
2
where H is the Hessian matrix of second derivatives. The eigenvalues of H reveal the local cur-
vature of the loss surface, with positive eigenvalues indicating directions of convexity and negative
65
eigenvalues corresponding to saddle points. Regularization is often introduced to improve gener-
alization by penalizing large parameter values. For L2 regularization, the modified training loss
is:
λ
Lreg 2
train (θ; h) = Ltrain (θ; h) + ∥θ∥2 , (325)
2
where λ > 0 is the regularization coefficient. The gradient of the regularized loss becomes:
∇θ Lreg
train (θ; h) = ∇θ Ltrain (θ; h) + λθ. (326)
Another key hyperparameter is the weight initialization strategy, which affects the scale of activa-
tions and gradients throughout the network. For a layer with nin inputs, He initialization samples
weights from:
2
wij ∼ N 0, , (327)
nin
to ensure that the variance of activations remains stable as data propagate through layers. The
activation function g(z) also plays a crucial role. The Rectified Linear Unit (ReLU), defined as
g(z) = max(0, z), introduces sparsity and mitigates vanishing gradients. However, it suffers from
the ”dying neuron” problem, as its derivative g ′ (z) is zero for z ≤ 0. The search for optimal hyper-
parameters can be approached using grid search, random search, or more advanced methods like
Bayesian optimization. In Bayesian optimization, a surrogate model p(Lval (h)), often a Gaussian
Process (GP), is constructed to approximate the validation loss. The acquisition function a(h), such
as Expected Improvement (EI), guides the exploration of H by balancing exploitation of regions
with low predicted loss and exploration of uncertain regions:
where Lval,min is the best observed validation loss. Hyperparameter tuning is computationally
intensive due to the high dimensionality of H and the nested nature of the optimization problem.
Early stopping, a widely used strategy, halts training when the improvement in validation loss falls
below a threshold:
(t+1) (t)
Lval − Lval
(t)
< ϵ, (329)
Lval
where ϵ > 0 is a small constant. Advanced techniques like Hyperband leverage multi-fidelity op-
timization, allocating resources dynamically to promising hyperparameter configurations based on
partial training evaluations.
66
Nandi et al. (2025) [399] examined the use of Grid Search for deep learning hyperparameter tuning
in baby cry sound recognition systems. The authors present a novel pipeline that systematically
selects the best hyperparameters for neural networks, improving both precision and recall in sound
classification. Sianga et. al. (2025) [400] applied Grid Search and Randomized Search to optimize
machine learning models predicting cardiovascular disease risk. The study finds that Grid Search
consistently outperforms randomized methods in accuracy, highlighting its effectiveness in medical
diagnostic models. Li et. al. (2025) [401] applied Stratified 5-fold cross-validation combined with
Grid Search to fine-tune Extreme Gradient Boosting (XGBoost) models in predicting post-surgical
complications. The results suggest that hyperparameter tuning significantly improves predictive
performance, with Grid Search leading to the best model stability and interpretability. Lázaro
et. al. (2025) [402] implemented Grid Search and Bayesian Optimization to optimize K-Nearest
Neighbors (KNN) and Decision Trees for incident classification in aviation safety. The research
underscores how different hyperparameter tuning methods affect the generalization of machine
learning models in NLP-based accident reports. Li et. al. (2025) [403] proposed RAINER, an en-
semble learning model that integrates Grid Search for optimal hyperparameter tuning. The study
demonstrates how parameter optimization enhances the predictive capabilities of rainfall models,
making Grid Search an essential step in climate modeling. Khurshid et. al. (2025) [404] compared
Bayesian Optimization with Grid Search for hyperparameter tuning in diabetes prediction models.
The study finds that while Bayesian methods are computationally faster, Grid Search delivers more
precise hyperparameter selection, especially for models with structured medical data. Kanwar et.
al. (2025) [405] applied Grid Search for tuning Random Forest classifiers in landslide susceptibility
mapping. The study demonstrates that fine-tuned models improve the identification of high-risk
zones, reducing false positives in predictive landslide models. Fadil et. al. (2025) [406] evaluated
the role of Grid Search and Random Search in hyperparameter tuning for XGBoost regression
models in corrosion prediction. The authors find that Grid Search-based models achieve higher R²
scores, making them ideal for complex chemical modeling applications.
Grid search is a highly structured and exhaustive method for hyperparameter tuning in machine
learning, where a predetermined grid of hyperparameter values is systematically explored. The goal
is to identify the set of hyperparameters ⃗h = (h1 , h2 , . . . , hp ) that yields the optimal performance
metric for a given machine learning model. Let p represent the total number of hyperparameters
to be tuned, and for each hyperparameter hi , let the candidate set be Hi = {hi1 , hi2 , . . . , himi },
where mi is the number of candidate values for hi . The hyperparameter search space is then the
Cartesian product of all candidate sets:
S = H1 × H2 × · · · × Hp . (330)
Thus, the total number of configurations to be evaluated is:
p
Y
|S| = mi . (331)
i=1
For example, if we have two hyperparameters h1 and h2 with 3 possible values each, the total
number of combinations to explore is 9. This search space grows exponentially as the number
of hyperparameters increases, posing a significant computational challenge. Grid search involves
iterating over all configurations in S, evaluating the model’s performance for each configuration.
Let us define the performance metric M (⃗h, Dtrain , Dval ), which quantifies the model’s performance
for a given hyperparameter configuration ⃗h, where Dtrain and Dval are the training and validation
datasets, respectively. This metric might represent accuracy, error rate, F1-score, or any other
relevant criterion, depending on the problem at hand. The hyperparameters are then tuned by
maximizing or minimizing M across the search space:
⃗h∗ = arg max M (⃗h, Dtrain , Dval ), (332)
⃗h∈S
67
or in the case of a minimization problem:
⃗h∗ = arg min M (⃗h, Dtrain , Dval ). (333)
⃗h∈S
For each hyperparameter combination, the model is trained on Dtrain and evaluated on Dval . The
process requires the repeated evaluation of the model over all |S| configurations, each yielding a
performance metric. To mitigate overfitting and ensure the reliability of the performance metric,
S Dtrain is partitioned into
cross-validation is frequently used. In k-fold cross-validation, the dataset
(j)
k disjoint subsets D1 , D2 , . . . , Dk . The model is trained on Dtrain = i̸=j Di and validated on Dj .
For each fold j, we compute the performance metric:
(j)
Mj (⃗h) = M (⃗h, Dtrain , Dj ). (334)
The overall cross-validation performance for a hyperparameter configuration ⃗h is the average of the
k individual fold performances:
k
1X
M (⃗h) = Mj (⃗h). (335)
k j=1
Thus, the grid search with cross-validation aims to find the optimal hyperparameters by maximizing
or minimizing the average performance across all folds. The computational complexity of grid search
is a key consideration. If we denote C as the cost of training and evaluating the model for a single
configuration, the total cost for grid search is:
p
!
Y
O mi · k · C , (336)
i=1
where k represents the number of folds in cross-validation. This results in an exponential increase in
the total computation time as the number of hyperparameters p and the number of candidate values
mi increase. For large search spaces, grid search can become computationally expensive, making it
infeasible for high-dimensional hyperparameter optimization problems. To illustrate with a specific
example, consider two hyperparameters h1 and h2 with the following sets of candidate values:
S = H1 × H2 = {(0.01, 0.1), (0.01, 1.0), (0.01, 10.0), (0.1, 0.1), . . . , (1.0, 10.0)}. (338)
There are 9 configurations to evaluate. For each configuration, assume we perform 3-fold cross-
validation, where the performance metrics for the first fold are:
M1 (0.1, 1.0) = 0.85, M2 (0.1, 1.0) = 0.87, M3 (0.1, 1.0) = 0.86, (339)
This process is repeated for all 9 combinations of h1 and h2 . Grid search, while exhaustive and
deterministic, can fail to efficiently explore the hyperparameter space, especially when the num-
ber of hyperparameters is large. The search is confined to a discrete grid and cannot interpolate
between points to capture optimal configurations that may lie between grid values. Furthermore,
because grid search evaluates each configuration independently, it can be computationally expen-
sive for high-dimensional spaces, as the number of configurations grows exponentially with p and mi .
68
In conclusion, grid search is a methodologically rigorous and systematic approach to hyperpa-
rameter optimization, ensuring that all predefined configurations are evaluated exhaustively. How-
ever, its computational cost increases exponentially with the number of hyperparameters and their
respective candidate values, which can limit its applicability for large-scale problems. As a re-
sult, more advanced techniques such as random search, Bayesian optimization, or evolutionary
algorithms are often used for hyperparameter tuning when the computational budget is limited.
Despite these challenges, grid search remains a powerful tool for demonstrating the principles of
hyperparameter tuning and is well-suited for problems with relatively small search spaces.
In machine learning, hyperparameter tuning is the process of selecting the best configuration of
hyperparameters h = (h1 , h2 , . . . , hd ), where each hi represents the i-th hyperparameter. The hy-
perparameters h control key aspects of model learning, such as the learning rate, regularization
strength, or the architecture of the neural network. These hyperparameters are not directly opti-
mized through the learning process itself but are instead set before training begins. Given a set
of hyperparameters, the model performance is evaluated by computing a loss function L(h), which
typically represents the error on a validation set, and possibly regularization terms to mitigate over-
69
fitting. The objective is to minimize this loss function to find the optimal set of hyperparameters:
where L(h) is the loss function that quantifies how well the model generalizes to unseen data. The
minimization of this function is often subject to constraints on the range or type of values that
each hi can take, forming a constrained optimization problem:
where H represents the feasible hyperparameter space. Hyperparameter tuning is typically carried
out by selecting a search method that explores this space efficiently, with the goal of finding the
global or local optimum of the loss function.
One such search method is random search, which is a straightforward yet effective approach
to exploring the hyperparameter space. Instead of exhaustively searching over a grid of val-
ues for each hyperparameter (as in grid search), random search samples hyperparameters ht =
(ht,1 , ht,2 , . . . , ht,d ) from a predefined distribution for each hyperparameter hi . For each iteration
t, the hyperparameters are independently sampled from probability distributions Di associated
with each hyperparameter hi , where the probability distribution might be continuous or discrete.
Specifically, for continuous hyperparameters, ht,i is drawn from a uniform or normal distribution
over an interval Hi = [ai , bi ]:
ht,i ∼ U(ai , bi ), ht,i ∈ Hi , (343)
where U(ai , bi ) denotes the uniform distribution between ai and bi . For discrete hyperparameters,
ht,i is sampled from a discrete set of values Hi = {hi1 , hi2 , . . . , hiNi } with each value equally probable:
where Di denotes the discrete distribution over the set {hi1 , hi2 , . . . , hiNi }. Thus, each hyperpa-
rameter is selected independently from its corresponding distribution. After selecting a new set of
hyperparameters ht , the model is trained with this configuration, and its performance is evaluated
by computing the loss function L(ht ). The process is repeated for T iterations, generating a se-
quence of hyperparameter configurations h1 , h2 , . . . , hT , and for each configuration, the associated
loss function values L(h1 ), L(h2 ), . . . , L(hT ) are computed. The optimal set of hyperparameters h∗
is then selected as the one that minimizes the loss:
Thus, random search performs an approximate optimization of the hyperparameter space, where
the computational cost per iteration is C (the time to evaluate the model’s performance for a
given set of hyperparameters), and the total computational cost is O(T · C). This makes random
search a computationally feasible approach, especially when T is moderate. The computational
efficiency of random search can be compared to that of grid search, which exhaustively searches the
hyperparameter space by discretizing each hyperparameter hi into a set of values hi1 , hi2 , . . . , hini ,
where ni is the number of values for the i-th hyperparameter. The total number of grid search
configurations is given by:
Yd
Ngrid = ni , (346)
i=1
and the computational cost of grid search is O(Ngrid ·C), which grows exponentially with the number
of hyperparameters d. In this sense, grid search can become prohibitively expensive when the
dimensionality d of the hyperparameter space is large. Random search, on the other hand, requires
70
only T evaluations, and since each evaluation is independent of the others, the computational
cost grows linearly with T , making it more efficient when d is large. The probabilistic nature of
random search further enhances its efficiency. Suppose that only a subset of hyperparameters,
say k, significantly influences the model’s performance. Let S be the subspace of H consisting of
hyperparameter configurations that produce low loss values, and let the complementary space H\S
correspond to configurations that are unlikely to achieve low loss. In this case, the task becomes one
of searching within the subspace S, rather than the entire space H. The random search method is
well-suited to such problems, as it can probabilistically focus on the relevant subspace by drawing
hyperparameter values from distributions Di that prioritize areas of the hyperparameter space
with low loss. More formally, the probability of selecting a hyperparameter set ht from the relevant
subspace S is given by:
Yd
P (ht ∈ S) = P (ht,i ∈ Si ), (347)
i=1
where Si is the relevant region for the i-th hyperparameter, and P (ht,i ∈ Si ) is the probability that
the i-th hyperparameter lies within the relevant region. As the number of iterations T increases, the
probability that random search selects a hyperparameter set ht ∈ S increases as well, approaching
1 as T → ∞:
P (ht ∈ S) = 1 − (1 − P0 )T , (348)
where P0 is the probability of sampling a hyperparameter set from the relevant subspace in one
iteration. Thus, random search tends to explore the subspace of low-loss configurations, improving
the chances of finding an optimal or near-optimal configuration as T increases.
The exploration behavior of random search contrasts with that of grid search, which, despite its
systematic nature, may fail to efficiently explore sparsely populated regions of the hyperparam-
eter space. When the hyperparameter space is high-dimensional, the grid search must evaluate
exponentially many configurations, regardless of the relevance of the hyperparameters. This leads
to inefficiencies when only a small fraction of hyperparameters significantly contribute to the loss
function. Random search, by sampling independently and uniformly across the entire space, is
not subject to this curse of dimensionality and can more effectively locate regions that matter for
model performance. Mathematically, random search has an additional advantage when the hyper-
parameters exhibit smooth or continuous relationships with the loss function. In this case, random
search can probe the space probabilistically, discovering gradients of loss that grid search, due to
its fixed grid structure, may miss. Furthermore, random search is capable of finding the optimum
even when the loss function is non-convex, provided that the space is explored adequately. This
becomes particularly relevant in the presence of highly irregular loss surfaces, as random search
has the potential to escape local minima more effectively than grid search, which is constrained by
its fixed sampling grid.
In conclusion, random search is a highly efficient and scalable approach for hyperparameter op-
timization in machine learning. By sampling hyperparameters from predefined probability dis-
tributions and evaluating the associated loss function, random search provides a computationally
feasible method for high-dimensional hyperparameter spaces, outperforming grid search in many
cases. Its probabilistic nature allows it to focus on relevant regions of the hyperparameter space,
making it particularly advantageous when only a subset of hyperparameters significantly impacts
the model’s performance. As the number of iterations T increases, random search becomes more
likely to converge to the optimal configuration, making it a powerful tool for hyperparameter tuning
in complex models.
71
6.4.3 Bayesian Optimization
Literature Review: Chang et. al. (2025) [414] applied Bayesian Optimization (BO) for hy-
perparameter tuning in machine learning models used for predicting landslide displacement. It
explores the impact of BO in optimizing Support Vector Machines (SVM), Long Short-Term Mem-
ory (LSTM), and Gated Recurrent Units (GRU), demonstrating how Bayesian techniques improve
model accuracy and convergence rates. Cihan (2025) [415] used Bayesian Optimization to fine-
tune XGBoost, LightGBM, Elastic Net, and Adaptive Boosting models for predicting biomass
gasification output. The study finds that Bayesian Optimization outperforms Grid and Random
Search in reducing computational overhead while improving predictive accuracy. Makomere et. al.
(2025) [416] integrated Bayesian Optimization for hyperparameter tuning in deep learning-based
industrial process modeling. The study provides insights into how BO improves model generaliza-
tion and reduces prediction errors in chemical process monitoring. Bakır (2025) [417] introduced
TuneDroid, an automated Bayesian Optimization-based framework for hyperparameter tuning of
Convolutional Neural Networks (CNNs) used in cybersecurity. The results suggest that Bayesian
Optimization accelerates model training while improving malware detection accuracy. Khurshid et.
al. (2025) [404] compared Bayesian Optimization and Random Search for tuning hyperparameters
in XGBoost-based diabetes prediction models. It concludes that Bayesian Optimization provides
a superior trade-off between speed and accuracy compared to traditional search methods. Liu et.
al. (2025) [418] explored Bayesian Optimization’s ability to fine-tune deep learning models for
predicting acoustic performance in engineering systems. The authors demonstrate how Bayesian
methods improve prediction accuracy while reducing computational costs. Balcan et. al. (2025)
[411] provided a rigorous analysis of the sample complexity required for Bayesian Optimization in
deep learning. The findings show that Bayesian Optimization requires fewer samples to converge
to optimal solutions compared to other hyperparameter tuning techniques. Ma et. al. (2025) [419]
integrated Bayesian Optimization with Support Vector Machines (SVMs) for anomaly detection
in high-speed machining. They find that Bayesian Optimization allows more effective exploration
of hyperparameter spaces, leading to improved model reliability. Bouzaidi et. al. (2025) [420]
explored the impact of Bayesian Optimization on CNN-based models for image classification. It
demonstrates how Bayesian techniques outperform traditional methods like Grid Search in transfer
learning scenarios. Mustapha et. al. (2025) [421] integrated Bayesian Optimization for tuning
a hybrid deep learning framework combining Convolutional Neural Networks (CNNs) and Vision
Transformers (ViTs) for pneumonia detection. The results confirm that Bayesian Optimization
enhances the efficiency of multi-model architectures in medical imaging.
72
set of observed points {x1 , x2 , . . . , xn }, the corresponding function values {f (x1 ), f (x2 ), . . . , f (xn )}
are assumed to be jointly distributed as:
f (x1 )
f (x2 )
.. ∼ N (m, K) (350)
.
f (xn )
where m = [m(x1 ), m(x2 ), . . . , m(xn )]⊤ is the mean vector and K is the covariance matrix whose
entries are defined by a covariance (or kernel) function k(x, x′ ), which encodes assumptions about
the smoothness and periodicity of the objective function. The kernel function plays a crucial role
in determining the properties of the Gaussian Process. A commonly used kernel is the Squared
Exponential (SE) kernel, which is defined as:
∥x − x′ ∥2
′ 2
k(x, x ) = σf exp − (351)
2ℓ2
where σf2 is the variance, which scales the function values, and ℓ is the length scale, which controls
the smoothness of the function by dictating how quickly the function values can change with respect
to the inputs. Once the Gaussian Process has been specified, Bayesian Optimization proceeds by
updating the posterior distribution over the objective function after each new evaluation. Given a
set of n observed pairs {(xi , yi )}ni=1 , where yi = f (xi )+ϵi and ϵi ∼ N (0, σ 2 ) represents observational
noise, we update the posterior of the GP to reflect the observed data. The posterior mean µ(x∗ )
and variance σ 2 (x∗ ) at a new point x∗ are given by the following equations:
µ(x∗ ) = k⊤ −1
∗K y (352)
σ 2 (x∗ ) = k(x∗ , x∗ ) − k⊤ −1
∗ K k∗ (353)
where k∗ is the vector of covariances between the test point x∗ and the observed points x1 , x2 , . . . , xn ,
and K is the covariance matrix of the observed points. The updated mean µ(x∗ ) provides the
model’s best guess for the value of the function at x∗ , and σ 2 (x∗ ) quantifies the uncertainty asso-
ciated with this estimate.
In Bayesian Optimization, the central objective is to select the next hyperparameter setting x∗
to evaluate in such a way that the number of function evaluations is minimized while still making
progress toward the global optimum. This is achieved by optimizing an acquisition function. The
acquisition function α(x) represents a trade-off between exploiting regions of the input space where
the objective function is expected to be low and exploring regions where the model’s uncertainty is
high. Several acquisition functions have been proposed, including Expected Improvement (EI),
Probability of Improvement (PI), and Upper Confidence Bound (UCB). The Expected
Improvement (EI) acquisition function is one of the most widely used and is defined as:
fbest − µ(x) fbest − µ(x)
EI(x) = (fbest − µ(x))Φ + σ(x)ϕ (354)
σ(x) σ(x)
where fbest is the best observed value of the objective function, Φ(·) and ϕ(·) are the cumulative
distribution and probability density functions of the standard normal distribution, respectively,
and σ(x) is the standard deviation at x. The first term measures the potential for improvement,
weighted by the probability of achieving that improvement, and the second term reflects the uncer-
tainty at x, encouraging exploration in uncertain regions. The acquisition function is maximized
at each iteration to select the next point x∗ :
73
An alternative acquisition function is the Probability of Improvement (PI), which is simpler
and directly measures the probability that the objective function at x will exceed the current best
value:
fbest − µ(x)
PI(x) = Φ (356)
σ(x)
Another common acquisition function is the Upper Confidence Bound (UCB), which balances
exploration and exploitation by selecting the point with the highest upper confidence bound:
UCB(x) = µ(x) + κσ(x) (357)
where κ is a hyperparameter that controls the trade-off between exploration (κ large) and exploita-
tion (κ small). After selecting x∗ , the function is evaluated at this point, and the observed value
y∗ = f (x∗ ) is used to update the posterior distribution of the Gaussian Process. This process is
repeated iteratively, and each new observation refines the model’s understanding of the objective
function, guiding the search for the optimal x∗ . One of the primary advantages of Bayesian Opti-
mization is its ability to efficiently optimize expensive-to-evaluate functions by focusing the search
on the most promising regions of the input space. However, as the number of observations in-
creases, the computational complexity of maintaining the Gaussian Process model grows cubically
with respect to the number of points, due to the need to invert the covariance matrix K. This
cubic complexity, O(n3 ), can be prohibitive for large datasets. To mitigate this, techniques such as
sparse Gaussian Processes have been developed, which approximate the full covariance matrix
by using a smaller set of inducing points, thus reducing the computational cost while maintaining
the flexibility of the Gaussian Process model.
74
layers detect patterns, textures, and objects. Liu et.al. (2021) [152] introduced Vision Transform-
ers (ViTs) that outperform CNNs in some vision tasks. This paper discusses the limitations of
CNNs and how transformers can be hybridized with CNN architectures. Lin et.al. (2013) [153]
introduced the 1x1 convolution, which improved feature learning efficiency. This concept became
a key component of modern CNN architectures such as ResNet and MobileNet. Rumelhart et.
al. (1986) [154] formalized backpropagation, the training method used for CNNs. Without this
discovery, CNNs and deep learning would not exist today.
At the core of a CNN is the convolutional layer, where the input image I is convolved with a
set of filters or kernels K ∈ Rfh ×fw ×C , where fh and fw are the height and width of the filter,
respectively. The filter K slides across the input image I, and the result of this convolution is a
set of feature maps that are indicative of certain local patterns in the image. The element-wise
convolution at location (i, j) of the feature map is given by:
fh fw C
X XX
I∗K= Ii+p−1,j+q−1,r · Kp,q,r (358)
p=1 q=1 r=1
where Ii+p−1,j+q−1,r denotes the value of the r-th channel of the input image at position (i + p −
1, j + q − 1), and Kp,q,r is the corresponding filter value at (p, q, r). This operation is done for
each location (i, j) of the output feature map. The resulting feature map F has spatial dimensions
H ′ × W ′ , where:
′ H + 2p − fh ′ W + 2p − fw
H = + 1, W = +1 (359)
s s
where p is the padding, and s is the stride of the filter during its sliding motion. The convolution
operation provides a translation-invariant representation of the input image, as each filter detects
patterns across the entire image. After this convolution, a non-linear activation function, typically
the Rectified Linear Unit (ReLU), is applied to introduce non-linearity into the network and
ensure it can model complex patterns. The ReLU activation function operates element-wise and is
given by:
ReLU(x) = max(0, x) (360)
Thus, for each feature map F, the output after ReLU is:
This ensures that negative values in the feature map are discarded, which helps with the sparse
representation of activations, mitigating the vanishing gradient problem in deeper layers. In CNNs,
pooling operations follow the convolution and activation layers. Pooling serves to reduce the spa-
tial dimensions of the feature maps, thus decreasing computational complexity and making the
representation more invariant to translations. Max pooling, which is the most common form,
selects the maximum value within a specified window size ph × pw . Given an input feature map
′ ′
F ∈ RH ×W ×K , max pooling operates as follows:
75
where P is the pooled feature map. This pooling operation effectively reduces the spatial dimensions
′′ ′′
of each feature map, resulting in an output P ∈ RH ×W ×K , where:
′ ′
′′ H ′′ W
H = , W = (363)
ph pw
Max pooling introduces an element of robustness by capturing only the strongest features within
the local regions, discarding irrelevant information, and ensuring that the network is invariant to
small translations. The CNN architecture typically contains multiple convolutional layers followed
by pooling layers. After these operations, the feature maps are flattened into a one-dimensional
vector and passed into one or more fully connected (dense) layers. A fully connected layer
computes a linear transformation of the form:
z(l) = W(l) a(l−1) + b(l) (364)
where a(l−1) is the input to the layer, W(l) is the weight matrix, and b(l) is the bias vector. The
output of this linear transformation is then passed through a non-linear activation function, such as
ReLU or softmax for classification tasks. For classification, the softmax function is often applied
to convert the output into a probability distribution:
exp(zi )
yi = PC (365)
j=1 exp(zj )
where C is the number of output classes, and yi is the probability of the i-th class. The softmax
function ensures that the output probabilities sum to 1, providing a valid classification output.
The CNN is trained using backpropagation, which computes the gradients of the loss function
L with respect to the network’s parameters (i.e., weights and biases). Backpropagation uses the
chain rule to propagate the error gradients through each layer. The gradients with respect to the
convolutional filters K are computed by:
∂L ∂L
= ∗I (366)
∂K ∂F
where ∗ denotes the convolution operation. Similarly, the gradients for the fully connected layers
are computed by:
∂L ∂L
(l)
= a(l−1) · (l) (367)
∂W ∂z
Once the gradients are computed, the weights are updated using an optimization algorithm like
gradient descent:
∂L
W(l) ← W(l) − η (368)
∂W(l)
where η is the learning rate. This optimization ensures that the network’s parameters are adjusted
in the direction of the negative gradient, minimizing the loss function and thereby improving the
performance of the CNN. Regularization techniques are commonly applied to avoid overfitting.
Dropout, for instance, randomly deactivates a subset of neurons during training, preventing the
network from becoming too reliant on any specific feature and promoting better generalization.
The dropout operation at a given layer l with dropout rate p is defined as:
a(l) ∼ Dropout(a(l) , p) (369)
where the activations a(l) are randomly set to zero with probability p, and the remaining activations
1
are scaled by 1−p . Another regularization technique is batch normalization, which normalizes
the inputs of each layer to have zero mean and unit variance, thus improving training speed and
stability. Mathematically, batch normalization is defined as:
x − µB
x̂ = , y = γ x̂ + β (370)
σB
76
where µB and σB are the mean and standard deviation of the batch, and γ and β are learned
scaling and shifting parameters.
The process of image classification in Convolutional Neural Networks (CNNs) involves a sophisti-
77
cated interplay of linear algebra, calculus, probability theory, and optimization. The primary goal
is to map a high-dimensional input image to a specific class label. Let I ∈ RH×W ×C represent the
input image, where H, W , and C are the height, width, and number of channels (usually 3 for
RGB images) of the image, respectively. Each pixel of the image can be represented as I(i, j, c),
which denotes the intensity of the c-th channel at pixel position (i, j). The objective of the CNN
is to transform this raw input image into a label, typically one of M classes, using a hierarchical
feature extraction process that includes convolutions, nonlinearities, pooling, and fully connected
layers.
The convolution operation is central to CNNs and forms the basis for the feature extraction process.
Let K ∈ Rk×k×C be a filter (or kernel) with spatial dimensions k × k and C channels, where k is
typically a small odd integer, such as 3 or 5. The filter K is convolved with the input image I
to produce a feature map S ∈ R(H−k+1)×(W −k+1)×F , where F is the number of filters used in the
convolution. For a given spatial position (i, j) in the feature map, the convolution operation is
defined as:
k−1 X
X k−1 C−1
X
Si,j,f = I(i + m, j + n, c) · Km,n,c,f (371)
m=0 n=0 c=0
where Si,j,f represents the value at position (i, j) in the feature map corresponding to the f -th filter.
This operation computes a weighted sum of pixel values in the receptive field of size k × k × C
around pixel (i, j), where the weights are given by the filter values. The result is a new feature map
that captures local patterns such as edges or textures in the image. This local feature extraction
is performed for each position (i, j) across the entire image, producing a set of feature maps for
each filter. To introduce non-linearity into the network and allow it to model complex functions,
the feature map S is passed through a non-linear activation function, typically the Rectified Linear
Unit (ReLU), which is defined element-wise as:
This activation function outputs 0 for negative values and passes positive values unchanged, en-
suring that the network can learn complex, non-linear relationships. The output of the activation
function for the feature map is denoted as S+ , where each element of S+ is computed as:
S+
i,j,f = max(0, Si,j,f ) (373)
This element-wise operation enhances the network’s ability to capture and represent complex pat-
terns, thereby aiding in the learning process. After the convolution and activation, the feature map
is downsampled using a pooling operation. The most common form of pooling is max pooling,
which selects the maximum value in a local region of the feature map. Given a pooling window of
size p × p and stride s, the max pooling operation for the feature map S+ is given by:
Pi,j,f = max S+
i+u,j+v,f (374)
(u,v)∈p×p
where P represents the pooled feature map. This operation reduces the spatial dimensions of the
feature map by a factor of p, while preserving the most important features in each region. Pooling
serves several purposes, including dimensionality reduction, translation invariance, and noise re-
duction. It also helps prevent overfitting by limiting the number of parameters and computations
in the network.
Once the feature maps are obtained through convolution, activation, and pooling, they are flat-
tened into a one-dimensional vector F ∈ RN , where N is the total number of elements in the
pooled feature map. The flattened vector F is then fed into one or more fully connected layers.
These layers perform linear transformations of the input, which are essentially weighted sums of
78
the inputs, followed by the addition of a bias term. The output of a fully connected layer can be
expressed as:
O=W·F+b (375)
where W ∈ RM ×N is the weight matrix, b ∈ RM is the bias vector, and O ∈ RM is the raw output
or logit for each of the M classes. The fully connected layer computes a set of logits for the classes
based on the learned features from the convolutional and pooling layers. To convert the logits into
class probabilities, a softmax function is applied. The softmax function is a generalization of the
logistic function to multiple classes and transforms the logits into a probability distribution. The
probability of class k is given by:
eOk
P (y = k | O) = PM (376)
Ok
k=1 e
where Ok is the logit corresponding to class k, and the denominator ensures that the sum of
probabilities across all classes equals 1. The class label with the highest probability is selected as
the final prediction:
y = arg max P (y = k | O) (377)
k
The prediction is made based on the computed class probabilities, and the network aims to minimize
the discrepancy between the predicted probabilities and the true labels during training. To optimize
the network’s parameters, we minimize a loss function that measures the difference between
the predicted probabilities and the actual labels. The cross-entropy loss is commonly used in
classification tasks and is defined as:
M
X
L=− yk log P (y = k | O) (378)
k=1
where yk is the true label in one-hot encoding, and P (y = k | O) is the predicted probability for
class k. The goal of training is to minimize this loss function, which corresponds to maximizing
the likelihood of the correct class under the predicted probability distribution.
The optimization of the network parameters is performed using gradient descent and its variants,
such as stochastic gradient descent (SGD), which iteratively updates the parameters based on the
gradients of the loss function. The gradients are computed using backpropagation, a method
that applies the chain rule of calculus to compute the partial derivatives of the loss with respect
to each parameter. For a fully connected layer, the gradient of the loss with respect to the weights
W is given by:
∂L ∂O
∇W L = · = δ · FT (379)
∂O ∂W
∂L
where δ = ∂O is the error term (also known as the delta) for the logits, and FT is the transpose of
the flattened feature vector. The parameters are updated using the following rule:
W ← W − η∇W L (380)
where η is the learning rate, controlling the step size of the updates. This process is repeated
for each batch of training data until the network converges to a set of parameters that minimize
the loss function. Through this complex and iterative process, CNNs are able to learn to classify
images by automatically extracting hierarchical features from raw input data. The combination of
convolution, activation, pooling, and fully connected layers enables the network to learn increasingly
abstract and high-level representations of the input image, ultimately achieving high accuracy in
image classification tasks.
79
7.2.2 Object Detection
Literature Review: Naseer and Jalal (2025) [195] presented a multimodal deep learning frame-
work that integrates RGB-D images for enhanced semantic scene classification. The study leverages
a Convolutional Neural Network (CNN)-based object detection model to extract and process fea-
tures from RGB and depth images, aiming to improve scene recognition accuracy in cluttered and
complex environments. By incorporating multimodal inputs, the model effectively addresses the
challenges associated with occlusions and background noise, which are common issues in tradi-
tional object detection frameworks. The researchers demonstrate how CNNs, when combined with
depth-aware semantic information, can significantly enhance object localization and classification
performance. Through extensive evaluations, they validate that their framework outperforms con-
ventional single-stream CNNs in various real-world scenarios, making a compelling case for RGB-D
integration in deep learning-based object detection systems. Wang and Wang (2025) [196] builds
upon the Faster R-CNN object detection framework, introducing a novel improvement that signif-
icantly enhances detection accuracy in highly dynamic and complex environments. The study pro-
poses an optimized anchor box generation mechanism, which allows the network to efficiently detect
objects of varying scales and aspect ratios, particularly those that are small or heavily occluded. By
incorporating a refined region proposal network (RPN), the authors mitigate localization errors and
reduce false-positive detections. The paper also explores the impact of feature pyramid networks
(FPNs) in hierarchical feature extraction, demonstrating their effectiveness in improving the detec-
tion of fine-grained details. The authors conduct an extensive empirical evaluation, comparing their
improved Faster R-CNN model against existing object detection architectures, proving its superior
performance in terms of precision and recall, particularly for applications involving customized icon
generation and user interface design. Ramana et. al. (2025) [197] introduced a Deep Convolutional
Graph Neural Network (DCGNN) that integrates Spectral Pyramid Pooling (SPP) and fused key-
point generation to significantly improve 3D object detection performance. The study employs
ResNet-50 as the backbone CNN architecture and enhances its feature extraction capability by
introducing multi-scale spectral feature aggregation. Through the integration of graph neural net-
works (GNNs), the model can effectively capture spatial relationships between object components,
leading to highly accurate 3D bounding box predictions. The proposed methodology is rigorously
evaluated on multiple benchmark datasets, demonstrating its superior ability to handle occlusion,
scale variation, and viewpoint changes. Additionally, the paper presents a novel fusion strategy
that combines keypoint-based object representation with spectral domain feature embeddings, al-
lowing the model to achieve unparalleled robustness in automated 3D object detection tasks. Shin
et. al. (2025) [198] explores the application of deep learning-based object detection in the field
of microfluidics and droplet-based bioengineering. The authors utilize YOLOv10n, an advanced
CNN-based object detection framework, to develop an automated system for tracking and catego-
rizing double emulsion droplets in high-throughput experimental setups. By fine-tuning the YOLO
architecture, the study achieves remarkable improvements in detection sensitivity and classification
accuracy, enabling real-time identification of droplet morphology, phase separation dynamics, and
stability characteristics. The researchers further introduce an adaptive feature refinement strategy,
wherein the CNN model continuously learns from real-time experimental variations, allowing for
automated calibration and correction of droplet misclassification. The paper also demonstrates
the practical implications of this AI-driven approach in drug delivery systems, encapsulation tech-
nologies, and synthetic biology applications. Taca et. al. (2025) [199] provided a comprehensive
comparative analysis of multiple CNN-based object detection architectures applied to aphid clas-
sification in large-scale agricultural datasets. The researchers evaluate the performance of YOLO,
SSD, Faster R-CNN, and EfficientDet, analyzing their trade-offs in terms of accuracy, inference
speed, and computational efficiency. Through an extensive experimental setup involving 48,000
annotated images, the authors demonstrate that certain CNN models excel in specific detection
scenarios, such as YOLO for real-time aphid localization and Faster R-CNN for high-precision clas-
sification. Furthermore, the paper introduces an innovative hybrid ensemble strategy, combining
80
the strengths of multiple CNN architectures to achieve optimal detection performance. The au-
thors validate their findings on real-world agricultural environments, emphasizing the importance
of deep learning-driven pest detection in sustainable farming practices. Ulaş et. al. (2025) [200]
explored the application of CNN-based object detection in the domain of astronomical time-series
analysis, specifically targeting oscillation-like patterns in eclipsing binary light curves. The study
systematically evaluates multiple state-of-the-art object detection models, including YOLO, Faster
R-CNN, and SSD, to determine their effectiveness in identifying transient light fluctuations that
indicate oscillatory behavior in celestial bodies. One of the key contributions of this paper is the
introduction of a customized pre-processing pipeline that optimizes raw observational data by re-
moving noise and enhancing feature visibility using wavelet-based signal decomposition techniques.
The researchers further implement a hybrid detection mechanism, integrating CNN-based spatial
feature extraction with recurrent neural networks (RNNs) to capture both spatial and temporal
dependencies within light curve datasets. Extensive validation on large-scale astronomical datasets
demonstrates that this approach significantly outperforms traditional statistical methods in detect-
ing oscillatory behavior, paving the way for AI-driven automation in astrophysical event classifica-
tion. Valensi et. al. (2025) [201] presents an advanced semi-supervised deep learning framework for
pleural line detection and segmentation in lung ultrasound (LUS) imaging, leveraging the power of
foundation models and CNN-based object detection architectures. The study highlights the short-
comings of conventional fully supervised learning in medical imaging, where annotated datasets
are limited and labor-intensive to create. To overcome this challenge, the researchers incorporate a
semi-supervised learning strategy, utilizing self-training techniques combined with pseudo-labeling
to improve model generalization. The framework employs YOLOv8-based object detection, specif-
ically optimized for medical feature localization, which significantly enhances detection accuracy
in cases of low-contrast and high-noise ultrasound images. Furthermore, the study integrates a
multi-scale feature extraction strategy, combining convolutional layers with attention mechanisms
to ensure precise identification of pleural lines across different imaging conditions. Experimen-
tal results demonstrate that this hybrid approach achieves a substantial increase in segmentation
accuracy, particularly in detecting subtle abnormalities linked to pneumothorax and pleural effu-
sion, making it a highly valuable tool in clinical diagnostic applications. Arulalan et. al. (2025)
[202] proposed an optimized object detection pipeline that integrates a novel convolutional neural
network (CNN) architecture, BS2ResNet, with bidirectional LSTM (LTK-Bi-LSTM) for improved
spatiotemporal object recognition. Unlike conventional CNN-based object detectors, which focus
solely on static spatial features, this study introduces a hybrid deep learning framework that cap-
tures both spatial and temporal dependencies. The proposed BS2ResNet model enhances feature
extraction by utilizing bottleneck squeeze-and-excitation blocks, which selectively emphasize impor-
tant spatial information while suppressing redundant feature maps. Additionally, the integration of
LTK-Bi-LSTM layers allows the model to effectively capture temporal correlations, making it highly
robust for detecting moving objects in dynamic environments. This approach is validated on mul-
tiple benchmark datasets, including autonomous driving and video surveillance datasets, where it
demonstrates superior performance in handling occlusions, rapid motion, and low-light conditions.
The findings indicate that combining deep convolutional networks with sequence-based modeling
significantly improves object detection accuracy in complex real-world scenarios, offering critical
advancements for applications in intelligent transportation, security, and real-time monitoring. Zhu
et. al. (2025) [203] investigated a novel adversarial attack strategy targeting CNN-based object
detection models, with a specific focus on binary image segmentation tasks such as salient object
detection and camouflage object detection. The paper introduces a high-transferability adversarial
attack framework, which generates adversarial perturbations capable of fooling a wide range of
deep learning models, including YOLO, Mask R-CNN, and U-Net-based segmentation networks.
The researchers employ adversarial example augmentation, where synthetic adversarial patterns
are iteratively refined through gradient-based optimization techniques, ensuring that the adversar-
ial attacks remain effective across different architectures and datasets. A particularly important
81
contribution is the introduction of a dual-stage attack pipeline, wherein the model first learns to
generate localized, high-impact adversarial noise and then optimizes for cross-model generalization.
Extensive experiments demonstrate that this approach significantly degrades detection performance
across multiple state-of-the-art models, revealing critical vulnerabilities in current CNN-based ob-
ject detectors. This research provides valuable insights into deep learning security and underscores
the urgent need for robust adversarial defense mechanisms in high-stakes applications such as au-
tonomous systems, medical imaging, and biometric security. Guo et. al. (2025) [204] introduced
a deep learning-based agricultural monitoring system, utilizing CNNs for agronomic entity detec-
tion and attribute extraction. The research highlights the limitations of traditional rule-based and
manual annotation systems in agricultural monitoring, which are prone to errors and inefficiencies.
By leveraging CNN-based object detection models, the proposed system enables real-time crop
analysis, accurately identifying key agronomic attributes such as plant height, leaf structure, and
disease symptoms. A significant innovation in this study is the incorporation of inter-layer feature
fusion, wherein multi-scale convolutional features are integrated across different network depths to
improve detection robustness under varying lighting and environmental conditions. Additionally,
the authors employ a hybrid feature selection mechanism, combining spatial attention networks
with spectral domain feature extraction, which enhances the model’s ability to distinguish between
healthy and diseased crops with high precision. The research is validated through rigorous field
trials, demonstrating that CNN-based agronomic monitoring can significantly enhance crop yield
predictions, reduce human labor in precision agriculture, and optimize resource allocation in farm-
ing operations.
In the mathematical framework, let the input image be represented by a matrix I ∈ RH×W ×C ,
where H, W , and C are the height, width, and number of channels (typically 3 for RGB images).
Convolution operations in a CNN serve as the fundamental building blocks to extract spatial hier-
archies of features. The convolution operation involves the application of a kernel K ∈ Rm×n×C to
the input image, where m and n are the spatial dimensions of the kernel, and C is the number of
input channels. The convolution operation is performed by sliding the kernel over the image and
computing the element-wise multiplication between the kernel and the image patch, yielding the
following equation for the feature map O(x, y):
m−1
XX n−1 C−1
X
O(x, y) = I(x + i, y + j, c) · K(i, j, c) (381)
i=0 j=0 c=0
Here, O(x, y) represents the feature map at the location (x, y), which is generated by applying the
kernel K. The sum is taken over the spatial extent of the kernel as it slides over the image. This
convolutional operation helps the network capture local patterns in the input image, such as edges,
corners, and textures, which are crucial for identifying objects. Once the convolution is performed,
a non-linear activation function such as the Rectified Linear Unit (ReLU) is applied to introduce
non-linearity into the system. The ReLU activation function is given by:
82
This activation function helps the network model complex non-linear relationships between fea-
tures and is computationally efficient. The application of ReLU ensures that the network can learn
complex decision boundaries that are necessary for tasks like object detection.
In CNN-based object detection, the goal is to predict the class of an object and localize its position
via a bounding box. The bounding box is parametrized by four coordinates: (x, y) for the center
of the box, and w, h for the width and height. The task can be viewed as a twofold problem: (1)
classify the object and (2) predict the bounding box that best encodes the object’s spatial posi-
tion. Mathematically, this requires the network to output both class probabilities and bounding
box coordinates for each object within the image. The classification task is typically performed
using a softmax function, which converts the network’s raw output logits zi for each class i into
probabilities P (yi |r). The softmax function is defined as:
exp(zi )
P (yi |r) = Pk (383)
j=1 exp(zj )
where k is the number of possible classes, zi is the raw score for class i, and P (yi |r) is the probability
that the region r belongs to class yi . This function ensures that the predicted scores are valid
probabilities that sum to one, which allows the network to make a probabilistic decision regarding
the class of the object in each region. Simultaneously, the network must also predict the four
parameters of the bounding box for each object. The network’s predicted bounding box parameters
are typically denoted as B̂ = (x̂, ŷ, ŵ, ĥ), while the ground truth bounding box is denoted by
B = (x, y, w, h). The error between the predicted and true bounding boxes is quantified using a
loss function, with the smooth L1 loss being a commonly used metric for bounding box regression.
The smooth L1 loss for each parameter of the bounding box is defined as:
4
X
Lbbox = SmoothL1(Bi − B̂i ) (384)
i=1
This loss function is used to reduce the impact of large errors, thereby making the training process
more robust. The goal is to minimize this loss during the training phase to improve the network’s
ability to predict both the class and the bounding box of objects.
For training, a combined loss function is used that combines both the classification loss and the
bounding box regression loss. The total loss function can be written as:
where Lcls is the classification loss, typically computed using the cross-entropy between the pre-
dicted probabilities and the ground truth labels. The cross-entropy loss for classification is given
by:
Xk
Lcls = − yi log(ŷi ) (387)
i=1
where yi is the true label, and ŷi is the predicted probability for class i. The total objective
function for training is therefore a weighted sum of the classification and bounding box regression
losses, and the network is optimized to minimize this combined loss function. Object detection
architectures like Region-based CNNs (R-CNNs) take a two-stage approach where the task is broken
83
into generating region proposals and classifying these regions. Region Proposal Networks (RPNs)
are employed to generate candidate regions r1 , r2 , . . . , rn , which are then passed through the network
to obtain their feature representations. The bounding box refinement and classification for each
proposal are then performed by a fully connected layer. The loss function for R-CNNs combines
both classification and bounding box regression losses for each proposal, and the objective is to
minimize:
LR-CNN = Lcls + Lbbox (388)
Another popular architecture, YOLO (You Only Look Once), frames object detection as a single
regression task. The image is divided into a grid of S × S cells, where each cell predicts the class
probabilities and bounding box parameters. The output vector for each cell consists of:
where (x, y) are the coordinates of the bounding box center, w and h are the dimensions of the box,
c is the confidence score, and P1 , P2 , . . . , Pk are the class probabilities. The total loss for YOLO
combines the classification loss, bounding box regression loss, and confidence loss, which can be
written as:
LYOLO = Lcls + Lbbox + Lconf (390)
where Lcls is the classification loss, Lbbox is the bounding box regression loss, and Lconf is the
confidence loss, which penalizes predictions with low confidence. This approach allows YOLO to
make object detection predictions in a single pass through the network, enabling faster inference.
The Single Shot Multibox Detector (SSD) improves on YOLO by generating bounding boxes at
multiple feature scales, which allows for detecting objects of varying sizes. The loss function for
SSD is similar to that of YOLO, comprising the classification loss and bounding box localization
loss, given by:
LSSD = Lcls + Lloc (391)
where Lcls is the classification loss, and Lloc is the smooth L1 loss for bounding box regression.
This multi-scale approach enhances the network’s ability to detect objects at different levels of
resolution, improving its robustness to objects of different sizes.
84
contrast agents. It highlights the ability of deep learning models to extract diagnostic informa-
tion from non-contrast images, reducing the need for invasive procedures. Nguyen et al. (2025)
[209] presented a multi-view tumor region-adapted synthesis model for mammograms using CNNs.
The approach enhances breast cancer detection by using 3D spatial feature extraction techniques,
improving tumor localization and classification. Chen et. al. (2025) [210] explored CNN-based de-
noising for medical images using a penalized least squares (PLS) approach. The study applies deep
learning for noise reduction in MRI scans, leading to improved clarity in low-signal-to-noise ratio
(SNR) images. Pradhan et al. (2025) [211] discussed CNN-based diabetic retinopathy detection.
It introduces an Atrous Residual U-Net architecture, enhancing image segmentation performance
for early-stage diagnosis of retinal diseases. Örenç et al. (2025) [212] evaluated ensemble CNN
models for adenoid hypertrophy detection in X-ray images. It demonstrates transfer learning and
feature fusion techniques, which improve CNN-based medical diagnostics. Jiang et al. (2025) [213]
introduced a cross-modal attention network for MRI image denoising, particularly effective when
some imaging modalities are missing. It highlights cross-domain knowledge transfer using CNNs.
Al-Haidri et. al. (2025) [214] developed a CNN-based framework for automatic myocardial fibrosis
segmentation in cardiac MRI scans. It emphasizes quantitative feature extraction techniques that
enhance precision in cardiac diagnostics.
Convolutional Neural Networks (CNNs) have become an indispensable tool in the field of med-
ical imaging, driven by their ability to automatically learn spatial hierarchies of features directly
from image data without the need for handcrafted feature extraction. The convolutional layers in
CNNs are designed to exploit the spatial structure of the input data, making them particularly
well-suited for tasks in medical imaging, where spatial relationships in images often encode critical
diagnostic information. The fundamental building block of CNNs, the convolution operation, is
mathematically expressed as
k
X k
X
S(i, j) = I(i + m, j + n) · K(m, n), (392)
m=−k n=−k
where S(i, j) represents the value of the output feature map at position (i, j), I(i, j) is the input
image, K(m, n) is the convolutional kernel (a learnable weight matrix), and k denotes the kernel
radius (for example, k = 1 for a 3 × 3 kernel). This equation fundamentally captures how local
patterns, such as edges, textures, and more complex features, are extracted by sliding the kernel
across the image. The convolution operation is performed for each channel of a multi-channel input
(e.g., RGB images or multi-modal medical images), and the results are summed across channels,
leading to multi-dimensional feature maps. For a 3D input tensor, the convolution extends to
include depth:
D X
X k Xk
′
S(i, j, d ) = I(i + m, j + n, d) · K(m, n, d), (393)
d=1 m=−k n=−k
where D is the depth of the input tensor, and d′ is the depth index of the output feature map. CNNs
incorporate nonlinear activation functions after convolutional layers to introduce nonlinearity into
the model, allowing it to learn complex mappings. A commonly used activation function is the
Rectified Linear Unit (ReLU), mathematically defined as
f (x) = max(0, x). (394)
This function ensures sparsity in the activations, which is advantageous for computational efficiency
and generalization. More advanced activation functions, such as parametric ReLU (PReLU), extend
this concept by allowing learnable parameters for the negative slope:
(
x if x > 0,
f (x) = (395)
ax if x ≤ 0,
85
where a is a learnable parameter. Pooling layers are employed in CNNs to downsample the spatial
dimensions of feature maps, thereby reducing computational complexity and the risk of overfitting.
Max pooling is defined mathematically as
where R is the pooling region (e.g., 2 × 2). Average pooling computes the mean value instead:
1 X
P (i, j) = S(i + m, j + n). (397)
|R|
(m,n)∈R
In medical imaging, CNNs are widely used for image classification tasks such as detecting abnor-
malities (e.g., tumors, fractures, or lesions). Consider a classification problem where the input is a
mammogram image, and the output is a binary label y ∈ {0, 1}, indicating benign or malignant.
The CNN model outputs a probability score ŷ, computed as
1
ŷ = σ(z) = , (398)
1 + e−z
where z is the output of the final layer before the sigmoid activation. The binary cross-entropy loss
function is then used to train the model:
N
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] . (399)
N i=1
For image segmentation tasks, where the goal is to assign a label to each pixel, architectures such
as U-Net are commonly used. U-Net employs an encoder-decoder structure, where the encoder ex-
tracts features through a series of convolutional and pooling layers, and the decoder reconstructs the
image through upsampling and concatenation operations. The objective function for segmentation
is often the Dice coefficient loss, defined as
P
2 i pi gi
LDice = 1 − P P , (400)
i pi + i gi
where pi and gi are the predicted and ground truth values for pixel i, respectively. In the context of
image reconstruction, such as in magnetic resonance imaging (MRI), CNNs are used to reconstruct
high-quality images from undersampled k-space data. The reconstruction problem is formulated as
minimizing the difference between the reconstructed image Ipred and the ground truth Itrue , often
using the ℓ2 -norm:
Lreconstruction = ∥Ipred − Itrue ∥22 . (401)
Generative adversarial networks (GANs) have also been applied to medical imaging, particularly
for enhancing image resolution or synthesizing realistic images from noisy inputs. A GAN consists
of a generator G and a discriminator D, where G learns to generate images G(z) from latent noise
z, and D distinguishes between real and fake images. The loss functions for G and D are given by
86
where F is the input feature map, W1 and W2 are learnable weight matrices, and b1 and b2 are
biases. Despite their success, CNNs in medical imaging face challenges, including data scarcity
and interpretability. Transfer learning addresses data scarcity by fine-tuning pre-trained models
on small medical datasets. Techniques such as Grad-CAM provide interpretability by visualiz-
ing regions that influence the network’s predictions. Mathematically, Grad-CAM computes the
importance of a feature map Ak for a class c as
1 X ∂y c
αkc = , (405)
Z i,j ∂Aki,j
where y c is the score for class c and Z is a normalization constant. The class activation map is
then obtained as !
X
LcGrad-CAM = ReLU αkc Ak . (406)
k
In summary, CNNs have transformed medical imaging by enabling automated and highly accu-
rate analysis of complex medical images. Their applications span disease detection, segmentation,
reconstruction, and multi-modal imaging, with continued advancements addressing challenges in
data efficiency and interpretability. Their mathematical foundations and computational frameworks
provide a robust basis for future innovations in this critical field.
87
[331] applied CNN-based multimodal sensor fusion to autonomous vehicles and UAVs for real-time
navigation. They did theoretical analysis of CNN feature fusion mechanisms for real-time per-
ception and developed mask region-based CNNs (Mask-RCNNs) for enhanced object recognition
in autonomous navigation. Mirindi et. al. (2025) [332] investigated the role of CNNs and AI in
smart autonomous transportation. They did theoretical discussion on the Unified Theory of AI
Adoption in autonomous driving and introduced hybrid Recurrent Neural Networks (RNNs) and
CNN architectures for vehicle trajectory prediction.
where x(τ ) represents the input, w(t − τ ) is the filter or kernel, and s(t) is the output feature. In
the discrete domain, especially for image processing, this operation can be written as:
k
X k
X
S(i, j) = X(i + m, j + n) · W (m, n), (408)
m=−k n=−k
where X(i, j) denotes the pixel intensity at coordinate (i, j) of the input image, and W (m, n)
represents the convolutional kernel values. This operation enables the detection of local patterns
such as edges, corners, or textures, which are then aggregated across layers to recognize complex
features like shapes and objects. In the context of autonomous vehicles, CNNs process sensor data
from cameras, LiDAR, and radar to identify critical features such as other vehicles, pedestrians,
road signs, and lane boundaries. For object detection, CNN-based architectures such as YOLO
(You Only Look Once) and Faster R-CNN employ a backbone network like ResNet, which uses
successive convolutional layers to extract hierarchical features from the input image. The object
detection task involves two primary outputs: bounding box coordinates and object class probabil-
ities. Mathematically, bounding box regression is modeled as a multi-task learning problem. The
loss function for bounding box regression is often formulated as:
N
X X
Lreg = SmoothL1(tij − t̂ij ), (409)
i=1 j∈{x,y,w,h}
where tij and t̂ij are the ground-truth and predicted bounding box parameters (e.g., center coordi-
nates x, y and dimensions w, h). Simultaneously, the classification loss, typically cross-entropy, is
computed as:
N X
X C
Lcls = − yi,c log(pi,c ), (410)
i=1 c=1
where yi,c is a binary indicator for whether the object at index i belongs to class c, and pi,c is the
predicted probability. The total loss function is a weighted combination:
Semantic segmentation, another critical task, requires pixel-level classification to assign a label
(e.g., road, vehicle, pedestrian) to each pixel in an image. Fully Convolutional Networks (FCNs)
or U-Net architectures are commonly used for this purpose. These architectures utilize an encoder-
decoder structure where the encoder extracts spatial features, and the decoder reconstructs the
88
spatial resolution to generate pixel-wise predictions. The loss function for semantic segmentation
is a sum over all pixels and classes, given as:
N X
X C
L=− yi,c log(pi,c ), (412)
i=1 c=1
where yi,c is the ground-truth binary label for pixel i and class c, and pi,c is the predicted prob-
ability. Advanced architectures also employ skip connections to preserve high-resolution spatial
information, enabling sharper segmentation boundaries.
Depth estimation is essential for autonomous vehicles to understand the 3D structure of their
surroundings. CNNs are used to predict depth maps from monocular images or stereo pairs. The
depth estimation process is modeled as a regression problem, where the loss function is designed to
minimize the difference between the predicted depth dˆi and the ground-truth depth di . A commonly
used loss function for this task is the scale-invariant loss:
2
n n
!
1 X 2 1 X
Lscale-inv = log di − log dˆi − 2 log di − log dˆi . (413)
n i=1 n i=1
This loss ensures that the relative depth differences are minimized, which is critical for accurate
3D reconstruction. Lane detection, another critical application, uses CNNs to detect road lanes
and boundaries. The task often involves predicting the lane markings as polynomial curves. CNNs
process the input image to extract lane features, and post-processing involves fitting a curve, such
as:
y = ax2 + bx + c, (414)
where a, b, c are the coefficients predicted by the network. The fitting process minimizes an error
function, typically the sum of squared differences between the detected lane points and the curve:
N
X
E= (yi − (ax2i + bxi + c))2 . (415)
i=1
In autonomous vehicles, these CNN tasks are integrated into an end-to-end pipeline. The input
data from cameras, LiDAR, and radar is first processed using CNNs to extract features relevant to
the vehicle’s perception. The outputs, including object detections, semantic maps, depth maps, and
lane boundaries, are then passed to the planning module, which computes the vehicle’s trajectory.
For instance, detected objects provide information about obstacles, while lane boundaries guide
path planning algorithms. The planning process involves solving optimization problems where
the objective function incorporates constraints from the CNN outputs. For example, a trajectory
optimization problem may minimize a cost function:
Z T
w1 ẋ2 + w2 ẏ 2 + w3 c(t) dt,
J= (416)
0
where ẋ and ẏ are the lateral and longitudinal velocities, and c(t) is a collision penalty based on
object detections.
In conclusion, CNNs provide the computational framework for perception tasks in autonomous
vehicles, enabling real-time interpretation of complex sensory data. By leveraging mathematical
principles of convolution, loss optimization, and hierarchical feature extraction, CNNs transform
raw sensor data into actionable insights, paving the way for safe and efficient autonomous naviga-
tion.
89
7.4 Popular CNN Architectures
Literature Review: Choudhury et. al. (2024) [333] presented a comparative theoretical study
of CNN architectures, including AlexNet, VGG, and ResNet, for satellite-based aircraft identifica-
tion. They analyzed the architectural differences and learning strategies used in VGG, AlexNet, and
ResNet and theoretically explained how VGG’s depth, AlexNet’s feature extraction, and ResNet’s
residual learning contribute to CNN advancements. Almubarok and Rosiani (2024) [334] discussed
the computational efficiency of CNN architectures, particularly focusing on AlexNet, VGG, and
ResNet in comparison to MobileNetV2. They established theoretical efficiency trade-offs between
depth, parameter count, and accuracy in AlexNet, VGG, and ResNet and highlighted ResNet’s
advantage in optimization due to skip connections, compared to AlexNet and VGG’s traditional
deep structures. Ding (2024) [335] explored CNN architectures (AlexNet, VGG, and ResNet) for
medical image classification, particularly in Traditional Chinese Medicine (TCM). He introduced
ResNet-101 with Squeeze-and-Excitation (SE) blocks, expanding theoretical understanding of deep
feature representations in CNNs and discussed VGG’s weight-sharing strategy and AlexNet’s lay-
ered feature extraction, improving classification accuracy. He et. al. (2015) [336] introduced
Residual Learning, demonstrating how deep CNNs benefit from identity mappings to tackle van-
ishing gradients. They formulated the mathematical justification of residual blocks in deep networks
and Established the theoretical backbone of ResNet’s identity mapping for deep optimization. Si-
monyan and Zisserman (2014) [148] presented the VGG architecture, which demonstrates how
depth improvement enhances feature extraction. They developed the theoretical formulation of
increasing CNN depth and its impact on feature hierarchies and provided an analytical framework
for receptive field expansion in deep CNNs. Krizhevsky et. al. (2012) [337] introduced AlexNet,
the first CNN model to achieve state-of-the-art performance in ImageNet classification. They intro-
duced ReLU activation as a breakthrough in CNN training and established dropout regularization
theory, preventing overfitting in deep networks. Sultana et. al. (2019) [338] compared the feature
extraction strategies of AlexNet, VGG, and ResNet for object recognition. They gave theoretical
explanation of hierarchical feature learning in CNN architectures and examined VGG’s use of small
convolutional filters and how it impacts feature map depth. Sattler et. al. (2019) [339] investi-
gated the fundamental limitations of CNN architectures such as AlexNet, VGG, and ResNet. They
established formal constraints on convolutional filters in CNNs and developed a theoretical model
for CNN generalization error in classification tasks.
7.4.1 AlexNet
The AlexNet Convolutional Neural Network (CNN) is a deep learning model that operates
on raw pixel values to perform image classification. Given an input image, represented as a 3D
tensor I0 ∈ RH×W ×C , where H is the height, W is the width, and C represents the number of input
channels (typically C = 3 for RGB images), the network performs a series of operations, such as
convolutions, activation functions, pooling, and fully connected layers, to transform this input into
a final output vector y ∈ RK , where K is the number of output classes. The objective of AlexNet
is to minimize a loss function that measures the discrepancy between the predicted output and the
true label, typically using the cross-entropy loss function.
At the heart of AlexNet’s architecture are the convolutional layers, which are designed to
learn local patterns in the image by convolving a set of filters over the input image. Specifi-
cally, the first convolutional layer performs a convolution of the input image I0 with a set of filters
(k)
W1 ∈ RF1 ×F1 ×C , where F1 is the size of the filter and C is the number of channels in the input.
(k)
The convolution operation for a given filter W1 and input image I0 at position (i, j) is defined as:
F1 X
X F1 X
C
(k) (k) (k)
Y1 (i, j) = W1 (u, v, c) · I0 (i + u − 1, j + v − 1, c) + b1 (417)
u=1 v=1 c=1
90
(k)
where b1 is the bias term for the k-th filter, and the output of this convolution is a feature map
(k)
Y1 (i, j) that captures the response of the filter at each spatial location (i, j). The result of this
(k) ′ ′
convolution operation is a set of feature maps Y1 ∈ RH ×W , where the dimensions of the output
are H ′ = H −F1 +1 and W ′ = W −F1 +1 if no padding is applied. Subsequent to the convolutional
(k)
operation, the output feature maps Y1 are passed through a ReLU (Rectified Linear Unit)
activation function, which introduces non-linearity into the network. The ReLU function is defined
as:
ReLU(z) = max(0, z) (418)
(k)
This function transforms negative values in the feature map Y1 into zero, while leaving positive
values unchanged, thus allowing the network to model complex, non-linear patterns in the data.
(k) (k)
The output of the ReLU activation function is denoted by A1 (i, j) = ReLU(Y1 (i, j)). Following
the activation function, a max-pooling operation is performed to downsample the feature maps
and reduce their spatial dimensions. Given a pooling window of size P × P , the max-pooling
operation computes the maximum value in each window, which is mathematically expressed as:
(k)
Y1pool (i, j) = max A1 (i′ , j ′ ) : (i′ , j ′ ) ∈ pooling window (419)
(k)
where A1 is the feature map after ReLU, and the resulting pooled output Y1pool (i, j) has reduced
′ ′
spatial dimensions, typically H ′′ = HP and W ′′ = WP . This operation helps retain the most
important features while discarding irrelevant spatial details, which makes the network more robust
to small translations in the input image. The convolutional and pooling operations are repeated
across multiple layers, with each layer learning progressively more complex patterns from the input
data. In the second convolutional layer, for example, we convolve the feature maps from the first
(k) (k)
layer A1 with a new set of filters W2 ∈ RF2 ×F2 ×K1 , where K1 is the number of feature maps
produced by the first convolutional layer. The convolution for the second layer is expressed as:
F2 X
X F2 X
K1
(k) (k) (c) (k)
Y2 (i, j) = W2 (u, v, c) · A1 (i + u − 1, j + v − 1) + b2 (420)
u=1 v=1 c=1
This process is iterated for each subsequent convolutional layer, where each new set of filters learns
higher-level features, such as edges, textures, and object parts. The activation maps produced
by each convolutional layer are passed through the ReLU activation function, and max-pooling is
applied after each convolutional layer to reduce the spatial dimensions.
After the last convolutional layer, the feature maps are flattened into a 1D vector af ∈ RN , where
N is the total number of activations across all channels and spatial dimensions. This flattened
vector is then passed to fully connected (FC) layers for classification. Each fully connected
layer performs a linear transformation, followed by a non-linear activation. The output of the i-th
neuron in the fully connected layer is given by:
N
X
zi = Wij · af (j) + bi (421)
j=1
where Wij is the weight connecting neuron j in the previous layer to neuron i in the current layer,
and bi is the bias term. The output of the fully connected layer is a vector of class scores z ∈ RK ,
which represents the unnormalized log-probabilities of the input image belonging to each class. To
convert these scores into a valid probability distribution, the softmax function is applied:
ezi
σ(zi ) = PK (422)
j=1 ezj
91
The softmax function ensures that the output values are in the range [0, 1] and sum to 1, thus
representing a probability distribution over the K classes. The final output of the network is a
probability vector ŷ ∈ RK , where each element ŷi corresponds to the predicted probability that
the input image belongs to class i. To train the AlexNet model, the network minimizes the cross-
entropy loss function between the predicted probabilities ŷ and the true labels y, which is given
by:
XK
L=− yi log(ŷi ) (423)
i=1
where yi is the true label (1 if the image belongs to class i, 0 otherwise), and ŷi is the predicted
probability for class i. The goal of training is to adjust the weights W and biases b in the network
to minimize this loss. The parameters of the network are updated using gradient descent. To
compute the gradients, the backpropagation algorithm is used. The gradient of the loss with
respect to the weights W in a fully connected layer is given by:
∂L ∂L ∂z
= · (424)
∂W ∂z ∂W
where ∂L
∂z
∂z
is the gradient of the loss with respect to the output of the layer, and ∂W is the gradient
of the output with respect to the weights. These gradients are then used to update the weights
using the gradient descent update rule:
∂L
W ←W −η· (425)
∂W
where η is the learning rate. This process is repeated iteratively for each layer of the network.
Regularization techniques such as dropout are often applied to prevent overfitting during training.
Dropout involves randomly setting a fraction of the activations to zero during each training step,
which helps prevent the network from relying too heavily on any one feature and encourages the
model to learn more robust features. Once trained, the AlexNet model can be used to classify
new images by passing them through the network and selecting the class with the highest proba-
bility. The combination of convolutional layers, ReLU activations, pooling, fully connected layers,
and softmax activation makes AlexNet a powerful and efficient architecture for image classification
tasks.
7.4.2 ResNet
At the heart of the ResNet architecture lies the notion of residual learning, where instead of learning
the direct transformation y = f (x; W), the network learns the residual function F(x, W), i.e., the
difference between the input and output. The network output y can therefore be expressed as:
y = F(x; W) + x (426)
This formulation represents the core difference from traditional neural networks where the model
learns a mapping directly from the input x to the output y. The introduction of the identity short-
cut connection x introduces a powerful mechanism by which the network can learn the residual,
and if the optimal residual transformation is the identity function, the network can essentially learn
y = x, improving optimization. This reduces the challenge of training deeper networks, where deep
layers often lead to vanishing gradients, because the gradient can propagate directly through these
shortcuts, bypassing intermediate layers.
Let’s formalize this residual learning. Let the input to the residual block be xl and the output
92
yl . In a conventional neural network, the transformation from input to output at the l-th layer
would be:
yl = F(xl ; Wl ) (427)
where F represents the function learned by the layer, parameterized by Wl . In contrast, for ResNet,
the output is the sum of the learned residual function F(xl ; Wl ) and the input xl itself, yielding:
yl = F(xl ; Wl ) + xl (428)
This addition of the identity shortcut connection enables the network to bypass layers if needed,
facilitating the learning process and addressing the vanishing gradient issue. To formalize the
optimization problem, we define the residual learning objective as the minimization of the loss
function L with respect to the parameters Wl :
N
X
L= Li (yi , ti ) (429)
i=1
where N is the number of training samples, ti are the target outputs, and Li is the loss for the i-th
sample. The training process involves adjusting the parameters Wl via gradient descent, which
in turn requires the gradients of the loss function with respect to the network parameters. The
gradient of L with respect to Wl can be expressed as:
N
∂L X ∂Li ∂yi
= · (430)
∂Wl i=1
∂yi ∂Wl
Since the residual block adds the input directly to the output, the derivative of the output with
respect to the weights Wl is given by:
∂yl ∂F(xl ; Wl )
= (431)
∂Wl ∂Wl
Now, let’s explore how this addition of the residual connection directly influences the backpropa-
gation process. In a traditional feedforward network, the backpropagated gradients for each layer
depend solely on the output of the preceding layer. However, in a residual network, the gradient
flow is enhanced because the identity mapping xl is directly passed to the subsequent layer. This
ensures that the gradients will not be lost as the network deepens, a phenomenon that becomes
critical in very deep networks. The gradient with respect to the loss L at layer l is:
∂L ∂L ∂yl
= · (432)
∂xl ∂yl ∂xl
Since yl = F(xl ; Wl ) + xl , the derivative of yl with respect to xl is:
∂yl ∂F(xl ; Wl )
=I+ (433)
∂xl ∂xl
∂L
where I is the identity matrix. This ensures that the gradient ∂x l
can propagate more easily
through the network, as it is now augmented by the identity matrix term. Thus, this term helps
preserve the gradient’s magnitude during backpropagation, solving the vanishing gradient problem
that typically arises in deep networks. Furthermore, to ensure that the dimensions of the input
and output of a residual block match, especially when the number of channels changes, ResNet
introduces projection shortcuts. These are used when the dimensionality of xl and yl do not align,
typically through a 1 × 1 convolution. The projection shortcut modifies the residual block’s output
to be:
yl = F(xl ; Wl ) + Wx · xl (434)
93
where Wx is a convolutional filter, and F(xl ; Wl ) is the residual transformation. The introduction
of the 1×1 convolution ensures that the input xl is mapped to the appropriate dimensionality, while
still benefiting from the residual learning framework. The ResNet architecture can be extended by
stacking multiple residual blocks. For a network with L layers, the output after passing through
the entire network can be written recursively as:
where y(L−1) is the output after L − 1 layers. The recursive nature of this formula ensures that
the network’s output is built layer by layer, with each layer contributing a transformation rela-
tive to the input passed to it. Mathematically, the gradient of the loss function with respect to
the parameters in deep residual networks can be expressed recursively, where each layer’s gradient
involves contributions from the identity shortcut connection. This facilitates the training of very
deep networks by maintaining a stable and consistent flow of gradients during the backpropagation
process.
Thus, the Residual Neural Network (ResNet) significantly improves the trainability of deep neural
networks by introducing residual learning, allowing the network to focus on learning the difference
between the input and output rather than the entire transformation. This approach, combined
with identity shortcut connections and projection shortcuts for dimensionality matching, ensures
that gradients flow effectively through the network, even in very deep architectures. The resulting
ResNet architecture has been proven to enable the training of networks with hundreds of layers,
yielding impressive performance on a wide range of tasks, from image classification to semantic
segmentation, while mitigating issues such as vanishing gradients. Through its recursive structure
and rigorous mathematical formulation, ResNet has become a foundational architecture in modern
deep learning.
7.4.3 VGG
The Visual Geometry Group (VGG) Convolutional Neural Network (CNN), introduced
by Simonyan and Zisserman in 2014, presents a detailed exploration of the effect of depth on the
performance of deep neural networks, specifically within the context of computer vision tasks such
as image classification. The VGG architecture is grounded in the hypothesis that deeper networks,
when constructed with small, consistent convolutional kernels, are more capable of capturing hi-
erarchical patterns in data, particularly in the domain of visual recognition. In contrast to other
CNN architectures, VGG prioritizes the usage of small 3 × 3 convolution filters (with a stride of 1)
stacked in increasing depth, rather than relying on larger filters (e.g., 5 × 5 or 7 × 7), thus offering
computational benefits without sacrificing representational power. This design choice inherently
encourages sparse local receptive fields, which ensures a richer learning capacity when extended
across deeper layers.
Let I ∈ RH×W ×C represent an input image of height H, width W , and C channels, where the
channels correspond to different color representations (e.g., RGB for C = 3). For the convolution
operation applied at a particular layer k, the output feature map O(k) can be computed by con-
volving the input I with a set of kernels K (k) corresponding to the k-th layer. The convolution for
each spatial location i, j can be described as:
kh X
X Cin
kw X
(k) (k)
Oi,j = Ku,v,c′ ,c Ii+u,j+v,c′ + bc(k) (436)
u=1 v=1 c′ =1
(k) (k)
where Oi,j is the output value at location (i, j) of the feature map for the k-th filter, Ku,v,c′ ,c is
(k)
the u, v-th spatial element of the c′ -to-c filter in layer k, and bc represents the bias term for the
94
output channel c. The convolutional layer’s kernel K (k) is typically initialized with small values
and learned during training, while the bias b(k) is added to shift the activation of the neuron. A key
aspect of the VGG architecture is that these convolution layers are consistently followed by non-
linear ReLU (Rectified Linear Unit) activation functions, which introduce local non-linearity
to the model. The ReLU function is mathematically defined as:
ReLU(x) = max(0, x) (437)
This transformation is applied element-wise, ensuring that negative values are mapped to zero,
which, as an effect, activates only positive feature responses. The non-linearity introduced by
ReLU aids the network in learning complex patterns and overcoming issues such as vanishing
gradients that often arise in deeper networks. In VGG, the network is constructed by stacking
these convolutional layers with ReLU activations. Each convolution layer is followed by max-
pooling operations, typically with 2 × 2 filters and a stride of 2. Max-pooling reduces the spatial
dimensions of the feature maps and extracts the most significant features from each region of the
image. The max-pooling operation is mathematically expressed as:
Oi,j = max Ii+u,j+v (438)
(u,v)∈P
where P is the pooling window, and Oi,j is the pooled value at position (i, j). The pooling oper-
ation performs downsampling, ensuring translation invariance while retaining the most prominent
features. The effect of this pooling operation is to reduce computational complexity, lower the
number of parameters, and make the network invariant to small translations and distortions in
the input image. The architecture of VGG typically culminates in a series of fully connected
(FC) layers after several convolutional and pooling layers have extracted relevant features from
the input image. Let the output of the final convolutional layer, after flattening, be denoted as
X ∈ Rd , where d represents the dimensionality of the feature vector obtained by flattening the last
convolutional feature map. The fully connected layers then transform this vector into the output,
as expressed by:
O = WX + b (439)
′ ′
where W ∈ Rd ×d is the weight matrix of the fully connected layer, b ∈ Rd is the bias vector, and
′
O ∈ Rd is the output vector. The output vector O represents the unnormalized scores for each
of the d′ possible classes in a classification task. This is typically followed by the application of a
softmax function to convert these raw scores into a probability distribution:
eoi
σ(oi ) = Pd′ (440)
j=1 e oj
where oi is the score for class i, and the softmax function ensures that the outputs are positive
and sum to one, facilitating their interpretation as class probabilities. This softmax function is
a crucial step in multi-class classification tasks as it normalizes the output into a probabilistic
format. During the training phase, the model minimizes the cross-entropy loss between the
predicted probabilities and the actual class labels, often represented as one-hot encoded vectors.
The cross-entropy loss is given by:
Xd′
L=− yi log(pi ) (441)
i=1
where yi is the true label for class i in one-hot encoded form, and pi is the predicted probability
for class i. This loss function is the appropriate objective for classification tasks, as it measures
the difference between the true and predicted probability distributions. The optimization of the
parameters in the VGG network is carried out using stochastic gradient descent (SGD) or its
variants. The weight update rule in gradient descent is:
W ← W − η∇W L (442)
95
where η is the learning rate, and ∇W L is the gradient of the loss with respect to the weights.
The gradient is computed through backpropagation, applying the chain rule of derivatives to
propagate errors backward through the network, updating the weights at each layer based on the
contribution of each parameter to the final output error.
A key advantage of the VGG architecture lies in its use of smaller, deeper layers compared to
previous networks like AlexNet, which used larger convolution filters. By using multiple small
kernels (such as 3 × 3), the VGG network can create richer representations without exponentially
increasing the number of parameters. The depth of the network, achieved by stacking these small
convolution filters, enables the model to extract increasingly abstract and hierarchical features from
the raw pixel data. Despite its success, VGG’s computational demands are relatively high due to
the large number of parameters, especially in the fully connected layers. The fully connected lay-
ers, which connect every neuron in one layer to every neuron in the next, account for a significant
portion of the model’s total parameters. To mitigate this limitation, later architectures, such as
ResNet, introduced skip connections, which allow gradients to flow more efficiently through the
network, thus enabling even deeper architectures without incurring the same computational costs.
Nevertheless, the VGG network set an important precedent in the design of deep convolutional net-
works, demonstrating the power of deep architectures and the effectiveness of small convolutional
filters. The model’s simplicity and straightforward design have influenced subsequent architectures,
reinforcing the notion that deeper models, when carefully constructed, can achieve exceptional per-
formance on complex tasks like image classification, despite the challenges posed by computational
cost and model complexity.
96
speech and handwriting recognition. Bengio et. al. (1994) [269] mathematically proved why RNNs
struggle with learning long-term dependencies. It identifies the root causes of the vanishing and
exploding gradient problems, setting the stage for future architectures like LSTMs. Bhattamishra
et. al. (2020) [270] rigorously compared the theoretical capabilities of RNNs and Transformers.
The authors analyze expressiveness, memory retention, and training efficiency, providing insights
into why Transformers are increasingly replacing RNNs in NLP. Siegelmann (1993) [271] provided a
rigorous theoretical treatment of RNNs, analyzing their convergence properties, stability conditions,
and computational complexity. It discusses mathematical frameworks for understanding RNN
generalization and optimization challenges.
where Wxh ∈ Rm×n is the input-to-hidden weight matrix, Whh ∈ Rm×m is the hidden-to-hidden
weight matrix, bh ∈ Rm is the bias vector, and fh is a non-linear activation function, typically
ex − e−x
tanh(x) = (444)
ex + e−x
or the rectified linear unit ReLU(x) = max(0, x). The recursive nature of this update equation
ensures that ht inherently encodes information about the sequence {x1 , x2 , . . . , xt }, allowing the
network to maintain a dynamic representation of past inputs. The output yt ∈ Ro at time t is
computed as:
yt = fy (Why ht + by ) , (445)
where Why ∈ Ro×m is the hidden-to-output weight matrix, by ∈ Ro is the output bias, and fy is
an activation function such as the softmax function:
e zi
fy (z)i = Po (446)
j=1 e zj
for classification tasks. Expanding the recurrence relation iteratively, the hidden state at time t
can be expressed as:
97
is computed through backpropagation through time (BPTT). The gradient of L with respect to
Whh , for instance, is given by:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= , (450)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh
∂hj
where tj=k+1 ∂hj−1
Q
represents the chain of derivatives from time step k to t. Unlike feedforward
neural networks, where each input is processed independently, RNNs maintain a hidden state ht
that acts as a dynamic memory, evolving recursively as the input sequence progresses. Formally,
given an input sequence {x1 , x2 , . . . , xT }, where xt ∈ Rn represents the input vector at time t, the
hidden state ht ∈ Rm is updated via the recurrence relation:
where Wxh ∈ Rm×n , Whh ∈ Rm×m , and bh ∈ Rm are learnable parameters, and fh is a nonlinear
activation function such as tanh or ReLU. The recursive structure inherently allows the hidden
state ht to encode the entire history of the sequence up to time t. The output yt ∈ Ro at each time
step is computed as:
yt = fy (Why ht + by ), (452)
where Why ∈ Ro×m and by ∈ Ro are additional learnable parameters, and fy is an optional output
activation function, such as the softmax function for classification. To elucidate the recursive
dynamics, we can expand ht explicitly in terms of the initial hidden state h0 and all previous
inputs {x1 , . . . , xt }:
This nested structure highlights the temporal dependencies and the potential challenges in training,
such as the vanishing and exploding gradient problems. During training, the loss function L, which
aggregates the discrepancies between the predicted outputs yt and the ground truth yttrue , is typically
defined as:
X T
L= ℓ(yt , yttrue ), (454)
t=1
where ℓ is a task-specific loss function, such as the mean squared error (MSE)
1
ℓ(y, y true ) = ∥y − y true ∥2 (455)
2
for regression or the cross-entropy loss for classification. To optimize L, gradient-based methods
are employed, requiring the computation of derivatives of L with respect to all parameters, such as
Wxh , Whh , and bh . Using backpropagation through time (BPTT), the gradient of L with respect
to Whh is expressed as:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= . (456)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh
Here,
t
Y ∂hj
(457)
j=k+1
∂hj−1
∂hj
is the product of Jacobian matrices that encode the influence of hk on ht . The Jacobian ∂hj−1
itself
is given by:
∂hj
= Whh ⊙ fh′ (aj ), (458)
∂hj−1
98
where
aj = Wxh xj + Whh hj−1 + bh , (459)
and fh′ (aj ) denotes the elementwise derivative of the activation function. The repeated multipli-
cation of these Jacobians can lead to exponential growth or decay of the gradients, depending on
the spectral radius ρ(Whh ). Specifically, if ρ(Whh ) > 1, gradients explode, whereas if ρ(Whh ) < 1,
gradients vanish, severely hampering the training process for long sequences. To address these
issues, modifications such as Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs) introduce gating mechanisms that explicitly regulate the flow of information. In
LSTMs, the cell state ct , governed by additive dynamics, prevents vanishing gradients. The cell
state is updated as:
ct = ft ⊙ ct−1 + it ⊙ tanh(Wc xt + Uc ht−1 + bc ), (460)
where ft is the forget gate, it is the input gate, and Uc , Wc , and bc are learnable parameters.
8.2 Sequence Modeling and Long Short-Term Memory (LSTM) and GRUs
Literature Review: Potter and Egon (2024) [387] provided an extensive study of RNNs and
their enhancements (LSTM and GRU) for time-series forecasting. The authors conduct an empir-
ical comparison between these architectures and analyze their effectiveness in capturing long-term
dependencies in sequential data. The study concludes that GRUs are computationally efficient
but slightly less expressive than LSTMs, whereas standard RNNs suffer from vanishing gradients.
Yatkin et. al. (2025) [388] introduced a topological perspective to RNNs, including LSTM and
GRU, to address inconsistencies in real-world applications. The authors propose stability-enhancing
mechanisms to improve RNN performance in finance and climate modeling. Their results show that
topologically-optimized GRUs outperform traditional LSTMs in maintaining memory over long se-
quences. Saifullah (2024) [389] applied LSTM and GRU networks to biomedical image classification
(chicken egg fertility detection). The paper demonstrates that GRU’s simpler architecture leads to
faster convergence while LSTMs achieve slightly higher accuracy due to better memory retention.
The results highlight domain-specific strengths of LSTM vs. GRU, particularly in handling sparse
feature representations. Alonso (2024) [390] rigorously explored the mathematical foundations of
RNNs, LSTMs, and GRUs. The author provides a deep analysis of gating mechanisms, vanishing
gradient solutions, and optimization techniques that improve sequence modeling. A theoretical
comparison is drawn between hidden state dynamics in GRUs vs. LSTMs, supporting their appli-
cation in NLP and time-series forecasting. Tu et. al. (2024) [391] in a medical AI study evaluates
LSTMs and GRUs for predicting patient physiological metrics during sedation. The authors find
that LSTMs retain more long-term dependencies in time-series medical data, making them suit-
able for patient monitoring, while GRUs are preferable for real-time predictions due to their lower
computational overhead. Zuo et. al. (2025) [392] applied hybrid GRUs for predicting customer
movements in stores using real-time location tracking. The authors propose a modified GRU-
LSTM hybrid model that achieves state-of-the-art accuracy in trajectory prediction. The study
demonstrates that GRUs alone may lack fine-grained memory retention, but a hybrid approach
improves forecasting ability. Lima et. al. (2025) [393] developed an industrial AI application that
demonstrated the efficiency of GRUs in process optimization. The study finds that GRUs out-
perform LSTMs in real-time predictive control of steel slab heating, showcasing their efficiency in
applications where faster computations are required. Khan et. al. (2025) [394] integrated LSTMs
with statistical ARIMA models to improve wind power forecasting. They demonstrate that hybrid
LSTM-ARIMA models outperform standalone RNNs in handling weather-related sequential data,
which is highly volatile. Guo and Feng (2024) [395] in an environmental AI study proposed a
temporal attention-enhanced LSTM model to predict greenhouse climate variables. The research
introduces a novel position-aware LSTM architecture that improves multi-step forecasting, which
is critical for precision agriculture. Abdelhamid (2024) [396] explored IoT-based energy forecasting
using deep RNN architectures, including LSTM and GRU. The study concludes that GRUs provide
99
faster inference speeds but LSTMs capture more accurate long-range dependencies, making them
more reliable for complex forecasting.
Sequence modeling in Recurrent Neural Networks (RNNs) represents a powerful framework for
capturing temporal dependencies in sequential data, enabling the learning of both short-term and
long-term patterns. The primary characteristic of RNNs lies in their recurrent architecture, where
the hidden state ht at time step t is updated as a function of both the current input xt and the
hidden state at the previous time step ht−1 . Mathematically, this recurrent relationship can be
expressed as:
ht = f (Wh ht−1 + Wx xt + bh ) (461)
Here, Wh and Wx are weight matrices corresponding to the previous hidden state ht−1 and the
current input xt , respectively, while bh is a bias term. The function f (·) is a non-linear activation
function, typically chosen as the hyperbolic tangent tanh or rectified linear unit (ReLU). The output
yt at each time step is derived from the hidden state ht through a linear transformation, followed
by a non-linear activation, yielding:
yt = g(Wy ht + by ) (462)
where Wy is the weight matrix connecting the hidden state to the output space, and by is the
associated bias term. The function g(·) is generally a softmax activation for classification tasks or
a linear activation for regression problems. The key feature of this structure is the interdependence
of the hidden state across time steps, allowing the model to capture the history of past inputs and
produce predictions that incorporate temporal context. Training an RNN involves minimizing a
loss function L, which quantifies the discrepancy between the predicted outputs yt and the true
outputs yttrue across all time steps. A common loss function used in classification tasks is the cross-
entropy loss, while regression tasks often utilize mean squared error. To optimize the parameters
of the network, the model employs Backpropagation Through Time (BPTT), a variant of the
standard backpropagation algorithm adapted for sequential data. The primary challenge in BPTT
arises from the recurrent nature of the network, where the hidden state at each time step depends
on the previous hidden state. The gradient of the loss function with respect to the hidden state at
time step t is computed recursively, reflecting the temporal structure of the model. The chain rule
is applied to compute the gradient of the loss with respect to the hidden state:
T
∂L ∂L ∂yt X ∂L ∂ht′
= · + · (463)
∂ht ∂yt ∂ht t′ =t+1 ∂ht′ ∂ht
∂L ∂yt
Here, ∂y t
is the gradient of the loss with respect to the output, and ∂ht
represents the Jacobian of
the output with respect to the hidden state. The second term in this expression corresponds to the
accumulated gradients propagated from future time steps, incorporating the temporal dependencies
across the entire sequence. This recursive gradient calculation allows for updating the weights and
biases of the RNN, adjusting them to minimize the total error across the sequence. The gradients
of the loss function with respect to the parameters of the network, such as Wh , Wx , and Wy , are
computed using the chain rule. For example, the gradient of the loss with respect to Wx is:
T
∂L X ∂L
= · x⊤
t (464)
∂Wx t=1
∂h t
This captures the contribution of each input to the overall error at all time steps, ensuring that the
model learns the correct relationships between inputs and hidden states. Similarly, the gradients
with respect to Wh and bh account for the recurrence in the hidden state, enabling the model
to adjust its internal parameters in response to the sequential nature of the data. Despite their
theoretical elegance, RNNs face significant practical challenges during training, primarily due to
100
the vanishing gradients problem. This issue arises when the gradients propagate through
many time steps, causing them to decay exponentially, especially when using activation functions
like tanh. As a result, the influence of distant time steps diminishes, making it difficult for the
network to learn long-term dependencies. The mathematical manifestation of this problem is seen
in the norm of the Jacobian matrices associated with the hidden state updates. If the spectral
radius of the weight matrices Wh is close to or greater than 1, the gradients can either vanish or
explode, leading to unstable training dynamics. To mitigate this issue, several solutions have been
proposed, including the use of Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs), which introduce gating mechanisms to better control the flow of information through
the network. LSTMs, for example, incorporate a memory cell Ct , which allows the network to store
information over long periods of time. The update rules for the LSTM are governed by three gates:
the forget gate ft , the input gate it , and the output gate ot , which control how much of the previous
memory and new information to retain. The equations governing the LSTM are:
In summary, sequence modeling in RNNs involves a series of recurrent updates to the hidden
state, driven by both the current input and the previous hidden state, and is trained via backprop-
agation through time. The introduction of specialized gating mechanisms in LSTMs and GRUs
alleviates issues such as vanishing gradients, enabling the networks to learn and maintain long-term
dependencies. Through these advanced architectures, RNNs can effectively model complex tem-
poral relationships, making them powerful tools for tasks such as time-series prediction, natural
language processing, and sequence generation.
101
8.3 Applications in Natural Language Processing
Literature Review: Yang et. al. (2020) [377] explored the effectiveness of deep learning mod-
els, including RNNs, for sentiment analysis in e-commerce platforms. It emphasizes how RNN
architectures, including LSTMs and GRUs, outperform traditional NLP techniques by capturing
sequential dependencies in customer reviews. The study provides empirical evidence demonstrat-
ing the superior accuracy of RNNs in analyzing consumer sentiment. Manikandan et. al. (2025)
[378] investigated how RNNs can improve spam detection in email filtering. By leveraging re-
current structures, the study demonstrates how RNNs effectively identify patterns in email text
that indicate spam or phishing attempts. It also compares RNN-based models with other ML
approaches, highlighting the robustness of RNNs in handling contextual word sequences. Isiaka
et. al. (2025) [379] examined AI technologies, particularly deep learning models, for predictive
healthcare applications. It highlights how RNNs can analyze patient records and medical reports
using NLP techniques. The study shows that RNN-based NLP models enhance medical diagnos-
tics and decision-making by extracting meaningful insights from unstructured text data. Petrov
et. al. (2025) [380] discussed the role of RNNs in emotion classification from textual data, an
essential NLP task. The paper evaluates various RNN-based architectures, including BiLSTMs, to
enhance the accuracy of emotion recognition in social media texts and chatbot responses. Liang
(2025) [381] focused on the application of RNNs in educational settings, specifically for automated
grading and feedback generation. The study presents an RNN-based NLP system capable of ana-
lyzing student responses, providing real-time assessments, and generating contextual feedback. Jin
(2025) [382] explored how RNNs optimize text generation tasks related to pharmaceutical edu-
cation. It demonstrates how NLP-powered RNN models generate high-quality textual summaries
from medical literature, ensuring accurate knowledge dissemination in the pharmaceutical indus-
try. McNicholas et. al. (2025) [383] investigated how RNNs facilitate clinical decision-making in
critical care by extracting insights from unstructured medical text. The research highlights how
RNN-based NLP models enhance patient care by predicting outcomes based on clinical notes and
physician reports. Abbas and Khammas (2024) [384] introduced an RNN-based NLP technique
for detecting malware in IoT networks. The study illustrates how RNN classifiers process logs
and textual patterns to identify malicious software, making RNNs crucial in cybersecurity appli-
cations. Kalonia and Upadhyay (2025) [385] applied RNNs to software fault prediction using NLP
techniques. It shows how recurrent networks analyze bug reports and software documentation to
predict potential failures in software applications, aiding developers in proactive debugging. Han
et. al. (2025) [386] discussed RNN applications in conversational AI, focusing on chatbots and
virtual assistants. The study presents an RNN-driven NLP model for improving dialogue manage-
ment and user interactions, significantly enhancing the responsiveness of AI-powered chat systems.
Recurrent Neural Networks (RNNs) are deep learning architectures that are explicitly designed
to handle sequential data, a key feature that makes them indispensable for applications in Natural
Language Processing (NLP). The mathematical foundation of RNNs lies in their ability to process
sequences of inputs, x1 , x2 , . . . , xT , where T denotes the length of the sequence. At each time step
t, the network updates its hidden state, ht , using both the current input xt and the previous hidden
state ht−1 . This recursive relationship is represented mathematically by the following equation:
Here, σ is a nonlinear activation function such as the hyperbolic tangent (tanh) or the rectified
linear unit (ReLU), Wh is the weight matrix associated with the previous hidden state ht−1 , Wx
is the weight matrix associated with the current input xt , and b is a bias term. The nonlinearity
introduced by σ allows the network to learn complex relationships between the input and the
output. The output yt at each time step is computed as:
yt = Wy ht + c (476)
102
where Wy is the weight matrix corresponding to the output and c is the bias term for the output.
The output yt is then used to compute the predicted probability distribution over possible outputs
at each time step, typically through a softmax function for classification tasks:
In NLP tasks such as language modeling, the objective is to predict the next word in a sequence
given all previous words. The RNN is trained to estimate the conditional probability distribution
P (wt |w1 , w2 , . . . , wt−1 ) of the next word wt based on the previous words. The full likelihood of the
sequence w1 , w2 , . . . , wT can be written as:
T
Y
P (w1 , w2 , . . . , wT ) = P (wt |w1 , w2 , . . . , wt−1 ) (478)
t=1
For an RNN, this conditional probability is modeled by recursively updating the hidden state and
generating a probability distribution for each word. At each time step, the probability of the next
word is computed as:
P (wt |ht−1 ) = softmax(Wy ht + c) (479)
The network is trained by minimizing the negative log-likelihood of the true word sequence:
T
X
L=− log P (wt |ht−1 ) (480)
t=1
This loss function guides the optimization of the weight matrices Wh , Wx , and Wy to maximize the
likelihood of the correct word sequences. As the network learns from large datasets, it develops the
ability to predict words based on the context provided by previous words in the sequence. A key
extension of RNNs in NLP is machine translation, where one sequence of words in one language
is mapped to another sequence in a target language. This is typically modeled using sequence-to-
sequence (Seq2Seq) architectures, which consist of two RNNs: the encoder and the decoder. The
encoder RNN processes the input sequence x1 , x2 , . . . , xT , updating its hidden state at each time
step:
henc
t = σ(Whenc henc enc
t−1 + Wx xt + b
enc
) (481)
The final hidden state henc
T of the encoder is passed to the decoder as its initial hidden state. The
decoder RNN generates the target sequence y1 , y2 , . . . , yT by updating its hidden state at each time
step, using both the previous hidden state hdec
t−1 and the previous output yt−1 :
hdec
t = σ(Whdec hdec dec
t−1 + Wx yt−1 + b
dec
) (482)
The decoder produces a probability distribution over the target vocabulary at each time step:
The training of the Seq2Seq model is based on minimizing the cross-entropy loss function:
T
X
L=− log P (yt |hdec
t ) (484)
t=1
This ensures that the network learns to map input sequences to output sequences. By training
on a large corpus of paired sentences, the Seq2Seq model learns to translate sentences from one
language to another, with the encoder capturing the context of the input sentence and the decoder
generating the translated sentence.
RNNs are also effective in sentiment analysis, a task where the goal is to classify the sentiment
103
of a sentence (positive, negative, or neutral). Given a sequence of words x1 , x2 , . . . , xT , the RNN
processes each word sequentially, updating its hidden state:
ht = σ(Wh ht−1 + Wx xt + b) (485)
After processing the entire sentence, the final hidden state hT is used to classify the sentiment. The
output is obtained by applying a softmax function to the final hidden state:
y = softmax(Wy hT + c) (486)
where Wy is the weight matrix associated with the output layer. The network is trained to minimize
the cross-entropy loss:
XT
L=− log P (y|hT ) (487)
t=1
This allows the RNN to classify the overall sentiment of the sentence by learning the relationships
between words and sentiment labels. Sentiment analysis is useful for applications such as customer
feedback analysis, social media monitoring, and opinion mining. In Named Entity Recognition
(NER), RNNs are used to identify and classify named entities, such as people, locations, and
organizations, in a text. The RNN processes each word xt in the sequence, updating its hidden
state at each time step:
ht = σ(Wh ht−1 + Wx xt + b) (488)
The output at each time step is a probability distribution over possible entity labels:
P (yt |ht ) = softmax(Wy ht + c) (489)
The network is trained to minimize the cross-entropy loss:
T
X
L=− log P (yt |ht ) (490)
t=1
By learning to classify each word with the appropriate entity label, the RNN can perform infor-
mation extraction tasks, such as identifying people, organizations, and locations in text. This is
crucial for applications such as document categorization, knowledge graph construction, and ques-
tion answering. In speech recognition, RNNs are used to transcribe spoken language into text.
The input to the RNN consists of a sequence of acoustic features, such as Mel-frequency cepstral
coefficients (MFCCs), which are extracted from the audio signal. At each time step t, the RNN
updates its hidden state:
ht = σ(Wh ht−1 + Wx xt + b) (491)
The output at each time step is a probability distribution over phonemes or words:
P (wt |ht ) = softmax(Wy ht + c) (492)
The network is trained by minimizing the negative log-likelihood:
T
X
L=− log P (wt |ht ) (493)
t=1
By learning the mapping between acoustic features and corresponding words or phonemes, the
RNN can transcribe speech into text, which is fundamental for applications such as voice assis-
tants, transcription services, and speech-to-text systems.
In summary, RNNs are powerful tools for processing sequential data in NLP tasks such as ma-
chine translation, sentiment analysis, named entity recognition, and speech recognition. Their
ability to process input sequences in a time-dependent manner allows them to capture long-range
dependencies, making them well-suited for complex tasks in NLP and beyond. However, chal-
lenges such as the vanishing and exploding gradient problems necessitate the use of more advanced
architectures, like LSTMs and GRUs, to enhance their performance in real-world applications.
104
9 Advanced Architectures
9.1 Transformers and Attention Mechanisms
Literature Review: Vaswani et. al. [340] introduced the Transformer architecture, replacing
recurrent models with a fully attention-based framework for sequence processing. They formulated
the self-attention mechanism, mathematically defining query-key-value (QKV) transformations.
They proved scalability advantages over RNNs, showing O(1) parallelization benefits and intro-
duced multi-head attention, enabling contextualized embeddings. Nannepagu et. al. [341] explored
hybrid AI architectures integrating Transformers with deep reinforcement learning (DQN). They
developed a theoretical framework for transformer-augmented reinforcement learning and discussed
how self-attention refines feature representations for financial time-series prediction. Rose et. al.
[342] investigated Vision Transformers (ViTs) for cybersecurity applications, examining attention-
based anomaly detection. They theoretically compared self-attention with CNN feature extraction
and proposed a new loss function for attention weight refinement in cybersecurity detection mod-
els. Buehler [343] explored the theoretical interplay between Graph Neural Networks (GNNs) and
Transformer architectures. They developed isomorphic self-attention, which preserves graph topo-
logical information and introduced graph-structured positional embeddings within Transformer
attention. Tabibpour and Madanizadeh [344] investigated Set Transformers as a theoretical exten-
sion of Transformers for high-dimensional dynamic systems and introduced permutation-invariant
self-attention mechanisms to replace standard Transformers in decision-making tasks and theoreti-
cally formalized attention mechanisms for non-sequential data. Kim et. al. (2024) [310] developed
a Transformer-based anomaly detection framework for video surveillance. They formalized a new
spatio-temporal self-attention mechanism to detect anomalies in videos and extended standard
Transformer architectures to handle high-dimensional video data. Li and Dong [345] examined
Transformer-based attention mechanisms for wireless communication networks. They introduced
hybrid spatial and temporal attention layers for large-scale MIMO channel estimation and pro-
vided a rigorous mathematical proof of attention-based signal recovery. Asefa and Assabie [346]
investigated language-specific adaptations of Transformer-based translation models. They intro-
duced attention mechanism regularization for low-resource language translation and analyzed the
impact of different positional encoding strategies on translation quality. Liao and Chen [347] ap-
plied transformer architectures to deepfake detection, analyzing self-attention mechanisms for facial
feature analysis. They theoretically compared CNNs and ViTs for forgery detection and introduced
attention-head dropout to improve robustness against adversarial attacks. Jiang et. al. [348] pro-
posed a novel Transformer-based approach for medical imaging reconstruction. They introduced
Spatial and Channel-wise Transformer (SCFormer) for enhanced attention-based feature aggrega-
tion and theoretically extended contrastive learning to Transformer encoders.
The Transformer model is an advanced neural network architecture fundamentally defined by the
self-attention mechanism, which enables global context-aware computations on sequential data.
The model processes an input sequence represented by
X ∈ Rn×dmodel , (494)
where n denotes the sequence length and dmodel the embedding dimensionality. Each token in
this sequence is projected into three learned spaces—queries Q, keys K, and values V—using the
trainable matrices WQ , WK , and WV , such that
where WQ , WK , WV ∈ Rdmodel ×dk , with dk being the dimensionality of queries and keys. The
pairwise similarity between tokens is determined by the dot product QK⊤ , scaled by the factor
105
√1 to ensure numerical stability, yielding the raw attention scores:
dk
QK⊤
S= √ , (496)
dk
where S ∈ Rn×n . These scores are normalized using the softmax function, producing the attention
weights A, where
exp(Sij )
Aij = Pn , (497)
k=1 exp(Sik )
ensuring nj=1 Aij = 1. The output of the attention mechanism is computed as a weighted sum of
P
the values:
Z = AV, (498)
where Z ∈ Rn×dv , with dv being the dimensionality of the value vectors. This process can be
expressed compactly as
QK⊤
Attention(Q, K, V) = softmax √ V. (499)
dk
Multi-head attention extends this mechanism by splitting Q, K, V into h distinct heads, each
operating in its subspace. For the i-th head:
where Qi = XWiQ , Ki = XWiK , Vi = XWiV . The outputs of all heads are concatenated and
linearly transformed:
where WO ∈ Rhdv ×dmodel . This architecture enables the model to capture multiple types of rela-
tionships simultaneously. Positional encodings are added to the input embeddings X to preserve
sequence order. These encodings P ∈ Rn×dmodel are defined as:
pos pos
P(pos,2i) = sin , P(pos,2i+1) = cos , (502)
100002i/dmodel 100002i/dmodel
ensuring unique representations for each position pos and dimension index i. The feedforward
network (FFN) applies two dense layers with an intermediate ReLU activation:
where W1 ∈ Rdmodel ×dff , W2 ∈ Rdff ×dmodel , and dff > dmodel . Residual connections and layer normal-
ization are applied throughout to stabilize training, with the output given by
where P (yt | y<t , x) is modeled using the softmax over the logits zt Wout + bout , with parame-
ters Wout , bout . The Transformer achieves a complexity of O(n2 dk ) per attention layer due to the
computation of QK⊤ , yet its parallelization capabilities render it more efficient than recurrent net-
works. This mathematical formalism, coupled with innovations like sparse attention and dynamic
programming, has solidified the Transformer as the cornerstone of modern sequence modeling tasks.
106
While this quadratic complexity poses challenges for very long sequences, it allows for greater par-
allelization compared to RNNs, which require O(n) sequential steps. Furthermore, the memory
complexity of O(n2 ) for storing attention weights can be mitigated using sparse approximations or
hierarchical attention structures. The Transformer architecture’s flexibility and effectiveness stem
from its ability to handle diverse tasks by appropriately modifying its components. For example,
in Vision Transformers (ViTs), the input sequence is formed by flattening image patches, and the
positional encodings capture spatial relationships. In contrast, in sequence-to-sequence tasks like
translation, the cross-attention mechanism enables the decoder to focus on relevant parts of the
encoder’s output.
In conclusion, the Transformer represents a paradigm shift in neural network design, replacing
recurrence with attention and enabling unprecedented scalability and performance. The rigorous
mathematical foundation of attention mechanisms, combined with the architectural innovations
of multi-head attention, positional encoding, and feedforward layers, underpins its success across
domains.
107
networks, the generator G and the discriminator D. These networks are parametrized by weights
θG ∈ ΘG and θD ∈ ΘD , and their interaction is mathematically formulated as a two-player zero-
sum game. The generator G : Rd → Rn maps latent variables z ∼ pz (z), where pz is a prior
probability distribution (commonly uniform or Gaussian), to a synthetic data sample x̂ = G(z).
The discriminator D : Rn → [0, 1] assigns a probability score D(x) indicating whether x originates
from the true data distribution pdata (x) or the generated distribution pg (x), implicitly defined as
the pushforward measure of pz under G, i.e., pg = G# pz . The optimization problem governing
GANs is expressed as
min max V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))], (506)
G D
where E denotes the expectation operator. This objective seeks to maximize the discriminator’s
ability to distinguish between real and generated samples while simultaneously minimizing the
generator’s ability to produce samples distinguishable from real data. For a fixed generator G, the
optimal discriminator D∗ is obtained by maximizing V (D, G), yielding
pdata (x)
D∗ (x) = . (507)
pdata (x) + pg (x)
This expression is equivalent to minimizing the Jensen-Shannon (JS) divergence between pdata and
pg , defined as
1 1
JS(pdata ∥pg ) = KL(pdata ∥M ) + KL(pg ∥M ), (509)
2 2
where M = 12 (pdata + pg ) and KL(p∥q) = p(x) log p(x)
R
q(x)
dx is the Kullback-Leibler divergence. At
the Nash equilibrium, pg = pdata , the JS divergence vanishes, and D∗ (x) = 21 for all x. The gradient
updates during training are derived using stochastic gradient descent. For the discriminator, the
gradients are given by
∇θD V (D, G) = Ex∼pdata [∇θD log D(x)] + Ez∼pz [∇θD log(1 − D(G(z)))] . (510)
Training Generative Adversarial Networks (GANs) involves iterative updates to the parameters
θD of the discriminator and θG of the generator. The discriminator’s parameters are updated
via gradient ascent to maximize the value function V (D, G), while the generator’s parameters are
updated via gradient descent to minimize the same value function. Denoting the gradients of D
and G with respect to their parameters as ∇θD and ∇θG , the updates are given by:
and
θG ← θG − ηG ∇θG [Ez∼pz [log (1 − D(G(z)))]] . (512)
In practice, to address issues of vanishing gradients, an alternative loss function for the generator
is often used, defined as:
−Ez∼pz [log D(G(z))] . (513)
This modification ensures stronger gradient signals when the discriminator is performing well,
effectively improving the generator’s training dynamics. For the generator, the gradients in the
original formulation are expressed as
108
but due to vanishing gradients when D(G(z)) is near 0, the non-saturating generator loss is pre-
ferred:
LG = −Ez∼pz [log D(G(z))]. (515)
The convergence of GANs is inherently linked to the properties of D∗ (x) and the alignment of
pg with pdata . However, mode collapse and training instability are frequently observed due to the
non-convex nature of the objective functions. Wasserstein GANs (WGANs) address these issues
by replacing the JS divergence with the Wasserstein-1 distance, defined as
where Π(pdata , pg ) is the set of all couplings of pdata and pg . Using Kantorovich-Rubinstein duality,
the Wasserstein distance is reformulated as
where f is a 1-Lipschitz function. To enforce the Lipschitz constraint, gradient penalties are ap-
plied, ensuring that ∥∇f (x)∥ ≤ 1.
The mathematical framework of GANs integrates elements from game theory, optimization, and
information geometry. Their training involves solving a high-dimensional non-convex game, where
theoretical guarantees for convergence are challenging due to saddle points and complex interac-
tions between G and D. Nevertheless, GANs represent a mathematically elegant paradigm for
generative modeling, with ongoing research extending their theoretical and practical capabilities.
109
VAEs and kernel-based models. Sharma et. al. (2024) [312] explored practical applications of
Autoencoders in network intrusion detection. They established Autoencoders as robust feature
extractors for anomaly detection and provided a formal study of Autoencoder latent space repre-
sentations.
The optimization procedure seeks to minimize the reconstruction error over the dataset D, assuming
a distribution p(x) over the input data x. The objective is to learn the optimal parameters θe and
θd , by solving the following optimization problem:
This formulation drives the encoder-decoder architecture towards learning a latent representation
that preserves key features of the input data, allowing it to be efficiently reconstructed. The solution
to this problem is typically pursued via stochastic gradient descent (SGD), where gradients
of the loss with respect to the model parameters are computed and backpropagated through the
network. In contrast to the deterministic autoencoder, the Variational Autoencoder (VAE)
introduces a probabilistic model to better capture the distribution of the latent variables. A VAE
models the data generation process using a latent variable z ∈ Rl , and aims to maximize the
likelihood of observing the data x by integrating over all possible latent variables. Specifically, we
have the joint distribution:
p(x, z) = p(x|z)p(z), (520)
where p(x|z) is the likelihood of the data given the latent variables, and p(z) is the prior distribution
of the latent variables, typically chosen to be a standard Gaussian N (z; 0, I). The prior assumption
that p(z) = N (0, I) simplifies the modeling, as it imposes no particular structure on the latent
space, which allows for flexible modeling of the data distribution. The encoder in a VAE outputs
a distribution qθe (z|x) over the latent variables, typically modeled as a multivariate Gaussian with
mean µθe (x) and variance σθe (x), i.e. qθe (z|x) = N (z; µθe (x), σθ2e (x)I). The decoder generates the
likelihood of the data x given the latent variable z, expressed as pθd (x|z), which typically takes the
form of a Gaussian distribution for continuous data. A central challenge in VAE training is the
marginal likelihood p(x), which represents the probability of the observed data. This marginal
likelihood is intractable due to the integral over the latent variables:
Z
p(x) = pθd (x|z)p(z) dz. (521)
To address this, VAE training employs variational inference, which approximates the true pos-
terior p(z|x) with a variational distribution qθe (z|x). The goal is to optimize the Evidence Lower
Bound (ELBO), which is a lower bound on the log-likelihood log p(x). The ELBO is derived
using Jensen’s inequality:
log p(x) ≥ Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) , (522)
where the first term is the expected log-likelihood of the data given the latent variables, and
the second term is the Kullback-Leibler (KL) divergence between the approximate posterior
110
qθe (z|x) and the prior p(z). The KL divergence acts as a regularizer, penalizing deviations from
the prior distribution. The ELBO can then be written as:
LVAE (x) = Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) . (523)
This formulation balances two competing objectives: maximizing the likelihood of reconstructing
x from z, and minimizing the divergence between the posterior qθe (z|x) and the prior p(z). In
order to perform optimization, we need to compute the gradient of the ELBO with respect to the
parameters θe and θd . However, since sampling from the distribution qθe (z|x) is non-differentiable,
the reparameterization trick is applied. This trick allows us to reparameterize the latent variable
z as:
z = µθe (x) + σθe (x) · ϵ, (524)
where ϵ ∼ N (0, I) is a standard Gaussian noise vector. This enables the backpropagation of
gradients through the latent space and allows the optimization process to proceed via stochastic
gradient descent. In practice, the Monte Carlo method is used to estimate the expectation
in the ELBO. This involves drawing K samples zk from the variational posterior qθe (z|x) and
approximating the expectation as:
K K
1 X 1 X
L̂VAE (x) = log pθd (x|zk ) − log qθe (zk |x). (525)
K k=1 K k=1
This approximation allows for efficient optimization, even when the latent space is high-dimensional
and the exact expectation is computationally prohibitive. Thus, the training process of a VAE
involves the following steps: first, the encoder produces a distribution qθe (z|x) for each input x;
then, latent variables z are sampled from this distribution; finally, the decoder reconstructs the
data x̂ from the latent variable z. The network is trained to maximize the ELBO, which effectively
balances the reconstruction loss and the KL divergence term.
In this rigorous exploration, we have presented the mathematical foundations of both autoencoders
and variational autoencoders. The core distinction between the two lies in the introduction of a
probabilistic framework in the VAE, which leverages variational inference to optimize a tractable
lower bound on the marginal likelihood. Through this process, the VAE learns to generate data by
sampling from the latent space and reconstructing the input, while maintaining a well-structured
latent distribution through regularization by the KL divergence term. The optimization framework
for VAEs is grounded in variational inference and the reparameterization trick, enabling gradient-
based optimization techniques to efficiently train deep generative models.
10 Reinforcement Learning
Literature Review: Sutton and Barto (2018) [272] (2021) [273] wrote a definitive textbook on
reinforcement learning. It covers the fundamental concepts, including Markov decision processes
(MDPs), temporal difference learning, policy gradient methods, and function approximation. The
second edition expands on deep reinforcement learning, covering advanced algorithms like DDPG,
A3C, and PPO. Bertsekas and Tsitsiklis (1996) [274] laid the theoretical foundation for reinforce-
ment learning by introducing neuro-dynamic programming, an extension of dynamic programming
methods for decision-making under uncertainty. It rigorously covers approximate dynamic pro-
gramming, policy iteration, and value function approximation. Kakade (2003) [275] in his thesis
formalized the sample complexity of RL, providing theoretical guarantees for how much data is
required for an agent to learn optimal policies. It introduces the PAC-RL (Probably Approxi-
mately Correct RL) framework, which has significantly influenced how RL algorithms are evaluated.
Szepesvári (2010) [276] presented a rigorous yet concise overview of reinforcement learning algo-
rithms, including value iteration, Q-learning, SARSA, function approximation, and policy gradient
111
methods. It provides deep theoretical insights into convergence proofs and performance bounds.
Haarnoja et. al. (2018) [277] introduced Soft Actor-Critic (SAC), an off-policy deep reinforce-
ment learning algorithm that maximizes expected reward and entropy simultaneously. It provides
a strong theoretical framework for handling exploration-exploitation trade-offs in high-dimensional
continuous action spaces. Mnih et al. (2015) [278] introduced Deep Q-Networks (DQN), demon-
strating how deep learning can be combined with Q-learning to achieve human-level performance
in Atari games. The authors address key challenges in reinforcement learning, including function
approximation and stability improvements. Konda and Tsitsiklis (2003) [279]. provided a rigorous
theoretical analysis of Actor-Critic methods, which combine policy-based and value-based learning.
It formally establishes convergence proofs for actor-critic algorithms and introduces the natural
gradient method for policy improvement. Levine (2018) [280] introduced a probabilistic inference
framework for reinforcement learning, linking RL to Bayesian inference. It provides a theoreti-
cal foundation for maximum entropy reinforcement learning, explaining why entropy-regularized
objectives lead to better exploration and stability. Mannor et. al. (2022) [281] gave one of the
most rigorous mathematical treatments of reinforcement learning theory. It covers several topics:
PAC guarantees for RL algorithms, Complexity bounds for exploration, Connections between RL
and control theory, Convergence rates of popular RL methods. Borkar (2008) [282] rigorously an-
alyzed stochastic approximation methods, which form the theoretical backbone of RL algorithms
like TD-learning, Q-learning, and policy gradient methods. Borkar provides a dynamical systems
perspective to convergence analysis, offering deep mathematical insights.
where π(st ) is the action chosen in state st , or stochastic, in which case the policy assigns a
probability distribution over actions for each state st . The goal of reinforcement learning is to find
an optimal policy π ∗ (st ), which maximizes the expected return (cumulative reward) from any initial
state. The optimal policy is defined as:
"∞ #
X
π ∗ (st ) = arg max E γ k rt+k | st , (528)
π
k=0
where γ is the discount factor that determines the weight of future rewards, and E[·] represents
the expectation under the policy π. The optimal policy can be derived from the optimal action-
value function Q∗ (st , at ), which we define in the next section. The state st ∈ S describes the
current situation of the agent at time t, encapsulating all relevant information that influences the
112
agent’s decision-making process. The state space S may be either discrete or continuous. The state
transitions are governed by a probability distribution P (st+1 |st , at ), which represents the probability
of moving from state st to state st+1 given action at . These transitions satisfy the Markov property,
meaning the future state depends only on the current state and action, not the history of previous
states or actions:
P (st+1 |st , at ) = P (st+1 |st , at ) ∀st , st+1 ∈ S, at ∈ A. (529)
Additionally, the transition probabilities satisfy the normalization condition:
X
P (st+1 |st , at ) = 1 ∀st , at . (530)
st+1 ∈S
The state distribution ρt (st ) represents the probability of the agent being in state st at time t. The
state distribution evolves over time according to the transition probabilities:
X
ρt+k (st+k ) = P (st+k |st , at )ρt (st ), (531)
st ∈S
where ρt (st ) is the initial distribution at time t, and ρt+k (st+k ) is the distribution at time t + k. An
action at taken at time t by the agent in state st leads to a transition to state st+1 and results in a
reward rt . The agent aims to select actions that maximize its long-term reward. The action-value
function Q(st , at ) quantifies the expected cumulative reward from taking action at in state st and
following the optimal policy thereafter. It is defined as:
"∞ #
X
Q(st , at ) = E γ k rt+k | st , at . (532)
k=0
The optimal action-value function Q∗ (st , at ) satisfies the Bellman Optimality Equation:
X
Q∗ (st , at ) = R(st , at ) + γ P (st+1 |st , at ) max Q∗ (st+1 , at+1 ). (533)
at+1
st+1 ∈S
This recursive equation provides the foundation for dynamic programming methods such as value
iteration and policy iteration. The optimal policy π ∗ (st ) is derived by choosing the action that
maximizes the action-value function:
The optimal value function V ∗ (st ), representing the expected return from state st under the optimal
policy, is given by:
V ∗ (st ) = max Q∗ (st , at ). (535)
at ∈A
The reward rt at time t is a scalar value that represents the immediate benefit (or cost) the agent
receives after taking action at in state st . It is a function R(st , at ) mapping state-action pairs to
real numbers:
rt = R(st , at ). (537)
The agent’s objective is to maximize the cumulative reward, which is given by the total return from
time t: ∞
X
Gt = γ k rt+k . (538)
k=0
113
The agent seeks to find a policy π that maximizes the expected return. The Bellman equation for
the expected return is:
X
V π (st ) = R(st , π(st )) + γ P (st+1 |st , π(st ))V π (st+1 ). (539)
st+1 ∈S
This recursive relation helps in solving for the optimal value function. An RL problem is typically
modeled as a Markov Decision Process (MDP), which is defined as the tuple:
where:
• S is the state space,
114
bridges reinforcement learning with stochastic gradient descent (SGD) optimization. Masood et.
al. (2025) [364] merged Deep Q-learning with Game Theory (GT) to optimize energy efficiency
in smart agriculture. It proposes a mathematical model for dynamic energy allocation, proving
the existence of Nash equilibria in Q-learning-based decision-making environments. Patrick (2024)
[365] bridged economic modeling with Deep Q-learning. It formulates dynamic pricing strategies
using deep reinforcement learning (DRL) and provides mathematical proofs on how RL adapts
to economic shocks. Mimouni and Avrachenkov (2025) [366] introduced a novel Deep Q-learning
algorithm that incorporates the Whittle index, a key concept in optimal stopping problems. It
proves convergence bounds and applies the model to email recommender systems, demonstrating
improved performance over traditional Q-learning methods.
Deep Q-Learning (DQL) is an advanced reinforcement learning (RL) technique where the goal
is to approximate the optimal action-value function Q∗ (s, a) through the use of deep neural net-
works. In traditional Q-learning, the action-value function Q(s, a) maps a state-action pair to the
expected return or cumulative discounted reward from that state-action pair, under the assumption
of following an optimal policy. Formally, the Q-function is defined as:
"∞ #
X
Q(s, a) = E γ t rt | s0 = s, a0 = a (541)
t=0
where γ ∈ [0, 1] is the discount factor, which determines the weight of future rewards relative to
immediate rewards, and rt is the reward received at time step t. The optimal Q-function Q∗ (s, a)
satisfies the Bellman optimality equation:
h i
Q∗ (s, a) = E rt + γ max
′
Q ∗
(s t+1 , a′
) | s 0 = s, a0 = a (542)
a
where st+1 is the next state after taking action a in state s, and the maximization term represents
the optimal future expected reward. This equation represents the recursive structure of the optimal
action-value function, where each Q-value is updated based on the reward obtained in the current
step and the maximum future reward expected from the next state. The goal is to learn the optimal
Q-function through iterative updates, typically using the Temporal Difference (TD) method. In
Deep Q-Learning, the Q-function is approximated by a deep neural network, as directly storing
Q-values for every state-action pair is computationally infeasible for large state and action spaces.
Let the approximated Q-function be Qθ (s, a), where θ denotes the parameters (weights and biases)
of the neural network that approximates the action-value function. The deep Q-network (DQN)
aims to learn Qθ (s, a) such that it closely approximates Q∗ (s, a) over time. The update of the
Q-function follows the TD error principle, where the goal is to minimize the difference between the
current Q-values and the target Q-values derived from the Bellman equation. The loss function for
training the DQN is given by:
L(θ) = E(st ,at ,rt ,st+1 )∼D (yt − Qθ (st , at ))2
(543)
where D denotes the experience replay buffer containing previous transitions (st , at , rt , st+1 ). The
target yt for the Q-values is defined as:
yt = rt + γ max
′
Qθ− (st+1 , a′ ) (544)
a
Here, θ− represents the parameters of the target network, which is a slowly updated copy of the
online network parameters θ. The target network Qθ− is used to generate stable targets for the
Q-value updates, and its parameters are updated periodically by copying the parameters from the
online network θ after every T steps. The idea behind this is to stabilize the training by preventing
rapid changes in the Q-values due to feedback loops from the Q-network’s predictions. The update
rule for the network parameters θ follows the gradient descent method and is expressed as:
∇θ L(θ) = E(st ,at ,rt ,st+1 )∼D [(yt − Qθ (st , at )) ∇θ Qθ (st , at )] (545)
115
where ∇θ Qθ (st , at ) is the gradient of the Q-function with respect to the parameters θ, which is
computed using backpropagation through the neural network. This gradient is used to update the
parameters of the Q-network to minimize the loss function. In reinforcement learning, the agent
must balance exploration (trying new actions) and exploitation (selecting actions that maximize
the reward). This is often handled by using an epsilon-greedy policy, where the agent selects a
random action with probability ϵ and the action with the highest Q-value with probability 1 − ϵ.
The epsilon value is decayed over time to ensure that, as the agent learns, it shifts from exploration
to more exploitation. The epsilon-greedy action selection rule is given by:
(
random action, with probability ϵ
at = (546)
arg maxa Qθ (st , a), with probability 1 − ϵ
This policy encourages the agent to explore different actions at the beginning of training and
gradually exploit the learned Q-values as training progresses. The decay of ϵ typically follows an
annealing schedule to balance exploration and exploitation effectively. A critical component in
stabilizing training in Deep Q-Learning is the use of experience replay. In standard Q-learning,
the updates are based on consecutive transitions, which can lead to high correlations between
consecutive data points. This correlation can slow down learning or even lead to instability. Ex-
perience replay addresses this issue by storing a buffer of past experiences and sampling random
mini-batches from this buffer during training. This breaks the correlation between consecutive
samples and results in more stable and efficient updates. Mathematically, the loss function for
training the network involves random sampling of transitions (st , at , rt , st+1 ) from the experience
replay buffer D, and the update to the Q-values is computed using the Bellman error based on the
sampled experiences:
2
′
L(θ) = E(st ,at ,rt ,st+1 )∼D rt + γ max
′
Qθ− (st+1 , a ) − Qθ (st , at ) (547)
a
This method ensures that the Q-values are updated in a way that is less sensitive to the order in
which experiences are observed, promoting more stable learning dynamics.
Despite its success, the DQL algorithm can still suffer from certain issues such as overestimation
bias and instability due to the maximization step in the Bellman equation. Overestimation bias
occurs because the maximization operation maxa′ Qθ− (st+1 , a′ ) tends to overestimate the true value,
as the Q-values are updated based on the same Q-network. To address this, Double Q-learning was
introduced, which uses two separate Q-networks for action selection and value estimation, reducing
overestimation bias. In Double Q-learning, the target Q-value is computed using the following
equation:
′
yt = rt + γQθ− st+1 , arg max
′
Q (s
θ t+1 , a ) (548)
a
This approach helps to mitigate the overestimation problem by decoupling the action selection
from the Q-value estimation process. The value of arg max is taken from the online network Qθ ,
while the Q-value for the next state is estimated using the target network Qθ− . Another extension
to improve the DQL framework is Dueling Q-Learning, which decomposes the Q-function into two
separate components: the state value function Vθ (s) and the advantage function Aθ (s, a). The
Q-function is then expressed as:
This decomposition allows the agent to learn the value of a state Vθ (s) independently of the specific
actions, thus reducing the number of parameters needed for learning. This is particularly beneficial
in environments where many actions have similar expected rewards, as it enables the agent to focus
on identifying the value of states rather than overfitting to individual actions.
116
In conclusion, Deep Q-Learning is an advanced reinforcement learning method that utilizes deep
neural networks to approximate the optimal Q-function, enabling agents to handle large state and
action spaces. The mathematical formulation of DQL involves minimizing the loss function based
on the temporal difference error, utilizing experience replay to stabilize learning, and using target
networks to prevent instability. Extensions such as Double Q-learning and Dueling Q-learning
further improve the performance and stability of the algorithm. Despite its remarkable successes,
Deep Q-Learning still faces challenges such as overestimation bias and instability, which have been
addressed with innovative modifications to the original algorithm.
Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make
decisions by interacting with an environment. The goal of the agent is to maximize a cumulative
reward signal over time by taking actions that affect its environment. The RL framework is for-
mally represented by a Markov Decision Process (MDP), which is defined by a 5-tuple (S, A, P, r, γ),
where:
• S is the state space, which represents all possible states the agent can be in.
117
• A is the action space, which represents all possible actions the agent can take.
• P (s′ |s, a) is the state transition probability, which defines the probability of transitioning
from state s to state s′ under action a.
• r(s, a) is the reward function, which defines the immediate reward received after taking action
a in state s.
• γ ∈ [0, 1) is the discount factor, which determines the importance of future rewards.
The objective in RL is for the agent to learn a policy π : S → A that maximizes its expected return
(the cumulative discounted reward), which is mathematically expressed as:
"∞ #
X
J(π) = Eπ γ t r(st , at ) , (550)
t=0
where st denotes the state at time t, and at = π(st ) is the action taken according to the policy π.
The expectation is taken over the agent’s interaction with the environment, under the policy π.
The agent seeks to maximize this expected return by choosing actions that yield the most reward
over time. The optimal value function V ∗ (s) is defined as the maximum expected return that can
be obtained starting from state s, and is governed by the Bellman optimality equation:
where s′ is the next state, and the expectation is taken with respect to the transition dynamics
P (s′ |s, a). The action-value function Q∗ (s, a) represents the maximum expected return from taking
action a in state s, and then following the optimal policy. It satisfies the Bellman optimality
equation for Q∗ (s, a): h i
Q∗ (s, a) = E r(s, a) + γ max′
Q ∗ ′ ′
(s , a ) , (552)
a
′
where a is the next action to be taken, and the expectation is again over the state transition
probabilities. These Bellman equations form the basis of many RL algorithms, which iteratively
approximate the value functions to learn an optimal policy. To solve these equations, one of
the most widely used methods is Q-learning, an off-policy, model-free RL algorithm. Q-learning
iteratively updates the action-value function Q(s, a) according to the following rule:
h i
′
Q(st , at ) ← Q(st , at ) + α r(st , at ) + γ max
′
Q(st+1 , a ) − Q(st , at ) , (553)
a
where α is the learning rate that controls the step size of updates, and γ is the discount factor.
The key idea behind Q-learning is that the agent learns the optimal action-value function Q∗ (s, a)
without needing a model of the environment. The agent improves its action-value estimates over
time by interacting with the environment and receiving feedback (rewards). The iterative nature of
this update ensures convergence to the optimal Q∗ , under the condition that all state-action pairs
are visited infinitely often and α is decayed appropriately. Policy Gradient methods, in contrast,
directly optimize the policy πθ , which is parameterized by a vector θ. These methods are useful
in high-dimensional or continuous action spaces where action-value methods may struggle. The
objective in policy gradient methods is to maximize the expected return, J(πθ ), which is given by:
"∞ #
X
J(πθ ) = Est ,at ∼πθ γ t r(st , at ) . (554)
t=0
The policy is updated using the gradient ascent method, and the gradient of the expected return
with respect to θ is computed as:
∇θ J(πθ ) = Est ,at ∼πθ [∇θ log πθ (at |st )Q(st , at )] , (555)
118
where Q(st , at ) is the action-value function, and ∇θ log πθ (at |st ) is the score function, representing
the sensitivity of the policy’s likelihood to the policy parameters. By following this gradient, the
policy parameters θ are updated to improve the agent’s performance. This method, known as
REINFORCE, is particularly effective when the action space is large or continuous, and the policy
needs to be parameterized with complex models, such as deep neural networks. In both Q-learning
and policy gradient methods, exploration and exploitation are essential concepts. Exploration refers
to trying new actions that have not been sufficiently tested, whereas exploitation involves choosing
actions that are known to yield high rewards. The epsilon-greedy strategy is a common way to
balance exploration and exploitation, where with probability ϵ, the agent chooses a random action,
and with probability 1 − ϵ, it chooses the action with the highest expected reward. As the agent
learns, ϵ is typically decayed over time to reduce exploration and focus more on exploiting the
learned policy. In more complex environments, Boltzmann exploration or entropy regularization
techniques are used to maintain a controlled amount of randomness in the policy to encourage
exploration. In multi-agent games, RL takes on additional complexity. When multiple agents
interact, the environment is no longer static, as each agent’s actions affect the others. In this
context, RL can be used to find optimal strategies through game theory. A fundamental concept
here is the Nash equilibrium, where no agent can improve its payoff by changing its strategy,
assuming all other agents’ strategies remain fixed. In mathematical terms, for two agents i and j,
a Nash equilibrium (πi∗ , πj∗ ) satisfies:
where ri (πi , πj ) is the payoff function of agent i when playing policy πi against agent j’s policy
πj . Finding Nash equilibria in multi-agent RL is a complex and computationally challenging task,
requiring the agents to learn in a non-stationary environment where the other agents’ strategies
are also changing over time. In the context of robotics, RL is used to solve high-dimensional
control tasks, such as motion planning and trajectory optimization. The robot’s state space is often
represented by vectors of its position, velocity, and other physical parameters, while the action space
consists of control inputs, such as joint torques or linear velocities. In this setting, RL algorithms
learn to map states to actions that optimize the robot’s performance in a task-specific way, such as
minimizing energy consumption or completing a task in the least time. The dynamics of the robot
are often modeled by differential equations:
where x(t) is the state vector at time t, and u(t) is the control input. Through RL, the robot
learns to optimize the control policy u(t) to maximize a reward function, typically involving a
combination of task success and efficiency. Deep RL, specifically, allows for the representation of
highly complex control policies using neural networks, enabling robots to tackle tasks that require
high-dimensional sensory input and decision-making, such as object manipulation or autonomous
navigation.
In games, RL has revolutionized the field by enabling agents to learn complex strategies in en-
vironments where hand-crafted features or simple tabular representations are insufficient. A key
challenge in Deep Reinforcement Learning (DRL) is stabilizing the training process, as neural
networks are prone to issues such as overfitting, exploding gradients, and vanishing gradients. Tech-
niques such as experience replay and target networks are used to mitigate these challenges, ensuring
stable and efficient learning. Thus, Reinforcement Learning, with its theoretical underpinnings in
MDPs, Bellman equations, and policy optimization methods, provides a mathematically rich and
deeply rigorous approach to solving sequential decision-making problems. Its application to fields
such as games and robotics not only illustrates its versatility but also pushes the boundaries of
machine learning into real-world, high-complexity scenarios.
119
11 Natural Language Processing (NLP)
Literature Review: Jurafsky and Martin 2023 [225] wrote book that is a cornerstone of NLP the-
ory, covering fundamental concepts like syntax, semantics, and discourse analysis, alongside deep
learning approaches to NLP. The book integrates linguistic theory with probabilistic and neural
methodologies, making it an essential resource for students and researchers alike. It thoroughly
explains sequence labeling, parsing, transformers, and BERT models. Manning and Schütze 1999
[226] wrote a foundational text in NLP, particularly for probabilistic models. It covers hidden
Markov models (HMMs), n-gram language models, and expectation-maximization (EM), concepts
that still underpin modern transformer-based NLP models. It also introduces latent semantic
analysis (LSA), a precursor to modern word embeddings. Liu and Zhang (2018) [227] presented
a detailed exploration of deep learning-based NLP, including word embeddings, recurrent neural
networks (RNNs), LSTMs, GRUs, and transformers. It introduces the mathematical foundations
of neural networks, making it a bridge between classical NLP and deep learning. Allen (1994)
[228] wrote a seminal book in NLP, focusing on symbolic and rule-based approaches. It provides
detailed coverage of semantic parsing, discourse modeling, and knowledge representation. While it
predates deep learning, it forms a strong theoretical foundation for logical and linguistic approaches
to NLP. wrote Koehn (2009) [231] wrote a definitive work on statistical NLP, particularly machine
translation techniques like phrase-based translation, alignment models, and decoder algorithms. It
remains relevant even as neural translation models (e.g., Transformer-based systems) dominate.
We now mention some of the recent works in Natural Language Processing (NLP). Hempelmann
[230] explored how linguistic theories of humor can be incorporated into Large Language Models
(LLMs). It discusses the integration of formal humor theories into neural models and whether
LLMs can be used to test linguistic hypotheses. Eisenstein (2020) [232] wrote a modern NLP text-
book that bridges theory and practice. It covers both probabilistic and deep learning approaches,
including dependency parsing, sequence-to-sequence models, and attention mechanisms. Unlike
many texts, it also discusses ethics and bias in NLP models. Otter et. al. (2018) [233] provides
a comprehensive review of neural architectures in NLP, covering CNNs, RNNs, attention mecha-
nisms, and reinforcement learning for NLP. It discusses both theoretical implications and empirical
advancements, making it an essential reference for deep learning in language tasks. The Oxford
Handbook of Computational Linguistics (2022) [234] provides a comprehensive collection of essays
covering the entire field of NLP and computational linguistics, including morphology, syntax, se-
mantics, discourse processing, and deep learning applications. It presents theoretical debates and
practical applications across different NLP domains. Li et. al. (2025) [229] introduced an advanced
multi-head attention mechanism that combines explorative factor analysis with NLP models. It
enhances our understanding of how transformers encode syntactic and semantic relationships.
120
[239] applies text classification to detect deception in online product reviews. It integrates cognitive
appraisal theory and NLP-based text mining to distinguish fake vs. genuine reviews. Kumar et. al.
(2025) [240] focused on medical text classification, demonstrating how NLP techniques can be ap-
plied to diagnose diseases using electronic health records (EHRs) and patient symptoms extracted
from text data. Yin (2024) [241] provided a deep dive into aspect-based sentiment analysis (ABSA),
discussing challenges in fine-grained text classification. It introduces new BERT-based techniques
to improve aspect-level sentiment classification accuracy. Raghavan (2024) [242] examines personal-
ity classification using text data. It evaluates the performance of NLP-based personality prediction
models and compares lexicon-based, deep learning, and transformer-based approaches. Semeraro
et. al. (2025) [243] introduced EmoAtlas, a tool that merges psychological lexicons, artificial in-
telligence, and network science to perform emotion classification in textual data. It compares its
accuracy with BERT and ChatGPT. Cai and Liu (2024) [244] provides a practical approach to text
classification in discourse analysis. It explores Python-based techniques for analyzing therapy talk
and sentiment classification in conversational texts.
Text classification is a fundamental problem in machine learning and natural language process-
ing (NLP), where the goal is to assign predefined categories to a given text based on its content.
This process involves several steps, including text preprocessing, feature extraction, model train-
ing, and evaluation. In this answer, we will explore these steps with a focus on the underlying
mathematical principles and models used in text classification. The first step in text classification
is preprocessing the raw text data. This typically involves the following operations:
• Tokenization: Breaking the text into words or tokens.
• Stopword Removal: Removing common words (such as ”and”, ”the”, etc.) that do not
carry significant meaning.
• Stemming and Lemmatization: Reducing words to their base or root form, e.g., ”running”
becomes ”run”.
• Lowercasing: Converting all words to lowercase to ensure consistency.
• Punctuation Removal: Removing punctuation marks.
These operations result in a cleaned and standardized text, ready for feature extraction. Once the
text is preprocessed, the next step is to convert the text into numerical representations that can
be fed into machine learning models. The most common methods for feature extraction include:
1. Bag-of-Words (BoW) model
2. Term Frequency-Inverse Document Frequency (TF-IDF)
In the first method (Bag-of-Words (BoW) model), each document is represented as a vector
where each dimension corresponds to a unique word in the corpus. The value of each dimension is
the frequency of the word in the document. If we have a corpus of N documents and a vocabulary
of M words, the document i can be represented as a vector xi ∈ RM , where:
where f (wj , di ) is the frequency of the word wj in the document di . The BoW model captures only
the frequency of terms within the document and disregards their order. While simple and com-
putationally efficient, this model does not capture the syntactic or semantic relationships between
words in the document.
A more sophisticated and improved representation can be obtained through Term Frequency-
Inverse Document Frequency (TF-IDF), which scales the raw frequency of words by their
121
relative importance in the corpus. TF-IDF is a more advanced technique that aims to weight
words based on their importance. It considers both the frequency of a word in a document and the
rarity of the word across all documents. The term frequency (TF) of a word w in document d is
defined as:
count(w, d)
TF(w, d) = (559)
total number of words in d
The inverse document frequency (IDF) is given by:
N
IDF(w) = log (560)
DF(w)
where N is the total number of documents and DF(w) is the number of documents containing the
word w. The TF-IDF score is the product of these two:
TF-IDF(w, d) = TF(w, d) · IDF(w) (561)
There are several machine learning models that can be used for text classification, ranging from
simpler models to more complex ones. A common approach to text classification is to use a linear
model such as logistic regression or linear support vector machines (SVM). Given a feature vector
xi for document i, the prediction of the class label yi can be made as:
ŷi = σ(wT xi + b) (562)
where σ is the sigmoid function for binary classification, and w and b are the weight vector and bias
term, respectively. The model parameters w and b are learned by minimizing a loss function, such
as the binary cross-entropy loss. More complex models, such as Neural Networks (NN), involve
deeper mathematical formulations. In a typical feedforward neural network, the goal is to learn
a set of parameters that map an input vector xi to an output label yi . The network consists of
multiple layers of interconnected neurons, each of which applies a non-linear transformation to the
input. Given an input vector xi , the output of the network is computed as:
(l) (l−1)
hi = σ(W(l) hi + b(l) ) (563)
(l)
where hi is the activation of layer l, σ is the activation function (e.g., ReLU, sigmoid, or tanh),
W(l) is the weight matrix, and b(l) is the bias term for layer l. The input to the network is passed
through several hidden layers before producing the final classification output. The output layer
typically applies a softmax function to obtain a probability distribution over the possible classes:
exp(WcT hi + bc )
P (yc |xi ) = P T
(564)
c′ exp(Wc′ hi + bc′ )
where Wc and bc are the weights and bias for class c, and hi is the output of the last hidden layer.
The network is trained by minimizing a cross-entropy loss function:
C
X
L(W, b) = − yi,c log P (yc |xi ) (565)
c=1
where yi,c is the one-hot encoded label for class c, and the goal is to minimize the difference be-
tween the predicted probability distribution and the true class distribution. Throughout the entire
process, optimization plays a crucial role in fine-tuning model parameters to minimize classification
errors. Common optimization techniques include stochastic gradient descent (SGD) and its variants,
such as Adam and RMSProp, which update model parameters iteratively based on the gradient of
the loss function with respect to the parameters. Given the loss function L(θ) parameterized by θ,
the gradient of the loss with respect to a parameter θi is computed as:
∂L(θ)
(566)
∂θi
122
The parameter update rule for gradient descent is then:
∂L(θ)
θi ← θi − η (567)
∂θi
where η is the learning rate. For each iteration, this update rule adjusts the model parameters in
the direction of the negative gradient, ultimately converging to a set of parameters that minimizes
the classification error.
In summary, text classification is an advanced and multifaceted problem that requires a deep
understanding of various mathematical principles, including linear algebra, probability theory, opti-
mization, and functional analysis. The entire process, from text preprocessing to feature extraction,
model training, and evaluation, involves the application of rigorous mathematical techniques that
enable the effective classification of text into meaningful categories. Each of these steps, whether
simple or complex, plays an integral role in transforming raw text data into actionable insights
using mathematically sophisticated models and algorithms.
Machine Translation (MT) in Natural Language Processing (NLP) is a highly intricate compu-
tational task that requires converting text from one language (source language) to another (target
language) by using statistical, rule-based, and deep learning models, often underpinned by proba-
bilistic and neural network-based frameworks. The goal is to determine the most probable target
sequence T = {t1 , t2 , . . . , tN } from the given source sequence S = {s1 , s2 , . . . , sT }, by modeling the
conditional probability P (T | S). The optimal translation is typically defined by:
T ∗ = arg max P (T | S) (568)
T
123
This involves estimating the probability of T given S, with the assumption that the translation
can be described probabilistically. In the most fundamental form of statistical machine translation
(SMT), this probability is often modeled through a series of translation models that decompose the
translation process into manageable components. The conditional probability P (T | S) in SMT
can be factorized using Bayes’ theorem:
P (S, T ) P (T | S)P (S)
P (T | S) = = (569)
P (S) P (S)
Given this decomposition, the core of early SMT models, such as IBM models, sought to model the
joint probability P (S, T ) over source and target language pairs. Specifically, in word-based models
like IBM Model 1, the task reduces to estimating the probability of translating each word in the
source language S to its corresponding word in the target language T . The joint probability can
be written as:
T Y
Y N
P (S, T ) = t(si | tj ) (570)
i=1 j=1
where t(si | tj ) is the probability of translating word si in the source sentence to word tj in the
target sentence. The estimation of these probabilities, t(si | tj ), is typically achieved by analyzing
parallel corpora through various techniques such as Expectation-Maximization (EM), which allows
the unsupervised learning of these translation probabilities from large amounts of bilingual text
data. The EM algorithm iterates between computing the expected alignments of words in the source
and target languages and refining the model parameters accordingly. The word-based translation
models, however, do not take into account the structure of the language, which often leads to
suboptimal translations, especially in languages with significantly different syntactic structures.
The challenges stem from the word order differences and idiomatic expressions that cannot be
captured through a simple word-to-word mapping. To overcome these limitations, IBM Model 2
introduced the concept of word alignments, where an additional hidden variable A is introduced,
representing a possible alignment between words in the source and target sentences. This can be
expressed as:
YT YN
P (S, T, A) = t(si | tj )a(si | tj ) (571)
i=1 j=1
where a(si | tj ) denotes the alignment probability between word si in the source language and word
tj in the target language. By optimizing these alignment probabilities, SMT systems improve the
quality of translations by better modeling the relationship between the source and target sentences.
Estimating a(si | tj ), however, requires computationally expensive algorithms, which can be han-
dled by methods like EM for iterative refinement.
A more sophisticated approach was introduced with sequence-to-sequence (Seq2Seq) models, which
significantly improved the translation process by leveraging deep learning techniques. The core of
Seq2Seq is the encoder-decoder framework, where an encoder processes the entire source sentence
and encodes it into a context vector, and a decoder generates the target sequence. In this approach,
the translation probability is formulated as:
N
Y
P (T | S) = P (t1 | S) P (ti | t<i , S) (572)
i=2
where t<i denotes the previously generated target words, capturing the sequential nature of trans-
lation. The key advantage of the Seq2Seq model is its ability to model entire sentences at once,
providing a richer, more flexible representation of both the source and target sequences compared to
word-based models. The encoder, typically implemented using Recurrent Neural Networks (RNNs)
or more advanced variants such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit
124
(GRU) networks, encodes the source sequence S into hidden states. The hidden state at time step
t is computed recursively, based on the input xt (the source word representation at time step t)
and the previous hidden state ht−1 :
ht = f (ht−1 , xt ) (573)
where f represents the update function, which is often parameterized as a non-linear function, such
as a sigmoid or tanh. This recursion generates a sequence of hidden states {h1 , h2 , . . . , hT }, each
encoding the relevant information of the source sentence. In this model, the decoder generates the
target sequence one token at a time by conditioning on the previous tokens t<i and the context
vector c, which is typically the last hidden state from the encoder. The conditional probability of
generating the next target word is given by:
where W is a learned weight matrix, and ht is the hidden state of the decoder at time step t.
The softmax function converts the output of the network into a probability distribution over the
vocabulary, and the word with the highest probability is chosen as the next target word.
A significant improvement to Seq2Seq was introduced through the attention mechanism. This
allows the decoder to dynamically focus on different parts of the source sentence during transla-
tion, instead of relying on a single fixed-length context vector. The attention mechanism computes
a set of attention weights αt,i for each source word, which are used to compute a weighted sum of
the encoder’s hidden states to form a dynamic context vector ct . The attention weight αt,i for time
step t in the decoder and source word i is calculated as:
exp(et,i )
αt,i = PT (575)
k=1 exp(et,k )
where et,i = score(ht , hi ) is a learned scoring function, which can be modeled as:
This attention mechanism allows the model to adaptively focus on relevant parts of the source
sentence while generating each word in the target sentence, thus overcoming the limitations of fixed-
length context vectors in long sentences. Training a machine translation model typically involves
optimizing a loss function that quantifies the difference between the predicted target sequence and
the true target sequence. The most common loss function is the negative log-likelihood:
N
X
L(θ) = − log P (ti | t<i , S; θ) (577)
i=1
where θ represents the parameters of the model. The parameters of the neural network are up-
dated using gradient-based optimization techniques, such as stochastic gradient descent (SGD) or
Adam, with the gradient of the loss function with respect to each parameter being computed via
backpropagation. In backpropagation, the gradient is computed by recursively applying the chain
rule through the layers of the network. For a parameter θ, the gradient is given by:
∂L(θ) ∂L(θ) ∂y
= (578)
∂θ ∂y ∂θ
125
using automatic metrics such as BLEU (Bilingual Evaluation Understudy), which measures the n-
gram overlap between the machine-generated translation and human references. The BLEU score
for an n-gram of length n is computed as:
N
!
X
BLEU(T, R) = exp wn log pn (T, R) (579)
n=1
where pn (T, R) is the precision of n-grams between the target translation T and reference R, and
wn is the weight assigned to each n-gram length. Despite advancements, machine translation still
faces challenges, such as handling rare or out-of-vocabulary words, idiomatic expressions, and the
alignment of complex syntactic structures across languages. Approaches such as transfer learning,
unsupervised learning, and domain adaptation are being explored to address these issues and
improve the robustness and accuracy of MT systems.
Chatbots and Conversational AI have evolved as some of the most sophisticated applications of
Natural Language Processing (NLP), a subfield of artificial intelligence that strives to enable ma-
chines to understand, generate, and interact in human language. At the core of conversational
AI is the ability to generate meaningful, contextually appropriate responses in a coherent and flu-
ent manner. This challenge is deeply rooted in both the complexities of natural language itself
and the mathematical models that attempt to approximate human understanding. This intricate
task involves processing language at different levels: syntactic (structure), semantic (meaning),
and pragmatic (context). These systems employ probabilistic and algebraic techniques to handle
language complexities and employ statistical models, deep neural networks, and optimization algo-
rithms to generate, understand, and respond to language.
126
In mathematical terms, conversational AI can be seen as a sequence of transformations from one
set of words or symbols (the input) to another (the output). The first mathematical aspect is lan-
guage modeling, which is crucial for predicting the likelihood of word sequences. The probability
distribution of a sequence of words w1 , w2 , . . . , wn is generally computed using the chain rule of
probability:
n
Y
P (w1 , w2 , . . . , wn ) = P (wi |w1 , w2 , . . . , wi−1 ) (580)
i=1
where P (wi |w1 , w2 , . . . , wi−1 ) models the conditional probability of the word wi given all the pre-
ceding words. This is a central concept in language generation tasks. In traditional n-gram models,
this conditional probability is estimated by considering only a fixed number of previous words. The
bigram model, for instance, assumes that the probability of a word depends only on the previous
word, leading to:
P (wi |w1 , w2 , . . . , wi−1 ) ≈ P (wi |wi−1 ) (581)
However, more advanced conversational AI systems, such as those based on recurrent neural net-
works (RNNs), attempt to model dependencies over much longer sequences. RNNs, in particular,
process the input sequence w1 , w2 , . . . , wn recursively by maintaining a hidden state ht that captures
the context up to time t. The hidden state is computed by:
where σ is a non-linear activation function (e.g., tanh or sigmoid ), Wh , Wx are weight matrices,
and b is a bias term. While RNNs provide a mechanism to capture sequential dependencies, they
suffer from the vanishing gradient problem, particularly for long sequences. To address this issue,
Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRUs) were introduced, with
special gating mechanisms that help mitigate the loss of information over long time horizons. These
networks introduce memory cells and gates, which regulate the flow of information in the network.
For instance, the LSTM memory cell is governed by the following equations:
QK T
Attention(Q, K, V ) = softmax √ V (585)
dk
where dk is the dimension of the key vectors. This operation allows the model to attend to all parts
of the input sequence simultaneously, enabling better handling of long-range dependencies and
improving computational efficiency by processing sequences in parallel. Unlike RNNs, transformers
do not process tokens in a fixed order but instead utilize positional encoding to inject sequence
order information. The positional encoding for position i and dimension 2k is given by:
i i
P E(i, 2k) = sin , P E(i, 2k + 1) = cos (586)
100002k/d 100002k/d
127
where d is the embedding dimension and k is the index for the dimension of the positional encoding.
This approach allows transformers to handle longer sequences more efficiently than RNNs and
LSTMs, and is the basis for models like BERT, GPT, and other state-of-the-art conversational
models. Semantic understanding in conversational AI involves translating sentences into formal
representations that can be manipulated by the system. A well-known approach for capturing
meaning is compositional semantics, which treats the meaning of a sentence as a function of the
meanings of its parts. For this, lambda calculus is often employed to represent the meaning of
sentences as functions that operate on their arguments. For example, the sentence ”John saw the
car” can be represented as a lambda expression:
where see(x, y) is a predicate representing the action of seeing, and λx quantifies over the sub-
ject of the action. This allows for the compositional building of complex meanings from simpler
components. Dialogue management is another critical aspect of conversational AI systems. This
is the process of maintaining coherence and context over the course of a conversation. It involves
understanding the user’s input in light of prior dialogue history and generating a response that is
contextually relevant. To model the dialogue state, Markov Decision Processes (MDPs) are com-
monly employed. In this context, the dialogue state is represented as a set of possible states, with
actions being transitions between these states. The goal is to select actions (responses) that maxi-
mize cumulative rewards, which, in this case, corresponds to maintaining a coherent and engaging
conversation. The value function V (s) at state s can be computed using the Bellman equation:
" #
X
V (s) = max R(s, a) + γ P (s′ |s, a)V (s′ ) (588)
a
s′
where R(s, a) is the immediate reward for taking action a from state s, γ is the discount factor, and
P (s′ |s, a) represents the transition probability to the next state s′ given action a. By solving this
equation, the system can determine the optimal policy for responding to user inputs in a way that
maximizes long-term conversational quality. Once the dialogue state is updated, the next step in
conversational AI is to generate a response. This is typically achieved using sequence-to-sequence
models, in which the input sequence (e.g., the user’s query) is processed by an encoder to produce a
fixed-size context vector, and a decoder generates the output sequence (e.g., the chatbot’s response).
The basic structure of these models can be expressed as:
yt = Decoder(yt−1 , ht ) (589)
where yt represents the token generated at time t, and ht is the hidden state passed from the
encoder. Attention mechanisms are incorporated into this framework to allow the decoder to
focus on different parts of the input sequence at each step, improving the quality of the generated
response. Training conversational models requires optimizing parameters through backpropagation
and gradient descent. The loss function, typically cross-entropy loss, is minimized to update the
model’s parameters:
XN
L(θ) = − yi log(ŷi ) (590)
i=1
where ŷi is the predicted probability for the correct token yi , and N is the length of the sequence.
The parameters θ are updated iteratively through gradient descent, adjusting the weights to mini-
mize the error.
In summary, chatbots and conversational AI systems are grounded in a rich mathematical frame-
work involving statistics, linear algebra, optimization, and neural networks. Each step, from lan-
guage modeling to dialogue management, relies on carefully constructed mathematical foundations
128
that drive the ability of machines to interact intelligently and meaningfully with humans. Through
advancements in deep learning and optimization techniques, conversational AI continues to push
the boundaries of what machines can understand and generate in natural language, leading to more
sophisticated, human-like interactions.
TensorFlow operates primarily on tensors, which are multi-dimensional arrays generalizing scalars,
vectors, and matrices. For instance, a scalar is a rank-0 tensor, a vector is a rank-1 tensor, a matrix
is a rank-2 tensor, and tensors of higher ranks represent multi-dimensional arrays. These tensors
can be written mathematically as:
T ∈ Rd1 ×d2 ×···×dn (591)
where d1 , d2 , . . . , dn represent the dimensions of the tensor. TensorFlow leverages efficient tensor
operations that allow the manipulation of large-scale data in a computationally optimized manner.
These operations are the foundation of all the transformations and calculations within TensorFlow
129
models. For example, the dot product of two vectors ⃗a and ⃗b is a scalar:
n
X
⃗a · ⃗b = ai b i (592)
i=1
Similarly, for matrices, operations like matrix multiplication A · B are highly optimized, taking
advantage of batch processing and parallelism on devices such as GPUs and TPUs. TensorFlow’s
underlying libraries, such as Eigen, employ these parallel strategies to optimize memory usage and
reduce computation time. The heart of TensorFlow’s efficiency lies in its computation graph, which
represents the relationships between different operations. The computation graph is a directed
acyclic graph (DAG) where nodes represent computational operations, and the edges represent the
flow of data (tensors). Each operation in the graph is a function, f , that maps a set of inputs to
an output tensor:
y = f (x1 , x2 , . . . , xn ) (593)
The graph is built by users or automatically by TensorFlow, where the nodes represent operations
such as addition, multiplication, or more complex transformations. Once the computation graph is
defined, TensorFlow optimizes the graph by reordering computations, applying algebraic transfor-
mations, or parallelizing independent subgraphs. The graph is executed either in a dynamic manner
(eager execution) or after optimization (static graph execution), depending on the user’s preference.
Automatic differentiation is another key feature of TensorFlow, and it relies on the chain rule of
differentiation to compute gradients. The gradient of a scalar-valued function f (x1 , x2 , . . . , xn ) with
respect to an input tensor xi is computed as:
n
∂f X ∂f ∂yj
= (594)
∂xi j=1
∂yj ∂xi
where yj represents intermediate variables computed during the forward pass of the network. In
the context of a neural network, this chain rule is used to propagate errors backward from the
output to the input layers during the backpropagation process, where the objective is to update
the network’s weights to minimize the loss function L. Consider a neural network with a simple
architecture, consisting of an input layer, one hidden layer, and an output layer. Let X represent
the input tensor, W1 and b1 the weights and biases of the hidden layer, and W2 and b2 the weights
and biases of the output layer. The forward pass can be written as:
h = σ(W1 X + b1 ) (595)
ŷ = W2 h + b2 (596)
where σ is the activation function, such as the ReLU function σ(x) = max(0, x), and ŷ is the
predicted output. The objective in training a model is to minimize a loss function L(ŷ, y), where
y represents the true labels. The loss function can take different forms, such as the mean squared
error for regression tasks:
N
1 X
L(ŷ, y) = (yi − ŷi )2 (597)
N i=1
or the cross-entropy loss for classification tasks:
C
X
L(ŷ, y) = − yi log(ŷi ) (598)
i=1
where C is the number of classes, and ŷi is the predicted probability of class i under the softmax
function. The optimization of this loss function requires the computation of the gradients of L with
respect to the model parameters W1 , b1 , W2 , b2 . This is achieved through backpropagation, which
130
applies the chain rule iteratively through the layers of the network. To perform optimization,
TensorFlow employs algorithms like Gradient Descent (GD). The basic gradient descent update
rule for parameters θ is:
θt+1 = θt − η∇θ L(θ) (599)
where η is the learning rate, and ∇θ L(θ) represents the gradient of the loss function with respect
to the model parameters θ. Variants of gradient descent, such as Stochastic Gradient Descent
(SGD), update the parameters using a subset (mini-batch) of the training data rather than the
entire dataset: m
1 X
θt+1 = θt − η∇θ L(θ, xi , yi ) (600)
m i=1
where m is the batch size, and (xi , yi ) are the data points in the mini-batch. More sophisticated
optimizers like Adam (Adaptive Moment Estimation) use both momentum (first moment) and
scaling (second moment) to adapt the learning rate for each parameter. The update rule for Adam
is:
mt = β1 mt−1 + (1 − β1 )∇θ L(θ) (601)
vt = β2 vt−1 + (1 − β2 )(∇θ L(θ))2 (602)
mt vt
m̂t = t
, v̂t = (603)
1 − β1 1 − β2t
m̂t
θt+1 = θt − η √ (604)
v̂t + ϵ
where β1 and β2 are the exponential decay rates, and ϵ is a small constant to prevent division by
zero. The inclusion of both the first and second moments allows Adam to adaptively adjust the
learning rate, speeding up convergence. In addition to standard optimization methods, TensorFlow
supports distributed computing, enabling model training across multiple devices, such as GPUs and
TPUs. In a distributed setting, the model’s parameters are split across different workers, each
handling a portion of the data. The gradients computed by each worker are averaged, and the
global parameters are updated:
N
1 X
θt+1 = θt − η ∇θ Li (θ) (605)
N i=1
where Li (θ) is the loss computed on the i-th device, and N is the total number of devices. Tensor-
Flow’s efficient parallelism ensures that large-scale data processing tasks can be carried out with
high computational throughput, thus speeding up model training on large datasets.
TensorFlow also facilitates model deployment on different platforms. TensorFlow Lite enables
model inference on mobile devices by converting trained models into optimized, smaller formats.
This process involves quantization, which reduces the precision of the weights and activations,
thereby reducing memory consumption and computation time. The conversion process aims to
balance model accuracy and performance, ensuring that deep learning models can run efficiently
on resource-constrained devices like smartphones and IoT devices. For web applications, TensorFlow
offers TensorFlow.js, which allows users to run machine learning models directly in the browser,
leveraging the computational power of the client-side GPU or CPU. This is particularly useful for
real-time interactions where low-latency predictions are required without sending data to a server.
Moreover, TensorFlow provides an ecosystem that extends beyond basic machine learning tasks.
For instance, TensorFlow Extended (TFX) supports the deployment of machine learning models
in production environments, automating the steps from model training to deployment. Tensor-
Flow Probability supports probabilistic modeling and uncertainty estimation, which are critical in
domains such as reinforcement learning and Bayesian inference.
131
12.2 PyTorch
Literature Review: Galaxy Yanshi Team of Beihang University [293] examined the use of Py-
Torch as a deep learning framework for real-time astronaut facial recognition in space stations.
It explores the Bayesian coding theory within PyTorch models and its significance in optimizing
neural network architectures. It provides a theoretical exploration of probability distributions in
PyTorch models, demonstrating how deep learning can be used in constrained computational envi-
ronments. Tabel (2024) [294] extended PyTorch to Spiking Neural Networks (SNNs), a biologically
inspired neural network type. It details a new theoretical approach for learning spike timings us-
ing PyTorch’s computational graph. The paper bridges neuromorphic computing and PyTorch’s
automatic differentiation, expanding the theory behind temporal deep learning. Naderi et. al.
(2024) [295] introduced a hybrid physics-based deep learning framework that integrates discrete
element modeling (DEM) with PyTorch-based networks. It demonstrates how physical simula-
tion problems can be formulated as deep learning models in PyTorch, providing new insights into
neural solvers for scientific computing. Polaka (2024) [296] evaluated reinforcement learning (RL)
theories within PyTorch, exploring the mathematical rigor of RL frameworks in safe AI applica-
tions. The author provided a strong theoretical foundation for understanding deep reinforcement
learning (DeepRL) in PyTorch, emphasizing how state-of-the-art RL theories are embedded in
the framework. Erdogan et. al. (2024) [297] explored the theoretical framework for reducing
stochastic communication overheads in large-scale recommendation systems built using PyTorch.
It introduced an optimized gradient synchronization method that can enhance PyTorch-based deep
learning models for distributed computing. Liao et. al. (2024) [298] extended the Iterative Partial
Diffusion Model (IPDM) framework, implemented in PyTorch, for medical image processing and
advanced the theory of deep generative models in PyTorch, specifically in diffusion-based learning
techniques. Sekhavat et. al. (2024) [299] examined the theoretical intersection between deep learn-
ing in PyTorch and artificial intelligence creativity, referencing Nietzschean philosophical concepts.
The author also explored how PyTorch enables neural creativity and provides a rigorous theoretical
model for computational aesthetics. Cai et. al. (2025) [300] developed a new theoretical framework
for explainability in neural networks using Shapley values, implemented in PyTorch and enhanced
the mathematical rigor of explainable AI (XAI) using PyTorch’s autograd system to analyze feature
importance. Na (2024) [301] proposed a novel ensemble learning theory using PyTorch, specifically
in weakly supervised learning (WSL). The paper extends Bayesian learning models in PyTorch for
handling sparse labeled data, addressing critical gaps in WSL. Khajah (2024) [302] combined item
response theory (IRT) and Bayesian knowledge tracing (BKT) using PyTorch to model generaliz-
able skill discovery. This study presents a rigorous statistical theory for adaptive learning systems
using PyTorch’s probabilistic programming capabilities.
The dynamic computation graph in PyTorch forms the core of its ability to perform efficient
and flexible machine learning tasks, especially deep learning models. To understand the underly-
ing mathematical and computational principles, we must explore how the graph operates, what it
represents, and how it changes during the execution of a machine learning program. Unlike the
static computation graphs employed in frameworks like TensorFlow (pre-Eager execution mode),
PyTorch constructs the computation graph dynamically, as the operations are performed in the
forward pass. This allows PyTorch to adapt to various input sizes, model structures, and control
flows that can change during execution. This adaptability is essential in enabling PyTorch to han-
dle models like recurrent neural networks (RNNs), which operate on sequences of varying lengths,
or models that incorporate conditionals in their computation steps.
The computation graph itself can be mathematically represented as a directed acyclic graph
(DAG), where the nodes represent operations and intermediate results, while the edges represent
the flow of data between these nodes. Each operation (e.g., addition, multiplication, or non-linear
activation) is applied to tensors, and the outputs of these operations are used as inputs for subse-
132
quent operations. The central feature of PyTorch’s dynamic computation graph is its construction
at runtime. For instance, when a tensor A is created, it might be involved in a series of operations
that eventually lead to the calculation of a loss function L. As each operation is executed, PyTorch
constructs an edge from the node representing the input tensor A to the node representing the
output tensor B. Mathematically, the transformation between these tensors can be described by:
B = f (A; θ) (606)
where f represents the transformation function (which could be a linear or nonlinear operation),
and θ represents the parameters involved in this transformation (e.g., weights or biases in the
case of neural networks). The construction of the dynamic graph allows PyTorch to deal with
variable-length sequences, which are common in tasks such as time-series prediction, nat-
ural language processing (NLP), and speech recognition. The length of the sequence can
change depending on the input data, and thus, the number of iterations or layers required in the
computation will also vary. In a recurrent neural network (RNN), for example, the hidden
state ht at each time step t is a function of the previous hidden state ht−1 and the input at the
current time step xt . This can be described mathematically as:
where f is typically a non-linear activation function (e.g., a hyperbolic tangent or a sigmoid), and
Wh , Wx , b represent the weight matrices and bias vector, respectively. This equation encapsulates
the recursive nature of RNNs, where each output depends on the previous output and the current
input. In a static computation graph, the number of operations for each sequence would need to
be predefined, leading to inefficiency when sequences of different lengths are processed. However,
in PyTorch, the computation graph is created dynamically for each sequence, which allows for the
efficient handling of varying-length sequences and avoids redundant computation.
The key to PyTorch’s efficiency lies in automatic differentiation, which is managed by its au-
tograd system. When a tensor A has the property requires grad=True, PyTorch starts tracking
all operations performed on it. Suppose that the tensor A is involved in a sequence of operations
to compute a scalar loss L. For example, if the loss is a function of Y, the output tensor, which is
computed through multiple layers, the objective is to find the gradient of L with respect to A. This
requires the computation of the Jacobian matrix, which represents the gradient of each component
of Y with respect to each component of A. Using the chain rule of differentiation, the gradient of
the loss with respect to A is given by:
∂L X ∂L ∂Yi
= · (608)
∂A i
∂Yi ∂A
∂L
This is an application of the multivariable chain rule, where ∂Y i
represents the gradient of the
∂Yi
loss with respect to the output tensor at the i-th component, and ∂A is the Jacobian matrix for the
transformation from A to Y. This computation is achieved by backpropagating the gradients
through the computation graph that PyTorch builds dynamically. Every operation node in the
graph has an associated gradient, which is propagated backward through the graph as we move
from the loss back to the input parameters. For example, if Y = A · B, the gradient of the loss
with respect to A would be:
∂L ∂L
= · BT (609)
∂A ∂Y
Similarly, the gradient with respect to B would be:
∂L ∂L
= · AT (610)
∂B ∂Y
133
This shows how the gradients are passed backward through the computation graph, utilizing the
stored operations at each node to calculate the required derivatives. The advantage of this dy-
namic construction of the graph is that it does not require the entire graph to be constructed
beforehand, as in the static graph approach. Instead, the graph is dynamically updated as op-
erations are executed, making it both more memory-efficient and computationally efficient.
An important feature of PyTorch’s dynamic graph is its ability to handle conditionals within the
computation. Consider a case where we have different branches in the computation based on a
conditional statement. In a static graph, such conditionals would require the entire graph to be
predetermined, including all possible branches. In contrast, PyTorch constructs the relevant part
of the graph depending on the input data, effectively enabling a branching computation. For
instance, suppose that we have a decision-making process in a neural network model, where the
output depends on whether an input tensor exceeds a threshold xi > t:
(
A · xi + b if xi > t
yi = (611)
C · xi + d otherwise
In a static graph, we would have to design two separate branches and potentially deal with the
computational cost of unused branches. In PyTorch’s dynamic graph, only the relevant branch is
executed, and the graph is updated accordingly to reflect the necessary operations. The mem-
ory efficiency in PyTorch’s dynamic graph construction is particularly evident when handling
large models and training on large datasets. When building models like deep neural networks
(DNNs), the operations performed on each tensor during both the forward and backward passes
are recorded in the computation graph. This allows for efficient reuse of intermediate results, and
only the necessary memory is allocated for each tensor during the graph’s construction. This stands
in contrast to static computation graphs, where the full graph needs to be defined and memory
allocated up front, potentially leading to unnecessary memory consumption.
To summarize, the dynamic computation graph in PyTorch is a powerful tool that allows
for flexible model building and efficient computation. By constructing the graph incrementally
during the execution of the forward pass, PyTorch is able to dynamically adjust to the input size,
control flow, and variable-length sequences, leading to more efficient use of memory and computa-
tional resources. The autograd system enables automatic differentiation, applying the chain
rule of calculus to compute gradients with respect to all model parameters. This flexibility is a key
reason why PyTorch has gained popularity for deep learning research and production, as it com-
bines high performance with flexibility and transparency, allowing researchers and engineers
to experiment with dynamic architectures and complex control flows without sacrificing efficiency.
12.3 JAX
Literature Review: Li et. al. (2024) [313] introduced JAX-based differentiable density func-
tional theory (DFT), enabling end-to-end differentiability in materials science simulations. This
paper extends machine learning theory into quantum chemistry by leveraging JAX’s automatic dif-
ferentiation and parallelization capabilities for efficient optimization of density functional models.
Bieberich and Li (2024) [314] explored quantum machine learning (QML) using JAX and Diffrax
to solve neural differential equations efficiently. They developed a new theoretical model for quan-
tum neural ODEs and discussed how JAX facilitates efficient GPU-based quantum simulations.
Dagréou et. al. (2024) [315] analyzed the efficiency of Hessian-vector product (HVP) computation
in JAX and PyTorch for deep learning. They established a mathematical foundation for computing
second-order derivatives in deep learning and optimization, showcasing JAX’s superior automatic
differentiation. Lohoff and Neftci (2025) [316] developed a deep reinforcement learning (DRL)
model that optimizes JAX’s autograd engine for scientific computing. They demonstrated how
reinforcement learning improves computational efficiency in JAX through a theoretical framework
134
that eliminates redundant computations in deep learning. Legrand et. al. (2024) [317] introduced
a JAX and Rust-based deep learning library for predictive coding networks (PCNs). They explored
theoretical extensions of neural networks beyond traditional backpropagation, providing a formal-
ized framework for hierarchical generative models. Alzás and Radev (2024) [318] used JAX to create
differentiable models for nuclear reactions, demonstrating its power in high-energy physics simu-
lations. They established a new differentiable framework for theoretical physics, utilizing JAX’s
gradient-based optimization to improve nuclear physics modeling. Edenhofer et. al. (2024) [319]
developed a Gaussian Process and Variational Inference framework in JAX, extending traditional
Bayesian methods. They bridged statistical physics and deep learning, formulating a theoretical
link between Gaussian processes and deep neural networks using JAX. Chan et. al. (2024) [320]
proposed a JAX-based quantum machine learning framework for long-tailed X-ray classification.
They introduced a novel quantum transfer learning technique within JAX, demonstrating its ad-
vantages over classical deep learning models in medical imaging. Ye et. al. (2025) [321] used JAX
to model electron transfer kinetics, bridging deep learning and density functional theory (DFT).
They developed a new theoretical framework for modeling charge transfer reactions, leveraging
JAX’s high-performance computation for quantum chemistry applications. Khan et. al. (2024)
[322] extended NODEs using JAX’s efficient autodiff capabilities for high-dimensional dynamical
systems. They established a rigorous mathematical framework for extending NODEs to stochastic
and chaotic systems, leveraging JAX’s high-speed parallelization.
JAX’s automatic differentiation is central to its ability to compute gradients, Jacobians, Hes-
sians, and other derivatives efficiently. For many applications, the function of interest involves
computing gradients with respect to model parameters in optimization and machine learning tasks.
Automatic differentiation allows for the efficient computation of these gradients using the reverse-
mode differentiation technique. Let us consider a function f : Rn → Rm , and suppose we wish to
compute the gradient of the scalar-valued output with respect to each input variable. The gradient
of f , denoted as ∇x f , is a vector of partial derivatives:
∂f1 ∂f1 ∂f1 ∂fm
∇x f (x) = , ,..., ,..., , (612)
∂x1 ∂x2 ∂xn ∂xn
where f = (f1 , f2 , . . . , fm ) represents a vector of m scalar outputs, and x = (x1 , x2 , . . . , xn ) repre-
sents the input vector. Reverse-mode differentiation computes this gradient by applying the chain
rule in reverse order. If f is composed of several intermediate functions, say f = g ◦ h, where
g : Rm → Rp and h : Rn → Rm , the gradient of f with respect to x is computed recursively by
applying the chain rule:
∂g ∂h ∂h ∂h
∇x f (x) = · , ,..., . (613)
∂h ∂x1 ∂x2 ∂xn
This recursive application of the chain rule ensures that each intermediate gradient computation is
propagated backward through the function’s layers, reducing the number of required passes com-
135
pared to forward-mode differentiation. This technique becomes particularly beneficial for functions
where the number of outputs m is much smaller than the number of inputs n, as it minimizes the
computational complexity. In the context of JAX, automatic differentiation is utilized through func-
tions like jax.grad, which can be applied to scalar-valued functions to return their gradients with
respect to vector-valued inputs. To compute higher-order derivatives, such as the Hessian matrix,
JAX allows for the computation of second- and higher-order derivatives using similar principles.
The Hessian matrix H of a scalar function f (x) is given by the matrix of second derivatives:
2
∂ f
H= , (614)
∂xi ∂xj
which is computed by applying the chain rule once again. The second-order derivatives can be
computed efficiently by differentiating the gradient once more, and this process can be extended
to higher-order derivatives by continuing the recursive application of the chain rule. A central
concept in JAX’s approach to high-performance computing is JIT (just-in-time) compilation,
which provides substantial performance gains by compiling Python functions into optimized ma-
chine code tailored to the underlying hardware architecture. JIT compilation in JAX is built on the
foundation of the XLA (Accelerated Linear Algebra) compiler. XLA optimizes the execution
of tensor operations by fusing multiple operations into a single kernel, thereby reducing the overhead
associated with launching individual computation kernels. This technique is particularly effective
for matrix multiplications, convolutions, and other tensor operations commonly found in machine
learning tasks. For example, consider a simple sequence of operations f = Op1 (Op2 (. . . (Opn (x)))),
where Opi represents different mathematical operations applied to the input tensor x. Without
optimization, each operation would typically be executed separately, introducing significant over-
head. JAX’s JIT compiler, however, recognizes this sequence and applies a fusion transformation,
resulting in a single composite operation:
Optimized(f (x)) = Fused Op(x), (615)
where Fused Op represents a highly optimized version of the original sequence of operations. This
optimization minimizes the number of kernel launches and reduces memory access overhead, which
in turn accelerates the computation. The JIT compiler analyzes the computational graph of the
function and identifies opportunities to combine operations into a more efficient form, ultimately
speeding up the computation on hardware accelerators such as GPUs or TPUs.
The vectorization capability provided by JAX through the jax.vmap operator is another essen-
tial optimization for high-performance computing. This feature automatically vectorizes functions
across batches of data, allowing the same operation to be applied in parallel across multiple data
points. Mathematically, for a function f : Rn → Rm and a batch of inputs X ∈ RB×n , the vectorized
function can be expressed as:
Y = vmap(f )(X), (616)
where B is the batch size and Y is the matrix in RB×m , containing the results of applying f to
each row of X. The mathematical operation applied by JAX is the same as applying f to each
individual row Xi , but with the benefit that the entire batch is processed in parallel, exploiting
the available hardware resources efficiently. The ability to parallelize computations across
multiple devices is one of JAX’s strongest features, and it is enabled through the jax.pmap
operator. This operator allows for the parallel execution of functions across different devices, such
as multiple GPUs or TPUs. Suppose we have a function f : Rn → Rm and a batch of inputs
X = (X1 , X2 , . . . , Xp ), distributed across p devices. The parallelized execution of the function can
be written as:
Y = pmap(f )(X), (617)
where each device independently computes its portion of the computation f (Xi ), and the results
are gathered into the final output Y. This capability is essential for large-scale distributed training
136
of machine learning models, where the model’s parameters and data must be distributed across
multiple devices to ensure efficient training. The parallelization effectively reduces computation
time, as each device operates on a distinct subset of the data and model parameters. GPU/TPU
acceleration is another crucial aspect of JAX’s performance, and it is facilitated by libraries like
cuBLAS for GPUs, which are specifically designed to optimize matrix operations. The primary
operation used in many numerical computing tasks is matrix multiplication, and JAX optimizes
this by leveraging hardware-accelerated implementations of these operations. Consider the matrix
multiplication of two matrices A and B, where A ∈ Rn×m and B ∈ Rm×p , resulting in a matrix
C ∈ Rn×p :
C = A × B. (618)
Using cuBLAS or a similar library, JAX can execute this operation on a GPU, utilizing the massive
parallel processing power of the hardware to perform the multiplication efficiently. This operation
can be further optimized by considering the specific memory hierarchies of GPUs, where large
matrix multiplications are broken down into smaller tiles that fit into the GPU’s high-speed mem-
ory. This technique minimizes memory bandwidth constraints, accelerating the computation. In
addition to these core operations, JAX allows for the definition of custom gradients using the
jax.custom jvp decorator, which enables users to specify the Jacobian-vector products (JVPs)
manually for more efficient gradient computation. This feature is especially useful in machine
learning applications, where certain operations might have custom gradients that cannot be com-
puted automatically. For instance, in a non-trivial activation function such as the softmax, the
custom gradient function might be provided explicitly for efficiency:
∂softmax(x)
= diag(softmax(x)) − softmax(x) · softmax(x)T . (619)
∂x
Thus, JAX allows for both flexibility and performance, enabling scientific computing applications
that require both efficiency and the ability to define complex, custom derivatives.
13 Appendix
13.1 Linear Algebra Essentials
13.1.1 Matrices and Vector Spaces
Definition of a Matrix: A matrix A is a rectangular array of numbers (or elements from a field
F), arranged in rows and columns:
a11 a12 · · · a1n
a21 a22 · · · a2n
m×n
A = .. .. ∈ F (620)
.. ...
. . .
am1 am2 · · · amn
where aij denotes the entry of A at the i-th row and j-th column. A square matrix is one where
m = n. A matrix is diagonal if all off-diagonal entries are zero. For matrices A ∈ Fm×n and
B ∈ Fm×n the following are the matrix operations:
137
• Addition: Defined entrywise:
(A + B)ij = Aij + Bij (621)
• Scalar Multiplication: For α ∈ F,
(αA)ij = α · Aij (622)
• Matrix Multiplication: If A ∈ Fm×p and B ∈ Fp×n , then the product C = AB is given by:
p
X
Cij = Aik Bkj (623)
k=1
This is only defined when the number of columns of A equals the number of rows of B.
• Transpose: The transpose of A, denoted AT , satisfies:
(AT )ij = Aji (624)
where A1j is the (n − 1) × (n − 1) submatrix obtained by removing the first row and j-th
column.
• Inverse: A square matrix A is invertible if there exists A−1 such that:
AA−1 = A−1 A = I (626)
where I is the identity matrix.
The dimension of V , denoted dim(V ), is the number of basis vectors. Linear Transformations:
A function T : V → W is linear if:
T (αv + βw) = αT (v) + βT (w) (629)
The matrix representation of T is the matrix A such that:
T (x) = Ax (630)
138
13.1.3 Eigenvalues and Eigenvectors
Definition: For a square matrix A ∈ Fn×n , an eigenvalue λ and eigenvector v ̸= 0 satisfy:
Av = λv (631)
which gives an n-th degree polynomial in λ. The set of all solutions v to (A − λI)v = 0 is the
eigenspace associated with λ.
A = U ΣV T (633)
Discrete Probability Distributions: For a discrete random variable X, which takes values
from a countable set, the probability mass function (PMF) is defined as:
139
An example of a discrete probability distribution is the binomial distribution, which describes
the number of successes in a fixed number of independent Bernoulli trials. The PMF for the
binomial distribution is:
n k
P (X = k) = p (1 − p)n−k , k = 0, 1, . . . , n (638)
k
Where n is the number of trials, p is the probability of success on each trial, and k is the number
of successes.
Let A and B be two events. Then, Bayes’ theorem gives the conditional probability of A given B:
P (B|A)P (A)
P (A|B) = (642)
P (B)
where P (A|B) is the posterior probability of A given B, P (B|A) is the likelihood, the probability
of observing B given A, P (A) is the prior probability of A, P (B) is the marginal likelihood of B,
computed as: X
P (B) = P (B|Ai )P (Ai ) (643)
i
This allows one to update beliefs about a hypothesis A based on observed evidence B. Let us
consider a diagnostic test for a disease. Let A be the event that a person has the disease and B
140
be the event that the test is positive. We are interested in the probability that a person has the
disease given that the test is positive, i.e., P (A|B). By Bayes’ theorem, we have:
P (B|A)P (A)
P (A|B) = (645)
P (B)
where P (B|A) is the probability of a positive test result given that the person has the disease
(sensitivity), P (A) is the prior probability of having the disease, P (B) is the total probability of a
positive test result.
Each of these measures provides different insights into the characteristics of a dataset or a prob-
ability distribution. There are several Measures of Central Tendency. Given a probability space
(Ω, F, P ) and a random variable X : Ω → R, the expectation (or mean) is defined as:
Z
E[X] = X(ω)dP (ω) (646)
Ω
141
The mode is defined as the point xm that maximizes the probability density function:
If Q1 and Q3 denote the first and third quartiles of a dataset (where Q1 is the 25th percentile and
Q3 is the 75th percentile), then the interquartile range is:
IQR = Q3 − Q1 (655)
Expanding:
Cov(X, Y ) = E[XY ] − E[X]E[Y ] (659)
The Pearson Correlation Coefficient defined as:
Cov(X, Y )
ρ(X, Y ) = (660)
σX σY
where σX and σY are the standard deviations of X and Y , respectively. The Information-Theoretic
Measure is Entropy which is defined as the entropy of a discrete probability distribution p(x) is
given by: X
H(X) = − p(x) log p(x) (661)
x
For continuous distributions with density f (x), the differential entropy is:
Z ∞
h(X) = − f (x) log f (x) dx (662)
−∞
which measures how much knowing X reduces uncertainty about Y . Statistical Measures satisfy
Linearity and Invariance i.e.
142
• Expectation is linear:
E[aX + bY ] = aE[X] + bE[Y ] (664)
For the Convergence and Asymptotic Behavior, The law of large numbers ensures that empirical
means converge to the expected value, while the central limit theorem states that sums of i.i.d.
random variables converge in distribution to a normal distribution.
The mean or expected value of a random variable X, denoted by E[X], represents the aver-
age value of X. For a discrete random variable:
X
E[X] = xi p(xi ) (666)
xi ∈S
The variance of a random variable X, denoted by Var(X), measures the spread or dispersion of
the distribution. For a discrete random variable:
The standard deviation is the square root of the variance and provides a measure of the spread
of the distribution in the same units as the random variable:
p
SD(X) = Var(X) (670)
The skewness of a random variable X quantifies the asymmetry of the probability distribution.
It is defined as:
E[(X − E[X])3 ]
Skew(X) = (671)
(Var(X))3/2
A positive skew indicates that the distribution has a long tail on the right, while a negative skew
indicates a long tail on the left. The kurtosis of a random variable X measures the ”tailedness”
of the distribution, i.e., how much of the probability mass is concentrated in the tails. It is defined
as:
E[(X − E[X])4 ]
Kurt(X) = (672)
(Var(X))2
A distribution with high kurtosis has heavy tails, and one with low kurtosis has light tails compared
to a normal distribution.
143
13.3 Optimization Techniques
13.3.1 Gradient Descent (GD)
Gradient Descent is an iterative optimization algorithm used to minimize a differentiable function.
The goal is to find the point where the function achieves its minimum value. Mathematically, it
can be formulated as follows. Given a differentiable objective function f : Rn → R, the gradient
descent update rule is:
xk+1 = xk − η∇f (xk ) (673)
where:
• xk ∈ Rn is the current point in the n-dimensional space (iteration index k),
• ∇f (xk ) is the gradient of the objective function at xk ,
• η is the learning rate (step size).
To analyze the convergence of gradient descent, we assume f is convex and differentiable with
a Lipschitz continuous gradient. That is, there exists a constant L > 0 such that:
∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥, ∀x, y ∈ Rn . (674)
This property ensures the gradient of f does not change too rapidly, which allows us to bound the
convergence rate. The following is an upper bound on the decrease in the function value at each
iteration:
f (xk+1 ) − f (x∗ ) ≤ (1 − ηL)(f (xk ) − f (x∗ )), (675)
where x∗ is the global minimum. Thus, we have the following convergence rate:
f (xk ) − f (x∗ ) ≤ (1 − ηL)k (f (x0 ) − f (x∗ )). (676)
For this to converge, we require ηL < 1. Hence, the step size η must be chosen carefully to ensure
convergence.
Let the objective function be the sum of individual functions fi (x) corresponding to each data
point:
m
1 X
f (x) = fi (x), (677)
m i=1
where m is the number of data points. In Stochastic Gradient Descent, the update rule becomes:
xk+1 = xk − η∇fik (xk ), (678)
where ik is a randomly chosen index at the k-th iteration, and ∇fik (x) is the gradient of the function
fik (x) corresponding to that randomly selected data point. The stochastic gradient is given by:
∇fik (xk ) = ∇fi (xk ). (679)
Given that the gradient is stochastic, the convergence analysis of SGD is more complex. Assuming
that each fi is convex and differentiable, and using the strong convexity assumption (i.e., there
exists a constant m > 0 such that f satisfies the inequality):
f (x) − f (y) ≥ m∥x − y∥2 , ∀x, y ∈ Rn , (680)
144
SGD converges to the optimal solution at a rate of:
C
E[f (xk ) − f (x∗ )] ≤
, (681)
k
where C is a constant depending on the step size η, the variance of the stochastic gradients, and
the strong convexity constant m. This slower convergence rate is due to the inherent noise in the
gradient estimates. Variance reduction techniques such as mini-batch SGD (using multiple data
points per iteration) or Momentum (accumulating past gradients) are often employed to improve
convergence speed and stability.
Second-order methods typically have faster convergence rates compared to gradient descent, par-
ticularly when the function f has well-conditioned curvature. However, computing the Hessian is
computationally expensive, which limits the scalability of these methods. Newton’s method is a
widely used second-order optimization technique that uses both the gradient and the Hessian. The
update rule is given by:
xk+1 = xk − ηH−1 (xk )∇f (xk ). (683)
Newton’s method converges quadratically near the optimal point under the assumption that the
objective function is twice continuously differentiable and the Hessian is positive definite. More
formally, if xk is sufficiently close to the optimal point x∗ , then the error ∥xk − x∗ ∥ decreases
quadratically:
∥xk+1 − x∗ ∥ ≤ C∥xk − x∗ ∥2 , (684)
where C is a constant depending on the condition number of the Hessian.
Since directly computing the Hessian is expensive, quasi-Newton methods aim to approximate
the inverse Hessian at each iteration. One of the most popular quasi-Newton methods is the Broy-
den–Fletcher–Goldfarb–Shanno (BFGS) method, which maintains an approximation to the
inverse Hessian, updating it at each iteration. The Summary of what we discussed above are as
follows:
• Gradient Descent (GD): An optimization algorithm that updates the parameter vector in
the direction opposite to the gradient of the objective function. Convergence is guaranteed
under convexity assumptions with an appropriately chosen step size.
• Stochastic Gradient Descent (SGD): A variant of GD that uses a random subset of
the data to estimate the gradient at each iteration. While faster and less computationally
intensive, its convergence is slower and more noisy, requiring variance reduction techniques
for efficient training.
• Second-Order Methods: These methods use the Hessian (second derivatives of the ob-
jective function) to accelerate convergence, often exhibiting quadratic convergence near the
optimum. However, the computational cost of calculating the Hessian restricts their practical
use. Quasi-Newton methods, such as BFGS, approximate the Hessian to improve efficiency.
Each of these methods has its advantages and trade-offs, with gradient-based methods being widely
used due to their simplicity and efficiency, and second-order methods providing faster convergence
but at higher computational costs.
145
13.4 Matrix Calculus
13.4.1 Matrix Differentiation
Consider a matrix A of size m × n, where A = [aij ]. For the purposes of differentiation, we will
focus on functions f (A) that map matrices to scalars or other matrices. We aim to compute the
derivative of f (A) with respect to A. Let f (A) be a scalar function of the matrix A. The derivative
of this scalar function with respect to A is defined as:
∂f (A) ∂f (A)
= (685)
∂A ∂aij
This is a matrix where the (i, j)-th entry is the partial derivative of the scalar function with respect
to the element aij . Let us take an example of Differentiating the Frobenius Norm. Consider the
Frobenius norm of a matrix A, defined as:
v
u m X n
uX
∥A∥F = t a2ij (686)
i=1 j=1
To compute the derivative of ∥A∥F with respect to A, we first apply the chain rule:
• Matrix trace: For a matrix A, the derivative of the trace Tr(A) with respect to A is the
identity matrix:
∂Tr(A)
=I (688)
∂A
• Matrix product: Let A and B be matrices, and consider the product f (A) = AB. The
derivative of this product with respect to A is:
∂(AB)
= BT (689)
∂A
∂(A−1 )
−1 ∂A
= −A A−1 (690)
∂A ∂A
Let T be a tensor, represented by the array of components Ti1 ,i2 ,...,ik , where the indices i1 , i2 , . . . , ik
are the dimensions of the tensor. Let f (T) be a scalar-valued function that depends on the tensor
T. The derivative of this function with respect to the tensor components Ti1 ,...,ik is given by:
∂f (T)
= Jacobian of f (T) with respect to Ti1 ,...,ik (691)
∂Ti1 ,...,ik
146
For example, consider a function of a second-order tensor, f (T), where T is a matrix. The dif-
ferentiation rule follows similar principles as matrix differentiation. The Jacobian is computed for
each tensor component in the same fashion, based on the partial derivatives with respect to the
individual tensor components.
Consider a second-order tensor T, and let’s compute the derivative of the Frobenius norm of
T: s X
∥T∥F = Ti21 ,...,ik (692)
i1 ,i2 ,...,ik
We start with the Differentiation of Scalar-Valued Functions with Matrix Arguments. Let f :
Rm×n → R be a scalar function of a matrix X. The differential df of f is defined by:
f (X + H) − f (X)
df = lim (696)
∥H∥→0 ∥H∥
where H is an infinitesimal perturbation. The total derivative of f is given by:
T !
∂f
df = tr dX . (697)
∂X
Definition of the Matrix Gradient: The gradient DX f (or Jacobian) is the unique matrix satis-
fying:
df = tr DTX dX .
(698)
This ensures that differentiation is dual to the Frobenius inner product ⟨A, B⟩ = tr(AT B), giving
a Hilbert space structure. Let’s start with the example of Quadratic Form Differentiation. Let
f (X) = tr(XT AX). Expanding in a small perturbation H:
f (X + H) = tr((X + H)T A(X + H)). (699)
147
Expanding and isolating linear terms:
df = tr(HT AX) + tr(XT AH). (700)
Using the cyclic property of the trace:
df = tr(HT (AX + AT X)). (701)
Thus, the derivative is:
∂f
= AX + AT X. (702)
∂X
If A is symmetric (AT = A), this simplifies to:
∂f
= 2AX. (703)
∂X
Regarding the Differentiation of Matrix-Valued Functions. Consider a differentiable function F :
Rm×n → Rp×q . The Fréchet derivative DX F is a fourth-order tensor satisfying:
dF = DX F : dX. (704)
Regarding the Differentiation of the Matrix Inverse, for F(X) = X−1 we use the identity:
d(XX−1 ) = 0 ⇒ dXX−1 + XdX−1 = 0. (705)
Solving for dX−1 :
dX−1 = −X−1 (dX)X−1 . (706)
Thus, the derivative is the negative bilinear operator:
DX (X−1 ) = −(X−1 ⊗ X−1 ). (707)
where ⊗ denotes the Kronecker product. For Differentiation of Tensor-Valued Functions. We need
to have a differentiable tensor function F : Rm×n×p → Ra×b×c , the Fréchet derivative shall be a
higher-order tensor DX F satisfying:
dF = DX F : dX . (708)
Let’s do a Differentiation of Tensor Contraction. If f (X ) = X : A, where X , A are second-order
tensors, then:
∂
(X : A) = A. (709)
∂X
For a fourth-order tensor C, if f (X ) = C : X , then:
∂
(C : X ) = C. (710)
∂X
Differentiation can be also done in Non-Euclidean Spaces. For a manifold M, differentiation is
defined via tangent spaces TX M, with the covariant derivative ∇X satisfying the Levi-Civita
connection:
ProjTX+ϵH M (Y(X + ϵH)) − Y(X)
∇X Y = lim . (711)
ϵ→0 ϵ
We can do differentiation using Variational Principles also. If f (X) is an energy functional, the
differentiation that follows from Gateaux derivatives is:
f (X + ϵH) − f (X)
δf = lim . (712)
ϵ→0 ϵ
For functionals, differentiation uses Euler-Lagrange equations:
Z
d
L(X, ∇X) dV = 0. (713)
dt Ω
148
14 Acknowledgments
The authors acknowledge the contributions of researchers whose foundational work has shaped our
understanding of Deep Learning.
References
[1] Rao, N., Farid, M., and Raiz, M. (2024). Symmetric Properties of λ-Szász Operators Coupled
with Generalized Beta Functions and Approximation Theory. Symmetry, 16(12), 1703.
[2] Mukhopadhyay, S.N., Ray, S. (2025). Function Spaces. In: Measure and Integration. University
Texts in the Mathematical Sciences. Springer, Singapore.
[3] Szoldra, T. (2024). Ergodicity breaking in quantum systems: from exact time evolution to
machine learning (Doctoral dissertation).
[4] SONG, W. X., CHEN, H., CUI, C., LIU, Y. F., TONG, D., GUO, F., ... and XIAO, C.
W. (2025). Theoretical, methodological, and implementation considerations for establishing a
sustainable urban renewal model. JOURNAL OF NATURAL RESOURCES, 40(1), 20-38.
[5] El Mennaoui, O., Kharou, Y., and Laasri, H. (2025). Evolution families in the framework of
maximal regularity. Evolution Equations and Control Theory, 0-0.
[6] Pedroza, G. (2024). On the Conditions for Domain Stability for Machine Learning: a Mathe-
matical Approach. arXiv preprint arXiv:2412.00464.
[7] Cerreia-Vioglio, S., and Ok, E. A. (2024). Abstract integration of set-valued functions. Journal
of Mathematical Analysis and Applications, 129169.
[8] Averin, A. (2024). Formulation and Proof of the Gravitational Entropy Bound. arXiv preprint
arXiv:2412.02470.
[9] Potter, T. (2025). Subspaces of L2 (Rn ) Invariant Under Crystallographic Shifts. arXiv e-prints,
arXiv-2501.
[11] Wang, R., Cai, L., Wu, Q., and Niyato, D. (2025). Service Function Chain Deployment with
Intrinsic Dynamic Defense Capability. IEEE Transactions on Mobile Computing.
[12] Duim, J. L., and Mesquita, D. P. (2025). Artificial Intelligence Value Alignment via Inverse
Reinforcement Learning. Proceeding Series of the Brazilian Society of Computational and
Applied Mathematics, 11(1), 1-2.
[13] Khayat, M., Barka, E., Serhani, M. A., Sallabi, F., Shuaib, K., and Khater, H. M. (2025).
Empowering Security Operation Center with Artificial Intelligence and Machine Learning–A
Systematic Literature Review. IEEE Access.
[14] Agrawal, R. (2025). 46 Detection of melanoma using DenseNet-based adaptive weighted loss
function. Emerging Trends in Computer Science and Its Application, 283.
[15] Hailemichael, H., and Ayalew, B. Adaptive and Safe Fast Charging of Lithium-Ion Batteries
Via Hybrid Model Learning and Control Barrier Functions. Available at SSRN 5110597.
149
[16] Nguyen, E., Xiao, J., Fan, Z., and Ruan, D. Contrast-free Full Intracranial Vessel Geometry
Estimation from MRI with Metric Learning based Inference. In Medical Imaging with Deep
Learning.
[17] Luo, Z., Bi, Y., Yang, X., Li, Y., Wang, S., and Ye, Q. A Novel Machine Vision-Based Collision
Risk Warning Method for Unsignalized Intersections on Arterial Roads. Frontiers in Physics,
13, 1527956.
[18] Bousquet, N., Thomassé, S. (2015). VC-dimension and Erdős–Pósa property. Discrete Math-
ematics, 338(12), 2302-2317.
[19] Asian, O., Yildiz, O. T., Alpaydin, E. (2009, September). Calculating the VC-dimension of
decision trees. In 2009 24th International Symposium on Computer and Information Sciences
(pp. 193-198). IEEE.
[20] Zhang, C., Bian, W., Tao, D., Lin, W. (2012). Discretized-Vapnik-Chervonenkis dimension
for analyzing complexity of real function classes. IEEE transactions on neural networks and
learning systems, 23(9), 1461-1472.
[21] Riondato, M., Akdere, M., Çetintemel, U., Zdonik, S. B., Upfal, E. (2011). The VC-dimension
of SQL queries and selectivity estimation through sampling. In Machine Learning and Knowl-
edge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece,
September 5-9, 2011, Proceedings, Part II 22 (pp. 661-676). Springer Berlin Heidelberg.
[22] Bane, M., Riggle, J., Sonderegger, M. (2010). The VC dimension of constraint-based grammars.
Lingua, 120(5), 1194-1208.
[23] Anderson, A. (2023). Fuzzy VC Combinatorics and Distality in Continuous Logic. arXiv
preprint arXiv:2310.04393.
[24] Fox, J., Pach, J., Suk, A. (2021). Bounded VC-dimension implies the Schur-Erdős conjecture.
Combinatorica, 41(6), 803-813.
[25] Johnson, H. R. (2021). Binary strings of finite VC dimension. arXiv preprint arXiv:2101.06490.
[26] Janzing, D. (2018). Merging joint distributions via causal model classes with low VC dimension.
arXiv preprint arXiv:1804.03206.
[27] Hüllermeier, E., Fallah Tehrani, A. (2012, July). On the vc-dimension of the choquet integral.
In International Conference on Information Processing and Management of Uncertainty in
Knowledge-Based Systems (pp. 42-50). Berlin, Heidelberg: Springer Berlin Heidelberg.
[28] Mohri, M. (2018). Foundations of machine learning.
[29] Cucker, F., Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint (Vol.
24). Cambridge University Press.
[30] Shalev-Shwartz, S., Ben-David, S. (2014). Understanding machine learning: From theory to
algorithms. Cambridge university press.
[31] Truong, L. V. (2022). On rademacher complexity-based generalization bounds for deep learn-
ing. arXiv preprint arXiv:2208.04284.
[32] Gnecco, G., and Sanguineti, M. (2008). Approximation error bounds via Rademacher com-
plexity. Applied Mathematical Sciences, 2, 153-176.
[33] Astashkin, S. V. (2010). Rademacher functions in symmetric spaces. Journal of Mathematical
Sciences, 169(6), 725-886.
150
[34] Ying and Campbell (2010). Rademacher chaos complexities for learning the kernel problem.
Neural computation, 22(11), 2858-2886.
[35] Zhu, J., Gibson, B., and Rogers, T. T. (2009). Human rademacher complexity. Advances in
neural information processing systems, 22.
[36] Astashkin, S. V., Astashkin, S. V., and Mazlum. (2020). The Rademacher system in function
spaces. Basel: Birkhäuser.
[37] Sachs, S., van Erven, T., Hodgkinson, L., Khanna, R., and Şimşekli, U. (2023, July). Gener-
alization Guarantees via Algorithm-dependent Rademacher Complexity. In The Thirty Sixth
Annual Conference on Learning Theory (pp. 4863-4880). PMLR.
[38] Ma and Wang (2020). Rademacher complexity and the generalization error of residual net-
works. Communications in Mathematical Sciences, 18(6), 1755-1774.
[39] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.
[40] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.
[41] McDonald, D. J., and Shalizi, C. R. (2011). Rademacher complexity of stationary sequences.
arXiv preprint arXiv:1106.0730.
[43] Giang, T. H., Tri, N. M., and Tuan, D. A. (2024). On some Sobolev and Pólya-Sezgö type
inequalities with weights and applications. arXiv preprint arXiv:2412.15490.
[44] Ruiz, P. A., and Fragkiadaki, V. (2024). Fractional Sobolev embeddings and algebra property:
A dyadic view. arXiv preprint arXiv:2412.12051.
[45] Bilalov, B., Mamedov, E., Sezer, Y., and Nasibova, N. (2025). Compactness in Banach function
spaces: Poincaré and Friedrichs inequalities. Rendiconti del Circolo Matematico di Palermo
Series 2, 74(1), 68.
[46] Cheng, M., and Shao, K. (2025). Ground states of the inhomogeneous nonlinear fractional
Schrödinger-Poisson equations. Complex Variables and Elliptic Equations, 1-17.
[47] Wei, J., and Zhang, L. (2025). Ground State Solutions of Nehari-Pohozaev Type for
Schrödinger-Poisson Equation with Zero-Mass and Weighted Hardy Sobolev Subcritical Ex-
ponent. The Journal of Geometric Analysis, 35(2), 48.
[48] Zhang, X., and Qi, W. (2025). Multiplicity result on a class of nonhomogeneous quasilinear
elliptic system with small perturbations in RN . arXiv preprint arXiv:2501.01602.
[49] Xiao, J., and Yue, C. (2025). A Trace Principle for Fractional Laplacian with an Application
to Image Processing. La Matematica, 1-26.
[50] Pesce, A., and Portaro, S. (2025). Fractional Sobolev spaces related to an ultraparabolic op-
erator. arXiv preprint arXiv:2501.05898.
151
[52] Chen, H., Chen, H. G., and Li, J. N. (2024). Sharp embedding results and geometric inequalities
for Hö rmander vector fields. arXiv preprint arXiv:2404.19393.
[54] Brezis, H., and Brézis, H. (2011). Functional analysis, Sobolev spaces and partial differential
equations (Vol. 2, No. 3, p. 5). New York: Springer.
[55] Evans, L. C. (2022). Partial differential equations (Vol. 19). American Mathematical Society.
[56] Maz’â, V. G. (2011). Sobolev Spaces: With Applications to Elliptic Partial Differential Equa-
tions. Springer.
[57] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural networks, 2(5), 359-366.
[59] Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal func-
tion. IEEE Transactions on Information theory, 39(3), 930-945.
[60] Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta numerica,
8, 143-195.
[61] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30.
[62] Hanin, B., and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal
width. arXiv preprint arXiv:1710.11278.
[63] Garcıa-Cervera, C. J., Kessler, M., Pedregal, P., and Periago, F. Universal approximation of
set-valued maps and DeepONet approximation of the controllability map.
[64] Majee, S., Abhishek, A., Strauss, T., and Khan, T. (2024). MCMC-Net: Accelerating
Markov Chain Monte Carlo with Neural Networks for Inverse Problems. arXiv preprint
arXiv:2412.16883.
[65] Toscano, J. D., Wang, L. L., and Karniadakis, G. E. (2024). KKANs: Kurkova-Kolmogorov-
Arnold Networks and Their Learning Dynamics. arXiv preprint arXiv:2412.16738.
[67] Rudin, W. (1964). Principles of mathematical analysis (Vol. 3). New York: McGraw-hill.
[68] Stein, E. M., and Shakarchi, R. (2009). Real analysis: measure theory, integration, and Hilbert
spaces. Princeton University Press.
[71] Folland, G. B. (1999). Real analysis: modern techniques and their applications (Vol. 40). John
Wiley and Sons.
[72] Sugiura, S. (2024). On the Universality of Reservoir Computing for Uniform Approximation.
152
[73] LIU, Y., LIU, S., HUANG, Z., and ZHOU, P. NORMED MODULES AND THE CATEGORI-
FICATION OF INTEGRATIONS, SERIES EXPANSIONS, AND DIFFERENTIATIONS.
[75] Chang, S. Y., and Wei, Y. (2024). Generalized Choi–Davis–Jensen’s Operator Inequalities and
Their Applications. Symmetry, 16(9), 1176.
[76] Caballer, M., Dantas, S., and Rodrı́guez-Vidanes, D. L. (2024). Searching for linear structures
in the failure of the Stone-Weierstrass theorem. arXiv preprint arXiv:2405.06453.
[77] Chen, D. (2024). The Machado–Bishop theorem in the uniform topology. Journal of Approxi-
mation Theory, 304, 106085.
[78] Rafiei, H., and Akbarzadeh-T, M. R. (2024). Hedge-embedded Linguistic Fuzzy Neural Net-
works for Systems Identification and Control. IEEE Transactions on Artificial Intelligence.
[81] Lorentz, G. G. (1966). Approximation of functions, athena series. Selected Topics in Mathe-
matics.
[82] Guilhoto, L. F., and Perdikaris, P. (2024). Deep learning alternatives of the Kolmogorov su-
perposition theorem. arXiv preprint arXiv:2410.01990.
[83] Alhafiz, M. R., Zakaria, K., Dung, D. V., Palar, P. S., Dwianto, Y. B., and Zuhal, L. R. (2025).
Kolmogorov-Arnold Networks for Data-Driven Turbulence Modeling. In AIAA SCITECH 2025
Forum (p. 2047).
[84] Lorencin, I., Mrzljak, V., Poljak, I., and Etinger, D. (2024, September). Prediction of CODLAG
Propulsion System Parameters Using Kolmogorov-Arnold Network. In 2024 IEEE 22nd Jubilee
International Symposium on Intelligent Systems and Informatics (SISY) (pp. 173-178). IEEE.
[85] Trevisan, D., Cassara, P., Agazzi, A., and Scardera, S. NTK Analysis of Knowledge Distilla-
tion.
[86] Bonfanti, A., Bruno, G., and Cipriani, C. (2024). The Challenges of the Nonlinear Regime for
Physics-Informed Neural Networks. arXiv preprint arXiv:2402.03864.
[87] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and gen-
eralization in neural networks. Advances in neural information processing systems, 31.
[88] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.
[89] Yang, G., and Hu, E. J. (2020). Feature learning in infinite-width neural networks. arXiv
preprint arXiv:2011.14522.
[90] Xiang, L., Dudziak, L., Abdelfattah, M. S., Chau, T., Lane, N. D., and Wen, H. (2021). Zero-
Cost Operation Scoring in Differentiable Architecture Search. arXiv preprint arXiv:2106.06799.
153
[91] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.
[92] McAllester, D. A. (1999, July). PAC-Bayesian model averaging. In Proceedings of the twelfth
annual conference on Computational learning theory (pp. 164-170).
[93] Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical
learning. arXiv preprint arXiv:0712.0248.
[94] Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. (2009, June). PAC-Bayesian
learning of linear classifiers. In Proceedings of the 26th Annual International Conference on
Machine Learning (pp. 353-360).
[95] Seeger, M. (2002). PAC-Bayesian generalisation error bounds for Gaussian process classifica-
tion. Journal of machine learning research, 3(Oct), 233-269.
[96] Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approxima-
tions of Gibbs posteriors. Journal of Machine Learning Research, 17(236), 1-41.
[97] Dziugaite, G. K., and Roy, D. M. (2017). Computing nonvacuous generalization bounds for
deep (stochastic) neural networks with many more parameters than training data. arXiv
preprint arXiv:1703.11008.
[98] Rivasplata, O., Kuzborskij, I., Szepesvári, C., and Shawe-Taylor, J. (2020). PAC-Bayes analysis
beyond the usual bounds. Advances in Neural Information Processing Systems, 33, 16833-
16845.
[99] Lever, G., Laviolette, F., and Shawe-Taylor, J. (2013). Tighter PAC-Bayes bounds through
distribution-dependent priors. Theoretical Computer Science, 473, 4-28.
[100] Rivasplata, O., Parrado-Hernández, E., Shawe-Taylor, J. S., Sun, S., and Szepesvári, C.
(2018). PAC-Bayes bounds for stable algorithms with instance-dependent priors. Advances in
Neural Information Processing Systems, 31.
[101] Lindemann, L., Zhao, Y., Yu, X., Pappas, G. J., and Deshmukh, J. V. (2024). Formal
verification and control with conformal prediction. arXiv preprint arXiv:2409.00536.
[102] Jin, G., Wu, S., Liu, J., Huang, T., and Mu, R. (2025). Enhancing Robust Fairness via
Confusional Spectral Regularization. arXiv preprint arXiv:2501.13273.
[103] Ye, F., Xiao, J., Ma, W., Jin, S., and Yang, Y. (2025). Detecting small clusters in the
stochastic block model. Statistical Papers, 66(2), 37.
[104] Bhattacharjee, A., and Bharadwaj, P. (2025). Coherent Spectral Feature Extraction Using
Symmetric Autoencoders. IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing.
[105] Wu, Q., Hu, B., Liu, C. et al. (2025). Velocity Analysis Using High-resolution Hyperbolic
Radon Transform with Lq1 − Lq2 Regularization. Pure Appl. Geophys.
[106] Ortega, I., Hannigan, J. W., Baier, B. C., McKain, K., and Smale, D. (2025). Advancing CH
4 and N 2 O retrieval strategies for NDACC/IRWG high-resolution direct-sun FTIR Observa-
tions. EGUsphere, 2025, 1-32.
[107] Kazmi, S. H. A., Hassan, R., Qamar, F., Nisar, K., and Al-Betar, M. A. (2025). Federated
Conditional Variational Auto Encoders for Cyber Threat Intelligence: Tackling Non-IID Data
in SDN Environments. IEEE Access.
154
[108] Zhao, Y., Bi, Z., Zhu, P., Yuan, A., and Li, X. (2025). Deep Spectral Clustering with Projected
Adaptive Feature Selection. IEEE Transactions on Geoscience and Remote Sensing.
[109] Saranya, S., and Menaka, R. (2025). A Quantum-Based Machine Learning Approach for
Autism Detection using Common Spatial Patterns of EEG Signals. IEEE Access.
[110] Dhalbisoi, S., Mohapatra, A., and Rout, A. (2024, March). Design of Cell-Free Massive
MIMO for Beyond 5G Systems with MMSE and RZF Processing. In International Conference
on Machine Learning, IoT and Big Data (pp. 263-273). Singapore: Springer Nature Singapore.
[111] Wei, C., Li, Z., Hu, T., Zhao, M., Sun, Z., Jia, K., ... and Jiang, S. (2025). Model-based
convolution neural network for 3D Near-infrared spectral tomography. IEEE Transactions on
Medical Imaging.
[113] Haykin, S. (2009). Neural networks and learning machines, 3/E. Pearson Education India.
[115] Bishop, C. M., and Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol.
4, No. 4, p. 738). New York: springer.
[116] Poggio, T., and Smale, S. (2003). The mathematics of learning: Dealing with data. Notices
of the AMS, 50(5), 537-544.
[117] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.
[118] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). IEEE.
[119] Sorrenson, P. (2025). Free-Form Flows: Generative Models for Scientific Applications (Doc-
toral dissertation).
[120] Liu, W., and Shi, X. (2025). An Enhanced Neural Network Forecasting System for the July
Precipitation over the Middle-Lower Reaches of the Yangtze River.
[121] Das, P., Mondal, D., Islam, M. A., Al Mohotadi, M. A., and Roy, P. C. (2025). Analyti-
cal Finite-Integral-Transform and Gradient-Enhanced Machine Learning Approach for Ther-
moelastic Analysis of FGM Spherical Structures with Arbitrary Properties. Theoretical and
Applied Mechanics Letters, 100576.
[122] Zhang, R. (2025). Physics-informed Parallel Neural Networks for the Identification of Con-
tinuous Structural Systems.
[123] Ali, S., and Hussain, A. (2025). A neuro-intelligent heuristic approach for performance pre-
diction of triangular fuzzy flow system. Proceedings of the Institution of Mechanical Engineers,
Part N: Journal of Nanomaterials, Nanoengineering and Nanosystems, 23977914241310569.
[124] Li, S. (2025). Scalable, generalizable, and offline methods for imperfect-information extensive-
form games.
[125] Hu, T., Jin, B., and Wang, F. (2025). An Iterative Deep Ritz Method for Monotone Elliptic
Problems. Journal of Computational Physics, 113791.
155
[126] Chen, P., Zhang, A., Zhang, S., Dong, T., Zeng, X., Chen, S., ... and Zhou, Q. (2025).
Maritime near-miss prediction framework and model interpretation analysis method based on
Transformer neural network model with multi-task classification variables. Reliability Engi-
neering and System Safety, 110845.
[127] Sun, G., Liu, Z., Gan, L., Su, H., Li, T., Zhao, W., and Sun, B. (2025). SpikeNAS-Bench:
Benchmarking NAS Algorithms for Spiking Neural Network Architecture. IEEE Transactions
on Artificial Intelligence.
[128] Zhang, Z., Wang, X., Shen, J., Zhang, M., Yang, S., Zhao, W., ... and Wang, J. (2025).
Unfixed Bias Iterator: A New Iterative Format. IEEE Access.
[129] Rosa, G. J. (2010). The Elements of Statistical Learning: Data Mining, Inference, and Pre-
diction by HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.
[131] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 15(1), 1929-1958.
[132] Zou, H., and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301-320.
[133] Vapnik, V. (2013). The nature of statistical learning theory. Springer science and business
media.
[134] Ng, A. Y. (2004, July). Feature selection, L 1 vs. L 2 regularization, and rotational invariance.
In Proceedings of the twenty-first international conference on Machine learning (p. 78).
[135] Li, T. (2025). Optimization of Clinical Trial Strategies for Anti-HER2 Drugs Based on
Bayesian Optimization and Deep Learning.
[136] Yasuda, M., and Sekimoto, K. (2024). Gaussian-discrete restricted Boltzmann machine with
sparse-regularized hidden layer. Behaviormetrika, 1-19.
[137] Xiaodong Luo, William C. Cruz, Xin-Lei Zhang, Heng Xiao, (2023), Hyper-parameter op-
timization for improving the performance of localization in an iterative ensemble smoother,
Geoenergy Science and Engineering, Volume 231, Part B, 212404
[138] Alrayes, F.S., Maray, M., Alshuhail, A. et al. (2025) Privacy-preserving approach for IoT
networks using statistical learning with optimization algorithm on high-dimensional big data
environment. Sci Rep 15, 3338. https://doi.org/10.1038/s41598-025-87454-1
[139] Cho, H., Kim, Y., Lee, E., Choi, D., Lee, Y., and Rhee, W. (2020). Basic enhancement
strategies when using Bayesian optimization for hyperparameter tuning of deep neural net-
works. IEEE access, 8, 52588-52608.
[141] Abdel-salam, M., Elhoseny, M. and El-hasnony, I.M. Intelligent and Secure Evolved Frame-
work for Vaccine Supply Chain Management Using Machine Learning and Blockchain. SN
COMPUT. SCI. 6, 121 (2025). https://doi.org/10.1007/s42979-024-03609-3
156
[142] Vali, M. H. (2025). Vector quantization in deep neural networks for speech and image pro-
cessing.
[144] Razavi-Termeh, S. V., Sadeghi-Niaraki, A., Ali, F., and Choi, S. M. (2025). Improving flood-
prone areas mapping using geospatial artificial intelligence (GeoAI): A non-parametric algo-
rithm enhanced by math-based metaheuristic algorithms. Journal of Environmental Manage-
ment, 375, 124238.
[145] Kiran, M., and Ozyildirim, M. (2022). Hyperparameter tuning for deep reinforcement learning
applications. arXiv preprint arXiv:2201.11182.
[146] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
[147] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84-90.
[148] Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
[149] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).
[150] Cohen, T., and Welling, M. (2016, June). Group equivariant convolutional networks. In In-
ternational conference on machine learning (pp. 2990-2999). PMLR.
[151] Zeiler, M. D., and Fergus, R. (2014). Visualizing and understanding convolutional networks.
In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing.
[152] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... and Guo, B. (2021). Swin transformer:
Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF in-
ternational conference on computer vision (pp. 10012-10022).
[154] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by
back-propagating errors. nature, 323(6088), 533-536.
[155] Bensaid, B., Poëtte, G., and Turpault, R. (2024). Convergence of the Iterates for Mo-
mentum and RMSProp for Local Smooth Functions: Adaptation is the Key. arXiv preprint
arXiv:2407.15471.
[156] Liu, Q., and Ma, W. (2024). The Epochal Sawtooth Effect: Unveiling Training Loss Oscilla-
tions in Adam and Other Optimizers. arXiv preprint arXiv:2410.10056.
[157] Li, H. (2024). Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning
Applications (Doctoral dissertation, Massachusetts Institute of Technology).
[158] Heredia, C. (2024). Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equa-
tions. arXiv preprint arXiv:2411.09734.
157
[159] Ye, Q. (2024). Preconditioning for Accelerated Gradient Descent Optimization and Regular-
ization. arXiv preprint arXiv:2410.00232.
[160] Compagnoni, E. M., Liu, T., Islamov, R., Proske, F. N., Orvieto, A., and Lucchi, A. (2024).
Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise. arXiv
preprint arXiv:2411.15958.
[161] Yao, B., Zhang, Q., Feng, R., and Wang, X. (2024). System response curve based first-order
optimization algorithms for cyber-physical-social intelligence. Concurrency and Computation:
Practice and Experience, 36(21), e8197.
[162] Wen, X., and Lei, Y. (2024, June). A Fast ADMM Framework for Training Deep Neural
Networks Without Gradients. In 2024 International Joint Conference on Neural Networks
(IJCNN) (pp. 1-8). IEEE.
[163] Hannibal, S., Jentzen, A., and Thang, D. M. (2024). Non-convergence to global minimizers
in data driven supervised deep learning: Adam and stochastic gradient descent optimization
provably fail to converge to global minimizers in the training of deep neural networks with
ReLU activation. arXiv preprint arXiv:2410.10533.
[164] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
[165] Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
[166] Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv
preprint arXiv:1904.09237.
[167] Jin, L., Nong, H., Chen, L., and Su, Z. (2024). A Method for Enhancing Generalization of
Adam by Multiple Integrations. arXiv preprint arXiv:2412.12473.
[168] Adly, A. M. (2024). EXAdam: The Power of Adaptive Cross-Moments. arXiv preprint
arXiv:2412.20302.
[169] Liu, Y., Cao, Y., and Lin, J. Convergence Analysis of the ADAM Algorithm for Linear Inverse
Problems.
[170] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
[171] Park, K., and Lee, S. (2024). SMMF: Square-Matricized Momentum Factorization for
Memory-Efficient Optimization. arXiv preprint arXiv:2412.08894.
[172] Mahjoubi, M. A., Lamrani, D., Saleh, S., Moutaouakil, W., Ouhmida, A., Hamida, S., ... and
Raihani, A. (2025). Optimizing ResNet50 Performance Using Stochastic Gradient Descent on
MRI Images for Alzheimer’s Disease Classification. Intelligence-Based Medicine, 100219.
[173] Seini, A. B., and Adam, I. O. (2024). HUMAN-AI COLLABORATION FOR ADAPTIVE
WORKING AND LEARNING OUTCOMES: AN ACTIVITY THEORY PERSPECTIVE.
[175] Lauand, C. K., and Meyn, S. (2025). Markovian Foundations for Quasi-Stochastic Approxi-
mation. SIAM Journal on Control and Optimization, 63(1), 402-430.
158
[176] Maranjyan, A., Tyurin, A., and Richtárik, P. (2025). Ringmaster ASGD: The First Asyn-
chronous SGD with Optimal Time Complexity. arXiv preprint arXiv:2501.16168.
[177] Gao, Z., and Gündüz, D. (2025). Graph Neural Networks over the Air for Decentralized Tasks
in Wireless Networks. IEEE Transactions on Signal Processing.
[178] Yoon, T., Choudhury, S., and Loizou, N. (2025). Multiplayer Federated Learning: Reaching
Equilibrium with Less Communication. arXiv preprint arXiv:2501.08263.
[179] Verma, K., and Maiti, A. (2025). Sine and cosine based learning rate for gradient descent
method. Applied Intelligence, 55(5), 352.
[180] Borowski, M., and Miasojedow, B. (2025). Convergence of projected stochastic approximation
algorithm. arXiv e-prints, arXiv-2501.
[181] Dong, K., Chen, S., Dan, Y., Zhang, L., Li, X., Liang, W., ... and Sun, Y. (2025). A new
perspective on brain stimulation interventions: Optimal stochastic tracking control of brain
network dynamics. arXiv preprint arXiv:2501.08567.
[182] Jiang, Y., Kang, H., Liu, J., and Xu, D. (2025). On the Convergence of Decentralized Stochas-
tic Gradient Descent with Biased Gradients. IEEE Transactions on Signal Processing.
[183] Sonobe, N., Momozaki, T., and Nakagawa, T. (2025). Sampling from Density power
divergence-based Generalized posterior distribution via Stochastic optimization. arXiv preprint
arXiv:2501.07790.
[184] Zhang, X., and Jia, G. (2025). Convergence of Policy Gradient for Stochastic Linear Quadratic
Optimal Control Problems in Infinite Horizon. Journal of Mathematical Analysis and Appli-
cations, 129264.
[185] Thiriveedhi, A., Ghanta, S., Biswas, S., and Pradhan, A. K. (2025). ALL-Net: integrating
CNN and explainable-AI for enhanced diagnosis and interpretation of acute lymphoblastic
leukemia. PeerJ Computer Science, 11, e2600.
[186] Ramos-Briceño, D. A., Flammia-D’Aleo, A., Fernández-López, G., Carrión-Nessi, F. S., and
Forero-Peña, D. A. (2025). Deep learning-based malaria parasite detection: convolutional
neural networks model for accurate species identification of Plasmodium falciparum and Plas-
modium vivax. Scientific Reports, 15(1), 3746.
[187] Espino-Salinas, C. H., Luna-Garcı́a, H., Cepeda-Argüelles, A., Trejo-Vázquez, K., Flores-
Chaires, L. A., Mercado Reyna, J., ... and Villalba-Condori, K. O. (2025). Convolutional
Neural Network for Depression and Schizophrenia Detection. Diagnostics, 15(3), 319.
[188] Ran, T., Huang, W., Qin, X., Xie, X., Deng, Y., Pan, Y., ... and Zou, D. (2025). Liquid-
based cytological diagnosis of pancreatic neuroendocrine tumors using hyperspectral imaging
and deep learning. EngMedicine, 2(1), 100059.
[189] Araujo, B. V. S., Rodrigues, G. A., de Oliveira, J. H. P., Xavier, G. V. R., Lebre, U., Cordeiro,
C., ... and Ferreira, T. V. (2025). Monitoring ZnO surge arresters using convolutional neural
networks and image processing techniques combined with signal alignment. Measurement,
116889.
[190] Sari, I. P., Elvitaria, L., and Rudiansyah, R. (2025). Data-driven approach for batik pattern
classification using convolutional neural network (CNN). Jurnal Mandiri IT, 13(3), 323-331.
159
[191] Wang, D., An, K., Mo, Y., Zhang, H., Guo, W., and Wang, B. Cf-Wiad: Consistency Fusion
with Weighted Instance and Adaptive Distribution for Enhanced Semi-Supervised Skin Lesion
Classification. Available at SSRN 5109182.
[192] Cai, P., Zhang, Y., He, H., Lei, Z., and Gao, S. (2025). DFNet: A Differential Feature-
Incorporated Residual Network for Image Recognition. Journal of Bionic Engineering, 1-14.
[193] Vishwakarma, A. K., and Deshmukh, M. (2025). CNNM-FDI: Novel Convolutional Neural
Network Model for Fire Detection in Images. IETE Journal of Research, 1-14.
[194] Ranjan, P., Kaushal, A., Girdhar, A., and Kumar, R. (2025). Revolutionizing hyperspec-
tral image classification for limited labeled data: unifying autoencoder-enhanced GANs with
convolutional neural networks and zero-shot learning. Earth Science Informatics, 18(2), 1-26.
[195] Naseer, A., and Jalal, A. Multimodal Deep Learning Framework for Enhanced Semantic
Scene Classification Using RGB-D Images.
[196] Wang, Z., and Wang, J. (2025). Personalized Icon Design Model Based on Improved Faster-
RCNN. Systems and Soft Computing, 200193.
[197] Ramana, R., Vasudevan, V., and Murugan, B. S. (2025). Spectral Pyramid Pooling and
Fused Keypoint Generation in ResNet-50 for Robust 3D Object Detection. IETE Journal of
Research, 1-13.
[198] Shin, S., Land, O., Seider, W., Lee, J., and Lee, D. (2025). Artificial Intelligence-Empowered
Automated Double Emulsion Droplet Library Generation.
[199] Taca, B. S., Lau, D., and Rieder, R. (2025). A comparative study between deep learning
approaches for aphid classification. IEEE Latin America Transactions, 23(3), 198-204.
[200] Ulaş, B., Szklenár, T., and Szabó, R. (2025). Detection of Oscillation-like Patterns in Eclipsing
Binary Light Curves using Neural Network-based Object Detection Algorithms. arXiv preprint
arXiv:2501.17538.
[201] Valensi, D., Lupu, L., Adam, D., and Topilsky, Y. Semi-Supervised Learning, Foundation
Models and Image Processing for Pleural Line Detection and Segmentation in Lung Ultra-
sound. Foundation Models and Image Processing for Pleural Line Detection and Segmentation
in Lung Ultrasound.
[202] V, A., V, P. and Kumar, D. An effective object detection via BS2ResNet and LTK-Bi-LSTM.
Multimed Tools Appl (2025). https://doi.org/10.1007/s11042-024-20433-2
[203] Zhu, X., Chen, W., and Jiang, Q. (2025). High-transferability black-box attack of binary
image segmentation via adversarial example augmentation. Displays, 102957.
[204] Guo, X., Zhu, Y., Li, S., Wu, S., and Liu, S. (2025). Research and Implementation of Agro-
nomic Entity and Attribute Extraction Based on Target Localization. Agronomy, 15(2), 354.
[205] Yousif, M., Jassam, N. M., Salim, A., Bardan, H. A., Mutlak, A. F., Sallibi, A. D., and
Ataalla, A. F. Melanoma Skin Cancer Detection Using Deep Learning Methods and Binary
GWO Algorithm.
[206] Rahman, S. I. U., Abbas, N., Ali, S., Salman, M., Alkhayat, A., Khan, J., ... and Gu,
Y. H. (2025). Deep Learning and Artificial Intelligence-Driven Advanced Methods for Acute
Lymphoblastic Leukemia Identification and Classification: A Systematic Review. Comput
Model Eng Sci, 142(2).
160
[207] Pratap Joshi, K., Gowda, V. B., Bidare Divakarachari, P., Siddappa Parameshwarappa, P.,
and Patra, R. K. (2025). VSA-GCNN: Attention Guided Graph Neural Networks for Brain
Tumor Segmentation and Classification. Big Data and Cognitive Computing, 9(2), 29.
[208] Ng, B., Eyre, K., and Chetrit, M. (2025). Prediction of ischemic cardiomyopathy using a
deep neural network with non-contrast cine cardiac magnetic resonance images. Journal of
Cardiovascular Magnetic Resonance, 27.
[209] Nguyen, H. T., Lam, T. B., Truong, T. T. N., Duong, T. D., and Dinh, V. Q. Mv-Trams:
An Efficient Tumor Region-Adapted Mammography Synthesis Under Multi-View Diagnosis.
Available at SSRN 5109180.
[210] Chen, W., Xu, T., and Zhou, W. (2025). Task-based Regularization in Penalized Least-
Squares for Binary Signal Detection Tasks in Medical Image Denoising. arXiv preprint
arXiv:2501.18418.
[211] Pradhan, P. D., Talmale, G., and Wazalwar, S. Deep dive into precision (DDiP): Unleashing
advanced deep learning approaches in diabetic retinopathy research for enhanced detection
and classification of retinal abnormalities. In Recent Advances in Sciences, Engineering, Infor-
mation Technology and Management (pp. 518-530). CRC Press.
[212] Örenç, S., Acar, E., Özerdem, M. S., Şahin, S., and Kaya, A. (2025). Automatic Identifica-
tion of Adenoid Hypertrophy via Ensemble Deep Learning Models Employing X-ray Adenoid
Images. Journal of Imaging Informatics in Medicine, 1-15.
[213] Jiang, M., Wang, S., Chan, K. H., Sun, Y., Xu, Y., Zhang, Z., ... and Tan, T. (2025).
Multimodal Cross Global Learnable Attention Network for MR images denoising with arbitrary
modal missing. Computerized Medical Imaging and Graphics, 102497.
[214] Al-Haidri, W., Levchuk, A., Zotov, N., Belousova, K., Ryzhkov, A., Fokin, V., ... and Brui,
E. (2025). Quantitative analysis of myocardial fibrosis using a deep learning-based framework
applied to the 17-Segment model. Biomedical Signal Processing and Control, 105, 107555.
[215] Osorio, S. L. J., Ruiz, M. A. R., Mendez-Vazquez, A., and Rodriguez-Tello, E. (2024). Fourier
Series Guided Design of Quantum Convolutional Neural Networks for Enhanced Time Series
Forecasting. arXiv preprint arXiv:2404.15377.
[216] Umeano, C., and Kyriienko, O. (2024). Ground state-based quantum feature maps. arXiv
preprint arXiv:2404.07174.
[217] Liu, N., He, X., Laurent, T., Di Giovanni, F., Bronstein, M. M., and Bresson, X. (2024).
Advancing Graph Convolutional Networks via General Spectral Wavelets. arXiv preprint
arXiv:2405.13806.
[218] Vlasic, A. (2024). Quantum Circuits, Feature Maps, and Expanded Pseudo-Entropy: A Cat-
egorical Theoretic Analysis of Encoding Real-World Data into a Quantum Computer. arXiv
preprint arXiv:2410.22084.
[219] Kim, M., Hioka, Y., and Witbrock, M. (2024). Neural Fourier Modelling: A Highly Compact
Approach to Time-Series Analysis. arXiv preprint arXiv:2410.04703.
[220] Xie, Y., Daigavane, A., Kotak, M., and Smidt, T. (2024). The price of freedom: Exploring
tradeoffs between expressivity and computational efficiency in equivariant tensor products.
In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative
Modeling.
161
[221] Liu, G., Wei, Z., Zhang, H., Wang, R., Yuan, A., Liu, C., ... and Cao, G. (2024, April).
Extending Implicit Neural Representations for Text-to-Image Generation. In ICASSP 2024-
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 3650-3654). IEEE.
[222] Zhang, M. (2024). Lock-in spectrum: a tool for representing long-term evolution of bearing
fault in the time–frequency domain using vibration signal. Sensor Review, 44(5), 598-610.
[223] Hamed, M., and Lachiri, Z. (2024, July). Expressivity Transfer In Transformer-Based Text-
To-Speech Synthesis. In 2024 IEEE 7th International Conference on Advanced Technologies,
Signal and Image Processing (ATSIP) (Vol. 1, pp. 443-448). IEEE.
[224] Lehmann, F., Gatti, F., Bertin, M., Grenié, D., and Clouteau, D. (2024). Uncertainty prop-
agation from crustal geologies to rock-site ground motion with a Fourier Neural Operator.
European Journal of Environmental and Civil Engineering, 28(13), 3088-3105.
[226] Manning, C., and Schutze, H. (1999). Foundations of statistical natural language processing.
MIT press.
[227] Liu, Y., and Zhang, M. (2018). Neural network methods for natural language processing.
[228] Allen, J. (1988). Natural language understanding. Benjamin-Cummings Publishing Co., Inc..
[229] Li, Z., Zhao, Y., Zhang, X., Han, H., and Huang, C. (2025). Word embedding factor based
multi-head attention. Artificial Intelligence Review, 58(4), 1-21.
[230] Hempelmann, C. F., Rayz, J., Dong, T., and Miller, T. (2025, January). Proceedings of
the 1st Workshop on Computational Humor (CHum). In Proceedings of the 1st Workshop on
Computational Humor (CHum).
[232] Eisenstein, J. (2019). Introduction to natural language processing. The MIT Press.
[233] Otter, D. W., Medina, J. R., and Kalita, J. K. (2020). A survey of the usages of deep learning
for natural language processing. IEEE transactions on neural networks and learning systems,
32(2), 604-624.
[234] Mitkov, R. (Ed.). (2022). The Oxford handbook of computational linguistics. Oxford univer-
sity press.
[235] Liu, X., Tao, Z., Jiang, T., Chang, H., Ma, Y., and Huang, X. (2024). ToDA: Target-oriented
Diffusion Attacker against Recommendation System. arXiv preprint arXiv:2401.12578.
[236] Çekik, R. (2025). Effective Text Classification Through Supervised Rough Set-Based Term
Weighting. Symmetry, 17(1), 90.
[237] Zhu, H., Xia, J., Liu, R., and Deng, B. (2025). SPIRIT: Structural Entropy Guided Prefix
Tuning for Hierarchical Text Classification. Entropy, 27(2), 128.
[238] Matrane, Y., Benabbou, F., and Ellaky, Z. (2024). Enhancing Moroccan Dialect Sentiment
Analysis through Optimized Preprocessing and transfer learning Techniques. IEEE Access.
[239] Moqbel, M., and Jain, A. (2025). Mining the truth: A text mining approach to understand-
ing perceived deceptive counterfeits and online ratings. Journal of Retailing and Consumer
Services, 84, 104149.
162
[240] Kumar, V., Iqbal, M. I., and Rathore, R. (2025). Natural Language Processing (NLP) in
Disease Detection—A Discussion of How NLP Techniques Can Be Used to Analyze and Clas-
sify Medical Text Data for Disease Diagnosis. AI in Disease Detection: Advancements and
Applications, 53-75.
[241] Yin, S. (2024). The Current State and Challenges of Aspect-Based Sentiment Analysis. Ap-
plied and Computational Engineering, 114, 25-31.
[242] Raghavan, M. (2024). Are you who AI says you are? Exploring the role of Natural Language
Processing algorithms for “predicting” personality traits from text (Doctoral dissertation, Uni-
versity of South Florida).
[243] Semeraro, A., Vilella, S., Improta, R., De Duro, E. S., Mohammad, S. M., Ruffo, G., and
Stella, M. (2025). EmoAtlas: An emotional network analyzer of texts that merges psychological
lexicons, artificial intelligence, and network science. Behavior Research Methods, 57(2), 77.
[244] Cai, F., and Liu, X. Data Analytics for Discourse Analysis with Python: The Case of Therapy
Talk, by Dennis Tay. New York: Routledge, 2024. ISBN: 9781032419015 (HB: USD 41.24),
xiii+ 182 pages. Natural Language Processing, 1-4.
[245] Wu, Yonghui. ”Google’s neural machine translation system: Bridging the gap between human
and machine translation.” arXiv preprint arXiv:1609.08144 (2016).
[246] Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., ... and
Uyangodage, L. (2024). Overview of the First Workshop on Language Models for Low-Resource
Languages (LoResLM 2025). arXiv preprint arXiv:2412.16365.
[247] Das, B. R., and Sahoo, R. (2024). Word Alignment in Statistical Machine Translation: Issues
and Challenges. Nov Joun of Appl Sci Res, 1 (6), 01-03.
[249] UÇKAN, T., and KURT, E. Word Embeddings in NLP. PIONEER AND INNOVATIVE
STUDIES IN COMPUTER SCIENCES AND ENGINEERING, 58.
[250] Pastor, G. C., Monti, J., Mitkov, R., and Hidalgo-Ternero, C. M. (2024). Recent Advances
in Multiword Units in Machine Translation and Translation Technology. Recent Advances in
Multiword Units in Machine Translation and Translation Technology.
[253] Yang, M. (2025). Adaptive Recognition of English Translation Errors Based on Improved
Machine Learning Methods. International Journal of High Speed Electronics and Systems,
2540236.
[254] Linnemann, G. A., and Reimann, L. E. (2024). Artificial Intelligence as a New Field of
Activity for Applied Social Psychology–A Reasoning for Broadening the Scope.
163
[255] Merkel, S., and Schorr, S. OPP: APPLICATION FIELDS and INNOVATIVE TECHNOLO-
GIES.
[256] Kushwaha, N. S., and Singh, P. (2022). Artificial Intelligence based Chatbot: A Case Study.
Journal of Management and Service Science (JMSS), 2(1), 1-13.
[257] Macedo, P., Madeira, R. N., Santos, P. A., Mota, P., Alves, B., and Pereira, C. M. (2024). A
Conversational Agent for Empowering People with Parkinson’s Disease in Exercising Through
Motivation and Support. Applied Sciences, 15(1), 223.
[258] Gupta, R., Nair, K., Mishra, M., Ibrahim, B., and Bhardwaj, S. (2024). Adoption and impacts
of generative artificial intelligence: Theoretical underpinnings and research agenda. Interna-
tional Journal of Information Management Data Insights, 4(1), 100232.
[259] Foroughi, B., Iranmanesh, M., Yadegaridehkordi, E., Wen, J., Ghobakhloo, M., Senali, M.
G., and Annamalai, N. (2025). Factors Affecting the Use of ChatGPT for Obtaining Shopping
Information. International Journal of Consumer Studies, 49(1), e70008.
[261] Pavlović, N., and Savić, M. (2024). The Impact of the ChatGPT Platform on Consumer
Experience in Digital Marketing and User Satisfaction. Theoretical and Practical Research in
Economic Fields, 15(3), 636-646.
[262] Mannava, V., Mitrevski, A., and Plöger, P. G. (2024, August). Exploring the Suitability of
Conversational AI for Child-Robot Interaction. In 2024 33rd IEEE International Conference
on Robot and Human Interactive Communication (ROMAN) (pp. 1821-1827). IEEE.
[263] Sherstinova, T., Mikhaylovskiy, N., Kolpashchikova, E., and Kruglikova, V. (2024, April).
Bridging Gaps in Russian Language Processing: AI and Everyday Conversations. In 2024
35th Conference of Open Innovations Association (FRUCT) (pp. 665-674). IEEE.
[264] Lipton, Z. C. (2015). A Critical Review of Recurrent Neural Networks for Sequence Learning.
arXiv Preprint, CoRR, abs/1506.00019.
[265] Pascanu, R. (2013). On the difficulty of training recurrent neural networks. arXiv preprint
arXiv:1211.5063.
[266] Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural
networks-with an erratum note. Bonn, Germany: German National Research Center for Infor-
mation Technology GMD Technical Report, 148(34), 13.
[268] Kawakami, K. (2008). Supervised sequence labelling with recurrent neural networks (Doctoral
dissertation, Ph. D. thesis).
[269] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gra-
dient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.
[270] Bhattamishra, S., Patel, A., and Goyal, N. (2020). On the computational power of trans-
formers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286.
164
[272] Sutton, R. S. (2018). Reinforcement learning: An introduction. A Bradford Book.
[273] Barto, A. G. (2021). Reinforcement Learning: An Introduction. By Richard’s Sutton. SIAM
Rev, 6(2), 423.
[274] Bertsekas, D. P. (1996). Neuro-dynamic programming. Athena Scientific.
[275] Kakade, S. M. (2003). On the sample complexity of reinforcement learning. University of
London, University College London (United Kingdom).
[276] Szepesvári, C. (2022). Algorithms for reinforcement learning. Springer nature.
[277] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018, July). Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. In International
conference on machine learning (pp. 1861-1870). PMLR.
[278] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... and Has-
sabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540),
529-533.
[279] Konda, V., and Tsitsiklis, J. (1999). Actor-critic algorithms. Advances in neural information
processing systems, 12.
[280] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and
review. arXiv preprint arXiv:1805.00909.
[281] Mannor, S., Mansour, Y., and Tamar, A. (2022). Reinforcement Learning: Foundations.
Online manuscript.
[282] Borkar, V. S., and Borkar, V. S. (2008). Stochastic approximation: a dynamical systems
viewpoint (Vol. 9). Cambridge: Cambridge University Press.
[283] Takhsha, Amir Reza, Maryam Rastgarpour, and Mozhgan Naderi. ”A Feature-Level Ensem-
ble Model for COVID-19 Identification in CXR Images using Choquet Integral and Differential
Evolution Optimization.” arXiv preprint arXiv:2501.08241 (2025).
[284] Singh, P., and Raman, B. (2025). Graph Neural Networks: Extending Deep Learning to
Graphs. In Deep Learning Through the Prism of Tensors (pp. 423-482). Singapore: Springer
Nature Singapore.
[285] Yao, L., Shi, Q., Yang, Z., Shao, S., and Hariri, S. (2024). Development of an Edge Resilient
ML Ensemble to Tolerate ICS Adversarial Attacks. arXiv preprint arXiv:2409.18244.
[286] Chen, K., Bi, Z., Niu, Q., Liu, J., Peng, B., Zhang, S., ... and Feng, P. (2024). Deep learning
and machine learning, advancing big data analytics and management: Tensorflow pretrained
models. arXiv preprint arXiv:2409.13566.
[287] Dumić, E. (2024). Learning neural network design with TensorFlow and Keras. In ICERI2024
Proceedings (pp. 10689-10696). IATED.
[288] Bajaj, K., Bordoloi, D., Tripathy, R., Mohapatra, S. K., Sarangi, P. K., and Sharma, P.
(2024, September). Convolutional Neural Network Based on TensorFlow for the Recognition
of Handwritten Digits in the Odia. In 2024 International Conference on Advances in Computing
Research on Science Engineering and Technology (ACROSET) (pp. 1-5). IEEE.
[289] Abbass, A. M., and Fyath, R. S. (2024). Enhanced approach for artificial neural network-
based optical fiber channel modeling: Geometric constellation shaping WDM system as a case
study. Journal of Applied Research and Technology, 22(6), 768-780.
165
[290] Prabha, D., Subramanian, R. S., Dinesh, M. G., and Girija, P. (2024). Sustainable Farming
Through AI-Enabled Precision Agriculture. In Artificial Intelligence for Precision Agriculture
(pp. 159-182). Auerbach Publications.
[293] Team, G. Y. Bifang: A New Free-Flying Cubic Robot for Space Station.
[295] Naderi, S., Chen, B., Yang, T., Xiang, J., Heaney, C. E., Latham, J. P., ... and Pain, C.
C. (2024). A discrete element solution method embedded within a Neural Network. Powder
Technology, 448, 120258.
[296] Polaka, S. K. R. (2024). Verifica delle reti neurali per l’apprendimento rinforzato sicuro.
[297] Erdogan, L. E., Kanakagiri, V. A. R., Keutzer, K., and Dong, Z. (2024). Stochastic Commu-
nication Avoidance for Recommendation Systems. arXiv preprint arXiv:2411.01611.
[298] Liao, F., Tang, Y., Du, Q., Wang, J., Li, M., and Zheng, J. (2024). Domain Progressive
Low-dose CT Imaging using Iterative Partial Diffusion Model. IEEE Transactions on Medical
Imaging.
[299] Sekhavat, Y. (2024). Looking for creative basis of artificial intelligence art in the midst of
order and chaos based on Nietzsche’s theories. Theoretical Principles of Visual Arts.
[300] Cai, H., Yang, Y., Tang, Y., Sun, Z., and Zhang, W. (2025). Shapley value-based class
activation mapping for improved explainability in neural networks. The Visual Computer,
1-19.
[301] Na, W. (2024). Rach-Space: Novel Ensemble Learning Method With Applications in Weakly
Supervised Learning (Master’s thesis, Tufts University).
[302] Khajah, M. M. (2024). Supercharging BKT with Multidimensional Generalizable IRT and
Skill Discovery. Journal of Educational Data Mining, 16(1), 233-278.
[303] Zhang, Y., Duan, Z., Huang, Y., and Zhu, F. (2024). Theoretical Bound-Guided Hierarchical
VAE for Neural Image Codecs. arXiv preprint arXiv:2403.18535.
[304] Wang, L., and Huang, W. (2025). On the convergence analysis of over-parameterized varia-
tional autoencoders: a neural tangent kernel perspective. Machine Learning, 114(1), 15.
[305] Li, C. N., Liang, H. P., Zhao, B. Q., Wei, S. H., and Zhang, X. (2024). Machine learning
assisted crystal structure prediction made simple. Journal of Materials Informatics, 4(3), N-A.
[306] Huang, Y. (2024). Research Advanced in Image Generation Based on Diffusion Probability
Model. Highlights in Science, Engineering and Technology, 85, 452-456.
[307] Chenebuah, E. T. (2024). Artificial Intelligence Simulation and Design of Energy Materials
with Targeted Properties (Doctoral dissertation, Université d’Ottawa— University of Ottawa).
166
[308] Furth, N., Imel, A., and Zawodzinski, T. A. (2024, November). Graph Encoders for Redox
Potentials and Solubility Predictions. In Electrochemical Society Meeting Abstracts prime2024
(No. 3, pp. 344-344). The Electrochemical Society, Inc..
[309] Gong, J., Deng, Z., Xie, H., Qiu, Z., Zhao, Z., and Tang, B. Z. (2025). Deciphering Design
of Aggregation-Induced Emission Materials by Data Interpretation. Advanced Science, 12(3),
2411345.
[310] Kim, H., Lee, C. H., and Hong, C. (2024, July). VATMAN: Video Anomaly Transformer for
Monitoring Accidents and Nefariousness. In 2024 IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS) (pp. 1-7). IEEE.
[311] Albert, S. W., Doostan, A., and Schaub, H. (2024). Dimensionality Reduction for Onboard
Modeling of Uncertain Atmospheres. Journal of Spacecraft and Rockets, 1-13.
[312] Sharma, D. K., Hota, H. S., and Rababaah, A. R. (2024). Machine Learning for Real World
Applications (Doctoral dissertation, Department of Computer Science and Engineering, Indian
Institute of Technology Patna).
[313] Li, T., Shi, Z., Dale, S. G., Vignale, G., and Lin, M. Jrystal: A JAX-based Differentiable
Density Functional Theory Framework for Materials.
[314] Bieberich, S., Li, P., Ngai, J., Patel, K., Vogt, R., Ranade, P., ... and Stafford, S. (2024).
Conducting Quantum Machine Learning Through The Lens of Solving Neural Differential
Equations On A Theoretical Fault Tolerant Quantum Computer: Calibration and Bench-
marking.
[315] Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. (2024). How to compute Hessian-vector
products?. In The Third Blogpost Track at ICLR 2024.
[316] Lohoff, J., and Neftci, E. (2024). Optimizing Automatic Differentiation with Deep Reinforce-
ment Learning. arXiv preprint arXiv:2406.05027.
[317] Legrand, N., Weber, L., Waade, P. T., Daugaard, A. H. M., Khodadadi, M., Mikuš, N.,
and Mathys, C. (2024). pyhgf: A neural network library for predictive coding. arXiv preprint
arXiv:2410.09206.
[318] Alzás, P. B., and Radev, R. (2024). Differentiable nuclear deexcitation simulation for low
energy neutrino physics. arXiv preprint arXiv:2404.00180.
[319] Edenhofer, G., Frank, P., Roth, J., Leike, R. H., Guerdi, M., Scheel-Platz, L. I., ... and
Enßlin, T. A. (2024). Re-envisioning numerical information field theory (NIFTy. re): A library
for Gaussian processes and variational inference. arXiv preprint arXiv:2402.16683.
[320] Chan, S., Kulkarni, P., Paul, H. Y., and Parekh, V. S. (2024, September). Expanding the
Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray Clas-
sification. In 2024 IEEE International Conference on Quantum Computing and Engineering
(QCE) (Vol. 1, pp. 572-582). IEEE.
[321] Ye, H., Hu, Z., Yin, R., Boyko, T. D., Liu, Y., Li, Y., ... and Li, Y. (2025). Electron transfer
at birnessite/organic compound interfaces: mechanism, regulation, and two-stage kinetic dis-
crepancy in structural rearrangement and decomposition. Geochimica et Cosmochimica Acta,
388, 253-267.
[322] Khan, M., Ludl, A. A., Bankier, S., Björkegren, J. L., and Michoel, T. (2024). Prediction
of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated
instrumental variables. PLoS genetics, 20(11), e1011473.
167
[323] Ojala, K., and Zhou, C. (2024). Determination of outdoor object distances from monocular
thermal images.
[324] Popordanoska, T., and Blaschko, M. (2024). Advancing Calibration in Deep Learning: The-
ory, Methods, and Applications.
[325] Alfieri, A., Cortes, J. M. P., Pastore, E., Castiglione, C., and Rey, G. M. Z. A Deep Q-Network
Approach to Job Shop Scheduling with Transport Resources.
[326] Zanardelli, R. (2025). Statistical learning methods for decision-making, with applications in
Industry 4.0.
[327] Norouzi, M., Hosseini, S. H., Khoshnevisan, M., and Moshiri, B. (2025). Applications of pre-
trained CNN models and data fusion techniques in Unity3D for connected vehicles. Applied
Intelligence, 55(6), 390.
[328] Wang, R., Yang, T., Liang, C., Wang, M., and Ci, Y. (2025). Reliable Autonomous Driving
Environment Perception: Uncertainty Quantification of Semantic Segmentation. Journal of
Transportation Engineering, Part A: Systems, 151(3), 04024117.
[329] Xia, Q., Chen, P., Xu, G., Sun, H., Li, L., and Yu, G. (2024). Adaptive Path-Tracking Con-
troller Embedded With Reinforcement Learning and Preview Model for Autonomous Driving.
IEEE Transactions on Vehicular Technology.
[330] Liu, Q., Tang, Y., Li, X., Yang, F., Wang, K., and Li, Z. (2024). MV-STGHAT: Multi-View
Spatial-Temporal Graph Hybrid Attention Network for Decision-Making of Connected and
Autonomous Vehicles. IEEE Transactions on Vehicular Technology.
[331] Chakraborty, D., and Deka, B. (2025). Deep Learning-based Selective Feature Fusion for
Litchi Fruit Detection using Multimodal UAV Sensor Measurements. IEEE Transactions on
Artificial Intelligence.
[332] Mirindi, D., Khang, A., and Mirindi, F. (2025). Artificial Intelligence (AI) and Automation
for Driving Green Transportation Systems: A Comprehensive Review. Driving Green Trans-
portation System Through Artificial Intelligence and Automation: Approaches, Technologies
and Applications, 1-19.
[333] Choudhury, B., Rajakumar, K., Badhale, A. A., Roy, A., Sahoo, R., and Margret, I. N. (2024,
June). Comparative Analysis of Advanced Models for Satellite-Based Aircraft Identification.
In 2024 International Conference on Smart Systems for Electrical, Electronics, Communication
and Computer Engineering (ICSSEECC) (pp. 483-488). IEEE.
[334] Almubarok, W., Rosiani, U. D., and Asmara, R. A. (2024, November). MobileNetV2 Pruning
for Improved Efficiency in Catfish Classification on Resource-Limited Devices. In 2024 IEEE
10th Information Technology International Seminar (ITIS) (pp. 271-277). IEEE.
[335] Ding, Q. (2024, February). Classification Techniques of Tongue Manifestation Based on Deep
Learning. In 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and
Algorithms (EEBDA) (pp. 802-810). IEEE.
[336] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).
[337] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
168
[338] Sultana, F., Sufian, A., and Dutta, P. (2018, November). Advancements in image classification
using convolutional neural network. In 2018 Fourth International Conference on Research in
Computational Intelligence and Communication Networks (ICRCICN) (pp. 122-129). IEEE.
[339] Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixe, L. (2019). Understanding the limitations
of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (pp. 3302-3312).
[340] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Systems.
[341] Nannepagu, M., Babu, D. B., and Madhuri, C. B. Leveraging Hybrid AI Models: DQN,
Prophet, BERT, ART-NN, and Transformer-Based Approaches for Advanced Stock Market
Forecasting.
[342] De Rose, L., Andresini, G., Appice, A., and Malerba, D. (2024). VINCENT: Cyber-threat
detection through vision transformers and knowledge distillation. Computers and Security,
103926.
[343] Buehler, M. J. (2025). Graph-Aware Isomorphic Attention for Adaptive Dynamics in Trans-
formers. arXiv preprint arXiv:2501.02393.
[344] Tabibpour, S. A., and Madanizadeh, S. A. (2024). Solving High-Dimensional Dynamic Pro-
gramming Using Set Transformer. Available at SSRN 5040295.
[345] Li, S., and Dong, P. (2024, October). Mixed Attention Transformer Enhanced Channel Esti-
mation for Extremely Large-Scale MIMO Systems. In 2024 16th International Conference on
Wireless Communications and Signal Processing (WCSP) (pp. 394-399). IEEE.
[346] Asefa, S. H., and Assabie, Y. (2024). Transformer-Based Amharic-to-English Machine Trans-
lation with Character Embedding and Combined Regularization Techniques. IEEE Access.
[347] Liao, M., and Chen, M. (2024, November). A new deepfake detection method by vision
transformers. In International Conference on Algorithms, High Performance Computing, and
Artificial Intelligence (AHPCAI 2024) (Vol. 13403, pp. 953-957). SPIE.
[348] Jiang, L., Cui, J., Xu, Y., Deng, X., Wu, X., Zhou, J., and Wang, Y. (2024, August). SC-
Former: Spatial and Channel-wise Transformer with Contrastive Learning for High-Quality
PET Image Reconstruction. In 2024 IEEE International Conference on Cybernetics and In-
telligent Systems (CIS) and IEEE International Conference on Robotics, Automation and
Mechatronics (RAM) (pp. 26-31). IEEE.
[349] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing
systems, 27.
[350] CHAPPIDI, J., and SUNDARAM, D. M. (2024). DUAL Q-LEARNING WITH GRAPH
NEURAL NETWORKS: A NOVEL APPROACH TO ANIMAL DETECTION IN CHAL-
LENGING ECOSYSTEMS. Journal of Theoretical and Applied Information Technology,
102(23).
[351] Joni, R. (2024). Delving into Deep Learning: Illuminating Techniques and Visual Clarity for
Image Analysis (No. 12808). EasyChair.
[352] Kalaiarasi, G., Sudharani, B., Jonnalagadda, S. C., Battula, H. V., and Sanagala, B. (2024,
July). A Comprehensive Survey of Image Steganography. In 2024 2nd International Conference
on Sustainable Computing and Smart Systems (ICSCSS) (pp. 1225-1230). IEEE.
169
[353] Arjmandi-Tash, A. M., Mansourian, A., Rahsepar, F. R., and Abdi, Y. (2024). Predicting
Photodetector Responsivity through Machine Learning. Advanced Theory and Simulations,
2301219.
[354] Gao, Y. (2024). Neural networks meet applied mathematics: GANs, PINNs, and transformers.
HKU Theses Online (HKUTO).
[355] Hisama, K., Ishikawa, A., Aspera, S. M., and Koyama, M. (2024). Theoretical Catalyst
Screening of Multielement Alloy Catalysts for Ammonia Synthesis Using Machine Learning
Potential and Generative Artificial Intelligence. The Journal of Physical Chemistry C, 128(44),
18750-18758.
[356] Wang, M., and Zhang, Y. (2024). Image Segmentation in Complex Backgrounds using an Im-
proved Generative Adversarial Network. International Journal of Advanced Computer Science
and Applications, 15(5).
[357] Alonso, N. I., and Arias, F. (2025). The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation. Fernando, The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation (January 05, 2025).
[358] Lu, C., Shi, L., Chen, Z., Wu, C., and Wierman, A. (2024). Overcoming the Curse of Di-
mensionality in Reinforcement Learning Through Approximate Factorization. arXiv preprint
arXiv:2411.07591.
[359] Humayoo, M. (2024). Time-Scale Separation in Q-Learning: Extending TD (△) for Action-
Value Function Decomposition. arXiv preprint arXiv:2411.14019.
[360] Jia, L., Qi, N., Su, Z., Chu, F., Fang, S., Wong, K. K., and Chae, C. B. (2024). Game theory
and reinforcement learning for anti-jamming defense in wireless communications: Current
research, challenges, and solutions. IEEE Communications Surveys and Tutorials.
[361] Chai, J., Chen, E., and Fan, J. (2025). Deep Transfer Q-Learning for Offline Non-Stationary
Reinforcement Learning. arXiv preprint arXiv:2501.04870.
[362] Yao, J., and Gong, X. (2024, October). Communication-Efficient and Resilient Distributed
Deep Reinforcement Learning for Multi-Agent Systems. In 2024 IEEE International Conference
on Unmanned Systems (ICUS) (pp. 1521-1526). IEEE.
[363] Liu, Y., Yang, T., Tian, L., and Pei, J. (2025). SGD-TripleQNet: An Integrated Deep Rein-
forcement Learning Model for Vehicle Lane-Change Decision. Mathematics, 13(2), 235.
[364] Masood, F., Ahmad, J., Al Mazroa, A., Alasbali, N., Alazeb, A., and Alshehri, M. S. (2025).
Multi IRS-Aided Low-Carbon Power Management for Green Communication in 6G Smart
Agriculture Using Deep Game Theory. Computational Intelligence, 41(1), e70022.
[366] El Mimouni, I., and Avrachenkov, K. (2025, January). Deep Q-Learning with Whittle Index
for Contextual Restless Bandits: Application to Email Recommender Systems. In Northern
Lights Deep Learning Conference 2025.
[367] Shefin, R. S., Rahman, M. A., Le, T., and Alqahtani, S. (2024). xSRL: Safety-Aware
Explainable Reinforcement Learning–Safety as a Product of Explainability. arXiv preprint
arXiv:2412.19311.
[368] Khlifi, A., Othmani, M., and Kherallah, M. (2025). A Novel Approach to Autonomous Driving
Using DDQN-Based Deep Reinforcement Learning.
170
[369] Kuczkowski, D. (2024). Energy efficient multi-objective reinforcement learning algorithm for
traffic simulation.
[370] Krauss, R., Zielasko, J., and Drechsler, R. Large-Scale Evolutionary Optimization of Artificial
Neural Networks Using Adaptive Mutations.
[371] Ahamed, M. S., Pey, J. J. J., Samarakoon, S. B. P., Muthugala, M. V. J., and Elara, M. R.
(2025). Reinforcement Learning for Reconfigurable Robotic Soccer. IEEE Access.
[372] Elmquist, A., Serban, R., and Negrut, D. (2024). A methodology to quantify simulation-vs-
reality differences in images for autonomous robots. IEEE Sensors Journal.
[373] Kobanda, A., Portelas, R., Maillard, O. A., and Denoyer, L. (2024). Hierarchical Subspaces
of Policies for Continual Offline Reinforcement Learning. arXiv preprint arXiv:2412.14865.
[374] Xu, J., Xie, G., Zhang, Z., Hou, X., Zhang, S., Ren, Y., and Niyato, D. (2025). UPEGSim:
An RL-Enabled Simulator for Unmanned Underwater Vehicles Dedicated in the Underwater
Pursuit-Evasion Game. IEEE Internet of Things Journal, 12(3), 2334-2346.
[375] Patadiya, K., Jain, R., Moteriya, J., Palaniappan, D., Kumar, P., and Premavathi, T. (2024,
December). Application of Deep Learning to Generate Auto Player Mode in Car Based Game.
In 2024 IEEE 16th International Conference on Computational Intelligence and Communica-
tion Networks (CICN) (pp. 233-237). IEEE.
[376] Janjua, J. I., Kousar, S., Khan, A., Ihsan, A., Abbas, T., and Saeed, A. Q. (2024, Decem-
ber). Enhancing Scalability in Reinforcement Learning for Open Spaces. In 2024 International
Conference on Decision Aid Sciences and Applications (DASA) (pp. 1-8). IEEE.
[377] Yang, L., Li, Y., Wang, J., and Sherratt, R. S. (2020). Sentiment analysis for E-commerce
product reviews in Chinese based on sentiment lexicon and deep learning. IEEE access, 8,
23522-23530.
[378] Manikandan, C., Kumar, P. S., Nikitha, N., Sanjana, P. G., and Dileep, Y. Filtering Emails
Using Natural Language Processing.
[379] ISIAKA, S. O., BABATUNDE, R. S., and ISIAKA, R. M. Exploring Artificial Intelligence
(AI) Technologies in Predictive Medicine: A Systematic Review.
[380] Petrov, A., Zhao, D., Smith, J., Volkov, S., Wang, J., and Ivanov, D. Deep Learning Ap-
proaches for Emotional State Classification in Textual Data.
[381] Liang, M. (2025). Leveraging natural language processing for automated assessment and feed-
back production in virtual education settings. Journal of Computational Methods in Sciences
and Engineering, 14727978251314556.
[382] Jin, L. (2025). Research on Optimization Strategies of Artificial Intelligence Algorithms for
the Integration and Dissemination of Pharmaceutical Science Popularization Knowledge. Sci-
entific Journal of Technology, 7(1), 45-55.
[383] McNicholas, B. A., Madden, M. G., and Laffey, J. G. (2025). Natural language processing in
critical care: opportunities, challenges, and future directions. Intensive Care Medicine, 1-5.
[384] Abd Al Abbas, M., and Khammas, B. M. (2024). Efficient IoT Malware Detection Technique
Using Recurrent Neural Network. Iraqi Journal of Information and Communication Technol-
ogy, 7(3), 29-42.
171
[385] Kalonia, S., and Upadhyay, A. (2025). Deep learning-based approach to predict software
faults. In Artificial Intelligence and Machine Learning Applications for Sustainable Develop-
ment (pp. 326-348). CRC Press.
[386] Han, S. C., Weld, H., Li, Y., Lee, J., and Poon, J. Natural Language Understanding in
Conversational AI with Deep Learning.
[387] Potter, K., and Egon, A. RECURRENT NEURAL NETWORKS (RNNS) FOR TIME SE-
RIES FORECASTING.
[388] Yatkin, M. A., Kõrgesaar, M., and Işlak, Ü. (2025). A Topological Approach to Enhancing
Consistency in Machine Learning via Recurrent Neural Networks. Applied Sciences, 15(2),
933.
[389] Saifullah, S. (2024). Comparative Analysis of LSTM and GRU Models for Chicken Egg Fer-
tility Classification using Deep Learning.
[391] Tu, Z., Jeffries, S. D., Morse, J., and Hemmerling, T. M. (2024). Comparison of time-series
models for predicting physiological metrics under sedation. Journal of Clinical Monitoring and
Computing, 1-11.
[392] Zuo, Y., Jiang, J., and Yada, K. (2025). Application of hybrid gate recurrent unit for in-store
trajectory prediction based on indoor location system. Scientific Reports, 15(1), 1055.
[393] Lima, R., Scardua, L. A., and De Almeida, G. M. (2024). Predicting Temperatures Inside a
Steel Slab Reheating Furnace Using Neural Networks. Authorea Preprints.
[394] Khan, S., Muhammad, Y., Jadoon, I., Awan, S. E., and Raja, M. A. Z. (2025). Leveraging
LSTM-SMI and ARIMA architecture for robust wind power plant forecasting. Applied Soft
Computing, 112765.
[395] Guo, Z., and Feng, L. (2024). Multi-step prediction of greenhouse temperature and humidity
based on temporal position attention LSTM. Stochastic Environmental Research and Risk
Assessment, 1-28.
[396] Abdelhamid, N. M., Khechekhouche, A., Mostefa, K., Brahim, L., and Talal, G. (2024).
Deep-RNN based model for short-time forecasting photovoltaic power generation using IoT.
Studies in Engineering and Exact Sciences, 5(2), e11461-e11461.
[397] Rohman, F. N., and Farikhin, B. S. Hyperparameter Tuning of Random Forest Algorithm
for Diabetes Classification.
[398] Rahman, M. Utilizing Machine Learning Techniques for Early Brain Tumor Detection.
[399] Nandi, A., Singh, H., Majumdar, A., Shaw, A., and Maiti, A. Optimizing Baby Sound
Recognition using Deep Learning through Class Balancing and Model Tuning.
[400] Sianga, B. E., Mbago, M. C., and Msengwa, A. S. (2025). PREDICTING THE PREVA-
LENCE OF CARDIOVASCULAR DISEASES USING MACHINE LEARNING ALGO-
RITHMS. Intelligence-Based Medicine, 100199.
172
[401] Li, L., Hu, Y., Yang, Z., Luo, Z., Wang, J., Wang, W., ... and Zhang, Z. (2025). Exploring the
assessment of post-cardiac valve surgery pulmonary complication risks through the integration
of wearable continuous physiological and clinical data. BMC Medical Informatics and Decision
Making, 25(1), 1-11.
[402] Lázaro, F. L., Madeira, T., Melicio, R., Valério, D., and Santos, L. F. (2025). Identifying Hu-
man Factors in Aviation Accidents with Natural Language Processing and Machine Learning
Models. Aerospace, 12(2), 106.
[403] Li, Z., Zhong, J., Wang, H., Xu, J., Li, Y., You, J., ... and Dev, S. (2025). RAINER: A
Robust Ensemble Learning Grid Search-Tuned Framework for Rainfall Patterns Prediction.
arXiv preprint arXiv:2501.16900.
[404] Khurshid, M. R., Manzoor, S., Sadiq, T., Hussain, L., Khan, M. S., and Dutta, A. K. (2025).
Unveiling diabetes onset: Optimized XGBoost with Bayesian optimization for enhanced pre-
diction. PloS one, 20(1), e0310218.
[405] Kanwar, M., Pokharel, B., and Lim, S. (2025). A new random forest method for landslide
susceptibility mapping using hyperparameter optimization and grid search techniques. Inter-
national Journal of Environmental Science and Technology, 1-16.
[406] Fadil, M., Akrom, M., and Herowati, W. (2025). Utilization of Machine Learning for Pre-
dicting Corrosion Inhibition by Quinoxaline Compounds. Journal of Applied Informatics and
Computing, 9(1), 173-177.
[407] Emmanuel, J., Isewon, I., and Oyelade, J. (2025). An Optimized Deep-Forest Algorithm
Using a Modified Differential Evolution Optimization Algorithm: A Case of Host-Pathogen
Protein-Protein Interaction Prediction. Computational and Structural Biotechnology Journal.
[408] Gaurav, A., Gupta, B. B., Attar, R. W., Alhomoud, A., Arya, V., and Chui, K. T. (2025).
Driver identification in advanced transportation systems using osprey and salp swarm opti-
mized random forest model. Scientific Reports, 15(1), 2453.
[409] Ning, C., Ouyang, H., Xiao, J., Wu, D., Sun, Z., Liu, B., ... and Huang, G. (2025). Develop-
ment and validation of an explainable machine learning model for mortality prediction among
patients with infected pancreatic necrosis. eClinicalMedicine, 80.
[410] Muñoz, V., Ballester, C., Copaci, D., Moreno, L., and Blanco, D. (2025). Accelerating hy-
perparameter optimization with a secretary. Neurocomputing, 129455.
[411] Balcan, M. F., Nguyen, A. T., and Sharma, D. (2025). Sample complexity of data-driven
tuning of model hyperparameters in neural networks with structured parameter-dependent
dual function. arXiv preprint arXiv:2501.13734.
[412] Azimi, H., Kalhor, E. G., Nabavi, S. R., Behbahani, M., and Vardini, M. T. (2025). Data-
based modeling for prediction of supercapacitor capacity: Integrated machine learning and
metaheuristic algorithms. Journal of the Taiwan Institute of Chemical Engineers, 170, 105996.
[413] Shibina, V., and Thasleema, T. M. (2025). Voice feature-based diagnosis of Parkinson’s dis-
ease using nature inspired squirrel search algorithm with ensemble learning classifiers. Iran
Journal of Computer Science, 1-25.
[414] Chang, F., Dong, S., Yin, H., Ye, X., Wu, Z., Zhang, W., and Zhu, H. (2025). 3D displacement
time series prediction of a north-facing reservoir landslide powered by InSAR and machine
learning. Journal of Rock Mechanics and Geotechnical Engineering.
173
[415] Cihan, P. (2025). Bayesian Hyperparameter Optimization of Machine Learning Models for
Predicting Biomass Gasification Gases. Applied Sciences, 15(3), 1018.
[416] Makomere, R., Rutto, H., Alugongo, A., Koech, L., Suter, E., and Kohitlhetse, I. (2025).
Enhanced dry SO2 capture estimation using Python-driven computational frameworks with
hyperparameter tuning and data augmentation. Unconventional Resources, 100145.
[417] Bakır, H. (2025). A new method for tuning the CNN pre-trained models as a feature extractor
for malware detection. Pattern Analysis and Applications, 28(1), 26.
[418] Liu, Y., Yin, H., and Li, Q. (2025). Sound absorption performance prediction of multi-
dimensional Helmholtz resonators based on deep learning and hyperparameter optimization.
Physica Scripta.
[419] Ma, Z., Zhao, M., Dai, X., and Chen, Y. (2025). Anomaly detection for high-speed machining
using hybrid regularized support vector data description. Robotics and Computer-Integrated
Manufacturing, 94, 102962.
[420] El-Bouzaidi, Y. E. I., Hibbi, F. Z., and Abdoun, O. (2025). Optimizing Convolutional Neural
Network Impact of Hyperparameter Tuning and Transfer Learning. In Innovations in Opti-
mization and Machine Learning (pp. 301-326). IGI Global Scientific Publishing.
[421] Mustapha, B., Zhou, Y., Shan, C., and Xiao, Z. (2025). Enhanced Pneumonia Detection in
Chest X-Rays Using Hybrid Convolutional and Vision Transformer Networks. Current Medical
Imaging, e15734056326685.
174