CodPy - A Python Library For Numerical, ML, and Stats
CodPy - A Python Library For Numerical, ML, and Stats
January 2024
1
Laboratoire Jacques-Louis Lions, Sorbonne Université and Centre National de la Recherche Scientifique,
4 Place Jussieu, 75258 Paris, France. Email: [email protected]
2
MPG-Partners, 136 Boulevard Haussmann, 75008 Paris, France.
Email: [email protected], [email protected].
This a draft of a monograph in preparation.
Contents
1 Introduction 4
1.1 Main objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Outline of this monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Kernel-based operators 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Discrete differential operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 A clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2
CONTENTS 3
Introduction
4
1.2. OUTLINE OF THIS MONOGRAPH 5
1.3 References
There is a vast literature available on kernel methods and reproducing kernel Hilbert spaces
which we do not attempt to review here. Our focus is on providing a practical framework for the
application of such methods. However, for the reader interested in a comprehensive review of the
theory we refer to several textbooks and research articles such as Berlinet and Thomas-Agnan [3]
and Fasshauer [11],[12],[13].
Our kernel-based meshfree algorithms presented in Chapters 3 to 5 are based on the research
papers [30],[31],[32],[33],[34]. Earlier versions of this material can also be found also in unpublished
notes [35]–[40].
For additional information on meshfree methods in fluid dynamics and material science, the reader
is referred to the following works: [2],[4],[16],[18],[23], [41],[43],[46],[49],[52],[56],[64].
Chapter 2
fz = Pm X, Y = [ ], Z = X, f (X) .
Using standard Python notation, the empty brackets indicate that the variables Y and Z represent
optional input data.
The subscript m is introduced to specify the choice of the method. On the one hand, each
method relies on a set of external parameters, or hyperparameters, which should be specified
before training. On the other hand, fine-tuning these external parameters can be challenging and
error-prone. As a matter of fact, some strategies in the literature even propose using a machine
learning approach to determine these parameters. When selecting a method, it is crucial to consider
performance indicators before tuning the hyperparameters.
Let us specify our notation, in which X, Y , and Y can be regarded as matrices (of various
dimensions).
• The input data X, Y, Z, f (X) are as follows.
– The (non-optional) parameter X ∈ RNx ,D is called the training set. This is a matrix
where each row represents a data sample of a distribution X and each column represents
a certain feature. The parameter D denotes the total number of features in the dataset.
– The variable f (X) ∈ RNx ,Df is called the training set values. These are the target
values or labels associated with each sample in the training set. The parameter Df is
the dimensionality of the target values. There is an important distinction to be made
here:
∗ Deterministic case, if f (X) is considered as a continuous function of X. This
book details kernel methods for this case in the two following chapters.
∗ Stochastic case, if f (X) ≡ E(f | X) is considered as a random variable, conditioned
by X. Kernel methods for this case are discussed chapter (5.3.2).
7
8 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
– The variable Z ∈ RNz ,D is the test set. This is a separate set of data samples used to
evaluate the model performance on unseen data. If Z is not explicitly provided, it is
assumed that Z = X (that is, the test set is then the same as the training set).
– The variable Y ∈ RNy ,D is called the internal parameter set1 This set is crucial for
defining the predictor Pm .
• The output data are as follows.
– Supervised learning: In this approach, the model is trained using known input-output
pairs. The goal is to learn a function that can make predictions for new, unseen inputs.
Specifically, given the input function values f (X) the relationship is expressed as
fZ = Pm X, Y = [ ], Z = X, f (X) ≃ f (Z), (2.1.1)
where fZ represens the predicted values and each fz ∈ RNz ,D is termed a prediction.
We distinguish between two cases.
∗ feed-backward machine. If the input data Y is not provided (i.e. left empty),
then the prediction mechanism described by (2.1.1) falls under the category of
feed-backward machines. In this scenario, the method internally determines this set
and computes the prediction fz .
∗ feed-forward machine. Conversely, if Y is explicitly specified as input data, then
the prediction mechanism from (2.1.1) is called a feed-forward machine. In this
case, the method make use of the set of internal parameters in order to compute
the prediction fz .
Unsupervised learning. In this approach, the model is trained without explicit labels
or target values. Instead, the goal is to discover underlying patterns or structures in the
data. Specifically, the relationship is expressed as:
where the output values fz ∈ RNz ,D are called clusters in the context of the so-called
clustering method ( which will be elaborated upon later).
Many other machine learning methods can be described with the notation above. For instance,
consider two methods denoted by m1 and m2 . Their composition can be defined and describes a
feed-backward machine, which is analogous to the notion of semi-supervised learning in the
literature (and also encompasses feed-backward learning machines). Specifically, we write
fz = Pm1 X, Pm2 (X, f (X)), Z, f (X) . (2.1.3)
Here, the term “semi-supervised learning” denotes a learning paradigm where the training dataset
comprises both labeled and unlabeled samples. The primary objective is to leverage the unlabeled
samples to enhance the model performance on the labeled ones. On the other hand, “feedback
learning machines” refer to a specific class of models, in which the output is recursively fed back as
input, aiming to refine prediction accuracy via iterations.
We summarize our main notation in Table 2.1. The dimensions of the input data, that is, the
integers D, Nx , Ny , Nz , Df , are also treated as input parameters. The fundamental distinction
between supervised and unsupervised learning lies in the nature of the input data: supervised
learning relies on input data for both the features and their associated labels, whereas unsupervised
learning only requires input data for the features. We will proceed deeper into this distinction in
subsequent sections of this chapter.
1 In the context of neural networks, this might also be referred to as the weight set.
2.1. A FRAMEWORK FOR MACHINE LEARNING 9
X Y Z f (X) fz
training set parameter set test set training values predictions
size Nx , D size Ny , D size Nz , Df size Nx , Df size Nz , Df
Moreover, from any machine learning method m we can also compute the gradient of a real-valued
function f = f (x1 , . . . , xD ) by
(∇f )Z = (∇Z Pm ) X, Y = [ ], Z = X, f (X) = [ ] ∼ ∇f (Z), (2.1.4)
where the gradient is noted ∇ = (∂x1 , . . . , ∂xD ), then we say that m is a differentiable learning
machine.
Supervised learning 2 is a technique used to predict or extrapolate the values of a given function
on a new set of inputs. In other words, it involves training a model on historical observations of
the function X and its corresponding outputs, and then using the trained model to predict the
output values on a new set of inputs Z.
When considering the terminology of supervised learning, a method is said to be multi-class or
multi-output if the function f is vector-valued, meaning Df ≥ 1 in our notation. It is important to
note that while it is possible to combine learning machines to produce multi-class methods, this
often comes with a significant computational cost.
Additionally, the input function f can be classified as being discrete, continuous, or mixed. A
discrete function has a finite (or countable) number of unique values and is referred to as labels.
These labels can always be mapped to an integer range of [1, . . . , #(Ran(f ))], where #(E) represents
the number of elements or cardinality of a set. A continuous function has an infinite number of
possible values, while a mixed function contains both discrete and continuous data.
In our presentation, we distinguish between the following aspects of the subject.
•
– Typical families of methods: linear models, support vector machines, neural networks,. . .
Figure 2.1: ,
Figure 2.2: ,
As an example, we demonstrate the use of visualization tools with the Iris flower dataset. The
Iris dataset was introduced by the British statistician, eugenicist, and biologist Ronald Fisher in
his 1936 paper “The use of multiple measurements in taxonomic problems”. It consists of 150
samples of Iris flowers, with 50 samples from each of three species: Iris setosa, Iris virginica, and
Iris versicolor. Each sample has four features: the length and width of the sepals and petals,
measured in centimeters.
Non-parametric density estimation. The density of the input data is estimated using a
kernel density estimate (KDE). We assume that (x1 , x2 , . . . , xNX ) are independent and identically
distributed samples, drawn from an univariate distribution with unknown density f at any given
point x. Our goal is to estimate the shape of this function f , and the kernel density estimator is
given by
NX NX
1X 1 X x − xi
fbh (x) = kh (x − xi ) = k ,
n i=1 nh i=1 h
where k is a kernel (say any non-negative function, at this stage) and h > 0 is a smoothing
parameter called the bandwidth.
KDE is a popular method for estimating the probability density function of a random variable. A
key factor in obtaining an accurate density estimate is the choice of the kernel and the smoothing
bandwidth. The kernel function determines the shape of the estimated density, while the bandwidth
controls the amount of smoothing applied to the data. An appropriate bandwidth for kernel density
estimation strikes a balance between over-smoothing, which can obscure important features of the
underlying distribution, and under-smoothing, which can result in a noisy estimate that does not
accurately capture the true shape of the data. Common kernel functions used in KDE include
uniform, triangular, biweight, triweight, Epanechnikov, normal, and others.
Scatter plot. A scatter plot is a way to visualize data by displaying it as a collection of points.
Each point represents a single observation in the dataset, with the value of one variable plotted on
12 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
40
Count
30
20
10
0
0 1 2 3 4 5 6 7 8
the horizontal axis and the value of another variable plotted on the vertical axis. This allows us to
see the relationship between the two variables and identify any patterns or trends in the data.
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
4.00 2.5
7.5 6
3.75
7.0 2.0
5
3.50
6.5
3.25 1.5
4
6.0
3.00
3 1.0
5.5
2.75
5.0
2.50 2 0.5
4.5
2.25
1
0.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Heat map. The correlation matrix of n random variables x1 , . . . , xn is the n, n matrix whose (i, j)
entry is corr(xi , xj ). Thus the diagonal entries are all identically unity.
Summary plot. The summary plot is a visualization tool that displays multiple plots in a grid
format. It is used to visualize the relationship between different features of a dataset. In this
plot, the density of each feature is displayed on the diagonal. The kernel density estimate plot is
displayed on the lower diagonal, which shows the estimated probability density function of the
data. The scatter plot is displayed on the upper diagonal, which shows the relationship between
two features by plotting them against each other. Overall, the summary plot provides a quick and
intuitive way to explore the relationship between different features of a dataset.
Correlation matrix
distributions are the so-called f −divergences, which can be classified as follows. Let f : (0, ∞) 7→ R
be a convex function with f (1) = 0. Let P and Q be two probability distributions on a discrete
measurable space (X , F). If P is absolutely continuous with respect to Q, then the f -divergence is
defined as
h dP i X dP (x)
Df (P ||Q) = EQ f = Q(x)f .
dQ x
dQ(x)
1 √
H(P, Q) = √ || dP − dQ||2 .
p
2
5 A. Muller, “Integral probability metrics and their generating classes of functions”, Advances in Applied
7
6
5
4.0
3.5
3.0
2.5
2.0
8
= 0.95
petal length (cm)
6
4
2
0
3
petal width (cm)
2
1
0
4 6 8 2 3 4 0 5 0 2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
∥fz − f (Z)∥ℓp
, 1 ≤ p ≤ ∞. (2.3.3)
∥fz ∥ℓp + ∥f (Z)∥ℓp
This produces an indicator with values ranging between 0 and 1, where smaller values indicate
better performance. It can be interpreted as a percentage of error. In finance, this concept is
sometimes referred to as the “basis point indicator’ ’.
Cross validation scores. The cross validation score involves randomly selecting a subset of the
training set as the test set, and then calculating a score or RMSE type error analysis for each run.
This process is repeated multiple times with different randomly selected test sets, and the results
are averaged to give an estimate of the model performance on unseen data. For more information,
see the dedicated page on the scikit-learn website.
A confusion matrix is a performance evaluation tool for supervised machine learning algorithms
that are used for classification tasks. It is a matrix representation of the number of predicted and
actual labels for each class in the data. The matrix has dimensions equal to the number of classes
in the data, with rows representing the actual classes and columns representing the predicted
classes. The diagonal elements of the matrix represent the number of correct predictions for each
class, while off-diagonal elements represent incorrect predictions.
For example, consider a binary classification problem where we are trying to predict whether an
email is spam or not. The confusion matrix for this problem would have two rows and two columns,
with one row and column for spam and the other for non-spam. The diagonal elements of the
matrix would represent the number of correctly classified spam and non-spam emails, while the
off-diagonal elements would represent the number of misclassified emails. Its common form is
The confusion matrix can be used to compute various performance measures for the classification
algorithm, such as accuracy, precision, recall, and F1 score. These measures are calculated based
6 link to scikit-learn https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics metrics.
16 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
on the number of true positives, false positives, true negatives, and false negatives in the matrix.
Other performance indicators such as Rand Index and Fowlkes-Mallows scores can also be derived
from the confusion matrix.
Norm of output. If no ground truth values are known, the quality of the prediction fz , depends
on a priori error estimates or error bounds. Such estimates exist only for kernel methods (to
the best of the knowledge of the authors), and are described in the next chapter. Such estimates
uses the norm of functions and was proven to be a useful indicator in the applications.
ROC curves. The receiver operating characteristic (ROC) is a graphical representation of a
binary classifier performance as its discrimination threshold is varied. Originally developed for
military radar operators in 1941, the ROC curve plots the true positive rate (TPR) against the
false positive rate (FPR) as the threshold is adjusted. These metrics are are summarized up in the
following table:
Precision (P RE) is another useful metric for evaluating binary classifiers. It measures the fraction
of correct positive predictions among all positive predictions, and is calculated as:
TP
P RE = .
TP + FP
For multi-class models, we can use micro-averaging or macro-averaging to combine precision scores
across classes. Micro-averaging calculates precision from the total number of true positives, true
negatives, false positives, and false negatives of k-class model:
T P1 + . . . + T Pk
P REmicro = .
T P1 + . . . + T Pk + F P1 + . . . + F Pk
P RE1 + . . . + P REk
P REmacro = .
k
as the sum of the squared distances between each point in X and its assigned centroid in Y .
We emphasize that the above functional need not be convex, even if the distance measure is convex.
The k-means algorithm computes the cluster centers y by minimizing the inertia functional, where
y is referred to as the set of centroids.
Kolmogorov-Smirnov test. In order to illustrate our claims, we will use three statistical
indicators that measure different types of distances between two distributions X and Y . The first
two tests are based on one-dimensional cumulative distribution functions and are performed on
each axis separately. The third test is based on the discrepancy error.
The Kolmogorov-Simirnov is a one-dimensional statistical test that involves the computation of
the supremum norm of the difference between the empirical cumulative distribution functions of
two distributions X and Y :
cN
∥cdf(X) − cdf(Y )∥ℓ∞ ≤ √ ,
N
where cdf(X) denotes the empirical cumulative distribution functions of a distribution X, and cN
is a threshold corresponding to a confidence level, a classical choice being to pick a constant CN
corresponding to 95% that both distributions are the same. For multidimensional distributions,
this test can be performed on each axis independently, validating similarity between marginals,
but not the full distribution. Nevertheless, it is very popular test that we use all along this book.
D Nx Ny Nz
2 2500 2500 2500
2 1600 1600 1600
2 900 900 900
2 400 400 400
2 2500 2500 2500
2 1600 1600 1600
2 900 900 900
2 400 400 400
18 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
2.0 2
1.5
1
1.0
0.5
f(x)-units
f(x)-units
0
0.0
1
0.5
1.0
2
1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.5 1.0 0.5 0.0 0.5 1.0 1.5
x-units x-units
A comparison between methods. We compared CodPy periodic kernels with other machine
learning models, including scipy RBF kernel regression, support vector regression (SVR), decision
tree (DT), adaboost, random forest (RF) by scikit-learn library, and TensorFlow neural network
(NN) model. For the kernel-based methods, the only external parameter is the choice of kernel,
which will be discussed later on this monograph. For SVR, we used the RBF kernel. For DT,
we set the maximum depth to 10. For RF and XGBoost, we set the number of estimators to 10
and 5 respectively, and the maximum depth to 5. For the feed-forward NN, we used 50 epochs
with a batch size of 16 and the Adam optimization algorithm with mean squared error as the loss
function. The NN was composed of two hidden layers (64 cells each), one input layer (8 cells), and
one output layer (1 cell) with the sequence of activation functions RELU - RELU - RELU - Linear.
All other hyperparameters in the models were default set by scikit-learn, SciPy, and TensorFlow.
In Figure 2.9, we can observe the extrapolation performance of each method. It is evident that the
periodic kernel-based method outperforms the other methods in the extrapolation range between
[−1.5, −1] and [1, 1.5]. This finding is also supported by Figure 2.10, which shows the RMSE error
for different sample sizes Nx .
It is important to note that the choice of method does not affect the function norms and the
discrepancy errors. Although the periodic kernel-based method performs better in this example,
our goal is not to establish its superiority. Instead, we aim to present a benchmark methodology,
especially when extrapolating test set data that are far from the training set.
f(x)-units
f(x)-units
f(x)-units
0 2
2
1 0
0
2 1 0
1 0 1 1 0 1 1 0 1 1 0 1
x-units x-units x-units x-units
Decision tree:Scikit Adaboost:Scikit XGBoost RF:Scikit
2 2 2
1.5
1 1.0 1 1
f(x)-units
f(x)-units
f(x)-units
f(x)-units
0.5
0 0.0 0 0
0.5
1 1 1
1.0
1 0 1 1 0 1 1 0 1 1 0 1
x-units x-units x-units x-units
Figure 2.9: Periodic kernel: CodPy, RBF kernel: SciPy, SVR: Scikit, Neural Network: TensorFlow,
Decision tree: Scikit, Adaboost: Scikit, XGBoost, Random Forest: Scikit
0.105
discrepancy_errors
0.4 4
execution_time
scores
0.3 3
0.100
0.2 2
AdaBoost
Decision tree 0.095
RForest
0.1 SVM 1
Tensorflow
XGboost
codpy extra 0.090
0.0 scipy pred 0
200 250 300 350 400 450 500 200 250 300 350 400 450 500 200 250 300 350 400 450 500
Nx Ny Ny
32101 4321012
2 3
1.00 1.5
0.75 1.0
0.50
0.25 0.5
0.00 0.0
0.25 0.5
0.50
0.75 1.0
1.00 1.5
1.00 0.500.75 1.5 0.5 1.0
0.000.25 0.5 0.0
0.500.25
1.000.75 1.5 1.0
A comparison between methods. We compare the performance of two models for function
extrapolation: CodPy periodic Gaussian kernel and SciPy RBF kernel. We assess their accuracy
on the first two scenarios defined in Table 2.3 and present the results in the first two graphs of
Figure 2.12, which show the RBF kernel predictions. The last two graphs in the figure show the
periodic Gaussian kernel predictions.
Figure 2.12: RBF (first and second) and periodic Gaussian kernel (third and forth)
2.4.4 Clustering
Description. We briefly overview here our methodology (which will be fully described in the next
chapter). Specifically, we proceed as follows.
• Demonstrate the prediction function Pm for some methods in the context of supervised
learning. Compute some performance indicators and present a toy benchmark using these
indicators.
• To generate data, we use a multimodal and multivariate Gaussian distribution with a
covariance matrix Σ = σId . The goal is to identify the modes of the distribution using a
clustering method.
2.4. GENERAL SPECIFICATION OF TESTS 21
We will generate distributions with a predetermined number of modes, which will enable us to test
validation scores on this toy example.
A comparison between methods. In this section, we evaluate and compare the performance
of CodPy clustering MMD minimization with Scikit implementation of the k-means algorithm in
order to identify the modes of a multimodal and multivariate Gaussian distribution. We generate
distributions with different numbers of modes (ranging from 2 to 6) and test validation scores on
this toy example.
Figure 2.14 displays the computed clusters using a k-means algorithm (top row) and the MMD
minimization (bottom row) for two different scenarios. The four confusion matrices in the figure
correspond to the two clustering methods for each scenario.
4 4
4 4
2 2
2 2
0 0
y
y
0 0
2 2 2 2
4 4 4 4
5 0 5 10 5 0 5 10 5 0 5 10 5 0 5 10
x x x x
k-means CodPy
0 22 0 0 0 25
0 32 0 0
30
25
20
1 0 28 1 2
20
15
1 0 34 0
15
2 0 0 28 0
10
10
3 0 1 1 17
5 2 0 0 34
5
0 0
0
We evaluate the performance of various methods using performance indicators, as shown in Figure
22 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
2.15. To assess the performance of the algorithms, we use inertia as the metric since it is a common
measure of clustering quality. The MMD error indicates the degree to which two samples are the
same, and it is computed at different sample sizes. The results of this test are summarized in Table
2.6 in the appendix to this chapter.
Overall, our aim is to offer a thorough comparison of the two clustering methods. This will enable
readers to make informed decisions about which method is best suited for different scenarios.
176 0.5
1.000 codpy codpy codpy codpy
k-means k-means k-means k-means
0.09
0.975 174 0.4
0.08
0.950
172
discrepancy_errors
0.07 0.3
execution_time
0.925
inertia
scores
0.06 170
0.900
0.2
0.05
0.875 168
0.1
0.850 0.04
166
0.825 0.03
3.00 3.25 3.50 3.75 4.00 3.00 3.25 3.50 3.75 4.00 3.00 3.25 3.50 3.75 4.00 3.00 3.25 3.50 3.75 4.00
Ny Ny Ny Ny
2.5 Bibliography
XGBoost7 is a computationally efficient implementation of the original gradient boost algorithm
and is commonly used for large-scale data sets with complex features. TensorFlow8 is a popular
library for building and training neural networks, often used for image and speech recognition.
PyTorch9 is another popular library for building and training neural networks, known for its
dynamic computational graph and ease of use. Scikit-learn10 offers a comprehensive set of models
for linear, SVM, and feature selection methods, making it a popular choice for general machine
learning tasks. TensorFlow Probability11 is a recent addition to the TensorFlow library and focuses
on probabilistic modeling and Bayesian inference.
7 See
this dedicated page for a description of XGBoost project
8 See
this dedicated page for a description of TensorFlow neural networks
9 See this dedicated page for a description of Pytorch neural networks
10 See this dedicated page for a description of Scikit library
11 See this dedicated page for a description of TensorFlow probability library
2.6. APPENDIX TO CHAPTER 2 23
Results concerning the clustering methods. In this test, we evaluate and compare the
performance of two different clustering methods, CodPy clustering MMD minimization and Scikit
24 CHAPTER 2. OVERVIEW OF METHODS OF MACHINE LEARNING
Let us point out immediately that, throughout this chapter, we will illustrate our notions for the
dimensions given in the tables for extrapolation and for interpolation, and with a choice of function
consisting of the sum of a periodic function and a direction-wise increasing function, given by
Y X
f (x) = f (x1 , . . . , xD ) = cos(4πxd ) + xd , x ∈ RD . (3.1.2)
d=1,...,D d=1,...,D
D Nx Ny Nz
2 576 576 576
25
26 CHAPTER 3. BASIC NOTIONS ABOUT REPRODUCING KERNELS
This numerical example will be useful in order to point out certain features enjoyed by the prediction
(Z, fZ ), and compare it with the training set (X, f (X)).
Furthermore, we propose to introduce an additional variable denoted by Y , and we distinguish
between several cases of interest. Throughout we use the notation Y ∈ RNy ,D and fY ∈ RNy ,Df ,
which is consistent with our notation X ∈ RNx ,D , Z ∈ RNz ,D while f (X) ∈ RNx ,Df and fZ ∈
RNz ,Df .
• The choice Nx = Nz corresponds to data extrapolation (as will be explained later).
• The choice Ny << Nx corresponds to data interpolation (as will also be explained later).
D Nx Ny Nz
2 576 32 576
Hence, Figure 3.1 shows results obtained for a typical problem of machine learning. In the following
discussion, we often focus on the choice made in the first test. The left-hand plots show the
(variable, value) training set (X, fX ), while the right-hand plot shows the (variable, value) test set
(Z, fZ ). The middle plots show the (variable, value) parameter set (Y, fY ). The crucial role played
by the additional variable Y will be discussed later on: basically, it helps not only for the overall
accuracy of the algorithm, but also for its overall computational cost.
Keeping in mind the above illustratve example, we now proceed with the definition and basic
properties of kernels and maps of interest.
3. Test Set: Now, after training our model, we want to test its accuracy. To that aim, consider
a new set of images that the system has never considered before. This is our test set Z. If
we have Nz such test images, each represented in D dimensions, then Z ∈ RNz ,D .
4. Test Values: Our goal is to predict the labels (or identifiers) for each image in our test set.
These predicted labels are our test values fZ . For each test image z n , we want to predict a
label fzn . The collection of test images and their predicted labels is:
Figure 3.1: Examples of (training, parameter, test) sets for three different Y
28 CHAPTER 3. BASIC NOTIONS ABOUT REPRODUCING KERNELS
In this facial recognition context, the training set is a collection of known faces with their associated
names (or identification numbers, etc.). The test set is a collection of new faces, and our goal is to
predict their names based on what our system learned from the training set.
k(x1 , y 1 ) · · · k(x1 , y Ny )
K(X, Y ) = .. .. .. (3.2.1)
. . . .
k(x , y ) · · ·
Nx 1
k(x , y )
Nx Ny
We say that k is a positive kernel if, for any collection of distinct points X ∈ RNx ,D and for any
collection c1 , ..., cNx ∈ RNx that is not identically vanishing, we have
X
ci cj k(xi , xj ) > 0. (3.2.2)
1≤i,j≤Nx
Kernel k(x, y)
1. Dot product k(x, y) = xT y
2. ReLU k(x, y) = max(x − y, 0)
3. Gaussian k(x, y) = Q
exp(−π | x − y |2 )
4. Periodic Gaussian k(x, y) = d θ3 (xd − yd )
5. Matern k(x, y) = exp(− Q
| x − y |)
6. Matern tensorial k(x, y) = exp(− d | xd − yd |)
7. Matern periodic k(x, y) = d exp(|xd −yd1+exp(1)
|)+exp(1−|xd −yd |)
Q
q
2
8. Multiquadric k(x, y) = 1 + |x−y|
c2
q
2
9. Multiquadric k(x, y) = d 1 + (xd −y d)
Q
c2
tensorial
3.2. REPRODUCING KERNELS AND TRANSFORMATION MAPS 29
Kernel k(x, y)
Q sin(π(xd −yd )) 2
10. Sinc square k(x, y) = d π(xd −yd )
tensorial
d −yd ))
11. Sinc tensorial k(x, y) = d sin(π(x
Q
π(xd −yd )
12. Tensor k(x, y) = d max(1− | xd − yd |, 0)
Q
13. Truncated k(x, y) = max(1− | x − y |, 0)
14. Truncated
periodic
Here is a brief list of applications in which certain kernels are especially useful.
• The ReLU kernel or rectified linear unit kernel yields the maximum value between the
difference of two given inputs and 0. This kernel is commonly used as an activation function
in neural networks, which are widely used for image recognition, natural language processing,
and related applications.
• The Gaussian kernel assigns higher weights to points that are closer to the center, making it
useful for tasks such as image recognition, where we want to assign higher weights to pixels
that are closer together. It is also commonly used in algorithms of clustering or dimensionality
reduction.
• The multiquadric kernel and their associated tensor versions are based on radial basis functions
and are very useful for smoothing and interpolation of scattered data. They are commonly
used in weather forecasting, seismic analysis, and computer graphics.
• The Sinc kernel and Sinc square kernel in tensorial form are used in signal processing and
image analysis. They model quite accurately some features, such as the periodicity in signals
or images. They are commonly used in applications such as speech recognition, image
denoising, and pattern recognition.
Furthermore, we emphasize that a scaling of such basic kernels is usually required in order to
properly handle the input data. This is precisely the purpose of the transformation maps, discussed
later on.
Examples. A mapping S : RD → RP and a function g : R → R being given, we construct a new
kernel by setting
k(x, y) = g(< S(x), S(y) >RP ), x, y ∈ RD ,
in which g is called the activation function and ⟨. . . , . . .⟩ denotes the standard scalar product. In
particular, this includes the scalar product between successive powers of the coordinate functions
xd and yd , that is,
(c being a constant) is actually a non-symmetric, hence does not directly fit in our framework but
is included in our library since it provides a useful and very standard choice. \end{example}
Consider next the so-called tensornorm kernel (described below) with the relevant parameters
specified in Section 3.2.1. Then we can compute its associated kernel matrix by using our function
30 CHAPTER 3. BASIC NOTIONS ABOUT REPRODUCING KERNELS
2.5 0.8
0.8 1.75
2.0 1.50
0.6 0.6
1.25
f(x)-units
f(x)-units
f(x)-units
f(x)-units
1.5
1.00
0.4 0.4
1.0 0.75
0.25
0.0 0.0 0.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
x-units x-units x-units x-units
invquadratictensor maternnorm maternper materntensor
1.0 1.0
1.0
1.8
1.4
0.6 0.6
0.6
f(x)-units
f(x)-units
f(x)-units
f(x)-units
1.2
0.4 0.4
0.4
1.0
0.2 0.2
0.8
0.2
0.6
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
x-units x-units x-units x-units
multiquadricnorm multiquadricper multiquadrictensor scalar_product
1.0 2.0 1.0
0.04
0.9 1.8 0.9
1.6
0.8 0.8 0.02
1.4
0.7 0.7
f(x)-units
f(x)-units
f(x)-units
f(x)-units
0.00
1.2
0.6 0.6
1.0
0.02
0.5 0.5
0.8
0.4 0.4
0.04
0.6
0.3 0.3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
x-units x-units x-units x-units
sincardsquaretensor sincardtensor tensornorm truncatednorm
1.0 1.0
1.0 1.0
0.6
0.6 0.6 0.6
f(x)-units
f(x)-units
f(x)-units
f(x)-units
0.4
0.4 0.4
0.4
0.2
0.2 0.2
0.0 0.2
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
x-units x-units x-units x-units
denoted by op.Knm in CodPy. Typical values for this matrix are presented in Table 3.4, which
includes the first four rows and columns.
Table 3.4: First four rows and columns of the kernel matrix K(X, Y )
Inverse of a kernel matrix. The inverse of a kernel matrix K(X, Y )−1 is computed in two ways
depending on whether X = Y or X ̸= Y . When X = Y , the inverse is computed with the formula
Table 3.5: First four rows and columns of an inverted kernel matrix K(X, Y )−1
Observe that, in the following instances, the product matrix K(X, Y )K(X, Y )−1 in Table 3.5 may
not coincide with the identity matrix.
• If Nx ̸= Ny .
• If ϵ > 0, the Tikhonov regularization parameter is used to adjust the solution for better
stability. While the user can choose ϵ = 0, in certain cases this will lead to performance
issues. For example, if the kernel is not unconditionally positive definite, the CodPy library
may raise an exception, and switch from the standard matrix inversion method to an adapted
method for non-invertible matrices, which can be computationally costly.
• If the choice of the kernel happens to lead to a matrix K(X, X)K(X, X)−1 that does not
have full rank, for instance when we use a linear regression kernel (cf. Section 3.4), the matrix
becomes a projection on the null space of K(X, X).
Distance matrices. Distance matrices provide a very useful tool in order to evaluate the accuracy
of a computation. To any positive kernel k : RD , RD 7→ R, we associate the distance function
dk (x, y) defined (for x, y ∈ RD ) by
For positive kernels, dk (·, ·) is continuous, non-negative, and satisfies the condition dk (x, x) = 0
(for all relevant x).
For a collection of points X = (x1 , ..., xNx ) and Y = (y 1 , ..., y Ny ) in RD , we define the associated
distance matrix D(X, Y ) ∈ RNx ,Ny by
dk (x1 , y 1 ) · · · dk (x1 , y M )
D(X, Y ) = .. .. .. (3.2.5)
. . . .
dk (x , y ) · · · dk (x , y )
N 1 N M
Distance matrices are crucial in a myriad of applications, particularly in addressing clustering and
classification challenges.
Table 3.6 shows the first four columns of the kernel-based distance matrix D(X, Y ). As expected,
the diagonal values are all vanishing.
Table 3.6: First four rows and columns of a kernel-based distance matrix D(X, Y )
3.2.2 Maps
A map is a function that transforms data from one space to another. When dealing with kernels,
we use maps in order to transform our input data in a way that makes it easier for our kernel
function to capture the underlying patterns or structures. Mappings, often denoted by S, take
input from RT and generate an output in RD , where T and D, by definition, are the dimensions of
the input and output spaces, respectively. We distinguish between the following maps.
• rescaling maps correspond to the choice T = D and are used in order to fit data X, Y, Z to
the range associated with a given kernel.
• dimension-reduction maps correspond to the choice T ≤ D.
• dimension-increasing maps correspond to the choice T ≥ D, and are useful when adding
information to the training set is required. Such a transformation might be loosely called a
kernel trick.
The list of rescaling maps available in our framework can be found in Table 3.7.
Maps Formulas
q
1 Scale to S(X) = σ,
x
σ= 1
n<Nx (x − µ), µ = 1
xn .
P n
P
Nx Nx n<Nx
standard
deviation
2 Scale to erf S(X) = erf (x), erf is the standard error function.
3 Scale to S(X) = erf −1 (x), erf −1 is the inverse of erf .
erfinv
|xi −xk |2
4 Scale to S(X) = √x , α=
P
α i,k≤Nx Nx2 .
mean
distance
5 Scale to min S(X) = √x , α= 1
mink̸=i | xi − xk |2 .
P
α Nx i≤Nx
distance
3.2. REPRODUCING KERNELS AND TRANSFORMATION MAPS 33
Maps Formulas
x−minn xn + N
0.5
6 Scale to S(X) = α
x
,α = maxn xn − minn xn .
unit cube
Applying a map S is equivalent to replacing a kernel k(x, y) by the kernel k(S(x), S(y)). For
instance, the use of the “scale-to-min distance map” is usually a good choice for Gaussian kernels,
as it scales all points to the average minimum distance. As an example, we can transform the given
Gaussian kernel using such a map. Note that the Gaussian setter function, by construction, uses
the default map set_min_distance_map. We refer the reader to a later discussion of all optional
parameters.
kernel_setters.set_gaussian_kernel(polynomial_order : int = 0,
regularization : f loat = 1e − 8,
set_map = map_setters.set_min_distance_map)
Finally, in Figure~3.3 we illustrate the action of maps on our kernels. Here, we should compare
the two-dimensional results generated with maps to the one-dimensional results generated without
maps, and given earlier in Figure 3.2.
More generally, a functional space denoted by Hk could also be defined, at least formally (or by
applying a further completion argument which we are not going to elaborate upon here), by
which consists of all linear combinations of the functions k(x, ·) and is endowed with the scalar
product
k(·, x), k(·, y) H = k(x, y), x, y ∈ RD . (3.2.8)
k
In every finite dimensional subspace Hkx ⊂ Hk , according to the expression of the scalar product
we can write
k(·, xi ), k(·, xj ) Hx
= k(xi , x)K(X, X)−1 k(x, xj ) = k(xi , xj ), i, j = 1, ..., Nx . (3.2.9)
k
The norm of a function f in the space Hk depends upon the choice of the kernel k. A reasonable
approximation of this norm can be induced by the kernel matrix K, and is given by the expression
Of course, this norm could be computed after a rescaling of the kernel based on a map. Finally, we
point out that the norm can be computed in CodPy by using the function
0.35 2.25
0.8 0.9 2.00
0.30
0.8 1.75
0.6 0.25
1.50
0.7 0.20
1.25
0.4 0.6 0.15 1.00
0.5 0.10 0.75
0.2
0.05 0.50
0.4
0.0 0.00
0.95
1.8 0.9
0.95 0.90
0.85 1.6 0.8
0.90
0.80 1.4
0.85 0.75 0.7
0.70 1.2
0.80 0.6
0.65 1.0
0.75 0.60 0.5
0.8
0.70 0.55
0.4
2.2
0.98 2.0 0.98
0.8
0.96 1.8 0.96
0.94 1.6 0.94
0.6
0.92 1.4 0.92
0.90 1.2 0.90 0.4
1.0 0.88
0.88
0.8 0.86 0.2
0.86 0.6 0.84
0.84
0.9 0.95
0.9 0.90
0.9
0.8 0.8 0.85
0.7 0.8 0.7 0.80
0.6 0.6 0.75
0.7 0.70
0.5 0.5
0.6 0.65
0.4
0.4 0.60
0.3 0.5 0.55
0.3
Importantly, the projection operator Pk is linear in term of, both, input and output data. Hence,
while keeping the set Y to a reasonable size, we can consider large set of data, as input or output.
Furthermore, choosing a well-adapted set Y often is a major source of optimization. We are going
to use this idea intensively in several applications. For instance, the kernel clustering method
(which we will describe later on) aims at minimizing the error implied by our learning machine with
respect to the set Y = Pk (X, Z). This technique also connects with the idea of sharp discrepancy
sequences to be defined later on. We refer to this step as a learning process, since this is exactly
the counterpart of the weight set for the neural network approach. This construction amounts to
define a feed-backward machine, analogous to (3.3.1) by
fz = Pk (X, Pk (X, Z), Z)f (X).
for any vector-valued function f : RD → RDf . Observe that this formula is computationally
realistic and can be systematically applied in order check the validity of a given kernel machine.
Moreover, it can also be combined with any other type of error measure. We also emphasize the
following error formula:
f (Z) − fz ℓ2 (Nz )Df ≤ dk X, Y + dk Y, Z ∥f ∥Hk . (3.3.7)
which we refer to as the discrepancy functional. This distance is also known in the literature as the
maximum mean discrepancy (MMD) (first introduced in [14]). It is a rather natural quantity, and
we expect that the accuracy of an extrapolation diminishes when the extrapolation set Z becomes
very different from the sampling set X. This distance is defined by
NX
x ,Nx Ny ,Ny Nx ,Ny
2 1 1 X 2 X
= 2 n m
+ 2 n m
k xn , y m (3.3.8)
dk X, Y k x ,x k y ,y −
Nx n=1,m=1
Ny n=1,m=1
Nx Ny n=1,m=1
In this discussion, we are given two kernels denoted by ki (x, y) : RD , RD 7→ R (with i = 1, 2) and
their corresponding matrices are denoted by K1 and K2 . According to (3.3.1), we introduce the
two projection operators
In order to work with multiple kernels, in CodPy we provide two Python functions, referred to as
basic setters and getters:
get_kernel_ptr(), *set_kernel_ptr(kernel_ptr)*.
The former allows us to recover a kernel that was previously input in our library, while the latter
enables us to incorporate the choice of a new kernel into our framework.
where ◦ denotes the Hadamard product of two matrices. The functional space generated by k1 · k2
is n X o
Hk = am k1 (·, xm ) k2 (·, xm ) . (3.4.5)
1≤m≤Nx
where k(x, y) = φ1 ∗ φ2 (x − y) is the convolution of the two kernels.
38 CHAPTER 3. BASIC NOTIONS ABOUT REPRODUCING KERNELS
Hence, this doubles up the coefficients (4.2.1). We define its inverse matrix by concatenation:
K −1 (X, Y ) = K1 (X, Y )−1 , K2 (X, Y )−1 INx − π1 (X, Y ) ∈ R2Ny ,Nx . (3.4.10)
a classical polynomial regression, which enables an exact matching of the moments of a distribution.
Namely, any remaining error can be effectively handled by the second kernel k2 . Importantly,
this combination of kernels provides a powerful framework for modeling and capturing complex
relationships between variables.
4321012 17.5
15.0
12.5
10.0
7.5
5.0
2.5 21012
3 0.0
Figure 3.4: A ground truth value (first), Gaussian (second) and Matern kernels (third) with mean
distance map
Composition of maps. Within our framework, we frequently employ maps to preprocess input
data prior to the computation based on kernel functions or using model fitting. Each map, with its
unique features, can be combined with other maps in order to craft more robust transformations.
As an illustrative example, we have constructed a composite map (termed a Swiss-knife map) for
Gaussian kernels, which implements multiple operations on the data.
Our composite map starts by implementing a rescaling, thereby rescaling all data points to fit
within a unit hypercube. Next, the map applies the transformation S(X) = erf−1 (2X − 1), which
is the inverse of the standard error function. This particular transformation is commonly employed
to normalize data points to a standard normal distribution, since this has been found to enhance
the performance of many machine learning algorithms.
The final step in the composite map process involves the application of the average min distance
map, scaling all points by the average distance for a Gaussian kernel. This map is particularly
efficient for Gaussian kernels; however, it may not be ideally suited for other types of kernels.
The implementation of this composite map in Python is performed in the following manner:
map_setters.set_min_distance_map(∗∗ kwargs)
pipe_map_setters.pipe_erf inv_map()
pipe_map_setters.pipe_unitcube_map()
• First, we can choose Y = X, which corresponds to the extrapolation case and typically
produces the highest accuracy; cf. Section 3.3.2.
• Alternatively, we can randomly select a subset for Y from X, which trades accuracy for
execution time and is better suited for larger training sets.
• Last, we can select Y to be a sharp discrepancy sequence associated with X, as described in
Section 4.3. This provides the best possible accuracy, but requires the use of a time-consuming
numerical algorithm.
To illustrate the impact of different kernels and maps on our learning machine, we consider a
one-dimensional test and compare the predictions achieved by using various kernels.
linear / periodic, no map periodic, no map matern kernel, no map linear regressor kernel, no map
2 1.5
0
1.0
1.0
1 5
0.5 0.5
10
f(x)-units
f(x)-units
f(x)-units
f(x)-units
0
15 0.0
0.0
1 20 0.5
0.5
25 1.0
2
1.0 30
1.5
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
x-units x-units x-units x-units
3.5.3 References
The topic of RKHS methods and kernel regressions has undergone extensive research over the past
decades, resulting in a vast body of literature. In our brief list of references provided at the end of
this monograph, we have included a selection of key works.
One notable resource offering a comprehensive introduction to the topic is the monograph by Hastie
et al. [20], which gives fundamental material on statistical learning, including the notions of data
mining, inference, and prediction. This book provides valuable insights into the field. In addition,
the textbook by Berlinet and Thomas-Agnan [3] is an excellent source of material on the use of
reproducing kernels in probability, statistics and related areas.
Another significant contribution to the subject can be found in the work of Smola et
al. \cite{{Smola=IFI}, which also offers substantial material on the topic. We also point out here
the work of Rosipal and Trejo \cite{{Rosipal} , which introduces a dimension-reduction technique
for least-square models and provides a valuable perspective on the subject.
For further references, the reader should refer to the bibliography at the end of this monograph.
Chapter 4
Kernel-based operators
4.1 Introduction
We now define and study classes of operators constructed from a reproducing kernel. We start
with interpolation and extrapolation operators, which are of central interest in machine learning as
well as for applications to partial differential equations (PDEs). Next, we introduce distance=type
measure induced by a kernel, which is referred to as the kernel discrepancy or the maximum mean
discrepancy. This measure is crucial for stating error estimates and designing effective clustering
methods, as we will explain in forthcoming chapters. An important tool in the present chapter
is provided by kernel based discrete differential operators, such as the gradient and divergence
operators. Such discrete operators will be shown to be useful in various circumstances, especially
for the modeling of physical phenomena described by PDEs.
where δn,m denotes the Kronecker delta symbol (that is, 1 if n = m and 0 otherwise). Figure 4.1
illustrates this notion with an example of four partition functions.
41
42 CHAPTER 4. KERNEL-BASED OPERATORS
0.7
0.8 0.6 0.20 0.8
0.6 0.5 0.15 0.6
0.4
0.4 0.3 0.10 0.4
0.2 0.2 0.05 0.2
0.1
0.0 0.0 0.00 0.0
in which we have ∇z k (Z, Y ) ∈ RD,Nx ,Ny . To compute the gradient of a vector-valued function f ,
Figure 4.2: The first two graphs correspond to the first dimension (original on the left-hand,
computed on the right-hand). The next two graphs correspond to the second dimension (original
on the left-hand, computed on the right-hand).
The operator ∇Tk , by definition, is consistent with the divergence operator and reads
< ∇k (X, Y, Z)f (X), g(Z) >=< f (X), ∇k (X, Y, Z)T g(Z) > .
To compute the operator ∇T , we start with the definition of the gradient operator (4.2.4) and
define, for any f (X) ∈ RNx ,Df and g(Z) ∈ RD,Nz ,Df ,
< ∇z K (Z, Y )K(X, Y )−1 fx , gz >=< fx , K(X, Y )−T ∇z K (Z, Y )T gz > .
60
40
20020
40
60 60
40
20020
40
60
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.500.75 1.00 0.500.75
0.000.25 0.000.25
0.500.25 0.500.25
1.000.75 1.000.75
Figure 4.3: Comparison of the outer product of the gradient to Laplace operator
This operator is used in various applications. In particular, the Laplacian arises for solving PDE
boundary value problems (a.g. Poisson, Helmholtz), and are involved in many time evolution
problems involving diffusion or propagation, as heat equations or wave equation, or stochastic
martingale processes.
−1
∆−1
k (X, Y ) = (∆k (X, Y ) ∈ RNx ,Nx . (4.2.7)
A two-dimensional example. Figure 4.4 compares f (X) with ∆k (X, Y )−1 ∆k (X, Y )f (X). This
latter operator is a projection operator (hence is stable).
To illustrate the use of this operator, Figure 4.4 compares the original function f (X) with the
result of applying the inverse Laplace operator to ∆k (X, Y )f (X), i.e. ∆k (X, Y )−1 ∆k (X, Y )f (X).
This latter operator acts as a projection operator and is therefore stable.
32101 321012
2
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.500.75 1.00 0.500.75
0.000.25 0.000.25
0.500.25 0.500.25
1.000.75 1.000.75
Figure 4.4: Comparison between original function to the product of Laplace and its inverse
In Figure 4.5, we compute the operator ∆k,x,y,z ∆−1 k,x,y,z f (X) to check that the pseudo-inverse
commutes, i.e., applying the Laplacian operator and its pseudo-inverse in any order produces the
same result. This property is crucial in many applications of the inverse Laplace operator.
32101 86420246
2
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.500.75 1.00 0.500.75
0.000.25 0.000.25
0.500.25 0.500.25
1.000.75 1.000.75
Figure 4.5: Comparison between original function and the product of the inverse of the Laplace
operator and the Laplace operator
∇−1 −1 T
k = ∆k ∇k ∈ R
Nx ,DNz
. (4.2.8)
It can be interpreted as a matrix, computed first considering ∇k (X, Y, Z) ∈ RD,Nz ,Nx , down casting
it to a matrix RDNz ,Nx before performing a least-square inversion. This operator acts on any
vz ∈ RD,Nz ,Dvz and produces a matrix
∇−1
k (X, Y, Z)vz ∈ R
Nx ,Dvz
, vz ∈ RD,Nz ,Dvz
coincides or at least is a good approximation of f (X). Figure 4.7 tests the extrapolation operator
(∇k )−1 (Z, Y, Z)(∇k (X, Y, Z)f (X).
32101 32101
2 2
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.500.75 1.00 0.500.75
0.000.25 0.000.25
0.500.25 0.500.25
1.000.75 1.000.75
Figure 4.6: Comparison between original function to the product of the gradient operator and its
inverse
4321012 321012
3
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
1.5 1.5
1.5 0.5 1.0 1.5 0.5 1.0
0.5 0.0 0.5 0.0
1.5 1.0 1.5 1.0
Figure 4.7: Comparison between original function to the product of the inverse of the gradient
operator and the gradient operator
A two-dimensional example. We compute ∇k (X, Y, Z)T (∇Tk (X, Y, Z))−1 = ∆k (X, Y, Z)∆k (X, Y, Z)−1 .
Thus, the following computation should give comparable results as those obtained in our study of
the inverse Laplace operator in Section 4.2.6.
Lk (X, Y )⊥ = ∇k (X, Y )∆k (X, Y )−1 ∇Tk,x,y,x = ∇k (X, Y, Z)∇k (X, Y, Z)−1 .
46 CHAPTER 4. KERNEL-BASED OPERATORS
3
2 2
1 1
0 0
1 1
2 2
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00 1.00 1.00
0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
1.00 0.75 0.50 0.25 1.00 0.75 0.50 0.25
Figure 4.8: Comparison between the product of the divergence operator and its inverse and the
product of Laplace operator and its inverse
This operator acts on any vector field f (Z) ∈ RD,Nz ,Df , and produces a three-argument object by
performing a matrix multiplication after applying the input vector field:
By using the Leray-orthogonal operator, we can perform an orthogonal decomposition of any vector
field into its divergence-free and curl-free components, which is the key to understanding some
important structure of fluid flows.
In Figure 4.9, we compare the action of this operator on a vector field f (Z) with the original
function (∇f )(Z).
Figure 4.9: Comparing f(z) and the transpose of the Leray operator on each direction
where Id is the identity matrix. This operator allows us to decompose any field as an orthogonal
sum of two components: one part belongs to the range of the Leray operator, and one part is
orthogonal to it:
vz = Lk (X, Y, Z)vz + Lk (X, Y, Z)⊥ vz , < Lk (X, Y, Z)vz , Lk (X, Y, Z)⊥ vz >D,Nz ,Dv = 0.
This decomposition is consistent with the Helmholtz-Hodge decomposition, which represents any
vector field as an orthogonal sum of a gradient and a divergence-free vector:
v = ∇h + ζ, ∇ · ζ = 0, h = ∆−1 ∇ · v.
4.3. A CLUSTERING ALGORITHM 47
From a numerical perspective, we can use a similar decomposition to compute the Helmholtz-Hodge
decomposition. Specifically, we can decompose a vector field into a gradient component and a
divergence-free component by using the Leray operator, namely
Figure 4.10: Comparing f(z) and the Leray operator in each direction
Assuming that this latter problem is well-posed and the distance functional to be convex
(This is a
formal argument, since most existing distances are not convex.), the cluster set Y = y 1 , . . . , y Ny
can be computed. Once computed, the index function σ(w, Y ) = arg inf j=1...NY d(w, y j ) can be
defined, as for (2.3.4). This function can be extended naturally to define a map:
which acts on the indices of the test set Z. This allows for a comparison of the prediction to a
given, user-desired partition of f (Z), if needed.
Note that the function σ(Z, Y ) is surjective (that is, onto), meaning that multiple points in Z
can be assigned to the same cluster in Y . Therefore, we can define its injective inverse (that is
one-to-one on its image), σ(Z, Y )−1 (n), which describes the points in Z that are assigned to cluster
y n in Y . This construction defines cells denoted as C n = σ(RD , y n )−1 (n), which provide us with a
partition of unity for the space RD .
48 CHAPTER 4. KERNEL-BASED OPERATORS
It is worth noting that, in the context of supervised clustering methods, the training set and its
values X and f (X), along with the index map σ(X, Y ) ∈ [1, . . . , Nx ]Ny defined above, can be used
to make predictions on the test set Z. Specifically, we can define a prediction for a point z ∈ Z as
σ(z,Y )
fz = f X σ(Y ,X)
(4.3.3)
,
Here, Σ denotes the set of all subsets from [1, . . . , Ny ] 7→ [1, . . . , Nx ], and any solution
Y = X σ is referred to as the sharp discrepancy sequence. This minimization
problem is investigated further in Chapter (4.3.5).
– For some kernels, after the discrete minimization step described above, a simple gradient
descent algorithm is used to obtain a more accurate approximation of (4.3.1). The
algorithm starts with X σ as the initial state and iteratively updates the position of each
point to improve the overall solution. This approach can provide a refined and more
precise solution to the original minimization problem.
• The supervised clustering algorithm involves computing the projection operator (3.3.1), that
maps the test set Z to the closest point in the weight set Y (i.e., the sharp discrepancy
sequence). This results in a prediction fz for each point in the test set. We implement the
projection operator using the Python function (3.3.5): fz = Pk (X, Y, Z)f (X).
• The problem (4.3.4) is at the heart of the algorithm and can be solved using the function:
• To compute the index associations (4.3.2), i.e., the function σdk (X, Y ) use
alg.distance_labelling(X, Y, . . .),
0.14 700
0.12
discrepancy_errors
600
inertia
0.10
500
0.08
0.06 400
0.04 300
0.02
200
20 40 60 80 20 40 60 80
Ny Ny
y 7→ dk (X, y),
where y is randomly generated on the unit cube. This functional represents the minimum distance
to be achieved if one were to consider a single cluster.
50 CHAPTER 4. KERNEL-BASED OPERATORS
An example of smooth kernels: Gaussian. We begin our analysis of the discrepancy functional
by examining the Gaussian kernel family, which is constructed by using the following kernel, which
generates functional spaces made of smooth functions:
In Figure 4.12, we show the function y 7→ dk (x, y) in blue color. Additionally, we display the
function dk (x, xn ), n = 1 . . . Nx in Figure 4.12 to demonstrate that this functional is smooth but
neither convex nor concave. Notably, the minimum of this functional is achieved by a point that is
not part of the original distribution X.
For a two-dimensional example, we refer to Figure 4.13 (left-hand) for a display of this functional.
An example of Lipschitz continuous kernels: RELU. Let us now consider a kernel that
generates a functional space with less regularity. The RELU kernel is the following family of kernels
which essentially generates the space of functions with bounded variation:
As shown in Figure \ref{fig:MMD1 (middle), the function y 7→ dk (x, y) is only piecewise differen-
tiable. Hence, in some cases, the functional dk (x, y) might have an infinite number of solutions (if
a “flat” segment occurs), but a minimum is attained on the set X. Figure 4.13 (middle) displays
the two-dimensional example.
An example of continuous kernel: Matern. The Matern family generates a space of continuous
functions, and is defined by the kernel
In Figure 4.12, we observe that the function y 7→ dk (x, y) has concave regions almost everywhere,
making it difficult to find a global minimum using a gradient descent algorithm. Figure 4.13-right
displays a two-dimensional example of this functional.
0.9 0.5
0.7
f(x)-units
f(x)-units
f(x)-units
0.8 0.4
0.6
0.1
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x-units x-units x-units
Figure 4.12: Distance functional for the Gaussian, the Matern and the RELU kernels (1D)
1.2 1.0
1.1 0.9
1.0 1.0 0.8
0.9 0.8 0.7
0.6 0.6
0.8 0.5
0.7 0.4 0.4
Figure 4.13: Distance functional for the Gaussian, the Matern and the RELU kernels (2D)
4.4 Bibliography
The topic of RKHS methods and kernel regressions has been extensively studied and there is a
vast literature on the subject. AS mentioned earlier, we provide a list of references at the end of
this monograph. In partcular, see the references already indicated at the end of Chapter 3.
Chapter 5
52
5.1. A BRIEF OVERVIEW OF OPTIMAL TRANSPORT 53
We say that L transports dX into dY, and write L# dX = dY, called a push-forward. To provide a
specific example, in the discrete case, a push-forward map is any map satisfying L(X) = Y σ =
n=1 , where σ : {1, ..., N } 7→ {1, ..., N } is any permutation.
{y σ(n) }N
There exists infinitely many push forward maps between different distributions Y and X. A common
way to select a reasonable one is to introduce a cost function, a positive, scalar-valued function
c(x, y). The Monge problem, then consists of finding a mapping x 7→ L(x) that minimizes the
transportation cost from from dY to dX, i.e.,
Z
L = arg inf c(x, L(x))dX (5.1.5)
L:L# dY=dX X
where Σ is the set of all permutations, and T r represents the trace of the matrix C. We now
introduce a problem closely related to the Monge problem (5.1.5), called the discrete Kantorovitch
problem
γ̄ = arg inf C(X, Y ) · γ, (5.1.7)
γ∈Γ
where where φ : X 7→ R, ψ : Z 7→ R are discrete functions. As stated in [6], the three discrete
problems above are equivalent. The discrete Monge problem (5.1.5) is also known as the linear
sum assignment problem (LSAP), and was solved in the 50’s by an algorithm due to H.W.
Kuhn; it is also known as the Hungarian method1 .
1 this algorithm seems nowadays credited to a 1890 posthumous paper by Jacobi.
54 CHAPTER 5. PERMUTATIONS AND OPTIMAL TRANSPORT
In the continuous case, any transport map L# dX = dY can be polar-factorized under suitable
conditions on X , Y, that is, the sets must be bounded and convex:
L(·) = L ◦ T (·), T# X = X. (5.1.9)
Here, L is the unique solution to the Monge problem (5.1.5), and is the gradient of a c−convex
potential S(X) = expx − ∇h(X) . Here, expx is the standard notation for the exponential
map (used in Riemannian geometry). A scalar function is said to be c-convex if hcc = h, where
hc (Z) = inf x c(X, Z) − h(X) is called the infimal c−convolution. Standard convexity coincides
with c-convexity for convex cost functions such as the Euclidean norm, in which case the following
polar factorization holds: S(X) = (∇h) ◦ T (X) with a convex h. These results go back to [7]
(convex distance case) and [26] (general Riemannian distance) in the continuous setting.
We now describe the main connection between these results and learning machines (3.3.1). Indeed,
consider the cost function defined as C(X, Z) = MK (dXX , dYY ), defined in (3.3.8). With these
notations, finding the map T appearing in the right-hand side of the polar factorization (5.1.9)
consists in finding the permutation (5.1.6).
Considering a learning machine (3.3.1), this permutation defines the encoder (of X with Y ) as:
x 7→ L(x) = Pk X, X, x Y σ . (5.1.10)
The inverse mapping is computed as
y 7→ L−1 (y) = Pk (Y σ , Y σ , y X. (5.1.11)
Note that, in the context of this paragraph, DX = DY = D, and the polar factorization of this
map is defined through the equations
L(z) = (∇k h) ◦ T (z)
that is we can estimate h(·) = ∇−1
k L (·) and the polar factorization of L and L
−1
.
In a discrete setting, given a kernel k, the problem (5.1.12) reduces to determining a permutation
that satisfies:
σ̄ = arg inf ∥∇k y σ (x) ∥2ℓ2 = arg inf < ∆k , y σ(x) y σ(x),T > (5.1.13)
σ∈Σ σ∈Σ
5.2. PERMUTATION ALGORITHMS 55
• A positive kernel k(x, y), defined through other input variables set_codpy_kernel.
• An optional parameter distance with the following potential values:
– “norm1”: Sorting is done accordingly to the Manhattan distance d(x, y) = |x − y|1 .
– “norm2”: Sorting is done accordingly to the Euclidean distance d(x, y) = |x − y|2 .
– “normifty”: Sorting is done accordingly to the Chebyshev distance d(x, y) = |x − y|∞ .
– If the parameter distance parameter is not provided, the function defaults to the
kernel-induced distance dk (x, y), as defined at (3.3.8).
This function returns :
• Two distributions X σ , Y σ each having length Ny . If Nx > Ny , then Y σ = Y . In the case
Ny > Nx , the function leaves the original distribution X unchanged.
• A permutation σ, represented as a vector i 7→ σi , 0 ≤ i ≤ min(Nx , Ny ).
7.100759
1.465813
In the next step, we compute the permutation σ. The Python interface for this function is simply
σ = lsap(M ).
1 3 2 0
Using this permutation for the matrix’s rows, we derive M σ = M [σ] and calculate the new cost
after ordering, i.e., T r(M σ ). We verify that the LSAP algorithm has indeed reduced the total cost.
0.6943549
A quantitative illustration. First, we demonstrate the results obtained from our ordering
algorithm on a simple example. We generate two random variables X ∈ R4,5 , Y ∈ R4,5 , such that
X ∼ N (µ, I5 ) and Y ∼ U nif ([0, 1]4,5 ) with µ = [5, ..., 5]. The first is generated by a multivariate
Gaussian distribution centered at µ, and the second by a uniform distribution supported within
the unit cube.
Table 5.5 displays the distance matrix Dk induced by the Matern kernel k, and the transportation
cost is the trace of the matrix, i.e. T race(Dk ).
1 3 2 0
Next, we employ the ordering algorithm and calculate the cost after ordering.
Finally, we output the distance matrix again after ordering in Table 5.8, along with the permutation
σ in Table 5.9. We can verify that the sum of the diagonal elements, i.e., the total cost, has
decreased.
5.2. PERMUTATION ALGORITHMS 57
2 3 1 0
7.097425
A qualitative illustration. The best illustration of this algorithm can be done in the two-
dimensional case. Initially, we consider a Euclidean distance function d(x, y) = |x − y|2 , where the
algorithm corresponds to a classical rearrangement, i.e., the one corresponding to the Wasserstein
distance.
To demonstrate this behavior, let’s generate a bimodal type distribution X ∈ RNx ,D and a random
uniform distribution Y ∈ [0, 1]Ny ,D ..
For a convex distance, this algorithm is characterized by an ordering where characteristic lines do
not intersect each other, as plotted in Figure 5.1, which displays the edges xi 7→ y i , before and
after the ordering algorithm.
first first
4 second 4 second
2 2
0 0
2 2
4 4
4 2 0 2 4 4 2 0 2 4
However, kernel-based distances may result in different permutations. This is because kernels
define distances that might not be Euclidean. For instance, the kernel selected above defines a
distance equivalent to d(x, y) = Πd |xd − yd |, and leads to an ordering in which some characteristics
should cross.
58 CHAPTER 5. PERMUTATIONS AND OPTIMAL TRANSPORT
6 first 6 first
second second
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
LSAP extensions - Different input sizes. Next, we describe some extensions of the LSAP
algorithms used in our library. A straightforward extension of the LSAP problem is applicable
when the input sets are of different sizes, specifically Ny ≤ Nx . Figure 5.2 illustrates the behavior
of our LSAP algorithm in this setting.
first 4 first
second second
4 3
2
2
1
0
0
2
2
3
4
4 2 0 2 4 3 2 1 0 1 2 3 4
For example, the encoder functional (5.1.13) corresponds to the functional L(C) =< ∆k , C >,
whereas the sharp discrepancy sequences minimization corresponds to L(C) = dk (X, X σ ).
5.3. TWO APPLICATIONS OF GENERATIVE METHODS 59
This algorithm relies on the fact that any permutation σ can be decomposed as a combination of
elementary permutations of two elements, making it particularly useful when evaluating L(C σ ) over
a permutation of two elements σ[i], σ[j] is faster than evaluating L(C σ ). Hence, we introduce a
permutation gain function s(i, j, σ). A typical example of such a function is the one corresponding
to the LSAP problem, with sLSAP (i, j, σ) = C(σ[i], σ[j]) + C(σ[j], σ[i]) − C(σ[i], σ[j]) − C(σ[j], σ[i]).
The algorithm can be considered a discrete descent algorithm. For symmetrical problems, i.e.,
problems satisfying s(i, j, σ) = s(j, i, σ), it can be written as follows:
start from permutation=[1,..,N],flag=True
while flag == True:
flag = False
for i in [1, N], for j in [i+1, N]:
if s(permutation[i],permutation[j]) <0 :
swap(permutation[i],permutation[j]), flag=True
Non symmetrical problems can be treated modifying the loop as follows : for i in [1, N],
for j != i. While these algorithms typically yield sub-optimal solutions, they are robust and
converge within a finite time, usually within a few steps. They are particularly useful for assisting
other global methods or for providing a first solution. Another utility is their ability to find a
local minimum that is close to the original ordering, thereby maintaining a certain relation to the
original data sequence.
We now design some useful algorithms based on generative models in the rest of this section.
For which Y ∈ RNY ,DY is mandatory, and where the other inputs are optional:
• If X is not provided, then two input numbers, namely NX , DX , are used to define X ∈ RNX ,DX
as a variate of a uniform distribution on the unit cube [0, 1]DX .
• As X ∈ RNX ,DX is now either provided or computed, we can define the encoder/decoder
(5.1.1)-(5.1.2). The LSAP approach (5.1.6) is chosen if DX = DY , otherwise the parametric
one (5.1.13).
• If Z ∈ RNZ ,DX is not provided, then two input numbers, namely N, DX , are used to define
Z ∈ RN,DX as a variate of a uniform distribution on the unit cube [0, 1]DX .
• As Z ∈ RNZ ,DX is now either provided or computed, this function outputs the decoding
function L(X, Y )(z).
In summary, the function aims to output NZ values in RNZ ,DY , representing a variate of a
distribution that shares close statistical properties with the discrete distribution Y and is somehow
explained by an exogenous random variable X.
We now give several illustrations of this python function.
One-dimensional illustrations
60 CHAPTER 5. PERMUTATIONS AND OPTIMAL TRANSPORT
Let’s consider two one-dimensional distributions: a bi-modal Gaussian and bi-modal Student’s
t−distribution. The experiment compares the true distribution X ∈ R1000,1 and a computed
distribution Y ∈ R1000,1 using a sampling function.
Figure 5.3 compares kernel density estimates and histograms of the original sample and the
distribution generated using a sampling function; the first plot for a Gaussian and second for a
t−distribution.
0.30
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10 0.10
0.05 0.05
0.00 0.00
6 4 2 0 2 4 6 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Figure 5.3: Histograms of Bi-modal Gaussian vs sampled (left) and Student’s t distribution vs
sampled (right)
Tables 5.13 and 5.14 in the Appendix show that sampling algorithm generated samples very close
in skewness, kurtosis and in terms of KL divergence and MMD.
Two-dimensional illustrations
In this example, we consider two circles with different centers, as illustrated in the first graph
below. The second graph shows the representation in the latent space, the third graph displays the
reconstruction, and the fourth graph demonstrates the decoder (cf. (5.1.2)) on randomly selected
latent data.
We repeat this experiment with random circles for a bimodal Gaussian distribution with modes
centered at −5 and 5. The first graph shows the original distribution, the second one is the
representation of the distribution in 1-dimensional latent space, the third graph is the reconstruction
of the original bimodal distribution, and the fourth graph is the reconstruction on unseen latent
variables.
We observe a perfect reconstruction using latent training data, and some aberrations on unseen
latent variables.
Next, we repeat the experiment for a two-dimensional case. Figure 5.6 compares the distributions
of X ∈ R1000×2 and Y ∈ R1000×2 (original and computed distribution), with the first scatter plot
comparing to a Gaussian, second to a t−distribution, and the third and fourth scatter plots showing
a bimodal Gaussian and t−distribution respectively with Nx = Ny = 1000.
Table 5.13 in the Appendix to this chapter presents the first four moments of the true and sampled
distributions. The sampling algorithm cannot capture the fourth moment for a heavy-tailed
unimodal distribution, where we chose a degree of freedom df = 3 for the t-distribution. However,
5.3. TWO APPLICATIONS OF GENERATIVE METHODS 61
0.6
0.0 label 0.0 0.0
0
1
1
1
0.2 0.2 0.2
0.4
0.5 0.0 0.5 0.00 0.25 0.50 0.75 1.00 0.5 0.0 0.5 0.5 0.0 0.5
0 0 0 0
0.6
0 0
0
1
0.4
1 1
1
2 0.2 2
5
2
0 0
2 5
4
10
6 4 2 0 2 4 6 15 10 5 0 5 10
Figure 5.6: 2D Gaussian vs sampled (left) and 2D Student’s t distribution vs sampled (center) and
2D bimodal Gaussian vs sampled (right)
it can capture the third and fourth moments of light and heavy-tailed distributions, but Figure 5.6
shows that there are some samples between the two modes.
The next two plots display a bimodal Gaussian distribution in dimension D = 15 and resampled
random variables using the optimal transport and parametric representation algorithm, respectively.
4 4 4
4
2 2 2
2
0
0 0 0
2
2 2 2
4
4 4
4
6
6 6
5.0 2.5 0.0 2.5 5.0 5.0 2.5 0.0 2.5 5.0 5.0 2.5 0.0 2.5 5.0 5.0 2.5 0.0 2.5 5.0
5.3. TWO APPLICATIONS OF GENERATIVE METHODS 63
4 4
2 2
0 0
2
2
4
4
6
6 4 2 0 2 4 4 2 0 2 4 6
Y|X. (5.3.2)
Suppose known a variate of the joint variable Z = X, Y = (z n )n=1...N , z n = (xn , y n ). Consider
another distribution ϵ = (ϵn )n=1...N , for instance a uniform one, and define the encoding map
Y | X = x ∼ L(x, ϵ) (5.3.3)
This approach, by defining a continuous, invertible mapping, from the latent distribution (X, ϵ)
to the target distributions (X, Y), can be helpful in a number of situations, and serve purposes
beyond estimating conditioned distributions.
However, we can benchmark its results with alternative methods to compute conditional distri-
butions, and we describe succinctly in the following two of them, that we use to benchmark our
generative algorithm.
The first one is the Nadaraya-Watson kernel regression introduced in 1964 in [45]. This algorithm
applies to any conditional probability according to the following formula:
PN
K(x, xi )K(y, y i )
p(y|x) ∼ i=1 PN . (5.3.4)
i=1 K(x, x )
i
64 CHAPTER 5. PERMUTATIONS AND OPTIMAL TRANSPORT
N
X
p(y|x) ∼ πk (x, ω)N y|µk (x, ω), σ k (x, ω) , (5.3.5)
i=1
where ω are the weights of the networks. We used the framework tensorflow probability, where
weights are calibrated minimizing the log likelihood loss function to a given distribution.
Example: Log-normal distributions. We illustrate our approach with a one-dimensional,
nonlinear combination of variates. Consider two independent distributions X, Y, having normal
distribution N (µx , σx ) and N (µy , σy ), with (µx , µy ) = (0, 0), (σx , σy ) = (1, 0.1), and consider the
following distribution:
Z := exp(X), exp(X) exp(Y)) .
In Figure 5.7-(i), we plot in red a variate of the joint distribution (X, Y) of size N = 1000. We
conditioned upon x = 0, and we plot in blue a sample of the conditioned variate Y | X = 0. This
blue distribution serves us as reference target distribution.
Figure 5.7-(ii) shows the density of the conditioned random variable algorithm in blue, against
the reference target distribution, where the estimator is the Nadaraya Watson one (5.3.4). We
used in this benchmarks a particular kernel, called the inverse quadratic kernel, corresponding to a
Cauchy distribution, defined as K(x, y) = 1+|x−y|π
2 . This kernel is used together with a scaling
map S(x) = h , h being the bandwidth, that has been set manually by trial and error to the value
x
Density
Density
Density
y
3 1.00
1.0
1.0 1.0
0.75
2
0.50 0.5
0.5 0.5
1
0.25
1.5
1.5 1.5 1.5
4
Density
Density
Density
Density
y
3
1.0 1.0 1.0
1.0
where
• X ∈ RNx ,DX is any set of points generated by a i.i.d sample of X(t1 ) where t1 is any time.
• Y ∈ RNY ,DY is any set of points, generated by a i.i.d sample of X(t2 ) at any time t2 > t1 .
• f (Y ) ∈ RNY ,Df is any, optional, function.
The output is a matrix fZ|X , representing the conditional expectation
2
fZ|X ∼ EX(t ) (f (·)|X(t1 )) ∈ RNx ,Df =:not. f (Z|X). (5.4.2)
• if f (Z) is let empty, the output fZ|X ∈ RNz ,Nx is a matrix, representing a convergent
1
approximation of the stochastic matrix EX(t ) (Z|X).
• if f (Z) ∈ RNz ,Df is not empty, fZ|X ∈ RNz ,Df is a matrix, representing the conditional
1
expectation f (Z|X) = EX(t ) (f (Z)|X).
Let us focus on the discrete case from now on. Consider X = (x1 , . . . , xNX ), Y = (y 1 , . . . , y NY )
5.5. APPENDIX TO CHAPTER 4 67
PNX PNY
and denote dXX = 1
NX n=1 δx ,
n dYY = 1
NY n=1 δy ,
n X + Y = (xn + y m )N X ,NY
n,m=1 . Then
NX ,NY
1 X
dXX ∗ dYY = δxn +ym
NX × NY n,m=1
Observe that dXX ∗ dYY is a distribution having NX × NY elements, since we want to map it
to a distribution X having NX elements. A possibility to solve this problem is to consider the
clustering approach (4.3.4), that is
inf Dk (XX + YY , ZZ )
Z∈RNX ,D
Then consider the map defined as S# dXX = dZZ defined at (5.1.10). However, this approach is
computationally costly, and generative methods allow one to design more performing algorithms,
as follows.
Consider any two independent latent variables X, Y and ϵx , ϵy , for instance uniform laws, and
definethe two encoders
X = Lx (ϵx ), Y = Ly (ϵy ). (5.4.3)
X + Y = Lx (ϵx ) + Ly (ϵy ).
We illustrate this approach with a simple example: Consider two independent normal distribution
dX = N (µx , σx ) and dY = N (µy , σy ), and consider the sum
and dZ can be used as a reference distribution for benchmarks. We consider the generative approach
(5.4.3), taking as latent variable (ϵx , ϵy ) the uniform distribution over the unit square [0, 1]2 . The
result is plot figure 5.9, where the first figure plot the two variates of the distributions dX, dY in
the first subplot. The second subplot 5.9-(ii) represent the reference distribution dZ, together with
the result of the generative approach (5.4.3).
Table 5.11 displays statistical tests to compare the generated distribution Lx (ϵx ) + Ly (ϵy ) against
the reference distribution dZ.
0
Mean -0.026(-0.62)
Variance 0.2(0.19)
Skewness 1.9(2.1)
Kurtosis -0.18(0.18)
KS test 1.8e-08(0.05)
0.35
0.6
0.30
0.5
0.25
0.4
0.20
0.3
0.15
0.2
0.10
0.1 0.05
0.0 0.00
8 6 4 2 0 2 4 6 8 4 2 0 2 4
15D encoders. Table 5.14 illustrates the skewness, the kurtosis between X ∈ R500×15 and
Y ∈ R500×15 for the Gaussian and Student’s t− bi-modal distributions from Section 5.3.1.
This table summarizes statistics for the second numerical experiment in Section 5.3.2, with a
conditioned variable Y | X = 2.
5.6 Bibliography
Many implementations of LSAP are available in a Python interface. For example, in Scipy, the
optimization and root finding module3 allows one to find LSAP using a Hungarian algorithm
when the cost matrix is unbalanced. A Python library Lapjv4 allows one to find LSAP using
Jonker-Volgenant algorithm5 . The Sinkhorn algorithm6 ,7 is (heuristically) fast for the Kantorovich
problem and solve LSAP efficiently, but the matrix based on the Sinkhorn algorithm is not always
a permutation matrix. In certain settings, it was implemented in POT library8 .
6.1 Introduction
We now explore how kernel methods can be applied to solve partial differential equations (PDEs),
and we demonstrate here that the mapproach we propose offers some advantages over traditional
numerical methods for PDEs.
• Meshless methods. Kernel methods allow for meshless (sometimes called meshfree)
formulations to be used. Unlike traditional finite difference or finite element methods,
meshless methods do not require a predefined mesh, nor to compute connections between
nodes of the grid points. Instead, they use a set of nodes or particles to represent the domain.
This makes them particularly useful for modeling complex geometric domains.
• Particle methods. Kernel methods can be used in the context of particle methods in fluid
dynamics, which are Lagrangian methods involving the tracking of the motion of particles.
Kernel methods are well-suited for these types of problems because they can easily handle
general meshes and boundaries.
• Boundary conditions. Indeed Kernel methods allow one to express complex boundary
conditions, which can be of Dirichlet or Neumann type, or even of more complex mixed-type
expressed on a set of points. They also can also encompass free boundary conditions for
particle methods, as well as fixed meshes.
We are going to provide several illustrations of the flexibility of this approach. The price to pay
with meshless methods is the computational time, which is greater than the one in more traditional
methods such as finite difference, finite element, or finite volume schemes. The reason is that kernel
methods usually produces dense matrix, whereas more classical methods on structured grids, due
to their localization properties, typically lead to sparse matrix, a property that matrix solvers can
benefit on.
In this chapter, we initiate our discussion with some of the technical details pertinent to the
discretization of partial differential equations via kernel methods. Building on this material, we
then present a series of examples, commencing with static models and progressing to encompass a
spectrum of time evolution equations. Our primary goal is to showcase and the efficacy and broad
applicability of meshfree methods, in the context of, both, structured and unstructured meshes.
70
6.2. KERNEL APPROXIMATION TECHNIQUES 71
z 7→ ∇k (X, z)T = K(X, X)−T ∇z K (z, X)T ∈ RNx ,D , (6.2.1)
This operator acts on any (sufficiently regular) vector-field function ϕ(X) ∈ RNx ,D , as the Frobenius
scalar product ∇k (X, z)T · ϕ(X), to compute an approximation of the divergence of the vector-field
ϕ. In particular, one can estimate this operator on all points of the set X. We compute that, for
any scalar field φ, this operator acts as
where ∇k (X, X) ∈ RNX ,D,Nx is now a three-tensor, ϕ(X) ∈ RNX ,D is a matrix, φ(X) ∈ RNX is a
scalar field, and · means here a contraction on the first two indices. So we can rewrite this latter
formula as
< φ(X), ∇k (X, X)T · ϕ(X) >=< ∇k (X, X)φ(X) , ϕ(X) >=< ϕ(·)dX , ∇k φ(·) >D′ ,D .
where now the right side of the equation above denote the weak topology on distributions. In
particular, assume that the discretized operator (∇k φ)(X) is consistent with (∇φ)(X) at the set of
point X for any functions belonging to φ ∈ HkX , the kernel space induced by k. Then, our operator
∇Tk is consistent with the operator
So one should pay attention to the fact that the operator ∇Tk , that is the transpose of the gradient
operator ∇k , is not consistent with the divergence operator ∇ · ϕ, but with the weighted operator
−∇ · (ϕdX). If the “true” divergence operator is needed, it can be built straightforwardly from the
operator ∇k . In the same way, the operator ∆k introduced in (4.2.6) is not consistent with the
PD
“genuine” Laplace operator ∆ = i=1 ∂i2 , but is instead consistent with the weighted operator
∆k φ ≃ −∇ · (∇φ dX).
u(tn+1 ) − u(tn )
δt u(tn ) = = A θu(t n+1
) + (1 − θ)u(t n
) = Auθ (tn ).
tn+1 − tn
72 CHAPTER 6. APPLICATION TO PARTIAL DIFFERENTIAL EQUATIONS
A formal solution of this scheme is given by u(tn+1 ) = B(A, θ, dt)u(tn ), where B is the generator
of the equation, defined as
−1
B(A, θ, τ n ) = I − τ n θA I + τ n (1 − θ)A . (6.2.3)
where ∇· represents the divergence, and f (t, u, . . .) ∈ RDx ,Du is a matrix field. For instance,
f (t, u, . . .) ≡ v(t, x)u corresponds to a transport equation, while f (t, u, . . .) = ∇x u leads to the
heat equation ∂t u = ∆u. Hamilton-Jacobi equations are thus applicable to hyperbolic-diffusive
models. Consider a scalar-valued, entropy function U = U (u), and denote the entropy variable
v(u) = ∇u U (u). We assume the existence of a vector-valued map v 7→ g(v) and a scalar-valued
function v 7→ G(v), allowing the equations (6.2.4) to be written with an entropy dissipation term:
The entropy dissipation must also be understood in a weak sense. In particular, (6.2.5) implies the
bound Z
d
U (u(t, x)) dx ≤ 0
dt RDx
for any solution to (6.2.4)-(6.2.5). In turn, this implies the Lp -stability of a solution (if available),
provided the entropy function U is convex.
To approximate such a system numerically, we consider a positive definite kernel k, a time grid
. . . < tn < tn+1 < . . ., a space grid X = (x1 , . . . , xNx ) ∈ RNx ,Dx , and we denote by τ n = tn+1 − tn ,
n+1 n
uni ∼ u(tn , xi ) the discrete solution, and by δt U n = U τ n−U . The strategy for building entropy
dissipative schemes involves first the choice of a (q + 1)-time level interpolation u∗ (uq , .., u0 ) which
should satisfy:
• Consistency with the identity (u∗ (u, .., u) = u).
6.3. SOLVING A FEW STANDARD PDES 73
U ∗,n+1 − U ∗,n
δt U ∗,n = = v ∗,n+1/2 · δt u∗,n
τn
and is consistent with the entropy variable : v ∗,n+1/2 (u, .., u) = v(u).
The system is then approximated by the fully discrete numerical scheme displayed now, where
un+1 is the unknown:
u∗,n+1 − u∗,n
δt u∗,n = = −∇k · g(v ∗,n+1/2 ). (6.2.7)
τn
These schemes can be fully implicit or explicit with respect to the unknown
PNx u n , based on the
n+1
entropy variable choice. They are entropy stable as follows: set E ∗,n
= i=1 U (ui ) and compute
∗,n+1/2
X
δt E ∗,n = ∇k · G(vi ) =< G(v ∗,n+1/2 ), ∇k 1 >ℓ2 .
i
If we consider a kernel and defines a divergence operator that satisfies ∇k 1 ≡ 0, then the numerical
scheme (6.2.7) is stable, as it enjoyes the property E ∗,n+1 ≤ E ∗,n . For instance, consider the linear
n+1 n
equation (3.2.4), the scheme δt un = Av ∗,n+1/2 with v ∗,n+1/2 = u 2+u , and the entropy function
U (u) = u2 . We can directly compute that δt U (un ) = v ∗,n+1/2 Av ∗,n+1/2 ≤ 0. The Crank-Nicolson
choice θ = 1/2 corresponds to a two-time level, entropy scheme, which is second-order accurate in
time.
∆u = f, supp u ⊂ Ω, u∂Ω = 0,
where f is sufficient regular and Ω is a sufficient regular domain. Consider the weak formulation of
this equation, that is for functions φ supported in Ω
Z
< ∆u, φ >D′ ,D = − < ∇u, ∇φ >D′ ,D = (f φ)(x)dx.
RD
leading to the equation (∆k u)(X) = f (X), ∆k being defined in (4.2.6). A solution to this equation
is computed as u = (∆k )−1 f , defined in (4.2.7).
Figure 6.1 displays a regular mesh for the domain Ω = [0, 1]2 , where f is plotted in the left=hand
side, and the solution u in the right=hand side.
f(x) solution
0.5 0.012
0.010
0.4
0.008
0.3 0.006
0.004
0.2
0.002
0.1 0.000
0.002
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75
1.00 0.75 0.50 0.25 1.00 0.75 0.50 0.25
Kernel methods facilitate the use of unstructured meshes, enabling the description of more complex
geometries. Figure 6.2 shows an unstructured mesh generated by a bimodal Gaussian, with f
plotted on the left and the solution u on the right.
Here, L : Hk (Ω) 7→ L2 (Ω) is a linear operator that serves as a penalty term. A formal solution is
given by:
G + ϵLT LG = F
f(x) solution
label label
3 0.15 3 0.04
0.30 0.02
0.45 0.00
0.60 0.02
2 2 0.04
0.06
1 1
0 0
1
1
1 1
2 2
3 3
4 2 0 2 4 6 4 2 0 2 4 6
0 0
−1
z 7→ G(z) = K(X, z) K(X, X) + ϵ LTk Lk (X, X) F (X)
To compute this function, input R = ϵLTk Lk into the pseudo-inverse formula (3.2.3).
As an example, consider the denoiser procedure, which aims to solve:
In this case, Lk = ∇k , and LTk Lk corresponds to ∆k . Figure 6.3 demonstrates the results of this
regularization procedure. The noisy signal (left image) is given by Fη (x) = F (x) + η, where η is a
white noise, and f is the cosine function defined in (3.1.2). The regularized solution is plotted on
the right.
In this case, Lk = ∇k , the discrete gradient operator defined at (4.2.4), and ∇Tk ∇k is an approxima-
tion of the operator ∆. Figure 6.3 demonstrates the results of this regularization procedure. The
noisy signal (left image) is given by Fη (x)
Q = F (x) + η, where η P
:= N (0, ϵ) is a white Gaussian noise,
ϵ = 0.1, and f (x) = f (x1 , . . . , xD ) = d=1,...,D cos(4πxd ) + d=1,...,D xd is a example function.
The regularized solution is plotted on the right.
2 2
1 1
0 0
1 1
2 2
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75
1.00 0.75 0.50 0.25 1.00 0.75 0.50 0.25
This approach can be easily adapted to more complex geometries, as demonstrated by the image
6.5, which shows the heat equation on an irregular mesh generated by a bimodal Gaussian process.
0.7 0.05375
0.6 0.05350
0.5 0.05325
0.05300
0.4
0.05275
0.3 0.05250
0.2 0.05225
0.1 0.05200
0.05175
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75
1.00 0.75 0.50 0.25 1.00 0.75 0.50 0.25
2 2
1 1
0 0
1
1
1 1
label
0.0000
2 label 2 0.0025
0.08 0.0050
0.16 0.0075
0.24 0.0100
3 0.32 3 0.0125
4 2 0 2 4 6 4 2 0 2 4 6
0 0
< µ0 , (∇φ) ◦ y · ∂t y >D′ ,D =< µ0 , (∇y)−1 · ∇(∇φ) ◦ y >D′ ,D for all φ ∈ C(RD ),
which is equivalent to
< (∇φ) ◦ y, µ0 ∂t y >D′ ,D = − < ∇ · (∇y)−1 µ0 , (∇φ) ◦ y >D′ ,D , for all φ ∈ C(RD ).
This motivates us to formulate the following (formal) evolution scheme for the map y:
−1
∂t y = −∇ · ∇ · ∇ ∇y = −∇ · ∆−1 ∇y, y(0, x) = xµ0 (x). (6.4.2)
On the one hand, this equation corresponds to a diffusive equation having a bad sign. On the
−1
other hand, the operator ∇ · ∆x ∇ is a projection operator, hence is bounded. Considering a
78 CHAPTER 6. APPLICATION TO PARTIAL DIFFERENTIAL EQUATIONS
positive definite kernel k, an initial condition µ0 ≡ δX , X ∈ RN,D , this amounts to consider the
semi-discrete scheme for t 7→ Y (t) ∈ RN,D
d −1
Y = ∇k · (∇k Y )−1 = ∇k · ∆k ∇k Y, Y (0, x) = X, (6.4.3)
dt
where the divergence, gradient, and Laplacian operator ∇k ·,∇k ,∆k are defined at (4.2.5)-(4.2.4).
Observe that at time t = 0, the scheme (6.4.2) reduces formally to ∂t y = ∇ · I D , where I D is
the identity matrix. This last formulation has to be understood Rin a weak sense, this operator
acting on sufficient regular functions φ as < ∇ · I D , φµ0 >D′ ,D = − I D · ∇(φµ0 ) and is not trivial.
In particular, picking up a kernel satisfying (∇k y) = I D reduces the semi-discrete scheme to
dt Y = ∇k · I . The evolution scheme (6.4.2) is theoretically a stable scheme, due to the following
d D
energy estimate
d
∥Y ∥2ℓ2 = 2 < Y, ∇k · (∇k Y )−1 >D′ ,D = 2 < ∇k Y, (∇k Y )−1 >D′ ,D = 2D.
dt
However, take care that the operator appearing in (6.4.3) is negative defined, hence a strong C.F.L.
condition is needed. We took here the C.F.L. τ n = mini̸=j ∥Y j (tn ) − Y i (tn )∥2ℓ2 .
Figure 6.6 shows our results with this numerical scheme. In the left=hand picture the initial
condition, taken as a two-dimensional variate of a standard normal law. The figure in the middle
displays the evolution at the time t = 1. Observe that the variate appears to be more regular.
The right-hand picture is a standard scaling of this last to unit variance. Indeed, the right-hand
plot approximates a sharp discrepancy sequence of the normal law, having strong convergence
properties for Monte Carlo sampling. These normal law samples can be obtained by the CodPy
function
get_normals(N, D, · · · )
1
2 1
0
0 0
1
2 1
2
4
3 2
6
2 1 0 1 2 4 2 0 2 4 6 2 1 0 1 2
0 0 0
convergence properties, as for the Heston model. (See [31].) The convergence rate of such variate
is of order
1 X O(1)
Z
φdµ − φ(xi ) ≤
RD N i
N2
for any sufficiently regular function φ. This should be compared to a naive Monte-Carlo variate,
converging at the statistical rate O(1)
√ .
N
where f = (fd (u))1≤d≤D : R 7→ RD is a given flux and ∇ · f (u) = 1≤d≤D ∂xd fd (u) denotes
P
its divergence, with x = (xd )1≤d≤D . A Lagrangian method corresponds to determine a solution
determined by the characteristic method. In the context of conservation laws, the characteristic
method determines u, y formally as (see (5.1.4) for a definition of the push-forward)
Provided u0 is sufficiently regular, the transport function y = y(t, x) defines an invertible map
for small time t and the equation (6.4.5) defines a unique solution to (6.4.4). However, we can
show that y(t, ·) is not one-to-one any longer for big enough times, for instance if u0 is compactly
supported. Nevertheless, y(t, ·)# u0 (·) still defines a formal solution to (6.4.4), calledthe energy
conservative solution, that is highly oscillating, as can be seen in Figures 6.7-6.8 (middle), taking
as flux f (u) = (−u2 , · · · ). The vanishing viscosity method allows one to select another, more
physically relevant solution, called the entropy dissipative solution. It consists in solving in the
limiting case ϵ 7→ 0 the following viscosity equation version of (6.4.5)
∂t uϵ + ∇ · f (uϵ ) = ϵ∆uϵ .
For any ϵ > 0, the solution uϵ satisfies in a strong sense the entropy dissipation property
∂t U (uϵ ) + ∇ · F (uϵ ) ≤ 0, for any convex entropy - entropy fluxes U, F . In the limiting case ϵ 7→ 0,
this entropy dissipation holds in a weak sense. The CHA-algorithm allows an explicit computation
of this vanishing viscosity solution, as
and h+ (t, ·) is the convex hull of h. Figure 6.7 illustrates this computation for the onen-dimensional
Burgers equation
1
∂t u + ∂x u2 = 0,
2
since Figure 6.8 illustrates the two dimensional case ∂t u + 12 ∇ · (u2 , u2 ) = 0. The left-hand figure
is the initial condition at time zero, since the solution at middle represent the conservative solution
at time 1, and the entropy solution is plot at right.
80 CHAPTER 6. APPLICATION TO PARTIAL DIFFERENTIAL EQUATIONS
0.8 0.8
0.6
0.6 0.6
f(x)-units
f(x)-units
f(x)-units
0.4
0.4 0.4
0.2
0.2 0.2
1.00
0.75 1.00
0.75 1.0
0.50 0.50 0.5
0.25 0.25 0.0
0.00 0.00
0.25 0.25 0.5
0.50 0.50
0.75 0.75 1.0
1.00 0.250.500.751.00 1.00 0.250.500.751.00 0.0 0.5 1.0
1.000.750.500.250.00 1.000.750.500.250.00 1.0 0.5
2 2
2.5
1 2.0 1
f(x)-units
f(x)-units
f(x)-units
1.5 0
0
1.0
1
1
0.5
2
2 0.0
2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
x-units x-units x-units
Figure 6.9: A cubic function, exact AAD first order and second order derivatives
f(x)-units
f(x)-units
f(x)-units
0 0
0 1
1 1
0
1 2 1 2
f(x)-units
f(x)-units
f(x)-units
2 2 2
1.0
0 0 0
2 0.5 2 2
4 4 4
0.0
1 0 1 1 0 1 1 0 1 1 0 1
x-units x-units x-units x-units
The same benchmark can be used in any dimension, and we plot the two-dimensional test in Figure
6.11
1.0
0.5 1 1 1
0.0 0 0 0
0.5 1 1 1
1.0 0.5 0.0 0.5 1.0 1 1 1
0 0 0
exact grad
1.0 codpy grad
1
pytorch grad-1
1
pytorch grad-2
1
1 1 1 1
0 0 0 0
1 1 1 1
1 0 1 1 0 1 1 0 1 1 0 1
• Two runs of AAD computations leads to two different results (pytorch-grad1 and 2) : NNs
do not define deterministic differential learning machines, due to the stochastic descent
algorithm, here Adam optimizer.
• Differential neural networks tends to be less accurate than a kernel-based gradient operator.
6.6. APPENDIX: DISCRETE HIGH-ORDER APPROXIMATIONS 83
1.5 3
0
1.0
1 2
f(x)-units
f(x)-units
f(x)-units
0.5
2
1
0.0
3
0.5 0
4
1.0
1
5
1.5
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
x-units x-units x-units
precisely, consider a sufficiently regular function f , known at q distinct points f (xk ), x1 < . . . , xq ,
Pq−1
and a differential operator P α (∂) = i=0 piα (∂ i ). For any function f , we want to approximate
q
(P α (∂)f )(y) = k=1 f (xk ) at some points y. To this aim, consider the Taylor formula
P
q−1
X (xk − y)i
f (xk ) = f (y) + (xk − y)∂f (y) + · · · = (∂ i f )(y), k = 1, . . . , q
i=0
i!
with the conventions 0! = 1, ∂ 0 f = f . Multiplying each line by βyk and summing leads to
q q−1 q
X X X (xk − y)i
βyk f (xk ) = (∂ i f )(y) βyk .
i=0
i!
k=1 k=1
Hence, we rely on a q−point accurate formula for P α (∂), and we solve the following Van Der
Monde-type system:
X q
βyk (xk − y)i = (i!)piα , i = 0, . . . , q − 1. (6.6.1)
k=1
Pq
Conversely, suppose a formula (P f )(y i ) = k=1 βyki f (xi−k ) is given for distinct points y 1 < . . . <
y Ny . To recover (P f )f (xi ), i = q, . . . , Nx , we solve the following linear system:
Pq−1
(P f )(y i ) − k=0 βyki f (xk )
(P f )(x ) =
i
, i = q, . . . , Nx .
βyki
Chapter 7
85
86 CHAPTER 7. APPLICATION TO SUPERVISED MACHINE LEARNING
D Nx Ny Nz
-1 505 505 -1
-1 456 456 -1
-1 408 408 -1
-1 359 359 -1
-1 311 311 -1
-1 262 262 -1
-1 214 214 -1
-1 165 165 -1
-1 117 117 -1
-1 68 68 -1
The first plot in Figure 7.1 compares the methods in term of scores, while the second and third
plots provide the discrepancy errors and execution time for different scenarii as defined in Table
7.1.
Interpretation of the results.
• First of all, observe that our RKHS-based method CodPy lab extra, namely the extrapolation
method, leads us with, both, the best scores and the worst execution time.
• If we compare the discrepancy error to 1, the result matches the scores of the method CodPy
lab extra. This indicates that the discrepancy error is an appropriate indicator.
• Another kernel method, CodPy lab proj,namely the projection method, is a more balanced
method.
• Both kernel methods are performes here with a standard kernel, namely the Gaussian one,
that is the only parameter for kernel methods. We emphasize that with kernel engineering
we can easily improve these results. We do not present these improved kernel methods, as
our purposes is to provide a benchmark with standard methods.
Observe that function norms and MMD errors are not method-dependent. Clearly, for this example,
a periodical kernel-based method outperforms the two other ones. However, it is not our goal to
illustrate an overall advantage of a particular method, but a benchmark methodology, particularly
in the context of extrapolating test set data far from the training set data.
0.125 15.0
4
12.5
discrepancy_errors
execution_time
0.100
scores
10.0 3
0.075
7.5
2
0.050
5.0
0.025 1
2.5
0.000 0.0 0
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
Nx Ny Ny
the label function f (Z) ∈ RNz ,Df . Data are recovered from Y. LeCun MNIST home page this
dedicated page for a description of the MNIST database, and we test here different values of the
integer Nx .
For instance, the following plot shows an image of hand-written number, that is the first image x1 ,
as well as many other numbers:
Comparison between methods. We consider here different machine learning models in order to
classify MNIST digits: support vector classifier (SVC), decision tree classifier (DT), adaboost
classifier, random forest classifier(RF) by scikit-learn library and TensorFlow’s neural network
(NN) model.
For the feed-forward NN we chose 10 epochs with a batch size set of 16, with Adam optimization
algorithm and sparse categorial entropy as the loss function. The NN network is composed of 128
input and 10 output layers with a RELU activation function. All the remaining hyperparameters
in the models are taken to be their default values given in scikit-learn or TensorFlow. On the
other hand, we straightforwardly apply our projection operator (3.3.1) with the kernel defined
by a composition of the Gaussian kernel with a mean distance map, where the training set is
X ∈ RNx ,784 , and Y ∈ RNy ,784 ⊂ X is randomly chosen.
88 CHAPTER 7. APPLICATION TO SUPERVISED MACHINE LEARNING
D Nx Ny Nz
784 32 8 10000
784 64 16 10000
784 128 32 10000
784 256 64 10000
Scores are computed using the formula (2.3.1), a scalar in the interval (0, 1), which counts the
number of correctly predicted images.
Conf. Mat.:
946 0 1 2 3 10 8 6 4 0
1000
0 1100 4 2 1 1 2 1 24 0
21 116 776 14 32 4 8 32 29 0
800
37 23 34 864 1 7 3 11 15 15
74 17 13 303 30 323 28 21 24 59
34 15 48 1 66 31 756 4 1 2 400
2 61 30 4 9 0 1 861 15 45
200
46 49 40 96 14 12 35 14 610 58
9 11 22 30 151 0 9 51 4 722
0
Figure 7.3 compares the methods in term of scores, MMD errors, andexecution time.
Interpretation of these results.
• First of all, observe that the kernel method CodPy class. extra is a multiple-input/multiple-
output classifier, which is basically an extrapolation method. Itprovides us with, both, the
best scores and the worst execution time.
• By computing 1 minus the discrepancy error, we match the scores of the method CodPy
class. extra. This indicates that the discrepancy error is a relevant indicator here.
• Another RKHS-based method, namely CodPy class. proj, allows us to reduce the computa-
tional complexity of the extrapolation by using a projection of the input data to lower the
dimensions. It is a more balanced method with respect to accuracy vs. complexity.
• Both kernel methods use a standard Gaussian kernel, that is the only parameter in the kernel
methods. We emphasize that with kernel engineering we can easily improve these results.
We do not present these improved kernel methods, as our purposes is to benchmark standard
methods.
7.4. RECONSTRUCTION PROBLEMS : LEARNING FROM SUB-SAMPLED SIGNALS IN TOMOGRAPHY.89
Observe that function norms and discrepancy errors are not method-dependent. Clearly, for this
example, a periodic kernel-based method outperforms the two other ones. However, it is not our
goal to illustrate a particular method supremacy, but a benchmark methodology, particularly in
the context of extrapolating test set data far from the training set ones.
execution_time
scores
1.5
0.15
0.5
0.14 1.0
0.4
AdaBoost 0.13
Decision tree 0.5
RForest
0.3 SVC 0.12
Tensorflow
codpy lab pred 0.0
50 100 150 200 250 50 100 150 200 250 50 100 150 200 250
Nx Ny Ny
Figure 7.3: Scores, discrepancy errors and execution time for MNIST classification problem. The
graph illustrates the performance indicators using different size of the training set.
This database image consists in a set of high resolution, (512, 512) images, consisting in approxi-
mately 30 images of 82 patients. The training set is built on the first 81 patient. The 82-th patient
is used for the test set. We first transform the training set database to produce our data. For each
image in the training set (2470 images) we proceed as follows:
• We perform a “high” resolution (256, 256) radon transform 3 , called a sinogram 4 . A
sinogram is quite similar to a Fourier transform of the original image, generating sinusoids.
• We perform a “low” resolution (8x256) radon transform.
• We reconstruct the original image from the high resolution sinogram to simulate high resolution
SPECT images from these data. The reconstruction algorithm consists in computing an
inverse radon transform 5 .
An example of training set construction is presented Figure 7.4. Left is the reconstructed image
from the “high resolution” sinogram (middle). The low resolution sinogram is plot at right.
Figure 7.4: high resolution sinogram (middle), low resolution (right), reconstructed image (left)
The test consists then in reconstructing all images of the 82-th patient using low-resolution
sinograms.
A comparison between methods. We present here the test resulting from a benchmark of a
kernel-based method and the SART algorithm6
Following our notations, section (2.1), we introduce
• The training set x ∈ R2473,2304 , consisting in 2473 sinograms having resolution 8, 256,
consisting in all low-resolution sinograms of the 81 first patients, plus the first one of the
82-th patient. This last figure is added to check an important feature in these problems : the
learning machine must be able to retrieve an already input example.
• The test set z ∈ R29,2304 , consisting in 29 sinograms of the 82-th patient, having resolution
8, 256.
• The training values set fx ∈ R2473,65536 , consisting in the 2473 images in “high-resolution”.
• The ground truth values f (Z) ∈ R29,65536 , consists in 29 images in “high-resolution”.
• The first line, named exact, simply output the original figures, leading to zero error.
• The second one, named SART, reconstruct the figures from the SART algorithm with
sub-sampled data.
3 Anintroduction to radon transform can be found at this wikipedia page.
4We used the standard radon transform from scikit, available at this url.
5We used a SART algorithm, 3 iterations, for reconstruction, available at this url.
6We did not succeed finding competitive parameters for other methods.
7.5. APPENDIX 91
• The third one, named CodPy, reconstruct the figures from the sub-sampled data with the
kernel extrapolation method (3.3.2).
Figure 7.5 plots the first 8 images, presenting the original one at left, the reconstruction from SART
algorithm, middle, and our algorithm, right. One can check visually that this kernel method better
reconstruct the original image. It would be erroneous to conclude that this reconstruction process
performs better than the SART algorithm, and it is not at all our speech here. We simply illustrate
here the capacity of our algorithm to recognize existing patterns: indeed, note that the first image
is perfectly reconstructed, as it is part of the training set. This property emphasizes that such
methods suit well to pattern recognition problems, as automated tools to support professionals
diagnosis.
Figure 7.5: Example of reconstruction original (left), sub-sampled SART (middle), kernel extrapo-
lation (right)
7.5 Appendix
Tables 7.3 and 7.4 indicates performance indicators for the Boston housing prices and MNIST
datasets.
Application to unsupervised
machine learning
• For k-means algorithm, the distance is called the inertia; see (2.3.5).
• For kernel-based algorithms, the distance is the kernel discrepancy or MMD; see (3.3.8).
Importantly, if the distance functional d(X, Y ) is not convex, then a solution to (4.3.1) might
not be unique. For instance, a k-means algorithm usually produces different clusters at different
execution runs.
Comparison between methods. First we use scikit’s k-means algorithm implementation, which
is simply partitioning the input data X ∈ RNx ,D into Ny sets so as to minimize the within-cluster
sum of squares, which is defined as “inertia”. The inertia represents the sum of distances of all
points to the centroid Y ∈ RNy ,D in a cluster. K-means algorithm starts with a group of randomly
initialized centroids and then performs iterative calculations to optimize the position of centroids
until the centroids stabilizes, or the defined number of iterations is reached.
Second we apply CodPy’s MMD minimization-based algorithm described in (4.3.1) using the
distance dk (x, y) induced by a Gaussian kernel: k(x, y) = exp(−(x − y)2 ).
94
8.2. CLASSIFICATION PROBLEM: HANDWRITTEN DIGITS 95
D Nx Ny Nz
-1 1000 128 1000
-1 1000 256 1000
Figure 8.1: Scikit (the first row) and CodPy (second row) clusters interpreted as images
The result of k-means algorithm is Ny clusters in D = 784 dimensions, i.e. Y ∈ RNy ,D . Note that
the cluster centroids themselves are 784-dimensional points, and can themselves be interpreted
as the “typical” digit within the cluster. Figure 8.1 plots some examples of computed clusters,
interpreted as images. As can be seen, they are perfectly recognizable.
Finally, we show another benchmark plot, displaying the computed performance indicator of scikit’s
k-means and CodPy’s MMD minimization-based algorithm in terms of MMD, inertia, accuracy
scores (when applicable) and execution time, using scenarios in Table 2.6. The higher the scores
and the lower are the inertia and MMD the better.
0.325
codpy codpy codpy codpy
k-means k-means 20000 k-means k-means
5.0
0.300
0.88
19000 4.5
0.275
4.0
0.86 0.250 18000
discrepancy_errors
execution_time
3.5
inertia
scores
0.225
17000
0.84 3.0
0.200
16000 2.5
0.175
0.82
2.0
15000
0.150
1.5
0.80 0.125 14000
150 200 250 150 200 250 150 200 250 150 200 250
Ny Ny Ny Ny
96 CHAPTER 8. APPLICATION TO UNSUPERVISED MACHINE LEARNING
The scores are quite high, compared to supervised methods for similar size of training set, see
results section (7). MMD-based minimization have an inertia indicator that is comparable to
k-means. This is surprising as k-means algorithms are based on inertia minimization. Moreover,
scores seems to indicate that the MMD distance is a more reliable criteria than inertia on this
pattern recognition problem.
Comparison between methods. The result of k-means and CodPy’s sharp discrepancy algorithm
algorithm is Ny clusters in D dimensions. Notice that the cluster centroids themselves are D-
dimensional points.
We visualize at figure 8.2 the clusters and corresponding centroids of scikit and CodPy’s sharp
discrepancy algorithm, for 20 clusters.
cluster:20 cluster:20
4 4
3 3
2 2
1 1
pca2
pca2
0 0
1 1
2 2
3 3
4 4
2 0 2 4 6 2 0 2 4 6
pca1 pca1
Finally, we present a benchmark plot, displaying the computed performance indicators of scikit’s
k-means and CodPy’s sharp discrepancy algorithms using scenarios from Table 2.6.
1 The German credit risk dataset is described in the kaggle page link
8.4. CREDIT CARD MARKETING STRATEGY 97
0.12
1.6
6500
1.4 0.10
discrepancy_errors
execution_time
inertia
6000
1.2
0.08
1.0
5500
0.06
0.8
5000 0.04
0.6
10 12 14 16 18 20 10 12 14 16 18 20 10 12 14 16 18 20
Ny Ny Ny
20 20 20 20
15 15 15 15
pca2
pca2
pca2
pca2
10 10 10 10
5 5 5 5
0 0 0 0
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
pca1 pca1 pca1 pca1
Next we visualize the clusters and corresponding centroids of scikit’s k-means implementation
2 The credit card marketing strategy dataset is detailed on this dedicated kaggle page.
98 CHAPTER 8. APPLICATION TO UNSUPERVISED MACHINE LEARNING
CodPy’s sharp discrepancy algorithm, where we vary the number of clusters Ny from 2 to 4.
Finally, we illustrate a benchmark plot, displaying the computed performance indicator of scikit’s
k-means and CodPy’s sharp discrepancy algorithms.
4.0
110000
0.25
3.5 100000
discrepancy_errors
execution_time
inertia
90000 0.20
3.0
80000
2.5 0.15
70000
1.5 50000
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Ny Ny Ny
D Nx Ny Nz
-1 500 15 1000
-1 500 30 1000
-1 500 45 1000
-1 500 60 1000
-1 500 75 1000
-1 500 90 1000
3 You can find more details on this use case following the link kaggle page link.
8.6. PORTFOLIO OF STOCK CLUSTERING 99
Figure 8.3 illustrates confusion matrices for the last scenario of each approach.
k-means MMD:CodPy
250000
250000
150000
150000
100000 100000
1 28 218 1 36 210
50000 50000
0
1
Figure 8.3: confusion matrix for CodPy
Finally, we illustrate a benchmark plot, that shows the performance of scikit’s k-means and
CodPy’s sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores
(when applicable) and execution time.
12.5
execution_time
14000 codpy
inertia
scores
10.0 k-means
0.96 0.35
12000
7.5
0.30
0.95 10000
5.0
0.25
8000 2.5
0.94
Comparison between methods. The table with a list of stocks shows that k-means clustering
and MMD minimization displays stocks into coherent groups. Finally, we illustrate a benchmark
plot, that shows the performance of scikit’s k-means and CodPy’s sharp discrepancy algorithms in
terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.
8.7. APPENDIX 101
4
0.70 25.40
25.39
0.65 3
discrepancy_errors
execution_time
25.38
inertia
0.60
25.37 2
0.55
25.36
1
0.50 25.35
25.34 0
0.45
9.6 9.8 10.0 10.2 10.4 9.6 9.8 10.0 10.2 10.4 9.6 9.8 10.0 10.2 10.4
Ny Ny Ny
8.7 Appendix
Table 8.4: Performance indicators for MNIST dataset
Table 8.6: Performance indicators for credit card marketing database (continued)
where lj ∈ RDX is the latent variable attached to the picture y j . Note that this matching algorithm
in latent space leads to a quite efficient pattern recognition method.
Observe also that, as the dimension of the latent variable increases, the generated images tends to
be more blurry. This is a dimensional effect : as the dimension increases, the distance between
our training set latent variables and a random sample tends to increase also, and are statistically
moving away from the training set. We somehow trade off variety for accuracy while tuning the
dimension parameters D of the latent space.
Figure 9.2 shows this effect with a 40 dimension latent space example, showing an example of
reconstruction, see (5.1.3). Starting from the left-hand image, the middle image corresponds to its
reconstruction, since the right-hand image is the closest image in the training set in the sense of
(9.1.1). This militate towards pattern recognition algorithm using high=dimensional latent spaces,
as both pictures are quite close in expression, and the reconstruction owns similarities with both
pictures.
103
104 CHAPTER 9. APPLICATION TO GENERATIVE MODELS
Figure 9.1: Original (right) and generated (left) images of CelebA dataset
25
50
75
100
125
150
175
200
Figure 9.2: Original (left), reconstruction (middle) and closest pic (right) of the CelebA dataset
9.2. ESTIMATION OF CONDITIONAL DISTRIBUTIONS 105
Density
Density
Density
Density
Density
Density
Figure 9.3
Note that the previous picture plots the cdf of each sampled marginals, but do not give information
on the full distributions. In Figure 9.4, we plot for one of our model a grid of figure, having the cdf
at center, and representing the bi-marginal distributions for the outer diagonal items.
Statistics on marginals can be found in Table 9.1. Note that statistical tests are hardly passed with
106 CHAPTER 9. APPLICATION TO GENERATIVE MODELS
4.75
4.50
4.25
pet.len
4.00
3.75
3.50
3.25
3.00
3.4
3.2
3.0
2.8
sep.wid.
2.6 dist
NormalLatent
ref. dist.
2.4
2.2
2.0
7.0
6.5
6.0
sep.len
5.5
5.0
4.5
2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 5.00 5.25 5.50 5.75 6.00 6.25 6.50
pet.len sep.wid. sep.len
Figure 9.4
9.2. ESTIMATION OF CONDITIONAL DISTRIBUTIONS 107
this example, as the reference distribution is chosen arbitrarily and contains too few data. Never-
theless, with very few data, these algorithms can infer quite convincing conditional distributions.
Here, we separate the malignant class into two, having 106 elements each. The first half is used with
the benign class as training set. The methodology is the following: we learn from a distribution
having 463 entries, then resample 500 examples of the four features for the malignant class, and
compare the generated distribution to the second malignant class. Figure 9.5 present, as in the iris
case, the cdf at center, with the bi-marginal distributions for the outer diagonal items.
The marginals statistics are available in Table 9.2. We noticed that the results are quite sensitive to
the used kernel, and some kernel engineering might be necessary, mainly depending on distributions.
For instance, a Cauchy kernel is quite well adapted to heavy tailed distributions. Here, we used a
RELU type kernel to produce these results.
These tests should indicate that the sampled distribution is quite close to the reference one, although
Kolmogorov-Smirnov tests are hardly passed.
26
24
22
20
mean radius
18
16
14
12
2000
1500
mean area
1000
500
0
dist
NormalLatent
ref. dist.
180
160
mean perimeter
140
120
100
80
60
45
40
35
30
mean texture
25
20
15
10
Figure 9.5
9.2. ESTIMATION OF CONDITIONAL DISTRIBUTIONS 109
Note that {1, 2, 3} are labels in this problem, and should not be ordered. Hence we rely on hot
encoding, to transform these labels into unordered ones, considering instead conditioning on a
three-dimensional labels {1, 0, 0}, {0, 1, 0}, {0, 0, 1}.
Given a hot-encoded label xi , i = 1, 2, 3, we generate samples two conditioning algorithms:
• The kernel generative conditioned method (5.3.3), with a latent space taken taken as Y ,
hence estimating the conditional probabilities p(Y|X = xi ).
• Nadaraya-Watson algorithm (5.3.4), with a latent space taken taken as Y , hence sampling
the conditional distributions Y|X = xi ).
Doing so, we resample the original distribution, and we test the capability of the Nadaraya-Watson
algorithm to properly identify the conditioned distribution, as well as this choice of latent variable
for the kernel generative method.
Original Circle 2 Sampled NW 2 Sampled QU 2
15 15 15
10 10 10
5 5 5
Y-axis
Y-axis
Y-axis
0 0 0
5 5 5
10 10 10
10 5 0 5 10 15 10 5 0 5 10 15 10 5 0 5 10 15
X-axis X-axis X-axis
Figure 9.6
As observed for simpler cases, the Nadaraya-Watson estimation and the generative conditioned
method (5.3.3) infers close conditional probabilities, when they both use the same kernel and latent
space, and the produced figures looks quite similars.
• The kernel generative conditioned method (5.3.3), with a latent space taken taken as a
uniform distribution in dimension 2.
• The mixture distribution method (5.3.5).
For each label from 0 to 9, we use these algorithms to produce ten different samples, and the
results are depicted figure 9.7.
NadarayaWatsonRejectionConditioner NormalConditioner
UniformConditioner
Figure 9.7
Figure 9.8: Removing hat and glasses from CelebA dataset pictures
For this exercise, the role of the latent space is quite important : if too big, the resulting pictures
will look quite close to the original image, still wearing hat and glasses. If too small, there will be
no longer any glass or hat, but the resulting pictures will look blurry and similar to each other. We
tuned this parameter manually using trial and error to produce this figure. The result is mitigated:
some of the resulting pictures are indeed without glass and hat, and we can see that these attributes
faded in all pictures, but faces are hardly recognizable from the produced pictures in some cases.
However, the purpose of this illustration is not to show state of the art image generation, but to
illustrate what can be learnt from a small dataset. It illustrates also the difficulty to work with few
examples, our main motivation to consider a small dataset is to keep the computation time within
ten seconds CPU-time on a standard laptop from loading to image and output figures generation.
Chapter 10
Application to mathematical
finance
We collect in this chapter a number of quite useful application of machine learning tools that
are relevant for mathematical finance. The presentation is structured into two parts. The first
part is dedicated to time series modeling and prediction, where we adopt an economic standpoint:
starting from an historical data set consisting of one, or several, time series observations, we
propose a framework capable to define a variety of stochastic processes matching these observations,
that we can for forecasts. The second part focuses on pricing, which are computationally costly,
time-dependent, functions defined on stochastic processes. Here, we show that classical supervised
machine learning setting can be used to learn those functions. Once learned, we show that we can
evaluate accurately those functions. This learning approach is a very numerical efficient one, and
accurate enough to compute derivative of the pricing function. The resulting framework can then
be used in a real-time setting, being a support to compute more sophisticated metrics, that can be
used for risk management or investment strategies.
112
10.1. FREE TIME SERIES MODELING 113
AAPL
AMZN
180 GOOGL
160
140
120
100
80
20
20
21
21
21
22
/20
/20
/20
/20
/20
/20
/06
/10
/03
/08
/12
/05
01
21
17
09
30
24
Figure 10.1: charts for Apple Amazon Google
where:
• ϵ ∈ RNϵ ,Dϵ ,Tϵ , with possibly different sizes, that is Nϵ , Dϵ , Tϵ can be different from
NX , DX , TX , is considered as a white noise, called latent,
observed from the historical
dataset applying the map F to the time series, as ϵ = F X .
• F : RNX ,DX ,TX 7→ RNϵ ,Dϵ ,Tϵ is a continuous map, that is supposed invertible, and we denote
X = F −1 ε . (10.1.3)
Observe that this framework allows one to combine simpler maps together. For instance, suppose
that we consider two different models, involving two maps F1 , F2 , with Im (F2 ) ⊂ Supp (F1 ), then
F := F1 ◦ F2 provides another model, with F −1 = F2−1 ◦ F1−1 .
In particular, consider any given invertible map F , a given time serie X, and observe a noise
ϵ = F (X). Then one always can compose it with the encoder mapping, see (5.1.1), transforming
114 CHAPTER 10. APPLICATION TO MATHEMATICAL FINANCE
this noise into another one, ϵ̃ = L(ϵ). Or, if we believe that an exogeneous distribution Y is causal
for the noise ϵ, one can use a conditioning map (5.3.2) to retrieve ϵ̃ = L(ϵ, Y),
The strategy followed in this section consists in the following:
• First observe ϵ from data, applying (10.1.2) to the historical observations X. Consider that
ϵn,k
· are Nx × TX variates of a white noise ϵ.
• Generate new samples of the latent variable ϵ̃.
• Use the inverse formula, computing X̃ = F −1 (ϵ̃). This amounts to sample new trajectories,
according to a given model (10.1.2).
The purpose of this approach is to allow for various applications as follows:
• Benchmarking strategies. Picking up t∗k = tk , this corresponds to re-sample the original
signal X on the same time-lattice. This allows to draw several simulated trajectory and to
compare it to the original one using various performance indicators.
• Monte-Carlo forecast simulations. The idea is quite similar to the previous applications, but
for future times t∗ = [tNX < t∗0 < . . .].
• Forward Calibration. This case corresponds to a perturbation of the previous case, expressed as
a minimization problem with constraints having form inf Y d(X, Y ), const. E(P (X ·,∗k )) = cp ,
where d is a distance, P a vector-valued function and cp a real-valued vector.
• PDE pricers, that are multidimensional trees, capable to compute forward prices or sensitivities
by solving backward Kolmogorov equations.
We claim that the framework (10.1.2) is quite a universal one, into which fit most of the known
quantitative models for time series analysis. Such models are usually built on top of known
processes, as Brownian motions. We can reconsidered them as built upon an unknown random
variable ϵ, that is observed with historical data and reproduced by generative methods. We can
then reinterpret these models as random walk processes. This allows to better model short term
dynamic of stochastic processes. Moreover, machine learning proposes new calibration methods.
Finally, this framework allows to define new quantitative models, as will be illustrated later on this
section.
X k+1 = X k + ϵk . (10.1.4)
A random walk process fit the framework (10.1.2) with a difference map
ϵ = δ0 (X) := X k+1 − X k
X k−1
X
X= ϵ := X 0 + ϵl , k = 0, . . .
l=0
In particular, provided ϵk are retrieved as variates of a centered random variable ϵ, the central
k
limit theorem states that X √ 7→ N (0, σ), a normal law having zero-mean and variance matrix
k
σ = var (ϵ) ∈ RDX ,DX , as k 7→ ∞, in a distributional sense.
Observe also that a Brownian motion Wt fits also the framework (10.1.2) with F defined as
W k+1 − Wtk
δ√t (Wt ) = δ√
k
t
(W t ) k
, δ√ t
(Wt ) := √t ,
k=0,... tk+1 − tk
10.1. FREE TIME SERIES MODELING 115
The Euler scheme (10.1.7) provides the explicit form of (10.1.2) as an integral-type operator,
summarized with the expression X = X 0 Exp ◦ (1/2) (ϵ).
P
From the historical data set 10.1, we compute the log-return random variable ϵ appearing at
(10.1.7), illustrated in the left part of the figure 10.2 on its two first components (AMAZ,APPL).
We can use the encoder setting (5.1.1) to map this noise to any, latent, known distribution, as for
instance a uniform distribution. We then generate another variate of the latent distribution, and
use the inverse map, the decoder (5.1.2), to simulate a variate of the observe noise ϵ, plot at right
of figure 10.2.
It is crucial to test whether the generated distribution is statistically close to the original, historical
one. The table 10.2 compute various statistical indicators, as the fourth moments and Kolmogorov-
Smirnov tests, to challenge the generative method
0 1 2
Mean 0.0012(0.00054) -3e-05(0.00032) 0.00091(0.00067)
Variance -0.066(-0.09) -0.44(0.029) -0.09(-0.19)
Skewness 0.0004(0.00034) 0.0005(0.00041) 0.00033(0.00026)
Kurtosis 2(0.57) 6.7(2) 1.4(0.62)
KS test 0.48(0.05) 0.93(0.05) 0.31(0.05)
where η is a white noise generated with the known random variable. Ten examples of resampling
are plot figure 10.3.
116 CHAPTER 10. APPLICATION TO MATHEMATICAL FINANCE
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.3: Ten examples of generated paths with the free Euler scheme
10.1. FREE TIME SERIES MODELING 117
p
X q
X
Xk = µ + ai X k−i + bi Xk−i , (10.1.8)
i=1 i=1
where Xk−i are white noise, that are random variables satisfying E(Xi ) = 0, E((Xi )2 ) = σ 2 ,
Cov(Xi , Xj ) = 0, i ̸= j, and µ is the mean of the process. ARMA processes proposes several
methods to calibrate the coefficients ai , bi , σ, as linear regressions, nonlinear least squares, or
maximum likelihood methods. Thus we suppose in the sequel that the coefficients a1 , . . . , ap ,
b1 , . . . , bq are given.
In the context of free-models, we do not suppose any longer that Xk are white noise random
variables, and we can straightforwardly generalize to the multidimensional case.
The expression (10.1.8) gives straightforwardly the map (10.1.2). To compute the inverse map
(10.1.3), we use the following relations, see [8]
∞
X
Xk = µ + πj X k−j ,
j=0
with the convention a0 = −1, ai = 0, for i > p, and bj = 0, for j > q. We introduced the backshift
Pmin(p,q)
operator B(π k ) = π k−1 and ϕ(B)(π j ) = k=1 bj πj−k . Considering range of values where this
operator is invertible, we can denote its inverse ϕ−1 (B).
For the numerics, we consider the autoregressive model of order p, denoted AR(p) which is
ARM A(p, 1) model. The mapping (10.1.2) is here ϕ(B)(X k ) = ϵk , and its inverse X k = ϕ−1 (B)(ϵk ).
The figure 10.4 shows example of ten generated trajectories with this AR(p) model.
180 Ref:GOOGL
180 140
160
160 120
140
120 140 100
100 120 80
80 Ref:AAPL Ref:AMZN 60
100
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.4: Ten examples of generated paths with the ARMA(p,1) Model
with a variance that depends on its past values. The GARCH(p, q) model is defined as follows:
PpX = µ +
k
σk Z k , P
q
(σ ) = α0 + i=1 αi (X ) + i=1 βi (σ k−i )2 .
k 2 k−i 2
Here, µ is the mean, σ k is a stochastic variance process, and Z k is a white noise process. The
parameters αi and βi denote the GARCH parameters.
We can express the variance process (σ k )2 in terms of the backshift operator B:
(1 − β(B))(σ k )2 = α0 + α(B)(X k )2 ,
Pp Pp Pp
Where
Pq α(B) i= i=1 αi B and β(B) =
i
i=1 αi B . Set φ(B) = α0 +
i
i=1 αi B , θ(B) =
i
q q
σ k = φ−1 (B)θ(B)(X k )2 = π(B)(X k )2 .
From here, assuming that we can obtain the white noise process:
q q
Z k = G(X k ) = φ(B)θ−1 (B)(X k )2 (X k − µ) = π −1 (B)(X k )2 (X k − µ).
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.5: Ten examples of generated paths with the GARCH(1,1) Model
where δ(i, j) = {i = j : 1, else: 0}. This interpolation corresponds to a model of a time series that
is not only determined by causal effects (the positive index i appearing at (10.1.9)), but that also
include market anticipation effects (the negative indices i appearing in (10.1.9)).
Figure 10.6 shows an example of resampling of our historical
P dataset using this Lagrange interpola-
tion with p = 10 and the map F −1 := X0 Exp ◦ L−(10) ◦ 0 ◦L−1 (η).
10.1. FREE TIME SERIES MODELING 119
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.6: Ten examples of generated paths with Lagrange interpolation
where
• ηY is a white noise, that is an independent random variable.
• G(Y ) ∈ RDϵ is a smooth function. If G is unknown, the denoising procedure (6.3.1) proposes
a way to calibrate it using historical observation.
For instance, we can elaborate on the ∗ model (10.1.9), defining as map F := ηY ◦ δ0 ◦ L2p ◦ Log,
where Y = X ∗ ◦ Log(X). The whole model can then be summarized as follows:
ln X ∗,k+1 = ln X ∗,k + G ln X ∗,k + ϵk .
This particular conditioning map was thought primarily to capture models following a stochastic
differential equations as Vasicek model, having form δrt = F (rt )δt + dWδt .
Applying this model produce the resampling of our historical dataset plot at figure 10.7. Note that
G is calibrated to historical data, using the algorithm (6.3.1), with ϵ = 10−3 , X = {X ∗,k }k=0,...
and F = {ϵ∗,k }k=0,... .
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
, Numerically, we approximate this conditioned distribution by the map (5.3.2). The map composi-
tion L ◦ ∆ ◦ Log defines the following scheme
ln X k+1 = ln X k + εk | ln X k . (10.1.13)
This scheme produced the resampling of our historical dataset plot in Figure 10.8
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.8: Ten examples of generated paths with the conditionning model
The scheme (10.1.13) is expected to capture (weakly) stationary stochastic processes, as CIR (Cox,
Ingelson Rox) processes. Observe that (10.3.8) allows also for data augmentation, that is adding
extra information to the original dataset. For instance, consider the following map
σ(X) = σ k (X) , σ k (X) := Tr (covar) X k−q , · · · , X k+q ,
0≤k
where q is an integer provides, and Tr(covar) holds for the trace of the covariance matrix. Any
distribution ϵ can then be conditioned to this variance. In particular, consider the following scheme:
ln X k+1 = ln X k + εX | σ k
(10.1.14)
σ k+1 = σ k + εσ | σ k ,
where ε = (εX , εσq ) are the noise components defined, produced the figure 10.9
The model (10.1.14) is expected to capture stochastic volatility type processes (ass Heston, GARCH,
. . . ).
250
Ref:AAPL Ref:AMZN Ref:GOOGL
300 300 225
200
250 250 175
200 150
200
150 125
150 100
100 75
100
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.9: Ten examples of generated paths with the stochastic volatility model
with closed formula is the last test, the tests being carried out in several stages, the aim being to
better understand these models, and to provide a methodology to design and tune them.
The methodology proceeds as follows:
• Setting: We choose a known stochastic process model (here the Heston one) under study and
select the associated parameters. Then we generate a path, that will be used as the historical
dataset.
• Calibration : Starting from this path, we pick a free time series model and calibrate it to
the historical dataset. We also calibrate the parameters of the stochastic process to match this
trajectory (the Heston model is defined through a set of eight parameters).
• Reproduction : We ensure that the generated model can reproduce the initial process. This step
is crucial for the generative framework (10.1.2), in order to check that the map is invertible.
• Distribution : Also specific to our generative framework, we check that the distribution of the
noise ϵ (see (10.1.2)) computed from the historical data and the generative model are consistent,
using graphical and statistical tests.
• Trajectories : We regenerate trajectories with these new parameters using the same library as for
the initial trajectory, and compare this with the method derived from the generative model.
• Pricing : We consider a function, given by the payoff of an option, and evaluate its expectation
by performing a naive Monte Carlo method both the known process, as well as the generative one,
comparing them to a closed formula whenever possible.
In the following sections we apply this methodology to three different methods considering a Heston
process. The first is a calibrated Heston process, and the two others are different generative models
from our framework, namely the log diff one (10.1.5).
With a given set of Heston parameters µ, κ, θ, ρ, X0 , ν0 , satisfying the Feller condition 2κθ > σ 2 ,
we generate one path, that is represented in bold red in the figures 10.12. Observing this path, we
calibrate µ = ln(X T)
ln(X0 ) and regenerate several paths, pictured in Figure 10.12-i). These paths will
serve us later on to benchmarks our models.
122 CHAPTER 10. APPLICATION TO MATHEMATICAL FINANCE
10.2.2 Reproducibility
First of all, we check that the generated model can reproduce the initial process since the map can
be reverted.
Initial path
52 Reproduced path 52
50
50
48
48 46
44
46
42
44
40
42 38 Initial path
Reproduced path
36
0 100 200 300 400 500 0 100 200 300 400 500
gen. noise
hist. noise
Here we compare 1000 trajectories generated on the left by a Heston SDE with approximated
parameters, and on the right what the generative model has reproduced from the initial input
trajectory. In both graphs, the initial trajectory we wish to reproduce is shown in red color.
20 0-07
20 0-10
20 1-01
20 1-04
20 1-07
20 1-10
20 2-01
20 2-04
-07
20 0-07
20 0-10
20 1-01
20 1-04
20 1-07
20 1-10
20 2-01
20 2-04
-07
22
22
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
20
20
20
With the initial SDE we create a vanilla option, in this case a European Call with strike K given
by the last value of the initial sample and maturity t=T, i.e. the end of the process. We calculate
the price by performing a Monte Carlo on the trajectories of the two methods, and compare it
with the closed formula.
124 CHAPTER 10. APPLICATION TO MATHEMATICAL FINANCE
Heston generator
110
90
Ref:Heston gen. Ref:Heston gen. Ref:Heston gen.
90
100
80 80
90
70
80 70
60 70
60
50 60
50
50
40
40 40
30
30
30
7
7
-0
-1
-0
-0
-0
-1
-0
-0
-0
-0
-1
-0
-0
-0
-1
-0
-0
-0
-0
-1
-0
-0
-0
-1
-0
-0
-0
0
2
2
2
0
0
2
2
Table 10.4: Heston Calls price
where Q is the standard notation of the neutral risk measure. We distinguish between the function
V and its expectation, using the overline notation V .
Observe that the previous sections allow to consider Monte-Carlo methods to estimate (10.3.1).
However, for a number of applications, one needs to compute not only one single value, that
is the price, —which is V (0, T, X0 ) in the above setting— but also all of the fair value surface
(s, y) 7→ V (s, T, y) (for 0 ≤ s ≤ T and y ∈ Im(Xs )). This latter observation is important in an
operational context, since all standard risk measures can be determined from the knowledge of this
surface, such as measures of internal or regulatory nature, or optimal investment strategies.
In such a context, Monte-Carlo methods are intractable, so we propose an alternative strategy in
this section.
NX∗
Πl,k := πn,m
l,k
, l,k
πn,m := E X ∗n,k |X ∗m,l , l = 1, . . . , l < k (10.3.2)
n,m=0
A way to estimate this conditional distribution is to generate numerous trajectories and to use the
conditioning map (5.3.2). However, this approach is computationally intensive, and we propose an
alternative approach in this section.
∂t µ − Lµ = 0, µ(s, ·) = δy , (10.3.4)
which is a convection-diffusion equation. Moreover, the initial data is the Dirac mass δy at some
point y, while the partial differential operator is
1 T
Lµ := ∇ · (Gµ) + ∇2 · (Aµ), A := σσ . (10.3.5)
2
Here, ∇ denotes the gradient operator, ∇· the divergence operator, and ∇2 := (∂i ∂i )1≤i,j≤D is the
Hessian operator. We are writing here A · B for the scalar product associated with the Frobenius
norm of matrices. We emphasize that weak solutions to (10.3.4) defined in the sense of distributions
must be considered, since the initial data is a Dirac mass.
The (vector-valued) dual of the Fokker-Planck equation is the Kolmogorov equation, also
known in mathematical finance as the Black and Scholes equations. This equation determine
the unknown vector-valued function P = P (t, x) as a solution to, with t ≤ s,
∂t P − L∗ P = 0, L∗ P := −G · ∇P + A · ∇2 P . (10.3.6)
By the Feynmann-Kac theorem, a solution to the Kolmogorov equation (10.3.6) can be interpreted
as a time-average of an expectation function. Hence our strategy is to solve Kolmogorov equations
(10.3.6) instead of a Monte-Carlo method. It also allows to take into account sophisticated strategies
based on derivatives, or american exercizing.
∇· denoting the divergence operator. The mapping ϵ 7→ B(ϵ) somehow smooth out the noise ϵ, at
the expense of increasing its dimensionality. The resulting matrix field is then conditioned to an
external variable, for instance X, as described in the previous section. We summarize this in the
following
−1 k
B(ϵ)k = E ∇ · (ϵ )|X k ∈ RDX ,DX , (10.3.8)
126 CHAPTER 10. APPLICATION TO MATHEMATICAL FINANCE
500
Ref:AAPL Ref:AMZN 250 Ref:GOOGL
400
400
300 200
300
150
200 200
100
100 100
50
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20 -07
20 -10
20 -01
20 -04
20 -07
20 -10
20 -01
20 -04
-07
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
21
21
21
21
22
22
22
20
20
20
Figure 10.13: Hundred examples of generated paths with conditioned covariance map
where < X · x > are the basket values, X being the weights, and K is called the option’s strike.
We represent this payoff in a two-dimensional figure with axis basket values in left=hand plot in
Figure 10.14.
We attached a pricing function as a payoff, that is a vector-valued function (t, x) 7→ P (t, x) ∈ RDP .
We represent this pricing function in a two-dimensional right figure 10.14 with axis basket values.
The pricing function here is selected as a simple Black and Scholes formula, hence hypothesizing
that the basket values are log normal 1
40 40
35 35
30 30
25 25
payoff values
pricer values
20 20
15 15
10 10
5 5
0 0
30 20 10 0 10 20 30
/2020 /2020 /2021 /2021 /2022
basket values (%K) 01/06 27/11 01/06 29/11 31/05
times days
X, f (X). According to (3.3.7), the interpolation error committed by the projection operator Pk ,
defined on a training set X, is driven at any point z by the quantity Dk (z, X). We plot at figure
10.15 the isocontours of this error function for two distinct training sets (blue dots). In these
figures, the test set is plot in red. and corresponds to simulated, intraday, market values, that are
produced synthetically for this experiment using the sampler function.
• (right) X is generated as VaR scenarios for three dates t0 − 1, t0 , t0 + 1, with H = 10 days
horizon. VaR (Value at Risk) means here producing synthetical datas at time t0 + H,
corresponding to what is referenced as *historical* VaR.
• (left) X is the historical data set.
The test set is generated as VaR scenarios with 5 days horizon (blue dots).
0.8 140
150 0.90
0.2
0.40
0.5
140 0.75
basket values
basket values
0.4
0.4
130 0.32
0.3
0.3
0.2
0.3
0.2
130
0.2
0.2
0.60
0.3
0.3
0.24
0.1
0.4
0.2
0.1
0.4
0.2
0.1
0.2
0.2
0.3
110 0.16
0.30
0.2
0.8
110
100 0.6
0.5
0.15 0.08
0.1 training set 0. 0.1
90 0.9 1.1 test set 1
0.00 0.00
737600 737800 738000 738200 9.0 9.5 10.0 10.5 11.0
time time +7.383e5
This figure motivates the choice of VaR-type scenario dataset as training set, right-hand plot in
Figure 10.15, in order to minimize the interpolation error. Note that using the historical data set,
might be of interest, if only historical data are available.
Observe finally that there are three sets of red points at Figure 10.15-(a), as we considered VaR
scenarios at three different times t0 − 1, t0 , t0 + 1, because we are interested in approximating time
derivatives for risk management, as the theta ∂t P .
We plot the results of two methods to extrapolate the pricer function on the test set Z (CodPy =
kernel prediction, taylor = Taylor second order approximation) in Figure 10.16.We also plot the
reference price (exact = reference price). We compared to a Taylor formula, widely used in an
operational context.
exact-Taylor-codpy
17.5
Exact
Taylor
codpy
15.0
12.5
Option Values (USD)
10.0
7.5
5.0
2.5
0.0
10 5 0 5
Basket Values (% K)
We can also compute greeks, using the operator (∇k P )Z defined at (4.2.4). Here too, we plot
the results of two methods to extrapolate the gradient of the pricer function on the test set
Z (CodPy = kernel prediction, taylor = Taylor second order approximation) in Figure 10.17.
We also plot the reference greeks (exact = reference greeks). This figure should thus produce
(∇k P )Z = (∂t P )Z , (∂x0 P )Z , . . . , (∂xD P )Z , that are D + 1 plots.
Note that raw deltas computed with this method present spurious oscillations, because our training
set is obtained as a iid variate, thus we used the denoising procedure (6.3.1), to smooth them out.
10.3. PRICING WITH GENERATIVE METHODS 129
Theta Delta-AAPL
0.0
0.3 Exact
0.1 Codpy
Taylor
0.2 0.2
Exact
Values
Values
0.3 Codpy
Taylor
0.4 0.1
0.5
0.6 0.0
10 5 0 5 10 5 0 5
Basket Values (% K) Basket Values (% K)
Delta-GOOGL Delta-AMZN
0.3 Exact 0.3 Exact
Codpy Codpy
Taylor Taylor
0.2 0.2
Values
Values
0.1 0.1
0.0 0.0
10 5 0 5 10 5 0 5
Basket Values (% K) Basket Values (% K)
[1] A. Antonov and M. Konikov and M. Spector, The free boundary SABR:
natural extension to negative rates, unpublished report, January 2015, available at
https://ssrn.com/abstract=2557046.
[2] I. Babuska, U. Banerjee, and J.E. Osborn, Survey of mesh-less and generalized finite
element methods: a unified approach, Acta Numer. 12 (2003), 1–125.
[3] A. Berlinet and C. Thomas-Agnan, Reproducing kernel Hilbert spaces in probability and
statistics, Springer US, Kluwer Academic Publishers, 2004.
[4] M.A. Bessa, and J.T. Foster, T. Belytschko, and W.K. Liu, A mesh-free unification:
reproducing kernel peridynamics, Comput. Mech. 53 (2014), 1251–1264.
[5] A. Brace, and D. Gatarek and M. Musiela, The market model of interest rate dynamics,
Math. Finance 7 (1997), 127–154.
[6] H. Brezis, Remarques sur le problème de Monge–Kantorovich dans le cas discret, Comptes
Rendus Math. 356 (2018), 207–213.
[7] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions,
Comm. Pure Applied Math. 44 (1991), 375–417.
[8] P.J. Brockwell, and R.A. Davis Time series: theory and methods, Springer Series in
Statistics, 2006.
[9] H. Buehler, Volatility and dividends: volatility modeling with cash dividends and simple
credit risk, February 2010, available at: https://ssrn.com/abstract=1141877.
[10] F. Eckerli and J. Osterrieder, Generative adversarial networks in finance: an overview,
Comput. Methods Appl. Mech. Engrg.(2021).
[11] G.E. Fasshauer, Mesh-free methods, in “Handbook of Theoretical and Computational
Nanotechnology”, Vol. 2, 2006.
[12] G.E. Fasshauer, Mesh-free approximation methods with Matlab, Interdisciplinary Math.
Sciences, Vol. 6, World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2007.
[13] G.E. Fasshauer, Positive definite kernels: past, present and future, unpublished report,
available at http://www.math.iit.edu/∼fass/PDKernels.pdf.
[14] A. Gretton, K.M. Borgwardt, M. Rasch, B. Schölkopf, and A.J. Smola, A kernel
method for the two sample problems, Proc. 19th Int. Conf. on Neural Information Processing
Systems, 2006, pp. 513–520.
[15] B.Schölkopf, R. Herbrich, and A.J. Smola, A generalized representer theorem. In
Computational learning theory, Springer Verlag, 2001, pp. 416–426.
[16] F.C. Günther and W.K. Liu, Implementation of boundary conditions for meshless methods,
Comput. Methods Appl. Mech. Engrg. 163 (1998), 205–230.
130
BIBLIOGRAPHY 131
[37] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy: a kernel-based reordering
algorithm, January 2021, available at ssrn.com/abstract=3770557.
[38] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy: RKHS-based polar factor-
ization and sampling algorithm, in preparation.
[39] P.G. LeFloch, J.M. Mercier, and Sh. Miryusupov, CodPy: RKHS-based algorithms
and conditional expectations, in preparation.
[40] P.G. LeFloch, J.-M. Mercier, and S. Miryusupov, CodPy: Support Vector Machines
(SVM) for (reverse) stress tests in finance, in preparation.
[41] S.F. Li and W.K. Liu, Mesh-free particle methods, Springer Verlag, Berlin, 2004.
[42] G.R. Liu, Mesh-free methods: moving beyond the finite element method, CRC Press, Boca
Raton, FL, 2003.
[43] G.R. Liu, An overview on mesh-free methods for computational solid mechanics, Int. J.
Comp. Methods 13 (2016), 1630001.
[44] J.-M. Mercier and Sh. Miryusupov, Hedging strategies for net interest in-
come and economic values of equity, unpublished report, Sept. 2019, available at:
https://ssrn.com/abstract=3454813.
[45] E. A. Nadaraya, On estimating regression, Theory of Proba. and Appl.. 9 (1): 141–2.
doi:10.1137/1109020
[46] Y. Nakano, Convergence of mesh-free collocation methods for fully nonlinear parabolic
equations, Numer. Math. 136 (2017), 703–723.
[47] F. Narcowich, J. Ward, and H. Wendland, Sobolev bounds on functions with scattered
zeros, with applications to radial basis function surface fitting, Math. of Comput. 74 (2005),
743–763.
[48] H. Niederreiter, Random number generation and quasi-Monte Carlo methods, CBMS-NSF
Regional Conf. Series in Applied Math., Soc. Industr. Applied Math., 1992.
[49] H.S. Oh, C. Davis, and J.W. Jeong, Mesh-free particle methods for thin plates, Comput.
Methods Appl. Mech. Engrg. 209/212 (2012), 156–171.
[50] R. Opfer, Multiscale kernels, Adv. Comput. Math. 25 (2006), 357–380.
[51] R. Rosipal and L.J. Trejo, Kernel partial least squares regression in reproducing kernel
Hilbert space, J. Machine Learning Res. 2 (2001), 97–123.
[52] R. Salehi and M. Dehghan, A moving least square reproducing polynomial mesh-less
method, Appl. Numer. Math. 69 (2013), 34–58.
[53] M. Sathyapriya and V. Thiagarasu, A cluster-based approach for credit card fraud
detection system using Hmm with the implementation of big data technology, Unpublished
report 2019.
[54] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic
matrices, Pacific J. Math. 21 (1967), 343–348.
[55] B.K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, and G.R. Lanck-
riet, Hilbert space embeddings and metrics on probability measures, J. Mach. Learn. Res. 11
(2010), 1517–1561.
[56] J. Sirignano and K. Spiliopoulos, DGM: a deep learning algorithm for solving partial
differential equations, J. Comput. Phys. 375 (2018), 1339–1364.
[57] I.M. Sobol, Distribution of points in a cube and approximate evaluation of integrals, U.S.S.R
Comput. Maths. Math. Phys. 7 (1967), 86–112.
BIBLIOGRAPHY 133
[58] A. Smola, A. Gretton, L. Le Song, and B. Scholkopf, A Hilbert space embedding for
distributions, IFIP Working Conference on Database Semantics, 2009.
[59] P. Traccucci, L. Dumontier, G. Garchery, and B. Jacot, A triptych approach for
reverse stress testing of complex Portfolios, unpublished report, available at ArXiv:1906.11186
[60] R.S. Varga, Matrix iterative analysis, Springer Verlag, 2000.
[61] C. Villani, Optimal transport, old and new, Springer Verlag, 2009.
[62] H. Wendland, Sobolev-type error estimates for interpolation by radial basis functions, in
“Surface fitting and multiresolution methods” (Chamonix-Mont-Blanc, 1996), Vanderbilt Univ.
Press, Nashville, TN, 1997, pp. 337–344.
[63] H. Wendland, Scattered data approximation, Cambridge Monograph, Applied Comput.
Math., Cambridge Univ., 2005.
[64] J.X. Zhou and M.E. Li, Solving phase field equations using a mesh-less method, Comm.
Numer. Methods Engrg. 22 (2006), 1109–1115.
[65] B. Zwicknagl, Power series kernels, Constructive Approx. 29 (2008), 61–84.