SSRN 4300756
SSRN 4300756
Abstract
We propose a highly optimized latent factor representation of the yield curve
obtained by training a variational autoencoder (VAE) to curve data from multiple
currencies. A curious byproduct of such training is a “world map of latent space”
where neighbors have similar curve shapes, and distant lands have disparate curve
shapes. The proposed VAE-based mapping offers a high degree of parsimony, in some
cases achieving similar accuracy to classical methods with one more state variable.
In the second part of the paper, we describe four types of autoencoder market
models (AEMM) in Q- and P-measure. Each autoencoder-based model starts from
a popular classical model and replaces its state variables with autoencoder latent
variables. This replacement leads to a greater similarity between the curves generated
by the model and historically observed curves, a desirable feature in both Q- and
P-measure. By aggressively eliminating invalid curve shapes from its latent space,
VAE prevents them from appearing within the model without intricate constraints
on the stochastic process used by the classical models for the same purpose. This
makes VAE-based models more robust and simplifies their calibration.
Potential applications of the new models and VAE-based latent factor represen-
tation they are based on are discussed.
1
Electronic copy available at: https://ssrn.com/abstract=4300756
Contents
1 Introduction 3
3 Models in Q-Measure 22
3.1 Forward Rate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 HJM and LMM Models . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 AFNS and FHJM Models . . . . . . . . . . . . . . . . . . . . 24
3.1.3 Forward Rate AEMM . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Multi-Factor Short Rate Models . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Multi-Factor Hull-White Model . . . . . . . . . . . . . . . . . 27
3.2.2 Multi-Factor Short Rate AEMM . . . . . . . . . . . . . . . . . 28
4 Models in P-measure 32
4.1 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Dynamic Nelson-Siegel Model . . . . . . . . . . . . . . . . . . 32
4.1.2 Autoregressive AEMM . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Dual-Measure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Risk Premium Estimation . . . . . . . . . . . . . . . . . . . . 35
4.2.2 HSW and BDL Models . . . . . . . . . . . . . . . . . . . . . . 36
4.2.3 Dual-Measure AEMM . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Pricing Under P-Measure . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Conclusion 40
6 Acknowledgments 41
2
Electronic copy available at: https://ssrn.com/abstract=4300756
1. Introduction
Properties of a term structure interest rate model are to a large extent determined by
the choice of its state variables. Having too many negatively affects performance and
causes parameter estimation issues. Having too few or choosing them poorly causes
the model to miss certain risks. This is why dimension reduction, i.e., decreasing the
number of state variables with the least possible loss of accuracy, is of paramount
importance.
In one of the early successes of using machine learning for dimension reduction,
Kondratyev [1] demonstrated that feedforward neural networks outperform classical
regression techniques such as PCA in describing the evolution of interest rate curve
shapes in P-measure. Bergeron et al. [2] and Buehler et al. [3] used VAE to reduce
the dimension of the volatility surface. In this paper, we derive a highly optimized
VAE-based representation of the yield curve and propose a new category of interest
rate models in Q- and P-measure that produce VAE-generated curve shapes to which
minimal corrections are applied to keep the model arbitrage-free.
Dimension reduction is a compression algorithm, not unlike those used to com-
press an image. The maximum possible degree of compression depends on the uni-
verse of images the algorithm is designed for. Because JPEG and similar general-
purpose image compression algorithms impose no restrictions on what the image can
depict, they have a moderate rate of compression (around x10 for JPEG). For these
general-purpose algorithms, dimensions of the compressed data (i.e., bits of the com-
pressed file) are local, in the sense that each of them encodes information from a
group of nearby pixels.
Variational autoencoders (VAE) are machine learning algorithms that provide
a fundamentally different type of compression. The rate of compression they can
achieve is multiple orders of magnitude higher than general-purpose compression
algorithms. Such a tremendous performance gain can only be achieved by training
the algorithm to compress a specific type of image, such as the image of a human
face. In the process of aggressively eliminating implausible combinations of pixels in
pursuit of better compression, something quite remarkable happens – dimensions of
the compressed image acquire meaning.
When using VAE to encode images of a human face, dimensions of the compressed
data (the latent variables, named after the Latin word “lateo” which means “hidden”)
become associated with realistic changes to the image of a human face, such as adding
a smile or changing hair color. This happens for the simple reason that the only
combinations of pixels not eliminated by training on a large library of human face
images are those that correspond to realistic faces. In machine learning, this “feature
extraction” effect is frequently a more important objective than compression itself.
The latent factors obtained in this manner are global because they can affect pixels
that may be far away from each other in the image (e.g., a dimension that encodes
hair color).
Let us now contemplate what a similar approach can do for the interest rate term
structure models. The first thing to consider is what we should be compressing. In
order to build a VAE-based counterpart to a stochastic volatility term structure
model such as SABR-LMM [4], we would need to compress both the yield curve
and the volatility surface into a single latent space (throughout the paper, we will
be using the term “volatility surface” generically to describe volatility surface or
3
Electronic copy available at: https://ssrn.com/abstract=4300756
volatility cube). For deterministic volatility term structure models, the volatility
surface is a function of the yield curve, and accordingly the yield curve is the only
thing we need to compress. This is the model type we will focus on in this paper.
VAE-based stochastic volatility models will be described in a separate publication.
Continuing with the image analogy, a smoothing spline fit to the yield curve is
similar to JPEG in the sense that its dimensions and the structure it imposes are both
local. On the other hand, the Nelson-Siegel [5] basis is similar to VAE in the sense
that its dimensions (roughly corresponding to the level, slope, and convexity) and
the structure it imposes (for example, not permitting a curve to have both minimum
and maximum at the same time) are both global. Having global dimensions leads to
a higher degree of compression for the Nelson-Siegel basis compared to the smoothing
spline.
Is aggressive dimension reduction necessary for term structure interest rate mod-
els, and can machine learning help find a more effective way to achieve it than the
Nelson-Siegel basis? We answer both questions in the affirmative. Many of the in-
terest rate models popular with the practitioners, including multi-factor short rate
models [6, 7], Cheyette model [8] and others, are Markovian in a small number of
state variables, usually between two and four. Considering the aggressive dimension
reduction that must occur when an extraordinary variety of historical yield curve
shapes is compressed into a small number of state variables, a sophisticated com-
pression algorithm is clearly required. And yet, the majority of classical models use
an exogenously specified SDE or factor basis whose selection is driven by criteria
unrelated to optimal compression.
In this paper, we describe several models in Q- and P-measure that start from a
popular classical model specification and replace the classical model’s state variables
by autoencoder latent variables. This replacement leads to greater similarity between
curves generated by the model and historically observed curves, a desirable feature
in both Q- and P-measure. We called the new model category “autoencoder market
models”, or AEMM.
Aggressively eliminating unfeasible curve shapes by VAE training creates state
variables that represent only valid curves. This prevents AEMM from generating
unrealistic curve shapes without using intricate constraints on curve dynamics.
The improvement in accuracy of mapping historical curve observations to model
state variables brought about by the use of autoencoders can be measured from the
first principles, providing a rigorous way to compare the proposed machine learn-
ing approach with its classical counterparts. Our results indicate that the use of
autoencoders leads to a significant and measurable improvement in the accuracy of
representing complex curve shapes compared to classical methods with the same
number of state variables. In turn, this makes AEMM perform better compared to
the corresponding classical models.
The rest of the paper is organized as follows. We describe machine learning
architecture and present the results of using autoencoders to compress LIBOR and
OIS swap curves in Chapter 2. After that, we describe how to convert four distinct
classical model types to AEMM by switching from classical state variables to VAE
latent variables. In each case, the objective of such conversion is to make curve
shapes generated by the model closer to the prevailing historical curve shapes for the
model’s currency than would have been possible with the original classical model.
The conversion of Q-measure models is discussed in Chapter 3 and P-measure models
4
Electronic copy available at: https://ssrn.com/abstract=4300756
in Chapter 4. The paper concludes with Chapter 5 where we summarize key results.
5
Electronic copy available at: https://ssrn.com/abstract=4300756
2.2. Classical Nelson-Siegel Basis
Before building an autoencoder-based curve representation, we will briefly review our
classical benchmark. The Nelson-Siegel basis is specified in terms of the continuously
compounded yield R(t, t + τ ) of a zero coupon bond observed at time t for time-to-
maturity τ = T − t (“zero rate”). The zero rate is the average of the instantaneous
forward rate f (t, T 0 ) over time interval t < T 0 < t + τ :
1 t+τ
Z
R(t, t + τ ) = f (t, T 0 )dT 0 (1)
τ t
The canonical form of the Nelson-Siegel basis is given by:
1 − e−λτ 1 − e−λτ
−λτ
R(t, t + τ ) = β1 + β2 + β3 −e (2)
λτ λτ
where the three latent factors β1,2,3 have simple interpretation as the parallel shift,
slope, and curvature of the instantaneous forward curve.
To gain insight into the origins of the peculiar linear-exponential form of the
Nelson-Siegel basis, it is highly illuminating to review the justification for choosing
this form presented by its authors in [5]. In their paper, Nelson and Siegel de-
scribed two simple parametric forms for the instantaneous forward rate f (t, T ). One
of these forms was swiftly rejected due to having too many parameters, while the
other became the canonical Nelson-Siegel formula (2) after being converted from the
instantaneous forward rate f (t, T ) to the zero rate R(t, t + τ ).
The first, subsequently rejected, form represented the instantaneous forward rate
f (t, T ) as the sum of a constant and two mean reverting terms, each with its own
rate of decay λi :
f (t, t + τ ) = β1 + β2 e−λ2 τ + β3 e−λ3 τ (3)
When a similar functional form with two exponents is encountered in the context of
two-factor Hull-White (HW2F) model [6], the typical calibration makes the two rates
of decay λ2,3 very different, one representing slow and the other fast reversion to the
mean. Nelson and Siegel however did exactly the opposite and proposed to make
the difference between the two parameters λ2,3 infinitesimally small. Using Taylor
expansion of (3) in dλ = λ2 − λ3 and omitting higher order terms, they arrived at
the linear-exponential form of their basis:
Substituting this form into (1), we obtain the canonical expression (2) for the zero
rate.
An important takeaway from this brief review of the origins of the Nelson-Siegel
formula is that its distinctive linear-exponential form is somewhat of an accident,
resulting from an attempt to reduce the number of parameters after starting from a
more conventional exponential form and omitting higher order terms in the Taylor
expansion. While the Nelson-Siegel basis achieves higher degree of parsimony in
representing curve shapes compared to local representations such as the smoothing
spline, it is clear that achieving the maximum degree of compression was not the
explicit objective of its selection. As a result, large areas of the three-dimensional
6
Electronic copy available at: https://ssrn.com/abstract=4300756
latent space of the Nelson-Siegel basis correspond to unrealistic curves. We will aim
to increase the degree of parsimony by using VAE to aggressively eliminate unfeasible
curve shapes, creating a more compact latent space where historical data is densely
packed.
7
Electronic copy available at: https://ssrn.com/abstract=4300756
(a) ENCODER DECODER
N K K N
Sn
in out
zk
in out
Sn`
N K N
historical latent reconstructed
swap rates variables swap rates
N
1
CONVENTIONAL AE Σ L 2(Sn,Sn`)
N n=1
Training loop
(b)
ENCODER SAMPLER DECODER
zk = k + σkω
N 2K K N
Sn σk
k zk Sn`
in out in out
P[ ω ]=N(0,1)
N 2K K N
historical distribution latent reconstructed
swap rates parameters variables swap rates
N K
1
VAE Σ L 2(Sn,Sn`)+ β Σ K L D ( k , σk )
N n=1 k=1
Training loop
(c)
ENCODER SAMPLER DECODER
zk = k + σkω
N +C 2K K+C N
Sn σk
k zk Sn`
in out in out
P[ ω ]=N(0,1)
N 2K K N
historical distribution latent reconstructed
swap rates parameters variables swap rates
yc
N K
1
CVAE Σ L 2(Sn,Sn`)+ β Σ K L D ( k , σk )
N n=1 k=1
Training loop
8
Electronic copy available at: https://ssrn.com/abstract=4300756
where Sn are historical swap rates and Sn0 their reconstruction. We will be using
boldface to indicate vectors throughout the paper. The two components of DVAE are
shown schematically in Figure 2.
Observation
1.5 Standard
Normal Encoded
Decoded
N(0,1) Historical
Swap Rates
1
Observation
Observation N( , σ )
DKLD
0.5
0 Maturity -2 -1 0 1 2
Z1 Z1
The reconstruction loss DL2 acts as a repelling force that pushes historical curve
shapes away from each other in latent space to ensure each can be accurately decoded
without mixing. This causes the latent space of most conventional autoencoders to
contain large gaps where no historical data is encoded. In the absence of a penalty
term that would discourage the appearance of such gaps, they reduce reconstruction
loss by creating physical separation between distinct yield curve shapes as shown in
9
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 3(a). When a point inside one of these latent space gaps is decoded, it will
likely produce an unrealistic curve shape because there are no historical data points
nearby.
In VAE, the KLD loss DKLD pulls (µk , σk ) toward (0, 1) as shown in Figure 3(b).
However, if each observation were to be encoded into the same point (0, 1), the
decoder would be unable to distinguish between them and L2 reconstruction loss
would increase dramatically. With small enough β, the mild attractive force exerted
by KLD loss makes areas of latent space where distinct curve shapes are encoded
move closer to each other and “touch”, but do not overlap. This is the mechanism
by which VAE eliminates gaps in latent space and generates a continuous and well-
regularized mapping between the input space and the latent space, as shown in
Figure 3(b).
After training, the sampling step is eliminated because the randomness it adds
is no longer needed. This is usually done by discarding σk and using µk to obtain a
deterministic mapping. Doing so is not always optimal because the center µ(S) of
the distribution that minimizes the sum of L2 and KLD loss during training is not
necessarily the point in latent space that minimizes L2 loss for encoding a specific
input vector Sn , regardless of whether or not such vector comes from an observation
in the training dataset.
A post-processing step described in [11] increases the accuracy of VAE mapping
by performing gradient descent minimizing L2 loss starting from the center µ(S) of
the distribution produced by the encoder. This additional step must be implemented
in a way that ensures continuity and regularity of the mapping from the input space
to the latent space.
10
Electronic copy available at: https://ssrn.com/abstract=4300756
assume that currencies with ordinal numbers next to each other are similar, causing
bias. Such bias can be avoided if each of C currency labels is converted into a binary
sequence of length C where the bit that corresponds to the observation’s currency
is set to 1 and all other bits to 0. This approach, called “one-hot encoding”, is
the standard way to avoid label ordering bias in classifiers. With one-hot encoding,
the total dimension of input space is N + C, with N axes for the swap rates Sn
and C additional axes for the currency bits yc that have two possible values of 0 or
1 as shown in Figure 4(b). The dimensions of CVAE encoder input are shown in
Figure 5(a) and the dimensions of the decoder input in Figure 5(b).
(a) (b)
Swap rate 1 Swap rate 1
Ccy 1
Swap Swap 1
rate 2 rate 2 1 Ccy 2
... ... 1
Ccy 1 Ccy 2 ... Ccy C 0 ...
(a)
S1 S2 S3 ... SN y1 y2 y3 y4 ... yC
2y 3y 5y 30y AUD CAD CHF CZK ZAR
z 1 z 2 y1 y2 y3 y4 ... yC
AUD CAD CHF CZK ZAR
Figure 5: Dimensions of the encoder input (a) and decoder input (b) for
CVAE.
11
Electronic copy available at: https://ssrn.com/abstract=4300756
become less dependent on interpolation between currencies and in the limit of infinite
time series length will closely match the output of VAE trained to single-currency
data.
CCY2
1 1 CCY3
CCY2
1
z1 z1
Figure 6: Latent space geometry for (a) multi-currency CVAE and (b) multi-
currency VAE. Dashed lines schematically show the historical data envelope
for each currency.
(a)
z2
2D
Other currencies
Modelled
currency
z1
currency hyperplanes in CVAE has its own historical data envelope shown by dashed
lines in Figure 6(a).
12
Electronic copy available at: https://ssrn.com/abstract=4300756
To be used in model construction, any mapping from the yield curve to the
model’s state variables must be capable of extrapolation outside the historical data
envelope for the currency being modelled in order to describe curve shapes that have
not occurred for that currency in the past. CVAE performs such extrapolation for
each K-dimensional hyperplane separately, guided by data in other hyperplanes.
An alternative and potentially more effective way to perform extrapolation out-
side the currency’s historical data envelope is to encode the swap rates from all cur-
rencies into a shared latent space, as shown in Figure 6(b). This can be accomplished
by going back to the non-conditional VAE architecture described in Figure 1(b) and
training it to multi-currency data. When the stochastic process reaches an area
where no data is encoded for the currency being modelled, at first no extrapolation
would be required as the mapping will be guided by historical data from adjacent
currencies as shown in Figure 7. Extrapolation would only kick in when the stochas-
tic process reaches even more distant areas of latent space where no historical data
for any currency is encoded, a rare occurrence in a properly calibrated model.
13
Electronic copy available at: https://ssrn.com/abstract=4300756
2.4. Results
2.4.1. Data and Configuration
The training dataset consists of daily observations of 2y, 3y, 5y, 10y, 15y, 20y, and
30y LIBOR and OIS swap rates for AUD, CAD, CHF, CZK, DKK, EUR, GBP,
JPY, MXN, NOK, SEK, USD, and ZAR. Historical data is visualized in Figure 8.
Only those observation dates where each swap maturity was present were included.
Because our objective is to build an autoencoder for the curve shapes rather than
the distribution of their shocks, the improvement from using more frequent than
monthly (e.g., daily) observations would not be significant as prevailing curve shapes
are well sampled even at the monthly frequency. As a practical consideration, data
vendors provide longer historical time series for monthly observations compared to
daily observations. We found that results obtained using daily observations are
indeed quite similar given the same time series length.
As Diebold and Li noted in [15], using unequally spaced liquid swap rates to
compute reconstruction loss provides a natural way to overweight shorter maturities
where the curve has more structure. Keeping this in mind, we assigned equal weight
to all swap rates in the L2 reconstruction loss term. We do not estimate the LIBOR-
OIS spread, expecting VAE to learn how to represent both types of curves directly
from the data.
Neural network configuration for single-currency VAE is described in Table 1,
multi-currency VAE in Table 2., and multi-currency CVAE in Table 3. Before passing
the input swap rates to the encoder, they are mapped from the interval with the lower
bound of Sn = −5% and higher bound of Sn = 25% to the (0, 1) interval using linear
transformation. The distribution N (µk , σk ) is encoded using mean and logvar. We
used the value of β = 1e-7 for all three architectures (single- and multi-currency VAE
and multi-currency CVAE).
14
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 8: Historical LIBOR and OIS swap rates for all currencies.
15
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 9: Distribution of in-sample root-mean-square error (RMSE) of swap
rate reconstruction for single-currency VAE, multi-currency CVAE, and
multi-currency VAE with two latent dimensions across all currencies, ma-
turities, and observation dates. The vertical axis is probability density in
arbitrary units and the horizontal axis is swap rate RMSE.
16
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 11: Distribution of in-sample root-mean-square error (RMSE) of swap
rate reconstruction by currency across all maturities and observation dates.
The vertical axis is probability density in arbitrary units and the horizontal
axis is swap rate RMSE.
17
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 13: World map of latent space obtained using VAE trained to multi-
currency data. Panel (a) shows latent space mapping of daily historical swap
rate observations for all currencies. Each ellipse in panel (b) encloses two
standard deviations of data along each principal axis for one currency.
18
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 14: Historical (left) vs. reconstructed (right) swap rates for JPY,
USD, and ZAR (from top to bottom) using VAE trained to multi-currency
data.
19
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 15: The ellipse in latent space that encloses two standard deviations
of data (left) along each principal semi-axis vs. curves obtained by moving
around its perimeter (right) for JPY, USD, and ZAR (from top to bottom).
Each marker on the ellipse perimeter in the left panel corresponds to a
single curve of matching color in the right panel. Blue markers show the
latent space mapping of daily historical swap rate observations for the same
currency and gray markers for all other currencies.
20
Electronic copy available at: https://ssrn.com/abstract=4300756
in light of the purpose for which the model is used, and balanced against greater
challenges of parameter estimation for models with more factors. Having a highly
parsimonious, densely packed latent space means that residual risks that a VAE-
based model does not capture will be smaller compared to a classical model with the
same number of state variables.
If having two state variables is still deemed insufficient even when greater par-
simony of VAE-based models is taken into account, our approach can be used for
higher dimensional VAE with minimal changes. Based on the evidence presented
here, we will conjecture that VAE-based models will continue to have more effective
compression than classical models even as the number of state variables increases.
21
Electronic copy available at: https://ssrn.com/abstract=4300756
use multi-currency VAE with two latent dimensions unless otherwise noted.
Figure 11 shows the distribution of in-sample of swap rate reconstruction RMSE
by currency across all maturities and observation dates. We found that multi-
currency VAE performs well for all currencies in the dataset, with moderate varia-
tion in accuracy between currencies. The error is somewhat larger in currencies with
shorter time series, but even in those currencies it rarely exceeds 20bp. Figure 12
shows in-sample vs. out-of-sample distribution of root-mean-square error (RMSE)
of swap rate reconstruction across all currencies, maturities and observation dates.
In-sample results were obtained by using the entire dataset for both training and
measurement. Out-of-sample results were obtained by using the data for 2011 and
prior years for training, and subsequent years for measurement without overlap. The
decrease in accuracy for out-of-sample results is surprisingly minor despite a much
shorter training period. There is no indication of overfitting.
3. Models in Q-Measure
When interest rate term structure models were first introduced, their primary pur-
pose was pricing derivatives for the front office where performance benefits of an
analytical solution were key to model adoption. To admit an analytical solution, the
model must assume highly stylized behaviors such as a specific form of volatility,
constant speed of mean reversion, and others. Practitioners are well aware that fi-
nancial markets follow these behaviors at best approximately, but willing to tolerate
model error for the performance gain of an analytical solution.
When a model is used for pricing a single derivative instrument with calibration
to its natural hedges, the model’s role is akin to interpolation, and the impact of
approximations made in pursuit of an analytical solution is reduced. A well-known
example where interpolation works perfectly is pricing a multi-callable instrument
when the model is calibrated to European options with a similar underlying. There
22
Electronic copy available at: https://ssrn.com/abstract=4300756
are however many calculations, including XVA among others, where the instrument
or portfolio to be priced has little in common with the calibration instruments.
For these calculations, the dictum “imply from market prices what you can (really)
hedge, and estimate econometrically what you cannot” by Rebonato et al. [16] creates
motivation for replacing stylized behaviors that admit an analytical solution by those
derived from the historical data. Using VAE-based state variables provides a way to
do so.
where V (t) and V (T ) are the contingent claim prices at time t and T respectively,
and EQ [·|X(t)] is the expectation calculated over the model’s Q-measure probability
density conditional on the state variable vector X(t).
23
Electronic copy available at: https://ssrn.com/abstract=4300756
A model that specifies a perfectly reasonable probability distribution of X(t)
but not one specifically selected to satisfy the constraints (11) will exhibit arbitrage.
These constraints are universal in nature, and we will not be changing them when
converting classical models to AEMM. Our goal will be to convert AEMM into a
form where these constraints can be applied the same way as they are applied for
the corresponding classical model.
In forward rate models described by (6), the no-arbitrage constraints (11) can only
be satisfied if the drift µn (t, F ) is completely determined by the volatility σn (t, F ):
X δn0 ρnn0 σn0 (t, F )
µn (t, F ) = σn (t, F ) (12)
n0
1 + δn0 Fn0
This makes correlation ρnn0 the sole mechanism of controlling the movement of for-
ward rates Fn relative to each other.
Forward rate models described by (6) use correlation ρnn0 to produce realistic
curve shapes by making rates for nearby time intervals more correlated, and rates
for distant time intervals less correlated. This restricts the movement of nearby rates
relative to each other in order to maintain the continuity of the curve. As a result, the
probability of producing unrealistic curve shapes such as those with multiple local
extrema is suppressed, but not completely eliminated. The drawback of this indirect
approach is that the control over curve shapes it offers becomes increasingly tenuous
over long time horizons, with a higher probability of unrealistic curve samples being
generated.
Things improve somewhat if the rank of the correlation matrix is reduced by
using a volatility basis (also called volatility kernel in some publications) with a
small number of stochastic drivers K = 2−4 instead of N drivers in the original
model: X
σn (t, F ) = σk (t, Tn , F )dwk (13)
k
where σk (t, Tn , F ) is the shock to Fn (t) at time t from stochastic driver dwk . However,
there is still no guarantee that an exogenously specified parametric basis or one
obtained by performing PCA on curve shocks will produce accurate curve shapes
over long time horizons and across all interest rate levels. This is what we will aim
to improve by converting classical forward rate models to AEMM.
24
Electronic copy available at: https://ssrn.com/abstract=4300756
AFNS/FHJM on the other hand begin by fitting the curve to the Nelson-Siegel
parametric form or its extensions, and derive the volatility basis from transformations
of that fit. Despite this apparent difference, the two model families are closely related.
A general framework developed in [24] clarifies the connection between them and can
be used to specify several important types of classical forward rate models, including
Cheyette, as well as the original AFNS as one of its special cases.
At present, there is a lack of consistency between state variables of the popular
P-measure models, many of which use the Nelson-Siegel basis and its extensions,
and state variables of the popular Q-measure models that use a variety of other
specifications. Multiple authors including [24, 25, 26, 27] describe the ability to
share state variables between Q- and P-measure models as highly beneficial. The
ability to share Nelson-Siegel state variables between P- and Q-measure has been
cited as motivation in the development of both AFNS and FHJM. We will retain
the ability to construct Q- and P-measure models with shared state variables after
replacing the Nelson-Siegel latent factors with VAE latent variables in AEMM.
(a) (b)
z2 F( )
F( ,z (F) + d zk )
z (F) +dzk
σk( , F )
F( ,z (F))
z (f)
z1 0
Latent space Input space
Figure 16: Volatility basis σ̂k (τ, F ) in forward rate AEMM is partial deriva-
tive of F̂ (τ, z) with respect to latent space shift dz k .
In order to make model-generated curve shapes match historical curve shapes bet-
ter, we propose to derive a dynamic volatility basis σ̂k (τ, F ) from the VAE-generated
family of curve shapes F̂ (τ, z) as shown in Figure 16:
∂ F̂ (τ, z)
σ̂k (τ, F ) = (14)
∂z k
z(F )
25
Electronic copy available at: https://ssrn.com/abstract=4300756
The expression (14) specifies the basis up to the maximum swap rate maturity,
in our case τ = 30y. Common sense and a rigorous argument in [28] require that any
extrapolation we use beyond that horizon eventually converges to zero, so that the
impact of a random shock at any finite time t is no longer felt at infinite maturity:
26
Electronic copy available at: https://ssrn.com/abstract=4300756
z2
A1
B1 B3
A3
A2 B2
z1
Figure 17: Stochastic evolution of forward rate AEMM in latent space where
shocks to the curve from stochastic drivers dwk denoted by A are alternating
with adjustments in latent space due to the drift denoted by B.
where ar,u are the rates of mean reversion and σr,u are the normal volatilities of
r(t) and u(t) respectively, and stochastic drivers dwr,u are multivariate normal with
correlation ρru = hdwr dwu i. The dependence of θ(t) on time is set such that the
model matches the initial yield curve for every maturity, making it arbitrage-free.
We define the short rate r(t) as the interest rate for borrowing over an infinitesimal
time period between t and t + dt. While an alternative definition of r(t) using a
small but finite investment period is helpful for certain numerical calculations, we
will not use it here. The HW2F model is Markovian, a property we will preserve
when converting it to short rate AEMM.
The HW2F model can be specified via an alternative, symmetric set of equations
described by Brigo and Mercurio [7] and known as G2++. The symmetric specifi-
cation is especially convenient for adding more than two factors to obtain G(K)++
model, where K is the number of factors. The G(K)++ model represents the short
rate as the sum of K correlated mean reverting stochastic variables xk and a deter-
ministic time-dependent shift φ(t):
X
r(t) = φ(t) + xk (t)
k (19)
dxk = −ak xk dt + σk (t)dwk
27
Electronic copy available at: https://ssrn.com/abstract=4300756
where ak are the rates of mean reversion, σk are the normal volatilities, and stochastic
drivers dwk are multivariate normal with correlation ρkk0 = hdwk dwk0 i. The G(K)++
model is made arbitrage-free by calculating the time dependence of deterministic shift
φ(t) that makes the model match the initial yield curve for every maturity. When
volatility σk (t) is normal, the state variables x = (xk ) can be set to zero at t = 0
without loss of generality.
The numeraire asset of this model is the money market account accruing contin-
uously compounded interest at the short rate r(t). The stochastic discount factor
(SDF) is given by: Z T
0 0
SDF (t, T ) = exp − r(t )dt (20)
t
The discount factor DF (t, T, x(t)) for maturity T seen at time t is a deterministic
function of the models’ state variable vector x(t):
Z T
0 0
DF (t, T, x(t)) = EQ exp − r(t )dt x(t) (21)
t
Let f (t, T, x(t)) be the instantaneous forward rate seen at time t for investing
over the infinitesimal time interval (T, T + dT ). Its limit for T → t is the short rate:
The relationship between the discount factor DF (t, T, x(t)) and the instantaneous
forward rate f (t, T, x(t)), each a deterministic function of of the models’ state vari-
able vector x(t), is given by:
Z T
0 0
DF (t, T, x(t)) = exp − f (t, T , x(t))dT (23)
t
The instantaneous forward rate f (t, T ) in G(K)++ has the following form:
X
f (t, T ) = f (0, T ) + xk (t)e−ak (T −t) + fconv (t, T, x(t)) (24)
k
where fconv (t, T, x(t)) ∼ O(σ 2 ) is convexity adjustment. In what follows we will
occasionally omit the last argument x(t) to simplify notation.
The curve shape given by (24) is defined relative to the initial curve shape, and
is therefore unable to capture the difference between how the curve moves for low
rates vs. high rates. We will be able to reflect this difference in AEMM, at the same
time adding currency-specific aspects of curve dynamics.
28
Electronic copy available at: https://ssrn.com/abstract=4300756
plays the same role as the rate of mean reversion in short rate models. It follows that
what we previously accomplished for forward rate models by changing the volatility
basis we can accomplish for short rate models by changing the drift.
The multi-factor short rate AEMM is given by:
X
r(t) = φ(t) + xk (t)
k (25)
dxk = µk (t, x1 . . . xk )dt + σk (t)dwk
Because drift is now non-linear, initial values of xk cannot be set to zero and are
instead selected to minimize the forward rate residual φ(t). We will calibrate the
new drift term µk (t, x1 . . . xk ) that replaced constant mean reversion −ak xk of the
classical model to match the VAE-generated family of curve shapes, with minimal
corrections to keep the model arbitrage-free.
Notation µk (t, x1 . . . xk ) for the drift of state variable xk in (25) means it depends
on xk and all of the preceding state variables xk0 where k 0 < k, but not the subsequent
state variables where k 0 > k. This constraint helps achieve a hierarchical model
definition where lower factors are responsible for a greater share of overall drift and
variance than higher factors. This definition of drift brings short rate AEMM closer
to the original specification of the HW2F model where one state variable is reverting
to the other.
The calibration target for µk (t, x1 . . . xk ) is fˆk (τ, z1 . . . zk ) obtained by decompos-
ing VAE-generated instantaneous forward rate into a sum of components, where k-th
component depends on latent space dimensions z1 . . . zk :
K
X
fˆ(τ, z) = fˆk (τ, z1 . . . zk ) (26)
k=1
The target curves for calibrating drift of the first state variable are shown in Fig-
ure 18(a) and second state variable in Figure 18(b). The first state variable of AEMM
captures a significant share of curve variation. The contribution of each added factor
is progressively smaller, providing a convenient hierarchical model specification.
There are two drift factorizations we can consider. The first is time-homogeneous
factorization µk (x1 . . . xk ) that matches the VAE-generated family of curves on aver-
age for all time horizons. The second is time-of-maturity factorization µk (t, x1 . . . xk )
that matches the VAE-generated family of curves exactly for a single time horizon
which we will once again choose to be t = 0 like we did for the forward rate models,
with O(σ 2 ) error for other time horizons.
The option of time-to-maturity factorization we previously used for forward rate
models is not available here because all short rate models, including the classical
ones, use the same time t to advance along the time axis and the maturity axis and
cannot shift one relative to the other. Its closest analog for short rate models is
the time-homogeneous factorization µk (x1 . . . xk ). This factorization is equivalent to
making the overall mean reversion speed of the short rate dependent on the level
of interest rates but not explicitly on time. Arguments supporting the existence of
such dependence in P-measure have been presented in [30, 31]. Unless miraculously
canceled out by the risk premium, the same dependence should exist in Q-measure
as well. Fitting time-homogeneous drift to VAE-generated curve shapes provides a
new way to estimate its level in Q-measure.
29
Electronic copy available at: https://ssrn.com/abstract=4300756
Figure 18: VAE-generated target curves for calibrating drift of (a) the first
state variable vs. (b) both state variables for JPY, USD, and ZAR (from
top to bottom).
30
Electronic copy available at: https://ssrn.com/abstract=4300756
The fitting procedure must ensure that the resulting drift is mean-reverting to
a single equilibrium point in state space x in order to comply with the constraint
that “long forward rates can never change” [28], which is another way of saying that
the impact of a random shock at any finite time t must no longer be felt at infinite
maturity T → ∞.
The time-of-maturity factorization requires finding µk (t, x1 . . . xk ) that reproduces
the entire set of VAE-generated forward rates fˆ(τ, z) at model origin, or any other
single time horizon, exactly for any maturity τ and latent vector z. Note that such
fit will be exact everywhere in latent space only if the mapping between z and each
forward rate is monotonic, which we found to be the case for the multi-currency VAE
we developed. A non-monotonic mapping may cause local deviations from the exact
fit, but not a global one. The t → ∞ extrapolation of drift must be mean-reverting
to a single equilibrium point in state space x.
The time-of-maturity factorization has a remarkable property. While it can only
be made exact across all maturities T for a single time horizon t which we chose
to be t = 0, its deviation from the exact fit at other time horizons is caused solely
by O(σ 2 ) convexity effects. This means curve shapes produced by the model stay
close to VAE-generated curve shapes across all maturities T until relatively long
time horizons t. Any attempt to make the agreement exact across all maturities is
thwarted by convexity adjustments that keep the model arbitrage-free [29]. However
even an approximate agreement with O(σ 2 ) accuracy is still highly beneficial.
(a) (b)
xk xk
fk(t,z i) xi
fk(t,z i)
μ k( ,x i)
...
μ k( ,x j)
fk(t,z j)
xj
fk(t,z j)
x k-1
x k-1
x1 x1
t t
x2 x2
... ...
31
Electronic copy available at: https://ssrn.com/abstract=4300756
structure of volatility, a universally accepted market practice for this model type.
One may even argue that if the head of the initial forward curve must be snipped
off to satisfy a martingale property that is not model-specific, the head of any term
structure of drift should be snipped off as well. For the reader not convinced by this
argument, time-homogeneous calibration, although approximate rather than exact
even at t = 0, eliminates any such concerns as it has no term structure to snip.
4. Models in P-measure
Interest rate models in P-measure have important practical applications across the
financial industry. At short time horizons, they are used for market and liquidity
risk. At long time horizons – for economic forecasting, macro investing, insurance
reserve requirements, and limit management.
For time horizons measured in days, calculations can be performed by sampling
directly from the historical time series of risk factor returns. This model-free ap-
proach is the foundation of historical value-at-risk (HVaR) and expected shortfall
(ES) methodologies. For longer time horizons, the number of independent returns in
the historical time series is insufficient for direct sampling. In this case, a P-measure
model is required.
The first P-measure models to be developed used the same type of SDE as Q-
measure models, but with time-homogeneous parameters. After enjoying a period of
considerable popularity in the 1990s, the SDE-based approach fell out of favor as the
practitioners increasingly turned to latent factor models based on the Nelson-Siegel
basis and its extensions. This model category will be the focus of our review of
P-measure models.
32
Electronic copy available at: https://ssrn.com/abstract=4300756
the Nelson-Siegel basis (2) or its extension and then estimates the parameters of a
separate first-order univariate linear autoregressive AR(1) process [32, 33] for each
of its three latent factors β1,2,3 (t). The DNS model treats the rate of decay λ in the
Nelson-Siegel basis as time-independent and specified a priori. Diebold and Li set
its value to 30 months [15]. As with any other model based on an autoregressive
process, the DNS model is time-homogeneous by construction.
Models for the swap curve require the absence of arbitrage at origin (t = 0). The
DNS model uses a separate stochastic process for the deviations of the curve from
the Nelson-Siegel form, setting its initial value such that the curve at origin has no
arbitrage. In what follows, we will omit this part of the model for the sake of brevity
as AEMM uses an identical approach.
The first-order univariate linear autoregressive AR(1) model represents each of
the three Nelson-Siegel latent factors βk (t + h) at the risk horizon t + h as the sum
of the deterministic factor forecast and white noise representing factor innovations
(i.e., volatility). The forecast for each of the three Nelson-Siegel latent factors β1,2,3
at time t + h depends only on time-t value of the same latent factor but not the other
two:
βk (t + h) = (1 − φk )θk + φk βk (t) + k (27)
where θk is the equilibrium level of the factor (i.e., the target of mean reversion),
φk is the decay multiplier in autoregression, and the white noise k for factor βk , is
a serially uncorrelated random variable with time-homogeneous probability density
P [k ] and the mean of zero: Z
P [k ]dk = 0 (28)
k
Because k has zero mean, P-measure expectation is the deterministic part of (27):
E[βk (t + h)|βk (t)] = (1 − φk )θk + φk βk (t) (29)
For h → 0, the multiplier φk → 1 and the forecast for βk (t + h) is its preceding
value βk (t). For h → ∞, the multiplier φk → 0, the initial state is forgotten, and
the forecast is the equilibrium level θk . To make the comparison of φk estimated at
different time horizons more intuitive, we can think of the average mean reversion
speed ak over the estimation horizon h defined as:
φk = exp(−ak h) (30)
The DNS model and its extensions are often called “forecasting” models. This
name reflects their original use for forecasting yield curve trends, where the objective
is to estimate mean future rates (i.e., the forecast), or equivalently the interest rate
drift. When forecasting is the objective, probability distribution of k is only of
interest to the extent it helps estimate forecast accuracy, but not in its own right.
Following their initial development for yield curve forecasting, dynamic latent
factor models were applied to the calculation of risk at long time horizons measured
in years and decades. This calculation is required for PFE-based limit management
and, unlike market risk, cannot be performed by sampling from the time series of
returns due to the long risk horizon. When the DNS model is used for calculating
PFE, both the forecast and the probability distribution of k must be estimated.
Unlike for the Ornstein–Uhlenbeck process to which the AR(1) model is frequently
compared, the probability distribution of k in AR(1) is not necessarily Gaussian. A
comprehensive review of DNS estimation methodologies can be found in [34].
33
Electronic copy available at: https://ssrn.com/abstract=4300756
4.1.2. Autoregressive AEMM
Constructing autoregressive AEMM is largely similar to constructing DNS, except
VAE-based curve representation is used in place of the Nelson-Siegel basis, and VAE
latent vector z is used in place of the Nelson-Siegel latent factors β1,2,3 . Minor compli-
cations that arise as a result of replacing a linear basis by the non-linear VAE-based
representation are easily resolved using readily available multidimensional optimiza-
tion packages, such as those found in most machine learning libraries.
The encoder component of VAE performs the job of converting the initial curve
shape to the initial latent vector. The autoregressive model is then inserted between
the encoder and the decoder and calibrated to produce the desired probability dis-
tribution in latent space. The decoder converts that distribution to future curve
shapes. In a Monte Carlo setting, this is done by simulating paths in latent space
first, and then converting each path to Monte Carlo samples of future curves using
the decoder. In settings other than Monte Carlo, regression techniques can be used
to estimate probability density in latent space and convert it to probability density
in input space (i.e., the probability of a given curve shape).
Our results indicate that VAE training to multi-currency data produces highly
efficient compression that can represent curve shapes using two latent dimensions
with similar accuracy to that of the Nelson-Siegel basis with three latent factors.
The resulting reduction in latent space dimension from three to two has the potential
to significantly improve the quality of parameter estimation for both forecasting
and risk applications. It also reopens the possibility of using a more sophisticated
autoregressive model that was ruled out in [15] and subsequent publications due to
the high number of model parameters when three or more latent variables are used.
Reduction in the number of latent variables from three to two thanks to the use
of VAE in turn reduces the number of parameters, making it possible to consider
alternatives to the univariate AR(1) model. Among the classical stochastic model
alternatives are the linear vector autoregressive VAR(1) model [35] and Jones model
[36, 37]. Each of these models improves one aspect of AR(1).
The AR(1) model used in DNS holds that the forecast for each latent factor
depends only on its own prior value, and this dependence is linear. In the case of
VAR(1) model, the forecast becomes multivariate but remains linear. In the case of
the Jones model, it remains univariate but becomes nonlinear. Other options include
Bayesian networks that were previously applied to stochastic factor models in [38]
and generative machine learning used for the same purpose in [31].
34
Electronic copy available at: https://ssrn.com/abstract=4300756
A dual-measure model consists of two constituent models, one in Q-measure and
the other in P-measure, that share the same set of state variables. The two models
need not use the same type of SDE, as long as the differences in model construction
between Q- and P-measure do not cause their state variables to differ. If both models
use the AEMM approach, the state variables are latent variables of VAE on both
sides, and this condition is always satisfied.
We will call the two constituent Q- and P-measure models “sides” of the dual
measure model. The risk premium incorporated into prices causes additional drift
in the Q-measure side relative to P-measure side. During model construction, the
Q-measure side is calibrated to market-implied data in the usual way. After that,
excess drift due to the risk premium is estimated and subtracted from the drift of
the Q-measure side to arrive at the calibration of the P-measure side.
Estimation of P-measure drift is a long-standing problem in interest rates re-
search, which so far eluded a comprehensive and universally accepted solution. The
premise of the dual-measure calibration approach is that it is often easier to esti-
mate the risk premium than estimate the drift itself. We will describe the principles
of risk premium estimation in the following Subsection 4.2.1. We will then review
two classical dual-measure models in Subsection 4.2.2 and introduce dual-measure
AEMM in Subsection 4.2.3.
35
Electronic copy available at: https://ssrn.com/abstract=4300756
average is taken over all historical observations ti in a sufficiently long observation
period. The excess drift can be estimated from the difference between historical
averages of Q-measure instantaneous forward rate and P-measure forecast of the
short rate for the same time offset τ , conditional on the initial state X:
36
Electronic copy available at: https://ssrn.com/abstract=4300756
This approach to risk-premium calibration avoids making the risk premium explic-
itly depend on model time t while producing accurate estimates of its dependence
on time-to-maturity τ .
(a) z2 (b) z2
z j (t) z j (t)
A
z j (t + )
p
B
z j (t + )
p
C
z i (t + )
p
z j (t + )
q
z i (t + )
p
C
A
z i (t + )
q
B
z i (t) z i (t)
z1 z1
Figure 20: Estimation of P-measure drift in latent space using (a) autore-
gressive model vs. (b) dual-measure model. Label A shows the estimated
P-measure drift, label B estimated Q-measure drift, and label C estimated
excess drift due to the risk premium.
37
Electronic copy available at: https://ssrn.com/abstract=4300756
estimating either the average difference Ψ(τ ) between the forward rate with time-
to-maturity τ and P-measure forecast for τ -lagged short rate, or the market price of
risk driving that difference. By mapping all historical curve shapes to a shared two-
dimensional latent space, AEMM creates the opportunity for a new way to perform
this estimation.
38
Electronic copy available at: https://ssrn.com/abstract=4300756
4.3. Pricing Under P-Measure
A P-measure model will rarely specify every market quote required by its user. For
example, a model used for portfolio risk must generate all pricing model inputs for
every instrument in the portfolio. Most P-measure models are unable to do so. This
problem is particularly acute for P-measure models that simulate the short rate, as
these models are unable to calculate risk of anything other than overnight deposits.
The models that simulate the entire yield curve but not the volatility surface do
better as they are able to price linear instruments, but not interest rate options. To
use a P-measure model for risk calculations, practitioners must be able to generate
all of the market quotes the model does not specify, but portfolio pricing requires.
Q-measure models sidestep this problem because they can price underlying in-
struments for the market quotes they do not specify directly. For example, finite-term
rates can be obtained by pricing zero coupon bonds, and volatilities by pricing caps
and swaptions.
While this approach is not applicable to a standalone P-measure model, it can be
used if a P-measure model is paired with a Q-measure model that shares the same
state variables. The dual-measure models discussed in the preceding Section 4.2 are
already set up this way. This approach can also be used by pairing an autoregressive
P-measure model described in Section 4.1 with any Q-measure model that shares
the same state variables. The resulting two-model setup is similar to a dual-measure
model, except the two models are calibrated independently rather than together.
In this two-model setup, the role of the P-measure model is to calculate the
real world probability distribution of the state variables at the risk horizon h. The
calculation of market quotes and trade pricing as a function of these state variables
becomes the responsibility of Q-measure model.
Figure 22 illustrates how this approach is applied to the Monte Carlo calculation
of portfolio risk. The Monte Carlo paths are generated using the P-measure model
before the risk horizon and continued using the Q-measure model after the risk
horizon. The dependence of trade prices on state variables at the risk horizon is
calculated by backward induction from t > h and therefore use only the Q-measure
segment of the paths. The probability distribution of these state variables, on the
other hand, is calculated for t < h and therefore uses only the P-measure segment
of the paths.
Because it is not practical to generate a new set of Q-measure paths for each risk
horizon h, the ending state variables of the P-measure segment and the starting state
variables of the Q-measure segment will not match exactly. Regression can be used
to assign values from one set of state variables to the other. With only two state
variables in our VAE, this regression will be more accurate compared to models with
more state variables.
One complication that sometimes arises in using the same set of paths for every
horizon is that at long horizons, the difference between Q- and P-measure drift causes
the two sets of paths to have maximum density in different areas of state space,
resulting in a low density of regression inputs in the area of state space where most
regression outputs are located. Several ways to address this problem were described
in prior literature. Stein [27] described a procedure where switching between the
measures is used to generate a single set of paths from which both Q- and P-measure
expectations can be computed and Sokol [30] proposed a “path injection” approach
39
Electronic copy available at: https://ssrn.com/abstract=4300756
X l
ift & vo
sure dr
P-measure drift & vol Q-mea
X1
X2
X3
0 Risk horizon t
P-measure Q-measure
model model
Figure 22: Two-model setup with P-measure model responsible for gener-
ating samples of state variable vector X at risk horizon h and Q-measure
model responsible for pricing the portfolio for each X.
where paths are dynamically added to Q-measure Monte Carlo simulation in order
to increase the density of coverage in the area of state space relevant to P-measure
calculations.
5. Conclusion
The invention of VAE revolutionized many areas of machine learning, from image
processing to natural language recognition. We believe that it holds the same promise
for interest rate modelling. In this paper, we describe how four categories of classical
interest rate models (two in Q-measure and two in P-measure) can be modified
to use VAE for dimension reduction. We propose to call this new model category
autoencoder market models (AEMM).
We emphasize that our selection of classical examples for the four model categories
discussed in this paper should not be taken as an endorsement of any particular
model or category. The purpose of these examples is merely to illustrate how to
turn a classical model into its AEMM counterpart. Other classical models can be
converted to AEMM by similar means.
Our decision to use autoencoders only for the part of the model that affects curve
shapes while leaving other aspects of its classical specification untouched was driven
by the desire to maintain continuity with the established market practice in interest
rate modelling. In Q-measure, switching to AEMM involves using a special form
of volatility basis for forward rate models and a special form of drift for short rate
models. In P-measure, model construction involves replacing Nelson-Siegel latent
factors by VAE latent variables, and proceeds in the usual way otherwise. It is our
hope that conservative use of machine learning to improve a single aspect of the
model specification in a transparent and explainable way without making radical
changes to the model will facilitate AEMM adoption by the practitioners.
With AEMM, machine learning is only used to create the mapping between yield
curve shapes and model state variables. The training process relies on decades of
historical data and can be run periodically (e.g., quarterly), with its results carefully
40
Electronic copy available at: https://ssrn.com/abstract=4300756
examined before they are used in production. Periodic training is acceptable in this
case because machine learning is used in AEMM to replace those aspects of classical
models that are usually specified a priori, such as the choice of Nelson-Siegel basis
to represent the curve.
The accuracy of the resulting mapping can be evaluated and compared to clas-
sical methods such as Nelson-Siegel, or previous versions of VAE, using a rigorous
process based on measuring curve reconstruction error over the historical dataset.
Mapping quality can be examined in detail and even visualized by generating curves
from a large number of sample points in latent space. The ability to carefully exam-
ine and backtest the quality of machine learning results before using the model in
production follows the principles of trustworthy ML. We hope it will be considered
during regulatory review of the new model category.
6. Acknowledgments
The principles of training on multi-currency data with one-hot encoding of the cur-
rency label have been developed jointly with Oleksiy Kondratyev in a different setting
as part of prior research on generative P-measure models. The code used to generate
the numerical results was created in collaboration with Svitlana Doroshenko and An-
drew Samodurov. The author is grateful for many insights and a detailed review of
the early version of this paper by Andrei Lyashenko that led to many improvements,
exchange of ideas with Oleksiy Kondratyev during our prior collaboration, and illu-
minating discussions with Leif Andersen, Marco Bianchetti, Vladimir Chorniy, Igor
Halperin, John Hull, Peter Jaeckel, Gordon Lee, Alexander Lipton, Fabio Mercurio,
Vladimir Piterbarg, Michael Pykhtin, members of the quant research team at Com-
patibL, and many others. The author alone is responsible for the errors. No conflicts
of interest are reported.
41
Electronic copy available at: https://ssrn.com/abstract=4300756
References
[1] A. Kondratyev, “Learning curve dynamics with artificial neural networks,” Risk,
vol. 31, June 2018.
[4] P. Hagan and A. Lesniewski, “LIBOR Market Model with SABR Style Stochas-
tic Volatility.” Working Paper, https://doi.org/10.13140/RG.2.2.22622.
89924, 2006.
[7] D. Brigo and F. Mercurio, Interest Rate Models: Theory and Practice - with
Smile, Inflation and Credit. Springer Verlag, 2006.
[10] L. E. Svensson, “Estimating Forward Interest Rates with the Extended Nelson
& Siegel Method,” Sveriges Riksbank Quarterly Review, vol. 3, no. 1, pp. 13–26,
1995.
[13] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation us-
ing deep conditional generative models,” in Advances in Neural Information
Processing Systems, vol. 28, Curran Associates, Inc., 2015.
42
Electronic copy available at: https://ssrn.com/abstract=4300756
[15] F. X. Diebold and C. Li, “Forecasting the Term Structure of Government Bond
Yields,” Journal of Econometrics, vol. 130, no. 2, pp. 337–364, 2006.
[17] D. Heath, R. Jarrow, and A. Morton, “Bond Pricing and the Term Structure of
Interest Rates: A New Methodology for Contingent Claims Valuation,” Econo-
metrica, vol. 60, no. 1, pp. 77–105, 1992.
[18] A. Brace, D. Gatarek, and M. Musiela, “The Market Model of Interest Rate
Dynamics,” Mathematical Finance, vol. 7, no. 2, pp. 127–155, 1997.
[19] F. Jamshidian, “LIBOR and swap market models and measures,” Finance and
Stochastics, vol. 1, p. 293–330, 1997.
[21] L. Andersen and J. Andreasen, “Volatility skews and extensions of the libor
market model,” Applied Mathematical Finance, vol. 7, no. 1, pp. 1–32, 2000.
[24] A. Lyashenko and Y. Goncharov, “Bridging P-Q Modeling Divide with Factor
HJM Modeling Framework.” Working Paper, SSRN https://www.ssrn.com/
abstract=3995533, 2021.
[25] J. Hull, A. Sokol, and A. White, “Short rate joint measure models,” Risk,
no. October, pp. 59–63, 2014.
[27] H. Stein, “Two measures for the price of one,” Risk Magazine, 3 2015.
[28] P. H. Dybvig, J. E. Ingersoll, Jr., and S. A. Ross, “Long Forward and Zero-
Coupon Rates Can Never Fall,” The Journal of Business, vol. 69, no. 1, p. 1,
1996.
43
Electronic copy available at: https://ssrn.com/abstract=4300756
[30] A. Sokol, Long-term Portfolio Simulation: For XVA, Limits, Liquidity and Reg-
ulatory Capital. Risk Books, 2014.
[31] O. Kondratyev and A. Sokol, “Machine Learning for Long Risk Horizons: Mar-
ket Generator Models.” RiskLive conference presentation and to be published,
2000.
[33] E. Slutzky, “The summation of random causes as the source of cyclic processes,”
Econometrica, vol. 5, no. 2, pp. 105–146, 1937.
[35] C. A. Sims, “Macroeconomics and reality,” Econometrica, vol. 48, no. 1, pp. 1–
48, 1980.
[38] V. Chorniy and A. Greenberg, “Bayesian Networks and Stochastic Factor Mod-
els.” Working Paper, SSRN https://ssrn.com/abstract=2688324, 2015.
[41] R. Ahmad and P. Wilmott, “The Market Price of Interest Rate Risk: Measuring
and Modeling Fear and Greed in Fixed Income Markets,” Wilmott, vol. January,
pp. 64–70, 2007.
44
Electronic copy available at: https://ssrn.com/abstract=4300756