Nesting Classical Actuarial Models Into Neural Network
Nesting Classical Actuarial Models Into Neural Network
Prepared for:
Fachgruppe “Data Science”
Swiss Association of Actuaries SAV
Abstract
Neural network modeling often suffers the deficiency of not using a systematic way of im-
proving classical statistical regression models. In this tutorial we exemplify the proposal of
[17]. We embed a classical generalized linear model into a neural network architecture, and
we let this nested network approach explore model structure not captured by the classical
generalized linear model. In addition, if the generalized linear model is already close to opti-
mal, then the maximum likelihood estimator of the generalized linear model can be used as
initialization of the fitting algorithm of the neural network. This saves computational time
because we start the fitting algorithm in a reasonable parameter. As a by-product of our
derivations, we present embedding layers and representation learning which often provides
a more efficient treatment of categorical features within neural networks than dummy and
one-hot encoding.
Keywords. neural networks, architecture, car insurance, generalized linear models, embed-
ding, nesting, embedding layers, one-hot encoding, dummy coding, representation learning,
claims frequency, Poisson regression model, machine learning, deep learning.
1
Electronic copy available at: https://ssrn.com/abstract=3320525
1 The data and revisiting generalized linear models
1.1 French motor third-party liability insurance data
We revisit the data freMTPL2freq which is included in the R package CASdatasets, see Char-
pentier [3].1 This data comprises a French MTPL insurance portfolio with corresponding claim
counts observed within one accounting year. This data has already been illustrated and studied
in the previous two tutorials of Noll et al. [10] and Ferrario et al. [4]. Listing 1 provides a short
summary of the data.
A detailed descriptive analysis of this data is provided in the tutorial of Noll et al. [10]. The
analysis in that reference also includes a (minor) data cleaning part on the original data which
is used but not further discussed in the present manuscript.2
for the given volumes vi > 0 (time Exposure in years on line 7 in Listing 1) and a given
claims frequency function xi 7→ λ(xi ), where xi describes the feature information of policy i,
see Assumptions 2.1 in Noll et al. [10] and lines 8-16 in Listing 1. All policies have been active
within one accounting year, and the volumes are considered pro-rata temporis vi ∈ (0, 1] for the
corresponding time exposures.
1
CASdatasets website http://cas.uqam.ca; the data is described on page 55 of the reference manual [2].
2
The R code is available from https://github.com/JSchelldorfer/ActuarialDataScience
2
Electronic copy available at: https://ssrn.com/abstract=3320525
Task. The main problem to be solved is to find the regression function λ(·) such that it
appropriately describes the data, and such that it generalizes to similar data which has not
been seen, yet. Remark that the task of finding an appropriate regression function λ : X → R+
also includes the definition of the feature space X which typically varies over different modeling
approaches.
• Area: we choose a continuous (log-linear) feature component for the Area code and we
therefore map {A, . . . , F} 7→ {1, . . . , 6};
• VehPower: we build 6 categorical classes by merging vehicle power groups bigger and equal
to 9 (totally 6 labels);
• VehAge: we build 3 categorical classes [0, 1), [1, 10], (10, ∞);
• DrivAge: we build 7 categorical classes [18, 21), [21, 26), [26, 31), [31, 41), [41, 51), [51, 71),
[71, ∞);
X ⊂ [1, 6] × {0, 1}5 × {0, 1}2 × {0, 1}6 × [50, 150] × {0, 1}10 × {0, 1} × [0, 11] × {0, 1}21 . (1.2)
3
Electronic copy available at: https://ssrn.com/abstract=3320525
That is, we have a q0 = 1 + 5 + 2 + 6 + 1 + 10 + 1 + 1 + 21 = 48 dimensional feature space X , and
the feature components in {0, 1}k add up either to 0 or 1 (dummy coding), this side constraint
is the reason for using the symbol ” ⊂ ” in formula (1.2), we also refer to Listing 3 in Noll et
al. [10] for more details. Based on this feature pre-processing we set up a first GLM.
Model Assumptions 1.1 (Model GLM1) Choose feature space X as in (1.2) and define the
regression function λ : X → R+ by
q0
def.
X
x 7→ log λ(x) = β0 + βl xl = hβ, xi, (1.3)
l=1
Table 1: run time, number of model parameters, in-sample and out-of-samples losses (units are
in 10−2 ), average estimated frequency on T (the empirically observed value is 10.41%, see Table
3 in [10]).
this Model GLM1 leads to a substantial improvement over the model with constant frequency
parameter λ (estimated by MLE). The last column of Table 1 gives the estimated frequency on
the test data set T , the empirically observed value being 10.41%, see Table 3 in [10].
3
Recall that minimizing the deviance loss is equivalent to maximizing the log-likelihood.
4
The run time of the corresponding R function glm on a personal laptop Intel(R) Core(TM) i7-8550U CPU @
1.80GHz 1.99GHz with 16GB RAM to find the MLE β b is roughly 20 seconds; the optimization method used is
iteratively weighted least squares (IWLS).
4
Electronic copy available at: https://ssrn.com/abstract=3320525
Model GLM1 estimated frequencies Model GLM1 estimated frequencies Model GLM1 estimated frequencies
0.35
0.35
0.35
●
0.30
0.30
0.30
0.25
0.25
0.25
frequency
frequency
frequency
0.20
0.20
0.20
0.15
0.15
0.15
● ● ●
●
0.10
0.10
0.10
● ● ● ● ●
● ●
●
● ●
●
0.05
0.05
0.05
0.00
0.00
0.00
1 2 3 4 5 6 1 2 3 1 2 3 4 5 6 7
Figure 1: Model GLM1: estimated frequencies w.r.t. the categorized (continuous) feature com-
ponents VehPower, VehAge and DrivAge (the corresponding reference group is normalized to the
overall frequency of 10%, illustrated by the dotted line).
In Figure 1 we present the resulting estimated frequencies of the (categorized) feature compo-
nents VehPower, VehAge and DrivAge of Model GLM1 (the corresponding reference group is
normalized to the overall frequency of 10%, illustrated by the dotted line). Note that these
feature components are continuous in nature, but we have been turning them into categorical
ones for modeling purposes (as mentioned above). Having so much data, we can further explore
these categorical feature components by trying to replace them by ordinal ones assuming an
appropriate continuous functional form, still fitting into the GLM framework.5
As an example we show how to bring DrivAge into a continuous functional form. We therefore
modify the feature space X from (1.2) and the regression function λ from (1.3). We replace the
7 categorical age classes by the following continuous function
4
X
DrivAge 7→ βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j , (1.6)
j=2
with regression parameters βl , . . . , βl+4 . Thus, we replace the 7 categorical classes (involving 6
regression parameters from dummy coding) by the above continuous functional form having 5 re-
gression parameters. The remaining parts of the regression function in (1.3) are kept unchanged,
and we call this new model Model GLM2.
On lines 1-4 of Listing 2 we specify Model GLM2 in detail, it shows the specific functional
form for DrivAge, and it keeps unchanged all other terms, in particular, the two categorized
(continuous) variables VehPower and VehAge are kept as in Model GLM1. On lines 14-18 of
Listing 2 we provide the resulting MLEs of this continuous implementation (1.6) of the feature
component DrivAge. We observe that all terms in the chosen functional form for DrivAge are
significant.
The resulting out-of-sample performance on the test data T of this second model fitted to
the learning data D is given in Table 1. We observe a slight improvement in (out-of-sample)
predictive power (generalization loss). Henceforth, we prefer this latter model over the former
one. Note that this transformation has reduced the number of estimated parameters by 1 from
q0 + 1 = 49 to 48. This Model GLM2 will be the benchmark for all subsequent considerations.
5
We could also consider generalized additive models (GAMs), but we refrain from doing so for the moment.
5
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 2: continuous coding of DrivAge: MLE results of Model GLM2
1 glm ( formula = ClaimNb ~ AreaGLM + VehPowerGLM + VehAgeGLM + BonusMalusGLM +
2 VehBrand + VehGas + DensityGLM + Region +
3 DrivAge + log ( DrivAge ) + I ( DrivAge ^2) + I ( DrivAge ^3) + I ( DrivAge ^4) ,
4 family = poisson () , data = learn , offset = log ( Exposure ))
5
6 Coefficients :
7 Estimate Std . Error z value Pr ( >! z !)
8 ( Intercept ) 6.793 e +01 5.227 e +00 12.996 < 2e -16 ***
9 AreaGLM 8.047 e -03 1.703 e -02 0.472 0.63662
10 VehPowerGLM5 2.008 e -01 1.925 e -02 10.427 < 2e -16 ***
11 VehPowerGLM6 2.301 e -01 1.916 e -02 12.011 < 2e -16 ***
12 . .
13 . .
14 DrivAge 3.421 e +00 2.916 e -01 11.733 < 2e -16 ***
15 log ( DrivAge ) -4.131 e +01 3.155 e +00 -13.095 < 2e -16 ***
16 I ( DrivAge ^2) -4.889 e -02 4.775 e -03 -10.239 < 2e -16 ***
17 I ( DrivAge ^3) 3.873 e -04 4.400 e -05 8.803 < 2e -16 ***
18 I ( DrivAge ^4) -1.222 e -06 1.633 e -07 -7.485 7.17 e -14 ***
19 ---
20 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
21
22 ( Dispersion parameter for poisson family taken to be 1)
23
24 Null deviance : 200974 on 610211 degrees of freedom
25 Residual deviance : 190732 on 610164 degrees of freedom
● Model GLM1
Model GLM2
0.30
0.25
frequency
0.20
0.15
0.10
●●● ●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●● ●●●●●●●●●●
●●●●●
0.05
0.00
20 30 40 50 60 70 80
driver's age
Figure 2: comparison of the estimated frequencies in Models GLM1 and GLM2 for feature
component DrivAge (normalized to age 46).
In Figure 2 we compare the resulting estimated frequencies of the two modeling approaches for
DrivAge. The continuous Model GLM2 for driver’s age looks similar to the categorical labeling,
but it provides a smooth transition between the age classes compared to Model GLM1, and it
leads to a substantially higher estimate for drivers of ages 18-19. Concluding, we do not have
any reservation to use this continuous version.
We could proceed in a similar way for VehPower and VehAge. In order to not overload this
tutorial, we refrain from doing so, and we choose Model GLM2 as benchmark model for all our
subsequent derivations.
6
Electronic copy available at: https://ssrn.com/abstract=3320525
Conclusion. We choose Model GLM2 as our benchmark model. This model is illustrated in
Listing 2, and it has 48 parameters to be estimated. One weakness of this model is that it does
not explore interactions between feature components beyond multiplications. This and other
points are going to be challenged in the following sections.
• q0 is the dimension of the feature space X with input neurons z (0) = x ∈ X (input layer).
A general network architecture with K hidden layers for our Poisson regression problem is then
obtained by adding an output layer as follows
nD Eo
λ : X → R+ , x 7→ λ(x) = exp w(K+1) , z (K) ◦ · · · ◦ z (1) (x) . (2.3)
That is, we map the neurons z (K) of the last hidden layer to the output layer R+ using the
exponential activation function and weights w(K+1) (including an intercept). This network
architecture has depth K and receives a network parameter θ ∈ Rr of dimension r = K+1
P
k=1 qk (1+
(k)
qk−1 ) collecting all network weights w , k = 1, . . . , K + 1, we set qK+1 = 1. Three examples
with K = 3 hidden layers are given in Figure 3. The example in the middle of Figure 3 has an
input layer of dimension q0 = 9 and q1 = 20, q2 = 15 and q3 = 10 hidden neurons which results
in a network parameter θ of dimension r = 686.
The network in Figure 3 (middle) shows for each feature component one single neuron in the
input layer (blue, green and magenta colors), thus, an input layer of dimension q0 = 9. However,
we have two categorical feature components VehBrand and Region with more than 2 different
categorical labels. One-hot encoding requires that these two components receive 11 and 22 input
neurons, respectively. Thus, one-hot encoding implies that the input layer has dimension q0 = 40
(if we assume that all other feature components need one single input neuron). This results in
7
Electronic copy available at: https://ssrn.com/abstract=3320525
Area
●
● ● ●
VehPower
VehAge
DrivAge
●
● ● ● Area
Area
BonusMalus
B1
●
● ● ●
B10
B11
●
● ● ● ● VehPower
VehPower
B12
●
● ●
B13
● ● ●
VehAge
●
B14
B2
●
● ● VehAge
●
B3
B4
●
● ●
DrivAge
B5
B6
● ● ●
VehGas
●
● ● ●
DrivAge
BonusMalus
●
Density
R11
●
● ● ClaimNb ClaimNb ClaimNb
R21
R22
● ● ● ● BonusMalus
R23
●
● ● ● VehBrEmb
R24
R25
● ● VehBrEmb
R26
●
● ● ●
R31
● ●
R41
R42
● ● ●
VehGas
R43
●
● ● ●
VehGas
●
R52
R53
R54
●
● ● Density
R72
R73
●
● ● ● ●
Density
R74
R82
●
● ● ● RegionEmb RegionEmb
R83
●
● ● ●
R91
R93
● ● ●
R94
●
Figure 3: networks with K = 3 hidden layers having q1 = 20, q2 = 15 and q3 = 10 hidden neurons
in the three hidden layers; the input layer has dimensions q0 = 40 (lhs), q0 = 9 (middle) and
q0 = 11 (rhs) resulting in network parameter dimensions r = 10 306, 686 and 726, respectively.
dimension r = 10 306 for the network parameter θ. This is exactly the network illustrated in
Figure 3 (lhs), with one-hot encoding for VehBrand in green color and one-hot encoding for
Region in magenta color. Brute-force network calibration then simply fits this model using a
version of the gradient descent algorithm; this is exactly what has been demonstrated in our
previous tutorial [4].
We should ask ourselves whether the brute-force implementation of categorical feature compo-
nents using one-hot encoding is optimal, since it seems to introduce an excessive number of
network parameters (in our case r = 10 306). There is a second reason why one-hot encoding
seems to be sub-optimal for our purposes. In general, we would like to identify (cluster) labels
that are similar for the regression modeling problem. This is not the case with one-hot encod-
ing. If we consider, for instance, the 11 vehicle brands B = {B1, B10, . . . , B6}, one-hot encoding
assigns a different unit vector xVehBrand ∈ R11 to each VehBrand ∈ B. For two different brands
√
VehBrand1 6= VehBrand2 ∈ B we always receive kxVehBrand1 − xVehBrand2 k = 2, thus, the (Eu-
clidean) distance between all vehicle brands is the same under one-hot encoding. In the next
section we present embedding layers which aim at embedding categorical feature components
into low dimensional Euclidean spaces, clustering labels that are more similar for the regression
modeling problem.
8
Electronic copy available at: https://ssrn.com/abstract=3320525
We exemplify the construction of an embedding layer on the categorical feature component
VehBrand. For an embedding layer, we need to choose an embedding dimension d ∈ N (hyper-
parameter). The embedding is then defined by an embedding mapping
def.
e : B → Rd , VehBrand 7→ eVehBrand = e(VehBrand). (2.4)
Thus, we allocate to every label VehBrand ∈ B a d-dimensional vector eVehBrand ∈ Rd . This is
called an embedding of B into Rd , and the embedding weights eVehBrand are learned during the
model calibration which is called representation learning.
B5
●
B11
●
0.4
B10
●
B13
● B6
●
dimension 2
B3●
B1
●
B2
0.2
●
B4
●
B14
●
0.0
−0.2
B12
●
dimension 1
9
Electronic copy available at: https://ssrn.com/abstract=3320525
2.3 Embedding layer example
We compare brute-force one-hot encoding and embedding layers on an explicit example. We
choose a network of depth K = 3, having hidden neurons (q1 , q2 , q2 ) = (20, 15, 10), and using
one-hot encoding for the feature components VehBrand and Region, this network is illustrated
in Figure 3 (lhs). As described above this results in r = 10 306 network parameters to be
calibrated. In order to fit this model we use the R interface to Keras.6 The code is provided in
Listing 3. Network on lines 4-9 of Listing 3 defines a network of depth K = 3 having neurons
(q1 , q2 , q3 ) = (20, 15, 10) and hyperbolic tangent activation function. This network produces a
one-dimensional output (lines 8 of Listing 3). It is initialized such that we start in the MLE of
the homogeneous model (constant frequency parameter, line 9 of Listing 3). On lines 11-13 we
add the non-trainable offset log(Exposure), and on line 16 we specify the nadam optimizer, see
Section 8.5 in Goodfellow et al. [6], and the Poisson deviance loss as objective function.
32.5
● ●
training loss (in−sample) ● ●
training loss (in−sample) ● ●
training loss (in−sample)
32.5
●
validation loss (out−of−sample) ●
validation loss (out−of−sample) ●
validation loss (out−of−sample)
32.0
32.0
● ●
32.0
● ●
● ●
●
● ● ●
● ●
● ●
● ●
● ●
● ● ●
●
● ● ●
●
●
●
●
● ●
●
●
● ●
● ● ●
31.5
31.5
● ●
● ● ●
●
●
● ● ●
● ●
●
31.5
deviance losses
deviance losses
deviance losses
● ●
● ●
●
● ●
●
●
● ● ●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
●
● ●
●
31.0
31.0
● ●
● ●
● ●● ● ●
●
●
●
●
● ●
●● ●
● ● ●
● ●
● ● ● ●
●
● ●●
● ●
31.0
●● ●
● ● ●
●● ●
●●
● ● ●● ●
● ●● ●●
●
● ● ● ● ●●●
●●
● ●
● ●●
●
● ● ●● ●
●
● ●
●● ●●
●
●
● ●●
● ●
● ●● ●● ●
●●
●● ●
●●
● ●
●●
● ●●
●
● ●
●●●
● ●
●
●● ●●
●● ●● ●
●●
●
●● ● ●
●
●●
●
●●● ●
●
●●● ●
●●
●●
●●●
●
● ●●
●
● ●
●●
●●
●● ●●
●●●● ●●●●
●●
●
● ●
● ●
●●
●●
●●● ●
●
●●
●
●●
●●
●
● ●●
●●
● ●
●
●●
●
●●
●●
● ●
●●●
●
●●●● ●●●
●
●●
●
●●
●
●● ●
●●
●●
●●● ● ●
●●
●
●●
●
●●
●
●●
●
●●
● ●
●●
● ●●
●
●●
●
●●
●●
● ● ●
●●
●
●●
●●
●●●●●● ● ●●●
●
●●
●
●●
●●
●● ●●
●●
●
●●
●
●●●
●●
●
●●
●● ● ●●
●
●●
●
●●●
●● ●●
●●
●
●●●●
●● ●
●●
●●
●
●●
●●
●●
●
●●● ●●●●● ●●●●●●
30.5
●● ●●
30.5
●●●●● ● ●●●
●●
●
●●
●
●●
●
● ●
●●●●
●
●●
●
●●
●
●●● ●
● ●●●●● ●
●
●●
●
●●
●
●●
●
●●
●
●●● ●
●●
●●
●
●●
●
●●
● ●
●
●●
●
●
● ●
●●
●●
●
●●● ●
●
● ●●
●
●●
●
●●
●
●●
●
●●
●●● ●●
●●●●●●●● ●
●●
●●
●
●●
●
●●
●
●●
●
●●● ●●● ● ●●
●●
●●
●
●●
●
● ●
●●●
●●
● ●
●●
●●
●
●●
●●●●●
●
● ● ● ●●
●●
●
●●
●
●●
●
●●
●●
●
● ● ●
●●●● ●
●
●●
●●
●●
●
● ● ●
●
●●
●●
●
●●
●●
●●
●
●●●●●●
●●●●● ● ●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●● ● ● ●
●●
●●●●●
●
●●
●●
●
●●●●
●●●●● ●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●●
● ●
●●
● ●●
●●
● ●
●
●●●
●
●●●●●●
●●●●
●●
●
●●
●
●●
●
●●
●●●●●● ● ●●●
●●
●●
●
●●
●
●●
●●
●●● ●
●●
●
●●
●●
●
●●●
●
●●
●●
● ●●●●● ● ●●●
●●
●●
●
●●
●●●
●●● ● ● ●
● ●●
●●
●
●●●●● ●
●●●●●●● ●● ●●
●●●
●●
●●
●
●●
●●
●
●●
●
●●
●●
● ●
●●
●●
●●
●●
●●
●
● ●
● ●●●
●
●●
●
●●
●
●●
●
●●
●●
●● ● ●●● ● ● ●●
●
●●
●●
●
●●●●
●●●●●● ●●●
●●
●
●●
●
●●●
●
● ●
●●●
● ●
●●●●● ● ● ●●
●●●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●● ●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●●● ● ●
●●
●
●●
●
●●
●
●●
●● ●●
●● ●●
●
●●
●
●●
●●●
●●●●●● ●● ●●
●●●
● ●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●● ● ● ●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
● ●●●●●●● ● ●
●●
●
●●
●●
●●
●●
●
●●
●● ● ●●●●
●
●●
●
●●
●
●●
●
●●
●
● ●● ● ●● ●●
●●
●
●●
●●●●
●
● ●●●● ●●
●
●●
●●●
30.5
●●●
● ●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●● ● ● ●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●●●●●
●●●●● ●●
● ●
●●●●●●
●
●●
●●●
●● ● ●●
● ●●●
●●●
●
●
●
●●●●●
●●●●●●
● ● ● ●●●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●●
●
●●
●
● ●●● ●●
●
● ●
●
●●
●
●●
●
●●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●● ●
●●●●
●●●● ●
●
●●
●●
●●
●●
●●
●●
●●
●
●● ●●●● ● ●●●
● ●●
●●●
●●
●
●●●●
●●●●● ●●● ●●●
●
●●
●● ●
●●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●
●●
●
●●
●●
●● ●●●●
● ● ●● ●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●●●●
●●●●●●●
●●●●● ●●
● ●●●
●●●
● ● ●●●
●●●●● ● ●●● ●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
● ●
●● ●
●●●●●● ●●
●● ●●●●●●●
●
●●
●
●●●●
●●
● ●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
● ●●
●● ●
● ●●●●●
●●●●
● ●●●●●
●
●●●
●●
●
● ●
●●
●
● ●
●●●
●
●●●
●●●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●●
●●
●● ● ●
● ●
● ●●
●●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●●●
●●
●
● ● ● ● ●●
●●●●
●●
●
●●
●
●●
●●
●
●●●
●
●●
●
●●●
●
●●●
● ● ●●●●●
●●●● ●●●
●
●●●●● ●
●●●●●●
● ●
●●● ● ● ●●●
●●
●●
●
●●
●●
●●●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●
●●● ●
●●●● ● ● ●
● ●●
●●●
●
●●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●●
●●
● ●●● ●●●●
● ●●●
●●●●
●
●●
●● ●
●●●●●●
●● ●
●●●●●●● ●● ●● ●●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
● ●
●●●●● ●●●
●●●●●●
●●●●●●●● ●●
●
●●● ●●●
●● ●
●●
●
●●
●
●●●● ●
●●●●●
●● ●● ● ●● ●
● ●●●●●●●● ● ● ●
●●
●
●●●
●●
●
●●●
●●
●●
●
●●
●
●●
●●●
●
●●
●
●●
●
●●●
●
●●
●
●●
●
●●
●●
●●●
●
●●
●●●●●
● ●●●
●●●●●●●●●● ● ●●●
● ●●●● ●
●●
● ● ●●
● ●●●●● ●●●●● ●●
●●
●●
●
●●
●●
●●●
●
●●●
● ● ●●●●● ●
●●●●●●●●●● ● ●
● ●●●●●●●● ● ●
●●●●●●● ●● ●
●●●● ●●●
●● ●●●●●●●● ● ● ●●
● ●●●●●●●●●●
●●●●●●●●●●●●●● ●
30.0
30.0
● ●●●●●●●● ●● ● ●● ●
●● ●●●●●●●●●● ●● ●●
●●●●●●●●●●●●●●● ● ●●●
● ● ●●●●●●●●●●● ● ●
● ●●●●●●●●●●● ●●●
●●●●●●●●●●●●●●●● ●
● ● ● ●● ●
● ● ●●●●●●●●●● ● ● ●● ●
●●●●●●●●●●●●●●●●● ●●● ● ● ●
● ●●●● ●●●●●●●●●●●●●● ● ●●
●● ● ● ●●●●●●●●●● ● ● ●● ● ● ●●●●● ●
●
●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●
● ●
●● ●● ●●●
●●
● ● ● ● ●●● ●
●● ●
● ● ●●●●
● ●●● ● ●●●●
●
●
●● ● ●●●
●● ●●●●● ●
●● ● ●● ● ● ● ● ●
●● ●● ● ● ●●●● ● ●●● ●●
●●
●● ●
●●● ●● ●● ●● ● ● ●● ● ●● ●
30.0
●● ●●● ●● ●● ● ● ● ● ●● ● ●●
●●● ● ● ●
●●● ● ●● ● ● ●● ● ● ● ●●● ●
●● ●●●
●●
● ● ●● ●
●●●●● ● ●● ● ● ●
●● ● ●● ●
● ●●●● ● ● ●●
● ●
● ●●●●●
● ●●● ●● ●●●
●● ●●● ●● ● ● ● ●●
● ● ●●● ● ●
● ● ●● ●●●● ● ●● ●● ●●● ● ●●●● ●
● ● ● ● ● ●●● ●●●●
● ● ●
● ●
● ●● ●●
●●● ●●● ● ●●●
●●
● ● ● ● ● ● ● ● ● ●
●
●
●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ●●● ● ●●●●●● ●● ● ●● ●
● ●
●● ● ● ● ● ●●●●● ● ●
● ● ● ● ●● ●
● ●● ●● ●●●●● ● ● ● ●● ● ● ● ●●●● ●●● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ●● ●●● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●●●●●●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●
●● ●●● ● ●
● ●●●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●
● ● ●●●●●●●●● ●● ●●●●● ● ●●● ● ●●●●●●●●●● ●●● ●● ● ● ● ● ● ●●●●●● ●● ●●●●● ●●●●●
●●●●●
● ●●●
●
●● ● ●●●●
●●● ● ● ●●● ● ●●●
● ● ● ● ● ● ● ● ● ●● ● ●●●●●
●● ●●● ●● ●● ●●●● ●
● ●
● ●● ● ● ● ●● ● ●
●●● ●
● ● ● ●● ●● ● ● ● ● ●● ●
● ●●
● ● ● ● ●● ● ●●● ●● ● ●● ●●● ● ●●
● ●●●● ●●●●●●●● ●●●● ●●● ●● ● ● ●●
● ●●● ●
●● ●●
●●●●●
●●●●●● ●●●●
●● ●● ●● ● ●
●● ●● ● ●●●●
● ●
● ● ●● ● ● ●● ● ●● ●
● ● ●●● ●●●
●●
● ●● ●●●●●●●●●● ●
●●● ● ● ●●●● ●
●●● ● ●● ● ●●●
●●●● ●●●●● ●●● ●●●● ● ●● ●
●
●● ●●● ●●
●●●
●● ●
●● ●●●
●● ●
● ● ●● ● ● ●
●● ● ● ● ●
●
●● ● ● ●●
● ●●●●●●● ●● ● ●●●● ●●●● ●● ● ●● ●● ●● ● ● ●● ● ● ●● ●● ●●● ●● ● ●●● ● ●● ●●
● ● ● ● ● ●
●
● ● ●● ●●● ●●●● ●●
●●● ● ●●●●●
●●●● ●●●●●●
●●●
●●●● ● ● ●● ●●●●●●
● ● ●● ●● ●●● ●●●
●
●●●●●●● ●● ●● ● ● ● ● ●● ●●● ● ●● ●
● ●●● ●
● ● ● ●● ●●●● ●● ● ●● ●
● ●● ● ●●
●
● ●● ●●● ● ●●●● ●●
●
●●
●●
●●● ●
●●●●●
●● ●
●●●● ●● ●●●
●●●
●●● ● ●●●●●●●●
●
●
●●●●● ●
●
●●● ●● ● ● ●●●
●● ● ● ● ● ● ● ●
● ●● ●●● ●● ● ●● ●
● ●●● ● ●●
● ●
● ● ● ●●● ●●● ●●●●● ● ●●●● ●●
●●● ●● ● ● ● ● ● ● ● ● ●●●●●●
●
●● ●●
●● ●●●
●●●
●●●●●●● ●● ●
●●● ●● ●●●●●●●●●● ●● ●●
●●●●● ●●●●●●● ● ●●● ●● ●● ●
●●● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ●●
●● ● ●●● ● ●● ● ●● ●● ● ● ● ●● ●●
● ●●●● ● ●
●● ● ●●
●●●●
●●
● ●●● ●●●● ●●
●●●●●
●● ●●●●●●● ● ●●●
●● ●●
●●● ●●●● ●● ●● ● ●● ●●●
● ● ●●●●●
●●● ● ●●
●● ● ●●●●
● ● ● ● ●●●●●●● ● ●●● ●●● ●
●●●● ●●● ●
●●●● ●● ●● ● ●●●
● ●● ● ● ●●● ●●●
● ● ● ● ●● ● ● ● ●● ● ●●●●● ●●●● ● ●● ●●●● ● ● ● ● ●●
●●●● ●
●● ● ●●
● ● ●● ●● ● ●●● ●
●
● ●●●● ● ● ●●●
●●●●●
●●● ●● ●●●●●●●●● ●●●● ●● ●● ●●● ●●● ●●●●● ●●
● ● ●●●●●
●●●●● ●
●●● ●●●● ●●●●
● ● ●● ●●●●●●● ●●● ● ●
●●
●
●●●
●●●●●●●●●
●●●●● ●●● ● ●●●●
● ●● ●●● ●●
●●
●
● ●
●● ●●
●●● ●●● ● ● ● ●●
●
●●● ● ●●●●● ●● ●●●
●● ●●● ●●
●●●● ●●●●● ●
● ●●
●●●● ●
●●● ●
●●●●●●● ● ●● ●●●● ●●● ●●● ●
●●● ●●●●●●● ●●● ● ●●● ● ●●●
●●●●●● ●●●● ●● ●●●● ● ●●● ●● ●● ●●
●●●●● ●●●●● ● ●●●
●●●●
●● ● ●●●●●●●●●●● ●●●● ●●● ● ●
● ●●●●●●
● ● ● ●●●●● ●●● ●●●●●●●●●● ●●●●●●● ● ● ●● ●● ●●●●● ● ●●●● ●●● ●● ●●●●●● ● ● ●● ●
●●
● ● ● ●● ●● ●●● ● ●●●●●●●● ●●●●●● ●●●●●
●●●●●
●
● ●●
●● ● ●●●
●● ●●
●●●●
●● ● ●●●●● ●
●
●●●
●●●●●●● ●●● ●● ● ●●●● ●● ●●●●
●● ● ●● ● ●●
●
●●● ●●● ●●●●●●
● ●
●●● ● ●● ●● ●●●
●
●● ●●● ●
●● ●
●●
● ● ●● ●●●●●●●●● ● ● ● ●●●●
● ●●●● ● ●● ●● ●●●●●●●
●●● ●●
● ● ●●●
●● ● ● ●● ●● ●●●●
●●●●●● ●● ●●
●●● ● ●●● ● ● ●
● ●● ●●●●● ●●●●● ●●
●● ●●●●●●●
●●● ●
● ●●
●●●● ● ●●●●● ● ●●●● ●●
●● ●●●●
●●● ●●
● ●● ●
● ● ● ●●●●●●●
● ●●
●
●●● ●
● ●● ● ●●● ●● ●● ●●
●●● ●
●● ●● ●●
●●●● ●●
●●●
●●●●● ●
●●●
●●●● ●● ●●●● ● ●●●●●●●
●●
● ●
● ●●● ●●●● ● ●
●●● ● ●
●
● ● ●●●● ●● ●● ● ●●●
● ● ● ●●●●● ●
●● ●●
●● ● ●●● ●●●●
● ● ●● ●●●● ●●●● ● ●● ●
● ● ●● ● ● ●●
● ● ● ●
● ●● ● ●● ●
● ●●● ●● ● ●
●●●
● ●● ●● ● ●● ●
● ●● ● ●● ●● ● ●●● ● ● ●●●●●● ●● ●
● ●●● ●● ●● ● ● ●●●●●●●●
●● ●●●
●● ● ● ● ●
● ●●●
● ● ● ●● ●● ●
●● ●●
● ●● ● ● ●●● ● ●● ● ●●● ●● ● ●● ● ●●
●
0 100 200 300 400 500 0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 5: performance of the gradient descent algorithm, the blue graphs show the training
losses and the red graphs the validation losses: (lhs) one-hot encoding for categorical feature
components; (middle) 1-dimensional embeddings of categorical feature components; (rhs) 2-
dimensional embeddings of categorical feature components.
6
Keras is a user-friendly API to TensorFlow, see https://tensorflow.rstudio.com/keras/
10
Electronic copy available at: https://ssrn.com/abstract=3320525
We use that same feature pre-processing as in Section 2 of [4]7 , and we run the gradient descent
algorithm for 500 epochs on the learning data set D on mini-batches of size 10’000 policies.
To track over-fitting we split the learning data at the ratio of 9:1 to a training data set and
a validation data set. In Figure 5 (lhs) we plot the decrease of training loss (blue color) and
validation loss (red color), respectively, over 500 epochs. We see that after roughly 250 epochs
we may exercise early stopping.
Table 2: run time, number of model parameters, in-sample and out-of-samples losses (units are in
10−2 ) of Models GLM2, (q1 , q2 , q3 ) = (20, 15, 10) network with one-hot encoding and embedding
layers with dimensions d = 1, 2 for the categorical feature components VehBrand and Region.
The resulting losses of this network after 250 epochs on the entire learning data are given in
Table 2 on row “Network One-Hot”. We note that we obtain a clearly better model than Model
GLM2 in terms of Poisson deviance losses (at the price of more run time). Fine-tuning of the
network architecture and the gradient descent algorithm could further improve this model. For
the time-being we stay with the current network architecture and its calibration, because we
would like to see whether we get an improvement using embedding layers for categorical feature
components.
The code for designing the network architecture with embedding layers for the categorical ex-
planatory variables VehBrand and Region is given in Listing 4, with the first line defining the
dimension d of the embedding layers (we use the same embedding dimension for both categorical
feature components). The network results of these architectures with d = 1 and d = 2, respec-
tively, are provided in Table 2, and Figure 5 gives the convergence behaviors on training and
validation sets (being a 9:1 partition of the learning data D). In view of Figure 5 (middle and
rhs) we use 700 epochs and 600 epochs for d = 1 and d = 2, respectively. The latter model has
more parameters which also provides more degrees of freedom to the gradient descent method.
This seems to slightly accelerate the fitting behavior.
On the one hand we observe that embedding layers provide a slower rate of convergence and
longer run times than one-hot encoding of categorical variables. We suppose that this is caused
by the fact that an embedding layer adds an additional layer to the network, see green and
magenta arrows in Figure 3 (middle, rhs). Therefore, the back-propagation method for network
calibration needs to be performed over 4 hidden layers for embedding layer coding compared to
3 hidden layers in one-hot encoding of categorical feature components.
On the other hand, the fitted models with embedding layers clearly outperform the model with
one-hot encoding in terms of the out-of-sample loss, if the former models are trained sufficiently
long. What is more worrying is that the calibration of the network models are very unstable
7
The corresponding R code is available from https://github.com/JSchelldorfer/ActuarialDataScience
11
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 4: network of depth 3 with embeddings for categorical features
1 d <- 1 # dimension of the embedding layers
2 Design <- layer_input ( shape = c (7) , dtype = ’ float32 ’ , name = ’ Design ’)
3 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
4 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
5 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
6
7 BrEmb = VehBrand % >%
8 l ay er _em be dd in g ( input_dim = 11 , output_dim = d , input_length = 1 , name = ’ BrEmb ’) % >%
9 layer_flatten ( name = ’ Br_flat ’)
10
11 ReEmb = Region % >%
12 l ay er _em be dd in g ( input_dim = 22 , output_dim = d , input_length = 1 , name = ’ ReEmb ’) % >%
13 layer_flatten ( name = ’ Re_flat ’)
14
15 Network = list ( Design , BrEmb , ReEmb ) % >% l a y e r _ c o n c a t e n a t e ( name = ’ concate ’) % >%
16 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
17 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
18 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
19 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
20 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
21
22 Response = list ( Network , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
23 layer_dense ( units =1 , activation = k_exp , name = ’ Response ’ , trainable = FALSE ,
24 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
25
26 model <- keras_model ( inputs = c ( Design , VehBrand , Region , LogVol ) , outputs = c ( Response ))
(in the choice of the initial value of the gradient descent algorithm). This results in fluctuating
average frequencies, see last column in Table 2. In fact, these numbers (not being part of the
objective function during model calibration) fluctuate quite a bit which is a major issue for
insurance pricing.
total volumes per car brand groups observed frequency per car brand groups 2−dimensional embedding of VehBrand
0.35
B12
●
0.5
80000
0.30
0.25
B14 ●
60000
B13
●
dimension 2
B5
frequency
●
exposure
0.20
0.0
B3●
B6
B4
● ●
B1
●
0.15
40000
B2
●
●
0.10
● ●
● ● ● ●
● ● ●
−0.5
20000
●
0.05
B11
●
0.00
B10
●
0
B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2
Figure 6: (lhs) histogram of exposures per VehBrand, (middle) observed frequency per VehBrand,
(rhs) resulting weights in the embedding layer for d = 2.
The embedding layers have an other advantage, namely, we can graphically illustrate the findings
of the network (at least if d is small). This is very useful in NLP as it allows us to explore similar
words graphically in 2 or 3 dimensions, after some further dimension reduction techniques have
been applied, see [1, 14, 12]. In Figures 6 (rhs) and 7 (rhs) we illustrate for d = 2 the resulting
embedding weights, see also (2.4). We observe clustering in both categorical labels, which
12
Electronic copy available at: https://ssrn.com/abstract=3320525
total volumes per regional groups observed frequency per regional groups 2−dimensional embedding of Region
1e+05
1.0
0.35
MTPL portfolio R43
●
R21
●
● French population
0.30
8e+04
R73
●
R23
●
0.25
0.5
●
R41
●
dimension 2
6e+04
R91
●
frequency
exposure
0.20
R42● R52
●
● R93
0.15
R72
●
R22
●
4e+04
0.0
● ●
● R54
●
●
● R11
R26
●● R31
● R74
●
● ● ● ● ●
0.10
●
● ● ● ● ● ● R83
● R82
●
● ● ● ● ● ●
●
2e+04
●
● R53
●
0.05
● ●
●
● ● R24
●
−0.5
●
● ●
● ●
● ● R25
●
● ● ●
0.00
●
R94
●
0e+00
R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 −0.5 0.0 0.5 1.0
Figure 7: (lhs) histogram of exposures per Region, (middle) observed frequency per Region,
(rhs) resulting weights in the embedding layer for d = 2.
indicates that some labels could be merged. For VehBrand we observe that car brand B12 is
different from all other car brands, B10 and B11 seem to have similarities, and the remaining
car brands cluster. For Region the result is more diverse, in fact, Figure 7 (rhs) suggests that
a 1-dimensional embedding is not sufficient. This is in contrast to Figure 6 (rhs) where we
have high co-linearity between the two dimensions. Figures 6 (lhs, middle) and 7 (lhs, middle)
are taken from Figures 8 and 11 of [10], they show the observed marginal frequencies and the
underlying volumes.
Conclusions. The networks improve the GLM results in terms of out-of-sample losses because
we have not been investing sufficient efforts in finding the optimal GLM with respect to feature
engineering and potential interactions. From the analysis in this section we prefer embedding
layers over one-hot encoding for categorical feature components, however, at the price of longer
run times. Besides that embedding layers might improve the out-of-sample performance of the
network they allow us to visually identify relationships between the different levels of categorical
inputs. The downsides of networks are that calibrations lead to volatile average frequency
estimates and bias fluctuations. This is also going to be studied in the next section.
Model Assumptions 3.1 (CANN approach: part I) Choose a feature space X ⊂ Rq0 and
define the regression function λ : X → R+ by
D E
x 7→ log λ(x) = hβ, xi + w(K+1) , z (K) ◦ · · · ◦ z (1) (x) , (3.1)
where the first term on the right-hand side of (3.1) is the regression function from Model As-
sumptions 1.1 with parameter vector β, and the second term the regression function from (2.3)
ind.
with network parameter θ. Assume Ni ∼ Poi (λ(xi )vi ) for all i ≥ 1.
13
Electronic copy available at: https://ssrn.com/abstract=3320525
GLM skip connection
Area
VehPower
VehAge
DrivAge
BonusMalus
ClaimNb
VehBrEmb
VehGas
Density
RegionEmb
Figure 8: CANN approach illustrating in orange color the classical GLM in the skip connection
added to a network of depth K = 3 with (q1 , q2 , q3 ) = (20, 15, 10) hidden neurons.
The CANN approach of Model Assumptions 3.1 is illustrated in Figure 8. The skip connection
in orange color contains the GLM (note that for the moment we neglect that categorical feature
components may use a different encoding for the GLM and the network parts).
• Formula (3.1) combines our previous two models, in particular, it embeds the GLM into
a network architecture by packing it into a so-called skip connection that directly links
the input layer to the output layer, see orange arrow in Figure 8. Skip connections are
used in deep networks because they have good calibration properties, potentially avoiding
the vanishing gradient problem, see He et al. [7] and Huang et al. [8]. We use the skip
connection for a different purpose here.
• The two models are combined in the output layer by a (simple) addition. This addition
(K+1)
makes one of the intercepts β0 and w0 superfluous. Therefore, we typically fix one of
(K+1)
the intercepts, in most cases β0 , and we only train the other intercept, say, w0 in the
new network parameter ϑ = (β, θ) of regression function (3.1).
• Regression function (3.1) requires that the GLM and the network model are defined on
the same feature space X . This may require that we merge the feature space of the GLM
model and the network approach, and not both parts in the regression function (3.1) may
consider all components of that merged feature space, for instance, when GLM considers
a component in a dummy coding representation and the network part considers the same
component in a continuous coding fashion.
Initialization 3.2 (CANN approach: part II) Assume that Model Assumptions 3.1 hold
and that β
b denotes the MLE for β under Model Assumptions 1.1. Initialize regression function
14
Electronic copy available at: https://ssrn.com/abstract=3320525
(3.1) as follows: set for network parameter ϑ = (β, θ) the initial value
ϑ0 = (β,
b θ0 ) with output layer weight w(K+1) ≡ 0 in θ0 . (3.2)
Note that initialization (3.2) exactly provides the MLE prediction of the GLM part of the CANN
model (3.1), i.e. it minimizes the Poisson deviance loss under Model Assumptions 1.1. If we start
the gradient descent algorithm for fitting the CANN model (3.1) in this initial value ϑ0 , and if
we use the Poisson deviance loss as objective function, then the algorithm explores the network
architecture for additional model structure that is not present in the GLM and which lowers the
initial Poisson deviance loss related to the (initial) network parameter ϑ0 . In this way we obtain
an improvement of the GLM by network features. This provides a more systematic way of using
network architectures to improve the GLM. We will highlight this with several examples.
where βb is the MLE of β. There are two different ways of applying the gradient descent algorithm
to (3.3): (1) we train the entire network parameter ϑ = (β, θ), (2) we declare the GLM part β b
to be non-trainable and only train the second term in ϑ = (β,b θ). In the latter case, the optimal
GLM always remains in the CANN regression function and it is modified by the network part.
In the former case, the optimal GLM is modified interacting with the network part.
A variant of (3.3) in the case where we declare βb to be non-trainable is to introduce a trainable
(credibility) weight α ∈ [0, 1] and we define a new regression model
n D E D Eo
x 7→ λ(x) = exp α β, b x + (1 − α) w(K+1) , z (K) ◦ · · · ◦ z (1) (x) . (3.4)
If we train this model, then we learn a credibility weight α at which the GLM is considered in
the CANN approach.
An extension of the CANN approach also allows us to learn across multiple insurance portfolios.
Assume we have J insurance portfolios, all living on the same feature space X and with β b
j
denoting the MLE of portfolio j = 1, . . . , J in the GLM. Let χ ∈ {1, . . . , J} be a categorical
variable denoting which portfolio we consider. We define the regression function
X J D E D E
(x, χ) 7→ λ(x, χ) = exp b , x 1{χ=j} + w(K+1) , z (K) ◦ · · · ◦ z (1) (x)
β .
j
j=1
In this case, the neural network part allows us to learn across portfolio because it describes the
interaction between the portfolios. This approach has been considered in Gabrielli et al. [5].
We present a first example that implements the CANN approach (3.3) where we declare the
MLE βb of Model GLM2 to be non-trainable and where we use d = 1 for the embedding layers
15
Electronic copy available at: https://ssrn.com/abstract=3320525
of the network part. We call this first example Model CANN0. The R script of this architecture
is given in Listing 9 in the appendix. We comment on this in more detail. This model has
686 trainable network parameters, which are exactly the network weights shown in Figure 3
(middle), i.e. these are the weights θ that come from the neural network part. Then, it has
58 non-trainable network parameters, these are the 48 GLM parameters in β b of Model GLM2
which we choose as non-trainable (see lines 14, 20, 26, 32, 38 and 39 of Listing 9), as well
as 10 non-trainable parameters, where 4 non-trainable parameters stem from the embedding
identification (one-hot versus dummy coding for VehPower, VehAge, VehBrand and Region), 4
non-trainable parameters come from concatenating the GLM network (line 38 of Listing 9), and
2 non-trainable parameters are from blending the GLM with the network part (line 50 of Listing
9).
On lines 1-8 of Listing 9 we define the input variables: ContGLM ∈ R4 collects the four ordinal
variables BonusMalus, VehGas, Density, Area; on lines 2-3 there are the GLM-categorized
variables VehPowerGLM and VehAgeGLM; on lines 4-5 the categorical variables VehBrand and
Region; line 6 collects all DrivAge related variables from line 3 of Listing 2, i.e. DrivAgeGLM ∈ R5
has 5 continuous components, see also (1.6); line 7 collects the continuous variables VehPower,
VehAge and DrivAge. Thus, the latter three variables enter the CANN model twice in a different
form for the GLM part and for the network part in (3.3). Moreover, we pre-process all feature
components that enter the network part of the architecture with the MinMaxScaler (for the
MinMaxScaler we refer to Section 2.2 of Ferrario et al. [4]). Finally, line 8 defines the offset
log(Exposure).
On lines 11-33 we define the embedding layers for the 4 categorical variables VehPowerGLM,
VehAgeGLM, VehBrand and Region using the MLE β b as non-trainable weights. On lines 35-39
these categorical variables are concatenated with the continuous ones of the GLM part again
using βb as non-trainable weights. This provides the non-trainable GLMNetwork, see lines 35-
39. On lines 41-46 we define the (q1 , q2 , q3 ) = (20, 15, 10) network architecture. This considers
the 9-dimensional variable consisting of ContGLM ∈ R4 , ContNN ∈ R3 , as well as the two one-
dimensional (d = 1) embedding weights for VehBrand and Region. These embedding weights
are exactly the GLM parameters (and they are declared to be non-trainable). Therefore, this
part exactly corresponds to Figure 3 (middle) with 686 trainable weights illustrated by the black
arrows. We blend the two models on lines 48-50 also including the offset log(Exposure) from
the underlying volumes.
Having the R script of Listing 9 we are ready to calibrate Model CANN0. Before doing so, we
mention that this code is a bit cumbersome for the task we try to achieve. In the case of the
Poisson distribution we can substantially simplify the CANN implementation. From (3.3) we
see that if the GLM part is non-trainable with MLE β, b then we can merge this term with the
given volumes vi . Thus, we consider a network function
D E
x 7→ log λNN (x) = w(K+1) , z (K) ◦ · · · ◦ z (1) (x) , (3.5)
16
Electronic copy available at: https://ssrn.com/abstract=3320525
This leads to a (much) simpler representation of the CANN model, in particular, we can replace
Listing 9 of the appendix by Listing 5 (which is almost identical to Listing 4).
We observe that this code has become much simpler, and in fact also calibration in Keras runs
faster. Line 4 of Listing 5 specifies the offsets log viGLM and lines 14-19 define the network
part λNN . This network is based on embedding layers for the categorical feature components
VehBrand and Region which may have (general) dimensions d. Note that in Listing 5 the
embedding weights for VehBrand and Region are trainable, if we choose embedding dimension
d = 1, initialize these embedding layers with the corresponding parts of β
b and declare these two
embedding layers to be non-trainable, then we exactly receive Model CANN0 from the previous
section.
Remark. We would like to emphasize that (3.6) is by no means restricted to the GLM. In fact,
we can choose any regression model for the skip connection (volume adjustments using working
weights similar to (3.6)), for instance, we can replace the GLM prediction by a generalized
additive model (GAM) prediction in the working weight definition. This is exactly the idea
behind the Poisson boosting machine as it has been presented in Section 5.2 of our tutorial [10].
We choose two different versions for our CANN approach (3.5)-(3.6), the first one has embedding
dimension d = 1 and the second one has embedding dimension d = 2 for the categorical feature
components VehBrand and Region, see also lines 7 and 11 in Listing 5. Both versions use Model
GLM2 in the skip connection (3.6). We call these two Models CANN1 and CANN2. In Figure
9 (lhs) we show the convergence statistics of the gradient descent algorithm where, again, we
split the learning data 9:1 for a training set (blue) and a validation set (red). The left-hand
side shows the calibration for embedding layers with d = 1 (Model CANN1), and the right-hand
17
Electronic copy available at: https://ssrn.com/abstract=3320525
gradient descent algorithm gradient descent algorithm
●
●
●
●
●
●
training loss (in−sample) ●
●
●
●
training loss (in−sample)
●
●
validation loss (out−of−sample) ●
●
●
validation loss (out−of−sample)
●
● ●
● ●
●
● ●
●
31.0
●
31.0
●
● ●
●
●
●
●
●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
●
●● ●
● ●
●
● ●
●
● ●
●●
● ●●
● ● ●
●● ●●
●●
● ●
●●
● ●
● ●
●
●
●
●
●●
● ●
●●● ●
●● ●●
●
●● ●
●● ●●
●
● ●
●
deviance losses
deviance losses
●
●● ●●
●
●●
● ●●
●●
●
● ●
●
● ●●
●●
●
●●
●
● ●
●●
●
●●
● ●
●
●●●
● ●
●●
●
● ●
●●
●
●
●●●● ●
●
●●● ●●
●● ●
●
●
●●● ●
●
●●
● ●
●
●
●● ●●
●
●
●●
● ●●
●●
● ●
●●
●
●●
●
●● ●●
●
●
●
●●
●
● ●
●
●
●
● ●
●
●●
●●
●
●●
● ●
●●●
●●
●
●● ●●
●
●●●
●●●
●●
●
●●
●
●
●●
●
●●
●●
●
●● ●
●●
●
●
● ●●●●
●
●
●● ●●●
30.5
●
●●
●
●● ●●
●
●●
●
●●
●● ●
●●
●●
●●
30.5
●●
●●● ●●
●●
●
●●
●
●●
●●
●
●●
● ● ●
●●●
●●
●
●●
●
●● ●●●
●●
●●●
●● ●●
●
●●
●●●
●
●●
●
●●●
●● ●●
●●
●●●●
●●●●
●
● ●●
●● ●●
●
●●
●
●●
●●
●●
●●●●● ●●
●
●●
●
● ●●
●●● ●
●
●●
●
● ●●
●●●
●●
●●
●●●
●●●● ●●●●●●●
●
●●
●
●●
●
●●●
●
●●
●●
●●
●● ●
●●
●●
● ●●
●●
●●
●
●●
●●
● ● ●
●
●●
●●
●●
●
●●●●●●●
●●● ●●●● ●
●
●
●●●●
● ●
●●●●●●
●●●●
●●
● ●
●● ● ●●
●● ●
●● ●●●●●
●●
●
● ● ● ●●
●
●●
●●● ●●●
●●
●●
●●
●●●
●
●● ●
●●●●● ●●●●
●●●●
● ●●●
●●●
●●
●
●●
●●●
●
● ● ● ●●
●● ●●
●●●●
●
●●●
●
●●●
● ●●
●
●●● ● ●
● ●●●
●●●● ●
●●
●●
●
●●
●
●●●●●
●●
● ● ● ●●
●●●●
●
●●
●●●●
●●
●
●●●
●●
●●
●●●
● ●
●●●
● ●● ● ●
●●
●●
●
●●
●●
● ●●
●●●●
●●
●● ●●
●●● ● ●●
●
●●●
●●●
●
●●
●●●
●●●
●
●●
●
●●
●
●●●
●●
●●●
●
●
●●●
●
●●●
●●●● ●
●●
● ●●● ●●
●●●
●●●● ●
●
●●
●●
●●●
● ●
●● ●●●●
● ● ●●
●●
● ●●
●●
● ●●●●●●
●
●●
●●
●
●●●●● ●●
●●
●●●
●●
●
●
●●
●
●●●●●●
●
● ● ●●
● ●
●●
●●
●●
● ●● ●
●●●
● ●
●●
●●
●●●
●● ●●
● ●
●●
●●●
●
●
●●
●
●
●●●●●
●●●● ●●
●●
●● ●●
●●●
●●
●
●●
●
●●
●
●
●●
●●●
●●●
●
●●
●●
●●●
● ●●
●
●●
●●●
●●●●●●
●●●● ● ●●●● ●●
●● ●● ●
●● ●●●
●●●●●●
●
●●
●
●● ●●●●● ● ●●
●●
●●
●●
●
●●
●●●●
●
● ●●
● ● ●
●●●●●●
●
●●●● ●
●●●
●
●●
●
●●
●
●●●
●●
●●●●
●●●● ●
●●
●●●
●
●●
●
●
●●
●●●
●●●
● ●●
●● ●
●
●●●●●
● ●●
●●
●
●●
●●
●
●●●
●●●
●●● ●● ● ●●●●
● ●●
●●●●●●●●●●●
● ● ●●●●
●●
●●
●● ●●
●●
● ●●●●●
● ● ●●●● ●●
●●
● ●●
●●
●●●●
●●
●●●
●●●●● ●
●●●
●● ●
●●●●
●●●●●●●●● ●●●●●●●●●● ●
●●
●●●
●●
●●● ●●●●
● ●●
●●●
● ●●
●●
●
●●●●●
● ●
●●
●●●
●●
●●●● ●●●●● ●●●●●
●● ●●●
●●
●
●● ●●●●● ●●●
● ●
●●●
● ●● ●
●●● ● ●●
●●●●●
● ●
●●
● ●●●
● ●●● ●
●
●● ●●●
●●
●●●
●●●
●●
●
●●
●
●●●●●
● ● ●●
● ●●
●●
●●
●●
●
●●
●
●●
●●●
●
●●● ●
●●
● ●
●●●
●
●●●●
●●●
●
●●●
●
● ●
●●●●●● ● ●●
● ●●
●●●●
●●
●●●●
●●● ● ●
● ●●●
● ●●
●●
●●● ●
●●
●●●●
● ●
●
●●●
●
●●
●
●●
●● ●●●●● ●●● ● ● ● ●●●●●
● ● ●●
●●
●●●
●●
●
●●
●●
●
●●● ●
● ●●● ● ●
●●
●●● ●●●●●●
●● ●●●
●●
●●
●
● ●
●●
●●●●●●●●● ●●● ●●●●●● ● ● ● ●
●●●
●●●
●
●●
●
●●
●
●● ●●
●●
●● ●
●●
●
●●
●
●●
●●●● ●
● ●●●●●
●●●●
●●●
●●
●
●●
●●●●●● ●●● ● ● ●●●●
●●
● ● ● ●●
●
●●
●●●●●●
●
●●●
●●
●●●●
●
●●●
●●●
●●
●● ●●●
●●
●●
●●
●●●● ● ● ● ● ●●
●●
●●
●
●●●
●
●●
●
●
●
●●●
●●●
●●
● ●
●●
●
●●●
●● ●●
● ●● ●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●●
●●
● ●
●●●
●
●●●
● ●
●●●●
●●
●●
●●●●●● ●●
●●
●
●●
●●
●●●●●
●● ●●
●●●●
●● ●
● ●
● ●●
●● ●●
●
● ●●● ●●
●
●●
●●
●
●●● ●
●●
●● ●
●● ● ●●●●●●
●●
●●●●
●
●●●
● ●●●●●●●●●●●
●● ● ●
●●●
●
●●
●
●●●
●●
●
●
●
●
●● ●●
●
●●●●●
● ●●● ●●
●●
● ●
●●●
●
●●
●
●●
●
●●
●●●●
●●●●● ● ● ●
●
● ●●●●● ●
●●●
●●
●
● ●●
●●
●
●●●
●●●●●●●
●● ●
● ●●
●
● ●●●
●● ●●●●●
●
●●●
●
● ●●
●
●●●●●●●●
●●●
●
● ●●
●●
●● ●●● ●
● ●●● ● ●●●● ●
●●
●●●
●
● ●●●
●
●●
●
●●
●
●●●
●
●●
● ●
●
●
● ●●
●
●
●
● ●
30.0
●
●● ●
30.0
●
●
●
● ● ●●
●
●
●
● ●● ●
●● ●●
● ●
● ●
● ●●●● ●
● ● ●● ● ● ● ●
●● ●● ● ● ●●
●●●●
● ●●
●● ●●
● ●
●
●● ● ●●●●●●●
● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●● ●●● ● ● ● ● ● ● ●● ● ●● ●●● ●● ●
●
●●●● ●●● ● ●● ● ● ● ●● ● ●● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●● ●●● ● ●●
● ●● ●●● ● ●●●●● ●●
●●
●●
●●● ● ● ●●
● ● ● ●
●● ● ● ●● ● ●●●●
● ● ●●● ● ●● ●●● ● ● ●● ●● ● ● ●●● ● ●● ●
●● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ●● ●●●● ●●● ● ● ●●● ●● ● ● ● ● ● ●●●● ● ● ●
●●● ● ● ●●●● ● ●●● ●● ● ● ● ● ● ● ●●● ●
●● ●●● ●● ● ●● ●●●● ●
● ●● ●● ●● ●●● ● ●●●●● ●● ●●● ● ●●●●●●●● ● ●●● ●● ● ● ● ●●●●●● ●●●● ●● ● ● ● ●● ● ● ●●●● ● ●● ●● ●
● ●●●●●●●●● ●●● ● ● ●● ●● ●
●● ●● ●●
●
●● ●●● ● ●●
●
●● ●● ●● ● ●● ● ●
● ●●●●●● ●●
●
● ●●●●●●
● ● ●●● ● ●● ●
● ●●● ●●●
●●
● ●●●
●●
●●●
●●● ●●
●●●●● ● ●●●●● ●●● ● ●
●●●●
●●●● ●●●●● ● ● ● ● ●●● ●● ●●● ● ●
●●● ●
● ● ● ● ●● ●● ● ●●●● ● ●● ● ●●● ● ● ●● ● ● ●●●●● ●
●● ● ●
●● ● ●
● ● ● ● ●
● ● ● ● ●● ●●● ●●●●● ●●● ●● ● ●● ●● ● ● ●●●● ●●
●
● ● ●●● ● ● ●● ● ● ●● ●●● ●
●
●● ●● ●● ● ●
●● ●
● ● ●●●●●● ●●● ● ●
●●●●
●● ● ● ● ●● ● ●●● ● ●●● ●●● ● ● ● ●● ● ● ● ●●
●
● ●●●
●
● ●● ● ● ●● ●
●●●●●● ● ● ● ●●●● ● ●●● ●●●● ● ● ● ● ●● ●● ● ● ●● ● ●●●● ●●● ●●● ●●●●●● ●● ● ● ●●● ● ●●●● ● ●● ● ●● ●●
●●
●● ● ● ●
●●●●●●●●●
● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ●●● ● ●●●● ●●●● ●●●●
● ●● ●●● ● ●● ● ● ●● ● ●
● ● ●●●● ●●●●● ● ● ● ● ●●
● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●●●
●
●●
● ● ● ●●●
● ●●●● ● ●●
●●● ●●●● ●
●
●●●●●●
●
●
●● ● ● ● ●
● ● ● ● ● ●●● ● ● ●
● ● ● ●
● ●
● ●● ●● ● ● ●
●●
● ● ●●● ● ●● ● ●
●●●● ●●
● ●● ●●● ● ●●●● ●
●● ● ●
●● ●● ● ●● ●●● ●●●
●●●●●
●●● ●● ●●●
● ●●●● ●● ●●●
●●● ●
●●●●● ● ● ● ●● ● ● ●● ● ● ●
●●●●● ●●●
●● ● ● ● ● ●●●●●● ●●●●● ● ● ●
● ●
●●●●● ●●●● ● ●● ●●●
● ● ●● ●● ●●●●● ● ●●●● ● ●●● ● ● ● ●● ●●●●●●● ● ●●●●
● ● ● ●
● ● ● ● ●●
● ●● ●● ● ●●
● ● ●● ● ● ●● ●●●● ●● ●●●●●●● ● ●●●● ● ●
●●●
●● ● ●● ●● ●●●●● ● ● ● ● ●● ● ●● ● ● ●●●●● ● ● ●●●●● ●● ● ● ●
● ● ● ● ● ●●● ● ●● ●●●● ●
●●●● ●●●● ● ●● ● ●
● ●● ● ● ● ●●●●●●●●● ●
● ●●● ●●● ● ●
●●●●●●●●● ●●●● ●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●
● ● ●● ●●●●
● ●●●● ● ● ●● ●●●
●● ●● ●● ●● ●● ●● ●● ● ● ●● ●●●● ● ●
● ● ●●● ●●● ● ● ● ● ●● ● ●● ●
● ●
● ●
●●
●● ●●● ●● ● ●● ● ●● ●● ● ● ●●● ●●●●
● ● ●● ●●●● ●●●●●●●● ● ●
● ● ●● ●● ●
●●● ●
●●●●●●●● ● ●● ● ●●
● ●●● ● ● ●
●● ● ●● ●● ●● ● ●● ●● ● ● ●●● ●
● ● ● ●● ●●●●●
● ● ●● ● ●●● ● ●● ● ●● ● ● ●
●●●●●●●●●● ●● ● ● ●● ●
● ● ●●● ● ● ●●● ● ●●● ●●● ● ●●
●●
●
● ●● ●●●
● ●● ● ●● ●● ●●●●● ● ●
● ●●
●●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●● ● ●
● ●●●● ● ●●● ●●●● ●●● ●● ● ● ●● ● ●●● ●●● ● ● ●
●●● ●● ● ●● ● ●● ●●●● ●
● ●●
●●
●● ● ●
●● ● ● ●●
● ●● ● ● ●●●●●●●● ●●●●● ● ● ●●
●●
● ● ● ●● ● ●● ●● ● ● ●●●● ●● ● ●
● ●●●● ● ●●
● ● ●● ●
●●● ●●
●●
●●●
●●●●● ● ●● ●
● ●●● ● ●
●●●● ● ● ●
●●●●● ● ● ● ● ● ●● ● ● ●● ●●●● ●
●
● ● ● ● ● ● ●● ●
●●●●●●
●●●●●●● ●●●
● ●●● ●●●●●●● ●●●●●
●●● ● ●●
● ● ●● ●● ● ●● ● ●●
●
●
●●●●● ● ●●●● ●●●●●● ●● ●●●●
●●● ●● ●●●● ●
●
● ●●● ●● ●●●●
● ● ● ● ● ●●● ●●● ●● ● ●
● ● ● ●●●●● ● ●● ● ● ● ● ●
●● ●●●●●●●● ●●●●●● ●● ●
●●●
●
●●●● ●
● ● ● ●● ● ●●●● ●
●● ● ●●●●
●● ● ● ●●●
● ● ● ●
●●●●
●●●●●●●●● ● ● ●● ●
●
●●● ● ● ● ●● ●● ● ●●●● ● ●●● ●
●●
●
●● ● ● ● ● ● ●
0 200 400 600 800 1000 0 200 400 600 800 1000
epoch epoch
Figure 9: performance of the gradient descent algorithm, the blue graphs show the training
losses and the red graphs the validation losses: (lhs) Model CANN1; (rhs) Model CANN2.
side uses embedding layers with d = 2 (Model CANN2). We observe over-fitting after roughly
200 gradient descent steps. Note that this is much faster than in the network models of Figure
5, the reason being that the MLE of Model GLM2 provides a reasonable initial value for the
gradient descent algorithm.
Table 3: run time, number of model parameters, in-sample and out-of-samples losses (units are in
10−2 ) of Models GLM2, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers with dimensions
d = 1, 2 for the categorical feature components VehBrand and Region, Models CANN1 and
CANN2 with embeddings d = 1, 2, respectively.
The results are presented in Table 3. We observe that the performance of all network models
with embedding layers are comparable in out-of-sample losses which range from 31.45327 to
31.50647 (last 4 rows of Table 3). More remarkable is that the GLM2 skip connection adds some
stability to the average frequency, last column in Table 3, note that the empirically observed
frequency on the test data T is 10.41%.
A bit disappointing seems that the CANN approach does not lead to a clear improvement over
the classical network approach in terms of out-of-sample losses. The main issue in the current
set-up is that Model GLM2 is not sufficiently good so that the CANN approach could benefit
from a very good initial model. In fact, we are penalized here for not having invested sufficient
efforts in building a good GLM. However, the CANN approach will allow us to explicitly analyze
the weaknesses of Model GLM2. This is what we are going to do next.
18
Electronic copy available at: https://ssrn.com/abstract=3320525
3.4 Analyzing the GLM marginals
The aim of the subsequent sections is to analyze the modeling of the regression function λ(·)
in a more modular way. We therefore start with Model GLM2 given in Listing 2. This model
considers a log-linear structure
q0
X
x 7→ log λ(x) = β0 + βl xl = hβ, xi,
l=1
with continuous feature components BonusMalus, Density, Area and DrivAge, categorized fea-
ture components VehPower and VehAge, and categorical feature components VehBrand, VehGas,
Region. The goal is to see whether the functional form used to model the continuous and the
categorized feature components is sufficiently good. We therefore change the modeling of one
feature component at a time, keeping the modeling of the other components fixed. For instance,
we choose VehPower and we consider
D E
x 7→ log λ(x) = hβ, xi + w(2) , z (1) (VehPower) , (3.7)
where the last term reflects a network of depth 1 (having q1 hidden neurons) applied to the
feature component VehPower, only. Note that we use a slight abuse of notation in (3.7) because
VehPower enters feature x also as a categorical variable with 6 labels for the GLM approach.
For the explicit implementation of (3.7) we again use approach (3.6) with working weights.
Table 4: in-sample and out-of-samples losses (units are in 10−2 ) of Model GLM2, compared to
a marginal network adjustment according to (3.7).
In Table 4 we present the results where we consider one continuous feature component in the
form (3.7) at a time, and where we choose a single hidden layer network with q1 = 7 hidden
neurons.8 From Table 4 we see that the marginal modeling in Model GLM2 is quite good, the
only two feature components that may be modeled in a better way are VehAge, i.e. the three age
classes [0, 1), [1, 10] and (10, ∞) should be refined, and BonusMalus where a log-linear functional
form is not fully appropriate. We would like to highlight that the variable BonusMalus needs
more gradient descent steps, i.e. a later early stopping point. It seems that Model GLM2 sits
in a rather “strong saddle point” for the variable BonusMalus which is difficult to leave for the
gradient descent algorithm.
8
In many cases one hidden layer is sufficient for modeling one-dimensional functions, for multivariate functionals
deep networks show better fitting performance because they can model more easily interactions.
19
Electronic copy available at: https://ssrn.com/abstract=3320525
In the last column of Table 4 we have been adding the out-of-sample losses obtained from
generalized additive model (GAM) predictions. GAMs are obtained by replacing the last term
in (3.7) by a natural cubic spline, that is, we set
where the first term on the right-hand side is the part originating from Model GLM2 and
ns2 : R → R denotes a natural cubic spline. For GAMs we refer to Wood [15], Ohlsson-
Johansson [11] and Chapter 3 in Wüthrich-Buser [16]. We would like to emphasize that this
GAM for marginals can be fit very efficiently, i.e. in less than 1s. But efficient fitting requires
that we aggregate data for each label (marginally we have only few labels) using that aggregation
leads to sufficient statistics under our Poisson model assumptions, see Section 3.1.2 in [16], and
line 4 in Listing 6. The GAM fit is performed on lines 6-7 of Listing 6 where VolGLM specifies
the working weights viGLM , see (3.6). The prediction can be done (again) on individual policies.
Unfortunately, the GAM cannot be applied to the feature component Area because it has only
6 different labels. The out-of-sample results of the neural network approach and of the GAM
approach are in line which can be interpreted as a “proof of concept” that these two methods
work.
In Figure 10 we provide the resulting marginal regression functions from approach (3.7) which
exactly correspond to the results of Table 4. These plots confirm our findings, namely, that the
modeling of VehAge in Model GLM2 can be improved (top-right), the log-linear assumption for
BonusMalus is not fully appropriate (bottom-middle), and the other (marginal) adjustments do
not lead to visible improvements.
Conclusion. The marginal modeling used in Model GLM2 can be (slightly) improved, but it
does not explain the big differences between Model GLM2 and the neural network models of
Table 3. Therefore, the major weakness of Model GLM2 compared to the neural network models
must come from missing interactions in the former model. Note that in this former model all
interactions are of multiplicative type between the feature components. This deficiency is going
to be explored next.
Base Model GAM1. For all our subsequent derivations we enhance Model GLM2 by im-
proving the marginal modeling of the feature components VehAge and BonusMalus using a joint
GAM adjustment. That is, we consider the regression function
20
Electronic copy available at: https://ssrn.com/abstract=3320525
GLM versus marginal NN GLM versus marginal NN GLM versus marginal NN
0.35
0.35
0.35
● Model GLM2 ● Model GLM2 ● Model GLM2
marginal NN marginal NN marginal NN
●
0.30
0.30
0.30
0.25
0.25
0.25
0.20
0.20
0.20
0.15
0.15
0.15
●
0.10
0.10
0.10
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
0.05
0.05
0.05
0.00
0.00
0.00
1 2 3 4 5 6 4 6 8 10 12 14 0 5 10 15 20
0.35
0.35
● Model GLM2 ● Model GLM2 ● Model GLM2
marginal NN marginal NN marginal NN
0.30
0.30
0.30
●
0.25
0.25
0.25
●
●
●
●
●
●
0.20
0.20
0.20
●
●
●
●
●
●
●
● ●
0.15
0.15
0.15
●●
●●
●●
●●
● ●●●●●●●● ●●
●●●● ●● ●●
●● ● ●● ● ●●
● ●
●● ●● ●
●●●
● ●
●
● ●● ●
●
●●● ● ●● ●●
●●
● ●●●● ●
●● ● ●
0.10
0.10
0.10
●●● ●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●
●
● ●
●
●●
●●
●
●●
●●
●
● ●
●
●●●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●● ●
●
●●
●
●
●
●●● ●● ●
●●
● ●● ●●
●
●
●
●● ●●
●
●
●
● ●● ●● ●
●
● ●● ●●●
●●●●●●●●● ●●●
●●●
●●●
0.05
0.05
0.05
●●●
●●●●
●●●●●●●
●●●●●
●●●●●
●●●●●●
●●●●●●
0.00
0.00
0.00
20 40 60 80 100 60 80 100 120 140 0 5000 10000 15000 20000 25000
Figure 10: comparison of the marginals in Model GLM2 and the marginal network adjustment
according to (3.7).
where the first term on the right-hand side (scalar product) is the part originating from Model
GLM2, and ns21 : R → R and ns22 : R → R are two natural cubic splines enhancing Model
GLM2 by GAM features. We fit these two natural cubic splines simultaneously using the GAM
framework and we call this improvement Model GAM1. The corresponding code is given in
Listing 7. On line 4 we compress the data w.r.t. the two selected feature components VehAge
and BonusMalus. On lines 7-8 we fit the natural cubic splines for these two variables using the
logged working weights log(viGLM ) as offsets, see also (3.6).
In Table 5 we present the results. We see the expected improvement in out-of-sample loss
from 32.14902 (Model GLM2) to 32.07597 (Model GAM1). However, there is still a big gap
compared to the neural network approaches. Note that Model GAM1 is based on multiplicative
interactions that we are going to challenge next.
21
Electronic copy available at: https://ssrn.com/abstract=3320525
epochs run # in-sample out-of-sample average
time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Model GAM1 1s 63.2† 31.14450 32.07597 10.01%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%
CANN1 Emb(d = 1) 200 115s 719 30.39966 31.50136 10.02%
CANN2 Emb(d = 2) 200 117s 792 30.47557 31.56555 10.34%
Table 5: run time, number of model parameters, in-sample and out-of-samples losses (units
are in 10−2 ) of Models GLM2, GAM1, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers
with dimensions d = 1, 2 for the categorical feature components VehBrand and Region, Models
CANN1 and CANN2 with embeddings d = 1, 2, respectively; † the number of parameters for
Model GAM1 considers the 48 GLM parameters plus the effective degrees of freedom of the
GAM splines being 6.6 + 8.6 = 15.2.
where x 7→ λ bGAM (x) is the regression function obtained from the base Model GAM1. Thus, the
2IA regression model (3.10) challenges the GAM-improved GLM2 model by allowing for pair-
wise interactions between DrivAge and BonusMalus. This can be interpreted as a boosting step.
For this boosting step we choose a neural network of depth K = 3 having (q1 , q2 , q3 ) = (20, 15, 10)
hidden neurons. Categorical feature components are modeled with two-dimensional embedding
layers (2.4). We fit these pair-wise boosting improvements over 1’000 gradient descent epochs
on batches of size 10’000 policies.
The pair-wise results are illustrated in Figure 11. The rows provide the components Area,
VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas and Density (in this order) and
the columns provide the components VehPower, VehAge, DrivAge, BonusMalus, VehBrand,
VehGas, Density and Region (in this order). Black and blue graphs show the out-of-samples
losses over the 1’000 epochs, and the orange dotted lines shows the out-of-sample loss of Model
GAM1. Note that the scale on the y-axis is the same in all plots. In blue color we iden-
tify the pairs that lead to a major decrease in loss. These are the pairs (VehPower, VehAge),
(VehPower, VehBrand), (VehAge, VehBrand), (VehAge, VehGas), (DrivAge, BonusMalus). Thus,
between these pairs we observe a major (non-multiplicative) interaction that should be inte-
grated into the model. The advantage of approach (3.10) is that we do not need to specify the
explicit form of these (missing) interactions, this is in contrast to the GLM and GAM approaches.
22
Electronic copy available at: https://ssrn.com/abstract=3320525
Area − VehPower Area − VehAge Area − DrivAge Area − BonusMalus Area − VehBrand Area − VehGas Area − Density Area − Region
32.10
32.10
32.10
32.10
32.10
32.10
32.10
32.10
32.05
32.05
32.05
32.05
32.05
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
32.00
32.00
32.00
32.00
32.00
31.95
31.95
31.95
31.95
31.95
31.95
31.95
31.95
31.90
31.90
31.90
31.90
31.90
31.90
31.90
31.90
31.85
31.85
31.85
31.85
31.85
31.85
31.85
31.85
31.80
31.80
31.80
31.80
31.80
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
VehPower − VehAge VehPower − DrivAge VehPower − BonusMalus VehPower − VehBrand VehPower − VehGas VehPower − Density VehPower − Region
32.10
32.10
32.10
32.10
32.10
32.10
32.10
32.05
32.05
32.05
32.05
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
32.00
32.00
32.00
32.00
31.95
31.95
31.95
31.95
31.95
31.95
31.95
31.90
31.90
31.90
31.90
31.90
31.90
31.90
31.85
31.85
31.85
31.85
31.85
31.85
31.85
31.80
31.80
31.80
31.80
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
VehAge − DrivAge VehAge − BonusMalus VehAge − VehBrand VehAge − VehGas VehAge − Density VehAge − Region
32.10
32.10
32.10
32.10
32.10
32.10
32.05
32.05
32.05
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
32.00
32.00
32.00
31.95
31.95
31.95
31.95
31.95
31.95
31.90
31.90
31.90
31.90
31.90
31.90
31.85
31.85
31.85
31.85
31.85
31.85
31.80
31.80
31.80
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
DrivAge − BonusMalus DrivAge − VehBrand DrivAge − VehGas DrivAge − Density DrivAge − Region
32.10
32.10
32.10
32.10
32.10
32.05
32.05
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
32.00
32.00
31.95
31.95
31.95
31.95
31.95
31.90
31.90
31.90
31.90
31.90
31.85
31.85
31.85
31.85
31.85
31.80
31.80
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
32.10
32.10
32.10
32.10
32.05
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
32.00
31.95
31.95
31.95
31.95
31.90
31.90
31.90
31.90
31.85
31.85
31.85
31.85
31.80
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
32.10
32.10
32.10
32.05
32.05
32.05
out−of−sample loss
out−of−sample loss
out−of−sample loss
32.00
32.00
32.00
31.95
31.95
31.95
31.90
31.90
31.90
31.85
31.85
31.85
31.80
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
32.10
32.10
32.05
32.05
out−of−sample loss
out−of−sample loss
32.00
32.00
31.95
31.95
31.90
31.90
31.85
31.85
31.80
31.80
0 200 400 600 800 1000 0 200 400 600 800 1000
epochs epochs
Density − Region
32.10
32.05
out−of−sample loss
32.00
31.95
31.90
31.85
31.80
epochs
Figure 11: exploring pair-wise interactions: out-of-sample losses over 1’000 gradient descent
epochs for all pairs of feature components, the orange dotted line shows Model GAM1 (the scale
on the y-axis is identical in all plots).
Interaction improved GAM1 model. This leads us to the following interaction improve-
ments of Model GAM1. We consider regression function
D E
(4) (3) (2) (1)
x 7→ log λGAM+ (x) = w1 , z 1 ◦ z 1 ◦ z 1 (VehPower, VehAge, VehBrand, VehGas) ,
D E
(4) (3) (2) (1)
+ w2 , z 2 ◦ z 2 ◦ z 2 (DrivAge, BonusMalus) , (3.11)
where we consider two parallel deep neural networks of depth K = 3 for the two component
vectors (VehPower, VehAge, VehBrand, VehGas) and (DrivAge, BonusMalus). Moreover, we set
ind.
Ni ∼ Poi λGAM+ (xi )viGAM , with working weights viGAM = vi λ bGAM (x).
23
Electronic copy available at: https://ssrn.com/abstract=3320525
In (3.11) we define two parallel neural networks that only interact in the last step where we
concatenate them by adding up the terms. The reason for the choice is that we did not observe
major interactions between the components of the two parallel networks in Figure 11.
We fit Model GAM+ given in (3.11) based on two parallel neural networks of depth K = 3,
both having (q1 , q2 , q3 ) = (20, 15, 10) hidden neurons. The R code is given in Listing 8. We run
the gradient descent algorithm over 400 epochs on batches of size 10’000 policies. The results
are presented in Table 6.
We observe excellent fitting results of Model GAM+ compared to the other neural network
models, see Table 6. This illustrates that in (3.11) we capture the main interaction terms.
In fact, from Figure 11 we see that the second interaction term in (3.11), based on the vari-
ables (DrivAge,BonusMalus), accounts for a decrease of out-of-sample loss from 32.07597 (Model
GAM1) to roughly 31.97, see Figure 11, plot (DrivAge,BonusMalus). Therefore, the first inter-
action term in (3.11) must account for the residual decrease of out-of-sample loss from roughly
31.97 to 31.49574. This closes our example.
4 Conclusions
We have started our case study from a classical generalized linear model (GLM) for predicting
claims frequencies. Therefore we have assumed a log-linear functional form for the regression
function which leads to a multiplicative tariff structure in the feature components. Categorical
24
Electronic copy available at: https://ssrn.com/abstract=3320525
epochs run # in-sample out-of-sample average
time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Model GAM1 1s 63.2† 31.14450 32.07597 10.01%
Model GAM+ 400 278s 1’174‡ 30.54186 31.49574 10.33%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%
CANN1 Emb(d = 1) 200 115s 719 30.39966 31.50136 10.02%
CANN2 Emb(d = 2) 200 117s 792 30.47557 31.56555 10.34%
Table 6: run time, number of model parameters, in-sample and out-of-samples losses (units
are in 10−2 ) of Models GLM2, GAM1, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers
with dimensions d = 1, 2 for the categorical feature components VehBrand and Region, Models
CANN1 and CANN2 with embeddings d = 1, 2, respectively; † the number of parameters for
Model GAM1 considers the 48 GLM parameters plus the effective degrees of freedom of the
GAM splines being 6.6 + 8.6 = 15.2; ‡ only accounts for the network parameters in (3.11) and
not for the parameters which have been used to receive the working weights viGAM from Model
GAM1.
References
[1] Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L. (2006). Neural probabilistic
language models. In: Innovations in Machine Learning. Studies in Fuzziness and Soft Computing,
Vol. 194. Springer, 137-186.
[2] CASdatasets Package Vignette (2016). Reference Manual, May 28, 2016. Version 1.0-6. Available
from http://cas.uqam.ca.
[3] Charpentier, A. (2015). Computational Actuarial Science with R. CRC Press.
25
Electronic copy available at: https://ssrn.com/abstract=3320525
[4] Ferrario, A., Noll, A., Wüthrich, M.V. (2018). Insights from inside neural networks. SSRN
Manuscript ID 3226852. Version November 14, 2018.
[5] Gabrielli, A., Richman, R., Wüthrich, M.V. (2018). Neural network embedding of the over-dispersed
Poisson reserving model. SSRN Manuscript ID 3288454. Version November 21, 2018.
[6] Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press,
http://www.deeplearningbook.org
[7] He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep residual learning for image recognition. CoRR,
abs/1512.03385.
[8] Huang, G., Liu, Z., Weinberger, K.Q. (2016). Densely connected convolutional networks. CoRR,
abs/1608.06993.
[9] Marra, G., Wood, S.N., (2011). Practical variable selection for generalized additive models. Com-
putational Statistics and Data Analysis 55, 2372-2387.
[10] Noll, A., Salzmann, R., Wüthrich, M.V. (2018). Case study: French motor third-party liability
claims. SSRN Manuscript ID 3164764. Version November 8, 2018.
[11] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized Linear Models.
Springer.
[12] Richman, R. (2018). AI in actuarial science. SSRN Manuscript, ID 3218082, Version August 20,
2018.
[13] Richman, R., Wüthrich, M.V. (2018). A neural network extension of the Lee-Carter model to
multiple populations. SSRN Manuscript, ID 3270877, Version October 22, 2018.
[14] Sahlgren, M. (2015). A brief history of word embeddings.
https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-
magnus-sahlgren/
[15] Wood, S.N. (2017). Generalized Additive Models: An Introduction with R. 2nd edition. Chapman
and Hall/CRC.
[16] Wüthrich, M.V., Buser, C. (2016). Data Analytics for Non-Life Insurance Pricing. SSRN
Manuscript ID 2870308. Version October 24, 2017.
[17] Wüthrich, M.V., Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin 49/1.
26
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 9: Model CANN1 architecture
1 ContGLM <- layer_input ( shape = c (4) , dtype = ’ float32 ’ , name = ’ ContGLM ’)
2 VehPowerGLM <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehPowerGLM ’)
3 VehAgeGLM <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehAgeGLM ’)
4 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
5 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
6 DrivAgeGLM <- layer_input ( shape = c (5) , dtype = ’ float32 ’ , name = ’ DrivAgeGLM ’)
7 ContNN <- layer_input ( shape = c (3) , dtype = ’ float32 ’ , name = ’ ContNN ’)
8 LogExposure <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogExposure ’)
9 x . input <- c ( ContGLM , DrivAgeGLM , VehPowerGLM , VehAgeGLM , VehBrand , Region , ContNN , LogExposure )
10 #
11 V e h P o w e r G L M_ e m b e d = VehPowerGLM % >%
12 l ayer _e m be dd i ng ( input_dim = length ( beta . VehPower ) , output_dim = 1 , trainable = FALSE ,
13 input_length = 1 , name = ’ VehPowerGLM_embed ’ ,
14 weights = list ( array ( beta . VehPower , dim = c ( length ( beta . VehPower ) ,1)))) % >%
15 layer_flatten ( name = ’ VehPowerGLM_flat ’)
16
17 V eh Ag eG L M_ em be d = VehAgeGLM % >%
18 l ayer _e m be dd i ng ( input_dim = length ( beta . VehAge ) , output_dim = 1 , trainable = FALSE ,
19 input_length = 1 , name = ’ VehAgeGLM_embed ’ ,
20 weights = list ( array ( beta . VehAge , dim = c ( length ( beta . VehAge ) ,1)))) % >%
21 layer_flatten ( name = ’ VehAgeGLM_flat ’)
22
23 VehB rand_emb ed = VehBrand % >%
24 l ayer _e m be dd i ng ( input_dim = length ( beta . VehBrand ) , output_dim = 1 , trainable = FALSE ,
25 input_length = 1 , name = ’ VehBrand_embed ’ ,
26 weights = list ( array ( beta . VehBrand , dim = c ( length ( beta . VehBrand ) ,1)))) % >%
27 layer_flatten ( name = ’ VehBrand_flat ’)
28
29 Region_embed = Region % >%
30 l ayer _e m be dd i ng ( input_dim = length ( beta . Region ) , output_dim = 1 , trainable = FALSE ,
31 input_length = 1 , name = ’ Region_embed ’ ,
32 weights = list ( array ( beta . Region , dim = c ( length ( beta . Region ) ,1)))) % >%
33 layer_flatten ( name = ’ Region_flat ’)
34 #
35 GLMNetwork = list ( ContGLM , DrivAgeGLM , VehPowerGLM_embed , VehAgeGLM_embed ,
36 VehBrand_embed , Region_embed ) % >% l a y e r _ c o n c a t e n a t e () % >%
37 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ GLMNetwork ’ , trainable = FALSE ,
38 weights = list ( array ( c ( beta . continuous , beta . DrivAge , rep (1 ,4)) , dim = c (13 ,1)) ,
39 array ( beta .0 , dim = c (1))))
40 #
41 NNetwork = list ( ContGLM , ContNN , VehBrand_embed , Region_embed ) % >% l a y e r _ c o n c a t e n a t e () % >%
42 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
43 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
44 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
45 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ NNetwork ’ ,
46 weights = list ( array (0 , dim = c (10 ,1)) , array (0 , dim = c (1))))
47 #
48 CANNoutput = list ( GLMNetwork , NNetwork , LogExposure ) % >% layer_add () % >%
49 layer_dense ( units =1 , activation = k_exp , name = ’ CANNoutput ’ , trainable = FALSE ,
50 weights = list ( array ( c (1) , dim = c (1 ,1)) , array (0 , dim = c (1))))
51 #
52 model <- keras_model ( inputs = x . input , outputs = c ( CANNoutput ))
27
Electronic copy available at: https://ssrn.com/abstract=3320525