0% found this document useful (0 votes)
45 views

Nesting Classical Actuarial Models Into Neural Network

Neural network modeling often suffers the deficiency of not using a systematic way of improving classical statistical regression models. In this tutorial we exemplify the proposal of [17]. We embed a classical generalized linear model into a neural network architecture, and we let this nested network approach explore model structure not captured by the classical generalized linear model. In addition, if the generalized linear model is already close to optimal, then the maximum likelihood estim

Uploaded by

papatest123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Nesting Classical Actuarial Models Into Neural Network

Neural network modeling often suffers the deficiency of not using a systematic way of improving classical statistical regression models. In this tutorial we exemplify the proposal of [17]. We embed a classical generalized linear model into a neural network architecture, and we let this nested network approach explore model structure not captured by the classical generalized linear model. In addition, if the generalized linear model is already close to optimal, then the maximum likelihood estim

Uploaded by

papatest123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Nesting Classical Actuarial Models into Neural Networks

Jürg Schelldorfer∗ Mario V. Wüthrich†

Prepared for:
Fachgruppe “Data Science”
Swiss Association of Actuaries SAV

Version of January 22, 2019

Abstract
Neural network modeling often suffers the deficiency of not using a systematic way of im-
proving classical statistical regression models. In this tutorial we exemplify the proposal of
[17]. We embed a classical generalized linear model into a neural network architecture, and
we let this nested network approach explore model structure not captured by the classical
generalized linear model. In addition, if the generalized linear model is already close to opti-
mal, then the maximum likelihood estimator of the generalized linear model can be used as
initialization of the fitting algorithm of the neural network. This saves computational time
because we start the fitting algorithm in a reasonable parameter. As a by-product of our
derivations, we present embedding layers and representation learning which often provides
a more efficient treatment of categorical features within neural networks than dummy and
one-hot encoding.

Keywords. neural networks, architecture, car insurance, generalized linear models, embed-
ding, nesting, embedding layers, one-hot encoding, dummy coding, representation learning,
claims frequency, Poisson regression model, machine learning, deep learning.

0 Introduction and overview


This data analytics tutorial has been written for the working group “Data Science” of the Swiss
Association of Actuaries SAV, see
https://www.actuarialdatascience.org
The main purpose of this tutorial is to provide a systematic approach of improving classical
actuarial regression models using the toolbox of neural networks. We follow the CANN proposal
[17] which stands for Combined Actuarial Neural Network approach. The CANN approach
proposes nesting a classical parametric regression model into a neural network architecture so
that we can benefit from both worlds simultaneously. This tutorial follows up the two previous
ones of Noll et al. [10] and Ferrario et al. [4], in particular, we further develop the same numerical
example of the French motor third-party liability (MTPL) insurance data set included in the R
package CASdatasets, see Charpentier [3].

Swiss Re, Juerg [email protected]

RiskLab, Department of Mathematics, ETH Zurich, [email protected]

1
Electronic copy available at: https://ssrn.com/abstract=3320525
1 The data and revisiting generalized linear models
1.1 French motor third-party liability insurance data
We revisit the data freMTPL2freq which is included in the R package CASdatasets, see Char-
pentier [3].1 This data comprises a French MTPL insurance portfolio with corresponding claim
counts observed within one accounting year. This data has already been illustrated and studied
in the previous two tutorials of Noll et al. [10] and Ferrario et al. [4]. Listing 1 provides a short
summary of the data.

Listing 1: output of command str(freMTPL2freq)


1 > str ( freMTPL2freq )
2 ’ data . frame ’: 678013 obs . of 12 variables :
3 $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ...
4 $ ClaimNb : num [1:678013(1 d )] 1 1 1 1 1 1 1 1 1 1 ...
5 .. - attr (* , " dimnames ")= List of 1
6 .. .. $ : chr "139" "414" "463" "975" ...
7 $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
8 $ Area : Factor w / 6 levels " A " ," B " ," C " ," D " ,..: 4 4 2 2 2 5 5 3 3 2 ...
9 $ VehPower : int 5 5 6 7 7 6 6 7 7 7 ...
10 $ VehAge : int 0 0 2 0 0 2 2 0 0 0 ...
11 $ DrivAge : int 55 55 52 46 46 38 38 33 33 41 ...
12 $ BonusMalus : int 50 50 50 50 50 50 50 68 68 50 ...
13 $ VehBrand : Factor w / 11 levels " B1 " ," B10 " ," B11 " ,..: 4 4 4 4 4 4 4 4 4 4 ...
14 $ VehGas : Factor w / 2 levels " Diesel " ," Regular ": 2 2 1 1 1 2 2 1 1 1 ...
15 $ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ...
16 $ Region : Factor w / 22 levels " R11 " ," R21 " ," R22 " ,..: 18 18 3 15 15 8 8 20 20 12 ...

A detailed descriptive analysis of this data is provided in the tutorial of Noll et al. [10]. The
analysis in that reference also includes a (minor) data cleaning part on the original data which
is used but not further discussed in the present manuscript.2

1.2 Poisson claims frequency modeling


One conclusion of the tutorial of Ferrario et al. [4] has been that the volume (time Exposure
in yearly units) needs a careful treatment in regression modeling, in particular, it may enter
the regression function in a non-linear fashion. In order to make the present analysis of this
tutorial not too complicated we will neglect this finding and we will focus on the generalized
linear model (GLM) as introduced in Noll et al. [10].
Our data set comprises 678’013 insurance policies i for which we assume that the numbers of
claims Ni (ClaimNb on line 4 in Listing 1) are independent and Poisson distributed with
ind.
Ni ∼ Poi (λ(xi )vi ) , (1.1)

for the given volumes vi > 0 (time Exposure in years on line 7 in Listing 1) and a given
claims frequency function xi 7→ λ(xi ), where xi describes the feature information of policy i,
see Assumptions 2.1 in Noll et al. [10] and lines 8-16 in Listing 1. All policies have been active
within one accounting year, and the volumes are considered pro-rata temporis vi ∈ (0, 1] for the
corresponding time exposures.
1
CASdatasets website http://cas.uqam.ca; the data is described on page 55 of the reference manual [2].
2
The R code is available from https://github.com/JSchelldorfer/ActuarialDataScience

2
Electronic copy available at: https://ssrn.com/abstract=3320525
Task. The main problem to be solved is to find the regression function λ(·) such that it
appropriately describes the data, and such that it generalizes to similar data which has not
been seen, yet. Remark that the task of finding an appropriate regression function λ : X → R+
also includes the definition of the feature space X which typically varies over different modeling
approaches.

1.3 Warming-up exercise in generalized linear modeling


In Section 3 of Noll et al. [10] we have presented a GLM approach as a first possible regression
model to estimate the unknown regression function λ(·). In the present section we are going to
refine that GLM. The starting point of the modeling exercise is to perform a thorough descriptive
analysis to better understand the data; we do not repeat this here but refer to [10, 4]. In a second
step we need to pre-process the data, in particular, feature pre-processing is done. We use the
same feature pre-processing as in Section 3.1 of Noll et al. [10]:

• Area: we choose a continuous (log-linear) feature component for the Area code and we
therefore map {A, . . . , F} 7→ {1, . . . , 6};

• VehPower: we build 6 categorical classes by merging vehicle power groups bigger and equal
to 9 (totally 6 labels);

• VehAge: we build 3 categorical classes [0, 1), [1, 10], (10, ∞);

• DrivAge: we build 7 categorical classes [18, 21), [21, 26), [26, 31), [31, 41), [41, 51), [51, 71),
[71, ∞);

• BonusMalus: continuous log-linear feature component (we cap at value 150);

• VehBrand: categorical feature component (totally 11 labels);

• VehGas: binary feature component;

• Density: log-density is chosen as continuous log-linear feature component;

• Region: categorical feature component (totally 22 labels).

Thus, we consider 3 continuous feature components (Area, BonusMalus, log-Density), 1 binary


feature component (VehGas) and 5 categorical feature components (VehPower, VehAge, DrivAge,
VehBrand, Region). The latter two are categorical by nature; the former three are continuous
but their functional forms are far from being log-linear, therefore, we group them categorically,
we also refer to Marra–Wood [9] for smooth variable selection. These categorical classes for
VehPower, VehAge and DrivAge have been done based on expert opinion, only. This expert
opinion has tried to find homogeneity within class labels and every class label should receive a
sufficient volume (of observations), we refer to Sections 1 and 3 of Noll et al. [10]. Categorical
features are implemented by dummy coding, see Section 2.2 in Ferrario et al. [4], and the resulting
feature space X is given by

X ⊂ [1, 6] × {0, 1}5 × {0, 1}2 × {0, 1}6 × [50, 150] × {0, 1}10 × {0, 1} × [0, 11] × {0, 1}21 . (1.2)

3
Electronic copy available at: https://ssrn.com/abstract=3320525
That is, we have a q0 = 1 + 5 + 2 + 6 + 1 + 10 + 1 + 1 + 21 = 48 dimensional feature space X , and
the feature components in {0, 1}k add up either to 0 or 1 (dummy coding), this side constraint
is the reason for using the symbol ” ⊂ ” in formula (1.2), we also refer to Listing 3 in Noll et
al. [10] for more details. Based on this feature pre-processing we set up a first GLM.
Model Assumptions 1.1 (Model GLM1) Choose feature space X as in (1.2) and define the
regression function λ : X → R+ by
q0
def.
X
x 7→ log λ(x) = β0 + βl xl = hβ, xi, (1.3)
l=1

for parameter vector β = (β0 , . . . , βq0 )0 ∈ Rq0 +1 . Assume for i ≥ 1


ind.
Ni ∼ Poi (λ(xi )vi ) .
We split our data into a learning data set D and a test data set T . We use exactly the same
partition as in Listing 2 of Noll et al. [10]. Then, we fit Model GLM1 with maximum likelihood
estimation (MLE) on the learning data set D by minimizing the corresponding in-sample Poisson
deviance loss (objective function)3
n   
1X λ(xi )vi λ(xi )vi
β 7→ L(D, λ) = 2Ni − 1 − log , (1.4)
n Ni Ni
i=1
for β-dependent parametric regression function λ(·) = λβ (·), and where the summation runs
over all policies 1 ≤ i ≤ n = 6100 212 in the learning data set D. Denote the resulting MLE4
by β.
b This provides estimated regression function λ(·)
b = λ b (·). The quality of this model is
β
assessed on the out-of-sample Poisson deviance loss (generalization loss) on the test data set T
given by
nT
" !#
1 X λ(x
b t )vt λ(x
b t )vt
L(T , λ)
b = 2Nt − 1 − log , (1.5)
nT Nt Nt
t=1
where the summation runs over all policies 1 ≤ t ≤ nT = 670 801 in the test data set T . This
provides the numerical results of Table 1, see also Table 5 in Noll et al. [10]. We observe that

run # in-sample out-of-sample average


time param. loss loss frequency
homogeneous model (λ ≡ constant) 0.1s 1 32.93518 33.86149 10.02%
Model GLM1 20s 49 31.26738 32.17123 10.01%
Model GLM2 17s 48 31.25674 32.14902 10.01%

Table 1: run time, number of model parameters, in-sample and out-of-samples losses (units are
in 10−2 ), average estimated frequency on T (the empirically observed value is 10.41%, see Table
3 in [10]).

this Model GLM1 leads to a substantial improvement over the model with constant frequency
parameter λ (estimated by MLE). The last column of Table 1 gives the estimated frequency on
the test data set T , the empirically observed value being 10.41%, see Table 3 in [10].
3
Recall that minimizing the deviance loss is equivalent to maximizing the log-likelihood.
4
The run time of the corresponding R function glm on a personal laptop Intel(R) Core(TM) i7-8550U CPU @
1.80GHz 1.99GHz with 16GB RAM to find the MLE β b is roughly 20 seconds; the optimization method used is
iteratively weighted least squares (IWLS).

4
Electronic copy available at: https://ssrn.com/abstract=3320525
Model GLM1 estimated frequencies Model GLM1 estimated frequencies Model GLM1 estimated frequencies

0.35

0.35

0.35

0.30

0.30

0.30
0.25

0.25

0.25
frequency

frequency

frequency
0.20

0.20

0.20
0.15

0.15

0.15
● ● ●

0.10

0.10

0.10
● ● ● ● ●
● ●

● ●

0.05

0.05

0.05
0.00

0.00

0.00
1 2 3 4 5 6 1 2 3 1 2 3 4 5 6 7

VehPower labels VehAge labels DrivAge labels

Figure 1: Model GLM1: estimated frequencies w.r.t. the categorized (continuous) feature com-
ponents VehPower, VehAge and DrivAge (the corresponding reference group is normalized to the
overall frequency of 10%, illustrated by the dotted line).

In Figure 1 we present the resulting estimated frequencies of the (categorized) feature compo-
nents VehPower, VehAge and DrivAge of Model GLM1 (the corresponding reference group is
normalized to the overall frequency of 10%, illustrated by the dotted line). Note that these
feature components are continuous in nature, but we have been turning them into categorical
ones for modeling purposes (as mentioned above). Having so much data, we can further explore
these categorical feature components by trying to replace them by ordinal ones assuming an
appropriate continuous functional form, still fitting into the GLM framework.5
As an example we show how to bring DrivAge into a continuous functional form. We therefore
modify the feature space X from (1.2) and the regression function λ from (1.3). We replace the
7 categorical age classes by the following continuous function
4
X
DrivAge 7→ βl DrivAge + βl+1 log(DrivAge) + βl+j (DrivAge)j , (1.6)
j=2

with regression parameters βl , . . . , βl+4 . Thus, we replace the 7 categorical classes (involving 6
regression parameters from dummy coding) by the above continuous functional form having 5 re-
gression parameters. The remaining parts of the regression function in (1.3) are kept unchanged,
and we call this new model Model GLM2.
On lines 1-4 of Listing 2 we specify Model GLM2 in detail, it shows the specific functional
form for DrivAge, and it keeps unchanged all other terms, in particular, the two categorized
(continuous) variables VehPower and VehAge are kept as in Model GLM1. On lines 14-18 of
Listing 2 we provide the resulting MLEs of this continuous implementation (1.6) of the feature
component DrivAge. We observe that all terms in the chosen functional form for DrivAge are
significant.
The resulting out-of-sample performance on the test data T of this second model fitted to
the learning data D is given in Table 1. We observe a slight improvement in (out-of-sample)
predictive power (generalization loss). Henceforth, we prefer this latter model over the former
one. Note that this transformation has reduced the number of estimated parameters by 1 from
q0 + 1 = 49 to 48. This Model GLM2 will be the benchmark for all subsequent considerations.
5
We could also consider generalized additive models (GAMs), but we refrain from doing so for the moment.

5
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 2: continuous coding of DrivAge: MLE results of Model GLM2
1 glm ( formula = ClaimNb ~ AreaGLM + VehPowerGLM + VehAgeGLM + BonusMalusGLM +
2 VehBrand + VehGas + DensityGLM + Region +
3 DrivAge + log ( DrivAge ) + I ( DrivAge ^2) + I ( DrivAge ^3) + I ( DrivAge ^4) ,
4 family = poisson () , data = learn , offset = log ( Exposure ))
5
6 Coefficients :
7 Estimate Std . Error z value Pr ( >! z !)
8 ( Intercept ) 6.793 e +01 5.227 e +00 12.996 < 2e -16 ***
9 AreaGLM 8.047 e -03 1.703 e -02 0.472 0.63662
10 VehPowerGLM5 2.008 e -01 1.925 e -02 10.427 < 2e -16 ***
11 VehPowerGLM6 2.301 e -01 1.916 e -02 12.011 < 2e -16 ***
12 . .
13 . .
14 DrivAge 3.421 e +00 2.916 e -01 11.733 < 2e -16 ***
15 log ( DrivAge ) -4.131 e +01 3.155 e +00 -13.095 < 2e -16 ***
16 I ( DrivAge ^2) -4.889 e -02 4.775 e -03 -10.239 < 2e -16 ***
17 I ( DrivAge ^3) 3.873 e -04 4.400 e -05 8.803 < 2e -16 ***
18 I ( DrivAge ^4) -1.222 e -06 1.633 e -07 -7.485 7.17 e -14 ***
19 ---
20 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
21
22 ( Dispersion parameter for poisson family taken to be 1)
23
24 Null deviance : 200974 on 610211 degrees of freedom
25 Residual deviance : 190732 on 610164 degrees of freedom

Model GLM2 versus GLM1


0.35

● Model GLM1
Model GLM2
0.30
0.25
frequency
0.20
0.15
0.10

●●● ●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●
●●●●●
0.05
0.00

20 30 40 50 60 70 80

driver's age

Figure 2: comparison of the estimated frequencies in Models GLM1 and GLM2 for feature
component DrivAge (normalized to age 46).

In Figure 2 we compare the resulting estimated frequencies of the two modeling approaches for
DrivAge. The continuous Model GLM2 for driver’s age looks similar to the categorical labeling,
but it provides a smooth transition between the age classes compared to Model GLM1, and it
leads to a substantially higher estimate for drivers of ages 18-19. Concluding, we do not have
any reservation to use this continuous version.
We could proceed in a similar way for VehPower and VehAge. In order to not overload this
tutorial, we refrain from doing so, and we choose Model GLM2 as benchmark model for all our
subsequent derivations.

6
Electronic copy available at: https://ssrn.com/abstract=3320525
Conclusion. We choose Model GLM2 as our benchmark model. This model is illustrated in
Listing 2, and it has 48 parameters to be estimated. One weakness of this model is that it does
not explore interactions between feature components beyond multiplications. This and other
points are going to be challenged in the following sections.

2 Embedding layers in neural networks


2.1 Definition of a neural network
We start by defining a generic feed-forward neural network, subsequently abbreviated as network.
We recall Ferrario et al. [4]. Choose k ≥ 1 and hyperparameters qk−1 , qk ∈ N. A network layer
is a mapping
 0
(k)
z (k) : Rqk−1 → Rqk , z 7→ z (k) (z) = z1 (z), . . . , zq(k)
k
(z) , (2.1)

with qk hidden neurons in the k-th hidden layer given by


qk−1 !
(k) (k) (k) def. (k)
X
zj (z) = φ wj,0 + wj,l zl = φhwj , zi, for j = 1, . . . , qk , (2.2)
l=1

(k) (k) (k) (k)


with weights w(k) = (w1 , . . . , wqk )0 = (w1,0 , . . . , wqk ,qk−1 )0 ∈ Rqk (1+qk−1 ) and activation func-
tion φ : R → R. We emphasize two points from [10, 4]:

• q0 is the dimension of the feature space X with input neurons z (0) = x ∈ X (input layer).

• φ : R → R is a (non-linear) activation function. In the sequel we choose the hyperbolic


tangent activation function φ(x) = tanh(x).

A general network architecture with K hidden layers for our Poisson regression problem is then
obtained by adding an output layer as follows
nD   Eo
λ : X → R+ , x 7→ λ(x) = exp w(K+1) , z (K) ◦ · · · ◦ z (1) (x) . (2.3)

That is, we map the neurons z (K) of the last hidden layer to the output layer R+ using the
exponential activation function and weights w(K+1) (including an intercept). This network
architecture has depth K and receives a network parameter θ ∈ Rr of dimension r = K+1
P
k=1 qk (1+
(k)
qk−1 ) collecting all network weights w , k = 1, . . . , K + 1, we set qK+1 = 1. Three examples
with K = 3 hidden layers are given in Figure 3. The example in the middle of Figure 3 has an
input layer of dimension q0 = 9 and q1 = 20, q2 = 15 and q3 = 10 hidden neurons which results
in a network parameter θ of dimension r = 686.

The network in Figure 3 (middle) shows for each feature component one single neuron in the
input layer (blue, green and magenta colors), thus, an input layer of dimension q0 = 9. However,
we have two categorical feature components VehBrand and Region with more than 2 different
categorical labels. One-hot encoding requires that these two components receive 11 and 22 input
neurons, respectively. Thus, one-hot encoding implies that the input layer has dimension q0 = 40
(if we assume that all other feature components need one single input neuron). This results in

7
Electronic copy available at: https://ssrn.com/abstract=3320525
Area

● ● ●
VehPower
VehAge
DrivAge

● ● ● Area
Area
BonusMalus
B1

● ● ●
B10
B11

● ● ● ● VehPower
VehPower

B12

● ●
B13

● ● ●
VehAge


B14
B2

● ● VehAge


B3
B4

● ●
DrivAge
B5
B6
● ● ●
VehGas

● ● ●
DrivAge
BonusMalus


Density
R11

● ● ClaimNb ClaimNb ClaimNb
R21
R22
● ● ● ● BonusMalus
R23

● ● ● VehBrEmb
R24
R25
● ● VehBrEmb
R26

● ● ●
R31

● ●
R41
R42
● ● ●
VehGas
R43

● ● ●
VehGas


R52
R53
R54

● ● Density

R72
R73

● ● ● ●
Density

R74
R82

● ● ● RegionEmb RegionEmb
R83

● ● ●
R91
R93
● ● ●
R94

Figure 3: networks with K = 3 hidden layers having q1 = 20, q2 = 15 and q3 = 10 hidden neurons
in the three hidden layers; the input layer has dimensions q0 = 40 (lhs), q0 = 9 (middle) and
q0 = 11 (rhs) resulting in network parameter dimensions r = 10 306, 686 and 726, respectively.

dimension r = 10 306 for the network parameter θ. This is exactly the network illustrated in
Figure 3 (lhs), with one-hot encoding for VehBrand in green color and one-hot encoding for
Region in magenta color. Brute-force network calibration then simply fits this model using a
version of the gradient descent algorithm; this is exactly what has been demonstrated in our
previous tutorial [4].
We should ask ourselves whether the brute-force implementation of categorical feature compo-
nents using one-hot encoding is optimal, since it seems to introduce an excessive number of
network parameters (in our case r = 10 306). There is a second reason why one-hot encoding
seems to be sub-optimal for our purposes. In general, we would like to identify (cluster) labels
that are similar for the regression modeling problem. This is not the case with one-hot encod-
ing. If we consider, for instance, the 11 vehicle brands B = {B1, B10, . . . , B6}, one-hot encoding
assigns a different unit vector xVehBrand ∈ R11 to each VehBrand ∈ B. For two different brands

VehBrand1 6= VehBrand2 ∈ B we always receive kxVehBrand1 − xVehBrand2 k = 2, thus, the (Eu-
clidean) distance between all vehicle brands is the same under one-hot encoding. In the next
section we present embedding layers which aim at embedding categorical feature components
into low dimensional Euclidean spaces, clustering labels that are more similar for the regression
modeling problem.

2.2 Embedding layers for categorical feature components


Recently, Richman [12] has proposed to use embedding layers for categorical feature components,
and he has noted that this can lead to better results compared to one-hot encoding, see Table
5 in [12]. Embedding layers are very common in natural language processing (NLP), we refer
to Bengio et al. [1], Sahlgren [14] and Section 3 in Richman [12] for an overview of embedding
layers. Embedding layers are used in NLP in order to represent words by numerical coordinates
in a low dimensional space. This approach has two advantages. First, the dimension is reduced
(compared to one-hot encoding with a large sparse matrix). Second, similarities between words
can be examined and provide additional insights to one-hot encoding. See also Richman [12] for
the rationales behind embedding layers.

8
Electronic copy available at: https://ssrn.com/abstract=3320525
We exemplify the construction of an embedding layer on the categorical feature component
VehBrand. For an embedding layer, we need to choose an embedding dimension d ∈ N (hyper-
parameter). The embedding is then defined by an embedding mapping
def.
e : B → Rd , VehBrand 7→ eVehBrand = e(VehBrand). (2.4)
Thus, we allocate to every label VehBrand ∈ B a d-dimensional vector eVehBrand ∈ Rd . This is
called an embedding of B into Rd , and the embedding weights eVehBrand are learned during the
model calibration which is called representation learning.

2−dimensional embedding of VehBrand


0.6

B5

B11

0.4

B10

B13
● B6

dimension 2

B3●

B1

B2
0.2


B4

B14

0.0
−0.2

B12

−0.2 0.0 0.2 0.4 0.6

dimension 1

Figure 4: schematic illustration of a two-dimensional embedding of VehBrand ∈ B.

In Figure 4 we illustrate a two-dimensional embedding of B. It shows for every vehicle brand a


two-dimensional representation, i.e.
B1 7→ eB1 ∈ R2 ,
B10 7→ eB10 ∈ R2 ,
..
.
B6 7→ eB6 ∈ R2 .
The schematic illustration of Figure 4 has the interpretation that, for instance, vehicle brand B12
is rather different from all other vehicle brands, and vehicle brands B1 and B3 have similarities,
illustrated by a small Euclidean distance between these two vehicle brands.
If we embed the two categorical feature components VehBrand and Region into embedding layers
of dimension d = 1 each, then, these embeddings use 11 + 22 = 33 embedding weights. For an
embedding of dimension d = 2 each, we receive 11 · 2 + 22 · 2 = 66 embedding weights.
Having one-dimensional embeddings provides the network in Figure 3 (middle), with embedding
vector eVehBrand ∈ R1 in green color and embedding vector eRegion ∈ R1 in magenta color. This
model results in 33 + 686 = 719 parameters to be learned, thus, substantially less than the
1’306 parameters from one-hot encoding. On the other hand it adds an additional layer for the
embeddings which may slow down calibration.
In Figure 3 (rhs) we show the resulting network with two-dimensional embeddings eVehBrand ∈ R2
and eRegion ∈ R2 resulting in a parameter of dimension 66 + 726 = 792.

9
Electronic copy available at: https://ssrn.com/abstract=3320525
2.3 Embedding layer example
We compare brute-force one-hot encoding and embedding layers on an explicit example. We
choose a network of depth K = 3, having hidden neurons (q1 , q2 , q2 ) = (20, 15, 10), and using
one-hot encoding for the feature components VehBrand and Region, this network is illustrated
in Figure 3 (lhs). As described above this results in r = 10 306 network parameters to be
calibrated. In order to fit this model we use the R interface to Keras.6 The code is provided in

Listing 3: network of depth 3 with one-hot coding for categorical features


1 Design <- layer_input ( shape = c (40) , dtype = ’ float32 ’ , name = ’ Design ’)
2 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
3
4 Network = Design % >%
5 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
6 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
7 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
8 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
9 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
10
11 Response = list ( Network , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
12 layer_dense ( units =1 , activation = k_exp , name = ’ Response ’ , trainable = FALSE ,
13 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
14
15 model <- keras_model ( inputs = c ( Design , LogVol ) , outputs = c ( Response ))
16 model % >% compile ( optimizer = op ti m iz er _n a da m () , loss = ’ poisson ’)

Listing 3. Network on lines 4-9 of Listing 3 defines a network of depth K = 3 having neurons
(q1 , q2 , q3 ) = (20, 15, 10) and hyperbolic tangent activation function. This network produces a
one-dimensional output (lines 8 of Listing 3). It is initialized such that we start in the MLE of
the homogeneous model (constant frequency parameter, line 9 of Listing 3). On lines 11-13 we
add the non-trainable offset log(Exposure), and on line 16 we specify the nadam optimizer, see
Section 8.5 in Goodfellow et al. [6], and the Poisson deviance loss as objective function.

gradient descent algorithm gradient descent algorithm gradient descent algorithm


32.5

32.5

● ●
training loss (in−sample) ● ●
training loss (in−sample) ● ●
training loss (in−sample)
32.5


validation loss (out−of−sample) ●
validation loss (out−of−sample) ●
validation loss (out−of−sample)
32.0

32.0

● ●
32.0

● ●
● ●

● ● ●
● ●
● ●
● ●
● ●
● ● ●

● ● ●




● ●


● ●
● ● ●
31.5

31.5

● ●
● ● ●


● ● ●
● ●

31.5
deviance losses

deviance losses

deviance losses

● ●
● ●

● ●


● ● ●
● ●

● ●

● ●

● ●

● ●


● ●

31.0

31.0

● ●
● ●
● ●● ● ●




● ●
●● ●
● ● ●
● ●
● ● ● ●

● ●●
● ●
31.0

●● ●
● ● ●
●● ●
●●
● ● ●● ●
● ●● ●●

● ● ● ● ●●●
●●
● ●
● ●●

● ● ●● ●

● ●
●● ●●


● ●●
● ●
● ●● ●● ●
●●
●● ●
●●
● ●
●●
● ●●

● ●
●●●
● ●

●● ●●
●● ●● ●
●●

●● ● ●

●●

●●● ●

●●● ●
●●
●●
●●●

● ●●

● ●
●●
●●
●● ●●
●●●● ●●●●
●●

● ●
● ●
●●
●●
●●● ●

●●

●●
●●

● ●●
●●
● ●

●●

●●
●●
● ●
●●●

●●●● ●●●

●●

●●

●● ●
●●
●●
●●● ● ●
●●

●●

●●

●●

●●
● ●
●●
● ●●

●●

●●
●●
● ● ●
●●

●●
●●
●●●●●● ● ●●●

●●

●●
●●
●● ●●
●●

●●

●●●
●●

●●
●● ● ●●

●●

●●●
●● ●●
●●

●●●●
●● ●
●●
●●

●●
●●
●●

●●● ●●●●● ●●●●●●
30.5

●● ●●
30.5

●●●●● ● ●●●
●●

●●

●●

● ●
●●●●

●●

●●

●●● ●
● ●●●●● ●

●●

●●

●●

●●

●●● ●
●●
●●

●●

●●
● ●

●●


● ●
●●
●●

●●● ●

● ●●

●●

●●

●●

●●
●●● ●●
●●●●●●●● ●
●●
●●

●●

●●

●●

●●● ●●● ● ●●
●●
●●

●●

● ●
●●●
●●
● ●
●●
●●

●●
●●●●●

● ● ● ●●
●●

●●

●●

●●
●●

● ● ●
●●●● ●

●●
●●
●●

● ● ●

●●
●●

●●
●●
●●

●●●●●●
●●●●● ● ●

●●

●●

●●

●●

●●

●●●
●● ● ● ●
●●
●●●●●

●●
●●

●●●●
●●●●● ●
●●●
●●

●●

●●

●●

●●
●●
●●●
● ●
●●
● ●●
●●
● ●

●●●

●●●●●●
●●●●
●●

●●

●●

●●
●●●●●● ● ●●●
●●
●●

●●

●●
●●
●●● ●
●●

●●
●●

●●●

●●
●●
● ●●●●● ● ●●●
●●
●●

●●
●●●
●●● ● ● ●
● ●●
●●

●●●●● ●
●●●●●●● ●● ●●
●●●
●●
●●

●●
●●

●●

●●
●●
● ●
●●
●●
●●
●●
●●

● ●
● ●●●

●●

●●

●●

●●
●●
●● ● ●●● ● ● ●●

●●
●●

●●●●
●●●●●● ●●●
●●

●●

●●●

● ●
●●●
● ●
●●●●● ● ● ●●
●●●
●●●
●●

●●

●●

●●

●●

●●

●●
●●

●●

●● ●


●●
●●

●●

●●

●●

●●

●●

●●●●● ● ●
●●

●●

●●

●●
●● ●●
●● ●●

●●

●●
●●●
●●●●●● ●● ●●
●●●
● ●●
●●

●●

●●

●●

●●

●●

●●
●●

●● ● ● ●●
●●
●●

●●

●●

●●

●●

●●
●●
●●

● ●●●●●●● ● ●
●●

●●
●●
●●
●●

●●
●● ● ●●●●

●●

●●

●●

●●

● ●● ● ●● ●●
●●

●●
●●●●

● ●●●● ●●

●●
●●●
30.5

●●●
● ●

●●

●●

●●

●●

●●

●● ● ● ●●
●●
●●

●●

●●

●●

●●
●●●●●●
●●●●● ●●
● ●
●●●●●●

●●
●●●
●● ● ●●
● ●●●
●●●



●●●●●
●●●●●●
● ● ● ●●●●
●●

●●
●●

●●

●●

●●

●●

●●

●●

●●

●●

●●
●●
●●
●●
●●

●●

● ●●● ●●

● ●

●●

●●

●●●

●●
●●

●●

●●

●●

●●

●●
●● ●
●●●●
●●●● ●

●●
●●
●●
●●
●●
●●
●●

●● ●●●● ● ●●●
● ●●
●●●
●●

●●●●
●●●●● ●●● ●●●

●●
●● ●
●●●

●●

●●

●●

●●
●●

●●

●●

●●

●●

●●

●●
●●●

●●

●●
●●
●● ●●●●
● ● ●● ●
●●
●●
●●

●●




●●

●●●●
●●●●●●●
●●●●● ●●
● ●●●
●●●
● ● ●●●
●●●●● ● ●●● ●

●●
●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●
●●
●●

●●

●●

●●

●●

●●
●●
●●
●●
● ●
●● ●
●●●●●● ●●
●● ●●●●●●●

●●

●●●●
●●
● ●●

●●

●●

●●

●●

●●
●●●
● ●●
●● ●
● ●●●●●
●●●●
● ●●●●●

●●●
●●

● ●
●●

● ●
●●●

●●●
●●●

●●

●●




●●

●●

●●
●●

●●

●●

●●
●●

●●
●●

●●

●●

●●
●●●
●●
●● ● ●
● ●
● ●●
●●●
●●

●●
●●

●●

●●

●●

●●

●●

●●

●●
●●

●●

●●
●●●
●●

● ● ● ● ●●
●●●●
●●

●●

●●
●●

●●●

●●

●●●

●●●
● ● ●●●●●
●●●● ●●●

●●●●● ●
●●●●●●
● ●
●●● ● ● ●●●
●●
●●

●●
●●
●●●

●●

●●




●●

●●

●●

●●

●●

●●●

●●● ●
●●●● ● ● ●
● ●●
●●●

●●●

●●

●●

●●

●●

●●

●●
●●
●●●
●●
● ●●● ●●●●
● ●●●
●●●●

●●
●● ●
●●●●●●
●● ●
●●●●●●● ●● ●● ●●●

●●
●●

●●

●●

●●

●●

●●

●●

●●

●●

●●
● ●
●●●●● ●●●
●●●●●●
●●●●●●●● ●●

●●● ●●●
●● ●
●●

●●

●●●● ●
●●●●●
●● ●● ● ●● ●
● ●●●●●●●● ● ● ●
●●

●●●
●●

●●●
●●
●●

●●

●●
●●●

●●

●●

●●●

●●

●●

●●
●●
●●●

●●
●●●●●
● ●●●
●●●●●●●●●● ● ●●●
● ●●●● ●
●●
● ● ●●
● ●●●●● ●●●●● ●●
●●
●●

●●
●●
●●●

●●●
● ● ●●●●● ●
●●●●●●●●●● ● ●
● ●●●●●●●● ● ●
●●●●●●● ●● ●
●●●● ●●●
●● ●●●●●●●● ● ● ●●
● ●●●●●●●●●●
●●●●●●●●●●●●●● ●
30.0
30.0

● ●●●●●●●● ●● ● ●● ●
●● ●●●●●●●●●● ●● ●●
●●●●●●●●●●●●●●● ● ●●●
● ● ●●●●●●●●●●● ● ●
● ●●●●●●●●●●● ●●●
●●●●●●●●●●●●●●●● ●
● ● ● ●● ●
● ● ●●●●●●●●●● ● ● ●● ●
●●●●●●●●●●●●●●●●● ●●● ● ● ●
● ●●●● ●●●●●●●●●●●●●● ● ●●
●● ● ● ●●●●●●●●●● ● ● ●● ● ● ●●●●● ●

●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●
● ●
●● ●● ●●●
●●
● ● ● ● ●●● ●
●● ●
● ● ●●●●
● ●●● ● ●●●●


●● ● ●●●
●● ●●●●● ●
●● ● ●● ● ● ● ● ●
●● ●● ● ● ●●●● ● ●●● ●●
●●
●● ●
●●● ●● ●● ●● ● ● ●● ● ●● ●
30.0

●● ●●● ●● ●● ● ● ● ● ●● ● ●●
●●● ● ● ●
●●● ● ●● ● ● ●● ● ● ● ●●● ●
●● ●●●
●●
● ● ●● ●
●●●●● ● ●● ● ● ●
●● ● ●● ●
● ●●●● ● ● ●●
● ●
● ●●●●●
● ●●● ●● ●●●
●● ●●● ●● ● ● ● ●●
● ● ●●● ● ●
● ● ●● ●●●● ● ●● ●● ●●● ● ●●●● ●
● ● ● ● ● ●●● ●●●●
● ● ●
● ●
● ●● ●●
●●● ●●● ● ●●●
●●
● ● ● ● ● ● ● ● ● ●


●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ●●● ● ●●●●●● ●● ● ●● ●
● ●
●● ● ● ● ● ●●●●● ● ●
● ● ● ● ●● ●
● ●● ●● ●●●●● ● ● ● ●● ● ● ● ●●●● ●●● ● ● ●
● ● ● ● ● ● ● ●● ●● ● ●● ●●● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●●●●●●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●
●● ●●● ● ●
● ●●●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●
● ● ●●●●●●●●● ●● ●●●●● ● ●●● ● ●●●●●●●●●● ●●● ●● ● ● ● ● ● ●●●●●● ●● ●●●●● ●●●●●
●●●●●
● ●●●

●● ● ●●●●
●●● ● ● ●●● ● ●●●
● ● ● ● ● ● ● ● ● ●● ● ●●●●●
●● ●●● ●● ●● ●●●● ●
● ●
● ●● ● ● ● ●● ● ●
●●● ●
● ● ● ●● ●● ● ● ● ● ●● ●
● ●●
● ● ● ● ●● ● ●●● ●● ● ●● ●●● ● ●●
● ●●●● ●●●●●●●● ●●●● ●●● ●● ● ● ●●
● ●●● ●
●● ●●
●●●●●
●●●●●● ●●●●
●● ●● ●● ● ●
●● ●● ● ●●●●
● ●
● ● ●● ● ● ●● ● ●● ●
● ● ●●● ●●●
●●
● ●● ●●●●●●●●●● ●
●●● ● ● ●●●● ●
●●● ● ●● ● ●●●
●●●● ●●●●● ●●● ●●●● ● ●● ●

●● ●●● ●●
●●●
●● ●
●● ●●●
●● ●
● ● ●● ● ● ●
●● ● ● ● ●

●● ● ● ●●
● ●●●●●●● ●● ● ●●●● ●●●● ●● ● ●● ●● ●● ● ● ●● ● ● ●● ●● ●●● ●● ● ●●● ● ●● ●●
● ● ● ● ● ●

● ● ●● ●●● ●●●● ●●
●●● ● ●●●●●
●●●● ●●●●●●
●●●
●●●● ● ● ●● ●●●●●●
● ● ●● ●● ●●● ●●●

●●●●●●● ●● ●● ● ● ● ● ●● ●●● ● ●● ●
● ●●● ●
● ● ● ●● ●●●● ●● ● ●● ●
● ●● ● ●●

● ●● ●●● ● ●●●● ●●

●●
●●
●●● ●
●●●●●
●● ●
●●●● ●● ●●●
●●●
●●● ● ●●●●●●●●


●●●●● ●

●●● ●● ● ● ●●●
●● ● ● ● ● ● ● ●
● ●● ●●● ●● ● ●● ●
● ●●● ● ●●
● ●
● ● ● ●●● ●●● ●●●●● ● ●●●● ●●
●●● ●● ● ● ● ● ● ● ● ● ●●●●●●

●● ●●
●● ●●●
●●●
●●●●●●● ●● ●
●●● ●● ●●●●●●●●●● ●● ●●
●●●●● ●●●●●●● ● ●●● ●● ●● ●
●●● ● ● ●● ● ● ●● ● ● ● ● ●●●● ● ●●
●● ● ●●● ● ●● ● ●● ●● ● ● ● ●● ●●
● ●●●● ● ●
●● ● ●●
●●●●
●●
● ●●● ●●●● ●●
●●●●●
●● ●●●●●●● ● ●●●
●● ●●
●●● ●●●● ●● ●● ● ●● ●●●
● ● ●●●●●
●●● ● ●●
●● ● ●●●●
● ● ● ● ●●●●●●● ● ●●● ●●● ●
●●●● ●●● ●
●●●● ●● ●● ● ●●●
● ●● ● ● ●●● ●●●
● ● ● ● ●● ● ● ● ●● ● ●●●●● ●●●● ● ●● ●●●● ● ● ● ● ●●
●●●● ●
●● ● ●●
● ● ●● ●● ● ●●● ●

● ●●●● ● ● ●●●
●●●●●
●●● ●● ●●●●●●●●● ●●●● ●● ●● ●●● ●●● ●●●●● ●●
● ● ●●●●●
●●●●● ●
●●● ●●●● ●●●●
● ● ●● ●●●●●●● ●●● ● ●
●●

●●●
●●●●●●●●●
●●●●● ●●● ● ●●●●
● ●● ●●● ●●
●●

● ●
●● ●●
●●● ●●● ● ● ● ●●

●●● ● ●●●●● ●● ●●●
●● ●●● ●●
●●●● ●●●●● ●
● ●●
●●●● ●
●●● ●
●●●●●●● ● ●● ●●●● ●●● ●●● ●
●●● ●●●●●●● ●●● ● ●●● ● ●●●
●●●●●● ●●●● ●● ●●●● ● ●●● ●● ●● ●●
●●●●● ●●●●● ● ●●●
●●●●
●● ● ●●●●●●●●●●● ●●●● ●●● ● ●
● ●●●●●●
● ● ● ●●●●● ●●● ●●●●●●●●●● ●●●●●●● ● ● ●● ●● ●●●●● ● ●●●● ●●● ●● ●●●●●● ● ● ●● ●
●●
● ● ● ●● ●● ●●● ● ●●●●●●●● ●●●●●● ●●●●●
●●●●●

● ●●
●● ● ●●●
●● ●●
●●●●
●● ● ●●●●● ●

●●●
●●●●●●● ●●● ●● ● ●●●● ●● ●●●●
●● ● ●● ● ●●

●●● ●●● ●●●●●●
● ●
●●● ● ●● ●● ●●●

●● ●●● ●
●● ●
●●
● ● ●● ●●●●●●●●● ● ● ● ●●●●
● ●●●● ● ●● ●● ●●●●●●●
●●● ●●
● ● ●●●
●● ● ● ●● ●● ●●●●
●●●●●● ●● ●●
●●● ● ●●● ● ● ●
● ●● ●●●●● ●●●●● ●●
●● ●●●●●●●
●●● ●
● ●●
●●●● ● ●●●●● ● ●●●● ●●
●● ●●●●
●●● ●●
● ●● ●
● ● ● ●●●●●●●
● ●●

●●● ●
● ●● ● ●●● ●● ●● ●●
●●● ●
●● ●● ●●
●●●● ●●
●●●
●●●●● ●
●●●
●●●● ●● ●●●● ● ●●●●●●●
●●
● ●
● ●●● ●●●● ● ●
●●● ● ●

● ● ●●●● ●● ●● ● ●●●
● ● ● ●●●●● ●
●● ●●
●● ● ●●● ●●●●
● ● ●● ●●●● ●●●● ● ●● ●
● ● ●● ● ● ●●
● ● ● ●
● ●● ● ●● ●
● ●●● ●● ● ●
●●●
● ●● ●● ● ●● ●
● ●● ● ●● ●● ● ●●● ● ● ●●●●●● ●● ●
● ●●● ●● ●● ● ● ●●●●●●●●
●● ●●●
●● ● ● ● ●
● ●●●
● ● ● ●● ●● ●
●● ●●
● ●● ● ● ●●● ● ●● ● ●●● ●● ● ●● ● ●●

0 100 200 300 400 500 0 200 400 600 800 1000 0 200 400 600 800 1000

epoch epoch epoch

Figure 5: performance of the gradient descent algorithm, the blue graphs show the training
losses and the red graphs the validation losses: (lhs) one-hot encoding for categorical feature
components; (middle) 1-dimensional embeddings of categorical feature components; (rhs) 2-
dimensional embeddings of categorical feature components.
6
Keras is a user-friendly API to TensorFlow, see https://tensorflow.rstudio.com/keras/

10
Electronic copy available at: https://ssrn.com/abstract=3320525
We use that same feature pre-processing as in Section 2 of [4]7 , and we run the gradient descent
algorithm for 500 epochs on the learning data set D on mini-batches of size 10’000 policies.
To track over-fitting we split the learning data at the ratio of 9:1 to a training data set and
a validation data set. In Figure 5 (lhs) we plot the decrease of training loss (blue color) and
validation loss (red color), respectively, over 500 epochs. We see that after roughly 250 epochs
we may exercise early stopping.

epochs run # in-sample out-of-sample average


time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Network One-Hot 250 152s 1’306 30.26768 31.67343 10.19%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%

Table 2: run time, number of model parameters, in-sample and out-of-samples losses (units are in
10−2 ) of Models GLM2, (q1 , q2 , q3 ) = (20, 15, 10) network with one-hot encoding and embedding
layers with dimensions d = 1, 2 for the categorical feature components VehBrand and Region.

The resulting losses of this network after 250 epochs on the entire learning data are given in
Table 2 on row “Network One-Hot”. We note that we obtain a clearly better model than Model
GLM2 in terms of Poisson deviance losses (at the price of more run time). Fine-tuning of the
network architecture and the gradient descent algorithm could further improve this model. For
the time-being we stay with the current network architecture and its calibration, because we
would like to see whether we get an improvement using embedding layers for categorical feature
components.
The code for designing the network architecture with embedding layers for the categorical ex-
planatory variables VehBrand and Region is given in Listing 4, with the first line defining the
dimension d of the embedding layers (we use the same embedding dimension for both categorical
feature components). The network results of these architectures with d = 1 and d = 2, respec-
tively, are provided in Table 2, and Figure 5 gives the convergence behaviors on training and
validation sets (being a 9:1 partition of the learning data D). In view of Figure 5 (middle and
rhs) we use 700 epochs and 600 epochs for d = 1 and d = 2, respectively. The latter model has
more parameters which also provides more degrees of freedom to the gradient descent method.
This seems to slightly accelerate the fitting behavior.
On the one hand we observe that embedding layers provide a slower rate of convergence and
longer run times than one-hot encoding of categorical variables. We suppose that this is caused
by the fact that an embedding layer adds an additional layer to the network, see green and
magenta arrows in Figure 3 (middle, rhs). Therefore, the back-propagation method for network
calibration needs to be performed over 4 hidden layers for embedding layer coding compared to
3 hidden layers in one-hot encoding of categorical feature components.
On the other hand, the fitted models with embedding layers clearly outperform the model with
one-hot encoding in terms of the out-of-sample loss, if the former models are trained sufficiently
long. What is more worrying is that the calibration of the network models are very unstable
7
The corresponding R code is available from https://github.com/JSchelldorfer/ActuarialDataScience

11
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 4: network of depth 3 with embeddings for categorical features
1 d <- 1 # dimension of the embedding layers
2 Design <- layer_input ( shape = c (7) , dtype = ’ float32 ’ , name = ’ Design ’)
3 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
4 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
5 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
6
7 BrEmb = VehBrand % >%
8 l ay er _em be dd in g ( input_dim = 11 , output_dim = d , input_length = 1 , name = ’ BrEmb ’) % >%
9 layer_flatten ( name = ’ Br_flat ’)
10
11 ReEmb = Region % >%
12 l ay er _em be dd in g ( input_dim = 22 , output_dim = d , input_length = 1 , name = ’ ReEmb ’) % >%
13 layer_flatten ( name = ’ Re_flat ’)
14
15 Network = list ( Design , BrEmb , ReEmb ) % >% l a y e r _ c o n c a t e n a t e ( name = ’ concate ’) % >%
16 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
17 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
18 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
19 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
20 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
21
22 Response = list ( Network , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
23 layer_dense ( units =1 , activation = k_exp , name = ’ Response ’ , trainable = FALSE ,
24 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
25
26 model <- keras_model ( inputs = c ( Design , VehBrand , Region , LogVol ) , outputs = c ( Response ))

(in the choice of the initial value of the gradient descent algorithm). This results in fluctuating
average frequencies, see last column in Table 2. In fact, these numbers (not being part of the
objective function during model calibration) fluctuate quite a bit which is a major issue for
insurance pricing.

total volumes per car brand groups observed frequency per car brand groups 2−dimensional embedding of VehBrand
0.35

B12

0.5
80000

0.30
0.25

B14 ●
60000

B13

dimension 2

B5
frequency


exposure

0.20

0.0

B3●

B6
B4
● ●

B1

0.15
40000

B2


0.10

● ●
● ● ● ●
● ● ●
−0.5
20000


0.05

B11

0.00

B10

0

B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2

car brand groups car brand groups dimension 1

Figure 6: (lhs) histogram of exposures per VehBrand, (middle) observed frequency per VehBrand,
(rhs) resulting weights in the embedding layer for d = 2.

The embedding layers have an other advantage, namely, we can graphically illustrate the findings
of the network (at least if d is small). This is very useful in NLP as it allows us to explore similar
words graphically in 2 or 3 dimensions, after some further dimension reduction techniques have
been applied, see [1, 14, 12]. In Figures 6 (rhs) and 7 (rhs) we illustrate for d = 2 the resulting
embedding weights, see also (2.4). We observe clustering in both categorical labels, which

12
Electronic copy available at: https://ssrn.com/abstract=3320525
total volumes per regional groups observed frequency per regional groups 2−dimensional embedding of Region

1e+05

1.0
0.35
MTPL portfolio R43

R21

● French population

0.30
8e+04

R73

R23

0.25

0.5

R41

dimension 2
6e+04

R91

frequency
exposure

0.20
R42● R52

● R93

0.15
R72

R22

4e+04

0.0
● ●
● R54


● R11
R26
●● R31
● R74

● ● ● ● ●

0.10

● ● ● ● ● ● R83
● R82

● ● ● ● ● ●

2e+04


● R53

0.05
● ●

● ● R24

−0.5

● ●
● ●
● ● R25

● ● ●

0.00

R94

0e+00

R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93 −0.5 0.0 0.5 1.0

regional groups regional groups dimension 1

Figure 7: (lhs) histogram of exposures per Region, (middle) observed frequency per Region,
(rhs) resulting weights in the embedding layer for d = 2.

indicates that some labels could be merged. For VehBrand we observe that car brand B12 is
different from all other car brands, B10 and B11 seem to have similarities, and the remaining
car brands cluster. For Region the result is more diverse, in fact, Figure 7 (rhs) suggests that
a 1-dimensional embedding is not sufficient. This is in contrast to Figure 6 (rhs) where we
have high co-linearity between the two dimensions. Figures 6 (lhs, middle) and 7 (lhs, middle)
are taken from Figures 8 and 11 of [10], they show the observed marginal frequencies and the
underlying volumes.

Conclusions. The networks improve the GLM results in terms of out-of-sample losses because
we have not been investing sufficient efforts in finding the optimal GLM with respect to feature
engineering and potential interactions. From the analysis in this section we prefer embedding
layers over one-hot encoding for categorical feature components, however, at the price of longer
run times. Besides that embedding layers might improve the out-of-sample performance of the
network they allow us to visually identify relationships between the different levels of categorical
inputs. The downsides of networks are that calibrations lead to volatile average frequency
estimates and bias fluctuations. This is also going to be studied in the next section.

3 Combined actuarial neural network approach


3.1 Nesting the actuarial model into a network architecture
In this section we combine the classical GLM and the network. This approach can be seen as
rather universal because it applies to many other parametric regression problems, for another
example see [5]. The idea is to nest the GLM into a network architecture.

Model Assumptions 3.1 (CANN approach: part I) Choose a feature space X ⊂ Rq0 and
define the regression function λ : X → R+ by
D   E
x 7→ log λ(x) = hβ, xi + w(K+1) , z (K) ◦ · · · ◦ z (1) (x) , (3.1)

where the first term on the right-hand side of (3.1) is the regression function from Model As-
sumptions 1.1 with parameter vector β, and the second term the regression function from (2.3)
ind.
with network parameter θ. Assume Ni ∼ Poi (λ(xi )vi ) for all i ≥ 1.

13
Electronic copy available at: https://ssrn.com/abstract=3320525
GLM skip connection

Area

VehPower

VehAge

DrivAge

BonusMalus
ClaimNb

VehBrEmb

VehGas

Density

RegionEmb

Figure 8: CANN approach illustrating in orange color the classical GLM in the skip connection
added to a network of depth K = 3 with (q1 , q2 , q3 ) = (20, 15, 10) hidden neurons.

The CANN approach of Model Assumptions 3.1 is illustrated in Figure 8. The skip connection
in orange color contains the GLM (note that for the moment we neglect that categorical feature
components may use a different encoding for the GLM and the network parts).

We provide some remarks.

• Formula (3.1) combines our previous two models, in particular, it embeds the GLM into
a network architecture by packing it into a so-called skip connection that directly links
the input layer to the output layer, see orange arrow in Figure 8. Skip connections are
used in deep networks because they have good calibration properties, potentially avoiding
the vanishing gradient problem, see He et al. [7] and Huang et al. [8]. We use the skip
connection for a different purpose here.

• The two models are combined in the output layer by a (simple) addition. This addition
(K+1)
makes one of the intercepts β0 and w0 superfluous. Therefore, we typically fix one of
(K+1)
the intercepts, in most cases β0 , and we only train the other intercept, say, w0 in the
new network parameter ϑ = (β, θ) of regression function (3.1).

• Regression function (3.1) requires that the GLM and the network model are defined on
the same feature space X . This may require that we merge the feature space of the GLM
model and the network approach, and not both parts in the regression function (3.1) may
consider all components of that merged feature space, for instance, when GLM considers
a component in a dummy coding representation and the network part considers the same
component in a continuous coding fashion.

The second important ingredient is the following idea.

Initialization 3.2 (CANN approach: part II) Assume that Model Assumptions 3.1 hold
and that β
b denotes the MLE for β under Model Assumptions 1.1. Initialize regression function

14
Electronic copy available at: https://ssrn.com/abstract=3320525
(3.1) as follows: set for network parameter ϑ = (β, θ) the initial value

ϑ0 = (β,
b θ0 ) with output layer weight w(K+1) ≡ 0 in θ0 . (3.2)

Note that initialization (3.2) exactly provides the MLE prediction of the GLM part of the CANN
model (3.1), i.e. it minimizes the Poisson deviance loss under Model Assumptions 1.1. If we start
the gradient descent algorithm for fitting the CANN model (3.1) in this initial value ϑ0 , and if
we use the Poisson deviance loss as objective function, then the algorithm explores the network
architecture for additional model structure that is not present in the GLM and which lowers the
initial Poisson deviance loss related to the (initial) network parameter ϑ0 . In this way we obtain
an improvement of the GLM by network features. This provides a more systematic way of using
network architectures to improve the GLM. We will highlight this with several examples.

3.2 Variants of the CANN approach


Before providing explicit examples we would like to briefly discuss some variants of the CANN
approach (3.1)-(3.2). The CANN approach starts the gradient descent algorithm in regression
model nD E D   Eo
x 7→ λ(x) = exp β, b x + w(K+1) , z (K) ◦ · · · ◦ z (1) (x) , (3.3)

where βb is the MLE of β. There are two different ways of applying the gradient descent algorithm
to (3.3): (1) we train the entire network parameter ϑ = (β, θ), (2) we declare the GLM part β b
to be non-trainable and only train the second term in ϑ = (β,b θ). In the latter case, the optimal
GLM always remains in the CANN regression function and it is modified by the network part.
In the former case, the optimal GLM is modified interacting with the network part.
A variant of (3.3) in the case where we declare βb to be non-trainable is to introduce a trainable
(credibility) weight α ∈ [0, 1] and we define a new regression model
n D E D   Eo
x 7→ λ(x) = exp α β, b x + (1 − α) w(K+1) , z (K) ◦ · · · ◦ z (1) (x) . (3.4)

If we train this model, then we learn a credibility weight α at which the GLM is considered in
the CANN approach.
An extension of the CANN approach also allows us to learn across multiple insurance portfolios.
Assume we have J insurance portfolios, all living on the same feature space X and with β b
j
denoting the MLE of portfolio j = 1, . . . , J in the GLM. Let χ ∈ {1, . . . , J} be a categorical
variable denoting which portfolio we consider. We define the regression function
 
X J D E D   E
(x, χ) 7→ λ(x, χ) = exp b , x 1{χ=j} + w(K+1) , z (K) ◦ · · · ◦ z (1) (x)
β .
j
 
j=1

In this case, the neural network part allows us to learn across portfolio because it describes the
interaction between the portfolios. This approach has been considered in Gabrielli et al. [5].

3.3 A CANN example


3.3.1 Generic CANN implementation

We present a first example that implements the CANN approach (3.3) where we declare the
MLE βb of Model GLM2 to be non-trainable and where we use d = 1 for the embedding layers

15
Electronic copy available at: https://ssrn.com/abstract=3320525
of the network part. We call this first example Model CANN0. The R script of this architecture
is given in Listing 9 in the appendix. We comment on this in more detail. This model has
686 trainable network parameters, which are exactly the network weights shown in Figure 3
(middle), i.e. these are the weights θ that come from the neural network part. Then, it has
58 non-trainable network parameters, these are the 48 GLM parameters in β b of Model GLM2
which we choose as non-trainable (see lines 14, 20, 26, 32, 38 and 39 of Listing 9), as well
as 10 non-trainable parameters, where 4 non-trainable parameters stem from the embedding
identification (one-hot versus dummy coding for VehPower, VehAge, VehBrand and Region), 4
non-trainable parameters come from concatenating the GLM network (line 38 of Listing 9), and
2 non-trainable parameters are from blending the GLM with the network part (line 50 of Listing
9).
On lines 1-8 of Listing 9 we define the input variables: ContGLM ∈ R4 collects the four ordinal
variables BonusMalus, VehGas, Density, Area; on lines 2-3 there are the GLM-categorized
variables VehPowerGLM and VehAgeGLM; on lines 4-5 the categorical variables VehBrand and
Region; line 6 collects all DrivAge related variables from line 3 of Listing 2, i.e. DrivAgeGLM ∈ R5
has 5 continuous components, see also (1.6); line 7 collects the continuous variables VehPower,
VehAge and DrivAge. Thus, the latter three variables enter the CANN model twice in a different
form for the GLM part and for the network part in (3.3). Moreover, we pre-process all feature
components that enter the network part of the architecture with the MinMaxScaler (for the
MinMaxScaler we refer to Section 2.2 of Ferrario et al. [4]). Finally, line 8 defines the offset
log(Exposure).
On lines 11-33 we define the embedding layers for the 4 categorical variables VehPowerGLM,
VehAgeGLM, VehBrand and Region using the MLE β b as non-trainable weights. On lines 35-39
these categorical variables are concatenated with the continuous ones of the GLM part again
using βb as non-trainable weights. This provides the non-trainable GLMNetwork, see lines 35-
39. On lines 41-46 we define the (q1 , q2 , q3 ) = (20, 15, 10) network architecture. This considers
the 9-dimensional variable consisting of ContGLM ∈ R4 , ContNN ∈ R3 , as well as the two one-
dimensional (d = 1) embedding weights for VehBrand and Region. These embedding weights
are exactly the GLM parameters (and they are declared to be non-trainable). Therefore, this
part exactly corresponds to Figure 3 (middle) with 686 trainable weights illustrated by the black
arrows. We blend the two models on lines 48-50 also including the offset log(Exposure) from
the underlying volumes.

3.3.2 Poisson CANN implementation

Having the R script of Listing 9 we are ready to calibrate Model CANN0. Before doing so, we
mention that this code is a bit cumbersome for the task we try to achieve. In the case of the
Poisson distribution we can substantially simplify the CANN implementation. From (3.3) we
see that if the GLM part is non-trainable with MLE β, b then we can merge this term with the
given volumes vi . Thus, we consider a network function
D   E
x 7→ log λNN (x) = w(K+1) , z (K) ◦ · · · ◦ z (1) (x) , (3.5)

and we assume that


D E
ind.
Ni ∼ Poi λNN (xi )viGLM , viGLM = vi exp β,

with working weights b x . (3.6)

16
Electronic copy available at: https://ssrn.com/abstract=3320525
This leads to a (much) simpler representation of the CANN model, in particular, we can replace
Listing 9 of the appendix by Listing 5 (which is almost identical to Listing 4).

Listing 5: Model CANN1/2 with embedding layers with dimensions d = 1, 2


1 Design <- layer_input ( shape = c (7) , dtype = ’ float32 ’ , name = ’ Design ’)
2 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
3 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
4 LogVolGLM <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
5
6 BrEmb = VehBrand % >%
7 l ay er _em be dd in g ( input_dim = 11 , output_dim = d , input_length = 1 , name = ’ BrEmb ’) % >%
8 layer_flatten ( name = ’ Br_flat ’)
9
10 ReEmb = Region % >%
11 l ay er _em be dd in g ( input_dim = 22 , output_dim = d , input_length = 1 , name = ’ ReEmb ’) % >%
12 layer_flatten ( name = ’ Re_flat ’)
13
14 Network = list ( Design , BrEmb , ReEmb ) % >% l a y e r _ c o n c a t e n a t e ( name = ’ concate ’) % >%
15 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
16 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
17 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
18 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
19 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
20
21 Response = list ( Network , LogVolGLM ) % >% layer_add ( name = ’ Add ’) % >%
22 layer_dense ( units =1 , activation = k_exp , name = ’ Response ’ , trainable = FALSE ,
23 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
24
25 model <- keras_model ( inputs = c ( Design , VehBrand , Region , LogVolGLM ) , outputs = c ( Response ))

We observe that this code has become much simpler, and in fact also calibration in Keras runs
faster. Line 4 of Listing 5 specifies the offsets log viGLM and lines 14-19 define the network
part λNN . This network is based on embedding layers for the categorical feature components
VehBrand and Region which may have (general) dimensions d. Note that in Listing 5 the
embedding weights for VehBrand and Region are trainable, if we choose embedding dimension
d = 1, initialize these embedding layers with the corresponding parts of β
b and declare these two
embedding layers to be non-trainable, then we exactly receive Model CANN0 from the previous
section.

Remark. We would like to emphasize that (3.6) is by no means restricted to the GLM. In fact,
we can choose any regression model for the skip connection (volume adjustments using working
weights similar to (3.6)), for instance, we can replace the GLM prediction by a generalized
additive model (GAM) prediction in the working weight definition. This is exactly the idea
behind the Poisson boosting machine as it has been presented in Section 5.2 of our tutorial [10].
We choose two different versions for our CANN approach (3.5)-(3.6), the first one has embedding
dimension d = 1 and the second one has embedding dimension d = 2 for the categorical feature
components VehBrand and Region, see also lines 7 and 11 in Listing 5. Both versions use Model
GLM2 in the skip connection (3.6). We call these two Models CANN1 and CANN2. In Figure
9 (lhs) we show the convergence statistics of the gradient descent algorithm where, again, we
split the learning data 9:1 for a training set (blue) and a validation set (red). The left-hand
side shows the calibration for embedding layers with d = 1 (Model CANN1), and the right-hand

17
Electronic copy available at: https://ssrn.com/abstract=3320525
gradient descent algorithm gradient descent algorithm







training loss (in−sample) ●



training loss (in−sample)


validation loss (out−of−sample) ●



validation loss (out−of−sample)

● ●

● ●

● ●

31.0

31.0

● ●





● ●
● ●
● ●

● ●
● ●
● ●
● ●
● ●

●● ●
● ●

● ●

● ●
●●
● ●●
● ● ●
●● ●●
●●
● ●
●●
● ●
● ●




●●
● ●
●●● ●
●● ●●

●● ●
●● ●●

● ●

deviance losses

deviance losses

●● ●●

●●
● ●●
●●

● ●

● ●●
●●

●●

● ●
●●

●●
● ●

●●●
● ●
●●

● ●
●●


●●●● ●

●●● ●●
●● ●


●●● ●

●●
● ●


●● ●●


●●
● ●●
●●
● ●
●●

●●

●● ●●



●●

● ●



● ●

●●
●●

●●
● ●
●●●
●●

●● ●●

●●●
●●●
●●

●●


●●

●●
●●

●● ●
●●


● ●●●●


●● ●●●

30.5

●●

●● ●●

●●

●●
●● ●
●●
●●
●●

30.5
●●
●●● ●●
●●

●●

●●
●●

●●
● ● ●
●●●
●●

●●

●● ●●●
●●
●●●
●● ●●

●●
●●●

●●

●●●
●● ●●
●●
●●●●
●●●●

● ●●
●● ●●

●●

●●
●●
●●
●●●●● ●●

●●

● ●●
●●● ●

●●

● ●●
●●●
●●
●●
●●●
●●●● ●●●●●●●

●●

●●

●●●

●●
●●
●●
●● ●
●●
●●
● ●●
●●
●●

●●
●●
● ● ●

●●
●●
●●

●●●●●●●
●●● ●●●● ●


●●●●
● ●
●●●●●●
●●●●
●●
● ●
●● ● ●●
●● ●
●● ●●●●●
●●

● ● ● ●●

●●
●●● ●●●
●●
●●
●●
●●●

●● ●
●●●●● ●●●●
●●●●
● ●●●
●●●
●●

●●
●●●

● ● ● ●●
●● ●●
●●●●

●●●

●●●
● ●●

●●● ● ●
● ●●●
●●●● ●
●●
●●

●●

●●●●●
●●
● ● ● ●●
●●●●

●●
●●●●
●●

●●●
●●
●●
●●●
● ●
●●●
● ●● ● ●
●●
●●

●●
●●
● ●●
●●●●
●●
●● ●●
●●● ● ●●

●●●
●●●

●●
●●●
●●●

●●

●●

●●●
●●
●●●


●●●

●●●
●●●● ●
●●
● ●●● ●●
●●●
●●●● ●

●●
●●
●●●
● ●
●● ●●●●
● ● ●●
●●
● ●●
●●
● ●●●●●●

●●
●●

●●●●● ●●
●●
●●●
●●


●●

●●●●●●

● ● ●●
● ●
●●
●●
●●
● ●● ●
●●●
● ●
●●
●●
●●●
●● ●●
● ●
●●
●●●


●●


●●●●●
●●●● ●●
●●
●● ●●
●●●
●●

●●

●●


●●
●●●
●●●

●●
●●
●●●
● ●●

●●
●●●
●●●●●●
●●●● ● ●●●● ●●
●● ●● ●
●● ●●●
●●●●●●

●●

●● ●●●●● ● ●●
●●
●●
●●

●●
●●●●

● ●●
● ● ●
●●●●●●

●●●● ●
●●●

●●

●●

●●●
●●
●●●●
●●●● ●
●●
●●●

●●


●●
●●●
●●●
● ●●
●● ●

●●●●●
● ●●
●●

●●
●●

●●●
●●●
●●● ●● ● ●●●●
● ●●
●●●●●●●●●●●
● ● ●●●●
●●
●●
●● ●●
●●
● ●●●●●
● ● ●●●● ●●
●●
● ●●
●●
●●●●
●●
●●●
●●●●● ●
●●●
●● ●
●●●●
●●●●●●●●● ●●●●●●●●●● ●
●●
●●●
●●
●●● ●●●●
● ●●
●●●
● ●●
●●

●●●●●
● ●
●●
●●●
●●
●●●● ●●●●● ●●●●●
●● ●●●
●●

●● ●●●●● ●●●
● ●
●●●
● ●● ●
●●● ● ●●
●●●●●
● ●
●●
● ●●●
● ●●● ●

●● ●●●
●●
●●●
●●●
●●

●●

●●●●●
● ● ●●
● ●●
●●
●●
●●

●●

●●
●●●

●●● ●
●●
● ●
●●●

●●●●
●●●

●●●

● ●
●●●●●● ● ●●
● ●●
●●●●
●●
●●●●
●●● ● ●
● ●●●
● ●●
●●
●●● ●
●●
●●●●
● ●

●●●

●●

●●
●● ●●●●● ●●● ● ● ● ●●●●●
● ● ●●
●●
●●●
●●

●●
●●

●●● ●
● ●●● ● ●
●●
●●● ●●●●●●
●● ●●●
●●
●●

● ●
●●
●●●●●●●●● ●●● ●●●●●● ● ● ● ●
●●●
●●●

●●

●●

●● ●●
●●
●● ●
●●

●●

●●
●●●● ●
● ●●●●●
●●●●
●●●
●●

●●
●●●●●● ●●● ● ● ●●●●
●●
● ● ● ●●

●●
●●●●●●

●●●
●●
●●●●

●●●
●●●
●●
●● ●●●
●●
●●
●●
●●●● ● ● ● ● ●●
●●
●●

●●●

●●



●●●
●●●
●●
● ●
●●

●●●
●● ●●
● ●● ●
●●●
●●

●●

●●

●●

●●
●●

●●●
●●
● ●
●●●

●●●
● ●
●●●●
●●
●●
●●●●●● ●●
●●

●●
●●
●●●●●
●● ●●
●●●●
●● ●
● ●
● ●●
●● ●●

● ●●● ●●

●●
●●

●●● ●
●●
●● ●
●● ● ●●●●●●
●●
●●●●

●●●
● ●●●●●●●●●●●
●● ● ●
●●●

●●

●●●
●●




●● ●●

●●●●●
● ●●● ●●
●●
● ●
●●●

●●

●●

●●
●●●●
●●●●● ● ● ●

● ●●●●● ●
●●●
●●

● ●●
●●

●●●
●●●●●●●
●● ●
● ●●

● ●●●
●● ●●●●●

●●●

● ●●

●●●●●●●●
●●●

● ●●
●●
●● ●●● ●
● ●●● ● ●●●● ●
●●
●●●

● ●●●

●●

●●

●●●

●●

● ●


● ●●



● ●

30.0

●● ●

30.0



● ● ●●



● ●● ●
●● ●●
● ●
● ●
● ●●●● ●
● ● ●● ● ● ● ●
●● ●● ● ● ●●
●●●●
● ●●
●● ●●
● ●

●● ● ●●●●●●●
● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●● ●●● ● ● ● ● ● ● ●● ● ●● ●●● ●● ●

●●●● ●●● ● ●● ● ● ● ●● ● ●● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●● ●●● ● ●●
● ●● ●●● ● ●●●●● ●●
●●
●●
●●● ● ● ●●
● ● ● ●
●● ● ● ●● ● ●●●●
● ● ●●● ● ●● ●●● ● ● ●● ●● ● ● ●●● ● ●● ●
●● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ●● ●●●● ●●● ● ● ●●● ●● ● ● ● ● ● ●●●● ● ● ●
●●● ● ● ●●●● ● ●●● ●● ● ● ● ● ● ● ●●● ●
●● ●●● ●● ● ●● ●●●● ●
● ●● ●● ●● ●●● ● ●●●●● ●● ●●● ● ●●●●●●●● ● ●●● ●● ● ● ● ●●●●●● ●●●● ●● ● ● ● ●● ● ● ●●●● ● ●● ●● ●
● ●●●●●●●●● ●●● ● ● ●● ●● ●
●● ●● ●●

●● ●●● ● ●●

●● ●● ●● ● ●● ● ●
● ●●●●●● ●●

● ●●●●●●
● ● ●●● ● ●● ●
● ●●● ●●●
●●
● ●●●
●●
●●●
●●● ●●
●●●●● ● ●●●●● ●●● ● ●
●●●●
●●●● ●●●●● ● ● ● ● ●●● ●● ●●● ● ●
●●● ●
● ● ● ● ●● ●● ● ●●●● ● ●● ● ●●● ● ● ●● ● ● ●●●●● ●
●● ● ●
●● ● ●
● ● ● ● ●
● ● ● ● ●● ●●● ●●●●● ●●● ●● ● ●● ●● ● ● ●●●● ●●

● ● ●●● ● ● ●● ● ● ●● ●●● ●

●● ●● ●● ● ●
●● ●
● ● ●●●●●● ●●● ● ●
●●●●
●● ● ● ● ●● ● ●●● ● ●●● ●●● ● ● ● ●● ● ● ● ●●

● ●●●

● ●● ● ● ●● ●
●●●●●● ● ● ● ●●●● ● ●●● ●●●● ● ● ● ● ●● ●● ● ● ●● ● ●●●● ●●● ●●● ●●●●●● ●● ● ● ●●● ● ●●●● ● ●● ● ●● ●●
●●
●● ● ● ●
●●●●●●●●●
● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ●●● ● ●●●● ●●●● ●●●●
● ●● ●●● ● ●● ● ● ●● ● ●
● ● ●●●● ●●●●● ● ● ● ● ●●
● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●●●

●●
● ● ● ●●●
● ●●●● ● ●●
●●● ●●●● ●

●●●●●●


●● ● ● ● ●
● ● ● ● ● ●●● ● ● ●
● ● ● ●
● ●
● ●● ●● ● ● ●
●●
● ● ●●● ● ●● ● ●
●●●● ●●
● ●● ●●● ● ●●●● ●
●● ● ●
●● ●● ● ●● ●●● ●●●
●●●●●
●●● ●● ●●●
● ●●●● ●● ●●●
●●● ●
●●●●● ● ● ● ●● ● ● ●● ● ● ●
●●●●● ●●●
●● ● ● ● ● ●●●●●● ●●●●● ● ● ●
● ●
●●●●● ●●●● ● ●● ●●●
● ● ●● ●● ●●●●● ● ●●●● ● ●●● ● ● ● ●● ●●●●●●● ● ●●●●
● ● ● ●
● ● ● ● ●●
● ●● ●● ● ●●
● ● ●● ● ● ●● ●●●● ●● ●●●●●●● ● ●●●● ● ●
●●●
●● ● ●● ●● ●●●●● ● ● ● ● ●● ● ●● ● ● ●●●●● ● ● ●●●●● ●● ● ● ●
● ● ● ● ● ●●● ● ●● ●●●● ●
●●●● ●●●● ● ●● ● ●
● ●● ● ● ● ●●●●●●●●● ●
● ●●● ●●● ● ●
●●●●●●●●● ●●●● ●●●●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●
● ● ●● ●●●●
● ●●●● ● ● ●● ●●●
●● ●● ●● ●● ●● ●● ●● ● ● ●● ●●●● ● ●
● ● ●●● ●●● ● ● ● ● ●● ● ●● ●
● ●
● ●
●●
●● ●●● ●● ● ●● ● ●● ●● ● ● ●●● ●●●●
● ● ●● ●●●● ●●●●●●●● ● ●
● ● ●● ●● ●
●●● ●
●●●●●●●● ● ●● ● ●●
● ●●● ● ● ●
●● ● ●● ●● ●● ● ●● ●● ● ● ●●● ●
● ● ● ●● ●●●●●
● ● ●● ● ●●● ● ●● ● ●● ● ● ●
●●●●●●●●●● ●● ● ● ●● ●
● ● ●●● ● ● ●●● ● ●●● ●●● ● ●●
●●

● ●● ●●●
● ●● ● ●● ●● ●●●●● ● ●
● ●●
●●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ●●● ● ●
● ●●●● ● ●●● ●●●● ●●● ●● ● ● ●● ● ●●● ●●● ● ● ●
●●● ●● ● ●● ● ●● ●●●● ●
● ●●
●●
●● ● ●
●● ● ● ●●
● ●● ● ● ●●●●●●●● ●●●●● ● ● ●●
●●
● ● ● ●● ● ●● ●● ● ● ●●●● ●● ● ●
● ●●●● ● ●●
● ● ●● ●
●●● ●●
●●
●●●
●●●●● ● ●● ●
● ●●● ● ●
●●●● ● ● ●
●●●●● ● ● ● ● ● ●● ● ● ●● ●●●● ●

● ● ● ● ● ● ●● ●
●●●●●●
●●●●●●● ●●●
● ●●● ●●●●●●● ●●●●●
●●● ● ●●
● ● ●● ●● ● ●● ● ●●


●●●●● ● ●●●● ●●●●●● ●● ●●●●
●●● ●● ●●●● ●

● ●●● ●● ●●●●
● ● ● ● ● ●●● ●●● ●● ● ●
● ● ● ●●●●● ● ●● ● ● ● ● ●
●● ●●●●●●●● ●●●●●● ●● ●
●●●

●●●● ●
● ● ● ●● ● ●●●● ●
●● ● ●●●●
●● ● ● ●●●
● ● ● ●
●●●●
●●●●●●●●● ● ● ●● ●

●●● ● ● ● ●● ●● ● ●●●● ● ●●● ●
●●

●● ● ● ● ● ● ●

0 200 400 600 800 1000 0 200 400 600 800 1000

epoch epoch

Figure 9: performance of the gradient descent algorithm, the blue graphs show the training
losses and the red graphs the validation losses: (lhs) Model CANN1; (rhs) Model CANN2.

side uses embedding layers with d = 2 (Model CANN2). We observe over-fitting after roughly
200 gradient descent steps. Note that this is much faster than in the network models of Figure
5, the reason being that the MLE of Model GLM2 provides a reasonable initial value for the
gradient descent algorithm.

epochs run # in-sample out-of-sample average


time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%
CANN1 Emb(d = 1) 200 115s 719 30.39966 31.50136 10.02%
CANN2 Emb(d = 2) 200 117s 792 30.47557 31.56555 10.34%

Table 3: run time, number of model parameters, in-sample and out-of-samples losses (units are in
10−2 ) of Models GLM2, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers with dimensions
d = 1, 2 for the categorical feature components VehBrand and Region, Models CANN1 and
CANN2 with embeddings d = 1, 2, respectively.

The results are presented in Table 3. We observe that the performance of all network models
with embedding layers are comparable in out-of-sample losses which range from 31.45327 to
31.50647 (last 4 rows of Table 3). More remarkable is that the GLM2 skip connection adds some
stability to the average frequency, last column in Table 3, note that the empirically observed
frequency on the test data T is 10.41%.
A bit disappointing seems that the CANN approach does not lead to a clear improvement over
the classical network approach in terms of out-of-sample losses. The main issue in the current
set-up is that Model GLM2 is not sufficiently good so that the CANN approach could benefit
from a very good initial model. In fact, we are penalized here for not having invested sufficient
efforts in building a good GLM. However, the CANN approach will allow us to explicitly analyze
the weaknesses of Model GLM2. This is what we are going to do next.

18
Electronic copy available at: https://ssrn.com/abstract=3320525
3.4 Analyzing the GLM marginals
The aim of the subsequent sections is to analyze the modeling of the regression function λ(·)
in a more modular way. We therefore start with Model GLM2 given in Listing 2. This model
considers a log-linear structure
q0
X
x 7→ log λ(x) = β0 + βl xl = hβ, xi,
l=1

with continuous feature components BonusMalus, Density, Area and DrivAge, categorized fea-
ture components VehPower and VehAge, and categorical feature components VehBrand, VehGas,
Region. The goal is to see whether the functional form used to model the continuous and the
categorized feature components is sufficiently good. We therefore change the modeling of one
feature component at a time, keeping the modeling of the other components fixed. For instance,
we choose VehPower and we consider
D E
x 7→ log λ(x) = hβ, xi + w(2) , z (1) (VehPower) , (3.7)

where the last term reflects a network of depth 1 (having q1 hidden neurons) applied to the
feature component VehPower, only. Note that we use a slight abuse of notation in (3.7) because
VehPower enters feature x also as a categorical variable with 6 labels for the GLM approach.
For the explicit implementation of (3.7) we again use approach (3.6) with working weights.

epochs run # train. in-sample out-of-sample out-of-sample


time param. NN NN GAM
homogeneous model 1 32.93518 33.86149
Model GLM2 17s 48 31.25674 32.14902
Area 200 54s 22 31.25684 32.14768 –
VehPower 200 54s 22 31.25626 32.14965 32.15306
VehAge 200 54s 22 31.23750 32.12474 32.12724
DrivAge 200 54s 22 31.25681 32.14764 32.14138
BonusMalus 500 130s 22 31.19411 32.10286 32.09712
Density 200 54s 22 31.25679 32.14813 32.14945

Table 4: in-sample and out-of-samples losses (units are in 10−2 ) of Model GLM2, compared to
a marginal network adjustment according to (3.7).

In Table 4 we present the results where we consider one continuous feature component in the
form (3.7) at a time, and where we choose a single hidden layer network with q1 = 7 hidden
neurons.8 From Table 4 we see that the marginal modeling in Model GLM2 is quite good, the
only two feature components that may be modeled in a better way are VehAge, i.e. the three age
classes [0, 1), [1, 10] and (10, ∞) should be refined, and BonusMalus where a log-linear functional
form is not fully appropriate. We would like to highlight that the variable BonusMalus needs
more gradient descent steps, i.e. a later early stopping point. It seems that Model GLM2 sits
in a rather “strong saddle point” for the variable BonusMalus which is difficult to leave for the
gradient descent algorithm.
8
In many cases one hidden layer is sufficient for modeling one-dimensional functions, for multivariate functionals
deep networks show better fitting performance because they can model more easily interactions.

19
Electronic copy available at: https://ssrn.com/abstract=3320525
In the last column of Table 4 we have been adding the out-of-sample losses obtained from
generalized additive model (GAM) predictions. GAMs are obtained by replacing the last term
in (3.7) by a natural cubic spline, that is, we set

x 7→ log λ(x) = hβ, xi + ns2 (VehPower), (3.8)

where the first term on the right-hand side is the part originating from Model GLM2 and
ns2 : R → R denotes a natural cubic spline. For GAMs we refer to Wood [15], Ohlsson-
Johansson [11] and Chapter 3 in Wüthrich-Buser [16]. We would like to emphasize that this
GAM for marginals can be fit very efficiently, i.e. in less than 1s. But efficient fitting requires
that we aggregate data for each label (marginally we have only few labels) using that aggregation
leads to sufficient statistics under our Poisson model assumptions, see Section 3.1.2 in [16], and
line 4 in Listing 6. The GAM fit is performed on lines 6-7 of Listing 6 where VolGLM specifies
the working weights viGLM , see (3.6). The prediction can be done (again) on individual policies.
Unfortunately, the GAM cannot be applied to the feature component Area because it has only
6 different labels. The out-of-sample results of the neural network approach and of the GAM
approach are in line which can be interpreted as a “proof of concept” that these two methods
work.

Listing 6: marginal GAM fitting of VehPower


1 library ( plyr )
2 library ( mgcv )
3 # data compression of the learning data set
4 learn . GAM <- ddply ( learn , .( VehPower ) , summarize , VolGLM = sum ( VolGLM ) , ClaimNb = sum ( ClaimNb ))
5 # GAM fitting
6 d . gam <- gam ( ClaimNb ~ s ( VehPower , bs =" cr ") , data = learn . GAM , method =" GCV . Cp " ,
7 offset = log ( VolGLM ) , family = poisson )
8 summary ( d . gam )

In Figure 10 we provide the resulting marginal regression functions from approach (3.7) which
exactly correspond to the results of Table 4. These plots confirm our findings, namely, that the
modeling of VehAge in Model GLM2 can be improved (top-right), the log-linear assumption for
BonusMalus is not fully appropriate (bottom-middle), and the other (marginal) adjustments do
not lead to visible improvements.

Conclusion. The marginal modeling used in Model GLM2 can be (slightly) improved, but it
does not explain the big differences between Model GLM2 and the neural network models of
Table 3. Therefore, the major weakness of Model GLM2 compared to the neural network models
must come from missing interactions in the former model. Note that in this former model all
interactions are of multiplicative type between the feature components. This deficiency is going
to be explored next.

Base Model GAM1. For all our subsequent derivations we enhance Model GLM2 by im-
proving the marginal modeling of the feature components VehAge and BonusMalus using a joint
GAM adjustment. That is, we consider the regression function

x 7→ log λGAM (x) = hβ, xi + ns21 (VehAge) + ns22 (BonusMalus), (3.9)

20
Electronic copy available at: https://ssrn.com/abstract=3320525
GLM versus marginal NN GLM versus marginal NN GLM versus marginal NN

0.35

0.35

0.35
● Model GLM2 ● Model GLM2 ● Model GLM2
marginal NN marginal NN marginal NN

0.30

0.30

0.30
0.25

0.25

0.25
0.20

0.20

0.20
0.15

0.15

0.15

0.10

0.10

0.10
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●


● ● ● ● ● ● ● ● ● ● ●
0.05

0.05

0.05
0.00

0.00

0.00
1 2 3 4 5 6 4 6 8 10 12 14 0 5 10 15 20

Area VehPower VehAge

GLM versus marginal NN GLM versus marginal NN GLM versus marginal NN


0.35

0.35

0.35
● Model GLM2 ● Model GLM2 ● Model GLM2
marginal NN marginal NN marginal NN
0.30

0.30

0.30

0.25

0.25

0.25





0.20

0.20

0.20







● ●
0.15

0.15

0.15
●●
●●
●●
●●
● ●●●●●●●● ●●
●●●● ●● ●●
●● ● ●● ● ●●
● ●
●● ●● ●
●●●
● ●

● ●● ●

●●● ● ●● ●●
●●
● ●●●● ●
●● ● ●
0.10

0.10

0.10
●●● ●
●●

●●

●●

●●

●●

●●

●●

●●

●●

●●
●●●


● ●

●●
●●

●●
●●

● ●

●●●
●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●

●●

●● ●

●●



●●● ●● ●
●●
● ●● ●●



●● ●●



● ●● ●● ●

● ●● ●●●
●●●●●●●●● ●●●
●●●
●●●
0.05

0.05

0.05
●●●
●●●●
●●●●●●●
●●●●●
●●●●●
●●●●●●
●●●●●●
0.00

0.00

0.00
20 40 60 80 100 60 80 100 120 140 0 5000 10000 15000 20000 25000

DrivAge BonusMalus Density

Figure 10: comparison of the marginals in Model GLM2 and the marginal network adjustment
according to (3.7).

where the first term on the right-hand side (scalar product) is the part originating from Model
GLM2, and ns21 : R → R and ns22 : R → R are two natural cubic splines enhancing Model
GLM2 by GAM features. We fit these two natural cubic splines simultaneously using the GAM
framework and we call this improvement Model GAM1. The corresponding code is given in
Listing 7. On line 4 we compress the data w.r.t. the two selected feature components VehAge
and BonusMalus. On lines 7-8 we fit the natural cubic splines for these two variables using the
logged working weights log(viGLM ) as offsets, see also (3.6).

Listing 7: Model GAM1 with VehAge and BonusMalus improvements


1 library ( plyr )
2 library ( mgcv )
3 # data compression of the learning data set
4 learn . GAM <- ddply ( learn , .( VehAge , BonusMalus ) , summarize , VolGLM = sum ( VolGLM ) ,
5 ClaimNb = sum ( ClaimNb ))
6 # Model GAM fitting
7 d . gam <- gam ( ClaimNb ~ s ( VehAge , bs =" cr ") + s ( BonusMalus , bs =" cr ") , data = learn . GAM ,
8 method =" GCV . Cp " , offset = log ( VolGLM ) , family = poisson )

In Table 5 we present the results. We see the expected improvement in out-of-sample loss
from 32.14902 (Model GLM2) to 32.07597 (Model GAM1). However, there is still a big gap
compared to the neural network approaches. Note that Model GAM1 is based on multiplicative
interactions that we are going to challenge next.

21
Electronic copy available at: https://ssrn.com/abstract=3320525
epochs run # in-sample out-of-sample average
time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Model GAM1 1s 63.2† 31.14450 32.07597 10.01%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%
CANN1 Emb(d = 1) 200 115s 719 30.39966 31.50136 10.02%
CANN2 Emb(d = 2) 200 117s 792 30.47557 31.56555 10.34%

Table 5: run time, number of model parameters, in-sample and out-of-samples losses (units
are in 10−2 ) of Models GLM2, GAM1, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers
with dimensions d = 1, 2 for the categorical feature components VehBrand and Region, Models
CANN1 and CANN2 with embeddings d = 1, 2, respectively; † the number of parameters for
Model GAM1 considers the 48 GLM parameters plus the effective degrees of freedom of the
GAM splines being 6.6 + 8.6 = 15.2.

3.5 Analyzing missing pair-wise interactions


Completely similarly as in (3.7) we may explore missing interactions in the models considered
above. As base model we choose Model GAM1 here, see Table 5. In analogy to (3.5)-(3.6), we
may analyze, say, a missing (non-multiplicative) interaction between DrivAge and BonusMalus.
We therefore define the following bivariate interaction (2IA) model
D   E
x 7→ log λ2IA (x) = w(4) , z (3) ◦ z (2) ◦ z (1) (DrivAge, BonusMalus) , (3.10)

and we assume that


ind.
Ni ∼ Poi λ2IA (xi )viGAM , viGAM = vi λ
bGAM (x),

with working weights

where x 7→ λ bGAM (x) is the regression function obtained from the base Model GAM1. Thus, the
2IA regression model (3.10) challenges the GAM-improved GLM2 model by allowing for pair-
wise interactions between DrivAge and BonusMalus. This can be interpreted as a boosting step.
For this boosting step we choose a neural network of depth K = 3 having (q1 , q2 , q3 ) = (20, 15, 10)
hidden neurons. Categorical feature components are modeled with two-dimensional embedding
layers (2.4). We fit these pair-wise boosting improvements over 1’000 gradient descent epochs
on batches of size 10’000 policies.
The pair-wise results are illustrated in Figure 11. The rows provide the components Area,
VehPower, VehAge, DrivAge, BonusMalus, VehBrand, VehGas and Density (in this order) and
the columns provide the components VehPower, VehAge, DrivAge, BonusMalus, VehBrand,
VehGas, Density and Region (in this order). Black and blue graphs show the out-of-samples
losses over the 1’000 epochs, and the orange dotted lines shows the out-of-sample loss of Model
GAM1. Note that the scale on the y-axis is the same in all plots. In blue color we iden-
tify the pairs that lead to a major decrease in loss. These are the pairs (VehPower, VehAge),
(VehPower, VehBrand), (VehAge, VehBrand), (VehAge, VehGas), (DrivAge, BonusMalus). Thus,
between these pairs we observe a major (non-multiplicative) interaction that should be inte-
grated into the model. The advantage of approach (3.10) is that we do not need to specify the
explicit form of these (missing) interactions, this is in contrast to the GLM and GAM approaches.

22
Electronic copy available at: https://ssrn.com/abstract=3320525
Area − VehPower Area − VehAge Area − DrivAge Area − BonusMalus Area − VehBrand Area − VehGas Area − Density Area − Region
32.10

32.10

32.10

32.10

32.10

32.10

32.10

32.10
32.05

32.05

32.05

32.05

32.05

32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00

32.00

32.00

32.00

32.00

32.00
31.95

31.95

31.95

31.95

31.95

31.95

31.95

31.95
31.90

31.90

31.90

31.90

31.90

31.90

31.90

31.90
31.85

31.85

31.85

31.85

31.85

31.85

31.85

31.85
31.80

31.80

31.80

31.80

31.80

31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs epochs epochs epochs epochs epochs

VehPower − VehAge VehPower − DrivAge VehPower − BonusMalus VehPower − VehBrand VehPower − VehGas VehPower − Density VehPower − Region

32.10

32.10

32.10

32.10

32.10

32.10

32.10
32.05

32.05

32.05

32.05

32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00

32.00

32.00

32.00

32.00
31.95

31.95

31.95

31.95

31.95

31.95

31.95
31.90

31.90

31.90

31.90

31.90

31.90

31.90
31.85

31.85

31.85

31.85

31.85

31.85

31.85
31.80

31.80

31.80

31.80

31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs epochs epochs epochs epochs

VehAge − DrivAge VehAge − BonusMalus VehAge − VehBrand VehAge − VehGas VehAge − Density VehAge − Region

32.10

32.10

32.10

32.10

32.10

32.10
32.05

32.05

32.05

32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00

32.00

32.00

32.00
31.95

31.95

31.95

31.95

31.95

31.95
31.90

31.90

31.90

31.90

31.90

31.90
31.85

31.85

31.85

31.85

31.85

31.85
31.80

31.80

31.80

31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs epochs epochs epochs

DrivAge − BonusMalus DrivAge − VehBrand DrivAge − VehGas DrivAge − Density DrivAge − Region

32.10

32.10

32.10

32.10

32.10
32.05

32.05

32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00

32.00

32.00
31.95

31.95

31.95

31.95

31.95
31.90

31.90

31.90

31.90

31.90
31.85

31.85

31.85

31.85

31.85
31.80

31.80

31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs epochs epochs

BonusMalus − VehBrand BonusMalus − VehGas BonusMalus − Density BonusMalus − Region

32.10

32.10

32.10

32.10
32.05

32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00

32.00
31.95

31.95

31.95

31.95
31.90

31.90

31.90

31.90
31.85

31.85

31.85

31.85
31.80

31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs epochs

VehBrand − VehGas VehBrand − Density VehBrand − Region

32.10

32.10

32.10
32.05

32.05

32.05
out−of−sample loss

out−of−sample loss

out−of−sample loss
32.00

32.00

32.00
31.95

31.95

31.95
31.90

31.90

31.90
31.85

31.85

31.85
31.80

31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs epochs

VehGas − Density VehGas − Region

32.10

32.10
32.05

32.05
out−of−sample loss

out−of−sample loss
32.00

32.00
31.95

31.95
31.90

31.90
31.85

31.85
31.80

31.80
0 200 400 600 800 1000 0 200 400 600 800 1000

epochs epochs

Density − Region

32.10
32.05
out−of−sample loss
32.00
31.95
31.90
31.85
31.80

0 200 400 600 800 1000

epochs

Figure 11: exploring pair-wise interactions: out-of-sample losses over 1’000 gradient descent
epochs for all pairs of feature components, the orange dotted line shows Model GAM1 (the scale
on the y-axis is identical in all plots).

Interaction improved GAM1 model. This leads us to the following interaction improve-
ments of Model GAM1. We consider regression function
D   E
(4) (3) (2) (1)
x 7→ log λGAM+ (x) = w1 , z 1 ◦ z 1 ◦ z 1 (VehPower, VehAge, VehBrand, VehGas) ,
D   E
(4) (3) (2) (1)
+ w2 , z 2 ◦ z 2 ◦ z 2 (DrivAge, BonusMalus) , (3.11)
where we consider two parallel deep neural networks of depth K = 3 for the two component
vectors (VehPower, VehAge, VehBrand, VehGas) and (DrivAge, BonusMalus). Moreover, we set
ind.
Ni ∼ Poi λGAM+ (xi )viGAM , with working weights viGAM = vi λ bGAM (x).


23
Electronic copy available at: https://ssrn.com/abstract=3320525
In (3.11) we define two parallel neural networks that only interact in the last step where we
concatenate them by adding up the terms. The reason for the choice is that we did not observe
major interactions between the components of the two parallel networks in Figure 11.

Listing 8: R code of Model GAM+


1 d <- 2
2 Cont1 <- layer_input ( shape = c (3) , dtype = ’ float32 ’ , name = ’ Cont1 ’)
3 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Cat1 ’)
4 Cont2 <- layer_input ( shape = c (2) , dtype = ’ float32 ’ , name = ’ Cont2 ’)
5 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
6 x . input <- c ( Cont1 , VehBrand , Cont2 , LogVol )
7
8 BrEmb = VehBrand % >%
9 l ay er _em be dd in g ( input_dim = 11 , output_dim = d , input_length = 1 , name = ’ BrEmb ’) % >%
10 layer_flatten ( name = ’ Br_flat ’)
11
12 Network1 = list ( Cont1 , BrEmb ) % >% l a y e r _ c o n c a t e n a t e ( name = ’ concate ’) % >%
13 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
14 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
15 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
16 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network1 ’ ,
17 weights = list ( array (0 , dim = c (10 ,1)) , array (0 , dim = c (1))))
18
19 Network2 = Cont2 % >%
20 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden4 ’) % >%
21 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden5 ’) % >%
22 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden6 ’) % >%
23 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network2 ’ ,
24 weights = list ( array (0 , dim = c (10 ,1)) , array (0 , dim = c (1))))
25
26 Response = list ( Network1 , Network2 , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
27 layer_dense ( units =1 , activation = k_exp , name = ’ Response ’ , trainable = FALSE ,
28 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
29
30 model <- keras_model ( inputs = x . input , outputs = c ( Response ))

We fit Model GAM+ given in (3.11) based on two parallel neural networks of depth K = 3,
both having (q1 , q2 , q3 ) = (20, 15, 10) hidden neurons. The R code is given in Listing 8. We run
the gradient descent algorithm over 400 epochs on batches of size 10’000 policies. The results
are presented in Table 6.
We observe excellent fitting results of Model GAM+ compared to the other neural network
models, see Table 6. This illustrates that in (3.11) we capture the main interaction terms.
In fact, from Figure 11 we see that the second interaction term in (3.11), based on the vari-
ables (DrivAge,BonusMalus), accounts for a decrease of out-of-sample loss from 32.07597 (Model
GAM1) to roughly 31.97, see Figure 11, plot (DrivAge,BonusMalus). Therefore, the first inter-
action term in (3.11) must account for the residual decrease of out-of-sample loss from roughly
31.97 to 31.49574. This closes our example.

4 Conclusions
We have started our case study from a classical generalized linear model (GLM) for predicting
claims frequencies. Therefore we have assumed a log-linear functional form for the regression
function which leads to a multiplicative tariff structure in the feature components. Categorical

24
Electronic copy available at: https://ssrn.com/abstract=3320525
epochs run # in-sample out-of-sample average
time param. loss loss frequency
homogeneous model 0.1s 1 32.93518 33.86149 10.02%
Model GLM2 17s 48 31.25674 32.14902 10.01%
Model GAM1 1s 63.2† 31.14450 32.07597 10.01%
Model GAM+ 400 278s 1’174‡ 30.54186 31.49574 10.33%
Network Emb(d = 1) 700 419s 719 30.24464 31.50647 9.90%
Network Emb(d = 2) 600 365s 792 30.16513 31.45327 9.70%
CANN1 Emb(d = 1) 200 115s 719 30.39966 31.50136 10.02%
CANN2 Emb(d = 2) 200 117s 792 30.47557 31.56555 10.34%

Table 6: run time, number of model parameters, in-sample and out-of-samples losses (units
are in 10−2 ) of Models GLM2, GAM1, (q1 , q2 , q3 ) = (20, 15, 10) network with embedding layers
with dimensions d = 1, 2 for the categorical feature components VehBrand and Region, Models
CANN1 and CANN2 with embeddings d = 1, 2, respectively; † the number of parameters for
Model GAM1 considers the 48 GLM parameters plus the effective degrees of freedom of the
GAM splines being 6.6 + 8.6 = 15.2; ‡ only accounts for the network parameters in (3.11) and
not for the parameters which have been used to receive the working weights viGAM from Model
GAM1.

variables have been considered using dummy coding.


In a second step, this GLM has been enhanced by improving the log-linear functional forms in
the regression function, if necessary. This has been done within the framework of generalized
additive models (GAMs) using natural cubic splines.
In a third step, the resulting model of the previous two steps has been embedded (nested) into a
bigger model which additionally considers neural network features. This embedding results in the
combined actuarial neural network (CANN) approach. Thereby the neural network architecture
is used to boost the GAM improved GLM.
During this third step, we have also learned that embedding layers can lead to a more efficient
treatment of categorical variables compared to dummy coding and one-hot encoding.
In the last step, we have used the CANN approach to systematically find missing interac-
tions in the GAM improved GLM regression function. These missing interactions are of non-
multiplicative type because the GAM-GLM approach considers multiplicative interactions in
the regression function. In our analysis we have been able to explicitly find all these missing
interactions. These steps lead to a systematic use of neural networks, in particular, they are
systematically used to identify weaknesses of existing regression models.

References
[1] Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L. (2006). Neural probabilistic
language models. In: Innovations in Machine Learning. Studies in Fuzziness and Soft Computing,
Vol. 194. Springer, 137-186.
[2] CASdatasets Package Vignette (2016). Reference Manual, May 28, 2016. Version 1.0-6. Available
from http://cas.uqam.ca.
[3] Charpentier, A. (2015). Computational Actuarial Science with R. CRC Press.

25
Electronic copy available at: https://ssrn.com/abstract=3320525
[4] Ferrario, A., Noll, A., Wüthrich, M.V. (2018). Insights from inside neural networks. SSRN
Manuscript ID 3226852. Version November 14, 2018.
[5] Gabrielli, A., Richman, R., Wüthrich, M.V. (2018). Neural network embedding of the over-dispersed
Poisson reserving model. SSRN Manuscript ID 3288454. Version November 21, 2018.
[6] Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press,
http://www.deeplearningbook.org
[7] He, K., Zhang, X., Ren, S., Sun, J. (2015). Deep residual learning for image recognition. CoRR,
abs/1512.03385.
[8] Huang, G., Liu, Z., Weinberger, K.Q. (2016). Densely connected convolutional networks. CoRR,
abs/1608.06993.
[9] Marra, G., Wood, S.N., (2011). Practical variable selection for generalized additive models. Com-
putational Statistics and Data Analysis 55, 2372-2387.
[10] Noll, A., Salzmann, R., Wüthrich, M.V. (2018). Case study: French motor third-party liability
claims. SSRN Manuscript ID 3164764. Version November 8, 2018.
[11] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized Linear Models.
Springer.
[12] Richman, R. (2018). AI in actuarial science. SSRN Manuscript, ID 3218082, Version August 20,
2018.
[13] Richman, R., Wüthrich, M.V. (2018). A neural network extension of the Lee-Carter model to
multiple populations. SSRN Manuscript, ID 3270877, Version October 22, 2018.
[14] Sahlgren, M. (2015). A brief history of word embeddings.
https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-
magnus-sahlgren/
[15] Wood, S.N. (2017). Generalized Additive Models: An Introduction with R. 2nd edition. Chapman
and Hall/CRC.
[16] Wüthrich, M.V., Buser, C. (2016). Data Analytics for Non-Life Insurance Pricing. SSRN
Manuscript ID 2870308. Version October 24, 2017.
[17] Wüthrich, M.V., Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin 49/1.

26
Electronic copy available at: https://ssrn.com/abstract=3320525
Listing 9: Model CANN1 architecture
1 ContGLM <- layer_input ( shape = c (4) , dtype = ’ float32 ’ , name = ’ ContGLM ’)
2 VehPowerGLM <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehPowerGLM ’)
3 VehAgeGLM <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehAgeGLM ’)
4 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
5 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
6 DrivAgeGLM <- layer_input ( shape = c (5) , dtype = ’ float32 ’ , name = ’ DrivAgeGLM ’)
7 ContNN <- layer_input ( shape = c (3) , dtype = ’ float32 ’ , name = ’ ContNN ’)
8 LogExposure <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogExposure ’)
9 x . input <- c ( ContGLM , DrivAgeGLM , VehPowerGLM , VehAgeGLM , VehBrand , Region , ContNN , LogExposure )
10 #
11 V e h P o w e r G L M_ e m b e d = VehPowerGLM % >%
12 l ayer _e m be dd i ng ( input_dim = length ( beta . VehPower ) , output_dim = 1 , trainable = FALSE ,
13 input_length = 1 , name = ’ VehPowerGLM_embed ’ ,
14 weights = list ( array ( beta . VehPower , dim = c ( length ( beta . VehPower ) ,1)))) % >%
15 layer_flatten ( name = ’ VehPowerGLM_flat ’)
16
17 V eh Ag eG L M_ em be d = VehAgeGLM % >%
18 l ayer _e m be dd i ng ( input_dim = length ( beta . VehAge ) , output_dim = 1 , trainable = FALSE ,
19 input_length = 1 , name = ’ VehAgeGLM_embed ’ ,
20 weights = list ( array ( beta . VehAge , dim = c ( length ( beta . VehAge ) ,1)))) % >%
21 layer_flatten ( name = ’ VehAgeGLM_flat ’)
22
23 VehB rand_emb ed = VehBrand % >%
24 l ayer _e m be dd i ng ( input_dim = length ( beta . VehBrand ) , output_dim = 1 , trainable = FALSE ,
25 input_length = 1 , name = ’ VehBrand_embed ’ ,
26 weights = list ( array ( beta . VehBrand , dim = c ( length ( beta . VehBrand ) ,1)))) % >%
27 layer_flatten ( name = ’ VehBrand_flat ’)
28
29 Region_embed = Region % >%
30 l ayer _e m be dd i ng ( input_dim = length ( beta . Region ) , output_dim = 1 , trainable = FALSE ,
31 input_length = 1 , name = ’ Region_embed ’ ,
32 weights = list ( array ( beta . Region , dim = c ( length ( beta . Region ) ,1)))) % >%
33 layer_flatten ( name = ’ Region_flat ’)
34 #
35 GLMNetwork = list ( ContGLM , DrivAgeGLM , VehPowerGLM_embed , VehAgeGLM_embed ,
36 VehBrand_embed , Region_embed ) % >% l a y e r _ c o n c a t e n a t e () % >%
37 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ GLMNetwork ’ , trainable = FALSE ,
38 weights = list ( array ( c ( beta . continuous , beta . DrivAge , rep (1 ,4)) , dim = c (13 ,1)) ,
39 array ( beta .0 , dim = c (1))))
40 #
41 NNetwork = list ( ContGLM , ContNN , VehBrand_embed , Region_embed ) % >% l a y e r _ c o n c a t e n a t e () % >%
42 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
43 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
44 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
45 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ NNetwork ’ ,
46 weights = list ( array (0 , dim = c (10 ,1)) , array (0 , dim = c (1))))
47 #
48 CANNoutput = list ( GLMNetwork , NNetwork , LogExposure ) % >% layer_add () % >%
49 layer_dense ( units =1 , activation = k_exp , name = ’ CANNoutput ’ , trainable = FALSE ,
50 weights = list ( array ( c (1) , dim = c (1 ,1)) , array (0 , dim = c (1))))
51 #
52 model <- keras_model ( inputs = x . input , outputs = c ( CANNoutput ))

27
Electronic copy available at: https://ssrn.com/abstract=3320525

You might also like