0% found this document useful (0 votes)

8 views

Deep Learning With Actuarial Applications

Uploaded by

Vishva Thombare

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Deep Learning With Actuarial Applications

Uploaded by

Vishva Thombare

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 248

Generalized Linear Models

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Contents: Generalized Linear Models

• Starting with data

• Exponential dispersion family (EDF)

• Generalized linear models (GLMs)

• Maximum likelihood estimation (MLE)

• Canonical link and the balance property

• Covariate pre-processing / feature engineering

• Parameter selection

2
• Starting with Data

3
Car Insurance Claims Frequency Data

1 ’ data . frame ’: 678013 obs . of 12 variables :

2 $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ...
3 $ ClaimNb : num 1 1 1 1 1 1 1 1 1 1 ...
4 $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
5 $ Area : Factor w / 6 levels " A " ," B " ," C " ," D " ,..: 4 4 2 2 2 5 5 3 3 2 ...
6 $ VehPower : int 5 5 6 7 7 6 6 7 7 7 ...
7 $ VehAge : int 0 0 2 0 0 2 2 0 0 0 ...
8 $ DrivAge : int 55 55 52 46 46 38 38 33 33 41 ...
9 $ BonusMalus : int 50 50 50 50 50 50 50 68 68 50 ...
10 $ VehBrand : Factor w / 11 levels " B1 " ," B10 " ," B11 " ,..: 4 4 4 4 4 4 4 4 4 4 ...
11 $ VehGas : Factor w / 2 levels " Diesel " ," Regular ": 2 2 1 1 1 2 2 1 1 1 ...
12 $ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ...
13 $ Region : Factor w / 22 levels " R11 " ," R21 " ," R22 " ,..: 18 18 3 15 15 8 8 20 20 12

• 3 categorical covariates, 1 binary covariate and 5 continuous covariates

• Goal: Find systematic effects to explain/predict claim counts ClaimNb.

4
Exposures and Claims

histogram of the exposures (678013 policies) boxplot of the exposures (678013 policies) histogram of claim numbers

2.0

6e+05
5e+05
150000

1.5
number of policies

4e+05
frequency
100000

1.0

3e+05
2e+05
50000

0.5

1e+05
0.0

0e+00
0

0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 6 8 9 11 16

exposures number of claims

• Most exposures are between 0 and 1 year.

• Exposures bigger than 1 are considered to be data error and are capped at 1.

• Most insurance policies do not suffer any claim (class imbalance problem).

5
Continuous Covariates: Age of Driver

total volumes per driver's age groups observed frequency per driver's age groups

0.35
8000

0.30
●

●
●

0.25
6000

●
●

frequency
exposure

0.20
●●
●●

●●
●
●

●●

0.15
4000

●
●
●
●

●●
●●

● ●●
● ●●
● ● ●
●
●
●●● ●
● ●

0.10
●
●●
●● ●
● ● ●
●●
● ●
●
●● ● ●
●

●● ● ●
● ●
●●
●● ● ●● ● ● ● ●
● ●● ●● ●●● ●
●
● ●●● ● ●●● ●
● ●
●
● ●
●●● ● ● ●
● ●
●
● ● ● ●
●●● ● ● ●
●
●● ● ●
●
●
● ●
●
●
● ● ●
●
● ●
● ● ●
●
●
2000

●● ●●

0.05
0.00
0

18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88

driver's age groups driver's age groups

• Systematic effects of continuous covariates are not necessarily monotone.

6
Categorical Covariates: French Region

total volumes per regional groups observed frequencies per regional groups
1e+05

MTPL portfolio
● French population
8e+04

●
6e+04
exposure
4e+04

●
2e+04

●
●
● ●
●
● ●
●
● ● ● ●
● ●
● ● ●
●
0e+00

R11 R22 R24 R26 R41 R43 R53 R72 R74 R83 R93

regional groups

7
Covariates: Dependence

boxplot BonusMalus vs. DrivAge boxplot log−Density vs. Area

120

10
110
100

8
BonusMalus

log−Density
90

6
80
70

4
60
50

2
18

26−35

36−45

46−55

56−65

66−75

76+

A B C D E F

DrivAge Area code

• These covariates show strong dependence/collinearity.

8
Goal: Regression Modeling

• Denote by xi the covariates of insurance policy 1 ≤ i ≤ n.

• Goal: Find regression function µ:

xi 7→ µ(xi),

such that for all insurance policies 1 ≤ i ≤ n we have

E[Ni] = µ(xi)vi,

where Ni denotes the number of claims and vi > 0 is the time exposure of
insurance policy 1 ≤ i ≤ n (pro-rata temporis).

• µ extracts the systematic effects from information xi to explain Ni.

9
• Exponential Dispersion Family (EDF)

10
Exponential Dispersion Family (EDF)

• Sir Fisher (1934), Barndorff-Nielsen (2014), Jørgensen (1986, 1987).

• Exponential dispersion family (EDF) gives a unified notational framework of a

large family of distribution functions.

• The parametrization of this family is chosen such that it is particularly suitable for
maximum likelihood estimation (MLE).

• The EDF is the base statistical model for generalized linear modeling (GLM) and
for neural network regressions.

• Examples: Gaussian, Poisson, gamma, binomial, categorical, Tweedie’s, inverse

Gaussian models.

• Remark: This first chapter on GLMs gives us the basic understanding and tools
for neural network regression modeling.
11
Exponential Dispersion Family (EDF)

• Assume (Yi)i are independent with density

yθi − κ(θi)
Yi ∼ f (y; θi, vi/ϕ) = exp + a(y; vi/ϕ) ,
ϕ/vi

with

vi > 0 (known) exposure of risk i,

ϕ>0 dispersion parameter,
θi ∈ Θ canonical parameter of risk i in the effective domain Θ,
κ:Θ→R cumulant function (type of distribution),
a(·; ·) normalization, not depending on the canonical parameter θi.

12
Cumulant Function

• Assume (Yi)i are independent with density

yθi − κ(θi)
Yi ∼ f (y; θi, vi/ϕ) = exp + a(y; vi/ϕ) .
ϕ/vi

• Cumulant function κ : Θ → R is convex and smooth in the interior of Θ.

• Examples: 

 θ2/2 Gauss,
exp(θ) Poisson,




 − log(−θ)

gamma,
κ(θ) = log(1 + eθ ) Bernoulli/binomial,

−(−2θ)1/2



 inverse Gaussian,
 ((1 − p)θ) 2−p


1−p /(2 − p) Tweedie with p > 1, p 6= 2.

13
Mean and Variance Function

• The mean is given by

µi = E[Yi] = κ0(θi).

• The variance is given by

ϕ 00 ϕ
Var(Yi) = κ (θi) = V (µi) > 0,
vi vi

where µ 7→ V (µ) = κ00((κ0)−1(µ)) is the so-called variance function.

• Examples: 

 1 Gauss,
 µ Poisson,


V (µ) = µ2 gamma,
3
µ inverse Gaussian,




 p
µ Tweedie with p ≥ 1.

14
Maximum Likelihood Estimation (MLE)

• MLE homogeneous θ case: log-likelihood of independent observations (Yi)ni=1 is

n
! n
Y X Yiθ − κ(θ)
`Y (θ) = log f (Yi; θ, vi/ϕ) = + a(Yi; vi/ϕ).
i=1 i=1
ϕ/vi

• This provides score equations

n
∂ X vi
`Y (θ) = [Yi − κ0(θ)] = 0,
∂θ i=1
ϕ

and MLE θb Pn
viYi
θb = (κ0)−1 Pi=1
n .
i=1 vi

• MLE is straightforward within the EDF!

15
Canonical Link and Unbiasedness

• Canonical link h(·) = (κ0)−1(·)

µ = E[Y ] = κ0(θ) or h(µ) = h(E[Y ]) = θ.

• This provides for the MLE

Pn Pn
viYi i=1 vi Yi
θb = (κ0)−1 Pi=1
n = h P n .
i=1 vi i=1 vi

The latter gives a sufficient statistics.

• Unbiasedness of estimated means in the homogeneous case

h i h i
0 b = κ0(θ).
b ] = E κ (θ)
E E[Y

B Unbiasedness emphasizes that we receive the right price level in pricing.

16
• Generalized Linear Models (GLMs)

17
Generalized Linear Models (GLMs)

• Nelder–Wedderburn (1972) and McCullagh–Nelder (1983).

• Assume we have heterogeneity between (Yi)ni=1 which manifests in systematic

effects modeled through covariates/features xi ∈ Rq .

• Assume for link function choice g and regression parameter β ∈ Rq+1

q
X
xi 7→ g(µi) = g(E[Yi]) = g(κ0(θi)) = β0 + βj xi,j .
j=1

This gives a GLM with link function g. Parameter β0 is called intercept/bias.

• Link g should be monotone and smooth.

• The choice g = h = (κ0)−1 is called canonical link.

18
Design Matrix

• Assume for link function choice g and regression parameter β ∈ Rq+1

q
X
xi 7→ g(µi) = g(E[Yi]) = hβ, xii = β0 + βj xi,j .
j=1

• The design matrix is  

1 x1,1 · · · x1,q
X = (x1, . . . , xn)> =  .. .. ..  ∈ Rn×(q+1).
1 xn,1 · · · xn,q

• The design matrix X is assumed to have full rank q + 1 ≤ n.

• Full rank property is important for uniqueness of MLE of β.

19
Maximum Likelihood Estimation of GLMs

• The log-likelihood of independent observations (Yi)ni=1 is given by

n
X Yih(µi) − κ(h(µi))
β 7→ `Y (β) = + a(Yi; vi/ϕ),
i=1
ϕ/vi

with mean µi = µi(β) = g −1hβ, xii and canonical parameter θi = h(µi).

• This provides score equations for MLE

∇β `Y (β) = 0.

• Score equations are solved numerically with Fisher’s scoring method or the iterated
re-weighted least squares (IRLS) algorithm.

20
MLE and Deviance Loss Functions

• The log-likelihood of independent observations (Yi)ni=1 is given by

n
X Yih(µi) − κ(h(µi))
`Y (β) = + a(Yi; vi/ϕ),
i=1
ϕ/vi

with mean µi = µi(β) = g −1hβ, xii.

• Maximizing log-likelihoods is equivalent to minimizing deviance losses

D∗(Y , β) = 2 [`Y (Y ) − `Y (β)]

n
X vi
= 2 Yih(Yi) − κ(h(Yi)) − Yih(µi) + κ(h(µi)) ≥ 0.
i=1
ϕ

• The deviance loss of the Gaussian model is the square loss function, other examples
of the EDF have deviance losses different from square losses.
21
Examples of Deviance Loss Functions
• Gaussian case: n
X vi 2
D∗(Y , β) = (Yi − µi) ≥ 0.
i=1
ϕ

• Gamma case:
n
X vi Yi µi
D∗(Y , β) = 2 − 1 + log ≥ 0.
i=1
ϕ µi Yi

• Inverse Gaussian case:

n
X vi (Yi − µi)2
D∗(Y , β) = ≥ 0.
i=1
ϕ µ2i Yi

• Poisson case:
n
X vi µi
D∗(Y , β) = 2 µi − Yi − Yi log ≥ 0.
i=1
ϕ Yi
22
Balance Property under Canonical Link

• Under the canonical link g = h = (κ0)−1 we have balance property for the MLE

n
X n
X n
X
viE[Y
b i] = viκ0hβ,
b xi i = viYi.
i=1 i=1 i=1

B The estimated model mean over the entire portfolio is unbiased.

• If one does not work with the canonical link, one should correct in βb0 for the bias.

23
• Feature Engineering / Covariate Pre-Processing

24
Feature Engineering
• Assume monotone link function choice g
 
q
X
xi 7→ µi = E[Yi] = g −1hβ, xii = g −1 β0 + βj xi,j  .
j=1

observed frequency per car brand groups observed frequency per driver's age groups
0.35

0.35
●
0.30

0.30
●

●
●
0.25

0.25
●
●
frequency

frequency
0.20

0.20
●●
●●

●●
●
●

●●
0.15

0.15
●
●
●
●
●
●
●●
●●

● ●●
● ●●
● ● ●
●
●
●●● ●
● ●
0.10

0.10
●
●
● ●
●●
●● ●
● ● ●
●●
● ●
●
●● ● ●
●

●● ● ●
● ●
● ●
●
●
●
●●
● ●
● ●
●● ●●● ●
●
● ●●● ● ●●● ●
●
● ● ●● ● ● ● ● ●
●
● ● ●
● ● ●
● ● ●
● ●
● ●● ● ●●
●●● ●
●
●
● ●● ● ●
● ● ●
●
●
●● ● ●
● ●● ● ●
●●
●
● ●
● ● ●
●
●●
●
●
●● ●●
0.05

0.05
0.00

0.00

B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88

car brand groups driver's age groups

• What about categorical covariates and non-monotone covariates?

• What about different interactions?

25
One-Hot Encoding of Categorical Covariates

B1 7→ e1 = 1 0 0 0 0 0 0 0 0 0 0
B10 7→ e2 = 0 1 0 0 0 0 0 0 0 0 0
B11 7→ e3 = 0 0 1 0 0 0 0 0 0 0 0
B12 7→ e4 = 0 0 0 1 0 0 0 0 0 0 0
B13 7→ e5 = 0 0 0 0 1 0 0 0 0 0 0
B14 7→ e6 = 0 0 0 0 0 1 0 0 0 0 0
B2 7→ e7 = 0 0 0 0 0 0 1 0 0 0 0
B3 7→ e8 = 0 0 0 0 0 0 0 1 0 0 0
B4 7→ e9 = 0 0 0 0 0 0 0 0 1 0 0
B5 7→ e10 = 0 0 0 0 0 0 0 0 0 1 0
B6 7→ e11 = 0 0 0 0 0 0 0 0 0 0 1

• One-hot encoding for the 11 car brands: brand 7→ ej ∈ R11.

• One-hot encoding does not lead to full rank design matrices X, because we have
a redundancy.

26
Dummy Coding of Categorical Covariates

B1 0 0 0 0 0 0 0 0 0 0
B10 1 0 0 0 0 0 0 0 0 0
B11 0 1 0 0 0 0 0 0 0 0
B12 0 0 1 0 0 0 0 0 0 0
B13 0 0 0 1 0 0 0 0 0 0
B14 0 0 0 0 1 0 0 0 0 0
B2 0 0 0 0 0 1 0 0 0 0
B3 0 0 0 0 0 0 1 0 0 0
B4 0 0 0 0 0 0 0 1 0 0
B5 0 0 0 0 0 0 0 0 1 0
B6 0 0 0 0 0 0 0 0 0 1

• Declare one label as reference level and drop the corresponding column.

• Dummy coding for the 11 car brands: brand 7→ xj ∈ R10.

• Dummy coding leads to full rank design matrices X.

• There are other full rank codings like Helmert’s contrast coding.
27
Pre-Processing of Continuous Covariates (1/2)
observed log−frequency per age of driver

age class 1: 18-20 ●

−1.0
age class 2: 21-25 ●

−1.5
age class 3: 26-30 ●
●
●
●

age class 4: 31-40

−2.0
●

frequency
●●
● ●
●● ● ●● ●
● ●●●●●● ●●●● ● ●●●●●●● ● ● ●
age class 5: 41-50 ●●● ● ● ●
●
● ●
● ● ●
● ●●
●

−2.5
●
●●● ● ● ● ●●● ● ●
●
● ● ●

age class 6: 51-60 ●

−3.0
age class 7: 61-70
age class 8: 71-90

−3.5
18 23 28 33 38 43 48 53 58 63 68 73 78 83 88

age of driver

• Continuous features need feature engineering, too, to bring them into the right
functional form for GLM. Assume we have log-link for g
q
X
x 7→ log (E [Y ]) = hβ, xi = β0 + βj x j .
j=1

• We build homogeneous categorical classes, and then apply dummy coding.

28
Pre-Processing of Continuous Covariates (2/2)
• Categorical coding of continuous covariates has some disadvantages.

• By changing continuous features to categorical dummies we lose adjacency

relationships between neighboring classes.

• The number of parameters can grow very large if we have many classes.

• Balance property holds true on every categorical level.

Caution: if we have very rare categorical levels this will lead to over-fitting; and it
will also lead to high correlations with the intercept β0.

• One may also consider other functional forms for continuous covariates, e.g.,

age 7→ β1age + β2age2 + β3 log(age).

• Similarly, we can model interactions between covariate components

(age, weight) 7→ β1age + β2weight + β3age/weight.

29
• Variable Selection

30
Variable Selection: Likelihood Ratio Test (LRT)

• Null hypothesis H0: β1 = . . . = βp = 0 for given 1 ≤ p ≤ q.

• Likelihood ratio test (LRT). Calculate test statistics (nested models)

χ2Y = D∗(Y , β
b ) − D∗(Y , β
H0
b ) ≥ 0.
full

Under H0, test statistics χ2Y is approximately χ2-distributed with p df.

31
Variable Selection: Wald Test

• Null hypothesis H0: β p = (β1, . . . , βp)> = 0 for given 1 ≤ p ≤ q.

• Wald test. Choose matrix Ip such that Ipβ full = β p. Consider Wald statistics

−1
b − 0)> Ip I(β
W = (Ipβ b )−1 I > b − 0).
(Ipβ
full full p full

Under H0, test statistics W is approximately χ2-distributed with p df.

• I(β
b ) is Fisher’s information matrix; the above test is based on asymptotic
full
normality of the MLE β
b .
full

• Model only needs to be fitted once.

32
Model Selection: AIC

• Akaike’s information criterion (AIC) is useful for non-nested models

AIC = −2`Y (β)

b + 2dim(β).

• Models do not need to be nested.

• Models can have different distributions.

• AIC considers all terms of the log-likelihood (also normalizing constants).

• Models need to be estimated with MLE.

• Different models need to consider the same data on the same scale (log-normal
vs. gamma).

33
Example: Poisson Frequency GLM
1 Call :
2 glm ( formula = claims ~ powerCAT + area + log ( dens ) + gas + ageCAT +
3 acCAT + brand + ct , family = poisson () , data = dat , offset =
4
5 Deviance Residuals :
6 Min 1Q Median 3Q Max
7 -1.1373 -0.3820 -0.2838 -0.1624 4.3856
8
9 Coefficients :
10 Estimate Std . Error z value Pr ( >! z !)
11 ( Intercept ) -1.903 e +00 4.699 e -02 -40.509 < 2e -16 ***
12 powerCAT2 2.681 e -01 2.121 e -02 12.637 < 2e -16 ***
13 .
14 .
15 powerCAT9 -1.044 e -01 4.708 e -02 -2.218 0.026564 *
16 area 4.333 e -02 1.927 e -02 2.248 0.024561 *
17 log ( dens ) 3.224 e -02 1.432 e -02 2.251 0.024385 *
18 gasRegular 6.868 e -02 1.339 e -02 5.129 2.92 e -07 ***
19 .
20 .
21 ctZG -8.123 e -02 4.638 e -02 -1.751 0.079900 .
22 ---
23 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
24
25 ( Dispersion parameter for poisson family taken to be 1)
34
26
27 Null deviance : 145532 on 499999 degrees of freedom
28 Residual deviance : 140641 on 499943 degrees of freedom
29 AIC : 191132

35
Forward Parameter Selection: ANOVA

1 Analysis of Deviance Table

2
3 Model : poisson , link : log
4
5 Response : claims
6
7 Terms added sequentially ( first to last )
8
9
10 Df Deviance Resid . Df Resid . Dev
11 NULL 499999 145532
12 acCAT 3 2927.32 499996 142605
13 ageCAT 7 850.00 499989 141755
14 ct 25 363.29 499964 141392
15 brand 10 124.37 499954 141267
16 powerCAT 8 315.48 499946 140952
17 gas 1 50.53 499945 140901
18 area 1 255.20 499944 140646
19 log ( dens ) 1 5.07 499943 140641

Pay attention: order of covariates inclusion is important.

36
Backward Parameter Reduction: Drop1

1 Single term deletions

2
3 Model :
4 claims ~ acCAT + ageCAT + ct + brand + powerCAT + gas + area + log ( dens )
5
6 Df Deviance AIC LRT Pr ( > Chi )
7 < none > 140641 191132
8 acCAT 3 142942 193426 2300.61 < 2.2 e -16 ***
9 ageCAT 7 141485 191962 843.91 < 2.2 e -16 ***
10 ct 25 140966 191406 324.86 < 2.2 e -16 ***
11 brand 10 140791 191261 149.70 < 2.2 e -16 ***
12 powerCAT 8 140969 191443 327.68 < 2.2 e -16 ***
13 gas 1 140667 191156 26.32 2.891 e -07 ***
14 area 1 140646 191135 5.06 0.02453 *
15 log ( dens ) 1 140646 191135 5.07 0.02434 *
16 ---
17 Signif . codes : 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

We should keep the full model according to AIC and according to the LRT on a 5%
significance level.

37
• Car Insurance Frequency Example

38
Example: Poisson Frequency Model (1/2)

• The Poisson model has dispersion ϕ = 1.

• The Poisson model has cumulant function

θ 7→ κ(θ) = exp(θ).

• Mean and variance of EDFs are given by

µi = E [Yi] = κ0(θi) = exp(θi),

ϕ 00 1 1
Var (Yi) = κ (θi) = exp(θi) = µi .
vi vi vi

B Ni = viYi has a Poisson distribution with mean viµi.

39
Example: Poisson Frequency Model (2/2)

• Mean of the Poisson model for Ni = viYi

viµi = E [Ni] = viκ0(θi) = vi exp(θi) = exp(log vi + θi).

The term log vi is called offset.

• The Poisson GLM with canonical link g = h = log is given by

q
X
xi 7→ log (E [Ni]) = log vi + hβ, xii = log vi + β0 + βj xi,j .
j=1

run time # param. q + 1 AIC in-sample loss out-of-sample loss

homogeneous model – 1 263’143 32.935 33.861
Model GLM1 20s 49 253’062 31.267 32.171

Losses are in 10−2 .

40
Further Points

• To prevent from over-fitting: regularization can be used.

• Ridge regression is based on an L2-penalization and generally reduces regression

parameter components in β (exclude the intercept β0).

• LASSO (least absolute shrinkage and selection operator) regression is based on an

L1-penalization and can set regression parameter components exactly to zero.

• LASSO has difficulties with collinearity in covariate components, therefore,

sometimes an elastic net regularization is used which combines ridge and LASSO.

• Regularization has a Bayesian interpretation.

• Generalized additive models (GAMs) allow for more flexibility than GLMs in
marginal covariate component modeling. But they often suffer from computational
complexity.

41
References
• Barndorff-Nielsen (2014). Information and Exponential Families: In Statistical Theory. Wiley
• Charpentier (2015). Computational Actuarial Science with R. CRC Press.
• Efron, Hastie (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge UP.
• Fahrmeir, Tutz (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer.
• Fisher (1934). Two new properties of mathematical likelihood. Proceeding of the Royal Society A 144, 285-307.
• Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning. Springer.
• Jørgensen (1986). Some properties of exponential dispersion models. Scandinavian Journal of Statistics 13/3,
187-197.
• Jørgensen (1987). Exponential dispersion models. Journal of the Royal Statistical Society. Series B (Methodological)
49/2, 127-145.
• Jørgensen (1997). The Theory of Dispersion Models. Chapman & Hall.
• Lehmann (1983). Theory of Point Estimation. Wiley.
• Lorentzen, Mayer (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN 3595944.
• McCullagh, Nelder (1983). Generalized Linear Models. Chapman & Hall.
• Nelder, Wedderburn (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A (General)
135/3, 370-384.
• Noll, Salzmann, Wüthrich (2018). Case study: French motor third-party liability claims. SSRN 3164764.
• Ohlsson, Johansson (2010). Non-Life Insurance Pricing with Generalized Linear Models. Springer.
• Wüthrich, Buser (2016). Data Analytics for Non-Life Insurance Pricing. SSRN 2870308, Version September 10, 2020.
• Wüthrich, Merz (2021). Statistical Foundations of Actuarial Learning and its Applications. SSRN 3822407.

42
Feed-Forward Neural Networks

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Contents: Feed-Forward Neural Network

• The statistical modeling cycle

• Generic feed-forward neural networks (FNNs)

• Universality theorems

• Gradient descent methods for model fitting

• Generalization loss and cross-validation

• Embedding layers

2
• The Statistical Modeling Cycle

3
The Statistical Modeling Cycle

(1) data collection, data cleaning and data pre-processing (> 80% of total time)

(2) selection of model class (data or algorithmic modeling culture, Breiman 2001)

(3) choice of objective function

(4) ’solving’ a (non-convex) optimization problem

(5) model validation and variable selection

(6) possibly go back to (1)

’solving’ involves:
? choice of algorithm
? choice of stopping criterion, step size, etc.
? choice of seed (starting value)
4
• Generic Feed-Forward Neural Networks (FNNs)

5
Neural Network Architectures
• Neural networks can be understood as an approximation framework.

• Here: neural networks generalize GLMs.

• There are different types of neural networks:

? Feed-forward neural network (FNN): Information propagates in one direction
from input to output.
? Recurrent neural network (RNN): This is an extension of FNNs that allows for
time series modeling (because it allows for time series (or causal) structures).
? Convolutional neural network (CNN): This is a type of network that allows for
modeling temporal and spatial structure, e.g., in image recognition.

• FNNs have stacked hidden layers. If there is exactly one hidden layer, we call the
network shallow; if there are multiple hidden layers, we call the network deep.

• There are many special neural network architectures such as generative-adversarial

networks (GANs), bottleneck auto-encoder (AE) networks, etc.
6
Shallow and Deep Fully-Connected FNNs

age age

claims claims

ac ac

These two examples are fully-connected FNNs.

Information is processed from the input (in blue) to the output (in red).
7
Representation Learning
• A GLM with link g has the following structure

x 7→ µ(x) = E [Y ] = g −1hβ, xi.

This requires manual feature engineering to bring x into the right form.

• Networks perform automated feature engineering.

• A layer is given by a mapping

z (m) : Rqm−1 → Rqm .

Each layer presents a new representation of the covariates.

• In general, compose layers

(d:1) def. (d) (1)
x 7→ z (x) = z ◦ ··· ◦ z (x) ∈ Rqd .

8
Fully-Connected FNN Layer

• Choose dimensions qm−1, qm ∈ N and activation function φ : R → R.

• A (hidden) FNN layer is a mapping

>
(m)
z (m) : Rqm−1 → Rqm x 7→ z (m)(x) = z1 (x), . . . , zq(m)
m
(x) ,

with (hidden) neurons given by, 1 ≤ j ≤ qm,

qm−1 !
(m) (m) (m) def. (m)
X
zj (x) =φ wj,0 + wj,l xl = φhwj , xi,
l=1

(m)
for given network weights (parameters) wj ∈ Rqm−1+1.

(m)
• Every neuron zj (x) describes a GLM w.r.t. feature x ∈ Rqm−1 and activation φ.
The resulting function (called ridge function) reflects a compression of information.
9
Shallow and Deep Fully-Connected FNNs

age age

claims claims

ac ac

These two examples are fully-connected FNNs.

Information is processed from the input (in blue) to the output (in red).
10
Activation Function

• The activation function φ : R → R is an inverse link function φ = g −1.

• Since we would like to approximate non-linear regression functions, activation

functions should be non-linear, too.

• The most popular choices of activation functions are

sigmoid/logistic function φ(x) = (1 + e−x )−1 ∈ (0, 1) φ0 = φ(1 − φ)

hyperbolic tangent function φ(x) = tanh(x) ∈ (−1, 1) φ0 = 1 − φ2
exponential function φ(x) = exp(x) ∈ (0, ∞) φ0 = φ
step function φ(x) = 1{x≥0} ∈ {0, 1} not differentiable in 0
rectified linear unit (ReLU) φ(x) = x1{x≥0} ∈ [0, ∞) not differentiable in 0

• We mainly use hyperbolic tangent (with the following relationship to sigmoid)

ex − e−x −2x −1

x 7→ tanh(x) = x = 2 1+e − 1 = 2 sigmoid(2x) − 1.
e + e−x

11
Sigmoid Activation Function φ(x) = (1 + e−x)−1
sigmoid function

1.0
w=4 ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
w=1 ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
w=1/4 ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●

0.8
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

0.6
●
●
●
●
●
●
●●
●
●
●
●
●
sigmoid
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
0.4

●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
0.2

●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
0.0

●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●

−10 −5 0 5 10

• Sigmoid activation x 7→ φ(wx) for weights w ∈ {1/4, 1, 4} and x ∈ (−10, 10):

? “deactivated” for small values x, i.e. φ(wx) ≈ 0 for x small,
? “activated” for big values x, i.e. φ(wx) ≈ 1 for x large.

12
Fully-Connected FNN Architecture

age

claims

• Choose depth of the network d ∈ N and define the FNN layer composition

(d:1) def. (d) (1)
x 7→ z (x) = z ◦ ··· ◦ z (x) ∈ Rqd ,

with q0 = q for x ∈ Rq .

• Define output layer with link function g by

D E
−1 (d:1)
xi 7→ µi = E[Yi] = g β, z (xi) .

13
FNN Architecture: Interpretations

age

claims

• Network mapping
D E
−1 (d:1)
xi 7→ µi = E[Yi] = g β, z (xi) .

• Mapping xi 7→ z i = z (d:1)(xi) should be understood as feature engineering or

representation learning.

• The linear activation function φ(x) = x provides a GLM (composition of linear

functions is a linear function). Thus, a GLM is a special case of a FNN.

• For depth d = 0 we receive a GLM, too.

14
• Universality Theorems

15
Universality Theorems for FNNs

age

claims

• Cybenko (1989) and Hornik et al. (1989): Any compactly supported continuous
function can be approximated arbitrarily well (in sup- or L2-norm) by shallow
FNNs with sigmoid activation if allowing for arbitrarily many hidden neurons (q1).

• Leshno et al. (1993): The universality theorem for shallow FNNs holds if and only
if the activation function φ is non-polynomial.

• Grohs et al. (2019): Shallow FNNs with ReLU activation functions provide
polynomial approximation rates, deep FNNs provide exponential rates.

16
Simple Example Supporting Deep FNNs

• Consider a 2-dimensional example µ : [0, 1]2 → R+

x 7→ µ(x) = 1 + 1{x2≥1/2} + 1{x1≥1/2, x2 ≥1/2} ∈ {1, 2, 3}.

regression function lambda

• Choose step function activation φ(x) = 1{x≥0}.

3.0

0.75 2.5

• A FNN of depth d = 2 with q1 = q2 = 2

feature component X2
0.5 2.0

D E D E
β, z (2:1)(x) = β, (z (2) ◦ z (1))(x) , 0.25 1.5

1.0

0.25

0.5

0.75
can perfectly approximate function µ. feature component X1

• Deep FNNs allow for more complex interactions of covariates through compositions
of layers/functions: wide allows for superposition, and deep allows for composition.

17
Shallow Neural Networks

shallow FN network q1=4 shallow FN network q1=8 shallow FN network q1=64

0.5 0.5 0.5

x2
0.0 0.0 0.0

−0.5 −0.5 −0.5

−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

x1 x1 x1

18
• Gradient Descent Methods for Model Fitting

19
Deviance Loss Function
• FNN mapping D E
−1 (d:1)
xi 7→ µi = E[Yi] = g β, z (xi) ,
has network parameter

(1)
ϑ= w1 , . . . , w(d)
qd , β ∈ Rr ,

Pd
of dimension r = m=1 qm (qm−1 + 1) + (qd + 1).

• The deviance loss function under independent observations (Yi)ni=1

ϑ 7→ D∗(Y , ϑ) = 2 [`Y (Y ) − `Y (ϑ)]

n
X vi
= 2 Yih(Yi) − κ(h(Yi)) − Yih(µi) + κ(h(µi)) ≥ 0.
i=1
ϕ

• Minimizing deviance loss D∗(Y , ϑ) in network parameter ϑ provides MLE ϑ.

20
Plain Vanilla Gradient Descent Method (1/2)

• Gradient descent methods (GDMs) stepwise iteratively improve network parameter

ϑ by moving into the direction of the maximal (local) decrease of D∗(Y , ϑ).

• 1st order Taylor expansion of deviance loss in network parameter ϑ

∗ ∗ ∗
e = D (Y , ϑ) + ∇ϑD (Y , ϑ) >
D (Y , ϑ) ϑe − ϑ + o kϑe − ϑk ,

for kϑe − ϑk → 0 (we suppose differentiability).

• Calculate the corresponding gradient

n
X
∇ϑD∗(Y , ϑ) = 2 [µi − Yi] ∇ϑh(µi).
i=1

• Back-propagation (Rumelhart et al. 1986) is an efficient way to calculate ∇ϑh(µi).

21
Plain Vanilla Gradient Descent Method (2/2)

• Negative gradient −∇ϑD∗(Y , ϑ) gives the direction for ϑ of the maximal local
decrease in deviance loss.

• For a given learning rate %t+1 > 0, the gradient descent algorithm updates network
parameter ϑ(t) iteratively by (adapted locally optimal)

ϑ(t) 7→ ϑ(t+1) = ϑ(t) − %t+1∇ϑD∗(Y , ϑ(t)).

• This update provides new (in-sample) deviance loss for %t+1 → 0

2
∗ (t+1) ∗ (t) ∗ (t)
D (Y , ϑ ) = D (Y , ϑ ) − %t+1 ∇ϑD (Y , ϑ ) + o (%t+1) .

• Using a tempered learning rate (%t)t≥1 the network parameter ϑ(t) converges to a
local minimum of D∗(Y , ·) for t → ∞.

22
Over-Fitting in Complex FNNs

● ●
observation

10
homogeneous
neural network 1
neural network 2

8
●

6
mu(x)
●
●

4 ●
2
●
● ● ●
● ●
●
● ● ●
● ● ●
●
0

0.5 1.0 1.5 2.0

• Convergence to a local minimum of D∗(Y , ·) typically means over-fitting.

• Apply early stopping:

? Partition data at random into training data U and validation data V.
? Fit ϑ on U (in-sample) and track over-fitting on V (out-of-sample).
? The “best” model obviously is non-unique when we use early stopping.
23
Early Stopping of Gradient Descent Algorithm
stochastic gradient descent algorithm

training loss

0.210
● validation loss

0.208
0.206
deviance loss
0.204
●
●
●
●
●

0.202
●
●
●
●
●
0.200
●
●●
●
●
●●
●
●
●
●● ●●
●
●●
●
●
●
●
●
●
●
●
●● ●●●●● ● ●● ●
●●●●●●
●
● ●●
●
●
●●●
●●
●
●●
●●
●●●
●
●●●●
●
●●●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●
●
●● ●● ● ● ● ● ● ● ● ● ●●●●
0.198

●
●●
●
●●
●
● ●● ● ● ●● ● ●
● ●●
●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●● ● ●●●●●●
●
●
● ●●
●●
●●
●●
●
●●●●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●●●
●● ●●●
●
●
●●
●
●
●
●●●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
● ●
●
●
●
●●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
● ● ●●●
● ●
●
●
●●
● ●
●
●● ●
● ●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●●
●●
●
●●
●●
●
●●
●
●●
● ●
● ● ●
●

0 200 400 600 800 1000

training epochs

• Convergence to a local minimum of D∗(Y , ·) typically means over-fitting.

• Apply early stopping:

D entire data,
L learning data (in-sample),
T test data (out-of-sample),
U training data,
V validation data

25
Computational Issues and Stochastic Gradient
• Gradient descent steps

ϑ(t) 7→ ϑ(t+1) = ϑ(t) − %t+1∇ϑD∗(Y , ϑ(t)),

involve high-dimensional matrix multiplications

n
X
∇ϑD∗(Y , ϑ) = 2 [µi − Yi] ∇ϑh(µi),
i=1

which are computationally expensive if the size of the training data U is large.

• Partition training data U at random in mini batches Uk of a given size. Use for
gradient descent steps one mini batch Uk at a time. This is called stochastic
gradient descent (SGD) algorithm.

• Running through all mini batches (Uk )k once is called a training epoch.

• Using the entire training data in each GDM step is called steepest gradient descent.
26
Size of Mini-Batches

• Partition training data U at random in mini batches U1, . . . , UK , and use for each
gradient descent step one mini batch Uk at a time
X
∗
∇ϑD (Uk , ϑ) = 2 [µi − Yi] ∇ϑh(µi).
i∈Uk

• Size of mini-batches for Poisson frequencies?

" r r # " r r #
µ(x) µ(x) 5% 5%
µ(x) − 2 , µ(x) + 2 = 5% − 2 , 5% + 2 = [4%, 6%].
v v 2000 2000

Note for Poisson case E[N ] = Var(N ) = µ(x)v.

27
Momentum-Based Gradient Descent Methods

• Plain vanilla GDMs use 1st order Taylor expansions.

• To improve convergence rates we could use 2nd order Taylor expansions.

• 2nd order Taylor expansions involve calculations of Hessians.

• This is computationally not feasible.

• Replace Hessians by momentum methods (inspired by physics/mechanics).

• Choose a momentum coefficient ν ∈ [0, 1) and set initial speed v(0) = 0 ∈ Rr .

Replace plain vanilla GDM update by

v(t) 7→ v(t+1) = νv(t) − %t+1∇ϑD∗(Y , ϑ(t)),

ϑ(t) 7→ ϑ(t+1) = ϑ(t) + v(t+1).

28
Predefined Gradient Descent Methods

• ’rmsprop’ chooses learning rates that differ in all directions by consider directional
sizes (’rmsprop’ stands for root mean square propagation);

• ’adam’ stands for adaptive moment estimation, similar to ’rmsprop’ it searches

for directionally optimal learning rates based on the momentum induced by past
gradients measured by an L2-norm;

• ’nadam’ is Nesterov (2007) accelerated version of ’adam’ avoiding zig-zag behavior.

• For more details we refer to Chapter 8 of Goodfellow et al. (2016) and Section
7.2.3 in Wüthrich–Merz (2021)

29
• Generalization Loss and Cross-Validation

30
Empirical Generalization Loss
Typically, for neural network modeling one considers 3 disjoint sets of data.

• Training data U: is used to fit the network parameter ϑ.

• Validation data V: is used to track in-sample over-fitting (early stopping).

• Test data T : is used to study out-of-sample generalization loss.

Assume that ϑbU ,V is the estimated network parameter based on U and V. The test
data T is given by (Yt, xt, vt)Tt=1. We have (out-of-sample) generalization loss (GL)

T
vt
µU µU
X ,V ,V
D∗(Y , ϑbU ,V ) = 2 Yth(Yt) − κ(h(Yt)) − Yth(bt ) + κ(h(bt )) .
t=1
ϕ

• This is an empirical generalization loss based on T mimicking portfolio distribution.

31
K-Fold Cross-Validation Loss
• If one cannot afford to partition the data D into 3 disjoint sets training data U,
validation data V and test data T , one has to use the data more efficiently.

• K-fold cross-validation aims at doing so.

• Partition entire data at random in K subsets D1, . . . , DK of roughly equal size.

• Denote by ϑb(−Dk ) the estimated network parameter based on all data except Dk .

• The K-fold cross-validation loss is given by

 
K
CV 1 X X vt (−D ) (−D )
D = 2 µt k ) + κ(h(b
Yth(Yt) − κ(h(Yt)) − Yth(b µt k )) .
K ϕ
k=1 t∈Dk

• This mimics K times an out-of-sample generalization loss on Dk , respectively.

• In neural network modeling K-fold cross-validation is computationally too costly.

32
• Car Insurance Frequency Example

33
Car Insurance Claims Frequency Data

1 ’ data . frame ’: 678013 obs . of 12 variables :

• 3 categorical covariates, 1 binary covariate and 5 continuous covariates

• Goal: Find systematic effects to explain/predict claim counts.

34
Feature Engineering
• Categorical features: use either dummy coding or one-hot encoding.
PS: We come back to this choice below.

• Also continuous features need pro-processing. All feature components should live
on a similar scale such that the GDM can be applied efficiently.

Often, the MinMaxScaler is used

xi,l − x−
xi,l 7→ xMM
i,l =2 + l
− − 1 ∈ [−1, 1],
xl − xl

where x−
l and x +
l are the minimum and maximum of the domain of xi,l .

• Successful application of MinMaxScaler pre-processing requires that the feature

distribution is not “too skewed”, otherwise pre-processing should be performed
with a scaler that accounts for skewness (like the log function).

• Standardization with empirical mean and standard deviation is possible, too.

35
Deep FNN Coding in R keras

1 library ( keras )
2
3 q0 <- 12 # dimension of input x
4 q1 <- 20
5 q2 <- 15
6 q3 <- 10
7
8 Design <- layer_input ( shape = c ( q0 ) , dtype = ’ float32 ’ , name = ’ Design ’)
9
10 Network = Design % >%
11 layer_dense ( units = q1 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
12 layer_dense ( units = q2 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
13 layer_dense ( units = q3 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
14 layer_dense ( units =1 , activation = ’ exponential ’ , name = ’ Network ’)
15
16 model <- keras_model ( inputs = c ( Design ) , outputs = c ( Network ))
17 model % >% compile ( optimizer = optimizer_nadam () , loss = ’ poisson ’)
18
19 summary ( model )

36
Deep FNN with (q1, q2, q3) = (20, 15, 10)

1 Layer ( type ) Output Shape Param #

37
Poisson FNN Regression with Offset
• Poisson regression with offset and canonical link g = h = log, set Ni = viYi

viµi = E [Ni] = viκ0(θi) = vi exp(θi) = exp(log vi + θi).

• The Poisson FNN regression is given by, set µi = µ(xi),

D E
xi 7→ log (E [Ni]) = log (viµ(xi)) = log vi + β, z (d:1)(xi) .

• The Poisson deviance loss function is given by

n
X viµ(xi) viµ(xi)
D∗(N , ϑ) = 2 Ni − 1 − log ≥ 0,
i=1
Ni Ni

where the i-th term is set equal to 2viµ(xi) for Ni = 0.

• In keras the terms independent of ϑ are dropped in the deviance losses.

38
Deep FNN Coding in R keras with Offset

1 q0 <- 12 # dimension of input x

2 q1 <- 20
3 q2 <- 15
4 q3 <- 10
5 lambda . hom <- 0.05 # initialization of network output ( homogeneous model )
6
7 Design <- layer_input ( shape = c ( q0 ) , dtype = ’ float32 ’ , name = ’ Design ’)
8 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
9
10 Network = Design % >%
11 layer_dense ( units = q1 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
12 layer_dense ( units = q2 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
13 layer_dense ( units = q3 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
14 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
15 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
16
17 Response = list ( Network , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
18 layer_dense ( units =1 , activation = ’ exponential ’ , name = ’ Response ’ ,
19 trainable = FALSE , weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
20
21 model <- keras_model ( inputs = c ( Design , LogVol ) , outputs = c ( Response ))
22 model % >% compile ( optimizer = optimizer_nadam () , loss = ’ poisson ’)

39
Deep FNN with (q1, q2, q3) = (20, 15, 10) with Offset

1 Layer ( type ) Output Shape Param # Connected to

2 ================================================================================
3 Design ( InputLayer ) ( None , 12) 0
4 ________________________________________________________________________________
5 hidden1 ( Dense ) ( None , 20) 260 Design [0][0]
6 ________________________________________________________________________________
7 hidden2 ( Dense ) ( None , 15) 315 hidden1 [0][0]
8 ________________________________________________________________________________
9 hidden3 ( Dense ) ( None , 10) 160 hidden2 [0][0]
10 ________________________________________________________________________________
11 Network ( Dense ) ( None , 1) 11 hidden3 [0][0]
12 ________________________________________________________________________________
13 LogVol ( InputLayer ) ( None , 1) 0
14 ________________________________________________________________________________
15 Add ( Add ) ( None , 1) 0 Network [0][0]
16 LogVol [0][0]
17 ________________________________________________________________________________
18 Response ( Dense ) ( None , 1) 2 Add [0][0]
19 ================================================================================
20 Total params : 748
21 Trainable params : 746
22 Non - trainable params : 2

40
Application to French MTPL Data

Area
●
● ● ●
VehPower
VehAge
DrivAge
●
● ● ●
BonusMalus
B1
●
● ● ●
B10
B11
●
● ● ● ●
B12
●
● ●
B13

● ● ●
●
B14
B2
●
● ●
●
B3
B4
●
● ●
B5
B6
● ● ●
VehGas
●
● ● ● ●
Density
R11
●
● ●
R21
R22
● ● ● ●ClaimNb
R23
●
● ● ●
R24
R25
● ●
R26
●
● ● ●
R31

● ●
R41
R42
● ● ●
R43
●
● ● ●
●
R52
R53
R54
●
● ●
R72
R73
●
● ● ● ●
R74
R82
●
● ● ●
R83
●
● ● ●
R91
R93
● ● ●
R94
●

Input dimension is q0 = 40 (one-hot encoding), this provides r = 10306.

41
Results of Deep FNN Model

epochs run # in-sample out-of-sample average

time param. loss 10−2 loss 10−2 frequency
homogeneous model – 1 32.935 33.861 10.02%
Model GLM1 20s 49 31.267 32.171 10.02%
Deep FNN model 250 152s 1’306 30.268 31.673 10.19%

• Network fitting needs quite some run time.

• We perform early stopping.

• The best validation loss model can be retrieved with a callback, see next slide.

• We see a substantial improvement in out-of-sample loss on test data T .

• Balance property fails to hold.

• Remark: AIC is not a sensible model selection criterion for FNNs (early stopping).
42
Callbacks in Gradient Descent Methods
1 path0 <- "./ name0 "
2 CB <- c a l l b a c k _ m o d el _ c h e c k p o i n t ( path0 , monitor = " val_loss " ,
3 verbose = 0 , save_best_only = TRUE ,
4 save_weights_only = TRUE )
5
6 model % >% fit ( list ( X . learn , LogVol . learn ) , Y . learn ,
7 validation_split = 0.1 , batch_size = 10000 , epochs = 500 ,
8 verbose = 0 , callbacks = CBs )
9
10 l o a d _m od e l_ w ei gh t s_ h df 5 ( model , path0 )

gradient descent algorithm

● ●
training loss (in−sample)
32.5

●
validation loss (out−of−sample)
32.0

●
●
●
●
●
●
●
●

●
31.5
deviance losses

● ●

●
●
31.0

●● ●
● ●
●
● ●
● ●
●
● ●●
● ●●
●
●●
●●●
● ●●●●
●●●
●●●●
● ●●●
●●●●●●
●●
●●●
●●●●●
● ●●●●●
●●●●●●●●
●●●●
●●●●●
●●●●●
●●●●●●
● ●●●●●
●●●●●●●
●●●●●●
●●●●●
●●●●●
●●●●●●
● ●●●●●●●
● ●● ●
30.5

●●●●●
●●●●●
● ●●●●●●
●●●●●
●●●●●
●●●●●
● ●●●●●
● ●●●●
●●●●● ●
●●●●●●
●●●●
●●●●●●●
●●●●●●●●
● ●●●●●●●●
●●●●●●●●●●
● ●●●●●
● ● ●●●●●
●●●●●●●●●●
● ●●●●●●●●
●●●●●●● ●●
●●●● ●●●
●● ●●●●●●●● ● ●
● ●●●●●●●●●●
●●●●●●●●●●●●●● ●
● ●●●●●●●● ●● ●
●● ●●●●●●●●●● ●●
●●●●●●●●●●●●●●● ●
● ● ●●●●●●●●●●● ● ●
● ●●●●●●●●●●● ●●●
● ●●●●●●●●●●●●●●●● ●
● ●●●●●●●●●● ● ●
●●●●●●●●●●●●●●●●● ●●● ●
● ● ●●●● ●●●●●●●●●●●●●●
● ● ● ●●●●●●●●●● ● ●
●●●●●●●●●●●●●●●●●●●●● ●●
●● ●●
●● ● ●
● ●
●●
●● ●
●● ●● ● ●
●●● ●● ●● ●● ● ● ●● ● ●●
30.0

●
●●● ●● ● ● ● ● ● ●
● ● ● ●●
●●●● ● ●
● ● ●● ●●●● ● ●● ●● ●●● ● ●●●● ●
● ● ● ● ●
●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●
● ●● ●● ●●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●
● ●
● ●●●●●●●●● ●● ●●● ● ● ●●● ● ●●●●●●●●●● ●●● ● ● ● ● ● ● ● ● ●
● ●●●●● ● ● ●●● ●●●●●●●●● ●●●● ●●● ● ●● ●● ●●●●●● ●●●● ●●●●●●
●
● ● ● ●● ● ● ●●
● ● ●● ● ●●● ●● ● ●● ●●● ● ●● ●●●●●●●● ●● ●●●
● ●●●●●●● ●● ● ● ●●●● ● ● ●● ●● ●●● ●● ● ● ● ● ●●
● ● ● ●● ●● ●●● ●● ● ●●● ● ●● ●● ● ● ● ● ● ●
● ● ● ● ● ●●● ●●● ●●●●● ● ●●●● ●● ● ● ● ●● ● ● ● ●● ● ●●●●● ●●●● ● ●● ●●●● ● ● ● ●
● ● ● ●●●●● ●●● ●●●●●●●●●● ●●●●●●● ● ● ●● ●● ●●●●● ● ●●●● ●●● ●● ● ● ● ● ●
● ●● ●● ● ● ● ●
● ●● ●● ● ●● ●
● ●● ● ●● ●● ● ●●● ● ● ●●●●●●

0 100 200 300 400 500

epoch

43
• Embedding Layers for Categorical Variables

44
Categorical Variables and Dummy/One-Hot Encoding
B1 0 0 0 0 0 0 0 0 0 0
B10 1 0 0 0 0 0 0 0 0 0
B11 0 1 0 0 0 0 0 0 0 0
B12 0 0 1 0 0 0 0 0 0 0
B13 0 0 0 1 0 0 0 0 0 0
B14 0 0 0 0 1 0 0 0 0 0 each row is in R10
B2 0 0 0 0 0 1 0 0 0 0
B3 0 0 0 0 0 0 1 0 0 0
B4 0 0 0 0 0 0 0 1 0 0
B5 0 0 0 0 0 0 0 0 1 0
B6 0 0 0 0 0 0 0 0 0 1

B1 7→ e1 1 0 0 0 0 0 0 0 0 0 0
B10 7→ e2 0 1 0 0 0 0 0 0 0 0 0
B11 7→ e3 0 0 1 0 0 0 0 0 0 0 0
B12 7→ e4 0 0 0 1 0 0 0 0 0 0 0
B13 7→ e5 0 0 0 0 1 0 0 0 0 0 0
B14 7→ e6 0 0 0 0 0 1 0 0 0 0 0 each row is in R11
B2 7→ e7 0 0 0 0 0 0 1 0 0 0 0
B3 7→ e8 0 0 0 0 0 0 0 1 0 0 0
B4 7→ e9 0 0 0 0 0 0 0 0 1 0 0
B5 7→ e10 0 0 0 0 0 0 0 0 0 1 0
B6 7→ e11 0 0 0 0 0 0 0 0 0 0 1

45
Embeddings for Categorical Variables

• One-hot encoding uses as many dimensions as there are labels (mapping to unit
vectors in Euclidean space).

• All labels have the same distance from each other.

• From Natural Language Processing (NLP) we have learned that there are “better”
codings in the sense that we should try to map to low-dimensional Euclidean spaces
Rb, and similar labels (w.r.t. the regression task) should have some proximity.

• Choose b ∈ N and consider an embedding mapping (representation)

def.
e : {B1, . . . , B6} → Rb, brand 7→ e(brand) = ebrand.

• ebrand ∈ Rb are called embeddings, and optimal embeddings for the regression
task can be learned during GDM training. This amounts in adding an additional
(embedding) layer to the FNN.
46
Deep FNN using Embedding Layers (1/2)

Area
●
● ● ●
VehPower
VehAge
DrivAge
●
● ● ● Area
Area
BonusMalus
B1
●
● ● ●
B10
B11
●
● ● ● ● VehPower
VehPower

B12
●
● ●
B13

● ● ●
VehAge

●
B14
B2
●
● ● VehAge

●
B3
B4
●
● ●
DrivAge
B5
B6
● ● ●
VehGas
●
● ● ●
DrivAge
BonusMalus

●
Density
R11
●
● ●
R21
R22
● ● ● ●ClaimNb BonusMalus
ClaimNb ClaimNb
R23
●
● ● ● VehBrEmb
R24
R25
● ● VehBrEmb
R26
●
● ● ●
R31

● ●
R41
R42
● ● ●
VehGas
R43
●
● ● ●
VehGas

●
R52
R53
R54
●
● ● Density

R72
R73
●
● ● ● ●
Density

R74
R82
●
● ● ● RegionEmb RegionEmb
R83
●
● ● ●
R91
R93
● ● ●
R94
●

(left) one-hot enc. (middle) b = 1-dim. emb’s (right) b = 2-dim. emb’s

• Embedding weights are learned during network training (gradient descent).

47
Deep FNN using Embedding Layers (2/2)

1 Design <- layer_input ( shape = c (7) , dtype = ’ float32 ’ , name = ’ Design ’)

2 VehBrand <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ VehBrand ’)
3 Region <- layer_input ( shape = c (1) , dtype = ’ int32 ’ , name = ’ Region ’)
4 LogVol <- layer_input ( shape = c (1) , dtype = ’ float32 ’ , name = ’ LogVol ’)
5
6 BrEmb = VehBrand % >%
7 layer_embedding ( input_dim =11 , output_dim =2 , input_length =1 , name = ’ BrEmb ’) % >%
8 layer_flatten ( name = ’ Br_flat ’)
9 ReEmb = Region % >%
10 layer_embedding ( input_dim =22 , output_dim =2 , input_length =1 , name = ’ ReEmb ’) % >%
11 layer_flatten ( name = ’ Re_flat ’)
12
13 Network = list ( Design , BrEmb , ReEmb ) % >% layer_concatenate ( name = ’ concate ’) % >%
14 layer_dense ( units =20 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
15 layer_dense ( units =15 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
16 layer_dense ( units =10 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
17 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Network ’ ,
18 weights = list ( array (0 , dim = c (10 ,1)) , array ( log ( lambda . hom ) , dim = c (1))))
19
20 Response = list ( Network , LogVol ) % >% layer_add ( name = ’ Add ’) % >%
21 layer_dense ( units =1 , activation = ’ exponential ’ , name = ’ Response ’ , trainable = FALSE ,
22 weights = list ( array (1 , dim = c (1 ,1)) , array (0 , dim = c (1))))
23
24 model <- keras_model ( inputs = c ( Design , VehBrand , Region , LogVol ) , outputs = c ( Response ))

48
Results of Deep FNN Model with Embeddings

epochs run # in-sample out-of-sample average

time param. loss 10−2 loss 10−2 frequency
homogeneous model – 1 32.935 33.861 10.02%
Model GLM1 20s 49 31.268 32.171 10.02%
Deep FNN One-Hot 250 152s 1’306 30.268 31.673 10.19%
Deep FNN Emb(b = 1) 700 419s 719 30.245 31.506 9.90%
Deep FNN Emb(b = 2) 600 365s 792 30.165 31.453 9.70%

• Network fitting needs quite some run time.

• We perform early stopping using a callback.

• We see a substantial improvement in out-of-sample loss on test data T .

• Balance property fails to hold.

• Remark: AIC is not a sensible model selection criterion for FNNs (early stopping).
49
Learned Two-Dimensional Embeddings

2−dimensional embedding of VehBrand 2−dimensional embedding of Region

1.0
B12
● R43
●
R21
●
0.5

R73
●

R23
●

0.5
B14
● R41
●

B13
●
dimension 2

dimension 2
B5
● R91
●
0.0

B3
●

B6
B4
● ●

R42● R52
●

B1
●
B2
● ● R93
●
R22
●
R72

0.0
R54
●

R11
R26
●● R31
● R74
●

R83
● R82
●
−0.5

R53
●

R24
●

−0.5
B11
●
R25
●

B10
● R94
●

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 −0.5 0.0 0.5 1.0

dimension 1 dimension 1

Two-dimensional embeddings can be nicely plotted and interpreted.

50
Special Purpose Layers and Other Features

• Drop-out layers. A method to prevent from over-training individual neurons to

a certain task is to introduce so-called drop-out layers. A drop-out layer, say,
after ’hidden2’ of the above listing would remove during a gradient descent step
at random any of the 15 neurons in that layer with a given drop-out probability
p ∈ (0, 1), and independently from the other neurons. This random removal will
imply that the composite of the remaining neurons needs to sufficiently well cover
the dropped-out neurons. Therefore, a single neuron cannot be over-trained to
a certain task because it may need to play several different roles at the same
time. Drop-out can be interpreted in terms of ridge regression, see Section 18.6
in Efron–Hastie (2016).

• Normalization layers. Feature activations z (m:1)(x) are scaled back to be

centered and have unit standard variance (similar to MinMaxScaler).

• Skip connections. Certain layers are skipped in the network architecture, this is
going to be used in the LocalGLMnet chapter.

51
References
• Breiman (2001). Statistical modeling: the two cultures. Statistical Science 16/3, 199-215.
• Cybenko (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and
Systems 2, 303-314.
• Efron (2020). Prediction, estimation, and attribution. Journal American Statistical Association 115/539 , 636-655.
• Efron, Hastie (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge UP.
• Ferrario, Noll, Wüthrich (2018). Insights from inside neural networks. SSRN 3226852.
• Goodfellow, Bengio, Courville (2016). Deep Learning. MIT Press.
• Grohs, Perekrestenko, Elbrächter, Bölcskei (2019). Deep neural network approximation theory. IEEE Transactions on
Information Theory.
• Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning. Springer.
• Hornik, Stinchcombe, White (1989). Multilayer feedforward networks are universal approximators. Neural Networks
2, 359-366.
• Kingma, Ba (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
• Leshno, Lin, Pinkus, Schocken (1993). Multilayer feedforward networks with a nonpolynomial activation function can
approximate any function. Neural Networks 6/6, 861-867.
• Nesterov (2007). Gradient methods for minimizing composite objective function. Technical Report 76, Center for
Operations Research and Econometrics (CORE), Catholic University of Louvain.
• Noll, Salzmann, Wüthrich (2018). Case study: French motor third-party liability claims. SSRN 3164764.
• Richman (2020a/b). AI in actuarial science – a review of recent advances – part 1/2. Annals of Actuarial Science.
• Rumelhart, Hinton, Williams (1986). Learning representations by back-propagating errors. Nature 323/6088, 533-536.
• Schelldorfer, Wüthrich (2019). Nesting classical actuarial models into neural networks. SSRN 3320525.
• Shmueli (2010). To explain or to predict? Statistical Science 25/3, 289-310.
• Wüthrich, Buser (2016). Data Analytics for Non-Life Insurance Pricing. SSRN 2870308, Version September 10, 2020.
• Wüthrich, Merz (2021). Statistical Foundations of Actuarial Learning and its Applications. SSRN 3822407.
52
Discrimination-Free Insurance Pricing

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Contents: Discrimination-Free Insurance Pricing

• Direct discrimination

• Indirect discrimination

• Unawareness price

• Discrimination-free price

2
• Direct Discrimination

3
Best-Estimate Pricing

• Basic pricing setup is given by

? Y denotes the claim costs;
? X denotes non-discriminatory covariates;
? D denotes discriminatory covariates.

• Develop regression model for Y using covariates X and D as explanatory variables.

• This motivates best-estimate price for Y

µ(X, D) = E [Y | X, D] .

• The best-estimate price

? uses maximal available information X and D;
? minimizes prediction uncertainty (in an L2-sense);
? is discriminatory w.r.t. D.
4
Best-Estimate Price: Example

120
●
●
●
●
110

●
●
●
●
price

●
100

●
●
●
●
90

●
●
●
● best−estimate female ●
● best−estimate male ●
80

20 40 60 80 100
age X

• Best-estimate prices µ(X, D) using all available information

? with non-discriminatory age information X;
? with discriminatory gender information D.

5
Best-Estimate Price: Direct Discrimination

120 ●
●
●
●
110

●
●
●
●
price

●
100

●
●
●
●
90

●
●
●
● best−estimate female ●
● best−estimate male ●
80

20 40 60 80 100
age X

• Article 2(a):1 “direct discrimination: where one person is treated less favourably,
on grounds of sex...”

• Intuitive guess for discrimination-free price?

1
COUNCIL DIRECTIVE 2004/113/EC of 13 December 2004, Official Journal of the European Union L 373/37
6
Best-Estimate Price: Direct Discrimination

120
●
●
●
●
110

●
●
●
●
price

●
100

●
●
●
●
90

●
●
● best−estimate female ●
● best−estimate male ●
●
intuitive discrimination−free
80

20 40 60 80 100
age X

• Article 2(a): “direct discrimination: where one person is treated less favourably,
on grounds of sex...”

• Intuitive guess for discrimination-free price.

7
• Unawareness Price and Indirect Discrimination

8
Unawareness Price

• Direct discrimination can be avoided by dropping discriminatory information D.

• This provides unawareness price for Y

µ(X) = E [Y | X] .

• The unawareness price

? uses maximal available non-discriminatory information X;
? minimizes prediction uncertainty (in an L2-sense w.r.t. X);
? is the best approximation to the best-estimate price µ(X, D);
? avoids direct discrimination.

9
Unawareness Price: Example

120
●
●
●
●
110

●
●
● ●
● ●
●
price

●
●
100

●
●
●
● ●
90

● ●
● best−estimate female ● ●
●
● best−estimate male ● ●
intuitive discrimination−free ●
●
unawareness price
80

20 40 60 80 100
age X

• What goes “wrong” here?

10
Unawareness Price: Example

120
●
●
●
110 ●
●
●
● ●
● ●
●
price

●
●
100

●
●
●
● ●
90

● ●
● best−estimate female ● ●
●
● best−estimate male ● ●
intuitive discrimination−free ●
●
unawareness price
80

20 40 60 80 100
age X

• What goes “wrong” here?

1.0
0.2 0.4 0.6 0.8
population distribution

P[D=female|X=age]
0.0

population average

20 40 60 80 100
age X

11
What Goes “Wrong” with the Unawareness Price?

• The unawareness prices can be expressed as (tower property)

µ(X) = E [µ(X, D)| X]

Z
= µ(X, D = d) dP (D = d|X).
d

• This shows that we infer D from X in the unawareness price.

• Article 2(b):2 “indirect discrimination: where an apparently neutral provision...

would put persons of one sex at a particular disadvantage compared with persons
of the other sex, unless that provision... is objectively justified...”

2
COUNCIL DIRECTIVE 2004/113/EC of 13 December 2004, Official Journal of the European Union L 373/37
12
• Discrimination-Free Price

13
Discrimination-Free Pricing

• The unawareness prices can be expressed as (tower property)

µ(X) = E [µ(X, D)| X]

Z
= µ(X, D = d) dP (D = d|X).
d

• We need to “break the structure” that allows to infer D from X.

• This motivates discrimination-free price

Z
µ∗(X) = µ(X, D = d) dP∗(D = d),
d

for some choice P∗ (there are infinitely many).

14
Discrimination-Free Price: Example

120
●
●
●
110 ●
●
●
● ●
● ●
●
price

●
●
100

● ● ● ● ● ● ● ● ●
●
●
●
● ●
90

● ●
● best−estimate female ● ●
●
● best−estimate male ● ●
● intuitive discrimination−free ●
●
unawareness price
80

20 40 60 80 100
age X

• For population distribution

1.0

P∗(D) = P(D).
0.2 0.4 0.6 0.8
population distribution

P[D=female|X=age]
0.0

population average

20 40 60 80 100
age X

15
Concluding Remarks

• We need to collect discriminatory information D, otherwise we cannot calculate

discrimination-free prices, i.e. just knowledge of X is not good enough.

• Lindholm et al. (2020) give a mathematical definition of (in-)direct discrimination.

• For any given problem there are infinitely many choices P∗, and henceforth there
are infinitely many discrimination-free prices.

• Discrimination-free prices need to be made unbiased.

• Discrimination-free prices sacrifice predictive power relative to unawareness prices.

• Discrimination-free prices can be motivated by “do-operators” in causal statistics

(confounders), see Pearl et al. (2016).

• Discrimination-free prices have same structure as partial dependence plots (PDPs),

see Zhao–Hastie (2019) and Lorentzen–Mayer (2020).
16
• Definition of discrimination-free prices is independent of any model.

• Discrimination-free prices may induce unwanted economic side effects like adverse
selection.

• Indirect discrimination can be explained by the fact that non-discriminatory

covariates are used to predict discriminatory ones. The better information we
have, the more accurately this can be done.

• We did not discuss fairness nor which variables are discriminatory (ethnicity, etc.).

17
References
• Chen, Guillén, Vigna (2018). Solvency requirement in a unisex mortality model. ASTIN Bulletin 48/3, 1219-1243.
• Chen, Vigna (2017). A unisex stochastic mortality model to comply with EU Gender Directive. Insurance: Mathematics
and Economics 73, 124-136.
• Frees, Huang (2020). The discriminating (pricing) actuary. SSRN 3592475.
• Lindholm, Richman, Tsanakas, Wüthrich (2020). Discrimination-free insurance pricing. SSRN 3520676. To appear in
ASTIN Bulletin 2022.
• Lorentzen, Mayer (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN 3595944.
• Pearl, Glymour, Jewell (2016). Causal Inference in Statistics: A Primer. Wiley.
• Zhao, Hastie (2019). Causal interpretations of black-box models. Journal of Business & Economic Statistics.

18
LocalGLMnet and more

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Contents: LocalGLMnet and more

• Balance property for neural networks

• Multiplicity of equally good FNN models

• The nagging predictor

• LocalGLMnet: interpretable deep learning

2
• Balance Property for Neural Networks

3
Balance Property for FNN Models

epochs run # in-sample out-of-sample average

• Balance property fails to hold for FNN regression models.

• The reason is early stopping which prevents from being in a critical point of the
deviance loss D∗(Y , ·) (under canonical link choice).

4
Critical Points of Deviance Loss Function

loss function (view 2)

Under the canonical link choice, the balance property is fulfilled in the critical points.

5
Failure of Balance Property of FNNs

balance property over 30 SGD calibrations

0.102
0.100
0.098
0.096

average frequencies

The failure of the balance property is significant.

6
Seeds and Randomness Involved in SGD

1 layer_dense ( units = q1 , activation = " tanh " ,

2 kernel_initializer = i n i t i a l i z e r _ g l o r o t _ u n i f o r m () ,
3 bias_initializer = ’ zeros ’) % >%
4 layer_dropout ( rate =0.05)
5
6 model % >% fit (X , Y , validation_split = 0.2 , batch_size = 10000 , epochs = 500)

loss function (view 2)

age

claims

7
Representation Learning: Additional GLM Step

age

claims

• Network mapping with link g

D E
(d:1)
xi 7→ g(b
µi) = g(E[Y
b i]) = β,
b zb (xi) ,

with GDM fitted network parameter ϑb = (w, b ∈ Rr .

b β)

• Mapping xi 7→ z b(d:1)(xi) should be understood as learned representation.

bi = z

• Idea (with canonical link choice g = h): consider a GLM with new covariates z
bi.

• If design matrix Zb ∈ Rn×(qd+1) has full rank qd + 1 ≤ n, we have a unique MLE

MLE
β
b , and the balance property will be fulfilled under canonical link choice.
8
Rectifying the Balance Property

age

claims

• Network mapping with link g

D E
(d:1)
xi 7→ g(b
µi) = g(E[Y
b i]) = β,
b zb (xi) ,

with GDM fitted network parameter ϑb = (w, b ∈ Rr .

b β)

MLE
• Choose β
b b ∈ Rn×(qd+1)
for design matrix Z

MLE
D E
xi 7→ h(b
µi) = h(E[Y
b i]) = β
b b(d:1)(xi) ,
,z

for canonical link h.

9
R Code for Implementing the Balance Property

1 z . layer <- keras_model ( inputs = model$input ,

2 outputs = get_layer ( model , ’ hidden3 ’) $output )
3
4 learn [ , c (" z1 " ," z2 " ," z3 " ," z4 " ," z5 " ," z6 " ," z7 " ," z8 " ," z9 " ," z10 ")] <-
5 data . frame ( z . layer % >% predict ( list ( Design , VehBrand , Region , LogVol )))
6
7 glm ( ClaimNb ~ z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z9 + z10 ,
8 data = learn , offset = log ( Exposure ) , family = poisson ())

• The R code considers the Poisson model with canonical link g = h = log.

• If we do not have canonical link we still need to adjust the intercept βb0MLE.

• The additional GLM step may lead to over-fitting, if the size qd of the last hidden
FNN layer is not too large, there won’t be over-fitting.

10
Balance Property for FNN Models

epochs run # in-sample out-of-sample average

time param. loss in 10−2 loss in 10−2 frequency
homogeneous model – 1 32.935 33.861 10.02%
Model GLM1 20s 49 31.267 32.171 10.02%
Deep FNN One-Hot 250 152s 1’306 30.268 31.673 10.19%
Deep FNN Emb(b = 1) 700 419s 719 30.245 31.506 9.90%
Deep FNN Emb(b = 2) 600 365s 792 30.165 31.453 9.70%
Deep FNN Emb(b = 2) seed 1 365s 792 30.411 31.503 9.90%
Deep FNN Emb(b = 2) seed 2 365s 792 30.352 31.418 10.23%
Deep FNN Emb(b = 2) seed 3 365s 792 30.315 31.500 9.61%
Reg. FNN Emb(b = 2) seed 1 +7s 792 30.408 31.488 10.02%
Reg. FNN Emb(b = 2) seed 2 +7s 792 30.346 31.418 10.02%
Reg. FNN Emb(b = 2) seed 3 +7s 792 30.303 31.462 10.02%

11
Balance Property for FNN Models

in−sample losses over 30 SGD calibrations out−of−sample losses over 30 SGD calibrations

31.65
●
30.45

31.55
30.40
30.35

31.45
30.30
30.25

31.35
SGD balance property SGD balance property

12
• The “Best” FNN Regression Model

13
Multiplicity of Equally Good FNN Models
loss function (view 2)

• Many network parameters ϑ produce the same loss figure for a given objective
function ϑ 7→ D∗(Y , ϑ), i.e. they are “equally good” (on portfolio level).

• The chosen network solution will depend in the initial seed of the algorithm.

• This is very troublesome for insurance pricing!

14
Scatter Plot: In-Sample vs. Out-of-Sample Losses

scatter plot of in−sample and out−of−sample losses

31.7
●

●
● ●
●
● ●
● ● ● ●

31.6
out−of−sample losses ● ● ●
●
● ● ●
●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●● ● ●
● ●● ●●
● ● ● ● ● ● ●
● ●
● ● ● ●●● ● ●● ● ● ●
● ● ● ●●● ●
● ● ●
●●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ●● ● ●●
● ● ● ●● ●
● ● ● ●
31.5

● ●● ● ● ● ● ●
● ● ●● ● ●
●●
● ● ● ●● ●
●● ●●
●
●● ●
● ● ● ●
● ●● ●
●
● ●
● 400 calibrations
● cubic spline

30.1 30.2 30.3 30.4 30.5

in−sample losses

This example is taken from Richman–Wüthrich (2020) and the in-sample losses are
smaller than in the table above because we used different data cleaning.
15
• The Nagging Predictor

16
The Nagging Predictor

• Breiman (1996) uses bootstrap aggregating = bagging to reduce noise in predictors.

• Aggregate over different network predictors of different calibrations j ≥ 1

M M
(M ) 1 X (j) 1 X
µ̄i = µ = µ b(j) (xi).
M j=1 i M j=1 ϑ
b

• Theorem (generalization loss).

Under suitable assumptions we have for deviance loss D∗(Yi, ·)
h i h i
∗ (M ) ∗ (M +1)
E D Yi, µ̄i ≥ E D Yi, µ̄i ≥ E [D∗(Yi, µi)],

and there is also a corresponding asymptotic normality result (CLT).

17
Nagging Predictor: Car Insurance Example
nagging predictors for M>=1

31.55
nagging predictor
1 standard deviation

31.50
out−of−sample losses
31.45
31.40
31.35
31.30

0 20 40 60 80 100

index M

• After M = 20 iterations: the out-of-sample loss on portfolio has converged.

• After M = 40 iterations: confidence bounds are narrow.

18
Car Insurance Frequency Poisson Example

epochs run # in-sample out-of-sample average

The nagging predictor leads to a clear model improvement.

19
Stability on Individual Policy Level

v ,
u M 2
σ
bi u 1 X (j) (M ) (M )
CoV
di =
(M )
= t bi − µ̄i
µ µ̄i .
µ̄i M − 1 j=1

• Individual policy
√ level: average over 400 networks to get coefficient of variation
(CoV) of 1/ 400 = 5%.
20
Fit Meta Model to Nagging Predictor

21
• LocalGLMnet: interpretable deep learning

22
Explainability of Deep FNN Predictors

• Network predictors are criticized for not being explainable (black box).

• There are many tools to make network predictors explainable a posteriori:

? partial dependence plots (PDPs)
? accumulated local effects (ALEs)
? locally interpretable model-agnostic explanation (LIME)
? SHapley Additive exPlanation (SHAP)
? marginal attribution by conditioning on quantiles (MACQ)

• Network predictors do not allow for variable selection.

• LocalGLMnet is an architecture that is explainable and allows for variable selection.

23
LocalGLMnet Architecture
• A GLM has the following regression structure for parameter β ∈ Rq
q
X
g(µ(x)) = β0 + hβ, xi = β0 + βj x j .
j=1

• Idea: Estimate regression parameter β = β(x) with a FNN.

• Define regression attentions through a deep FNN

β : Rq → Rq

(d:1) (d) (1)
x 7→ β(x) = z (x) = z ◦ ··· ◦ z (x).

• The LocalGLMnet is defined by the additive decomposition

q
X
g(µ(x)) = β0 + hβ(x), xi = β0 + βj (x)xj .
j=1

24
Interpretation of LocalGLMnet Architecture

• The LocalGLMnet is defined by the additive decomposition

q
X
g(µ(x)) = β0 + hβ(x), xi = β0 + βj (x)xj .
j=1

• We consider the different cases of regression attentions βj (x):

? If βj (x) ≡ βj : GLM term βj xj .
? If βj (x) ≡ 0: drop term xj .
? If βj (x) = βj (xj ): no interactions with covariates xj 0 for j 0 6= j.
? Test for interactions: Calculate and analyze gradients
>
∂ ∂
∇βj (x) = βj (x), . . . , βj (x) ∈ Rq .
∂x1 ∂xq

? Careful, there is no identifiability: βj (x)xj = xj 0 .

25
Implementation of LocalGLMnet

1 Design = layer_input ( shape = c ( q0 ) , dtype = ’ float32 ’ , name = ’ Design ’)

2 #
3 Attention = Design % >%
4 layer_dense ( units = q1 , activation = ’ tanh ’ , name = ’ hidden1 ’) % >%
5 layer_dense ( units = q2 , activation = ’ tanh ’ , name = ’ hidden2 ’) % >%
6 layer_dense ( units = q3 , activation = ’ tanh ’ , name = ’ hidden3 ’) % >%
7 layer_dense ( units = q0 , activation = ’ linear ’ , name = ’ Attention ’)
8
9 Response = list ( Design , Attention ) % >% layer_dot ( axes =1) % >%
10 layer_dense ( units =1 , activation = ’ linear ’ , name = ’ Response ’)
11 #
12 model <- keras_model ( inputs = c ( Design ) , outputs = c ( Response ))

Line 10 is needed to bring in the intercept β0; but there are other ways to do so,
see also next slide for explanation.

26
Remarks and Preparation of Example

• There are other ways in implementing the intercept. Current solution:

q
X
g(µ(x)) = α0 + α1 βj (x)xj .
j=1

• Categorical variables should either use:

? one-hot encoding
? dummy coding with normalization to centering and unit variance
These codings will allow for interaction of all levels.

• We normalize all covariates to centering and unit variance: comparability!

• Fitting is done completely analogously to a FNN with SGD. Resulting networks

are competitive with classical FNNs.

27
Example: Regression Attentions LocalGLMnet
●●●●●
●
● ●●
●
● ●● ● ●
●
● ●●●
●
● ●● ●
●
●
● ●
●●● ●●
● ● ●
●
● ●●●● ●●
●●
●
●●
● ● ●●
●
● ● ●
●
●
● ● ● ● ●
●

regression attention: Area Code regression attention: Bonus−Malus Level regression attention: Density
● ● ● ●
●
● ● ●●
●●
●
●
● ● ●● ●● ●
●
● ●
● ●●● ●
●
● ●●●
● ●
● ● ● ● ●
●
●
● ●
●
● ● ● ● ●
●
●
●
● ●●●●● ●
● ●
●●● ●● ●
●
● ●
●●
● ●● ●
●
● ● ● ● ●
●● ●
● ●●●
●● ●● ●
●
●
●
●
● ● ●● ●
●
●●
●
● ●● ●
● ● ●
● ●
●●
● ●● ● ● ●
●
●
● ●
● ●● ●
● ●
●
●
●
● ●● ● ● ● ●
● ●
●
●
● ●
● ●
●
● ● ● ● ●
●
● ● ● ●
●
● ●
● ● ●
● ●● ●
●●
●
●
● ● ●● ● ● ●
●●
● ● ●● ● ●●● ● ●
●●● ●
● ● ● ●
●
● ● ●
● ● ●
●
● ●● ●
●●
●
● ●
● ●
●●●●●● ● ● ●●
● ●
●
● ●● ●●
●●
● ● ●● ● ●●●
●● ●● ● ● ● ●
●
● ● ● ●
●
●●● ●●●
● ● ●
●
● ●
●● ● ●
●●
● ●
●● ●● ● ● ●
●
● ● ●●●
● ●●
0.5

0.5

regression attention
● ●

regression attention
●●●
●
● ● ●
●● ● ● ●●
●
● ●
● ● ●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ●● ●● ● ● ●
● ●
● ●
●● ●
●● ●● ●
● ● ●●● ●● ●
● ●
● ● ● ●
● ● ●● ●
●● ● ●●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●●●
● ● ●● ● ●● ●● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●●● ● ●
● ● ● ● ● ●● ● ●● ●
● ●
● ● ●
● ●
● ● ● ●●●● ● ●
●● ●● ●●●● ● ● ● ● ● ● ●● ●●●
● ● ● ● ●
●●●●
●
●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ●● ●●● ● ●● ●
● ●●
●
● ● ● ● ●●
●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●●● ●●●● ● ● ● ●● ●●● ● ● ● ●● ●● ●
● ● ● ● ●● ● ●● ●●● ● ●● ● ●
● ● ●●● ● ● ● ● ●
● ● ●
● ● ● ● ●●●
●● ●●● ●●●● ●●
●
● ●● ●●●●●● ●● ●●● ●●●●
●
●● ●● ● ●● ●●●●
● ● ●●●●● ● ● ●
●●
● ●
● ●● ● ●● ● ●●●● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●● ● ●● ●● ●●● ●●
● ●●●● ● ●● ●
●● ● ● ●● ● ●● ●●● ● ●● ●
● ●
●● ●
● ●● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●
●
● ● ●
● ●
● ● ● ●● ● ●●●
●● ●
●●● ● ● ●●
●● ●●● ●
●●
● ●●
● ●●● ●
●
●●●●●
●
●● ●● ● ●●
● ●●● ●● ●● ●●● ●● ●● ●●●● ●●
● ●● ●●● ●● ●● ●●
● ●
●
● ●●● ● ● ●●●
●
●
●●●●●●●
●
● ●● ● ● ●● ● ● ●
● ● ●
● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●
●●● ● ●●● ●● ●●●●● ●
●●● ●●● ●● ●●
● ●●●● ●
●●● ●
● ●●
●● ●●●●●●● ●
● ●● ●●●●●
●●●
● ●●● ●● ●● ●● ● ●● ● ●●
●
●● ●
● ●● ●● ● ●
● ● ●
● ●
● ●●●● ● ●
● ● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●●●● ●●●●● ● ●
●●●●● ● ●●●●●●●●●
●●●● ●● ●● ●●●● ●● ● ● ●●●● ●● ●
●●● ● ● ●● ● ● ●●
●●●
● ●
●● ● ●●●● ● ● ● ●●● ●
●
●
●
●
● ● ● ● ●
●
●●● ● ●
● ●●
●
● ●
● ●
● ●
● ● ●
● ●● ●●● ●●
●● ●● ●●
●●●
●● ●
● ●
●● ●●●●
● ●●
●●●● ●●
●● ● ●● ●● ●●●
●
●●●●● ● ●●
●
●● ●● ● ●
●●
●●●●
● ●● ● ●● ●
●●●● ● ●
●
●●●●
● ●●
●
● ●
● ●● ●●●● ● ● ●
●
● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●
● ●●● ● ●● ● ● ●● ●
●● ● ●● ●●●● ● ●● ●●●●
●
●●●
●●●
●●● ● ●●
●●
● ●●● ●●
●● ●●●● ● ● ●●
●●●●
●
●●●● ●●
●
●
●●●●●
●●●● ●●
● ● ●● ●
●●●
●
● ●● ●●● ● ● ●
●●● ● ●●
●●● ● ●● ● ●●● ●
●●
●●
●
●●● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
●
● ● ● ●●● ● ● ●●●●● ●● ● ●● ●●
●●● ●●●●● ●●
● ●●●
●●●
● ● ●
● ●
●●● ● ● ●●
●●●●●
●●
●
●●
●●●
●
●●
●
●●●● ●●
●●●●
●●
●●● ● ● ● ● ● ● ● ● ●● ●●●●
● ●●●
● ●●●●
●
●
●● ●
●●
●
●
● ● ●
●
● ●
● ●
●
● ● ●● ● ●
● ●
● ●●
● ●● ●
●
● ●
●
● ● ●
● ● ●●● ●● ●●●● ●
●●●●
●
●
●●●●●● ● ● ●●●
●●●
●●●●
●●
●
●
●●●●●●●
● ●●●
●●● ●●●●
● ●
●●●
● ●● ●●
●●●●●●
●●●●●●
● ●● ●●
● ●●●●●
●●
●
●●
● ●●
●● ●● ●
●
● ● ●●●●●●
● ● ● ● ● ● ● ●●
●
●●●●● ●●
● ●● ●● ● ● ● ● ●
●
● ●
● ●
● ●
● ● ● ●
● ●●● ● ● ● ●
●● ●●
● ●● ● ●
● ● ●
● ●
● ● ● ● ●● ●● ●● ● ●●●● ● ●●● ●●● ●●
● ●
●●●●
● ●●●●
●●●● ●● ● ●●●
●
●
●
●●●●●●●●●●●
● ●● ●
●
●
●●
●
● ●●
●● ●● ●
●●● ●
●●
●● ●●●●●●● ● ●●
●●● ● ●● ●
●●●●
●●
●●●● ●
●● ●● ● ●● ●●●
●
● ●●●●● ●●
●● ●●● ●●●●
● ● ●
●
●
●
●
●
● ●
● ●
● ● ●
● ● ●
● ● ● ● ●● ● ●
● ● ● ● ●
● ● ●●● ● ● ●●●●●● ●● ●●●
● ● ●●● ● ● ●● ●
● ● ● ●●● ●●● ●●● ●●
● ●
●
●
●●●
●
●●●
●●● ●●
● ●●● ●●●●●
● ● ●●●●●●● ●●● ●●●
●●
●● ●● ●
●●●● ● ●
● ●●● ●
●
●
●
● ● ● ● ●
●
●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ● ●● ●●●●●●●
● ● ● ●
●● ●
●● ● ● ●
●
●
●●● ●● ● ●● ● ●
●
● ●● ●●●●
●●● ●● ●●●● ● ●
● ●●
●●●●
●●
● ● ●
● ●● ● ●
●●● ● ● ● ● ●● ●
●● ●
● ●● ●
● ● ● ● ● ●●
●
● ●●
●
●●●●●
●●●●
●●● ● ●● ●●● ●●●
●●
● ●
●
● ●
●
●
● ● ●
● ● ●● ● ●
● ●● ● ●
● ●
● ●●● ● ●● ● ●
● ● ●● ● ● ● ●
●●●●●●
●● ●●●
● ●
●●●●●
●●●
●● ●
●●●●●●● ●●
● ●●●●●●
●●
●●●●
●
●●
●●●● ●●
● ●● ●
●●●● ●●
● ●
●●
●
●●
●
●
●●
● ●●●●●
● ● ●● ●●●●
●
●●●
● ●●
● ●●●● ● ●●●● ●
●
●●●● ●●●
● ●●
●● ●●●●● ●●●
●●
●●●●●●● ●
●●●●
●●
●●●
●
●●
●
●●●
●●
●●●
● ●●● ● ●●
●
●●●●●●
● ● ● ●
●
● ●
● ●
●
●
●
● ●
● ●
● ● ●
● ●
●
●●
●
● ●
●● ● ●
●
● ● ●
● ● ●
● ● ●●● ● ●● ●● ●
● ●● ● ●● ●
●● ●●● ●●●●●●
●●●● ● ●●
●● ●●●
● ●
●●
●
●●●●● ●
●●●● ●●●●●
● ●●●●●
● ●
●●●
●● ●●●
●●
●● ●● ●● ●●●
●
●●●
●
●
●● ●●●
●●●
●●
●●
●●●
●
●
●
●●●● ●●●●● ●
●●
●
●● ●
●
● ●●●● ●● ●●
● ●●●
● ●● ● ●●● ●●
●
● ● ●
●●
●
●
●
●●●
●●
●
●
●●●
●
●●●●
●● ● ● ● ●●
●
●●●
●● ● ●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ●
● ● ● ●
● ●
●
●
●
●
● ● ●
● ●●● ●●●● ●●
● ●●
●●●
●●
●
●●●
●●● ●●●
●●●● ●●● ●
●
●●
●●
●●
●
●
●●●
●
● ●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
● ●●
●●●
●●
●
●●●
●
●
●●
●
●●
●
●●●
●
●●●
●
●●
●●
●
●
●
●●
●●●●
●
●●●
●
●●●●●
●●● ●●●
●●
●
●●●●● ●●●
●● ●●
●
●
● ●●●●
●
●
●●● ●●●
●
● ●●
●●
●
●●
● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●● ●●
● ● ●●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●●
● ●●●
●●
●
●
●
●●●
●
●
●●
●
●
●●
●
●
● ●
● ●●
●●
● ●
●●●●
●
●
●
●●●
●●
●●
●
●
●● ●●
●
●●
● ●
● ●●●●●●
●●●●
●●
●
●● ●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●●
●●
●●
●
●●●
●
●
●
●
●
● ●●●●
●
●
● ●●
● ●
●
●
● ●●●● ●● ●●● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●● ●●
● ●● ●
●●
● ● ● ●
●●● ●●●
● ●●
● ●
● ●
● ● ● ● ●● ● ●
●● ● ●●
● ●●
● ●● ●●
● ● ● ●●●●● ●●●● ● ● ● ●
● ● ● ●● ●● ● ●
● ● ●
●
● ● ●
● ●
● ● ● ● ●
● ●●● ● ●
● ●●● ● ● ●● ●
● ● ● ● ●
● ●
●●● ●●● ● ●●● ●
●●●●
● ●
●●●●● ●●
●●● ●●●●
●●● ●
●●●●●●
●●● ●
●●●
●●
●●●
●●
●● ●●●●
●
● ●
●●
●●●●●●
●● ●●●●●
●●●
● ●● ●●
●
●●
● ●●
●
●●
●●
●●
●●●
●
●●
●●●
● ●●●
●●
●●●●●
●●
●●
●
●●
●
● ● ●
●●● ●●
●●●●
●●
●
●●●
● ●●
● ●
●
●● ●● ● ●
●
●
●●
●●●
●●●●
●
●●●
● ●
●●●●● ●
●●●● ●
●●●
●● ●●
●●●● ●●
● ●●●●●●
●
● ● ● ●● ●● ●●●●●●●●● ●● ●
●
●●●
●●● ●
●
●●●
●●
●● ●●
● ●● ● ●
●●● ●
●● ●●●● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ● ● ● ●●● ●● ● ● ●● ● ● ●● ●●●●●● ●● ●●●
●●●
● ● ●● ●●
● ●
● ●
●●
●● ●
●●●●
●●
●
●●
●●●
● ●●
●●
●●
●
●●
●●●
●
● ●
●●●●● ●●
●●●●
●● ●● ●
● ●
●●●●● ●
● ●● ●
●●●
●
●●
●● ●● ●
●●● ●
●●
●●●● ●
●
● ● ●● ●
●● ●
●●●
● ●
●●
● ●●● ●● ● ● ●●
● ●●
●
●●●●●● ●
●
● ●●●
●
● ●●● ●
● ● ●● ● ● ●
●●● ● ●
●
● ●
● ● ●
● ●
● ● ● ● ● ●● ●● ● ●
● ●
●
●
● ●
● ●
● ● ●● ●●
● ●●●● ●● ●●
●● ●●●● ●● ●●●●●●
●●
●●
●●●
●
●
●
●●●●
●●
●●●●
● ● ●●
●●
●● ●
●
●● ●
●●
● ●
●● ● ●●
●
●●●●●
●●●● ●●
●●●●
● ●● ● ●
●
●●
●●
●●●
● ●●
● ●
●●
● ●●●●●●● ● ●●
●
●●●●●●
●●● ●
●●●
●●●
●●●●●●
●●●●
●●●●●●
●● ●
●●●●● ●●
● ●● ●●
● ●
●●●
●
●●● ●●
●●● ●
●
●●●
●●●● ● ●●●
●● ● ●
●
●●
●●●● ●●●
●
● ●●●
●
●●
● ●
● ● ●● ● ● ●● ●
● ● ● ●
● ●
●
● ●
● ●
●
● ● ●
● ●
● ●
● ●
● ●● ●
● ● ● ● ● ● ● ●●● ● ● ● ●●●●
●●●● ●●
● ●
●●●● ●●●●●
●●● ●● ●● ●●●●● ● ● ●
●
●
●● ● ● ●●●●●
●●●●●●
● ●●●● ●● ● ●
●
●
●
●●
●● ●
●
●●
●
●
●●●
●●
● ●
● ●●
●●
● ●
●●●●●●●
●●
● ●●
●●
●
● ●
●●
● ● ●●●●●●● ● ● ●
●●
● ●● ●
●
● ●●
●
●●
●● ●
●● ●●
● ●● ●●● ● ●
●●● ● ●
●
●
● ●
●
●
●
●
● ●
● ●
●
● ● ● ●
● ● ● ● ●
● ● ●●
● ●
●●
●
●
● ●● ●
● ● ●
● ●●●
● ● ●● ●● ● ● ● ●●●●●●●● ●
●●
● ●● ●●●● ● ●●
●●●●
●●●● ●
●●
●●
●●●●●
●●
●●●● ●● ●●● ●●●
● ●
●●
●
●●
●● ●●●
● ●
●●● ● ●
●●●
●●
●
●●
●
●●
●●● ●
●●● ●●●● ●
●●● ●
●● ●●
●●●●●● ● ● ●●
●
●●
●● ●●●●●
●●
●●
●● ●●●● ●● ●
●●
●●
●
● ●
●
●●
●
●●● ● ●
●
●●
●●●
●
●● ● ●●● ●● ● ●
●
●
●
● ●
● ●
● ●
●
● ●
● ● ● ● ●●● ● ●● ●●● ● ● ●
● ● ● ● ● ●● ● ●●● ●●● ●●●● ●
●●
● ●● ●●
●●●
● ●●●●● ●●●● ●●●● ●●
● ●●●
●
●●●● ●● ●
●●
● ●●●●●●●
● ●
●●
● ●
●
●
●
● ●●
● ●
●●
● ●● ●
●●●●●
●●●● ● ●● ●
●● ●●
● ●● ●●●●●●● ●
● ●●●
●
●● ●● ● ●● ●●●●●●● ●●
● ●●●● ●
●●●
●●
● ● ●●
●● ● ●● ● ●
●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ●●●
● ● ●
●●●●
●●● ● ●● ●● ● ● ●● ●● ● ● ● ● ●
●●●● ● ●● ● ●● ● ●● ●●● ●●●●●● ●● ●●●●●● ●● ● ● ●● ●● ● ●●
●●● ●
●● ●
0.0

0.0

−0.5

−0.5
●
●
●
●
●
● ●
●

● beta(x) ● beta(x) ● beta(x)

zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

1 2 3 4 5 6 60 80 100 120 140 2 4 6 8 10

Area Code Bonus−Malus Level Density

regression attention: Driver's Age regression attention: Vehicle Age regression attention: Vehicle Power

●
● ●
●
● ● ●
●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●
● ●
●
● ● ●
● ● ●
● ●
● ● ●
●
●
● ● ● ●
0.5

0.5

●
regression attention

0.0

−0.5

−0.5
●
●

● beta(x) ● beta(x) ● beta(x)

zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

20 30 40 50 60 70 80 90 0 5 10 15 20 4 6 8 10 12 14
Driver's Age Vehicle Age Vehicle Power

28
Confidence Bounds for Variable Selection

• Add (two) purely random covariates (with different distributions)

h √ √ i
i.i.d. i.i.d.
xi,q+1 ∼ U − 3, 3 and xi,q+2 ∼ N (0, 1).

These covariates are standardized.

• Consider extended regression function for x+ = (x1, . . . , xq , xq+1, xq+2)> ∈ Rq+2

q+2
X
g(µ(x+)) = β0+ + β +(x+), x+ = β0+ + βj+(x+)xj .
j=1

+ +
• Magnitudes of βbq+1 (x+) and βbq+2 (x+) determine size of insignificant components.

29
Confidence Bounds for Variable Selection

+ +
• Magnitudes of βbq+1 (x+) and βbq+2 (x+) determine size of insignificant components.

• Determine empirical mean and standard deviations for j = q + 1, q + 2

v
n u n 2
1 X b+ + u 1 X
b̄j = β (x ) and sbj = t βbj+(x+
i ) − b̄j .
n i=1 j i n − 1 i=1

• Null hypothesis H0 : βj (x) = 0 for component j on significance level α ∈ (0, 1/2)

can be rejected if the coverage ratio of the following interval
h i
Iα = qN (α/2) · sbq+1, qN (1 − α/2) · sbq+1

is substantially smaller than 1 − α, where qN (p) denotes the standard Gaussian

quantile on level p ∈ (0, 1).

30
Variable Selection with Cyan Confidence Bounds
regression attention: Density regression attention: Vehicle Age regression attention: Vehicle Gas
0.5

0.5

0.5
● ●
●
● ● ●
regression attention

regression attention

regression attention
● ●
●
●
● ●
●● ● ● ●
●
● ● ●
● ● ● ● ●
●
● ● ● ● ●
●●● ● ●●● ● ●
●
● ●
● ● ● ● ●
●●●● ● ●
●● ●● ●
●● ● ● ● ● ● ● ●● ●●● ● ● ● ●
● ● ● ●
● ● ●● ● ● ● ●●● ● ●● ●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●●● ● ●●
●
● ●
●
● ● ●
● ● ●
●
● ● ●● ●●● ●● ●● ● ● ● ●●●● ●●●●
● ● ● ●●
●●● ●
● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ●●● ● ● ●●● ●● ● ● ● ● ●
● ●
● ●
● ●
●
● ● ● ●●● ●●
● ●● ●●● ●●●● ●● ● ●
● ●●● ●● ●●● ●●●●●● ●● ●● ● ●●
●● ● ●●
●●● ● ●● ● ●
● ● ●
●● ● ●
● ●
● ● ● ●
● ●● ● ●●●
●● ● ● ● ●● ● ● ●●●● ●●●●●● ●● ●● ●● ●● ● ●● ● ● ●● ● ●● ●● ●●● ● ●●● ●●●●● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
●●●●● ● ●●
●● ●●● ●
●● ●●
● ●●● ● ● ●●●● ●●●●●
●
●● ●
●
●●● ●●●● ● ●●● ● ●●● ●●●●●● ●●
● ●● ● ●● ● ●
●
●
●
●●● ● ●● ●● ●● ●●● ● ● ●● ●●●●●●● ● ●
●● ● ● ●● ● ● ● ●
● ●
●
● ● ● ● ●
●
● ● ● ● ●●● ● ●●●● ●● ● ● ● ●●●
● ●
●●●● ●●●●● ●●●●●● ●
● ●●●● ●●●●● ●●●● ●●● ● ●●●●●●●● ●●●●●●● ●
● ●●● ●●●
●●●●●●●● ●● ●
●
●●●
●●
● ● ● ●● ●
● ●●
●●
● ●●●
●
●
● ●●● ●●●
● ●
● ● ● ●●●
●● ●
● ● ●
● ●
●
● ●
● ● ● ● ●
●
● ●● ●● ● ●●
● ●● ●●● ●●● ●
●● ●●●●
●●●
● ●●●●●
● ● ●● ●●● ●●● ●●●
● ●●●● ●●
● ●●
●● ●● ●● ●
●●
●●
● ●● ●● ●● ●● ● ●● ● ●
● ●● ● ● ●
●●●● ●●
●
●
● ●
● ●
●
● ●
●
● ● ●
● ● ● ● ●
●● ●
●● ● ●
● ● ●● ● ●●●●● ● ●●●
●
●●●
●●● ● ●●●
●●
●
● ●●● ●●●● ●●●●●
● ●●●●●●●● ●●
●
●
●
●●●● ●
●●●●
●
●●
● ● ●● ●●●
● ●●● ●● ●
●●
●● ●●●● ● ●● ● ●●● ●●
●●
● ● ●● ●●●● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ●●● ●● ●●
●●●●● ●
●●● ●●●●● ● ●●● ● ●●● ●●●●●● ● ●●
● ●●●
●●
●
●●●
● ●
●● ●●●●●
●●● ●
●● ●● ● ●
● ●
● ●● ● ● ●● ●●●● ●●● ●●●●
●
●●●●● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ●●● ● ● ●●●●● ● ● ●
●● ●● ●●●● ●
●●●●
●
●
●●●●●
● ●●●
● ● ● ● ●●●
● ●●●
●●●
●
●● ●
●●●●
●
●
●●●●●
●
●
●
● ●●●
●●●● ●●
● ●●
●●●
● ●● ●
●●
●
●●
●●● ●●●
●●
●
●●●●●●
●
●
●●●● ●● ●●
●
● ● ●
●
●●●●
●
●
●●
●
●●
● ●●
●● ●● ●
●● ● ●●●●
●●●
●
● ●●● ● ●
●
● ●
●●
●
●●
●
●●●●
●
●● ●●
● ●
●● ●● ● ● ●●● ● ● ●
● ●
●
●
●
● ●
●
●
●
● ● ●
● ● ●
● ●
● ● ● ●● ●● ●● ●
● ●●●●●● ●●
●●
●●●● ●● ● ●● ●●
●
●●
● ●
●●●●
● ● ●●●●●●
●●● ●●● ●●● ●●
●
● ●●
● ●●●●●●● ●● ●
●
●
●
●
●
●
● ●●
●●● ●● ●●●●●●
●● ●●●●●
●●
●● ●●
●●●●● ● ●● ●
●●●
●●
●●●●
●● ●
●● ●● ●
●●
●
●● ●
●●●
●
●●●●●●●●●
●
● ●●
●● ●●● ●●●●
● ● ●
● ●
● ●
● ●
●
● ●
● ● ●
●
● ● ●
●
● ●
●● ●
● ●●● ●●
● ●
●
●● ● ● ● ● ●
●
●
●
●●● ● ●●
●
●
●● ●
● ●●●●
● ●●● ●●● ● ●
●● ● ●●
●● ● ●
●
●●
●
●
●
●
● ●
●● ●●
●●●●●
● ●●●●● ●●●●●● ●
● ●● ●● ●●● ● ●●
●●●
●● ●●
● ● ● ●
● ●
●
●
●● ●●
●●
●●
●
●●● ● ●●
● ●
●●● ●● ●
●
● ● ●
●
●
● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ● ● ●
●●● ● ●● ● ● ●
● ● ●● ● ●
●●●
● ●●
●● ●●
●● ●●
●●●●●●●●
●
●
●●●
●●
●
●●●●●●●●●●●●
●●● ● ● ● ● ●
●●● ●● ●●●● ●●●●
●●●● ●●
●●●
● ●● ● ●
●●●●
● ●
●●●●●
●
●
● ●
● ●● ●●●●
●●●●
●●● ●● ● ●●
●●● ●
●
● ●●●●
●● ●●●● ●●●●●● ●●
● ●
●●● ●
●● ●
●●
●●●● ● ●●
●
●●●●● ●● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ● ●
●●● ● ● ●● ● ● ●● ●●●●
● ●●●●
● ●●●●●●●●
●● ●●
●● ●●●●
●
● ●● ●●
●
●●
● ●● ●● ● ● ●●●●
● ● ●
●● ●●
● ●●
● ●●●●●●●
● ●
●●●
●
●
●●
●
●●●● ●
●
● ●
●● ●●
●
●●●●● ● ●
●●●●●●●
●
● ●●
● ●●● ●
● ●●
●●● ●●●
●● ●●
●●
● ●● ●
●●
●
●●
●● ●●●
●
●●
●
●
●●
●●
● ●●● ●
●●
●● ●
● ●
●●● ● ●● ●
●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
●
● ●
● ●
●●●● ● ●●● ●●● ●●●● ●●● ●● ●
●● ●●●●● ●
●●
●●
●●
● ●●● ●
●●●●
● ●●●
● ● ●●● ●
●●● ● ●
●
●●●●● ●● ● ●●●
●
● ●●●● ●●
●●●●●●
● ●
● ●●
●● ●●●●
● ●●
●● ●●●●●●
●
●●●
● ●●● ●●●● ●● ●
●●●●
● ●●
●
● ●
●●●●
●
● ●●
●● ●
● ●
●● ●
● ● ● ●● ●●
● ●●● ●
● ●
● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ●
● ●●● ●
●●●
● ●●
●
●●●
●
●●
●● ●●●
●
●●●
●
●●●●
●
●●● ●
●
●● ● ●●
●●
●●●
●
●
● ●●
●
●●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●
●●
●●
● ●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●
● ●
●●●
●●
●
●●●
● ●●●
●
●
●●●
●●●●
●●
●●●
● ●●●
●
● ●●
●●●
● ●●●
●
● ●●
●
●
●
●
●●
● ● ●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●●● ●
●●
●
●●●● ●●
● ●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
● ●
●
●
●●●●●
●
●
●
●
●
●●●●●●
●●
●●
● ●●
●
●●
●●
●●●
●
●●
●
●●
●●
● ●
● ● ● ●
●●●●
●●●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●●
● ●● ●●
●
●
● ●●
● ●
●
●
●
●● ● ● ●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ● ● ●
●
●
●
●
● ● ●
● ●
● ● ● ● ● ● ● ● ●● ● ●
● ●●●● ●●
● ●●
● ●
● ●
● ● ● ● ● ●
●● ●● ● ●● ●●● ●● ●● ●●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●●● ●●●●● ●●● ●
●●●● ●
●●●●
●●
●
●●● ●●
●●● ●●●●
●●● ●
●●●●●●
●●● ●
●●●
●●
●●
●●●
●●●
●●●●
●
● ●
●●
●●
● ●●
●●●
●● ●●●●
●● ●
●●●
● ●●●
●
●
●
●
●●●
●●
●
●●
●●
●
●●
●
●●●
●
●●
●
●●●
●
●●●
●●
●
●
●●
●●●
● ●●
●
●●
●●
●●
●●●●●
● ●●
●●●
●
●
●●●●●
●
● ●●
● ●
●●●
●
●●
●● ●●
● ●● ●
●
●
●●
●
●
●
●●
● ●●
●
●●●
● ●
●●●●● ●
●●●●● ●
●●●
●
●● ●●
●
● ●
●
●● ●●
● ●●
● ●●●●
●
●
●●● ● ●●
● ●●●
●●●●●●●●● ●● ●●●●
●●●
●●
●●●
●
●
●●●
●●
●●
● ●●●●●
●
●●● ● ●● ● ●
●●● ●● ●●●●
● ● ●
● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
● ● ● ● ● ● ●
●
●● ●● ● ● ●●●
●● ●●
●● ●●●● ●●●● ●●●●●
●●
●●●
●
●●
●●●●●●●●
●●●●●
●●●●
●
●
●
●●
●
●●●
●●●●●
●
●●
●
●●●
● ●●
●● ●●● ●
●
● ●
●●●
●● ●●
●
● ●●
●
●●
●
●●
●●●●
●●
●
●
●
●● ●
● ●
●●●● ●
●●●
●
●
●●
●●
●●● ●● ● ●
●●● ●
●●
●
●●●
●●
●●
●
●●● ●
●●●●●
●
●
●●●
●●●● ●●
● ●
●●●
● ●●
●● ●●
●●● ●●●●● ● ●●●● ● ●
●
●
●
●● ●●●●
●
●●
●
●
● ● ● ●● ● ● ●
●●●
● ● ●● ●● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ● ●●●● ●●●●
●●●●
● ●
●●● ● ●
●●●● ●●●
●● ●● ●●
● ●● ●●
●●
●●●●
●●● ●
●●●●
● ●●
●
●● ●●●●
●●●
● ● ●●
●●
●
●
●● ●
●●
●
●●●●●
●
●●
●●
●●
●●●● ●●
●● ●●
●
●●● ●● ●● ●●●●
●
●●●
●
●●●●
●●●
●
●
●●●●●●● ●
●●
● ●●● ●●
●●●
● ●●●
●
●
●● ●
●●
●● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ● ● ● ● ●
● ●●● ● ● ● ●●●●
● ●●●●
● ●
●●● ●● ●
●●●
●● ●●●●●●
●●
●●
●
●●●●
●●
●● ●● ● ●●● ●●● ●●
● ● ●●● ●● ●
●●●●
●●
●
●●
●●● ●
●●
●●
●●
●● ● ●● ● ●
●
●●
● ●●●●●●●● ● ●● ●●
● ●
●●
●
●●●●
●
●●● ●
● ●●
●
●● ●
●●
●●●
●●
● ●● ●●● ●●●
●
●●● ● ●
● ●
●
● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
●●●
● ● ●● ●● ● ● ● ●●●●●●●● ● ● ●● ● ●● ● ●●●●
●● ● ●●●●●● ●●●● ●
●●
● ●
●● ●●● ●
●●● ● ●● ●●
●
●●● ●● ●●●● ●
●●● ● ●●
● ●
●
●●
●● ●●●
●●
●●
●● ●● ●●
●●
● ● ●
●
●●
●
●●● ● ●
●●
● ●
●● ● ● ●● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
●
● ●
● ● ●
● ● ●● ● ●●● ●●● ●●●● ●
●●● ●● ●●
●●●
● ●●●●● ●●●●●●●● ●●
● ●●●
●
●●● ●●
● ●●●
● ●●●●● ●●
● ●
●●
● ●
●
●
●
● ●●
● ●
●●
● ●
●●●●●
●●
●●●● ● ●● ●
●● ●●
● ●● ●●●●●
●● ●
● ●●●
●
●● ●● ● ●● ●●●●●●● ●●
● ●●●● ●
●●●
●●
● ● ●●
●● ● ●● ● ● ●
● ●
● ●
●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ●
●
● ●
● ●●● ●●●
● ● ●
●●●●
●●● ● ●● ●● ● ● ●● ●● ● ● ● ● ●
●●●● ● ●● ● ●● ● ●● ●●● ●●●●●● ●● ●●●●●● ●● ● ● ●● ●● ● ●●
●●● ●
●● ● ●
● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
●
0.0

0.0
●

−0.5

−0.5
●
●

● beta(x) ● beta(x) ● beta(x)

zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

2 4 6 8 10 0 5 10 15 20 Diesel Regular
Density Vehicle Age Vehicle Gas

regression attention: Vehicle Power regression attention: RandU regression attention: RandN
0.5

0.5

0.5
regression attention

regression attention

0.0

● ●●

0.0
●
● ● ●
● ●
● ●
● ● ● ● ● ● ●●●● ● ●● ● ●●● ●●● ●●● ● ●● ● ●● ●●● ●● ●● ● ● ●●●
●●●● ●●● ●●● ●●● ●●
●● ●●●● ●
●● ●●● ● ●●● ●
●●●●
● ●●●● ●●
● ●● ●● ●
●●● ●
●●●●●● ●●● ●●●●● ●●● ●●● ● ●● ●● ●●●● ● ●●●●● ● ●● ● ●● ● ●● ● ● ● ●
●● ●●
●●● ●
●● ●●●● ●●
●
●●●●●●●●
● ●
●●
●●●●
●
● ●
●
● ●
● ●
● ●
●●
●●
●● ●● ●●
● ● ●●●● ●●
●●
●●●
● ● ●●●
●●● ●●
● ●●● ●●
●● ●●●●● ●●
●●●
● ●
●●
●●●●● ●
●●●
●
● ●
● ●●●
● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ●
●● ●●●
●
● ●
●
● ●
● ●
● ● ● ●
● ● ●
● ● ● ●
●●
●●
● ●●
● ●●● ●●●● ●●
●● ●● ●
●
● ●●●● ●●●
● ●● ●●
●●●● ●
●
●●
● ●●●● ●
● ●●
●
●●●●● ●●●●
● ●●
● ● ●●
● ●
●●●
●●●●● ●● ●●
●●
●
● ●●
●
●●●● ● ●
●●●● ●
●● ● ●●
●●●●● ●●● ●●●● ● ● ●● ●●●●●
● ●● ● ●●
●●
●●
●● ●●●●●
●● ●
●●●
●●●
● ●
●●●●
●●
●●●● ● ●●●●●●● ●●●●
●● ●● ● ●● ●●
●●●●
● ●●●● ●●●● ●●●●● ●
●●
● ●● ●● ●●
●● ●
● ● ●● ● ● ● ● ● ● ● ●●
●●●●
● ●●
●● ●● ●● ●● ●●●●●
●
●
● ●
● ●
●●●
● ●
●●
●●
●
●●
●●●
●
●
●●
●●
●●●
●
● ●●
●●●
●
●●●
●●
●
●●
●●
●●●
●
●●
●
●●
●●
●
●
● ●
●
●●
●
●●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●●●
●●●
●
●●●
●
●●●
●
● ●
●●
●●
● ●●●
●
●●
●
●●●●
●
●●
●
●● ●
●●●●
●●●●●
●●
●
●●●●● ●●
●●
●
● ●
● ● ●
●
●●
●
● ●
●
●●● ●
●●
●● ●●●●●●
●● ●●●
●●●
●● ●●
●●●● ● ●●● ●
●● ●
● ●● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●●
● ●
●●●● ●
●●●●● ● ●● ●●
●● ●●● ● ● ●●
●●●●●
● ●● ●
● ●●●●●●●●● ● ● ●● ●● ● ●●
●●
● ● ●●
● ●●●●●●●● ●●●●●●● ●●●
● ●● ●● ●●●● ●●●● ●●
● ●● ●
●●● ●●●●●● ●●●
●●
●● ●●● ●●● ●●●
●●
● ●●●●●● ●●●●
●
● ●●
●● ●
●●●●●
● ●●
● ●●
●● ● ● ●● ● ●● ●●●● ● ●
●●●●●●●●● ●
● ●●●
● ●●● ●●●
●
●● ●
●●●●●●●●●
● ●
● ●
●
●●● ● ●
●●●●●
● ● ● ● ●●
●●● ●● ● ● ●●● ●●
●●● ●
●●
●●
●●
●●● ●●●
●
●●
●
●●● ●
●●
●
● ●●
●
●●●●●●●●
●●
●●●
●●●●
●●
●●
●
●●●
●
● ●●●
●
●● ●●
●
●
●●●
●●●●●
●
●
●●●
●●
●
●●
●
●●
●●●
●●
●
●●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●●
● ●
●●●
●●●
●
●●●
●
●●●
●●
●
●● ●
●●
●●●
●
●
●
● ●
●●
●
●
●
●●
●●
●
● ●
●●●
●
●
●
●
●
●●●●●
●●
●
●●
●●
●●
●
●
● ●●
●●
●
● ●●
●● ●●
● ●●
●●
●
●
●●
● ●●●
●
●● ●● ●
●
●● ●● ●● ●
●●● ● ●
●●●●
● ●
● ●
● ●
● ●
● ● ● ● ● ● ●● ●
●●
●
● ●● ● ● ● ●● ●● ●●● ● ●●
●●●●●
●● ●
●● ● ●●●●●●●●●
● ●● ●●● ●
● ● ●●
●●●●
●
● ●
●●● ●● ●●●
●●●● ●●● ●●
●●
●●
●●●●●●●●
●● ●●● ●●●● ●●●● ●
●● ● ●●
● ●●●● ●● ●● ● ●●
●● ●●● ● ●●● ●●
●●●●●●
●● ●
● ● ●● ●● ●●
●●
●●● ●●
● ● ●
●●●● ● ●● ●●●● ●●
●● ●● ● ●●
●●●
●
● ●●●●
● ● ●●●
● ● ●
● ● ● ●●●●●●● ●●●● ● ●●●
●●●●●●●
●
●
●●●●
●
● ●
●●
●●●●
● ●●
●●
●●●●
●●
●●
● ●
● ●
● ●
●
●●
●●●●●
●
●●●
●●
●
●●
●●●
●
●
●●
●
●●
●●●
●●●
●●●
●●
●●
●●
●
●
● ●
● ●●
●
●●
●●
●
●●
●●
●●
●
●
●●
●
●
●●
●
●
● ●●
●
●●
●●
●
●●●
●●
●●●●
●
●●
●●
●
●●
●
●●
●●
●
●●●
●
●●
●●
●●
●●●
●●●●
●●●
●●
●
●
●●●
●●
●●●
●
● ●●●
●●
●
●●
●●● ●●
●
●● ●
●●●●●●
●
●●●●
●●●● ●● ●
●●●●●●
● ●
●●●
●
●● ● ●● ●● ●● ●● ● ● ●● ● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ● ● ●●●●
● ●
●
● ●●●●
● ●● ● ●● ● ●● ●● ● ● ●
●●
●● ●●● ●●●
● ●● ●● ●● ● ● ●●
●● ● ● ●●●●●●●
●●● ●
●
● ● ● ●● ● ●●●● ●● ●●● ● ●●● ●
●
●●● ● ●●●
●●●
●●
●
● ●● ●●● ●
●●
●
●●●● ● ● ●● ●●●
● ●●●● ● ● ●●●●
●
●●● ●
●● ●
●●●●●●●● ●●●●●●● ●● ●● ●● ●●●●
●
● ●●●●● ●
●
●●●● ●●●
●●●●●●
● ●●●●
● ●●
●● ●
● ●●●
● ●●
● ●●● ●
●
●●● ● ● ● ● ● ●●● ●● ● ●
● ●● ●●●●●
●●●●●
●● ●●
● ●●●
● ●
●●●
●
● ●●
●●●●
●●●
●●
●●
●●● ●●
●
●●
● ●
●
●●
●
●● ●
●●
●●
●
●●●
●
●
●●
●●●
●
●●
●●
●●
●●
●● ●●
●
●●
●●●●
●●
●
●●
●●●●
●●
●
●
●●
●●
●
●●●●
●
●●
●●
●●
●●●
●●
●●●
●●●
●●
●
●
●●
●●
●●
●●●
●●●
●●
● ●
●●●●
●●●●
●●
●●
●
● ●
●●
●●
● ●
●● ●
●● ●●
●●● ●
●
●●
● ●
●
● ●
●●
● ●
●
●●
●● ●●
●●
●●●
●●● ●
●
●●
●●
●●●●●
●●
●● ●
●● ●● ●
●
●●●●●●●●●●●
●● ●● ●
●
●
● ●
● ●
● ● ●
● ●
● ● ● ● ●
●●
● ● ● ●●●
●●● ●●
●●●●●●●●●●●●
●●●● ● ●● ●● ●●● ● ●●●
●
● ●
● ●● ● ●●● ●
●●●●●● ●● ● ●
●
●●● ●
●●●●●●● ●
●
● ●●● ●● ●
●● ●
●
●●●●
●
● ●●
● ●●●●●
●
●●●●
●
●●
● ●●
●
●●● ●●● ●
●●●
●● ●●●●●●
● ●●●● ●●●●
● ●
●●●
●●● ● ●●● ●
●
●
●●●●
●●
● ●● ●●● ●●●● ●
● ●● ●●● ●●
●●
●●●●
●
● ● ●
● ●●● ●●●
●●
● ●
●●●● ● ●●● ●● ●●●●●● ●●●● ●● ●●
●●●
●●●●●●● ●
●
●● ● ●
● ●●●●●●
●●
●●● ●●●
● ●
● ● ●● ● ● ● ●● ●● ● ●● ● ● ●
●●
●● ●
●●
● ●●●●
●
●●
●
●●●●
●●
●●
●
● ●
●●
●●
●
● ●● ●●
●
●
●
●●●
●●●●
●●
●●
●●● ●
●●●●
● ●
●●
●
●●●●
●
●●
●●
●●●●●
●●●
●
●
●●●
●●
●●●
●
●●●
●●
●● ●●
●
●●●●
●●
●●●
●●●
●
●●
●
● ●
●●
● ●●
●●
●●
●
●●
● ●●●
●●●
● ●●● ●
●
● ●
●●
●
●● ●
●●
●
●
● ●●
●●
●●
● ●●
● ●●
●●
●●●●●
●●
●●●●
●●●
●●●
● ● ●● ●●●
● ●●●● ●
● ●●● ●● ● ● ●
●
● ● ●
● ●
● ● ● ● ● ●
●
●● ● ● ●●●●●
● ●●●●●● ●●●●
● ● ●●
●
●●● ●● ●●●●● ●●●●● ●
● ●●
●●
●●● ● ●
●●●●● ●● ●●
●●
●●● ●● ●●
●●
●●● ●● ●●●
●●●
●●
●●●● ●●●
●
●●●● ●●●● ●● ●
●●●●
● ●●● ● ●●● ● ● ● ●●● ●●
●●● ●●
●
●●●
●
●●●●●●● ●●● ●●●
●●●●●●● ●● ●
●
●● ●
● ● ●● ●● ●
● ●●
●●
●●●●
●
●●
●●
● ●●
●● ●
●
●
●●
●
●● ●●●
●●●
●●●
●●
●●
● ●●● ● ●
●●●●●● ● ● ● ●●● ●
● ●●●● ●●●●
●●● ●●●●
●
●● ●
●●●●●
● ●
●●●● ●
●● ●●
●● ●
●● ●●
●●
● ●
● ●●
●
●●
●
●●●
●●●●●
●
●●
●
●●
●
●●
●
●●●
● ●●
●●
●
●●
●●●●
●
●●
●●
● ●
●●
●●●
●●
●●●●
●● ●
●●
●
●
●●●
●●●●
●
●●
●●
●
●
● ●
● ●●●
● ●
●●●
●
●●●
●
●
●●●
●
●●
●●●●● ●
●●●●
●● ●●●● ●
●
●●
●●● ●●
●
●●● ●●
● ● ●●●
●
● ●●● ●●● ● ●●
● ●
● ●
● ●
● ●
● ●
● ● ●
● ● ●●●
● ●●
●●●
●●● ●●●●
● ●● ● ●
●●● ● ●● ● ●● ●●
● ●
●●
●● ●
● ●●●
●●● ● ●●●●● ●●
●
●●
●●
● ●● ●●●●●
● ● ●
●●●
● ●
●
●●●● ●●●●
●●●●● ● ●
●●●●
● ●●● ● ●●
●
● ●
●●●
●●● ●● ●●
●●
●●●
●
● ●●
● ●
● ●●●●●● ●●●
●●● ●●●
● ●●
●●●● ●●
●
●● ●●● ●
●●●
●
●●
●●
●
● ●● ● ●●●●
●●●
●●●
●● ●●●●●●
●
● ●
●● ●●
●●
●●●●
●
●
●
●
●●
●●
●
●●
● ●●
●
●●●●●
●
● ●●● ●● ●●●
●
● ●●
● ●
●●● ●●●
●
●
●
●●
● ●●● ●
●●●
●
●●●●
●
●
●
●
●●●●
● ●
●●
●●
● ● ●
●● ●●●
●● ●● ● ●●●● ●●
● ● ● ●●
● ●●
●●
●
●●●●●
● ●●●
●
● ●
●
●●● ●
●
●●●
● ●●
●●●●
●● ●●● ●
●●
●●
●●
●
● ●
●●●
●●
●●
●●●
●●
●
●
●● ●
●●● ●
●
●
●●●●●
●
●●●●●●
●●●●●
●●●●●
●●●●
●
●
●●
●
●
●●
●●●●
●●
●
● ●●●
●●
●●
●●●
●
● ●●
●●●
● ●
●●●●●●
● ●●
● ●● ●●● ●● ●
●●●●
●● ●
●● ●●● ● ●● ●●●
●● ● ●●
●
● ●
● ●
● ●
● ●
● ● ● ●
● ● ● ● ●
●●
●●●●● ●●
●● ●● ● ●●
● ●
●●
●
●● ●●●●● ●●
●
● ●● ●●●●● ●● ●
●
●●●●●
●●
●
●
●
●
●●
● ●●●
●●
●●
●●
●
●
●●●
●●● ●● ● ●
●
● ●●
●
●● ●
●●●
●
● ●●● ●
●● ●
●●
●
●
●●
●●●●● ●●●●
●●● ●
● ●● ●●● ●●
●● ●●●
●● ●
●● ●●
●●●
● ●●● ●●●
● ●●●● ●●
●●●
●● ●● ●
●●●●●
●●●●
●● ●
● ●● ●
●
●
●
●●●
●●●● ● ● ●● ●●●
●● ● ●●
●● ●●
●
●
●
●●●
●
●●●
● ●● ●
●
●●
● ● ●● ●● ●
● ●●●
●
●●●●
● ●
●
●●●●●●
●●●
● ●●●●
● ●●● ●●
●●● ●●●
●● ●●●●●● ●●●
●●●
●●●●
●●
●●●●●●
●●●●● ●●● ●● ● ● ●● ● ●
● ●● ● ●●●●● ●
●● ●
●
● ●●●● ●●● ●● ●●●●●● ●
● ●
●● ●
●
●
●●
● ●●
●
●
●●
●
●●●
● ● ● ●●
●●●
●● ●
●
●●● ●● ●●●●
● ●●●
●●●●
●●●
● ●
● ●
●
●
●
●
●
●●●●●
●●●●
●●
●
●
● ●
●
●●●● ● ● ●●●
●● ● ●
●●
● ● ● ●
● ● ●
● ● ● ●
●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ●
●●
●●●●●●
●●●●●
● ●●●●
●
●●●● ●●● ●●
●●●●● ●
●●●●●●●●
●●●●
●● ●
● ●●
●● ●
●●●●● ●●
●
●
●●●●● ●●
●●● ●● ●
●●
●●
●●●
●
●
● ●● ●● ●●●
●●●●
● ●●● ●●● ●●●
●● ●●●
●●●●●●●●
●●●
● ●●●●
●●●
● ●●● ●●
●●●●● ●● ●
●●
●
●●●
●●
● ●●●●●●●●●●● ●
●● ●●●
●●●●● ●●●●●
● ●●● ●●●
●● ●
●●
●●●●●● ● ●● ●● ●
●
●●
●●●●
●●
●●
● ●
● ●●
●● ●●
●●●●
●● ●●●●●●● ● ●● ●●
●
● ●●● ●●
●
● ●●● ●● ●●
●
● ●● ● ●● ●● ●
● ●●
● ●● ●●● ● ●●●●● ●● ●● ● ● ●●● ● ●●● ●
●●
●●●●●●●●●●●●
● ●
● ●●●
●●●● ●●●●
●● ●
●●●
● ●
●●●
●●●●●●
●●
●●● ●● ●●
●●
●●●●
● ●
●●
●●●
●●●● ●●●
●● ●●● ● ●● ●●●
● ●● ●
●●●
●● ● ● ●●● ● ● ● ●●
●● ●
●
● ●
● ●
●
● ●
● ●
● ● ● ●
● ● ●●
●●
●
●
● ●
●
●●
●●●
●
●
●●●
●●●
●●
● ●●
●
●
● ●●●●
●● ●●●● ●
●●
● ●●●●●● ●
●●●●
●●●
●●
●● ●●●
●●
● ●
●●
●●● ●
●●●●●●● ●● ●
●●
●●● ●
●●
●●●●●
●
●●
●
●●●● ●●●
● ● ●● ●●● ● ●
●●●●
●●● ●●
●●●
●●●
●●● ●●
●●
● ● ●● ●●
●●●
● ●
●●●●● ●●●●●●
● ●●
●●●●
●●●
●●●● ●●●●
●● ●●●●●●
●●●●●
●
● ●●●
●
● ●●
●●●
●●
●
●● ●
●
●
●
●
●●● ●●
● ●●●●
●●●●●●
● ● ●● ● ●●
● ●● ●●
●
● ●●●
●
●●●
● ●●●● ● ●● ●● ●
● ●
●● ●● ● ●
●●
● ●● ●● ●
● ●●● ● ● ● ●●
●●●●●●
● ●
●● ●● ●● ●●●
●●●● ●
●●
● ●●●
● ●●●
● ● ●
● ●●●● ●● ●● ● ● ●● ● ● ●
●
● ●
● ● ● ● ● ●
● ● ● ●
●●
●●
●●
●●●
●
● ●●
●●
●●
●●●
●●●●●● ●
●●●●●●
●●
●●●●●●●● ●
●
●●●
● ●
●●●● ●
●● ● ●
●● ●●
● ●● ● ●● ●●●●
● ●●●
●●
●● ●
●●● ●● ●
● ●
● ●●●●
● ● ●● ●●●●● ● ●●● ●●
● ● ●
●●●●● ●● ●● ●●●● ●●
●● ● ● ● ●●
● ●
● ●●● ●●●●●● ●●●●
● ● ●● ● ●●
●●● ●●
●●● ● ●● ●●●● ●●● ●
●● ●●●● ●●● ●●
●
●●●● ●
●● ● ● ●●●● ● ●● ●●●● ●
●
●●
●● ● ●●●●● ●●●● ●● ●●●● ●●●● ●
● ●●●● ●
●● ●●● ●●●●
●
●
●●●
●●●
●●
●● ●●●●
● ●●● ●
●●
● ●●● ●
●●●● ●●●●●●●●●●●●●● ● ●● ●● ●●● ●● ● ● ● ● ●
●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
●●●●● ●● ●●
●
●● ●●●● ●●●●● ● ●●
●●●● ● ●● ● ●●
●●● ●●●● ● ●
●● ●● ● ● ●●●●●● ●●● ● ● ● ● ●●●●● ● ●
● ●● ●
●●● ●●●●●● ● ●● ●●●● ●● ●● ●● ●●
●●● ● ●
●
●●●● ●●● ● ●●●●● ●●● ●● ● ●
●● ●● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●●● ●●● ●
● ●●
●● ●
●● ● ● ● ● ●●
● ●● ●● ●●
● ● ●● ●● ●●●
● ●● ●● ● ● ●●● ●● ●● ●
● ●● ●
● ● ● ●● ● ● ●● ●●● ● ●●● ● ●● ●●● ● ●● ●
● ● ●● ● ● ●
● ●
●
● ●
● ● ●
● ●
●
●
● ● ●
● ●
●● ● ● ●
●●● ●● ●●●● ● ●●●●●●
● ● ● ●● ●
● ●● ●●●● ●● ●●
●●
● ● ●●●● ●●
●
●●● ●● ● ● ● ●●● ● ●●●● ● ● ●●●● ●●● ● ● ● ●● ●● ●●●●●
● ●●● ●● ● ● ● ●●●●●● ● ● ●
●●● ●● ●●●●●● ● ● ● ● ●
● ●●●● ●● ● ●
● ●● ●
● ●
●● ●
●●●●● ●
● ●● ●●
●
● ●● ● ●●●● ●
●● ●●
●●●●●
● ●●
●●●●●
● ● ● ● ●●●●● ● ●●
● ●
● ● ● ● ●●●
● ● ● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ● ● ● ●● ● ●●●● ● ●●
●
●● ● ●
●●●● ●● ● ●● ● ● ● ● ● ●● ● ●● ●●●
●●● ● ●●● ● ● ● ●● ●● ● ● ● ●●●● ●
● ●
● ●●●● ●● ●
●●●● ● ●● ● ●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
●
●● ● ● ● ● ● ● ●● ● ●●● ● ●
● ●● ● ● ●●● ●● ●
●●●●●● ● ●●●●
● ● ●● ●
● ●
●●
●● ● ●● ● ● ●● ● ●
●
● ●
● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ●● ●● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ●●●
●
● ● ●
● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●●●
● ●● ●●
●
● ●●
● ● ● ● ● ● ● ●● ● ● ● ● ● ●
●
●
●
● ●
●
●
●
●
●
● ● ●● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●
●
● ●
● ● ●
● ● ● ●
●
●
−0.5

−0.5

−0.5
● beta(x) ● beta(x) ● beta(x)
zero line zero line zero line
0.1% significance level 0.1% significance level 0.1% significance level

4 6 8 10 12 14 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −4 −2 0 2

Vehicle Power RandU RandN

31
Covariate Contributions βbj (x)xj
●

feature contribution: Bonus−Malus Level ●

●
feature contribution: Density feature contribution: Driver's Age
●
● ●
●
● ● ●
● ●
●
1.5

1.5

1.5
● ●
●
● ● ●
●

●
●
●
●
●
● ● ● beta(x)
● ●
●
●
●
●
●
●
●
●
● ●
● zero line
●
● ● ●
● ●

●
●
●

● ●
●
● spline fit
● ● ● ●
●
●
●
● ●
● ● ●
● ●
1.0

1.0

1.0
● ● ●
● ● ●
●
● ●
● ● ● ●
● ● ●
●
● ●
● ●
● ● ● ●
● ●
● ●
●
● ● ● ●
●
● ● ● ●
●
● ● ●
●
● ● ●
● ●
● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
●
● ● ●
● ● ●
● ●
feature contribution

feature contribution

feature contribution
● ● ●
●● ● ● ●
● ●
● ● ●
● ●
● ● ●
●
0.5

0.5

0.5
● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ● ●
● ●
●● ● ●
● ● ● ● ●
● ●
● ●
● ●
● ● ● ● ●
●
● ● ● ● ●●
● ●
●●
● ● ● ● ● ● ● ●
●
● ● ● ● ●
● ●● ● ● ● ● ●
●
● ●
●
●
● ● ●
● ●
● ●
● ●
● ●● ●●●●● ● ● ●● ● ● ●● ● ●
●
● ●
● ● ● ●
●
●
●● ● ●● ●
● ● ●●
● ●● ● ●
● ●
●
● ●
● ● ●
● ● ●●●●●●● ●
● ● ●● ● ● ●● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ●● ● ●●●● ●● ● ● ● ●
●● ●
● ●● ●
● ● ● ●
● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●
● ●
●●● ●● ●
● ●
●● ●
● ● ●● ● ● ● ● ●
●●●● ●●●
●●
● ● ● ●●●● ● ●
●
●
● ● ● ●●● ● ● ● ● ●● ● ●● ●● ●●● ●●● ●
●●●● ●● ●
● ●
●●
●●● ● ● ● ● ● ● ● ●
● ●
● ● ● ●● ● ● ● ●● ●
● ● ● ● ●●●●
●
●
●●●
● ●●●
●● ●●●●
● ●● ● ●●●● ● ●
●
●
●●●● ● ●
● ●
●
●
●
●● ● ● ● ● ● ●● ● ● ● ●● ●
●●
●● ● ●●● ●● ● ● ● ● ● ●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
● ● ●●●●● ●●●●●● ● ●
● ●
●●●●●
●
●
●●●
●● ●
●
●●●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
● ●●●●● ●
●●●● ●
●
● ●●
●●●● ●●
●● ● ● ● ● ●
●
● ●
● ● ● ● ● ●
●
● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ●
●
● ●
● ● ●● ● ●
● ● ●
●● ● ●●●●
● ● ●
● ● ●
● ● ●● ●
● ●
●
● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●● ●● ●●●●● ●● ●●
●● ● ●
●●● ●● ●● ●
●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●
●
● ●●●
●
● ●●
● ●●● ●
● ●
●
●
●
●
●
● ●
● ● ●
●
●●
●
●
●●
●●
●
●
●
●● ●● ●●●●
●
●
●●●●●
● ●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●● ●●
●●●●●●
● ●●
● ●●●
● ● ●
●
●●
●
●●●
● ●● ● ● ●
●
● ● ●
● ● ●
● ● ● ● ●
● ● ●
● ●●● ● ●●● ● ●
● ● ●
● ● ●
●●●
● ●●
● ●●
●
●●
●●● ●●
●● ●●● ●● ● ●●●
●
●● ●● ●
●
● ●● ●
●●
●●●
●●
●
●●
●
●● ●● ● ●●● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
●●●
● ● ● ●●● ● ● ● ● ● ● ● ●●● ●●●● ●
●
●
● ●
●
●●
●
●●●●●
● ●●
● ●
●●●●●
●
● ●●● ● ●●
●
●● ●●● ●
●
●
●
●●
●●
●● ●
●
●●●●
●●●● ●
● ●●
● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ●
● ● ●
●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●●
● ●
●●●● ●●
●●●●●●
● ●
●
●
●
● ● ●
●● ●
● ●
●
● ● ●●●
●
●●●●
●●●●
●●
●
● ●
●●
●
● ●
●●●
●
●●●
●●
●
●
●●●
●●●● ● ● ●
● ●●●● ● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
●●
● ●●●●●●
●●
● ●●●● ●
● ● ●
● ● ●
● ● ●
● ● ●●●●●
●●
● ●●
●
●●
●
●● ●●
●
●
● ●●●●
●
●
●●
●●
●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●●●
●●● ●
● ●
●
●●●●
●
● ●●
●
●●
●
●●
●●●
●●
●
●
●
● ●●
●
●
●
●●●●●
●●●
●●●
● ●
●
●
●
●
●
●●●●
● ● ● ● ●
● ● ● ● ● ●
●
●
●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ●● ●● ●●
● ● ●● ●●
● ● ● ● ● ● ● ● ●
●●
●●
●●
●
● ●
●
●●●
●●●
● ●
●●
●● ● ●
●●●●
●●●
●
●
●
●
●●
●●
●●
●
●
●
● ●
● ●●●●
●●●
●● ●●● ●
●●●●●
● ●
●
●
● ● ●●
●● ●
●
●●●●
● ● ●●
● ●● ● ● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ●
● ●●
● ●● ●●
● ●● ● ● ●
● ● ●● ● ● ●
●
● ●● ●
● ●
●●●● ● ●●● ●●●●
●●●
●●●●● ● ● ●
● ●● ● ●● ● ● ● ● ● ● ● ● ●
0.0

0.0
● ● ●● ●

−0.5

−0.5
●●● ●● ●● ●●●●●● ●●●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ●● ●●●● ● ● ● ●● ● ●● ●●●●
● ●● ● ●
●●● ● ● ● ● ● ●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
●
●
● ●
● ●
● ● ● ●
● ● ●●● ●●●●
●●
●●●●●●
●●●●●
●●
● ●●
●● ●●
●●●● ●●●● ●●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●●
●
● ● ● ● ● ● ● ● ●●
●● ●●● ● ●●
●●● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ●
● ● ● ● ● ● ● ●
● ●
●
●●
● ●●
●
●●
● ●●
● ● ●
●●●
●●● ● ●● ● ●
●● ● ●● ●●
●● ●
●● ● ●
● ● ● ● ●
● ● ●
● ●
●
●
●
● ●
● ● ● ●
●
●
● ●
● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
● ●● ●
● ●●● ●● ● ●●●● ● ●●●●●
●●●
●●●●
●●●● ● ● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
●
●
●
●●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ●
●
●
● ● ●
● ● ● ●
● ● ●●● ● ●●● ● ●
●
● ● ● ●
● ● ●
● ●
● ●
● ● ●
●
● ●
●●
●
● ●●
●
●
● ●●●● ●
● ●●
● ●●●
●● ● ● ● ● ● ●
● ● ● ●
● ●
●
● ●
●
● ● ●
●
●
●
● ● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
● ● ● ●●●● ● ●
●●●● ●●
● ● ●
● ● ● ● ●
● ● ●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
●
●
●
● ●
●●
● ● ● ●● ● ● ● ● ● ● ● ●
● ● ● ●
●
●
●●●
● ● ●● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
●
● ● ● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ●
●● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
●
●
●
● ● ●●● ● ● ● ● ●
● ● ● ●
● ●
●
● ●
● ● ●
● ●
● ●
●
●
●
●
●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
●
●
●
●
● ● ● ●
● ●
●
● ● ●
● ● ● ● ● ●
●
●
● ●
● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
●
●
● ●
● ●
● ●
● ● ● ● ● ● ●
−1.0

−1.0

−1.0
● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
●
●
● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ●
●
● ● ● ●
● ● ●

● beta(x) ● beta(x) ●

zero line zero line

−1.5

−1.5
spline fit spline fit

60 80 100 120 140 2 4 6 8 10 20 30 40 50 60 70 80 90

Bonus−Malus Level Density Driver's Age

feature contribution: Vehicle Age feature contribution: Vehicle Gas

1.5

1.5
● beta(x) ● beta(x)
zero line zero line
spline fit
1.0

1.0
feature contribution

feature contribution
0.5

0.5
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
● ●
●
●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
●
● ●
● ● ●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
●
● ●
●
0.0

−0.5
●
● ●
● ●
● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ●
●
● ● ●
● ●
● ●
●
● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
●
● ●
● ● ●
●
● ● ●
● ●
● ●
●
● ●
● ●
● ● ●
● ● ● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ● ●
●
● ●
● ●
●
● ● ●
●
● ●
● ●
● ●
●
●
● ●
●
●
●
●
●
−1.0

−1.0
−1.5

−1.5

0 5 10 15 20 Diesel Regular
Vehicle Age Vehicle Gas

32
Interactions between Covariate Components

interactions of feature component Driver's Age interactions of feature component Vehicle Age
2

2
1

1
DrivAge
interaction strengths

interaction strengths
BonusMalus
VehAge
BonusMalus
Density DrivAge
Density
VehGas
0

0
VehGas
VehAge
−1

−1
Vehicle Age Vehicle Age
Driver's Age Driver's Age
Bonus−Malus Level Bonus−Malus Level
Vehicle Gas Vehicle Gas
−2

−2
Density Density

20 30 40 50 60 70 80 90 0 5 10 15 20
Driver's Age Vehicle Age

>
∂ ∂
∇βj (x) = βj (x), . . . , βj (x) ∈ Rq .
∂x1 ∂xq

33
Calculation of Gradients in keras

1 j <- 1 # select the feature component

2 #
3 beta . j <- Attention % >% layer_lambda ( function ( x ) x [ , j ])
4 model . j <- keras_model ( inputs = c ( Design ) , outputs = c ( beta . j ))
5 #
6 grad <- beta . j % >% layer_lambda ( function ( x ) k_gradients ( model . j$outputs ,
7 model . j$inputs ))
8 model . grad <- keras_model ( inputs = c ( Design ) , outputs = c ( grad ))
9 #
10 grad . beta <- data . frame ( model . grad % >% predict ( as . matrix ( XX )))

In different TensorFlow/keras versions this may be slightly different.

34
Variable/Term Importance

variable importance

Bonus−Malus

Driver's Age

Density

Vehicle Age

Vehicle Gas

Vehicle Power

Area Code

RandN

RandU

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

n
1X b
VIj = βj (xi) .
n i=1

35
Categorical Covariate Components

feature contribution: Vehicle Brand feature contribution: French Regions

1.5

1.5
1.0

1.0
feature contribution

feature contribution
0.5

0.5
●
●
●
●
●
●
● ●
● ● ●
● ● ●
● ●
●
● ●
● ●
● ●
● ● ●
●
●
●
●

●
0.0

0.0
●
●
●
● ●
● ● ●
● ●
● ●
● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
−0.5

−0.5
−1.0

−1.0
−1.5

−1.5
B1 B3 B5 B10 B12 B14 R11 R23 R26 R42 R53 R73 R83 R94
Vehicle Brand French Regions

LASSO regularization within LocalGLMnets: see Richman–Wüthrich (2021b).

36
References
• Breiman (1996). Bagging predictors. Machine Learning 24, 123-40.
• Efron, Hastie (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge UP.
• Ferrario, Noll, Wüthrich (2018). Insights from inside neural networks. SSRN 3226852.
• Goodfellow, Bengio, Courville (2016). Deep Learning. MIT Press.
• Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning. Springer.
• Lorentzen, Mayer (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN 3595944.
• Noll, Salzmann, Wüthrich (2018). Case study: French motor third-party liability claims. SSRN 3164764.
• Richman (2020a/b). AI in actuarial science – a review of recent advances – part 1/2. Annals of Actuarial Science.
• Richman, Wüthrich (2020). Nagging predictors. Risks 8/3, 83.
• Richman, Wüthrich (2021a). LocalGLMnet: interpretable deep learning for tabular data. SSRN 3892015.
• Richman, Wüthrich (2021b). LASSO regularization within the LocalGLMnet architecture. SSRN 3927187.
• Schelldorfer, Wüthrich (2019). Nesting classical actuarial models into neural networks. SSRN 3320525.
• Schelldorfer, Wüthrich (2021). LocalGLMnet: a deep learning architecture for actuaries. SSRN 3900350.
• Wüthrich (2020). Bias regularization in neural network models for general insurance pricing. European Actuarial
Journal 10/1, 179-202.
• Wüthrich, Buser (2016). Data Analytics for Non-Life Insurance Pricing. SSRN 2870308, Version September 10, 2020.
• Wüthrich, Merz (2019). Editorial: Yes, we CANN! ASTIN Bulletin 49/1, 1-3.
• Wüthrich, Merz (2021). Statistical Foundations of Actuarial Learning and its Applications. SSRN 3822407.

37
Convolutional Neural Networks

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Contents: Convolutional Neural Networks

• Spatial and temporal data

• Convolutional neural networks (CNNs)

• Special purpose tools for CNNs

• CNN examples

2
• Spatial and Temporal Data

3
Spatial Objects
Swiss Female raw log−mortality rates
92
84
76
68
60
age x
52
44
36
28
20
12
6
0

1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016

calendar year t

• Spatial objects are tensors Z ∈ RI×J×q of order/mode 3, i.e. 3-dimensional arrays.

• The first two components of Z give the location (i, j) in the picture.

• The 3rd component of Z gives the signals in location (i, j). This 3rd component
is called channels. Black-white pictures have q = 1 channel (gray scale), color
pictures have q = 3 channels (RGB channels for red-green-blue).
4
Temporal and Time Series Objects
driver 238, trip number 1
3
0
speed / angle / acceleration
−3

‘storm damaged controller at courthouse’

0.25
0
50
25
0

0 50 100 150
time in seconds

• Temporal objects are matrices Z ∈ RT ×q which are tensors of order/mode 2.

• The 1st component of Z gives the location t in the time series.

• The 2nd component of Z gives the signals in location t having q channels.

5
Using FNNs for Time Series Processing

• Assume we have time series information Z = x0:T = (x0, . . . , xT )> ∈ R(T +1)×q
to predict response YT +1

x0:T 7→ µT +1(x0:T ) = E[YT +1|x0:T ] = E[YT +1|x0, . . . , xT ].

• In principle, we could choose a FNN architecture and set

D E
−1 (d:1)
µT +1(x0:T ) = g β, z (x0:T ) .

• This is not recommended:

? Whenever we collect a new observation xT +1 we need to extend the input
dimension of the FNN and re-calibrate it.
? The FNN does not recognize any temporal (topological) structure.
? The latter also applies to spatial objects.

6
RNNs, CNNs and Attention Layers

• There are 3 different ways in network modeling to deal with topological data.

• Recurrent neural networks (RNNs) process information recursively to preserve time

series structure. RNNs are most suitable to predict the next response YT +1 based
on past information x0:T .

• Convolutional neural networks (CNNs) extract local structure from topological

objects preserving the topology. This is done by moving small windows, say, across
the picture and trying to identify specific structure in this window.
This is similar to rolling windows in financial time series estimation.
CNNs act more locally, whereas RNNs and FNNs act more globally.

• Attention layers move across the time series and try to pay attention to special
features in the time series, like giving more or less credibility to them. This is
similar to the regression attentions β(x) in LocalGLMnets.

7
• Convolutional Neural Networks (CNNs)

8
Functioning of CNNs
driver 57, trip number 1 driver 206, trip number 1 driver 238, trip number 1
3

3
0

0
speed / angle / acceleration

speed / angle / acceleration

−3

−3
0.25

0.25

0.25
0

0
50

50
25

25
0

0
0 50 100 150 0 50 100 150 0 50 100 150
time in seconds time in seconds time in seconds

• Choose a window (called filter), say, of size b × q = 10 × 3.

• b is called filter size, kernel size or band width; q is the number of channels.

• Move with this filter across the time series (in time direction t) and try to spot
specific structure with this filter (in the rolling window).

• The way of finding structure is with a convolution operation ∗.

9
CNNs: More Formally

• Start from an input tensor x ∈ RI×J×q0 of order 3.

• Choose filter sizes (b1, b2, q0)> ∈ N3 with b1 < I and b2 < J.

• A CNN operation is a mapping

z k : RI×J×q0 → R(I−b1+1)×(J−b2+1)
x 7→ z k (x) = (zk;i,j (x))1≤i≤I−b1+1;1≤j≤J−b2+1,

having, for activation function φ : R → R,

 
b1 X
X q0
b2 X
zk;i,j (x) = φ wk + wk;l1,l2,l3 xi+l1−1,j+l2−1,l3  ,
l1 =1 l2 =1 l3 =1

for given intercept wk ∈ R and filter weights W k = (wk;l1,l2,l3 )l1,l2,l3 .

10
CNNs: Convolution Operation

• Choose the corner (i, j, 1) of the tensor as base point. CNN operation considers

(i, j, 1) + [0 : b1 − 1] × [0 : b2 − 1] × [0 : q0 − 1],

with filter weights W k .

• In fact, we perform a sort of convolution which motivates compact notation

z k : RI×J×q0 → R(I−b1+1)×(J−b2+1)
x 7→ z k (x) = φ (W k ∗ x).

• This convolution operation ∗ reflects one filter with filter weights W k . We can
now choose multiple filters (similar to neurons in FNNs):
This explains the meaning of the lower index k (which plays the role of different
neurons 1 ≤ k ≤ q1 in FNNs).

11
CNNs: Multiple Filters

• Choose q1 ∈ N filters, each having filter weights W k , 1 ≤ k ≤ q1.

• A CNN layer is a mapping

z CNN : RI×J×q0 → R(I−b1+1)×(J−b2+1)×q1

x 7→ z CNN(x) = (z 1(x), . . . , z q1 (x)),

with filters z k (x) = φ (W k ∗ x).

• Thus, the (spatial) tensor

x ∈ RI×J×q0 ,
with q0 channels is mapped to a (spatial) tensor

z CNN(x) ∈ R(I−b1+1)×(J−b2+1)×q1 ,

with q1 filters.
12
Properties of CNNs

• The convolution operation ∗ respects the local structure.

• FNNs extract global structure, CNNs extract local structure.

• Formally, the global scalar product z k (x) = φhwk , xi of FNNs is replaced by a

local convolution z k (x) = φ (W k ∗ x) for CNNs.

• CNNs have translation invariance properties, see Wiatowski–Bölcskei (2018).

• CNNs generally use less parameters than FNNs and RNNs, because filter weights
are re-used/re-located.

13
• Special Purpose Tools for CNNs

14
CNNs: Padding with Zeros

• A CNN layer reduces the spatial size of the output tensor

z CNN : RI×J×q0 → R(I−b1+1)×(J−b2+1)×q1

x 7→ z CNN(x) = (z 1(x), . . . , z q1 (x)).

• If this is an undesired feature, padding with zeros can be applied at all edges to
obtain

z CNN : RI×J×q0 → RI×J×q1

x 7→ z CNN(x) = (z 1(x), . . . , z q1 (x)).

• Remark that padding does not add any additional parameters, but it is only used
to reshape the output tensor.

15
CNNs: Stride

• Strides are used to skip part of the input tensor x to reduce the size of the output.
This may be useful if the input tensor is a very high resolution image.

• Choose stride parameters s1 and s2. Consider modified convolution

XXX
wk;l1,l2,l3 xs1(i−1)+l1,s2(j−1)+l2,l3 .
l1 l2 l3

• This considers windows

(s1(i − 1), s2(j − 1), 1) + [1 : b1] × [1 : b2] × [0 : q0 − 1].

16
CNNs: Dilation

• Dilation is similar to stride, though, different in that it enlarges the filter sizes
instead of skipping certain positions in the input tensor.

• Choose dilation parameters e1 and e2. Consider modified convolution

XXX
wk;l1,l2,l3 xi+e1(l1−1),j+e2(l2−1),l3 .
l1 l2 l3

• This considers

(i, j, 1) + e1 [0 : b1 − 1] × e2 [0 : b2 − 1] × [0 : q0 − 1].

17
CNNs: Max-Pooling Layers
• Pooling layers help to reduce the sizes of the tensors.

• We choose fixed window sizes b1 and b2 and strides s1 = b1 and s2 = b2; this
gives a partition (disjoint windows).

• The max-pooling layer then considers

0 0
z Max : RI×J×q0 → RI ×J ×q0
x 7→ z Max(x) = MaxPool(x),

with I 0 = bI/b1c and J 0 = bJ/b2c (cropping last columns by default), and where
the convolution operation ∗ is replaced by a max operation (modulo channels).

• This extracts the maximums from the (spatially disjoint) windows

[b1(i − 1) + 1 : b1i] × [b2(j − 1) + 1 : b2j] × [k],

for each channel 1 ≤ k ≤ q0, individually.

18
CNNs: Flatten Layers

• A flatten layer is used to reshape a tensor to a vector.

• Consider mapping

z flatten : RI×J×q0 → R q1
>
x 7→ z flatten(x) = (x1,1,1, . . . , xI,J,q0 ) ,

with q1 = I · J · q0.

• The flattened object z flatten(x) can serve as input to a FNN.

• We have already met this operator with embedding layers for categorical features.

19
CNNs: Example (1/2)

1 library ( keras )
2 #
3 shape <- c (180 ,50 ,3)
4 #
5 model <- ke ra s_ mod el _s eq uen ti al ()
6 model % >%
7 layer_conv_2d ( filters = 10 , kernel_size = c (11 ,6) , activation = ’ tanh ’ ,
8 input_shape = shape ) % >%
9 l ayer_max_pooling_2d ( pool_size = c (10 ,5)) % >%
10 layer_conv_2d ( filters = 5 , kernel_size = c (6 ,4) , activation = ’ tanh ’) % >%
11 l ayer_max_pooling_2d ( pool_size = c (3 ,2)) % >%
12 layer_flatten ()

CNN1 Max1 CNN2 Max2 flatten

180×50×3 7→ 170×45×10 7→ 17×9×10 7→ 12×6×5 7→ 4×3×5 7→ 60.

20
CNNs: Example (2/2)

1 Layer ( type ) Output Shape Param #

2 =======================================================================
3 conv2d_1 ( Conv2D ) ( None , 170 , 45 , 10) 1990
4 _______________________________________________________________________
5 max_pooling2d_1 ( MaxPooling2D ) ( None , 17 , 9 , 10) 0
6 _______________________________________________________________________
7 conv2d_2 ( Conv2D ) ( None , 12 , 6 , 5) 1205
8 _______________________________________________________________________
9 max_pooling2d_2 ( MaxPooling2D ) ( None , 4 , 3 , 5) 0
10 _______________________________________________________________________
11 flatten_1 ( Flatten ) ( None , 60) 0
12 =======================================================================
13 Total params : 3 ,195
14 Trainable params : 3 ,195
15 Non - trainable params : 0

CNN1 Max1 CNN2 Max2 flatten

180×50×3 7→ 170×45×10 7→ 17×9×10 7→ 12×6×5 7→ 4×3×5 7→ 60.

21
• Time Series Example: Telematics Data

22
What is Telematics Car Driving Data?
• GPS location data second by second, speed, acceleration, braking, intensity of left
and right turns, engine revolutions,

• vehicle sensors and cameras,

• time stamp (day time, rush hour, night, etc.), total distances at different times,

• road type, traffic conditions, weather conditions, etc.,

• traffic rules (e.g. speeding), driving and health conditions, etc.

Volume of telematics car driving data, back of envelope calculation:

• 100KB of telematics data per driver and per day.

• This amounts to 40MB of data per driver per year.

• A small portfolio of 100’000 drivers results every year in 4TB data.

23
Illustration of GPS Location Data of Selected Driver

GPS coordinates of individual trips time spent in different speed buckets

0.4
20

0.3
10
y coordinate (in km)

0.2
0

●
−10

0.1
−20

0.0
−20 −10 0 10 20 [0] (0,5] (5,20] (20,50] (50,80] (80,130]

x coordinate (in km) speed buckets (in km/h)

Remark that the idling phase is comparably large,

typically, we truncate the idling phase in our analysis.

24
Speed, Acceleration/Braking and Direction

driver 238, trip number 1

3
0
speed / angle / acceleration
−3

0.25
0
50
25
0
0 50 100 150
time in seconds

acceleration (m/s2) / change in direction (| sin |/s)/ speed (km/h)

• We have 3 channels:
? Speed v is concatenated so that v ∈ [2, 50]km/h.
? Acceleration a is censored at ±3m/s2 because of scarcity of data and data
error. Extreme acceleration +6m/s2, extreme deceleration −8m/s2.
? Change of direction ∆ is censored at 1/2.
25
Choose 3 Selected Drivers

driver 57, trip number 1 driver 206, trip number 1 driver 238, trip number 1
3

3
0

0
speed / angle / acceleration

speed / angle / acceleration

−3

−3
0.25

0.25

0.25
0

0
50

50
25

25
0

0
0 50 100 150 0 50 100 150 0 50 100 150
time in seconds time in seconds time in seconds

• Consider 3 selected drivers, called drivers 57, 206 and 238.

• Question: Can we correctly allocate individual trips to the right drivers?

• Assume that of each trip we have 180 seconds of driving experience (at random
chosen from the entire trip and pre-processed as described above).

26
Classification with CNNs

• We choose a CNN because we would like to find similar structure in telematics

time series data to discriminate the 3 different drivers.

• Consider (speed-acceleration-change in angle) time series, for t = 1, . . . , T = 180,

2
(vs,t, as,t, ∆s,t)> ∈ [2, 50]km/h × [−3, 3]m/s × [0, 1/2],

where s = 1, . . . , S labels the individual trips of the considered drivers.

• Define 3-dimensional time series feature (covariate with 3 channels)

>
> >
xs = (vs,1, as,1, ∆s,1) , . . . , (vs,180, as,180, ∆s,180) ∈ R180×3,

with categorical response Ys ∈ {57, 206, 238} indicating the drivers.

27
Logistic Regression for Classification

• Multinomial logistic regression uses linear predictors on the canonical scale

D E
Logistic flatten
x 7→ p (x) = softmax B, z (x) ∈ (0, 1)3,

with regression parameters B ∈ R180·3×3:

? we need to pre-process time series feature x ∈ R180×3 for suitable shape (flatten
to vector) and for suitable functional form (not done here);
? the scalar product is understood column-wise in B;
? the softmax function is (here) for j = 1, 2, 3 given by

D E exphbj , zi
softmax B, z = P3 ∈ (0, 1).
k=1 exphbk , zi
j

where bj is the j-th column of B.

28
CNNs for Logistic Regression

• Multinomial logistic regression uses linear predictors on the canonical scale

D E
x →
7 pLogistic(x) = softmax B, z flatten(x) ∈ (0, 1)3,

with regression parameters B ∈ R180·3×3.

• We choose a CNN architecture of depth d ∈ N for multinomial logistic regression

D E
CNN (d) (1)
x 7→ p (x) = softmax B, z ◦ ··· ◦ z (x) ∈ (0, 1)3,

with layers z (m) being described on the next slide.

29
R Code for CNN Architecture on Time Series Data

1 model <- ke ra s_ mod el _s eq uen ti al ()

2
3 #
4 model % >%
5
6 layer_conv_1d ( filters =12 , kernel_size =5 , activation = ’ tanh ’ , input_shape = c (180 ,3)) % >%
7 layer_max_pooling_1d ( pool_size = 3) % >%
8
9 layer_conv_1d ( filters = 10 , kernel_size = 5 , activation = ’ tanh ’) % >%
10 layer_max_pooling_1d ( pool_size = 3) % >%
11
12 layer_conv_1d ( filters = 8 , kernel_size = 5 , activation = ’ tanh ’) % >%
13 l a y e r _ g l o b a l _ m a x _ p o o l i n g _ 1 d () % >%
14 layer_dropout ( rate = 0.3) % >%
15
16 layer_dense ( units = 3 , activation = ’ softmax ’)

CNN1 Max1 CNN2 Max2 CNN3 GlobMax FNN

180 × 3 7→ 176 × 12 7→ 58 × 12 7→ 54 × 10 7→ 18 × 10 7→ 14 × 8 7→ 8 7→ 3.

30
Explicit CNN Architecture and Network Parameters

1
2 Layer ( type ) Output Shape Param #
3 ==============================================================
4 conv1d_1 ( Conv1D ) ( None , 176 , 12) 192
5 ______________________________________________________________
6 max_pooling1d_1 ( MaxPoolin ( None , 58 , 12) 0
7 ______________________________________________________________
8 conv1d_2 ( Conv1D ) ( None , 54 , 10) 610
9 ______________________________________________________________
10 max_pooling1d_2 ( MaxPoolin ( None , 18 , 10) 0
11 ______________________________________________________________
12 conv1d_3 ( Conv1D ) ( None , 14 , 8) 408
13 ______________________________________________________________
14 g l o b al _ma x_ po ol ing 1d _1 ( Gl ( None , 8) 0
15 ______________________________________________________________
16 dropout_1 ( Dropout ) ( None , 8) 0
17 ______________________________________________________________
18 dense_1 ( Dense ) ( None , 3) 27
19 ==============================================================
20 Total params : 1 ,237
21 Trainable params : 1 ,237
22 Non - trainable params : 0

31
Gradient Descent Fitting

1.25 ●

●
●
●●

• Total data: 521+131 individual trips L and T

●●
●
●●
●●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
1.00 ●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●●●●
●
●●●
●
●●
●
●●
●●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●● ● ● ●
● ●●
●● ●●
● ●
●●
●●
●●
●
●●
●
●●
●
●●●
●
● ●
● ●
● ●● ●●

loss
●●
●
●●
● ●●
●●
●●
●●●●● ●●
● ● ●
● ● ● ● ● ●
●
●●
●● ●
●●●
● ●●● ● ● ● ●● ●●
● ●
●● ●●
● ●
●●
●
●●
●
●●
●● ● ●
●●
●
●●● ●●● ● ●
●
●●●
●●
0.75 ●
●
●●●●
●
●●● ●●
●
●●
●
●●
●
●●●●●
●
●
●
●●
●
● ●●●
●
● ●
●●
●
●●● ●●
●
●● ● ●● ● ●● ● ●●●
● ●●●● ● ●●● ●
●
●
●●● ●●●
●●●●
●
●●
●●
●
●
● ●●●
●●
●●
● ●●●●
●● ●● ● ● ●● ● ● ●●
● ●● ●● ●
● ●● ●
● ●
●●●●●●
●●
●
●
●●
●●
●●
●
●●●● ● ● ●●
●●●
● ●
● ●
●●
●●●●
●
●●
●
●●
●
● ●
●
●● ● ●●●
●
●●●
●●
●●●
●●
●●
●●●
●●
●
●●
●●●
● ● ●
● ●●
●
●● ● ●●● ●● ●●●● ●
●●● ●●●● ●
●●●
●● ●
●●●●
●●
● ●●●●● ● ●● ●
● ●

• We out-of-sample predict on 131 trips T .

● ●

0.25 ● training
●
● ●● ●
●●
●●
●●●●
●
●●
● ●●● ● validation
●●● ● ● ●●● ●
● ● ●●●● ●●●●●
●
●●
● ● ● ● ●●
●●
●●
●
●● ●
●
●●●●●
●
● ●● ●● ●● ●●
●
●● ●
●●
●●
●
●● ● ● ●●●
●●●● ● ● ● ●●●● ●●
● ●●●● ● ●
●● ● ●● ● ●●
● ●
● ●●
●
●●●●●●●●●●● ●●
●●
● ● ●
● ● ●
● ● ●●
● ●● ●● ●● ●
● ● ● ●
●●
●
●●
●●
●
●●● ●●
●● ● ● ●●
●
●
● ● ●
●●●● ● ●●
● ●●● ●●
● ● ● ●
●●●●●
● ●●●
●● ●
●● ● ● ●
0.8 ● ●● ● ●● ●●● ●
●●●
●● ● ● ●●
●
●●●
● ●
● ● ●●
● ● ●●● ●
●●●
● ● ●●
●●●●
●●●●
●
● ●●● ● ●● ●
● ●
●
●● ● ●● ● ● ●
● ●●● ● ●
● ●●● ● ●●●
●●●
● ●●● ●
●● ●●● ●
● ● ● ●
● ●●●●●
●●●
● ● ●●●●● ● ●● ●
●●●●●●● ● ● ●● ●
●
●
●
●
●●
●● ●
●●●●
●●●●● ●
●● ●●
●●● ●● ●
●●
●● ●● ●●●●● ●●
● ●● ● ●●● ●● ●
●●
●●●● ● ●●● ●●●● ●
●●
●●●●●●
●●
●●●●●●
●●
● ● ●●
●
●●●●●●
● ●●●●●

• Split 521 trips 8:2 for train/validation U and V.

● ●● ●
● ●●●●●
●● ●
● ●● ●●●● ●●
●●
●●●●● ●●●●
● ● ●● ●●
● ●● ●●●●●●●
●●●●
●●●●●●● ●●●●● ●● ●
●
●
●●
●●●● ●●● ●●●●
● ● ●
●
●●●● ●
● ● ● ●●●
●●●
●●
●
●
●●●
●
●●
●
●
●
● ● ●●●●●● ● ●
●●●● ● ●●●●●
●●●
●●
●
●●
●●
●●
●●● ●
●●●●
● ●●
● ●●●●●
●●●●●
●●●●●
●
●●●●●
●
●●
●
●●
●●●
●●●
●●●●● ●● ●
●● ●● ●●●● ●●● ● ●● ●●●● ●
●●●●● ● ● ●●●● ● ●●● ●● ● ● ●●●
●
● ●● ● ●
●● ● ●●●
●● ●●●●
● ●●● ●● ●●●●●●●
●
●●● ●●●●● ●● ●
●●●●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●●●
●●
●●● ●●● ●●● ●
●● ●● ● ●
●● ●●● ●● ●
●
●
●
●●
●●
●●●● ●●● ●● ●●● ●●● ●●●● ●● ●
● ● ●● ● ●●
● ●●●
● ●●
●●
●● ● ●
●●●●●● ●
● ●
● ●●●● ●● ● ●
●●●●
● ● ●●●● ●
● ● ●● ●
●● ● ● ●
● ● ●
●● ●● ● ●
●● ● ●
● ●● ● ●

acc
●●
●
●
●●●●●
●●●
● ● ● ●●● ●
0.6 ●●●●●●●●●
●
●
●●●
●● ●●●
●
●●●
●●●●
● ●●
●
●●
●● ●
●●● ●
●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●

0 100 200 300 400 500

epoch

• Gradient descent fitting takes 40 sec.

• The plot shows deviance losses and accuracy/misclassification rate.

32
Out-of-sample Results

• Out-of-sample confusion matrix on T (131 trips):

true labels
driver 57 driver 206 driver 238
predicted label 57 33 4 0
predicted label 206 8 38 6
predicted label 238 1 5 36
% correct 78.6% 80.9% 85.7%

• This excellent prediction is based on “minimal information”:

? only 180 seconds per trip;
? only very few trips to fit the network (521);
? not optimal data quality;
? not much network fine-tuning has been done.
33
Other Triples of Drivers

true labels
driver 300 driver 301 driver 302
predicted label 300 61 1 3
predicted label 301 5 42 11
predicted label 302 8 11 25
% correct 82.4% 77.8% 65.8%

true labels
driver 100 driver 150 driver 200
predicted label 100 43 12 2
predicted label 150 5 64 5
predicted label 200 4 2 51
% correct 82.7% 82.1% 87.9%

34
What’s Next?

• Do you have any privacy concerns?

• Can we use this data to identify different driving styles?

• How much telematics data is needed to characterize a given driver?

• Can this data be made useful to improve driving behavior and style?

35
• Spatial Example: Digits Recognition

36
Modified National Institute of Standards and
Technology (MINST) Data Set

• We have black-white pictures, i.e., 1 channel.

• Spatial objects are represented by tensors x ∈ [0, 1]28×28×1 of order/mode 3.

• We have a classification problem with categorical response Y ∈ {0, . . . , 9}.

• The data basis contains n = 700000 images.

37
R Code for CNN Architecture on Spatial Data

1 model <- ke ra s_ mod el _s eq uen ti al ()

2
3 model % >%
4
5 layer_conv_2d ( filters = 10 , kernel_size = c (3 ,3) , padding =" valid " ,
6 activation = " linear " , input_shape = c (28 ,28 ,1)) % >%
7 l a y e r _ b a t c h _ n o rm a l i z a t i o n () % >%
8 layer_activation ( activation =" relu ") % >%
9 l a y er_max_pooling_2d ( pool_size = c (2 ,2) , strides = c (2 ,2) , padding =" valid ") % >%
10
11 layer_conv_2d ( filters = 20 , kernel_size = c (3 ,3) , padding =" valid ") % >%
12 l a y e r _ b a t c h _ n o rm a l i z a t i o n () % >%
13 layer_activation ( activation =" relu ") % >%
14 l a y er_max_pooling_2d ( pool_size = c (2 ,2) , strides = c (1 ,1) , padding =" valid ") % >%
15
16 layer_conv_2d ( filters = 40 , kernel_size = c (3 ,3) , padding =" valid ") % >%
17 l a y e r _ b a t c h _ n o rm a l i z a t i o n () % >%
18 layer_activation ( activation =" relu ") % >%
19 l a y er_max_pooling_2d ( pool_size = c (2 ,2) , strides = c (2 ,2) , padding =" valid ") % >%
20
21 layer_flatten () % >%
22 layer_dense ( units =10 , activation = ’ softmax ’)

38
Explicit CNN Architecture and Network Parameters
1 Layer ( type ) Output Shape Param #
2 =====================================================================================
3 conv2d_8 ( Conv2D ) ( None , 26 , 26 , 10) 100
4 _____________________________________________________________________________________
5 b a t ch_n ormal izat ion_ 7 ( BatchNormalizat ( None , 26 , 26 , 10) 40
6 _____________________________________________________________________________________
7 activation_7 ( Activation ) ( None , 26 , 26 , 10) 0
8 _____________________________________________________________________________________
9 max_pooling2d_7 ( MaxPooling2D ) ( None , 13 , 13 , 10) 0
10 _____________________________________________________________________________________
11 conv2d_9 ( Conv2D ) ( None , 11 , 11 , 20) 1820
12 _____________________________________________________________________________________
13 b a t ch_n ormal izat ion_ 8 ( BatchNormalizat ( None , 11 , 11 , 20) 80
14 _____________________________________________________________________________________
15 activation_8 ( Activation ) ( None , 11 , 11 , 20) 0
16 _____________________________________________________________________________________
17 max_pooling2d_8 ( MaxPooling2D ) ( None , 10 , 10 , 20) 0
18 _____________________________________________________________________________________
19 conv2d_10 ( Conv2D ) ( None , 8 , 8 , 40) 7240
20 _____________________________________________________________________________________
21 b a t ch_n orma lizat ion_ 9 ( BatchNormalizat ( None , 8 , 8 , 40) 160
22 _____________________________________________________________________________________
23 activation_9 ( Activation ) ( None , 8 , 8 , 40) 0
24 _____________________________________________________________________________________
25 max_pooling2d_9 ( MaxPooling2D ) ( None , 4 , 4 , 40) 0
39
26 _____________________________________________________________________________________
27 flatten_3 ( Flatten ) ( None , 640) 0
28 _____________________________________________________________________________________
29 dense_3 ( Dense ) ( None , 10) 6410
30 =====================================================================================
31 Total params : 15 ,850
32 Trainable params : 15 ,710
33 Non - trainable params : 140 (1/2 of batch normalizations )

40
Result: Confusion Matrix

41
Shift Invariance

42
Rotation Invariance

43
Scale Invariance

44
References
• Efron, Hastie (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge UP.
• Gao, Wüthrich (2019). Convolutional neural network classification of telematics car driving data. Risks 7/1, article 6.
• Goodfellow, Bengio, Courville (2016). Deep Learning. MIT Press.
• Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning. Springer.
• Meier, Wüthrich (2020). Convolutional neural network case studies: (1) anomalies in mortality rates (2) image
recognition. SSRN 3656210.
• Perla, Richman, Scognamiglio, Wüthrich (2021). Time-series forecasting of mortality rates using deep learning.
Scandinavian Actuarial Journal 2021/7, 572-598.
• Wiatowski, Bölcskei (2018). A mathematical theory of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory 64/3, 1845-1866.
• Wüthrich, Merz (2021). Statistical Foundations of Actuarial Learning and its Applications. SSRN 3822407.

45
Recurrent Neural Networks

Mario V. Wüthrich
RiskLab, ETH Zurich

Block Course “Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries, Zurich
October 15/16, 2020
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:00)

• Feed-Forward Neural Networks (THU 12:30-14:00)

• Combined Actuarial Neural Network Models (THU 16:30-17:45)

• Recurrent Neural Networks (FRI 10:30-12:00)

• Discrimination-Free Insurance Pricing (FRI 14:30-15:00)

• Unsupervised Learning Methods (FRI 15:30-16:30)

1
Contents: Recurrent Neural Networks

• Lee–Carter (LC) model

• Recurrent neural networks (RNNs)

• Long short-term memory (LSTM) networks

• Gated recurrent unit (GRU) networks

• Recurrent neural networks (RNNs) vs. convolutional neural networks (CNNs)

2
• Lee–Carter (LC) Model and Time-Series

3
Human Mortality Database (HMD)

1 Classes ’ data . table ’ and ’ data . frame ’: 13400 obs . of 7 variables :

2 $ Gender : Factor w / 2 levels " Female " ," Male ": 1 1 1 1 1 1 1 1 1 1 ...
3 $ Year : int 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
4 $ Age : int 0 1 2 3 4 5 6 7 8 9 ...
5 $ Country : chr " CHE " " CHE " " CHE " " CHE " ...
6 $ imputed_flag : chr " FALSE " " FALSE " " FALSE " " FALSE " ...
7 $ mx : num 0.02729 0.00305 0.00167 0.00123 0.00101 ...
8 $ logmx : num -3.6 -5.79 -6.39 -6.7 -6.9 ...

Swiss Female raw log−mortality rates Swiss Male raw log−mortality rates
92

92
84

84
76

76
68

68
60

60
age x

age x
52

52
44

44
36

36
28

28
20

20
12

12
6

6
0

1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016 1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016

calendar year t calendar year t

4
Human Mortality Database (HMD)

(i)
• Aim: Forecast mortality rates mx,t for ages x, calendar years t and populations i.

• Data available for ages 0 ≤ x ≤ 99 and calendar years 1950 ≤ t ≤ 2016 of 38

countries and 2 genders, i.e., i = (r, g) ∈ I = R × {female, male}.

• Learning data D = {1950 ≤ t ≤ 1999}; test data T = {2000 ≤ t ≤ 2016}.

Swiss Female raw log−mortality rates Swiss Male raw log−mortality rates
92

92
84

84
76

76
68

68
60

60
age x

age x
52

52
44

44
36

36
28

28
20

20
12

12
6

6
0

1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016 1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016

calendar year t calendar year t

5
Lee–Carter (LC) Model (1992)

• Expected log-mortality rate is modeled by a regression function

(i) (i)
(x, t, i) 7→ log(mx,t) = a(i) (i)
x + bx kt ,

(i)
? ax average force of mortality at age x in population i;
(i)
? kt mortality trend in calendar year t of population i;
(i)
? bx mortality trend broken down to ages x of population i.
The inputs (x, i) and (t, i) are treated as categorical variables.
We have log-link, but not a GLM.

• 2-stage estimation and forecasting procedure, for each population i individually:

(i) (i) (i)
1. Estimate ax , kt and bx with singular value decomposition (SVD).
(i)
2. Forecast by extrapolating estimated time series (b
kt )t0≤t≤t1 to years t > t1.

6
Lee–Carter 2-Stage Forecasting
(i)
• Center the observed log-mortality rates log(Mx,t )

(i) (i) 1 X (i)

Lx,t = log(Mx,t ) − log(Mx,s ).
|D|
s∈D

• Find optimal parameter values with SVD (see also PCA chapter)

X 2
(i) (i)
arg min Lx,t − b(i)
x t k ,
(i) (i)
(bx )x ,(kt )t t,x

P b(i) P (i)
under side constraint for identifiability x bx = 1; and t∈D b
kt = 0.

(i)
• Extrapolate time series (b
kt )t∈D using a random walk with drift.

• A random walk with drift often works surprisingly well.

7
Lee–Carter Forecast for Switzerland
estimated process k_t for Female estimated process k_t for Male

●● ●
● ●
●● ●
●●●●
40

●
● ● ●

20
●● ●●
●●
● ●●●● ●
● ●●●● ● ●
●
20

●
●●● ●
●●● ●●
● ● ●
●

0
●●
●
0

●● ●●● ●
values k_t

values k_t
● ●● ●
● ●● ●●
●●

−20
●● ●
−20

●
●● ● ●●
● ●●
●● ● ●
●● ●
●
●●
−40

●● ●

−40
●●
● ●
● ●● ●
● ●
● ●
● ●
−60

● ●
● ●
● ●
● ●
●

−60
● ●
● ●
● ●
● ●
● ●
−80

● ●
● ●
● ●

1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013

calendar year t calendar year t

in-sample MSE out-of-sample MSE

female male female male
LC model with SVD 3.7573 8.8110 0.6045 1.8152

8
• Recurrent Neural Networks (RNNs)

9
Recap: Feed-Forward Neural Networks (FNNs)

age

claims

• Deep FNN mapping

D E
−1 (d:1)
x 7→ µ = E[Y ] = g β, z (x) .

• Goal: Use time series input x = (x1, . . . , xT ) to predict output Y .

• The input of this FNN grows whenever we have a new observation xt ∈ Rτ0 .

• This FNN does not respect time series (causality) structure.

10
Plain-Vanilla Recurrent Neural Network (RNN)
• Define a recursive structure using a single RNN layer (upper index(1))

z (1) : Rτ0×τ1 → Rτ1 , (xt, z t−1) 7→ z t = z (1) (xt, z t−1) .

• The RNN layer is given by

z t = z (1) (xt, z t−1)

>
(1) (1) (1) (1)
= φ hw1 , xti + hu1 , z t−1i , . . . , φ hwτ1 , xti + huτ1 , z t−1i ,

where the individual neurons 1 ≤ j ≤ τ1 are modeled by

τ0 τ1
!

(1) (1) (1) (1) (1)
X X
φ hwj , xti + huj , z t−1i = φ wj,0 + wj,l xt,l + uj,l zt−1,l .
l=1 l=1

• This RNN has one hidden layer with upper index(1) that is visited T times.
11
Remarks on RNNs

• Lower index t in z t = z (1) (xt, z t−1) is time and upper index(1) is the hidden layer.

• This gives time series structure:

··· 7→ z t = z (1) (xt, z t−1) 7→ z t+1 = z (1) (xt+1, z t) 7→ ···

(1) (1) (1) (1)

• Network weights (w1 , . . . , wτ1 )> ∈ Rτ1×(τ0+1) and (u1 , . . . , uτ1 )> ∈ Rτ1×τ1
are time independent (are shared across every t-loop).

• We have an auto-regressive structure of order 1 in (z t)t summarizing the past

history; this structure also resembles a state-space model.

• There are different ways in designing RNNs with multiple hidden layers. We give
examples of two hidden layers, i.e. depth d = 2.

12
Variants with 2 Hidden RNN Layers
• 1st variant of a two-hidden layer RNN:

(1) (1)
zt = z (1) xt, z t−1 ,

(2) (1) (2)
zt = z (2) z t , z t−1 .

• 2nd variant of a two-hidden layer RNN:

(1) (1) (2)
zt = z (1) xt, z t−1, z t−1 ,

(2) (1) (2)
zt = z (2) z t , z t−1 .

• 3rd variant of a two-hidden layer RNN:

(1) (1) (1) (2)
zt = z xt, z t−1, z t−1 ,

(2) (2) (1) (2)
zt = z xt, z t , z t−1 .
13
• Long Short-Term Memory (LSTM) Networks

14
Long Short-Term Memory (LSTM) Networks
• The above plain-vanilla RNN architecture is of auto-regressive type of order 1.

• Long short-term memory (LSTM) networks were introduced by Hochreiter–

Schmidhuber (1997): design a RNN architecture that can store information
for “longer” by using a so-called memory cell ct.

15
LSTM Layer: The 3 Gates
• Forget Gate (loss of memory rate):

f t = f (1) (xt, z t−1) = φσ (hWf , xti + hUf , z t−1i) ∈ (0, 1)τ1 .

• Input Gate (memory update rate):

it = i(1) (xt, z t−1) = φσ (hWi, xti + hUi, z t−1i) ∈ (0, 1)τ1 .

• Output Gate (release of memory information rate):

ot = o(1) (xt, z t−1) = φσ (hWo, xti + hUo, z t−1i) ∈ (0, 1)τ1 .

• Network weights are given by Wf>, Wi>, Wo> ∈ Rτ1×(τ0+1) (including an intercept),
Uf>, Ui>, Uo> ∈ Rτ1×τ1 (excluding an intercept).

16
LSTM Layer: The Memory Cell

• The above gates determine the release and update of the memory cell ct.

• The memory cell (ct)t, called cell state process, is defined by

ct = c(1) (xt, z t−1, ct−1)

= f t ⊗ ct−1 + it ⊗ φtanh (hWc, xti + hUc, z t−1i) ∈ Rτ1 ,

for weights Wc> ∈ Rτ1×(τ0+1) (incl. intercept), Uc> ∈ Rτ1×τ1 (excl. intercept), and
Hadamard product ⊗ (element-wise product).

• Finally, define the updated neuron activation, given ct−1 and z t−1, by

z t = z (1) (xt, z t−1, ct−1) = ot ⊗ φ (ct) ∈ Rτ1 .

• This is one LSTM layer indicated by the upper index(1).

17
Outputs and Time-Distributed Layers

• The LSTM produces a latent variable z T , based on time series input (x1, . . . , xT ).

• LSTM prediction: choose link function g and set

(x1, . . . , xT ) 7→ µT +1 = E [YT +1] = g −1 hβ, z T i.

• Network weights Wf>, Wi>, Wo>, Wc> ∈ Rτ1×(τ0+1), Uf>, Ui>, Uo>, Uc> ∈ Rτ1×τ1
and β ∈ R(τ1+1)×dim(YT +1). All time t-independent.

• The LSTM produces a latent time series z 1, . . . , z T . A so-called time-distributed

layer outputs all of them such that we can fit

(x1, . . . , xt) 7→ µt+1 = E [Yt+1] = g −1 hβ, z ti,

using the same output filter g −1 hβ, ·i for all t = 1, . . . , T .

18
• Code LSTM Layers and Networks

19
R Code for Single LSTM Layer Architecture
1 T <- 10 # length of time series x_1 ,... , x_T
2 tau0 <- 3 # dimension of inputs x_t
3 tau1 <- 5 # dimension of the neurons z_t and cell states c_t
4
5 Input <- layer_input ( shape = c (T , tau0 ) , dtype = ’ float32 ’ , name = ’ Input ’)
6
7 Output = Input % >%
8 layer_lstm ( units = tau1 , activation = ’ tanh ’ , recurrent_activation = ’ tanh ’ , name = ’ LSTM1 ’)% >%
9 layer_dense ( units =1 , activation = ’ exponential ’ , name =" Output ")
10
11 model <- keras_model ( inputs = list ( Input ) , outputs = c ( Output ))

1 Layer ( type ) Output Shape Param #

2 ===========================================================================
3 Input ( InputLayer ) ( None , 10 , 3) 0
4 ___________________________________________________________________________
5 LSTM1 ( LSTM ) ( None , 5) 180
6 ___________________________________________________________________________
7 Output ( Dense ) ( None , 1) 6
8 ===========================================================================
9 Total params : 186
10 Trainable params : 186
11 Non - trainable params : 0
20
R Code for LSTM Time-Distribution

1 Output = Input % >%

2 layer_lstm ( units = tau1 , activation = ’ tanh ’ , recurrent_activation = ’ tanh ’ ,
3 return_sequences = TRUE , name = ’ LSTM1 ’) % >%
4 time_distributed ( layer_dense ( units =1 , activation = ’ exponential ’ ,
5 name =" Output ") , name = ’ TD ’)

1 Layer ( type ) Output Shape Param #

2 ===========================================================================
3 Input ( InputLayer ) ( None , 10 , 3) 0
4 ___________________________________________________________________________
5 LSTM1 ( LSTM ) ( None , 10 , 5) 180
6 ___________________________________________________________________________
7 TD ( TimeDistributed ) ( None , 10 , 1) 6
8 ===========================================================================
9 Total params : 186
10 Trainable params : 186
11 Non - trainable params : 0

21
R Code for Deep LSTMs

1 tau2 <- 4
2
3 Output = Input % >%
4 layer_lstm ( units = tau1 , activation = ’ tanh ’ , recurrent_activation = ’ tanh ’ ,
5 return_sequences = TRUE , name = ’ LSTM1 ’) % >%
6 layer_lstm ( units = tau2 , activation = ’ tanh ’ , recurrent_activation = ’ tanh ’ , name = ’ LSTM2 ’)% >
7 layer_dense ( units =1 , activation = ’ exponential ’ , name =" Output ")

1 Layer ( type ) Output Shape Param #

2 ===========================================================================
3 Input ( InputLayer ) ( None , 10 , 3) 0
4 ___________________________________________________________________________
5 LSTM1 ( LSTM ) ( None , 10 , 5) 180
6 ___________________________________________________________________________
7 LSTM2 ( LSTM ) ( None , 4) 160
8 ___________________________________________________________________________
9 Output ( Dense ) ( None , 1) 5
10 ===========================================================================
11 Total params : 345
12 Trainable params : 345
13 Non - trainable params : 0

22
• Gated Recurrent Unit (GRU) Networks

23
Gated Recurrent Unit (GRU) Networks

• A shortcoming of LSTMs is their complexity.

• Gated recurrent unit (GRU) networks were introduced by Cho et al. (2014).

• They should share similar properties as LSTMs but based on less parameters.

24
GRU Layer
• Reset gate:

r t = r (1) (xt, z t−1) = φσ (hWr , xti + hUr , z t−1i) ∈ (0, 1)τ1 .

• Update gate:

ut = u(1) (xt, z t−1) = φσ (hWu, xti + hUu, z t−1i) ∈ (0, 1)τ1 .

• Latent time series z 1, . . . , z T :

z t = z (1) (xt, z t−1) = r t ⊗ z t−1 + (1 − r t) ⊗ φ (hWz , xti + ut ◦ hUz , z t−1i) ∈ Rτ1 ,

thus, we consider a credibility weighted average for the update of z t, this can also
be understood as a skip connection.

• Network weights are given by Wr>, Wu>, Wz> ∈ Rτ1×(τ0+1) (including an intercept),
Ur>, Uu>, Uz> ∈ Rτ1×τ1 (excluding an intercept).
25
R Code for Single GRU Layer Architecture
1 T <- 10 # length of time series x_1 ,... , x_T
2 tau0 <- 3 # dimension of inputs x_t
3 tau1 <- 5 # dimension of the neurons z_t and cell states c_t
4
5 Input <- layer_input ( shape = c (T , tau0 ) , dtype = ’ float32 ’ , name = ’ Input ’)
6
7 Output = Input % >%
8 layer_gru ( units = tau1 , activation = ’ tanh ’ , recurrent_activation = ’ tanh ’ , name = ’ GRU1 ’)% >%
9 layer_dense ( units =1 , activation = ’ exponential ’ , name =" Output ")
10
11 model <- keras_model ( inputs = list ( Input ) , outputs = c ( Output ))

1 Layer ( type ) Output Shape Param #

2 ===========================================================================
3 Input ( InputLayer ) ( None , 10 , 3) 0
4 ___________________________________________________________________________
5 GRU1 ( GRU ) ( None , 5) 135
6 ___________________________________________________________________________
7 Output ( Dense ) ( None , 1) 6
8 ===========================================================================
9 Total params : 141
10 Trainable params : 141
11 Non - trainable params : 0
26
• Example: Mortality Modeling

27
Toy Example of LSTMs and GRUs

• Consider raw Swiss female log-mortality rates log(Mx,t) for calendar years
1990, . . . , 2001 and ages 0 ≤ x ≤ 99.

• Set T = 10 and τ0 = 3. Define for ages 1 ≤ x ≤ 98 and years 1 ≤ t ≤ T features

>
xx,t = log(Mx−1,1999−(T −t)), log(Mx,1999−(T −t)), log(Mx+1,1999−(T −t)) ∈ Rτ 0 ,

and observations

Yx,T +1 = log(Mx,2000) = log(Mx,1999−(T −(T +1))).

• Based on these definitions, choose training data

D = {(xx,1, . . . , xx,T ; Yx,T +1); 1 ≤ x ≤ 98} .

Thus, we have 98 training samples.

28
Toy Example of LSTMs and GRUs
• Consider ages (x − 1, x, x + 1) simultaneously in xx,t to smooth inputs over
neighboring ages to predict the central mortality rate Yx,T +1.

data toy example

● ●

● ●
● ●

−2
● ●

● ●

●
●

raw log−mortality rates

●
●

● ●

−4
●
●

● ●
● ●
● ●
● ●
● ●
● ●

−6
●
●

● ●

●
●
●
●

● ●
●
●
● ●
−8

●
● ●
●
●
● ●
● ●
●
●

●
●
● ●
−10

●
●

1990 1992 1994 1996 1998 2000

calendar years

? black lines: explanatory variables (xx,t)1≤t≤T (input data)

? blue dots: response variables Yx,T +1 (for training)
? test data T = {(xx,2, . . . , xx,T +1; Yx,T +2); 1 ≤ x ≤ 98} (shifted data); or
alternatively T+ = {(xx,1, . . . , xx,T +1; Yx,T +2); 1 ≤ x ≤ 98} (expanded data)
29
Toy Example of LSTMs and GRUs
• Pre-process all variables xx,t with MinMaxScaler to domain [−1, 1].

• Use shallow LSTM1, deep LSTM2 as above, and corresponding GRU1, GRU2,
and deep FNN; GDM: blue is in-sample, red is out-of-sample
● ● ●
●●● ●●● ● ●● ●
●
●
● ●●●●●
●●●
●●● ● ● ●
●●
● ● ●
●● ●
● ●●● ● ● ● ●
● ●● ● ●
● ● ●
● early stopping rule LSTM1 ● early stopping rule LSTM2 ●●
early stopping rule GRU1
●
●
●●
●●● ●
● ●
● ●● early stopping rule GRU2 ●
● early stopping rule deep FNN
● ●
● ●● ●
●
● ● ● ●● ● ● ● ●
●●
● ●
● ● ● ●●
● ●
● ● ● ●● ●
0.20

0.20

0.20
● ● ● ● ● ●
● ●●●
● ●
●
● ●
●● ● ● ● ●●
● ●
● ●●
● ●● ● ● ●
● ●
●
●●
●
●
●
●●
● ●
●
●●
● ●
● ● ● ● ●● ● ●●●
● ● ●● ● ● ● ●
●● ●● ● ●●● ● ● ●
● ● ●● ● ● ● ●
● ●●●
●
● ● ●● ●● ●●● ●● ●● ●●●
●
● ●
●●
●●
●●
●●● ● ●●
●
●
●
●●● ● ●● ● ●● ●● ●●●
●● ●● ● ●● ● ●● ●● ● ● ●
●●●●
● ● ●●
●●
●
●●●●
●●●
●●●●
●
●●●●●
● ●
●
●
●●●●● ●
● ●●● ●
●● ●●
●● ●● ●
●
●●
●●● ●●●●
●●
●●●
●● ●●●
●● ● ●●●● ●
●●
●●●
●●●●●●
●●● ●
●●
● ●● ● ●● ● ● ●●●● ● ● ●● ● ●●● ●●
● ● ●●● ●
●● ●
● ●
●●● ● ●●●●● ● ●
● ●
● ● ●
● ●● ●
●
● ●● ● ● ●●
●●● ●
●●
●●
●
●
●●
●●●●
●●●●
●●●
●●●● ●●●●
●●
●●●
●●●●
●
●●
●●
●
●
●
●
●●
●●
● ● ●
●●
●
●●
●
●●●
●●
●●
●●
●
●●
●●● ●●●●
●
●
●●
●
●
●
●●
●●
●●
●
● ●●●
●●●● ●●●
●●●● ●
●
●
●
●
●●
● ● ●● ● ●● ● ● ● ● ●●●●● ●
●
●● ●●●
●●●
● ●
●●●
●●
●●
●
●● ●
● ●● ●
● ● ● ●●
● ● ●
● ● ●● ●● ●● ●●●●
●●●●●●●
● ●● ● ●● ● ●
● ● ● ●
0.15

0.15

0.15
● ● ●●●● ●
●● ●● ●●●●●
●●● ●●
● ●● ●
●●●
●●●● ● ●● ●● ●
●●●● ● ●●
●
●●●● ●●●●●●● ● ●● ●
● ● ● ●
●
●
●● ●
●● ●●●
● ●●● ● ●
●●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ●
● ● ●
● ● ●
● ●
● ●
● ●●●●
● ●
●● ●
● ● ●● ● ● ● ● ●●●●● ●
● ●● ● ●
●
●
●●● ● ●
●●●
●●●●●
●●
● ●
●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●● ● ●● ● ● ● ●●● ●● ●●● ● ● ● ●
● ● ●● ●
●
●● ●●
● ●●
● ●
●●
●●●●● ●● ●
●●● ● ●● ●● ● ●
● ●●●●●●●● ●●● ● ●●●●
● ●● ● ●● ● ●● ●●●●
●
●●
●●
●
●●●●
●● ●● ● ●
●
● ●
●
●
●●●
●
● ● ●
●●
●●
●●
●●●
● ● ●● ●●
●●
●● ● ●
●● ●●● ●●●
●●● ● ● ● ● ●● ● ● ●●●●●
●●
● ●●● ● ● ●●
●●
●●●● ● ● ●●●● ●●● ●● ●
●●
●● ●● ●● ●● ●
● ● ●● ●● ● ●●●
●●●
●●
●●
●●● ●● ● ●
●
●
●
●
●
●● ●
●●● ●● ● ● ●●
●●●
●●● ● ●● ● ● ● ● ●●● ●
●●●● ● ● ● ● ● ●● ●● ● ● ● ●
MSE loss

MSE loss

MSE loss
● ● ● ● ● ● ● ●●● ●● ●
● ●
● ●● ● ● ●● ●●
● ● ●● ●●●●● ●●
● ●●●
● ●● ●● ●● ●
●
● ● ● ●●● ● ●●●●●● ●●●●●
●●
●
●
●
●
● ● ●
●
●●
●
●
●●
● ●●
●●
● ●
●●●● ● ● ● ●
●●
●
●●
●
●
●
●●●
●●
●● ●●●● ● ● ●●● ● ● ●● ●●
● ●
●● ●●
●●
●●
●●●●●● ●● ● ● ●
● ● ●●● ●● ● ●● ●● ●
●
● ●
●
●
●
●●
●
●
●●
●●
● ●
●●
●
●● ●
●●
● ● ●● ●● ● ●● ●● ● ●● ●
● ●●●● ●●●
●●
●●
●
●● ●
● ●●● ●●
●●
●●●● ●●●●●● ● ●
● ●
● ●
●●
● ● ●●● ●
● ●● ● ●
●●
●
●●
●●
●●
●
●●●
● ●
● ●●
●● ●
●●●
●●
●
●●
● ●
● ●
●●●● ● ●
● ●● ●●●●●●●● ● ●●
●●
● ● ●●
●●
● ●
● ●●
●● ● ●● ●● ●● ●●●
●●● ●●●
● ●● ● ●●● ●
●
●● ● ●
● ●● ● ●● ●● ● ●
●●
0.10

0.10

0.10
● ●

0.10
●●
●
●●
●
● ●●● ● ●●
●●●
● ●●
●
●●
●● ●
●● ●●●●
●●
●●
● ● ●●●
●●●● ● ●●●● ●
●● ●●●●
●●● ●
● ●● ●
●●
●●
●
● ●
● ●
●●●
●● ●
●
● ●● ●
●●
●
●
●●
●
● ●
●●●
●
●●●●●
●●●
●● ● ●
● ●●● ● ●●
●●● ● ● ●
●●
●●●
●●● ●●●● ●● ●●● ●●●
● ●●
● ●
● ●●●
● ● ●
●●
●● ●● ● ●● ●● ●
● ● ● ●●●● ●
●
●●●
●●
●●
●●
●●
●●● ●
●●
● ●●
●●●●●● ●
●●●● ●
● ●● ●
● ●
●●
●●●●●●●
●
●●●●
●●
●●●
●●●
●● ●
●
●●●
●●
●
●●●
●
●●
● ●●
●
●●
●
●●●
●
● ●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●●●
●●
●●●
●
●
●●
●
●●
●
●
●
●●●●
●
●●
●
●
●
●
●●●
● ●● ●●
●
●
●●
● ● ●●●
●●
●●
●
●
●
●
●
●●
●●●
●●●
● ●● ●
●●
●●
●●
●●●
●●
●●
●
●●
●
●●
●●●●●
●●●
● ●
●
●●●●
●
● ●
●
●●
● ●●
●●●●●●
●
●
● ● ●●
●
●●
●
●
●●
●
● ●
●●●●● ●●●●●
●●●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●●
●●●
0.05

0.05

0.05
in−sample loss in−sample loss in−sample loss in−sample loss in−sample loss
0.00

0.00

0.00
● out−of−sample loss ● out−of−sample loss ● out−of−sample loss ● out−of−sample loss ● out−of−sample loss

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500

epochs epochs epochs epochs epochs

# param. epochs run time in-sample loss out-of-sample loss

LSTM1 186 150 8 sec 0.0655 0.0936
LSTM2 345 200 15 sec 0.0603 0.0918
GRU1 141 100 5 sec 0.0671 0.0860
GRU2 260 200 14 sec 0.0651 0.0958
deep FNN 184 200 5 sec 0.0485 0.1577

• In general: LSTMs seem more robust than GRUs.

30
Hyper-Parameters in LSTMs

# param. epochs run time in-sample out-of-sample

base case:
LSTM1 (T = 10, τ0 = 3, τ1 = 5) 186 150 8 sec 0.0655 0.0936
LSTM1 (T = 10, τ0 = 1, τ1 = 5) 146 100 5 sec 0.0681 0.0994
LSTM1 (T = 10, τ0 = 5, τ1 = 5) 226 150 15 sec 0.0572 0.0795
LSTM1 (T = 5, τ0 = 3, τ1 = 5) 186 100 4 sec 0.0753 0.1028
LSTM1 (T = 20, τ0 = 3, τ1 = 5) 186 200 16 sec 0.0678 0.0914
LSTM1 (T = 10, τ0 = 3, τ1 = 3) 88 200 10 sec 0.0614 0.1077
LSTM1 (T = 10, τ0 = 3, τ1 = 10) 571 100 5 sec 0.0667 0.0962

31
Application to Swiss Mortality Data
in-sample out-of-sample run times
female male female male female male
LSTM3 (T = 10, (τ0 , τ1 , τ2 , τ3 ) = (5, 20, 15, 10)) 2.5222 6.9458 0.3566 1.3507 225s 203s
GRU3 (T = 10, (τ0 , τ1 , τ2 , τ3 ) = (5, 20, 15, 10)) 2.8370 7.0907 0.4788 1.2435 185s 198s
LC model with SVD 3.7573 8.8110 0.6045 1.8152 – –

estimated process k_t for Female estimated process k_t for Male

●● ●
●
●● ●●
●●●●
40

●
● ●
●●

20
●●
●
● ●
● ●
● ● ●●●●●
●● ● ●●
20

●
●●● ●
● ● ● ●●
● ●
● ●

0
●●
●
0
values k_t

values k_t
●● ●●● ●

● ●● ●
● ●● ●●

−20
● ●
●●
−20

●
● ● ●●
● ●
● ●
● ●
●● ●
● ● ●
●
●
−40

● ●
●●

−40
●●
●● ●
●●
● ● ●
●●
●● ● ●
● ● ●● ● ●●
●
● ●● ● ●●
● ●
● ●● ● ●●
−60

● ●● ● ●●
●
●
●● ● ●●
●● ● ●
●
in−sample ●
● ●
in−sample ● ●●●

−60
●
● ● ●●●
●
LC drift ● ●
LC drift ● ● ●
● ● ● ●
● ● ● ●
●
LSTM drift ● ●
LSTM drift ● ●
−80

● ●
● ●●
●
GRU drift ● ●
GRU drift ●

1950 1957 1964 1971 1978 1985 1992 1999 2006 2013 1950 1957 1964 1971 1978 1985 1992 1999 2006 2013

calendar year t calendar year t

• Main difficulty: Robustness of results.

• More stability by simultaneous multi-population modeling.

• For more sophisticated models see Perla et al. (2020).

32
• RNNs vs. Convolutional Neural Networks (CNNs)

33
RNNs vs. Convolutional Neural Networks (CNNs)

• RNNs respect time series structures.

• Convolutional neural networks (CNNs) respect local spatial structure.

• Intuitively, for CNNs we move small windows (filters) over the images to discover
similar structure at different locations in the images.

Swiss Female raw log−mortality rates Swiss Male raw log−mortality rates
92

92
84

84
76

76
68

68
60

60
age x

age x
52

52
44

44
36

36
28

28
20

20
12

12
6

6
0

1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016 1950 1956 1962 1968 1974 1980 1986 1992 1998 2004 2010 2016

calendar year t calendar year t

34
Convolutional Neural Networks (CNNs)

• CNNs have been introduced in Fukushima (1980).

• CNNs used for image and speech recognition, natural language processing (NLP),
and in many other fields, for references see our tutorial Meier–Wüthrich (2020).

• Main feature of CNNs is translation invariance, see Wiatowski–Bölcskei (2018).

35
Convolutional Layer: Sketch of Structure
A convolution layer (we consider a two-dimensional image here and a single filter)
 
(m) (m)
···
z1,1 (x) z (m) (x)
(m−1) (m−1) (m) (m)
 1,n2 
z (m) : R n1 ×n2
→R n1 ×n2
, x 7→ 
 .. .. ,

 (m) (m)
z (m) (x) · · ·

z (m) (m) (x)
n1 ,1 n1 ,n2

(m) (m)
with (local) filter/window having filter size f1 and f2
 (m) (m)

f1 f2
(m)  (m) X X (m)
x 7→ zi1,i2 (x) = φ w0,0 + wj1,j2 xi1+j1−1,i2+j2−1 .

j1 =1 j2 =1

? In our tutorial we add activation φ only later (after batch normalization).

? We illustrate a single filter, multiple filters are used to extract different features.
? Multiple filters require three-dimensional inputs in deep CNNs.
? Pooling layers, flatten layers and so-called padding is used.
36
References
• Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio (2014). Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.
• Efron, Hastie (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge UP.
• Fukushima (1980). Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition
unaffected by shift in position. Biological Cybernetics, 36/4, 193-202.
• Goodfellow, Bengio, Courville (2016). Deep Learning. MIT Press.
• Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning. Springer.
• Hochreiter, Schmidhuber (1997). Long short-term memory. Neural Computation 9/8, 1735-1780.
• Lee, Carter (1992). Modeling and forecasting US mortality. Journal American Statistical Association 87/419, 659-671.
• Meier, Wüthrich (2020). Convolutional neural network case studies: (1) anomalies in mortality rates (2) image
recognition. SSRN 3656210.
• Perla, Richman, Scognamiglio, Wüthrich (2020). Time-series forecasting of mortality rates using deep learning. SSRN
3595426.
• Richman, Wüthrich (2019). Lee and Carter go machine learning. SSRN 3441030.
• Richman, Wüthrich (2020). A neural network extension of the Lee–Carter model to multiple populations. Annals of
Actuarial Science.
• Smyl (2019). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting.
International Journal of Forecasting.
• Wiatowski, Bölcskei (2018). A mathematical theory of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory 64/3, 1845-1866.
• Wilmoth, Shkolnikov (2010). Human Mortality Database. University of California.
• Wüthrich, Buser (2016). Data Analytics for Non-Life Insurance Pricing. SSRN 2870308, Version September 10, 2020.
37
Wrap Up

Mario V. Wüthrich
RiskLab, ETH Zurich

“Deep Learning with Actuarial Applications in R”

Swiss Association of Actuaries SAA/SAV, Zurich
October 14/15, 2021
Programme SAV Block Course

• Refresher: Generalized Linear Models (THU 9:00-10:30)

• Feed-Forward Neural Networks (THU 13:00-15:00)

• Discrimination-Free Insurance Pricing (THU 17:15-17:45)

• LocalGLMnet (FRI 9:00-10:30)

• Convolutional Neural Networks (FRI 13:00-14:30)

• Wrap Up (FRI 16:00-16:30)

1
Exponential Dispersion Family

• EDF provides a unified notation for a big class of distribution functions.

• EDF has the following structure

yθi − κ(θi)
Yi ∼ f (y; θi, vi/ϕ) = exp + a(y; vi/ϕ) ,
ϕ/vi

with expected mean, and canonical link h = (κ0)−1,

µi = E[Yi] = κ0(θi) ⇐⇒ h(µi) = θi.

• This family contains the Gauss, Poisson, binomial, negative binomial, gamma,
inverse Gaussian, and Tweedie’s models.

• The cumulant function κ determines the distribution type, and the canonical
parameter θi is estimated with MLE.
2
Generalized Linear Models

• GLMs are the basis for regression modeling.

• We start from the EDF

yθi − κ(θi)
Yi ∼ f (y; θi, vi/ϕ) = exp + a(y; vi/ϕ) ,
ϕ/vi

with mean µi = κ0(θi).

• A GLM chooses a link g and assumes linear structure (in covariates)

q
X
xi 7→ g(µi) = g(E[Yi]) = hβ, xii = β0 + βj xi,j ,
j=1

for covariate information xi ∈ Rq of insurance policy i.

3
Generalized Linear Models: Fitting
• GLMs are fit using MLE.

• Maximizing (log-)likelihoods is equivalent to minimizing deviance losses

D∗(Y , β) = 2 [`Y (Y ) − `Y (β)]

n
X vi
= 2 Yih(Yi) − κ(h(Yi)) − Yih(µi) + κ(h(µi)) ≥ 0.
i=1
ϕ

• Deviance losses are distribution-adapted loss functions.

• For canonical link g = h, the fitted model fulfills the balance property
n
X n
X n
X
viE[Y
b i] = viκ0hβ,
b xi i = viYi.
i=1 i=1 i=1

Otherwise, one should adjust the intercept β0.

4
(Feed-Forward) Neural Networks

• Neural networks can be seen as extensions of GLMs.

• Neural networks perform representation learning:

D E
xi 7→ µi = E[Yi] = g −1 β, z (d:1)(xi) ,

with learned representation z i = z (d:1)(xi) of xi.

• Neural network

(d:1) (d) (1)
x 7→ z (x) = z ◦ ··· ◦ z (x),

processes information x into a suitable form.

• The family of networks fulfills the universality property.

5
(Feed-Forward) Neural Networks

• Neural networks perform representation learning:

D E
xi 7→ µi = E[Yi] = g −1 β, z (d:1)(xi) ,

with learned representation z i = z (d:1)(xi) of xi.

• Deep networks should be preferred to capture interactions more efficiently.

• Categorical variables can be integrated into (so-called) embedding layers.

• Time series, image and text data can be processed (in a similar fashion),
using different types of network architectures, but the general philosophy of
representation learning is the same.

6
(Feed-Forward) Neural Networks

• Neural networks perform representation learning:

D E
xi 7→ µi = E[Yi] = g −1 β, z (d:1)(xi) ,

with learned representation z i = z (d:1)(xi) of xi.

• We have output parameter β ∈ Rqd+1, and each hidden layer z (m) has parameters
(m) (m)
(weights) (w1 , . . . , wqm ) ∈ Rqm(qm−1+1).

(1) (d)
• This network parameter ϑ = (w1 , . . . , wqd , β) is fit with gradient descent
methods, and early stopping is used to prevent from (in-sample) over-fitting.

• Every different seed (starting point) of the gradient descent will provide a different
(early stopped) network calibration ϑ.b

7
(Feed-Forward) Neural Networks: Peculiarities
• There is no “unique best” network, but there are infinitely many sufficiently
(equally) good networks.

• These sufficiently good networks have equally good predictive power on portfolio
level, but they can be quite different on policy level.

• Aggregating/ensembling/nagging helps to reduce noise and, generally, improves

predictive models.

• Typically, the balance property fails to hold. This requires an extra fitting (bias
regularization) step.

• The LocalGLMnet provides an explainable network architecture, that allows for

variable selection, for a variable importance measure, for the study of interactions.

• There is a LASSO version of LocalGLMnet.

• Based on the learned structure one can still try to improve a GLM.
8
Convolutional Neural Networks

• CNNs process image and time series data.

• FNNs act globally, CNNs act locally.

• CNNs extract local structure via (small size) filters (windows).

• Time series data and text data can also be process through RNNs (not presented
here, but there is a SAV tutorial). A RNN is a FNN with loops.

• Often one uses regression trees, random forests and tree boosting as competing
data science models to FNNs.

• This works for tabular data, however, time series, image and text data does not
have obvious non-network counterparts.

9
References: www.ActuarialDataScience.org
• Ferrario, Hämmerli (2019). On boosting: theory and applications. SSRN 3402687.
• Ferrario, Nägelin (2020). The art of natural language processing: classical, modern and contemporary approaches to
text document classification. SSRN 3547887.
• Ferrario, Noll, Wüthrich (2018). Insights from inside neural networks. SSRN 3226852.
• Lorentzen, Mayer (2020). Peeking into the black box: an actuarial case study for interpretable machine learning.
SSRN 3595944.
• Meier, Wüthrich (2020). Convolutional neural network case studies: (1) anomalies in mortality rates (2) image
recognition. SSRN 3656210.
• Noll, Salzmann, Wüthrich (2018). Case study: French motor third-party liability claims. SSRN 3164764.
• Rentzmann, Wüthrich (2019). Unsupervised learning: What is a sports car? SSRN 3439358.
• Richman, Wüthrich (2019). Lee and Carter go machine learning: recurrent neural networks. SSRN 3441030.
• Schelldorfer, Wüthrich (2019). Nesting classical actuarial models into neural networks. SSRN 3320525.
• Schelldorfer, Wüthrich (2021). LocalGLMnet: a deep learning architecture for actuaries. SSRN 3900350.

Many thanks for attending!

Hsb4ua 02 Mean Girls Organizer
No ratings yet
Hsb4ua 02 Mean Girls Organizer
2 pages
Econometrics Assignment 3
No ratings yet
Econometrics Assignment 3
4 pages
Hg19 Qc Polts Sangwoo
No ratings yet
Hg19 Qc Polts Sangwoo
1 page
examples-lab-2022
No ratings yet
examples-lab-2022
12 pages
Simulation Modeling
No ratings yet
Simulation Modeling
22 pages
Tata Communication
No ratings yet
Tata Communication
5 pages
Sample Excel Histograms
No ratings yet
Sample Excel Histograms
2 pages
En An Ce / Co Nti Nu Ed: Silicon PNP Epitaxial Planar Type
No ratings yet
En An Ce / Co Nti Nu Ed: Silicon PNP Epitaxial Planar Type
4 pages
Sheet1: Ratio of Suction To Motive Pressure Versus Entrainment Ratio
No ratings yet
Sheet1: Ratio of Suction To Motive Pressure Versus Entrainment Ratio
2 pages
2D-NMR-21s-e
No ratings yet
2D-NMR-21s-e
51 pages
Beta Regression Modeling: Recent Advances in Theory and Applications
No ratings yet
Beta Regression Modeling: Recent Advances in Theory and Applications
51 pages
Atp On Moment of Force
No ratings yet
Atp On Moment of Force
18 pages
Homework 5
No ratings yet
Homework 5
6 pages
The Gamma Statistic Converges To The Noise Relative To An Unknown Nonlinear Function
No ratings yet
The Gamma Statistic Converges To The Noise Relative To An Unknown Nonlinear Function
7 pages
1LE0□□□-0DB2 B35
No ratings yet
1LE0□□□-0DB2 B35
1 page
AP Tax Groups
No ratings yet
AP Tax Groups
8 pages
Trade report-350683631 2024-09-06 20-31
No ratings yet
Trade report-350683631 2024-09-06 20-31
5 pages
Beta Estimation Cost of Equity
No ratings yet
Beta Estimation Cost of Equity
3 pages
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
No ratings yet
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
26 pages
Nonlinear causal discovery with additive noise models_Hoyer, P. O., Janzing, D., Mooij, J., Peters, J., & Schölkopf, B.
No ratings yet
Nonlinear causal discovery with additive noise models_Hoyer, P. O., Janzing, D., Mooij, J., Peters, J., & Schölkopf, B.
8 pages
Chem142 Kinetics1 Report 021317
No ratings yet
Chem142 Kinetics1 Report 021317
10 pages
Lab2-Markdown Xfl (CLEAN)
No ratings yet
Lab2-Markdown Xfl (CLEAN)
7 pages
VC.04: 2D Vector Fields and Their Trajectories Literacy
No ratings yet
VC.04: 2D Vector Fields and Their Trajectories Literacy
6 pages
Chapter 7.Docx
No ratings yet
Chapter 7.Docx
7 pages
Tax Groups Doc
No ratings yet
Tax Groups Doc
8 pages
GDPBD00000130219
100% (2)
GDPBD00000130219
35 pages
2 Dkhuyu 1
No ratings yet
2 Dkhuyu 1
1 page
Trade Report-83193744 2024-06-27 14-56
No ratings yet
Trade Report-83193744 2024-06-27 14-56
5 pages
Trade report-90253137 2025-02-14 04-35
No ratings yet
Trade report-90253137 2025-02-14 04-35
5 pages
Public Saas Company Benchmarking: Are You On Track For A Successful Ipo?
No ratings yet
Public Saas Company Benchmarking: Are You On Track For A Successful Ipo?
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
Chap5 Evaluating Performance
No ratings yet
Chap5 Evaluating Performance
54 pages
00.Denah Lt 1, 2, dan 3 Rumah Kos Putih 3 Lt
No ratings yet
00.Denah Lt 1, 2, dan 3 Rumah Kos Putih 3 Lt
3 pages
Data and Signal Line Chokes: %Bub4Iffu %Bub4Iffu
No ratings yet
Data and Signal Line Chokes: %Bub4Iffu %Bub4Iffu
9 pages
Principles of Corporate Finance 10th Edition Brealey Solutions Manual pdf download
100% (1)
Principles of Corporate Finance 10th Edition Brealey Solutions Manual pdf download
46 pages
BANA 560 - Lecture - 3 - Model - Evalaution - Regression
No ratings yet
BANA 560 - Lecture - 3 - Model - Evalaution - Regression
37 pages
2SD0602A
No ratings yet
2SD0602A
3 pages
Lab04-Young-Modulus Bro PDF
No ratings yet
Lab04-Young-Modulus Bro PDF
3 pages
Editado
No ratings yet
Editado
12 pages
Problems On Risk and Returns of Individual Securities
No ratings yet
Problems On Risk and Returns of Individual Securities
9 pages
Home Loan Data for Analysis With Visualization
No ratings yet
Home Loan Data for Analysis With Visualization
137 pages
BEL 300 Year Test 3 2023 Solution(2)
No ratings yet
BEL 300 Year Test 3 2023 Solution(2)
7 pages
Kinetics Activity Handout Form: Step-1
No ratings yet
Kinetics Activity Handout Form: Step-1
5 pages
Python Workshop March 2018
No ratings yet
Python Workshop March 2018
31 pages
22.00 X 36.00 MTS Campo de Grass Sintetico: Malla Metálica H 2.00M Malla Metálica H 2.00M Malla Metálica H 2.00M
No ratings yet
22.00 X 36.00 MTS Campo de Grass Sintetico: Malla Metálica H 2.00M Malla Metálica H 2.00M Malla Metálica H 2.00M
1 page
Principles of Corporate Finance 10th Edition Brealey Solutions Manual download
100% (1)
Principles of Corporate Finance 10th Edition Brealey Solutions Manual download
40 pages
Free Access to Principles of Corporate Finance 10th Edition Brealey Solutions Manual Chapter Answers
100% (2)
Free Access to Principles of Corporate Finance 10th Edition Brealey Solutions Manual Chapter Answers
50 pages
A6 Regression Challenge ANSWERS
No ratings yet
A6 Regression Challenge ANSWERS
6 pages
Gd&T unit- 5
No ratings yet
Gd&T unit- 5
33 pages
Handout Form Kinetics
0% (2)
Handout Form Kinetics
5 pages
Ex5
No ratings yet
Ex5
3 pages
Uji Deskriptif Fafa
No ratings yet
Uji Deskriptif Fafa
11 pages
Managing Counter Party Risk in An Extended Basel II Approach
No ratings yet
Managing Counter Party Risk in An Extended Basel II Approach
27 pages
A Small Note On MM Theory and APV
No ratings yet
A Small Note On MM Theory and APV
4 pages
Substructure Design (1)
No ratings yet
Substructure Design (1)
34 pages
BS 4235-1_Parallel Metric Keys and Keyway Dimensions
No ratings yet
BS 4235-1_Parallel Metric Keys and Keyway Dimensions
4 pages
Scaling Objective Function
No ratings yet
Scaling Objective Function
6 pages
2SB 1237
No ratings yet
2SB 1237
5 pages
Root Cause Template
100% (1)
Root Cause Template
20 pages
Must Know Math Grade 7
From Everand
Must Know Math Grade 7
Wendy Hanks
4.5/5 (16)
Schaum's Easy Outline of Precalculus
From Everand
Schaum's Easy Outline of Precalculus
Fred Safier
No ratings yet
R_ Reserving with LASSO — Actuaries' Analytical Cookbook
No ratings yet
R_ Reserving with LASSO — Actuaries' Analytical Cookbook
38 pages
R_ Baudry ML Reserving Pt 2 — Actuaries' Analytical Cookbook
No ratings yet
R_ Baudry ML Reserving Pt 2 — Actuaries' Analytical Cookbook
28 pages
Aptia - JD - Actuary
No ratings yet
Aptia - JD - Actuary
1 page
CM2 - Booklet 2
No ratings yet
CM2 - Booklet 2
44 pages
Cm2a Mock 4 Solutions
No ratings yet
Cm2a Mock 4 Solutions
18 pages
CM2A September23 EXAM Clean Proof
No ratings yet
CM2A September23 EXAM Clean Proof
7 pages
Transmission Lines Parameters
No ratings yet
Transmission Lines Parameters
11 pages
Dsbda Unit 5 Imp Batnotes
No ratings yet
Dsbda Unit 5 Imp Batnotes
5 pages
Booth Multiplier
100% (1)
Booth Multiplier
5 pages
Mtinv Manual
No ratings yet
Mtinv Manual
8 pages
BCS301 - Module 4
No ratings yet
BCS301 - Module 4
23 pages
Wechsler Intelligence Test
100% (1)
Wechsler Intelligence Test
13 pages
Petroleum Engineering
No ratings yet
Petroleum Engineering
1 page
PETE301 Reservoir Engineering Dykstra-Parsons-Method For Predicting Waterflood Performance in Layered Reservoirs
No ratings yet
PETE301 Reservoir Engineering Dykstra-Parsons-Method For Predicting Waterflood Performance in Layered Reservoirs
13 pages
Python Guis: Basics of A Tkinter Gui
No ratings yet
Python Guis: Basics of A Tkinter Gui
21 pages
Ge-Mmw Beed 1
No ratings yet
Ge-Mmw Beed 1
12 pages
2210 02592
No ratings yet
2210 02592
8 pages
AMA1110 Lecture - 1-Written-In-2-Gp103
No ratings yet
AMA1110 Lecture - 1-Written-In-2-Gp103
35 pages
Representing Knowledge Using Rules
No ratings yet
Representing Knowledge Using Rules
30 pages
Assignment 1 COMP3211 HKUST
No ratings yet
Assignment 1 COMP3211 HKUST
4 pages
Class XI Thermodynamics Worksheet 3: W HP W Cal S T C K T C K
No ratings yet
Class XI Thermodynamics Worksheet 3: W HP W Cal S T C K T C K
4 pages
2.18) Add Math Module 18 (Differentiation)
No ratings yet
2.18) Add Math Module 18 (Differentiation)
16 pages
Neural Networks Neural Networks
No ratings yet
Neural Networks Neural Networks
30 pages
Debabrata Podder, Santanu Chatterjee - Introduction To Structural Analysis-CRC Press (2021)
100% (4)
Debabrata Podder, Santanu Chatterjee - Introduction To Structural Analysis-CRC Press (2021)
513 pages
Transmission Lines CH 1
No ratings yet
Transmission Lines CH 1
27 pages
Quantum mechanics classical results modern systems and visualized examples 2nd ed Edition Richard Robinett 2024 scribd download
No ratings yet
Quantum mechanics classical results modern systems and visualized examples 2nd ed Edition Richard Robinett 2024 scribd download
51 pages
672-c-11613-HOLIDAYS HOMEWORK ASSIGNMENT (1)
No ratings yet
672-c-11613-HOLIDAYS HOMEWORK ASSIGNMENT (1)
1 page
Lec 5
No ratings yet
Lec 5
3 pages
Spearman's Cooeficient of Rank Correlation
No ratings yet
Spearman's Cooeficient of Rank Correlation
17 pages
CFD Midterm Project-SEM 7
No ratings yet
CFD Midterm Project-SEM 7
6 pages
A New Current Mirror Layout Technique For Improved Matching Characteristics
No ratings yet
A New Current Mirror Layout Technique For Improved Matching Characteristics
4 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Eurocode Design Midas Gen
100% (1)
Eurocode Design Midas Gen
60 pages
Chapter 8 - Optimization For Engineering Systems - Ralph W. Pike
No ratings yet
Chapter 8 - Optimization For Engineering Systems - Ralph W. Pike
35 pages
Advanced Synthesis Cookbook
No ratings yet
Advanced Synthesis Cookbook
127 pages