0% found this document useful (0 votes)
61 views

Pricing by Geographical Zone An Application With A Spanish Database PDF

This dissertation examines pricing insurance premiums by geographical zone using a Spanish automobile insurance database. It develops a Bayesian model to analyze how various risk factors like driver age, vehicle type, and geographical location impact claim frequency and severity. The model is applied to the Spanish database and estimates standard factors and their effect on bodily injury and material damage claims. It also conducts spatial analysis to model geographical factors and examine differences in expected claim frequencies across Spanish regions. The results provide insights into how various risk characteristics and geography should influence insurance premiums charged in different zones.

Uploaded by

TAKUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Pricing by Geographical Zone An Application With A Spanish Database PDF

This dissertation examines pricing insurance premiums by geographical zone using a Spanish automobile insurance database. It develops a Bayesian model to analyze how various risk factors like driver age, vehicle type, and geographical location impact claim frequency and severity. The model is applied to the Spanish database and estimates standard factors and their effect on bodily injury and material damage claims. It also conducts spatial analysis to model geographical factors and examine differences in expected claim frequencies across Spanish regions. The results provide insights into how various risk characteristics and geography should influence insurance premiums charged in different zones.

Uploaded by

TAKUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

PRICING BY GEOGRAPHICAL ZONE:

AN APPLICATION WITH A SPANISH DATABASE

by

Núria Puig

December 2003

A dissertation submitted for the award of the degree of


Master of Science
in

Actuarial Science

Department of Actuarial Science and Statistics

CITY UNIVERSITY, London


CONTENTS

List of tables __________________________________________________________ 3


List of figures _________________________________________________________ 4
ACKNOWLEDGEMENTS ______________________________________________ 6
ABSTRACT __________________________________________________________ 7
1 INTRODUCTION _________________________________________________ 9
1.1 The problem under study _______________________________________ 9
1.2 Some references______________________________________________ 10
1.3 Outline _____________________________________________________ 11
2 THE MODEL ____________________________________________________ 13
2.1 Introduction _________________________________________________ 13
2.2 Bayesian approach ___________________________________________ 14
2.3 The model for insurance risk ___________________________________ 15
2.4 The Gibbs sampler ___________________________________________ 20
2.5 The Gibbs sampler for the insurance risk ________________________ 22
3 NUMERICAL APPLICATION: A Spanish portfolio ____________________ 25
3.1 Preparing the database ________________________________________ 25
3.2 The standard factors __________________________________________ 26
3.3 Modelling the standard factors _________________________________ 35
3.4 Estimators of the standard factors ______________________________ 40
3.5 Assessing goodness of fit _______________________________________ 48
4 SPATIAL ANALYSIS _____________________________________________ 56
4.1 Modelling the spatial factors ___________________________________ 56
4.2 Analysis of the results _________________________________________ 68
5 CONCLUSION___________________________________________________ 72
6 REFERENCES __________________________________________________ 75
7 APPENDICES ___________________________________________________ 77
Appendix I: A brief review of Generalised Linear Models _____________ 77
Appendix II: Diagram of the process ______________________________ 80
Appendix III: Explanation of the programmes _______________________ 81
Appendix IV: SAS programmes __________________________________ 83

Pricing by geographical zone 2


List of tables

Table 3.1: Database composition by year ...................................................................................26


Table 3.2: Variable ranges - possible values...............................................................................29
Table 3.3: Variables selected and levels considered – Bodily Injury .........................................34
Table 3.4: Variables selected and levels considered – Material Damage ...................................35
Table 3.5: Stepwise Analysis (forward and backward) – Bodily Injury.....................................39
Table 3.6: Stepwise Analysis (forward and backward) – Material Damage...............................39
Table 3.7: Estimated values – Bodily Injury...............................................................................42
Table 3.8: Estimated values – Material Damage ........................................................................44
Table 3.9: Parameters to assess goodness of fit ..........................................................................49
Table 3.10: Summary of the deviance residuals – Bodily Injury................................................53
Table 3.11: Summary of the deviance residuals – Material Damage..........................................55
Table 4.1: Database composition by region................................................................................59
Table 4.2: u i and vi estimates - the Boskov and Verrall model results .....................................64

Table 4.3: Actual and expected number of claims accounting for spatial effects.......................67

Pricing by geographical zone 3


List of figures

Figure 3.1: Database composition by factor – driver age ...........................................................29


Figure 3.2: Database composition by factor – driver license......................................................30
Figure 3.3: Database composition by factor – vehicle age .........................................................30
Figure 3.4: Database composition by factor – loyalty ................................................................30
Figure 3.5: Database composition by factor – power .................................................................31
Figure 3.6: Database composition by factor – weight.................................................................31
Figure 3.7: Database composition by factor – type of fuel .........................................................31
Figure 3.8: Database composition by factor – policyholder sex .................................................32
Figure 3.9: Database composition by factor – type of car ..........................................................32
Figure 3.10: Database composition by factor – driver sex..........................................................32
Figure 3.11: Database composition by factor – number of doors ...............................................33
Figure 3.12: Comparison univariate values versus multivariate estimates – driver age - BI......45
Figure 3.13: Comparison univariate values versus multivariate estimates – driver license - BI 45
Figure 3.14: Comparison univariate values versus multivariate estimates –car power - BI .......46
Figure 3.15: Comparison univariate values versus multivariate estimates – vehicle age - BI....46
Figure 3.16: Comparison univariate values versus multivariate estimates – driver sex - BI ......46
Figure 3.17: Comparison univariate values versus multivariate estimates – loyalty - BI...........46
Figure 3.18: Comparison univariate values versus multivariate estimates – driver age - MD ...47
Figure 3.19: Comparison univariate values versus multivariate estimates – driver licence - MD
............................................................................................................................................47
Figure 3.20: Comparison univariate values versus multivariate estimates – car power - MD....47
Figure 3.21: Comparison univariate values versus multivariate estimates – weight - MD.........47
Figure 3.22: Comparison univariate values versus multivariate estimates – type of veh. - MD 47
Figure 3.23: Comparison univariate values versus multivariate estimates – weight- MD..........47
Figure 3.24: Fitted values against deviance residuals – Bodily Injury .......................................51
Figure 3.25: Histogram of the deviance residuals – Bodily Injury .............................................52
Figure 3.26: Fitted values against deviance residuals – Material damage ..................................53
Figure 3.27: Histogram of the deviance residuals – Material Damage.......................................54
Figure 4.1: Map of Spain with the 50 labelled “provincias”.......................................................57
Figure 4.2: Map of Spain with exposures-to-risk (number of policies-year)..............................59
Figure 4.3: Map of Spain with crude rates – Bodily Injury ........................................................60
Figure 4.4: Map of Spain with crude rates – Material Damage ..................................................61
Figure 4.5: Ratio actual divided by expected number of claims – Bodily Injury .......................61
Figure 4.6: Ratio actual divided by expected number of claims – Material Damage .................62

Pricing by geographical zone 4


Figure 4.7: Maximum a posteriori estimates of u i - Bodily Injury .............................................64

Figure 4.8: Maximum a posteriori estimates of u i - Material Damage.......................................65

Figure 4.9: Expected frequency accounting for spatial effects – Bodily Injury .........................66
Figure 4.10: Expected frequency accounting for spatial effects – Material Damage .................66

Pricing by geographical zone 5


ACKNOWLEDGEMENTS

I would like to thank Professor Richard Verrall for his guidance, supervision and

friendly approach.

Also my acknowledgment to Winterthur Seguros from Barcelona for the database used

and all the facilities they have given me.

Pricing by geographical zone 6


ABSTRACT

This dissertation focuses on the premium rating area for personal insurance.

In terms of pricing, the current situation is based on the use of Generalised Linear

Model techniques to evaluate how different factors affect the size and number of

claims. Although this methodology is commonly accepted, there is not the same

agreement about how to deal with some related problems, such as the convenient

grouping of the responses of some variables or how to obtain smoothing estimates for

specific factors. Both of these elements affect the geographical area variable, since we

are interested in defining homogeneous rating regions and estimating smoothing

coefficients through them.

In view of these problems, Boskov and Verrall (1994) proposed a model for premium

rating by postcode area. The model follows a Bayesian approach and uses the Gibbs

sampler as a particular Markov Chain Monte Carlo method to solve computational

complexities.

The aim of this dissertation is to give a comprehensive explanation of this model and

provide an application of it to a Spanish motor portfolio, corresponding to the fifth

largest company of that market. The analysis refers to the frequency risk of passenger

cars and third party liability (bodily injury and material damage), and the information

corresponds to years 2000 and 2001.

Pricing by geographical zone 7


Since the model assumes the correct estimation of the other factors apart from the

geographical area (called standard factors), an initial analysis has been performed so as

to estimate their coefficients by means of Generalised Linear Models and evaluate the

goodness of fit of the models.

The spatial rating is carried out afterwards and deals with the differences between the

actual number of claims and the estimated number of claims, having accounted for the

standard factors, in order to evaluate how much of this variation can be attributed to

spatial effects and how much is unexplained variation without a defined pattern. The

model operates by borrowing information from neighbouring areas, which are more

likely to be similar to the region considered, to estimate appropriate coefficients for

each region.

Once the spatial analysis is performed, the results highlight some difficulties in

applying the model in this country, which mainly refer to the different postcode

structure. So, although the model is viewed as a convenient solution to rate by

geographical zone, a detailed analysis to solve the postcode inefficiencies and identify

optimal areas is finally recommended to further progress in the implementation of the

model in Spain.

Pricing by geographical zone 8


1 INTRODUCTION

1.1 The problem under study

The aim of actuaries is to assess risks as best as they can. This is important in terms of

equity as well as in terms of company profits. If the premium is higher than the risk

covered, competitors will attract this client with lower prices. By contrast, if the risk is

undercharged, the company will incur losses. So, to assess the risk properly, actuaries

try to find the variables which best describe the underwritten risk so as to charge the

fair premium to each insured person.

If we focus on motor insurance (as it is the example used in the numerical application

in chapters 3 and 4), there are many variables which are commonly used because of

their proven power for explaining and differentiating motor risks. That is, factors like

age and sex of the driver, driving experience, power of the car, use of the car, type of

fuel, weight of the car, etc. have been considered good explanatory variables to

discriminate between risks. A particular factor is the geographical zone (defined as the

area where the car is mainly driven). Although its explanatory value is also generally

accepted, many difficulties arise while using it with pricing purposes, which explains

the diverse treatment of this variable by different insurers.

The first problem to deal with is the decision about the number of regions to consider.

The quickest approach could be that the smaller the regions the better the risk is

assessed. However this involves the next problem, which is to have enough volume in

each region to calculate a reliable coefficient. Another important issue is the transition

between regions. It is sensible to consider that neighbouring areas are likely to be more

Pricing by geographical zone 9


similar than those which are far apart. It is needless to say that in terms of client

comprehension, it is also difficult to accept that adjacent regions could have a big

difference in premiums when other factors remain the same. In addition, big differences

could increase fraudulent behaviour. Therefore, spatial coefficients should smooth

through regions.

In this sense, the Boskov and Verrall (1994) model provides a solution to these

problems (smoothness and lack of information) by allowing the transfer of information

to and from neighbouring areas and thus, offering a way to deal with spatial assessment

of risks.

1.2 Some references

Although most of the research related to mapping and geographical zone risk

evaluation has been carried out in the archaeological and, especially, medical and

epidemiological fields, the results can be transferred to an insurance environment since

the target is similar, even though the nature of the risk is, obviously, different.

Some relevant papers around this matter have been published in medical journals such

as Statistics of Medicine, and a good contribution coming from the archaeological area

is the work of Julian Besag et al. (1991) about image restoration. However there is a

vast amount of literature about spatial evaluation of risks or similar characteristics in

these related areas.

Pricing by geographical zone 10


As a brief reference to the short background in the insurance field, a first approach was

made by Taylor (1989) using bivariate splines to rate Householders insurance

portfolios. Boskov and Verrall (1994) provided an alternative method which this

dissertation is devoted to. Later on, both authors developed extensions to this latest

model either introducing Whittaker graduation (Taylor, 1996) or including weighting

factors accounting for distances between regions (Dixon, Kelsey and Verrall, 2000).

1.3 Outline

This dissertation is organised as follows. Chapter 2 provides an overview of the Boskov

and Verrall model. The chapter starts with an examination of the Bayesian statistical

methods and continues by focusing on the Gibbs sampler as the practical way to

calculate a posteriori estimates. The methods are introduced in general first, and the

approach moves to an insurance environment later. The main references for this chapter

are the Boskov and Verrall paper (1994) and the work of Smith and Roberts (1993)

about Bayesian computation. The next two chapters refer to the practical application

with the Spanish portfolio. Chapter 3 includes all the work previous to the spatial

analysis, for bodily injury and material damage types of claims, from the preparation of

the database until the point where the estimates of the standard factors are obtained by

means of a Generalised Linear Model. This statistical methodology is covered from a

practical point of view. However, a brief review of the theory is included in appendix I.

The chapter ends with the assessment of the goodness of fit of the estimated models. In

chapter 4, the spatial analysis is carried out. The geographical zones are defined and the

models run to obtain the expected number of claims by region accounting for spatial

effects. The most interesting part is the last section related to the analysis of the results.

Pricing by geographical zone 11


Finally, the dissertation concludes in chapter 5 with a review of the difficulties faced

and some proposals for future work to assess risks by geographical zone in Spain.

Pricing by geographical zone 12


2 THE MODEL

2.1 Introduction

The starting point is the aim of evaluating a risk. Specifically, we are interested in

analysing its variation by geographical area. In order to do that, we can assume that the

area under study is divided into n regions. In practice, the regions correspond to

different postal districts, usually identified by the postal code.

Since the idea behind the model is that areas which are close are likely to be more

similar than those which are far apart, and that it is possible to borrow information from

neighbours to work out the region estimates, the concept of “neighbouring areas” has to

be defined.

We consider neighbouring areas to be those which are adjacent. In other words, regions

which border the one analysed. We define δ i as the set of neighbouring areas of region

i . Some sensible comments could be made in the sense that distances between regions

or other elements matter rather than the borders themselves. Further extension of the

model has been developed allowing for distance weighting, as mentioned, but it will not

be part of this dissertation (see Dixon, Kelsey and Verrall, 2000).

We define xi as the true risk of region i and x the vector of risk over the whole

region. In addition, let yi denote the observed data of area i and y the corresponding

vector.

Pricing by geographical zone 13


2.2 Bayesian approach

In this particular problem, the main target is to identify the true underlying risk of each

area, so x is the unknown parameter of interest.

Following Bayesian statistics, unknown parameters are treated as random variables and

the first stage consists of expressing our “prior belief” about the parameter distribution

by defining its density function (prior distribution of x )

(2.2.1.) p ( x) = p( x1 , x 2 ,..., xi ,..., x n ) .

However, since this density function may contain unspecified hyperparameters that

need to be estimated in addition to x , it becomes useful to derive the conditional

density instead

(2.2.2.) pi ( xi / x1 , x 2 ,...xi −1 , xi +1 ..., x n ) i = 1,..., n

But to be precise, xi do not depend on regions which are not in the neighbourhood of i ,

so (2.2.2.) can be simplified as

(2.2.3) p i ( xi / δ i ) .

The second step of Bayesian formulation is based on past experience, that is, the

random observed outcomes y . Having this prior information and assuming y i are

conditionally independent given x , the joint density of the sample values or likelihood

function can be calculated as

n
(2.2.4) f ( y / x) ∝ ∑ f ( y i / xi )
i =1

Now applying Bayes´ Theorem

Pricing by geographical zone 14


(2.2.5) f ( x / y ) f ( y ) = f ( y / x) f ( x)

(2.2.6) p( x / y ) ∝ f ( y / x) p( x)

We find a way to work out the posterior distribution of the true risk, given the sample

information, p( x / y ) .

Once the posterior distribution has been derived, the most obvious Bayesian point

estimate is the one that maximizes (2.2.6), especially if the maximum is unique.

However it does not turn out to be straightforward as p( x / y ) is not easily derived,

although pi ( xi / δ i ) could be calculated. This problem is overcome by generating the

empirical density from realisations of the posterior density so as to later calculate

maximum a posteriori estimates. This is known as a Markov Chain Monte Carlo

method based on a variant of a Metropolis algorithm called the Gibbs sampler. This

will be explained in more detail in section 2.4.

2.3 The model for insurance risk

We now consider the insurance risk. Specifically, we want to estimate the frequency of

claims, which is the expected number of claims divided by the exposure in each region,

xi ri . However, given the exposure values, it is more common to model the expected

number of claims rather than the frequency itself. In this sense, a Poisson distribution

turns out to be appropriate to model the number of events over a given period of time.

Pricing by geographical zone 15


The purpose is to explain the variation in the number of claims by different variables

and estimate how much each variable contributes in determining the level of risk. The

variables are considered to follow a linear function and the estimate of this function is

called the linear predictor η i . A standard choice of link function between this linear

predictor and the variable that has to be explained (number of claims) is a logarithmic

function. Therefore, the model can be expressed as:

(2.3.1.) xi = ri exp(η i ) = ri exp( β 0 + β 1 z1 + β 2 z 2 + ...)

However, the linear predictor is made up of different components. On one hand there is

a part of the risk which can be explained by the standard variables (age, sex, power,

etc.). On the other hand there is the risk component which depends on the geographical

zone. That spatial component is the one we are especially interested in. Finally, there is

some random variation with no defined pattern.

So, for each region i , we define

ti as the component based on the standard factors. Each factor coefficient is

estimated using generalised linear models, commonly a standard Poisson

regression,

ui as the component with spatial structure,

vi as unexplained variation with no defined pattern.

Therefore, (2.3.1.) can be alternatively specified as

(2.3.2.) xi = ri exp(t i + u i + vi )

Pricing by geographical zone 16


As far as the theory is concerned, we assume that the coefficients of the standard

factors have been correctly estimated using a Generalised Linear Model. This means

that t i are already known and can be removed from the data and the model.

Defining ci = ri exp(t i ) and θ i = exp(u i + vi ) , (2.3.2) can be expressed as

(2.3.3.) xi = ciθ i

Hence, the unknown parameter of interest is, in fact, θ i .

The first stage in Bayesian statistics is, as mentioned, to define the prior distribution of

the unknown parameter θ i . Assuming that the risk in each region i only depends on the

neighbouring regions δ i , we may first define the conditional prior distribution of

xi / δ i in terms of u i and vi . It is reasonable to assume that u i and vi are independent

so we will look for the most appropriate distribution for each of them.

Since {vi : i = 1,..., n} do not follow any defined pattern, a normal distribution will be

considered with unknown variance λ . There are no reasons to use any other

distribution. Hence,

1
− 1 2
(2.3.4) p (v i ) ∝ λ 2
exp(− vi )

Relating to the spatial component u i , we assume that it can be factorised into

components representing the dependencies on each neighbouring regions. Hence a

possible form for pairwise difference prior will be

Pricing by geographical zone 17


(2.3.5) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(− ∑ φ (u i − u j ))
j∈δ i

The function φ must reflect the spatial dependence so it should reduce when the

distance between regions increases. Meanwhile adjacent regions should get similar

values.

Additionally φ could be preceded by a factor ω i to take account of other elements of

neighbouring regions such as distance, population at risk, etc., but for the moment it

will be ignored, or it can be considered but making the simplest choice ω i = 1 .

The expression with weights would be

(2.3.6) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(− ∑ wij φ (u i − u j ))


j∈δ i

Going back to φ function, two possible choices have been investigated in the literature.

One is φ ( z ) = z 2 2k where k is an unknown positive constant. In this case, (2.3.5)

becomes

1
(2.3.7) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(−
2k
∑δ (u
j∈
i − u j )2 )
i

and the expression for the vector of risk over the whole region is

1 1
(2.3.8) p (u / k ) ∝
k ni 2
exp(−
2k
∑δ (u
j∈
i − u j )2 )
i

where ni is the cardinality of δ i .

Pricing by geographical zone 18


The alternative choice is φ ( z ) = z k where k is an unknown scale parameter. Now

(2.3.5) becomes

1 1
(2.3.9) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(− ∑ u i − u j )
k k j∈δ i

The first option can be interpreted as a stochastic version of linear interpolation while

the second can be viewed as a stochastic version of the median filter.

The first formulation is considered in this dissertation.

Finally, the prior density of the two hyperparameters which determine the variance of

u and v , has to be defined. The possible choice has to be proportional to k −1λ−1 .

However this particular value is not suitable because of its behaviour near the origin

u = v = 0 , k = λ = 0 and because it has the singularity that the origin becomes an

absorbing state of the Markov chain which invalidates the Gibbs sampler. To avoid

these computational problems the next expression is assumed

ε ε
(2.3.10) prior (k , λ ) ∝ exp(− − )
2k 2λ

where ε is a small positive constant, say 0.01.

Having defined the distributions of u , v , k and λ the joint posterior density is given by

n
(2.3.11) p(u, v, k , λ , / y ) ∝ ∏ f ( y i / xi )
i =1

ni 1
− 1 − 1 2
x k 2
exp(−
2k
∑δ (u
j∈
i − u j ) )λ
2 2
exp(−

vi ) prior ( x, λ )
i

Pricing by geographical zone 19


The second stage of Bayesian formulation relates to the observed outcomes.

Although various forms for f ( y i / xi ) can be considered, a Poisson distribution is

assumed to be the most appropriate function to model the number of claims. Hence

exp(− xi )( xi ) yi exp(−ci e ui + vi )(ci e ui + vi ) yi


(2.3.12) f ( y i / xi ) = =
yi ! yi !

Taking the structure of f ( y i / xi ) into account, the joint posterior density of u , v , k and

λ finally becomes

exp(−ci e ui + vi )(ci e ui + vi ) yi
n
(2.3.13) p(u, v, k , λ , / y ) ∝ ∏
i =1 yi !

ni 1
− 1 − 1 2
x k 2
exp(−
2k
∑ (u i − u j ) 2 ) λ
j∈δ i
2
exp(−

vi ) prior ( x, λ )

Now the remaining problem is to obtain maximum a posteriori estimates for the

parameters. However, since the process becomes mathematically difficult because of its

complexity, high dimensionality and multimodality, a suitable approximation technique

is used based on a version of a Markov Chain Monte Carlo method called the Gibbs

sampler.

2.4 The Gibbs sampler

The Gibbs sampler is an algorithm to generate realizations of a distribution function. As

has been said, it is a specific version of the approximation methodology known as

Markov Chain Monte Carlo method (MCMC).

Pricing by geographical zone 20


The key idea of this methodology is that we want to generate a sample from a

distribution but it can not be done directly. Instead, we can construct a Markov chain

whose stationary distribution is consistent with the distribution of interest. So if we then

run the chain for a long time, simulated values of the chain can be used to summarize,

under suitable conditions, features of the distribution. To implement this strategy, we

simply need algorithms for constructing chains with specified stationary distributions.

One of these algorithms is the Gibbs sampler which exploits conditional densities to

obtain realizations from the posterior density. It proceeds as follow:

Suppose π ( x) = π ( x1 ,....x n ) is the joint probability density function, and π ( xi / x −i ) the

conditional densities for each component xi , given the values of the other components.

Suppose we want to generate a sample of π (x) , but this function is so complicated that

it becomes impossible to do this directly. Alternatively, we can set up a Markov chain

with a consistent stationary distribution of π (x) . In order to do that, first we take some

arbitrary starting values x 0 = ( x10 ,....x n0 ) . Then, we make successive random drawings

from the full conditional distribution as follows:

x11 from π ( x1 / x 20 ,..., x n0 )

x12 from π ( x 2 / x12 , x30 ,..., x n0 )

x31 from π ( x3 / x11 , x12 , x 40 ,..., x n0 )


.
.
x1n from π ( x n / x1− n )

Pricing by geographical zone 21


This completes a transition from x 0 = ( x10 ,....x n0 ) to x1 = ( x11 ,....x 1n ) . Iterations of this

cycle produce a sequence x 0 , x1 ,..., x t ,... which is a realization of a Markov chain.

After obtaining a sufficient number of realizations, we may use the empirical density

generated to find maximum a posteriori estimates.

Related to how many iterations are necessary to get a stationary distribution, some

practical applications have shown that generally the chain must run for 1,000 steps

before converging to its stationary distribution. Once convergence has been obtained, a

sample of every 10th step over the next 10,000 steps usually provides a reasonable

estimate of the stationary distribution.

To assure that the process has converged, it is recommended to run some processes in

parallel with different starting values. The processes must provide similar results unless

they have not still converged. In this case, more iterations will be necessary.

2.5 The Gibbs sampler for the insurance risk

We now formulate the Gibbs sampler in terms of the frequency risk. In the terminology

used, at each step a value for xi is sampled at random from the density function

(2.5.1) pi ( xi / δ i , y )

The values of the risk parameters in all regions other than i included in δ i are assumed

fixed at their current values and, each step, involves sampling from each of the

distributions subsumed into xi , that is, u i , vi , k and λ .

Pricing by geographical zone 22


Considering that the data follows a Poisson distribution, the marginal posterior of u i is

given by

(2.5.2) p(u i / u −i , v, k , λ , y ) ∝ f ( y i / xi ) p(u i / u −i , k )

1
∝ f ( y i / xi ) exp(−
2k
∑δ (u
j∈
i − u j )2 )
i

ni
∝ exp(−ci exp(u i + vi ) + u i y i − (u i − u i ) 2 )
2k

where u −i denotes all values of u except u i and u i is the mean value of u i over δ i .

Similarly, the marginal posterior of vi is of the form

(2.5.3.) p (v i / v − i , u , k , λ , y ) ∝ f ( y i / x i ) p (v i / λ )

1 2
∝ f ( y i / xi ) exp(− vi )

1 2
∝ exp(−ci exp(u i + vi ) + u i y i − vi )
2k

Additionally, the posterior distribution of the hyperparameters is

(2.5.4) p(k / u, v, λ , y ) ∝ p(u / v, λ , k , y ) p(k / v, λ , y )

 1  
∝ k − n 2 exp−  ε + ∑ (u i − u j ) 2 
 2k  i≈ j



for k , and

(2.5.5) p ( λ / u , v, k , y ) ∝ p ( v / u , k , y ) p ( k / u , k , y )

 1  n

∝ λ− n 2 exp−  ε + ∑ vi2 
 2λ  i =1 

for λ .

Pricing by geographical zone 23


In practice, the u i and vi conditional distributions are sampled efficiently by a

carefully designed rejection method, while the hyperparameters densities are sampled

using standard techniques designed for χ 2 distributions.

In addition to the maximum a posteriori estimates, the Gibbs sampler provides an

alternative way to estimate u , v , k and λ using approximations to their posterior means

which are estimated by the corresponding sample means. Hence

uˆ = Ε(u / y ) vˆ = Ε(v / y ) kˆ = Ε(k / y ) λˆ = Ε(λ / y )

However, since the joint posterior density of u and v given k , λ and y ,

p(u, v / k , λ , y ) is log-concave and differentiable, it has a single maximum whenever

φ (z ) is a differentiable convex function of z . Then the maximum a posteriori

estimates ( u and v ) of u , v given kˆ , λˆ and y can be calculated using the next


* *

expressions

n n n

∑v
i =1
*
i =0 and ∑c
i =1
i exp(u i* + vi* ) = ∑ y i
i =1

which provides an alternative to û and v̂ .

For further details about Bayesian computation via the Gibbs sampler, see Smith and

Roberts (1993).

Pricing by geographical zone 24


3 NUMERICAL APPLICATION: A Spanish portfolio

In this section an application of the model is provided. The aim is to illustrate how to

proceed with a real database until the results of the model are obtained and, thereafter,

analyse these results to get some conclusions. The database corresponds to the fifth

biggest general insurance company of the Spanish market and the information refers to

motor insurance of years 2000 and 2001.

All the work related to preparing the database, modelling standard factors and

calculating the estimated number of claims has been performed with the statistical

software SAS. In appendix II there is a diagram of the process followed and the next

appendix explains the purpose of all the programmes. Finally, the programmes (in SAS

syntax) are also included in appendix IV.

3.1 Preparing the database

The first step in the preparation of the database consists of selecting the relevant

information for the analysis.

On one hand, it is recommended to model risks which are as homogeneous as possible.

In terms of motor insurance, it means that different type of vehicles (passenger cars,

vans, trucks, motorbikes, trailer, etc) should be modelled separately and the same

advice could be given for the type of claim. The risk covered with third part liability

and own damage for example, are completely different so they should be part of

different studies. In this sense, the analysis has been performed with passenger cars

Pricing by geographical zone 25


clients only and third party liability peril, distinguishing between bodily injury and

material damage.

On the other hand, the premium is made up of two components: frequency and severity.

Since these two random variables have different distributions, it is also convenient to

model them separately to obtain a better rating. This particular numerical application

refers to claim frequency.

Finally, the database selected consists of 1,044,006 policies, 19,699 bodily injury (BI)

and 93,040 material damage (MD) claims with the distribution between years as shown

in Table 3.1.

Year Policies BI claims MD claims


2000 532,567 10,716 50,363
2001 511,439 8,983 42,677
Total 1,944,006 19,699 93,040
Table 3.1: Database composition by year

The number of claims corresponds to those that have occurred during these two years

by policies in force, so they include current and IBNR (incurred but not reported)

claims which are reported by the end of year 2001.

3.2 The standard factors

The factors considered to assess the risk are those currently used by the company. They

refer mainly to the characteristics of the vehicle and of the driver and are listed below:

Characteristics of the vehicle


o Power of the car
o Number of doors
o Type of fuel

Pricing by geographical zone 26


o Weight
o Rate weight divided by power
o Age of the vehicle
o Type of car (4-wheels, monovolume, others)
Characteristics of the driver
o Age of the driver
o Driving experience (years since obtaining drivers licence)
o Sex of the driver
o Loyalty (number of years with the Company)
Sex of the policyholder
Occasional drivers allowed
No claims discount level
Geographical zone

The power of the car is one of the most traditional factors which is applied by nearly all

of the companies. In some countries a classification of the cars in a number of groups is

used instead. However, the same idea is behind both approaches.

The variable number of doors has been included recently as a standard factor in the

company. A smaller, 3-door car is commonly the second car of the family which is

driven mainly within the city. Alternatively it is sometimes used by the children in their

initial driving experience. Therefore, the inclusion of this variable tries to capture and

differentiate this particular behaviour of risk.

Related to the type of fuel, it was supposed to reflect how often a car was driven.

Because of their higher price of purchase, diesel cars were mainly bought by people

who drive a lot. Since the fuel is cheaper, this type of car was a good deal for them. In

this sense, a diesel car was equivalent to a higher effective exposed to risk. However,

the gap in price between diesel and petrol cars has been reduced in the last years, so this

standard factor tends to be less important.

Pricing by geographical zone 27


Another new variable is the rate weight divided by power. Once again, this tries to

differentiate a particular kind of risk, which is cars with high power and small weight.

These types of cars are considered very dangerous. In fact, they are known as “flying

cars” in Spain or “hot hatches” in the UK.

The second group of factors refers to the driver. All these variables are very similar in

all insurance companies, with the driver age being the most important by far.

Nevertheless, some differences can be mentioned in relation to the driving experience.

That is, this factor is sometimes removed because of its high correlation with driver

age.

Finally, some comments related to the possibility of including an occasional driver.

This option was introduced some years ago, with the intention of uncovering fraudulent

behaviour. A general practice consisted of declaring the driver’s name in the claim

report, even though he was not driving when the claim occurred. So, the purpose of

including this factor was to charge an additional amount when declaring other drivers,

with no additional claims costs. The company was already paying these claims.

However, the results were worse than expected and more claims were declared, so this

option is not available anymore. Nevertheless, the factor is still considered since some

policies with this possibility are still in force.

After selecting the information and the explanatory variables, some routine work has

been done cleaning the database, which means checking that the information is correct.

The possible values that each factor can take are described in the table 3.2. In this

sense, any value out of these ranges has been invalidated (considered missing).

Pricing by geographical zone 28


FACTOR TYPE VARIABLE MEASURE POSSIBLE
FROM FILES VALUES
Power Num Cvdin Horse power 5 to 500
Number of doors Num Puertas 3 or 5
Type of fuel Char Combust D or G
Weight Num Peso Kg. 0 to 3000
Weight/power Num Pespot 0 to 600
Vehicle age Num Antvehi years 0 to 50
Type of car Char Tip TT, MON, NTM
Driver age Num Edad years 18 to 90
Driving experience Num Antcarne years 0 to 72
Driver sex Char Sexcon V, M, missing*
Loyalty Num Antwin years 0 to 99
Policyholder sex Char Sexcli E, missing
Occasional driver Char Conoc S, N
No claims discount level Char Bm -65 to 200
Geographical zone Char Cpcirc 5 digit code
Table 3.2: Variable ranges - possible values
* missing values correspond to company cars where the driver is not specified

Note that

G stands for petrol and D for diesel


V stands for men, M for women and E for company
S stands for yes and N for no
TT stands for 4-wheels, MON for monovolume car and NTM for the rest

For an overview of the composition of the database after the cleaning process has been

performed, some univariate graphs are included. They display the exposure and

frequency of claims (BI and MD) for the main standard factors considered.

Exposure - driver age Frequency - driver age

35,000 20
30,000
25,000 15

20,000 BI
10
15,000 MD
10,000 5
5,000
0 0
18
24
30
36
42
48
54
60
66
72
78
84
90

18
25
32
39
46
53
60
67
74
81
88

Figure 3.1: Database composition by factor – driver age

Pricing by geographical zone 29


Exposure - driver license Frequency - driver license

50,000 20

40,000 15

30,000 BI
10
20,000
MD

5
10,000

0 0

14
21
28
35
42
49
56
63
70
0
7
12
18
24
30
36
42
48
54
60
66
72
0
6

Figure 3.2: Database composition by factor – driver license

Exposure - vehicle age Frequency - vehicle age

100,000 20

80,000 15

60,000 BI
10
MD
40,000
5
20,000
0
0
10

15

20

25

30

35

40

45
0

5
13
17
21
25
29
33
37
41
45
49
1
5
9

Figure 3.3: Database composition by factor – vehicle age

Exposure - loyalty Frequency - loyalty

210,000 20
180,000
15
150,000

120,000 BI
10
90,000 MD

60,000 5

30,000
0
0
10

13

16

19

22

25

28

31
1

12
15
18
21
24
27
30
0
3
6
9

Figure 3.4: Database composition by factor – loyalty


Exposure - power Frequency - power

120,000 20

100,000
15
80,000
BI
10
60,000 MD
40,000 5
20,000
0
0

44
71
98

5
3
1
1
4
6
9
5
5 41 65 89 113 137 162 188 214 244 283 333 450

12
15
18
21
24
28
34
Figure 3.5: Database composition by factor – power

Exposure - weight Frequency - weight

210,000 20
180,000
15
150,000
120,000 BI
10
90,000 MD
60,000 5
30,000
0 0
12

15

18

21

24

27

30
2

12

15

18

21

24

27

30
2

Figure 3.6: Database composition by factor – weight

Exposure - type of fuel Frequency - type of fuel

800,000 20
700,000
600,000 15
500,000
BI
400,000 10
MD
300,000
200,000 5
100,000
0 0
D G mis D G mis

Figure 3.7: Database composition by factor – type of fuel


Exposure - policyholder sex Frequency - policyholder sex

1,000,000 20

800,000
15

600,000
BI
10
400,000 MD

5
200,000

0 0
E M V mis E M V mis

Figure 3.8: Database composition by factor – policyholder sex

Exposure - type of car Frequency - type of car

1,000,000 20

800,000
15

600,000
BI
10
400,000
MD

5
200,000

0 0
MON NTM TT MON NTM TT

Figure 3.9: Database composition by factor – type of car

Exposure - driver sex Frequency - driver sex

1,000,000 20

800,000 15

600,000 BI
10
MD
400,000
5
200,000

0 0
M V mis M V mis

Figure 3.10: Database composition by factor – driver sex

Pricing by geographical zone 32


Exposure - number of doors Frequency - number of doors

800,000 20

600,000 15

BI
400,000 10
MD

200,000 5

0 0
3 5 mis 3 5 mis

Figure 3.11: Database composition by factor – number of doors

In view of the univariate graphs, the variables driver sex and number of doors seem not

to be very relevant. In addition, it is quite likely that the estimates will increase with the

weight and the power of the car, and decrease with the number of years with the

company (loyalty) and the vehicle age. By contrast, singular factors are the driver age

and driver experience. For these two variables, the frequency decreases in the lower age

groups and increases for older people. In the middle ages there is a surprising hump.

It will be interesting to compare these figures with the multivariate outcomes.

The last step of the pre-modelling stage consists of grouping the possible answers of the

numerical variables in order to reduce the number of levels. This will also reduce the

possible combinations of rating factors or cells. The grouping could be based on

empirical knowledge or by means of other techniques like clustering, multivariate

grouping, etc. Again the levels considered are those in force in the company.

Pricing by geographical zone 33


In the following tables 3.3 and 3.4, the variable levels are shown. We include only the

variables that later have been selected in each model. How the selection of these

variables has been made is explained in section.3.3.

As will be seen, the large number of levels for the variable driver age is probably the

most surprising component. The reason for this treatment is the aim to avoid jumps

when the insurer remains in the company and progresses through years. In some sense,

this gives a kind of continuous treatment in a discrete way.

Third Part Liability - Bodily Injury


Driver Driving Vehicle
Age Experience Loyalty Age Power Weight Weight/power
18-21 0-2 0 0 0-49 0-799 0-9
22 3-5 1-9 1-10 50-75 800-1199 10-18
23 6-9 +9 +10 76-94 1200-1499 +18
24 10-13 95-129 1500-1799
25 14-22 130-160 +1799
26 23-32 161-199
27 33-39 +199
28-29 +40
30-31
32-33
34-35
36-37
38-39
40-41
42-43
44-45
46-47
48-49
50-51
52-53
54-55
56-57
58-59
60-64
65-70
71-75
+75

Table 3.3: Variables selected and levels considered – Bodily Injury

Pricing by geographical zone 34


Third Part Liability - Material Damage
Driver Driving Vehicle
Age Experience Loyalty Age Power Weight
18-21 0 0 0 0-49 0-799
22 1 1-9 +0 50-75 800-1199
23 2 +9 76-94 1200-1499
24 3 95-129 1500-1799
25 4-6 130-160 +1799
26 7-8 161-199
27 9-10 +199
28 11-14
29 15-23
30-31 24-32
32-33 33-38
34-35 +38
36-37
38-39
40-41
42-43
44-45
46-47
48-49
50-51
52-53
54-55
56-57
58-60
61-64
65-70
71-74
+75

Table 3.4: Variables selected and levels considered – Material Damage

3.3 Modelling the standard factors

The variable number of claims has been modelled using a Poisson regression model

which is a specific case of the Generalised Linear Model family.

The main idea behind the Generalised Linear Models is that we want to model a

response by means of a linear function involving one or more explanatory variables. So


the aim is to identify the most influential factors and, thereafter, estimate or measure

this influence.

The linear function of the explanatory variables is called the linear predictor and a

relationship between this linear predictor and the response variable can be defined,

which is known as the link function.

In fact, one of the powerful features of the Generalised Linear Models approach is this

possibility of defining different functions between the dependent variable (the variable

which has to be explained) and the explanatory variables, which makes the method very

useful and versatile.

Just to clarify, it can be said that the simple linear regression is a special case of a

Generalised Linear Model where the link function is the identity.

For additional details about Generalised Linear Models, a brief review of this

methodology is included in appendix I.

Actually, modelling involves three steps: 1) specifying the model; 2) identifying the

subset of variables which provide a good estimation, and 3) calculating the estimated

coefficients of these variables.

Focusing on the number of claims, a Poisson model is considered to be the function that

best describes this distribution, and the suitable link function is the logarithm. So the

model is specified as

Pricing by geographical zone 36


xi
= exp(η i ) = exp( β 0 + β 1 z1 + β 2 z 2 + ...)
ri

or alternatively

ln( xi ) = ln(ri ) x(η i ) = ln(ri ) x( β 0 + β 1 x1 + β 2 x 2 + ...)

where, following previous notation, xi is the expected number of claims (or also

response variable), ri is the risk exposure, z1 , z 2 ,... are the explanatory variables and

β 0 , β 1 ,... are the coefficients that have to be estimated.

As is clearly shown in the second expression, since we model number of claims instead

of the frequency, the logarithm of the exposure has to be included as offset variable.

The second step, identifying the relevant subset of variables, implies finding a balance

between accuracy and simplicity. The variables which best explain the risk have to be

selected but always taking into account what is known as parsimony. This criteria

implies that a simpler model which describes the data adequately will be preferred to a

more complicated one, when this second does not significantly improve the goodness of

fit.

There are several procedures to identify the relevant variables. However, all of them are

based on adding and deleting terms from the model and testing the significance of the

terms introduced or removed.

On one hand there is the backward-type selection technique which starts with all the

explanatory variables and proceeds by removing the least significant one at each step.

On the other hand, there is the forward-type selection which, by contrast, starts with

Pricing by geographical zone 37


only the most explicative variable and selects the next relevant one at each step which

is introduced in the model. Finally the stepwise procedure combines the forward and

backward technique. Initially it proceeds as a forward-type selection but at each step

the possibility of removing a variable previously introduced is also checked.

This last procedure has been used. The final variables retained in each of the models

and they level of significance (forward and backward analysis) is shown in the next

tables.

Forward Analysis
Variable retained Deviance Num DF F Value Pr > F Chi-Square Pr > ChiSq
INTERCEPT 88553.819
EDAD 88178.114 27 14.78 <.0001 399.14 <.0001
CARNET 88138.481 8 5.26 <.0001 42.1 <.0001
ANTVEHI 87921.066 3 76.99 <.0001 230.98 <.0001
ANTWIN 87785.847 2 71.83 <.0001 143.65 <.0001
CVDIN 87445.863 7 51.6 <.0001 361.19 <.0001
PESO 87155.754 5 61.64 <.0001 308.2 <.0001
PESPOT 87103.696 3 18.43 <.0001 55.3 <.0001
COMBUST 86907.42 2 104.26 <.0001 208.52 <.0001
SEXCLI 86899.866 1 8.02 0.0046 8.02 0.0046
FAINPER 86841.956 2 30.76 <.0001 61.52 <.0001
CONOC 86751.393 1 96.21 <.0001 96.21 <.0001
TIP 86604.769 2 77.88 <.0001 155.77 <.0001
SEXCON 86533.681 2 37.76 <.0001 75.52 <.0001
PUERTAS 86507.677 2 13.81 <.0001 27.63 <.0001

Backward Analysis

Variable retained Num DF F Value Pr > F Chi-Square Pr > ChiSq


EDAD 27 4.34 <.0001 117.05 <.0001
CARNET 8 5.46 <.0001 43.68 <.0001
ANTVEHI 3 58.17 <.0001 174.5 <.0001
ANTWIN 2 83.51 <.0001 167.02 <.0001
CVDIN 7 8.23 <.0001 57.61 <.0001
PESO 5 16.2 <.0001 80.98 <.0001
PESPOT 3 10.98 <.0001 32.95 <.0001
COMBUST 2 89.67 <.0001 179.34 <.0001

Pricing by geographical zone 38


SEXCLI 1 8.14 0.0043 8.14 0.0043
FAINPER 2 28.69 <.0001 57.38 <.0001
CONOC 1 102.48 <.0001 102.48 <.0001
TIP 2 87.56 <.0001 175.12 <.0001
SEXCON 2 39.68 <.0001 79.35 <.0001
PUERTAS 2 13.81 <.0001 27.63 <.0001
Table 3.5: Stepwise Analysis (forward and backward) – Bodily Injury

Forward Analysis
Variable retained Deviance Num DF F Value Pr > F Chi-Square Pr > ChiSq
INTERCEPT 203881.209
EDAD 203362.153 28 18.13 <.0001 507.63 <.0001
CARNET 203292.77 12 5.65 <.0001 67.86 <.0001
ANTWIN 202565.232 2 355.76 <.0001 711.52 <.0001
CVDIN 201460.877 7 154.29 <.0001 1080.04 <.0001
PESO 200414.605 5 204.65 <.0001 1023.24 <.0001
COMBUST 200046.261 2 180.12 <.0001 360.24 <.0001
ANTVEHI 199487.372 2 273.29 <.0001 546.59 <.0001
TIP 199232.634 2 124.57 <.0001 249.13 <.0001
SEXCLI 199125.07 1 105.2 <.0001 105.2 <.0001
SEXCON 198970.928 2 75.37 <.0001 150.75 <.0001
PUERTAS 198964.329 2 3.23 0.0397 6.45 0.0397
FAINPER 198599.34 2 178.48 <.0001 356.95 <.0001
CONOC 198506.783 1 90.52 <.0001 90.52 <.0001

Backward Analysis

Variable retained Num DF F Value Pr > F Chi-Square Pr > ChiSq


EDAD 28 4.74 <.0001 132.84 <.0001
CARNET 12 6.13 <.0001 73.55 <.0001
ANTWIN 2 314.01 <.0001 628.01 <.0001
CVDIN 7 64.88 <.0001 454.17 <.0001
PESO 5 61.97 <.0001 309.85 <.0001
COMBUST 2 174.37 <.0001 348.75 <.0001
ANTVEHI 2 245.55 <.0001 491.09 <.0001
TIP 2 122.19 <.0001 244.38 <.0001
SEXCLI 1 105.64 <.0001 105.64 <.0001
SEXCON 2 93.09 <.0001 186.17 <.0001
PUERTAS 2 4.31 0.0134 8.62 0.0134
FAINPER 2 168.4 <.0001 336.8 <.0001
CONOC 1 90.52 <.0001 90.52 <.0001
Table 3.6: Stepwise Analysis (forward and backward) – Material Damage

Pricing by geographical zone 39


It can be seen that the factors retained are all highly significant.

The variable related to the no claims discount deserves special comment. The no

discount system (NDS) rewards those drivers who do not make claims. However, the

discount granted is far from being a technical or real one. That is, a discount which

matches the actual cost avoided. Instead there are mainly commercial reasons behind

this. So, if the results of the models were applied directly, the tariff would be clearly

insufficient because the NDS introduces disequilibria. Therefore to preserve the tariff

sufficiency, an overload has to be introduced. Two possibilities can be considered:

introduce a homogeneous increase or charge in a discriminate way according to the

explanatory variables. This second approach has been selected which, in a practical

point of view, means to include the logarithm of this variable, the current level of

discount, as an offset variable.

It can be mentioned that the introduction of restrictions or the intention of forcing some

estimated values in the model is solved by including these values in the offset variable.

3.4 Estimators of the standard factors

With all these considerations taken into account, the model has been calculated and the

estimated parameters obtained in each model are shown in the tables 3.7 and 3.8.

Analysis of Parameters Estimates


Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 -5.3489 0.1086 -5.5617 -5.136 2425.98 <.0001
edad 18-21 1 0.4553 0.1627 0.1363 0.7742 7.83 0.0051
edad 22 1 0.083 0.1786 -0.267 0.433 0.22 0.6421

Pricing by geographical zone 40


edad 23 1 0.0406 0.1488 -0.2511 0.3323 0.07 0.785
edad 24 1 0.3016 0.1099 0.0862 0.517 7.53 0.0061
edad 25 1 0.0863 0.0975 -0.1047 0.2773 0.78 0.3759
edad 26 1 0.1925 0.0837 0.0284 0.3566 5.29 0.0215
edad 27 1 0.1466 0.0768 -0.0039 0.2971 3.64 0.0562
edad 28-29 1 0.1014 0.0548 -0.0061 0.2089 3.42 0.0644
edad 30-31 1 0.1536 0.0499 0.0557 0.2514 9.46 0.0021
edad 32-33 1 0.0604 0.0453 -0.0283 0.1491 1.78 0.1817
edad 34-35 0 0 0 0 0 . .
edad 36-37 1 -0.0909 0.0449 -0.179 -0.0028 4.09 0.043
edad 38-39 1 -0.0777 0.0445 -0.1649 0.0094 3.06 0.0804
edad 40-41 1 0.0084 0.043 -0.0759 0.0927 0.04 0.8445
edad 42-43 1 0.027 0.0442 -0.0595 0.1136 0.37 0.5405
edad 44-45 1 0.0611 0.0448 -0.0267 0.1488 1.86 0.1724
edad 46-47 1 0.0994 0.0448 0.0116 0.1872 4.93 0.0265
edad 48-49 1 0.1272 0.0448 0.0394 0.215 8.06 0.0045
edad 50-51 1 0.0541 0.045 -0.0342 0.1423 1.44 0.2298
edad 52-53 1 0.0424 0.0456 -0.047 0.1318 0.87 0.3523
edad 54-55 1 0.0569 0.0463 -0.0339 0.1477 1.51 0.2196
edad 56-57 1 -0.0374 0.0489 -0.1332 0.0584 0.59 0.444
edad 58-59 1 -0.0159 0.0515 -0.1169 0.0851 0.1 0.7575
edad 60-64 1 -0.0518 0.0464 -0.1428 0.0391 1.25 0.2639
edad 65-70 1 -0.1578 0.0498 -0.2554 -0.0602 10.04 0.0015
edad 71-75 1 -0.1781 0.0633 -0.3022 -0.054 7.91 0.0049
edad +75 1 -0.1106 0.0745 -0.2566 0.0353 2.21 0.1374
edad mis 1 -0.3346 0.2286 -0.7826 0.1135 2.14 0.1433
carnet 0-2 1 0.0471 0.1201 -0.1883 0.2824 0.15 0.695
carnet 3-5 1 0.0081 0.0657 -0.1207 0.1369 0.02 0.9021
carnet 6-9 1 -0.0072 0.0392 -0.0841 0.0697 0.03 0.8542
carnet 10-13 1 0.0098 0.0281 -0.0452 0.0648 0.12 0.7259
carnet 14-22 0 0 0 0 0 . .
carnet 23-32 1 -0.0088 0.0221 -0.0521 0.0345 0.16 0.6897
carnet 33-39 1 -0.0373 0.0301 -0.0964 0.0218 1.53 0.216
carnet +39 1 -0.2512 0.0494 -0.3481 -0.1542 25.8 <.0001
carnet mis 1 0.7964 0.2405 0.3249 1.2678 10.96 0.0009
antvehi 1-10 1 0.3148 0.0375 0.2412 0.3883 70.41 <.0001
antvehi +10 1 0.1383 0.0402 0.0596 0.217 11.87 0.0006
antvehi mis 1 0.0435 0.2176 -0.383 0.4701 0.04 0.8414
antvehi 0 0 0 0 0 0 . .
antwin 1-9 1 -0.1019 0.0195 -0.14 -0.0637 27.37 <.0001
antwin +9 1 -0.3446 0.0278 -0.399 -0.2902 153.99 <.0001
antwin 0 0 0 0 0 0 . .
cvdin 0-49 1 -0.2035 0.0408 -0.2835 -0.1235 24.84 <.0001
cvdin 50-75 1 -0.0052 0.0198 -0.044 0.0336 0.07 0.7941
cvdin 76-94 0 0 0 0 0 . .
cvdin 94-129 1 -0.0341 0.0239 -0.0809 0.0126 2.05 0.1524
cvdin 129-160 1 -0.0506 0.04 -0.129 0.0278 1.6 0.2055
cvdin 160-199 1 -0.1624 0.0589 -0.2777 -0.047 7.61 0.0058
cvdin +199 1 -0.3566 0.0777 -0.5089 -0.2044 21.07 <.0001
cvdin mis 1 -0.3909 0.2394 -0.8602 0.0783 2.67 0.1025
peso 0-7 1 -0.5964 0.9488 -2.456 1.2631 0.4 0.5296

Pricing by geographical zone 41


peso 8-11 1 -0.3881 0.9479 -2.246 1.4699 0.17 0.6823
peso 12-14 1 -0.5384 0.9476 -2.3958 1.3189 0.32 0.5699
peso 15-17 1 -0.5733 0.9475 -2.4304 1.2838 0.37 0.5451
peso +17 1 -0.5065 0.9479 -2.3644 1.3514 0.29 0.5931
peso mis 0 0 0 0 0 . .
pespot 0-9 1 0.9182 0.9505 -0.9448 2.7812 0.93 0.3341
pespot 10-18 1 0.876 0.9498 -0.9855 2.7376 0.85 0.3564
pespot +18 1 0.602 0.949 -1.258 2.4619 0.4 0.5258
pespot mis 0 0 0 0 0 . .
combust G 0 0 0 0 0 . .
combust D 1 0.2276 0.0179 0.1924 0.2628 160.83 <.0001
combust mis 1 0.799 0.1516 0.5018 1.0962 27.77 <.0001
sexcli E 1 -0.0869 0.0308 -0.1472 -0.0265 7.96 0.0048
sexcli mis 0 0 0 0 0 . .
fainper 1999 1 -17.8134 4543.239 -8922.4 8886.772 0 0.9969
fainper 2000 1 0.101 0.014 0.0734 0.1285 51.69 <.0001
fainper 2001 0 0 0 0 0 . .
conoc S 1 0.3425 0.0324 0.279 0.406 111.78 <.0001
conoc N 0 0 0 0 0 . .
Tip MON 1 0.2119 0.0636 0.0872 0.3365 11.1 0.0009
Tip NTM 1 0.5425 0.0444 0.4554 0.6295 149.23 <.0001
Tip TT 0 0 0 0 0 . .
sexcon V 0 0 0 0 0 . .
sexcon M 1 -0.1086 0.0199 -0.1476 -0.0695 29.7 <.0001
sexcon mis 1 -0.2966 0.0417 -0.3784 -0.2147 50.46 <.0001
puertas 3 1 0.0806 0.0161 0.0489 0.1122 24.95 <.0001
puertas 5 0 0 0 0 0 . .
puertas mis 1 -0.1266 0.0804 -0.2841 0.031 2.48 0.1154
Scale 0 0.9702 0 0.9702 0.9702
Table 3.7: Estimated values – Bodily Injury

Analysis of Parameters Estimates


Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 -3.1148 0.0457 -3.2045 -3.0252 4636.98 <.0001
edad 18-21 1 0.1548 0.0969 -0.0352 0.3448 2.55 0.1104
edad 22 1 0.0319 0.0936 -0.1516 0.2153 0.12 0.7336
edad 23 1 0.186 0.0711 0.0467 0.3253 6.85 0.0089
edad 24 1 0.1867 0.0621 0.0651 0.3084 9.05 0.0026
edad 25 1 -0.0037 0.0528 -0.1071 0.0998 0 0.9448
edad 26 1 0.0629 0.0455 -0.0263 0.152 1.91 0.167
edad 27 1 0.0347 0.0389 -0.0416 0.1109 0.79 0.3733
edad 28 1 0.0106 0.0367 -0.0613 0.0825 0.08 0.7725
edad 29 1 -0.0275 0.0332 -0.0926 0.0376 0.69 0.4072
edad 30-31 1 0.0435 0.0246 -0.0048 0.0918 3.12 0.0775
edad 32-33 1 0.0237 0.0228 -0.0209 0.0683 1.09 0.2973
edad 34-35 0 0 0 0 0 . .
edad 36-37 1 -0.0038 0.0214 -0.0458 0.0381 0.03 0.8575
edad 38-39 1 0.0238 0.0211 -0.0176 0.0651 1.27 0.2606

Pricing by geographical zone 42


edad 40-41 1 0.0315 0.0209 -0.0095 0.0725 2.27 0.1319
edad 42-43 1 0.0087 0.0212 -0.033 0.0503 0.17 0.6835
edad 44-45 1 0.0557 0.0216 0.0133 0.0981 6.62 0.0101
edad 46-47 1 0.0731 0.0218 0.0305 0.1157 11.29 0.0008
edad 48-49 1 0.094 0.0218 0.0512 0.1368 18.54 <.0001
edad 50-51 1 0.0832 0.0217 0.0406 0.1258 14.67 0.0001
edad 52-53 1 0.0537 0.0221 0.0104 0.097 5.92 0.015
edad 54-55 1 0.0235 0.0226 -0.0209 0.0678 1.08 0.2997
edad 56-57 1 0.0198 0.0234 -0.026 0.0656 0.72 0.3961
edad 58-60 1 0.0037 0.023 -0.0414 0.0487 0.03 0.8725
edad 61-64 1 -0.0123 0.0233 -0.058 0.0334 0.28 0.5984
edad 65-70 1 -0.0325 0.0234 -0.0783 0.0132 1.94 0.1635
edad 71-74 1 -0.0555 0.0305 -0.1152 0.0043 3.31 0.0687
edad +74 1 0.1379 0.0305 0.078 0.1978 20.38 <.0001
edad mis 1 -0.0001 0.0933 -0.183 0.1828 0 0.9992
carnet 0 1 0.0565 0.1517 -0.2409 0.3539 0.14 0.7096
carnet 1 1 0.1701 0.1049 -0.0354 0.3756 2.63 0.1048
carnet 2 1 -0.0111 0.0736 -0.1553 0.1332 0.02 0.8807
carnet 3 1 -0.1136 0.06 -0.2313 0.0041 3.58 0.0586
carnet 4-6 1 -0.0034 0.0309 -0.0641 0.0572 0.01 0.9122
carnet 7-8 1 -0.0124 0.0247 -0.0608 0.0361 0.25 0.6169
carnet 9-10 1 0.0065 0.0195 -0.0318 0.0447 0.11 0.7409
carnet 11-14 1 0.0064 0.0135 -0.02 0.0328 0.23 0.6341
carnet 15-23 0 0 0 0 0 . .
carnet 24-32 1 0.0111 0.0108 -0.01 0.0322 1.06 0.3023
carnet 33-38 1 -0.0061 0.0143 -0.0341 0.0219 0.18 0.6701
carnet +38 1 -0.0804 0.02 -0.1195 -0.0413 16.26 <.0001
carnet mis 1 0.5779 0.0981 0.3857 0.7702 34.72 <.0001
antwin 0 0 0 0 0 0 . .
antwin 1-9 1 -0.0774 0.0094 -0.0958 -0.059 67.98 <.0001
antwin +9 1 -0.3085 0.0132 -0.3343 -0.2827 548.81 <.0001
cvdin 0-49 1 -0.2608 0.0196 -0.2992 -0.2224 177.29 <.0001
cvdin 50-75 1 -0.0322 0.0097 -0.0511 -0.0132 11.07 0.0009
cvdin 76-94 0 0 0 0 0 . .
cvdin 94-129 1 0.0664 0.0103 0.0462 0.0865 41.7 <.0001
cvdin 129-160 1 0.0263 0.0159 -0.0049 0.0575 2.72 0.0989
cvdin 160-199 1 -0.1015 0.0223 -0.1451 -0.0579 20.79 <.0001
cvdin +199 1 -0.2355 0.0294 -0.2931 -0.1779 64.22 <.0001
cvdin mis 1 -0.3441 0.0951 -0.5304 -0.1577 13.1 0.0003
peso 0-7 1 0.0075 0.0416 -0.0739 0.089 0.03 0.8562
peso 8-11 1 0.2272 0.0372 0.1543 0.3 37.38 <.0001
peso 12-14 1 0.1747 0.0371 0.1019 0.2475 22.12 <.0001
peso 15-17 1 0.2194 0.038 0.1449 0.2938 33.32 <.0001
peso +17 1 0.3889 0.0398 0.3108 0.467 95.22 <.0001
peso mis 0 0 0 0 0 . .
combust G 0 0 0 0 0 . .
combust D 1 0.1627 0.0087 0.1456 0.1797 348.15 <.0001
combust mis 1 0.2087 0.0786 0.0547 0.3626 7.06 0.0079
antvehi 0 0 0 0 0 0 . .
antvehi +0 1 0.3709 0.0178 0.3359 0.4059 432.46 <.0001
antvehi mis 1 -0.0404 0.1092 -0.2545 0.1737 0.14 0.7116

Pricing by geographical zone 43


tip MON 1 -0.3961 0.0264 -0.4478 -0.3444 225.52 <.0001
tip NTM 1 -0.0967 0.0167 -0.1294 -0.064 33.55 <.0001
tip TT 0 0 0 0 0 . .
secli E 1 0.134 0.0128 0.1089 0.1592 109.01 <.0001
secli mis 0 0 0 0 0 . .
secon V 0 0 0 0 0 . .
secon M 1 -0.0868 0.0098 -0.106 -0.0675 77.87 <.0001
secon mis 1 -0.2091 0.0196 -0.2476 -0.1707 113.5 <.0001
puertas 3 1 -0.0186 0.0078 -0.034 -0.0032 5.63 0.0176
puertas 5 0 0 0 0 0 . .
puertas mis 1 -0.0579 0.0328 -0.1221 0.0064 3.12 0.0775
fainper 1999 1 -2.4621 1.0112 -4.444 -0.4801 5.93 0.0149
fainper 2000 1 0.12 0.0067 0.1068 0.1332 318.46 <.0001
fainper 2001 0 0 0 0 0 . .
conoc S 1 0.1703 0.0175 0.136 0.2046 94.76 <.0001
conoc N 0 0 0 0 0 . .
Scale 0 1.0112 0 1.0112 1.0112
Table 3.8: Estimated values – Material Damage

Because the link function is the exponential, the final estimators are calculated as the

exponential of these estimate values. These factors are applied multiplicatively since

the model is a multiplicative1 one.

To visualise the results, the estimated factors of some of the variables have been plotted

in figures 3.12 to 3.23. In the same graph, the univariate values of these variables have

been included to give an idea of the differences between a univariate versus a

multivariate approach.

The graphs show very clearly how the decisions taken over a univariate analysis could

be wrong. The figure 3.12, corresponding to the driver age factor for bodily injury, for

example, displays that the young drivers estimator has to be lower than the value

calculated under the univariate basis, which means that the gap is already explained by

1
For additional information about the discusion between an additive versus a multiplicative model to
estimate the frequency, see Brockman & Wright (1992)

Pricing by geographical zone 44


another factor. Maybe most of the young people drive similar kind of cars or have

another characteristic in common. The same effect is shown in figure 3.13 for the diver

licence factor. New drivers do not have to be charged as the univariate analysis

suggests because this increase is partly included in the driver age factor. Without doubt,

these two variables are highly correlated, especially in the lower age groups. So, if the

univariate numbers were applied, younger drivers would be overcharged.

By contrast, the univariate and multivariate results are very similar for some other

variables. That is the case of the car power in bodily injury for example (figure 3.14).

This suggests that car power is an important factor, which has probably been included

in the first steps in the selection of the variables.

One additional comment can be made related to the variable driver age. The hump

observed in the univariate analysis still persists in the multivariate one. Further

investigation carried out to discover the reason for this particular pattern has

determined that the children of the insured people are the cause of this shape. This is

due to the children driving the parents’ car, what is likely to happen when the parents

are between 45-55 years old. Later on, young people tend to have their own car.

Frequency BI - driver age Frequency BI - driver licence

6 6

5 5

4 4
univ univ
3 3

2
multiv 2
multiv
1 1

0 0
-13

-22

-32

-39

9
0-2

3-5

6-9
1

3
9
5
1
7
0
24
27

+3
-2

-3
-3
-4
-5
-5
-7

10

14

23

33
18

32
38
44
50
56
65

Figure 3.12: Comparison univariate values versus Figure 3.13: Comparison univariate values versus
multivariate estimates – driver age - BI multivariate estimates – driver license - BI

Pricing by geographical zone 45


Frequency BI - car power Frequency BI - vehicle age

6 6
5
5
4
univ 4
3
multiv univ
2 3
1 multiv
2
0
1
29
-75

-94

9
9

9
6

9
0-4

9
0
-1

9-1

0-1

+1
50

76

94

0 1-10 +10
12

16

Figure 3.14: Comparison univariate values versus Figure 3.15: Comparison univariate values versus
multivariate estimates –car power - BI multivariate estimates – vehicle age - BI

Frequency BI - driver sex Frequency BI - loyalty

6 6

5 5

4 4
univ univ
3 3
multiv multiv
2 2

1 1

0 0

V M 0 1-9 +9

Figure 3.16: Comparison univariate values versus Figure 3.17: Comparison univariate values versus
multivariate estimates – driver sex - BI multivariate estimates – loyalty - BI

Similar comments can be made for the material damage results. The following figures

3.18 to 3.23 show some of the variables of this model. The correlation between driver

age and driver licence is even more evident and, by contrast, the car power variable

displays a different shape between the univariate and multivariate curves for highly

powerful vehicles. There is a singular case, the weight of the car, where the multivariate

analysis swings the curve.


Frequency MD - driver age Frequency MD- driver licence

18
21
15
18
12 15
univ 12
univ
9

6
multiv 9 multiv
6
3
3
0 0
1

1
7
3
9
5
4
24
27

8
6

10
0

2
-2

-3
-3
-4
-4
-5
-6
+7

-2

-3
4-

9-
18

30
36
42
48
54
61

15

33
Figure 3.18: Comparison univariate values versus Figure 3.19: Comparison univariate values versus
multivariate estimates – driver age - MD multivariate estimates – driver licence - MD

Frequency MD - car power Frequency MD - weight

18 18

15 15
12 12
univ univ
9
9
6
multiv multiv
6
3
3
0
0
5

12 9

99
49

-7

-9

16

19
-1

+1

7
0-

7
11
50

76

9-

0-

-1

-1
0-

+1
94

8-
16

12

15

Figure 3.20: Comparison univariate values versus Figure 3.21: Comparison univariate values versus
multivariate estimates – car power - MD multivariate estimates – weight - MD

Frequency MD - type of vehicle Frequency MD - loyalty

18 18

15 15

12 12
univ univ
9 9
multiv multiv
6 6

3 3

0 0

MON NTM TT 0 1-9 +9

Figure 3.22: Comparison univariate values versus Figure 3.23: Comparison univariate values versus
multivariate estimates – type of veh. - MD multivariate estimates – weight - MD

Pricing by geographical zone 47


3.5 Assessing goodness of fit

Although the stepwise technique ensures that the variables selected provide the best

possible choice, it is always important to evaluate the adequacy of the model or

goodness of fit.

The deviance and the scale parameter (deviance divided by the degrees of freedom) are

often used as a crude method to know how well the model fits the data by comparing

their values with the appropriate χ 2 distribution.

In fact, the deviance reflects the discrepancy between the fitted model and the model

with the observed values exactly, called the saturated model. Both models, being in the

same assumptions in terms of distribution and link function.

Analytically, it can be defined as:

D = −2[log LˆC − log Lˆ S ]

where L̂C is the maximised likelihood when the parameters are set equal to their

maximum likelihood estimates (the fitted ones)

L̂S is the likelihood for the saturated model.

Then, assuming that the number of parameters to be estimated are p for a data set of

N observations, D has a χ 2 distribution with N − p degrees of freedom.

Pricing by geographical zone 48


If the model is good, we will expect the value of the deviance to be near of the middle

of the distribution, which means in terms of a χ 2 , to be near to its degrees of freedom.

Meanwhile, the scaled parameter gives us an estimate of the variance. So,

D
σˆ 2 =
N−p

The next table 3.9 contains the value of these parameters for both models.

BI model MD model
Deviance 86507.68 198506.78
DF 482475 392306
Deviance/DF 0.17930 0.50600
Table 3.9: Parameters to assess goodness of fit

The deviance values are small relative to the degrees of freedom indicating that the

model fits the data well. Related to the scale parameter, although the MD model

provides a much higher value, in both cases they are small enough to reinforce the

goodness of fit of the model selected.

It can be pointed out that the deviance is also very frequently used to decide between

different models. In this case, the deviances of the models are compared and the one

with the lowest value is considered to be the best, since it also means the lowest

discrepancy between estimated and observed values.

Pricing by geographical zone 49


Other useful methods to assess goodness of fit involve the examinations of the

residuals. These methods basically facilitate the investigation of specific aspects of the

model.

There are several possible definitions of residuals but two of the most commonly used

are:

( y i − yˆ i )
The Pearson residual, defined as
V [ yˆ i ]

The deviance residual, given by sgn( y i − yˆ i ) d i

where ŷ i are the fitted values and d i the contribution of each observation to the

deviance.

A useful way to analyse the residuals consists of plotting them against the fitted values.

The model can be considered to be satisfactory if the residuals do not follow any

evident pattern and they are concentrated around zero. By contrast, the values

associated with large residuals can be classified as anomalous observations (outliers). It

is recommended to remove them.

First we are going to evaluate the bodily injury model. Figure 3.24 displays the plot of

deviance against the fitted values.

Pricing by geographical zone 50


Figure 3.24: Fitted values against deviance residuals – Bodily Injury

It can be seen that there is no big dispersion and the residuals are quite concentrated

around zero. That is, nearly all the residuals are included in the interval –4,4 which

means that no observation is distorting the estimation. In this sense, the model can be

said to be good.

However, it is quite evident that the residuals follow a defined pattern. This is an

indication that there are one or more factors not considered in the model that are real

explanatory variables. Since with the stepwise technique all the possible combinations

of the variables selected are considered, and the model chosen has been tested to be the

best, we can conclude that there are additional factors apart form the ones considered

which influence the response variable.

Probably we can expect the spatial elements to be important so, even more now, the

spatial factors analysis appears to be relevant in assessing the bodily injury peril.

Pricing by geographical zone 51


In addition to the graphs shown, figure 3.25 reproduces the histogram of the deviance

residuals and table 3.10 summarises the main features (mean, variance, quartiles,

extreme values, …) of this variable (a univariate procedure in SAS syntax).

This additional information basically reaffirms what has been seen with the residual

graphs. The histogram is pretty much concentrated around zero so the mean is close to

that value and the standard deviation is quite low. The extreme observations are not far

from zero.

Figure 3.25: Histogram of the deviance residuals – Bodily Injury

The UNIVARIATE Procedure


Moments

N 482526 Sum Weights 482526


Mean -0.145872 Sum Observations -70387.054
Std Deviation 0.39749533 Variance 0.15800253
Skewness 5.16970598 Kurtosis 32.2494595
Uncorrected SS 86507.6768 Corrected SS 76240.1732
Coeff Variation -272.49589 Std Error Mean 0.00057223

Basic Statistical Measures


Location Variability

Mean -0.14587 Std Deviation 0.39750


Median -0.18036 Variance 0.15800
Mode -0.47624 Range 9.93294
Interquartile Range 0.09165

NOTE: The mode displayed is the smallest of 122 modes with a count of 2.

Pricing by geographical zone 52


Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t -254.918 Pr > |t| <.0001


Sign M -226393 Pr >= |M| <.0001
Signed Rank S -5.13E10 Pr >= |S| <.0001

Quantiles (Definition 5)
Quantile Estimate

100% Max 6.5538459


99% 2.2638414
95% -0.0496721
90% -0.1049035
75% Q3 -0.1405540
50% Median -0.1803559
25% Q1 -0.2322007
10% -0.3149512
5% -0.4090052
1% -0.7298383
0% Min -3.3790934

Extreme Observations

------Lowest----- -----Highest-----
Value Obs Value Obs

-3.37909 426785 4.61876 184368


-3.12826 317635 4.72664 98848
-3.08085 214859 4.77086 48651
-2.80271 318327 5.37791 432911
-2.61847 189822 6.55385 184326

Table 3.10: Summary of the deviance residuals – Bodily Injury

Related to the material damage model, the same analysis has been performed. Figure

3.26 displays the fitted values against deviance residuals.

Figure 3.26: Fitted values against deviance residuals – Material damage

Pricing by geographical zone 53


Although this second model presents more dispersion, it is not alarming. Effectively the

range of the residuals is wider and up to 4 there are quite a lot of observations.

However there are not clear outliers. Related to the pattern, there are also some signs of

a possible defined pattern but it is not so evident as it was in the bodily injury model.

To illustrate and validate these conclusions, figure 3.27 contains a histogram of the

deviance residuals and table 3.11 is a statistical summary of this variable. The

histogram visualises that the residuals are still very much concentrated around zero and

according to the numerical values of table 3.11, the extreme values are not very far

from zero. Therefore, similar conclusions can be applied to both models.

Figure 3.27: Histogram of the deviance residuals – Material Damage

Pricing by geographical zone 54


The UNIVARIATE Procedure
Moments

N 392403 Sum Weights 392403


Mean -0.2356438 Sum Observations -92467.321
Std Deviation 0.67107969 Variance 0.45034795
Skewness 2.47269947 Kurtosis 6.8520141
Uncorrected SS 198506.783 Corrected SS 176717.435
Coeff Variation -284.78567 Std Error Mean 0.00107129

Basic Statistical Measures


Location Variability

Mean -0.23564 Std Deviation 0.67108


Median -0.39983 Variance 0.45035
Mode -0.76798 Range 11.25378
Interquartile Range 0.19863

NOTE: The mode displayed is the smallest of 98 modes with a count of 2.

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t -219.962 Pr > |t| <.0001


Sign M -150191 Pr >= |M| <.0001
Signed Rank S -2.29E10 Pr >= |S| <.0001

Quantiles (Definition 5)

Quantile Estimate

100% Max 7.468486


99% 2.277407
95% 1.532982
90% 0.691858
75% Q3 -0.302567
50% Median -0.399827
25% Q1 -0.501196
10% -0.639928
5% -0.780878
1% -1.215588
0% Min -3.785296

Extreme Observations
------Lowest----- -----Highest-----

Value Obs Value Obs

-3.78530 275285 5.86867 19981


-3.25684 268283 5.95437 361240
-3.21583 63136 5.96257 382253
-3.21511 321232 6.54284 183262
-3.21389 288426 7.46849 98375

Table 3.11: Summary of the deviance residuals – Material Damage

For further details about the deviance and the scale parameter and other suggestions

about residuals in a Generalised Linear Models context see Dobson (1990) or

McCullagh and Nelder (1989).

Pricing by geographical zone 55


4 SPATIAL ANALYSIS

4.1 Modelling the spatial factors

Once the standard factors estimators have been calculated, the next stage is related to

the spatial analysis.

First of all, we have to define the different areas. Usually the areas correspond to some

postal codes. In this sense, the Spanish postal code is made up of 5 digits. The two first

digits, for example, divide the country into 50 regions, which are known as

“provincias”. As we consider more digits the code refers to a smaller area.

The analysis has been performed taking into account these two first digits. So, 50

different regions have been considered. Although it would be strongly recommended to

consider smaller regions which also means that there would be a larger number of them

(see chapter 5 about the conclusions), the unavailability of more detailed maps and the

cost of purchasing this additional information have made it impossible to go further.

The map in figure 4.1 displays the 50 areas considered and the name of each region, as

it will be useful to know these in later comments about the results.

Pricing by geographical zone 56


Figure 4.1: Map of Spain with the 50 labelled “provincias”

Once the regions have been defined, and provided that the estimators for the standard

factors have been calculated, we are in the position to work out the expected number of

claims in each region, ci .

In addition, we also have the actual number of claims reported in each area, Yi . So, a

first approach for the estimators of the spatial effect could be the ratio of the actual

Yi
divided by the estimated number of claims, θˆi = . This ratio can be seen as a sort of
ci

residual and gives us an idea about how much the spatial factors could contribute in

assessing the risk. Applying the model, we will get information about how much of this

unexplained variation can be attributed to the spatial factors and how much corresponds

to variation without a defined pattern. This undefined variation can be caused by other

unknown factors which have not been considered in the modelling approach.

Pricing by geographical zone 57


Table 4.1 shows the exposure, the actual number of claims, the estimated number of

claims and the ratio actual divided by estimated of each region in both models. The

estimated number of claims has been calculated using the estimates of the standard

factors.

TPL - Bodily Injury TPL - Material Damage

Region Region Actual Estimated Actual Estimated


digit name Exposure Claims Claims Ratio Claims Claims Ratio
01 Álava 16439.19 345 266.09 1.2966 1627 1394.18 1.1670
02 Albacete 5265.36 70 89.64 0.7809 376 462.48 0.8130
03 Alicante 58940.59 1009 1031.34 0.9783 4813 5063.79 0.9505
04 Almería 9779.99 180 175.91 1.0233 844 862.12 0.9790
05 Ávila 815.62 19 14.31 1.3277 81 74.67 1.0848
06 Badajoz 13055.09 137 203.71 0.6725 918 1083.3 0.8474
07 Baleares 48644.62 814 792.59 1.0270 3589 3830.09 0.9371
08 Barcelona 189551.38 4216 3432.24 1.2284 17326 16981.31 1.0203
09 Burgos 14846.52 258 271.83 0.9491 1485 1342.4 1.1062
10 Cáceres 6166.52 78 105.38 0.7402 394 519.93 0.7578
11 Cádiz 13624.05 237 239.64 0.9890 1278 1130.13 1.1308
12 Castellón 40518.27 499 756.98 0.6592 3298 3701.42 0.8910
13 Ciudad Real 5510.26 87 91.77 0.9480 451 463.32 0.9734
14 Córdoba 15101.28 213 233.95 0.9105 1257 1274.68 0.9861
15 La Coruña 37309.25 751 712.93 1.0534 3446 3271.67 1.0533
16 Cuenca 3690.65 37 61.7 0.5997 281 330.69 0.8497
17 Girona 22137.56 454 388.45 1.1687 1775 1935.33 0.9172
18 Granada 20778.37 332 350.57 0.9470 1754 1749.06 1.0028
19 Guadalajara 1013.83 16 20.24 0.7905 101 101.91 0.9911
20 Guipúzcoa 11879.28 275 202.19 1.3601 1177 1039.17 1.1326
21 Huelva 2839.4 60 52.02 1.1534 283 261.04 1.0841
22 Huesca 7770.97 109 115.87 0.9407 505 667.55 0.7565
23 Jaén 18607.13 275 270.35 1.0172 1422 1637.65 0.8683
24 León 8872.26 187 130.86 1.4290 713 718.65 0.9921
25 Lleida 28249.1 429 444.03 0.9662 1888 2401.28 0.7862
26 La Rioja 8536.94 142 137.32 1.0341 767 744.6 1.0301
27 Lugo 5794.08 96 97.27 0.9869 377 489.87 0.7696
28 Madrid 25122.08 486 504.92 0.9625 3082 2667.1 1.1556
29 Málaga 36845.87 686 664.97 1.0316 3829 3257.45 1.1755
30 Murcia 19509.05 529 339.75 1.5570 1716 1684.43 1.0187
31 Navarra 10141.58 194 176.99 1.0961 1007 911.12 1.1052
32 Orense 10325.96 175 194.1 0.9016 757 922.77 0.8204
33 Asturias 27842.3 603 497.79 1.2114 2677 2496.34 1.0724
34 Palencia 3278.14 61 56.38 1.0819 296 279.84 1.0577
35 Las Palmas 18630.7 512 378.85 1.3515 1858 1910.43 0.9726
36 Pontevedra 22106.25 682 429.99 1.5861 2212 1973.85 1.1207

Pricing by geographical zone 58


37 Salamanca 8776.39 149 148.6 1.0027 713 758.58 0.9399
38 Tenerife 42243.04 682 677.98 1.0059 3744 3639.6 1.0287
39 Cantabria 12797.77 341 236.1 1.4443 1451 1136.28 1.2770
40 Segovia 3219.07 37 55.47 0.6670 229 284.18 0.8058
41 Sevilla 16296.02 307 317.9 0.9657 1835 1525.97 1.2025
42 Soria 6949.75 89 112.35 0.7922 569 577.34 0.9856
43 Tarragona 19361.12 297 334.02 0.8892 1524 1657.65 0.9194
44 Teruel 5741.55 59 89.21 0.6614 323 487.4 0.6627
45 Toledo 13959.89 184 271.11 0.6787 1131 1337.84 0.8454
46 Valencia 65049.78 1106 1163.08 0.9509 6033 5709.13 1.0567
47 Valladolid 11459.72 201 194.04 1.0359 1122 988.06 1.1356
48 Vizcaya 12398.39 360 217.097 1.6582 1326 1059.84 1.2511
49 Zamora 4355.31 58 65.676 0.8831 314 349.97 0.8972
50 Zaragoza 24868.05 393 411.26 0.9556 2093 2171.97 0.9636
Table 4.1: Database composition by region

To illustrate these numbers graphically, the exposure has been mapped to provide a

view of the composition of the portfolio by region. As can be seen, the company has a

good position in the big cities (Barcelona, Madrid, Valencia, etc.) and, in general, an

important presence in the northeast, on the east coast, and at the extreme northwest. On

the contrary, the exposure is much lower in the centre (with the exception of Madrid).

Figure 4.2: Map of Spain with exposures-to-risk (number of policies-year)

Pricing by geographical zone 59


The next two maps (4.3 and 4.4) display crude rates (actual number of claims divided

by exposure) for bodily injury and material damage respectively. These maps are

difficult to interpret because this crude estimation is highly affected by the amount of

risk exposure. So, regions with small exposure tend to get higher rates than the areas

with larger exposure. This means that crude rates are not a proper starting point to deal

with assessing spatial risks and reinforces the utility of the model, since smoothing the

crude rates directly would become impossible.

Just to point out some examples, areas like Huelva or Cádiz (south) are affected by this

low exposure effect. The same behaviour can be seen in regions around Madrid (Ávila

or Toledo for example).

Figure 4.3: Map of Spain with crude rates – Bodily Injury

Pricing by geographical zone 60


Figure 4.4: Map of Spain with crude rates – Material Damage

Finally, figures 4.5 and 4.6 show the ratio of actual divided by expected number of

claims. This ratio provides a sort of residual approach since it reflects the differences

between the actual number of claims and the expected number of claims according to

the estimates of the standard factors. The differences between regions can be attributed

to spatial factors.

Figure 4.5: Ratio actual divided by expected number of claims – Bodily Injury

Pricing by geographical zone 61


Figure 4.6: Ratio actual divided by expected number of claims – Material Damage

At this stage, when the regions and neighbours have been defined and the estimated

number of claims by region according to the standard factors has been calculated, the

next step consists of applying the Boskov and Verrall model.

In order to do that, a computer program has been developed in the Department of

Actuarial Science and Statistics of City University. This software provides a friendly

interface to perform the process so as to obtain the maximum a posteriori estimates of

u i and vi for each region and the expected claim frequency.

For both types of claims, the process has been run for 5000 iterations to enable the

convergence to a steady state. After that, a sample of every 10th step over the next 1000

steps has been selected to estimate the maximum a posteriori estimates.

Pricing by geographical zone 62


In the next table 4.2, the results of the processes are exposed for bodily injury and

material damage.

TPL - Bodily Injury TPL - Material Damage


Region Region Estimated Estimated
digit name u v ratio u v ratio
01 Álava 0.13277 0.05104 1.2868 0.09740 0.04642 1.1641
02 Albacete -0.16548 -0.08808 0.8309 -0.09287 -0.09972 0.8316
03 Alicante -0.08428 -0.00567 0.9786 -0.06015 0.00118 0.9505
04 Almería -0.01861 -0.02086 1.0293 -0.02790 -0.00136 0.9791
05 Ávila -0.10958 0.08078 1.0403 -0.02020 0.04472 1.0332
06 Badajoz -0.26795 -0.14472 0.7087 -0.06751 -0.09714 0.8551
07 Baleares -0.04169 0.00000 1.0270 -0.07314 -0.00003 0.9371
08 Barcelona 0.04720 0.08906 1.2270 -0.06672 0.07827 1.0199
09 Burgos -0.01744 -0.08624 0.9653 0.06073 0.03031 1.1043
10 Cáceres -0.24456 -0.07614 0.7770 -0.12184 -0.13465 0.7801
11 Cádiz -0.08206 0.00218 0.9885 0.08456 0.02834 1.1287
12 Castellón -0.36832 -0.10600 0.6663 -0.12107 -0.00242 0.8911
13 Ciudad Real -0.18650 0.04068 0.9255 -0.06556 0.02556 0.9687
14 Córdoba -0.14696 -0.01227 0.9131 -0.01783 -0.00402 0.9864
15 La Coruña 0.03606 -0.04906 1.0569 0.00725 0.03562 1.0523
16 Cuenca -0.29599 -0.12500 0.7028 -0.09692 -0.05680 0.8645
17 Girona 0.05137 0.03256 1.1645 -0.10415 0.00908 0.9168
18 Granada -0.08471 -0.03301 0.9518 -0.02333 0.01716 1.0020
19 Guadalajara -0.24911 -0.01319 0.8237 -0.05673 0.02129 0.9731
20 Guipúzcoa 0.19508 0.03723 1.3507 0.10214 0.01329 1.1315
21 Huelva -0.06256 0.07299 1.0820 0.04515 0.02104 1.0772
22 Huesca -0.07916 -0.03437 0.9558 -0.17256 -0.09806 0.7691
23 Jaén -0.10362 0.04413 1.0089 -0.08482 -0.06085 0.8715
24 León 0.13472 0.12054 1.3821 0.00289 -0.01690 0.9942
25 Lleida -0.06929 -0.02994 0.9696 -0.16259 -0.08230 0.7892
26 La Rioja -0.01918 -0.01152 1.0384 0.03904 -0.01579 1.0319
27 Lugo 0.03842 -0.07891 1.0282 -0.07966 -0.15549 0.7969
28 Madrid -0.17994 0.06643 0.9558 -0.00121 0.13389 1.1512
29 Málaga -0.07452 0.03473 1.0290 0.07336 0.07838 1.1734
30 Murcia 0.12744 0.22510 1.5233 -0.02934 0.03785 1.0168
31 Navarra 0.04272 -0.01529 1.1005 0.04258 0.04543 1.1009
32 Orense -0.01966 -0.11840 0.9327 -0.09671 -0.09834 0.8295
33 Asturias 0.12175 0.00152 1.2112 0.03284 0.02798 1.0714
34 Palencia 0.04859 -0.02089 1.1008 0.05144 -0.00267 1.0586
35 Las Palmas 0.15920 0.06696 1.3425 -0.04013 0.00396 0.9724
36 Pontevedra 0.22173 0.15923 1.5672 0.01239 0.08986 1.1167
37 Salamanca -0.09747 0.02369 0.9946 -0.04922 -0.01865 0.9420
38 Tenerife -0.01390 -0.04516 1.0093 0.00155 0.01816 1.0283
39 Cantabria 0.18626 0.09825 1.4231 0.11106 0.11824 1.2680
40 Segovia -0.20471 -0.11804 0.7754 -0.04749 -0.12913 0.8450
41 Sevilla -0.11746 0.01220 0.9638 0.07010 0.10139 1.1968
42 Soria -0.19060 -0.07094 0.8243 -0.00829 -0.01252 0.9874
43 Tarragona -0.18181 -0.00341 0.8897 -0.11814 0.02453 0.9181

Pricing by geographical zone 63


44 Teruel -0.32377 -0.08624 0.7106 -0.17859 -0.19126 0.6965
45 Toledo -0.29495 -0.12656 0.7025 -0.09209 -0.07809 0.8504
46 Valencia -0.19305 0.07111 0.9478 -0.06100 0.10651 1.0551
47 Valladolid -0.02910 -0.00320 1.0367 0.03720 0.07592 1.1289
48 Vizcaya 0.26703 0.14908 1.6233 0.13592 0.07508 1.2450
49 Zamora -0.05386 -0.07501 0.9413 -0.04574 -0.05571 0.9109
50 Zaragoza -0.12781 0.01244 0.9541 -0.07005 0.02388 0.9627

Table 4.2: u i and vi estimates - the Boskov and Verrall model results

In general, the vi are small compared to the u i , especially in the bodily injury model,

which means that the spatial structure variation dominates the unstructured one. Recall

that in section 3.5, the deviance residuals graph (figures 3.24 and 3.26) displayed an

evident pattern, in particular for the bodily injury model. So the results of the model are

consistent with the features observed in the data.

To provide an easy view of the results the next two maps (figures 4.7 and 4.8) display

the maximum a posteriori estimates of the u i for bodily injury and material damage.

These values determine the geographical variations and will smooth the frequencies

between adjacent regions.

Figure 4.7: Maximum a posteriori estimates of u i - Bodily Injury

Pricing by geographical zone 64


Figure 4.8: Maximum a posteriori estimates of u i - Material Damage

In both maps, the north appears to be a riskier area meanwhile the centre displays the

lower spatial estimates. In the material damage type of claim, a risky zone also emerges

from the southwest.

The influence of these results is illustrated in figures 4.9 and 4.10 where the new

estimated ratio (actual divided by expected) accounting for the spatial effects is mapped

for both types of claims. These maps should be compared with figures 4.5 and 4.6 to

evaluate the impact of the spatial factors. At first sight, the smoothing effect is

appreciable although it is a bit diffuse. Following what has been observed from the

data, the spatial effect is more noticeable in the bodily injury type of claim, as is the

smoothing.

Pricing by geographical zone 65


Figure 4.9: Expected frequency accounting for spatial effects – Bodily Injury

Figure 4.10: Expected frequency accounting for spatial effects – Material Damage

Finally, in table 4.3, the expected number of claims by region, according to the new

estimated frequencies, are given. It can be observed that the estimated values are very

close to the actual observed number of claims.

Pricing by geographical zone 66


Bodily Injury Material Damage
Region Region Actual Expected Actual Expected
digit name Claims n. Claims Claims n. Claims
01 Álava 345 342.4 1627 1623
02 Albacete 70 74.5 376 384.6
03 Alicante 1009 1009.3 4813 4812.9
04 Almería 180 181.1 844 844.1
05 Ávila 19 14.9 81 77.1
06 Badajoz 137 144.4 918 926.4
07 Baleares 814 814 3589 3589
08 Barcelona 4216 4211.5 17326 17319.3
09 Burgos 258 262.4 1485 1482.4
10 Cáceres 78 81.9 394 405.6
11 Cádiz 237 236.9 1278 1275.6
12 Castellón 499 504.4 3298 3298.2
13 Ciudad Real 87 84.9 451 448.8
14 Córdoba 213 213.6 1257 1257.3
15 La Coruña 751 753.5 3446 3442.9
16 Cuenca 37 43.4 281 285.9
17 Girona 454 452.3 1775 1774.2
18 Granada 332 333.7 1754 1752.5
19 Guadalajara 16 16.7 101 99.2
20 Guipúzcoa 275 273.1 1177 1175.9
21 Huelva 60 56.3 283 281.2
22 Huesca 109 110.8 505 513.4
23 Jaén 275 272.8 1422 1427.2
24 León 187 180.9 713 714.5
25 Lleida 429 430.5 1888 1895.1
26 La Rioja 142 142.6 767 768.4
27 Lugo 96 100 377 390.4
28 Madrid 486 482.6 3082 3070.5
29 Málaga 686 684.2 3829 3822.3
30 Murcia 529 517.5 1716 1712.7
31 Navarra 194 194.8 1007 1003.1
32 Orense 175 181 757 765.5
33 Asturias 603 602.9 2677 2674.6
34 Palencia 61 62.1 296 296.2
35 Las Palmas 512 508.6 1858 1857.7
36 Pontevedra 682 673.9 2212 2204.3
37 Salamanca 149 147.8 713 714.6
38 Tenerife 682 684.3 3744 3742.4
39 Cantabria 341 336 1451 1440.8
40 Segovia 37 43 229 240.1
41 Sevilla 307 306.4 1835 1826.3
42 Soria 89 92.6 569 570.1
43 Tarragona 297 297.2 1524 1521.9
44 Teruel 59 63.4 323 339.5
45 Toledo 184 190.4 1131 1137.7
46 Valencia 1106 1102.4 6033 6023.8
47 Valladolid 201 201.2 1122 1115.5
48 Vizcaya 360 352.4 1326 1319.5
49 Zamora 58 61.8 314 318.8
50 Zaragoza 393 392.4 2093 2090.9
Table 4.3: Actual and expected number of claims accounting for spatial effects
Pricing by geographical zone 67
4.2 Analysis of the results

The first element to consider, related to the results, is the fact that the north (especially

north centre and north west) is identified as a zone with the highest risk for bodily

injury as well as for material damage. The weather conditions are worse in the north:

more rain and colder weather which facilitates the occurrence of claims. In addition, the

north is more densely populated with a higher level of activity, which means a busier

lifestyle and conditions more likely to give rise to more claims. Therefore, this was an

expected outcome.

The second element to point out, while analysing the results, is the relative low amount

of smoothing provided by the model for both types of claims. Although there are some

variations in the ratios as a result of the spatial effects, the smoothing is not strongly

evident.

If we first consider the bodily injury peril, some changes operate in the centre east and

centre west areas. The regions of Castellón, Teruel and Cuenca (centre east) are moved

to a higher ratio section influenced by the neighbouring regions which hold higher

ratios. By contrast, Salamanca and Ávila (centre west) are shifted to lower levels since

the adjacent areas display values of ratios lower than 1. The same change can be

mentioned for Girona (northeast); however, the evolution of this zone deserves a

detailed comment.

Effectively, in this zone, there are two regions, Barcelona and Girona, which belong to

the same range of ratio and have similar neighbouring regions. However, because of the

influence of the neighbours, Girona is moved towards a lower level while Barcelona

Pricing by geographical zone 68


remains in the same position. The reason for this diverse behaviour has to be found in

the exposure. The number of policies from Barcelona is very high so it is not necessary

to borrow information from neighbouring regions to calculate a reliable estimate. Even

if the neighbouring information were taken into account, it would not influence the

region value very much, due to the high volume of information for Barcelona.

Related to the material damage type of claims, some smoothing is carried out in

Zamora (northwest) and Sevilla (southwest). Their ratio shifts towards a higher level

and a lower level respectively. Additionally, the same effect explained for Barcelona

now operates in Madrid which remains unsmoothed although the neighbours display

lower coefficients.

Expanding on the reasons for this low degree of smoothing, the number of regions and

their composition has to be analysed.

Firstly, it has to be said that the number of regions considered is very small and

therefore, it entails less “borrowing” of information from neighbouring areas. Since the

exposure of the regions is large, a reliable spatial coefficient can be calculated with the

information from the region itself, as previously mentioned. So not much smoothing is

exhibited.

Secondly, and closely related to the previous point, the bigger the areas the more

heterogeneity they include. This means that, in the same region, there could be huge

differences within subareas. A clear example is the regions which contains a big city,

like Madrid or Barcelona. Without doubt, the risk covered inside the city is completely

Pricing by geographical zone 69


different from the risk assured in the rest of the region. What is more, probably there

would be at least three differentiable areas: the city itself, the metropolitan area or

surroundings of the city and the rest. In particular, the risk covered in the countryside

subarea is more likely to be similar to the risk of neighbouring areas, whereas the risk

assured inside the big city is probably very different from the less populated adjacent

areas.

Therefore, the 50 regions considered do not allow the calculation of spatial coefficients

that discriminate between this existing diversity of risks.

To deal with this problem, some competitors have included a binary variable in the

modelling process which differentiates the big cities from the rest. However, this partial

solution still involves inaccuracies since it looks for a common coefficient for all the

big cities. That is, it assumes similar behaviour in all of them, which is not necessarily

true.

Finally, last comments about the results can be made in the sense that after running the

model, there is still some unexplained variation given by the vi values. These can be

interpreted as the existence of other relevant factors not yet considered. One of these

factors is almost certainly the annual mileage. This is generally accepted as an

important variable since it reflects how much risk the driver is actually exposed to.

However, the use of this factor involves problems of implementation. The issue of how

to get this number and how to annually check its updated value is not solved and other

complications appear when rating a new driver or with second hand cars. Maybe, in the

future, the use of some technological advances, like GPS (Global Positioning System),

Pricing by geographical zone 70


will facilitate the inclusion of this new rating factor and therefore, the estimation of

even more precise premiums for each assured person.

Pricing by geographical zone 71


5 CONCLUSION

The aim of this dissertation was to run the Boskov and Verrall model with a Spanish

database.

However, at the same time, it involves cheeking that the model and its implementation

can be transferred to a different market from that for which it was initially created.

Although the philosophy is obviously applicable to all data in which there is a

geographical area effect, some differences operate in Spain that generate additional

difficulties.

The main difference comes from the structure of the Spanish postal codes. Although

these also refer to some geographical areas, they do not provide the same level of

accuracy as the UK postcodes. In the UK, for example, by means of the postal code, it

is possible to practically identify a single house. However, it is not the same in Spain.

So, the UK postcode certainly allows for defining more homogeneous areas in terms of

risk.

First, it has to be said that there is still a lot of scope for improvement by means of

exploiting the existing Spanish postal code. As mentioned, it is made up of 5 digits and

the application has been carried out taking into account the areas identified by the two

first components. So, smaller areas could be defined considering the other elements,

which would probably provide more satisfactory results.

Pricing by geographical zone 72


However, going further into the use of the Spanish postal code, the zone allocated by

the fourth and fifth digits is not always well defined. That is, especially in the

countryside, two areas with the same fifth digit for example, could be crossed by

another area with a different fifth component. In addition, there are some areas not

identified by the last digits if no people live there. So this would cause other problems

to arise when locating the neighbouring regions for narrower geographical zones.

Although we have no certain proof, the experience of dealing with postal codes seems

to confirm that in Spain, new codes are allocated to new settlements but these do not

necessarily arise in a correlated order in terms of geographical location. Meanwhile, in

the UK, the postal code always refers to a specific territory regardless to the number of

people living there.

In this sense, detailed work focusing on solving the postcode inefficiencies, or looking

for an alternative to this code in terms of identifying the optimal areas to consider,

would be recommended to make progress in the implementation of the model in Spain.

However, once this problem has been overcome, the model certainly provides a useful

method in personal lines insurances to deal with the geographical zone. Where, in view

of the diversity of alternatives adopted by different companies to introduce this factor,

the way the geographical zone is dealt with is still an open issue.

To conclude, although this model is viewed as a useful solution to rate by geographical

zone, the different structure of the Spanish postcode in addition to the unavailability of

Pricing by geographical zone 73


more detailed data have blocked more satisfactory results. A detailed geographical

study would probably be the key factor to further progress.

Pricing by geographical zone 74


6 REFERENCES

BESAG, J.E., YORK, J., & MOLLIÉ, A. (1991). Bayesian image restoration, with two

applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43, 1-

20.

BOSKOV, M. & VERRALL, R.J. (1994). Premium rating by geographical area using

spatial models. ASTIN Bulletin 24, 131-143.

BROCKMAN, J., WRIGHT, T.S. (1992). Statistical motor rating: making effective use

of your data. Journal of the Institute of Actuaries vol. 119, part III, 457-543.

BROUHNS, N., DENUIT, M., MASUY, B. & VERRALL, R. (2002). Ratemaking by

geographical area: a case study using the Boskov and Verrall Model. Discussion Paper,

Institut de Statistique. Universite Catholique de Louvain.

CLAYTON, D., & KALDOR, J. (1987). Empirical Bayes estimates of age-standardized

relative risk for use in disease mapping. Biometrics 43, 671-681.

DIXON, M., KELSEY, R. & VERRALL, R. (2000). Postcode insurance rating: spatial

modelling and performance evaluation. Paper presented at the 4th IME Congress,

Barcelona.

DOBSON, A. (1990). An introduction to Generalized Linear Models. Chapman and

Hall.

Pricing by geographical zone 75


KADAFAR, K. (1996). Smoothing geographical data, particularly rates of disease.

Statistics in Medicine 15, 2539-2560.

McCULLAGH, P. & NELDER, J.A. (1989). Generalised Linear Models. Chapman and

Hall.

SMITH, A.F.M. & ROBERTS, G.O. (1993). Bayesian computation via the Gibbs

sampler and related Markov Chain Monte Carlo methods. Journal of the Royal

Statistical Society, Series B 55, Nº 1.

TAYLOR, G.C. (1989). Use of spline functions for premium rating by geographical

area. ASTIN Bulletin 19, Nº 1, 91-122.

TAYLOR, G.C. (1996). Geographic premium rating by Whittaker spatial smoothing.

Technical report. Centre for actuarial studies. University of Melbourne.

Pricing by geographical zone 76


7 APPENDICES

Appendix I: A brief review of Generalised Linear Models

Generalised Linear Models (GLM) is a general approach to modelling a response y, as

a linear function, involving one or more explanatory variables, x .

Although the main idea is the same as the simple linear regression, GLM expands this

methodology in two ways:

- The response variable can have a non- Normal distribution.

- The relationship between the response and explanatory variables can be

different from the simple linear form.

In this sense, the main distributions used in GLMs belong to a wider class of

distributions called the exponential family.

And, as the second expansion implies, we can now estimate parameters from linear

combinations, Xβ , like in a simple regression, but also from functions of linear

combinations, g ( Xβ ) .

The general form of the exponential family of distributions is as follows:

 ( yθ − b(θ ) 
f Y ( y | θ , φ ) = exp + c( y, φ )
 a(φ ) 

where θ is known as the natural parameter, and

φ is the scale or dispersion parameter.

In addition, the expressions for the mean and variance of the response variable are

d d2
E [Y ] = µ = b(θ ) V [Y ] = a(φ ) b(θ )
dθ dθ 2

Pricing by geographical zone 77


However, in general the variance depends on the mean, µ , so it is often written as

d2
V [Y ] = a (φ )V ( µ ) where V [µ ] = b(θ ) is called the variance function.
dθ 2

The most common distributions belong to the exponential family. For example, the

Poisson, Normal and Binomial can all be written in the exponential form.

Related to the explanatory variables, they follow a linear function, known as the linear

predictor, and may be either quantitative or qualitative.

The linear predictor can be formulated as η = βX where,

β = ( β 1 , β 2 ,..., β p ) T is the vector of parameter of the model

X = ( X 1 , X 2 ,..., X p ) T is the matrix of explanatory variables

and p is the number of explanatory variables.

The improve provided by GLM is the possibility of defining a relationship between the

response variable and the linear predictor. This relation is called the link function, g .

The next table summarises the relevant components for the GLM approach for the most

common distributions.

Family θ φ a(φ ) b(θ ) c( y, φ ) link fitted y

Normal µ σ2 φ θ2 1 y2 µ η
− ( + 2πφ )
2 2 φ

Poisson log(µ ) 1 1 eθ − log y! log(µ ) eη

Pricing by geographical zone 78


Binomial µ n 1 log(1 + eθ ) n µ e η
log( ) log  log( ) )
1− µ φ  ny  1 − µ 1 + eη

Gamma 1 α 1 − log(−θ ) (φ − 1) log y + 1 1



µ φ φ log φ − log Γ(φ ) µ η

The estimation of the parameters of the model is carried out by the method of

maximum likelihood. So, given a set of independent random variables Y1 ,..., Yn , with

joint density function f Y ( y | θ , φ ) , the maximum likelihood estimator of θ is the value

n
which maximizes likelihood function given by L(θ , φ | y ) = ∏ f Y ( y | θ , φ ) . However,
i =1

since the logarithmic function is monotonic, the same value maximizes the log-

likelihood function l (θ , φ | y ) = log( L(θ , φ | y )) . The estimator θˆ is obtained by

differentiating the log-likelihood function with respect to each element θ j of θ and

solving the simultaneous equations

∂l (θ , φ | y )
=0 for j = 1,..., p
∂θ j

We must check that the solutions correspond to a maxima by verifying that the matrix

of second derivatives is negative definite.

Pricing by geographical zone 79


Appendix II: Diagram of the process

1.- Claims part 2.- Exposure part

Sin00cf Sin01cf Per1200 Per1201

Cla0001 Per0001 Inft

3.- Renewal date cutting Per0001b Base7

Per0001d
Per0001c Per1201

Posinprr
Per0001d

Cla0001 4.- Mapping

Posinprs Posinprs
Sin1t Corp1

Ff1corp1 Ff1mate1
Sinfase2

Posinprr Sinfase2 Ff1corp2 Ff1mate2

Posinprs 5.- Genmod

Pricing by geographical zone 80


Appendix III: Explanation of the programmes

The information is structured in two files: the claims file and the file with all the

characteristics of the policies. The number of the policy is the relational factor.

The claims file contains data related to all the claims that were opened at the beginning

of the year or have been reported during the year. Each record represents a different

claim. The information includes dates (occurrence date, reporting date, closing date, …)

as well as amounts (payments and reserves).

The file with the characteristics of the policies contains records which represent

homogeneous situations of the risk. This means that any change in the characteristics of

the risk (change of vehicle, change of address, …) is reflected by the generation of a

new record. The information provided relates to characteristics of the driver,

policyholder and car owner (date of birth, sex, profession, address, …), characteristics

of the car (type of vehicle, power, weight, type of fuel, …), and other relevant

information of the risk. In addition, there are variables which are necessary to calculate

the number of days exposed-to-risk.

1.- Claims part

Since the information is processed yearly, this program joins the claims databases of

the two years 2000 and 2001 that the analysis relates to. At the same time, claims are

filtered so as to keep only the ones corresponding to third part liability (bodily injury

and material damage). However, the resulting file still contains information about all

types of vehicles as this factor is not known at this level.

Pricing by geographical zone 81


2.- Exposure part

This programme deals with the file that contains all the characteristics of the policies.

The two years information is also joined and the passenger cars filtered. Since the tariff

incorporates new factors related to the characteristics of the car which were not used

before, we must include external information to be able to analyse those new variables.

The information comes from an external file which contains all the technical

characteristics of all the cars. This file is called base7. In addition the exposure file is

cleaned which means that any value out of the standard ranges is removed.

3.- Renewal date cutting

Since the information related to the no claims discount level or any date factor could

change every renewal date, the homogeneous situations of the risk are additionally split

by this date. This allows the addition of this discount level to the file and the calculation

of the factors: age of the driver, years since obtaining the drivers licence and years with

the company. The variable related to days exposed-to-risk is also created. At this stage,

the two files (claims and characteristics of the risks) are ready and can be merged.

4.- Mapping

The last step before the modelling process consists of grouping the content of the

variables into the levels decided. After that, the information in accumulated and at the

same time, the offset variable is generated.

5.- Genmod

Finally, this programme estimates the parameters by means of a Poison regression

model with a logarithmic link function.

Pricing by geographical zone 82


Appendix IV: SAS programmes

/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Claims part.sas ***/
/*** Job description: Preparation of the claims file ***/
/*** ***/
/******************************************************************/

libname pc 'C:\Dissertation';run;
options compress=yes;

data cla0001 (keep=numpol finper ffiper numexp score faocur anocon


score fmocur fdocur);
set pc.sin00cf pc.sin01cf;
where score in ('DA' 'SM' 'RS' 'SC') and faocur in (2000 2001);
run;

proc sort data=cla0001;by numexp score anocon;

data pc.cla0001;
set cla0001;
by numexp score;
if last.score then do;
claim=1;
output;end;
run;

/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Exposure part.sas ***/
/*** Job description: Preparation of the exposure file ***/
/*** ***/
/******************************************************************/

libname pc 'C:\Dissertation';run;
options compress=yes;

data pc.per1200z (keep=numpol combust conoc cpcirc ffiper finper


impfranq marcwint multicar peso sexcli sexcon sexpro
uso1 valacce valtari valvehi cvdin uso gardan fmvto
fdvto garrobo garlun fainhis fminhis fdinhis fanaci
fmnaci facarne fmcarne anomatri cmarca cmodelo
cversion objeto)
set pc.per1200 (rename=(fanaci=afanaci fmnaci=afmnaci facarne=afacarne

Pricing by geographical zone 83


fmcarne=afmcarne fmvto=afmvto fdvto=afdvto fainhis=
afainhis fminhis=afminhis fdinhis=afdinhis uso1=
auso1 uso2=auso2 uso3=auso3 uso4=auso4 uso5=auso5));

where auso1 in ('1');

anomatri=substr(put(fcons,8.),1,4);
marcwint=dmarca;modwint=dmodelo;
valtari=sum(valvehi,valacce)/1000;
if indrcv='S' then do;if mutrcv=1 then garrcv=4;
if mutrcv ne 1 then garrcv=3;end;
if indrcv='N' then garrcv=5;
if inddan='S' and mutdan=2 then do;if impfranq gt 0 then gardan=2;
if impfranq eq 0 then gardan=1;end;
if inddan='S' and mutdan=1 then do;if indinc='S' then gardan=7;
if indinc='N' then gardan=4;end;
if inddan='N' and indinc='S' then gardan=6;
if gardan=. then gardan=5;

if gravamen=1 then tipvehi='T1N';


if gravamen in (2 20) then tipvehi='T2N';

garrobo=indrob;
garlun=indlun;

if fanaci=. then fanaci=0;


if fmnaci=. then fmnaci=0;
if fdnaci=. then fdnaci=0;

fanaci=input(translate(afanaci,'0',' '),8.);
fmnaci=input(translate(afmnaci,'0',' '),8.);
facarne=input(afacarne,8.);
fmcarne=input(afmcarne,8.);
fmvto=input(afmvto,8.);
fdvto=input(afdvto,8.);
fainhis=input(afainhis,8.);
fminhis=input(afminhis,8.);
fdinhis=input(afdinhis,8.);
uso1=input(auso1,8.);uso2=input(auso2,8.);uso3=input(auso3,8.);
uso4=input(auso4,8.);uso5=input(auso5,8.);

uso=uso1*10000+uso2*1000+uso3*100+uso4*10+uso5;run;

data pc.per1201z (keep=numpol combust conoc cpcirc ffiper finper


impfranq marcwint multicar peso sexcli sexcon sexpro
uso1 valacce valtari valvehi cvdin uso gardan fmvto
fdvto garrobo garlun fainhis fminhis fdinhis fanaci
fmnaci facarne fmcarne anomatri cmarca cmodelo
cversion objeto);
set pc.per1201 (rename=(fanaci=afanaci fmnaci=afmnaci facarne=afacarne
fmcarne=afmcarne fmvto=afmvto fdvto=afdvto fainhis=
afainhis fminhis=afminhis fdinhis=afdinhis uso1=
auso1 uso2=auso2 uso3=auso3 uso4=auso4 uso5=auso5));
where auso1 in ('1');

fainper=substr(finper,1,4);
anomatri=substr(put(fcons,8.),1,4);
marcwint=dmarca;modwint=dmodelo;
valtari=sum(valvehi,valacce)/1000;

Pricing by geographical zone 84


if indrcv='S' then do;if mutrcv=1 then garrcv=4;
if mutrcv ne 1 then garrcv=3;end;
if indrcv='N' then garrcv=5;
if inddan='S' and mutdan=2 then do;if impfranq gt 0 then gardan=2;
if impfranq eq 0 then gardan=1;end;
if inddan='S' and mutdan=1 then do;if indinc='S' then gardan=7;
if indinc='N' then gardan=4;end;
if inddan='N' and indinc='S' then gardan=6;
if gardan=. then gardan=5;

garrobo=indrob;
garlun=indlun;

if gravamen=1 then tipvehi='T1N';


if gravamen in (2 20) then tipvehi='T2N';

fanaci=input(translate(afanaci,'0',' '),8.);
fmnaci=input(translate(afmnaci,'0',' '),8.);
facarne=input(afacarne,8.);
fmcarne=input(afmcarne,8.);
fmvto=input(afmvto,8.);
fdvto=input(afdvto,8.);
fainhis=input(afainhis,8.);
fminhis=input(afminhis,8.);
fdinhis=input(afdinhis,8.);
uso1=input(auso1,8.);uso2=input(auso2,8.);uso3=input(auso3,8.);
uso4=input(auso4,8.);uso5=input(auso5,8.);

uso=uso1*10000+uso2*1000+uso3*100+uso4*10+uso5;run;

data pc.per0001;
set pc.per1200z pc.per1201z;
fainper=substr(finper,1,4);run;

proc sort data=pc.per0001;by numpol;run;

data pc.per0001b (drop=tipneg99 tipneg00 tipneg01);


merge pc.per0001 (in=c)
pc.inft;
by numpol;
if fainper='2001' then tipneg=tipneg01;
if fainper='2000' then tipneg=tipneg00;

if tipneg=' ' then do;tipneg=tipneg00;


if tipneg01 ne ' ' then tipneg=tipneg01;end;
if c;run;

proc sort data=pc.b7ve1505;by cmarca cmodelo cversion;run;

data base7;
set pc.b7ve1505;
by cmarca cmodelo cversion;
if last.cversion then output;

proc datasets lib=pc memtype=data nolist;


modify per0001b;
index create ind=(cmarca cmodelo cversion);run;

Pricing by geographical zone 85


proc sort data=base7;by cmarca cmodelo cversion;run;

data pc.per0001c (drop= combusti cvdini pesoi tipneg cvdinu objeto);


merge pc.per0001b (in=c rename=(combust=combusti cvdin=cvdini )
where=(tipneg='CP' and finper ge '19990101'
and objeto not in (110 111) and numpol ne
(512700) and fmvto ne (0)))
base7 (keep=cmarca cmodelo cversion cvdin cvdinu combust peso
puertas tipov rename=(peso=pesoi));
by cmarca cmodelo cversion;
if c;

if combust in (' ' '') then combust=combusti;


if cvdinu gt 0 then cvdin=cvdinu;
if cvdin in (' ' '') then cvdin=cvdini;
if peso in (' ' '' '0') then peso=pesoi;run;

data per01;
set pc.per1201;
by numpol finper ffiper;
if last.ffiper then output;

proc datasets lib=work memtype=data nolist;


modify per01;
index create coche2=(numpol cmarca cmodelo cversion);run;

proc datasets lib=pc memtype=data nolist;


modify per0001c;
index create coche2=(numpol cmarca cmodelo cversion);run;

data pc.per0001d (drop=dinper dfiper dvenci dfnaci dfcarne dinhis


favenci fminper fdinper fafiper fmfiper fdfiper
combustp pesop fconsp gardan garrobo garlun
tipov cmarca cmodelo cversion);
merge pc.per0001c (in=c)
per01 (keep=numpol combust fcons peso cmarca cmodelo cversion
rename=(combust=combustp fcons=fconsp peso=pesop));
by numpol cmarca cmodelo cversion;
if c;

provin=substr(cpcirc,1,2);
if substr(cpcirc,3,1)='0' then capital='1';
if substr(cpcirc,3,1) in ('1' '2' '3' '4' '5' '6' '7' '8' '9') then
capital='0';

fminper=substr(finper,5,2);
fdinper=substr(finper,7,2);
fafiper=substr(ffiper,1,4);
fmfiper=substr(ffiper,5,2);
fdfiper=substr(ffiper,7,2);
if (fmvto=02 and fdvto=29) then fdvto=28;

favenci=fainper;

dinper=mdy(fminper,fdinper,fainper);
dfiper=mdy(fmfiper,fdfiper,fafiper);
dinhis=mdy(fminhis,fdinhis,fainhis);

Pricing by geographical zone 86


if favenci gt 0 and fmvto gt 0 and fmvto le 12 and fdvto gt 0 and
fdvto le 31 then
dvenci=mdy(fmvto,fdvto,favenci);
if fanaci gt 0 and fmnaci gt 0 and fmnaci le 12 then
dfnaci=mdy(fmnaci,1,fanaci);
if facarne gt 0 and fmcarne gt 0 and fmcarne le 12 then
dfcarne=mdy(fmcarne,1,facarne);

if combust in ('' ' ') then combust=combustp;


if anomatri in (' ') then anomatri=substr(fconsp,1,4);
if peso in ('' ' ' '0') then combust=combustp;

edad=int((dvenci-dfnaci)/365);
carnet=int((dvenci-dfcarne)/365);
antwin=int((dvenci-dinhis)/365);
antvehi=fainper-anomatri;

if abs(dfiper-dvenci) le abs(dinper-dvenci) then do;


edad=edad-1;
carnet=carnet-1;
antwin=antwin-1;end;
pesrc=((dfiper-dinper)/365);
if gardan ne 5 then pesda=pesrc;

if edad lt 18 or edad gt 90 then edad=.;


if sexcli='E' and edad=. then edad=99990;
if edad=. then edad=99999;

if carnet=-1 then carnet=0;


if carnet gt 72 then carnet=.;
if sexcli='E' and carnet=. then carnet=99990;
if carnet=. then carnet=99999;

if antvehi=-1 then antvehi=0;


if antvehi lt -1 or antvehi gt 50 then antvehi=.;
if antvehi=. then antvehi=99999;

if antwin=-1 then antwin=0;

if cvdin lt 5 or cvdin gt 500 then cvdin=.;


if cvdin in (0 .) then cvdin=99999;

if sexcli='H' then sexcli='V';


if sexcli not in ('V' 'M' 'E') then sexcli='.';
if sexcli in (' ' '.') then sexcli='mis';

if sexcon='H' then sexcon='V';


if sexcon not in ('V' 'M' 'E') then sexcon='.';
if sexcon in (' ' '.') then sexcon='mis';

if sexpro='H' then sexpro='V';


if sexpro not in ('V' 'M' 'E') then sexpro='.';
if sexpro in (' ' '.') then sexpro='mis';

if provin in ('53' '80') then provin=' ';


if provin in ('' '00') then provin='mis';

format pespot 8.2;


if cvdin in (99999) or peso in (0 .) then pespot=99999;
if cvdin ne (99999) and peso not in (0 .) then pespot=int(peso/cvdin);

Pricing by geographical zone 87


if pespot=. then pespot=99999;

if puertas in (2) then puertas=3;


if puertas in (4) then puertas=5;
if puertas in (0 6 .) then puertas=99999;

if combust='' then combust='mis';

if multicar not in ('S' 'N') then multicar='mis';

if peso gt 3000 then peso=.;


peso=round(peso/100);
if peso in (0 .) then peso=99999;

if tipov in (150 250) then tip='TT ';


if tipov in (120) then tip='MON ';
if tipov not in (150 250 120) then tip='NTM';

if conoc='A' then conoc='S';

if gardan in (1) then combi='TR sin franq';


if gardan in (2) then combi='TR con franq';
if gardan in (5 3) and garrobo not in ('S') and garlun not in ('S')
then combi='TC';
if (gardan in (4 6 7)) or (gardan in (5 3) and (garrobo='S' or
garlun='S')) then combi='TC ampliat';run;

proc datasets lib=pc memtype=data nolist;


modify per0001d;
index create polper=(numpol finper ffiper);run;

data pc.per0001d;
set pc.per0001d;
by numpol finper ffiper;
if last.ffiper then output;run;

/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Expire date cut.sas ***/
/*** Job description: Divide records by renewal date ***/
/*** ***/
/******************************************************************/

libname pc 'C:\Dissertation';run;
options compress=yes;

data pc.posinprr (drop=aaaaiper aaaafper mmddiper mmddfper inia


edad carnet antwin antvehi pesrc pesda);
set pc.per9901d;
mmddfvto=put(100*fmvto+fdvto,4.);
aaaaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aaaafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

Pricing by geographical zone 88


if inia=19 then do;
mmddiper=put(finper,8.)-19000000-aaaaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aaaafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaaaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aaaafper*10000;end;

if (mmddiper ne mmddfvto or mmddfvto ne mmddfper) and


( (mmddiper<mmddfvto and mmddfvto<mmddfper) or
(mmddiper<mmddfvto and aaaaiper<aaaafper) or
(mmddiper=mmddfper and aaaaiper<aaaafper ) ) and
( mmddfvto ne 0) then do;
auxini=finper; auxfin=ffiper;

ffiper=input(finper,2.)*1000000+aaaaiper*10000+fmvto*100+fdvto;
output;
finper=ffiper; ffiper=auxfin; output;
end;
else output;

drop mmddfvto auxini auxfin;run;

proc sort data=pc.cla0001;by numpol finper ffiper;run;

data sinciib;
merge pc.cla0001 (in=c)
pc.per0001d (in=d keep=numpol finper ffiper fmvto fdvto);
by numpol finper ffiper;
if c and d;
run;

proc sort data=sinciib;by score;

proc means data=sinciib sum;


class score;
var claim;run;

data sin1 sin2;


set sinciib;

aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

if inia=19 then do;


mmddiper=put(finper,8.)-19000000-aaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aafper*10000;end;

mmddfvto=put(100*fmvto+fdvto,4.);

if (faocur*10000+fmocur*100+fdocur>=input(ffiper,8.) or
faocur*10000+fmocur*100+fdocur<input(finper,8.))

Pricing by geographical zone 89


then output sin1;

else

if (mmddiper ne mmddfvto or mmddfvto ne mmddfper) and


( (mmddiper<mmddfvto and mmddfvto<mmddfper) or
(mmddiper<mmddfvto and aaiper<aafper) or
(mmddiper=mmddfper and aaiper<aafper ) ) and
( mmddfvto ne 0)

then do;
auxini=finper; auxfin=ffiper;

ffiper=input(finper,2.)*1000000+aaiper*10000+fmvto*100+fdvto;

if (faocur*10000+fmocur*100+fdocur<input(ffiper,8.) and
faocur*10000+fmocur*100+fdocur>=input(finper,8.))
then output sin2;

finper=ffiper; ffiper=auxfin;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

if inia=19 then do;


mmddiper=put(finper,8.)-19000000-aaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aafper*10000;end;

if (faocur*10000+fmocur*100+fdocur<input(ffiper,8.) and
faocur*10000+fmocur*100+fdocur>=input(finper,8.))
then output sin2;
end;
else output sin2;

drop mmddiper mmddfper mmddfvto auxini auxfin aaiper aafper inia;run;

proc sql;
create table sin1b as
select
posinprr.finper, posinprr.ffiper, sin1.*
from sin1, pc.posinprr
where ( posinprr.numpol=sin1.numpol and
( input(posinprr.finper,8.) <=
sin1.faocur*10000+sin1.fmocur*100+sin1.fdocur))

order by sin1.numpol, sin1.faocur, sin1.fmocur, sin1.fdocur;

proc sort data=sin1b; by numpol numexp anocon score finper ffiper;


run;

data sin1b;
set sin1b;
by numpol numexp anocon score finper ffiper;
if first.score then output;run;

Pricing by geographical zone 90


proc sql;
create table sin1c as
select * from sin1
where (put(sin1.numpol,9.)||put(sin1.numexp,10.)||sin1.score not in
select put(numpol,9.)||put(numexp,10.)||score from sin1b);

data sin1d;
set sin1c;

aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

if inia=19 then do;


mmddiper=put(finper,8.)-19000000-aaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aafper*10000;end;

mmddfvto=put(100*fmvto+fdvto,4.);

if (mmddiper ne mmddfvto or mmddfvto ne mmddfper) and


( (mmddiper<mmddfvto and mmddfvto<mmddfper) or
(mmddiper=mmddfper and aaiper<aafper ) ) and
( mmddfvto ne 0)
then do;
auxini=finper; auxfin=ffiper;

ffiper=input(finper,2.)*1000000+aaiper*10000+fmvto*100+fdvto;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

if inia=19 then do;


mmddiper=put(finper,8.)-19000000-aaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aafper*10000;end;

if (fmocur*100+fdocur<mmddfper and
fmocur*100+fdocur>=mmddiper)
then output ;
else do;
finper=ffiper; ffiper=auxfin;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);

if inia=19 then do;


mmddiper=put(finper,8.)-19000000-aaiper*10000;
mmddfper=put(ffiper,8.)-19000000-aafper*10000;end;

if inia=20 then do;


mmddiper=put(finper,8.)-20000000-aaiper*10000;
mmddfper=put(ffiper,8.)-20000000-aafper*10000;end;

Pricing by geographical zone 91


output;
end;
end;
else output;

drop aaiper aafper mmddiper mmddfper mmddfvto auxini auxfin inia;run;

data sin1d (drop=finper ffiper rename=(afinper=finper


affiper=ffiper));
set sin1d;
afinper=put(finper,8.);
affiper=put(ffiper,8.);run;

data pc.sin1t;
set sin2 sin1b sin1d;run;

data corp;
set pc.sin1t;
where score in ('RS' 'SC');run;

proc sort data=corp;by numexp;run;

data corp1;
set corp;
by numexp;
if first.numexp and not last.numexp then do;nsin=0;end;
retain;
if not first.numexp and last.numexp then do;
nsin=1;
cobert='CORP';
output;end;
if first.numexp and last.numexp then do;
nsin=1;
cobert='CORP';
output;end;

data sinfase2;
set corp1
pc.sin1t (where=(score in ('SM' 'DA')));
if score='SM' then cobert='MATE';
if score='DA' then cobert='DANO';
if score in ('SM' 'DA') then nsin=1;

if cobert='MATE' then do;nsinSM=nsin;end;


if cobert='CORP' then do;nsinCO=nsin;end;
if cobert='DANO' then do;nsinDA=nsin;end;run;

proc sort data=sinfase2;by numpol finper ffiper;

proc means data=sinfase2 sum noprint;


by numpol finper ffiper;
var nsinsm nsinco nsinda;
output out=pc.sinfase2 sum=;run;

proc means data=pc.sinfase2 sum;


var nsinsm nsinco nsinda;run;

Pricing by geographical zone 92


proc datasets lib=pc memtype=data nolist;
modify posinprr;
index create ind=(numpol finper ffiper);run;

data pc.posinprs (drop= fanaci fmnaci facarne fmcarne anomatri


fminhis fdinhis fdvto favenci fafiper
fmfiper fdfiper fdinper dfnaci dfcarne);
merge pc.posinprr (in=c)
pc.sinfase2 (keep=numpol finper ffiper nsinsm nsinco nsinda);
by numpol finper ffiper;
if c;

fainper=substr(finper,1,4);
fminper=substr(finper,5,2);
fdinper=substr(finper,7,2);
fafiper=substr(ffiper,1,4);
fmfiper=substr(ffiper,5,2);
fdfiper=substr(ffiper,7,2);
if (fmvto=02 and fdvto=29) then fdvto=28;

favenci=fainper;

dinper=mdy(fminper,fdinper,fainper);
dfiper=mdy(fmfiper,fdfiper,fafiper);
dinhis=mdy(fminhis,fdinhis,fainhis);

if favenci gt 0 and fmvto gt 0 and fmvto le 12 and fdvto gt 0 and


fdvto le 31 then
dvenci=mdy(fmvto,fdvto,favenci);
if fanaci gt 0 and fmnaci gt 0 and fmnaci le 12 then
dfnaci=mdy(fmnaci,1,fanaci);
if facarne gt 0 and fmcarne gt 0 and fmcarne le 12 then
dfcarne=mdy(fmcarne,1,facarne);

edad=int((dvenci-dfnaci)/365);
carnet=int((dvenci-dfcarne)/365);
antwin=int((dvenci-dinhis)/365);
antvehi=fainper-anomatri;

if abs(dfiper-dvenci) le abs(dinper-dvenci) then do;


edad=edad-1;
carnet=carnet-1;
antwin=antwin-1;end;
pesrc=((dfiper-dinper)/365);
if gardan ne 5 then pesda=pesrc;

if edad lt 18 or edad gt 90 then edad=.;


if sexcli='E' and edad=. then edad=99990;
if edad=. then edad=99999;

if carnet=-1 then carnet=0;


if carnet gt 72 then carnet=.;
if sexcli='E' and carnet=. then carnet=99990;
if carnet=. then carnet=99999;

if antvehi=-1 then antvehi=0;


if antvehi lt -1 or antvehi gt 50 then antvehi=.;
if antvehi=. then antvehi=99999;

if antwin=-1 then antwin=0;run;

Pricing by geographical zone 93


/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Mapping.sas ***/
/*** Job description: Generate the mapped variables ***/
/*** ***/
/******************************************************************/

libname pc 'C:\Dissertation';run;
options compress=yes;

data pc.ff1corp1 (keep=xedad xcarnet xantvehi xantwin xcvdin xpeso


xpespot xcombust xsexcli xpuertas xconoc
fainper xsexcon tip nsinco nsinsm pesrc drc
provin cp3);
set pc.posinprt;
where pesrc gt 0;

length xedad $ 6.;


select (edad);
when (18, 19, 20, 21) xedad='18-21';
when (22) xedad='22';
when (23) xedad='23';
when (24) xedad='24';
when (25) xedad='25';
when (26) xedad='26';
when (27) xedad='27';
when (28, 29) xedad='28-29';
when (30, 31) xedad='30-31';
when (32, 33) xedad='32-33';
when (34, 35) xedad='z34-35';
when (36, 37) xedad='36-37';
when (38, 39) xedad='38-39';
when (40, 41) xedad='40-41';
when (42, 43) xedad='42-43';
when (44, 45) xedad='44-45';
when (46, 47) xedad='46-47';
when (48, 49) xedad='48-49';
when (50, 51) xedad='50-51';
when (52, 53) xedad='52-53';
when (54, 55) xedad='54-55';
when (56, 57) xedad='56-57';
when (58, 59) xedad='58-59';
when (60, 61, 62, 63, 64) xedad='60-64';
when (65, 66, 67, 68, 69, 70) xedad='65-70';
when (71, 72, 73, 74, 75) xedad='71-75';
when (99999, 99990) xedad='mis';
otherwise xedad='+75';end;

length xcarnet $ 6.;


select (carnet);
when (0, 1, 2) xcarnet='0-2';
when (3, 4, 5) xcarnet='3-5';
when (6, 7, 8, 9) xcarnet='6-9';
when (10, 11, 12, 13) xcarnet='10-13';
when (14, 15, 16, 17, 18, 19, 20, 21, 22) xcarnet='z14-22';
when (23, 24, 25, 26, 27, 28, 29, 30, 31, 32) xcarnet='23-32';

Pricing by geographical zone 94


when (33, 34, 35, 36, 37, 38, 39) xcarnet='33-39';
when (99990, 99999) xcarnet='mis';
otherwise xcarnet='+39';end;

length xantvehi $ 5.;


select (antvehi);
when (0) xantvehi='z0';
when (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) xantvehi='1-10';
when (99999) xantvehi='mis';
otherwise xantvehi='+10';end;

length xantwin $ 5.;


select (antwin);
when (0) xantwin='z0';
when (1, 2, 3, 4, 5, 6, 7, 8, 9) xantwin='1-9';
when (99999) xantwin='mis';
otherwise xantwin='+9';end;

length xcvdin $7.;


if cvdin le 49 then xcvdin='0-49';
if 49 lt cvdin le 75 then xcvdin='50-75';
if 75 lt cvdin le 94 then xcvdin='z76-94';
if 94 lt cvdin le 129 then xcvdin='94-129';
if 129 lt cvdin le 160 then xcvdin='129-160';
if 160 lt cvdin le 199 then xcvdin='160-199';
if cvdin gt 199 then xcvdin='+199';
if cvdin in (99999) then xcvdin='mis';

length xpeso $5.;


if 0 lt peso lt 8 then xpeso='0-7';
if 8 le peso lt 12 then xpeso='8-11';
if 12 le peso lt 15 then xpeso='12-14';
if 15 le peso lt 18 then xpeso='15-17';
if peso ge 18 then xpeso='+17';
if peso in (99999 0) then xpeso='mis';

length xpespot $5.;


if 0 lt pespot le 9 then xpespot='0-9';
if 9 lt pespot le 18 then xpespot='10-18';
if pespot gt 18 then xpespot='+18';
if pespot in (99999 0) then xpespot='mis';

length xcombust $3.;


if combust='G' then xcombust='zG';
if combust='D' then xcombust='D';
if combust not in ('G', 'D') then xcombust='mis';

length xsexcli $4.;


if sexcli in ('E') then xsexcli='E';
if sexcli ne ('E') then xsexcli='zmis';

length xsexcon $4.;


if sexcon in ('V') then xsexcon='zV';
if sexcon in ('M') then xsexcon='M';
if sexcon not in('M' 'V') then xsexcon='mis';

length xpuertas $3.;


if puertas in (3) then xpuertas='3';
if puertas in (5) then xpuertas='z5';
if puertas not in (3, 5) then xpuertas='mis';

Pricing by geographical zone 95


length xconoc $2.;
if conoc in ('N') then xconoc='zN';
else xconoc=conoc;

cp3=substr(cpcirc,1,3);run;

proc freq data=pc.ff1corp1;


weight pesrc;
tables xedad xcarnet xantwin xantvehi xcvdin xpeso xpespot xcombust
xpuertas xsexcli xsexcon fainper xconoc drc;run;

proc sort data=pc.ff1corp1;


by xedad xcarnet xantwin xantvehi xcvdin xpeso xpespot xcombust
xpuertas xsexcli xsexcon fainper xconoc tip drc;run;

data pc.ff1corp2 (drop=pesrc nsinCO nsinSM provin);


set pc.ff1corp1;
by xedad xcarnet xantwin xantvehi xcvdin xpeso xpespot xcombust
xpuertas xsexcli xsexcon fainper xconoc tip drc;
if first.drc then do;
pobl=0;
claCO=0;
claSM=0;end;
pobl+pesrc;
claCO+nsinCO;
claSM+nsinSM;
if last.drc then do;
if drc in (-65) then lnpobl=log(sum(pobl, 0.35));
if drc in (-60) then lnpobl=log(sum(pobl, 0.40));
if drc in (-55) then lnpobl=log(sum(pobl, 0.45));

if drc in (-52) then lnpobl=log(sum(pobl, 0.475));


if drc in (-50) then lnpobl=log(sum(pobl, 0.50));
if drc in (-47) then lnpobl=log(sum(pobl, 0.525));
if drc in (-45) then lnpobl=log(sum(pobl, 0.55));
if drc in (-42) then lnpobl=log(sum(pobl, 0.575));
if drc in (-40) then lnpobl=log(sum(pobl, 0.60));
if drc in (-35) then lnpobl=log(sum(pobl, 0.65));
if drc in (-30) then lnpobl=log(sum(pobl, 0.70));
if drc in (-25) then lnpobl=log(sum(pobl, 0.75));

if drc in (-20) then lnpobl=log(sum(pobl, 0.80));


if drc in (-15) then lnpobl=log(sum(pobl, 0.85));
if drc in (-10) then lnpobl=log(sum(pobl, 0.90));
if drc in (-5) then lnpobl=log(sum(pobl, 0.95));
if drc in (0) then lnpobl=log(sum(pobl, 1));
if drc in (10) then lnpobl=log(sum(pobl, 1.10));
if drc in (20) then lnpobl=log(sum(pobl, 1.20));
if drc in (30) then lnpobl=log(sum(pobl, 1.30));
if drc in (50) then lnpobl=log(sum(pobl, 1.50));
if drc in (75) then lnpobl=log(sum(pobl, 1.75));
if drc in (80) then lnpobl=log(sum(pobl, 1.80));
if drc in (100) then lnpobl=log(sum(pobl, 2));
if drc in (150) then lnpobl=log(sum(pobl, 2.5));
if drc in (200) then lnpobl=log(sum(pobl, 3));
if drc in (225) then lnpobl=log(sum(pobl, 3.25));
if drc in (300) then lnpobl=log(sum(pobl, 4));
if drc in (.) then lnpobl=log(pobl);

Pricing by geographical zone 96


output;end;
run;

data pc.ff1mate1 (keep=xedad xcarnet xantvehi xantwin xcvdin xpeso


xcombust xsexcli xpuertas xconoc fainper
xsexcon tip nsinco nsinsm pesrc drc provin);
set pc.posinprt;
where pesrc gt 0;

length xedad $ 6.;


select (edad);
when (18, 19, 20, 21) xedad='18-21';
when (22) xedad='22';
when (23) xedad='23';
when (24) xedad='24';
when (25) xedad='25';
when (26) xedad='26';
when (27) xedad='27';
when (28) xedad='28';
when (29) xedad='29';
when (30, 31) xedad='30-31';
when (32, 33) xedad='32-33';
when (34, 35) xedad='z34-35';
when (36, 37) xedad='36-37';
when (38, 39) xedad='38-39';
when (40, 41) xedad='40-41';
when (42, 43) xedad='42-43';
when (44, 45) xedad='44-45';
when (46, 47) xedad='46-47';
when (48, 49) xedad='48-49';
when (50, 51) xedad='50-51';
when (52, 53) xedad='52-53';
when (54, 55) xedad='54-55';
when (56, 57) xedad='56-57';
when (58, 59, 60) xedad='58-60';
when (61, 62, 63, 64) xedad='61-64';
when (65, 66, 67, 68, 69, 70) xedad='65-70';
when (71, 72, 73, 74) xedad='71-74';
when (99999, 99990) xedad='mis';
otherwise xedad='+74';end;

length xcarnet $ 6.;


select (carnet);
when (0) xcarnet='0';
when (1) xcarnet='1';
when (2) xcarnet='2';
when (3) xcarnet='3';
when (4, 5, 6) xcarnet='4-6';
when (7, 8) xcarnet='7-8';
when (9, 10) xcarnet='9-10';
when (11, 12, 13, 14) xcarnet='11-14';
when (15, 16, 17, 18, 19, 20, 21, 22, 23) xcarnet='z15-23';
when (24, 25, 26, 27, 28, 29, 30, 31, 32) xcarnet='24-32';
when (33, 34, 35, 36, 37, 38) xcarnet='33-38';
when (99990, 99999) xcarnet='mis';
otherwise xcarnet='+38';end;

length xantvehi $ 5.;


select (antvehi);
when (0) xantvehi='z0';

Pricing by geographical zone 97


when (99999) xantvehi='mis';
otherwise xantvehi='+0';end;

length xantwin $ 5.;


select (antwin);
when (0) xantwin='z0';
when (1, 2, 3, 4, 5, 6, 7, 8, 9) xantwin='1-9';
when (99999) xantwin='mis';
otherwise xantwin='+9';end;

length xcvdin $7.;


if cvdin le 49 then xcvdin='0-49';
if 49 lt cvdin le 75 then xcvdin='50-75';
if 75 lt cvdin le 94 then xcvdin='z76-94';
if 94 lt cvdin le 129 then xcvdin='95-129';
if 129 lt cvdin le 160 then xcvdin='130-160';
if 160 lt cvdin le 199 then xcvdin='161-199';
if cvdin gt 199 then xcvdin='+199';
if cvdin in (99999) then xcvdin='mis';

length xpeso $5.;


if 0 lt peso lt 8 then xpeso='0-7';
if 8 le peso lt 12 then xpeso='8-11';
if 12 le peso lt 15 then xpeso='12-14';
if 15 le peso lt 18 then xpeso='15-17';
if peso ge 18 then xpeso='+17';
if peso in (99999 0) then xpeso='mis';

length xcombust $3.;


if combust='G' then xcombust='zG';
if combust='D' then xcombust='D';
if combust not in ('G', 'D') then xcombust='mis';

length xsexcli $4.;


if sexcli in ('E') then xsexcli='E';
if sexcli ne ('E') then xsexcli='zmis';

length xsexcon $4.;


if sexcon in ('V') then xsexcon='zV';
if sexcon in ('M') then xsexcon='M';
if sexcon not in('M' 'V') then xsexcon='mis';

length xpuertas $3.;


if puertas in (3) then xpuertas='3';
if puertas in (5) then xpuertas='z5';
if puertas not in (3, 5) then xpuertas='mis';

length xconoc $2.;


if conoc in ('N') then xconoc='zN';
else xconoc=conoc;run;

proc freq data=pc.ff1mate1;


weight pesrc;
tables xedad xcarnet xantwin xantvehi xcvdin xpeso xcombust xpuertas
xsexcli xsexcon fainper xconoc drc;run;

proc sort data=pc.ff1mate1;


by xedad xcarnet xantwin xantvehi xcvdin xpeso xcombust xpuertas
xsexcli xsexcon fainper xconoc tip drc;run;

Pricing by geographical zone 98


data pc.ff1mate2 (drop=pesrc nsinSM provin);
set pc.ff1mate1;
by xedad xcarnet xantwin xantvehi xcvdin xpeso xcombust xpuertas
xsexcli xsexcon fainper xconoc tip drc;
if first.drc then do;
pobl=0;
claSM=0;end;
pobl+pesrc;
claSM+nsinSM;
if last.drc then do;
if drc in (-65) then lnpobl=log(sum(pobl, 0.35));
if drc in (-60) then lnpobl=log(sum(pobl, 0.40));
if drc in (-55) then lnpobl=log(sum(pobl, 0.45));

if drc in (-52) then lnpobl=log(sum(pobl, 0.475));


if drc in (-50) then lnpobl=log(sum(pobl, 0.50));
if drc in (-47) then lnpobl=log(sum(pobl, 0.525));
if drc in (-45) then lnpobl=log(sum(pobl, 0.55));
if drc in (-42) then lnpobl=log(sum(pobl, 0.575));
if drc in (-40) then lnpobl=log(sum(pobl, 0.60));
if drc in (-35) then lnpobl=log(sum(pobl, 0.65));
if drc in (-30) then lnpobl=log(sum(pobl, 0.70));
if drc in (-25) then lnpobl=log(sum(pobl, 0.75));

if drc in (-20) then lnpobl=log(sum(pobl, 0.80));


if drc in (-15) then lnpobl=log(sum(pobl, 0.85));
if drc in (-10) then lnpobl=log(sum(pobl, 0.90));
if drc in (-5) then lnpobl=log(sum(pobl, 0.95));
if drc in (0) then lnpobl=log(sum(pobl, 1));
if drc in (10) then lnpobl=log(sum(pobl, 1.10));
if drc in (20) then lnpobl=log(sum(pobl, 1.20));
if drc in (30) then lnpobl=log(sum(pobl, 1.30));
if drc in (50) then lnpobl=log(sum(pobl, 1.50));
if drc in (75) then lnpobl=log(sum(pobl, 1.75));
if drc in (80) then lnpobl=log(sum(pobl, 1.80));
if drc in (100) then lnpobl=log(sum(pobl, 2));
if drc in (150) then lnpobl=log(sum(pobl, 2.5));
if drc in (200) then lnpobl=log(sum(pobl, 3));
if drc in (225) then lnpobl=log(sum(pobl, 3.25));
if drc in (300) then lnpobl=log(sum(pobl, 4));
if drc in (.) then lnpobl=log(pobl);
output;end;
run;

/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Genmod.sas ***/
/*** Job description: Generalised linear model ***/
/*** ***/
/******************************************************************/

libname pc 'C:\Dissertation';run;
options compress=yes;

Pricing by geographical zone 99


proc genmod data=pc.ff1corp2;
class xedad xcarnet xantwin xantvehi xcvdin xpeso xpespot xcombust tip
xsexcli xsexcon xpuertas fainper xconoc;
model claCO=xedad xcarnet xantvehi xantwin xcvdin xpeso xpespot
xcombust xsexcli fainper xconoc tip xsexcon xpuertas
/ dist=poisson link=log type1 type3 offset=lnpobl pscale corrb;
run;

proc genmod data=pc.ff1mate2;


class xedad xcarnet xantwin xantvehi xcvdin xpeso xcombust tip xsexcli
xsexcon xpuertas fainper xconoc;
model claSM=xedad xcarnet xantwin xcvdin xpeso xcombust*xantvehi tip
xsexcli xsexcon xpuertas fainper xconoc
/ dist=poisson link=log type1 type3 offset=lnpobl pscale corrb;
run;

Pricing by geographical zone 100

You might also like