Pricing by Geographical Zone An Application With A Spanish Database PDF
Pricing by Geographical Zone An Application With A Spanish Database PDF
by
Núria Puig
December 2003
Actuarial Science
Table 4.3: Actual and expected number of claims accounting for spatial effects.......................67
Figure 4.9: Expected frequency accounting for spatial effects – Bodily Injury .........................66
Figure 4.10: Expected frequency accounting for spatial effects – Material Damage .................66
I would like to thank Professor Richard Verrall for his guidance, supervision and
friendly approach.
Also my acknowledgment to Winterthur Seguros from Barcelona for the database used
This dissertation focuses on the premium rating area for personal insurance.
In terms of pricing, the current situation is based on the use of Generalised Linear
Model techniques to evaluate how different factors affect the size and number of
claims. Although this methodology is commonly accepted, there is not the same
agreement about how to deal with some related problems, such as the convenient
grouping of the responses of some variables or how to obtain smoothing estimates for
specific factors. Both of these elements affect the geographical area variable, since we
In view of these problems, Boskov and Verrall (1994) proposed a model for premium
rating by postcode area. The model follows a Bayesian approach and uses the Gibbs
complexities.
The aim of this dissertation is to give a comprehensive explanation of this model and
largest company of that market. The analysis refers to the frequency risk of passenger
cars and third party liability (bodily injury and material damage), and the information
geographical area (called standard factors), an initial analysis has been performed so as
to estimate their coefficients by means of Generalised Linear Models and evaluate the
The spatial rating is carried out afterwards and deals with the differences between the
actual number of claims and the estimated number of claims, having accounted for the
standard factors, in order to evaluate how much of this variation can be attributed to
spatial effects and how much is unexplained variation without a defined pattern. The
model operates by borrowing information from neighbouring areas, which are more
each region.
Once the spatial analysis is performed, the results highlight some difficulties in
applying the model in this country, which mainly refer to the different postcode
geographical zone, a detailed analysis to solve the postcode inefficiencies and identify
model in Spain.
The aim of actuaries is to assess risks as best as they can. This is important in terms of
equity as well as in terms of company profits. If the premium is higher than the risk
covered, competitors will attract this client with lower prices. By contrast, if the risk is
undercharged, the company will incur losses. So, to assess the risk properly, actuaries
try to find the variables which best describe the underwritten risk so as to charge the
If we focus on motor insurance (as it is the example used in the numerical application
in chapters 3 and 4), there are many variables which are commonly used because of
their proven power for explaining and differentiating motor risks. That is, factors like
age and sex of the driver, driving experience, power of the car, use of the car, type of
fuel, weight of the car, etc. have been considered good explanatory variables to
discriminate between risks. A particular factor is the geographical zone (defined as the
area where the car is mainly driven). Although its explanatory value is also generally
accepted, many difficulties arise while using it with pricing purposes, which explains
The first problem to deal with is the decision about the number of regions to consider.
The quickest approach could be that the smaller the regions the better the risk is
assessed. However this involves the next problem, which is to have enough volume in
each region to calculate a reliable coefficient. Another important issue is the transition
between regions. It is sensible to consider that neighbouring areas are likely to be more
comprehension, it is also difficult to accept that adjacent regions could have a big
difference in premiums when other factors remain the same. In addition, big differences
through regions.
In this sense, the Boskov and Verrall (1994) model provides a solution to these
to and from neighbouring areas and thus, offering a way to deal with spatial assessment
of risks.
Although most of the research related to mapping and geographical zone risk
evaluation has been carried out in the archaeological and, especially, medical and
the target is similar, even though the nature of the risk is, obviously, different.
Some relevant papers around this matter have been published in medical journals such
as Statistics of Medicine, and a good contribution coming from the archaeological area
is the work of Julian Besag et al. (1991) about image restoration. However there is a
portfolios. Boskov and Verrall (1994) provided an alternative method which this
dissertation is devoted to. Later on, both authors developed extensions to this latest
factors accounting for distances between regions (Dixon, Kelsey and Verrall, 2000).
1.3 Outline
and Verrall model. The chapter starts with an examination of the Bayesian statistical
methods and continues by focusing on the Gibbs sampler as the practical way to
calculate a posteriori estimates. The methods are introduced in general first, and the
approach moves to an insurance environment later. The main references for this chapter
are the Boskov and Verrall paper (1994) and the work of Smith and Roberts (1993)
about Bayesian computation. The next two chapters refer to the practical application
with the Spanish portfolio. Chapter 3 includes all the work previous to the spatial
analysis, for bodily injury and material damage types of claims, from the preparation of
the database until the point where the estimates of the standard factors are obtained by
practical point of view. However, a brief review of the theory is included in appendix I.
The chapter ends with the assessment of the goodness of fit of the estimated models. In
chapter 4, the spatial analysis is carried out. The geographical zones are defined and the
models run to obtain the expected number of claims by region accounting for spatial
effects. The most interesting part is the last section related to the analysis of the results.
and some proposals for future work to assess risks by geographical zone in Spain.
2.1 Introduction
The starting point is the aim of evaluating a risk. Specifically, we are interested in
analysing its variation by geographical area. In order to do that, we can assume that the
area under study is divided into n regions. In practice, the regions correspond to
Since the idea behind the model is that areas which are close are likely to be more
similar than those which are far apart, and that it is possible to borrow information from
neighbours to work out the region estimates, the concept of “neighbouring areas” has to
be defined.
We consider neighbouring areas to be those which are adjacent. In other words, regions
which border the one analysed. We define δ i as the set of neighbouring areas of region
i . Some sensible comments could be made in the sense that distances between regions
or other elements matter rather than the borders themselves. Further extension of the
model has been developed allowing for distance weighting, as mentioned, but it will not
We define xi as the true risk of region i and x the vector of risk over the whole
region. In addition, let yi denote the observed data of area i and y the corresponding
vector.
In this particular problem, the main target is to identify the true underlying risk of each
Following Bayesian statistics, unknown parameters are treated as random variables and
the first stage consists of expressing our “prior belief” about the parameter distribution
However, since this density function may contain unspecified hyperparameters that
density instead
But to be precise, xi do not depend on regions which are not in the neighbourhood of i ,
(2.2.3) p i ( xi / δ i ) .
The second step of Bayesian formulation is based on past experience, that is, the
random observed outcomes y . Having this prior information and assuming y i are
conditionally independent given x , the joint density of the sample values or likelihood
n
(2.2.4) f ( y / x) ∝ ∑ f ( y i / xi )
i =1
(2.2.6) p( x / y ) ∝ f ( y / x) p( x)
We find a way to work out the posterior distribution of the true risk, given the sample
information, p( x / y ) .
Once the posterior distribution has been derived, the most obvious Bayesian point
estimate is the one that maximizes (2.2.6), especially if the maximum is unique.
method based on a variant of a Metropolis algorithm called the Gibbs sampler. This
We now consider the insurance risk. Specifically, we want to estimate the frequency of
claims, which is the expected number of claims divided by the exposure in each region,
xi ri . However, given the exposure values, it is more common to model the expected
number of claims rather than the frequency itself. In this sense, a Poisson distribution
turns out to be appropriate to model the number of events over a given period of time.
and estimate how much each variable contributes in determining the level of risk. The
variables are considered to follow a linear function and the estimate of this function is
called the linear predictor η i . A standard choice of link function between this linear
predictor and the variable that has to be explained (number of claims) is a logarithmic
However, the linear predictor is made up of different components. On one hand there is
a part of the risk which can be explained by the standard variables (age, sex, power,
etc.). On the other hand there is the risk component which depends on the geographical
zone. That spatial component is the one we are especially interested in. Finally, there is
regression,
(2.3.2.) xi = ri exp(t i + u i + vi )
factors have been correctly estimated using a Generalised Linear Model. This means
that t i are already known and can be removed from the data and the model.
(2.3.3.) xi = ciθ i
The first stage in Bayesian statistics is, as mentioned, to define the prior distribution of
the unknown parameter θ i . Assuming that the risk in each region i only depends on the
so we will look for the most appropriate distribution for each of them.
Since {vi : i = 1,..., n} do not follow any defined pattern, a normal distribution will be
considered with unknown variance λ . There are no reasons to use any other
distribution. Hence,
1
− 1 2
(2.3.4) p (v i ) ∝ λ 2
exp(− vi )
2λ
The function φ must reflect the spatial dependence so it should reduce when the
distance between regions increases. Meanwhile adjacent regions should get similar
values.
neighbouring regions such as distance, population at risk, etc., but for the moment it
Going back to φ function, two possible choices have been investigated in the literature.
becomes
1
(2.3.7) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(−
2k
∑δ (u
j∈
i − u j )2 )
i
and the expression for the vector of risk over the whole region is
1 1
(2.3.8) p (u / k ) ∝
k ni 2
exp(−
2k
∑δ (u
j∈
i − u j )2 )
i
(2.3.5) becomes
1 1
(2.3.9) pi (u i / u1 ,..., u i −1 , u i +1 ,..., u n ) ∝ exp(− ∑ u i − u j )
k k j∈δ i
The first option can be interpreted as a stochastic version of linear interpolation while
Finally, the prior density of the two hyperparameters which determine the variance of
However this particular value is not suitable because of its behaviour near the origin
absorbing state of the Markov chain which invalidates the Gibbs sampler. To avoid
ε ε
(2.3.10) prior (k , λ ) ∝ exp(− − )
2k 2λ
Having defined the distributions of u , v , k and λ the joint posterior density is given by
n
(2.3.11) p(u, v, k , λ , / y ) ∝ ∏ f ( y i / xi )
i =1
ni 1
− 1 − 1 2
x k 2
exp(−
2k
∑δ (u
j∈
i − u j ) )λ
2 2
exp(−
2λ
vi ) prior ( x, λ )
i
assumed to be the most appropriate function to model the number of claims. Hence
Taking the structure of f ( y i / xi ) into account, the joint posterior density of u , v , k and
λ finally becomes
exp(−ci e ui + vi )(ci e ui + vi ) yi
n
(2.3.13) p(u, v, k , λ , / y ) ∝ ∏
i =1 yi !
ni 1
− 1 − 1 2
x k 2
exp(−
2k
∑ (u i − u j ) 2 ) λ
j∈δ i
2
exp(−
2λ
vi ) prior ( x, λ )
Now the remaining problem is to obtain maximum a posteriori estimates for the
parameters. However, since the process becomes mathematically difficult because of its
is used based on a version of a Markov Chain Monte Carlo method called the Gibbs
sampler.
distribution but it can not be done directly. Instead, we can construct a Markov chain
run the chain for a long time, simulated values of the chain can be used to summarize,
simply need algorithms for constructing chains with specified stationary distributions.
One of these algorithms is the Gibbs sampler which exploits conditional densities to
conditional densities for each component xi , given the values of the other components.
Suppose we want to generate a sample of π (x) , but this function is so complicated that
with a consistent stationary distribution of π (x) . In order to do that, first we take some
arbitrary starting values x 0 = ( x10 ,....x n0 ) . Then, we make successive random drawings
After obtaining a sufficient number of realizations, we may use the empirical density
Related to how many iterations are necessary to get a stationary distribution, some
practical applications have shown that generally the chain must run for 1,000 steps
before converging to its stationary distribution. Once convergence has been obtained, a
sample of every 10th step over the next 10,000 steps usually provides a reasonable
To assure that the process has converged, it is recommended to run some processes in
parallel with different starting values. The processes must provide similar results unless
they have not still converged. In this case, more iterations will be necessary.
We now formulate the Gibbs sampler in terms of the frequency risk. In the terminology
used, at each step a value for xi is sampled at random from the density function
(2.5.1) pi ( xi / δ i , y )
The values of the risk parameters in all regions other than i included in δ i are assumed
fixed at their current values and, each step, involves sampling from each of the
given by
1
∝ f ( y i / xi ) exp(−
2k
∑δ (u
j∈
i − u j )2 )
i
ni
∝ exp(−ci exp(u i + vi ) + u i y i − (u i − u i ) 2 )
2k
where u −i denotes all values of u except u i and u i is the mean value of u i over δ i .
(2.5.3.) p (v i / v − i , u , k , λ , y ) ∝ f ( y i / x i ) p (v i / λ )
1 2
∝ f ( y i / xi ) exp(− vi )
2λ
1 2
∝ exp(−ci exp(u i + vi ) + u i y i − vi )
2k
1
∝ k − n 2 exp− ε + ∑ (u i − u j ) 2
2k i≈ j
for k , and
(2.5.5) p ( λ / u , v, k , y ) ∝ p ( v / u , k , y ) p ( k / u , k , y )
1 n
∝ λ− n 2 exp− ε + ∑ vi2
2λ i =1
for λ .
carefully designed rejection method, while the hyperparameters densities are sampled
expressions
n n n
∑v
i =1
*
i =0 and ∑c
i =1
i exp(u i* + vi* ) = ∑ y i
i =1
For further details about Bayesian computation via the Gibbs sampler, see Smith and
Roberts (1993).
In this section an application of the model is provided. The aim is to illustrate how to
proceed with a real database until the results of the model are obtained and, thereafter,
analyse these results to get some conclusions. The database corresponds to the fifth
biggest general insurance company of the Spanish market and the information refers to
All the work related to preparing the database, modelling standard factors and
calculating the estimated number of claims has been performed with the statistical
software SAS. In appendix II there is a diagram of the process followed and the next
appendix explains the purpose of all the programmes. Finally, the programmes (in SAS
The first step in the preparation of the database consists of selecting the relevant
In terms of motor insurance, it means that different type of vehicles (passenger cars,
vans, trucks, motorbikes, trailer, etc) should be modelled separately and the same
advice could be given for the type of claim. The risk covered with third part liability
and own damage for example, are completely different so they should be part of
different studies. In this sense, the analysis has been performed with passenger cars
material damage.
On the other hand, the premium is made up of two components: frequency and severity.
Since these two random variables have different distributions, it is also convenient to
model them separately to obtain a better rating. This particular numerical application
Finally, the database selected consists of 1,044,006 policies, 19,699 bodily injury (BI)
and 93,040 material damage (MD) claims with the distribution between years as shown
in Table 3.1.
The number of claims corresponds to those that have occurred during these two years
by policies in force, so they include current and IBNR (incurred but not reported)
The factors considered to assess the risk are those currently used by the company. They
refer mainly to the characteristics of the vehicle and of the driver and are listed below:
The power of the car is one of the most traditional factors which is applied by nearly all
The variable number of doors has been included recently as a standard factor in the
company. A smaller, 3-door car is commonly the second car of the family which is
driven mainly within the city. Alternatively it is sometimes used by the children in their
initial driving experience. Therefore, the inclusion of this variable tries to capture and
Related to the type of fuel, it was supposed to reflect how often a car was driven.
Because of their higher price of purchase, diesel cars were mainly bought by people
who drive a lot. Since the fuel is cheaper, this type of car was a good deal for them. In
this sense, a diesel car was equivalent to a higher effective exposed to risk. However,
the gap in price between diesel and petrol cars has been reduced in the last years, so this
differentiate a particular kind of risk, which is cars with high power and small weight.
These types of cars are considered very dangerous. In fact, they are known as “flying
The second group of factors refers to the driver. All these variables are very similar in
all insurance companies, with the driver age being the most important by far.
That is, this factor is sometimes removed because of its high correlation with driver
age.
This option was introduced some years ago, with the intention of uncovering fraudulent
behaviour. A general practice consisted of declaring the driver’s name in the claim
report, even though he was not driving when the claim occurred. So, the purpose of
including this factor was to charge an additional amount when declaring other drivers,
with no additional claims costs. The company was already paying these claims.
However, the results were worse than expected and more claims were declared, so this
option is not available anymore. Nevertheless, the factor is still considered since some
After selecting the information and the explanatory variables, some routine work has
been done cleaning the database, which means checking that the information is correct.
The possible values that each factor can take are described in the table 3.2. In this
sense, any value out of these ranges has been invalidated (considered missing).
Note that
For an overview of the composition of the database after the cleaning process has been
performed, some univariate graphs are included. They display the exposure and
frequency of claims (BI and MD) for the main standard factors considered.
35,000 20
30,000
25,000 15
20,000 BI
10
15,000 MD
10,000 5
5,000
0 0
18
24
30
36
42
48
54
60
66
72
78
84
90
18
25
32
39
46
53
60
67
74
81
88
50,000 20
40,000 15
30,000 BI
10
20,000
MD
5
10,000
0 0
14
21
28
35
42
49
56
63
70
0
7
12
18
24
30
36
42
48
54
60
66
72
0
6
100,000 20
80,000 15
60,000 BI
10
MD
40,000
5
20,000
0
0
10
15
20
25
30
35
40
45
0
5
13
17
21
25
29
33
37
41
45
49
1
5
9
210,000 20
180,000
15
150,000
120,000 BI
10
90,000 MD
60,000 5
30,000
0
0
10
13
16
19
22
25
28
31
1
12
15
18
21
24
27
30
0
3
6
9
120,000 20
100,000
15
80,000
BI
10
60,000 MD
40,000 5
20,000
0
0
44
71
98
5
3
1
1
4
6
9
5
5 41 65 89 113 137 162 188 214 244 283 333 450
12
15
18
21
24
28
34
Figure 3.5: Database composition by factor – power
210,000 20
180,000
15
150,000
120,000 BI
10
90,000 MD
60,000 5
30,000
0 0
12
15
18
21
24
27
30
2
12
15
18
21
24
27
30
2
800,000 20
700,000
600,000 15
500,000
BI
400,000 10
MD
300,000
200,000 5
100,000
0 0
D G mis D G mis
1,000,000 20
800,000
15
600,000
BI
10
400,000 MD
5
200,000
0 0
E M V mis E M V mis
1,000,000 20
800,000
15
600,000
BI
10
400,000
MD
5
200,000
0 0
MON NTM TT MON NTM TT
1,000,000 20
800,000 15
600,000 BI
10
MD
400,000
5
200,000
0 0
M V mis M V mis
800,000 20
600,000 15
BI
400,000 10
MD
200,000 5
0 0
3 5 mis 3 5 mis
In view of the univariate graphs, the variables driver sex and number of doors seem not
to be very relevant. In addition, it is quite likely that the estimates will increase with the
weight and the power of the car, and decrease with the number of years with the
company (loyalty) and the vehicle age. By contrast, singular factors are the driver age
and driver experience. For these two variables, the frequency decreases in the lower age
groups and increases for older people. In the middle ages there is a surprising hump.
The last step of the pre-modelling stage consists of grouping the possible answers of the
numerical variables in order to reduce the number of levels. This will also reduce the
grouping, etc. Again the levels considered are those in force in the company.
variables that later have been selected in each model. How the selection of these
As will be seen, the large number of levels for the variable driver age is probably the
most surprising component. The reason for this treatment is the aim to avoid jumps
when the insurer remains in the company and progresses through years. In some sense,
The variable number of claims has been modelled using a Poisson regression model
The main idea behind the Generalised Linear Models is that we want to model a
this influence.
The linear function of the explanatory variables is called the linear predictor and a
relationship between this linear predictor and the response variable can be defined,
In fact, one of the powerful features of the Generalised Linear Models approach is this
possibility of defining different functions between the dependent variable (the variable
which has to be explained) and the explanatory variables, which makes the method very
Just to clarify, it can be said that the simple linear regression is a special case of a
For additional details about Generalised Linear Models, a brief review of this
Actually, modelling involves three steps: 1) specifying the model; 2) identifying the
subset of variables which provide a good estimation, and 3) calculating the estimated
Focusing on the number of claims, a Poisson model is considered to be the function that
best describes this distribution, and the suitable link function is the logarithm. So the
model is specified as
or alternatively
where, following previous notation, xi is the expected number of claims (or also
response variable), ri is the risk exposure, z1 , z 2 ,... are the explanatory variables and
As is clearly shown in the second expression, since we model number of claims instead
of the frequency, the logarithm of the exposure has to be included as offset variable.
The second step, identifying the relevant subset of variables, implies finding a balance
between accuracy and simplicity. The variables which best explain the risk have to be
selected but always taking into account what is known as parsimony. This criteria
implies that a simpler model which describes the data adequately will be preferred to a
more complicated one, when this second does not significantly improve the goodness of
fit.
There are several procedures to identify the relevant variables. However, all of them are
based on adding and deleting terms from the model and testing the significance of the
On one hand there is the backward-type selection technique which starts with all the
explanatory variables and proceeds by removing the least significant one at each step.
On the other hand, there is the forward-type selection which, by contrast, starts with
is introduced in the model. Finally the stepwise procedure combines the forward and
This last procedure has been used. The final variables retained in each of the models
and they level of significance (forward and backward analysis) is shown in the next
tables.
Forward Analysis
Variable retained Deviance Num DF F Value Pr > F Chi-Square Pr > ChiSq
INTERCEPT 88553.819
EDAD 88178.114 27 14.78 <.0001 399.14 <.0001
CARNET 88138.481 8 5.26 <.0001 42.1 <.0001
ANTVEHI 87921.066 3 76.99 <.0001 230.98 <.0001
ANTWIN 87785.847 2 71.83 <.0001 143.65 <.0001
CVDIN 87445.863 7 51.6 <.0001 361.19 <.0001
PESO 87155.754 5 61.64 <.0001 308.2 <.0001
PESPOT 87103.696 3 18.43 <.0001 55.3 <.0001
COMBUST 86907.42 2 104.26 <.0001 208.52 <.0001
SEXCLI 86899.866 1 8.02 0.0046 8.02 0.0046
FAINPER 86841.956 2 30.76 <.0001 61.52 <.0001
CONOC 86751.393 1 96.21 <.0001 96.21 <.0001
TIP 86604.769 2 77.88 <.0001 155.77 <.0001
SEXCON 86533.681 2 37.76 <.0001 75.52 <.0001
PUERTAS 86507.677 2 13.81 <.0001 27.63 <.0001
Backward Analysis
Forward Analysis
Variable retained Deviance Num DF F Value Pr > F Chi-Square Pr > ChiSq
INTERCEPT 203881.209
EDAD 203362.153 28 18.13 <.0001 507.63 <.0001
CARNET 203292.77 12 5.65 <.0001 67.86 <.0001
ANTWIN 202565.232 2 355.76 <.0001 711.52 <.0001
CVDIN 201460.877 7 154.29 <.0001 1080.04 <.0001
PESO 200414.605 5 204.65 <.0001 1023.24 <.0001
COMBUST 200046.261 2 180.12 <.0001 360.24 <.0001
ANTVEHI 199487.372 2 273.29 <.0001 546.59 <.0001
TIP 199232.634 2 124.57 <.0001 249.13 <.0001
SEXCLI 199125.07 1 105.2 <.0001 105.2 <.0001
SEXCON 198970.928 2 75.37 <.0001 150.75 <.0001
PUERTAS 198964.329 2 3.23 0.0397 6.45 0.0397
FAINPER 198599.34 2 178.48 <.0001 356.95 <.0001
CONOC 198506.783 1 90.52 <.0001 90.52 <.0001
Backward Analysis
The variable related to the no claims discount deserves special comment. The no
discount system (NDS) rewards those drivers who do not make claims. However, the
discount granted is far from being a technical or real one. That is, a discount which
matches the actual cost avoided. Instead there are mainly commercial reasons behind
this. So, if the results of the models were applied directly, the tariff would be clearly
insufficient because the NDS introduces disequilibria. Therefore to preserve the tariff
explanatory variables. This second approach has been selected which, in a practical
point of view, means to include the logarithm of this variable, the current level of
It can be mentioned that the introduction of restrictions or the intention of forcing some
estimated values in the model is solved by including these values in the offset variable.
With all these considerations taken into account, the model has been calculated and the
estimated parameters obtained in each model are shown in the tables 3.7 and 3.8.
Because the link function is the exponential, the final estimators are calculated as the
exponential of these estimate values. These factors are applied multiplicatively since
To visualise the results, the estimated factors of some of the variables have been plotted
in figures 3.12 to 3.23. In the same graph, the univariate values of these variables have
multivariate approach.
The graphs show very clearly how the decisions taken over a univariate analysis could
be wrong. The figure 3.12, corresponding to the driver age factor for bodily injury, for
example, displays that the young drivers estimator has to be lower than the value
calculated under the univariate basis, which means that the gap is already explained by
1
For additional information about the discusion between an additive versus a multiplicative model to
estimate the frequency, see Brockman & Wright (1992)
another characteristic in common. The same effect is shown in figure 3.13 for the diver
licence factor. New drivers do not have to be charged as the univariate analysis
suggests because this increase is partly included in the driver age factor. Without doubt,
these two variables are highly correlated, especially in the lower age groups. So, if the
By contrast, the univariate and multivariate results are very similar for some other
variables. That is the case of the car power in bodily injury for example (figure 3.14).
This suggests that car power is an important factor, which has probably been included
One additional comment can be made related to the variable driver age. The hump
observed in the univariate analysis still persists in the multivariate one. Further
investigation carried out to discover the reason for this particular pattern has
determined that the children of the insured people are the cause of this shape. This is
due to the children driving the parents’ car, what is likely to happen when the parents
are between 45-55 years old. Later on, young people tend to have their own car.
6 6
5 5
4 4
univ univ
3 3
2
multiv 2
multiv
1 1
0 0
-13
-22
-32
-39
9
0-2
3-5
6-9
1
3
9
5
1
7
0
24
27
+3
-2
-3
-3
-4
-5
-5
-7
10
14
23
33
18
32
38
44
50
56
65
Figure 3.12: Comparison univariate values versus Figure 3.13: Comparison univariate values versus
multivariate estimates – driver age - BI multivariate estimates – driver license - BI
6 6
5
5
4
univ 4
3
multiv univ
2 3
1 multiv
2
0
1
29
-75
-94
9
9
9
6
9
0-4
9
0
-1
9-1
0-1
+1
50
76
94
0 1-10 +10
12
16
Figure 3.14: Comparison univariate values versus Figure 3.15: Comparison univariate values versus
multivariate estimates –car power - BI multivariate estimates – vehicle age - BI
6 6
5 5
4 4
univ univ
3 3
multiv multiv
2 2
1 1
0 0
V M 0 1-9 +9
Figure 3.16: Comparison univariate values versus Figure 3.17: Comparison univariate values versus
multivariate estimates – driver sex - BI multivariate estimates – loyalty - BI
Similar comments can be made for the material damage results. The following figures
3.18 to 3.23 show some of the variables of this model. The correlation between driver
age and driver licence is even more evident and, by contrast, the car power variable
displays a different shape between the univariate and multivariate curves for highly
powerful vehicles. There is a singular case, the weight of the car, where the multivariate
18
21
15
18
12 15
univ 12
univ
9
6
multiv 9 multiv
6
3
3
0 0
1
1
7
3
9
5
4
24
27
8
6
10
0
2
-2
-3
-3
-4
-4
-5
-6
+7
-2
-3
4-
9-
18
30
36
42
48
54
61
15
33
Figure 3.18: Comparison univariate values versus Figure 3.19: Comparison univariate values versus
multivariate estimates – driver age - MD multivariate estimates – driver licence - MD
18 18
15 15
12 12
univ univ
9
9
6
multiv multiv
6
3
3
0
0
5
12 9
99
49
-7
-9
16
19
-1
+1
7
0-
7
11
50
76
9-
0-
-1
-1
0-
+1
94
8-
16
12
15
Figure 3.20: Comparison univariate values versus Figure 3.21: Comparison univariate values versus
multivariate estimates – car power - MD multivariate estimates – weight - MD
18 18
15 15
12 12
univ univ
9 9
multiv multiv
6 6
3 3
0 0
Figure 3.22: Comparison univariate values versus Figure 3.23: Comparison univariate values versus
multivariate estimates – type of veh. - MD multivariate estimates – weight - MD
Although the stepwise technique ensures that the variables selected provide the best
goodness of fit.
The deviance and the scale parameter (deviance divided by the degrees of freedom) are
often used as a crude method to know how well the model fits the data by comparing
In fact, the deviance reflects the discrepancy between the fitted model and the model
with the observed values exactly, called the saturated model. Both models, being in the
where L̂C is the maximised likelihood when the parameters are set equal to their
Then, assuming that the number of parameters to be estimated are p for a data set of
D
σˆ 2 =
N−p
The next table 3.9 contains the value of these parameters for both models.
BI model MD model
Deviance 86507.68 198506.78
DF 482475 392306
Deviance/DF 0.17930 0.50600
Table 3.9: Parameters to assess goodness of fit
The deviance values are small relative to the degrees of freedom indicating that the
model fits the data well. Related to the scale parameter, although the MD model
provides a much higher value, in both cases they are small enough to reinforce the
It can be pointed out that the deviance is also very frequently used to decide between
different models. In this case, the deviances of the models are compared and the one
with the lowest value is considered to be the best, since it also means the lowest
residuals. These methods basically facilitate the investigation of specific aspects of the
model.
There are several possible definitions of residuals but two of the most commonly used
are:
( y i − yˆ i )
The Pearson residual, defined as
V [ yˆ i ]
where ŷ i are the fitted values and d i the contribution of each observation to the
deviance.
A useful way to analyse the residuals consists of plotting them against the fitted values.
The model can be considered to be satisfactory if the residuals do not follow any
evident pattern and they are concentrated around zero. By contrast, the values
First we are going to evaluate the bodily injury model. Figure 3.24 displays the plot of
It can be seen that there is no big dispersion and the residuals are quite concentrated
around zero. That is, nearly all the residuals are included in the interval –4,4 which
means that no observation is distorting the estimation. In this sense, the model can be
said to be good.
However, it is quite evident that the residuals follow a defined pattern. This is an
indication that there are one or more factors not considered in the model that are real
explanatory variables. Since with the stepwise technique all the possible combinations
of the variables selected are considered, and the model chosen has been tested to be the
best, we can conclude that there are additional factors apart form the ones considered
Probably we can expect the spatial elements to be important so, even more now, the
spatial factors analysis appears to be relevant in assessing the bodily injury peril.
residuals and table 3.10 summarises the main features (mean, variance, quartiles,
This additional information basically reaffirms what has been seen with the residual
graphs. The histogram is pretty much concentrated around zero so the mean is close to
that value and the standard deviation is quite low. The extreme observations are not far
from zero.
NOTE: The mode displayed is the smallest of 122 modes with a count of 2.
Quantiles (Definition 5)
Quantile Estimate
Extreme Observations
------Lowest----- -----Highest-----
Value Obs Value Obs
Related to the material damage model, the same analysis has been performed. Figure
range of the residuals is wider and up to 4 there are quite a lot of observations.
However there are not clear outliers. Related to the pattern, there are also some signs of
a possible defined pattern but it is not so evident as it was in the bodily injury model.
To illustrate and validate these conclusions, figure 3.27 contains a histogram of the
deviance residuals and table 3.11 is a statistical summary of this variable. The
histogram visualises that the residuals are still very much concentrated around zero and
according to the numerical values of table 3.11, the extreme values are not very far
Quantiles (Definition 5)
Quantile Estimate
Extreme Observations
------Lowest----- -----Highest-----
For further details about the deviance and the scale parameter and other suggestions
Once the standard factors estimators have been calculated, the next stage is related to
First of all, we have to define the different areas. Usually the areas correspond to some
postal codes. In this sense, the Spanish postal code is made up of 5 digits. The two first
digits, for example, divide the country into 50 regions, which are known as
The analysis has been performed taking into account these two first digits. So, 50
consider smaller regions which also means that there would be a larger number of them
(see chapter 5 about the conclusions), the unavailability of more detailed maps and the
The map in figure 4.1 displays the 50 areas considered and the name of each region, as
Once the regions have been defined, and provided that the estimators for the standard
factors have been calculated, we are in the position to work out the expected number of
In addition, we also have the actual number of claims reported in each area, Yi . So, a
first approach for the estimators of the spatial effect could be the ratio of the actual
Yi
divided by the estimated number of claims, θˆi = . This ratio can be seen as a sort of
ci
residual and gives us an idea about how much the spatial factors could contribute in
assessing the risk. Applying the model, we will get information about how much of this
unexplained variation can be attributed to the spatial factors and how much corresponds
to variation without a defined pattern. This undefined variation can be caused by other
unknown factors which have not been considered in the modelling approach.
claims and the ratio actual divided by estimated of each region in both models. The
estimated number of claims has been calculated using the estimates of the standard
factors.
To illustrate these numbers graphically, the exposure has been mapped to provide a
view of the composition of the portfolio by region. As can be seen, the company has a
good position in the big cities (Barcelona, Madrid, Valencia, etc.) and, in general, an
important presence in the northeast, on the east coast, and at the extreme northwest. On
the contrary, the exposure is much lower in the centre (with the exception of Madrid).
by exposure) for bodily injury and material damage respectively. These maps are
difficult to interpret because this crude estimation is highly affected by the amount of
risk exposure. So, regions with small exposure tend to get higher rates than the areas
with larger exposure. This means that crude rates are not a proper starting point to deal
with assessing spatial risks and reinforces the utility of the model, since smoothing the
Just to point out some examples, areas like Huelva or Cádiz (south) are affected by this
low exposure effect. The same behaviour can be seen in regions around Madrid (Ávila
Finally, figures 4.5 and 4.6 show the ratio of actual divided by expected number of
claims. This ratio provides a sort of residual approach since it reflects the differences
between the actual number of claims and the expected number of claims according to
the estimates of the standard factors. The differences between regions can be attributed
to spatial factors.
Figure 4.5: Ratio actual divided by expected number of claims – Bodily Injury
At this stage, when the regions and neighbours have been defined and the estimated
number of claims by region according to the standard factors has been calculated, the
Actuarial Science and Statistics of City University. This software provides a friendly
For both types of claims, the process has been run for 5000 iterations to enable the
convergence to a steady state. After that, a sample of every 10th step over the next 1000
material damage.
Table 4.2: u i and vi estimates - the Boskov and Verrall model results
In general, the vi are small compared to the u i , especially in the bodily injury model,
which means that the spatial structure variation dominates the unstructured one. Recall
that in section 3.5, the deviance residuals graph (figures 3.24 and 3.26) displayed an
evident pattern, in particular for the bodily injury model. So the results of the model are
To provide an easy view of the results the next two maps (figures 4.7 and 4.8) display
the maximum a posteriori estimates of the u i for bodily injury and material damage.
These values determine the geographical variations and will smooth the frequencies
In both maps, the north appears to be a riskier area meanwhile the centre displays the
lower spatial estimates. In the material damage type of claim, a risky zone also emerges
The influence of these results is illustrated in figures 4.9 and 4.10 where the new
estimated ratio (actual divided by expected) accounting for the spatial effects is mapped
for both types of claims. These maps should be compared with figures 4.5 and 4.6 to
evaluate the impact of the spatial factors. At first sight, the smoothing effect is
appreciable although it is a bit diffuse. Following what has been observed from the
data, the spatial effect is more noticeable in the bodily injury type of claim, as is the
smoothing.
Figure 4.10: Expected frequency accounting for spatial effects – Material Damage
Finally, in table 4.3, the expected number of claims by region, according to the new
estimated frequencies, are given. It can be observed that the estimated values are very
The first element to consider, related to the results, is the fact that the north (especially
north centre and north west) is identified as a zone with the highest risk for bodily
injury as well as for material damage. The weather conditions are worse in the north:
more rain and colder weather which facilitates the occurrence of claims. In addition, the
north is more densely populated with a higher level of activity, which means a busier
lifestyle and conditions more likely to give rise to more claims. Therefore, this was an
expected outcome.
The second element to point out, while analysing the results, is the relative low amount
of smoothing provided by the model for both types of claims. Although there are some
variations in the ratios as a result of the spatial effects, the smoothing is not strongly
evident.
If we first consider the bodily injury peril, some changes operate in the centre east and
centre west areas. The regions of Castellón, Teruel and Cuenca (centre east) are moved
to a higher ratio section influenced by the neighbouring regions which hold higher
ratios. By contrast, Salamanca and Ávila (centre west) are shifted to lower levels since
the adjacent areas display values of ratios lower than 1. The same change can be
mentioned for Girona (northeast); however, the evolution of this zone deserves a
detailed comment.
Effectively, in this zone, there are two regions, Barcelona and Girona, which belong to
the same range of ratio and have similar neighbouring regions. However, because of the
influence of the neighbours, Girona is moved towards a lower level while Barcelona
the exposure. The number of policies from Barcelona is very high so it is not necessary
if the neighbouring information were taken into account, it would not influence the
region value very much, due to the high volume of information for Barcelona.
Related to the material damage type of claims, some smoothing is carried out in
Zamora (northwest) and Sevilla (southwest). Their ratio shifts towards a higher level
and a lower level respectively. Additionally, the same effect explained for Barcelona
now operates in Madrid which remains unsmoothed although the neighbours display
lower coefficients.
Expanding on the reasons for this low degree of smoothing, the number of regions and
Firstly, it has to be said that the number of regions considered is very small and
therefore, it entails less “borrowing” of information from neighbouring areas. Since the
exposure of the regions is large, a reliable spatial coefficient can be calculated with the
information from the region itself, as previously mentioned. So not much smoothing is
exhibited.
Secondly, and closely related to the previous point, the bigger the areas the more
heterogeneity they include. This means that, in the same region, there could be huge
differences within subareas. A clear example is the regions which contains a big city,
like Madrid or Barcelona. Without doubt, the risk covered inside the city is completely
would be at least three differentiable areas: the city itself, the metropolitan area or
surroundings of the city and the rest. In particular, the risk covered in the countryside
subarea is more likely to be similar to the risk of neighbouring areas, whereas the risk
assured inside the big city is probably very different from the less populated adjacent
areas.
Therefore, the 50 regions considered do not allow the calculation of spatial coefficients
To deal with this problem, some competitors have included a binary variable in the
modelling process which differentiates the big cities from the rest. However, this partial
solution still involves inaccuracies since it looks for a common coefficient for all the
big cities. That is, it assumes similar behaviour in all of them, which is not necessarily
true.
Finally, last comments about the results can be made in the sense that after running the
model, there is still some unexplained variation given by the vi values. These can be
interpreted as the existence of other relevant factors not yet considered. One of these
important variable since it reflects how much risk the driver is actually exposed to.
However, the use of this factor involves problems of implementation. The issue of how
to get this number and how to annually check its updated value is not solved and other
complications appear when rating a new driver or with second hand cars. Maybe, in the
future, the use of some technological advances, like GPS (Global Positioning System),
The aim of this dissertation was to run the Boskov and Verrall model with a Spanish
database.
However, at the same time, it involves cheeking that the model and its implementation
can be transferred to a different market from that for which it was initially created.
geographical area effect, some differences operate in Spain that generate additional
difficulties.
The main difference comes from the structure of the Spanish postal codes. Although
these also refer to some geographical areas, they do not provide the same level of
accuracy as the UK postcodes. In the UK, for example, by means of the postal code, it
is possible to practically identify a single house. However, it is not the same in Spain.
So, the UK postcode certainly allows for defining more homogeneous areas in terms of
risk.
First, it has to be said that there is still a lot of scope for improvement by means of
exploiting the existing Spanish postal code. As mentioned, it is made up of 5 digits and
the application has been carried out taking into account the areas identified by the two
first components. So, smaller areas could be defined considering the other elements,
the fourth and fifth digits is not always well defined. That is, especially in the
countryside, two areas with the same fifth digit for example, could be crossed by
another area with a different fifth component. In addition, there are some areas not
identified by the last digits if no people live there. So this would cause other problems
to arise when locating the neighbouring regions for narrower geographical zones.
Although we have no certain proof, the experience of dealing with postal codes seems
to confirm that in Spain, new codes are allocated to new settlements but these do not
the UK, the postal code always refers to a specific territory regardless to the number of
In this sense, detailed work focusing on solving the postcode inefficiencies, or looking
for an alternative to this code in terms of identifying the optimal areas to consider,
However, once this problem has been overcome, the model certainly provides a useful
method in personal lines insurances to deal with the geographical zone. Where, in view
the way the geographical zone is dealt with is still an open issue.
zone, the different structure of the Spanish postcode in addition to the unavailability of
BESAG, J.E., YORK, J., & MOLLIÉ, A. (1991). Bayesian image restoration, with two
20.
BOSKOV, M. & VERRALL, R.J. (1994). Premium rating by geographical area using
BROCKMAN, J., WRIGHT, T.S. (1992). Statistical motor rating: making effective use
of your data. Journal of the Institute of Actuaries vol. 119, part III, 457-543.
geographical area: a case study using the Boskov and Verrall Model. Discussion Paper,
DIXON, M., KELSEY, R. & VERRALL, R. (2000). Postcode insurance rating: spatial
modelling and performance evaluation. Paper presented at the 4th IME Congress,
Barcelona.
Hall.
McCULLAGH, P. & NELDER, J.A. (1989). Generalised Linear Models. Chapman and
Hall.
SMITH, A.F.M. & ROBERTS, G.O. (1993). Bayesian computation via the Gibbs
sampler and related Markov Chain Monte Carlo methods. Journal of the Royal
TAYLOR, G.C. (1989). Use of spline functions for premium rating by geographical
Although the main idea is the same as the simple linear regression, GLM expands this
In this sense, the main distributions used in GLMs belong to a wider class of
And, as the second expansion implies, we can now estimate parameters from linear
combinations, g ( Xβ ) .
( yθ − b(θ )
f Y ( y | θ , φ ) = exp + c( y, φ )
a(φ )
In addition, the expressions for the mean and variance of the response variable are
d d2
E [Y ] = µ = b(θ ) V [Y ] = a(φ ) b(θ )
dθ dθ 2
d2
V [Y ] = a (φ )V ( µ ) where V [µ ] = b(θ ) is called the variance function.
dθ 2
The most common distributions belong to the exponential family. For example, the
Poisson, Normal and Binomial can all be written in the exponential form.
Related to the explanatory variables, they follow a linear function, known as the linear
The improve provided by GLM is the possibility of defining a relationship between the
response variable and the linear predictor. This relation is called the link function, g .
The next table summarises the relevant components for the GLM approach for the most
common distributions.
Normal µ σ2 φ θ2 1 y2 µ η
− ( + 2πφ )
2 2 φ
The estimation of the parameters of the model is carried out by the method of
maximum likelihood. So, given a set of independent random variables Y1 ,..., Yn , with
n
which maximizes likelihood function given by L(θ , φ | y ) = ∏ f Y ( y | θ , φ ) . However,
i =1
since the logarithmic function is monotonic, the same value maximizes the log-
∂l (θ , φ | y )
=0 for j = 1,..., p
∂θ j
We must check that the solutions correspond to a maxima by verifying that the matrix
Per0001d
Per0001c Per1201
Posinprr
Per0001d
Posinprs Posinprs
Sin1t Corp1
Ff1corp1 Ff1mate1
Sinfase2
The information is structured in two files: the claims file and the file with all the
characteristics of the policies. The number of the policy is the relational factor.
The claims file contains data related to all the claims that were opened at the beginning
of the year or have been reported during the year. Each record represents a different
claim. The information includes dates (occurrence date, reporting date, closing date, …)
The file with the characteristics of the policies contains records which represent
homogeneous situations of the risk. This means that any change in the characteristics of
policyholder and car owner (date of birth, sex, profession, address, …), characteristics
of the car (type of vehicle, power, weight, type of fuel, …), and other relevant
information of the risk. In addition, there are variables which are necessary to calculate
Since the information is processed yearly, this program joins the claims databases of
the two years 2000 and 2001 that the analysis relates to. At the same time, claims are
filtered so as to keep only the ones corresponding to third part liability (bodily injury
and material damage). However, the resulting file still contains information about all
This programme deals with the file that contains all the characteristics of the policies.
The two years information is also joined and the passenger cars filtered. Since the tariff
incorporates new factors related to the characteristics of the car which were not used
before, we must include external information to be able to analyse those new variables.
The information comes from an external file which contains all the technical
characteristics of all the cars. This file is called base7. In addition the exposure file is
cleaned which means that any value out of the standard ranges is removed.
Since the information related to the no claims discount level or any date factor could
change every renewal date, the homogeneous situations of the risk are additionally split
by this date. This allows the addition of this discount level to the file and the calculation
of the factors: age of the driver, years since obtaining the drivers licence and years with
the company. The variable related to days exposed-to-risk is also created. At this stage,
the two files (claims and characteristics of the risks) are ready and can be merged.
4.- Mapping
The last step before the modelling process consists of grouping the content of the
variables into the levels decided. After that, the information in accumulated and at the
5.- Genmod
/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Claims part.sas ***/
/*** Job description: Preparation of the claims file ***/
/*** ***/
/******************************************************************/
libname pc 'C:\Dissertation';run;
options compress=yes;
data pc.cla0001;
set cla0001;
by numexp score;
if last.score then do;
claim=1;
output;end;
run;
/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Exposure part.sas ***/
/*** Job description: Preparation of the exposure file ***/
/*** ***/
/******************************************************************/
libname pc 'C:\Dissertation';run;
options compress=yes;
anomatri=substr(put(fcons,8.),1,4);
marcwint=dmarca;modwint=dmodelo;
valtari=sum(valvehi,valacce)/1000;
if indrcv='S' then do;if mutrcv=1 then garrcv=4;
if mutrcv ne 1 then garrcv=3;end;
if indrcv='N' then garrcv=5;
if inddan='S' and mutdan=2 then do;if impfranq gt 0 then gardan=2;
if impfranq eq 0 then gardan=1;end;
if inddan='S' and mutdan=1 then do;if indinc='S' then gardan=7;
if indinc='N' then gardan=4;end;
if inddan='N' and indinc='S' then gardan=6;
if gardan=. then gardan=5;
garrobo=indrob;
garlun=indlun;
fanaci=input(translate(afanaci,'0',' '),8.);
fmnaci=input(translate(afmnaci,'0',' '),8.);
facarne=input(afacarne,8.);
fmcarne=input(afmcarne,8.);
fmvto=input(afmvto,8.);
fdvto=input(afdvto,8.);
fainhis=input(afainhis,8.);
fminhis=input(afminhis,8.);
fdinhis=input(afdinhis,8.);
uso1=input(auso1,8.);uso2=input(auso2,8.);uso3=input(auso3,8.);
uso4=input(auso4,8.);uso5=input(auso5,8.);
uso=uso1*10000+uso2*1000+uso3*100+uso4*10+uso5;run;
fainper=substr(finper,1,4);
anomatri=substr(put(fcons,8.),1,4);
marcwint=dmarca;modwint=dmodelo;
valtari=sum(valvehi,valacce)/1000;
garrobo=indrob;
garlun=indlun;
fanaci=input(translate(afanaci,'0',' '),8.);
fmnaci=input(translate(afmnaci,'0',' '),8.);
facarne=input(afacarne,8.);
fmcarne=input(afmcarne,8.);
fmvto=input(afmvto,8.);
fdvto=input(afdvto,8.);
fainhis=input(afainhis,8.);
fminhis=input(afminhis,8.);
fdinhis=input(afdinhis,8.);
uso1=input(auso1,8.);uso2=input(auso2,8.);uso3=input(auso3,8.);
uso4=input(auso4,8.);uso5=input(auso5,8.);
uso=uso1*10000+uso2*1000+uso3*100+uso4*10+uso5;run;
data pc.per0001;
set pc.per1200z pc.per1201z;
fainper=substr(finper,1,4);run;
data base7;
set pc.b7ve1505;
by cmarca cmodelo cversion;
if last.cversion then output;
data per01;
set pc.per1201;
by numpol finper ffiper;
if last.ffiper then output;
provin=substr(cpcirc,1,2);
if substr(cpcirc,3,1)='0' then capital='1';
if substr(cpcirc,3,1) in ('1' '2' '3' '4' '5' '6' '7' '8' '9') then
capital='0';
fminper=substr(finper,5,2);
fdinper=substr(finper,7,2);
fafiper=substr(ffiper,1,4);
fmfiper=substr(ffiper,5,2);
fdfiper=substr(ffiper,7,2);
if (fmvto=02 and fdvto=29) then fdvto=28;
favenci=fainper;
dinper=mdy(fminper,fdinper,fainper);
dfiper=mdy(fmfiper,fdfiper,fafiper);
dinhis=mdy(fminhis,fdinhis,fainhis);
edad=int((dvenci-dfnaci)/365);
carnet=int((dvenci-dfcarne)/365);
antwin=int((dvenci-dinhis)/365);
antvehi=fainper-anomatri;
data pc.per0001d;
set pc.per0001d;
by numpol finper ffiper;
if last.ffiper then output;run;
/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Expire date cut.sas ***/
/*** Job description: Divide records by renewal date ***/
/*** ***/
/******************************************************************/
libname pc 'C:\Dissertation';run;
options compress=yes;
ffiper=input(finper,2.)*1000000+aaaaiper*10000+fmvto*100+fdvto;
output;
finper=ffiper; ffiper=auxfin; output;
end;
else output;
data sinciib;
merge pc.cla0001 (in=c)
pc.per0001d (in=d keep=numpol finper ffiper fmvto fdvto);
by numpol finper ffiper;
if c and d;
run;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);
mmddfvto=put(100*fmvto+fdvto,4.);
if (faocur*10000+fmocur*100+fdocur>=input(ffiper,8.) or
faocur*10000+fmocur*100+fdocur<input(finper,8.))
else
then do;
auxini=finper; auxfin=ffiper;
ffiper=input(finper,2.)*1000000+aaiper*10000+fmvto*100+fdvto;
if (faocur*10000+fmocur*100+fdocur<input(ffiper,8.) and
faocur*10000+fmocur*100+fdocur>=input(finper,8.))
then output sin2;
finper=ffiper; ffiper=auxfin;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);
if (faocur*10000+fmocur*100+fdocur<input(ffiper,8.) and
faocur*10000+fmocur*100+fdocur>=input(finper,8.))
then output sin2;
end;
else output sin2;
proc sql;
create table sin1b as
select
posinprr.finper, posinprr.ffiper, sin1.*
from sin1, pc.posinprr
where ( posinprr.numpol=sin1.numpol and
( input(posinprr.finper,8.) <=
sin1.faocur*10000+sin1.fmocur*100+sin1.fdocur))
data sin1b;
set sin1b;
by numpol numexp anocon score finper ffiper;
if first.score then output;run;
data sin1d;
set sin1c;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);
mmddfvto=put(100*fmvto+fdvto,4.);
ffiper=input(finper,2.)*1000000+aaiper*10000+fmvto*100+fdvto;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);
if (fmocur*100+fdocur<mmddfper and
fmocur*100+fdocur>=mmddiper)
then output ;
else do;
finper=ffiper; ffiper=auxfin;
aaiper=put(trunc(input(finper,8.)/10000,4)-input(finper,2.)*100,2.);
aafper=put(trunc(input(ffiper,8.)/10000,4)-input(ffiper,2.)*100,2.);
inia=input(finper,2.);
data pc.sin1t;
set sin2 sin1b sin1d;run;
data corp;
set pc.sin1t;
where score in ('RS' 'SC');run;
data corp1;
set corp;
by numexp;
if first.numexp and not last.numexp then do;nsin=0;end;
retain;
if not first.numexp and last.numexp then do;
nsin=1;
cobert='CORP';
output;end;
if first.numexp and last.numexp then do;
nsin=1;
cobert='CORP';
output;end;
data sinfase2;
set corp1
pc.sin1t (where=(score in ('SM' 'DA')));
if score='SM' then cobert='MATE';
if score='DA' then cobert='DANO';
if score in ('SM' 'DA') then nsin=1;
fainper=substr(finper,1,4);
fminper=substr(finper,5,2);
fdinper=substr(finper,7,2);
fafiper=substr(ffiper,1,4);
fmfiper=substr(ffiper,5,2);
fdfiper=substr(ffiper,7,2);
if (fmvto=02 and fdvto=29) then fdvto=28;
favenci=fainper;
dinper=mdy(fminper,fdinper,fainper);
dfiper=mdy(fmfiper,fdfiper,fafiper);
dinhis=mdy(fminhis,fdinhis,fainhis);
edad=int((dvenci-dfnaci)/365);
carnet=int((dvenci-dfcarne)/365);
antwin=int((dvenci-dinhis)/365);
antvehi=fainper-anomatri;
libname pc 'C:\Dissertation';run;
options compress=yes;
cp3=substr(cpcirc,1,3);run;
/******************************************************************/
/*** ***/
/*** Pricing by geographical zone ***/
/*** ***/
/*** Date: September/2003 ***/
/*** Author: Núria Puig ***/
/*** Job name: Genmod.sas ***/
/*** Job description: Generalised linear model ***/
/*** ***/
/******************************************************************/
libname pc 'C:\Dissertation';run;
options compress=yes;