Rumus2 GBT
Rumus2 GBT
Gradient boosting trees for auto insurance loss cost modeling and prediction
Leo Guelman ⇑
Royal Bank of Canada, RBC Insurance, 6880 Financial Drive, Mississauga, Ontario, Canada L5N 7Y5
a r t i c l e i n f o a b s t r a c t
Keywords: Gradient Boosting (GB) is an iterative algorithm that combines simple parameterized functions with
Statistical learning ‘‘poor’’ performance (high prediction error) to produce a highly accurate prediction rule. In contrast to
Gradient boosting trees other statistical learning methods usually providing comparable accuracy (e.g., neural networks and sup-
Insurance pricing port vector machines), GB gives interpretable results, while requiring little data preprocessing and tuning
of the parameters. The method is highly robust to less than clean data and can be applied to classification
or regression problems from a variety of response distributions (Gaussian, Bernoulli, Poisson, and
Laplace). Complex interactions are modeled simply, missing values in the predictors are managed almost
without loss of information, and feature selection is performed as an integral part of the procedure. These
properties make GB a good candidate for insurance loss cost modeling. However, to the best of our knowl-
edge, the application of this method to insurance pricing has not been fully documented to date. This
paper presents the theory of GB and its application to the problem of predicting auto ‘‘at-fault’’ accident
loss cost using data from a major Canadian insurer. The predictive accuracy of the model is compared
against the conventional Generalized Linear Model (GLM) approach.
2011 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.09.058
3660 L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667
The predictive learning problem can be characterized by a vec- The success of AdaBoost for classification problems was seen
tor of inputs or predictor variables x = {x1, . . . , xp} and an output or as a mysterious phenomenon by the statistics community until
target variable y. In this application, the input variables are repre- (Friedman, Hastie, & Tibshirani, 2000) showed the connection
sented by a collection of quantitative and qualitative attributes of between boosting and statistical concepts such as additive model-
the vehicle and the insured, and the output is the actual loss cost. ing and maximum-likelihood. Their main result is that it is possi-
Given a collection of M instances {(yi, xi); i = 1, . . . , M} of known ble to rederive AdaBoost as a method for fitting an additive model
(y, x) values, the goal is to use this data to obtain and estimate of in a forward stagewise manner. This gave significant understand-
the function that maps the input vector x into the values of the ing of why this algorithm tends to outperform a single base mod-
output y. This function can then be used to make predictions on el: by fitting an additive model of different and potentially simple
instances where only the x values are observed. Formally, we wish functions, it expands the class of functions that can be
to learn a prediction function ^f ðxÞ : x ! y that minimizes the approximated.
expectation of some loss function L(y, f) over the joint distribution
of all (y, x)-values
4. Additive models and boosting
^f ðxÞ ¼ argmin Ey;x Lðy; f ðxÞÞ ð1Þ
f ðxÞ
Our discussion in this section will be focused on the regres-
Boosting methods are based on the intuitive idea that combin- sion problem, where the output y is quantitative and the objec-
ing many ‘‘weak’’ rules to approximate (1) should result in tive is to estimate the mean E(yjx) = f(x). The standard linear
classification and regression models with improved predictive per- regression model assumes a linear form for this conditional
formance compared to a single model. A weak rule is a learning expectation
algorithm which performs only a little bit better than a coinflip.
The aim is to characterize ‘‘local rules’’ relating variables (e.g., ‘‘if X
p
EðyjxÞ ¼ f ðxÞ ¼ bj xj ð2Þ
an insured characteristic A is present and B is absent, then a claim j¼1
has high probability of occurring’’). Although this rule alone would
not be strong enough to make accurate predictions on all insureds, An additive model extends the linear model by replacing the
P
it is possible to combine many of those rules to produce a highly linear component g ¼ pj¼1 bj xj with an additive predictor of the
Pp
accurate model. This idea, known as the ‘‘the strength of weak lear- form g ¼ j¼1 fj ðxj Þ. We assume
nability’’ (Schapire, 1990) was originated in the machine learning
community with the introduction of AdaBoost, which is described X
p
EðyjxÞ ¼ f ðxÞ ¼ fj ðxj Þ; ð3Þ
in the next section.
j¼1
represents the ‘‘weak learner’’ and f(x) the weighted majority vote of
the individual weak learners. Algorithm 3. Gradient Boosting
Estimation of the parameters in (4) amounts to solving P
1: Initialize f0(x) to be a constant, f0 ðxÞ ¼ argminb M
i¼1 Lðyi ; bÞ
! 2: for t = 1 to T do
X
M X
T
min L yi ; bt hðxi ; at Þ ; ð5Þ 3: Compute the negative gradient as the working
fbt ;at gT1 i¼1 t¼1 response
where L(y, f(x)) is the chosen loss function (1) to define lack-of-fit. A
@Lðyi ; f ðxi ÞÞ
‘‘greedy’’ forward stepwise method solves (5) by sequentially fitting ri ¼ ; i ¼ f1; . . . ; Mg
@f ðxi Þ f ðxÞ¼ft1 ðxÞ
a single weak learner and adding it to the expansion of prior fitted
terms. The corresponding solution values of each new fitted term is 4: Fit a regression model to ri by least-squares using the
not readjusted as new terms are added into the model. This is out- input xi and get the estimate at of bh(x; a)
lined in Algorithm 2. 5: Get the estimate bt by minimizing
L(yi, ft1(xi) + bh(xi; at))
6: Update ft(x) = ft1(x) + bth(x; at)
7: end for
Algorithm 2. Forward Stagewise Additive Modeling
8: Output ^f ðxÞ ¼ f ðxÞ
T
1: Initialize f0(x) = 0
2: for t = 1 to T do
3: Obtain estimates bt and at by minimizing
PM For squared-error loss, the negative gradient in line 3 is just the
i¼1 Lðyi ; ft1 ðxi Þ þ bhðxi ; aÞÞ
4: Update ft(x) = ft1(x) + bth(x; at) usual residuals, so in this case the algorithm is reduced to standard
5: end for least-squares boosting. With absolute error loss, the negative gra-
dient is the sign of the residuals. Least-squares is used in line 4
6: Output ^f ðxÞ ¼ f ðxÞ T
independently of the chosen loss function.
Although boosting is not restricted to trees, our work will focus
on the case in which the weak learners represent a ‘‘small’’ regres-
If squared-error is used as the loss function, line 3 simplifies to sion tree, since they were proven to be a convenient representation
for the weak learners h(x; a) in the context of boosting. In this spe-
Lðyi ; ft1 ðxi Þ þ bhðxi ; aÞÞ ¼ ðyi ft1 ðxi Þ bhðxi ; aÞÞ2 cific case, the algorithm above is called gradient boosting trees and
the parameters at represent the split variables, their split values
¼ ðr it bhðxi ; aÞÞ2 ; ð6Þ
and the fitted values at each terminal node of the tree. Henceforth
in this paper, the term ‘‘Gradient Boosting’’ will be used to denote
where rit is the residual of the ith observation at the current gradient boosting trees.
iteration. Thus, for squared-error loss, the term bth(x; at) fitted to
the current residuals is added to the expansion in line 4. It is also
6. Injecting randomness and regularization
fairly easy to show (Hastie et al., 2001) that the AdaBoost algorithm
described in Section 3 is equivalent to forward stagewise modeling
Two additional ingredients to the gradient boosting algorithm
based on an exponential loss function of the form
were proposed by Friedman, namely regularization through
L(y, f(x)) = exp(yf(x)).
shrinkage of the contributed weak learners (Friedman, 2001) and
injecting randomness in the fitting process (Friedman, 2002).
5. Gradient boosting trees The generalization performance of a statistical learning method
is related to its prediction capabilities on independent test data.
Squared-error and exponential error are plausible loss functions Fitting a model too closely to the train data can lead to poor gen-
commonly used for regression and classification problems, respec- eralization performance. Regularization methods are designed to
tively. However, there may be situations in which other loss func- prevent ‘‘overfitting’’ by placing restrictions on the parameters of
tions are more appropriate. For instance, binomial deviance is far the model. In the context of boosting, this translates into control-
more robust than exponential loss in noisy settings where the ling the number of iterations T (i.e., trees) during the training pro-
Bayes error rate is not close to zero, or in situations where the tar- cess. An independent test sample or cross-validation can be used to
get classes are mislabeled. Similarly, the performance of squared- select the optimal value of T. However, an alternative strategy
error significantly degrades for long-tailed error distributions or showed to provide better results, and relates to scaling the contri-
the presence of ‘‘outliers’’ in the data. In such situations, other bution of each tree by a factor s 2 (0, 1]. This implies changing line
functions such as absolute error or Huber loss are more 6 in Algorithm 3 to
appropriate.
ft ðxÞ ¼ ft1 ðxÞ þ s bt hðx; at Þ ð7Þ
Under these alternative specifications for the loss function and
for a particular weak learner, the solution to line 3 in Algorithm The parameter s has the effect of retarding the learning rate of
2 is difficult to obtain. The gradient boosting algorithm solves the the series, so the series has to be longer to compensate for the
problem using a two-step procedure which can be applied to any shrinkage, but its accuracy is better. Lower values of s will produce
differentiable loss function. The first step estimates at by fitting a a larger value for T for the same test error. Empirically it has been
weak learner h(x; a) to the negative gradient of the loss function shown that small shrinkage factors (s < 0.1) yield dramatic
(i.e., the ‘‘pseudo-residuals’’) using least-squares. In the second improvements over boosting series built with no shrinkage
step, the optimal value of bt is determined given h(x; at). The pro- (s = 1). The trade-off is that a small shrinkage factor requires a
cedure is shown in Algorithm 3. higher number of iterations and computational time increases. A
3662 L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667
strategy for model selection often used is practice is to set the The data set includes 426,838 earned exposures (measured in
value of s as small as possible (i.e. between 0.01 and 0.001) and vehicle-years) from Jan-06 to Jun-09, and 14,984 claims incurred
then choose T by early stopping. during the same period of time, with losses based on best reserve
The second modification introduced in the algorithm was to estimates as of Dec-09. The input variables (for an overview, see
incorporate randomness as an integral part of the fitting procedure. Table 1) were measured at the start of the exposure period, and
This involves taking a simple random sample without replacement are represented by a collection of quantitative and qualitative attri-
of usually approximately 1/2 the size of the full training data set at butes of the vehicle and the insured. The output is the actual loss
each iteration. This sample is then used to fit the weak learner (line cost, which is calculated as the ratio of the total amount of losses
4 in Algorithm 3) and compute the model update for the current to the earned exposure. In practice, the insurance legislation may
iteration. As a result of this randomization procedure, the variance restrict the usage of certain input variables to calculate insurance
of the individual weak learner estimates at each iteration premiums. Although our analysis was developed assuming a free
increases, but there is less correlation between these estimates at rating regulatory environment, the techniques described here can
different iterations. The net effect is a reduction in the variance be applied independently of the limitations imposed by any spe-
of the combined model. In addition, this randomization procedure cific legislation.
has the benefit of reducing the computational demand. For For statistical modeling purposes, we first partitioned the data
instance, taking half-samples reduces computation by almost 50%. into train (70%) and test (30%) data sets. The train set was used
for model training and selection, and the test set to assess the pre-
dictive accuracy of the selected gradient boosting model against
7. Interpretation
the Generalized Linear Model. To ensure that the estimated perfor-
mance of the model, as measured on the test sample, is an accurate
Accuracy and interpretability are two fundamental objectives of
approximation of the expected performance on future ‘‘unseen’’
predictive learning. However, these objectives do not always coin-
cases, the inception date of policies in the test set is posterior to
cide. In contrast to other statistical learning methods providing
the one of policies used to build and select the model.
comparable accuracy (e.g., neural networks and support vector
Loss cost is usually broken down into two components: claim
machines), gradient boosting gives interpretable results. An impor-
frequency (calculated as the ratio of the number of claims to the
tant measure often useful for interpretation is the relative influ-
earned exposure) and claim severity (calculated as the ratio of the
ence of the input variables on the output. For a single decision
total amount of losses to the number of claims). Some factors affect
tree, (Brieman, Friedman, Olshen, & Stone, 1984) proposed the
claim frequency and claim severity differently, and thus we consid-
following measure as an approximation of the relative influence
ered them separately. For the claim frequency model, the target
of a predictor xj
variable was coded as binary since only a few records had more
X than one claim during a given exposure period. The exposure per-
bI 2 ¼ m^2s ; ð8Þ
j
all splits
iod was treated as an offset variable in the model (i.e., a variable
on xj with a known parameter of 1).
The actual claim frequency measured on the entire sample is
where m ^2s is the empirical improvement in squared-error as a result 3.51%. This represents an imbalanced or skewed class distribution
of using xj as a splitting variable at the non-terminal node s. For Gra- for the target variable, with one class represented by a large sam-
dient Boosting, this relative influence measure is naturally extended ple (i.e. the non-claimants) and the other represented by only a few
by averaging (8) over the collection of trees. (i.e. the claimants). Classification of data with imbalanced class
Another important interpretation component is given by a distribution has posed a significant drawback for the performance
visual representation of the partial dependence of the approxima- attainable by most standard classifier algorithms, which assume a
tion ^f ðxÞ on a subset x‘ of size ‘ < p of the input vector x. The depen- relatively balanced class distribution (Sun, Kamel, Wong, & Wang,
dency of ^f ðxÞ on the remaining predictors xc (i.e. x‘ [ xc = x) must 2007). These classifiers tend to output the simplest hypothesis
be conditioned out. This can be estimated based on the training which best fits the data and, as a result, classification rules that
data by predict the small class tend to be fewer and weaker compared to
those that predict the majority class. This may hinder the detection
XM
^f ðx‘ Þ ¼ 1 ^f ðx‘ ; x Þ ð9Þ of claim predictors and eventually decrease the predictive accuracy
ic
M i¼1 of the model. To address this issue, we re-balanced the class distri-
bution for the target in the frequency model by resampling the
Note that this method requires predicting the response over the
data space. Specifically, we under-sampled instances from the
training sample for each set of the joint values of x‘, which can be
majority class to attain a 10% representation of claims in the train
computationally very demanding. However, for regression trees, a
sample. The test sample was not modified and thus contains the
weighted transversal method (Friedman, 2001) can be used, from
original class distribution for the target. In econometrics, this sam-
which ^f ðx‘ Þ is computed using only the tree, without reference to
ple scheme is known as choice-based or endogenous stratified sam-
the data itself.
pling (Green, 2000) and it is also popular in the computer science
community (Chan & Stolfo, 1998; Estabrooks & Japkowicz, 2004).
8. Application to auto insurance loss cost modeling The ‘‘optimal’’ class distribution for the target variable based on
under-sampling is generally dependent on the specific data set
8.1. The data (Weiss & Provost, 2003), and it is usually considered as an addi-
tional tuning parameter to optimize based on the performance
The data used for this analysis were extracted from a large data- measured on a validation sample.
base from a major Canadian insurer. It consists of policy and claim The estimation of a classification model from a balanced sample
information at the individual vehicle level. There is one observa- can be efficient but will overestimate the actual claim frequency.
tion for each period of time during which the vehicle was exposed An appropriate statistical method is required to correct this bias,
to the risk of having an at-fault collision accident. Mid-term and several alternatives exist for that purpose. In this application,
changes and policy cancellations would result in a corresponding we used the method of prior correction, which fundamentally
reduction in the exposure period. involves adjusting the predicted values based on the actual claim
L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667 3663
Table 1
Overview of loss cost predictors.
frequency in the population. This correction is described for the was done in turn for the frequency and severity models. For each
logit model in Ref. (King & Zeng, 2001), and the same method of these models, we run 20,000 boosting iterations using the train-
has been successfully used in a boosting application to predict cus- ing data set.
tomer churn (Lemmens & Croux, 2006). A drawback of the under-sampling scheme described in Section
8.1, is that we may risk losing information from the majority class
when being under-sampled. To maximize the usage of the informa-
8.2. Building the model tion available in the training data, the optimal value for the param-
eters S and T was chosen based on the smallest estimated
The first choice in building the model involves selecting an prediction error using a K-fold cross-validation procedure with
appropriate loss function L(y, f(x)) as in (1). Squared-error loss, K = 10. This involves splitting the training data in K equal parts, fit-
PM 2 PM
i¼1 ðyi f ðxi ÞÞ , and Bernoulli deviance, 2 i¼1 ðyi f ðxi Þ logð1þ ting the model to K 1 parts of the data, and then calculating the
expðf ðxi ÞÞÞ, were used to define prediction error for the severity value for the prediction error on the kth part. This is done for
and frequency models, respectively. Then, it is necessary to select k = 1, 2, . . . , K and then the K estimated values for the prediction
the shrinkage parameter s applied to each tree and the sub-sam- error are averaged. Using a three-way interaction gave best results
pling rate as defined in Section 6. The former was set at the fixed in both frequency and severity models. Based on this level of inter-
value of 0.001 and the later at 50%. Next, the the size of the individ- action, Fig. 1 shows the train and cv-error as function of the num-
ual trees S and the number of boosting iterations T (i.e., number of ber of iterations for the severity model. The optimal value of T was
trees) need to be selected. The size of the trees was selected by set at the point for which the cv-error cease to decrease.
sequentially increasing the interaction depth of the tree, starting The test data set was not used for model selection purposes, but
with an additive model (single-split regression trees), followed to assess the generalization error of the final chosen model relative
by two-way interactions, and up to six-way interactions. This to the Generalized Linear Model approach. The later model was
estimated based on the same training data and using Binomial/
Gamma distributions for the response variables in the Frequency/
29000
predictor variables for the frequency (left) and severity (right) mod-
els. Since these measures are relative, a value of 100 was assigned
27000
to the most important predictor and the others were scaled accord-
ingly. There is a clear differential effect between the models. For
instance, the number of years licensed of the principal operator of
the vehicle is the most relevant predictor in the frequency model,
26000
convictions, and the age of the principal operator. For the severity
model, the vehicle age is the most influential predictor, followed
0 5000 10000 15000 by the price of the vehicle and the horse power to weight ratio. Partial
Boosting Iterations dependence plots offer additional insights in the way these vari-
ables affect the dependent variable in each model. Fig. 3 shows
Fig. 1. The relation between train and cross validation error and the optimal
number of boosting iterations (shown by the vertical green line). (For interpretation
the partial dependence plots for the frequency model. The vertical
of the references to colour in this figure legend, the reader is referred to the web scale is in the log odds and the hash marks at the base of each plot
version of this article.) show the deciles of the distribution of the corresponding variable.
3664 L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667
VC3
PC9
PC6
PC7
DC8
AC2
DC3
DC8
VC4
AC5
DC2
VC5
DC1
PC3
AC5
VC4
PC7
VC6
DC2
Fig. 2. Relative importance of the predictors for the Frequency (left) and Severity (right) models.
−2.00
−1.8
−1.9
partial dependence
partial dependence
partial dependence
−2.10
−2.0
−2.0
−2.1
−2.20
−2.2
−2.2
−2.4
−2.30
−2.3
0 10 20 30 40 50 60 N Y 0 1 2
DC2 PC7 AC5
−2.10
−2.2
−2.0
partial dependence
partial dependence
partial dependence
−2.15
−2.3
−2.20
−2.1
−2.4
−2.25
−2.5
−2.2
−2.30
−2.6
−2.3
The partial dependence of each predictor accounts for the average increases nearly at the end. The partial dependence on age initially
joint effect of the other predictors in the model. decreases abruptly up to a value of approximately 30, followed by
Claim frequency has a nonmonotonic partial dependence on a long plateau up to 70, when it steeply increases. The variables
years licensed. It decreases over the main body of the data and vehicle age and postal code risk score have a roughly monotonically
L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667 3665
6800
7000
7000
6600
6500
partial dependence
partial dependence
partial dependence
6500
6000
6400
5500
6200
6000
5000
6000
4500
5800
5500
4000
6500
6400
6600
partial dependence
partial dependence
partial dependence
6200
6300
6400
6000
6200
6100
6000
5800
5900
5800
0−500 501+ 0 10 20 30 40 50 60 0 1 2
PC3 DC2 AC5
decreasing partial dependence. The age of the vehicle is widely ratio on claim severity. There appears to be an interaction effect
recognized as an important predictor in the frequency model between these two variables. Claim severity tends to be higher
(Brockman & Wright, 1992), since it is believed to be negatively for low values of years licensed, but this relation tends to be much
associated with annual mileage. It is not a common practice to stronger for high values of horse power to weight ratio.
use annual mileage directly as an input in the model, due to the We next compare the predictive accuracy of Gradient Boosting
difficulty in obtaining a reliable estimate for this variable. Claim (GB) against the conventional Generalized Linear Model (GLM)
frequency is also estimated to increase with the number of driving approach based on the test sample. This was done by calculating
convictions and it is higher for vehicles with an occasional driver the ratio of the rate we would charge based on the GB model
under 25 years of age. to the rate we would charge based on the GLM. Then we grouped
Note that these plots are not necessarily smooth, since there is the observations into five fairly equally sized buckets ranked by
no smoothness constraint imposed on the fitting procedure. This the ratio. Finally, for each bucket we calculated the GLM-loss ratio,
is the consequence of using a tree-based model. If a smooth trend defined as the ratio of the actual losses to the GLM predicted loss
is observed, this is result of the estimated nature of the depen- cost. Fig. 6 displays the results. Note that the GLM-loss ratio
dence of the predictors on the response and it is purely dictated increases whenever the GB model would suggest to charge a higher
by the data. rate relative to the GLM. The upward trend in the GLM-loss ratio
Fig. 4 shows the partial dependence plots for the severity mod- curve indicates the higher predictive performance of GB relative
el. The nature of the dependence of vehicle age and price of the vehi- to GLM.
cle is naturally due to the fact that newer and more expensive cars
would cost more to repair in the event of a collision. The shape of
these curves is fairly linear over the vast majority of the data. The 9. Discussion
variable horse power to weight ratio measures the actual perfor-
mance of the vehicle’s engine. The upward trend observed in the In this paper, we described the theory of Gradient Boosting (GB)
curve is anticipated, since drivers with high performance engines and its application to the analysis of auto insurance loss cost mod-
will generally drive at a higher speed compared to those with eling. GB was presented as an additive model that sequentially fits
low performance engines. All the remaining variables have the a relatively simple function (weak learner) to the current residuals
expected partial dependence effect on claim severity. by least-squares. The most important practical steps in building a
An interesting relationship is given in Fig. 5, which shows the model using this methodology have been described. Estimating
joint dependence between years licensed and horse power to weight loss cost involves solving regression and classification problems
3666 L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667
7500
partial dep
7000
6500
endence
6000
5500 0.20
0
10 0.15
20
30 4
0.10
DC
2 40 VC
50
60 0.05
Fig. 5. Partial dependence of claim severity on years licensed and horse power to weight ratio.
15000
1.3
Actual Losses/GLM Pred. Loss Cost
Exposure Count
10000
1.1 1.2
5000
0.9 1.0
0
with several challenges. The large number of categorical and integral part of the GB procedure, and so it requires little ‘‘detec-
numerical predictors, the presence of non-linearities in the data tive’’ work on the part of the analyst.
and the complex interactions among the inputs is often the norm. In short, Gradient Boosting is a good alternative method to Gen-
In addition, data might not be clean and/or contain missing values eralized Linear Models for building insurance loss cost models. The
for some predictors. GB fits very well this data structure. First, free available package gbm implements gradient boosting methods
based on the sample data used in this analysis, the level of accu- under the R environment for statistical computing (Ridgeway,
racy in prediction was shown to be higher for GB relative to the 2007).
conventional Generalized Linear Model approach. This is not sur-
prising since GLMs are, in essence, relatively simple linear models Acknowledgments
and thus they are constrained by the class of functions they can
approximate. Second, as opposed to other non-linear statistical I am deeply grateful to Matthew Buchalter and Charles Dugas
learning methods such as neural networks and support vector for thoughtful discussions. Also special thanks to Greg Ridgeway
machines, GB provides interpretable results via the relative influ- for freely distributing the gbm software package in R. Comments
ence of the input variables and their partial dependence plots. This are welcome.
is a critical aspect to consider in a business environment, where
models usually must be approved by non-statistically trained deci-
sion makers who need to understand how the output from the References
‘‘black-box’’ is being produced. Third, GB requires very little data
Anderson, D., Feldblum, S., Modlin, C., Schirmacher, D. Schirmacher, E., & Thandi, N.
preprocessing which is one of the most time consuming activities (2007). A practitioner’s guide to generalized linear models. Casualty Actuarial
in a data mining project. Lastly, model selection is done as an Society (CAS), Syllabus Year: 2010, Exam Number: 9, 1–116.
L. Guelman / Expert Systems with Applications 39 (2012) 3659–3667 3667
Brieman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression Friedman, J. (2002). Stochastic gradient boosting. Computational Statistics & Data
trees. CRC Press. Analysis, 38, 367–378.
Brieman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16, Green, W. (2000). Econometric analysis (4th ed.). Prentice-Hall.
199–231. Haberman, S., & Renshaw, A. (1996). Generalized linear models and actuarial
Brockman, M., & Wright, T. (1992). Statistical motor rating: Making effective use of science. Journal of the Royal Statistical Society, Series D, 45, 407–436.
your data. Journal of the Institute of Actuaries, 119, 457–543. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning.
Chan, P., & Stolfo, S. (1998). Toward scalable learning with non-uniform class and Springer.
cost distributions: A case study in credit card fraud detection. Proceedings of the King, G., & Zeng, L. (2001). Explaining rare events in international relations.
International Conference on Knowledge Discovery and Data Mining, 4 (pp. 164– International Organization, 55, 693–715.
168). Kolyshkina, I., Wong, S., & Lim, S. (2004). Enhancing generalised linear models with
Chapados, N., Bengio, Y., Vincent, P., Ghosn, J., Dugas, C., Takeuchi, I., et al. (2001). data mining. Casualty Actuarial Society 2004, Discussion Paper Program.
Estimating car insurance premia: A case study in high-dimensional data Lemmens, A., & Croux, C. (2006). Bagging and boosting classification trees to predict
inference. University of Montreal, DIRO Technical Report, 1199. churn. Journal of Marketing Research, 43, 276–286.
Estabrooks, T., & Japkowicz, T. (2004). A multiple resampling method for learning McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). Chapman and
from imbalanced data sets. Computational Intelligence, 20, 315–354. Hall.
Francis, L. (2001). Neural networks demystified. Casualty Actuarial Society Forum, Ridgeway, G. (2007). Generalized boosted models: a guide to the gbm package.
Winter, 2001, 252–319. Available from http://cran.r-project.org/web/packages/gbm/index.html.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.
Proceedings of the International Conference on Machine Learning, 13 (pp. 148– Sun, Y., Kamel, M., Wong, A., & Wang, Y. (2007). Cost-sensitive boosting for
156). classification of imbalanced data. Pattern Recognition, 40, 3358–3378.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A Weiss, G., & Provost, F. (2003). Learning when training data are costly: The effect of
statistical view of boosting. The Annals of Statistics, 28, 337–407. class distribution on tree induction. Journal of Artificial Intelligence Research, 19,
Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. 315–354.
The Annals of Statistics, 29, 1189–1232.