Development and Validation of Credit-Scoring Models
Development and Validation of Credit-Scoring Models
Models1
July 9, 2008
1 Disclaimer: The statements made and views expressed herein are solely those of the
authors and do not necessarily represent o¢ cial policies, statements, or views of the Of-
…ce of the Comptroller of the Currency or its sta¤ . Acknowledgement: We are grateful to
our colleagues for many helpful comments and discussions, and especially to Regina Vil-
lasmil, curator of the OCC/RAD consumer credit database. Please address comments and
questions to Dennis Glennon, OCC, Risk Analysis Division, Third and E Streets, SW,
Washington, DC 20219, email: [email protected].
2 U.S. Department of the Treasury, O¢ ce of the Comptroller of the Currency, Risk
Analysis Division.
3 Cornell University, Departments of Economics and Statistical Sciences; US Department
of the Treasury, O¢ ce of the Comptroller of the Currency, Risk Analysis Division, and
CREATES, University of Aarhus, Denmark.
4 Promontory Financial Group and ceriklarson.com
5 Texas A&M University, Department of Economics
3
sues that arise during the process of developing credit-scoring models. Bierman and
Hausman (1970); Dirickx and Wakeman (1976); Srinivasan and Kim (1987); Thomas,
Crook, and Edelman (1992); Thomas, Edelman, and Crook (2002); Hand (1997); and
others, outline the development of scorecards using a range of di¤erent mathematical
and statistical techniques. A recent research conference with industrial, academic and
supervisory participants sponsored by the O¢ ce of the Comptroller of the Currency
(OCC), the primary supervisor of nationally chartered banks in the United States,
had a full program of papers on speci…cation and evaluation of credit-scoring models.
This literature re‡ects substantial advances but not consensus on best practices in
credit scoring.
In this paper, we demonstrate a range of techniques commonly employed by prac-
titioners to build and validate credit scoring models using the OCC Risk Analysis
Division (OCC/RAD) consumer credit database (CCDB). We compare the mod-
els with each other and with a commercially developed generic bureau-based credit
score. Our model development process illustrates several aspects of common indus-
try practices. We provide a framework in which to compare and contrast alternative
modeling approaches, and we demonstrate the strengths and weaknesses of alter-
native modeling techniques commonly used to develop a scoring model. We focus
on a limited number of sample and modeling issues that typically arise during the
model-development process and that are likely to have signi…cant impacts on the
accuracy and reliability of a model.1 Speci…cally we …nd that accuracy in predicting
default probabilities can deteriorate substantially as forecasts move away from the
development time frame. We attribute this at least in part to the di¤erential e¤ects
of changing macroeconomic conditions on the di¤erent credit categories. Higher-risk
of default groups are considerably more a¤ected by small changes in the economy
than low-default risk groups. This …nding points out robustness issues that can guide
future research and applications. On the other hand, although the accuracy deteri-
1
There are other legitimate ways of addressing issues of sample design, model selection, and
validation beyond those outlined below. Moreover, we believe newer and better techniques continue
to be developed in the statistical and econometric literature. For those reasons, we emphasize that
there are alternatives to the processes outlined below that can and, under certain circumstances,
should be used as part of a well-developed and comprehensive model development process.
4
orates, the ranking or separation quality is largely maintained. The models remain
useful, but they have their weaknesses which must be realized.
One signi…cant objective of our work is to illustrate aspects of model validation
that can and we believe should be employed at the time of model develop-
ment. Model validation is a process that is comprised of three general types of
activities: (1) the collection of evidence in support of the model’s design, estimation,
and evaluation at the time of development; (2) the establishment of on-going mon-
itoring and benchmarking methods by which to evaluate model performance during
implementation and use, and (3) the evaluation of a model’s performance utilizing
outcomes-based measures and the establishment of feedback processes which ensure
that unexpected performance is acted upon. The focus of this paper is on the …rst
of these activities: the compilation of developmental evidence in support of a model.
However, as a natural part of the model development process, which involves bench-
marking alternative models and identifying of appropriate outcomes-based measures
of performance, we do touch upon some of the post-development validation activities
noted in (2) and (3). Finally, we show that there are limitations to the application
of a model developed using a static sample design as a risk measurement tool. A
model that performs well at ranking the population by expected performance may
still perform poorly at generating valid default probabilities required for pricing and
pro…tability analysis.
Section 2 describes the data development process employed to create the OCC/RAD
consumer credit database. The CCDB is unique in many ways. It contains both
tradeline (account) and summary information for individuals obtained from a recog-
nized national credit bureau, and it is su¢ ciently large to allow us to construct both
a holdout sample drawn from the population at the time of development and several
out-of-sample and out-of-time validation samples. The database also allows for one
to observe the longitudinal performance of individual borrowers and individual ac-
counts; however, models exploiting this type of dynamic structure generally have not
been developed or used by lenders and other practitioners. Such dynamic models are
consequently not within the scope of this paper.
Section 3 outlines the methods used to specify and estimate our suite of models
5
and the calibration process used to construct our scores. Section 4 describes methods
that we employ to benchmark and compare the performance of the scores within the
development sample and in various validation samples from periods subsequent to
that of the development sample. Section 5 summarizes our …ndings.
6
Most …nancial institutions that purchase research samples of credit bureau data
do so in order to analyze and build models that describe the credit behavior of
their current or likely future customers. In these cases, the sample design might be
limited to selecting a sample of the bank’s current or prior customers, or alternately
to selecting a sample of individuals with a generic credit score greater than some
pre-speci…ed value (under the assumption that future customers will look like those
from the past.) In contrast, large-scale developers of generic bureau-based credit
scores are interested in having these scoring tools robustly predict performance for a
broad spectrum of the consumer credit-using population and consequently will want
a broader, more nationally representative sample on which to base their work. In
many ways, the design of the CCDB and the development of the models in this paper
more closely parallel that of the later group.
7
2.3 Temporal Coverage
Sample designs di¤er in their breadth and unit of analysis and in terms of their
temporal coverage. Common modeling practice in the development of credit scoring
tools has historically utilized cross-sectional sampling designs, when a selection of
consumer credit histories is observed at time t, and payment behavior is tracked over
k future time periods (k is often typically de…ned as 24 months). Scoring models
are developed to predict performance over the interval [t; t + k] as a function of
characteristics observed at time t.
In contrast, the study of the dynamic behavior of credit quality requires obser-
vations over multiple periods of time for a …xed set of analysis units that have been
sampled in a base year (i.e., a longitudinal or panel data design). In both instances,
data has to be extracted with su¢ cient detail to allow the tracking of performance,
balances, line increases, etc., by tradelines (i.e., by lender) for each unit over time.
Under a longitudinal sample design, annual extracts represent updated (or re-
freshed) observations for each of the observations in the sample. To facilitate the
objectives of illustrating existing cross sectional methods and allowing for experi-
mentation with longitudinal-based analysis, the CCDB has a unique structure. The
database has been constructed so as to incorporate a “rolling”set of panels, as well
as an annual sequence of random cross sectional samples. Rather than simply iden-
tifying a base period sample and then tracking the same individuals though time, as
might be the case in a classic panel, the CCDB seeks to maintain the representative
nature of the longitudinal data by introducing supplemental parallel structure indi-
viduals at various points in time, and by developing weights relating the panel to
the population at any point in time. Further details are presented in the following
sections.
The initial sample consists of 1; 000; 000 randomly selected individual credit reports
as of June 30, 1999. Nine hundred …fty thousand of these individuals were ran-
domly sampled from the sub-population of individuals for whom the value of a
8
generic, bureau-based score (GBS) could be computed (the scoreable population),
while 50; 000 individuals were sampled from the unscoreable population. The allo-
cation of the sample between scoreable and unscoreable populations was chosen in
order to track some initially unscoreable observations longitudinally though subse-
quent time periods. Because the unscoreable segment represents roughly 25 percent
of the credit bureau population, a purely random sampling from the main credit
bureau database would have yielded too many unscoreable individuals.2
Given the required cross-sectional size and the need to observe future performance
when developing a model, it was also determined that the sample should include
performance information through June 30, 2004 the terminal date of our data
set. The 1,000,000 observations from the June 30, 1999 sample make up the initial
“core”set of observations under our panel data design. The panel is constructed
by updating the credit pro…le of each observation in the core on June 30th of each
subsequent year. In Figure 1 we illustrate the general sampling and matching strat-
egy using the 1999 and 2000 data; counts of sampled and matched individuals are
presented in Tables 1 and 2.
In general, the match rate from one year’s sample to the following year’s bu-
reau master …le is high. Some of the scoreable individuals sampled in 1999 became
unscoreable in 2000, again due to death or inactivity, and some of the previously
unscoreable became scoreable in 2000 (for instance, if they had acquired enough
credit history). Of the 1; 000; 000 individuals sampled in 1999, 949; 790 individuals
were found to be scoreable as of June 30, 2000. As indicated in Table 2, this change
resulted from 17,339 individuals moving from scoreable to unscoreable or missing,
while 17; 129 individuals moved from unscoreable to scoreable.
Over time, the credit quality of a …xed sample of observations (i.e., the core) is
likely to diverge from that of a growing population. For that reason, we update the
core each year by sampling additional individuals from the general population and
2
Unscoreable individuals include those who are deceased or who have only public records or very
thin credit tradline experience.
9
then developing “rebalanced”sampling weights which allow for comparison between
the updated core and the current population. For example, we update the core
in 2000 by comparing the GBS distribution of the 949; 790 individuals from the
1999-2000 matched sample (tabulated using 10-point score buckets from 300 to 900,
the range of the GBS) to a similarly constructed GBS distribution for an additional
950; 000 individuals randomly sampled from the credit bureau’s master …le as of June
30, 2000. The relative di¤erence in frequency by bucket between the two distributions
was then used to identify the size of an “update sample”of individuals to add to the
1999-2000 matched sample. The minimum of these bucket-level frequency changes
(i.e., the maximum decrease rate in relative frequency) was then used as a sampling
proportion to determine the number of additional individuals that would be randomly
sampled from the June 30, 2000, scoreable population and added to the core data
set (i.e., the 1999-2000 matched …le). For 2000, the “updating proportion” was
determined to be 7 percent, resulting in the addition of 66,500 individuals from
the 2000 scoreable population to the 1999-2000 matched scoreable sample on the
CCDB. Use of this updating strategy ensures that the precision with which one might
estimate characteristics at the GBS bucket level in a given year does not diminish
due to drift in the credit quality of those individuals sampled in earlier years.
Sampling for years 2001-2004 proceeded along similar lines, with the results re-
ported again in Tables 1 and 2. The individuals who were members of the CCDB
panel in a previous year (i.e., the core) were matched to a current year’s master …le.
Individuals who were unmatched or remained or became unscoreable in the current
year were dropped from the CCDB panel and then replaced with another draw of
50,000 unscoreable individuals from the current year’s master …le. The GBS distribu-
tion from the panel was compared with that for a random cross section of individuals
drawn from the current master …le and a “updating proportion”was determined and
applied to de…ne an additional fraction of the random cross section to add to and
complete the current-year CCDB panel.
10
3 Scorecard Development
3.1 De…ning Performance and Identifying Risk Drivers
We follow industry-accepted practices to generate a comprehensive risk pro…le for
each individual. We use as a starting point the …ve broadly de…ned categories out-
lined in Fair-Isaac (2006). We summarized our own examples of possible credit
bureau variables that fall within each category and which are obtainable from our
data set; these are presented in Table 3.
Scorecard development attempts to build a segmentation or index that can be
used to classify agents into two or more distinct groups. Econometric methods for
the modeling of limited dependent variables and statistical classi…cation methods
are therefore commonly applied. In order to implement these types of models using
the type of credit information available from bureaus, it is necessary to de…ne a
performance outcome; this is usually, but not necessarily, dichotomous, with classes
generally distinguishing between “good”and “bad”credit histories based upon some
measure of performance.
In this paper, we choose to classify and develop a predictive model for perfor-
mance of good and bad credits based upon their “default”experience. Bad outcomes
correspond to individuals who experience a “default”and “good”outcomes to indi-
viduals who do not. It is our convention to assign a default if an individual becomes
90 days past due (DPD), or worse, on at least one bankcard over a 24-month per-
formance period (for example July 1999 through June 2001). Although regulatory
rules require banks to charge-o¤ credit card loans at 180 DPD, it is not uncommon
among practitioners to use our more conservative de…nition of default (90+ DPD).
We experimented with a de…nition of default based on both a 12- and 18-month
performance period. The results of our analysis are fundamentally the same under
the alternative de…nitions of default.
11
3.2 Construction of the Development and Hold-Out (In-Time
Validation) Samples
We develop our model using a conventional scorecard sample design. The re…nement
process that was applied to the CCDB and that resulted in the development sam-
ples is presented in Figure 2. A randomly selected, cross-section sample of 995; 251
individual credit …les with valid tradeline data representing over 14:5 million trade-
lines is drawn from the CCDB database as of June 30, 1999. The sample includes
733; 820 individuals with at least one open bankcard line of credit that had been
updated during the January through June 1999 time period.3 We drop 19; 122 …les
with a bankcard currently 90+ DPD, choosing to model the performance of accounts
that are no worse than 60 DPD at time of model development. A separate model for
accounts that are currently seriously delinquent (i.e., greater than 60 DPD) could
be developed (although we do not attempt to develop such a model in this paper.)
An additional 37; 436 accounts are deleted because their future performance could
not be reliably observed in our panel, leaving us with a sample of bankcard credit
performance on 677; 262 individual credit records. We split this group randomly into
two samples of approximately equal size and then develop our suite of models using
a sample of 338; 578 individual credit histories. The remaining 338; 684 individuals
are used as a holdout sample for (within-period) validation purposes.
To allow for the more parsimonious modeling of di¤erent risk factors (i.e., charac-
teristics), and possibly di¤erent e¤ects of common risk drivers, it is standard practice
in the industry to segment (or split) the sample prior to model development. We
have implemented a common segmentation by introducing splits based upon the
amount of credit experience and the amount, if any, of prior delinquency. Credit
…les that contain no history of delinquencies are de…ned as clean, and those with a
history of one or more delinquencies are de…ned as dirty.4 Because individuals with
3
A bankcard tradeline is de…ned as a credit card, or other revolving credit account with variable
terms issued by a commercial bank, industrial bank, co-op bank, credit union, savings and loan
company, or …nance company.
4
We de…ne an observation as dirty if the individual has a history of delinquencies greater than
30 DPD ever, a public record, or collections proceedings against him or her.
12
little or no credit experience are expected to perform di¤erently from those with
more experience and thicker …les, we create additional segments within the clean
group made up of individuals with thin credit …les (fewer than 3 tradelines) or credit
…les (more than 2 tradelines). On the other hand, we created two segments within
the dirty group consiting of individuals with no current delinquency and with mild
delinquency (60- DPD). Consequently, we identify four mutually exclusive segments:
clean/thick, clean/thin, dirty/current, and dirty/delinquent.
In Figure 2 we report the number of individuals and the average default rate in
each of the segments. The development sample has an average default rate of 7:19
percent. The clean and dirty segments have a default rate of 3:1 percent and 20:3
percent respectively. Our objective is to model the likelihood of default (i.e., 90+
DPD) for each segment using credit bureau information only.
13
nonparametric, we retain the assumption that the link function is the same across
segments. That is, we retain the assumption that there is a common relationship
between the value of the index and the default probability, though we no longer
require the logistic functional form. We experiment with further generalizations
to di¤erent link functions across segment;, however, these generalizations are not
especially productive, especially for the segments with smaller sample sizes. Finally,
we compare these two regression forms with a fully nonparametric model developed
using a decision-tree approach. This can be thought of as a further generalization in
which both the index and the link are estimated nonparametrically.
0
pi = E(yi jxi ) = 1=(1 + exp( xi )) for each individual i; (1)
Z = ln(^
p=(1 p^)): (3)
The semiparametric models use the estimated (parametric) index function to parti-
tion the sample into relative risk segments. We rank the sample by the estimated
index from the logistic regression and then estimate the link function nonparamet-
rically. Speci…cally, for this model the estimates of the default rate are equal to the
14
empirically observed default rate within each segment.
We follow current industry practice and partition the sample into discrete seg-
ments, chosen so that each band contains the same number of observations, m: Given
the sample size, we create 30 distinct segments. For each segment, the predicted
probability of default is given by
p^i = y Ji ; (4)
where Pn
yk 1fJk = Ji g
y Ji = Pk=1
n : (5)
k=1 1fJk = Ji g
and Ji 2 f1; :::; 30g denotes the segment J to which individual i belongs.
15
we repeat the resampling and stepwise regressor selection k times and choose the
variables that appear most often in the k replications (variables that occurred in
10 or more of the replications). We use k = 20 and experiment with values for
! = f20 percent; 50 percent; 100 percentg. After some experimentation, we use the
results from the 50 percent trial. We applied the stepwise and Resampling methods
separately to each segment.
Finally, we de…ne the Intersection method as the variable selection resulting from
construction of the common set of covariates that appear in the Stepwise and Re-
sampling methods. The Stepwise selection approach generates the largest, and the
Intersection approach the smallest, set of covariates.
The fully nonparametric model form does not assume a functional form for the
covariates. To implement our nonparametric speci…cation, we use a tree method
called CHAID (Chi-squared Automatic Interaction Detector) to cluster the data into
multiple “nodes” by individual characteristics (attributes). The variable selection
process searches by sequential subdivision for a grouping of the data giving maximal
discrimination subject to limitations on the sizes of the groups (avoiding the best …t
solution of one group per data point). The approach is due to Kass (1980).6 The
CHAID approach splits the data sequentially by performing consecutive Chi-square
tests on all possible splits. It accepts the best split. If all possible splits are rejected,
or if a minimum group size limit is reached, it stops. Each of the …nal nodes is
assigned with predictions that are equal to the empirical default probability, p^n for
node n. By design of the algorithm, individuals within a node are chosen to be as
homogeneous as possible, while individuals in di¤erent nodes are as heterogeneous
as possible (in terms of p^n ), resulting in maximum discrimination. Note that the
splitting of the development sample data into four segments which preceded the
construction of parametric and semiparametric models was not undertaken prior to
implementing the CHAID algorithm.
6
Various re…nements have been made to Kass’s original speci…catgion; we implement CHAID
using the SAS macro %TREEDISC (SAS (1995)).
16
For the CHAID method we have to specify (1) the candidate variable list, (2)
the transformation of continuous variables into discrete variables, and (3) the mini-
mum size of the …nal nodes. We considered two di¤erent sets of candidate variables.
Initially, we considered all available attributes and kept only those that generated
at least one split. As an alternative, we used only those attributes that were iden-
ti…ed using the Intersection method for variable selection outlined above. In the
latter case, for each model segment (i.e., clean/thick, clean/thin, dirty/current, and
dirty/delinquent), we take the intersection of the variables from the stepwise selec-
tion process with the variables appearing 10+ times in the 20 percent, 50 percent,
and 100 percent Resampling methods, then combine the selected variables across the
model segments by taking the union of those sets of variables.
As the CHAID approach considers all possible splits, it requires the splitting of
continuous variables into discrete ranges. We chose the common and practical ap-
proach of constructing dummy variables to represent each quartile of each continuous
variable. As a validity check on this procedure we also split the continuous variables
into 200 bins. (Note that this process includes all intermediate splits from 4–199 as
special cases).
To prevent nodes from having too few observations or having only one kind of
account (good or bad), we set the minimum of observations in a node to be 1; 000.
The CHAID rejects a split if it produces a node smaller than 1; 000. Therefore the
size of the …nal nodes works as a stopping rule for the CHAID. Since this speci…cation
is rather arbitrary, we experiment with di¤erent node sizes ranging from 100 to 8; 000
observations.
17
methods, respectively. The worst status for open bankcards within the last six months,
the total number of tradelines with 30+ DPD, and the total number of tradelines with
good standing are “individual credit history” variables that consistently show up as
important explanatory variables. Utilization rates for bankcards and for revolving
accounts are the more important “amount-owed” variables. The age of the oldest
bankcard tradeline enters as a relevant measure of the “length of credit history,”
and “new credit activity”is measured using the total number of inquiries within the
last 12 months and the total number of bankcard accounts opened within the last
two years. Finally, the total number of revolving tradelines active was an important
explanatory variable capturing the impact of the “type of credit used.” It is clear
from our results that a fairly small set of variables su¢ ces to capture almost all of
the possible explanatory power. In Table 8, we report the set of “splitting”variables
identi…ed under the CHAID selection method, again sorted by variable type.
(1 p^700 )
= 20; (6)
p^700
2. Every 20-unit increase in S doubles the odds ratio. The score values, S; are
calibrated using the a¢ ne transformation:
18
where Z is as given in equation (3). We calculate eight di¤erent RAD scores,
one from each of the three parametric (Stepwise, Resampling, and Intersec-
tion), three semiparametric (Stepwise, Resampling, and Intersection), and two
nonparametric (all variables, and Intersection) models.
We also recalibrate the GBS so as to allow for comparison with the RAD scores.
Since we cannot observe the predicted p^ associated with the GBS, we estimate it
though a linear regression of the empirical log-odds in our sample against the score
values. Data for the regression consists of empirical log-odds estimated for 20 di¤erent
buckets of individuals sorted by the GBS and the associated bucket mean bureau
values.
19
X
10
(pj pj )2
HL = ; (8)
j=1
pj (1 pj )=nj
where nj is the number of observation in each of the j deciles. The H-L statistic is
distributed as Chi-square with 8 degrees of freedom under the null that pj = pj for all
j. Just to be clear, a good model should have a high value of the separation measure,
K-S, but a low value of the accuracy measure H-L. It would be perhaps better to
label the H-L as an an “inaccuracy” measure, as it is a Chi-squared measure of …t,
but the contrary convention is long established.
20
model form) as stand alone models. Table 11 shows K-S and H-L measures from the
parametric and semiparametric models for each segment.7 Individually, the segment-
speci…c models perform well at di¤erentiating between good and bad accounts. As
is commonly observed in practice, credit bureau-based models perform better on the
clean-history segments of the population as re‡ected in the nearly 20 point di¤er-
ence in the K-S values between the clean-history and dirty-history segments across
model form and variable selection procedures. It is interesting to note, however, that
the parametric models are relatively accurate on the development and in-sample,
hold-out data except for the clean-history/thick-…le segment. That latter result is
likely driving the accuracy results in Table 10, given the relative size of the clean-
history/thick-…le segment.8 These results clearly show that a model can perform
well at discriminating between good and bad accounts (i.e., high K-S value), yet per-
form poorly at generating accurate estimates of the default probabilities – a result
that illustrates the importance of considering model purpose (i.e., discrimination or
prediction) in the development and selection of a credit scoring model.
The K-S test evaluates separation at a speci…c point over the full distribution of
outcomes. In Figures 3 and 4, we plot the Gains charts for each of models. The
Gains charts describe the separation ability graphically by showing the CDF for ob-
servations with “bad”outcomes plotted against the CDF for all sample observations
(the 45-degree line serves as a benchmark representing no separation power). The
parametric and semiparametric models and the GBS produce very similar graphs,
while the CHAID models showed much weaker discriminatory power.
In Figures 5 and 6, we plot the empirical log odds by RAD score for each model, for
both the development and hold-out samples, respectively. We compare the empirical
7
By design, the actual performance within each decile (i.e., score band) from the development
sample is used to generate the predicted values under the semiparametric method. For that reason
(as noted above), the H-L test is not well designed for evaluating the accuracy of the semiparametric
models. Therefore, we use the actual performance (i.e., default rate) derived from the pooled-
segment analysis summarized in Table 10 as the predicted values in the calculation of the H-L
values for each of the semiparametric models in Table 11.
8
The more accurate model results for the semi parametric model on the development and in-
sample, hold-out data are likely to be due to the construct of the tests, and therefore, must be
interpreted carefully. Clearly an out-of-sample test will better re‡ect the true accuracy of the
models constructed using this approach.
21
log odds for each model against calibrated target values. The calibration target line
is given in eq. (7). The graphs show that the models preform relatively well for
score values below 750. Although the semiparametric models and the CHAID do not
generate estimates for scores below 600 due to the smoothing nature of the models,
we point out that the parametric model continues to perform well on the score range
below 600. For scores between 760 and 780, the parametric and semiparametric
models slightly overestimate default risk. For the score range over 800, the RAD
models underestimate the default rates. These results suggests that the lack of overall
accuracy of the model is being driven primarily by the imprecision in the estimates
at the higher end (i.e., greater than 750) of the distribution: that portion of the score
distribution, based on the median scores reported in Table 9, heavily populated by
observations from the clean-history (both thick and thin) segments.
It is worth noting that the Resampling and Intersection models generate very
similar levels of separation and accuracy measures using fewer covariates. Those
results hold across both the development and hold-out samples. The decision tree
approach (i.e., CHAID method), however, clearly generates models with lower dis-
criminate power. That result is re‡ected in the …ve-point di¤erence in the K-S values
between the Stepwise parametric model and Intersection CHAID model in Table 10.
It is di¢ cult, however, to interpret the meaning of that result. Instead, we look to the
relationship between the Gains charts in Figure 3. The Gains chart for the Stepwise
parametric model is above the Gains chart for the CHAID-Intersection model. As a
result, at each point on the horizontal axis, the Stepwise parametric model identi…es
a greater percentage of the bad distribution. For example, over the bottom 10 per-
cent of the score distribution, the Stepwise parametric model identi…es roughly 60
percent of the bad accounts, while the CHAID-Intersection identi…es only approxi-
mately 48 percent. At that point, the Stepwise model identi…es nearly 25 percent
(i.e., 12/48) more bad accounts a substantial increase over the CHAID model.
We have estimated the models using the widely, but not universally, accepted
90+ DPD de…nition for the outcome variable. It is interesting to ask whether the
model would also do well at discriminating between good and bad accounts if default
is de…ned at 60+ DPD, or evaluated over a shorter performance horizon (e.g., 18,
22
12, and 6 months). In Table 12, we summarize the observed performance over these
alternative de…nitions of performance. A reliable model should order individuals by
credit quality over a variety of bad de…nitions. In Table 13 we compare the K-S
measures from eight di¤erent RAD models and the GBS, using both the 90+ DPD
and 60+ DPD bad de…nitions. We …nd that a model’s ability to di¤erentiate between
good and bad accounts is virtually the same as re‡ected by the K-S values across
the development and holdout samples for all methods. As expected, the models
perform better under the 90+ DPD de…nition. Nevertheless, the models seem to
order observations well by credit quality for the alternative de…nitions. This topic is
revisited below.
23
in the out-of-time validation samples, indicating general lack of statistical …t for
predictive purposes. None of the scoring models developed using conventional indus-
try practices generated accurate predictions over time even though all the models
maintained their ability to di¤erentiate between good and bad accounts. These con-
clusions are supported by the out-of-sample results in Table 17. For each segment,
the K-S values remained relative constant, or improved, over time; however, in all
cases, the H-L statistics increased signi…cantly. The signi…cant increase in the H-L
values across all model segments in Table 17 suggests that our simple cross-section
model is under-speci…ed relative to the factors that re‡ect changes in the economic
environment over time.
As an additional test of the non-parametric approach, we reran the CHAID model
with continuous variables discretized to 200 values, and compared the performance
to the CHAID model based on quartiles. The CHAID based on all variables did
substantially worse in terms of model accuracy in the out-of-time validation samples.
The CHAID based on the Intersection selection performed about the same with 200
values as with quartiles for the 2000 and 2001 samples but substantially worse in
2002 in terms of model accuracy. Thus, there seems to be no real bene…t from
adding splits beyond quartiles for our continuous variables.
Figure 12 compares the empirical log-odds by di¤erent RAD scores for the 2002
validation samples. The plot clearly shows a deterioration in the predicted default
rate over the range 650-750. The actual performance is worse than the predicted,
and the RAD scores underestimate the default rates. The results for other years
were very similar to 2002, and are not shown here.
Overall, out-of-sample analyses show that the separation power of the models is
relatively stable over time; however, model accuracy decreases substantially. This
result, combined with the observed increase in the average default rate over the
full sample period except for the clean/thin segment (Table 14), implies that the
models estimated on a cross section of data from 1999 will underpredict defaults
over future periods. Moreover, it suggests that when the defaults are disaggregated
into buckets, the higher-default buckets will tend to be underpredicted more than
the low default buckets a result observed in Figure 12. These results imply that
24
models aimed at accuracy should be frequently updated, or that dynamic models,
with some dependence on macroeconomic conditions, should be considered.
Figure 13 compares the Gains chart for each of the RAD scoring models using the
2002 validation samples. Other years showed very similar results. As in the develop-
ment samples, the parametric and semiparametric models, and the GBS performed
very similarly, and the CHAID models were worse than the others. Although the
Gains charts for all parametric and semiparametric models are nearly overlapping,
the Stepwise selection method produces models that discriminate slightly better (for
both parametric and semiparametric forms). The Resampling selection method is
nearly as good, followed by the Intersection method.
We compare the Gains charts for the development samples and the validation
samples for each of the “preferred”models (the Resampling-based parametric model,
the Stepwise-based semiparametric model, the CHAID with all variables) and the
calibrated GBS in Figures 14 though 17. For all models and the GBS, the Gains
charts are again nearly overlapping and support the general results of the comparison
of K-S values over time.
25
The results in Table 18 show that the K-S measures for di¤erent de…nitions
of default are relatively consistent over time under the alternative event horizons.
Although the models perform better under a 90+ DPD de…nition of default, they
perform reasonably well under a 60+ DPD de…nition. If we compare across models,
parametric and semiparametric models showed the best separation being slightly
better than the calibrated GBS. The CHAID model consistently performs slightly
worse at separating good from bad accounts. These results show that the RAD scores
are very robust and informative in the separation metric for the delinquency events
we considered.
5 Conclusion
We developed credit-scoring models for bankcard performance using the OCC Risk
Analysis Division consumer credit data base and methods that are often encountered
in the industry. We validated and compared a parametric model, a semiparametric
model, and a popular nonparametric approach (CHAID).
It is worth pointing out that data preparation is crucial. The sample design
issues are important, as discussed, but simple matters such as variable de…nition and
treatment of missing or ambiguous data become critical. This is especially true in
cases where similar credit attributes could be calculated in slightly di¤erent ways.
Evaluating these data issues was one of the most time-consuming components of the
project.
With the data in hand, we …nd that careful statistical analysis will deliver a useful
model, and that, while there are di¤erences across methods, the di¤erences are small.
The parametric and semiparametric models appear to work slightly better than the
CHAID. There is little di¤erence between the parametric and semiparametric models.
We …nd that within-period validation is useful, but out-of-time validation shows
a substantial loss of accuracy. We attribute this to the changing macroeconomic
conditions. These conditions led to a small change in the overall default rate. This
change re‡ects much larger changes in the default rates of the high-default (low-score)
components of the population. This raises robustness issues in default prediction.
26
A practical conclusion is that accurate out-of-time prediction of within-score-group
default rates should be based on models which are frequently updated. The longer-
term response is to develop models which have variables re‡ecting aggregate credit
conditions. On the positive side, the separation properties of the models seem quite
robust in the out-of-time validation samples. This suggests that it is easier to rank
individuals by creditworthiness than to predict actual default rates.
There are many additional models in each of the categories, parametric, semi-
parametric and nonparametric, which could be considered. We have taken a rep-
resentative approach from each category. Our models are similar to those used in
practice. Our results suggest that the performance of models developed using simple
cross sectional techniques may be unreliable in terms of accuracy as macroeconomic
conditions change. The results suggest that increased attention be placed on the
use of longitudinal modeling methods as a means by which to estimate performance
conditional on temporally varying economic factors.
References
Bierman, H., and W. H. Hausman (1970): “The Credit Granting Decision,”
Management Sci., 16, 519–532.
FRB (2006): Statistical Release G. 19. Board of Governers of the Federal Reserve
System.
27
Glennon, D. C. (1998): “Issues in Model Design and Validation,” in Credit Risk
Modeling Design and Application, ed. by E. Mays, chap. 13, pp. 207–221. Glenlake
Publishing.
Thomas, L., J. Crook, and D. B. Edelman (1992): Credit Scoring and Credit
Control. Oxford University Press, Oxford.
Thomas, L. C., D. B. Edelman, and J. Crook (2002): Credit Scoring and Its
Applications. SIAM.
28
TABLES AND FIGURES
29
Table 1:
CCDB Sampling Design Counts
A B C D E F G H H
Random
Matched Random
Scoreable CCDB
from Unmatched Unscoreable Total Current
Cross Panel
Previous from Cross Updating Year Masterfile
Section Updating
Year Year's Previous Section Random Extracts
From Proportion = B+E+G
Panel Year's Panel From Sample
Current
and and Dropped Current Year = B+C+D+E
Year
Scoreable Masterfile
Masterfile
30
Table 2:
Transition to scoreable and unscoreable states
Individuals Individuals
Transition Base Year Panel Transitioning to Transitioning to Net Transitions
Period Size Unscoreable from Scoreable from to Scoreable
Scoreable Unscoreable
1999~2000 1,000,000 17,129 17,339 210
2000~2001 1,066,290 14,150 14,971 821
2001~2002 1,127,329 11,988 13,648 1,660
2002~2003 1,183,256 9,846 13,069 3,223
2003~2004 1,218,370 8,837 14,485 5,648
31
Table 3:
Variables by Type
Amounts Owed
Aggregate credit amount of bankcard tradelines of which the records were updated within 12 months BK27
Aggregate credit amount of installment tradelines of which the records were updated within 12 months IN27
Aggregate credit amount of mortgage tradelines of which the records were updated within 12 months MG27
Aggregate credit amount of auto loan tradelines of which the records were updated within 12 months AL27
Aggregate credit amount of revolving tradelines of which the records were updated within 12 months RV27
Dummy variable for the positive aggregate credit amount of bankcard tradelines U11
Dummy variable for the positive aggregate credit amount of installment tradelines U12
Dummy variable for the positive aggregate credit amount of mortgage tradelines U13
Dummy variable for the positive aggregate credit amount of auto loan tradelines U17
Dummy variable for the positive aggregate credit amount of revolving tradelines U18
Aggregate balance amount of open bankcard tradelines of which the records were updated within 12 months ABK16
Aggregate balance amount of installment tradelines of which the records were updated within 12 months IN16
Aggregate balance amount of mortgage tradelines of which the records were updated within 12 months MG16
Aggregate balance amount of auto loan tradelines of which the records were updated within 12 months AL16
Aggregate balance amount of finance tradelines of which the records were updated within 12 months ALN08
Aggregate balance amount of retail tradelines of which the records were updated within 12 months ART08
Aggregate balance amount of revolving tradelines of which the records were updated within 12 months RV16
Aggregate balance amount of open home equity tradelines of which the records were updated within 12 months AEQ08
Bankcard utilization rate (Aggregate balance / Aggregate credit amount) BK28
Dummy variable for zero bankcard utilization rate BK28_0
32
Dummy variable for bankcard utilization rate=100% BK28_100
Dummy variable for bankcard utilization rate>100% BK28_101
Installment accounts utilization rate (Aggregate balance / Aggregate credit amount) IN28
Mortgage accounts utilization rate (Aggregate balance / Aggregate credit amount) MG28
Auto loan accounts utilization rate (Aggregate balance / Aggregate credit amount) AL28
Open bankcard utilization rate (Aggregate balance / Aggregate credit amount) ABK18
Revolving accounts utilization rate (Aggregate balance / Aggregate credit amount) RV28
Average credit amount of bankcard tradelines with positive balance and of which the records were updated within 12
BK17
months
Average credit amount of installment tradelines with positive balance and of which the records were updated within 12
IN17
months
Average credit amount of mortgage tradelines with positive balance and of which the records were updated within 12
MG17
months
Average credit amount of retail tradelines with positive balance and of which the records were updated within 12 months RT17
Average credit amount of auto loan tradelines with positive balance and of which the records were updated within 12
AL17
months
Average credit amount of revolving tradelines with positive balance and of which the records were updated within 12 months RV17
New credit
Total number of inquiries within 6 months AIQ01
Total number of inquiries within 12 months IQ12
Total number of bankcard accounts opened within 2 years BK61
Total number of installment accounts opened within 2 years IN61
Total number of mortgage accounts opened within 2 years MG61
Dummy variable for the existence of new accounts within 2 years NUM71
33
Total number of retail tradelines of which the records were updated within 12 months RT21
Total number of revolving retail tradelines of which the records were updated within 12 months RTR21
Total number of auto lease tradelines of which the records were updated within 12 months AS21
Total number of auto loan tradelines of which the records were updated within 12 months AL21
Total number of revolving tradelines of which the records were updated within 12 months RV21
Total number of bankcard tradelines with positive balance and of which the records were updated within 12 months BK31
Total number of installment tradelines with positive balance and of which the records were updated within 12 months IN31
Total number of mortgage tradelines with positive balance and of which the records were updated within 12 months MG31
Total number of retail tradelines with positive balance and of which the records were updated within 12 months RT31
Total number of auto loan tradelines with positive balance and of which the records were updated within 12 months AL31
Total number of revolving tradelines with positive balance and of which the records were updated within 12 months RV31
34
Table 4: Dirty/Delinquent Segment - Explanatory variables selected using Stepwise, Resample, and Intersection methods
Significance Ranking
Variables selected using the Stepwise method Variable in Selection Methods
(sorted by variable type) Names Inter-
Resample Stepwise
section
I. Payment History
Total number of tradelines with good standing, positive balance, and of which the records were updated
GO01 1 1 1
within 12 months
Worst status of open bankcards within 6 months CURR 3 3 3
Total number of closed tradelines within 12 months NUM_Closed 10 6
Dummy variable for the existence of tradelines with 90 days past due or worse BAD11 9 13
Worst status of bankcard tradelines with 60 days past due or worse and of which the records were updated
BK43 15
within 12 months
Maximum of the balance amount, past due amount, and charged off amount of delinquent bankcard
BK53 29
tradelines with 60 days past due or worse and of which the records were updated within 12 months
Dummy variable for the existence of installment tradelines with 90 days past due or worse within 12 months IN13 31
35
Total number of inquiries within 12 months IQ12 11 10
Dummy variable for the existence of new accounts within 2 years NUM71 14 18
Total number of installment accounts opened within 2 years IN61 12 20
36
Table 5: Dirty/Current Segment - Explanatory variables selected using Stepwise, Resample, and Intersection methods
Significance Ranking
Variables selected using the Stepwise method Variable in Selection Methods
(sorted by variable type) Names Inter-
Resample Stepwise
section
I. Payment History
Total number of tradelines with good standing, positive balance, and of which the records were updated
GO01 3 4 3
within 12 months
Total number of tradelines with 30 days past due or worse BAD41 4 5 4
Total number of tradelines with 90 days past due or worse BAD01 7 9 7
Total number of closed tradelines within 12 months NUM_Closed 12 10 9
Total number of bankcard tradelines with 90 days past due or worse within 12 months BK03 25 13
Dummy variable for the existence of mortgage tradelines with 90 days past due or worse within 12 months MG13 18 21 16
Dummy variable for the existence of installment tradelines with 90 days past due or worse within 12 months IN13 13 13 18
Dummy variable for the existence of bankcard tradelines with 90 days past due or worse within 12 months BK13 24 20
Dummy variable for the existence of retail tradelines with 90 days past due or worse within 12 months RT13 28
Dummy variable for the existence of revolving tradelines with 90 days past due or worse within 12 months RV13 20 11 34
Dummy variable for the existence of revolving retail tradelines with 90 days past due or worse within 12
RTR13 36
months
Dummy variable for the existence of auto loan tradelines with 90 days past due or worse within 12 months AL13 38
Dummy variable for the existence of tradelines with 90 days past due or worse BAD11 43
Maximum of the balance amount, past due amount, and charged off amount of delinquent bankcard
BK53 44
tradelines with 60 days past due or worse and of which the records were updated within 12 months
37
Aggregate balance amount of installment tradelines of which the records were updated within 12 months IN16 46
Mortgage accounts utilization rate (Aggregate balance / Aggregate credit amount) MG28 47
Dummy variable for the positive aggregate credit amount of installment tradelines U12 48
38
Table 6: Clean/Thin Segment - Explanatory variables selected using Stepwise, Resample, and Intersection methods
Significance Ranking
Variables selected using the Stepwise method Variable in Selection Methods
(sorted by variable type) Names Inter
Resample Stepwise
section
I. Payment History
Worst status of open bankcards within 6 months CURR 2 2 4
Total number of tradelines with 30 days past due or worse BAD41 10
Total number of tradelines with good standing, positive balance, and of which the records were updated
GO01 11
within 12 months
39
Table 7: Clean/Thick Segment - Explanatory variables selected using Stepwise, Resample, and Intersection methods
Significance Ranking
Variables selected using the Stepwise method Variable in Selection Methods
(sorted by variable type) Names Inter-
Resample Stepwise
section
I. Payment History
Total number of tradelines with 30 days past due or worse BAD41 33
Dummy variable for the existence of tradelines with 30 days past due or worse BAD51 3 5 5
Worst status of open bankcards within 6 months CURR 2 3 3
Total number of tradelines with good standing, positive balance, and of which the records were updated
GO01 6 4
within 12 months
Total number of closed tradelines within 12 months NUM_Closed 19
40
Total number of inquiries within 6 months AIQ01 31
Total number of bankcard accounts opened within 2 years BK61 7 7 6
Total number of installment accounts opened within 2 years IN61 10 11
Total number of inquiries within 12 months IQ12 5 4 17
Total number of mortgage accounts opened within 2 years MG61 32
Dummy variable for the existence of new accounts within 2 years NUM71 37
41
Table 8: Variables used at least once in CHAID splitting
I. Payment History
Total number of tradelines with 60 days past due or worse BAD21 X X
Dummy variable for the existence of tradelines with 60 days past due or worse BAD31 X
Total number of tradelines with 30 days past due or worse BAD41 X X
Dummy variable for the existence of tradelines with 30 days past due or worse BAD51 X X
Worst status of open bankcards within 6 months CURR X X
Total number of tradelines with good standing, positive balance, and of which the records were updated within 12
GO01 X X
months
42
IV. New Credit
IQ06 X
Total number of bankcard accounts opened within 2 years BK61 X X
Total number of installment accounts opened within 2 years IN61 X
Total number of inquiries within 12 months IQ12 X X
Dummy variable for the existence of new accounts within 2 years NUM71 X
43
Table 9:
In-Time Validation: Median Scores for Various Models at Development
44
Table 10:
In-Time Validation: Model Separation and Accuracy Measures at Development (Pooled Across Segments)
(Bad = 90+Days Past Due, or Worse, over the Following 24 Months)
Calibrated Generic Bureau Score 62.4 62.6 194.1 (<.0001) 2506.3 (<.0001)
1. The H-L test does not apply. By design, the predicted outcomes under the semi parametric approach are equal to the actual outcomes.
2. The p-values are derived under null hypothesis H0: pj=qj for all j (see equation 8) under the assumption that the H-L ~ c2 df=8.
45
Table 11:
In-Time Validation: Separation (K-S) and Accuracy (H-L) Measures for Parametric and Semi
Parametric Models at Development, by Segment
Statistic and Sample
Scoring Model Kolmogorov-
Segment Hosmer-Lemeshow
Smirnov
Variable 1999 1999
Model Form 1999 Dev 1999 Hold-out
Selection Dev Hold-out
(value) (p-value) (value) (p-value)
Stepwise 40.3 38.9 16.5 .0358 22.1 .0047
Dirty History Parametric Resampling 39.3 37.2 12.5 .1303 23.9 .0024
and Presently
Mildly Intersection 37.5 37.0 6.1 .6360 10.8 .2133
Delinquent Stepwise 40.3 38.5 489.8 (<.0001) 548.1 (<.0001)
Semi
1 Resampling 39.1 37.3 526.8 (<.0001) 565.6 (<.0001)
(n=13,302) Parametric
Intersection 39.1 37.3 592.7 (<.0001) 625.6 (<.0001)
Stepwise 42.9 43.2 30.8 .0002 26.9 .0007
Dirty History Parametric Resampling 42.4 43.0 26.0 .0011 27.3 .0006
and Presently Intersection 41.4 42.0 31.1 .0001 32.8 .0001
Current
Stepwise 43.0 43.2 34.7 (<.0001) 25.8 .0011
Semi
(n=67,814) Resampling 42.7 43.2 46.1 (<.0001) 42.5 (<.0001)
Parametric
Intersection 42.2 42.6 52.7 (<.0001) 35.8 (<.0001)
Stepwise 58.2 57.1 8.3 .4047 27.3 .0006
Parametric Resampling 57.4 56.1 11.5 .1750 16.7 .0334
Clean History Intersection 54.3 54.4 48.7 (<.0001) 51.7 (<.0001)
and Thin File
Stepwise 57.6 57.1 7.5 .4837 16.4 .0370
Semi
(n=15,132) Resampling 57.2 56.8 6.1 .6360 16.9 .0312
Parametric
Intersection 55.5 55.2 21.4 .0062 51.8 (<.0001))
Stepwise 60.2 60.1 84.9 (<.0001) 94.8 (<.0001)
Parametric Resampling 60.0 59.9 74.6 (<.0001) 74.5 (<.0001)
Clean History
and Thick File Intersection 58.7 58.9 67.6 (<.0001) 82.2 (<.0001)
Stepwise 60.1 60.2 9.7 .2867 15.4 .0518
(n=242,330) Semi Resampling 60.0 59.9 7.7 .4633 14.4 .0719
Parametric
Intersection 59.1 59.2 6.8 .5584 15.9 .0438
1. The predicted values were derived from the actual default rates in the decile range based on the pooled segment data in Table 10.
46
Table 12:
Empirical Bad Rates for Alternate Bad Definitions on the Development Samples
1. Missing observations were generated if the lender failed to report performance as of the observation date 6, 12, or 18 months forward.
47
Table 13:
K-S separation measures for models built to alternate bad definitions on the development sample
Bad Event Type and Sample
Scoring Model
90+ Days Past Due or Worse 60+ Days Past Due or Worse
Bad
Variable
Model Form Event Dev Hold-Out Dev Hold-Out
Selection
Horizon
24 64.0 64.0 61.5 61.6
18 65.8 65.8 63.1 63.4
Stepwise
12 67.8 67.6 65.4 65.4
6 71.8 72.2 68.6 68.7
24 63.7 63.8 61.4 61.6
18 65.7 65.5 63.0 63.3
Parametric Resampling
12 67.7 67.5 65.3 65.3
6 71.5 72.0 68.4 68.8
24 62.7 62.9 60.5 60.9
18 64.6 64.8 62.2 62.6
Intersection
12 66.7 66.8 64.5 64.8
6 71.0 71.7 67.8 68.2
24 64.0 63.9 61.6 61.7
18 65.8 65.8 63.0 63.3
Stepwise
12 67.8 67.6 65.3 65.3
6 71.7 72.2 68.6 68.7
24 63.8 63.7 61.4 61.6
Semi 18 65.7 65.5 63.0 63.2
Parametric Resampling
12 67.6 67.5 65.2 65.3
6 71.6 72.1 68.4 68.8
24 62.6 62.8 60.5 60.9
18 64.6 64.8 62.1 62.5
Intersection
12 66.6 66.7 64.4 64.8
6 71.0 71.6 67.8 68.2
24 58.3 57.9 56.6 56.4
18 59.9 59.4 57.9 57.7
All variables
12 61.6 61.0 59.6 59.4
Non 6 65.0 64.9 62.3 62.2
Parametric
24 59.2 58.9 57.2 56.8
18 60.9 60.4 58.4 58.1
Intersection
12 62.7 62.0 60.1 59.6
6 65.8 65.7 62.8 62.0
24 62.4 62.6 60.3 60.4
Calibrated Generic Bureau 18 64.6 64.4 61.9 61.9
Score
12 66.5 66.4 63.9 63.8
6 70.2 70.7 67.2 66.7
48
Table 14:
Sample Sizes and Bad Rates For the Development and Out-of-Time Validation Samples
(Bad = 90 Days Past Due, or Worse, over the Following 24 Months)
49
Table 15:
Out-of-Time Validation: Median Scores for Various Models Across Validation Samples
50
Table 16:
Out-of-Time Validation: Model Separation and Accuracy Measures (Pooled Across Segments)
NonParametric All Variables 58.3 59.3 60.0 59.9 2.1 .9778 3962.4 (<.0001) 3231.5 (<.0001) 2792.4 (<.0001)
(CHAID) Intersection 59.2 60.6 60.9 60.9 44.8 (<.0001) 4163.6 (<.0001) 3758.9 (<.0001) 3053.0 (<.0001)
Calibrated Generic Bureau Score 62.4 64.9 65.6 65.1 194.1 (<.0001) 2506.3 (<.0001) 4267.6 (<.0001) 2970.0 (<.0001)
51
Table 17:
Out-of-Time Validation: Separation (K-S) and Accuracy (H-L) Measures for Parametric and Semi Parametric Models, by
Segment
52
Table 18:
Out-of-Time Validation: Separation (K-S) Measures for Different Definitions of Default
53
Figure 1:
OCC/RAD CCDB Sample Design: 1999 & 2000
1999 2000
Updating Sample:
66,500 scoreable
Cross-sectional
random sample from
2000 master file
(1,000,000)
883,500 scoreable
54
Figure 2:
1999 Development and 1999 Hold-Out Sample Construction, and Bad Rates
(Bad = 90+Days Past Due, or Worse, over the Following 24 Months)
338,578
individuals Clean Credit History & Thin File
(p=7.19%) 15,132 Individuals
Future (p=4.76%)
bankcard Clean Credit History
performance
is observable 257,462 Individuals
at month 24 (p=3.06%)
Clean Credit History & Thick File
Not 242,330 Individuals
At least 677,262
Presently (p=2.96%)
with one individuals
open Severely
Delinquent Mildly Delinquent
bankcard
or Worse Dirty In-Time Hold-Out Segment
with
With on History 13,207 Individuals (p=49.04%)
balance
With CCDB Valid Bankcards The In-Time 80,942 Individuals Current
update
Attribute CCDB Hold-out (p=20.39%)
date In-Time Hold-Out Segment
Records Tradeline 714,698 Samples
between 67,735 Individuals (p=14.79%)
In 1999 Accounts individuals
1/99 and Thin File
In 1999 338,684
6/99 In-Time Hold-Out Segment
1 individuals Clean
Million 995,251 (p=7.22%) History 15,150 Individuals (p=4.86%)
733,820
Individuals individuals 257,742 Individuals Thick File
individuals
(p=3.09%) In-Time Hold-Out Segment
242,592 Individuals (p=2.97%)
Individuals with at least one bankcard account presently severely delinquent or worse
19,122 individuals
Individuals without any open bankcards with a balance update date in 1/99 though 6/99
261,431 individuals
Individuals with attributes but without matiching valid tradeline data in 1999
4,749 individuals
55
Figure 3:
0.9
0.8
0.7
0.6
CDF of Bads
Parametric (Stepwise)
0.5 Parametric (Resampling)
Parametric (Intersection)
0.4
SemiParametric (Stepwise)
0.3 SemiParametric (Resampling)
SemiParametric (Intersection)
0.2 CHAID (All variables)
CHAID (Intersection)
0.1 Calibrated GBS
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
56
Figure 4:
0.9
0.8
0.7
0.6
CDF of Bads
Parametric (Stepwise)
0.5 Parametric (Resampling)
Parametric (Intersection)
0.4
SemiParametric (Stepwise)
0.3 SemiParametric (Resampling)
SemiParametric (Intersection)
0.2 CHAID (All variables)
CHAID (Intersection)
0.1 Calibrated GBS
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
57
Figure 5:
15 SemiPara(Stepwise)
SemiPara(Resampling)
SemiPara(Intersection)
CHAID(All variables)
10
CHAID(Intersection)
Calibrated GBS
Calibration Target
5
58
Figure 6:
15 SemiPara(Stepwise)
SemiPara(Resampling)
SemiPara(Intersection)
CHAID(All variables)
10
CHAID(Intersection)
Calibrated GBS
Calibration Target
5
59
Figure 7: RAD Scores, Full Sample, Development and Validation
All Samples
Dev
Hold
Parametric 00
(Resampling)
01
02
Dev
Hold
Semiparametric 00
(Stepwise)
01
02
Dev
Hold
Nonparametric
00
(CHAID)
01
02
Dev
Hold
Calibrated
00
GBS
01
02
500 550 600 650 700 750 800 850 900
Values
60
Figure 8: RAD Scores, Dirty/Delinquent Sample, Development and Validation
Dirty/Delinquent
Dev
Hold
Parametric 00
(Resampling)
01
02
Dev
Hold
Semiparametric 00
(Stepwise) 01
02
Dev
Hold
Nonparametric 00
(CHAID)
01
02
Dev
Hold
Calibrated
00
GBS
01
02
500 550 600 650 700 750 800 850 900
Values
61
Figure 9: RAD Scores, Dirty/Current Sample, Development and Validation
Dirty/Current
Dev
Hold
Parametric 00
(Resampling)
01
02
Dev
Hold
Semiparametric 00
(Stepwise)
01
02
Dev
Hold
Nonparametric
00
(CHAID)
01
02
Dev
Hold
Calibrated 00
GBS 01
02
500 550 600 650 700 750 800 850 900
Values
62
Figure 10: RAD Scores, Clean/Thin Sample, Development and Validation
Clean/Thin
Dev
Hold
Parametric 00
(Resampling)
01
02
Dev
Hold
Semiparametric 00
(Stepwise)
01
02
Dev
Hold
Nonparametric
00
(CHAID)
01
02
Dev
Hold
Calibrated 00
GBS 01
02
500 550 600 650 700 750 800 850 900
Values
63
Figure 11: RAD Scores, Clean/Thick Sample, Development and Validation
Clean/Thick
Dev
Hold
Parametric 00
(Resampling) 01
02
Dev
Hold
Semiparametric 00
(Stepwise) 01
02
Dev
Hold
Nonparametric 00
(CHAID)
01
02
Dev
Hold
Calibrated 00
GBS 01
02
500 550 600 650 700 750 800 850 900
Values
64
Figure 12:
15 SemiPara(Stepwise)
SemiPara(Resampling)
SemiPara(Intersection)
CHAID(All variables)
10
CHAID(Intersection)
Calibrated GBS
Calibration Target
5
65
Figure 13:
0.9
0.8
0.7
0.6
CDF of Bads
Parametric (Stepwise)
0.5 Parametric (Resampling)
Parametric (Intersection)
0.4
SemiParametric (Stepwise)
0.3 SemiParametric (Resampling)
SemiParametric (Intersection)
0.2 CHAID (All variables)
CHAID (Intersection)
0.1 Calibrated GBS
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
66
Figure 14:
0.9
0.8
0.7
0.6
CDF of Bads
0.5
0.4
Development
0.3
Hold-out
0.2 2000
2001
0.1 2002
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
67
Figure 15:
0.9
0.8
0.7
0.6
CDF of Bads
0.5
0.4
Development
0.3
Hold-out
0.2 2000
2001
0.1 2002
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
68
Figure 16:
0.9
0.8
0.7
0.6
CDF of Bads
0.5
0.4
Development
0.3
Hold-out
0.2 2000
2001
0.1 2002
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
69
Figure 17:
0.9
0.8
0.7
0.6
CDF of Bads
0.5
0.4
Development
0.3
Hold-out
0.2 2000
2001
0.1 2002
No Separation
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CDF of All Samples (Sorted from Low to High score)
70