Experiences with using R in credit risk
Experiences with using R in credit risk
R in credit risk
Hong Ooi
Introduction
Page 2
Mortgage haircut model
• When a mortgage defaults, the bank can take possession of the property
and sell it to recoup the loss1
• We have some idea of the market value of the property
• Actual sale price tends to be lower on average than the market value (the
haircut)2
• If sale price > exposure at default, we don’t make a loss (excess is passed
on to customer); otherwise, we make a loss
Notes:
1. For ANZ, <10% of defaults actually result in possession
2. Meaning of “haircut” depends on context; very different when talking
about, say, US mortgages
Page 3
Sale price distribution
Valuation
Expected shortfall
Haircut
Exposure
at default
Page 4
Stat modelling
Page 5
Valuation at origination Valuation at kerbside Valuation after possession
15
15
15
14
14
14
13
13
13
log sale price
12
12
11
11
11
10
10
10
9
9
11 12 13 14 15 10 11 12 13 14 15 10 11 12 13 14 15
Page 6
Volatility
A 11.6%
B 9.3%
C 31.2%
State/territory SD(haircut)*
1 NA
2 13.3%
3 7.7%
4 9.2%
5 15.6%
6 18.4%
7 14.8%
Page 7
Volatility modelling
Page 8
Shortfall
140000 90000 40000 0 0 0 0
Sale price
Page 9
Volatility: T-regression
Page 10
Example impact
Gaussian model
t5 model
Page 11
Model fitting function (simplified)
Page 12
Shortfall
140000 90000 40000 0 0 0 0
Because it downweights
outliers, t distribution is more
concentrated in the center
Sale price
Page 13
Normal model residuals
0.7
6
0.6
4
0.5
Sample quantiles
2
0.4
Density
0.3
0
0.2
-2
0.1
-4
0.0
-4 -2 0 2 4 6 -3 -2 -1 0 1 2 3
t5-model residuals
0.4
10
0.3
Sample quantiles
5
Density
0.2
0
0.1
-5
0.0
-5 0 5 10 -6 -4 -2 0 2 4 6
Page 14
Notes on model behaviour
Page 15
In SAS
• SAS has PROC MIXED for modelling variances, but only allows one grouping
variable and assumes a normal distribution
• PROC NLIN does general nonlinear optimisation
• Also possible in PROC IML
Page 16
Through-the-cycle calibration
Page 17
TTC approach
PD(x, e) = f(x, e)
Economic cycle
Page 18
TTC approach
Page 19
TTC calculation
Page 20
Binning/cohorting
• Raw TTC estimate is a combination of many spot PDs, each of which is from
a logistic regression
→ TTC estimate is a complicated function of customer attributes
• Need to simplify for communication, implementation purposes
• Turn into bins or cohorts based on customer attributes: estimate for each
cohort is the average for customers within the cohort
• Take pragmatic approach to defining cohorts
• Create tiers based on small selection of variables that will split out
riskiest customers
• Within each tier, create contingency table using attributes deemed most
interesting/important to the business
• Number of cohorts limited by need for simplicity/manageability, <1000
desirable
• Not a data-driven approach, although selection of variables informed by
data exploration/analysis
Page 21
Binning/cohorting
0.4
0.3
0.3
Density
Density
0.2
0.2
0.1
0.1
0.0
0.0
PD PD
Page 22
Binning input
Page 23
Binning output
if lvr = 50 and enq = 0 and low_doc = 'N' then do; tier2 = 1; ttc_pd = ________; end;
else if lvr = 60 and enq = 0 and low_doc = 'N' then do; tier2 = 2; ttc_pd = ________;
end;
else if lvr = 70 and enq = 0 and low_doc = 'N' then do; tier2 = 3; ttc_pd = ________;
end;
else if lvr = 80 and enq = 0 and low_doc = 'N' then do; tier2 = 4; ttc_pd = ________;
end;
else if lvr = 95 and enq = 0 and low_doc = 'N' then do; tier2 = 5; ttc_pd = ________;
end;
else if lvr = 50 and enq = 3 and low_doc = 'N' then do; tier2 = 6; ttc_pd = ________;
end;
else if lvr = 60 and enq = 3 and low_doc = 'N' then do; tier2 = 7; ttc_pd = ________;
end; Page 24
...
Binning/cohorting
Page 25
Stress testing simulation
• Banks run stress tests on their loan portfolios, to see what a downturn would
do to their financial health
• Mathematical framework is similar to the “Vasicek model”:
• Represent the economy by a parameter X
• Each loan has a transition matrix that is shifted based on X, determining
its risk grade in year t given its grade in year t - 1
• Defaults if bottom grade reached
• Take a scenario/simulation-based approach: set X to a stressed value, run N
times, take the average
• Contrast to VaR: “average result for a stressed economy”, as opposed to
“stressed result for an average economy”
• Example data: portfolio of 100,000 commercial loans along with current risk
grade, split by subportfolio
• Simulation horizon: ~3 years
Page 26
Application outline
• Previous version was an ad-hoc script written entirely in SAS, took ~4 hours
to run, often crashed due to lack of disk space
• Series of DATA steps (disk-bound)
• Transition matrices represented by unrolled if-then-else statements
(25x25 matrix becomes 625 lines of code)
• Reduced to 2 minutes with R, 1 megabyte of code cut to 10k
• No rocket science involved: simply due to using a better tool
• Similar times achievable with PROC IML, of which more later
Page 27
Application outline
• For each subportfolio and year, get the median result and store it
• Next year’s simulation uses this year’s median portfolio
• To avoid having to store multiple transited copies of the portfolio, we
manipulate random seeds
Page 28
Data structures
• But desired output for each [i, j] might be a bunch of summary statistics,
diagnostics, etc
→ Output needs to be a list
Page 29
Data structures
Page 30
Data structures
Page 31
PROC IML: a gateway to R
• As of SAS 9.2, you can use IML to execute R code, and transfer datasets to
and from R:
PROC IML;
call ExportDataSetToR('portfol', 'portfol'); /* creates a data frame */
call ExportMatrixToR("&Rfuncs", 'rfuncs');
call ExportMatrixToR("&Rscript", 'rscript');
call ExportMatrixToR("&nIters", 'nIters');
...
submit /R;
source(rfuncs)
source(rscript)
endsubmit;
call ImportDataSetFromR('result', 'result');
QUIT;
Page 32
IML: a side-rant
• IML lacks:
• Logical vectors: everything has to be numeric or character
• Support for zero-length vectors (you don’t realise how useful they are
until they’re gone)
• Unoriented vectors: everything is either a row or column vector
(technically, everything is a matrix)
• So something like x = x + y[z < 0]; fails in three ways
Page 33
Other SAS/R interfaces
• SAS has a proprietary dataset format (or, many proprietary dataset formats)
• R’s foreign package includes read.ssd and read.xport for importing,
and write.foreign(*, package="SAS") for exporting
• Package Hmisc has sas.get
• Package sas7bdat has an experimental reader for this format
• Revolution R can read SAS datasets
• All have glitches, are not widely available, or not fully functional
• First 2 also need SAS installed
• SAS 9.2 and IML make these issues moot
• You just have to pay for it
• Caveat: only works with R <= 2.11.1 (2.12 changed the locations of
binaries)
• SAS 9.3 will support R 2.12+
Page 34
R and SAS rundown
• Advantages of R
• Free! (base distribution, anyway)
• Very powerful statistical programming environment: SAS takes 3
languages to do what R does with 1
• Flexible and extensible
• Lots of features (if you can find them)
• User-contributed packages are a blessing and a curse
• Ability to handle large datasets is improving
• Advantages of SAS
• Pervasive presence in large firms
• “Nobody got fired for buying IBM SAS”
• Compatibility with existing processes/metadata
• Long-term support
• Tremendous data processing/data warehousing capability
• Lots of features (if you can afford them)
• Sometimes cleaner than R, especially for data manipulation
Page 35
R and SAS rundown
Page 36
Challenges for deploying R
Page 37
Commercial R: an aside
Page 38
Good problems to have
Page 39
Other resources
Page 40