Project-Report Sample
Project-Report Sample
Franklin Lam
January, 2023
Introduction
In the assignment, the mlr package is used for building SVM model and neural network model for the Credit
Card Default classification problem. This sample aims to illustrate how to use R to perform exploratory
analysis and develop and benchmark SVM and neural models for the given problems. Only brief discussion
will be included as an illustration.
**NOTE: A more detailed analysis are required in the assignment and marks will not be given for directly
copying of the analysis.
First, we have to load the libraries and disable all output from the mlr package.
library(dplyr)
library(psych)
library(VIM)
library(ggplot2)
library(corrplot)
library(car)
library(mlr)
library(gridExtra)
# turn off training outputs of mlr
configureMlr(show.info=FALSE, show.learner.output=FALSE)
Exploratory analysis
The following code produces basic descriptive statistics of the raw data and plot the missing data pattern.
# Read the Data
df <- read.csv("default_credit_card.csv",header=TRUE) %>%
mutate(default=factor(Y, labels=c("No", "Yes"))) %>%
select(-Y)
str(df)
1
## $ X11 : int -2 2 0 0 0 0 0 -1 0 -1 ...
## $ X12 : int 3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
## $ X13 : int 3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
## $ X14 : int 689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
## $ X15 : int 0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
## $ X16 : int 0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
## $ X17 : int 0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
## $ X18 : int 0 0 1518 2000 2000 2500 55000 380 3329 0 ...
## $ X19 : int 689 1000 1500 2019 36681 1815 40000 601 0 0 ...
## $ X20 : int 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ X21 : int 0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
## $ X22 : int 0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
## $ X23 : int 0 2000 5000 1000 679 800 13770 1542 1000 0 ...
## $ default: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
summary(df)
## X1 X2 X3 X4
## Min. : 10000 Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 140000 Median :2.000 Median :2.000 Median :2.000
## Mean : 167484 Mean :1.604 Mean :1.853 Mean :1.552
## 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :1000000 Max. :2.000 Max. :6.000 Max. :3.000
## X5 X6 X7 X8
## Min. :21.00 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
## 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :34.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean :35.49 Mean :-0.0167 Mean :-0.1338 Mean :-0.1662
## 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :79.00 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
## X9 X10 X11 X12
## Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-165580
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.: 3559
## Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 22382
## Mean :-0.2207 Mean :-0.2662 Mean :-0.2911 Mean : 51223
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 67091
## Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 964511
## X13 X14 X15 X16
## Min. :-69777 Min. :-157264 Min. :-170000 Min. :-81334
## 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327 1st Qu.: 1763
## Median : 21200 Median : 20089 Median : 19052 Median : 18105
## Mean : 49179 Mean : 47013 Mean : 43263 Mean : 40311
## 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506 3rd Qu.: 50191
## Max. :983931 Max. :1664089 Max. : 891586 Max. :927171
## X17 X18 X19 X20
## Min. :-339603 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833 1st Qu.: 390
## Median : 17071 Median : 2100 Median : 2009 Median : 1800
## Mean : 38872 Mean : 5664 Mean : 5921 Mean : 5226
## 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000 3rd Qu.: 4505
## Max. : 961664 Max. :873552 Max. :1684259 Max. :896040
## X21 X22 X23 default
## Min. : 0 Min. : 0.0 Min. : 0.0 No :23364
## 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8 Yes: 6636
2
## Median : 1500 Median : 1500.0 Median : 1500.0
## Mean : 4826 Mean : 4799.4 Mean : 5215.5
## 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
## Max. :621000 Max. :426529.0 Max. :528666.0
# plot the missing data pattern
aggr(df[,1:23], prop=FALSE, numbers=TRUE)
Number of missings
25000
Combinations
30000
10000
0
X1
X5
X9
X13
X17
X21
X1
X3
X5
X7
X9
X11
X13
X15
X17
X19
X21
X23
From the descriptive statistics and graphs, no missing value in the dataset.
Descriptive statistics of the data is given below.
# count frequency and calculate percentage
freq=table(df$default)
percentage=round(prop.table(freq),3)
cbind(freq, percentage)
## freq percentage
## No 23364 0.779
## Yes 6636 0.221
# descriptive statistics
describe(df[,-24])
3
## X14 14 30000 47013.15 69349.39 20088.5 32064.43 29219.82 -157264 1664089
## X15 15 30000 43262.95 64332.86 19052.0 29212.37 27659.39 -170000 891586
## X16 16 30000 40311.40 60797.16 18104.5 26920.95 26224.97 -81334 927171
## X17 17 30000 38871.76 59554.11 17071.0 25726.08 24840.96 -339603 961664
## X18 18 30000 5663.58 16563.28 2100.0 2997.22 2864.38 0 873552
## X19 19 30000 5921.16 23040.87 2009.0 2876.43 2951.86 0 1684259
## X20 20 30000 5225.68 17606.96 1800.0 2468.91 2661.27 0 896040
## X21 21 30000 4826.08 15666.16 1500.0 2199.23 2223.90 0 621000
## X22 22 30000 4799.39 15278.31 1500.0 2202.19 2223.90 0 426529
## X23 23 30000 5215.50 17777.47 1500.0 2165.33 2223.90 0 528666
## range skew kurtosis se
## X1 990000 0.99 0.54 749.10
## X2 1 -0.42 -1.82 0.00
## X3 6 0.97 2.08 0.00
## X4 3 -0.02 -1.36 0.00
## X5 58 0.73 0.04 0.05
## X6 10 0.73 2.72 0.01
## X7 10 0.79 1.57 0.01
## X8 10 0.84 2.08 0.01
## X9 10 1.00 3.50 0.01
## X10 10 1.01 3.99 0.01
## X11 10 0.95 3.43 0.01
## X12 1130091 2.66 9.80 425.14
## X13 1053708 2.70 10.30 410.92
## X14 1821353 3.09 19.78 400.39
## X15 1061586 2.82 11.31 371.43
## X16 1008505 2.88 12.30 351.01
## X17 1301267 2.85 12.27 343.84
## X18 873552 14.67 415.16 95.63
## X19 1684259 30.45 1641.25 133.03
## X20 896040 17.21 564.18 101.65
## X21 621000 12.90 277.27 90.45
## X22 426529 11.13 180.02 88.21
## X23 528666 10.64 167.12 102.64
Boxplots of numerical data
# boxplot - only numerical data
# X1: credit
df %>% select(X1) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(title="Predicting variables") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
4
Predicting variables
1000000
750000
variable
value
500000
X1
250000
X1
variable
Predicting variables
7.5
variable
X6
5.0
X7
value
X8
2.5
X9
X10
0.0
X11
−2.5
X10
X11
X6
X7
X8
X9
variable
# X12-X17: bill
df %>% select(X12:X17) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(title="Predicting variables") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
5
Predicting variables
1500000 variable
X12
1000000
X13
value
X14
500000
X15
X16
0
X17
X12
X13
X14
X15
X16
X17
variable
Predicting variables
1500000 variable
X18
X19
1000000
value
X20
X21
500000
X22
X23
0
X18
X19
X20
X21
X22
X23
variable
# predicdting variables
# X1
df %>% select(X1, default) %>%
group_by(default) %>%
6
summarise_all(mean) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=default)) +
geom_col(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
150000
default
value
100000
No
Yes
50000
0
X1
variable
# X2:X5
df %>% select(X2, default) %>%
mutate(Sex=factor(X2, labels=c("male", "female"))) %>%
select(Sex, default) %>%
reshape2::melt() %>%
ggplot(aes(x=Sex, fill=default)) +
geom_bar(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
7
The Means of Predicting Variables by Class
15000
10000
default
count
No
5000 Yes
female
male
Sex
9000
default
count
6000
No
3000 Yes
0
Graduate School
High School
University
Unknown
Others
Education
8
geom_bar(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
10000
default
count
No
5000 Yes
0
Unknown
Married
Others
Single
Marital_status
No Yes
2500
2000
1500 default
count
No
1000 Yes
500
0
20
40
60
80
20
40
60
80
Age
# X6:X11
df %>% select(X6:X11, default) %>%
group_by(default) %>%
summarise_all(mean) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=default)) +
9
geom_col(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
0.50
0.25 default
value
No
0.00 Yes
−0.25
X10
X11
X6
X7
X8
X9
variable
# X12:X17
df %>% select(X12:X17, default) %>%
group_by(default) %>%
summarise_all(mean) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=default)) +
geom_col(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
40000
default
30000
value
No
20000 Yes
10000
0
X12
X13
X14
X15
X16
X17
variable
10
# X18:X23
df %>% select(X18:X23, default) %>%
group_by(default) %>%
summarise_all(mean) %>%
reshape2::melt() %>%
ggplot(aes(x=variable, y=value, fill=default)) +
geom_col(position="dodge") +
labs(title = "The Means of Predicting Variables by Class") +
theme(axis.text.x = element_text(angle=90, vjust=0.5),
plot.title=element_text(hjust=0.5))
6000
4000
default
value
No
Yes
2000
0
X18
X19
X20
X21
X22
variable X23
11
X1 X6 X7
Frequency
Frequency
Frequency
0
0
0e+00 4e+05 8e+05 −2 0 2 4 6 8 −2 0 2 4 6 8
X8 X9 X10
Frequency
Frequency
Frequency
0
0
−2 0 2 4 6 8 −2 0 2 4 6 8 −2 0 2 4 6 8
Frequency
Frequency
0
0
−2 0 2 4 6 8 −2e+05 4e+05 1e+06 0e+00 6e+05
Frequency
Frequency
0
Frequency
Frequency
0
Frequency
Frequency
0
12
X23
Frequency
0e+00 3e+05
Value
# correlation plot
df %>% select(X1, X6:X11) %>% cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation between response variables",
mar=c(0,0,2,0))
13
Correlation between response variables
X10
X11
X1
X6
X7
X8
X9
1
X1 1 −0.27 −0.3 −0.29 −0.27 −0.25 −0.24 0.8
0.4
X7 1 0.77 0.66 0.62 0.58
0.2
−0.2
X9 1 0.82 0.72
−0.4
X10 1 0.82 −0.6
−0.8
X11 1
−1
df %>% select(X1, X12:X17) %>% cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation between response variables",
mar=c(0,0,2,0))
14
Correlation between response variables
X12
X13
X14
X15
X16
X17
X1
1
X1 1 0.28 0.28 0.28 0.29 0.3 0.29 0.8
0.4
X13 1 0.93 0.89 0.86 0.83
0.2
−0.2
X15 1 0.94 0.9
−0.4
X16 1 0.95 −0.6
−0.8
X17 1
−1
df %>% select(X1, X18:X23) %>% cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation between response variables",
mar=c(0,0,2,0))
15
Correlation between response variables
X18
X19
X20
X21
X22
X23
X1
1
X1 1 0.2 0.18 0.21 0.2 0.22 0.22 0.8
0.4
X19 1 0.24 0.18 0.18 0.16
0.2
−0.2
X21 1 0.15 0.16
−0.4
X22 1 0.16 −0.6
−0.8
X23 1
−1
df %>% select(X6:X23) %>% cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation between response variables",
mar=c(0,0,1,0),
tl.cex=0.5, number.cex = 0.5)
16
Correlation between response variables
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X6
X7
X8
X9
X6
1
1 0.67 0.57 0.54 0.51 0.48 0.19 0.19 0.18 0.18 0.18 0.18 −0.08−0.07−0.07−0.06−0.06−0.06
X7 1 0.77 0.66 0.62 0.58 0.24 0.24 0.22 0.22 0.22 0.22 −0.08−0.06−0.06−0.05−0.04−0.04
0.8
X8 1 0.78 0.69 0.63 0.21 0.24 0.23 0.23 0.22 0.22 0 −0.07−0.05−0.05−0.04−0.04
X9 1 0.82 0.72 0.2 0.23 0.24 0.25 0.24 0.24 −0.01 0 −0.07−0.04−0.03−0.03 0.6
X10 1 0.82 0.21 0.23 0.24 0.27 0.27 0.26 −0.01 0 0.01 −0.06−0.03−0.02
X11 1 0.21 0.23 0.24 0.27 0.29 0.28 0 0 0.01 0.02 −0.05−0.02 0.4
X12 1 0.95 0.89 0.86 0.83 0.8 0.14 0.1 0.16 0.16 0.17 0.18
X13 1 0.93 0.89 0.86 0.83 0.28 0.1 0.15 0.15 0.16 0.17 0.2
X14 1 0.92 0.88 0.85 0.24 0.32 0.13 0.14 0.18 0.18
0
X15 1 0.94 0.9 0.23 0.21 0.3 0.13 0.16 0.18
17
Correlation between response variables (Non−default)
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X6
X7
X8
X9
X6 1 0.640.540.50.480.440.220.210.20.190.190.18
−0.07
−0.06
−0.07
−0.05
−0.05
−0.05
1
X7 1 0.740.650.610.560.280.270.250.240.240.24
−0.07
−0.05
−0.05
−0.04
−0.03
−0.03
0.8
X8 1 0.760.660.610.250.280.260.260.250.240.02
−0.06
−0.05
−0.04
−0.02
−0.03
X9 1 0.790.680.260.280.30.290.280.280.010.02
−0.06
−0.03
−0.02
−0.02 0.6
X10 1 0.790.260.280.290.320.310.30.010.010.02
−0.05
−0.02
−0.01
X11 1 0.260.280.290.310.340.320.020.010.020.04
−0.04
−0.02 0.4
X12 1 0.940.880.850.820.790.130.090.140.140.150.16
X13 1 0.920.880.850.820.280.090.140.130.140.16 0.2
X14 1 0.920.870.840.240.330.120.130.170.17
X15 1 0.930.890.230.210.30.120.150.17
0
X16 1 0.940.210.180.250.30.140.16
−0.2
X17 1 0.20.170.230.250.310.11
X18 1 0.290.260.20.150.18 −0.4
X19 1 0.260.180.190.16
X20 1 0.230.160.17 −0.6
X21 1 0.160.16
X22 1 0.16 −0.8
X23 1
−1
df %>% filter(default=="Yes") %>%
select(X6:X23) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation between response variables (default)",
mar=c(0,0,1,0),
tl.cex=0.5, number.cex = 0.5)
X6 1 0.650.550.510.480.460.180.2 0.20.210.210.21
−0.05
−0.04
−0.03
−0.04
−0.02
−0.02
1
X7 1 0.760.630.590.560.190.2 0.20.220.220.22
−0.08
−0.05
−0.03
−0.02
−0.01
−0.02
0.8
X8 1 0.780.680.630.150.180.180.2 0.20.210.02
−0.07
−0.03
−0.03
−0.02
−0.02
X9 1 0.840.740.120.140.170.180.190.19 0 0 −0.07
−0.03
−0.03
−0.02 0.6
X10 1 0.840.120.140.170.20.210.21 0 0 0.02
−0.05
−0.03
−0.01
X11 1 0.130.140.160.20.220.23 0 −0.01
0.010.02
−0.04
−0.02 0.4
X12 1 0.980.930.90.870.850.250.190.230.240.250.28
X13 1 0.960.930.890.860.320.190.220.230.220.26 0.2
X14 1 0.960.920.890.310.290.170.220.210.26
X15 1 0.960.930.30.250.30.210.20.24
0
X16 1 0.960.280.240.290.30.180.22
−0.2
X17 1 0.250.220.260.280.290.16
X18 1 0.120.160.160.110.2 −0.4
X19 1 0.10.120.120.13
X20 1 0.110.110.1 −0.6
X21 1 0.080.12
X22 1 0.13 −0.8
X23 1
−1
No significant in the correlation between variables for default cases and non-default cases.
Given the high correlation between X6 - X11 and X12 - X17, we may replace X6 - X11 and X12 - X17 by
their means: history and bill, respectively, in order to reduce the dimension of the problem.
# derive variables
df_reduce <- df %>% mutate(history=(X6 + X7 + X8 + X9 + X10 + X11)/6,
bill=(X12 + X13 + X14 + X15 + X16 + X17)/6) %>%
select(X1:X5, history:bill, X18:X23, default)
18
describe(df_reduce)
19
Predicting variables
6
default
value
2 No
Yes
−2
history
variable
20
Predicting variables
750000
500000 default
value
No
Yes
250000
0
bill
variable
21
Predicting variables
1500000
1000000 default
value
No
Yes
500000
0
X18
X19
X20
X21
X22
X23
variable
22
Correlation between response variables
history
X18
X19
X20
X21
X22
X23
X1
bill
1
X1 1 −0.32 0.3 0.2 0.18 0.21 0.2 0.22 0.22
0.8
history 1 0.28 −0.04 −0.04 −0.05 −0.05 −0.05 −0.04
0.6
bill 1 0.23 0.19 0.21 0.19 0.19 0.18
0.4
X18 1 0.29 0.25 0.2 0.15 0.19
0.2
−0.6
X22 1 0.16
−0.8
X23 1
−1
Variables X2 - X4 are categorical variable. It is advised that SVM should not be applied to categorical
variables because it is based on the Euclidean distances. For SVM, we can set dummy variables to represent
the categorical variables. The following code uses the dummy_cols() in the fastDummies package to replace
the categorical variables by dummy variables.
# use encoding for categorical variables X2:X4
dfc <- df %>% mutate(X2=as.factor(X2), X3=as.factor(X3), X4=as.factor(X4)) %>%
fastDummies::dummy_cols() %>%
select(X1, X6:X23, X2_1:X4_3, default)
Preparing data
The following code splits the dataset into a training dataset (70%) and a testing dataset (30%). The minimum
and maximum of the variables in the training dataset are used to normalize the datasets to the [0,1] range.
# split data into training and test sets
set.seed(800)
index <- 1:nrow(df)
test_set_index <- sample(index, trunc(length(index)/3))
# (1) Full set
test_set <- df[test_set_index,]
train_set <- df[-test_set_index,]
# (2) Reduced set
test_set_r <- df_reduce[test_set_index,]
23
train_set_r <- df_reduce[-test_set_index,]
# (3) Full set - dummy encoding for categorical variables
test_setc <- dfc[test_set_index,]
train_setc <- dfc[-test_set_index,]
# (4) Reduced set - dummy encoding for categorical variables
test_set_rc <- dfc_reduce[test_set_index,]
train_set_rc <- dfc_reduce[-test_set_index,]
# function for rescale the columns based on the training set max/min
rescale <- function(dat, d_min, d_max) {
c <- ncol(dat)
for (i in 1:c) {
dat[,i] <- sapply(dat[,i], function(x) (x - d_min[i])/(d_max[i] - d_min[i]))
}
return (dat)
}
SVM
Develop basic model
The basic SVM classifier is constructed using radial kernel. The following code develops the full and reduce
models. Default=“Yes” is set to be positive and the models is to predict the probability as follows:
# (1) Full set
# create a classification task
task_f <- makeClassifTask(id = "Credit_Default_F", data = train_set,
target = "default",
positive = "Yes")
task_f
24
## Observations: 20000
## Features:
## numerics factors ordered functionals
## 23 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## No Yes
## 15590 4410
## Positive class: Yes
# create a svm learner - with predict.type = "prob" for plotting the training
svm_lrn_f <- makeLearner("classif.svm", id="svm_full",
kernel="radial", predict.type = "prob")
svm_lrn_f
##
## Call:
## svm.default(x = d$data, y = d$target, kernel = "radial", probability = .learner$predict.type ==
## "prob")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8882
svm_mod_f$features
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12"
## [13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23"
svm_mod_f$time
## [1] 339.06
train_svm_f <- predict(svm_mod_f, task_f); train_svm_f
25
## time: 13.53
## id truth prob.Yes prob.No response
## 1 1 Yes 0.4350538 0.5649462 No
## 2 2 Yes 0.1618716 0.8381284 No
## 3 3 No 0.1555109 0.8444891 No
## 4 4 No 0.1576471 0.8423529 No
## 5 5 No 0.1553270 0.8446730 No
## 9 6 No 0.1530721 0.8469279 No
## ... (#rows: 20000, #cols: 5)
test_svm_f <- predict(svm_mod_f, newdata=test_set); test_svm_f
26
## Properties: twoclass,multiclass,numerics,factors,prob,class.weights
## Predict-Type: prob
## Hyperparameters: kernel=radial
# train the svm classifier
set.seed(500)
svm_mod_r <- train(svm_lrn_r, task_r)
svm_mod_r$learner.model
##
## Call:
## svm.default(x = d$data, y = d$target, kernel = "radial", probability = .learner$predict.type ==
## "prob")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8963
svm_mod_r$features
## [1] 178.28
train_svm_r <- predict(svm_mod_r, task_r); train_svm_r
27
## 19572 No 0.1701972 0.8298028 No
## ... (#rows: 10000, #cols: 4)
The performance of the classifiers (without dummy encoding) are evaluated below:
cat("(1) Full set\n")
28
## Test set auc: 0.7287499
# plot ROC and threshold
d = generateThreshVsPerfData(train_svm_f, measures = list(fpr, tpr, mmce))
plotROCCurves(d)
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
29
False positive rate True positive rate Mean misclassification error
0.8
1.00 1.00
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
30
## -n- 7774 2226 <NA> 20000
##
##
## Absolute confusion matrix:
## No Yes -err.- -n-
## No 7555 219 219 7774
## Yes 1764 462 1764 2226
## -err.- 1764 219 1983 NA
## -n- 9319 681 NA 20000
# performance
cat("\n")
cat("Training set accuracy: ", performance(train_svm_r, measures=acc), "\n")
1.00
0.75
True positive rate
0.50
0.25
0.00
31
plotThreshVsPerf(d)
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
The
testing results are as follows: * Full model - TP rate: 0.3 TN rate: 0.96 Accuracy: 0.8212 AUC: 0.7145 *
Reduce model - TP rate: 0.2 TN rate: 0.97 Accuracy: 0.808 AUC: 0.7187
(Note: the TP rate and TN rate would depend on the cutoff threshold for making the prediction. By default,
the threshold is 0.5. If the threshold is reduced to 0.3, the result will be as follows:
cat("(1) Full set - threshold=0.3\n")
32
## Test set accuracy: 0.8203
cat("(2) Reduced set - threshold=0.3\n")
33
kernel="radial", predict.type = "prob")
svm_lrn_fc
##
## Call:
## svm.default(x = d$data, y = d$target, kernel = "radial", probability = .learner$predict.type ==
## "prob")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8777
svm_mod_fc$features
## [1] "X1" "X6" "X7" "X8" "X9" "X10" "X11" "X12" "X13" "X14"
## [11] "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23" "X2_1"
## [21] "X2_2" "X3_0" "X3_1" "X3_2" "X3_3" "X3_4" "X3_5" "X3_6" "X4_0" "X4_1"
## [31] "X4_2" "X4_3"
svm_mod_fc$time
## [1] 437.62
train_svm_fc <- predict(svm_mod_fc, task_fc); train_svm_fc
34
## predict.type: prob
## threshold: No=0.50,Yes=0.50
## time: 8.25
## truth prob.Yes prob.No response
## 21299 No 0.1494978 0.8505022 No
## 8619 No 0.1588606 0.8411394 No
## 11730 No 0.1580871 0.8419129 No
## 3266 Yes 0.1577301 0.8422699 No
## 15740 No 0.1570811 0.8429189 No
## 19572 No 0.1548008 0.8451992 No
## ... (#rows: 10000, #cols: 4)
# (2) Reduce set - dummy encoding
task_rc <- makeClassifTask(id = "Credit_Default_RC", data = train_set_rc,
target = "default",
positive = "Yes")
task_rc
##
## Call:
## svm.default(x = d$data, y = d$target, kernel = "radial", probability = .learner$predict.type ==
## "prob")
35
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 8902
svm_mod_rc$features
## [1] 268.22
train_svm_rc <- predict(svm_mod_rc, task_rc); train_svm_rc
36
## Relative confusion matrix (normalized by row/column):
## predicted
## true No Yes -err.- -n-
## No 0.97/0.83 0.03/0.28 0.03 18092
## Yes 0.69/0.17 0.31/0.72 0.69 1908
## -err.- 0.17 0.28 0.18 <NA>
## -n- 15590 4410 <NA> 20000
##
##
## Absolute confusion matrix:
## No Yes -err.- -n-
## No 15064 526 526 15590
## Yes 3028 1382 3028 4410
## -err.- 3028 526 3554 NA
## -n- 18092 1908 NA 20000
calculateConfusionMatrix(test_svm_fc, relative = TRUE, sums = TRUE)
37
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
38
# Relative confusion matrix (normalized by row/column)
calculateConfusionMatrix(train_svm_rc, relative = TRUE, sums = TRUE)
39
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
The results for dummy encoding models are as follows: * Full model - TP rate: 0.29 TN rate: 0.96 Accuracy:
0.8104 AUC: 0.7159 * Reduce model - TP rate: 0.19 TN rate: 0.97 Accuracy: 0.807 AUC: 0.7121
40
The performance of the dummy encoding models are slightly worse than before. It may be due to the
increasing no. of dummy variables. So, we will use the original variable instead of dummy encoding in the
following analysis.
To following codes generate sample plots the boundary of the SVM classifier.
# Visualizing the prediction
plotLearnerPrediction(svm_lrn_f, task = task_f,
features =c("X1", "X23"))
svm: kernel=radial
Train: mmce=0.2205000; CV: mmce.test.mean=0.2205000
1.00
0.75
default
X23
0.50 No
Yes
0.25
0.00
41
svm: kernel=radial
Train: mmce=0.1962500; CV: mmce.test.mean=0.1974500
1.00
0.75
default
history
0.50 No
Yes
0.25
0.00
Tuning model
Due to the long model training time, we only attempt to perform grid search to search for better cost for the
reduced model with 3-fold cross-validation.
## Define the resampling strategy
set.seed(800)
rdesc <- makeResampleDesc(method = "CV", iters=3)
## Tune result:
## Op. pars: cost=1.075
## auc.test.mean=0.7179790,acc.test.mean=0.8006500,mmce.test.mean=0.1993500
res_svm$x
## $cost
## [1] 1.075
42
res_svm$y
# plot graphs
grid.arrange(p1 + ggtitle("Tuning performance") + ylab("AUC"),
p2 + ggtitle("Tuning performance") + ylab("Accuracy"), nrow=1)
0.71796
0.80079
0.80078
Accuracy
0.71792
AUC
0.80077
0.71788
0.80076
0.71784 0.80075
1 2 3 4 5 1 2 3 4 5
iteration iteration
From the result, the optimal cost remains unchange (1.0). So, it seems that the model cannot be further
improved by tuning.
So, it is used to retrain the model and the performance of the tuned SVM classifier (reduced model) are then
evaluated as follows:
# update the hyperparameter of reduced model
svm_lrn_r_tuned <- setHyperPars(svm_lrn_r, par.vals = res_svm$x)
svm_lrn_r_tuned$id <- "svm_r_tuned"
svm_lrn_r_tuned$par.vals #after update
43
## $kernel
## [1] "radial"
##
## $cost
## [1] 1.075
svm_mod_r_tuned <- train(svm_lrn_r_tuned, task_r)
train_svm_r_tuned <- predict(svm_mod_r_tuned, task_r); train_svm_r_tuned
44
## -err.- 1762 220 1982 NA
## -n- 9316 684 NA 20000
cat("Training set accuracy: ", performance(train_svm_r_tuned, measures=acc), "\n")
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
45
False positive rate True positive rate Mean misclassification error
0.8
1.00 1.00
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
The result shows that tuning the cost cannot further improve the accuracy of the SVM classifier and The
full model performed better than the reduced model. Although the out sample test of the full and reduced
models are over 82%, the TP rate is only about 30% and 20% for threshold=0.5, respectively. From the ROC
curve of the full model, the balance TP rate and TN rate is less than 0.7. Overall, the performance of the
present SVM models are not very good. Intensive tuning is required to further improve the accuracy and
robustness of the model.
Neural Network
Basic neural network
A single layer neural network with 5 neurons is built for the problem below. In addition, it tests the effect of
applying dummy encoding for categorical variables on the performance of the neural network model. In view
of the fast training of neural network, only full dataset will be used.
# (1) Full set
# create a neural network learner - 5 neurons
nn_lrn_f <- makeLearner("classif.nnet", id="nn-full", size=5, maxit=10000L,
predict.type = "prob")
nn_lrn_f
46
# train the neural network classifier
set.seed(500)
nn_mod_f <- train(nn_lrn_f, task_f)
nn_mod_f$learner.model
## [1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12"
## [13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23"
nn_mod_f$time
## [1] 12.15
train_nn_f <- predict(nn_mod_f, task_f); train_nn_f
47
## Class: classif.nnet
## Properties: twoclass,multiclass,numerics,factors,prob,weights
## Predict-Type: prob
## Hyperparameters: size=5,maxit=10000
# train the neural network classifier
set.seed(500)
nn_mod_fc <- train(nn_lrn_f, task_fc)
nn_mod_fc$learner.model
## [1] "X1" "X6" "X7" "X8" "X9" "X10" "X11" "X12" "X13" "X14"
## [11] "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23" "X2_1"
## [21] "X2_2" "X3_0" "X3_1" "X3_2" "X3_3" "X3_4" "X3_5" "X3_6" "X4_0" "X4_1"
## [31] "X4_2" "X4_3"
nn_mod_fc$time
## [1] 14.95
train_nn_fc <- predict(nn_mod_fc, task_fc); train_nn_fc
48
# Relative confusion matrix (normalized by row/column)
calculateConfusionMatrix(train_nn_f, relative = TRUE, sums = TRUE)
49
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
50
## Relative confusion matrix (normalized by row/column):
## predicted
## true No Yes -err.- -n-
## No 0.95/0.84 0.05/0.33 0.05 17539
## Yes 0.62/0.16 0.38/0.67 0.62 2461
## -err.- 0.16 0.33 0.18 <NA>
## -n- 15590 4410 <NA> 20000
##
##
## Absolute confusion matrix:
## No Yes -err.- -n-
## No 14783 807 807 15590
## Yes 2756 1654 2756 4410
## -err.- 2756 807 3563 NA
## -n- 17539 2461 NA 20000
calculateConfusionMatrix(test_nn_fc, relative = TRUE, sums = TRUE)
51
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
The testing results are as follows: * Full model - TP rate: 0.35 TN rate: 0.95 Accuracy: 0.8196 AUC: 0.7693
* Full model (dummy encoding) - TP rate: 0.38 TN rate: 0.95 Accuracy: 0.821 AUC: 0.7665
52
Applying dummy encoding can slighlty increase the out sample testing TP rate from 0.35 to 0.38 and accuracy
from to 0.8196 to 0.821. The ROC curve of both models are similar. The out sample testing accuracy of
the basic neural network is slightly lower than the SVM model. To compare with the SVM model, we will
approximate the categorical variables by numerical variables without using the dummy encoding.
## Tune result:
## Op. pars: size=4
## auc.test.mean=0.7649980,acc.test.mean=0.8153499,mmce.test.mean=0.1846501
res_nn$x
## $size
## [1] 4
res_nn$y
53
## Model for learner.id=nn_tuned; learner.class=classif.nnet
## Trained on: task.id = Credit_Default_F; obs = 20000; features = 23
## Hyperparameters: size=4,maxit=10000
# prediction
train_nn_tuned <- predict(nn_mod_tuned, task_f)
test_nn_tuned <- predict(nn_mod_tuned, newdata = test_set)
54
cat("Test set auc: ", performance(test_nn_tuned, measures=auc), "\n")
1.00
0.75
True positive rate
0.50
0.25
0.00
plotThreshVsPerf(d)
55
False positive rate True positive rate Mean misclassification error
0.8
1.00 1.00
0.75 0.75
0.6
performance
0.50 0.50
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
The
testing results are as follows: * Full model (tuned) - TP rate: 0.35 TN rate: 0.95 Accuracy: 0.8196 AUC:
0.7693
The tuned model cannot further improve the accuracy of the model.
# 3-fold cross-validation
rdesc.outer <- makeResampleDesc("CV", iters = 3)
56
# perform benchmark
set.seed(500)
bmr <- benchmark(lrns, tasks = task, resampling =
rdesc.outer, measures = ms)
bmr
1.00
0.75
True positive rate
learner
0.50 nn−full
svm_full
0.25
0.00
plotThreshVsPerf(df) +
theme(strip.text.x = element_text(size = 7))
57
False positive rate True positive rate Mean misclassification error
0.8
1.00 1.00
0.75 0.75
0.6
performance
learner
0.50 0.50 nn−full
svm_full
0.4
0.25 0.25
0.2
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
threshold
# density plots
perf <- getBMRPerformances(bmr, as.df=TRUE)
p1<-ggplot(perf, aes(acc, colour = learner.id)) +
geom_density() +
labs(title="Accuracy")
p2<-ggplot(perf, aes(mmce, colour = learner.id)) +
geom_density() +
labs(title="Mean misclassification rate")
p3<-ggplot(perf, aes(ber, colour = learner.id)) +
geom_density() +
labs(title="Balanced error rate")
grid.arrange(p1,p2,p3,ncol=1)
58
Accuracy
300 learner.id
density
200
svm_full
100
0 nn−full
0.815 0.816 0.817 0.818 0.819
acc
200
svm_full
100
0 nn−full
0.181 0.182 0.183 0.184 0.185
mmce
100
75 svm_full
50
25
0 nn−full
0.34 0.35 0.36
ber
From the results obtained in the benchmark experiment, the neural network performed slightly better than
the svm classifiers in all measures. Overall, the results of both classifiers are not highly satisfactory with 33%
- 37% balance error rate. More effective machine learning algorithms are required for this credit card default
problem.
59