Assignment 2 Sol
Assignment 2 Sol
The base likelihood of someone defaulting on their loan, without considering other
factors, is moderately high (log-odds of 1.537386).
Age
Older customers are a bit less likely to default on their loans. For each additional year
of age, the log-odds of defaulting decrease by 0.014793.
Number of Dependents
Having more dependents slightly increases the chance of defaulting. For each
additional dependent, the log-odds of defaulting increase by 0.039196.
Monthly Income
Higher monthly income slightly reduces the chance of defaulting. For each additional
thousand INR of income, the log-odds of defaulting decrease by 0.005679.
Salary Date
Getting paid earlier in the month significantly lowers the risk of defaulting. For each
unit increase in the salary date fraction, the log-odds of defaulting decrease by
0.453885.
Loan Tenure
Longer loan terms significantly increase the risk of defaulting. For each additional
year of loan tenure, the log-odds of defaulting increase by 0.725872.
Down Payment
Making a larger down payment significantly reduces the risk of defaulting. For each
unit increase in the down payment fraction, the log-odds of defaulting decrease by
1.073405.
Profession
Business people are slightly more likely to default compared to professionals. The
log-odds of defaulting increase by 0.269408 for business people.
Education Level
Customers with only high school education (HSC) are a bit more likely to default
compared to undergraduates. The log-odds of defaulting increase by 0.105550 for
HSC qualified customers. Those with a postgraduate education are less likely to
default, with the log-odds decreasing by 0.346036.
Gender
Men are slightly more likely to default than women. The log-odds of defaulting
increase by 0.237154 for men.
Post-Dated Checks
Customers who provide full post-dated checks are much less likely to default. The
log-odds of defaulting decrease by 1.280570.
Refrigerator Ownership
Owning a washing machine slightly reduces the risk of defaulting. The log-odds of
defaulting decrease by 0.154021.
Key Takeaways
Older age (-0.014793), higher income (-0.005679), earlier salary date (-0.453885),
larger down payment (-1.073405), postgraduate education (-0.346036), and
ownership of household appliances (refrigerator: -0.190968, washing machine: -
0.154021) are all factors that reduce the likelihood of loan default.
More dependents (0.039196), longer loan terms (0.725872), and being a business
person (0.269408) slightly increase the risk of defaulting.
Providing full post-dated checks (-1.280570) is a strong indicator of a lower risk of
default.
Code and output-
Call:
glm(formula = DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + SALDATFR +
TENORYR + DWNPMFR + PROFBUS + QUALHSC + QUAL_PG + SEXCODE +
FULLPDC + FRICODE + WASHCODE, family = binomial, data = trainData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.537386 0.128123 11.999 < 2e-16 ***
AGE -0.014793 0.001605 -9.218 < 2e-16 ***
NOOFDEPE 0.039196 0.010093 3.884 0.000103 ***
MTHINCTH -0.005679 0.003376 -1.682 0.092525 .
SALDATFR -0.453885 0.038951 -11.653 < 2e-16 ***
TENORYR 0.725872 0.042502 17.079 < 2e-16 ***
DWNPMFR -1.073405 0.121978 -8.800 < 2e-16 ***
PROFBUS 0.269408 0.045698 5.895 3.74e-09 ***
QUALHSC 0.105550 0.038096 2.771 0.005594 **
QUAL_PG -0.346036 0.074628 -4.637 3.54e-06 ***
SEXCODE 0.237154 0.057760 4.106 4.03e-05 ***
FULLPDC -1.280570 0.033054 -38.742 < 2e-16 ***
FRICODE -0.190968 0.036351 -5.253 1.49e-07 ***
WASHCODE -0.154021 0.043326 -3.555 0.000378 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Performance Metrics
Accuracy: 0.741
Precision: 0.619
Recall (Sensitivity): 0.301
F1 Score: 0.405
AUC: 0.717
The recall is quite low at 30.1%, which means the model is missing a large number of actual
defaulters.
The precision is relatively better at 61.9%, indicating that when the model predicts a default,
it is reasonably accurate.
The balanced accuracy (0.612) takes into account both the sensitivity and specificity, giving a
more balanced view of the model performance.
There are many false negatives (313), which means many defaulters are being classified as
non-defaulters.
There are also many false positives (1184), which means many non-defaulters are being
classified as defaulters
The model shows moderate performance with an accuracy of 74.1% and an AUC of 0.717.
However, the low recall (30.1%) is a major issue as it indicates the model's poor performance
in identifying actual defaulters.
"AUC: 0.717266143080232"
>
>
>
>
> opt_threshold=0.5
> print(paste("Optimal Threshold:", opt_threshold))
[1] "Optimal Threshold: 0.5"
>
>
> testData$predicted_class=ifelse(testData$predicted_prob > opt_threshold, 1, 0)
> levels(testData$predicted_class)=levels(testData$DefaulterFlag)
>
> table(testData$predicted_class)
0 1
822 4959
> table(testData$DefaulterFlag)
0 1
1693 4088
> conf_matrix=confusionMatrix(as.factor(testData$predicted_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 509 313
1 1184 3775
Accuracy : 0.741
95% CI : (0.7295, 0.7523)
No Information Rate : 0.7071
P-Value [Acc > NIR] : 5.438e-09
Kappa : 0.2638
Sensitivity : 0.30065
Specificity : 0.92343
Pos Pred Value : 0.61922
Neg Pred Value : 0.76124
Prevalence : 0.29286
Detection Rate : 0.08805
Detection Prevalence : 0.14219
Balanced Accuracy : 0.61204
'Positive' Class : 0
> table(testData$predicted_class,testData$DefaulterFlag)
0 1
0 509 313
1 1184 3775
>
> accuracy=conf_matrix$overall['Accuracy']
> precision=conf_matrix$byClass['Pos Pred Value']
> recall=conf_matrix$byClass['Sensitivity']
> f1_score=2 * ((precision * recall) / (precision + recall))
>
> print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.741048261546445"
> print(paste("Precision:", precision))
[1] "Precision: 0.619221411192214"
> print(paste("Recall:", recall))
[1] "Recall: 0.300649734199646"
> print(paste("F1 Score:", f1_score))
[1] "F1 Score: 0.404771371769384"
Q3. Suppose Auto Finance Ltd. provides a loan for 2
years. The management of Auto Finance Ltd. has
estimated that the profit associated with a “True
Positive” case is Rs. 6000. Furthermore, they also
estimated that the losses associated with a “False
Negative” case and a “False Positive” case are Rs.
12000 and Rs. 6000, respectively. The profit
associated with a “True Negative” case is Rs. 2000.
Provide the benefit matrix. Based on the confusion
matrix obtained for the validation data set,
calculate the total profit/loss for the company.
TP: 3775
TN: 509
FP: 1184
FN: 313
Benefit Matrix
profit_TP=6000
> loss_FP=-6000
> loss_FN=-12000
> profit_TN=2000
>
>
> TP=3775
> TN=509
> FP=1184
> FN=313
>
>
> total_profit_loss=(TP * profit_TP) + (FP * loss_FP) + (FN * loss_FN) + (TN * profit_TN)
> print(paste("Total Profit/Loss:", total_profit_loss))
[1] "Total Profit/Loss: 12808000"
Using an alternative approach to determine the optimal threshold for maximizing profit/loss,
we identified an optimal threshold of 0.37, which yields a maximum profit/loss of Rs.
14,520,000. This approach significantly improves financial performance compared to a
default threshold.
At this threshold, the model's performance metrics are as follows: an accuracy of 71.93%,
sensitivity (recall) of 6.56%, specificity of 98.99%, precision of 73.03%, and balanced
accuracy of 52.78%. While the model achieves high specificity and reasonable precision, the
sensitivity is very low, indicating that the model misses many actual defaulters. This poses a
significant risk as these missed defaulters may default on their loans.
From a managerial perspective, this threshold allows for a selective loan approval strategy,
ensuring that those approved are less likely to default, thus reducing overall risk exposure.
However, due to the low sensitivity, many potential defaulters are not identified,
necessitating additional monitoring and intervention strategies for high-risk customers.
The model’s high specificity ensures confident identification of non-defaulters, allowing for
secure loan approvals. However, given the low sensitivity, enhanced monitoring and
proactive engagement with potential defaulters are crucial to mitigate risk. This approach
focuses on profit maximization by approving fewer, but more reliable loans. Continuous
monitoring and potential adjustments of the threshold based on economic conditions and
customer behavior will help maintain optimal financial performance.
Reference
Prediction 0 1
0 111 41
1 1582 4047
Accuracy : 0.7193
95% CI : (0.7075, 0.7308)
No Information Rate : 0.7071
P-Value [Acc > NIR] : 0.02196
Kappa : 0.0757
Sensitivity : 0.06556
Specificity : 0.98997
Pos Pred Value : 0.73026
Neg Pred Value : 0.71896
Prevalence : 0.29286
Detection Rate : 0.01920
Detection Prevalence : 0.02629
Balanced Accuracy : 0.52777
'Positive' Class : 0
Logistic Regression
Accuracy: 74.1%
Precision: 61.9%
Recall: 30.1%
F1 Score: 40.5%
AUC: 0.717
Classification Tree
Accuracy: 73.1%
Precision: 55.5%
Recall: 33.2%
F1 Score: 41.5%
AUC: 0.675
Random Forest
Accuracy: 72.7%
Precision: 55.1%
Recall: 29.2%
F1 Score: 38.2%
AUC: 0.712
Logistic Regression: Best overall performance with highest accuracy (74.1%) and AUC (0.717).
Moderate precision and recall.
Classification Tree: Slightly higher recall (33.2%) than logistic regression but lower overall
performance.
Random Forest: Lowest recall (29.2%) and slightly lower accuracy. Most important predictors:
FULLPDC, DWNPMFR, MTHINCTH, TENORYR, AGE.
print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.741048261546445"
> print(paste("Precision:", precision))
[1] "Precision: 0.619221411192214"
> print(paste("Recall:", recall))
[1] "Recall: 0.300649734199646"
> print(paste("F1 Score:", f1_score))
[1] "F1 Score: 0.404771371769384"
tree_pred_class
0 1
996 4785
> table(testData$DefaulterFlag)
0 1
1666 4115
> conf_matrix_tree=confusionMatrix(as.factor(tree_pred_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix_tree)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 553 443
1 1113 3672
Accuracy : 0.7308
95% CI : (0.7192, 0.7422)
No Information Rate : 0.7118
P-Value [Acc > NIR] : 0.000689
Kappa : 0.2548
Sensitivity : 0.33193
Specificity : 0.89235
Pos Pred Value : 0.55522
Neg Pred Value : 0.76740
Prevalence : 0.28819
Detection Rate : 0.09566
Detection Prevalence : 0.17229
Balanced Accuracy : 0.61214
'Positive' Class : 0
0 1
1666 4115
> conf_matrix_rf=confusionMatrix(as.factor(rf_pred_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix_rf)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 487 397
1 1179 3718
Accuracy : 0.7274
95% CI : (0.7157, 0.7388)
No Information Rate : 0.7118
P-Value [Acc > NIR] : 0.004512
Kappa : 0.2276
Sensitivity : 0.29232
Specificity : 0.90352
Pos Pred Value : 0.55090
Neg Pred Value : 0.75924
Prevalence : 0.28819
Detection Rate : 0.08424
Detection Prevalence : 0.15291
Balanced Accuracy : 0.59792
'Positive' Class : 0
Using the logistic regression model, Auto Finance Ltd. can make a number of strategy
management choices that will improve the loan approval process, make risk management better,
better divide customers into groups, focus marketing efforts, and make operations run more
smoothly. With an AUC of 0.717 and a high accuracy rate of 74.1%, the logistic regression model
is a solid basis for these choices.
Loan Approval Strategy: The main goal is to reduce failures while giving loans to people who can
pay them back. To catch more defaulters, the company can find a good mix between recall and
precision by changing the decision threshold. This can be done by lowering the recall level, which
will help find more people who might not pay back their loans. Using the expected odds, it is also
possible to make a credit scoring system that can make decisions automatically. Applicants with
high scores can be accepted right away, while applicants with average scores may need more
review. This speeds up the clearance process and makes sure that the risk is evaluated more
accurately.
Risk management: Auto Finance Ltd. can increase tracking of high-risk customers found by the
model to cut down on financial losses caused by defaults. To stop people from not making their
payments on time, early intervention tactics like payment reminders or loan restructuring options
can be used. Managing general risk is easier when the loan portfolio is diversified by balancing
customers with high and low risk. Changing credit limits based on risk scores and giving higher
limits to customers with low risk and lower limits or more security requirements to customers
with high risk can help reduce losses even more.
Customer Segmentation: Knowing about the different types of customers helps a business make
sure its goods and services are just right for each one. Risk numbers can be used to divide
customers into low-, medium-, and high-risk groups. Then, loan products can be made that are
specifically designed for each group. For example, low-risk customers can get loans with low
interest rates, while high-risk customers can get loans with flexible payment plans. Personalized
communication strategies can also be used to build better relationships with low-risk customers
and get involved with high-risk customers ahead of time to stop defaults.
Marketing and Reaching Out to Customers: The logistic regression model can make marketing
plans much more effective by focusing on low- and medium-risk customers who are more likely
to be approved for loans. This focused marketing method gets more customers while lowering the
risk. Different groups of people with different levels of risk can be given different promotional
deals. For example, low-risk customers can get special interest rates or no processing fees.
Medium-risk customers could be kept by offering rewards for on-time payments and loyalty,
which would help build long-term relationships with customers and increase total retention rates.
Operational Improvements: The logistic regression model makes it possible to set up automated
systems for loan handling and risk assessment. This cuts down on the amount of work that needs
to be done by hand and speeds up the decision-making process. Staff training programs on how to
understand model results and use them in decision-making can make sure that the predictions are
used correctly. Setting up methods for continuous improvement and regularly adding new data to
the model to keep it accurate and useful should be done. The model will continue to make
accurate predictions as long as success metrics are tracked and any necessary changes are made.
Code-
set.seed(017)
library(readr)
library(dplyr)
library(caret)
d=read.csv(file.choose())
head(d)
str(d)
attach(d)
colSums(is.na(d))
data=na.omit(d)
d$DefaulterFlag=as.factor(data$DefaulterFlag)
set.seed(017)
trainIndex=createDataPartition(data$DefaulterFlag, p = 0.8, list = FALSE)
trainData=data[trainIndex,]
testData=data[-trainIndex,]
summary(lg.mod.1)
install.packages("ROCR")
library(ROCR)
pred=prediction(testData$predicted_prob, testData$DefaulterFlag)
perf=performance(pred, "tpr", "fpr")
auc=performance(pred, "auc")
[email protected][[1]]
print(paste("AUC:", auc_value))
opt_threshold=0.5
print(paste("Optimal Threshold:", opt_threshold))
table(testData$predicted_class)
table(testData$DefaulterFlag)
conf_matrix=confusionMatrix(as.factor(testData$predicted_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix)
table(testData$predicted_class,testData$DefaulterFlag)
accuracy=conf_matrix$overall['Accuracy']
precision=conf_matrix$byClass['Pos Pred Value']
recall=conf_matrix$byClass['Sensitivity']
f1_score=2 * ((precision * recall) / (precision + recall))
print(paste("Accuracy:", accuracy))
print(paste("Precision:", precision))
print(paste("Recall:", recall))
print(paste("F1 Score:", f1_score))
profit_TP=6000
loss_FP=-6000
loss_FN=-12000
profit_TN=2000
TP=3775
TN=509
FP=1184
FN=313
## Question4-:
library(ROCR)
library(caret)
profit_TP=6000
loss_FP= -6000
loss_FN=-12000
profit_TN=2000
thresholds=seq(0, 1, by = 0.01)
profits_losses=numeric(length(thresholds))
for (i in seq_along(thresholds)) {
threshold=thresholds[i]
predicted_class=ifelse(testData$predicted_prob > threshold, 1, 0)
TP=conf_matrix[2, 2]
TN=conf_matrix[1, 1]
FP=conf_matrix[2, 1]
FN =conf_matrix[1, 2]
optimal_threshold = thresholds[which.max(profits_losses)]
max_profit_loss = max(profits_losses)
testData$predicted_class_opt = as.factor(testData$predicted_class_opt)
testData$DefaulterFlag = as.factor(testData$DefaulterFlag)
levels(testData$predicted_class_opt) = levels(testData$DefaulterFlag)
conf_matrix_opt = confusionMatrix(testData$predicted_class_opt,
testData$DefaulterFlag)
print(conf_matrix_opt)
TP_opt = conf_matrix_opt$table[2, 2]
TN_opt = conf_matrix_opt$table[1, 1]
FP_opt = conf_matrix_opt$table[2, 1]
FN_opt = conf_matrix_opt$table[1, 2]
library(readr)
library(dplyr)
library(caret)
library(rpart)
library(randomForest)
library(ROCR)
set.seed(017)
data$DefaulterFlag = as.factor(data$DefaulterFlag)
trainIndex = createDataPartition(data$DefaulterFlag, p = 0.8, list = FALSE)
trainData = data[trainIndex,]
testData = data[-trainIndex,]
tree_model = rpart(DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + SALDATFR +
TENORYR + DWNPMFR +
PROFBUS + QUALHSC + QUAL_PG + SEXCODE + FULLPDC +
FRICODE + WASHCODE,
data = trainData, method = "class")
tree_pred_prob = predict(tree_model, testData, type = "prob")[,2]
tree_pred_class = predict(tree_model, testData, type = "class")
table(tree_pred_class)
table(testData$DefaulterFlag)
conf_matrix_tree=confusionMatrix(as.factor(tree_pred_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix_tree)
pred_tree = prediction(tree_pred_prob, testData$DefaulterFlag)
table(rf_pred_class)
table(testData$DefaulterFlag)
conf_matrix_rf=confusionMatrix(as.factor(rf_pred_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix_rf)
pred_rf = prediction(rf_pred_prob, testData$DefaulterFlag)
perf_rf = performance(pred_rf, "tpr", "fpr")
auc_rf = performance(pred_rf, "auc")
auc_value_rf = [email protected][[1]]
print(paste("AUC for Random Forest:", auc_value_rf))
importance(rf_model)
varImpPlot(rf_model)
rf_accuracy=conf_matrix_rf$overall['Accuracy']
rf_precision=conf_matrix_rf$byClass['Pos Pred Value']
rf_recal=conf_matrix_rf$byClass['Sensitivity']
rf_f1_score=2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))