0% found this document useful (0 votes)
7 views

Assignment 2 Sol

Solution of assignment Analytics R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment 2 Sol

Solution of assignment Analytics R
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Assignment – 2

Q1. Build a logistic regression model based on the


training data set to identify good customers and bad
customers. A good customer is one who has never
delayed the payment, whereas a bad customer is one
who has delayed the payment even once. Discuss the
interpretations of the model coefficients in detail.
Intercept

 The base likelihood of someone defaulting on their loan, without considering other
factors, is moderately high (log-odds of 1.537386).

Age

 Older customers are a bit less likely to default on their loans. For each additional year
of age, the log-odds of defaulting decrease by 0.014793.

Number of Dependents

 Having more dependents slightly increases the chance of defaulting. For each
additional dependent, the log-odds of defaulting increase by 0.039196.

Monthly Income

 Higher monthly income slightly reduces the chance of defaulting. For each additional
thousand INR of income, the log-odds of defaulting decrease by 0.005679.

Salary Date

 Getting paid earlier in the month significantly lowers the risk of defaulting. For each
unit increase in the salary date fraction, the log-odds of defaulting decrease by
0.453885.

Loan Tenure

 Longer loan terms significantly increase the risk of defaulting. For each additional
year of loan tenure, the log-odds of defaulting increase by 0.725872.
Down Payment

 Making a larger down payment significantly reduces the risk of defaulting. For each
unit increase in the down payment fraction, the log-odds of defaulting decrease by
1.073405.

Profession

 Business people are slightly more likely to default compared to professionals. The
log-odds of defaulting increase by 0.269408 for business people.

Education Level

 Customers with only high school education (HSC) are a bit more likely to default
compared to undergraduates. The log-odds of defaulting increase by 0.105550 for
HSC qualified customers. Those with a postgraduate education are less likely to
default, with the log-odds decreasing by 0.346036.

Gender

 Men are slightly more likely to default than women. The log-odds of defaulting
increase by 0.237154 for men.

Post-Dated Checks

 Customers who provide full post-dated checks are much less likely to default. The
log-odds of defaulting decrease by 1.280570.

Refrigerator Ownership

 Owning a refrigerator slightly reduces the risk of defaulting. The log-odds of


defaulting decrease by 0.190968.

Washing Machine Ownership

 Owning a washing machine slightly reduces the risk of defaulting. The log-odds of
defaulting decrease by 0.154021.

Key Takeaways

 Older age (-0.014793), higher income (-0.005679), earlier salary date (-0.453885),
larger down payment (-1.073405), postgraduate education (-0.346036), and
ownership of household appliances (refrigerator: -0.190968, washing machine: -
0.154021) are all factors that reduce the likelihood of loan default.
 More dependents (0.039196), longer loan terms (0.725872), and being a business
person (0.269408) slightly increase the risk of defaulting.
 Providing full post-dated checks (-1.280570) is a strong indicator of a lower risk of
default.
Code and output-

Call:
glm(formula = DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + SALDATFR +
TENORYR + DWNPMFR + PROFBUS + QUALHSC + QUAL_PG + SEXCODE +
FULLPDC + FRICODE + WASHCODE, family = binomial, data = trainData)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.537386 0.128123 11.999 < 2e-16 ***
AGE -0.014793 0.001605 -9.218 < 2e-16 ***
NOOFDEPE 0.039196 0.010093 3.884 0.000103 ***
MTHINCTH -0.005679 0.003376 -1.682 0.092525 .
SALDATFR -0.453885 0.038951 -11.653 < 2e-16 ***
TENORYR 0.725872 0.042502 17.079 < 2e-16 ***
DWNPMFR -1.073405 0.121978 -8.800 < 2e-16 ***
PROFBUS 0.269408 0.045698 5.895 3.74e-09 ***
QUALHSC 0.105550 0.038096 2.771 0.005594 **
QUAL_PG -0.346036 0.074628 -4.637 3.54e-06 ***
SEXCODE 0.237154 0.057760 4.106 4.03e-05 ***
FULLPDC -1.280570 0.033054 -38.742 < 2e-16 ***
FRICODE -0.190968 0.036351 -5.253 1.49e-07 ***
WASHCODE -0.154021 0.043326 -3.555 0.000378 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 27726 on 23124 degrees of freedom


Residual deviance: 24786 on 23111 degrees of freedom
AIC: 24814

Number of Fisher Scoring iterations: 4

Q2. Judge the performance of the model based on the


validation data set, after identifying the optimal
threshold point. Is the performance of the model
satisfactory? Consider at least two criteria.

 True Positives: 3775


 True Negatives: 509
 False Positives: 1184
 False Negatives: 313

Performance Metrics

 Accuracy: 0.741
 Precision: 0.619
 Recall (Sensitivity): 0.301
 F1 Score: 0.405
 AUC: 0.717

The recall is quite low at 30.1%, which means the model is missing a large number of actual
defaulters.
The precision is relatively better at 61.9%, indicating that when the model predicts a default,
it is reasonably accurate.
The balanced accuracy (0.612) takes into account both the sensitivity and specificity, giving a
more balanced view of the model performance.
There are many false negatives (313), which means many defaulters are being classified as
non-defaulters.
There are also many false positives (1184), which means many non-defaulters are being
classified as defaulters
The model shows moderate performance with an accuracy of 74.1% and an AUC of 0.717.
However, the low recall (30.1%) is a major issue as it indicates the model's poor performance
in identifying actual defaulters.

Code and output-

"AUC: 0.717266143080232"
>
>
>
>
> opt_threshold=0.5
> print(paste("Optimal Threshold:", opt_threshold))
[1] "Optimal Threshold: 0.5"
>
>
> testData$predicted_class=ifelse(testData$predicted_prob > opt_threshold, 1, 0)
> levels(testData$predicted_class)=levels(testData$DefaulterFlag)
>
> table(testData$predicted_class)

0 1
822 4959
> table(testData$DefaulterFlag)

0 1
1693 4088
> conf_matrix=confusionMatrix(as.factor(testData$predicted_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 509 313
1 1184 3775

Accuracy : 0.741
95% CI : (0.7295, 0.7523)
No Information Rate : 0.7071
P-Value [Acc > NIR] : 5.438e-09

Kappa : 0.2638

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.30065
Specificity : 0.92343
Pos Pred Value : 0.61922
Neg Pred Value : 0.76124
Prevalence : 0.29286
Detection Rate : 0.08805
Detection Prevalence : 0.14219
Balanced Accuracy : 0.61204

'Positive' Class : 0

> table(testData$predicted_class,testData$DefaulterFlag)

0 1
0 509 313
1 1184 3775
>
> accuracy=conf_matrix$overall['Accuracy']
> precision=conf_matrix$byClass['Pos Pred Value']
> recall=conf_matrix$byClass['Sensitivity']
> f1_score=2 * ((precision * recall) / (precision + recall))
>
> print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.741048261546445"
> print(paste("Precision:", precision))
[1] "Precision: 0.619221411192214"
> print(paste("Recall:", recall))
[1] "Recall: 0.300649734199646"
> print(paste("F1 Score:", f1_score))
[1] "F1 Score: 0.404771371769384"
Q3. Suppose Auto Finance Ltd. provides a loan for 2
years. The management of Auto Finance Ltd. has
estimated that the profit associated with a “True
Positive” case is Rs. 6000. Furthermore, they also
estimated that the losses associated with a “False
Negative” case and a “False Positive” case are Rs.
12000 and Rs. 6000, respectively. The profit
associated with a “True Negative” case is Rs. 2000.
Provide the benefit matrix. Based on the confusion
matrix obtained for the validation data set,
calculate the total profit/loss for the company.

 TP: 3775
 TN: 509
 FP: 1184
 FN: 313
Benefit Matrix

 Profit for True Positive: Rs. 6000


 Loss for False Positive: -Rs. 6000
 Loss for False Negative: -Rs. 12000
 Profit for True Negative: Rs. 2000

Code and output-

Total Profit/Loss: 12808000

profit_TP=6000
> loss_FP=-6000
> loss_FN=-12000
> profit_TN=2000
>
>
> TP=3775
> TN=509
> FP=1184
> FN=313
>
>
> total_profit_loss=(TP * profit_TP) + (FP * loss_FP) + (FN * loss_FN) + (TN * profit_TN)
> print(paste("Total Profit/Loss:", total_profit_loss))
[1] "Total Profit/Loss: 12808000"

Q4. Based on the benefit matrix, can you provide an


alternative approach for deciding the optimal
threshold point? Does the optimal threshold point
differ from the one identified earlier? If yes, recalculate
the confusion matrix and profit/loss.

Using an alternative approach to determine the optimal threshold for maximizing profit/loss,
we identified an optimal threshold of 0.37, which yields a maximum profit/loss of Rs.
14,520,000. This approach significantly improves financial performance compared to a
default threshold.

At this threshold, the model's performance metrics are as follows: an accuracy of 71.93%,
sensitivity (recall) of 6.56%, specificity of 98.99%, precision of 73.03%, and balanced
accuracy of 52.78%. While the model achieves high specificity and reasonable precision, the
sensitivity is very low, indicating that the model misses many actual defaulters. This poses a
significant risk as these missed defaulters may default on their loans.

From a managerial perspective, this threshold allows for a selective loan approval strategy,
ensuring that those approved are less likely to default, thus reducing overall risk exposure.
However, due to the low sensitivity, many potential defaulters are not identified,
necessitating additional monitoring and intervention strategies for high-risk customers.

The model’s high specificity ensures confident identification of non-defaulters, allowing for
secure loan approvals. However, given the low sensitivity, enhanced monitoring and
proactive engagement with potential defaulters are crucial to mitigate risk. This approach
focuses on profit maximization by approving fewer, but more reliable loans. Continuous
monitoring and potential adjustments of the threshold based on economic conditions and
customer behavior will help maintain optimal financial performance.

In conclusion, while the threshold of 0.37 maximizes profitability, it requires careful


management of the risk posed by missed defaulters. Auto Finance Ltd. should balance the
profit-maximizing benefits of this threshold with strategies to manage and monitor potential
risks, ensuring a holistic approach to loan approval and risk management.

Confusion Matrix and Statistics

Reference
Prediction 0 1
0 111 41
1 1582 4047

Accuracy : 0.7193
95% CI : (0.7075, 0.7308)
No Information Rate : 0.7071
P-Value [Acc > NIR] : 0.02196

Kappa : 0.0757

Mcnemar's Test P-Value : < 2e-16

Sensitivity : 0.06556
Specificity : 0.98997
Pos Pred Value : 0.73026
Neg Pred Value : 0.71896
Prevalence : 0.29286
Detection Rate : 0.01920
Detection Prevalence : 0.02629
Balanced Accuracy : 0.52777

'Positive' Class : 0

"Total Profit/Loss for the optimal threshold: 14520000"


> Optimal Threshold based on Profit/Loss: 0.37"
> print(paste("Maximum Profit/Loss:", max_profit_loss))
[1] "Maximum Profit/Loss: 14520000"

Q 5. Build now a classification tree and random forest


model in the same context. Judge the performances
of these models as compared to the fitted logistic
regression model above. Identify the important
predictors. Compare and contrast the same with your
observation earlier

Logistic Regression

 Accuracy: 74.1%
 Precision: 61.9%
 Recall: 30.1%
 F1 Score: 40.5%
 AUC: 0.717

Classification Tree

 Accuracy: 73.1%
 Precision: 55.5%
 Recall: 33.2%
 F1 Score: 41.5%
 AUC: 0.675

Random Forest

 Accuracy: 72.7%
 Precision: 55.1%
 Recall: 29.2%
 F1 Score: 38.2%
 AUC: 0.712

Logistic Regression: Best overall performance with highest accuracy (74.1%) and AUC (0.717).
Moderate precision and recall.
Classification Tree: Slightly higher recall (33.2%) than logistic regression but lower overall
performance.
Random Forest: Lowest recall (29.2%) and slightly lower accuracy. Most important predictors:
FULLPDC, DWNPMFR, MTHINCTH, TENORYR, AGE.

Code and output-

print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.741048261546445"
> print(paste("Precision:", precision))
[1] "Precision: 0.619221411192214"
> print(paste("Recall:", recall))
[1] "Recall: 0.300649734199646"
> print(paste("F1 Score:", f1_score))
[1] "F1 Score: 0.404771371769384"
tree_pred_class
0 1
996 4785
> table(testData$DefaulterFlag)

0 1
1666 4115
> conf_matrix_tree=confusionMatrix(as.factor(tree_pred_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix_tree)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 553 443
1 1113 3672

Accuracy : 0.7308
95% CI : (0.7192, 0.7422)
No Information Rate : 0.7118
P-Value [Acc > NIR] : 0.000689

Kappa : 0.2548

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.33193
Specificity : 0.89235
Pos Pred Value : 0.55522
Neg Pred Value : 0.76740
Prevalence : 0.28819
Detection Rate : 0.09566
Detection Prevalence : 0.17229
Balanced Accuracy : 0.61214

'Positive' Class : 0

> pred_tree <- prediction(tree_pred_prob, testData$DefaulterFlag)


>
> perf_tree <- performance(pred_tree, "tpr", "fpr")
> auc_tree <- performance(pred_tree, "auc")
> auc_value_tree <- [email protected][[1]]
> print(paste("AUC for Classification Tree:", auc_value_tree))
[1] "AUC for Classification Tree: 0.675466444171836"
>
> rf_model <- randomForest(DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH +
SALDATFR + TENORYR + DWNPMFR +
+ PROFBUS + QUALHSC + QUAL_PG + SEXCODE + FULLPDC +
FRICODE + WASHCODE,
+ data = trainData, ntree = 500, importance = TRUE)
> rf_pred_prob <- predict(rf_model, testData, type = "prob")[,2]
> rf_pred_class <- predict(rf_model, testData,type="class")
>
> table(rf_pred_class)
rf_pred_class
0 1
884 4897
> table(testData$DefaulterFlag)

0 1
1666 4115
> conf_matrix_rf=confusionMatrix(as.factor(rf_pred_class),
as.factor(testData$DefaulterFlag))
> print(conf_matrix_rf)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 487 397
1 1179 3718

Accuracy : 0.7274
95% CI : (0.7157, 0.7388)
No Information Rate : 0.7118
P-Value [Acc > NIR] : 0.004512

Kappa : 0.2276

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.29232
Specificity : 0.90352
Pos Pred Value : 0.55090
Neg Pred Value : 0.75924
Prevalence : 0.28819
Detection Rate : 0.08424
Detection Prevalence : 0.15291
Balanced Accuracy : 0.59792

'Positive' Class : 0

> pred_rf <- prediction(rf_pred_prob, testData$DefaulterFlag)


> perf_rf <- performance(pred_rf, "tpr", "fpr")
> auc_rf <- performance(pred_rf, "auc")
> auc_value_rf <- [email protected][[1]]
> print(paste("AUC for Random Forest:", auc_value_rf))
[1] "AUC for Random Forest: 0.711854632496984"
>
> importance(rf_model)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
AGE 9.623505 11.267268 14.523652 1188.37944
NOOFDEPE 1.482156 12.704786 12.204627 609.31152
MTHINCTH -4.916054 29.425415 24.097508 1215.93278
SALDATFR -10.184515 38.832478 31.731234 416.84193
TENORYR 37.754316 44.645115 66.346723 605.85118
DWNPMFR 16.075278 29.865796 37.419267 1263.42715
PROFBUS -6.726649 20.042520 17.653610 118.17213
QUALHSC 5.062489 6.207251 8.515154 164.18279
QUAL_PG 7.893954 2.478583 6.909609 79.42708
SEXCODE 6.413266 6.126289 9.197658 108.96061
FULLPDC 114.473405 83.382018 130.298974 713.12662
FRICODE 13.466021 13.159150 21.752985 147.09237
WASHCODE 7.025977 18.756664 24.814680 125.78148
> varImpPlot(rf_model)
>
> rf_accuracy <- conf_matrix_rf$overall['Accuracy']
> rf_precision <- conf_matrix_rf$byClass['Pos Pred Value']
> rf_recall <- conf_matrix_rf$byClass['Sensitivity']
> rf_f1_score <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))
>
> print(paste("Random Forest Accuracy:", rf_accuracy))
[1] "Random Forest Accuracy: 0.727382805742951"
> print(paste("Random Forest Precision:", rf_precision))
[1] "Random Forest Precision: 0.550904977375566"
> print(paste("Random Forest Recall:", rf_recall))
[1] "Random Forest Recall: 0.292316926770708"
> print(paste("Random Forest F1 Score:", rf_f1_score))
[1] "Random Forest F1 Score: 0.381960784313726"
> print(paste("Random Forest AUC:", auc_value_rf))
[1] "Random Forest AUC: 0.711854632496984"
>
>
>
>
> # Classification Tree Results
> tree_accuracy <- conf_matrix_tree$overall['Accuracy']
> tree_precision <- conf_matrix_tree$byClass['Pos Pred Value']
> tree_recall <- conf_matrix_tree$byClass['Sensitivity']
> tree_f1_score <- 2 * ((tree_precision * tree_recall) / (tree_precision + tree_recall))
>
> print(paste("Classification Tree Accuracy:", tree_accuracy))
[1] "Classification Tree Accuracy: 0.730842414807127"
> print(paste("Classification Tree Precision:", tree_precision))
[1] "Classification Tree Precision: 0.555220883534137"
> print(paste("Classification Tree Recall:", tree_recall))
[1] "Classification Tree Recall: 0.331932773109244"
> print(paste("Classification Tree F1 Score:", tree_f1_score))
[1] "Classification Tree F1 Score: 0.415477084898573"
> print(paste("Classification Tree AUC:", auc_value_tree))
[1] "Classification Tree AUC: 0.675466444171836"

Q 6. Based on your chosen model, how will you take


managerial decisions? Discuss in detail.

Using the logistic regression model, Auto Finance Ltd. can make a number of strategy
management choices that will improve the loan approval process, make risk management better,
better divide customers into groups, focus marketing efforts, and make operations run more
smoothly. With an AUC of 0.717 and a high accuracy rate of 74.1%, the logistic regression model
is a solid basis for these choices.

Loan Approval Strategy: The main goal is to reduce failures while giving loans to people who can
pay them back. To catch more defaulters, the company can find a good mix between recall and
precision by changing the decision threshold. This can be done by lowering the recall level, which
will help find more people who might not pay back their loans. Using the expected odds, it is also
possible to make a credit scoring system that can make decisions automatically. Applicants with
high scores can be accepted right away, while applicants with average scores may need more
review. This speeds up the clearance process and makes sure that the risk is evaluated more
accurately.

Risk management: Auto Finance Ltd. can increase tracking of high-risk customers found by the
model to cut down on financial losses caused by defaults. To stop people from not making their
payments on time, early intervention tactics like payment reminders or loan restructuring options
can be used. Managing general risk is easier when the loan portfolio is diversified by balancing
customers with high and low risk. Changing credit limits based on risk scores and giving higher
limits to customers with low risk and lower limits or more security requirements to customers
with high risk can help reduce losses even more.

Customer Segmentation: Knowing about the different types of customers helps a business make
sure its goods and services are just right for each one. Risk numbers can be used to divide
customers into low-, medium-, and high-risk groups. Then, loan products can be made that are
specifically designed for each group. For example, low-risk customers can get loans with low
interest rates, while high-risk customers can get loans with flexible payment plans. Personalized
communication strategies can also be used to build better relationships with low-risk customers
and get involved with high-risk customers ahead of time to stop defaults.

Marketing and Reaching Out to Customers: The logistic regression model can make marketing
plans much more effective by focusing on low- and medium-risk customers who are more likely
to be approved for loans. This focused marketing method gets more customers while lowering the
risk. Different groups of people with different levels of risk can be given different promotional
deals. For example, low-risk customers can get special interest rates or no processing fees.
Medium-risk customers could be kept by offering rewards for on-time payments and loyalty,
which would help build long-term relationships with customers and increase total retention rates.

Operational Improvements: The logistic regression model makes it possible to set up automated
systems for loan handling and risk assessment. This cuts down on the amount of work that needs
to be done by hand and speeds up the decision-making process. Staff training programs on how to
understand model results and use them in decision-making can make sure that the predictions are
used correctly. Setting up methods for continuous improvement and regularly adding new data to
the model to keep it accurate and useful should be done. The model will continue to make
accurate predictions as long as success metrics are tracked and any necessary changes are made.

Code-

set.seed(017)

library(readr)
library(dplyr)
library(caret)

d=read.csv(file.choose())

head(d)
str(d)
attach(d)
colSums(is.na(d))

data=na.omit(d)

d$DefaulterFlag=as.factor(data$DefaulterFlag)

set.seed(017)
trainIndex=createDataPartition(data$DefaulterFlag, p = 0.8, list = FALSE)
trainData=data[trainIndex,]
testData=data[-trainIndex,]

lg.mod.1=glm(DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + SALDATFR +


TENORYR + DWNPMFR +
PROFBUS + QUALHSC + QUAL_PG + SEXCODE + FULLPDC +
FRICODE + WASHCODE,
data = trainData, family = binomial)

summary(lg.mod.1)
install.packages("ROCR")
library(ROCR)

testData$predicted_prob=predict(lg.mod.1, newdata = testData, type = "response")

pred=prediction(testData$predicted_prob, testData$DefaulterFlag)
perf=performance(pred, "tpr", "fpr")

plot(perf, colorize = TRUE)


abline(a = 0, b = 1, lty = 2, col = "Red")

auc=performance(pred, "auc")
[email protected][[1]]
print(paste("AUC:", auc_value))

opt_threshold=0.5
print(paste("Optimal Threshold:", opt_threshold))

testData$predicted_class=ifelse(testData$predicted_prob > opt_threshold, 1, 0)


levels(testData$predicted_class)=levels(testData$DefaulterFlag)

table(testData$predicted_class)
table(testData$DefaulterFlag)
conf_matrix=confusionMatrix(as.factor(testData$predicted_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix)
table(testData$predicted_class,testData$DefaulterFlag)

accuracy=conf_matrix$overall['Accuracy']
precision=conf_matrix$byClass['Pos Pred Value']
recall=conf_matrix$byClass['Sensitivity']
f1_score=2 * ((precision * recall) / (precision + recall))

print(paste("Accuracy:", accuracy))
print(paste("Precision:", precision))
print(paste("Recall:", recall))
print(paste("F1 Score:", f1_score))

profit_TP=6000
loss_FP=-6000
loss_FN=-12000
profit_TN=2000
TP=3775
TN=509
FP=1184
FN=313

total_profit_loss=(TP * profit_TP) + (FP * loss_FP) + (FN * loss_FN) + (TN * profit_TN)


print(paste("Total Profit/Loss:", total_profit_loss))

## Question4-:

library(ROCR)
library(caret)

profit_TP=6000
loss_FP= -6000
loss_FN=-12000
profit_TN=2000

testData$predicted_prob=predict(lg.mod.1, newdata = testData, type = "response")

thresholds=seq(0, 1, by = 0.01)

profits_losses=numeric(length(thresholds))

for (i in seq_along(thresholds)) {
threshold=thresholds[i]
predicted_class=ifelse(testData$predicted_prob > threshold, 1, 0)

conf_matrix=table(factor(predicted_class, levels = c(0, 1)),


factor(testData$DefaulterFlag, levels = c(0, 1)))

TP=conf_matrix[2, 2]
TN=conf_matrix[1, 1]
FP=conf_matrix[2, 1]
FN =conf_matrix[1, 2]

total_profit_loss= (TP * profit_TP) + (FP * loss_FP) + (FN * loss_FN) + (TN * profit_TN)


profits_losses[i] = total_profit_loss
}

optimal_threshold = thresholds[which.max(profits_losses)]
max_profit_loss = max(profits_losses)

print(paste("Optimal Threshold based on Profit/Loss:", optimal_threshold))


print(paste("Maximum Profit/Loss:", max_profit_loss))

testData$predicted_class_opt = ifelse(testData$predicted_prob > optimal_threshold, 1,


0)

testData$predicted_class_opt = as.factor(testData$predicted_class_opt)
testData$DefaulterFlag = as.factor(testData$DefaulterFlag)

levels(testData$predicted_class_opt) = levels(testData$DefaulterFlag)

conf_matrix_opt = confusionMatrix(testData$predicted_class_opt,
testData$DefaulterFlag)
print(conf_matrix_opt)

TP_opt = conf_matrix_opt$table[2, 2]
TN_opt = conf_matrix_opt$table[1, 1]
FP_opt = conf_matrix_opt$table[2, 1]
FN_opt = conf_matrix_opt$table[1, 2]

total_profit_loss_opt = (TP_opt * profit_TP) + (FP_opt * loss_FP) + (FN_opt * loss_FN) +


(TN_opt * profit_TN)
print(paste("Total Profit/Loss for the optimal threshold:", total_profit_loss_opt))

library(readr)
library(dplyr)
library(caret)
library(rpart)
library(randomForest)
library(ROCR)
set.seed(017)
data$DefaulterFlag = as.factor(data$DefaulterFlag)
trainIndex = createDataPartition(data$DefaulterFlag, p = 0.8, list = FALSE)
trainData = data[trainIndex,]
testData = data[-trainIndex,]
tree_model = rpart(DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH + SALDATFR +
TENORYR + DWNPMFR +
PROFBUS + QUALHSC + QUAL_PG + SEXCODE + FULLPDC +
FRICODE + WASHCODE,
data = trainData, method = "class")
tree_pred_prob = predict(tree_model, testData, type = "prob")[,2]
tree_pred_class = predict(tree_model, testData, type = "class")

table(tree_pred_class)
table(testData$DefaulterFlag)
conf_matrix_tree=confusionMatrix(as.factor(tree_pred_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix_tree)
pred_tree = prediction(tree_pred_prob, testData$DefaulterFlag)

perf_tree = performance(pred_tree, "tpr", "fpr")


auc_tree = performance(pred_tree, "auc")
auc_value_tree = [email protected][[1]]
print(paste("AUC for Classification Tree:", auc_value_tree))

rf_model = randomForest(DefaulterFlag ~ AGE + NOOFDEPE + MTHINCTH +


SALDATFR + TENORYR + DWNPMFR +
PROFBUS + QUALHSC + QUAL_PG + SEXCODE + FULLPDC +
FRICODE + WASHCODE,
data = trainData, ntree = 500, importance = TRUE)
rf_pred_prob = predict(rf_model, testData, type = "prob")[,2]
rf_pred_class = predict(rf_model, testData,type="class")

table(rf_pred_class)
table(testData$DefaulterFlag)
conf_matrix_rf=confusionMatrix(as.factor(rf_pred_class),
as.factor(testData$DefaulterFlag))
print(conf_matrix_rf)
pred_rf = prediction(rf_pred_prob, testData$DefaulterFlag)
perf_rf = performance(pred_rf, "tpr", "fpr")
auc_rf = performance(pred_rf, "auc")
auc_value_rf = [email protected][[1]]
print(paste("AUC for Random Forest:", auc_value_rf))

importance(rf_model)
varImpPlot(rf_model)

rf_accuracy=conf_matrix_rf$overall['Accuracy']
rf_precision=conf_matrix_rf$byClass['Pos Pred Value']
rf_recal=conf_matrix_rf$byClass['Sensitivity']
rf_f1_score=2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))

print(paste("Random Forest Accuracy:", rf_accuracy))


print(paste("Random Forest Precision:", rf_precision))
print(paste("Random Forest Recall:", rf_recall))
print(paste("Random Forest F1 Score:", rf_f1_score))
print(paste("Random Forest AUC:", auc_value_rf))

# Classification Tree Results


tree_accuracy = conf_matrix_tree$overall['Accuracy']
tree_precision = conf_matrix_tree$byClass['Pos Pred Value']
tree_recall = conf_matrix_tree$byClass['Sensitivity']
tree_f1_score = 2 * ((tree_precision * tree_recall) / (tree_precision + tree_recall))

print(paste("Classification Tree Accuracy:", tree_accuracy))


print(paste("Classification Tree Precision:", tree_precision))
print(paste("Classification Tree Recall:", tree_recall))
print(paste("Classification Tree F1 Score:", tree_f1_score))
print(paste("Classification Tree AUC:", auc_value_tree))

You might also like