Apex Financial Services Loan Data Automation
Apex Financial Services Loan Data Automation
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# This makes sure that the matplotlib plots are displayed within the
notebook
%matplotlib inline
Data Inspection
performing an initial inspection to understand the data structure and checking for any
inconsistencies or issues that might need to be addressed
After succefully loading the data, We can see that all the columns have non-null value across all
the 247 entries. this is an indication that there are no missing values on the dataset,which
simplifies the cleaning process.
Data Cleaning
# Checking for duplicates
if loan_data['Loan_ID'].duplicated().any():
loan_data = loan_data.drop_duplicates('Loan_ID')
print("Duplicates removed.")
else:
print("No duplicates found.")
# Converting categorical variables to 'category' dtype
categorical_vars = ['Gender', 'Married', 'Dependents', 'Graduate',
'Self_Employed',
'Credit_History', 'Property_Area', 'Loan_Status']
loan_data[categorical_vars] =
loan_data[categorical_vars].astype('category')
No duplicates found.
Data types after conversion:
Loan_ID int64
Gender category
Married category
Dependents category
Graduate category
Self_Employed category
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount int64
Loan_Amount_Term int64
Credit_History category
Property_Area category
Loan_Status category
dtype: object
• The average applicant income is approximately 5,404, with a wide range from 210 to
81,000.
• The co-applicant income also shows significant variation.
• The average loan amount is approximately 153, with loans ranging from 9 to $ 600.
Property Area:
Loan Approval:
The bar chart illustrates the approval rate of loans based on the marital status of applicants.
There are two categories represented:
• 0 for Single
• 1 for Married
The graph shows that married applicants have a slightly higher loan approval rate compared to
single applicants. This could suggest that married applicants might be viewed as having more
stable financial conditions or possibly dual incomes, which could influence the decision-making
process in approving loans.
The bar chart shows the approval rate of loans based on whether the applicants are graduates:
From the graph, it is evident that graduates have a higher loan approval rate compared to non-
graduates. This may be attributed to the perception that graduates are more likely to have stable
and higher-paying jobs, making them better candidates for loan approvals.
Graph 3: Loan Approval Rate by Property Area
The bar chart displays the loan approval rates based on the property area of the applicants:
• 1 for Urban
• 2 for Semiurban
• 3 for Rural
The graph indicates that applicants from Semiurban areas have the highest loan approval rate,
followed closely by Urban and Rural areas. This might reflect varying credit policies or economic
conditions across different regions that influence loan approval decisions.
The bar chart visualizes the impact of credit history on loan approval rates:
• 0 for No Credit History
• 1 for Yes Credit History
The graph starkly illustrates that applicants with a credit history (1) have a significantly higher
approval rate compared to those without a credit history (0).
This underscores the importance of credit history in lending decisions, where a positive credit
history strongly favors the likelihood of loan approval.
# Data Preparation
X = loan_data.drop('Loan_Status', axis=1)
y = loan_data['Loan_Status']
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)
# Model Evaluation
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
accuracy 0.85 75
macro avg 0.83 0.80 0.81 75
weighted avg 0.85 0.85 0.85 75
The precision for class 1 (approved loans) is particularly high at 88%, which means the model is
very effective at identifying true positive loan approvals. Here's a breakdown of the model's
performance and what each metric signifies:
Precision: Indicates how accurate the predictions are. For instance, when your model predicted a
loan would be approved, it was correct 88% of the time.
Recall: Reflects the ability to find all relevant instances. For approved loans, the model correctly
identified 93% of all actual approvals.
F1-Score: A weighted average of precision and recall. An F1-score reaches its best value at 1
(perfect precision and recall) and worst at 0.
Interpretation:
The model is robust in terms of identifying loan approvals, which is crucial for avoiding potential
defaults by not approving risky loans.
There is slightly lower precision and recall for the rejected class (0), which could suggest a need
for additional features or alternative modeling techniques to improve identification of rejected
applications.
Next Steps:
Given the success of this initial model, you might consider the following enhancements or
further analysis:
Feature Engineering: You can create new features that might help improve model predictions,
such as ratios of income to loan amount, or aggregate measures of credit history.
Try Different Models: Experiment with other models like Decision Trees, Random Forests, or
Gradient Boosting Machines to see if they can achieve better performance.
Model Tuning: Adjust model parameters using techniques like grid search or random search to
find the best settings for your algorithms.
Segmentation Analysis
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Analyzing clusters
for i in range(3):
cluster = loan_data[loan_data['Cluster'] == i]
print(f"Cluster {i}:")
print(f"Average Income: {cluster['ApplicantIncome'].mean()}")
print(f"Average Loan Amount: {cluster['LoanAmount'].mean()}")
print(f"Proportion of Graduates:
{cluster['Graduate'].value_counts(normalize=True)}")
print(f"Property Area Distribution:
{cluster['Property_Area'].value_counts(normalize=True)}\n")
Cluster 0:
Average Income: 3880.126984126984
Average Loan Amount: 128.11111111111111
Proportion of Graduates: Graduate
0 1.0
1 0.0
Name: proportion, dtype: float64
Property Area Distribution: Property_Area
3 0.428571
2 0.349206
1 0.222222
Name: proportion, dtype: float64
Cluster 1:
Average Income: 4614.629411764706
Average Loan Amount: 141.87058823529412
Proportion of Graduates: Graduate
1 1.0
0 0.0
Name: proportion, dtype: float64
Property Area Distribution: Property_Area
2 0.394118
3 0.311765
1 0.294118
Name: proportion, dtype: float64
Cluster 2:
Average Income: 21841.14285714286
Average Loan Amount: 393.57142857142856
Proportion of Graduates: Graduate
1 1.0
0 0.0
Name: proportion, dtype: float64
Property Area Distribution: Property_Area
2 0.428571
1 0.285714
3 0.285714
Name: proportion, dtype: float64
Analysis of Clusters The analysis results show distinct characteristics for each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
Targeting Strategies:
• Different marketing strategies can be employed that resonate with the unique
characteristics of each cluster. For example, more straightforward, assurance-focused
messaging might work better for Cluster 0, while more sophisticated, investment-
opportunity-focused messaging could appeal to Cluster 2.
• Risk Management: Understanding the income and educational background can help in
adjusting the risk models, as higher-income, educated groups (like Cluster 2) might have
a lower default rate.
conclusion
This project encapsulates a comprehensive data analysis lifecycle from loading, cleaning,
analyzing, and modeling Apex Financial Services loan data. Insights derived from this analysis
help in understanding the lending environment and making informed decisions on loan
approvals and risk management