Fraud Detection[1]
Fraud Detection[1]
OF
SUBMITTED BY
Submitted by
Sudesh puri
is a Bonafide student at this institute and the work has been carried out by him/her
under the supervision of Prof. Rushali Patil and it is approved for the partial
fulfillment of the requirement of, Third Year course on DSBDA Mini Project of
Savitribai Phule Pune University
I would like to express our heartfelt gratitude to our guide, Prof. Rushali Patil, for her
unwavering support, valuable insights, and encouragement throughout the course of
this research project. Her expertise and guidance were instrumental in shaping the
direction of this project and ensuring its success.
I also extend my appreciation to the other staff for their valuable feedback and
support, which helped refine our understanding of the subject matter. Their
commitment to fostering a collaborative and enriching learning environment greatly
contributed to our academic growth.
Sudesh puri
ABSTRACT
In the modern financial ecosystem, fraud detection has become essential to safeguard
transactional integrity. This project presents an analysis-driven approach to
identifying fraudulent activities using Exploratory Data Analysis (EDA) and the
XGBoost machine learning algorithm.
After cleaning and transforming the dataset, EDA revealed that fraudulent
transactions primarily occur in two types: TRANSFER and CASH_OUT. These
insights were leveraged to engineer new features that capture inconsistencies and
anomalies.
A gradient boosting model (XGBoost) was then trained to classify transactions,
achieving high accuracy and recall. This project showcases how domain knowledge,
combined with statistical rigor and machine learning, can effectively detect fraud in
real-world datasets.
Digital transactions have revolutionized financial systems, but they also open avenues for
fraud. Rule-based systems are often rigid and miss nuanced or evolving fraud patterns. This
project presents a machine learning–based approach using EDA and XGBoost to intelligently
detect fraud.
The dataset used simulates mobile money transactions, providing rich transactional features.
The challenge lies in accurately flagging fraud among millions of legitimate transactions,
many of which closely mimic fraudulent behavior.
In recent years, financial institutions and digital platforms have increasingly turned to
artificial intelligence and machine learning to safeguard user transactions. Unlike
traditional rule-based systems, which rely on predefined conditions and often struggle
with novel fraud strategies, machine learning models can adapt and learn from data
patterns over time. This adaptability makes them highly effective in detecting
evolving fraud behaviors. By combining exploratory data analysis (EDA) with a
powerful model like XGBoost, this project aims to not only detect fraud but also
understand the underlying transaction dynamics that contribute to fraudulent
activity. This dual approach ensures that the detection system is both accurate and
explainable—critical qualities for building trust and ensuring compliance in real-
world financial systems.
2. Problem Statement
The core challenge of this project is to identify and classify fraudulent transactions
from a large dataset that includes multiple transaction types and patterns. The
complexity increases due to:
The highly imbalanced nature of the dataset, where fraudulent transactions form a
tiny fraction of the total.
The presence of manipulated values (e.g., fake balances, shell accounts) that aim to
mimic real user behavior.
What patterns (e.g., balance inconsistencies) correlate highly with fraudulent activity?
How can a machine learning model distinguish fraud using historical transaction
features?
3. Objectives
The project was carried out with the following focused objectives:
Data Cleaning & Filtering: To remove irrelevant or misleading transaction types that do
not contribute to fraud patterns (e.g., DEBIT, PAYMENT).
EDA & Visualization: To identify hidden patterns, such as imbalances in sender/receiver
balances or inconsistencies in amount transfers.
Feature Engineering: To derive meaningful attributes that better describe the anomaly—
for example, error margins between expected and actual balances.
Model Development: To use XGBoost for binary classification with a focus on reducing
false negatives (missed frauds).
Model Evaluation: To assess accuracy, recall, precision, and ROC-AUC on a validation
set, especially under class imbalance.
Explainability: To interpret important features influencing model decisions using feature
importance plots.
4. Literature Survey
Financial fraud detection using machine learning has been an area of active research
due to the growing need for adaptive, intelligent systems. This section presents a
survey of foundational studies and modern approaches:
1. Ngai et al. (2011)
In their review “The application of data mining techniques in financial fraud
detection,” the authors outline a classification framework that incorporates
statistical, AI, and hybrid techniques for detecting anomalies. This research
highlights the effectiveness of machine learning models like decision trees and
boosting methods in combating fraud.
2. Bhattacharyya et al. (2010)
Their comparative study explores credit card fraud detection using decision trees,
neural networks, and logistic regression. The paper emphasizes the role of data
imbalance and the importance of precision-recall trade-offs—a challenge
addressed in this project using XGBoost’s class-weighted training.
3. Sahin & Duman (2011)
This study evaluated artificial neural networks (ANNs) and logistic regression
on fraud datasets. Although ANNs provided good accuracy, interpretability was a
challenge. This supports the decision to use XGBoost in this project, which
balances performance with explainability.
4. XGBoost Algorithm Documentation
Official documentation provided insights into regularization, tree pruning, and
early stopping, which were incorporated in hyperparameter tuning for improved
fraud detection.
5. Kaggle Competitions and Discussions
Online platforms like Kaggle offered real-world case studies and public
notebooks that helped in shaping the EDA and modeling approach. They also
reinforced best practices for handling imbalanced data and feature engineering.
6.
5. Data Sources
The dataset used in this project simulates mobile money transactions and is widely
used in fraud detection challenges. Key aspects:
Preprocessing Steps:
This data preparation stage was vital to ensure that the model learns meaningful
distinctions.
6. Exploratory Data Analysis (EDA)
Key Observations:
Visualizations Used:
Overall, EDA offered valuable insights into the scale, efficiency, and regional
dynamics of India’s vaccination campaign. It provided a data-driven foundation for
understanding successes and identifying areas for improvement in future public
health initiatives.
Bar Charts
Bar charts are one of the most widely used visualization tools in data analysis. They
represent categorical data with rectangular bars, where the length or height of each
bar is proportional to the value it represents. This makes bar charts highly effective
for comparing discrete groups or categories.
Theoretical Basis:
The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric used to
evaluate the performance of a binary classification model.
The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at
various threshold settings.
A model with a higher curve indicates better discrimination between the classes.
The AUC (Area Under the Curve) score ranges from 0 to 1; closer to 1 means excellent
model performance.
An AUC of 0.5 suggests no discriminative ability (random guessing), while above 0.9 is
considered outstanding.
In fraud detection, a high AUC ensures that the model distinguishes well between fraud and
legitimate transactions.
It’s especially valuable when dealing with imbalanced datasets, as it reflects performance
across all thresholds.
7. Result and Discussion
The model developed using XGBoost was evaluated on multiple performance metrics
to determine its effectiveness in detecting fraudulent financial transactions. The
discussion below outlines the evaluation results and interprets them from a practical
and data-driven perspective.
7.1 Model Performance Metrics
The XGBoost classifier was trained using the preprocessed and feature-engineered
dataset. Since the dataset was highly imbalanced—with fraudulent transactions
representing less than 1% of total transactions—standard accuracy metrics were
supplemented with more reliable indicators like precision, recall, F1-score, and
ROC-AUC.
Metric Value
Accuracy 99.93%
Precision 90.16%
Recall 92.47%
F1-score 91.30%
ROC-AUC Score 0.998
These results indicate that the model performs exceptionally well in identifying
fraudulent transactions, with high precision and recall. The recall of 92.47% is
particularly important in fraud detection, as it reflects the model’s ability to correctly
identify the majority of fraudulent cases, minimizing false negatives.
7.2 Confusion Matrix Analysis
The confusion matrix showed:
A very low number of False Negatives (missed fraud cases), which is crucial in
real-world financial systems.
Minimal False Positives, meaning legitimate users are not frequently
misclassified as fraudulent.
This balance is ideal in fraud detection systems where undetected fraud can lead to
major financial and reputational losses, while false alarms can cause inconvenience
to genuine users.
7.3 Feature Importance
The feature importance plot generated by XGBoost revealed that the most influential
features for detecting fraud were:
errorBalanceOrig: Difference between expected and actual origin balance.
errorBalanceDest: Difference between expected and actual destination balance.
amount: The transaction amount.
Encoded type: Categorical indicator of transaction type (TRANSFER,
CASH_OUT).
These insights are consistent with observations from EDA, reinforcing that balance
inconsistencies and transaction types are reliable indicators of suspicious activity.
This alignment between statistical patterns and model learning adds credibility to
both the analysis and the predictive output.
7.4 Interpretation of ROC-AUC Curve
The ROC curve confirmed that the model maintains high sensitivity and specificity
across different threshold values. The AUC score of 0.998 reflects excellent
discriminative power, making the model robust even under class imbalance. This
makes XGBoost particularly suitable for fraud detection tasks where fraudulent
activity is rare but highly consequential.
7.5 Model Strengths
Robustness: The model generalizes well to unseen data, thanks to regularization
techniques in XGBoost.
Speed and Scalability: XGBoost’s parallel processing allows for fast training
even on large datasets.
Explainability: Feature importance outputs help interpret model behavior—
important for audits and financial regulations.
7.6 Limitations and Considerations
While the model shows high performance, some limitations must be acknowledged:
Simulated Data: The dataset used is synthetic and may not capture all
complexities of real-world fraud.
Static Nature: The current implementation does not support real-time detection,
which would be critical in production environment
8. Conclusion
This project highlights the efficacy of combining domain knowledge, EDA, and
XGBoost to detect fraud. The pipeline achieves reliable classification while
prioritizing interpretability and minimizing risk.
Future Scope:
Deploy in a real-time fraud detection pipeline
Incorporate deep learning or graph-based models
Enhance stream-based anomaly detection
9. Reference