0% found this document useful (0 votes)
2 views

Fraud Detection[1]

This report presents a project on transaction fraud detection using Exploratory Data Analysis (EDA) and the XGBoost machine learning algorithm. The study identifies fraudulent transactions primarily in TRANSFER and CASH_OUT types, achieving high accuracy and recall in classification. The project emphasizes the importance of combining domain knowledge with machine learning techniques to effectively detect fraud in financial systems.

Uploaded by

dakiadmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Fraud Detection[1]

This report presents a project on transaction fraud detection using Exploratory Data Analysis (EDA) and the XGBoost machine learning algorithm. The study identifies fraudulent transactions primarily in TRANSFER and CASH_OUT types, achieving high accuracy and recall in classification. The project emphasizes the importance of combining domain knowledge with machine learning techniques to effectively detect fraud in financial systems.

Uploaded by

dakiadmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

A REPORT ON

TRANSACTION FRAUD DETECTION


SUBMITTED TO THE SAVITRBAI PHULE PUNE UNIVERSITY,
PUNE IN THE PARTIAL FULFILLMENT OF THE REQUIREMENT

OF

MINI PROJECT (THIRD YEAR ENGINEERING)

SUBMITTED BY

Sudesh puri 3355

DEPARTMENT OF COMPUTER ENGINEERING

ARMY INSTITUTE OF TECHNOLOGY


DIGHI HILLS, ALANDI ROAD, PUNE
411015 SAVITRIBAI PHULE PUNE
UNIVERSITY
2024-25
CERTIFICATE
This is to certify that the project report entitles

TRANSACTION FRAUD DETECTION

Submitted by

Sudesh puri

is a Bonafide student at this institute and the work has been carried out by him/her
under the supervision of Prof. Rushali Patil and it is approved for the partial
fulfillment of the requirement of, Third Year course on DSBDA Mini Project of
Savitribai Phule Pune University

(Prof. Rushali Patil) (Dr. Sunil Dhore)


Guide Head,
Computer Engineering Computer Engineering

(Dr B.P. Patil)


Principal,
Army Institute of Technology, Dighi, Pune-411015
ACKNOWLEDGEMENT

I would like to express our heartfelt gratitude to our guide, Prof. Rushali Patil, for her
unwavering support, valuable insights, and encouragement throughout the course of
this research project. Her expertise and guidance were instrumental in shaping the
direction of this project and ensuring its success.

I also extend my appreciation to the other staff for their valuable feedback and
support, which helped refine our understanding of the subject matter. Their
commitment to fostering a collaborative and enriching learning environment greatly
contributed to our academic growth.

Sudesh puri
ABSTRACT

In the modern financial ecosystem, fraud detection has become essential to safeguard
transactional integrity. This project presents an analysis-driven approach to
identifying fraudulent activities using Exploratory Data Analysis (EDA) and the
XGBoost machine learning algorithm.

After cleaning and transforming the dataset, EDA revealed that fraudulent
transactions primarily occur in two types: TRANSFER and CASH_OUT. These
insights were leveraged to engineer new features that capture inconsistencies and
anomalies.
A gradient boosting model (XGBoost) was then trained to classify transactions,
achieving high accuracy and recall. This project showcases how domain knowledge,
combined with statistical rigor and machine learning, can effectively detect fraud in
real-world datasets.

Keywords: Fraud Detection, EDA, XGBoost, Anomaly Detection, Classification,


Feature Engineering
1. Introduction

Digital transactions have revolutionized financial systems, but they also open avenues for
fraud. Rule-based systems are often rigid and miss nuanced or evolving fraud patterns. This
project presents a machine learning–based approach using EDA and XGBoost to intelligently
detect fraud.

The dataset used simulates mobile money transactions, providing rich transactional features.
The challenge lies in accurately flagging fraud among millions of legitimate transactions,
many of which closely mimic fraudulent behavior.

In recent years, financial institutions and digital platforms have increasingly turned to
artificial intelligence and machine learning to safeguard user transactions. Unlike
traditional rule-based systems, which rely on predefined conditions and often struggle
with novel fraud strategies, machine learning models can adapt and learn from data
patterns over time. This adaptability makes them highly effective in detecting
evolving fraud behaviors. By combining exploratory data analysis (EDA) with a
powerful model like XGBoost, this project aims to not only detect fraud but also
understand the underlying transaction dynamics that contribute to fraudulent
activity. This dual approach ensures that the detection system is both accurate and
explainable—critical qualities for building trust and ensuring compliance in real-
world financial systems.
2. Problem Statement

The core challenge of this project is to identify and classify fraudulent transactions
from a large dataset that includes multiple transaction types and patterns. The
complexity increases due to:

 The highly imbalanced nature of the dataset, where fraudulent transactions form a
tiny fraction of the total.

 The similarity between fraudulent and legitimate transactions in terms of


transaction structure.

 The presence of manipulated values (e.g., fake balances, shell accounts) that aim to
mimic real user behavior.

Specifically, the project seeks to answer:

 What transaction types are most prone to fraud?

 What patterns (e.g., balance inconsistencies) correlate highly with fraudulent activity?

 How can a machine learning model distinguish fraud using historical transaction
features?
3. Objectives

The project was carried out with the following focused objectives:
 Data Cleaning & Filtering: To remove irrelevant or misleading transaction types that do
not contribute to fraud patterns (e.g., DEBIT, PAYMENT).
 EDA & Visualization: To identify hidden patterns, such as imbalances in sender/receiver
balances or inconsistencies in amount transfers.
 Feature Engineering: To derive meaningful attributes that better describe the anomaly—
for example, error margins between expected and actual balances.
 Model Development: To use XGBoost for binary classification with a focus on reducing
false negatives (missed frauds).
 Model Evaluation: To assess accuracy, recall, precision, and ROC-AUC on a validation
set, especially under class imbalance.
 Explainability: To interpret important features influencing model decisions using feature
importance plots.
4. Literature Survey

Financial fraud detection using machine learning has been an area of active research
due to the growing need for adaptive, intelligent systems. This section presents a
survey of foundational studies and modern approaches:
1. Ngai et al. (2011)
In their review “The application of data mining techniques in financial fraud
detection,” the authors outline a classification framework that incorporates
statistical, AI, and hybrid techniques for detecting anomalies. This research
highlights the effectiveness of machine learning models like decision trees and
boosting methods in combating fraud.
2. Bhattacharyya et al. (2010)
Their comparative study explores credit card fraud detection using decision trees,
neural networks, and logistic regression. The paper emphasizes the role of data
imbalance and the importance of precision-recall trade-offs—a challenge
addressed in this project using XGBoost’s class-weighted training.
3. Sahin & Duman (2011)
This study evaluated artificial neural networks (ANNs) and logistic regression
on fraud datasets. Although ANNs provided good accuracy, interpretability was a
challenge. This supports the decision to use XGBoost in this project, which
balances performance with explainability.
4. XGBoost Algorithm Documentation
Official documentation provided insights into regularization, tree pruning, and
early stopping, which were incorporated in hyperparameter tuning for improved
fraud detection.
5. Kaggle Competitions and Discussions
Online platforms like Kaggle offered real-world case studies and public
notebooks that helped in shaping the EDA and modeling approach. They also
reinforced best practices for handling imbalanced data and feature engineering.
6.
5. Data Sources
The dataset used in this project simulates mobile money transactions and is widely
used in fraud detection challenges. Key aspects:

 Total entries: ~6 million transactions

 Transaction types: TRANSFER, CASH_OUT, DEBIT, PAYMENT, CASH_IN

 Key attributes: amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest,


newbalanceDest, isFraud

Preprocessing Steps:

1. Filtering Transaction Types: Fraud only occurs in TRANSFER and


CASH_OUT, so others were removed.

2. Null and Zero Handling: Transactions with zero balances in a TRANSFER


context were marked as suspicious.

3. New Feature Creation:

o errorBalanceOrig = oldbalanceOrg - amount - newbalanceOrig

o errorBalanceDest = oldbalanceDest + amount - newbalanceDest

4. Label Encoding: For transaction types and categorical data.

5. Balancing Strategy: Implemented under-sampling of the majority class or used


class-weighting in XGBoost to combat imbalance.

This data preparation stage was vital to ensure that the model learns meaningful
distinctions.
6. Exploratory Data Analysis (EDA)

EDA was performed to understand transaction behavior and detect outliers or


hidden patterns.

Key Observations:

 Fraud is isolated to just 2 transaction types: TRANSFER and CASH_OUT.

 Fraudulent TRANSFERs often show a zero balance at origin, indicating


suspicious shell behavior.

 Amounts in fraudulent cases are significantly higher on average than regular


transactions.

 Feature Correlation: Engineered error fields (errorBalanceOrig and


errorBalanceDest) strongly correlate with fraud.

 Sender vs Receiver Behavior: Fraudulent senders have near-zero or identical


balances post-transfer, a strong indicator of simulation.

Visualizations Used:

 Histograms of fraud distribution

 Boxplots for transaction amounts

 Heatmaps for correlation analysis

 Pie charts to show class imbalance

Overall, EDA offered valuable insights into the scale, efficiency, and regional
dynamics of India’s vaccination campaign. It provided a data-driven foundation for
understanding successes and identifying areas for improvement in future public
health initiatives.
Bar Charts

Bar charts are one of the most widely used visualization tools in data analysis. They
represent categorical data with rectangular bars, where the length or height of each
bar is proportional to the value it represents. This makes bar charts highly effective
for comparing discrete groups or categories.

Theoretical Basis:

 X-axis typically shows categories (e.g., transaction types like TRANSFER,


CASH_OUT).

 Y-axis represents numerical values (e.g., count of fraudulent transactions).

 Bars can be:

o Vertical (Column charts) – commonly used for frequency comparison.

o Horizontal – useful when category labels are long or numerous.


Confusion Matrix

A confusion matrix is a performance evaluation tool for classification models. It


summarizes predictions into four categories: True Positives (TP), True Negatives
(TN), False Positives (FP), and False Negatives (FN).
It helps measure metrics like accuracy, precision, recall, and F1-score.
In fraud detection, minimizing False Negatives is crucial to avoid missing fraudulent
cases.
The confusion matrix provides a clear snapshot of how well the model distinguishes
between fraud and non-fraud transactions.
AUCROC Plot

The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric used to
evaluate the performance of a binary classification model.
The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at
various threshold settings.
A model with a higher curve indicates better discrimination between the classes.
The AUC (Area Under the Curve) score ranges from 0 to 1; closer to 1 means excellent
model performance.
An AUC of 0.5 suggests no discriminative ability (random guessing), while above 0.9 is
considered outstanding.
In fraud detection, a high AUC ensures that the model distinguishes well between fraud and
legitimate transactions.
It’s especially valuable when dealing with imbalanced datasets, as it reflects performance
across all thresholds.
7. Result and Discussion

The model developed using XGBoost was evaluated on multiple performance metrics
to determine its effectiveness in detecting fraudulent financial transactions. The
discussion below outlines the evaluation results and interprets them from a practical
and data-driven perspective.
7.1 Model Performance Metrics
The XGBoost classifier was trained using the preprocessed and feature-engineered
dataset. Since the dataset was highly imbalanced—with fraudulent transactions
representing less than 1% of total transactions—standard accuracy metrics were
supplemented with more reliable indicators like precision, recall, F1-score, and
ROC-AUC.
Metric Value
Accuracy 99.93%
Precision 90.16%
Recall 92.47%
F1-score 91.30%
ROC-AUC Score 0.998
These results indicate that the model performs exceptionally well in identifying
fraudulent transactions, with high precision and recall. The recall of 92.47% is
particularly important in fraud detection, as it reflects the model’s ability to correctly
identify the majority of fraudulent cases, minimizing false negatives.
7.2 Confusion Matrix Analysis
The confusion matrix showed:
 A very low number of False Negatives (missed fraud cases), which is crucial in
real-world financial systems.
 Minimal False Positives, meaning legitimate users are not frequently
misclassified as fraudulent.
This balance is ideal in fraud detection systems where undetected fraud can lead to
major financial and reputational losses, while false alarms can cause inconvenience
to genuine users.
7.3 Feature Importance
The feature importance plot generated by XGBoost revealed that the most influential
features for detecting fraud were:
 errorBalanceOrig: Difference between expected and actual origin balance.
 errorBalanceDest: Difference between expected and actual destination balance.
 amount: The transaction amount.
 Encoded type: Categorical indicator of transaction type (TRANSFER,
CASH_OUT).
These insights are consistent with observations from EDA, reinforcing that balance
inconsistencies and transaction types are reliable indicators of suspicious activity.
This alignment between statistical patterns and model learning adds credibility to
both the analysis and the predictive output.
7.4 Interpretation of ROC-AUC Curve
The ROC curve confirmed that the model maintains high sensitivity and specificity
across different threshold values. The AUC score of 0.998 reflects excellent
discriminative power, making the model robust even under class imbalance. This
makes XGBoost particularly suitable for fraud detection tasks where fraudulent
activity is rare but highly consequential.
7.5 Model Strengths
 Robustness: The model generalizes well to unseen data, thanks to regularization
techniques in XGBoost.
 Speed and Scalability: XGBoost’s parallel processing allows for fast training
even on large datasets.
 Explainability: Feature importance outputs help interpret model behavior—
important for audits and financial regulations.
7.6 Limitations and Considerations
While the model shows high performance, some limitations must be acknowledged:
 Simulated Data: The dataset used is synthetic and may not capture all
complexities of real-world fraud.
 Static Nature: The current implementation does not support real-time detection,
which would be critical in production environment
8. Conclusion

This project highlights the efficacy of combining domain knowledge, EDA, and
XGBoost to detect fraud. The pipeline achieves reliable classification while
prioritizing interpretability and minimizing risk.
Future Scope:
 Deploy in a real-time fraud detection pipeline
 Incorporate deep learning or graph-based models
 Enhance stream-based anomaly detection
9. Reference

 XGBoost Documentation: https://xgboost.readthedocs.io/


 Kaggle Paysim Dataset:
https://www.kaggle.com/datasets/ntnu-testimon/paysim1
 Ngai, E.W.T. et al. (2011), Decision Support Systems.
 Bhattacharyya, S. et al. (2010), Decision Support Systems.
 Sahin, Y. & Duman, E. (2011), Expert Systems with Applications.

You might also like