0% found this document useful (0 votes)

2 views

Fraud Detection[1]

This report presents a project on transaction fraud detection using Exploratory Data Analysis (EDA) and the XGBoost machine learning algorithm. The study identifies fraudulent transactions primarily in TRANSFER and CASH_OUT types, achieving high accuracy and recall in classification. The project emphasizes the importance of combining domain knowledge with machine learning techniques to effectively detect fraud in financial systems.

Uploaded by

dakiadmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Fraud Detection[1]

Uploaded by

dakiadmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

A REPORT ON

TRANSACTION FRAUD DETECTION

SUBMITTED TO THE SAVITRBAI PHULE PUNE UNIVERSITY,
PUNE IN THE PARTIAL FULFILLMENT OF THE REQUIREMENT

MINI PROJECT (THIRD YEAR ENGINEERING)

SUBMITTED BY

Sudesh puri 3355

DEPARTMENT OF COMPUTER ENGINEERING

ARMY INSTITUTE OF TECHNOLOGY

DIGHI HILLS, ALANDI ROAD, PUNE
411015 SAVITRIBAI PHULE PUNE
UNIVERSITY
2024-25
CERTIFICATE
This is to certify that the project report entitles

TRANSACTION FRAUD DETECTION

Submitted by

Sudesh puri

is a Bonafide student at this institute and the work has been carried out by him/her
under the supervision of Prof. Rushali Patil and it is approved for the partial
fulfillment of the requirement of, Third Year course on DSBDA Mini Project of
Savitribai Phule Pune University

(Prof. Rushali Patil) (Dr. Sunil Dhore)

Guide Head,
Computer Engineering Computer Engineering

(Dr B.P. Patil)

Principal,
Army Institute of Technology, Dighi, Pune-411015
ACKNOWLEDGEMENT

I would like to express our heartfelt gratitude to our guide, Prof. Rushali Patil, for her
unwavering support, valuable insights, and encouragement throughout the course of
this research project. Her expertise and guidance were instrumental in shaping the
direction of this project and ensuring its success.

I also extend my appreciation to the other staff for their valuable feedback and
support, which helped refine our understanding of the subject matter. Their
commitment to fostering a collaborative and enriching learning environment greatly
contributed to our academic growth.

Sudesh puri
ABSTRACT

In the modern financial ecosystem, fraud detection has become essential to safeguard
transactional integrity. This project presents an analysis-driven approach to
identifying fraudulent activities using Exploratory Data Analysis (EDA) and the
XGBoost machine learning algorithm.

After cleaning and transforming the dataset, EDA revealed that fraudulent
transactions primarily occur in two types: TRANSFER and CASH_OUT. These
insights were leveraged to engineer new features that capture inconsistencies and
anomalies.
A gradient boosting model (XGBoost) was then trained to classify transactions,
achieving high accuracy and recall. This project showcases how domain knowledge,
combined with statistical rigor and machine learning, can effectively detect fraud in
real-world datasets.

Keywords: Fraud Detection, EDA, XGBoost, Anomaly Detection, Classification,

Feature Engineering
1. Introduction

Digital transactions have revolutionized financial systems, but they also open avenues for
fraud. Rule-based systems are often rigid and miss nuanced or evolving fraud patterns. This
project presents a machine learning–based approach using EDA and XGBoost to intelligently
detect fraud.

The dataset used simulates mobile money transactions, providing rich transactional features.
The challenge lies in accurately flagging fraud among millions of legitimate transactions,
many of which closely mimic fraudulent behavior.

In recent years, financial institutions and digital platforms have increasingly turned to
artificial intelligence and machine learning to safeguard user transactions. Unlike
traditional rule-based systems, which rely on predefined conditions and often struggle
with novel fraud strategies, machine learning models can adapt and learn from data
patterns over time. This adaptability makes them highly effective in detecting
evolving fraud behaviors. By combining exploratory data analysis (EDA) with a
powerful model like XGBoost, this project aims to not only detect fraud but also
understand the underlying transaction dynamics that contribute to fraudulent
activity. This dual approach ensures that the detection system is both accurate and
explainable—critical qualities for building trust and ensuring compliance in real-
world financial systems.
2. Problem Statement

The core challenge of this project is to identify and classify fraudulent transactions
from a large dataset that includes multiple transaction types and patterns. The
complexity increases due to:

 The highly imbalanced nature of the dataset, where fraudulent transactions form a
tiny fraction of the total.

 The similarity between fraudulent and legitimate transactions in terms of

transaction structure.

 The presence of manipulated values (e.g., fake balances, shell accounts) that aim to
mimic real user behavior.

Specifically, the project seeks to answer:

 What transaction types are most prone to fraud?

 What patterns (e.g., balance inconsistencies) correlate highly with fraudulent activity?

 How can a machine learning model distinguish fraud using historical transaction
features?
3. Objectives

The project was carried out with the following focused objectives:
 Data Cleaning & Filtering: To remove irrelevant or misleading transaction types that do
not contribute to fraud patterns (e.g., DEBIT, PAYMENT).
 EDA & Visualization: To identify hidden patterns, such as imbalances in sender/receiver
balances or inconsistencies in amount transfers.
 Feature Engineering: To derive meaningful attributes that better describe the anomaly—
for example, error margins between expected and actual balances.
 Model Development: To use XGBoost for binary classification with a focus on reducing
false negatives (missed frauds).
 Model Evaluation: To assess accuracy, recall, precision, and ROC-AUC on a validation
set, especially under class imbalance.
 Explainability: To interpret important features influencing model decisions using feature
importance plots.
4. Literature Survey

Financial fraud detection using machine learning has been an area of active research
due to the growing need for adaptive, intelligent systems. This section presents a
survey of foundational studies and modern approaches:
1. Ngai et al. (2011)
In their review “The application of data mining techniques in financial fraud
detection,” the authors outline a classification framework that incorporates
statistical, AI, and hybrid techniques for detecting anomalies. This research
highlights the effectiveness of machine learning models like decision trees and
boosting methods in combating fraud.
2. Bhattacharyya et al. (2010)
Their comparative study explores credit card fraud detection using decision trees,
neural networks, and logistic regression. The paper emphasizes the role of data
imbalance and the importance of precision-recall trade-offs—a challenge
addressed in this project using XGBoost’s class-weighted training.
3. Sahin & Duman (2011)
This study evaluated artificial neural networks (ANNs) and logistic regression
on fraud datasets. Although ANNs provided good accuracy, interpretability was a
challenge. This supports the decision to use XGBoost in this project, which
balances performance with explainability.
4. XGBoost Algorithm Documentation
Official documentation provided insights into regularization, tree pruning, and
early stopping, which were incorporated in hyperparameter tuning for improved
fraud detection.
5. Kaggle Competitions and Discussions
Online platforms like Kaggle offered real-world case studies and public
notebooks that helped in shaping the EDA and modeling approach. They also
reinforced best practices for handling imbalanced data and feature engineering.
6.
5. Data Sources
The dataset used in this project simulates mobile money transactions and is widely
used in fraud detection challenges. Key aspects:

 Total entries: ~6 million transactions

 Transaction types: TRANSFER, CASH_OUT, DEBIT, PAYMENT, CASH_IN

 Key attributes: amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest,

newbalanceDest, isFraud

Preprocessing Steps:

1. Filtering Transaction Types: Fraud only occurs in TRANSFER and

CASH_OUT, so others were removed.

2. Null and Zero Handling: Transactions with zero balances in a TRANSFER

context were marked as suspicious.

3. New Feature Creation:

o errorBalanceOrig = oldbalanceOrg - amount - newbalanceOrig

o errorBalanceDest = oldbalanceDest + amount - newbalanceDest

4. Label Encoding: For transaction types and categorical data.

5. Balancing Strategy: Implemented under-sampling of the majority class or used

class-weighting in XGBoost to combat imbalance.

This data preparation stage was vital to ensure that the model learns meaningful
distinctions.
6. Exploratory Data Analysis (EDA)

EDA was performed to understand transaction behavior and detect outliers or

hidden patterns.

Key Observations:

 Fraud is isolated to just 2 transaction types: TRANSFER and CASH_OUT.

 Fraudulent TRANSFERs often show a zero balance at origin, indicating

suspicious shell behavior.

 Amounts in fraudulent cases are significantly higher on average than regular

transactions.

 Feature Correlation: Engineered error fields (errorBalanceOrig and

errorBalanceDest) strongly correlate with fraud.

 Sender vs Receiver Behavior: Fraudulent senders have near-zero or identical

balances post-transfer, a strong indicator of simulation.

Visualizations Used:

 Histograms of fraud distribution

 Boxplots for transaction amounts

 Heatmaps for correlation analysis

 Pie charts to show class imbalance

Overall, EDA offered valuable insights into the scale, efficiency, and regional
dynamics of India’s vaccination campaign. It provided a data-driven foundation for
understanding successes and identifying areas for improvement in future public
health initiatives.
Bar Charts

Bar charts are one of the most widely used visualization tools in data analysis. They
represent categorical data with rectangular bars, where the length or height of each
bar is proportional to the value it represents. This makes bar charts highly effective
for comparing discrete groups or categories.

Theoretical Basis:

 X-axis typically shows categories (e.g., transaction types like TRANSFER,

CASH_OUT).

 Y-axis represents numerical values (e.g., count of fraudulent transactions).

 Bars can be:

o Vertical (Column charts) – commonly used for frequency comparison.

o Horizontal – useful when category labels are long or numerous.

Confusion Matrix

A confusion matrix is a performance evaluation tool for classification models. It

summarizes predictions into four categories: True Positives (TP), True Negatives
(TN), False Positives (FP), and False Negatives (FN).
It helps measure metrics like accuracy, precision, recall, and F1-score.
In fraud detection, minimizing False Negatives is crucial to avoid missing fraudulent
cases.
The confusion matrix provides a clear snapshot of how well the model distinguishes
between fraud and non-fraud transactions.
AUCROC Plot

The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) is a metric used to
evaluate the performance of a binary classification model.
The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at
various threshold settings.
A model with a higher curve indicates better discrimination between the classes.
The AUC (Area Under the Curve) score ranges from 0 to 1; closer to 1 means excellent
model performance.
An AUC of 0.5 suggests no discriminative ability (random guessing), while above 0.9 is
considered outstanding.
In fraud detection, a high AUC ensures that the model distinguishes well between fraud and
legitimate transactions.
It’s especially valuable when dealing with imbalanced datasets, as it reflects performance
across all thresholds.
7. Result and Discussion

The model developed using XGBoost was evaluated on multiple performance metrics
to determine its effectiveness in detecting fraudulent financial transactions. The
discussion below outlines the evaluation results and interprets them from a practical
and data-driven perspective.
7.1 Model Performance Metrics
The XGBoost classifier was trained using the preprocessed and feature-engineered
dataset. Since the dataset was highly imbalanced—with fraudulent transactions
representing less than 1% of total transactions—standard accuracy metrics were
supplemented with more reliable indicators like precision, recall, F1-score, and
ROC-AUC.
Metric Value
Accuracy 99.93%
Precision 90.16%
Recall 92.47%
F1-score 91.30%
ROC-AUC Score 0.998
These results indicate that the model performs exceptionally well in identifying
fraudulent transactions, with high precision and recall. The recall of 92.47% is
particularly important in fraud detection, as it reflects the model’s ability to correctly
identify the majority of fraudulent cases, minimizing false negatives.
7.2 Confusion Matrix Analysis
The confusion matrix showed:
 A very low number of False Negatives (missed fraud cases), which is crucial in
real-world financial systems.
 Minimal False Positives, meaning legitimate users are not frequently
misclassified as fraudulent.
This balance is ideal in fraud detection systems where undetected fraud can lead to
major financial and reputational losses, while false alarms can cause inconvenience
to genuine users.
7.3 Feature Importance
The feature importance plot generated by XGBoost revealed that the most influential
features for detecting fraud were:
 errorBalanceOrig: Difference between expected and actual origin balance.
 errorBalanceDest: Difference between expected and actual destination balance.
 amount: The transaction amount.
 Encoded type: Categorical indicator of transaction type (TRANSFER,
CASH_OUT).
These insights are consistent with observations from EDA, reinforcing that balance
inconsistencies and transaction types are reliable indicators of suspicious activity.
This alignment between statistical patterns and model learning adds credibility to
both the analysis and the predictive output.
7.4 Interpretation of ROC-AUC Curve
The ROC curve confirmed that the model maintains high sensitivity and specificity
across different threshold values. The AUC score of 0.998 reflects excellent
discriminative power, making the model robust even under class imbalance. This
makes XGBoost particularly suitable for fraud detection tasks where fraudulent
activity is rare but highly consequential.
7.5 Model Strengths
 Robustness: The model generalizes well to unseen data, thanks to regularization
techniques in XGBoost.
 Speed and Scalability: XGBoost’s parallel processing allows for fast training
even on large datasets.
 Explainability: Feature importance outputs help interpret model behavior—
important for audits and financial regulations.
7.6 Limitations and Considerations
While the model shows high performance, some limitations must be acknowledged:
 Simulated Data: The dataset used is synthetic and may not capture all
complexities of real-world fraud.
 Static Nature: The current implementation does not support real-time detection,
which would be critical in production environment
8. Conclusion

This project highlights the efficacy of combining domain knowledge, EDA, and
XGBoost to detect fraud. The pipeline achieves reliable classification while
prioritizing interpretability and minimizing risk.
Future Scope:
 Deploy in a real-time fraud detection pipeline
 Incorporate deep learning or graph-based models
 Enhance stream-based anomaly detection
9. Reference

 XGBoost Documentation: https://xgboost.readthedocs.io/

 Kaggle Paysim Dataset:
https://www.kaggle.com/datasets/ntnu-testimon/paysim1
 Ngai, E.W.T. et al. (2011), Decision Support Systems.
 Bhattacharyya, S. et al. (2010), Decision Support Systems.
 Sahin, Y. & Duman, E. (2011), Expert Systems with Applications.

Structural Review
73% (11)
Structural Review
57 pages
Assignment 7.docx-2
No ratings yet
Assignment 7.docx-2
4 pages
FINANCIAL FRAUD DETECTION
No ratings yet
FINANCIAL FRAUD DETECTION
11 pages
Research Proposal Template for Master Student
No ratings yet
Research Proposal Template for Master Student
15 pages
Creditcard Fraud Detection
No ratings yet
Creditcard Fraud Detection
26 pages
HR template
No ratings yet
HR template
6 pages
Upi Fraud Detection Using Machine Learning
No ratings yet
Upi Fraud Detection Using Machine Learning
11 pages
Nityananda Vyawhare 2223216 Case Study 5
No ratings yet
Nityananda Vyawhare 2223216 Case Study 5
5 pages
Report
No ratings yet
Report
14 pages
pdsreport (1)
No ratings yet
pdsreport (1)
6 pages
Final Year Project
No ratings yet
Final Year Project
27 pages
NAYAN{PROJECT}
No ratings yet
NAYAN{PROJECT}
12 pages
Final_synopsis_fraud_detection[1]
No ratings yet
Final_synopsis_fraud_detection[1]
15 pages
Fraud Detection Project Report
No ratings yet
Fraud Detection Project Report
4 pages
Research Paper Danish
No ratings yet
Research Paper Danish
6 pages
Financial Fraud Detection Using Machine Learning Techniques
No ratings yet
Financial Fraud Detection Using Machine Learning Techniques
43 pages
307A029Seminar
No ratings yet
307A029Seminar
16 pages
Copy of final eddited research paper1
No ratings yet
Copy of final eddited research paper1
6 pages
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
No ratings yet
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
6 pages
JETIR2404299
No ratings yet
JETIR2404299
9 pages
ML Fraud Detection Case Study
No ratings yet
ML Fraud Detection Case Study
5 pages
Case Study Front Page
No ratings yet
Case Study Front Page
11 pages
Financial Fraud Detection Using Machine Learning Techniques
No ratings yet
Financial Fraud Detection Using Machine Learning Techniques
43 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
11 pages
Internship project
No ratings yet
Internship project
8 pages
Fraud Detection in Financial Transaction
No ratings yet
Fraud Detection in Financial Transaction
5 pages
FDS Project Report
No ratings yet
FDS Project Report
7 pages
upi demo 1 (1)
No ratings yet
upi demo 1 (1)
12 pages
Project Zero
No ratings yet
Project Zero
15 pages
Fraud Detection Synopsis[1]
No ratings yet
Fraud Detection Synopsis[1]
14 pages
Secureswipe Pioneering Strategies for Next-gen Credit Card Fraud Prevention 1
No ratings yet
Secureswipe Pioneering Strategies for Next-gen Credit Card Fraud Prevention 1
9 pages
IJRPR16322
No ratings yet
IJRPR16322
15 pages
Fraud Detection in Online Transactions Using Machine Learning Techniques
No ratings yet
Fraud Detection in Online Transactions Using Machine Learning Techniques
24 pages
Credit Card Fraud Detection and Analysis
No ratings yet
Credit Card Fraud Detection and Analysis
4 pages
TE Seminar Formatfinal
No ratings yet
TE Seminar Formatfinal
16 pages
Synopsis FinalFINAL
No ratings yet
Synopsis FinalFINAL
4 pages
Credit Card Fraud Detection Using Machine Learning Techniques
No ratings yet
Credit Card Fraud Detection Using Machine Learning Techniques
4 pages
Phase 5 Fraud detection in financial transactions
No ratings yet
Phase 5 Fraud detection in financial transactions
17 pages
IEEE_Conference_Template (2)
No ratings yet
IEEE_Conference_Template (2)
3 pages
Script KHDL
No ratings yet
Script KHDL
4 pages
IJIRSET Paper Sample
No ratings yet
IJIRSET Paper Sample
4 pages
AI in Fraud Detection: Leveraging Real-Time Machine Learning For Financial Security
No ratings yet
AI in Fraud Detection: Leveraging Real-Time Machine Learning For Financial Security
16 pages
B17 Discrete Report
No ratings yet
B17 Discrete Report
16 pages
Res Ayu
No ratings yet
Res Ayu
16 pages
Fraudulent Financial Transactions Detection Using Machine Learning
No ratings yet
Fraudulent Financial Transactions Detection Using Machine Learning
10 pages
New Synopsis
No ratings yet
New Synopsis
18 pages
Researcch Paper
No ratings yet
Researcch Paper
27 pages
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
No ratings yet
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
22 pages
Group10_PPT
No ratings yet
Group10_PPT
31 pages
Credit Card Fraud Detection by Data Analytics Using Python: Malay Joshi, Yudhishthir Bhunwal and Dr. Smita Agarwal
No ratings yet
Credit Card Fraud Detection by Data Analytics Using Python: Malay Joshi, Yudhishthir Bhunwal and Dr. Smita Agarwal
4 pages
DOC-20250430-WA0006
No ratings yet
DOC-20250430-WA0006
6 pages
Fraud Detection Using Machine Learning V 2
No ratings yet
Fraud Detection Using Machine Learning V 2
33 pages
Topic 2
No ratings yet
Topic 2
5 pages
Credit Card Fraud Detection Report
100% (1)
Credit Card Fraud Detection Report
17 pages
Anti Fraud
No ratings yet
Anti Fraud
23 pages
21EBKCS42
No ratings yet
21EBKCS42
57 pages
Credit Card Fraud Detection Using Machine Learning PDF
No ratings yet
Credit Card Fraud Detection Using Machine Learning PDF
6 pages
1
No ratings yet
1
11 pages
Batch 31
No ratings yet
Batch 31
30 pages
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Krajewski - Om12 - 08
No ratings yet
Krajewski - Om12 - 08
74 pages
Directive Principles of State Policy
No ratings yet
Directive Principles of State Policy
15 pages
Writers Guild of America, West, Inc. Et Al. v. WME Entertainment, Et Al.
No ratings yet
Writers Guild of America, West, Inc. Et Al. v. WME Entertainment, Et Al.
6 pages
Korea 2017
No ratings yet
Korea 2017
64 pages
newspaper essay
No ratings yet
newspaper essay
2 pages
Internship Report
100% (6)
Internship Report
45 pages
Basic Principles Stat Con
No ratings yet
Basic Principles Stat Con
36 pages
11a. Chlor-Alkali
No ratings yet
11a. Chlor-Alkali
4 pages
Empowerment Takes More Than A Minute
No ratings yet
Empowerment Takes More Than A Minute
2 pages
Input Data:: Bottom Plate (160-11)
No ratings yet
Input Data:: Bottom Plate (160-11)
2 pages
Personal Essay--Reda Driss Ounejjar M'zali (1)
No ratings yet
Personal Essay--Reda Driss Ounejjar M'zali (1)
2 pages
Automatic Dish Wash Gel (Formulaiton #35284-029)
100% (1)
Automatic Dish Wash Gel (Formulaiton #35284-029)
2 pages
Fasteners and Hardware
No ratings yet
Fasteners and Hardware
42 pages
Salesforce Developer Codes: 1. Apex Class Basic Examples (Addition Example)
100% (1)
Salesforce Developer Codes: 1. Apex Class Basic Examples (Addition Example)
47 pages
Resume - Dalton J Lind 2020
No ratings yet
Resume - Dalton J Lind 2020
1 page
Bottles, Scott L. - L.A. and The Automobile
No ratings yet
Bottles, Scott L. - L.A. and The Automobile
170 pages
Industrial Report by Rakesh Kumar Joshi
No ratings yet
Industrial Report by Rakesh Kumar Joshi
35 pages
Litebox en To Readt
No ratings yet
Litebox en To Readt
7 pages
30-99-00-0041(22)
No ratings yet
30-99-00-0041(22)
16 pages
Century Vs Banas
No ratings yet
Century Vs Banas
10 pages
/{CBM352 CIA-I SET 1 ANSWERKEY
No ratings yet
/{CBM352 CIA-I SET 1 ANSWERKEY
10 pages
IoT Basics and Smart Sensors - EXP 12-14
No ratings yet
IoT Basics and Smart Sensors - EXP 12-14
16 pages
Compounding More Than Once A Year
No ratings yet
Compounding More Than Once A Year
18 pages
The Beginners Guide To Robotc: Volume 1, 3 Edition
No ratings yet
The Beginners Guide To Robotc: Volume 1, 3 Edition
16 pages
Dmadv Fundamentals: Let's Look at The Case Study
No ratings yet
Dmadv Fundamentals: Let's Look at The Case Study
9 pages
Kendra para
No ratings yet
Kendra para
5 pages
Virtual Function in Java
No ratings yet
Virtual Function in Java
4 pages
Constructing Identity and Tradition: Englishness, Politics and The Neo-Traditional House
No ratings yet
Constructing Identity and Tradition: Englishness, Politics and The Neo-Traditional House
13 pages