ML LAB REPORT
ML LAB REPORT
ON
CLASSIFICATION OF ALGORITHMS ON ONLINE
PAYMENT FRAUD DETECTION DATASET
Is submitted to Jawaharlal Nehru Technology University, Hyderabad,
In partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
(ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)
SUBMITTED BY
MIRZA KHIZER BAIG (22J21A6654)
KUSTHAPURAM PAVANI (22J21A6649)
MEKA SUJIT (22J21A6653)
Under the guidance of
Mrs. VIJAYALAXMI MATHPATI
Assistant Professor
1
JOGINPALLY B.R ENGINEERING COLLEGE
Accredited by NAAC with an A+ Grade, Recognized under sec. 2(f) of the UGC Act. 1956
Approved by AICTE & Affiliated to Jawaharlal Nehru Technological University, Hyderabad
Bhaskar Nagar, Yenkapally, Moinabad, Ranga Reddy, Hyderabad, Telangana – 500075
CERTIFICATE
2
PRINCIPAL
3
ACKNOWLEDGMENT
We would like to take this opportunity to place it on record that this Lab Project would never
have taken shape but for the cooperation extended to me by certain individuals. Though it is
not possible to name all of them, it would be unpardonable on our part if we do not mention
some of the very important persons.
We express our gratitude to Dr. M.L.M. PRASAD, HOD of Computer Science and
Engineering (Artificial Intelligence and Machine Learning) for his valuable
suggestions and advice.
Sincerely, we acknowledge our deep sense of gratitude to our major project guide, Mrs.
VIJAYALAXMI MATHPATI, Assistant Professor for her constant encouragement, help and
valuable suggestions.
4
DECLARATION
This project aims to evaluate the performance of multiple classification algorithms for
detecting fraudulent transactions. The dataset used in this study comprises anonymized
transaction records, including features such as transaction time, amount, and customer ID,
with a severe class imbalance due to the rarity of fraud cases. Key ML algorithms analyzed
include Logistic Regression, Random Forest, Support Vector Machine (SVM), and Gradient
Boosting.
The performance of these models is compared using metrics such as accuracy, precision,
recall, F1-score, and ROC-AUC. Experimental results reveal that ensemble models like
Random Forest and Gradient Boosting significantly outperform other algorithms, achieving a
high F1-score and robust precision-recall balance. The findings underscore the potential of
ensemble learning techniques in improving fraud detection systems while minimizing false
positive rates.
This documentation discusses the dataset preprocessing steps, model selection rationale,
evaluation metrics, and insights derived from the results. Future work involves exploring
advanced deep learning models and real-time fraud detection techniques to address evolving
fraud strategies effectively.
6
TABLE OF CONTENTS
1 INTRODUCTION
1.1 Background 1
1.2 Importance of Fraud Detection 1
1.3 Challenges in Fraud Detection 1
1.4 Objectives of the study 2
1.5 Scope of the study 2
2 DATASET DESCRIPTION
2.1 Dataset Source 3
2.2 Dataset Overview 3
2.3 Class Imbalance 4
2.4 Data Preprocessing 4
2.5 Dataset Challenges 5
3 METHODOLOGY
3.1 Overview 6
3.2 Workflow 6
3.3 Algorithms 7
4 EXPERIMENTAL SETUP
4.1 Environment Setup 9
4.2 Dataset Partitioning 9
4.3 Performance Metrics 10
5 RESULTS
5.1 Model Performance Overview 11
5.2 Key Observations 11
5.3 Visualization of Results 12
6 CONCLUSION
6.1 Conclusion 13
6.2 Key Findings 13
7 REFERENCES 14
7
LIST OF FIGURES
2 Dataset 5
3 Flowchart 8
8
1. Introduction
1.1 Background
The rise of online payment platforms has transformed the global economy, enabling secure,
fast, and convenient financial transactions. However, with this transformation comes the
growing challenge of fraudulent activities. Online payment fraud includes unauthorized
transactions, identity theft, and account takeovers. In 2023 alone, global losses due to
payment fraud were estimated to exceed $40 billion, affecting individuals, businesses, and
financial institutions.
Detecting fraudulent transactions is complex because fraud patterns continuously evolve, and
fraudulent transactions make up a small fraction of the overall dataset. The task is analogous
to finding a needle in a haystack, where even small inaccuracies can lead to significant
financial consequences.
Fraud detection systems aim to distinguish between legitimate and fraudulent transactions,
ensuring customer security and maintaining trust in financial services. Effective fraud
detection systems must strike a balance between:
Machine learning (ML) has emerged as a promising solution to these challenges, leveraging
historical transaction data to train predictive models capable of identifying suspicious
patterns.
1. Class Imbalance: Fraudulent transactions are rare, often comprising less than 1% of
the total dataset. This imbalance can bias ML models toward predicting non-
fraudulent transactions.
1
2. Evolving Fraud Patterns: Fraudsters continuously adapt their methods to bypass
detection systems.
To compare the models using evaluation metrics like accuracy, precision, recall, F1-
score, and ROC-AUC.
To identify the most effective algorithm for fraud detection and discuss its advantages
and limitations.
The findings of this project are valuable for organizations aiming to enhance their fraud
detection capabilities. The study focuses on offline analysis using a publicly available dataset.
Future extensions may include real-time fraud detection systems, integration with streaming
platforms, or applying advanced deep learning techniques.
2
Fig: Fraud Detection
3
2. Dataset Description
The dataset used in this project is a publicly available Online Payment Fraud Detection
Dataset. For this study, we used the Kaggle Credit Card Fraud Detection dataset, which
contains anonymized transaction data from a European cardholder. The dataset is widely used
in fraud detection research due to its real-world characteristics and the significant class
imbalance, making it a relevant choice for evaluating classification algorithms.
Target Variable:
Features:
Most features are transformed using Principal Component Analysis (PCA) for
anonymization, except for:
o Time: The seconds elapsed between this transaction and the first transaction in
the dataset.
Feature Description
V1, V2, ..., V28 Principal components derived from PCA transformation.
4
2.3 Class Imbalance
The dataset is highly imbalanced, with fraudulent transactions comprising only 0.172% of the
total observations.
Such imbalance makes fraud detection challenging, as models can achieve high accuracy by
simply predicting all transactions as legitimate. To address this, various techniques such as
SMOTE (Synthetic Minority Over-sampling Technique) and undersampling were considered
during preprocessing.
Before feeding the dataset into machine learning models, the following preprocessing steps
were performed:
2. Scaling Features
o The Amount and Time features were scaled using StandardScaler to bring
them onto a similar scale as PCA-transformed features.
3. Splitting Data
o The dataset was divided into 80% training and 20% testing subsets using a
stratified split to preserve the class imbalance ratio.
5
2.5 Dataset Challenges
Fig 2: Dataset
6
3. Methodology
3.1 Overview
The methodology focuses on using supervised machine learning techniques to classify
transactions as fraudulent or legitimate. The process involves:
1. Data preparation and preprocessing (discussed in Section 4).
2. Selection of machine learning algorithms.
3. Evaluation of performance using appropriate metrics.
4. Comparative analysis to identify the most effective algorithm.
This section outlines the step-by-step approach used to implement and evaluate the machine
learning models.
3.2 Workflow
The workflow consists of the following steps:
1. Dataset Preparation:
o Cleaning and preprocessing (as described in Section 4).
o Splitting the dataset into training (80%) and testing (20%) subsets using
stratified sampling.
2. Model Selection:
o Several supervised machine learning models were chosen for the analysis:
Logistic Regression: A simple and interpretable model, often used as a
baseline.
Random Forest: An ensemble method that builds multiple decision
trees to improve prediction accuracy.
Support Vector Machine (SVM): A model that finds the optimal
hyperplane to separate classes, particularly effective for small and
imbalanced datasets.
Gradient Boosting: A boosting algorithm that combines weak learners
iteratively to reduce errors (e.g., XGBoost).
3. Hyperparameter Tuning:
o Grid Search Cross-Validation was used to find the optimal hyperparameters
for each model, ensuring better performance.
o Example parameters tuned:
7
Logistic Regression: Regularization strength (C).
Random Forest: Number of trees, maximum depth.
SVM: Kernel type (linear, rbf), regularization parameter (C).
Gradient Boosting: Learning rate, number of estimators, maximum
depth.
4. Model Training:
o Models were trained on the preprocessed training dataset.
o Class weights were adjusted (where applicable) to handle the class imbalance.
5. Model Evaluation:
o Models were evaluated on the testing dataset using performance metrics such
as:
Accuracy: Overall percentage of correctly predicted transactions.
Precision: Proportion of true positives (frauds) among all predicted
positives.
Recall: Proportion of actual frauds detected.
F1-Score: Harmonic mean of precision and recall.
6. Comparative Analysis:
o Results from all models were compared using the evaluation metrics.
o The best-performing model was identified based on its ability to detect fraud
while minimizing false positives.
3.3Algorithms in Detail
1.Logistic Regression
o A probabilistic model that uses the logistic function to predict the probability of fraud.
o Works well on linearly separable data and serves as a baseline for comparison.
2.Random Forest
o An ensemble model that combines multiple decision trees.
o Features importance analysis helps understand which features contribute most
to fraud detection.
o Robust to overfitting and performs well on imbalanced datasets.
8
3.Support Vector Machine (SVM)
o Effective in handling smaller datasets and works well with non-linear decision
boundaries (using kernel tricks).
o Requires careful tuning of kernel type and regularization parameters.
Fig 3 : Flowchart
4. Experimental Setup
9
4.1 Environment Setup
The experiments were conducted in a controlled environment with the following
specifications:
Hardware:
o Processor: Intel Core i7-12700H (12th Gen) or equivalent.
o RAM: 16 GB.
o Storage: SSD with 512 GB capacity.
o GPU: NVIDIA RTX 3060 (if applicable for advanced computations).
Software:
o Operating System: Windows 10/Ubuntu 20.04.
o Programming Language: Python 3.9.
o IDE/Notebook: Jupyter Notebook, PyCharm, or Google Colab (for cloud
execution).
Libraries and Frameworks:
o Pandas and NumPy: Data manipulation and analysis.
o Matplotlib and Seaborn: Data visualization.
o Scikit-learn: Machine learning algorithms, preprocessing, and evaluation.
o Imbalanced-learn: Class balancing techniques (e.g., SMOTE).
o XGBoost: For Gradient Boosting implementation.
o Joblib: For saving and reloading trained models.
10
The following evaluation metrics were used to assess model performance:
1. Accuracy: Measures the overall correctness of predictions. However, due to the
imbalanced nature of the dataset, it is not sufficient alone.
2. Precision: Focuses on the correctness of positive predictions (fraudulent
transactions).
3. Recall (Sensitivity): Focuses on the model's ability to identify all fraudulent
transactions.
4. F1-Score: Harmonic mean of precision and recall, balancing the trade-off between the
two.
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluates
the model's ability to distinguish between the classes.
11
5. Results
12
3. Support Vector Machine (SVM):
o Advantages:
Handled the imbalanced dataset effectively with precision (80.3%) and
recall (72.4%).
Kernel methods allowed it to capture non-linear relationships in the
data.
o Limitations:
Training time increased significantly as the dataset size increased.
Required extensive hyperparameter tuning to achieve optimal results.
4. Gradient Boosting (XG Boost):
o Advantages:
Outperformed other models across all metrics, achieving the highest
recall (81.3%), F1-Score (83.5%), and ROC-AUC (94.2%).
Effectively handled class imbalance by prioritizing misclassified
samples.
Robust against overfitting due to its regularization techniques.
o Limitations:
Computationally more expensive compared to Logistic Regression.
13
6. Conclusion
6.1 Conclusion
This project focused on the performance analysis of various classification algorithms
for online payment fraud detection using an imbalanced dataset. Based on the results,
the following conclusions were drawn:
1. Gradient Boosting (XGBoost) emerged as the most effective algorithm for fraud
detection, achieving the highest recall (81.3%) and ROC-AUC (94.2%). Its ability to
prioritize minority class samples made it the best-suited model for this task.
2. Random Forest demonstrated a strong balance between precision (84.1%) and recall
(78.5%), making it a reliable choice for fraud detection when computational
efficiency is prioritized.
3. The use of class balancing techniques like SMOTE significantly improved the recall
of all models, ensuring better detection of fraudulent transactions without severely
impacting precision.
4. Simpler models like Logistic Regression performed reasonably well but struggled
with the dataset's class imbalance, missing a substantial portion of fraudulent
transactions.
This study highlights the importance of selecting robust machine learning algorithms
and addressing class imbalance to effectively detect online payment fraud.
14
7. References
This section provides a list of all the sources, datasets, libraries, and tools referenced or used
throughout the project. Proper citation ensures credibility and acknowledges the contributions
of others.
7.1 Dataset
15