New Report
New Report
PROJECT REPORT ON
By
DIVYASHREE S 1RN21EC049
JASWANT R 1RN21EC055
KHUSHI RAJ 1RN21EC064
2024-25
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Jnana Sangama, Belagavi - 590 018
PROJECT REPORT ON
By
DIVYASHREE S 1RN21EC049
JASWANT R 1RN21EC055
KHUSHI RAJ 1RN21EC064
2024-25
RNS INSTITUTE OF TECHNOLOGY
Autonomous Institution Affiliated to VTU, Recognized by GOK, Approved by AICTE
(NAAC ‘A+ Grade’ Accredited, NBA Accredited (UG - CSE, ECE, ISE, EIE and EEE)
Channasandra, Dr. Vishnuvardhan Road, Bengaluru - 560 098
CERTIFICATE
Certified that the Project work entitled “Credit Card Fraud Detection Using
Machine Learning” is carried out by DIVYASHREE S (USN: 1RN21EC049),
JASWANT R(USN: 1RN21EC055), KHUSHI RAJ (USN: 1RN21EC064))
in partial fulfillment for the award of degree of Bachelor of Engineering in Electron-
ics and Communication Engineering of Visvesvaraya Technological University,
Belagavi, during the year 2024-2025. It is certified that all corrections and sugges-
tions indicated during internal assessment have been incorporated in the report. The
project report has been approved as it satisfies the academic requirements in respect
of the mini project work prescribed for the award of degree of Bachelor of Engi-
neering.
External Viva
1 .......................................... ...........................................
2 .......................................... ...........................................
RNS INSTITUTE OF TECHNOLOGY
Autonomous Institution Affiliated to VTU, Recognized by GOK, Approved by AICTE
(NAAC ‘A+ Grade’ Accredited, NBA Accredited (UG - CSE, ECE, ISE, EIE and EEE)
Channasandra, Dr. Vishnuvardhan Road, Bengaluru - 560 098
DECLARATION
We hereby declare that the entire work embodied in this project report titled,
“Credit Card Fraud Detection Using Machine Learning ” submitted to Visves-
varaya Technological University, Belagavi, is carried out at the department of
Electronics and Communication Engineering, RNS Institue of Technology,
Bengaluru under the guidance of Dr. Leena Chandrashekhar, Associate Profes-
sor. This report has not been submitted for the award of any Diploma or Degree of
this or any other University.
.........................
3. KHUSHI RAJ 1RN21EC064
3
4
5
6
Acknowledgements
The joy and satisfaction that accompany the successful completion of
any task would be incomplete without thanking those who made it possi-
ble. We are proud of being students of RNS Institute of Technology, the
Institution which shaped us for the better future.
Divyashree S
Jaswant R
Khushi Raj
i
Abstract
Credit card fraud detection focuses on real world scenarios. Nowadays
credit card frauds are drastically increasing in number as compared to
earlier times. Criminals are using fake identity and various technologies
to trap the users and get the money out of them. Therefore, it is very es-
sential to find a solution to these types of frauds. In this proposed project
we designed a model to detect the fraud activity in credit card transac-
tions. This system can provide most of the important features required to
detect illegal and illicit transactions. As technology changes constantly, it
is becoming difficult to track the behavior and pattern of criminal trans-
actions. With the advancement of machine learning, artificial intelligence,
and other related information technologies, it has become possible to auto-
mate the process of detecting credit card fraud. This not only streamlines
the task but also helps reduce the significant amount of labor typically
involved.
Initially, we will collect the credit card usage data set from the users
and classify it as a trained and testing data set using logistic regression
algorithm. Using this feasible algorithm, we can analyze the larger data-
set and user-provided current data-set. Then increase the accuracy of the
result data. Proceeded with the application of processing of some of the
attributes provided which can find affected fraud detection in viewing the
graphical model of data visualization. The performance of the techniques
is gauged on the basis of accuracy, sensitivity, and specificity, precision.
The results indicated regarding the best accuracy for logistic regression
algorithm are unit 98.6 per cent respectively.
ii
Table of Contents
Abstract ii
List of Figures iv
Acronyms v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Organisation of Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Literature Survey 11
3 Software Requirements 20
3.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Description of Software used . . . . . . . . . . . . . . . . . . . . . . . 20
5 Result Analysis 41
References 47
iii
List of Figures
1.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iv
Acronyms
ML : Machine Learning
LR : Logistic Regression
AI : Artificial Intelligence
F1S : F1-Score
TN : True Negative
FP : False Positive
FN : False Negative
TP : True Positive
v
Chapter 1
Introduction
Credit card fraud detection is one of the most critical areas of research in the financial
sector. Fraudulent transactions can result in significant financial losses and damage
the reputation of financial institutions. With the growth in online transactions, de-
tecting fraud has become more challenging due to the complexity of data patterns.
Machine learning (ML) algorithms, particularly logistic regression, are increasingly be-
ing used to detect fraudulent activities by identifying patterns in transaction data and
distinguishing between legitimate and fraudulent transactions. This project focuses
on credit card fraud detection using machine learning, particularly logistic regression.
The project aims to develop a model that can predict whether a given credit card
transaction is fraudulent or legitimate based on historical transaction data. Credit
card fraud refers to unauthorized use of a credit card or its details to make fraudulent
transactions. Fraud can take several forms, including: Card-not-present fraud: Fraud-
sters make transactions without the physical card, often in online shopping. Card-
present fraud: Fraudsters use a stolen physical card to make purchases in person.
Account takeover: A fraudster takes control of an existing cardholder’s account and
uses it for unauthorized transactions. With the rise of online and mobile payments,
detecting fraudulent activities in real-time has become paramount to preventing sig-
nificant financial losses.
Credit card fraud is a pervasive and growing problem in the financial industry. As
more consumers turn to online shopping, digital transactions, and mobile payments,
the financial sector faces an increasing volume of transaction data that must be care-
fully monitored to identify fraudulent activity. Fraudulent transactions not only result
in significant monetary losses but can also damage the trust and reputation of financial
institutions, which are essential for maintaining customer relationships and business
stability. For financial institutions, detecting fraud in a timely manner is crucial. A
delayed response to fraud can lead to significant financial losses for both the insti-
tution and its customers. The complexity of modern fraud schemes, combined with
the ever-increasing number of transactions being processed globally, has made man-
ual detection methods and traditional rule-based systems less effective. Fraudsters
continuously evolve their tactics, finding new ways to bypass security measures.
1
Credit Card Fraud Detection Using Machine Learning 2024-25
1.1 Motivation
A Credit Card Fraud Detection using Machine Learning project is driven by several
critical factors that reflect both the increasing prevalence of fraud and the potential
of machine learning to address this challenge effectively. Credit card fraud is a major
global issue, with billions of dollars lost each year to fraudulent activities. The sophis-
tication of fraudsters continues to grow, making traditional fraud detection methods
less effective. With the rise of online shopping and digital financial transactions, fraud
detection has become even more important, as fraudsters exploit digital platforms to
steal personal and financial information. Fraudulent activities are becoming increas-
ingly complex, with fraudsters using more advanced techniques such as identity theft,
synthetic identity creation, and account takeovers. Fraudulent transactions are of-
ten conducted in real time, and detecting them as soon as they happen is crucial to
minimize losses and protect consumers. Traditional fraud detection systems are often
based on predefined rules or heuristics, which may not be flexible enough to detect
novel or sophisticated fraudulent behavior.
1.2 Objectives
Data Collection and Preprocessing: Gather relevant datasets that contain credit
card transaction details, including features like transaction amount, merchant, trans-
action type, and customer demographics. Preprocess the data by handling missing
values, encoding categorical variables, scaling numerical features, and removing any
noise or outliers.
Feature Engineering: Identify important features that can help detect fraudulent
transactions (e.g., transaction amount, location, time of transaction, frequency of
transactions, etc.). Create new features or transform existing ones to enhance the
model’s ability to distinguish between fraudulent and non-fraudulent transactions.
Data Splitting: Split the dataset into training and testing sets to evaluate the
model’s performance effectively. This allows the model to be trained on a subset of
the data and tested on unseen data.
Model Evaluation: Evaluate the model’s performance using various metrics such
Handling Class Imbalance: Explore techniques to deal with the imbalance be-
tween fraudulent and non-fraudulent transactions, such as using SMOTE (Synthetic
Minority Over-sampling Technique), cost-sensitive learning, or balancing the dataset.
1.3 Methodology
Start by loading the dataset you’ll be working with. This is usually a file with lots of
information on each transaction, like how much was spent, where, and when. Convert
the data into a table format (like an Excel sheet) so it’s easy to work with. In pro-
gramming, this is often done using something called a Data frame, which helps you
view and manipulate the data. Since there are often far fewer fraud cases than normal
transactions, you need to balance the data. This can be done by making more copies
of the fraud cases so the model gets a fair chance to learn what fraud looks like. Split
the data into two parts. One part will be used to ”train” the model (help it learn),
and the other part will be used to test it (see how well it learned). Usually, 80 per cent
The data is divided into two sets: one for training the model and the other for testing
its performance. Typically, a 70-30 or 80-20 split is used. Give the training data
to each model. The model will use this data to understand pattern and differences
between fraudulent and non-fraudulent transactions.The training set is used to train
the logistic regression model. The test set is used to evaluate the model’s performance
and generalization ability. Once the models are trained, use them to make predictions
on the test data. Check how often each model gets the right answer (accuracy). This
helps you see which models are doing well. A confusion matrix shows details of the
model’s performance, like how many fraud cases it caught, how many it missed, and
if it accidentally labeled any good transactions as fraud. Finally, compare all three
models based on different criteria (like accuracy and how often they find fraud). This
helps you decide which model is the best for detecting fraud in your dataset.
Data Preprocessing: Data preprocessing helps to clean and prepare the dataset
for analysis and modeling. Data preprocessing involves handling missing values, con-
verting categorical variables into numerical formats, normalizing or scaling data, and
dealing with class imbalance. Convert categorical variables (if any) into numerical
values. It handles missing values and checks for and address missing data points.
The Area Under the Curve (AUC) is a key metric used in evaluating the performance
of classification models, including logistic regression. It is specifically related to the
Receiver Operating Characteristic (ROC) curve, which is a graphical representation
of the model’s performance across all possible classification thresholds.
Data Analysis: The objective of data analysis is to analyse the data to gain in-
sights into the distribution and relationship of features. Then we conduct exploratory
data analysisto understand patterns, trends, and correlations in the data. Next, a
statistical summary is written where we use basic statistical methods (mean, median,
variance, etc.) to understand feature distributions. Correlation Analysis is performed
to check for correlations between features to identify the most influential variables
for predicting fraud. Visualize data to better understand the trends and patterns.
SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to ad-
dress class imbalance in classification problems, where the number of instances in one
class (typically the minority class) is significantly lower than the number of instances
in the other class (majority class). Imbalanced datasets can lead to poor performance
of machine learning models like Logistic Regression, as the model tends to be biased
toward the majority class.
Train-Test Split: Firstly, we split the dataset into training and testing subsets to
evaluate the model’s performance. The data is divided into two sets: one for training
the model and the other for testing its performance. Typically, a 70-30 or 80-20 split
is used. The training set is used to train the logistic regression model. The test set
Logistic Regression Model: Here, we train the logistic regression model on the
training data. Logistic regression is used to predict the probability of a transaction
being fraudulent or not based on the input features. It outputs a probability, which is
then classified as fraudulent (1) or non-fraudulent (0). Then we have to initialize the
logistic regression model. After training, the model will learn the coefficients (weights)
associated with each feature. In the context of credit card fraud detection, logistic
regression is used to predict the probability that a given transaction is fraudulent
or non-fraudulent based on various input features, such as transaction amount, time,
location, and other transactional attributes. Logistic regression is a widely used binary
classification algorithm because it is simple, interpretable, and performs well with
linearly separable data.
Evaluation: First, evaluate the performance of the logistic regression model on the
test data. After training, the model is tested using the test set to assess its accu-
racy, precision, recall, and other relevant metrics. Then, use the trained model to
predict the labels (fraud or non-fraud) on the test set metrics. After training a lo-
gistic regression model on the training dataset, the next crucial step is to assess its
performance on a test dataset. The test dataset contains unseen data, which allows
us to evaluate how well the model generalizes to new, real-world data. The goal is to
understand the model’s effectiveness in predicting whether a transaction is fraudulent
or non-fraudulent. This evaluation step is essential for validating the model’s practical
utility.
1.5 Applications
The Credit Card Fraud Detection using Logistic Regression has several important
applications across various industries, primarily in the financial services sector, but
also in related fields that involve secure transactions and data analysis. However, the
principles behind fraud detection models, like those built using logistic regression, can
extend to other industries and sectors that deal with sensitive or secure transactions,
the need for data-driven analysis. Below are some key applications of this project:
Credit Card Issuers and Banks Fraud Prevention: Banks and credit card is-
suers use fraud detection systems to monitor customer accounts continuously for any
fraudulent activities. By using machine learning models, such as logistic regression,
banks can reduce false positives and improve the accuracy of fraud detection, offering
better protection to customers while minimizing disruptions in legitimate transac-
tions. Example: A customer’s card is used for a high-value transaction, and the
model identifies it as potentially fraudulent, triggering an automatic notification to
the customer for verification.
Handling of Imbalanced Data: Credit card fraud datasets often suffer from class
imbalance, where fraudulent transactions are much fewer than legitimate ones. Lo-
gistic regression can effectively handle imbalanced data through techniques like class
weighting and oversampling of minority classes. This improves the accuracy of de-
tecting fraudulent transactions without biasing the model toward the majority class
(non-fraudulent transactions), thus reducing false negatives.
Bias Toward Majority Class: Logistic regression models can be biased towards
the majority class (non-fraudulent transactions) in imbalanced datasets, especially
when no additional measures (e.g., class weighting, oversampling) are taken. This
could lead to a high number of false positives (non-fraudulent transactions incorrectly
flagged as fraudulent) and a low recall for detecting actual fraud. When using logistic
regression for fraud detection, an imbalanced dataset can lead to a number of issues
that can affect the model’s performance, particularly in terms of bias, false positives,
and recall. In machine learning, especially in classification tasks, a common challenge
arises when dealing with imbalanced datasets—that is, when the distribution of classes
(e.g., fraudulent vs. non-fraudulent transactions) is skewed.
Chapter 3 : System Analysis: This chapter gives us insight into the technical
details of our project such as the software requirement specification, hardware re-
quirement specifications, high-level design, etc.
In this chapter, we will summarize the research paper(s) we used as references for our
credit card fraud detection project, explaining the methodology behind the work and
how it directly contributed to the development of our project. This includes how we
adapted ideas from the referenced papers, incorporated them into our approach, and
customized these methodologies to suit our specific project goals.
Fraud Detection Systems (FDS) are automated machine learning based solutions that
credit card companies employ to detect the fraudulent transactions even before end
user’s feedback. Goal of such a system is to detect the fraudulent transaction before
it is committed to the database and thus prevent the fraud from taking place. An
ideal FDS should also minimize the false detections where a genuine transaction is
interrupted causing inconvenience to the end-user. Machine learning based algorithms
work with lots of example data of the underlying domain to define computation model
so as to classify future data seen in the domain. A class of these algorithms called
Supervised Learning Algorithms requires the example data classes to be pre-labeled.
On the other hand, other class of algorithms uses Unsupervised Learning where the
data is clustered into identical groups and termed as belonging to one class. Many al-
gorithms based on both approaches have been proposed in literature. FDS collect lot
of historical data to apply computations on them. But the transaction data sets are
typically imbalanced with number of normal transactions far outnumbering the fraud-
ulent ones. In this paper, we outline and evaluate various popular machine learning
algorithms with respect to their capability to correctly classify fraudulent transactions
in a real world imbalanced dataset[1].
11
Credit Card Fraud Detection Using Machine Learning 2024-25
fraudulent transactions in real-time datasets. Two methods under random forests are
used to train the behavioural features of normal and abnormal transactions. They
are Random-tree-based random forest and CART-based. Even though random forest
obtains good results on small set data, there are still some problems in case of im-
balanced data. The future work will focus on solving the above-mentioned problem.
The algorithm of the random forest itself should be improved.
Performance of Logistic Regression, K-Nearest Neighbour, and Naı̈ve Bayes are anal-
ysed on highly skewed credit card fraud data where Research is carried out on ex-
amining meta-classifiers and meta-learning approaches in handling highly imbalanced
credit card fraud data. Through supervised learning methods can be used there may
fail at certain cases of detecting the fraud cases. A model of deep Auto-encoder and
restricted Boltzmann machine (RBM) that can construct normal transactions to find
anomalies from normal patterns. Not only that a hybrid method is developed with a
combination of Adaboost and Majority Voting methods[2].
It is essential that credit card companies are able to detect fraudulent transactions
so that customers are not charged for items they did not purchase. Data can be used
to solve these issues. Science and its importance, as well as machine and soft learning,
could not be more critical. When someone defrauds you of your money or otherwise
harms your financial well-being through deception or other illegal means, this is re-
ferred to as financial fraud. Billions of dollars worth of financial fraud is committed
every year. According to the Federal Trade Commission(FTC), the number of theft
reports has more than doubled in the last two years. One of the major types of finan-
cial fraud is credit card fraud.
Research on fraud detection using machine learning in credit card problems has re-
ceived high attention. The paper considers using popular supervised algorithms for
However, problems continue with drifting on the aspect of limited data availabil-
ity, interpretability, and explanation. Future research opportunities include ensemble
methods, deep learning architectures, additional data sources and natural time fraud
detection systems. Machine learning has many patterns associated with credit card
information analysis to combat fraud. However, in terms of the explainability of the
model and updating it, problems arise due to changes in fraud patterns associated
with credit card activities. .
Introduction of credit cards and others have led to the broad adoption of online trans-
actions in daily life . Credit is a plastic card for buying and cash withdrawal. This
invention made trading faster, thereby improving business and increasing economic
activity. Banks introduced cards to provide consumers with convenient and efficient
purchases without needing immediate cash. Furthermore, by streamlining payments
and improving the overall user experience, cards encourage spending by helping people
establish credit histories, which are helpful for various financial endeavours, including
applying for loans and mortgages. Additionally, credit cards benefit banks by creat-
ing new revenue streams through interest charges associated with card usage, thereby
facilitating the development of a credit-based economy. Credit card fraud remains a
global problem due to the rise in online commerce yield and the growing use of credit
cards recently, and credit card fraud has increased frequently[4].
During the search it was found that there were many models created by other re-
searchers which have proven that people have been trying to solve the credit card
fraud problem. I found that Najdat Team used an approach that is established upon
bidirectional long/short-term memory in building their model, other researchers have
tried different data splitting ratios to generate different accuracies. The team of Sahin
and Duman used different Support Vector Machine methods which are (SVM) Sup-
port Vector Machine with RBF, Polynomial, Sigmoid, and Linear Kernel.
The lowest accuracy of the four models that will be studied in this research, is 54.86%
for KNN and 36.40% for logistic Regression which were scored by Awoyemi and his
A credit card is often described as a card that is granted to the customer (card-
holder), frequently enabling them to buy goods and services within their credit limit
or withdraw cash in advance, among many other things, is always because of the
absence of accessible funds at the time. Credit cards give cardholders the benefit of
time, allowing consumers to postpone payments that are due beyond a certain period
of time by rolling them over to the following billing cycle. The payments industry
employs a process called credit card fraud detection to determine if a transaction is
fraudulent, which involves utilizing historical data. Detecting fraudulent credit card
transactions can be a challenging task because it involves identifying unauthorized
usage of a credit card by an individual who does not have control over the account[5].
Machine learning (ML) algorithms are utilized to assess all authorized transactions
and identify any that appear suspicious. Investigators get in touch with the cardhold-
ers who are asked them if the transaction was genuine or fraudulent. The conclusions
drawn in this investigation are based on the datasets used, which are outlined in the
methodology section. The work is concluded with the Conclusion section and sug-
gestions for further investigation on relevant topics. Credit card fraud occurs in a
transaction when a fictitious source of funds is created using a credit card the ma-
jority of credit card fraud detection methods depend on artificial intelligence, Meta
learning, and pattern matching as their founding principles.
Fraud Detection Systems (FDS) are automated machine learning based solutions that
credit card companies employ to detect the fraudulent transactions even before end
user’s feedback. Goal of such a system is to detect the fraudulent transaction before
it is committed to the database and thus prevent the fraud from taking place. An
ideal FDS should also minimize the false detections where a genuine transaction is
interrupted causing inconvenience to the end-user. Machine learning based algorithms
work with lots of example data of the underlying domain to define computation model
so as to classify future data seen in the domain. A class of these algorithms called
On one hand, other class of algorithms uses Unsupervised Learning where the data
is clustered into identical groups and termed as belonging to one class. Many algo-
rithms based on both approaches have been proposed in literature. FDS collect lot
of historical data to apply computations on them. But the transaction data sets are
typically imbalanced with number of normal transactions far outnumbering the fraud-
ulent ones. In this paper, we outline and evaluate various popular machine learning
algorithms with respect to their capability to correctly classify fraudulent transactions
in a real world imbalanced dataset.
It is essential that credit card companies are able to detect fraudulent transactions so
that customers are not charged for items they did not purchase. Data can be used to
solve these issues. Science and its importance, as well as machine and soft learning,
could not be more critical. When someone defrauds you of your money or otherwise
harms your financial well-being through deception or other illegal means, this is re-
ferred to as financial fraud. Billions of dollars worth of financial fraud is committed
every year. According to the Federal Trade Commission(FTC), the number of theft
reports has more than doubled in the last two years. One of the major types of finan-
cial fraud is credit card fraud. As the number of online transactions is growing, so is
the number of credit card frauds. An effective solution is necessary to reduce loss due
to fraudulent transactions at the initial stage.
Credit card fraud poses a significant threat to the financial sector, resulting in sub-
stantial financial losses. This research investigates the application of advanced ma-
chine learning techniques to effectively detect fraudulent transactions. By utilizing
a publicly available dataset, this study evaluates and compares various algorithms,
Machine learning (ML) algorithms are utilized to assess all authorized transactions
and identify any that appear suspicious. Investigators get in touch with the cardhold-
ers who are asked them if the transaction was genuine or fraudulent. The conclusions
drawn in this investigation are based on the datasets used, which are outlined in the
methodology section. The work is concluded with the Conclusion section and sug-
gestions for further investigation on relevant topics. Credit card fraud occurs in a
transaction when a fictitious source of funds is created using a credit card the ma-
jority of credit card fraud detection methods depend on artificial intelligence, Meta
learning, and pattern matching as their founding principles.
There are many types of fraud in our daily life. One of the frauds occurring these days
is credit card fraud. When people around the globe make credit card transactions,
there will also be fraudulent transactions. To avoid credit card fraud, we must know
the patterns and how the fraud values differ. This paper proposed credit card fraud
detection using machine learning based on the labeled data and differentiating the
fraudulent and legitimate transactions. The experiment was conducted using super-
vised machine-learning techniques.
Performance of Logistic Regression are analysed on highly skewed credit card fraud
data where Research is carried out on examining meta-classifiers and meta-learning
approaches in handling highly imbalanced credit card fraud data. Through supervised
learning methods can be used there may fail at certain cases of detecting the fraud
cases. A model of deep Auto-encoder and restricted Boltzmann machine (RBM) that
can construct normal transactions to find anomalies from normal patterns. Not only
that a hybrid method is developed with a combination of Adaboost and Majority
Voting methods[8].
Financial fraud is an ever growing menace with far reaching consequences in the
finance industry, corporate organizations, and government. Fraud can be defined as
criminal deception with intent of acquiring financial gain. High dependence on internet
technology has enjoyed increased credit card transactions. As credit card transactions
become the most prevailing mode of payment for both online and offline transaction,
Financial fraud is an ever growing menace with far consequences in the financial
industry. Data mining had played an imperative role in the detection of credit card
fraud in online transactions. Credit card fraud detection, which is a data mining
problem, becomes challenging due to two major reasons - first, the profiles of normal
and fraudulent behaviours change constantly and secondly, credit card fraud data sets
are highly skewed. The performance of fraud detection in credit card transactions is
greatly affected by the sampling approach on dataset, selection of variables and detec-
tion technique(s) used. This paper investigates the performance of logistic regression
on highly skewed credit card fraud data. Dataset of credit card transactions is sourced
from European cardholders containing 284,807 transactions[9].
Challenges arise when the current system for detecting fraud does not sufficiently
tackle new and evolving fraud patterns. Credit card fraud is a significant concern
which causes damage to citizens, firms, and the economy. Credit card fraud cost an
estimated 27.85 billion just last year. This represents 16.2 per cent from 23.97 billion
in 2017. There are severe consequences from credit card fraud experienced by most
Credit card fraud is a huge ranging term for theft and fraud committed using or
involving at the time of payment by using this card. The purpose may be to purchase
goods without paying, or to transfer unauthorized funds from an account. Credit
card fraud is also an add on to identity theft. As per the information from the United
States Federal Trade Commission, the theft rate of identity had been holding sta-
ble during the mid 2000s, but it was increased by 21 percent in 2008. Even though
credit card fraud, that crime which most people associate with ID theft, decreased as
a percentage of all ID theft complaints In 2000, out of 13 billion transactions made
annually, approximately 10 million or one out of every 1300 transactions turned out
to be fraudulent. Investigating the possibility of applying extra information bases to
construct credit card fraud detection systems would be necessary. Such information
would go beyond simply transactional data like the ones discussed above, like user
behaviour patterns, geo-location, and device fingerprints.
In this way, additional data sources could feed machine learning models with more
comprehensive intelligence about user activity and, consequently, better accuracy in
detecting possible fraud. Another promising area of research and development is real-
time credit card fraud detection systems. Financial institutions should develop mod-
els that can analyse transactions within the shortest period possible to assist them in
quickly spotting any fraud-related activity before they get substantial monetary losses.
Credit card companies are able to detect fraudulent transactions so that customers
are not charged for items they did not purchase. Data can be used to solve these
issues. Science and its importance, as well as machine and soft learning, could not be
more critical. When someone defrauds you of your money or otherwise harms your
financial well-being through deception or other illegal means, this is referred to as
financial fraud. Billions of dollars worth of financial fraud is committed every year.
According to the Federal Trade Commission, the number of theft reports has more
than doubled in the last two years. One of the major types of financial fraud is credit
card fraud. As the number of online transactions is growing, so is the number of
credit card frauds. An effective solution is necessary to reduce loss due to fraudulent
transactions at the initial stage.
In corporate and finance business, financial fraud become very crucial issue. More-
over, financial fraud affect a lot in business, economy instability and it also affects the
people’s price of living. There are some frauds, which are again classify further, that
are the major issues now days. They are credit card fraud, mortgage fraud, money
laundering, financial statement fraud, securities and commodities fraud, automobile
insurance fraud and healthcare fraud. In this paper, we will focus on Credit card
fraud and its detection techniques. An effective way to do so would be to use machine
learning algorithms to detect credit card fraud. This paper examines latest advances
and application in the field of machine learning-based credit card fraud detection[12].
The software requirements outlined in this document define the necessary features,
functionalities, and system constraints for developing the fraud detection system.
These include the need for real-time data processing, integration with transaction
databases, as well as the ability to train and update machine learning models based on
historical transaction data.Key aspects of the system will include data pre-processing,
model training, anomaly detection, decision support, and system performance metrics
to ensure that the solution can effectively balance detection accuracy with response
time.The software must also be capable of handling different types of fraud detec-
tion techniques, including supervised and unsupervised learning models, to adapt to
various fraud scenarios.
3.1 Software
The software used are: Google Colab is a cloud-based interactive development envi-
ronment (IDE) that allows users to write and execute Python code in a web-based
notebook. It is compatible with both Python 2.7 and Python 3.x, making it a versatile
tool for various coding tasks. Google Colab is particularly useful for data analysis,
machine learning, and deep learning projects, as it provides free access to powerful
computing resources such as GPUs and TPUs. Additionally, it integrates seamlessly
with Google Drive, making it easy to store and share notebooks. With its user-
friendly interface and collaborative features, Google Colab is an excellent choice for
both beginners and advanced developers working on Python projects.
20
Credit Card Fraud Detection Using Machine Learning 2024-25
management, which involves handling the execution of programs and managing sys-
tem resources like CPU time. Memory management is another essential aspect, as
the OS allocates and tracks memory usage to ensure that running programs do not
interfere with each other.Fraud detection systems often need to work in real-time to
monitor transactions as they occur. The OS plays a critical role in managing network
communications between different systems, such as when data is transferred from the
point-of-sale (POS) systems to fraud detection servers.
It handles TCP/IP connections, HTTP requests, and data streaming protocols for
continuous monitoring.For fraud detection, APIs are used to send transaction data
from one system to another (e.g., payment gateway to fraud detection service), en-
abling real-time fraud scoring.From managing system resources (CPU, memory, and
storage) to securing sensitive data and ensuring smooth real-time communication, the
OS is integral to the successful operation of a fraud detection system. It helps ensure
that fraud detection models can function efficiently, securely, and reliably in detecting
and preventing credit card fraud.
Python IDEs are essential for building and deploying a credit card fraud detection
system using machine learning. They provide a streamlined workflow for data prepro-
cessing, model building, debugging, and testing, as well as features for collaboration,
Google Colab offers a powerful, cloud-based platform to develop, train, and deploy
machine learning models for credit card fraud detection. With its free access to
GPUs/TPUs, integration with popular libraries, collaboration features, and the ability
to handle large datasets, it significantly speeds up the machine learning development
cycle. It’s ideal for experimenting with algorithms, visualizing data, and building end-
to-end fraud detection systems in a collaborative and scalable way. Google Colab is
a free, cloud-based platform provided by Google that enables users to write, execute,
and share Python code within a Jupyter notebook interface. It is especially popular
in the fields of data science, machine learning, artificial intelligence, and education
due to its accessibility, ease of use, and integration with powerful hardware resources
like GPUs and TPUs.
Designing the architecture for the “Credit Card Fraud Detection Using Machine
Learning” involves several components and stages. Designing the architecture for
the ”Credit Card Fraud Detection Using Machine Learning” involves a systematic ap-
proach that integrates multiple components and stages to efficiently detect fraudulent
transactions. The architecture is built to process large datasets, train a model, and
make predictions in real-time. The key stages include:
4.1 Dataset
The use of datasets in credit card fraud detection is fundamental to training machine
learning models that can identify fraudulent transactions. These datasets typically
contain transaction records, including features such as transaction amount, merchant
information, time of transaction, geographical location, user ID, and past transaction
history. The most critical aspect of these datasets is their ability to represent both
legitimate and fraudulent transactions, with fraud being a much smaller subset, mak-
ing the data highly imbalanced. This imbalance requires special handling techniques
such as oversampling, undersampling, or generating synthetic data to improve model
performance. The dataset is usually split into training, validation, and test sets to
ensure the model can generalize well to unseen data.
The image displays a tabular dataset containing a preview of the first five rows and 31
columns. The table starts with a ”Time” column, followed by 30 numerical features
labeled as V1, V2, V3, and so on up to V30. Each cell in the table contains numeric
values, which appear to be either standardized or normalized, as they include both
positive and negative values, many of which are fractions.
The dataset structure suggests that it may be intended for a machine learning or
statistical analysis task. The ”Time” column likely represents a temporal sequence
or the order of the records, while the V1 through V30 columns could represent input
features derived from the original data. The precise nature of these features is not
immediately clear, as their names are generic, but they are likely related to some
23
Credit Card Fraud Detection Using Machine Learning 2024-25
specific domain problem, possibly in finance, signal processing, or another numerical
analysis field. The table caption, labeled ”Figure 4.1: Dataset example,” implies that
this is part of a larger document, possibly a thesis or report, where the dataset serves
as an illustrative example of the data being analyzed. The dataset’s concise format
and presentation hint that it might have been preprocessed, likely for tasks such as
classification, regression, or anomaly detection.
Training: Train these models using the preprocessed dataset, partitioned into prepar-
ing.Model training in credit card fraud detection is a critical step in building a reliable
machine learning system. The process begins by preparing the data, which involves
feature engineering, handling missing values, scaling features, and addressing class im-
balance. A balanced dataset is crucial, as fraud detection datasets tend to be highly
Imports libraries: NumPy and Pandas for numerical and data manipulation, train
test split for splitting the dataset into training and testing sets.Numerical Compu-
tation: NumPy (Numerical Python) is a powerful library for numerical computing
in Python. It provides support for handling large, multi-dimensional arrays and ma-
trices, along with a collection of mathematical functions to operate on these arrays.
NumPy’s array structure is far more efficient than Python’s built-in lists when dealing
with large datasets, making it indispensable for numerical analysis. Logistic Regres-
sion for building a logistic regression model and accuracy score for evaluating the
model’s performance. One of the fundamental steps in machine learning is to split
the dataset into training and testing subsets. This is essential to ensure that the
Dept of ECE, RNSIT, Bengaluru 26
Credit Card Fraud Detection Using Machine Learning 2024-25
model is evaluated on unseen data, allowing for an unbiased assessment of its gen-
eralization capability. By splitting the dataset into these two subsets, you simulate
how the model will behave when deployed in real-world scenarios, where it will en-
counter unseen data. Cross-validation can also be used to further refine the evaluation.
Loads a dataset: The dataset (credit data.csv) is loaded into a Pandas DataFrame
(credit card data) from the specified path (/content/credit data.csv).When we load a
dataset (such as the ”credit data.csv”) into a Pandas DataFrame, we are essentially
importing the dataset from a CSV (Comma-Separated Values) file into a structure
that is easy to manipulate and analyze. Pandas is a popular Python library for data
manipulation and analysis, and it allows us to work with data in a tabular format,
similar to how we might work with data in an Excel sheet or SQL database. This
is the initial step to load and prepare data for further processing, analysis, or model
training. To load the dataset, we typically use pd.read csv() function from Pandas.
This function reads the data from a specified file path (in this case, /content/credit
data.csv) and converts it into a DataFrame, a 2-dimensional data structure where
rows represent data entries, and columns represent features.
It displays a summary of the dataset, including the number of entries, the columns,
and their data types. It also shows how many non-null values are present in each
column. It shows the distribution of the target variable (Class), which represents the
classes of transactions. The output indicates that the dataset has 284,315 legitimate
transactions (label 0) and 492 fraudulent transactions (label 1). This imbalance sug-
gests that the dataset is highly unbalanced, with a much larger number of legitimate
transactions than fraudulent ones, which may require special handling during model
training. When we load the dataset and perform an initial analysis using functions
like df.info() and df.describe(), it gives us a concise summary of the dataset, which is
vital for understanding its structure and characteristics.
Dept of ECE, RNSIT, Bengaluru 27
Credit Card Fraud Detection Using Machine Learning 2024-25
Separating the data for analysis: It filters the dataset to select legitimate trans-
actions (where the Class is 0) and stores them in the legit DataFrame and filters the
dataset to select fraudulent transactions (where the Class is 1) and stores them in
the fraud DataFrame. To filter the dataset and separate legitimate and fraudulent
transactions, we would typically use Boolean indexing in Pandas. This allows us to
select rows based on a condition and store the results in separate DataFrames for
further analysis or model training.
The reason for separating the legitimate and fraudulent transactions into different
DataFrames is typically to: Analyze each class separately: You might want to per-
form different kinds of analysis, like looking at the distribution of amounts or features
for legitimate vs. fraudulent transactions. Balance the data: Since the dataset is
imbalanced, separating the classes allows you to apply techniques like oversampling
(for the minority class, i.e., fraud) or undersampling (for the majority class, i.e., le-
git) before combining the datasets back for model training. Model training: During
model training, you might prefer to handle the two classes separately. For example,
you may want to focus more on improving the detection of fraudulent transactions
(the minority class), or ensure the model learns from both legitimate and fraudulent
transactions while accounting for their imbalance.
Printing the shapes of the datasets: The output (284315, 31) indicates that there
are 284,315 legitimate transactions with 31 features (columns). The output (492, 31)
indicates that there are 492 fraudulent transactions with 31 features (columns). When
we print the shapes of the two datasets (legit and fraud), we are essentially asking
Statistical measures of the data: The code displays summary statistics (count,
mean, standard deviation, min, max, quartiles) for the Amount column in the le-
gitimate transactions, giving an overview of the transaction amounts for legitimate
transactions. When we analyze the Amount column for legitimate transactions, it’s
helpful to use summary statistics to get an overall picture of the distribution and
characteristics of the transaction amounts. Using describe() in Pandas gives us key
insights such as the count, mean, standard deviation, min, max, and quartiles (25%,
50%, and 75%). The count represents the number of non-null entries in the Amount
column. In the context of legitimate transactions (legit), this would typically be equal
to the number of legitimate transactions, assuming there are no missing values here.
new dataset.head(): Displays the first 5 rows of the newly created new dataset, giving a
quick preview of the combined data. When you run the command new dataset.head(),
it displays the first 5 rows of the newly created dataset called new dataset. This is
a quick and common way to get a preview of the data, especially after you’ve made
modifications or combined different datasets. new dataset.tail(): Displays the last 5
rows of the new dataset, allowing a view of the dataset’s ending entries. The head()
function in Pandas is a method that shows you the first 5 rows of a DataFrame by
default.
It’s a convenient way to quickly inspect the contents of your dataset, especially when
you are working with large datasets and don’t want to print the entire DataFrame.
The resulting DataFrame, stored in the variable X, will now contain all the feature
columns from new dataset except for the Class column. Features (X): These are the
input variables used by the machine learning model to make predictions. For example,
in the case of fraud detection, features could include transaction amounts, time, and
various derived variables (e.g., V1, V2, V3, etc.). Target Variable (Class): The Class
column (which we just dropped) is the output variable that the model aims to predict.
This column indicates whether a transaction is legitimate (0) or fraudulent (1).
Dept of ECE, RNSIT, Bengaluru 32
Credit Card Fraud Detection Using Machine Learning 2024-25
Y = new dataset[’Class’]: It extracts the Class column from the new dataset as
the target variable (Y). Y will be used to train the model to predict whether a trans-
action is legitimate or fraudulent. When you run the code Y = new dataset[’Class’],
it extracts the target variable (the Class column) from the new dataset and assigns
it to the variable Y. This is an important step in preparing your dataset for machine
learning, where Y represents the output the model is trying to predict. The Class
column in the dataset represents the target variable. In the context of a classification
problem, the target variable is the value that the machine learning model will predict
based on the features. The Class column likely contains binary values (e.g., 0 for
legitimate transactions and 1 for fraudulent transactions).
The Class column is extracted and assigned to the variable Y. This variable will serve
as the target or labels that the machine learning model will attempt to predict. In
this case, Y will represent whether each transaction is legitimate (0) or fraudulent (1).
Since this is a supervised learning problem, the model learns to map input features
(X) to the target variable (Y). The model’s goal is to learn the patterns in the data
(from the features) that correspond to the target labels (fraudulent or legitimate).
After splitting the data into training and test sets, you will use X (features) and Y
(target) to train the machine learning model.
X train, X test, Y train, Y test = train test split(X, Y, test size=0.2, stratify=Y, ran-
dom state=2): Splits the data (X for features and Y for target labels) into training
and testing sets. test size=0.2 means 20 per cent of the data will be used for testing,
X contains all the information that the model will use to make predictions, such
as transaction amounts, times, and other relevant features. This is the target variable
(Class column), which contains the labels (either 0 for legitimate or 1 for fraudulent
transactions). Y is the actual outcome that the model is trying to predict based on
the features in X. This function from sklearn.model selection splits your dataset into
training and testing sets. The training set will be used to train the model, while the
testing set will be used to evaluate the model’s performance. This parameter specifies
the proportion of the dataset to be included in the test set. test size=0.2 means 20%
of the data will be used for testing, and the remaining 80% will be used for training
the model. A typical split ratio is 80% for training and 20% for testing, but other
ratios can be used depending on the situation.
Figure 4.10: Split the data into Training and Testing Data
Instead of outputting a raw score, Logistic Regression passes this linear combina-
tion z through a sigmoid function, which maps any real-valued number to a value
between 0 and 1. The model predicts probabilities for class 1 (fraudulent) for each
transaction. Based on a threshold (typically 0.5), the prediction is made: if the pre-
dicted probability p is greater than or equal to 0.5, the model classifies the transaction
as fraudulent (Class 1). If the predicted probability p is less than 0.5, the model clas-
sifies the transaction as legitimate (Class 0).
Dept of ECE, RNSIT, Bengaluru 35
Credit Card Fraud Detection Using Machine Learning 2024-25
model.fit(X train, Y train): Trains the logistic regression model using the training
data (X train for features and Y train for the target labels). The model learns the
relationship between the features and the target class (fraud or legitimate). The line
of code is where the Logistic Regression model is actually trained on the training
data (X train for features and Y train for target labels). This is a key step in ma-
chine learning, as it allows the model to ”learn” from the data, enabling it to make
predictions on unseen data. X train contains the feature data (i.e., the independent
variables that describe each transaction, such as transaction amount, time, etc.). Y
train contains the target labels (i.e., the class labels for each transaction, where 0
represents a legitimate transaction and 1 represents a fraudulent transaction).
The Logistic Regression model uses this data to learn the relationship between the
input features (X train) and the target labels (Y train). In other words, it tries to
understand how the features of a transaction (like the amount, time, etc.) can help
determine if it’s fraudulent or legitimate. The model uses an optimization algorithm,
such as gradient descent, to minimize the cost function or loss function. The loss
function measures how well the model’s predictions match the true target labels. In
logistic regression, the loss function is typically the log-loss (cross-entropy loss), which
penalizes the model more when it makes incorrect predictions. The optimization pro-
cess iteratively adjusts the weights to minimize the log-loss function and improve the
model’s accuracy. The model continues adjusting the weights through multiple iter-
ations until it reaches a point where the loss function cannot be minimized further,
or the changes in the loss function are very small. At this point, the model has con-
verged, and it has learned the optimal set of weights for the given data.
Model Performance Assessment: Accuracy gives an overall sense of how well the
model is doing in classifying the data into the correct categories (fraudulent or legiti-
mate). If the accuracy is high, it suggests that the model is making correct predictions
The training accuracy is a measure of how good the model is at classifying the train-
ing data. However, for imbalanced datasets, like in fraud detection where fraudulent
transactions are much fewer than legitimate ones, accuracy alone can be misleading.
Class Imbalance Problem: In highly imbalanced datasets (e.g., where there are far
more legitimate transactions than fraudulent ones), a model that always predicts ”le-
gitimate” (0) could still achieve high accuracy, even if it fails to identify any fraudulent
transactions (Class 1). For example, if 99% of the data is legitimate, a model that
predicts legitimate for all transactions will have an accuracy of 99%. But it will not
help in detecting fraud, which is the goal of the model. In fraud detection, recall
(how many actual fraudulent transactions are detected) is often more important than
accuracy, because failing to detect fraud (false negatives) can be costly. So, you might
prioritize a model that has good recall, even if it sacrifices some accuracy.
Accuracy on Test Data: X test prediction = model.predict(X test): Uses the trained
model to make predictions on the test data (X test). The line of code is used to make
predictions on the test data using the trained logistic regression model. After training
the model on the training data, the next step is to test how well the model generalizes
to unseen data (the test data). This is done by using the predict() method on the test
data (X test), which allows the model to classify the test samples into one of the two
classes: fraudulent (1) or legitimate (0). Test Data (X test): X test consists of the
features (input variables) of the test data, just like X train contains the features of
the training data. The test data has not been used during the training process, so it
represents new, unseen data. The model will make predictions based on these unseen
features. Trained Model (model): The model is already trained (using model.fit(X
Test data accuracy = accuracy score(X test prediction, Y test): Compares the pre-
dicted values (X test prediction) with the actual values (Y test) and calculates the
accuracy score for the test data. Exploratory Data Analysis (EDA) is an essential step
in the data science process, where you analyze and visualize the data to understand
its underlying patterns, detect anomalies, and check assumptions before applying ma-
chine learning models like Logistic Regression. For Logistic Regression, the expression
for EDA focuses on understanding the relationships between the features and the tar-
get variable (binary outcome), as well as checking data quality. The line of code is
used to calculate the accuracy of the model on the test data, which is an important
step to evaluate how well the model performs when applied to unseen data. The
function accuracy score() compares the predicted values (X test prediction) with the
actual values (Y test) from the test dataset and calculates the accuracy. Accuracy is
simply the percentage of correct predictions the model makes out of the total number
of predictions made.
Exploratory Data Analysis (EDA) is a crucial phase in the data science workflow.
It involves analyzing and visualizing the data to better understand its structure, iden-
tify patterns, check assumptions, and detect anomalies before building machine learn-
ing models. For Logistic Regression, which is a supervised learning algorithm, EDA
primarily focuses on: Understanding Relationships between Features and Target Vari-
able. Checking Data Quality to ensure there are no issues like missing values, outliers,
or incorrect data types. Visualizing Data to better comprehend the distribution of
features and their interaction with the target variable. For Logistic Regression, the
goal of EDA is to ensure that the data is prepared and suitable for modeling. Logistic
Regression is used for binary classification tasks, where the target variable (Y) has
two possible values (e.g., 0 or 1, which in fraud detection means legitimate or fraudu-
lent). Each feature (or independent variable) needs to be checked for its relationship
with the target variable (e.g., Class in fraud detection). For Logistic Regression, it’s
important to know if the features are linearly related to the target (since Logistic
Regression assumes linearity between the features and log-odds of the target). Vi-
sualize the features against the target variable to understand their distribution and
correlation with the target.
In this chapter we will analyze the results obtained in the project. The main aim of
this project is the detection of credit card fraudulent transactions, as it’s important
to figure out the fraudulent transactions so that customers don’t get charged for the
purchase of products that they didn’t buy.The detection of the credit card fraudulent
transactions will be performed with multiple ML techniques then a comparison will be
made between the outcomes and results of each technique to find the best and most
suited model in the detection of the credit card transaction that are fraudulent, graphs
and numbers will be provided as well. In addition, exploring previous literatures and
different techniques used to distinguish the fraud within a dataset.the main objective
of this project was to find the most suited model in credit card fraud detection in
terms of the machine learning techniques. The result of a credit card fraud detection
using machine learning project typically involves evaluating the model’s effectiveness
in identifying fraudulent transactions, often by measuring several key performance
metrics.
The result of the project would typically be a trained model that can predict whether
a transaction is fraudulent or not with good accuracy and recall, alongside insights
into the effectiveness of various machine learning techniques and approaches to handle
imbalanced data.The considered dataset included 284,807 transactions, 492 of which
were fraudulent and the rest were legitimate. We can observe from the numbers that
this dataset is severely skewed, with only 0.173 percent of transactions being classi-
fied as fraudulent. Among the 31 features, Class has only two values: 1 in the case
of a fraud transaction and 0 otherwise.he most basic performance statistic is accu-
racy, which is just the percentage of properly predicted observations to all observed
data. One can assume that our model is the best if it has a high level of accuracy.As a
result, other parameters must be considered while evaluating the models’ performance.
The project focuses on applying multiple machine learning (ML) techniques to de-
tect fraudulent credit card transactions. The dataset we are working with contains a
mixture of legitimate and fraudulent transactions, and the goal is to determine the
most effective model to identify the fraudulent ones. After training multiple models, a
comparison of their results will help us identify the best model for fraud detection. Key
41
Credit Card Fraud Detection Using Machine Learning 2024-25
metrics such as accuracy, recall, precision, and F1-score will be used to evaluate and
compare the models’ performance. To improve the model’s ability to detect fraud,
it’s essential to address the class imbalance present in the dataset. The dataset is
heavily skewed, with only 492 fraudulent transactions out of a total of 284,807. This
means that only 0.173% of the transactions are fraudulent, and the rest are legiti-
mate. This imbalance introduces challenges in training models, as most models may
be biased toward predicting the majority class (legitimate transactions). The dataset
contains 31 features describing the transactions, including numerical and categorical
attributes. The most important column is the Class column, which indicates whether
a transaction is fraudulent (1) or legitimate (0). With 284,807 total transactions and
only 492 fraudulent transactions, the dataset is highly imbalanced. This is typical for
fraud detection, where fraudulent cases are rare but critical to identify. Class imbal-
ance is one of the most significant challenges in fraud detection. A model that simply
predicts that all transactions are legitimate would achieve a high accuracy (over 99%)
but would fail to detect any fraud, resulting in poor performance when it comes to
identifying actual fraudulent transactions. Therefore, alternative evaluation metrics
(such as recall and precision) are critical when measuring the model’s performance.
Credit card fraud detection using a logistic regression model has successfully imple-
mented a machine learning approach to identify fraudulent transactions through the
following steps:
Data Preprocessing: The dataset was cleaned, missing values were handled, and the
data was balanced using sampling techniques. Model Building: A logistic regression
model was trained using the features (such as transaction amount, time, etc.) to
predict whether a transaction is fraudulent (Class = 1) or legitimate (Class = 0).
Model Evaluation: The model was evaluated using accuracy, which showed a high
performance with around 94 per cent accuracy on the training set and 93.9 per cent
accuracy on the test set. These results indicate that the model is well-suited for de-
tecting fraudulent transactions with good generalization to unseen data.
While the logistic regression model provides a solid approach for fraud detection,
it is important to note that the dataset was imbalanced, with a much larger number
of legitimate transactions compared to fraudulent ones. Balancing techniques, such
as oversampling or undersampling, were used to address this issue.
44
Credit Card Fraud Detection Using Machine Learning 2024-25
are several important hyperparameters that can be optimized to improve the model’s
generalization and performance, especially in the context of fraud detection. By lever-
aging advanced algorithms and performing thorough hyperparameter tuning, we can
improve the performance of the Logistic Regression model for credit card fraud detec-
tion. Techniques such as regularization, class balancing, and ensemble methods can
enhance the model’s ability to identify rare fraudulent transactions and improve its
robustness to overfitting.
Handling Class Imbalance: Implementing advanced balancing techniques like
SMOTE (Synthetic Minority Over-sampling Technique) or Cost-sensitive Learning
can enhance the model’s ability to correctly predict fraudulent transactions, which
are underrepresented in the dataset. In the Credit Card Fraud Detection project, the
dataset is highly imbalanced, with the majority of transactions being legitimate (class
0) and a very small proportion being fraudulent (class 1). This class imbalance is a
significant challenge for any machine learning model, as the model tends to be biased
toward predicting the majority class (legitimate transactions) and may underperform
in detecting the minority class (fraudulent transactions).
To address this issue, we can apply various advanced techniques like SMOTE (Syn-
thetic Minority Over-sampling Technique) and Cost-sensitive Learning. These tech-
niques aim to enhance the model’s ability to correctly identify fraudulent transactions
while maintaining the overall accuracy of the model. SMOTE is a powerful technique
designed to handle imbalanced datasets by generating synthetic samples for the minor-
ity class (fraudulent transactions) rather than simply duplicating existing ones. This
helps the model learn more about the minority class and improves its ability to make
accurate predictions for the underrepresented class. By applying these techniques, the
model can achieve better recall and precision for detecting fraudulent transactions,
which is vital in minimizing financial losses and ensuring a secure transaction envi-
ronment for customers.
Feature Engineering: More sophisticated feature engineering can be done to ex-
tract additional relevant features or transform existing ones. Features such as user
behavior and transaction patterns over time can be valuable in identifying fraudulent
transactions. In the Credit Card Fraud Detection project, feature engineering plays a
crucial role in improving the model’s ability to correctly identify fraudulent transac-
tions. The goal of feature engineering is to create or transform existing features in ways
that make them more informative for the machine learning model, especially when
detecting anomalies like fraud. By improving the quality and relevance of features,
we can enhance the performance of the Logistic Regression algorithm. Fraudulent
transactions are often characterized by unusual patterns and behaviors that are not
Model Evaluation Metrics: Instead of just accuracy, other evaluation metrics like
precision, recall, F1-score, and AUC-ROC curve should be considered, especially for
imbalanced datasets, to get a better sense of model performance on detecting fraud.
In the Credit Card Fraud Detection project, the primary objective is to correctly iden-
tify fraudulent transactions (Class = 1) while minimizing the number of false positives
(legitimate transactions incorrectly classified as fraud) and false negatives (fraudulent
transactions incorrectly classified as legitimate). Since the dataset is highly imbal-
anced, with a very small fraction of fraudulent transactions, evaluating the model’s
performance based solely on accuracy might not give a true picture of its effectiveness.
[2] Credit Card Fraud Detection using Machine Learning Algorithms, Vaishnavi
Nath Dornadulaa, S Geetha. 2022 International Conference on Electrical, Com-
puter, Communications and Mechatronics Engineering (ICECCME).
[3] Design and Implementation of Different Machine Learning Algorithms for Credit
Card Fraud Detection, Aditi Singh, Anoushka Singh, Anshul Aggarwal, Anamika
Chauhan 2024 5th International Conference on Smart Electronics and Commu-
nication (ICOSEC)
[4] A Review of Credit Card Fraud Detection Using Machine Learning Algorithms
(Adesola Gregory Oketola, Ayo Agbeja and Tobi Gbadebo Ogunmefun) 2023
2nd International Conference on Paradigm Shifts in Communications Embedded
Systems, Machine Learning and Signal Processing (PCEMS)
[5] Fraud Detection of Credit Card Using Logistic Regression: Nasser Hussain Mo-
hammed, Kakatiya Institute of Technology and Science, Warangal, Sai Charan
Reddy Maram, International Journal of Computer Science and Network (IJCSN),
vol. 1, no. 4, pp. 31-35, 2019, ISSN ISSN: 2277-5420
[6] Credit Card Fraud Detection Using Machine Learning Algorithms: Anagha T S;
Asra Fathima; Archana D. Naik; Chirag Goenka; Shridhar B. Devamane; Aneesh
R Thimmapurmath, International Journal of Computer Science and Mobile Com-
puting (IJCSMC)
[7] Performance Evaluation of Machine Learning Algorithms for Credit Card Fraud
Detection: Sangeeta Mittal, Shivani Tyagi, Decision Support Systems, vol. 50,
no. 3, pp. 602-613, 2019
[8] Advanced Machine Learning Techniques for Credit Card Fraud Detection: A
Comprehensive Study, Vishnu R. Sonwane, Siddika Zanje, Siddhant Yenpure,
Yash Gunjal, Yash Kulkarni, Rohit Yeole, International Journal of Scientific En-
gineering and Technology, vol. 1, no. 3, pp. 194-198, 2019, ISSN ISSN: 2277-1581.
47
Credit Card Fraud Detection Using Machine Learning 2024-25
[9] Credit Card Fraud Detection Using Machine Learning Techniques: In-
drani Vejalla,Sai Preethi Battula, Kartheek Kalluri, Hemantha Kumar
Kalluri,Communications and Mechatronics Engineering (ICECCME)
[10] Comparison and analysis of logistic regression algorithm for credit card fraud
detection: Fayaz Itoo, Meenakshi Satwinder Singh, Data Science Engineering
(Confluence)
[11] Design and Implementation of Different Machine Learning Algorithms for Credit
Card Fraud Detection: Aditi Singh, Anoushka Singh, Anshul Aggarwal, Anamika
Chauhan, 2024 5th International Conference on Smart Electronics and Commu-
nication (ICOSEC).
[12] A Survey on Credit Card Fraud Detection Using Machine Learning: Rimpal R.
Popat, Jayesh Chaudhary, International Journal of Computer Science and Mobile
Computing (IJCSMC)