Enoch Project
Enoch Project
INTRODUCTION
Credit card use has grown in popularity across numerous areas, including
healthcare, due to the simplicity with which transactions can be made with them.
Credit cards have made internet transactions more convenient and accessible as
society moves toward cashless transactions (Mehbodniya et al, 2021). Fraud
transactions, on the other hand, result in a large loss of capital every year, which is
expected to rise in the future year. The technique for identifying fraud can be done
manually, which includes fraud investigators estimating each transaction and
providing binary feedback on each one, or automatically, which is done by
algorithms based on all prior ways fraud transactions have occurred. Health-care
fraud is a severe problem that affects both patients and providers of health-care
services. As a result, fraud detection is critical while doing online transactions.
The technique of examining the behavior of cardholder transactions to determine
whether they are genuine is known as fraud detection. The unauthorized use of a
credit or debit card, or a comparable payment method (ACH, EFT, recurring
charge, etc.) to obtain money or property is known as credit card fraud. Credit and
debit card numbers can be taken via unprotected websites or identity theft schemes.
Many technologies are constantly accustomed to gain information that is sensitive
regarding the people who use credit cards, such as phishing and virus-like trojans.
As a result, robust technology for identifying various types of credit card fraud
should be available. For training the other standard and Other algorithms such as
Naive Bayes, Logistic Regression, K-Nearest Neighbor (KNN), Random Forest,
and the sequential Convolution Neural Network are employed in addition to deep
learning techniques for detecting credit card fraud. (Mehbodniya et al, 2021).
The study of computer algorithms that may learn and improve over time as a result
of their experience and usage of data is known as machine learning (ML) (Mitchell
tom, 1997). It's a type of artificial intelligence. Machine learning algorithms create
a model from training data and use it to make predictions or judgments without the
need for explicit programming (Arthur Samuel, 1959). Machine learning
algorithms are employed in a range of applications where traditional algorithms are
difficult or impossible to design, such as medicine, email filtering, speech
recognition, and computer vision (Hu. J et al, 2020).
Several authors have used machine learning to model and assess publicly available
data; however, the performance of several approaches, including The Sequential
Convolutional Neural Network, Naive Bayes, Logistics Regression, K-Nearest
Neighbor(KNN), Random Forest, and Naive Bayes, among others, produces
superior results but requires more improvement, according to the comparison
analysis.
However, because the K-Nearest Neighbor algorithm calculates the distance
between data points for every training samples, it has a large computing cost. This
could have an impact on the accuracy of the outcome.
This study proposes a machine learning detection model for financial fraud
detection in health care using the Recursive Feature Elimination (RFE) feature
selection method with three classifiers: K-Nearest Neighbor, Logistic Regression,
and Nave Bayes to analyze and evaluate the performance of the developed model.
To address this problem, this research proposes the development of a model that
uses machine learning techniques like feature selection with KNN, Naïve Bayes
and Logistic Regression to classify the dataset.
1.3 Aim and Objectives
The Aim of this project is to detect fraudulent transactions involving credit cards
fraud in the health care sector. The objectives of this project are to;
i. To design a financial fraud detection model using existing dataset from open
source repository.
ii. Develop a machine learning model using a feature selection method called
RFE with KNN, Naïve Bayes and Logistic Regression to create the intended
detection model.
iii. Evaluate the implemented algorithms in terms of performance metrics.
iv. To compare this model to other previously developed model.
The significance of this study is to aid in improving fraud detection accuracy and
performance. This study is aimed at the victims of financial fraud in health care
sector like; patients, customers and health service provider.
This project helps in the further reduction of financial losses in regards to credit
card caused by financial fraud in health cares on victims like patients, customers
and health care service providers. This study uses machine learning algorithm to
detect financial fraud and dimensionality reduction to for reducing the number of
input variables in training data
CHATPER 2
LITERATURE REVIEW
Medical professionals, organizations, and ancillary health care staff all work in the
health services industry, providing medical treatment to those in need. Patients,
families, communities, and the broader public all benefit from health services.
Emergency, preventive, rehabilitative, long-term, hospital, diagnostic, primary,
palliative, and home care services are all covered. These services strive to improve
the accessibility, quality, and patient-centeredness of health care. Several different
types of care and providers are required to give successful health services. Medical
professionals, organizations, and ancillary health care staff all work in the health
services industry, providing medical treatment to those in need. Patients, families,
communities, and the broader public all benefit from health services. Emergency,
preventive, rehabilitative, long-term, hospital, diagnostic, primary, palliative, and
home care services are all covered. These services aim to make health care more
accessible, high-quality, and patient-centered. To deliver good health services, a
variety of various forms of treatment and providers are necessary. (Health
Services: Definition, Types & Providers, 2018)
When someone defrauds you of money or harms your financial well-being in any
way by deception, fraud, or other unlawful means, it is known as financial fraud.
Identity theft and investment fraud are two examples of how this might be
performed. The bulk of victim compensation schemes do not reimburse money lost
due to deception or fraud. You should investigate your state's victim compensation
laws. Civil justice may be the only legal option for recovering money that has been
misappropriated. Regardless of the type of financial fraud, it is vital to report the
crime as quickly as possible to the appropriate agencies and law enforcement.
When fraudulent charges are discovered, they should be challenged or cancelled as
quickly as possible. Victims should also collect evidence connected to the crime,
such as bank statements, credit reports, and current and prior year tax forms, and
continue to file crucial information throughout the reporting process. (Apoorva,
2022)
Fraud is deception with the intent of obtaining an unlawful advantage for the
offender or depriving a victim of a right. Fraud may take many forms, including
tax fraud, credit card fraud, wire fraud, securities fraud, and bankruptcy fraud. A
single person, a group of people, or an entire corporation can engage in fraudulent
behavior (James & Margaret, 2021)
Financial fraud happens when someone takes your money or harms your financial
well-being by using deceptive, dishonest, or illegal methods. Identity theft and
investment fraud are two examples of how this might be performed.
Fraud detection refers to a set of measures for preventing money or property from
being obtained via deception. Fraud detection is utilized in a wide range of
businesses, such as banking and insurance. Banking fraud includes things like
check forgery and the use of stolen credit cards. Other sorts of fraud include
exaggerating losses or fabricating an accident for the sole purpose of getting a
payout (Gillis, 2021).
With the limitless and ever-increasing number of methods someone may commit
fraud, detection can be difficult. A company's ability to identify fraud might be
harmed by reorganization, downsizing, transitioning to new information systems,
or experiencing a cybersecurity incident. Techniques such as real-time fraud
monitoring have been offered. Fraud should be examined in all financial
transactions, locations, devices utilized, initiated sessions, and authentication
methods.
Algorithms for detecting fraud are present in all modern financial systems. They're
a crucial tool for financial institutions to avoid chargebacks, investigation fees,
government fines, and brand damage. A good preventive and detection system can
benefit companies in numerous ways. It can screen out the great majority of
fraudulent transactions, allowing security personnel to focus on other tasks.
However, not all systems for detecting fraud are the same.. Machine learning-
based credit card fraud detection is an interesting new advancement in the field of
detecting payment abnormalities. It enables financial organizations to detect
fraudulent transactions with unprecedented precision. It aids in the reduction of
false positives for legitimate transactions. It accomplishes this while lowering
overall IT costs (sidelov, 2021).
Machine learning (ML) is a type of artificial intelligence (AI) that allows software
to improve prediction accuracy without being created particularly for it. Machine
learning algorithms anticipate new output values using past data as input. 2021)
(Burns).
Machine learning is a type of data analysis in which analytical models are created
using artificial intelligence. It's a branch of artificial intelligence based on the
premise that computers can learn from data, recognize patterns, and make
judgments with little or no human input. Machine learning (ML) systems offer a
far more current and efficient method of automating safety procedures. ML
algorithms have shown to be extremely successful for a number of big business
firms (sidelov, 2021).
i. Supervised Learning
ii. Unsupervised Learning
iii. Reinforcement Learning
2.4.1.1 Supervised Learning
Supervised learning is a method of developing artificial intelligence (AI) that
includes training a computer system on input data that has been tagged for a certain
output. When given never-before-seen data, the model is trained until it can find
the underlying patterns and links between the input data and the output labels,
allowing it to offer suitable labeling results (Petersson, 2021).
i. Polynomial regression
ii. Random forest
iii. Linear Regression
iv. Logistic regression
v. Decision trees
vi. K-Nearest Neighbor
vii. Naïve Bayes
2.4.1.2 Unsupervised Learning
The use of artificial intelligence (AI) systems to find patterns in data sets that
comprise data points that are neither classified nor labeled is known as
unsupervised learning. As a consequence, the algorithms can categorize, label,
and/or organize the data points in the data sets without the need for outside
assistance. Unsupervised learning, in other terms, allows the system to detect
patterns in data sets on its own. Even if no categories are specified, an AI system
will categorize unsorted data based on similarities and differences in unsupervised
learning. Compared to supervised learning systems, unsupervised learning
algorithms can handle more complicated processing tasks. Furthermore,
unsupervised learning is one method of putting AI to the test. (Pratt, 2021).
Filter, wrapper, and embedding techniques are the three types of feature selection
methods, depending on how they interact with the classifier.
Fisher score is one of the most often used supervised feature selection methods.
Based on the fisher's score, the method we'll use delivers the variables' rankings in
decreasing order. The variables can then be chosen based on the situation.
The K-Nearest Neighbor (KNN) technique classifies items by learning data that is
nearest to the item based on previous and current data comparisons. KNN
determines the distance to the closest neighbor using the euclidean distance
formula, but other algorithms optimize the distance formula by comparing it to
other related formulae to reach best results. To determine the distance to the
nearest neighbor, the euclidean distance formula in KNN will be compared to the
normalized euclidean distance, Manhattan, and normalized Manhattan.(Lubis et al.,
2020).
In this article, many similarity measures for both numerical and binary data were
generated using a mixture of well-known distances, and the efficiency of k-NN for
categorizing such diverse data sets was studied (Ali et al., 2019). The trials used
six different datasets from various domains and two different types of metrics. For
heterogeneous data, the suggested measures beat Euclidean distance, suggesting
that the challenges faced by different data demand unique similarity metrics
adapted to the data characteristics.(Ali et al., 2019).
One of the most fundamental probabilistic classifiers is Naive Bayes. Despite the
strong assumption that all characteristics are conditionally independent given the
class, it frequently performs wonderfully in a wide range of real-world settings.
Class probabilities and conditional probabilities are generated using training data
in the learning phase of this known structure classifier, and the values of these
probabilities are subsequently utilized to categorize fresh observations. 2013
(Taheri & Mammadov).
Bayesian Networks were first introduced by Pearl (1988). (BNs), They are high-
level representations of probability distributions over a set of variables X = X1,
X2,...,Xn in the learning process. The two stages of BN learning are structure
learning and parameter learning. The former generates a directed acyclic graph
from the collection X. Each node in the graph represents a variable, and each arc
depicts a causal link between two variables, with the arc's orientation representing
the direction of causality. The causal node is referred to as the parent, while the
other node is referred to as the child, when two nodes are connected by an arc. The
set of parents of the node Xi is P a(Xi), where Xi signifies both the variable
(feature) and the related node.Finding probability distributions, class probabilities,
and conditional probabilities associated with each variable given a structure is
referred to as parameter learning(Taheri & Mammadov, 2013).
As shown in Figure 2.4, the Naive Bayes classifier assumes that each feature
simply depends on the class. This signifies that the class is the single parent for
each feature. NB is appealing because it has a clear and robust theoretical
foundation that ensures optimal induction given a set of stated assumptions. The
independency assumptions of features with respect to the class are violated in some
real-world problems, which is a flaw. However, it has been demonstrated that NB
is remarkably resistant to such violations. NB is quick, easy to use, and effective
because to its straightforward structure. It's also good for high-dimensional data
because each feature's probability is calculated separately. According to Wu et al.,
NB is one of the top ten data mining methods (2008)(Taheri & Mammadov, 2013).
Let C stand for the observation X's class. Using the Bayes method to forecast the
class of observation X, the highest posterior probability of
p(C )P( X∨C)
p ( C∨X )= (eqn 2.2)
P (X )
Should be found.
Using the premise that features X1, X2,...,Xn are conditionally independent of each
other given the class, we derive the NB classifier.
n
p (C ) ∏ P( X i∨C)
(eqn 2.3)
p ( C| X )= i=1
P(X )
They are three distinct optimization models to estimate class probabilities P(C) and
conditional probabilities P(Xi|C), I = 1,..., n, in Equ. (3)(Taheri & Mammadov,
2013)
C
x1 x2 x3 xn
In 2017, Awoyemi et al. compared the efficacy of several techniques such as Naive
Bayes, KNN, and Logistic Regression, when they looked at severely distorted
credit card fraudulent data. For a total of 284,807 transactions, customers
throughout Europe supplied credit card transaction information. On the erroneous
data, a hybrid approach of underrepresentation and the technique of oversampling
is employed. The raw preprocessed data is subjected to three tests. different
approaches. Python is utilized to complete the work. Naive Bayes, K-Nearest
Neighbor, and Logistic Regression classifiers had optimal accuracy of 97.92
percent, 97.69 percent, and 54.86 percent, respectively, according to the data. KNN
beats According to Naive Bayes and Logistic Regression approaches the findings
of the comparison.
In 2017, Dal Pozzolo et al. offered three major contributions. First, the authors
provide a formalization of the problem of fraud detection appropriately depicts the
operating conditions of FDSs that monitor massive amounts of credit card
transactions on a regular basis, thanks to their research support. The authors also
demonstrated how to spot fraud by employing the most appropriate evaluation
techniques. Second, to address class imbalance, concept drift, and verification
delay, The writers designed and tested a novel method of learning. Finally, the
writers demonstrated the impact the unequal distribution of class drift using a real-
world data stream with over Over the course of three years, there were 75 million
transactions. Two types of random forests are used to teach the behavior aspects of
regular and anomalous transactions.
In terms of credit fraud detection, Xuan et al framework's presented in 2018
examined and examined the results of several With a variety of categorization
models, random forests are created. These studies used data from a Chinese e-
commerce company.
In their study published in 2018, Jurgovsky et al. framed the fraud detection
problem as a sequence classification job and employed long short-term memory
networks to include transactional sequences. Furthermore, the system employs
cutting-edge attribute aggregation algorithms and reports the framework's findings
using standard retrieval metrics. When compared to a benchmark Random Forest
classifier, the LSTM improves identification accuracy on offline transactions
where the cardholder is physically present at merchants. Both sequential and
nonsequential learning systems benefit from manual attribute aggregation
strategies. Following an examination of true positives, it was determined that both
approaches detect different types of fraud, implying that they should be used
jointly.
In a study published in 2019, Varmedja et al. revealed many approaches for
determining if transactions are fraudulent or not. The credit card fraud
identification dataset was used in this research. Because the dataset was severely
imbalanced, the SMOTE method was used to oversample it. Furthermore,
attributes were picked, and the dataset was split into two sections: training data and
test data. The technologies used in the study were Logistic Regression, Random
Forest, Naive Bayes, and Multilayer Perceptron. The research shows that each
system is capable of accurately detecting credit card fraud. Additional anomalies
could be discovered using the developed framework. Systems that use supervised
learning approaches to detect credit card fraud are based on the premise that
fraudulent patterns can be learned from a review of previous transactions.
By merging supervised and unsupervised approaches, Carcillo et al. proposed a
hybrid methodology for enhancing fraud detection accuracy in 2019. Unsupervised
anomaly ratings created at various degrees of granularity are investigated and
assessed using a real, labeled credit card fraud identification dataset. Experimental
data show that the combination is effective and enhances identification accuracy.
In a study published in 2018, Randhawa et al. used machine learning approaches to
detect credit card fraud. To begin, traditional methods are employed. Then, using a
combination of AdaBoost and popular voting, hybrid techniques are used. The
framework's effectiveness is evaluated using a publicly available credit card
dataset. The information is then analyzed using a real-time credit card dataset
obtained from a financial institution. In addition, to test the approaches' resiliency,
distortion is inserted into the data samples. The results of the experiments reveal
that the popular vote method accurately detects instances of credit card theft.
To identify credit card fraud issues, De Sá et al. introduced the Fraud-BNC
approach in 2018. The Bayesian network classification model is used to underpin
the proposed technique. Fraud-BNC was developed using a dataset from Pag
Seguro, Brazil's most popular online payment platform, and tested against two
cost-sensitive categorization algorithms. The obtained results were compared to
seven other techniques, and the methodology's cost efficiency and data
classification issue were evaluated.
In 2020, Sailusha et al. developed a credit card fraud detection model. The focus of
this study is on machine learning techniques. The AdaBoost and Random Forest
methodologies were used in this study. To compare the outcomes of the two
methods, the accuracy, precision, recall, and F1-score are used. To create the ROC
curve, the confusion matrix is used. These two techniques were evaluated in terms
of performance criteria like accuracy, precision, recall, and F1-score. The best
methodology for detecting fraud is the one that has the best performance metrics.
Bagga et al. proposed a framework in 2020 to compare the efficacy of various
methodologies on credit card fraud data, including Logistic Regression, Naive
Bayes, Random Forest, KNN, AdaBoost, Multilayer Perceptron, Pipelining, and
Ensemble Learning. The variables used and the method used to detect fraud have
an impact on the effectiveness of fraud detection.
Zhaohui Zhang proposed a Convolutional Neural Network-Based Model for
Detecting Online Transaction Fraud in 2018. It creates an input feature sequencing
layer that allows raw transaction features to be reorganized into multiple
convolutional patterns. When compared to the existing CNN for fraud detection,
the experimental results show that the model achieves excellent fraud detection
performance without derivative features, with precision stabilizing at 91 percent
and recall stabilizing around 94 percent, an increase of 26 percent and 2 percent
respectively.
Srivastava et al. (2008) used the Hidden Markov Model (HMM) to describe the
credit card transaction process. For this analysis, HMM was used as a detector for
fraudulent transactions after being programmed with specific cardholder behavior.
Following the training phase, the incoming credit card transactions were checked
using the model. If HMM did not accept the incoming credit card transaction, it
would be considered a fraud. The main disadvantage of this approach is that HMM
generates a high rate of false alarms in both positive and negative situations.
Halvaiee and Akbari (2014) proposed an Artificial Immune System-based Fraud
Detection Model (AISFDM) for detecting credit card fraudulent behavior. In this
approach, AIS was used as the artificial immune detection mechanism. An
algorithm inspired by the immune system was developed to improve the accuracy
of fraud detection. It does not, however, improve classification accuracy.
Duman and Ozcelik (2011) addressed the issue of detecting fraudulent credit card
transactions. For better classification performance, the authors first introduced a
new classification cost function for fraud detection, and then combined two meta-
heuristic algorithms, such as genetic algorithms, with the scatter search.
Krivko (2010) used a data-customized approach to detect plastic card fraud. To
compensate for the shortcomings of the individual methods, the proposed approach
combined supervised and unsupervised methodologies. The proposed method first
tracked changes in transaction behavior over time, and then assigned scores to each
fraudulent transaction based on the assumption of fraud behavior. The rule-based
filters were fed a sequence of transactions with scores greater than a certain
threshold value. The rules were then generated from those transactional records
with the goal of improving the detection's performance. However, it is also critical
to keep the saving information in order to improve detection.
Lei and Ghorbani (2012) proposed an Improved Competitive Learning Network
(ICLN) and a clustering algorithm of the Supervised Improved Competitive
Learning Network (SICLN). To represent the data Centre, ICLN's neural network
was programmed to use the reward-punishment update rule. SICLN then used the
updated rule and achieved better results during clustering by assigning class labels
to guide the training process. Improving the SICLN's convergence speed
necessitates the use of an efficient method.
Ravisankar et al. (2011) compared data-driven fraud detection approaches based on
past fraudulent behavior and financial ratios. The authors compared methods for
detecting fraud in business financial statements, including Multilayer Feed
Forward Neural Network (MFFNN), Support Vector Machines (SVM), Genetic
Programming (GP), Group Method of Data Handling (GMDH), Logistic
Regression (LR), and Probabilistic Neural Network (PNN). Then, feature selection
methods were used to extract fraudulent transactions from the dataset, and fraud
behavior was found to be effective. GP which was found to be the best method
among others suffers from the delivery of marginally less accuracy. To detect
fraudulent financial reporting activities.
Glancy and Yadav (2011) proposed a Computational Fraud Detection Model
(CFDM). CFDM discovered incorrect details in annual filings with the assistance
of the US Securities and Exchange Commission (SEC) using information
presented in a text document for the detection process.
A.Shen et al (2007) demonstrate the efficiency of classification models to credit
card fraud detection problem and the authors proposed the three classification
models i.e decision tree, neural network and logistic regression. Among the three
models neural network and logistic regression outperforms than the decision tree.
The research was mentioned by Y. Sahin and E. Duman (2011) for credit card
fraud detection, and seven categorization algorithms were employed. To reduce the
risk of the banks, they used decision trees and SVMs in this study. They propose
that Artificial Neural Networks and Logistic Regression classification models are
more useful in improving fraud detection performance.
To increase the efficiency of identifying financial fraud, Li and Wong (2015)
developed Grammar-based Multi-Objective Genetic Programming with Statistical
Selection Learning (GBMGP-SSL). To adjust the goal values of each solution, this
system employed token competition. To maintain variety, similar objective values
of various definitions were separated.
For credit card fraud detection, Van Vlasselaer et al. (2015) developed Anomaly
Prevention Using Advanced Transaction Exploration (APATE). The proposed
approach combines past transactional trends and customer actions into useful
features that are then matched with incoming transactions. APATE has the
advantage of being able to detect fraudulent transactions in as little as six seconds.
The proposed method, however, was inapplicable to defining a group of fraudulent
behavior.
CHAPTER 3
METHODOLOGY
We used three key strategies in this study: dataset, feature selection, and
classification. This study uses open source data from https://data.world/vlad/credit-
card-fraud-detection, which is then passed through a feature selection method
called Recursive Feature Elimination(RFE) to reduce class imbalance, and the
subset obtained is then used for classification using K-Nearest Neighbor, Logical
regression, and Nave Bayes. After then, performance measures are used to
compare and assess the results. The proposed framework for this investigation is
shown in Figure 3.1.
Dataset
Recursive Feature Elimination
RFE assesses the features by significance and returns the top-n features after
removing the least important features, where n is the user's input.
To use Recursive Feature Elimination it must be imported into the code using the
sklearn.feature_selection library. There are major parameters it takes,
After that, the coefficient associated with each attribute is considered. which is
obtained from the coef_ or feature importances_ property. Those coefficients are
essentially the same as the ones we get after fitting the model to the dataset and
minimizing the residuals. The relevance of these coefficients with the target
variable is shown by their value. The feature with the smallest absolute coefficient
value is deemed the least significant, and so on.
The least important coefficient is then removed from the list of characteristics, and
the model is rebuilt using the remaining features. The step parameter determines
the number of features to be dropped at each iteration. It is preferable to remove
one feature at a time because the coefficient values of other features change when
the model is rebuilt.
With each iteration, it rebuilds the model, reducing the least important feature(s)
and continuing the process until it only has two features. After that, it assigns a
score to features depending on how long it took to delete them. The feature that
was eliminated initially receives the highest ranking, and so on. The last n features
that have been removed are given a rank of one. (Mittal, 2020).
√∑
m
Dist ij = 2
( X iı −X jı ) (eqn 3.2)
ι=1
We run the KNN algorithm numerous times with different values of K to find the K
that decreases the amount of mistakes we encounter while keeping the algorithm's
capacity to generate correct predictions when it's given data it hasn't seen before.
Naive Bayes is a statistical approach based on Bayesian theory that determines the
outcome based on the highest likelihood. Based on the given value, it calculates the
likelihood of the unknown value. To anticipate unknown probabilities, logic and
prior information can be used. Binary classes and conditional probabilities are the
foundations of Naive Bayes(Mehbodniya et al., 2021).
prob ( featurek ∨class j )∗prob(class j)
prob ( class j|featurek ) = (eqn 3.4)
prob(featurek )
m
prob ( featurek|class j ) =∏ prob(featurek ∨class j) (eqn 3.5)
j=1
Input:
Training dataset T,
F = ( f 1, f 2, f 3,.., f n) // In the testing dataset, the value of the predictor variable.
Output:
Steps:
3.6.2 Specificity
The definition of specificity is the right number of negative forecasts, also known
as a true negative rate, divided by the accurate negative predictions and the
inaccurate negative predictions. A species' value ranges from 0 to 1.
specificity=
∑ TrueNegative
ConditionNegative (∑ FalsePositive+ ∑ TrueNegative )
3.6.3 Precision
The number of correct positive forecasts divided by the number of correct and
positive predictions equals precision. The accuracy number might be anything
between 0 and 1.
precision=
∑ TruePositive
∑ TruePositive+ ∑ FalsePositive
3.7 System Configuration
To achieve the study, an 16GB RAM, 64-bit Hp pavilion 15 system with intel core
i5 10300h 2.5GHz processor will be used.
CHAPTER 4
The dataset was stored in the download folder in the system and it containes
284807 instances and 32 attributes. The dataset was loaded using the following
code data = pd.read_csv(‘dataset_full.csv’)Data.
The shape of the data, which has 284807 instances and 32 attributes, is returned by
df.shape(). The shape of the dataset is depicted in Figure 4.3.
Datasets are described in the program using data.describe(), this is a method for
calculating statistics from a data frame's numerical values, such as the mean,
percentile, and standard deviation(std). The count, mean, standard deviation,
minimum, percentile of (25 percent, 50 percent, 75 percent), and maximum were
calculated in this dataset. The credit card dataset's data description is shown in
Figure 4.4.
This is a data splitting approach for machine learning that divides data into train,
test, and validation sets. Each algorithm divided the data into subgroups for
training and testing. To fit the model and conduct the evaluation test, the training
set was used. In this study, training took up 80% of the time, while testing took up
20%. Figure 4.5 depicts the data for training and testing.
Figure 4.5 Diagram showing split dataset.
The optimal number of selected features offered in this study is 9, and RFE is used
to choose the relevant characteristics that will be suitable for the performance. The
selected features are shown in Figure 4.6 using the Recursive Features Elimination
technique.
Figure 4.6 Selecting features from the dataset using RFE
4.4.1 Evaluating RFE for classification
The reason for evaluating RFE for classification is to evaluate the accuracy of the
RFE using decision tree classifier, the model is evaluated and the performance is
reported with an accuracy. The figure 4.7 shows the evaluation of RFE for
classification.
The Bayesian theorem is utilized to create the Nave Bayes algorithm, which is
used to handle classification problems. Simple and effective classification
methods, such as the Nave Bayes classifier, are critical for quickly constructing
machine learning models that can make accurate predictions. The confusion matrix
and ROC curve of Nave Bayes after RFE are shown in Figures 4.13 and 4.14.
Figure 4.13 Confusion Matrix of Naïve Bayes
This type of graph is vital to the study of statistics because it can reveal the degree
of correlation between selected features or variables. Observation and visualization
of relationships between two numeric variables are the primary purposes of scatter
plots. Individual data points are represented by dots in a scatter plot, but as a whole
is represented by patterns in dots. Figure 4.15 shows the scatter plot visualization
in the credit card dataset.
Figure 4.15 Scatter Plot Graph
Experiments were carried out, and the results are shown in Table 4.1, however
KNN, Logistic Regression(L1), Logistic Regression(L2) and Naïve Bayes
performed with an accuracy of 99.9%, 99.9%, 99.9% and 98.1% respectively.
Table 4.2 shows the comparison of the result with other works.
5.1 Summary
To detect credit card fraud, this study uses machine learning approaches. Different
conventional models are introduced and cast for evaluation, including Logistic
Regression, Decision Tree, K-Nearest Neighbor, and Naive Bayes. Logistics
Regression and KNN beat Nave Bayes in terms of performance, according to the
technique used in this study. It might arise because the dataset is insufficient to
train and uncover hidden patterns in order to anticipate future or upcoming data,
and the weights' initialization was extremely random, potentially interfering with
the training process. The dataset was imbalanced, and it was enhanced by
removing imbalance data using Recursive Feature Selection.
5.2 Conclusion
The credit card dataset is open to the general public. To achieve accuracy, A
variety of standard models are trained and evaluated, and the best model using both
stored and real-time data is picked. The dataset is used to train and test machine
learning classifiers, and their performance is assessed using a variety of credit card
fraud indicators. When contrasting to the sequential pattern of earlier expected
fraud detection data, our research shows that online and offline transactions have
distinct characteristics.
5.2 Recommendation