Real-Time Credit Card Fraud Detection Using Machine Learning
Real-Time Credit Card Fraud Detection Using Machine Learning
Machine Learning
Anuruddha Thennakoon1, Chee Bhagyani2, Sasitha Premadasa3, Shalitha Mihiranga4, Nuwan Kuruwitaarachchi5
Faculty of Computing
Sri Lanka Institute of Information Technology
Colombo, Sri Lanka
[email protected],[email protected],[email protected],
4
[email protected],[email protected]
Abstract²Credit card fraud events take place frequently fraud natures that belong to the CNP fraud category
and then result in huge financial losses [1]. The number of described above and we propose a method to detect those
online transactions has grown in large quantities and online frauds real time.
credit card transactions holds a huge share of these
transactions. Therefore, banks and financial institutions offer Machine learning is this generation's solution which
credit card fraud detection applications much value and replaces such methodologies and can work on large
demand. Fraudulent transactions can occur in various ways datasets which is not easily possible for human beings.
and can be put into different categories. This paper focuses on Machine learning techniques fall into two main categories;
four main fraud occasions in real-world transactions. Each supervised learning and unsupervised learning. Fraud
fraud is addressed using a series of machine learning models detection can be done in either way and only can be decided
and the best method is selected via an evaluation. This
when to use according to the dataset. Supervised learning
evaluation provides a comprehensive guide to selecting an
optimal algorithm with respect to the type of the frauds and requires prior classification to anomalies. During the last
we illustrate the evaluation with an appropriate performance few years, several supervised algorithms have been used in
measure. Another major key area that we address in our detecting credit card fraud.
project is real-time credit card fraud detection. For this, we
take the use of predictive analytics done by the implemented The data which is being used in this study is analyzed
machine learning models and an API module to decide if a in two main ways: as categorical data and as numerical
particular transaction is genuine or fraudulent. We also assess data. The dataset originally comes with categorical data.
a novel strategy that effectively addresses the skewed The raw data can be prepared by data cleaning and other
distribution of data. The data used in our experiments come
basic preprocessing techniques. First, categorical data can
from a financial institution according to a confidential
disclosure agreement. be transformed into numerical data and then appropriate
techniques are applied to do the evaluation. Secondly,
Keywords— credit card frauds, fraud detection system, categorical data is used in the machine learning techniques
fraud detection, confidential disclosure agreement, real-time to find the optimal algorithm.
credit card fraud detection, skewed distribution.
This paper consists of selecting optimal algorithms for
I. INTRODUCTION the four fraud patterns through an extensive comparison of
Fraud has been increasing drastically with the machine learning techniques via an effective performance
progression of state-of-art technology and worldwide measure for the detection of fraudulent credit card
communication. [5] Fraud can be avoided in two main ways: transactions.
prevention and detection. Prevention avoids any attacks The rest of this paper is presented as follows. Section 2
from fraudsters by acting as a layer of protection. Detection presents the literature review. Section 3 provides the
happens once the prevention has already failed. Therefore, experimental methodology including results. Finally,
detection helps in identifying and alerting as soon as a conclusions and discussions of the paper are presented in
fraudulent transaction is being triggered. Recently, card-
Section 4.
not-present transactions [6] in credit card operations have
become popular among web payment gateways. According II. LITERATURE REVIEW
to the Nilson Report in October 2016, more than $31 trillion
were generated worldwide by online payment systems in In earlier studies, many approaches have been proposed
2015, increasing 7.3% than 2014. Worldwide losses from to bring solutions to detect fraud from supervised
credit card fraud have been rising to $21 billion in 2015, and approaches, unsupervised approaches to hybrid ones;
will possibly reach $31 billion by 2020. [3] However, there which makes it a must to learn the technologies associated
has been an extreme increase in fraudulent transactions that in credit card frauds detection and to have a clear
affect the economy dramatically. Credit card fraud can be understanding of the types of credit card fraud. As time
classified into several categories. The two types of frauds progressed fraud patterns evolved introducing new forms
that can be mainly identified in a set of transactions are of fraud making it a keen area of interest for researchers.
Card-not-present (CNP) frauds and Card-present (CP) The remainder of this section describes single machine
frauds. Those two types can be described further by learning algorithms, machine learning models and fraud
bankruptcy fraud, theft/counterfeit fraud, application fraud, detection systems that were used in fraud detection. The
and behavioural fraud. Our study aims at addressing four problems that came across the review have analyzed for the
978-1-5386-5933-5/19/$31.00 2019
c IEEE 488
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.
later use of implementing an efficient machine learning issues and give outstanding results. For handling
model. imbalanced data, a highly efficient bagging model has been
used. To handle the implicit noise in the transaction dataset
With the analysis of various detection models, past
they have used Naive Bayes algorithm [9]. Peter et al.
researchers have found many problems regarding fraud
evaluated several deep learning algorithms with respect to
detection. In [14] and [3] they have mentioned Lack of real-
their efficacy. The four topologies are Recurrent Neural
life data as a huge issue. Real life data are lacking because
Networks (RNNs), Gated Recurrent Units (GRUs), Long
of the data sensitivity and privacy issues. Papers [3] and [7]
Short-term Memory (LSTMs), and Artificial Neural
have studied Imbalance data or skewed distribution of data.
Networks (ANNs). In their project in addition to data
The reason behind this is having quite a less amount of
cleaning and other data preparation steps, they have
frauds when compared to non-frauds in the transaction
overcome class imbalance and scalability problems by
datasets. Paper [3] states that data mining techniques take
using undersampling. To discover which hyper-parameters
time to execute when dealing with big data. Overlapping of
had the highest influence on the performance of the model,
data is another major drawback in preparation of credit card
the sensitivity analysis was carried out. They have
transaction data. According to paper [2] and [7] the issue
discovered that the performance of the model was affected
occurs due to some scenarios when the legitimate
by the size of the network. They concluded that larger the
transactions look exactly like fraudulent transactions. In
network it showed better performance. [11]
another way, fraudulent transactions may appear as
legitimate transactions. Also, they have come across the Credit card data have the issue of skewed
difficulty in dealing with categorical data. When distribution which is also known as the class imbalance.
considering the credit card transaction data, most of the According to Andrea et al., their project addresses class
features have categorical values. In this case, almost all the imbalance including other issues such as concept drift and
machine learning algorithms do not support the categorical verification latency. They have also illustrated the most
values. In [3][4] they have mentioned choice of detection relevant performance matrix that can be used in credit card
algorithms and feature selection as a challenge in detecting fraud detection. The achievement of the research also
frauds since most of the machine learning algorithms take includes a formal model and a powerful learning strategy
much time for training purposes than predicting. Another for addressing the 'verification latency' and an 'alert and
key issue that affects financial fraud detection is the feature feedback' mechanism. According to experiments they have
selection. It aims to filter out the attributes that most declared the precision of the alerts as the most important
describes the aspects of fraud detection and its characters. measure [15].
In paper [7] they have highlighted fraud detection cost and
Chee et al. used twelve standard models and
lack of adaptability as challenges in the fraud detection
hybrid methods which use AdaBoost and majority voting
process. When considering a system, the cost of fraudulent
methods to achieve better accuracy rates in credit card
behaviour and the prevention cost should be taken into
fraud detection [16]. They were evaluated using both
consideration. Lack of adaptability occurs when the
benchmark and real-world data. A summary of the
algorithm is exposed to new types of fraud patterns and
strengths and limitations of the methods were evaluated.
normal transactions. Effectiveness can change according to
The Matthews Correlation Coefficient metric (MCC) has
the problem definition and its specifications, so having a
been taken as the performance measure. To evaluate the
good understanding of the performance measure is
robustness of the algorithms noise was added to the data.
necessary [4].
Also, they have proved that the majority voting method was
There are different kinds of models implemented for not affected by the added noise.
credit card fraud detections. In those models, different
The analysis carried out on highly imbalanced
algorithms have been used.
data in paper [17] show that KNN shows outstanding
Adapting the fraud detection system to newly performance for sensitivity, specificity and MCC, except
introduced frauds can be problematic whether to retrain the for accuracy. The paper [18] discussed commonly used
machine learning model due to drastic changes in the fraud supervised techniques and they have provided a thorough
patterns, also may be costly and risky. For instance, Tyler evaluation of supervised learning techniques. Also, they
et al. extended a framework proposed in [12], implemented have shown that all algorithms change according to the
the model and the model was applied to a real-world problem area.
transaction log. To address the classification problem
Fraud detection system presented in paper [19] is
Logistic Regression (LR) has been used. The instances of
built to handle class imbalance, the formation of labelled
fraudulent transactions have been discretized into strategies
and unlabeled, and processing of large datasets. The
by using Gaussian Mixture Models (GMMs). Here
proposed system was able to overcome all the challenges.
synthetic minority oversampling technique was used to
address the class imbalance. To stand out the significance III. EXPERIMENTAL METHODOLOGY
of estimates in economic value sensitivity analysis has been
used. The results have proven that a practical method which A. Data description
uses minimal steps to retrain a model could function as The dataset was created combining two data sources;
same as a classifier that typically retrains every round [13]. the fraud transactions log file and all transactions log file.
The fraud transactions log file holds all the online credit
There is another model called Risk-Based
card fraud occurrences while all transactions log file holds
Ensemble (RBE) that can handle the data consisting of
9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 489
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.
all transactions stored by the corresponding bank within a B. Data preparation
specified time period. Due to the confidential disclosure Collected raw data were first divided into 4 data sets
agreement made between the bank and the authors of the according to its fraud pattern. This process was done with
paper, some of the sensitive attributes such as card number the information gained by the bank. The four datasets are,
were hashed. When evaluating the combined dataset, the 1. Transactions with Risky Merchant Category Code
shape of the data was much skewed due to the imbalanced (MCC).
numbers of legitimate transactions and fraudulent
occurrences. The file with the fraud cases had 200 records 2. Transactions larger than $100.
while the transaction log file had 917781 records. 3. Transactions with risky ISO Response code.
Attributes of the two data sources are as follows. 4. Transactions with unknown web addresses.
TABLE I. ATTRIBUTES OF THE GENUINE TRANSACTIONS LOG Those 4 datasets were used in two different ways.
Field Name Description
1. By transforming raw data into a numerical form.
(Type A)
CARD_NO &UHGLWFDUGKROGHU¶VKDVKHG 2. By preparing raw data categorically without
card number.
making any transformation. (Type B)
DATE Date of the transaction.
Type A was applied to datasets 1, 2, 3 and type B was
TIME Time of the transaction.
applied to data set number 4. In data preparation the data
TRANSACTION_AMOUNT Transaction Amount. are cleaned, transformed, integrated and reduced. First 3
sets of data were subjected to all above-mentioned steps to
MERCHANT_NAME Merchant name relevant to prepare them numerically. To prepare the data categorically
the transaction all the steps except for data transformation were applied.
MERCHANT_CITY Registered city of the
The basic steps which were involved in the type A are
Merchant. described below.
490 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.
normalization scales the attribute data to fall in a small
numeric range.
Additionally, we have used 10-fold cross-validation. The following graphs show the accuracy rates from the 4
Then the data to which the cross-validation was applied, types of fraud when the ML classifiers are applied to
were resampled by the above-mentioned resampling preprocessed and resampled data.
techniques.
D. Modelling and testing
Our study analyses four different fraud patterns. For
analyzing each pattern, we have reflected the following
process as described in figure 2. Quite a few numbers of
techniques were used in the data analysis. Four machine
learning algorithms were prioritized in our analysis with the
help of the literature. They are Support Vector Machine,
Naive Bayes, K-Nearest Neighbor and Logistic
Regression. We applied those selected supervised learning
classifiers to our resampled data. When selecting machine Fig. 3. Risky MCC Results
learning models which can capture each fraud, the accuracy
and performance of each model were taken into
consideration. Optimal models were selected by filtering
them out comparatively against an appropriate
performance matrix (Table 3).
9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 491
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.
given by the end user will be stored. Figure 7 describes the
overall system flow of the fraud detection system.
IV. CONCLUSIONS
Credit card fraud detection has been a keen area of
research for the researchers for years and will be an
intriguing area of research in the coming future. This
Fig. 6. Unknown web address results happens majorly due to continuous change of patterns in
frauds. In this paper, we propose a novel credit-card fraud
E. Real time Fraud Detection. detection system by detecting four different patterns of
In the past, fraud detection has been done by taking fraudulent transactions using best suiting algorithms and by
already happened transactions in bulk and applying addressing the related problems identified by past
machine learning models on them. Since the results can be researchers in credit card fraud detection. By addressing
seen after weeks or months, tracking down of detected real time credit-card fraud detection by using predictive
frauds was found extremely difficult, and there have been analytics and an API module the end user is notified over
many cases where the fraudsters were able to commit many the GUI the second a fraudulent transaction is taken place.
more fraudulent purchases before being exposed. Real-time This part of our system can allow the fraud investigation
fraud detection is the execution of fraud detection models team to make their decision to move to the next step as soon
the second an online purchase is taken place. That way our as a suspicious transaction is detected. Optimal algorithms
system is capable of detecting frauds real-time. It gives an that address four main types of frauds were selected
alert to the bank indicating its fraud pattern and accuracy through literature, experimenting and parameter tuning as
rate, making it easy for fraud monitoring teams to move shown in the methodology. We also assess sampling
into their next action without having to waste their time and methods that effectively address the skewed distribution of
money. data. Therefore, we can conclude that there is a major
impact of using resampling techniques for obtaining a
F. Fraud Detection System. comparatively higher performance from the classifier. The
Real-time detection of credit card fraud can be stated as machine learning models that captured the four fraud
one of the main contributions of this project. The real-time patterns (Risky MCC, Unknown web address, ISO-
fraud detection system consists of three main units; API Response Code, Transaction above 100$) with the highest
MODULE, FRAUD DETECTION MODELS and DATA accuracy rates are LR, NB, LR and SVM. Further the
WAREHOUSE. All the components are involved in fraud models indicated 74%, 83%, 72% and 91% accuracy rates
detection simultaneously. Fraudulent transactions are being respectively. As the developed machine learning models
classified into four fraud types (Frauds occur due to Risky present an average level of accuracy, we hope to focus on
MCC, ISO-Response Code, Unknown web address, improving the prediction levels to acquire a better
Transaction above 100$) using three supervised learning prediction. Also, the future extensions aim to focus on
classifiers. API module is responsible for transferring real location-based frauds.
time transactions between the Fraud detection model, GUI,
and Data warehouse. A Data Warehouse has been used for
storing live transactions, the predicted results and other REFERENCES
important data of the machine learning models. The user [1] S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, and G. N. Surname,
can interact with the fraud detection system with GUIs ³5DQGRP)RUHVWIRUFUHGLWFDUGIUDXG´WK,QW&RQI1HWZRUNLQJ
Sens. Control, 2018.
where it shows the real time transactions, alerts regarding
[2] 0=DUHDSRRU6.6HHMD.5DQG0$IVKDU$ODP³$QDO\VLVRQ
frauds and historical data regarding frauds in a graphical Credit Card Fraud Detection Techniques: Based on Certain Design
representation. When a transaction is recognized as &ULWHULD´,QW-&RPSXW$SSOYROQRSS±42, 2012.
fraudulent by the fraud detection model, a message will be [3] 'DYLG5REHUWVRQ³,QYHVWPHQWV DPS$FTXLVLWLRQV² September
sent to the API module. Then the API module will notify 2016 Top Card Issuers in Asia±Pacific Card Fraud Losses Reach
the end user by sending a notification and the feedback %LOOLRQ´1LOVRQ5HSQR
492 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.
[4] J. West and M. BKDWWDFKDU\D ³$Q ,QYHVWLJDWLRQ RQ ([SHULPHQWDO
,VVXHVLQ)LQDQFLDO)UDXG0LQLQJ´Procedia Comput. Sci., vol. 80,
pp. 1734±1744, 2016.
[5] ' 6 6LVRGLD 1 . 5HGG\ DQG 6 %KDQGDUL ³3HUIRUPDQFH
Evaluation of Class Balancing Techniques for Credit Card Fraud
'HWHFWLRQ´,(((,QW&RQI3RZHU&RQWURO6LJQDOV,QVWUXP(QJ
pp. 2747±2752, 2017.
[6] */LX:/XDQ=/LDQG<=KDQJ³$QHZ)'6IRUFUHGLWFDUG
fraud detection based on behavior FHUWLILFDWH´
[7] ==RMDML5($WDQLDQG$+0RQDGMHPL³$Survey of Credit
Card Fraud Detection Techniques : Data and Technique Oriented
3HUVSHFWLYH´SS±26, 2016.
[8] 6XPDQDQG1XWDQ³5HYLHZ3DSHURQ&UHGLW&DUG)UDXG'HWHFWLRQ´
Int. J. Comput. Trends Technol., vol. 4, no. 7, 2013.
[9] 6$NLODDQG865HGG\³Risk based Bagged Ensemble ( RBE )
IRU&UHGLW&DUG)UDXG'HWHFWLRQ´QR,FLFLSS±674, 2017.
[10] '30HWKRGV³'DWD3UHSURFHVVLQJ7HFKQLTXHVIRU'DWD0LQLQJ´
Science (80-. )., p. 6, 2011.
[11] A. Roy, J. Sun, R. Mahoney, L. Alonzi, S. Adams, and P. Beling,
³'HHS/HDUQLQJ'HWHFWLQJ)UDXGLQ&UHGLW&DUG7UDQVDFWLRQV´SS
129±134, 2018.
[12] M. F. Zeager, A. Sridhar, N. Fogal, S. Adams, D. E. Brown, and P.
$ %HOLQJ ³$GYHUVDULDO OHDUQLQJ LQ FUHGLW FDUG IUDXG GHWHFWLRQ´
2017 Syst. Inf. Eng. Des. Symp., pp. 112±116, 2017.
[13] 7 &RG\ 6 $GDPVDQG 3 $%HOLQJ ³$ 8WLOLWDULDQ $SSURDFK WR
$GYHUVDULDO/HDUQLQJLQ&UHGLW&DUG)UDXG'HWHFWLRQ´SS±242,
2018.
[14] 05DIDáR³5HDO-WLPHIUDXGGHWHFWLRQLQFUHGLWFDUGWUDQVDFWLRQV´
Data Science Warsaw. 2017.
[15] A. Dal Pozzolo, *%RUDFFKL2&DHOHQDQG&$OLSSL³&UHGLW&DUG
Fraud Detection: A Realistic Modeling and a Novel Learning
6WUDWHJ\´Ieee Trans. Neural Networks Learn. Syst., pp. 1±14, 2018.
[16] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi,
³&UHGLWFDUGIUDXGGHWHFWLRQXVLQJ$GD%RRVWDQGPDMRULW\YRWLQJ´
IEEE Access, vol. XX, pp. 1±1, 2018.
[17] -2$ZR\HPL$2$GHWXQPELDQG6$2OXZDGDUH³&UHGLWFDUG
fraud detection using machine learning techniques: A comparative
DQDO\VLV´2017 Int. Conf. Comput. Netw. Informatics, pp. 1±9, 2017.
[18] R. Choudhary and H. K. Gianey ³&RPSUHKHQVLYH 5HYLHZ 2Q
6XSHUYLVHG0DFKLQH/HDUQLQJ$OJRULWKPV´ 2017 Int. Conf. Mach.
Learn. Data Sci., pp. 37±43, 2017.
[19] G. E. Melo-Acosta, F. Duitama-Muñoz, and J. D. Arias-Londoño,
³)UDXGGHWHFWLRQLQELJGDWDXVLQJVXSHUYLVHGDQGVHPL-supervised
OHDUQLQJ WHFKQLTXHV´ Commun. Comput. (COLCOM), 2017 IEEE
Colomb. Conf., pp. 1±6, 2017.
[20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
³6027(6\QWKHWLFPLQRULW\RYHU-VDPSOLQJWHFKQLTXH´-RXUQDORI
Artificial Intelligence Research, vol. 16, pp. 321±357, 2002.
[21] ' 6 6LVRGLD 1 . 5HGG\ DQG 6 %KDQGDUL ³3HUIRUPDQFH
Evaluation of Class Balancing Techniques for Credit Card Fraud
'HWHFWLRQ´IEEE Int. Conf. Power, Control. Signals Instrum. Eng.,
pp. 2747±2752, 2017..
[22] ' 6 6LVRGLD 1 . 5HGG\ DQG 6 %KDQGDUL ³3HUIRUPDQFH
Evaluation of Class Balancing Techniques for Credit Card Fraud
'HWHFWLRQ´,(((,QW&RQI3RZHU, Control. Signals Instrum. Eng.,
pp. 2747±2752, 2017.
9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 493
Authorized licensed use limited to: PES University Bengaluru. Downloaded on September 29,2022 at 03:11:42 UTC from IEEE Xplore. Restrictions apply.