0% found this document useful (0 votes)
1 views9 pages

Financial Fraud Detection Based on Machine Learning

This document presents a systematic literature review on machine learning techniques for detecting financial fraud, highlighting the limitations of traditional methods and the advantages of AI-based approaches. It categorizes various types of financial fraud, such as credit card and insurance fraud, and discusses evaluation metrics used in fraud detection models. The paper identifies gaps in current research and suggests future directions, including addressing imbalanced datasets and exploring unstructured data for improved detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views9 pages

Financial Fraud Detection Based on Machine Learning

This document presents a systematic literature review on machine learning techniques for detecting financial fraud, highlighting the limitations of traditional methods and the advantages of AI-based approaches. It categorizes various types of financial fraud, such as credit card and insurance fraud, and discusses evaluation metrics used in fraud detection models. The paper identifies gaps in current research and suggests future directions, including addressing imbalanced datasets and exploring unstructured data for improved detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine Learning for Fraud Detection and Financial Crimes

1. Abstract:
Financial fraud, considered as deceptive tactics for gaining financial benefits, has recently
become a widespread menace in companies and organizations. Conventional techniques
such as manual verifications and inspections are imprecise, costly, and time consuming for
identifying such fraudulent activities. With the advent of artificial intelligence, machine-
learning-based approaches can be used intelligently to detect fraudulent transactions by
analyzing a large number of financial data. Therefore, this paper attempts to present a
systematic literature review (SLR) that systematically reviews and synthesizes the existing
literature on machine learning (ML)-based fraud detection. Particularly, the review
employed the Kitchenham approach, which uses well-defined protocols to extract and
synthesize the relevant articles; it then report the obtained results. Based on the specified
search strategies from popular electronic database libraries, several studies have been
gathered. After inclusion/exclusion criteria, 93 articles were chosen, synthesized, and
analyzed. The review summarizes popular ML techniques used for fraud detection, the most
popular fraud type, and evaluation metrics. The reviewed articles showed that support
vector machine (SVM) and artificial neural network (ANN) are popular ML algorithms used
for fraud detection, and credit card fraud is the most popular fraud type addressed using ML
techniques. The paper finally presents main issues, gaps, and limitations in financial fraud
detection areas and suggests possible areas for future research.
2. Introduction:
Financial fraud is the act of gaining financial benefits by using illegal and fraudulent
methods. Financial fraud can be committed in different areas, such as insurance, banking,
taxation, and corporate sectors. Recently, financial transaction fraud, money laundering, and
other types of financial fraud have become an increasing challenge among companies and
industries. Despite several efforts to reduce financial fraudulent activities, its persistence
affects the economy and society adversely, as large amounts of money are lost to fraud
every day. Several fraud detection approaches were introduced many years ago. Most
traditional methods are manual, and this is not only time consuming, costly, and imprecise
but also impractical. More studies are conducted to reduce losses resulting from fraudulent
activities, but they are not efficient. With the advancement of the artificial intelligence (AI)
approach, machine learning and data mining have been utilized to detect fraudulent
activities in the financial sector. Both unsupervised and supervised methods were employed
to predict fraud activities. Classification methods have been the most popular method for
detecting financial fraudulent transactions. In this scenario, the first stage of model training
uses a dataset with class labels and feature vectors. The trained model is then used to
classify test samples in the next step.
Thus, this study attempts to identify machine-learning-based techniques employed for
financial transaction fraud and to analyse gaps to discover research trends in this area.
Recently, some reviews have been conducted to detect fraudulent financial activities.
Despite several existing reviews in the field, however, most studies particularly focused on
specific areas of finance, such as detecting credit card fraudulent activities, fraud in online
banking, fraud in bank credit administration, and fraud in payment cards. Hence, there is a
need of a study that encompasses all popular areas of financial fraud activities to fill the gap
in this aspect. More recently, a study was published to review fraud-detection methods in
financial records. The authors integrated the prior multi-disciplinary literature on financial
statement fraud. However, there are several differences between their work and our review.
First, their primary objective is to integrate research from several fields, including
information systems, analytics, and accounting. On the other hand, we aim to identify
financial fraud transactions based on machine learning methods and to discover datasets
applied in the ML-based financial fraud detection. Furthermore, we considered conference
articles in our study while they did not. This study reviews existing machine learning (ML)-
based methods applied for financial transaction fraud detection. Furthermore, the SLR can
guide researchers in their choice of applying ML-based financial transaction fraud-detection
methods along with the datasets to be used for predicting fraudulent activities in financial
transactions.
3. Research Methods
RQ1: What Are the Different Categories of Fraudulent Activities That Are Addressed Using
ML Techniques
Fraudulent activities vary depending on industry sectors. This section attempts to answer
RQ1 by presenting different fraudulent activities that were addressed using ML techniques
based on the selected articles. Based on the reviewed articles, fraudulent activities in the
financial sector can be broadly categorized into credit card, mortgage, financial statement,
and health care fraud. This can be further explained in the following subsections.
A. Credit Card Fraud
Credits are typically used to refer to electronic financial transactions made without the use
of physical cash. A credit card that is extensively used for online transactions is a small piece
made up of thin plastic material with credit services and customer details. Fraudsters use
credit cards to make unlawful transactions that result in massive losses to banks and card
holders. Moreover, the invention of counterfeit cards has aided fraudsters in performing
illicit transactions more easily. In general, it is regarded as illegitimate to use the card
without the proper owners’ authorization. By obtaining access to a certain account
illegitimately, any transaction that is carried out is considered as fraudulent. Credit card
fraudulent activities can be divided into two aspects, namely, offline and online fraud. In
offline fraudulent activity, the fraudsters conduct their illicit transactions with stolen credit
cards such as genuine card holders, while online fraudsters conduct their activities in online
transactions through Internet Online fraud.
B. Financial Statement Fraud
Fraud in financial statements involves forging financial reports to claim that a company is
more profitable than usual, avoid the payment of taxes, increasing stock prices, or obtaining
a bank loan. It can also be regarded as the confidential records generated by organizations
that contain their financial records that comprise their expenses, profits made, income
loans, etc. These statements also comprise some write-ups made by management for
discussing business performances and predicted future tendencies. Different financial
records provide the financial reality of the organization, which indicates how successful the
organization is and assists in checking if the organization is bankable. In addition, financial
statement fraudsters deceive the users of financial statements by correcting misstatements
to make the organizations appear beneficial. The main purpose of the financial fraudulent
statements is to enhance share prices, minimize tax liabilities, attract more investors as
much as possible, and access personal bank loans among others.
C. Insurance Fraud
Insurance fraud can be defined as the act of misusing an insurance policy for gaining
illegitimate benefits from an insurance business. Usually, insurance is made to protect the
organization’s transactions or individual’s transactions against any financial risks. The main
sectors of target by fraudulent insurance claims include healthcare and automobile
insurance companies; although home and crop insurance fraudulent also occur, however,
there is a paucity of the literature on both. It has been estimated recently that the total cost
of insurance fraud in the United States is over a billion USD yearly and it is finally passed on
to consumers in the form of higher insurance premiums.
In order to cover the relevant costs of theft or accidental damages to a car, an agreement
between the insurance provider and the insured person or organization is typically involved
in automobile insurance claims. Individual fraudsters are capable of committing fraudulent
claims, and one method of committing fraud is through deception during the claims process.
Evidence of organized groups working together to conduct insurance fraud also exists.
Typically, these groups stage or fake incidents; in other cases, an accident may not have even
occurred. Instead, the vehicles were brought to the scene. Nevertheless, the majority of
fraud cases are opportunistic frauds in a way that they are not planned; rather, an individual
seizes the opportunity presented by such an accident by exaggerating the claimed
statements or damages. Another popular insurance fraud is in the health sector. Healthcare
has grown to be a serious issue in contemporary society that is entangled with social,
political, and economic concerns. There is a significant financial expense associated with
meeting the public demand for high-quality medical services and the technology required to
provide them. Additionally, many low-income people and families rely on government-
sponsored healthcare insurance programs for support in order to pay for the steadily rising
costs with respect to prescription medications and medical services.
D. Financial Cyber-Fraud
The term financial cyber fraud is a new term capturing the umbrella of crime committed
over cyberspace for the sole purpose of illegal economic gain. Financial cybercrime
perpetrators are difficult to identify. They purposely mask their activities to blend their
actions with the normal behavior of any other customer or user of a website or financial
service; however, when grouped together, the activity is more obvious in terms of its
abnormality. As technical skills and advancements in technology are increasingly available to
criminals, their tactics for committing criminal offenses become more difficult to combat.
This symbiosis of financial crime and cybersecurity is leading financial institutions to use
their in-house developed methods to protect their assets using tools such as real-time
analytics and interdiction to prevent financial loss. However, as models are showing signs of
an inability to prevent and address these attacks, new methods must be developed and
deployed across organizations to prevent further loss to their business, customer data, and
their own reputation. The new methods deployed in the research community and industry
include machine learning and deep learning models.
E. Other Financial Fraudulent Types
Apart from the above types of fraudulent activities committed in the financial sectors, other
frauds are met in the financial domain, which includes commodities and securities fraud,
mortgage fraud, corporate fraud, and money laundering. Securities and commodities fraud
is a dishonest practice that occurs when a person invests in a company based on given fake
information. A mortgage is a material misstatement made by a debtor at any stage of the
application procedure when an underwriter relies on those facts to obtain a loan or credit. It
intentionally targets documents associated with a mortgage by modifying information during
the mortgage loan application processes. Another popular fraud is corporate fraud, which
involves the falsification of financial documents by insiders to cover up any fraud or criminal
activity. Money laundering is another type of financial fraud in which fraudsters try to
change the source of illegal money by convincing criminals to turn their dirty money into
legitimate money. Money laundering has a major influence on society because it is the
primary method in which other crimes, such as funding terrorism and trade-in weapons, are
accomplished. Another popular financial crime is cryptocurrency fraud. This type of fraud
systematically provides fake investments to naïve users in order to defraud them. The main
idea of this is to entice innocent individuals with the promise of significant gains from their
investments
RQ2: What Are the Performance Evaluation Metrics Used for Financial Fraud Detection
Using Machine Learning Methods
To evaluate the performance of a model, the evaluation metric is very important in financial
fraud detection. However, there are no specific evaluation measures that are strictly used for
evaluating ML techniques for fraud detection. In recent times, several performance
evaluation metrics have been employed by different researchers that include accuracy,
precision, recall, F1 measure, false-negative rate (FNR), the area under the curve (AUC),
specificity, etc. In this section, we present evaluation metrics that have been employed in
the reviewed papers based on the selected articles.
The percentage of positive cases that the classifier properly detects is known as the recall of
a classifier, also known as sensitivity. The classifier’s specificity is measured as a ratio of
correctly classified negative samples to all negative samples. While it makes sense to
increase both recall and precision, the two metrics have an inverse relationship. Lower recall
may arise by forcing a higher precision, and vice versa. This is referred to as the
recall/precision trade-off. The F-score, which is the harmonic mean of the precision and
recall is a better metric to maximize instead. It is possible to track the performance change in
terms of the trade-off between certain measures such as recall and precision by tweaking
the decision threshold value for a classifier. This process of evaluating all confusion matrices
that can result from changing the decision threshold is known as parametric evaluation.
Another metric is the Receiver Operating Characteristic (ROC) curve and the precision–recall
curve. The area under the ROC curve (AUC) can be used to compare classifiers, with perfect
classifiers possessing an AUC equal to 1 and purely random classifiers having an AUC equal
to 0.5. Different distance measures, such as the mean squared error (MSE), Euclidean
distance, Manhattan distance, and others, are used by clustering algorithms to quantify the
similarity or dissimilarity of samples or observations. These algorithms group similar data
points based on relevant features. As for the FPR and FNR, the lower the value of the
performance measures, the more improved the generalization ability of models. In this
table, TP, TN, FP, and FN stand for the true positive, true negative, false positive, and false
negative, respectively.
RQ3 What Are the Gaps and Future Research Direction in Machine-Learning-Based Fraud
Detection
This section attempts to answer RQ4, which aims to present research gaps and future
directions of the field. Thus, to determine the limitations and explore the gaps and future
work opportunities in this field, we synthesized the reviewed articles as discussed in the
following subsections.
Imbalanced Dataset
Virtually, most financial transaction datasets comprise millions of transactions, and all of
them share a common issue, namely imbalanced datasets. On other hand, the number of
fraudulent financial transactions is far fewer than non-fraudulent ones. This issue is
generally caused as a result of the fact that the rate of actual fraud transactions out of all
transactions is nominal. The problem of imbalanced data distribution generally affects the
efficiency of machine learning models. Therefore, training models for detecting fraudulent
activity that is very minimal requires extra consideration.
To address the issue of imbalanced data, some studies applied oversampling approaches.
The others attempt to introduce approaches that may effectively work with extremely
imbalanced data. Perols utilized imbalanced and left balancing datasets based on the
oversampling method for future work. Additionally, Lin et al. avoided oversampling as it may
lead to choice-based sample biases. Essentially, there were no studies that are applied
under-sampling methods among the reviewed articles. Moreover, the only applied
oversampling method was the SMOTE (Synthetic Minority Oversampling Technique).
Because of this, it could be deduced that future studies could consider employing other
oversampling techniques, as well as under-sampling techniques.
Data Size Feature Vectors
Some research works established that the dataset’s size is a limitation of their studies. For
instance, Jan showed that the financial markets in Taiwan are smaller than Europe, Japan,
and China in terms of its scope and size. Moreover, in Taiwan, the number of registered
companies is relatively small in scale. Accordingly, the data’s size is one of the major
problems in other nations. Therefore, if the issue of the datasets can be resolved, an
improved and more efficient ML approach that could identify fraudulent financial activity
can be achieved.
On other hand, the majority of the studies in the reviewed literature noted that the
performances of the detection model can be enhanced by improving the input vectors.
Future works could combine data from other sources, such as financial social media sites
such as Seeking Alpha, numerical information from financial documents, and the transcripts
of earnings calls, to produce more relevant feature vectors. In addition, if the textual data
could be considered in the model’s design process, then exploring emerging methods of
converting textual contents into vectors, such as Word2Vec, and BERT, such as the Doc2Vec
algorithm, could benefit further research.
Unstructured Data
Recently, several studies investigated different kinds of unstructured data such as vocal
inputs and textual data. However, to achieve remarkable results, unstructured data
exploration in the financial fraud detection domain requires more attention. We expect
future research to look into the text sources from financial statements. In addition, future
research could also explore the use of new data mining techniques, particularly word-
embedding methods such as Doc2Vec, Word2Vec, and BERT, to convert financial texts into
feature vectors that can be used to create machine learning models.
Machine-Learning-Based Technique
Classifying the machine learning techniques used for financial fraud detection is an effective
method for determining the appropriate methods for this research domain. Research gaps
can be identified by investigating why particular methods were selected and why others
were not given more attention. Based on the literature, many learning algorithms that have
received more attention and are used in other fields have not been popularly applied in
financial fraud detection. For instance, active learning can address the issue of insufficient
data and improve the learning cost, incremental learning can dynamically add sample data
to accuracy, and transfer learning can use the knowledge acquired from learning tasks to
enhance the learning effect on other tasks. In future, these learning methods can be given
more attention in financial fraud detection.
Another future direction that is worth considering in machine learning is the concept of
drift. Many times, due to the continuous development and evolution of financial fraud, the
accuracy of the algorithms diminishes. This phenomenon is called concept drift, which is a
problem that scholars have been attempting to address in the area of ML. Although the
concept drift issue can be addressed by applying new datasets for training periodically to
update the algorithms, it is clear that the overhead is large, and the model performance
between two training cycles cannot be certain.
4. Discussion
In this section, the content of the SLR is highlighted, which includes popular financial fraud
detection and machine learning techniques used in detection methods. We categorized the
findings of ML techniques and the fraud type in this SLR based on their frequency of usage.
From the review, it can be observed that out of all ML approaches discovered in this study,
the most popular ones from 2010 to 2021 are summarized. As a result, we identified that
the NB algorithm is the most popular technique used for identifying fraudulent activities in
the financial sector followed by the SVM and ANN with 11, 10, and 10 articles, respectively.
This shows that the NB, SVM, and ANN are the most popular machine learning techniques
used for the financial fraud detection based on the reviewed literature. Below shows the
frequency distribution of the machine learning techniques used for fraud detection and the
financial fraud types addressed in the reviewed articles.

Freque
ncy of the machine learning methods used for fraud detection.
The frequency of the different fraud types.
Based on the findings of our study, we categorized financial fraud into four different
categories, such as financial statements bank fraud, insurance fraud, and others. The
number of reviewed papers and the type of fraud are provided. It can be shown that out of
the review papers, 50% articles focus on bank fraud, 29% addressed financial statements,
and 21% focused on insurance fraud and other fraud at 14% and 7%, respectively. This show
that the majority of studies have a significant focus on bank fraud and financial statement
fraud, while insurance frauds that include health insurance and auto insurance are not as
frequently identified in the reviewed articles. It also shows that the two most common
frauds include credit card and financial fraud, which appear to be the most prevalent types
of fraud. On the other hand, our review did not cover money laundering, stock,
commodities, or mortgage fraud for some reason. One of the factors is the difficulty in
acquiring these data and the inability to reveal results if they are related to sensitive
subjects.
5. Limitation and Threat to Validity
In this SLR, various ML techniques and fraud types were identified. We develop our
protocols to promote external and internal validity as much as possible while answering the
RQs. However, there are still some limitations and validity threats that can be encountered
and mentioned here.
1. This SLR is only limited to conference and journal papers that discuss machine
learning (ML) in the context of detecting financial fraud. By using our search
approach in the early stages of the review, several non-relevant research papers were
identified and excluded from this review. This ensures that the selected research
papers satisfied the criteria for the study. However, it is believed that using more
sources, such as additional source books, would have further enhanced this review.
2. Although major databases were taken into consideration when exploring the
research articles, there may be other digital libraries with relevant studies that were
overlooked. We compared search terms and keywords to a well-known list of
research studies to mitigate this limitation. However, some synonyms may be
overlooked when searching for the keywords. The SLR protocol has been revised to
address this problem by ensuring no essential terms are left out.
3. We restricted our search to only English-language articles. Thus, this results in
linguistic bias because some related papers in this field of study may exist in other
languages. However, fortunately, all the gathered papers in this study were written in
English. As such, we have no language bias.
6. Conclusions
Financial Fraud can be committed in different financial aspects such as insurance, banking,
taxation, and corporate sectors. Recently, financial fraud has become increasingly worrisome
among companies and industries. Despite several efforts to eradicate financial fraud, its
persistence adversely affects the economy and society as very large amounts of money are
lost to fraud every day. With the advent of artificial intelligence, machine-learning-based
approaches can be used intelligently to detect fraudulent transactions by analyzing a large
number of financial data. In this paper, we presented a study that systematically reviewed
and synthesized the existing literature on ML-based fraud detection. In particular, this paper
adopted the Kitchenham methodology, which uses well-defined protocols to extract,
synthesize, and report results. Several studies have been gathered based on the specified
search strategies for popular electronic libraries. After the inclusion/exclusion criteria, 87
were selected. In this review, popularly used ML techniques for fraud detection, the most
common fraud type, and the evaluation metrics are summarized. Based on the reviewed
articles, results showed that SVM and NN are the popular ML algorithms used for fraud, and
credit card fraud is the most popular fraud type in the literature. The paper finally presented
the key issues, gaps, and limitations in the area of financial fraud detection and suggests
areas for future research. We identified gaps in the research by examining unexplored or less
studied algorithms. Previous studies in financial fraud detection focused on supervised
classification and regression methods, such as SVM, neural networks, and logistic regression.
The use of ensemble methods that take advantage of multiple algorithms to classify samples
is a rising trend in the field. Interestingly, we discovered that unsupervised learning
approaches, such as clustering, were less employed in the present literature. Clustering is
beneficial for investigating latent relations and resemblances. In addition, since there are a
small number of fraud cases that have to be identified, clustering could be effective. We
recommend that future studies pay more attention to unsupervised practices, such as
anomaly detection, which can uncover new insights. Additionally, another avenue for future
research would be to use emerging text-mining techniques and word-embedding techniques
such as Word2Vec, Doc2Vec, or BERT to transform financial texts into vectors of features,
which will then be used to build machine learning models.

You might also like