0% found this document useful (0 votes)
76 views

Data Engineering For Fraud Detection

A DSS for Data engineering to use in fraud detection.

Uploaded by

Amir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Data Engineering For Fraud Detection

A DSS for Data engineering to use in fraud detection.

Uploaded by

Amir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Decision Support Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Decision Support Systems


journal homepage: www.elsevier.com/locate/dss

Data engineering for fraud detection


Bart Baesens a, b, Sebastiaan Höppner c, Tim Verdonck c, d, *
a
KU Leuven, Faculty of Economics and Business, Naamsestraat 69, Leuven 3000, Belgium
b
University of Southampton, School of Management, Highfield Southampton, SO17 1BJ, United Kingdom
c
KU Leuven, Department of Mathematics, Celestijnenlaan 200B, Leuven 3001, Belgium
d
University of Antwerp, Department of Mathematics, Middelheimlaan 1, Antwerp 2020, Belgium

A R T I C L E I N F O A B S T R A C T

Keywords: Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which
Decision analysis are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task
Payment transactions fraud of detecting suspicious transactions is a binary classification problem and therefore many techniques can be
Instance engineering
applied. Interpretability is however of utmost importance for the management to have confidence in the model
Feature engineering
Cost-based model evaluation
and for designing fraud prevention strategies. Moreover, models that enable the fraud experts to understand the
underlying reasons why a case is flagged as suspicious will greatly facilitate their job of investigating the sus­
picious transactions. Therefore, we propose several data engineering techniques to improve the performance of
an analytical model while retaining the interpretability property. Our data engineering process is decomposed
into several feature and instance engineering steps. We illustrate the improvement in performance of these data
engineering steps for popular analytical models on a real payment transactions data set.

1. Introduction fraud. On the other hand, it does not very precisely describe the nature
and characteristics of fraud and as such does not provide much direction
The association of certified fraud examiners (ACFE) estimates that a for discussing the requirements of a fraud detection system. A more
typical organization loses 5% of its revenues to fraud each year. The fifth thorough and detailed characterization of the multifaceted phenomenon
oversight report on card fraud analyses developments in fraud related to of fraud is provided by Van Vlasselaer et al. [61]: Fraud is an uncommon,
card payment schemes (CPSs) in the Single Euro Payments Area (SEPA), well-considered, imperceptibly concealed, time-evolving and often carefully
issued in September 2018 by the European Central Bank and covering organized crime which appears in many types of forms. This definition
almost the entire card market, indicates that the total value of fraudulent highlights five characteristics that are associated with particular chal­
transactions conducted using cards issued within SEPA and acquired lenges related to developing a fraud detection system.
worldwide amounted to 1.8 billion Euros in 2016, which in relative The first emphasized characteristic and associated challenge con­
terms, i.e. as a share of the total value of transactions, amounted to cerns the fact that fraud is uncommon. Independent of the exact setting
0.041% in 2016 [21]. These are just a few numbers to indicate the or application, only a small minority of the involved population of cases
severity of the payment transactions fraud problem. It is also seen that typically concerns fraud, of which furthermore only a limited number
losses due to fraudulent activities keep increasing each year and affect will be known to be fraudulent. This makes it difficult to both detect
card holders worldwide. Therefore, fraud detection and prevention are fraud, since the fraudulent cases are covered by the non-fraudulent ones,
more important than ever before and developing powerful fraud as well as to learn from historical cases to build a powerful fraud
detection systems is of crucial importance to many organizations and detection system since only few examples are available. This will make it
firms in order to reduce losses by timely blocking, containing and pre­ hard for machine learning techniques to extract meaningful patterns
venting fraudulent transactions. from the data.
The Oxford Dictionary defines fraud as follows: the crime of cheating Fraud is also imperceptibly concealed since fraudsters exactly try to
somebody in order to get money or goods illegally. This definition captures blend into their environments to remain unnoticed. This relates to the
the essence of fraud and covers the many different forms and types of subtlety of fraud since fraudsters try to imitate normal behavior.

* Corresponding author at: University of Antwerp, Department of Mathematics, Middelheimlaan 1, Antwerp 2020, Belgium.
E-mail addresses: [email protected] (B. Baesens), [email protected] (S. Höppner), [email protected] (T. Verdonck).

https://doi.org/10.1016/j.dss.2021.113492
Received 15 July 2020; Received in revised form 25 November 2020; Accepted 7 January 2021
Available online 12 January 2021
0167-9236/© 2021 Published by Elsevier B.V.

Please cite this article as: Bart Baesens, Decision Support Systems, https://doi.org/10.1016/j.dss.2021.113492
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Fig. 1. Timeline of transactions of a customer using, for example, a particular payment channel.

Moreover, fraud is well-considered and intentional and complex fraud featurization) from different sources (e.g. transactional data, network
structures are carefully planned upfront. Fraudsters can also adapt or data, time series data, text data, ...) to achieve better performance.
refine their tactics whenever needed, for example, due to changing fraud Instance engineering entails the careful selection of instances or obser­
detection mechanisms. Therefore, fraud detection systems need to vations again with the aim to improve predictive modeling performance.
improve and learn by example. Put differently, it aims at selecting those observations which positively
The traditional approach to fraud detection is expert-driven, which contribute to the learning of the analytical technique and remove those
builds on the experience, intuition, and business or domain knowledge that have a detrimental impact on it. Obviously, this is not a trivial ex­
of one or more fraud investigators. Such expert-based rule base or engine ercise and many instance engineering techniques have been developed
is typically hard to build and maintain. A shift is occurring towards data- which we will carefully study and experiment with in this paper. In this
driven or machine learning based fraud detection methodologies. This paper the focus will be on successful data engineering steps to improve
shift is triggered by the digitization of almost every aspect of society and the performance of a fraud detection model. More concretely, we will
daily life, which leads to an abundance of available data. Financial in­ describe the lessons that we have learnt when complementing expert-
stitutions increasingly rely upon data-driven methods for developing based approaches with machine learning or data-driven techniques to
powerful fraud detection systems, which are able to automatically detect combat payment transactions fraud for a large European bank.
and block fraudulent transactions. In other words, we need adaptive This paper is organized as follows. We start with presenting our data
analytical models to complement experience-based approaches for engineering process: Section 2 presents feature engineering steps
fighting fraud. A stream of literature has reported upon the adoption of whereas instance engineering is explained in Section 3. In Section 4
data-driven aproaches for developing fraud detection systems [45,47]. popular performance measures in an (imbalanced) classification setting
These methods significantly improve the efficiency of fraud detection are described. In Section 5, more information about payment transaction
systems and are easier to maintain and more objective. From a machine fraud and the observed data set is given. This section also illustrates the
learning perspective, the task of detecting fraudulent transactions is a benefits of the various data engineering steps by showing increased
binary classification problem. performance on our real data set. Finally, concluding remarks and po­
A natural first step to move from expert-based approaches to data tential directions for future research are provided in Section 6.
driven techniques (while still taking into account the experience of the
fraud experts) is to consider logistic regression and/or decision trees. 2. Feature engineering
These simple analytical models can then be replaced by complex tech­
niques such as random forests and boosting methods, support vector The main objective of machine learning is to extract patterns to turn
machines, neural networks and deep learning to increase the detection data into knowledge. Since the beginning of this century, technological
power. Although the latter are definitely powerful analytical techniques, advances have drastically changed the size of data sets as well as the
they suffer a very important drawback which is not desirable from a speed with which these data must be analyzed. Modern data sets may
fraud prevention perspective: they are black box models which means have a huge number of instances, a very large number of features, or
that they are very complex to interpret. We would also like to note that both. In most applications, data sets are compiled by combining data
these complex models not always significantly outperform simple from different sources and databases (containing both structured and
analytical models such as logistic regression [4,38] and we strongly unstructured data) where each source of information has its strengths
believe that you should always start with implementing these simple and weaknesses. Before applying any machine learning algorithm, it is
techniques. Many benchmarking studies have illustrated that complex therefore necessary to transform these raw data sources into interesting
analytical techniques only provide marginal performance gains on features that better help the predictive models. This essential step, which
structured, tabular data sets as frequently encountered in common is often denoted feature engineering, is of utmost importance in the
classification tasks such as fraud detection, credit scoring and marketing machine learning process. We believe that data scientists should be well
analytics [4,38]. It is our firm belief that in order to improve the per­ aware of the power of feature engineering and that they should share
formance of any analytical model, we should focus more on the data good practices.
itself rather than developing new, complex predictive analytical tech­ An important set of interesting features can be created based on the
niques. This is exactly the aim of data engineering. It can be defined as famous Recency, Frequency, Monetary (RFM) principle. Recency mea­
the clever engineering of data hereby exploiting the bias of the analyt­ sures how long ago a certain event took place, whereas frequency counts
ical technique to our benefit, both in terms of accuracy and interpret­ the number specific events per unit of time. Besides recency features, we
ability at the same time. Often times it will be applied in combination also present several other time-related features. Features related to
with simple analytical techniques such as linear or logistic regression so monetary value measure the intensity of a transaction, typically
as to maintain the interpretability property which is so often needed in expressed in a currency such as Euros or USD. We also introduce features
analytical modeling. In our context of fraud analytics, interpretability is based on unsupervised anomaly detection and briefly discuss some other
of key importance to design smart fraud prevention mechanisms. Data advanced feature engineering techniques.
engineering can be decomposed into feature engineering and instance
engineering. Feature engineering aims at designing smart features in one
2.1. Frequency features
of two possible ways: either by transforming existing features using
smart transformations, which will allow a simple analytical technique
We explain the idea behind the RFM principle by first deriving fre­
such as linear or logistic regression to boost its performance, or by
quency features using a transaction aggregation strategy in order to
extracting or creating new meaningful features (a process often called
capture a customer’s spending behavior. This methodology was first

2
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Table 1 ( )
Example calculation of frequency features: xfreq
i is the number of transactions in D freq
tp ,i = AGG
freq
D , i, tp
the last 24 h, and xfreq2
i is the number of transactions with the same authenti­ { ( ) ( ( ) ) }N (1)
cation method and payment channel in the last 24 h. = xamt id
j | xj = xi
id
and days xtime
i , xtime
j < tp
j=1

Initial features Frequency


features where AGG(⋅) is a function that aggregates transactions of D into a
TransId CustId Timestamp Authentication Payment xfreq xfreq2
subset associated with a transaction i with respect to the time frame tp;
i i
method channel xtime
i is the timestamp of transaction i; xamt
i is the amount of transaction i;
id
xi is the customer or card identification number of transaction i; and
1 1 01/07/ pin code web 0 0
2019 16:51 days(t1, t2) is a function that calculates the number of days between the
2 1 01/07/ pin code web 1 1 times t1 and t2. Finally, the frequency feature is calculated as
2019 19:04 ⃒ ⃒
⃒ ⃒
3 1 01/07/ fingerprint app 2 0 xfreq
i = ⃒D freq
tp ,i ⃒ (2)
2019 19:36
4 1 01/07/ pin code web 3 2
2019 23:31 where ∣ ⋅ ∣ is the cardinality of a set. This aggregation strategy, however,
5 1 02/07/ fingerprint app 3 1 does not take the combination of different features into account. For
2019 17:48 example, we can aggregate transactions according to certain criteria,
6 1 02/07/ fingerprint app 2 1
such as: transactions made in the last tp days using the same authenti­
2019 22:12
7 1 02/07/ fingerprint app 2 2 cation method (e.g. pin code or fingerprint) and the same payment
2019 23:34 channel (e.g. online banking or mobile app). For calculating such fea­
8 1 03/07/ pin code app 3 0 tures, Bahnsen et al. [6] expand (1) as follows
2019 01:40 ( )
D freq2
tp ,i = AGG
freq
D , i, tp , cond1,cond2
{ ( ) ( ( ) )
proposed by Whitrow et al. [62] and has been used by a number of = xjamt | xid id
j = xi and days xitime , xtime
j < tp (3)
studies [6,10,18,35]. Frequency calculates how many transactions were (
cond1 cond1
) (
cond2 cond2
) }N
and xj = xi and xj = xi
made during a sliding time window that satisfies predefined conditions, j=1
as illustrated in Fig. 1. The first step in creating frequency features
consists in aggregating the transactions made during the last given time where cond1 and cond2 could be one of the features of a transaction (e.g.
period (e.g. last 3 months), first by card or account number, then by authentication method, payment channel, beneficiary country, etc.).
payment channel, authentication method, beneficiary country or other, Similarly, the frequency feature is then calculated as
followed by counting the number of transactions. It is important to ⃒



choose an appropriate time period over which to aggregate a customer’s xfreq2
i = ⃒D freq2
tp ,i ⃒ (4)
transactions. When time passes, the spending patterns of a customer are
One could also define new features as the ratio of frequency features.
not expected to remain constant over the years. For transactions made
For example,
with debit cards, we propose to use a fixed time frame of 90, 120 or 180
/ freq
days (~ 3, 4 or 6 months). Let D denote a set of N transactions where xratio
i = xfreq2
i xi (5)
each transaction is represented by the pair (xi, yi) for i = 1, 2, …, N. Here
yi ∈ {0, 1} describes the true class of transfer i and xi = (x1i , x2i , …, xpi ) which is always between 0 and 1. Since xratio
i is the fraction of of transfers
represents the p associated features of transfer i. Bahnsen et al. [6] for which conditions cond1 and cond2 hold over all transactions in the
describe the process of creating frequency features as selecting those past tp days, this feature represents the probability that both conditions
transactions that were made in the previous tp days, for each transaction cond1 and cond2 are met by the customer.
i in the data set D , We show an example to further clarify how the frequency features
are calculated. Consider a set of transactions made by a customer be­
tween 01/07/2019 and 03/07/2019, as shown in Table 1. Then we

Fig. 2. Example of a recency feature derived from the authentication method used by a customer. When the customer makes a transaction, she chooses one of five
possible authentication methods which are labeled as AU01, AU02, …, AU05. If the time between the same two successive authentication methods is long, the
recency is close to zero, while if that time is short, the recency is close to 1. If an authentication method is used for the first time, its recency is defined as zero.

3
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Fig. 3. Recency versus time (in days) for different values of γ.

Fig. 4. (Left) Circular histogram of timestamps of transactions. The dashed line is the estimated periodic mean of the von Mises distribution. (Right) Circular
histogram including the 90% confidence interval (orange area).

estimate the frequency features xfreq


i and xfreq2
i by setting tp = 1 day (~ When a customer makes a transfer xi, she chooses a method xAU i to
24 h) for ease of calculation. authenticate herself. Examples of authentication methods are pass­
The frequency features give us specific details about the spending words, pin codes, fingerprints, itsme,1 iris scans and hardware tokens.
behavior of the customer. For example, if a customer frequently used a For each transaction i in the data set D , we define the recency of the
particular payment channel in the past tp days, its frequency is obviously transaction’s authentication method as.
large. However, a zero frequency for a particular payment channel im­
plies that the customer has not used that payment channel in the past tp xAU,recency
i = exp( − γ⋅Δti )where
{ ( ) ( ) ( ) }N (6)
days which indicates anomalous behavior and perhaps fraud. The total Δti = min days xitime , xtime
j | xjid = xiid and xAU
j = xAU
i .
number of frequency features can grow quite quickly, as tp can have j=1

several values, and the combination of criteria can be quite large as well.
here Δti is the time interval, typically in days, between two consecutive
For the experiments we set the different values of tp to 90, 120 and 180
transfers made by the same customer with identification number xid i
days. Then we calculate the frequency features using (2) and (4) as well
using the same authentication method xAU i . The parameter γ can be
as (5) with the aggregation criteria including payment channel,
chosen such that, for example, the recency is small (e.g. 0.01) when Δt =
authentication method, beneficiary country, type of communication,
180 days (~ 6 months) in which case γ = − log (0.01)/180 = 0.026.
and others.
Notice that recency is always a number between 0 and 1. When the time
period Δt between two consecutive transfers with the same authenti­
2.2. Recency features cation method is small (large), we say that the authentication method
has (not) recently been used. In that case the recency for this authenti­
Although frequency features are powerful in describing a customer’s cation method is close to one (zero). When an authentication method is
spending behavior, they do not take the aspect of time into account. used for the first time, we define its recency to be zero. A zero or small
Recency features are a way to capture this information. Recency mea­ recency shows atypical behavior and might indicate fraud. Fig. 3 shows
sures the time passed since the previous transaction that satisfy pre­ that recency indeed decreases when the time interval becomes larger.
defined conditions. To explain how recency features are defined we The parameter γ determines how fast the recency decreases. For larger
show an example where we create a recency feature derived from the values of γ, recency will decrease quicker with time and vice versa.
authentication method used by the customer as illustrated in Fig. 2.

1
This is a popular app in Belgium that allows you to safely, easily and reli­
ably confirm your (digital) identity and approve transactions.

4
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Table 2 timestamps using these estimates:


Example calculation of a binary feature that informs whenever a transaction is ( ( ) ( ))
being made within the confidence interval (with α = 0.9) of the time of the xtime
i μ D time
̃von Mises ̂ tp ,i κ D time
,̂ tp ,i (9)
previous transactions.
Once the von Mises distribution is fitted on the timestamps of the
TransId Time Periodic Confidence Binary
mean interval feature customer’s transactions we can construct a confidence interval with
probability α, e.g. 80%, 90%, 95%. An example is presented in Fig. 4
1 01/07/2019 – – –
16:51 (right). Using the confidence interval, a binary feature is created: a
2 01/07/2019 – – – transaction is flagged as normal or suspicious depending on whether or
19:04 not the time of the transaction is within the confidence interval. Table 2
3 01/07/2019 17:57 16:07–19:48 1 shows an example of a binary feature that takes the value of one if the
19:36
current time of the transaction is within the confidence interval of the
4 01/07/2019 18:31 16:32–20:29 0
23:31 time of the previous transactions with a confidence of α = 0.9. Of course,
5 02/07/2019 19:40 15:39–23:40 1 multiple of these binary features can be extracted for different values of
17:48 α and time period tp. The new feature also helps to get a better under­
6 02/07/2019 19:14 15:27–23:01 1
standing of when a customer is expected to make transactions. Note that
22:12
7 02/07/2019 19:47 15:52–23:42 1
this feature (just as many others) solely indicates atypical behavior for a
23:34 customer, which might give an indication for fraud. If a certain trans­
8 03/07/2019 20:21 16:05–00:38 0 action is flagged as potentially fraudulent due to this feature, then it is
01:40 important that this information is also given to the fraud investigators. If
they see that the customer is abroad, then that could be the reason for
2.3. Other time-related features the atypical value of this feature.
Instead of looking at the timestamp of a transaction within a day, we
It is well-known that time is an important aspect in fraud detection. can of course create similar features indicating how atypical it is for a
Besides recency features other time-related features can be created customer to have a payment on a certain day or above a certain amount.
based on the assumption that certain events, like a customer who makes Some customers, for example, may only do transactions during the
transactions, occur at similar moments in time. Having a transaction at weekend. Adding such features based on customer spending history may
22:00 might be very regular for one person, but very suspicious for bring significant increase in model performance. Most predictive models
another person. Since, for every customer, we know the timestamps of let you also easily evaluate which features increased the performance of
all their transactions in the past, we can use this information to decide your model and which are not significant for discriminating frauds from
whether a new transaction at 22:00 is atypical for a particular customer. non-frauds.
For the set of timestamps of transactions made by a each customer we
can construct a circular histogram, as shown in Fig. 4 (left). Since 00:00 2.4. Monetary value related features
is the same as 24:00, we have to model the time of a transaction as a
periodic variable by fitting an appropriate statistical distribution [6]. A The last pillar of the RFM principle involves monetary value related
popular choice is the von Mises distribution, also known as the periodic features which focus on the amount that is transferred. Monetary fea­
normal distribution because it represents a normal distribution wrapped tures calculate various statistics such as the total value, the average, and
around a circle [25]. The von Mises distribution of a set of timestamps the standard deviation of the transferred amounts that were pursued
D time = {t1 , t2 , …, tN } is defined as during the sliding time window that satisfy predefined conditions
(Fig. 5). The first step in creating monetary features is the same as with
D time ̃von Mises(μ, κ) (7)
frequency features: select those transactions that were made in the last tp
where parameters μ and 1/κ represent the periodic mean and the peri­ days, as in (1). Next, we can calculate the total amount spent on those
odic standard deviation, respectively. These parameters can easily be transactions,
estimated by most statistical software. We use the function mle.von­ ∑
N ( )
mises from the R package circular to compute the maximum likelihood xtotal
i = xjamt I xamt
j ∈ D freq
tp ,i (10)
estimates for the parameters of a von Mises distribution. j=1

For each customer we construct a confidence interval for the time of


where I(⋅) is the indicator function. Of course, we can also aggregate
a transaction. First, we select the set of transactions made by the same
transactions according to certain criteria, as in (3), followed by calcu­
customer in the last tp days,
lating their sum,
( )
D time
tp ,i = AGG
time
D , i, tp ( )

N
{ ( ) ( ( ) ) }N (8) xtotal2
i = xamt
j I xj
amt
∈ D freq2
tp ,i (11)
= xtime
j | xid
j = xi
id
and days xtime
i , xtime
j < tp . j=1
j=1

Based on this set of selected timestamps, the estimated parameters ̂ μ Transferring 500 Euros may be little for one person, but a lot for
and ̂
κ are calculated. Next, a von Mises distribution is fitted on the set of another person. A monetary feature that calculates the so-called z-score
of an amount can indicate whether the amount is atypical for a

Fig. 5. Timeline of amounts transferred by a customer using, for example, a particular payment channel.

5
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

where
(⃒ ({ } ) ⃒n )
⃒ n ⃒
MAD({x1 , x2 , …, xn }) = 1.4826⋅Median ⃒xi − Median xj j=1 ⃒
i=1
(16)
The constant scale factor 1.4826 ensures that the MAD is a consistent
estimator for the estimation of the standard deviation σ, i.e.
E[MAD({X1 , X2 , …, Xn }) ] = σ for Xj distributed as N(μ, σ 2) and large n.
Using the robust estimates, the z-score of the last amount in Fig. 6 is
5.79, which clearly indicates that the 500 Euros is atypical for this
customer.
Remark: transferred amounts are often right-skewed as shown in
Fig. 7 (left). The rule of thumb, i.e. ∣zi ∣ > 3, implicitly assumes that the z-
scores are distributed as N(μ, σ2). Before standardizing the amounts, a
Fig. 6. An example of transferred amounts. The last amount of 500 Euros is
transformation is often applied to them that changes their distribution to
clearly an outlier compared to the previous amounts. The atypical high amount
one that resembles a normal distribution, or at least a symmetric dis­
is not indicated when using traditional estimates such as sample mean and
sample standard deviation. Instead, we have to use robust estimates such as the tribution. One such transformation is the natural logarithm, as shown in
median and the median absolute deviation (MAD). Fig. 7 (right).
A popular alternative for computing (robust) z-scores is the boxplot,
which is a frequently used graphical tool to analyze a univariate data set
particular customer. For a set of amounts D freq
tp ,i , the standardized values
[60]. The boxplot marks all observations outside the interval [Q1 −
or z-scores are defined as 1.5IQR; Q3 + 1.5IQR] as potential outliers, where Q1, Q2 and Q3 denote
xamt μD
− ̂ respectively the first, second (or median) and third quartile and IQR =
zi = i
(12) Q3 − Q1 equals the interquartile range. It is known that the boxplot
σD
̂
typically flags too many points as outlying when the data are skewed and
where ̂μ D and ̂
σ D are the sample mean and sample standard deviation, therefore Hubert and Vandervieren [34] have modified the boxplot in­
respectively, terval so that the skewness is sufficiently taken into account.
( ) ( ) In practice one often tries to detect outliers using diagnostics starting
μ D = Mean D freq
̂ tp ,i σ D = Stdev D freq
and ̂ tp ,i (13) from a classical or traditional fitting method. Unfortunately, these
traditional techniques can be affected by outliers so strongly that the
As a rule of thumb, an amount is flagged as an outlier if its z-score is resulting fitted model may not allow to detect the deviating observa­
larger than 3, ∣zi ∣ > 3. Now consider the transactions made by a tions. This is called the masking effect (see e.g. Rousseeuw and Leroy
customer, as shown in Fig. 6. The last amount of 500 Euros is clearly an [55]). Additionally, some good data points might even appear to be
outlier compared to the previous amounts. However, when using the outliers, which is known as swamping [19]. To avoid these effects, the
sample mean and sample standard deviation, the z-score of the atypi­ goal of robust statistics is to find a fit which is close to the fit we would
cally high amount is only 2.66 and is therefore not regarded as have found without the outliers. We can then automatically identify the
abnormal. outliers by their large ‘deviation’ (e.g., their distance or residual) from
Instead of computing the z-score using traditional estimates such as that robust fit. It is not our aim to replace traditional techniques by a
sample mean and sample standard deviation, we propose using robust robust alternative, but we have illustrated that robust methods can give
alternatives such as the median and the median absolute deviation you extra insights in the data and may improve the reliability and ac­
(MAD), curacy of your analysis.
xamt − μrD
zri = i
(14)
σrD 2.5. Features based on (unsupervised) anomaly detection techniques

with In this section we focus on unsupervised techniques that do not use


( ) ( ) the target variable (fraudulent or not). Anomaly detection techniques
μrD = Median D freq
tp ,i and σ rD = MAD D freq
tp ,i (15) flag anomalies or outliers, which are observations that deviate from the
pattern of the majority of the data. These flagged observations indicate

Fig. 7. Histogram and kernel density estimate of amounts (left) and natural logarithm of those amounts (right).

6
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

atypical behavior and hence may contain crucial information for fraud 2.6. Other feature engineering techniques
detection and should be investigated by the fraud expert. As an alter­
native, we propose to use the outlyingness score or metric of several In this paper, we only study a few feature engineering techniques to
anomaly detection techniques as features that we add to our data set. illustrate their importance as a key data engineering mechanism. Other
Anomalies in a single dimension (i.e. univariate outliers) can be powerful feature engineering techniques are the Box-Cox and Yeo-
detected by computing (robust) z-scores (and see which observations are Johnson transformation which both univariately transform data vari­
in absolute value larger than 3) or by constructing the (adjusted) boxplot ables so as to boost the performance of the predictive analytical model.
(and see which observations are outside the boxplot interval or fence). Note that these transformation techniques are sensitive to outliers and
Another tool for univariate anomaly detection that is also popular in will try to move outliers inward at the expense of the normality of the
fraud detection is Newcomb-Benford law, which makes predictions central part of the data. Therefore various robust transformation pro­
about the distribution of the first leading digit of all numbers [7,46]. cedures have been proposed in literature (see e.g. Carroll and Ruppert
These techniques can then be applied on each feature in the data set. [15]; Riani [51]; Marazzi et al. [43]; Raymaekers and Rousseeuw [50]).
However, in this way it is only possible to detect anomalies that are Feature engineering techniques have also been designed for unstruc­
atypical in (at least) one dimension or feature of our data set. Since tured data such as text, network data, and multimedia data (e.g., images,
fraudsters succeed very well in blending in with legitimate customers, audio, videos). For text data, one commonly uses Singular Value
they are typically not detected by checking each feature separately. It is Decomposition (SVD) or Natural Language Processing (NLP) as feature
important to flag those observations that deviate in several dimensions engineering techniques. For network data, node2vec and GraphSage
from the main data structure but are not atypical in one of the features. [29,30] have proven to be very valuable techniques. Deep learning has
Such multivariate outliers can only be detected in the multidimensional been used to learn complex features for multimedia data. As an example,
space and require the use of advanced models. convolutional neural networks can learn key features to describe objects
A first tool for this purpose is robust statistics, which first fits the in images. However, an important caveat is that many of these features
majority of the data and then flags the observations that deviate from are black box in nature and thus hard to interpret for business decision
this robust fit [54]. For a multivariate n × p data set X, one can calculate makers. Finally, tailored feature engineering techniques have been
the robust Mahalanobis distance (or robust generalized distance) for designed for specific domains, e.g., Item2Vec in Recommender Systems
each observation xi: [8].
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
MD(xi , ̂ ̂ = (x − ̂
μ , Σ) μ )T Σ ̂ − 1 (x − ̂ μ ). (17) 3. Instance engineering

An observation is then flagged as anomaly if its distance exceeds the


√̅̅̅̅̅̅̅̅̅̅̅̅̅ A major challenge in fraud analytics is the imbalance or skewness of
cut-off value χ 2p,0.975 , which is the 0.975 quantile of the chi-squared the data, meaning that typically there are plenty of historical examples
distribution with p degrees of freedom. It is of utmost importance that of non-fraudulent cases, but only a limited number of fraudulent cases.
robust estimates of multivariate location and scatter are used in the For example, in a credit card fraud setting, typically less than 0.5% of
computation of the distances (to avoid masking and swamping effects). transactions are fraudulent. Such a problem is commonly referred to as
A popular method yielding such estimates is the Mininimum Covariance the needle in a haystack problem, and might cause an analytical tech­
Determinant (MCD) method of Rousseeuw and Driessen [53] or the nique to experience difficulties in learning to create an accurate model.
Minimum Regularized Covariance Determinant (MRCD) estimator of Every classifier faced with a skewed data set typically tends to favor the
Boudt et al. [11] in case of high-dimensional data. Note that also various majority class. In other words, the classifier tends to label all trans­
robust alternatives for popular predictive models are proposed in liter­ actions as non-fraudulent since it then already achieves a classification
ature. These robust supervised techniques automatically flag anomalies accuracy of more than 99%. Classifiers typically learn better from a
(typically with a convenient graphical tool to visualize the anomalies). more balanced distribution. Two popular ways to accomplish this is by
Therefore it is interesting to also apply robust versions of the predictive undersampling, whereby non-fraudulent transactions in the training set
models on the data and carefully examine the anomalies flagged with are removed, or oversampling, whereby fraudulent transactions in the
these techniques (for more information see e.g. Maronna et al. [44]; training set are replicated.
Heritier et al. [33]; Atkinson and Riani [3]). Recently, Rousseeuw et al. A practical question concerns the optimal, non-fraud/fraud odds,
[56] also used robust statistics to detect potential fraud cases in time which should be the goal by doing under- or oversampling. This of
series of imports into the European Union. course depends on the data characteristics and quality and type of
Besides robust statistics, many other unsupervised anomaly detec­ classifier. Although train and error is commonly adopted to determine
tion tools from various research fields have been proposed [28]. We this optimal odds, the ratio 90% non-fraudsters versus 10% fraudsters is
briefly introduce and illustrate three popular techniques: k-nearest usually already sufficient for most business applications.
neighbors distance [2,14], local outlier factor (LOF) [13] and isolation The Synthetic Minority Oversampling technique, or SMOTE, is
forests [40]. The k-nearest neighbors distance for an observation is the another interesting approach to deal with skewed class distributions
average distance to each of its k closest neighbors. This distance mea­ [16]. In SMOTE, the minority class is oversampled by adding synthetic
sures how isolated an observation is from its neighbors and hence a large observations. The creation of these artifical fraudsters goes as follows. In
distance typically indicates an anomaly. The LOF score is the average Step 1 of SMOTE, for each minority class observation, the k nearest
density around the k nearest neighbors divided by the density around neighbors (of same class) are determined. Step 2 then randomly selects
the observation itself and anomalies typically have a score above one. one of the neighbors and generates synthetic observations as follows: 1)
Isolation forest is obtained by taking an ensemble of isolation trees take the difference between the features of the current minority sample
which try to isolate each observation as quickly as possible. The final and those of its nearest neighbor. 2) multiply this difference with a
score is the average of the standardized path length (i.e. number of splits random number between 0 and 1 and 3) add the obtained result as new
to isolate the observation) over all trees. Hence for all the methods above observation to the sample, hereby increasing the frequency of the mi­
it holds: the higher the score or metric, the more suspicious is the nority class.
observation. The key idea of these undersampling and oversampling techniques is
to adjust the class priors to enable the analytical technique to create a
meaningful model that discriminates the fraudsters from the non-
fraudsters. By doing so, the class posteriors become biased. This is not
a problem if the fraud analyst is interested in ranking the observations in

7
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Fig. 8. Illustration of SMOTE, ADASYN, MWMOTE and ROSE. The blue circles represent the legitimate cases, the black squares are the original fraud cases, and the
red dots are the synthetic fraud cases. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

terms of their fraud risks. However, if well-calibrated fraud probabilities


Table 3
are needed, then the posterior probabilities can be adjusted [57].
Confusion matrix of a binary classification task.
Since its introduction in 2002, many variants of SMOTE have been
proposed in literature (see e.g. Zhu et al. [63] and Kovács [36] for an Actual legitimate (negative) Actual fraudulent (positive)
y=0 y=1
overview). In Fig. 8, we visually show the differences between ADASYN
[32], MWMOTE [9] and ROSE [41] and show their performance on our Predicted as True negative False negative
legitimate
data set. We refer to their papers for details. It is clear that there is not
(negative) ̂
y =0 (TN) (FN)
one oversampling technique that always yield the best result [1].
Predicted as False positive True positive
fraudulent
4. Measuring performance (positive) ̂
y =1 (FP) (TP)

The aim of detecting transfer fraud is to identify transactions with a


high probability of being fraudulent. From the perspective of machine A classification exercise typically leads to a confusion matrix as
learning, the task of predicting the fraudulent nature of transactions can shown in Table 3. Based on the confusion matrix, we can compute
be presented as a binary classification problem where observations (i.e. several performance measure such as Precision, Recall (also called True
transactions, customers, etc.) belong either to class 0 or to class 1. We Positive Rate, Sensitivity or Hit Rate), False Positive Rate, and F1-measure.
follow the convention that the fraudulent observations belong to class 1, Each of these measures are calculated for a given confusion matrix that
whereas the legitimate observations correspond to class 0. We often is based on a certain threshold value t ∈ [0, 1].
speak of positive (class 1) and negative (class 0) observations. The receiver operating characteristic (ROC) curve, as shown on the
{( )}N left plot in Fig. 9, is probably the most popular method to analyze the
Consider again our set D = xi , yi i=1 of N transactions. In gen­
effectiveness of a classifier. The ROC curve is obtained by plotting for
eral, a classification algorithm provides a continuous score si := s(xi) ∈
each possible threshold value the false positive rate (FPR) on the X-axis
[0, 1] for each transaction i. This score si is a function of the observed
and the true positive rate (TPR) on the Y-axis. As a graphical tool the
features xi of transaction i and represents the fraud propensity of that
ROC curve visualizes the tradeoff between achieving a high recall (TPR)
transaction. Here we assume that legitimate transfers (class 0) have a
while maintaining a low false positive rate (FPR), and is often used to
lower score than fraudulent ones (class 1). The score si is then converted
find an appropriate decision threshold. Provost et al. [48] argue that
to a predicted class ̂ y i ∈ {0, 1} by comparing it with a classification
ROC curves, as an alternative to accuracy estimation for comparing
threshold t ∈ [0, 1]. If a transfer’s probability of being fraudulent as
classifiers, would enable stronger and more general conclusions. For
estimated by the classification model lies above this threshold value,
more information about ROC curves we refer to Krzanowski and Hand
then the transfer is predicted as fraud (si > t⇒̂y i = 1), and otherwise it is
[37] and Swets [59].
classified as legitimate (si ≤ t⇒̂ y i = 0).

8
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Fig. 9. (Left) example of a ROC curve. (Right) example of a Precision-Recall curve. Both curves are based on the same classifier validated on the same data set.

Comparing classifiers based solely on their ROC curves can be


Table 4
challenging. Therefore, the ROC curve is often summarized in a single
Cost matrix where, between square brackets, the related instance-dependent
score, namely the Area Under the ROC Curve (AUC) which varies be­
classification costs for transfer fraud are given.
tween 0 and 1 [22,23,39]. In the context of fraud detection, the AUC of a
classifier can be interpreted as being the probability that a randomly Actual legitimate (negative) Actual fraudulent (positive)
yi = 0 yi = 1
chosen fraud case is predicted a higher score than a randomly chosen
legitimate case. Therefore, a higher AUC indicates superior classification Predicted as True negative False negative
legitimate
performance. A perfect classifier would achieve an AUC of 1 while a
(negative) ̂
yi = 0 [Ci(0| 0) = 0] [Ci(0| 1) = Ai]
random model (i.e. no prediction power) would yield an AUC of 0.5.
Predicted as False positive True positive
When dealing with highly imbalanced data as is the case with fraud fraudulent
detection, AUC (and ROC curves) may be too optimistic and the Area (positive) ̂
yi = 1 [Ci(1| 0) = cf] [Ci(1| 1) = cf]
under the Precision-Recall Curve (AUPRC) gives a more informative
picture of a classifier’s performance [20,24,58]. As the name suggest,
the Precision-Recall curve (right plot in Fig. 9) plots the precision (Y- matrix and can even be instance-dependent, in other words, specific to
axis) against the recall (X-axis) or each possible threshold. The AUPRC is each transaction i as indicated in Table 4. Hand et al. [31] proposed a
therefore also a value between 0 and 1. Both ROC and PR curves use the cost matrix, where in the case of a false positive (i.e. incorrectly pre­
recall, but the ROC curve also plots the FPR whereas PR curves focus on dicting a transaction as fraudulent) the associated cost is the adminis­
precision. In the denominator of FPR, one sums the number of true trative cost Ci(1| 0) = cf. This fixed cost cf has to do with investigating the
negatives and false positives. In highly imbalanced data, the number of transaction and contacting the card holder. When detecting a fraudulent
negatives (legitimate observations) is much larger than the number of transfer, the same cost Ci(1| 1) is allocated to a true positive, because in
positives (fraudulent observations) and hence the number of true neg­ this situation, the card owner will still need to be contacted. In other
atives is typically very high compared to the number of false positives. words, the action undertaken by the company towards an individual
Therefore, a large increase or decrease in the number of false positives transaction i comes at a fixed cost cf ≥ 0, regardless of the nature of the
will have almost no impact on FPR in the ROC curves. Precision, on the transaction. However, in the case of a false negative, in which a fraud­
other hand, compares the number of true positives to the number of false ulent transfer is not detected, the cost is defined as the amount Ci(0| 1) =
positives and hence copes better with the imbalance between positive Ai of the transaction i. The instance-dependent costs are summarized in
and negative observations. Since precision is more sensitive to class Table 4. We argue that the cost matrix in Table 4 is a reasonable
imbalance, the area under the Precision-Recall curve (AUPRC) is better assumption. However, one could alter the cost matrix, for example, by
to highlight differences between models for highly imbalanced data sets. using a variable cost for false positives that reflects the level of friction
Despite the many ways to evaluate a classification model’s perfor­ that the card holder experiences.
mance we argue that the true business objective of a fraud detection Using the instance-dependent cost matrix in Table 4, Bahnsen et al.
system is to minimize the financial losses due to fraud. However, the [6] define the cost of using a classifier s(⋅) on the transactions in D as
performance measures mentioned so far do not incorporate any costs N ( [ ( ) ( ]

related to incorrect predictions such as not detecting a fraudulent Cost(s(D ) ) = y i Ci (1|1) + 1 − ̂
yi ̂ y i Ci 0 |1)
transaction. Therefore, they may not be the most appropriate evaluation i=1
[ ( ) ( ])
criteria when evaluating fraud detection models. In fact, the previous + (1 − yi ) ̂ y i Ci (1|0) + 1 − ̂y i Ci 0 |0) (18)
performance measures tacitly assume that misclassification errors carry
( )
the same cost, similarly with the correctly classified transactions. This ∑
N
= yi 1 − ̂y i Ai + ̂y i cf .
assumption clearly does not hold in practice because wrongly predicting i=1
a fraudulent transaction as legitimate carries a significantly different
financial cost than the inverse case. To better align the assessment of In other words, the total cost is the sum of the amounts of the un­
data-driven fraud detection systems with the actual objective of detected fraudulent transactions (yi = 1, ̂ y i = 0) plus the administrative
decreasing losses due to fraud, we extend the confusion matrix in Table 3 cost incurred. The total cost may not always be easy to interpret because
by incorporating costs as proposed in [5]. Let Ci (̂ y |y) be the cost of there is no reference to which the cost is compared [62]. So Bahnsen
predicting class ̂y for a transfer i when the true class is y. If ̂
y = y then the et al. [6] proposed the cost savings of a classification algorithm as the cost
prediction is correct, while if ̂ y∕ = y the prediction is incorrect. In gen­ of using the algorithm compared to using no algorithm at all. The cost of
eral, the costs can be different for each of the four cells in the confusion using no algorithm is

9
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Costl (D ) = min{Cost(s0 (D ) ) , Cost(s1 (D ) ) } (19) Table 5


Examples of typical features of transactions.
where s0 refers to a classifier that predicts all the transactions in D as Feature name Description
belonging to class 0 (legitimate) and similarly s1 refers to a classifier that
Transaction ID Transaction identification number
predicts all the transfers in D as belonging to class 1 (fraud). The cost Timestamp Date and time of the transaction
savings is then expressed as the cost improvement of using an algorithm Originator’s account number Identification number of the originator’s bank
as compared with Cost l (D ), account
Beneficiary’s account Identification number of the beneficiary’s bank
Costl (D ) − Cost(s(D ) ) number account
Savings(s(D ) ) = . (20)
Costl (D ) Beneficiary’s name Name of the beneficiary
Card number Identification of the debit card
In the case of transaction fraud, the cost of not using an algorithm is Payment channel Electronic channel (e.g. online banking, mobile
equal to the sum of amounts of the fraudulent transactions, Costl (D ) = app,...)
∑N Authentication method e.g. pin code, fingerprint, itsme,...
i=1 yi Ai . The savings are then calculated as Currency Original currency (e.g. Euros, USD,...)
∑N Amount Amount of the transaction in Euros
yi ̂y Ai − ̂y i cf Originator country Country from which the money is send
Savings(s(D ) ) = i=1∑N i . (21)
i=1 yi Ai Beneficiary country Country to which the money is send
Communication Message provided with the transfer
In other words, the costs that can be saved by using an algorithm are Gender Gender of the customer
the sum of amounts of detected fraudulent transactions minus the Age Age of the customer
administrative cost incurred in detecting them, divided by the sum of Country Customer’s country of residence
Language Customer’s preferred language
amounts of the fraudulent transactions.
Besides obtaining the best statistical accuracy or the highest cost
savings, there are many other reasons why one model might be preferred
above another, such as interpretability, operational efficiency and Table 6
economical cost. Summary of the data sets.
Interpretability refers to the intelligibility or readability of the Set Transactions Frauds
analytical model. Models that enable the user to understand the un­ Total 31,763 506
derlying reasons why the model signals a case to be suspicious are called Training 22,234 354
white-box models. Complex incomprehensible mathematical models are Testing 9529 153
often referred to as black-box models. It might well be, in a fraud
detection setting, that black-box models are acceptable, although in
5. Experimental assessment
most settings, some level of understanding and in-fact validation, which
is facilitated by interpretability, is required for the management to have
In this Section 5.1 we first describe the observed data set for the
confidence and allow the effective implementation of the model. In most
experiments. In Section 5.2 we present the experimental design and in
situations, the aim of the fraud detection system is to select out of mil­
Section 5.3 we show the results of the experiments.
lions of payments the transactions that are most suspicious. These top,
say 100, most suspicious transactions are then given to the fraud in­
5.1. Information about the real data set
vestigators for further examination. When using white box models, it is
straightforward to also give information about why a certain transaction
We illustrate the proposed techniques on a data set that has been
is flagged as being suspicious. This of course facilitates the job of the
provided to our research group by a large European bank. The data set
fraud investigators leading to more suspicious transactions that can be
consists of fraudulent and legitimate transactions made with debit cards
examined for example in one day. The need of interpretability on the
between September 2018 and July 2019. Note that the magnitude of the
operator side, which advocates for relatively simple models and
data set illustrated here is much smaller than data sets typically used in
methods, has also the advantage to simplify for the end-user (a bank) the
fraud prediction and its incidence of fraudulent transactions is also
implementation, maintainability and possibility to update/enrich the
much higher. This is because a kind of white-listing (based on
system over time.
experience-driven business rules) was first applied to the data by the
Operational efficiency refers to the response time or the time that is
bank to filter out “definitely safe” transactions. The total data set con­
required to evaluate the model, or in other words, the time required to
tains 31,763 individual transactions, each with 14 attributes and a fraud
evaluate whether a case is suspicious. It also entails the efforts needed to
label that indicates when a transaction is confirmed as fraudulent. This
collect and preprocess the data, evaluate the model, monitor and back-
label was created internally in the bank by fraud investigators, and can
test the model, and re-estimate it when necessary. Operational efficiency
be considered as highly accurate. Only 506 transactions in the data set
can be a key requirement, meaning that the fraud detection system
were labeled as fraud, resulting in a fraud ratio of 1.6%.
might have only a limited amount of time available to reach a decision
The initial set of features include information regarding individual
and let a transaction pass or not. In others words, huge volumes of data
transactions, such as amount, timestamp, payment channel and bene­
need to be processed in a short time span. For example, in a credit card
ficiary country. Table 5 contains examples of such typical attributes that
fraud detection setting, the decision time must typically be less than
are available for transactions.
eight seconds. Such a requirement clearly impacts the design of the
operational IT systems, but also the design of the analytical model.
The economical cost refers to the total cost of ownership and return 5.2. Experimental design
on investment of the analytical fraud model. Although the former can be
approximated reasonably well, the latter is more difficult to determine. In order to test the performance of machine learning models that
Fraud analytical models should also be in line and comply with all only use these 14 initial features, we split the data into a training and
applicable regulation and legislation with respect to, for example, pri­ testing set. Each one contains 70% and 30% of the transactions,
vacy or the use of cookies in a web browser. respectively, stratified according to the fraud label to obtain similar
fraud distributions as observed in the original data set. Table 6 sum­
marizes the different data sets.
For the experiments we use the following popular classification

10
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Table 7
Performance of logistic regression (LR), decision tree (DT) and gradient boosted trees (GBT) on the testing set using (top) the 14 original features, (middle) the RFM and
other time-related features, (bottom) and the features based on anomaly detection techniques.
Original features

Precision Recall F1 FPR AUPRC Savings % of fraud amount detected

LR 0.6154 0.3810 0.4706 0.0025 0.4417 0.5117 0.5340


DT 1.0000 0.1905 0.3200 0.0000 0.3050 0.3191 0.3260
GBT 0.7778 0.3333 0.4667 0.0010 0.4632 0.5068 0.5223

Including RFM and other time-related features


Precision Recall F1 FPR AUPRC Savings % of fraud amount detected
LR 0.5625 0.4286 0.4865 0.0035 0.4680 0.5483 0.5757
DT 0.8000 0.3810 0.5161 0.0010 0.4836 0.6635 0.6807
GBT 0.6923 0.4286 0.5294 0.0020 0.6333 0.5979 0.6202

Including features based on anomaly detection techniques


Precision Recall F1 FPR AUPRC Savings % of fraud amount detected
LR 0.7647 0.6190 0.6842 0.0020 0.6975 0.6751 0.7042
DT 0.8125 0.6190 0.7027 0.0015 0.6370 0.6883 0.7158
GBT 0.8750 0.6667 0.7568 0.0010 0.7669 0.7908 0.8183

methods: logistic regression (LR), decision tree (DT), using the CART based methods estimate the contribution of individual features towards
algorithm [12], and gradient boosted trees (GBT), using the XGBoost a specific prediction. The purpose of this paper is to illustrate the benefit
algorithm [17]. Logistic regression is often used in the industry because of the proposed data engineering techniques to the performance of fraud
it is fast to compute, easy to understand and interpret. Moreover, logistic detection models regardless of the chosen model structure. Therefore, all
regression is often used as a benchmark model to which other classifi­ three classifiers (LR, DT and GBT) are trained on the training set using
cation algorithms are compared. Commonly used decision tree algo­ their default parameters as suggested by their respective authors. The
rithms include CART [12] and C4.5 [49]. The tree-like structure of a performance of the three classifiers is evaluated on the testing set using
decision tree makes it particularly easy to gain insight in its decision Precision, Recall (i.e. hit rate), F1 measure, false positive rate (FPR, i.e.
process. This is especially useful in a fraud detection setting to under­ false alarm rate), Area Under Precision Recall Curve (AUPRC), Savings,
stand how fraud is committed and work out corresponding fraud pre­ and the fraction of fraudulent amounts that are detected. Hereby a de­
vention strategies. XGBoost is short for eXtreme Gradient Boosting [17]. cision threshold of t = 50% is used. For the calculation of the Savings
It is an efficient and scalable implementation of the gradient boosting measure, we choose a fixed cost of cf = 5 Euros.
framework by Friedman et al. [27] and Friedman [26], but it uses a more
regularized model formalization to control over-fitting, which gives it 5.3. Results
better performance. The name XGBoost refers to the engineering goal to
push the limit of computational resources for boosted tree algorithms. Table 7 contains the performance of logistic regression (LR), decision
The XGBoost algorithm is widely used by data scientists to achieve state- tree (DT) and gradient boosted trees (GBT) on the testing set using the 14
of-the-art results on many machine learning challenges and has been original features (top). When we include RFM features and time features
used by a series of competition winning solutions [17]. Note that recent using the von Mises distribution, the performance of all three models
model explaining techniques, such as SHapley Additive exPlanation improves significantly (middle of Table 7). In particular the Savings, F1
(SHAP,Lundberg and Lee [42]) and Local Interpretable Model-agnostic and AUPRC values of the three models have clearly increased. Their
Explanations (LIME, Ribeiro et al. [52]) make it possible to provide overall performance is further enhanced when we add the features that
model interpretability for such black box methods. These perturbation- are based on the anomaly detection techniques (bottom of Table 7).

Table 8
Performance of logistic regression (top), decision tree (middle) and gradient boosted trees (bottom) on the testing set using different over-sampling methods: SMOTE,
ADASYN, MWMOTE and ROSE.
Logistic regression (LR)

Precision Recall F1 FPR AUPRC Savings % of fraud amount detected

Original 0.7647 0.6190 0.6842 0.0020 0.6975 0.6751 0.7042


SMOTE 0.4103 0.7619 0.5333 0.0116 0.6408 0.7647 0.8316
ADASYN 0.4167 0.7143 0.5263 0.0106 0.6924 0.6674 0.7291
MWMOTE 0.4706 0.7619 0.5818 0.0091 0.6388 0.7733 0.8316
ROSE 0.4324 0.7619 0.5517 0.0106 0.6692 0.7681 0.8316

Decision tree (DT)


Precision Recall F1 FPR AUPRC Savings % of fraud amount detected
Original 0.8125 0.6190 0.7027 0.0015 0.6370 0.6883 0.7158
SMOTE 0.5000 0.7619 0.6038 0.0081 0.5118 0.7712 0.8261
ADASYN 0.5667 0.8095 0.6667 0.0066 0.3716 0.7987 0.8501
MWMOTE 0.4545 0.7143 0.5556 0.0091 0.4001 0.6739 0.7305
ROSE 0.6190 0.6190 0.6190 0.0040 0.6565 0.6866 0.7226

Gradient boosted trees (GBT)


Precision Recall F1 FPR AUPRC Savings % of fraud amount detected
Original 0.8750 0.6667 0.7568 0.0010 0.7669 0.7908 0.8183
SMOTE 0.6842 0.6190 0.6500 0.0030 0.7146 0.5941 0.6266
ADASYN 0.8462 0.5238 0.6471 0.0010 0.7763 0.5962 0.6184
MWMOTE 0.7500 0.5714 0.6486 0.0020 0.6931 0.5975 0.6249
ROSE 0.6667 0.0952 0.1667 0.0005 0.4341 0.0430 0.0482

11
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

Using the original features, the three models are only able to detect [4] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen,
Benchmarking state-of-the-art classification algorithms for credit scoring, J. Oper.
around 50% of the fraudulent amounts. By including the features that
Res. Soc. 54 (2003) 627–635.
are created by the various feature engineering methods, the improved [5] B. Baesens, S. Höppner, W. Verbeke, T. Verdonck, Instance-dependent cost-
models can block more than 70% of the stolen money and thus saving sensitive learning for detecting transfer fraud, arXiv (2020) preprint arXiv:
more than 67% of the costs compared to not using any fraud detection 2005.02488.
[6] A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Feature engineering
system. strategies for credit card fraud detection, Expert Syst. Appl. 51 (2016) 134–142.
While the data set is now extended with new features, the imbalance [7] L. Barabesi, A. Cerasa, A. Cerioli, D. Perrotta, Goodness-of-fit testing for the
between the fraudulent and legitimate transactions remains. To address newcomb-benford law with application to the detection of customs fraud, J. Bus.
Econ. Stat. 36 (2018) 346–358.
this issue we apply the following over-sampling methods on the [8] O. Barkan, N. Koenigstein, Item2vec: Neural Item Embedding for Collaborative
extended training set: SMOTE, ADASYN, MWMOTE and ROSE, each Filtering, 2016 arXiv:1603.04259.
with their default parameters as suggested by their respective authors. [9] S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority
oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data
We use these over-sampling techniques such that the new, re-balanced Eng. 26 (2012) 405–425.
training set contains a ratio of 90% legitimate cases versus 10% fraud [10] S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit
cases. In Table 8 we present the results for all three classifiers with each card fraud: a comparative study, Decis. Support. Syst. 50 (2011) 602–613.
[11] K. Boudt, P.J. Rousseeuw, S. Vanduffel, T. Verdonck, The minimum regularized
of the over-sampling methods. Notice how the performance varies covariance determinant estimator, Stat. Comput. 30 (2020) 113–128.
depending on the chosen over-sampling method. The Savings value of [12] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees,
the logistic regression model is mostly improved with MWMOTE as well Wadsworth Int. Group 37 (1984) 237–251.
[13] M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local
as SMOTE and ROSE. The Savings value of the decision tree, however,
outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on
only increases with ADASYN and SMOTE. While logistic regression and Management of Data, 2000, pp. 93–104.
decision tree may benefit from over-sampling methods, the overall [14] M.R. Brito, E.L. Chávez, A.J. Quiroz, J.E. Yukich, Connectivity of the mutual k-
performance of the gradient boosted trees is decreasing. This may be due nearest-neighbor graph in clustering and outlier detection, Statistics & Probability
Letters 35 (1997) 33–42.
to the boosting algorithm which could be over-fitting the classifier on [15] R.J. Carroll, D. Ruppert, Transformations in regression: a robust analysis,
the over-sampled training set resulting in a lesser performance on the Technometrics 27 (1985) 1–12.
testing set. Depending on the chosen classification method, there is [16] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
definitely potential in over-sampling the training set with synthetic [17] T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of
fraud cases, although there is not one over-sampling technique that will the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data
always yield the best result. Mining, ACM, 2016, pp. 785–794.
[18] A. Dal Pozzolo, O. Caelen, Y.A. Le Borgne, S. Waterschoot, G. Bontempi, Learned
lessons in credit card fraud detection from a practitioner perspective, Expert Syst.
6. Conclusions and future research Appl. 41 (2014) 4915–4928.
[19] L. Davies, U. Gather, The identification of multiple outliers, J. Am. Stat. Assoc. 88
(1993) 782–792.
In this paper, we extensively researched data engineering in a fraud [20] J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves,
detection setting. More specifically, we decomposed data engineering in: Proceedings of the 23rd International Conference on Machine Learning, ACM,
into feature engineering and instance engineering. Our motivation for 2006, pp. 233–240.
[21] European Central Bank, E, Fifth Report on Card Fraud, URL, www.ecb.europa.
doing so is that, based upon past extensive research, it is our firm belief
eu/pub/cardfraud/html/ecb.cardfraudreport201809.en.html, September 2018.
that the best way to boost the performance of any analytical technique is [22] T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach.
to smartly engineer the data instead of overly focusing on the develop­ Learn. 31 (2004) 1–38.
ment of new, often times highly complex, analytical techniques giving us [23] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006)
861–874.
analytical models which are often only poorly benchmarked and give us [24] A. Fernández, S. Garca, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning
no interpretability at all. We used a payment transactions data set from a from Imbalanced Data Sets, Springer, 2018.
large European Bank to illustrate the substantial impact of data engi­ [25] N.I. Fisher, Statistical Analysis of Circular Data, Cambridge University Press, 1995.
[26] J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann.
neering on the performance of a fraud detection mode. We empirically Stat. (2001) 1189–1232.
showed that both the feature engineering and instance engineering steps [27] J. Friedman, T. Hastie, R. Tibshirani, et al., Additive logistic regression: a statistical
significantly improved the performance of popular analytical models. view of boosting (with discussion and a rejoinder by the authors), Ann. Stat. 28
(2000) 337–407.
Moreover, we have illustrated that by clever engineering of the data [28] M. Goldstein, S. Uchida, A comparative evaluation of unsupervised anomaly
simple analytical techniques as logistic regression and classification detection algorithms for multivariate data, PLoS One 11 (2016), e0152173.
trees yield very good results. Although the focus in this paper is on [29] A. Grover, J. Leskovec, node2vec: scalable feature learning for networks, in:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
payment transactions fraud, the discussed techniques are also useful or Discovery and Data Mining, 2016, pp. 855–864.
could be extended to other types of fraud, e.g. in healthcare, insurance [30] W.L. Hamilton, R. Ying, J. Leskovec, Inductive Representation Learning on Large
or e-commerce. Graphs, 2017 arXiv:1706.02216.
[31] D.J. Hand, C. Whitrow, N.M. Adams, P. Juszczak, D. Weston, Performance criteria
for plastic card fraud detection tools, J. Oper. Res. Soc. 59 (2008) 956–962.
Acknowledgements [32] H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn: adaptive synthetic sampling approach for
imbalanced learning, in: 2008 IEEE International Joint Conference on Neural
The authors gratefully acknowledge the financial support from the Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008,
pp. 1322–1328.
BNP Paribas Fortis Research Chair in Fraud Analytics at KU Leuven and [33] S. Heritier, E. Cantoni, S. Copt, M.P. Victoria-Feser, Robust Methods in Biostatistics
the Internal Funds KU Leuven under grant C16/15/068. 825, John Wiley & Sons, 2009.
[34] M. Hubert, E. Vandervieren, An adjusted boxplot for skewed distributions,
Computational statistics & data analysis 52 (2008) 5186–5201.
References [35] S. Jha, M. Guillen, J.C. Westland, Employing transaction aggregation strategy to
detect credit card fraud, Expert Syst. Appl. 39 (2012) 12650–12657.
[1] A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, [36] G. Kovács, Smote-variants: a python implementation of 85 minority oversampling
A. Hussain, Comparing oversampling techniques to handle the class imbalance techniques, Neurocomputing 366 (2019) 352–354.
problem: a customer churn prediction case study, IEEE Access 4 (2016) [37] W.J. Krzanowski, D.J. Hand, ROC Curves for Continuous Data, Chapman and Hall/
7940–7957. CRC, 2009.
[2] F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, in: [38] S. Lessmann, B. Baesens, H.V. Seow, L.C. Thomas, Benchmarking state-of-the-art
European Conference on Principles of Data Mining and Knowledge Discovery, classification algorithms for credit scoring: an update of research, Eur. J. Oper. Res.
Springer, 2002, pp. 15–27. 247 (2015) 124–136.
[3] A. Atkinson, M. Riani, Robust Diagnostic Regression Analysis, Springer Science & [39] C.X. Ling, J. Huang, H. Zhang, et al., AUC: a statistically consistent and more
Business Media, 2000. discriminating measure than accuracy, in: IJCAI, 2003, pp. 519–524.

12
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx

[40] F.T. Liu, K.M. Ting, Z.H. Zhou, Isolation forest, in: 2008 Eighth IEEE International [57] M. Saerens, P. Latinne, C. Decaestecker, Adjusting the outputs of a classifier to new
Conference on Data Mining, IEEE, 2008, pp. 413–422. a priori probabilities: a simple procedure, Neural Comput. 14 (2002) 21–41.
[41] N. Lunardon, G. Menardi, N. Torelli, Rose: A package for binary imbalanced [58] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc
learning, R Journal (2014) 6. plot when evaluating binary classifiers on imbalanced datasets, PloS one (2015)
[42] S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions, in: 10.
Advances in Neural Information Processing Systems, 2017, pp. 4765–4774. [59] J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and
[43] A. Marazzi, A.J. Villar, V.J. Yohai, Robust response transformations based on Diagnostics: Collected Papers, Psychology Press, 2014.
optimal prediction, J. Am. Stat. Assoc. 104 (2009) 360–370. [60] J.W. Tukey, Exploratory data analysis vol. 2, 1977. Reading, MA.
[44] R.A. Maronna, R.D. Martin, V.J. Yohai, M. Salibián-Barrera, Robust Statistics: [61] V. Van Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, Gotcha!
Theory and Methods (with R), John Wiley & Sons, 2019. Network-based fraud detection for social security fraud, Manag. Sci. 63 (2017)
[45] E.W. Ngai, Y. Hu, Y.H. Wong, Y. Chen, X. Sun, The application of data mining 3090–3110.
techniques in financial fraud detection: a classification framework and an academic [62] C. Whitrow, D.J. Hand, P. Juszczak, D. Weston, N.M. Adams, Transaction
review of literature, Decis. Support. Syst. 50 (2011) 559–569. aggregation as a strategy for credit card fraud detection, Data Min. Knowl. Disc. 18
[46] M.J. Nigrini, Benford’s Law: Applications for Forensic Accounting, Auditing, and (2009) 30–55.
Fraud Detection 586, John Wiley & Sons, 2012. [63] B. Zhu, Z. Gao, J. Zhao, S.K. Vanden Broucke, Iric: an r library for binary
[47] C. Phua, V. Lee, K. Smith, R. Gayler, A comprehensive survey of data mining-based imbalanced classification, SoftwareX 10 (2019) 100341.
fraud detection research, arXiv (2010) preprint arXiv:1009.6119.
[48] F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for
Bart Baesens Faculty of Economics and Business, KU Leuven, Naamsestraat 69, B-3000
comparing classifiers. 5th int, in: Conference on Machine Learning, Kaufman
Leuven, Belgium. www.dataminingapps.com. Southampton Business School, University of
Morgan, San Francisco, 1998, pp. 445–453.
Southampton, 12 University Road, Highfield, Southampton SO17 1BJ, United Kingdom.
[49] J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publichers,
Research interests: data mining and analytics, credit scoring, fraud detection, marketing
San Mateo, 1993.
analytics.
[50] J. Raymaekers, P.J. Rousseeuw, Transforming variables to central normality, arXiv
(2020) preprint arXiv:2005.07946.
[51] M. Riani, Robust transformations in univariate and multivariate time series, Econ. Sebastiaan Höppner Faculty of Science, Department of Mathematics, KU Leuven, Cel­
Rev. 28 (2008) 262–278. estijnenlaan 200B, B-3001 Leuven, Belgium. https://www.kuleuven.be/wieiswie/nl/pe
[52] M.T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?” explaining the rson/00111217. Research interests: robust statistics, fraud detection, high-dimensional
predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD data analysis.
International Conference on Knowledge Discovery and Data Mining, 2016,
pp. 1135–1144.
Tim Verdonck Faculty of Science, Department of Mathematics, UAntwerp, Mid­
[53] P.J. Rousseeuw, K.V. Driessen, A fast algorithm for the minimum covariance
delheimlaan 1, B-2020 Antwerp, Belgium. https://www.uantwerpen.be/nl/personeel/t
determinant estimator, Technometrics 41 (1999) 212–223.
im-verdonck/. Faculty of Science, Department of Mathematics, KU Leuven, Celes­
[54] P.J. Rousseeuw, M. Hubert, Anomaly detection by robust statistics, Wiley
tijnenlaan 200B, B-3001 Leuven, Belgium. https://www.kuleuven.be/wieiswie/nl/person
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2018), e1236.
/00071962. Research: statistical data science, anomaly and fraud detection, actuarial
[55] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection 589, John
science.
Wiley & Sons, 2005.
[56] P. Rousseeuw, D. Perrotta, M. Riani, M. Hubert, Robust monitoring of time series
with application to fraud detection, Econometrics and Statistics 9 (2019) 108–121.

13

You might also like