Data Engineering For Fraud Detection
Data Engineering For Fraud Detection
A R T I C L E I N F O A B S T R A C T
Keywords: Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which
Decision analysis are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task
Payment transactions fraud of detecting suspicious transactions is a binary classification problem and therefore many techniques can be
Instance engineering
applied. Interpretability is however of utmost importance for the management to have confidence in the model
Feature engineering
Cost-based model evaluation
and for designing fraud prevention strategies. Moreover, models that enable the fraud experts to understand the
underlying reasons why a case is flagged as suspicious will greatly facilitate their job of investigating the sus
picious transactions. Therefore, we propose several data engineering techniques to improve the performance of
an analytical model while retaining the interpretability property. Our data engineering process is decomposed
into several feature and instance engineering steps. We illustrate the improvement in performance of these data
engineering steps for popular analytical models on a real payment transactions data set.
1. Introduction fraud. On the other hand, it does not very precisely describe the nature
and characteristics of fraud and as such does not provide much direction
The association of certified fraud examiners (ACFE) estimates that a for discussing the requirements of a fraud detection system. A more
typical organization loses 5% of its revenues to fraud each year. The fifth thorough and detailed characterization of the multifaceted phenomenon
oversight report on card fraud analyses developments in fraud related to of fraud is provided by Van Vlasselaer et al. [61]: Fraud is an uncommon,
card payment schemes (CPSs) in the Single Euro Payments Area (SEPA), well-considered, imperceptibly concealed, time-evolving and often carefully
issued in September 2018 by the European Central Bank and covering organized crime which appears in many types of forms. This definition
almost the entire card market, indicates that the total value of fraudulent highlights five characteristics that are associated with particular chal
transactions conducted using cards issued within SEPA and acquired lenges related to developing a fraud detection system.
worldwide amounted to 1.8 billion Euros in 2016, which in relative The first emphasized characteristic and associated challenge con
terms, i.e. as a share of the total value of transactions, amounted to cerns the fact that fraud is uncommon. Independent of the exact setting
0.041% in 2016 [21]. These are just a few numbers to indicate the or application, only a small minority of the involved population of cases
severity of the payment transactions fraud problem. It is also seen that typically concerns fraud, of which furthermore only a limited number
losses due to fraudulent activities keep increasing each year and affect will be known to be fraudulent. This makes it difficult to both detect
card holders worldwide. Therefore, fraud detection and prevention are fraud, since the fraudulent cases are covered by the non-fraudulent ones,
more important than ever before and developing powerful fraud as well as to learn from historical cases to build a powerful fraud
detection systems is of crucial importance to many organizations and detection system since only few examples are available. This will make it
firms in order to reduce losses by timely blocking, containing and pre hard for machine learning techniques to extract meaningful patterns
venting fraudulent transactions. from the data.
The Oxford Dictionary defines fraud as follows: the crime of cheating Fraud is also imperceptibly concealed since fraudsters exactly try to
somebody in order to get money or goods illegally. This definition captures blend into their environments to remain unnoticed. This relates to the
the essence of fraud and covers the many different forms and types of subtlety of fraud since fraudsters try to imitate normal behavior.
* Corresponding author at: University of Antwerp, Department of Mathematics, Middelheimlaan 1, Antwerp 2020, Belgium.
E-mail addresses: [email protected] (B. Baesens), [email protected] (S. Höppner), [email protected] (T. Verdonck).
https://doi.org/10.1016/j.dss.2021.113492
Received 15 July 2020; Received in revised form 25 November 2020; Accepted 7 January 2021
Available online 12 January 2021
0167-9236/© 2021 Published by Elsevier B.V.
Please cite this article as: Bart Baesens, Decision Support Systems, https://doi.org/10.1016/j.dss.2021.113492
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Fig. 1. Timeline of transactions of a customer using, for example, a particular payment channel.
Moreover, fraud is well-considered and intentional and complex fraud featurization) from different sources (e.g. transactional data, network
structures are carefully planned upfront. Fraudsters can also adapt or data, time series data, text data, ...) to achieve better performance.
refine their tactics whenever needed, for example, due to changing fraud Instance engineering entails the careful selection of instances or obser
detection mechanisms. Therefore, fraud detection systems need to vations again with the aim to improve predictive modeling performance.
improve and learn by example. Put differently, it aims at selecting those observations which positively
The traditional approach to fraud detection is expert-driven, which contribute to the learning of the analytical technique and remove those
builds on the experience, intuition, and business or domain knowledge that have a detrimental impact on it. Obviously, this is not a trivial ex
of one or more fraud investigators. Such expert-based rule base or engine ercise and many instance engineering techniques have been developed
is typically hard to build and maintain. A shift is occurring towards data- which we will carefully study and experiment with in this paper. In this
driven or machine learning based fraud detection methodologies. This paper the focus will be on successful data engineering steps to improve
shift is triggered by the digitization of almost every aspect of society and the performance of a fraud detection model. More concretely, we will
daily life, which leads to an abundance of available data. Financial in describe the lessons that we have learnt when complementing expert-
stitutions increasingly rely upon data-driven methods for developing based approaches with machine learning or data-driven techniques to
powerful fraud detection systems, which are able to automatically detect combat payment transactions fraud for a large European bank.
and block fraudulent transactions. In other words, we need adaptive This paper is organized as follows. We start with presenting our data
analytical models to complement experience-based approaches for engineering process: Section 2 presents feature engineering steps
fighting fraud. A stream of literature has reported upon the adoption of whereas instance engineering is explained in Section 3. In Section 4
data-driven aproaches for developing fraud detection systems [45,47]. popular performance measures in an (imbalanced) classification setting
These methods significantly improve the efficiency of fraud detection are described. In Section 5, more information about payment transaction
systems and are easier to maintain and more objective. From a machine fraud and the observed data set is given. This section also illustrates the
learning perspective, the task of detecting fraudulent transactions is a benefits of the various data engineering steps by showing increased
binary classification problem. performance on our real data set. Finally, concluding remarks and po
A natural first step to move from expert-based approaches to data tential directions for future research are provided in Section 6.
driven techniques (while still taking into account the experience of the
fraud experts) is to consider logistic regression and/or decision trees. 2. Feature engineering
These simple analytical models can then be replaced by complex tech
niques such as random forests and boosting methods, support vector The main objective of machine learning is to extract patterns to turn
machines, neural networks and deep learning to increase the detection data into knowledge. Since the beginning of this century, technological
power. Although the latter are definitely powerful analytical techniques, advances have drastically changed the size of data sets as well as the
they suffer a very important drawback which is not desirable from a speed with which these data must be analyzed. Modern data sets may
fraud prevention perspective: they are black box models which means have a huge number of instances, a very large number of features, or
that they are very complex to interpret. We would also like to note that both. In most applications, data sets are compiled by combining data
these complex models not always significantly outperform simple from different sources and databases (containing both structured and
analytical models such as logistic regression [4,38] and we strongly unstructured data) where each source of information has its strengths
believe that you should always start with implementing these simple and weaknesses. Before applying any machine learning algorithm, it is
techniques. Many benchmarking studies have illustrated that complex therefore necessary to transform these raw data sources into interesting
analytical techniques only provide marginal performance gains on features that better help the predictive models. This essential step, which
structured, tabular data sets as frequently encountered in common is often denoted feature engineering, is of utmost importance in the
classification tasks such as fraud detection, credit scoring and marketing machine learning process. We believe that data scientists should be well
analytics [4,38]. It is our firm belief that in order to improve the per aware of the power of feature engineering and that they should share
formance of any analytical model, we should focus more on the data good practices.
itself rather than developing new, complex predictive analytical tech An important set of interesting features can be created based on the
niques. This is exactly the aim of data engineering. It can be defined as famous Recency, Frequency, Monetary (RFM) principle. Recency mea
the clever engineering of data hereby exploiting the bias of the analyt sures how long ago a certain event took place, whereas frequency counts
ical technique to our benefit, both in terms of accuracy and interpret the number specific events per unit of time. Besides recency features, we
ability at the same time. Often times it will be applied in combination also present several other time-related features. Features related to
with simple analytical techniques such as linear or logistic regression so monetary value measure the intensity of a transaction, typically
as to maintain the interpretability property which is so often needed in expressed in a currency such as Euros or USD. We also introduce features
analytical modeling. In our context of fraud analytics, interpretability is based on unsupervised anomaly detection and briefly discuss some other
of key importance to design smart fraud prevention mechanisms. Data advanced feature engineering techniques.
engineering can be decomposed into feature engineering and instance
engineering. Feature engineering aims at designing smart features in one
2.1. Frequency features
of two possible ways: either by transforming existing features using
smart transformations, which will allow a simple analytical technique
We explain the idea behind the RFM principle by first deriving fre
such as linear or logistic regression to boost its performance, or by
quency features using a transaction aggregation strategy in order to
extracting or creating new meaningful features (a process often called
capture a customer’s spending behavior. This methodology was first
2
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Table 1 ( )
Example calculation of frequency features: xfreq
i is the number of transactions in D freq
tp ,i = AGG
freq
D , i, tp
the last 24 h, and xfreq2
i is the number of transactions with the same authenti { ( ) ( ( ) ) }N (1)
cation method and payment channel in the last 24 h. = xamt id
j | xj = xi
id
and days xtime
i , xtime
j < tp
j=1
Fig. 2. Example of a recency feature derived from the authentication method used by a customer. When the customer makes a transaction, she chooses one of five
possible authentication methods which are labeled as AU01, AU02, …, AU05. If the time between the same two successive authentication methods is long, the
recency is close to zero, while if that time is short, the recency is close to 1. If an authentication method is used for the first time, its recency is defined as zero.
3
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Fig. 4. (Left) Circular histogram of timestamps of transactions. The dashed line is the estimated periodic mean of the von Mises distribution. (Right) Circular
histogram including the 90% confidence interval (orange area).
several values, and the combination of criteria can be quite large as well.
here Δti is the time interval, typically in days, between two consecutive
For the experiments we set the different values of tp to 90, 120 and 180
transfers made by the same customer with identification number xid i
days. Then we calculate the frequency features using (2) and (4) as well
using the same authentication method xAU i . The parameter γ can be
as (5) with the aggregation criteria including payment channel,
chosen such that, for example, the recency is small (e.g. 0.01) when Δt =
authentication method, beneficiary country, type of communication,
180 days (~ 6 months) in which case γ = − log (0.01)/180 = 0.026.
and others.
Notice that recency is always a number between 0 and 1. When the time
period Δt between two consecutive transfers with the same authenti
2.2. Recency features cation method is small (large), we say that the authentication method
has (not) recently been used. In that case the recency for this authenti
Although frequency features are powerful in describing a customer’s cation method is close to one (zero). When an authentication method is
spending behavior, they do not take the aspect of time into account. used for the first time, we define its recency to be zero. A zero or small
Recency features are a way to capture this information. Recency mea recency shows atypical behavior and might indicate fraud. Fig. 3 shows
sures the time passed since the previous transaction that satisfy pre that recency indeed decreases when the time interval becomes larger.
defined conditions. To explain how recency features are defined we The parameter γ determines how fast the recency decreases. For larger
show an example where we create a recency feature derived from the values of γ, recency will decrease quicker with time and vice versa.
authentication method used by the customer as illustrated in Fig. 2.
1
This is a popular app in Belgium that allows you to safely, easily and reli
ably confirm your (digital) identity and approve transactions.
4
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Based on this set of selected timestamps, the estimated parameters ̂ μ Transferring 500 Euros may be little for one person, but a lot for
and ̂
κ are calculated. Next, a von Mises distribution is fitted on the set of another person. A monetary feature that calculates the so-called z-score
of an amount can indicate whether the amount is atypical for a
Fig. 5. Timeline of amounts transferred by a customer using, for example, a particular payment channel.
5
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
where
(⃒ ({ } ) ⃒n )
⃒ n ⃒
MAD({x1 , x2 , …, xn }) = 1.4826⋅Median ⃒xi − Median xj j=1 ⃒
i=1
(16)
The constant scale factor 1.4826 ensures that the MAD is a consistent
estimator for the estimation of the standard deviation σ, i.e.
E[MAD({X1 , X2 , …, Xn }) ] = σ for Xj distributed as N(μ, σ 2) and large n.
Using the robust estimates, the z-score of the last amount in Fig. 6 is
5.79, which clearly indicates that the 500 Euros is atypical for this
customer.
Remark: transferred amounts are often right-skewed as shown in
Fig. 7 (left). The rule of thumb, i.e. ∣zi ∣ > 3, implicitly assumes that the z-
scores are distributed as N(μ, σ2). Before standardizing the amounts, a
Fig. 6. An example of transferred amounts. The last amount of 500 Euros is
transformation is often applied to them that changes their distribution to
clearly an outlier compared to the previous amounts. The atypical high amount
one that resembles a normal distribution, or at least a symmetric dis
is not indicated when using traditional estimates such as sample mean and
sample standard deviation. Instead, we have to use robust estimates such as the tribution. One such transformation is the natural logarithm, as shown in
median and the median absolute deviation (MAD). Fig. 7 (right).
A popular alternative for computing (robust) z-scores is the boxplot,
which is a frequently used graphical tool to analyze a univariate data set
particular customer. For a set of amounts D freq
tp ,i , the standardized values
[60]. The boxplot marks all observations outside the interval [Q1 −
or z-scores are defined as 1.5IQR; Q3 + 1.5IQR] as potential outliers, where Q1, Q2 and Q3 denote
xamt μD
− ̂ respectively the first, second (or median) and third quartile and IQR =
zi = i
(12) Q3 − Q1 equals the interquartile range. It is known that the boxplot
σD
̂
typically flags too many points as outlying when the data are skewed and
where ̂μ D and ̂
σ D are the sample mean and sample standard deviation, therefore Hubert and Vandervieren [34] have modified the boxplot in
respectively, terval so that the skewness is sufficiently taken into account.
( ) ( ) In practice one often tries to detect outliers using diagnostics starting
μ D = Mean D freq
̂ tp ,i σ D = Stdev D freq
and ̂ tp ,i (13) from a classical or traditional fitting method. Unfortunately, these
traditional techniques can be affected by outliers so strongly that the
As a rule of thumb, an amount is flagged as an outlier if its z-score is resulting fitted model may not allow to detect the deviating observa
larger than 3, ∣zi ∣ > 3. Now consider the transactions made by a tions. This is called the masking effect (see e.g. Rousseeuw and Leroy
customer, as shown in Fig. 6. The last amount of 500 Euros is clearly an [55]). Additionally, some good data points might even appear to be
outlier compared to the previous amounts. However, when using the outliers, which is known as swamping [19]. To avoid these effects, the
sample mean and sample standard deviation, the z-score of the atypi goal of robust statistics is to find a fit which is close to the fit we would
cally high amount is only 2.66 and is therefore not regarded as have found without the outliers. We can then automatically identify the
abnormal. outliers by their large ‘deviation’ (e.g., their distance or residual) from
Instead of computing the z-score using traditional estimates such as that robust fit. It is not our aim to replace traditional techniques by a
sample mean and sample standard deviation, we propose using robust robust alternative, but we have illustrated that robust methods can give
alternatives such as the median and the median absolute deviation you extra insights in the data and may improve the reliability and ac
(MAD), curacy of your analysis.
xamt − μrD
zri = i
(14)
σrD 2.5. Features based on (unsupervised) anomaly detection techniques
Fig. 7. Histogram and kernel density estimate of amounts (left) and natural logarithm of those amounts (right).
6
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
atypical behavior and hence may contain crucial information for fraud 2.6. Other feature engineering techniques
detection and should be investigated by the fraud expert. As an alter
native, we propose to use the outlyingness score or metric of several In this paper, we only study a few feature engineering techniques to
anomaly detection techniques as features that we add to our data set. illustrate their importance as a key data engineering mechanism. Other
Anomalies in a single dimension (i.e. univariate outliers) can be powerful feature engineering techniques are the Box-Cox and Yeo-
detected by computing (robust) z-scores (and see which observations are Johnson transformation which both univariately transform data vari
in absolute value larger than 3) or by constructing the (adjusted) boxplot ables so as to boost the performance of the predictive analytical model.
(and see which observations are outside the boxplot interval or fence). Note that these transformation techniques are sensitive to outliers and
Another tool for univariate anomaly detection that is also popular in will try to move outliers inward at the expense of the normality of the
fraud detection is Newcomb-Benford law, which makes predictions central part of the data. Therefore various robust transformation pro
about the distribution of the first leading digit of all numbers [7,46]. cedures have been proposed in literature (see e.g. Carroll and Ruppert
These techniques can then be applied on each feature in the data set. [15]; Riani [51]; Marazzi et al. [43]; Raymaekers and Rousseeuw [50]).
However, in this way it is only possible to detect anomalies that are Feature engineering techniques have also been designed for unstruc
atypical in (at least) one dimension or feature of our data set. Since tured data such as text, network data, and multimedia data (e.g., images,
fraudsters succeed very well in blending in with legitimate customers, audio, videos). For text data, one commonly uses Singular Value
they are typically not detected by checking each feature separately. It is Decomposition (SVD) or Natural Language Processing (NLP) as feature
important to flag those observations that deviate in several dimensions engineering techniques. For network data, node2vec and GraphSage
from the main data structure but are not atypical in one of the features. [29,30] have proven to be very valuable techniques. Deep learning has
Such multivariate outliers can only be detected in the multidimensional been used to learn complex features for multimedia data. As an example,
space and require the use of advanced models. convolutional neural networks can learn key features to describe objects
A first tool for this purpose is robust statistics, which first fits the in images. However, an important caveat is that many of these features
majority of the data and then flags the observations that deviate from are black box in nature and thus hard to interpret for business decision
this robust fit [54]. For a multivariate n × p data set X, one can calculate makers. Finally, tailored feature engineering techniques have been
the robust Mahalanobis distance (or robust generalized distance) for designed for specific domains, e.g., Item2Vec in Recommender Systems
each observation xi: [8].
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
MD(xi , ̂ ̂ = (x − ̂
μ , Σ) μ )T Σ ̂ − 1 (x − ̂ μ ). (17) 3. Instance engineering
7
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Fig. 8. Illustration of SMOTE, ADASYN, MWMOTE and ROSE. The blue circles represent the legitimate cases, the black squares are the original fraud cases, and the
red dots are the synthetic fraud cases. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
8
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Fig. 9. (Left) example of a ROC curve. (Right) example of a Precision-Recall curve. Both curves are based on the same classifier validated on the same data set.
9
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
10
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Table 7
Performance of logistic regression (LR), decision tree (DT) and gradient boosted trees (GBT) on the testing set using (top) the 14 original features, (middle) the RFM and
other time-related features, (bottom) and the features based on anomaly detection techniques.
Original features
methods: logistic regression (LR), decision tree (DT), using the CART based methods estimate the contribution of individual features towards
algorithm [12], and gradient boosted trees (GBT), using the XGBoost a specific prediction. The purpose of this paper is to illustrate the benefit
algorithm [17]. Logistic regression is often used in the industry because of the proposed data engineering techniques to the performance of fraud
it is fast to compute, easy to understand and interpret. Moreover, logistic detection models regardless of the chosen model structure. Therefore, all
regression is often used as a benchmark model to which other classifi three classifiers (LR, DT and GBT) are trained on the training set using
cation algorithms are compared. Commonly used decision tree algo their default parameters as suggested by their respective authors. The
rithms include CART [12] and C4.5 [49]. The tree-like structure of a performance of the three classifiers is evaluated on the testing set using
decision tree makes it particularly easy to gain insight in its decision Precision, Recall (i.e. hit rate), F1 measure, false positive rate (FPR, i.e.
process. This is especially useful in a fraud detection setting to under false alarm rate), Area Under Precision Recall Curve (AUPRC), Savings,
stand how fraud is committed and work out corresponding fraud pre and the fraction of fraudulent amounts that are detected. Hereby a de
vention strategies. XGBoost is short for eXtreme Gradient Boosting [17]. cision threshold of t = 50% is used. For the calculation of the Savings
It is an efficient and scalable implementation of the gradient boosting measure, we choose a fixed cost of cf = 5 Euros.
framework by Friedman et al. [27] and Friedman [26], but it uses a more
regularized model formalization to control over-fitting, which gives it 5.3. Results
better performance. The name XGBoost refers to the engineering goal to
push the limit of computational resources for boosted tree algorithms. Table 7 contains the performance of logistic regression (LR), decision
The XGBoost algorithm is widely used by data scientists to achieve state- tree (DT) and gradient boosted trees (GBT) on the testing set using the 14
of-the-art results on many machine learning challenges and has been original features (top). When we include RFM features and time features
used by a series of competition winning solutions [17]. Note that recent using the von Mises distribution, the performance of all three models
model explaining techniques, such as SHapley Additive exPlanation improves significantly (middle of Table 7). In particular the Savings, F1
(SHAP,Lundberg and Lee [42]) and Local Interpretable Model-agnostic and AUPRC values of the three models have clearly increased. Their
Explanations (LIME, Ribeiro et al. [52]) make it possible to provide overall performance is further enhanced when we add the features that
model interpretability for such black box methods. These perturbation- are based on the anomaly detection techniques (bottom of Table 7).
Table 8
Performance of logistic regression (top), decision tree (middle) and gradient boosted trees (bottom) on the testing set using different over-sampling methods: SMOTE,
ADASYN, MWMOTE and ROSE.
Logistic regression (LR)
11
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
Using the original features, the three models are only able to detect [4] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen,
Benchmarking state-of-the-art classification algorithms for credit scoring, J. Oper.
around 50% of the fraudulent amounts. By including the features that
Res. Soc. 54 (2003) 627–635.
are created by the various feature engineering methods, the improved [5] B. Baesens, S. Höppner, W. Verbeke, T. Verdonck, Instance-dependent cost-
models can block more than 70% of the stolen money and thus saving sensitive learning for detecting transfer fraud, arXiv (2020) preprint arXiv:
more than 67% of the costs compared to not using any fraud detection 2005.02488.
[6] A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Feature engineering
system. strategies for credit card fraud detection, Expert Syst. Appl. 51 (2016) 134–142.
While the data set is now extended with new features, the imbalance [7] L. Barabesi, A. Cerasa, A. Cerioli, D. Perrotta, Goodness-of-fit testing for the
between the fraudulent and legitimate transactions remains. To address newcomb-benford law with application to the detection of customs fraud, J. Bus.
Econ. Stat. 36 (2018) 346–358.
this issue we apply the following over-sampling methods on the [8] O. Barkan, N. Koenigstein, Item2vec: Neural Item Embedding for Collaborative
extended training set: SMOTE, ADASYN, MWMOTE and ROSE, each Filtering, 2016 arXiv:1603.04259.
with their default parameters as suggested by their respective authors. [9] S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority
oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data
We use these over-sampling techniques such that the new, re-balanced Eng. 26 (2012) 405–425.
training set contains a ratio of 90% legitimate cases versus 10% fraud [10] S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit
cases. In Table 8 we present the results for all three classifiers with each card fraud: a comparative study, Decis. Support. Syst. 50 (2011) 602–613.
[11] K. Boudt, P.J. Rousseeuw, S. Vanduffel, T. Verdonck, The minimum regularized
of the over-sampling methods. Notice how the performance varies covariance determinant estimator, Stat. Comput. 30 (2020) 113–128.
depending on the chosen over-sampling method. The Savings value of [12] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees,
the logistic regression model is mostly improved with MWMOTE as well Wadsworth Int. Group 37 (1984) 237–251.
[13] M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, Lof: identifying density-based local
as SMOTE and ROSE. The Savings value of the decision tree, however,
outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on
only increases with ADASYN and SMOTE. While logistic regression and Management of Data, 2000, pp. 93–104.
decision tree may benefit from over-sampling methods, the overall [14] M.R. Brito, E.L. Chávez, A.J. Quiroz, J.E. Yukich, Connectivity of the mutual k-
performance of the gradient boosted trees is decreasing. This may be due nearest-neighbor graph in clustering and outlier detection, Statistics & Probability
Letters 35 (1997) 33–42.
to the boosting algorithm which could be over-fitting the classifier on [15] R.J. Carroll, D. Ruppert, Transformations in regression: a robust analysis,
the over-sampled training set resulting in a lesser performance on the Technometrics 27 (1985) 1–12.
testing set. Depending on the chosen classification method, there is [16] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
definitely potential in over-sampling the training set with synthetic [17] T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in: Proceedings of
fraud cases, although there is not one over-sampling technique that will the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data
always yield the best result. Mining, ACM, 2016, pp. 785–794.
[18] A. Dal Pozzolo, O. Caelen, Y.A. Le Borgne, S. Waterschoot, G. Bontempi, Learned
lessons in credit card fraud detection from a practitioner perspective, Expert Syst.
6. Conclusions and future research Appl. 41 (2014) 4915–4928.
[19] L. Davies, U. Gather, The identification of multiple outliers, J. Am. Stat. Assoc. 88
(1993) 782–792.
In this paper, we extensively researched data engineering in a fraud [20] J. Davis, M. Goadrich, The relationship between precision-recall and ROC curves,
detection setting. More specifically, we decomposed data engineering in: Proceedings of the 23rd International Conference on Machine Learning, ACM,
into feature engineering and instance engineering. Our motivation for 2006, pp. 233–240.
[21] European Central Bank, E, Fifth Report on Card Fraud, URL, www.ecb.europa.
doing so is that, based upon past extensive research, it is our firm belief
eu/pub/cardfraud/html/ecb.cardfraudreport201809.en.html, September 2018.
that the best way to boost the performance of any analytical technique is [22] T. Fawcett, ROC graphs: notes and practical considerations for researchers, Mach.
to smartly engineer the data instead of overly focusing on the develop Learn. 31 (2004) 1–38.
ment of new, often times highly complex, analytical techniques giving us [23] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006)
861–874.
analytical models which are often only poorly benchmarked and give us [24] A. Fernández, S. Garca, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning
no interpretability at all. We used a payment transactions data set from a from Imbalanced Data Sets, Springer, 2018.
large European Bank to illustrate the substantial impact of data engi [25] N.I. Fisher, Statistical Analysis of Circular Data, Cambridge University Press, 1995.
[26] J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann.
neering on the performance of a fraud detection mode. We empirically Stat. (2001) 1189–1232.
showed that both the feature engineering and instance engineering steps [27] J. Friedman, T. Hastie, R. Tibshirani, et al., Additive logistic regression: a statistical
significantly improved the performance of popular analytical models. view of boosting (with discussion and a rejoinder by the authors), Ann. Stat. 28
(2000) 337–407.
Moreover, we have illustrated that by clever engineering of the data [28] M. Goldstein, S. Uchida, A comparative evaluation of unsupervised anomaly
simple analytical techniques as logistic regression and classification detection algorithms for multivariate data, PLoS One 11 (2016), e0152173.
trees yield very good results. Although the focus in this paper is on [29] A. Grover, J. Leskovec, node2vec: scalable feature learning for networks, in:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
payment transactions fraud, the discussed techniques are also useful or Discovery and Data Mining, 2016, pp. 855–864.
could be extended to other types of fraud, e.g. in healthcare, insurance [30] W.L. Hamilton, R. Ying, J. Leskovec, Inductive Representation Learning on Large
or e-commerce. Graphs, 2017 arXiv:1706.02216.
[31] D.J. Hand, C. Whitrow, N.M. Adams, P. Juszczak, D. Weston, Performance criteria
for plastic card fraud detection tools, J. Oper. Res. Soc. 59 (2008) 956–962.
Acknowledgements [32] H. He, Y. Bai, E.A. Garcia, S. Li, Adasyn: adaptive synthetic sampling approach for
imbalanced learning, in: 2008 IEEE International Joint Conference on Neural
The authors gratefully acknowledge the financial support from the Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008,
pp. 1322–1328.
BNP Paribas Fortis Research Chair in Fraud Analytics at KU Leuven and [33] S. Heritier, E. Cantoni, S. Copt, M.P. Victoria-Feser, Robust Methods in Biostatistics
the Internal Funds KU Leuven under grant C16/15/068. 825, John Wiley & Sons, 2009.
[34] M. Hubert, E. Vandervieren, An adjusted boxplot for skewed distributions,
Computational statistics & data analysis 52 (2008) 5186–5201.
References [35] S. Jha, M. Guillen, J.C. Westland, Employing transaction aggregation strategy to
detect credit card fraud, Expert Syst. Appl. 39 (2012) 12650–12657.
[1] A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, [36] G. Kovács, Smote-variants: a python implementation of 85 minority oversampling
A. Hussain, Comparing oversampling techniques to handle the class imbalance techniques, Neurocomputing 366 (2019) 352–354.
problem: a customer churn prediction case study, IEEE Access 4 (2016) [37] W.J. Krzanowski, D.J. Hand, ROC Curves for Continuous Data, Chapman and Hall/
7940–7957. CRC, 2009.
[2] F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, in: [38] S. Lessmann, B. Baesens, H.V. Seow, L.C. Thomas, Benchmarking state-of-the-art
European Conference on Principles of Data Mining and Knowledge Discovery, classification algorithms for credit scoring: an update of research, Eur. J. Oper. Res.
Springer, 2002, pp. 15–27. 247 (2015) 124–136.
[3] A. Atkinson, M. Riani, Robust Diagnostic Regression Analysis, Springer Science & [39] C.X. Ling, J. Huang, H. Zhang, et al., AUC: a statistically consistent and more
Business Media, 2000. discriminating measure than accuracy, in: IJCAI, 2003, pp. 519–524.
12
B. Baesens et al. Decision Support Systems xxx (xxxx) xxx
[40] F.T. Liu, K.M. Ting, Z.H. Zhou, Isolation forest, in: 2008 Eighth IEEE International [57] M. Saerens, P. Latinne, C. Decaestecker, Adjusting the outputs of a classifier to new
Conference on Data Mining, IEEE, 2008, pp. 413–422. a priori probabilities: a simple procedure, Neural Comput. 14 (2002) 21–41.
[41] N. Lunardon, G. Menardi, N. Torelli, Rose: A package for binary imbalanced [58] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the roc
learning, R Journal (2014) 6. plot when evaluating binary classifiers on imbalanced datasets, PloS one (2015)
[42] S.M. Lundberg, S.I. Lee, A unified approach to interpreting model predictions, in: 10.
Advances in Neural Information Processing Systems, 2017, pp. 4765–4774. [59] J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and
[43] A. Marazzi, A.J. Villar, V.J. Yohai, Robust response transformations based on Diagnostics: Collected Papers, Psychology Press, 2014.
optimal prediction, J. Am. Stat. Assoc. 104 (2009) 360–370. [60] J.W. Tukey, Exploratory data analysis vol. 2, 1977. Reading, MA.
[44] R.A. Maronna, R.D. Martin, V.J. Yohai, M. Salibián-Barrera, Robust Statistics: [61] V. Van Vlasselaer, T. Eliassi-Rad, L. Akoglu, M. Snoeck, B. Baesens, Gotcha!
Theory and Methods (with R), John Wiley & Sons, 2019. Network-based fraud detection for social security fraud, Manag. Sci. 63 (2017)
[45] E.W. Ngai, Y. Hu, Y.H. Wong, Y. Chen, X. Sun, The application of data mining 3090–3110.
techniques in financial fraud detection: a classification framework and an academic [62] C. Whitrow, D.J. Hand, P. Juszczak, D. Weston, N.M. Adams, Transaction
review of literature, Decis. Support. Syst. 50 (2011) 559–569. aggregation as a strategy for credit card fraud detection, Data Min. Knowl. Disc. 18
[46] M.J. Nigrini, Benford’s Law: Applications for Forensic Accounting, Auditing, and (2009) 30–55.
Fraud Detection 586, John Wiley & Sons, 2012. [63] B. Zhu, Z. Gao, J. Zhao, S.K. Vanden Broucke, Iric: an r library for binary
[47] C. Phua, V. Lee, K. Smith, R. Gayler, A comprehensive survey of data mining-based imbalanced classification, SoftwareX 10 (2019) 100341.
fraud detection research, arXiv (2010) preprint arXiv:1009.6119.
[48] F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for
Bart Baesens Faculty of Economics and Business, KU Leuven, Naamsestraat 69, B-3000
comparing classifiers. 5th int, in: Conference on Machine Learning, Kaufman
Leuven, Belgium. www.dataminingapps.com. Southampton Business School, University of
Morgan, San Francisco, 1998, pp. 445–453.
Southampton, 12 University Road, Highfield, Southampton SO17 1BJ, United Kingdom.
[49] J.R. Quilan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publichers,
Research interests: data mining and analytics, credit scoring, fraud detection, marketing
San Mateo, 1993.
analytics.
[50] J. Raymaekers, P.J. Rousseeuw, Transforming variables to central normality, arXiv
(2020) preprint arXiv:2005.07946.
[51] M. Riani, Robust transformations in univariate and multivariate time series, Econ. Sebastiaan Höppner Faculty of Science, Department of Mathematics, KU Leuven, Cel
Rev. 28 (2008) 262–278. estijnenlaan 200B, B-3001 Leuven, Belgium. https://www.kuleuven.be/wieiswie/nl/pe
[52] M.T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?” explaining the rson/00111217. Research interests: robust statistics, fraud detection, high-dimensional
predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD data analysis.
International Conference on Knowledge Discovery and Data Mining, 2016,
pp. 1135–1144.
Tim Verdonck Faculty of Science, Department of Mathematics, UAntwerp, Mid
[53] P.J. Rousseeuw, K.V. Driessen, A fast algorithm for the minimum covariance
delheimlaan 1, B-2020 Antwerp, Belgium. https://www.uantwerpen.be/nl/personeel/t
determinant estimator, Technometrics 41 (1999) 212–223.
im-verdonck/. Faculty of Science, Department of Mathematics, KU Leuven, Celes
[54] P.J. Rousseeuw, M. Hubert, Anomaly detection by robust statistics, Wiley
tijnenlaan 200B, B-3001 Leuven, Belgium. https://www.kuleuven.be/wieiswie/nl/person
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2018), e1236.
/00071962. Research: statistical data science, anomaly and fraud detection, actuarial
[55] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection 589, John
science.
Wiley & Sons, 2005.
[56] P. Rousseeuw, D. Perrotta, M. Riani, M. Hubert, Robust monitoring of time series
with application to fraud detection, Econometrics and Statistics 9 (2019) 108–121.
13