0% found this document useful (0 votes)
30 views10 pages

EMSS2014 Final

This document introduces BankSim, a multi-agent simulation of bank payments developed using aggregated transaction data from a Spanish bank. The goal of BankSim is to generate synthetic transaction data that can be used to test fraud detection techniques while protecting sensitive customer information. The simulation models normal payment behavior and relationships between merchants and customers based on statistical analysis of the real data. Future work will involve injecting known fraud signatures to evaluate detection strategies. By simulating realistic scenarios, BankSim aims to address the lack of public data available for research in fraud detection and related fields.

Uploaded by

Manish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

EMSS2014 Final

This document introduces BankSim, a multi-agent simulation of bank payments developed using aggregated transaction data from a Spanish bank. The goal of BankSim is to generate synthetic transaction data that can be used to test fraud detection techniques while protecting sensitive customer information. The simulation models normal payment behavior and relationships between merchants and customers based on statistical analysis of the real data. Future work will involve injecting known fraud signatures to evaluate detection strategies. By simulating realistic scenarios, BankSim aims to address the lack of public data available for research in fraud detection and related fields.

Uploaded by

Manish Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/265736405

BankSim: A Bank Payment Simulation for Fraud Detection Research

Conference Paper · September 2014

CITATIONS READS

16 11,317

2 authors:

Edgar Alonso Lopez-Rojas Stefan Axelsson


EalaX Norwegian University of Science and Technology at Gjøvik, Norway
27 PUBLICATIONS 208 CITATIONS 61 PUBLICATIONS 2,815 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CyberAIMs View project

CyberAIMs View project

All content following this page was uploaded by Edgar Alonso Lopez-Rojas on 06 January 2015.

The user has requested enhancement of the downloaded file.


BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH

Edgar Alonso Lopez-Rojas(a) and Stefan Axelsson(b)


(a),(b)
Blekinge Institute of Technology , School of Computing
(a)
[email protected], (b) [email protected]

ABSTRACT method of generating anonymous synthetic data from ag-


BankSim is an agent-based simulator of bank payments gregated transactional data of a bank payment system,
based on a sample of aggregated transactional data pro- that can then be used as part of the necessary data for the
vided by a bank in Spain. The main purpose of BankSim development and testing of fraud detection techniques.
is the generation of synthetic data that can be used for Even so, the data set generated could also be the basis for
fraud detection research. Statistical and a Social Network research in other fields, such as consumer behaviour, gen-
Analysis (SNA) of relations between merchants and cus- eral economic study including social development and
tomers were used to develop and calibrate the model. Our forecasting.
ultimate goal is for BankSim to be usable to model rel-
Later we plan to address the actual fraud and develop
evant scenarios that combine normal payments and in-
techniques to develop malicious agents to inject fraudu-
jected known fraud signatures. The data sets generated
lent and anomalous behaviour, and then develop and test
by BankSim contain no personal information or disclo-
different strategies for detecting these instances of fraud.
sure of legal and private customer transactions. There-
Even though we do not address these issues in this paper,
fore, it can be shared by academia, and others, to develop
we describe some typical scenarios of credit card fraud
and reason about fraud detection methods. Synthetic data
that affects bank payments. As this is our ultimate goal,
has the added benefit of being easier to acquire, faster and
fraud heavily influenced the design of BankSim.
at less cost, for experimentation even for those that have
access to their own data. We argue that BankSim gener- The main goal of developing this simulation is that it
ates data that usefully approximates the relevant aspects enables us to share realistic fraud data, without exposing
of the real data. We intend to make the simulation and its potentially business or personally sensitive information
results available to the research community. about the actual source. As data relevant for computer
security research often is sensitive, for a multitude of rea-
Keywords: Multi-Agent Based Simulation, Bank Pay- sons, i.e. financial, privacy related, legal, contractual and
ments, Fraud Detection, Credit Card Fraud, Synthetic other, research has historically been hampered by a lack
Data. of publicly available relevant data sets. Our aim with this
work is to address that situation. However, simulation
1. INTRODUCTION also have other benefits, it can be much faster and less ex-
In this paper we present BankSim, a Bank payment pensive than trying different scenarios of fraud, detection
Simulation, built on the concept of Multi Agent-Based algorithms, and personnel and security policy approaches
Simulation (MABS). BankSim is based on a sample of ag- in an actual store. The latter also risks incurring e.g. un-
gregated transaction data provided by one bank in Spain happiness amongst the staff, due to trying e.g. an ill ad-
with the aim of promoting the development of applica- vised policy, which leads to even greater expense and un-
tions for Big Data. This data contains several thousand wanted problems.
records of transactional data covering six months, from
November 2012 until April 2013 restricted by zip code
location to Madrid and Barcelona. That is, this data is
recent enough to reflect current conditions of payments,
but aggregated to not pose a risk from a specific customer Outline: The rest of this paper is organised as follows:
privacy standpoint. Section 2. introduce the topic of fraud detection for bank
The defence against fraud is an important topic that payments and present previous and related work. Sec-
has seen some study. In a bank the cost of fraud are of tions 3. describes the problem, which is the generation
course ultimately transferred to the consumer, and finally of synthetic data of a bank payment system. Section 4.
impacts the overall economy. Our aim with BankSim is to shows a data analysis of the current data. Section 6.
learn the relevant parameters that governs the behaviour presents an implementation of a MABS for our domain
of a bank payment system to simulate normal behaviour and shows the description of some credit card fraud sce-
and inject specific fraud scenarios that are interesting to narios. We present our results and verification of the sim-
study. ulation in section 7. and finish with a discussion and con-
The main contribution, and focus, of this paper is a clusions, including future work in section 8.
2. BACKGROUND AND RELATED WORK by legislation without distinction made between different
Simulations in the domain of financial markets have tra- economic sectors or actors. This of course leads to fraud-
ditionally been focused on finding answers to prediction sters adapting their behaviour in order to avoid this kind
problems such as economic growth, market growth, con- of control, by e.g. making many smaller transactions that
sumption patterns and so on. fall just below the threshold. Hence, these and other simi-
There is currently a lack of research in the area of sim- lar methods have proven insufficient (Magnusson, 2009).
ulation of bank systems, more specifically for fraud de- Nowadays with the popularity of social networks, such
tection. as Facebook, the topic of Social Network Analysis (SNA)
We have previously analysed the implications of us- has been given special interest in the research commu-
ing machine learning techniques for fraud detection using nity (Alam and Geller, 2012). Social Network Analysis is
a synthetic data set (Lopez-Rojas and Axelsson, 2012a). a topic that is currently being combined with Social Sim-
We then built a simple simulation of a financial trans- ulation. Both topics support each other for the benefit of
action system based on these assumptions, in order to representing the interactions and behaviour of agents in
overcome our limitations and lack of real data (Lopez- the specific context of social networks.
Rojas and Axelsson, 2012b). However, this work was Our approach aims to fill the gap between existing
not based on any underlying data, but rather on assump- methods and provide researchers with a tool that gener-
tions of what such data could contain. We learn the prin- ates reliable data to experiment with different fraud detec-
ciples of simulation and modelling and successfully ap- tion techniques and compare them with other approaches.
plied them to RetSim (Lopez-Rojas et al., 2013). RetSim
is the older brother of BankSim and uses data from a re- 3. PROBLEM
tail store to produce a realistic simulation that generates Fraud and fraud detection is an important problem that
synthetic data. has a number of applications in diverse domains. How-
Here we continued our work and built a realistic sim- ever, in order to investigate, develop, test and improve
ulation based on a real aggregated payment data set that fraud detection techniques one needs detailed information
can be used to test diverse fraud detection techniques. All about the domain and its specific problems.
our simulators are part of a financial system chain. They There is a lack of data sets available for research in
have in common that all are built with the aim of mod- fields such as money laundering, financial fraud and ille-
elling financial activity with the purpose of generating gal payments. Disclosure of personal or private informa-
synthetic data sets for fraud detection research. We are tion is only one of the many concerns that those that own
continuing to build the needed components to integrate relevant data have. This leads to in-house solutions that
them into a complex financial chain and produce a virtual are not shared with the research community and hence
financial world that covers many domains. This is specif- there can be no mutual benefit from free exchange of
ically useful to implement more complex fraud scenarios ideas between the many worlds of the data owners and
such as money laundering. the research community.
After describing the problem we formulated the main
Data mining based methods have previously been used
research question that we address on this paper:
to detect fraud (Phua et al., 2010). This lead to the result
that machine learning algorithms can identify novel meth- RQ How could we model and simulate a bank pay-
ods of fraud by detecting those transactions that are dif- ment system and generate a realistic and reliable syn-
ferent (anomalous) in comparison to benign transactions. thetic data set for the purpose of fraud detection?
This problem in machine learning is known as novelty de-
tection. Supervised learning algorithms have previously 4. Data Analysis
been used on a synthetic data set to prove the performance To better understand the problem we began by perform-
of outliers detection (Abe et al., 2006), however this has ing data analysis of the sample data provided by a bank
not been performed on transactional data. There are tools in Spain. We are interested in finding the necessary and
such as IDSG (IDAS Data and Scenario Generator (Lin sufficient attributes to enable us to simulate a realistic sce-
et al., 2006)) which was developed with the purpose of nario in which we could reason about and detect interest-
generating synthetic data based on the relationship be- ing cases of fraud.
tween attributes and their statistical distributions. IDSG The bank in Spain, which we will name Bank Inc., pro-
was created to support data mining systems during their vided a web service interface to query aggregated infor-
test phase and it has been used to test fraud detection sys- mation about bank payments. The web service limited
tems. the query to transactions that occurred between Novem-
The most common method today used for preventing ber 2012 and April 2013 restricted to transactions that
illegal financial transactions consists on flagging differ- took place in Madrid and Barcelona. The service pro-
ent clients according to perceived risk and restricting their vided by Bank Inc. groups the data by month, week, day
transactions using thresholds (Bolton and Hand, 2002). of the week and hour. The interface allows three types of
Transactions that exceed these thresholds require extra queries: consumption habits, customer classification and
scrutiny whereby the client needs to declare the prece- origin and source of transactions. The basic information
dence of the funds. These thresholds are usually set provided by the queries are mainly statistical information
about payments such as: number, average, minimum and
Table 2: Age Categories
maximum values. It also provides information regard-
ing zip code location of origin/source, merchant category idAge Rank
and customer gender and age. There are 16 merchant cat- 0 <=18
1 19-25
egories that differentiate between payments made for ex- 2 26-35
ample in a restaurant or payments performed while buy- 3 36-45
4 46-55
ing cars or other goods. 5 56-65
It was not possible to query information where less than 6 >65
2 customers made payments. This means there is some U Unknown

missing information about the data provided, but fortu-


nately we know exactly which data is missing, because
the response from the web service is different depending Table 3: Gender Categories
on whether the data is missing or restricted.
We initially started by selecting a few zip codes that idGender Description
E ENTERPRISE
contain enough information to avoid missing fields. We F FEMALE
selected two of the biggest zip codes by number of trans- M MALE
U UNKNOWN
actions and amount. We extracted statistical information,
presented in table 1. Age Categories are given in table 2
and gender categories are given in table 3. All prices
given are in euro. Table 4: Categories ZC1
Due to a lack of space we will focus our presentation
category percentage avg std
of the analysis on one of the biggest zip code by payment Auto 0.0049 224.35916667 267.52611111
volume that we will call Zip Code One (ZC1). Bars and restaurants 0.0244 31.03238095 39.19238095
Books and press 0.0014 33.34714286 45.01428571
ZC1 is relatively richer in data than the smaller zip Fashion 0.0076 49.73190476 59.08452381
codes, it contains 731658 payments during a six month Food 0.0726 32.49333333 30.87285714
period. This is specially interesting, since we are more Health 0.0179 59.39119048 113.98619048
Home 0.0021 75.48317073 121.77292683
likely to find actual cases of fraud. Accomodation 0.0016 97.41071429 86.85047619
Hypermarkets 0.0178 33.06547619 35.78166667
Leisure 0.0001 74.86357143 22.01107143
Other services 0.009 52.9897619 76.65309524
Table 1: Statistical Analysis Data Sports and toys 0.0043 74.8047619 75.45452381
zipcode gender age payments avgAmountMonth avgNumCardsMonth
ZC1 E U 823 31.97 90.67 Technology 0.0031 67.28285714 108.68452381
ZC1 F 6 12375 44.83 1002.33 Transport 0.8176 24.56047619 20.76928571
ZC1 F 5 39461 35.81 3297.50 Travel 0.0004 577.46285714 518.41885714
ZC1 F 4 72336 33.79 6514.83
ZC1 F 3 94536 31.87 9337.50
Wellness and beauty 0.0155 44.20809524 55.29142857
ZC1 F 2 128117 29.37 13457.33
ZC1 F 1 41299 30.13 5002.00
ZC1 F 0 1809 28.81 257.00
ZC1 M 6 18030 36.93 1676.33
ZC1 M 5 38097 33.29 3534.83
ZC1
ZC1
M
M
4
3
62314
82222
32.39
30.38
5871.67
8451.83
in almost the same way. Furthermore, a fraudster will
ZC1 M 2 106404 27.42 10969.33 probably use several different methods of fraud, which
ZC1 M 1 32031 27.70 3739.67
ZC1 M 0 1516 28.37 213.83 means that BankSim needs to be able to model combina-
ZC1 U 6 193 17.56 13.83
ZC1 U 4 14 23.95 3.00 tions of all fraud scenarios implemented. Although the
ZC1 U 3 54 12.03 4.00
ZC1 U 2 27 23.60 3.40 implementation of these scenarios are out of the scope of
ZC2 E U 23349 5.78 482.83
ZC2 F 6 13160 61.97 1373.00 this paper, we include a description and explain how to
ZC2 F 5 27250 55.58 2766.50
ZC2 F 4 50074 48.90 4508.00
implement them in BankSim.
ZC2
ZC2
F
F
3
2
63122
91343
43.59
37.89
5746.00
8026.67
We will focus on card related frauds. This kind of fraud
ZC2
ZC2
F
F
1
0
37303
1842
30.17
26.89
3152.50
172.83
usually begins when the the important data on the card is
ZC2 M 6 11176 80.01 1203.00 compromised: Account name, credit card number, expi-
ZC2 M 5 18854 74.22 1951.83
ZC2 M 4 29474 67.89 2990.83 ration date and verification code. This data can be ac-
ZC2 M 3 45850 53.18 4612.17
ZC2 M 2 63568 41.72 6048.00 quired by a fraudster either by theft of the physical card
ZC2 M 1 21538 32.88 2054.50
ZC2 M 0 977 28.16 92.83 or by gaining knowledge of the important data associated
ZC2 U 6 67 74.08 6.33
ZC2 U 5 8 103.15 3.00 with the account.
ZC2 U 3 10 24.48 4.00

5.1. Theft
This scenario includes cases where the customer loses
5. Fraud Scenarios in a Bank Payment System physical possession of her card and a fraudster imper-
In this section we describe how three example of fraud sonate the customer purchasing goods or service with the
that can be implemented in BankSim. These fraud sce- stolen card. In terms of the object model used in BankSim
narios are based on selected cases from the Grant Thorn- the Theft scenario can be implemented by the following
ton report Member and Council (2009). As can be seen setting: Include in the fraudster the behaviour of sensing
in section 6., the different scenarios can be implemented customer proximity, then execute the theft and later pur-
chase goods from another merchant with the information 6.1. Overview
from the customer. The volume of fraudulent activity can 6.1.1. Purpose
be modelled changing the specific parameter of number We aim to produce a simulation that resembles a bank
of theft, zip code and frequency. A ``red flag'' for de- payment system. Our main purpose is to generate a syn-
tection in this case could be a high number of unusual thetic data set of commercial transactions that can be used
transactions with high value in a short period. for the development and testing of different fraud detec-
tion techniques.
If we want to use the real original data for the devel-
5.2. Cloned Card/Skimming
opment of fraud detection methods, it often happens that
This scenario includes cases where the fraudster creates is difficulty to find diverse and enough cases of fraud.
a clone of the card, letting the user keep the original card However this is not the case of a simulated environment,
but without knowledge of the loss of security. In terms of where fraud can be injected following known patterns of
the object model used in BankSim, the cloned card sce- fraud and flagged for easy recognition and evaluation of
nario can be implemented by the following setting: In- the performance of the detectors.
clude in the fraudster the behaviour of sensing customers
proximity, then execute the acquisition or cloning of a 6.1.2. Entities, state variables and scales
card and later purchase goods from another merchant with There are three agents in this simulation: Merchant, Cus-
the information from the customer. An alternative way tomer and Fraudster.
to implement this scenario could be when a merchant is
compromised in different ways (e.g. by hacking) and al- Merchant This agent serves the customer with one cat-
low a fraudster to steal information from all customers egory of merchandise specified by the original data. It
that have been served there on a massive scale. The vol- offers products or services according to the statistics ob-
ume of fraudulent activity can be modelled changing the tained from the specific zip code and time (week, day of
specific parameter of number of theft and merchant af- the week and/or hour). They are waiting for customers to
fected, zip code and frequency of use for purchasing. A request products and register the payments.
``red flag'' for detection in this case could be similar as
previous case, a high number of unusual transactions with
Customer This agent's main objective is to satisfy a
high value in a short period. Other methods such as simul-
need for one of the 16 categories and purchase goods or
taneous payments in different physical locations, or using
services from merchants. They posses a payment method
the card far from previously known locations, could also
which in this case we will be generalised as a credit card.
be flagged.

Fraudster The behaviour is determined by the goal of


5.3. Internet purchases defrauding the customers and/or merchants. The specific
This scenario includes cases where the fraudster uses a behaviour can be extended to fulfil different patterns and
method called Carding to purchase immaterial goods, e.g. can mutate depending on the specific fraud behaviour we
music files, redeemable coupons, tickets etc. on the In- are interested in studying. Some of the known fraud be-
ternet using websites that check the validity of the card haviour is presented in section 5..
instantly. This is to ascertain whether the card data is
still valid without having to run the risk of getting caught 6.1.3. Process overview and scheduling
when using the card while physically present. Similar During a normal step of the simulation, a customer that
to cloned cards the customer keeps the original card but enters the simulation can decide to purchase an item or
without knowledge of the situation. In terms of the ob- service from one of the offered categories. Once the cat-
ject model used in BankSim the cloned card scenario egory has been selected, it senses nearby merchants that
can be implemented by the following setting: Include in offer that category and listen to the offers from the mer-
the fraudster the behaviour of sense customers proximity, chant. If accepted (with a certain probability of rejection)
then execute the acquisition of the important information the transaction takes place and the merchant registers the
of a card and later on proceed with the method of Card- payment.
ing, to check for validity. A ``red flag'' for detection in The time granularity of the simulation is that each step
this case could be to have a black list of Carding websites represents a day of commercial activity, but the original
and proceed to cross this information with current user data is so rich that this can be modified to the specific hour
activity to detect any unusual purchases after the Carding of the day. So a normal week has 7 steps and a month will
was executed. consist of around 30 steps. Notice that in the future we
can chose to make the distinction between specific days
of the week explicit, since the information from Bank Inc.
6. MODEL AND METHOD is good enough to obtain statistics from it. But for now,
The design of BankSim was based on the ODD model we are not taking specific day of the week into account
introduced by Grimm et al. (2006). ODD contains 3 main to feed the consumption pattern and we treat all days the
parts: Overview, Design Concepts and Details. same.
6.2. Design Concepts
The basic principle of this model is the concept of a com-
mercial transactions. We can observe an emergent social
network from the relation between the customers and the
merchants. Each of the customers have the objective of
purchasing articles from the merchants. The merchants
objective is to serve the customers and commit the pay-
ment that result into the generation of a synthetic data set.
In our virtual environment the interaction between agents
is always between merchant and customer. Purchasing
articles from another customer or selling articles to an-
other merchant is not included in our model.
Customers can scout for the merchants in any radial di-
rection from their current position in the virtual world and Figure 1: BankSim Use Case Diagram including misuse
search for a merchant that matches its category selection. cases
If no merchant is found then the transaction can not take
place, and the step for this customer ends. Offer service or product Is performed by the merchant
The agents do not perform any specific learning activ- and once a merchant is approached by a customer, it of-
ities. Their behaviour is given by probabilistic Markov fers a product or service according to the demand speci-
models where the probabilities are extracted from the real fied on the parameters for each category.
data set.
Buy/Sell Once a customer finds a merchant and after a
6.3. Details
merchant offers a product, a transaction takes place and it
6.3.1. Initialization
stores the required information for the generation of the
The simulation starts with a number of merchants that
synthetic data of transactions.
match the categories of what a specific zip code offers,
an initial number of customers and fraudsters.
Steal card or info Fraudsters move around the envi-
6.3.2. Input Data ronment of the simulation and find customers to steal the
BankSim has different inputs needed in order to run a sim- physical card or just the important information of the cus-
ulation. The input data concerns the distributions of prob- tomer credit card. This information is stored for later use.
abilities for each of the merchants, and the consumer pat- In this misuse case we aim to emulate the behaviour of a
tern behaviour of the customers specified by gender and criminal performing a cloning of a card or just stealing
age. The items that can be purchased are all grouped into the card.
a category using the statistic measures for the payments.
For setting the parameters, we use a parameter file that Abuse purchasing This misuse case is performed by
is loaded as the simulation starts, it contains zip codes Fraudsters, they make purchases of goods or services on
that we want to simulate and the malicious parameters. physical merchants or internet merchants that hides their
Some parameters can also be set manually in the GUI. physical presence.
The zip codes are queried against the API of the bank and
we retrieve information corresponding to the customers: Report/Block Card This use case is performed by Cus-
quantity, age and gender distribution. We also query the tomers, when they realise that abusive behaviour is com-
merchants and obtain sales distributions for each of the mitted on their accounts, they report the case to the bank
merchant categories. and block the card for further abuse.

6.3.3. Submodels Log of transactions Each time an item or service is


Figure 1 shows the different use cases of the agents in- purchased from a merchant a transaction is created. A log
cluding the misused cases for the fraudsters. This model contains the information about the customer, merchant,
represent the different actions that an agent can take in- amount, location, date and fraud if any.
side the system.
7. RESULTS
Find Merchant The first step in a simulation for a cus- BankSim uses the Multi-Agent Based Simulation toolkit
tomer is to find a merchant, each agent decides which cat- MASON which is implemented in Java (Luke, 2005).
egory of service they will want to find, so the next step MASON offers several tools that aid the development of
is to sense the environment and find a merchant that pro- a MABS. We justified our choice mainly for the ben-
vides the category selected. Next search by the customer efits of supporting multi-platform, parallelisation, good
starts here, i.e. the customers move from merchant to execution speed in comparison with other agent frame-
merchant. works; which is specially important for computationally
intensive simulations such as BankSim (Railsback et al.,
Table 5: Simulated ZC1
2006). BankSim can be run with a GUI, that helps the
user see the states and balance of the customers (purple gender age payments avgAmount
E U 1171 34.02
dots) and easier identify the merchants (green circles) and F 6 13795 32.13
fraudsters (red dots), as can be seen in the example in fig- F 5 33574 31.47
ure 2. F 4 57835 31.74
F 3 77333 31.97
F 2 103112 32.14
F 1 32340 32.09
F 0 1818 34.75
M 6 12718 31.57
M 5 28382 31.35
M 4 49780 31.99
M 3 67870 31.83
M 2 81690 31.48
M 1 24924 31.84
M 0 586 33.36
U 3 173 32.28
U 2 164 28.83
U 1 178 33.23

Table 6: Categories Simulated ZC1


category payments perc avgAmount std
Accomodation 1196 0.002 106.55 69.34
Bars and restaurants 6253 0.0105 41.15 29.55
Books and press 885 0.0015 44.55 33.14
Fashion 6338 0.0107 62.35 44.36
Food 26254 0.0442 37.07 25.00
Figure 2: Screenshot of BankSim during a step Health 14437 0.0243 103.74 76.87
Home 1684 0.0028 113.34 83.23
Hypermarkets 5818 0.0098 40.04 27.96
Leisure 25 0 73.23 20.91
The output of BankSim is a CSV file that contains the Other services 684 0.0012 75.69 54.59
fields: Step, CustomerId, Age, Gender, zipCodeOrigin, Sports and toys 2020 0.0034 88.50 63.13
Technology 2212 0.0037 99.92 73.49
merchant, zipMerchant, category, amount and a special Transport 505119 0.8494 26.96 17.53
field to flag fraudsters called fraud. Travel 150 0.0003 669.03 494.90
Wellness and beauty 14368 0.0242 57.32 41.48

7.1. Simulated scenarios


We aimed to perform a simulation that would produce a
comparable data set to our sample data set which con- Table 7: Fraud Simulated ZC1
tained payments for over 6 months to match the original
data. The simulation was loaded with information from Fraud payments per total per avg std
ZC1 (see table 1), which was selected due to the highest 0 587443 98.78 18708432.56 83.03 31.84 31.47
1 7200 1.21 3822671.17 16.96 530.92 835.52
amount of payments.
We ran BankSim for 180 steps (approx. six months),
several times and calibrated the parameters in order to ob-
tain a distribution that get close enough to be reliable for Rosewell, 2009). The verification ensures that the simu-
testing. We collected several log files and selected the lation correspond to the described model presented by the
most accurate. We injected thieves that aim to steal an av- chosen scenarios. We described BankSim in section 6. In
erage of three cards per step and perform about two fraud- our model, we have included several characteristics from
ulent transactions per day. We produced 594643 records a real payment system, and successfully generated a dis-
in total. Where 587443 are normal payments and 7200 tribution of payments that involved the interaction of mer-
fraudulent transactions. Since this is a randomised sim- chants and customers.
ulation the values are of course not identical to original The validation of the model answer the question: Is
data. the model a realistic model of the real problem we are
The result of the simulation for normal transactions is addressing? After several runs of the simulation to cali-
summarised in tables 5, 6 and 7. Remember that the codes brate it, we are able to answer that question affirmatively.
for age categories are given in table 2 and gender codes We present a table summarising the generated data in ta-
are given in table 3. All prices given are in euro. bles 5, 6 and 7.
Table 5 can be compared with table 1, both tables com-
7.2. Evaluation of the model pare the distribution of payments by gender and age. Sim-
We begin the evaluation with the verification and val- ilar values are found in both tables because we created
idation of the generated simulation data (Ormerod and the agents based on gender and age distribution of the zip
code. However, we did not programme the consumption Boxplot Payments and Categories
behaviour of agents based on gender and age. This is 8000
because we did not have the statistic standard deviation
7000
for the consumption patterns per age and gender, we only
6000
have the average. This affects the results, despite that in
the overall results we find similar data. But we think the 5000

Amount
missing information from the real system can be found 4000

with further calibration that is at the moment beyond the 3000

scope of our work. Figure 3 shows a distribution of gen- 2000


der and age from our simulated data. 1000

barsandrestaurants

wellnessandbeauty
sportsandtoys
transportation

otherservices

hotelservices

contents
fashion

leisure
health

hyper

home

travel
food

tech
Categories

Figure 4: BoxPlot of a BankSim simulation

Boxplot Payments and Categories without Travel


0

2000

1500

1000

500
Amount

2000

1500

1000
Figure 3: ScatterPlot Payments vs Age/Gender 500

Table 6 is comparable to table 4. We succeed in gen-


barsandrestaurants

wellnessandbeauty
erating a distribution of categories that resembles the real sportsandtoys
transportation

otherservices

hotelservices

data. We matched the percentage of categories and sim-

contents
fashion

leisure
health

hyper

home
ulated similar average and standard deviation to the ones
food

tech

Categories
present in the original data. One thing to notice is that
the category auto did not get any transaction during the
simulation, this could be due to the location of the mer- Figure 5: BoxPlot of a BankSim simulation without cat-
chant in the environment being random and was perhaps egory Travel
far enough to be hidden from customers that wanted to
purchase from this category. A box plot of the simulated
micro behaviour, produces the same type of overall inter-
categories is shown in figure 4. Since the values of travel
action that we can observe in the original data, and fur-
are bigger than other categories, we decided to draw the
thermore, this interaction give rise to the same macro be-
box plot omitting this category in figure 5 to improve the
haviour for the whole zip code as for a real situation as
visualization of the simulated data.
well.
The simulated fraud behaviour is presented in table 7.
Since we are running a simulation we argue that the
The total amount stolen was around 3.8 million Euros
differences are not significant for our purpose, which is to
which corresponds to a rather high crime rate of nearly
use this distribution to simulate the normal behaviour of
17% of the total amount of payments. We programmed an
payments, and simultaneously combine this with injected
aggressive behaviour where few transactions (only 7200
anomalies and known patterns of fraud.
and 1.2% of total)) could defraud 17% of the payments
with an average of 530 Euros per fraud. For the purpose
of fraud detection there is a benefit from the occurrence 8. CONCLUSIONS
of enough cases of fraud that can help the investigators to BankSim is a simulation of bank payments with the objec-
gather the evidence needed to prosecute the criminals. In tive to generate a synthetic transactional data set that can
our case we benefit from the abundance of fraud cases be- be used for research into fraud detection. The data sets
cause many detection methods need enough data to train generated with BankSim can aid academia, financial or-
better a classifier that can detect the fraud behaviour. ganisations and governmental agencies to test their fraud
So in summary, our agent model with its programmed detection methods or to compare the performance of dif-
ferent methods under similar conditions using a common edge discovery and data mining - KDD 06, page 504,
public available and standard synthetic data set for the 2006. doi: 10.1145/1150402.1150459.
test. SJ Alam and Armando Geller. Networks in agent-based
In section 3. we formulated our research question: social simulation. Agent-based models of geographical
How could we model and simulate a bank payment sys- systems, pages 77--79, 2012.
tem and generate a realistic and reliable synthetic data R.J. Bolton and D.J. Hand. Statistical fraud detection: A
set for the purpose of fraud detection? review. Statistical Science, 17(3):235--249, 2002.
In section 6. we presented the model for BankSim, Volker Grimm, Uta Berger, Finn Bastiansen, Sigrunn
which is based on the ODD methodology. In order to Eliassen, Vincent Ginot, Jarl Giske, John Goss-
better support our claim and answer our research ques- Custard, Tamara Grand, Simone K. Heinz, Geir Huse,
tion we analysed the type of data needed to generate and Andreas Huth, Jane U. Jepsen, Christian Jø rgensen,
output as a CVS file (see section 7.) and we evaluated Wolf M. Mooij, Birgit Müller, Guy Pe’er, Cyril Piou,
and verified our model in section 7.2. Steven F. Railsback, Andrew M. Robbins, Martha M.
It is important to know how much information from the Robbins, Eva Rossmanith, Nadja Rüger, Espen Strand,
real data set is contained in the generated synthetic data. Sami Souissi, Richard a. Stillman, Rune Vabø, Ute
First we do not have access to any specific record of who Visser, and Donald L. DeAngelis. A standard pro-
is purchasing anything and neither the merchant involved tocol for describing individual-based and agent-based
in the transaction. We based our simulation purely on models. Ecological Modelling, 198(1-2):115--126,
the aggregated statistical measures present in the original September 2006. ISSN 03043800. doi: 10.1016/j.
data that give us an approximate description of how the ecolmodel.2006.04.023.
individual agents behave. This means that Bank Inc. can P.J. Lin, B. Samadi, and Alan Cipolone. Development of
be sure that the privacy from the customers is preserved a synthetic data set generator for building and testing
when using BankSim. information discovery systems. In ITNG 2006., pages
We argue that BankSim is ready to be used as a gen- 707--712. IEEE, 2006. ISBN 0769524974.
erator of synthetic data sets of financial activity of a pay- Edgar Alonso Lopez-Rojas and Stefan Axelsson. Money
ments. Data sets generated by BankSim can be used to Laundering Detection using Synthetic Data. The 27th
implement fraud detection scenarios and malicious be- workshop of Swedish Artificial Intelligence Society
haviour scenarios such as a stolen or cloned credit cards (SAIS), pages 33--40, 2012a.
or unusual simultaneous activity of purchase in differ- Edgar Alonso Lopez-Rojas and Stefan Axelsson. Multi
ent physical locations. We will make a stable release of Agent Based Simulation (MABS) of Financial Trans-
BankSim available to the research community together actions for Anti Money Laundering (AML). The 17th
with standard data sets developed for this article and fur- Nordic Conference on Secure IT Systems, pages 25--
ther research. 32, 2012b.
For the future we plan several improvements of and ad- Edgar Alonso Lopez-Rojas, Stefan Axelsson, and Dan
ditions to the current model. BankSim can be calibrated Gorton. RetSim: A Shoe Store Agent-Based Simula-
to improve the results presented in section 7. and increase tion for Fraud Detection. The 25th European Modeling
the granularity and the coverage of zip codes that enrich and Simulation Symposium, 2013.
the synthetic data set and make it even more valuable as S. Luke. MASON: A Multiagent Simulation Environ-
a realistic data set for fraud detection. ment. Simulation, 81(7):517--527, July 2005. ISSN
In order to generate records with malicious behaviour 0037-5497. doi: 10.1177/0037549705058073.
we plan to extend BankSim to also generate malicious ac- Dan Magnusson. The costs of implementing the anti-
tivity that can come from the merchants, customers, dif- money laundering regulations in Sweden. Journal
ferent fraudsters or combinations of these. of Money Laundering Control, 12(2):101--112, 2009.
Among the additions we consider are: increase the step ISSN 1368-5201. doi: 10.1108/13685200910951884.
granularity and add to the simulation more zip codes si- Associate Member and Advisory Council. Reviving
multaneously. We intend to make BankSim a complete retail Strategies for growth in 2009 Executive sum-
bank system by adding other bank transactions such as mary, 2009. URL http://www.grantthornton.
deposit, withdraws and transfers besides the current pay- com/staticfiles/GTCom/files/Industries/
ments. Unfortunately for this addition there is a lack of Consumer&industrialproducts/Whitepapers/
real data that we can use for this purpose, but hopefully Revivingretail_Strategiesforgrowthin2009.
in the future we will find financial institutions interested pdf.
in our project that are willing to share this data. Paul Ormerod and Bridget Rosewell. Validation and
Verification of Agent-Based Models in the Social Sci-
ences. In Flaminio Squazzoni, editor, LNCS, pages
REFERENCES 130--140. Springer Berlin / Heidelberg, 2009. ISBN
978-3-642-01108-5.
Naoki Abe, Bianca Zadrozny, and John Langford. Outlier
Clifton Phua, Vincent Lee, Kate Smith, and Ross Gayler.
detection by active learning. Proceedings of the 12th
A comprehensive survey of data mining-based fraud
ACM SIGKDD international conference on Knowl-
detection research. Arxiv preprint arXiv:1009.6119,
2010.
S. F. Railsback, S. L. Lytinen, and S. K. Jackson. Agent-
based Simulation Platforms: Review and Develop-
ment Recommendations. Simulation, 82(9):609--623,
September 2006. ISSN 0037-5497. doi: 10.1177/
0037549706073695.

AUTHORS BIOGRAPHY
MSc. Edgar A. Lopez-Rojas
Edgar Lopez is a PhD student in Computer Science and
his research area is Multi-Agent Based Simulation, Ma-
chine Learning techniques with applied Visualization for
fraud detection and Anti Money Laundering (AML) in
the domains of retail stores, payment systems and fi-
nancial transactions. He obtained a Bachelors degree in
Computer Science from EAFIT University in Colombia
(2004). After that he worked for 5 more years at EAFIT
University as a System Analysis and Developer and par-
tially as a lecturer. He obtained a Masters degree in Com-
puter Science from Linköping University in Sweden in
2011 and a licentiate degree in computer science (a de-
gree halfway between a Master's degree and a PhD) in
2014.

Dr. Stefan Axelsson


Stefan Axelsson is a senior lecturer at Blekinge Institute
of Technology. He received his M.Sc in computer sci-
ence and engineering in 1993, and his Ph.D. in computer
science in 2005, both from Chalmers University of Tech-
nology, in Gothenburg, Sweden. His research interests
revolve around computer security, especially the detec-
tion of anomalous behaviour in computer networks, fi-
nancial transactions and ship/cargo movements to name
a few. He is also interested in how to combine the appli-
cation of machine learning and information visualization
to better aid the operator in understanding how the sys-
tem classifies a certain behaviour as anomalous. Stefan
has ten years of industry experience, most of it working
with systems security issues at Ericsson.

View publication stats

You might also like