0% found this document useful (0 votes)
13 views

Multi-Model-Identifies-Fraud-At-Scale-–-ArangoDB-White-Paper

Uploaded by

enrique.repulles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Multi-Model-Identifies-Fraud-At-Scale-–-ArangoDB-White-Paper

Uploaded by

enrique.repulles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

White Paper:

Multi-Model Identifies Fraud


At Scale

By Arthur Keen (Senior Solution Architect, ArangoDB)


May 2020
Table of Contents
The Significance of Fraud and Graphs 2

Why Multi-Model for Fraud Detection? 3

Converting from Relational Source to Multi-model Graph 4

Fraud Questions 5

Detect Fraud Rings From a Suspicious Account 5

Detect All Fraud Rings 6

Find Orphan Accounts 7

Find Most Influential Customers and Accounts 8

What are the top 3 most influential accounts? 9

Finding Money Laundering Patterns 10

Detecting Fraud At Scale 11

Conclusion 12

Hands-on with Fraud Detection & Anti Money Laundering 13

Appendix A: Queries 15

1
The Significance of Fraud and Graphs
Fraud is an enormous and ever growing problem impacting all industries and
government services. Global fraud results in over $3.7 trillion losses annually.
Businesses lose on average 5% of their income to fraud every year. In 2018
businesses incurred $3.13 remediation costs for each dollar of fraud [​1​], dealing
with chargebacks, fees, interest and labor.

Traditional fraud detection views data through a straw, focusing on discrete data
points including specific accounts, individuals, devices or IP addresses. However,
today’s sophisticated fraudsters escape detection by forming fraud rings
composed of stolen and synthetic identities and circuitous back channels.

To uncover fraud rings, it is essential to look beyond individual data points in


individual data sources to a broader view of the connection patterns that exist
across multiple data ​modalities1. Multiple disparate data sources storing
individual activities and relationships that need to be analysed in concert to
detect complex fraudulent behavior.

ArangoDB’s native multi-model is ideal for tackling this challenge, because it


supports graphs, documents, key-stores, and relational models. This provides
streamlined, flexible, and agile harmonization of the relevant ​multi-modal2 user
activity data, provides the performance and scale to detect the complex fraud
patterns, and serves results in the different data models needed by
stakeholders.

1
​Multimodal data​. Our experience of the world is ​multimodal​ — we see objects, hear sounds, feel
the texture, smell odors, and taste flavors. ​Modality​ refers to the way in which something
happens or is experienced and a research problem is characterized as ​multimodal​ when it
includes ​multiple​ such modalities.
2
This juxtaposition of multi-model and multi-modal is deliberate, they are orthogonal terms.

2
Figure 1: Identify fraud patterns in the network of transactions and relationships.

Why Multi-Model for Fraud Detection?


ArangoDB’s multi-model graph allows you to easily fuse together disparate data
and identify complex fraudulent patterns of connections, such as fraud rings,
using the ArangoDB Query Language (AQL).

The identification of fraud ring patterns requires very deep (multi-hop) traversals
across the graph. The query for detecting a fraud ring can be accomplished in
six lines of (easy to write and maintain) AQL code and ArangoDB can execute
these queries with sub-second response times

Multi-model do not have to convert the entire dataset to graph to do this.


Use graphs where needed for analytics. Multi-model graphs allow you to
combine documents, joins, and graphs to solve this problem.

3
Converting from Relational Source to Multi-model Graph

The source of data for fraud detection would likely be a relational database,
for example, the schema depicted in Figure 3, which describes the foreign
key relationships among the Bank, Branch, Customer, Account, and
Transaction Tables.

Figure 3: Relational Source Schema

How do we convert this to a graph in ArangoDB? Because ArangoDB is a


multi-model database, the tables can be ingested as-is, directly into
ArangoDB as collections, so the Bank table becomes the Bank collection and
so on. Then you can choose whether to convert all or part of it to a graph
model based on the requirements for fraud analytics.

Since we need to do deep link/traversal analytics on the account transactions


and the customers, it makes sense to add graph edges in this area of the
graph. In this transformation it makes sense to use the Transaction
collection as an edge and to materialize the CustomerAccount foreign key as
the AccountHolder edge.

Figure 4: Multi-Model Schema: Documents, Joins, Graph

We use the convention of converting foreign key relationships to edges that


are directed from the dependent to the independent entity. Resolution

4
entities AKA join tables in the relational model can be used as edges in a
graph model as we have done with Transaction.

Fraud Questions
We will describe how to use ArangoDB to answer various questions:

● Are there potential fraud rings connected to a suspicious account?


● Are there any potential fraud rings in my data?
● Are there orphan accounts (those not transacting)?
● Who are the most influential customers/accounts in transactions?
● Are there any money laundering patterns?

The following section shows how these questions can be answered in


ArangoDB on synthetically generated transaction data. The queries are
examples for detecting the patterns on this synthetic data set, meant to
inspire practitioners to develop real-world fraud detection capabilities on
ArangoDB with real data.

Detect Fraud Rings From a Suspicious Account

Fraud rings consist of very long loops of transactions and relationships


among individuals that are used by fraudsters to evade detection. These
long loops are also used in sophisticated cyber crime, where the perpetrators
create long paths of logins across multiple systems to avoid detection. The
reason these long paths are difficult to detect and understand is that they
require deep multi-hop traversals into the graph of transactions and
relationships among the individuals collaborating in the fraud.

In conventional systems, these multi-hop queries require a high number of


joins, which can take a substantial amount of time and consume a large
amount of computing resources. ArangoDB’s graph model supports high
performance multi-hop queries, where for example 10-hop queries on large
datasets can take less than 10 milliseconds depending on the topology of the

5
graph. For this example, the query finds long loops of transactions starting
from a suspicious account and looping back to the suspicious account over 5
to 10 transaction hops.

Figure 5 depicts the fraud ring detection query written in the ArangoDB
Query Language (AQL) being developed and executed in the ArangoDB
administrative panel. Note that this sophisticated query is expressed in 6
lines of AQL code and that the compact representation is easily
understandable and maintainable. The query results are displayed as a
circuit in the graph visualization and are also available in json, so they can be
processed by applications calling this query. Note also that the query is
parameterized by ‘suspicious account’ and number of loops to detect.

Figure 5: Finding fraud Ring(s) from a suspicious account

Detect All Fraud Rings

In the previous example, we detected fraud rings connected to a suspicious


account. What if we did not have a list of suspicious accounts to analyze yet
and wanted to analyze our graph to detect all of the fraud ring patterns in it?

6
This is easily accomplished in AQL by adding an outer loop to the fraud ring
detector for suspicious accounts. This sophisticated query is written in only 6
lines of AQL!
The query for finding all fraud loops is depicted in Figure 6.

Figure 6: Find all fraud rings

Find Orphan Accounts

There are many patterns for finding suspicious accounts that may require
further investigation. Most of these patterns are essentially finding
anomalous behavior to flag accounts.

One pattern is the orphan account, where an account is set up to participate


in very specific fraud transaction patterns, but otherwise does not interact in
a ‘normal’ way with other accounts and may be used very infrequently.

Figure 7 depicts a query for finding orphan accounts and reports on the
accounts and account owner.

7
Figure 7: Find Suspicious “Orphan” Accounts

Find Most Influential Customers and Accounts

We can also use standard graph algorithms like pagerank to find deeply
coordinated activity, by looking for the most influential customers and
accounts.

The pagerank algorithm scores how important or influential a vertex is


relative to the rest of the network. This is accomplished in ArangoDB by
executing ArangoDB’s pagerank algorithm on the graph via the Pregel
interface and then visualizing the results.

Figure 8 depicts a visualization of several clusters of customer/


account/transaction activity, where the size of the vertices is scaled
proportional to the pagerank computed for that vertex. This visualization
provides visual cues to the relative dominance of customers and accounts in
the network.

8
Figure 8: Find most influential accounts and customers

What are the top 3 most influential accounts?

Top 3 or top 10 queries are often used to focus attention. In this example,
we use an AQL query to find the top 3 most influential customers. This query
is essentially reading the pagerank value inserted by ArangoDB’s pagerank
algorithm and ordering the results in descending order and returning a limit
of three. The query and the results of execution are depicted in Figure 9.

Figure 9: Query for listing top 3 most influential accounts

9
Finding Money Laundering Patterns

ArangoDB can also be used to find more specific patterns, for example, in
money laundering. In money laundering there is a funds
disaggregation/aggregation pattern, where many small transactions (below
some known triggering threshold) are used to split up a large sum of money,
followed by multiple transaction hops across accounts to further avoid
detection, ultimately followed by a number of transactions that aggregate
the funds back to an account.

This fan-out/fan-in pattern can easily be detected using AQL. The query and
results are depicted in Figure 10.

Figure 10: Finding Money Laundering Patterns

10
Detecting Fraud At Scale
Real-world financial transactions generate billions of data points and
relationships, which will rapidly overrun the capabilities of a single server.
Providing fraud-detection performance at scale requires the underlying data
systems to be able to scale out data across multiple nodes in a distributed
cluster and to be able to efficiently distribute computation in parallel across
the cluster.

On a distributed database cluster, the limiting factor is network performance,


because network performance is two orders of magnitude slower than
memory and in a distributed cluster there will be data and communication
traffic between nodes in the cluster. For example, the performance on
detecting a fraud ring would be negatively impacted if many of the edges
being traversed caused computation to hop back and forth between servers.
Obviously better network performance improves overall performance,
however there are also data distribution and query optimizations that can
greatly reduce the amount of inter-node communication needed to execute
queries, and therefore improve distributed performance.

Optimizing the layout of data on the cluster can reduce the inter-node
communication needed to perform queries. ArangoDB uses Smartgraph
algorithm to optimize graph distribution across a cluster, SmartJoins to
ensure that joins do not cross servers, and satellite collections to replicate
metadata across servers so that lookups occur local to servers.

Figure 11: Bad distribution of graph data causes network hops during query execution

11
The Smartgraph feature of ArangoDB allows us to handle this problem in a
smarter way. In Fraud Detection we might know from the past that
fraudsters use banks in certain countries or regions to launder their money.
We can use this domain knowledge as a sharing key for our graph data and
allocate all financial transactions performed in this region on DB server 1,
and distribute other transactions on other DB servers. By using this
approach we can allocate all data needed to be grouped together on each
machine, and use the query engines on each DB Server to execute our
queries in parallel.

Figure 12: Optimized data distribution with ArangoDB SmartGraphs

Conclusion
This paper points the way to using ArangoDB as part of a fraud detection
solution. We encourage users to experiment with our sample data and
sample queries, learn how to apply ArangoDB to fraud visa experimentation
by adding/modifying the data and queries, and be inspired and empowered
to apply your knowledge of fraud to use ArangoDB on your own data to

12
detect fraudulent activity. To get started easily, you can follow the interactive
demo provider on our cloud service ArangoDb Oasis and described below.

Hands-on with Fraud Detection & Anti Money Laundering


Testing ArangoDB and its capabilities for detecting fraud and money
laundering is very simple. Many of the use cases shown in this white paper
are part of an interactive demo available for free on ArangoDB’s cloud
service Oasis. No credit card is needed for a 14 day free trial deployment and
the examples can be installed with just one click. A detailed guide is provided
so really everyone can follow along easily.

Just s​ ign-up for ArangoDB Oasis​ and follow the few steps below

1. Create a Deployment (​Here is a 2min video Tutorial​)


2. Install the Fraud Detection Example in Oasis (Project -> Deployment
Tab -> View your deployment -> Examples Tab or just click “view
Deployment” directly after initiating the deployment creation
3. After the example is ready (~1minute) follow the Fraud Detection
guide provided to run real queries against the demo data you just
installed

13
This White Paper was written by Arthur Keen. For any questions about solving
Fraud Detection cases with ArangoDB, feel free to reach out to
[email protected]

14
Appendix A: Queries
/*

Find all suspicious long loops of transactions


Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer

*/

WITH​ transaction, account


FOR​ suspicous_account ​IN​ account
​FOR​ acct, tx, path IN 5..10 ​OUTBOUND​ suspicous_account._id ​GRAPH​ 'fraud-detection'
​PRUNE​ tx._to == suspicous_account._id
​FILTER​ tx._to == suspicous_account._id
RETURN​ path

/*
Find number of Curious loops from a suspicious Account
Hints:
Try suspiciousAccountID = account/10000032
Rerun the query for different number of loops detected
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer
*/

WITH​ account, transaction


LET​ suspicious_account = ​DOCUMENT​(@suspiciousAccountID)
FOR​ acct, tx, path IN 5..10 ​OUTBOUND​ suspicious_account._id ​GRAPH​ 'fraud-detection'
​PRUNE​ tx._to == suspicious_account._id
​FILTER​ tx._to == suspicious_account._id
​LIMIT​ @numberOfLoopsReturned
RETURN​ path

/*
Find Orphan Account
An orphan account is an account with little or no transactions.
These may be set up in advance of money laundering operations.
This query finds accounts with no transactions

*/

LET​ usedResources = ​UNION_DISTINCT​(


​FOR​ relationship ​IN​ transaction ​RETURN​ relationship._from,
​FOR​ relationship ​IN​ transaction ​RETURN​ relationship._to)
FOR​ resource ​IN​ account
​FILTER​ resource._id ​NOT​ ​IN​ usedResources
​SORT​ resource.account_type, resource.customer_id
​RETURN {"customerName" : DOCUMENT(CONCAT("customer/",
resource.customer_id)).Name, "customerID": resource.customer_id, "accountID":
resource._id, "type": resource.account_type }

15
/*
Anti Money Laundering Pattern Detection
Find transaction patterns that contain a disaggregation and re-aggregation of funds
pattern
This pattern is characterized by transactions that dis-aggregate funds from a source
account to
multiple accounts in amounts that are below a reporting threshold, i.e., below $10,000
followed by a series of small transactions into 1 or more accounts, followed by
re-aggregation
of the small transactions into a destination account.
Show the graph and json results
Scroll to bottom of graph results and click "GraphViewer" to see results in Graph Viewer
*/

WITH​ account, transaction


LET​ accountOutDegree = (​FOR​ transaction ​IN​ transaction
​COLLECT​ accountOut = transaction._from WITH COUNT INTO outDegree
​RETURN​ {account : accountOut, outDegree : outDegree})
LET​ accountInDegree = (FOR transaction ​IN​ transaction
​COLLECT​ accountIn = transaction._to ​WITH​ ​COUNT​ ​INTO​ inDegree
​RETURN​ {account : accountIn, inDegree : inDegree} )
LET​ accountDegree = (​FOR​ inRecord ​IN​ accountInDegree
​FOR​ outRecord ​IN​ accountOutDegree
​FILTER​ inRecord.account == outRecord.account
​RETURN​ ​MERGE​(inRecord, outRecord))
LET​ maxAccount = (​FOR​ maxDegree ​IN​ accountOutDegree
​FILTER​ maxDegree.outDegree == ​MAX​(accountOutDegree[*].outDegree)
​RETURN​ maxDegree)[0]
FOR​ account, transaction ​IN​ 1..4 ​OUTBOUND​ maxAccount.account transaction
RETURN​ transaction

16

You might also like