0% found this document useful (0 votes)
86 views

Titant: Online Real-Time Transaction Fraud Detection in Ant Financial

Uploaded by

Muhammad yaseen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Titant: Online Real-Time Transaction Fraud Detection in Ant Financial

Uploaded by

Muhammad yaseen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

TitAnt: Online Real-time Transaction Fraud Detection

in Ant Financial

Shaosheng Cao XinXing Yang Cen Chen


AI Department (Hangzhou) AI Department (Beijing) AI Department (Singapore)
Ant Financial Services Group Ant Financial Services Group Ant Financial Services Group
556 Xixi Rd, Hangzhou, China 9F East Tower, WFC, 1 East 1 Raffles Place, Singapore
3rd Ring, Beijing, China
[email protected] [email protected]
[email protected]
Jun Zhou Xiaolong Li Yuan Qi
arXiv:1906.07407v1 [cs.LG] 18 Jun 2019

AI Department (Beijing) AI Department (Seattle) AI Department (Hangzhou)


Ant Financial Services Group Ant Financial Services Group Ant Financial Services Group
9F East Tower, WFC, 1 East 500 108th A. NE Bellevue, 556 Xixi Rd, Hangzhou, China
3rd Ring, Beijing, China Washington 98004, USA
[email protected]
[email protected] [email protected]

ABSTRACT According to the statistics [41], in the year of 2017, the


With the explosive growth of e-commerce and the booming number and the volume of online transaction reaches 48 bil-
of e-payment, detecting online transaction fraud in real time lion and 2, 075 trillion yuan respectively only in China. Ant
has become increasingly important to Fintech business. To Financial1 , also known as Alipay, accounts for about 58% of
tackle this problem, we introduce the TitAnt, a transaction China’s third-part online payment transactions [30]. Specif-
fraud detection system deployed in Ant Financial, one of the ically, on 2017’s Double Eleven Shopping Festival2 (similar
largest Fintech companies in the world. The system is able to Black Friday Day in the US), a single day’s transaction
to predict online real-time transaction fraud in mere millisec- shot up to US$25 billion [51, 25]. With such transaction
onds. We present the problem definition, feature extraction, volume, it becomes thus of great significance to detect and
detection methods, implementation and deployment of the prevent online transaction fraud.
system, as well as empirical effectiveness. Extensive experi- To collect and analyze such a magnitude of transaction
ments have been conducted on large real-world transaction data, it requires a robust database component for offline
data to show the effectiveness and the efficiency of the pro- storage and management. Furthermore, a large-scale dis-
posed system. tributed computing component for running algorithms is
also necessary. To satisfy the low latency requirements for
PVLDB Reference Format: online serving, online prediction with efficient data accessing
Shaosheng Cao, Xinxing Yang, Cen Chen, Jun Zhou, Xiaolong is of great significance. Meanwhile, feature extraction and
Li, and Yuan Qi. TitAnt: Online Real-time Transaction Fraud detection methods are equally important.
Detection in Ant Financial. PVLDB, 12(xxx): xxxx-yyyy, 2019. Rule-based methods have been extensively studied over
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
the years [46] for fraud detection problem. However, fraud
patterns change rapidly over time, greatly deteriorating the
1. INTRODUCTION effectiveness of rules summarized by expert experience. Sub-
sequently, many data mining based methods have been in-
Fraud, such as phone fraud, insurance fraud and credit
vestigated. For example, supervised learning methods, are
card fraud, causes severe problems for government and busi-
proposed recently [40, 53]. However, transaction data usu-
ness. However, detecting such a fraud has always been chal-
ally exhibit two kinds of characteristics: 1) the labels are
lenging. With the rapid development of the e-commerce
unbalanced, i.e., the majority of transactions are not fraudu-
and e-payment, the problem of online transaction fraud has
lent but normal, and 2) compared with analyzing individual
become increasingly prominent. Compared with traditional
transaction records, aggregated data often provides much
areas, online transaction is facing a considerably larger vol-
richer information to identify fraud patterns.
ume of fund transfer.
To cope with the first characteristic, several unsupervised
learning and anomaly detection methods are introduced [10,
35], however label information can hardly be utilized. On
This work is licensed under the Creative Commons Attribution- the other hand, some existing data aggregation strategies are
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For also applied for detecting fraud [65, 28], nevertheless, most
any use beyond those covered by this license, obtain permission by emailing of the previous approaches can hardly capture the complex
[email protected]. Copyright is held by the owner/author(s). Publication rights fraud patterns of the online transactions. It is this paper’s
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 12, No. xxx 1
ISSN 2150-8097. https://en.wikipedia.org/wiki/Ant Financial
2
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx https://en.wikipedia.org/wiki/Singles%27 Day
topic to investigate how to deal with these two characteris- 2.2 Supervised Learning Models
tics with our methods. Hand [27] first uses a linear discriminative model to de-
In this paper, we present a real-world task in FinTech and tect fraud, and later Foster and Stine [18] propose an im-
introduce our TitAnt3 system, which is actively detecting proved least square regression with stepwise selection pre-
fraudulent transactions. Our contributions are summarized dicting. Bayesian approaches have been investigated, where
as follows: Ezawa and Norton [17] employ a four-stage Bayesian net-
• We carefully analyze the task and some discoveries are work model for telephone fraud and Viaene et al. [60] adopts
excavated. Based on our observations, new feature AdaBoosted naive Bayes for insurance fraud. Otherwise,
extraction approaches for transaction fraud detection neural network based models are applied in fraud diagnosis
are examined, which is capable of making full use of [21, 43, 2]. Subsequently, Syeda et al. [55] develops a par-
the information from aggregated data. allel system of fuzzy neural networks, Barse et al. [3] lever-
ages the memory-based neural network to capture temporal
• We design and develop a real-world transaction fraud dependencies, and Maes et al. [37] combines Bayesian net-
detection system which is able to train offline large- works and neural networks for detecting credit card fraud.
scale data in hours, and predict online real-time trans- Also, Bhowmik [5] applies Bayesian classification and deci-
action fraud within only milliseconds. sion trees in insurance fraud detection task. Besides, Kim et
• We conduct extensive experiments on a large trans- al. [31] and Wang and Ma [63] utilize SVM-based ensemble
action record dataset to validate the effectiveness and strategy for detecting telecommunication subscription fraud
efficiency of our system, including rule-based methods, and credit fraud. Besides, Halvaiee and Akbari [26] and
anomaly detection approaches and classification mod- Jia-jie [29] respectively investigate the effectiveness of the
els. artificial immune system and particle swarm optimization
algorithm in fraud detection.
Our paper is organized as follows. Section 2 discusses re-
lated work of fraud detection. Section 3 presents the prob- 2.3 Unsupervised Learning and Aggregation
lem definition, feature extraction and detection methods. Strategies
Section 4 describes the details of the implementation and Cox et al. [15] visualizes data with the information of
deployment of our TitAnt system. Section 5 shows experi- color, position, size and etc. to help to detect fraud. Bolton
mental results, followed by the conclusion in Section 6. et al. [6] introduces profiling method to detect credit card
fraud, Burge and Shawe-Taylor [8] uses a recurrent neural
2. RELATED WORK network to exploit temporal information of account behav-
In this section, we investigate the related literature, in- ior, and Cortes et al. [14] explores graph mining algorithms
cluding expert systems and rule-based approaches, super- such as link analysis. Later, Yamanishi et al. [66] detects
vised and unsupervised learning algorithms for fraud detec- the fraud from medical insurance data by recognizing statis-
tion task, as well as recently proposed network representa- tical outliers. Aggregated data analysis is also investigated,
tion learning models. in which Perlich and Provost [44] propose a novel target-
dependent aggregation method, Casas et al. [10] utilizes k-
2.1 Rule-based Methods and Expert System means to classify network security data, and Vadoodparast
Quinlan [48] and Cohen [13] introduce assertion statement et al. [59] combines the results of several different clustering
of IF {conditions} and THEN {a consequent} to recognize methods. In addition, Jha et al. [28] and Whitrow et al.
fraud records at first. By distinguishing fraudulent and nor- [65] detect credit card fraud employing transaction aggrega-
mal records, Brause et al. [7] generalizes and weighs the tion. Anomaly detection methods, such as isolation forest
association rules of detecting credit card fraud. Based on [35], sheds light on fraud detection tasks, since fraudulent
previous achievement, Baulier et al. [4] identifies implicit transactions are undoubtedly regarded as abnormal cases.
fraudulent calls by generating decision variables, Rosset et
al. [52] investigates a two-stage rule-based solution to detect 2.4 Network Representation Learning Models
telephone fraud, and Wheeler and Aitken [64] adopt case- Recently, network representation learning, also known as
based reasoning to analyze the hardest ones of misclassified graph embeddings, plays an increasingly important role in
cases. Expert system based methods, on the other hand, network analysis. Perozzi et al. proposes DeepWalk [45],
also have been well investigated. Major and Riedinger [38] which is superior to traditional graph analysis approaches
uses statistical knowledge to construct a five-layer system, like Spectral Clustering [58], Modularity [57], and wvRN
Von Altrock [61], Stefano and Gisella [54], Pathak et al. [36]. After that, many models are introduced, for exam-
[42] respectively design different fuzzy expert systems for a ple, LINE [56], GraRep [9], node2vec [24] and etc. Besides,
specific scene. Besides, Chiu and Tsai [12] proposes FPM Structure2Vec [16] is a state-of-the-art supervised fashion of
algorithm to mine frequent patterns of credit card trans- generating embeddings. Although these models have been
actions. With the rapid evolution of fraud patterns, only demonstrated to be effective on the public dataset, there
hand-summarized rules or expert knowledge are not suffi- does not exist a distributed version that is able to support
cient to satisfy today’s online detection, the methods [47, real-world industrial-scale transaction records.
33] learning knowledgeable information from historical data
is more worthwhile to investigate.
3
It indicates the combination of Titan and Ant, where the 3. PROBLEM DEFINITION, FEATURE EX-
Titans are giant deities with incredible strength in Greek TRACTION AND DETECTION METH-
mythology and the Ant is the name of the company by the
meaning of “greatness comes from micro things”. ODS
(a) Basic features are extracted from user profile, transfer environment and etc.

(b) User node embeddings are learned from historical transaction records.

Figure 1: An illustrated example of basic features extraction and user node embeddings generation.

transaction fraud detection task and a formal problem defi-


nition is described as follows:

Definition 1. (Online Real-time Transaction Fraud De-


tection) Given historical transaction records with fraud la-
bels, the task of online real-time transaction fraud detection
is to design a system to predict whether an online real-time
transaction is a fraud or not.

Figure 2: A simple case of aggregated data over the 3.2 Feature Extraction from Aggregated Data
transaction network. In order to discover transaction fraud, user profile and
transfer contextual information are often of great impor-
tance. In particular, the fraudulent rates in some specific
In this section, based on our analysis of the problem, fea- locations are always higher than other areas. Figure 1 (a)
ture extraction and detection methods are introduced. illustrates the basic user profile features extracted, such as
age, gender, and transfer city (trans city)4 .
3.1 Problem Definition In addition, aggregated information on transaction records
In general, online transaction fraud can be categorized can provide much richer information. Based on our investi-
into two different types: explicit and implicit. In an explicit gation, approximately 70% of the fraudsters have fraudulent
case, a user is aware of the fraud afterwards. After a trans- behaviors more than once. It suggests that fraudsters tend
action is completed, the user could file a fraud report and to repeat their deceitful actions once successful. In Figure
upload the supporting proofs. Based on the transaction de- 2, we give a simple example to demonstrate the value of
tails, profiles and evidence, the authenticity of transaction aggregated data. A directed edge reflects the transfer rela-
fraud will be examined. If this user indeed suffers from a tionship from the corresponding transferor to the transferee.
fraud, the fraudsters would be punished with punitive mea- Directed red lines with a dollar sign indicate the fraudulent
sures, such as action restrictions or account lockout, but it transactions, while a black user node stands for the fraud-
would be difficult to recover the losses according to the laws. ster. The on-going transaction, i.e., the dashed line with
This type is defined as an explicit fraud after an accident. a question mark, is very likely to be a potentially implicit
In an implicit case, what we are concerned is to take fraud. Such gathering behaviors are often observed in the
proactive actions to prevent the potential event of fraudulent real cases and manifest in more complex ways.
transactions, i.e., actively detecting online transaction fraud To extract useful information from the aggregated trans-
and taking immediate steps to prevent suspicious transac- action data, a transaction network is leveraged. Formally,
tions. Contrary to explicit fraud, implicit one reveals less we define the transaction network as follows:
information and requires real-time prediction of the system.
4
In this paper, we aim to tackle the implicit online real-time trans city can be inferred from transfer IP address.
Figure 3: The architecture of TitAnt system.

Definition 2. (Transaction Network) A transaction net- one node will often occur in its contextual position in the
work is defined as G = (V, E). V = {v1 , v2 , . . . , vn } is a linear node sequences. After the linear node sequences are
collection of nodes with each node v indicating a user while generated, Skip-gram with negative sampling in word2vec
E = {ei,j } is a set of edges with each edge e indicating the [39] is applied to generate user node embeddings finally.
transfer relationship from a transferor to a transferee, both We also reimplement Structure2Vec (S2V) [16] as an al-
regarded as user nodes. ternative. Such supervised method can take full advantages
of label information, but the learned user node embeddings
Based on historical records in a period of time, transac- are also affected by unbalanced labels. Meanwhile, unsuper-
tion network is built for analysis. Recall the simple case in vised methods like DW do not require any labels, therefore,
Figure 2, all the victims including the potential one have a the topological information is extracted only from transac-
same neighbor, i.e., the fraudster. It suggests they are 2-hop tion network without being influenced by the imbalance of
neighbors to each other. Therefore, the analysis of topologi- labels.
cal relationship is worthy of well studying in the transaction
network. To capture topological relationship information, 3.3 Detection Methods
Network Representation Learning (NRL) is a promising di- As the problem of fraud detection is vital to a Fintech
rection to be explored [67]. Given a transaction network, business, efforts have been spent for years, where about fifty
NRL methods aim to learn a low dimensional representation features are carefully engineered. We call such features as
matrix D ∈ R|V |×d , whose i-th row Di is a d-dimensional basic features, which are also treated as rules or attributes.
vector representing the node vi in the transaction network. For each user, we generate user node embeddings, i.e., ag-
In this way, the topological information can be captured by gregated features, as additional information from the aggre-
dense vectors, i.e., node embedding. Figure 1 (b) shows the gated transaction records. Basic features and aggregated
procedures of generating user node embeddings. First, his- features are then concatenated together. Labels are col-
torical transaction records are collected to construct trans- lected from user fraud reports, thus cannot be obtained in
action network, and then user node embeddings are learned real-time.
by NRL methods. In order to precisely find out fraud, we extensively inves-
As most NRL implementations in the literature are lim- tigate and validate rule-based methods, anomaly detection
ited to a single machine, we need to reimplement in a dis- approaches and classification models.
tributed learning framework, since huge amount of transac- Rule-based methods are widely used in many fraud de-
tion records are being produced every day. Based on the tection applications. Iterative Dichotomiser 3 (ID 3) [47]
insights that no one NRL method is the best in all cases is a traditional approach based on decision tree learning,
[22], we select DeepWalk (DW) [45] for its efficiency, effec- whereas C5.0 [33, 50] is revised version of C4.5 [49] to ex-
tiveness and simplicity. tract informative patterns from data with higher accuracy.
Original DW utilizes random walk to generate short node In those methods, features are regards as rules and label
sequences which transforms the topological information from information is utilized to do fine-tune.
the network into the sequences. Intuitively, the neighbors of Isolation Forest (IF) [35] is a classical anomaly detection
Figure 4: The architecture of MaxCompute.

approach widely used due to its effectiveness. We treat fea- fline computation. MaxCompute supports SQL and MapRe-
tures as attributes and directly predict fraudulent transac- duce for extracting basic features/labels and constructing
tions, since it does not require any label information. In- transaction network. At the same time, KunPeng supports
tuitively, transaction fraud detection is similar to anomaly large-scale distributed NRL and classification model train-
detection tasks, since the goal is to find out abnormal trans- ing7 . The learned user node embeddings and classification
actions, i.e., outliers that are more likely to be separated models are stored in MaxCompute.
from most of the other data. Online prediction happens at Model Server (MS), where
One of the most popular classification models is Logistic the model files are periodically updated. Once a transaction
Regression (LR) [62]. Although continuous features can be created by a user in Alipay APP, Alipay server immediately
used in LR, better performance can be achieved after feature requests the Model server (MS). MS then gets the related
discretization in most cases. Compared with LR, non-linear data from Ali-HBase and makes real-time prediction. If the
models such as, Gradient Boosting Decision Tree (GBDT) transaction is detected as fraud, the on-going transaction
[19, 20, 1] is able to achieve better performance in a vari- will be interrupted and transferor will be notified. More
ety of industrial tasks. GBDT is a tree-based classification details on each component will be elaborated in the following
model, whose decision trees learn the decision boundary of subsections.
the classification dataset, and gradient boosting combines
several weak classifiers into a stronger one. 4.2 MaxCompute
We will examine the effectiveness of the above detection MaxCompute, formerly known as Open Data Process-
methods in Section 5. ing Service (ODPS), is a database storage and management
platform. It has three logical layers: client layer, server layer
4. TITANT SYSTEM IMPLEMENTATION and storage & compute layer. As illustrated in Figure 4, de-
velopers can login with their cloud account and submit jobs
AND DEPLOYMENT by web console in client layer, where HTTP server receives
In this section, we show the details of the implementation the command and send message to next layer. Server layer
and deployment of our TitAnt system. contains workers, executors and scheduler to split jobs into
subjobs for distribution. Also, heterogeneous jobs, such as
4.1 The Framework of TitAnt System mapreduce, SQL and etc., can be recognized and operated
To guarantee timely response on fraud detection requests, in the storage & compute layer based on Pangu and Fuxi,
low latency predictor, robust database storage platform, and where Pangu is a disk storage module and Fuxi is a resource
distributed algorithms ought to be carefully designed. As scheduling module [68].
illustrated in Figure 3, our system mainly has two parts, i.e., When a SQL command is submitted by web console, the
offline periodical training and online real-time prediction. In message is sent to the HTTP server, which requires the ver-
the offline training part, where models are trained on a fixed ification of cloud account information. If authentication
time basis, and model files are uploaded to online predictor passes, the job will be delivered to worker and the corre-
for real-time transaction monitoring. sponding job instance will be sent to the scheduler. After
Once users initiate transaction requests in Alipay5 , trans- that, scheduler registers the instance in Open Table Service
action logs will be periodically sent to MaxCompute6 for of-
7
5
Only classification based detection methods are reimple-
https://itunes.apple.com/us/app/alipay-simplify-your- mented in KunPeng for better performance, as reimplemen-
life/id333206289?mt=8 tation is time-consuming. Rule-based and anomaly detec-
6
https://www.alibabacloud.com/product/maxcompute tion methods are not distributed.
Figure 5: The architecture of the MS and its interactions with other components.

(OTS) via SQL planner and its status is set as “running” si- Based on KunPeng, we redesign NLR and classification
multaneously. OTS maintains the status of all the instances. algorithms, such as DW, S2V, LR, and GBDT. As an im-
Finally, scheduler adds the instance into the queue and cor- portant part of DW, our reimplemented word2vec is involved
responding instance ID will be generated. in both worker and server nodes. Worker nodes receive the
Subsequently, the scheduler will split the task of job in- node sequences by Random walk algorithm. For every iter-
stance into multiple subtasks, which are arranged into task ation, each worker first read a batch of sequence data and
pool in priority order. After that, scheduler keeps waiting generate negative word list. The embeddings are then pulled
for the available resource for computing. As soon as the from server nodes and are updated by gradient descent. Sub-
resource conditions are satisfied, the subtasks are sent to sequently, the updated embeddings are uploaded to server
an executor, which requests Fuxi to trigger computing re- nodes. On the other hand, server nodes are responsible for
sources in the compute layer. When all the subtasks are communication with workers in order to exchange embed-
finished, the executor updates the status of the instance as ding data. Server nodes first randomly initialize the em-
“terminated” in OTS. Finally, the results will be stored in beddings and wait for the push requests from worker nodes.
Pangu. Once the push request is received, the corresponding embed-
dings are sent. After the update of each worker, server nodes
4.3 KunPeng pull the new embeddings and aggregate them by executing
the model average operation.

4.4 MS and Ali-HBase

Figure 6: The system architecture of KunPeng.

As numerous transaction records wait for analysis every


day, a distributed computing platform is an urgent need.
Traditional frameworks, such as MPI [23], do not support Figure 7: The architecture of Ali-HBase.
good failure tolerance. However, Parameter Server (PS) [34]
supports a single point of failure, i.e., the failed instance can
be restarted and recovered to the previous status automat- Once offline training section ends, online real-time pre-
ically while other instances remain not affected. KunPeng diction works. Figure 5 shows an illustrative example of the
system [69] is self-developed by the company based on PS whole real-time prediction process. When a user transfer
framework , where various machine learning algorithms are money in Alipay App, the transfer request is sent to the
running simultaneously. Alipay server, followed by the MS for fraud monitoring. MS
KunPeng supports data parallelism and model parallelism. will access data from Ali-HBase for the latest version of user
As illustrated in Figure 6, it consists of server nodes and node embeddings and basic features. MS are distributed
worker nodes, where server nodes store the model param- to satisfy low latency and high service load. As shown in
eters while worker nodes are responsible for training. Pull Figure 5, the transaction TID=2 is probably a fraud with
and Push operations are defined between server and worker predicted fraud probability of 99%, thus MS sends an alert
nodes for data exchange. Besides, communication also hap- to the Alipay server, which will further interrupt the corre-
pens among server nodes. sponding on-going transaction.
Figure 8: A graphical illustration of the datasets.

Ali-HBase is based on HBase Project8 . HBase is first days of labeled records are treated as the training set and
proposed as Bigtable [11], a distributed, scalable and big the last day of labeled records are used for the test set.
data store, which is suitable for our real-time data accessing For example in Dataset 1 (illustrated in Figure 8), trans-
scenario. The inner data is organized in the form of Column action records of April 10, 2017 are chosen as the test set, 14
Family (CF), where qualifier is used as a marker. As shown days’ records prior to the test set are used as the training,
in Figure 7, the first CF is basic features where age, gender, and the earlier 90 days of records are employed to build the
and trans city are qualifiers. And the second is user node transaction network. Different from other industrial scenes,
embeddings, where each dimension of value is the qualifier. such as e-commerce recommendation, online testing is hard
In Figure 7, users like Zoe, Sam and Liam are row-keys, to to achieve since labels are not real-time obtained.
index the corresponding data. Every time offline training is In our experiment, one of the goals is to investigate the
completed, the data is uploaded to Ali-HBase by the version effectiveness of basic features and the learned user node em-
of date time. beddings based on transaction network. More specifically,
we compare both unsupervised DW and supervised S2V
4.5 Discussion models on our task with unbalanced labels. For a fair com-
In this section, we discuss the implementation design, de- parison, the size of the learned embeddings is set to 32 and
ployment issues and the construction of transaction network. is concatenated with the basic features. In addition, for de-
First, the system has strict serving requirements, i.e., tens tection methods, we test the validity of rule-based ID3 and
of milliseconds at most for online detection including compu- C5.0, anomaly detection based IF and classification based
tation and communication costs. However, labels are usually LR and GBDT.
delayed, as they are collected through user feedbacks, where For DW, we set the length of the random walk as 50, where
online training is impractical. Thus, we adopt periodical each node is sampled as the first node of the sequences 100
offline training and real-time prediction in our system. times, i.e., the number of sampling is 100. It takes around
Second, in our system, we only demonstrated the use- 1.5 hours to learn the embeddings with approximate 8 mil-
fulness of user node embeddings learned from transaction lion randomly selected transaction records with 20 machines
network. One may ask what about other aggregated infor- equipped with 10 threads in our production environment.
mation, such as device and IP information? It is an inter- Aside from the transaction network, we also feed S2V with
esting question to construct a heterogeneous network. We the fraud ground truth as the edge labels. Besides, there are
will explore this direction in future work. a total of 52 basic features carefully extracted.
We set 100 trees for IF and raw basic features are fed
5. EXPERIMENTS as attributes. As rule-based ID3 and C5.0 cannot support
continuous values well, we discretize the data into different
To empirically quantify the benefits of each component of bins [32]. We impose L1 regularization and assign its weight
our TitAnt system, we conduct experiments under different as 0.1 for LR, and set 300 iterations as the stopping criteria.
configurations. For GBDT, we generate 400 trees with the depth of 3 to
ensemble the results and use root mean square error as the
5.1 Experimental Setup objective. The subsampling rate of samples and features are
On this task, we adopt “T+1” mode to update the model, set as 0.4 to prevent overfitting.
which means a model will be trained and deployed in an of-
fline manner on a daily basis and will be used for prediction 5.2 Empirical Results on Transaction Fraud
for the next day on a real-time basis. To demonstrate the Detection
effectiveness of our system, we have conducted several ex-
periments and reported the performance of each day over a In this section, we empirically evaluate the effectiveness
continuous week. In total, we have seven sets of data, where of our proposed system for the transaction fraud detection
each one is sliced into three subsets: one for learning user task. Eleven configurations are tested in Table 1 from April
node embeddings, another for training the classifier, and the 10 to April 16, where F1 score is chosen as the evaluation
last for testing. Specifically, we collect 90 days of transac- metric. The best results are written in bold font for each
tion records to build the transaction network. The next 14 day.
First, we analyze the effectiveness of the learned user node
8
https://hbase.apache.org/ embeddings. With the same classifier, it is obvious that in-
Table 1: Performance under different eleven configurations.

Number F1 Score April 10 April 11 April 12 April 13 April 14 April 15 April 16


1 Basic Features/Attributes+IF 10.30% 10.38% 11.62% 11.21% 10.82% 11.00% 13.30%
2 Basic Features/Rules+ID3 42.08% 44.72% 41.21% 44.25% 42.33% 41.94% 47.69%
3 Basic Features/Rules+C5.0 44.56% 51.55% 45.94% 51.17% 50.23% 51.91% 57.07%
4 Basic Features+LR 53.08% 58.47% 55.72% 60.13% 56.87% 52.52% 64.38%
5 Basic Features+GBDT 56.80% 65.47% 59.05% 64.87% 59.19% 60.34% 68.85%
6 Basic Features+S2V+LR 55.21% 62.08% 60.78% 64.11% 61.04% 55.83% 68.86%
7 Basic Features+S2V+GBDT 60.23% 66.37% 63.24% 68.87% 64.79% 63.30% 71.10%
8 Basic Features+DW+LR 56.06% 61.15% 58.37% 61.13% 60.08% 56.00% 67.33%
9 Basic Features+DW+GBDT 61.43% 66.87% 64.11% 69.93% 65.10% 64.00% 71.84%
10 Basic Features+DW+S2V+LR 56.70% 61.41% 60.69% 62.78% 63.29% 57.74% 67.21%
11 Basic Features+DW+S2V+GBDT 61.37% 66.76% 64.11% 69.67% 64.53% 63.48% 71.40%

troducing additional features from aggregated data can con- ter data discretization and segmentation mechanisms such
sistently improve the performance of the task. For example, as Gain Ratio. LR is implemented with discretization pre-
on April 10, the F1 score for “Basic features+GBDT” is processing which tremendously improves performance. Only
56.80%. Adding the embeddings generated by S2V will im- the best performance of LR is shown in the table, whose
prove the baseline by 3.4% while adding the embeddings discretization bin size is set as 200. But still, it is obvi-
by DW will boost the performance by 4.6%. The similar ous that GBDT can achieve better results than LR, i.e.,
conclusion can be obtained for the rest of the days. outperforms LR by 4.5%, 2.2%, 4.5% and 4.2% for “Basic
We can observe that using user node embeddings learned Features”, “Basic Feature+S2V”, “Basic Feature+DW” and
by DW leads to better results than S2V. Although super- “Basic Features+DW+S2V” on April 16.
vised S2V utilizes extra label information, it also suffers Besides F1, recalls at different thresholds are also impor-
from the issue of unbalanced labels. In this case, experi- tant for real-world analysis. Such recall metric can measure
mental results demonstrate that the benefits from the label the ability of the classifier to find the most suspicious fraud.
information are weaker than the losses suffering from the Figure 9 shows the recall for the top 1% of the most sus-
label imbalance. Moreover, in the experiments, we further picious cases, i.e., rec@top 1%, over five different detection
concatenate the different sources of the learned user node methods. From the results, we can see that IF performs the
embeddings together from DW and S2V, but the perfor- worst, i.e., under 10%, which is consistent with F1. Such
mance is not improved compared with only DW is used. results are intuitive as outliers found by IF are probably not
This suggests the topological information has already been caused by fraud cases but for other reasons. Rule-based ID3
well extracted by DW. and C5.0 methods achieve much higher results, i.e., 30%
and 40%, respectively. GBDT slightly outperforms LR and
performs the best.
80

70 600 1500

60

Time Cost of GBDT (seconds)


Time Cost of DW (minutes)
Rec@top 1% (%)

50
400 1000

40

30

200 500
20

10

0
IF ID3 C5.0 LR GBDT 0 0
4 10 20 40
Numbers of machines

Figure 9: Recall scores for the top 1% of the most Figure 10: Time cost over the numbers of machines.
suspicious frauds under different detection methods.
Based on the above observations, we choose DW to ex-
In general, the performance of rule-based methods is not tract additional aggregated information. GBDT is selected
as good as that of classification models. C5.0 has better as the classifier for its good performance. In order to decide
than ID3 by 6.9% on average, probably because it takes bet- the computing resources, we further test time cost versus
the number of machines. For our reimplemented version on
KunPeng, half of the machines are selected as server nodes, Basic Features+GBDT
and the rest are used as worker nodes. 70 Basic Features+S2V+GBDT
Basic Features+DW+GBDT
As shown in Figure 10, the time cost continues to de- Basic Features+DW+S2V+GBDT
creases as the number of the machines increases for DW.
65
However, we also notice that the time cost of GBDT does

F1 Score (%)
not obviously halve when the number of machines increases
to 40 from 20. In fact, in real-world PS environment, IO 60
and network communication might become the bottleneck
besides computation, while more machines often indicate
55
greater communication cost due to uneven machine traffic.
Moreover, in the production environment, heterogeneous
tasks execute at the same time, so resource allocation is 50
necessary to be considered. More resources requested, more
waiting time may be needed for allocation. As a compro-
45
mise, we finally assign 40 machines for DW and 20 machines
for GBDT. 100 200 400 800
Numbers of Trees
5.3 Hyperparameter Sensitivity
We further perform a hyperparameter sensitivity analy- Figure 12: Performance versus the numbers of
sis, where Dataset 1 shown in Figure 8 is selected for this GBDT decision trees.
experiment.

Table 2: Performance versus the number of node


66 sampling.
65 Basic Features+S2V+GBDT
Basic Features+DW+GBDT
64 Basic Features+DW+S2V+GBDT No. of Sampling 25 50 100 200
F1 Score 59.67% 60.62% 61.43% 61.57%
63
F1 Score (%)

62
61 Finally, we analyze the impact of the number of node
sampling in DW, which controls the number of linear node
60
sequences generated. Similar to the conclusion in [45], the
59 performance tends to be stable as the number reaches a
58 specific value. Table 2 suggests that the performance tends
to stabilize when the number reaches 100. Although the
57 result is slightly better as for 200, it takes about double time
56 to generate node sequences and learn embeddings. Besides,
55
the depth of GBDT decision trees is also worthy of exploring,
8 16 32 64 we omit it here as it can be analyzed in the similar way.
Dimensions

6. CONCLUSION
Figure 11: Performance versus the dimensions of
the learned user node embeddings. In this paper, we first reveal the significance of online real-
time transaction fraud detection task in Ant Financial and
then demonstrate our feature extraction approaches, detec-
The dimension size of the learned embeddings is an impor- tion models and implementation details. Extensive experi-
tant hyperparameter, which influences the amount of topo- ments on real-world data are conducted, showing the effec-
logical information of the transaction network extracts. As tiveness and performance of our proposed TitAnt system.
shown in Figure 11, we compare F1 score against the di- In the future, we will investigate more possibilities for sys-
mension size using different NRL methods. Obviously, 32 tem design, explore dynamic construction and modeling of
is the best dimension size. We believe that the topological a heterogeneous network, and study the interpretability of
information of the network is not well extracted when the learned embeddings by NRL models.
dimension is too small, while the results probably overfit
when it is too large.
In addition, we vary the tree size to examine the impor- 7. ACKNOWLEDGMENTS
tance of tree size in GBDT. As illustrated in Figure 12, F1 The authors thank the anonymous reviewers for their con-
score consistently improves as the number of trees increases structive and valuable advice, and MaxCompute team for
to 400 and then decreases when the number of trees further their suggestions on data storage and management, as well
increases to 800. It is intuitive that the model is not suffi- as Kai Xiao and Xiujing Lin’s help on data preparation.
ciently trained when the number of trees used is too small.
On the contrary, the model prone to overfitting when the
number of trees is too big. 8. ADDITIONAL AUTHORS
9. REFERENCES Elsevier, 1995.
[1] M. M. Ahmed and M. Abdel-Aty. Application of [14] C. Cortes, D. Pregibon, and C. Volinsky.
stochastic gradient boosting technique to enhance Computational methods for dynamic graphs. Journal
reliability of real-time risk assessment: use of of Computational and Graphical Statistics,
automatic vehicle identification and remote traffic 12(4):950–970, 2003.
microwave sensor data. Transportation research record, [15] K. C. Cox, S. G. Eick, G. J. Wills, and R. J.
2386(1):26–34, 2013. Brachman. Brief application description; visual data
[2] E. Aleskerov, B. Freisleben, and B. Rao. Cardwatch: mining: Recognizing telephone calling fraud. Data
A neural network based database mining system for Mining and Knowledge Discovery, 1(2):225–231, 1997.
credit card fraud detection. In Proceedings of the [16] H. Dai, B. Dai, and L. Song. Discriminative
IEEE/IAFE 1997 computational intelligence for embeddings of latent variable models for structured
financial engineering (CIFEr), pages 220–226. IEEE, data. In International conference on machine learning,
1997. pages 2702–2711, 2016.
[3] E. L. Barse, H. Kvarnstrom, and E. Jonsson. [17] K. J. Ezawa and S. W. Norton. Constructing bayesian
Synthesizing test data for fraud detection systems. In networks to predict uncollectible telecommunications
19th Annual Computer Security Applications accounts. IEEE Expert, 11(5):45–51, 1996.
Conference, 2003. Proceedings., pages 384–394. IEEE,
[18] D. P. Foster and R. A. Stine. Variable selection in
2003.
data mining: Building a predictive model for
[4] G. D. Baulier, M. H. Cahill, V. K. Ferrara, and bankruptcy. Journal of the American Statistical
D. Lambert. Automated fraud management in Association, 99(466):303–313, 2004.
transaction-based networks, Dec. 19 2000. US Patent
[19] J. H. Friedman. Greedy function approximation: a
6,163,604.
gradient boosting machine. Annals of statistics, pages
[5] R. Bhowmik. Detecting auto insurance fraud by data 1189–1232, 2001.
mining techniques. Journal of Emerging Trends in
[20] J. H. Friedman. Stochastic gradient boosting.
Computing and Information Sciences, 2(4):156–162,
Computational statistics & data analysis,
2011.
38(4):367–378, 2002.
[6] R. J. Bolton, D. J. Hand, et al. Unsupervised profiling
[21] S. Ghosh and D. L. Reilly. Credit card fraud detection
methods for fraud detection. Credit Scoring and Credit
with a neural-network. In System Sciences, 1994.
Control VII, pages 235–255, 2001.
Proceedings of the Twenty-Seventh Hawaii
[7] R. Brause, T. Langsdorf, and M. Hepp. Neural data International Conference on, volume 3, pages 621–630.
mining for credit card fraud detection. In Proceedings IEEE, 1994.
11th International Conference on Tools with Artificial
[22] P. Goyal and E. Ferrara. Graph embedding
Intelligence, pages 103–106. IEEE, 1999.
techniques, applications, and performance: A survey.
[8] P. Burge and J. Shawe-Taylor. An unsupervised Knowledge-Based Systems, 151:78–94, 2018.
neural network approach to profiling the behavior of
[23] W. D. Gropp, W. Gropp, E. Lusk, and A. Skjellum.
mobile phone users for use in fraud detection. Journal
Using MPI: portable parallel programming with the
of parallel and distributed computing, 61(7):915–925,
message-passing interface, volume 1. MIT press, 1999.
2001.
[24] A. Grover and J. Leskovec. node2vec: Scalable feature
[9] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph
learning for networks. In Proceedings of the 22nd ACM
representations with global structural information. In
SIGKDD international conference on Knowledge
Proceedings of the 24th ACM international on
discovery and data mining, pages 855–864. ACM,
conference on information and knowledge
2016.
management, pages 891–900. ACM, 2015.
[25] T. Guardian. Chinese shoppers spend a record $25bn
[10] P. Casas, A. D’Alconzo, G. Settanni, P. Fiadino, and
in singles day splurge. https://www.theguardian.co
F. Skopik. Poster:(semi)-supervised machine learning
m/world/2017/nov/12/chinese-shoppers-spend-a-r
approaches for network security in high-dimensional
ecord-25bn-in-singles-day-splurge/, 2018.
network data. In Proceedings of the 2016 ACM
Accessed May 24, 2018.
SIGSAC Conference on Computer and
Communications Security, pages 1805–1807. ACM, [26] N. S. Halvaiee and M. K. Akbari. A novel model for
2016. credit card fraud detection using artificial immune
systems. Applied soft computing, 24:40–49, 2014.
[11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
Wallach, M. Burrows, T. Chandra, A. Fikes, and [27] D. J. Hand. Discrimination and classification. Wiley
R. E. Gruber. Bigtable: A distributed storage system Series in Probability and Mathematical Statistics,
for structured data. ACM Transactions on Computer Chichester: Wiley, 1981, 1981.
Systems, 26(2):4, 2008. [28] S. Jha, M. Guillen, and J. C. Westland. Employing
[12] C.-C. Chiu and C.-Y. Tsai. A web services-based transaction aggregation strategy to detect credit card
collaborative scheme for credit card fraud detection. fraud. Expert systems with applications,
In IEEE International Conference on e-Technology, 39(16):12650–12657, 2012.
e-Commerce and e-Service, 2004. EEE’04. 2004, [29] S. Jia-jie. Electronic transaction fraud detection based
pages 177–181. IEEE, 2004. on improved pso algorithm. In Proceedings of 2012 2nd
[13] W. W. Cohen. Fast effective rule induction. In International Conference on Computer Science and
Machine Learning Proceedings 1995, pages 115–123. Network Technology, pages 2121–2125. IEEE, 2012.
[30] W. S. Journal. 5 things to know about china’s ant [45] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk:
financial. Online learning of social representations. In
https://blogs.wsj.com/briefly/2016/04/26/5-thi Proceedings of the 20th ACM SIGKDD international
ngs-to-know-about-chinas-ant-financial/, 2016. conference on Knowledge discovery and data mining,
Accessed May 24, 2018. pages 701–710. ACM, 2014.
[31] J. Kim, A. Ong, and R. E. Overill. Design of an [46] C. Phua, V. Lee, K. Smith, and R. Gayler. A
artificial immune system as a novel anomaly detector comprehensive survey of data mining-based fraud
for combating financial fraud in the retail sector. In detection research. arXiv preprint arXiv:1009.6119,
The 2003 Congress on Evolutionary Computation, 2010.
2003. CEC’03., volume 1, pages 405–412. IEEE, 2003. [47] J. R. Quinlan. Induction of decision trees. Machine
[32] S. Kotsiantis and D. Kanellopoulos. Discretization learning, 1(1):81–106, 1986.
techniques: A recent survey. GESTS International [48] J. R. Quinlan. Learning logical definitions from
Transactions on Computer Science and Engineering, relations. Machine learning, 5(3):239–266, 1990.
32(1):47–58, 2006. [49] J. R. Quinlan. C4. 5: programs for machine learning.
[33] M. Kuhn and K. Johnson. Applied predictive modeling, Elsevier, 2014.
volume 26. Springer, 2013. [50] R. Quinlan. Data mining tools see5 and c5.0.
[34] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. http://www.rulequest.com/see5-info.html.
Andersen, and A. Smola. Parameter server for Accessed February 12, 2019.
distributed machine learning. In Big Learning NIPS [51] M. T. Review. Big data game-changer: Alibaba’s
Workshop, volume 6, page 2, 2013. double 11 event raises the bar for online sales.
[35] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation https://www.technologyreview.com/s/602850/bi
forest. In 2008 Eighth IEEE International Conference g-data-game-changer-alibabas-double-11-event
on Data Mining, pages 413–422. IEEE, 2008. -raises-the-bar-for-online-sales/, 2016.
[36] S. A. Macskassy and F. Provost. A simple relational Accessed May 24, 2018.
classifier. Technical report, NEW YORK UNIV NY [52] S. Rosset, U. Murad, E. Neumann, Y. Idan, and
STERN SCHOOL OF BUSINESS, 2003. G. Pinkas. Discovery of fraud rules for
[37] S. Maes, K. Tuyls, B. Vanschoenwinkel, and telecommunicationschallenges and solutions. In
B. Manderick. Credit card fraud detection using Proceedings of the fifth ACM SIGKDD international
bayesian and neural networks. In Proceedings of the conference on Knowledge discovery and data mining,
1st international naiso congress on neuro fuzzy pages 409–413. ACM, 1999.
technologies, pages 261–270, 2002. [53] B. Sagar, P. Singh, and S. Mallika. Online transaction
[38] J. A. Major and D. R. Riedinger. Efd: A hybrid fraud detection techniques: A review of data mining
knowledge/statistical-based system for the detection approaches. In 2016 3rd International Conference on
of fraud. Journal of Risk and Insurance, Computing for Sustainable Global Development, pages
69(3):309–324, 2002. 3756–3761. IEEE, 2016.
[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and [54] B. Stefano and F. Gisella. Insurance fraud evaluation:
J. Dean. Distributed representations of words and a fuzzy expert system. In 10th IEEE International
phrases and their compositionality. In Advances in Conference on Fuzzy Systems.(Cat. No. 01CH37297),
neural information processing systems, pages volume 3, pages 1491–1494. IEEE, 2001.
3111–3119, 2013. [55] M. Syeda, Y.-Q. Zhang, and Y. Pan. Parallel granular
[40] E. W. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and neural networks for fast credit card fraud detection. In
X. Sun. The application of data mining techniques in 2002 IEEE World Congress on Computational
financial fraud detection: A classification framework Intelligence. 2002 IEEE International Conference on
and an academic review of literature. Decision support Fuzzy Systems. FUZZ-IEEE’02. Proceedings (Cat. No.
systems, 50(3):559–569, 2011. 02CH37291), volume 1, pages 572–577. IEEE, 2002.
[41] P. B. of China. The overall operation of the payment [56] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and
system in 2017. http://www.pcac.org.cn/Upload/ima Q. Mei. Line: Large-scale information network
ge/20180306/20180306144824_91997.pdf/, 2018. embedding. In Proceedings of the 24th international
Accessed Feburay 19, 2019. conference on world wide web, pages 1067–1077.
[42] J. Pathak, N. Vidyarthi, and S. L. Summers. A International World Wide Web Conferences Steering
fuzzy-based algorithm for auditors to detect elements Committee, 2015.
of fraud in settled insurance claims. Managerial [57] L. Tang and H. Liu. Relational learning via latent
Auditing Journal, 20(6):632–644, 2005. social dimensions. In Proceedings of the 15th ACM
[43] R. Patidar, L. Sharma, et al. Credit card fraud SIGKDD international conference on Knowledge
detection using neural network. International Journal discovery and data mining, pages 817–826. ACM,
of Soft Computing and Engineering (IJSCE), 1(32-38), 2009.
2011. [58] L. Tang and H. Liu. Leveraging social media networks
[44] C. Perlich and F. Provost. Aggregation-based feature for classification. Data Mining and Knowledge
invention and relational concept classes. In Discovery, 23(3):447–478, 2011.
Proceedings of the ninth ACM SIGKDD international [59] M. Vadoodparast, A. R. Hamdan, et al. Fraudulent
conference on Knowledge discovery and data mining, electronic transaction detection using dynamic kda
pages 167–176. ACM, 2003. model. International Journal of Computer Science and
Information Security, 13(3):90, 2015. for credit card fraud detection. Data mining and
[60] S. Viaene, R. A. Derrig, and G. Dedene. A case study knowledge discovery, 18(1):30–55, 2009.
of applying boosting naive bayes to claim fraud [66] K. Yamanishi, J.-I. Takeuchi, G. Williams, and
diagnosis. IEEE Transactions on Knowledge and Data P. Milne. On-line unsupervised outlier detection using
Engineering, 16(5):612–620, 2004. finite mixtures with discounting learning algorithms.
[61] C. Von Altrock. Fuzzy logic and neurofuzzy Data Mining and Knowledge Discovery, 8(3):275–300,
applications in business and finance. Prentice-Hall, 2004.
Inc., 1996. [67] D. Zhang, J. Yin, X. Zhu, and C. Zhang. Network
[62] S. H. Walker and D. B. Duncan. Estimation of the representation learning: A survey. IEEE transactions
probability of an event as a function of several on Big Data, 2018.
independent variables. Biometrika, 54(1-2):167–179, [68] Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu.
1967. Fuxi: a fault-tolerant resource management and job
[63] G. Wang and J. Ma. A hybrid ensemble approach for scheduling system at internet scale. PVLDB,
enterprise credit risk assessment based on support 7(13):1393–1404, 2014.
vector machine. Expert Systems with Applications, [69] J. Zhou, X. Li, P. Zhao, C. Chen, L. Li, X. Yang,
39(5):5325–5331, 2012. Q. Cui, J. Yu, X. Chen, Y. Ding, et al. Kunpeng:
[64] R. Wheeler and S. Aitken. Multiple algorithms for Parameter server based distributed learning systems
fraud detection. In Applications and Innovations in and its applications in alibaba and ant financial. In
Intelligent Systems VII, pages 219–231. Springer, 2000. Proceedings of the 23rd ACM SIGKDD International
[65] C. Whitrow, D. J. Hand, P. Juszczak, D. Weston, and Conference on Knowledge Discovery and Data Mining,
N. M. Adams. Transaction aggregation as a strategy pages 1693–1702. ACM, 2017.

You might also like