Analysis and Detection of Fraud in International Calls Using Decision Tree
Analysis and Detection of Fraud in International Calls Using Decision Tree
Abstract. fraud is one of the most severe threats to revenue and quality
of service in telecommunication networks. The advent of new technologies
has provided fraudsters with new techniques to commit fraud. Subscriber
identity module box (SIMbox) fraud is one of such fraud that is used in
international calls and it has emerged with the use of VOIP technologies.
In this paper, we propose a novel technique for detecting SIMbox fraud
in international calls. The proposed technique is based in using decision
tree algorithm to build a model based on six features extracted from
call data record (CDR). The proposed algorithm is tested using dataset
obtained from a real mobile operator (Almadar Ajadid Co.,) and it has
shown 97.95% detection accuracy.
1 Introduction
Cellular network operators lose about 3% of the their annual revenue due to
fraudulent and illegal services [1]. Juniper Research estimated the total losses
from the underground mobile network industry to be 58 billion in 2011 [1, 2].
The impact of voice traffic termination fraud, commonly known as Subscriber
Identity Module (SIMbox) fraud or bypass fraud, on mobile networks is partic-
ularly severe in some parts of the globe [2]. Recent highly publicized raids on
fraudsters include those in Mauritius, Haiti, and El Salvador [3].
Fraudulent SIMboxes hijack international voice calls and transfer them over the
Internet to a cellular device, which injects them back into the cellular network.
As a result, the calls become local at the destination network [4]. When interna-
tional call is received with the emergence of a local number on the phone screen
that call should be noted as a type of fraud which causes considerable losses for
the telecommunications companies. Cellular operators of the intermediate and
destination networks do not receive payments for call routing and termination.
Fraudulent SIMboxes also hijack domestic traffic in certain areas, e.g. in Alaska
within the United States, where call termination costs are high. In some cases,
the traffic is injected into a cellular network and is forwarded to the terminating
country [5]. This increases the call routing cost for the operator of the injected
traffic. Besides causing the economic loss, SIMboxes degrade the quality of local
service where they operate. Often, cells are overloaded, and voice calls routed
over a SIMbox have poor quality, which results in customer dissatisfaction. Al-
though some vendors provide cellular anti-fraud services, the large amount of
Fig. 1: Example of one-hop SIM-box bypass fraud hijacking of an international
call [7]
daily cellular traffic and the number of connected mobile devices make detecting
call bypassing fraud extremely challenging. Moreover, traffic patterns and char-
acteristics of fraudulent SIMboxes are very similar to those of certain legitimate
devices, such as cellular network probes. So, detecting fraudulent SIMboxes re-
sembles searching for a few needles in a huge haystack full of small objects that
look like needles. While operators of the intermediate and destination networks
have high financial incentives to understand the problem, they do not have the
data to analyse the international calls that are gone. Also, the absence of publicly
available SIMbox related data is a major obstacle for emerging of comprehensive
studies on voice bypassing fraud analysis and detection [6]. By contrast, most
of the SIMbox traffic analysed in this paper is on the originating end of the
communication, giving us insight on SIMbox fraud from a different perspective
than most networks with a bypass problem. This work analyses fraudulent SIM-
box traffic based on communication data from Almadar Aljadid company, one of
the major mobile operators in Libya. It neither collects nor uses any personally
identifiable information. Based on these observations, we propose using decision
tree for detecting fraudulent SIMboxes. The proposed technique shows high de-
tection rate and correctly filters out mobile network probes with traffic patterns
similar to those of SIMboxes.
The rest of this paper is organized into six sections. Section II overviews fraud in
international call and illustrates it with basic example. Section III presents deci-
sion tree algorithm of fraud detection in international calls. Section IV analyses
SIMbox related traffic, compares it to the legitimate traffic, based on the ex-
tracted features. In Section V we describe some experiments we have performed,
and Section VI concludes the paper.
Machine learning is a technique which computer learns from a set of data given
to it, and then it becomes able to predict the result of new data similar to
the training data. The machine learning algorithm is meant to identify patterns
based on different characteristics or features and then make predictions on new,
unclassified data based on the patterns learned earlier. The input data is usu-
ally numerous instances of relations between the different variables or features
relevant to the data.
There are various different approaches to machine learning namely decision trees,
random forests, neural networks, clustering, bayesian networks, reinforcement
learning, support vector machines, genetic algorithms, and many more. Decision
tree learning is a method commonly used in data mining. Decision trees are pow-
erful and popular tools for classification and prediction. Decision trees represent
rules, which can be understood by humans and used in knowledge system such
as database. The goal is to create a model that predicts the value of a target fea-
ture based on several input features. Figure 2 shows general criteria in decision
tree. A decision tree represents a multi-stage decision process, where a binary
decision is made at each stage. The tree is made up of nodes and branches, with
Fig. 2: General criteria in decision tree
2- The data set is split into two branches by different feature, the entropy for
each branch is calculated:
Ha = H(m, n)
Fig. 3: Model example
m m n n
Ha = log2 log2 (2)
m+n m+n n+m n+m
Hb = H(M m, N n)
M m M m
Hb = log2
(M m) + (N n) (M m) + (N n)
N n N n
log2 (3)
(N n) + (M m) ((N n) + (M m)
3- The entropy for each branch is added proportionally to get total entropy for
the split:
H(S|A) = Pa Ha + Pb Hb
m+n (M m) + (N n)
H(S|A) = Ha + Hb (4)
M +N (M + N )
where Pa is the number of samples at node (a) per the number of samples at
node (A), Pb is the number of samples at node (b) per the number of samples
at node (A).
4- The resulting entropy is subtracted from the entropy before the split and the
result is the information gain or decrease in entropy:
Data samples are the training samples (building samples). Target-feature is the
feature whose value is to be predicted by the tree. Feature-list is a list of other
features that may be tested by the Decision tree.
If all samples are SIMbox, Return the single-node tree Root, with label =SIMbox
If all samples are Legitimate, Return the single-node tree Root, with label = Legitimate
If features-list are empty, Return the single-node tree Root, with label = most common
value of Target-feature in samples
Otherwise Begin
A is the feature from features-list with condition that gives best classifies samples
with best(the feature that gives the biggest I.G)
The decision feature for Root is A
For each possible value, vi , of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let samplesvi be the subset of samples that have value vi for A
If samples of vi is empty
Then below this new branch add a leaf node with label = most common
value of Target-feature in samples.
Else below this new branch add the subtree Decision tree(samplesvi , Target-
feature, features-list without A).
End.
Return Root
We analyse samples of fully anonymous call data records (CDRs) from a tier-1
cellular operator in Libya (Almadar Aljadid Co.,). Data collected between Oc-
tober 2014 and November 2014. CDRs are logs of all phone calls, text messages,
and data exchanges in the network. If there are two communicating parties (caller
and receiver) belong to the same cellular provider, two records are stored.
The data set contains CDRs of 34 known fraudulent SIMboxes account and of
about 273 legitimate accounts. The legitimate accounts consist of fully anonymized
post-paid family plans, unlikely to be involved in fraudulent activities, corporate
accounts, and mobile network probing devices. It is a common practice that lo-
cal and foreign cellular operators and device manufacturers probe the mobility
network to measure the quality of service in terms of latency, to test upcoming
new cellular devices, etc. [12,13]. Probing devices generate a rather large number
of voice calls, most of which are addressed to different recipients. This contrasts
with the communication pattern of regular users, who make less phone calls to
fewer contacts [14]. The data set split into two parts the first one are used for
building (training) and the second one are used for testing.
25
150
NoDL
20
100
15
10
50
5
0 0
0 20 40 60 80 100 120 140 0 10 20 30 40 50 60
Number of MO Calls Number of SMSO
((a)) ((c))
40 60
SIMbox SIMbox
Legitimate Legitimate
35
50
30
40
25
Number of SMSO
NoDL
20 30
15
20
10
10
5
0 0
0 20 40 60 80 100 120 140 0 50 100 150 200 250
The Number of MT Calls Number of MT Calls
((b)) ((d))
5 Experimental results
The practical performance of the decision tree algorithm described in the pre-
vious section was tested using another data sample (that used for testing) that
consist of 12 samples of SIMboxes and 251 samples of legitimate accounts. Ac-
cording to the information gain measure, the Number of different locations pro-
vides the best prediction of the target feature (kind of account) over the training
samples. Therefore, the number of different locations is selected as the decision
feature for the root node, and branches are created below the root for each of its
Table 2: Information Gain for the features at each node
MO MT SMSO SMST NoDL node
0.052596 0.207557 0.097026 0.088234 0.276763 Root(R)
0.514704 0.245623 0.118183 0.133216 0.181276 R-Left(L)
0.066197 0.136376 0.040580 0.174136 0.072861 R-L-L
0.311689 0.311689 0 0 0.141619 R-L-L-L
0.027740 0.257678 0 0 0.242697 R-L-L-L-L
0.144484 0.078982 0 0 0.144484 R-L-L-L-L-L
0.122556 0.122556 0 0 0.811278 R-L-L-L-L-L-r
0.093531 0.111687 0.138122 0.185579 0.012461 R-L-r
0.970950 0.970950 0.321928 0.170950 0 R-L-r-r
0.013723 0.053982 0.002601 0.006265 0.008751 R-r
0.918295 0.251629 0.251629 0.918295 0.918295 R-r-L
possible values. Table 2 summarises the information gain for the six features at
each node. where R is root node, L is a node on the left and r is a node on the
right. There are two types of testing to determine the accuracy of the algorithm,
true negative rate test and true positive rate test. The true negative rate test
is the proportion of legitimate accounts classified as legitimate (Its inverse of
The false positive rate), whereas true positive rate is the proportion of SIMboxes
classified as SIMbox accounts (Its inverse of the false negative rate).
Figure 5 a shows the prediction accuracy of the proposed algorithm as a func-
tion of number of building samples. It can be clearly seen that as the number of
samples increases the accuracy of the algorithm improves. When the full number
of samples were used the classification accuracy has reached 97.95%. Figure 5 b
shows true negative rate versus the number of legitimate building samples when
using decision tree algorithm to predict status of the SIM-Card. It can be seen
that the prediction accuracy improves with changing the number of samples for
legitimate users. The improvement is due to the fact increasing the number of
legitimate building samples improves the understanding of the of behaviour of
legitimate users.
6 Conclusions
In this paper six features extracted from CDR data are utilized to build decision
tree that can be used to distinguish between legitimate and SIMbox accounts.
The features include the total number of outgoing and incoming calls, the total
number of SMS originating and SMS terminating, the total number of hand over
and the total number of different locations. The proposed decision tree algorithm
has shown accuracy up to 97.95% when it was tested using testing samples data
from Almadar Aljadid company.
98 98
96 96
94
94
92
90
88
88
86
86
84
84
82
80 82
78 80
0 50 100 150 200 250 300 0 50 100 150 200 250
Number of building samples Number of building Legitimate samples
((a)) ((b))
References
1. H. Windsor, Mobile Revenue Assurance Fraud Management, Juniper Research,
http://goo.gl/GX7G4.
2. M. Yelland, Fraud in mobile networks, Computer Fraud & Security, vol. 2013, no.
3, pp. 5-9, 2013.
3. Raids on SIM Box/GSM Gateway Fraudsters Save Mobile Operators Millions,
Reuters, http://goo.gl/pHCpK.
4. Fraud in the Mobile World, Revector, http://goo.gl/Uobx6.
5. I. Murynets, M. Zabarankin, R.P. Jover and A. Panagia, Analysis and detection
of SIMbox fraud in mobility networks, INFOCOM, 2014 Proceedings IEEE, pp.
1519-1526, May 2014.
6. A. H. Elmi, S. Ibrahim, and R. Sallehuddin, Detecting sim box fraud using neural
network, in IT Convergence and Security 2012. Springer, 2013, pp. 575-582.
7. N2B Risk Management, http://www.zira.com.ba/products/risk-managemet/n2b-
fraud-management-system/sim-box.
8. G. Kesavaraj, S. Sukumaran, A study on classification techniques in data mining,
International Conference on Computing, Communications and Networking Tech-
nologies (ICCCNT), pp. 1-7, July 2013.
9. Wendy L. Martinez , Angel R. Martinez, Computational Statistics Handbook with
MATLAB,, 2002.
10. T. M. Mitchellz, Machine Learning,,Published by McGraw-Hill, March 1997.
11. Y. Freund, The alternating decision tree learning algorithm, in Machine Learn-
ing: Proceedings of the Sixteenth International Conference, March 1999.
12. I. Murynets and R. Piqueras Jover, Crime scene investigation: SMS spam data
analysis, in Proceedings of the 2012 ACM conference on Internet measurement.
ACM, pp. 441-452, 2012.
13. RCATS - Remote Cellular Active Test System, JDSU, http://goo.gl/VEbMA.
14. A.-L. Barabasi and R. Albert, Emergence of scaling in random networks, science,
vol. 286, no. 5439, pp. 509-512, 1999.