Fitsum Tesfaye
Fitsum Tesfaye
A Thesis submitted to
School of Electrical and Computer Engineering
Addis Ababa Institute of Technology
I, the undersigned, declare that the thesis comprises my own work in compliance
with internationally accepted practices; I have fully acknowledged and referred all
materials used in this thesis work.
Fitsum Tesfaye
Signature
Name
Addis Ababa University
Addis Ababa Institute of Technology
School of Electrical and Computer Engineering
This is to certify that the thesis prepared by Fitsum Tesfaye, entitled Near-Real Time
SIM-box Fraud Detection Using Machine Learning in the case of ethio telecom and sub-
mitted in partial fulfillment of the requirements for the degree of Master of Science
Telecommunication Engineering complies with the regulations of the University and
meets the accepted standards with respect to originality and quality.
In this study, Sliding Window (SW) aggregation mode is applied to provide a rele-
vant dataset instance and reduce detection delay to one hour by using supervised
Machine Learning (ML) algorithm. Three supervised ML classifier algorithms were
used, namely Random Forest (RF), Artificial Neural Network (ANN), and Support
Vector Machine (SVM) with the two validation techniques 10-fold cross-validation
and supplied test. Call Detail Record (CDR) data were collected, relevant attributes
were selected and preprocessing such as data cleaning, integrating and aggregat-
ing tasks were performed.
KEYWORDS
i
ACKNOWLEDGMENTS
First I would like to thank God for giving me the strength to pass all the steps.
Next, I would like to give special gratitude to my advisor Ephrem Teshale (PhD)
for his constructive, and valuable comments and support. I would also like to
thank my evaluators Yalemzewd Negash (PhD) and Murad Ridwan (PhD) for
their feedbacks during the thesis progress presentations. I also want to thank my
company ethio telecom for giving me this opportunity.
I would also like to give my special thanks to ethio telecom staffs’ for their support
on giving Data and resource, and also special thanks to my firends Gebremeskel
G/medhin, Surafel G/Mariam and Tamirat Teshome for being supportive of this
research work.
Lastly, I would like to give my special thanks to my beloved wife Liya Abiyu for
her unforgettable support and patience, to my sweet kids too.
ii
CONTENTS
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 SIM-box Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Understanding The Data . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Data Preprocessing and feature selection . . . . . . . . . . . . . . . . . 21
4.3.1 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 Data Aggregation Mode . . . . . . . . . . . . . . . . . . . . . . . 26
iii
contents
iv
LIST OF FIGURES
v
list of tables
L I S T O F TA B L E S
vi
ACRONYMS
AI Artificial Intelligence
DT Decision Tree
FN False Negative
FP False Positive
ML Machine Learning
RF Random Forest
vii
acronyms
SW Sliding Window
TN True Negative
TP True Positive
viii
1
INTRODUCTION
1
1.1 statement of the problem
In Ethiopia, the sole telecom company, ethio telecom, is not only losing their rev-
enue, the country as well losses a lot in foreign currency. Telecom companies build
their own FMS to protect themselves from fraudulent activities. SIM-box fraud is
one of the major interconnect bypass fraud type which affects telecom operators.
Telecom companies mainly use three approaches to overcome SIM-box fraud. TCG
telecom operators make an international call to their own network through inter-
national get way, FMS a rule-based fraud detection technique and Controlling SIM
card distribution is limiting the number of SIM card to be given for the customers.
A report made by ethio telecom’s CEO on October 15, 2018, indicates that ethio
telecom has lost a huge amount of foreign currency which is about 2.5 Billion Birr
on 2017/18 [Fanabc2018][8]. Ethio telecom has given high attention and working
hard to overcome these fraudsters using different ways of controlling techniques.
2
1.2 objective
ML was found to be effective in the detection of telecom frauds, but there is a trade
of between detection accuracy and delay in detection. Since SIM-box fraudsters are
affecting telecom operators they should be detected before making huge damage
on telecom operators. Overall, exploring SIM-box fraud detection technique is
used easily to identify the fraudulent patterns and reduce detection delay. There
1.2 objective
The main objective of the research is to detect SIM-box fraud near-real time using
machine learning algorithms by analyzing users CDR data.
• Explore the best machine learning algorithm in the process of SIM-box fraud
detection.
• Select the relevant attribute to build the SIM-box fraud detection model.
• Building models with the selected relevant algorithms for the detection of
SIM-box fraud.
To the best of the author’s knowledge, there is no specific work done on SIM-box
fraud detection using SW data aggregation mode by applying machine learning
algorithms. The output of this research will give a better understanding of SIM-box
3
1.4 literature review
fraud detection near-real time and being the initial idea for further research work
related to near-real time or real-time SIM-box fraud detection.
Different researches have made their studies on telecom fraud on the detection
and prevention of telecom fraudsters’ impact on telecom companies. SIM-box fraud
is one of the top interconnect bypass telecom fraud brought a huge impact on
telecom companies, a number of researches have been conducted in implementing
machine learning algorithms for the detection of SIM-box fraud.
Ilona Murynets et al. [12] made the analysis and detection of SIM-box fraud us-
ing high volume traffic CDR with the consideration of users’ mobility. Fraudulent
users have made a huge rate of call traffic than legitimate users as well as having a
static location. Which means they have none or very low mobility as compare with
the legitimate users. The authors set a classifier rule as a linear combination of the
three classifiers algorithm by obtaining weight coefficients from the three classi-
fiers by minimizing the model’s error on the training dataset. The final result as
shown in the new classifier rule obtains a better performance accuracy compared
with the other three original decision tree algorithms.
4
1.4 literature review
Using data mining for the detection of SIM-box fraud is being common since
the big-data analysis concept emerged, [13] is a research that is conducted to de-
tect SIM-box frauds using data mining and compare data mining techniques and
classification results of the algorithm. The author selects four supervised machine
learning classifiers; Logistic Classifier, Boosted Trees Classifier, SVM and ANN algo-
rithm. Five common machine learning model evaluation metrics have been used
to analyze the results, Accuracy, Confusion matrix, Area Under Curve (AUC),
Precision and Recall. Both Boosted Trees classifier and Logistic Classifier models
performed with better results than the other two SVM and ANN algorithms.
Due to the impact of fraudsters telecom companies should try to detect fraudu-
lent activities in real-time. So, [14] focused on fraud detection which is algorithm
based namely United Intelligent Scoring (UIS) algorithm. Kun Niu et al. believes
commonly used fraud detection approaches such as a rule-based, outlier detec-
tor and classifiers have a problem with high computational cost while processing
mass data in terms of accuracy. So, telecom companies need to have a real-time so-
lution to reduce fraudulent impacts. In order to achieve that , the authors propose
a new algorithm which is called United Intelligent Scoring (UIS). UIS algorithm
has less computational complexity in classification time and updates areal-time
scores in addition to that UIS could have the chance to detect new fraud patterns
effectively.
A recent research [10] studies SIM-box fraud detection using data mining tech-
niques in ethio telecom’s cases. The author collects a one-month CDR data for
20,000 customers from ethio telecom, 5,000 of them are fraudulent customers that
are detected and blocked by ethio telecom security department. The research basi-
cally focused on data mining technique for SIM-box fraud detection. RF, SVM and
ANN are the selected algorithm that is applied in the research. Each algorithm’s
model trained and tested with different granularity level as 4-hour, one day and
one month. Each algorithm achieved different classification performances. Finally,
RF algorithm model with a 4-hour granularity level achieved better accuracy than
the other two algorithms SVM and ANN models on the consideration of daily and
monthly granularity levels. As the granularity level becomes less classifier algo-
rithm obtain better performance or classification accuracy. Another research [15]
5
1.5 methodology
conducted on SIM-box fraud detection a year before [10] research is done. [15] is
also used ethio telecom’s customers CDR data for its experimental processes using
data mining technique. Those selected algorithms have the capability to depict
user’s patterns form their voice, Short Message Service (SMS) and DATA usage.
12,686 CDR record used for the experiment which is very fewer datasets as com-
pare to the CDRs generated in the company. Decision tree (J48), Rule-based (PART)
and neural network (Multilayer Perceptron) algorithms implemented for the train-
ing and testing model, the model evaluated using confusion matrices, Precision,
Recall, F-measure and Accuracy. The decision tree algorithm performs better than
other algorithms. Both researchers [10, 15] have used Waikato Environment for
Knowledge Analysis (WEKA) data mining tool for the experiment.
1.5 methodology
The main purpose of this research is building a near-real time SIM-box fraud detec-
tion model with SW using machine learning. In order to achieve the objectives the
following steps are carried out.
• CDR and customer profile data collected, then the required preprocessing
performed and apply SW aggregation mode to make ready final instance
datasets.
• WEKA workbench tool is used to train the selected machine learning algo-
rithms and analyze the classification algorithm performance using overall
accuracy, Confusion matrix, F-measure and Receiver Operating Characteris-
tic (ROC) curve as evaluation metrics.
6
1.6 thesis organization
7
2
SIM-BOX FRAUD
SIM-box fraudulent hijack an international voice calls and transfer the call through
the internet VoIP; in some countries like USA SIM-boxers even hijack the local
8
sim-box fraud
call, terminate it as local call using the local SIM cards inserted in a SIM-box
device [4, 7]. SIM-boxers are very interested on countries with high international
call termination costs and less local calls cost [12]. SIM-boxers are working with
fraudulent transient operators, these fraudulent transient operators offer a list
call routing cost and hijack the calls [10]. SIM-box fraud scenario is depicted on
Figure 2.2. The green line indicates a normal or legitimate way of international
call route and the broken red line indicates a hijacked route. Ethio telecom is the
sole telecom service provider in Ethiopia, one of the telecom service providers
impacted by interconnect bypass fraud mainly SIM-box fraud.
Telecom operators are still battling with telecom fraudsters since the presence
of telecommunication frauds. Telecom operators use different SIM-box fraud de-
tection approaches, Test Call Generation, Fraud Management System, SIM card
Distribution Control (SDC) is some of the detection techniques [4]. When telecom
operators apply TCG, they make a huge number of international calls to their own
network and check the calls terminate through legitimate route or SIM-box routes.
In TCG there is no false positive on the test result and depends on the probabilistic
nature, it is also costly making several international calls. FMS is a user profil-
ing mated using CDR, it tries to detect the behavior of the fraudulent activity by
providing a long list of rules to identify between the legitimate and fraudulent
users. SDC is one way of controlling mechanism fraudsters not to get an exces-
9
sim-box fraud
sive number of SIM cards, like limiting the number of SIM-card per customer and
demanding different customer identification information for SIM provisioning.
SIM-boxers are fighting back telecom operator’s detection techniques using im-
proved technologies not to get blocked. They analyze the incoming voice call pat-
tern to determine whether the calls are from the real subscribers or from TCG,
once they identify the call is coming from TCG either they block the test call or
reroute to the legitimate route [4, 10]. SIM-box fraudsters act like legitimate users’
behaviors, they imitate legitimate behaviors using special software Human Be-
havior Simulation (HBS) installed in the SIM-box device. Using the HBS software
they do SIM migration and rotation acting like they are in mobility, they try to use
other network services like SMS and GPRS to be more like a legitimate customer,
and they prepare their own family lists and make a call one another not to look
suspicious.
10
3
MACHINE LEARNING ALGORITHMS
3.1 introduction
Machine learning is a computer algorithm that uses certain instructions and rules
to understand important concepts of information and services from enormous
amounts of input data, those rules are not created by computer programmers [18].
Machines initially intended to do their tasks much faster with a higher level of pre-
cision as compared to humans; and made human life easy and smooth. Machine
learning algorithms learn from experience sample data to extract knowledge or in-
formation without step-by-step instruction. Machine learning is a part of Artificial
Intelligence (AI) which helps to extract knowledge patterns from the input huge
data. Machine learning learns and improves a given problem based on their ex-
periences [19, 20]. The basic process of ML is train and test the model to generate
a new set of rules based on inference from source data [18]. ML is more related
to Knowledge Discovery from Data (KDD), data mining and pattern recognition.
ML uses different mathematical formulation to extract information from the prior
or history data which are called machine learning algorithms. Machine learning
algorithms are organized based on the desired output of the algorithm. There are
four common machine learning algorithms types [20, 21]; Supervised Learning,
Unsupervised Learning, Semi-Supervised Learning, and Reinforcement Learning.
11
3.2 supervised learning
vised machine learning algorithms, these algorithms use labeled dataset at the
time of training then build a classification model.
X
H (s) = −P (c) log2 P(c) (3.1)
c∈C
12
3.2 supervised learning
The other basic concept to select a root node and decision node is an information
gain. Attributes with the highest information gain value is set as a root node and
the process continuously computes for the remaining attributes to develop the
hierarchy of the tree. Information gain is computed taking entropy results as one
component using the next Equation (3.2).
X
IG (A, S) = H (S) − −P(t) • H(t) (3.2)
t∈T
13
3.2 supervised learning
SVM is a supervised machine learning algorithm and working for classification and
S(x) = wT x + b (3.3)
SVM varies input’s weight and combine the bias values on the training stage to sep-
arate each class, putting instances of class one (C1 ) to one-side of the hyperplane
and instance of class 2 (C2 ) to the other side of the hyperplane. using Equation
(3.4) and Equation (3.5) set the class of the new dataset instances, according to the
result of Y(x) the class is identified as C1 if Y(x) > 0 and C2 if Y(x) < 0 [27].
14
3.2 supervised learning
The above scenario is not working for nonlinear separable datasets, in order to
overcome the case SVM apply a technique called kernel trick as shown in Equation
(3.6) generating a smooth separating nonlinear decision boundary. Using training
datasets SVM built a model that assigns new example into one category or the
other [20].
2 2
K xi , xj = e−(||xi −xj || /σ )
(3.6)
The simplest kind of ANNs is a Perceptron that is used to classify linearly separable
classes of any m-dimensional data, weighted sum is computed using Equation
(3.7) and provided to an acctivation function like sigmoid function callculated in
Equation (3.8).
X
d
y= wj xj + w0 (3.7)
j=1
15
3.3 unsupervised learning
wj Inputs’ weight
1
sigmoid (a) = (3.8)
1 + exp [−wT x]
The collection of data that combines both labeled data and unlabeled data. Most of
the cases in semi-supervised machine learning labeled data are scarce, the target is
16
3.5 reinforcement learning
to train the model to predict classes of test data better than that of model generate
using labeled data [20].
17
4
E X P E R I M E N TA L A N A LY S I S
This chapter discusses the overall experimental process conducted through this
research. Figure 4.1 shows the expermental process of the model; Data collection,
Data Pre-processing, and Classification are the main tasks that have been done in
order to detect SIM-box fraud. Details of the tasks done under these modules are
described on the comming sections.
The current FMS at ethio telecom uses customers’ CDR to analyse and detect tele-
com frauds, this research as well uses the same CDR source for the experiment.
Raw CDR data is stored to our database server every five minutes, on average
about 26 million CDR records are dump to the database server every day. Since the
18
4.2 understanding the data
CDR data size is huge, a separate storage place is required. Information System
Division (ISD) prepares windows server 2012R2 with 8GB ram and 4TB storage
capacity.
The raw CDR is stored on the storage server as a flat-file. So, it needs to be imported
to the database which is installed on the same server. To do so, an automatic data
loader script is used to fully import it to the database. Importing one day’s CDR
data flat-file takes more than 20 hours on average, so it takes much time. While
collecting the CDR data, previously collected CDR data is imported to the database
server. To speed up the data importing process automatic data loader scripts were
running parallelly to import the CDR data to four different tables in parallel.
The imported CDR data has 33 columns or attributes listed in Table 4.1 with their
description. Some fields like CALLING_IMEI, CALLING_CARRIER, CALLED_C-
ARRIER, CALLED_DISTRICT, HOTLINEINDICATOR, CALLING_TRUNK_ID, a-
nd CALLED_TRUNK_ID have no values. There are some fields that are generated
for billing purpose, like CDR_ID, RE_ID, CDR_TYPE, CALL_FEE, STATUS_DATE,
CHARGE_1, CHARGE_2, RATE_ID, and ACCOUNT_ITEM_ID. The remaining
fields also used for biling purpose but they got their own values. CDR_ID uniquely
identifies each CDR and RE_ID used to differentiate the service types Voice, SMS
and Internet Data usage records. CDR_TYPE included for distinguishing Mobile
Originating, Terminating or Forwarding call types. CELL_A and CELL_B as well
include the ID of Calling and Called district or Cell. Some sensitive fields such as
called_number, calling_number or Billing numbers have been hashed for privacy
reasons [10].
The RE_ID has a value between 1 and 6, each value represents deferent types
of services’ CDR record. Like, 1 represents the voice service, 2 represents SMS
service and 5 also represents the DATA service records. To simplify the exper-
iment, the three services which are Voice, SMS and DATA CDR data segregated
into different tables named VOICE_SOURCE_TABLE, SMS_SOURCE_TABLE and
DATA_SOURCE_TABLE. In addition, customer profile data (activation date) also
19
4.2 understanding the data
collected only for currently active customers. While doing the segregation, fields
with no value are removed.
No Attributes Description
20
4.3 data preprocessing and feature selection
No Attributes Description
24 RATE_ID1 Rate ID
21
4.3 data preprocessing and feature selection
Before proceeding to the data preprocessing stage, sampling is part of the prior
mandatory task to be done. ML is applied/used to detect SIM-box fraud, ML uses
a supervised data type which is a labeled data with two class. The data type is
either nominal or numeric.
Fraudulent customer numbers are provided from ethio telecom security depart-
ment, there are a lot of SIM-box fraudulent numbers detected every day. The
company provides 5,000 SIM-box fraudulent numbers that are detected and sus-
pended by FMS within the CDR data collection period. The ratio of fraudulent
and legitimate numbers must be proportional. Most researches proposed the ra-
tio to be 25% fraudulent numbers and 75% legitimate numbers. Currently, ethio
telecom’s active customers are 34 million, which is a huge number when it is
compared with the sample fraudulent numbers. So, 15,000 legitimate numbers
are chosen randomly from an active customer database. Each active customer has
equal probabilities to be selected, then Simple Random Sampling (SRS) is applied
to get the 75% legitimate sample numbers. The research is going to use a total
22
4.3 data preprocessing and feature selection
of 20,000 sample subscribers number. Table 4.4 shows the number of subscriber
sample size of legitimate and fraudulent service numbers.
SIM-box fraud detection using machine learning or states of the art broadly uses
customer usage data or CDR , which is highly helpful to extract knowledge about
customer behavior.
Real-world databases are susceptible to noise, missing values, and data inconsis-
tency due to huge data size. These data sources are multiple and heterogeneous
which needs to be properly processed to improve the data quality [28]. There are
a few data processing techniques that help to improve data quality.
• Data integration as well, useful to merge data from multiple sources into
coherent data.
• Data reduction is used to reduce the instance data size by aggregating, elim-
inating redundant features or clustering.
Data cleaning is a step of preparing a relevant data source for ML. Remove or
fill the vacant values of the data, maintain the consistency of the data, remove
noisy data values and remove redundant values by keeping a single record, not to
23
4.3 data preprocessing and feature selection
bias the ML. The data cleaning process is time taking and requires high attention
not to to avoid creation of irrelevant data at the end. Since the collected data are
stored in different tables or places, all the data cleaning activities applied to all
data sources. Making sure the same columns found in each table must have equal
size, data taype, and the same format. Like Calling numbers, Called Number, Call
Start Time and Call End Time found in more than two places or tables.
Data aggregation is one type of data preprocessing task performed on the col-
lected CDR data, which helps to give the full information of the specific users. A
single record found under the raw CDR shows activities of a specific user; but,
it is difficult to understand users’ behavior with a single CDR record. A collective
CDR record needs to be aggregate together in order to give a full picture of users’
behaviour. An aggregation is the cumulative result of each individual user within
a given time span. The time span of the aggregation is depending on the behav-
ior of the research.This research is all about SIM-box fraud detection near-real
time using usage data, and understand users’ usage behavior patterns within the
given time span. This research uses the SW aggregation technique to merge out the
instances for the experiment, detail description about SW stated in Section 4.3.3.
Finding the minimum granularity level of instance needs to consider some points
fraudulent activities. Points that are listed bellow tries to depict customer behav-
ior.
• The number of VOICE call made by the user within the time span
When aggregation time span is less, an unsuitable pattern may not be able to
detect the SIM-box fraud. On the other hand, taking a high aggregation time span
24
4.3 data preprocessing and feature selection
could not be convenient to detect SIM-box fraud with near-real time. A research
[10] is conducted using ethio telecom’s CDR data to detect SIM-box fraud. It uses
three different granularity levels which are 4-hour, 1-day and 1-month. As the
cumulative results of the research shows that, the minimum granularity level (4-
hour) achieves better performance compared with the other granularity levels.
In addition to that the current ethio telecom’s FMS also uses 4-hour granularity
level as a minimum aggregation time span. So, since this research is near-real
time, it would be more persuasive setting the minimum aggregation time span or
granularity level to 4-hour.
• Aggregated_VOICE table
• Aggregated_SMS table
• Aggregated_DATA table
• Aggregated_IN_VOICE and
• Service_AGE
In order to identify and collect the aggregated values from each table, a unique
identifier is required for each aggregation time range and FLAGE label is used to
do so. If we take one service number’s instance record, it is the integration of all
the above-listed tables’ records with the same FLAGE value or aggregation hour.
While collecting the values of each service number’s record from those tables each
service number’s record must much the calling time, this means the FLAGE and
date of calling time must be the same. For those records which has no value at
that given time, zero value will be assigned to indicate that specific service not
used by the user on that specific time range. All instances are prepared using the
same fashion.
Service_AGE table that has the information about service activation date of all ser-
vice numbers taking as a sample for this research, which helps to get to know how
25
4.3 data preprocessing and feature selection
long the customer uses the service number. Keep in mind most of the time fraud-
ulent numbers service life is too short compared with the legitimate customers.
The other four tables provide the total number of a given service type used by
customers, which provides lots of information about customers’ behavior. Fig-
ure 4.2 shows the sample screenshots of final aggregated instances implemented
for both SW and F4H.
As Section 4.3.2.2 describes the minimum time span or granularity level for ag-
gregation is four hours. The window size equals to the minimum time span, then
slides one hour to the next timeline. It will continue until the end of the collected
CDR data.
26
4.3 data preprocessing and feature selection
system granularity level. This research mainly focused to minimize the detection
delay to 1-hour from 4-hour with a better detection accuracy. SW is the basic idea
to achieve this research’s main target. The scenario for the SW will be explained
next. As shown in the Figure 4.3; let us say, the current day is the initial date for
the detection process is started and goes to the next window.
The collected CDR data within the initial hour aggregated with the previous three
hours of collected CDR to get the cumulative instance result of windows time
span. On the next slide, the same scenario is processed and get the instance. Keep
doing the process till the collected CDR data is fully covered with the SW. Every
27
4.3 data preprocessing and feature selection
time window slides aggregated result will be stored into the database and the
final aggregated result collected. It would be possible to get customers’ behavior
within one hour by considering the previous three hours using the SW technique
without waiting for an additional four hours.
Other than SW, F4H is applied in this research as a comparison of perivious re-
search made by Kahsu Hagos in [10] and the current FMS as well taking a CDR of
every four hour. The aggregation time range is minimum of 4-hours as the name
indicates. Every four hour’s collected CDR is aggregated for those service numbers
who uses the telecom services within that 4 hours. The daily 24 hours chunked
into six parts with a time frame of 4-hours. These way of CDR data aggregation as
well is applied in [10].
An outlier is a value that is not consistent with the remaining dataset and, also
considered as noisy data. These values in a dataset need to be detected and re-
moved to enhance the classification performance of algorithms [3, 29]. Inter Quar-
tile Range (IQR) method is applied for the detection of outliers in this research.
Outliers are individual values that fall outside of the overall pattern of the rest of
the datasets. The first thing that the IQR does is sorting all the dataset and divided
into four equal parts. Then, find out the three quartile values which are Q1, Q2,
and Q3. The Q2 value is almost the same as the median. Since the outliers that
are found somewhere outside on a specific boundary, IQR tries to find the upper
and lower boundary which is basically called the fence. IQR value is calculated as
shown on Equation 4.1.
IQR = Q3 − Q1 (4.1)
28
4.3 data preprocessing and feature selection
The boundaries are calculated using an outlier’s factors which is basically set to
‘1.5’, Equation 4.2 and Equation 4.3 shows the upper and lower boundary respec-
tively.
Once the process is finalized, the outliers are detected from both SW and Fixed
Four-hours then removed. Table 4.5 shows the number of outliers that are detected
and removed.
29
4.4 algorithm training
Once all the data preprocessing, feature selection, aggregation, and integration is
completed, the next step is performing training and building classification model
using the selected algorithm. Machine learning classification technique is dis-
cussed in detail on Chapter 3. The selected three supervised machine learning
algorithms Random Forest, Artificial Neural Network and Support Vector Ma-
chine are to create the models.
For the training of proposed ML algorithms, two separate training techniques are
used to train the models. Explicitly, K-fold cross-validation and Separate test data.
K-fold cross validation is the most used training method. What it does is, chunk
the instance dataset into k-equal parts or folds, then the classifier algorithm trained
using k-1 folds and tested by the remaining fold. This process is repeated itera-
tively by changing the test fold starting the first up to Kth fold. Finally, the cumu-
lative average error of each training and testing result is provided [10, 28]. 10-fold
cross-validation technique is used for medium size datasets, so this research as
well uses the technique.
Separate test data is also another way of model training technique. The datasets
divided into two parts, one for training and the other is for testing purposes.
There is no a specific labeled dataset ratio of testing and traning data, but, the di-
vided datasets should be enough for both training and testing. If there are enough
datasets, it is also possible to split the dataset 50% to 50%. Unless the training and
testing process biased by insufficient datasets [3]. Table 4.6 depicts that separate
test data uses 40% of the total dataset for algorithm testing purpose and Table 4.7
containes 60% of the total dataset for algorithm training.
30
4.5 model building
Once the experiment environment is ready, relevant algorithms selected (RF, ANN,
and SVM), training mode as well selected which fits for this research (Cross-
validation and Separate Test data) and the aggregated dataset mode (SW and F4H).
With the possible combinations, a total of 12 models built to detect SIM-box fraud.
Table 4.8 shows the possible number of building models with a combination of all
the three selected modes (Algorithms, Training, and Datasets)
31
4.5 model building
Each model building explained in detail on the coming subsections. Model built
with RF, ANN and SVM algorithms addressed in detail on Section 4.5.1, Section 4.5.2
and Section 4.5.3 respectively. Their classification performance as well collected
and evaluated.
Building model using ANN algorithm follows the same fashion as RF model build-
ing explained on Section 4.5.2, as a result of ANN as well built four models.
32
4.5 model building
The remaining four models are build using SVM algorithm. The training and data
aggregation mode is the same building model as indicated on Section 4.5.1 and
Section 4.5.2 used. The building model using SVM explained in this Section 4.5.3.
33
4.6 algorithm evaluation
The main objective of this research is to evaluate and compare the classifica-
tion performance of machine learning algorithms with the desired time span.
Once the data collection is completed, preprocessing task is handled and model
training and testing are continued. Within one-hour SIM-box frauds detected in
near-real time. So, validation performed using 10-fold cross-validation and sepa-
rate test data to evaluate performance algorithms. All ML models evaluate their
performance using different evaluation metrics. The common evaluation matri-
ces are confusion matrix, classification accuracy, F-measure, Recall, Root Mean
Squared (RMS) and ROC curve. These evaluation metrics discussed in the coming
pages.
34
4.6 algorithm evaluation
Correctly classified instances for both fraudulent and Normal (legitimate) are in-
dicated by TP and TN respectively. On the other ways around, incorrectly classified
instances of fraudulent and normal (legitimate) identified by FP and FN respec-
tively. Several researchers [3, 10, 30–32] uses confusion matrix and classification
accuracy as a common classification metrics.
Classification accuracy measures the ratio of correctly classified (both normal class
‘N’ and Fraudulent class ‘Y’) with respect to the overall dataset. the below mathe-
matical Equation 4.4 shows the percentage of correctly classified instances.
TP + TN
Accuracy = (4.4)
T P + T N + FP + FN
4.6.3 F-Measure
2 ∗ Precision ∗ Recall
F − Measure = (4.5)
Precision + Recall
Recall measures the ratio of correctly classified instances as Normal (class ‘N’)
divide by the sum of correctly and incorrectly classified instances of normal (class
’N’) as (class ’N’) and (class ’Y’). Recall for Fraudulent (class ‘Y’) computed the
same way using Equation 4.6.
TP
Recall = (4.6)
T P + FN
Precision measures the ratio of correctly classified instances as Normal (class ‘N’)
divide by the sum of correctly classified instances of Normal (class ‘N’) as (class
’N’) and incorrectly classified instances of Normal (class ‘Y’) as (class ’N’). Preci-
sion for Fraudulent (class ‘Y’) computed the same way using Equation 4.7. The
35
4.6 algorithm evaluation
TP
Precision = (4.7)
T P + FP
ROC is a graphical representation of True Positive Rate (TPR) and False Positive
Rate (FPR). FPR and TPR displayed in the X and Y-axis respectively. When ROC
curves are too close to the top-left corner of the area, the algorithm is considered
as a perfect classifier. On the contrary, if ROC curves lie under the linear line
(X=Y), the algorithm is considered as low-level classifier.
36
5
R E S U LT S A N D D I S C U S S I O N
This Chapter describes the experiment results of the research with ten cross-fold
and separate test data validation techniques. Both SW and F4H aggregated datasets
applied while using selected classifier algorithms RF, ANN and SVM.
The main target of this research is detecting SIM-box fraud near-real time using
ML algorithms, and compare each algorithm’s performance. In order to overcome
SIM-box fraud activities, near-real time SIM-box fraud detection experiments has
been discussed in the above chapters, the SW data aggregation technique is applied
to achieve the desired results of the research.
Experiments are conducted using the selected algorithms for the detection process
of SIM-box fraud. The two aggregation modes (SW and F4H) as well applied to get
the final dataset instances for each experiment. 10 cross-fold and Separate/Sup-
plied test data validation techniques are applied to perform the experiment with a
supervised ML algorithms. The final experiment results of the model are recorded,
evaluated and compared each other.
While comparing the models, RF ML classifier algorithm has better accuracy than
the other two classifier algorithms ANN and SVM models. Four independent mod-
els were build using those ML algorithms, on both training technique 10-cross
fold and Supplied test data, all four models of RF algorithm achieves better per-
formance than the other models build using ANN and SVM algorithms. Next, RF’s
model compared based on their aggregation mode, SW aggregation mode achieves
the highest accuracy of 96.2% and 94.6% respectively with 10-fold cross-validation
and Separate test data training technique.
37
5.1 model evaluation
10-cross fold validation with SW mode performs better than supplied test data,
the model used the same data for training and testing recursively, due to similar
data source is used for training and testing purpose.
The overall result of SW mode is better than F4H mode, due to the size of instances
used by SW mode is much higher than F4H used. SW has provide huge data size
to train and test the model than F4H and obtained better performance than F4H
mode.
ANN classifier algorithm models as well achieve the highest accuracy while com-
paring with SVM classifier models. While comparing ANN’s model with each other,
each model achieves almost similar performance values which is about 85% ac-
curacy. Each models’ performance result stated in Table 4.10. The last but not
the least classifier algorithm SVM’s model performance, the result is very less as
compared with the other two classifier algorithm models. SW with a 10-fold cross-
validation training technique gets the result of 68.9% accuracy. The other three
SVM classifier models performance presented in Table 4.11.
Due to the case explained earlier, using instances as training and test datasets in
the case of cross-fold validation technique, experiments usually obtained better
performance compared with separate test data validation. Since this research is
near-real time, training and evaluation time is one of a major comparison for those
selected algorithms. In this research, SVM takes much longer time on both cases
(building and evaluation) compared with the other two algorithms while they are
doing classification on cross-fold validation. SVM takes more than a day to build a
model, similarly with separate test data. Due to that SVM is not recommended for
this near-real time research case.
Supplied test data uses small test data size instance for testing compare with the
training instance data; because of that, its evaluation time is very less compared
with 10-cross fold technique. Using huge instance data increases model building
and evaluation time. Windows operating system with 8Gb RAM laptop is used for
the experiment. Detail experiment results of 10-Fold cross validation test depicted
in Table 5.2. Table (5.1) shows the allover time consumption of each algorithm on
model building.
38
5.1 model evaluation
39
5.1 model evaluation
Figure 5.1: ROC curve for 10-Fold Cross Validation of SW and F4H
40
Table 5.2: Overall Performance of Classification algorithms with 10-cross Fold Validation
Validation Technique Algorithm Aggerigaion Confusion Matrix Accuracy F-Measure Time (Second)
NO YES
SW NO 460,570 4,991 96.2% 0.961 1,062.95
YES 17,952 120,789
RF
NO YES
F4H NO 113,127 2,848 91.38% 0.91 217.19
YES 10,110 24,183
NO YES
SW NO 461,671 3,890 84.87% 0.822 1,240.25
41
YES 87,563 51,178
10-cross fold ANN
NO YES
F4H NO 114,497 1,478 84.87% 0.824 365.65
YES 21,255 13,038
NO YES
SW NO 362,466 103,095 68.9% 0.694 49,965
YES 84,867 53,874
SVM
NO YES
F4H NO 88,426 27,549 68.34% 0.694 4,895.28
YES 20,031 142,62
5.1 model evaluation
5.1 model evaluation
Figure 5.1 depicts the performance values’ of model built by suppleied test adata
with SW and F4H aggregated data instances. A model built using RF algorithm is
close to the top-left corner at (0,1) of the graph, which indicates that RF’s model
performance better than other two algorithm’s model. Similarly, ROC curve Fig-
ure 5.2 show RF algorithm models achieve better performance over the other amod-
els. both Figure 5.1a and Figure 5.2a has simmilar looks as a result of having close
results.
Figure 5.2: ROC curve for Supplied Test Data of SW and F4H
As shown on both Figure 5.1 and Figure 5.2 RF classifier algorithm remain being a
better classifier algorithm with a high performance or accuracy. A model with SW
is the top performer on both validation technique 10-cross validation and Supplied
test case. unlike RF classifier algorithm SVM is the least performer ML algorithm in
this research experiment. The detail experiment results of supplied test depicted
in Table 5.3.
42
Table 5.3: Overall Performance of Classification algorithms with Supplied Test Data
Validation Technique Algorithm Aggerigaion Confusion Matrix Accuracy F-Measure Time (Minutes)
NO YES
SW NO 183,668 2,557 94.9% 0.948 798.69
YES 9,777 45,720
RF
NO YES
F4H NO 45,221 1,169 90.56% 0.901 110.98
YES 4,504 9,214
NO YES
SW NO 181,615 4,610 84.52% 0.824 1,103.80
43
YES 32,818 22,679
Supplied Test Data ANN
NO YES
F4H NO 35,376 11,014 85.09% 0.823 236.87
YES 7,971 5,747
NO YES
SW NO 141,903 44,322 68.5% 0.695 6,449.47
YES 32,071 23,426
SVM
NO YES
F4H NO 35,376 11,014 68.42% 0.695 1,452.83
YES 7,971 5,747
5.1 model evaluation
6
C O N C L U S I O N A N D R E C O M M E N D AT I O N
6.1 conclusion
The main focus of this research is detecting SIM-box fraud in near-real time using
users’ CDR data with the help of ML algorithms. SW aggregation mode is used with
the minimum time span of 4-hour.
An aggregation mode with a minimum of 4-hour time span (window size) slides
every one-hour to the next. In each windows slide aggregated instance delivered.
SW within 4-hour window aggregation technique improve the trade off between
detection accuracy and detection delay.
44
6.2 recommendations for future work
The amount of time taking by the RF algorithm with SW for classification is much
higher than the classification time of RF with F4H aggregation mode. Evaluation
time is increased due to SW dataset instance is much higher than F4H data instance.
Ethio telecom reduces the local voice call price which could increase the interest
of SIM-box fraudster to hijack international call termination. As future work or
recommendations to improve near-real time SIM-box fraud detection, continuous
research is required using CDR data analysis and incorporate additional CDR fea-
tures like, International Mobile Equipment Identity (IMEI) and Mobil Termination
ID (Receivers cell ID). In addition to that reducing detection time using state of
the art and making more closer to real-time, and quality reduction investigation
of a voice call in the detection of SIM-box fraud.
45
REFERENCES
[5] Y. Kou, C.-T. Lu, S. Sirwongwattana, and Y.-P. Huang, “Survey of fraud
detection techniques,” in IEEE International Conference on Networking, Sensing
and Control, 2004, IEEE, vol. 2, 2004, pp. 749–754.
[7] R. Alves, P. Ferreira, O Belo, and J. Lopes, “Discovering telecom fraud sit-
uations through mining anomalous behavior patterns,” ACM Workshop on
Data Mining for Business Applications (DMBA), 2006.
[8] Apanews. (2017). Ethiopia loses over $52m to telecom fraud-official. 2017-03-
06, [Online]. Available: https://mobile.apanews.net/en/news/ethiopia-
loses-over-52m-to-telecom-fraud-official (visited on 01/20/2019).
46
references
[10] H. Kahsu, “Sim-box fraud detection using data mining techniques : The case
of ethio telecom,” p. 84, 2018.
[11] R. Sallehuddin, S. Ibrahim, azlan Mohd zain, and A. Hussein Elmi, “Classifi-
cation of sim box fraud detection using support vector machine and artificial
neural network,” International Journal of Innovative Computing, vol. 4, no. 2,
pp. 19–27, 2014.
[14] K. Niu, H. Jiao, N. Deng, and Z. Gao, “A real-time fraud detection algorithm
based on intelligent scoring for the telecom industry,” Proceedings - 2016
International Conference on Networking and Network Applications, NaNA 2016,
vol. 1, pp. 303–306, 2016.
[15] F. Mola, “Analysis and Detection Mechanisms of SIM Box Fraud in The Case
of Ethio Telecom,” Journal of Chemical Information and Modeling, p. 76, 2017.
doi: 10.1017/CBO9781107415324.004. arXiv: arXiv:1011.1669v3.
[18] I. Society, “Artificial intelligence and machine learning : Policy paper,” no. April,
2017.
47
references
[26] S. Haykin, Neural Networks and Learning Machines, Third. McMaster Univer-
sity Hamilton Ontario Canada: Pearson Education, 2009.
[28] M. K. Jiawei Han and J. Pei, Data Mining Concept and Technique, third. Mor-
gan Kaufmann Publishers.
48