Anomaly Detection Using Machine Learning
Anomaly Detection Using Machine Learning
Abstract
Outliers are being researched in many fields of research and various domains. In this
paper, we analyse and bring together various outlier detection techniques. With this, we
hope to attain a better understanding of the different approaches of research on outlier
detection. The goal of this project was to detect the outliers of the housing prices in
Melbourne (Australia), using statistical and Machine Learning prediction models. The
type of Machine Learning implemented was unsupervised learning for all models. The
models used were Isolation Forest, Elliptic Envelope, Density-Based Spatial Clustering of
Applications with Noise (DBSCAN) and Local Outlier Factor (LOF). The results of each
model were visualised for multivariate data to detect outliers. Outlier Detection was
performed on univariate and multivariate data. A dummy data frame was created by 1000
random observation values and 4 features to perform parametric methods on univariate
and multivariate data.
1. Introduction
The past twenty years have profoundly dedicated to intrusion detection within the
information technology world. This is because intrusions violate system security policies
that correspond with information, as well as in identifying the processes that are utilized
in identifying intrusions. Currently, there is a profound insight as permeated by research
on intrusion detection, a factor that has gained tremendous attention and has further
resulted in comprehensive research. However, the research community is still confronted
by severe problems while researching on intrusion detection. The problems arise from the
difficulty in reducing the number of false alerts while facing problems relative to
identifying unknown attack patterns, a factor that has remained unresolved for a while.
Nevertheless, research has shown that there is a solution to the above problem. The
answer has been found to lie in anomaly detection or outlier detection. Outlier detection
has come to be known as key in research on intrusion detection. This is due to the fact that
if outlier detection indicates an anomaly or anything against the norm, then it means that
there is the presence of unintended or intended faults, induced attacks, or any form of
intrusion. Mainly, outlier detection is based on several machine learning techniques;
supervised outlier detection and unsupervised outlier detection. In light of understanding
how outlier detection is carried out, this paper seeks to report on the machine learning
techniques utilized, both supervised and unsupervised.
The rest of this paper is organised as follows. Section II discusses about the previous
work and Section III discusses about the nature of input data. The proposed work is
discussed in Section IV. The implementation results are given in Section V followed by
conclusion and future scope in Section VI.
2. Literature Review
Supervised Machine Learning and various statistical prediction models are discussed in
[1] can be used to predict housing prices in Melbourne (Australia) which include three
different Linear Regression based models and two decision tree based models. The data
was first cleaned and Exploratory Data Analysis (EDA) was performed on it to have an
insight on the data. „Price‟ was considered as the main variable/feature which was used by
the prediction models for price prediction. The results of RIDGE Linear Regression
(MAPE – 25.5%) gave the best result among the three Linear Regression models used.
Random Forest model (MAPE – 9.5%) performed the best among the other models where
there was an improvement in the results until 300 trees and after that, there was no major
difference in the results.EDA was helpful in data cleaning and finding out the main/target
variable i.e. „Price‟. RIDGE Regression and LASSO method help in reducing the variance
and the sample error. Decision tree model is useful for interpretability. The only problem
faced in this model was in predicting prices above a particular threshold.
Location plays a major role in predicting houseprices compared to other in-house
features. Various locations and their related data were gathered. The entire house data and
the prediction models used were partitioned and the Multi-Task Learning (MTL) model
was used for each partition which are aligned to a task. Different MTL-based methods are
used to find the relatedness between the aligned tasks. Experimental evaluations were
performed and the results show the superiority of MTL-based methods over other
methods. The impact of the task definitions was analysed along with MTL-based method
selection and it was shown that the prediction performance of the MTL-based methods is
more than the other methods.
In [3], Active Anomaly Detection is a new framework developed for anomaly detection
whose cost is same as that of unsupervised anomaly detection, and it produces better
results. It is shown that a Prior should be assumed in order to have guarantees in the
performance of the anomalies probability distribution. A new layer can be added to a
unsupervised anomaly detection model which will make it an Active anomaly detection
methods, and this can yield better results on any dataset based on anomaly detection.
From [4] ,the degree of dispersion between an object and its neighbours are often
ignored by some Local Outlier Detection approaches. These approaches are less efficient
as they perform local outlier factor calculation on the entire dataset which consists of
small amount of outlier data. Local Deviation Coefficient (LDC) uses the distribution of
the object and its neighbours. For Data Preprocessing, non-outlier data are removed by
Rough Clustering based on Multi-Level Queries (RCMLQ). RCMLQ is useful in
reducing the amount of data for local outlier detection. It is used in parallel with other
existing local outlier detection methods to improve their efficiency. LDC help in showing
the unusual or abnormal situations of the data for the scattered datasets.
Threshold-based alarms are used for detecting anomalies for critical metrics or health
probing requests as discuseed in [5]. Machine Learning classifiers can be used to predict
the status of the system‟s health. Recurrent Neural Networks with Long Short-Term
Memory (LSTM) is used on the real-world dataset and it was found to be more effective
in detecting health issues of the system and anomalies compared to other ML classifiers.
Area under precision-recall curve was found to be 0.44. 70% of the anomalies were
automatically detected at the default threshold. The rate at which false positives occur is
4% even though the precision was found to be low (31%)
From [6], Outliers describe patterns of data that are not complying with the broadly
anticipated behavior. Even though it is real outliers are found in the data since some
malicious activities such as credit card fraud breakdown of the system and cyber
intrusion. Still, at the same moment, they are vital to the relevant analyst, and the actual
life relevance of outliers is a primary feature for the outlier detection. Comparison of the
noise removal and noise accommodation of which both deals with the unrequired noise,
therefore, can be referred to as hindrance to data analysis hence not required in the data.
Noise removal is articulated by the urge to remove the objects that are unwanted before
conduction data analysis on the data.Elements determining the outlier detection
difficulties:With outline observations being attractive, the detection of outliers becomes a
very critical point. There are a variety of factors used in the determination of ways to
formulate an outlier detection difficulty.
1. Point outliers
2. Contextual outliers
Some data instances happen to be rare in occurrence concerning some particular context
and a regular appearance concerning another setting, therefore such forms of data are
referred to as contextual data sets, which are inclusive of time series data also [8].
3. Collective outliers
If an individual data instance happens not to be anomalous with the collection of the data
set, being anomalous classifies the data as a collective outlier. A collective outlier can be
used for sequence data, spatial data, and even graph data [8]. Point outliers a rare
occurrence in any data set where the data instances are in relation.
4. Erroneous outliers
When some observation is not noted correctly as an outlier, due to some inherent
difficulty or catastrophic failure, then these are mistake outliers, and we can define them
as illusive outliers [9]. Which takes the outcome of the data in another way.
Further in [10], Anomaly has been detected for the live streaming data in case of error for
instance, unwanted alarm. So, in this false fear alarms are detected by using effective
algorithms to overcome the unnecessary work load and security if alrm rings without any
emergency situation.
4. Proposed method
In our dataset (Melbourne house pricing), there are 13 features and considering those
features price prediction is made to detect which is normal and anomaly based on various
algorithms. All the algorithms are using different technique to give the best output and
after analyzing the different results we have come with both the data (normal and
outliers). . The “Melbourne housing prices” data set was used for non-parametric methods
on univariate and multivariate data. It consisted of 34,858 observations and 21 features.
First, the missing values for each column were filled with its median and specific columns
were visualised using Histograms. The target variables from the given data set were
“Rooms” and “Price”. The results of models based on DBCAN and LOF gave quite
satisfying results and were visualised accordingly where the outliers were duly marked in
the plots.
MELBOURNE
DATASET
DATA
PREPROCESSING
OUTIER
DETECTED
Figure 1. System Architecture
4.3.1. Standard Deviation: In stable range, If the distribution of data is roughly normal
then near about 68% of the data values lie within one standard deviation of the mean and
about 95% are within two standard deviations, and about 99.7% lie within three standard
deviations. So, if there is presence of any data which is three times the value of standard
deviation, then those points have more chances to be called as anomaly or outliers. It is
possible for one dimensional dataset.
4.3.2. Box plots: It is said to graphical representation of the data with help of quartiles. It
is said to be easiest, however it is more effective method. After finding the maximum and
minimum value we can identify the outliers which are below minimum range and above
maximum range. In this data set is divided into 4 intervals and quartile is the one which
divides them into 3 partitions.
IQR=Q3-Q1
4.4.Non-parametric (Univariate)
4.4.1. Isolation Forest: It is an assembly regression and this concept is used to find
outlier which is very different from other. Earlier outlier was detected by identifying the
data which seems to be in the outer region and is most deviating from it‟s normal path but
here isolation forest uses a different approach in which there is no profiling for normal
instances, no point based distance calculation is performed. It builts an random tree for
the given data set which in return provide anomaly score to figure out how isolated the
object is in received structure. Firstly the data is generated from the given data set to form
a distribution and is compared with the anomaly score to find highlighted region where
there are maximum chaces of outlier.
Figure 5.After plotting the contour the objects which are in greater densities
inside the contour are normal data but outside contour are anomalies
4.6.1. DBSCAN Clustering: It is type of algorithm which is can be used to find outlier
more effectively because it is density based algorithm and has few concepts on data
points.
Concepts:
1. Core point: It consists of hyper parameters like- Minimum sample which can
form clusters using few core points and other one is ESP, it is the maximum
distance between two samples but has to be considered in same cluster.
2. Border point: They are in the same cluster but have great distance from center of
the circle.
3. Noise Point: They are the exception which does not belong to any cluster but can
be analyzed as anomaly or not.
In this algorithm all the data points are approached individually and labeled it as core,
border and noise point based on its conditions.
Conditions:In core point, random distance (ESP) is taken from one point to the nearest
and contour is formed. If the minimum of 3 points are present in the contour then the
point which was visited first to measure ESP is called core point.In border point, there
should be minimum of 2 points in contour and it should be neighbour of core point.In
noise point, no condition is satisfied and it doesnot have any neighbour.
4.6.2. LOF(Local Outlier Factor): LOF is density based method and it is most effective
method to find anomaly by forming a contour. Local outliers are the nearby points in one
area of the dataset which are totally different from global outliers but both of them can be
detected by considering the relative density and it can also detect outlier on skewed
datasets. Therefore by applying LOF anomaly is detected in different colour and lies
outside the contour.
5.Experiment Result
After analyzing all the algorithms separately we have come with perfect results and
outlier is detected by various algoritms using different methods given below:
5.1. LOF
LOF (Local outlier factor) has tha capability to find local outlier as well as global
outlier where others alorithms can easily find global outlier but they are not accurate in
case of local outlier. It is also density based algorithm and similar to dbscan but it forms a
contour where dbscan doesnot form. So, its gives the results based of neighbouring points
and whether those points lie in the contour or not. In LOF method, the number of
neighbours to be considered (n_neighbors) was taken to be 50 and the parameter
„contamination‟ was given as „auto‟ which takes the default threshold. The visualised
result of this method is given below.
Figure 6.Points outside the contour in red colour are outliers where points
inside the contour in white coloue are normal points
5.2. DBSCAN
DBSCAN is a density based based method which detect the anomaly with help of
neighbouring point, if the density of point is significantly different from its neighbour
then it is said to be outlier. Moreover in DBSCAN method, the parameter „eps‟ (the
maximum distance between two samples for one to be considered as in the neighbourhood
of the other) is given the value of 3.0 and the number of samples, in a neighbourhood for
a point to be considered as core point („min_samples‟), is taken as 10. The visualised
result of this method is given below.
Figure 7.Points in the red colour are outliers which doesnot satisfy any
condition and blue points are normal points
Figure 8.Outlier regions are found in the areas of low probabilty in the pink
colour based on the anomaly-score
References
[1] C.C. Aggarwal, Outlier Analysis, Springer Publishing Company, Incorporated, 2nd
edition, (2016).
[2] Baingana, B. and Giannakis, G, “Joint community and anomaly tracking in dynamic
networks”,(2016).
[3] G. Gao, Z. Bao, J. Cao, A. Qin, T. Sellis and Z. Wu, “Location-centered house price
prediction: A multi-task learning approach”, (2019).
[4] M. Gupta, J. Gao, C.C.Aggarwaland J. Han, “Outlier detection for temporal data: A
survey”, IEEE Transactions on Knowledge and Data Engineering, 26(9),(2014), 2250–
2267.
[5] F.Huch, M. Golagha, A. Petrovskaand A. Krauss, “Machine learning-basedrun-time
anomaly detection in software systems: An industrial evaluation”, 2018 IEEE Workshop