Model_Based_Anomaly_Detection_in_High_Dimensional_DATA
Model_Based_Anomaly_Detection_in_High_Dimensional_DATA
Dimensional DATA
Abstract— Anomaly detection is a growing could point out the theft of the credit card or
research issue in several application domains. This identity [10]. Anomaly detection improves data
paper attempts to provide a structured overview of
anomaly detection research. A state-of-the-art of quality by deleting or replacing anomalous data. In
anomaly detection techniques is then presented. A other cases, anomalies reflect an event and provide
classification of methods based both on the type of useful new knowledge. The importance of anomaly
datasets (Big DATA, data flow, graphs, time series,
detection is because anomalies in data indicate
etc.), application domains (fraud detection, intrusion
detection, medical anomaly detection, etc.) and the important, and often critical, actionable information
approach considered (Deep Learning, statistical, in a wide variety of application domains. Case-
classification, clustering based, etc.) is proposed. We Based Reasoning is proposed as a framework that
propose a multi-vision approach of case base
representation for anomaly detection in a high-
to use in order to tackle our main target of
dimensional data using Case-Based Reasoning. anomalies detection in a high dimensional data and
to position the proposed multi-vision model.
Keywords—Anomaly detection, CBR, high-
dimensional data, Big DATA, case base, Case-Based The focus of this paper is two-fold; firstly we
Reasoning.
present a structured overview of anomaly detection
I. INTRODUCTION research using Case-Based Reasoning in a smart
environment. Furthermore, a model case base
Various research field and applications have multi-vision is proposed in order to overcome
addressed the problem of anomaly detection. It major challenges in a Big Data framework.
consists on detecting rare events or, more generally,
observations that are outliers and different from the Our paper is organized as follows: The following
majority of the data. These rare events are often section reviews the related research in anomaly
called anomalies and they can be of various types detection methods. Section 3 presents the major
and are encountered in different areas. Indeed, in CBR processes for anomaly detection. Section 4
computer networks, an abnormal traffic pattern explains the representation of case base according
could indicate that a hacked computer is to multi vision approach for anomaly detection in
dispatching suspicious data to unauthorized access high-dimensional data. Finally, section 5
to computer systems [10]. In medical diagnosis, an contains the conclusion.
abnormal brain MRI could reveal critical
information on tumors [11]. In bank security, an II. RELATED WORK
abnormal data transaction, through credit card,
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.
Table 1 Comparison of 10 review:
Hodge and Patcha Chandola Zhang Gupta Aggarwal Salehi and Chalapathy
Souiden et A.Blazquez
Austin and Park et al. (2013) et al. (2017) Rashidi and Chawla
al. (2016) et al. (2021)
(2004) (2007) (2009) (2014) (2018) (2019)
Statistical ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Clustering Based ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Nearest Neighbor Based ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Techniques
Classification Based ✓ ✓ ✓ ✓
Regression ✓ ✓
Spectral ✓ ✓
Deep learning ✓ ✓ ✓
Data flow ✓ ✓ ✓ ✓ ✓
Time series ✓ ✓ ✓ ✓
Data types
Graphs ✓ ✓
Big DATA ✓ ✓ ✓
Cyber-Intrusion Detection ✓ ✓ ✓ ✓ ✓
Fraud Detection ✓ ✓ ✓
Applications
The detection of anomalies is a subject that interests proposed in order to recommend anomaly
many researchers and has been the subject of detection methods to be used according to the type
numerous works. Several methods have been of data available (Big DATA, data flow, graphs,
proposed for anomaly detection and each method etc.) with relevant bibliographic references (Table
has its strengths and weaknesses. Patcha and Park 1).
[4] reviewed the methods used for intrusion
detection. A general review of existing techniques Anomaly detection is defined in this paper as the
covering several approaches is proposed in detection of any situation inconsistent with normal
Aggarwal [6] and Chandola et al. [3]. Gupta et al. resident behavior and daily routines [12]. Several
[5] review the state of the art of methods according approaches have been used to allow systems, to
to the type of data considered: temporal data such interpret, and to reason with previous situations.
as time series, spatio-temporal data and data flows. Among these approaches, data mining techniques
Salehi and Rashidi [7] have also presented methods such as neural networks, decision trees, and
applicable to data flows. In Table 1, a summary machine learning are widely used. Other
based on major reviews in the literature is approaches are based on the use of probabilistic
presented. We respectively identify the anomaly reasoning tools, hidden Markov models to reason
detection techniques, types of datasets, and with different types of situations and to overcome
application areas covered in each of these reviews. uncertainty problems. Furthermore, despite their
The purpose of this section is to provide a complete success for detecting abnormal situations, these
state of the art by aggregating several information approaches generally suffer from the problem of
on the different anomaly detection methods, anomalies detection in a high- dimensional data,
datasets and application domains. A classification is where anomaly detection has been a long-standing
problem [13].
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.
Next section is devoted to the presentation of the b) Structure of a case
general Case-Based Reasoning framework that we
First, a case in a case-based reasoning system is
propose to use in order to tackle our main target of
generally composed of two disjointed spaces: the
anomalies detection in a high dimensional data and space of problems and the space of solutions. The
to position the proposed multi-vision model. problem area relates to the part in which the
objectives to be achieved with regard to the solution
III. CASE-BASED REASONING area are to be found. It groups together the
description of the solution provided by the
We present in this section the main Case-Based reasoning, its justification, its evaluation and the
Reasoning processes for the detection of anomalies. steps that led to this solution. Two types of cases
A Case-based reasoning system is a combination of can be distinguished: source and target cases.
process and knowledge containers, which preserve
- The source case is the one in which the "problem"
and exploit the past experiences to solved future
and "solution" parts are filled in. Thus, this is a case
abnormal situation. that will inspire the system to solve a new problem.
The source case may also contains another part
1. Representation and formalization called “quality information”. This section contains
information on how to use the case in the system;
Case-Based Reasoning, considered as a reasoning
approach, is based on solving new problems -The target case is the one that bears the problem
through adapting previous successful solutions to and its solution part is not filled in.
similar problems. As a result, the new problems
In our system, a case is represented as a set of
(cases) could be then enriched through time. The
features which are grouped into categories of
CBR reasoning techniques are widely used in smart parameters, as follows:
environment for anomaly detection [14].To
simplify the presentation, we use the model of CBR
presented by Afouba et al. [15]. The knowledge
structures are: the indexing vocabulary, the case
base, similarity metrics, and knowledge [16].
a) Case Definition
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.
b) Case Base
c) Measures of similarity
d) Adaptive Knowledge
This section presents such unsolved detection A huge amount of real time data is daily generated
challenges in complex anomaly data. It explains too and stored in case databases, where we are appealed
Fig.2. Representation of case base
case-base representation with multiple visions. to visualize and manipulate a high-dimensional data
to detect anomalies. Hence, detecting anomalies in
this type of databases is complicated. Furthermore,
in a low-dimensional space, anomalies often exhibit
evident abnormal characteristics, but in a high-
dimensional space, they become hidden and
unnoticeable. Detecting anomalies in a reduced
lower-dimensional space spanned by a small subset
of original features or newly constructed features is
a straightforward solution. This is why it is crucial
to think about representing the case database
according to several criteria: (a time-indexed
database, a user database, a user-group database and
a database linked to the infrastructure generated
from our surveillance model) to reduce dimension
Fig.2. Model of infrastructure of data as defined in Fig.2.
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.
anomaly detection in high-dimensional data. Since path .We can clearly notice that a user takes this
we can act at the concerned reduced database. To path by itself is cannot be an anomaly.
put this vision into practice, for example in our
Anomaly detection, concerns only anomalies at the
application (traffic in a road system), if a user has level of the case bases (users, user group) but it also
always taken a path different from the optimal path concerns anomalies at the model level (the
there is an anomaly; a priori it is a mono-user infrastructure). The case-base and the model are
anomaly. Hence, depending on the type of anomaly linked. The anomaly detection system is checked by
we will focus on the desired case base, as in the the model which is based on external knowledge. If
example above the user case base is the target to we have information from the infrastructure that
defines an incident, for example, a node will be
find the abnormal case. The nature of the desired
closed because of road network work, in this case
anomaly is an important aspect of an anomaly we cannot interpret this information as an anomaly
detection technique. Anomalies can be classified but without this knowledge it is obviously an
into following categories: point, contextual and anomaly. Thus, the anomaly detected at the model
collective anomalies. It is important to properly level is observed in the case database.
identify their type and then choose the algorithm
IV. CONCLUSION
most suitable for their detection. The type of
anomalies considered depends on the problem. It is The anomalies detection being transverse to many
also possible to want to detect several types of fields of data processing, therefore, different
anomalies at once, making the problem more methods are proposed according to the constraints
complex and the choice of the detection algorithm of each application domain and data type. In this
more complicated. paper, we have proposed a review and proposed a
general framework for the application of the
existing methods and those adapted to each field of
application and main types dataset. Furthermore,
Anomalie
we presented the various methods in which
anomaly detection problems have been formulated
in the literature. This work introduces a conceptual
Point Contextual collective
model of anomaly detection in high dimensional
User Infrastructure Group User
data and proposes, as a solution face to a big data
anomaly anomaly anomaly application to consider the dataset (i.e. the case
base) following different visions: individual user,
Fig.3. types of anomalie group of users as well as time of instances
User Infrastructure Group User occurrence. This multi vision approach allows to
anomaly anomaly anomaly tackle anomaly detection in an efficient way and to
be able to exploit different sources of knowledge, as
Point Anomalies: A point anomaly occurs when an
Fig.3. types of anomalie for instance, the knowledge source resuming
individual data instance can be considered as infrastructure resources of the node-based
abnormal in comparison with the rest of data. This considered structure in our application.
is the most basic type of anomaly, and it is the
subject of the majority of anomaly detection
research (for example, this is the case of behavior
anomaly in a mono-user system).
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
Authorized licensed use limited to: GITAM University. Downloaded on May 01,2025 at 05:44:23 UTC from IEEE Xplore. Restrictions apply.