Online Mining For Association Rules and Collective Anomalies in Data Streams
Online Mining For Association Rules and Collective Anomalies in Data Streams
Online Mining for Association Rules and Collective Anomalies in Data Streams
Abstract—When analyzing streaming data, the results can solution to reduce the number of passes requires the user
depreciate in value faster than the analysis can be completed to specify part of the rule (query). This solution, entitled
and results deployed. This is certainly the case in the area targeted association mining, uses a in-memory tree structure
of anomaly detection, where detecting a potential problem as
it is occurring (or in the early stages) can permit corrective called Itemset Tree, which requires, at most, one pass
behavior. However, most anomaly detection methods focus on through the tree. While computationally fast, a query could
point anomalies, whilst many fraudulent behaviors could be potentially traverse the entire tree before the system is able to
detected only through collective analysis of sequences of data move on to the next query [14]. By distributing the tree into
in practice. Moreover, anomaly detection systems often stop at multiple sub-trees, a query could be executed concurrently
detecting anomalies; they typically do not provide information
about how the features (attributes) of anomalies relate to each on the sub-trees and return a list of rules within a certain
other or to those in normal states. The goal of this research threshold with significantly less processing time, allowing
is to create a distributed system that allows for the detection queries to have less of a wait time before their execution.
of collective anomalies from streaming data, and to provide a Currently the majority of the work on sequence-based
richer context of information about the anomalies besides their anomaly detection addresses batch processing. Only a small
presence. To accomplish this, we (a) re-engineered an online
sequence anomaly detection algorithm and (b) designed new number of these methods handle direct online detection of
algorithms for targeted association mining to run on a stream- outliers [4]. Online anomaly detection is crucially important
ing, distributed environment. Our experiments, conducted on in many cases because it enables immediate action to be
both synthetic and real-world data sets, demonstrated that the taken in order to reduce or prevent losses. Examples of these
proposed framework is able to achieve near real-time response are cases of credit card fraud or health monitoring [4]. The
in detecting anomalies and extracting information pertaining
to the anomalies. current challenge for collective online anomaly detection
is far more challenging and thus has higher computational
Keywords-association mining; anomaly detection; streaming complexity, which potentially makes a fully online detection
data; online processing; collective anomalies;
unfeasible. For our real-time responsive method for anomaly
detection, we propose a strategy of utilizing two models
I. I NTRODUCTION
(1) a current model/snapshot of the current data and (2) a
The objective of this paper is to introduce an online historical data model based upon previous data in the stream.
mining framework for data streams designed to detect col- The current model is then compared against the streaming
lective anomalies and extract richer contextual information. data as it arrives. Since the online comparison of incoming
To accomplish this, we first developed a scalable association data is much easier and only involves a small subset of the
mining algorithm for processing large volumes of data over data, we can achieve near-real time response.
time frames. Second, we develop distributed batch process- Our system combines both anomaly detection and targeted
ing algorithms that stores and models high-volume historical association rule mining into one online knowledge discovery
data. The model captures the existing sequence patterns system. We utilize this system to generate association rules
within the historical data. Third, we develop an online based upon anomalous sequences discovered by our detec-
distributed stream processing algorithms that continuously tion model. The system will (1) process a streaming data set,
compares incoming data with the model, and evaluates one time window at a time, (2) find anomalous sequences
them to detect collective anomalies. Finally, the methods using stable model from historical and an off-line retrained
are integrated, allowing both the rapid, online detection of model from a much shorter but more current period, and
anomalies and the extraction of association rules about the (3) generate a list of strong association rules which describe
anomalies. the anomalous sequences. These rules will allow insight into
When mining rules, Association Rule Mining usually not only the why/how of the anomalous sequences, but also
makes multiple passes over the entire data set [1][3]. One outline possible cause/effect involved in the sequences and
2371
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
of outliers. It also can be applied locally on our sliding
window subsets of data for processing of online streaming
datasets [7]. The challenges for this online approach are (1)
scalability and (2) concept drift. The scalability challenge
is the potential escalating computational cost that could
occur as the number of incoming streams grows higher. The
Figure 1. Sliding window technique concern is that the system will become unable process all of
the data simultaneously and generate labels in real time. This
lag, in turn, could result in a cascading backlog, resulting
W(i+1), and W(i+2) and contain overlapping information to in system failure. In our system, we address the scalability
allow a smoother snapshot of the data to be gathered. issue by utilizing the horizontally scalable MapReduce based
distributed computing platform. Concept drift is when the
D. Collective Anomaly Detection models representing the current data is no longer considered
Outliers can be defined as those observations that deviate valid. If context drift is not addressed, an increase in the
from normal observations to an extent that arouses suspicion number of false alarms and/or missing true anomalies will
they were generated by a dissimilar mechanism [1]. These occur. For the concept drift issue, our system proposes a
events can also be referred as anomalies. We will use the simple approach in which we periodically run an off-line
terms outliers and anomalies interchangeably. The results of retraining of the model. If the difference between the newly
outlier detection usually translate to significant or critical trained model and existing historical model exceeds the pre-
real life events, e.g. network intrusion, fraudulent credit defined threshold (implies concept drifts), the new model
card transactions, or failure in operational system. Outliers will replace the existing model.
can be classified into three categories: (1) point outliers,
(2) contextual outliers, and (3) collective outliers [9]. Point, E. Technology Selection
or global, outliers are data points which significantly differ Our framework has both a batch processing component for
from the current norm of the data. Contextual/Conditional off-line model training, and a streaming component for on-
outliers deviate from the surrounding context, referred to line anomaly detection. The alternatives considered for batch
as contextual attributes. Collective outliers are similar to processing were Hadoops MapReduce and Spark [23]. Both
context outliers, except that we dont consider individual are open source projects, and provide scalable distributed
points, but instead, we consider a collection of points that processing based on MapReduce with high fault-tolerance.
has common relationship(s). Examples of related collections We selected Spark because it has faster processing due to
of objects include sub-sequences for temporal data, local its capability to cache results of previous steps on memory
regions for spatial data, and connected sub-graphs for graph instead of disk. Considering the streaming component, we
data [4]. Sequence based outlier detection is an example of evaluated both Storm [19] and Spark Streaming. We selected
collective outlier detection, where the temporal order of the Spark Streaming because of the ability to smoothly integrate
points is their relationship. Our algorithm focuses solely on it with the batch component.
sequence based outliers detection for our collective anomaly
detection. Sequence data is common in many applications III. P ROPOSED A PPROACH
such as network intrusion detection, bioinformatics, flight The proposed approach is a materialization of the general
safety, health monitoring, and web site navigational click strategy explained above that is also capable of handling the
sequences [4] [5]. scalability challenge by using new distributed MapReduce
In order to implement a real-time response for anomaly based technologies.
detection, we adopted the strategy of building a current
model based upon a previously built historical data model. A. Parallel Itemset Trees Using Spark
The historical model is updated as concept drifts so as to re- Combining Spark Core and Spark Streaming provides
main current. The computationally expensive job of building our system with an efficient platform for high-speed data
the updated model is done off-line. We then compare this streaming and processing with fault tolerance. Our Spark ap-
current model against incoming streams of data. The online plication will create multiple Itemset Tree structures, which
comparison of incoming data is much easier as it involves will initially reside in the driver program. The system begins
only a small subset of the data represented within a sliding with continuous steaming data handled via the executors.
window. This way we can achieve a near-real time response. These executors will also update the trees as the data
For our system, we chose to use Markovian model because changes. For every batch of data, the executor will get a tree
it is unsupervised. This means it does not require the from the driver, run the Tree Insertion Algorithm to insert
labels of training data to be known in advance, thereby the itemsets into the tree, and then return the tree back to the
making it suitable for detecting new and unexpected varieties driver program. We have defined a parameter, called ”level
2372
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Itemset Tree Parallelism
2373
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
of i: The Markov model requires categorical data which can
F requency(ij) be mapped into states. If the data has continuous attributes,
Pij =
F requency(i) discretization is required to transform it into categorical
attributes. We selected the SAX (Symbolic Aggregate Ap-
w−1
proximation) for this purpose. SAX is a recently developed
M iss − P robability(itemi , itemi+1 )
discretization method which could achieve results improve-
i=1
μ= ments for similar machine learning tasks [17]. This method
w−1
endeavors to create equi-frequent symbols and is specifically
The outlierness score μ is measured, online, as the average designed for streaming data.
of miss-probability over length of the sliding window w − 1.
C. Anomaly Detection + Association Rules Mining
The label is anomalous if μ ≥ t, where t is a user defined
threshold, and normal otherwise. Alarms can be issued upon Our algorithm proposes to utilize association mining
detection of anomalous instances. to generate association rules about anomalous sequences.
The model builder takes the sequenced data as input These rules are implications containing anomalous data in
and produces the transition matrix as output. It does a their antecedent, and can be used to explain the anomalous
single scan over the data and updates two frequency arrays. sequences which created them. In addition, they can the-
The first array is a one-dimensional array containing the oretically predict the anomaly itself by creating A implies
frequencies of all elements, Frequency(i), initialized to zeros B implications. Our anomaly detection framework consists
and incremented upon each occurrence of item i. The second of two stages: training stage and detection stage. We start
array is a 2 dimensional array representing Frequency(ij), from historical data to build a sequence model using batch
initialized to zeros and incremented upon each occurrence processing of Spark and then the model is applied in de-
of item i followed by item j. At the end of the scan, each tection stage. We use a sliding window approach to process
row of the matrix is divided by the corresponding element in data stream and compare with the outlier score to determine
the one-dimensional frequencies array so that each element if a sequence is anomalous or not. The historical data will
will represent the probability p(ij). The training stage is be updated at the same time. The integrated framework of
simplified to parallelize and implement on Spark. anomaly detection and association rules mining for contin-
2) Detection Stage: The detection stage is an online uous data streams can be seen in Fig. 4, while the data flow
process which takes as input a stream of time stamped trans- of our system’s integrated analysis is illustrated fin Fig. 5.
actional data. This stage continuously receives the incoming We verified these rules with a domain expert.
stream and produces anomaly labels or alarms whenever D. Tools and Prototypes
it detects anomalous sub-sequences. The system can also A prototype of the integrated framework was implemented
save the anomalous records for a further investigation. on a cluster of three Ubuntu nodes with 24 cores on each.
We implemented a data feeder module to emulate a real Each node has Hadoops HDFS, Apache Spark, Apache
data source. To handle real-time streams in an efficient, Hbase, Apache Kafka, and Apache Zookeeper. Both Java
scalable, and fault tolerant manner, we needed a message and Scala languages were used to write scripts and programs.
broker. We selected Apache Kafka as a message broker to Eclipse with a plug-in for Scala was used as an IDE for
handle the message passing and queuing. Kafka adopts a developing, debugging, and testing the code. SBT (Scala
publish/subscribe approach. It is important to note that Spark Build Tool) was used to build and package the final JAR
provides smooth integration with Kafka. file which is submitted to Spark for running it on the cluster.
The detector module is the most important component The output, anomalous sub-sequences, is both displayed on
in our system. The module implements the online classifier console and saved to HDFS. R scripts were used to visualize
described previously. Similar to the training stage, we need the sequence data. We also used Rickshaw, a JavaScript
to transform incoming streams of data into a sequence toolkit, to create an interactive time-series graph tool to
representation. To construct sequences, we need to acquire visualize the anomalous sequence.
all previous records. HBase, an open source column based
NoSQL built on top of HDFS, enables fast querying of data IV. R ESULTS
resident on HDFS. Also, Spark has a native integration with The MapReduce implementation of the Paralleled Itemset
HBase. We use HBase to hold new incoming records and to Tree was able to generate accurate rules in less time than
return them when needed as answers for the detector pro- the sequential implementation. It has much higher scalability
gram queries. Sequences retrieved from Hbase are scanned limited only by available processing power. The streaming
using a sliding windows of a length w, and the outlierness framework was built on the integration of multiple new
score is calculated using the Miss-Probability metric defined big data technologies was able to achieve near real-time
previously for every subsequence of length w. Spark adopts a response in detecting outliers. The throughput scales with
micro-batching mechanism to simulate real time streaming. available computing power.
2374
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Integrated Framework
Table I
PERFORMANCE COMPARISON BETWEEN THE SEQUENTIAL
VERSION AND THE SPARK VERSION
Cores 1 2 4 8 16
Time − Seconds 99,434 82,042 81,301 74,534 68,905
Memory − MB 1,026 1,450 1,504 1,566 1,604
Figure 5. Data flow for Anomaly Detection + Association Rules Mining 2) Synthetic Data B: We generated a synthetic dataset in
order to simulate credit card transactional data. Credit card
fraud is an interesting application for anomaly detection with
A. System Specifications the goal of detecting fraudulent transactions based on normal
The system is developed on the Spark framework (version behavior patterns. The data is organized into sequences of
1.4.0). The cluster consisted three Ubuntu 12.04 servers and historical transactions for a specific customer and their peers.
each server has 16 Gigabyte memory and 24 cores. The Online credit card fraud detection is crucial to prevent such
programs were developed in Java 8.0 and Scala 2.11. Eclipse fraudulent transactions in real-time. In this application, we
is our main IDE for programming. set window size, w = 5 transactions and proposed 8 features
to represent the data within the credit card transactions.
B. Data Sources Eight discretized transaction item types:
1) Total transaction amount (L = Low, M = Medium, H
1) Synthetic Data A: This data set was generated mainly
= High)
the evaluation of our system’s performance, each partion
2) Highest Items Price (L = Low, N = Normal, H = High)
comprising of ten million records. Each record consisted 1
3) Time since the last transaction ( L = Long, M =
to 10 random numbers, whose cardinality is between 0 and
Medium, S = Short)
99. Table I shows the comparison between the performance
4) Distance of customer’s address ( C = Close, F =
between the core and Spark versions with a varying number
Faraway, V = Very faraway)
of trees. We can see that as the tree number grows, the
5) Hours of the day (A = 0 − 6, B = 6 − 12, C = 12 − 18,
time for execution decreases gradually. Even though the
D = 18 − 24)
performance of the system improves as the number of trees
6) Day of the week (A = M on − T hu, B = F ri − Sun)
increase, when compared to the original PFP version, the
7) Age (A = 0 − 12, B = 12 − 21, C = 21 − 39, D =
speed growth rate is not as good as original because of
40 − 65, E = 65+)
the additional overhead introduced by MapReduce. Our
8) Electronic transaction (T = True, F= False)
current analysis pointed out that most of the execution
time is spent on data transfer, especially during the merge A sample anomalous sequence detected can be seen below:
process. Decreasing the number of node communications 1) (M, H, L, V, B, B, C, F)
and reduction of the serialization cost are some of our future 2) (L, N, L, V, C, B, A, T)
work. 3) (H, H, S, V, A, B, A, T)
2375
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
the very limited domain knowledge the research team has.
Our proposed solution to this challenge is to generate
sequences for contract data based upon when the contract is
awarded/modified for both contracting entities and vendors.
Also, when our sequence-based collective anomaly detection
method detects anomalous sequences which are difficult
to interpret, we will leverage anomalous association rules
to understand them. This data set is publicly accessible
from: www.usaspending.gov. It contains records from United
States federal contracts, grants, and loans occurring between
2008 and 2016 and contains more than 200 attributes. We
Figure 6. Electricity Load Diagram chose to focus on the contracts of one agency, the US
Department of Defense (DOD), and the bureau: Department
of the Army. Two sequence data sets were generated from
the transactional data: (1) one for investigation of contracting
offices and (2) for vendors.
The Contracting Offices data set had its unique
sequence key as a composite of ”ContractingOf f iceId”,
”ContractActionT ype”, and ”P roductOrServiceCode”.
This dataset compares different contracting offices inside
DOD, considering the temporal sequence of contracts. Due
Figure 7. Electricity Load Diagram Anomalous Sequences
to the large impact of contract type and product prices, we
decided to add them to the sequence key. The attributes
for the sequence items, which make different Markov
4) (L, H, N, V, C, B, B, T)
states, are the following: ”SolicitationP rocedures”,
5) (H, H, S, V, D, A, B, T)
”N umberOf Of f ersReceived”, and
This sequence includes five transactions; each transaction ”SU M (DollarsObligated)”. The Vendors data set has the
contains the eight discretized features mentioned above. ”DU N SN umber”. This data set compare different vendors
Transaction 3 indicates (a) large purchase amount with high doing business with the DOD considering the temporal
priced items, (b) a short amount of time since the last sequence of contracts. The attributes for the sequence items,
transaction, (c) was issued very far away from the customer’s which make different Markov states, are the same as the
home address, (d) occurred in the morning hours on a Contracting Offices Data set: ”SolicitationP rocedures”,
weekend day, (e) an electronic transaction (online purchase), ”N umberOf Of f ersReceived”, and
and (f) the person made this purchase is a young person. ”SU M (DollarsObligated)”.
3) Electricity Load Diagram: Next, we applied our Common sense dictates that you should not compare
method to a real-world data set available on the UCI disparate sets; i.e. contracts for purchasing office supplies
repository for machine learning [2]. The data represents the won’t be relevant to those for building a flood control levee.
electric power consumption of 370 data points/clients. The Contracts are comparable only for similar types and will
values are measured in kilowatts every 15 minutes. Each define an appropriate context for the investigation. For a
point has 140,256 readings. Fig. 6 shows the raw data of context-based approach, we decided to partition both data
one point in the interval from 2011 to 2015. sets by the contract type. This separates the data sets into to
Fig. 7 shows samples of the programs discovered anoma- eight different partitions each: (1) BPA CALL, (2) BPA Call
lous subsequences, in bold, plotted along with the whole Blanket Purchase Agreement, (3) DCA Definitive Contract,
sequence of a single client. (4) Definitive Contract, (5) Delivery Order, (6) DO Delivery,
4) Government Spending Dataset: Typically collective (7) Order, (8) PO Purchase Order, (9) Purchase Order.
anomalies are difficult to interpret without appropriate con- Our reason for selecting ”SolicitationP rocedures” and
text. This is because each individual data point might ”N umberOf Of f ersReceived” as behavior attributes was
look normal without a frame of reference. For example, derived from the assumption that offers received by contract-
government spending draws a great deal of attention, and ing office should align with the solicitation type. Therefore,
each expenditure is supposed to undergo close examination any discrepancy may mean anomalous behavior, especially
by regulators and auditors. However, fraudulent activities for the large amount contract. The types of solicitation
can still appear in government spending. To detect anoma- procedures for this data set are listed in Table II. From
lies/frauds in this case is a very challenging problem. The Table II you can see some solicitations explicitly define how
challenge is in defining what is considered normal given many offers the contracting office is supposed to receive. For
2376
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Table II Table V
DIFFERENT SOLICIATION PROCEDURES SAMPLE ANOMALOUS ASSOCIATION RULES - ONE OFFER
RECEIVED - VENDORS
Code Explanation
NP NEGOTIATED PROPOSAL OR QUOTE Rule Purchase Orders Confidence Support
AE ARCHITECT ENGINEER FAR 6.102 {Sum of obligated Dollars IV}
MAFO SUBJECT TO MULTIPLE AWARD ⇒ {Received Offer 1} 55% 15%
FAIR OPPORTUNITY {Received Offer 1,
SP1 SIMPLIFIED ACQUISITION Sum of Obligated Dollars part 3} 67.5% 9%
SSS ONLY ONE SOURCE ⇒ {Solicitation Procedure SP1}
2377
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Table VI Table IX
SAMPLE ANOMALOUS ASSOCIATION RULES - ONE OFFER SAMPLE ANOMALOUS ASSOCIATION RULES - LARGE
RECEIVED - VENDORS CONTRACT- VENDORS
2378
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
probability. Combining the parallelization of the Itemset [10] J. Somesh, et. al., ”Markov Chains, Classifiers, and Intrusion
Tree and our Anomaly Detection allows for fast generation Detection”, CSFW ’01 Proceedings of the 14th IEEE workshop
of anomalous sequences and rule generation. This allows on Computer Security Foundations, vol. 1, p. 206, 2001.
knowledge gained to be able to be implemented before [11] R. Kiran and P. Re, ”An improved multiple minimum support
they become obsolete. Online mining of associations and based approach to mine rare association rules.”, 2009 IEEE
collective anomalies is of practical importance because it Symposium on Computational Intelligence and Data Mining,
enables decision makers to take swift action to prevent losses 2009.
or take advantage of potential profit making opportunities at
[12] M. Kubat, et. al., ”Itemset trees for targeted association query-
the correct moment in time before the value of information ing”, IEEE Transactions on Knowledge and Data Engineering,
deprecates. This research designs and implements an online vol. 15, no. 6, pp. 1522-1534, 2003.
mining framework that is designed specifically for big data.
The system is designed to be general and can be applied [13] J. Lavergne, et. al., ”Min-Max Itemset Trees for Dense
in various application domains with minimal customization. and Categorical Datasets”International Symposium on Method-
ologies for Intelligent Systems ISMIS 2012: Foundations of
This system can be beneficial in several applications such Intelligent Systems, pp. 51-60, 2013
as in intrusion detection, epidemiology, click analysis, law
enforcement, anti-terrorism, marketing (emerging markets, [14] J. Lavergne, et. al., ”DynTARM: An In-Memory Data Struc-
cross-selling, and advertisement optimization), etc. ture for Targeted Strong and Rare Association Rule Mining
over Time-Varying Domains”, 2013 IEEE/WIC/ACM Inter-
ACKNOWLEDGMENT national Joint Conferences on Web Intelligence (WI) and
Intelligent Agent Technologies (IAT), vol. 1, 2013.
The work was partially supported by the National Science
Foundation under grant No.IIP-1160958 and partially by [15] H. Li, et. al., ”Pfp: parallel fp-growth for query recommen-
the Industry Advisory Board of the Center for Visual and dation”, RecSys ’08 Proceedings of the 2008 ACM conference
Decision Informatics. on Recommender systems, pp. 107-114, 2008.
2379
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.