0% found this document useful (0 votes)
35 views

Online Mining For Association Rules and Collective Anomalies in Data Streams

Uploaded by

jaskeerat singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Online Mining For Association Rules and Collective Anomalies in Data Streams

Uploaded by

jaskeerat singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2017 IEEE International Conference on Big Data (BIGDATA)

Online Mining for Association Rules and Collective Anomalies in Data Streams

Shaaban Abbady, Cheng-Yuan Ke, Jennifer Lavergne, Ryan Benton


Jian Chen, Vijay Raghavan School of Computing
Center for Visual and Decision Informatics University of South Alabama
University of Louisiana at Lafayette Mobile, USA
Lafayette, USA [email protected]
{smr5476, cxk9886, jjsl5910, jchen, raghvan}@louisiana.edu

Abstract—When analyzing streaming data, the results can solution to reduce the number of passes requires the user
depreciate in value faster than the analysis can be completed to specify part of the rule (query). This solution, entitled
and results deployed. This is certainly the case in the area targeted association mining, uses a in-memory tree structure
of anomaly detection, where detecting a potential problem as
it is occurring (or in the early stages) can permit corrective called Itemset Tree, which requires, at most, one pass
behavior. However, most anomaly detection methods focus on through the tree. While computationally fast, a query could
point anomalies, whilst many fraudulent behaviors could be potentially traverse the entire tree before the system is able to
detected only through collective analysis of sequences of data move on to the next query [14]. By distributing the tree into
in practice. Moreover, anomaly detection systems often stop at multiple sub-trees, a query could be executed concurrently
detecting anomalies; they typically do not provide information
about how the features (attributes) of anomalies relate to each on the sub-trees and return a list of rules within a certain
other or to those in normal states. The goal of this research threshold with significantly less processing time, allowing
is to create a distributed system that allows for the detection queries to have less of a wait time before their execution.
of collective anomalies from streaming data, and to provide a Currently the majority of the work on sequence-based
richer context of information about the anomalies besides their anomaly detection addresses batch processing. Only a small
presence. To accomplish this, we (a) re-engineered an online
sequence anomaly detection algorithm and (b) designed new number of these methods handle direct online detection of
algorithms for targeted association mining to run on a stream- outliers [4]. Online anomaly detection is crucially important
ing, distributed environment. Our experiments, conducted on in many cases because it enables immediate action to be
both synthetic and real-world data sets, demonstrated that the taken in order to reduce or prevent losses. Examples of these
proposed framework is able to achieve near real-time response are cases of credit card fraud or health monitoring [4]. The
in detecting anomalies and extracting information pertaining
to the anomalies. current challenge for collective online anomaly detection
is far more challenging and thus has higher computational
Keywords-association mining; anomaly detection; streaming complexity, which potentially makes a fully online detection
data; online processing; collective anomalies;
unfeasible. For our real-time responsive method for anomaly
detection, we propose a strategy of utilizing two models
I. I NTRODUCTION
(1) a current model/snapshot of the current data and (2) a
The objective of this paper is to introduce an online historical data model based upon previous data in the stream.
mining framework for data streams designed to detect col- The current model is then compared against the streaming
lective anomalies and extract richer contextual information. data as it arrives. Since the online comparison of incoming
To accomplish this, we first developed a scalable association data is much easier and only involves a small subset of the
mining algorithm for processing large volumes of data over data, we can achieve near-real time response.
time frames. Second, we develop distributed batch process- Our system combines both anomaly detection and targeted
ing algorithms that stores and models high-volume historical association rule mining into one online knowledge discovery
data. The model captures the existing sequence patterns system. We utilize this system to generate association rules
within the historical data. Third, we develop an online based upon anomalous sequences discovered by our detec-
distributed stream processing algorithms that continuously tion model. The system will (1) process a streaming data set,
compares incoming data with the model, and evaluates one time window at a time, (2) find anomalous sequences
them to detect collective anomalies. Finally, the methods using stable model from historical and an off-line retrained
are integrated, allowing both the rapid, online detection of model from a much shorter but more current period, and
anomalies and the extraction of association rules about the (3) generate a list of strong association rules which describe
anomalies. the anomalous sequences. These rules will allow insight into
When mining rules, Association Rule Mining usually not only the why/how of the anomalous sequences, but also
makes multiple passes over the entire data set [1][3]. One outline possible cause/effect involved in the sequences and

978-1-5386-2715-0/17/$31.00 ©2017 IEEE 2370


Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
their component parts. to another user defined threshold Minimum Confidence, or
In this paper, we will first discuss the methodologies we MinConf, then it is added to the Strong Rules set. This
utilized and then expanded upon our innovations. These method can be cost prohibitive because of the generation
are Association Mining, Targeted Association Mining, and of itemsets in the first step, but otherwise produces reliable
Collective Anomaly Detection. Next, we will cover the results depending upon the proper selection of minsup and
proposed approach and detail the system we created. Then, minconf [1] [3] [9].
we will visit our experiments and their results. Finally, we
B. Targeted Association Mining
will offer a conclusion for each part of our algorithm.
Targeted Association Mining takes advantage of the fact
II. M ETHODOLOGIES that most users actually know what they want from a data
This section contains an overview of the methodologies set. Using this knowledge, and a tree structure that represents
which inspired us to create our system, and the modifications the database itself, targeted association mining allows for not
we made to each in order to create our proposed system only faster querying times but also filtered results based upon
what the user actually wants to see. This is unlike traditional
A. Association Mining association mining returns all strong rules [3]. Targeted
For many years, organizations and government entities association mining makes use of an index tree structure that
have collected large data repositories from their day to contains all the information in the database [12]. This tree
day workings. Hidden within the data sets are potentially is referred to as an Itemset Tree. Using domain experts and
highly desirable/interesting correlations. An example would their own common knowledge, we are able to create a list
be correlations between certain purchase contracts and sus- of items referred to as the User Defined Interest Itemsets,
picious bidding practices. This can help the entity regulate or UDII. Therefore all itemsets I will contain the UDII and
bidding practices for certain contracts to minimize fraud. additional items of the set I. By using the UDII, finding
The process by which these hidden gems of information the confidence of a rule U DII => (I − U DII) is less
are discovered is referred to as Association Mining [1] [3] computationally expensive [13][14].
[9]. Association Mining processes large quantities of data,
typically transactional, and returns strong correlations called T otalCount(I)
Support(I) =
Association Rules based upon the mathematical formulas N
for ranking occurrence. This process can be divided into Sup(I)
Conf idence(U DII => (I − U DII)) =
two sub-tasks: (1) Mining frequent co-occurrences and (2) Sup(U DII)
generating the strong correlations.
Using this list, we can query the tree structure and return
An early method of finding association rules was the
all itemsets whose support is greater than or equal to minSup
Apriori algorithm, which was created by Agrawal et al [1]
[1] [3] [9]. The rule generation sequence is similar to the
[3]. In this method, all possible combinations of itemsets
Apriori version. Rules are generated and their confidence is
are generated in a lattice-like pattern. Given an itemset I =
calculated. This differs only in the manner that the support
{i1 , i2 , i3 , i4 }, a rule {i1 , i3 } ⇒ {i2 , i4 }, and a database of
of the left hand side is calculated. Instead of querying the
size N, Support and Confidence are defined as:
whole database, the support is calculated using the Itemset
T otalCount(I) Tree. Our algorithm proposes distributing the Itemset Tree
Support(I) =
N into multiple sub-trees, therefore allowing a query to run
Sup(i1, i2, i3, i4) concurrently on the sub-trees. The results of this process
Conf idence({i1 , i3 } ⇒ {i2 , i4 }) = will be all items with support ≥ minsup and rules with
Sup({i1 , i3 })
confidence ≥ minConf. Adding the concurrent processing
As the itemset combinations are generated, their support allows these results to be discovered in a more expedited
is calculated. If their support is less than a user defined manner. Our system allows the Itemset Tree to be built and
threshold called Minimum Support, or MinSup, then the queried in a parallel/distributed manner upon a MapReduce
pattern is kept as a Candidate Itemset. Otherwise, the pattern platform.
is discarded. Additionally, no other itemsets containing that
exact itemset will be generated as their support will be less C. Streaming Data Processing
than or equal to the support of I. Next, the association rules For our streaming data approach, we utilized a ”sliding
are generated. These are generated using a method where window” technique. This allows us to process a stream
the itemset is broken apart into two separate sub-itemsets to of data and process the subset considered current. Using
make a rule and their Confidence calculated. The confidence this method we can window over the data, and given a
calculation is defined as the support of the sub-itemset on certain amount of overlap, catch trends in the data as they
the left hand side of the rule divided by the support of change. Fig. 1 illustrates the sliding window over a stream of
the itemset itself. If the confidence is greater than or equal data. Three different consecutive windows are shown W(i),

2371
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
of outliers. It also can be applied locally on our sliding
window subsets of data for processing of online streaming
datasets [7]. The challenges for this online approach are (1)
scalability and (2) concept drift. The scalability challenge
is the potential escalating computational cost that could
occur as the number of incoming streams grows higher. The
Figure 1. Sliding window technique concern is that the system will become unable process all of
the data simultaneously and generate labels in real time. This
lag, in turn, could result in a cascading backlog, resulting
W(i+1), and W(i+2) and contain overlapping information to in system failure. In our system, we address the scalability
allow a smoother snapshot of the data to be gathered. issue by utilizing the horizontally scalable MapReduce based
distributed computing platform. Concept drift is when the
D. Collective Anomaly Detection models representing the current data is no longer considered
Outliers can be defined as those observations that deviate valid. If context drift is not addressed, an increase in the
from normal observations to an extent that arouses suspicion number of false alarms and/or missing true anomalies will
they were generated by a dissimilar mechanism [1]. These occur. For the concept drift issue, our system proposes a
events can also be referred as anomalies. We will use the simple approach in which we periodically run an off-line
terms outliers and anomalies interchangeably. The results of retraining of the model. If the difference between the newly
outlier detection usually translate to significant or critical trained model and existing historical model exceeds the pre-
real life events, e.g. network intrusion, fraudulent credit defined threshold (implies concept drifts), the new model
card transactions, or failure in operational system. Outliers will replace the existing model.
can be classified into three categories: (1) point outliers,
(2) contextual outliers, and (3) collective outliers [9]. Point, E. Technology Selection
or global, outliers are data points which significantly differ Our framework has both a batch processing component for
from the current norm of the data. Contextual/Conditional off-line model training, and a streaming component for on-
outliers deviate from the surrounding context, referred to line anomaly detection. The alternatives considered for batch
as contextual attributes. Collective outliers are similar to processing were Hadoops MapReduce and Spark [23]. Both
context outliers, except that we dont consider individual are open source projects, and provide scalable distributed
points, but instead, we consider a collection of points that processing based on MapReduce with high fault-tolerance.
has common relationship(s). Examples of related collections We selected Spark because it has faster processing due to
of objects include sub-sequences for temporal data, local its capability to cache results of previous steps on memory
regions for spatial data, and connected sub-graphs for graph instead of disk. Considering the streaming component, we
data [4]. Sequence based outlier detection is an example of evaluated both Storm [19] and Spark Streaming. We selected
collective outlier detection, where the temporal order of the Spark Streaming because of the ability to smoothly integrate
points is their relationship. Our algorithm focuses solely on it with the batch component.
sequence based outliers detection for our collective anomaly
detection. Sequence data is common in many applications III. P ROPOSED A PPROACH
such as network intrusion detection, bioinformatics, flight The proposed approach is a materialization of the general
safety, health monitoring, and web site navigational click strategy explained above that is also capable of handling the
sequences [4] [5]. scalability challenge by using new distributed MapReduce
In order to implement a real-time response for anomaly based technologies.
detection, we adopted the strategy of building a current
model based upon a previously built historical data model. A. Parallel Itemset Trees Using Spark
The historical model is updated as concept drifts so as to re- Combining Spark Core and Spark Streaming provides
main current. The computationally expensive job of building our system with an efficient platform for high-speed data
the updated model is done off-line. We then compare this streaming and processing with fault tolerance. Our Spark ap-
current model against incoming streams of data. The online plication will create multiple Itemset Tree structures, which
comparison of incoming data is much easier as it involves will initially reside in the driver program. The system begins
only a small subset of the data represented within a sliding with continuous steaming data handled via the executors.
window. This way we can achieve a near-real time response. These executors will also update the trees as the data
For our system, we chose to use Markovian model because changes. For every batch of data, the executor will get a tree
it is unsupervised. This means it does not require the from the driver, run the Tree Insertion Algorithm to insert
labels of training data to be known in advance, thereby the itemsets into the tree, and then return the tree back to the
making it suitable for detecting new and unexpected varieties driver program. We have defined a parameter, called ”level

2372
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Itemset Tree Parallelism

of parallelism”, which can be configured while running the


application. The parallelism level will define the number of
trees to be created, and thereby, the number of partitions to
which the incoming data stream has to be split into before
processing.
An example of the data flow between executors with
parallelism level of n is illustrated in Fig. 2. The query
processing is completed by sending the query and a tree
to each of the executors so that each executor can apply
the user query to the input tree. The results of the query
will be collected from each executor and combined with
the driver program. When applying the query, the minimum Figure 3. Online Anomaly Detection Approach
support and minimum confidence values are not considered
at executor level. These constraints will be applied only
at driver program level while combining the results. The transform a database of credit card transactions into a group
algorithm to combine the Association Rules query results is of sequences, where each sequence represents on customer
as follows: trace of transactions sorted chronically. Given our Markovian
Let R = {R1 , R2 , . . . , Rn } be a list of n trees results. transition matrix, we can build an online classifier which
Let each Ri include elements of the form: receives the incoming input stream and produces as output
Ri = {Rulei , Supp(Antecedenti ), Supp(Consequenti )} labels as normal and anomalous. In addition, it can also
For each Ri and Rj where Rulei = Rulej output the numeric score of outlierness of the sub-sequence
1) Update Ri = {Rulei , at hand. To do that, we need to store a time window
Supp(Antecedenti ) + Supp(Antecedentj ) , containing the most recent incoming input. The system then
Supp(Consequenti ) + Supp(Consequentj ) } applies a metric to decide how different a point is to the
2) Delete Rj norm. This metric is calculated based on the subsequence
resident in the sliding window, and updated whenever a new
The final results are filtered based upon the minimum
element arrives. The metric refers to the statistical model to
support and confidence constraints.
calculate a score of outlierness based upon the extent to
B. AnomalyDetection which the sub-sequence matches the model. Many metrics
for calculating outlierness utilize the Markovian model. We
Fig. 3 depicts a conceptual design of proposed approach
chose the miss-probability metric to implement. The model
anomaly detection stages. We can divide our approach into
adds all transitional probabilities to a new state j, which is
two stages: (1) the training stage, and (2) the detection stage.
not equal to the current state i. The metric formula can be
1) Training Stage: The training stage is a batch process
seen below:
which begins by taking the available historical data as input

and then produces the Markov model as output. The Markov M iss − P robability(i, j) = Pij
model is represented by a transitions probability matrix [10]. i=j
This model is then placed into the HDFS for utilization in the
detection step. Since in many cases data arrives and is stored where pij is element of the ith row and j th column of the
as individual transactions, with or without time attributes, we transition matrix. This represents the probability of receiving
need to transform the data into sequences. A Data sequencer an item i given that the last item was item j. This can be
completes this task by grouping transactions with a common calculated off-line from the historical data by dividing the
key and sorting them temporally. For example, we can frequency of i followed by j and dividing it by the frequency

2373
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
of i: The Markov model requires categorical data which can
F requency(ij) be mapped into states. If the data has continuous attributes,
Pij =
F requency(i) discretization is required to transform it into categorical
attributes. We selected the SAX (Symbolic Aggregate Ap-
w−1
 proximation) for this purpose. SAX is a recently developed
M iss − P robability(itemi , itemi+1 )
discretization method which could achieve results improve-
i=1
μ= ments for similar machine learning tasks [17]. This method
w−1
endeavors to create equi-frequent symbols and is specifically
The outlierness score μ is measured, online, as the average designed for streaming data.
of miss-probability over length of the sliding window w − 1.
C. Anomaly Detection + Association Rules Mining
The label is anomalous if μ ≥ t, where t is a user defined
threshold, and normal otherwise. Alarms can be issued upon Our algorithm proposes to utilize association mining
detection of anomalous instances. to generate association rules about anomalous sequences.
The model builder takes the sequenced data as input These rules are implications containing anomalous data in
and produces the transition matrix as output. It does a their antecedent, and can be used to explain the anomalous
single scan over the data and updates two frequency arrays. sequences which created them. In addition, they can the-
The first array is a one-dimensional array containing the oretically predict the anomaly itself by creating A implies
frequencies of all elements, Frequency(i), initialized to zeros B implications. Our anomaly detection framework consists
and incremented upon each occurrence of item i. The second of two stages: training stage and detection stage. We start
array is a 2 dimensional array representing Frequency(ij), from historical data to build a sequence model using batch
initialized to zeros and incremented upon each occurrence processing of Spark and then the model is applied in de-
of item i followed by item j. At the end of the scan, each tection stage. We use a sliding window approach to process
row of the matrix is divided by the corresponding element in data stream and compare with the outlier score to determine
the one-dimensional frequencies array so that each element if a sequence is anomalous or not. The historical data will
will represent the probability p(ij). The training stage is be updated at the same time. The integrated framework of
simplified to parallelize and implement on Spark. anomaly detection and association rules mining for contin-
2) Detection Stage: The detection stage is an online uous data streams can be seen in Fig. 4, while the data flow
process which takes as input a stream of time stamped trans- of our system’s integrated analysis is illustrated fin Fig. 5.
actional data. This stage continuously receives the incoming We verified these rules with a domain expert.
stream and produces anomaly labels or alarms whenever D. Tools and Prototypes
it detects anomalous sub-sequences. The system can also A prototype of the integrated framework was implemented
save the anomalous records for a further investigation. on a cluster of three Ubuntu nodes with 24 cores on each.
We implemented a data feeder module to emulate a real Each node has Hadoops HDFS, Apache Spark, Apache
data source. To handle real-time streams in an efficient, Hbase, Apache Kafka, and Apache Zookeeper. Both Java
scalable, and fault tolerant manner, we needed a message and Scala languages were used to write scripts and programs.
broker. We selected Apache Kafka as a message broker to Eclipse with a plug-in for Scala was used as an IDE for
handle the message passing and queuing. Kafka adopts a developing, debugging, and testing the code. SBT (Scala
publish/subscribe approach. It is important to note that Spark Build Tool) was used to build and package the final JAR
provides smooth integration with Kafka. file which is submitted to Spark for running it on the cluster.
The detector module is the most important component The output, anomalous sub-sequences, is both displayed on
in our system. The module implements the online classifier console and saved to HDFS. R scripts were used to visualize
described previously. Similar to the training stage, we need the sequence data. We also used Rickshaw, a JavaScript
to transform incoming streams of data into a sequence toolkit, to create an interactive time-series graph tool to
representation. To construct sequences, we need to acquire visualize the anomalous sequence.
all previous records. HBase, an open source column based
NoSQL built on top of HDFS, enables fast querying of data IV. R ESULTS
resident on HDFS. Also, Spark has a native integration with The MapReduce implementation of the Paralleled Itemset
HBase. We use HBase to hold new incoming records and to Tree was able to generate accurate rules in less time than
return them when needed as answers for the detector pro- the sequential implementation. It has much higher scalability
gram queries. Sequences retrieved from Hbase are scanned limited only by available processing power. The streaming
using a sliding windows of a length w, and the outlierness framework was built on the integration of multiple new
score is calculated using the Miss-Probability metric defined big data technologies was able to achieve near real-time
previously for every subsequence of length w. Spark adopts a response in detecting outliers. The throughput scales with
micro-batching mechanism to simulate real time streaming. available computing power.

2374
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Integrated Framework

Table I
PERFORMANCE COMPARISON BETWEEN THE SEQUENTIAL
VERSION AND THE SPARK VERSION

Cores 1 2 4 8 16
Time − Seconds 99,434 82,042 81,301 74,534 68,905
Memory − MB 1,026 1,450 1,504 1,566 1,604

Figure 5. Data flow for Anomaly Detection + Association Rules Mining 2) Synthetic Data B: We generated a synthetic dataset in
order to simulate credit card transactional data. Credit card
fraud is an interesting application for anomaly detection with
A. System Specifications the goal of detecting fraudulent transactions based on normal
The system is developed on the Spark framework (version behavior patterns. The data is organized into sequences of
1.4.0). The cluster consisted three Ubuntu 12.04 servers and historical transactions for a specific customer and their peers.
each server has 16 Gigabyte memory and 24 cores. The Online credit card fraud detection is crucial to prevent such
programs were developed in Java 8.0 and Scala 2.11. Eclipse fraudulent transactions in real-time. In this application, we
is our main IDE for programming. set window size, w = 5 transactions and proposed 8 features
to represent the data within the credit card transactions.
B. Data Sources Eight discretized transaction item types:
1) Total transaction amount (L = Low, M = Medium, H
1) Synthetic Data A: This data set was generated mainly
= High)
the evaluation of our system’s performance, each partion
2) Highest Items Price (L = Low, N = Normal, H = High)
comprising of ten million records. Each record consisted 1
3) Time since the last transaction ( L = Long, M =
to 10 random numbers, whose cardinality is between 0 and
Medium, S = Short)
99. Table I shows the comparison between the performance
4) Distance of customer’s address ( C = Close, F =
between the core and Spark versions with a varying number
Faraway, V = Very faraway)
of trees. We can see that as the tree number grows, the
5) Hours of the day (A = 0 − 6, B = 6 − 12, C = 12 − 18,
time for execution decreases gradually. Even though the
D = 18 − 24)
performance of the system improves as the number of trees
6) Day of the week (A = M on − T hu, B = F ri − Sun)
increase, when compared to the original PFP version, the
7) Age (A = 0 − 12, B = 12 − 21, C = 21 − 39, D =
speed growth rate is not as good as original because of
40 − 65, E = 65+)
the additional overhead introduced by MapReduce. Our
8) Electronic transaction (T = True, F= False)
current analysis pointed out that most of the execution
time is spent on data transfer, especially during the merge A sample anomalous sequence detected can be seen below:
process. Decreasing the number of node communications 1) (M, H, L, V, B, B, C, F)
and reduction of the serialization cost are some of our future 2) (L, N, L, V, C, B, A, T)
work. 3) (H, H, S, V, A, B, A, T)

2375
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
the very limited domain knowledge the research team has.
Our proposed solution to this challenge is to generate
sequences for contract data based upon when the contract is
awarded/modified for both contracting entities and vendors.
Also, when our sequence-based collective anomaly detection
method detects anomalous sequences which are difficult
to interpret, we will leverage anomalous association rules
to understand them. This data set is publicly accessible
from: www.usaspending.gov. It contains records from United
States federal contracts, grants, and loans occurring between
2008 and 2016 and contains more than 200 attributes. We
Figure 6. Electricity Load Diagram chose to focus on the contracts of one agency, the US
Department of Defense (DOD), and the bureau: Department
of the Army. Two sequence data sets were generated from
the transactional data: (1) one for investigation of contracting
offices and (2) for vendors.
The Contracting Offices data set had its unique
sequence key as a composite of ”ContractingOf f iceId”,
”ContractActionT ype”, and ”P roductOrServiceCode”.
This dataset compares different contracting offices inside
DOD, considering the temporal sequence of contracts. Due
Figure 7. Electricity Load Diagram Anomalous Sequences
to the large impact of contract type and product prices, we
decided to add them to the sequence key. The attributes
for the sequence items, which make different Markov
4) (L, H, N, V, C, B, B, T)
states, are the following: ”SolicitationP rocedures”,
5) (H, H, S, V, D, A, B, T)
”N umberOf Of f ersReceived”, and
This sequence includes five transactions; each transaction ”SU M (DollarsObligated)”. The Vendors data set has the
contains the eight discretized features mentioned above. ”DU N SN umber”. This data set compare different vendors
Transaction 3 indicates (a) large purchase amount with high doing business with the DOD considering the temporal
priced items, (b) a short amount of time since the last sequence of contracts. The attributes for the sequence items,
transaction, (c) was issued very far away from the customer’s which make different Markov states, are the same as the
home address, (d) occurred in the morning hours on a Contracting Offices Data set: ”SolicitationP rocedures”,
weekend day, (e) an electronic transaction (online purchase), ”N umberOf Of f ersReceived”, and
and (f) the person made this purchase is a young person. ”SU M (DollarsObligated)”.
3) Electricity Load Diagram: Next, we applied our Common sense dictates that you should not compare
method to a real-world data set available on the UCI disparate sets; i.e. contracts for purchasing office supplies
repository for machine learning [2]. The data represents the won’t be relevant to those for building a flood control levee.
electric power consumption of 370 data points/clients. The Contracts are comparable only for similar types and will
values are measured in kilowatts every 15 minutes. Each define an appropriate context for the investigation. For a
point has 140,256 readings. Fig. 6 shows the raw data of context-based approach, we decided to partition both data
one point in the interval from 2011 to 2015. sets by the contract type. This separates the data sets into to
Fig. 7 shows samples of the programs discovered anoma- eight different partitions each: (1) BPA CALL, (2) BPA Call
lous subsequences, in bold, plotted along with the whole Blanket Purchase Agreement, (3) DCA Definitive Contract,
sequence of a single client. (4) Definitive Contract, (5) Delivery Order, (6) DO Delivery,
4) Government Spending Dataset: Typically collective (7) Order, (8) PO Purchase Order, (9) Purchase Order.
anomalies are difficult to interpret without appropriate con- Our reason for selecting ”SolicitationP rocedures” and
text. This is because each individual data point might ”N umberOf Of f ersReceived” as behavior attributes was
look normal without a frame of reference. For example, derived from the assumption that offers received by contract-
government spending draws a great deal of attention, and ing office should align with the solicitation type. Therefore,
each expenditure is supposed to undergo close examination any discrepancy may mean anomalous behavior, especially
by regulators and auditors. However, fraudulent activities for the large amount contract. The types of solicitation
can still appear in government spending. To detect anoma- procedures for this data set are listed in Table II. From
lies/frauds in this case is a very challenging problem. The Table II you can see some solicitations explicitly define how
challenge is in defining what is considered normal given many offers the contracting office is supposed to receive. For

2376
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Table II Table V
DIFFERENT SOLICIATION PROCEDURES SAMPLE ANOMALOUS ASSOCIATION RULES - ONE OFFER
RECEIVED - VENDORS
Code Explanation
NP NEGOTIATED PROPOSAL OR QUOTE Rule Purchase Orders Confidence Support
AE ARCHITECT ENGINEER FAR 6.102 {Sum of obligated Dollars IV}
MAFO SUBJECT TO MULTIPLE AWARD ⇒ {Received Offer 1} 55% 15%
FAIR OPPORTUNITY {Received Offer 1,
SP1 SIMPLIFIED ACQUISITION Sum of Obligated Dollars part 3} 67.5% 9%
SSS ONLY ONE SOURCE ⇒ {Solicitation Procedure SP1}

SB SEALED BID Rule - DCA Definite Contract Confidence Support


BR BASIC RESEARCH {Sum of obligated Dollars IV}
AS ALTERNATIVE SOURCES ⇒ {Received Offer 1} 60% 30%
TS TWO STEP {Received Offer 1,
Sum of Obligated Dollars V} 66% 9.7%
Table III ⇒ {Solicitation Procedure SSS}
CONTRACT TOTAL AMOUNT DISCRITIZATION

Categ. Vendors Dataset Contracting Office Dataset


I [-∞, $55) [-∞, $69) ”N P −3−3” represents a contract awarded under an NP (Ne-
II [$55, $1,506) [$69, $1,734) gotiated Proposal/Quote) solicitation, with 3 offers received,
III [$1,506, $26,015) [$1,734, $27,842) and has the total amount of category III (1, 374−27,842).
IV [$26,015, $710,808) [$27,842, $698,906 ) Due to the difficulty of interpreting these sequences manu-
V [$710,808, ∞) [$698,906,∞] ally, we used the top anomalous sequences as input for the
(Max: $8.15 Billion) (Max: $8,547,900,116) association rules mining trying to find interesting suspicious
associations.
Table IV
SAMPLE ANOMALOUS SEQUENCES The Targeted Association Mining portion of our system
uses the detected anomalous sequences as the queries, and
Office ID Score Sequence resulted in several interesting findings. For interesting as-
W91QVN-B-3510 0.979 NP-3-3, SP1-5-4, NP-5-1, NP-5-4, sociation rules in this context, we paid special attention to
SSS-1-3, NP-2-1, SP1-3-3, NP-5-3 those have high confidence as well as significant support.
Duns Number Score Sequence Continuing our principle idea about solicitation procedures
123456787 0.98 SP1-3-3, SP1-5-1, NP-1-4, NP-1-1, versus offers received, we started to look at closely those
SP1-1-2, NP-4-3, NP-3-3, NP-1-2 rules have a component of only one offer received. Some
sample anomalous association rules in this scenario are
shown in Tables V for vendors and Table VII contracting
example, under the type of SSS, only one offer is expected; offices. The contract types of these sample rules were
while multiple offers are expected under solicitation type randomly selected out of the eight in total. Let’s look at
MAFO. Those contract awards with MAFO solicitation but the last rule in Table IX. It says if the solicitation procedure
only received one offer will look suspicious because the is SP1 Simplified Acquisition and the total amount falls in
awardee didnt face any competition, as required by such type the largest category (larger than $710,808), there are about
of solicitation. Suspicion will become even further aroused 62.4% of chance that this awarded contract only received
if the contract is for a large amount. one offer. It is definitely interesting association pattern worth
The ”SU M (DollarsObligated)” attribute was dis- further investigation.
cretized into five levels using SAX, as seen in Ta- Vendors in this scenario are presented in Table VIII
ble III for both data sets. The categories in the (contracting offices) and Table IX (vendors). Again, the
”N umberOf Of f ersReceived” attribute were discretized contract types presented in these tables were randomly
similarly into five categories. Both data sets contained ap- selected without any preference. Let’s look at the top three
proximately 1.5 million transactions. For our experiments rules together: they describe for large contracts, if only one
we set the outlier score threshold to 0.8 (with the highest offer received, it was most likely to be a SSS solicitation
possible value = 1.0). All sequences having higher score (one offers required), if the total number of offers received
were reported as anomalous. Sample anomalous sequences are 2 or 3, most likely they are for solicitation type NP
can be seen in Table IV. (negotiated proposal/quote). We discussed such results with
Each transaction has three categorical or discretized at- domain experts and received very positive feedback. We
tributes as described above. As an example, a transaction believe that further investigation is necessary.

2377
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
Table VI Table IX
SAMPLE ANOMALOUS ASSOCIATION RULES - ONE OFFER SAMPLE ANOMALOUS ASSOCIATION RULES - LARGE
RECEIVED - VENDORS CONTRACT- VENDORS

Rule - Purchase Order Confidence Support Rule - BPA Call Blanket


{Sum of obligated Dollars:IV} Purchase Agreement Confidence Support
⇒ {Received Offer 1} 55% 15% {Sum of obligated Dollars IV}
{Received Offer 1, ⇒ {Received Offer 1} 55% 15%
Sum of obligated Dollars part 3} 67% 9% {Received Offer 1,
⇒ {Solicitation Procedure SP1} {Sum of Obligated Dollars part 3} 67% 9%
Rule - DCA Definitive Contract Confidence Support ⇒ {Solicitation Procedure SP1}
{Sum of obligated Dollars IV} Rule - DCA Definitive Contract Confidence Support
⇒ {Received Offer 1} 60% 30% {Solicitation Procedure SP1,
{Received Offer 1, Sum of Obligated Dollars V} 62.4% 0.7%
Sum of obligated Dollars V} 66% 9.7% ⇒ {Received Offer 1}
⇒ {Solicitation Procedurer SSS} {Solicitation Procedure SP1,
Received Offer 1} 59.3% 3.5%
Table VII ⇒ {Sum of Obligated Dollars V}
SAMPLE ANOMALOUS ASSOCIATION RULES - ONE OFFER
RECEIVED - CONTRACTING OFFICES

Rule - Delivery Order Confidence Support


A. Association Rules Mining
{Received Offer 1, We proposed and implemented a distributed paralleled
Sum of obligated Dollars III} 59% 5.5% version of the Itemset Tree utilizing MapReduce technol-
⇒ {Solicitation Procedure MAFO} ogy. The new parallel Itemset Tree over-performances the
{Received Offer 1, sequential tree algorithm in terms of running time for both
Sum of obligated Dollars IV} 46% 5% the construction and the querying when using fewer cores.
⇒ {Solicitation Procedure MAFO} No sacrifice of the accuracy of results is introduced by
Rule - DCA Definitive Contract Confidence Support this parallelization. The resulting Distributed Itemset Tree
{Solicitation Procedure MAFO, is more scalable and suitable for bigger data sets.
Sum of obligated Dollars III} 69% 3.5%
B. Collective Anomaly Detection
⇒ {Received Offer 1}
{Sum of obligated Dollars IV} We implemented our proposed framework for Online
⇒ {Received Offer 1} 60.9% 6.2% Detection of Anomalies in data streams and sought col-
lective, sequence based anomalies. We built a probabilistic
Table VIII model of the data based on available historical information
SAMPLE ANOMALOUS ASSOCIATION RULES - LARGE and then utilized an online classifier to compare and label
CONTRACT- CONTRACTING OFFICES sub-sequences against the historical model. The framework
makes good use of new big data technologies to fulfill
Rule - DCA Definitive Contract Confidence Support
these needs and yield real-time detection of outliers. The
{Sum of obligated Dollars V}
approach is scalable so it can handle increasing sizes and
⇒ {Received Offer 1} 65.6% 15%
different data streams. To overcome the problem of concept
Rule - Purchase Order Confidence Support
drift and model expiration, we proposed regular rebuild of
{Sum of obligated Dollars V}
the model which can be done efficiently using a distributed
⇒ {Solicitation Procedure SP1} 61.5% 0.51%
computation platform. For future work, further investigation
{Received Offer 1,
into mitigating the concept drift problem is underway.
Sum of Obligated Dollars part 3} 55.4% .25%
⇒ {Solicitation Procedure SP1} C. Association Rule Mining + Collective Anomaly Detection
By combining Association Mining and Anomaly Detec-
tion, we were able to create association rules that can
V. C ONCLUSIONS AND F UTURE W ORK be generated using anomalous sequences. These rules are
implications containing anomalous data points and can be
The proposed framework is promising for online associa- used to explain the anomalous sequences found within
tion mining and/or online sequence-based anomaly detection each sequence. In addition, they can theoretically pre-
applications. It is also scalable and able to generate near real dict the anomaly itself by creating the implication of if
time response before the value of the output deprecates. A occurs, then B occurs also with a certain amount of

2378
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.
probability. Combining the parallelization of the Itemset [10] J. Somesh, et. al., ”Markov Chains, Classifiers, and Intrusion
Tree and our Anomaly Detection allows for fast generation Detection”, CSFW ’01 Proceedings of the 14th IEEE workshop
of anomalous sequences and rule generation. This allows on Computer Security Foundations, vol. 1, p. 206, 2001.
knowledge gained to be able to be implemented before [11] R. Kiran and P. Re, ”An improved multiple minimum support
they become obsolete. Online mining of associations and based approach to mine rare association rules.”, 2009 IEEE
collective anomalies is of practical importance because it Symposium on Computational Intelligence and Data Mining,
enables decision makers to take swift action to prevent losses 2009.
or take advantage of potential profit making opportunities at
[12] M. Kubat, et. al., ”Itemset trees for targeted association query-
the correct moment in time before the value of information ing”, IEEE Transactions on Knowledge and Data Engineering,
deprecates. This research designs and implements an online vol. 15, no. 6, pp. 1522-1534, 2003.
mining framework that is designed specifically for big data.
The system is designed to be general and can be applied [13] J. Lavergne, et. al., ”Min-Max Itemset Trees for Dense
in various application domains with minimal customization. and Categorical Datasets”International Symposium on Method-
ologies for Intelligent Systems ISMIS 2012: Foundations of
This system can be beneficial in several applications such Intelligent Systems, pp. 51-60, 2013
as in intrusion detection, epidemiology, click analysis, law
enforcement, anti-terrorism, marketing (emerging markets, [14] J. Lavergne, et. al., ”DynTARM: An In-Memory Data Struc-
cross-selling, and advertisement optimization), etc. ture for Targeted Strong and Rare Association Rule Mining
over Time-Varying Domains”, 2013 IEEE/WIC/ACM Inter-
ACKNOWLEDGMENT national Joint Conferences on Web Intelligence (WI) and
Intelligent Agent Technologies (IAT), vol. 1, 2013.
The work was partially supported by the National Science
Foundation under grant No.IIP-1160958 and partially by [15] H. Li, et. al., ”Pfp: parallel fp-growth for query recommen-
the Industry Advisory Board of the Center for Visual and dation”, RecSys ’08 Proceedings of the 2008 ACM conference
Decision Informatics. on Recommender systems, pp. 107-114, 2008.

R EFERENCES [16] Y. Li and M. Kubat, ”Searching for high-support itemsets in


itemset trees.”, Intelligent Data Analysis, vol. 10, no. 2, pp.
[1] C. Aggarwal, Data mining, 1st ed. New Delhi: Springer Inter- 105-120, 2006.
national Publishing, 2015, pp. 1-734.
[17] J. Lin, et. al., ”A symbolic representation of time series, with
[2] ”UCI Machine Learning Repository”, Archive.ics.uci.edu, implications for streaming algorithms”, DMKD ’03 Proceed-
2013. [Online]. Available: http://archive.ics.uci.edu/ml. ings of the 8th ACM SIGMOD workshop on Research issues
in data mining and knowledge discovery, pp. 2-11, 2003.
[3] R. Agrawal and R. Srikant, ”Fast algorithms for mining associ-
ation rules.”, VLDB ’94 Proceedings of the 20th International [18] B. Liu, et. al., ”Mining association rules with multiple min-
Conference on Very Large Data Bases, vol. 1215, pp. 487-499, imum supports.”, KDD ’99 Proceedings of the fifth ACM
1994. SIGKDD international conference on Knowledge discovery
and data mining, pp. 337-341, 1999.
[4] C. Varun, ”Anomaly detection for symbolic sequences and time
series data”, Ph.D., University of Minnesota, 2009. [19] N. Marz, ”nathanmarz/storm”, GitHub, 2014. [Online]. Avail-
able: https://github.com/nathanmarz/storm/wiki/Tutorial.
[5] V. Chandola, et. al., ”Anomaly Detection for Discrete Se-
quences: A Survey”, IEEE Transactions on Knowledge and [20] R. Srikant and R. Agrawal, ”Mining generalized association
Data Engineering, vol. 24, no. 5, pp. 823-839, 2012. rules”, Future Generation Computer Systems, vol. 13, no. 2-3,
pp. 161-180, 1997.
[6] F. Gedikli and D. Jannach, ”Neighborhood-Restricted Mining
and Weighted Application of Association Rules for Recom- [21] R. Srikant and R. Agrawal, ”Mining quantitative association
menders”, Web Information Systems Engineering WISE 2010. rules in large relational tables”, ACM SIGMOD Record, vol.
Lecture Notes in Computer Science, vol. 6488, pp. 157-165, 25, no. 2, pp. 1-12, 1996.
2010.
[22] H. Yun, et. al., ”Mining association rules on significant rare
[7] J. Pei, et. al., ”Mining sequential patterns by pattern-growth: data using relative support.”, Journal of Systems and Software,
the PrefixSpan approach”, IEEE Transactions on Knowledge vol. 67, no. 3, pp. 181-191, 2003.
and Data Engineering, vol. 16, no. 11, pp. 1424-1440, 2004.
[23] M. Zaharia, et. al., ”Resilient distributed datasets: A fault-
[8] J. Han, et. al., ”Mining frequent patterns without candidate tolerant abstraction for in-memory cluster computing”, NSDI
generation”, ACM SIGMOD Record, vol. 29, no. 2, pp. 1-12, ’13: 10th USENIX Symposium on Networked Systems Design
2000. and Implementation, p. 2, 2012.
[9] H. Jiawei, et. al., ”Data mining: concepts, models and tech-
niques”, Choice Reviews Online, vol. 49, no. 04, pp. 49-2107-
49-2107, 2011.

2379
Authorized licensed use limited to: University of Queensland. Downloaded on December 11,2023 at 09:09:09 UTC from IEEE Xplore. Restrictions apply.

You might also like