CApriori Conviction Based Apriori Algorithm For Discovering Frequent Determinant Patterns From High Dimensional Datasets
CApriori Conviction Based Apriori Algorithm For Discovering Frequent Determinant Patterns From High Dimensional Datasets
net/publication/338935389
CITATIONS READS
4 11
1 author:
Prasanna Kottapalle
G. Narayanamma Institute of Technology and Science
17 PUBLICATIONS 64 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Prasanna Kottapalle on 06 May 2023.
Abstract— At present, due to the developments in Database Dimensional Data needs efficient data mining techniques to
Technology, large volumes of data are produced by everyday discover interesting knowledge from datasets.
operations and they have introduced the necessity of representing
the data in High Dimensional Datasets. Discovering Frequent Since its introduction, Association Rule Mining Techniques
Determinant Patterns and Association Rules from these High have become a core research topic in Data mining. Association
Dimensional Datasets has become very tedious since these Rule Mining has gained much attention in discovering
databases contain large number of different attributes. For the fascinating Frequent Patterns, affinities and their relationships
reason that, it generates extremely large number of redundant from huge amounts of business Transactional data, which are
rules which makes the algorithms inefficient and it does not fit in potentially useful for real life applications and decision
main memory.In this paper, a new Association Rule Mining making. Association Rule Mining was first introduced in [1].
approach is presented, and it efficiently discovers Frequent In the given transactional database, where each transaction is a
Determinant Patterns and Association Rules from High set of items, an Association Rule is an inference of the form
Dimensional Datasets. The proposed approach adopts the Į Æ ȕ; where Į and ȕ are sets of items. A Frequent
conventional Apriori algorithm and device anew CApriori Determinant Pattern set is a set of attributes (Į,ȕ) where
algorithm to prune the generated Frequent Determinant Sets attributes Į and ȕ are frequently occurring attributes in the
effectively. A Frequent Determinant set is selected if its value is High Dimensional Dataset.
first compared with Conviction threshold value and then
compared with Support threshold. This double comparison will Association Rule Mining process is a two step process. In
eliminate the redundancy and generate strong Association Rules. the first step, all frequent itemsets are discovered; and in the
To improve the mining process, this algorithm also makes use of second step it generates strong Association Rules from frequent
a compressed data structure f_list constructed from feature itemsets found in first step. Association rule mining is used to
attributes selected using Heuristic Fitness Function (HFF) and a find interesting correlations along with affinities in a given
Heuristic Discretization algorithm. It also makes use of Count transactional database [1], [2]. Apriori is used to discover
Array (CA) devised as One Dimensional Triple Array pair set to Association Rules from Market Basket data which involve sets
minimize main memory utilization. This comprehensive study of items [1]. In general, Apriori is used as a prominent
shows that the approach outperforms with traditional Apriori
algorithm for mining frequent itemsets for generating
and obtains more rapid computing speed and at the same time
generates Sententious Rules. Further the mining methodology is
Association Rules. In literature, there is an assortment of
ascertained to be better in generating strong Association Rules development in algorithms and techniques for finding frequent
from High Dimensional Databases. itemsets. Now, many applications and databases are in High
Dimensional Space which contains multi-valued attributes that
Keywords— Frequent Determinant Patterns, Association Rule pose a great challenge in applying knowledge mining process.
Mining, Conviction Value, Heuristic Fitness Function, One The efficiency Association Rule Mining has been concerned
Dimensional Triple Array pair set for last decade since it is a difficult problem in view of the fact
that it comprehends large number of attributes and rules, the
I. INTRODUCTION mining may have to generate or combine explosive number of
candidate patterns and rules from the databases and makes the
The development of Bioinformatics and Microarray user difficult in decision making.
Technology has produced many High Dimensional Datasets
like Gene Expression Data and Microarray Data, which are Association Rules are classified according to the number of
different from Transactional data. Microarray Data usually attributes appearing in the Condition Action Rule. Considering
contains less number of rows or samples and has large number each database attribute as a dimension, it is now interesting to
of columns or attributes or genes. This kind of Very High mine High Dimensional Association Rules. If a rule contains
large number of attributes in its derivation, it is called as High
Veltech Multitech Dr.Rangarajan Dr.Sakunthala
Engineering College, Avadi, Chennai (Sponsors)
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
IEEE-32331
Dimensional Association Rule [2]. There are several data Definition1: a Frequent Determinant pattern set (Į, ȕ) is
repositories storing data in different features. In general the frequent if both Į and ȕ in the set are also frequent and it is true
databases contain Qualitative and Quantitative attributes. in Database D, if it is having higher support(s) and
Mining Association Rules on these databases are the confidence(c) in the database.
challenging issues. There is a need to develop efficient
techniques to mine High Dimensional Data effectively for Definition 2: An Association Rule is an implication of the
different applications and decision making. Relatively, there is form ĮÆȕ, which satisfies user supplied Support s and
a stable advancement in mining High Dimensional Association Confidence c.
Rules form databases. In this paper, a new approach is Definition 3: A support(s) is defined as the percentage of
projected to discover frequent k-dimension sets and transactions in D, that contains both Į and ȕ, represented as
Association Rules efficiently from large databases.
Support (ĮÆ ȕ) = P (Į U ȕ)
Finally, the main contribution of the paper is to devise a
new Apriori based algorithm which efficiently prunes the Definition 4: Confidence (c) of the Rule ĮÆ ȕ is true in the
generated frequent k-dimension sets using conviction values Database D, if it contains the percentage of transactions
known as CApriori algorithm. This algorithm also makes use a containing Į that also contains ȕ, represented as
Count Array, a one dimensional triple array pair set to Confidence (ĮÆ ȕ) = P (ȕ | Į) =P (Į U ȕ)| P (Į)
minimize the main memory reference while accessing the
dataset. This algorithm also makes use of a compressed data The problem is to find out all Association Rules which
structure f_list, which is constructed by selecting feature satisfy user-specified minimum support and confidence
attributes using a Heuristic Fitness Function (HFF). It also uses constraints. The problem of generating Association Rules was
Heuristic discretization as a preprocessing technique to initially introduced with the well-known Apriori algorithm [3].
improve the mining task faster. In Section 2, we present basic There are several algorithms for mining Association Rules with
preliminaries associated with Association Rule Mining. In the target of several studies in obtaining Association Rules
section 3 the proposed approach is elicited. Comprehensive from transactional databases. The pre-eminently recognized
analysis is presented in section 4 and conclusion will be in Apriori is proven to improve performance on 2-Dimensional
section 5. Transactional Database. Further, several algorithms are
designed and developed similar to with Apriori to find
Association Rules from 2-D transactional databases. Since its
II. BASIC PRELIMINARIES introduction, there has been affordable work on designing
algorithms for mining Association Rules [1]. This work was
A. Problem statement
subsequently extended to find Association Rules over
The main objective is to discover Frequent Determinants multidimensional databases. Currently the researchers are
and Association Rules from a High Dimensional Database D, focusing on designing and developing techniques for High
with N attributes and M records. The amn gives us a value of the Dimensional Association Rules. The proposed work discusses
Attribute An over the record rm, as shown in Table1. the issues of effective mining of High Dimensional datasets.
The major task of this approach is to find out High
TABLE I. HIGH DIMENSIONAL DATABASE Dimensional Association Rules which satisfy the conditions of
minimum support and confidence.
Ref.ID A1 A2 A3 A4 A5 ……… An
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
IEEE-32331
It is necessary to understand and analyze such large amount attributes. An essential preprocessing technique used by many
data for efficient decision making. A study on mining large of the Association Rule Mining algorithms is discretization.
databases is presented in [4]. Naturally, in a Gene Expression For every algorithm it is necessary to define character of input
Microarray Databases, there could be tens or hundreds of data, or data the algorithms can deal with. In Data Mining
attributes or dimensions in the dataset, and there may be up to Literature, the databases may include different types of data
hundreds of thousands of samples, each of which is mapped to such as nominal, ordinal, discrete and continuous. The
a dimension. Analyzing these datasets offers great challenges preprocessing the data main objective is to reduce a potentially
on attribute selection. These challenges are better described in infinite number of values for these data types. Many
[5], [6], [7]. The complexity of many existing data mining Association Rule Mining algorithms are used discretization as
algorithms is exponential with respect to the number of a preprocessing technique. The most frequently used
dimensions [5]. With an increasing dimensionality, these unsupervised discretization types are Equi-depth discretization
algorithms soon become computationally intractable and and Equi-distance discretization.
therefore inapplicable in many real applications. The detailed
description about the process of discovering Frequent Discretization is a process which partitions the data into
Determinants and generating High Dimensional Association discrete intervals. In general, many discretization methods are
Rules is explained in next section. guided or controlled by an external expert, such as “k” in k-
means discretization. In this paper, a new Heuristic
Discretization algorithm, which is derived from k-means
III. FREQUENT DETERMINANT MINING discretization, is used. The Heuristic Discretization is an
In this section, the process of generating Frequent automatic unsupervised discretization method, which is
Determinant Patterns and Association Rules on High proficient to adapt the data set character and to combine
Dimensional datasets is described. Generation of Patterns and advantages of Equi-depth and Equi-distance discretization. It is
Rules on High Dimensional datasets using basic Apriori described as follows.
algorithms is a time consuming process and generates Heuristic Discretization algorithm
redundant rules, since it takes only two-Dimensional data as
input. In this section, a new approach of taking High Input: d as degree of discretization,
Dimensional Dataset as input and producing the Association
D as High Dimensional database with M records
Rules as output is elucidated.
rm: record of the database
In this, a new measure called Conviction is used along with
Apriori algorithm to prune the Frequent Determinant Pattern Output: Discretized set of records
Set according to the user supplied prior knowledge as support.
The Conviction value mainly used to reduce the infrequent x For each discretized attribute
patterns and generates strong association rules. The CApriori x Generate discretized attribute intervals S1,S2 …Sd where
uses One Dimensional Triple Array Pair Set known as Count
Array (CA) to count the occurrences of attributes in the dataset.
The objective of using CA is to optimize the main memory
usage. The detailed description of the process is explained in and
the following sub sections.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
IEEE-32331
data into intervals will renovate raw data into corresponding Conviction-Based Apriori algorithm for Discovering
binary values [10]. Frequent Determinant Pattern sets
Input: Reduced Feature Matrix D, minsup and minconf
B. Compute Fitness value for attribute selection.
Attribute selection plays an important role in predicting the Output: complete set of Frequent Determinant Patterns
mining outcomes. Because of discretization, the quantitative 1. Set Count Array [i].count = 0;
attribute will introduce plentiful sub attributes in the dataset 2. For each reference id in reduced feature matrix
and the mining task will become complex. So there is a need to
reduce number of attributes in the mining task. In this paper, a 3. For (k=2;Fk-1z0; k++)
Heuristics based Fitness Function (HFF) is used to evaluate the 4. for all reference Rid D do Ct = subset( f-list).
quality of each added attribute [9]. Consider each attribute in 5. For each pattern pair(A,B) in the f-list compute
the dataset with a sequence of data items; apply Fitness Conviction Value (A, B);
function to compute quality of each attribute using the formula. 6. for all patterns in f-list, which are above conviction
value, cCt do update Count Array[i].count++
7. if Count Array [i].count tminsup then discover it is
pattern (A,B) is a Frequent Determinants
Where N = number of Discretized attribute bins with Equi- The algorithm reduces passes over the database where each
Depth partitioning, pass consists of two phases. This algorithm scans the database
to for each frequent pattern set to discover the Conviction (A,
SUMi= the sum of all random sequences in bin I, B) using the formula given below.
C= the bin capacity. For each pattern A, B אf-list, and A B, the Conviction¬
value computed as.
The attributes whose Fitness value is less than the threshold
limit will be removed from the dataset because they are not P ( A) P (B)
useful for mining process. After removing such attributes, a Conviction Value (CV) =
compressed data structure f_list will be constructed. P ( AB)
If its conviction values than the limit then it will generate
C. Conviction-based Apriori Algorithm all Frequent Determinant pattern found in the (k-1)th pass, and
In this section, the process to find all Frequent Determinant it is used to generate the candidate dimension set Ct. It ensures
patterns is described. A Frequent Determinant set is selected that Ct is a superset of the set of all frequent sets . After this,
using Conviction value. Conviction Values are computed for the procedure the discovered pattern is compared with user
each frequent pair and used to identify infrequent patterns from supplied support value to determine which of the pattern sets
the databases. A Frequent Determinant set is selected once it Frequent Determinants pattern sets. This process will terminate
satisfies the conviction first principle. A few changes are made if no more patterns are added to Count Array.
to the basic Apriori algorithm to select Frequent Determinant
set. The new Conviction Based Apriori algorithm has the The Conviction value is used to identify infrequent itemsets
process. in the database. The set of patterns whose values are less than
Conviction Value are discovered as infrequent patterns sets.
x Scan the database to obtain frequent pattern sets. Beginning with frequent pattern sets, all such Frequent
Determinants sets are generated in the procedure given is as
x If every pattern set first compared with conviction follows. Let Lk be the set of frequent pattern set. Ct be the set
values and then compared with support value. If both
are satisfied then it produces a valid Frequent of candidate sets. If an item set (A, B) is frequent cCt. This
Determinant Pattern Set. algorithm make use of One Dimensional Triple Array pair set
as Count Array to measure the frequency of occurrence. If its
x For this Frequent Determinants, strong association rules count value is less than Conviction Values that frequent pattern
were generated using confidence and conviction values. is treated as infrequent pattern and removed from set. Prune the
At this stage, the prepared and preprocessed dataset is used pattern attributes further in a frequent pattern set at each
as described above. This dataset contains partitioned iteration step which is not satisfying the user specified interest
quantitative attributes, and created combinations of intervals of measure ‘support’.
the quantitative attributes. This combination, along with those
values of categorical attributes, collectively forms the frequent D. One Dimensional Triple Array Pair Set
dimension sets. f-list contains all possible Frequent pattern In General, the Association Rule mining algorithms
generated in the dataset. The description of algorithm is as maintain different item count frequency values throughout a
follows: scan over database. For instance, it is essential to have
adequate main memory to hoard each pattern count that the
Let f-list represents the set of frequent pattern set and Ct the
number of times a pattern pair sets occurs in the transaction
set of Candidate set for the frequent pattern set.
database. It is hard to update a 1 to a count set where the
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
IEEE-32331
counting sequences are stored in different memory locations thoroughly preprocessed and replace integer and real data
and difficult in loading the page to main memory. In such values with integer and binary values. Using this datasets the
cases, these algorithms will be slow in finding that pattern pair interesting Association Rules with varying support and
count in main memory as it takes extra overhead on processing confidence values are generated. Table 3 shows the evaluation
time and increases the time to discover frequent pattern set. So of Heuristic Discretization algorithm on Iris data.
it is difficult to count anything that doesn’t ¿t in main memory.
So each algorithm has a limit on how many items it can deal TABLE III. : DISCRETIZATION VALUES OF IRIS DATASET
with. When it comes to high dimensional datasets, it is difficult
to maintain all in one memory. So a new one dimensional triple Discretized Attribute1 Attribute2 Attribute3 Attribute4
array set is used to count all the pattern occurrences in the Intervals Sepallength Sepalwidth petallength Petalwidth
given database. Interval 1 4.7748 3.1789 1.4194 0.1948
Interval 2 6.8585 3.0862 5.7859 2.1327
To optimize main memory, a pattern pair (i,j) occurrence in Interval 3 6.1613 2.8547 4.7484 1.5757
the dataset should be counted in one place. If it is in the order Interval 4 5.2823 3.7037 1.5173 0.3028
a pattern pair such that i < j, and use only one entry a[i,j] in Interval 5 5.5432 2.5786 3.863 1.1696
Fitness
two dimensional array a. This strategy makes half of the array Value
0.26 0.23 0.28 0.15
as useless. Count Array (CA) is a more efficient way to store In the Table 3, for each attribute and for each interval, we
pattern pairs in memory. A count array is defined as a one- calculated the average Sd and it is shown in Table3. For each
dimensional triple array set which will store a count as a[k] for attribute the calculated fitness values are also shown in table.
the pair (i,j), with 1 i < j n, where The threshold value for fitness function is 0.2 and the attributes
selected for consideration for mining process is Sepallength,
Sepalwidth, Petalwidth.
The graphs shown in Figure.1 and Figure 2 depict the
The pattern pair sets are stored in lexicographic order. execution times of discovering Frequent Determinants sets on
two datasets with varying combinations of parameter values
E. Generating and Validating Association Rules
In the second phase of Association Rule mining, the basic
rule generation algorithm is used to generate Association Rules
from these Frequent Determinant sets. The strong rules which
have maximum support and confidence are validated. Once the
frequent attributes have been found, it is straight forward to
generate strong Association Rules which satisfying both
minimum support and confidence. This process is going to
generate all Association Rules which are above the confidence
value. The redundant rules are eliminated by selecting feature
attributes using Fitness function.
Fig. 1. Performance on synthetic dataset with varying support values
IV. RESULTS AND DICUSSION
In this section, the performance of the proposed approach is
evaluated based on support and elapsed time with respect to
three factors namely the number of Frequent Determinant set,
number of strong rules generated, and Dimensionality (D). The
elapsed time is measured as the time duration (in mille
seconds) to generate all frequent pattern sets. The datasets used
for evaluation as shown the Table 2.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
IEEE-32331
the increase of elapsed time with the decrease of the support V. CONCLUSION
value is noticeable. In the recent years, Association Rule mining has gained
The numbers of Frequent Determinant Pattern sets considerable interest in the research community. In this paper,
generated for the datasets with varying support values shown in a framework approach for mining Association Rules on High
Table 4 and it is observed that the number of generated Dimensional data using new C-Apriori is discussed. The
Frequent Determinant Pattern sets are decreased as the support quantitative data values are better dealt with partitioning
values are increased. As we can see, in all the experiments the method and different attributes are combined into a master
runtime will be more when the support values are minimum. table where effective Frequent Determinant Pattern set can be
As the support values are increasing the elapsed time is easily generated. From the experiments, it is ascertained that
gradually reducing. Because, the reason is the generations of the strong rules are generated with the selected feature
Frequent Determinant Pattern sets are more when support attributes using Heuristic Fitness function and efficient
values are low and it reduces when support values are frequents are selected using Conviction values. With the above
increasing. results, our approach for mining High Dimensional Association
Rules produces better results compared to traditional method.
TABLE IV. FREQUENT DETERMINANT PATTERN SET GENERATION
REFERENCES
Frequent Determinant Pattern Sets [1] Rakesh Agrawal, Tomasz Imielinski and Arun Swami,” Mining
Support
T100-AT10-I100-P50-AP5 SAGE Association Rules between sets of items in large databases”, in
proceedings of the ACM SIGMOD Conference on Management of Data,
0.2 131 9945 pp 207-216, Washington, D.C., May 1993.
0.4 43 5689 [2] Bodon, F., “A Fast Apriori Implementation”, FIMI’03, November 2003.
0.6 16 2673 [3] Rakesh Agrawal, Tomasz Imielinski and Arun Swami,” Database
Mining- A performance perspective”, IEEE transactions on knowledge
0.8 2 809 and data engineering, vol 5 1993.
1.0 0 89 [4] M J Zaki and C J Hsiao,” CHARM- an Efficient algorithm for closed
itemset mining, in the proceedings of SDM 2002, p 457-473., 2002
[5] Aggrawal, C.C., Hinnwburg, A., and Keim, D.A. “On the surprising
The number of interesting Association Rules generated behavior of distance metrics in High Dimensional space”. IBM Research
with varying confidence values on SAGE dataset are shown in report, RC 21739, 2000.
Table 5. As the confidence values increase the number of rules [6] Beyer K., Goldstein, J.Ramakrishnan, R., and Shaft, U. “When is nearest
and elapsed time are decrease. neighbor meaningful?” In Proceedings of the 7th ICDT, Jerusalem,
Israel. 1999.
[7] Beyer K and Ramakrishnan. “Bottom-up computation of sparse and
TABLE V. RESULTS ON SAGE DATASET iceberg cubes”. In Proceeding of the ACM-SIGMOD 1999 International
Number of Elapsed time Conference on Management of Data, Philadelphia, PA, 359–370, 1999.
Confidence [8] K.Prasanna, M.Seetha “Mining High Dimensional Association Rules by
rules generated in ms
0.2 37520 621 Generating Large Frequent K-Dimension Set”, in proceedings of IEEE
International Conference on Data Science and Engineering, Kochin ,
0.4 31110 490 India, 2012.
0.6 18300 141 [9] Masri Ayob, Yang Xiao Fei, “ Local Search Heuristics for One
0.8 8230 78 Dimensional bin packing problem”, in the proceedings of International
1.0 5860 57 Journal of Soft Computing, 8(2):108-112,2013.
[10] Rama kirshna Srikant, Rakesh agrawal ,” Mining quantitative
Association ures in large relational tables”, in the proceedings of ACM
SIGMOD , USA 1996.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on May 06,2023 at 10:19:14 UTC from IEEE Xplore. Restrictions apply.
View publication stats