0% found this document useful (0 votes)
368 views

1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation

The Apriori algorithm is used to find frequent itemsets and association rules in transactional databases. It works in multiple passes over the database, identifying frequent itemsets in each pass and using them to explore larger itemsets. The algorithm prunes infrequent items to improve efficiency. FP-growth is an alternative algorithm that constructs a frequent pattern tree to store transaction data and mine patterns without candidate generation. It reduces mining time by avoiding the generation of candidate itemsets. Mining frequent patterns helps identify strongly correlated items and discover associations in transaction data.

Uploaded by

kambala dhanush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
368 views

1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation

The Apriori algorithm is used to find frequent itemsets and association rules in transactional databases. It works in multiple passes over the database, identifying frequent itemsets in each pass and using them to explore larger itemsets. The algorithm prunes infrequent items to improve efficiency. FP-growth is an alternative algorithm that constructs a frequent pattern tree to store transaction data and mine patterns without candidate generation. It reduces mining time by avoiding the generation of candidate itemsets. Mining frequent patterns helps identify strongly correlated items and discover associations in transaction data.

Uploaded by

kambala dhanush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1 Explain Apriori algorithm with example or Finding Frequent Item sets

Using with Candidate Generation

Apriori algorithm refers to an algorithm that is used in mining frequent products


sets and relevant association rules. Generally, the apriori algorithm operates on
a database containing a huge number of transactions. For example, the items
customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store

Components of Apriori algorithm

The given three components comprise the apriori algorithm.

• Support
• Confidence
• Lift

Suppose you have 4000 customers transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say
Biscuits and Chocolate. This is because customers frequently buy these two items
together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate,
and these 600 transactions include a 200 that includes Biscuits and chocolates.
Using this data, we will find out the support, confidence, and lift.

Support

Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent

Confidence

Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.
Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total


transactions involving Biscuits)

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift

Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

How does the Apriori Algorithm work in Data Mining?

We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil,
Milk, Apple}. The database comprises six transactions where 1 represents the
presence of the product and 0 represents the absence of the product.

The Apriori Algorithm makes the given assumptions

• All subsets of a frequent itemset must be frequent.


• The subsets of an infrequent item set must be infrequent.
• Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions.
Now, short the frequency table to add only those products with a threshold
support level of over 50 percent. We find the given frequency table.

The above table indicated the products frequently bought by the customers.

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the
given frequency table.

Step 3

Implementing the same threshold support of 50 percent and consider the


products that are more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4
Now, look for a set of three products that the customers buy together. We get
the given combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency table.
2. Explain in detail of various methods that improve the efficiency of
Apriori algorithm?

Ans:

Techniques to improve the efficiency of Apriori algorithm:

Hash based technique

Transaction Reduction

Partioning

Sampling

Dynamic item counting

Hash Based Technique:

Transaction Reduction:

Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent
individual items in the database and extending them to larger and larger item
sets as long as those item sets appear sufficiently often in the database.
Apriori uses a "bottom up" approach, where frequent subsets are extended one
item at a time (a step known as candidate generation), and groups of candidates
are tested against the data. The algorithm terminates when no further successful
extensions are found.

Partioning:

Partitioning Method:

This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods.

In the partitioning method when database(D) that contains multiple(N) objects


then the partitioning method constructs user-specified(K) partitions of the data
in which each partition represents a cluster and a particular region. There are
many algorithms that come under partitioning method some of the popular ones
are K-Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large
Applications) etc.

Sampling:

The improvement of Apriori algorithm is Sampling algorithm, the main idea is


that it using the sampling method to sample from the original database. In order
to directly stored in memory, according to the sample database mining frequent
itemsets, reduce the mining time. Sampling algorithm is using random sampling
method to proceed with sampling, random sampling method has the
characteristics of simple and quick.

Dynamic item counting:


Alternative to Apriori Itemset Generation

Itemsets are dynamically added and deleted as transactions are read

Relies on the fact that for an itemset to be frequent, all of its subsets must also
be frequent, so we only examine those itemsets whose subsets are all frequent

Itemsets are marked in four different ways as they are counted:

• Solid box: confirmed frequent itemset - an itemset we have finished


counting and exceeds the support threshold minsupp
• Solid circle: confirmed infrequent itemset - we have finished counting
and it is below minsupp
• Dashed box: suspected frequent itemset - an itemset we are still
counting that exceeds minsupp
• Dashed circle: suspected infrequent itemset - an itemset we are still
counting that is below minsupp
3 Describe FP-growth algorithm with example.?
This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a frequent pattern tree or FP
tree.
This tree structure will maintain the association between the itemsets. The
database is fragmented using one frequent item. This fragmented part is called
“pattern fragment”. The itemsets of these fragmented patterns are analyzed.
Thus with this method, the search for frequent itemsets is reduced
comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial
itemsets of the database. The purpose of the FP tree is to mine the most frequent
pattern. Each node of the FP tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The
association of the nodes with the lower nodes that is the itemsets with the other
itemsets are maintained while forming the tree.
Frequent Pattern Algorithm Steps
#1) The first step is to scan the database to find the occurrences of the itemsets
in the database. This step is the same as the first step of Apriori. The count of 1-
itemsets in the database is called support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the
tree. The root is represented by null.
#3) The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with the
max count is taken at the top, the next itemset with lower count and so on. It
means that the branch of the tree is constructed with transaction itemsets in
descending order of count.
#5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are
created and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in the FP
Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP
tree occurring with the lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in
the path. The itemsets meeting the threshold support are considered in the
Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
s
Mining Multilevel Association Rules
For many applications, it is difficult to find strong associations among data
items at low or primitive levels of abstraction due to the sparsity of data at
those levels.

Mining Multidimensional Association Rules from Relational


Databases and Data Warehouses
For instance, in mining our AllElectronics database, we may discover the
Boolean association rule

Considering each database attribute or warehouse dimension as a predicate, we


can therefore mine association rules containing multiple predicates, such as
Attributes
5. Explain mining frequent patterns?

Ans: Frequent Pattern is a pattern which appears frequently in a data set. By


identifying frequent patterns we can observe strongly correlated items together
and easily identify similar characteristics, associations among them.

Support: How often a given rule appears in the database being mined.

Confidence: The number of times a given rule turns out to be true in practice.

Example: One of possible Association Rule is A => D

Total no of Transactions(N) = 5

Frequency(A, D) = > Total no of instances together A with D is 3

Frequency(A) => Total no of occurrence in A

Support = 3 / 5

Confidence = 3 / 4

Frequent pattern mining, there are 2 categories to be considered,

1. Mining frequent pattern with candidate generation

2. Mining frequent pattern without candidate generation

Generate Candidate set 1, do the first scan and generate One item set

In this stage, we get the sample data set and take each individual’s count and
make frequent item set 1(K = 1).
Hence the minimum support is 2 and based on that, item E will remove from the
Candidate set 1.

After Elimination :

Generate Candidate set 2, do the second scan and generate Second item set

Through this step, you create frequent set 2 (K =2) and takes each of their
Support counts.

Hence the minimum support is 2, Itemset B, D will be removed from Candidate


set 2.

After Elimination :
Generate Candidate set 3, do the third scan and generate Third item set

In this iteration create frequent set 3 (K = 3) and take count of Support. Then
compare with the minimum support value from the Candidate set 3.

6. Explain association and correlation analysis?

Ans:

Correlation Analysis :

Correlation analysis is a method of statistical evaluation used to study the strength


of a relationship between two, numerically measured, continuous variables (e.g.
height and weight). This particular type of analysis is useful when a researcher
wants to establish if there are possible connections between variables.

If there is correlation found, depending upon the numerical values measured, this
can be either positive or negative.

Positive correlation exists if one variable increases simultaneously with the


other, i.e. the high numerical values of one variable relate to the high numerical
values of the other.

Negative correlation exists if one variable decreases when the other increases,
i.e. the high numerical values of one variable relate to the low numerical values
of the other.
Association Analysis:

Association Rule :

An association rule is an implication expression of the form X→Y, where X and


Y are disjoint item sets (X∩Y=∅).

The strength of an association rule can be measured in terms of its support and
confidence. A rule that has very low support may occur simply by chance.
Confidence measures the reliability of the inference made by a rule.

Support of an association rule X→Y

σ(X) is the support count of X

N is the count of the transactions set T.

s(X→Y)=σ(X∪Y)N

Confidence of an association rule X→Y

σ(X) is the support count of X

N is the count of the transactions set T.

conf(X→Y)=σ(X∪Y)σ(X)

Interest of an association rule X→Y


P(Y)=s(Y) is the support of Y (fraction of baskets that contain Y)

If interest of a rule is close to 1, then it is uninteresting.

I(X→Y)=1→X and Y are independent

I(X→Y)>1→X and Y are positive correlated

I(X→Y)<1→X and Y are negative correlated

I(X→Y)=P(X,Y)P(X)×P(Y)

7. Explain pattern mining in multilevel?

Ans: Multilevel Association Rule :

Association rules created from mining information at different degrees of


reflection are called various level or staggered association rules.

Multilevel association rules can be mined effectively utilizing idea progressions


under a help certainty system.

Rules at a high idea level may add to good judgment while rules at a low idea
level may not be valuable consistently.

Needs of Multidimensional Rule :

Sometimes at the low data level, data does not show any significant pattern but
there is useful information hiding behind it.

The aim is to find the hidden information in or between levels of abstraction.


Approaches to multilevel association rule mining :

Uniform Support(Using uniform minimum support for all level)

Reduced Support (Using reduced minimum support at lower levels)

Group-based Support(Using item or group based support)

Uniform Support –

At the point when a uniform least help edge is used, the search methodology is
simplified. The technique is likewise basic in that clients are needed to
determine just a single least help threshold

Reduce Support –

For mining various level relationship with diminished support, there are various
elective hunt techniques as follows.

• Level-by-Level independence –

This is a full-broadness search, where


no foundation information on regular item sets is utilized for pruning

• Level-cross separating by – K-itemset –

A-itemset at the I level is


inspected if and just if it’s For mining various level relationship with diminished
support, there are various elective hunt techniques.

Group-based support –

The group-wise threshold value for support and confidence is input by the user
or expert. The group is selected based on a product price or item set because
often expert has insight as to which groups are more important than others.

Example –

Experts are interested in purchase patterns of laptops or clothes in the non and
electronic category. Therefore low support threshold is set for this group to give
attention to these items’ purchase patterns.
8. Explain multidimensional space?

Ans: Multidimensional Space

a space having more than three dimensions. Ordinary Euclidean space studied in
elementary geometry is three dimensional, planes are two dimensional, and
lines are one dimensional. The concept of a multidimensional space arose in the
process of the generalization of the subject of geometry.

In multidimensional spaces, we consider not only two-dimensional planes but


also k-dimensional planes (k < n), which, as in ordinary Euclidean space, are
defined by linear equations or by systems of such equations.

• A dimension describes some aspect of the data that the company wants to
analyze. For example, your company would have a data with time
element in it—the Time could become a dimension in your model.
• A member corresponds to one point on a dimension. For example, in the
Time dimension, Monday would be a dimension member.
• A value is a unique characteristic of a member. For example, in the Time
dimension, 5/12/2008 might be the value of the member with the caption
“Monday.”
• An attribute is the full collection of members. For example, all the days
of the week would be an attribute of the Time dimension.
• The size, or cardinality, of a dimension is the number of members it
contains. For example, a Time dimension made up of the days of the
week would have a size of 7.
The following list defines some more of the common terms we use in
describing a multidimensional space.

A tuple is a coordinate in multidimensional space.

A slice is a section of multidimensional space that can be defined by a tuple.

Aggregation function—A function that enables us to calculate the values of


cells in the logical space from the values of the cells in the fact space

Attribute—A collection of similar members of a dimension

Cell value—A measure value of a cell

Dimension—An element in the data that the company wants to analyze

Dimension hierarchy—An ordered structure of dimension members

Dimension size—The number of members a dimension contains

Measure—The value in a cell

Member—One point on a dimension

Member value—A unique characteristic of a member

Tuple—A coordinate in multidimensional space

Slice—A section of multidimensional space that can be defined by a tuple

Subcube—A portion of the full space of a cube

You might also like