0% found this document useful (0 votes)
2 views

unit4

The document outlines a course on Unsupervised Learning in Machine Learning, detailing key concepts, algorithms, and applications. It covers topics such as clustering methods (including K-means and hierarchical clustering) and association rule mining, emphasizing their importance in data analysis. The course aims to equip students with the ability to understand and implement various unsupervised learning techniques and evaluate their effectiveness.

Uploaded by

Anas Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

unit4

The document outlines a course on Unsupervised Learning in Machine Learning, detailing key concepts, algorithms, and applications. It covers topics such as clustering methods (including K-means and hierarchical clustering) and association rule mining, emphasizing their importance in data analysis. The course aims to equip students with the ability to understand and implement various unsupervised learning techniques and evaluate their effectiveness.

Uploaded by

Anas Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Department of

Computer Engineering

Machine Learning
Sem 7

Unsupervised Learning 01CE0715


4 Credits

Unit # 4

Prof. Urvi Bhatt


After completion of this course, students will be able to

 Understand machine-learning concepts.

 Understand and implement Classification concepts.


Course
 Understand and analyse the different Regression
Outcomes
algorithms.

 Apply the concept of Unsupervised Learning.

 Apply the concepts of Artificial Neural Networks.


 Unsupervised Learning  K-means clustering
 Introduction & Importance Algorithm

 Types of Unsupervised  Evaluation metrics for


Learning Clustering

 Silhouette Coefficient
 Clustering
 Introduction to clustering  Dunn's Index
Topics
 Types of Clustering  Association Rule Mining

 Hierarchical Clustering  Introduction to Association

 Agglomerative rule mining

Clustering  Apriori Algorithm

 Divisive clustering  FP tree algorithm

 Partitional Clustering
Unsupervised Learning
Introduction & Importance , Types of Unsupervised Learning
 Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed
to act on that data without any supervision.

 Unsupervised learning cannot be directly applied to a


Introduction to regression or classification problem because unlike
Unsupervised supervised learning, we have the input data but no
Learning corresponding output data.

 The goal of unsupervised learning is to find the underlying


structure of dataset, group that data according to
similarities, and represent that dataset in a compressed
format.
Below are some main reasons which describe the importance of
Unsupervised Learning:

 Unsupervised learning is helpful for finding useful insights from


the data.
Why use  Unsupervised learning is much similar as a human learns to think
Unsupervised by their own experiences, which makes it closer to the real AI.

Learning?  Unsupervised learning works on unlabeled and uncategorized data


which make unsupervised learning more important.

 In real-world, we do not always have input data with the


corresponding output so to solve such cases, we need
unsupervised learning.
How it works?
Types of
Unsupervised
Learning
Algorithm
K-means
Clustering
Algorithm

Hierarchical
Clustering Clustering
Algorithm
Types of
Unsupervised Unsupervised
DBSCAN
Learning Learning
Algorithm

Algorithm algorithms
Apriori
Algorithm
Association Rule
Learning
FP-Growth
Algorithm
 Clustering: Clustering is a method of grouping the
objects into clusters such that objects with most
similarities remains into a group and has less or no
Types of
similarities with the objects of another group.
Unsupervised
Learning  Cluster analysis finds the commonalities between the
Algorithm data objects and categorizes them as per the presence
and absence of those commonalities.
 Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of
Types of items that occurs together in the dataset.
Unsupervised
 Association rule makes marketing strategy more effective.
Learning
Such as people who buy X item (suppose a bread) are also
Algorithm
tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis.
Feature Clustering Association Rule Mining
Discover interesting
Group similar data points
Purpose relationships between
into clusters.
variables.
A set of clusters or Association rules (e.g., "If
Output
groups. A, then B").
Difference Data Type
Typically works with
Often applied to
transactional or
Between unlabeled data.
categorical data.
Clustering and Approach
Looks for patterns based Looks for co-occurrence
on distance or similarity. or frequency of items.
Association Market basket analysis
Grouping customers by
Rule Mining Examples
buying behavior.
(e.g., buying bread and
butter together).
Rules can be evaluated
Clusters can be visualized
Interpretation for support and
and analyzed.
confidence.
K-means, hierarchical
Techniques Apriori, FP-Growth, etc.
clustering, DBSCAN, etc.
Clustering
Introduction, Types of Clustering (Hierarchical, Agglomerative,
Divisive, Partitional), K -means clustering Algorithm, Evaluation
metrics for Clustering (Silhouette Coefficient, Dunn's Index)
 Clustering is a way to group similar things together.

 In a more technical sense, it's a method used in data analysis


where the goal is to find patterns or natural groupings in a
set of data points without knowing in advance what those
groups are. So, if you have a lot of data, clustering can help
Clustering
you organize it into meaningful categories

 Example: Imagine you have a bunch of different fruits—


apples, oranges, and bananas. Clustering helps you sort
these fruits into groups based on similarities, like color or
shape.
Clustering  Now it is not necessary that the clusters formed must be
circular in shape. The shape of clusters can be arbitrary.
 Customer Segmentation: Group customers based on purchasing
behavior, demographics, or preferences, allowing for targeted
marketing strategies

 Image Segmentation: Divide images for identifying objects or regions


within an image

 Anomaly Detection: Find unusual patterns or outliers in data, useful for


Uses of fraud detection or network security
Clustering  Recommendation Systems: Suggest items based on similar interests.

 Document Clustering: Organize texts.

 Biology and Genetics: Classify genes based on their characteristics

 Social Networks: Identify communities by grouping users with similar


connections or interactions.

 Market Research: Analyze trends.


 Centroid-based Clustering (Partitioning methods)

 Density-based Clustering (Model-based methods)

Types of  Connectivity-based Clustering (Hierarchical clustering -

Clustering Agglomerative & Divisive)


Methods  Distribution-based Clustering

 Fuzzy Clustering
Centroid-based
Clustering
(Partitioning
methods)

Density-based
Clustering (Model-
based methods)
Agglomerative
Types of Connectivity-based clustering
Types of Clustering Clustering
Clustering Methods (Hierarchical
clustering )
Methods Divisive clustering
Distribution-based
Clustering

Fuzzy Clustering
 Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical


groups. It is also known as the centroid-based method. The most
Types of common example of partitioning clustering is the K-Means
Clustering Clustering algorithm.
Methods
 Density-Based Clustering

The density-based clustering method connects the highly-dense


areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected. This
Types of
algorithm does it by identifying different clusters in the dataset and
Clustering
connects the areas of high densities into clusters. The dense areas in
Methods data space are divided from each other by sparser areas.
 Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is


divided based on the probability of how a dataset belongs to a
particular distribution. The grouping is done by assuming some
Types of
distributions commonly Gaussian Distribution.
Clustering
Methods
 Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the


partitioned clustering as there is no requirement of pre-specifying
the number of clusters to be created.
Types of
Clustering In this technique, the dataset is divided into clusters to create a tree-

Methods like structure, which is also called a dendrogram. The observations


or any number of clusters can be selected by cutting the tree at the
correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Types of
Hierarchical
Clustering
(Agglomerative
& Divisive)
 Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may


belong to more than one group or cluster. Each dataset has a set of
membership coefficients, which depend on the degree of
Types of
membership to be in a cluster. Fuzzy C-means algorithm is the
Clustering
example of this type of clustering; it is sometimes also known as the
Methods Fuzzy k-means algorithm.
 K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters.

 Here K defines the number of pre-defined clusters that need


to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.

 It allows us to cluster the data into different groups and a


K-Means
convenient way to discover the categories of groups in the
Clustering unlabeled dataset on its own without the need for any
Algorithm training.

 It is a centroid-based algorithm, where each cluster is


associated with a centroid. The main aim of this algorithm is
to minimize the sum of distances between the data point
and their corresponding clusters.
K-Means
Clustering
Algorithm
 The algorithm takes the unlabeled dataset as input, divides
the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

 The k-means clustering algorithm mainly performs two


tasks:
K-Means  Determines the best value for K center points or
Clustering centroids by an iterative process.

Algorithm  Assigns each data point to its closest k-center. Those


data points which are near to the particular k-center,
create a cluster.

 Hence each cluster has datapoints with some


commonalities, and it is away from other clusters.
How does the K-Means Algorithm Work?

 The working of the K-Means algorithm is explained in the below steps:

 Step-1: Select the number K to decide the number of clusters.

 Step-2: Select random K points or centroids. (It can be other from the
input dataset).
K-Means  Step-3: Assign each data point to their closest centroid, which will form
Clustering the predefined K clusters.
Algorithm  Step-4: Calculate the variance and place a new centroid of each cluster.

 Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.

 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

 Step-7: The model is ready.


Example:
Contd.
Contd.
 Apply K-means Clustering Algorithm to divide data into

Example: 2 Cluster.

Data: { 1, 5, 2, 4, 5 }
 apply the K-means clustering algorithm again to
divide the data into 2 clusters. Here is a new set of

Example: data:

{3, 8, 6, 7, 2}
 Step 1: Initialization

 We start by randomly selecting 2 initial centroids. Let's


choose 3 and 8 as the initial centroids.
Solution:
 Step 2: Calculate Distance and Assign Points to the
Nearest Centroid

 Next, we'll compute the distance of each data point to


the centroids (3 and 8), and assign each point to the
nearest centroid.
Contd.
 Step 3: Update Centroids

 Now, we'll update the centroids by calculating the


mean of the points in each cluster:

 Cluster 1: Points = {3, 2}


Contd.
 New Centroid = (3 + 2) / 2 = 2.5

 Cluster 2: Points = {8, 6, 7}


 New Centroid = (8 + 6 + 7) / 3 = 7
 Step 4: Recalculate Distance and Reassign Points

 Now, recalculate the distances of each point to the


updated centroids (2.5 and 7):

Contd.
Table
1:

Table
2:
 Step 5: Since the cluster assignments did not change,
the algorithm stop.

 Final Clusters
Contd.
 Cluster 1: {3, 2}

 Cluster 2: {8, 6, 7}
Example:
Example:
 Evaluation metrics for Clustering
 Silhouette Coefficient: it tells you how well points fit
Evaluation into their clusters, where higher is better.
metrics for
Clustering  Dunn's Index: it tells you if the clusters are well-
separated and compact, where a higher value
indicates better clustering.
Silhouette Coefficient

 What it measures: How well each data point fits within its
cluster compared to other clusters.

 Range: From -1 to 1.
 1 means the point is well-clustered (fits perfectly in its
Evaluation cluster).
metrics for
 0 means the point is on the border between clusters.
Clustering
 -1 means the point is likely in the wrong cluster.

 How it works: It looks at the average distance between a


point and other points in the same cluster, and compares it
with the distance to points in the nearest different cluster.
Dunn's Index

 What it measures: How well-separated and compact clusters


are.
 It wants clusters to be far apart (well-separated) and
Evaluation
metrics for points within a cluster to be close together (compact).
Clustering  Higher Dunn’s Index is better: A higher value means better
clustering. It balances two things:
 Distance between clusters: The farther apart, the better.

 Cluster size: The smaller (tighter) the clusters, the better.


Association Rule Mining
Definition, Support and Confidence,
 In simple words, association rule mining is based
on IF/THEN statements.

 Association, as the name suggests, it finds the relationship


between different items. And the best thing about this is, It
also works with non-numeric or categorical datasets.
Association  Association rule mining finds frequently occurring patterns
Rule Mining between the given data. An association rule has two parts:
 Antecedent (IF)
 Consequent (THEN)

 For eg: If a person in the supermarket buys bread, then it’s


more likely that he will buy butter or jam or egg.
Association
Rule Mining
Association
Rule Mining
 Support — Support is an indication of the frequency of an
item. Support of an item X is the ratio of the number of times X
appears in the transaction to the total number of transactions.
The greater the value of support the more frequently that item
is bought.

 Confidence — Confidence is a measure of how likely a product


Support and Y will be sold with product X i.e. X=>Y. It is calculated by the
Confidence ratio of support (X UY)(i.e. union) to the support of (X).

 Lift — Lift is a measure of the popularity of an item or a


measure of the performance of the targeting model. The Lift
of Y is calculated by dividing confidence with the support of
(Y).
Support and
Confidence  Support(wine) =Probability(X=wine) = 4(because wine is
bought 4 times)/6(total transaction)

 Confidence (X={wine, chips}) => (Y={bread})


= Support(wine, chips, bread) / Support(wine, chips)

 i.e. Confidence = (2/6)/(3/6) = 0.667


Support Confidence

Support is a measure of the number of times an Confidence is a measure of the likelihood that an
item set appears in a dataset. itemset will appear if another itemset appears.

Confidence is calculated by dividing the number of


Support is calculated by dividing the number of
transactions containing both itemsets by the
transactions containing an item set by the total
number of transactions containing the first
number of transactions.
itemset.

Support is used to identify itemsets that occur Confidence is used to evaluate the strength of a
frequently in the dataset. rule.

Support is often used with a threshold to identify Confidence is often used with a threshold to
itemsets that occur frequently enough to be of identify rules that are strong enough to be of
interest. interest.

Confidence is interpreted as the percentage of


Support is interpreted as the percentage of
transactions in which the second itemset appears
transactions in which an item set appears.
given that the first itemset appears.
Below are the steps for the apriori algorithm:

 Step-1: Determine the support of itemsets in the


transactional database, and select the minimum support
and confidence.

Apriori  Step-2: Take all supports in the transaction with higher


Algorithm support value than the minimum or selected support value.

 Step-3: Find all the rules of these subsets that have higher
confidence value than the threshold or minimum
confidence.

 Step-4: Sort the rules as the decreasing order of lift.


 Find the frequent itemsets using Apriori Algorithm .
Assume that minimum support threshold (s =2)

Example:
Condt.
 Find the frequent itemsets using Apriori Algorithm .
Assume that minimum support threshold (s =2)

Example:
Contd.
 Find the frequent
itemsets using Apriori
Algorithm . Assume
Example: that minimum support
threshold (s =2)
Ans:
 Find the frequent itemsets on this using Apriori
Algorithm. Assume that minimum support (s = 3)

Example:
Contd.

 There is only one itemset with minimum support 3. So only


one itemset is frequent.

 Frequent Itemset (I) = {Coke, Chips}


 Find the frequent itemsets and generate association rules on
this using Apriori Algorithm. Assume that minimum support
threshold (s = 33.33%) and minimum confident threshold (c = 60%)

Example:
Contd.
There is only one itemset with minimum support 2. So only one itemset is frequent.

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}

 Association rules,

 [Hot Dogs^Coke]=>[Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke) =


2/2*100=100% //Selected

 [Hot Dogs^Chips]=>[Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips) =


2/2*100=100% //Selected

Contd.  [Coke^Chips]=>[Hot Dogs]


2/3*100=66.67% //Selected
//confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips) =

 [Hot Dogs]=>[Coke^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs) = 2/4*100=50%


//Rejected

 [Coke]=>[Hot Dogs^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke) = 2/3*100=66.67%


//Selected

 [Chips]=>[Hot Dogs^Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Chips) = 2/4*100=50%


//Rejected

 There are four strong results (minimum confidence greater than 60%)
 Find the frequent itemsets and generate association
rules on this using Apriori Algorithm. Assume that
minimum support threshold (s = 33.33%) and minimum
confident threshold (c = 60%)

Example:
 Consider frequent itemset - {I1, I2, I3} , Find Association
rules generation using Apriori Algorithm.
Example:
 So here, by taking an example of any frequent itemset, we will show the
rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
Contd.
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

 So if minimum confidence is 50%, then first 3 rules can be considered as


strong association rules.
Advantages:

 Simple to Understand: Easy to implement and understand.

 Straightforward Approach: Uses a clear, step-by-step method to find frequent itemsets in a database.

 Widely Used: Popular in market basket analysis to find associations between products.

Advantage  No Domain Knowledge Required: Doesn't need prior knowledge of patterns.

and  Flexibility: Can be applied to a wide range of data mining tasks.

Disadvantage Disadvantages:

 High Time Complexity: Can be slow for large datasets, as it has to scan the entire database multiple
of Apriori times.

Algorithm  Memory Intensive: Requires a lot of memory, especially with big datasets.

 Generates Redundant Rules: Often produces many rules, including irrelevant or redundant ones.

 Prone to Scalability Issues: Not efficient with large and complex databases.

 Requires Pruning: Needs careful tuning of support and confidence thresholds to filter out uninteresting
rules.
 The two primary drawbacks of the Apriori Algorithm are:
 At each step, candidate sets have to be built.

 To build the candidate sets, the algorithm has to


repeatedly scan the database.

 These two properties inevitably make the algorithm


FP tree
slower.
algorithm
 To overcome these redundant steps, a new association-
rule mining algorithm was developed named Frequent
Pattern Growth Algorithm.

 It overcomes the disadvantages of the Apriori algorithm by


storing all the transactions in a Trie Data Structure.
Algorithm:

 Input: Given Database and Minimum Support Value

 Step 1: Making Frequency Table

FP Tree  Step 2: Find Frequent Pattern set


Algorithm  Step 3: Ordered-Item set Creation
Steps  Step 4: Make a FP- TreeStep 5: Computation of
Conditional Pattern Base

 Step 6: Compute Conditional Frequent Pattern Tree

 Step 7: Frequent Pattern rules Generation


Algorithm:

 Input: Given Database and Minimum Support Value

 Step 1: Making Frequency Table - The frequency of each individual item is computed:

 Step 2: Find Frequent Pattern set - A Frequent Pattern set is built which will contain all the
elements whose frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.

 Step 3: Ordered-Item set Creation - for each transaction, the respective Ordered-Item set is built.

FP Tree  Step 4: Make a FP- Tree - All the Ordered-Item sets are inserted into a Trie Data Structure.

Algorithm  Step 5: Computation of Conditional Pattern Base - for each item, the Conditional Pattern Base is
computed which is path labels of all the paths which lead to any node of the given item in the
frequent-pattern tree. Note that the items in the below table are arranged in the ascending order of
their frequencies.

 Step 6: Compute Conditional Frequent Pattern Tree - It is done by taking the set of elements that
is common in all the paths in the Conditional Pattern Base of that item and calculating its support
count by summing the support counts of all the paths in the Conditional Pattern Base.

 Step 7: Frequent Pattern rules Generation - From the Conditional Frequent Pattern tree,
the Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern
Tree set to the corresponding to the item as given in the below table.
Example:

 Apply FP Tree Algorithm.


Step 1: Making Frequency Table - The frequency of
each individual item is computed
Step 2: Find Frequent Pattern set - A Frequent Pattern set is built which will
contain all the elements whose frequency is greater than or equal to the
minimum support. These elements are stored in descending order of their
respective frequencies.
Step 3: Ordered-Item set Creation - for each transaction, the
respective Ordered-Item set is built.
Step 4: Make a FP- Tree - All the Ordered-Item sets are
inserted into a Trie Data Structure.
Step 5: Computation of Conditional Pattern Base - for each item,
the Conditional Pattern Base is computed which is path labels of all the
paths which lead to any node of the given item in the frequent-pattern tree.
Note that the items in the below table are arranged in the ascending order of
their frequencies.
Step 6: Compute Conditional Frequent Pattern Tree - It is done by taking
the set of elements that is common in all the paths in the Conditional Pattern
Base of that item and calculating its support count by summing the support
counts of all the paths in the Conditional Pattern Base.
Step 7: Frequent Pattern rules Generation - From the Conditional Frequent
Pattern tree, the Frequent Pattern rules are generated by pairing the items
of the Conditional Frequent Pattern Tree set to the corresponding to the
item as given in the below table.

Item Condition FP- Frequent Pattern rules


Tree
p {c:3} {c,p:3}
Contd. m {f,c,a : 3 } {f,m : 3 }, {c,m : 3 }, {a,m : 3 },
{f,c,m:3}, {f,a,m:3}, {c,a,m:3}
{f,c,a,m:3}
b Φ Φ
a {f,c : 3 } {f,a: 3 }, {c,a: 3 }, {f,c,a: 3 }
c {f: 3 } {f,c: 3 }
f Φ Φ
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
Example:
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4

 Apply FP Tree Algorithm with Support threshold=50%.


 Step 1: Making Frequency Table - The frequency of
each individual item is computed

Item Count
I1 4
Contd. I2 5
I3 4
I4 4
I5 2
 Step 2: Find Frequent Pattern set - A Frequent
Pattern set is built which will contain all the elements
whose frequency is greater than or equal to the
minimum support. These elements are stored in
descending order of their respective frequencies.
Support threshold=50% => 0.5*6= 3 => min_sup=3
Contd.
Item Count
I2 5
I1 4
I3 4
I4 4

Ordered-Item set - { I2 : 5, I1:4, I3:4, I4:4}


 Step 3: Ordered-Item set Creation - for each
transaction, the respective Ordered-Item set is built.

 Ordered-Item set - { I2 : 5, I1:4, I3:4, I4:4}

Transact List of items Ordered-Item set


ion
T1 I1,I2,I3 I2,I1,I3
Contd.
T2 I2,I3,I4 I2,I3,I4
T3 I4,I5 I4
T4 I1,I2,I4 I2,I1,I4
T5 I1,I2,I3,I5 I2,I1,I3
T6 I1,I2,I3,I4 I2,I1,I3,I4
 Step 4: Make a FP- Tree - All the Ordered-Item sets are
inserted into a Trie Data Structure.

Ordered-Item set
I2,I1,I3
I2,I3,I4
Contd.
I4
I2,I1,I4
I2,I1,I3
I2,I1,I3,I4
 Step 5: Computation of Conditional Pattern Base - for
each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node
of the given item in the frequent-pattern tree. Note that the
items in the below table are arranged in the ascending order
of their frequencies.

Item Conditional Pattern Base


I4 {I2,I1,I3:1}, {I2,I3:1} , {I2,I1:1}
I3 {I2,I1:3},{I2:1}
I1 {I2:4}
I2 Φ

Ordered-Item set - { I2 : 5, I1:4, I3:4, I4:4}


 Step 6: Compute Conditional Frequent Pattern Tree -
It is done by taking the set of elements that is common
in all the paths in the Conditional Pattern Base of that
item and calculating its support count by summing the
support counts of all the paths in the Conditional
Pattern Base.
Contd.
Item Conditional Pattern Base Conditional Frequent
Pattern Tree
{I2,I1,I3:1}, {I2,I3:1} ,
I4 {I2: 3}
{I2,I1:1}
I3 {I2,I1:3},{I2:1} {I2: 4}
I1 {I2:4} {I2:4}
I2 Φ Φ
 Step 7: Frequent Pattern rules Generation - From the
Conditional Frequent Pattern tree, the Frequent
Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to the
corresponding to the item as given in the below table.

Item Conditional Conditional Frequent Pattern


Contd. Pattern Base Frequent Pattern rules
Tree
{I2,I1,I3:1},
I4 {I2: 3} {I2,I4: 3}
{I2,I3:1} , {I2,I1:1}
I3 {I2,I1:3},{I2:1} {I2: 4} {I2,I3: 4}
I1 {I2:4} {I2:4} {I2,I1:4}
I2 Φ Φ Φ
Advantages of FP Tree Algorithm

 This algorithm needs to scan the database twice when compared


to Apriori, which scans the transactions for each iteration.

 The pairing of items is not done in this algorithm, making it faster.


Advantage  The database is stored in a compact version in memory.
and
 It is efficient and scalable for mining both long and short frequent
Disadvantage patterns.
of FP-Tree
Disadvantages of FP Tree Algorithm
Algorithm
 FP Tree is more cumbersome and difficult to build than Apriori.

 It may be expensive.

 The algorithm may not fit in the shared memory when the
database is large.
Apriori FP Growth

Apriori generates frequent patterns


by making the itemsets using FP Growth generates an FP-Tree for
pairings such as single item set, making frequent patterns.
double itemset, and triple itemset.
Difference Apriori uses candidate generation
FP-growth generates a conditional
between where frequent subsets are extended
FP-Tree for every item in the data.
one item at a time.
Apriori and FP
Since apriori scans the database in
Tree Algorithm each step, it becomes time-
FP-tree requires only one database
scan in its beginning steps, so it
consuming for data where the
consumes less time.
number of items is larger.

A converted version of the database A set of conditional FP-tree for every


is saved in the memory item is saved in the memory
It uses a breadth-first search It uses a depth-first search.
Any
Queries..?? Thank you

You might also like