0% found this document useful (0 votes)

36 views

Data Mining Unit 2 1

Uploaded by

19Q91A1231 NALDEEGA SAKETHA CHARY

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Data Mining Unit 2 1

Uploaded by

19Q91A1231 NALDEEGA SAKETHA CHARY

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Mining Unit-2

Lecture Notes

------------------------------------------------------------------------------------------------------
Association Rule Mining: Mining Frequent Patterns–Associations and correlations – Mining
Methods– Mining Various kinds of Association Rules– Correlation Analysis– Constraint
based Association mining. Graph Pattern Mining, SPM.

Topic 1: Mining Frequent Patterns

Market Basket Analysis: A Motivating Example

Frequent itemset mining leads to the discovery of associations and correlations among items
in large transactional or relational data sets. With massive amounts of data continuously
being collected and stored, many industries are becoming interested in mining such patterns
from their databases. The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping behaviour
analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyses
customer buying habits by finding associations between the different items that customers
place in their “shopping baskets”.

The discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of bread) on
the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.

Market basket analysis. Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?” To
answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.

For instance, market basket analysis may help you design different store layouts. In one
strategy, items that are frequently purchased together can be placed in proximity to further
encourage the combined sale of such items. If customers who purchase computers also tend
to buy antivirus software at the same time, then placing the hardware display close to the
software display may help increase the sales of both items.

In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance,
after deciding on an expensive computer, a customer may observe security systems for sale
while heading toward the software display to purchase antivirus software, and may decide to
purchase a home security system as well.

If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The

Boolean vectors can be analysed for buying patterns that reflect items that are frequently
associated or purchased together. These patterns can be represented in the form of association
rules.

For example, the information that customer who purchase computers also tend to buy
antivirus software at the same time is represented in the following association rule:

Computer =>antivirus_software [support =2%,confidence = 60%].

Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by users or
domain experts.

Frequent Itemsets, Closed Itemsets, and Association Rules

Let I ={i1,i2,i3…………..in} be an itemset. Let D, the task-relevant data, be a set of database

transactions where each transaction T is a nonempty itemset such that T€I.

Let A be a set of items. A transaction T is said to contain A if A€ T.

An association rule is an implication of the form A=>B,

where A € I, B € I, A ≠ф ;B≠ф, and A Ո B≠ф

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf ) are called strong. By convention, we write support and confidence
values so as to occur between 0% and 100%, rather than 0 to 1.0.

The occurrence frequency of an itemset is the number of transactions that contain the itemset.
This is also known, simply, as the frequency, support count, or count of the itemset

If the relative support of an itemset I satisfies a pre-specified minimum support threshold

(i.e., the absolute support of I satisfies the corresponding minimum support count threshold),
then I is a frequent itemset

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules
must satisfy minimum support and minimum confidence.

Additional interestingness measures can be applied for the discovery of correlation

relationships between associated items, the second step is much less costly than the first, the
overall performance of mining association rules is determined by the first step.

If an itemset is frequent, each of its subsets is frequent as well.

For example, a frequent itemset of length 100, such as {a1, a2,………,a100}, contains
= 100 frequent 1-itemsets: {a1}, {a2}, . . . ,{a100};

frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . , {a99, a100}; and so on.

The total number of frequent itemsets that it contains is thus

This is too huge a number of item sets for any computer to compute or store. To overcome
this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent
itemset.

Demonstrate Apriori algorithm with example.

To explain, let's use the data in this table and assume that the minimum support is 2.

We start by looking for single items that meet the support threshold. In this case, it's simply
A, B, C, D, and E, because there is at least 2 of each of these in the table. This is summarized
in the single item support table below

Next, we take all of the items that meet the support requirements, everything so far in this
example, an make all of the patterns/combinations we can out of them; AB, AC, AD, AE,
BC, BD, BE, CD, CE, DE. When we list all of these combinations in a table, and
determine the support for each, we get a table that looks like this.
Several of these patterns don't meet the support threshold of 2, so we remove them from the
list of options.

At this point, we use the surviving items to make other patterns that contain 3 items. If you
logically work through all of the options you'll get a list like this: ABC, ABD, ABE, BCD,
BCE, BDE (Notice that I didn't list ABCD, or BCDE here because they are 4 items long).

Before I create the support table for these let's look at these patterns. The first one, ABC, was
created by combining AB and BC. If you look in the 2 item support table (before or after
filtering), you'll find that AC doesn't have the minimum support required. If AC isn't
supported, a more complicated pattern that includes AC (ABC) can't be supported
either. This is a key point of the Apriori Principle. So, without having to go back to the
original data, we can exclude some of the 3-item patterns. When we do this, we eliminate
ABC (AC not supported), ABD (AD not supported), ABE (AE not supported), BCE (CE not
supported) and BDE (DE not supported). This process of removing patterns that can't be
supported because their subsets (or shorter combination) aren't supported is called pruning.
This pruning process leaves only BCD with a support of 2.
The final list of all of the patterns that have support greater than or equal to 2 are summarized
here.

Shortcomings Of Apriori Algorithm

 Using Apriori needs a generation of candidate itemsets. These itemsets may be large
in number if the itemset in the database is huge.
 Apriori needs multiple scans of the database to check the support of each itemset
generated and this leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.
Frequent Pattern Growth Algorithm
This algorithm is an improvement to the Apriori method. A frequent pattern is generated
without the need for candidate generation. FP growth algorithm represents the database in the
form of a tree called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”. The
itemsets of these fragmented patterns are analyzed. Thus with this method, the search for
frequent itemsets is reduced comparatively.

FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database. The purpose of the FP tree is to mine the most frequent pattern. Each node of the
FP tree represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained
while forming the tree.

Frequent Pattern Algorithm Steps

The frequent pattern growth method lets us find the frequent pattern without candidate
generation.

Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:

1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is
called support count or frequency of 1-itemset.

2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.

3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top,
the next itemset with lower count and so on. It means that the branch of the tree is
constructed with transaction itemsets in descending order of count.

4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for
example in the 1st transaction), then this transaction branch would share a common prefix to
the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.

5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according
to transactions.

6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency pattern
length 1. From this, traverse the path in the FP Tree. This path or paths are called a
conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).

7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8) Frequent Patterns are generated from the Conditional FP Tree.

Example Of FP-Growth Algorithm

Support threshold=50%, Confidence= 60%

Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of each item

2. Sort the itemset in descending order.

3. Build FP Tree

1. Considering the root node null.

2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and
I4 is linked to I3. But this branch would share I2 node as common as it is already used
in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child
to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node,
hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is
already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3},
{I4 1}.

4) Mining of FP-tree is summarized below:

1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}.
This forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the
min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},
{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree:
{I2:4} and frequent patterns are generated: {I2, I1:4}.

Advantages Of FP Growth Algorithm

1. This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.

Graph Pattern Mining

Graph pattern mining is the mining of frequent subgraphs (also called (sub)graph patterns)
in one or a set of graphs. Methods for mining graph patterns can be categorized into Apriori-
based and pattern growth–based approaches. Alternatively, we can mine the set of closed
graphs where a graph g is closed if there exists no proper supergraph g′ that carries the same
support count as g. Moreover, there are many variant graph patterns, including approximate
frequent graphs, coherent graphs, and dense graphs. User-specified constraints can be pushed
deep into the graph pattern mining process to improve mining efficiency.
Graph pattern mining has many interesting applications.

For example, it can be used to generate compact and effective graph index structures based
on the concept of frequent and discriminative graph patterns. Approximate structure
similarity search can be achieved by exploring graph index structures and multiple graph
features. Moreover, classification of graphs can also be performed effectively using frequent
and discriminative subgraphs as features.
Graph Mining (GM) is essentially the problem of discovering repetitive subgraphs occurring
in the input graphs
Motivation
 Finding subgraphs capable of compressing the data by abstracting instances of the
substructures
 Identifying conceptually interesting patterns

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network Internet

Co- author network

Application of Graph Mining:
1. Chemical compounds (Cheminformatics)
2. Protein structures, biological pathways/networks (Bioinformactics)
3. Program control flow, traffic flow, and workflow analysis
4. XML databases, Web, and social network analysis

GRAPH DATASET

FREQUENT PATTERNS (MIN SUPPORT IS 2)

Sequential pattern mining (SPM)

Finding statistically relevant patterns between data examples where the values are delivered
in a sequence. It is usually presumed that the values are discrete, and thus time series mining
is closely related, but usually considered a different activity. Sequential pattern mining is a
special case of structured data mining.

There are several key traditional computational problems addressed within this field. These
include building efficient databases and indexes for sequence information, extracting the
frequently occurring patterns, comparing sequences for similarity, and recovering missing
sequence members. In general, sequence mining problems can be classified as string mining
which is typically based on string processing algorithms and itemset mining which is
typically based on association rule learning. Local process models extend sequential pattern
mining to more complex patterns that can include (exclusive) choices, loops, and concurrency
constructs in addition to the sequential ordering construct.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6135)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4973)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1988)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1994)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
DM Unit-3
No ratings yet
DM Unit-3
20 pages
Time and Work
No ratings yet
Time and Work
11 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
II Notes
No ratings yet
II Notes
18 pages
SPPM 1 To 3 Units
No ratings yet
SPPM 1 To 3 Units
69 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)

Data Mining Unit 2 1

Uploaded by

Data Mining Unit 2 1

Uploaded by

Data Mining Unit-2

Topic 1: Mining Frequent Patterns

Market Basket Analysis: A Motivating Example

Computer =>antivirus_software [support =2%,confidence = 60%].

Frequent Itemsets, Closed Itemsets, and Association Rules

Let I ={i1,i2,i3…………..in} be an itemset. Let D, the task-relevant data, be a set of database

Let A be a set of items. A transaction T is said to contain A if A€ T.

where A € I, B € I, A ≠ф ;B≠ф, and A Ո B≠ф

If the relative support of an itemset I satisfies a pre-specified minimum support threshold

In general, association rule mining can be viewed as a two-step process:

Additional interestingness measures can be applied for the discovery of correlation

If an itemset is frequent, each of its subsets is frequent as well.

The total number of frequent itemsets that it contains is thus

Demonstrate Apriori algorithm with example.

Shortcomings Of Apriori Algorithm

Frequent Pattern Algorithm Steps

Example Of FP-Growth Algorithm

1. Count of each item

2. Sort the itemset in descending order.

1. Considering the root node null.

4) Mining of FP-tree is summarized below:

Advantages Of FP Growth Algorithm

Graph Pattern Mining

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network Internet

Co- author network

FREQUENT PATTERNS (MIN SUPPORT IS 2)

You might also like