SlideShare a Scribd company logo
Pattern Mining: Getting the
most out of your log data.
Krishna Sridhar
Staff Data Scientist, Dato Inc.
krishna_srd
• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me!
45+$and$growing$fast!
About Us!
+ =
Questions?
• (Now) I love questions. Feel free to interrupt for questions!
• (Later) Email me srikris@dato.com.
DAML Talks!
About you?
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy
Log Journey
Lots of data
Insights Profits
Log Mining: Pattern Mining
Logs are everywhere!
Machine Learning in Logs
Source: Mining Your Logs - Gaining Insight Through Visualization
Coffee shop
Coffee Shops Menu
Receipts
Coffee Shops Menu
Coffee Store Logs
Frequent Pattern Mining
What sets of items were bought together?
Real Applications
Real Applications
Real Applications
Log Mining: Rule Mining
Can we recommend items?
Rule Mining
Real Applications
Log Mining: Feature Extraction
Feature Extraction
0 1 0 0 0 0 1 1 0
1 1 0 0 1 0 0 0 0
0 0 1 1 1 0
Receipt Space Features in
Menu Space
ML
3 Useful Data Mining Tasks
Rule MiningPattern Mining Feature Extraction
Demo
Pattern Mining: Explained
Formulating Pattern Mining
N distinct items → 2N itemsets
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
Formulating Pattern Mining
Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support
Pattern Mining
N distinct items → 2N itemsets
Pattern Mining: Principles
Pattern Mining: Principles
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {C, D} frequent?
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D} occurs 5 times
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{C, D} occurs 5 times
Frequent!
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Is the pattern {A, D} frequent?
M = 4
Patterns
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
Patterns
{A, D} occurs 3 times.
Not frequent.
Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 3 is not frequent
min_support
Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 4 is frequent therefore
{C, D} : 5 is frequent
M = 4
Why?
{C, D} must occur at least as
many times as {B, C, D}.
Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Principle 2: Apriori principle (Contrapositive)
If a pattern is not frequent then all supersets are not frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
M = 4
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
Why?
{A, D} cannot occur more times
than {A}.
Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth
Candidate Generation
Lots of Generalizations
Source: http://www.philippe-fournier-viger.com/spmf/
Candidate Generation
Two phases
1. Candidate generation.
2. Candidate filtering.
Exploit Apriori Principle!
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Not frequent
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
No need to
explore!
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Candidate Generation
{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth
Candidate Generation
Two phases
1. Candidate generation: Enumerate all subsets.
2. Candidate filtering: Eliminate infrequent subsets.
Exploit Apriori Principle!
Pattern Growth
Pattern Growth
Two phases
1. Candidate filtering
2. Conditional database constructions.
Avoid full scans over the data & large
candidate sets!
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2
{BC} : 2
Pattern Growth - Preprocessing {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
Preprocessing
First, count the number of times
each item (singleton) occurs.
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?
No need to
explore!
Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : X
Explore depth
first on {B}
Pattern Growth
{B} : 4
{ } : 6
Conditional Database Construction
DB{} DB{B}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{C, D}
{D}
{C, D}
{D}
Pattern Growth
{B} : 4
{ } : 6
Candidate Filtering
DB{B}
{C, D}
{D}
{C, D}
{D}
{D} : 4
{C} : 2
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
DB{B}
Add {BD} as frequent
Pattern Growth - Depth First {C, D}
{D}
{C, D}
{D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : 2
Explore depth
first on {BD}
Pattern Growth - Depth First
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : X {ACD} : ? {BCD} : X
{BC} : 2
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Compare & Constrast
• Candidate Generation
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice
Some cool ideas
FP-Tree Compression
Figures From Florian Verhein’s Slides on FP-Growth
FP-Growth Algorithm
Figures From Florian Verhein’s Slides on FP-Growth
Two phases
1. Candidate filtering.
2. Conditional database constructions.
TopK FP-Growth Algorithm
Similar to FP-Growth
1. Dynamically raise min_support.
2. Estimates of min_support greatly help.
Future Work
Distributed FP-Growth
Partition database on item-ids.
Database
Bags + Sequences
× 2
Itemset: {Item}
Bags: {Item: quantity}
Sequences : (item)
Demo: Model built, now what?
Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.
SELECT questions FROM audience
WHERE difficulty == “Easy”
Thanks!

More Related Content

What's hot (20)

PPT
Association rule mining
Acad
 
PPT
Apriori algorithm
nouraalkhatib
 
PPT
1.11.association mining 3
Krish_ver2
 
PPT
1.9.association mining 1
Krish_ver2
 
PDF
Ej36829834
IJERA Editor
 
PDF
06 fp basic
JoonyoungJayGwak
 
PPT
Associative Learning
Indrajit Sreemany
 
PPT
Cs583 association-rules
Gautam Thakur
 
PDF
B0950814
IOSR Journals
 
PPTX
Graph mining ppt
tallalfarooq1
 
PDF
An improved Item-based Maxcover Algorithm to protect Sensitive Patterns in La...
IOSR Journals
 
PPT
Rmining
wolverine1309
 
PDF
J0945761
IOSR Journals
 
PPTX
Hiding slides
sameeksha15
 
PDF
Discovering Frequent Patterns with New Mining Procedure
IOSR Journals
 
PDF
Ad03301810188
ijceronline
 
PDF
Collective entity linking with WSRM DocEng'19
ngamou
 
PPT
1.10.association mining 2
Krish_ver2
 
PPTX
Association 04.03.14
rahulmath80
 
Association rule mining
Acad
 
Apriori algorithm
nouraalkhatib
 
1.11.association mining 3
Krish_ver2
 
1.9.association mining 1
Krish_ver2
 
Ej36829834
IJERA Editor
 
06 fp basic
JoonyoungJayGwak
 
Associative Learning
Indrajit Sreemany
 
Cs583 association-rules
Gautam Thakur
 
B0950814
IOSR Journals
 
Graph mining ppt
tallalfarooq1
 
An improved Item-based Maxcover Algorithm to protect Sensitive Patterns in La...
IOSR Journals
 
Rmining
wolverine1309
 
J0945761
IOSR Journals
 
Hiding slides
sameeksha15
 
Discovering Frequent Patterns with New Mining Procedure
IOSR Journals
 
Ad03301810188
ijceronline
 
Collective entity linking with WSRM DocEng'19
ngamou
 
1.10.association mining 2
Krish_ver2
 
Association 04.03.14
rahulmath80
 

Viewers also liked (20)

PPTX
Temporal Pattern Mining
Prakhar Dhama
 
PDF
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
PPTX
Efficient frequent pattern mining in distributed system
Saurav Kumar
 
PPT
Survey on Frequent Pattern Mining on Graph Data - Slides
Kasun Gajasinghe
 
PPSX
Frequent itemset mining methods
Prof.Nilesh Magar
 
PDF
Lecture13 - Association Rules
Albert Orriols-Puig
 
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
Albert Bifet
 
PDF
Improved Frequent Pattern Mining Algorithm using Divide and Conquer Technique...
ijsrd.com
 
PPT
Frequent itemset mining using pattern growth method
Shani729
 
PPTX
Frequent Itemset Mining(FIM) on BigData
Raju Gupta
 
PPT
A vertical representation in frequent item set mining
Dr.Manmohan Singh
 
PDF
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
PDF
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
PPTX
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Victor Giannakouris
 
PPTX
Apriori algorithm
Junghoon Kim
 
PPTX
Major issues in data mining
Slideshare
 
PDF
Spatial Data Model
Kaium Chowdhury
 
PPTX
Sequential pattern mining
kiran said
 
PPTX
Text and text stream mining tutorial
mgrcar
 
Temporal Pattern Mining
Prakhar Dhama
 
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Efficient frequent pattern mining in distributed system
Saurav Kumar
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Kasun Gajasinghe
 
Frequent itemset mining methods
Prof.Nilesh Magar
 
Lecture13 - Association Rules
Albert Orriols-Puig
 
Mining Frequent Closed Graphs on Evolving Data Streams
Albert Bifet
 
Improved Frequent Pattern Mining Algorithm using Divide and Conquer Technique...
ijsrd.com
 
Frequent itemset mining using pattern growth method
Shani729
 
Frequent Itemset Mining(FIM) on BigData
Raju Gupta
 
A vertical representation in frequent item set mining
Dr.Manmohan Singh
 
IntelliGO semantic similarity measure for Gene Ontology annotations
European Institute for Systems Biology & Medicine.
 
OUTDATED Text Mining 4/5: Text Classification
Florian Leitner
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Victor Giannakouris
 
Apriori algorithm
Junghoon Kim
 
Major issues in data mining
Slideshare
 
Spatial Data Model
Kaium Chowdhury
 
Sequential pattern mining
kiran said
 
Text and text stream mining tutorial
mgrcar
 
Ad

Similar to Frequent Pattern Mining - Krishna Sridhar, Feb 2016 (20)

PDF
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
PDF
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
MLconf
 
PDF
Outrageous Ideas for Graph Databases
Max De Marzi
 
PDF
Massively distributed environments and closed itemset mining
Mehdi Zitouni
 
PPT
Advertising
black150
 
PPTX
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Max De Marzi
 
PDF
Realizability Analysis for Message-based Interactions Using Shared-State Proj...
Sylvain Hallé
 
PPTX
Data Mining Lecture_4.pptx
Subrata Kumer Paul
 
PDF
Feequent Item Mining - Data Mining - Pattern Mining
Jason J Pulikkottil
 
PDF
Module 5 Indices PMR
roszelan
 
PDF
Consistency without Consensus: CRDTs in Production at SoundCloud
C4Media
 
PPS
Fighting Knowledge Acquisition Bottleneck with Argument Based ...
butest
 
PDF
Teoria y problemas de fracciones fr122 ccesa007
Demetrio Ccesa Rayme
 
PDF
Teoria y problemas de fracciones fr122 ccesa007
Demetrio Ccesa Rayme
 
PDF
Optimizing the Catalyst Optimizer for Complex Plans
Databricks
 
PDF
Mongo indexes
Mehmet Çetin
 
PPTX
Ratio and proportion
Vishwanath Krishna
 
PPT
Association Rule Mining Max Miner Itemset Mining
VishwajeetSingh416757
 
PDF
shake! 2017 본선문제 풀이
HYUNJEONG KIM
 
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
MLconf
 
Outrageous Ideas for Graph Databases
Max De Marzi
 
Massively distributed environments and closed itemset mining
Mehdi Zitouni
 
Advertising
black150
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Max De Marzi
 
Realizability Analysis for Message-based Interactions Using Shared-State Proj...
Sylvain Hallé
 
Data Mining Lecture_4.pptx
Subrata Kumer Paul
 
Feequent Item Mining - Data Mining - Pattern Mining
Jason J Pulikkottil
 
Module 5 Indices PMR
roszelan
 
Consistency without Consensus: CRDTs in Production at SoundCloud
C4Media
 
Fighting Knowledge Acquisition Bottleneck with Argument Based ...
butest
 
Teoria y problemas de fracciones fr122 ccesa007
Demetrio Ccesa Rayme
 
Teoria y problemas de fracciones fr122 ccesa007
Demetrio Ccesa Rayme
 
Optimizing the Catalyst Optimizer for Complex Plans
Databricks
 
Mongo indexes
Mehmet Çetin
 
Ratio and proportion
Vishwanath Krishna
 
Association Rule Mining Max Miner Itemset Mining
VishwajeetSingh416757
 
shake! 2017 본선문제 풀이
HYUNJEONG KIM
 
Ad

More from Seattle DAML meetup (11)

PDF
Karin Strauss - DNA Storage, July 2016
Seattle DAML meetup
 
PPTX
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Seattle DAML meetup
 
PPTX
Understanding disparities using the American Community Survey - Sean Green, M...
Seattle DAML meetup
 
PDF
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Seattle DAML meetup
 
PPTX
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Seattle DAML meetup
 
PDF
Been Kim - Interpretable machine learning, Nov 2015
Seattle DAML meetup
 
PPTX
Hunting criminals with hybrid analytics -- October 2015
Seattle DAML meetup
 
PDF
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Seattle DAML meetup
 
PDF
Adventures in Data Visualization - Jeff Heer, May 2015
Seattle DAML meetup
 
PPTX
The Road to Data Science - Joel Grus, June 2015
Seattle DAML meetup
 
PPTX
Scaling decision trees - George Murray, July 2015
Seattle DAML meetup
 
Karin Strauss - DNA Storage, July 2016
Seattle DAML meetup
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Seattle DAML meetup
 
Understanding disparities using the American Community Survey - Sean Green, M...
Seattle DAML meetup
 
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Seattle DAML meetup
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Seattle DAML meetup
 
Been Kim - Interpretable machine learning, Nov 2015
Seattle DAML meetup
 
Hunting criminals with hybrid analytics -- October 2015
Seattle DAML meetup
 
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Seattle DAML meetup
 
Adventures in Data Visualization - Jeff Heer, May 2015
Seattle DAML meetup
 
The Road to Data Science - Joel Grus, June 2015
Seattle DAML meetup
 
Scaling decision trees - George Murray, July 2015
Seattle DAML meetup
 

Recently uploaded (20)

PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PPTX
UNIT III CONTROL OF PARTICULATE CONTAMINANTS
sundharamm
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
7.2 Physical Layer.pdf123456789101112123
MinaMolky
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPTX
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
PDF
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
PPTX
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
PDF
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
UNIT III CONTROL OF PARTICULATE CONTAMINANTS
sundharamm
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
7.2 Physical Layer.pdf123456789101112123
MinaMolky
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
Inventory management chapter in automation and robotics.
atisht0104
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Water resources Engineering GIS KRT.pptx
Krunal Thanki
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Unit 2 Theodolite and Tachometric surveying p.pptx
satheeshkumarcivil
 
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
Introduction to Fluid and Thermal Engineering
Avesahemad Husainy
 
CFM 56-7B - Engine General Familiarization. PDF
Gianluca Foro
 

Frequent Pattern Mining - Krishna Sridhar, Feb 2016

  • 1. Pattern Mining: Getting the most out of your log data. Krishna Sridhar Staff Data Scientist, Dato Inc. krishna_srd
  • 2. • Background - Machine Learning (ML) Research. - Ph.D Numerical Optimization @Wisconsin • Now - Build ML tools for data-scientists & developers @Dato. - Help deploy ML algorithms. @krishna_srd, @DatoInc About Me!
  • 4. + = Questions? • (Now) I love questions. Feel free to interrupt for questions! • (Later) Email me [email protected]. DAML Talks!
  • 6. Creating a model pipeline Ingest Transform Model Deploy Unstructured Data exploration data modeling Data Science Workflow Ingest Transform Model Deploy
  • 7. Log Journey Lots of data Insights Profits
  • 10. Machine Learning in Logs Source: Mining Your Logs - Gaining Insight Through Visualization
  • 14. Frequent Pattern Mining What sets of items were bought together?
  • 19. Can we recommend items? Rule Mining
  • 21. Log Mining: Feature Extraction
  • 22. Feature Extraction 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 Receipt Space Features in Menu Space ML
  • 23. 3 Useful Data Mining Tasks Rule MiningPattern Mining Feature Extraction
  • 24. Demo
  • 26. Formulating Pattern Mining N distinct items → 2N itemsets
  • 27. Formulating Pattern Mining Find the top K most frequent sets of length at least L that occur at least M times.
  • 28. Formulating Pattern Mining Find the top K most frequent sets of length at least L that occur at least M times. - max_patterns - min_length - min_support
  • 29. Pattern Mining N distinct items → 2N itemsets
  • 32. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Is the pattern {C, D} frequent? M = 4 Patterns
  • 33. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D} occurs 5 times M = 4 Patterns
  • 34. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 Patterns {C, D} occurs 5 times Frequent!
  • 35. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Is the pattern {A, D} frequent? M = 4 Patterns
  • 36. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 Patterns {A, D} occurs 3 times. Not frequent.
  • 37. Principle 1: What is frequent? A pattern is frequent if it occurs at least M times. {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {C, D}: 5 is frequent M = 4 {A, D}: 3 is not frequent min_support
  • 38. Principle 2: Apriori principle A pattern is frequent only if a subset is frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent M = 4
  • 39. Principle 2: Apriori principle A pattern is frequent only if a subset is frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} {B, C, D} : 4 is frequent therefore {C, D} : 5 is frequent M = 4 Why? {C, D} must occur at least as many times as {B, C, D}.
  • 40. Principle 2: Apriori principle (Contrapositive) If a pattern is not frequent then all supersets are not frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 {A} : 3 is not frequent therefore {A, D} : 3 is not frequent
  • 41. Principle 2: Apriori principle (Contrapositive) If a pattern is not frequent then all supersets are not frequent {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} M = 4 {A} : 3 is not frequent therefore {A, D} : 3 is not frequent Why? {A, D} cannot occur more times than {A}.
  • 42. Two Main Algorithms • Candidate Generation - Apriori - Eclat • Pattern Growth - FP-Growth - TopK FP-Growth
  • 44. Lots of Generalizations Source: http://www.philippe-fournier-viger.com/spmf/
  • 45. Candidate Generation Two phases 1. Candidate generation. 2. Candidate filtering. Exploit Apriori Principle!
  • 46. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 47. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : ? {B} : ? {C} : ? {D} : ? { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 48. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 49. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} Not frequent
  • 50. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D} No need to explore!
  • 51. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 52. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 53. Candidate Generation {AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5 {A} : 3 {B} : 4 {C} : 5 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {B, C, D} {A, C, D} {A, B, C, D} {A, D} {B, C, D} {B, C, D}
  • 54. Two Main Algorithms • Candidate Generation - Apriori - Eclat • Pattern Growth - FP-Growth - TopK FP-Growth
  • 55. Candidate Generation Two phases 1. Candidate generation: Enumerate all subsets. 2. Candidate filtering: Eliminate infrequent subsets. Exploit Apriori Principle!
  • 57. Pattern Growth Two phases 1. Candidate filtering 2. Conditional database constructions. Avoid full scans over the data & large candidate sets!
  • 58. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4 {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2 {BC} : 2
  • 59. Pattern Growth - Preprocessing {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 Preprocessing First, count the number of times each item (singleton) occurs.
  • 60. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 61. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 62. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ?
  • 63. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : ? No need to explore!
  • 64. Pattern Growth - Depth First {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {AB} : X {AC} : ? {AD} : ? {BD} : X {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : X Explore depth first on {B}
  • 65. Pattern Growth {B} : 4 { } : 6 Conditional Database Construction DB{} DB{B} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} {C, D} {D} {C, D} {D}
  • 66. Pattern Growth {B} : 4 { } : 6 Candidate Filtering DB{B} {C, D} {D} {C, D} {D} {D} : 4 {C} : 2 DB{} {B, C, D} {A, C, D} {B, D} {A, C, D} {B, C, D} {A, B, D} DB{B} Add {BD} as frequent
  • 67. Pattern Growth - Depth First {C, D} {D} {C, D} {D} {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ? {BC} : 2 Explore depth first on {BD}
  • 68. Pattern Growth - Depth First {AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ? {A} : 3 {B} : 4 {C} : 4 {D} : 6 { } : 6 {ABC} : ? {ABD} : X {ACD} : ? {BCD} : X {BC} : 2
  • 69. Compare & Constrast • Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient.
  • 70. Compare & Constrast • Candidate Generation + Better than brute force + Filters candidate sets - Multiple passes over the data • Pattern Growth + Fewer passes over the data + Space efficient. Better choice
  • 72. FP-Tree Compression Figures From Florian Verhein’s Slides on FP-Growth
  • 73. FP-Growth Algorithm Figures From Florian Verhein’s Slides on FP-Growth Two phases 1. Candidate filtering. 2. Conditional database constructions.
  • 74. TopK FP-Growth Algorithm Similar to FP-Growth 1. Dynamically raise min_support. 2. Estimates of min_support greatly help.
  • 77. Bags + Sequences × 2 Itemset: {Item} Bags: {Item: quantity} Sequences : (item)
  • 78. Demo: Model built, now what?
  • 79. Summary Log Data Mining ≠ Rocket Science • FP-Growth for finding frequent patterns. • Find rules from patterns to make predictions. • Extract features for useful ML in pattern space.
  • 80. SELECT questions FROM audience WHERE difficulty == “Easy” Thanks!