Dmtut
Dmtut
Tutorial
Gregory Piatetsky-Shapiro
KDnuggets
© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples
2
© 2006 KDnuggets
Trends leading to Data Flood
More data is generated:
Web, text, images …
Business transactions, calls,
...
Scientific data: astronomy,
biology, etc
More data is captured:
Storage technology faster
and cheaper
DBMS can handle bigger DB
3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
4
© 2006 KDnuggets
Data Growth
5
© 2006 KDnuggets
Data Growth Rate
6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining,
Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter
1), AAAI/MIT Press 1996
7
© 2006 KDnuggets
Related Fields
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of data mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Distinctions are fuzzy
9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM
see
Monitoring www.crisp-dm.org
for more
information
Continuous
monitoring and
improvement is
an addition to CRISP
10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960-
used by statisticians (as bad name)
© 2006 KDnuggets
Some Definitions
Attribute or Field
measuring aspects of the Instance, e.g. temperature
Class (Label)
grouping of instances, e.g. days good for playing
13
© 2006 KDnuggets
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
…
14
© 2006 KDnuggets
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data
16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
Bread, Cereal (3)
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)
17
© 2006 KDnuggets
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
18
© 2006 KDnuggets
Summarization
© 2006 KDnuggets
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
23
© 2006 KDnuggets
Regression for Classification
Any regression technique can be used for classification
Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and 0
for those that don’t
Prediction: predict class corresponding to model with largest
output value (membership value)
24
© 2006 KDnuggets
Classification: Decision Trees
2 5 X
25
© 2006 KDnuggets
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,
Color=red.
A leaf node represents a class label or class label
distribution.
At each node, one attribute is chosen to split training
examples into distinct classes as much as possible
A new instance is classified by following a matching
path to a leaf node.
26
© 2006 KDnuggets
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes no relation to
rain cool normal true No Microsoft
overcast cool normal true Yes email program
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
27
© 2006 KDnuggets
Example Tree for “Play?”
Outlook
sunny
overcast rain
Humidity Yes
Windy
No Yes No Yes
28
© 2006 KDnuggets
Classification: Neural Nets
29
© 2006 KDnuggets
Classification: other approaches
Naïve Bayes
Rules
Support Vector Machines
Genetic Algorithms
…
See www.KDnuggets.com/software/
30
© 2006 KDnuggets
Evaluation
© 2006 KDnuggets
Evaluating which method works the
best for classification
No model is uniformly the best
Dimensions for Comparison
speed of training
speed of model application
noise tolerance
explanation ability
32
© 2006 KDnuggets
Comparison of Major
Classification Approaches
34
© 2006 KDnuggets
Evaluation issues
35
© 2006 KDnuggets
Classifier error rate
36
© 2006 KDnuggets
Evaluation on “LARGE” data
37
© 2006 KDnuggets
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Testing set
38
© 2006 KDnuggets
Classification Step 2:
Build a model on a training set
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Testing set
39
© 2006 KDnuggets
Classification Step 3:
Evaluate on test set (Re-
train?)
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -
40
© 2006 KDnuggets
Unbalanced data
41
© 2006 KDnuggets
Handling unbalanced data –
how?
If we have two classes that are very
unbalanced, then how can we evaluate our
classifier method?
42
© 2006 KDnuggets
Balancing unbalanced data, 1
43
© 2006 KDnuggets
Balancing unbalanced data, 2
44
© 2006 KDnuggets
A note on parameter tuning
It is important that the test data is not used in any way to
create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
45
© 2006 KDnuggets
Making the most of the data
46
© 2006 KDnuggets
Classification:
Train, Validation, Test split
Results Known
+
Training set Model
+
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -
+
- Final Evaluation
+
Final Test Set Final Model -
47
© 2006 KDnuggets
Cross-validation
—Hold aside one group for testing and use the rest to build model
— Test
—Repeat
49
49
© 2006 KDnuggets
More on cross-validation
50
© 2006 KDnuggets
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than number of
prospects
Typical Applications
retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition prediction
...
51
© 2006 KDnuggets
Direct Marketing Evaluation
52
© 2006 KDnuggets
Model-Sorted List
Use a model to assign score to each customer
Sort customers by decreasing score
Expect more targets (hits) near the top of the list
No Score Target CustID Age
1 0.97 Y 1746 … 3 hits in top 5% of
the list
2 0.95 N 1024 …
3 0.94 Y 2478 … If there 15 targets
overall, then top 5
4 0.93 Y 3820 …
has 3/15=20% of
5 0.92 N 4897 … targets
… … … …
99 0.11 N 2734 …
100 0.06 N 2422
53
© 2006 KDnuggets
CPH (Cumulative Pct Hits)
100
Cumulative % Hits
90
Definition: 80
CPH(P,M) 70
= % of all targets 60
Random
50
in the first P% 40
of the list scored 30
20
by model M 10
CPH frequently 0
called Gains
5
15
35
75
25
45
55
65
85
95
Pct list
5% of random list have 5% of targets
54
© 2006 KDnuggets
CPH: Random List vs
Model-ranked list
100
Cumulative % Hits
90
80
70
60 Random
50 Model
40
30
20
10
0
5
15
25
35
45
55
65
75
85
95
Pct list
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
CPH(5%,model)=21%.
55
© 2006 KDnuggets
Lift
Lift(P,M) = CPH(P,M) / P
45
55
65
75
85
95
15
25
35
we call CPH.
P -- percent of the list
56
© 2006 KDnuggets
Lift – a measure of model quality
57
© 2006 KDnuggets
Clustering
© 2006 KDnuggets
Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
59
© 2006 KDnuggets
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
60
© 2006 KDnuggets
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
61
© 2006 KDnuggets
The distance function
62
© 2006 KDnuggets
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
63
© 2006 KDnuggets
K-means example, step 1
c1
Y
Pick 3 c2
initial
cluster
centers
(randomly)
c3
X
64
© 2006 KDnuggets
K-means example, step 2
c1
Y
c2
Assign
each point
to the closest
cluster
center c3
X
65
© 2006 KDnuggets
K-means example, step 3
c1 c1
Y
Move c2
each cluster
center c3
c2
to the mean
of each cluster c3
X
66
© 2006 KDnuggets
K-means example, step 4a
Reassign c1
points Y
closest to a
different new
cluster center
c3
Q: Which c2
points are
reassigned?
X
67
© 2006 KDnuggets
K-means example, step 4b
c1
Y
A: these
three points
c3
c2
X
68
© 2006 KDnuggets
K-means example, step 5a
c1
Y
re-compute
cluster
means c3
c2
X
69
© 2006 KDnuggets
K-means example, step 5b
c1
Y
c2
move cluster
centers to c3
cluster means
X
70
© 2006 KDnuggets
Data Mining Applications
© 2006 KDnuggets
Problems Suitable for Data-Mining
72
© 2006 KDnuggets
Major Application Areas for
Data Mining Solutions
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
73
© 2006 KDnuggets
Application: Search Engines
ALL AML
76
© 2006 KDnuggets
Application:
Direct Marketing and CRM
Most major direct marketing companies are using
modeling and data mining
Most financial companies are using customer
modeling
Modeling is easier than changing customer
behaviour
Example
Verizon Wireless reduced customer attrition rate from
2% to 1.5%, saving many millions of $
77
© 2006 KDnuggets
Application: e-Commerce
Amazon.com recommendations
if you bought (viewed) X, you are likely to buy Y
Netflix
If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
Comparison shopping
Froogle, mySimon, Yahoo Shopping, …
78
© 2006 KDnuggets
Application:
Security and Fraud Detection
Credit Card Fraud Detection
over 20 Million credit cards protected by
Neural networks (Fair, Isaac)
79
© 2006 KDnuggets
Data Mining, Privacy, and Security
80
© 2006 KDnuggets
Criticism of Analytic Approaches
to Threat Detection:
Data Mining will
be ineffective - generate millions of false positives
and invade privacy
81
© 2006 KDnuggets
Can Data Mining and Statistics
be Effective for Threat Detection?
Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
Reality: Analytical models correlate many items of
information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out with
high probability.
Can identify 19 biased coins out of 100 million with
sufficient number of throws
82
© 2006 KDnuggets
Another Approach: Link Analysis
84
© 2006 KDnuggets
Data Mining with Privacy
85
© 2006 KDnuggets
The Hype Curve for
Data Mining and Knowledge
Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment Performance
Expectations
1990
1998 2000 2002
2005
86
© 2006 KDnuggets
Summary
87
© 2006 KDnuggets
Additional Resources
www.KDnuggets.com
data mining software, jobs, courses, etc
www.acm.org/sigkdd
ACM SIGKDD – the professional society for
data mining
88
© 2006 KDnuggets