Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
●
Inferring rudimentary rules
●
Statistical modeling
●
Constructing decision trees
●
Constructing rules
●
Association rule learning
●
Linear models
●
Instance-based learning
●
Clustering
●
Simple algorithms often work very well!
●
There are many kinds of simple structure, eg:
♦ One attribute does all the work
♦ All attributes contribute equally & independently
♦ A weighted linear combination might do
♦ Instance-based: use a few prototypes
♦ Use simple logical rules
●
Success of method depends on the domain
●
1R: learns a 1-level decision tree
♦ I.e., rules that all test one particular attribute
●
Basic version
♦ One branch for each value
♦ Each branch assigns most frequent class
♦ Error rate: proportion of instances that don’t belong
to the majority class of their corresponding branch
♦ Choose attribute with lowest error rate
(assumes nominal attributes)
●
Note: “missing” is treated as a separate attribute value
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
●
Resulting rule set:
●
Another simple technique: build one rule for each class
♦ Each rule is a conjunction of tests, one for each attribute
inside an interval
● Interval given by minimum and maximum observed in
training data
♦ For nominal attributes: test checks whether value is one of a
●
“Opposite” of 1R: use all the attributes
●
Two assumptions: Attributes are
♦ equally important
♦ statistically independent (given the class value)
● I.e., knowing the value of one attribute says nothing about
the value of another (if the class is known)
●
Independence assumption is never correct!
●
But … this scheme works well in practice
●
A new day: Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Pr [E∣H]Pr [H]
Pr [H∣E]=
Pr [E]
A priori probability of H :
●
Pr [H]
● Probability of event before evidence is seen
A posteriori probability of H :
●
Pr [H∣E]
● Probability of event after evidence is seen
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
●
Classification learning: what’s the probability of
the class given an instance?
♦ Evidence E = instance
♦ Event H = class value for instance
●
Naïve assumption: evidence splits into parts (i.e.
attributes) that are independent
●
What if an attribute value doesn’t occur with every class
value?
(e.g. “Humidity = high” for class “yes”)
♦ Probability will be zero! Pr [Humidity=High∣yes]=0
♦ A posteriori probability will also be zero! Pr [yes∣E]=0
(No matter how likely the other values are!)
●
Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
●
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
●
In some cases adding a constant different from 1
might be more appropriate
●
Example: attribute outlook for class yes
(x−μ)2
1 −
2 σ2
f (x)= e
√2 π σ
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 21
Statistics for weather data
●
Example density value:
66−732
1 −
2⋅6.2
2
f temperature=66∣yes= e =0.0340
2 6.2
●
A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
●
Missing values during training are not included
in calculation of mean and standard deviation
●
Relationship between probability and density:
Pr [c− xc ]≈×f c
2 2
●
But: this doesn’t change calculation of a posteriori
probabilities because ε cancels out
●
Exact relationship:
Pr [axb]=∫ f tdt
a
●
Ignores probability of generating a document of the right length
(prob. assumed constant for each class)
●
Naïve Bayes works surprisingly well (even if independence
assumption is clearly violated)
●
Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability is
assigned to correct class
●
However: adding too many redundant attributes will cause
problems (e.g. identical attributes)
●
Note also: many numeric attributes are not normally
distributed (→ kernel density estimators)
●
Strategy: top down
Recursive divide-and-conquer fashion
♦ First: select attribute for root node
Create branch for each possible attribute value
♦ Then: split instances into subsets
One for each branch extending from the node
♦ Finally: repeat recursively for each branch, using only
instances that reach the branch
●
Stop if all instances have the same class
●
Which is the best attribute?
♦ Want to get the smallest tree
♦ Heuristic: choose the attribute that produces the
“purest” nodes
●
Popular impurity criterion: information gain
♦ Information gain increases with the average purity of
the subsets
●
Strategy: choose attribute that gives greatest
information gain
●
Measure information in bits
♦ Given a probability distribution, the info required to
●
Outlook = Sunny :
info[2,3]=entropy 2/5,3/5=−2/5 log 2/5−3/5 log 3/5=0.971 bits
●
Outlook = Overcast : Note: this
info[4,0]=entropy 1,0=−1 log 1−0 log0=0 bits is normally
undefined.
●
Outlook = Rainy :
info[2,3]=entropy 3/5,2/5=−3/5 log 3/5−2/5 log 2/5=0.971 bits
●
Expected information for attribute:
info[3,2],[4,0],[3,2]=5/14×0.9714/14×05/14×0.971=0.693 bits
●
Information gain: information before splitting –
information after splitting
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
●
Information gain for attributes from weather data:
●
Note: not all leaves need to be pure; sometimes
identical instances have different classes
⇒ Splitting stops when data can’t be split any further
●
Properties we require from a purity measure:
♦ When node is pure, measure should be zero
♦ When impurity is maximal (i.e. all classes equally
likely), measure should be maximal
♦ Measure should obey multistage property (i.e. decisions
can be made in several stages):
measure[2,3,4]=measure [2,7]7/9×measure[3,4]
●
Entropy is the only function that satisfies all three
properties!
●
The multistage property:
q r
entropy p ,q , r=entropy p ,qrqr×entropy qr , qr
●
Simplification of computation:
info[2,3,4]=−2/9×log 2/9−3/9×log3/9−4/9×log 4/9
=[−2×log 2−3×log 3−4×log 49×log 9]/9
●
Note: instead of maximizing info gain we could
just minimize information
●
Problematic: attributes with a large number of
values (extreme case: ID code)
●
Subsets are more likely to be pure if there is a
large number of values
⇒ Information gain is biased towards choosing
attributes with a large number of values
⇒ This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
●
Another problem: fragmentation
●
Entropy of split:
infoID code=info[0,1]info[0,1]...info[0,1]=0 bits
⇒ Information gain is maximal for ID code (namely
0.940 bits)
●
Gain ratio: a modification of the information gain that
reduces its bias
●
Gain ratio takes number and size of branches into
account when choosing an attribute
♦ It corrects the information gain by taking the intrinsic
information of a split into account
●
Intrinsic information: entropy of distribution of instances
into branches (i.e. how much info do we need to tell
which branch an instance belongs to)
●
Example: intrinsic information for ID code
info[1,1,. ..,1]=14×−1/14×log 1/14=3.807 bits
●
Value of attribute decreases as intrinsic
information gets larger
●
Definition of gain ratio:
gain_ratioattribute=gainattribute
intrinsic_infoattribute
●
Example:
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
●
“Outlook” still comes out top
●
However: “ID code” has greater gain ratio
♦ Standard fix: ad hoc test to prevent splitting on that type of
attribute
●
Problem with gain ratio: it may overcompensate
♦ May choose an attribute just because its intrinsic
information is very low
♦ Standard fix: only consider attributes with greater than
average information gain
●
Top-down induction of decision trees: ID3,
algorithm developed by Ross Quinlan
♦ Gain ratio just one modification of this basic algorithm
♦ ⇒ C4.5: deals with numeric attributes, missing values,
noisy data
●
Similar approach: CART
●
There are many other attribute selection criteria!
(But little difference in accuracy of result)
●
Possible rule set for class “b”:
If x ≤ 1.2 then class = b
If x > 1.2 and y ≤ 2.6 then class = b
●
Could add more rules, get “perfect” rule set
●
But: rule sets can be more perspicuous when decision
trees suffer from replicated subtrees
●
Also: in multiclass situations, covering algorithm
concentrates on one class at a time whereas decision tree
learner takes all classes into account
●
Generates a rule by adding tests that maximize rule’s
accuracy
●
Similar to situation in decision trees: problem of
selecting an attribute to split on
♦ But: decision tree inducer maximizes overall purity
●
Each new test reduces
rule’s coverage:
●
Goal: maximize accuracy
♦ t total number of instances covered by rule
♦ p positive examples of the class covered by rule
If ?
●
Rule we seek: then recommendation = hard
●
Possible tests:
●
Rule with best test added:
If astigmatism = yes
then recommendation = hard
●
Instances covered by modified rule:
Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
If astigmatism = yes
●
Current state: and ?
then recommendation = hard
●
Possible tests:
●
Rule with best test added:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard
●
Instances covered by modified rule:
Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Normal Hard
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Normal None
●
Possible tests:
Age = Young 2/2
Age = Pre-presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3
●
Tie between the first and the fourth test
♦ We choose the one with greater coverage
●
Final rule: If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
●
Second rule for recommending “hard lenses”:
(built from instances not covered by first rule)
●
These two rules cover all “hard lenses”:
♦ Process is repeated with other two classes
●
PRISM with outer loop removed generates a decision
list for one class
♦ Subsequent rules are designed for rules that are not covered
by previous rules
♦ But: order doesn’t matter because all rules predict the same
class
●
Outer loop considers all classes separately
♦ No order dependence implied
●
Problems: overlapping rules, default rule required
●
Methods like PRISM (for dealing with one class)
are separate-and-conquer algorithms:
♦ First, identify a useful rule
♦ Then, separate out all the instances it covers
♦ Finally, “conquer” the remaining instances
●
Difference to divide-and-conquer methods:
♦ Subset covered by rule doesn’t need to be explored
any further
●
Naïve method for finding association rules:
♦ Use separate-and-conquer method
♦ Treat every possible combination of attribute values as a
separate class
●
Two problems:
♦ Computational complexity
♦ Resulting number of rules (which would have to be
pruned on the basis of support and confidence)
●
But: we can look for high support rules directly!
●
Support: number of instances correctly covered by
association rule
♦ The same as the number of instances covered by all tests in
the rule (LHS and RHS!)
●
Item: one test/attribute-value pair
●
Item set : all items occurring in a rule
●
Goal: only rules that exceed pre-defined support
⇒ Do it by finding all item sets with the given minimum
support and generating rules from them!
●
In total: 12 one-item sets, 47 two-item sets, 39
three-item sets, 6 four-item sets and 0 five-item sets
(with minimum support of two)
●
Once all item sets with minimum support have been
generated, we can turn them into rules
●
Example:
Humidity = Normal, Windy = False, Play = Yes (4)
●
Seven (2N-1) potential rules:
If Humidity = Normal and Windy = False then Play = Yes 4/4
If Humidity = Normal and Play = Yes then Windy = False 4/6
If Windy = False and Play = Yes then Humidity = Normal 4/6
If Humidity = Normal then Windy = False and Play = Yes 4/7
If Windy = False then Humidity = Normal and Play = Yes 4/8
If Play = Yes then Humidity = Normal and Windy = False 4/9
If True then Humidity = Normal and Windy = False
and Play = Yes 4/12
●
In total:
3 rules with support four
5 with support three
50 with support two
●
Resulting rules (all with 100% confidence):
Temperature = Cool, Windy = False ⇒ Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal ⇒ Play = Yes
Temperature = Cool, Windy = False, Play = Yes ⇒ Humidity = Normal
●
How can we efficiently find all frequent item sets?
●
Finding one-item sets easy
●
Idea: use one-item sets to generate two-item sets, two-item
sets to generate three-item sets, …
♦ If (A B) is frequent item set, then (A) and (B) have to be
frequent item sets as well!
♦ In general: if X is frequent k-item set, then all (k-1)-item
●
Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
●
Lexicographically ordered!
●
Candidate four-item sets:
(A B C D) OK because of (A C D) (B C D)
(A C D E) Not OK because of (C D E)
●
Final check by counting instances in dataset!
●
(k –1)-item sets are stored in hash table
●
We are looking for all high-confidence rules
♦ Support of antecedent obtained from hash table
♦ But: brute-force method is (2N-1)
●
Better way: building (c + 1)-consequent rules from c-
consequent ones
♦ Observation: (c + 1)-consequent rule can only hold if all
corresponding c-consequent rules also hold
●
Resulting algorithm similar to procedure for large
item sets
●
Final check of antecedent against hash table!
●
Above method makes one pass through the data for each
different size item set
♦ Other possibility: generate (k+2)-item sets just after (k+1)-item
sets have been generated
♦ Result: more (k+2)-item sets than necessary will be considered
but less passes through the data
♦ Makes sense if data too large for main memory
●
Practical issue: generating a certain number of rules (e.g. by
incrementally reducing min. support)
●
Standard ARFF format very inefficient for typical
market basket data
♦ Attributes represent items in a basket and most items are
usually missing
♦ Data should be represented in sparse format
●
Instances are also called transactions
●
Confidence is not necessarily the best measure
♦ Example: milk occurs in almost every supermarket
transaction
♦ Other measures have been devised (e.g. lift)
●
Work most naturally with numeric attributes
●
Standard technique for numeric prediction
♦ Outcome is linear combination of attributes
x=w 0w 1 a1w2 a 2...w k a k
●
Weights are calculated from the training data
●
Predicted value for first training instance a(1)
k
w0 a1
0 w a
1 1
1
w a
2 2
1
...w a
k k
1
=∑ j=0 w a
j j
1
by this target
P[1∣a1, a2, .... ,ak ]
log 1−P[1∣a 1, a 2, ...., a k ]
●
Logit transformation maps [0,1] to (-∞ , +∞ )
●
Resulting model:
1
Pr [1∣a 1, a2, ..., ak ]= 1e −w 0−w 1 a1 −...− w k ak
●
Parameters are found from training data using
maximum likelihood
to (k(k-1)/2)×(2n/k) = (k-1)n
w1
0 w 1
1 a 1 ...w 1
k a k w 2
0 w 2
1 a 1 ...w k ak
2
⇔w1
0 −w 2
0 w 1
1 −w 2
1 a1 ...w 1
k −w 2
k a k 0
●
Why does this work?
Consider situation where instance a pertaining to the first class has
been added:
w 0a0 a0w1a1a 1w 2a2 a2...w k a k a k
This means output for a has increased by:
Output
layer
Input
layer
or one)
●
Difference: multiplicative updates instead of additive updates
♦ Weights are multiplied by a user-specified parameter α >
1(or its inverse)
●
Another difference: user-specified threshold parameter θ
♦ Predict first class if
●
Winnow is very effective in homing in on relevant
features (it is attribute efficient)
●
Can also be used in an on-line setting in which new
instances arrive continuously
(like the perceptron algorithm)
●
Instance is classified as belonging to the first class (of two classes) if:
●
Different attributes are measured on different scales ⇒
need to be normalized:
v i −min v i
ai = max v i−min vi
●
Simplest way of finding nearest neighbour: linear scan of
the data
♦ Classification takes time proportional to the product of the
number of instances in training and test sets
●
Nearest-neighbor search can be done more efficiently using
appropriate data structures
●
We will discuss two methods that represent training data in
a tree structure:
●
Using value closest to mean (rather than median) can be better if
data is skewed
●
Can apply this recursively
●
Can we do the same with kD-trees?
●
Heuristic strategy:
♦ Find leaf node containing new instance
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 100
Discussion of nearest-neighbor learning
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 101
More discussion
●
Instead of storing all training instances, compress them into
regions
●
Example: hyperpipes (from discussion of 1R)
●
Another simple technique (Voting Feature Intervals):
♦ Construct intervals for each attribute
● Discretize numeric attributes
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 102
Clustering
●
Clustering techniques apply when there is no class to be
predicted
●
Aim: divide instances into “natural” groups
●
As we've seen clusters can be:
♦ disjoint vs. overlapping
●
We'll look at a classic clustering algorithm called k-means
♦ k-means clusters are disjoint, deterministic, and flat
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 103
The k-means algorithm
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 104
Discussion
●
Algorithm minimizes squared distance to cluster centers
●
Result can vary significantly
♦ based on initial choice of seeds
●
Can get trapped in local minimum
♦ Example: initial
cluster
centres
instances
●
To increase chance of finding global optimum: restart with
different random seeds
●
Can we applied recursively with k = 2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 105
Faster distance calculations
●
Can we use kD-trees or ball trees to speed up the
process? Yes:
♦ First, build tree, which remains static, for all the
data points
♦ At each node, store number of instances and sum of
all instances
♦ In each iteration, descend tree and find out which
cluster each node belongs to
● Can stop descending as soon as we find out that a node
belongs entirely to a particular cluster
● Use statistics stored at the nodes to compute new
cluster centers
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 106
Example
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 107
Multi-instance learning
●
Simplicity-first methodology can be applied to
multi-instance learning with surprisingly good
results
●
Two simple approaches, both using standard
single-instance learners:
♦ Manipulate the input to learning
♦ Manipulate the output of learning
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 108
Aggregating the input
●
Convert multi-instance problem into single-instance one
♦ Summarize the instances in a bag by computing mean,
mode, minimum and maximum as new attributes
♦ “Summary” instance retains the class label of its bag
♦ To classify a new bag the same process is used
●
Results using summary instances with minimum and
maximum + support vector machine classifier are
comparable to special purpose multi-instance learners on
original drug discovery problem
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 109
Aggregating the output
●
Learn a single-instance classifier directly from the original
instances in each bag
♦ Each instance is given the class of the bag it originates from
●
To classify a new bag:
♦ Produce a prediction for each instance in the bag
♦ Aggregate the predictions to produce a prediction for the bag as
a whole
♦ One approach: treat predictions as votes for the various class
labels
♦ A problem: bags can contain differing numbers of instances →
give each instance a weight inversely proportional to the bag's
size
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 110
Comments on basic methods
●
Bayes’ rule stems from his “Essay towards solving a
problem in the doctrine of chances” (1763)
♦ Difficult bit in general: estimating prior probabilities (easy in the
case of naïve Bayes)
●
Extension of naïve Bayes: Bayesian networks (which we'll
discuss later)
●
Algorithm for association rules is called APRIORI
●
Minsky and Papert (1969) showed that linear classifiers
have limitations, e.g. can’t learn XOR
♦ But: combinations of them can (→ multi-layer neural nets,
which we'll discuss later)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 111