Chapter 4 (2)
Chapter 4 (2)
1
Chapter 4. Classification and Prediction
■ Medical diagnosis
■ Fraud detection
Classification—A Two-Step Process
■ Model construction: describing a set of predetermined classes
■ Each tuple/sample is assumed to belong to a predefined class,
■
The known label of test sample is compared with the
classified result from the model
■
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
■
Test set is independent of training set, otherwise over-fitting
will occur
■ If the accuracy is acceptable, use the model to classify data
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised Learning
■ Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
■ Relevance analysis (feature selection)
Remove the irrelevant or redundant
■
attributes
■ Speed
■
time to construct the model (training time)
■
time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■
understanding and insight provided by the
■ model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
Decision Tree Induction: Training Dataset
age income student credit_rating buys_cloths
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for “ buys_cloths”
age
no yes no yes
Algorithm for Decision Tree Induction
■ Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Examples are partitioned recursively based on selected
■ attributes
Test attributes are selected on the basis of a heuristic or statistical
■ measurefor(e.g.,
Conditions information
stopping gain)
partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
■ There are no samples left
Tree construction general algorithm
A1 =
[9+, A2 =
humidity [9+,
5-] wind
5-]
normal high weak strong
𝑝
� is the proportion of negative objects in S
In the play-tennis example, these numbers are 14, 9/14
and 5/14, respectively
• Entropy is defined as
follows
Entropy
proportion 𝑝 of positive
op
rt
n
e
�
�
Example : Training data for concept “play-‐
tennis”
From 14 examples of Play-Tennis, 9 positive and 5
negative objects (denote by [9+, 5-‐] )
Entropy( [9+, 5-‐] ) = − (9/14)log2(9/14) −
(5/14)log2(5/14)
= 0.940
Notice:
1. Entropy is 0 if all members of S belong to the same
class
2. Entropy is 1 if the collection contains an equal number
of positive and negative examples. If these numbers are
unequal, the entropy is between 0 and 1.
Information gain measures the
expected reduction in entropy
We define a measure, called information gain, of the
effectiveness of an attribute in classifying data. It is the expected
reduction in entropy caused by partitioning the objects according
to this attribute
where Value(A) is the set of all possible values for attribute A, and
Sv is the subset of S for which A has value v .
Information gain measures the
expected reduction in entropy
∑ 𝑣∈{w𝑒𝑎𝑘,𝑠𝑡𝑟𝑜𝑛g}
𝑆𝑣
�
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆1−
8 weak 6
− 1 𝑠𝑡𝑟 𝑜 )
𝑛g
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 4 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆
4
8 6
= 0.940 − 0.811 − 𝑥 1.0 =
0.048
14 14
Which attribute is the best classifier?
S:[9+, S:[9+,
5-] 5-]
E= E=
Humidity
0.940 Wind
0.940
Temperature) = 0.029
Next step in growing the decision tree
Attributes with many values
classification rules
■ can use SQL queries for accessing databases
prediction? (SVM)
P(X|Ci)P(Ci)
P(Ci | X)
P(X)
■ Since P(X) is constant for all classes, only
P(Ci | X) P(X|
needs to be maximized
C )P(C )
Derivation of Naïve Bayes
Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( | ) P( | ) P( | ) ... P(
x Ci x Ci x Ci x | Ci)
P(X | C i ) k 1 2 n
k
■ This greatly reduces 1the computation cost: Only counts
the class distribution
■ If Ak is categorical, P(xk|Ci) is the # of tuples in Ci
having value xk for Ak divided by |Ci, D| (# of tuples of Ci
■ in D)
based
If Ak is on Gaussian distribution
continous-valued, P(xk|Ci)with a mean
is usually μ and
computed
standard deviation σ 1
( x) 2
g(x, , ) e 22
2
and P(xk|Ci) is
P (X | Ci) g (xk , C ,i
Naïve Bayesian Classifier: Training Dataset
■
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
counterparts
Naïve Bayesian Classifier:
Comments
■ Advantages
■
Easy to implement
■
Good results obtained in most of the cases
■ Disadvantages
■
Assumption: class conditional independence, therefore
loss of accuracy
■ Practically, dependencies exist among variables
■
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
■
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
■ How to deal with these dependencies?
■
Bayesian Belief Networks
Naive Bayesian Classifier
Example
Outlook Temperature Humidity W indy
sunny hot high false
Class
N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
P) = 9/14*2/9*3/9*3/9*6/9 = 0.01
N) = 5/14*3/5*1/5*4/5*2/5 = 0.013
P(X|p)·P(p) =
P(X|n)·P(n) =
■ Each time a rule is learned, the tuples covered by the rules are
removed
■ The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
■ Comp. w. decision-tree induction: learning a set of rules simultaneously
How to Learn-One-Rule?
■ Star with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■ Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■ Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
pos
FOIL _ Gain pos'(log
2 log )
pos' 2 pos'neg' pos
neg
It favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
pos neg
FOIL _ Prune(R) pos neg
■ Classification:
■predicts categorical class labels
■ x1 : # of a word “homepage”
x2 : # of a word
■
■
“welcome” Mathematically
■ x X = n, y Y = {+1, –
1}
■ We want a function f: X
Linear Classification
Binary Classification
x problem
x x
x x The data above the red
line belongs to class ‘x’
x x o
x x The data below red line
o belongs to class ‘o’
x o o
oo o Examples: SVM,
o
o Perceptron, Probabilistic
o o o
Classifiers
o
Discriminative Classifiers
■ Advantages
■
prediction accuracy is generally high
■
As compared to Bayesian methods – in general
■
robust, works when training examples contain errors
■
fast evaluation of the learned target function
■
Bayesian networks are normally slow
■ Criticism
■
long training time
■ difficult to understand the learned function (weights)
■
Bayesian networks can be used easily for pattern discovery
■ not easy to incorporate domain knowledge
■
Easy in the form of priors on the data or distributions
Perceptron & Winnow
• Vector: x, w
x2 •Scalar: x, y, w
f(x)i) =>
f(x wxyi+=b-1= 0
< 0 for
or w1x1+w2x2+b = 0
•Perceptron: update
W additively
•Winnow: update
W multiplicatively
x1
Perceptron & Winnow
Classification by Backpropagation
x0 w0 - k
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi
vector vector sum function k )
i0
x w
■ The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping
Neural Networks
What are they?
Based on early research aimed at representing the
way the human brain works
Neural networks are composed of many processing
units called neurons
65
Neural Networks are great, but..
Problem 1: The black box model!
Solution: 1. Do we really need to know?
Solution 2. Rule Extraction techniques
66
Neural Network Concepts
Neural computing
Artificial neural network (ANN)
Dendrites
Synapse
Synapse
Axon
Axon
Dendrites Soma
Soma
x1
w1 Y1
.
. Summation
Transfer
.
Function
wn Yn
xn
ANN
Model
Three-step process:
1. Compute temporary
Compute
output outputs
2. Compare outputs with
desired targets
3. Adjust the weights and
Is desired
Adjust
weights
No
output repeat the process
achieved?
Yes
Stop
learning
How a Network Learns
Learning parameters:
Learning rate
Momentum
Backpropagation Learning
a(Zi – Yi)
x1 error
w1
. Summation
Transfer
Function
wn
xn
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Separable
■
A separating hyperplane can be written as W
●X+b=0
where W={w1, w2, …, wn} is a weight vector
■
and b a scalar (bias)
For 2-D it can be written as w0
■
+ w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and H2:
■
w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
■
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic Programming
► Lagrangian multipliers
Why Is SVM Effective on High Dimensional
Data?
■
The complexity of trained classifier is characterized by the # of support
vectors rather than the dimensionality of the data
■
The support vectors are the essential or critical training examples —
they lie closest to the decision boundary (MMH)
■
If all other training examples are removed and the training is repeated,
the same separating hyperplane would be found
■
The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
A
2
SVM—Linearly Inseparable
A
space
■ SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Scaling SVM by Hierarchical Micro-Clustering
■ Non-linear regression
x
(x i x)( y
i1(x x)i
2
i y)
■ Prediction