CS583 Supervised Learning
CS583 Supervised Learning
Supervised
Learning
CS583, Bing Liu, UIC 2
Road Map
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 3
An example application
age
Marital status
annual salary
outstanding debts
credit rating
etc.
k attributes: A
1
, A
2
, A
k
.
No (not approved)
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to
perform the task T if after learning the
systems performance on T improves as
measured by M.
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 15
Introduction
it is very efficient.
Pr(c
j
) is the probability of class c
j
in data set D
C
j
j
j
C
j
j
c
c c D entropy
CS583, Bing Liu, UIC 29
Entropy measure: let us get a
feeling
If we make attribute A
i
, with v values, the root of the
current tree, this will partition D into v subsets D
1
, D
2
, D
v
. The expected entropy if A
i
is used as the
current root:
v
j
j
j
A
D entropy
D
D
D entropy
i
1
) (
| |
| |
) (
CS583, Bing Liu, UIC 31
Information gain (cont )
+
D entropy D entropy D entropy
house Own
888 . 0
722 . 0
15
5
971 . 0
15
5
971 . 0
15
5
) (
15
5
) (
15
5
) (
15
5
) (
3 2 1
+ +
D entropy D entropy D entropy D entropy
Age
CS583, Bing Liu, UIC 33
We build the final tree
Attribute construction
Etc.
CS583, Bing Liu, UIC 39
Road Map
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 40
Evaluating classification
methods
Predictive accuracy
Efficiency
Interpretability:
Use each subset as the test set and combine the rest
n-1 subsets as the training set to learn a classifier.
a training set,
a test set.
recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
For F
1
-value to be large, both p and r much be large.
CS583, Bing Liu, UIC 50
Another evaluation method:
Scoring and ranking
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 56
Introduction
Yes.
Differences:
where each av
j
is a condition (an attribute-value pair).
,
_
+
+
0 0
0
1 1
1
1
2 2
log log ) , (
n p
p
n p
p
p R R gain
CS583, Bing Liu, UIC 67
Rule pruning in learn-one-rule-
2
) , , (
CS583, Bing Liu, UIC 68
Discussions
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 70
Association rules for
classification
the confidence of r
i
is greater than that of r
j
, or
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 79
Bayesian classification
Let A
1
through A
k
be attributes with discrete values.
The class is C.
Pr(C=c
j
) is the class prior probability: easy to
estimate from the training data.
| |
1
| | | | 1 1
| | | | 1 1
| | | | 1 1
| | | | 1 1
| | | | 1 1
) Pr( ) | ,..., Pr(
) Pr( ) | ,..., Pr(
) ,..., Pr(
) Pr( ) | ,..., Pr(
) ,..., | Pr(
C
r
r r A A
j j A A
A A
j j A A
A A j
c C c C a A a A
c C c C a A a A
a A a A
c C c C a A a A
a A a A c C
CS583, Bing Liu, UIC 81
Computing probabilities
Formally, we assume,
Pr(A
1
=a
1
| A
2
=a
2
, ..., A
|A|
=a
|A|
, C=c
j
) = Pr(A
1
=a
1
| C=c
j
)
and so on for A
2
through A
|A|
. I.e.,
| |
1
| | | | 1 1
) | Pr( ) | ,..., Pr(
A
i
j i i i A A
c C a A c C a A a A
CS583, Bing Liu, UIC 83
Final nave Bayesian classifier
We are done!
| |
1
| |
1
| |
1
| | | | 1 1
) | Pr( ) Pr(
) | Pr( ) Pr(
) ,..., | Pr(
C
r
A
i
r i i r
A
i
j i i j
A A j
c C a A c C
c C a A c C
a A a A c C
CS583, Bing Liu, UIC 84
Classify a test instance
| |
1
) | Pr( ) Pr( max arg
A
i
j i i j
c
c C a A c c
j
CS583, Bing Liu, UIC 85
An example
For C = t, we have
+
+
) | Pr(
CS583, Bing Liu, UIC 88
On nave Bayesian classifier
Advantages:
Easy to implement
Very efficient
Disadvantages
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 90
Text
classification/categorization
K
}, where
j
is the mixture weight (or mixture
probability) of the mixture component j and
j
is the parameters of component j.
C
j
j i j i
c d c d (23)
CS583, Bing Liu, UIC 96
Model text documents
| |
1
!
) ; | Pr(
|! | |) Pr(| ) ; | Pr(
V
t
ti
ti
N
j t
i i j i
N
c w
d d c d
| |
| |
1
i
V
t
it d N
. 1 ) ; | Pr(
| |
1
V
t
j t c w
(24)
(25)
CS583, Bing Liu, UIC 99
Parameter estimation
; | Pr(
| |
1
| |
1
| |
1
V
s
D
i
i j si
D
i
i j ti
j t
d c N
d c N
c w
.
) | Pr( | |
) | Pr(
)
; | Pr(
| |
1
| |
1
| |
1
+
+
V
s
D
i
i j si
D
i
i j ti
j t
d c N V
d c N
c w
(26)
(27)
CS583, Bing Liu, UIC 100
Parameter estimation (cont
)
| Pr(
| |
1
D
d c
c
D
i
i j
j
(28)
CS583, Bing Liu, UIC 101
Classification
| |
1
| |
1
,
| |
1
,
)
; | Pr( ) Pr(
)
; | Pr( )
| Pr(
)
| Pr(
)
; | Pr( )
| Pr(
)
; | Pr(
C
r
d
k
r k d
d
k
k d
i
i
r
i
j
i
j
i
j i j
i j
c w c
c w c
d
c d c
d c
CS583, Bing Liu, UIC 102
Discussions
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 104
Introduction
'
< +
+
0 1
0 1
b if
b if
y
i
i
i
x w
x w
CS583, Bing Liu, UIC 106
The hyperplane
in the figure).
Let us compute d
+
.
+
b
d
|| ||
2
w
+
+
d d margin
(38)
(39)
CS583, Bing Liu, UIC 111
A optimization problem!
Definition (Linear SVM: separable case): Given a set of
linearly separable training examples,
D = {(x
1
, y
1
), (x
2
, y
2
), , (x
r
, y
r
)}
Learning is to solve the following constrained minimization
problem,
summarizes
w x
i
+ b 1 for y
i
= 1
w x
i
+ b -1for y
i
= -1.
r i b y
i i
..., 2, 1, , 1 ) ( : Subject to
2
: Minimize
+
x w
w w
r i b y
i i
..., 2, 1, , 1 ( + x w
(40)
CS583, Bing Liu, UIC 112
Solve the constrained
minimization
Standard Lagrangian method
where
i
0 are the Lagrange multipliers.
b y L
i
r
i
i i P
x w w w
(41)
CS583, Bing Liu, UIC 113
Kuhn-Tucker conditions
These points are called the support vectors, All the other
parameters
i
= 0.
CS583, Bing Liu, UIC 114
Solve the problem
b y b
sv i
i i i
x x x w (57)
,
_
+ +
sv i
i i i
b y sign b sign z x z w ) ( (58)
CS583, Bing Liu, UIC 118
Linear SVM: Non-separable
case
+
r
i
k
i
C
1
) (
2
: Minimize
w w
(60)
CS583, Bing Liu, UIC 122
New optimization problem
x w
w w
(61)
+ + +
r
i
i i i i
r
i
i i
r
i
i P
b y C L
1 1 1
] 1 ) ( [
2
1
x w w w
(62)
CS583, Bing Liu, UIC 123
Kuhn-Tucker conditions
CS583, Bing Liu, UIC 124
From primal to dual
Interestingly,
i
and its Lagrange multipliers
i
are not
in the dual. The objective function is identical to that
for the separable case.
The resulting
i
values are then used to compute w
and b. w is computed using Equation (63) and b is
computed using the Kuhn-Tucker complementarity
conditions (70) and (71).
j
r
i
i i i
i
y
y
b x x
(73)
CS583, Bing Liu, UIC 127
(65), (70) and (71) in fact tell
us more
b y b
r
i
i i i
x x x w
(75)
CS583, Bing Liu, UIC 129
How to deal with nonlinear
separation?
F X
(76)
(77)
CS583, Bing Liu, UIC 131
Geometric interpretation
Polynomial kernel
K(x, z) = x z
d
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 142
k-Nearest Neighbor Classification
(kNN)
Estimate Pr(c
j
|d) as n/k
Basic concepts
Evaluation of classifiers
Rule induction
K-nearest neighbor
Summary
CS583, Bing Liu, UIC 147
Summary
Bayesian networks
Neural networks
Genetic algorithms
Fuzzy classification
This large number of methods also show the importance of
classification and its wide applicability.