Classificationi 4
Classificationi 4
Classification I Bob
John
M
M
20
45 M
Gender
F
Bob
John
M
M
20
45
N
P
Dave M 25 Age Dave M 25 N
Age
Marthe F 27 Marthe F 27 N
Kathy F 40 <40 >40 <30 >30 Kathy F 40 P
Kimi M 35 N P N P Kimi M 35 P
☞ Outline: Tod M 50 Tod M 50 P
1 – Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 – Linear Regression in Classification . . . . . . . . . . . . . . . . . . . . . 7
Name Sex Age Prof.
3 – Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Dave M 25 N
4 – Distance Based Classification . . . . . . . . . . . . . . . . . . . . . . . 14
Kathy F 40 P
5 – Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Tod M 50 P
☞ Definition 0.1 Given a database D = {t1 , t2 , . . . , tn } and a set of classes ☞ Main strategies to classify data:
C = {C1 , C2 , . . . , Cm } the classification problem is to define the mapping • Specify the boundaries of the domain
f : D → C , where f (t) is assigned to one class only. • With the help of probability distributions:
☞ The Classes: P (ti ∧ Cj ) = P (ti |Cj )P (Cj )
• are pre-defined and known in advanced
• With the help of posterior probabilities:
• non-overlapping
• partition the whole database (domain) P (Cj |ti )
Fall, 2006 Arturas Mazeika Page 3 Fall, 2006 Arturas Mazeika Page 4
Classification (Missing Values) Classification (Quality Metrics)
False Negative
True Positive
☞ There are three main strategies to deal with missing values: (We do miss data)
f (ti ) ∈ Cj , ti ∈ Cj
• Ignore the missing values f (ti ) ∈
/ C j , ti ∈ C j
• Predict the missing values False Positive
True Negative
• Classify the missing values separately (we get too much data)
f (ti ) ∈
/ C j , ti ∈
/ Cj
f (ti ) ∈ Cj , ti ∈
/ Cj
Fall, 2006 Arturas Mazeika Page 5 Fall, 2006 Arturas Mazeika Page 6
• outliers a= Xi
n
• noise i=1
Fall, 2006 Arturas Mazeika Page 7 Fall, 2006 Arturas Mazeika Page 8
Linear Regression in Classification (Example 3) Linear Regression in Classification (Conclusions)
☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = ax + b + ε for our classification problem
• The general formula to compute a:
(
b = Ȳ − aX̄
)
a = cov(X,Y
σ2 X
• In our case:
a = 0.04 b = −0.7 ☞ Regression is hardly applicable in classification, unless strong assumptions about the
data distribution is true
☞ all x for which f (x) ≥ 1/2 will be classified as profitable customers
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
15 20 25 30 35 40 45 50 55 15 20 25 30 35 40 45 50 55
Fall, 2006 Arturas Mazeika Page 9 Fall, 2006 Arturas Mazeika Page 10
Fall, 2006 Arturas Mazeika Page 11 Fall, 2006 Arturas Mazeika Page 12
Bayesian Classification (Conclusion) Distance Based Classification (A Naive Distance Based Classification
Algorithm, the Idea)
☞ Given:
• Training data
+ Easy to use • Classes
Fall, 2006 Arturas Mazeika Page 13 Fall, 2006 Arturas Mazeika Page 14
The idea:
☞ Train data is the model of the data
☞ Compute k-nearest neighbors to the
train data
☞ DB tuple t is placed in the class, which
has most of the nearest neighbors ☞ Questions?
☞ Choice of k has a huge impact to the
classification results
☞ A rule of thumb:
√
K ≤ number of tuples in the training set
Fall, 2006 Arturas Mazeika Page 15 Fall, 2006 Arturas Mazeika Page 16