0% found this document useful (0 votes)
23 views

Classificationi 4

This document discusses classification, which is assigning data points to predefined categories or classes. It provides definitions of classification and discusses strategies like specifying boundaries, using probability distributions, and posterior probabilities. It also covers dealing with missing data values, quality metrics for classification like true/false positives and negatives, and using linear regression for classification with examples. The key aspects covered are definition of classification, main classification strategies, handling missing values, and evaluating classification quality.

Uploaded by

Kamal Jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Classificationi 4

This document discusses classification, which is assigning data points to predefined categories or classes. It provides definitions of classification and discusses strategies like specifying boundaries, using probability distributions, and posterior probabilities. It also covers dealing with missing data values, quality metrics for classification like true/false positives and negatives, and using linear regression for classification with examples. The key aspects covered are definition of classification, main classification strategies, handling missing values, and evaluating classification quality.

Uploaded by

Kamal Jack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Classification (A High Level Idea)

Name Gender Age Name Sex Age Prof.

Classification I Bob
John
M
M
20
45 M
Gender
F
Bob
John
M
M
20
45
N
P
Dave M 25 Age Dave M 25 N
Age
Marthe F 27 Marthe F 27 N
Kathy F 40 <40 >40 <30 >30 Kathy F 40 P
Kimi M 35 N P N P Kimi M 35 P
☞ Outline: Tod M 50 Tod M 50 P
1 – Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 – Linear Regression in Classification . . . . . . . . . . . . . . . . . . . . . 7
Name Sex Age Prof.
3 – Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Dave M 25 N
4 – Distance Based Classification . . . . . . . . . . . . . . . . . . . . . . . 14
Kathy F 40 P
5 – Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Tod M 50 P

Fall, 2006 Arturas Mazeika Page 2

Classification (Definition) Classification (Strategies)

☞ Definition 0.1 Given a database D = {t1 , t2 , . . . , tn } and a set of classes ☞ Main strategies to classify data:
C = {C1 , C2 , . . . , Cm } the classification problem is to define the mapping • Specify the boundaries of the domain
f : D → C , where f (t) is assigned to one class only. • With the help of probability distributions:
☞ The Classes: P (ti ∧ Cj ) = P (ti |Cj )P (Cj )
• are pre-defined and known in advanced
• With the help of posterior probabilities:
• non-overlapping
• partition the whole database (domain) P (Cj |ti )

Fall, 2006 Arturas Mazeika Page 3 Fall, 2006 Arturas Mazeika Page 4
Classification (Missing Values) Classification (Quality Metrics)

False Negative
True Positive
☞ There are three main strategies to deal with missing values: (We do miss data)
f (ti ) ∈ Cj , ti ∈ Cj
• Ignore the missing values f (ti ) ∈
/ C j , ti ∈ C j
• Predict the missing values False Positive
True Negative
• Classify the missing values separately (we get too much data)
f (ti ) ∈
/ C j , ti ∈
/ Cj
f (ti ) ∈ Cj , ti ∈
/ Cj

Fall, 2006 Arturas Mazeika Page 5 Fall, 2006 Arturas Mazeika Page 6

Linear Regression in Classification (Example 1) Linear Regression in Classification (Example 2)


☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = a + ε for our classification problem
☞ On can use (linear) regression to clas-
sify data into: • The general formula to compute a:
• data, which follows linear regression 1X
n

• outliers a= Xi
n
• noise i=1

☞ The training data should be provided • In our case:


without noise and outliers a = 34.7
☞ The idea:
• Compute the (linear) regression
from the data
• Compute the regression error ε
• All data which are farther away from N N
the regression line by ε is noise and N NN P P P P N NN P P P P
outliers
20 30 40 50 20 30 40 50

Fall, 2006 Arturas Mazeika Page 7 Fall, 2006 Arturas Mazeika Page 8
Linear Regression in Classification (Example 3) Linear Regression in Classification (Conclusions)
☞ Linear regression (line) can be used to separate two classes
☞ Lets use f (x) = ax + b + ε for our classification problem
• The general formula to compute a:
(
b = Ȳ − aX̄
)
a = cov(X,Y
σ2 X

• In our case:
a = 0.04 b = −0.7 ☞ Regression is hardly applicable in classification, unless strong assumptions about the
data distribution is true
☞ all x for which f (x) ≥ 1/2 will be classified as profitable customers

2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
15 20 25 30 35 40 45 50 55 15 20 25 30 35 40 45 50 55

Fall, 2006 Arturas Mazeika Page 9 Fall, 2006 Arturas Mazeika Page 10

Bayesian Classification (The formalism) Bayesian Classification (Example)

☞ Use Bayesian theorem to classify the data:


P (ti |Hj )P (Hj )
P (Hj |ti ) =
P (ti )
Name Sex Age Prof.
☞ Say ti = (t1i , t2i , . . . , tki ) consists of k attributes:
Bob M 20 N
P (t1i , t2i , . . . , tki |Hj )P (Hj ) John M 45 P
P (Hj |t1i , t2i , . . . , tki ) = ☞ What is
P (t1i , t2i , . . . , tki ) Dave M 25 N
• P (HP |(F, 27))
Marthe F 27 N
☞ If we assume that attributes are independent: • P (HN |(F, 27))
Kathy F 40 P
P (t1i , t2i , . . . , tki |Hj ) = P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj ) Kimi M 35 P
Tod M 50 P
☞ Therefore
P (t1i |Hj )P (t2i |Hj ) . . . P (tki |Hj )
P (Hj |t1i , t2i , . . . , tki ) =
P (t1i )P (t2i ) . . . P (tki )

Fall, 2006 Arturas Mazeika Page 11 Fall, 2006 Arturas Mazeika Page 12
Bayesian Classification (Conclusion) Distance Based Classification (A Naive Distance Based Classification
Algorithm, the Idea)

☞ Given:
• Training data
+ Easy to use • Classes

+ Requires one scan of training data ☞ The main idea:


• Compute representatives for each class
+ Handles missing data
• For each DB point t assign t to the closest class in term of given distance or similarity

− Independence of attributes might not be realistic


− Continuous data is not easily handled

Fall, 2006 Arturas Mazeika Page 13 Fall, 2006 Arturas Mazeika Page 14

Distance Based ClassificationK-Nearest Neighbors Classification, the Idea Questions?

The idea:
☞ Train data is the model of the data
☞ Compute k-nearest neighbors to the
train data
☞ DB tuple t is placed in the class, which
has most of the nearest neighbors ☞ Questions?
☞ Choice of k has a huge impact to the
classification results
☞ A rule of thumb:

K ≤ number of tuples in the training set

Fall, 2006 Arturas Mazeika Page 15 Fall, 2006 Arturas Mazeika Page 16

You might also like