0% found this document useful (0 votes)
18 views

Lecture 3 Basics of Clssification

Uploaded by

parisangel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 3 Basics of Clssification

Uploaded by

parisangel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Basics of Classification

(Adapted from various sources. The slides copied are only for
teaching purposes.)
BASIC IDEA
§ In the exams, the student’s grade was assigned based
on their marks as follows:

Mark ≥ "# : A

"# > Mark ≥ $# : B

Rules 8# > Mark ≥ %# : C

%# > Mark ≥ &# : D

6# > Mark : F

Here the classification is done based on a simple


rule!!!
§ We apply a rule / set of rules to classify the
data
§ Classificationis a technique for
describing important data classes based
on some rules.

Classification !
§ The classes are mutually exhaustive and
exclusive.

§ This indicate each object can be assigned


to precisely only one class
§ Science
§ Finance
§ Medical
§ Security
Applications § Prediction
§ Entertainment
§ Social media
§ And more….
§ Supervised Classification
§ We already know the set of possible
classes.

§ Unsupervised Classification
Classification § It is called clustering
types § We don’t know the classes or the
number of possible classes.

§ We try to categorize based on some


rule which may not serve our purpose
at all.
Image taken from www.webstockreview.net
§ The Classification is a supervised
technique

§ A good classifier depends on the


Points to below two factors.
remember § We need rules for classification
§ We need a teacher.
§ Training set (the teacher)
§ Collection of records with a set of
attributes and one class lebel.

How to
proceed?
The image is taken from https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
§ Training set (the teacher)
§ Collection of records with a set of
attributes and one class label.

How to
proceed?

§ Develop a model for class in terms of


the other attributes with training set
§ Define the rules
§ Statistics based - Bayesian
§ Distance Based - KNN

Classifiers § Decision Tree Based - CART


§ Machine Learning based - SVM
§ Neural Network based - CNN
§ The ‘idiot’ or ‘simple’ classifier.
§ Based on statistics.
§ Empirically proven to be useful.
§ Scales very well.
Naïve Bayes § Predicts class membership probabilities
§ Based on Bayes’ Theorem.
§ The attributes are independent given the
class.
In this database, there are four attributes
A = [ Sepal length, Sepal width, Petal length,
Petal width]
with 150 sample.

The categories of classes are:


C= [Iris Versicolor, Iris Setosa, Iris Virginica]

Given this is the knowledge of data and


classes, we are to find most likely
classification for any other unseen instance.
§ In many applications, a unknown sample
cannot be classified to a class label with
certainty.
§ In such a situation, the classification can
be achieved probabilistically.
Why Statistics? § In Bayesian classifier we try to model
probabilistic relationships between the
attribute set and the class variable,
§ Bayesian classifier use Bayes’ Theorem of
Probability for classification.
Digit Recognition

Classifier 5
Another § X1,…,Xn Î {0,1} (Blue vs. Red pixels)
Application § Y Î {5,6} (predict whether a digit is a 5 or a 6)

A good strategy is to predict what is the


probability that the image represents a 5
given its pixels?
The Bayes Classifier
§ A good strategy is to predict what is the
probability that the image represents a 5
given its pixels?

§ So … How do we compute that?


Likelihood Prior

Normalization Constant

§ To classify, we’ll simply compute these two probabilities

How and predict based on which one is greater


CLASSIFIERS
X1
feature X2 Y
X3 Classifier category
values …
Xn
CLASSIFIERS
X1
feature X2
Y category
X3 Classifier
values …
Xn

collection of instances
DB
with known categories
EXAMPLE 1
Determining decision on scholarship application based on the following features:

Household income (annual Number of siblings in High school grade (on a


income in millions of pesos) family QPI scale of 1.0 – 4.0)

Intuition (reflected on data set): award scholarships


to high-performers and to those with financial need
,ØInstance-based learning is often termed lazy learning, as there is
typically no “transformation” of training instances into more general
“statements”

ØInstead, the presented training data is simply stored and, when a new
query instance is encountered, a set of similar, related instances is
retrieved from memory and used to classify the new query instance

ØHence, instance-based learners never form an explicit general hypothesis


regarding the target function. They simply compute the classification of
each new query instance as needed
K-NN APPROACH
The simplest, most used instance-based learning algorithm is the k-
NN algorithm

k-NN assumes that all instances are points in some n-dimensional


space and defines neighbors in terms of distance (usually
Euclidean in R-space)

k is the number of neighbors considered


K-NN APPROACH
ØUnlike all the previous learning methods, kNN does not build model from the
training data.

ØTo classify a test instance d, define k-neighborhood P as k nearest neighbors


of d

ØCount number n of training instances in P that belong to class cj

ØEstimate Pr(cj|d) as n/k

ØNo training is needed. Classification time is linear in training set size for each
test case.
9
K-NEAREST-
NEIGHBORS
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3
The 3 nearest points to c are: a,
a and o.
Therefore, the most possible
label for c is a.
SIMPLE ILLUSTRATION: THE COMPLEXITY

-
-
+
+ -
-
+
+
-
SIMPLE ILLUSTRATION

-
-
+
•q + -
-
+
+
-
What is the class of q?
SIMPLE ILLUSTRATION

-
-
+
•q + -
-
+
+
-
q is + under 1-NN
SIMPLE ILLUSTRATION
-
-
+
•q + -
-
+
+
-
q is + under 1-NN,
but – under 5-NN
For a given instance T, get the top k
Get dataset instances that are “nearest” to T
• Select a reasonable distance measure

K - NEAREST Inspect
Inspect the category of these k instances,
choose the category C that represent the
NEIGHBORS most instances

Conclude Conclude that T belongs to category C


A case is classified by a majority
voting of its neighbors, with the case
being assigned to the class most
common among its K nearest neighbors
measured by a distance function.
K-NEAREST-
NEIGHBORS
If K=1, then the case is simply assigned
ALGORITHM
to the class of its nearest neighbor
Evaluation of Classification
Models

Adopted from Václav Hlaváč , Czech


Technical University, Prague.
Performance of a learned classifier?

— Classifiers (both supervised and unsupervised) are


learned (trained) on a finite training multiset (named
simply training set in the sequel for simplicity).
— A learned classifier has to be tested on a different test set
experimentally.
— The classifier performs on different data in the run mode
that on which it has learned.
— The experimental performance on the test data is a proxy
for the performance on unseen data. It checks the
classifier’s generalization ability.
— There is a need for a criterion function assessing the
classifier performance experimentally, e.g., its error rate,
accuracy, expected Bayesian risk (to be discussed later).
A need for comparing classifiers experimentally.
Evaluation as Hypothesis testing

— Evaluation has to be
treated as hypothesis
testing in statistics.
— The value of the
population parameter
has to be statistically
inferred based on the
sample statistics (i.e., a
training set in pattern
recognition).
Danger of overfitting

— Learning the training data too precisely usually


leads to poor classification results on new data.

— Classifier has to have the ability to generalize.


Training vs. test data

— Problem: Finite data are available only and have to


be used both for training and testing.

— More training data gives better generalization.


More test data gives better estimate for the
classification error probability.

— Never evaluate performance on training data.


¡ The conclusion would be optimistically biased.
Training vs. test data

— Hold out: Partitioning of available finite set of data


to training / test sets.

— Bootstrap and Cross validation

Once evaluation is finished, all the available data can


be used to train the final classifier.
Hold out method

— Given data is randomly partitioned into two


independent sets.
— Training multi-set (e.g., 2/3 of data) for the
statistical model construction, i.e. learning the
classifier.
— Test set (e.g., 1/3 of data) is hold out for the accuracy
estimation of the classifier.
— Random sampling is a variation of the hold out
method
— Repeat the hold out k times, the accuracy is
estimated as the average of the accuracies obtained.
K-fold cross validation

— The training set is randomly divided into K disjoint


sets of equal size where each part has roughly the
same class distribution.

— The classifier is trained K times, each time with a


different set held out as a test set.

— The estimated error is the mean of these K errors.


Graphical Example
Leave-one-out

— A special case of K-fold cross validation with K = n,


where n is the total number of samples in the training
multiset.
— n experiments are performed using (n − 1) samples for
training and the remaining sample for testing.
— It is rather computationally expensive.
— Leave-one-out cross-validation does not guarantee the
same class distribution in training and test data!
— The extreme case:
¡ 50% class A, 50% class B. Predict majority class label in the training
data. True error 50%; Leave-one-out error estimate 100%!
Bootstrap aggregating

— The bootstrap uses sampling with replacement to


form the training set.
— Let the training set T consisting of n entries.
— Bootstrap generates m new datasets Ti each of size n0 < n
by sampling T uniformly with replacement. The
consequence is that some entries can be repeated in Ti.
— The m statistical models (e.g., classifiers, regressors) are
learned using the above m bootstrap samples.
— The statistical models are combined, e.g. by averaging
the output (for regression) or by voting (for
classification).
Recommended experimental validation
procedure

— Use K-fold cross-validation (K = 5 or K = 10) for


estimating performance estimates (accuracy, etc.).

— Compute the mean value of performance estimate,


and standard deviation and confidence intervals.

— Report mean values of performance estimates and


their standard deviations or 95% confidence
intervals around the mean.
Criterion function to assess classifier
performance

— Accuracy and error rate


¡ Accuracy is the percent of correct classifications.
¡ Error rate = is the percent of incorrect classifications.
¡ Accuracy = 1 - Error rate.

— Problems with the accuracy:


¡ Assumes equal costs for misclassification.
¡ Assumes relatively uniform class distribution.

— Other characteristics derived from the confusion


matrix
— Expected Bayesian risk.
Confusion matrix, two classes only
Confusion matrix, two classes only

— Accuracy — Precision, predicted


(a + d)/(a + b + c + d) =(TN + positive value
TP)/total = d/(b + d) = TP/predicted
positive
— True positive rate, recall,
sensitivity — False positive rate, false
= d/(c + d) = TP/actual positive alarm
= b/(a+b) = FP/actual negative =
— Specificity, true negative
1 - specificity
rate
= a/(a + b) = TN/actual negative — False negative rate
= c/(c + d) = FN/actual positive
Confusion matrix, # of classes > 2
Here is the summary!!!

— Any ML/AI model depends on the training data set

— Class balancing is important.

— Validation is important!

— And we also need some measure for validating the


data.
THANK YOU FOR LISTENING

Questions???

You might also like