0% found this document useful (0 votes)
5 views

Machine Learning

Machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine Learning

Machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Classification

The problem of predicting a categorical output from a set of training examples is known as a
classification problem. Predicting a categorical output is equivalent to classifying a new
observation in to one of the known categories based on the features of that observation. The
task of predicting categorical output is, therefore, termed as the classification problem. The
classes themselves can be crisp or conceptual in nature. Some examples of classification
problem are,
o Predicting medical condition of a patient from the symptoms
o Predicting fraudulent transaction on on-line banking service
o Classifying a hotel into luxurious or budget hotel
o Predicting whether a credit card holder will default
o Classifying whether an email is a spam
o Recognizing a hand written character

A model used for classification is called a classifier. Many classifiers have been developed in the
literature such as

 K-Nearest Neighbors
 Decision Tree
 Bayesian classification
 Logistic regression
 Linear discriminant analysis
 Random forest
 Support vector machine
A brief idea of K-Nearest Neighbor (KNN) classification was discussed earlier. In the present
session we will discuss Bayesian classification.

Bayesian Classification
Let (x1, y1), (x2, y2), …, (xn, yn) be the training examples, where the outputs yi’s are categorical.
Let f be a mapping that maps an input X to the corresponding output Y.
A Bayesian classification technique is a technique based on the Bayes’ theorem. The resultant
classifier is called a Bayesian classifier. Bayesian classifier predicts the probabilities of class
membership based on the given input. The class having the highest probability is declared as
the predicted class. In other words, given the input x, we estimate the conditional probability
for every class Cj as
Prob (Y = Cj| X = x)
and declare Ci as the predicted class if
P(Ci  X = x) > P(Cj  X = x) for all j ≠ i
The conditional probabilities, also called the posterior class probabilities, are computed using
the Bayes’ theorem. The modelled function ^f using this approach is also called the ‘maximum
posteriori hypothesis’
In comparative studies of classification algorithms, Bayesian classifiers have exhibited high
accuracy and speed when applied to large databases.
Naïve Bayes Classifier
This is one of the simplest Bayesian classifiers. It assumes that given a class, the value of a
feature is independent of the values of the other features. This assumption is called class
conditional independence. This assumption is made in order to simplify the computations and, it
is in this sense, the classifier is considered “naïve.”

1. The posterior probabilities are computed using Bayes’ theorem as

P ( X =x|C i ) P ( Ci )
P ( Ci|X =x ) =
P ( X=x )
2. The prior probabilities P(Ci) for different classes are estimated as
# of training examples ∈ the training dataset having class label C i
P ( Ci )=
Total # of training examples ∈ training dataset
3. If the training dataset has many input variables, computing P(X = x Ci) is
computationally very expensive. In order to reduce computational effort, the Naïve
algorithm assumes Class Conditional independence. This assumption states that the
values of input variables are conditionally independent of one another given the class
label of the example. With this assumption, we have
p
P ( X=x|C i) =∏ P ( X k =x k|C i )
k=1

The probabilities P(xk Ci) are much easier to compute from the training tuples.
(a) If the attribute Xk is categorical , we estimate
# of examples of class Ci having value x k for variable X k
P ( X k =x k|C i) =
# of examples of class Ci
(b) If the input variable Xk is continuous valued, it is generally assumed that the attribute
is normally distributed with mean I and standard deviation i for class Ci. The
probability P(xkCi) is then calculated using the estimated values of I and i from the
training dataset.
4. To predict the class label for new example with input x, P ( X=x|C i) P ( Ci ) is calculated
for each class Ci. The classifier predicts the class for which this product is maximum.

Need of Laplacian correction

While computing P(XCi) using class-conditional independence assumption, we may encounter


situations where P(xkCi) is estimated as zero, which in turn makes the estimate of P(XCi) also
zero. To avoid this problem, we add 1 to each count that we obtain from our dataset D. For a
large dataset, this does not change the estimated probabilities significantly and also avoids the
zero-probability problem described above.

This technique of avoiding zero probability problem is called Laplacian correction.

Measuring Classifier Accuracy

Classifier accuracy is measured using a test set that consists of class-labeled tuples that were
not used to train the model.

Accuracy

The accuracy of a classifier M on a given test set is defined as

Acc(M) = Percentage of test set tuples that are correctly classified by the classifier.

Error rate

The error rate (or misclassification rate) for classifier M is given by 1-Acc(M).

Confusion Matrix

This is a useful tool for analyzing how well a classifier can recognize tuples of different classes.
Confusion matrix for a classifier is defined as

CM = (Ci,j)mm,

Where
Ci,j = proportion of tuples of class i that were labeled by the classifier as class j.

For a classifier with good accuracy, most of the off-diagonal entries must be zero.
When one of the classes is of special interest

The tuples belonging to the class of interest are termed are positive tuples, and other tuples
are termed as negative tuples.

The positive tuples that are correctly labeled by the classifier as positive as called true positive
tuples, and the negative tuples that are correctly labeled by the classifier as negative as called
true negative tuples.

Similarly, the positive tuples that are incorrectly labeled by the classifier as negative as called
false negative tuples, and the negative tuples that are incorrectly labeled by the classifier as
positive as called false positive tuples.

True Labeled by classifier


Class
Positive Negative

Positive True Positive False Negative

Negative False positive True Negative

Sensitivity

# True positives
Sensitivity=
# Positives

Specificity

# True Negatives
Specificity=
# Negatives

Precision

# True Positives
Precision=
( # True positives )+ ( # False positives )
Kappa Statistic
Cohen’s Kappa statistic is computed as
Po − P e 1 − Po
κ= =1 −
1− Pe 1 − Pe

Where Po = proportion of observed agreement ( = accuracy)

Pe = Probability of agreement by chance


If there is a complete agreement,  will be 1, whereas  will be 0 when the observed agreement
is same as the agreement by chance. Thus, a large value of κ indicates a better performance of
the classifier.

You might also like