Machine Learning
Machine Learning
The problem of predicting a categorical output from a set of training examples is known as a
classification problem. Predicting a categorical output is equivalent to classifying a new
observation in to one of the known categories based on the features of that observation. The
task of predicting categorical output is, therefore, termed as the classification problem. The
classes themselves can be crisp or conceptual in nature. Some examples of classification
problem are,
o Predicting medical condition of a patient from the symptoms
o Predicting fraudulent transaction on on-line banking service
o Classifying a hotel into luxurious or budget hotel
o Predicting whether a credit card holder will default
o Classifying whether an email is a spam
o Recognizing a hand written character
A model used for classification is called a classifier. Many classifiers have been developed in the
literature such as
K-Nearest Neighbors
Decision Tree
Bayesian classification
Logistic regression
Linear discriminant analysis
Random forest
Support vector machine
A brief idea of K-Nearest Neighbor (KNN) classification was discussed earlier. In the present
session we will discuss Bayesian classification.
Bayesian Classification
Let (x1, y1), (x2, y2), …, (xn, yn) be the training examples, where the outputs yi’s are categorical.
Let f be a mapping that maps an input X to the corresponding output Y.
A Bayesian classification technique is a technique based on the Bayes’ theorem. The resultant
classifier is called a Bayesian classifier. Bayesian classifier predicts the probabilities of class
membership based on the given input. The class having the highest probability is declared as
the predicted class. In other words, given the input x, we estimate the conditional probability
for every class Cj as
Prob (Y = Cj| X = x)
and declare Ci as the predicted class if
P(Ci X = x) > P(Cj X = x) for all j ≠ i
The conditional probabilities, also called the posterior class probabilities, are computed using
the Bayes’ theorem. The modelled function ^f using this approach is also called the ‘maximum
posteriori hypothesis’
In comparative studies of classification algorithms, Bayesian classifiers have exhibited high
accuracy and speed when applied to large databases.
Naïve Bayes Classifier
This is one of the simplest Bayesian classifiers. It assumes that given a class, the value of a
feature is independent of the values of the other features. This assumption is called class
conditional independence. This assumption is made in order to simplify the computations and, it
is in this sense, the classifier is considered “naïve.”
P ( X =x|C i ) P ( Ci )
P ( Ci|X =x ) =
P ( X=x )
2. The prior probabilities P(Ci) for different classes are estimated as
# of training examples ∈ the training dataset having class label C i
P ( Ci )=
Total # of training examples ∈ training dataset
3. If the training dataset has many input variables, computing P(X = x Ci) is
computationally very expensive. In order to reduce computational effort, the Naïve
algorithm assumes Class Conditional independence. This assumption states that the
values of input variables are conditionally independent of one another given the class
label of the example. With this assumption, we have
p
P ( X=x|C i) =∏ P ( X k =x k|C i )
k=1
The probabilities P(xk Ci) are much easier to compute from the training tuples.
(a) If the attribute Xk is categorical , we estimate
# of examples of class Ci having value x k for variable X k
P ( X k =x k|C i) =
# of examples of class Ci
(b) If the input variable Xk is continuous valued, it is generally assumed that the attribute
is normally distributed with mean I and standard deviation i for class Ci. The
probability P(xkCi) is then calculated using the estimated values of I and i from the
training dataset.
4. To predict the class label for new example with input x, P ( X=x|C i) P ( Ci ) is calculated
for each class Ci. The classifier predicts the class for which this product is maximum.
Classifier accuracy is measured using a test set that consists of class-labeled tuples that were
not used to train the model.
Accuracy
Acc(M) = Percentage of test set tuples that are correctly classified by the classifier.
Error rate
The error rate (or misclassification rate) for classifier M is given by 1-Acc(M).
Confusion Matrix
This is a useful tool for analyzing how well a classifier can recognize tuples of different classes.
Confusion matrix for a classifier is defined as
CM = (Ci,j)mm,
Where
Ci,j = proportion of tuples of class i that were labeled by the classifier as class j.
For a classifier with good accuracy, most of the off-diagonal entries must be zero.
When one of the classes is of special interest
The tuples belonging to the class of interest are termed are positive tuples, and other tuples
are termed as negative tuples.
The positive tuples that are correctly labeled by the classifier as positive as called true positive
tuples, and the negative tuples that are correctly labeled by the classifier as negative as called
true negative tuples.
Similarly, the positive tuples that are incorrectly labeled by the classifier as negative as called
false negative tuples, and the negative tuples that are incorrectly labeled by the classifier as
positive as called false positive tuples.
Sensitivity
# True positives
Sensitivity=
# Positives
Specificity
# True Negatives
Specificity=
# Negatives
Precision
# True Positives
Precision=
( # True positives )+ ( # False positives )
Kappa Statistic
Cohen’s Kappa statistic is computed as
Po − P e 1 − Po
κ= =1 −
1− Pe 1 − Pe