0% found this document useful (0 votes)
25 views

Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1

The document discusses Naive Bayes classifiers and evaluating classifier accuracy. It introduces Naive Bayes classifiers and how they make strong independence assumptions. It then discusses techniques for evaluating classifier accuracy including holdout validation, cross-validation, and repeated cross-validation. Finally, it discusses limitations of accuracy and introduces the concept of a cost matrix to account for different misclassification costs.

Uploaded by

zombiee hook
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1

The document discusses Naive Bayes classifiers and evaluating classifier accuracy. It introduces Naive Bayes classifiers and how they make strong independence assumptions. It then discusses techniques for evaluating classifier accuracy including holdout validation, cross-validation, and repeated cross-validation. Finally, it discusses limitations of accuracy and introduces the concept of a cost matrix to account for different misclassification costs.

Uploaded by

zombiee hook
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to Artificial

Intelligence

Lecture 07
Amna Iftikhar Fall ' 2019 1
Today’s Agenda
• Supervised learning
– Naïve Bayes
• Accuracy of Classifiers
• Ensemble Methods

Amna Iftikhar Fall ' 2019 2


Example
• According to American Lung Association, 7%
of the population has lung cancer. Of these
people having lung disease, 90% are smokers;
and of those not having lung disease, 25.3%
are smokers.
• Determine the probability that a randomly
selected smoker has lung cancer.

Amna Iftikhar Fall ' 2019 3


Bayesian Classifiers
• Consider each attribute and class label as
random variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly
from data?

Amna Iftikhar Fall ' 2019 4


Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2,
…,An) for all values of C using the Bayes theorem

– Choose value of C that maximizes


– –P(C | A1, A2,…,An)
– Equivalent to choosing value of C that maximizes
P (A1, A2,…,An)|C) P(C)
Amna Iftikhar Fall ' 2019 5
Naïve Bayes
• Naïve Bayes classifiers assume that the effect of
an attribute value on a given class is
independent of the values of the other
attributes.
• This assumption is called class conditional
independence.
• It is made to simplify the computations involved
and, in this sense, is considered “naïve”.

Amna Iftikhar Fall ' 2019 6


Naïve Bayes Classifier
• Assume independence among attributes Ai
when class is given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj ) for all Ai and C
– New point is classified to Cj if P(Cj)∏ P(Ai| Cj) is
maximal.

Amna Iftikhar Fall ' 2019 7


Naïve Bayes Classification: Mammals vs. Non-
Mammals

Amna Iftikhar Fall ' 2019 8


Naïve Bayes Classification: Mammals vs. Non-
Mammals

Amna Iftikhar Fall ' 2019 9


Example: Play Tennis

Amna Iftikhar Fall ' 2019 10


Characteristics of Naïve Bayes Classifier
• They are robust to isolated noise points because such points
are averaged out when estimating conditional probabilities
from data.
• Naïve Bayes classifiers can also handle missing values by
ignoring the example during model building and classification.
• They are robust to irrelevant attributes. If Xi is an irrelevant
attribute, then P(Xi | Y) becomes almost uniformly distributed.
• Correlated attributes can degrade the performance of naïve
Bayes classifiers because the conditional independence
assumption no longer holds for such attributes.

Amna Iftikhar Fall ' 2019 11


Evaluating the Accuracy of a Classifier

• Holdout, random subsampling,


crossvalidation, and the bootstrap are
common techniques for assessing accuracy
based on randomly sampled partitions of the
given data.

Amna Iftikhar Fall ' 2019 12


Cross Validation
• Partition: Training-and-testing
– Use two independent data sets, e.g., training set (2/3), test
set(1/3)
– Used for data set with large number of examples
– Stratification: Guarantee that each class is properly represented in
both training and testing data sets.
• Cross-validation
– Divide the data set randomly into k subsamples
– The class is represented in approximately the same proportions as in
the full dataset
– Use k-1 subsamples as training data and one sub-sample as test
data—k-fold cross-validation for data set with moderate size
Muhammad Usman Arif Spring 2019 13
Repeated Cross validations
• A single 10-fold cross-validation might not be enough to get a
reliable error estimate.
• Different 10-fold cross-validation experiments with the same
learning method and dataset often produce different results,
because of the effect of random variation in choosing the folds
themselves.
• Stratification reduces the variation, but it certainly does not
eliminate it entirely.
• When seeking an accurate error estimate, it is standard
procedure to repeat the cross-validation process 10 times—that
is, 10 times 10-fold cross-validation—and average the results.

Muhammad Usman Arif Spring 2019 14


Metrics for Performance Evaluation

• Focus on the predictive capability of a model


– Rather than how fast it takes to classify or build
models, scalability, etc.
• Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No a: TP (true positive)
b: FN (false negative)
ACTUAL Class=Yes a b c: FP (false positive)
CLASS d: TN (true negative)
Class=No c d

Muhammad Usman Arif Spring 2019 15


Metrics for Performance Evaluation…
PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

• Most widely-used metric:


ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Muhammad Usman Arif Spring 2019 16
Limitation of Accuracy
• Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

• If model predicts everything to be class 0,


accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does not
detect any class 1 example

Muhammad Usman Arif Spring 2019 17


Cost Matrix
PREDICTED CLASS
C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Muhammad Usman Arif Spring 2019 18


Cost Matrix (Cont’d)
PREDICTED CLASS PREDICTED CLASS
True False True False
True 10 5 True 10 3
ACTUAL ACTUAL
CLASS False 1 14 CLASS False 3 14

PREDICTED CLASS All three confusion matrices have


True False the same accuracy value, i.e., 24 /
30
True 10 6
ACTUAL What if the cost of misclassification
CLASS False 0 14
is not the same for both type of
errors?

Muhammad Usman Arif Spring 2019 19


Cost Matrix (Cont’d)
PREDICTED CLASS PREDICTED CLASS
True False True False
True 10 5x5 True 10 3x5
ACTUAL ACTUAL
CLASS False 1 14 CLASS False 3 14

PREDICTED CLASS Suppose the cost of misclassifying


True False True as False is 5 while the cost of
misclassifying False as True is 1.
True 10 6x5
ACTUAL Accuracy values are:
CLASS False 0 14
24/50, 24/42, 24/54

Muhammad Usman Arif Spring 2019 20


Cost Matrix (Cont’d)
PREDICTED CLASS PREDICTED CLASS
True False True False
True 10 5x4 True 10 3x4
ACTUAL ACTUAL
CLASS False 1 14 CLASS False 3 14

PREDICTED CLASS Suppose the cost of misclassifying


True False True as False is 4 while the cost of
misclassifying False as True is 1.
True 10 6x4
ACTUAL Accuracy values are:
CLASS False 0 14
24/45, 24/39, 24/48

Muhammad Usman Arif Spring 2019 21


Cost-Sensitive Measures
a
Precision (p)  PREDICTED CLASS

ac Class=Yes Class=No

a
ACTUAL Class=Yes a b
CLASS
Recall (r) 
(TP) (FN)
Class=No c d
ab (FP) (TN)

2rp 2a
F - measure (F)  
r  p 2a  b  c
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
wa  w d
Weighted Accuracy  1 4

wa wb wc w d 1 2 3 4

Muhammad Usman Arif Spring 2019 22


Example
• Compare one system that:
– locates 100 documents, 40 of which are relevant,
• With:
– Another that locates 400 documents, 80 of which are relevant.
• Which is better?
• The answer should now be obvious:
– It depends on the relative cost of false positives, documents
that are returned that aren’t relevant
– And false negatives, documents that are relevant that aren’t
returned.

Muhammad Usman Arif Spring 2019 23


Recall – Precision for this example

Muhammad Usman Arif Spring 2019 24


Recall and Precision
Actual Prediction
a
T T Precision (p) 
T F ac
F T a
Recall (r) 
F F ab
F T 2rp 2a
F - measure (F)  
T T r  p 2a  b  c
T T PREDICTED CLASS

T F Class=Yes Class=No

F T ACTUAL
CLASS Class=Yes a b
(TP) (FN)
T T
Class=No c d
(FP) (TN)
Muhammad Usman Arif Spring 2019 25
Recall and Precision
Actual Prediction • Precision = 4 / 7
T T
T F
F T
F F
F T
T T
T T
T F
F T
T T

Muhammad Usman Arif Spring 2019 26


Recall and Precision
Actual Prediction • Recall = 4 / 6
T T
T F
• Precision = 4 / 7
F T • F-Measure = 8 / 13
F F
F T
T T
T T
T F
F T
T T

Muhammad Usman Arif Spring 2019 27


Ensemble Method
• Machine Learning concept in which the idea is to
train multiple models using the same learning algorithm.
The ensembles take part in a bigger group of methods,
called multi-classifiers, where a set of hundreds or
thousands of learners with a common objective are fused
together to solve the problem.
• The main causes of error in learning are due to noise, bias
and variance. Ensemble helps to minimize these factors.
These methods are designed to improve the stability and
the accuracy of Machine Learning algorithms. 

Muhammad Usman Arif Spring 2019 28


Bagging and Boosting
• Bagging and boosting are examples of ensemble methods, or methods
that use a combination of models.
• Each combines a series of k learned models (classifiers or predictors), M1,
M2, …, Mk, with the aim of creating an improved composite model, M*.
• Both bagging and boosting can be used for classification as well as
prediction

Muhammad Usman Arif Spring 2019 29


Boosting
• In boosting, weights are assigned to each
training tuple.
• A series of k classifiers is iteratively learned.
• After a classifier Mi is learned, the weights are
updated to allow the subsequent
classifier,Mi+1, to “pay more attention” to the
training tuples that were misclassified by Mi.
• The final boosted classifier, M, combines the
votes of each individual classifier, where the
weight of each classifier’s vote is a function of
its accuracy.
Muhammad Usman Arif Spring 2019 30
Bagging
• Fitting a simple decision tree holding out a test
set results vary dramatically.
• This high variance is undesirable and therefore
we consider a new dataset which is a subset of
the original (bootstrap sample).
• In bagged trees, we resample observations
from a dataset with replacement and fit a tree.
• We consider all the features in our resampling
and this process is repeated n times.

Muhammad Usman Arif Spring 2019 31


Random Forest
• One issue not considered in this bagging process is how similar the
trees tend to be.
• Consider one strong predictor in the data set. All the bagged trees will
tend to make the same cuts because they all share the same features.
• This makes all these trees look very similar hence increasing
correlation.
• To solve tree correlation we allow random forest to randomly choose
only m predictors in performing the split.
• Now the bagged trees all have different randomly selected features to
perform cuts on.
• Therefore, the feature space is split on different predictors,
decorrelating all the trees.

Muhammad Usman Arif Spring 2019 32


Bagging or Boosting?
• Depends on the data, the simulation and the circumstances.
• If the problem is that the single model gets a very low
performance, Bagging will rarely get a better bias. However,
Boosting could generate a combined model with lower
errors as it optimises the advantages and reduces pitfalls of
the single model.
• By contrast, if the difficulty of the single model is over-
fitting, then Bagging is the best option. Boosting for its part
doesn’t help to avoid over-fitting; in fact, this technique is
faced with this problem itself. For this reason, Bagging is
effective more often than Boosting.

Muhammad Usman Arif Spring 2019 33

You might also like