0% found this document useful (0 votes)
22 views

Lecture 9 - Evaluations

Uploaded by

Syed Abubakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lecture 9 - Evaluations

Uploaded by

Syed Abubakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Machine Learning

Lecture 9- Performance Evaluation


Evaluation
• Evaluation = Process of judging the merit or worth of
something
• Evaluation is key to building effective and efficient
Machine Learning systems
• usually carried out in controlled experiments
• online testing can also be done
Why System Evaluation?
• There are many models/ algorithms/ systems, which
one is the best?
• What is the best component for:
• similarity function (cosine, correlation,…)
• Term selection (stopword removal, stemming…)
• Term weighting (TF, TF-IDF,…)
• How far down the list will a user need to look to find
some/all relevant documents in text retrieval?
Regression / classification models

• Predictive modeling / Supervised learning


• A model is a specification of mathematical/probabilistic
relationships that exist between different variables
• The goal is usually to use existing data to develop models that
we can use to predict outcomes for new data, such as
• Predicting whether an email message is spam or not Nominal
• Predicting whether a credit card transaction is fraudulent (categorical with
• Predicting which advertisement a shopper is most likely to click on No particular
• Predicting which football team is going to win the Super Bowl order)

• Predicting stock price of a given company


• Predicting number of buyers of a certain product Continuous / ordinal
• Predicting user ratings of a new movie
• Predicting the grade of a disease
Performance evaluation

• How predictive is the model we learned?


• For regression, usually R2 or MSE
• For classification, many options (discuss later today)
• Accuracy can be used, with caution
• Performance on the training data (data used to build models) is not a
good indicator of performance on future data
• Q: Why?
• A: Because new data will probably not be exactly the same as the training
data!
Overfitting vs underfitting
• Overfitting – fitting the training data too precisely - usually
leads to poor results on new data
• Underfitting – model does not fit training data well

Underfitting Overfitting
High bias, low variance Low bias, high variance
Increase # of Get more training data,
features or or reduce # of features
complexity of or complexity of model
model
Bias

•Definition of Bias
•Inability of the model to predict accurately.
•Difference or error between predicted values and actual values.
•Bias as Systematic Error
•Caused by wrong assumptions in the machine learning
process.
•Mathematical Representation

•Measurement of Model Fit


•Indicates how well the model aligns with data.
Bias Levels

• Low Bias
• Fewer assumptions in target function construction.
• Model closely matches the training dataset.
• High Bias
• More assumptions lead to poor model fit.
• Results in underfitting and high error rates.
• Example of High Bias
• Linear regression model for non-linear data.
Reducing High Bias in Machine Learning

•Use a More Complex Model


•Increase hidden layers in deep learning models.
•Employ models like Polynomial regression, CNNs, or RNNs.
•Increase Number of Features
•Adding features enhances model complexity.
•Improves ability to capture underlying data patterns.
•Reduce Regularization
•Adjust L1 or L2 regularization techniques.
•Helps improve model performance when bias is high.
•Increase Size of Training Data
•More examples provide better learning opportunities.
•Can significantly reduce bias.
Variance

•Definition of Variance
•Measure of spread in data from its mean.
•Indicates how the model's performance varies with
different training data subsets.
•Sensitivity to Training Data
•Variance reflects how much the model adjusts to new
subsets of data.
•Mathematical Representation
Types of Variance Errors

•Low Variance
•Model is less sensitive to training data changes.
•Produces consistent estimates across subsets.
•Often leads to underfitting (poor generalization).
•High Variance
•Model is very sensitive to training data changes.
•Significant changes in estimates based on subsets.
•Results in overfitting (good performance on training data
but poor on unseen data).
Ways to Reduce Variance
•Cross-Validation
•Splitting data into training/testing sets to identify overfitting/underfitting.
•Feature Selection
•Choosing only relevant features to decrease model complexity.
•Regularization
•Applying L1 or L2 regularization to reduce variance in models.
•Ensemble Methods
•Combining multiple models (e.g., bagging, boosting) for better
generalization.
•Simplifying the Model
•Reducing complexity, such as decreasing parameters or layers in neural
networks.
•Early Stopping
•Halting training when performance on validation stops improving to
prevent overfitting.
Evaluation on “LARGE” data

• If many (thousands) of examples are available, then how can we


evaluate our model?
• A simple evaluation is sufficient
• Randomly split data into training and test sets (e.g. 2/3 for train, 1/3 for test)
• For classification, make sure training and testing have similar distribution of
class labels
• Build a model using the train set and evaluate it using the test set.
Model Evaluation Step 1:
Split data into train and test sets
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Testing set
Model Evaluation Step 2:
Build a model on a training set
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Model Builder

Testing set
Model Evaluation Step 3:
Evaluate on test set
Results Known
0
3 Training set
2
5
1
Data

Model Builder
Evaluate
Predictions
3
Y N
4
1
Testing set 2
A note on parameter tuning

• It is important that the test data is not used in any way to build the
model
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and
test data
• Validation data is used to optimize parameters
Evaluation on “small” data, 1

• The holdout method reserves a certain amount for testing and uses
the remainder for training
• Usually: one third for testing, the rest for training
• For “unbalanced” datasets, samples might not be representative
• Few or none instances of some classes
• Stratified sample: advanced version of balancing the data
• Make sure that each class is represented with approximately equal
proportions in both subsets
Evaluation on “small” data, 2

• What if we have a small data set?


• The chosen 2/3 for training may not be representative.
• The chosen 1/3 for testing may not be representative.
Cross-validation

• Cross-validation more useful in small datasets


• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation
• For classification, often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an overall error estimate
Cross-validation example:

— Break up data into groups of the same size

— Hold aside one group for testing and use the rest to
build model
Test

— Repeat

21
More on cross-validation

• Standard method for evaluation: stratified ten-fold cross-validation


• Why ten? Extensive experiments have shown that this is the best
choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are averaged
(reduces the variance)
The bootstrap
• CV uses sampling without replacement
• The same instance, once selected, can not be selected again for a
particular training/test set
• The bootstrap uses sampling with replacement to
form the training set
• Sample a dataset of n instances n times with replacement to form a
new dataset
of n instances
• Use this data as the training set
• Use the instances from the original
dataset that don’t occur in the new
training set for testing
The bootstrap
• The bootstrap approach allows us to use a computer to mimic the
process of obtaining new data sets, so that we can estimate the
variability of our estimate without generating additional samples.
• Rather than repeatedly obtaining independent data sets from the
population, we instead obtain distinct data sets by repeatedly
sampling observations from the original data set with
replacement.
• Each of these “bootstrap data sets” is created by sampling with
replacement, and is the same size as our original dataset. As a
result some observations may appear more than once in a given
bootstrap data set and some not at all.
Precision and Recall
Space of all documents

Relevant +
Relevant Retrieved Retrieved

Not Relevant + Not Retrieved

Number of relevant documents retrieved


recall 
Total number of relevant documents
Number of relevant documents retrieved
precision 
Total number of documents retrieved
Confusion Matrix
A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test
data
• true positives (TP): These are cases in which we predicted positive
(they have the disease), and they do have the disease.
• true negatives (TN): We predicted negative, and they don't have the
disease.
• false positives (FP): We predicted positive, but they don't actually
have the disease. (Also known as a "Type I error.")
• false negatives (FN): We predicted negative, but they actually do have
the disease. (Also known as a "Type II error.")
Actual: Actual:
Positive Negative
Predicted: Tp (430) Fp (200)
Positive
Predicted: Fn (70) tn (300)
Negative
Precision and Recall in Text Retrieval
• Precision
• The ability to retrieve top-ranked documents that are
mostly relevant.
• Precision P = tp/(tp + fp)
• Recall
• The ability of the search to find all of the relevant items
in the corpus.
• Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp

Not Retrieved fn tn
Precision/Recall : Example

Recall = 2/6 = 0.33 Precision 


Relevant Retrieved
Precision = 2/3 = 0.67 Retrieved
Relevant Retrieved
Recall 
Relevant
Precision/Recall : Example

Recall = 5/6 = 0.83 Precision 


Relevant Retrieved
Precision = 5/6 = 0.83 Retrieved
Relevant Retrieved
Recall 
Relevant
Accuracy
• Overall, how often is the classifier correct?
• Number of correct predictions / Total number of predictions
• Accuracy = tp+tn/(tp + fp + fn + tn)

Positive Negative
Predicted Positive 1 1

Predicted Negative 8 90

• Accuracy = 1+90/(1+1+8+90) = 0.91


• 91 correct prediction out of 100 total examples
• Precision = 1/2 and Recall =1/9
• Accuracy alone doesn't tell the full story when you're
working with a class-imbalanced data set
Relevant Nonrelevant
Retrieved tp = ? fp = ?
Home Activity Not Retrieved fn = ? tn = ?

Accuracy of a retrieval model is defined by,


Accuracy =
Calculate the tp, fp, fn, tn and accuracy for Ranking algorithm #1 and #2 for
the highlighted location in the ranking.
F Measure (F1/Harmonic Mean)

• One measure of performance that takes into account


both recall and precision.
• Harmonic mean of recall and precision:

• Why harmonic mean?


• harmonic mean emphasizes the importance of small values,
whereas the arithmetic mean is affected more by outliers that
are unusually large
• Data are extremely skewed; over 99% documents are non-
relevant. This is why accuracy is not an appropriate measure
• Compared to arithmetic mean, both need to be high for
harmonic mean to be high.
F Measure (F1/Harmonic Mean) : example

Recall = 2/6 = 0.33


Precision = 2/3 = 0.67
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.33*0.67/(0.33 + 0.67) = 0.44
F Measure (F1/Harmonic Mean) : example

Recall = 5/6 = 0.83


Precision = 5/6 = 0.83
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.83*0.83/(0.83 + 0.83) = 0.83
Mean Average Precision (MAP)

• Average Precision: Average of the precision


values at the points at which each relevant
document is retrieved.
• Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
• Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Averaging the precision values from the rank positions where a
relevant document was retrieved
• Set precision values to be zero for the not retrieved documents
Average Precision: Example
Average Precision: Example
Average Precision: Example
Average Precision: Example

Miss one relevant


document
Average Precision: Example

Miss two relevant


documents
Mean Average Precision (MAP)

• Summarize rankings from multiple queries by averaging


average precision
• Most commonly used measure in research papers
• Assumes user is interested in finding many relevant documents
for each query
• Requires many relevance judgments in text collection
Mean Average Precision (MAP)
Recall-Precision Graph

• The Recall-Precision Graph is created using the standard Recall


values from the Recall Level and Precision Averages.
• Typically these graphs slope downward from left to right,
enforcing the notion that as more relevant documents are
retrieved (recall increases), the more nonrelevant documents are
retrieved (precision decreases).
• This graph is the most commonly used method for comparing
systems. The plots of different runs can be superimposed on the
same graph to determine which run is superior.
• Curves closest to the upper right-hand corner of the graph
(where recall and precision are maximized) indicate the best
performance

43
Recall-Precision Graph

Multiple precision
for some recalls

44
Interpolation

• Defines precision at any recall level as the maximum


precision observed in any recall-precision point at a
higher recall level
• produces a step function
• defines precision at recall 0.0, 0.1……1.0

45
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0
Precision

46
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0
Precision

47
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0 1.0


Precision

48
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0 1.0 1.0


Precision

49
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0 1.0 1.0


Precision

50
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0 1.0 1.0 0.67


Precision

51
Interpolation

Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Interpolated 1.0 1.0 1.0 0.67 0.67 0.5 0.5 0.5 0.5 0.5 0.5
Precision

52
Recap: Confusion matrix

• The confusion matrix (easily generalize to multi-class)

Actual: Actual:
Positive Negative
Predicted: tp fp
Positive
Predicted: fn tn
Negative

• Machine Learning methods usually minimize FP+FN


• TPR (True Positive Rate): TP / (TP + FN) = Recall
• FPR (False Positive Rate): FP / (TN + FP)
ROC Curves

• A receiver operating characteristic curve, i.e. ROC curve, is a


graphical plot that illustrates the diagnostic ability of a binary
classifier system as its discrimination threshold is varied.
• The diagnostic performance of a test, or the accuracy of a test to
discriminate diseased cases from normal cases is evaluated using
Receiver Operating Characteristic (ROC) curve analysis
• A ROC Curve is a way to compare diagnostic tests. It is a plot of the
true positive rate against the false positive rate.
ROC Curves

This is an ideal situation. This is the worst situation.


Model has an ideal When AUC is approximately
measure of separability. It 0.5, model has no
is perfectly able to discrimination capacity to
distinguish between distinguish between positive
positive class and class and negative class.
negative class. Random predictions.
Multiple ROC Curves

• Comparison of multiple classifiers is usually straight-forward


especially when no curves cross each other. Curves close to the
perfect ROC curve have a better performance level than the ones
closes to the baseline.
PR Curves Vs ROC Curves

• Remember, a ROC curve represents a relation between sensitivity


(Recall) and False Positive Rate (Not Precision).
• ROC curve plot True Positive Rate Vs. False Positive Rate; Whereas, PR curve
plot Precision Vs. Recall.
• If your question is, "How well can this classifier be expected to
perform in general, go with a ROC curve
• If true negative is not much valuable to the problem, or negative
examples are abundant. Then, PR-curve is typically more appropriate.
• For example, if the class is highly imbalanced and positive samples are very
rare, then use PR-curve.
• How meaningful is a positive result from my classifier
Cost-Sensitive Learning

• Learning to minimize the expected cost of misclassifications


• Most classification learning algorithms attempt to minimize the
expected number of misclassification errors
• In many applications, different kinds of classification errors have
different costs, so we need cost-sensitive methods
Examples of Applications with Unequal
Misclassification Costs
• Medical Diagnosis:
• Cost of false positive error: Unnecessary treatment;
unnecessary worry
• Cost of false negative error: Postponed treatment or
failure to treat; death or injury
• Fraud Detection:
• False positive: resources wasted investigating non-fraud
• False negative: failure to detect fraud could be very
expensive
Cost Matrix
Model 1: Confusion matrix Model 2: Confusion matrix
P N P N
Actual

Actual
P 150 40 FN P 250 45

N 60 250 N 5 200

FP Predicted Predicted
Cost matrix
P N
Accuracy: 80% Accuracy: 90%
Cost: 150x-1 + 40x100 + 60x1=3910 P -1 100 Cost: 250x-1 + 45x100 +5x1 = 4255

N 1 0

• If we are focusing on accuracy then we will go with the Model 2 (In this
case we need to compromise on cost) , however if we are focusing on
cost then we will go with the Model 1 (In this case we need to
compromise on accuracy).
Significance Testing

• Also called “hypothesis testing”


• Objective: to test a claim about parameter μ
• Procedure:
A.State hypotheses H0 and Ha
B.Calculate test statistic
C.Convert test statistic to P-value and interpret
D.Consider significance level (optional)
Hypotheses
• H0 (null hypothesis) claims “no difference”
• Ha (alternative hypothesis) contradicts the null
• Example: We test whether a population gained weight
on average…
H0: no average weight gain in population
Ha: H0 is wrong (i.e., “weight gain”)
• Next  collect data  quantify the extent to which the
data provides evidence against H0
Significance Tests

• Given the results from a number of queries, how can


we conclude that ranking algorithm B is better than
algorithm A?
• A significance test
• null hypothesis: no difference between A and B
• alternative hypothesis: B is better than A
• the power of a test is the probability that the test will reject
the null hypothesis correctly
t-test

• The t test (also called Student’s T Test) compares


two averages (means) and tells you if they are
different from each other.
• The t test also tells you how significant the differences
are; In other words it lets you know if those
differences could have happened by chance.
t-test

• What are T-Values and P-values?


• How big is “big enough”? Every t-value has a p-value to
go with it.
• A p-value is the probability that the results from your
sample data occurred by chance.
• P-values are from 0% to 100%. They are usually written as a
decimal. For example, a p value of 5% is 0.05.
• Low p-values are good; They indicate your data did not
occur by chance.
• For example, a p-value of .01 means there is only a 1%
probability that the results from an experiment happened by
chance. In most cases, a p-value of 0.05 (5%) is accepted or
95% confidence that experiment didn’t happen by a chance.
Example Experimental Results

Significance level:  = 0.05


Probability for B=A
Example Experimental Results

p-value = 0.03 < 0.05


Probability for B=A is 0.03
Reject null hypothesis
Avg 41.1 62.5
 B is better than A
Significance level:  = 0.05, Probability for B=A
The p-value is less than the alpha level: p < 0.05 We can be 95% sure to
reject the null hypothesis that there is a significant difference between
means.
T-test Python
import scipy.stats as stats
import numpy as np
sample1 = [25, 43, 39, 75, 43, 15, 20, 52, 49, 50]
sample2 = [35, 84, 15, 75, 68, 85, 80, 50, 58, 75]
t_stat, p_val = stats.ttest_ind(sample1, sample2, equal_var=False)
print(t_stat)
print(p_val)

You might also like