0% found this document useful (0 votes)

22 views

Lecture 9 - Evaluations

Uploaded by

Syed Abubakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Lecture 9 - Evaluations

Uploaded by

Syed Abubakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 68

Machine Learning

Lecture 9- Performance Evaluation

Evaluation
• Evaluation = Process of judging the merit or worth of
something
• Evaluation is key to building effective and efficient
Machine Learning systems
• usually carried out in controlled experiments
• online testing can also be done
Why System Evaluation?
• There are many models/ algorithms/ systems, which
one is the best?
• What is the best component for:
• similarity function (cosine, correlation,…)
• Term selection (stopword removal, stemming…)
• Term weighting (TF, TF-IDF,…)
• How far down the list will a user need to look to find
some/all relevant documents in text retrieval?
Regression / classification models

• Predictive modeling / Supervised learning

• A model is a specification of mathematical/probabilistic
relationships that exist between different variables
• The goal is usually to use existing data to develop models that
we can use to predict outcomes for new data, such as
• Predicting whether an email message is spam or not Nominal
• Predicting whether a credit card transaction is fraudulent (categorical with
• Predicting which advertisement a shopper is most likely to click on No particular
• Predicting which football team is going to win the Super Bowl order)

• Predicting stock price of a given company

• Predicting number of buyers of a certain product Continuous / ordinal
• Predicting user ratings of a new movie
• Predicting the grade of a disease
Performance evaluation

• How predictive is the model we learned?

• For regression, usually R2 or MSE
• For classification, many options (discuss later today)
• Accuracy can be used, with caution
• Performance on the training data (data used to build models) is not a
good indicator of performance on future data
• Q: Why?
• A: Because new data will probably not be exactly the same as the training
data!
Overfitting vs underfitting
• Overfitting – fitting the training data too precisely - usually
leads to poor results on new data
• Underfitting – model does not fit training data well

Underfitting Overfitting
High bias, low variance Low bias, high variance
Increase # of Get more training data,
features or or reduce # of features
complexity of or complexity of model
model
Bias

•Definition of Bias
•Inability of the model to predict accurately.
•Difference or error between predicted values and actual values.
•Bias as Systematic Error
•Caused by wrong assumptions in the machine learning
process.
•Mathematical Representation

•Measurement of Model Fit

•Indicates how well the model aligns with data.
Bias Levels

• Low Bias
• Fewer assumptions in target function construction.
• Model closely matches the training dataset.
• High Bias
• More assumptions lead to poor model fit.
• Results in underfitting and high error rates.
• Example of High Bias
• Linear regression model for non-linear data.
Reducing High Bias in Machine Learning

•Use a More Complex Model

•Increase hidden layers in deep learning models.
•Employ models like Polynomial regression, CNNs, or RNNs.
•Increase Number of Features
•Adding features enhances model complexity.
•Improves ability to capture underlying data patterns.
•Reduce Regularization
•Adjust L1 or L2 regularization techniques.
•Helps improve model performance when bias is high.
•Increase Size of Training Data
•More examples provide better learning opportunities.
•Can significantly reduce bias.
Variance

•Definition of Variance
•Measure of spread in data from its mean.
•Indicates how the model's performance varies with
different training data subsets.
•Sensitivity to Training Data
•Variance reflects how much the model adjusts to new
subsets of data.
•Mathematical Representation
Types of Variance Errors

•Low Variance
•Model is less sensitive to training data changes.
•Produces consistent estimates across subsets.
•Often leads to underfitting (poor generalization).
•High Variance
•Model is very sensitive to training data changes.
•Significant changes in estimates based on subsets.
•Results in overfitting (good performance on training data
but poor on unseen data).
Ways to Reduce Variance
•Cross-Validation
•Splitting data into training/testing sets to identify overfitting/underfitting.
•Feature Selection
•Choosing only relevant features to decrease model complexity.
•Regularization
•Applying L1 or L2 regularization to reduce variance in models.
•Ensemble Methods
•Combining multiple models (e.g., bagging, boosting) for better
generalization.
•Simplifying the Model
•Reducing complexity, such as decreasing parameters or layers in neural
networks.
•Early Stopping
•Halting training when performance on validation stops improving to
prevent overfitting.
Evaluation on “LARGE” data

• If many (thousands) of examples are available, then how can we

evaluate our model?
• A simple evaluation is sufficient
• Randomly split data into training and test sets (e.g. 2/3 for train, 1/3 for test)
• For classification, make sure training and testing have similar distribution of
class labels
• Build a model using the train set and evaluate it using the test set.
Model Evaluation Step 1:
Split data into train and test sets
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Testing set
Model Evaluation Step 2:
Build a model on a training set
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Model Builder

Testing set
Model Evaluation Step 3:
Evaluate on test set
Results Known
0
3 Training set
2
5
1
Data

Model Builder
Evaluate
Predictions
3
Y N
4
1
Testing set 2
A note on parameter tuning

• It is important that the test data is not used in any way to build the
model
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and
test data
• Validation data is used to optimize parameters
Evaluation on “small” data, 1

• The holdout method reserves a certain amount for testing and uses
the remainder for training
• Usually: one third for testing, the rest for training
• For “unbalanced” datasets, samples might not be representative
• Few or none instances of some classes
• Stratified sample: advanced version of balancing the data
• Make sure that each class is represented with approximately equal
proportions in both subsets
Evaluation on “small” data, 2

• What if we have a small data set?

• The chosen 2/3 for training may not be representative.
• The chosen 1/3 for testing may not be representative.
Cross-validation

• Cross-validation more useful in small datasets

• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation
• For classification, often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an overall error estimate
Cross-validation example:

— Break up data into groups of the same size

— Hold aside one group for testing and use the rest to
build model
Test

— Repeat

21
More on cross-validation

• Standard method for evaluation: stratified ten-fold cross-validation

• Why ten? Extensive experiments have shown that this is the best
choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are averaged
(reduces the variance)
The bootstrap
• CV uses sampling without replacement
• The same instance, once selected, can not be selected again for a
particular training/test set
• The bootstrap uses sampling with replacement to
form the training set
• Sample a dataset of n instances n times with replacement to form a
new dataset
of n instances
• Use this data as the training set
• Use the instances from the original
dataset that don’t occur in the new
training set for testing
The bootstrap
• The bootstrap approach allows us to use a computer to mimic the
process of obtaining new data sets, so that we can estimate the
variability of our estimate without generating additional samples.
• Rather than repeatedly obtaining independent data sets from the
population, we instead obtain distinct data sets by repeatedly
sampling observations from the original data set with
replacement.
• Each of these “bootstrap data sets” is created by sampling with
replacement, and is the same size as our original dataset. As a
result some observations may appear more than once in a given
bootstrap data set and some not at all.
Precision and Recall
Space of all documents

Relevant +
Relevant Retrieved Retrieved

Not Relevant + Not Retrieved

Number of relevant documents retrieved

recall 
Total number of relevant documents
Number of relevant documents retrieved
precision 
Total number of documents retrieved
Confusion Matrix
A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test
data
• true positives (TP): These are cases in which we predicted positive
(they have the disease), and they do have the disease.
• true negatives (TN): We predicted negative, and they don't have the
disease.
• false positives (FP): We predicted positive, but they don't actually
have the disease. (Also known as a "Type I error.")
• false negatives (FN): We predicted negative, but they actually do have
the disease. (Also known as a "Type II error.")
Actual: Actual:
Positive Negative
Predicted: Tp (430) Fp (200)
Positive
Predicted: Fn (70) tn (300)
Negative
Precision and Recall in Text Retrieval
• Precision
• The ability to retrieve top-ranked documents that are
mostly relevant.
• Precision P = tp/(tp + fp)
• Recall
• The ability of the search to find all of the relevant items
in the corpus.
• Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp

Not Retrieved fn tn
Precision/Recall : Example

Recall = 2/6 = 0.33 Precision 

Relevant Retrieved
Precision = 2/3 = 0.67 Retrieved
Relevant Retrieved
Recall 
Relevant
Precision/Recall : Example

Recall = 5/6 = 0.83 Precision 

Relevant Retrieved
Precision = 5/6 = 0.83 Retrieved
Relevant Retrieved
Recall 
Relevant
Accuracy
• Overall, how often is the classifier correct?
• Number of correct predictions / Total number of predictions
• Accuracy = tp+tn/(tp + fp + fn + tn)

Positive Negative
Predicted Positive 1 1

Predicted Negative 8 90

• Accuracy = 1+90/(1+1+8+90) = 0.91

• 91 correct prediction out of 100 total examples
• Precision = 1/2 and Recall =1/9
• Accuracy alone doesn't tell the full story when you're
working with a class-imbalanced data set
Relevant Nonrelevant
Retrieved tp = ? fp = ?
Home Activity Not Retrieved fn = ? tn = ?

Accuracy of a retrieval model is defined by,

Accuracy =
Calculate the tp, fp, fn, tn and accuracy for Ranking algorithm #1 and #2 for
the highlighted location in the ranking.
F Measure (F1/Harmonic Mean)

• One measure of performance that takes into account

both recall and precision.
• Harmonic mean of recall and precision:

• Why harmonic mean?

• harmonic mean emphasizes the importance of small values,
whereas the arithmetic mean is affected more by outliers that
are unusually large
• Data are extremely skewed; over 99% documents are non-
relevant. This is why accuracy is not an appropriate measure
• Compared to arithmetic mean, both need to be high for
harmonic mean to be high.
F Measure (F1/Harmonic Mean) : example

Recall = 2/6 = 0.33

Precision = 2/3 = 0.67
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.33*0.67/(0.33 + 0.67) = 0.44
F Measure (F1/Harmonic Mean) : example

Recall = 5/6 = 0.83

Precision = 5/6 = 0.83
F = 2*Recall*Precision/(Recall + Precision)
= 2*0.83*0.83/(0.83 + 0.83) = 0.83
Mean Average Precision (MAP)

• Average Precision: Average of the precision

values at the points at which each relevant
document is retrieved.
• Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
• Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Averaging the precision values from the rank positions where a
relevant document was retrieved
• Set precision values to be zero for the not retrieved documents
Average Precision: Example
Average Precision: Example
Average Precision: Example
Average Precision: Example

Miss one relevant

document
Average Precision: Example

Miss two relevant

documents
Mean Average Precision (MAP)

• Summarize rankings from multiple queries by averaging

average precision
• Most commonly used measure in research papers
• Assumes user is interested in finding many relevant documents
for each query
• Requires many relevance judgments in text collection
Mean Average Precision (MAP)
Recall-Precision Graph

• The Recall-Precision Graph is created using the standard Recall

values from the Recall Level and Precision Averages.
• Typically these graphs slope downward from left to right,
enforcing the notion that as more relevant documents are
retrieved (recall increases), the more nonrelevant documents are
retrieved (precision decreases).
• This graph is the most commonly used method for comparing
systems. The plots of different runs can be superimposed on the
same graph to determine which run is superior.
• Curves closest to the upper right-hand corner of the graph
(where recall and precision are maximized) indicate the best
performance

43
Recall-Precision Graph

Multiple precision
for some recalls

44
Interpolation

• Defines precision at any recall level as the maximum

precision observed in any recall-precision point at a
higher recall level
• produces a step function
• defines precision at recall 0.0, 0.1……1.0

45
Interpolation