ai-part-b-ch6
ai-part-b-ch6
6
Model Evaluation
Learning Objectives
After studying this chapter, students will be able to:
• Understand the role of Evaluation
• Know more about the Train Test Split method
• Understand Accuracy and Error for evaluating AI models
• Understand Evaluation metrics for Classification
• Know more about the ethical concerns around model evaluation
EVALUATION
Once a model has been made and trained, it needs to go through proper
testing so that one can calculate the efficiency and performance of the
model. Evaluation is the process of understanding the reliability of any
AI model, based on outputs by feeding the test dataset into the model
and comparing it with actual answers.
There can be different Evaluation techniques, depending on the type
and purpose of the model. Remember that It’s not recommended to use
the data we used to build the model to evaluate it. This is because our
model will simply remember the whole training set, and will therefore always predict the correct label for
any point in the training set. This is known as overfitting.
Hence, the model is tested with the help of Testing Data (which was separated out of the acquired dataset
at Data Acquisition stage) and the efficiency of the model is calculated on the basis of the parameters such
as Accuracy, Precision, Recall and F1 Score.
NEED OF MODEL EVALUATION
The model evaluation is like a quality check for your AI creations. Just like you wouldn’t ship a product
without testing it, you wouldn’t deploy an AI model without evaluating its performance. We can check the
following using the evaluation parameters :
1. Understanding Model Behavior:
• Accuracy: How often is the model correct in its predictions?
• Bias: Does the model favor certain groups or outcomes unfairly?
• Overfitting/Underfitting: Is the model too specific to the training data (overfitting) or too general to
capture patterns (underfitting)?
2. Building Trust and Reliability:
• Real-world Applications: Accurate models are essential for tasks like medical diagnosis, self-driving
cars, and financial predictions.
• Ethical Considerations: Biased models can have serious consequences, so evaluation helps ensure
fairness and responsible AI development.
3. Improving Model Performance:
• Identifying Weaknesses: Evaluation highlights areas where the model struggles, guiding improve-
ments in the training data or model architecture.
• Comparing Models: By evaluating multiple models, you can choose the one that performs best for a
specific task.
TRAIN-TEST SPLIT
Train-test split is a machine learning technique that divides a dataset into two subsets: a training set and
a testing set. The training set is used to train the machine learning model, while the testing set is used to
evaluate its performance. The train-test split is a crucial technique in machine learning. It helps us build
models that are not only good at learning from the data they are trained on but also perform well on new,
unseen data. This ensures that our models are reliable and can be used for real-world applications.
Imagine you’re learning to play cricket. You practice a lot (that’s your training), and then you play a match
(that’s your test). The train-test split in machine learning is very similar!
Details of Train-Test split
• It’s like dividing your cricket practice. You have a set of data (like your cricket skills).
• You split it into two parts:
• Training data: This is like your practice sessions. You use this data to teach your machine learning
model how to play (make predictions).
• Testing data: This is like your actual match. You use this data to see how well your model actually
performs. It’s data the model hasn’t seen before.
Why is it Important?
• To avoid overfitting: Imagine you practice only one type of bowling (like spin). You might become
very good at it, but you’ll struggle against fast bowlers in a real match. Similarly, a model trained on
the same data might become too specific and not perform well on new, unseen data.
• To get a realistic estimate of performance: By testing on unseen data, we get a better idea of how
well our model will perform in the real world.
Example: Predicting Weather
Let’s say you want to build a model to predict if it will rain tomorrow.
1. Gather data: You collect data on temperature, humidity, wind speed, etc., for the past few years, along
with whether it rained the next day.
2. Split the data:
• Training data (70%): You use 70% of the data to train your model. The model learns to identify
patterns between the weather conditions and whether it rained.
• Testing data (30%): You use the remaining 30% of the data to test the model. The model predicts
if it will rain based on the weather conditions in this data, and you compare its predictions to the
actual outcomes.
3. Evaluate: You compare the model’s predictions to the actual outcomes in the testing data. This helps
you understand how accurate your model is.
Limitations
• Class Imbalance: In datasets with imbalanced classes (e.g., 90% of samples belong to one class), high
accuracy can be misleading.
• Misleading Metrics: In some cases, accuracy might not be the most informative metric. For example,
in medical diagnosis, false negatives (failing to detect a disease) can have more severe consequences
than false positives.
Activity
CLASSIFICATION
Classification is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point. It’s a type of supervised learning, where we provide the algorithm with labeled
data (data where the correct class is already known) during training.
Examples:
• Spam detection: Classifying emails as spam or not spam.
• Image recognition: Identifying objects in images (e.g., cats, dogs, cars).
• Medical diagnosis: Predicting the presence or absence of a disease.
Types of Classification:
• Binary Classification: The simplest type, where there are only two possible classes (e.g., spam/not
spam, yes/no, positive/negative).
• Multi-class Classification: Involves predicting one of multiple classes (e.g., classifying images of dif-
ferent animals: cat, dog, bird).
• Multi-label Classification: Assigns multiple labels to a single data point (e.g., classifying a news arti-
cle as belonging to multiple topics: politics, sports, technology).
Evaluation metrics for Classification
There are various new terminologies which come into the picture when we work on evaluating our model.
Let’s explore them with an example of the Forest fire scenario.
The Scenario
Imagine that you have come up with an AI based prediction model which has been deployed in a forest
which is prone to forest fires. Now, the objective of the model is to predict whether a forest fire has
broken out in the forest or not. Now, to understand the efficiency of this model, we need to check if the
predictions which it makes are correct or not. Thus, there exist two conditions which we need to ponder
upon: Prediction and Reality. The prediction is the output which is given by the machine and the reality is
the real scenario in the forest when the prediction has been made. Now let us look at various combinations
that we can have with these two conditions.
The confusion matrix provides a visual summary of the model’s performance. It helps identify specific types
of errors made by the model. This information can be used to improve the model’s accuracy and address
potential biases.
Activity
Let us go back to the Forest Fire example. Assume that the model always predicts that there is no fire. But
in reality, there is a 2% chance of forest fire breaking out. In this case, for 98 cases, the model will be right
but for those 2 cases in which there was a forest fire, then too the model predicted no fire.
Here,
True Positives = 0
True Negatives = 98
Total cases = 100
Therefore, accuracy becomes: (98 + 0) / 100 = 98%
This is a fairly high accuracy for an AI model. But this parameter is useless for us as the actual cases where
the fire broke out are not taken into account. Hence, there is a need to look at another parameter which
takes account of such cases as well.
Precision parameter
Precision is defined as the percentage of true positive cases versus all the cases where the prediction is true.
That is, it takes into account the True Positives and False Positives.
True Positive
Precision = *100%
All Predicted Positives
TP
Precision = *100%
TP + FP
Going back to the Forest Fire example, in this case, assume that the model always predicts that there is a
forest fire irrespective of the reality. In this case, all the Positive conditions would be taken into account
that is, True Positive (Prediction = Yes and Reality = Yes) and False Positive (Prediction = Yes and Reality
= No). In this case, the firefighters will check for the fire all the time to see if the alarm was True or False.
You might recall the story of the boy who falsely cries out that there are wolves every time and so when
they actually arrive, no one comes to his rescue. Similarly, here if the Precision is low (which means there
are more False alarms than the actual ones) then the firefighters would get complacent and might not go
and check every time considering it could be a false alarm.
This makes Precision an important evaluation criteria. If Precision is high, this means the True Positive
cases are more, giving lesser False alarms.
But again, is good Precision equivalent to a good model performance? Why?
Let us consider that a model has 100% precision. Which means that whenever the machine says there’s a
fire, there is actually a fire (True Positive). In the same model, there can be a rare exceptional case where
there was actual fire but the system could not detect it. This is the case of a False Negative condition. But
the precision value would not be affected by it because it does not take FN into account.
Is precision then a good parameter for model performance?
Recall parameter
Another parameter for evaluating the model’s performance is Recall. It can be defined as the fraction of
positive cases that are correctly identified. It majorly takes into account the true reality cases where in
Reality there was a fire but the machine either detected it correctly or it didn’t. That is, it considers True
Positives (There was a forest fire in reality and the model predicted a forest fire) and False Negatives (There
was a forest fire and the model didn’t predict it).
True Positive
Recall = *100%
True positive + False Negative
TP
Recall = *100%
TP + FN
Now as we notice, we can see that the Numerator in both Precision and Recall is the same: True Positives.
But in the denominator, Precision counts the False Positives while Recall takes False Negatives into
consideration.
Let us ponder... Which one do you think is better? Precision or Recall? Why?
Which Metric is Important?
Choosing between Precision and Recall depends on the condition in which the model has been deployed.
In a case like Forest Fire, a False Negative can cost us a lot and is risky too. Imagine no alert being given
even when there is a Forest Fire. The whole forest might burn down.
Another case where a False Negative can be dangerous is Viral Outbreak. Imagine a deadly virus has started
spreading and the model which is supposed to predict a viral outbreak does not detect it. The virus might
spread widely and infect a lot of people.
On the other hand, there can be cases in which the False Positive condition costs us more than False
Negatives. One such case is Mining. Imagine a model telling you that there exists treasure at a point and
you keep on digging there but it turns out that it is a false alarm. Here, False Positive cases (predicting
there is treasure but there is no treasure) can be very costly.
Similarly, let’s consider a model that predicts whether a mail is spam or not. If the model always predicts that
the mail is spam, people would not look at it and eventually might lose important information. Here also
False Positive condition (Predicting the mail as spam while the mail is not spam) would have a high cost.
Cases with high FN cost Cases with high FP cost
Forest fire Viral Spam Mining
Which one is more important? RECALL or PRECISION ?
Activity
Activity: To demonstrate how to calculate and interpret accuracy, precision, recall, and F1-score in a
classification context.
Scenario: Predicting Loan Default
Imagine a bank is developing an AI model to predict whether a loan applicant will default (fail to repay the
loan).
Data:
1. Actual Class:
• 1: Applicant will default
• 0: Applicant will not default
2. Predicted Class:
• 1: Model predicts the applicant will default
• 0: Model predicts the applicant will not default
Let’s say we have the following test data:
Applicant ID Actual Class Predicted Class
1 1 1
2 0 0
3 1 0
4 0 1
5 1 1
6 0 0
7 1 1
8 0 0
9 1 0
10 0 1
Confusion Matrix: [[3, 2], [2, 3]]
• True Positives (TP): 3 (Correctly predicted to default)
• True Negatives (TN): 3 (Correctly predicted to not default)
• False Positives (FP): 2 (Incorrectly predicted to default)
• False Negatives (FN): 2 (Incorrectly predicted to not default)
Accuracy:
• Accuracy = (TP + TN) / (TP + FP + TN + FN)
• Accuracy = (3 + 3) / (3 + 2 + 2 + 3) = 0.6 or 60%
Precision:
• Precision = TP / (TP + FP)
• Precision = 3 / (3 + 2) = 0.6 or 60%
Recall:
• Recall = TP / (TP + FN)
• Recall = 3 / (3 + 2) = 0.6 or 60%
F1-Score:
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• F1-Score = 2 * (0.6 * 0.6) / (0.6 + 0.6) = 0.6
Interpretation:
• Accuracy: The model correctly predicted the default status of 60% of the applicants.
• Precision: When the model predicts that an applicant will default, it is correct 60% of the time.
• Recall: The model correctly identifies 60% of the actual defaulters.
• F1-Score: The model has a balanced performance between precision and recall.
Data Characteristics
• Class Imbalance: If one class significantly outnumbers others, accuracy can be misleading. Consider
precision, recall or F1-score
Cost of Errors
• False Positives: What are the consequences of incorrectly predicting a positive outcome? (e.g., flag-
ging a legitimate email as spam, misdiagnosing a healthy patient)
• False Negatives: What are the consequences of incorrectly predicting a negative outcome? (e.g., fail-
ing to detect a disease, approving a risky loan)
Example:
• Spam Detection: Precision is crucial to avoid annoying users with false alarms.
• Disease Diagnosis: Recall is critical to minimize the risk of missing actual cases.
• Fraud Detection: Both precision and recall are important to minimize both false alarms and missed
fraud cases.
Activity
RECAP
• Evaluation is the process of understanding the reliability of any AI model, based on outputs by feeding
the test dataset into the model and comparing it with actual answers.
• There can be different Evaluation techniques, depending on the type and purpose of the model.
• Train-test split is a machine learning technique that divides a dataset into two subsets: a training set and
a testing set.
• The training set is used to train the machine learning model, while the testing set is used to evaluate
its performance.
• Classification is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point.
• Evaluating AI models effectively and ethically is crucial to ensure their responsible development and
deployment.
• The efficiency of the model is calculated on the basis of the parameters like Accuracy, Precision, Recall
and F1 Score.
KEY TERMS
• Model Evaluation is the process of understanding the reliability of any AI model, based on outputs by
feeding the test dataset into the model and comparing it with actual answers.
• Accuracy is defined as the percentage of correct predictions out of all the observations.
• Precision is defined as the percentage of true positive cases versus all the cases where the prediction is
true.
• Recall can be defined as the fraction of positive cases that are correctly identified.
• F1 score can be defined as the measure of balance between precision and recall.
• Confusion matrix is used to record the result of comparison between the prediction and reality.
• Error, also known as error rate, represents the proportion of incorrect predictions made by the model.
• Classification is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point.
EXERCISES
A. Multiple choice questions.
1. ______________ is the process of understanding the reliability of any AI model, based on outputs by feeding
the test dataset into the model and comparing it with actual answers.
(a) Data Reliability (b) Data Feed (c) Model Evaluation (d) None of these
2. __________ is defined as the percentage of correct predictions out of all the observations.
(a) Precision (b) F1 Score (c) Accuracy (d) None of these
3. _________ is defined as the percentage of true positive cases versus all the cases where the prediction is true.
(a) Precision (b) Accuracy
(c) F1 Score (d) None of these
4. ___________ can be defined as the measure of balance between precision and recall.
(a) Precision (b) F1 Score (c) Accuracy (d) None of these
5. The result of comparison between the prediction and reality is recorded in the ____________.
(a) Confusion Matrix (b) F1 Score
(c) Evaluation Model (d) None of these
6. Raunak was learning the conditions that make up the confusion matrix. He came across a scenario in which the
machine that was supposed to predict an animal was always predicting not an animal. What is this condition
called?
(a) False Positive (b) True Positive
(c) False Negative (d) True Negative
7. Priya was confused with the terms used in the evaluation stage. Suggest her the term used for the percentage
of correct predictions out of all the observations.
(a) Accuracy (b) Precision (c) Recall (d) F1-Score
8. Prediction and Reality can be easily mapped together with the help of :
(a) Prediction (b) Reality (c) Accuracy (d) Confusion Matrix
9. What is the primary purpose of the train-test split in machine learning?
(a) To increase the size of the dataset. (b) To prevent overfitting of the model.
(c) To improve the speed of training. (d) To reduce the number of features.
10. Which of the following is NOT a typical ratio for splitting data into training and testing sets?
(a) 70/30 (b) 80/20 (c) 90/10 (d) 50/50
11. What does “overfitting” mean in the context of machine learning?
(a) The model performs well on the training data but poorly on unseen data.
(b) The model performs poorly on both training and testing data.
(c) The model performs well on both training and testing data.
(d) The model performs poorly on training data but well on unseen data.
12. In binary classification, how many possible class labels are there?
(a) One (b) Two (c) Three (d) Many
13. In a medical diagnosis scenario where missing a disease is critical, which metric is generally more important:
precision or recall?
(a) Precision (b) Recall (c) Accuracy (d) F1-score
14. Which metric is most suitable for evaluating a model that predicts rare events, such as fraud detection?
(a) Accuracy b) Precision c) Recall d) F1-score
15. What is the main difference between binary and multi-class classification?
(a) The number of features in the input data. (b) The number of possible output classes.
(c) The type of algorithm used. (d) The presence of noise in the data.
1. (a) Ritwik has been given the prediction results generated by the model. He is being told by his senior to
understand these results and explain to his counterparts. Suggest to him the technique through which
he can understand the prediction results easily.
(b) Take a look at the confusion matrix given below:
Reality
Confusion Matrix
Yes No
Yes True Positive (TP) False Positive (FP)
Prediction
No False Negative (FN) True Negative (TN)
How do you calculate F1 score?
(c) What should be the value of F1 score if the model needs to have 100% accuracy?
2. (a) Mayank wants to measure balance between the precision and recall for evaluating a model. Which method
he has to use for measuring this balance. Give its formula also.
(b) Calculate Accuracy, Precision, Recall and F1 Score for the following Confusion Matrix on Heart Attack Risk.
Also suggest which metric would not be a good evaluation parameter here and why?
Confusion Matrix Reality: 1 Reality: 0
Prediction: 1 50 20
Prediction: 0 10 20
3. Water Shortage in School: Also, suggest which metric would not be a good evaluation parameter here and why?
Confusion Matrix
Reality: 1 Reality: 0
(Water Shortage in School)
Prediction: 1 75 5
Prediction: 0 5 15
4. Imagine that you have come up with an AI based prediction model which has been deployed on the roads to
check traffic jams. Now, the objective of the model is to predict whether there will be a traffic jam or not. Now,
to understand the efficiency of this model, we need to check if the predictions which it makes are correct or
not. Thus, there exist two conditions which we need to ponder upon : Prediction and Reality.
Traffic Jams have become a common part of our lives nowadays. Living in an urban area means you have to
face traffic each and every time you get out on the road. Mostly, school students opt for buses to go to school. Many
times, the bus gets late due to such jams and the students are not able to reach their school on time.
Considering all the possible situations, make a Confusion Matrix for the above situation.
5. Frequent tsunamis are a rising problem across the world. Tsunamis can cause severe damage to the economy
and livelihood. An AI model has been created which can predict if there is a chance of tsunamis in an area. The
confusion matrix for this model is given below.
Predicted Positive 99 41
6. (a) Calculate Accuracy, Precision, Recall and F1 Score using the confusion matrix given below.
1. Consider a scenario in which a lot of times people face the problem of sudden downpour. People wash clothes
and put them out to dry but due to unexpected rain, their work gets wasted. Thus, an AI model has been created
which predicts if there will be rain or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 5 0
Predicted : 0 45 50
2. Consider a scenario in which traffic Jams have become a common part of our lives nowadays. Living in an urban
area means you have to face traffic each and every time you get out on the road. Mostly, school students opt
for buses to go to school. Many times, the bus gets late due to such jams and the students are not able to reach
their school on time. Thus, an AI model is created to predict explicitly if there would be a traffic jam on their
way to school or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 50 50
Predicted : 0 0 0
3. In schools, a lot of times it happens that there is no water to drink. At a few places, cases of water shortage in
schools are very common and prominent. Hence, an AI model is designed to predict if there is going to be a
water shortage in the school in near future or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 22 12
Predicted : 0 47 118
4. Nowadays, the problem of floods has worsened in some parts of the country. Not only does it damage the whole
place but it also forces the people to move out of their homes and relocate. To address this issue, an AI model
has been created which can predict if there is a chance of floods or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 50 3
Predicted : 0 3 94
5. Watch the following video for understanding the confusion matrix with the help of an example and then discuss
it in the class.
https://www.youtube.com/watch?v=8Oog7TXHvFY
1. Is it healthy to have only a few big players (Facebook, Google, Microsoft, Amazon, IBM) that govern AI
development, or shall we have some restrictions about this and promote much more startups to do AI.
2. Divide the class into two groups and discuss on the topic “The Importance of Stakeholder Involvement in AI
Model Evaluation and Development”
http://wiki.pathmind.com/accuracy-precision-recall-f1
https://ai.plainenglish.io/what-is-accuracy-precision-recall-and-f1-score-what-is-its-significance-in-machine-
learning-77d262952287
G. Experiential learning
https://youtu.be/jjsRC1Wv750