0% found this document useful (0 votes)

7 views

ai-part-b-ch6

Unit 3 focuses on model evaluation in AI, emphasizing the importance of testing models using techniques like Train-Test Split to avoid overfitting and ensure reliability. It covers key metrics such as Accuracy, Error, and the Confusion Matrix to assess model performance and identify areas for improvement. Ethical considerations and the need for fair and unbiased models are also highlighted in the evaluation process.

Uploaded by

subrotomajumdar001

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

ai-part-b-ch6

Uploaded by

subrotomajumdar001

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Unit 3: Evaluatign Models

6
Model Evaluation
Learning Objectives
After studying this chapter, students will be able to:
• Understand the role of Evaluation
• Know more about the Train Test Split method
• Understand Accuracy and Error for evaluating AI models
• Understand Evaluation metrics for Classification
• Know more about the ethical concerns around model evaluation

EVALUATION
Once a model has been made and trained, it needs to go through proper
testing so that one can calculate the efficiency and performance of the
model. Evaluation is the process of understanding the reliability of any
AI model, based on outputs by feeding the test dataset into the model
and comparing it with actual answers.
There can be different Evaluation techniques, depending on the type
and purpose of the model. Remember that It’s not recommended to use
the data we used to build the model to evaluate it. This is because our
model will simply remember the whole training set, and will therefore always predict the correct label for
any point in the training set. This is known as overfitting.
Hence, the model is tested with the help of Testing Data (which was separated out of the acquired dataset
at Data Acquisition stage) and the efficiency of the model is calculated on the basis of the parameters such
as Accuracy, Precision, Recall and F1 Score.
NEED OF MODEL EVALUATION
The model evaluation is like a quality check for your AI creations. Just like you wouldn’t ship a product
without testing it, you wouldn’t deploy an AI model without evaluating its performance. We can check the
following using the evaluation parameters :
1. Understanding Model Behavior:
• Accuracy: How often is the model correct in its predictions?
• Bias: Does the model favor certain groups or outcomes unfairly?
• Overfitting/Underfitting: Is the model too specific to the training data (overfitting) or too general to
capture patterns (underfitting)?
2. Building Trust and Reliability:
• Real-world Applications: Accurate models are essential for tasks like medical diagnosis, self-driving
cars, and financial predictions.
• Ethical Considerations: Biased models can have serious consequences, so evaluation helps ensure
fairness and responsible AI development.
3. Improving Model Performance:
• Identifying Weaknesses: Evaluation highlights areas where the model struggles, guiding improve-
ments in the training data or model architecture.
• Comparing Models: By evaluating multiple models, you can choose the one that performs best for a
specific task.

TRAIN-TEST SPLIT
Train-test split is a machine learning technique that divides a dataset into two subsets: a training set and
a testing set. The training set is used to train the machine learning model, while the testing set is used to
evaluate its performance. The train-test split is a crucial technique in machine learning. It helps us build
models that are not only good at learning from the data they are trained on but also perform well on new,
unseen data. This ensures that our models are reliable and can be used for real-world applications.
Imagine you’re learning to play cricket. You practice a lot (that’s your training), and then you play a match
(that’s your test). The train-test split in machine learning is very similar!
Details of Train-Test split
• It’s like dividing your cricket practice. You have a set of data (like your cricket skills).
• You split it into two parts:
• Training data: This is like your practice sessions. You use this data to teach your machine learning
model how to play (make predictions).
• Testing data: This is like your actual match. You use this data to see how well your model actually
performs. It’s data the model hasn’t seen before.
Why is it Important?
• To avoid overfitting: Imagine you practice only one type of bowling (like spin). You might become
very good at it, but you’ll struggle against fast bowlers in a real match. Similarly, a model trained on
the same data might become too specific and not perform well on new, unseen data.
• To get a realistic estimate of performance: By testing on unseen data, we get a better idea of how
well our model will perform in the real world.
Example: Predicting Weather
Let’s say you want to build a model to predict if it will rain tomorrow.
1. Gather data: You collect data on temperature, humidity, wind speed, etc., for the past few years, along
with whether it rained the next day.
2. Split the data:
• Training data (70%): You use 70% of the data to train your model. The model learns to identify
patterns between the weather conditions and whether it rained.
• Testing data (30%): You use the remaining 30% of the data to test the model. The model predicts
if it will rain based on the weather conditions in this data, and you compare its predictions to the
actual outcomes.
3. Evaluate: You compare the model’s predictions to the actual outcomes in the testing data. This helps
you understand how accurate your model is.

ACCURACY AND ERROR IN AI MODEL EVALUATION

When evaluating AI models, accuracy and error are fundamental metrics that provide insights into a
model’s performance. They help us understand how well the model aligns with the real world and identify
areas for improvement.
Accuracy
Accuracy measures the proportion of correct predictions made by the model out of the total number of
predictions.
Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
A higher accuracy generally indicates a better-performing model. However, it is crucial to consider the
context and potential biases in the data.
Error
Error, also known as error rate, represents the proportion of incorrect predictions made by the model.
Error = 1 - Accuracy
A lower error rate is desirable, as it signifies fewer incorrect predictions.
Example: Image Classification
Imagine an AI model designed to classify images of cats and dogs.
• Accuracy: If the model correctly identifies 90 out of 100 images, its accuracy is 90%.
• Error: The error rate would be 10% (100% - 90%).

Importance of Accuracy and Error

• Model Selection: By comparing the accuracy or error rates of different models, we can choose the one
that performs best for a specific task.
• Identifying Weaknesses: Analyzing the types of errors made by the model can reveal areas where it
struggles, such as specific classes or conditions.
• Improving Models: Understanding the error rate can guide efforts to improve the model’s perfor-
mance.

Limitations
• Class Imbalance: In datasets with imbalanced classes (e.g., 90% of samples belong to one class), high
accuracy can be misleading.
• Misleading Metrics: In some cases, accuracy might not be the most informative metric. For example,
in medical diagnosis, false negatives (failing to detect a disease) can have more severe consequences
than false positives.
Activity

Calculating the Accuracy of an AI Model

Objective: To understand how to calculate the accuracy of an AI model in a practical setting.
Materials:
• A set of flashcards with images of different animals (e.g., cats, dogs, birds, horses)
• A sheet of paper and a pen or pencil for each student
Procedure:
1. Create an AI Model (Simulation):
• “Train” the Model: The teacher will act as the AI model. The teacher secretly looks at the flashcards
and memorizes the animal on each card.
• “Test” the Model: The teacher hides the flashcards from the students.
2. Student Predictions:
• Each student writes down their prediction for the animal on each card.
3. Calculate Accuracy:
• Correct Predictions: Count the number of times a student correctly predicted the animal on a card.
• Total Predictions: Count the total number of cards.
Accuracy: (Number of Correct Predictions) / (Total Number of Predictions)
For example, if a student correctly predicted 8 out of 10 cards, their accuracy would be 8/10 = 0.8 or 80%.
• Accuracy: A measure of how often the model makes correct predictions.
• Training Data: The information the model “learns” from (in this case, the teacher memorizing the
flashcards).
• Testing Data: The data used to evaluate the model’s performance (in this case, the student’s predictions).

CLASSIFICATION
Classification is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point. It’s a type of supervised learning, where we provide the algorithm with labeled
data (data where the correct class is already known) during training.
Examples:
• Spam detection: Classifying emails as spam or not spam.
• Image recognition: Identifying objects in images (e.g., cats, dogs, cars).
• Medical diagnosis: Predicting the presence or absence of a disease.
Types of Classification:
• Binary Classification: The simplest type, where there are only two possible classes (e.g., spam/not
spam, yes/no, positive/negative).
• Multi-class Classification: Involves predicting one of multiple classes (e.g., classifying images of dif-
ferent animals: cat, dog, bird).
• Multi-label Classification: Assigns multiple labels to a single data point (e.g., classifying a news arti-
cle as belonging to multiple topics: politics, sports, technology).
Evaluation metrics for Classification
There are various new terminologies which come into the picture when we work on evaluating our model.
Let’s explore them with an example of the Forest fire scenario.
The Scenario
Imagine that you have come up with an AI based prediction model which has been deployed in a forest
which is prone to forest fires. Now, the objective of the model is to predict whether a forest fire has
broken out in the forest or not. Now, to understand the efficiency of this model, we need to check if the
predictions which it makes are correct or not. Thus, there exist two conditions which we need to ponder
upon: Prediction and Reality. The prediction is the output which is given by the machine and the reality is
the real scenario in the forest when the prediction has been made. Now let us look at various combinations
that we can have with these two conditions.

CASE 1: Is there a forest fire?

Prediction: Yes
Reality: Yes
TRUE POSITIVE
Here, we can see in the picture that a forest fire has broken out in the
forest. The model predicts a Yes which means there is a forest fire. The
Prediction matches with the Reality. Hence, this condition is termed as Forest with fire
True Positive.

CASE 2: Is there a forest fire ?

Prediction: No
Reality: No
TRUE NEGATIVE
Here there is no fire in the forest hence the reality is No. In this case,
the machine too has predicted it correctly as a No. Therefore, this Forest without fire
condition is termed as True Negative.

CASE 3: Is there a forest fire ?

Prediction: Yes
Reality: No
FALSE POSITIVE
Here the reality is that there is no forest fire. But the machine has
incorrectly predicted that there is a forest fire. This case is termed as
Forest without fire
False Positive.

CASE 4: Is there a forest fire ?

Prediction: No
Reality: Yes
FALSE NEGATIVE
Here, a forest fire has broken out in the forest because of which
the Reality is Yes but the machine has incorrectly predicted it as a
No which means the machine predicts that there is no Forest Fire.
Therefore, this case becomes False Negative. Forest fire
What is the Confusion Matrix and why do you need it?
Well, it is a performance measurement for machine learning classification problems where output can be
two or more classes. It is a table with 4 different combinations of predicted and actual values.
True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that India won the cricket match series against
England and they actually did.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that India would not win the cricket match
series against England and it was true.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that India would win the cricket match series against England but they lost.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that India would lose the cricket match
series against England but it was false.
Just Remember, We describe predicted values as Positive
and Negative and actual values as True and False.

Conclusions of Confusion matrix

The result of comparison between the prediction and reality can be recorded in what we call the confusion
matrix. The confusion matrix allows us to understand the prediction results. Note that it is not an
evaluation metric but a record which can help in evaluation. Let us once again take a look at the four
conditions that we went through in the Forest Fire example:
Confusion Matrix
Reality (Yes) Reality (No)
Prediction (Yes) TRUE POSITIVE (TP) FALSE POSITIVE (FP)
Prediction and Reality matches (True) Prediction and Reality do not match (False)
Prediction is true (Positive) Prediction is true (Positive)
Prediction (No) FALSE NEGATIVE (FN) TRUE NEGATIVE (TN)
Prediction and Reality matches (True) Prediction and Reality matches (True)
Prediction is false (Negative) Prediction is false (Negative)

The confusion matrix provides a visual summary of the model’s performance. It helps identify specific types
of errors made by the model. This information can be used to improve the model’s accuracy and address
potential biases.
Activity

Build the confusion matrix from scratch

Scenario: A hospital is developing an AI model to predict the likelihood of a patient developing heart disease.
Data:
1. Actual Class:
• 1: Patient has heart disease
• 0: Patient does not have heart disease
2. Predicted Class:
• 1: Model predicts the patient has heart disease
• 0: Model predicts the patient does not have heart disease
Let’s say we have the following test data:
Patient ID Actual Class Predicted Class
1 1 1
2 0 0
3 1 0
4 0 1
5 1 1
6 0 0
7 1 1
8 0 0
9 1 0
10 0 1
Initialize the Matrix:
confusion_matrix = [[0, 0], [0, 0]]
1. This creates a 2x2 matrix to store the counts for:
• True Positives (TP): Correctly predicted to have heart disease (Actual: 1, Predicted: 1)
• True Negatives (TN): Correctly predicted to not have heart disease (Actual: 0, Predicted: 0)
• False Positives (FP): Incorrectly predicted to have heart disease (Actual: 0, Predicted: 1)
• False Negatives (FN): Incorrectly predicted to not have heart disease (Actual: 1, Predicted: 0)
2. Populate the Matrix: Iterate through the test data:
• If Actual Class = 1 and Predicted Class = 1: confusion_matrix[0][0] += 1 (TP)
• If Actual Class = 0 and Predicted Class = 0: confusion_matrix[1][1] += 1 (TN)
• If Actual Class = 0 and Predicted Class = 1: confusion_matrix[0][1] += 1 (FP)
• If Actual Class = 1 and Predicted Class = 0: confusion_matrix[1][0] += 1 (FN)
After iterating through all the data points:
Confusion_matrix = [[3, 2], [2, 3]]
Interpretation:
• True Positives (TP) = 3: The model correctly predicted that 3 patients have heart disease.
• True Negatives (TN) = 3: The model correctly predicted that 3 patients do not have heart disease.
• False Positives (FP) = 2: The model incorrectly predicted that 2 patients have heart disease when they
actually do not.
• False Negatives (FN) = 2: The model incorrectly predicted that 2 patients do not have heart disease
when they actually do.
EVALUATION METHODS
Now as we have gone through all the possible combinations of Prediction and Reality, let us see how we
can use these conditions to evaluate the model.
Accuracy parameter
Accuracy is defined as the percentage of correct predictions out of all the observations. A prediction can
be said to be correct if it matches reality. Here, we have two conditions in which the Prediction matches
with the Reality: True Positive and True Negative. Hence, the formula for Accuracy becomes:
Correct prediction
Accuracy = *100%
Total cases
(TP +TN)
Accuracy = *100%
(TP +TN + FP + FN)
Here, total observations cover all the possible cases of prediction that can be True Positive (TP), True
Negative (TN), False Positive (FP) and False Negative (FN).
Let us think:
Is high accuracy equivalent to good performance?

How much percentage of accuracy is reasonable to show good performance?

Let us go back to the Forest Fire example. Assume that the model always predicts that there is no fire. But
in reality, there is a 2% chance of forest fire breaking out. In this case, for 98 cases, the model will be right
but for those 2 cases in which there was a forest fire, then too the model predicted no fire.
Here,
True Positives = 0
True Negatives = 98
Total cases = 100
Therefore, accuracy becomes: (98 + 0) / 100 = 98%
This is a fairly high accuracy for an AI model. But this parameter is useless for us as the actual cases where
the fire broke out are not taken into account. Hence, there is a need to look at another parameter which
takes account of such cases as well.

Precision parameter
Precision is defined as the percentage of true positive cases versus all the cases where the prediction is true.
That is, it takes into account the True Positives and False Positives.
True Positive
Precision = *100%
All Predicted Positives
TP
Precision = *100%
TP + FP
Going back to the Forest Fire example, in this case, assume that the model always predicts that there is a
forest fire irrespective of the reality. In this case, all the Positive conditions would be taken into account
that is, True Positive (Prediction = Yes and Reality = Yes) and False Positive (Prediction = Yes and Reality
= No). In this case, the firefighters will check for the fire all the time to see if the alarm was True or False.
You might recall the story of the boy who falsely cries out that there are wolves every time and so when
they actually arrive, no one comes to his rescue. Similarly, here if the Precision is low (which means there
are more False alarms than the actual ones) then the firefighters would get complacent and might not go
and check every time considering it could be a false alarm.
This makes Precision an important evaluation criteria. If Precision is high, this means the True Positive
cases are more, giving lesser False alarms.
But again, is good Precision equivalent to a good model performance? Why?

Let us consider that a model has 100% precision. Which means that whenever the machine says there’s a
fire, there is actually a fire (True Positive). In the same model, there can be a rare exceptional case where
there was actual fire but the system could not detect it. This is the case of a False Negative condition. But
the precision value would not be affected by it because it does not take FN into account.
Is precision then a good parameter for model performance?

Recall parameter
Another parameter for evaluating the model’s performance is Recall. It can be defined as the fraction of
positive cases that are correctly identified. It majorly takes into account the true reality cases where in
Reality there was a fire but the machine either detected it correctly or it didn’t. That is, it considers True
Positives (There was a forest fire in reality and the model predicted a forest fire) and False Negatives (There
was a forest fire and the model didn’t predict it).
True Positive
Recall = *100%
True positive + False Negative

TP
Recall = *100%
TP + FN

Now as we notice, we can see that the Numerator in both Precision and Recall is the same: True Positives.
But in the denominator, Precision counts the False Positives while Recall takes False Negatives into
consideration.
Let us ponder... Which one do you think is better? Precision or Recall? Why?
Which Metric is Important?
Choosing between Precision and Recall depends on the condition in which the model has been deployed.
In a case like Forest Fire, a False Negative can cost us a lot and is risky too. Imagine no alert being given
even when there is a Forest Fire. The whole forest might burn down.
Another case where a False Negative can be dangerous is Viral Outbreak. Imagine a deadly virus has started
spreading and the model which is supposed to predict a viral outbreak does not detect it. The virus might
spread widely and infect a lot of people.
On the other hand, there can be cases in which the False Positive condition costs us more than False
Negatives. One such case is Mining. Imagine a model telling you that there exists treasure at a point and
you keep on digging there but it turns out that it is a false alarm. Here, False Positive cases (predicting
there is treasure but there is no treasure) can be very costly.
Similarly, let’s consider a model that predicts whether a mail is spam or not. If the model always predicts that
the mail is spam, people would not look at it and eventually might lose important information. Here also
False Positive condition (Predicting the mail as spam while the mail is not spam) would have a high cost.
Cases with high FN cost Cases with high FP cost
Forest fire Viral Spam Mining
Which one is more important? RECALL or PRECISION ?

Think of some more examples having:

• High False Negative cost

• High False Positive cost

Both the measures are important
High Precision: Precision (TP)/(TP+FP)
High Recall: Recall=(TP)/(TP+FN)
To conclude, we must say that if we want to know if our model’s performance is good, we need these two
measures: Recall and Precision. For some cases, you might have a High Precision but Low Recall or Low
Precision but High Recall. But since both the measures are important, there is a need for a parameter which
takes both Precision and Recall into account.
F1 Score
F1 score can be defined as the measure of balance between precision and recall.
Precision* Recall
F1 Score = 2 *100%
Precision + Recall
An ideal situation would be when we have a value of 1 (that is 100%) for both Precision and Recall. In
that case, the F1 score would also be an ideal 1 (100%). It is known as the perfect value for F1 Score. As
the values of both Precision and Recall ranges from 0 to 1, the F1 score also ranges from 0 to 1.
Let us explore the variations we can have in the F1 Score:
Precision Recall F1 Score
Low Low Low
Low High Low
High Low Low
High High High
In conclusion, we can say that a model has good performance if the F1 Score for that model is high.

Activity

Activity: To demonstrate how to calculate and interpret accuracy, precision, recall, and F1-score in a
classification context.
Scenario: Predicting Loan Default
Imagine a bank is developing an AI model to predict whether a loan applicant will default (fail to repay the
loan).
Data:
1. Actual Class:
• 1: Applicant will default
• 0: Applicant will not default
2. Predicted Class:
• 1: Model predicts the applicant will default
• 0: Model predicts the applicant will not default
Let’s say we have the following test data:
Applicant ID Actual Class Predicted Class
1 1 1
2 0 0
3 1 0
4 0 1
5 1 1
6 0 0
7 1 1
8 0 0
9 1 0
10 0 1
Confusion Matrix: [[3, 2], [2, 3]]
• True Positives (TP): 3 (Correctly predicted to default)
• True Negatives (TN): 3 (Correctly predicted to not default)
• False Positives (FP): 2 (Incorrectly predicted to default)
• False Negatives (FN): 2 (Incorrectly predicted to not default)
Accuracy:
• Accuracy = (TP + TN) / (TP + FP + TN + FN)
• Accuracy = (3 + 3) / (3 + 2 + 2 + 3) = 0.6 or 60%
Precision:
• Precision = TP / (TP + FP)
• Precision = 3 / (3 + 2) = 0.6 or 60%
Recall:
• Recall = TP / (TP + FN)
• Recall = 3 / (3 + 2) = 0.6 or 60%
F1-Score:
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• F1-Score = 2 * (0.6 * 0.6) / (0.6 + 0.6) = 0.6
Interpretation:
• Accuracy: The model correctly predicted the default status of 60% of the applicants.
• Precision: When the model predicts that an applicant will default, it is correct 60% of the time.
• Recall: The model correctly identifies 60% of the actual defaulters.
• F1-Score: The model has a balanced performance between precision and recall.

Choosing the Right Evaluation Metric for an AI Model

Selecting the appropriate metric to evaluate an AI model is crucial for understanding its performance and
making informed decisions about its deployment. The choice of metric depends heavily on the specific
problem domain, the type of data, and the desired outcome. Here are some key factors to consider:
Problem Domain and Objectives
• Classification:
 Accuracy: Suitable for balanced datasets where all misclassifications are equally costly.
 Precision: Prioritizes minimizing false positives (e.g., spam detection).
 Recall (Sensitivity): Prioritizes minimizing false negatives (e.g., disease diagnosis).
 F1-score: Balances precision and recall, useful when both are important.

Data Characteristics
• Class Imbalance: If one class significantly outnumbers others, accuracy can be misleading. Consider
precision, recall or F1-score
Cost of Errors
• False Positives: What are the consequences of incorrectly predicting a positive outcome? (e.g., flag-
ging a legitimate email as spam, misdiagnosing a healthy patient)
• False Negatives: What are the consequences of incorrectly predicting a negative outcome? (e.g., fail-
ing to detect a disease, approving a risky loan)
Example:
• Spam Detection: Precision is crucial to avoid annoying users with false alarms.
• Disease Diagnosis: Recall is critical to minimize the risk of missing actual cases.
• Fraud Detection: Both precision and recall are important to minimize both false alarms and missed
fraud cases.
Activity

Choosing the Right Evaluation Metric

Objective: To understand the importance of selecting the appropriate metric to evaluate an AI model
based on the specific problem and its implications.
Procedure:
1. Scenario Cards: Prepare a set of scenario cards with brief descriptions of different AI applications:
• Scenario 1 (Medical Diagnosis): Predicting the presence of a rare but serious disease.
• Scenario 2 (Spam Detection): Classifying emails as spam or not spam.
• Scenario 3 (Product Recommendation): Recommending products to customers based on their past
purchases.
• Scenario 4 (Self-Driving Car): Detecting pedestrians and obstacles on the road.
• Scenario 5 (Credit Risk Assessment): Predicting the likelihood of loan default.
2. Metric Cards: Prepare cards with the names of different evaluation metrics like Accuracy, Precision,
Recall and F1-Score
3. Group Discussion:
• Divide students into small groups.
• Distribute a scenario card to each group.
• Ask each group to discuss:
 What are the potential consequences of false positives and false negatives in this scenario?
(e.g., missed diagnoses, incorrect spam classifications, wrong recommendations)
 What are the primary objectives of the AI model in this scenario? (e.g., minimize false
negatives, maximize customer satisfaction, ensure safety)
 Based on their discussion, have each group select the most appropriate evaluation metric(s) for
their scenario and explain their reasoning by presenting in the class.

ETHICAL CONCERNS AROUND MODEL EVALUATION

Evaluating AI models effectively and ethically is crucial to ensure their responsible development and
deployment. Here are some key ethical concerns:
Bias in Evaluation Metrics
Many common evaluation metrics can be biased towards majority classes, especially in imbalanced datasets.
This can mask the true performance of the model on minority classes.
Example: In a medical diagnosis scenario with a rare disease, a model might achieve high overall accuracy
by simply predicting “no disease” for most patients. This high accuracy is misleading as the model fails to
identify the crucial minority class (patients with the disease).
Biased evaluation can lead to models that unfairly discriminate against marginalized groups, exacerbate
existing inequalities, and have harmful consequences for individuals.
Lack of Transparency and Explainability
Many complex AI models (like deep neural networks) are often considered “black boxes,” making it difficult
to understand how they arrive at their predictions.
Example: A loan application might be denied by an AI model, but the reasons for the denial might be
unclear to both the applicant and the loan officer.
Lack of transparency can erode trust in AI systems. It can also hinder accountability and make it difficult
to identify and rectify biases or errors.
Overemphasis on Accuracy
While accuracy is an important metric, it’s not always the most informative or appropriate measure of
performance.
Example: In a fraud detection system, minimizing false positives (accusing innocent people of fraud) might
be more important than maximizing overall accuracy.
An overemphasis on accuracy can lead to models that prioritize minimizing errors in a way that is not
aligned with ethical considerations or the real-world impact of those errors.

DID YOU KNOW ?

AI won’t replace artists- instead, it will augment them
AI can and should be viewed as a tool to augment human endeavors in the field
of creativity, as it does in any other field. Similar to the idea of a paintbrush, AI
facilitates artistic expression by being the means to an end.
starryai , a free AI art generator is available for free on iOS and Android.
AI Art Generators
Art created by the Midjourney AI bot after I typed in text.
• NightCafe. • DALL-E 2.
• Deep Dream Generator. • Artbreeder.
• DeepAI. • StarryAI.
• Fotor. • Runway ML.
Few technologies that help the artists
1. This Person Does Not Exist creates faces that look realistic, but are actually a conglomeration of different
faces.
2. WordsEye lets you describe a scene in writing, then AI creates that scene. It has a small asset library for
each object so you can choose the right one.

RECAP
• Evaluation is the process of understanding the reliability of any AI model, based on outputs by feeding
the test dataset into the model and comparing it with actual answers.
• There can be different Evaluation techniques, depending on the type and purpose of the model.
• Train-test split is a machine learning technique that divides a dataset into two subsets: a training set and
a testing set.
• The training set is used to train the machine learning model, while the testing set is used to evaluate
its performance.
• Classification is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point.
• Evaluating AI models effectively and ethically is crucial to ensure their responsible development and
deployment.
• The efficiency of the model is calculated on the basis of the parameters like Accuracy, Precision, Recall
and F1 Score.

KEY TERMS
• Model Evaluation is the process of understanding the reliability of any AI model, based on outputs by
feeding the test dataset into the model and comparing it with actual answers.
• Accuracy is defined as the percentage of correct predictions out of all the observations.
• Precision is defined as the percentage of true positive cases versus all the cases where the prediction is
true.
• Recall can be defined as the fraction of positive cases that are correctly identified.
• F1 score can be defined as the measure of balance between precision and recall.
• Confusion matrix is used to record the result of comparison between the prediction and reality.
• Error, also known as error rate, represents the proportion of incorrect predictions made by the model.
• Classiﬁcation is a fundamental task in machine learning where the goal is to predict the class or category
of a given input data point.

EXERCISES
A. Multiple choice questions.
1. ______________ is the process of understanding the reliability of any AI model, based on outputs by feeding
the test dataset into the model and comparing it with actual answers.
(a) Data Reliability (b) Data Feed (c) Model Evaluation (d) None of these
2. __________ is defined as the percentage of correct predictions out of all the observations.
(a) Precision (b) F1 Score (c) Accuracy (d) None of these
3. _________ is defined as the percentage of true positive cases versus all the cases where the prediction is true.
(a) Precision (b) Accuracy
(c) F1 Score (d) None of these
4. ___________ can be defined as the measure of balance between precision and recall.
(a) Precision (b) F1 Score (c) Accuracy (d) None of these
5. The result of comparison between the prediction and reality is recorded in the ____________.
(a) Confusion Matrix (b) F1 Score
(c) Evaluation Model (d) None of these
6. Raunak was learning the conditions that make up the confusion matrix. He came across a scenario in which the
machine that was supposed to predict an animal was always predicting not an animal. What is this condition
called?
(a) False Positive (b) True Positive
(c) False Negative (d) True Negative
7. Priya was confused with the terms used in the evaluation stage. Suggest her the term used for the percentage
of correct predictions out of all the observations.
(a) Accuracy (b) Precision (c) Recall (d) F1-Score
8. Prediction and Reality can be easily mapped together with the help of :
(a) Prediction (b) Reality (c) Accuracy (d) Confusion Matrix
9. What is the primary purpose of the train-test split in machine learning?
(a) To increase the size of the dataset. (b) To prevent overfitting of the model.
(c) To improve the speed of training. (d) To reduce the number of features.
10. Which of the following is NOT a typical ratio for splitting data into training and testing sets?
(a) 70/30 (b) 80/20 (c) 90/10 (d) 50/50
11. What does “overfitting” mean in the context of machine learning?
(a) The model performs well on the training data but poorly on unseen data.
(b) The model performs poorly on both training and testing data.
(c) The model performs well on both training and testing data.
(d) The model performs poorly on training data but well on unseen data.
12. In binary classification, how many possible class labels are there?
(a) One (b) Two (c) Three (d) Many
13. In a medical diagnosis scenario where missing a disease is critical, which metric is generally more important:
precision or recall?
(a) Precision (b) Recall (c) Accuracy (d) F1-score
14. Which metric is most suitable for evaluating a model that predicts rare events, such as fraud detection?
(a) Accuracy b) Precision c) Recall d) F1-score
15. What is the main difference between binary and multi-class classification?
(a) The number of features in the input data. (b) The number of possible output classes.
(c) The type of algorithm used. (d) The presence of noise in the data.

B. Answer the following

1. Define
(a) Accuracy (b) Precision (c) Recall (d) F1 Score
2. Explain evaluation methods with an example
3. Describe what is a Confusion Matrix. Give an example
4. Explain the Train Test Split method to evaluate the AI model.
5. Define classification and its various types related to model evaluation.
6. Explain bias and transparency in relation to the AI model evaluation.

C. Competency based questions. Life Skills & Values

1. (a) Ritwik has been given the prediction results generated by the model. He is being told by his senior to
understand these results and explain to his counterparts. Suggest to him the technique through which
he can understand the prediction results easily.
(b) Take a look at the confusion matrix given below:
Reality
Confusion Matrix
Yes No
Yes True Positive (TP) False Positive (FP)
Prediction
No False Negative (FN) True Negative (TN)
How do you calculate F1 score?
(c) What should be the value of F1 score if the model needs to have 100% accuracy?
2. (a) Mayank wants to measure balance between the precision and recall for evaluating a model. Which method
he has to use for measuring this balance. Give its formula also.
(b) Calculate Accuracy, Precision, Recall and F1 Score for the following Confusion Matrix on Heart Attack Risk.
Also suggest which metric would not be a good evaluation parameter here and why?
Confusion Matrix Reality: 1 Reality: 0
Prediction: 1 50 20
Prediction: 0 10 20
3. Water Shortage in School: Also, suggest which metric would not be a good evaluation parameter here and why?

Confusion Matrix
Reality: 1 Reality: 0
(Water Shortage in School)
Prediction: 1 75 5
Prediction: 0 5 15
4. Imagine that you have come up with an AI based prediction model which has been deployed on the roads to
check traffic jams. Now, the objective of the model is to predict whether there will be a traffic jam or not. Now,
to understand the efficiency of this model, we need to check if the predictions which it makes are correct or
not. Thus, there exist two conditions which we need to ponder upon : Prediction and Reality.
Traffic Jams have become a common part of our lives nowadays. Living in an urban area means you have to
face traffic each and every time you get out on the road. Mostly, school students opt for buses to go to school. Many
times, the bus gets late due to such jams and the students are not able to reach their school on time.
Considering all the possible situations, make a Confusion Matrix for the above situation.
5. Frequent tsunamis are a rising problem across the world. Tsunamis can cause severe damage to the economy
and livelihood. An AI model has been created which can predict if there is a chance of tsunamis in an area. The
confusion matrix for this model is given below.

Confusion Matrix Reality: Positive Reality: Negative

Predicted Positive 99 41

Predicted Negative 86 250

6. (a) Calculate Accuracy, Precision, Recall and F1 Score using the confusion matrix given below.

Confusion Matrix Reality: Yes Reality: No

Predicted Positive 76 28
Predicted Negative 9 5

(b) Give an example where High Precision is not usable.

Ans.Example: “Predicting a mail as Spam or Not Spam” False Positive: Mail is predicted as “spam” but it
is “not spam”. False Negative: Mail is predicted as “not spam” but it is “spam”. Of course, too many False
Negatives will make the spam filter ineffective but False Positives may cause important mails to be missed
and hence Precision is not usable.
(c) Give an example where High Accuracy is not usable.
D. Activity Zone Experiential Learning

1. Consider a scenario in which a lot of times people face the problem of sudden downpour. People wash clothes
and put them out to dry but due to unexpected rain, their work gets wasted. Thus, an AI model has been created
which predicts if there will be rain or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 5 0
Predicted : 0 45 50
2. Consider a scenario in which traffic Jams have become a common part of our lives nowadays. Living in an urban
area means you have to face traffic each and every time you get out on the road. Mostly, school students opt
for buses to go to school. Many times, the bus gets late due to such jams and the students are not able to reach
their school on time. Thus, an AI model is created to predict explicitly if there would be a traffic jam on their
way to school or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 50 50
Predicted : 0 0 0
3. In schools, a lot of times it happens that there is no water to drink. At a few places, cases of water shortage in
schools are very common and prominent. Hence, an AI model is designed to predict if there is going to be a
water shortage in the school in near future or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 22 12
Predicted : 0 47 118
4. Nowadays, the problem of floods has worsened in some parts of the country. Not only does it damage the whole
place but it also forces the people to move out of their homes and relocate. To address this issue, an AI model
has been created which can predict if there is a chance of floods or not. The confusion matrix for the same is:
The Confusion Matrix Reality: 1 Reality : 0
Predicted : 1 50 3
Predicted : 0 3 94
5. Watch the following video for understanding the confusion matrix with the help of an example and then discuss
it in the class.
https://www.youtube.com/watch?v=8Oog7TXHvFY

E. Group Discussion Communication

1. Is it healthy to have only a few big players (Facebook, Google, Microsoft, Amazon, IBM) that govern AI
development, or shall we have some restrictions about this and promote much more startups to do AI.
2. Divide the class into two groups and discuss on the topic “The Importance of Stakeholder Involvement in AI
Model Evaluation and Development”

F. Knowledge Hub Subject Enrichment

http://wiki.pathmind.com/accuracy-precision-recall-f1
https://ai.plainenglish.io/what-is-accuracy-precision-recall-and-f1-score-what-is-its-significance-in-machine-
learning-77d262952287

G. Experiential learning
https://youtu.be/jjsRC1Wv750

Sketching For Engineers and Architects 2016
No ratings yet
Sketching For Engineers and Architects 2016
271 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Capstone Project
No ratings yet
Capstone Project
40 pages
ML MAKAUT unit-3
No ratings yet
ML MAKAUT unit-3
6 pages
8-Module 5 Linear and Logical Regression-18-03-2024
No ratings yet
8-Module 5 Linear and Logical Regression-18-03-2024
14 pages
ML Type
No ratings yet
ML Type
13 pages
ML 2
No ratings yet
ML 2
166 pages
Unit 5 PPT
No ratings yet
Unit 5 PPT
32 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
61 pages
Supervised learning
No ratings yet
Supervised learning
19 pages
Unit-2
No ratings yet
Unit-2
125 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
5 IntroML
No ratings yet
5 IntroML
23 pages
Machine learning_question bank
No ratings yet
Machine learning_question bank
45 pages
dbms-10 marks
No ratings yet
dbms-10 marks
32 pages
AI 10TH UNIT 3
No ratings yet
AI 10TH UNIT 3
42 pages
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
No ratings yet
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
25 pages
Unit 2
No ratings yet
Unit 2
63 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Last Section ( Feature Scaling And Selection - Split Data - Building Model - Evaluation )
No ratings yet
Last Section ( Feature Scaling And Selection - Split Data - Building Model - Evaluation )
6 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Data Analytics Unit1
No ratings yet
Data Analytics Unit1
17 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Unit 2
No ratings yet
Unit 2
76 pages
Model Validation & Data Partition
No ratings yet
Model Validation & Data Partition
14 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
13 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
25 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Unit 3
No ratings yet
Unit 3
13 pages
Regression
No ratings yet
Regression
24 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
13 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Inductive Learning and Machine Learning
100% (1)
Inductive Learning and Machine Learning
321 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Receiver Operator Characteristic
No ratings yet
Receiver Operator Characteristic
25 pages
CLASS XII AI WORKSHEET BOOKLET PART2 2023-2024
No ratings yet
CLASS XII AI WORKSHEET BOOKLET PART2 2023-2024
26 pages
AI UNIT 5
No ratings yet
AI UNIT 5
13 pages
Top 45 Machine Learning Interview Questions in 2025
No ratings yet
Top 45 Machine Learning Interview Questions in 2025
37 pages
Twenty Frequently Asked Interview Questions and Answers
No ratings yet
Twenty Frequently Asked Interview Questions and Answers
8 pages
Chapter 01 machine learning
No ratings yet
Chapter 01 machine learning
22 pages
Machine Learning.
No ratings yet
Machine Learning.
50 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
CSL0777 L08
No ratings yet
CSL0777 L08
29 pages
(Pec Cs701e)
No ratings yet
(Pec Cs701e)
4 pages
Deep Learning Ascs
No ratings yet
Deep Learning Ascs
10 pages
Intro to Machine learning
No ratings yet
Intro to Machine learning
13 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
Unit 1
No ratings yet
Unit 1
62 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
The Fundamentals of Machine Learning: Building Intelligent Systems from Data
From Everand
The Fundamentals of Machine Learning: Building Intelligent Systems from Data
Ethan Bennett
No ratings yet
(27 September PDF) 3000 Gs GK (#03)
No ratings yet
(27 September PDF) 3000 Gs GK (#03)
31 pages
A Health Issue Called Benign Paroxysmal Positional Vertigo
No ratings yet
A Health Issue Called Benign Paroxysmal Positional Vertigo
3 pages
Ankit Jha CV (1) - 1
No ratings yet
Ankit Jha CV (1) - 1
1 page
Hose & Cable Reel Selection Guide
No ratings yet
Hose & Cable Reel Selection Guide
5 pages
SESSION 9, FEB. 4th 1
No ratings yet
SESSION 9, FEB. 4th 1
2 pages
Laying Out The Road Line - Study Material
No ratings yet
Laying Out The Road Line - Study Material
9 pages
Business Intelligence Unit 5
No ratings yet
Business Intelligence Unit 5
12 pages
Skyla Rhorer Probable Cause Statement
No ratings yet
Skyla Rhorer Probable Cause Statement
3 pages
Elements of Soil Mechanics by G.N.Smith
No ratings yet
Elements of Soil Mechanics by G.N.Smith
509 pages
Mildly Corrosive Solution Applications: Dimensions (Inches)
No ratings yet
Mildly Corrosive Solution Applications: Dimensions (Inches)
2 pages
Skan Business Process Tenets
100% (1)
Skan Business Process Tenets
14 pages
Disabled Aircraft Recovery
No ratings yet
Disabled Aircraft Recovery
3 pages
Exam Mid B10
No ratings yet
Exam Mid B10
2 pages
29. Đề Thi Thử Tn 2025 Thpt Ngô Gia Tự - Vĩnh Phúc (1)
No ratings yet
29. Đề Thi Thử Tn 2025 Thpt Ngô Gia Tự - Vĩnh Phúc (1)
22 pages
SDTW CY-10 Scooptram Specification
No ratings yet
SDTW CY-10 Scooptram Specification
6 pages
Proceedingsfinal PDF
No ratings yet
Proceedingsfinal PDF
706 pages
RIVERSIDE 520E User Guide, Repairs
No ratings yet
RIVERSIDE 520E User Guide, Repairs
1 page
Muhammad Habib - Chisti Mystic Records
No ratings yet
Muhammad Habib - Chisti Mystic Records
30 pages
INC4805 Assignment 1 2024F3
No ratings yet
INC4805 Assignment 1 2024F3
6 pages
GOA Cluster Breakdown 2022
100% (1)
GOA Cluster Breakdown 2022
3 pages
4.display Case
No ratings yet
4.display Case
2 pages
33371-00, 1111-11-0 Operation & Maintenance Manual 8-22-16,2-20-17
No ratings yet
33371-00, 1111-11-0 Operation & Maintenance Manual 8-22-16,2-20-17
134 pages
The Age of Decentralization How Web3 and Related Technologies Will Change Industries and Our Lives by Sam Ghosh, Subhasis Gorai
No ratings yet
The Age of Decentralization How Web3 and Related Technologies Will Change Industries and Our Lives by Sam Ghosh, Subhasis Gorai
43 pages
Chemistry Module 5 Notes
No ratings yet
Chemistry Module 5 Notes
9 pages
Design of A PM Brushless Motor Drive For Hybrid Electrical Vehicle Application
No ratings yet
Design of A PM Brushless Motor Drive For Hybrid Electrical Vehicle Application
7 pages
History of Cable TV in Pakistan
No ratings yet
History of Cable TV in Pakistan
2 pages
AC Motor - Rev
No ratings yet
AC Motor - Rev
182 pages
SSR Data
No ratings yet
SSR Data
251 pages
1 - Shop Drawings of HT Panel Rev.06
100% (2)
1 - Shop Drawings of HT Panel Rev.06
18 pages