AI XII (843) Units 1,2,3
AI XII (843) Units 1,2,3
8. A _________ is a set of historical data in which the outcomes are already known.
a)Training set b)Test set
c)Data requirement d)Model
Answer: a)Training set
9. You can compute the residual in regression by
a) actual y‐coordinate value - predicted y‐coordinate value
b) predicted y‐coordinate value - actual y coordinate value
c) actual y‐coordinate value / predicted y‐coordinate value
d) None
Answer: a) actual y‐coordinate value - predicted y‐
coordinate value
10. ________ is used to evaluate the machine learning model.
a)Train model b) Test Dataset
c)Train data set d) Test model
Answer: b) Test Dataset
11. _________ is an iterative process with a prescribed sequence of steps that are
followed by data scientists to approach a problem and find a solution.
a)Robotic methodology b) Engineering
c)Design thinking d)Data Science methodology
Answer: d) Data Science methodology
12. In problem decomposition:
i. Understand the problem and then restate the problem in your own words
ii. Gather all simple facts to create a complicated piece
iii. Break larger units into simpler ones
iv. Code one small unit at a time
Which of the following is true?
(a) (i) and (ii) b) (i), (iii) and (iv)
c ) (ii) and (iv) d) (i), (ii), (iii), (iv)
Answer: (b) (i), (iii) and (iv)
14. helps us to summarize all the key points into one single outline so that
in future, whenever there is a need to look back at the basis of the problem, we can take
a look at it and understand the key elements of it.
a) 4W Problem canvas b) Problem Statement Template
c) Data Acquisition d) Algorithm
Answer: b) Problem Statement Template
15. Which of these is the code for test data split of 0.33?
(a) x_train,x_test,y_train,y_test=train_test_split (x,y,test_size=0.33)
(b) x_train,x_test,y_train,y_test=train_test_split (x,y,test_size=0.67)
(c) x_train,x_test,y_train,y_test=train_test_split (x,y,test_size=0.5)
(d) x_train,x_test,y_train,y_test=train_test_split (x,y,test_size=1)
Answer: (a) x_train,x_test,y_train,y_test=train_test_split (x,y,test_size=0.33)
Assertion(A): Nowadays, predictive models are not able to better predict rare events such as
disease or system failure.
Reason(R): Today’s high-performance database analytics enable data scientists to utilize
large or even all of the available data.
Answer: d. A is false but R is true.
Assertion (A) – Consider that the goal of an AI model is to predict an answer such as "yes" or
"no".
Reason (R) – In such a case, predictive modelling can be used.
Answer c. A is true but R is false.
II. Answer the given questions in 20 – 30 words each
(2 marks each)
(i) (ii)
f) Fill in the blanks: MSE is known as the ______ of the error value, while RMSE is the
________ of errors.
Answer: MSE is known as the Variance of the error value, while RMSE is the
Standard Deviation of errors.
Answer: ‘A’ is called residual. The differences between actual and predicted data
values are known as residuals in statistical or machine learning models. They serve as
a diagnostic tool for evaluating a model's quality. They are also called errors.
Importance of residual:
• Residuals are crucial in assessing a model's quality.
• When all of the residuals are 0, that means that the model makes accurate
predictions. The model is considered biased when the average residual is not
0. (i.e., consistently over- or under-predicting).
• When residuals show patterns, the model is most likely qualitatively
inaccurate because it is unable to account for some aspect of the data.
c) Sudha is studying the Data Science Methodology. Answer the following questions for
Sudha.
i) Why does data need to be split?
ii) Differentiate between the Training Data and Test Data.
Answer:
i. To assess how effectively our machine learning model works, we must divide
a dataset into train and test sets. The statistics of the train set are known and
are utilised to fit the model. The test data set, which is the second set, is
utilised just for predictions.
ii. A collection of data used to fit the model is called the training dataset. This
dataset is used to train the model. The model observes and learns this data.
An accurate assessment of a final model fit is provided by the test dataset,
which is a subset of the training dataset. Testing of the AI model is conducted
after model training.
Predicted value 14 19 17 13 12 7 24 23 17 18
Observed Value 17 18 18 15 18 11 20 18 13 19
Answer:
Residual
(observed - Squared
Predicted Observed predicted) Residuals
14 17 3 9
19 18 -1 1
17 18 1 1
13 15 2 4
12 18 6 36
7 11 4 16
24 20 -4 16
23 18 -5 25
17 13 -4 16
18 19 1 1
Total 125
MSE 12.5
RMSE 3.54
e) “Regression functions predict a quantity, and classification functions predict a label.”
What are these functions collectively called? Define it and also give 2 examples from
each category.
Answer: These are called loss functions. A loss function evaluates the accuracy with
which your prediction model can forecast the desired result (or value). The learning
problem is converted into an optimization problem, a loss function is defined, and the
algorithm is then optimised to minimise the loss function.
Examples – Regression Loss Function – RMSE, MAPE
Classification Loss Function – Log loss, Focal Loss
f) Neha is studying linear regression analysis between 2 variables. She finds that the
residuals are quite large. Answer the following questions for Smitha:
i) What are residuals
ii) What is their role in model evaluation?
iii) What does their value imply?
Answer:
i. A residual in a linear regression is the difference between the dependent
variable's predicted value and its actual observed value.
ii. By measuring how well the model matches the data, it is used to assess the
model's effectiveness.
iii. The residuals should be small and uniformly spaced around the mean if the
model successfully fits the data. The model may not be a good match for the
data if the residuals are large and not equally distributed, and it may need to be
improved or revised.
g) “Cross Validation is doing train-test split several times.” Do you agree? Differentiate
between train test split approach and Cross validation technique with visualizations.
Answer: Yes, the statement is true.
Train Test Split Approach Cross Validation Technique
Dataset is divided into two parts – At the core it is doing train-test split
Training Set and the unseen Test Set several times.
Data points used for training are not Every data point could potentially be in
used for testing. either the testing or the training set.
Better for large datasets. More robust for small datasets.
1. During the initial data collection phase, data scientists identify available data sources
relevant to the problem area. Identify the data sources.
i) Structured
ii) Unstructured
iii) Semi-structured
iv) Re-structured
2. A company has created an AI model to predict amount of nitrogen in soil. The model
was tested extensively and then deployed to customer. Now, however, the model is
not giving reliable insights. What could be the reason for this? Give a solution to this
problem also.
Answer: The model is overfitting. A model is overfitted when it is trained with a lot
of data. As the model now remembers the entire dataset, it should be retrained with a
completely new set of data and evaluated using techniques like k-fold cross validation.
3. Manoj is part of the testing team for developing an AI model. He is told “The testing
team should put the AI and ML algorithms through rigorous testing while maintaining
model validity and keeping successful learning ability, and algorithm efficacy in mind.”
What steps should Manoj now take?
Answer:
• Data validation should be performed on test data to check for bias.
• Test data should include all relevant subsets of training data, i.e., the data you
will use for training the AI system.
• Manoj’s team must create test suites that help validate the ML models.
• The AI model can be tested using cross validation technique. This technique
will give a better idea of model performance.
4. Reeta is working as a data scientist. She is tasked with studying the learning rate of
the model. She finds that the learning rate is quite low. What should she do now?
What category of parameter does learning rate belong to?
Answer: Learning rate is a hyperparameter. Since the learning rate is low, the model
is taking too much time for training/learning. Hence this hyperparameter needs to be
tweaked for optimal model performance.
5. Consider the following:
“Is this soap dispenser racist?” This question became an internet sensation. In
a video at a Marriott hotel in Atlanta in 2015, an automatic soap dispenser was
shown unable to detect a black customer’s hand. What is this called in AI
terminology? When an AI model behaves like this, which stage should you go back
to? Give reasons.
Answer: The above episode depicts AI bias. When an AI model behaves like this, we
should go back to data collection stage. The dataset should have more samples from
diverse population groups. If the AI model is trained on the right data, it will give
better and accurate results.
3. Manisha is studying data storytelling. List 2 ethics that Manisha should follow while
creating a data story.
Answer:
• Respect privacy.
• Avoid making assumptions about the experiences of others.
5. Manju is part of a team developing an AI model. What will data stories created by
Manju and her team convey in the Data Exploration step?
Answer: Data Exploration is the first step in developing a model. In this step, a data
analyst uses visual exploration to understand what is in a dataset and the characteristics
of the data, rather than through traditional data management systems. So, the data
stories will reflect the data characteristics.
3. Consider the following visual. Create an effective data story using this visual.
2014
Female Turnout – 65%
Male turnout – 67%
4. Identify and define the following icons related to data storytelling. How can these
elements bring about change?
a. Data b. Narrative c. Visuals
Answer:
a) Data- Facts and information.
b) Narrative – the skill or process of telling a story
c) Visuals - a graphical representation such as a graph, chart, or other
presentation.
When the proper visuals and narrative are combined with the correct data, you
have a data story that has the potential to impact and drive change.
5. Anita is studying about data stories. She creates a data story on women education
in rural areas in India. However, her teacher tells her that her story lacks structure.
What structure should Anita follow? Explain.
Answer: Anita should structure her story as follows:
• Setup - the context of the story which will include the problem, the objective, the data
sources, and the audience.
• Conflict - the difficulties and obstacles that the characters of the story have to
overcome, such as information gaps, constraints, risks, or trade-offs.
• Resolution - the outcome/solution are given in the resolution. This can also include
any insights, suggestions, advantages, or next actions.
6. Answer the following briefly:
i. Name two digital tools which help to visualise data.
Answer: Tableau, Google Charts
ii. Which one out of the following is a better visual? Give reasons.
(a)
(Please Note: Data Story can vary as per student’s understanding and creativity)
************************