Introduction To Machine Learning IIT KGP Week 2
Introduction To Machine Learning IIT KGP Week 2
Week 2: Assignment 2
Due date: 2023-08-09, 23:59 IST.
1
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 1.
a.
b.
c.
d.
Answer: To answer this question, we need to understand what entropy is and how it is
calculated for a decision tree. Entropy is a measure of the uncertainty or disorder in a
group of observations. It determines how a decision tree chooses to split data based on
the features. The entropy formula is given as follows1:
Entropy(S)=−i=1∑npilog2pi
Where S is the set of observations, n is the number of classes, and pi is the probability
of an observation belonging to class i.
To calculate the entropy of Emotion[Wig = Y], we need to find the probability of each
class (happy or sad) given that the person wears a wig. We can use the table in the
image to count the number of observations that satisfy this condition. There are 4
people who wear a wig, out of which 2 are happy and 2 are sad. Therefore, the
probabilities are:
phappy=2/4=0.5
psad=2/4=0.5
Plugging these values into the entropy formula, we get:
Entropy(Emotion[Wig=Y])=−(0.5log20.5+0.5log20.5)
Entropy(Emotion[Wig=Y])=−(−0.5−0.5)
Entropy(Emotion[Wig=Y])=1
Therefore, the correct option is a. 1.
2
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 2.
a.
b.
c.
d.
Answer: To answer this question, we need to use the same entropy formula as before, but
with a different condition. We need to find the probability of each class (happy or sad)
given that the person has 3 ears. We can use the table in the image to count the
number of observations that satisfy this condition. There are 2 people who have 3 ears,
out of which 1 is happy and 1 is sad. Therefore, the probabilities are:
phappy=1/2=0.5
psad=1/2=0.5
Plugging these values into the entropy formula, we get:
Entropy (Emotion∣Ears=3) =−(0.5log20.5+0.5log20.5)
Entropy (Emotion∣Ears=3) =−(−0.5−0.5)
Entropy (Emotion∣Ears=3) =1
Therefore, the correct option is a. 1.
3
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 3.
a.
b.
c.
d.
Answer: To answer the correct option, we need to use the table in the image to
calculate either the information gain or the Gini impurity for each attribute. I will use the
information gain as an example, but you can also use the Gini impurity if you prefer.
The entropy of the root node is 1, as we calculated before. The information gain for
each attribute is:
IG(Color)=1−(4/8×0.811+4/8×0.811)=0.188
IG(Wig)=1−(4/8×1+4/8×1)=0
IG(Ears)=1−(2/8×1+2/8×0+4/8×0.811)=0.311
The attribute with the highest information gain is Ears,
so the correct option is c. Number of ears.
4
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 4.
a.
b.
c.
d.
Answer: To answer this question, we need to understand what linear regression is and what kind of
output it produces. Linear regression is a machine learning technique that models the relationship
between one or more input variables (x) and a single output variable (y) using a linear equation. The
output variable is also called the dependent variable or the response variable, and the input variables
are also called the independent variables or the predictors.
The output of linear regression is a numeric value that is calculated from a linear combination of the
input variables. For example, if we have two input variables x1 and x2, and a linear equation y = b0 +
b1x1 + b2x2, then the output y is a numeric value that depends on the values of x1 and x2 and the
coefficients b0, b1 and b2.
A numeric value can be either discrete or continuous. A discrete value is one that can only take
certain values, such as integers or counts. A continuous value is one that can take any value within a
range, such as fractions or measurements. For example, the number of students in a class is a
discrete value, while the height of a student is a continuous value.
The output of linear regression can be either discrete or continuous, depending on the nature of the
data and the problem. For example, if we want to predict the number of sales of a product based on
its price and advertising budget, then the output is a discrete value. If we want to predict the weight
of a person based on their height and age, then the output is a continuous value.
5
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 5.
a.
b.
c.
d.
Answer: To answer this question, we need to apply the linear regression hypothesis to the given training data
and find the value of h(x) for each option. The linear regression hypothesis is h(x) = θ0 + θ1x, where θ0 and θ1
6
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
are the parameters that need to be learned from the data. The training data is given in a table with two
columns, X and Y, where X is the input variable and Y is the output variable.
To find the value of h(x) for each option, we need to substitute the value of x with the option and use
the values of θ0 and θ1 that minimize the mean square error (MSE). The MSE is defined as 1/2m ∑(h(x) -
y)², where m is the number of training examples and ∑(h(x) - y)² is the sum of squared errors between
the predicted output and the actual output.
To find the values of θ0 and θ1 that minimize the MSE, we can use a technique called normal equation, which
gives a closed-form solution for the parameters. The normal equation is:
θ=(XTX)−1XTy
where X is a matrix of input values with an extra column of 1s for the intercept term, y is a vector of output
values, and θ is a vector of parameters.
Using this equation, we can calculate the values of θ0 and θ1 as follows:
X=111156103,y=7495,θ=[θ0θ1]
θ=(XTX)−1XTy=[2.857−0.143−0.1430.057]−1[25124]=[3.5710.429]
Therefore, θ0 = 3.571 and θ1 = 0.429.
Now, we can find the value of h(x) for each option by substituting x with the option and using these values of θ0
and θ1:
a. h(1) = θ0 + θ1 * 1 = 3.571 + 0.429 * 1 = 4
b. h(0) = θ0 + θ1 * 0 = 3.571 + 0.429 * 0 = 3.571
c. h(2) = θ0 + θ1 * 2 = 3.571 + 0.429 * 2 = 4.429
d. h(0.5) = θ0 + θ1 * 0.5 = 3.571 + 0.429 * 0.5 = 3.786
The correct option is c. h(2) = 4.429, because it is the closest value to the actual output of the
training example with x = 2, which is y = 9.
2 points
Question 6.
a.
b.
Answer: The statement is false. The ID3 algorithm is not guaranteed to find the optimal
decision tree, because it uses a greedy approach that can converge upon local optima.
The algorithm’s optimality can be improved by using backtracking during the search for
the optimal decision tree at the cost of possibly taking longer. ID3 can also overfit the
7
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
training data, which means it can perform poorly on unseen data. To avoid overfitting,
smaller decision trees should be preferred over larger ones1.training data, which means
it can perform poorly on unseen data. To avoid overfitting, smaller decision trees should
be preferred over larger ones
2 points
Question 7.
a.
b.
Answer: The statement is false. Aclassifier trained on less training data is more likely to
overfit, because it cannot capture the underlying trend of the data. Overfitting occurs
when a classifier fits the training data too tightly, such that it performs poorly on unseen
data. Overfitting can be reduced by using more training data, regularization techniques,
cross-validation, or simpler models.
2 points
Question 8.
a.
b.
8
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 9.
a.
b.
c.
d.
Answer: The correct answer is c. It would probably result in a decision tree that scores well on the
training set but badly on a test set. This is because a multiway split with one branch for each
distinct value of the attribute would create a very fine-grained partition of the data, which
may capture noise or irrelevant patterns that do not generalize well to unseen data. This
is a form of overfitting, which reduces the accuracy and comprehensibility of the
decision tree. A better way to handle real-valued attributes is to use binary splits based
on some threshold that maximizes the information gain or minimizes the impurity.
2 points
Question 10.
9
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
a.
b.
c.
d.
The other statements are false. Decision trees are not resistant to overfitting, but rather
prone to it. This means that they may create very complex trees that fit the training data
well, but perform poorly on new data. To prevent overfitting, some techniques such as
pruning, regularization, or ensemble methods can be used.
2 points
Question 11.
a.
b.
c.
d.
The other options are false. Increasing the tree depth, decreasing the minimum number
of samples required to split a node, or adding more features to the dataset can all
increase the complexity and variance of the decision tree, which can lead to overfitting.
These are not techniques to handle overfitting, but rather factors that can cause it.
10
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
2 points
Question 12.
a.
b.
c.
d.
The other options are false. Support Vector Machine, K-Means Clustering, and Naive
Bayes are not measures for selecting the best split in decision trees, but rather different
types of machine learning algorithms for classification or clustering problems. They
have their own assumptions, methods, and parameters, which are different from those
of decision trees.
2 points
Question 13.
11
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
a.
b.
c.
d.
Answer: The correct answer is b. Theroot node of a decision tree serves as the starting point
for tree traversal during prediction. The root node contains the feature that best splits
the data into different classes or outcomes. The algorithm compares the values of the
root attribute with the record attribute and follows the branch that matches the condition.
The process is repeated until a leaf node is reached, which represents the predicted
class or value.
The other options are false. The root node does not represent the class labels of the
training data, as these are stored in the leaf nodes. The root node does not contain the
feature values of the training data, as these are used to split the data at each node. The
root node does not determine the stopping criterion for tree construction, as this is
usually based on some measure of impurity, complexity, or error.
2 points
Question 14.
a.
b.
c.
12
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
d.
The other options are false. Linear regression is a supervised learning algorithm used
only for regression tasks, not classification tasks. Linear regression is affected by
outliers in the data, as they can distort the slope and intercept of the regression line.
Linear regression cannot handle missing values in the dataset, as they can reduce the
sample size and introduce bias.
2 points
Question 15.
a.
b.
c.
d.
Regularization is a technique that adds a penalty term to the loss function of the
algorithm, which reduces the complexity and size of the model parameters. This
prevents the model from fitting too closely to the training data and reduces the variance
of the model.
13
@NPTEL_Solution Introduction To Machine Learning - IITKGP Week-2
Gathering more training data is another technique that can help reduce overfitting, as it
provides more examples and diversity for the algorithm to learn from. This increases the
representativeness and robustness of the model and reduces the bias of the model.
14