CS725 2020 Midsem
CS725 2020 Midsem
(a) Consider a Bernoulli random variable X with parameter b (i.e. P (X = 1) = b). Say you observe the
following values of X: (0, 0, 1, 0, 1).
(i) Write an expression for the likelihood as a function of b. (b(1 − b)2 can be written as b(1-b)**2.)
[1 pts]
(b) Suppose we build a least-squares linear regression model, where we impose a Gaussian-distributed
prior probability on the weights. Then, we are doing: [1 pts]
(i) Logistic regression (ii) Ridge regression (iii) Lasso regression (iv) L1 regularization.
(c) Consider two decision trees T1 and T2 of depths 2 and 4, respectively, that are trained on the same
dataset. Which of the following are likely correct? [1 pts]
(i) Bias(T1 ) > Bias(T2 )
(ii) Bias(T1 ) < Bias(T2 )
(iii) Variance(T1 ) > Variance (T2 )
(iv) Variance(T1 ) < Variance(T2 )
(a) Increasing the depth of a decision tree cannot increase its training error. True [1 pts]
(b) When a linear separator f (x) = wT x + w0 is trained on data that is linearly separable, a perceptron
classifier is guaranteed to achieve zero error on the training data. True [1 pts]
(c) Given a matrix X, (XXT + λI)−1 for λ > 0 always exists. True [1 pts]
X1 X2 X3 X4 X5 Y
1 0 1 1 0 0
1 0 0 1 0 0
1 0 1 0 1 1
1 0 1 0 1 1
0 0 1 0 1 1
0 0 0 1 0 0
1 0 0 1 1 0
1 1 0 1 0 1
You can use the following approximate values: log2 (3) ≈ 23 , log2 (5) ≈ 11
5
and log2 (7) ≈ 14
5
.
1
(a) Let IG(Xi , Y ) denote the information gain of attribute Xi at the root node. What is IG(X3 , Y )?
[1 pts]
(c) Using information gain as the splitting criterion, which attribute would you choose to use at the root
of the tree?
(d) With information gain as the splitting criterion, suppose you construct the smallest decision tree
that will yield zero training error on the dataset above. This tree can be constructed using only
two attributes, one of which you found out in the previous question. Which are the two attributes?
(Note that you do not need any more information gain computations to answer this question.)
x1 x2 y
0 0 1
0 1 1
1 1 0
1 1 1
0 0 0
0 1 0
0 0 1
0 1 0
(a) Say we want to use a Naive Bayes classifier to predict y using x1 and x2 . What is P (y = 0 | x1 =
0, x2 = 0) according to this Naive Bayes classifier? (You can leave the final answer as a fraction.)
[1 pts]
(b) What is the expected error rate of the Naive Bayes classifier on test samples generated according to
probabilities estimated using the dataset in the table above? Pick the correct answer. (i) 12 (ii) 34
(iii) 85 (iv) 83 [2 pts]
if ywT x ≤ 0 then
w ← w + ηyx
2
Here, η is a learning rate. Say we want to modify the weight update rule by appropriately setting η
such that if the algorithm sees the same example twice in a row, it will never incorrectly label the second
occurrence of the example. Which of the following constraints on η would help satisfy this property?
[1 pts]
−ywT x
(a) η ≤ ||x||2
−ywT x
(b) η > ||x||2
1−ywT x
(c) η ≤ ||x||2
1−ywT x
(d) η > ||x||2
[Pen and paper] Show how you derived the constraint you picked above. [3 pts]