Midterm Sp16 Solutions
Midterm Sp16 Solutions
Name:
Andrew ID:
• You are allowed to use one page of notes, front and back.
• Note that the questions vary in difficulty. Make sure to look over the entire exam
before you start and answer the easier questions first.
1 20
2 20
3 20
4 20
5 20
Extra Credit 14
Total 114
10-601: Machine Learning Page 2 of 17 2/29/2016
Solution:
l(θ; X1 , . . . , Xn ) = log L(θ; X1 , . . . , Xn )
h Pn Pn i
= log θ i=1 xi (1 − θ)n− i=1 xi
Xn n
X
=( xi ) log(θ) + (n − xi ) log(1 − θ)
i=1 i=1
1 Pn
(c) Extra Credit: [2 pts.] Derive the following formula for the MLE: θ̂ = ( i=1 Xi ).
n
d
Solution: To find the MLE we solve dθ `(θ; X1 , . . . , Xn ) = 0. The derivative is given by
n n
d dhX X i
`(θ; X1 , . . . , Xn ) = ( xi ) log(θ) + (n − xi ) log(1 − θ)
dθ dθ
Pn i=1 Pn i=1
x i n − x
i=1 i
= i=1 −
θ 1−θ
d
Next, we solve dθ
`(θ; X1 , . . . , Xn ) = 0:
Pn
n − ni=1 xi
P
i=1 xi
− =0
θ 1−θ
n
X Xn
⇐⇒ xi (1 − θ) − n − xi θ = 0
i=1 i=1
n
X
⇐⇒ xi − nθ = 0
i=1
n
1 X
⇐⇒ θ̂ = ( Xi ).
n i=1
10-601: Machine Learning Page 4 of 17 2/29/2016
(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP and
MLE estimates become the same.
Solution: True. As the number of examples increases, the data likelihood goes to zero
very quickly, while the magnitude of the prior stays the same. In the limit, the prior
plays an insignificant role in the MAP estimate and the two estimates will converge.
(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLE
estimates.
Solution: False. In Naive Bayes we need to estimate the distribution of each feature Xi
given the label Y . Any technique for estimating the distribution can be used, including
both the MLE and the MAP estimate.
1.4 Probability
Assume we have a sample space Ω. Answer each question with T or F. No justification
is required.
(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
Solution: False. If they are disjoint, i.e. mutually exclusive, they are very dependent!
P (A)P (B|A)
(b) [1 pts.] T or F: P (A|B) ∝ . (The sign ‘∝’ means ‘is proportional to’)
P (A|B)
P (A)P (B|A)
Solution: False. P (A|B) =
P (B)
(c) [1 pts.] T or F: P (A ∪ B) ≤ P (A).
Solution: False. P (A ∪ B) ≥ P (A)
(a) [4 pts.] T or F: The true error of a hypothesis h can be lower than its training error on
the sample S.
Solution: True. If the sample S happens to favor part of the space where h makes
mistakes then the sample error will overestimate the true error. An extreme example is
when the hypothesis h has true error 0.5, but the training sample S contains a single
sample that h incorrectly classifies.
(b) [4 pts.] T or F: Learning theory allows us to determine with 100% certainty the true
error of a hypothesis to within any > 0 error.
Solution: False. There is always a small chance that the sample S is not representative
of the underlying distribution D, in which case the sample S may have no relationship
to the true error. The sample complexity bounds discussed in class show that this is
rare, but not impossible.
10-601: Machine Learning Page 6 of 17 2/29/2016
curve (i)
Error
curve (ii)
(a) [8 pts.] Which curve represents the training error? Please provide 1–2 sentences of
justification.
Solution: It is easier to correctly classify small training datasets. For example, if the
data contains just a single point, then logistic regression will always have zero training
error. On the other hand, we don’t expect a classifier learned from few examples to
generalize well, so for small training sets the true error is large. Therefore, curve (ii)
shows the general trend of the training error.
(b) [4 pt.] In one word, what does the gap between the two curves represent?
Solution: The gap between the two curves represents the amount of overfitting.
10-601: Machine Learning Page 7 of 17 2/29/2016
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative
to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write
your answers in the table below.
(a) Adding one outlier to the (b) Adding two outliers to the original data
original data set. set.
(c) Adding three outliers to the original data (d) Duplicating the original data set.
set. Two on one side and one on the other
side.
(a) [5 pts.] Is it possible to get a closed form for the parameters ŵ that maximize the
conditional log likelihood? How would you compute ŵ in practice?
Solution: There is no closed form expression for maximizing the conditional log likeli-
hood. One has to consider iterative optimization methods, such as gradient descent, to
compute ŵ.
(b) [5 pts.] What is the form of the classifier output by logistic regression?
Solution: Given x, we predict ŷ = 1, if p(y = 1|x) ≥ p(y = 0|x). This is reduces to
ŷ = 1, if wT x ≥ 0, which is a linear classifier.
(c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x ∈ {0, 1}d ⊂ Rd ,
where feature x1 is rare and happens to appear in the training set with only label 1.
What is ŵ1 ? Is the gradient ever zero for any finite w? Why is it important to include
a regularization term to control the norm of ŵ?
Solution: If a binary feature was active for only label 1 in the training set then, by
maximizing the conditional log likelihood, we will make the weight associated to that
feature be infinite. This is because, when this feature is observed in the training set,
we will want to predict predict 1 irrespective of everything else. This is an undesired
behaviour from the point of view of generalization performance, as most likely we do
not believe this rare feature to have that much information about class 1. Most likely, it
is spurious co-occurrence. Controlling the norm of the weight vector will prevent these
pathological cases.
10-601: Machine Learning Page 10 of 17 2/29/2016
(b) [2 pts.] Suppose φ(x) is an arbitrary feature mapping from input x ∈ X to φ(x) ∈ RN
and let K(x, z) = φ(x) · φ(z). Then K(x, z) will always be a valid kernel function.
(c) [2 pts.] Given the same training data, in which the points are linearly separable, the
margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron.
Solution: True. SVM explicitly maximizes margin; Perceptron does not differentiate
between decision boundaries as long as they lie within the margin of the training data.
pt.] Recall that when the data are not linearly separable, SVM minimizes kwk2 +
(b) [3 P
C i ξi subject to the constraint that ∀i, yi w · xi ≥ 1 − ξi and ξi ≥ 0. Which of the
following may happen to the size of the margin if the tradeoff parameter C is increased?
Circle all that apply.
• Increases No
• Decreases Yes
10-601: Machine Learning Page 11 of 17 2/29/2016
Next, we show that if w0∗ = argmin kwk2 + c0 i ξi and w1∗ = argmin kwk2 + c1 i ξi ,
P P
then kw1∗ k2 ≥ kw0∗ k2 .
By contradiction, assume kw1∗ k2 < kw0∗ k2 . Since w0∗ = argmin kwk2 + c0 i ξi :
P
X X
kw0∗ k2 + c0 ξi(0) ≤ kw1∗ k2 + c0 ξi(1)
i i
P P
Since c1 > c0 and i ξi(1) > i ξi(0) , then
X X
kw0∗ k2 + c1 ξi(0) < kw1∗ k2 + c1 ξi(1)
i i
But, X
w1∗ = argmin kwk2 + c1 ξi
i
4.3 Analysis
(a) [4 pts.] In one or two sentences, describe the benefit of using the Kernel trick.
Solution: Allows mapping features into higher dimensional space but avoids the extra
computational costs of mapping into higher dimensional feature space explicitly.
(b) [4 pt.] The concept of margin is essential in both SVM and Perceptron. Describe why a
large margin separator is desirable for classification.
Solution: Separators with large margin will have low generalization errors with high
probability.
(c) [4 pts.] Extra Credit: Consider the dataset in Fig. 4. Under the SVM formulation in
section 4.2(a),
(a) [3 pts.] T or F: It is possible to label 4 points in R2 in all possible 24 ways via linear
separators in R2 .
Solution: F. The VC dimension of linear separator in R2 is 3, hence it cannot shatter
a set of size 4 in all possible ways.
(d) [3 pts.] T or F: The VC dimension of a concept class with infinite size is also infinite.
Solution: F. Consider all the half-spaces in R2 , which has infinite cardinality but the
VC dimension is 3.
(f) [3 pts.] T or F: Given a realizable concept class and a set of training instances, a
consistent learner will output a concept that achieves 0 error on the training instances.
Solution: T. Since the concept class is realizable, then the consistent learner can output
the oracle labeler by definition, which is guaranteed to achieve 0 error on the training
set.
10-601: Machine Learning Page 14 of 17 2/29/2016
5.2 VC dimension
Briefly explain in 2–3 sentences the importance of sample complexity and VC dimension
in learning with generalization guarantees.
Solution: Sample complexity guarantees quantify how many training samples we need
to see from the underlying data distribution D in order to guarantee that uniformly for
all hypotheses in the class of functions under consideration we have that their empirical
error rates are close to their true errors. This is important because we care abut finding
a hypothesis of small true error, but we can only optimize over a fixed training sample.
VC bounds are one kind of sample complexity guarantee, where the bound depends on the
VC-dimension of the hypothesis class, and they are particularly useful when the class of
functions is infinite.
10-601: Machine Learning Page 15 of 17 2/29/2016
5
y
4 S2 w31
w32
3
S1 S3
x2 h1 h2
2
0 x1 x2
0 1 2 3 4 5
x1
(a) The dataset with groups S1 , S2 , and S3 . (b) The neural network architecture
Figure 5
5 5 5
4 4 4
3 3 3
x2 x2 x2
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x1 x1 x1
(a) Set S2 and S3 (b) Set S1 and S2 (c) Set S1, S2 and S3
Figure 6: NN classification.
10-601: Machine Learning Page 16 of 17 2/29/2016
(a) First we will set the parameters w11 , w12 and b1 of the neuron labeled h1 so that its
output h1 (x) = φ(w11 x1 + w12 x2 + b1 ) forms a linear separator between the sets S2 and
S3 .
(1) [1 pt.] On Fig. 6a, draw a linear decision boundary that separates S2 and S3 .
(2) [1 pt.] Write down the corresponding weights w11 , w12 , and b1 so that h1 (x) = 0 for
all points in S3 and h1 (x) = 1 for all points in S2 .
Solution: w11 = −1, w12 = 0, b1 = 3. With these parameters, we have w11 x1 +
w22 x2 + b1 > 0 if and only if −x1 > −3, which is equivalent to x1 < 3.
(b) Next we set the parameters w21 , w22 and b2 of the neuron labeled h2 so that its output
h2 (x) = φ(w21 x1 + w22 x2 + b2 ) forms a linear separator between the sets S1 and S2 .
(1) [1 pt.] On Fig. 6b, draw a linear decision boundary that separates S1 and S2 .
(2) [1 pt.] Write down the corresponding weights w21 , w22 , and b2 so that h2 (x) = 0 for
all points in S1 and h2 (x) = 1 for all points in S2 .
Solution: The provided line has a slope of −2 and crosses the x2 axis at the value
5. From this, the equation for the region above the line (those points for which
h2 (x) = 1)) is given by x2 ≥ −2x1 + 5 or, equivalently, x2 + 2x1 − 5 ≥ 0. Therefore,
w21 = 2, w22 = 1, b2 = −5.
(c) Now we have two classifiers h1 (to classify S2 from S3 ) and h2 (to classify S1 from S2 ). We
will set the weights of the final neuron of the neural network based on the results from h1
and h2 to classify the crosses from the circles. Let h3 (x) = φ w31 h1 (x) + w32 h2 (x) + b3 .
(1) [1 pt.] Compute w31 , w32 , b3 such that h3 (x) correctly classifies the entire dataset.
Solution: Consider the weights w31 = w32 = 1 and b3 = −1.5. With these weights,
h3 (x) = 1 if h1 (x) + h2 (x) ≥ 1.5. For points in S1 and S3 either h1 or h2 is zero, so
they will be classified as 0, as required. For points in S2 , both h1 and h2 output 1,
so the point is classified as 1. This rule has zero training error.
(2) [1 pt.] Draw your decision boundary in Fig. 6c.
10-601: Machine Learning Page 17 of 17 2/29/2016