CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
For this question you will prove the following bound. Let any δ > 0 be fixed. Then with
probability at least 1 − δ, we have that
s ! s
∗ 2 4|Hi | 2 4k
ε(ĥ) ≤ min ε(hi ) + log + log
i=1,...,k (1 − β)m δ 2βm δ
Answer: For each ĥi , the empirical error on the cross-validation set, ε̂(ĥi ) represents
the average of βm random variables with mean ε(ĥi ), so by the Hoeffding inequality for
any ĥi ,
P (|ε(ĥi ) − ε̂Scv (ĥi )| ≥ γ) ≤ 2 exp(−2γ 2 βm).
As in the class notes, to insure that this holds for all ĥi , we need to take the union over
all k of the ĥi ’s.
δ
Answer: Let j = arg mini ε(ĥi ). Using part (a), with probability at least 1 − 2
s
1 4k
ε(ĥ) ≤ ε̂Scv (ĥ) + log
2βm δ
s
1 4k
= min ε̂Scv (ĥi ) + log
i 2βm δ
s
1 4k
≤ ε̂Scv (ĥj ) + log
2βm δ
s
1 4k
≤ ε(ĥj ) + 2 log
2βm δ
s
2 4k
= min ε(ĥi ) + log
i=1,...,k βm δ
δ
(c) Let j = arg mini ε(ĥi ). We know from class that for Hj , with probability 1 − 2
s
2 4|Hj |
|ε(ĥj ) − ε̂Strain (h⋆j )| ≤ log , ∀hj ∈ Hj .
(1 − β)m δ
Use this to prove the final bound given at the beginning of this problem.
Answer: The bounds in parts (a) and (c) both hold simultaneously with probability
2
(1 − 2δ )2 = 1 − δ + δ4 > 1 − δ, so with probablity greater than 1 − δ,
s s
1 2|Hj | 1 2k
ε(ĥ) ≤ ε(h⋆j ) + 2 log δ + 2 log δ
2(1 − γ)m 2
2γm 2
2. VC Dimension
Let the input domain of a learning problem be X = R. Give the VC dimension for each
of the following classes of hypotheses. In each case, if you claim that the VC dimension is
d, then you need to show that the hypothesis class can shatter d points, and explain why
there are no d + 1 points it can shatter.
CS229 Problem Set #3 Solutions 3
There has been a great deal of recent interest in ℓ1 regularization, which, as we will see,
has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are
equal to zero).
The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2
regularized case, because the ℓ1 term is not differentiable. However, there have been many
efficient algorithms developed for this problem that work very well in practive. One very
straightforward approach, which we have already seen in class, is the coordinate descent
method. In this problem you’ll derive and implement a coordinate descent algorithm for
ℓ1 regularized least squares, and apply it to test data.
CS229 Problem Set #3 Solutions 4
(a) Here we’ll derive the coordinate descent update for a given θi . Given the X and
~y matrices, as defined in the class notes, as well a parameter vector θ, how can we
adjust θi so as to minimize the optimization objective? To answer this question, we’ll
rewrite the optimization objective above as
1 1
J(θ) = kXθ − ~y k22 + λkθk1 = kX θ̄ + Xi θi − ~y k22 + λkθ̄k1 + λ|θi |
2 2
where Xi ∈ Rm denotes the ith column of X, and θ̄ is equal to θ except with θ̄i = 0;
all we have done in rewriting the above expression is to make the θi term explicit in
the objective. However, this still contains the |θi | term, which is non-differentiable
and therefore difficult to optimize. To get around this we make the observation that
the sign of θi must either be non-negative or non-positive. But if we knew the sign of
θi , then |θi | becomes just a linear term. That, is, we can rewrite the objective as
1
J(θ) = kX θ̄ + Xi θi − ~y k22 + λkθ̄k1 + λsi θi
2
where si denotes the sign of θi , si ∈ {−1, 1}. In order to update θi , we can just
compute the optimal θi for both possible values of si (making sure that we restrict
the optimal θi to obey the sign restriction we used to solve for it), then look to see
which achieves the best objective value.
For each of the possible values of si , compute the resulting optimal value of θi . [Hint:
to do this, you can fix si in the above equation, then differentiate with respect to θi
to find the best value. Finally, clip θi so that it lies in the allowable range — i.e., for
si = 1, you need to clip θi such that θi ≥ 0.]
Answer: For si = 1,
1
J(θ) = tr(X θ̄ + Xi θi − ~y )T (X θ̄ + Xi θi − ~y ) + λkθ̄k1 + λθi
2
1
XiT Xi θi2 + 2XiT (X θ̄ − ~y )θi + kX θ̄ − ~y k22 + λkθ̄k1 + λθi ,
=
2
so
∂J(θ)
= XiT Xi θ + XiT (X θ̄ − ~y ) + λ
∂θi
which means the optimal θi is given by
−XiT (X θ̄ − ~y ) − λ
θi = max ,0 .
XiT Xi
(b) Implement the above coordinate descent algorithm using the updates you found in
the previous part. We have provided a skeleton theta = l1ls(X,y,lambda) function
in the q3/ directory. To implement the coordinate descent algorithm, you should
repeatedly iterate over all the θi ’s, adjusting each as you found above. You can
terminate the process when θ changes by less than 10− 5 after all n of the updates.
Answer: The following is our implementation of l1ls.m:
CS229 Problem Set #3 Solutions 5
m = size(X,1);
n = size(X,2);
theta = zeros(n,1);
old_theta = ones(n,1);
4. K-Means Clustering
In this problem you’ll implement the K-means clustering algorithm on a synthetic data
set. There is code and data for this problem in the q4/ directory. Run load ’X.dat’;
to load the data file for clustering. Implement the [clusters, centers] = k means(X,
k) function in this directory. As input, this function takes the m × n data matrix X and
the number of clusters k. It should output a m element vector, clusters, which indicates
which of the clusters each data point belongs to, and a k × n matrix, centers, which
contains the centroids of each cluster. Run the algorithm on the data provided, with k = 3
CS229 Problem Set #3 Solutions 6
and k = 4. Plot the cluster assignments and centroids for each iteration of the algorithm
using the draw clusters(X, clusters, centroids) function. For each k, be sure to run
the algorithm several times using different initial centroids.
Answer: The following is our implementation of k means.m:
m = size(X,1);
n = size(X,2);
oldcentroids = zeros(k,n);
centroids = X(ceil(rand(k,1)*m),:);
Below we show the centroid evolution for two typical runs with k = 3. Note that the different
starting positions of the clusters lead to do different final clusterings.
1 1 1 1 1 1
0 0 0 0 0 0
1 1 1 1 1 1
0 0 0 0 0 0
Put slightly more formally, recall that the M-step of the standard EM algorithm performs
the maximization
XX p(x(i) , z (i) ; θ)
θ := arg max Qi (z (i) ) log .
θ
i (i)
Qi (z (i) )
z
The GEM algorithm, in constrast, performs the following update in the M-step:
XX p(x(i) , z (i) ; θ)
θ := θ + α∇θ Qi (z (i) ) log
i
Qi (z (i) )
z (i)
where α is a learning rate which we assume is choosen small enough such that we do not
decrease the objective function when taking this gradient step.
(a) Prove that the GEM algorithm described above converges. To do this, you should
show that the the likelihood is monotonically improving, as it does for the EM algo-
rithm — i.e., show that ℓ(θ(t+1) ) ≥ ℓ(θ(t) ).
Answer: We use the same logic as for the standard EM algorithm. Specifically, just
as for EM, we have for the GEM algorithm that
XX (t) p(x(i) , z (i) ; θ(t+1) )
ℓ(θ(t+1) ) ≥ Qi (z (i) ) log (t)
i z (i)
Qi (z (i) )
XX (t) p(x(i) , z (i) ; θ(t) )
≥ Qi (z (i) ) log (t)
i z (i)
Qi (z (i) )
(t)
= ℓ(θ )
where as in EM the first line holds due to Jensen’s equality, and the last line holds because
we choose the Q distribution to make this hold with equality. The only difference between
EM and GEM is the logic as to why the second line holds: for EM it held because θ(t+1)
was chosen to maximize this quantity, but for GEM it holds by our assumption that we
take a gradient step small enough so as not to decrease the objective function.
(b) Instead of using the EM algorithm at all, suppose we just want to apply gradient ascent
to maximize the log-likelihood directly. In other words, we are trying to maximize
the (non-convex) function
X X
ℓ(θ) = log p(x(i) , z (i) ; θ)
i z (i)
Show that this procedure in fact gives the same update as the GEM algorithm de-
scribed above.
Answer: Differentiating the log likelihood directly we get
∂ X X X 1 X ∂
log p(x(i) , z (i) ; θ) = P (i) , z (i) ; θ)
p(x(i) , z (i) ; θ)
∂θj i (i) i z (i) p(x
(i)
∂θ j
z z
XX 1 ∂
= · p(x(i) , z (i) ; θ).
i
p(x(i) ; θ) ∂θj
z (i)
CS229 Problem Set #3 Solutions 8
p(x(i) , z (i) ; θ)
Qi (z (i) ) = p(z (i) |x(i) ; θ) = ,
p(x(i) ; θ)
so
XX Qi (z (i) ) ∂ XX 1 ∂
(i) (i)
· p(x(i) , z (i) ; θ) = ·
(i) ; θ) ∂θ
p(x(i) , z (i) ; θ)
i
p(x , z ; θ) ∂θj i
p(x j
z (i) (i) z