0% found this document useful (0 votes)

53 views8 pages

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

This document contains solutions to problems from CS229 Problem Set #3 on learning theory and unsupervised learning. The first problem proves a bound on the error of a model selection procedure using hypothesis classes of different sizes. The second problem calculates the VC dimension for various hypothesis classes of functions over the real numbers. The third problem considers regularized least squares regression using an l1 norm penalty instead of an l2 norm penalty.

Uploaded by

lalm9931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views8 pages

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Uploaded by

lalm9931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CS229 Problem Set #3 Solutions 1

CS 229, Public Course

Problem Set #3 Solutions: Learning Theory and
Unsupervised Learning

1. Uniform convergence and Model Selection

In this problem, we will prove a bound on the error of a simple model selection procedure.
Let there be a binary classification problem with labels y ∈ {0, 1}, and let H1 ⊆ H2 ⊆
. . . ⊆ Hk be k different finite hypothesis classes (|Hi | < ∞). Given a dataset S of m iid
training examples, we will divide it into a training set Strain consisting of the first (1 − β)m
examples, and a hold-out cross validation set Scv consisting of the remaining βm examples.
Here, β ∈ (0, 1).
Let ĥi = arg minh∈Hi ε̂Strain (h) be the hypothesis in Hi with the lowest training error
(on Strain ). Thus, ĥi would be the hypothesis returned by training (with empirical risk
minimization) using hypothesis class Hi and dataset Strain . Also let h⋆i = arg minh∈Hi ε(h)
be the hypothesis in Hi with the lowest generalization error.
Suppose that our algorithm first finds all the ĥi ’s using empirical risk minimization then
uses the hold-out cross validation set to select a hypothesis from this the {ĥ1 , . . . , ĥk } with
minimum training error. That is, the algorithm will output

ĥ = arg min ε̂Scv (h).

h∈{ĥ1 ,...,ĥk }

For this question you will prove the following bound. Let any δ > 0 be fixed. Then with
probability at least 1 − δ, we have that
s ! s
∗ 2 4|Hi | 2 4k
ε(ĥ) ≤ min ε(hi ) + log + log
i=1,...,k (1 − β)m δ 2βm δ

(a) Prove that with probability at least 1 − 2δ , for all ĥi ,

s
1 4k
|ε(ĥi ) − ε̂Scv (ĥi )| ≤ log .
2βm δ

Answer: For each ĥi , the empirical error on the cross-validation set, ε̂(ĥi ) represents
the average of βm random variables with mean ε(ĥi ), so by the Hoeffding inequality for
any ĥi ,
P (|ε(ĥi ) − ε̂Scv (ĥi )| ≥ γ) ≤ 2 exp(−2γ 2 βm).
As in the class notes, to insure that this holds for all ĥi , we need to take the union over
all k of the ĥi ’s.

P (∃i, s.t.|ε(ĥi ) − ε̂Scv (ĥi )| ≥ γ) ≤ 2k exp(−2γ 2 βm).

CS229 Problem Set #3 Solutions 2

Setting this term equal to δ/2 and solving for γ yields

s
1 4k
γ= log
2βm δ

proving the desired bound.

(b) Use part (a) to show that with probability 1 − 2δ ,
s
2 4k
ε(ĥ) ≤ min ε(ĥi ) + log .
i=1,...,k βm δ

δ
Answer: Let j = arg mini ε(ĥi ). Using part (a), with probability at least 1 − 2
s
1 4k
ε(ĥ) ≤ ε̂Scv (ĥ) + log
2βm δ
s
1 4k
= min ε̂Scv (ĥi ) + log
i 2βm δ
s
1 4k
≤ ε̂Scv (ĥj ) + log
2βm δ
s
1 4k
≤ ε(ĥj ) + 2 log
2βm δ
s
2 4k
= min ε(ĥi ) + log
i=1,...,k βm δ

δ
(c) Let j = arg mini ε(ĥi ). We know from class that for Hj , with probability 1 − 2
s
2 4|Hj |
|ε(ĥj ) − ε̂Strain (h⋆j )| ≤ log , ∀hj ∈ Hj .
(1 − β)m δ

Use this to prove the final bound given at the beginning of this problem.
Answer: The bounds in parts (a) and (c) both hold simultaneously with probability
2
(1 − 2δ )2 = 1 − δ + δ4 > 1 − δ, so with probablity greater than 1 − δ,
s s
1 2|Hj | 1 2k
ε(ĥ) ≤ ε(h⋆j ) + 2 log δ + 2 log δ
2(1 − γ)m 2
2γm 2

which is equivalent to the bound we want to show.

2. VC Dimension
Let the input domain of a learning problem be X = R. Give the VC dimension for each
of the following classes of hypotheses. In each case, if you claim that the VC dimension is
d, then you need to show that the hypothesis class can shatter d points, and explain why
there are no d + 1 points it can shatter.
CS229 Problem Set #3 Solutions 3

• h(x) = 1{a < x}, with parameter a ∈ R.

Answer: VC-dimension = 1.
(a) It can shatter point {0}, by choosing a to be 2 and −2.
(b) It cannot shatter any two points {x1 , x2 }, x1 < x2 , because the labelling x1 = 1 and
x2 = 0 cannot be realized.
• h(x) = 1{a < x < b}, with parameters a, b ∈ R.
Answer: VC-dimension = 2.
(a) It can shatter points {0, 2} by choosing (a, b) to be (3, 5), (−1, 1), (1, 3), (−1, 3).
(b) It cannot shatter any three points {x1 , x2 , x3 }, x1 < x2 < x3 , because the labelling
x1 = x3 = 1, x2 = 0 cannot be realized.
• h(x) = 1{a sin x > 0}, with parameter a ∈ R.
Answer: VC-dimension = 1. a controls the amplitude of the sine curve.
(a) It can shatter point { π2 } by choosing a to be 1 and −1.
(b) It cannot shatter any two points {x1 , x2 }, since, the labellings of x1 and x2 will flip
together. If x1 = x2 = 1 for some a, then we cannot achieve x1 6= x2 . If x1 6= x2
for some a, then we cannot achieve x1 = x2 = 1 (x1 = x2 = 0 can be achieved by
setting a = 0).
• h(x) = 1{sin(x + a) > 0}, with parameter a ∈ R.
Answer: VC-dimension = 2. a controls the phase of the sine curve.
(a) It can shatter points { π4 , 3π π 3π
4 }, by choosing a to be 0, 2 , π, and 2 .
(b) It cannot shatter any three points {x1 , x2 , x3 }. Since sine has a preiod of 2π, let’s
define x′i = xi mod 2π. W.l.o.g., assume x′1 < x′2 < x′3 . If the labelling of
x1 = x2 = x3 = 1 can be realized, then the labelling of x1 = x3 = 1, x2 = 0 will
not be realizable. Notice the similarity to the second question.

3. ℓ1 regularization for least squares

In the previous problem set, we looked at the least squares problem where the objective
function is augmented with an additional regularization term λkθk22 . In this problem we’ll
consider a similar regularized objective but thisPtime with a penalty on the ℓ1 norm of
the parameters λkθk1 , where kθk1 is defined as i |θi |. That is, we want to minimize the
objective
m n
1 X T (i) X
J(θ) = (θ x − y (i) )2 + λ |θi |.
2 i=1 i=1

There has been a great deal of recent interest in ℓ1 regularization, which, as we will see,
has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are
equal to zero).
The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2
regularized case, because the ℓ1 term is not differentiable. However, there have been many
efficient algorithms developed for this problem that work very well in practive. One very
straightforward approach, which we have already seen in class, is the coordinate descent
method. In this problem you’ll derive and implement a coordinate descent algorithm for
ℓ1 regularized least squares, and apply it to test data.
CS229 Problem Set #3 Solutions 4

(a) Here we’ll derive the coordinate descent update for a given θi . Given the X and
~y matrices, as defined in the class notes, as well a parameter vector θ, how can we
adjust θi so as to minimize the optimization objective? To answer this question, we’ll
rewrite the optimization objective above as
1 1
J(θ) = kXθ − ~y k22 + λkθk1 = kX θ̄ + Xi θi − ~y k22 + λkθ̄k1 + λ|θi |
2 2
where Xi ∈ Rm denotes the ith column of X, and θ̄ is equal to θ except with θ̄i = 0;
all we have done in rewriting the above expression is to make the θi term explicit in
the objective. However, this still contains the |θi | term, which is non-differentiable
and therefore difficult to optimize. To get around this we make the observation that
the sign of θi must either be non-negative or non-positive. But if we knew the sign of
θi , then |θi | becomes just a linear term. That, is, we can rewrite the objective as
1
J(θ) = kX θ̄ + Xi θi − ~y k22 + λkθ̄k1 + λsi θi
2
where si denotes the sign of θi , si ∈ {−1, 1}. In order to update θi , we can just
compute the optimal θi for both possible values of si (making sure that we restrict
the optimal θi to obey the sign restriction we used to solve for it), then look to see
which achieves the best objective value.
For each of the possible values of si , compute the resulting optimal value of θi . [Hint:
to do this, you can fix si in the above equation, then differentiate with respect to θi
to find the best value. Finally, clip θi so that it lies in the allowable range — i.e., for
si = 1, you need to clip θi such that θi ≥ 0.]
Answer: For si = 1,
1
J(θ) = tr(X θ̄ + Xi θi − ~y )T (X θ̄ + Xi θi − ~y ) + λkθ̄k1 + λθi
2
1
XiT Xi θi2 + 2XiT (X θ̄ − ~y )θi + kX θ̄ − ~y k22 + λkθ̄k1 + λθi ,

=
2
so
∂J(θ)
= XiT Xi θ + XiT (X θ̄ − ~y ) + λ
∂θi
which means the optimal θi is given by
−XiT (X θ̄ − ~y ) − λ

θi = max ,0 .
XiT Xi

Similarly, for si = −1, the optimal θi is given by

−XiT (X θ̄ − ~y ) + λ

θi = min , 0 .
XiT Xi

(b) Implement the above coordinate descent algorithm using the updates you found in
the previous part. We have provided a skeleton theta = l1ls(X,y,lambda) function
in the q3/ directory. To implement the coordinate descent algorithm, you should
repeatedly iterate over all the θi ’s, adjusting each as you found above. You can
terminate the process when θ changes by less than 10− 5 after all n of the updates.
Answer: The following is our implementation of l1ls.m:
CS229 Problem Set #3 Solutions 5

function theta = l1l2(X,y,lambda)

m = size(X,1);
n = size(X,2);
theta = zeros(n,1);
old_theta = ones(n,1);

while (norm(theta - old_theta) > 1e-5)

old_theta = theta;
for i=1:n,
% compute possible values for theta
theta(i) = 0;
theta_i(1) = max((-X(:,i)’*(X*theta - y) - lambda) / (X(:,i)’*X(:,i)), 0);
theta_i(2) = min((-X(:,i)’*(X*theta - y) + lambda) / (X(:,i)’*X(:,i)), 0);

% get objective value for both possible thetas

theta(i) = theta_i(1);
obj_theta(1) = 0.5*norm(X*theta - y)^2 + lambda*norm(theta,1);
theta(i) = theta_i(2);
obj_theta(2) = 0.5*norm(X*theta - y)^2 + lambda*norm(theta,1);

% pick the theta which minimizes the objective

[min_obj, min_ind] = min(obj_theta);
theta(i) = theta_i(min_ind);
end
end
(c) Test your implementation on the data provided in the q3/ directory. The [X, y,
theta true] = load data; function will load all the data — the data was generated
by y = X*theta true + 0.05*randn(20,1), but theta true is sparse, so that very
few of the columns of X actually contain relevant features. Run your l1ls.m imple-
mentation on this data set, ranging λ from 0.001 to 10. Comment briefly on how this
algorithm might be used for feature selection.
Answer: For λ = 1, our implementation of l1 regularized least squares recovers the
exact sparsity pattern of the true parameter that generated the data. In constrast, using
any amount of l2 regularization still leads to θ’s that contain no zeros. This suggests
that the l1 regularization could be very useful as a feature selection algorithm: we could
run l1 regularized least squares to see which coefficients are non-zero, then select only
these features for use with either least-squares or possibly a completely different machine
learning algorithm.

4. K-Means Clustering
In this problem you’ll implement the K-means clustering algorithm on a synthetic data
set. There is code and data for this problem in the q4/ directory. Run load ’X.dat’;
to load the data file for clustering. Implement the [clusters, centers] = k means(X,
k) function in this directory. As input, this function takes the m × n data matrix X and
the number of clusters k. It should output a m element vector, clusters, which indicates
which of the clusters each data point belongs to, and a k × n matrix, centers, which
contains the centroids of each cluster. Run the algorithm on the data provided, with k = 3
CS229 Problem Set #3 Solutions 6

and k = 4. Plot the cluster assignments and centroids for each iteration of the algorithm
using the draw clusters(X, clusters, centroids) function. For each k, be sure to run
the algorithm several times using different initial centroids.
Answer: The following is our implementation of k means.m:

function [clusters, centroids] = k_means(X, k)

m = size(X,1);
n = size(X,2);
oldcentroids = zeros(k,n);
centroids = X(ceil(rand(k,1)*m),:);

while (norm(oldcentroids - centroids) > 1e-15)

oldcentroids = centroids;
% compute cluster assignments
for i=1:m,
dists = sum((repmat(X(i,:), k, 1) - centroids).^2, 2);
[min_dist, clusters(i,1)] = min(dists);
end

draw_clusters(X, clusters, centroids);

pause(0.1);

% compute cluster centroids

for i=1:k,
centroids(i,:) = mean(X(clusters == i, :));
end
end

Below we show the centroid evolution for two typical runs with k = 3. Note that the different
starting positions of the clusters lead to do different final clusterings.
1 1 1 1 1 1

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

0 0 0 0 0 0

−0.2 −0.2 −0.2 −0.2 −0.2 −0.2

−0.4 −0.4 −0.4 −0.4 −0.4 −0.4

−0.6 −0.6 −0.6 −0.6 −0.6 −0.6

−0.8 −0.8 −0.8 −0.8 −0.8 −0.8

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

1 1 1 1 1 1

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

0 0 0 0 0 0

−0.2 −0.2 −0.2 −0.2 −0.2 −0.2

−0.4 −0.4 −0.4 −0.4 −0.4 −0.4

−0.6 −0.6 −0.6 −0.6 −0.6 −0.6

−0.8 −0.8 −0.8 −0.8 −0.8 −0.8

5. The Generalized EM algorithm

When attempting to run the EM algorithm, it may sometimes be difficult to perform the M
step exactly — recall that we often need to implement numerical optimization to perform
the maximization, which can be costly. Therefore, instead of finding the global maximum
of our lower bound on the log-likelihood, and alternative is to just increase this lower bound
a little bit, by taking one step of gradient ascent, for example. This is commonly known
as the Generalized EM (GEM) algorithm.
CS229 Problem Set #3 Solutions 7

Put slightly more formally, recall that the M-step of the standard EM algorithm performs
the maximization
XX p(x(i) , z (i) ; θ)
θ := arg max Qi (z (i) ) log .
θ
i (i)
Qi (z (i) )
z

The GEM algorithm, in constrast, performs the following update in the M-step:
XX p(x(i) , z (i) ; θ)
θ := θ + α∇θ Qi (z (i) ) log
i
Qi (z (i) )
z (i)

where α is a learning rate which we assume is choosen small enough such that we do not
decrease the objective function when taking this gradient step.
(a) Prove that the GEM algorithm described above converges. To do this, you should
show that the the likelihood is monotonically improving, as it does for the EM algo-
rithm — i.e., show that ℓ(θ(t+1) ) ≥ ℓ(θ(t) ).
Answer: We use the same logic as for the standard EM algorithm. Specifically, just
as for EM, we have for the GEM algorithm that
XX (t) p(x(i) , z (i) ; θ(t+1) )
ℓ(θ(t+1) ) ≥ Qi (z (i) ) log (t)
i z (i)
Qi (z (i) )
XX (t) p(x(i) , z (i) ; θ(t) )
≥ Qi (z (i) ) log (t)
i z (i)
Qi (z (i) )
(t)
= ℓ(θ )
where as in EM the first line holds due to Jensen’s equality, and the last line holds because
we choose the Q distribution to make this hold with equality. The only difference between
EM and GEM is the logic as to why the second line holds: for EM it held because θ(t+1)
was chosen to maximize this quantity, but for GEM it holds by our assumption that we
take a gradient step small enough so as not to decrease the objective function.
(b) Instead of using the EM algorithm at all, suppose we just want to apply gradient ascent
to maximize the log-likelihood directly. In other words, we are trying to maximize
the (non-convex) function
X X
ℓ(θ) = log p(x(i) , z (i) ; θ)
i z (i)

so we could simply use the update

X X
θ := θ + α∇θ log p(x(i) , z (i) ; θ).
i z (i)

Show that this procedure in fact gives the same update as the GEM algorithm de-
scribed above.
Answer: Differentiating the log likelihood directly we get
∂ X X X 1 X ∂
log p(x(i) , z (i) ; θ) = P (i) , z (i) ; θ)
p(x(i) , z (i) ; θ)
∂θj i (i) i z (i) p(x
(i)
∂θ j
z z
XX 1 ∂
= · p(x(i) , z (i) ; θ).
i
p(x(i) ; θ) ∂θj
z (i)
CS229 Problem Set #3 Solutions 8

For the GEM algorithm,

∂ XX p(x(i) , z (i) ; θ) XX Qi (z (i) ) ∂

Qi (z (i) ) log = · p(x(i) , z (i) ; θ).
∂θj i (i) Qi (z (i) ) i
p(x(i) , z (i) ; θ) ∂θj
z z (i)

But the E-step of the GEM algorithm chooses

p(x(i) , z (i) ; θ)
Qi (z (i) ) = p(z (i) |x(i) ; θ) = ,
p(x(i) ; θ)
so
XX Qi (z (i) ) ∂ XX 1 ∂
(i) (i)
· p(x(i) , z (i) ; θ) = ·
(i) ; θ) ∂θ
p(x(i) , z (i) ; θ)
i
p(x , z ; θ) ∂θj i
p(x j
z (i) (i) z

which is the same as the derivative of the log likelihood.

Cs229 Midterm Aut2015
No ratings yet
Cs229 Midterm Aut2015
21 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
CS 229, Summer 2019 Problem Set #1 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #1 Solutions
22 pages
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
No ratings yet
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
4 pages
Practice Midterm 2 Sol
No ratings yet
Practice Midterm 2 Sol
26 pages
CS 229, Spring 2016 Problem Set #1: Supervised Learning: m −y θ x m θ (i) (i)
No ratings yet
CS 229, Spring 2016 Problem Set #1: Supervised Learning: m −y θ x m θ (i) (i)
8 pages
CS 229, Autumn 2016 Problem Set #1: Supervised Learning: m −y θ x m θ (i) (i)
No ratings yet
CS 229, Autumn 2016 Problem Set #1: Supervised Learning: m −y θ x m θ (i) (i)
8 pages
CS 229, Autumn 2012 Problem Set #1 Solutions: Supervised Learning
No ratings yet
CS 229, Autumn 2012 Problem Set #1 Solutions: Supervised Learning
16 pages
Stanford University CS 229, Autumn 2015 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2015 Midterm Examination
25 pages
ps1-sol (1)
No ratings yet
ps1-sol (1)
25 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
No ratings yet
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
5 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
No ratings yet
CS 229 Autumn 2016 Problem Set#3:Theory & Unsupervised Learning
5 pages
Problem Sets
No ratings yet
Problem Sets
47 pages
Ps 1
No ratings yet
Ps 1
5 pages
Ps 1
No ratings yet
Ps 1
5 pages
Midterm Aut2014 (Final) Sol
No ratings yet
Midterm Aut2014 (Final) Sol
23 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
No ratings yet
CS 229 Autumn 2016 Problem Set #3 Solutions: Theory & Unsuper-Vised Learning
16 pages
Ps 1
No ratings yet
Ps 1
25 pages
MS_key-4
No ratings yet
MS_key-4
4 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
Quiz2_Mock_Solutions
No ratings yet
Quiz2_Mock_Solutions
19 pages
MidTerm_EE6180-2020
No ratings yet
MidTerm_EE6180-2020
2 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
hw2
No ratings yet
hw2
10 pages
MedTerm Machine Learning
No ratings yet
MedTerm Machine Learning
14 pages
Ps 3
No ratings yet
Ps 3
15 pages
Final
No ratings yet
Final
19 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
No ratings yet
CS 771A: Intro To Machine Learning, IIT Kanpur Name Roll No Dept
4 pages
hw5
No ratings yet
hw5
11 pages
MS_key-2
No ratings yet
MS_key-2
4 pages
hw3
No ratings yet
hw3
7 pages
Module-3.1 Static, Linear Inverse Problem - Nov-06
No ratings yet
Module-3.1 Static, Linear Inverse Problem - Nov-06
29 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Problemset2 PDF
No ratings yet
Problemset2 PDF
4 pages
Fall2020 CS395T Mock Midterm Solutions
No ratings yet
Fall2020 CS395T Mock Midterm Solutions
4 pages
Ps and Solution CS229
No ratings yet
Ps and Solution CS229
55 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
hw3
No ratings yet
hw3
13 pages
ПМиИИ Демо ENG
No ratings yet
ПМиИИ Демо ENG
11 pages
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
No ratings yet
CS 229, Public Course Problem Set #1 Solutions: Supervised Learning
10 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn-Ing and Reinforcement Learning
No ratings yet
CS 229, Public Course Problem Set #4 Solutions: Unsupervised Learn-Ing and Reinforcement Learning
12 pages
endsem_ML_makeup_AK-_1_
No ratings yet
endsem_ML_makeup_AK-_1_
7 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
C2_M2_Exam_withSol (1)
No ratings yet
C2_M2_Exam_withSol (1)
12 pages
Solutions Manual Scientific Computing
0% (1)
Solutions Manual Scientific Computing
192 pages
S Ccs Answers
No ratings yet
S Ccs Answers
192 pages
hw1
No ratings yet
hw1
11 pages
CS209 Practice Problems 1 ML
No ratings yet
CS209 Practice Problems 1 ML
4 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
06 Basic Graph Algorithms
No ratings yet
06 Basic Graph Algorithms
31 pages
DAA Syllabus
No ratings yet
DAA Syllabus
2 pages
L1-cosc320-Jan 2024
No ratings yet
L1-cosc320-Jan 2024
38 pages
Ch05 Integer Programming
No ratings yet
Ch05 Integer Programming
29 pages
Large Scale Machine Learning
No ratings yet
Large Scale Machine Learning
24 pages
Data Structures Through C in Depth 2nd revised and Updated Edition by Srivastava, Deepali Srivastava ISBN 8176567418 9788176567411pdf download
100% (3)
Data Structures Through C in Depth 2nd revised and Updated Edition by Srivastava, Deepali Srivastava ISBN 8176567418 9788176567411pdf download
88 pages
Jawaharlal Nehru Technological University Anantapur: (9A05302) Advanced Data Structures
No ratings yet
Jawaharlal Nehru Technological University Anantapur: (9A05302) Advanced Data Structures
4 pages
R22 ML QUESTION BANK FOR IT AND CSM
No ratings yet
R22 ML QUESTION BANK FOR IT AND CSM
4 pages
Node Properties: Mary Rose C. Columbres
No ratings yet
Node Properties: Mary Rose C. Columbres
6 pages
Unit-10
No ratings yet
Unit-10
10 pages
Assignment Solution
No ratings yet
Assignment Solution
2 pages
檔案2
No ratings yet
檔案2
11 pages
L5 Slides - Algorithms - KS4
No ratings yet
L5 Slides - Algorithms - KS4
22 pages
Optimisation (Part 2)
No ratings yet
Optimisation (Part 2)
15 pages
Week 2
No ratings yet
Week 2
6 pages
The Algorithm Design Manual 3rd Edition Steven S. Skiena - Get the ebook instantly with just one click
100% (2)
The Algorithm Design Manual 3rd Edition Steven S. Skiena - Get the ebook instantly with just one click
55 pages
Data Structure & Algorithms: Sunbeam Infotech
No ratings yet
Data Structure & Algorithms: Sunbeam Infotech
15 pages
PYTHON For Coding Interviews
No ratings yet
PYTHON For Coding Interviews
11 pages
Branch and Cut Algorithm IME 960 Project
No ratings yet
Branch and Cut Algorithm IME 960 Project
23 pages
CH 04
No ratings yet
CH 04
5 pages
AVL tree
No ratings yet
AVL tree
9 pages
Numerical Methods: Unit 2: Solution of Non-Linear Equations
No ratings yet
Numerical Methods: Unit 2: Solution of Non-Linear Equations
22 pages
CH 12 Arq
No ratings yet
CH 12 Arq
7 pages
Deletion From A 2-4 Tree
No ratings yet
Deletion From A 2-4 Tree
5 pages
Programming Assignment-4
No ratings yet
Programming Assignment-4
4 pages
Ann MPDM Ii
No ratings yet
Ann MPDM Ii
42 pages
Top 50 Questions of Basic C Programming Asked in Interviews
No ratings yet
Top 50 Questions of Basic C Programming Asked in Interviews
2 pages
211 CRT Cable Disconnected Loc1 SM 4 139 Scanner Power Cable Out Loc3 LRG 2 149 Printer Paper Jam Loc2 MED 3
No ratings yet
211 CRT Cable Disconnected Loc1 SM 4 139 Scanner Power Cable Out Loc3 LRG 2 149 Printer Paper Jam Loc2 MED 3
7 pages
The Blue Path - Uva 12532 - Interval Product
No ratings yet
The Blue Path - Uva 12532 - Interval Product
6 pages
Report On The Applications of DFS and BFS
No ratings yet
Report On The Applications of DFS and BFS
7 pages

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Uploaded by

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Uploaded by

CS229 Problem Set #3 Solutions 1

CS 229, Public Course

1. Uniform convergence and Model Selection

ĥ = arg min ε̂Scv (h).

(a) Prove that with probability at least 1 − 2δ , for all ĥi ,

P (∃i, s.t.|ε(ĥi ) − ε̂Scv (ĥi )| ≥ γ) ≤ 2k exp(−2γ 2 βm).

Setting this term equal to δ/2 and solving for γ yields

proving the desired bound.

which is equivalent to the bound we want to show.

• h(x) = 1{a < x}, with parameter a ∈ R.

3. ℓ1 regularization for least squares

Similarly, for si = −1, the optimal θi is given by

function theta = l1l2(X,y,lambda)

while (norm(theta - old_theta) > 1e-5)

% get objective value for both possible thetas

% pick the theta which minimizes the objective

function [clusters, centroids] = k_means(X, k)

while (norm(oldcentroids - centroids) > 1e-15)

draw_clusters(X, clusters, centroids);

% compute cluster centroids

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

−0.2 −0.2 −0.2 −0.2 −0.2 −0.2

−0.4 −0.4 −0.4 −0.4 −0.4 −0.4

−0.6 −0.6 −0.6 −0.6 −0.6 −0.6

−0.8 −0.8 −0.8 −0.8 −0.8 −0.8

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

−0.2 −0.2 −0.2 −0.2 −0.2 −0.2

−0.4 −0.4 −0.4 −0.4 −0.4 −0.4

−0.6 −0.6 −0.6 −0.6 −0.6 −0.6

−0.8 −0.8 −0.8 −0.8 −0.8 −0.8

5. The Generalized EM algorithm

so we could simply use the update

For the GEM algorithm,

∂ XX p(x(i) , z (i) ; θ) XX Qi (z (i) ) ∂

But the E-step of the GEM algorithm chooses

which is the same as the derivative of the log likelihood.

You might also like