0% found this document useful (0 votes)
89 views

hw7

Uploaded by

gamblerough
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

hw7

Uploaded by

gamblerough
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CS 189 Introduction to Machine Learning

Spring 2024 Jonathan Shewchuk HW7


Due: Wednesday, May 1 at 11:59 pm
Deliverables:

1. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope as-
signment entitled “Homework 7 Write-Up”. In addition, please include, as your solutions to each
coding problem, the specific subset of code relevant to that part of the problem. You may typeset your
homework in LaTeX or Word (submit PDF format, not .doc/.docx format) or submit neatly handwrit-
ten and scanned solutions. Please start each question on a new page. If there are graphs, include
those graphs in the correct sections. Do not put them in an appendix. We need each solution to be
self-contained on pages of its own.

• In your write-up, please state with whom you worked on the homework.
• In your write-up, please copy the following statement and sign your signature next to it. (Mac
Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want
to make it extra clear so that no one inadvertently cheats.
“I certify that all solutions are entirely in my own words and that I have not looked at another
student’s solutions. I have given credit to all external sources I consulted.”

2. Submit all the code needed to reproduce your results to the Gradescope assignment entitled “Home-
work 7 Code”. Yes, you must submit your code twice: once in your PDF write-up following the
directions as described above so the readers can easily read it, and once in compilable/interpretable
form so the readers can easily run it. Do NOT include any data files we provided. Please include a
short file named README listing your name, student ID, and instructions on how to reproduce your
results. Please take care that your code doesn’t take up inordinate amounts of time or memory to run.
If your code cannot be executed, your solution cannot be verified.

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Honor Code
Declare and sign the following statement:
“I certify that all solutions in this document are entirely my own and that I have not looked at anyone else’s
solution. I have given credit to all external sources I consulted.”
Signature :
While discussions are encouraged, everything in your solution must be your (and only your) creation. Fur-
thermore, all external material (i.e., anything outside lectures and assigned readings, including figures and
pictures) should be cited properly. We wish to remind you that consequences of academic misconduct are
particularly severe!

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 2
2 The Training Error of AdaBoost
Recall that in AdaBoost, our input is an n × d design matrix X with n labels yi = ±1, and at the end of
iteration T the importance of each sample is reweighted as
P (T )
yi ,GT (Xi ) wi
!
+1) 1 1 − errT
w(T
i = w(T )
i exp(−βT yi GT (Xi )), where βT = ln and errT = (T )
.
2 errT Pn
w
i=1 i

Note that errT is the weighted error rate of the classifier GT . Recall that GT (z) is ±1 for all points z, but the
metalearner has a non-binary decision function M(z) = Tt=1 βt Gt (z). To classify a test point z, we calculate
P
M(z) and return its sign.
In this problem we will prove that if every learner Gt achieves 51% accuracy (that is, only slightly above
random), AdaBoost will converge to zero training error. (If you get stuck on one part, move on; all five parts
below can be done without solving the other parts, and parts (c) and (e) are the easiest.)

(a) We want to change the update rule to “normalize” the weights so that each iteration’s weights sum to 1;
that is, ni=1 wi(T +1) = 1. That way, we can treat the weights as a discrete probability distribution over
P
the sample points. Hence we rewrite the update rule in the form

+1) w(T )
i exp(−βT yi G T (Xi ))
w(T
i = (1)
ZT
(T ) (T +1)
= 1 and = 1, then
Pn Pn
for some scalar ZT . Show that if i=1 wi i=1 wi

ZT = 2 errT (1 − errT ).
p
(2)

Hint: sum over both sides of (1), then split the right summation into misclassified points and correctly
classified points.

(b) The initial weights are w(1) (1) (1)


1 = w2 = · · · = wn = n . Show that
1

+1) 1
w(T
i = QT e−yi M(Xi ) . (3)
n t=1 Zt

(c) Let B (for “bad”) be the number of sample points out of n that the metalearner classifies incorrectly.
Show that
Xn
e−yi M(Xi ) ≥ B. (4)
i=1

Hint: split the summation into misclassified points and correctly classified points.

(d) Use the formulas (2), (3), and (4) to show that if errt ≤ 0.49 for every learner Gt , then B → 0 as T → ∞.
Hint: (2) implies that every Zt < 0.9998. How can you combine this fact with (3) and (4)?

(e) Explain briefly why AdaBoost with short decision trees is a form of subset selection when the number
of features is large.

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
3 Movie Recommender System
In this problem, we will build a personalized movie recommender system! Suppose that there are m = 100
movies and n = 24,983 users in total, and each user has watched and rated a subset of the m movies. Our
goal is to recommend more movies for each user given their preferences.
Our historical ratings dataset is given by a matrix R ∈ Rn×m , where Ri j represents the rating that user i gave
movie j. The rating is a real number in the range [−10, 10]: a higher value indicates that the user was more
satisfied with that movie. If user i did not rate movie j, Ri j = NaN.
The provided movie data/ directory contains the following files:

• movie train.mat contains the training data, i.e. the matrix R of historical ratings specified above.
• movie validate.txt contains user-movie pairs that don’t appear in the training set (i.e. Ri j = NaN).
Each line takes the form “i, j, s”, where i is the user index, j is the movie index, and s indicates the
user’s rating of the movie. Contrary to the training set, the rating here is binary: if the user liked the
movie (positive rating), s = 1, and if the user did not like the movie (negative rating), s = −1.

We also provide movie recommender.py, containing starter code for building your recommender system.
The singular value decomposition (SVD) is a powerful tool to decompose and analyze matrices. In lecture,
we saw that the SVD can be used to efficiently compute the principal coordinates of a data matrix for PCA.
Here, we will see that SVD can also produce dense, compact featurizations of the variables in the input ma-
trix (in our case, the m movies and n users). This application of SVD is known as Latent Semantic Analysis
(Wikipedia), and we can use it to construct a Latent Factor Model (LFM) for personalized recommendation.
Specifically, we want to learn a feature vector xi ∈ Rd for user i and a feature vector y j ∈ Rd for movie j
such that the inner product xi · y j approximates the rating Ri j that user i would give movie j.

(a) Recall the SVD definition for a matrix R ∈ Rn×m from Lecture 21: R = UDV ⊤ . Write an expression for
Ri j , user i’s rating for movie j, in terms of only the contents of U, D, and V.
(b) Based on your answer above, what should we choose as our user and movie feature vector representa-
tions xi and y j to achieve 100% training accuracy (correctly predict all known ratings in R)?
(c) In the provided movie recommender.py, complete the code for part (c) by filling in the missing parts of
the function svd lfm. Start by replacing all missing (NaN) values in R with 0. Then, compute the SVD
of the resulting matrix, and follow your above derivations to compute the feature vector representations
for each user and movie. Note: do not center the data matrix; this is not PCA.
Once you are finished with the code, the rows of the user vecs array should contain the feature vectors
for users (so the ith row of user vecs is xi ), and the rows of movie vecs should contain the feature
vectors for movies (so the jth row of movie vecs is y j ).
Hint: we recommend using scipy.linalg.svd to compute the SVD, with full matrices = False.
This returns U (n × m), D (as a vector of m singular values in descending order, not a diagonal matrix),
and V ⊤ (m × m) in that order.
(d) To measure the training performance of the model, we can use the mean squared error (MSE) loss,
X
MSE = (xi · y j − Ri j )2 where S := {(i, j) : Ri j , NaN}.
(i, j)∈S

Complete the code to implement the training MSE computation within the function get train mse.

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 4
(e) Our model as constructed may achieve 100% training accuracy, but it is prone to overfitting. Instead,
we would like to use lower-dimensional representations for xi and y j to approximate our known ratings
closely while still generalizing well to unknown user/movie combinations. Specifically, we want each
xi and y j to be d-dimensional for some d < m, such that only the top d features are used to make
predictions xi · y j . The “top d features” are those corresponding to the d largest singular values: use this
as a hint for how to prune your current user/movie vector representations to d dimensions.
In your code, compute pruned user/movie vector representations with d = 2, 5, 10, 20. Then, for
each setting, compute the training MSE (using the function you implemented in part (d)), the train-
ing accuracy (using the provided get train acc), and the validation accuracy (using the provided
get val acc). Plot the training MSE as a function of d on one plot, and the training and validation
accuracies as a function of d together on a separate plot. The code for this part is already included in the
starter code, so if your training MSE function from part (d) is implemented correctly, the required plots
should be saved to your project directory.
Comment on which value of d leads to optimal performance.
Hint: as a sanity check, if implemented correctly, your best validation accuracy should be about 71%.

(f) For sparse data, replacing all missing values with zero, as we did in part (c), is not a very satisfying
solution. A missing value in the training matrix R means that the user has not watched the movie; this
does not imply that the rating should be zero. Instead, we can learn our user/movie vector representations
by minimizing the MSE loss, which only incorporates the loss on rated movies (Ri j , NaN).
Let’s define a loss function

  X n
X m
X
L {xi }, {y j } = (xi · y j − Ri j )2 + ∥xi ∥22 + ∥y j ∥22
(i, j)∈S i=1 j=1

where S has the same definition as in the MSE. This is similar to the original MSE loss, except with two
additional regularization terms to prevent the norms of the user/movie vectors from getting too large.
Implement an algorithm to learn vector representations of dimension d, the optimal value you found in
part (e), for users and movies by minimizing L({xi }, {y j }).
We suggest employing an alternating minimization scheme. First, randomly initialize xi and y j for all
i, j. Then, minimize the above loss function with respect to the xi by treating the y j as constant vectors,
and subsequently minimize the loss with respect to the y j by treating the xi as constant vectors. Repeat
these two steps for a number of iterations. Note that when one of the xi or y j are constant, minimizing
the loss function with respect to the other component has a closed-form solution. Derive this solution
first in your report, showing all your work.
The starter code provides a template for this algorithm. Start by inputting your best d value from part (e)
to initialize the user and movie vectors, and then implement the functions to update the user and movie
vectors (holding the other constant) to their loss-minimizing values.

• To improve efficiency, we recommend using the user rated idxs and movie rated idxs ar-
rays provided, which contain the indices of movies that each user rated and the indices of users that
rated each movie (respectively), to iterate through the non-NaN values of R in the update functions.
• Run these 2 update steps for 20 iterations. Include your final training MSE, training accuracy, and
validation accuracy on your report, and compare these results with your best results from part (e).

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5
4 IM2SPAIN: Nearest Neighbors for Geo-location
For this problem, we will use nearest neighbors (NN or k-NN) to predict latitude and longitude coordinates of
images from their CLIP embeddings. You’ll be modifying starter code in the provided im2spain directory.
We are using a dataset of images scraped from Flickr with geo-tagged locations within Spain. Each image
has been processed with OpenAI’s CLIP image model (https://github.com/openai/CLIP) to produce
features that can be used with k-NN.
The CLIP model was not explicitly trained to predict coordinates from images, but from task-agnostic pre-
training on a large web-crawl dataset of captioned images has learned a generally useful mapping from
images to embedding vectors. These feature vectors turn out to encode various pieces of information about
the image content such as object categories, textures, 3D shapes, etc. In fact, these very same features were
used to filter out indoor images from outdoor images in the construction of this dataset.
Note: Throughout the problem we use MDE which stands for Mean Displacement Error (in miles). Dis-
placement is the (technically spherical) distance between the predicted coordinates and ground truth coor-
dinates. Since all our images are located within a relatively small region of the globe, we can approximate
spherical distances with Euclidean distances by treating latitude/longitude as cartesian coordinates. Assume
1 degree latitude is equal to 69 miles and 1 degree longitude is 52 miles in this problem.
Deliverables: Include your modified im2spain starter.py script in your submission. Your submitted
file should include all modifications requested in this problem.

(a) Let’s visualize the data. Using matplotlib and scikit-learn, plot the image locations and modify the
code to apply PCA to the image features (remember to re-center the features first) in the plot data
method of im2spain starter.py. Plot the data in its first two PCA dimensions, colored by longitude
coordinate (east-west position).
(b) Modify the starter code in im2spain starter.py to find the three nearest neighbors in the training
set of the test image file 53633239060.jpg. Include those three image files (as images) in order from
nearest to 3rd nearest in your submission. Now look at their coordinates. How many of the 3 nearest
neighbors are “correct”?
Note: most images have been replaced with a placeholder image in the interest of storage space.

(c) Before we begin with our k-NN model, let’s first establish a naive constant baseline of simply predicting
the training set centroid (coordinate-wise average) location for every test image. Modify the code in
im2spain starter.py to implement the constant baseline. What is its MDE in miles?
(d) The main hyperparameter of a k-nearest neighbor classifier is k itself. Use a 1-D grid search in
im2spain starter.py to create a plot of the MDE (in miles) of k-NN regression versus k, where k is
the number of neighbors. Include your plot in your write-up. What is the best value of k? What is the
lowest error?

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 6
(e) Explain your plot in (d) in terms of bias and variance. (In the definitions of bias and variance, you
should think of the ground truth function g and the predicted hypothesis h as both being in the form of
a longitude and a latitude, and you should assume that we integrate the bias-squared and the variance
over the probability distribution of Spain travel photos that people might take.) In particular, given n
training points, how is the bias different for k = 1 versus k = n? How is the variance different for k = 1
versus k = n? What happens for intermediate values of k?

(f) We do not need to weight every neighbor equally: closer neighbors may be more relevant. For this
problem, weight each neighbor by the inverse of its distance (in feature space) to the test point by
modifying im2spain starter.py. Plot the error of k-NN regression with distance weighting vs. k,
where k is the number of neighbors. What is the best value of k? What is the MDE in miles? How does
performance compare to part (e)?
Note: When computing the inverse distance, add a small value (e.g. 10−8 ) to the denominator to avoid
division by zero.

(g) k-NN yields a non-parametric model which means its complexity can grow without bound as we in-
crease the amount of training data. This is in contrast to parametric models such as linear regression that
assume a fixed number of parameters, so the complexity of the model is bounded even if trained with
infinite data. (We typically think of modern deep neural nets as functionally non-parametric, though
they technically have a finite parameter size, because when we have more data we usually add more
parameters.)
Let’s compare the performance of k-NN with linear regression at different sizes of training datasets to
get a sense of their respective “scaling curves”. Plot the test error of both k-NN and linear regression
for various percentages of training data. Which method would you expect to continue improving with
twice as much training data?
Note: use the optimal value of k at each training dataset size by running grid search.

HW7, ©UCB CS 189, Spring 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 7

You might also like