0% found this document useful (0 votes)
4 views

hw3

Homework #3 for CSE 446/546 focuses on machine learning concepts, including SVM, kernels, perceptron algorithms, and PyTorch implementation. Students are required to submit answers to conceptual questions, graph data points, and implement kernel ridge regression, along with tasks related to neural networks and loss functions. The assignment emphasizes proper submission guidelines, including linking answers on Gradescope and collaborating with peers.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

hw3

Homework #3 for CSE 446/546 focuses on machine learning concepts, including SVM, kernels, perceptron algorithms, and PyTorch implementation. Students are required to submit answers to conceptual questions, graph data points, and implement kernel ridge regression, along with tasks related to neural networks and loss functions. The assignment emphasizes proper submission guidelines, including linking answers on Gradescope and collaborating with peers.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Homework #3

CSE 446/546: Machine Learning


Prof. Kevin Jamieson and Prof. Simon S. Du
Due: May 19, 2023 11:59pm
Points A: 90; B: 5

Please review all homework guidance posted on the website before submitting it to Gradescope. Reminders:
• Make sure to read the “What to Submit” section following each question and include all items.
• Please provide succinct answers and supporting reasoning for each question. Similarly, when discussing
experimental results, concisely create tables and/or figures when appropriate to organize the experimental
results. All explanations, tables, and figures for any particular part of a question must be grouped together.
• For every problem involving generating plots, please include the plots as part of your PDF submission.
• When submitting to Gradescope, please link each question from the homework in Gradescope to the
location of its answer in your homework PDF. Failure to do so may result in deductions of up to 10% of
the value of each question not properly linked. For instructions, see https://www.gradescope.com/get_
started#student-submission.
• If you collaborate on this homework with others, you must indicate who you worked with on your homework
by providing a complete list of collaborators on the first page of your assignment. Make sure to include
the name of each collaborator, and on which problem(s) you collaborated. Failure to do so may result
in accusations of plagiarism. You can review the course collaboration policy at https://courses.cs.
washington.edu/courses/cse446/23sp/assignments/
• For every problem involving code, please include all code you have written for the problem as part of your
PDF submission in addition to submitting your code to the separate assignment on Gradescope created
for code. Not submitting all code files will lead to a deduction of up to 10% of the value of each question
missing code.
Not adhering to these reminders may result in point deductions.

1
Conceptual Questions
A1. The answers to these questions should be answerable without referring to external materials. Briefly justify
your answers with a few words.
ku−vk2
 
a. [2 points] Say you trained an SVM classifier with an RBF kernel (K(u, v) = exp − 2σ2 2 ). It seems to
underfit the training set: should you increase or decrease σ?

b. [2 points] True or False: Training deep neural networks requires minimizing a convex loss function, and
therefore gradient descent will provide the best result.
c. [2 points] True or False: It is a good practice to initialize all weights to zero when training a deep neural
network.
d. [2 points] True or False: We use non-linear activation functions in a neural network’s hidden layers so that
the network learns non-linear decision boundaries.
e. [2 points] True or False: Given a neural network, the time complexity of the backward pass step in the
backpropagation algorithm can be prohibitively larger compared to the relatively low time complexity of
the forward pass step.
f. [2 points] True or False: Neural Networks are the most extensible model and therefore the best choice for
any circumstance.

What to Submit:
• Parts a-f: 1-2 sentence explanation containing your answer.

Support Vector Machines


A2. Recall that solving the SVM problem amounts to solving the following constrained optimization problem:
Given data points D = {(xi , yi )}ni=1 find

min ||w||2 subject to yi (xTi w − b) ≥ 1 for i ∈ {1, . . . , n}


w,b

where xi ∈ Rd , yi ∈ {−1, 1}, and w ∈ Rd .


Consider the following labeled data points:
   
1 2 0 0.5
1 3 1 0
2 3 with label y = −1 and 2
    with label y = 1
1
3 4 3 0

a. [2 points] Graph the data points above. Highlight the support vectors and write their coordinates. Draw
the two parallel hyperplanes separating the two classes of data such that the distance between them is
as large as possible. Draw the maximum-margin hyperplane. Write the equations describing these three
hyperplanes using only x, w, b(that is without using any specific values). Draw w(it doesn’t have to have
the exact magnitude, but it should have the correct orientation).

b. [2 points] For the data points above, find w and b.

Hint: Use the support vectors and the values {−1, 1} to create a linear system of equations where the
unknowns are w1 , w2 and b.

2
c. [4 points] Show that for any solvable SVM problem, the distance between the two separating hyperplanes
is ||w||
2
2
.

Hint 1: The distance between two hyperplanes is the distance between any point x0 on one of the hyper-
planes and its projection on the other hyperplane.
Hint 2: A direction w and an offset c define the hyperplane: H = {x ∈ Rn |wT x = c}. The projection of
a vector y onto H is given by PH (y) = y − w||w||
T
y−c
2 w.
2

What to Submit:
• Part a: Write down support vectors and equations. Graph the points, hyperplanes, and w.
• Part b: Solution and corresponding calculations.
• Part c: Proof.

Kernels
A3. [5 points] Suppose that our inputs x are one-dimensional and that our feature map is infinite-dimensional:
φ(x) is a vector whose ith component is:
1 2
√ e−x /2 xi ,
i!
(x−x0 )2
for all nonnegative integers i. (Thus, φ is an infinite-dimensional vector.) Show that K(x, x0 ) = e− 2 is a
kernel function for this feature map, i.e.,
(x−x0 )2
φ(x) · φ(x0 ) = e− 2 .

Hint: Use the Taylor expansion of z 7→ ez . (This is the one dimensional version of the Gaussian (RBF) kernel).

What to Submit:
• Proof.

A4. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernels.
First, let’s generate some data. Let n = 30 and f∗ (x) = 4 sin(πx) cos(6πx2 ). For i = 1, . . . , n let each xi be
drawn uniformly at random from [0, 1], and let yi = f∗ (xi ) + i where i ∼ N (0, 1). For any function f , the true
error and the train error are respectively defined as:
n
1X 2
Etrue (f ) = EX,Y (f (X) − Y )2 ,
 
Ebtrain (f ) = (f (xi ) − yi ) .
n i=1

Now, our goal is, using kernel ridge regression, to construct a predictor:
n
X
2
b = arg minkKα − yk2 + λα> Kα ,
α fb(x) = α
bi k(xi , x)
α
i=1

where K ∈ Rn×n is the kernel matrix such that Ki,j = k(xi , xj ), and λ ≥ 0 is the regularization constant.

a. [10 points] Using leave-one-out cross validation, find a good λ and hyperparameter settings for the following
kernels:
• kpoly (x, z) = (1 + x> z)d where d ∈ N is a hyperparameter,
• krbf (x, z) = exp(−γkx − zk2 ) where γ > 0 is a hyperparameter1 .
2

1 Given a dataset x1 , . . . , xn ∈ Rd , a heuristic for choosing a range of γ in the right ballpark is the inverse of the median of all
n
2
squared distances kxi − xj k22 .

3
We strongly recommend implementing either grid search or random search. Do not use sklearn, but
actually implement of these algorithms. Reasonable values to look through in this problem are: λ ∈
10[−5,−1] , d ∈ [5, 25], γ sampled from a narrow gaussian distribution centered at value described in the
footnote.
Report the values of d, γ, and the λ values for both kernels.
b. [10 points] Let fbpoly (x) and fbrbf (x) be
n the functions learned
o using the hyperparameters you found in part
a. For a single plot per function f ∈ fpoly (x), frbf (x) , plot the original data {(xi , yi )}ni=1 , the true f (x),
b b b
and fb(x) (i.e., define a fine grid on [0, 1] to plot the functions).

What to Submit:
• Part a: Report the values of d, γ and the value of λ for both kernels as described.
• Part b: Two plots. One plot for each function.

• Code on Gradescope through coding submission.

Perceptron

B1. One of the oldest algorithms used in machine learning (from the early 60’s) is an online algorithm for
learning a linear threshold function called the Perceptron Algorithm. It works as follows:

1. Start with the all-zeroes weight vector w1 = 0, and initialize t to 1. Also let’s automatically scale all
examples x to have (Euclidean) norm 1, since this doesn’t affect which side of the plane they are on.
2. Given example x, predict positive iff wt · x > 0.
3. On a mistake, update as follows:

• Mistake on positive: wt+1 ← wt + x.


• Mistake on negative: wt+1 ← wt − x.
4. t ← t + 1.

If we make a mistake on a positive x we get wt+1 · x = (wt + x) · x = wt · x + 1, and similarly if we make


a mistake on a negative x we have wt+1 · x = (wt − x) · x = wt · x − 1. So, in both cases we move closer
(by 1) to the value we wanted. Here is a link if you are interested in more details.

Now consider the linear decision boundary for classification (labels in {−1, 1}) of the form w · x = 0 (i.e.,
no offset). Now consider the following loss function evaluated at a data point (x, y) which is a variant on
the hinge loss.
`((x, y), w) = max{0, −y(w · x)}.
a. [2 points] Given a dataset of (xi , yi ) pairs, write down a single step of subgradient descent with a
step size of η if we are trying to minimize
n
1X
`((xi , yi ), w)
n i=1

for `(·) defined as above. That is, given a current iterate w


e what is an expression for the next iterate?

b. [2 points] Use what you derived to argue that the Perceptron can be viewed as implementing SGD
applied to the loss function just described (for what value of η)?

4
c. [1 point] Suppose your data was drawn i.i.d. and that there exists a w∗ that separates the two classes
perfectly. Provide an explanation for why hinge loss is generally preferred over the loss given above.

What to Submit:
• Part a: Expression for a single step of subgradient descent

• Part b: A 1-2 sentence explanation).


• Part c: A 1-2 sentence explanation).

Introduction to PyTorch
A5. PyTorch is a great tool for developing, deploying and researching neural networks and other gradient-
based algorithms. In this problem we will explore how this package is built, and re-implement some of its core
components. Firstly start by reading README.md file provided in intro_pytorch subfolder. A lot of problem
statements will overlap between here, readme’s and comments in functions.

a. [10 points] You will start by implementing components of our own PyTorch modules. You can find these
in folders: layers, losses and optimizers. Almost each file there should contain at least one problem
function, including exact directions for what to achieve in this problem. Lastly, you should implement
functions in train.py file.

b. [5 points] Next we will use the above module to perform hyperparameter search. Here we will also treat
loss function as a hyper-parameter. However, because cross-entropy and MSE require different shapes we
are going to use two different files: crossentropy_search.py and mean_squared_error_search.py. For
each you will need to build and train (in provided order) 5 models:
• Linear neural network (Single layer, no activation function)
• NN with one hidden layer (2 units) and sigmoid activation function after the hidden layer
• NN with one hidden layer (2 units) and ReLU activation function after the hidden layer
• NN with two hidden layer (each with 2 units) and Sigmoid, ReLU activation functions after first and
second hidden layers, respectively
• NN with two hidden layer (each with 2 units) and ReLU, Sigmoid activation functions after first and
second hidden layers, respectively
For each loss function, submit a plot of losses from training and validation sets. All models should be on
the same plot (10 lines per plot), with two plots total (1 for MSE, 1 for cross-entropy).
c. [5 points] For each loss function, report the best performing architecture (best performing is defined here
as achieving the lowest validation loss at any point during the training), and plot it’s guesses on test set.
You should use function plot_model_guesses from train.py file. Lastly, report accuracy of that model
on a test set.

On Softmax function
One of the activation functions we ask you to implement is softmax. For a prediction ŷ ∈ Rk corresponding to
single datapoint (in a problem with k classes):

exp(ŷi )
softmax(ŷi ) = P
j exp(ŷj )

5
What to Submit:
• Part b: 2 plots (one per loss function), with 10 lines each, showing both training and validation loss of
each model. Make sure plots are titled, and have proper legends.

• Part c: Names of best performing models (i.e. descriptions of their architectures), and their accuracy on
test set.
• Part c: 2 scatter plots (one per loss function), with predictions of best performing models on test set.
• Code on Gradescope through coding submission

Neural Networks for MNIST


Resources
For questions A.4, A.5 and A.6 you will use a lot of PyTorch. In Section materials (Week 6) there is a notebook
that you might find useful. Additionally make use of PyTorch Documentation, when needed.
If you do not have access to GPU, you might find Google Colaboratory useful. It allows you to use a cloud
GPU for free. To enable it make sure: ”Runtime” -> ”Change runtime type” -> ”Hardware accelerator” is set
to ”GPU”. When submitting please download and submit a .py version of your notebook.
A6. In Homework 1, we used ridge regression for training a classifier for the MNIST data set. In Homework 2,
we used logistic regression to distinguish between the digits 2 and 7. In this problem, we will use PyTorch to
build a simple neural network classifier for MNIST to further improve our accuracy.

We will implement two different architectures: a shallow but wide network, and a narrow but deeper net-
work. For both architectures, we use d to refer to the number of input features (in MNIST, d = 282 = 784), hi
to refer to the dimension of the i-th hidden layer and k for the number of target classes (in MNIST, k = 10).
For the non-linear activation, use ReLU. Recall from lecture that
(
x, x ≥ 0
ReLU(x) =
0, x < 0 .

Weight Initialization
Consider a weight matrix W ∈ Rn×m and b ∈ Rn . Note that here m refers to the input dimension and n to the
output dimension of the transformation x 7→ W x + b. Define α = √1m . Initialize all your weight matrices and
biases according to Unif(−α, α).

Training
For this assignment, use the Adam optimizer from torch.optim. Adam is a more advanced form of gradient
descent that combines momentum and learning rate scaling. It often converges faster than regular gradient
descent in practice. You can use either Gradient Descent or any form of Stochastic Gradient Descent. Note
that you are still using Adam, but might pass either the full data, a single datapoint or a batch of data to it.
Use cross entropy for the loss function and ReLU for the non-linearity.

Implementing the Neural Networks


a. [10 points] Let W0 ∈ Rh×d , b0 ∈ Rh , W1 ∈ Rk×h , b1 ∈ Rk and σ(z) : R → R some non-linear activation
function applied element-wise. Given some x ∈ Rd , the forward pass of the wide, shallow network can be
formulated as:
F1 (x) := W1 σ(W0 x + b0 ) + b1
Use h = 64 for the number of hidden units and choose an appropriate learning rate. Train the network
until it reaches 99% accuracy on the training data and provide a training plot (loss vs. epoch). Finally
evaluate the model on the test data and report both the accuracy and the loss.

6
b. [10 points] Let W0 ∈ Rh0 ×d , b0 ∈ Rh0 , W1 ∈ Rh1 ×h0 , b1 ∈ Rh1 , W2 ∈ Rk×h1 , b2 ∈ Rk and σ(z) : R → R
some non-linear activation function. Given some x ∈ Rd , the forward pass of the network can be formulated
as:
F2 (x) := W2 σ(W1 σ(W0 x + b0 ) + b1 ) + b2
Use h0 = h1 = 32 and perform the same steps as in part a.

c. [5 points] Compute the total number of parameters of each network and report them. Then compare the
number of parameters as well as the test accuracies the networks achieved. Is one of the approaches (wide,
shallow vs. narrow, deeper) better than the other? Give an intuition for why or why not.

Using PyTorch: For your solution, you may not use any functionality from the torch.nn module except for
torch.nn.functional.relu and torch.nn.functional.cross_entropy. You must implement the networks
F1 and F2 from scratch. For starter code and a tutorial on PyTorch refer to the sections 6 and 7 material.

What to Submit:
• Parts a-b: Provide a plot of the training loss versus epoch. In addition evaluate the model trained on
the test data and report the accuracy and loss.

• Part c: Report the number of parameters for the network trained in part (a) and for the network trained
in part (b). Provide a comparison of the two networks as described in part in 1-2 sentences.
• Code on Gradescope through coding submission.

You might also like