0% found this document useful (0 votes)
2 views

hw2

Homework #2 for CSE 446/546 focuses on machine learning concepts, including norms, convexity, and the Lasso method. Students are required to submit their work via Gradescope, ensuring proper linking and collaboration disclosure. The assignment includes conceptual questions, proofs, and practical coding tasks involving Lasso on synthetic and real datasets.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

hw2

Homework #2 for CSE 446/546 focuses on machine learning concepts, including norms, convexity, and the Lasso method. Students are required to submit their work via Gradescope, ensuring proper linking and collaboration disclosure. The assignment includes conceptual questions, proofs, and practical coding tasks involving Lasso on synthetic and real datasets.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Homework #2

CSE 446/546: Machine Learning


Prof. Kevin Jamieson and Prof. Simon S. Du
Due: May 3, 2023 11:59pm
Points A: 104; B: 10

Please review all homework guidance posted on the website before submitting it to Gradescope. Reminders:
• Make sure to read the “What to Submit” section following each question and include all items.
• Please provide succinct answers and supporting reasoning for each question. Similarly, when discussing
experimental results, concisely create tables and/or figures when appropriate to organize the experimental
results. All explanations, tables, and figures for any particular part of a question must be grouped together.
• For every problem involving generating plots, please include the plots as part of your PDF submission.
• When submitting to Gradescope, please link each question from the homework in Gradescope to the
location of its answer in your homework PDF. Failure to do so may result in deductions of up to 10% of
the value of each question not properly linked. For instructions, see https://www.gradescope.com/get_
started#student-submission.
• If you collaborate on this homework with others, you must indicate who you worked with on your homework
by providing a complete list of collaborators on the first page of your assignment. Make sure to include
the name of each collaborator, and on which problem(s) you collaborated. Failure to do so may result
in accusations of plagiarism. You can review the course collaboration policy at https://courses.cs.
washington.edu/courses/cse446/23sp/assignments/
• For every problem involving code, please include all code you have written for the problem as part of your
PDF submission in addition to submitting your code to the separate assignment on Gradescope created
for code. Not submitting all code files will lead to a deduction of up to 10% of the value of each question
missing code.
Not adhering to these reminders may result in point deductions.

1
Conceptual Questions
A1. The answers to these questions should be answerable without referring to external materials. Briefly justify
your answers with a few words.
a. [2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in sparsity
(a larger number of 0s) in the weight vector.
b. [2 points] In at most one
Psentence each, state one possible upside and one possible downside of using the
following regularizer: .
0.5
i |wi |

c. [2 points] True or False: If the step-size for gradient descent is too large, it may not converge.
d. [2 points] In at most one sentence each, state one possible advantage of SGD over GD (gradient descent),
and one possible disadvantage of SGD relative to GD.
e. [2 points] Why is it necessary to apply the gradient descent algorithm on logistic regression but not linear
regression?

What to Submit:
• Part c: True or False.

• Parts a-e: Brief (2-3 sentence) explanation.

2
Convexity and Norms
A2. A norm k · k over Rn is defined by the properties: (i) non-negativity: kxk ≥ 0 for all x ∈ Rn with equality
if and only if x = 0, (ii) absolute scalability: ka xk = |a| kxk for all a ∈ R and x ∈ Rn , (iii) triangle inequality:
kx + yk ≤ kxk + kyk for all x, y ∈ Rn .

a. [3 points] Show that f (x) = ( i=1 |xi |) is a norm. (Hint: for (iii), begin by showing that |a + b| ≤ |a| + |b|
Pn
for all a, b ∈ R.)

b. [2 points] Show that g(x) = 1/2 2


is not a norm. (Hint: it suffices to find two points in n = 2
Pn 
i=1 |xi |
dimensions such that the triangle inequality does not hold.)
Context: norms are often used in regularization to encourage specific behaviors of solutions. If we define
then one can show that kxkp is a norm for all p ≥ 1. The important cases of p = 2 and
Pn 1/p
kxkp := ( i=1 |xi |p )
p = 1 correspond to the penalty for ridge regression and the lasso, respectively.

What to Submit:
• Parts a, b: Proof.

B1. [6 points] For any x ∈ Rn , define the following norms: kxk1 = i=1 |xi | ,
Pn pPn
2
i=1 |xi |, kxk2 =
kxk∞ := limp→∞ kxkp = maxi=1,...,n |xi |. Show that kxk∞ ≤ kxk2 ≤ kxk1 .

What to Submit:
• Proof.

A3. [2 points] A set A ⊆ Rn is convex if λx + (1 − λ)y ∈ A for all x, y ∈ A and λ ∈ [0, 1]. For each of the
grey-shaded sets below (I-II), state whether each one is convex, or state why it is not convex using any of the
points a, b, c, d in your answer.

What to Submit:
• Parts I, II: 1-2 sentence explanation of why the set is convex or not.

3
A4. [2 points] We say a function f : Rd → R is convex on a set A if f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for all x, y ∈ A and λ ∈ [0, 1]. For each of the functions shown below (I-II), state whether each is convex on the
specified interval, or state why not with a counterexample using any of the points a, b, c, d in your answer.

a. Function in panel I on [a, c]

b. Function in panel II on [a, d]

What to Submit:
• Parts a, b: 1-2 sentence explanation of why the function is convex or not.

B2. For i = 1, . . . , n let `i (w) be convex functions over w ∈ Rd (e.g., `i (w) = (yi − w> xi )2 ), k · k is any
norm, and λ > 0.

a. [3 points] Show that


n
X
`i (w) + λkwk
i=1

is convex over w ∈ Rd (Hint: Show that if f, g are convex functions, then f (x) + g(x) is also convex.)

b. [1 point] Explain in one sentence why we prefer to use loss functions and regularized loss functions
that are convex.

What to Submit
• Part a: Proof.
• Part b: 1-2 sentence explanation.

4
Lasso on a Real Dataset
 n
Given λ > 0 and data xi , yi , the Lasso is the problem of solving
i=1

n
X d
X
arg min (xTi w + b − yi )2 + λ |wj |
w∈Rd ,b∈R
i=1 j=1

where λ is a regularization parameter. For the programming part of this homework, you will implement the
iterative shrinkage thresholding algorithm shown in Algorithm 1 to solve the Lasso problem in ISTA.py. This
is a variant of the subgradient descent method and a more detailed discussion can be found in this slides. You
may use common computing packages (such as numpy or scipy), but do not use an existing Lasso solver (e.g.,
of scikit-learn).

Algorithm 1: Iterative Shrinkage Thresholding Algorithm for Lasso


Input: Step size η
while not converged
Pn do
b0 ← b − 2η i=1 (xTi w + b − yi )
for k ∈ {1, 2, · · · d}Pdo
n
wk0 ← wk −0 2η i=1 xi,k (xTi w + b − yi )
0
 wk + 2ηλ wk < −2ηλ
wk0 ← 0 wk0 ∈ [−2ηλ, 2ηλ]
 0
wk − 2ηλ wk0 > 2ηλ
end
b ← b0 , w ← w 0
end

Before you get started, the following hints may be useful:


• Wherever possible, use matrix libraries for matrix operations (not for loops). This especially applies to
computing the updates for w. While we wrote the algorithm above with a for loop for clarity, you should
be able to replace this loop using equivalent matrix/vector operations in your code (e.g., numpy functions).
• As a sanity check, ensure the objective value is nonincreasing with each step.
• It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element
of w changes by more than some small δ during an iteration. If you need your algorithm to run faster, an
easy place to start is to loosen this condition.
• You will need to solve the Lasso on the same dataset for many values of λ. This is called a regularization
path. One way to do this efficiently is to start at a large λ, and then for each consecutive solution, initialize
the algorithm with the previous solution, decreasing λ by a constant ratio (e.g., by a factor of 2).
• The smallest value of λ for which the solution w
b is entirely zero is given by
  
n n
1
(1)
X X
λmax = max 2 xi,k yi −  yj 
k=1,...,d
i=1
n j=1

This is helpful for choosing the first λ in a regularization path.

A5. We will first try out your solver with some synthetic data. A benefit of the Lasso is that if we believe
many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively
differentiating between the relevant and irrelevant features. Suppose that x ∈ Rd , y ∈ R, k < d, and data are
generated independently according to the model yi = wT xi + i where

j/k if j ∈ {1, . . . , k}
(
wj = (2)
0 otherwise

5
and i ∼ N (0, σ 2 ) is noise (note that in the model above b = 0). We can see from Equation (2) that since k < d
and wj = 0 for j > k, the features k + 1 through d are irrelevant for predicting y.
Generate a dataset using this model with n = 500, d = 1000, k = 100, and σ = 1. You should generate the
dataset such that each i ∼ N (0, 1) , and yi is generated as specified above. You are free to choose a distribution
from which the x’s are drawn, but make sure standardize the x’s before running your experiments.

a. [10 points] With your synthetic data, solve multiple Lasso problems on a regularization path, starting at
λmax where no features are selected (see Equation (1)) and decreasing λ by a constant ratio (e.g., 2) until
nearly all the features are chosen. In plot 1, plot the number of non-zeros as a function of λ on the x-axis
(Tip: use plt.xscale('log')).
b. [10 points] For each value of λ tried, record values for false discovery rate (FDR) (number of incorrect
nonzeros in w/total
b number of nonzeros in w)b and true positive rate (TPR) (number of correct nonzeros
in w/k).
b Note: for each j, w
bj is an incorrect nonzero if and only if w
bj 6= 0 while wj = 0. In plot 2, plot
these values with the x-axis as FDR, and the y-axis as TPR.
Note that in an ideal situation we would have an (FDR,TPR) pair in the upper left corner. We can always
trivially achieve (0, 0) and ( d−k
d , 1).

c. [5 points] Comment on the effect of λ in these two plots in 1-2 sentences.

What to Submit:
• Part a: Plot 1.
• Part b: Plot 2.
• Part c: 1-2 sentence explanation.
• Code on Gradescope through coding submission

• All code you wrote in the write-up, with correct page mapping.

A6. We’ll now put the Lasso to work on some real data in crime_data_lasso.py. . Download the training data
set “crime-train.txt” and the test data set “crime-test.txt” from the website. Store your data in your working
directory, ensure you have the pandas library for Python installed, and read in the files with:

import pandas as pd
df_train = pd.read_table("crime-train.txt")
df_test = pd.read_table("crime-test.txt")

This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more flexible;
unlike arrays, DataFrames store row and column indices along with the values of the data. Each column of a
DataFrame can also store data of a different type (here, all data are floats). Here are a few commands that will
get you working with Pandas for this assignment:

df.head() # Print the first few lines of DataFrame df.


df.index # Get the row indices for df.
df.columns # Get the column indices.
df[``foo''] # Return the column named ``foo''.
df.drop(``foo'', axis = 1) # Return all columns except ``foo''.
df.values # Return the values as a Numpy array.
df[``foo''].values # Grab column foo and convert to Numpy array.
df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.

6
The data consist of local crime statistics for 1,994 US communities. The response y is the rate of violent crimes
reported per capita in a community. The name of the response variable is ViolentCrimesPerPop, and it is held
in the first column of df_train and df_test. There are 95 features. These features include many variables.
Some features are the consequence of complex political processes, such as the size of the police force and other
systemic and historical factors. Others are demographic characteristics of the community, including self-reported
statistics about race, age, education, and employment drawn from Census reports.

The goals of this problem are threefold: (i) to encourage you to think about how data collection processes affect
the resulting model trained from that data; (ii) to encourage you to think deeply about models you might train
and how they might be misused; and (iii) to see how Lasso encourages sparsity of linear models in settings where
d is large relative to n. We emphasize that training a model on this dataset can suggest a degree of
correlation between a community’s demographics and the rate at which a community experiences
and reports violent crime. We strongly encourage students to consider why these correlations
may or may not hold more generally, whether correlations might result from a common cause,
and what issues can result in misinterpreting what a model can explain.

The dataset is split into a training and test set with 1,595 and 399 entries, respectively1 . We will use this
training set to fit a model to predict the crime rate in new communities and evaluate model performance on the
test set. As there are a considerable number of input variables and fairly few training observations, overfitting is
a serious issue. In order to avoid this, use the coordinate descent Lasso algorithm implemented in the previous
problem.

a. [4 points] Read the documentation for the originalcversion of this dataset: http://archive.ics.uci.edu/
ml/datasets/communities+and+crime. Report 3 features included in this dataset for which historical
policy choices in the US would lead to variability in these features. As an example, the number of police
in a community is often the consequence of decisions made by governing bodies, elections, and amount of
tax revenue available to decision makers.
b. [4 points] Before you train a model, describe 3 features in the dataset which might, if found to have
nonzero weight in model, be interpreted as reasons for higher levels of violent crime, but which might
actually be a result rather than (or in addition to being) the cause of this violence.

Now, we will run the Lasso solver. Begin with λ = λmax defined in Equation (1). Initialize all weights to 0.
Then, reduce λ by a factor of 2 and run again, but this time initialize ŵ from your λ = λmax solution as your
initial weights, as described above. Continue the process of reducing λ by a factor of 2 until λ < 0.01. For all
plots use a log-scale for the λ dimension (Tip: use plt.xscale('log')).

c. [4 points] Plot the number of nonzero weights of each solution as a function of λ.


d. [4 points] Plot the regularization paths (in one plot) for the coefficients for input variables agePct12t29,
pctWSocSec, pctUrban, agePct65up, and householdsize.

e. [4 points] On one plot, plot the mean squared error on the training and test data as a function of λ.
f. [4 points] Sometimes a larger value of λ performs nearly as well as a smaller value, but a larger value will
select fewer variables and perhaps be more interpretable. Inspect the weights ŵ for λ = 30. Which feature
had the largest (most positive) Lasso coefficient? What about the most negative? Discuss briefly.

g. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician
suggests policies that encourage people over the age of 65 to move to high crime areas in an effort to reduce
crime. What is the (statistical) flaw in this line of reasoning? (Hint: fire trucks are often seen around
burning buildings, do fire trucks cause fire?)
1 The features have been standardized to have mean 0 and variance 1.

7
What to Submit:
• Parts a, b: 1-2 sentence explanation.

• Part c: Plot 1.
• Part d: Plot 2.
• Part e: Plot 3.
• Parts f, g: Answers and 1-2 sentence explanation.

• Code on Gradescope through coding submission.


• All code you wrote in the write-up, with correct page mapping.

8
Logistic Regression
A7. Here we consider the MNIST dataset, but for binary classification. Specifically, the task is to determine
whether a digit is a 2 or 7. Here, let Y = 1 for all the “7” digits in the dataset, and use Y = −1 for “2”.
We will use regularized logistic regression. Given a binary classification dataset {(xi , yi )}ni=1 for xi ∈ Rd and
yi ∈ {−1, 1} we showed in class that the regularized negative log likelihood objective function can be written as
n
1X
J(w, b) = log(1 + exp(−yi (b + xTi w))) + λ||w||22
n i=1

Note that the offset term b is not regularized. For all experiments, use λ = 10−1 . Let µi (w, b) = 1
1+exp(−yi (b+xT
.
i w))

a. [8 points] Derive the gradients ∇w J(w, b), ∇b J(w, b) and give your answers in terms of µi (w, b) (your
answers should not contain exponentials).
b. [8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes
to find one that appears to make convergence on the training set as fast as possible. Run until you feel
you are near to convergence.
(i) For both the training set and the test, plot J(w, b) as a function of the iteration number (and show
both curves on the same plot).
(ii) For both the training set and the test, classify the points according to the rule sign(b + xTi w) and
plot the misclassification error as a function of the iteration number (and show both curves on the
same plot).
Reminder: Make sure you are only using the test set for evaluation (not for training).
c. [7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient
with respect to the random selection should be equal to the gradient found in part (a). Show both plots
described in (b) when using batch size 1. Take careful note of how to scale the regularizer.
d. [7 points] Repeat (b) using mini-batch gradient descent with batch size of 100. That is, instead of
approximating the gradient with a single example, use 100. Note, the expected gradient with respect to
the random selection should be equal to the gradient found in part (a).

What to Submit
• Part a: Proof
• Part b: Separate plots for b(i) and b(ii).

• Part c: Separate plots for c which reproduce those from b(i) and b(ii) for this case.
• Part d: Separate plots for c which reproduce those from b(i) and b(ii) for this case.
• Code on Gradescope through coding submission.

9
Administrative
A8.
a. [2 points] About how many hours did you spend on this homework? There is no right or wrong answer :)

10

You might also like