0% found this document useful (0 votes)
82 views

Homework Assignment 3 Homework Assignment 3

This homework assignment asks students to complete 4 questions related to prostate cancer data and regularization techniques. Students are instructed to load and explore a prostate cancer data set, fit linear regression models with least squares and elastic net, and implement the cyclic coordinate descent algorithm for elastic net regularization. Key steps include standardizing predictors, splitting data, computing correlations, and verifying optimality conditions.

Uploaded by

Ido Akov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Homework Assignment 3 Homework Assignment 3

This homework assignment asks students to complete 4 questions related to prostate cancer data and regularization techniques. Students are instructed to load and explore a prostate cancer data set, fit linear regression models with least squares and elastic net, and implement the cyclic coordinate descent algorithm for elastic net regularization. Key steps include standardizing predictors, splitting data, computing correlations, and verifying optimality conditions.

Uploaded by

Ido Akov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Homework assignment 3

Return this notebook (filled with your answers) by the deadline via mycourses. Also provide pdf printout of the
notebook.

Note that the notebook that you submit needs to work, that is, if running it produces errors, then that may result
in reduction of points.

The first two questions related to lecture notes by Prof. Ollila and the last two questions to lecture notes by Prof.
Vorobyov.

My name : Ido Akov


My student number : 100597222

In [1]:
import pandas as pd
from sklearn import linear_model as LM

In [2]:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (8.0, 8.0)

Question 1: look at the data (basic operations)


Load the data in prostate.txt to your workspace. You can consult the file prostate.info.txt for description of the
data set.

In [3]:
data_table = pd.read_csv('prostate.txt', sep='\t ')
predictors=list(data_table.columns[1:-2])
data_table

Out[3]:

Unnamed: 0 lcavol lweight age lbph svi lcp gleason pgg45 lpsa train

0 1 -0.579818 2.769459 50 -1.386294 0 -1.386294 6 0 -0.430783 T

1 2 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 0 -0.162519 T

2 3 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 -0.162519 T

3 4 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 0 -0.162519 T

4 5 0.751416 3.432373 62 -1.386294 0 -1.386294 6 0 0.371564 T

... ... ... ... ... ... ... ... ... ... ... ...

92 93 2.830268 3.876396 68 -1.386294 1 1.321756 7 60 4.385147 T

93 94 3.821004 3.896909 44 -1.386294 1 2.169054 7 40 4.684443 T

94 95 2.907447 3.396185 52 -1.386294 1 2.463853 7 10 5.143124 F

95 96 2.882564 3.773910 68 1.558145 1 1.558145 7 80 5.477509 T

96 97 3.471966 3.974998 68 0.438255 1 2.904165 7 20 5.582932 F


97 rows × 11 columns

1 a)
Make a scatterplot matrix of the prostate cancer variables, where the first row shows the response against each
of the predictors in turn.
Hint: you should get the same picture that is displayed in Figure 1.1., page 3, of Hastie et al. (2017).
https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf

In [4]:
sns.pairplot(data_table[['lpsa']+predictors])
Out[4]:
<seaborn.axisgrid.PairGrid at 0x7f251a6c2d00>

1 b)
Center and standardize all predictors to have mean zero and unit sample variance

In [5]:
In [5]:
X = data_table[predictors].to_numpy()
y = data_table['lpsa']
#center X
X = (X-np.mean(X,axis=0))
#standartize X
s = np.sqrt(np.diag((X.T@X))/len(X))
X = np.divide(X,s)

1 c)
Split data into the training and test sets, according to the labels in the last column in prostate.txt .

In [103]:
train_indices = data_table['train']== 'T'
Xtrain, ytrain = X[train_indices], y[train_indices]
Xtest, ytest = X[np.invert(train_indices)], y[np.invert(train_indices)]

### 1 d)

Fit a LS linear regression (with intercept) on the training set. Report the estimated regression coefficients
Plot the residuals versus observation number.

Hint: you should get exactly the same results as given in Table 3.2, page 50, of Hastie et al., (2009) and shown
below:

1 e)
Compute the prediction error (PE) on the test set, where PE is defined as

1

Ntest i ∈ I
PE = test(y i − ŷ i)2.

where Itest denotes the indices in the test set.

In [104]:
#add intercept
Xtrain = np.hstack((np.ones((len(Xtrain),1)),Xtrain))
Xtest = np.hstack((np.ones((len(Xtest),1)),Xtest))

#closed form solution of linear regression


LS_coeffs = np.linalg.pinv(Xtrain)@ytrain
print(dict(zip(['intercept'] + predictors ,np.round(LS_coeffs,decimals=2))))
pred_err = lambda b: np.mean(np.square(Xtest@b - ytest))
LS_pred_err = pred_err(LS_coeffs)
print("prediction error is: ",LS_pred_err)

#plot residuals
plt.scatter(range(len(ytest)),Xtest@LS_coeffs - ytest)
plt.axhline(y=0, color='r', linestyle='--',label = 'actual lpsa value per observation nu
mber')

plt.xlabel("observation number")
plt.ylabel("residual")
plt.legend()
plt.show()

{'intercept': 2.46, 'lcavol': 0.68, 'lweight': 0.26, 'age': -0.14, 'lbph': 0.21, 'svi': 0
.3, 'lcp': -0.29, 'gleason': -0.02, 'pgg45': 0.27}
prediction error is: 0.5212740055076002
1 f)
Compute and report the correlation matrix of predictors variables in the training set.
Identify the largest correlation between the predictors and report it in the form:
max correlation (3 decimal accuracy) is XXX between predictors XXX and XXX.

Hint: you should get exactly the same values as in Table 3.1 of Hastie et al (2017, 12th printing, p. 50) and shown
below:

In [8]:
#get rid of intercept
Xtrain = Xtrain[:,1:]
corrs = np.round(np.corrcoef(Xtrain.T),decimals=3)
#look only at lower part of correlation matrix
corrs = np.tril(corrs) - np.eye(*corrs.shape)
corrdf = pd.DataFrame(data=corrs[1:,:-1])
corrdf.index, corrdf.columns = predictors[1:], predictors[:-1]
pred1, pred2 = np.unravel_index(corrs.argmax(), corrs.shape)
print(corrdf)
print("max correlation is {0:.3f} between predictors {1} and {2}".format(np.max(corrs),pr
edictors[pred1],predictors[pred2]))

lcavol lweight age lbph svi lcp gleason


lweight 0.300 0.000 0.000 0.000 0.000 0.000 0.000
age 0.286 0.317 0.000 0.000 0.000 0.000 0.000
lbph 0.063 0.437 0.287 0.000 0.000 0.000 0.000
svi 0.593 0.181 0.129 -0.139 0.000 0.000 0.000
lcp 0.692 0.157 0.173 -0.089 0.671 0.000 0.000
gleason 0.426 0.024 0.366 0.033 0.307 0.476 0.000
pgg45 0.483 0.074 0.276 -0.030 0.481 0.663 0.757
max correlation is 0.757 between predictors pgg45 and gleason

Question 2: Cyclic Coordinate Descent (CCD) for lasso and elastic


net
Read lecture notes discussing the cyclic coordinate descent algorithm (Esa's lecture notes) for lasso and elastic
net.

2 a)
Implement the basic CCD Elastic Net (EN) algorithm ( ccden ) (see Algorithm 6.1 ) by yourself by writing a
function named ccden below. Recall that this algorithm assumes that the predictors are standardized.

In [9]:
def ccden(y, X, b_init, lam, alpha=1.0, delta=1e-4, max_iter=10000):
#compute all betas for convergence graphs later
denom = 1 + lam*(1 - alpha)
soft_thresh = lambda x: np.sign(x) * np.maximum(0,abs(x) - lam*alpha)
update = lambda j,r,b: soft_thresh(b[j] + np.dot(r, X[:,j])/len(X)) / denom

#initialize values
r = y - X@b_init
betas = [b_init]
#iters
for i in range(max_iter):
beta_hat = np.zeros(b_init.shape)
#cyclic coordinates
for j in range(len(X[0])):
#update every new coordinate of beta with update operator
beta_hat[j] = update(j,r,betas[i])
#update residual
r += (betas[i][j] - beta_hat[j])* X[:,j]
if np.linalg.norm(betas[i]-beta_hat)/np.linalg.norm(beta_hat) < delta:
#converged
break
betas.append(beta_hat)
return betas

2 b)
But did my code work? Let's check this out. So your ccden function has produced you an estimate that
1 1
2N
minimizes ǁy − Xβ ǁ22 + λαǁβ ǁ1 + λ(1 − α)ǁβ ǁ22
2

. Recall the subgradient optimality conditions: β̂


is the solution for EN optimization problem with penalty parameter if and only if equation (6.4) in Lecture notes
holds.

Perform these steps:

1. First center both the response and predictor variables of the prostate cancer training data set you created in
1(c).
2. Then standardize the predictors.
3. Give this training data as inputs to ccden function to find the solution β̂ (λ, α)
with a penalty parameter value λ = 0.3
and α = 1
(lasso) and α = 0.9
. The initial value of iteration beta_init should be a vector of zeros.
4. Report the solutions
5. Verify that the subgradient optimality condition holds for your solutions.

Note : Essentially, items 1 and 2 perform Steps 1 and 2 of Algorithm 5.1 of Esa's lecture notes.

In [10]:
lam = 0.3 # This is the penalty for lasso
al = 0.9 # This is the EN penalty parameter
In [11]:
# step 1 and 2: center and standardize
X_cen = (Xtrain-np.mean(Xtrain,axis=0))
s = np.sqrt(np.diag((X_cen.T@X_cen))/len(X_cen))
X0 = np.divide(X_cen,s)
y0 = ytrain - np.mean(ytrain) #centered

In [13]:
# step 3: compute solutions, use last coefficients beta star as optimal solution
lasso_coeffs = ccden(y0, X0, np.zeros(len(X0[0])), lam)
en_coeffs = ccden(y0, X0, np.zeros(len(X0[0])), lam,alpha=al)
my_beta_las_star = lasso_coeffs[-1]
my_beta_en_star = en_coeffs[-1]

In [14]:

def subg_cond(y,X,b,alpha,lam,delta):
"""
compute subgradient optimality condition equation
"""
#values to compare
vals = X.T@(y - X@b)/len(X) - lam*(1-alpha)*b
#conditions to compare with
conditions = np.array([alpha*lam*np.sign(b_j) if b_j \
else alpha*lam for b_j in b])
#check that vals are close enough to conditions for b_j!=0 and smaller than
#conditions for b_j==0
return np.all(np.isclose(vals[b != 0],conditions[b != 0],atol=delta)) and \
np.all(np.abs(vals[b == 0]) <= conditions[b == 0])

In [15]:
# step 5: subgradient optimality condition
print(subg_cond(y0,X0,my_beta_las_star,1,lam,1e-4))
print(subg_cond(y0,X0,my_beta_en_star,al,lam,1e-4))

True
True

Note : You can verify that your code works by checking that it returns the same value as the following code:

In [16]:
# y0 and X0 are here the centered / standardized data from 2b)
regLasso0 = LM.Lasso(fit_intercept=False,alpha=lam).fit(X0, y0)
regEN0 = LM.ElasticNet(fit_intercept=False,alpha=lam,l1_ratio=al).fit(X0, y0)
beta_las_star = regLasso0.coef_
beta_en_star = regEN0.coef_
pd.DataFrame(data=(beta_las_star, beta_en_star))
Out[16]:

0 1 2 3 4 5 6 7

0 0.521421 0.118423 0.0 0.0 0.036958 0.0 0.0 0.0

1 0.511467 0.141785 0.0 0.0 0.066632 0.0 0.0 0.0

2c)
Then transform the obtained estimates to original scale and compute the intercept, i.e., apply steps 4 and 5 of
Algorithm 5.1 . Report the obtained values of regression coefficents β̂ (λ, α)
and intercept β̂ (λ, α)
0
when (λ, α) = (0.3, 1)
and (λ, α) = (0, 3, 0.9)
. Note that former ( α = 1
) yields the lasso solution for the original training data y
and X
. Compare the found lasso and EN solution with the LSE solution you computed in question 1d).

In [17]:
def scale_and_intercept(coeffs,s,y,X):
coeffs = np.divide(coeffs,s)
interc = np.mean(y) - np.dot(np.mean(X,axis=0),coeffs)
return np.insert(coeffs,0,interc)

las_coeffs_interc = scale_and_intercept(my_beta_las_star,s,ytrain,Xtrain)
en_coeffs_interc = scale_and_intercept(my_beta_en_star,s,ytrain,Xtrain)

print("LS coefficient values are {} ".format(


dict(zip(['intercept'] + predictors ,np.round(LS_coeffs,decimals=2)))))
print("Lasso coefficient values are {} ".format(
dict(zip(['intercept'] + predictors ,np.round(las_coeffs_interc,decimals=2)))))
print("EN coefficient values are {} ".format(
dict(zip(['intercept'] + predictors ,np.round(en_coeffs_interc,decimals=2)))))

#added bias predictor column to test set previously


preds = {'LS':pred_err(LS_coeffs), 'lasso': \
pred_err(las_coeffs_interc), 'en': pred_err(en_coeffs_interc)}
print(*["{} prediction error: {} ".format(key,pair) \
for key, pair in preds.items()],sep="\n ")

LS coefficient values are {'intercept': 2.46, 'lcavol': 0.68, 'lweight': 0.26, 'age': -0.
14, 'lbph': 0.21, 'svi': 0.3, 'lcp': -0.29, 'gleason': -0.02, 'pgg45': 0.27}
Lasso coefficient values are {'intercept': 2.47, 'lcavol': 0.5, 'lweight': 0.11, 'age': 0
.0, 'lbph': 0.0, 'svi': 0.04, 'lcp': 0.0, 'gleason': 0.0, 'pgg45': 0.0}
EN coefficient values are {'intercept': 2.47, 'lcavol': 0.49, 'lweight': 0.13, 'age': 0.0
, 'lbph': 0.0, 'svi': 0.07, 'lcp': 0.0, 'gleason': 0.0, 'pgg45': 0.0}
LS prediction error: 0.5212740055076002
lasso prediction error: 0.5387062169510012
en prediction error: 0.5195710179001547

Question 3: FISTA for lasso

3a)
Implement FISTA for the lasso problem (see Lecture notes by prof. Vorobyov, pp. 33) by yourself by writing a
function named fistalasso .

In [18]:
def fistalasso(A,y,L,tau,delta=1e-4, max_iter=10000):
#initialize starter values
xs = [np.random.rand(len(A[0]))]
ts = [1]
step_size = 1/L
soft_thresh = lambda x: np.sign(x) * np.maximum(0,abs(x) - tau*step_size)

for i in range(max_iter):
if i and np.linalg.norm(xs[i] - xs[i-1])/np.linalg.norm(xs[i]) <delta:
#converged
break
next_t = 0.5*(1+np.sqrt(1+4*ts[i]**2))
ts.append(next_t)
z_k = xs[i] + (ts[i-1]-1)*(xs[i] - xs[i-1])/ts[i] if i else xs[i] #return x[i]
for i<=1
next_x = soft_thresh(z_k - step_size*A.T@(A@z_k - y))
xs.append(next_x)
return xs
3 b)
Use the same data set as in question 2b). Show the convergence graphs (as in Assignment 1) for both FISTA and
CCD and compare. Use the value beta_las_star (obtained in Note of question 2b) as the true optimum β ∗
of the lasso objective function.

In [109]:
# compute the values
u,d,vt = np.linalg.svd(X0)
#use alpha = 1/N, tau = L*mu
fistalasso_coeffs = fistalasso(X0,y0,len(X0),min(d)*max(d))
fistalasso_coeffs_interc = scale_and_intercept(fistalasso_coeffs[-1],s,ytrain,Xtrain)
print("{} prediction error: {} ".format('Fista-Lasso',pred_err(fistalasso_coeffs_interc))
)

Fista-Lasso prediction error: 0.9300245965239617

In [98]:
# show convergerence graph

def plot_converg(res,A,b,text):
inner_pred_err = lambda x: np.mean(np.square(A@x - b))
errors = [inner_pred_err(x) for x in res]

diffs = [errors[i]/errors[i-1] for i in range(1,len(errors))]


#give more weight to later values because of short convergence time
estimed_conv_rate = np.average(diffs,weights=np.linspace(1.0,2.0,len(diffs)))
if estimed_conv_rate < 1:
#plot linear function to compare with
linear = [estimed_conv_rate**k for k in range(len(res))]
plt.plot(range(len(res)),linear, label= "linear convergence: ({:.3f}**k)".format
(estimed_conv_rate))
elif np.isclose(estimed_conv_rate,1,atol=1e-3):
quasi_linear = [estimed_conv_rate**k for k in range(len(res))]
plt.plot(range(len(res)),quasi_linear, label= "sublinear (almost linear convergen
ce): ({:.4f}**k)".format(estimed_conv_rate))
else:
#plot sublinear function to compare with
sublinear = [errors[0]] + [1/x for x in range(1,len(res))] if errors[0] > 1 else
[errors[0]] + [errors[0]/x for x in range(1,len(res))]
plt.plot(range(len(res)),sublinear, label= "sublinear convergence: 1/x")
plt.plot(range(len(res)),errors, label= text)
plt.title(text)
plt.ylabel('Prediction error train-set')
plt.xlabel('iteration number k')
plt.legend()
plt.show()
#convergence looks sublinear
plot_converg(fistalasso_coeffs,X0,y0,"FISTA-Lasso convergence")
In [94]:
#convergence looks linear
plot_converg(lasso_coeffs,X0,y0,"CCD-Lasso convergence")

Question 4: Alternating Direction Method of Multipliers for lasso

4 a)
Implement Alternating Direction Method of Multipliers (ADMM) for the lasso problem (see Lecture notes by prof.
Vorobyov, pp. 34) by yourself by writing a function named admmlasso . Use ρ = 1
in the ADMM algorithm.

In [61]:
def admmlasso(A,y,rho,tau,delta=1e-4, max_iter=10000):

soft_thresh = lambda x: np.sign(x) * np.maximum(0,abs(x) - tau/rho)


#initialize values
xs = []
zs = [np.zeros(len(A[0]))]
ls = [0]
for i in range(max_iter):
next_x = np.linalg.inv(A.T@A + rho*np.eye(len(A[0])))@(A.T@y + rho*zs[i] - ls[i]
)
xs.append(next_x)
if i>1 and np.linalg.norm(xs[i] - xs[i-1])/np.linalg.norm(xs[i]) <delta:
#converged
break
next_z = soft_thresh(next_x + ls[i]/rho)
zs.append(next_z)
next_l = ls[i] + rho*(next_x-next_z)
ls.append(next_l)
return xs

4b)
Show the convergence graph of ADMM implementation and compare it to that of FISTA and CCD from the
previous problems.

In [88]:
# compute the values
#used tau = L*mu/N
admmlasso_coeffs = admmlasso(X0,y0,1,min(d)*max(d)/len(X0))

In [99]:
# show convergerence graph
#convergence is sublinear
plot_converg(admmlasso_coeffs,X0,y0,"ADMM-Lasso convergence")

In [107]:

admmlasso_coeffs_interc = scale_and_intercept(admmlasso_coeffs[-1],s,ytrain,Xtrain)
print("{} prediction error: {} ".format('ADMM-Lasso',pred_err(admmlasso_coeffs_interc)))

ADMM-Lasso prediction error: 0.4965710693884858

You might also like