Homework Assignment 3 Homework Assignment 3
Homework Assignment 3 Homework Assignment 3
Return this notebook (filled with your answers) by the deadline via mycourses. Also provide pdf printout of the
notebook.
Note that the notebook that you submit needs to work, that is, if running it produces errors, then that may result
in reduction of points.
The first two questions related to lecture notes by Prof. Ollila and the last two questions to lecture notes by Prof.
Vorobyov.
In [1]:
import pandas as pd
from sklearn import linear_model as LM
In [2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.figsize'] = (8.0, 8.0)
In [3]:
data_table = pd.read_csv('prostate.txt', sep='\t ')
predictors=list(data_table.columns[1:-2])
data_table
Out[3]:
Unnamed: 0 lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
... ... ... ... ... ... ... ... ... ... ... ...
1 a)
Make a scatterplot matrix of the prostate cancer variables, where the first row shows the response against each
of the predictors in turn.
Hint: you should get the same picture that is displayed in Figure 1.1., page 3, of Hastie et al. (2017).
https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf
In [4]:
sns.pairplot(data_table[['lpsa']+predictors])
Out[4]:
<seaborn.axisgrid.PairGrid at 0x7f251a6c2d00>
1 b)
Center and standardize all predictors to have mean zero and unit sample variance
In [5]:
In [5]:
X = data_table[predictors].to_numpy()
y = data_table['lpsa']
#center X
X = (X-np.mean(X,axis=0))
#standartize X
s = np.sqrt(np.diag((X.T@X))/len(X))
X = np.divide(X,s)
1 c)
Split data into the training and test sets, according to the labels in the last column in prostate.txt .
In [103]:
train_indices = data_table['train']== 'T'
Xtrain, ytrain = X[train_indices], y[train_indices]
Xtest, ytest = X[np.invert(train_indices)], y[np.invert(train_indices)]
### 1 d)
Fit a LS linear regression (with intercept) on the training set. Report the estimated regression coefficients
Plot the residuals versus observation number.
Hint: you should get exactly the same results as given in Table 3.2, page 50, of Hastie et al., (2009) and shown
below:
1 e)
Compute the prediction error (PE) on the test set, where PE is defined as
1
∑
Ntest i ∈ I
PE = test(y i − ŷ i)2.
In [104]:
#add intercept
Xtrain = np.hstack((np.ones((len(Xtrain),1)),Xtrain))
Xtest = np.hstack((np.ones((len(Xtest),1)),Xtest))
#plot residuals
plt.scatter(range(len(ytest)),Xtest@LS_coeffs - ytest)
plt.axhline(y=0, color='r', linestyle='--',label = 'actual lpsa value per observation nu
mber')
plt.xlabel("observation number")
plt.ylabel("residual")
plt.legend()
plt.show()
{'intercept': 2.46, 'lcavol': 0.68, 'lweight': 0.26, 'age': -0.14, 'lbph': 0.21, 'svi': 0
.3, 'lcp': -0.29, 'gleason': -0.02, 'pgg45': 0.27}
prediction error is: 0.5212740055076002
1 f)
Compute and report the correlation matrix of predictors variables in the training set.
Identify the largest correlation between the predictors and report it in the form:
max correlation (3 decimal accuracy) is XXX between predictors XXX and XXX.
Hint: you should get exactly the same values as in Table 3.1 of Hastie et al (2017, 12th printing, p. 50) and shown
below:
In [8]:
#get rid of intercept
Xtrain = Xtrain[:,1:]
corrs = np.round(np.corrcoef(Xtrain.T),decimals=3)
#look only at lower part of correlation matrix
corrs = np.tril(corrs) - np.eye(*corrs.shape)
corrdf = pd.DataFrame(data=corrs[1:,:-1])
corrdf.index, corrdf.columns = predictors[1:], predictors[:-1]
pred1, pred2 = np.unravel_index(corrs.argmax(), corrs.shape)
print(corrdf)
print("max correlation is {0:.3f} between predictors {1} and {2}".format(np.max(corrs),pr
edictors[pred1],predictors[pred2]))
2 a)
Implement the basic CCD Elastic Net (EN) algorithm ( ccden ) (see Algorithm 6.1 ) by yourself by writing a
function named ccden below. Recall that this algorithm assumes that the predictors are standardized.
In [9]:
def ccden(y, X, b_init, lam, alpha=1.0, delta=1e-4, max_iter=10000):
#compute all betas for convergence graphs later
denom = 1 + lam*(1 - alpha)
soft_thresh = lambda x: np.sign(x) * np.maximum(0,abs(x) - lam*alpha)
update = lambda j,r,b: soft_thresh(b[j] + np.dot(r, X[:,j])/len(X)) / denom
#initialize values
r = y - X@b_init
betas = [b_init]
#iters
for i in range(max_iter):
beta_hat = np.zeros(b_init.shape)
#cyclic coordinates
for j in range(len(X[0])):
#update every new coordinate of beta with update operator
beta_hat[j] = update(j,r,betas[i])
#update residual
r += (betas[i][j] - beta_hat[j])* X[:,j]
if np.linalg.norm(betas[i]-beta_hat)/np.linalg.norm(beta_hat) < delta:
#converged
break
betas.append(beta_hat)
return betas
2 b)
But did my code work? Let's check this out. So your ccden function has produced you an estimate that
1 1
2N
minimizes ǁy − Xβ ǁ22 + λαǁβ ǁ1 + λ(1 − α)ǁβ ǁ22
2
1. First center both the response and predictor variables of the prostate cancer training data set you created in
1(c).
2. Then standardize the predictors.
3. Give this training data as inputs to ccden function to find the solution β̂ (λ, α)
with a penalty parameter value λ = 0.3
and α = 1
(lasso) and α = 0.9
. The initial value of iteration beta_init should be a vector of zeros.
4. Report the solutions
5. Verify that the subgradient optimality condition holds for your solutions.
Note : Essentially, items 1 and 2 perform Steps 1 and 2 of Algorithm 5.1 of Esa's lecture notes.
In [10]:
lam = 0.3 # This is the penalty for lasso
al = 0.9 # This is the EN penalty parameter
In [11]:
# step 1 and 2: center and standardize
X_cen = (Xtrain-np.mean(Xtrain,axis=0))
s = np.sqrt(np.diag((X_cen.T@X_cen))/len(X_cen))
X0 = np.divide(X_cen,s)
y0 = ytrain - np.mean(ytrain) #centered
In [13]:
# step 3: compute solutions, use last coefficients beta star as optimal solution
lasso_coeffs = ccden(y0, X0, np.zeros(len(X0[0])), lam)
en_coeffs = ccden(y0, X0, np.zeros(len(X0[0])), lam,alpha=al)
my_beta_las_star = lasso_coeffs[-1]
my_beta_en_star = en_coeffs[-1]
In [14]:
def subg_cond(y,X,b,alpha,lam,delta):
"""
compute subgradient optimality condition equation
"""
#values to compare
vals = X.T@(y - X@b)/len(X) - lam*(1-alpha)*b
#conditions to compare with
conditions = np.array([alpha*lam*np.sign(b_j) if b_j \
else alpha*lam for b_j in b])
#check that vals are close enough to conditions for b_j!=0 and smaller than
#conditions for b_j==0
return np.all(np.isclose(vals[b != 0],conditions[b != 0],atol=delta)) and \
np.all(np.abs(vals[b == 0]) <= conditions[b == 0])
In [15]:
# step 5: subgradient optimality condition
print(subg_cond(y0,X0,my_beta_las_star,1,lam,1e-4))
print(subg_cond(y0,X0,my_beta_en_star,al,lam,1e-4))
True
True
Note : You can verify that your code works by checking that it returns the same value as the following code:
In [16]:
# y0 and X0 are here the centered / standardized data from 2b)
regLasso0 = LM.Lasso(fit_intercept=False,alpha=lam).fit(X0, y0)
regEN0 = LM.ElasticNet(fit_intercept=False,alpha=lam,l1_ratio=al).fit(X0, y0)
beta_las_star = regLasso0.coef_
beta_en_star = regEN0.coef_
pd.DataFrame(data=(beta_las_star, beta_en_star))
Out[16]:
0 1 2 3 4 5 6 7
2c)
Then transform the obtained estimates to original scale and compute the intercept, i.e., apply steps 4 and 5 of
Algorithm 5.1 . Report the obtained values of regression coefficents β̂ (λ, α)
and intercept β̂ (λ, α)
0
when (λ, α) = (0.3, 1)
and (λ, α) = (0, 3, 0.9)
. Note that former ( α = 1
) yields the lasso solution for the original training data y
and X
. Compare the found lasso and EN solution with the LSE solution you computed in question 1d).
In [17]:
def scale_and_intercept(coeffs,s,y,X):
coeffs = np.divide(coeffs,s)
interc = np.mean(y) - np.dot(np.mean(X,axis=0),coeffs)
return np.insert(coeffs,0,interc)
las_coeffs_interc = scale_and_intercept(my_beta_las_star,s,ytrain,Xtrain)
en_coeffs_interc = scale_and_intercept(my_beta_en_star,s,ytrain,Xtrain)
LS coefficient values are {'intercept': 2.46, 'lcavol': 0.68, 'lweight': 0.26, 'age': -0.
14, 'lbph': 0.21, 'svi': 0.3, 'lcp': -0.29, 'gleason': -0.02, 'pgg45': 0.27}
Lasso coefficient values are {'intercept': 2.47, 'lcavol': 0.5, 'lweight': 0.11, 'age': 0
.0, 'lbph': 0.0, 'svi': 0.04, 'lcp': 0.0, 'gleason': 0.0, 'pgg45': 0.0}
EN coefficient values are {'intercept': 2.47, 'lcavol': 0.49, 'lweight': 0.13, 'age': 0.0
, 'lbph': 0.0, 'svi': 0.07, 'lcp': 0.0, 'gleason': 0.0, 'pgg45': 0.0}
LS prediction error: 0.5212740055076002
lasso prediction error: 0.5387062169510012
en prediction error: 0.5195710179001547
3a)
Implement FISTA for the lasso problem (see Lecture notes by prof. Vorobyov, pp. 33) by yourself by writing a
function named fistalasso .
In [18]:
def fistalasso(A,y,L,tau,delta=1e-4, max_iter=10000):
#initialize starter values
xs = [np.random.rand(len(A[0]))]
ts = [1]
step_size = 1/L
soft_thresh = lambda x: np.sign(x) * np.maximum(0,abs(x) - tau*step_size)
for i in range(max_iter):
if i and np.linalg.norm(xs[i] - xs[i-1])/np.linalg.norm(xs[i]) <delta:
#converged
break
next_t = 0.5*(1+np.sqrt(1+4*ts[i]**2))
ts.append(next_t)
z_k = xs[i] + (ts[i-1]-1)*(xs[i] - xs[i-1])/ts[i] if i else xs[i] #return x[i]
for i<=1
next_x = soft_thresh(z_k - step_size*A.T@(A@z_k - y))
xs.append(next_x)
return xs
3 b)
Use the same data set as in question 2b). Show the convergence graphs (as in Assignment 1) for both FISTA and
CCD and compare. Use the value beta_las_star (obtained in Note of question 2b) as the true optimum β ∗
of the lasso objective function.
In [109]:
# compute the values
u,d,vt = np.linalg.svd(X0)
#use alpha = 1/N, tau = L*mu
fistalasso_coeffs = fistalasso(X0,y0,len(X0),min(d)*max(d))
fistalasso_coeffs_interc = scale_and_intercept(fistalasso_coeffs[-1],s,ytrain,Xtrain)
print("{} prediction error: {} ".format('Fista-Lasso',pred_err(fistalasso_coeffs_interc))
)
In [98]:
# show convergerence graph
def plot_converg(res,A,b,text):
inner_pred_err = lambda x: np.mean(np.square(A@x - b))
errors = [inner_pred_err(x) for x in res]
4 a)
Implement Alternating Direction Method of Multipliers (ADMM) for the lasso problem (see Lecture notes by prof.
Vorobyov, pp. 34) by yourself by writing a function named admmlasso . Use ρ = 1
in the ADMM algorithm.
In [61]:
def admmlasso(A,y,rho,tau,delta=1e-4, max_iter=10000):
4b)
Show the convergence graph of ADMM implementation and compare it to that of FISTA and CCD from the
previous problems.
In [88]:
# compute the values
#used tau = L*mu/N
admmlasso_coeffs = admmlasso(X0,y0,1,min(d)*max(d)/len(X0))
In [99]:
# show convergerence graph
#convergence is sublinear
plot_converg(admmlasso_coeffs,X0,y0,"ADMM-Lasso convergence")
In [107]:
admmlasso_coeffs_interc = scale_and_intercept(admmlasso_coeffs[-1],s,ytrain,Xtrain)
print("{} prediction error: {} ".format('ADMM-Lasso',pred_err(admmlasso_coeffs_interc)))