homework2
homework2
Theoretical Exercises
Q1 (Converting ⋆-values to p-values)
Let X be a random variable (on R) which has a density p(x). Assume that p(x) > 0 for any x ∈ R. Let
F1 (s) = P(X ≤ s) and F2 (s) = P(X ≥ s). Show that F1 (X) ∼ Unif([0, 1]) and F2 (X) ∼ Unif([0, 1]). (please
avoid using confusing notations P(X ≤ X))
Q2 (Deriving the log-likelihood function of logistic regression with label Y ∈ {−1, 1})
Let (xi , yi )i∈[n] ∼iid (X, Y ) ∈ Rd × {−1, 1} (this is different from the {0, 1} label in class), where Pβ (Y =
1|X) = exp(⟨X, β⟩)/(1 + exp(⟨X, β⟩)) and Pβ (Y = −1|X) = 1 − Pβ (Y = 1|X). Let the likelihood function
of the dataset be Ln (β). Please write down and simplify log Ln (β). Calculate the expression of gradient
∇β [log Ln (β)] and Hessian ∇2β [log Ln (β)].
Q3 (Projection matrix 1)
Let P 1 , P 2 ∈ Rn×n be two projection matrices (so that P i = P T 2
i , P i = P i ) with P 1 P 2 = 0. Let the rank
of P i be ri (so that we must have r1 + r2 ≤ n). Let D 1 = diag(1r1 , 0T
T T T T
n−r1 ), D 2 = diag(0r1 , 1r2 , 0n−r1 −r2 ) ∈
n×n
R be two diagonal matrices with diagonal elements in {0, 1}. We would like to show that, there exists
an orthogonal matrix U ∈ Rn×n , such that P 1 = U D 1 U T and P 2 = U D 2 U T (i.e., P 1 and P 2 are
simultaneously diagonalizable). To show this, one can proceed in the following way.
1. Show that there exists orthogonal matrices V 1 , V 2 ∈ Rn×n , such that P 1 = V 1 D 1 V T 1 and P 2 =
V 2 D2 V T
2 (hint: use the properties that the eigenvalues of projection matrices are either 0 or 1).
2. Let U 1 ∈ Rn×r1 be a submatrix of V 1 ∈ Rn×n by selecting the first r1 columns of V 1 . Let U 2 ∈ Rn×r2
T
be a submatrix of V 2 ∈ Rn×n by selecting the r1 + 1 to r1 + r2 columns of V 2 . Show that P 1 = U 1 U 1
T
and P 2 = U 2 U 2 .
T T
3. Show that U 1 U 2 = 0r1 ×r2 (use the properties that P 1 P 2 = 0, U i U i = Iri ).
4. Show that there exists U 3 ∈ Rn×(n−r1 −r2 ) , such that if we define U = [U 1 , U 2 , U 3 ] ∈ Rn×n , then U
is an orthogonal matrix.
T T
5. Show that P 1 = U D 1 U and P 2 = U D 2 U . Then this U matrix is the U matrix we would like to
find.
1
Q4 (Projection matrix 2)
Let X ∈ Rn×d with n ≥ d and assume that X has full column rank. Define P = X(X T X)−1 X T ∈ Rn×n .
Let T ⊆ {1, 2, . . . d} with |T | = t and let X T ∈ Rn×t be the submatrix of X by selecting columns with
indices in T . Let P T = X T (X T T XT )
−1
XT
T ∈ R
n×n
. Define P 1 = In − P and P 2 = P − P T . We would
like to show that P 1 and P 2 are projection matrices, rank(P 1 ) = n − d, rank(P 2 ) = d − t, and P 1 P 2 = 0.
To show this, one can proceed in the following way.
1. Show that P = P T , P T = P T 2 2
T , P = P , and P T = P T , so that P and P T are projection matrices.
2. Show that P 1 = P T 2
1 and P 1 = P 1 , so that P 1 is a projection matrix with rank n − d.
(a) εT P 1 ε ∼ σ 2 χ2 (n − d), εT P 2 ε ∼ σ 2 χ2 (d − d0 ).
(b) εT P 1 ε is independent of εT P 2 ε (hint: if ε̄ ∼ N (0, In ), then the coordinates of ε̄ are independent).
2
Computational Exercise: Two algorithms to find the least squares solution
Recall that the least squares solution βb = arg minb ∥Xb − y∥2 is given by β b = (X ⊤ X)−1 X ⊤ y. Generally,
2
the most numerically taxing component of computing β b is computing the inverse (X ⊤ X)−1 . This inverse can
either not exist (if X is not full column rank), be numerically unstable (e.g. if X is ”poorly conditioned”),
or computationally slow (if p is large). In this exercise, we will walk through two alternative algorithms that
can be used to solve the least squares problem in these situations.
Q1
In this problem, we will use the Boston dataset, which we used during the lab, to test the algorithms. In
Python the dataset can be loaded using the sklearn library. Please install sklearn by running ”pip install
scikit-learn”, and then load the dataset and standardize the features using the following code
from s k l e a r n . d a t a s e t s import f e t c h o p e n m l
b o s t o n = f e t c h o p e n m l ( name=” Boston ” , v e r s i o n =1, a s f r a m e=True )
X raw = b o s t o n . data
y raw = b o s t o n . t a r g e t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
s c a l e r = StandardScaler ()
s c a l e r . s e t o u t p u t ( t r a n s f o r m= ’ pandas ’ )
X raw = s c a l e r . f i t t r a n s f o r m ( X raw )
After running this code, you will get a dataframe ”X raw” with 506 rows and 13 columns, and ”y raw” with
506 rows and 1 column.
Consider a model regression median home value on six of the available features:
X ⊤ Xβ = X ⊤ y.
Rather than multiplying either side by the inverse of X ⊤ X, if we plug in the QR decomposition of X, we
obtain
X ⊤ Xβ = X ⊤ y ⇐⇒ R⊤ Q⊤ Q Rβ = R⊤ Q⊤ y ⇐⇒ Rβ = Q⊤ y.
| {z }
I
The problem of solving equations of the form Ru = v for u, where R is a triangular matrix, can be done
efficiently using an algorithm called backward substitution. The idea of backward substitution is simple:
suppose we had a system of a equations:
2x + y + z = 3 (1)
y − 2z = 1 (2)
2z = 4. (3)
3
Algorithm 1 Backward substitution: find vector u such that Ru = v.
Require: Upper triangular matrix R ∈ Rp×p , vector v ∈ Rp .
Initialize u = 0p (the all zeros vector in Rp ).
for i = p, . . . , 1 do
t ← vi
for j > i do
t ← t − Rij · uj
end for
ui ← t/Rii
end for
return u
A natural way to solve this would be to first solve for z in eqn (3), then plug this into eqn (2) to find y, and
finally plug these both into eqn (1) to find x. This is simple precisely because this system is triangular. The
general algorithm is given in Algorithm 1.
Use a QR decomposition together with backward substitution algorithm to find the least squares esti-
mator β b , and compute the RSS ∥y − X β b ∥2 , and compare them to the results you obtained in Q1.
qr qr 2
Note: the QR decomposition can be done using the built-in function numpy.linalg.qr() in Python. You
should attempt to implement the backward substitution algorithm yourself, but if you get stuck you can get
some credit if you use pre-implemented versions: in Python, this algorithm is implemented in the function
scipy.linalg.solve triangular(R, v); in R it is implemented backsolve(R,v).
Another method, which can be used to solve a wide variety of optimization problems, is the gradient descent
algorithm, which can be used to find a minimum x = arg minx′ f (x′ ) of a differentiable function f . The
algorithm starts at a random initialization x0 , and iteratively updates x using the iterations
until we reach some convergence criterion. Here α > 0 is a step-size parameter which needs to be chosen
before running the algorithm. In the context of linear regression, we can use this method to find the solution
b by applying it to the function f (β) = ∥Xβ − y∥2 . In particular, for linear regression, the gradient is given
β 2
by
∇β f (β) = 2X ⊤ Xβ − 2X ⊤ y.
This is repeated until, for example, the norm of the gradient ∥∇β f (β)∥2 is sufficiently small. The full
algorithm is given in Algorithm 2.
Implement and run the gradient descent algorithm to approximately find the least squares coefficients β
b .
gd
Also compute the RSS ∥y − X β b ∥2 . Compare your answer with the ones you obtained in the previous two
gd 2
parts. (Note: You may have to experiment with several values of step size α, though we recommend starting
4
with a very small value, e.g. α = 0.0000001, and adjust it up or down, by monitoring the decay of the loss
function or the size of the gradient. )