0% found this document useful (0 votes)
3 views

homework2

Uploaded by

hangyuju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

homework2

Uploaded by

hangyuju
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

STAT154 Modern Statistical Prediction and Machine Learning

p-values, F statistics, logistic regression,


Lecturer: Song Mei. GSI: Ruiqi Zhang. Assignment 2 - Due on 09/22/2024

Homework submissions are expected to be in pdf format produced by LATEX.


For exercises with multiple sub-questions, please also annotate the sub-question id in your solution.
Questions solely for students enrolled in 254 are marked in question titles. Students enrolled in 154 could
ignore these questions.
For coding exercises, you shoud use Python. For submission of coding exercises: report the results and
figures produced by the simulations, and also paste the source code.

Theoretical Exercises
Q1 (Converting ⋆-values to p-values)
Let X be a random variable (on R) which has a density p(x). Assume that p(x) > 0 for any x ∈ R. Let
F1 (s) = P(X ≤ s) and F2 (s) = P(X ≥ s). Show that F1 (X) ∼ Unif([0, 1]) and F2 (X) ∼ Unif([0, 1]). (please
avoid using confusing notations P(X ≤ X))

Q2 (Deriving the log-likelihood function of logistic regression with label Y ∈ {−1, 1})
Let (xi , yi )i∈[n] ∼iid (X, Y ) ∈ Rd × {−1, 1} (this is different from the {0, 1} label in class), where Pβ (Y =
1|X) = exp(⟨X, β⟩)/(1 + exp(⟨X, β⟩)) and Pβ (Y = −1|X) = 1 − Pβ (Y = 1|X). Let the likelihood function
of the dataset be Ln (β). Please write down and simplify log Ln (β). Calculate the expression of gradient
∇β [log Ln (β)] and Hessian ∇2β [log Ln (β)].

Q3 (Projection matrix 1)
Let P 1 , P 2 ∈ Rn×n be two projection matrices (so that P i = P T 2
i , P i = P i ) with P 1 P 2 = 0. Let the rank
of P i be ri (so that we must have r1 + r2 ≤ n). Let D 1 = diag(1r1 , 0T
T T T T
n−r1 ), D 2 = diag(0r1 , 1r2 , 0n−r1 −r2 ) ∈
n×n
R be two diagonal matrices with diagonal elements in {0, 1}. We would like to show that, there exists
an orthogonal matrix U ∈ Rn×n , such that P 1 = U D 1 U T and P 2 = U D 2 U T (i.e., P 1 and P 2 are
simultaneously diagonalizable). To show this, one can proceed in the following way.

1. Show that there exists orthogonal matrices V 1 , V 2 ∈ Rn×n , such that P 1 = V 1 D 1 V T 1 and P 2 =
V 2 D2 V T
2 (hint: use the properties that the eigenvalues of projection matrices are either 0 or 1).
2. Let U 1 ∈ Rn×r1 be a submatrix of V 1 ∈ Rn×n by selecting the first r1 columns of V 1 . Let U 2 ∈ Rn×r2
T
be a submatrix of V 2 ∈ Rn×n by selecting the r1 + 1 to r1 + r2 columns of V 2 . Show that P 1 = U 1 U 1
T
and P 2 = U 2 U 2 .
T T
3. Show that U 1 U 2 = 0r1 ×r2 (use the properties that P 1 P 2 = 0, U i U i = Iri ).
4. Show that there exists U 3 ∈ Rn×(n−r1 −r2 ) , such that if we define U = [U 1 , U 2 , U 3 ] ∈ Rn×n , then U
is an orthogonal matrix.
T T
5. Show that P 1 = U D 1 U and P 2 = U D 2 U . Then this U matrix is the U matrix we would like to
find.

1
Q4 (Projection matrix 2)
Let X ∈ Rn×d with n ≥ d and assume that X has full column rank. Define P = X(X T X)−1 X T ∈ Rn×n .
Let T ⊆ {1, 2, . . . d} with |T | = t and let X T ∈ Rn×t be the submatrix of X by selecting columns with
indices in T . Let P T = X T (X T T XT )
−1
XT
T ∈ R
n×n
. Define P 1 = In − P and P 2 = P − P T . We would
like to show that P 1 and P 2 are projection matrices, rank(P 1 ) = n − d, rank(P 2 ) = d − t, and P 1 P 2 = 0.
To show this, one can proceed in the following way.

1. Show that P = P T , P T = P T 2 2
T , P = P , and P T = P T , so that P and P T are projection matrices.

2. Show that P 1 = P T 2
1 and P 1 = P 1 , so that P 1 is a projection matrix with rank n − d.

3. Show that P X = X so that P X T = X T .


4. Show that P P T = P T P = P T .
5. Show that P 2 = P T 2
2 and P 2 = P 2 , so that P 2 is a projection matrix with rank d − t.

6. Show that P 1 P 2 = 0 using the properties above.

Q5 (Showing that F statistics follows the F distribution)


Let (xi , yi )i∈[n] ⊆ Rd × R. Let X = [x1 , . . . , xn ]T ∈ Rn×d , and y = (y1 , . . . , yn )T ∈ Rn . Assume that
n ≥ d and X has full column rank. Let S ⊆ {1, 2, . . . , d} and S c = {1, 2, . . . , d} \ S with |S c | = d0 . Let
X S c ∈ Rn×d0 be the submatrix of X by selecting columns with indices in S c .
Assume that the null hypothesis to be true, so that yi = ⟨β S c , xi,S c ⟩ + εi , εi ∼iid N (0, σ 2 ) (in matrix
form, we have y = X S c β S c + ε).
Define RSS1 = minβ′ ∈Rd ∥y − Xβ ′ ∥22 and RSS0 = minβ′Sc ∈Rd0 ∥y − X S c β ′S c ∥22 . We would like to show
that RSS0 − RSS1 ∼ σ 2 · χ2 (d − d0 ), RSS1 ∼ σ 2 · χ2 (n − d), and RSS0 − RSS1 is independent of RSS1 (so
that F = [(RSS0 − RSS1 )/(d − d0 )]/[RSS1 /(n − d)] follows the F distribution).
To show this, one can proceed in the following way.

1. Show that RSS1 = εT (In − X(X T X)−1 X T )ε and RSS0 = εT (In − X S c (X T


Sc X Sc )
−1
XT
S c )ε, where
T n
ε = (ε1 , . . . , εn ) ∈ R .
2. Define P 1 = (In −X(X T X)−1 X T ) and P 2 = (X(X T X)−1 X T −X S c (X T
Sc X Sc )
−1
XT
S c ), then RSS1 =
T T
ε P 1 ε, RSS1 − RSS0 = ε P 2 ε. Use Q4 to show that, P 1 and P 2 are projection matrices with rank
respectively n − d and d − d0 , and P 1 P 2 = 0.

3. By Q3, there exists an orthogonal matrix U ∈ Rn×n such that P 1 = U D 1 U T and P 2 = U D 2 U T ,


where D 1 = diag(1T T T T T
n−d , 0d ), D 2 = diag(0n−d , 1d−d0 , 0d0 ) ∈ R
n×n
. Use this to show that

(a) εT P 1 ε ∼ σ 2 χ2 (n − d), εT P 2 ε ∼ σ 2 χ2 (d − d0 ).
(b) εT P 1 ε is independent of εT P 2 ε (hint: if ε̄ ∼ N (0, In ), then the coordinates of ε̄ are independent).

Q6 (Fisher information matrix. Question for 254)


Assume Z ∼ pθ (z) for θ ∈ Rd for general d ∈ N. The fisher information matrix is defined as I(θ) =
−EZ∼pθ [∇2θ log pθ (Z)]. Show that
1. I(θ) = EZ∼pθ [∇θ log pθ (Z)∇θ log pθ (Z)T ].
2. Let pθ (Z) be Gaussian distribution N (θ, Σ) for Σ ∈ Rd×d , give the expression of I(θ).
λZ −λ
3. Let pλ (Z) = Z! e be Poisson distribution (that is, taking θ = λ ∈ R), give the expression of I(λ).

2
Computational Exercise: Two algorithms to find the least squares solution
Recall that the least squares solution βb = arg minb ∥Xb − y∥2 is given by β b = (X ⊤ X)−1 X ⊤ y. Generally,
2
the most numerically taxing component of computing β b is computing the inverse (X ⊤ X)−1 . This inverse can
either not exist (if X is not full column rank), be numerically unstable (e.g. if X is ”poorly conditioned”),
or computationally slow (if p is large). In this exercise, we will walk through two alternative algorithms that
can be used to solve the least squares problem in these situations.

Q1
In this problem, we will use the Boston dataset, which we used during the lab, to test the algorithms. In
Python the dataset can be loaded using the sklearn library. Please install sklearn by running ”pip install
scikit-learn”, and then load the dataset and standardize the features using the following code
from s k l e a r n . d a t a s e t s import f e t c h o p e n m l
b o s t o n = f e t c h o p e n m l ( name=” Boston ” , v e r s i o n =1, a s f r a m e=True )
X raw = b o s t o n . data
y raw = b o s t o n . t a r g e t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
s c a l e r = StandardScaler ()
s c a l e r . s e t o u t p u t ( t r a n s f o r m= ’ pandas ’ )
X raw = s c a l e r . f i t t r a n s f o r m ( X raw )
After running this code, you will get a dataframe ”X raw” with 506 rows and 13 columns, and ”y raw” with
506 rows and 1 column.
Consider a model regression median home value on six of the available features:

medv = β0 + β1 crim + β2 dis + β3 indus + β4 lstat + β5 tax + β6 rm + ε,

where ε ∼ N (0, σ 2 ). Compute β̂ = (X ⊤ X)−1 X ⊤ y, and evaluate the RSS: ∥y −X β∥


b 2 (remember to append
2
a column of ones to your data matrix to account for the intercept term).

Q2: The QR decomposition and backsubstitution


The QR decomposition of a matrix A ∈ Rn×p , which n ≥ p, expresses A = QR where Q ∈ Rn×p with
orthogonal columns (i.e. Q⊤ Q = I), and R ∈ Rp×p is an upper triangular matrix. Furthermore, recall that
setting the gradient ∇β ∥Xβ − y∥22 = 0 leads to the so-called normal equations:

X ⊤ Xβ = X ⊤ y.

Rather than multiplying either side by the inverse of X ⊤ X, if we plug in the QR decomposition of X, we
obtain

X ⊤ Xβ = X ⊤ y ⇐⇒ R⊤ Q⊤ Q Rβ = R⊤ Q⊤ y ⇐⇒ Rβ = Q⊤ y.
| {z }
I

The problem of solving equations of the form Ru = v for u, where R is a triangular matrix, can be done
efficiently using an algorithm called backward substitution. The idea of backward substitution is simple:
suppose we had a system of a equations:

2x + y + z = 3 (1)
y − 2z = 1 (2)
2z = 4. (3)

3
Algorithm 1 Backward substitution: find vector u such that Ru = v.
Require: Upper triangular matrix R ∈ Rp×p , vector v ∈ Rp .
Initialize u = 0p (the all zeros vector in Rp ).
for i = p, . . . , 1 do
t ← vi
for j > i do
t ← t − Rij · uj
end for
ui ← t/Rii
end for
return u

A natural way to solve this would be to first solve for z in eqn (3), then plug this into eqn (2) to find y, and
finally plug these both into eqn (1) to find x. This is simple precisely because this system is triangular. The
general algorithm is given in Algorithm 1.
Use a QR decomposition together with backward substitution algorithm to find the least squares esti-
mator β b , and compute the RSS ∥y − X β b ∥2 , and compare them to the results you obtained in Q1.
qr qr 2
Note: the QR decomposition can be done using the built-in function numpy.linalg.qr() in Python. You
should attempt to implement the backward substitution algorithm yourself, but if you get stuck you can get
some credit if you use pre-implemented versions: in Python, this algorithm is implemented in the function
scipy.linalg.solve triangular(R, v); in R it is implemented backsolve(R,v).

Q3: Gradient descent

Algorithm 2 Gradient descent for linear regression


Require: Data X, y, step-size α > 0, tolerance for gradient norm ϵ.
Randomly initialize β
while ∥X ⊤ Xβ − X ⊤ y∥2 > ϵ do
β ← β − 2α(X ⊤ Xβ − X ⊤ y)
end while
return β

Another method, which can be used to solve a wide variety of optimization problems, is the gradient descent
algorithm, which can be used to find a minimum x = arg minx′ f (x′ ) of a differentiable function f . The
algorithm starts at a random initialization x0 , and iteratively updates x using the iterations

x(t+1) = x(t) − α · ∇x f (x(t) )

until we reach some convergence criterion. Here α > 0 is a step-size parameter which needs to be chosen
before running the algorithm. In the context of linear regression, we can use this method to find the solution
b by applying it to the function f (β) = ∥Xβ − y∥2 . In particular, for linear regression, the gradient is given
β 2
by

∇β f (β) = 2X ⊤ Xβ − 2X ⊤ y.

This is repeated until, for example, the norm of the gradient ∥∇β f (β)∥2 is sufficiently small. The full
algorithm is given in Algorithm 2.
Implement and run the gradient descent algorithm to approximately find the least squares coefficients β
b .
gd
Also compute the RSS ∥y − X β b ∥2 . Compare your answer with the ones you obtained in the previous two
gd 2
parts. (Note: You may have to experiment with several values of step size α, though we recommend starting

4
with a very small value, e.g. α = 0.0000001, and adjust it up or down, by monitoring the decay of the loss
function or the size of the gradient. )

You might also like