Worksheet For Quiz
Worksheet For Quiz
Regression”
PRML – CS5691 (Jul–Nov 2023)
October 27, 2023
1. (5 marks) Consider the undirected, weighted graph given below. (a) Write down the Laplacian
matrix L for this graph; (b) Write down the largest set of orthonormal eigen vectors of this graph
Laplacian with eigen value 0, and (c) Redo part (b) for a graph where the edge (3,4) is removed.
100 −100 0 0 0 0
−100 150 −50 0 0 0
0 −50 150 −100 0 0
Solution: (a) L =
0
0 −100 110 −10 0
0 0 0 −10 110 −100
0 0 0 0 −100 100
√
(b) The set comprising a single vector, which is the all ones vector divided by 6 is the solution.
Eigen value 0 has multiplicity 1 (the number of connected components in the graph), so there
is no other orthonormal set with more than one eigen vector for eigen value 0.
(c) Eigen value 0 has multiplicity 2. The two corresponding eigen vectors are the indicator
vectors of the two components in the modified graph (normalized by the square root of the
corresponding component’s size), i.e., the vectors √13 [1 1 1 0 0 0]T , and √13 [0 0 0 1 1 1]T .
2. (5 marks) Consider the following data matrix, representing four sample points xn ∈ R2
4 1
2 3
X=
5
4
1 0
Use principal component analysis (PCA) to represent the above data in only one direction. Re-
port PC1 from the dataset, PC1-based representation of the last datapoint x4 = [1 0]T , and the
reconstruction error of x4 .
1
Solution:
1. Mean µ = 3 2
3. Covariance matrix:
′ T ′10 6
(X ) X =
6 10
4. λ = 16, 4
5. Eigenvector corresponding to largest eigenvalue (16) is P C1 = [ √12 , √12 ]T .
Project x′4 = [−2 − 2]T onto this PC1, and add mean vector back to find the PC1-based
reconstruction/representation of x4 . Also calculate the resulting reconstruction error of
this datapoint.
3. (6 marks) Consider the dataset of three datapoints below, and we would like to predict y using x.
1 2 3
X = 1 4 Y = 6
1 6 7
(a) Perform least squares regression and report the resulting regression coefficients wLS .
(b) Perform regularised least squares regression and find the optimal regression coefficients wRLS ,
where the weight of the penalty term λ is assumed to be 1.
(c) Assume a maximum likelihood approach for linear regression. Under this setting, for a new
datapoint xnew = [1 1]T :
(i) What is the predicted ynew value? Show your calculation.
(ii) What is the uncertainty around ynew (in terms of the estimated variance of ynew |xnew )?
(iii) Can you use this variance to not just report a single predicted value for ynew , but an interval
that has 95% probability of containing the true ynew value? If so, specify this interval.
(Note: Assume that the regression coefficients or its distribution estimated from the data
are correct, and use the fact that approximately 95% of the values sampled from a normal
distribution lie within two standard deviations from the mean.)
(d) Answer the three sub-parts of part (3) above when Bayesian linear regression approach is used
instead of MLE approach. Assume parameters α (precision parameter of Gaussian prior of wi )
and β (precision parameter of the Gaussian y|x) to each be 1.
Solution:
T 3 12
(a) X X =
12 56
56 −12
(X T X)−1 = 24
1
−12 3
T −1 T 1 32
wLS = (X X) X Y = 24
24
Page 2
(b) Repeat above calculation but with 1 added to each diagonal entry of X T X.
(c) (i) Since wLS = wM L , we have ynew = xTnew wLS = [1 1]wLS = (32 + 24)/24.
(ii) Use the formula for (1/β̂M L ) from slides, which is basically the average of the squared
residuals of the three datapoints, to calculate this variance.
Yes, the interval is [µ̂ − 2σ̂, µ̂ + 2σ̂], where µ̂ is predicted ynew from part (a) and σ̂ is
(iii) q
1/β̂ from part (b) above. Substitute these values from the above parts and simplify
the interval.
(d) The solution for Bayesian linear regression is similar, with the main difference being that the
mean is calculated using wRLS instead of wLS , and the variance of the predictive distribution
being a function of xnew instead of being the same for all x.
(i) Since wRLS = wM AP = wmean−of −posterior , we have ynew = xTnew wRLS = . . . (com-
plete calculation).
(ii) Refer to slides for the formulas involving parameters mN , SN for the posterior of w
and results for the posterior predictive distribution. Verify that those formulas applied
for the current problem yields: var(ynew |xnew ) = 1 + xTnew SN xnew = 1 + xTnew (I +
X T X)xnew . Complete the calculation.
(iii) Yes, the interval is [µ̂ − 2σ̂, µ̂ + 2σ̂], where µ̂ is predicted ynew from part (a) and σ̂ is
the variance from part (b) above. Substitute these values from the above parts and
simplify the interval.
Check how numerically different the Bayesian vs. MLE based linear regression answers
above are.
4. (2 marks) Prove that a Laplacian matrix of an undirected simple graph is positive semi-definite.
Solution: We have already seen this in class using adjacency matrix representation A of a
graph.
Another way to prove this is using the incidence matrix representation of a graph. If B is the
incidence matrix of an orientation of the edges of G (each edge (i, j) in an undirected graph is
only represented in one of the directions, either (i, j) or (j, i), but not both), then we can show
L = BB T . So xT Lx = ||Bx||2 ≥ 0.
Some students have asked for additional reference on Laplacian. One such reference is here:
https://www.cs.yale.edu/homes/spielman/561/2009/lect02-09.pdf
5. (4 marks) We would like to reduce the dimensionality of a set of data points DN = {xn }N n=1 using
PCA. Let ui denote the ith PC of the dataset, and x̄ the average of all the datapoints in DN .
Let each xn ∈ R3 . Now, choose all formula(s) from below that will correctly compute the PC1-based
reconstruction x̃n of xn ? Justify your answer.
Page 3
Solution: F1 and F2. Both will give the same answer, because we’ve proved a more general
statement in class below and we let D = 3, M = 1 in this proof to show that F1 and F2 above
are the same.
If xn ∈ RD and x̃n is a reconstruction of xn using only the top-M PCs of the dataset, then we’ve
showed this proof in class:
M
X D
X
x̃n = (xTn ui )ui + (x̄T ui )ui (reconstn. formula from class slides)
i=1 i=M +1
M
X D
X
= x̄ − x̄ + (xTn ui )ui + (x̄T ui )ui (add and subtract x̄)
i=1 i=M +1
D
X M
X D
X
= x̄ − (x̄T ui )ui + (xTn ui )ui + (x̄T ui )ui (represent x̄ in terms of basis vectors {ui } of RD )
i=1 i=1 i=M +1
M
X
= x̄ + ((xn − x̄)T ui )ui
i=1
Geometrically also, you can try to visualize the above proof for D=2 and M=1 as below.
6. (2 marks) Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for
Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA
and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we
use least squares to fit the model and get β̂0 = 50, β̂1 = 20, β̂2 = 0.07, β̂3 = 35, β̂4 = 0.01, β̂5 = −10.
For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA
is high enough. Is this statement correct? Justify.
Solution: False.
The least squares line is given by, ŷ = 50 + 20 (GPA) + 0.07 (IQ) + 35 (Gender) + 0.01 (GPA
× IQ) - 10 (GPA × Gender)
which becomes for the males, ŷ = 50 + 20 (GPA) + 0.07 (IQ) + 0.01 (GPA × IQ),
and for the females it becomes, ŷ = 85 + 10 (GPA) + 0.07 (IQ) + 0.01 (GPA × IQ).
So the starting salary for males is higher than for females on average if 50 + 20 (GPA) ≥ 85 +
10 (GPA), which is equivalent to GPA ≥ 3.5.
Page 4
7. (4 marks) Consider the dataset D in the table below.
n x (or xn ) y (or yn )
#1 1 1
#2 2 2
#3 4 3
#4 5 4
#5 6 4
(a) Find the line y = wx + b that minimizes the squared vertical distance of the datapoints to the
line (i.e., the squared errors in y). Specifically, find the w, b that minimizes
5
X
((wxn + b) − yn )2
n=1
.
(b) Find the line y = mx + c that minimizes the squared perpendicular distance (i.e., shortest
distance) of the datapoints to the line. Specifically, find the m, c that minimizes
5
X ((mxn + c) − yn )2
n=1
m2 + 1
.
(Note: Distance of a point to a line from geometry is used to get each summation term above.)
(c) The two minimization problems above are each related to which ML task seen in class?
Solution: You can verify that linear regression least-squares solution (wLS formula seen in
class) will solve part (a), and PCA u1 (PC1) formula will solve part (b). Use matrix notations
and relevant matrix-vector-based formulas from class to obtain a quick solution to both parts.
Doing the usual ”setting the gradient to zero” approach to these minimization problems will
also work, but may take longer time.
End Note: Please also go through Assignment 2 questions and tutorials related to Spectral
clustering, PCA, and Linear Regression.
Page 5