0% found this document useful (0 votes)
15 views

Homework 2 MATH2050

Uploaded by

ngohaikhanh1810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Homework 2 MATH2050

Uploaded by

ngohaikhanh1810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

College of Engineering and Computer Science

MATH2050. Linear Algebra


Fall 2023

Homework 2

Student name: Tran Le Hai Student ID: V202100435

1 Visualizing data on a line


In this exercise, we examine how to visualize a high-dimensional data set of points xi ∈ Rn , i = 1, . . . , m,
by computing and visualizing a single scalar, or score, associated to the data points. Specifically, the
score associated to a generic data point x ∈ Rn is obtained via the linear formula

f (x) = uT x + v.

Without loss of generality, and in order to compare different scoring mechanisms, we may assume that
the vector u is unit-norm (∥u∥2 = 1) and that the scores are centered, that is,
m
X
f (xi ) = 0.
i=1

1. Show that the centering requirement implies that v can be expressed as a function of u, which you
will determine. Interpret the resulting scoring mechanism in terms of the centered data points xi − x̄,
i = 1, . . . , m, where
m
1 X
x̄ := xi
m i=1
is the center of the data points.
- We have:
m m
! m
!
X X 1 X
(uT xi + v) = 0 =⇒ uT xi + mv = 0 =⇒ v = − uT xi =⇒ v = −uT x̄
i=1 i=1
m i=1

with:
m
1 X
x̄ := xi
m i=1
2. Interpret the scoring formula above as a projection on a line, which you will determine in terms of u.

f (x) = uT x − uT x̄ = uT (x − x̄)
3. Consider a data set of your choice and try different vectors u (do not forget to normalize them):
• Random vectors
• All ones (normalized)
• Any other choice
Look at the spread of the scores, as measured by their variance. What do you observe? Which vector u
would you choose? Comment.
- Random generate dataset of x

1
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

0.0 0.1764052345967664
 
 0.1111111111111111 0.26223794305894454 
0.2222222222222222 0.5423182428550184
 
 
0.3333333333333333 0.8907559865868124
 
 
0.4444444444444444 1.0756446879038857
 
 
0.5555555555555556 1.0133833231234701
 
 
0.6666666666666666 1.4283421750858922
 
 
0.7777777777777777 1.5404198347257856
 
 
0.8888888888888888 1.7674558925984218
 
 
1.0 2.0410598501938373

- Using this code to calculate, and also using code to normalize for every vector u:

- Vector u1 (random vector): u1 = [2, 3]

2
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

- Vector u2 (all ones vector): u1 = [1, 1]

- Vector u3 (random vector): u1 = [−1, 1]

Given three vectors u1 , u2 , and u3 with their corresponding variances:


• Vector u1 with var(u1 ) = 0.45031475625984135 has the highest variance, indicating that the pro-
jection of the data points along this vector results in the greatest spread. This suggests that u1 is
potentially the best direction for visualizing the data on a line because it may preserve the most
information about the data’s variability.
• Vector u2 has a slightly lower variance than u1 , which means it is also a good candidate but might
not capture as much variability as u1 .
• Vector u3 with var(u3 ) = 0.04000633515372719 has the lowest variance, indicating that the projec-
tion along this vector results in the least spread. This suggests that u3 is the least desirable vector
for visualizing the data on a line since it captures the least amount of variability.

3
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

2 Clustering
In clustering problems, we are given data points xi ∈ Rn , i = 1, . . . , m. We seek to assign each point to
a cluster of points. The so-called k-means algorithm is one of the most widely used clustering methods.
It is based on choosing a number of clusters, k(< m), and minimizing the average squared Euclidean
distance from the data points to their closest cluster “representative”. The objective function to minimize
is thus
Xm
Jclust := min min ∥xi − cj ∥2 .
c1 ,...,ck 1≤j≤k
i=1

Each cj ∈ Rn is the “representative” point for the j-th cluster, denoted Cj . Note that the terms inside
the sum express the assignment of a specific point xi to a center cluster, so that the problem expresses
as minimizing the sum of those distances.
1. Show that the problem can be written as one involving two matrix variables C, U :

2
m
X k
X k
X
min xi − uij cj subject to uij = 1, 1 ≤ i ≤ m,
C,U
i=1 j=1 j=1

uij ∈ {0, 1}, 1 ≤ i ≤ m, 1 ≤ j ≤ k.

In the above, the n × k matrix C has columns cj , 1 ≤ j ≤ k, the center representatives; you are asked to
explain why the Boolean 1 m × k matrix U with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, is referred to as an
assignment matrix. Hint: show that, for a given point x ∈ Rn , we have A(x) = B(x), where

A(x) := min ∥x − cj ∥2 , B(x) := min F (x, u),


1≤j≤k u∈U
2
k
X
F (x, u) := x − uj cj , U := {u ∈ {0, 1}k | 1T k = 1, u ∈ {0, 1}k }.
j=1

We define:
C is an n × k matrix with column cj ,1 ≤ j ≤ k. Each column represents the center representative for a
cluster.
U: boolean m × k matrix with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, uij are either 0 or 1.
2
m
X k
X
⇒ Jclust = min xi − uij cj
C,U
i=1 j=1
2

We have:
U := u | uT 1k = 1, u ∈ {0, 1}k .


- As uT 1k = 1, u must be one of the unit vectors in Rk


Pk 2
⇒ B(x) = minu∈U F (x, u) = min1≤j≤k F (x, ej ) = min1≤j≤k x − j=1 ej cj = min1≤j≤k ∥x − cj ∥22 .
2

⇒ B(x) = A(x)
Note: 1k represents a k-dimensional vector of all ones, and ej represents the j-th unit vector in Rk .
2. Show that in turn, the above problem is equivalent to finding an (approximate) factorization of the
data matrix X, into a product of two matrices with specific properties. Make sure to express the above
problem in terms of matrices, matrix norms, and matrix constraints. It will be convenient to use the
notation 1S for the vector of ones in RS , and B := {0, 1}m×k for the set of Boolean matrices in Rm×k .

4
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

k
X
uij cj = ui1 ck = U T e i C
  
ui2 ··· uik c1 c2 ···
j=1

The Frobenius norm of a matrix is defined as the square root of the sum of the absolute squares of
its elements, which is equivalent to taking the Euclidean norm of each row or column, squaring it, and
summing these squares across the entire matrix.

 1/2
X
∥A∥F =  |aij |2 
i,j

This property means that the squared Frobenius norm of the difference between two matrices A and B,
∥A−B∥2F , is equivalent to summing the squared Euclidean norms of the difference between corresponding
rows of A and B. Substitute A with X and B with CU T ei , we obtain:

2
m k m
X X X 2 2
⇒ xi − uij cj = xi − CU T ei 2
= X − CU T F
i=1 j=1 i=1
2

In this problem, we can rewrite as:

2
min X − CU T F
such that C ∈ Rn×m , U T 1k = 1m , U ∈ B := {0, 1}m×k
C,U

3. One idea to solve the above problem is to alternate over the matrix C and U . We start with an initial
point (C 0 , U 0 ) and update the pair by minimizing J(C, U ) over U with C fixed, and the over C with
fixed U . Derive the solution the C-step, that is, minimizing over C for fixed U . Express the result in
terms of mj , the number of points assigned to cluster j, and Ij , the index of points assigned to cluster
Cj ; then express your result in words. Hint: using the fact that the gradient of a differentiable function
is zero at its minimum, show that the vector c which minimizes the sum of squared distances to given
vectors y1 , . . . , yL ,

F (c) = ∥y1 − c∥2 + · · · + ∥yL − c∥2 , (1)

is the average of the vectors, c∗ = L (y1


1
+ · · · + yL ).
- Compute the gradient of F(c) by c, we have:

∇F (c) = 2(c − y1 ) + 2(c − y2 ) + . . . + 2(c − yL )

- According to the hint:

1
∇F (c) = 0 ⇔ c = (y1 + . . . + yL )
L
- We can write the problem as, for every U = 1, 2, . . . , k

X
min ∥xi − c∥22
c
i∈Ij

- From hint, we can obtain:

1 X
cj = xi
mj
i∈Ij

5
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

- By considering:

X X
F (c) = ∥xi − c∥22 ⇒ ∇F (c) = mj c − xi
i∈Ij i∈Ij

1 X
∇F (c) = 0 ⇒ cj = xi
mj
i∈Ij

4. Find the solution to the U -step, where we fix C and solve for the assignment matrix U . Express your
result in words.
- Let fix C, the problem could be written as:

k
X
min xi − uj cj subject to uT 1k = 1, u ∈ {0, 1}k .
u
j=1
2

- For every data point in our set, we need to find the nearest cluster center and assign the point to
that cluster. Practically, this means we compare the distance from the data point to each cluster center
and select the smallest one. The assignment matrix U is then updated to reflect this by setting the
corresponding entry to 1, indicating that the data point is assigned to that particular cluster, and setting
all other entries for that point to 0. So, optimal j is the index of the center assigned to data point to xi .

3 Matrices
1. Let f : Rm → Rk and g : Rn → Rm be two maps. Let h : Rn → Rk be the composite map h = f ◦ g,
with values h(x) = f (g(x)) for x in Rn . Show that the derivatives of h can be expressed via a matrix-
matrix product, as Jh (x) = Jf (g(x)) · Jg (x), where the Jacobian matrix of h at x is defined as the matrix
Jh (x) with (i, j) element ∂x
∂hi
j
(x).
According to the problem, Jf (g(x)) is a k × m matrix:
 ∂f1   ∂f1 ∂f1 ∂f1

···
 
f1 ∂g ∂g ∂g2 ∂gm
 f2   ∂f2   ∂f21 ∂f2
··· ∂f2 
 ∂g   ∂g1 ∂g2 ∂gm 
f (g(x)) =  .  ⇒ Jf (g(x)) = 
 ..  =  .. .. .. .. 
 
 .. 
 
 .   . . . . 

fk ∂fk ∂fk ∂fk ∂fk
∂g ∂g1 ∂g2 ··· ∂gm k×m

and, Jg (x) is an m × n matrix:


 ∂g ∂g1 ∂g1

1
∂x1 ∂x2 ··· ∂xn
∂g2 ∂g2 ∂g2
···
h i  
∂x1 ∂x2 ∂xn
Jg (x) = ∂g ∂g ∂g
=
 
∂x1 ∂x2 ··· ∂xn  .. .. .. .. 
 . . . . 

∂gm ∂gm ∂gm
∂x1 ∂x2 ··· ∂xn m×n

- We calculate:
 ∂f1 ∂f1 ∂f1
  ∂g ∂g1 ∂g1

∂g ∂g2 ··· ∂gm ∂x1
1
∂x2 ··· ∂xn
 ∂f21 ∂f2
··· ∂f2  ∂g2 ∂g2 ∂g2
···
 
 ∂g1 ∂g2 ∂gm  ∂x1 ∂x2 ∂xn
Jf (g(x)) · Jg (x) = 
 
 . .. .. 
 .. .. .. ..   . ..
 . . . .   . . . . 
 
∂fk ∂fk ∂fk ∂gm ∂gm ∂gm
∂g1 ∂g2 ··· ∂gm ∂x1 ∂x2 ··· ∂xn m×n
k×m

6
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

- Perform the dot product with first row of matrix Jf (g(x)) and first column of matrix Jg (x):
∂f1 ∂g1 ∂f1 ∂gm ∂f1 × (∂g1 + · · · + ∂gm ) ∂f1
× + ··· + × = =
∂g1 ∂x1 ∂gm ∂x1 ∂x1 × (∂g1 + · · · + ∂gm ) ∂x1
- Do the same with others, we therefore can obtain the result:
 ∂f ∂f1 ∂f1

∂x1
1
∂x2 · · · ∂x n
 ∂f2 ∂f2 ∂f 
 ∂x1 ∂x2 · · · ∂xn2 
Jf (g(x)) · Jg (x) =  . .. .. ..  = Jh (x)
 ..

. . . 

∂fk ∂fk ∂fk
∂x1 ∂x2 · · · ∂x n k×n

2. A matrix P in Rn×n is a permutation matrix if it is a permutation of the columns of the n × n identity


matrix. For an n × n matrix A, we consider the products P A and AP . Describe in simple terms what
these matrices look like with respect to the original matrix A.
- When we multiply matrix A by the permutation matrix P from the left to get P A, each row of A is
rearranged according to the permutation of the n × n identity matrix.
- Similarly, multiplying A by P from the right to get AP results in a matrix where each column of A is
reordered according to the permutation pattern of the identity matrix.
3. a) Show that a square matrix is invertible if and only if its determinant is non-zero. You can
use the fact that the determinant of a product is a product of the determinant, together with the QR
decomposition of matrix A.
- From the given problem, we have:

det(A) = det(Q) · det(R)

- We understand that Q is an orthogonal matrix, we obtain that:

QT Q = QQ−1 = I

=⇒ det(Q) · det(QT ) = det(I) =⇒ (det(Q))2 = 1 =⇒ det(Q) = ±1 ̸= 0.

- Now, let’s consider the upper-half matrix R:


 
r1 ··· ··· ···
0 r2 · · · · · ·
R=. .. .. .. 
 
 .. . . . 
0 · · · 0 rn

=⇒ det(R) = r1 · r2 · · · rn

- Hence, R is invertible if and only if det(R) ̸= 0, which means that all rows in R are linearly independent
( r1 , r2 , ..., rn ̸= 0 ).
- From the given problem, if det(A) ̸= 0, means that det(R) and det(Q) must be non-zero, so we can
prove that Q is invertible and then A is invertible. From this, we can prove the result: a square matrix
is invertible if and only if its determinant is non-zero
3. b) Let A ∈ Rm×n , B ∈ Rn×p , and let C := AB ∈ Rm×p . Show that ∥C∥ ≤ ∥A∥ · ∥B∥ where ∥ · ∥
denotes the ℓ2 -induced norm of its matrix argument, defined for a matrix M as:
∥M z∥2
∥M ∥ := max .
z̸=0 ∥z∥2
- From the given problem:
∥M z∥2
∥M ∥ = max =⇒ ∥M z∥2 ≤ ∥M ∥ · ∥z∥2
z̸=0 ∥z∥2

7
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

- Let x be a non-zero p-vector.

∥(Ax)B∥2 ≤ ∥Ax∥2 · ∥B∥2 ≤ ∥A∥ · ∥B∥2 · ∥x∥2

=⇒ ∥Cx∥2 ≤ ∥A∥ · ∥B∥ · ∥x∥2 =⇒ ∥C∥ ≤ ∥A∥ · ∥B∥

4 Hermitian product and projection of complex vectors on a


line
In lecture we have defined the scalar product of two complex vectors x, y ∈ C n as
n
X
xH y = xT y = x(i)y(i),
i=1

where z is the conjugate of the complex vector z. The ordinary scalar product results when x, y are
both real vectors. In this exercise, we explain why this choice makes sense from the point of view of
projections. Precisely, we show that the projection z of a point p ∈ C n on the line L(u) := {αu : α ∈ C},
where u ∈ C n satisfies uH u = 1 without loss of generality, is given by z = (uuH )p = (uH p)u.
1. As a preliminary result, show that for any real vector z, the minimum value of

∥w∥22 − 2z T w (1)

over real vectors w, is obtained for w = z. Hint: express the objective function of the above problem as
the difference of two squared terms, the second one independent of w.

∥w∥22 − 2z T w = ∥w∥22 − 2z T w + ∥z∥22 − ∥z∥22

= ∥w − z∥22 − ∥z∥22 ≥ −∥z∥22

It obtains minimum value for w = z.


2. Show that the proposed formula for the projected vector is correct when u, p are real.
- When u and p are real, uH p = uT p.

⇒ z = (uH p)u = (uT p)u = (uuT )p

3. Show that the proposed formula is also correct in the complex case. That is, solve the problem

min ∥p − αu∥2
α∈C

and show that the optimal α is α∗ = uH p. Hint: optimize over the real and imaginary parts of α, and
transform the problem into one of the form (1) involving two-dimensional real vectors; then apply the
result of part 1.
- Let D2 = ∥p − αu∥22 = (p − αu)H (p − αu) = pH p − pH (αu) − (αu)H p + (αu)H (αu)
= pH p − αpH u − αuH p + ααuH u = ∥α∥2 − αpH u − αuH p + pH p = ∥α∥22 − (αuH p + αpH u)
- Since pH p is a real number, it does not contribute to the cross terms and can be omitted when considering
the optimization problem.
- Define: α = a + bi, uH p = c + di
⇒ ∥p − αu∥22 = a2 + b2 − [(a − bi)(c + di) + (a + bi)(c − di)]
= a2 + b2 − (ac + adi + bd − bci + ac − adi + bd + bci) = a2 + b2 − 2(ab + cd), optimal when (a∗ , b∗ ) = (c, d)

⇒ α∗ = c + di = uH p

8
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

5 Convolutions
The convolution of a n-vector a and m-vector b is the (n + m − 1)-vector c = a ∗ b, with entries
X
ck = ai bj , k = 1, . . . , n + m − 1.
i+j=k+1

1. Express the coefficients of the product of two polynomials

p(x) = a1 + a2 x + · · · + an xn−1 , q(x) = b1 + b2 x + · · · + bm xm−1 ,

in terms of an appropriate convolution.


- We have: p(x) · q(x) = c1 + c2 x + · · · + cn+m−1 xn+m−2

⇒ c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b3 + a2 b2 + a3 b1
..
.
cn+m−1 = an bm
X
⇒ ck = ai bj , k = 1, 2, . . . , n + m − 1.
i+j=k+1

2. Given a time-series x ∈ Rn , the (4-point) moving average of x is a new time-series y such that, for
every i = 4, 5, . . . , n, yi is the average of xi , xi−1 , xi−2 , xi−3 . Express y in terms of a convolution of x
with an appropriate vector.
Hint: Think about time-series with only a single 1 in it.
The 4-point moving average of a time series x can be expressed as a convolution with a window function.
Define the window function w as a vector w =P41 [1, 1, 1, 1, 0, . . .]. The convolution of the time series x
with w is given by the summation (x ∗ w)[i] = x[j] · w[i − j].
The moving average for i ≥ 4 is then y[i] = 14 (x[i] + x[i − 1] + x[i − 2] + x[i − 3]). This process can
be expressed in terms of convolution with shift operators t′ , t′′ , and t′′′ representing unit shifts. The
combined vector e that encapsulates these shifts is e = 41 (t + t′ + t′′ + t′′′ ), where t has a leading one.
Hence, the moving average y is the result of the convolution of x with e, succinctly given by y = x ∗ e.
- Define t as a sequence t = (1,0,. . . ), with 1 located in the first place of the sequence, and the remaining
are zero. ⇒ x ∗ d = x.
xi +xi−1 +xi−2 +xi−3
- Therefore, yi = 4 (for i = 4, 5, . . . , n), y is the 4-point moving average of x.
3. Show that
a ∗ b = T (a)b = T (b)a,
where T (a), T (b) are two appropriate matrices. Specify those matrices for the case n = 3, m = 4.
- Let c = a ∗ b, we can define a convolution:
X
ck = ai bj , k = 1, . . . , 6
i+j=k+1

⇒ c1 = a1 b1
..
.
c6 = a3 b4

9
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023

Therefore:
0 0 0   b1 0 0
     
c1 a1
c2  a2 a1 0 0  b1
b2 b1 0   
0 b2  b3 b2 b1  a1
    
c3  a3 a2 a1
c4  =  0
c=     = 
  b4 b3 b2  a2
 
a3 a2  b3
a1 
c5   0 0  0 b4 b3  a3
    
a3 a2  b4
c6 0 0 0 a3 0 0 b4
0 0 0 b1 0 0
   
a1
a2 a1 0 0 b2 b1 0 
0
   
a3 a2 a1
T (a) =   , T (b) = b3 b2 b1 
 
0 a3 a2 a1  b4 b3 b2 
0 0  0 b4 b3 
   
a3 a2 
0 0 0 a3 0 0 b4
4. A T -vector r gives the average daily rainfall in some region over a period of T days. The vector h
gives the daily height of a river in the region. Using model fitting, it is found that the two vectors are
related by h = g ∗ r, where
g := (0.1, 0.4, 0.5, 0.2).
(a) If one day there is a heavy rainfall, assuming uniform rainfall for all other days, how many days
after that day is the river at maximum height?
(b) How many days does it take for the river to return to 0 after rain stops?
- We have: h = g ∗ r = r ∗ g
- As g is a weighted sum of the shifted sequence, let define h is a weighted sum of r.

hi = 0.1ri + 0.4ri−1 + 0.5ri−2 + 0.2ri−3 .

- Maximum height is most affected by 2 days prior.


a) 2 days
b) As vector g has four elements, the effect of a single day’s rainfall on the river height extends to four
days after. Therefore, it takes 4 days for the river to return 4 days after.

10

You might also like