Homework 2 MATH2050
Homework 2 MATH2050
Homework 2
f (x) = uT x + v.
Without loss of generality, and in order to compare different scoring mechanisms, we may assume that
the vector u is unit-norm (∥u∥2 = 1) and that the scores are centered, that is,
m
X
f (xi ) = 0.
i=1
1. Show that the centering requirement implies that v can be expressed as a function of u, which you
will determine. Interpret the resulting scoring mechanism in terms of the centered data points xi − x̄,
i = 1, . . . , m, where
m
1 X
x̄ := xi
m i=1
is the center of the data points.
- We have:
m m
! m
!
X X 1 X
(uT xi + v) = 0 =⇒ uT xi + mv = 0 =⇒ v = − uT xi =⇒ v = −uT x̄
i=1 i=1
m i=1
with:
m
1 X
x̄ := xi
m i=1
2. Interpret the scoring formula above as a projection on a line, which you will determine in terms of u.
f (x) = uT x − uT x̄ = uT (x − x̄)
3. Consider a data set of your choice and try different vectors u (do not forget to normalize them):
• Random vectors
• All ones (normalized)
• Any other choice
Look at the spread of the scores, as measured by their variance. What do you observe? Which vector u
would you choose? Comment.
- Random generate dataset of x
1
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
0.0 0.1764052345967664
0.1111111111111111 0.26223794305894454
0.2222222222222222 0.5423182428550184
0.3333333333333333 0.8907559865868124
0.4444444444444444 1.0756446879038857
0.5555555555555556 1.0133833231234701
0.6666666666666666 1.4283421750858922
0.7777777777777777 1.5404198347257856
0.8888888888888888 1.7674558925984218
1.0 2.0410598501938373
- Using this code to calculate, and also using code to normalize for every vector u:
2
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
3
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
2 Clustering
In clustering problems, we are given data points xi ∈ Rn , i = 1, . . . , m. We seek to assign each point to
a cluster of points. The so-called k-means algorithm is one of the most widely used clustering methods.
It is based on choosing a number of clusters, k(< m), and minimizing the average squared Euclidean
distance from the data points to their closest cluster “representative”. The objective function to minimize
is thus
Xm
Jclust := min min ∥xi − cj ∥2 .
c1 ,...,ck 1≤j≤k
i=1
Each cj ∈ Rn is the “representative” point for the j-th cluster, denoted Cj . Note that the terms inside
the sum express the assignment of a specific point xi to a center cluster, so that the problem expresses
as minimizing the sum of those distances.
1. Show that the problem can be written as one involving two matrix variables C, U :
2
m
X k
X k
X
min xi − uij cj subject to uij = 1, 1 ≤ i ≤ m,
C,U
i=1 j=1 j=1
In the above, the n × k matrix C has columns cj , 1 ≤ j ≤ k, the center representatives; you are asked to
explain why the Boolean 1 m × k matrix U with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, is referred to as an
assignment matrix. Hint: show that, for a given point x ∈ Rn , we have A(x) = B(x), where
We define:
C is an n × k matrix with column cj ,1 ≤ j ≤ k. Each column represents the center representative for a
cluster.
U: boolean m × k matrix with entries uij , 1 ≤ i ≤ m, 1 ≤ j ≤ k, uij are either 0 or 1.
2
m
X k
X
⇒ Jclust = min xi − uij cj
C,U
i=1 j=1
2
We have:
U := u | uT 1k = 1, u ∈ {0, 1}k .
⇒ B(x) = A(x)
Note: 1k represents a k-dimensional vector of all ones, and ej represents the j-th unit vector in Rk .
2. Show that in turn, the above problem is equivalent to finding an (approximate) factorization of the
data matrix X, into a product of two matrices with specific properties. Make sure to express the above
problem in terms of matrices, matrix norms, and matrix constraints. It will be convenient to use the
notation 1S for the vector of ones in RS , and B := {0, 1}m×k for the set of Boolean matrices in Rm×k .
4
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
k
X
uij cj = ui1 ck = U T e i C
ui2 ··· uik c1 c2 ···
j=1
The Frobenius norm of a matrix is defined as the square root of the sum of the absolute squares of
its elements, which is equivalent to taking the Euclidean norm of each row or column, squaring it, and
summing these squares across the entire matrix.
1/2
X
∥A∥F = |aij |2
i,j
This property means that the squared Frobenius norm of the difference between two matrices A and B,
∥A−B∥2F , is equivalent to summing the squared Euclidean norms of the difference between corresponding
rows of A and B. Substitute A with X and B with CU T ei , we obtain:
2
m k m
X X X 2 2
⇒ xi − uij cj = xi − CU T ei 2
= X − CU T F
i=1 j=1 i=1
2
2
min X − CU T F
such that C ∈ Rn×m , U T 1k = 1m , U ∈ B := {0, 1}m×k
C,U
3. One idea to solve the above problem is to alternate over the matrix C and U . We start with an initial
point (C 0 , U 0 ) and update the pair by minimizing J(C, U ) over U with C fixed, and the over C with
fixed U . Derive the solution the C-step, that is, minimizing over C for fixed U . Express the result in
terms of mj , the number of points assigned to cluster j, and Ij , the index of points assigned to cluster
Cj ; then express your result in words. Hint: using the fact that the gradient of a differentiable function
is zero at its minimum, show that the vector c which minimizes the sum of squared distances to given
vectors y1 , . . . , yL ,
1
∇F (c) = 0 ⇔ c = (y1 + . . . + yL )
L
- We can write the problem as, for every U = 1, 2, . . . , k
X
min ∥xi − c∥22
c
i∈Ij
1 X
cj = xi
mj
i∈Ij
5
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
- By considering:
X X
F (c) = ∥xi − c∥22 ⇒ ∇F (c) = mj c − xi
i∈Ij i∈Ij
1 X
∇F (c) = 0 ⇒ cj = xi
mj
i∈Ij
4. Find the solution to the U -step, where we fix C and solve for the assignment matrix U . Express your
result in words.
- Let fix C, the problem could be written as:
k
X
min xi − uj cj subject to uT 1k = 1, u ∈ {0, 1}k .
u
j=1
2
- For every data point in our set, we need to find the nearest cluster center and assign the point to
that cluster. Practically, this means we compare the distance from the data point to each cluster center
and select the smallest one. The assignment matrix U is then updated to reflect this by setting the
corresponding entry to 1, indicating that the data point is assigned to that particular cluster, and setting
all other entries for that point to 0. So, optimal j is the index of the center assigned to data point to xi .
3 Matrices
1. Let f : Rm → Rk and g : Rn → Rm be two maps. Let h : Rn → Rk be the composite map h = f ◦ g,
with values h(x) = f (g(x)) for x in Rn . Show that the derivatives of h can be expressed via a matrix-
matrix product, as Jh (x) = Jf (g(x)) · Jg (x), where the Jacobian matrix of h at x is defined as the matrix
Jh (x) with (i, j) element ∂x
∂hi
j
(x).
According to the problem, Jf (g(x)) is a k × m matrix:
∂f1 ∂f1 ∂f1 ∂f1
···
f1 ∂g ∂g ∂g2 ∂gm
f2 ∂f2 ∂f21 ∂f2
··· ∂f2
∂g ∂g1 ∂g2 ∂gm
f (g(x)) = . ⇒ Jf (g(x)) =
.. = .. .. .. ..
..
. . . . .
fk ∂fk ∂fk ∂fk ∂fk
∂g ∂g1 ∂g2 ··· ∂gm k×m
- We calculate:
∂f1 ∂f1 ∂f1
∂g ∂g1 ∂g1
∂g ∂g2 ··· ∂gm ∂x1
1
∂x2 ··· ∂xn
∂f21 ∂f2
··· ∂f2 ∂g2 ∂g2 ∂g2
···
∂g1 ∂g2 ∂gm ∂x1 ∂x2 ∂xn
Jf (g(x)) · Jg (x) =
. .. ..
.. .. .. .. . ..
. . . . . . . .
∂fk ∂fk ∂fk ∂gm ∂gm ∂gm
∂g1 ∂g2 ··· ∂gm ∂x1 ∂x2 ··· ∂xn m×n
k×m
6
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
- Perform the dot product with first row of matrix Jf (g(x)) and first column of matrix Jg (x):
∂f1 ∂g1 ∂f1 ∂gm ∂f1 × (∂g1 + · · · + ∂gm ) ∂f1
× + ··· + × = =
∂g1 ∂x1 ∂gm ∂x1 ∂x1 × (∂g1 + · · · + ∂gm ) ∂x1
- Do the same with others, we therefore can obtain the result:
∂f ∂f1 ∂f1
∂x1
1
∂x2 · · · ∂x n
∂f2 ∂f2 ∂f
∂x1 ∂x2 · · · ∂xn2
Jf (g(x)) · Jg (x) = . .. .. .. = Jh (x)
..
. . .
∂fk ∂fk ∂fk
∂x1 ∂x2 · · · ∂x n k×n
QT Q = QQ−1 = I
=⇒ det(R) = r1 · r2 · · · rn
- Hence, R is invertible if and only if det(R) ̸= 0, which means that all rows in R are linearly independent
( r1 , r2 , ..., rn ̸= 0 ).
- From the given problem, if det(A) ̸= 0, means that det(R) and det(Q) must be non-zero, so we can
prove that Q is invertible and then A is invertible. From this, we can prove the result: a square matrix
is invertible if and only if its determinant is non-zero
3. b) Let A ∈ Rm×n , B ∈ Rn×p , and let C := AB ∈ Rm×p . Show that ∥C∥ ≤ ∥A∥ · ∥B∥ where ∥ · ∥
denotes the ℓ2 -induced norm of its matrix argument, defined for a matrix M as:
∥M z∥2
∥M ∥ := max .
z̸=0 ∥z∥2
- From the given problem:
∥M z∥2
∥M ∥ = max =⇒ ∥M z∥2 ≤ ∥M ∥ · ∥z∥2
z̸=0 ∥z∥2
7
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
where z is the conjugate of the complex vector z. The ordinary scalar product results when x, y are
both real vectors. In this exercise, we explain why this choice makes sense from the point of view of
projections. Precisely, we show that the projection z of a point p ∈ C n on the line L(u) := {αu : α ∈ C},
where u ∈ C n satisfies uH u = 1 without loss of generality, is given by z = (uuH )p = (uH p)u.
1. As a preliminary result, show that for any real vector z, the minimum value of
∥w∥22 − 2z T w (1)
over real vectors w, is obtained for w = z. Hint: express the objective function of the above problem as
the difference of two squared terms, the second one independent of w.
3. Show that the proposed formula is also correct in the complex case. That is, solve the problem
min ∥p − αu∥2
α∈C
and show that the optimal α is α∗ = uH p. Hint: optimize over the real and imaginary parts of α, and
transform the problem into one of the form (1) involving two-dimensional real vectors; then apply the
result of part 1.
- Let D2 = ∥p − αu∥22 = (p − αu)H (p − αu) = pH p − pH (αu) − (αu)H p + (αu)H (αu)
= pH p − αpH u − αuH p + ααuH u = ∥α∥2 − αpH u − αuH p + pH p = ∥α∥22 − (αuH p + αpH u)
- Since pH p is a real number, it does not contribute to the cross terms and can be omitted when considering
the optimization problem.
- Define: α = a + bi, uH p = c + di
⇒ ∥p − αu∥22 = a2 + b2 − [(a − bi)(c + di) + (a + bi)(c − di)]
= a2 + b2 − (ac + adi + bd − bci + ac − adi + bd + bci) = a2 + b2 − 2(ab + cd), optimal when (a∗ , b∗ ) = (c, d)
⇒ α∗ = c + di = uH p
8
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
5 Convolutions
The convolution of a n-vector a and m-vector b is the (n + m − 1)-vector c = a ∗ b, with entries
X
ck = ai bj , k = 1, . . . , n + m − 1.
i+j=k+1
⇒ c1 = a1 b1
c2 = a1 b2 + a2 b1
c3 = a1 b3 + a2 b2 + a3 b1
..
.
cn+m−1 = an bm
X
⇒ ck = ai bj , k = 1, 2, . . . , n + m − 1.
i+j=k+1
2. Given a time-series x ∈ Rn , the (4-point) moving average of x is a new time-series y such that, for
every i = 4, 5, . . . , n, yi is the average of xi , xi−1 , xi−2 , xi−3 . Express y in terms of a convolution of x
with an appropriate vector.
Hint: Think about time-series with only a single 1 in it.
The 4-point moving average of a time series x can be expressed as a convolution with a window function.
Define the window function w as a vector w =P41 [1, 1, 1, 1, 0, . . .]. The convolution of the time series x
with w is given by the summation (x ∗ w)[i] = x[j] · w[i − j].
The moving average for i ≥ 4 is then y[i] = 14 (x[i] + x[i − 1] + x[i − 2] + x[i − 3]). This process can
be expressed in terms of convolution with shift operators t′ , t′′ , and t′′′ representing unit shifts. The
combined vector e that encapsulates these shifts is e = 41 (t + t′ + t′′ + t′′′ ), where t has a leading one.
Hence, the moving average y is the result of the convolution of x with e, succinctly given by y = x ∗ e.
- Define t as a sequence t = (1,0,. . . ), with 1 located in the first place of the sequence, and the remaining
are zero. ⇒ x ∗ d = x.
xi +xi−1 +xi−2 +xi−3
- Therefore, yi = 4 (for i = 4, 5, . . . , n), y is the 4-point moving average of x.
3. Show that
a ∗ b = T (a)b = T (b)a,
where T (a), T (b) are two appropriate matrices. Specify those matrices for the case n = 3, m = 4.
- Let c = a ∗ b, we can define a convolution:
X
ck = ai bj , k = 1, . . . , 6
i+j=k+1
⇒ c1 = a1 b1
..
.
c6 = a3 b4
9
College of Engineering and Computer Science
MATH2050. Linear Algebra
Fall 2023
Therefore:
0 0 0 b1 0 0
c1 a1
c2 a2 a1 0 0 b1
b2 b1 0
0 b2 b3 b2 b1 a1
c3 a3 a2 a1
c4 = 0
c= =
b4 b3 b2 a2
a3 a2 b3
a1
c5 0 0 0 b4 b3 a3
a3 a2 b4
c6 0 0 0 a3 0 0 b4
0 0 0 b1 0 0
a1
a2 a1 0 0 b2 b1 0
0
a3 a2 a1
T (a) = , T (b) = b3 b2 b1
0 a3 a2 a1 b4 b3 b2
0 0 0 b4 b3
a3 a2
0 0 0 a3 0 0 b4
4. A T -vector r gives the average daily rainfall in some region over a period of T days. The vector h
gives the daily height of a river in the region. Using model fitting, it is found that the two vectors are
related by h = g ∗ r, where
g := (0.1, 0.4, 0.5, 0.2).
(a) If one day there is a heavy rainfall, assuming uniform rainfall for all other days, how many days
after that day is the river at maximum height?
(b) How many days does it take for the river to return to 0 after rain stops?
- We have: h = g ∗ r = r ∗ g
- As g is a weighted sum of the shifted sequence, let define h is a weighted sum of r.
10