2019 GHOJOGN Generalized Eigenvalue Tutorial
2019 GHOJOGN Generalized Eigenvalue Tutorial
where Φ⊤ = Φ−1 because Φ is an orthogonal matrix. which is an eigenvalue problem for A according to Eq. (1).
Moreover, note that we always have Φ⊤ Φ = I for orthog- The φ is the eigenvector of A and the λ is the eigenvalue.
onal Φ but we only have ΦΦ⊤ = I if “all” the columns of As the Eq. (6) is a maximization problem, the eigenvector
the orthogonal Φ exist (it is not truncated, i.e., it is a square is the one having the largest eigenvalue. If the Eq. (6) is
matrix). The Eq. (3) is referred to as “eigenvalue decom- a minimization problem, the eigenvector is the one having
position”, “eigen-decomposition”, or “spectral decomposi- the smallest eigenvalue.
tion”.
3.2. Optimization Form 2
2.2. Generalized Eigenvalue Problem Consider the following optimization problem with the vari-
The generalized eigenvalue problem (Parlett, 1998; able Φ ∈ Rd×d :
Golub & Van Loan, 2012) of two symmetric matrices A ∈
Rd×d and B ∈ Rd×d is defined as:
maximize tr(Φ⊤ A Φ),
Φ (7)
Aφi = λi Bφi , ∀i ∈ {1, . . . , d}, (4) subject to Φ⊤ Φ = I,
and in matrix form, it is:
where A ∈ Rd×d , the tr(.) denotes the trace of matrix, and
AΦ = BΦΛ, (5) I is the identity matrix. Note that according to the prop-
erties of trace, the objective function can be any of these:
where the columns of Rd×d ∋ Φ := [φ1 , . . . , φd ] are tr(Φ⊤ A Φ) = tr(ΦΦ⊤ A) = tr(AΦΦ⊤ ).
the eigenvectors and diagonal elements of Rd×d ∋ Λ := The Lagrangian (Boyd & Vandenberghe, 2004) for Eq. (7)
diag([λ1 , . . . , λd ]⊤ ) are the eigenvalues. Note that φi ∈ is:
Rd and λi ∈ R.
The generalized eigenvalue problem of Eq. (4) or (5) is
L = tr(Φ⊤ A Φ) − tr Λ⊤ (Φ⊤ Φ − I) ,
denoted by (A, B). The (A, B) is called “pair” or “pencil”
(Parlett, 1998). The order in the pair matters. The Φ and Λ
are called the generalized eigenvectors and eigenvalues of where Λ ∈ Rd×d is a diagonal matrix whose entries are the
(A, B). The (Φ, Λ) or (φi , λi ) is called the “eigenpair” of Lagrange multipliers.
the pair (A, B) in the literature (Parlett, 1998). Equating derivative of L to zero gives us:
Comparing Eqs. (1) and (4) or Eqs. (2) and (5) shows that
the eigenvalue problem is a special case of the generalized ∂L set
eigenvalue problem where B = I. Rd×d ∋ = 2 AΦ − 2 ΦΛ = 0
∂Φ
=⇒ AΦ = ΦΛ,
3. Eigenvalue Optimization
In this section, we introduce the optimization problems which is an eigenvalue problem for A according to Eq. (2).
which yield to the eigenvalue problem. The columns of Φ are the eigenvectors of A and the diag-
3.1. Optimization Form 1 onal elements of Λ are the eigenvalues.
As the Eq. (7) is a maximization problem, the eigenvalues
Consider the following optimization problem with the vari-
and eigenvectors in Λ and Φ are sorted from the largest
able φ ∈ Rd :
to smallest eigenvalues. If the Eq. (7) is a minimization
maximize φ⊤ A φ, problem, the eigenvalues and eigenvectors in Λ and Φ are
φ sorted from the smallest to largest eigenvalues.
(6)
subject to φ⊤ φ = 1,
3.3. Optimization Form 3
where A ∈ Rd×d . The Lagrangian Consider the following optimization problem with the vari-
(Boyd & Vandenberghe, 2004) for Eq. (6) is: able φ ∈ Rd :
L = φ⊤ A φ − λ (φ⊤ φ − 1),
minimize ||X − φ φ⊤ X||2F ,
φ
(8)
where λ ∈ R is the Lagrange multiplier. Equating the
subject to φ⊤ φ = 1,
derivative of Lagrangian to zero gives us:
The objective function in Eq. (8) is simplified as: 3.5. Optimization Form 5
||X− φφ ⊤
X||2F Consider the following optimization problem with the vari-
able φ ∈ Rd :
= tr (X − φφ⊤ X)⊤ (X − φφ⊤ X)
= tr (X ⊤ − X ⊤ φφ⊤ )(X − φφ⊤ X)
φ⊤ A φ
maximize . (10)
⊤ ⊤ ⊤ ⊤ ⊤ ⊤
φ φ⊤ φ
= tr X X − X φφ X + X φ φ φ φ X
| {z }
1 According to Rayleigh-Ritz quotient method (Croot, 2005),
⊤
= tr(X X − X φφ X) ⊤ ⊤ this optimization problem can be restated as:
As the Eq. (12) is a maximization problem, the eigenvector where λ is the Lagrange multiplier. Equating the derivative
is the one having the largest eigenvalue. If the Eq. (12) is of L to zero gives:
a minimization problem, the eigenvector is the one having
∂L set
the smallest eigenvalue. Rd ∋ = 2 XX ⊤ φ − 2 λ B φ = 0
∂φ
Comparing Eqs. (6) and (12) shows that eigenvalue prob-
lem is a special case of generalized eigenvalue problem =⇒ XX ⊤ φ = λ B φ =⇒ A φ = λ B φ,
where B = I.
which is a generalized eigenvalue problem (A, B) accord-
4.2. Optimization Form 2 ing to Eq. (4). The φ is the eigenvector and the λ is the
eigenvalue.
Consider the following optimization problem with the vari-
able Φ ∈ Rd×d : 4.4. Optimization Form 4
⊤
maximize tr(Φ A Φ), Consider the following optimization problem with the vari-
Φ (13) able Φ ∈ Rd×d :
subject to Φ⊤ B Φ = I,
minimize ||X − Φ Φ⊤ X||2F ,
d×d d×d Φ
where A ∈ R and B ∈ R . Note that according to (15)
the properties of trace, the objective function can be any of subject to Φ⊤ B Φ = I,
these: tr(Φ⊤ A Φ) = tr(ΦΦ⊤ A) = tr(AΦΦ⊤ ).
where X ∈ Rd×n .
The Lagrangian (Boyd & Vandenberghe, 2004) for Eq.
(13) is: Similar to what we had for Eq. (9), the objective function
in Eq. (15) is simplified as:
L = tr(Φ⊤ A Φ) − tr Λ⊤ (Φ⊤ B Φ − I) ,
||X− ΦΦ⊤ X||2F = tr(X ⊤ X − XX ⊤ ΦΦ⊤ )
d×d
where Λ ∈ R is a diagonal matrix whose entries are the
Lagrange multipliers. The Lagrangian (Boyd & Vandenberghe, 2004) is:
Equating derivative of L to zero gives us: L = tr(X ⊤ X) − tr(XX ⊤ ΦΦ⊤ )
∂L set − tr Λ⊤ (Φ⊤ B Φ − I) ,
Rd×d ∋ = 2 AΦ − 2 BΦΛ = 0
∂Φ
=⇒ AΦ = BΦΛ, where Λ ∈ Rd×d is a diagonal matrix including Lagrange
multipliers. Equating the derivative of L to zero gives:
which is an eigenvalue problem (A, B) according to Eq.
(5). The columns of Φ are the eigenvectors of A and the ∂L set
Rd×d ∋ = 2 XX ⊤ Φ − 2 B ΦΛ = 0
diagonal elements of Λ are the eigenvalues. ∂Φ
As the Eq. (13) is a maximization problem, the eigenvalues =⇒ XX ⊤ Φ = BΦΛ =⇒ AΦ = BΦΛ,
and eigenvectors in Λ and Φ are sorted from the largest
which is an eigenvalue problem (A, B) according to Eq.
to smallest eigenvalues. If the Eq. (13) is a minimization
(5). The columns of Φ are the eigenvectors of A and the
problem, the eigenvalues and eigenvectors in Λ and Φ are
diagonal elements of Λ are the eigenvalues.
sorted from the smallest to largest eigenvalues.
4.5. Optimization Form 5
4.3. Optimization Form 3
Consider the following optimization problem (Parlett,
Consider the following optimization problem with the vari-
1998) with the variable φ ∈ Rd :
able φ ∈ Rd :
minimize ||X − φ φ⊤ X||2F , φ⊤ A φ
maximize . (16)
φ
(14) φ φ⊤ B φ
subject to φ⊤ B φ = 1,
According to Rayleigh-Ritz quotient method (Croot, 2005),
where X ∈ Rd×n . this optimization problem can be restated as:
Similar to what we had for Eq. (8), The objective function
maximize φ⊤ A φ,
in Eq. (14) is simplified as: φ
(17)
||X− φφ ⊤
X||2F ⊤
= tr(X X − XX φφ ) ⊤ ⊤ subject to φ⊤ B φ = 1,
The Lagrangian (Boyd & Vandenberghe, 2004) is: The Lagrangian (Boyd & Vandenberghe, 2004) is:
where λ is the Lagrange multiplier. Equating the derivative If we consider several PCA directions, i.e., the columns of
of L to zero gives: U , the minimization of the reconstruction error is:
∂L set minimize ||X − U U ⊤ X||2F ,
= 2Aφ − 2λB φ = 0 U (21)
∂w
=⇒ 2 A φ = 2 λ B φ =⇒ A φ = λ B φ, subject to U ⊤ U = I.
Thus, the columns of U are the eigenvectors of the covari-
which is a generalized eigenvalue problem (A, B) accord-
ance matrix S = XX ⊤ (the X is already centered by
ing to Eq. (4). The φ is the eigenvector and the λ is the
removing its mean).
eigenvalue.
As the Eq. (16) is a maximization problem, the eigenvector 5.2. Examples for Generalized Eigenvalue Problem
is the one having the largest eigenvalue. If the Eq. (16) is 5.2.1. K ERNEL S UPERVISED P RINCIPAL C OMPONENT
a minimization problem, the eigenvector is the one having A NALYSIS
the smallest eigenvalue.
Kernel Supervised PCA (SPCA) (Barshan et al., 2011)
uses the following optimization problem:
5. Examples for the Optimization Problems
In this section, we introduce some examples in machine maximize tr(Θ⊤ K x HK y HK x Θ),
Θ (22)
learning which use the introduced optimization problems.
subject to Θ⊤ K x Θ = I,
5.1. Examples for Eigenvalue Problem where K x and K y are the kernel matrices over the train-
5.1.1. VARIANCE IN P RINCIPAL C OMPONENT ing data and the labels of the training data, respectively,
A NALYSIS the H := I − (1/n)11⊤ is the centering matrix, and the
In Principal Component Analysis (PCA) (Pearson, 1901; columns of Θ span the kernel SPCA subspace.
Friedman et al., 2009), if we want to project onto one vec- According to Eq. (13), the solution to Eq. (22) is:
tor (one-dimensional PCA subspace), the problem is:
K x HK y HK x Θ = K x ΘΛ, (23)
maximize u⊤ S u, which is the generalized eigenvalue problem
u
(18)
subject to ⊤
u u = 1, (K x HK y HK x , K x ) according to Eq. (5) where
the Θ and Λ are the eigenvector and eigenvalue matrices,
where u is the projection direction and S is the covariance respectively.
matrix. Therefore, u is the eigenvector of S with the largest
eigenvalue. 5.2.2. F ISHER D ISCRIMINANT A NALYSIS
If we want to project onto a PCA subspace spanned by sev- Another example is Fisher Discriminant Analysis (FDA)
eral directions, we have: (Fisher, 1936; Friedman et al., 2009) in which the Fisher
criterion (Xu & Lu, 2006) is maximized:
maximize tr(U ⊤ S U ),
U (19) w⊤ S B w
⊤ maximize , (24)
subject to U U = I, w w⊤ S W w
where the columns of U span the PCA subspace. where w is the projection direction and S B and S W are
between- and within-class scatters:
5.1.2. R ECONSTRUCTION IN P RINCIPAL C OMPONENT c
X
A NALYSIS SB = (µi − µt )(µi − µt )⊤ , (25)
We can look at PCA with another perspective: PCA is the j=1
best linear projection which has the smallest reconstruc- nj
c X
X
tion error. If we have one PCA direction, the projection is SW = (xj,i − µi )(xj,i − µi )⊤ , (26)
u⊤ X and the reconstruction is uu⊤ X. We want the error j=1 i=1
between the reconstructed data and the original data to be c is the number of classes, nj is the sample size of the j-th
minimized: class, xj,i is the i-th data point in the j-th class, µi is the
minimize ||X − u u⊤ X||2F , mean of the i-th class, and µt is the total mean.
u
(20) According to Rayleigh-Ritz quotient method (Croot, 2005),
subject to u⊤ u = 1.
the optimization problem in Eq. (24) can be restated as:
Therefore, u is the eigenvector of the covariance matrix
maximize w ⊤ S B w,
S = XX ⊤ (the X is already centered by removing its w
(27)
mean). subject to w ⊤ S W w = 1.
Eigenvalue and Generalized Eigenvalue Problems: Tutorial 6
The Lagrangian (Boyd & Vandenberghe, 2004) is: The ρ is stationary at φ 6= 0 if and only if:
where λ is the Lagrange multiplier. Equating the derivative for some scalar λ (Parlett, 1998). The Eq. (31) is a linear
of L to zero gives: system of equations. This system of equations can also be
obtained from the Eq. (4):
∂L set
= 2 SB w − 2 λ SW w = 0
∂w Aφi = λi Bφi =⇒ (A − λi B) φi = 0. (32)
=⇒ 2 S B w = 2 λ S W w =⇒ S B w = λ S W w,
As we mentioned earlier, eigenvalue problem is a special
which is a generalized eigenvalue problem (S B , S W ) ac- case of generalized eigenvalue problem (where B = I)
cording to Eq. (4). The w is the eigenvector with the largest which is obvious by comparing Eqs. (28) and (32).
eigenvalue and the λ is the corresponding eigenvalue. According to Cramer’s rule, a linear system of equations
has non-trivial solutions if and only if the determinant van-
6. Solution to Eigenvalue Problem ishes. Therefore:
In this section, we introduce the solution to the eigenvalue
problem. Consider the Eq. (1): det(A − λi B) = 0. (33)
Aφi = λi φi =⇒ (A − λi I) φi = 0, (28) Similar to the explanations for Eq. (29), we can solve for
the roots of Eq. (33). However, note that the Eq. (33) is
which is a linear system of equations. According to obtained from Eq. (4) or (16) where only one eigenvector
Cramer’s rule, a linear system of equations has non-trivial φ is considered.
solutions if and only if the determinant vanishes. There- For solving Eq. (5) in general case, there exist two solu-
fore: tions for the generalized eigenvalue problem one of which
is a quick and dirty solution and the other is a rigorous
det(A − λi I) = 0, (29) method. Both of the methods are explained in the follow-
ing.
where det(.) denotes the determinant of matrix. The Eq.
(29) gives us a d-degree polynomial equation which has d 7.1. The Quick & Dirty Solution
roots (answers). Note that if the A is not full rank (if it is a
Consider the Eq. (5) again:
singular matrix), some of the roots will be zero. Moreover,
if A is positive semi-definite, i.e., A 0, all the roots are AΦ = BΦΛ.
non-negative.
The roots (answers) from Eq. (29) are the eigenvalues of If B is not singular (is invertible ), we can left-multiply the
A. After finding the roots, we put every answer in Eq. (28) expressions by B −1 :
and find its corresponding eigenvector, φi ∈ Rd . Note that
(a)
putting the root in Eq. (28) gives us a vector which can B −1 AΦ = ΦΛ =⇒ CΦ = ΦΛ, (34)
be normalized because the direction of the eigenvector is
important and not its magnitude. The information of mag- where (a) is because we take C = B −1 A. The Eq. (34) is
nitude exists in its corresponding eigenvalue. the eigenvalue problem for C according to Eq. (2) and can
be solved using the approach of Eq. (29).
7. Solution to Generalized Eigenvalue Note that even if B is singular, we can use a numeric hack
Problem (which is a little dirty) and slightly strengthen its main di-
In this section, we introduce the solution to the generalized agonal in order to make it full rank:
eigenvalue problem. Recall the Eq. (16) again:
(B + εI)−1 AΦ = ΦΛ =⇒ CΦ = ΦΛ, (35)
⊤
φ Aφ
maximize . where ε is a very small positive number, e.g., ε = 10−5 ,
φ φ⊤ B φ large enough to make B full rank.
Let ρ be this fraction named Rayleigh quotient (Croot,
7.2. The Rigorous Solution
2005):
Consider the Eq. (5) again:
u⊤ A u
ρ(u; A, B) := , ∀u 6= 0. (30) AΦ = BΦΛ.
u⊤ B u
Eigenvalue and Generalized Eigenvalue Problems: Tutorial 7
8. Conclusion
This paper was a tutorial paper introducing the eigenvalue
and generalized eigenvalue problems. The problems were
introduced, their optimization problems were mentioned,
and some examples from machine learning were provided
for them. Moreover, the solution to the eigenvalue and gen-
eralized eigenvalue problems were introduced.
References
Barshan, Elnaz, Ghodsi, Ali, Azimifar, Zohreh, and
Jahromi, Mansoor Zolghadri. Supervised principal com-
ponent analysis: Visualization, classification and regres-
sion on subspaces and submanifolds. Pattern Recogni-
tion, 44(7):1357–1371, 2011.
Boyd, Stephen and Vandenberghe, Lieven. Convex opti-
mization. Cambridge university press, 2004.
Croot, Ernie. The Rayleigh principle for finding
eigenvalues. Technical report, Georgia Institute of
Technology, School of Mathematics, 2005. Online:
http://people.math.gatech.edu/∼ecroot/notes_linear.pdf,
Accessed: March 2019.
Fisher, Ronald A. The use of multiple measurements in
taxonomic problems. Annals of eugenics, 7(2):179–188,
1936.
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.
The elements of statistical learning, volume 2. Springer
series in statistics New York, NY, USA:, 2009.
Golub, Gene H. and Van Loan, Charles F. Matrix compu-
tations, volume 3. The Johns Hopkins University Press,
2012.
Jolliffe, Ian. Principal component analysis. Springer, 2011.
Parlett, Beresford N. The symmetric eigenvalue problem.
Classics in Applied Mathematics, 20, 1998.
Pearson, Karl. LIII. on lines and planes of closest fit to
systems of points in space. The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science,
2(11):559–572, 1901.
Wang, Ruye. Generalized eigenvalue problem.
http://fourier.eng.hmc.edu/e161/lectures/algebra/node7.html,
2015. Accessed: January 2019.
Wilkinson, James Hardy. The algebraic eigenvalue prob-
lem, volume 662. Oxford Clarendon, 1965.
Xu, Yong and Lu, Guangming. Analysis on Fisher discrim-
inant criterion and linear separability of feature space. In
2006 International Conference on Computational Intel-
ligence and Security, volume 2, pp. 1671–1676. IEEE,
2006.