0% found this document useful (0 votes)
82 views

Power Method and Deflation

The document provides lecture notes on finding eigenvalues using the power method. It begins with an overview of the power method and inverse power method. It then discusses the basic power method in more detail, explaining how to compute the dominant eigenvalue by taking the ratio of powers of the matrix and vector. It notes that the convergence is linear and depends on the difference between the largest and second largest eigenvalues. The document then discusses an improved version of the power method that normalizes the vector at each step to avoid exponential growth. It concludes by noting that for symmetric matrices, the convergence is even better since the eigenvectors are orthogonal.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Power Method and Deflation

The document provides lecture notes on finding eigenvalues using the power method. It begins with an overview of the power method and inverse power method. It then discusses the basic power method in more detail, explaining how to compute the dominant eigenvalue by taking the ratio of powers of the matrix and vector. It notes that the convergence is linear and depends on the difference between the largest and second largest eigenvalues. The document then discusses an improved version of the power method that normalizes the vector at each step to avoid exponential growth. It concludes by noting that for symmetric matrices, the convergence is even better since the eigenvectors are orthogonal.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Math 361S Lecture notes

Finding eigenvalues: The power method


Jeffrey Wong
April 12, 2019

Topics covered
• Finding eigenvalues

◦ Power method, Rayleigh quotient


◦ Benefit of symmetric matrices
◦ Inverse power method

• General tricks

◦ Deflation (and why it is dangerous)


◦ Deflation for the power method (second largest λ)
◦ Aitken extrapolation

1 Computing the dominant eigenvalues


Throughout, let A be an n × n, non-singular, real-valued matrix with a basis of eigenvectors.
Denote the eigenvalues by λj and eigenvectors by vj .

We assume here there is a single eigenvalue of largest magnitude (the ‘dominant’ eigen-
value). Label them as follows:

|λ1 | > |λ2 | ≥ · · · ≥ |λn | > 0.

Note that if A has real-valued entries, it must be that λ1 is real (why?).

The simplest approach to computing λ1 and v1 is the power method. The idea is as fol-
lows. Let x be any vector. Then, since {vj } is a basis,

x = c1 v1 + · · · + cn vn .

1
Now suppose c1 6= 0 (i.e x has a non-zero v1 component). Then
Ax = c1 λ1 v1 + · · · + cn λn vn
and, applying A repeatedly,
n
X
k
A x= cj λkj vj .
j=1

Since the λk1 term is largest in magnitude, in the sequence


x, Ax, A2 x, A3 x, · · ·
we expect the λk1 v1 term will dominate so
Ak x ≈ cλk1 v1 + smaller terms.
Each iteration grows the largest term relative to the others, so after enough iterations only
the first term (what we want) will be left.

1.1 Power method: the basic method


Let’s formalize the observation and derive a practical method. The main trouble is that λk1
will either grow exponentially (bad) or decay to zero (less bad, but still bad). By taking the
right ratio, the issue can be avoided.

Claim: Let x and x be vectors with wT v1 6= 0 and such that x has a non-zero v1 component.
Then
wT Ak x
= λ1 + O((λ2 /λ1 )k ) as k → ∞ (1)
wT Ak−1 x
Proof. Since the eigenvectors form a basis, there are scalars c1 , · · · , cn such that
n
X
x= cj vj .
j=1

By assumption, c1 6= 0. Since Ak vj = λkj vj it follows that


n n  k !
k
X X λj
A x= cj λkj vj = λk1 c1 v1 + cj vj .
j=1 j=2
λ1

All the terms in the parentheses except the first go to zero in magnitude as k → ∞. Taking
the dot product with w on the left and computing the ratio (1),
d1 + nj=2 dj (λj /λ1 )k
P
wT Ak x
= λ1
d1 + nj=2 dj (λj /λ1 )k−1
P
wT Ak−1 x
1 + O((λ2 /λ1 )k )
= λ1
1 + O((λ2 /λ1 )k )
= λ1 + O((λ2 /λ1 )k )

2
where dj = cj wT · vj (note that d1 6= 0 by assumption). Note that we have used that

1
= 1 + O(f ).
1+f

Thus the power method computes the dominant eigenvalue (largest in magnitude),
and the convergence is linear. The rate depends on the size of λ1 relative to the next largest
eigenvalue λ2 .

Power method (naive version):

1) Choose vectors x and w ‘at random’.1


2) For k = 1, 2, · · · compute

w T zk
zk = Azk−1 , λ(k) =
wT zk−1

3) Stop when λ(k) is close to converged.

The result is that λ(k) converges linearly to the dominant eigenvalue λ1 . The issue of when
to stop is addressed in section subsection 1.4.

1.2 A better version


The ‘naive version’ provides the eigenvalue, and the vector zk = Ak x becomes more and more
parallel to the desired eigenvector v1 . However, the magnitude can also grow exponentially
(or decay), which is unacceptable.

In practice, there are two differences:

i) Normalize x at each step to have kxk2 = 1, which keeps elements from growing expo-
nentially in size. This also gives an eigenvector of unit 2-norm.

ii) Use a ’left’ vector that depends on k instead of the fixed w in (1). A good choice is
the normalized Ak) x (which we have anyway).

Power method (improved version):

• Pick q (0) such that kq (0) k2 = 1

• For k = 1, 2, · · · :

◦ x(k) = Aq (k−1)

3
◦ q (k) = x(k) /kx(k) k2
◦ λ(k) = (q (k) )T Aq (k) .

In practice, very little needs to be stored, and the algorithm is quite simple:

• Pick q such that kqk2 = 1 and set x = Aq

• For k = 1, 2, · · · :

◦ q = x/kxk2
◦ λ = xT q
◦ x = Aq

Claim: The result is that


λ(k) = λ1 + O((λ2 /λ1 )k ),
qk = v1 + O((λ2 /λ1 )k )
where v1 is an eigenvector with kv1 k2 = 1 and the Big-O term means that the error kq(k) −v1 k
has that Big-O.

Sketch of proof: First, define


zk = Ak q0 ,
the ’unscaled’ iterates. Observe that since kqk2 = 1 we may write λ(k) as a ratio:
qT Aq
λ(k) = qT Aq = .
qT q
It is not too hard (but tedious) to show that the normalization factors all cancel when
replacing qk with zk , leading to
qTk Aqk zTk Azk
λ(k) = . = . (2)
qTk qk zTk zk
Now, as before, write the initial vector in terms of the eigenvector basis:
n
X
q0 = c j vj , c1 6= 0.
j=1

It follows that n
X
k
zk = A q0 = cj λkj vj . (3)
j=1

Then plug (3) into (2) and sort out the mess, identifying the largest terms (left as an exercise;
but see below for a special case).

The convergence of the eigenvector takes more work and is omitted here.

4
1.3 Symmetric power method
The method above has a nice benefit: if A is a real symmetric matrix, then the convergence
rate is actually better. If A is (real) symmetric then its eigenvectors are orthogonal:

vi · vj = 0 for i 6= j.

We may also take them to be orthonormal, i.e. kvi k2 = 1.

Now return to the convergence proof. Observe that


n
! n
! n
X X X
zTk zk = ci λki vi · cj λkj vj = c2j λ2k
j
i=1 j=1 j=1

since vi · vj = 0 for i 6= j and 1 for i = j. Similarly,


n
! n
! n
X X X
k+1
T
zk Azk = k
ci λi vi · cj λj vj = c2j λ2k+1
j .
i=1 j=1 j=1

Now plugging (3) into (2) is nice:

zTk Azk
λ(k) =
zT zk
Pkn 2k+1
j=1 cj λj
= Pn 2 2k
j=1 cj λj
c1 + nj=2 cj (λj /λ1 )2k+1
P
= λ1
c1 + nj=2 cj (λj //λ1 )2k
P

1 + O((λ2 /λ1 )2k )


= λ1
1 + O((λ2 /λ1 )2k )
and so
λ(k) = λ1 + O((λ2 /λ1 )2k )
i.e. the rate of convergence is squared compared to the non-symmetric case. The larger error
term from the non-symmetric case is a cross term (from v1 · v2 ) which vanishes here.

1.4 Aitken extrapolation


We saw that the eigenvalue estimate converges linearly to the true value:

λ(n) ∼ λ + crn

where r = (λ2 /λ1 ) (non-symmetric) or (λ2 /λ1 )2 (symmetric). Both c and r are, of course,
unknown. However, just as with Richardson extrapolation, we can ’cheat’ here and ’solve’
for c and r to improve the estimate.

5
To generalize, suppose we have a scalar sequence
xn ∼ x + crn (4)
with a limit x, the desired quantity. Observe that
xn+1 − xn crn+1 − crn
lim = lim n = r.
n→∞ xn − xn−1 n→∞ cr − cr n−1

Similarly
(xn+1 − xn )2 ∼ c2 (rn )2 (r − 1)2
and
xn+2 − 2xn+1 + xn ∼ crn+2 − 2rn+1 + crn = crn (r − 1)2 .
Thus
(xn+1 − xn )2
∼ crn ∼ xn − x.
xn+2 − 2xn+1 + xn
Let us define the new sequence
(xn+1 − xn )2
yn := Axn = xn − . (5)
xn+2 − 2xn+1 + xn
If (4) holds exactly then yn = x. Otherwise, {Axn } tends to converge to x faster (linear,
with a better rate) than {xn }. This process is called Aitken extrapolation, which can be
used to accelerate the convergence of a sequence.

In practice, it is used when a sequence is known to converge linearly (from theory), and
one wants a bit of extra accuracy (or efficiency) for cheap.

Example: Consider the sequence


xn = 1 + 2−n + 4−n .
Then xn → 1 linearly with rate 1/2. However,
Axn = 1 + O(4−n )
(proof: exercise), so Axn converges with a rate of 1/4 instead. By doing the more or less
trivial computation (5), we have a new approximating sequence {Axn } that is much more
accurate (twice the number of digits!).

An example for eigenvalues was shown in subsection 1.5.

Remark: There are many other acceleration techniques of this flavor, that recycle existing
data to construct better approximations. For iterative methods, a more sophisticated tech-
nique called Chebyshev acceleration can be used to obtain an ’optimized’ approximation
using the first k iterates.

6
0
10

-5
10

-10
10

-15
10
0 20 40 60 80 100

Figure 1: Error in the eigenvalue approximation for the symmetric power method with and
without Aitken extrapolation.

1.5 Example
We find the dominant eigenvalue of the matrix
 
1 1 2
A = 1 −1 1  ,
2 −1 −1

The result of using the power method with q0 = (1, 0, 0)T is shown in Figure 1. The
eigenvalues are
λ1 ≈ 2.74, λ2 = −2.35, λ3 = −1.4.
The error in λ(k) (the approximation to λ1 ) is shown along with an accelerated estimate
using Aitken extrapolation (subsection 1.4). Since A is symmetric, the convergence is linear
with rate
r = (λ2 /λ1 )2 ≈ 0.73
which is not bad. The accelerated version (which does not require any more work), however,
does much better. The noise for k > 50 is due to rounding; of course we cannot expect the
error to get much better then around machine precision. There is some additional error in
the Aitken sequence due to cancellation (ratio of two differences of nearly equal numbers),
but it is not of much concern here.

7
2 Inverse power method
A simple change allows us to compute the smallest eigenvalue (in magnitude). Let us
assume now that A has eigenvalues

|λ1 | ≥ |λ2 | · · · > |λn |.

Then A−1 has eigenvalues λ−1


j satisfying

|λ−1 −1 −1
n | > |λ2 | ≥ · · · ≥ |λn |.

Thus if we apply the power method to A−1 , the algorithm will give 1/λn , yielding the small-
est eigenvalue of A (after taking the reciprocal at the end).

Note that in practice, instead of computing A−1 , we first compute an LU factorization


of A, and then solve
Ax(k+1) = x(k)
at each step, which only takes O(n2 ) operations after the initial work.

Now suppose instead we want to find the eigenvalue closest to a number µ. Notice that
the matrix (A − µI)−1 has eigenvalues
1
, j = 1, · · · n.
λj − µ

The eigenvalue of largest magnitude will be 1/(λj0 − µ) where λj0 is the closest eigenvalue
to µ (assuming there is only one). This leads to the inverse power method (sometimes
called inverse iteration):

Inverse power method: To find the eigenvalue of A closest to µ,

1) Apply the power method to (A − µI)−1 , solving

(A − µI)xk = xk−1

at each step using some linear system solver (e.g. LU factorization).


2) Compute λ from the output 1/(λ − µ).

Note that if µ is fixed, the LU factorization only needs to be computed once.!

The method is a cheap, often effective way of computing one eigenvalue, which is often
all that matters. Moreover, a good choice of µ helps convergence. It should be a ’guess’ of
the eigenvalue. Suppose the eigenvalues satisfy
1 1
> > ···
|λ1 − µ| |λ2 − µ|

8
and we seek λ1 . Then inverse iteration will yield λ1 , and from the power method,

|λ1 − µ|k
 
error in λ1 = O
|λ2 − µ|k

with k replaced by 2k for a symmetric A. In either case, the (linear) rate improves as µ gets
closer to λ1 . With a good guess, the convergence rate will be quite good (close to zero).

An improvement for symmetric matrices: The advantage of a fixed value of µ is that


A − µI only needs to be factored once. But the convergence is only linear.

For a symmetric matrix, the convergence can be accelerated even more by choos-
ing µ at each step as the Rayleigh quotient

x(k) · Ax(k)
µk = (k) (k) .
x ·x
The iteration (’Rayleigh quotient iteration’) is then

(A − µk I)x(k+1) = x(k) .

For a typical symmetric matrix, the convergence becomes cubic, which is much faster than
linear! Thus the disadvantage of factoring A − µk I at each step is balanced out by the
dramatic boost in convergence rate.

2.1 More on eigenvalues


The methods above give an extreme eigenvalue, but all the other eigenvalues disappear as
the iteration proceeds. Finding the other eigenvalues is more difficult, and the power/inverse
power methods are not particularly well suited to doing so. There are some ways to reduce
the problem after finding one eigenvalue (to find the next largest using the power method,
and so on), but they can be problematic (see next section).

For computing all the eigenvalues of A, there is a powerful class of iterative methods that
can be used, such as the QR algorithm, which will not be covered here.

9
3 Deflation
Here we introduce a practical trick that is occasionally useful if used carefully. For robust
computation, other more sophisticated methods are used instead.

3.1 Deflation: the idea


To illustrate the point, consider the problem of finding all the roots of a degree n polynomial
p(x) with n real roots. Suppose we have access to a solver that finds one root like Newton.
Then we can proceed by using deflation:
1) Use Newton’s method to find a root x0 of p(x)
2) Define
q0 (x) = p(x)/(x − x0 )
and find a zero x1 of q0 .
3) Define q1 = q0 /(x − x1 ), find a root x2 and so on up through qn−1 .

Definition: Given a method that finds one solution to a system, the process of ‘dividing
out a solution’ one by one to get all of them is called deflation.

In theory, this will ‘divide out’ the found roots one by one, e.g.
p = (x − 1)(x − 2)(x − 3) =⇒ q0 = (x − 1)(x − 3) =⇒ q1 = (x − 1).
Note that each qk is smooth; the division does not add any singularities.

However, if xj is not computed exactly, e.g. x̃0 ≈ x0 , there is trouble, since then
x − x0
q0 (x) = (· · · )
x − x̃0
which introduces a ‘singularity’ at x̃0 that can be disastrous unless x̃0 is computed to very
high accuracy. It could be that Newton’s method on q0 will still converge to x0 , or just blow
up. The method relies on a theoretical guarantee that must hold exactly for the method
to be correct. Otherwise, it may work in practice, or it may not.

This is not to say the method is worthless; just that it should be used cautiously, and
without high expectations.

3.2 Deflation for the power method


Since the eigenvalues are the roots of the characteristic polynomial
p(λ) = det(A − λI),
we might expect that deflation could be used in conjunction with the power method. Indeed,
it can be used, if one is willing to accept the numerical issues that arise.

10
Reminder (outer product): The outer product vwT of two column vectors v, w ∈ Rn
is the matrix
C = vwT with cij = vi wj .
That is, the (i, j) component of the outer product is the i-th component of v times the j-th
of wj . Observe that the rank of an outer product is always one; for this reason a matrix

B = A + vwT

is called a ‘rank one perturbation’ (it is A plus a rank-one matrix).

Note that the transpose notation makes some manipulations convenient, e.g.

(vwT )x = v(wT x) = (wT x)v

noting that vwT is a matrix and wT x is a scalar (inner product).

Let A be an n × n real symmetric matrix with (distinct) non-zero eigenvalues


|λ1 | > |λ2 | > · · · > |λn | > 0.
and eigenvector v1 , · · · , vn . The goal of deflation is to build a modified matrix that has
only λ2 , · · · , λn as eigenvalues.2 Suppose we have obtained λ1 and v1 .

This goal is achieved by defining the ‘deflated’ matrix


B = A − λ1 v1 xT
where x is any vector such that
xT v1 = 1.
Claim: The eigenvalues of B are 0 and λ2 , · · · , λn with eigenvectors v1 , w2 , · · · wn . Only v1
is the same, but v2 can be obtained easily from w2 .
Proof. First, 0 is an eigenvalue of B with eigenvector v1 since
Bv1 = Av1 − λ1 v1 xT v1 = λ1 v1 − λ1 v1 = 0
by the choice of x. If j 6= 1 then vj is not an eigenvector. Instead, look for an eigenvector
wj = cv1 + vj .
We need to find a c such that (B − λj I)wj = 0. Compute
(B − λj I)wj = c(A − λj I)v1 − λ1 v1 xT wj
= c(λ1 − λj )v1 − λ1 v1 xT wj
= (c(λ1 − λj ) − λ1 xT wj )v1 .
2
Discussed also in some textbooks, e.g. Burden and Faires, Numerical analysis, p587-588, from which
this presentation is adapted.

11
Since xT wj = c + xT vj , the right value of c is

c = −λ1 xT vj /λj

so λj is an eigenvalue with eigenvector


 
λ1 T
w j = vj − x vj v1 . (6)
λj

This gives a way to compute the second-largest eigenvalue in magnitude.

Deflation for the second-largest eigenvalue: To compute λ2 and v2 (where |λ1 | >
|λ2 | > · · · ),

• Use the power method to obtain λ1 and v1 ,

• Construct the deflated matrix

B = A − λ1 v1 xT

where x is chosen so that xT v1 = 1.

• Use the power method on B to obtain λ2 and w2 , then recover the eigenvector v2 for
A from (6).

It is not hard to find such a vector x, but the numerical stability depends on this choice.
One option is Weilandt deflation, which chooses
1
x= R1
λ1 (v1 )1

where R1 = (a11 , · · · a1n )T is the first row of A and (v1 )1 is the first component of v1 .

The peril is that deflation is numerically unstable, and repeated applications can lead to
disaster. Using it to get λ2 is usually fine except for ill-behaved eigen-problems, but it is
not advisable to use it to find all the eigenvalues.

Here is an extreme example where there is trouble. Let

A = λ1 v1 v1T + λ2 v2 v2T

where v1T v1 = v2T v2 = 1. Then A has eigenvalues λ1 , λ2 and eigenvectors v1 , v2 . Now suppose
λ̃1 ≈ λ1 is computed. We then deflate:

B = A − λ̃1 v1 v1T

12
choosing x = v1T for simplicity (of the example). Then

B = (λ1 − λ̃1 )v1 v1T + λ2 v2 v2T .

So B has an eigenvalue λ1 − λ̃1 . If, say

λ1 = 108 , λ2 = 10−8

and λ̃1 is computed to machine precision (relative error 10−16 ) then

|λ1 − λ̃1 | ≈ 10−8

which is the same size as λ2 . Thus the spurious ’leftover’ from the deflation is actually the
dominant part, and the power method cannot see λ2 .

4 An example: PageRank
The notes here expand on the brief discussion in the textbook. For an introduction, see
https://www.mathworks.com/content/dam/mathworks/mathworks-dot-com/moler/exm/
chapters/pagerank.pdf. For a detailed exposition, the best source is Langville & Myer,
Google’s PageRank and Beyond.

A Matlab demonstration can be found at https://www.mathworks.com/help/matlab/math/


use-page-rank-algorithm-to-rank-websites.html. The summary here is adapted from
this source.

4.1 The setup


We regard a set of web pages as a graph whose vertices are pages {Wi } connected by a
directed edge i → j if Wi links to page Wj . The goal is to determine how a metric for the
popularity of each page.

Imagine a hypothetical user (a ’surfer’) meandering through the web by following links.
The surfer will visit pages that are more connected more often. We define

xi = proportion of time spent at page Wi .

Given pages Wi and Wj , also define the transition probability

pij = probability of going from Wi to Wj . (7)

and the transition matrix P whose entries are pij . A small example and its transition matrix
are shown in Figure 2.
If the surfer is at page Wi , they have come from some other site, which occurs with
probability pji . It should follow that
X
xi = pji xj .
j

13
Figure 2: Directed graph of four linked pages to be ranked and the transition matrix.

In probabilistic terms,3
X
P (at Wi ) = P (goes from Wj to Wi at Wj ) · P (at Wj ).
j

Now we need to know how likely it is that the surfer goes from i to j. We need one last
definition; the adjacency matrix A is the matrix whose (i, j) entry is 1 if there is a link
from Wi to Wj : (
1 i points to j
aij = .
0 otherwise
The PageRank algorithm (in its simplest form) makes the assumption that the surfer clicks
a link uniformly at random to leave a page:
aji
pji = , where Nj = # of outgoing links from Wj .
Nj

4.2 The eigenvalue problem


Now let x be the vector of xi ’s. Then the steady-state equation (7) is

x = PTx

or, setting M = P T ,
x = M x. (8)
Thus x is an eigenvector of M with eigenvalue 1. It can be shown that if the system is
‘irreducible’ (all states can be reached from all others), then:4

• λ = 1 is an eigenvalue with multiplicity 1


3
This is a specific example of a steady state of a Markov chain.
4
This is a consequence of the Perron-Frobenius theorem. The key facts here are that the entries of
M are non-negative and the entries in each column sum to 1.

14
• All other eigenvalues are less than one in magnitude
Thus there is a unique eigenvector x (the ‘ranking’) such that
X
xi = 1.
i

The popularity of the website is then measured by this ranking. To get around the issue of
pages that do not link back and for numerical reasons, there is one more ingredient in the
basic model. The surfer is assumed to jump to a random page (regardless of links) with a
small probability 1 − α. That is,
aji 1 − α
mij = α + .
Nj n
The value of α is typically chosen to be near 1, e.g. α ≈ 0.85. In matrix terms,
1−α
M = αP T + E (9)
n
where E is a matrix of all ones.

4.3 Computation
The ranking x is the eigenvector for the dominant eigenvalue λ = 1. Thus the power method
can be employed. Moreover, we do not need to normalize at each step (why not?). The ma-
trix P T has a simple structure, so the multiplication M x can be implemented quite efficiently.

It is not hard to show that


X X
xi = 1 =⇒ (M x)i = 1
and then that n
X
(M x)i = α (aji xj /Nj ) + (1 − α)/n.
j=1

In Matlab, this is simply coded as


Mx = alpha*A’*x./N + (1-alpha)/n
where N is the vector of Nj ’s. The adjacency matrix A is typically sparse (most pages link
only to a few others), so AT x is fast to compute. Note that the sparseness is essential for
the enormous data-sets produced by crawling the web.

4.4 Example
Let us compute the ranking for the small system in Figure 2. The transition probabilities
and transition matrix P are also shown. The adjacency matrix and N values are
   
0 1 1 1 3
0 0 1 0 1
A= 1 0 0 0 , N = 2 .
  

0 0 1 0 1

15
Figure 3: Ranking (darker shade =⇒ higher rank) and computed x for the four-page
example.

The matrix to be used for the power method is (with n = 4)

(1 − α)
M = αP T + E
n
where E is a matrix of all ones and we seek an eigenvector x such that

x = M x.

Applying the power method (no normalization is required because the eigenvalue is 1) with
α = 0.85, we obtain the solution

x = (0.214, 0.098, 0.414, 0.274)T

which indicates that the highest ranked page should be #3 (see Figure 3). Since 3 is in the
middle of the network and has the most incoming links, the result makes sense. Similarly,
#2 is last, as it is only connected from #1.

For a larger example, see the Matlab demonstration linked at the start of the section.

16

You might also like