Power Method and Deflation
Power Method and Deflation
Topics covered
• Finding eigenvalues
• General tricks
We assume here there is a single eigenvalue of largest magnitude (the ‘dominant’ eigen-
value). Label them as follows:
The simplest approach to computing λ1 and v1 is the power method. The idea is as fol-
lows. Let x be any vector. Then, since {vj } is a basis,
x = c1 v1 + · · · + cn vn .
1
Now suppose c1 6= 0 (i.e x has a non-zero v1 component). Then
Ax = c1 λ1 v1 + · · · + cn λn vn
and, applying A repeatedly,
n
X
k
A x= cj λkj vj .
j=1
Claim: Let x and x be vectors with wT v1 6= 0 and such that x has a non-zero v1 component.
Then
wT Ak x
= λ1 + O((λ2 /λ1 )k ) as k → ∞ (1)
wT Ak−1 x
Proof. Since the eigenvectors form a basis, there are scalars c1 , · · · , cn such that
n
X
x= cj vj .
j=1
All the terms in the parentheses except the first go to zero in magnitude as k → ∞. Taking
the dot product with w on the left and computing the ratio (1),
d1 + nj=2 dj (λj /λ1 )k
P
wT Ak x
= λ1
d1 + nj=2 dj (λj /λ1 )k−1
P
wT Ak−1 x
1 + O((λ2 /λ1 )k )
= λ1
1 + O((λ2 /λ1 )k )
= λ1 + O((λ2 /λ1 )k )
2
where dj = cj wT · vj (note that d1 6= 0 by assumption). Note that we have used that
1
= 1 + O(f ).
1+f
Thus the power method computes the dominant eigenvalue (largest in magnitude),
and the convergence is linear. The rate depends on the size of λ1 relative to the next largest
eigenvalue λ2 .
w T zk
zk = Azk−1 , λ(k) =
wT zk−1
The result is that λ(k) converges linearly to the dominant eigenvalue λ1 . The issue of when
to stop is addressed in section subsection 1.4.
i) Normalize x at each step to have kxk2 = 1, which keeps elements from growing expo-
nentially in size. This also gives an eigenvector of unit 2-norm.
ii) Use a ’left’ vector that depends on k instead of the fixed w in (1). A good choice is
the normalized Ak) x (which we have anyway).
• For k = 1, 2, · · · :
◦ x(k) = Aq (k−1)
3
◦ q (k) = x(k) /kx(k) k2
◦ λ(k) = (q (k) )T Aq (k) .
In practice, very little needs to be stored, and the algorithm is quite simple:
• For k = 1, 2, · · · :
◦ q = x/kxk2
◦ λ = xT q
◦ x = Aq
It follows that n
X
k
zk = A q0 = cj λkj vj . (3)
j=1
Then plug (3) into (2) and sort out the mess, identifying the largest terms (left as an exercise;
but see below for a special case).
The convergence of the eigenvector takes more work and is omitted here.
4
1.3 Symmetric power method
The method above has a nice benefit: if A is a real symmetric matrix, then the convergence
rate is actually better. If A is (real) symmetric then its eigenvectors are orthogonal:
vi · vj = 0 for i 6= j.
zTk Azk
λ(k) =
zT zk
Pkn 2k+1
j=1 cj λj
= Pn 2 2k
j=1 cj λj
c1 + nj=2 cj (λj /λ1 )2k+1
P
= λ1
c1 + nj=2 cj (λj //λ1 )2k
P
λ(n) ∼ λ + crn
where r = (λ2 /λ1 ) (non-symmetric) or (λ2 /λ1 )2 (symmetric). Both c and r are, of course,
unknown. However, just as with Richardson extrapolation, we can ’cheat’ here and ’solve’
for c and r to improve the estimate.
5
To generalize, suppose we have a scalar sequence
xn ∼ x + crn (4)
with a limit x, the desired quantity. Observe that
xn+1 − xn crn+1 − crn
lim = lim n = r.
n→∞ xn − xn−1 n→∞ cr − cr n−1
Similarly
(xn+1 − xn )2 ∼ c2 (rn )2 (r − 1)2
and
xn+2 − 2xn+1 + xn ∼ crn+2 − 2rn+1 + crn = crn (r − 1)2 .
Thus
(xn+1 − xn )2
∼ crn ∼ xn − x.
xn+2 − 2xn+1 + xn
Let us define the new sequence
(xn+1 − xn )2
yn := Axn = xn − . (5)
xn+2 − 2xn+1 + xn
If (4) holds exactly then yn = x. Otherwise, {Axn } tends to converge to x faster (linear,
with a better rate) than {xn }. This process is called Aitken extrapolation, which can be
used to accelerate the convergence of a sequence.
In practice, it is used when a sequence is known to converge linearly (from theory), and
one wants a bit of extra accuracy (or efficiency) for cheap.
Remark: There are many other acceleration techniques of this flavor, that recycle existing
data to construct better approximations. For iterative methods, a more sophisticated tech-
nique called Chebyshev acceleration can be used to obtain an ’optimized’ approximation
using the first k iterates.
6
0
10
-5
10
-10
10
-15
10
0 20 40 60 80 100
Figure 1: Error in the eigenvalue approximation for the symmetric power method with and
without Aitken extrapolation.
1.5 Example
We find the dominant eigenvalue of the matrix
1 1 2
A = 1 −1 1 ,
2 −1 −1
The result of using the power method with q0 = (1, 0, 0)T is shown in Figure 1. The
eigenvalues are
λ1 ≈ 2.74, λ2 = −2.35, λ3 = −1.4.
The error in λ(k) (the approximation to λ1 ) is shown along with an accelerated estimate
using Aitken extrapolation (subsection 1.4). Since A is symmetric, the convergence is linear
with rate
r = (λ2 /λ1 )2 ≈ 0.73
which is not bad. The accelerated version (which does not require any more work), however,
does much better. The noise for k > 50 is due to rounding; of course we cannot expect the
error to get much better then around machine precision. There is some additional error in
the Aitken sequence due to cancellation (ratio of two differences of nearly equal numbers),
but it is not of much concern here.
7
2 Inverse power method
A simple change allows us to compute the smallest eigenvalue (in magnitude). Let us
assume now that A has eigenvalues
|λ−1 −1 −1
n | > |λ2 | ≥ · · · ≥ |λn |.
Thus if we apply the power method to A−1 , the algorithm will give 1/λn , yielding the small-
est eigenvalue of A (after taking the reciprocal at the end).
Now suppose instead we want to find the eigenvalue closest to a number µ. Notice that
the matrix (A − µI)−1 has eigenvalues
1
, j = 1, · · · n.
λj − µ
The eigenvalue of largest magnitude will be 1/(λj0 − µ) where λj0 is the closest eigenvalue
to µ (assuming there is only one). This leads to the inverse power method (sometimes
called inverse iteration):
(A − µI)xk = xk−1
The method is a cheap, often effective way of computing one eigenvalue, which is often
all that matters. Moreover, a good choice of µ helps convergence. It should be a ’guess’ of
the eigenvalue. Suppose the eigenvalues satisfy
1 1
> > ···
|λ1 − µ| |λ2 − µ|
8
and we seek λ1 . Then inverse iteration will yield λ1 , and from the power method,
|λ1 − µ|k
error in λ1 = O
|λ2 − µ|k
with k replaced by 2k for a symmetric A. In either case, the (linear) rate improves as µ gets
closer to λ1 . With a good guess, the convergence rate will be quite good (close to zero).
For a symmetric matrix, the convergence can be accelerated even more by choos-
ing µ at each step as the Rayleigh quotient
x(k) · Ax(k)
µk = (k) (k) .
x ·x
The iteration (’Rayleigh quotient iteration’) is then
(A − µk I)x(k+1) = x(k) .
For a typical symmetric matrix, the convergence becomes cubic, which is much faster than
linear! Thus the disadvantage of factoring A − µk I at each step is balanced out by the
dramatic boost in convergence rate.
For computing all the eigenvalues of A, there is a powerful class of iterative methods that
can be used, such as the QR algorithm, which will not be covered here.
9
3 Deflation
Here we introduce a practical trick that is occasionally useful if used carefully. For robust
computation, other more sophisticated methods are used instead.
Definition: Given a method that finds one solution to a system, the process of ‘dividing
out a solution’ one by one to get all of them is called deflation.
In theory, this will ‘divide out’ the found roots one by one, e.g.
p = (x − 1)(x − 2)(x − 3) =⇒ q0 = (x − 1)(x − 3) =⇒ q1 = (x − 1).
Note that each qk is smooth; the division does not add any singularities.
However, if xj is not computed exactly, e.g. x̃0 ≈ x0 , there is trouble, since then
x − x0
q0 (x) = (· · · )
x − x̃0
which introduces a ‘singularity’ at x̃0 that can be disastrous unless x̃0 is computed to very
high accuracy. It could be that Newton’s method on q0 will still converge to x0 , or just blow
up. The method relies on a theoretical guarantee that must hold exactly for the method
to be correct. Otherwise, it may work in practice, or it may not.
This is not to say the method is worthless; just that it should be used cautiously, and
without high expectations.
10
Reminder (outer product): The outer product vwT of two column vectors v, w ∈ Rn
is the matrix
C = vwT with cij = vi wj .
That is, the (i, j) component of the outer product is the i-th component of v times the j-th
of wj . Observe that the rank of an outer product is always one; for this reason a matrix
B = A + vwT
Note that the transpose notation makes some manipulations convenient, e.g.
11
Since xT wj = c + xT vj , the right value of c is
c = −λ1 xT vj /λj
Deflation for the second-largest eigenvalue: To compute λ2 and v2 (where |λ1 | >
|λ2 | > · · · ),
B = A − λ1 v1 xT
• Use the power method on B to obtain λ2 and w2 , then recover the eigenvector v2 for
A from (6).
It is not hard to find such a vector x, but the numerical stability depends on this choice.
One option is Weilandt deflation, which chooses
1
x= R1
λ1 (v1 )1
where R1 = (a11 , · · · a1n )T is the first row of A and (v1 )1 is the first component of v1 .
The peril is that deflation is numerically unstable, and repeated applications can lead to
disaster. Using it to get λ2 is usually fine except for ill-behaved eigen-problems, but it is
not advisable to use it to find all the eigenvalues.
A = λ1 v1 v1T + λ2 v2 v2T
where v1T v1 = v2T v2 = 1. Then A has eigenvalues λ1 , λ2 and eigenvectors v1 , v2 . Now suppose
λ̃1 ≈ λ1 is computed. We then deflate:
B = A − λ̃1 v1 v1T
12
choosing x = v1T for simplicity (of the example). Then
λ1 = 108 , λ2 = 10−8
which is the same size as λ2 . Thus the spurious ’leftover’ from the deflation is actually the
dominant part, and the power method cannot see λ2 .
4 An example: PageRank
The notes here expand on the brief discussion in the textbook. For an introduction, see
https://www.mathworks.com/content/dam/mathworks/mathworks-dot-com/moler/exm/
chapters/pagerank.pdf. For a detailed exposition, the best source is Langville & Myer,
Google’s PageRank and Beyond.
Imagine a hypothetical user (a ’surfer’) meandering through the web by following links.
The surfer will visit pages that are more connected more often. We define
and the transition matrix P whose entries are pij . A small example and its transition matrix
are shown in Figure 2.
If the surfer is at page Wi , they have come from some other site, which occurs with
probability pji . It should follow that
X
xi = pji xj .
j
13
Figure 2: Directed graph of four linked pages to be ranked and the transition matrix.
In probabilistic terms,3
X
P (at Wi ) = P (goes from Wj to Wi at Wj ) · P (at Wj ).
j
Now we need to know how likely it is that the surfer goes from i to j. We need one last
definition; the adjacency matrix A is the matrix whose (i, j) entry is 1 if there is a link
from Wi to Wj : (
1 i points to j
aij = .
0 otherwise
The PageRank algorithm (in its simplest form) makes the assumption that the surfer clicks
a link uniformly at random to leave a page:
aji
pji = , where Nj = # of outgoing links from Wj .
Nj
x = PTx
or, setting M = P T ,
x = M x. (8)
Thus x is an eigenvector of M with eigenvalue 1. It can be shown that if the system is
‘irreducible’ (all states can be reached from all others), then:4
14
• All other eigenvalues are less than one in magnitude
Thus there is a unique eigenvector x (the ‘ranking’) such that
X
xi = 1.
i
The popularity of the website is then measured by this ranking. To get around the issue of
pages that do not link back and for numerical reasons, there is one more ingredient in the
basic model. The surfer is assumed to jump to a random page (regardless of links) with a
small probability 1 − α. That is,
aji 1 − α
mij = α + .
Nj n
The value of α is typically chosen to be near 1, e.g. α ≈ 0.85. In matrix terms,
1−α
M = αP T + E (9)
n
where E is a matrix of all ones.
4.3 Computation
The ranking x is the eigenvector for the dominant eigenvalue λ = 1. Thus the power method
can be employed. Moreover, we do not need to normalize at each step (why not?). The ma-
trix P T has a simple structure, so the multiplication M x can be implemented quite efficiently.
4.4 Example
Let us compute the ranking for the small system in Figure 2. The transition probabilities
and transition matrix P are also shown. The adjacency matrix and N values are
0 1 1 1 3
0 0 1 0 1
A= 1 0 0 0 , N = 2 .
0 0 1 0 1
15
Figure 3: Ranking (darker shade =⇒ higher rank) and computed x for the four-page
example.
(1 − α)
M = αP T + E
n
where E is a matrix of all ones and we seek an eigenvector x such that
x = M x.
Applying the power method (no normalization is required because the eigenvalue is 1) with
α = 0.85, we obtain the solution
which indicates that the highest ranked page should be #3 (see Figure 3). Since 3 is in the
middle of the network and has the most incoming links, the result makes sense. Similarly,
#2 is last, as it is only connected from #1.
For a larger example, see the Matlab demonstration linked at the start of the section.
16