0% found this document useful (0 votes)

29 views17 pages

The Fast Convergence of Incremental PCA

This document summarizes the fast convergence rates of two incremental PCA algorithms: Krasulina (1969) and Oja (1983). The algorithms update their estimate of the top eigenvector in an online fashion as new data points arrive. The paper analyzes the convergence rates of these algorithms and shows that their potential functions decrease at a rate of O(1/n), meaning the estimates converge to the true top eigenvector at a rate proportional to the number of samples seen. It also discusses practical considerations like initialization and setting the learning rate parameter.

Uploaded by

Quarktop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views17 pages

The Fast Convergence of Incremental PCA

Uploaded by

Quarktop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

The Fast Convergence of Incremental PCA

Akshay Balsubramani Sanjoy Dasgupta Yoav Freund

UC San Diego UC San Diego UC San Diego
[email protected] [email protected] [email protected]

Abstract

We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some

distribution with mean zero and unknown covariance A. We wish to compute the
top eigenvector of A in an incremental fashion: that is, with an algorithm that
maintains an estimate of the top eigenvector, in O(d) space, and incrementally
adjusts the estimate with each new data point that arrives. Two classical such
schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample con-
vergence rates for both.

1 Introduction

Principal component analysis (PCA) is a popular form of dimensionality reduction that projects a
data set on the top eigenvector(s) of its covariance matrix. The default method for computing these
eigenvectors uses O(d2 ) space for data in Rd , which can be prohibitive in practice. It is therefore
of interest to study incremental schemes that take one data point at a time, updating their estimates
of the desired eigenvectors with each new point. For computing one eigenvector, such methods use
O(d) space.
For the case of the top eigenvector, this problem has long been studied, and two elegant solutions
were obtained by Krasulina [7] and Oja [9]. Their methods are closely related. At time n − 1, they
have some estimate Vn−1 ∈ Rd of the top eigenvector. Upon seeing the next data point, Xn , they
update this estimate as follows:
V T Xn XnT Vn−1

Vn = Vn−1 + γn Xn XnT − n−1 I d Vn−1 (Krasulina)
kVn−1 k2
Vn−1 + γn Xn XnT Vn−1
Vn = (Oja)
kVn−1 + γn Xn XnT Vn−1 k
Here γn is a “learning rate” that is typically proportional to 1/n.
Suppose the points X1 , X2 , . . . are drawn i.i.d. from a distribution on Rd with mean zero and co-
variance A. The original papers proved that these estimators converge almost surely to the top
eigenvector of A, call it v ∗ , under mild conditions:
P P 2
• n γn = ∞ while n γn < ∞.
• If λ1 , λ2 denote the top two eigenvalues of A, then λ1 > λ2 .
• EkXn kk < ∞ for some suitable k (for instance, k = 8 works).

There are also other incremental estimators for which convergence has not been established; see, for
instance, [12] and [16].
In this paper, we analyze the rate of convergence of the Krasulina and Oja estimators. They can
be treated in a common framework, as stochastic approximation algorithms for maximizing the

1
Rayleigh quotient
v T Av
G(v) = .
vT v
The maximum value of this function is λ1 , and is achieved at v ∗ (or any nonzero multiple thereof).
The gradient is
v T Av

2
∇G(v) = A − Id v.
kvk2 vT v
Since EXn XnT = A, we see that Krasulina’s method is stochastic gradient descent. The Oja proce-
dure is closely related: as pointed out in [10], the two are identical to within second-order terms.
Recently, there has been a lot of work on rates of convergence for stochastic gradient descent (for in-
stance, [11]), but this has typically been limited to convex cost functions. These results do not apply
to the non-convex Rayleigh quotient, except at the very end, when the system is near convergence.
Most of our analysis focuses on the build-up to this finale.
We measure the quality of the solution Vn at time n using the potential function
(Vn · v ∗ )2
Ψn = 1 − ,
kVn k2
where v ∗ is taken to have unit norm. This quantity lies in the range [0, 1], and we are interested in
the rate at which it approaches zero. The result, in brief, is that E[Ψn ] = O(1/n), under conditions
that are similar to those above, but stronger. In particular, we require that γn be proportional to 1/n
and that kXn k be bounded.

1.1 The algorithm

We analyze the following procedure.

1. Set starting time. Set the clock to time no .

2. Initialization. Initialize Vno uniformly at random from the unit sphere in Rd .
3. For time n = no + 1, no + 2, . . .:
(a) Receive the next data point, Xn .
(b) Update step. Perform either the Krasulina or Oja update, with γn = c/n.

The first step is similar to using a learning rate of the form γn = c/(n + no ), as is sometimes done
in stochastic gradient descent implementations [1]. We have adopted it because the initial sequence
of updates is highly noisy: during this phase Vn moves around wildly, and cannot be shown to make
progress. It becomes better behaved when the step size γn becomes smaller, that is to say, when n
gets larger than some suitable no . By setting the start time to no , we can simply fast-forward the
analysis to this moment.

1.2 Initialization

One possible initialization is to set Vno to the first data point that arrives, or to the average of a few
data points. This seems sensible enough, but can fail dramatically in some situations.
Here is an example. Suppose X can take on just 2d possible values: ±e1 , ±σe2 , . . . , ±σed , where
the ei are coordinate directions and 0 < σ < 1 is a small constant. Suppose further that the
distribution of X is specified by a single positive number p < 1:
p
Pr(X = e1 ) = Pr(X = −e1 ) =
2
1−p
Pr(X = σei ) = Pr(X = −σei ) = for i > 1
2(d − 1)
Then X has mean zero and covariance diag(p, σ 2 (1 − p)/(d − 1), . . . , σ 2 (1 − p)/(d − 1)). We will
assume that p and σ are chosen so that p > σ 2 (1 − p)/(d − 1); in our notation, the top eigenvalues
are then λ1 = p and λ2 = σ 2 (1 − p)/(d − 1), and the target vector is v ∗ = e1 .

2
If Vn is ever orthogonal to some ei , it will remain so forever. This is because both the Krasulina and
Oja updates have the following properties:
Vn−1 · Xn = 0 =⇒ Vn = Vn−1
Vn−1 · Xn 6= 0 =⇒ Vn ∈ span(Vn−1 , Xn ).
If Vno is initialized to a random data point, then with probability 1 − p, it will be assigned to some
ei with i > 1, and will converge to a multiple of that same ei rather than to e1 . Likewise, if it is
initialized to the average of ≤ 1/p data points, then with constant probability it will be orthogonal
to e1 and remain so always.
Setting Vno to a random unit vector avoids this problem. However, there are doubtless cases, for
instance when the data has intrinsic dimension d, in which a better initializer is possible.

1.3 The setting of the learning rate

In order to get a sense of what rates of convergence we might expect, let’s return to the example of a
random vector X with 2d possible values. In the Oja update, Vn = Vn−1 + γn Xn XnT Vn−1 , we can
ignore normalization if we are merely interested in the progress of the potential function Ψn . Since
the Xn correspond to coordinate directions, each update changes just one coordinate of V :
Xn = ±e1 =⇒ Vn,1 = Vn−1,1 (1 + γn )
Xn = ±σei =⇒ Vn,i = Vn−1,i (1 + σ 2 γn )
Recall that we initialize Vno to a random vector from the unit sphere. For simplicity, let’s just
suppose that no = 0 and that this initial value is the all-ones vector (again, we don’t have to worry
about normalization). On each iteration the first coordinate is updated with probability exactly
p = λ1 , and thus
E[Vn,1 ] = (1 + λ1 γ1 )(1 + λ1 γ2 ) · · · (1 + λ1 γn ) ∼ exp(λ1 (γ1 + · · · + γn )) ∼ ncλ1
since γn = c/n. Likewise, for i > 1,
E[Vn,i ] = (1 + λ2 γ1 )(1 + λ2 γ2 ) · · · (1 + λ2 γn ) ∼ ncλ2 .
If all goes according to expectation, then at time n,
2
Vn,1 n2cλ1 d−1
Ψn = 1 − 2
∼ 1 − 2cλ1 ∼ 2c(λ −λ ) .
kVn k n + (d − 1)n2cλ2 n 1 2

(This is all very rough, but can be made precise by obtaining concentration bounds for ln Vn,i .)
From this, we can see that it is not possible to achieve a O(1/n) rate unless c ≥ 1/(2(λ1 − λ2 )).
Therefore, we will assume this when stating our final results, although most of our analysis is in
terms of general γn . An interesting practical question, to which we don’t have an answer, is how
one would empirically set c without prior knowledge of the eigenvalue gap.

1.4 Nested sample spaces

For n ≥ no , let Fn denote the sigma-field of all outcomes up to and including time n, that is,
Fn = σ(Vno , Xno +1 , . . . , Xn ). We start by showing that
E[Ψn |Fn−1 ] ≤ Ψn−1 (1 − 2γn (λ1 − λ2 )(1 − Ψn−1 )) + O(γn2 ).
Initially Ψn is likely to be close to 1. For instance, if the initial Vno is picked uniformly at random
from the surface of the unit sphere in Rd , then we’d expect Ψno ≈ 1 − 1/d. This means that the
initial rate of decrease is very small, because of the (1 − Ψn−1 ) term.
To deal with this, we divide the analysis into epochs: the first takes Ψn from 1 − 1/d to 1 − 2/d, the
second from 1−2/d to 1−4/d, and so on until Ψn finally drops below 1/2. We use martingale large
deviation bounds to bound the length of each epoch, and also to argue that Ψn does not regress. In
particular, we establish a sequence of times nj such that (with high probability)
2j
sup Ψn ≤ 1 − . (1)
n≥nj d

3
The analysis of each epoch uses martingale arguments, but at the same time, assumes that Ψn re-
mains bounded above. Combining the two requires a careful specification of the sample space at
each step. Let Ω denote the sample space of all realizations (vno , xno +1 , xno +2 , . . .), and P the
probability distribution on these sequences. For any δ > 0, we define a nested sequence of spaces
Ω ⊃ Ω0no ⊃ Ω0no +1 ⊃ · · · such that each Ω0n is Fn−1 -measurable, has probability P (Ω0n ) ≥ 1 − δ,
and moreover consists exclusively of realizations ω ∈ Ω that satisfy the constraints (1) upto and
including time n − 1. We can then build martingale arguments by restricting attention to Ω0n when
computing the conditional expectations of quantities at time n.

1.5 Main result

We make the following assumptions:

(A1) The Xn ∈ Rd are i.i.d. with mean zero and covariance A.

(A2) There is a constant B such that kXn k2 ≤ B.
(A3) The eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd of A satisfy λ1 > λ2 .
(A4) The step sizes are of the form γn = c/n.

Under these conditions, we get the following rate of convergence.

Theorem 1.1. There are absolute constants Ao , A1 > 0 and 1 < a < 4 for which the following
holds. Pick any 0 < δ < 1, and any co > 2. Set the step sizes to γn = c/n, where c = co /(2(λ1 −
λ2 )), and set the starting time to no ≥ (Ao B 2 c2 d2 /δ 4 ) ln(1/δ). Then there is a nested sequence of
subsets of the sample space Ω ⊃ Ω0no ⊃ Ω0no +1 ⊃ · · · such that for any n ≥ no , we have:

P (Ω0n ) ≥ 1 − δ and
a c /2
(Vn · v ∗ )2
2 2 co /no
c B e 1 d no + 1 o
En ≥1− − A1 ,
kVn k2 2(co − 2) n+1 δ2 n+1

where En denotes expectation restricted to Ω0n .

Since co > 2, this bound is of the form E[Ψn ] = O(1/n).

1.6 Related work

There is an extensive line of work analyzing PCA from the statistical perspective, in which the con-
vergence of various estimators is characterized under certain conditions, including generative models
of the data [5] and various assumptions on the covariance matrix spectrum [14, 4] and eigenvalue
spacing [17]. Such works do provide finite-sample guarantees, but they apply only to the batch case
and/or are computationally intensive, rather than considering an efficient incremental algorithm.
Among incremental algorithms, the work of Kuzmin et al. [15] describes and analyzes worst-case
online PCA, using an experts-setting algorithm with a super-quadratic per-iteration cost. More ef-
ficient general-purpose incremental PCA algorithms have lacked finite-sample analyses [2]. There
have been recent attempts to remedy this situation by relaxing the nonconvexity inherent in the prob-
lem [3] or making generative assumptions [8]. The present paper directly analyzes the oldest known
incremental PCA algorithms under relatively mild assumptions.

2 Outline of proof

We now sketch the proof of Theorem 1.1; almost all the details are relegated to the appendix.
Recall that for n ≥ no , we take Fn to be the sigma-field of all outcomes up to and including time n,
that is, Fn = σ(Vno , Xno +1 , . . . , Xn ).
An additional piece of notation: we will use ub to denote u/kuk, the unit vector in the direction of
u ∈ Rd . Thus, for instance, the Rayleigh quotient can be written G(v) = vbT Ab
v.

4
2.1 Expected per-step change in potential

We first bound the expected improvement in Ψn in each step of the Krasulina or Oja algorithms.
Theorem 2.1. For any n > no , we can write Ψn ≤ Ψn−1 + βn − Zn , where
2 2
γn B /4 (Krasulina)
βn =
5γn2 B 2 + 2γn3 B 3 (Oja)
and where Zn is a Fn -measurable random variable with the following properties:

• E[Zn |Fn−1 ] = 2γn (Vbn−1 · v ∗ )2 (λ1 − G(Vn−1 )) ≥ 2γn (λ1 − λ2 )Ψn−1 (1 − Ψn−1 ) ≥ 0.
• |Zn | ≤ 4γn B.

The theorem follows from Lemmas A.4 and A.5 in the appendix. Its characterization of the two
estimators is almost identical, and for simplicity we will henceforth deal only with Krasulina’s
estimator. All the subsequent results hold also for Oja’s method, upto constants.

2.2 A large deviation bound for Ψn

We know from Theorem 2.1 that Ψn ≤ Ψn−1 +βn −Zn , where βn is a constant and Zn is a quantity
of positive expected value. Thus, in expectation, and modulo a small additive term, Ψn decreases
monotonically. However, the amount of decrease at the nth time step can be arbitrarily small when
Ψn is close to 1. Thus, we need to show that Ψn is eventually bounded away from 1, that is, there
exists some o > 0 and some time no such that for any n ≥ no , we have Ψn ≤ 1 − o .
Recall from the algorithm specification that we advance the clock so as to skip the pre-no phase.
Given this, what can we expect o to be? If the initial estimate Vno√is a random unit vector, then
E[Ψno ] = 1 − 1/d and, roughly speaking, Pr(Ψno > 1 − /d) = O( ). If no is sufficiently large,
then Ψn may subsequently increase a little bit, but not by very much. In this section, we establish
the following bound.
Theorem 2.2. Suppose the initial estimate Vno is chosen uniformly at random from the surface of
the unit sphere in Rd . Assume also that the step sizes are of the form γn = c/n, for some constant
c > 0. Then for any 0 < < 1, if no ≥ 2B 2 c2 d2 /2 , we have
√

Pr sup Ψn ≥ 1 − ≤ 2e.
n≥no d

To prove this, we start with a simple recurrence for the moment-generating function of Ψn .
Lemma 2.3. Consider a filtration (Fn ) and random variables Yn , Zn ∈ Fn such that there are two
sequences of nonnegative constants, (βn ) and (ζn ), for which:

• Yn ≤ Yn−1 + βn − Zn .
• Each Zn takes values in an interval of length ζn .

Then for any t > 0, we have E[etYn |Fn−1 ] ≤ exp(t(Yn−1 − E[Zn |Fn−1 ] + βn + tζn2 /8)).

This relation shows how to define a supermartingale based on etYn , from which we can derive a
large deviation bound on Yn .
Lemma 2.4. Assume the conditions of Lemma 2.3, and also that E[Zn |Fn−1 ] ≥ 0. Then, for any
integer m and any ∆, t > 0,
X
Pr sup Yn ≥ ∆ ≤ E[etYm ] exp − t ∆ − (β` + tζ`2 /8) .

n≥m
`>m

In order to apply this to the sequence (Ψn ), we need to first calculate the moment-generating func-
tion of its starting value Ψno .

5
Lemma 2.5. Suppose a vector V is picked uniformly at random from the surface of the unit sphere
in Rd , where d ≥ 3. Define Y = 1 − (V12 )/kV k2 . Then, for any t > 0,
r
tY t d−1
Ee ≤ e .
2t

Putting these pieces together yields Theorem 2.2.

2.3 Intermediate epochs of improvement

We have seen that, for suitable and no , it is likely that Ψn ≤ 1 − /d for all n ≥ no . We now
define a series of epochs in which 1 − Ψn successively doubles, until Ψn finally drops below 1/2.
To do this, we specify intermediate goals (no , o ), (n1 , 1 ), (n2 , 2 ), . . . , (nJ , J ), where no < n1 <
· · · < nJ and o < 1 < · · · < J = 1/2, with the intention that:
For all 0 ≤ j ≤ J, we have sup Ψn ≤ 1 − j . (2)
n≥nj

Of course, this can only hold with a certain probability.

Let Ω denote the sample space of all realizations (vno , xno +1 , xno +2 , . . .), and P the probability
distribution on these sequences. We will show that, for a certain choice of {(nj , j )}, all J + 1
constraints (2) can be met by excluding just a small portion of Ω.
We consider a specific realization ω ∈ Ω to be good if it satisfies (2). Call this set Ω0 :
Ω0 = {ω ∈ Ω : sup Ψn (ω) ≤ 1 − j for all 0 ≤ j ≤ J}.
n≥nj

For technical reasons, we also need to look at realizations that are good up to time n−1. Specifically,
for each n, define
Ω0n = {ω ∈ Ω : sup Ψ` (ω) ≤ 1 − j for all 0 ≤ j ≤ J}.
nj ≤`<n

Crucially, this is Fn−1 -measurable. Also note that Ω0 = Ω0n .

T
n>no
We can talk about expectations under the distribution P restricted to subsets of Ω. In particular, let
Pn be the restriction of P to Ω0n ; that is, for any A ⊂ Ω, we have Pn (A) = P (A ∩ Ω0n )/P (Ω0n ). As
for expectations with respect to Pn , for any function f : Ω → R, we define
Z
1
En f = f (ω)P (dω).
P (Ω0n ) Ω0n

Here is the main result of this section.

Theorem 2.6. Assume that γn = c/n, where c = co /(2(λ1 − λ2 )) and co > 0. Pick any 0 < δ < 1
and select a schedule (no , o ), . . . , (nJ , J ) that satisfies the conditions
δ2 3 1
o = 8ed , and 2 j ≤ j+1 ≤ 2j for 0 ≤ j < J, and J−1 ≤ 4
(3)
(nj+1 + 1) ≥ e5/co (nj + 1) for 0 ≤ j < J
as well as no ≥ (20c2 B 2 /2o ) ln(4/δ). Then Pr(Ω0 ) ≥ 1 − δ.

The first step towards proving this theorem is bounding the moment-generating function of Ψn in
terms of that of Ψn−1 .
Lemma 2.7. Suppose n > nj . Suppose also that γn = c/n, where c = co /(2(λ1 − λ2 )). Then for
any t > 0,
2 2
tΨn
h co j i c B t(1 + 32t)
En [e ] ≤ En exp tΨn−1 1 − exp .
n 4n2

6
We would like to use this result to bound En [Ψn ] in terms of Em [Ψm ] for m < n. The shift in
sample spaces is easily handled using the following observation.
Lemma 2.8. If g : R → R is nondecreasing, then En [g(Ψn−1 )] ≤ En−1 [g(Ψn−1 )] for any n > no .

A repeated application of Lemmas 2.7 and 2.8 yields the following.

Lemma 2.9. Suppose that conditions (3) hold. Then for 0 ≤ j < J and any t > 0,
tc2 B 2 (1 + 32t) 1

tΨnj+1 1
Enj+1 [e ] ≤ exp t(1 − j+1 ) − tj + − .
4 nj nj+1

Now that we have bounds on the moment-generating functions of intermediate Ψn , we can apply
martingale deviation bounds, as in Lemma 2.4, to obtain the following, from which Theorem 2.6
ensues.
Lemma 2.10. Assume conditions (3) hold. Pick any 0 < δ < 1, and set no ≥ (20c2 B 2 /2o ) ln(4/δ).
Then !
J
X δ
Pnj sup Ψn > 1 − j ≤ .
j=1 n≥n j
2

2.4 The final epoch

Recall the definition of the intermediate goals (nj , j ) in (2), (3). The final epoch is the period
n ≥ nJ , at which point Ψn ≤ 1/2. The following consequence of Lemmas A.4 and 2.8 captures
the rate at which Ψ decreases during this phase.
Lemma 2.11. For all n > nJ ,
En [Ψn ] ≤ (1 − αn )En−1 [Ψn−1 ] + βn ,
where αn = (λ1 − λ2 )γn and βn = (B 2 /4)γn2 .

By solving this recurrence relation, and piecing together the various epochs, we get the overall
convergence result of Theorem 1.1.

3 Experiments
When performing PCA in practice with massive d and a large/growing dataset, an incremental
method like that of Krasulina or Oja remains practically viable, even as quadratic-time and -memory
algorithms become increasingly impractical. Arora et al. [2] have a more complete discussion of
the empirical necessity of incremental PCA algorithms, including a version of Oja’s method which
is shown to be extremely competitive in practice.
Since the efficiency benefits of these types of algorithms are well understood, we now instead ex-
perimentally explore some other ways in which our main results seem to accurately characterize the
performance of Oja’s algorithm. We use the CMU PIE faces [13], consisting of 11554 images of
size 32 × 32, as a prototypical example of a dataset with most of its variance captured by a few PCs,
as shown in Fig. 1.
We expect from Theorem 1.1 and the discussion in the introduction that varying c (the constant in
the learning rate) will influence the overall rate of convergence. In particular, if c is low, then halving
it can be expected to halve the exponent of n, and the slope of the log-log convergence graph. This
is exactly what occurs in practice, as illustrated in Fig. 2 on the PIE data. The dotted line in that
figure is a convergence rate of 1/n, drawn as a guide.

4 Open problems
Several fundamental questions remain unanswered. First, the convergence rates of the two incre-
mental schemes depend on the multiplier c in the learning rate γn . If it is too low, convergence will

7
PIE Dataset Covariance Spectrum Oja Subspace Rule Dependence on c
0
5000 10

4500
−1
10
4000

3500 −2
10

Reconstruction Error
3000
Eigenvalue

−3
2500 10

2000
−4
10
c=6
1500 c=3
c=1.5
1000 −5 c=1
10
c=0.666
500 c=0.444
c=0.296
−6
0 10
0 1 2 3 4 5
0 5 10 15 20 25 30 10 10 10 10 10 10
Component Number Iteration Number

Figures 1 and 2.

be slower than O(1/n). If it is too high, the constant in the rate of convergence will be large. Is
there a simple and practical scheme for setting c?
Second, what can be said about incrementally estimating the top p eigenvectors, for p > 1? Oja’s
method extends easily to this case: the estimate at time n is a d × p matrix Vn whose columns
correspond to the eigenvectors. The invariant VnT Vn = Ip is always maintained, and when a new
data point Xn ∈ Rd arrives, the following update is performed:
Wn = Vn−1 + γn Xn XnT Vn−1
Vn = orth(Wn )
where the second step is an orthogonalization, for instance by Gram-Schmidt. It would be interesting
to characterize the rate of convergence of this scheme.
Finally, our analysis applies to a modified procedure in which the starting time no is artificially set
to a large constant. This seems unnecessary in practice, and it would be useful to extend the analysis
to the case where no = 0.

Acknowledgements

The authors are grateful to the National Science Foundation for support under grant IIS-1162581.

References
[1] A. Agarwal, O. Chapelle, M. Dudı́k, and J. Langford. A reliable effective terascale linear
learning system. CoRR, abs/1110.4198, 2011.
[2] R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stochastic optimization for PCA and PLS.
In 50th Annual Allerton Conference on Communication, Control, and Computing, pages 861–
868. 2012.
[3] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of PCA with capped MSG. In
Advances in Neural Information Processing Systems, 2013.
[4] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component
analysis. Machine Learning, 66(2-3):259–294, 2007.
[5] T. T. Cai, Z. Ma, and Y. Wu. Sparse PCA: Optimal rates and adaptive estimation. CoRR,
abs/1211.1309, 2012.
[6] R. Durrett. Probability: Theory and Examples. Duxbury, second edition, 1995.
[7] T.P. Krasulina. A method of stochastic approximation for the determination of the least eigen-
value of a symmetrical matrix. USSR Computational Mathematics and Mathematical Physics,
9(6):189–195, 1969.

8
[8] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Advances in
Neural Information Processing Systems, 2013.
[9] E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press, 1983.
[10] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of
the expectation of a random matrix. Journal of Math. Analysis and Applications, 106:69–84,
1985.
[11] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex
stochastic optimization. In International Conference on Machine Learning, 2012.
[12] S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing
Systems, 1997.
[13] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(12):1615–1618, 2003.
[14] V.Q. Vu and J. Lei. Minimax rates of estimation for sparse PCA in high dimensions. Journal
of Machine Learning Research - Proceedings Track, 22:1278–1286, 2012.
[15] M.K. Warmuth and D. Kuzmin. Randomized PCA algorithms with regret bounds that are
logarithmic in the dimension. In Advances in Neural Information Processing Systems. 2007.
[16] J. Weng, Y. Zhang, and W.-S. Hwang. Candid covariance-free incremental principal compo-
nent analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034–
1040, 2003.
[17] L. Zwald and G. Blanchard. On the convergence of eigenspaces in kernel principal component
analysis. In Advances in Neural Information Processing Systems, 2005.

9
A Expected per-step change in potential
A.1 The change in potential of Krasulina’s update

Write Krasulina’s update equation as

Vn = Vn−1 + γn ξn

ξn = Xn XnT − Vbn−1
T
Xn XnT Vbn−1 Id Vn−1

We start with some basic observations.

Lemma A.1. For all n > no ,

(a) ξn is orthogonal to Vn−1 .

(b) kξn k2 ≤ B 2 kVn−1 k2 /4.
(c) E[ξn |Fn−1 ] = AVn−1 − G(Vn−1 )Vn−1 .
(d) kVn k ≥ kVn−1 k.

Proof. For (a), let Xn⊥ denote the component of Xn orthogonal to Vn−1 . Then
ξn = (Vn−1 ·Xn )Xn −(Vbn−1 ·Xn )2 Vn−1 = (Vn−1 ·Xn )(Xn −(Vbn−1 ·Xn )Vbn−1 ) = (Vn−1 ·Xn )Xn⊥ .

For (b), note from the previous formulation that kξn k2 = (Vn−1 ·Xn )2 kXn⊥ k2 ≤ kVn−1 k2 kXn k4 /4.
Part (c) follows directly from E[Xn XnT |Fn−1 ] = A.
For (d), we use kVn k2 = kVn−1 + γn ξn k2 = kVn−1 k2 + γn2 kξn k2 ≥ kVn−1 k2 .

We now check that (Vn · v ∗ )2 grows in expectation with each iteration.

Lemma A.2. For any n > no , we have

(a) (Vn · v ∗ )2 ≥ (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(ξn · v ∗ ).

(b) E[ξn · v ∗ |Fn−1 ] = (Vn−1 · v ∗ )(λ1 − G(Vn−1 )).

Proof. Part (a) follows directly from the update rule:

(Vn · v ∗ )2 = ((Vn−1 · v ∗ ) + γn (ξn · v ∗ ))2 ≥ (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(ξn · v ∗ ).
Part (b) follows by substituting the expression for E[ξn |Fn−1 ] from Lemma A.1(c):
E[ξn · v ∗ |Fn−1 ] = (Vn−1
T
Av ∗ ) − G(Vn−1 )(Vn−1 · v ∗ ) = λ1 (Vn−1 · v ∗ ) − G(Vn−1 )(Vn−1 · v ∗ ).

In order to use Lemma A.2 to bound the change in potential Ψn , we need to relate Ψn to the quantity
λ1 − G(Vn ).
Lemma A.3. For any n ≥ no , we have λ1 − G(Vn ) ≥ (λ1 − λ2 )Ψn .

Proof. It is easiest to think of Vn in the eigenbasis of A: the component of Vn in direction v ∗ is

Vn · v ∗ , and the orthogonal component is Vn⊥ = Vn − (Vn · v ∗ )v ∗ . Then
VnT AVn (Vn · v ∗ )2 (V ⊥ )T AVn⊥ λ1 (Vn · v ∗ )2 + λ2 kVn⊥ k2
G(Vn ) = 2
= 2
λ1 + n 2
≤ .
kVn k kVn k kVn k kVn k2
Therefore,
λ1 (Vn · v ∗ )2 + λ2 (kVn k2 − (Vn · v ∗ )2 ) (Vn · v ∗ )2

λ1 −G(Vn ) ≥ λ1 − = (λ1 −λ2 ) 1 − = (λ1 −λ2 )Ψn .
kVn k2 kVn k2

10
We can now explicitly bound the expected change in Ψn in each iteration.
Lemma A.4. For any n > no , we can write Ψn ≤ Ψn−1 + βn − Zn , where βn = γn2 B 2 /4 and
where
Zn = 2γn (Vn−1 · v ∗ )(ξn · v ∗ )/kVn−1 k2
is a Fn -measurable random variable with the following properties:

• E[Zn |Fn−1 ] = 2γn (Vbn−1 · v ∗ )2 (λ1 − G(Vn−1 )) ≥ 2γn (λ1 − λ2 )Ψn−1 (1 − Ψn−1 ) ≥ 0.
• |Zn | ≤ 4γn B.

Proof. Using Lemmas A.1 and A.2(a),

kVn k2 − (Vn · v ∗ )2 kVn−1 k2 + γn2 kξn k2 − (Vn · v ∗ )2
Ψn = 2
≤
kVn k kVn−1 k2
1 (Vn · v ∗ )2
≤ 1 + γn2 B 2 −
4 kVn−1 k2
1 (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(ξn · v ∗ )
≤ 1 + γn2 B 2 −
4 kVn−1 k2
1 (Vn−1 · v ∗ )(ξn · v ∗ )
= Ψn−1 + γn2 B 2 − 2γn ,
4 kVn−1 k2
which is Ψn−1 +βn −Zn . The conditional expectation of Zn can be determined from Lemma A.2(b):
2γn (Vn−1 · v ∗ )
E[Zn |Fn−1 ] = E[ξn · v ∗ |Fn−1 ] = 2γn (Vbn−1 · v ∗ )2 (λ1 − G(Vn−1 ))
kVn−1 k2
and this can be lower-bounded using Lemma A.3.
Finally, we need to determine the range of possible values of Zn . By expanding ξn , we get

Zn = 2γn (Vbn−1 · v ∗ ) (Xn · v ∗ )(Xn · Vbn−1 ) − (Vbn−1 · v ∗ )(Xn · Vbn−1 )2 .

Since kXn k2 ≤ B, we see that Zn must lie in the range ±4γn B.

A.2 The change in potential of the Oja update

Recall the Oja update:

Vn−1 + γn Xn XnT Vn−1
Vn = .
kVn−1 + γn Xn XnT Vn−1 k
Since our bounds are on the potential function Ψn , which is insensitive to the length of Vn , we can
skip the normalization, and instead just consider the update rule
Vn = Vn−1 + γn Xn XnT Vn−1 .

The final bounds, as well as many of the intermediate results, are almost exactly the same as for
Krasulina’s estimator. Here is the analog of Lemma A.4.
Lemma A.5. For any n > no , we can write Ψn ≤ Ψn−1 − Zn + βn , where Zn is the same as in
Lemma A.4 and βn = 5γn2 B 2 + 2γn3 B 3 .

Proof. This is a series of calculations. First,

(Vn · v ∗ )2 = ((Vn−1 · v ∗ ) + γn (Vn−1
T
Xn XnT v ∗ ))2
≥ (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(Vn−1
T
Xn XnT v ∗ ).
Similarly,
kVn k2 = kVn−1 + γn Xn XnT Vn−1 k2
= kVn−1 k2 + γn2 kXn XnT Vn−1 k2 + 2γn (Vn−1 · Xn )2
≤ kVn−1 k2 (1 + γn2 B 2 + 2γn (Vbn−1 · Xn )2 )

11
where we have used kXn k2 ≤ B. Combining these,
(Vn · v ∗ )2 (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(Vn−1T
Xn XnT v ∗ )
≥
kVn k2 kVn−1 k2 (1 + γn2 B 2 + 2γn (Vbn−1 · Xn )2 )
(Vbn−1 · v ∗ )2 + 2γn (Vbn−1 · v ∗ )(Vbn−1
T
Xn XnT v ∗ )
=
1 + γn2 B 2 + 2γn (Vbn−1 · Xn )2

≥ (Vbn−1 · v ∗ )2 + 2γn (Vbn−1 · v ∗ )(Vbn−1
T
Xn XnT v ∗ ) 1 − γn2 B 2 − 2γn (Vbn−1 · Xn )2

≥ (Vbn−1 · v ∗ )2 + 2γn (Vbn−1 · v ∗ ) Vbn−1
T
Xn XnT v ∗ − (Vbn−1 · v ∗ )(Vbn−1 · Xn )2 − 5γn2 B 2 − 2γn3 B 3
where the final step involves some extra algebra that we have omitted. The lemma now follows by
invoking Ψn = 1 − (Vbn · v ∗ )2 .

B A large deviation bound for Ψn

B.1 Proof of Lemma 2.3

For any t > 0,

h i
E etYn |Fn−1 ≤ E et(Yn−1 +βn −Zn ) |Fn−1

= et(Yn−1 +βn ) E e−tZn |Fn−1

B.2 Proof of Lemma 2.4

By Lemma 2.3,
t2 ζn2

tYn
E e |Fn−1 ≤ exp tYn−1 + tβn + .
8
Now let’s define an appropriate martingale. Let τn = `>n (β` +tζ`2 /8), and let Mn = exp(t(Yn +
P
τn )). Thus Mn ∈ Fn , and
t2 ζn2

E[Mn |Fn−1 ] = E[etYn |Fn−1 ] exp(tτn ) ≤ exp tYn−1 + tβn + + tτn = Mn−1 .
8
Thus (Mn ) is a positive-valued supermartingale adapted to (Fn ). A version of Doob’s martingale
inequality—see, for instance, page 274 of [6]—then says that for any m, we have Pr(supn≥m Mn ≥
δ) ≤ (EMm )/δ. Using this, we see that for any ∆ > 0,

t∆
Pr sup Yn ≥ ∆ ≤ Pr sup Yn + τn ≥ ∆ = Pr sup Mn ≥ e
n≥m n≥m n≥m
EMm
≤ t∆ = exp(−t(∆ − τm ))EetYm
e

B.3 Proof of Lemma 2.5

It is well known that V can be chosen by picking d values Z = (Z1 , . . . , Zd ) independently from
the standard normal distribution and then setting V = Z/kZk. Therefore,
Z22 + · · · + Zd2 W1
Y = = ,
Z12 + (Z22 + · · · + Zd2 ) W1 + W2

12
where W1 is drawn from a chi-squared distribution with d − 1 degrees of freedom and W2 is drawn
independently from a chi-squared distribution with one degree of freedom. This characterization
implies that Y follows the Beta((d − 1)/2, 1/2) distribution: specifically, for any 0 < y < 1,
Γ( d2 )
Pr(Y = y) = y (d−3)/2 (1 − y)−1/2 .
Γ( d−1 1
2 )Γ( 2 )

The moment-generating function of this distribution is

Γ( d2 )
Z 1
EetY = ety y (d−3)/2 (1 − y)−1/2 dy.
Γ( d−1
2 )Γ( 1
2 ) 0

There isn’t a closed form for this, but an upper bound on the integral can be obtained. Assuming
d ≥ 3,
Z 1 Z 1
−1/2
ty (d−3)/2
e y (1 − y) dy ≤ ety (1 − y)−1/2 dy
0 0
Z t
et
=√ e−z z −1/2 dz
t 0
Z ∞
et et
≤√ e−z z −1/2 dz = √ Γ(1/2),
t 0 t
where the second step uses a change of variable z = t(1 − y), and the√fourth uses the definition of
the gamma function. To finish up, we use the inequality Γ(z + 1/2) ≤ z Γ(z) (Lemma B.1) to get
r
tY Γ( d2 ) et t d−1
Ee ≤ d−1
√ ≤ e .
Γ( 2 ) t 2t

The following inequality is doubtless standard; we give a short proof here because we are unable to
find a reference.
Lemma B.1. For any z > 0,
√

1
Γ z+ ≤ z Γ(z).
2

Proof. Suppose a random√variable T > 0 is drawn according to the density Pr(T = t) ∝ tz−1 e−t .
Let’s compute ET and E T :
R ∞ z −t
t e dt Γ(z + 1)
ET = R ∞0 z−1 −t = =z
0
t e dt Γ(z)
R ∞ z−1/2 −t
√ t e dt Γ(z + 1/2)
E T = R0 ∞ z−1 −t = ,
0
t e dt Γ(z)

√ the√standard fact Γ(z + 1) = zΓ(z). By concavity of the square root function,

where we have used
we know that E T ≤ ET . This yields the lemma.

B.4 Proof of Theorem 2.2

From Lemma A.4(a), we have Ψn ≤ Ψn−1 + βn − Zn , where βn = γn2 B 2 /4, and E[Zn |Fn−1 ] ≥ 0,
and Zn lies in an interval of length ζn = 8γn B. We can thus directly apply the first deviation bound
of Lemma 2.4.
Since Z ∞
X X 1 dx c2
γn2 = c2
≤ c2
= ,
`2 n x2 n
`>n `>n

13
we see that for any t > 0,
X tζ 2 X B2 B 2 c2

2 2 2
β` + ` = γ` + 8B tγ` ≤ (1 + 32t).
8 4 4no
`>no `>no

To make this ≤ /d, it suffices to take no ≥ B 2 c2 d(1 + 32t)/(4), whereupon Lemma 2.4 yields

Pr sup Ψn ≥ 1 − ≤ E[exp(tΨno )]e−t(1−(/d)−(/d))
n≥no d
r r
t d −t(1−(2/d)) d
≤e e = e2t/d .
2t 2t

where the last step uses Lemma 2.5. The result follows by taking t = d/(4).

C Intermediate epochs of improvement

C.1 Proof of Lemma 2.7

Lemma A.4 establishes an inequality Ψn ≤ Ψn−1 − Zn + βn as well as a lower bound on

E[Zn |Fn−1 ], where Zn is a random variable that lies in an interval of length ζn = 8γn B. From
Lemma 2.3, we then have

E[etΨn |Fn−1 ] ≤ exp t(Ψn−1 − E[Zn |Fn−1 ] + βn + tζn2 /8)

≤ exp t(Ψn−1 − 2γn (λ1 − λ2 )Ψn−1 (1 − Ψn−1 ) + γn2 B 2 (1 + 32t)/4)

= exp t(Ψn−1 − co Ψn−1 (1 − Ψn−1 )/n + c2 B 2 (1 + 32t)/4n2 )

For any ω ∈ Ω0n , we have Ψn−1 (ω) ≤ 1 − j . Taking expectations over Ω0n , we get the lemma.

C.2 Proof of Lemma 2.8

Let j be the largest index such that nj < n. Then

for ω ∈ Ω0n

≤ 1 − j
Ψn−1 (ω) has value
> 1 − j for ω ∈ Ω0n−1 \ Ω0n

Thus the expected value of g(Ψn−1 ) over Ω0n is at most the expected value over Ω0n−1 .

C.3 Proof of Lemma 2.9

We begin with the following Lemma.

Lemma C.1. For any n > nj and any t > 0,
c
nj + 1 o j tc2 B 2 (1 + 32t) 1

1
En [etΨn ] ≤ exp t(1 − j ) + − .
n+1 4 nj n

Proof. Define αn = 1 − (co j /n) and ξn (t) = c2 B 2 t(1 + 32t)/4n2 . By Lemmas 2.7 and 2.8, for
n > nj ,

En [etΨn ] ≤ En [etαn Ψn−1 ] exp(ξn (t)) ≤ En−1 [e(tαn )Ψn−1 ] exp(ξn (t)).

14
By applying these inequalities repeatedly, for n shrinking to nj + 1 (and t shrinking as well), we get
En [etΨn ] ≤ Enj +1 exp tΨnj αn αn−1 · · · αnj +1 exp(ξn (t)) exp(ξn−1 (tαn )) · · · exp(ξnj +1 (tαn · · · αnj +2 ))

≤ Enj +1 exp tΨnj αn αn−1 · · · αnj +1 exp(ξn (t)) exp(ξn−1 (t)) · · · exp(ξnj +1 (t))

co j co j co j
= Enj +1 exp tΨnj 1 − 1− ··· 1 − ×
n n−1 nj + 1
2 2
c B t(1 + 32t) 1 1 1
exp + + ··· +
4 n2 (n − 1)2 (nj + 1)2

1 1
≤ exp t(1 − j ) exp −co j + ··· + ×
nj + 1 n
2 2
c B t(1 + 32t) 1 1 1
exp + + ··· +
4 n2 (n − 1)2 (nj + 1)2
since Ψnj (ω) ≤ 1 − j for all ω ∈ Ω0nj +1 . We then use the summations
Z n+1
1 1 dx n+1
+ ··· + ≥ = ln
nj + 1 n nj +1 x n j +1
Z n
1 1 dx 1 1
+ ··· + 2 ≤ = −
(nj + 1)2 n nj x 2 n j n
to get the lemma.

To prove Lemma 2.9, we note that under conditions (3),

co j
nj + 1
(1 − j ) ≤ e−j (e−5/co )co j = e−6j ≤ 1 − 3j ≤ 1 − j+1 − j .
nj+1 + 1
We have used the fact that e−2x ≤ 1 − x for 0 ≤ x ≤ 3/4. The rest follows by applying Lemma C.1
with n = nj+1 .

C.4 Proof of Lemma 2.10

Pick any 0 < j ≤ J. We will mimic the reasoning of Theorem 2.2, being careful to define martin-
gales only on the restricted space Ω0nj and with starting time nj . Then
!
tc2 B 2 (1 + 32t)

tΨnj
Pnj sup Ψn > 1 − j ≤ Enj [e ] exp −t(1 − j ) +
n≥nj 4nj
tc2 B 2 (1 + 32t)

≤ exp −tj−1 + ,
4nj−1
where the second step invokes Lemma 2.9.
To finish, we pick t = (2/o ) ln(4/δ). The lower bound on no is also a lower bound on nj−1 , and
implies that tc2 B 2 (1 + 32t)/4nj−1 ≤ to /2, whereupon
! j−1 /o
tj−1 δ δ
Pnj sup Ψn > 1 − j ≤ exp − = ≤ j+1 .
n≥nj 2 4 2
Summing over j then yields the lemma.

D The final epoch

D.1 Proof of Lemma 2.11

By Lemma A.4,
E[Ψn |Fn−1 ] ≤ Ψn−1 (1 − 2γn (1 − Ψn−1 )(λ1 − λ2 )) + βn .

15
For realizations ω ∈ Ω0n , we have Ψn−1 (ω) ≤ 1/2 and thus the right-hand side of the above
expression is at most (1 − αn )Ψn−1 + βn . Using the fact that Ω0n is Fn−1 -measurable, and taking
expectations over Ω0n ,
En [Ψn ] ≤ (1 − αn )En [Ψn−1 ] + βn
≤ (1 − αn )En−1 [Ψn−1 ] + βn ,
as claimed. The last step uses Lemma 2.8.

D.2 Proof of Theorem 1.1

Define epochs (nj , j ) that satisfy the conditions of Theorem 2.6, with J = 1/2, and with j+1 =
2j whenever possible. Then J = log2 1/(2o ) and
5/(co ln 2) 5/(co ln 2)
5J 1 4ed
nJ + 1 = (no + 1) exp = (no + 1) = (no + 1) .
co 2o δ2
By Theorem 2.6, with probability > 1 − δ, we have Ψn ≤ 1/2 for all n ≥ nJ . More precisely,
P (Ω0n ) ≥ 1 − δ for all n > no .
By Lemma 2.11, for n > nJ ,
a
b
En [Ψn ] ≤ 1−
En−1 [Ψn−1 ] + 2 ,
n n
for a = co /2 and b = c2 B 2 /4. By Lemma D.1,
a a+1
nJ + 1 b 1 1
En [Ψn ] ≤ EnJ [ΨnJ ] + 1+
n+1 a−1 nJ + 1 n+1
a 5/(2 ln 2)
1 no + 1 4ed b a+1 1
≤ 2
+ exp .
2 n+1 δ a−1 nJ + 1 n + 1
which upon further simplification yields the bound of Theorem 1.1.

Lemma D.1. Consider a nonnegative sequence (ut : t ≥ to ), such that for some constants a, b > 0
and for all t > to ≥ 0,
a b
ut ≤ 1 − ut−1 + 2 .
t t
Then, if a > 1,
a a+1
to + 1 b 1 1
ut ≤ uto + 1+ .
t+1 a−1 to + 1 t+1

Proof. Unwrapping the given recurrence for ut yields

!  
t t t
Y a X b  Y a 
ut ≤ 1− uto + 1− .
i=t +1
i i=t +1
i2 j=i+1 j
o o

To bound the product term, we use

t ! Z t+1 k
Y k X1 dx to + 1
1− ≤ exp −k ≤ exp −k = .
i=t +1
i i=t
i to +1 x t+1
o o

Therefore,
a t a
to + 1 X b i+1
ut ≤ uto + 2
t+1 i=to +1
i t+1
a t
2 X
to + 1 b to + 2
≤ uto + (i + 1)a−2 .
t+1 (t + 1)a to + 1 i=to +1

16
We finish by bounding the summation of (i + 1)a−2 by a definite integral, to get:
t
X 1
(i + 1)a−2 ≤ (t + 2)a−1 .
i=to +1
a−1

Numerical Inversion of Laplace Transform For Time Resolved Thermal Characterization Experiment
No ratings yet
Numerical Inversion of Laplace Transform For Time Resolved Thermal Characterization Experiment
5 pages
Monte Carlo Errors With Less Errors.
No ratings yet
Monte Carlo Errors With Less Errors.
23 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
18-BEJ1073
No ratings yet
18-BEJ1073
29 pages
2000 WHICH EIGENVALUES ARE FOUND BY THE LANCZOS
No ratings yet
2000 WHICH EIGENVALUES ARE FOUND BY THE LANCZOS
16 pages
weighted-ensemble-2003.02316v3
No ratings yet
weighted-ensemble-2003.02316v3
41 pages
Convergence analysis of a collapsed Gibbs sampler for Bayesian vector autoregressions
No ratings yet
Convergence analysis of a collapsed Gibbs sampler for Bayesian vector autoregressions
31 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Assignment 10 solution
No ratings yet
Assignment 10 solution
8 pages
slides_no_break
No ratings yet
slides_no_break
77 pages
Asymptotic Statistics (By Changliang ZOU)
No ratings yet
Asymptotic Statistics (By Changliang ZOU)
115 pages
Validación Cruzada
No ratings yet
Validación Cruzada
23 pages
Theory of Quadrature PDF
No ratings yet
Theory of Quadrature PDF
280 pages
M Tech-Asp3
No ratings yet
M Tech-Asp3
15 pages
chapter_3 Performance Surface and Search Method
No ratings yet
chapter_3 Performance Surface and Search Method
49 pages
9e6cbf4ac9c3320e4e9d5402ab7ac5eb_MIT14_384F13_rec7
No ratings yet
9e6cbf4ac9c3320e4e9d5402ab7ac5eb_MIT14_384F13_rec7
6 pages
chapter_3 Performance Surface and Search Method
No ratings yet
chapter_3 Performance Surface and Search Method
24 pages
Introduction To Stochastic Approximation Algorithms
No ratings yet
Introduction To Stochastic Approximation Algorithms
14 pages
Low Complexity Adaptive Algorithm For Generalized Eigenvalue Decomposition
No ratings yet
Low Complexity Adaptive Algorithm For Generalized Eigenvalue Decomposition
4 pages
Rakhlin Mathstat sp22
No ratings yet
Rakhlin Mathstat sp22
108 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Tikhonov Regularization
No ratings yet
Tikhonov Regularization
8 pages
Fast Adaptive Eigenvalue Decomposition A Maximum Likelihood Approach
No ratings yet
Fast Adaptive Eigenvalue Decomposition A Maximum Likelihood Approach
4 pages
A Step by Step Mathematical Derivation A
No ratings yet
A Step by Step Mathematical Derivation A
32 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Nummax
No ratings yet
Nummax
3 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
05 Spectral Power Iterations For The Random Eigenvalue Problem
No ratings yet
05 Spectral Power Iterations For The Random Eigenvalue Problem
13 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
class-test-1
No ratings yet
class-test-1
5 pages
Montanari
No ratings yet
Montanari
10 pages
Addendum Bias Variance
No ratings yet
Addendum Bias Variance
3 pages
Bias Variance
No ratings yet
Bias Variance
3 pages
Low Variance Sampling Techniques For Particle Filter
No ratings yet
Low Variance Sampling Techniques For Particle Filter
7 pages
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
No ratings yet
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
18 pages
191IT7310Machine LearningQB
No ratings yet
191IT7310Machine LearningQB
27 pages
Tutorial On Principal Component Analysis: Javier R. Movellan
No ratings yet
Tutorial On Principal Component Analysis: Javier R. Movellan
9 pages
Download full Calculus Single Variable 7th Edition (eBook PDF) ebook all chapters
100% (2)
Download full Calculus Single Variable 7th Edition (eBook PDF) ebook all chapters
51 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
No ratings yet
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
211 pages
Introduction To Nonlinear Filtering
No ratings yet
Introduction To Nonlinear Filtering
126 pages
Slides Lecture7 Ext
No ratings yet
Slides Lecture7 Ext
21 pages
Optimal Prediction With Memory: Alexandre J. Chorin, Ole H. Hald, Raz Kupferman
No ratings yet
Optimal Prediction With Memory: Alexandre J. Chorin, Ole H. Hald, Raz Kupferman
19 pages
Approximation of The Invariant Measure of Stable SDEs
No ratings yet
Approximation of The Invariant Measure of Stable SDEs
32 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
No ratings yet
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
9 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Applications of Random Matrix Theory To Principal Component Analysis (PCA)
No ratings yet
Applications of Random Matrix Theory To Principal Component Analysis (PCA)
25 pages
Adaptive Lecture04 2005
No ratings yet
Adaptive Lecture04 2005
28 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
hw7 Sol
No ratings yet
hw7 Sol
12 pages
Eigenvectors and Eigenvalues of Stationary Processes: N N JK J K
No ratings yet
Eigenvectors and Eigenvalues of Stationary Processes: N N JK J K
8 pages
An Adaptive Simulated Annealing Algorithm PDF
No ratings yet
An Adaptive Simulated Annealing Algorithm PDF
9 pages
Unit 3 - Estimation And Prediction: θ 1 2 n 1 2 n 1 1 2 2 n n
No ratings yet
Unit 3 - Estimation And Prediction: θ 1 2 n 1 2 n 1 1 2 2 n n
14 pages
Fisher Linear Discriminant Analysis: Max Welling
No ratings yet
Fisher Linear Discriminant Analysis: Max Welling
4 pages
An Introduction To Independent Component Analysis (ICA)
No ratings yet
An Introduction To Independent Component Analysis (ICA)
39 pages
Areas of Normal Curve
0% (1)
Areas of Normal Curve
24 pages
Mathematics
No ratings yet
Mathematics
10 pages
Subsequences of Real Numbers
No ratings yet
Subsequences of Real Numbers
18 pages
MAT5007 - Module 1 Problem Set
No ratings yet
MAT5007 - Module 1 Problem Set
3 pages
Bahan1 Matematika Teknik Kimia Salin
No ratings yet
Bahan1 Matematika Teknik Kimia Salin
103 pages
Real Life Integral Calculus
No ratings yet
Real Life Integral Calculus
3 pages
Lec10 PDF
No ratings yet
Lec10 PDF
14 pages
Schlkopf 1998
No ratings yet
Schlkopf 1998
57 pages
More On Tabular Integration by Parts - Leonard Gillman
100% (3)
More On Tabular Integration by Parts - Leonard Gillman
4 pages
Functional Equations PDF
No ratings yet
Functional Equations PDF
4 pages
Project and Operation Management: Project Scheduling With Uncertain Activity Times
No ratings yet
Project and Operation Management: Project Scheduling With Uncertain Activity Times
41 pages
Design of Experiment
No ratings yet
Design of Experiment
13 pages
Signal Processing 1 Script English v2017 PDF
No ratings yet
Signal Processing 1 Script English v2017 PDF
224 pages
Chapter 1 - The Integral
No ratings yet
Chapter 1 - The Integral
12 pages
Applied Stochastic Processes: M. Ottobre
No ratings yet
Applied Stochastic Processes: M. Ottobre
164 pages
Set Theory Short Notes
No ratings yet
Set Theory Short Notes
42 pages
Generalized Bessel Equation: Xy Xa Bxy CDX B Arx BX y Rs
No ratings yet
Generalized Bessel Equation: Xy Xa Bxy CDX B Arx BX y Rs
8 pages
AS/A Level Mathematics Integration: Mathsgenie - Co.uk
No ratings yet
AS/A Level Mathematics Integration: Mathsgenie - Co.uk
3 pages
Architecture in Use Summary of JM Van Der Voordt
100% (1)
Architecture in Use Summary of JM Van Der Voordt
43 pages
Modern Control Engineering Problems CH 8 PDF
No ratings yet
Modern Control Engineering Problems CH 8 PDF
29 pages
Averaging
No ratings yet
Averaging
9 pages
3 5-RationalFunctions
No ratings yet
3 5-RationalFunctions
41 pages
Multiple Integrals Notes
No ratings yet
Multiple Integrals Notes
30 pages
Mixed Methods Research: Integrating Qualitative and Quantitative Methods
No ratings yet
Mixed Methods Research: Integrating Qualitative and Quantitative Methods
52 pages
Abramson Restricted Combinatioons and Compositions
No ratings yet
Abramson Restricted Combinatioons and Compositions
14 pages
Unit 5. Finite Element Analysis
No ratings yet
Unit 5. Finite Element Analysis
48 pages
Understanding Vector Calculus: Practical Development and Solved Problems
From Everand
Understanding Vector Calculus: Practical Development and Solved Problems
Jerrold Franklin
No ratings yet
EC3120 Mathematical Economics
No ratings yet
EC3120 Mathematical Economics
2 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

The Fast Convergence of Incremental PCA

Uploaded by

The Fast Convergence of Incremental PCA

Uploaded by

The Fast Convergence of Incremental PCA

Akshay Balsubramani Sanjoy Dasgupta Yoav Freund

We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some

1.1 The algorithm

We analyze the following procedure.

1. Set starting time. Set the clock to time no .

1.3 The setting of the learning rate

1.4 Nested sample spaces

1.5 Main result

We make the following assumptions:

(A1) The Xn ∈ Rd are i.i.d. with mean zero and covariance A.

Under these conditions, we get the following rate of convergence.

where En denotes expectation restricted to Ω0n .

Since co > 2, this bound is of the form E[Ψn ] = O(1/n).

1.6 Related work

2.2 A large deviation bound for Ψn

Putting these pieces together yields Theorem 2.2.

2.3 Intermediate epochs of improvement

Of course, this can only hold with a certain probability.

Crucially, this is Fn−1 -measurable. Also note that Ω0 = Ω0n .

Here is the main result of this section.

A repeated application of Lemmas 2.7 and 2.8 yields the following.

2.4 The final epoch

Write Krasulina’s update equation as

We start with some basic observations.

(a) ξn is orthogonal to Vn−1 .

We now check that (Vn · v ∗ )2 grows in expectation with each iteration.

(a) (Vn · v ∗ )2 ≥ (Vn−1 · v ∗ )2 + 2γn (Vn−1 · v ∗ )(ξn · v ∗ ).

Proof. Part (a) follows directly from the update rule:

Proof. It is easiest to think of Vn in the eigenbasis of A: the component of Vn in direction v ∗ is

Proof. Using Lemmas A.1 and A.2(a),

Since kXn k2 ≤ B, we see that Zn must lie in the range ±4γn B.

A.2 The change in potential of the Oja update

Recall the Oja update:

Proof. This is a series of calculations. First,

B A large deviation bound for Ψn

For any t > 0,

= et(Yn−1 +βn ) E e−tZn |Fn−1

B.2 Proof of Lemma 2.4

B.3 Proof of Lemma 2.5

The moment-generating function of this distribution is

√ the√standard fact Γ(z + 1) = zΓ(z). By concavity of the square root function,

B.4 Proof of Theorem 2.2

C Intermediate epochs of improvement

C.1 Proof of Lemma 2.7

Lemma A.4 establishes an inequality Ψn ≤ Ψn−1 − Zn + βn as well as a lower bound on

E[etΨn |Fn−1 ] ≤ exp t(Ψn−1 − E[Zn |Fn−1 ] + βn + tζn2 /8)

≤ exp t(Ψn−1 − 2γn (λ1 − λ2 )Ψn−1 (1 − Ψn−1 ) + γn2 B 2 (1 + 32t)/4)

= exp t(Ψn−1 − co Ψn−1 (1 − Ψn−1 )/n + c2 B 2 (1 + 32t)/4n2 )

C.2 Proof of Lemma 2.8

Let j be the largest index such that nj < n. Then

C.3 Proof of Lemma 2.9

We begin with the following Lemma.

To prove Lemma 2.9, we note that under conditions (3),

C.4 Proof of Lemma 2.10

D The final epoch

D.2 Proof of Theorem 1.1

Proof. Unwrapping the given recurrence for ut yields

To bound the product term, we use

You might also like