0% found this document useful (0 votes)

24 views

Dimension Reduction

Uploaded by

idhitappu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Dimension Reduction

Uploaded by

idhitappu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

DIMENSION REDUCTION

1 Principal Component Analysis (PCA)

Principal components analysis (PCA) finds low dimensional approximations to the data by
projecting the data onto linear subspaces.

Let X ∈ Rd and let Lk denote all k-dimensional linear subspaces. The k th principal subspace
is !
e − yk2
`k = argmin E min kX
`∈Lk y∈`

where X e = X − µ and µ = E(X). The dimension-reduced version of X is then Tk (X) =

µ + π`k X where and π`k X is the projection of X onto `k . To find `k proceed as follows.

Let Σ = E((X − µ)(X − µ)T ) denote the covariance matrix, where µ = E(X). Let λ1 ≥ λ2 ≥
· · · ≥ λd be the ordered eigenvalues of Σ and let e1 , . . . , ed be the corresponding eigenvectors.
Let Λ be the diagonal matrix with Λjj = λj and let E = [e1 · · · ed ]. Then the spectral
decomposition of Σ is X
Σ = EΛE T = λj ej eTj .
j

Theorem 1 The k th principal subspace `k is the subspace spanned by e1 , . . . , ek . Further-

more,
Xk
Tk (X) = µ + βj ej
j=1

where βj = hX − µ, ej i. The risk satisfies

d
X
R(k) = EkX − Tk (X)k2 = λj .
j=k+1

We can restate the result as follows. To minimize

EkYi − α − Aβi k2 ,
with respect to α ∈ Rd , A ∈ Rd×k and βi ∈ Rk we set α = µ and A = [e1 e2 · · · ek ]. Any
other solution is equivalent in the sense that it corresponds to the same subspace.

We can choose k by fixing some α and then taking

( ) ( Pm )
R(m) j=1 λj
k = min m : ≥ 1 − α = min m : Pd ≥1−α .
R(0) j=1 λj

1
Let Y = (Y1 , . . . , Yd ) where Yi = eTi (X − µ). Then Y is the PCA-transformation applied to
X. The random variable Y has the following properties:

Lemma 2 We have:

1. E[Y ] = 0 and Var(Y ) = Λ.

2. X = µ + EY .
Pm
3. j=1 Var(Yj ) = Σ11 + · · · + Σmm .

Hence, Pm
j=1 λj
Pd
j=1 λj
is the percentage of variance explained by the first m principal components.

The data version of PCA is obtained by replacing Σ with the sample covariance matrix
n
b= 1
X
Σ (Xi − X n )(Xi − X n )T .
n i=1

Principal Components Analysis (PCA)

b = n−1 Pn (Xi − X n )(Xi − X n )T .
1. Compute the sample covariance matrix Σ i=1

2. Compute the eigenvalues λ1 ≥ λ2 ≥ · · · and eigenvectors e1 , e2 , . . . , of Σ.

3. Choose a dimension k.
Pk
4. Define the dimension reduced data Zi = Tk (Xi ) = X + j=1 βij ej where βij =
hXi − X, ej i.

Example 3 Figure 1 shows a synthetic two-dimensional data set together with the first
principal component.

Example 4 Figure 2 shows some handwritten digits. The eigenvalues and the first few
eigenfunctions are shown in Figures 3 and 4. A few digits and their low-dimensional recon-
structions are shown in Figure 5.

2
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: First principal component in a simple two dimensional example.

Figure 2: Handwritten digits.

3
Variance

Dimension

Figure 3: Digits data: eigenvalues

How well does the sample version approximate the population version? For now, assume
the dimensions d is fixed and that n is large. We will study the high-dimensional case later
where we will use some random matrix theory.

Define the operator norm ( )

||Σv||
||Σ|| = sup : v 6= 0 .
||v||
b − Σ|| = OP (1/√n). According to Weyl’s theorem
It can be shown that ||Σ

max |λj (Σ)

b − λj (Σ)| ≤ ||Σ
b − Σ||
j

and hence, the estimated eigenvalues are consistent. We can also say that the eigenvectors
are consistent. We have
23/2 ||Σ
b − Σ||
||b
ej − ej || ≤ .
min(λj−1 − λj , λj − λj+1 )

(See Yu, Wang and Samworth, arXiv:1405.0680.) There is a also a central limit theorem
for the eigenvalues and eigenvectors which also leads to a proof that the bootstrap is valid.
However, these limiting results depend on the distinctness of the eigenvalues.

There is a strong connection between PCA and the singular value decomposition (SVD). Let
X be an n × d matrix. The SVD is

X = U DV T

4
Figure 4: Digits: mean and eigenvectors

Figure 5: Digits data: Top: digits. Bottom: their reconstructions.

5
where U is an n × n matrix with orthonormal columns, V is a d × d matrix with orthonormal
columns, and D is an n × d diagonal matrix with non-negative real numbers on the diagonal
(called singular values). Then

X T X = (V DU T )(U DV T ) = V D2 V T

and hence the singular values are the square root of the eigenvalues.

2 Multidimensional Scaling

A different view of dimension reduction is provided by thinking in terms of preserving pair-

wise distances. Suppose that Zi = T (Xi ) for i = 1, . . . , n where T : Rd → Rk with k < d.
Define the loss function
X
L= (||Xi − Xj ||2 − ||Zi − Zj ||2 )
i,j

which measures how well the map T preserves pairwise distances. Multidimensional scal-
ing find the linear map T to minimize L.

Theorem 5 The linear map T : Rd → Rk that minimizes L is the projection onto Span{e1 , . . . , ek }
where e1 , . . . , ek are the first k principal components.

We could use other measures of distortion. In that case, the MDS solution and the PCA
solution will not coincide.

3 Kernel PCA

To get a ninlinear version of PCA, we can use a kernel. Suppose we have a “feature map”
x 7→ Φ(x) and want to carry out PCA in this new feature space. For the moment, assume
that the feature vectors are centered (we return to this point shortly). Define the empirical
covariance matrix n
1X
CΦ = Φ(xi )Φ(xi )T .
n i=1
We can define eigenvalues λ1 , λ2 , . . . and eigenvectors v1 , v2 , . . . of this matrix.

6
It turns out that the eigenvectors are linear combinations of the feature vectors Φ(x1 ), . . . , Φ(xn ).
To see this, note that
n
1X
λv = CΦ v = Φ(xi )Φ(xi )T v
n i=1
n n
1X X
= h Φ(xi ), v i Φ(xi ) = αi Φ(xi )
n i=1 i=1

where
1 1
αi = h Φ(xi ), v i = h Φ(xi ), CΦ v i .
n nλ
Now
n
X
λ αi h Φ(xk ), Φ(xi ) i = λ h Φ(xk ), Cv i
i=1
n
1X
= λ h Φ(xk ), Φ(xj )Φ(xj )T v i
n j=1
n n
1X T
X
= h Φ(xk ), Φ(xj )Φ(xj ) αi Φ(xi ) i
n j=1 i=1
n n
1X X
= αi h Φ(xk ), h Φ(xj ), Φ(xi ) i Φ(xj ) i .
n i=1 j=1

Define the kernel matrix K by Kij = h Φ(xi ), Φ(xj ) i . Then we can write the above equation
as

λnKα = K 2 α

Thus, we need to solve the kernel eigenvalue problem

Kα = nλα

which requires diagonalizing only an n×n system. Normalizing the eigenvectors, h v, v i = 1

leads to the condition λ h α, α i = 1. Of course, we need to find all the solutions giving
eigenvectors v1 , v2 , . . . ,.

In order to compute the kernel PCA projection of a new test point x, it is necessary to
project the feature vector Φ(x) onto the principal direction vm . This requires the evaluation
n
X
h v, Φ(x) i = αi h Φ(xi ), Φ(x) i
i=1
n
X
= αi K(xi , x)
i=1

7
Thus, the entire procedure uses only the kernel evaluations K(x, xi ) and never requires actual
manipulation of feature vectors, which could be infinite dimensional. This is an instance of
the kernel trick. An arbitrary data point x can then be approximated by projecting Φ(x)
onto the first k vectors. This defines an approximation in the feature space. We then need
to find x
e that corresponds to this projection. There is an iterative algorithm for doing this
(Mika et al 1998) which turns out to be a weighed version of the mean shift algorithm.

To complete the description of the algorithm, it is necessary to explain how to center the
data in feature space using only kernel operations. This is accomplished by transforming the
kernel according to
e ij = (K − 1n K − K1n + 1n K1n )
K ij

where
 
1 1 ··· 1
1 1
 1 ··· 1  1 T
1n = . .. ..  = 11
n  .. . ··· . n
1 1 ··· 1
where 1 denotes the vector of all ones.

Kernel PCA. Given a Mercer kernel K and data X1 , X2 , . . . , Xn

1. Center the kernel

2. Compute Kij = K(Xi , Xj )
3. Diagonalize K
1
4. Normalize eigenvectors α(m) so that h α(m) , α(m) i = λm

5. Compute the projection of a test point x onto an eigenvector vm by

Xn
(m)
h vm , Φ(x) i = αi K(Xi , x)
i=1

Just as for standard PCA, this selects components of high variance, but in the feature space
of the kernel. In addition, the “feature functions”
X n
(m)
fm (x) = αi K(Xi , x)
i=1

are orthogonal and act as representative feature functions in the reproducing kernel Hilbert
space of the kernel, with respect to the given data. Intuitively, these functions are smooth
respect to the RKHS norm k · kK among all those supported on the data.

Another perspective on kernel PCA is that it is doing MDS on the kernel distances dij =
p
2(1 − K(Xi , Xj )); see Williams (2002).

8
4 Local Linear Embedding

Local Linear Embedding (LLE) (Roweis et al) is another nonlinear dimension rediction
method. The LLE algorithm is comprised of three steps. First, nearest neighbors are com-
puted for each point Xi ∈P Rd . Second, each point is regressed onto its neighbors, giving
weights wij so that Xi = j wij Xj . Third, the Xi ∈ Rd are replaced by Yi ∈ Rm where
typically m d by solving a sparse eigenvector problem. The result is a highly nonlinear
embedding, but one that is carried out by optimizations that are not prone to local minima.
Underlying the procedure, as for many “manifold” methods, is a weighted sparse graph that
represents the data.

Step 1: Nearest Neighbors. Here the set of the K nearest neighbors in standard Euclidean
space is constructed for each data point. Using brute-force search, this requires O(n2 d) oper-
ations; more efficient algorithms are possible, in particular if approximate nearest neighbors
are calculated. The number of neighbors K is a parameter to the algorithm, but this is the
only parameter needed by LLE.

Step 2: Local weights. In this step, the local geometry of each point is characterized by a
set of weights wij . The weights are computed by reconstructing each input Xi as a linear
combination of its neighbors, as tabulated in Step 1. This is done by solving the least squares
problem
n
X X
min kXi − wij Xj k22 (1)
w
i=1 j

The weights wij are constrained so that wij = 0 if Xj is not one

Pof the K nearest neighbors
of Xi . Moreover, the weights are normalized to sum to one: j wij = 1, for i = 1, . . . , n.
This normalization ensures that the optimal weights are invariant to rotation, translation,
and scaling.

Step 3: Linearization. In this step the points Xi ∈ Rd are mapped to Yi ∈ Rm , where

m selected by the user, or estimated directly from the data. The vectors Yi are chosen to
minimize the reconstruction error under the local linear mappings constructed in the previous
step. That is, the goal is to optimize the functional
n
X X
Ψ(y) = kYi − wij Yj k22 (2)
i=1 j

where the weights wij are calculated in Step 2. To obtain a unique solution, the vectors are
“centered” to have mean zero and unit covariance:
X 1X
Yi = 0 Yi YiT = Im (3)
i
n i

9
Carrying out this optimization is equivalent to finding eigenvectors by minimizing the quadratic
form

Ψ(y) = y T Gy (4)
= y T (I − W )T (I − W )y (5)
(6)

corresponding to a Rayleigh quotient. Each minimization gives one of the lower (m + 1)

eigenvectors of the n×n matrix G = (I −W )T (I −W ). The lowest eigenvector has eigenvalue
0, and consists of the all ones vector (1, 1 . . . , 1)T .

Locally Linear Embedding (LLE). Given n data vectors Xi ∈ Rd ,

1. Compute K nearest neighbors for each point;

2. Compute local reconstruction weights wij by minimizing

n
X X
Φ(w) = kXi − wij Xj k2 (7)
i=1 j
X
subject to wij = 1; (8)
j

3. Compute outputs Yi ∈ Rm by computing the first m eigenvectors with nonzero eigen-

values for the n × n matrix G = (I − W )T (I − W ). The reduced data matrix is
[u1 · · · um ] where uj are the eigenvectors corresponding to the first (smallest) nonzero
eigenvalues of G.

Note that the last step assumes that the underlying graph encoded by the nearest neighbor
graph is connected. Otherwise, there may be more than one eigenvector with eigenvalue
zero. If the graph is disconnected, then the LLE algorithm can be run separately on each
connected component. However, the recommended procedure is to choose K so that the
graph is connected.

Using the simplest algorithms, the first step has time complexity O(dn2 ), the second step
requires O(nK 3 ) operations, and the third step, using routines for computing eigenvalues
for sparse matrices, requires O(mn2 ) operations (and O(n3 ) operations in the worse case if
sparsity is not exploited and the full spectrum is computed). Thus, for high dimensional
problems, the first step is the most expensive. Since the third step computes eigenvectors,
it shares the property with PCA that as more dimensions are added to the embedding, the
previously computed coordinates do not change.

10
Figure 6: Each data set has n = 1, 000 points in d = 3 dimensions, and LLE was run with
K = 8 neighbors.

11
Figure 7: Faces example. n = 1, 965 images with d = 560, with LLE run for K = 12
neighbors.

Figure 8: Lips example. n = 15, 960 images in d = 65, 664 dimensions, with LLE run for
K = 24 neighbors.

12
5 Isomap

Isomap is a technique that is similar to LLE, intended to provide a low dimensional “mani-
fold” representation of a high dimensional data set. Isomap differs in how it assesses similarity
between objects, and in how the low dimensional mapping is constructed.

The first step in Isomap is to constructed a graph with the nodes representing instances
Xi ∈ Rd to be embedded in a low dimensional space. Standard choices are a k-nearest
neighbors, and -neighborhoods. In the k-nearest neighborhood graph, each point Xi is
connected to its closest k neighbors Nk (Xi ), where distance is measured using Euclidean
distance in the ambient space Rd . In the -neighborhood graph, each point Xi is connected
to all points N (Xi ) within a Euclidean ball of radius centered at Xi . The graph G = (V, E)
by taking edge set V = {x1 , . . . , xn } and edge set
(u, v) ∈ E if v ∈ N (u) or u ∈ N (v) (9)
Note that the node degree in these graphs may be highly variable. For simplicity, assume
that the graph is connected; the parameters k or may need to be carefully selected for this
to be the case.

The next step is to form a distance between points by taking path distance in the graph. That
is d(Xi , Xj ) is the shortest path between node Xi and Xj . This distance can be computed
for sparse graphs in time O(|E| + |V | log |V |). The final step is to embed the points into a
low dimensional space using metric multi-dimensional scaling.

Isomap. Given n data vectors Xi ∈ Rd ,

1. Compute k nearest neighbors for each point, forming the nearest neighbor graph G =
(V, E) with vertices {Xi }.
2. Compute graph distances d(Xi , Xj ) using Dijkstra’s algorithm
3. Embed the points into low dimensions using metric multidimensional scaling

Isomap and LLE both obtain nonlinear dimensionality reduction by mapping points into a
low dimensional space, in a manner that preserves the local geometry. This local geometry
will not be preserved by classical PCA or MDS, since far away points on the manifold will
be, typically, be mapped to nearby points in the lower dimensional space.

6 Laplacian Eigenmaps

A similar
approach is based on the use of the graph Laplacian. Recall that if wij =

Xi −Xj
Kh h
is a weighting between pairs of points determined by a kernel K, the graph

13
Laplacian associated W is given by

L=D−W (10)
P
where D = diag(di ) with di = j wij the sum of the weights for edges emanating from node
i. In Laplacian eigenmaps, the embedding is obtained using the spectral decomposition of
L.

In particular, let y0 , y1 , . . . , yk ∈ Rn denote the first k eigenvectors corresponding to eigen-

values 0 = λ0 < λ1 , < λ2 < · · · < λk+1 of the Laplacian. This determines an embedding

Xi 7→ (y1i , y2i , . . . , yki ) ∈ Rk (11)

into k − 1 dimensions.

The intuition behind this approach can be seen from the basic properties of Rayleigh quo-
tients and Laplacians. In particular, we have that the first nonzero eigenvector satisfies
X
y1 = arg min y1T Ly1 = arg min wij (y1i − y1j )2 (12)
i,j

such that y1T Dy1 =1 (13)

Thus, the eigenvector minimizes the weighted graph L2 norm; the intuition is that the
vector changes very slowly with respect to the intrinsic geometry of the graph. This analogy
is strengthened by consistency properties of the graph Laplacian. In particular, if the data
lie on a Riemannian manifold M , and f : M → R is a function on the manifold,
Z
T
f Lf ≈ k∇f (x)k2 dM (x) (14)
M

where on the left hand side we have evaluated the function on n points sampled uniformly
from the manifold.

14
Figure 9: A portion of the similarity graph for actual scanned digits (1s and 2s), projected
to two dimensions using Laplacian eigenmaps. Each image is a point in R256 , as a 16 × 16
pixel image; the graph suggests the data has lower dimensional “manifold” structure

15
7 Diffusion Distances

As we saw when we discussed spectral clustering, there are other versions of graph Laplacians
such as D−1/2 W D−1/2 and D−1 W that can have better behavior. In fact, let us consider the
matrix L = D−1 W which, as we shall now see, has a nice interpretation. We can view L as
the transition matrix for a Markov chain on the data. This has a population analogue: we
define the diffusion (continuous Markov chain) with transition density

K(x, y)
`(y|x) =
s(x)
R R
where s(x) = K(x, y)dP (y). The stationary distribution has density π(y) = s(y)/ s(u)dP (u).
Then L is just the discrete version of this transition probability. Suppose we run the chain
for t steps. The transition matrix is Lt . The properties of this matrix give information on
the larger scale structure of the data. We define the diffusion distance by
Z
p(u)
Dt (x, y) = (qt (u|x) − qt (u|y))2
π(u)

which is a measure of how har it is to get from x to y in t steps (Coifman and Lafon, 2006).
It can be shown that sX
Dt (x, y) = λ2t
j (ψj (x) − ψj (y))
2

where λj and ψj are the eigenvalues and eigenvectors of q. We can now reduce the dimension
of the data by applying MDS to Dt (x, y). Alternatively, they suggest maping a point x to

Ψt (x) = (λt1 ψ1 (x), . . . , πlt ψk (x))

for some k. An example is shown in Figure 10.

8 Principal Curves and Manifolds

A nonparametric generalization of principal components is principal manifolds. The idea

is to replace linear subspaces with more general manifolds. There are many approaches. We
will consider an approach due to Smola et al (2001).

Let X ∈ Rd and let F be a set of functions from [0, 1]k to Rd . The principal manifold (or
principal curve) is the function f ∈ F that minimizes
!
R(f ) = E min kX − f (z)k2 . (15)
z∈[0,1]k

16
2

●
●
● ●● ● ●● ●
● ● ●
● ●● ● ●
●●
● ●●● ●
● ●●●● ●
●●● ●● ● ●
●
●
●
● ●●
● ●
●
●●●
●
● ●
●
●
●●●
●●●
●
●
●●●
●
●●● ● ●
● ●●
● ●
●
●●
●●●●
●
●●
●
●
●
●●●
●
●
●
●
●●
●●●●
●●
●●
●●●●●●
1

● ●●
● ●●●
●●●
●●
●
●●
●●●●●
●
●
● ●●
●●●
●
●●
●● ●
●
●●
●
●●● ●
● ●●● ●
●●●
● ●●●●
●
● ●●● ● ●●● ●
● ●●● ● ● ●
● ●●●●●●●
● ●●● ●●
● ● ● ●●
●●●● ●●●
●●
●●●●
●
●
●●● ●●
●●
●●
●● ●
●●●
● ● ●● ● ● ●●
● ●
●●●●●
●●●
●
●●●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ● ●● ● ●
●
●
●●
●
●●●●
●
●●●
●●●
●●
●●
●
●
●●● ●●
●●●
● ●
● ●
●
●●●●
●
●● ●
●●●● ● ●
y

● ●●
●● ●●●
● ●
●
●●
●●
●●
●
● ● ●● ●
0

●● ●● ●
●●
●●● ●
●●●●●
●
●
●●
●●●●●
●
●
●●
●●
●●
●●
●
●
●●● ●
●●●
● ● ●●●●●
●● ●
●●
● ●●
●● ● ● ● ●●
●●●●
●
●●●
●●
●●
●
●●
●●
●
●● ●●●●
●
●●● ● ●● ● ●●●
● ●
●●●
● ●● ● ●
●●●●●●● ● ● ●● ● ● ●●
● ● ●● ●● ●●
●●●
●●●
●●●●●
● ●● ● ● ● ●●●
●
●
●●
●
●●
●
●
●●●
●
●
●
● ●●
●
●
●●●
●
● ●●●
●
●● ●●●
● ●●●●●
●●●
● ●●
●●
●●●
●●●
●● ● ●●●
●
−1

● ● ● ●
● ●
● ●●●●
●●
●●
●●
●●
●●●
●●
●
●
●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●●●●●
●
●●
● ●●●●
●●
●
●●●
●●●
●●●
●●
●● ●●●● ●
●●●●● ●● ●
●
●
●●
●
●●
●●● ●● ●
●
● ●
● ●
●●
●
●●●● ● ●
● ● ● ●
● ●
●● ● ●●
● ● ●

−2 −1 0 1 2

Figure 10: Diffusion maps. Top left: data. Top right: Transition matrix for t = 1. Botom
left: Transition matrix for t = 3. Bottom right: Transition matrix for t = 64.

17
To see how general this is, note that we recover principal components as a special case by
taking F to be linear mappings. We recover k-means by taking F to be all mappings from
{1, . . . , k} to Rd . In fact we could construct F to map to k-lines (or k-planes), also called
local principal components; see Bradley and Mangasarian (1998) and Kambhatla and Leen
(1994, 1997). But our focus in this section is on smooth curves.

We will take ( )
F= f : kf k2k ≤ C 2

where kf kK is the norm for a reproducing kernel Hilbert space (RKHS) with kernel K. A
common choice is the Gaussian kernel
kz − uk2

K(z, u) = exp − .
2h2

To approximate the minimizer, we can proceed as in (Smola, Mika, Schölkopf, Williamson

2001). Fix a large number of points z1 , . . . , zM and approximate an arbitrary f ∈ F as
M
X
f (z) = αj K(zj , z)
j=1

which depends on parameters α = (α1 , . . . , αM ). The minimizer can be found as follows.

Define latent variables ξ = (ξ1 , . . . , ξn ) where ξi ∈ Rd and

ξi = argminξ∈[0,1]d kXi − f (ξ)k2 .

For fixed α we find each ξi by any standard nonlinear function minimizer. Given ξ we then
find α by minimizing
n M M M
1X X λ XX
kXi − αj K(zj , ξi )k2 + αi αj K(zi , zj ).
n i=1 j=1
2 i=1 j=1

The minimizer is −1

λn
α= Kz + KξT Kξ KξT X
2
where (Kz )ij = K(zi , zj ) is M × M and (Kξ )ij = K(ξi , zj ) is n × M . Now we iterate,
alternately solving for ξ and α.

Example 6 Figure 11 shows some data and four principal curves based on increasing degrees
of regularization.

18
Figure 11: Principal curve with increasing amounts of regularization.

Theoretical results are due to Kégl et al. (2000) and Smola, Mika, Schölkopf, Williamson
(2001). For example, we may proceed as follows. Define a norm

kf k# ≡ sup kf (z)k2 .
z∈[0,1]k

Theorem 7 Let f∗ minimize R(f ) over F. Assume that the distribution of Xi is supported
on a compact set S and let C = supx,x0 ∈S kx − x0 k2 . For every > 0
2
P |R(f ) − R(f∗ )| > 2 ≤ 2N
b , F, k · k# e−n /(2C)
4L
for some constant L.

Proof. As with any of our previous risk minimization proofs, it suffices to show that
2
P sup |R(f ) − R(f )| > ≤ 2N
b , F, k · k# e−n /(2C) .
f ∈F 4L

Define hf (x) = minz kx − f (z)k. For any fixed f , R(f

b ) − R(f ) = Pn (hf ) − P (hf ) and so, by
Hoeffding’s inequality, 2
P |R(f ) − R(f )| > ≤ 2e−2n /C .
b

19
Let G be the set of functions of the form gf (x, z) = kx − f (z)k2 . Define a metric on G by

d(g, g 0 ) = sup |gf (x, z) − gf 0 (x, z)|.

z∈[0,1]k ,x∈S

Since S is compact, there exists L > 0 such that

kx − x0 k2 − kx − x00 k2 ≤ Lkx0 − x00 k

for all x, x0 , x00 ∈ S. It follows that

d(g, g 0 ) ≤ L sup kf (z) − f 0 (z)k = Lkf − f 0 k# . (16)

z∈[0,1]k

Let δ = /2 and let f1 , . . . , fN be an δ/2 of F. Let gj = gfj , j = 1, . . . , N . It follows from (16)
that g1 , . . . , gN is an δ/(2L) cover of G. For any f there exists fj such that d(gf , gj ) ≤ δ/2.
So

|R(f ) − R(fj )| = E inf kX − f (z)k2 − inf kX − fj (z)k2
z z

= E inf gf (X, z) − inf gj (X, z)
z z

≤ E inf gf (X, z) − inf gj (X, z) ≤ δ/2.

z z

Some comments on this result are in order. First, Smola, Mika, Schölkopf, Williamson (2001)

compute N 4L , F, k · k# for several classes. For the Gaussian kernel they show that
s
1
N (, F, k · k# ) = O

for some constant s. This implies that R(fb) − R(f∗ ) = O(n−1/2 ) which is a parametric rate
of convergence. This is somewhat misleading. As we get more and more data, we should
regularize less and less if we want a truly nonparametric analysis. This is ignored in the
analysis above.

20
9 Random Projections: Part I

A simple method for reducing the dimension is to do a random projection. Surprisingly, this
can actually preserve pairwise distances. This fact is known as the Johnson-Lindenstrauss
Lemma, and this section is devoted to an elementary proof of this result.1

Let X1 , . . . , Xn be a dataset with Xi ∈ Rd . Let S be a m × d matrix filled with iid N (0, 1)

entries, where m < d. Define
Sx
L(x) = √ .
m
The matrix S is called a sketching matrix. Define Yi = L(Xi ) and note that Yi ∈ Rm . The
projected dataset Y1 , . . . , Yn is lower dimensional.

Theorem 8 (Johnson-Lindenstrauss) Fix > 0. Let m ≥ 32 log n/2 . Then, with prob-
2
ability at least 1 − e−m /16 ≥ 1 − (1/n)2 , we have

(1 − )||Xi − Xj ||2 ≤ ||Yi − Yj ||2 ≤ (1 + )||Xi − Xj ||2 (17)

for all i, j.

Notice that the embedding dimension m, does not depend on the original dimension d.

Proof. For any j 6= k,

m
||Yj − Yk ||2 ||S(Xj − Xk )||2 1 X 2
− 1 = − 1 = Z −1
||Xi − Xj ||2 m||Xi − Xj ||2 m i=1 i

where
Xj − X k
Zi = Si ,
||Xj − Xk ||
where Si is the ith row of S. Note that Zi ∼ N (0, 1) and so Zi2 ∼ χ21 and E[Zi2 ] = 1. The
moment generating function of Zi2 is m(λ) = (1 − 2λ)−1/2 (for λ < 1/2). So, for λ > 0 small
enough,
2 e−λ 2
E[eλ(Zi −1) ] = √ ≤ e2λ .
1 − 2λ
Hence, " !#
2
X
E exp λ (Zi2 − 1) ≤ e2mλ .
i

1
In this section and the next, we follow some lecture notes by Martin Wainwright.

21
Thus
m
!
1 X 2 Pm 2
P Zi − 1 ≥ = P eλ i=1 Zi −1 ≥ eλm
m i=1
Pm 2 2
≤ e−λm E eλ i=1 Zi −1 ≤ e2mλ −mλ
2 /8
≤ e−m
1
Pm 2

where, in the last step, we chose λ = /4. By a similar argument, we can bound P m
Z
i=1 i − 1 ≤ − .
Hence,
||S(Xj − Xk )||2

2
P 2
− 1 ≥ ≤ 2e−m /8 .
m||Xj − Xk ||
By the union bound, the probability that (17) fauls for some pair is at most
2 /8 2 /16
n2 2e−m ≤ e−m

where we used the fact that m ≥ 32 log n/2 .

10 Random Projections: Part II

The key to the Johnson-Lindenstrauss (JL) theorem was applying concentration of measure
to the quantity
||Su||2
Γ(K) = sup −1
u∈K m
where
Xj − Xk
K= : j 6= k .
||Xj − Xk ||
Note that K is a subset of the sphere S d−1 .

We can generalize this to other subsets of the sphere. For example, suppose that we take
b = m−1 S T S. Note that each row of Si has mean 0 and variance matrix I
K = S d−1 . Let Σ
and Σ
b is the estimate of the covaraince matrix. Then

||Su||2 ||Su||2
sup − 1 = sup −1
u∈K m ||u||=1 m
= sup uT (m−1 S T S − I)u = ||Σ
b − I||
||u||=1

which is the operator norm of the difference between the sample covariance and true covari-
ance.

22
Now consider least squares. Suppose we want to minimize ||Y − Xβ||2 where Y is an n × 1
vector and X is a n × d matrix. If n is large, this may be expensive. We could try to
approximate the solution by minimizing ||S(Y − Xβ)||2 . The true least squares solution lies
in the column space of X. The approximate solution will lie in the column space of SX. It
can be shown that if This suggests taking
n o
d−1 d
K= u∈S : u = Xv for some v ∈ R .
Later we will show that, if Γ(K) is small, then the solution to the reduced problem approxi-
mates the original problem.

How can we bound Γ(K)? To answer this, we use the Gaussian width which is defined by

W (K) = E suphu, Zi
u∈K

where Z ∼ N (0, I) and I is the d × d identity matrix.

Theorem 9 Let S be a m × d Gaussian projection matrix. Let K be any subset of the sphere
and suppose that m ≥ W 2 (K). Then, for any ∈ (0, 1/2),

W (K) 2
P Γ(K) ≥ 4 √ + ≤ 2e−m /2 .
m
In particular, if m ≥ W 2 (K)/δ 2 , then Γ(K) ≤ 8δ with high probability.

Let us return to the JL theorem. In this case,

Xj − Xk
K= : j 6= k .
||Xj − Xk ||
In this case K is finite. The number of elements is N = n2 . Note that log N ≤ 2 log n.

Since the set is finite, we know from our previous results on expectations of maxima, that
p p
W (K) ≤ 2 log N ≤ 4 log n.
According to the above theorem, we need to take m ≥ W 2 /δ 2 log n/δ 2 which agrees with
the JL theorem.

The proof of the theorem is quite long but it basically uses concentration of measure ar-
guments to control the maximum fluctuations as u varies over K. When applied to least
squares, if we want to approximate ||Y − Xβ||2 it turns out that the Gaussian width has
constant order. Thus, taking m to be a large, fixed constant is enough. But this result
assumed we are interested in approximating β, b we need m ≈ n which is not useful. How-
ever, there is an improvement that uses iterative sketching that only requires m = O(log n)
observations. A good reference is:

M. Pilanci and M. J. Wainwright. Iterative Hessian Sketch: Fast and accurate solution
approximation for constrained least-squares. arXiv:1411.0347.

James Hillman Anima PDF
No ratings yet
James Hillman Anima PDF
197 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
Spectral Decomposition for Graphs
No ratings yet
Spectral Decomposition for Graphs
5 pages
Chapter 2 Metric Spaces 2018
No ratings yet
Chapter 2 Metric Spaces 2018
19 pages
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
No ratings yet
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
18 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
' - Magic: Recovery of Sparse Signals Via Convex Programming
No ratings yet
' - Magic: Recovery of Sparse Signals Via Convex Programming
19 pages
Admm Homework
No ratings yet
Admm Homework
5 pages
Placement Exam
100% (2)
Placement Exam
2 pages
Problems
No ratings yet
Problems
62 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Optimization by UC Berkley
No ratings yet
Optimization by UC Berkley
77 pages
Kovic 4
No ratings yet
Kovic 4
16 pages
Machine Learning 2: Exercise Sheet 1
No ratings yet
Machine Learning 2: Exercise Sheet 1
2 pages
A Problem Course in Metric Spaces
No ratings yet
A Problem Course in Metric Spaces
19 pages
ST1_2019_20_ex6
No ratings yet
ST1_2019_20_ex6
3 pages
Serija Iii: WWW - Math.hr/glasnik
No ratings yet
Serija Iii: WWW - Math.hr/glasnik
20 pages
01 - Lab Notes
No ratings yet
01 - Lab Notes
8 pages
Lecture RandomizedLA
No ratings yet
Lecture RandomizedLA
6 pages
Koorwinder Root Syst
No ratings yet
Koorwinder Root Syst
17 pages
LD
No ratings yet
LD
84 pages
HW 5
No ratings yet
HW 5
5 pages
Lecture Notes: Metric Spaces - Sergey Mozgovoy
No ratings yet
Lecture Notes: Metric Spaces - Sergey Mozgovoy
41 pages
Ahmed Rebai PCA-ICA
No ratings yet
Ahmed Rebai PCA-ICA
34 pages
Week 2 Notes
No ratings yet
Week 2 Notes
23 pages
grothendieck
No ratings yet
grothendieck
9 pages
Lecture 00
No ratings yet
Lecture 00
10 pages
Basic R Programming: Exercises
No ratings yet
Basic R Programming: Exercises
7 pages
Mathematical Economics: 1 What To Study
No ratings yet
Mathematical Economics: 1 What To Study
23 pages
K Is An: Ramanujan Graphs
No ratings yet
K Is An: Ramanujan Graphs
8 pages
Henriksson, Exercises For QM
No ratings yet
Henriksson, Exercises For QM
25 pages
A Two Parameter Distribution Obtained by
No ratings yet
A Two Parameter Distribution Obtained by
15 pages
λ and vectors x = 0 f or which Ax = λx
No ratings yet
λ and vectors x = 0 f or which Ax = λx
22 pages
Majority CA
No ratings yet
Majority CA
35 pages
Sparse Geometry
No ratings yet
Sparse Geometry
18 pages
Lec 4 RQ
No ratings yet
Lec 4 RQ
32 pages
Tomacs-paper-1998
No ratings yet
Tomacs-paper-1998
13 pages
Spectral - Graph - Theory - 5
No ratings yet
Spectral - Graph - Theory - 5
32 pages
Semiconductor Physics: 2.1 Basic Band Theory
No ratings yet
Semiconductor Physics: 2.1 Basic Band Theory
6 pages
Paper DCRE
No ratings yet
Paper DCRE
13 pages
[Numerical Methods for Partial Differential Equations vol. 26 iss. 6] Alper Korkmaz - Numerical algorithms for solutions of Korteweg–de Vries equation (2009) [10.1002_num.20505] - libgen.li
No ratings yet
[Numerical Methods for Partial Differential Equations vol. 26 iss. 6] Alper Korkmaz - Numerical algorithms for solutions of Korteweg–de Vries equation (2009) [10.1002_num.20505] - libgen.li
18 pages
hw8-solutions
No ratings yet
hw8-solutions
6 pages
QM - Excercise - 0 - Wavefunction and Probability
No ratings yet
QM - Excercise - 0 - Wavefunction and Probability
4 pages
Classical Multiple Regression
No ratings yet
Classical Multiple Regression
5 pages
Proxscal 1
100% (1)
Proxscal 1
14 pages
Wavelets 3
No ratings yet
Wavelets 3
29 pages
ExerciseSheet2 2017
No ratings yet
ExerciseSheet2 2017
2 pages
Optimization Techniques 1. Least Squares
No ratings yet
Optimization Techniques 1. Least Squares
17 pages
The Role of LUB Axiom in Real Analysis: S. Kumaresan School of Math. and Stat. University of Hyderabad Hyderabad 500046
100% (1)
The Role of LUB Axiom in Real Analysis: S. Kumaresan School of Math. and Stat. University of Hyderabad Hyderabad 500046
10 pages
PS2_signals_and_systems_27Jan2025
No ratings yet
PS2_signals_and_systems_27Jan2025
4 pages
CONSECUTIVE PRIMES IN SHORT INTERVALS ARTYOM RADOMSKII - Proc Steklov Institute - Maynard - Radziwill-Matomaki
No ratings yet
CONSECUTIVE PRIMES IN SHORT INTERVALS ARTYOM RADOMSKII - Proc Steklov Institute - Maynard - Radziwill-Matomaki
82 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
Lecture7 PDF
No ratings yet
Lecture7 PDF
16 pages
Assign20153 Sol
No ratings yet
Assign20153 Sol
47 pages
hw8 (5555)
No ratings yet
hw8 (5555)
3 pages
sol3_2020
No ratings yet
sol3_2020
5 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
02 The Noisy Channel Model of Spelling 19-30
No ratings yet
02 The Noisy Channel Model of Spelling 19-30
12 pages
03 Real-Word Spelling Correction 9-19
No ratings yet
03 Real-Word Spelling Correction 9-19
4 pages
05 Smoothing - Add-One 6-30
No ratings yet
05 Smoothing - Add-One 6-30
3 pages
08 Kneser-Ney Smoothing 8-59
No ratings yet
08 Kneser-Ney Smoothing 8-59
3 pages
05 Sentence Segmentation 5-31
No ratings yet
05 Sentence Segmentation 5-31
3 pages
02 Regular Expressions in Practical NLP 6-04
No ratings yet
02 Regular Expressions in Practical NLP 6-04
3 pages
Computer System Servicing Teachers Module
86% (84)
Computer System Servicing Teachers Module
114 pages
Free PowerPoint Template Featuring Morph Transitions For 4 Step Processes
No ratings yet
Free PowerPoint Template Featuring Morph Transitions For 4 Step Processes
8 pages
IBM Maximo Anywhere SSO Using SAML Authentication
No ratings yet
IBM Maximo Anywhere SSO Using SAML Authentication
2 pages
Using Zbi2 Keymanager en
No ratings yet
Using Zbi2 Keymanager en
13 pages
Implementation of White Line Following Firebird V Robot
No ratings yet
Implementation of White Line Following Firebird V Robot
11 pages
Prabhat Patel 2nd Sem IOT Project
No ratings yet
Prabhat Patel 2nd Sem IOT Project
86 pages
Liebert Crv4 Brochure
0% (1)
Liebert Crv4 Brochure
8 pages
Instant Download Mastering Jquery 1st Edition Libby Alex PDF All Chapter
100% (4)
Instant Download Mastering Jquery 1st Edition Libby Alex PDF All Chapter
62 pages
CS111 & COM101 - Exercises (7 Problems Sheets - Fall 2024)
No ratings yet
CS111 & COM101 - Exercises (7 Problems Sheets - Fall 2024)
25 pages
HWK 1
No ratings yet
HWK 1
2 pages
Computer Programming Chapter4
No ratings yet
Computer Programming Chapter4
79 pages
20-04-24 TXWD20cv323 Conversant v. Tesla Complaint
No ratings yet
20-04-24 TXWD20cv323 Conversant v. Tesla Complaint
66 pages
USW-StyleGuide2011
No ratings yet
USW-StyleGuide2011
12 pages
Week 3
No ratings yet
Week 3
9 pages
SearchBot - Results - For - Temple of Elemental Evil
No ratings yet
SearchBot - Results - For - Temple of Elemental Evil
2 pages
Experiment No-7
No ratings yet
Experiment No-7
7 pages
Exp-08-Sport Data Analysis Using Hive
No ratings yet
Exp-08-Sport Data Analysis Using Hive
2 pages
Instant Download Approaching the Bible in Medieval England 1st Edition Eyal Poleg PDF All Chapters
100% (1)
Instant Download Approaching the Bible in Medieval England 1st Edition Eyal Poleg PDF All Chapters
44 pages
Page
No ratings yet
Page
163 pages
Applies To:: OBIEE 12.2.1.4 Admin Tool - Fail To Create ODBC Systems Datasource (Doc ID 2495989.1)
No ratings yet
Applies To:: OBIEE 12.2.1.4 Admin Tool - Fail To Create ODBC Systems Datasource (Doc ID 2495989.1)
2 pages
125 Most Predictible Questions For JEE Mains 202412 Topic Notes
No ratings yet
125 Most Predictible Questions For JEE Mains 202412 Topic Notes
122 pages
Block Chain Technology
100% (1)
Block Chain Technology
36 pages
Step 4: Add Reflections: Reflection Effect
No ratings yet
Step 4: Add Reflections: Reflection Effect
5 pages
Beldex-whitepaper
No ratings yet
Beldex-whitepaper
46 pages
IP Part 1
No ratings yet
IP Part 1
14 pages
SDLC - Asm2 - Le Hoang Giang-Bs00606
No ratings yet
SDLC - Asm2 - Le Hoang Giang-Bs00606
33 pages
Science Fair Project - Teachable AI - Final Normal Text
No ratings yet
Science Fair Project - Teachable AI - Final Normal Text
6 pages
_OceanofPDF.com_Big_Data_Analytics_and_Intelligent_Applications_-_Kamal_Upreti
No ratings yet
_OceanofPDF.com_Big_Data_Analytics_and_Intelligent_Applications_-_Kamal_Upreti
355 pages
Deepika G - Resume-1
No ratings yet
Deepika G - Resume-1
2 pages

Dimension Reduction

Uploaded by

Dimension Reduction

Uploaded by

DIMENSION REDUCTION

1 Principal Component Analysis (PCA)

where X e = X − µ and µ = E(X). The dimension-reduced version of X is then Tk (X) =

Theorem 1 The k th principal subspace `k is the subspace spanned by e1 , . . . , ek . Further-

where βj = hX − µ, ej i. The risk satisfies

We can restate the result as follows. To minimize

We can choose k by fixing some α and then taking

1. E[Y ] = 0 and Var(Y ) = Λ.

Principal Components Analysis (PCA)

2. Compute the eigenvalues λ1 ≥ λ2 ≥ · · · and eigenvectors e1 , e2 , . . . , of Σ.

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: First principal component in a simple two dimensional example.

Figure 2: Handwritten digits.

Figure 3: Digits data: eigenvalues

Define the operator norm ( )

max |λj (Σ)

Figure 5: Digits data: Top: digits. Bottom: their reconstructions.

A different view of dimension reduction is provided by thinking in terms of preserving pair-

Thus, we need to solve the kernel eigenvalue problem

which requires diagonalizing only an n×n system. Normalizing the eigenvectors, h v, v i = 1

Kernel PCA. Given a Mercer kernel K and data X1 , X2 , . . . , Xn

1. Center the kernel

5. Compute the projection of a test point x onto an eigenvector vm by

The weights wij are constrained so that wij = 0 if Xj is not one

Step 3: Linearization. In this step the points Xi ∈ Rd are mapped to Yi ∈ Rm , where

corresponding to a Rayleigh quotient. Each minimization gives one of the lower (m + 1)

Locally Linear Embedding (LLE). Given n data vectors Xi ∈ Rd ,

1. Compute K nearest neighbors for each point;

2. Compute local reconstruction weights wij by minimizing

3. Compute outputs Yi ∈ Rm by computing the first m eigenvectors with nonzero eigen-

Isomap. Given n data vectors Xi ∈ Rd ,

In particular, let y0 , y1 , . . . , yk ∈ Rn denote the first k eigenvectors corresponding to eigen-

Xi 7→ (y1i , y2i , . . . , yki ) ∈ Rk (11)

such that y1T Dy1 =1 (13)

Ψt (x) = (λt1 ψ1 (x), . . . , πlt ψk (x))

for some k. An example is shown in Figure 10.

8 Principal Curves and Manifolds

A nonparametric generalization of principal components is principal manifolds. The idea

To approximate the minimizer, we can proceed as in (Smola, Mika, Schölkopf, Williamson

which depends on parameters α = (α1 , . . . , αM ). The minimizer can be found as follows.

ξi = argminξ∈[0,1]d kXi − f (ξ)k2 .

The minimizer is  −1

Define hf (x) = minz kx − f (z)k. For any fixed f , R(f

d(g, g 0 ) = sup |gf (x, z) − gf 0 (x, z)|.

Since S is compact, there exists L > 0 such that

kx − x0 k2 − kx − x00 k2 ≤ Lkx0 − x00 k

for all x, x0 , x00 ∈ S. It follows that

d(g, g 0 ) ≤ L sup kf (z) − f 0 (z)k = Lkf − f 0 k# . (16)

≤ E inf gf (X, z) − inf gj (X, z) ≤ δ/2.

Let X1 , . . . , Xn be a dataset with Xi ∈ Rd . Let S be a m × d matrix filled with iid N (0, 1)

(1 − )||Xi − Xj ||2 ≤ ||Yi − Yj ||2 ≤ (1 + )||Xi − Xj ||2 (17)

Proof. For any j 6= k,

where we used the fact that m ≥ 32 log n/2 .

10 Random Projections: Part II

where Z ∼ N (0, I) and I is the d × d identity matrix.

Let us return to the JL theorem. In this case,

You might also like

The minimizer is −1

(1 − )||Xi − Xj ||2 ≤ ||Yi − Yj ||2 ≤ (1 + )||Xi − Xj ||2 (17)

where we used the fact that m ≥ 32 log n/2 .