0% found this document useful (0 votes)
2 views

lec14

The document discusses Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), explaining how every linear transformation can be represented through scaling and rotation. It details the properties of singular values and vectors, their uniqueness, and the relationship between eigenvalues and singular values. Additionally, it covers concepts like matrix rank, trace, determinant, and the conditions for matrix invertibility, culminating in the application of these concepts in PCA for dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lec14

The document discusses Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), explaining how every linear transformation can be represented through scaling and rotation. It details the properties of singular values and vectors, their uniqueness, and the relationship between eigenvalues and singular values. Additionally, it covers concepts like matrix rank, trace, determinant, and the conditions for matrix invertibility, culminating in the application of these concepts in PCA for dimensionality reduction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Principal

Component
Analysis
The Universal Linear
Transformations
It turns out that scaling and rotation transformations
are all we need to understand in order to understand
any linear transformation
A superbly powerful result in linear algebra assures us
that every linear transformation can be expressed as a
composition of two rotation transformations and one
scaling transformation
This result is known as the singular value
decomposition (SVD) theorem and it underlies a very
useful ML technique known as principal component
analysis (PCA)
Singular Value Decomposition
Let be any (possibly non-
symmetric) square matrix. Then
we can always write as a
product of three matrices

where are orthonormal


matrices and is a scaling Caution: This matrix is
(diagonal) matrix with all entriesnot i.e.
non-negative
clockwise scaling counter
Thus, every linear map is simply rotation 120 clockwise
a rotation (+ possibly axes flips, rotation 45
swaps), followed by axes scaling
followed by another rotation (+
axes flips/swaps)
Singular Value Decomposition
For example, consider the following four different but equivalent
SVDs for
Diagonal elements in are called the singular values of (always )
Column vectors of are called the left singular vectors of
ColumnNotevectors
that eachofmatrix
(row vecs of ) called right singular vectors of
has unit (row and
If , then
column) rank
In order to minimize this ambiguity, people
Singular values of a matrix are always unique, singular vectors
commonly write SVD such that and
are not
Given one set of left+right singular vectors, can obtain other sets of
left+right singular vectors using “certain”
If then weorthonormal
have maps – details a
bit tedious
Tons of things to studyCan you find
about an expression
singular for rows of– too little
decompositions
time too?
Singular Value Decomposition
SVD is defined even for matrices that are not square
Suppose then we can always write where
and are orthonormal matrices of different sizes
is a rectangular diagonal matrix with but for
Case 1: i.e. output dim < input dim i.e. reduces dim of
vectors
In this case throws out
𝐴 ΣV last
⊤ dimensions
V

𝐱 by
after rotation

𝐴¿ 𝑈 Σ ⊤
𝑉
Singular Value Decomposition
SVD is defined even for matrices that are not square
Suppose then we can always write where
and are orthonormal matrices of different sizes
is a rectangular diagonal matrix with but for
Case 2: i.e. output dim > input dim. increases dim of
vectors
In this case adds
scaling
𝐴 dummy dimensions
ΣV

Vin 𝐱
⊤addition to

𝐴 𝑈 Σ

𝑉
¿ 0
0
0
0
Not that this automatically shows that row rank
Rank and column rank are the same. and the column
rank of this matrix is clearly the same as that of
Gettingthethe
SVD of asince
matrix matrix immediately
the number of non-zerostells
a lot us
in does
not change at all.
Rank: We always have number of non-zero entries in
To see why, notice that if some diagonal entries of are zero,
we can remove those rows and the corresponding columns
of
𝐴¿ 𝑈 Σ ⊤
If and only entries of are non-zero,
then we can equivalently write as
0

𝑉
where and . This is often called the thin SVD of
Finally, notice that where and thus, the column space of is
Trace and Determinant
Trace and determinant are defined only for square
matrices
Trace: For a square matrix ,
Can use SVD to get another definition of trace
Trace has a funny property: for any
Trace also satisfies linearity properties if , then we have and .
This gives us

Determinant: For a square matrix ,


Sign of the depends on how many axes did flip/swap
Inverses
A square matrix is invertible iff all its singular values are non-
zero
A singular value being zero means that squishes vectors along some
direction to the origin. This means that multiple input vectors map to
the same output vector. This means that we cannot undo the linear
map of
Note that this means is invertible iff
If is invertible, then we always have
Nice because inverse of a diagonal matrix is simply element wise
inverse
If , then
Recall that we do insist i.e. and so is well defined
Indeed
Verify that as well
Pseudo Inverse Indeed, the pseudo inverse always
gives us the “least squares”
solution. If , then
Even if a matrix is not invertible (even if is not
square), we can still define the Moore-Penrose inverse
of Least squares?? Does this have
anything to do with the least
squares we did in regression?
if , where we define where if else
The pseudo inverseItsatisfies
sure does. a
If few nice properties
is invertible, then we
If is indeed invertible then can show that
If with , then i.e. acts Thus,
asleast squares
identity for solution
columns simply
of . This
means that for all . However this gives us that if , then is a
means
vector such
In fact, eventhat . Thus,
if there existsdoes invert
no , s.t. the
, even map
then, of bythe
returns sending
a valid output of the
solution map
which has to
thean input error
smallest whichi.e. does indeed
generate that all
and among very output
vectors withwith
this that map.
smallest error, is the one
Can be shown that with is the vector
smallest L2 with
norm smallest L2 norm s.t.
Eigenvectors and Eigenvalues
Can be considered quite enigmatic when taught in an MTH course
ML perspective is a bit different/easier (MTH people may dislike this – sorry
)
SVD is clearly an extremely important (central even) concept in LinAlg and
ML
Eigenblah can be seen as simply a handy way of obtaining the SVD of a
matrix
Consider a feature matrix – want to find its SVD
Instead, consider
Nice as this matrix is square, symmetric and lets us focus on finding –
how?
Let (recall that columns of are orthonormal). Then we have

where . Thus, merely scales right singular vectors of


To handle , simply take
To get around this issue, people usually define
Eigenvectors and Eigenvalues
eigenvectors to be only unit vectors such that .
Even then, there is still some ambiguity as and
For a square symm. would
matrix , an
both be eigenvectors
eigenpair for this matrix is a pair
of a vectorRecall
and a that we commented
scalar such that even the singular
that vectors are not really unique since we can always
Caution:generate new ones by flipping the sign etc. so it is
is mat-vec multiplication
whereasnotissurprising that eigenvectors are not unique either
scalar multiplication
since right singular vectors of are eigenvectors of 
The vector in the eigenpair
(eigenvector) is merely Thescaled by of eigenpairs makes
concept
the scalar in the pair (eigenvalue)
sense for even non-symmetric (but
and not rotated etc still square) matrices but we will only
counter scaling clockwise
Eigenvalues may be negative or
need the symmetric case in ML
clockwise rotation 45
zero too rotation 45
In general, matrices have an
infinite number of such eigenpairs
– whenever is an eigenpair, is
Eigenpairs and Singular Triplets
We have seen how right singular vectors of are
eigenvectors of and left singular vectors of are
eigenvectors of
It turns out that the square roots of the eigenvalues of
give us all the non-zero singular values of
In fact, the square roots of the eigenvalues of also do the same
thing because and share non-zero eigenvalues
Proof: We have and . We defined which is a diagonal matrix
and which is an diagonal matrix. It is easy to verify the
following simple facts
and share their non-zero (diagonal) entries
Those diagonal entries are exactly the eigenvalues corresponding to the
eigenvectors
Those diagonal entries are exactly squares of the singular values of
A square symmetric matrix is PSD if and only if (aka iff)
Definiteness none of its eigenvalues is negative. We use the term
Positive Definite (PD) to refer to matrices all of whose
Note thateigenvalues
, we have arei.e.
strictly
is positive.
PSD For these matrices,
unless in which case obviously we must have
We are now ready Can to
yousee when
see that exactly
a square is a matrix
symmetric matrix PSD. We
will restrict ourselves
has one orto only
more square
zero symmetric
eigenvalues iff it is notmatrices
full
Non-symmetricrank? caseAgets a bitsquare
full rank complicated due
symmetric to will
matrix
diagonalizability
It is illuminating toissues
seehave only non-zero
this work when theeigenvalues
square symmetric
Claim without
matrix is or . All proof: every
eigenvalues squareofsymmetric
are squares matrix
singular values
can be written as of which is
where means that
orthonormal and is diagonal
1. All its eigenvalues must be non-negative i.e. is always
matrix
PSD
As before, we can write where and . It is easy to see that are
2. If , then can have a zero eigenvalue iff has a zero
eigenpairs for
singular value. This means that is full rank iff is full rank
Thus, for case
in the every , we have
If3.all
If ,then
then easy to see
can have thateigenvalue
a zero we always iff have
has a i.e.
zeroPSD
If even onevalue.
singular , thenThis
take to get
means thati.e. nonrank
is full PSDiff is full rank
Singular Blah vs Eigenblah
Symmetric square matrices always have real eigenvalues. It
is only in the non-symmetric case that funny things start
happening.
Singular Value Fortunately,
Decompositionin most ML situations,
(SVD): writewhenever we
a (rect) matrix
Rotation encounter
matricessquare matrices, they
(and orthonormal are symmetric
matrices too.are
in general)
Eigendecomposition
where the difference (ED):between write
SVDa and
square
ED issymm matrix as
most stark.
Be awarematrices
Rotation of the differences
in generalbetween SVD andcannot)
do not (actually ED – dohave
not get
anyconfused
SVD always exists,
eigenvectors. This isnobecause
matter whether matrix
they rotate everyis square/rect,
vector (except symm/non-
symm
which is a trivial case) so no vector is transformed with just a
ED always exists for square symm mats, may not exist (may require
simple
complex scaling. Thus, they
S or non-ortho have
) for nonno ED. However,
symm. mats. ED they
doesdo nothave
make sense
an
forSVD
rect(all matrices do) and actually they are their own SVD i.e.
mats.
However,
Singular thefor
values a always
are rotation
identity matrixmatrix , the
(which
non-negative, SVD
is also aisrotationcan
eigenvalues matrix
be –
pos/neg/complex
rotation by degrees) is also a bit weird in that every vector is
For aitsPSD
eigenvector since . The identity
square symmetric matrix,matrix
its SVDhasisthe
itssame
ED andSVDvice versa
and ED and that is
For a nonPSD square symm. matrix, its ED can be used to obtain its
SVD
Get ED . Let . Then, let , and (or else take ) to get the SVD
Note that still orthonormal but
Careful: since we have dropped some columns of we
Principal Component Analysis
need to be careful. When was square i.e. , its columns
were orthonormal as were its rows i.e. . However, now that
we havesingular
The largest removed value
some columns, whereas the remaining
(resp. eigenvalue) of a matrix is
calledcolumns are stillsingular
its leading orthonormal,
valuethe (resp.
rows areeigenvalue)
not necessarily
orthonormal i.e. but maybe
The corresponding left/right singular vector (resp. eigenvector) is
called its leading left/right singular vector (resp. eigenvector)
In some cases, there may be more than one singular vector with
the same singular value (resp. more than one eigenvector with
the same eigenvalue)
Principal Component Analysis: the process of finding
the top few singular values and corresponding singular
vectors (left + right)
Given (assume . The case similar) with SVD where , with and
and both have orthonormal columns, is square but is not!
Note: we dropped the last columns of which were useless, to get
Want to find the leading triplets i.e. for for some
Principal Component Analysis
Suppose we wish to find the leading right singular vector of
Same as finding the leading eigenvector of
To find the leading left singular vector of , find leading eigenvector
of
Denote
Recall that and that we reorder things so that
Assume for sake of simplicity that for some
This is sometimes called an eigengap or even leading eigengap
Easier to see the algorithms at work with an eigengap – will handle later
Caution: textbooks might write eigengap additively as for – same thing
Note: where
Similarly, convince yourself that for any
Will use this curious fact very efficiently to find the leading
eigenpair
The Power Method How do I find
our ?
Note that if , then
Hmm …forthis means if our approximation is
not good for
Note that this means then
allouras
approximation
well wont be
good either
Even if , with large enough , gap blows up e.g.
Thus, with large , theFind and then
leading use the fact
eigenvalue thatstands
really is out!
the eigenvalue of corresponding to
Let us take a vector and let (i.e. ) to get
The vector represents in terms of columns of i.e.
must be a vector so that i.e. . If , then as well which means
Notice that since is orthonormal, we have
we will never recover the vector . The longer you run i.e.
This means
larger the , the better the approximation 𝑠 will 1
𝛼 𝜆 ⋅ 𝐯 get.
you +¿
1 1
However, we just saw that for all
This meansWe obtained
that whichour approximation
means that as . How should
we choose ? Will any work? How should we
choose ?
The Power Method
Calculate in timeTHE POWER
using METHODEnsures with high
mat-vec mults
instead of first calculating which takes
1. Input: square time
symmetric matrix probability that
2. Initialize randomly e.g.Good to periodically
3. For normalize to prevent
1. Let overflow errors
Can show that doesn’t
2. Let ¿‖𝐳 ‖2
𝑡
affect the working of the
4. Return leading eigenvector algoestimate ¿‖ 𝐱 ‖2
in any way as
𝑠

5. Return leading eigenvalue estimate as


The Power Method is fast – guaranteed to return an estimate
In settings
in at mostwith no eigengap,
iterations it turns
(proof beyond out To
CS771). that
findthere is an
smaller
entire subspace
eigenpairs, (i.e. off
we “peel” infinitely many eigenvectors)
largest eigenpair we have found and
corresponding to the largest eigenvalue
repeat process
The Peeling Method
THE PEELING METHOD
1. Input: square symmetric matrix
Takes overall time to return
the top leading eigenpairs of
2. Initialize
3. For
1. Let The “peeling”
2. Let step
After leading eigenpair is peeled off, the Some residue might still be
4. Return
eigenpair with the second largest left due to inaccurate
eigenvalue becomes the new leading pair estimation of but usually
and Power Method can now recover this small if sufficiently large
1 ⊤ 2 ⊤ 3 ⊤ 4 ⊤
𝐴= 𝜆1 v ( v ) + 𝜆2 v ( v ) + 𝜆3 v ( v ) + 𝜆4 v ( v ) +…
1 2 3 4

You might also like