0% found this document useful (0 votes)
52 views

EECS 275 Matrix Computation: Ming-Hsuan Yang

EECS 275 Matrix Computation discusses algorithms for modeling and analyzing massive datasets. It covers topics like low-rank matrix approximation, nearest neighbor algorithms, spectral graph theory, and kernel methods. The document discusses using matrix factorization to represent high-dimensional data in a low-dimensional space using techniques like SVD. It also summarizes randomized algorithms for approximating matrix multiplication and SVD that can analyze large datasets more efficiently than conventional algorithms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

EECS 275 Matrix Computation: Ming-Hsuan Yang

EECS 275 Matrix Computation discusses algorithms for modeling and analyzing massive datasets. It covers topics like low-rank matrix approximation, nearest neighbor algorithms, spectral graph theory, and kernel methods. The document discusses using matrix factorization to represent high-dimensional data in a low-dimensional space using techniques like SVD. It also summarizes randomized algorithms for approximating matrix multiplication and SVD that can analyze large datasets more efficiently than conventional algorithms.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

EECS 275 Matrix Computation

Ming-Hsuan Yang
Electrical Engineering and Computer Science
University of California at Merced
Merced, CA 95344
http://faculty.ucmerced.edu/mhyang

Lecture 22

1 / 21

Overview

Algorithms for Modern Massive Data Sets (MMDS):


I

Explore algorithms for modeling and analyzing massive, high


dimensional and nonlinear structured data
Bring together computer scientists, computational and applied
mathematicians, and practitioners

Tools: numerical linear algebra, kernel methods, multilinear algebra,


randomized algorithms, optimization, differential geometry, geometry
and topology, etc.
Organized by Gene Golub et al. in 2006, 2008 and 2010
Slides available at mmds.stanford.edu

2 / 21

Topics

Low rank matrix approximation: theory, sampling algorithms


Nearest neighbor algorithms and approximation
Spectral graph theory and applications
Non-negative matrix factorization
Kernel methods
Algebraic topology and analysis of high dimensional data
Higher order statistics, tensors and approximations

3 / 21

Matrix factorization and applications

Treat each data point as a vector


I
I
I

2D image 1D vector of intensity values


3D models 1D vector of 3D coordinates
Document 1D vector of term frequency

Massive data set


High dimensional data
Find a low dimensional representation using eigendecomposition
See OLearys slides

4 / 21

Low rank matrix approximation


SVD is great but computationally expensive for large scale problems
Efficient randomized algorithms for low rank approximation with
guaranteed error bounds
CX algorithm [Drineas and Kanna FOCS 01]: randomly pick k
columns
Amn Cmk Xkn
CUR algorithm [Drineas and Kannan SODA 03]: randomly pick c
columns and r rows
Amn Cmc Ucr Rr n
Element-wise sampling [Achiloptas and McSherry STOC 01]

Aij /pij , with probability pij
Amn Smn , Sij =
0
, with probability 1 pij
See Kannans slides and Drineas slides.
5 / 21

Approximating matrix multiplication


Given an m by n matrix A and an n by p matrix B,
AB =

n
X

A(i) B(i)
|
{z }
i=1
IRmp

where A(i) are the i-th column of A and B(i) is the i-th row of B
Each term is a rank-one matrix
Random sampling algorithm
I
I

fix a set of probabilities pi i = 1, . . . , n summing up to 1


for t = 1 up to s, set jt = i where Pr (jt = i) = pi (pick s terms of the
sum with replacement w.r.t. pi )
approximate AB by the sum of s terms, after scaling
s

AB

1 X 1 (jt )
A B
s
p | {z (jt})
t=1 jt
IRmp

6 / 21

Approximating matrix multiplication (contd)

In matrix notation
Amn Bnp Cms Rsp
Create C and R i.i.d. trials with replacement
For t = 1 up to s, pick a column A(jt ) and a row B(jt ) with probability
kA(i) k2 kB(i) k2
P
Pr (jt = i) = n
(i)
i=1 kA k2 kB(i) k2
Include A(jt ) /(spjt )1/2 as a column of C and B(jt ) /(spjt )1/2 as a row
of R

7 / 21

Approximating matrix multiplication (contd)


The input matrices are given in sparse unordered representation,
e.g., their non-zero entries are presented as triples (i, j, Aij ) in any
order
The expectation of CR (element-wise) is AB
The nonuniform sampling minimizes the variance of this estimator
Easy to implement the sampling algorithm in two phases
If the matrices are dense the algorithm runs in O(smp) time instead
of O(nmp) time
Require only O(sm + sp) memory space
Does not tamper with the sparsity of the matrices
For the above algorithm
1
E (kAB CRk2,F ) kAkF kBkF
s
With probability at least 1
O(log(1/))

kAB CRk2,F
kAkF kBkF
s
8 / 21

Special case: B = A>


If B = A> , then the sampling probabilities are
kA(i) k22
kA(i) k22
Pr (picking i) = Pn
=
(i) 2
kAk2F
i=1 kA k2
Also R = C > , and the error bounds are
1
E (kAA> CC > k2,F ) kAk2F
s
Improvement for the spectral norm bound for the special case

4 log s
>
>
E (kAA CC k2 )
kAkF kAk2
s
The sampling procedure is slightly different; s columns/rows are kept
in expectation, i.e., column i is picked with probability
Pr (picking i) = min(1,

skA(i) k22
)
kAk2F
9 / 21

Approximating SVD in O(n) time


The complexity of computing SVD of a m by n matrix A is
O(min(mn2 , m2 n)) (e.g., using Golub-Kahan algorithm)
The top few singular vectors/values can be approximated faster using
Lanczos/Arnoldi methods
Let Ak be rank k approximation of A
Ak is a matrix of rank k such that kA Ak k2,F is minimized over all
rank k matrices
Approximate SVD in linear time O(m + n)
I
I

sample c columns from A and rescale to form the m c matrix C


compute the m k matrix Hk of the top k left singular vectors C

Structural theorem: For any probabilities and number of columns

kA Hk Hk> Ak22,F kA Ak k22,F + 2 kkAA> CC > kF


Algorithmic theorem: If pi = kA(i) k22 /kAk2F and c 4 2 k/2 , then
kA Hk Hk> Ak22,F kA Ak k22,F + kAk2F
10 / 21

Example of randomized SVD


Compute the top k left singular vectors of matrix C and restore them
in the 512-by-k matrix Hk

Original matrix A

After sampling columns C

Hk Hk> A

11 / 21

Potential problems with SVD

Structure in the data is not respected by mathematical operations on


the data:
I
I
I
I

Reification: maximum variance directions


Interpretability: what does a linear combination of 6000 genes mean?
Sparsity: destroyed by orthogonalization
Non-negativity: is a convex but not linear algebraic notion

Does there exist better low-rank matrix approximation?


I
I
I

better structure properties for certain applications


better at respecting relevant structure
better for Interpretability and informing intuition

12 / 21

CX matrix decomposition
Goal: Find Amn Cmc Xcn so that A CX small in some norm
One way to approach this is
min kA CX kF = kA C (C A)kF

X IRcn

where C is the pseudoinverse of C


SVD of A: Ak = Uk k Vk> where Ak is of m n, Uk of m k, of
k k, and Vk> of k n
Subspace sampling: Vk is an orthogonal matrix containing the top k
left singular vectors of A
The columns of Vk are orthonormal vectors, but the rows of Vk ,
denoted by (Vk )(i) are not orthonormal vectors
Subspace sampling in O(SVDk (A)) time
i = 1, 2, . . . , n, pi =

k(Vk )(i) k22


k
13 / 21

Relative error CX
Relative error CX decomposition
I
I

compute the probabilities pi


for each i = 1, 2, . . . , n, pick the i-th column of A with probability
min(1, cpi )
let C be the matrix containing the sampled columns

Theorem
For any k, let Ak be the best rank k approximation to A. In O(SVDk (A))
we can compute pi such that if c = O(k log k/2 ) then with probability at
least 1
minX IRcn kA CX kF

= kA CC AkF
(1 + )kA Ak kF

14 / 21

CUR decomposition
Goal: Find Amn Cmc Ucr Rr n so that kA CURk is small in
some norm
Why: After making two passes over A, one can compute provably C ,
U, and R and store them (sketch) instead of A of O(m + n) vs.
O(mn)
> where ran(A) = p
SVD of Amn = Ump pp Vpn

Exact computation of the SVD takes O(min(mn2 , m2 n)) and the top
k left/right singular vectors/values can be computed from
Lanczos/Arnoldi methods
Rank k approximation Ak = Uk kk Vk> where kk is a diagonal
matrix with top k singular values of A
Note that the columns of Uk are linear combinations of all columns of
A, and the rows of Vk> are linear combinations of all rows of A

15 / 21

The CUR decomposition


Sampling columns for C : use CX algorithm to sample columns of A
I

C has c columns in expectation


C
m c

UC
m

VC>
c

UC is the orthogonal matrix containing the left singular vectors of C


and is the rank of C
Let (UC )(i) denote the i-th row of U

Sampling rows for R:


I

subspace sampling in O(c 2 m) time with probability qi


i = 1, 2, . . . , m qi =

k(UC )(i) k22

R has r rows in expectation

Compute U:
I
I

Let W be the intersection of C and R


Let U be a rescaled pseudoinverse of W
16 / 21

The CUR decomposition (contd)


Put together
A
A
mn

CUR

D

C
m c

 
D

r c

R
r n

where D is a diagonal rescaling matrix and


U = (DW ) D

Theorem
Given C , in O(c 2 m) time, one can compute qi such that
kA CURkF (1 + )kA C (C A)kF
holds with probability at least 1 if r = O(c log c/2 ) rows
17 / 21

Relative error CUR


Theorem
For any k, it takes O(SVDk (A)) time to construct C , U, and R such that
kA CURkF

(1 + )kA Uk k Vk> kF
= (1 + )kA Ak kF

holds with probability at least 1 by picking


O(k log k log(1/)/2 ) columns, and
O(k log2 k log(1/)/6 ) rows
where O(SVDk (A)) is the time to compute the top k top left/right
singular values
Applications: Genomic microarray data, time-resolved fMRI data,
sequence and mutational data, hyperspectral color cancer data
For small k, in O(SVDk (A)) time we can construct C , U, and R s.t.
kA CURkF (1 + 0.001)kA Ak kF by typically at most k + 5
columns and at most k + 5 rows
18 / 21

Element-wise sampling
Main idea:
I

to approximate matrix A, keep a few elements of the matrix (instead of


sampling rows or columns) and zero out the remaining elements
compute a rank k approximation to this sparse matrix (using Lanczos
methods)

Create the matrix S from A such that


(
Aij /pij with probability pij
Sij =
0
with probability 1 pij
It can be shown that kA Sk2 is bounded and the singular values of
S and A are close
Under additional assumptions the top k left singular vectors of S span
a subspace that is close to the subspace spanned by the top k left
singular vectors of A
19 / 21

Element-wise sampling (contd)

Approximating singular values fast:


I

zero out a large number of elements of A, and scale the remaining ones
appropriately
compute the singular values of the resulting sparse matrix using
iterative methods
sA2
good choice for pij = P ijA2 where s denotes the expected number of
i,j

ij

elements that we seek to keep in S


note each element is kept or discarded independently of the others

Similar ideas that has been used to


I
I
I

explain the success of Latent Semantic Indexing


design recommendation systems
speed up kernel computation

20 / 21

Element-wise sampling vs. row/column sampling

Row/column sampling preserves subspace/structural properties of the


matrices
Element-wise sampling explains how adding noise and/or quantizing
the elements of a matrix perturbs its singular values/vectors
These two techniques should be complementary
These two techniques have similar error bounds
Element-wise sampling can be carried out in one pass
Running time of element-wise sampling depends on the speed of
iterative methods

21 / 21

You might also like