0% found this document useful (0 votes)
121 views18 pages

Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) is an unsupervised learning method similar to principal component analysis but with non-negativity constraints. NMF learns parts-based representations of data. The documents describe NMF, an algorithm for NMF, examples of NMF on faces, and issues with non-uniqueness of solutions. Archetypal analysis is another related method that approximates data as convex combinations of archetypes, which are themselves convex combinations of the data points.

Uploaded by

Ariake Swyce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views18 pages

Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) is an unsupervised learning method similar to principal component analysis but with non-negativity constraints. NMF learns parts-based representations of data. The documents describe NMF, an algorithm for NMF, examples of NMF on faces, and issues with non-uniqueness of solutions. Archetypal analysis is another related method that approximates data as convex combinations of archetypes, which are themselves convex combinations of the data points.

Uploaded by

Ariake Swyce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

' $

Sta306b May 27, 2011 Dimension Reduction: 1

Non-negative matrix factorization

• Lee & Seung (1999)


• like principal components (SVD), but data and components are
assumed to be non-negative
• Model
X ≈ WH
where X is n × p, W is n × r, H is r × p, r ≤ p.
• we assume Xij , Wij , Hij ≥ 0.
• criterion: minimize
XX
L(W, H) = [Xiu log(W H)iu − (W H)iu ]
i u

This is the log-likelihood for the model Xiu ∼ Poisson(W H)iu .


& %
' $
Sta306b May 27, 2011 Dimension Reduction: 2

The following alternating algorithm (Lee & Seung 2001) converges


to a local maximum of L(W, H):
Pp
j=1 hkj xij /(WH)ij
wik ← wik Pp
j=1 hkj
PN (1)
i=1 wik xij /(WH)ij
hkj ← hkj PN
i=1 wik

Ccan be viewed as an instance of the MM algorithm (see text) and


iterative proportional scaling for log-linear models.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 3

Example

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 4

Figure 1 Non-negative matrix factorization (NMF) learns a parts-based


representation of faces, whereas vector quantization (VQ) and principal
components analysis (PCA) learn holistic representations. The three
learning methods were applied to a database of m = 2, 429 facial images,
each consisting of n = 19 × 19 pixels, and constituting an n × m matrix
V . All three find approximate factorizations of the form X = W H, but
with three different types of constraints on W and H, as described more
fully in the main text and methods. As shown in the 7 × 7 montages,
each method has learned a set of r = 49 basis images. Positive values are
illustrated with black pixels and negative values with red pixels. A
particular instance of a face, shown at top right, is approximately
represented by a linear superposition of basis images. The coefficients of
the linear superposition are shown next to each montage, in a 7 × 7 grid,
and the resulting superpositions are shown on the other side of the
equality sign. Unlike VQ and PCA, NMF learns to represent faces with a

& %
set of basis images resembling parts of faces.
' $
Sta306b May 27, 2011 Dimension Reduction: 5

Big problem!

See Donoho and Stodden (2004): “When does non-negative matrix


factorization give a correct decomposition into parts?” Advances in
Neural Information Processing Systems, 17, 2004
• columns of W are not required to be orthogonal, as in principal
components
• solution is not unique (even when X = W H holds exactly):
can choose for columns of W any vectors in gap between axes
and the data
• this limits its utility in practice

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 6

Example
W1

W2

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 7

Archetypal Analysis

• This method, due to Cutler & Breiman (1994), approximates


data points by prototypes that are themselves linear
combinations of data points. In this sense it has a similar flavor
to K-means clustering.
• However, rather than approximating each data point by a
single nearby prototype, archetypal analysis approximates each
data point by a convex combination of a collection of
prototypes. The use of a convex combination forces the
prototypes to lie on the convex hull of the data cloud. In this
sense, the prototypes are “pure,”, or “archetypal.”

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 8

Archetypal Analysis- ctd

• The N × p data matrix X is modeled as

X ≈ WH (2)

where W is N × r and H is r × p.
Pr
• We assume that wik ≥ 0 and k=1 wik = 1 ∀i. Hence the N
data points (rows of X) in p-dimensional space are represented
by convex combinations of the r archetypes (rows of H).
• We also assume that
H = BX (3)
PN
where B is r × N with bki ≥ 0 and i=1 bki = 1 ∀k.
• Thus the archetypes themselves are convex combinations of the
data points.
& %
' $
Sta306b May 27, 2011 Dimension Reduction: 9

• Using both (2) and (3) we minimize

J(W, B) = ||X − WH||2


= ||X − WBX||2 (4)

over the weights W and B.


• This function is minimized in an alternating fashion, with each
separate minimization involving a convex optimization. The
overall problem is not convex however, and so the algorithm
converges to a local minimum of the criterion.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 10

The next Figure shows an example with simulated data in two


dimensions. The top panel displays the results of archetypal
analysis, while the bottom panel shows the results from K-means
clustering. In order to best reconstruct the data from convex
combinations of the prototypes, it pays to locate the prototypes on
the convex hull of the data. This is seen in the top panels of the
Figure and is the case in general, as proven by Cutler & Breiman
(1994). K-means clustering, shown in the bottom panels, chooses
prototypes in the middle of the data cloud.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 11

2 Prototypes 4 Prototypes 8 Prototypes

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 12

Archetypal analysis (top panels) and K-means clustering (bottom panels)


applied to 50 data points drawn from a bivariate Gaussian distribution.
The colored points show the positions of the prototypes in each case.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 13

Relation to K-means clustering and NNMF

• We can think of K-means clustering as a special case of the


archetypal model, in which each row of W has a single one and
and the rest of the entries are zero.
• Notice also that the archetypal model (2) has the same general
form as the non-negative matrix factorization model (??).
However, the two models are applied in different settings, and
have somewhat different goals. Non-negative matrix
factorization aims to approximate the columns of the data
matrix X, and the main output of interest are the columns of
W representing the primary non-negative components in the
data.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 14

Relation to K-means clustering and NNMF- ctd

• Archetypal analysis focuses instead on the approximation of


the rows of X using the rows of H, which represent the
archetypal data points.
• Non-negative matrix factorization also assumes that r ≤ p.
With r = p, we can get an exact reconstruction simply
choosing W to be the data X with columns scaled so that they
sum to 1. In contrast, archetypal analysis requires r ≤ N , but
allows r > p.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 15

The next Figure shows the results of archetypal analysis applied to


the database of 3’s discussed earlier. The three rows in the Figure
are the resulting archetypes from three runs, specifying two, three
and four archetypes, respectively. As expected, the algorithm has
produced extreme 3’s both in size and shape.

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 16

& %
' $
Sta306b May 27, 2011 Dimension Reduction: 17

Archetypal analysis applied to the database of digitized 3’s. The rows in


the figure show the resulting archetypes from three runs, specifying two,
three and four archetypes, respectively.

& %
Sta306b May 27, 2011 Dimension Reduction: 17-1

References
Cutler, A. & Breiman, L. (1994), ‘Archetypal analysis’, Technometrics
36(4), 338–347.
Lee, D. D. & Seung, H. S. (1999), ‘Learning the parts of objects by
non-negative matrix factorization’, Nature 401, 788.
Lee, D. D. & Seung, H. S. (2001), Algorithms for non-negative matrix
factorization, in ‘Advances in Neural Information Processing Sys-
tems, (NIPS*2001)’.

You might also like