6. 6
78 COMMUNICATIONS OF THE ACM | APRIL 2012 | VOL. 55 | NO. 4
genetics, such as “sequenced” and ics are generated first, before the documents. different topics. Why latent? Keep reading.
Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of “topics,” which are distributions over words,
exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the
histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic.
The topics and topic assignments in this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit from data.
. , ,
. , ,
. . .
gene
dna
genetic
life
evolve
organism
brain
neuron
nerve
data
number
computer
. , ,
Topics Documents
Topic proportions and
assignments
0.04
0.02
0.01
0.04
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
data
number
computer
. , ,
0.02
0.02
0.01
(Blei, CACM, 2012)
7. 7
(Blei, CACM, 2012)
review articles
evolutionary biology, and each word
is drawn from one of those three top-
ics. Notice that the next article in
the collection might be about data
analysis and neuroscience; its distri-
bution over topics would place prob-
ability on those two topics. This is
the distinguishing characteristic of
algorithm assumed that there were 100
topics.) We then computed the inferred
topic distribution for the example
article (Figure 2, left), the distribution
over topics that best describes its par-
ticular collection of words. Notice that
this topic distribution, though it can
use any of the topics, has only “acti-
about subjects like genetics and data
analysis are replaced by topics about
discrimination and contract law.
The utility of topic models stems
from the property that the inferred hid-
den structure resembles the thematic
structure of the collection. This inter-
pretable hidden structure annotates
Figure 2. Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred
topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found
in this article.
“Genetics”
human
genome
dna
genetic
genes
sequence
gene
molecular
sequencing
map
information
genetics
mapping
project
sequences
life
two
“Evolution”
evolution
evolutionary
species
organisms
origin
biology
groups
phylogenetic
living
diversity
group
new
common
“Disease”
disease
host
bacteria
diseases
resistance
bacterial
new
strains
control
infectious
malaria
parasite
parasites
united
tuberculosis
“Computers”
computer
models
information
data
computers
system
network
systems
model
parallel
methods
networks
software
new
simulations
1 8 16 26 36 46 56 66 76 86 96
Topics
Probability
0.00.10.20.30.4
16. 16
VQ
× =
NMF
=×
Original Figure 1 Non-n
faces, whereas
holistic represe
m ¼ 2;429 fac
n ϫ m matrix V
three different t
and methods. A
r ¼ 49 basis im
with red pixels.
represented by
superposition a
superpositions a
learns to repres
VQ
× =
PCA
=×
superposition a
superpositions
learns to repres
⾮
負
⾏
列
分
解
主
成
分
分
析
(Lee & Seung, 1999)
H
W
u
27. 参考⽂献
• Strang, Introduc-on to Linear Algebra, Wellesley-Cambridge Press, 2009.
– ストラング『線形代数イントロダクション』近代科学社、2015。
• Lee & Seung, “Learning the parts of objects by non-nega;ve matrix
factoriza;on,” Nature, 1999.
• Lee & Seung, “Algorithms for non-nega;ve matrix factoriza;on,” Adv.
Neural Inf. Process. Syst. (NIPS), 2000.
• Morioka, Kanemura, et al., “Learning a common dic;onary for subject-
transfer decoding with res;ng calibra;on,” NeuroImage, 2015.
• Blei, “Probabilis;c topic models,” Commun. ACM, 2012.
• Barnard et al., “Matching words and pictures,” J. Mach. Learn. Res., 2003.
• Zhang et al., “Learning from incomplete ra;ngs using non-nega;ve matrix
factoriza;on,” SIAM Conf. Data Mining (SDM), 2006.
27