An Introduction to gensim: "Topic Modelling for Humans"

topic modelingfor humans

William Bert
DC Python Meetup
1 May 2012

please go to http://ADDRESS
and enter a sentence

interesting relationships?

gensim generated the data for those visualizations
by computing the semantic similarity of the input

who am I?

William Bert
developer at Carney Labs (teamcarney.com)
user of gensim
still new to world of topic modelling,
semantic similarity, etc

gensim: “topic modeling for humans”

topic modeling attempts to uncover the
underlying semantic structure of by identifying
recurring patterns of terms in a set of data
(topics).

topic modelling
does not parse sentences,
does not care about word order, and
does not "understand" grammar or syntax.

gensim: “topic modeling for humans”
>>> lsi_model.show_topics()
'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +
0.132*"software" + 0.119*"fort" + -0.119*"nov" +
0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -
0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -
0.140*"creek" + 0.135*"chess" + -0.130*"air" +
0.128*"en" + -0.122*"nov" + -0.120*"fr" +
0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -
0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"
+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -
0.090*"engineering"',

gensim isn't about topic modeling
(for me, anyway)
It's about similarity.
What is similarity?
Some types:
• String matching
• Stylometry
• Term frequency
• Semantic (meaning)

Is
A seven-year quest to collect samples from the
solar system's formation ended in triumph in a
dark and wet Utah desert this weekend.
similar in meaning to
For a month, a huge storm with massive
lightning has been raging on Jupiter under the
watchful eye of an orbiting spacecraft.
more or less than it is similar to
One of Saturn's moons is spewing a giant plume
of water vapour that is feeding the planet's
rings, scientists say.
?

Who cares about semantic similarity?

Some use cases:
• Query large collections of text
• Automatic metadata
• Recommendations
• Better human-computer interaction

gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

corpus = stream of vectors of document
feature ids
for example, words in documents are
features (“bucket of words”)

gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]

Dictionary class
>>> print corpus.dictionary
Dictionary(8472 unique tokens)

dictionary maps features (words) to feature ids
(numbers)

need massive collection of documents that
ostensibly has meaning
sounds like a job for wikipedia
>>>wiki_corpus= WikiCorpus(articles) # articles
is Wikipedia text dump bz2 file. several hours.

>>>wiki_corpus.dictionary.save("wiki_dict.dict")
# persist dictionary

>>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpu
s) # uses numpy to persist corpus in Matrix
Market format. several GBs. can be BZ2’ed.

>>>wiki_corpus= MmCorpus("wiki_corpus.mm") #
revive a corpus

gensim.models

transform corpora using models classes

for example, term frequency/inverse document
frequency (TFIDF) transformation

reflects importance of a term, not just
presence/absence

gensim.models
>>> tfidf_trans =
models.TfidfModel(wiki_corpus, id2word=dictionar
y) # TFIDF computes frequencies of all document
features in the corpus. several hours.
TfidfModel(num_docs=3430645, num_nnz=547534266)
>>> tfidf_trans[documents] # emits documents
in TFIDF representation. documents must be in
the same BOW vector space as wiki_corpus.
[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...
]
>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans
[wiki_corpus], id2word=dictionary) # builds new
corpus by iterating over documents transformed
to TFIDF

gensim.models

>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400) # creates LSI
transformation model from tfidf corpus
representation

topics again for a bit
>>> lsi_model.show_topics()
'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" +
0.132*"software" + 0.119*"fort" + -0.119*"nov" +
0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -
0.105*"water"',

'0.179*"squadron" + 0.158*"smith" + -
0.140*"creek" + 0.135*"chess" + -0.130*"air" +
0.128*"en" + -0.122*"nov" + -0.120*"fr" +
0.119*"jan" + -0.115*"wales"',

'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -
0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania"
+ 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -
0.090*"engineering"',

• SVD decomposes a matrix into three simpler matrices
• full rank SVD would be able to recreate the underlying
matrix exactly from those three matrices
• lower-rank SVD provides the best (least square error)
approximation of the matrix
• this approximation can find interesting relationships among
data
• it preserves most information while reducing noise and
merging dimensions associated with terms that have similar
meanings


• SVD:
alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

•Original paper:
www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA
_Deerwester1990.pdf

• General explanation:
tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf

• Many more

gensim.models

>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400, decay=1.0, chunksize=2000
0) # creates LSI transformation model from tfidf
corpus representation

>>> print lsi_trans
LsiModel(num_terms=100000, num_topics=400, decay=1
.0, chunksize=20000)

gensim.similarities
(the best part)
>>> index =
Similarity(corpus=lsi_transformation[tfidf_trans
formation[index_corpus]], num_features=400, outp
ut_prefix=”/tmp/shard”)
>>>
index[lsi_trans[tfidf_trans[dictionary.doc2bow(to
kenize(query))]]] # similarity of each document
in the index corpus to a new query document
>>> [s for s in index] # a matrix of each
document’s similarities to all other documents
[array([ 1. , 0. , 0.08, 0.01]),
array([ 0. , 1. , 0.02, -0.02]),
array([ 0.08, 0.02, 1. , 0.15]),
array([ 0.01, -0.02, 0.15, 1. ])]

about gensim
four additional models available

dependencies: optional:
numpy Pyro
scipy Pattern

created by Radim Rehurek
•radimrehurek.com/gensim
•github.com/piskvorky/gensim
•groups.google.com/group/gensim

thank you

example code, visualization code, and ppt:
github.com/sandinmyjoints

interview with Radim:
williamjohnbert.com

gensim.models

• term frequency/inverse document frequency
(TFIDF)
• log entropy
• random projections
• latent dirichlet allocation (LDA)
• hierarchical dirichlet process (HDP)
• latent semantic analysis/indexing (LSA/LSI)

slightly more about gensim

Dependencies: numpy and scipy, and optionally
Pyro for distributed and Pattern for lemmatization

data from Lee 2005 and other papers is available
in gensim for tests

gensim: “topic modelling for humans”

>>> lda_model.show_topics()
['0.083*bridge + 0.034*dam + 0.034*river +
0.027*canal + 0.026*construction + 0.014*ferry +
0.013*bridges + 0.013*tunnel + 0.012*trail +
0.012*reservoir',
'0.044*fight + 0.029*bout + 0.029*via +
0.028*martial + 0.025*boxing + 0.024*submission +
0.021*loss + 0.021*mixed + 0.020*arts +
0.020*fighting',
'0.086*italian + 0.062*italy + 0.048*di +
0.024*milan + 0.019*rome + 0.014*venice +
0.013*giovanni + 0.012*della + 0.011*florence +
0.011*francesco’]

An Introduction to gensim: "Topic Modelling for Humans"

Recommended

More Related Content

What's hot (8)

Viewers also liked (9)

Similar to An Introduction to gensim: "Topic Modelling for Humans" (20)

Recently uploaded (20)

An Introduction to gensim: "Topic Modelling for Humans"

Editor's Notes