Topic Extraction using Machine Learning

Topic Extraction using Machine Learning
Sanjib Basak
Director of Data Science, Digital River
Jan,2016
Twin cities Big Data Analytics and Apache Spark user group meet up

Agenda
• History of Topic Models
• A Use Case
• Demo using R
• Demo using Spark
• Conclusion

History ofTopic Modeling
• TF-IDF model (Salton and McGill, 1983)
• A basic vocabulary of “words” or “terms” is chosen, and, for each
document in the corpus, a count is formed of the number of
occurrences of each word. (TF)
• After suitable normalization, this term frequency count is
compared to an Inverse Document Frequency (IDF) count, which
measures the number of occurrences of a word in the entire
corpus.
• Not a generative model
TF-IDF

• To address the shortcomings ofTF-IDF Deerwester et al. 1990
came up with LSI(Latent Semantic Indexing) model.
• LSI uses a singular value decomposition of term document matrix
to identify a linear subspace in the space ofTF-IDF features that
captures most of the variance in the collection
• They claim that the model can capture some aspects of basic
linguistic notions such as synonymy and polysemy
• Still not a useful model to capture distribution of words
LSI

PLSI
• Hofmann (1999), presented the Probabilistic Latent Semantic
Analysis (pLSI) model, also known as the aspect model, as an
alternative to LSI.
• Models each word in a document as a sample from a mixture
model, where the mixture components are multinomial random
variables that can be viewed as representations of “topics.”
• The model is still incomplete
• Not a probabilistic model at the level of documents
• Each document is represented as a list of numbers (the mixing
proportions for topics)

• De Finetti (1990) establishes that any
collection of exchangeable random variables
has a representation as a mixture
distribution—in general an infinite mixture..
This line of thinking leads to the latent
Dirichlet allocation (LDA) model
• Blei, Ng and Jordon 2003 explained LDA
• Hierarchical Bayesian Model - Each item or
word is modeled as a finite mixture over an
underlying set of topics. Each topic is, in turn,
modeled as an infinite mixture over an
underlying set of topic probabilities.
LDA
Taken fromWikipedia

LDA
• The original paper used a variational Bayes approximation of
the posterior distribution
• Alternative inference techniques use Gibbs sampling,
Expectation Maximization Algorithm, OnlineVariation and
many more.

Model Workflow
Review
Results
Step 3 Apply
Models
Step 2 Create
Document
Term Matrix
Step 1
Preprocessing

K-Means
• Choose number of clusters (K)
• Initialize the clusters. Make one observation as
centroid
• Determine observations that are closest to the
centroid and assign them part of the cluster
• Revise the cluster center as mean of the
assigned observation
• Repeat above steps until convergence

Demo in R
• Use Case
• Model with K-Means
• Model with LDA and visualization
• Github Code Location - https://github.com/sanjibb/R-Code

Experimentation with Spark
MLLib
• Work with dataset and in Scala
• 2 variations of optimization model –
• EMVariation Optimizer
• online variational inference -
http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf
Github code Location
• https://github.com/sanjibb/spark_example

Conclusion
1. LDA provides mixture of topics on the words vs K-Means provides
distinct topics
1. In real-life topics may not be distinctively separated
2. Unsupervised LDA model may require to work with SMEs to get
better representation of topics
1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)

Bibliography
https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.p
df
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
http://vis.stanford.edu/files/2012-Termite-AVI.pdf
http://nlp.stanford.edu/events/illvi2014/papers/sievert-
illvi2014.pdf

Topic Extraction using Machine Learning

More Related Content

What's hot (19)

Viewers also liked (8)

Similar to Topic Extraction using Machine Learning (20)

Topic Extraction using Machine Learning

Editor's Notes