SlideShare a Scribd company logo
Topic Extraction using Machine Learning
Sanjib Basak
Director of Data Science, Digital River
Jan,2016
Twin cities Big Data Analytics and Apache Spark user group meet up
Agenda
• History of Topic Models
• A Use Case
• Demo using R
• Demo using Spark
• Conclusion
History ofTopic Modeling
• TF-IDF model (Salton and McGill, 1983)
• A basic vocabulary of “words” or “terms” is chosen, and, for each
document in the corpus, a count is formed of the number of
occurrences of each word. (TF)
• After suitable normalization, this term frequency count is
compared to an Inverse Document Frequency (IDF) count, which
measures the number of occurrences of a word in the entire
corpus.
• Not a generative model
TF-IDF
History ofTopic Modeling
• To address the shortcomings ofTF-IDF Deerwester et al. 1990
came up with LSI(Latent Semantic Indexing) model.
• LSI uses a singular value decomposition of term document matrix
to identify a linear subspace in the space ofTF-IDF features that
captures most of the variance in the collection
• They claim that the model can capture some aspects of basic
linguistic notions such as synonymy and polysemy
• Still not a useful model to capture distribution of words
LSI
PLSI
• Hofmann (1999), presented the Probabilistic Latent Semantic
Analysis (pLSI) model, also known as the aspect model, as an
alternative to LSI.
• Models each word in a document as a sample from a mixture
model, where the mixture components are multinomial random
variables that can be viewed as representations of “topics.”
• The model is still incomplete
• Not a probabilistic model at the level of documents
• Each document is represented as a list of numbers (the mixing
proportions for topics)
History ofTopic Modeling
• De Finetti (1990) establishes that any
collection of exchangeable random variables
has a representation as a mixture
distribution—in general an infinite mixture..
This line of thinking leads to the latent
Dirichlet allocation (LDA) model
• Blei, Ng and Jordon 2003 explained LDA
• Hierarchical Bayesian Model - Each item or
word is modeled as a finite mixture over an
underlying set of topics. Each topic is, in turn,
modeled as an infinite mixture over an
underlying set of topic probabilities.
LDA
History ofTopic Modeling
Taken fromWikipedia
LDA
• The original paper used a variational Bayes approximation of
the posterior distribution
• Alternative inference techniques use Gibbs sampling,
Expectation Maximization Algorithm, OnlineVariation and
many more.
Model Workflow
Review
Results
Step 3 Apply
Models
Step 2 Create
Document
Term Matrix
Step 1
Preprocessing
K-Means
• Choose number of clusters (K)
• Initialize the clusters. Make one observation as
centroid
• Determine observations that are closest to the
centroid and assign them part of the cluster
• Revise the cluster center as mean of the
assigned observation
• Repeat above steps until convergence
Demo in R
• Use Case
• Model with K-Means
• Model with LDA and visualization
• Github Code Location - https://github.com/sanjibb/R-Code
K-Means Result
Experimentation with Spark
MLLib
• Work with dataset and in Scala
• 2 variations of optimization model –
• EMVariation Optimizer
• online variational inference -
http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf
Github code Location
• https://github.com/sanjibb/spark_example
Conclusion
1. LDA provides mixture of topics on the words vs K-Means provides
distinct topics
1. In real-life topics may not be distinctively separated
2. Unsupervised LDA model may require to work with SMEs to get
better representation of topics
1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)
Bibliography
https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.p
df
https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
http://vis.stanford.edu/files/2012-Termite-AVI.pdf
http://nlp.stanford.edu/events/illvi2014/papers/sievert-
illvi2014.pdf

More Related Content

What's hot (19)

PPT
Scientific Computing with Python Webinar --- August 28, 2009
Enthought, Inc.
 
PDF
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 
PPT
Domain Ontology Usage Analysis Framework (OUSAF)
Jamshaid Ashraf
 
PDF
Learning scientific scholar representations using a combination of collaborat...
Ankush Khandelwal
 
PPTX
Asymmetric Tri-training for Unsupervised Domain Adaptation
Yoshitaka Ushiku
 
PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
PPTX
An Introduction To Python - Lists, Part 1
Blue Elephant Consulting
 
PDF
Object reusability in python
Learnbay Datascience
 
PPTX
JIST2015-data challenge
GUANGYUAN PIAO
 
PDF
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
PPT
Models for Information Retrieval and Recommendation
Arjen de Vries
 
PDF
AjayBhullar_Resume (5)
Ajay Bhullar
 
PPTX
NFD InterestDigest
Shi Junxiao
 
PDF
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
PPT
Adaptive Geographical Search in Networks
Andrea Wiggins
 
PDF
S2P Recipe for Success
Zycus
 
PDF
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
PDF
P33077080
IJERA Editor
 
Scientific Computing with Python Webinar --- August 28, 2009
Enthought, Inc.
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Sebastian Ruder
 
Domain Ontology Usage Analysis Framework (OUSAF)
Jamshaid Ashraf
 
Learning scientific scholar representations using a combination of collaborat...
Ankush Khandelwal
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Yoshitaka Ushiku
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
An Introduction To Python - Lists, Part 1
Blue Elephant Consulting
 
Object reusability in python
Learnbay Datascience
 
JIST2015-data challenge
GUANGYUAN PIAO
 
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
Models for Information Retrieval and Recommendation
Arjen de Vries
 
AjayBhullar_Resume (5)
Ajay Bhullar
 
NFD InterestDigest
Shi Junxiao
 
NumPy Roadmap presentation at NumFOCUS Forum
Ralf Gommers
 
Adaptive Geographical Search in Networks
Andrea Wiggins
 
S2P Recipe for Success
Zycus
 
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
P33077080
IJERA Editor
 

Viewers also liked (8)

KEY
Pole nord
ecourchesne
 
PDF
¡Fiestas patrias viaja por el perú !
Jean Pierre Olivera Manrique
 
PDF
¡ úLtimos espacios fiestas patrias nacional !
Jean Pierre Olivera Manrique
 
DOCX
Speech
tgaskins4
 
PDF
Consciencia fonologica
mtorren
 
PDF
Unitat 03 superior
Barbara Sales Alos
 
PPTX
Senior Project Powerpoint
tgaskins4
 
PPTX
Topic extraction using machine learning
Sanjib Basak
 
Pole nord
ecourchesne
 
¡Fiestas patrias viaja por el perú !
Jean Pierre Olivera Manrique
 
¡ úLtimos espacios fiestas patrias nacional !
Jean Pierre Olivera Manrique
 
Speech
tgaskins4
 
Consciencia fonologica
mtorren
 
Unitat 03 superior
Barbara Sales Alos
 
Senior Project Powerpoint
tgaskins4
 
Topic extraction using machine learning
Sanjib Basak
 
Ad

Similar to Topic Extraction using Machine Learning (20)

PDF
Ire major project
Abhishek Mungoli
 
PDF
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
PDF
graduate_thesis (1)
Sihan Chen
 
PPTX
Introduction to Text Mining and Topic Modelling
David Paule
 
PDF
A Text Mining Research Based on LDA Topic Modelling
csandit
 
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
PDF
Topic modelling
Shubhmay Potdar
 
ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
Survey of Generative Clustering Models 2008
Roman Stanchak
 
PPTX
Machine Learning - Intro & Applications .pptx
ssuserf3aa89
 
PDF
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET Journal
 
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
PPTX
Tdm probabilistic models (part 2)
KU Leuven
 
PDF
Latent dirichletallocation presentation
Soojung Hong
 
PDF
Basic review on topic modeling
Hiroyuki Kuromiya
 
PPTX
Probabilistic models (part 1)
KU Leuven
 
PPTX
topic modelling through LDA and bertopic model
AngelShina
 
PDF
Streaming topic model training and inference
Suneel Marthi
 
PDF
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward
 
Ire major project
Abhishek Mungoli
 
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
graduate_thesis (1)
Sihan Chen
 
Introduction to Text Mining and Topic Modelling
David Paule
 
A Text Mining Research Based on LDA Topic Modelling
csandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
Topic modelling
Shubhmay Potdar
 
Topic Modeling
Karol Grzegorczyk
 
Survey of Generative Clustering Models 2008
Roman Stanchak
 
Machine Learning - Intro & Applications .pptx
ssuserf3aa89
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET Journal
 
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
Tdm probabilistic models (part 2)
KU Leuven
 
Latent dirichletallocation presentation
Soojung Hong
 
Basic review on topic modeling
Hiroyuki Kuromiya
 
Probabilistic models (part 1)
KU Leuven
 
topic modelling through LDA and bertopic model
AngelShina
 
Streaming topic model training and inference
Suneel Marthi
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward
 
Ad

Topic Extraction using Machine Learning

  • 1. Topic Extraction using Machine Learning Sanjib Basak Director of Data Science, Digital River Jan,2016 Twin cities Big Data Analytics and Apache Spark user group meet up
  • 2. Agenda • History of Topic Models • A Use Case • Demo using R • Demo using Spark • Conclusion
  • 3. History ofTopic Modeling • TF-IDF model (Salton and McGill, 1983) • A basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. (TF) • After suitable normalization, this term frequency count is compared to an Inverse Document Frequency (IDF) count, which measures the number of occurrences of a word in the entire corpus. • Not a generative model TF-IDF
  • 4. History ofTopic Modeling • To address the shortcomings ofTF-IDF Deerwester et al. 1990 came up with LSI(Latent Semantic Indexing) model. • LSI uses a singular value decomposition of term document matrix to identify a linear subspace in the space ofTF-IDF features that captures most of the variance in the collection • They claim that the model can capture some aspects of basic linguistic notions such as synonymy and polysemy • Still not a useful model to capture distribution of words LSI
  • 5. PLSI • Hofmann (1999), presented the Probabilistic Latent Semantic Analysis (pLSI) model, also known as the aspect model, as an alternative to LSI. • Models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” • The model is still incomplete • Not a probabilistic model at the level of documents • Each document is represented as a list of numbers (the mixing proportions for topics) History ofTopic Modeling
  • 6. • De Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture.. This line of thinking leads to the latent Dirichlet allocation (LDA) model • Blei, Ng and Jordon 2003 explained LDA • Hierarchical Bayesian Model - Each item or word is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. LDA History ofTopic Modeling Taken fromWikipedia
  • 7. LDA • The original paper used a variational Bayes approximation of the posterior distribution • Alternative inference techniques use Gibbs sampling, Expectation Maximization Algorithm, OnlineVariation and many more.
  • 8. Model Workflow Review Results Step 3 Apply Models Step 2 Create Document Term Matrix Step 1 Preprocessing
  • 9. K-Means • Choose number of clusters (K) • Initialize the clusters. Make one observation as centroid • Determine observations that are closest to the centroid and assign them part of the cluster • Revise the cluster center as mean of the assigned observation • Repeat above steps until convergence
  • 10. Demo in R • Use Case • Model with K-Means • Model with LDA and visualization • Github Code Location - https://github.com/sanjibb/R-Code
  • 12. Experimentation with Spark MLLib • Work with dataset and in Scala • 2 variations of optimization model – • EMVariation Optimizer • online variational inference - http://www.cs.columbia.edu/~blei/papers/WangPaisleyBlei2011.pdf Github code Location • https://github.com/sanjibb/spark_example
  • 13. Conclusion 1. LDA provides mixture of topics on the words vs K-Means provides distinct topics 1. In real-life topics may not be distinctively separated 2. Unsupervised LDA model may require to work with SMEs to get better representation of topics 1. There is a supervised LDA model (sLDA) as well, which I have not covered in this presentation)

Editor's Notes

  • #7: Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents