SlideShare a Scribd company logo
Text as Data
Cluster Model
Application
Extensions
Combining latent topics with
document attributes in text analysis
Nelson Auner
Prof. Matt Taddy1, Prof. Stephen Stigler2
University of Chicago
May 13, 2014
1
Associate Professor of Econometrics and Statistics at Chicago Booth
School of Business
2
Ernest DeWitt Burton Distinguished Service Professor at the Department
of Statistics of the University of Chicago
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Motivation
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Motivation
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Basic Structure
Metadata and Computation
Topic Models
Text as Data
A document is a collection of words or phrases.
Our datasets are collections of documents
Table: What did homework consist of?
Document Content
1 Some computation and formula proving, a lot of R code
2 Problems, computation using R
3 Some computations and writing R code
4 Proofs, problems, and programming work
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Basic Structure
Metadata and Computation
Topic Models
Multinomial Models
If order doesn’t matter, then we can treat each document as a
”bag of words”.
The number of words can be modeled ∼ multinomial
Table: Creating a word-count matrix from text
Document Some comp formula prov R code use problem writ program work
1 1 1 1 1 1 1 0 0 0 0 0
2 0 1 0 0 1 0 1 1 0 0 0
3 1 1 0 0 1 0 0 0 1 0 0
4 0 0 0 1 0 0 0 1 0 1 1
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Basic Structure
Metadata and Computation
Topic Models
A better model: Metadata
We would like to add structure to the model for inference or
prediction
Metadata is data that accompanies a document
Table: What did homework consist of?
Grade Content
A+ Some computation and formula proving, a lot of R code
B Problems, computation using R
B Some computations and writing R code
C+ Proofs, problems, and programming work
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Basic Structure
Metadata and Computation
Topic Models
Metadata and Computation
n documents with metadata that takes m discrete values:
Normally, n >> m
⇒ Collapse observations by outcome variables.
Model as m observations, instead of n
Document Some comp formula prov R code use problem writ program work
A+ 1 1 1 1 1 1 0 0 0 0 0
B 1 2 0 0 2 0 1 1 1 0 0
C 0 0 0 1 0 0 0 1 0 1 1
Reality: There are thousands of course reviews
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Basic Structure
Metadata and Computation
Topic Models
Topic Models
A topic is a distribution of words.
In a topic model, documents are made of a mixtures of topics.
Running Topic
Stride, Pacing,
Stretch
Bike Topic
Pedal, Helmet,
Gears
Swimming
Stroke, Air, Water
A book about triathalon training ∼ θ1 Running + θ2 Biking +
θ3 Swimming
Problem: We can no longer collapse observations, must use
all n observations
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Algorithm
Cluster Model
Goal
Want to use the Topic Model but incorporate Metadata
Also want computational ease
Approach
Restrict each document to only one topic ⇒ ”cluster”
Can collapse observations over unique (metadata, cluster)
combination
xi ∼ MN(qij , mij ); qij =
exp(αj +yi φj +ui Γkj )
p
l=1 exp(αl +yi φl +ui Γkl )
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Algorithm
Algorithm for Cluster Membership Model with Gamma
Lasso Penalty
1 Initialize cluster membership ui for i = 1, . . . , n
2 Determine parameters α,φ, Γ by fitting a multinomial
regression on yi |xi , ui with a gamma lasso penalty (Taddy
2013)
3 For each document i, determine new cluster ui membership as
argmaxk=1,..,K (ui |α, φ, Γ)
4 Check if current cluster assignment is different from previous
cluster assignment , (u(t) = u(t−1)).If so, return to step 2. If
not, end algorithm.
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
Congressional Speech and Restaurant Reviews
We apply the algorithm to two datasets:
Congressional Speech records (Gentzkow and Shapiro, 2010)
A corpus of restaurant reviews called we8there.
Questions:
Can this simple model capture the variation explained by a
topic model?
How does choice of cluster initialization affect the fit?
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
An Example Cluster
term loading
1 nation.oil.food 20.09
2 united.nation.oil 12.09
3 liberty.pursuit.happiness 8.11
4 life.liberty.pursuit 8.11
5 minority.women.owned 6.73
6 universal.health 6.67
7 white.care.act 6.64
8 ryan.white.care 6.6
9 universal.health.care 5.99
10 growth.job.creation 5.39
11 drilling.arctic.national 5.3
12 tax.relief.package 5.29
13 judge.john.robert 5.26
14 fre.enterprise 5.07
15 arctic.refuge 4.93
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
Comparison with the Topic Model
Good news: We are able to recover similar topics with our model:
Table: Comparison of top word loadings on a stem-cell topic
Cluster Membership Topic Model (LDA)*
umbilic.cord.blood pluripotent.stem.cel
cord.blood.stem national.ad.campaign
blood.stem.cel cel.stem.cel
adult.stem.cel stem.cel.line
*Results reported in Taddy (2012)
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
Incorporating metadata: Congressional Speech
nation.oil.food
united.nation.oil
liberty.pursuit.happiness
life.liberty.pursuit
minority.women.owned
universal.health
white.care.act
ryan.white.care
universal.health.care
growth.job.creation
drilling.arctic.national
tax.relief.package
judge.john.robert
fre.enterprise
arctic.refuge
4
8
12
−5.0 −2.5 0.0 2.5 5.0
(a) Democrat
term
nation.oil.food
united.nation.oil
un.official
liberty.pursuit.happiness
life.liberty.pursuit
growth.job.creation
tax.relief.package
judge.john.robert
food.scandal
oil.food.scandal
death.tax.repeal
fre.enterprise
speaker.table
st.paul
judge.alberto.gonzale
4
8
12
−5 0 5 10
(b) Republican
term
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
Example Topic from Restaurant Review
term loading
1 deep dish 7.76
2 italian beef 7.07
3 pizza like 6.85
4 style food 6.69
5 au jus 6.33
6 cut fri 6.16
7 just ok 6.01
8 great pizza 5.96
9 south side 5.94
10 pizza great 5.82
11 just over 5.75
12 took seat 5.72
13 golden brown 5.61
14 behind counter 5.58
15 got littl 5.52
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Congressional Speech Data
Restaurant Review Data
Incorporating metadata: Restaurant Review
deep dish
italian beef
pizza like
style food
au jus
cut fri
just ok
great pizza
south side
pizza great
just over
took seat
golden brown
behind counter
got littl
4
8
12
−1.0 −0.5 0.0 0.5 1.0
Base Frequency
term
deep dish
pizza like
italian beef
style food
au jus
great pizza
cut fri
best mexican
pizza great
great tast
south side
sauc great
back home
outstand food
just over
4
8
12
−1 0 1 2 3
Positive Review
term
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
1 Relationship Between Clusters and Metadata
2 Feature Allocations: Allow an obervation to be a member of
multiple clusters
3 Prediction and Cross Validation
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Imma Let you Finish, but the Dirichlet was the greatest
prior of all time!
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Results
term loading
1 yeezus 5.48
2 constel 3.79
3 homm 3.79
4 preach 3.79
5 bound 3.6
6 thoma 3.38
7 thirti 3.32
8 rocka 3.31
9 rowland 3.25
10 jamaican 3.23
11 blocka 3.22
12 movement 3.22
13 unlik 3.08
14 yknow 3.08
NA Hidden Structure
Text as Data
Cluster Model
Application
Extensions
Thank You
NA Hidden Structure

More Related Content

What's hot (20)

PPTX
12. Heaps - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPT
Finding Similar Files in Large Document Repositories
feiwin
 
PPTX
14. Files - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
15. STL - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
6. Linked list - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
PDF
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
PPT
Stacks in algorithems & data structure
faran nawaz
 
PDF
Ginix Generalized Inverted Index for Keyword Search
IRJET Journal
 
PDF
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
csandit
 
PDF
Text Mining Using R
Knoldus Inc.
 
PDF
Introduction to the R Statistical Computing Environment
izahn
 
PDF
final_copy_camera_ready_paper (7)
Ankit Rathi
 
PDF
A Survey of Entity Ranking over RDF Graphs
Intelligent Search Systems and Semantic Technologies lab at ITIS KFU
 
PPTX
4. Recursion - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPT
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Jason Yang
 
PPT
Indexing and hashing
Abdul mannan Karim
 
PPT
ORE en Fedora Op Klompen
Lodewijk Bogaards
 
PPTX
Lec 1 indexing and hashing
Md. Mashiur Rahman
 
12. Heaps - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Finding Similar Files in Large Document Repositories
feiwin
 
14. Files - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
15. STL - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
6. Linked list - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Stacks in algorithems & data structure
faran nawaz
 
Ginix Generalized Inverted Index for Keyword Search
IRJET Journal
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
csandit
 
Text Mining Using R
Knoldus Inc.
 
Introduction to the R Statistical Computing Environment
izahn
 
final_copy_camera_ready_paper (7)
Ankit Rathi
 
4. Recursion - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Jason Yang
 
Indexing and hashing
Abdul mannan Karim
 
ORE en Fedora Op Klompen
Lodewijk Bogaards
 
Lec 1 indexing and hashing
Md. Mashiur Rahman
 

Viewers also liked (16)

PDF
Clogged gutters
LeafFilter North, Inc.
 
PPTX
Evaluation question 4+5
PRINCE360
 
PDF
Tracktl event : L'animation digitale pour vos évènements
Tracktl
 
PDF
Architecture
Mothra Saidian
 
PDF
1SAMPLE16C68-The-Chinese-economy-in-2025-and-the-impact-on-South-Africa
Emily Parker
 
PPT
FLiPCHART DiARRHEA
Deslani Khairun Nisak
 
PPT
Healthcare Quality Improvement Tools
Mac Pro
 
DOCX
Bab II
Aniyah Damayanti
 
PDF
Adm fin orç aula 02
Cristiano Ferreira Cesarino
 
PDF
Materiais ii logistica
Cristiano Ferreira Cesarino
 
PPTX
c++ programming Unit 1 introduction to c++
AAKASH KUMAR
 
PPTX
The role of livestock in achieving the SDGs
ILRI
 
PPTX
The emerging middle class and the world market for beef
ILRI
 
PPT
Ms word Part 1
AAKASH KUMAR
 
PDF
Treat HPV Naturally
danneeledge
 
PPTX
Kort presentation
Stefan Thelberg
 
Clogged gutters
LeafFilter North, Inc.
 
Evaluation question 4+5
PRINCE360
 
Tracktl event : L'animation digitale pour vos évènements
Tracktl
 
Architecture
Mothra Saidian
 
1SAMPLE16C68-The-Chinese-economy-in-2025-and-the-impact-on-South-Africa
Emily Parker
 
FLiPCHART DiARRHEA
Deslani Khairun Nisak
 
Healthcare Quality Improvement Tools
Mac Pro
 
Adm fin orç aula 02
Cristiano Ferreira Cesarino
 
Materiais ii logistica
Cristiano Ferreira Cesarino
 
c++ programming Unit 1 introduction to c++
AAKASH KUMAR
 
The role of livestock in achieving the SDGs
ILRI
 
The emerging middle class and the world market for beef
ILRI
 
Ms word Part 1
AAKASH KUMAR
 
Treat HPV Naturally
danneeledge
 
Kort presentation
Stefan Thelberg
 
Ad

Similar to Text Analysis: Latent Topics and Annotated Documents (20)

PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
PPTX
Machine Learning - Intro & Applications .pptx
ssuserf3aa89
 
PPTX
Introduction to Text Mining and Topic Modelling
David Paule
 
PDF
Mattingly "Text Mining Techniques"
National Information Standards Organization (NISO)
 
PDF
Survey of Generative Clustering Models 2008
Roman Stanchak
 
PPT
Introduction to linked data and the semantic web
Dave Reynolds
 
PPT
Cluster
guest1babda
 
PPTX
Text Mining using LDA with Context
Steffen Staab
 
PDF
Diversified Social Media Retrieval for News Stories
Bryan Gummibearehausen
 
PPTX
datamining_Uses_Process_Image_Captioning.ppt.pptx
YashikaTanwar11
 
PPTX
Natural Language Processing in R (rNLP)
fridolin.wild
 
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Parang Saraf
 
PPTX
Popular Text Analytics Algorithms
PromptCloud
 
PDF
Distributed machine learning examples
Stanley Wang
 
PDF
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Vitomir Kovanovic
 
PDF
Learning Similarity Metrics for Event Identification in Social Media
Hila Becker
 
PDF
Ay3313861388
IJMER
 
PPTX
Text mining introduction-1
Sumit Sony
 
PPTX
Topic Extraction using Machine Learning
Sanjib Basak
 
PDF
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
Machine Learning - Intro & Applications .pptx
ssuserf3aa89
 
Introduction to Text Mining and Topic Modelling
David Paule
 
Mattingly "Text Mining Techniques"
National Information Standards Organization (NISO)
 
Survey of Generative Clustering Models 2008
Roman Stanchak
 
Introduction to linked data and the semantic web
Dave Reynolds
 
Cluster
guest1babda
 
Text Mining using LDA with Context
Steffen Staab
 
Diversified Social Media Retrieval for News Stories
Bryan Gummibearehausen
 
datamining_Uses_Process_Image_Captioning.ppt.pptx
YashikaTanwar11
 
Natural Language Processing in R (rNLP)
fridolin.wild
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Parang Saraf
 
Popular Text Analytics Algorithms
PromptCloud
 
Distributed machine learning examples
Stanley Wang
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Vitomir Kovanovic
 
Learning Similarity Metrics for Event Identification in Social Media
Hila Becker
 
Ay3313861388
IJMER
 
Text mining introduction-1
Sumit Sony
 
Topic Extraction using Machine Learning
Sanjib Basak
 
Text Mining with R -- an Analysis of Twitter Data
Yanchang Zhao
 
Ad

Recently uploaded (20)

PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Data base management system Transactions.ppt
gandhamcharan2006
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 

Text Analysis: Latent Topics and Annotated Documents

  • 1. Text as Data Cluster Model Application Extensions Combining latent topics with document attributes in text analysis Nelson Auner Prof. Matt Taddy1, Prof. Stephen Stigler2 University of Chicago May 13, 2014 1 Associate Professor of Econometrics and Statistics at Chicago Booth School of Business 2 Ernest DeWitt Burton Distinguished Service Professor at the Department of Statistics of the University of Chicago NA Hidden Structure
  • 2. Text as Data Cluster Model Application Extensions Motivation NA Hidden Structure
  • 3. Text as Data Cluster Model Application Extensions Motivation NA Hidden Structure
  • 4. Text as Data Cluster Model Application Extensions Basic Structure Metadata and Computation Topic Models Text as Data A document is a collection of words or phrases. Our datasets are collections of documents Table: What did homework consist of? Document Content 1 Some computation and formula proving, a lot of R code 2 Problems, computation using R 3 Some computations and writing R code 4 Proofs, problems, and programming work NA Hidden Structure
  • 5. Text as Data Cluster Model Application Extensions Basic Structure Metadata and Computation Topic Models Multinomial Models If order doesn’t matter, then we can treat each document as a ”bag of words”. The number of words can be modeled ∼ multinomial Table: Creating a word-count matrix from text Document Some comp formula prov R code use problem writ program work 1 1 1 1 1 1 1 0 0 0 0 0 2 0 1 0 0 1 0 1 1 0 0 0 3 1 1 0 0 1 0 0 0 1 0 0 4 0 0 0 1 0 0 0 1 0 1 1 NA Hidden Structure
  • 6. Text as Data Cluster Model Application Extensions Basic Structure Metadata and Computation Topic Models A better model: Metadata We would like to add structure to the model for inference or prediction Metadata is data that accompanies a document Table: What did homework consist of? Grade Content A+ Some computation and formula proving, a lot of R code B Problems, computation using R B Some computations and writing R code C+ Proofs, problems, and programming work NA Hidden Structure
  • 7. Text as Data Cluster Model Application Extensions Basic Structure Metadata and Computation Topic Models Metadata and Computation n documents with metadata that takes m discrete values: Normally, n >> m ⇒ Collapse observations by outcome variables. Model as m observations, instead of n Document Some comp formula prov R code use problem writ program work A+ 1 1 1 1 1 1 0 0 0 0 0 B 1 2 0 0 2 0 1 1 1 0 0 C 0 0 0 1 0 0 0 1 0 1 1 Reality: There are thousands of course reviews NA Hidden Structure
  • 8. Text as Data Cluster Model Application Extensions Basic Structure Metadata and Computation Topic Models Topic Models A topic is a distribution of words. In a topic model, documents are made of a mixtures of topics. Running Topic Stride, Pacing, Stretch Bike Topic Pedal, Helmet, Gears Swimming Stroke, Air, Water A book about triathalon training ∼ θ1 Running + θ2 Biking + θ3 Swimming Problem: We can no longer collapse observations, must use all n observations NA Hidden Structure
  • 9. Text as Data Cluster Model Application Extensions Algorithm Cluster Model Goal Want to use the Topic Model but incorporate Metadata Also want computational ease Approach Restrict each document to only one topic ⇒ ”cluster” Can collapse observations over unique (metadata, cluster) combination xi ∼ MN(qij , mij ); qij = exp(αj +yi φj +ui Γkj ) p l=1 exp(αl +yi φl +ui Γkl ) NA Hidden Structure
  • 10. Text as Data Cluster Model Application Extensions Algorithm Algorithm for Cluster Membership Model with Gamma Lasso Penalty 1 Initialize cluster membership ui for i = 1, . . . , n 2 Determine parameters α,φ, Γ by fitting a multinomial regression on yi |xi , ui with a gamma lasso penalty (Taddy 2013) 3 For each document i, determine new cluster ui membership as argmaxk=1,..,K (ui |α, φ, Γ) 4 Check if current cluster assignment is different from previous cluster assignment , (u(t) = u(t−1)).If so, return to step 2. If not, end algorithm. NA Hidden Structure
  • 11. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data Congressional Speech and Restaurant Reviews We apply the algorithm to two datasets: Congressional Speech records (Gentzkow and Shapiro, 2010) A corpus of restaurant reviews called we8there. Questions: Can this simple model capture the variation explained by a topic model? How does choice of cluster initialization affect the fit? NA Hidden Structure
  • 12. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data An Example Cluster term loading 1 nation.oil.food 20.09 2 united.nation.oil 12.09 3 liberty.pursuit.happiness 8.11 4 life.liberty.pursuit 8.11 5 minority.women.owned 6.73 6 universal.health 6.67 7 white.care.act 6.64 8 ryan.white.care 6.6 9 universal.health.care 5.99 10 growth.job.creation 5.39 11 drilling.arctic.national 5.3 12 tax.relief.package 5.29 13 judge.john.robert 5.26 14 fre.enterprise 5.07 15 arctic.refuge 4.93 NA Hidden Structure
  • 13. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data Comparison with the Topic Model Good news: We are able to recover similar topics with our model: Table: Comparison of top word loadings on a stem-cell topic Cluster Membership Topic Model (LDA)* umbilic.cord.blood pluripotent.stem.cel cord.blood.stem national.ad.campaign blood.stem.cel cel.stem.cel adult.stem.cel stem.cel.line *Results reported in Taddy (2012) NA Hidden Structure
  • 14. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data Incorporating metadata: Congressional Speech nation.oil.food united.nation.oil liberty.pursuit.happiness life.liberty.pursuit minority.women.owned universal.health white.care.act ryan.white.care universal.health.care growth.job.creation drilling.arctic.national tax.relief.package judge.john.robert fre.enterprise arctic.refuge 4 8 12 −5.0 −2.5 0.0 2.5 5.0 (a) Democrat term nation.oil.food united.nation.oil un.official liberty.pursuit.happiness life.liberty.pursuit growth.job.creation tax.relief.package judge.john.robert food.scandal oil.food.scandal death.tax.repeal fre.enterprise speaker.table st.paul judge.alberto.gonzale 4 8 12 −5 0 5 10 (b) Republican term NA Hidden Structure
  • 15. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data Example Topic from Restaurant Review term loading 1 deep dish 7.76 2 italian beef 7.07 3 pizza like 6.85 4 style food 6.69 5 au jus 6.33 6 cut fri 6.16 7 just ok 6.01 8 great pizza 5.96 9 south side 5.94 10 pizza great 5.82 11 just over 5.75 12 took seat 5.72 13 golden brown 5.61 14 behind counter 5.58 15 got littl 5.52 NA Hidden Structure
  • 16. Text as Data Cluster Model Application Extensions Congressional Speech Data Restaurant Review Data Incorporating metadata: Restaurant Review deep dish italian beef pizza like style food au jus cut fri just ok great pizza south side pizza great just over took seat golden brown behind counter got littl 4 8 12 −1.0 −0.5 0.0 0.5 1.0 Base Frequency term deep dish pizza like italian beef style food au jus great pizza cut fri best mexican pizza great great tast south side sauc great back home outstand food just over 4 8 12 −1 0 1 2 3 Positive Review term NA Hidden Structure
  • 17. Text as Data Cluster Model Application Extensions 1 Relationship Between Clusters and Metadata 2 Feature Allocations: Allow an obervation to be a member of multiple clusters 3 Prediction and Cross Validation NA Hidden Structure
  • 18. Text as Data Cluster Model Application Extensions Imma Let you Finish, but the Dirichlet was the greatest prior of all time! NA Hidden Structure
  • 19. Text as Data Cluster Model Application Extensions Results term loading 1 yeezus 5.48 2 constel 3.79 3 homm 3.79 4 preach 3.79 5 bound 3.6 6 thoma 3.38 7 thirti 3.32 8 rocka 3.31 9 rowland 3.25 10 jamaican 3.23 11 blocka 3.22 12 movement 3.22 13 unlik 3.08 14 yknow 3.08 NA Hidden Structure
  • 20. Text as Data Cluster Model Application Extensions Thank You NA Hidden Structure