0% found this document useful (0 votes)
3 views

Text_as_Data_in_the_Social_Sciences

The document discusses the increasing relevance of text analysis in social sciences due to the rise of unstructured text from various sources like the internet and digitization efforts. It emphasizes the need for problem-driven, theoretically-motivated research to understand human behavior and highlights the interdisciplinary nature of computational social science. The workshop aims to teach methods for estimating latent document-level properties to answer social science questions, focusing on practical applications rather than the latest techniques.

Uploaded by

Rodrigo Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Text_as_Data_in_the_Social_Sciences

The document discusses the increasing relevance of text analysis in social sciences due to the rise of unstructured text from various sources like the internet and digitization efforts. It emphasizes the need for problem-driven, theoretically-motivated research to understand human behavior and highlights the interdisciplinary nature of computational social science. The workshop aims to teach methods for estimating latent document-level properties to answer social science questions, focusing on practical applications rather than the latest techniques.

Uploaded by

Rodrigo Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1024

Text as Data in the Social Sciences

Data Science Summer School (DS3 ) École Polytechnique

Brandon Stewart1
Princeton University
Code at https://bit.ly/2KtDziR

June 28-29, 2018

1
Huge thanks to Justin Grimmer for many slides included here (see
https://github.com/justingrimmer/TAD).
Stewart (Princeton) Text as Data June 28-29, 2018 1 / 187
Big Data Social Science

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints
Tools to analyze text advancing in parallel

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints
Tools to analyze text advancing in parallel
I text by itself is useless

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints
Tools to analyze text advancing in parallel
I text by itself is useless
I importing methods from many different fields

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints
Tools to analyze text advancing in parallel
I text by itself is useless
I importing methods from many different fields
I new analysis techniques can even drive new data availability

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Big Data Social Science

Massive increase in unstructured text due to:


I new social structures (the internet, email)
I new/improved data collection (wiki surveys, survey experiments)
I digitization efforts (government documents, Google Books)
Communities now leave digitized footprints
Tools to analyze text advancing in parallel
I text by itself is useless
I importing methods from many different fields
I new analysis techniques can even drive new data availability
Today is about the logic of inference to transform data and tools
to make inferences about society.

Stewart (Princeton) Text as Data June 28-29, 2018 2 / 187


Admin

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.
Cambria will be helping me out!

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.
Cambria will be helping me out!
Please stop me with questions. If you have a question someone
else probably does too.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.
Cambria will be helping me out!
Please stop me with questions. If you have a question someone
else probably does too.
Heavy focus on the logic of inference with examples.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.
Cambria will be helping me out!
Please stop me with questions. If you have a question someone
else probably does too.
Heavy focus on the logic of inference with examples.
I will send out an email with slides sometime next week after
recovering from jet lag.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Admin

Topic: text analysis in the social sciences.


Each session with involve some lectures and some code.
Cambria will be helping me out!
Please stop me with questions. If you have a question someone
else probably does too.
Heavy focus on the logic of inference with examples.
I will send out an email with slides sometime next week after
recovering from jet lag.
Much of the material and perspective here comes from a joint
book manuscript with Justin Grimmer and Molly Roberts and
I’m thankful for their perspective.

Stewart (Princeton) Text as Data June 28-29, 2018 3 / 187


Organization by Tasks not Techniques

Discovery Measurement Inference

Stewart (Princeton) Text as Data June 28-29, 2018 4 / 187


Overview Papers

Grimmer and Stewart. “Text as Data: The Promise and Pitfalls


of Automatic Content Analysis Methods for Political Texts”
(Political Analysis, 2013)
Lucas, Roberts, Stewart, Storer, Tingley. “Computer Assisted
Text Analysis for Comparative Politics” (Political Analysis, 2015)

Copies at BrandonStewart.org

Stewart (Princeton) Text as Data June 28-29, 2018 5 / 187


Goals in Social Science

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)
Much of social science is about properties of actors which are
hidden from view (e.g. preferences, ideology, goals).

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)
Much of social science is about properties of actors which are
hidden from view (e.g. preferences, ideology, goals).
We are all social scientists now (Grimmer 2015)

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)
Much of social science is about properties of actors which are
hidden from view (e.g. preferences, ideology, goals).
We are all social scientists now (Grimmer 2015)
I we need: problem-driven, theoretically-motivated research to
understand human behavior

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)
Much of social science is about properties of actors which are
hidden from view (e.g. preferences, ideology, goals).
We are all social scientists now (Grimmer 2015)
I we need: problem-driven, theoretically-motivated research to
understand human behavior
I computational social science: bring social science to the
machine learning; machine learning to the social science

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Goals in Social Science

Social science seeks to document many aspects of society and


relations between humans.
Social science research agendas often emphasize why a pattern
of behavior exists (not just documenting that it does)
Much of social science is about properties of actors which are
hidden from view (e.g. preferences, ideology, goals).
We are all social scientists now (Grimmer 2015)
I we need: problem-driven, theoretically-motivated research to
understand human behavior
I computational social science: bring social science to the
machine learning; machine learning to the social science
I fundamentally interdisciplinary

Stewart (Princeton) Text as Data June 28-29, 2018 6 / 187


Today’s Workshop

Text lets researchers learn about latent variables based on


observed data in text.

Stewart (Princeton) Text as Data June 28-29, 2018 7 / 187


Today’s Workshop

Text lets researchers learn about latent variables based on


observed data in text.
Focus primarily on methods for estimating latent document level
properties to answer a social science question.

Stewart (Princeton) Text as Data June 28-29, 2018 7 / 187


Today’s Workshop

Text lets researchers learn about latent variables based on


observed data in text.
Focus primarily on methods for estimating latent document level
properties to answer a social science question.
Concentrate on what the methods are used for rather than the
latest methods themselves (they may be replaced by something
new in two years anyway!)

Stewart (Princeton) Text as Data June 28-29, 2018 7 / 187


Today’s Workshop

Text lets researchers learn about latent variables based on


observed data in text.
Focus primarily on methods for estimating latent document level
properties to answer a social science question.
Concentrate on what the methods are used for rather than the
latest methods themselves (they may be replaced by something
new in two years anyway!)
What we won’t cover: machine translation, language
generation/question answering, deep learning models of text,
dependency parsing and grammar.

Stewart (Princeton) Text as Data June 28-29, 2018 7 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 8 / 187


Why Now?

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are
I generalizable: one method can be used across many methods
and to unify collections of texts

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are
I generalizable: one method can be used across many methods
and to unify collections of texts
I systematic: parameters/statistics demonstrate how models
make coding decisions

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are
I generalizable: one method can be used across many methods
and to unify collections of texts
I systematic: parameters/statistics demonstrate how models
make coding decisions
I cheap: easily applied to many new collections of texts,
computing power is inexpensive

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are
I generalizable: one method can be used across many methods
and to unify collections of texts
I systematic: parameters/statistics demonstrate how models
make coding decisions
I cheap: easily applied to many new collections of texts,
computing power is inexpensive
Methods created in neighboring disciplines (CS, Stats etc.) but
modified for social science

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


Why Now?
↑ supply of texts + ≈ demand
explosion of methods which are
I generalizable: one method can be used across many methods
and to unify collections of texts
I systematic: parameters/statistics demonstrate how models
make coding decisions
I cheap: easily applied to many new collections of texts,
computing power is inexpensive
Methods created in neighboring disciplines (CS, Stats etc.) but
modified for social science
Many approaches an application of high-dimensional statistics
and thus shares similarities to methods in genetics, networks,
neuroscience, video/image as data, and “big data”

Stewart (Princeton) Text as Data June 28-29, 2018 9 / 187


What Can Text Methods Do?
Haystack metaphor:

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack
- Humans: terrible. Tiny active memories
- Computers: amazing and getting better all the time!

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack
- Humans: terrible. Tiny active memories
- Computers: amazing and getting better all the time!
What They Don’t Do:

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack
- Humans: terrible. Tiny active memories
- Computers: amazing and getting better all the time!
What They Don’t Do:
- Develop a comprehensive statistical model of language

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack
- Humans: terrible. Tiny active memories
- Computers: amazing and getting better all the time!
What They Don’t Do:
- Develop a comprehensive statistical model of language
- Replace the need to read

Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187


What Can Text Methods Do?
Haystack metaphor: Improve Reading
- Interpreting the meaning of a sentence or phrase Analyzing a
straw of hay
- Humans: amazing (Straussian political theory, analysis of
English poetry)
- Computers: struggle
- Comparing, Organizing, and Classifying Texts Organizing hay
stack
- Humans: terrible. Tiny active memories
- Computers: amazing and getting better all the time!
What They Don’t Do:
- Develop a comprehensive statistical model of language
- Replace the need to read
- Develop a single tool or evaluation applicable for all tasks
Stewart (Princeton) Text as Data June 28-29, 2018 10 / 187
Use of the Word Ghetto (Duneier)

Stewart (Princeton) Text as Data June 28-29, 2018 11 / 187


Use of the Word Ghetto (Duneier)

Stewart (Princeton) Text as Data June 28-29, 2018 11 / 187


Use of the Word Ghetto (Duneier)

'Warsaw' or 'Jewish' Ghetto


0.075
Proportion of Uses of Ghetto

0.050

0.025

'Black' or 'Negro' Ghetto

0.000

1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975
Year

Stewart (Princeton) Text as Data June 28-29, 2018 11 / 187


International Events (O’Connor, Stewart & Smith)

OConnor, Stewart, and Smith. “Learning to Extract International Relations from Political
Context.” Association of Computational Linguistics. 2013
Stewart (Princeton) Text as Data June 28-29, 2018 12 / 187
International Events (O’Connor, Stewart & Smith)

OConnor, Stewart, and Smith. “Learning to Extract International Relations from Political
Context.” Association of Computational Linguistics. 2013

Stewart (Princeton) Text as Data June 28-29, 2018 12 / 187


Digital Literature Reviews (Nielsen & Stewart)
The International Relations Literature
70

Realism
Liberalism
60

Constructivism
Non−paradigmatic
50
Number of Articles

40
30
20
10
0

1980 1985 1990 1995 2000 2005

Year

Stewart (Princeton) Text as Data June 28-29, 2018 13 / 187


Digital Literature Reviews (Nielsen & Stewart)

Realism

Stewart (Princeton) Text as Data June 28-29, 2018 14 / 187


Digital Literature Reviews (Nielsen & Stewart)

Liberalism

Stewart (Princeton) Text as Data June 28-29, 2018 15 / 187


Digital Literature Reviews (Nielsen & Stewart)

Constructivism

Stewart (Princeton) Text as Data June 28-29, 2018 16 / 187


Modeling the Progress of Science
(Blei and Lafferty)

Blei, David M., and John D. Lafferty. “Dynamic topic models.” Proceedings of the 23rd
international conference on Machine learning. 2006.
Stewart (Princeton) Text as Data June 28-29, 2018 17 / 187
Modeling the Progress of Science
(Blei and Lafferty)

Blei, David M., and John D. Lafferty. “Dynamic topic models.” Proceedings of the 23rd
international conference on Machine learning. 2006.
Stewart (Princeton) Text as Data June 28-29, 2018 18 / 187
What Do People Search For?

WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?

WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?

WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?

WebSeer http://hint.fm/seer/

Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187


Social Science

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods.

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.
What is the question?

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.
What is the question?
If you had infinite time and resources what would you do?

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.
What is the question?
If you had infinite time and resources what would you do?
What is your data a sample of?

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.
What is the question?
If you had infinite time and resources what would you do?
What is your data a sample of?
What assumptions do you need to connect the observed data to
your quantity of interest?

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


Social Science
We will talk a lot today about methods. However, the best research
is problem-driven research.
What is the question?
If you had infinite time and resources what would you do?
What is your data a sample of?
What assumptions do you need to connect the observed data to
your quantity of interest?

Data needs a great question and assumptions


(theory) to produce insight.

Stewart (Princeton) Text as Data June 28-29, 2018 20 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 21 / 187


Different Types of Methods for Different Goals

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research
Unsupervised: goal is to learn the goal

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research
Unsupervised: goal is to learn the goal
I algorithm discovers themes/patterns in the texts

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research
Unsupervised: goal is to learn the goal
I algorithm discovers themes/patterns in the texts
I human interprets the results

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research
Unsupervised: goal is to learn the goal
I algorithm discovers themes/patterns in the texts
I human interprets the results
I usually associated with qualitative research

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Different Types of Methods for Different Goals

Supervised: pursuing a known goal


I human annotates a subset of documents
I algorithm annotates the rest
I usually associated with quantitative research
Unsupervised: goal is to learn the goal
I algorithm discovers themes/patterns in the texts
I human interprets the results
I usually associated with qualitative research
Both strategies amplify human effort, each in different ways.

Stewart (Princeton) Text as Data June 28-29, 2018 22 / 187


Principles

From the forthcoming book with Grimmer and Roberts

Stewart (Princeton) Text as Data June 28-29, 2018 23 / 187


Principles

From the forthcoming book with Grimmer and Roberts

Six core principles for how to deploy text as data methods

Stewart (Princeton) Text as Data June 28-29, 2018 23 / 187


Principles

From the forthcoming book with Grimmer and Roberts

Six core principles for how to deploy text as data methods

Based on a radically agnostic view of how to apply text as data


methods (although how radical it seems may depend on where you
sit).

Stewart (Princeton) Text as Data June 28-29, 2018 23 / 187


Principles

From the forthcoming book with Grimmer and Roberts

Six core principles for how to deploy text as data methods

Based on a radically agnostic view of how to apply text as data


methods (although how radical it seems may depend on where you
sit).

Core idea: do not assume that there is an underlying true model of


text — instead, use methods to discover what we are interested in,
believe the results because they are externally validated.

Stewart (Princeton) Text as Data June 28-29, 2018 23 / 187


Principles

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.


2) Text as Data methods do not replace humans
— they augment them.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.


2) Text as Data methods do not replace humans
— they augment them.
3) Social science is iterative.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.


2) Text as Data methods do not replace humans
— they augment them.
3) Social science is iterative.
4) We aren’t looking for a universal model of
language, just generalizations which are useful.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.


2) Text as Data methods do not replace humans
— they augment them.
3) Social science is iterative.
4) We aren’t looking for a universal model of
language, just generalizations which are useful.
5) The best method is task-dependent.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


Principles

1) Theory is the starting point for analysis.


2) Text as Data methods do not replace humans
— they augment them.
3) Social science is iterative.
4) We aren’t looking for a universal model of
language, just generalizations which are useful.
5) The best method is task-dependent.
6) Validation is essential.

Stewart (Princeton) Text as Data June 28-29, 2018 24 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 25 / 187


Text is Complex

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow, fruit flies like a banana

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow, fruit flies like a banana
- Make peace, not war

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow, fruit flies like a banana
- Make peace, not war , Make war not peace

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow, fruit flies like a banana
- Make peace, not war , Make war not peace
- “Years from now, you’ll look back and you’ll say that this was
the moment, this was the place where America remembered
what it means to hope. ”

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Text is Complex

- Data generation process for text unknown


- Complexity of language:
- Time flies like an arrow, fruit flies like a banana
- Make peace, not war , Make war not peace
- “Years from now, you’ll look back and you’ll say that this was
the moment, this was the place where America remembered
what it means to hope. ”
- Models necessarily fail to capture some complexity of
language useful for specific tasks

Stewart (Princeton) Text as Data June 28-29, 2018 26 / 187


Identifying and Representing Texts.
Selecting Texts

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone
I But we will be forced to simplify

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone
I But we will be forced to simplify
I What aspects of the communication are important to our
study?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone
I But we will be forced to simplify
I What aspects of the communication are important to our
study?
F When might emojis be important?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone
I But we will be forced to simplify
I What aspects of the communication are important to our
study?
F When might emojis be important?
F When might formatting be important?

Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187


Identifying and Representing Texts.
Selecting Texts
I Sample should depend on the research goal.
F What questions is a sample of Twitter users useful for?
F Where might it be problematic?
F What questions are front page newspaper articles useful for?
F Where might they be problematic?
Representing Texts
I Meaning comes from a full social interaction
F words, pictures, context, music, body language, laughter, emojis,
facial expressions, tone
I But we will be forced to simplify
I What aspects of the communication are important to our
study?
F When might emojis be important?
F When might formatting be important?
Problem-Driven Research.
Stewart (Princeton) Text as Data June 28-29, 2018 27 / 187
Document Term Matrices

Preprocessing Simplify text, make it useful

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Document Term Matrices

Preprocessing Simplify text, make it useful


Lower dimensionality

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Document Term Matrices

Preprocessing Simplify text, make it useful


Lower dimensionality
- For our purposes

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Document Term Matrices

Preprocessing Simplify text, make it useful


Lower dimensionality
- For our purposes
Remember: characterize the Hay stack

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Document Term Matrices

Preprocessing Simplify text, make it useful


Lower dimensionality
- For our purposes
Remember: characterize the Hay stack
- If you want to analyze a straw of hay, these methods are unlikely
to work

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Document Term Matrices

Preprocessing Simplify text, make it useful


Lower dimensionality
- For our purposes
Remember: characterize the Hay stack
- If you want to analyze a straw of hay, these methods are unlikely
to work
- But even if you want to closely read texts, characterizing hay
stack can be useful

Stewart (Princeton) Text as Data June 28-29, 2018 28 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)
3) Discard stop words

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)
3) Discard stop words
4) Create Equivalence Class: Stem, Lemmatize, or synonym

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)
3) Discard stop words
4) Create Equivalence Class: Stem, Lemmatize, or synonym
5) Discard less useful features depends on application

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)
3) Discard stop words
4) Create Equivalence Class: Stem, Lemmatize, or synonym
5) Discard less useful features depends on application
6) Other reduction, specialization

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


Preprocessing for Quantitative Text Analysis

One (of many) recipe for preprocessing: retain useful


information
1) Remove capitalization, punctuation
2) Discard Word Order (Bag of Words Assumption)
3) Discard stop words
4) Create Equivalence Class: Stem, Lemmatize, or synonym
5) Discard less useful features depends on application
6) Other reduction, specialization
Output: Count vector, each element counts occurrence of stems

Stewart (Princeton) Text as Data June 28-29, 2018 29 / 187


A Complete Example

“Political power grows out of the barrel of a gun” - Mao

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

“Political power grows out of the barrel of a gun” - Mao

Compound Words: With substantive justification, words can be


combined or split to improve inference.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

“Political power grows out of the barrel of a gun” - Mao

Compound Words: An analyst may want to combine words into a


single term that can be analyzed.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

“Political power grows out of the barrel of a gun” - Mao

Compound Words: An analyst may want to combine words into a


single term that can be analyzed.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [of], [the], [barrel of a gun]

Compound Words: An analyst may want to combine words into a


single term that can be analyzed.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [of], [the], [barrel of a gun]

Stopword Removal: Removing terms that are not related to what


the author is studying from the text.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [of], [the], [barrel of a gun]

Stopword Removal: Removing terms that are not related to what


the author is studying from the text.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [barrel of a gun]

Stopword Removal: Removing terms that are not related to what


the author is studying from the text.

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [barrel of a gun]

Stemming: Takes the ends off conjugated verbs or plural nouns,


leaving just the “stem.”

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Political], [power], [grows], [out], [barrel of a gun]

Stemming: Takes the ends off conjugated verbs or plural nouns,


leaving just the “stem.”

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

[Polit], [power], [grow], [out], [barrel of a gun]

Stemming: Takes the ends off conjugated verbs or plural nouns,


leaving just the “stem.”

Stewart (Princeton) Text as Data June 28-29, 2018 30 / 187


A Complete Example

Finally, we can turn tokens and documents into a “document-term


matrix.”

Imagine we have a second document in addition to the Mao quote,


which tokenizes as follows.

Document #1: [polit], [power], [grow], [out], [barrel of a gun]


Document #2: [compar], [polit], [chicago], [polit]

Stewart (Princeton) Text as Data June 28-29, 2018 31 / 187


Output: Document Term Matrix

 
Doc1 Doc2

 power 1 0  

 grow 1 0  

 out 1 0  

 barrel of a gun 1 0  

 compar 0 1  
 polit 1 2 
chicago 0 1

Stewart (Princeton) Text as Data June 28-29, 2018 32 / 187


How Could This Possibly Work?

There are so many subtle things in speech:

Stewart (Princeton) Text as Data June 28-29, 2018 33 / 187


How Could This Possibly Work?

There are so many subtle things in speech:


- Sarcasm:
The Star Wars prequels were amazing because
everyone loves a good discussion about trade
policy. (Source: Grimmer)

Stewart (Princeton) Text as Data June 28-29, 2018 33 / 187


How Could This Possibly Work?

There are so many subtle things in speech:


- Sarcasm:
The Star Wars prequels were amazing because
everyone loves a good discussion about trade
policy. (Source: Grimmer)
- Subtle Negation :
We are in no way troubled.

Stewart (Princeton) Text as Data June 28-29, 2018 33 / 187


How Could This Possibly Work?

There are so many subtle things in speech:


- Sarcasm:
The Star Wars prequels were amazing because
everyone loves a good discussion about trade
policy. (Source: Grimmer)
- Subtle Negation :
We are in no way troubled.
- Order Dependence:
Dog bites man.
Man bites dog.

Stewart (Princeton) Text as Data June 28-29, 2018 33 / 187


How Could This Possibly Work?

Three answers

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


How Could This Possibly Work?

Three answers
1) It might not: Validation is critical (task specific)

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


How Could This Possibly Work?

Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


How Could This Possibly Work?

Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


How Could This Possibly Work?

Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
3) Human supervision can help: Inject human judgement (coders):
helps methods identify subtle relationships between words and
outcomes of interest

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


How Could This Possibly Work?

Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
3) Human supervision can help: Inject human judgement (coders):
helps methods identify subtle relationships between words and
outcomes of interest
It is easier to capture some things than others.

Stewart (Princeton) Text as Data June 28-29, 2018 34 / 187


Changing Consensus on Preprocessing Steps

Stewart (Princeton) Text as Data June 28-29, 2018 35 / 187


Changing Consensus on Preprocessing Steps

Stewart (Princeton) Text as Data June 28-29, 2018 35 / 187


Changing Consensus on Preprocessing Steps

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.


Excellent new papers pushing back:

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.


Excellent new papers pushing back:
I Schofield and Mimno (2016) “Comparing Apples to Apple: The Effect of
Stemmers on Topic Models” TACL
I Denny and Spirling (2018) “Text Preprocessing for Unsupervised Learning: Why It
Matters, When it Misleads, And What To Do About It” Political Analysis (comes
with R package pretext)
I Schofield et al (2017) “Pulling Out the Stops: Rethinking Stopword Removal for
Topic Models” EACL

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.


Excellent new papers pushing back:
I Schofield and Mimno (2016) “Comparing Apples to Apple: The Effect of
Stemmers on Topic Models” TACL
I Denny and Spirling (2018) “Text Preprocessing for Unsupervised Learning: Why It
Matters, When it Misleads, And What To Do About It” Political Analysis (comes
with R package pretext)
I Schofield et al (2017) “Pulling Out the Stops: Rethinking Stopword Removal for
Topic Models” EACL

Core Point: while pre-processing of text is likely inevitable,


choices are consequential and we shouldn’t pretend otherwise.

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.


Excellent new papers pushing back:
I Schofield and Mimno (2016) “Comparing Apples to Apple: The Effect of
Stemmers on Topic Models” TACL
I Denny and Spirling (2018) “Text Preprocessing for Unsupervised Learning: Why It
Matters, When it Misleads, And What To Do About It” Political Analysis (comes
with R package pretext)
I Schofield et al (2017) “Pulling Out the Stops: Rethinking Stopword Removal for
Topic Models” EACL

Core Point: while pre-processing of text is likely inevitable,


choices are consequential and we shouldn’t pretend otherwise.
This is great- reconsidering fundamentals is the sign of a
maturing field.

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Changing Consensus on Preprocessing Steps

Conventions imported from 1990’s computational linguistics.


Excellent new papers pushing back:
I Schofield and Mimno (2016) “Comparing Apples to Apple: The Effect of
Stemmers on Topic Models” TACL
I Denny and Spirling (2018) “Text Preprocessing for Unsupervised Learning: Why It
Matters, When it Misleads, And What To Do About It” Political Analysis (comes
with R package pretext)
I Schofield et al (2017) “Pulling Out the Stops: Rethinking Stopword Removal for
Topic Models” EACL

Core Point: while pre-processing of text is likely inevitable,


choices are consequential and we shouldn’t pretend otherwise.
This is great- reconsidering fundamentals is the sign of a
maturing field.
Remember: folk wisdom is always a product of its time.

Stewart (Princeton) Text as Data June 28-29, 2018 36 / 187


Why Do We Do This In The First Place?

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data
2) Aesthetics

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data
2) Aesthetics
I people like looking at lists of most probable words: stop words
and unstemmed words make these lists less informative

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data
2) Aesthetics
I people like looking at lists of most probable words: stop words
and unstemmed words make these lists less informative
I arguably this is a (complicated) software problem

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data
2) Aesthetics
I people like looking at lists of most probable words: stop words
and unstemmed words make these lists less informative
I arguably this is a (complicated) software problem

Stewart (Princeton) Text as Data June 28-29, 2018 37 / 187


The Aesthetics of Pre-Processing

Topic 26: the, to, of

Topic 68: the, to, of

Topic 15: the, to, of

Topic 97: the, to, of

Topic 18: the, to, of

Topic 52: the, of, to

Topic 85: the, of, to

Topic 34: the, of, to

Topic 77: the, to, and

Topic 60: the, of, to

Topic 71: the, and, to

Topic 45: the, a, to

Topic 58: the, and, to

Topic 38: the, to, that

Topic 30: the, to, of

0.000 0.005 0.010 0.015 0.020 0.025

Expected Topic Proportions

Stewart (Princeton) Text as Data June 28-29, 2018 38 / 187


The Aesthetics of Pre-Processing

Topic 26: tax, cuts, taxes

Topic 68: prices, gas, oil

Topic 15: fundraising, donors, lobbying

Topic 97: supreme, court, judicial

Topic 18: gaza, hamas, palestinian

Topic 52: guantanamo, detainees, gitmo

Topic 85: ossetia, russian, russia

Topic 34: warming, climate, ipcc

Topic 77: pelosi, nancy, speaker

Topic 60: chinese, tibetan, beijing

Topic 71: fannie, mae, freddie

Topic 45: taser, police, tasers

Topic 58: alien, immigrants, illegal

Topic 38: waterboarding, torture., mukasey

Topic 30: stem, cell, embryonic

0.000 0.005 0.010 0.015 0.020 0.025

Expected Topic Proportions

Stewart (Princeton) Text as Data June 28-29, 2018 38 / 187


Why Do We Do This In The First Place?
1) Efficiency
I stemming is a form of parameter tying
(or equivalence assertions)
I if you have enough data you can learn that car and cars have
similar loadings (or not!)
I this arises from the (pragmatically understandable) focus on
small corpora
I as usual, the best answer is get more data
2) Aesthetics
I people like most probable words: stop words and unstemmed
words make these lists less informative
I arguably this is a (complicated) software problem

Practical Issue: We have a lot of moving pieces in text analysis


and we want people to be able to get their work done.

Stewart (Princeton) Text as Data June 28-29, 2018 39 / 187


This Can Actually Work!

The Future of Political Science


100 Perspectives
Edited by Gary King, Harvard University, Kay Lehman Schlozman, Boston College
and Norman H. Nie, Stanford University

“The list of authors in The Future of Political Science is a 'who’s


who' of political science. As I was reading it, I came to think of it
as a platter of tasty hors d’oeuvres. It hooked me thoroughly.”
—Peter Kingstone, University of Connecticut

“In this one-of-a-kind collection, an eclectic set of contributors


offer short but forceful forecasts about the future of the
discipline. The resulting assortment is captivating, consistently
thought-provoking, often intriguing, and sure to spur discussion
and debate.”
—Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

“King, Schlozman, and Nie have created a visionary and


stimulating volume. The organization of the essays strikes me as
nothing less than brilliant. . . It is truly a joy to read.”
—Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science,
University of Florida

Available March 2009: 304pp


Pb: 978-0-415-99701-0: $24.95
www.routledge.com/politics

Stewart (Princeton) Text as Data June 28-29, 2018 40 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


Evaluators’ Rate Machine Choices Better Than
Their Own (Grimmer and King)
Generate pairs of similar documents: Humans vs Machines
Scale: (1) unrelated, (2) loosely related, or (3) closely related
Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40
p.s. The hand-coders did the evaluation!

Stewart (Princeton) Text as Data June 28-29, 2018 41 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 42 / 187


Discovery, Measurement and Causal Inference:
Censorship in China

Several slides that follow graciously provided by Molly Roberts.

Stewart (Princeton) Text as Data June 28-29, 2018 43 / 187


Discovery, Measurement and Causal Inference:
Censorship in China

Stewart (Princeton) Text as Data June 28-29, 2018 43 / 187


Discovery, Measurement and Causal Inference:
Censorship in China

Several slides that follow graciously provided by Molly Roberts.

Stewart (Princeton) Text as Data June 28-29, 2018 43 / 187


Censorship & Post Volume are “Bursty”

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
70
60

Count Published
Count Censored Riots in
50

Zengcheng
40
Count

30
20
10
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
70
60

Count Published
Count Censored Riots in
50

Zengcheng
40
Count

30
20
10
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
70
60

Count Published
Count Censored Riots in
50

Zengcheng
40
Count

30
20
10
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng
40
Count

30
20
10
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20
10
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst

Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187


Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70

than baseline
60

Count Published
Count Censored Riots in volume)
50

Zengcheng

They monitored 85
40
Count

topic areas
30
20

Found 87 volume
10

bursts in total
0

Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Observational Test 2: The Event Generating
Volume Bursts

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


Observational Test 2: The Event Generating
Volume Bursts
Event classification (each category can be +, −, or neutral comments
about the state)
1 Collective Action Potential
I protest or organized crowd formation outside the Internet
I individuals who have organized or incited collective action on
the ground in the past;
2 Criticism of censors
3 Pornography
4 (Other) News
5 Government Policies

Stewart (Princeton) Text as Data June 28-29, 2018 45 / 187


What Types of Events Are Censored?

12
10

Policy
News
8
Density

Collective Action
Criticism of Censors
Pornography
4
2
0

-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Censorship Magnitude

Stewart (Princeton) Text as Data June 28-29, 2018 46 / 187


What Types of Events Are Censored?
Protests in Inner Mongolia
Pornography Disguised as News
Baidu Copyright Lawsuit
Zengcheng Protests
Pornography Mentioning Popular Book
Ai Weiwei Arrested
Collective Anger At Lead Poisoning in Jiangsu
Google is Hacked Collective Action
Localized Advocacy for Environment Lottery Criticism of Censors
Fuzhou Bombing
Students Throw Shoes at Fang BinXing Pornography
Rush to Buy Salt After Earthquake
New Laws on Fifty Cent Party

U.S. Military Intervention in Libya


Food Prices Rise
Policies Education Reform for Migrant Children
News Popular Video Game Released
Indoor Smoking Ban Takes Effect
News About Iran Nuclear Program
Jon Hunstman Steps Down as Ambassador to China
Gov't Increases Power Prices
China Puts Nuclear Program on Hold
Chinese Solar Company Announces Earnings
EPA Issues New Rules on Lead
Disney Announced Theme Park
Popular Book Published in Audio Format

-0.2 0 0.1 0.3 0.5 0.7

Censorship Magnitude

Stewart (Princeton) Text as Data June 28-29, 2018 46 / 187


Mechanisms of Censorship

Stewart (Princeton) Text as Data June 28-29, 2018 47 / 187


Mechanisms of Censorship

Stewart (Princeton) Text as Data June 28-29, 2018 47 / 187


Mechanisms of Censorship

Stewart (Princeton) Text as Data June 28-29, 2018 47 / 187


Mechanisms of Censorship

Stewart (Princeton) Text as Data June 28-29, 2018 47 / 187


Experimental Design

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts
Submitted posts randomly assigned to type

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts
Submitted posts randomly assigned to type
Four treatment conditions

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts
Submitted posts randomly assigned to type
Four treatment conditions
Pro-gov Anti-gov
CA CA pro CA anti
non-CA non-CA pro non-CA anti

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts
Submitted posts randomly assigned to type
Four treatment conditions
Pro-gov Anti-gov
CA CA pro CA anti
non-CA non-CA pro non-CA anti
Record whether the post is published immediately or held for
review

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Experimental Design
Conducted three rounds of experiments. For each round:
Selected 100 top social media sites (∼87% of blogs, >500M
Users, geographically diverse)
Created 2 accounts on each
Wrote 1,200 unique social media posts
Submitted posts randomly assigned to type
Four treatment conditions
Pro-gov Anti-gov
CA CA pro CA anti
non-CA non-CA pro non-CA anti
Record whether the post is published immediately or held for
review
Check for censorship after 24-72 hours

Stewart (Princeton) Text as Data June 28-29, 2018 48 / 187


Collective Action Events: Large Causal Effect

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect

Censorship Difference (CA Event − Non−CA Event)


0.5
0.4
0.3
0.2
0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect
Panxu
Protest

Censorship Difference (CA Event − Non−CA Event)


0.5
0.4
0.3
0.2
0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect
Panxu
Protest

Censorship Difference (CA Event − Non−CA Event)


0.5
0.4


0.3
0.2
0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect
Tibetan
Panxu Self−
Protest Immolations

Censorship Difference (CA Event − Non−CA Event)


0.5

0.4


0.3
0.2
0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect
Tibetan
Panxu Self− Ai Weiwei
Protest Immolations Album

Censorship Difference (CA Event − Non−CA Event)


0.5

0.4


0.3
0.2


0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Collective Action Events: Large Causal Effect
Tibetan Protests
Panxu Self− Ai Weiwei in
Protest Immolations Album Xinjiang

Censorship Difference (CA Event − Non−CA Event)


0.5

0.4


0.3


0.2


0.1
0.0

● ●

Stewart (Princeton) Text as Data June 28-29, 2018 49 / 187


Posts For v. Against Government: Zero Causal
Effect

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect
1.0
Censorship Difference (Pro − Anti)
0.5
0.0

● ●
−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu
1.0

Protest
Censorship Difference (Pro − Anti)
0.5
0.0

● ●
−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu
1.0

Protest
Censorship Difference (Pro − Anti)
0.5
0.0

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu
1.0

Protest
Censorship Difference (Pro − Anti)

Tibetan
Self−
0.5

Immolations


0.0

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei
1.0

Protest Album
Censorship Difference (Pro − Anti)

Tibetan
Self−
0.5

Immolations

● ●
0.0

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei
1.0

Protest Album
Censorship Difference (Pro − Anti)

Tibetan Protests
Self− in
0.5

Immolations Xinjiang


● ●
0.0

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei Corruption


1.0

Protest Album Policy


Censorship Difference (Pro − Anti)

Tibetan Protests
Self− in
0.5

Immolations Xinjiang


● ●

0.0

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei Corruption


1.0

Protest Album Policy


Censorship Difference (Pro − Anti)

Eliminate
Tibetan Protests Golden
Self− in Week
0.5

Immolations Xinjiang


● ●

0.0

● ● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei Corruption


1.0

Protest Album Policy


Censorship Difference (Pro − Anti)

Eliminate
Tibetan Protests Golden
Self− in Week
0.5

Immolations Xinjiang
Rental
Tax


● ●

0.0

● ● ●


−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Panxu Ai Weiwei Corruption


1.0

Protest Album Policy Yellow


Censorship Difference (Pro − Anti)

Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5

Immolations Xinjiang
Rental
Tax



● ●

0.0

● ● ●


−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Stock
Panxu Ai Weiwei Corruption
1.0

Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)

Crash
Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5

Immolations Xinjiang
Rental
Tax



● ●

0.0

● ● ● ●


−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Stock
Panxu Ai Weiwei Corruption
1.0

Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)

Crash
Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5

Immolations Xinjiang Investigation


Rental
of Sichuan
Tax
Vice

Governor

● ●

0.0

● ● ● ●

● ●
−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Stock
Panxu Ai Weiwei Corruption
1.0

Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)

Crash
Light
Eliminate Gender
Fines
Tibetan Protests Golden Imbalance
Self− in Week
0.5

Immolations Xinjiang Investigation


Rental
of Sichuan
Tax
Vice

Governor

● ●

0.0

● ●
● ● ●

● ●
−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Posts For v. Against Government: Zero Causal
Effect

Stock
Panxu Ai Weiwei Corruption
1.0

Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)

Crash
Light
Eliminate Gender
Fines
Tibetan Protests Golden Imbalance
Self− in Week
0.5

Immolations Xinjiang Investigation


Rental
of Sichuan
Tax Li Tianyi
Vice
Scandal

Governor

● ●

0.0

● ●
● ● ●

● ●

−0.5

Stewart (Princeton) Text as Data June 28-29, 2018 50 / 187


Q&A and Code

Stewart (Princeton) Text as Data June 28-29, 2018 51 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 52 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 52 / 187


Organization by Tasks not Techniques

Discovery Measurement Inference

Stewart (Princeton) Text as Data June 28-29, 2018 53 / 187


Discovery: the way we teach it

Stewart (Princeton) Text as Data June 28-29, 2018 54 / 187


Discovery: the way we teach it

Stewart (Princeton) Text as Data June 28-29, 2018 54 / 187


Discovery: the way it is

Stewart (Princeton) Text as Data June 28-29, 2018 55 / 187


Discovery: the way it is

Stewart (Princeton) Text as Data June 28-29, 2018 55 / 187


Discovery

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory
I Once you discover something, you need new data to test it.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory
I Once you discover something, you need new data to test it.
F replication and science

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory
I Once you discover something, you need new data to test it.
F replication and science
I The ‘right’ approach depends on the application.

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory
I Once you discover something, you need new data to test it.
F replication and science
I The ‘right’ approach depends on the application.
F what works for movie review may not work for political blogs

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


Discovery
We want to discover new concepts, organizations of the data.
Once we acknowledge discovery as part of the research process-
we can improve at it.
Qualitative researchers (ethnographers in particular) have a lot
to teach us about how to do this right.
Some principles:
I Discovery is not a substitute for theory.
F discovery is aided by theory
I Once you discover something, you need new data to test it.
F replication and science
I The ‘right’ approach depends on the application.
F what works for movie review may not work for political blogs
Our view: science is iterative

Stewart (Princeton) Text as Data June 28-29, 2018 56 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 57 / 187


What is clustering?

Some basic terminology:

Stewart (Princeton) Text as Data June 28-29, 2018 58 / 187


What is clustering?

Some basic terminology:


1 Clustering: grouping like units into partitions

Stewart (Princeton) Text as Data June 28-29, 2018 58 / 187


What is clustering?

Some basic terminology:


1 Clustering: grouping like units into partitions
2 Unsupervised Learning: learning without using labelled data

Stewart (Princeton) Text as Data June 28-29, 2018 58 / 187


What is clustering?

Some basic terminology:


1 Clustering: grouping like units into partitions
2 Unsupervised Learning: learning without using labelled data
3 Topic Models: usually the application of
clustering/mixed-membership techniques to documents for
determining their subject matter

Stewart (Princeton) Text as Data June 28-29, 2018 58 / 187


What is clustering?

Some basic terminology:


1 Clustering: grouping like units into partitions
2 Unsupervised Learning: learning without using labelled data
3 Topic Models: usually the application of
clustering/mixed-membership techniques to documents for
determining their subject matter
4 We are both creating the categories and categorizing the
documents at the same time.

Stewart (Princeton) Text as Data June 28-29, 2018 58 / 187


Clustering

Fully Automated Clustering

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering

Fully Automated Clustering


1) Distance metric when are documents close?

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering

Fully Automated Clustering


1) Distance metric when are documents close?
2) Objective function how do we summarize distances?

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering

Fully Automated Clustering


1) Distance metric when are documents close?
2) Objective function how do we summarize distances?
3) Optimization method how do we find optimal clustering?

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering

Fully Automated Clustering


1) Distance metric when are documents close?
2) Objective function how do we summarize distances?
3) Optimization method how do we find optimal clustering?
THERE IS NO A PRIORI OPTIMAL METHOD

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering

Fully Automated Clustering


1) Distance metric when are documents close?
2) Objective function how do we summarize distances?
3) Optimization method how do we find optimal clustering?
THERE IS NO A PRIORI OPTIMAL METHOD
Computer Assisted Clustering (Grimmer and King, 2011)
- crucial to combine human and computer insights

Stewart (Princeton) Text as Data June 28-29, 2018 59 / 187


Clustering as Discovery

Stewart (Princeton) Text as Data June 28-29, 2018 60 / 187


Clustering as Discovery

- When we analyze texts (data) we have some idea how to


organize them

Stewart (Princeton) Text as Data June 28-29, 2018 60 / 187


Clustering as Discovery

- When we analyze texts (data) we have some idea how to


organize them
- How do we formulate new ways to organize texts?

Stewart (Princeton) Text as Data June 28-29, 2018 60 / 187


Clustering as Discovery

- When we analyze texts (data) we have some idea how to


organize them
- How do we formulate new ways to organize texts?
- Clustering methods suggest new (model and data driven) ways
to organize texts

Stewart (Princeton) Text as Data June 28-29, 2018 60 / 187


Clustering as Discovery

- When we analyze texts (data) we have some idea how to


organize them
- How do we formulate new ways to organize texts?
- Clustering methods suggest new (model and data driven) ways
to organize texts
- Using new method, new lens to look at social interaction

Stewart (Princeton) Text as Data June 28-29, 2018 60 / 187


Not just for “big data”

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)
Working around the clock (24/7/365)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)
Working around the clock (24/7/365)
≈ 1.54 × 1084 ×

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)
Working around the clock (24/7/365)
≈ 1.54 × 1084 × (14, 000, 000, 000)

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)
Working around the clock (24/7/365)
≈ 1.54 × 1084 × (14, 000, 000, 000) years

Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187


Not just for “big data”
Manually develop categorization scheme for partitioning small (100)
set of documents
- Bell(n) = number of ways of partitioning n objects
- Bell(2) = 2 (AB, A B)
- Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
- Bell(5) = 52
- Bell(100)≈ 4.75 × 10115 partitions
- Big Number:
7 Billion RAs
Impossibly Fast (enumerate one clustering every millisecond)
Working around the clock (24/7/365)
≈ 1.54 × 1084 × (14, 000, 000, 000) years
Machine Learning methods can help with even small problems
Stewart (Princeton) Text as Data June 28-29, 2018 61 / 187
Estimating Clustering: Data and Assumptions

Steps common across Fully Automated Clustering methods

Stewart (Princeton) Text as Data June 28-29, 2018 62 / 187


Estimating Clustering: Data and Assumptions

Steps common across Fully Automated Clustering methods


- Assume similarity/dissimilarity between objects (Some methods
assume implicitly) ← or a probability distribution

Stewart (Princeton) Text as Data June 28-29, 2018 62 / 187


Estimating Clustering: Data and Assumptions

Steps common across Fully Automated Clustering methods


- Assume similarity/dissimilarity between objects (Some methods
assume implicitly)
- Define objective function ← also given by a probability
distribution

Stewart (Princeton) Text as Data June 28-29, 2018 62 / 187


Estimating Clustering: Data and Assumptions

Steps common across Fully Automated Clustering methods


- Assume similarity/dissimilarity between objects (Some methods
assume implicitly)
- Define objective function
- Use approximate inference/optimization algorithm to identify
optimal solution ← Huge search space, very difficult (and
interesting!) problem, only hinted at here

Stewart (Princeton) Text as Data June 28-29, 2018 62 / 187


Estimating Clustering: Data and Assumptions

Steps common across Fully Automated Clustering methods


- Assume similarity/dissimilarity between objects (Some methods
assume implicitly)
- Define objective function
- Use approximate inference/optimization algorithm to identify
optimal solution

Stewart (Princeton) Text as Data June 28-29, 2018 62 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.


Story: Data are grouped in K clusters and each cluster has a center
or mean.

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.


Story: Data are grouped in K clusters and each cluster has a center
or mean.
→ Two types of parameters to estimate

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.


Story: Data are grouped in K clusters and each cluster has a center
or mean.
→ Two types of parameters to estimate
1) For each cluster j, (j = 1, . . . , K )
rij =Indicator, Document i assigned to cluster j

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.


Story: Data are grouped in K clusters and each cluster has a center
or mean.
→ Two types of parameters to estimate
1) For each cluster j, (j = 1, . . . , K )
rij =Indicator, Document i assigned to cluster j
2) For each cluster j
µj a cluster center for cluster j.

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


An Example FAC Method

K-means: most commonly used clustering algorithm.


Story: Data are grouped in K clusters and each cluster has a center
or mean.
→ Two types of parameters to estimate
1) For each cluster j, (j = 1, . . . , K )
rij =Indicator, Document i assigned to cluster j
2) For each cluster j
µj a cluster center for cluster j.

Stewart (Princeton) Text as Data June 28-29, 2018 63 / 187


How Do We Choose K ?

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!
- No one statistic captures how you want to use your data

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search discuss statistical
methods/experimental methods
- Humans should be the final judge

Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187


How Do We Choose K ?
- Most methods assume we know the number of clusters.
- How do we choose?
- Cannot compare objective function across clusters
- Sum squared errors decreases as K increases
- Trivial answer: each document in own cluster (useless)
- Modeling problem: Fit often increases with features
- How do we choose number of clusters?

Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search discuss statistical
methods/experimental methods
- Humans should be the final judge
- Compare insights across clusterings
Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187
How do we Choose K ? Summary
Generate many candidate models
1) Assess model fit using surrogate statistics
2) Use experiments
3) Read
4) combination of above final decision

Stewart (Princeton) Text as Data June 28-29, 2018 65 / 187


Fully Automated Clustering

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful?

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful? Validate, Validate,
Validate

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful? Validate, Validate,
Validate
How do we know we have the right model?

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful? Validate, Validate,
Validate
How do we know we have the right model?
YOU DON’T!

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful? Validate, Validate,
Validate
How do we know we have the right model?
YOU DON’T! And never will

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


Fully Automated Clustering

Notion of similarity and fit clustering


Generally there is a probabilistic and algorithmic interpretation,
so use whatever is easier for you.
How do we know we have something useful? Validate, Validate,
Validate
How do we know we have the right model?
YOU DON’T! And never will but still useful for
discovery (and measurement)

Stewart (Princeton) Text as Data June 28-29, 2018 66 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means

Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means , Mixture of multinomials

Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means , Mixture of multinomials , k-medoids

Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means , Mixture of multinomials , k-medoids , affinity propagation

Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means , Mixture of multinomials , k-medoids , affinity propagation ,
agglomerative Hierarchical

Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187


An Overview of Clustering Models
There are a lot of different clustering models (and many variations
within each):
k-means , Mixture of multinomials , k-medoids , affinity propagation ,
agglomerative Hierarchical fuzzy k-means, trimmed k-means,
k-Harmonic means, fuzzy k-medoids, fuzzy k modes, maximum
entropy clustering, model based hierarchical (agglomerative),
proximus, ROCK, divisive hierarchical, DISMEA, Fuzzy, QTClust,
self-organizing map, self-organizing tree, unnormalized spectral, MS
spectral, NJW Spectral, SM Spectral, Dirichlet Process Multinomial,
Dirichlet Process Normal, Dirichlet Process von-mises Fisher, Mixture
of von mises-Fisher (EM), Mixture of von Mises Fisher (VA), Mixture
of normals, co-clustering mutual information, co-clustering SVD,
LLAhclust, CLUES, bclust, c-shell, qtClustering, LDA, Express
Agenda Model, Hierarchical Dirichlet process prior, multinomial,
uniform process mulitinomial, Chinese Restaurant Distance Dirichlet
process multinomial, Pitmann-Yor Process multinomial, LSA, ...
Stewart (Princeton) Text as Data June 28-29, 2018 67 / 187
The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method —

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices
- Well-defined statistical, data analytic, or machine learning
foundations

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices
- Well-defined statistical, data analytic, or machine learning
foundations
- How to add substantive knowledge: unclear

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices
- Well-defined statistical, data analytic, or machine learning
foundations
- How to add substantive knowledge: unclear
- The literature: little guidance on when methods apply

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices
- Well-defined statistical, data analytic, or machine learning
foundations
- How to add substantive knowledge: unclear
- The literature: little guidance on when methods apply
- Deriving such guidance: difficult or impossible

Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187


The Problem with Fully Automated Clustering
(Grimmer and King 2011)
- Large quantitative literature on cluster analysis
- The Goal — an optimal application-independent cluster analysis
method — is mathematically impossible:
- No free lunch theorem: every possible clustering method
performs equally well on average over all possible substantive
applications
- Existing methods:
- Many choices
- Well-defined statistical, data analytic, or machine learning
foundations
- How to add substantive knowledge: unclear
- The literature: little guidance on when methods apply
- Deriving such guidance: difficult or impossible
Deep problem in cluster analysis literature: full
automation requires more information
Stewart (Princeton) Text as Data June 28-29, 2018 68 / 187
Fully Automated → Computer Assisted (Grimmer
and King 2011)

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice
- Solution: Organized list

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice
- Solution: Organized list
- Insight: Many clusterings are perceptually identical

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice
- Solution: Organized list
- Insight: Many clusterings are perceptually identical
- Consider two clusterings of 10,000 documents, we move one
document from 5 to 6.

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice
- Solution: Organized list
- Insight: Many clusterings are perceptually identical
- Consider two clusterings of 10,000 documents, we move one
document from 5 to 6.
- How to organize clusterings so humans can understand?

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


Fully Automated → Computer Assisted (Grimmer
and King 2011)

- Fully Automated Clustering may succeed, fails in general. Too


hard to know when to apply models
- An alternative: Computer Assisted Clustering
- Easy (if you don’t think about it): list all clustering, choose best
- Impossible in Practice
- Solution: Organized list
- Insight: Many clusterings are perceptually identical
- Consider two clusterings of 10,000 documents, we move one
document from 5 to 6.
- How to organize clusterings so humans can understand?
- Answer: a geography of clusterings

Stewart (Princeton) Text as Data June 28-29, 2018 69 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection
5) Introduce the local cluster ensemble to summarize any point,
including points with no existing clustering

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection
5) Introduce the local cluster ensemble to summarize any point,
including points with no existing clustering
- New Clustering: weighted average of clusterings from methods

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection
5) Introduce the local cluster ensemble to summarize any point,
including points with no existing clustering
- New Clustering: weighted average of clusterings from methods
6) Use animated visualization: use the local cluster ensemble to
explore the space of clusterings (smoothly morphing from one
into others)

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection
5) Introduce the local cluster ensemble to summarize any point,
including points with no existing clustering
- New Clustering: weighted average of clusterings from methods
6) Use animated visualization: use the local cluster ensemble to
explore the space of clusterings (smoothly morphing from one
into others)
7) Millions of clusterings easily comprehended

Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187


A New Strategy (Grimmer and King 2011)
1) Code text as numbers (in one or more of several ways)
2) Apply many different clustering methods to the data — each
representing different (unstated) substantive assumptions
- Introduce sampling methods to extend search beyond existing
methods
3) Develop a metric between clusterings
4) Create a metric space of clusterings, and a 2-D projection
5) Introduce the local cluster ensemble to summarize any point,
including points with no existing clustering
- New Clustering: weighted average of clusterings from methods
6) Use animated visualization: use the local cluster ensemble to
explore the space of clusterings (smoothly morphing from one
into others)
7) Millions of clusterings easily comprehended
8) (Or, our new strategy: represent entire Bell space directly; no
need to examine document contents )
Stewart (Princeton) Text as Data June 28-29, 2018 70 / 187
Interpreting Cluster Components
Unsupervised methods

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters
- Transparency

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters
- Transparency
- Debate what clusters are

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters
- Transparency
- Debate what clusters are
- Debate what they mean

Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187


Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters
- Transparency
- Debate what clusters are
- Debate what they mean
- Provide documents + organizations
Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187
Interpreting Cluster Components
Unsupervised methods low startup costs, high post-model costs
- Apply clustering methods, we have groups of documents
- How to interpret the groups?
- Two (broad) methods:
- Manual identification
- Sample set of documents from same cluster
- Read documents
- Assign cluster label
- Automatic identification
- Know label classes
- Use methods to identify separating words
- Use these to help infer differences across clusters
- Transparency
- Debate what clusters are
- Debate what they mean
- Provide documents + organizations
Stewart (Princeton) Text as Data June 28-29, 2018 71 / 187
Distinguishing Words

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:
I words that are distinctive between clusters

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:
I words that are distinctive between clusters
I words that are representative of the cluster

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:
I words that are distinctive between clusters
I words that are representative of the cluster
Informally, imagine you read the word list and then read the
documents; we are trying to minimize the surprise you have
about the contents.

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:
I words that are distinctive between clusters
I words that are representative of the cluster
Informally, imagine you read the word list and then read the
documents; we are trying to minimize the surprise you have
about the contents.
A good first step at this is minimizing the information theoretic
version of surprise: mutual information

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


Distinguishing Words
Basic Task: given a partition, find words that summarize the
partition
This is inherently a vague enterprise because there are at least
two components:
I words that are distinctive between clusters
I words that are representative of the cluster
Informally, imagine you read the word list and then read the
documents; we are trying to minimize the surprise you have
about the contents.
A good first step at this is minimizing the information theoretic
version of surprise: mutual information
MI(x, y ) is the reduction in uncertainty about x having been
told y

Stewart (Princeton) Text as Data June 28-29, 2018 72 / 187


A Method for Identifying Distinguishing Words

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty
- Maximum when unrelated: Conditional uncertainty =
uncertainty

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty
- Maximum when unrelated: Conditional uncertainty =
uncertainty
- Minimum when perfect predictor: Conditional uncertainty = 0

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty
- Maximum when unrelated: Conditional uncertainty =
uncertainty
- Minimum when perfect predictor: Conditional uncertainty = 0
- Mutual information(Xj ): uncertainty - conditional uncertainty

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty
- Maximum when unrelated: Conditional uncertainty =
uncertainty
- Minimum when perfect predictor: Conditional uncertainty = 0
- Mutual information(Xj ): uncertainty - conditional uncertainty
- Maximum when conditional uncertainty = 0.

Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187


A Method for Identifying Distinguishing Words
Mutual Information
- Unconditional uncertainty (entropy):
- Randomly sample a movie review
- Guess positive or negative (without looking at it)
- Uncertainty about guess
- Maximum when No. positive = No. negative
- Minimum when All documents in one category
- Conditional uncertainty (Xj ) (conditional entropy)
- Condition on presence of word Xj
- Randomly sample a movie review
- Guess positive or negative (looking at only presence of word j)
- Word presence reduces uncertainty
- Maximum when unrelated: Conditional uncertainty =
uncertainty
- Minimum when perfect predictor: Conditional uncertainty = 0
- Mutual information(Xj ): uncertainty - conditional uncertainty
- Maximum when conditional uncertainty = 0.
- Minimum when conditional uncertainty = uncertainty.
Stewart (Princeton) Text as Data June 28-29, 2018 73 / 187
A Method for Identifying Distinguishing Words

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Mutual Information(Xj ) = H(Doc) − H(Doc|Xj )

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Mutual Information(Xj ) = H(Doc) − H(Doc|Xj )

- Maximum: H(Doc) ⇒ H(Doc|Xj ) = 0

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Mutual Information(Xj ) = H(Doc) − H(Doc|Xj )

- Maximum: H(Doc) ⇒ H(Doc|Xj ) = 0


- Minimum: 0 ⇒ H(Doc|Xj ) = H(Doc).

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Mutual Information(Xj ) = H(Doc) − H(Doc|Xj )

- Maximum: H(Doc) ⇒ H(Doc|Xj ) = 0


- Minimum: 0 ⇒ H(Doc|Xj ) = H(Doc).
Bigger mutual information ⇒ better discrimination

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


A Method for Identifying Distinguishing Words

- Define Mutual Information(Xj ) as

Mutual Information(Xj ) = H(Doc) − H(Doc|Xj )

- Maximum: H(Doc) ⇒ H(Doc|Xj ) = 0


- Minimum: 0 ⇒ H(Doc|Xj ) = H(Doc).
Bigger mutual information ⇒ better discrimination

Objective function and optimization estimate


probabilities that we then place in mutual information

Stewart (Princeton) Text as Data June 28-29, 2018 74 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )
P(E )
Odds(E ) =
1 − P(E )

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )
P(E )
Odds(E ) =
1 − P(E )
P(E ))
(1−P(E ))
Odds Ratio(E , F ) = P(F )
1−P(F )

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )
P(E )
Odds(E ) =
1 − P(E )
P(E ))
(1−P(E ))
Odds Ratio(E , F ) = P(F )
1−P(F )
   
P(E ) P(F )
Log Odds Ratio(E , F ) = log − log
1 − P(E ) 1 − P(F )

Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187


Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )
P(E )
Odds(E ) =
1 − P(E )
P(E ))
(1−P(E ))
Odds Ratio(E , F ) = P(F )
1−P(F )
   
P(E ) P(F )
Log Odds Ratio(E , F ) = log − log
1 − P(E ) 1 − P(F )
Strategy Construct objective function on proportions (and then
calculate log-odds)
Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187
Fightin’ Words An Introduction to
Regularization
Monroe, Colaresi, and Quinn (2009) what makes a word partisan?
Argue for using Log Odds Ratio, weighted by variance
Recall: For some event E and F
P(E ) = 1 − P(E c )
P(E )
Odds(E ) =
1 − P(E )
P(E ))
(1−P(E ))
Odds Ratio(E , F ) = P(F )
1−P(F )
   
P(E ) P(F )
Log Odds Ratio(E , F ) = log − log
1 − P(E ) 1 − P(F )
Strategy Construct objective function on proportions (and then
calculate log-odds)
Stewart (Princeton) Text as Data June 28-29, 2018 75 / 187
Objective Function

Stewart (Princeton) Text as Data June 28-29, 2018 76 / 187


Objective Function

Suppose we’re interested in how a word separates partisan speech.


Y = (Republican, Republican, Democrat, . . . , Republican)
X = Unnormalized matrix of word counts N × J

Stewart (Princeton) Text as Data June 28-29, 2018 76 / 187


Objective Function

Suppose we’re interested in how a word separates partisan speech.


Y = (Republican, Republican, Democrat, . . . , Republican)
X = Unnormalized matrix of word counts N × J
Define
N
X N
X
xRepublican = ( I (Yi = Republican)Xi1 , I (Yi = Republican)Xi2 ,
i=1 i=1
N
X
..., I (Yi = Republican)XiJ )
i=1

with NRepublican = Total number of Republican words

Stewart (Princeton) Text as Data June 28-29, 2018 76 / 187


Calculating Log Odds Ratio

Stewart (Princeton) Text as Data June 28-29, 2018 77 / 187


Calculating Log Odds Ratio

Define log Odds Ratioj as


   
πRepublican,j πDemocratic,j
log Odds Ratioj = log − log
1 − πRepublican,j 1 − πDemocratic,j

Stewart (Princeton) Text as Data June 28-29, 2018 77 / 187


Calculating Log Odds Ratio

Define log Odds Ratioj as


   
πRepublican,j πDemocratic,j
log Odds Ratioj = log − log
1 − πRepublican,j 1 − πDemocratic,j

1 1
Var(log Odds Ratioj ) ≈ +
xjD + αj xjR + αj
log Odds Ratioj
Std. Log Oddsj = p
Var(log Odds Ratioj )

Stewart (Princeton) Text as Data June 28-29, 2018 77 / 187


Applying the Model

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Applying the Model

How do Republicans and Democrats differ in debate?

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Applying the Model

How do Republicans and Democrats differ in debate?


Condition on topic and examine word usage

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Applying the Model

How do Republicans and Democrats differ in debate?


Condition on topic and examine word usage
- Press Releases (64,033)

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Applying the Model

How do Republicans and Democrats differ in debate?


Condition on topic and examine word usage
- Press Releases (64,033)
- Topic Coded (Structural Topic Model)

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Applying the Model

How do Republicans and Democrats differ in debate?


Condition on topic and examine word usage
- Press Releases (64,033)
- Topic Coded (Structural Topic Model)
- Given press release is about topic, what are the features that
distinguish Republican and Democratic language?

Stewart (Princeton) Text as Data June 28-29, 2018 78 / 187


Mutual Information, Standardized Log Odds
Iraq War, Partisan Words

republican

0.012
strategi
0.010 freedom

start
Mutual Information
0.008

truth idea talk


troop democraci
begin
chief land
conclud attack
0.006

ground
cours
answer
command
ask
white
armi 11 continu
manag
instead month deleg
vice
acknowledg
mass stop congression
intellig
rankclear
wrong
identifi note
0.004

lot
basic chang
secretari
went recogn plan
home
refus
expert
insurg
fundament told
essenti
holdend environ
scienc
situat
matter
legal
sens reallike
differ
senior
kill
problem
energi
purchas
push
consist
goe complet
learn
investig vital
combat
realli
old
speak democrat
fail
reason polici project
great
tough polic
recommenddon hear fair
south
station
human
integr
0.002

sure bush
long wide
kind
failur
nuclear
west
alli
feel
fact encourag
respect
500
followannual
adequ
comprehens
focu
word
data deserv
total societi
qualiti
higher
disast
rest
later
add
suffici
tool
averag
thought
main
outsid
face
hous million worker
women
announc
protect histor
east
extens
thankchanc
practic
capac
approach
weapon
post
300
heard
break sent
account
bipartisan
met
solut 000 developvote mind
affect
left
drug
lead
prevent
believ
prioriti
stabil
sustain
decid honor
soldier just
war establish
coalit
employe
test
deliv
victim
appreci
demonstr
tell
dear defens
win
challeng
young point
septemb
specif
line
actual
campaign
target
road
morn term
capabl
assessagretrain
inform
fight
reduc nationwid
impact
directli
perform
floor
execut
appoint
citi
terror
support air
clean
consum
commit aid
danger
street
consequ
combin open
north
wrote
demand
time
level
action
build
consid
want
implement
high ultim
peac
countri technolog deni
enact
credit
resid
death
engag
longer
board
crisi
claim
consider confer
half
lightabil
achiev
colleagu
budget
cost
children
creationorder
stand fund
individu
job world
address
benefit team
water
distribut
nomin
court
treatment
rule past
reviewgain
corpor
applaud
natur
proud
guard
commonhistori
devast
statu
basi
burden
intendamerican
necessari
struggl
financ
option
depend
simpli
prepar
urgunderstand
america
greater
judiciari
homeland
threaten
examin
uniform awardtook
destruct
producrang
courag
engin
nearli
remov
voic
limit
accomplish
seek
regard
effortconserv
friend
fall
afford
cosponsorappropri
singl
withdraw famili oper select
dedic
promot
school
0.000

certain
extend
lack
won
personnel
resolv
insur
marin care
result
depart
octob
highest
promis student
research
activ
advoc
incom
product field
global
life
environmentfree
list
central
especi
properti
role
popul
terroristaltern
confirm
expand
attent
travel
earli
social
appear
oppos
stori
volunt
speech secur
resourc
institut
unfortun
loss
document avail
short
request
compromis public
joint
estim
collegcritic
effici
veteran
enforc
price
union sign
signific
cut
exampl independ
grant
author
process
ensur 200
vehicl
hurrican
sourc
attempt
goodcrucial
facil
medic
despit
advanc
iraqi
constitut
particip
approv carri
rate
partnershipself
defend
harm
oil
earlier
suppli
power
suffer
suggest
lose
border
welcom import
doe
assur
perman
health
choic
soon
decembfinal
overal
accept
offici
univers
saddam
underminoffer
decis
firefight
guarante
hand
payment arm
run
justic
elimin
congress
lost
relief
standard
possibl
respons
confid
economi
parti
week
accord
invest
view
share
exist
growth
coordin
concern
present
program
special
town access
grow
damagcrimin
violenc
appli
schedul
express
case
control
determin
rais direct
sponsor
measur
group
strongli domest
serv
agreement
organ
enhanc
militari
industri task
ad
late
held
reform
stransit
applic
largest
major
construct
base
caus
afghanistanoccur
strong
passag fulli
katrina
profit
gulf
tudi
true
judg
focus
save
middl
messag
preserv
design men
contain
agricultur
billion
particularli
join best
tradetruli
tax
council
resolut
progress
dollar
transport
strengthen associ
governor
econom
seen
similar
safe
spend
remain
evid
nomine
commiss
commerc financi
area
discuss
riseexperi
polit
respond
leadership
regionhope
approxim
purpos
heart
upgrad
novemb expect
near
mean form
spent
potenti
adopt
letter
employ
increas
threat
opportunprovis
balanc
small
conduct
improv
recent
suprem
replac
foreign
brought
difficult
reach
awar
abusprivat
creat
market
larg
live
goal
partner
compani
version
modern big 400
director
equip
urban
locat reflect
intern
competit duti
affair
hard
close
elig
quickli
maintain
man
mission civil
visit
relat
brave
negoti
risk food
fiscal
educ
rural
iraq
posit
equal
cover
successturn
indic
reli
sacrific
remark
allow
crime
infrastructur
child
restor
safeti
condit
extrem
river
staffretir
district
write
forward
clearli
assist
thousand low
entir
busi
lower
head
known
local
center
event
send
taken
enabl
contributset
servic
readi
deal
valu
debat
fuel

−1.5 −1.0 −0.5 0.0 0.5 1.0


Log Odds
Stewart (Princeton) Text as Data June 28-29, 2018 79 / 187
Mutual Information, Standardized Log Odds
Gas Prices, Partisan Words

compani

0.04 profit
skyrocket
0.03

consum
Mutual Information

urg
manipul plan interior
wrotestop
hurt famili
0.02

refus earlier
real
pocket gougbig
letter explor
democrat
pump market
republican resourc south
creation
special
sure
ftc runaverag
quarter
anti relief
dear
strateg
break
soar exampl
hand fight bush
month
trade domest
board organ justic
share recent
action
level
northeast
0.01

violat
indic invest
dollar comprehens
natur
environment
attempthonor
stand
thousand
comparwin independ
true fail
rule author
join
offerjust
american packag
innov
file unfair
agreementinvestig
gallon home
oper arctic leas
sell
sustain
opposit
doubl
instead
spike specif
leadership
white
women
men higherweek affectmanner
debat
dramat
financ
governor enforc
threaten
advantag crisi
essenti economi
fair regard
estim
applaud
land
travel
dangersuffer
integr
reject practic
corpor
deal
purchas
tool
answer
execut
high
ask time
fund contain
refugroad
cosponsor
declar illeg
despit
form
struggl fiscal
past
benefit
lead
casebillion passag
sourcnorth
task
constituburden
attent
guardaid
cent
evid carecut
total
deserv doe
commiss
petroleum
award protect producpayment
clearli
unfortun
western
block entir
post
ceo thank
septemb
gasolin
hard wildlif
coastal
11
best
commend
taxpay
condit
fall
mobil
central
impos
appear industri
intern
tell
school
prioriti
determin
goal
novemb
octob
sentmajor
follow
improv
conserv
expand
positmission
focu
plain
farmer
reform
drop prevent
write
attack
station
studi
achiev
regul
preserv
replac
remarkturn life
locat
bipartisan price
world
caus
rate
design
strengthen good
growth
vote
mean
product credit
quickli
loss
500
varieti
event
standard
substantipush
financi
dedic
summer ad
safeti
distribut
risk
account
accord
promis fact
rise
line
set
assist
group develop
fuel
potenti
open district
remov
main
small
refineri
rest
recoveri
longer
reconcili
employ
option
institut
treatment
explain
recommend
assur
won
prohibit
popul
claim
retail rais
hold
direct
later
katrina
negoti
monitorsend
intend order
public
heat
express
heard
personnel applihous
free
sign
chief grant
electr
build
light
request
manufactur
engag long job reduct
capac
implement
defensfood
construct fix
version
readi
oversea
labor
deficit
enact
legal
staff
languagearli
infrastructur
deleg
secretari
gulf
200
troubl effort
opportun
possibl
suppli
control
exist
fulli million type
capabl
0.00

plant
extend
hour
certainli
incom
took
decid
conclud
combat
lose area
base
water
educ
health
actual
seen
restor
role
begin
director continu
approv
short
held
threat
strong
associ
nearli
congression
altern
close
sponsor
directli
proud
agre
safe
adequ cost
colleagu
don
employe respond
individu
univers
discuss
stabil
drill
complet
support partner
policicertain
half
decreas
student
enhanc
victori
limit
promot civil
children
hit
relat
middl
america
homeland
highest
damag
establish
refin
hope
afford
largest
offici
china
treasuri
basic
skill
measur
significantli
drive
season
privat
effici
save
present
sens
talk
end
joint
obtain
reliabl
polit
particularli
commerc
annual
consequensur
progress
similar
victim
reach
target
decemb
testimoni
devast
advancdepart
matter
car
access
left
approxim
occur
commercirang
recogn
court
constitut
inform
encourag
announc
util
confer
west
simpli
remain
nominprovis
charg increas
appropri
oil review
transport
technolog
allow
decis
global
respons idea
resolut
louisiana
affair
cover
train
respect
energi
want
disrupt
applic
point
grow
environ
appreci
centerkind
subject
schedul
duti
medic
demonstr
materi
chair
term
understand
vital
met
differ
avail
goe
incent
especi
servic
great
econom
seek
activ
reli
start
expens
deliv
collect
barrel
floor
soon
vehicl
result
regulatori
manag
lost
feel
qualiti
strategi
combin
particip
elig
spend
reduc
sector
greater
senior
iraq
data
region
conduct
extrem
wide
clean
choic
confirm balanc
larg
foreign
engin
export
contribut diesel
note
revenu
huge
valu
maintain
live
accept
east
transit
commit
mile militari
research
critic solut
cours
common
known
suggest
import
scienc
approach equip
paid
testifi
000
lower n
concern
coast
300
renew
budget tax
servear
terror
rank
agricultur
reason
adopt
portion
hurrican
depend
project
impact
believ
success
secur
address arm
final
countri
field
necessari
forward
nationwid
hear
human
overal
difficult
head focus
consider
like
face
visit
creat
learn
examin
demand
add
winter
worker
old
view
resid sale
air
identifi
consist
local
taken
abil
vulner
consid
clear
busi
low
citi wit
singl
list
particular
gain
pipelin
expect
facil
elimin
histori
program
chang
histor
judiciari
process
congress
problem
signific
prepar
power
purpos
oppos
rural
disast
situat
war
competit reflect
challeng
advoc
harm

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


Log Odds
Stewart (Princeton) Text as Data June 28-29, 2018 79 / 187
Fictious Prediction Problem

Stewart (Princeton) Text as Data June 28-29, 2018 80 / 187


Fictious Prediction Problem

Identify features that discriminate between groups to learn features


that are indicative of some group Fictitious prediction problem

Stewart (Princeton) Text as Data June 28-29, 2018 80 / 187


Fictious Prediction Problem

Identify features that discriminate between groups to learn features


that are indicative of some group Fictitious prediction problem
Core idea: turn labeling into an explicit supervised learning problem.

Stewart (Princeton) Text as Data June 28-29, 2018 80 / 187


Fictious Prediction Problem

Identify features that discriminate between groups to learn features


that are indicative of some group Fictitious prediction problem
Core idea: turn labeling into an explicit supervised learning problem.
Sometimes this works but be sure to think through the philosophical
implications.

Stewart (Princeton) Text as Data June 28-29, 2018 80 / 187


Taddy (2013) Multinomial Inverse Regression

Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187


Taddy (2013) Multinomial Inverse Regression
Interested in classifying documents into some category, for
example, political party of floor speaker

Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187


Taddy (2013) Multinomial Inverse Regression
Interested in classifying documents into some category, for
example, political party of floor speaker
Let’s denote this desired outcome as Y and the word features
for the document as X

Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187


Taddy (2013) Multinomial Inverse Regression
Interested in classifying documents into some category, for
example, political party of floor speaker
Let’s denote this desired outcome as Y and the word features
for the document as X
In classification we’re generally interested in:

E [Y |X ] = f (X1 , X2 , . . . , XJ )

Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187


Taddy (2013) Multinomial Inverse Regression
Interested in classifying documents into some category, for
example, political party of floor speaker
Let’s denote this desired outcome as Y and the word features
for the document as X
In classification we’re generally interested in:

E [Y |X ] = f (X1 , X2 , . . . , XJ )

Problem: J might be very, very big.

Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187


Taddy (2013) Multinomial Inverse Regression
Interested in classifying documents into some category, for
example, political party of floor speaker
Let’s denote this desired outcome as Y and the word features
for the document as X
In classification we’re generally interested in:

E [Y |X ] = f (X1 , X2 , . . . , XJ )

Problem: J might be very, very big.


Potential solution invert regression

E [X |Y ] = g (Y )

Now we have a generative model for X and we need to make a


distributional assumption for X |Y
Stewart (Princeton) Text as Data June 28-29, 2018 81 / 187
Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )
λj ∼ Gamma(s, r )

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )
λj ∼ Gamma(s, r )

Laplace priors regularize or shrink estimates toward zero

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )
λj ∼ Gamma(s, r )

Laplace priors regularize or shrink estimates toward zero


Laplace priors Equivalent to L1 or lasso penalization

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )
λj ∼ Gamma(s, r )

Laplace priors regularize or shrink estimates toward zero


Laplace priors Equivalent to L1 or lasso penalization
Gamma-Lasso prior

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


Multinomial Inverse Regression

As before, xRepublican to be the Republican count vector.

xRepublican ∼ Multinomial(NRepublican , πRepublican )


exp[αj + I (Republican)φj ]
πRepublican,j = PJ
l=1 exp[αl + I (Republican)φl ]
φj ∼ Laplace(λj )
λj ∼ Gamma(s, r )

Laplace priors regularize or shrink estimates toward zero


Laplace priors Equivalent to L1 or lasso penalization
Gamma-Lasso prior

Stewart (Princeton) Text as Data June 28-29, 2018 82 / 187


What Does This Do in Practice?

Stewart (Princeton) Text as Data June 28-29, 2018 83 / 187


What Does This Do in Practice?

For each word in the vocabulary check if there is a meaningful


difference in the rate at which it is used by republicans vs.
democrats

Stewart (Princeton) Text as Data June 28-29, 2018 83 / 187


What Does This Do in Practice?

For each word in the vocabulary check if there is a meaningful


difference in the rate at which it is used by republicans vs.
democrats
Zero out many estimates of those differences due to the sparsity
inducing prior

Stewart (Princeton) Text as Data June 28-29, 2018 83 / 187


What Does This Do in Practice?

For each word in the vocabulary check if there is a meaningful


difference in the rate at which it is used by republicans vs.
democrats
Zero out many estimates of those differences due to the sparsity
inducing prior
Create a projection which contains all (linear) information
contained in the words about the categorization, this is a
sufficient reduction

Stewart (Princeton) Text as Data June 28-29, 2018 83 / 187


What Does This Do in Practice?

For each word in the vocabulary check if there is a meaningful


difference in the rate at which it is used by republicans vs.
democrats
Zero out many estimates of those differences due to the sparsity
inducing prior
Create a projection which contains all (linear) information
contained in the words about the categorization, this is a
sufficient reduction
Estimate a forward regression using this one dimensional object.

Stewart (Princeton) Text as Data June 28-29, 2018 83 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words
Why does this help?

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words
Why does this help?
Consider your basic regression model. We don’t have to come up
with a data generating process for X , we simply condition on it

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words
Why does this help?
Consider your basic regression model. We don’t have to come up
with a data generating process for X , we simply condition on it
This allows the X to have an arbitrary covariance structure, but
it throws away potential information

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words
Why does this help?
Consider your basic regression model. We don’t have to come up
with a data generating process for X , we simply condition on it
This allows the X to have an arbitrary covariance structure, but
it throws away potential information
MNIR uses a data generating process that assumes a particular
covariance structure

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The first move in Taddy is to give a data generating process for


words
Why does this help?
Consider your basic regression model. We don’t have to come up
with a data generating process for X , we simply condition on it
This allows the X to have an arbitrary covariance structure, but
it throws away potential information
MNIR uses a data generating process that assumes a particular
covariance structure
When the assumption holds this makes the model more efficient

Stewart (Princeton) Text as Data June 28-29, 2018 84 / 187


Generative vs. Discriminative Models

The efficiency gain increases dramatically with number of


columns in X .

Stewart (Princeton) Text as Data June 28-29, 2018 85 / 187


Generative vs. Discriminative Models

The efficiency gain increases dramatically with number of


columns in X .
In particular one can show that the variance scales with the
number of words in MNIR and with the number of documents in
the original forward regression problem

Stewart (Princeton) Text as Data June 28-29, 2018 85 / 187


Generative vs. Discriminative Models

The efficiency gain increases dramatically with number of


columns in X .
In particular one can show that the variance scales with the
number of words in MNIR and with the number of documents in
the original forward regression problem
Even though the assumptions don’t necessarily hold in practice,
we can correct for some misspecification through the forward
regression.

Stewart (Princeton) Text as Data June 28-29, 2018 85 / 187


Generative vs. Discriminative Models

The efficiency gain increases dramatically with number of


columns in X .
In particular one can show that the variance scales with the
number of words in MNIR and with the number of documents in
the original forward regression problem
Even though the assumptions don’t necessarily hold in practice,
we can correct for some misspecification through the forward
regression.
As a side note: the generative model makes scalable
computation much easier as you can with a few assumptions
decouple computation on each word.

Stewart (Princeton) Text as Data June 28-29, 2018 85 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 86 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Latent Dirichlet Allocation estimates:

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Latent Dirichlet Allocation estimates:


I The distribution over words for each topic.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Latent Dirichlet Allocation estimates:


I The distribution over words for each topic.
I The proportion of a document in each topic, for each document.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Latent Dirichlet Allocation estimates:


I The distribution over words for each topic.
I The proportion of a document in each topic, for each document.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


Latent Dirichlet Allocation

Idea: documents exhibit each topic in some proportion. This is


an admixture.
Each document is a mixture over topics. Each topic is a mixture
over words.

Latent Dirichlet Allocation estimates:


I The distribution over words for each topic.
I The proportion of a document in each topic, for each document.
Maintained assumptions: Bag of words/fix number of topics ex ante.

Stewart (Princeton) Text as Data June 28-29, 2018 87 / 187


This is a Bayesian Model

Figure: Plate Notation of Latent Dirichlet Allocation

Graphic from David Blei’s Website

Stewart (Princeton) Text as Data June 28-29, 2018 88 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.
3) Optimization

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.
3) Optimization
- Variational Inference deterministic approximation

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.
3) Optimization
- Variational Inference deterministic approximation
- Collapsed Gibbs Sampling MCMC algorithm

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.
3) Optimization
- Variational Inference deterministic approximation
- Collapsed Gibbs Sampling MCMC algorithm
- Spectral/Factorization Methods

Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187


“Vanilla” Latent Dirichlet Allocation
1) Task:
- Discover thematic content of documents
- Quickly explore documents
2) Objective Function
f (W , β, Θ, α)
Where:
- Θ = N × K matrix with row θi = (θi1 , θi2 , . . . , θiK )
proportion of a document allocated to each topic
- β = K × J matrix, with row βk = (β1k , β2k , . . . , βkJ ) topics
- α = K element long vector, population prior for Θ.
3) Optimization
- Variational Inference deterministic approximation
- Collapsed Gibbs Sampling MCMC algorithm
- Spectral/Factorization Methods
4) Validation application-specific
Stewart (Princeton) Text as Data June 28-29, 2018 89 / 187
A Statistical Highlighter (With Many Colors)

Image from Hanna Wallach

Stewart (Princeton) Text as Data June 28-29, 2018 90 / 187


Why does this work Co-occurrence
Where’s the information for each word’s topic?

Stewart (Princeton) Text as Data June 28-29, 2018 91 / 187


Why does this work Co-occurrence
Where’s the information for each word’s topic?
Reconsider document-term matrix

Stewart (Princeton) Text as Data June 28-29, 2018 91 / 187


Why does this work Co-occurrence
Where’s the information for each word’s topic?
Reconsider document-term matrix

Word1 Word2 ... WordJ


Doc1 0 1 ... 0
Doc2 2 0 ... 3
.. .. .. .. ..
. . . . .
DocN 0 1 ... 1

Stewart (Princeton) Text as Data June 28-29, 2018 91 / 187


Why does this work Co-occurrence
Where’s the information for each word’s topic?
Reconsider document-term matrix

Word1 Word2 ... WordJ


Doc1 0 1 ... 0
Doc2 2 0 ... 3
.. .. .. .. ..
. . . . .
DocN 0 1 ... 1

We are learning the pattern of what words occur together.

Stewart (Princeton) Text as Data June 28-29, 2018 91 / 187


Why does this work Co-occurrence
Where’s the information for each word’s topic?
Reconsider document-term matrix

Word1 Word2 ... WordJ


Doc1 0 1 ... 0
Doc2 2 0 ... 3
.. .. .. .. ..
. . . . .
DocN 0 1 ... 1

We are learning the pattern of what words occur together.

The model wants a topic to contain as few words as possible, but a


document to contain as few topics as possible. This tension is what
makes the model work.
Stewart (Princeton) Text as Data June 28-29, 2018 91 / 187
1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 92 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology


- Advertising

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology


- Advertising
- Credit Claiming

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery: Congressional Communication
to Constituents

- Paper (Grimmer and King 2011)

- David Mayhew’s (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)
- Apply method (relying on many clustering algorithms)

Stewart (Princeton) Text as Data June 28-29, 2018 93 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid

Each point is a clustering


hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust median
spearman single
centroid
centroid
average
medianaverage
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean

Affinity Propagation-Cosine
kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (Dueck and Frey 2007)
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc

mixvmf kmeans correlation


hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid

Each point is a clustering


hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust median
spearman single
centroid
centroid
average
medianaverage
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean

Affinity Propagation-Cosine
kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (Dueck and Frey 2007)
hclusthclust spearman
kendall ward ward
hclust maximum ward

kmeans maximum
kmeans binary
Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)
⇒ Similar clustering of
documents

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single

biclust_spectral
hclust canberra
hclust spearman complete
spec_man
spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single

biclust_spectral
hclust canberra
hclust spearman complete
spec_man
spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:
local cluster ensemble

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclust maximum ward


hclusthclust spearman
kendall ward ward

kmeans maximum
kmeans binary Found a region with clusterings
that all reveal the same
important insight

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
Mixture:
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
Mixture:
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single
hclust correlation median
hclust binary single
mec
hclust pearson mcquitty
hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
0.39 Hclust-Canberra-McQuitty
hclustpearson
hclust correlation centroid rock
centroid som
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclusthclust spearman
spearman
kendall
hclust kendall centroid
centroid
average
median
median
spearman average
single
hclust kendall
hclust spearman mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean manhattan
single
median
divisive manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum 0.09 Hclust-Pearson-Ward

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
Mixture:
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single
hclust correlation median
hclust binary single
mec
hclust pearson mcquitty
hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
0.39 Hclust-Canberra-McQuitty
hclustpearson
hclust correlation centroid rock
centroid som
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete

affprop maximum
hclust canberra
kmeans kendall median
kmeans spearman
kmeans canberra
hclust binary centroid
kmeans manhattan
hclust canberra average mspec_mink
spec_man
spec_cos
spec_mink
spec_euc
spec_max
spec_canb
mspec_man
mspec_max
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust
hclusthclust kendall
spearman
spearman
kendall
hclust kendallsingle
median
spearman centroid
centroid
average
medianaverage
single
hclust canberra centroid
hclust
hclust
hclusthclust
hclust
hclust
euclidean
hclust
maximum
hclust
hclust kendall
hclust spearman
kmedoids
manhattan
manhattan
manhattan
single
manhattan
euclidean
divisive
single
euclidean
hclust
centroid
median
affprop
mcquitty
mcquitty
average
hclust kendall complete
manhattan
manhattan
single
median
manhattan
centroid
euclidean average
● Random Walk
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
(Metrics 1-6)
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum 0.09 Hclust-Pearson-Ward

0.04 Spectral clustering


Symmetric
(Metrics 1-6)

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
Mixture:
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single
hclust correlation median
hclust binary single
mec
hclust pearson mcquitty
hclust pearson average hclust correlation complete
hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
0.39 Hclust-Canberra-McQuitty
hclustpearson
hclust correlation centroid rock
centroid som
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete

affprop maximum
hclust canberra
kmeans kendall median
kmeans spearman
kmeans canberra
hclust binary centroid
kmeans manhattan
hclust canberra average mspec_mink
spec_man
spec_cos
spec_mink
spec_euc
spec_max
spec_canb
mspec_man
mspec_max
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust
hclusthclust kendall
spearman
spearman
kendall
hclust kendallsingle
median
spearman centroid
centroid
average
medianaverage
single
hclust canberra centroid
hclust
hclust
hclusthclust
hclust
hclust
euclidean
hclust
maximum
hclust
hclust kendall
hclust spearman
kmedoids
manhattan
manhattan
manhattan
single
manhattan
euclidean
divisive
single
euclidean
hclust
centroid
median
affprop
mcquitty
mcquitty
average
hclust kendall complete
manhattan
manhattan
single
median
manhattan
centroid
euclidean average
● Random Walk
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
(Metrics 1-6)
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum 0.09 Hclust-Pearson-Ward


0.05 Kmediods-Cosine
0.04 Spectral clustering
Symmetric
(Metrics 1-6)

Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187


Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Clusters in this Clustering

Mayhew
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc

Credit Claiming, Pork:


kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
maximum
euclidean
hclust maximum
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc

“Sens. Frank R. Lautenberg


hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (D-NJ) and Robert Menendez
hclust maximum ward
hclusthclust spearman
kendall ward ward

kmeans maximum
kmeans binary (D-NJ) announced that the
Clusters in this Clustering
U.S. Department of Commerce

●●

has awarded a $100,000 grant
●● ●
●● ● ●




●●
●●




● to the South Jersey Economic
Credit Claiming Development District”
Pork

Mayhew
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust
hclust
hclust
hclust
manhattan
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclustmaximum
hclust
kmedoids
manhattan
manhattan
single
euclidean
manhattan
maximum
divisive
median
affprop
single
manhattanmedian
divisive
centroid
euclidean
manhattan
centroid
average
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average

hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Credit Claiming, Legislation:
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty

kmeanshclust euclidean ward


euclidean
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea

hclust canberra complete


affprop info.costs “As the Senate begins its
sot_euc

recess, Senator Frank


hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Lautenberg today pointed to a

Clusters in this Clustering string of victories in Congress
●●



●●
●●

●●

● ●

on his legislative agenda during

● ●
● ●● ●

this work period”
Credit Claiming
Pork ● ●

● ●

●● ●
● ● ●
● ● ● ●

● ● ● ●●
●● ●
Credit Claiming
Mayhew
Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
maximum
euclidean
hclust maximum
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc

Advertising:
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward “Senate Adopts
hclust maximum ward
hclusthclust spearman
kendall ward ward

kmeans maximum
kmeans binary Lautenberg/Menendez
Clusters in this Clustering
Resolution Honoring Spelling

●●

● ●
Bee Champion from New
●● ● ●
●● ● ●




●●
●●








●●
●●
Jersey”
Credit Claiming Advertising
Pork ● ●

● ●

●● ●
● ● ●
● ● ● ●

● ● ● ●●
●● ●
Credit Claiming
Mayhew
Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery: Partisan Taunting
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward Partisan Taunting:
hclust maximum ward
hclusthclust spearman
kendall ward ward

kmeans maximum
kmeans binary “Republicans Selling Out
Clusters in this Clustering
Nation on Chemical Plant

●●

● ●
Security”
●● ● ●

●● ● ● ● ●●
● ●

● ●●
● ●
●● ●
● ● ●●

Credit Claiming Advertising


Pork Partisan Taunting
● ● ●

●● ●
● ● ●
● ● ●

●● ● ●
● ● ● ● ● ● ●●
● ● ● ● ●
●● ●
● ●● ● ●

● ● ● ●●
●● ● ●

Mayhew Credit Claiming ●

Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology

- “Senator Lautenberg Blasts


Republicans as ‘Chicken Hawks’ ”
[Government Oversight]

Sen. Lautenberg
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology

- “Senator Lautenberg Blasts


Republicans as ‘Chicken Hawks’ ”
[Government Oversight]

Sen. Lautenberg
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology

- “Senator Lautenberg Blasts


Republicans as ‘Chicken Hawks’ ”
[Government Oversight]

- “Every day the House Republicans


dragged this out was a day that
made our communities less
Sen. Lautenberg safe.”[Homeland Security]
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology
Definition: Explicit, public, and negative attacks on another political party
or its members

- “Senator Lautenberg Blasts


Republicans as ‘Chicken Hawks’ ”
[Government Oversight]

- “Every day the House Republicans


dragged this out was a day that
made our communities less
Sen. Lautenberg safe.”[Homeland Security]
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology
Definition: Explicit, public, and negative attacks on another political party
or its members
Consequences for representation: Deliberative, Polarization, Policy

- “Senator Lautenberg Blasts


Republicans as ‘Chicken Hawks’ ”
[Government Oversight]

- “Every day the House Republicans


dragged this out was a day that
made our communities less
Sen. Lautenberg safe.”[Homeland Security]
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.

Stewart (Princeton) Text as Data June 28-29, 2018 96 / 187


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Demonstrate prevalence using senators’ press releases.

Stewart (Princeton) Text as Data June 28-29, 2018 96 / 187


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Demonstrate prevalence using senators’ press releases.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party

Stewart (Princeton) Text as Data June 28-29, 2018 96 / 187


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Demonstrate prevalence using senators’ press releases.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party
30
Frequency

20
10

0.1 0.2 0.3 0.4 0.5

Prop. of Press Releases Taunting


Stewart (Princeton) Text as Data June 28-29, 2018 97 / 187
Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Demonstrate prevalence using senators’ press releases.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party
On Avg., Senators Taunt
in 27 % of Press Releases
30
Frequency

20
10

0.1 0.2 0.3 0.4 0.5

Prop. of Press Releases Taunting


Stewart (Princeton) Text as Data June 28-29, 2018 97 / 187
Over Time Taunting Rates in Speeches
0.15
Proportion Speeches Taunting

GOP
● DEM ●
0.10


● ●

● ●

● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ●
● ●
0.05


● ● ●



0.00

1990 1995 2000 2005


Year

Stewart (Princeton) Text as Data June 28-29, 2018 98 / 187


Q&A and Code

Stewart (Princeton) Text as Data June 28-29, 2018 99 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 100 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 100 / 187


Organization by Tasks not Techniques

Discovery Measurement Inference

Stewart (Princeton) Text as Data June 28-29, 2018 101 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data
Coding process is explainable and replicable.

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data
Coding process is explainable and replicable.
I inter-coder reliability

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data
Coding process is explainable and replicable.
I inter-coder reliability
The measure is validated and reliable.

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data
Coding process is explainable and replicable.
I inter-coder reliability
The measure is validated and reliable.
I Can you do this again with the same accuracy?

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Measurement
Once you identify the concepts, you need to pin them down.
Measures should have a clear scope or purpose.
I how generalizable is the measure?
I what place does it have in the theory?
Source material is identified and ideally made public.
I can be difficult with texts
I document the way and time periods you got the data
Coding process is explainable and replicable.
I inter-coder reliability
The measure is validated and reliable.
I Can you do this again with the same accuracy?
Limitations are explored, documented, and communicated.

Stewart (Princeton) Text as Data June 28-29, 2018 102 / 187


Two (non-exclusive or exhaustive) Approaches

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts
- Know (develop) categories before hand

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts
- Know (develop) categories before hand
- Hand coding: assign documents to categories

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts
- Know (develop) categories before hand
- Hand coding: assign documents to categories
- Infer: new document assignment to categories (distribution of
documents to categories)

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts
- Know (develop) categories before hand
- Hand coding: assign documents to categories
- Infer: new document assignment to categories (distribution of
documents to categories)
- Pre-estimation: extensive work constructing categories, building
classifiers

Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187


Two (non-exclusive or exhaustive) Approaches
Clustering and Topic Models:
- Repurpose models for discovery
- Infer categories
- Infer document assignment to categories
- Pre-estimation: relatively little work
- Post-estimation: extensive validation testing
Supervised Methods:
- Use an existing classification scheme and models for categorizing
texts
- Know (develop) categories before hand
- Hand coding: assign documents to categories
- Infer: new document assignment to categories (distribution of
documents to categories)
- Pre-estimation: extensive work constructing categories, building
classifiers
- Post-estimation: relatively little work
Stewart (Princeton) Text as Data June 28-29, 2018 103 / 187
1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 104 / 187


Desirable Properties of a Representation

Stewart (Princeton) Text as Data June 28-29, 2018 105 / 187


Desirable Properties of a Representation

1 Interpretable
can we clearly communicate the idea to the reader

Stewart (Princeton) Text as Data June 28-29, 2018 105 / 187


Desirable Properties of a Representation

1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument

Stewart (Princeton) Text as Data June 28-29, 2018 105 / 187


Desirable Properties of a Representation

1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument
3 Label Fidelity
minimal surprise when going from reading the label to reading
the documents

Stewart (Princeton) Text as Data June 28-29, 2018 105 / 187


Desirable Properties of a Representation

1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument
3 Label Fidelity
minimal surprise when going from reading the label to reading
the documents
4 Tractable
computationally tractable model and enough samples to estimate

Stewart (Princeton) Text as Data June 28-29, 2018 105 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.


There are many options:

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.


There are many options:
Categorical: one of K mutually exclusive
and exhaustive categories

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.


There are many options:
Categorical: one of K mutually exclusive
and exhaustive categories
Mixed Membership: proportional member
of K topics

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.


There are many options:
Categorical: one of K mutually exclusive
and exhaustive categories
Mixed Membership: proportional member
of K topics
Binary Features: K binary latent variables,
each of which could be one or off

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


Types of Representation

The biggest modeling choice is the form of the latent representation.


There are many options:
Categorical: one of K mutually exclusive
and exhaustive categories
Mixed Membership: proportional member
of K topics
Binary Features: K binary latent variables,
each of which could be one or off
Scales: K continuous scales or positions

Stewart (Princeton) Text as Data June 28-29, 2018 106 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 107 / 187


Topic Models as Measurement

Stewart (Princeton) Text as Data June 28-29, 2018 108 / 187


Topic Models as Measurement

Topic models are algorithms for discovering the main


themes that pervade a large and otherwise unstructured
collection of documents. Topic models can organize the
collection according to the discovered themes.

Blei, 2012

Stewart (Princeton) Text as Data June 28-29, 2018 108 / 187


Topic Models as Measurement

Topic models are algorithms for discovering the main


themes that pervade a large and otherwise unstructured
collection of documents. Topic models can organize the
collection according to the discovered themes.

Blei, 2012

In social science we often try to use these outputs as an approach to


measurement.

Stewart (Princeton) Text as Data June 28-29, 2018 108 / 187


Topic Models as Measurement

Topic models are algorithms for discovering the main


themes that pervade a large and otherwise unstructured
collection of documents. Topic models can organize the
collection according to the discovered themes.

Blei, 2012

In social science we often try to use these outputs as an approach to


measurement.
But can we?

Stewart (Princeton) Text as Data June 28-29, 2018 108 / 187


Topic Models as Measurement

Topic models are algorithms for discovering the main


themes that pervade a large and otherwise unstructured
collection of documents. Topic models can organize the
collection according to the discovered themes.

Blei, 2012

In social science we often try to use these outputs as an approach to


measurement.
But can we? Should we?

Stewart (Princeton) Text as Data June 28-29, 2018 108 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action?

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)
2 1993 change in electoral system? Moving from pork to policy.

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)
2 1993 change in electoral system? Moving from pork to policy.
- To answer well: characterize campaigns across 50 + years

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)
2 1993 change in electoral system? Moving from pork to policy.
- To answer well: characterize campaigns across 50 + years
- That sounds hard

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)
2 1993 change in electoral system? Moving from pork to policy.
- To answer well: characterize campaigns across 50 + years
- That sounds hard
- Determined (relentless) data collection

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- IR question: why is Japan now willing to engage militaristic


foreign action? Two stories:
1 rise of China? (need to focus on defensive security)
2 1993 change in electoral system? Moving from pork to policy.
- To answer well: characterize campaigns across 50 + years
- That sounds hard
- Determined (relentless) data collection
- Latent Dirichlet Allocation (on japanese texts)

Stewart (Princeton) Text as Data June 28-29, 2018 109 / 187


Stewart (Princeton) Text as Data June 28-29, 2018 110 / 187
Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Typical Manifesto:

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm
- Collected from microfilm, hand transcribed (no OCR worked),
used a variety of techniques to create a TDM

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm
- Collected from microfilm, hand transcribed (no OCR worked),
used a variety of techniques to create a TDM
- Harder for Japanese

Stewart (Princeton) Text as Data June 28-29, 2018 111 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

- Applies Vanilla LDA


- Output: topics (with Japanese characters)

Stewart (Princeton) Text as Data June 28-29, 2018 112 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016)

Stewart (Princeton) Text as Data June 28-29, 2018 113 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016) ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●

● ● ● ●
● ● ●
● ●● ●
● Change●●●in Mean Proportion of Each

Manifesto Devoted to Pork Over ●Time
0.1 ● ●● ●
● ●
● ●


● ● ●
● ● ●
● ●
● ●

● ● ●

● ●
● ● ● ●
● ● ●

● ● ● ●

● ●

● ● ●
● ●
● ● ● ● ●
● ● ● ● ●

● ● ● ● ●
● ● ● ●
● ● ● ● ●

● ● ●
● ●
● ● ●
● ● ●
● ●


● ● ● ● ●
● ●
● ● ●

● ● ● ●
● ● ● ● ●
● ●
● ● ●
Proportions of each Manifesto Devoted to Pork

● ● ● ●
● ●
● ●

● ● ● ●
● ●
● ● ●

● ● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ●

● ●
● ● ● ●

● ● ● ●


● ● ● ●

● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●


● ● ● ●
● ● ● ● ●
● ● ●
● ● ●


● ●
● ● ●

● ● ● ● ●
● ●
● ● ●
● ● ● ● ●

● ● ●
● ● ● ●
● ●




● ● ● ●
● ● ● ● ●
● ● ● ●

● ● ● ●
● ● ●
● ● ●

● ●

● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ●

● ● ●
● ●

● ●
● ●
● ● ● ●
● ●

● ●





● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ●

● ● ●
● ● ● ● ●
● ●

1990 1995 2000 2005


Election Years

Stewart (Princeton) Text as Data June 28-29, 2018 114 / 187


Example: Japanese Campaign Manifestos
(Catalinac 2016) ● ●
● ● ●


● ●
● ●
Change in Mean Proportion of Each Manifesto Devoted to Foreign● Policy Over Time ●


● ●

● ●


● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ●
● ● ●
● ●
● ●


● ●
● ●

● ●
● ●
Proportions of each Manifesto Devoted to Foreign Policy Issues


● ●

● ●
● ●
● ● ● ●
● ●
● ●

● ●


● ● ●
● ● ●

● ●



● ● ●

● ●






● ●

● ●




● ●
● ●

● ●
● ● ● ●
● ●

● ● ●
● ● ●
● ● ●
● ●



● ● ● ●
● ●
● ●
● ●
● ●

● ● ● ● ● ● ● ●

1990 1995 2000 2005


Election Years

Stewart (Princeton) Text as Data June 28-29, 2018 114 / 187


Topic Models as Measurement

the goal is a
representation that is

useful

reliable

valid
Topic Models as Measurement

the goal is a
representation that is

useful

reliable

valid
substantiv
e fit
Topic Models as Measurement

desirable properties

easy to use

transparent

broad support

helpful
Setting the number of topics K
Current approaches to setting K

Optimize model fit


Wallach et al., 2009
Foulds and Smyth, 2014
Snoek, Larochelle and Adams, 2012

Bayesian Nonparametrics
Teh et al., 2005
Wallach et al., 2010
Current approaches to setting K
Optimize a surrogate
criterion
Chang et al., 2009
Lau, Newman and Baldwin, 2014
Mimno et al., 2011
Newman et al., 2010
Roberts et al., 2014

Bespoke methods
Grimmer, 2010
Quinn et al., 2010
Grimmer and Stewart, 2013
Topic Aggregation
Topic Aggregation

Why?

stability

weak supervision

label interpretability

transparency
An Interactive System
1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 115 / 187


Leveraging Information Within and About Texts

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom
I we want to avoid throwing away valuable information we have

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom
I we want to avoid throwing away valuable information we have
Structural Topic Model (STM)

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom
I we want to avoid throwing away valuable information we have
Structural Topic Model (STM)
I general method for modeling documents with context

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom
I we want to avoid throwing away valuable information we have
Structural Topic Model (STM)
I general method for modeling documents with context
I modeling context in document sets with enable comparison

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


Leveraging Information Within and About Texts
Previous methods leverage the information within documents
I methods developed in computer science and statistics
I primarily analyzing unstructured text
I use words within document to infer its subject
But, we also have information about documents
I captured by metadata: data about data
I e.g. author, source, date, audience
I important because speech is deeply contextual
I e.g. who says it, where, when, to whom
I we want to avoid throwing away valuable information we have
Structural Topic Model (STM)
I general method for modeling documents with context
I modeling context in document sets with enable comparison
I two uses of metadata: topic prevalence and topical content

Stewart (Princeton) Text as Data June 28-29, 2018 116 / 187


STM = LDA + Contextual Information

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans
I Topic content can vary by metadata

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans
I Topic content can vary by metadata
F e.g. Democrats are less likely to use the word “life” when
talking about abortion than Republicans

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans
I Topic content can vary by metadata
F e.g. Democrats are less likely to use the word “life” when
talking about abortion than Republicans
Including context improves the model:

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans
I Topic content can vary by metadata
F e.g. Democrats are less likely to use the word “life” when
talking about abortion than Republicans
Including context improves the model:
I more accurate estimation

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


STM = LDA + Contextual Information

STM provides two ways to include contextual information


I Topic prevalence can vary by metadata
F e.g. Democrats talk more about education than Republicans
I Topic content can vary by metadata
F e.g. Democrats are less likely to use the word “life” when
talking about abortion than Republicans
Including context improves the model:
I more accurate estimation
I better qualitative interpretability

Stewart (Princeton) Text as Data June 28-29, 2018 117 / 187


Mixed-Membership Topic Models
More formal terminology:

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X
I Topical content groups: D length vector Y

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X
I Topical content groups: D length vector Y
Latent variables

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X
I Topical content groups: D length vector Y
Latent variables
I D × K matrix θ: proportion of document on each topic.

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X
I Topical content groups: D length vector Y
Latent variables
I D × K matrix θ: proportion of document on each topic.
I K × V matrix β: probability of drawing a word conditional on
topic.

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Mixed-Membership Topic Models
More formal terminology:
User specifies the number of topics: K
Observed data for standard topic models
I Each document (d ∈ 1 . . . D) is a collection of Nd tokens
I Each token is a particular word from a dictionary of V entries
I Data summarized in a single matrix D × V matrix W
Additional data for STM
I Topic prevalence covariates: D × P matrix X
I Topical content groups: D length vector Y
Latent variables
I D × K matrix θ: proportion of document on each topic.
I K × V matrix β: probability of drawing a word conditional on
topic.
I Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

Stewart (Princeton) Text as Data June 28-29, 2018 118 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix

β, K × V topic-word matrix

Each token has a topic drawn from the document mixture


I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix ⇐ logistic normal glm with


covariates

β, K × V topic-word matrix

Each token has a topic drawn from the document mixture


I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix ⇐ logistic normal glm with


covariates
I Covariate-specific prior with global topic covariance
I θd,· ∼ LogisticNormal(Xd γ, Σ)
β, K × V topic-word matrix

Each token has a topic drawn from the document mixture


I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix ⇐ logistic normal glm with


covariates
I Covariate-specific prior with global topic covariance
I θd,· ∼ LogisticNormal(Xd γ, Σ)
β, K × V topic-word matrix ⇐ multinomial logit with covariates

Each token has a topic drawn from the document mixture


I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix ⇐ logistic normal glm with


covariates
I Covariate-specific prior with global topic covariance
I θd,· ∼ LogisticNormal(Xd γ, Σ)
β, K × V topic-word matrix ⇐ multinomial logit with covariates
I Each topic is now a sparse, covariate-specific deviation from a
baseline distribution.
β~k,· ∝ exp m + κ(topic) + κ(cov) + κ(int)

I

I Thee parts: topic, covariate, topic-covariate interaction

Each token has a topic drawn from the document mixture


I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Technical Details: The Structural Topic Model
Low rank approximation to expected counts: W̃ ≈ θ β
D×V D×K K ×V

θ, D × K document-topic matrix ⇐ logistic normal glm with


covariates
I Covariate-specific prior with global topic covariance
I θd,· ∼ LogisticNormal(Xd γ, Σ)
β, K × V topic-word matrix ⇐ multinomial logit with covariates
I Each topic is now a sparse, covariate-specific deviation from a
baseline distribution.
β~k,· ∝ exp m + κ(topic) + κ(cov) + κ(int)

I

I Thee parts: topic, covariate, topic-covariate interaction


I β may instead by point-estimated
Each token has a topic drawn from the document mixture
I Draw token topic zd,n from Discrete(θd )
I Draw observed word wd,n from Discrete(βk=z, )

Stewart (Princeton) Text as Data June 28-29, 2018 119 / 187


Structural Topic Model

Stewart (Princeton) Text as Data June 28-29, 2018 120 / 187


Estimation and Implementation of the STM

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R
I complete workflow: raw texts → figures

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R
I complete workflow: raw texts → figures
I simple regression style syntax using formulas
mod.out <- stm(documents,vocab, K=10,
prevalence= ~paper + s(time),
data=metadata, init.type="Spectral")

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R
I complete workflow: raw texts → figures
I simple regression style syntax using formulas
mod.out <- stm(documents,vocab, K=10,
prevalence= ~paper + s(time),
data=metadata, init.type="Spectral")
I many functions for summarization, visualization and checking

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R
I complete workflow: raw texts → figures
I simple regression style syntax using formulas
mod.out <- stm(documents,vocab, K=10,
prevalence= ~paper + s(time),
data=metadata, init.type="Spectral")
I many functions for summarization, visualization and checking
Complete vignette online with examples

Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187


Estimation and Implementation of the STM
Define a probabilistic model and estimate parameters
I bayesian estimation using variational inference
(initialization from spectral method of moments estimator)
I essentially word co-occurences used to discover topics
General to many kinds of corpus structure using covariates
stm Package in R
I complete workflow: raw texts → figures
I simple regression style syntax using formulas
mod.out <- stm(documents,vocab, K=10,
prevalence= ~paper + s(time),
data=metadata, init.type="Spectral")
I many functions for summarization, visualization and checking
Complete vignette online with examples

You can do this with your data!


Stewart (Princeton) Text as Data June 28-29, 2018 121 / 187
1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 122 / 187


Classification via Dictionary Methods
1) Task

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij
where:

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij
where:
- θ = (θ1 , θ2 , . . . , θN ) are word weights

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij
where:
- θ = (θ1 , θ2 , . . . , θN ) are word weights
- Xi = (Xi1 , Xi2 , . . . , XiN ) count the occurrence of each
corresponding word in document i

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij
where:
- θ = (θ1 , θ2 , . . . , θN ) are word weights
- Xi = (Xi1 , Xi2 , . . . , XiN ) count the occurrence of each
corresponding word in document i
3) Optimization predetermined word list, no task specific
optimization

Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187


Classification via Dictionary Methods
1) Task
a) Categorize documents into predetermined categories
b) Measure documents association with predetermined categories
2) Objective function:
PN
j=1 θj X ij
f (θ, Xi ) = PN
j=1 X ij
where:
- θ = (θ1 , θ2 , . . . , θN ) are word weights
- Xi = (Xi1 , Xi2 , . . . , XiN ) count the occurrence of each
corresponding word in document i
3) Optimization predetermined word list, no task specific
optimization
4) Validation (Model checking) weight (model) checking,
replication of hand coding, face validity
Stewart (Princeton) Text as Data June 28-29, 2018 123 / 187
Word Weights: Separating Classes
General Classification Goal: Place documents into categories

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)
- Supervised Classification Methods (in a few slides):

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)
- Supervised Classification Methods (in a few slides):
- Rely on statistical models

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)
- Supervised Classification Methods (in a few slides):
- Rely on statistical models
- Given set of coded documents, statistical relationship between
classes/words

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)
- Supervised Classification Methods (in a few slides):
- Rely on statistical models
- Given set of coded documents, statistical relationship between
classes/words
- Statistical measures of separation

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Word Weights: Separating Classes
General Classification Goal: Place documents into categories
How To Do Classification?
- Dictionaries:
- Rely on Humans humans to identify words that associate with
classes
- Measure how well words separate (positive/negative, emotional,
...)
- Supervised Classification Methods (in a few slides):
- Rely on statistical models
- Given set of coded documents, statistical relationship between
classes/words
- Statistical measures of separation

Key point: this is the same task

Stewart (Princeton) Text as Data June 28-29, 2018 124 / 187


Types of Classification Problems
Topic: What is this text about?

Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187


Types of Classification Problems
Topic: What is this text about?
- Policy area of legislation
⇒ {Agriculture, Crime, Environment, ...}
- Campaign agendas
⇒ {Abortion, Campaign, Finance, Taxing, ... }

Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187


Types of Classification Problems
Topic: What is this text about?
- Policy area of legislation
⇒ {Agriculture, Crime, Environment, ...}
- Campaign agendas
⇒ {Abortion, Campaign, Finance, Taxing, ... }
Sentiment: What is said in this text? [Public Opinion]

Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187


Types of Classification Problems
Topic: What is this text about?
- Policy area of legislation
⇒ {Agriculture, Crime, Environment, ...}
- Campaign agendas
⇒ {Abortion, Campaign, Finance, Taxing, ... }
Sentiment: What is said in this text? [Public Opinion]
- Positions on legislation
⇒ { Support, Ambiguous, Oppose }
- Positions on Court Cases
⇒ { Agree with Court, Disagree with Court }
- Liberal/Conservative Blog Posts
⇒ { Liberal, Middle, Conservative, No Ideology Expressed }

Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187


Types of Classification Problems
Topic: What is this text about?
- Policy area of legislation
⇒ {Agriculture, Crime, Environment, ...}
- Campaign agendas
⇒ {Abortion, Campaign, Finance, Taxing, ... }
Sentiment: What is said in this text? [Public Opinion]
- Positions on legislation
⇒ { Support, Ambiguous, Oppose }
- Positions on Court Cases
⇒ { Agree with Court, Disagree with Court }
- Liberal/Conservative Blog Posts
⇒ { Liberal, Middle, Conservative, No Ideology Expressed }
Style/Tone: How is it said?

Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187


Types of Classification Problems
Topic: What is this text about?
- Policy area of legislation
⇒ {Agriculture, Crime, Environment, ...}
- Campaign agendas
⇒ {Abortion, Campaign, Finance, Taxing, ... }
Sentiment: What is said in this text? [Public Opinion]
- Positions on legislation
⇒ { Support, Ambiguous, Oppose }
- Positions on Court Cases
⇒ { Agree with Court, Disagree with Court }
- Liberal/Conservative Blog Posts
⇒ { Liberal, Middle, Conservative, No Ideology Expressed }
Style/Tone: How is it said?
- Taunting in floor statements
⇒ { Partisan Taunt, Intra party taunt, Agency taunt, ... }
- Negative campaigning
Stewart (Princeton) Text as Data June 28-29, 2018 125 / 187
Applying Methods to Documents
Applying the model:

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document
PK
k=1 θk Xik
Yi = P K
k=1 Xk

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document
PK
k=1 θk Xik
Yi = P K
k=1 Xk
0
θ Xi
Yi = 0
Xi 1

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document
PK
k=1 θk Xik
Yi = P K
k=1 Xk
0
θ Xi
Yi = 0
Xi 1
Yi ≈ continuous Classification

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document
PK
k=1 θk Xik
Yi = P K
k=1 Xk
0
θ Xi
Yi = 0
Xi 1
Yi ≈ continuous Classification
Yi > 0 ⇒ Positive Category

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Applying Methods to Documents
Applying the model:
- Vector of word counts: Xi = (Xi1 , Xi2 , . . . , XiK , (i = 1, . . . , N)
- Weights attached to words θ = (θ1 , θ2 , . . . , θK )
- θk ∈ {0, 1}
- θk ∈ {−1, 0, 1}
- θk ∈ {−2, −1, 0, 1, 2}
For each document i calculate score for document
PK
k=1 θk Xik
Yi = P K
k=1 Xk
0
θ Xi
Yi = 0
Xi 1
Yi ≈ continuous Classification
Yi > 0 ⇒ Positive Category
Yi < 0 ⇒ Negative Category

Stewart (Princeton) Text as Data June 28-29, 2018 126 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant

Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant


- No optimization step same word weights regardless of texts

Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant


- No optimization step same word weights regardless of texts
- Optimization incorporate information specific to context

Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant


- No optimization step same word weights regardless of texts
- Optimization incorporate information specific to context
- Without optimization unclear about dictionaries performance

Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant


- No optimization step same word weights regardless of texts
- Optimization incorporate information specific to context
- Without optimization unclear about dictionaries performance
Just because dictionaries provide measures labeled “positive” or
“negative” it doesn’t mean they are accurate measures in your text
(!!!!)

Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187


Methodological Issues/Problems with Dictionaries

Dictionary methods are context invariant


- No optimization step same word weights regardless of texts
- Optimization incorporate information specific to context
- Without optimization unclear about dictionaries performance
Just because dictionaries provide measures labeled “positive” or
“negative” it doesn’t mean they are accurate measures in your text
(!!!!)

Validation
Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187
Validation, Dictionaries from other Fields

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes
- Negative words in Harvard, Not Negative in Accounting:

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes
- Negative words in Harvard, Not Negative in Accounting:
tax,cost,capital,board,liability,foreign, cancer,
crude(oil),tire

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes
- Negative words in Harvard, Not Negative in Accounting:
tax,cost,capital,board,liability,foreign, cancer,
crude(oil),tire
- 73% of Harvard negative words in this set(!!!!!)

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes
- Negative words in Harvard, Not Negative in Accounting:
tax,cost,capital,board,liability,foreign, cancer,
crude(oil),tire
- 73% of Harvard negative words in this set(!!!!!)
- Not Negative Harvard, Negative in Accounting:

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Validation, Dictionaries from other Fields
Accounting Research: measure tone of 10-K reports
- tone matters ($)
Previous state of art: Harvard-IV-4 Dictionary applied to texts
Loughran and McDonald (2011): Financial Documents are Different,
polysemes
- Negative words in Harvard, Not Negative in Accounting:
tax,cost,capital,board,liability,foreign, cancer,
crude(oil),tire
- 73% of Harvard negative words in this set(!!!!!)
- Not Negative Harvard, Negative in Accounting:
felony,litigation,restated,misstatement,
andunanticipated

Stewart (Princeton) Text as Data June 28-29, 2018 128 / 187


Supervised Learning
1) Task

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization
- Method specific: MLE, Bayesian, EM, ...

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation
- Obtain predicted fit for new data f (Xi , θ)
b

Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187


Supervised Learning
1) Task
- Classify documents to pre existing categories
- Measure the proportion of documents in each category
2) Objective function
- Suppose we have K categories.
- Select Ntrain document to hand-label, Yi = k,
Y = (Y1 , Y2 , . . . , YNtrain )
Y = f (X , θ)

3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation
- Obtain predicted fit for new data f (Xi , θ)
b
- Examine prediction performance compare classification to
gold standard
Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187
Components to Supervised Learning Method

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code
3) Set of unlabeled documents

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


Components to Supervised Learning Method

1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code
3) Set of unlabeled documents
4) Method to extrapolate from hand coding to unlabeled
documents

Stewart (Princeton) Text as Data June 28-29, 2018 130 / 187


How Do We Generate Coding Rules and
Categories?

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:
- Fundamental ambiguity in language [careful analysis of texts]
- Contextual nature of language

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:
- Fundamental ambiguity in language [careful analysis of texts]
- Contextual nature of language
For supervised methods to work: maximize coder agreement

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:
- Fundamental ambiguity in language [careful analysis of texts]
- Contextual nature of language
For supervised methods to work: maximize coder agreement
1) Write careful (and brief) coding rules

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:
- Fundamental ambiguity in language [careful analysis of texts]
- Contextual nature of language
For supervised methods to work: maximize coder agreement
1) Write careful (and brief) coding rules
- Flow charts help simplify problems

Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187


How Do We Generate Coding Rules and
Categories?
Challenge: coding rules/training coders to maximize coder
performance
Challenge: developing a clear set of categories
1) Limits of Humans:
- Small working memories
- Easily distracted
- Insufficient motivation
2) Limits of Language:
- Fundamental ambiguity in language [careful analysis of texts]
- Contextual nature of language
For supervised methods to work: maximize coder agreement
1) Write careful (and brief) coding rules
- Flow charts help simplify problems
2) Train coders to remove ambiguity, misinterpretation
Stewart (Princeton) Text as Data June 28-29, 2018 131 / 187
How Do We Generate Coding Rules?

Iterative process for generating coding rules:

Stewart (Princeton) Text as Data June 28-29, 2018 132 / 187


How Do We Generate Coding Rules?

Iterative process for generating coding rules:


1) Write a set of coding rules

Stewart (Princeton) Text as Data June 28-29, 2018 132 / 187


How Do We Generate Coding Rules?

Iterative process for generating coding rules:


1) Write a set of coding rules
2) Have coders code documents

Stewart (Princeton) Text as Data June 28-29, 2018 132 / 187


How Do We Generate Coding Rules?

Iterative process for generating coding rules:


1) Write a set of coding rules
2) Have coders code documents
3) Assess coder agreement

Stewart (Princeton) Text as Data June 28-29, 2018 132 / 187


How Do We Generate Coding Rules?

Iterative process for generating coding rules:


1) Write a set of coding rules
2) Have coders code documents
3) Assess coder agreement
4) Identify sources of disagreement, repeat

Stewart (Princeton) Text as Data June 28-29, 2018 132 / 187


How Do We Identify Coding Disagreement?
Many measures of inter-coder agreement

Stewart (Princeton) Text as Data June 28-29, 2018 133 / 187


How Do We Identify Coding Disagreement?
Many measures of inter-coder agreement
Essentially attempt to summarize a confusion matrix

Stewart (Princeton) Text as Data June 28-29, 2018 133 / 187


How Do We Identify Coding Disagreement?
Many measures of inter-coder agreement
Essentially attempt to summarize a confusion matrix
Cat 1 Cat 2 Cat 3 Cat 4 Sum, Coder 1
Cat 1 30 0 1 0 31
Cat 2 1 1 0 0 2
Cat 3 0 0 1 0 1
Cat 4 3 1 0 7 11
Sum, Coder 2 34 2 2 7 Total: 45
- Diagonal: coders agree on document
- Off-diagonal : coders disagree (confused) on document

Stewart (Princeton) Text as Data June 28-29, 2018 133 / 187


How Do We Identify Coding Disagreement?
Many measures of inter-coder agreement
Essentially attempt to summarize a confusion matrix
Cat 1 Cat 2 Cat 3 Cat 4 Sum, Coder 1
Cat 1 30 0 1 0 31
Cat 2 1 1 0 0 2
Cat 3 0 0 1 0 1
Cat 4 3 1 0 7 11
Sum, Coder 2 34 2 2 7 Total: 45
- Diagonal: coders agree on document
- Off-diagonal : coders disagree (confused) on document
Generalize across (k) coders:
- k(k−1)
2
pairwise comparisons
- k comparisons: Coder A against All other coders

Stewart (Princeton) Text as Data June 28-29, 2018 133 / 187


Example Coding Document
8 part coding scheme

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party
- Bipartisan support: praise for the other party

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party
- Bipartisan support: praise for the other party
- Honorary Statements: qualitatively different kind of speech

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party
- Bipartisan support: praise for the other party
- Honorary Statements: qualitatively different kind of speech
- Policy speech: a speech without taunting or credit claiming

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party
- Bipartisan support: praise for the other party
- Honorary Statements: qualitatively different kind of speech
- Policy speech: a speech without taunting or credit claiming
- Procedural

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document
8 part coding scheme
- Across Party Taunting: explicit public and negative attacks on
the other party or its members
- Within Party Taunting: explicit public and negative attacks on
the same party or its members [for 1960’s politics]
- Other taunting: explicit public and negative attacks not directed
at a party
- Bipartisan support: praise for the other party
- Honorary Statements: qualitatively different kind of speech
- Policy speech: a speech without taunting or credit claiming
- Procedural
- No Content: (occasionally occurs in CR)

Stewart (Princeton) Text as Data June 28-29, 2018 134 / 187


Example Coding Document

Stewart (Princeton) Text as Data June 28-29, 2018 135 / 187


How Do We Summarize Confusion Matrix?

Lots of statistics to summarize confusion matrix:

Stewart (Princeton) Text as Data June 28-29, 2018 136 / 187


How Do We Summarize Confusion Matrix?

Lots of statistics to summarize confusion matrix:


- Most common: intercoder agreement

Stewart (Princeton) Text as Data June 28-29, 2018 136 / 187


How Do We Summarize Confusion Matrix?

Lots of statistics to summarize confusion matrix:


- Most common: intercoder agreement

No. (Coder A & Coder B agree)


Inter Coder(A, B) =
No. Documents

Stewart (Princeton) Text as Data June 28-29, 2018 136 / 187


Liberal measure of agreement:

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?
Suggestion: Subtract off amount expected by chance:

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?
Suggestion: Subtract off amount expected by chance:
Inter Coder(A, B)norm =
No. (Coder A & Coder B agree)−No. Expected by Chance
No. Documents

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?
Suggestion: Subtract off amount expected by chance:
Inter Coder(A, B)norm =
No. (Coder A & Coder B agree)−No. Expected by Chance
No. Documents
Question: what is amount expected by chance?

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?
Suggestion: Subtract off amount expected by chance:
Inter Coder(A, B)norm =
No. (Coder A & Coder B agree)−No. Expected by Chance
No. Documents
Question: what is amount expected by chance?
1
- #Categories ?
- Avg Proportion in categories across coders? (Krippendorf’s
Alpha)

Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187


Liberal measure of agreement:
- Some agreement by chance
- Consider coding scheme with two categories
{ Class 1, Class 2}.
- Coder A and Coder B flip a (biased coin).
( Pr(Class 1) = 0.75, Pr(Class 2) = 0.25 )
- Inter Coder reliability: 0.625
What to do?
Suggestion: Subtract off amount expected by chance:
Inter Coder(A, B)norm =
No. (Coder A & Coder B agree)−No. Expected by Chance
No. Documents
Question: what is amount expected by chance?
1
- #Categories ?
- Avg Proportion in categories across coders? (Krippendorf’s
Alpha)
Best Practice: present confusion matrices.
Stewart (Princeton) Text as Data June 28-29, 2018 137 / 187
Krippendorf’s Alpha
Define coder reliability as:

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

No. Pairwise Disagreements Observed = observe from data

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

No. Pairwise Disagreements Observed = observe from data


No Expected pairwise disagreements: coding by chance, with
rate labels used available from data

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

No. Pairwise Disagreements Observed = observe from data


No Expected pairwise disagreements: coding by chance, with
rate labels used available from data
Thinking through expected differences:

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

No. Pairwise Disagreements Observed = observe from data


No Expected pairwise disagreements: coding by chance, with
rate labels used available from data
Thinking through expected differences:
- Pretend I know something I’m trying to estimate
- How is that we know coders estimate levels well?
- Have to present correlation statistic: vary assumptions about
“expectations” (from uniform, to data driven)

Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187


Krippendorf’s Alpha
Define coder reliability as:
No. Pairwise Disagreements Observed
α = 1−
No Pairwise Disagreements Expected By Chance

No. Pairwise Disagreements Observed = observe from data


No Expected pairwise disagreements: coding by chance, with
rate labels used available from data
Thinking through expected differences:
- Pretend I know something I’m trying to estimate
- How is that we know coders estimate levels well?
- Have to present correlation statistic: vary assumptions about
“expectations” (from uniform, to data driven)
Calculate in R with concord package and function kripp.alpha
Stewart (Princeton) Text as Data June 28-29, 2018 138 / 187
Three categories of documents

Hand labeled

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Three categories of documents

Hand labeled
- Training set (what we’ll use to estimate model)

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Three categories of documents

Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Three categories of documents

Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Three categories of documents

Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled
- Test set (what we’ll use the model to categorize)

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Three categories of documents

Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled
- Test set (what we’ll use the model to categorize)
Label more documents than necessary to train model

Stewart (Princeton) Text as Data June 28-29, 2018 139 / 187


Methods to Perform Supervised Classification

- Use the hand labels to train a statistical model.

Stewart (Princeton) Text as Data June 28-29, 2018 140 / 187


Methods to Perform Supervised Classification

- Use the hand labels to train a statistical model.


- Naive Bayes

Stewart (Princeton) Text as Data June 28-29, 2018 140 / 187


Methods to Perform Supervised Classification

- Use the hand labels to train a statistical model.


- Naive Bayes
- Shockingly simple application of Bayes’ rule

Stewart (Princeton) Text as Data June 28-29, 2018 140 / 187


Methods to Perform Supervised Classification

- Use the hand labels to train a statistical model.


- Naive Bayes
- Shockingly simple application of Bayes’ rule
- Use of a terribly implausible independence assumption

Stewart (Princeton) Text as Data June 28-29, 2018 140 / 187


Methods to Perform Supervised Classification

- Use the hand labels to train a statistical model.


- Naive Bayes
- Shockingly simple application of Bayes’ rule
- Use of a terribly implausible independence assumption
- Shockingly useful often default classifier

Stewart (Princeton) Text as Data June 28-29, 2018 140 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }
Subset of labeled documents Y = (Y1 , Y2 , . . . , YNtrain ) where
Yi ∈ {C1 , C2 , . . . , CK }.

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }
Subset of labeled documents Y = (Y1 , Y2 , . . . , YNtrain ) where
Yi ∈ {C1 , C2 , . . . , CK }.
Goal: classify every document into one category.

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }
Subset of labeled documents Y = (Y1 , Y2 , . . . , YNtrain ) where
Yi ∈ {C1 , C2 , . . . , CK }.
Goal: classify every document into one category.
Learn a function that maps from space of (possible) documents to
categories

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }
Subset of labeled documents Y = (Y1 , Y2 , . . . , YNtrain ) where
Yi ∈ {C1 , C2 , . . . , CK }.
Goal: classify every document into one category.
Learn a function that maps from space of (possible) documents to
categories
To do this: use hand coded observations to estimate (train)
regression model

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup

Suppose we have document i, (i = 1, . . . , N) with J features


xi = (x1i , x2i , . . . , xJi )
Set of K categories. Category k (k = 1, . . . , K )
{C1 , C2 , . . . , CK }
Subset of labeled documents Y = (Y1 , Y2 , . . . , YNtrain ) where
Yi ∈ {C1 , C2 , . . . , CK }.
Goal: classify every document into one category.
Learn a function that maps from space of (possible) documents to
categories
To do this: use hand coded observations to estimate (train)
regression model
Apply model to test data, classify those observations

Stewart (Princeton) Text as Data June 28-29, 2018 141 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

CMax = arg maxk p(Ck |xi )

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

CMax = arg maxk p(Ck |xi )

We’re going to use Bayes’ rule to estimate p(Ck |xi ).

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

CMax = arg maxk p(Ck |xi )

We’re going to use Bayes’ rule to estimate p(Ck |xi ).

p(Ck , xi )
p(Ck |xi ) =
p(xi )

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

CMax = arg maxk p(Ck |xi )

We’re going to use Bayes’ rule to estimate p(Ck |xi ).

p(Ck , xi )
p(Ck |xi ) =
p(xi )
p(Ck )p(xi |Ck )
=
p(xi )

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Goal: For each document xi , we want to infer most likely category

CMax = arg maxk p(Ck |xi )

We’re going to use Bayes’ rule to estimate p(Ck |xi ).

p(Ck , xi )
p(Ck |xi ) =
p(xi )
Proportion in Ck
z }| {
p(Ck ) p(xi |Ck )
| {z }
Language model
=
p(xi )

Stewart (Princeton) Text as Data June 28-29, 2018 142 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:
p(Ck ) = No.No.Documents
Documents
in k
(training set)

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:
p(Ck ) = No.No.Documents
Documents
in k
(training set)
p(xi |Ck ) complicated without assumptions

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:
p(Ck ) = No.No.Documents
Documents
in k
(training set)
p(xi |Ck ) complicated without assumptions
- Imagine each xij just binary indicator. Then 2J possible xi
documents

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:
p(Ck ) = No.No.Documents
Documents
in k
(training set)
p(xi |Ck ) complicated without assumptions
- Imagine each xij just binary indicator. Then 2J possible xi
documents
- Simplify: assume each feature is independent

Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

CMax = arg maxk p(Ck |xi )


p(Ck )p(xi |Ck )
CMax = arg maxk
p(xi )
CMax = arg maxk p(Ck )p(xi |Ck )
Two probabilities to estimate:
p(Ck ) = No.No.Documents
Documents
in k
(training set)
p(xi |Ck ) complicated without assumptions
- Imagine each xij just binary indicator. Then 2J possible xi
documents
- Simplify: assume each feature is independent
J
Y
p(xi |Ck ) = p(xij |Ck )
j=1
Stewart (Princeton) Text as Data June 28-29, 2018 143 / 187
Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Two components to estimation:


- p(Ck ) = No.No.Documents
Documents
in k
(training set)
QJ
- p(xi |Ck ) = j=1 p(xij |Ck )

Stewart (Princeton) Text as Data June 28-29, 2018 144 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Two components to estimation:


- p(Ck ) = No.No.Documents
Documents
in k
(training set)
QJ
- p(xi |Ck ) = j=1 p(xij |Ck )
Maximum likelihood estimation (training set):

Stewart (Princeton) Text as Data June 28-29, 2018 144 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Two components to estimation:


- p(Ck ) = No.No.Documents
Documents
in k
(training set)
QJ
- p(xi |Ck ) = j=1 p(xij |Ck )
Maximum likelihood estimation (training set):

No( Docsij = z and C = Ck )


p(xim = z|Ck ) =
No(C= Ck )

Stewart (Princeton) Text as Data June 28-29, 2018 144 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Two components to estimation:


- p(Ck ) = No.No.Documents
Documents
in k
(training set)
QJ
- p(xi |Ck ) = j=1 p(xij |Ck )
Maximum likelihood estimation (training set):

No( Docsij = z and C = Ck )


p(xim = z|Ck ) =
No(C= Ck )

Problem: What if No( Docsij = z and C = Ck ) = 0 ?

Stewart (Princeton) Text as Data June 28-29, 2018 144 / 187


Naive Bayes and Optimization (Jurafsky Inspired
Slide)

Two components to estimation:


- p(Ck ) = No.No.Documents
Documents
in k
(training set)
QJ
- p(xi |Ck ) = j=1 p(xij |Ck )
Maximum likelihood estimation (training set):

No( Docsij = z and C = Ck )


p(xim = z|Ck ) =
No(C= Ck )

Problem: What if No( Docsij = z and C = Ck ) = 0 ?


QJ
j=1 p(xij |Ck ) = 0

Stewart (Princeton) Text as Data June 28-29, 2018 144 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data
2) Use this to identify most likely Ck for each document i in test
set

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data
2) Use this to identify most likely Ck for each document i in test
set
Ci = arg max k p̂(Ck )p̂(xi |Ck )

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data
2) Use this to identify most likely Ck for each document i in test
set
Ci = arg max k p̂(Ck )p̂(xi |Ck )
Simple intuition about Naive Bayes:

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data
2) Use this to identify most likely Ck for each document i in test
set
Ci = arg max k p̂(Ck )p̂(xi |Ck )
Simple intuition about Naive Bayes:
- Learn what documents in class j look like

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Naive Bayes and General Problem Setup (Jurafsky
Inspired Slide)
Solution: smoothing (Bayesian estimation)
No( Docsij = z and C = Ck ) + 1
p(xij = z|Ck ) =
No(C= Ck ) + k
Algorithm steps:
1) Learn p̂(C ) and p̂(xi |Ck ) on training data
2) Use this to identify most likely Ck for each document i in test
set
Ci = arg max k p̂(Ck )p̂(xi |Ck )
Simple intuition about Naive Bayes:
- Learn what documents in class j look like
- Find class k that document i is most similar to

Stewart (Princeton) Text as Data June 28-29, 2018 145 / 187


Comparing Training and Validation Set

Text classification and model assessment

Stewart (Princeton) Text as Data June 28-29, 2018 146 / 187


Comparing Training and Validation Set

Text classification and model assessment


- Replicate classification exercise with validation set

Stewart (Princeton) Text as Data June 28-29, 2018 146 / 187


Comparing Training and Validation Set

Text classification and model assessment


- Replicate classification exercise with validation set
- General principle of classification/prediction

Stewart (Princeton) Text as Data June 28-29, 2018 146 / 187


Comparing Training and Validation Set

Text classification and model assessment


- Replicate classification exercise with validation set
- General principle of classification/prediction
- Compare supervised learning labels to hand labels

Stewart (Princeton) Text as Data June 28-29, 2018 146 / 187


Classification

Stewart (Princeton) Text as Data June 28-29, 2018 147 / 187


Classification

Many, Many Algorithms


Gradient boosted machines are a good default
Deep learning approaches work well if you have enough data
Lots of Training Parameters

Stewart (Princeton) Text as Data June 28-29, 2018 147 / 187


Classification

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Classification

Key Supervised Learning Assumptions

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Classification

Key Supervised Learning Assumptions


Random Sample

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Classification

Key Supervised Learning Assumptions


Random Sample
Joint Distribution assumed to be same in training and unlabeled
set

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Classification

Key Supervised Learning Assumptions


Random Sample
Joint Distribution assumed to be same in training and unlabeled
set
Features are enough to explain the outcome

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Classification

Key Supervised Learning Assumptions


Random Sample
Joint Distribution assumed to be same in training and unlabeled
set
Features are enough to explain the outcome
Still concerns about overfitting

Stewart (Princeton) Text as Data June 28-29, 2018 148 / 187


Hand and Illusion of Advances

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


Hand and Illusion of Advances

Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


Hand and Illusion of Advances

Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


Hand and Illusion of Advances

Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


Hand and Illusion of Advances

Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal
3 Major gains only come after the first step.

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


Hand and Illusion of Advances

Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal
3 Major gains only come after the first step.
In general: better features beat better models every time.

Stewart (Princeton) Text as Data June 28-29, 2018 149 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewer
assumptions (do not need completely random sample of documents) .

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewer
assumptions (do not need completely random sample of documents) .
- King and Lu (2008): derive method for characterizing causes of
deaths for verbal autopsies

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewer
assumptions (do not need completely random sample of documents) .
- King and Lu (2008): derive method for characterizing causes of
deaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewer
assumptions (do not need completely random sample of documents) .
- King and Lu (2008): derive method for characterizing causes of
deaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents


Basic intuition:

Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187


ReadMe: Optimization for Quantification
(Hopkins and King 2010)
Most classifiers focused on individual document classification.
But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewer
assumptions (do not need completely random sample of documents) .
- King and Lu (2008): derive method for characterizing causes of
deaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents


Basic intuition:
- Examine joint distribution of characteristics (without making Naive
Bayes like assumption)

- Focus on distributions (only) makes this analysis possible


Stewart (Princeton) Text as Data June 28-29, 2018 150 / 187
ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

P(x) = probability of observing x

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

P(x) = probability of observing x


P(x|Cj ) = Probability of observing x conditional on category Cj

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

P(x) = probability of observing x


P(x|Cj ) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

P(x) = probability of observing x


P(x|Cj ) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1 , C2 , . . . , CK ) target quantity of interest

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)
Measure only presence/absence of each term [(Jx1) vector ]

xi = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of xi ?


- 2J possible vectors
Define:

P(x) = probability of observing x


P(x|Cj ) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1 , C2 , . . . , CK ) target quantity of interest

Stewart (Princeton) Text as Data June 28-29, 2018 151 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents
Kernel Smoothing Methods (without a formal model)

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents
Kernel Smoothing Methods (without a formal model)
- P(x) = estimate directly from test set

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents
Kernel Smoothing Methods (without a formal model)
- P(x) = estimate directly from test set
- P(x|C ) = estimate from training set

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents
Kernel Smoothing Methods (without a formal model)
- P(x) = estimate directly from test set
- P(x|C ) = estimate from training set
- Key assumption: P(x|C ) in training set is equivalent to P(x|C )
in test set

Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187


ReadMe: Optimization for a Different Goal
(Hopkins and King 2010)

P(x) = P(x|C ) P(C )


| {z } | {z } | {z }
2J x1 2J xK Kx1

Matrix algebra problem to solve, for P(C )


Like Naive Bayes, requires two pieces to estimate
Complication 2J >> no. documents
Kernel Smoothing Methods (without a formal model)
- P(x) = estimate directly from test set
- P(x|C ) = estimate from training set
- Key assumption: P(x|C ) in training set is equivalent to P(x|C )
in test set
- If true, can perform biased sampling of documents, worry less
about drift...
Stewart (Princeton) Text as Data June 28-29, 2018 152 / 187
Algorithm Summarized

- Estimate p̂(x) from test set


- Estimate p̂(x|C ) from training set
- Use p̂(x) and p̂(x|C ) to solve for p(C )

Stewart (Princeton) Text as Data June 28-29, 2018 153 / 187


Assessing Model Performance

Stewart (Princeton) Text as Data June 28-29, 2018 154 / 187


Assessing Model Performance

Not classifying individual documents → different standards

Stewart (Princeton) Text as Data June 28-29, 2018 154 / 187


Assessing Model Performance

Not classifying individual documents → different standards


Mean Square Error :

E[(θ̂ − θ)2 ] = var(θ̂) + Bias(θ̂, θ)2

Stewart (Princeton) Text as Data June 28-29, 2018 154 / 187


Assessing Model Performance

Not classifying individual documents → different standards


Mean Square Error :

E[(θ̂ − θ)2 ] = var(θ̂) + Bias(θ̂, θ)2

Suppose we have true proportions P(C )true . Then, we’ll estimate


Root Mean Square Error
s
PJ true − P(C ))
j=1 (P(Cj ) j
RMSE =
J

Stewart (Princeton) Text as Data June 28-29, 2018 154 / 187


Assessing Model Performance

Not classifying individual documents → different standards


Mean Square Error :

E[(θ̂ − θ)2 ] = var(θ̂) + Bias(θ̂, θ)2

Suppose we have true proportions P(C )true . Then, we’ll estimate


Root Mean Square Error
s
PJ true − P(C ))
j=1 (P(Cj ) j
RMSE =
J
Visualize: plot true and estimated proportions

Stewart (Princeton) Text as Data June 28-29, 2018 154 / 187


Scaling

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)
2) supervision of convenience (e.g. wordscores)

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)
2) supervision of convenience (e.g. wordscores)
3) fictious prediction problem based on binary categories

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)
2) supervision of convenience (e.g. wordscores)
3) fictious prediction problem based on binary categories
4) prediction problem with human provided scores (e.g. regression)

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)
2) supervision of convenience (e.g. wordscores)
3) fictious prediction problem based on binary categories
4) prediction problem with human provided scores (e.g. regression)
5) unsupervised scaling (e.g. wordfish, IRT, PCA)

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling

Scaling involves a (usually) one-dimensional continuous latent


representation.
Five important strategies
1) scoring based on human set weights (e.g. dictionary methods)
2) supervision of convenience (e.g. wordscores)
3) fictious prediction problem based on binary categories
4) prediction problem with human provided scores (e.g. regression)
5) unsupervised scaling (e.g. wordfish, IRT, PCA)
Scaling is quite similar to the other approaches we have covered
just with fewer dimensions.

Stewart (Princeton) Text as Data June 28-29, 2018 155 / 187


Scaling via Naive Bayes(Beauchamp)
We can also scale via an unnormalized Naive Bayes classifier

Stewart (Princeton) Text as Data June 28-29, 2018 156 / 187


Scaling via Naive Bayes(Beauchamp)
We can also scale via an unnormalized Naive Bayes classifier
We are interested in estimating the probability that a document
belongs to a class (R) given that we are presented with a document
of an unknown class (S). From Bayes Rule we know that:

Stewart (Princeton) Text as Data June 28-29, 2018 156 / 187


Scaling via Naive Bayes(Beauchamp)
We can also scale via an unnormalized Naive Bayes classifier
We are interested in estimating the probability that a document
belongs to a class (R) given that we are presented with a document
of an unknown class (S). From Bayes Rule we know that:
P(S|R)P(R)
P(R|S) =
P(S)

Stewart (Princeton) Text as Data June 28-29, 2018 156 / 187


Scaling via Naive Bayes(Beauchamp)
We can also scale via an unnormalized Naive Bayes classifier
We are interested in estimating the probability that a document
belongs to a class (R) given that we are presented with a document
of an unknown class (S). From Bayes Rule we know that:
P(S|R)P(R)
P(R|S) =
P(S)
The probability of the words given the class are denoted P(wi |R).
Thus we take P(S|R) to be the independent product over all words
in the document.

Stewart (Princeton) Text as Data June 28-29, 2018 156 / 187


Scaling via Naive Bayes(Beauchamp)
We can also scale via an unnormalized Naive Bayes classifier
We are interested in estimating the probability that a document
belongs to a class (R) given that we are presented with a document
of an unknown class (S). From Bayes Rule we know that:
P(S|R)P(R)
P(R|S) =
P(S)
The probability of the words given the class are denoted P(wi |R).
Thus we take P(S|R) to be the independent product over all words
in the document.
Y
p(S|R) = P(wi |R)
i
P(R) Y
P(R|S) = P(wi |R)
P(S) i

Stewart (Princeton) Text as Data June 28-29, 2018 156 / 187


Scaling
We can generate a symmetrical equation for not-R which can be used
to produce a likelihood ratio:

Stewart (Princeton) Text as Data June 28-29, 2018 157 / 187


Scaling
We can generate a symmetrical equation for not-R which can be used
to produce a likelihood ratio:
0 P(R 0 ) Y
P(R |S) = P(wi |R 0 )
P(S) i
P(R|S) P(R) Y p(wi |R)
=
P(R 0 |S) P(R 0 ) i p(wi |D)

Stewart (Princeton) Text as Data June 28-29, 2018 157 / 187


Scaling
We can generate a symmetrical equation for not-R which can be used
to produce a likelihood ratio:
0 P(R 0 ) Y
P(R |S) = P(wi |R 0 )
P(S) i
P(R|S) P(R) Y p(wi |R)
=
P(R 0 |S) P(R 0 ) i p(wi |D)
Because in practice the product is quite small we work with the log
ratio.

Stewart (Princeton) Text as Data June 28-29, 2018 157 / 187


Scaling
We can generate a symmetrical equation for not-R which can be used
to produce a likelihood ratio:
0 P(R 0 ) Y
P(R |S) = P(wi |R 0 )
P(S) i
P(R|S) P(R) Y p(wi |R)
=
P(R 0 |S) P(R 0 ) i p(wi |D)
Because in practice the product is quite small we work with the log
ratio.
P(R|S) P(R) X p(wi |R)
log = log + log
P(R 0 |S) P(R 0 ) i
p(wi |R 0 )
. X p(wi |R)
= log
i
p(wi |R 0 )

Stewart (Princeton) Text as Data June 28-29, 2018 157 / 187


Scaling
We can generate a symmetrical equation for not-R which can be used
to produce a likelihood ratio:
0 P(R 0 ) Y
P(R |S) = P(wi |R 0 )
P(S) i
P(R|S) P(R) Y p(wi |R)
=
P(R 0 |S) P(R 0 ) i p(wi |D)
Because in practice the product is quite small we work with the log
ratio.
P(R|S) P(R) X p(wi |R)
log = log + log
P(R 0 |S) P(R 0 ) i
p(wi |R 0 )
. X p(wi |R)
= log
i
p(wi |R 0 )
We estimate p(wi |R) as the percentage of word wi in document R.
and Stewart
report(Princeton)
it. Text as Data June 28-29, 2018 157 / 187
Islamic Clerics and Jihad (Nielsen)
Why do some Islamic Clerics support militant Jihad?

Stewart (Princeton) Text as Data June 28-29, 2018 158 / 187


Islamic Clerics and Jihad (Nielsen)
Why do some Islamic Clerics support militant Jihad?

●●● ●
● ● ● ● ●

● ●●
●● ●
● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●
● ●● ●
●● ● ● ●
● ●● ●
●● ●
● ● ● ● ●
●● ● ● ●●

● ●
● ● ●
● ●●
● ● ●● ●● ● ● ●●
● ●● ●
● ● ● ● ●




● ●
● ●● ● ● ●
● ●

● ●● ●
● ●● ●



● ● ● ● ● ●

●● ● ●● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
●●
● ● ● ● ● ●

● ● ● ●●
● ● ●

● ●● ● ●● ●

●● ● ●



● ●● ● ● ●

● ●●●
● ● ● ● ● ● ●
● ●●
● ● ● ●
● ● ● ●
● ●● ● ●
● ● ● ●

● ●● ●● ●
● ●

● ● ●
● ● ● ●
● ●

● ● ● ● ●● ● ●
● ●
●● ● ●
● ●
●● ● ● ● ●
● ● ●
● ●
● ● ●● ●
● ● ● ●
●● ● ● ● ● ●
● ● ● ●
● ●
●● ●
● ● ● ● ●● ● ●



● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●


● ● ● ●● ●

● ●● ●
● ●● ● ● ●




● ● ●● ● ● ●
● ●●● ●
● ● ●●
● ● ● ● ●
●●● ● ● ●● ● ●
● ● ● ● ● ● ●
● ●●
● ● ● ● ● ●● ●
●● ● ●● ● ●
● ● ●

● ● ●

● ●
● ●

● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ●


● ● ● ● ●
● ●
● ● ●
● ●
● ●

● ● ● ● ●

Stewart (Princeton) Text as Data June 28-29, 2018 158 / 187


Islamic Clerics and Jihad (Nielsen)
Why do some Islamic Clerics support militant Jihad?
Ruling on Fighting Now in Palestine
There is a fundamental fact about the
If a person arrives while the Imam and Afghanistan. The foregoing has
nature of this religion and the way
is preaching at Friday prayers, he clarified that if an inch of Muslim
it works in people's lives. A
should pray two brief prostrations lands are attacked, then Jihad is
fundamental, simple fact, but although
and sit without greeting anyone as obligatory for the people of that area,
it is simple, it is often forgotten or
greeting people in this circumstance and those near by. If they do not
not realized at all. Forgetting this
is forbidden because the Prophet, succeed or are incapable or lazy, the
fact, or failing to recognize it arises
peace be upon him, says, "If your individual obligation widens to those
from a serious omission from views of
friend speaks to you during the behind them and then gradually the
this religion: its truthfulness and
Friday prayers, silence him while individual obligation expands until
historical, present, and future reality.
the Imam preaches because it is it is general for the whole land,
(Sayyid Qutb)
idle talk." from East to West.
(Ibn Uthaymeen) (Abdallah Azzam)

Histogram of cleric Jihad scores

−0.15 Jihad Score −0.1 −0.05 0 0.05

Ibn Baz Abdallah Azzam


Ibn Uthaymeen
Sayyid Qutb

Usama bin Laden

Stewart (Princeton) Text as Data June 28-29, 2018 158 / 187


Islamic Clerics and Jihad (Nielsen)
Why do some Islamic Clerics support militant Jihad?
The The apparent
provisions
Imams Timeof
Apostatize Side Ahmed
West Almighty
Evidence
Criminals Systems
Attempt
Legislation
Penalty
Quraish
All
How
Says
TheSeen
Wide

Permanent

Solomon
Out
Education
Supports the
And because
Nasser
Up

NotRelationship
List True
Allah will
Murderer
Ourselves
Walking
And Either Must
possibly
Shed
Place
Possible I know
Mosques
Ramadan

I ToTongue
I bear witness
Left

Sword Increase
Between them Like

Sex
Versus
Log in
Weak
And
Accept
so on
Estimate

News Right
Government
Particularly

Know
RoyTakeIs called

Face
Sins
Maqdisi
Issues
Visualization
We have
The day following
Something
order
Sea Still One

I
Live Parental

Jews Issue
Their hearts
He Works
wanted
PharaohLikely At
He wants
Water

Age
WhatOn Mr. Issuance ofLaSalle Fasting
of Valid
Five
Members The
In the name above
ofExamples Witness Not Sharif
Nine

Era Several
Humiliation Hamid
His hand

Killing Especially
The large number
Platform Spring Adults
HonorableCities

Apostasy
Transfer
Open
People Known
Huda
And
Sub−
Perhaps the Building
Amr
more Save Corrupt
Summary

Stands
High EBay
The unseen Majah
Detection
The sixthGet
Send Pay
Women
Must
Calling Around him Curse

Assets Lord And they


Sadiq Families
Border Instinct And you Tracking
Of religion
Pregnancy HitIll Roy
DeathMatt
ProvidesWas

Must
Del Full
Certificate Zaid

Him
Like Side
CheckDeny
Study
Perfection
Requires
Tender
Out
In Her husband
By saying
Force Eye
Prove
Updated
Remains See
Fancies
Impeller
And Money
Mark followers
Things Check And Close
messengers
And how much I
Issued

Supporters Page
WeSheikhs
findWrites Provided that

Obedience
Reasons
Movement The
Hurry
More Lack
Good Half

Idolaters
Alice

Political
Reason Has
To
Discoursebe
Virtue Defense For some
Latecomers
TheRay
You
Charity

fourth
Jazz
Intention

The world
Questions
There Provideare

Statement The mind


Uh WeHearts Good
Sham Money Place
More
Companions
Appeal
Need Pray Adultery
Men
FirstCollection SpoilerToxic Accept

Word
Must Sale His brothers
According Minds
Center
to
Enough
Back
Bear Uncle
Trial
Symptoms
Becomes Links
Seer

Ignorance
Neighborhoods
By
Issuesvirtue
Prophet of
Hypocrites For
Narrated by
Other
Is Speech

The answer
Appropriate
The handsA Need
Protection
Detail
Condition Man Jurisprudential
PainVery
ReportArrived When Directed by

Faith
Far To imitate
When Month Forbidden
Application
Determination

Human
Method
Law
Not
Corruption
Ezz
Consideration
Crusade
Numbers
Degree
How Deficit
For
I gotMan
non−
A means

Sharia
Many
For these Salah Guidance

Scientists
Easy

And I
Ashraf
MartyrsAs well The as sky

Wants
Taymiyah
Some of Whether
them
Character to Worship
Court
Excuse
Exposure Fired
Speak
Raina
One
Literature Zakat
Evidence
Shows Lead And then
Doctrine
Injustice
Prison
Shadow
One
Days Inn
Conflict
Does
Down
ReformShow

Survival it
Completely
Often
His life
Existing
Shown Likely
Neighborhood
End Be Necessary Part

Idolatry
Young people
The
Aba
most
Other
The
Means
Language
Duty
Different
Scholars
than
Directed
Claim
intended
The
They
by
Number
facts
David
said
Conditions
Blocker
Understand Skies
Religion
The
Hanbal
Account public
About
Second Owner Muhammad
Education Raise
Party
The greatest Immigration
Dispute
Their Zone
Display
Loyalty Judge
religion
Satan Contract
Revolution
ExegesisNuclear
Kill Including
GodWas
We
Are
Strange
Lord ask
I want

War
The eyesYourself

Sacred UsForgive
Great

Down Learning
Explicitly
We So
Powers The Desire
oldFor
A large Maintain

Without
The battle
Fire Phrase Bone

Group
Send Together

The Before
belief Shan
The island
Show
Quasi−
Just Found
LeavesHit
Optical Usury
A way
Delivers
Concern

And others Malignant The


AlertMeanings said Palm

Was First

Jihad
Student Secrecy

Fighting Minimum
Were
Palestine HowSeparation
Owner Spirit
And Prevention
After
Al
Do
Leadership
Fighting
Risk
To differentiate
Saying Sir
ISee
said
Means
Has toFixed
Intervention
Corresponding
Wine
Post
Increase
Yam Aisha
become
The body
Control Rejection
Follow
Received
Useful
Are

Year
ThePast
Abbas
Thread Because
Remain Read
Khaled

The nation
Land
You
Senate QadirDies Found
Apostles Number
Image Yes
It Pay Women
And here Return
The pretext of Back

The
A We Names
Position
absolute
Display
Legalization

Themselves All Century


Friend

Be
wantSpoke
Field
The enemies Source
Caution

Milli JosephRecognitionWith Kaaba After them

Worship
Hopes

History
Prophets SupremeStrong
As

Road MayHajj
Arabs Door Rahman
Underneath
Significance
Necessity
To the son of Ras
Claim
Edition
ShouldIn other words
Measurement be
Responsible
Else Link
Full
Total
Means
Recent

Ignorance Prevents
Makes Fourth Fit

WayWords
Under no
Poverty Go
BelieveAge
inHappyChildren We know Morocco

I sayWhatever
So Board
Rights Gold Reading
Infidel Know Penalty During the

Without Came SaidA lot Deprived


Soldiers
Foreign
Is located
But For saying
Their money Hear

Brotherhood Not
Create
Safe Exchange

Sect
Previous Prevent Scourge Asked
FootAccording to

State They said


Partner
Issued

Quran
Doctrines

Believe As
Egypt Chatter
Innovation
Indicates
Kinds
Dear of Blood
Person Article MostBoth Street
Stronger
of them
Is Proven
Wear

Prince Is
Ideas
AndAfraid
Reality even Hidden

Other
Meaning
Important Level Bodyof
Voice

And between
Hostility
According to Have DoBad

Owners
Fact Specific
Evil
Great
Pronunciation
Virgin
Follow Shafei Contemporary
Will
Word
Difference
I say Weakness
Economic
Hypocrisy
HeAre
Secretariat
Love
The
Trees
DeliveranceAbe
not
Mahmoud
Assesses
You
Forms
Enlarge

nearest
HisComprehensive
statement
Fidelity
Desires
The contrary
Sentence
Food

Request Confidence Months

The
Thefaithful
Thirdly
Jurisprudence
Since
Unitarian
Passport
Stage
Insults
Never
Are Be
Rectum

CameStatus
Deal Means
Iraq Stone Back
KharijitesWhenever Keeper
Texts Comes
Hussein
Presence
Social
Called Use

Wife
Task Eg
They wanted
Long as

infidels Will
Progress
Freedom We
Where Justicesay
Like
Very
Hundred Series
His The Acts
Agreement
faith
Perhaps

Their hands Mosque


Doubt
Victory Impose Sometimes
About us Make

Resurrection Love
Has become
Complex Differ
SoldiersAlpha
Find Find
Intent Line
One ofEffort
them Problem
AngelsEnters Elimination Drink

This
Want to

Martyr New See Can


Agreed
Beliefs
Word
Gathering
Did

Mr. Probably Massoud


Concept
Stray
Question
Result
Senior Wage The lowest
Daybreak I hope

Idolators
Wonder
You
Long While
While Remember For example
Place
With
Ordered
Labbe them Delivery
Our brothers
Time
Explain
Improves

ArmsHead

AloneRay Of
Example
Violation
Narrow
Clear
So and so The best
Falsehood We
Regardless
Satisfaction
Muslim
The Prove
Patience
following
Light
Wise
Anger
City
The A time
The day
Period
Note Actually
Be taken
Fear
Episode

Time
Word Frequency
Two
The Party Rest
establishment Graduation
of Qadeer Staff

Against
Oppressors Great
Sultan
Issue
Nature
Harm
Search
Across
The
Years
Solve the
Said Permissible
Once
Hasthe
become
Not
Target
Country
Which

Lying
That Branches If
But House
Wahab Preferred
Article
The Messenger
Sin

Money

aa
Life Name See
Aware of

Infidel
At least
AdamNuhaRepentance
I took
The

Man In
Show Table Project
Some
Laws Should be
Was
Blessed
Aversion
Fancy
At Proof
least
Ask

Doctrine
Guides
Time Disbelief Sons Promise
Appearance BeginningAnd companions
WeRuled
knowthatHair
Rightly
Yard
Blood PeriodMessages

Books
TheCircle
Nearfamous

Call
Feast
Quest
PanSaad Children
YouRoads Comment

Enemy House
Originally
And Enmity DisbeliefObaid
Business
Travel
Defeated
Displays
Gives

Amount
TeamThe seventh
Interests Distinguished
Hearing
Is believed
Minute Advice
Being

Well
Knowing
Punk Hell Paper

Army
Unbelievers
Was Badr
Balance Grandfather Aspects of
So−called
Capital

Four
Wi

Bukhari
Then
They SinHand
Disposal

Paradise
Estimated
Faces

= 1/250 CountryOrder WeText


Governor Invasion
Opinion
Type
Never
Loyalty
Almighty
Do Night
Ethics Serious DecisionThe
To the people
Satisfaction
Never

Afghanistan
Valley
Mansoura
Positions Sin
Order
Denial Seethe
Supreme RoleTell
Approved
Old Eating
Woman
Follow the
Champions
Announcement

The mostIndeed
Option
Contained
important View
The effects of
Denied
Away
MyGet Sins
nation No
Such as
Livelihood

First
Terrorism Show
Places
Bammer
Packages
Aslam Gel
Opposition
Iron Contrary
A little Combines
A way One
to Women
All Delusion Income CattleHatred Back WillCalls for Answer
You Quoting

Heart Do
The doors
Roman Individual Small
Known
Reality
Rules
Community Out Take
A particular Raises Event
Fear
Obstacle
Twice

Different
Can

AlShirk
Behind Necessary
OurWork All Results

Which Consequent
For himself

Look
Critical
OnePresence
thousandth Frank Evil
Demonstration
Acknowledges

Entry Truth
Required
Story Spread
Qaeda Security
Christianity Signed
This Cut
Us
Treatment And blessings

Monotheism Make
Aas

Can
Re−

Difference Creature Directly


Judgements
Puts
Certain
Sayings
A people
Thousands of
For Events You want
Interest

God
Reported

= 1/500 Response Leave Governing Line


And it Wise
StaffIntended
ThoseWright
The original
Good Ten
Sharp True

LegislationRahimSectionWilling Differentiate
Address

Imam
State On Followed
behalf by as
Needs of
Serves
Isaac

Something
Were it not for Prove

Ban
American
Claimed

Explanation
Significant Things
Prayers

One
Take
Intensity
Without Cow
Say
Ibrahim Terms
As Selection
the
Guidance Trade
Will not
Within
And other

PeopleMilitary
Really Is
The basis of
Values Three
Kara

Is Sun Up
America Revelation Yemen
You Few Land
Change
Kitten
I learned

Evil Shows
ProphecyDisease
a = 1/1000
Add
To reconcile
Dhar
Lord Newly

King
Always
Apostates Knowledge Hanifa
Yourselves
Including Comes
Fight AcceptanceThe CreatorFoot

Mujahideen
Ring
Description Extent

Message
People The
As third
Damage Back

Human
More

Thought
Interest V Six
Science
Who
Think We
To be
News Certainty

I see toOut To be
Other
Denial
Advanced
Years
Diligence
Predecessor TheCustomPursuant
soul Bass
A cause
The best
Halt
HimSecondly
Brother Position
Still

End Osman
By himself
Asked
Preachers Put
Some

Them Provided Warning

Consensus Seven
And onlyYoung
Hand
Present
Opportunity

Parties We Author
I And
likepiety
Conditions
Benefit Except
Stand
Nullity

= 1/2000 Companions Call

a WasAnd Religiously
Descent

ExcommunicationInformation
Orientation
Your I heard
Solution
All Suspicion
Non− all Tirmidhi
Mustafa Affect
Take
Born
Torment Worship
Obtained
Re−
Dead What

Whole
All Israel Occurs
United
Parents
Investigation
Disclosure
Seemed Moses Sunan Noble Said the Service
Confrontation
Violates
Before
Is Required
Transferred

<−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− >
Jihadi Not Jihadi

Stewart (Princeton) Text as Data June 28-29, 2018 158 / 187


Q&A and Code

Stewart (Princeton) Text as Data June 28-29, 2018 159 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 160 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 160 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 161 / 187


Measuring Slant

Enormous interest across fields in measuring slant, polarization,


etc.

Stewart (Princeton) Text as Data June 28-29, 2018 162 / 187


Measuring Slant

Enormous interest across fields in measuring slant, polarization,


etc.
Gentzkow and Shapiro (2010) present a method looking at
language change in newspapers

Stewart (Princeton) Text as Data June 28-29, 2018 162 / 187


Measuring Slant

Enormous interest across fields in measuring slant, polarization,


etc.
Gentzkow and Shapiro (2010) present a method looking at
language change in newspapers
Jensen et al (2012) followup with a study of 130 years of
partisan speech, finding that polarization is not unusually high
today and was higher late 19th and 20th century

Stewart (Princeton) Text as Data June 28-29, 2018 162 / 187


Measuring Slant

Enormous interest across fields in measuring slant, polarization,


etc.
Gentzkow and Shapiro (2010) present a method looking at
language change in newspapers
Jensen et al (2012) followup with a study of 130 years of
partisan speech, finding that polarization is not unusually high
today and was higher late 19th and 20th century
Gentzkow, Shapiro and Taddy (2016) investigate properties of
these measures for high dimensional data

Stewart (Princeton) Text as Data June 28-29, 2018 162 / 187


Measuring Slant

Enormous interest across fields in measuring slant, polarization,


etc.
Gentzkow and Shapiro (2010) present a method looking at
language change in newspapers
Jensen et al (2012) followup with a study of 130 years of
partisan speech, finding that polarization is not unusually high
today and was higher late 19th and 20th century
Gentzkow, Shapiro and Taddy (2016) investigate properties of
these measures for high dimensional data
Their goal: measure polarization in opinions or behaviors and
characterize evolution over time.

Stewart (Princeton) Text as Data June 28-29, 2018 162 / 187


The Setup

Stewart (Princeton) Text as Data June 28-29, 2018 163 / 187


The Setup

Given a vector of choices cit for a sample of individuals i ∈ R ∪ D at


time t, how different are the choices of individuals in R from
individuals in D at each t

Stewart (Princeton) Text as Data June 28-29, 2018 163 / 187


The Setup

Given a vector of choices cit for a sample of individuals i ∈ R ∪ D at


time t, how different are the choices of individuals in R from
individuals in D at each t

If cit is scalar, we can do difference in sample means etc.

Stewart (Princeton) Text as Data June 28-29, 2018 163 / 187


The Setup

Given a vector of choices cit for a sample of individuals i ∈ R ∪ D at


time t, how different are the choices of individuals in R from
individuals in D at each t

If cit is scalar, we can do difference in sample means etc.


If its indicators from 100 neighborhoods or answers to 20 survey
questions- we have a problem of measuring segregation.

Stewart (Princeton) Text as Data June 28-29, 2018 163 / 187


The Setup

Given a vector of choices cit for a sample of individuals i ∈ R ∪ D at


time t, how different are the choices of individuals in R from
individuals in D at each t

If cit is scalar, we can do difference in sample means etc.


If its indicators from 100 neighborhoods or answers to 20 survey
questions- we have a problem of measuring segregation.

Task: map counts cit into scalar st so that we can use s1 . . . sT to


answer questions about change in segregation/polarization/slant over
time.

Stewart (Princeton) Text as Data June 28-29, 2018 163 / 187


Difficulties in Measuring Segregation

Many proposed indices: isolation index, dissimilarity index,


Atkinson index, mutual information all derived from some set of
desired properties on the sample

Stewart (Princeton) Text as Data June 28-29, 2018 164 / 187


Difficulties in Measuring Segregation

Many proposed indices: isolation index, dissimilarity index,


Atkinson index, mutual information all derived from some set of
desired properties on the sample
All indices solving the same problem aggregate differences
across dimensions

Stewart (Princeton) Text as Data June 28-29, 2018 164 / 187


Difficulties in Measuring Segregation

Many proposed indices: isolation index, dissimilarity index,


Atkinson index, mutual information all derived from some set of
desired properties on the sample
All indices solving the same problem aggregate differences
across dimensions
Problem when dimensions of cit is large relative to sample size.
(e.g. only 535 legislators but we might be interested in
differences across millions of phrases)

Stewart (Princeton) Text as Data June 28-29, 2018 164 / 187


Difficulties in Measuring Segregation

Many proposed indices: isolation index, dissimilarity index,


Atkinson index, mutual information all derived from some set of
desired properties on the sample
All indices solving the same problem aggregate differences
across dimensions
Problem when dimensions of cit is large relative to sample size.
(e.g. only 535 legislators but we might be interested in
differences across millions of phrases)
Issue arises for many types of things: visits to websites, purchase
of products, residence in small areas like census blocks

Stewart (Princeton) Text as Data June 28-29, 2018 164 / 187


Difficulties in Measuring Segregation

Many proposed indices: isolation index, dissimilarity index,


Atkinson index, mutual information all derived from some set of
desired properties on the sample
All indices solving the same problem aggregate differences
across dimensions
Problem when dimensions of cit is large relative to sample size.
(e.g. only 535 legislators but we might be interested in
differences across millions of phrases)
Issue arises for many types of things: visits to websites, purchase
of products, residence in small areas like census blocks
When most options are chosen by only a few individuals bias

Stewart (Princeton) Text as Data June 28-29, 2018 164 / 187


Origins of the Bias

A Thought Experiment

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


Origins of the Bias

A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


Origins of the Bias

A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


Origins of the Bias

A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


Origins of the Bias

A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


Origins of the Bias

A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated
The issue is that the measures of variance and the variance across
elements of cit are biased upwards by sampling error.

Stewart (Princeton) Text as Data June 28-29, 2018 165 / 187


The Problem in Congressional Speech

Stewart (Princeton) Text as Data June 28-29, 2018 166 / 187


Gentkow, Shapiro and Taddy (2016)

They show this problem affects Gentkow and Shapiro (2010) as


well as Jensen et al (2012)

Stewart (Princeton) Text as Data June 28-29, 2018 167 / 187


Gentkow, Shapiro and Taddy (2016)

They show this problem affects Gentkow and Shapiro (2010) as


well as Jensen et al (2012)
Present a measure correcting for small sample bias

Stewart (Princeton) Text as Data June 28-29, 2018 167 / 187


Gentkow, Shapiro and Taddy (2016)

They show this problem affects Gentkow and Shapiro (2010) as


well as Jensen et al (2012)
Present a measure correcting for small sample bias
Show that contra Jensen et al 2012, partisanship was low until
the 1980’s, sharply shifted upwards and now is at an
unprecedented high.

Stewart (Princeton) Text as Data June 28-29, 2018 167 / 187


Gentkow, Shapiro and Taddy (2016)

They show this problem affects Gentkow and Shapiro (2010) as


well as Jensen et al (2012)
Present a measure correcting for small sample bias
Show that contra Jensen et al 2012, partisanship was low until
the 1980’s, sharply shifted upwards and now is at an
unprecedented high.
More generally this is an example of how intuition about the
behavior of our measures can fail when shifted into high
dimensions

Stewart (Princeton) Text as Data June 28-29, 2018 167 / 187


Measurement Approach
Specify a multinomial logit model of speech in which utility
given to speaker i of using phrase j is determined by measured
and unmeasured factors

Stewart (Princeton) Text as Data June 28-29, 2018 168 / 187


Measurement Approach
Specify a multinomial logit model of speech in which utility
given to speaker i of using phrase j is determined by measured
and unmeasured factors
Partisanship of a phrase is the the effect of party affiliation on
the mean utility of using the phrase

Stewart (Princeton) Text as Data June 28-29, 2018 168 / 187


Measurement Approach
Specify a multinomial logit model of speech in which utility
given to speaker i of using phrase j is determined by measured
and unmeasured factors
Partisanship of a phrase is the the effect of party affiliation on
the mean utility of using the phrase
Speaker’s partisanship is the frequency-weighted mean
partisanship of the phrases used by the speaker.

Stewart (Princeton) Text as Data June 28-29, 2018 168 / 187


Measurement Approach
Specify a multinomial logit model of speech in which utility
given to speaker i of using phrase j is determined by measured
and unmeasured factors
Partisanship of a phrase is the the effect of party affiliation on
the mean utility of using the phrase
Speaker’s partisanship is the frequency-weighted mean
partisanship of the phrases used by the speaker.
This is essentially the MNIR projection on to party. They
characterize session partisanship as the difference of mean
partisanship of Republicans and the mean partisanship of
Democrats

Stewart (Princeton) Text as Data June 28-29, 2018 168 / 187


Measurement Approach
Specify a multinomial logit model of speech in which utility
given to speaker i of using phrase j is determined by measured
and unmeasured factors
Partisanship of a phrase is the the effect of party affiliation on
the mean utility of using the phrase
Speaker’s partisanship is the frequency-weighted mean
partisanship of the phrases used by the speaker.
This is essentially the MNIR projection on to party. They
characterize session partisanship as the difference of mean
partisanship of Republicans and the mean partisanship of
Democrats
Note: without regularization this is overfit just lie the measures
we showed above.

Stewart (Princeton) Text as Data June 28-29, 2018 168 / 187


Some Important Details

A “speech” is an uninterrupted utterance

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


Some Important Details

A “speech” is an uninterrupted utterance


They exclude speeches made by speakers identified by office
rather than name (speaker of the house, president of the senate)

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


Some Important Details

A “speech” is an uninterrupted utterance


They exclude speeches made by speakers identified by office
rather than name (speaker of the house, president of the senate)
They use bigram counts after having stemmed and removed
stopwords

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


Some Important Details

A “speech” is an uninterrupted utterance


They exclude speeches made by speakers identified by office
rather than name (speaker of the house, president of the senate)
They use bigram counts after having stemmed and removed
stopwords
Filter out procedural phrases, phrases include a congressperson’s
name or a state’s name, and a few more

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


Some Important Details

A “speech” is an uninterrupted utterance


They exclude speeches made by speakers identified by office
rather than name (speaker of the house, president of the senate)
They use bigram counts after having stemmed and removed
stopwords
Filter out procedural phrases, phrases include a congressperson’s
name or a state’s name, and a few more
Trim data to remove phrases that appear in every congress fewer
than 10 times or are used by fewer than 75 speakers

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


Some Important Details

A “speech” is an uninterrupted utterance


They exclude speeches made by speakers identified by office
rather than name (speaker of the house, president of the senate)
They use bigram counts after having stemmed and removed
stopwords
Filter out procedural phrases, phrases include a congressperson’s
name or a state’s name, and a few more
Trim data to remove phrases that appear in every congress fewer
than 10 times or are used by fewer than 75 speakers
Final sample: 723,198 unique phrases by 7,285 unique speakers
modeled at the speaker-session level (resulting in 33,486
observations)

Stewart (Princeton) Text as Data June 28-29, 2018 169 / 187


A technical aside
They employ a clever technique from Taddy (2016): Distributed
Multinomial Regression which approximates the likelihood using
independent Poissons
citj ∼ Poisson (mit exp(ηitj ))

Stewart (Princeton) Text as Data June 28-29, 2018 170 / 187


A technical aside
They employ a clever technique from Taddy (2016): Distributed
Multinomial Regression which approximates the likelihood using
independent Poissons
citj ∼ Poisson (mit exp(ηitj ))
Surprisingly (to me at least) this actually works.
Why? multinomial dependence is induced only by totals

Stewart (Princeton) Text as Data June 28-29, 2018 170 / 187


A technical aside
They employ a clever technique from Taddy (2016): Distributed
Multinomial Regression which approximates the likelihood using
independent Poissons
citj ∼ Poisson (mit exp(ηitj ))
Surprisingly (to me at least) this actually works.
Why? multinomial dependence is induced only by totals
Decompose partisan loadings as:
T
X
ϕjt = ϕ̄j + ϕ̃jk 1t>k
k=1

with penalty
!
X
c(ϕtj ) = λj |ϕ̄j | + |ϕ̃jk |
k

Stewart (Princeton) Text as Data June 28-29, 2018 170 / 187


Previous Measures

Stewart (Princeton) Text as Data June 28-29, 2018 171 / 187


Previous Measures

Stewart (Princeton) Text as Data June 28-29, 2018 172 / 187


Their Finding

Stewart (Princeton) Text as Data June 28-29, 2018 173 / 187


The Role of Penalization

Stewart (Princeton) Text as Data June 28-29, 2018 174 / 187


Robustness checks

Stewart (Princeton) Text as Data June 28-29, 2018 175 / 187


Concluding Thoughts on This

A lot to love here: cool methods, a fun problem, extensibility to


other problems (racial segregation in browsing histories, spatial
segregation, consumer product purchasing etc.)

Stewart (Princeton) Text as Data June 28-29, 2018 176 / 187


Concluding Thoughts on This

A lot to love here: cool methods, a fun problem, extensibility to


other problems (racial segregation in browsing histories, spatial
segregation, consumer product purchasing etc.)
Exposes problems that can happen with high dimension by
ignoring sampling variance and estimation bias

Stewart (Princeton) Text as Data June 28-29, 2018 176 / 187


Concluding Thoughts on This

A lot to love here: cool methods, a fun problem, extensibility to


other problems (racial segregation in browsing histories, spatial
segregation, consumer product purchasing etc.)
Exposes problems that can happen with high dimension by
ignoring sampling variance and estimation bias
Creates a generative model and uses it to make inferences

Stewart (Princeton) Text as Data June 28-29, 2018 176 / 187


Concluding Thoughts on This

A lot to love here: cool methods, a fun problem, extensibility to


other problems (racial segregation in browsing histories, spatial
segregation, consumer product purchasing etc.)
Exposes problems that can happen with high dimension by
ignoring sampling variance and estimation bias
Creates a generative model and uses it to make inferences
Emphasizes the importance of thinking through what you want
the measure to be used for (do we care about inferring a
population quantity?)

Stewart (Princeton) Text as Data June 28-29, 2018 176 / 187


Concluding Thoughts on This

A lot to love here: cool methods, a fun problem, extensibility to


other problems (racial segregation in browsing histories, spatial
segregation, consumer product purchasing etc.)
Exposes problems that can happen with high dimension by
ignoring sampling variance and estimation bias
Creates a generative model and uses it to make inferences
Emphasizes the importance of thinking through what you want
the measure to be used for (do we care about inferring a
population quantity?)
Above all- demonstrates the importance of checking what you
are doing! There is nothing new in their methods to diagnose
the problem.

Stewart (Princeton) Text as Data June 28-29, 2018 176 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 177 / 187


TextReuse

Stewart (Princeton) Text as Data June 28-29, 2018 178 / 187


TextReuse

Stewart (Princeton) Text as Data June 28-29, 2018 178 / 187


TextReuse

Stewart (Princeton) Text as Data June 28-29, 2018 178 / 187


TextReuse

Stewart (Princeton) Text as Data June 28-29, 2018 178 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 179 / 187


Readability Scores

Stewart (Princeton) Text as Data June 28-29, 2018 180 / 187


Readability Scores

There is broad interest in sophistication / readability of texts.

Stewart (Princeton) Text as Data June 28-29, 2018 180 / 187


Readability Scores

There is broad interest in sophistication / readability of texts.


e.g. has the U.S. State of the Union become dumbed down over
time, do politicians who are easier to understand connect better
with constituents,

Stewart (Princeton) Text as Data June 28-29, 2018 180 / 187


Readability Scores

There is broad interest in sophistication / readability of texts.


e.g. has the U.S. State of the Union become dumbed down over
time, do politicians who are easier to understand connect better
with constituents,
Rich literature on assigning “grade levels” to texts.

Stewart (Princeton) Text as Data June 28-29, 2018 180 / 187


Readability Scores

There is broad interest in sophistication / readability of texts.


e.g. has the U.S. State of the Union become dumbed down over
time, do politicians who are easier to understand connect better
with constituents,
Rich literature on assigning “grade levels” to texts.
Flesch (1948) offers the Flesch Reading Ease score
   
total words total syllables
206.835 − 1.015 − 84.6
total sentences total words

Kinkaid et al translate to grade levels

Stewart (Princeton) Text as Data June 28-29, 2018 180 / 187


Examples (via Spirling)

Score Text
-800 Molly Bloom’s (3.6K) Soliloquy, Ulysses
33 judicial opinion
45 life insurance requirement in Florida
48 New York Times
65 Reader’s Digest
67 Al Qaeda press release
77 Dickens’ complete works
80 childen’s books
90 death row inmate last statements
100 this entry right here.

Stewart (Princeton) Text as Data June 28-29, 2018 181 / 187


Notes about Flesch Scoring

Stewart (Princeton) Text as Data June 28-29, 2018 182 / 187


Notes about Flesch Scoring

Uses only syllable information and not rareness of the word.

Stewart (Princeton) Text as Data June 28-29, 2018 182 / 187


Notes about Flesch Scoring

Uses only syllable information and not rareness of the word.


Seems to work in practice, but isn’t based on first principles.

Stewart (Princeton) Text as Data June 28-29, 2018 182 / 187


Notes about Flesch Scoring

Uses only syllable information and not rareness of the word.


Seems to work in practice, but isn’t based on first principles.
One of many reading indices, but they are typically highly
correlated in practice.

Stewart (Princeton) Text as Data June 28-29, 2018 182 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


Readability Applications

Stewart (Princeton) Text as Data June 28-29, 2018 183 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 184 / 187


1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship

2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication

3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling

4 Session 4: Additional Approaches to Measurement


Difficulties with Trends
TextReuse
Readability
A few more papers

Stewart (Princeton) Text as Data June 28-29, 2018 184 / 187


Stewart (Princeton) Text as Data June 28-29, 2018 185 / 187
Stewart (Princeton) Text as Data June 28-29, 2018 185 / 187
Thanks!

Stewart (Princeton) Text as Data June 28-29, 2018 186 / 187


For more information

BrandonStewart.org

Stewart (Princeton) Text as Data June 28-29, 2018 187 / 187

You might also like