Text_as_Data_in_the_Social_Sciences
Text_as_Data_in_the_Social_Sciences
Brandon Stewart1
Princeton University
Code at https://bit.ly/2KtDziR
1
Huge thanks to Justin Grimmer for many slides included here (see
https://github.com/justingrimmer/TAD).
Stewart (Princeton) Text as Data June 28-29, 2018 1 / 187
Big Data Social Science
Copies at BrandonStewart.org
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
0.050
0.025
0.000
1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975
Year
OConnor, Stewart, and Smith. “Learning to Extract International Relations from Political
Context.” Association of Computational Linguistics. 2013
Stewart (Princeton) Text as Data June 28-29, 2018 12 / 187
International Events (O’Connor, Stewart & Smith)
OConnor, Stewart, and Smith. “Learning to Extract International Relations from Political
Context.” Association of Computational Linguistics. 2013
Realism
Liberalism
60
Constructivism
Non−paradigmatic
50
Number of Articles
40
30
20
10
0
Year
Realism
Liberalism
Constructivism
Blei, David M., and John D. Lafferty. “Dynamic topic models.” Proceedings of the 23rd
international conference on Machine learning. 2006.
Stewart (Princeton) Text as Data June 28-29, 2018 17 / 187
Modeling the Progress of Science
(Blei and Lafferty)
Blei, David M., and John D. Lafferty. “Dynamic topic models.” Proceedings of the 23rd
international conference on Machine learning. 2006.
Stewart (Princeton) Text as Data June 28-29, 2018 18 / 187
What Do People Search For?
WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?
WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?
WebSeer http://hint.fm/seer/
Stewart (Princeton) Text as Data June 28-29, 2018 19 / 187
What Do People Search For?
WebSeer http://hint.fm/seer/
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Doc1 Doc2
power 1 0
grow 1 0
out 1 0
barrel of a gun 1 0
compar 0 1
polit 1 2
chicago 0 1
Three answers
Three answers
1) It might not: Validation is critical (task specific)
Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
3) Human supervision can help: Inject human judgement (coders):
helps methods identify subtle relationships between words and
outcomes of interest
Three answers
1) It might not: Validation is critical (task specific)
2) There is a Central Tendency in Text: Words often imply what a
text is about war, civil, union or tone consecrate,
dead, died, lives.
Likely to be used repeatedly: create a theme for an article
3) Human supervision can help: Inject human judgement (coders):
helps methods identify subtle relationships between words and
outcomes of interest
It is easier to capture some things than others.
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Count Published
Count Censored Riots in
50
Zengcheng
40
Count
30
20
10
0
Count Published
Count Censored Riots in
50
Zengcheng
40
Count
30
20
10
0
Count Published
Count Censored Riots in
50
Zengcheng
40
Count
30
20
10
0
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
40
Count
30
20
10
0
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
10
0
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Censorship & Post Volume are “Bursty”
Unit of analysis:
I volume burst
I (≈ 3 SDs greater
70
than baseline
60
Count Published
Count Censored Riots in volume)
50
Zengcheng
They monitored 85
40
Count
topic areas
30
20
Found 87 volume
10
bursts in total
0
Jan Feb Mar Apr May Jun Jul Identified real world
events associated with
each burst
Their hypothesis: The government censors all posts in volume bursts
associated with events with collective action potential
Stewart (Princeton) Text as Data June 28-29, 2018 44 / 187
Observational Test 2: The Event Generating
Volume Bursts
12
10
Policy
News
8
Density
Collective Action
Criticism of Censors
Pornography
4
2
0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Censorship Magnitude
Censorship Magnitude
● ●
● ●
●
0.3
0.2
0.1
0.0
● ●
●
0.3
0.2
0.1
0.0
● ●
●
0.3
0.2
●
0.1
0.0
● ●
●
0.3
●
0.2
●
0.1
0.0
● ●
● ●
−0.5
Panxu
1.0
Protest
Censorship Difference (Pro − Anti)
0.5
0.0
● ●
−0.5
Panxu
1.0
Protest
Censorship Difference (Pro − Anti)
0.5
0.0
● ●
●
−0.5
Panxu
1.0
Protest
Censorship Difference (Pro − Anti)
Tibetan
Self−
0.5
Immolations
●
0.0
● ●
●
−0.5
Panxu Ai Weiwei
1.0
Protest Album
Censorship Difference (Pro − Anti)
Tibetan
Self−
0.5
Immolations
● ●
0.0
● ●
●
−0.5
Panxu Ai Weiwei
1.0
Protest Album
Censorship Difference (Pro − Anti)
Tibetan Protests
Self− in
0.5
Immolations Xinjiang
●
● ●
0.0
● ●
●
−0.5
Tibetan Protests
Self− in
0.5
Immolations Xinjiang
●
● ●
●
0.0
● ●
●
−0.5
Eliminate
Tibetan Protests Golden
Self− in Week
0.5
Immolations Xinjiang
●
● ●
●
0.0
● ● ●
●
−0.5
Eliminate
Tibetan Protests Golden
Self− in Week
0.5
Immolations Xinjiang
Rental
Tax
●
● ●
●
0.0
● ● ●
●
●
−0.5
Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5
Immolations Xinjiang
Rental
Tax
●
●
● ●
●
0.0
● ● ●
●
●
−0.5
Stock
Panxu Ai Weiwei Corruption
1.0
Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)
Crash
Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5
Immolations Xinjiang
Rental
Tax
●
●
● ●
●
0.0
● ● ● ●
●
●
−0.5
Stock
Panxu Ai Weiwei Corruption
1.0
Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)
Crash
Light
Eliminate
Fines
Tibetan Protests Golden
Self− in Week
0.5
● ● ● ●
●
● ●
−0.5
Stock
Panxu Ai Weiwei Corruption
1.0
Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)
Crash
Light
Eliminate Gender
Fines
Tibetan Protests Golden Imbalance
Self− in Week
0.5
● ●
● ● ●
●
● ●
−0.5
Stock
Panxu Ai Weiwei Corruption
1.0
Market
Protest Album Policy Yellow
Censorship Difference (Pro − Anti)
Crash
Light
Eliminate Gender
Fines
Tibetan Protests Golden Imbalance
Self− in Week
0.5
● ●
● ● ●
●
● ●
●
−0.5
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Think!
Think!
- No one statistic captures how you want to use your data
Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search
Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search discuss statistical
methods/experimental methods
- Humans should be the final judge
Think!
- No one statistic captures how you want to use your data
- But, can help guide your selection
- Combination statistic + manual search discuss statistical
methods/experimental methods
- Humans should be the final judge
- Compare insights across clusterings
Stewart (Princeton) Text as Data June 28-29, 2018 64 / 187
How do we Choose K ? Summary
Generate many candidate models
1) Assess model fit using surrogate statistics
2) Use experiments
3) Read
4) combination of above final decision
1 1
Var(log Odds Ratioj ) ≈ +
xjD + αj xjR + αj
log Odds Ratioj
Std. Log Oddsj = p
Var(log Odds Ratioj )
republican
0.012
strategi
0.010 freedom
start
Mutual Information
0.008
ground
cours
answer
command
ask
white
armi 11 continu
manag
instead month deleg
vice
acknowledg
mass stop congression
intellig
rankclear
wrong
identifi note
0.004
lot
basic chang
secretari
went recogn plan
home
refus
expert
insurg
fundament told
essenti
holdend environ
scienc
situat
matter
legal
sens reallike
differ
senior
kill
problem
energi
purchas
push
consist
goe complet
learn
investig vital
combat
realli
old
speak democrat
fail
reason polici project
great
tough polic
recommenddon hear fair
south
station
human
integr
0.002
sure bush
long wide
kind
failur
nuclear
west
alli
feel
fact encourag
respect
500
followannual
adequ
comprehens
focu
word
data deserv
total societi
qualiti
higher
disast
rest
later
add
suffici
tool
averag
thought
main
outsid
face
hous million worker
women
announc
protect histor
east
extens
thankchanc
practic
capac
approach
weapon
post
300
heard
break sent
account
bipartisan
met
solut 000 developvote mind
affect
left
drug
lead
prevent
believ
prioriti
stabil
sustain
decid honor
soldier just
war establish
coalit
employe
test
deliv
victim
appreci
demonstr
tell
dear defens
win
challeng
young point
septemb
specif
line
actual
campaign
target
road
morn term
capabl
assessagretrain
inform
fight
reduc nationwid
impact
directli
perform
floor
execut
appoint
citi
terror
support air
clean
consum
commit aid
danger
street
consequ
combin open
north
wrote
demand
time
level
action
build
consid
want
implement
high ultim
peac
countri technolog deni
enact
credit
resid
death
engag
longer
board
crisi
claim
consider confer
half
lightabil
achiev
colleagu
budget
cost
children
creationorder
stand fund
individu
job world
address
benefit team
water
distribut
nomin
court
treatment
rule past
reviewgain
corpor
applaud
natur
proud
guard
commonhistori
devast
statu
basi
burden
intendamerican
necessari
struggl
financ
option
depend
simpli
prepar
urgunderstand
america
greater
judiciari
homeland
threaten
examin
uniform awardtook
destruct
producrang
courag
engin
nearli
remov
voic
limit
accomplish
seek
regard
effortconserv
friend
fall
afford
cosponsorappropri
singl
withdraw famili oper select
dedic
promot
school
0.000
certain
extend
lack
won
personnel
resolv
insur
marin care
result
depart
octob
highest
promis student
research
activ
advoc
incom
product field
global
life
environmentfree
list
central
especi
properti
role
popul
terroristaltern
confirm
expand
attent
travel
earli
social
appear
oppos
stori
volunt
speech secur
resourc
institut
unfortun
loss
document avail
short
request
compromis public
joint
estim
collegcritic
effici
veteran
enforc
price
union sign
signific
cut
exampl independ
grant
author
process
ensur 200
vehicl
hurrican
sourc
attempt
goodcrucial
facil
medic
despit
advanc
iraqi
constitut
particip
approv carri
rate
partnershipself
defend
harm
oil
earlier
suppli
power
suffer
suggest
lose
border
welcom import
doe
assur
perman
health
choic
soon
decembfinal
overal
accept
offici
univers
saddam
underminoffer
decis
firefight
guarante
hand
payment arm
run
justic
elimin
congress
lost
relief
standard
possibl
respons
confid
economi
parti
week
accord
invest
view
share
exist
growth
coordin
concern
present
program
special
town access
grow
damagcrimin
violenc
appli
schedul
express
case
control
determin
rais direct
sponsor
measur
group
strongli domest
serv
agreement
organ
enhanc
militari
industri task
ad
late
held
reform
stransit
applic
largest
major
construct
base
caus
afghanistanoccur
strong
passag fulli
katrina
profit
gulf
tudi
true
judg
focus
save
middl
messag
preserv
design men
contain
agricultur
billion
particularli
join best
tradetruli
tax
council
resolut
progress
dollar
transport
strengthen associ
governor
econom
seen
similar
safe
spend
remain
evid
nomine
commiss
commerc financi
area
discuss
riseexperi
polit
respond
leadership
regionhope
approxim
purpos
heart
upgrad
novemb expect
near
mean form
spent
potenti
adopt
letter
employ
increas
threat
opportunprovis
balanc
small
conduct
improv
recent
suprem
replac
foreign
brought
difficult
reach
awar
abusprivat
creat
market
larg
live
goal
partner
compani
version
modern big 400
director
equip
urban
locat reflect
intern
competit duti
affair
hard
close
elig
quickli
maintain
man
mission civil
visit
relat
brave
negoti
risk food
fiscal
educ
rural
iraq
posit
equal
cover
successturn
indic
reli
sacrific
remark
allow
crime
infrastructur
child
restor
safeti
condit
extrem
river
staffretir
district
write
forward
clearli
assist
thousand low
entir
busi
lower
head
known
local
center
event
send
taken
enabl
contributset
servic
readi
deal
valu
debat
fuel
compani
0.04 profit
skyrocket
0.03
consum
Mutual Information
urg
manipul plan interior
wrotestop
hurt famili
0.02
refus earlier
real
pocket gougbig
letter explor
democrat
pump market
republican resourc south
creation
special
sure
ftc runaverag
quarter
anti relief
dear
strateg
break
soar exampl
hand fight bush
month
trade domest
board organ justic
share recent
action
level
northeast
0.01
violat
indic invest
dollar comprehens
natur
environment
attempthonor
stand
thousand
comparwin independ
true fail
rule author
join
offerjust
american packag
innov
file unfair
agreementinvestig
gallon home
oper arctic leas
sell
sustain
opposit
doubl
instead
spike specif
leadership
white
women
men higherweek affectmanner
debat
dramat
financ
governor enforc
threaten
advantag crisi
essenti economi
fair regard
estim
applaud
land
travel
dangersuffer
integr
reject practic
corpor
deal
purchas
tool
answer
execut
high
ask time
fund contain
refugroad
cosponsor
declar illeg
despit
form
struggl fiscal
past
benefit
lead
casebillion passag
sourcnorth
task
constituburden
attent
guardaid
cent
evid carecut
total
deserv doe
commiss
petroleum
award protect producpayment
clearli
unfortun
western
block entir
post
ceo thank
septemb
gasolin
hard wildlif
coastal
11
best
commend
taxpay
condit
fall
mobil
central
impos
appear industri
intern
tell
school
prioriti
determin
goal
novemb
octob
sentmajor
follow
improv
conserv
expand
positmission
focu
plain
farmer
reform
drop prevent
write
attack
station
studi
achiev
regul
preserv
replac
remarkturn life
locat
bipartisan price
world
caus
rate
design
strengthen good
growth
vote
mean
product credit
quickli
loss
500
varieti
event
standard
substantipush
financi
dedic
summer ad
safeti
distribut
risk
account
accord
promis fact
rise
line
set
assist
group develop
fuel
potenti
open district
remov
main
small
refineri
rest
recoveri
longer
reconcili
employ
option
institut
treatment
explain
recommend
assur
won
prohibit
popul
claim
retail rais
hold
direct
later
katrina
negoti
monitorsend
intend order
public
heat
express
heard
personnel applihous
free
sign
chief grant
electr
build
light
request
manufactur
engag long job reduct
capac
implement
defensfood
construct fix
version
readi
oversea
labor
deficit
enact
legal
staff
languagearli
infrastructur
deleg
secretari
gulf
200
troubl effort
opportun
possibl
suppli
control
exist
fulli million type
capabl
0.00
plant
extend
hour
certainli
incom
took
decid
conclud
combat
lose area
base
water
educ
health
actual
seen
restor
role
begin
director continu
approv
short
held
threat
strong
associ
nearli
congression
altern
close
sponsor
directli
proud
agre
safe
adequ cost
colleagu
don
employe respond
individu
univers
discuss
stabil
drill
complet
support partner
policicertain
half
decreas
student
enhanc
victori
limit
promot civil
children
hit
relat
middl
america
homeland
highest
damag
establish
refin
hope
afford
largest
offici
china
treasuri
basic
skill
measur
significantli
drive
season
privat
effici
save
present
sens
talk
end
joint
obtain
reliabl
polit
particularli
commerc
annual
consequensur
progress
similar
victim
reach
target
decemb
testimoni
devast
advancdepart
matter
car
access
left
approxim
occur
commercirang
recogn
court
constitut
inform
encourag
announc
util
confer
west
simpli
remain
nominprovis
charg increas
appropri
oil review
transport
technolog
allow
decis
global
respons idea
resolut
louisiana
affair
cover
train
respect
energi
want
disrupt
applic
point
grow
environ
appreci
centerkind
subject
schedul
duti
medic
demonstr
materi
chair
term
understand
vital
met
differ
avail
goe
incent
especi
servic
great
econom
seek
activ
reli
start
expens
deliv
collect
barrel
floor
soon
vehicl
result
regulatori
manag
lost
feel
qualiti
strategi
combin
particip
elig
spend
reduc
sector
greater
senior
iraq
data
region
conduct
extrem
wide
clean
choic
confirm balanc
larg
foreign
engin
export
contribut diesel
note
revenu
huge
valu
maintain
live
accept
east
transit
commit
mile militari
research
critic solut
cours
common
known
suggest
import
scienc
approach equip
paid
testifi
000
lower n
concern
coast
300
renew
budget tax
servear
terror
rank
agricultur
reason
adopt
portion
hurrican
depend
project
impact
believ
success
secur
address arm
final
countri
field
necessari
forward
nationwid
hear
human
overal
difficult
head focus
consider
like
face
visit
creat
learn
examin
demand
add
winter
worker
old
view
resid sale
air
identifi
consist
local
taken
abil
vulner
consid
clear
busi
low
citi wit
singl
list
particular
gain
pipelin
expect
facil
elimin
histori
program
chang
histor
judiciari
process
congress
problem
signific
prepar
power
purpos
oppos
rural
disast
situat
war
competit reflect
challeng
advoc
harm
E [Y |X ] = f (X1 , X2 , . . . , XJ )
E [Y |X ] = f (X1 , X2 , . . . , XJ )
E [Y |X ] = f (X1 , X2 , . . . , XJ )
E [X |Y ] = g (Y )
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
Affinity Propagation-Cosine
kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (Dueck and Frey 2007)
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
Affinity Propagation-Cosine
kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (Dueck and Frey 2007)
hclusthclust spearman
kendall ward ward
hclust maximum ward
kmeans maximum
kmeans binary
Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)
⇒ Similar clustering of
documents
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
Space between methods:
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
Space between methods:
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
Space between methods:
local cluster ensemble
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
kmeans maximum
kmeans binary Found a region with clusterings
that all reveal the same
important insight
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
affprop maximum
hclust canberra
kmeans kendall median
kmeans spearman
kmeans canberra
hclust binary centroid
kmeans manhattan
hclust canberra average mspec_mink
spec_man
spec_cos
spec_mink
spec_euc
spec_max
spec_canb
mspec_man
mspec_max
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust
hclusthclust kendall
spearman
spearman
kendall
hclust kendallsingle
median
spearman centroid
centroid
average
medianaverage
single
hclust canberra centroid
hclust
hclust
hclusthclust
hclust
hclust
euclidean
hclust
maximum
hclust
hclust kendall
hclust spearman
kmedoids
manhattan
manhattan
manhattan
single
manhattan
euclidean
divisive
single
euclidean
hclust
centroid
median
affprop
mcquitty
mcquitty
average
hclust kendall complete
manhattan
manhattan
single
median
manhattan
centroid
euclidean average
● Random Walk
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
(Metrics 1-6)
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
affprop maximum
hclust canberra
kmeans kendall median
kmeans spearman
kmeans canberra
hclust binary centroid
kmeans manhattan
hclust canberra average mspec_mink
spec_man
spec_cos
spec_mink
spec_euc
spec_max
spec_canb
mspec_man
mspec_max
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust
hclusthclust kendall
spearman
spearman
kendall
hclust kendallsingle
median
spearman centroid
centroid
average
medianaverage
single
hclust canberra centroid
hclust
hclust
hclusthclust
hclust
hclust
euclidean
hclust
maximum
hclust
hclust kendall
hclust spearman
kmedoids
manhattan
manhattan
manhattan
single
manhattan
euclidean
divisive
single
euclidean
hclust
centroid
median
affprop
mcquitty
mcquitty
average
hclust kendall complete
manhattan
manhattan
single
median
manhattan
centroid
euclidean average
● Random Walk
hclust manhattan mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
(Metrics 1-6)
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
Mayhew
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans maximum
kmeans binary (D-NJ) announced that the
Clusters in this Clustering
U.S. Department of Commerce
●
●●
●
has awarded a $100,000 grant
●● ●
●● ● ●
●
●
●
●
●●
●●
●
●
●
●
● to the South Jersey Economic
Credit Claiming Development District”
Pork
Mayhew
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust
hclust
hclust
hclust
manhattan
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclustmaximum
hclust
kmedoids
manhattan
manhattan
single
euclidean
manhattan
maximum
divisive
median
affprop
single
manhattanmedian
divisive
centroid
euclidean
manhattan
centroid
average
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
●
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Credit Claiming, Legislation:
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum
Lautenberg today pointed to a
●
Clusters in this Clustering string of victories in Congress
●●
●
●
●
●●
●●
●●
●
● ●
●
on his legislative agenda during
●
● ●
● ●● ●
●
this work period”
Credit Claiming
Pork ● ●
●
● ●
●
●● ●
● ● ●
● ● ● ●
●
● ● ● ●●
●● ●
Credit Claiming
Mayhew
Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean
hclustmaximum
hclust maximum
divisive
hclust
hclust
hclust manhattan
maximum
euclidean
hclust maximum
mcquitty
kmedoids euclidean
centroidaffprop euclidean
median
euclidean
hclust maximum average
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Advertising:
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward “Senate Adopts
hclust maximum ward
hclusthclust spearman
kendall ward ward
kmeans maximum
kmeans binary Lautenberg/Menendez
Clusters in this Clustering
Resolution Honoring Spelling
●
●●
●
● ●
Bee Champion from New
●● ● ●
●● ● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
Jersey”
Credit Claiming Advertising
Pork ● ●
●
● ●
●
●● ●
● ● ●
● ● ● ●
●
● ● ● ●●
●● ●
Credit Claiming
Mayhew
Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
Example Discovery: Partisan Taunting
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan ●
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward Partisan Taunting:
hclust maximum ward
hclusthclust spearman
kendall ward ward
kmeans maximum
kmeans binary “Republicans Selling Out
Clusters in this Clustering
Nation on Chemical Plant
●
●●
●
● ●
Security”
●● ● ●
●
●● ● ● ● ●●
● ●
●
● ●●
● ●
●● ●
● ● ●●
●
Legislation
Stewart (Princeton) Text as Data June 28-29, 2018 94 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology
Sen. Lautenberg
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology
Sen. Lautenberg
on Senate Floor
4/29/04
Stewart (Princeton) Text as Data June 28-29, 2018 95 / 187
In Sample Illustration of Partisan Taunting
Important Concept Overlooked in Mayhew’s (1974) typology
20
10
20
10
GOP
● DEM ●
0.10
●
● ●
●
● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ●
● ●
0.05
●
● ● ●
●
●
●
0.00
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
1 Interpretable
can we clearly communicate the idea to the reader
1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument
1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument
3 Label Fidelity
minimal surprise when going from reading the label to reading
the documents
1 Interpretable
can we clearly communicate the idea to the reader
2 Theoretical Interest
helps us advance a relevant argument
3 Label Fidelity
minimal surprise when going from reading the label to reading
the documents
4 Tractable
computationally tractable model and enough samples to estimate
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Blei, 2012
Blei, 2012
Blei, 2012
Blei, 2012
Japanese Elections:
Japanese Elections:
- Election Administration Commission runs elections → district
level
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
Typical Manifesto:
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm
- Collected from microfilm, hand transcribed (no OCR worked),
used a variety of techniques to create a TDM
Japanese Elections:
- Election Administration Commission runs elections → district
level
- Required to submit manifestos for all candidates to National Diet
- Collected from 1950- 2009
- Available only at district level
- Until: 2009 national library made texts available on microfilm
- Collected from microfilm, hand transcribed (no OCR worked),
used a variety of techniques to create a TDM
- Harder for Japanese
● ● ● ●
● ●
● ●
●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
● ● ● ●
●
● ● ● ●
●
●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
●
●
● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ●
● ●
●
●
●
●
● ● ● ●
● ● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ●
● ● ●
●
● ●
●
● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ●
●
● ● ●
● ●
●
● ●
● ●
● ● ● ●
● ●
●
● ●
●
●
●
●
●
● ● ● ●
● ● ●
● ●
● ●
● ●
● ● ●
●
● ● ●
● ● ● ● ●
● ●
●
● ●
●
● ●
● ●
● ● ● ●
● ●
● ●
●
● ●
●
●
● ● ●
● ● ●
●
● ●
●
●
●
● ● ●
●
● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●
● ●
●
● ●
● ● ● ●
● ●
●
● ● ●
● ● ●
● ● ●
● ●
●
●
●
● ● ● ●
● ●
● ●
● ●
● ●
●
● ● ● ● ● ● ● ●
the goal is a
representation that is
useful
reliable
valid
Topic Models as Measurement
the goal is a
representation that is
useful
reliable
valid
substantiv
e fit
Topic Models as Measurement
desirable properties
easy to use
transparent
broad support
helpful
Setting the number of topics K
Current approaches to setting K
Bayesian Nonparametrics
Teh et al., 2005
Wallach et al., 2010
Current approaches to setting K
Optimize a surrogate
criterion
Chang et al., 2009
Lau, Newman and Baldwin, 2014
Mimno et al., 2011
Newman et al., 2010
Roberts et al., 2014
Bespoke methods
Grimmer, 2010
Quinn et al., 2010
Grimmer and Stewart, 2013
Topic Aggregation
Topic Aggregation
Why?
stability
weak supervision
label interpretability
transparency
An Interactive System
1 Session 1: Getting Started with Text in Social Science
What Text Methods Can Do
Core Concepts and Principles
Represent
Example: Understanding Chinese Censorship
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
θ, D × K document-topic matrix
β, K × V topic-word matrix
β, K × V topic-word matrix
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Validation
Stewart (Princeton) Text as Data June 28-29, 2018 127 / 187
Validation, Dictionaries from other Fields
3) Optimization
3) Optimization
3) Optimization
- Method specific: MLE, Bayesian, EM, ...
3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation
3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation
- Obtain predicted fit for new data f (Xi , θ)
b
3) Optimization
- Method specific: MLE, Bayesian, EM, ...
- We learn θb
4) Validation
- Obtain predicted fit for new data f (Xi , θ)
b
- Examine prediction performance compare classification to
gold standard
Stewart (Princeton) Text as Data June 28-29, 2018 129 / 187
Components to Supervised Learning Method
1) Set of categories
1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code
1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code
3) Set of unlabeled documents
1) Set of categories
- Credit Claiming, Position Taking, Advertising
- Positive Tone, Negative Tone
- Pro-war, Ambiguous, Anti-war
2) Set of hand-coded documents
- Coding done by human coders
- Training Set: documents we’ll use to learn how to code
- Validation Set: documents we’ll use to learn how well we code
3) Set of unlabeled documents
4) Method to extrapolate from hand coding to unlabeled
documents
Hand labeled
Hand labeled
- Training set (what we’ll use to estimate model)
Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled
Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled
- Test set (what we’ll use the model to categorize)
Hand labeled
- Training set (what we’ll use to estimate model)
- Validation set (what we’ll use to assess model)
Unlabeled
- Test set (what we’ll use the model to categorize)
Label more documents than necessary to train model
p(Ck , xi )
p(Ck |xi ) =
p(xi )
p(Ck , xi )
p(Ck |xi ) =
p(xi )
p(Ck )p(xi |Ck )
=
p(xi )
p(Ck , xi )
p(Ck |xi ) =
p(xi )
Proportion in Ck
z }| {
p(Ck ) p(xi |Ck )
| {z }
Language model
=
p(xi )
Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal
Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal
3 Major gains only come after the first step.
Hand (2006) argues that most new algorithms only provide the
illusion of progress. He makes 3 major arguments:
1 Population Drift is a bigger problem than people accept
2 Some types of complexity (donut holes etc.) are not that big of
a deal
3 Major gains only come after the first step.
In general: better features beat better models every time.
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
xi = (1, 0, 0, 1, . . . , 0)
●●● ●
● ● ● ● ●
●
● ●●
●● ●
● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●
● ●● ●
●● ● ● ●
● ●● ●
●● ●
● ● ● ● ●
●● ● ● ●●
●
● ●
● ● ●
● ●●
● ● ●● ●● ● ● ●●
● ●● ●
● ● ● ● ●
●
●
●
● ●
● ●● ● ● ●
● ●
●
● ●● ●
● ●● ●
●
●
●
● ● ● ● ● ●
●● ● ●● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
●●
● ● ● ● ● ●
●
● ● ● ●●
● ● ●
●
● ●● ● ●● ●
●
●● ● ●
●
●
●
● ●● ● ● ●
●
● ●●●
● ● ● ● ● ● ●
● ●●
● ● ● ●
● ● ● ●
● ●● ● ●
● ● ● ●
● ●● ●● ●
● ●
●
● ● ●
● ● ● ●
● ●
● ● ● ● ●● ● ●
● ●
●● ● ●
● ●
●● ● ● ● ●
● ● ●
● ●
● ● ●● ●
● ● ● ●
●● ● ● ● ● ●
● ● ● ●
● ●
●● ●
● ● ● ● ●● ● ●
●
●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●● ●
●
● ●● ●
● ●● ● ● ●
●
●
●
●
● ● ●● ● ● ●
● ●●● ●
● ● ●●
● ● ● ● ●
●●● ● ● ●● ● ●
● ● ● ● ● ● ●
● ●●
● ● ● ● ● ●● ●
●● ● ●● ● ●
● ● ●
●
● ● ●
●
● ●
● ●
●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ●
●
●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
Permanent
Solomon
Out
Education
Supports the
And because
Nasser
Up
NotRelationship
List True
Allah will
Murderer
Ourselves
Walking
And Either Must
possibly
Shed
Place
Possible I know
Mosques
Ramadan
I ToTongue
I bear witness
Left
Sword Increase
Between them Like
Sex
Versus
Log in
Weak
And
Accept
so on
Estimate
News Right
Government
Particularly
Know
RoyTakeIs called
Face
Sins
Maqdisi
Issues
Visualization
We have
The day following
Something
order
Sea Still One
I
Live Parental
Jews Issue
Their hearts
He Works
wanted
PharaohLikely At
He wants
Water
Age
WhatOn Mr. Issuance ofLaSalle Fasting
of Valid
Five
Members The
In the name above
ofExamples Witness Not Sharif
Nine
Era Several
Humiliation Hamid
His hand
Killing Especially
The large number
Platform Spring Adults
HonorableCities
Apostasy
Transfer
Open
People Known
Huda
And
Sub−
Perhaps the Building
Amr
more Save Corrupt
Summary
Stands
High EBay
The unseen Majah
Detection
The sixthGet
Send Pay
Women
Must
Calling Around him Curse
Must
Del Full
Certificate Zaid
Him
Like Side
CheckDeny
Study
Perfection
Requires
Tender
Out
In Her husband
By saying
Force Eye
Prove
Updated
Remains See
Fancies
Impeller
And Money
Mark followers
Things Check And Close
messengers
And how much I
Issued
Supporters Page
WeSheikhs
findWrites Provided that
Obedience
Reasons
Movement The
Hurry
More Lack
Good Half
Idolaters
Alice
Political
Reason Has
To
Discoursebe
Virtue Defense For some
Latecomers
TheRay
You
Charity
fourth
Jazz
Intention
The world
Questions
There Provideare
Word
Must Sale His brothers
According Minds
Center
to
Enough
Back
Bear Uncle
Trial
Symptoms
Becomes Links
Seer
Ignorance
Neighborhoods
By
Issuesvirtue
Prophet of
Hypocrites For
Narrated by
Other
Is Speech
The answer
Appropriate
The handsA Need
Protection
Detail
Condition Man Jurisprudential
PainVery
ReportArrived When Directed by
Faith
Far To imitate
When Month Forbidden
Application
Determination
Human
Method
Law
Not
Corruption
Ezz
Consideration
Crusade
Numbers
Degree
How Deficit
For
I gotMan
non−
A means
Sharia
Many
For these Salah Guidance
Scientists
Easy
And I
Ashraf
MartyrsAs well The as sky
Wants
Taymiyah
Some of Whether
them
Character to Worship
Court
Excuse
Exposure Fired
Speak
Raina
One
Literature Zakat
Evidence
Shows Lead And then
Doctrine
Injustice
Prison
Shadow
One
Days Inn
Conflict
Does
Down
ReformShow
Survival it
Completely
Often
His life
Existing
Shown Likely
Neighborhood
End Be Necessary Part
Idolatry
Young people
The
Aba
most
Other
The
Means
Language
Duty
Different
Scholars
than
Directed
Claim
intended
The
They
by
Number
facts
David
said
Conditions
Blocker
Understand Skies
Religion
The
Hanbal
Account public
About
Second Owner Muhammad
Education Raise
Party
The greatest Immigration
Dispute
Their Zone
Display
Loyalty Judge
religion
Satan Contract
Revolution
ExegesisNuclear
Kill Including
GodWas
We
Are
Strange
Lord ask
I want
War
The eyesYourself
Sacred UsForgive
Great
Down Learning
Explicitly
We So
Powers The Desire
oldFor
A large Maintain
Without
The battle
Fire Phrase Bone
Group
Send Together
The Before
belief Shan
The island
Show
Quasi−
Just Found
LeavesHit
Optical Usury
A way
Delivers
Concern
Was First
Jihad
Student Secrecy
Fighting Minimum
Were
Palestine HowSeparation
Owner Spirit
And Prevention
After
Al
Do
Leadership
Fighting
Risk
To differentiate
Saying Sir
ISee
said
Means
Has toFixed
Intervention
Corresponding
Wine
Post
Increase
Yam Aisha
become
The body
Control Rejection
Follow
Received
Useful
Are
Year
ThePast
Abbas
Thread Because
Remain Read
Khaled
The nation
Land
You
Senate QadirDies Found
Apostles Number
Image Yes
It Pay Women
And here Return
The pretext of Back
The
A We Names
Position
absolute
Display
Legalization
Be
wantSpoke
Field
The enemies Source
Caution
Worship
Hopes
History
Prophets SupremeStrong
As
Road MayHajj
Arabs Door Rahman
Underneath
Significance
Necessity
To the son of Ras
Claim
Edition
ShouldIn other words
Measurement be
Responsible
Else Link
Full
Total
Means
Recent
Ignorance Prevents
Makes Fourth Fit
WayWords
Under no
Poverty Go
BelieveAge
inHappyChildren We know Morocco
I sayWhatever
So Board
Rights Gold Reading
Infidel Know Penalty During the
Brotherhood Not
Create
Safe Exchange
Sect
Previous Prevent Scourge Asked
FootAccording to
Quran
Doctrines
Believe As
Egypt Chatter
Innovation
Indicates
Kinds
Dear of Blood
Person Article MostBoth Street
Stronger
of them
Is Proven
Wear
Prince Is
Ideas
AndAfraid
Reality even Hidden
Other
Meaning
Important Level Bodyof
Voice
And between
Hostility
According to Have DoBad
Owners
Fact Specific
Evil
Great
Pronunciation
Virgin
Follow Shafei Contemporary
Will
Word
Difference
I say Weakness
Economic
Hypocrisy
HeAre
Secretariat
Love
The
Trees
DeliveranceAbe
not
Mahmoud
Assesses
You
Forms
Enlarge
nearest
HisComprehensive
statement
Fidelity
Desires
The contrary
Sentence
Food
The
Thefaithful
Thirdly
Jurisprudence
Since
Unitarian
Passport
Stage
Insults
Never
Are Be
Rectum
CameStatus
Deal Means
Iraq Stone Back
KharijitesWhenever Keeper
Texts Comes
Hussein
Presence
Social
Called Use
Wife
Task Eg
They wanted
Long as
infidels Will
Progress
Freedom We
Where Justicesay
Like
Very
Hundred Series
His The Acts
Agreement
faith
Perhaps
Resurrection Love
Has become
Complex Differ
SoldiersAlpha
Find Find
Intent Line
One ofEffort
them Problem
AngelsEnters Elimination Drink
This
Want to
Idolators
Wonder
You
Long While
While Remember For example
Place
With
Ordered
Labbe them Delivery
Our brothers
Time
Explain
Improves
ArmsHead
AloneRay Of
Example
Violation
Narrow
Clear
So and so The best
Falsehood We
Regardless
Satisfaction
Muslim
The Prove
Patience
following
Light
Wise
Anger
City
The A time
The day
Period
Note Actually
Be taken
Fear
Episode
Time
Word Frequency
Two
The Party Rest
establishment Graduation
of Qadeer Staff
Against
Oppressors Great
Sultan
Issue
Nature
Harm
Search
Across
The
Years
Solve the
Said Permissible
Once
Hasthe
become
Not
Target
Country
Which
Lying
That Branches If
But House
Wahab Preferred
Article
The Messenger
Sin
Money
aa
Life Name See
Aware of
Infidel
At least
AdamNuhaRepentance
I took
The
Man In
Show Table Project
Some
Laws Should be
Was
Blessed
Aversion
Fancy
At Proof
least
Ask
Doctrine
Guides
Time Disbelief Sons Promise
Appearance BeginningAnd companions
WeRuled
knowthatHair
Rightly
Yard
Blood PeriodMessages
Books
TheCircle
Nearfamous
Call
Feast
Quest
PanSaad Children
YouRoads Comment
Enemy House
Originally
And Enmity DisbeliefObaid
Business
Travel
Defeated
Displays
Gives
Amount
TeamThe seventh
Interests Distinguished
Hearing
Is believed
Minute Advice
Being
Well
Knowing
Punk Hell Paper
Army
Unbelievers
Was Badr
Balance Grandfather Aspects of
So−called
Capital
Four
Wi
Bukhari
Then
They SinHand
Disposal
Paradise
Estimated
Faces
Afghanistan
Valley
Mansoura
Positions Sin
Order
Denial Seethe
Supreme RoleTell
Approved
Old Eating
Woman
Follow the
Champions
Announcement
The mostIndeed
Option
Contained
important View
The effects of
Denied
Away
MyGet Sins
nation No
Such as
Livelihood
First
Terrorism Show
Places
Bammer
Packages
Aslam Gel
Opposition
Iron Contrary
A little Combines
A way One
to Women
All Delusion Income CattleHatred Back WillCalls for Answer
You Quoting
Heart Do
The doors
Roman Individual Small
Known
Reality
Rules
Community Out Take
A particular Raises Event
Fear
Obstacle
Twice
Different
Can
AlShirk
Behind Necessary
OurWork All Results
Which Consequent
For himself
Look
Critical
OnePresence
thousandth Frank Evil
Demonstration
Acknowledges
Entry Truth
Required
Story Spread
Qaeda Security
Christianity Signed
This Cut
Us
Treatment And blessings
Monotheism Make
Aas
Can
Re−
God
Reported
LegislationRahimSectionWilling Differentiate
Address
Imam
State On Followed
behalf by as
Needs of
Serves
Isaac
Something
Were it not for Prove
Ban
American
Claimed
Explanation
Significant Things
Prayers
One
Take
Intensity
Without Cow
Say
Ibrahim Terms
As Selection
the
Guidance Trade
Will not
Within
And other
PeopleMilitary
Really Is
The basis of
Values Three
Kara
Is Sun Up
America Revelation Yemen
You Few Land
Change
Kitten
I learned
Evil Shows
ProphecyDisease
a = 1/1000
Add
To reconcile
Dhar
Lord Newly
King
Always
Apostates Knowledge Hanifa
Yourselves
Including Comes
Fight AcceptanceThe CreatorFoot
Mujahideen
Ring
Description Extent
Message
People The
As third
Damage Back
Human
More
Thought
Interest V Six
Science
Who
Think We
To be
News Certainty
I see toOut To be
Other
Denial
Advanced
Years
Diligence
Predecessor TheCustomPursuant
soul Bass
A cause
The best
Halt
HimSecondly
Brother Position
Still
End Osman
By himself
Asked
Preachers Put
Some
Consensus Seven
And onlyYoung
Hand
Present
Opportunity
Parties We Author
I And
likepiety
Conditions
Benefit Except
Stand
Nullity
a WasAnd Religiously
Descent
ExcommunicationInformation
Orientation
Your I heard
Solution
All Suspicion
Non− all Tirmidhi
Mustafa Affect
Take
Born
Torment Worship
Obtained
Re−
Dead What
Whole
All Israel Occurs
United
Parents
Investigation
Disclosure
Seemed Moses Sunan Noble Said the Service
Confrontation
Violates
Before
Is Required
Transferred
<−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− >
Jihadi Not Jihadi
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
A Thought Experiment
A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated
A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated
A Thought Experiment
Republicans and Democrats are both 50% of the population and
both groups are uniformly distributed across zipcodes
Our sample is sufficiently small that we get two individuals per
zipcode
Even though true segregation level is zero, we find that half of
the zip codes are perfectly segregated
The issue is that the measures of variance and the variance across
elements of cit are biased upwards by sampling error.
with penalty
!
X
c(ϕtj ) = λj |ϕ̄j | + |ϕ̃jk |
k
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
Score Text
-800 Molly Bloom’s (3.6K) Soliloquy, Ulysses
33 judicial opinion
45 life insurance requirement in Florida
48 New York Times
65 Reader’s Digest
67 Al Qaeda press release
77 Dickens’ complete works
80 childen’s books
90 death row inmate last statements
100 this entry right here.
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
2 Session 2: Discover
Clustering
Interpretation
Mixed Membership Models
Example: Discovery in Congressional Communication
3 Session 3: Measure
Choosing a Model
Example: Party Manifestoes in Japan
Revisiting K
Structural Topic Model
Supervised Learning
Scaling
BrandonStewart.org