0% found this document useful (0 votes)
191 views

Computational Journalism 2017 Week 4: Computational Journalism Platforms

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

Uploaded by

Jonathan Stray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views

Computational Journalism 2017 Week 4: Computational Journalism Platforms

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

Uploaded by

Jonathan Stray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Frontiers of

Computational Journalism
Columbia Journalism School
Week 4: Computational Journalism Platforms

September 29, 2017


This class
What do journalists do with Documents
The Computational Journalism Workbench
Plate Notation
NYT Recommender
What do journalists do with
documents?
Overview prototype running on Iraq security contractor docs, Feb 2012
Technical troubles with a new system meant that almost 70,000 North
Carolina residents received their food stamps late this summer. Thats
8.5 percent of the number of clients the state currently serves every
month. The problem was eventually traced to web browser
compatibility issues. WRAL reporter Tyler Dukes obtained 4,500 pages
of emails on paper from various government departments and
used DocumentCloud and Overview to piece together this story.

https://blog.overviewdocs.com/completed-stories/
Used Overviews topic tree (TF-IDF clustering) to find a group
of key emails from a listserv.
What do Journalists do with Documents, Stray 2016
1. Robust Import
The hardest feature to implement
The most requested, the most used
2. Robust Analysis
What researchers choose
News articles
Academic literature
NLP test data sets

What journalists deal with


PDF dumps
Printed, scanned emails
A million pages scraped from an antique site
CD full of random files
LAPD Crime Descriptions

VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP


THEN BEGAN HITTING VICTS IN THE FACE
Entity recognition is not solved!
Entities found
out of 150

Incredibly dirty source data. Current methods have low recall (~70%)
3. Search, not exploration
A number of previous tools aim to help the user explore
a document collection (such as [6, 9, 10, 12]), though few
of these tools have been evaluated with users from a
specific target domain who bring their own data, making
us suspect that this imprecise term often masks a lack of
understanding of actual user tasks.

Overview: The Design, Adoption, and Analysis of a Visual Document


Mining Tool For Investigative Journalists, Brehmer et al, 2014
Suffolk County public safety committee transcript,
Reference to a body left on the street due to union dispute
(via Adam Playford, Newsday, 2014)
4. Quantitative Summaries
Count incident types by date. For Level 14, ProPublica, 2015
LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years
Los Angeles Times, 2015
The Child Exchange, Reuters, 2014
5. Interactive Methods
Design Study Methodology: Reflections from the Trenches and the Stacks,
Sedlmair et al, 2012
Extracting yes/no answers from database of Foreign Corrupt Practices
Act cases. Comparison by Ariana Giorgi
6. Clarity and Accuracy
We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers
did which kind of work for which sorts
of petitioners. For example, in cases
where workers sue their employers, the
lawyers most successful getting cases
before the court were far more likely to
represent the employers rather than
the employees.

The Echo Chamber, Reuters, 2014


Evaluation Methods for Topic Models
Wallach et. al. 2009
Interpretation refers to the facility with which an
analyst makes inferences about the data through
the lens of a model abstraction. Trust refers to the
actual and perceived accuracy of an analysts
inferences

Interpretation and Trust: Designing Model-driven Visualizations


for Text Analysis, Chuang et al. 2012
Overview prototype running on Wikileaks cables, early 2012
Overview circa 2014
Overviewdocs.com today
Overview Entity and Multisearch plugins
Overview plugin API
Computational Journalism Workbench
cjworkbench.org
Plate Notation
Probability graphs

Node = variable
Edge = dependence (sampled from)
Filled node = observed data
Choose a topic for each word

Both PLSA and LDA model each document as a distribution over


topics. Each word belongs to a single topic.
LDA Plate Notation
topics in doc
topic topic for word words in topics word
word in doc
concentration concentration
parameter parameter

N words D docs K topics


in doc
New York Times recommender
Combining collaborative filtering
and topic modeling
Collaborative Topic Modeling
topic topics in doc topic for word word in doc K topics
concentration (content)

user rating
weight of user topics in doc of doc
selections (collaborative)

variation in
per-user topics topics for user
content only

content +
social

You might also like