Red Blue Presentation

Download as PPTX, PDF

2 likes862 views

This document summarizes research on using machine learning to build an ideologically balanced news diet. The researchers trained classification models on debate transcripts to predict whether news articles came from left-leaning or right-leaning media sources. The models achieved 84% accuracy but predicted that 79% of articles were from right-leaning sources, which did not match other data. The researchers discuss potential reasons for this and ways to improve the models in future iterations, such as using more training data sources and articles to better represent the ideological spectrum.

Data & Analytics

Red / Blue
Using Machine Learning to Build an
Ideologically Balanced News Diet
Salil Doshi
Sam Goodgame
Susan Eun Park
Paul Platzman
May 21st, 2016

May 15th, 2016 -- Six Days Ago...
“...Today in every phone in one of
your pockets we have access to
more information than at any time
in human history, at a touch of a
button. But, ironically, the flood of
information hasn’t made us more
discerning of the truth. In some
ways, it’s just made us more
confident in our ignorance. We
assume whatever is on the web
must be true. We search for sites
that just reinforce our own
predispositions.”
-President Obama, Rutgers Commencement Address

Data Transformation
Removed common English words and candidate and moderator names
Vectorized the Data
Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values
Sample TF-IDF Vectorized Matrix:

Model Estimators
Binary Classification Models:
Logistic Regression (LR)
Multinomial Naive Bayes
(MNB)
Support Vector Machine
(SVM)

Feature Engineering
Truncated Singular Value Decomposition (TSVD)
Reduced number of features without compromising predictive
performance
11,228 features --> 2,000 features
No reduction in F-1 Score or Accuracy Score
Models with fewer than 2,000 features experienced diminished
performance
Trend observed across each model form

Parameter Tuning: Using Grid Search
● Optimized ‘C’ Value, the penalty parameter
● Maintained generalizability of model to prediction data
http://www.intechopen.com/source/html/4
5102/media/image44.png

SVM Model Performance Metrics
Precision Recall F-1 Score
Democratic 0.76 0.58 0.66
Republican 0.86 0.93 0.89
Average/Total 0.83 0.84 0.83
Correct Democratic Incorrect Democratic
n=392 n=279
Correct Republican Incorrect Republican
n=1693 n=121
Overall Accuracy Rate: 84%

Prediction Results: Normalized Spectrum
● 79% of all documents were classified as Republican

Prediction Results: Media Source Spectrum

Prediction Results vs. Pew Research Center Results

Discussion
Results don’t match ideological spectrum of audiences.
Several potential interpretations:
Republican stories dominated news cycles
Republican candidates more regularly used pre-existing
media language
Oral language is not strongly predictive of written
language

Methodological Self-Evaluation (1)
● Strengths:
○ Expansion of instance set to reduce model performance variation
○ Removal of moderator speech
○ Removal of custom stop words
○ Employed a variety of model forms
○ Reduced feature set size without impeding performance
○ Optimized ‘C’ parameter value

Methodological Self-Evaluation (2)
● Shortcomings:
○ RSS feed content was not always ideal or consistent
■ Contained ‘jQuery’ or advertisement placeholders
■ Variety in article length
■ Variable number of instances from each media outlet
○ Single source of training data
○ Uneven distribution of red/blue training data

Looking Towards Future Iterations
● Future studies could…
○ Use additional training data sources
○ Encompass prediction data of greater breadth and
depth: more news sources and more articles per source
○ Include more feature engineering to account for
differently formatted RSS feeds
○ Predict oral political dialogue

For Posterity
● Implications for partisanship...
○ The potential virtue of an ideologically
balanced diet
○ A shift in media engagement behaviors could
promote open-mindedness and compromise
○ This, in turn, could promote legislative
functioning

More Related Content

PDF

Team CDTW Capstone Presentation

Todd Rutherford

PPTX

Data Analytics Capstone

Macemann

PDF

#ICCSS2015 - Computational Human Security Analytics using "Big Data"

Pete Burnap

PPTX

Are Twitter Users Equal in Predicting Elections

Lu Chen

PPTX

CUS 695 Project Presentation

Adrian Duran

PDF

Predicting what gets ‘Likes’ on Facebook: case study of BlogTO

Toronto Metropolitan University

PPTX

Rob Procter

NSMNSS

PDF

Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...

Francisco Manuel Rangel Pardo

Team CDTW Capstone Presentation

Todd Rutherford

Data Analytics Capstone

Macemann

#ICCSS2015 - Computational Human Security Analytics using "Big Data"

Pete Burnap

Are Twitter Users Equal in Predicting Elections

Lu Chen

CUS 695 Project Presentation

Adrian Duran

Predicting what gets ‘Likes’ on Facebook: case study of BlogTO

Toronto Metropolitan University

Rob Procter

NSMNSS

Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...

Francisco Manuel Rangel Pardo

What's hot (20)

PDF

2012 Presidential Elections on Twitter - An Analysis of How the US and French...

University Politehnica Bucharest

PDF

Viewers also liked (6)

PPTX

Capital Bikeshare Presentation

donahuerm

PDF

Analysis of differential investor performance captstone presentation final

Howard Ho

PPTX

Hotel Performance FINAL

team_hotelperformance

PPTX

Georgetown Data Analytics - Team 1 Capstone Project

Mark Phillips

PPTX

No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...

Brittne Kakulla, Ph.D.

PPTX

Georgetown Data Analytics Project (Team DC)

Noah Turner

Capital Bikeshare Presentation

donahuerm

Analysis of differential investor performance captstone presentation final

Howard Ho

Hotel Performance FINAL

team_hotelperformance

Georgetown Data Analytics - Team 1 Capstone Project

Mark Phillips

No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...

Brittne Kakulla, Ph.D.

Georgetown Data Analytics Project (Team DC)

Noah Turner

Recently uploaded (20)

PPTX

IBA_Chapter_11_Slides_Final_Accessible.pptx

SrikantKapoor1

PDF

Galatica Smart Energy Infrastructure Startup Pitch Deck

Shahzaib Ajmal

PPTX

Supervised vs unsupervised machine learning algorithms

agarwal18harsh08

PPTX

advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

rohullahansari5

PPTX

Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

kuthubussaman1

PPTX

mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

CHINTU5000

PPTX

climate analysis of Dhaka ,Banglades.pptx

rrawjatun

PDF

168300704-gasification-ppt.pdfhghhhsjsjhsuxush

sriram270905

PPTX

05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

macomputermacomputer

PPTX

Computer network topology notes for revision

BatoolRawat

PPT

Reliability_Chapter_ presentation 1221.5784

abobaker13

PDF

.pdf is not working space design for the following data for the following dat...

jerinjoy242

PDF

Clinical guidelines as a resource for EBP(1).pdf

MashalKhan626345

PDF

Introduction to Business Data Analytics.

jamalmumthaj

PPTX

Data_Analytics_and_PowerBI_Presentation.pptx

ZubyrAhmed

PPTX

MODULE 8 - DISASTER risk PREPAREDNESS.pptx

AceAquino4

PDF

“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Nusrat Gulbarga

PDF

22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

ngaviet5

PPTX

Introduction to Knowledge Engineering Part 1

busyprogrammersguide

PPTX

Major-Components-ofNKJNNKNKNKNKronment.pptx

dushyantsharma1221

IBA_Chapter_11_Slides_Final_Accessible.pptx

SrikantKapoor1

Galatica Smart Energy Infrastructure Startup Pitch Deck

Shahzaib Ajmal

Supervised vs unsupervised machine learning algorithms

agarwal18harsh08

advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

rohullahansari5

Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

kuthubussaman1

mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

CHINTU5000

climate analysis of Dhaka ,Banglades.pptx

rrawjatun

168300704-gasification-ppt.pdfhghhhsjsjhsuxush

sriram270905

05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

macomputermacomputer

Computer network topology notes for revision

BatoolRawat

Reliability_Chapter_ presentation 1221.5784

abobaker13

.pdf is not working space design for the following data for the following dat...

jerinjoy242

Clinical guidelines as a resource for EBP(1).pdf

MashalKhan626345

Introduction to Business Data Analytics.

jamalmumthaj

Data_Analytics_and_PowerBI_Presentation.pptx

ZubyrAhmed

MODULE 8 - DISASTER risk PREPAREDNESS.pptx

AceAquino4

“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Nusrat Gulbarga

22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

ngaviet5

Introduction to Knowledge Engineering Part 1

busyprogrammersguide

Major-Components-ofNKJNNKNKNKNKronment.pptx

dushyantsharma1221

Red Blue Presentation

1. Red / Blue Using Machine Learning to Build an Ideologically Balanced News Diet Salil Doshi Sam Goodgame Susan Eun Park Paul Platzman May 21st, 2016

2. May 15th, 2016 -- Six Days Ago... “...Today in every phone in one of your pockets we have access to more information than at any time in human history, at a touch of a button. But, ironically, the flood of information hasn’t made us more discerning of the truth. In some ways, it’s just made us more confident in our ignorance. We assume whatever is on the web must be true. We search for sites that just reinforce our own predispositions.” -President Obama, Rutgers Commencement Address

3. Pew Research Center April 29, 2014

4. Architecture

5. Build Phase

6. Training Data Ingestion and Wrangling

7. Data Transformation Removed common English words and candidate and moderator names Vectorized the Data Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values Sample TF-IDF Vectorized Matrix:

8. Model Estimators Binary Classification Models: Logistic Regression (LR) Multinomial Naive Bayes (MNB) Support Vector Machine (SVM)

9. Feature Engineering Truncated Singular Value Decomposition (TSVD) Reduced number of features without compromising predictive performance 11,228 features --> 2,000 features No reduction in F-1 Score or Accuracy Score Models with fewer than 2,000 features experienced diminished performance Trend observed across each model form

10. Parameter Tuning: Using Grid Search ● Optimized ‘C’ Value, the penalty parameter ● Maintained generalizability of model to prediction data http://www.intechopen.com/source/html/4 5102/media/image44.png

11. SVM Model Performance Metrics Precision Recall F-1 Score Democratic 0.76 0.58 0.66 Republican 0.86 0.93 0.89 Average/Total 0.83 0.84 0.83 Correct Democratic Incorrect Democratic n=392 n=279 Correct Republican Incorrect Republican n=1693 n=121 Overall Accuracy Rate: 84%

12. Operational Phase

13. Prediction Results: Normalized Spectrum ● 79% of all documents were classified as Republican

14. Prediction Results: Media Source Spectrum

15. Prediction Results vs. Pew Research Center Results

16. Discussion Results don’t match ideological spectrum of audiences. Several potential interpretations: Republican stories dominated news cycles Republican candidates more regularly used pre-existing media language Oral language is not strongly predictive of written language

17. Methodological Self-Evaluation (1) ● Strengths: ○ Expansion of instance set to reduce model performance variation ○ Removal of moderator speech ○ Removal of custom stop words ○ Employed a variety of model forms ○ Reduced feature set size without impeding performance ○ Optimized ‘C’ parameter value

18. Methodological Self-Evaluation (2) ● Shortcomings: ○ RSS feed content was not always ideal or consistent ■ Contained ‘jQuery’ or advertisement placeholders ■ Variety in article length ■ Variable number of instances from each media outlet ○ Single source of training data ○ Uneven distribution of red/blue training data

19. Looking Towards Future Iterations ● Future studies could… ○ Use additional training data sources ○ Encompass prediction data of greater breadth and depth: more news sources and more articles per source ○ Include more feature engineering to account for differently formatted RSS feeds ○ Predict oral political dialogue

20. For Posterity ● Implications for partisanship... ○ The potential virtue of an ideologically balanced diet ○ A shift in media engagement behaviors could promote open-mindedness and compromise ○ This, in turn, could promote legislative functioning

21. Questions?

Editor's Notes

#3: Last weekend, President Obama delivered the commencement address at my alma mater, Rutgers University. In it, he alluded to the flood of information that we’ve become increasingly exposed to and the perhaps counterintuitive notion that it has not made us more informed. Instead, he noted, we use the web and social media as a tool to seek out information that reinforces our preexisting beliefs, tune out voices of those who don’t think like us, and amplify voices of those who do. Indeed, America has become increasingly politically polarized during Obama’s tenure and media consumption habits are thought to play a role in this emerging phenomenon.
#4: In 2014, the Pew Research Center measured the ideological placement of audiences of a variety of political news outlets. As you can see, political conservatives and liberals consume different news sources, and each are believed to espouse and reinforce particular philosophies within their readerships. If media consumption differentiation exists, it could presumably influence the political divisiveness that has manifested in, for example, gridlocked government -- the last two U.S. Congresses have been the two least productive historically. So political media content analysis is worthwhile. Past studies have analyzed media content from a variety of angles, such as sentiment analysis, but we sought to evaluate media outlets based on their consistency with language spoken by politicians. Specifically, we asked: to what extent do media outlets’ written articles correspond to Democratic and Republican politicians’ word choices during the 2016 presidential primary debates? Does media language usage vary according to the same spectrum as the political preferences of their respective audiences? If so, could that suggest a link between language choice and political polarization?
#5: Data Product: political language classifier for news articles. High-level overview: Build phase: generate model Pull debate transcripts from Internet Wrangle into text documents Put them into the proper format for analysis Fit classifier Operational Phase: Pull RSS feeds from the Internet Put that data into the same form as our training data Feed it to our model; receive one prediction per text document: “red” for Republican or “blue” for Democrat Now I’m going to drill down into the build phase, and then specifically into the initial data wrangling
#6: More depth for building our model: Start with Debate HTML documents Get them into text format, and into a data bunch Conduct a type of analysis called TF-IDF, which is a weighted measure for word frequency in each document Final data form: sparse matrix (I’ll go into more detail in a moment) Feature engineering: Remove stop words (“the” or “for”) Remove non-predictive features Data is in final form: Evaluate three models: LR, SVM, and MNB Iterative feature engineering and parameter tuning → fitted model
#7: Drilling down into the initial data ingestion and wrangling: Debate transcripts were HTML documents--ugly, with markup like ‘p’ and ‘body’ tags Used BeautifulSoup to parse out the text, then spit out a document that only includes text Data bunch format: particular directory structure compatible with scikit-learn modules
#8: Vectorize data, transform it into weighted term frequency values, and remove stop words with one line of code TF-IDF stands for Term Frequency-Inverse Document Frequency. Scikit-Learn package that determines weights for words End result is a sparse matrix. Any given word appears relatively infrequently, so we have a lot of zeroes
#9: After getting data into proper format, we evaluated these three models: The LR algorithm classifies data by obtaining a best-fit logistic function. The MNB algorithm is a probabilistic classifier that applies Bayes’ theorem; it assumes (naively) that the features in the model are independent. SVM separates categories in data by drawing a separating hyperplane between instances of different classes.
#10: Next, we wanted to make our model more efficient.
#11: Moving forward with SVM as our best model, we used scikit’s learn grid search to conduct parameter tuning. The type of model we used was linear support vector classification, which had a penalty parameter called ‘C’. This parameter controlled how many errors or misclassifications the model would make. We had to be careful in not overfitting our model by trying to minimize the errors according to C, because then it would solely be optimized on training data. We chose a C that gave us larger margins in our hyperplane in our linear model, which gave us the best F-1 Score across both Democratic and Republican data which meant that it was in the best position to predict against outside data that we introduce.
#12: Our optimized SVM model had an overall F-1 score of 0.83. The F1 score is a weighted average of the precision and recall, 1 being the best value and 0 being the worst. You can also see the broken out precision and recall for our Democratic and Republican data. Accuracy rate is 84% High precision and recall for Republican, but this could be attributed to the fact that we had as twice as many Republican training data than Democratic training data, owing to the fact that there were more Republican debates than Democratic. “A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.” Confusion matrix
#13: Goal: get RSS data into the same exact vector format as our training data, then feed it into our model OPML (Outline processor markup language) documents: instructions for pulling specific RSS feeds → Baleen: An automated ingestion service for blogs to construct a corpus for NLP research. → Instantiate separate MongoDB database per news source. Documents are instances, and words are features. html → Transform into text → Feed to model
#14: Here is a graphic imported from Tableau that shows a normalized spectrum of our prediction results. 12 news sources, articles per source ranged between 27-173 As the slide states, 79% of the news articles that we analyzed had language that was more consistent with Republican rhetoric than Democratic rhetoric While most sources fall within one standard deviations from the mean - Washington post and the Nation are outliers
#15: Explain what we’re looking at. Another way of representing the information on the previous slide. Spectrum of absolute values. More uniformly Dem on left, more uniformly Republican on right. As Salil said, among all articles, 79% were classified as more consistent with Rep than Dem. Note that the majority of news sources are clustered together between 76% and 94% Red. You might see that MSNBC, often conventionally assumed to be left wing, is the furthest right. This probably didn’t conform with your expectations.
#16: It also didn’t conform with the Pew Research findings. Here is a comparison of our results and the Pew findings. Although WAPO is about as far left as both scales go, most other sources display no meaningful relationship. So what does this mean?
#17: Why DIDN’T our results match the ideological spectrum of audiences?
#18: Parsing each debate into one document yielded a low sample size for our model, so we re-parsed our debate transcripts to yield one document per paragraph. Next, we removed instances that contained moderators’ remarks - instance engineering Created a list of custom stop words, added onto scikit’s original set, to further strengthen our training data - removal of candidate’s names and moderator names LR, MNB, and SVM - chosen because they were appropriate for the binary classification nature of our analysis by fitting our TFIDF vector to a Truncated Singular Value Decomposition model - Scaled down 2,000 features - the last point before which we observed gradual reductions in model performances. The ‘C’ value represents the misclassification parameter which we so our model wasn’t overly optimized on its ability to correctly fit the training data.
#19: After transforming the HTML we pulled from RSS feeds, we discovered documents with jQuery script tags in addition to journalistic content. Other transformed documents contained solely advertisements or placeholders HTML tags for advertisements. Further, different news sources produced different kinds of RSS feeds. Some long-form with in-depth analysis, others simply contained blurbs that set up a resulting slideshow (not used in our data). Other shortcomings include that we used only debate transcripts as our source for training data and that we had far more republican debates(and candidates) than democratic ones.
#20: First, future studies could include more news sources and many more articles per news source. They could even out the distribution of Republican and Democratic speech in the training set. Third, they could improve feature engineering, specifically regarding transforming data from its organic form into text documents and vectors. Instead of relying solely on debate transcripts for the training data corpus, a future study could use debate transcripts to fit an initial model, then use that model to make predictions about a cross-section of article data, then feed the labeled article data back into the fitted model to strengthen and generalize it.
#21: So going back to the conundrum that Paul outlined in the beginning of the presentation, it’s important to consider what kind of implications that a text classifier built to identify partisan leaning language can have for individual news consumption. If people are choosing the news they read to reinforce the pre-existing beliefs they hold, then it’s worth examining the potential virtue of an ideologically balanced diet. By being conscientious with our news consumption, we could witness a shift in media engagement behaviors that are more open-minded and less entrenched in ideology. Becoming open to compromise and working with the other side could promote legislative functioning

Red Blue Presentation

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Red Blue Presentation (20)

Recently uploaded (20)

Red Blue Presentation

Editor's Notes