SlideShare a Scribd company logo
Red / Blue
Using Machine Learning to Build an
Ideologically Balanced News Diet
Salil Doshi
Sam Goodgame
Susan Eun Park
Paul Platzman
May 21st, 2016
May 15th, 2016 -- Six Days Ago...
“...Today in every phone in one of
your pockets we have access to
more information than at any time
in human history, at a touch of a
button. But, ironically, the flood of
information hasn’t made us more
discerning of the truth. In some
ways, it’s just made us more
confident in our ignorance. We
assume whatever is on the web
must be true. We search for sites
that just reinforce our own
predispositions.”
-President Obama, Rutgers Commencement Address
Pew Research Center
April 29, 2014
Architecture
Build Phase
Training Data Ingestion and Wrangling
Data Transformation
Removed common English words and candidate and moderator names
Vectorized the Data
Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values
Sample TF-IDF Vectorized Matrix:
Model Estimators
Binary Classification Models:
Logistic Regression (LR)
Multinomial Naive Bayes
(MNB)
Support Vector Machine
(SVM)
Feature Engineering
Truncated Singular Value Decomposition (TSVD)
Reduced number of features without compromising predictive
performance
11,228 features --> 2,000 features
No reduction in F-1 Score or Accuracy Score
Models with fewer than 2,000 features experienced diminished
performance
Trend observed across each model form
Parameter Tuning: Using Grid Search
● Optimized ‘C’ Value, the penalty parameter
● Maintained generalizability of model to prediction data
http://www.intechopen.com/source/html/4
5102/media/image44.png
SVM Model Performance Metrics
Precision Recall F-1 Score
Democratic 0.76 0.58 0.66
Republican 0.86 0.93 0.89
Average/Total 0.83 0.84 0.83
Correct Democratic Incorrect Democratic
n=392 n=279
Correct Republican Incorrect Republican
n=1693 n=121
Overall Accuracy Rate: 84%
Operational Phase
Prediction Results: Normalized Spectrum
● 79% of all documents were classified as Republican
Prediction Results: Media Source Spectrum
Prediction Results vs. Pew Research Center Results
Discussion
Results don’t match ideological spectrum of audiences.
Several potential interpretations:
Republican stories dominated news cycles
Republican candidates more regularly used pre-existing
media language
Oral language is not strongly predictive of written
language
Methodological Self-Evaluation (1)
● Strengths:
○ Expansion of instance set to reduce model performance variation
○ Removal of moderator speech
○ Removal of custom stop words
○ Employed a variety of model forms
○ Reduced feature set size without impeding performance
○ Optimized ‘C’ parameter value
Methodological Self-Evaluation (2)
● Shortcomings:
○ RSS feed content was not always ideal or consistent
■ Contained ‘jQuery’ or advertisement placeholders
■ Variety in article length
■ Variable number of instances from each media outlet
○ Single source of training data
○ Uneven distribution of red/blue training data
Looking Towards Future Iterations
● Future studies could…
○ Use additional training data sources
○ Encompass prediction data of greater breadth and
depth: more news sources and more articles per source
○ Include more feature engineering to account for
differently formatted RSS feeds
○ Predict oral political dialogue
For Posterity
● Implications for partisanship...
○ The potential virtue of an ideologically
balanced diet
○ A shift in media engagement behaviors could
promote open-mindedness and compromise
○ This, in turn, could promote legislative
functioning
Questions?

More Related Content

PDF
Team CDTW Capstone Presentation
PPTX
Data Analytics Capstone
PDF
#ICCSS2015 - Computational Human Security Analytics using "Big Data"
PPTX
Are Twitter Users Equal in Predicting Elections
PPTX
CUS 695 Project Presentation
PDF
Predicting what gets ‘Likes’ on Facebook: case study of BlogTO
PPTX
Rob Procter
PDF
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Team CDTW Capstone Presentation
Data Analytics Capstone
#ICCSS2015 - Computational Human Security Analytics using "Big Data"
Are Twitter Users Equal in Predicting Elections
CUS 695 Project Presentation
Predicting what gets ‘Likes’ on Facebook: case study of BlogTO
Rob Procter
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...

What's hot (20)

PDF
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
PDF
Trending Topic in Social Networks
PDF
Microposts2015 - Social Spam Detection on Twitter
PPTX
Asking Questions of Data
PDF
Twitter Analysis: Fake News
PPTX
Reference List Citations - APA 6th Edition
PPTX
Data Cleaning for social media knowledge extraction
PPTX
Presentation-Detecting Spammers on Social Networks
PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PPT
Information retrieval
PDF
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
PDF
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
PDF
srd117.final.512Spring2016
PPTX
Groundhog Day: Near-Duplicate Detection on Twitter
PDF
Measuring Opinion Credibility in Twiiter
PPTX
Stack_Overflow-Network_Graph
PPTX
Hao lyu slides_sarcasm
PDF
GeospatialDataAnalysis
PPSX
Metodologia para el analisis de redes sociales
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
Trending Topic in Social Networks
Microposts2015 - Social Spam Detection on Twitter
Asking Questions of Data
Twitter Analysis: Fake News
Reference List Citations - APA 6th Edition
Data Cleaning for social media knowledge extraction
Presentation-Detecting Spammers on Social Networks
Data.Mining.C.8(Ii).Web Mining 570802461
Information retrieval
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
srd117.final.512Spring2016
Groundhog Day: Near-Duplicate Detection on Twitter
Measuring Opinion Credibility in Twiiter
Stack_Overflow-Network_Graph
Hao lyu slides_sarcasm
GeospatialDataAnalysis
Metodologia para el analisis de redes sociales
Ad

Viewers also liked (6)

PPTX
Capital Bikeshare Presentation
PDF
Analysis of differential investor performance captstone presentation final
PPTX
Hotel Performance FINAL
PPTX
Georgetown Data Analytics - Team 1 Capstone Project
PPTX
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
PPTX
Georgetown Data Analytics Project (Team DC)
Capital Bikeshare Presentation
Analysis of differential investor performance captstone presentation final
Hotel Performance FINAL
Georgetown Data Analytics - Team 1 Capstone Project
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
Georgetown Data Analytics Project (Team DC)
Ad

Similar to Red Blue Presentation (20)

PDF
A Comparative Study of different Classifiers on Political Data for Classifica...
PPTX
seminar_ppt.pptx
PDF
Fairness in Machine Learning @Codemotion
PDF
Persuasion across the Political Spectrum: Quantifying Differences in Parallel...
PPTX
algorithmic-decisions, fairness, machine learning, provenance, transparency
PPTX
Political Poster Edit
PPTX
Launching into machine learning
PDF
Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgar...
PPTX
Embeddings-Based Clustering for Target Specific Stances
PDF
Data science pitfalls
PPT
Pre-Dac-Presentation [Autosaved]ph-d.ppt
PPT
Empowering Digital Direct Democracy: Policy making via Stance Classification
PDF
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
PPTX
Human-machine Coexistence in Groups
PDF
ML Foundations: A 3-Day Journey into Machine Learning
PPTX
Language of Politics on Twitter - 03 Analysis
PDF
A review of Fake News Detection Methods
PPTX
Verifying Multimedia Use at MediaEval 2016
PDF
MediaEval 2016 - Verifying Multimedia Use Task Overview
PPTX
Major project.pptx
A Comparative Study of different Classifiers on Political Data for Classifica...
seminar_ppt.pptx
Fairness in Machine Learning @Codemotion
Persuasion across the Political Spectrum: Quantifying Differences in Parallel...
algorithmic-decisions, fairness, machine learning, provenance, transparency
Political Poster Edit
Launching into machine learning
Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgar...
Embeddings-Based Clustering for Target Specific Stances
Data science pitfalls
Pre-Dac-Presentation [Autosaved]ph-d.ppt
Empowering Digital Direct Democracy: Policy making via Stance Classification
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Human-machine Coexistence in Groups
ML Foundations: A 3-Day Journey into Machine Learning
Language of Politics on Twitter - 03 Analysis
A review of Fake News Detection Methods
Verifying Multimedia Use at MediaEval 2016
MediaEval 2016 - Verifying Multimedia Use Task Overview
Major project.pptx

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Computer network topology notes for revision
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
climate analysis of Dhaka ,Banglades.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Computer network topology notes for revision
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Major-Components-ofNKJNNKNKNKNKronment.pptx

Red Blue Presentation

  • 1. Red / Blue Using Machine Learning to Build an Ideologically Balanced News Diet Salil Doshi Sam Goodgame Susan Eun Park Paul Platzman May 21st, 2016
  • 2. May 15th, 2016 -- Six Days Ago... “...Today in every phone in one of your pockets we have access to more information than at any time in human history, at a touch of a button. But, ironically, the flood of information hasn’t made us more discerning of the truth. In some ways, it’s just made us more confident in our ignorance. We assume whatever is on the web must be true. We search for sites that just reinforce our own predispositions.” -President Obama, Rutgers Commencement Address
  • 6. Training Data Ingestion and Wrangling
  • 7. Data Transformation Removed common English words and candidate and moderator names Vectorized the Data Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values Sample TF-IDF Vectorized Matrix:
  • 8. Model Estimators Binary Classification Models: Logistic Regression (LR) Multinomial Naive Bayes (MNB) Support Vector Machine (SVM)
  • 9. Feature Engineering Truncated Singular Value Decomposition (TSVD) Reduced number of features without compromising predictive performance 11,228 features --> 2,000 features No reduction in F-1 Score or Accuracy Score Models with fewer than 2,000 features experienced diminished performance Trend observed across each model form
  • 10. Parameter Tuning: Using Grid Search ● Optimized ‘C’ Value, the penalty parameter ● Maintained generalizability of model to prediction data http://www.intechopen.com/source/html/4 5102/media/image44.png
  • 11. SVM Model Performance Metrics Precision Recall F-1 Score Democratic 0.76 0.58 0.66 Republican 0.86 0.93 0.89 Average/Total 0.83 0.84 0.83 Correct Democratic Incorrect Democratic n=392 n=279 Correct Republican Incorrect Republican n=1693 n=121 Overall Accuracy Rate: 84%
  • 13. Prediction Results: Normalized Spectrum ● 79% of all documents were classified as Republican
  • 14. Prediction Results: Media Source Spectrum
  • 15. Prediction Results vs. Pew Research Center Results
  • 16. Discussion Results don’t match ideological spectrum of audiences. Several potential interpretations: Republican stories dominated news cycles Republican candidates more regularly used pre-existing media language Oral language is not strongly predictive of written language
  • 17. Methodological Self-Evaluation (1) ● Strengths: ○ Expansion of instance set to reduce model performance variation ○ Removal of moderator speech ○ Removal of custom stop words ○ Employed a variety of model forms ○ Reduced feature set size without impeding performance ○ Optimized ‘C’ parameter value
  • 18. Methodological Self-Evaluation (2) ● Shortcomings: ○ RSS feed content was not always ideal or consistent ■ Contained ‘jQuery’ or advertisement placeholders ■ Variety in article length ■ Variable number of instances from each media outlet ○ Single source of training data ○ Uneven distribution of red/blue training data
  • 19. Looking Towards Future Iterations ● Future studies could… ○ Use additional training data sources ○ Encompass prediction data of greater breadth and depth: more news sources and more articles per source ○ Include more feature engineering to account for differently formatted RSS feeds ○ Predict oral political dialogue
  • 20. For Posterity ● Implications for partisanship... ○ The potential virtue of an ideologically balanced diet ○ A shift in media engagement behaviors could promote open-mindedness and compromise ○ This, in turn, could promote legislative functioning

Editor's Notes

  • #3: Last weekend, President Obama delivered the commencement address at my alma mater, Rutgers University. In it, he alluded to the flood of information that we’ve become increasingly exposed to and the perhaps counterintuitive notion that it has not made us more informed. Instead, he noted, we use the web and social media as a tool to seek out information that reinforces our preexisting beliefs, tune out voices of those who don’t think like us, and amplify voices of those who do. Indeed, America has become increasingly politically polarized during Obama’s tenure and media consumption habits are thought to play a role in this emerging phenomenon.
  • #4: In 2014, the Pew Research Center measured the ideological placement of audiences of a variety of political news outlets. As you can see, political conservatives and liberals consume different news sources, and each are believed to espouse and reinforce particular philosophies within their readerships. If media consumption differentiation exists, it could presumably influence the political divisiveness that has manifested in, for example, gridlocked government -- the last two U.S. Congresses have been the two least productive historically. So political media content analysis is worthwhile. Past studies have analyzed media content from a variety of angles, such as sentiment analysis, but we sought to evaluate media outlets based on their consistency with language spoken by politicians. Specifically, we asked: to what extent do media outlets’ written articles correspond to Democratic and Republican politicians’ word choices during the 2016 presidential primary debates? Does media language usage vary according to the same spectrum as the political preferences of their respective audiences? If so, could that suggest a link between language choice and political polarization?
  • #5: Data Product: political language classifier for news articles. High-level overview: Build phase: generate model Pull debate transcripts from Internet Wrangle into text documents Put them into the proper format for analysis Fit classifier Operational Phase: Pull RSS feeds from the Internet Put that data into the same form as our training data Feed it to our model; receive one prediction per text document: “red” for Republican or “blue” for Democrat Now I’m going to drill down into the build phase, and then specifically into the initial data wrangling
  • #6: More depth for building our model: Start with Debate HTML documents Get them into text format, and into a data bunch Conduct a type of analysis called TF-IDF, which is a weighted measure for word frequency in each document Final data form: sparse matrix (I’ll go into more detail in a moment) Feature engineering: Remove stop words (“the” or “for”) Remove non-predictive features Data is in final form: Evaluate three models: LR, SVM, and MNB Iterative feature engineering and parameter tuning → fitted model
  • #7: Drilling down into the initial data ingestion and wrangling: Debate transcripts were HTML documents--ugly, with markup like ‘p’ and ‘body’ tags Used BeautifulSoup to parse out the text, then spit out a document that only includes text Data bunch format: particular directory structure compatible with scikit-learn modules
  • #8: Vectorize data, transform it into weighted term frequency values, and remove stop words with one line of code TF-IDF stands for Term Frequency-Inverse Document Frequency. Scikit-Learn package that determines weights for words End result is a sparse matrix. Any given word appears relatively infrequently, so we have a lot of zeroes
  • #9: After getting data into proper format, we evaluated these three models: The LR algorithm classifies data by obtaining a best-fit logistic function. The MNB algorithm is a probabilistic classifier that applies Bayes’ theorem; it assumes (naively) that the features in the model are independent. SVM separates categories in data by drawing a separating hyperplane between instances of different classes.
  • #10: Next, we wanted to make our model more efficient.
  • #11: Moving forward with SVM as our best model, we used scikit’s learn grid search to conduct parameter tuning. The type of model we used was linear support vector classification, which had a penalty parameter called ‘C’. This parameter controlled how many errors or misclassifications the model would make. We had to be careful in not overfitting our model by trying to minimize the errors according to C, because then it would solely be optimized on training data. We chose a C that gave us larger margins in our hyperplane in our linear model, which gave us the best F-1 Score across both Democratic and Republican data which meant that it was in the best position to predict against outside data that we introduce.
  • #12: Our optimized SVM model had an overall F-1 score of 0.83. The F1 score is a weighted average of the precision and recall, 1 being the best value and 0 being the worst. You can also see the broken out precision and recall for our Democratic and Republican data. Accuracy rate is 84% High precision and recall for Republican, but this could be attributed to the fact that we had as twice as many Republican training data than Democratic training data, owing to the fact that there were more Republican debates than Democratic. “A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.” Confusion matrix
  • #13: Goal: get RSS data into the same exact vector format as our training data, then feed it into our model OPML (Outline processor markup language) documents: instructions for pulling specific RSS feeds → Baleen: An automated ingestion service for blogs to construct a corpus for NLP research. → Instantiate separate MongoDB database per news source. Documents are instances, and words are features. html → Transform into text → Feed to model
  • #14: Here is a graphic imported from Tableau that shows a normalized spectrum of our prediction results. 12 news sources, articles per source ranged between 27-173 As the slide states, 79% of the news articles that we analyzed had language that was more consistent with Republican rhetoric than Democratic rhetoric While most sources fall within one standard deviations from the mean - Washington post and the Nation are outliers
  • #15: Explain what we’re looking at. Another way of representing the information on the previous slide. Spectrum of absolute values. More uniformly Dem on left, more uniformly Republican on right. As Salil said, among all articles, 79% were classified as more consistent with Rep than Dem. Note that the majority of news sources are clustered together between 76% and 94% Red. You might see that MSNBC, often conventionally assumed to be left wing, is the furthest right. This probably didn’t conform with your expectations.
  • #16: It also didn’t conform with the Pew Research findings. Here is a comparison of our results and the Pew findings. Although WAPO is about as far left as both scales go, most other sources display no meaningful relationship. So what does this mean?
  • #17: Why DIDN’T our results match the ideological spectrum of audiences?
  • #18: Parsing each debate into one document yielded a low sample size for our model, so we re-parsed our debate transcripts to yield one document per paragraph. Next, we removed instances that contained moderators’ remarks - instance engineering Created a list of custom stop words, added onto scikit’s original set, to further strengthen our training data - removal of candidate’s names and moderator names LR, MNB, and SVM - chosen because they were appropriate for the binary classification nature of our analysis by fitting our TFIDF vector to a Truncated Singular Value Decomposition model - Scaled down 2,000 features - the last point before which we observed gradual reductions in model performances. The ‘C’ value represents the misclassification parameter which we so our model wasn’t overly optimized on its ability to correctly fit the training data.
  • #19: After transforming the HTML we pulled from RSS feeds, we discovered documents with jQuery script tags in addition to journalistic content. Other transformed documents contained solely advertisements or placeholders HTML tags for advertisements. Further, different news sources produced different kinds of RSS feeds. Some long-form with in-depth analysis, others simply contained blurbs that set up a resulting slideshow (not used in our data). Other shortcomings include that we used only debate transcripts as our source for training data and that we had far more republican debates(and candidates) than democratic ones.
  • #20: First, future studies could include more news sources and many more articles per news source. They could even out the distribution of Republican and Democratic speech in the training set. Third, they could improve feature engineering, specifically regarding transforming data from its organic form into text documents and vectors. Instead of relying solely on debate transcripts for the training data corpus, a future study could use debate transcripts to fit an initial model, then use that model to make predictions about a cross-section of article data, then feed the labeled article data back into the fitted model to strengthen and generalize it.
  • #21: So going back to the conundrum that Paul outlined in the beginning of the presentation, it’s important to consider what kind of implications that a text classifier built to identify partisan leaning language can have for individual news consumption. If people are choosing the news they read to reinforce the pre-existing beliefs they hold, then it’s worth examining the potential virtue of an ideologically balanced diet. By being conscientious with our news consumption, we could witness a shift in media engagement behaviors that are more open-minded and less entrenched in ideology. Becoming open to compromise and working with the other side could promote legislative functioning