Clickbait Detection Using Word Embeddings

Uploaded by

Katia Franco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Clickbait Detection Using Word Embeddings

Uploaded by

Katia Franco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Clickbait detection using word embeddings

The torpedo Clickbait Detector at the Clickbait Challenge 2017

Vijayasaradhi Indurthi Subba Reddy Oota
International Institute of Informatin Technology, International Institute of Information Technology,
Hyderabad Hyderabad
[email protected] [email protected]
arXiv:1710.02861v1 [cs.CL] 8 Oct 2017

ABSTRACT item". By conducting experiments on the clickbait dataset we show

Clickbait is a pejorative term describing web content that is aimed that this approach can improve the performance of the classification
at generating online advertising revenue, especially at the expense task.
of quality or accuracy, relying on sensationalist headlines or eye- The main contributions of our paper are as follows: (1) We iden-
catching thumbnail pictures to attract click-throughs and to encour- tify a few hand-crafted features which capture domain specific in-
age forwarding of the material over online social networks. We use formation and use them for the classification task. We use pre
distributed word representations of the words in the title as features trained GloVe vectors as features for the classification task (4) We
to identify clickbaits in online news media. We train a machine augment the GloVe embeddings along with hand-crafted features
learning model using linear regression to predict the cickbait score for predicting the clickbait score of a tweet. Our methods achieve
of a given tweet. Our methods achieve an F1-score of 64.98% and an F1-score of 64.98% and an MSE of 0.0791.
an MSE of 0.0791. Compared to other methods, our method is sim-
ple, fast to train, does not require extensive feature engineering and 2. RELATED WORK
yet moderately effective. [2] highlighted many interesting differences between clickbait
and non-clickbait categories which include sentence structure,
1. INTRODUCTION word patterns etc. They rely on a rich set of 14 hand-crafted fea-
Clickbait is that web content whose main purpose is to attract tures to detect clickbait headlines. In addition, [2] build a browser
attention and encourage visitors to click on a link to a particular extension which warns the readers of different media sites about
web page. Examples of such clickbaits include the possibility of being baited by such headlines. Their methods
achieve 93% accuracy in detecting and 89% accuracy in blocking
• “21 Completely Engrossing Fan Fictions You Won’t Be Able clickbaits.
To Stop Reading" [7] attempted to detect clickbaity Tweets in Twitter by using com-
mon words occurring in clickbaits, and by extracting some other
• “These White Tiger Cubs Are The Most Beautiful Creatures tweet specific features. They achieve an F1 score of 73% in classi-
You’ll See Today" fying tweets as clickbaits or not.
[3] argued for labeling clickbaits as misleading content or false
• “Here’s What Real Vegans Actually Eat" news.
• “Bow Wow Had No Clue How To Kill Time During The [1] used deep learning techniques like Bi-Directional Recurrent
Grammys And It Was Hilarious" Neural Network model with character and word embeddings as the
features. They achieve the state of the art results with an F1 score
• “We Know Who Your Celebrity Husband Should Be Based of 98% in classifying online content as clickbaits or not.
On One Question" While [2] and [1] explore identifying clickbaity titles in web-
pages, [7] explore identifying clickbaits in tweets.
Clickbaits employ the cognitive phenomenon known as Curios- Unlike earlier work done on clickbaits, the clickbait challenge
ity Gap [4], where the headlines provide forward referenced cues [9] requires us to calculate a clickbait score of a tweet post.
which generate sufficient curiosity compelling the reader to click
the link and fill their curiosity gap. Clickbaits eventually cause dis- 3. APPROACH
appointment, as they are not able to live up to the promises made in
the headline. Due to their heavy use in online journalism, it is im- The clickbait dataset [10] contained tweets from Twitter. Twitter
portant to develop techniques that automatically detect and combat is an online news and social networking service where users post
clickbaits. and interact with messages, "tweets", restricted to 140 characters.
Research has shown that using distributed word embeddings can Each tweet in the dataset has the text of the posted tweet and its
improve the performance of text classification as they capture lexi- associated metadata like keywords, time of the post, media linked
cal and semantic features of the text, without the need for explicit with the post, description of the target and the target paragraphs.
feature engineering. However, these word embeddings are generic In spite of the availability of the tweets’ metadata, we limit our
and may not capture domain specific knowledge necessary for the experiments to only the text of the post for training a machine learn-
classification task. Our motivation for this work is to specifically ing model to predict the clickbait score of each tweet.
answer this question - "Can we use distributed word embeddings to We augment a few hand-crafted domain specific features along
train a machine learning model and predict the rating of a clickbait with pre-trained distributed word representations as features for this
Features Used Dataset Total Clickbaits No-Clickbaits
Number of words [8]Training 2495 762 1697
Number of stop words [10]Validation 19538 4761 14777
Average length of the word Unlabelled 80012 NA NA
Presence of question form Test Unknown Unknown Unknown
Presence of numbers at the start of headline
Presence of continuous form of verb Table 2: Dataset details.
Presence of superlative forms of adjectives
300 dimensions of the GloVe embeddings Evaluation Metric Value
Mean squared error 0.0791655793621
Table 1: Features used for training our model. Median absolute error 0.236312405103
F1 score 0.649884407912
Precision 0.5297319933
task. We train a linear regression model to predict the clickbait Recall 0.840531561462
score of a tweet. Accuracy 0.784551346225
Hand-crafted features: In addition to using the first three fea- Normalised mean squared error 1.07669865075
tures used by [2] i.e number of words, number of stopwords and Mean absolute error 0.240963463871
the average word length of the clickbait headlines, we attempt to Explained variance 0.345845579242
use the following additional hand-crafted features. R2 score -0.0766986507455
Runtime 00:04:55
1. Presence of question form - When, What, Which, Who,
When, Whose, Whom, How, Where, Which, Can, Should
2. Presence of digits at the beginning of the headline Table 3: Results of official Evaluation for our method
3. Presence of gerunds i.e continuous form of the verb in the
headline like walking, eating, attending etc. [10] has been provided for validation. 2 shows the details of each
4. Presence of superlative forms of adjectives like cutest, best, dataset. In our approach, we concatenate the two datasets, both
hottest, greatest etc. the training and validation dataset to make a bigger training dataset.
Out of this we select the same number of clickbaits and nonclick-
Distributed word embeddings: Distributed word embeddings baits to have equal representation of the classes. From this, we
map words in a language to high dimensional real-valued vectors in randomly split the set into 80:20 for training and validation.
order to capture hidden semantic and syntactic properties of words. For the final official evaluation, we have used the whole of the
These embeddings are typically learned from large unlabeled text above dataset for training the model, which was used to make pre-
corpora. In our work, we use the pre-trained 300 dimensional dictions on the unseen test set.
GloVe embeddings [5] which were trained on about 6 billion words Official evaluation has been done on the platform called TIRA
from the 2014 Wikipedia corpus and English Gigaword Fifth Edi- [6].
tion corpus using the Continuous Bag of Words architecture. 3 shows various evaluation metrics which the official evaluators
To arrive at the embedding of a tweet post, we take the average have evaluated for our model.
of the GloVe embeddings of all the words present in the tweet post. The MSE for the baseline system was 0.0435, which our system
We used Linear Regression technique, which is a very simple ma- was unable to achieve. This might be because of high bias in our
chine learning learning algorithm for predicting the clickbait score model as our model was a very simple model. Selecting a little
of the tweet post. We model the given challenge as a regression complex model or a machine learning technique, or with more fea-
problem, where the dependent variable is the clickbait score and ture engineering might help in improving the performance of the
the independent variables are the features mentioned above. We model.
had 307 features for training(7 handrafted and 300 from GloVe).
We have not used any kind of regularisation technique. We felt
that a simple model like Linear Regression would generalize suffi- 5. CONCLUSION
ciently well than a complex model. In this paper, we develop a machine learning model, using pre-
The advantages of our approach are trained distributed representations of words trained on a huge cor-
pus, to predict the clickbait score of a tweet post. In future, we
1. Simple to implement as there is not much feature engineering
would like to use a more complex machine learning models like
2. Pretrained vectors are available which are ready to use
neural network to build a model to predict the clickbait score of the
3. Machine learning technique is simple to train and does not
clickbaits. We would want to include more hand crafted features in
need long training times.
our future work.
4. Unlike deep learning methods, our methods are interpretable
to a certain extent
5. Can work with modest hardware requirements as training is References
not memory intensive. [1] A. Anand, T. Chakraborty, and N. Park. We used Neural
6. Model generated is compact and can be used in low cost hard- Networks to Detect Clickbaits: You won’t believe what
ware like phones etc. happened Next! arXiv preprint arXiv:1612.01340, 2016.
[2] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly.
4. EVALUATION RESULTS Stop Clickbait: Detecting and Preventing Clickbaits in
Online News Media. In Proc. of the 2016 IEEE/ACM
2 datasets have been provided for training the model for [9]. A
International Conference on Advances in Social Networks
small initial dataset used in [8] for training and a bigger dataset
Analysis and Mining (ASONAM), pages 9–16. IEEE, 2016.
[3] Y. Chen, N. J. Conroy, and V. L. Rubin. Misleading Online
Content: Recognizing Clickbait as False News. In
Proceedings of the 2015 ACM on Workshop on Multimodal
Deception Detection, pages 15–19. ACM, 2015.
[4] Lowenstein. The Psychology of Curiosity: A Review and
Reinterpretation. In Psychological bulletin, vol. 116, 1994.
[5] J. Pennington, R. Socher, and C. D. Manning. GLoVe:
Global Vectors for Word Representation. In Empirical
Methods in Natural Language Processing (EMNLP), pages
1532–1543, 2014.
[6] M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos,
and B. Stein. Improving the Reproducibility of PAN’s Shared
Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In CLEF, pages 268–299. Springer, 2014.
[7] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. Clickbait
Detection. In European Conference on Information
Retrieval, pages 810–817. Springer, 2016.
[8] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. Clickbait
Detection. In N. Ferro, F. Crestani, M.-F. Moens, J. Mothe,
F. Silvestri, G. Di Nunzio, C. Hauff, and G. Silvello, editors,
Advances in Information Retrieval. 38th European
Conference on IR Research (ECIR 16), volume 9626 of
Lecture Notes in Computer Science, pages 810–817, Berlin
Heidelberg New York, Mar. 2016. Springer. .
[9] M. Potthast, T. Gollub, M. Hagen, and B. Stein. The
Clickbait Challenge 2017: Towards a Regression Model for
Clickbait Strength. In Proceddings of the Clickbait
Chhallenge, 2017.
[10] M. Potthast, T. Gollub, K. Komlossy, S. Schuster,
M. Wiegmann, E. Garces, M. Hagen, and B. Stein.
Crowdsourcing a Large Corpus of Clickbait on Twitter. In
(to appear), 2017.

Questions Are The Answers by Allan Pease: Download Here
82% (11)
Questions Are The Answers by Allan Pease: Download Here
3 pages
The Strategic Web Designer: How to Confidently Navigate the Web Design Process
From Everand
The Strategic Web Designer: How to Confidently Navigate the Web Design Process
Christopher Butler
No ratings yet
Laboratory Exercise Set 1 (Getting Started With Windows) Lab 1.1 Getting Started in Windows 7
No ratings yet
Laboratory Exercise Set 1 (Getting Started With Windows) Lab 1.1 Getting Started in Windows 7
43 pages
Clickbait Detection in YouTube Videos
No ratings yet
Clickbait Detection in YouTube Videos
25 pages
Hyperparameter Tuning of Long Short-Term Memory Model For Clickbait Classification in News Headlines
No ratings yet
Hyperparameter Tuning of Long Short-Term Memory Model For Clickbait Classification in News Headlines
9 pages
BS_Thesis_MidSem_Report (2)
No ratings yet
BS_Thesis_MidSem_Report (2)
31 pages
Enhancing_Clickbait_Detection_with_Cross-Modal_Topic_Modeling_in_Social_Networks
No ratings yet
Enhancing_Clickbait_Detection_with_Cross-Modal_Topic_Modeling_in_Social_Networks
6 pages
2) A_transformer-based_architecture_for_the_automatic_detection_of_clickbait_for_Ar
No ratings yet
2) A_transformer-based_architecture_for_the_automatic_detection_of_clickbait_for_Ar
5 pages
8.progress Report Presentation (Clickbait Detection System)
No ratings yet
8.progress Report Presentation (Clickbait Detection System)
26 pages
IRE Deliverable 3
No ratings yet
IRE Deliverable 3
7 pages
Urdu Clickbait Detection and Spoiling
No ratings yet
Urdu Clickbait Detection and Spoiling
4 pages
IRE Deliverable 2
No ratings yet
IRE Deliverable 2
4 pages
We Used Neural Networks To Detect Clickbaits: You Won't Believe What Happened Next!
No ratings yet
We Used Neural Networks To Detect Clickbaits: You Won't Believe What Happened Next!
7 pages
3411764.3445753
No ratings yet
3411764.3445753
19 pages
Clickbait Detection: A Literature Review of The Methods Used
No ratings yet
Clickbait Detection: A Literature Review of The Methods Used
10 pages
Clickbait_in_YouTube_Prevention_Detection_and_Analysis
No ratings yet
Clickbait_in_YouTube_Prevention_Detection_and_Analysis
26 pages
The Good The Bad and The Bait Detecting and Characterizing Clickbait On YouTube
No ratings yet
The Good The Bad and The Bait Detecting and Characterizing Clickbait On YouTube
7 pages
1 s2.0 S0378216621000229 Main
No ratings yet
1 s2.0 S0378216621000229 Main
14 pages
w05 - Did Clickbait Crack The Code On Virality
No ratings yet
w05 - Did Clickbait Crack The Code On Virality
21 pages
YouTube Hacks: Secrets to Boost Your Views and Engagement
From Everand
YouTube Hacks: Secrets to Boost Your Views and Engagement
B. Vincent
No ratings yet
Effects of Clickbait Headlines On User Responses
No ratings yet
Effects of Clickbait Headlines On User Responses
19 pages
Mastering AI-Driven Blogging with DeepSeek
From Everand
Mastering AI-Driven Blogging with DeepSeek
Robert Cullen
No ratings yet
Predicting Article Retweets and Likes Based On The Title Using Machine Learning
No ratings yet
Predicting Article Retweets and Likes Based On The Title Using Machine Learning
10 pages
A Tool For Fake News Detection: September 2018
No ratings yet
A Tool For Fake News Detection: September 2018
9 pages
Ignite Your LinkedIn Posts: 60 Proven Hook Templates to Captivate Your Audience
From Everand
Ignite Your LinkedIn Posts: 60 Proven Hook Templates to Captivate Your Audience
Dusan Grujin
No ratings yet
Unlocking Income Potential 30 Ways to Monetize Your Skills with ChatGPT: Making Money with ChatGPT
From Everand
Unlocking Income Potential 30 Ways to Monetize Your Skills with ChatGPT: Making Money with ChatGPT
Gregg Dunlap
No ratings yet
30 Days to YouTube Success: Social Media Master - The 30-day Challenge, #3
From Everand
30 Days to YouTube Success: Social Media Master - The 30-day Challenge, #3
Alexis Phoenix
No ratings yet
Classification Survey
No ratings yet
Classification Survey
40 pages
Be Premium on LinkedIn: Jack Trump Presents, #1
From Everand
Be Premium on LinkedIn: Jack Trump Presents, #1
Jack Trump
No ratings yet
Approaches For Fake Content Detection Strengths and Weaknesses To Adversarial Attacks
No ratings yet
Approaches For Fake Content Detection Strengths and Weaknesses To Adversarial Attacks
11 pages
Project Report
No ratings yet
Project Report
6 pages
Sarcastic Tweet - MGR
No ratings yet
Sarcastic Tweet - MGR
26 pages
Advanced Strategies for Blog Monetization with DeepSeek
From Everand
Advanced Strategies for Blog Monetization with DeepSeek
Robert Cullen
No ratings yet
2019 Using Deep Neural Network
No ratings yet
2019 Using Deep Neural Network
4 pages
Decentralized Blockchain and the Future Web: An Introductory Guide
From Everand
Decentralized Blockchain and the Future Web: An Introductory Guide
Todd J. Finch
No ratings yet
Effective Social Media Branding
From Everand
Effective Social Media Branding
Yuvraj Mallick
No ratings yet
It Depends: Writing on Technology Leadership 2012-2022
From Everand
It Depends: Writing on Technology Leadership 2012-2022
Kevin Goldsmith
No ratings yet
YouTube for Business: How to Grow Your Brand
From Everand
YouTube for Business: How to Grow Your Brand
B. Vincent
No ratings yet
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
He Laskar 2019
No ratings yet
He Laskar 2019
4 pages
Lect05
No ratings yet
Lect05
17 pages
Blogging Profits 2025: A Comprehensive Guide on How to Start a Blog and Make Money
From Everand
Blogging Profits 2025: A Comprehensive Guide on How to Start a Blog and Make Money
Carly Jennings-Brown
No ratings yet
30 Days to Instagram Growth: Social Media Master - The 30-day Challenge, #1
From Everand
30 Days to Instagram Growth: Social Media Master - The 30-day Challenge, #1
Alexis Phoenix
No ratings yet
#Metoomaastricht: Building A Chatbot To Assist Survivors of Sexual Harassment
No ratings yet
#Metoomaastricht: Building A Chatbot To Assist Survivors of Sexual Harassment
19 pages
Mastering Lead Generation with DeepSeek AI: Unlocking the Future of Customer Acquisition
From Everand
Mastering Lead Generation with DeepSeek AI: Unlocking the Future of Customer Acquisition
Robert Cullen
No ratings yet
2307.14692v1
No ratings yet
2307.14692v1
11 pages
MT19CS019_Thesis_Final
No ratings yet
MT19CS019_Thesis_Final
32 pages
Social Media Data Mining: Insights and Strategies
From Everand
Social Media Data Mining: Insights and Strategies
Vidhur Gupta
No ratings yet
NLP Unit-3
No ratings yet
NLP Unit-3
17 pages
Blogging Your Way To Wealth A Complete Guide To Making Money Through Blogging
From Everand
Blogging Your Way To Wealth A Complete Guide To Making Money Through Blogging
Arief Muinnudin
No ratings yet
Investigating Deep Learning Approaches For Hate
No ratings yet
Investigating Deep Learning Approaches For Hate
12 pages
Master DotNET Fundamentals: Dot Net Interview Preparation, #1
From Everand
Master DotNET Fundamentals: Dot Net Interview Preparation, #1
Nirbhay Chauhan
No ratings yet
Twitter Spam Detection Based On Deep Learning: Tingmin Wu, Shigang Liu, Jun Zhang and Yang Xiang
No ratings yet
Twitter Spam Detection Based On Deep Learning: Tingmin Wu, Shigang Liu, Jun Zhang and Yang Xiang
8 pages
Instagram Guru: Advanced Hacks To Dominate Instagram
From Everand
Instagram Guru: Advanced Hacks To Dominate Instagram
Dack Douglas
No ratings yet
Character-Based Neural Embeddings For Tweet Clustering
No ratings yet
Character-Based Neural Embeddings For Tweet Clustering
9 pages
Group08_BDM01_Topic-Modelling-in-Text-Classification
No ratings yet
Group08_BDM01_Topic-Modelling-in-Text-Classification
19 pages
An Analysis of Hierarchical Text Classification Using Word Embeddings
No ratings yet
An Analysis of Hierarchical Text Classification Using Word Embeddings
22 pages
The Content Creation Blueprint
From Everand
The Content Creation Blueprint
mosbah77
No ratings yet
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
Online Harassement
No ratings yet
Online Harassement
48 pages
From Zero To Content Hero: A Guide to Becoming a Remarkable Content Creator
From Everand
From Zero To Content Hero: A Guide to Becoming a Remarkable Content Creator
Nasir Mazumder
No ratings yet
Fake News Detection Natural Language Processing
No ratings yet
Fake News Detection Natural Language Processing
62 pages
Novartis
0% (1)
Novartis
2 pages
Gei 100696
No ratings yet
Gei 100696
38 pages
Cambridge International AS & A Level: Mathematics 9709/12 March 2021
No ratings yet
Cambridge International AS & A Level: Mathematics 9709/12 March 2021
14 pages
K-Config 2 Guide
No ratings yet
K-Config 2 Guide
137 pages
Ict Syllabus For Primary School Students
No ratings yet
Ict Syllabus For Primary School Students
12 pages
Results in Engineering
No ratings yet
Results in Engineering
9 pages
OSPF (Open Shortest Path First)
No ratings yet
OSPF (Open Shortest Path First)
5 pages
Digital Financial Services
No ratings yet
Digital Financial Services
8 pages
Engineer-to-Order Production With Variant Configura-Tion (4R8 - ES)
No ratings yet
Engineer-to-Order Production With Variant Configura-Tion (4R8 - ES)
20 pages
Eba3e PPT ch06
No ratings yet
Eba3e PPT ch06
41 pages
Invitational Speech Outline
No ratings yet
Invitational Speech Outline
4 pages
Diwali_Sales_Analysis - Jupyter Notebook
No ratings yet
Diwali_Sales_Analysis - Jupyter Notebook
12 pages
Autodesk Inventor - Ipart: Beyond The Basics
No ratings yet
Autodesk Inventor - Ipart: Beyond The Basics
12 pages
Power Steering Gears
No ratings yet
Power Steering Gears
5 pages
MS 900 Questions
No ratings yet
MS 900 Questions
5 pages
RECRUITMENT System: 211420205167 CS8582-Object Oriented Analysis and Design
No ratings yet
RECRUITMENT System: 211420205167 CS8582-Object Oriented Analysis and Design
14 pages
RCV TSC Mss S 001 r2 Msra Dewatering
No ratings yet
RCV TSC Mss S 001 r2 Msra Dewatering
47 pages
The Application of Artificial Intelligence in Functional Safety
No ratings yet
The Application of Artificial Intelligence in Functional Safety
54 pages
User Guide: Thinkpad T450
No ratings yet
User Guide: Thinkpad T450
185 pages
RFC Index
No ratings yet
RFC Index
233 pages
Vector Spaces
No ratings yet
Vector Spaces
58 pages
Systems of Inequalities HW Key
No ratings yet
Systems of Inequalities HW Key
4 pages
Research G 4
100% (3)
Research G 4
60 pages
It 020-3-2013 Computer System Operation Cu Wa
No ratings yet
It 020-3-2013 Computer System Operation Cu Wa
2 pages
SPLIT TEE WELDING Procedure
100% (2)
SPLIT TEE WELDING Procedure
42 pages
Safety Valve Qap 270622
No ratings yet
Safety Valve Qap 270622
2 pages
Free Sample Cover Letter Format
100% (1)
Free Sample Cover Letter Format
7 pages
BA RECORD
No ratings yet
BA RECORD
105 pages

Clickbait Detection Using Word Embeddings

Uploaded by

Clickbait Detection Using Word Embeddings

Uploaded by

Clickbait detection using word embeddings

The torpedo Clickbait Detector at the Clickbait Challenge 2017

ABSTRACT item". By conducting experiments on the clickbait dataset we show

You might also like