0% found this document useful (0 votes)
23 views

Clickbait Detection Using Word Embeddings

Uploaded by

Katia Franco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Clickbait Detection Using Word Embeddings

Uploaded by

Katia Franco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Clickbait detection using word embeddings

The torpedo Clickbait Detector at the Clickbait Challenge 2017


Vijayasaradhi Indurthi Subba Reddy Oota
International Institute of Informatin Technology, International Institute of Information Technology,
Hyderabad Hyderabad
[email protected] [email protected]
arXiv:1710.02861v1 [cs.CL] 8 Oct 2017

ABSTRACT item". By conducting experiments on the clickbait dataset we show


Clickbait is a pejorative term describing web content that is aimed that this approach can improve the performance of the classification
at generating online advertising revenue, especially at the expense task.
of quality or accuracy, relying on sensationalist headlines or eye- The main contributions of our paper are as follows: (1) We iden-
catching thumbnail pictures to attract click-throughs and to encour- tify a few hand-crafted features which capture domain specific in-
age forwarding of the material over online social networks. We use formation and use them for the classification task. We use pre
distributed word representations of the words in the title as features trained GloVe vectors as features for the classification task (4) We
to identify clickbaits in online news media. We train a machine augment the GloVe embeddings along with hand-crafted features
learning model using linear regression to predict the cickbait score for predicting the clickbait score of a tweet. Our methods achieve
of a given tweet. Our methods achieve an F1-score of 64.98% and an F1-score of 64.98% and an MSE of 0.0791.
an MSE of 0.0791. Compared to other methods, our method is sim-
ple, fast to train, does not require extensive feature engineering and 2. RELATED WORK
yet moderately effective. [2] highlighted many interesting differences between clickbait
and non-clickbait categories which include sentence structure,
1. INTRODUCTION word patterns etc. They rely on a rich set of 14 hand-crafted fea-
Clickbait is that web content whose main purpose is to attract tures to detect clickbait headlines. In addition, [2] build a browser
attention and encourage visitors to click on a link to a particular extension which warns the readers of different media sites about
web page. Examples of such clickbaits include the possibility of being baited by such headlines. Their methods
achieve 93% accuracy in detecting and 89% accuracy in blocking
• “21 Completely Engrossing Fan Fictions You Won’t Be Able clickbaits.
To Stop Reading" [7] attempted to detect clickbaity Tweets in Twitter by using com-
mon words occurring in clickbaits, and by extracting some other
• “These White Tiger Cubs Are The Most Beautiful Creatures tweet specific features. They achieve an F1 score of 73% in classi-
You’ll See Today" fying tweets as clickbaits or not.
[3] argued for labeling clickbaits as misleading content or false
• “Here’s What Real Vegans Actually Eat" news.
• “Bow Wow Had No Clue How To Kill Time During The [1] used deep learning techniques like Bi-Directional Recurrent
Grammys And It Was Hilarious" Neural Network model with character and word embeddings as the
features. They achieve the state of the art results with an F1 score
• “We Know Who Your Celebrity Husband Should Be Based of 98% in classifying online content as clickbaits or not.
On One Question" While [2] and [1] explore identifying clickbaity titles in web-
pages, [7] explore identifying clickbaits in tweets.
Clickbaits employ the cognitive phenomenon known as Curios- Unlike earlier work done on clickbaits, the clickbait challenge
ity Gap [4], where the headlines provide forward referenced cues [9] requires us to calculate a clickbait score of a tweet post.
which generate sufficient curiosity compelling the reader to click
the link and fill their curiosity gap. Clickbaits eventually cause dis- 3. APPROACH
appointment, as they are not able to live up to the promises made in
the headline. Due to their heavy use in online journalism, it is im- The clickbait dataset [10] contained tweets from Twitter. Twitter
portant to develop techniques that automatically detect and combat is an online news and social networking service where users post
clickbaits. and interact with messages, "tweets", restricted to 140 characters.
Research has shown that using distributed word embeddings can Each tweet in the dataset has the text of the posted tweet and its
improve the performance of text classification as they capture lexi- associated metadata like keywords, time of the post, media linked
cal and semantic features of the text, without the need for explicit with the post, description of the target and the target paragraphs.
feature engineering. However, these word embeddings are generic In spite of the availability of the tweets’ metadata, we limit our
and may not capture domain specific knowledge necessary for the experiments to only the text of the post for training a machine learn-
classification task. Our motivation for this work is to specifically ing model to predict the clickbait score of each tweet.
answer this question - "Can we use distributed word embeddings to We augment a few hand-crafted domain specific features along
train a machine learning model and predict the rating of a clickbait with pre-trained distributed word representations as features for this
Features Used Dataset Total Clickbaits No-Clickbaits
Number of words [8]Training 2495 762 1697
Number of stop words [10]Validation 19538 4761 14777
Average length of the word Unlabelled 80012 NA NA
Presence of question form Test Unknown Unknown Unknown
Presence of numbers at the start of headline
Presence of continuous form of verb Table 2: Dataset details.
Presence of superlative forms of adjectives
300 dimensions of the GloVe embeddings Evaluation Metric Value
Mean squared error 0.0791655793621
Table 1: Features used for training our model. Median absolute error 0.236312405103
F1 score 0.649884407912
Precision 0.5297319933
task. We train a linear regression model to predict the clickbait Recall 0.840531561462
score of a tweet. Accuracy 0.784551346225
Hand-crafted features: In addition to using the first three fea- Normalised mean squared error 1.07669865075
tures used by [2] i.e number of words, number of stopwords and Mean absolute error 0.240963463871
the average word length of the clickbait headlines, we attempt to Explained variance 0.345845579242
use the following additional hand-crafted features. R2 score -0.0766986507455
Runtime 00:04:55
1. Presence of question form - When, What, Which, Who,
When, Whose, Whom, How, Where, Which, Can, Should
2. Presence of digits at the beginning of the headline Table 3: Results of official Evaluation for our method
3. Presence of gerunds i.e continuous form of the verb in the
headline like walking, eating, attending etc. [10] has been provided for validation. 2 shows the details of each
4. Presence of superlative forms of adjectives like cutest, best, dataset. In our approach, we concatenate the two datasets, both
hottest, greatest etc. the training and validation dataset to make a bigger training dataset.
Out of this we select the same number of clickbaits and nonclick-
Distributed word embeddings: Distributed word embeddings baits to have equal representation of the classes. From this, we
map words in a language to high dimensional real-valued vectors in randomly split the set into 80:20 for training and validation.
order to capture hidden semantic and syntactic properties of words. For the final official evaluation, we have used the whole of the
These embeddings are typically learned from large unlabeled text above dataset for training the model, which was used to make pre-
corpora. In our work, we use the pre-trained 300 dimensional dictions on the unseen test set.
GloVe embeddings [5] which were trained on about 6 billion words Official evaluation has been done on the platform called TIRA
from the 2014 Wikipedia corpus and English Gigaword Fifth Edi- [6].
tion corpus using the Continuous Bag of Words architecture. 3 shows various evaluation metrics which the official evaluators
To arrive at the embedding of a tweet post, we take the average have evaluated for our model.
of the GloVe embeddings of all the words present in the tweet post. The MSE for the baseline system was 0.0435, which our system
We used Linear Regression technique, which is a very simple ma- was unable to achieve. This might be because of high bias in our
chine learning learning algorithm for predicting the clickbait score model as our model was a very simple model. Selecting a little
of the tweet post. We model the given challenge as a regression complex model or a machine learning technique, or with more fea-
problem, where the dependent variable is the clickbait score and ture engineering might help in improving the performance of the
the independent variables are the features mentioned above. We model.
had 307 features for training(7 handrafted and 300 from GloVe).
We have not used any kind of regularisation technique. We felt
that a simple model like Linear Regression would generalize suffi- 5. CONCLUSION
ciently well than a complex model. In this paper, we develop a machine learning model, using pre-
The advantages of our approach are trained distributed representations of words trained on a huge cor-
pus, to predict the clickbait score of a tweet post. In future, we
1. Simple to implement as there is not much feature engineering
would like to use a more complex machine learning models like
2. Pretrained vectors are available which are ready to use
neural network to build a model to predict the clickbait score of the
3. Machine learning technique is simple to train and does not
clickbaits. We would want to include more hand crafted features in
need long training times.
our future work.
4. Unlike deep learning methods, our methods are interpretable
to a certain extent
5. Can work with modest hardware requirements as training is References
not memory intensive. [1] A. Anand, T. Chakraborty, and N. Park. We used Neural
6. Model generated is compact and can be used in low cost hard- Networks to Detect Clickbaits: You won’t believe what
ware like phones etc. happened Next! arXiv preprint arXiv:1612.01340, 2016.
[2] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly.
4. EVALUATION RESULTS Stop Clickbait: Detecting and Preventing Clickbaits in
Online News Media. In Proc. of the 2016 IEEE/ACM
2 datasets have been provided for training the model for [9]. A
International Conference on Advances in Social Networks
small initial dataset used in [8] for training and a bigger dataset
Analysis and Mining (ASONAM), pages 9–16. IEEE, 2016.
[3] Y. Chen, N. J. Conroy, and V. L. Rubin. Misleading Online
Content: Recognizing Clickbait as False News. In
Proceedings of the 2015 ACM on Workshop on Multimodal
Deception Detection, pages 15–19. ACM, 2015.
[4] Lowenstein. The Psychology of Curiosity: A Review and
Reinterpretation. In Psychological bulletin, vol. 116, 1994.
[5] J. Pennington, R. Socher, and C. D. Manning. GLoVe:
Global Vectors for Word Representation. In Empirical
Methods in Natural Language Processing (EMNLP), pages
1532–1543, 2014.
[6] M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos,
and B. Stein. Improving the Reproducibility of PAN’s Shared
Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In CLEF, pages 268–299. Springer, 2014.
[7] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. Clickbait
Detection. In European Conference on Information
Retrieval, pages 810–817. Springer, 2016.
[8] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. Clickbait
Detection. In N. Ferro, F. Crestani, M.-F. Moens, J. Mothe,
F. Silvestri, G. Di Nunzio, C. Hauff, and G. Silvello, editors,
Advances in Information Retrieval. 38th European
Conference on IR Research (ECIR 16), volume 9626 of
Lecture Notes in Computer Science, pages 810–817, Berlin
Heidelberg New York, Mar. 2016. Springer. .
[9] M. Potthast, T. Gollub, M. Hagen, and B. Stein. The
Clickbait Challenge 2017: Towards a Regression Model for
Clickbait Strength. In Proceddings of the Clickbait
Chhallenge, 2017.
[10] M. Potthast, T. Gollub, K. Komlossy, S. Schuster,
M. Wiegmann, E. Garces, M. Hagen, and B. Stein.
Crowdsourcing a Large Corpus of Clickbait on Twitter. In
(to appear), 2017.

You might also like