Predicting Article Retweets and Likes Based On The Title Using Machine Learning
Predicting Article Retweets and Likes Based On The Title Using Machine Learning
Abstract—Choosing a good title for an article is an important step in the writing pro-
cess. The more interesting the article title seems, the higher the chance a reader will in-
teract with the whole content. This project focus on predicting the number of retweets
and likes on Twitter from FreeCodeCamp’s articles based on its titles. This problem
is a classification task using Supervised Learning. With data from FreeCodeCamp on
Twitter and Medium, it was used machine learning methods including support vec-
tor machines (SVM), decision trees, gaussian naive Bayes (GaussianNB), k-nearest
neighbors, logistic regression and naive Bayes classifier for multinomial models (Multi-
nomialNB) to make the predictions. This study shows that the MultinomialNB model
performed better for retweets reaching an accuracy of 60.6% and logistic regression
reached 55.3% for likes.
Keywords— prediction, machine learning, social media, title, performance
values from all the three variables, retweet and like hun-
dreds and clap thousand of times. From these numbers,
we can define what is expected from our articles and the
interaction with them. The length of the text goes from
21 to 146 characters, as expected, for a tweet content.
FIG. 1. Histogram
4. Title length that performed better
FIG. 9. Best Words for Retweet FIG. 11. Best Words for Claps
rithms can predict well with small set of data and when
there is a large number of features comparatively.
K-nearest neighbors (KNN): This algorithm takes in
consideration the k closest points (neighbors) around the
target and use them learn how to classify the desired
point.
This model was chosen, because its simple to imple-
ment, no assumption about the data is necessary and
the non-parametric nature of KNN gives an advantage in
certain settings where the data may be highly unusual.
FIG. 10. Best Words for Like Logistic regression: This model is named after the core
statistical function that it is based on, the logistic func-
tion. The Logistic regression estimates the parameters of
find the coefficients that maximize this margin. Only this function (coefficients), and as result it predicts the
these closest data points are relevant to identify (or to probability of presence of the characteristic of interest.
support the definition of) the hyperplane, and they are This model was chosen, because provides probabilities
named vectors. SVM performs linear classifications, but for outcomes and a convenient probability scores for ob-
also can efficiently perform non-linear, for this is nec- servations.
essary use a kernel trick, mapping their inputs into a
high-dimensional feature spaces. Naive Bayes classifier for multinomial models (Multi-
nomialNB): This model is similar to the Gaussian naive
This model was chosen, because it works well then big
Bayer, but the difference is that it was a multinomial
quantity of features and relatively small quantity of data
distribution of the dataset, instead of a gaussian one.
and to deal well with linear and non-linear datasets. And
due the fact we have more samples than number of fea- This model was chosen, because it works well for data
tures, it can generate a good prediction. which can easily be turned into counts, such as word
Decision trees: This model uses a decision tree to clas- counts in text. However, in practice, fractional counts
sifies the dataset into smaller subsets, and to define a con- such as TF-IDF may also work.
clusion about a target value. The tree consists of leaves, In the end, it was selected those with the best accu-
where the intermediate ones are the decision nodes and racy. To estimate it, it was used a 5-fold cross validation
the ones from the extremes are the final outcomes. that splits the dataset in 5 parts, 4 of training and 1 of
This model was chosen, because it can be easily in- testing. The implementation of this project was made
terpreted, visualized and explained. Also due the fact using Python, Numpy [15] and Scikit [16]. The full code
that this model implicitly perform variable screening or used on this project is available at: [17]
feature selection.
Gaussian naive Bayes (GaussianNB): This model is a
classification technique based on the Bayes’ Theorem. It
assumes the independence among the involved features. D. Benchmark
Nevertheless, this approach performers well even on data
that are dependent between them. This algorithm was This project run the same testing and training data
created by Bayes to prove the existence of God. It relies for multiple algorithms, the comparison between them
on the probability of an event, based on prior knowledge was used to evaluate the overall performance. The over-
of conditions that might be related to the event. all benchmark was made comparing our data with the
This model was chosen, because this family of algo- logistic regression results.
6
A. Data Preprocessing Where the range ”x-y”, means bigger than x and less
equal than y. The first range item of each feature also
1. Data cleaning
contains the zero on the range.
The first part of the data processing was to clean 3. Bag of words
the dataset. After downloading the tweets, we removed
the ones that didn’t have any URL (that points to the
To be possible to analyze the title in each data point,
Medium article) or title. Data points with values of likes,
we need to map each word into a number. This is nec-
claps or retweets that were not positive numbers or zero
essary because machine learning models normally don’t
were also excluded.
process raw text, but numerical values. To reach this, we
Words that were Twitter users were replaced by the
used a bag of words model [18]. In this model, it is taken
character ’@’ (that could be used on the statistics) and
into consideration the presence and often the frequency
words that were wrong non ASCII characters were also
of words, but the order or position is ignored.
removed.
For the calculation of the bag of words, we will use a
Some of the data points have the same URL, it means,
measure called Term Frequency, Inverse Document Fre-
that they shared more than once on the account of Twit-
quency (TF-IDF) [19]. The goal is to limit the impact of
ter. After analyzing each one of the duplicates, we no-
tokens (words) that occur very frequently.
ticed that there were two types of retweets: same URL
At this step, we processed the collection of documents
and same title; and same URL and a different title. We
and built a vocabulary with the known words. We
removed the ones of the first type. For the second type,
reached a vocabulary of 1356 words for retweets, 1399
we left, because the titles were completely rewritten and
words for likes and 1430 words for claps.
it can be considered as one different data point.
For the remaining data points, we removed the ones
that are considered outliers, as explained in II B 4. We B. Implementation
reached the numbers: retweet and likes have 711 items
(658 without outliers) each; claps has the same number
1. Training and Testing Data Split
with or without outliers, 711.
The defined ranges for our features are: 2. Clean the title removing not desired words III A 1
1. Retweets: 0-10, 10-30, 30+ 3. Filter the outliers from the dataset II B 4
2. Likes: 0-25, 25-60, 60+ 4. Classify the features in ranges III A 2
7
5. Divide the dataset in training and test Outliers: We made some tests to discover if we should
keep the outliers for the training or remove them. During
6. Create a bag of words using TF-IDF for the titles the tests, we discover if we keep the outliers, the accuracy
III A 3 was always worse.
Bag of words: To create the bag of Words, we had the
7. Train the model and calculate the accuracy
option of choosing the CountVectorizer or TfidfVector-
izer. During the simulation we got better results with
the last one, TfidfVectorizer.
3. Model Performance Metrics
Clean words: Another step during implementation was
to decide if we should keep the title in the original way or
We separated the dataset into learning and validation remove the undesired words. The tested to remove the
set. A validation set is important to reduce the risk of name of the Twitter users that appeared in some titles,
over-fitting of the chosen model. To avoid discarding some wrong characters that appeared on our dataset dur-
relevant data points, we used a cross-validation strategy. ing the crawling process. After checking the results, we
Cross-validation splits the training dataset in k folds, decided to clean up the data.
being k - 1 folders used to train the model and the last one Model’s parameters: For each model tested, we calcu-
to test it. This strategy will repeat multiple times and lated the accuracy for the default model (without any
the overall performance is the average of the computed parameter, just the default ones) and also we tried to
values. come with better parameters to evolve the accuracy. To
To estimate the model’s accuracy, we used a 5-fold test the combination of the new parameters, fine tune the
cross validation that split the dataset into 5 parts, 4 of model, we used grid search (GridSearchCV).
training and 1 of testing.
During the implementation of the code available at: A. Model Evaluation and Validation
[17], the biggest challenge was to find the best accuracy,
because in the initial attempts the final accuracy was The tables IV, V and VI describe the accuracy values
close or even worse than the benchmark. To reach an we reached with the proposed model. The final accuracy
acceptable value, it was necessary to reiterate over the for each of the features are: likes is 55.3%, retweets is
same code several times and try to understand what were 60.6% and claps is 49%.
the factors that increase or decrease the this metric.
Since our input data was limited, because we are using
just the titles, we had to bring alternatives and hypothe- TABLE IV. Accuracy Likes (%)
ses that we could obtain from this value. During the ID Model Name Default Tuned
III C, we discuss several approaches that were used to 0 Benchmark 53.15 –
get a better accuracy.
1 LogisticRegression 55.30 45.45
2 GaussianNB 46.21 46.21
C. Refinement 3 DecisionTreeClassifier 43.18 45.45
4 SVC 51.51 51.51
During the models’ implementation a lot of steps were 5 KNeighborsClassifier 44.70 46.21
tested and some of them needed to be modified to reach 6 MultinomialNB 40.91 45.45
better performance. For choosing a better parameter, we 7 GradientBoostingClassifier 47.73 48.48
interacted over the options and decided on the one that
optimized the accuracy.
Ranges of likes and retweets: We tried ranges with The parameters that we obtained by the grid search of
different number of elements and also different values for the models and features are explained in the tables VII,
the ranges. Some of them were underperforming, while VIII and IX.
others reached close values. The chosen one divided the After the steps presented on the previous sections, we
ranges with a similar number of data points and also noticed that this model is robust for outliers. Even if
offers a good overview of the feature analyzed. part of the design process was to eliminate them during
New models: We increased the number of models to the training step, we can see on the image 13 the lit-
be tested in three. The classifiers previous chosen were tle variation of the final result when ignoring or not the
not reaching the desired accuracy, we decided to add new outliers.
models to try to make better predictions. The new mod- This model can be considered to reach a reasonable
els have a worse performance than the first ones. accuracy for the proposed goal. We can make further
8
V. CONCLUSION
A. Free-Form Visualization
FIG. 13. Evolution Accuracy
The image 13 shows the evolution of the development
of this project. We started with the benchmark, from
there we started adding and testing features and treating 3. t2: t1 + Cleaned the words
the dataset.
During the evolution of the metrics, some variables
deprecated the accuracy value, but in further steps, it 4. t3: t2 + Added TF-IDF
made it grow. The evolution happened in the following
steps:
5. t4: t3 + Used stopwords
1. Benchmark
2. t1: Removed outliers 6. Winner: t4 + Parameters from the model tuned
9
TABLE VIII. Tuned Parameters Retweets TABLE IX. Tuned Parameters Claps
ID Parameters ID Parameters
1 C=10, class weight=None, dual=False, 1 C=2, class weight=None, dual=False,
fit intercept=False, intercept scaling=1, fit intercept=True, intercept scaling=1,
max iter=100, multi class=’ovr’, n jobs=1, max iter=10, multi class=’ovr’, n jobs=1,
penalty=’l2’, random state=None, solver=’lbfgs’, penalty=’l2’, random state=None, solver=’liblinear’,
tol=0.0001, verbose=0, warm start=False tol=0.0001, verbose=0, warm start=False
2 priors=None 2 priors=None
3 class weight=’balanced’, criterion=’gini’, 3 class weight=None, criterion=’gini’, max depth=10,
max depth=20, max features=None, max features=None, max leaf nodes=None,
max leaf nodes=None, min impurity decrease=0.0, min impurity decrease=0.0,
min impurity split=None, min samples leaf=1, min impurity split=None, min samples leaf=1,
min samples split=2, min weight fraction leaf=0.0, min samples split=2, min weight fraction leaf=0.0,
presort=False, random state=None, splitter=’best’ presort=False, random state=None,
4 C=1, cache size=200, class weight=None, splitter=’random’
coef0=0.0, decision function shape=’ovr’, degree=1, 4 C=1, cache size=200, class weight=None,
gamma=’auto’, kernel=’linear’, max iter=-1, coef0=0.0, decision function shape=’ovr’, degree=1,
probability=False, random state=None, gamma=’auto’, kernel=’linear’, max iter=-1,
shrinking=True, tol=0.001, verbose=False probability=False, random state=None,
5 algorithm=’auto’, leaf size=10, metric=’minkowski’, shrinking=True, tol=0.001, verbose=False
metric params=None, n jobs=1, n neighbors=10, 5 algorithm=’auto’, leaf size=30, metric=’minkowski’,
p=2, weights=’uniform’ metric params=None, n jobs=1, n neighbors=20,
6 alpha=1, class prior=None, fit prior=True p=2, weights=’uniform’
7 criterion=’friedman mse’, init=None, 6 alpha=0.5, class prior=None, fit prior=True
learning rate=0.5, loss=’deviance’, max depth=10, 7 criterion=’friedman mse’, init=None,
max features=None, max leaf nodes=None, learning rate=0.5, loss=’deviance’, max depth=3,
min impurity decrease=0.0, max features=None, max leaf nodes=None,
min impurity split=None, min samples leaf=2, min impurity decrease=0.0,
min samples split=1.0, min impurity split=None, min samples leaf=1,
min weight fraction leaf=0.0, n estimators=50, min samples split=0.5,
presort=’auto’, random state=None, min weight fraction leaf=0.0, n estimators=100,
subsample=1.0, verbose=0, warm start=False presort=’auto’, random state=0, subsample=1.0,
verbose=0, warm start=False
B. Reflection
3. Clean the dataset and each title before training the
In this project, we developed classifiers to under- model (remove Twiter users and invalid characters)
stand how many times an article will receive interac- 4. Grid search to search for the best model parameters
tion like retweets and likes (both on Twitter) and claps
(on Medium). We also presented a list of words that 5. Remove outliers before processing the data
have a high change to impact positively with the read-
ers, when used on the title or the article. We classified 6. Test the dataset to discover a good relation between
and extracted information about the Categories used on train and test data points
Medium that are commonly presented on our top per-
7. Test and split the number of like, retweet and claps
formers. The number of words and length of the title
in ranges
were also discussed and presented an optimal number to
increase the success numbers. 8. Use stopwords to remove common terms of the lan-
Besides the mathematical analysis used to extract im- guage
portant characteristics of the dataset, we also developed
and trained models to predict the how an article would Following these steps listed here, this methodology and
perform. To achieve this machine learning project, some framework can be used to classify any kind of article and
features and characteristics were used: subjects that are created on Medium and shared on Twit-
ter. This solution is not limited by the context either the
1. Bag of words to tokenize the words of the title subject of the articles and can be easily reproduced to
other datasets.
2. Term Frequency, Inverse Document Frequency The hard part of the project was to reach a higher
(TF-IDF) to translate the frequency of words in accuracy than the one found with simple models, it was
the dataset necessary multiple reiterations and several modifications
10
on the initial assumption. Reaching the 61%, 55% and [6] Lex, Elisabeth, Andreas Juffinger, and Michael Gran-
49% is not the ideal solution, but it can clearly lead to itzer. ”Objectivity Classication in Online Media.” Re-
the creation of a good title. searchGate. Proceedings of the 21st ACM Conference on
Hypertext and Hypermedia, 13 June 2010.
[7] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio.
”Domain Adaptation for Large-Scale Sentiment Classi-
C. Improvement fication: A Deep Learning Approach.” WUSTL. Proceed-
ings of the 28 Th International Conference on Machine
Learning, 2011.
For future work we can think about some additional
[8] Twitter. Accessed 13 Aug. 2018. Available at:
improvements: Adding more features to the original https://www.twitter.com.
dataset making possible to relate more information to [9] Medium. Accessed 13 Aug. 2018. Available at:
the success of the article. For example, we can correlate https://www.medium.com.
the words of the title, with trendy words of the month; [10] freecodecamp. Accessed 13 Aug. 2018. Available at:
Bring more data points to train our model, would also https://medium.freecodecamp.org/ .
increase the accuracy of the solution; and try to use the [11] freecodecamp.org. Accessed 13 Aug. 2018. Available at:
position of the word on the title to classify its importance. https://twitter.com/freecodecamp.
[12] N. Abdelhamid, A. Ayesh, F. Thabtah, S. Ahmadi, W.
Hadi. MAC: A multiclass associative classification algo-
rithm J. Info. Know. Mgmt. (JIKM), 11 (2) (2012), pp.
REFERENCES 125001-1-1250011-10 WorldScinet.
[13] I.H. Witten, E. Frank, M.A. Hall. Data Mining: Prac-
[1] eMarketer Report. (2017). US Time Spent tical Machine Learning Tools and Techniques. Morgan
with Media: eMarketer’s Updated Estimates Kaufmann, Burlington, MA (2011).
for 2017. Accessed 9 Aug. 2018. Available at: [14] Thabtah, S. Hammoud, H. Abdeljaber, Parallel associa-
https://www.emarketer.com/Report/US-Time- tive classification data mining frameworks based mapre-
Spent-with-Media-eMarketers-Updated-Estimates- duce, To Appear in Journal of Parallel Processing Letter,
2017/2002142 . March 2015, World Scientific, 2015.
[2] Common Sense Media. (2015). The Common [15] NumPy. Accessed 14 Aug. 2018. Available at:
Sense Census: Media Use by Tweens and http://www.numpy.org/ .
Teens. Accessed 9 Aug. 2018. Available at: [16] scikit-learn. Accessed 14 Aug. 2018. Available at:
https://www.commonsensemedia.org/research/the- http://scikit-learn.org/stable/ .
common-sense-census-media-use-by-tweens-and-teens. [17] de Freitas, Flavio H. Jupyter notebook - Im-
[3] Lindgaard, Gitte Fernandes, Gary Dudek, Cathy M. plementation code: Predicting article retweets
Brown, Judith. (2006). Attention web designers: You and likes based on the title using Machine
have 50 milliseconds to make a good first impression! Be- Learning. Accessed 11 Sep. 2018. Available at:
haviour and Information Technology, 25(2), 115-126. Be- https://github.com/flaviohenriquecbc/machine-
haviour IT. 25. 115-126. 10.1080/01449290500330448 . learning-capstone-project/blob/master/title-success-
[4] Shearer, E., Gottfried, J. (2017). News use across so- prediction.ipynb.
cial media platforms 2017. Accessed 9 Aug. 2018. Avail- [18] Nahm, U. Y.; Mooney, R. J. Text mining with informa-
able at: http://www.journalism.org/2017/09/07/news- tion extraction. In: AAAI 2002 Spring Symposium on
use-across-social-media-platforms-2017/ . Mining Answers from Texts and Knowledge. Bases. [S.l.:
[5] Chen, Yimin, Niall J. Conroy, and Victoria L. Ru- s.n.], 2002. v. 1.
bin. ”Misleading Online Content: Recognizing Clickbait [19] Baeza-Yates, R.; Ribeiro-Neto, B. et al. Modern infor-
as ”False News”” ResearchGate. ACM WMDD, 9 Nov. mation retrieval. [S.l.]: ACM press New York, 1999. v.
2015. 463.