PredictingUserInfluenceinSocialMedia
PredictingUserInfluenceinSocialMedia
net/publication/269645110
CITATIONS READS
14 503
4 authors, including:
All content following this page was uploaded by Chunjing Xiao on 24 May 2015.
Abstract—Understanding influence plays a vital role in retweets. Subsequently, the amount of extended retweets
enhancing businesses operation and improving effect of is used as the standard for predicting the influence of
information propagation. Therefore the user influence in Twitter users [5], and for analyzing word-of-mouth
social media, such as Twitter, is widely studied based on information propagation in Twitter [6]. Whereas none of
different standards, such as the number of followers,
these works consider the accurate click number of short
retweets and so on. However, little work considers the
accurate click number of short URLs as the measurement of URLs as the standard of influence. In fact shortened
influence. In Twitter short URLs are frequently included in URLs are frequently included in the content published by
tweets because of the limitation of characters. And some users, because of the limitation of characters of contents,
users may focus more on click number of the URLs instead especially for Twitter which limits a tweet to 140
of the number of followers or retweets. Thus, it is necessary characters. And there should be a lot of users who aim to
to analyze the factors that impact the click number received attract web traffic by Twitter. In addition, in Twitter the
by URLs of users. In this paper, we conduct the predictive number of retweets and click number of URLs received
analyses about the user influence which is measured by the by the user are disproportionate [7]. Therefore it is
click number of short URLs. We first exploit a wide range of
necessary to understand user influence based on the
possible features consisting of the sets of user properties,
behavior and topics. And then we employ the logistic standard of click number of short URLs.
regression analysis to identify the significant features for In this paper, we predict the user influence based on
predicting the user influence, and find most of features we the accurate click number of short URLs. Due to the
proposed have a significant predictive power to the user importance of click number, Antoniades et al. [8]
influence. Finally based on the large scale Twitter data, four compare the popular websites in Twitter, which are
models are used for the prediction and the Bagging model measured by the click number received via Twitter, with
achieves the best result, an overall accuracy of more than that in Alexa.com. And Romero et al. [9] use the global
82%. click number of URLs as the ground truth to evaluate
their proposed algorithm of ranking users in Twitter, but
Index Terms—Twitter; Influence; Web Traffic; Predict
as they said the global clicks are noise because they also
include the click source from outside of Twitter, such as
I. INTRODUCTION Facebook and forums.
Social Media such as Twitter and Facebook has While compared with the existed work, we use the
become an important platform to publish or receive accurate click number as the standard of user influence.
information, which is changing the way of According to the click information provided by Bitly.com,
communication and knowledge sharing between the there are three kinds of click number: accurate clicks
people. Users in these systems post and discussion referring to the click number received by each URL of
millions of news, options, and reviews to promote them. each user; domain clicks referring to the click number
Correspondingly, influence, which has long been studied received by any object in a domain (such as Twitter.com);
in the fields of sociology, communication, marketing and global clicks referring to the sum of all the click number.
political science [1, 2], also receive much attention in Therefore the accurate click number is a precise one
social media, because understanding influence can without noise in comparison with the domain clicks and
provide insights for users to learn why certain global clicks.
information propagates faster and how we improve the Based on the standard of the accurate click number, we
effect of contents diffusion. predict the influence of users in Twitter. And we treat the
Currently, the user influence has been analyzed from prediction as a classification task by defining four
different aspects based on different standards. For categories to represent the levels of the user influence. To
example, Cha et al. [3] present a comparison of three conduct the prediction, we exploit a wide range of
measures of influence: the number of user followers, possible features consisting of the sets of user properties,
retweets and mentions in Twitter. Also, Kwak et al. [4] behavior and topics. The set of user properties includes
analyze the user influence based on another three the basic properties of users, such as the number of
standards: the number of followers, PageRank and followers, friends and lists, as well as the properties we
defined, such as the number of active followers and the tweet features add a substantial boost. Artzi et al. [14]
type of user domains. The set of behavior is composed of predict whether a message will elicit a user response in
the number of tweets and the entropies of published time Twitter based on a discriminative model, and they
of URLs. And the set of topics includes the topic category explore various sources as features, such as the language
and topic entropy. After extracting these features, we first used in the tweet, the user's social network and history.
analyze the significance of each feature by using the Bandari et al. [15] predict the popularity of news items in
logistic regression analysis and find majority of features Twitter prior to their release. A multi-dimensional feature
are statistically significant. And then by using multiple space derived from properties of the article is exploited
classification models, we predict the levels of the user for the prediction.
influence and achieve an overall accuracy of more than However, differing from these studies, we use the
82% with the Bagging model. different standard for the user influence, the accurate
click number of short URLs. Compared with the existed
II. RELATED WORKS work, the analyses and prediction based on click number
The studies related to influence in online social can provide insights to improve the web traffic via social
networks have been conducted from ranking influential media. In addition, we explore different features for the
users [3, 4, 10, 11], and quantifying user influence [5, 9] prediction, such as the type of tweets, the entropy of
to predicting popularity of contents [12-15]. Specifically, published time of URLs and so on.
Kwak et al. [4] find that the ranking of users depending
on the amount of retweets is different from that III. DATA DESCRIPTION
depending on the number of followers and PageRank in As our goal is to predict user influence measured by
the follower network. And Cha et al. [3] also compare the the accurate click number of short URLs, the data for the
user influence based on indegree (the number of experiments should be mainly comprised of the
followers), the number of retweets and number of information of short URLs in tweets published by Twitter
mentions, and demonstrated that popular users with high users and accurate click number received by these short
indegree are not necessarily influential in terms of URLs.
spawning retweets or mentions. Besides, influential users To obtain these data, we firstly select the targeted users
are identified in Twitter by taking the topical similarity of Twitter. Particularly, we select users who tend to
and the link structure into account [10] and by using publish tweets including short URLs, and these URLs
modified k-shell decomposition algorithm [11]. should be hosted by Bitly because the short URLs of
Apart from ranking influential users, Hofman et al. [5] Bitly take up about 50% of all the URLs in Twitter [8]
quantify influence of general users based on the standard and their accurate click number can be collected.
of the number of extended retweets. This standard, To this end, from more than 790 millions tweets during
beyond official retweets, also includes the amount that June 2012 collected by Twitter streaming APIs which
implicit propagation which will occurs when a user return roughly 10% of all public tweets, we extract
shares a URL that has already been shared by one of his around 46 million unique users. From these users, we
friends (followings) without necessarily citing the select users satisfying the following conditions: (i) The
information source. Based on this measure, to predict language in the profile settings of users is English,
user influence they explore features consisting of the because users speaking English are most popular in
numbers of followers, friends and tweets, date of joining, Twitter [16] and we are familiar with this language; (ii)
and past influence of users including past total influence The ratios between the numbers of tweets including Bitly
and local influence. Their predictive model, the URLs and all the tweets of users are no less than 80%,
regression tree, achieves relatively poor performance (R2 because, for users with many Bitly URLs, their influence
= 0.34) without averaging predicted and actual values at can more properly be represented by accurate click
the leaf nodes. And since the majority of users act as number of their URLs; (iii) The domain focuses of users
passive information consumers and do not forward the are more than 80%, and focused domains are the same
content to the network, Romero et al. [9] developed an with domains of websites in their profiles. Here the
efficient algorithm to quantify the influence of all the domain focus refers to the highest fraction of the number
users in Twitter by taking passivity into account, and they of URLs of a domain over the entire number of URLs,
used the global clicks of shorts URLs as the standard of and is defined as below:
the influence, which is noise as they said.
1
Another body of works is the prediction of the Di max vik . (1)
popularity in Twitter. Hong et al. [12] predict the Vi k
popularity of tweets as measured by the number of future
retweets. They define several categories to represent the where Vi refers to the sum of URL number of user i, and
volume of retweets and predict which categories the vik refers to the number of URLs with domain k of user i.
tweets will belong to. The prediction that whether a tweet We employ this selection because this kind of users are
will be retweeted is studied in [13]. Based on the model more likely to aim to attract the web traffic via Twitter;
with the passive-aggressive algorithm, they can (iv) Users publish at least average one URL per day,
automatically predict retweets and find that the because it is obvious that too few URLs will skew the
performance is dominated by social features, but the results. As a result, 32,942 users are selected as our
targeted users.
And then, for these selected users, by Twitter APIs we attention from others. Therefore, the number of lists
download their profiles, followers and the lists that including a user, to some extent, should reflect the
include them. And there are more than 194.69 million popularity of this user.
follower links and 4.33 million list links. We also
download their tweets during June by Twitter APIs,
around 9.13 million. Among them, approximately 8.57
million tweets include the short URL, and the click
information of these URLs is downloaded by Bitly APIs.
The detailed information is presented in Table I.
IV. FEATURES ENGINEERING We also analyze the correlation between the number of
Here we introduce the features that will be used to lists and the click number of URLs of users in Fig. 1. The
predict user influence. We try to explore a wide range of x-axis is the number of lists into which the users have
possible features which help determine the attributes been added, and y-axis refers to the average click number
related to user influence. The features consist of the sets of URLs published by the users. From this figure, we can
of user properties, user behavior and topics. see that the correlation exhibits some linear characteristic,
and the linear correlation coefficient is 0.7337. This
A. Features of User Properties indicates that the number of lists cannot accurately reflect
We first describe the features about user properties. the click number, however there exist a certain linear
Based on the user information we can collect, the relationship between them. Hence, we explore the
metadata, such as the number of followers, friends and number of lists as a feature to identify its importance in
lists, will be extracted as the feature. Besides we exploit the predictive model.
relative information to further describe user characters, 3) User domains: The short URLs can be generated by
such as the number of active followers and user domains. users' own domains or public domains provided by
1) Number of followers: Followers of a user are the companies of short URLs, such as Bitly or Ownly. We try
people who will receive all the updates or messages to find whether the special domains of short URLs have a
published by this user. And the number of followers of a significant impact on the click number by exploring the
user can directly indicate the size of the audience for this feature of user domains. To compute this feature, we need
user. Therefore, the number of followers is frequently learn whether a domain is the special one or public one.
used as measuring the user influence [3, 4]. Hence, this To this end, we identify public domains by check whether
number will be extracted as the feature to predict user their short URLs are extended to long URLs with
influence in this paper. Correspondingly, another basic multiple domains, i.e., the domains of short URLs will be
property of users, the number of friends which can reflect regarded as the public ones if the corresponding long
the social capital, will also be as a feature. URLs are directed to multiple domains. As a result, two
In addition, because there are a lot of accounts that domains, bit.ly and j.mp, are identified as public ones and
have been suspended due to spammers or other reasons others as special ones. Consequently based on the
[17], and there should be a part of users that register dominant domains of short URLs published by users we
multiple accounts and only use one of them or stop using can classify users into two groups: one with special
Twitter. We further compute the number of active domains and another one with public domains.
followers as the feature for prediction. To this end, we B. Features of User Behavior
need identify whether a user is active. In general, the
users will be regarded as active ones by the owners of Compared with the features of user properties, the
online social networks, such as Twitter or Facebook, if features of user behavior mainly describe characters
they log in at least once a month [18]. However, since we which can be controlled arbitrarily by users. For example,
cannot obtain the information about logging in, we regard the type of tweets and the published time of tweets can be
users as active ones if they publish at least one tweet in easily changed by users. Therefore analyzing which
recent two months. After collecting the recent tweets of features have a predictive power for user influence is
each follower by Twitter APIs, we can compute the important for users to adopt their behavior.
number of active followers for each user. 1) Type of tweets: The Twitter provides different marks
2) Number of Lists: The Twitter List, launched on to enhance the contents of tweets, such as hashtags and
November 2009, is an official functionality to group sets mentions. The hashtag, whose format is #keyword, can be
of users into topical or other categories, and it aim to help a kind of mark about keywords or topics of the tweet for
users organize people they follow. If a user has been convenience of searching or categorizing messages. And
added into more lists, it means that this user receive more the mention, whose format is @username, will be a kind
of the conversation between users of Twitter. The tweets
might include only hashtags, only mentions, or both them. user influence. Thus these features include two values:
Correspondingly, the tweets can be called as hashtag user topic category and topic entropy.
tweets, mention tweets, or hashtag and mention tweets. 1) User topic category: The role of content is generally
To conduct prediction, for each kind of tweets, the ratio analyzed in the work related to influence in online social
between the numbers of them and all the tweets will be networks. For example, Cha et al. [3] study the
calculated as the feature. Also the number of all the propagation of three popular topics in 2009 in Twitter,
tweets is computed as a feature. and find that most influential users can hold significant
influence over a variety of topics. And Hofman et al. [5]
use humans to classify the content of a sample of 1000
URLs and find that the content features are not
informative in predicting influence in Twitter. However,
we use the different standard, accurate click number, to
represent user influence. Besides, we classify users
automatically into different topic categories. Therefore,
here we exploit the feature of user topics for the
prediction to measure the correlation between the
accurate click number and the content of tweets.
To this end, we firstly need classify users into different
Figure 2. The number of URLs against time. topic categories. In Twitter, organizations and individuals
tend to create multiple accounts for publishing different
2) Published time: To exploit the character of contents. For example, there are more than 30 accounts
published time of tweets, we firstly analyze the number for Washington Post [19]. Thus, the accounts in Twitter
of URLs published in different hours, as shown in Fig. 2. can be divided into different topic categories. The method
The y-axis is the average number of URLs. The figure of the classification is mainly based on the Twitter list.
clearly shows that more tweets are published in the day Because the names and descriptions of Twitter lists
and less in the night. Because of the significant difference provide valuable semantic cues to the experts' domain of
in the day and night, for each user we divide his tweets expertise [20], the names of lists can be used to classify
into two groups: tweets in day time and tweets in night users. Specifically, if a user is frequently added into lists
time. For convenience, we simply regard the day time as with similar names, it will be put in the category related
from 8 AM to 7 PM and others as the night time. to these names. For example, if the New York Times is
By intuition, there are two properties relative to time of often added into lists with the name of News, it will be
URLs that might impact the click number: the amount of put in the News category. To compute the most frequent
URLs and intervals of published time of URLs. For names of lists including a user, we define dominant
example, if a user publishes too many tweets in a short frequency of list names for user i, Ri, as below:
time, the large amount of information will be beyond the
receptive ability of his audience, and a part of URLs will Ri max Lim . (3)
m
be skipped. Hence, we introduce a comprehensive
variable, time entropy, to measure the number of URLs where m is the number of names of lists that include user
and intervals of published time of URLs as the feature of i, and Lim is the number of lists with the m-th name.
our predictive model. For user i, his time entropy Ei is Correspondingly, the dominant name refers the name of
defined as below: the lists with dominant frequency.
Following the steps of the classification procedure, we
M
1 d d
* ih ln ih .
first clean the data by removing users who are included in
Ei (2)
ln M h 1 di di less than 10 lists and explore the stems of frequent names
of lists. Second we select nine categories based on the
where dih refers to the number of URLs published during data we downloaded: Tech, News, Music, Sports, Food,
the h hour by the user i, di is the sum of the number of Politics, Education, Health and Travel, and compare the
URLs published by the user i, and M is the sum of all the dominant names of lists which include the users with
hours. For a user, if her tweets are published only in one these nine categories. The users whose dominant names
hour, the time entropy will be 0; while if her tweets are of lists can match one of the nine categories will be put
published in the M hours uniformly, the time entropy will into the corresponding category, Otherwise they will be
be 1. Hence, higher entropy denotes that users have the excluded from any category.
lower inter-tweet delays and tent to publish tweets 2) Topic distribution: The topics may have a wide
regularly. We will compute the entropies for day time and range in the tweets for some users while narrow for
night time as the two features. others. Even if multiple users belong to one topic
C. Features of Topics category, their topic ranges might still be different. For
example, for the two users from the news category, one
Here we exploit the features relative to the topics. We has a wider range of topics if he publishes tweets
want to measure whether the topic category and topic including international news, domestic news and
distribution in the tweets are important for impacting the technology news, while another has a comparatively
narrower one if he only publishes domestic news. To
levels of significance of less than 0.0001, except for the 2* Precision * Recall
three features: domain, mention and mentionHashtag. F score . (7)
Precision Recall
These three features have high significant levels, i.e., they
have week relationship with the user influence. It should where Precision represents the proportion of the true
be noted that the positive or negative of coefficient in the positives against all the positive results and Recall shows
regression might not indicate the positive or negative the proportion of the true positives against the positive
correlation between the features (independent variables) and false negative results. And both they are computed as
and the categories of user influence (dependent variable), below:
because some features have high correlation. For example,
tp
the number of followers has a negative coefficient in the Precision . (8)
regression analysis because it is highly correlated with tp fp
the number of active followers. Indeed, when we remove
the number of active followers from the regression, the tp
coefficient of the number of followers becomes positive Recall . (9)
tp fn
and this feature still remains significant.
TABLE IV. THE RESULTS OF REGRESSION ANALYSIS TABLE V. THE PREDICTIVE RESULTS
Set Name Estimate Significance Method Accuracy (%) F-score (%)
Followers -4.15E-05 4.03E-47*** SVM 77.10 74.16
Friends -4.93E-05 1.33E-07*** J48 Decision Trees 80.61 80.01
Properties Lists 1.36E-03 1.90E-17*** Naive Bayes 75.99 76.86
ActiveFollowers 1.95E-04 1.62E-31*** Bagging 82.55 81.96
Domains 1.85E-01 5.81E-02
Tweets 3.99E-04 1.14E-05***
The results are shown in Table V. We can see that the
Mentions -2.42E-01 1.82E-01 methods have a little impact on the predictive
Hashtags -2.15E-01 9.34E-03** performance. For example, the difference between the
Behavior
MentionHashtags -5.46E-01 1.21E-01 best accuracy and worst one is around 6%. The Bagging
DayEntropy 6.07E+00 2.51E-64*** model achieves the best performance in both accuracy
NightEntropy -1.63E+00 1.73E-19***
Categories 1.85E-01 3.43E-24*** and F-score, and the overall accuracy arrives at more than
Topics 82% in determining whether a user will belong to a low-
TopicEntropy -1.11E+00 3.38E-06***
Significant at the: *** 0.001, ** 0.01, or * 0.05 level. influence, medium-influence, high-influence, or extra-
The results of the regression analysis thus provide high-influence group.
strong evidences that most of the characters from user
properties, behavior and topics affect the user influence. VI. CONCLUSIONS
The features we defined, such as the day entropy, user In this paper, we predicted the user influence based on
topic category and topic entropy, have a significant the standard of the accurate click number of URLs. We
predictive power to the influence. first exploited a wide range of possible features
C. Influence Prediction consisting of the sets of user properties, behavior and
topics. These features not only include the basic
To identify the better model, we select four widely properties, such as the number of followers, friends and
used methods: Support Vector Machine (SVM) lists, but also include our defined features, such as the
classification, J48 decision tree, Naive Bayes, and entropies of published time and topics. And then we
Bagging. For the model of SVM, we use the LIBSVM defined four categories based on the click number to
[24], which is an integrated software that implemented represent the levels of user influence. After that we
the SVM to conduct classification. And for this model, conducted the logistic regression analysis to identify
we use the popular e-SVR algorithm with a kernel whether the features have a predictive power to predict
function of Radial Basis Function (RBF). For the last user influence, and find that most of the features, such as
three models, we use the Weka [25], a collection of the number of followers, number of tweets, time entropy,
machine learning algorithms for data mining tasks, to topic category and entropy have a significantly predictive
perform experiments. All methods are employed with 10- power. Finally, by using four models: SVM, J48 Decision
fold cross-validation. And their performances are Trees, Naive Bayes and Bagging, we predicted the levels
evaluated by comparing the accuracy and F-score. The of user influence, and find that the models have a little
accuracy is the proportion of true results in the population. impact on the predictive performance and the Bagging
Assume that tp are true positive, fp - false positive, fn - model achieve the best result with an overall accuracy of
false negative, and tn - true negative counts, and the more than 82% in determining whether a user will belong
accuracy can be computed as below: to a low-influence, medium-influence, high-influence, or
tp tn extra-high-influence group.
accuracy . (6)
tp fp fn tn
ACKNOWLEDGMENT
The F-score combines Recall and Precision with an equal This work is supported by the National Science
weight, in the following form: Foundation of China (NSFC), Grant No. 61272527.