0% found this document useful (0 votes)

2 views

PredictingUserInfluenceinSocialMedia

Uploaded by

temitopetusin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

PredictingUserInfluenceinSocialMedia

Uploaded by

temitopetusin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/269645110

Predicting User Inﬂuence in Social Media

Article in Journal of Networks · October 2013

DOI: 10.4304/jnw.8.11.2649-2655

CITATIONS READS

14 503

4 authors, including:

Chunjing Xiao Yue wu

Henan University University of Leeds
37 PUBLICATIONS 452 CITATIONS 53 PUBLICATIONS 656 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Chunjing Xiao on 24 May 2015.

The user has requested enhancement of the downloaded file.

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013 2649

Predicting User Influence in Social Media

Chunjing Xiao 1, *, Yuhong Zhang 2, Xue Zeng 1, and Yue Wu 1
1. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu,
China
2. College of Information Science and Engineering, Henan University of Technology, Zhengzhou, China
Email: [email protected], [email protected], [email protected], [email protected]
*Corresponding author

Abstract—Understanding influence plays a vital role in retweets. Subsequently, the amount of extended retweets
enhancing businesses operation and improving effect of is used as the standard for predicting the influence of
information propagation. Therefore the user influence in Twitter users [5], and for analyzing word-of-mouth
social media, such as Twitter, is widely studied based on information propagation in Twitter [6]. Whereas none of
different standards, such as the number of followers,
these works consider the accurate click number of short
retweets and so on. However, little work considers the
accurate click number of short URLs as the measurement of URLs as the standard of influence. In fact shortened
influence. In Twitter short URLs are frequently included in URLs are frequently included in the content published by
tweets because of the limitation of characters. And some users, because of the limitation of characters of contents,
users may focus more on click number of the URLs instead especially for Twitter which limits a tweet to 140
of the number of followers or retweets. Thus, it is necessary characters. And there should be a lot of users who aim to
to analyze the factors that impact the click number received attract web traffic by Twitter. In addition, in Twitter the
by URLs of users. In this paper, we conduct the predictive number of retweets and click number of URLs received
analyses about the user influence which is measured by the by the user are disproportionate [7]. Therefore it is
click number of short URLs. We first exploit a wide range of
necessary to understand user influence based on the
possible features consisting of the sets of user properties,
behavior and topics. And then we employ the logistic standard of click number of short URLs.
regression analysis to identify the significant features for In this paper, we predict the user influence based on
predicting the user influence, and find most of features we the accurate click number of short URLs. Due to the
proposed have a significant predictive power to the user importance of click number, Antoniades et al. [8]
influence. Finally based on the large scale Twitter data, four compare the popular websites in Twitter, which are
models are used for the prediction and the Bagging model measured by the click number received via Twitter, with
achieves the best result, an overall accuracy of more than that in Alexa.com. And Romero et al. [9] use the global
82%. click number of URLs as the ground truth to evaluate
their proposed algorithm of ranking users in Twitter, but
Index Terms—Twitter; Influence; Web Traffic; Predict
as they said the global clicks are noise because they also
include the click source from outside of Twitter, such as
I. INTRODUCTION Facebook and forums.
Social Media such as Twitter and Facebook has While compared with the existed work, we use the
become an important platform to publish or receive accurate click number as the standard of user influence.
information, which is changing the way of According to the click information provided by Bitly.com,
communication and knowledge sharing between the there are three kinds of click number: accurate clicks
people. Users in these systems post and discussion referring to the click number received by each URL of
millions of news, options, and reviews to promote them. each user; domain clicks referring to the click number
Correspondingly, influence, which has long been studied received by any object in a domain (such as Twitter.com);
in the fields of sociology, communication, marketing and global clicks referring to the sum of all the click number.
political science [1, 2], also receive much attention in Therefore the accurate click number is a precise one
social media, because understanding influence can without noise in comparison with the domain clicks and
provide insights for users to learn why certain global clicks.
information propagates faster and how we improve the Based on the standard of the accurate click number, we
effect of contents diffusion. predict the influence of users in Twitter. And we treat the
Currently, the user influence has been analyzed from prediction as a classification task by defining four
different aspects based on different standards. For categories to represent the levels of the user influence. To
example, Cha et al. [3] present a comparison of three conduct the prediction, we exploit a wide range of
measures of influence: the number of user followers, possible features consisting of the sets of user properties,
retweets and mentions in Twitter. Also, Kwak et al. [4] behavior and topics. The set of user properties includes
analyze the user influence based on another three the basic properties of users, such as the number of
standards: the number of followers, PageRank and followers, friends and lists, as well as the properties we

© 2013 ACADEMY PUBLISHER

doi:10.4304/jnw.8.11.2649-2655
2650 JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

defined, such as the number of active followers and the tweet features add a substantial boost. Artzi et al. [14]
type of user domains. The set of behavior is composed of predict whether a message will elicit a user response in
the number of tweets and the entropies of published time Twitter based on a discriminative model, and they
of URLs. And the set of topics includes the topic category explore various sources as features, such as the language
and topic entropy. After extracting these features, we first used in the tweet, the user's social network and history.
analyze the significance of each feature by using the Bandari et al. [15] predict the popularity of news items in
logistic regression analysis and find majority of features Twitter prior to their release. A multi-dimensional feature
are statistically significant. And then by using multiple space derived from properties of the article is exploited
classification models, we predict the levels of the user for the prediction.
influence and achieve an overall accuracy of more than However, differing from these studies, we use the
82% with the Bagging model. different standard for the user influence, the accurate
click number of short URLs. Compared with the existed
II. RELATED WORKS work, the analyses and prediction based on click number
The studies related to influence in online social can provide insights to improve the web traffic via social
networks have been conducted from ranking influential media. In addition, we explore different features for the
users [3, 4, 10, 11], and quantifying user influence [5, 9] prediction, such as the type of tweets, the entropy of
to predicting popularity of contents [12-15]. Specifically, published time of URLs and so on.
Kwak et al. [4] find that the ranking of users depending
on the amount of retweets is different from that III. DATA DESCRIPTION
depending on the number of followers and PageRank in As our goal is to predict user influence measured by
the follower network. And Cha et al. [3] also compare the the accurate click number of short URLs, the data for the
user influence based on indegree (the number of experiments should be mainly comprised of the
followers), the number of retweets and number of information of short URLs in tweets published by Twitter
mentions, and demonstrated that popular users with high users and accurate click number received by these short
indegree are not necessarily influential in terms of URLs.
spawning retweets or mentions. Besides, influential users To obtain these data, we firstly select the targeted users
are identified in Twitter by taking the topical similarity of Twitter. Particularly, we select users who tend to
and the link structure into account [10] and by using publish tweets including short URLs, and these URLs
modified k-shell decomposition algorithm [11]. should be hosted by Bitly because the short URLs of
Apart from ranking influential users, Hofman et al. [5] Bitly take up about 50% of all the URLs in Twitter [8]
quantify influence of general users based on the standard and their accurate click number can be collected.
of the number of extended retweets. This standard, To this end, from more than 790 millions tweets during
beyond official retweets, also includes the amount that June 2012 collected by Twitter streaming APIs which
implicit propagation which will occurs when a user return roughly 10% of all public tweets, we extract
shares a URL that has already been shared by one of his around 46 million unique users. From these users, we
friends (followings) without necessarily citing the select users satisfying the following conditions: (i) The
information source. Based on this measure, to predict language in the profile settings of users is English,
user influence they explore features consisting of the because users speaking English are most popular in
numbers of followers, friends and tweets, date of joining, Twitter [16] and we are familiar with this language; (ii)
and past influence of users including past total influence The ratios between the numbers of tweets including Bitly
and local influence. Their predictive model, the URLs and all the tweets of users are no less than 80%,
regression tree, achieves relatively poor performance (R2 because, for users with many Bitly URLs, their influence
= 0.34) without averaging predicted and actual values at can more properly be represented by accurate click
the leaf nodes. And since the majority of users act as number of their URLs; (iii) The domain focuses of users
passive information consumers and do not forward the are more than 80%, and focused domains are the same
content to the network, Romero et al. [9] developed an with domains of websites in their profiles. Here the
efficient algorithm to quantify the influence of all the domain focus refers to the highest fraction of the number
users in Twitter by taking passivity into account, and they of URLs of a domain over the entire number of URLs,
used the global clicks of shorts URLs as the standard of and is defined as below:
the influence, which is noise as they said.
1
Another body of works is the prediction of the Di  max vik . (1)
popularity in Twitter. Hong et al. [12] predict the Vi k
popularity of tweets as measured by the number of future
retweets. They define several categories to represent the where Vi refers to the sum of URL number of user i, and
volume of retweets and predict which categories the vik refers to the number of URLs with domain k of user i.
tweets will belong to. The prediction that whether a tweet We employ this selection because this kind of users are
will be retweeted is studied in [13]. Based on the model more likely to aim to attract the web traffic via Twitter;
with the passive-aggressive algorithm, they can (iv) Users publish at least average one URL per day,
automatically predict retweets and find that the because it is obvious that too few URLs will skew the
performance is dominated by social features, but the results. As a result, 32,942 users are selected as our
targeted users.

© 2013 ACADEMY PUBLISHER

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013 2651

And then, for these selected users, by Twitter APIs we attention from others. Therefore, the number of lists
download their profiles, followers and the lists that including a user, to some extent, should reflect the
include them. And there are more than 194.69 million popularity of this user.
follower links and 4.33 million list links. We also
download their tweets during June by Twitter APIs,
around 9.13 million. Among them, approximately 8.57
million tweets include the short URL, and the click
information of these URLs is downloaded by Bitly APIs.
The detailed information is presented in Table I.

TABLE I. TWITTER DATA DESCRIPTION

Number of users 32,942
Number of tweets 9,135,996
Number of tweets with Bitly URLs 8,574,672
Number of follower links 194,693,901
Number of lists links 4,337,344 Figure 1. Correlation between the list number and click number.

IV. FEATURES ENGINEERING We also analyze the correlation between the number of
Here we introduce the features that will be used to lists and the click number of URLs of users in Fig. 1. The
predict user influence. We try to explore a wide range of x-axis is the number of lists into which the users have
possible features which help determine the attributes been added, and y-axis refers to the average click number
related to user influence. The features consist of the sets of URLs published by the users. From this figure, we can
of user properties, user behavior and topics. see that the correlation exhibits some linear characteristic,
and the linear correlation coefficient is 0.7337. This
A. Features of User Properties indicates that the number of lists cannot accurately reflect
We first describe the features about user properties. the click number, however there exist a certain linear
Based on the user information we can collect, the relationship between them. Hence, we explore the
metadata, such as the number of followers, friends and number of lists as a feature to identify its importance in
lists, will be extracted as the feature. Besides we exploit the predictive model.
relative information to further describe user characters, 3) User domains: The short URLs can be generated by
such as the number of active followers and user domains. users' own domains or public domains provided by
1) Number of followers: Followers of a user are the companies of short URLs, such as Bitly or Ownly. We try
people who will receive all the updates or messages to find whether the special domains of short URLs have a
published by this user. And the number of followers of a significant impact on the click number by exploring the
user can directly indicate the size of the audience for this feature of user domains. To compute this feature, we need
user. Therefore, the number of followers is frequently learn whether a domain is the special one or public one.
used as measuring the user influence [3, 4]. Hence, this To this end, we identify public domains by check whether
number will be extracted as the feature to predict user their short URLs are extended to long URLs with
influence in this paper. Correspondingly, another basic multiple domains, i.e., the domains of short URLs will be
property of users, the number of friends which can reflect regarded as the public ones if the corresponding long
the social capital, will also be as a feature. URLs are directed to multiple domains. As a result, two
In addition, because there are a lot of accounts that domains, bit.ly and j.mp, are identified as public ones and
have been suspended due to spammers or other reasons others as special ones. Consequently based on the
[17], and there should be a part of users that register dominant domains of short URLs published by users we
multiple accounts and only use one of them or stop using can classify users into two groups: one with special
Twitter. We further compute the number of active domains and another one with public domains.
followers as the feature for prediction. To this end, we B. Features of User Behavior
need identify whether a user is active. In general, the
users will be regarded as active ones by the owners of Compared with the features of user properties, the
online social networks, such as Twitter or Facebook, if features of user behavior mainly describe characters
they log in at least once a month [18]. However, since we which can be controlled arbitrarily by users. For example,
cannot obtain the information about logging in, we regard the type of tweets and the published time of tweets can be
users as active ones if they publish at least one tweet in easily changed by users. Therefore analyzing which
recent two months. After collecting the recent tweets of features have a predictive power for user influence is
each follower by Twitter APIs, we can compute the important for users to adopt their behavior.
number of active followers for each user. 1) Type of tweets: The Twitter provides different marks
2) Number of Lists: The Twitter List, launched on to enhance the contents of tweets, such as hashtags and
November 2009, is an official functionality to group sets mentions. The hashtag, whose format is #keyword, can be
of users into topical or other categories, and it aim to help a kind of mark about keywords or topics of the tweet for
users organize people they follow. If a user has been convenience of searching or categorizing messages. And
added into more lists, it means that this user receive more the mention, whose format is @username, will be a kind
of the conversation between users of Twitter. The tweets

2652 JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

might include only hashtags, only mentions, or both them. user influence. Thus these features include two values:
Correspondingly, the tweets can be called as hashtag user topic category and topic entropy.
tweets, mention tweets, or hashtag and mention tweets. 1) User topic category: The role of content is generally
To conduct prediction, for each kind of tweets, the ratio analyzed in the work related to influence in online social
between the numbers of them and all the tweets will be networks. For example, Cha et al. [3] study the
calculated as the feature. Also the number of all the propagation of three popular topics in 2009 in Twitter,
tweets is computed as a feature. and find that most influential users can hold significant
influence over a variety of topics. And Hofman et al. [5]
use humans to classify the content of a sample of 1000
URLs and find that the content features are not
informative in predicting influence in Twitter. However,
we use the different standard, accurate click number, to
represent user influence. Besides, we classify users
automatically into different topic categories. Therefore,
here we exploit the feature of user topics for the
prediction to measure the correlation between the
accurate click number and the content of tweets.
To this end, we firstly need classify users into different
Figure 2. The number of URLs against time. topic categories. In Twitter, organizations and individuals
tend to create multiple accounts for publishing different
2) Published time: To exploit the character of contents. For example, there are more than 30 accounts
published time of tweets, we firstly analyze the number for Washington Post [19]. Thus, the accounts in Twitter
of URLs published in different hours, as shown in Fig. 2. can be divided into different topic categories. The method
The y-axis is the average number of URLs. The figure of the classification is mainly based on the Twitter list.
clearly shows that more tweets are published in the day Because the names and descriptions of Twitter lists
and less in the night. Because of the significant difference provide valuable semantic cues to the experts' domain of
in the day and night, for each user we divide his tweets expertise [20], the names of lists can be used to classify
into two groups: tweets in day time and tweets in night users. Specifically, if a user is frequently added into lists
time. For convenience, we simply regard the day time as with similar names, it will be put in the category related
from 8 AM to 7 PM and others as the night time. to these names. For example, if the New York Times is
By intuition, there are two properties relative to time of often added into lists with the name of News, it will be
URLs that might impact the click number: the amount of put in the News category. To compute the most frequent
URLs and intervals of published time of URLs. For names of lists including a user, we define dominant
example, if a user publishes too many tweets in a short frequency of list names for user i, Ri, as below:
time, the large amount of information will be beyond the
receptive ability of his audience, and a part of URLs will Ri  max Lim . (3)
m
be skipped. Hence, we introduce a comprehensive
variable, time entropy, to measure the number of URLs where m is the number of names of lists that include user
and intervals of published time of URLs as the feature of i, and Lim is the number of lists with the m-th name.
our predictive model. For user i, his time entropy Ei is Correspondingly, the dominant name refers the name of
defined as below: the lists with dominant frequency.
Following the steps of the classification procedure, we
M
1 d d
*  ih ln ih .
first clean the data by removing users who are included in
Ei   (2)
ln M h 1 di di less than 10 lists and explore the stems of frequent names
of lists. Second we select nine categories based on the
where dih refers to the number of URLs published during data we downloaded: Tech, News, Music, Sports, Food,
the h hour by the user i, di is the sum of the number of Politics, Education, Health and Travel, and compare the
URLs published by the user i, and M is the sum of all the dominant names of lists which include the users with
hours. For a user, if her tweets are published only in one these nine categories. The users whose dominant names
hour, the time entropy will be 0; while if her tweets are of lists can match one of the nine categories will be put
published in the M hours uniformly, the time entropy will into the corresponding category, Otherwise they will be
be 1. Hence, higher entropy denotes that users have the excluded from any category.
lower inter-tweet delays and tent to publish tweets 2) Topic distribution: The topics may have a wide
regularly. We will compute the entropies for day time and range in the tweets for some users while narrow for
night time as the two features. others. Even if multiple users belong to one topic
C. Features of Topics category, their topic ranges might still be different. For
example, for the two users from the news category, one
Here we exploit the features relative to the topics. We has a wider range of topics if he publishes tweets
want to measure whether the topic category and topic including international news, domestic news and
distribution in the tweets are important for impacting the technology news, while another has a comparatively
narrower one if he only publishes domestic news. To

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013 2653

measure the wide degree of topics, we use Latent A. Experiment Setup

Dirichlet Allocation (LDA) [21] to compute the topic We further filter users in Section III for the
distribution of users. LDA is an unsupervised machine experiments, because some of them lack corresponding
learning generative probabilistic model, which identifies data for computation of certain features. For example,
latent topic information in large collections of data because some users have no information of time zones in
including text corpora, and has been widely used to their profiles, we cannot transfer the published time of
exploiting the interest and topic [22, 23]. For a corpus of tweets into the local time of users to identify the period of
M documents, LDA assumes that documents are day time or night time, and further cannot compute the
generated from a set of N latent topics. In a document, feature of the time entropy. We filter these kinds of users,
each word wi is associated with a hidden variable zi ∈ and as a result, 11,025 users are selected for the
{1,..., N }indicating the topic from which wi is generated. experiments.
The probability of word wi is expressed as: For these selected users, their influence will be
N measured by the average click number per day received
P(wi )   P  wi | zi  j  P  zi  j  . (4) by their URLs. Rather than predicting the exact click
j 1 number, we define several categories to represent the
levels of the influence and predict which categories the
where β = P(wi|zi = j) is a probability of word wi in topic j users will belong to. The reason is because these
and θ = P(zi = j) is a document specific mixture weight categories can provide a clear concept about the levels of
indicating the proportion of topic j. LDA treats the user influence and this method can compute the
multinomial parameters β and θ as latent random predictive accuracy which is more apparent to describe
variables sampled from a Dirichlet prior with hyper- the performance of the prediction. Specifically, we divide
parameters α and η respectively. the users into four categories depending on the average
To compute topic distribution, for each user, we first click number per day, and these classification results are
merge all its tweets into one document. And similar to shown in Table III.
[22], we remove the 570 stop-words and terms that
cannot be fount in Wikipedia dataset as well as the terms TABLE II. THE COMPLETE LIST OF FEATURES
appearing in fewer than 10 tweets. Second, based on the Set Name Description
document, the LDA will generate the topic distribution, Followers Number of followers
which shows how many percent each topic take up in all Friends Number of friends
Properties Lists Number of lists including this user
the topics (Here we set the number of topics N = 100). ActiveFollowers Number of active followers
Necessarily, the sum of percentage of all the topics in a Domains Type of user domain
document equals to 100%. Based on the topic distribution, Tweets Number of tweets
we can define the topic entropy Ti of user i as below: Rate of number of tweets including
Mentions
mentions
N Rate of number of tweets including
1 c c
Ti   *  ih ln ih . (5) Behavior
Hashtags
hashtags
ln N k 1 ci ci Rate of number of tweets including
MentionHashtags
both mentions and hashtags
where cik refers to the percentage of the topic k for the DayEntropy Time entropy of day time
user i, ci is the sum of the percentages of all the topic of NightEntropy Time entropy of night time
the user i (here ci always is 1), and N is the number of the Categories Topic category of users
Topic entropy indicating the wide
topics, 100. For a user, if its tweets only focus on one Topics
TopicEntropy degree of topic distribution in tweets
topic, its topic entropy will be 0; while if its tweets cover of the user
all the 100 topics uniformly, its topic entropy will be 1.
Hence, higher entropy denotes wider range of topics in TABLE III. THE CATEGORIES OF USERS
the tweets of users. Name Range of click number Number of users
1 0 - 10 7,593
D. The summary of Features 2 10 - 100 2,414
We summarize all the features we proposed for the 3 100 - 1000 835
prediction in Table II. The table presents the abbreviation 4 1000 + 183
name and the description of each feature. Next we will B. Regression Analysis
show whether these features are significant for predicting
Before conducting the prediction, we first explore the
the user influence and how the performances of the
correlation between the features and the influence, i.e.,
predictive models will be.
whether the features we proposed have a predictive
V. REGRESSION ANALYSES AND PREDICTION power for predicting user influence and how significant
the features are for the prediction. To this end, we employ
Here we first present the logistic regression analysis to
logistic regression analysis when using all the features to
show the correlation between the features we proposed
predict the categories of the influence. Table IV presents
and users influence, and then we conduct the predictions
the results of the regression analysis.
via the four models: Support Vector Machine, J48
Apparently most of the features we proposed have a
decision tree, Naive Bayes, and Bagging.
significant predictive power to the user influence with

2654 JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

levels of significance of less than 0.0001, except for the 2* Precision * Recall
three features: domain, mention and mentionHashtag. F  score  . (7)
Precision  Recall
These three features have high significant levels, i.e., they
have week relationship with the user influence. It should where Precision represents the proportion of the true
be noted that the positive or negative of coefficient in the positives against all the positive results and Recall shows
regression might not indicate the positive or negative the proportion of the true positives against the positive
correlation between the features (independent variables) and false negative results. And both they are computed as
and the categories of user influence (dependent variable), below:
because some features have high correlation. For example,
tp
the number of followers has a negative coefficient in the Precision  . (8)
regression analysis because it is highly correlated with tp  fp
the number of active followers. Indeed, when we remove
the number of active followers from the regression, the tp
coefficient of the number of followers becomes positive Recall  . (9)
tp  fn
and this feature still remains significant.

TABLE IV. THE RESULTS OF REGRESSION ANALYSIS TABLE V. THE PREDICTIVE RESULTS
Set Name Estimate Significance Method Accuracy (%) F-score (%)
Followers -4.15E-05 4.03E-47*** SVM 77.10 74.16
Friends -4.93E-05 1.33E-07*** J48 Decision Trees 80.61 80.01
Properties Lists 1.36E-03 1.90E-17*** Naive Bayes 75.99 76.86
ActiveFollowers 1.95E-04 1.62E-31*** Bagging 82.55 81.96
Domains 1.85E-01 5.81E-02
Tweets 3.99E-04 1.14E-05***
The results are shown in Table V. We can see that the
Mentions -2.42E-01 1.82E-01 methods have a little impact on the predictive
Hashtags -2.15E-01 9.34E-03** performance. For example, the difference between the
Behavior
MentionHashtags -5.46E-01 1.21E-01 best accuracy and worst one is around 6%. The Bagging
DayEntropy 6.07E+00 2.51E-64*** model achieves the best performance in both accuracy
NightEntropy -1.63E+00 1.73E-19***
Categories 1.85E-01 3.43E-24*** and F-score, and the overall accuracy arrives at more than
Topics 82% in determining whether a user will belong to a low-
TopicEntropy -1.11E+00 3.38E-06***
Significant at the: *** 0.001, ** 0.01, or * 0.05 level. influence, medium-influence, high-influence, or extra-
The results of the regression analysis thus provide high-influence group.
strong evidences that most of the characters from user
properties, behavior and topics affect the user influence. VI. CONCLUSIONS
The features we defined, such as the day entropy, user In this paper, we predicted the user influence based on
topic category and topic entropy, have a significant the standard of the accurate click number of URLs. We
predictive power to the influence. first exploited a wide range of possible features
C. Influence Prediction consisting of the sets of user properties, behavior and
topics. These features not only include the basic
To identify the better model, we select four widely properties, such as the number of followers, friends and
used methods: Support Vector Machine (SVM) lists, but also include our defined features, such as the
classification, J48 decision tree, Naive Bayes, and entropies of published time and topics. And then we
Bagging. For the model of SVM, we use the LIBSVM defined four categories based on the click number to
[24], which is an integrated software that implemented represent the levels of user influence. After that we
the SVM to conduct classification. And for this model, conducted the logistic regression analysis to identify
we use the popular e-SVR algorithm with a kernel whether the features have a predictive power to predict
function of Radial Basis Function (RBF). For the last user influence, and find that most of the features, such as
three models, we use the Weka [25], a collection of the number of followers, number of tweets, time entropy,
machine learning algorithms for data mining tasks, to topic category and entropy have a significantly predictive
perform experiments. All methods are employed with 10- power. Finally, by using four models: SVM, J48 Decision
fold cross-validation. And their performances are Trees, Naive Bayes and Bagging, we predicted the levels
evaluated by comparing the accuracy and F-score. The of user influence, and find that the models have a little
accuracy is the proportion of true results in the population. impact on the predictive performance and the Bagging
Assume that tp are true positive, fp - false positive, fn - model achieve the best result with an overall accuracy of
false negative, and tn - true negative counts, and the more than 82% in determining whether a user will belong
accuracy can be computed as below: to a low-influence, medium-influence, high-influence, or
tp  tn extra-high-influence group.
accuracy  . (6)
tp  fp  fn  tn
ACKNOWLEDGMENT
The F-score combines Recall and Precision with an equal This work is supported by the National Science
weight, in the following form: Foundation of China (NSFC), Grant No. 61272527.

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013 2655

REFERENCES [13] S. Petrovic, M. Osborne and V. Lavrenko. "RT to Win!

Predicting Message Propagation in Twitter", in
[1] E. M. Rogers, Diffusion of Innovations. New York: Free International AAAI Conference on Weblogs and Social
Press, 1962. Media (ICWSM) 2011, pp. 586-589.
[2] E. Katz and P. F. Lazarsfeld, Personal Influence: The Part [14] Y. Artzi, P. Pantel and M. Gamon. "Predicting responses to
Played by People in the Flow of Mass Communications. microblog posts", in Conference of the North American
Free Press: New York, 1955. Chapter of the Association for Computational Linguistics:
[3] M. Cha, H. Haddadi, F. Benevenuto and K. Gummadi. Human Language Technologies, 2012, pp. 602-606.
"Measuring User Influence in Twitter: The Million [15] R. Bandari, S. Asur and B. A. Huberman. "The Pulse of
Follower Fallacy", in International AAAI Conference on News in Social Media: Forecasting Popularity", in
Weblogs and Social Media (ICWSM), 2010, pp. 10-17. International AAAI Conference on Weblogs and Social
[4] H. Kwak, C. Lee, H. Park and S. Moon. "What is Twitter, Media (ICWSM) 2012.
a social network or a news media?", in Proceedings of the [16] An Exhaustive Study of Twitter Users Across the World -
international conference on World Wide Web (WWW), Beevolve, Social Media Analytics Platform. http://www.
2010, pp. 591-600. beevolve. com/twitter-statistics/.
[5] E. Bakshy, W. A. Mason, J. M. Hofman and D. J. Watts. [17] K. Thomas, C. Grier, D. Song and V. Paxson. "Suspended
"Everyone's an influencer: Quantifying influence on accounts in retrospect: an analysis of twitter spam", in
twitter", in ACM International Conference on Web Search ACM SIGCOMM conference on Internet Measurement
and Data Mining (WSDM), 2011, pp. 65-74. Conference (IMC), 2011, pp. 243-258
[6] T. Rodrigues, F. Benevenuto, M. Cha, K. Gummadi and V. [18] Twitter Announces 100 Million Active Users. http://www.
Almeida. "On word-of-mouth based discovery of the web", mediabistro. com/alltwitter/twitter_active_users_b13510.
in ACM SIGCOMM conference on Internet Measurement [19] Washingtonpost. com on Twitter. http://www.
Conference (IMC), 2011, pp. 381-396. washingtonpost. com/twitter.
[7] Engaging News Hungry Audiences Tweet by Tweet: An [20] S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly and K.
audience analysis of prominent mainstream media news Gummadi. "Cognos: crowdsourcing search for topic
accounts on Twitter. http://blog. socialflow. experts in microblogs", in ACM SIGIR conference on
com/post/7120243870. Research and development in information retrieval, 2012,
[8] D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, pp. 575-590.
S. Ioannidis, E. P. Markatos and T. Karagiannis. "we. b: [21] D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent dirichlet
the web of short urls", in Proceedings of the international allocation", The Journal of Machine Learning Research,
conference on World Wide Web (WWW), 2011, pp. 715- vol. 3, no. 4, pp. 993-1022, 2003.
724. [22] Z. Xu, R. Lu, L. Xiang and Q. Yang. "Discovering user
[9] D. M. Romero, W. Galuba, S. Asur and B. A. Huberman. interest on twitter with a modified author-topic model", in
"Influence and passivity in social media", in Proceedings IEEE/WIC/ACM International Conference on Web
of the international conference on World Wide Web Intelligence, 2011, pp. 422-429.
(WWW), 2011, pp. 113-114. [23] J. Sang and C. Xu, "Faceted subtopic retrieval: Exploiting
[10] J. Weng, E. -P. Lim, J. Jiang and Q. He. "TwitterRank: the topic hierarchy via a multi-modal framework", Journal
finding topic-sensitive influential twitterers", in ACM of Multimedia, vol. 7, no. 1, pp. 9-20, 2012.
international conference on Web search and data mining [24] C. -C. Chang and C. -J. Lin, "LIBSVM: A library for
(WSDM), 2010, pp. 261-270. support vector machines", ACM Transactions on
[11] P. E. Brown and J. Feng. "Measuring User Influence on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1-27,
Twitter Using Modified K-Shell Decomposition", in 2011.
International AAAI Conference on Weblogs and Social [25] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Media (ICWSM), 2011, pp. 18-23. Reutemann and I. H. Witten, "The WEKA data mining
[12] L. Hong, O. Dan and B. D. Davison. "Predicting popular software: an update", ACM SIGKDD Explorations
messages in Twitter", in Proceedings of the international Newsletter, vol. 11, no. 1, pp. 10-18, 2009.
conference on World Wide Web (WWW), 2011, pp. 57-58.

View publication stats

Unit 7 - Session 2: Texting On The Run
No ratings yet
Unit 7 - Session 2: Texting On The Run
10 pages
Test Development: Fundamentals for Certification and Evaluation
From Everand
Test Development: Fundamentals for Certification and Evaluation
Melissa Fein
No ratings yet
Obi Abata Kola Nut - Mystical Divination!!! - Chief Yagbe Awolowo Onilu
100% (2)
Obi Abata Kola Nut - Mystical Divination!!! - Chief Yagbe Awolowo Onilu
26 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Measuring User Influence in Twitter: The Million Follower Fallacy
No ratings yet
Measuring User Influence in Twitter: The Million Follower Fallacy
8 pages
Marketing Campaign Targeting Using Bridge Extraction in Multiplex Social Network
No ratings yet
Marketing Campaign Targeting Using Bridge Extraction in Multiplex Social Network
22 pages
Ait 2013043014521561
No ratings yet
Ait 2013043014521561
7 pages
Semantifying Twitter - The Influencetracker Ontology
No ratings yet
Semantifying Twitter - The Influencetracker Ontology
8 pages
Topic-Based Social Influence Measurement For Social Networks
No ratings yet
Topic-Based Social Influence Measurement For Social Networks
14 pages
Klout Score PDF
No ratings yet
Klout Score PDF
8 pages
Social Media Data Mining: Insights and Strategies
From Everand
Social Media Data Mining: Insights and Strategies
Vidhur Gupta
No ratings yet
ZAI MSC 2015 20 Luo
No ratings yet
ZAI MSC 2015 20 Luo
73 pages
Research on the Application of Computer Network Technology in New Media Communication
No ratings yet
Research on the Application of Computer Network Technology in New Media Communication
4 pages
Influence and Passivity in Social Media - HP Labs Research
100% (6)
Influence and Passivity in Social Media - HP Labs Research
9 pages
Predicting Influentials in Online Social Networks: Rumi Ghosh Kristina Lerman
No ratings yet
Predicting Influentials in Online Social Networks: Rumi Ghosh Kristina Lerman
31 pages
Predicting Social Media Performance Metr
No ratings yet
Predicting Social Media Performance Metr
11 pages
From Individual Behavior To Influence Networks: A Case Study On Twitter
No ratings yet
From Individual Behavior To Influence Networks: A Case Study On Twitter
8 pages
Topic-Based in Uential User Detection: A Survey: Rrubaa Panchendrarajan Akrati Saxena
No ratings yet
Topic-Based in Uential User Detection: A Survey: Rrubaa Panchendrarajan Akrati Saxena
27 pages
Performance Analysis of User Influence Algorithm Under Big Data Processing Framework in Social Networks
No ratings yet
Performance Analysis of User Influence Algorithm Under Big Data Processing Framework in Social Networks
6 pages
Predicting The Brand Popularity From The Brand Metadata
No ratings yet
Predicting The Brand Popularity From The Brand Metadata
14 pages
Identifying Influencers' On Twitter
No ratings yet
Identifying Influencers' On Twitter
10 pages
A Power Law Approach To Estimating Fake Social Network
No ratings yet
A Power Law Approach To Estimating Fake Social Network
16 pages
User Behavior Shifts
From Everand
User Behavior Shifts
Elian Wildgrove
No ratings yet
The Role of Social Media in Society: A Simple Guide to Big Ideas
From Everand
The Role of Social Media in Society: A Simple Guide to Big Ideas
NOVA MARTIAN
No ratings yet
SIMBIG Celayes
No ratings yet
SIMBIG Celayes
13 pages
Information Sciences: Sancheng Peng, Aimin Yang, Lihong Cao, Shui Yu, Dongqing Xie
No ratings yet
Information Sciences: Sancheng Peng, Aimin Yang, Lihong Cao, Shui Yu, Dongqing Xie
14 pages
28 Ijcse 07897
No ratings yet
28 Ijcse 07897
5 pages
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Web Copy For Beginners: Crafting Effective Online Content
From Everand
Web Copy For Beginners: Crafting Effective Online Content
Jake Hill
No ratings yet
LinkedIn vs Twitter
From Everand
LinkedIn vs Twitter
Leo Musk
No ratings yet
Megaproject Organization and Performance: The Myth and Political Reality
From Everand
Megaproject Organization and Performance: The Myth and Political Reality
Nuno Gil
No ratings yet
In Uence Propagation in Social Networks: A Data Mining Perspective
No ratings yet
In Uence Propagation in Social Networks: A Data Mining Perspective
11 pages
Rathore2018 - Epidemic Model-Based Visibility Estimation in Online Social Networks
No ratings yet
Rathore2018 - Epidemic Model-Based Visibility Estimation in Online Social Networks
8 pages
9. Barbieri. 2012
No ratings yet
9. Barbieri. 2012
10 pages
Social Media Business Model Analysis
From Everand
Social Media Business Model Analysis
Xiaoyan Hu
No ratings yet
Assessing IT Projects to Ensure Successful Outcomes
From Everand
Assessing IT Projects to Ensure Successful Outcomes
Kerry Wills
No ratings yet
Sreyas Institute of Engineering and Technology: Predicting Influencers in The Social Network
No ratings yet
Sreyas Institute of Engineering and Technology: Predicting Influencers in The Social Network
16 pages
Fake Profile Detection in Social Media Using NLP: About The Project
100% (1)
Fake Profile Detection in Social Media Using NLP: About The Project
33 pages
The Pulse of News in Social Media: Forecasting Popularity: Roja Bandari Sitaram Asur Bernardo A. Huberman
No ratings yet
The Pulse of News in Social Media: Forecasting Popularity: Roja Bandari Sitaram Asur Bernardo A. Huberman
8 pages
Retweet Prediction Using Artificial Neural Network Method Optimized with Firefly Algorithm
No ratings yet
Retweet Prediction Using Artificial Neural Network Method Optimized with Firefly Algorithm
10 pages
Everyone's An Influencer
No ratings yet
Everyone's An Influencer
10 pages
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
From Everand
Trust between Cooperating Technical Systems: With an Application on Cognitive Vehicles
Walter Bamberger
No ratings yet
Disconnected: Exploring the Decline of Social Networks
From Everand
Disconnected: Exploring the Decline of Social Networks
Milan Frankl
No ratings yet
Social Bookmarking: The Basics II
From Everand
Social Bookmarking: The Basics II
Janet Amber
No ratings yet
Structural - Social Networks: Twitter Analysis
No ratings yet
Structural - Social Networks: Twitter Analysis
18 pages
Web Applications and Their Implications for Modern E-Government Systems: Working Action Research 1St Edition
From Everand
Web Applications and Their Implications for Modern E-Government Systems: Working Action Research 1St Edition
Salman Ben Zayed
No ratings yet
Planning and Managing Distance Education for Public Health Course
From Everand
Planning and Managing Distance Education for Public Health Course
Dr. Roy Rillera Marzo MD MPH
No ratings yet
s40649-015-0016-5
No ratings yet
s40649-015-0016-5
21 pages
đầu
No ratings yet
đầu
7 pages
Introduction to Service: What It Is and What It Should Be
From Everand
Introduction to Service: What It Is and What It Should Be
Harry Katzan Jr
No ratings yet
A Social Media Primer: For People in Organizations
From Everand
A Social Media Primer: For People in Organizations
Karin Wills
No ratings yet
(IJCST-V3I3P44) : Mrinalini Tiwari, Hare Ram Shah
100% (1)
(IJCST-V3I3P44) : Mrinalini Tiwari, Hare Ram Shah
4 pages
Influence Propagation in Social Networks: A Data Mining Perspective
No ratings yet
Influence Propagation in Social Networks: A Data Mining Perspective
57 pages
7.CEJ4770 Research Paper
No ratings yet
7.CEJ4770 Research Paper
16 pages
The Future of YouTube: Trends and Predictions
From Everand
The Future of YouTube: Trends and Predictions
Elara Phoenix
No ratings yet
Social Media: The Basics
From Everand
Social Media: The Basics
Janet Amber
No ratings yet
Web Application Security is a Stack: How to CYA (Cover Your Apps) Completely
From Everand
Web Application Security is a Stack: How to CYA (Cover Your Apps) Completely
Lori Mac Vittie
No ratings yet
Changing Humanities and Smart Application of Digital Technologies
From Everand
Changing Humanities and Smart Application of Digital Technologies
Kuo Hung Huang
No ratings yet
Building Personal Networks Geared for Success: A Comprehensive Guide
From Everand
Building Personal Networks Geared for Success: A Comprehensive Guide
Bryce Peterson
No ratings yet
Community Detection Using A Measure of Global Influence
No ratings yet
Community Detection Using A Measure of Global Influence
16 pages
Architecting the Future Enterprise
From Everand
Architecting the Future Enterprise
Deborah J. Nightingale
No ratings yet
Java™ Programming: A Complete Project Lifecycle Guide
From Everand
Java™ Programming: A Complete Project Lifecycle Guide
Nitin Shreyakar
No ratings yet
The Little Book of Service Management
From Everand
The Little Book of Service Management
Harry Katzan Jr.
No ratings yet
Arts Appreciation WEEK 7
No ratings yet
Arts Appreciation WEEK 7
3 pages
107 Flowcharts
No ratings yet
107 Flowcharts
3 pages
Lab 11
No ratings yet
Lab 11
4 pages
DM PDF
No ratings yet
DM PDF
11 pages
g7 entrance exams
No ratings yet
g7 entrance exams
8 pages
29 Savin Sebastian
No ratings yet
29 Savin Sebastian
8 pages
Sharah Saba Mullaqa Loading .... Please Wait
No ratings yet
Sharah Saba Mullaqa Loading .... Please Wait
68 pages
Kad Imbasan Suku Kata KV
No ratings yet
Kad Imbasan Suku Kata KV
13 pages
Personal Computer Load Control
No ratings yet
Personal Computer Load Control
17 pages
1.4 Presnt Perfect
No ratings yet
1.4 Presnt Perfect
1 page
Steps For Inbound Idoc
No ratings yet
Steps For Inbound Idoc
6 pages
Lesson 5 Empowerment Technology
No ratings yet
Lesson 5 Empowerment Technology
13 pages
ibm_tabformer
No ratings yet
ibm_tabformer
5 pages
Introduction To Microcontrollers
No ratings yet
Introduction To Microcontrollers
12 pages
Graeme J. Humble, Robert K. McIver, Editors - South Pacific Perspectives On Ordination - Biblical, Theological and Historical Studies in An Adventist Context-Avondale Academic Press (2015) PDF
No ratings yet
Graeme J. Humble, Robert K. McIver, Editors - South Pacific Perspectives On Ordination - Biblical, Theological and Historical Studies in An Adventist Context-Avondale Academic Press (2015) PDF
291 pages
TACN Final
No ratings yet
TACN Final
10 pages
Theories of Ethics An Introduction To Moral Philosophy With A Selection of Classic Readings 1st Edition Gordon Graham
100% (5)
Theories of Ethics An Introduction To Moral Philosophy With A Selection of Classic Readings 1st Edition Gordon Graham
84 pages
LECTURES IN READING AND WRITING SKILLS Complete
No ratings yet
LECTURES IN READING AND WRITING SKILLS Complete
10 pages
Windows Commands Cheat Sheet (Cmdref - Net - Cheat Sheet and Example)
No ratings yet
Windows Commands Cheat Sheet (Cmdref - Net - Cheat Sheet and Example)
15 pages
Raymond Smullyan - A Beginner's Further Guide To Mathematical Logic-World Scientific Pub Co Inc (2017)
No ratings yet
Raymond Smullyan - A Beginner's Further Guide To Mathematical Logic-World Scientific Pub Co Inc (2017)
288 pages
Form 2 Mid Year Exam 2020 Paper 1
100% (5)
Form 2 Mid Year Exam 2020 Paper 1
10 pages
w13 Session 22 Worksheet
No ratings yet
w13 Session 22 Worksheet
9 pages
23 Study Notes Computer PDF
No ratings yet
23 Study Notes Computer PDF
20 pages
Shooting An Elephant: Graphic Organizers For Active Reading
No ratings yet
Shooting An Elephant: Graphic Organizers For Active Reading
5 pages
Types of Poem
No ratings yet
Types of Poem
3 pages
SQL Server Best Practices
No ratings yet
SQL Server Best Practices
21 pages
Syllabus For Pre-Advanced Speaking and Listening
No ratings yet
Syllabus For Pre-Advanced Speaking and Listening
17 pages

PredictingUserInfluenceinSocialMedia

Uploaded by

PredictingUserInfluenceinSocialMedia

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Predicting User Inﬂuence in Social Media

Article in Journal of Networks · October 2013

Chunjing Xiao Yue wu

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Predicting User Influence in Social Media

© 2013 ACADEMY PUBLISHER

© 2013 ACADEMY PUBLISHER

TABLE I. TWITTER DATA DESCRIPTION

© 2013 ACADEMY PUBLISHER

© 2013 ACADEMY PUBLISHER

measure the wide degree of topics, we use Latent A. Experiment Setup

© 2013 ACADEMY PUBLISHER

© 2013 ACADEMY PUBLISHER

REFERENCES [13] S. Petrovic, M. Osborne and V. Lavrenko. "RT to Win!

© 2013 ACADEMY PUBLISHER

View publication stats

You might also like