2016 - Elsevier - Neurocomputing - Identification of Influential Nodes in Social Networks With Community Structure Based On Label Propagation

This document summarizes a research paper that proposes a new algorithm for identifying influential nodes in social networks with community structure. The algorithm is based on label propagation and can find the core nodes of different communities through the label propagation process. It has low time complexity, making it applicable to large-scale networks. Experiments on synthetic and real-world networks show the effectiveness and efficiency of the proposed algorithm compared to other influence maximization methods.

Uploaded by

RaghavJain

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

2016 - Elsevier - Neurocomputing - Identification of Influential Nodes in Social Networks With Community Structure Based On Label Propagation

Uploaded by

RaghavJain

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Neurocomputing 210 (2016) 34–44

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Identiﬁcation of inﬂuential nodes in social networks with community

structure based on label propagation
Yuxin Zhao a,c,n, Shenghong Li a,b, Feng Jin c
a
Department of Electronic Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China
b
School of Information Security Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China
c
IBM China Research Laboratory, 399 Ke Yuan Road, Shanghai 201203, China

art ic l e i nf o a b s t r a c t

Article history: Social network is an abstract presentation of social systems where ideas and information propagate
Received 14 April 2015 through the interactions between individuals. It is an essential issue to find a set of most influential
Received in revised form individuals in a social network so that they can spread influence to the largest range on the network.
4 November 2015
Traditional methods for identifying influential nodes in networks are based on greedy algorithm or
Accepted 9 November 2015
specific centrality measures. Some recent researches have shown that community structure, which is a
Available online 11 June 2016
common and important topological property of social networks, has significant effect on the dynamics of
Keywords: networks. However, most influence maximization methods do not take into consideration the commu-
Social network nity structure in the network, which limits their applications on social networks with community
Influential node
structure. In this paper, we propose a new algorithm for identifying influential nodes in social networks
Community structure
with community structure based on label propagation. The proposed algorithm can find the core nodes
Label propagation
of different communities in the network through the label propagation process. Moreover, our algorithm
has low time complexity, which makes it applicable to large-scale networks. Extensive experiments on
both synthetic and real-world networks under common diffusion models demonstrate the effectiveness
and efficiency of our proposed algorithm.
& 2016 Elsevier B.V. All rights reserved.

1. Introduction problem, which aims to ﬁnd a set of most inﬂuential individuals in

a social network so that they can spread influence to the largest
Social network is an abstract representation of real-world social number of nodes in the network [6–9]. Formally, influence max-
systems consisting of large numbers of individuals and relation- imization problem can be described as follows: given a positive
ships between individuals [1–3]. The natures of social network are integer k, identify a node set containing k nodes to maximize the
the interactions between different individuals, which leads to the influence effect under specific diffusion model, where the influ-
spread of ideas, information and influences in the network [4]. ence effect is quantitatively measured by the expected number of
With the increasing popularity of online social networks, such as influenced nodes during the whole spreading process [6].
Facebook, Twitter, MicroBlog and WeChat, it has become an ef- In recent years, a number of methods for finding the influential
fective and promising marketing strategy by conducting product nodes in networks have been proposed to solve the influence
promotions through social influences among individual cycles of maximization problem. These influence maximization methods
friends and families [5]. A motivating application is the viral can be roughly classified into two categories: centrality-based al-
marketing, which aims to select a small number of influential gorithms and greedy algorithms. Centrality-based algorithms
users to adopt a product, and subsequently trigger a large cascade evaluate the centrality or importance of the nodes according to
of further adoptions by utilizing the “Word-of-Mouth” effect in some topological measures and identify the nodes with largest
social networks. centrality as the influential nodes [10,11]. Degree centrality [12],
Motivated by this background, an essential issue that has re- betweenness centrality [13] and closeness centrality [14] are the
ceived considerable attention is the influence maximization most basic centrality measures. Many other complicated centrality
measures [15–18] to identify core nodes in the network are also
n proposed from different perspectives. On the other hand, greedy
Corresponding author at: Department of Electronic Engineering, Shanghai Jiao
Tong University, 800 Dong Chuan Road, Shanghai 200240, China. algorithms formulate the influence maximization problem as a
E-mail address: [email protected] (Y. Zhao). discrete optimization problem and use greedy strategy to achieve

http://dx.doi.org/10.1016/j.neucom.2015.11.125
0925-2312/& 2016 Elsevier B.V. All rights reserved.
Y. Zhao et al. / Neurocomputing 210 (2016) 34–44 35

the approximate optimal solution. Kempe et al. [6] first came up stay in one of two states: active state and inactive state. Active
with a greedy algorithm to solve the influence maximization state indicates that the corresponding individual adopts the in-
problem and used Monte-Carlo simulations to estimate the influ- formation, while inactive state contrarily means that the in-
ence scope of initial node sets. However, this greedy algorithm is dividual does not accept the information. A node can convert from
extremely time consuming and only applicable to small networks. inactive state to active state under the influence of its neighbor
To reduce the calculation complexity, Leskovec et al. [19] put for- nodes in the network, but cannot convert in the opposite direc-
ward the CELF (cost-effective lazy-forward) method to avoid re- tion. The spreading process starts from an initial set of active
dundant calculations of influence scope according to the sub- nodes and unfolds in discrete time steps.
modularity property of influence spreading. Some other methods Let a social network be represented by a graph G = (V , E ),
[20–24] estimate the influence scope using some heuristic strate- where V and E = {(u, v )|u, v ∈ V} are respectively the set of nodes
gies instead of Monte-Carlo simulations, which can effectively
and edges in the network. Given the initial active node set S ⊂ V ,
improve the time efficiency.
the number of active nodes at the end of the spreading process is
Community structure is a common and important property of
denoted by a variable φ (S ). The influence effect is measured by
real-world networks [25]. A community can be generally described
σ (S ), which is defined as the expected value of φ (S ). We call σ (S )
as a group of nodes with dense internal connections and relatively
the influence scope of S. Since σ (S ) is very difficult to calculate
sparse connections to the nodes in other groups [26–28]. Com-
munity structure is beneficial to understand the function and or- precisely, Monte-Carlo simulations are used to estimate the in-
ganization of social networks, since communities often correspond fluence scope σ (S ) in practical calculation.
to real social associations and organizations. Some recent re-
searches have shown that community structure has important 2.1.1. Independent cascade model
effect on the spreading process in networks [29–31]. However, In independent cascade model (IC model) [32,33], every link
most existing influence maximization methods do not take into (u, v ) in the network is associated with a diffusion probability pu, v ,
account the influence of community structure in the network, which indicates the probability that node u successfully actives
which limits their applications on social networks with commu- node v. Any node in the network attempts to activate its inactive
nity structure. neighbors only at the time that it just gets activated. When the
In this paper, we propose an algorithm to identify influential initial active node set S is given, the spreading process under IC
nodes in social networks with community structure based on label model unfolds according to the following rule. At each time step t,
propagation. Our main contributions are summarized as follows: if node u is newly activated at time step t − 1, it can make an at-
tempt to activate every inactive neighbor node v with probability
1. We successfully introduce the label propagation process to pu, v . If the attempt successes, node v would become active at time
identify the influential nodes in social networks with commu- step t + 1; otherwise, it still stays inactive. No matter whether the
nity structure. attempt succeeds or not, node u can never try to activate other
2. Our proposed algorithm is parameter-free and requires no prior nodes at later time steps. When more than one newly activated
information about the community structure. It also exhibits node tries to activate a node, these activation attempts are in-
very low time complexity, which makes it applicable to large- dependent from each other and can proceed sequentially in an
scale networks.
arbitrary order. The spreading process finally terminates if there
3. The tests on both synthetic and real-world networks under
are no more newly activated nodes in the network.
common diffusion models demonstrate the effectiveness and
In our paper, due to the lack of information, the diffusion
efficiency of our algorithm.
probabilities of different links in the IC model are set to a uniform
value, puv = p, ∀ (u, v ) ∈ E .
The rest of the paper is organized as follows. Section 2 makes
an introduction of the related work of influence maximization
2.1.2. Linear threshold model
problem. In Section 3, we introduce the basic concept of com-
In linear threshold model (LT model) [34,35], every link (u, v ) in
munity structure in networks and the label propagation process.
the network is associated with a weight wu, v , which indicates the
Our proposed method for identifying influential nodes in social
influence that node u exerts on node v. The weights of links satisfy
networks with community structure is described in detail in Sec-
the constraint: ∑v ∈ N (u) wu, v ≤ 1, ∀ u ∈ V , where N (u) is the set of
tion 4. The experimental results and discussions are reported in
Section 5. Finally, Section 6 gives the conclusion of this paper. neighbor nodes of node u in the network. Given an initial active
node set S, the spreading process under LT model unfolds ac-
cording to the following rule. First, each node in the network is
2. Influence maximization assigned with a random threshold in the range [0, 1]. The
threshold reflects the tendency of the node to convert to active
Influence maximization problem is how to identify a node set state, so that it is harder to make the node with larger threshold
containing k nodes to maximize the influence effect where influ- get activated. Then, at each time step t, each inactive node v in the
ence is propagated in the network under specific diffusion model network is activated only if the influence sum of all its active
[6]. Here, we first introduce the diffusion model adopted in our neighbor nodes exceeds its corresponding threshold, i.e.,
paper, and then give a brief review of the existing influence ∑u ∈ Γ (v) wu, v ≥ θv , where Γ (v ) is the set of active neighbor nodes of
maximization methods. node v and θv is the threshold for node v. This spreading process
continues until no more activations are possible in the network.
2.1. Diffusion model

Diffusion model is the key to simulate the actual spreading 2.2. Influence maximization methods
process of ideas and information in social networks. In our study,
we employ two widely-used diffusion models, namely in- Most existing influence maximization methods can be roughly
dependent cascade model (IC model) and linear threshold model classified into two categories: centrality-based algorithms and
(LT model). In both diffusion models, any node in the network can greedy algorithms.
36 Y. Zhao et al. / Neurocomputing 210 (2016) 34–44

2.2.1. Centrality-based algorithm approximately same solutions as original greedy algorithm with
The basic idea of centrality-based algorithm is to evaluate the much faster calculation speed.
importance of nodes in the network using some centrality mea- Many other methods try to estimate the influence scope using
sures and taking the nodes with large centrality as the initial active some heuristic strategies instead of Monte-Carlo simulations. Ki-
nodes. mura and Saito [20] exerted a strict constraint to the IC model that
Degree centrality is the simplest centrality measure. It is be- each node can be only activated at specific time steps and deduced
lieved that, in social networks with broad degree distribution, the the mathematical expression of influence scope. Chen et al. pro-
most connected people are the hubs for extensive influence posed the DiscountIC [21] algorithm which estimates the marginal
spreading [12]. Although degree centrality is proven to be related gain of influence scope under the IC model using degree discount
with influence scope to some extend, but it has great bias since it heuristics. Chen et al. [22] and Goyal et al. [23] estimated the in-
does not consider the whole network topology. Betweenness fluence scope under the LT model by exploring all the paths within
centrality and closeness centrality are two other typical centrality a small neighborhood in the network. Kimura et al. [24] evaluated
measures for influence maximization problem. Betweenness cen- the marginal influence scope through a bond percolation process
trality [13] is defined as the number of shortest paths crossing in social networks.
through the node, which reflects the interpersonal influence a
person put on others in social network theory [10]. Closeness
centrality [14] is defined as the reciprocal of the sum of geodesic 3. Community structure
distances of one node to all the other nodes in the network, which
is a measure of how long it takes to spread information from a 3.1. Definition of community structure
given node to other reachable nodes in the network.
Chen et al. [15] proposed the local centrality which is an ex- Community structure is a common and important property of
tension of degree centrality to local neighborhood. We et al. [16] real-world networks [25]. The communities in social networks
identified influential nodes in the network based on the coritivity have great practical significance since they correspond to the real
theory of complex network. Coritivity theory measures the im- associations or organizations. Therefore, the identification of
portance of a node set by the number of connected components community structure can provide an important insight into the
showing up after deleting the nodes and their incident edges from function, organization and dynamics of social networks. Generally,
the network. Kitsak et al. [17] distinguished the nodes with dif- a community can be defined as a group of nodes with dense in-
ferent influential degrees using the k-shell decomposition. The ternal connections and relatively sparse connections to the rest of
influential nodes identified by k-shell decomposition do not cor- the network [26–28]. In order to make a more explicit description
respond to the nodes with high degree and betweenness, but are of community structure, we introduce the quantitative definitions
located in the center of network topology. Lu et al. [18] improved of community structure which are widely adopted in the
the well-known PageRank algorithm [36] to find leaders in social literature.
networks, which especially performs well on directed networks. Given a network represented by a simple graph G = (V , E ),
where V is the set of nodes and E = {(u, v )|u, v ∈ V} is the set of
2.2.2. Greedy algorithm edges between the nodes. The topology of the network is fully
Kempe et al. [6] made the first attempt to solve the influence specified by the adjacency matrix A, where Auv = 1 if node u and
maximization problem using a simple greedy algorithm. This node v are directly connected and Auv = 0 otherwise.
greedy algorithm uses large numbers of Monte-Carlo simulations Radicci et al. [37] proposed two local definitions of community,
to estimate the influence scope and takes the influence scope as respectively in a strong sense and a weak sense. Considering a
the optimization objective. A hill climbing strategy is adopted to subnetwork C ⊂ G , to which node i belongs, the total degree of
pursue the optimal solution. At each iteration, the node with node u is split into two contributions: k u = k uin (C ) + k uout (C ), where
maximal marginal gain of influence scope is added to the initial
k uin (C ) = ∑v ∈ C Auv is the number of edges connecting node u to the
active node set. It has been demonstrated that the greedy algo-
nodes belonging to subnetwork C and k uout (C ) = ∑v ∉ C Auv is clearly
rithm is (1 − 1/e − ϵ) optimal for the influence maximization
problem, where e is the base of the natural logarithm and ϵ is a the number of edges connecting node u towards the rest of the
very small positive real number. Thus, the influence scope network. The strong community is a subnetwork that satisfies the
achieved by the greedy algorithm is at least better than 63% of constraint:
actual global optimum. Extensive experiments have shown that k uin (C ) > k uout (C ), ∀u∈C (1)
the greedy algorithm significantly outperforms the typical cen-
trality-based algorithms in influence scope. In a strong community, each node has more connections within
However, the main drawback of the greedy algorithm is the the community than with the rest of the network. Compared with
high computational complexity. Let n and m be respectively the strong community, weak community is under a relaxed constraint:
number of nodes and edges in the network. Each simulation of
spreading process takes O (m) time, so that the calculation of in-
∑ kuin (C ) > ∑ kuout (C )
u∈ C u∈ C (2)
fluence scope for any initial active node set requires O (Rm) time,
where R is the number of repeated Monte-Carlo simulations. Weak community requires that the sum of node degrees within
Therefore, the total time complexity of the greedy algorithm is the community is larger than the sum of node degrees toward the
O (knRm). rest of the network. According to the definitions, a strong com-
In order to improve the calculation efficiency of greedy algo- munity is also a weak community, whereas the converse is not
rithm, Leskovec et al. [19] presented a CELF (cost-effective lazy- true.
forward) method to avoid redundant calculations of influence
scope according to the submodularity property of influence 3.2. Label propagation
spread. In CELF greedy algorithm, the marginal gain of influence
scope for a node does not need to be re-calculated if its value at Label propagation [39] is an important dynamics process in
previous iteration is already smaller than that of any other node at networks, which aims at detecting community structure in the
the current iteration. CELF greedy algorithm can achieve network. The procedure of label propagation is conceptually
Y. Zhao et al. / Neurocomputing 210 (2016) 34–44 37

simple. Initially, each node in the network is assigned with a un- Generally, a seed is a subnetwork which can be thought of as the
ique label. Then, at each time step, each node adopts the label potential core of a community in the network. The choice of seed
shared by the maximum number of its neighbors in synchronous is crucial for finding and revealing the community structure in the
or asynchronous fashion. The label updating rule can be for- network. The nodes with high degree or clustering coefficient,
mulated as: fully-connected cliques and maximal cliques are usually taken as
the seed of communities in the network [43].
lu′ = argmaxl ∑ δ (l, lv ).
In the seeding phase of the IM-LPA algorithm, we identify some
v ∈ σ (u) (3)
specific nodes as the seeds of communities in the network ac-
where lu′ is the new label for node u, lv is the current label for node cording to the degree of the nodes in the network. Given a net-
v, σ (u) is the set of the neighbor nodes of node u, argmaxl returns work G = (V , E ), where V is the set of the nodes and E is the set of
label l for which the sum function attains the largest value, and δ () the edges connecting the nodes. The procedure of the seeding
is the Kronecker delta function, i.e., δ (l, lv ) = 1 when l = lv ; phase is described below.
otherwise, δ (l, lv ) = 0. If more than one maximal label exists, one
of the maximal labels is chosen at random. As the time increases, Step 1: Initialize the set of seed nodes Ω = ∅, and the set of can-
the densely connected groups of nodes quickly reach a consensus didate nodes W = V .
on a unique label and expand outward to take over more nodes. Step 2: Calculate the degree kv of each node v in the network.
The process finally converges when the label for each node is the Step 3: Find the node with the largest degree in the candidate
maximal label among its neighbors. At the end of label propaga- node set, vmax = argmax v ∈ W k v .
tion, one label would cover all the nodes in a community and Step 4: Add vmax to the seed node set, Ω = Ω ∪ {vmax }.
different labels indicate different communities in the network. Step 5: Remove vmax and its neighbor nodes N (vmax ) from candi-
The community structure obtained by label propagation ap- date node set, W = W ⧹({vmax } ∪ N (vmax )).
proximately accords with the definition of strong community. Step 6: Go back to Step 3 and repeat the process until the candi-
While strong community requires each node to have strictly more date node set is empty W = ∅.
connections within its community than outside the community, Step 7: Output the seed node set Ω.
the label propagation guarantees that each node has at least as
many connections within its community as it has with each of the In social network analysis, the membership contribution of a
other communities. node to its community has been proven to be highly related with
Label propagation is a simple dynamics process and exhibits its degree [44]. The node with larger degree is more likely to be
near linear time complexity O (Tm), where T is the number of time the core of the community it belongs to. The seeding phase of the
steps during the process and m is the number of the edges in the IM-LPA algorithm can ensure that the chosen seed nodes have
network. However, the label propagation process have the draw- relatively large degree. Moreover, the seeding phase also guaran-
back of weak robustness due to the random nature. For the same tees that the seed nodes are independent from each other, which
community in the network, it may be covered by different labels means that the geodesic distance of any two seed nodes is at least
with different runs of label propagation. Refer to [40] for detailed 2. The independence of seed nodes is crucial for the next label
analysis of the label propagation process. propagation phase, because it can eliminate the interference to the
label propagation process.

4. Our method 4.2. Label propagation phase

The spreading process in social networks with community After the seed nodes are extracted from the network, the al-
structure has its unique pattern. Due to the structural compactness gorithm IM-LPA expands these seeds to reveal the community
of the community, information can easily propagate within the structure based on label propagation. Considering that a node may
community but has little chance to spread outside the community belong to multiple communities in real social networks [45], we
[38]. Therefore, it is necessary to identify the core nodes in dif- allow a node to have more than one label in the label propagation
ferent communities as the initial active nodes, so that they can process. Specially, when a node is unlabeled, it has an empty label
spread influence to the largest range on the network. set.
Our proposed influence maximization algorithm based on label The label propagation phase follows a simple procedure. First,
propagation (IM-LPA) especially aims at identifying influential each seed node is assigned with a unique label and the other
nodes in social networks with community structure. We introduce nodes in the network are assigned with empty label sets. Then, at
the label propagation process to solve the influence maximization each time step t, every node in the network updates its label set
problem and propose a novel heuristic that the most influential synchronously according to the following label updating rule:
node of a community could propagate its label to all the nodes
L v′ (t ) = arg max ∑ δ (l, L u (t − 1)).
within the community during the label propagation process. The l
u ∈ N (v ) (4)
IM-LPA algorithm is composed of two phases: seeding phase and
label propagation phase. In the seeding phase, some special nodes where L v (t ) is the label set for node v at time step t, L u (t − 1) is the
are extracted as the seeds of the communities in the network. label set of node u at time step t − 1, N (v ) is the set of the neighbor
Then, in the label propagation phase, the algorithm propagates the nodes of node v, argmaxl returns the set of labels for which the
labels from the seed nodes and measures the centrality of these sum function attains the largest value, and δ () is the extended
seed nodes for their communities according to the label propa- Kronecker delta function, i.e., δ (l, L u ) = 1 when l ∈ L u ; otherwise,
gation process. The details of the algorithm IM-LPA are presented δ (l, L u ) = 0. Finally, the label propagation process terminates when
below. the label set of any node in the network no longer changes.
The label updating rule of the IM-LPA algorithm is similar to
4.1. Seeding phase that of the basic label propagation. The main difference lies on that
when more than one maximal label exists, one of them is chosen
Seeding techniques have been adopted in a variety of re- at random in basic label propagation while all of them are retained
searches related with community structure in networks [41,42]. in the new label set in the IM-LPA algorithm. For basic label
38 Y. Zhao et al. / Neurocomputing 210 (2016) 34–44

propagation, the random choices of label happen very frequently Algorithm IM-LPA.
at the first few iterations since each node initially has a unique repeat
label. This is the main reason that makes basic label propagation Step 3: Find the node with the largest degree in the candi-
often achieve poor and unreasonable performances. Different from date node set, vmax = argmax v ∈ W k v .
basic label propagation, the IM-LPA algorithm only initially assigns Step 4: Add vmax to the seed node set, Ω = Ω ∪ {vmax }.
every seed node with a unique label, which removes large num- Step 5: Remove vmax and its neighbor nodes N (vmax ) from
bers of misleading labels. Moreover, the improvement of label candidate node set, W = W ⧹({vmax } ∪ N (vmax )).
updating rule also eliminates the influence of the randomicity
until The candidate node set is empty W = ∅
during the label propagation process. Thus, the label propagation
repeat
in the IM-LPA algorithm is much more robust and stable than basic
Step 6: Assigned each seed node v ∈ Ω with a unique label
label propagation.
l (v), and assign other nodes in the network with empty label
At the beginning of the label propagation process in the IM-LPA
sets.
algorithm, the labels expand from the initial seed nodes and at-
Step 7: Update the label of each node in the network using
tempt to acquire more nodes in the neighborhood. When different
Eq. (4) in synchronous fashion.
labels reach on the same node, they start to compete for the oc-
until The label set of any node in the network no longer
cupation of the node, and the node only adopts the most frequent
changes.
label or labels among its neighbors. We argue that the label de-
Step 8: Calculate the centrality of each seed node v ∈ Ω using
rived from the most influential node of a community would finally
Eq. (5).
defeat other labels and occupy all the nodes within the commu-
Step 9: Select the top k nodes with the largest centrality as
nity. And the label derived from less important node would be
the initial active nodes set S.
taken over by the labels from most influential nodes and gradually
disappear in the label propagation. At the end of the label pro-
pagation process, the nodes in the same community would be The proposed IM-LPA algorithm uses the label propagation
associated with the label from the most influential node of the process to identify the influential nodes in social networks with
community. In addition, the nodes with multiple labels indicate community structure. It can reveal the community structure in the
the overlaps between the communities in the network. Therefore, network and measure the centrality of different nodes for their
the whole label propagation process can reflect the centrality of communities. The IM-LPA algorithm is parameter-free and re-
different nodes in the network. quires no prior information of the community structure in the
Let l (v) denote the label initially assigned to seed node v. Ac- network.
cording to the label propagation process, we define a process
variable N (v, t ) denoting the number of nodes associated with 4.4. Time complexity analysis
label l (v) at time step t. Then, the centrality of each seed node can
be measured as: We also make an analysis on the time complexity of the pro-
C (v) = max N (v, t ) posed IM-LPA algorithm. Let n and m respectively be the number
0≤ t ≤ T (5)
of nodes and edges in the network, d̄ be the average degree of the
where C (v ) is the centrality of node v, T is the final time step of the nodes which satisfies O (m) = O (n)·O (d¯ ).
label propagation process. The centrality reflects the importance
and influence of a node for the community which it belongs to. 1. In the seeding phase, it takes O (m) time to calculate the degree
The centrality of the most influential node for each community is of the nodes in the network. Ranking the nodes according to the
equal to the number of nodes in the community. While, the cen- degree also needs O (m) time. Thus, the time complexity of the
trality of less important node is determined by the compactness of seeding phase is O (m).
the local structure near the node. After ranking these seed nodes 2. In the label propagation phase, each node has O (d¯ ) neighbors
according to the centrality, the top k nodes are identified as the and at most O (d¯ ) labels in its label set. In the worst case, it costs
2
most influential nodes for influence maximization. O (d¯ ) time for one node to update its label set, since it needs to
2
traverse O (d¯ ) labels among the neighbor nodes. Therefore, each
2
4.3. Overview of the IM-LPA algorithm time step of label propagation process requires O (nd¯ ) time and
the total time of the label propagation phase is
2
By combining the seeding phase and label propagation phase, O (Tnd¯ ) = O (Tmd¯ ), where T is the number of time steps during
we can outline the IM-LPA algorithm as follows: the label propagation. According to our experiments, the value
of T is generally very small compared with the size of the
Algorithm IM-LPA. network.
Input 3. The total time complexity of the IM-LPA algorithm is
– An × n : the adjacency matrix of the network G = (V , E ), where O (md¯ ) + O (Tmd¯ ) = O (Tmd¯ ). For sparse networks, the time com-
V is the set of the nodes, E is the set of the edges connecting plexity is therefore O (Tm), which is near linear with the number
the nodes and n = |V | is the number of nodes in the network. of edges in the network.
Auv = 1 if node u and node v are directly connected; other-
wise, Auv = 0.
– k: the number of initial active nodes. 5. Experimental results and discussions
Output
The initial active nodes set S. We have tested our proposed IM-LPA algorithm in comparison
Method with other influence maximization methods on both synthetic
Seeding Phase benchmark networks and real-world social networks. The ex-
Step 1: Initialize the set of seed nodes Ω = ∅, and the set of periments adopt two basic diffusion models: independent cascade
candidate nodes W = V . (IC) model and linear threshold (LT) model, as described in Section
Step 2: Calculate the degree kv of each node v in the network. 2.1. For the IC model, the diffusion probability is uniformly set as
Y. Zhao et al. / Neurocomputing 210 (2016) 34–44 39

p = 0.025, 0.05 and 0.1. The performances of different influence mixing parameter μ, which denotes the average fraction of the
maximization methods are measured by the fraction of active connections to other communities per node. The smaller value of
nodes at the end of the spreading process, which is calculated over mixing parameter μ leads to more significant communities. Fig. 1
10,000 independent simulations. All the experiments are im- shows the relationship between the mixing parameter μ and the
plemented by MATLAB 2009b running on a PC with a 2.7 GHz modularity Q in LFR benchmark networks. We can see that the
processor and 3 GB memory. modularity Q decreases as the parameter μ increase from 0 to 1,
The influence maximization algorithms for comparison include namely the community structure in the network is gradually
the CELF greedy algorithm [19], degree centrality, betweenness weakened with the increasing of the parameter μ.
centrality [13], closeness centrality [14], local centrality [15], k- Our experiments construct a series of LFR benchmark networks
shell decomposition [17] and PageRank algorithm [36]. For the with obvious community structure. The network size is set to 1000
CELF greedy algorithm, the influence scope is estimated by 10,000 and the mixing parameter μ is set to 0.1. The power-law exponents
simulations. For PageRank algorithm, the damping factor d is set to of node degree and community size are set to 2 and 1 respectively.
0.85. The node degree is between [1, 50] and has an average value of 20.
For all the test network, we also use a measure, which is called The first set of LFR benchmark networks contain small commu-
modularity [46], to evaluate the significance of the community nities, whose sizes are in the range [20, 50]. In the first set of LFR
structure in the network. Formally, modularity can be formulated benchmark networks, most influential nodes identified by the IM-
as: LPA algorithm are the core nodes of different communities in the
networks. Fig. 2 shows the performance of the IM-LPA algorithm
⎛ ⎞ on the first set of LFR benchmark networks in comparison with
1 ⎜ Auv − k u k v ⎟ δ (u, v)
Q= ∑ ⎜
2m u, v ∈ V ⎝ 2m ⎟⎠ (6)
other algorithms, where each data point is an average over 10
different networks.
where m is the number of edges in the network, Auv is the element As is shown in Fig. 2, under IC model with p = 0.025 and
of the adjacency matrix for the network, ku is the degree of node u, p = 0.1, the greedy algorithm performs the best among all the al-
and δ (·) is the extended Kronecker delta function, i.e., δ (u, v ) = 1 if gorithms. Our IM-LPA algorithm is only inferior to the greedy al-
nodes u and v are in the same community; otherwise, δ (u, v ) = 0. gorithm and shows significant advantages over the other algo-
The term k u k v /2m indicates the expected number of edges con- rithms in most situations. Under IC model with p = 0.05 and LT
necting node u and node v in a random network of the same size model, the IM-LPA algorithm performs the best among all the al-
and node degree distribution. If the number of edges within gorithms. It is slightly better than the greedy algorithm and greatly
communities is greater than the expected number in a random outperforms the other algorithms.
network, the modularity value Q would be greater than 0. Larger The second set of LFR benchmark networks contain large
value of modularity indicates more significant community struc- communities, whose sizes are in the range [100, 200]. In the sec-
ture in the network. ond set of LFR benchmark networks, many influential nodes
identified by the IM-LPA algorithm are in the same community
since there exist only a few communities in the network. Fig. 3
5.1. Synthetic networks
shows the performance of the IM-LPA algorithm on the second set
of LFR benchmark networks in comparison with other algorithms,
We first use the LFR benchmark model introduced by Lanci-
where each data point is still an average over 10 different
chinetti et al. [47] to construct synthetic networks with commu-
networks.
nity structure. A number of parameters are used to constrain the
From Fig. 2, under IC model with p = 0.025 and p = 0.05, the
topological structure of the LFR benchmark networks. Both the
performances of the IM-LPA algorithm, the greedy algorithm, de-
node degree and the community size follow the power-law dis- gree centrality and PageRank algorithm are very close to each
tribution, as commonly observed in real-world networks. The other. The IM-LPA algorithm is slightly superior to the other al-
significance of the community structure is determined by a critical gorithms when the number of initial active nodes k is larger than
20. Under the LT model, the IM-LPA algorithm performs the best
among all the algorithms and shows some advantages over the
other algorithms. Only under the IC model with p = 0.1, our IM-
LPA algorithm is inferior to the greedy algorithm and betweenness
centrality, but it still performs better than the other algorithms.
From the above experimental results, we can see that the IM-
LPA can effectively find the influential nodes in synthetic networks
with community structure.

5.2. Real-world networks

We also compare the IM-LPA algorithm with other algorithms

on several real-world social networks which are widely used in
the literature. General information of these real-world social net-
works are shown in Table 1. It can be seen that Football Network,
SFI network Facebook network and PGP network have signiﬁcant
community structure since the modularity is large. For Email
network, the relatively small modularity indicates that the com-
munity structure is kind of indistinct.
The experiments on real-world social networks exclude the IC
model with p = 0.025, due to the fact that inﬂuence can only
Fig. 1. The relationship between the mixing parameter μ and the modularity Q in spread to a tiny range on these networks when the diffusion
LFR benchmark networks. probability of IC model is small. Fig. 4 shows the performances of
40 Y. Zhao et al. / Neurocomputing 210 (2016) 34–44

Fig. 2. The inﬂuence spreading of different algorithms on LFR benchmark networks with small communities. (a) IC model, p ¼ 0.025; (b) IC model, p ¼ 0.05; (c) IC model,
p ¼ 0.1; (d) LT model.

the IM-LPA algorithm on the real-world social networks in com- still performs better than most of the other algorithms.
parison with other algorithms. On PGP network, the greedy algorithm performs better than
On Football Network, the performance of the IM-LPA algorithm the other algorithms. The IM-LPA algorithm shows significant
is approximate to that of the greedy algorithm and significantly advantages over the other algorithms under the IC model. Under
outperforms the other algorithms. the LT model, the IM-LPA algorithm is inferior to PageRank algo-
On SFI Network, when the number of initial active nodes k < 4 , rithm and very close to degree centrality.
the IM-LPA algorithm performs very closely to the greedy algo- The above experimental results demonstrate that the IM-LPA
rithm and shows obvious superiority over the other algorithms. algorithm is promising and effective for identifying the influential
But as the number of initial active nodes k exceeds 4, the perfor- nodes in real-world social networks with community structure.
mance of the IM-LPA algorithms falls behind that of the greedy
algorithm and gets close to that of degree centrality, k-shell de- 5.3. Time complexity
composition and PageRank algorithm.
On Email Network, the greedy algorithm performs the best Finally, we experimentally measure the time complexity of the
among all the algorithms, but there are little differences between IM-LPA algorithm. The LFR benchmark networks are still used in
the performances of most algorithms. Our IM-LPA algorithm does our experiments. The mixing parameter μ is set to 0.1, average
not perform well mainly due to the indistinct community structure node degree is set to 20 and the community size is in the range
of Email Network. But it still performs better than betweenness [20, 200]. The number of nodes in the network increases from 100
centrality, closeness centrality, local centrality and k-shell de- to 10000, so that the number of edges varies from 1000 to
composition in most situations. 100,000. The execution time and the number of time steps of the
On Facebook network, the proposed IM-LPA algorithm gives IM-LPA algorithm on benchmark networks are shown in Fig. 5,
excellent results and performs almost the same as the greedy al- where each data point is an average over 10 networks.
gorithm under the IC model. Under the LT model, the greedy al- Seen from Fig. 5, the execution time of the IM-LPA algorithm
gorithm makes the best performances, and the IM-LPA algorithm increases a little rapidly than linearly to the number of edges in
Y. Zhao et al. / Neurocomputing 210 (2016) 34–44 41

Fig. 3. The inﬂuence spreading of different algorithms on LFR benchmark networks with large communities. (a) IC model, p ¼ 0.025 (b) IC model, p ¼ 0.05 (c) IC model, p ¼
0.1 (d) LT model.

Table 1
General information of the real-world social networks.

Network Description Node Edge Community Modularity

Football American College football union [48] 115 616 12 0.6010

SFI Collaboration network of scientists at Santa Fe Institute [48] 118 200 8 0.7335
Email E-mail interchanges between members of URV [49] 1133 5451 14 0.4876
Facebook Friend relationship on Facebook Site [50] 4038 88,234 8 0.7379
PGP The network of users of PGP algorithm [51] 10,680 24,316 240 0.8432

the network. This is mainly because that the number of time steps community structure. The label propagation process is introduced
of the label propagation process in IM-LPA algorithms increases to identify the influential nodes in the network. Our proposed
along with the number of edges. Therefore, the time complexity of algorithm is based on a novel heuristic that the most influential
the IM-LPA algorithm is accordant with the previous analysis in node of a community could propagate its label to all the nodes
Section 4.3. The timing results demonstrate that our proposed IM- within the community during the label propagation process.
LPA algorithm is a fast influence maximization algorithm and ap- Hence, we make the labels propagate from some seed nodes and
plicable to large-scale networks. evaluate the centrality of these seed nodes according to the label
propagation process.
The IM-LPA algorithm is parameter free and requires no prior
6. Conclusion information about the community structure in the network.
Moreover, the IM-LPA algorithm has near linear time complexity,
In this paper, we propose the IM-LPA algorithm to solve the which makes it applicable to large-scale networks. We test the IM-
influence maximization problem in social networks with LPA algorithm along with several other influence maximization
42 Y. Zhao et al. / Neurocomputing 210 (2016) 34–44

Fig. 4. The inﬂuence spreading of different algorithms on real-world social networks. (a) Football Network (b) SFI Network (c) Email Network (d) Facebook Network (e) PGP
Network.
Y. Zhao et al. / Neurocomputing 210 (2016) 34–44 43

Fig. 5. (a) Average execution time of the IM-LPA algorithm on benchmark networks with different sizes; (b) average number of time steps of the IM-LPA algorithm on
benchmark networks with different sizes.

methods on both synthetic and real-world networks for compar- [18] L. Lu, Y.C. Zhang, C.H. Yeung, T. Zhou, Leaders in social networks, the delicious
ison. The experimental results demonstrate the effectiveness and case, PLoS One 6 (2011) e21202.
[19] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance, Cost-
efficiency of our proposed algorithm. effective outbreak detection in networks, in: Proceedings of the 13th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
San Jose, USA, 2007, pp. 420-429.
[20] M. Kimura, K. Saito, Tractable models for information diffusion in social net-
Acknowledgments works, in: Knowledge Discovery in Databases: PKDD 2006, 2006, pp. 259-271.
[21] W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks,
This research work is funded by the National Science Founda- in: Proceedings of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 199–208.
tion of China (61271316), 973 Program of China (2013CB329605), [22] W. Chen, Y. Yuan, L. Zhang, Scalable influence maximization in social networks
the National Social Science Foundation of China (14ZDB167), and under the linear threshold model, in: Proceedings of IEEE 10th International
Shanghai Key Laboratory of Integrated Administration Technolo- Conference on Data Mining (ICDM 2010), Sydney, Australia, 2010, pp. 88–97.
[23] A. Goyal, W. Lu, L. V. Lakshmanan, SIMPATH: an efficient algorithm for influ-
gies for Information Security. ence maximization under the linear threshold model, in: 2011 IEEE 11th In-
ternational Conference on Data Mining (ICDM 2011), Vancouver, Canada, 2011,
pp. 211–220.
[24] M. Kimura, K. Saito, R. Nakano, H. Motoda, Extracting influential nodes on a
References social network for information diffusion, Data Min. Knowl. Discov. 20 (2010)
70–97.
[25] M.E.J. Newman, The structure and function of complex networks, SIAM Rev. 45
[1] C. Haythornthwaite, Social network analysis: an approach and technique for
(2003) 167–256.
the study of information exchange, Libr. Inf. Sci. Res. 18 (1996) 323–342.
[26] M.E.J. Newman, Detecting community structure in networks, Eur. Phys. J. B 38
[2] S. Wasserman, Social Network Analysis: Methods and Applications, Cambridge
(2004) 321–330.
University Press, Cambridge, 1994. [27] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in
[3] S.H. Strogatz, Exploring complex networks, Nature 410 (2001) 268–276. networks, Phys. Rev. E 69 (2004) 026113.
[4] M.J. Keeling, P. Rohani, Modeling Infectious Diseases: in Humans and Animals, [28] S. Fortunato, Community detection in graphs, Phys. Rep. 486 (2010) 75–174.
Princeton University Press, Princeton, 2008. [29] X. Wu, Z. Liu, How community structure influences epidemic spread in social
[5] W. Chen, C. Wang, Y. Wang, Scalable influence maximization for prevalent networks, Physica A 387 (2008) 623–630.
viral marketing in large-scale social networks, in: Proceedings of the 16th [30] W. Huang, C. Li, Epidemic spreading in scale-free networks with community
ACM SIGKDD International Conference on Knowledge Discovery and Data structure, J. Stat. Mech. 2007 (2007) P01014.
Mining, Washington DC, USA, 2010, pp. 1029–1038. [31] X. Chu, J. Guan, Z. Zhang, S. Zhou, Epidemic spreading in weighted scale-free
[6] D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through networks with community structure, J. Stat. Mech. 2009 (2009) P07043.
a social network, in: Proceedings of the 9th ACM SIGKDD International Con- [32] J. Goldenberg, B. Libai, E. Muller, Talk of the network: a complex systems look
ference on Knowledge Discovery and Data Mining, Washington DC, USA, 2003, at the underlying process of word-of-mouth, Mark. Lett. 12 (2001) 211–223.
pp. 137–146. [33] J. Goldenberg, B. Libai, E. Muller, Using complex systems analysis to advance
[7] E. Even-Dar, A. Shapira, A note on maximizing the spread of influence in social marketing theory development: modeling heterogeneity effects on new pro-
networks, Inf. Process. Lett. 111 (2011) 184–187. duct growth through stochastic cellular automata, Acad. Mark. Sci. Rev. 9
[8] C. Gao, X. Lan, X. Zhang, Y. Deng, A bio-inspired methodology of identifying (2001) 1–18.
influential nodes in complex networks, PLoS One 8 (2013) e66732. [34] M. Granovetter, Threshold models of collective behavior, Am. J. Sociol. 1978
[9] M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H.E. Stanley, H.A. Makse, (1978) 1420–1443.
Identification of influential spreaders in complex networks, Nat. Phys. 6 (2010) [35] D.J. Watts, A simple model of global cascades on random networks, Proc. Natl.
888–893. Acad. Sci. USA, 99 (2002) 5766–5771.
[10] N.E. Friedkin, Theoretical foundations for centrality measures, Am J. Sociol. 96 [36] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking:
(1991) 1478–1504. Bringing Order to the Web, Stanford InfoLab Publication, Stanford, 1999.
[11] T. Opsahl, F. Agneessens, J. Skvoretz, Node centrality in weighted networks: [37] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Defining and identifying
generalizing degree and shortest paths, Soc. Netw. 32 (2010) 245–251. communities in networks, Proc. Natl. Acad. Sci. USA, 101 (2004) 2658–2663.
[12] R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks, [38] X. Chu, J. Guan, Z. Zhang, S. Zhou, Epidemic spreading in weighted scale-free
Phys. Rev. Lett. 86 (2001) 3200–3203. networks with community structure, J. Stat. Mech. 2009 (2009) P07043.
[13] L.C. Freeman, A set of measures of centrality based on betweenness, Socio- [39] U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect
metry 40 (1977) 35–41. community structures in large-scale networks, Phys. Rev. E 76 (2007) 036106.
[14] G. Sabidussi, The centrality index of a graph, Psychometrika 31 (1966) 581–603. [40] M.J. Barber, J.W. Clark, Detecting network communities by propagating labels
[15] D. Chen, L. Lu, M.S. Shang, Y.C. Zhang, T. Zhou, Identifying influential nodes in under constraints, Phys. Rev. E 80 (2009) 026129.
complex networks, Physica A 391 (2012) 1777–1787. [41] A. Lancichinetti, S. Fortunato, J. Kertesz, Detecting the overlapping and hierarchical
[16] Y. Wu, Y. Yang, F. Jiang, S. Jin, J. Xu, Coritivity-based influence maximization in community structures in complex networks, New J. Phys. 11 (2009) 033015.
social networks, Physica A 416 (2014) 467–480. [42] C. Lee, F. Reid, A. McDaid, N. Hurley, Detecting highly overlapping community
[17] M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H.E. Stanley, H.A. Makse, structure by greedy clique expansion, in: Proceedings of International Work-
Identification of influential spreaders in complex networks, Nat. Phys. 6 (2010) shop on Social Network Mining and Analysis (SNAKDD 2010), Washington, DC,
888–893. USA, 2010, pp. 33–42.
44 Y. Zhao et al. / Neurocomputing 210 (2016) 34–44

[43] C. Lee, F. Reid, A. McDaid, N. Hurley, Seeding for pervasively overlapping Shenghong Li received the B.S. and the M.S. degrees in
communities, Phys. Rev. E 83 (2011) 066107. Electrical Engineering from Jilin University of Technol-
[44] J. Scripps, P. N. Tan, A.-H. Esfahanian, Exploration of link structure and com- ogy, China, in 1993 and 1996 respectively, and received
munity-based node roles in network analysis, in: Proceedings of 7th IEEE the Ph.D. degree in radio engineering from Beijing
International Conference on Data Mining (ICDM 2007), 2007, pp. 649–654. University of Posts and Telecommunications, China, in
[45] Y. Zhao, S. Li, S. Wang, Agglomerative clustering based on label propagation for 1999. Since September 1999, he has been working in
detecting overlapping and hierarchical communities in complex networks, Shanghai Jiao Tong University, China, as a Research
Adv. Complex Syst. 17 (2014) 1450021. Fellow, Associate Professor and Professor, successively.
[46] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in In 2010, he worked as a Visiting Scholar in Nanyang
networks, Phys. Rev. E 69 (2004) 026113. Technological University, Singapore. His research in-
[47] A. Lancichinetti, S. Fortunato, F. Radicchi, Benchmark graphs for testing terests include information security, signal and in-
community detection algorithms, Phys. Rev. E 78 (2008) 046110. formation processing, artiﬁcial intelligence. He pub-
[48] M. Girvan, M.E.J. Newman, Community structure in social and biological lished more than 80 papers, co-authored four books,
networks, Proc. Natl. Acad. Sci. USA, 99 (2002) 7821–7826. and holds ten granted patents. In 2003, he received the 1st Prize of Shanghai Sci-
[49] M.R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, A. Arenas, Self-similar ence and Technology Progress in China. In 2006 and 2007, he was elected for New
community structure in a network of human interactions, Phys. Rev. E 68 century talent of Chinese Education Ministry and Shanghai dawn scholar.
(2003) 065103.
[50] J. Leskovec, J. J. Mcauley, Learning to discover social circles in ego networks, in:
Advances in Neural Information Processing Systems, 2012, pp. 539–547.
[51] M. Boguña, R. Pastor-Satorras, A. Díaz-Guilera, A. Arenas, Models of social Feng Jin received the B.S. degree in Automation and the
networks based on social distance attachment, Phys. Rev. E 70 (2004) 056122. Ph.D. degree in Control Science and Engineering from
Tsinghua University, Beijing China, in 2003 and 2008
respectively. Since February 2009, he has been working
at IBM China Research Laboratory on analytics and
Yuxin Zhao received the B.S. degree and Ph.D. degree optimization management system. His research inter-
in Electronic Engineering from Shanghai Jiao Tong ests include optimization algorithms, data mining and
University, Shanghai, China, in 2010 and 2015 respec- machine learning.
tively. He has won the First Class Scholarship of
Shanghai Jiao Tong University in 2008. He is currently
doing researches at China IBM research laboratory on
machine learning and big data analysis. His research
interests include complex network, data mining and
information security.