2021 Book ComplexNetworksTheirApplicatio
2021 Book ComplexNetworksTheirApplicatio
Complex
Networks & Their
Applications IX
Volume 1, Proceedings of the Ninth
International Conference on Complex
Networks and Their Applications
COMPLEX NETWORKS 2020
Studies in Computational Intelligence
Volume 943
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of
Science.
Editors
123
Editors
Rosa M. Benito Chantal Cherifi
Grupo de Sistemas Complejos IUT Lumière
Universidad Politécnica de Madrid University of Lyon
Madrid, Madrid, Spain Bron Cedex, France
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This 2020 edition of the International Conference on Complex Networks & Their
Applications is the ninth of a series that began in 2011. Over the years, this
adventure has made the conference one of the major international events in network
science.
Network science continues to trigger a tremendous interest among the scientific
community of various fields such as finance and economy, medicine and neuro-
science, biology and earth sciences, sociology and politics, computer science and
physics. The variety of scientific topics ranges from network theory, network
models, network geometry, community structure, network analysis and measure,
link analysis and ranking, resilience and control, machine learning and networks,
dynamics on/of networks, diffusion and epidemics, visualization. It is also worth
mentioning some recent applications with high added value for current trend social
concerns such as social and urban networks, human behavior, urban systems—
mobility, or quantifying success. The conference brings together researchers that
study the world through the lens of networks. Catalyzing the efforts of this scientific
community, it drives network science to generate cross-fertilization between fun-
damental issues and innovative applications, review the current state of the field and
promote future research directions.
Every year, researchers from all over the world gather in our host venue. This
year’s edition was initially to be hosted in Spain by Universidad Politécnica de
Madrid. Unfortunately, the COVID-19 global health crisis forced us to organize the
conference as a fully online event.
Nevertheless, this edition attracted numerous authors with 323 submissions from
51 countries. The papers selected for the volumes of proceedings clearly reflect the
multiple aspects of complex network issues as well as the high quality of the
contributions.
All the submissions were peer-reviewed by 3 independent reviewers from our
strong international program committee. This ensured high-quality contributions as
well as compliance to conference topics. After the review process, 112 papers were
selected to be included in the proceedings.
v
vi Preface
Undoubtedly, the success of this edition relied on the authors who have pro-
duced high-quality papers, as well as the impressive list of keynote speakers who
delivered fascinating plenary lectures:
– Leman Akoglu (Carnegie Mellon University, USA): “Graph-Based Anomaly
Detection: Problems, Algorithms and Applications”
– Stefano Boccaletti (Florence University, Italy): “Synchronization in Complex
Networks, Hypergraphs and Simplicial Complexes”
– Fosca Giannotti (KDD Lab, Pisa, Italy): “Explainable Machine Learning for
Trustworthy AI”
– János Kertész (Central European University, Hungary): “Possibilities and
Limitations of using mobile phone data in exploring human behavior”
– Vito Latora (Queen Mary University of London, UK): “Simplicial model of
social contagion”
– Alex “Sandy” Pentland (MIT Media Lab, USA): “Human and Optimal
Networked Decision Making in Long-Tailed and Non-stationary Environments”
– Nataša Pržulj (Barcelona Supercomputing Center, Spain): “Untangling biolog-
ical complexity: From omics network data to new biomedical knowledge and
Data-Integrated Medicine”
The topics addressed in the keynote talks allowed a broad coverage of the issues
encountered in complex networks and their applications to complex systems.
For the traditional tutorial sessions prior to the conference, our two invited
speakers delivered insightful talks. David Garcia (Complexity Science Hub Vienna,
Austria) gave a lecture entitled “Analyzing complex social phenomena through
social media data,” and Mikko Kivela (Aalto University, Finland) delivered a talk
on “Multilayer Networks.”
Each edition of the conference represents a challenge that cannot be successfully
achieved without the deep involvement of many people, institutions and sponsors.
First of all, we sincerely gratify our advisory board members, Jon Crowcroft
(University of Cambridge), Raissa D’Souza (University of California, Davis, USA),
Eugene Stanley (Boston University, USA) and Ben Y. Zhao (University of
Chicago, USA), for inspiring the essence of the conference.
We record our thanks to our fellow members of the Organizing Committee. José
Fernando Mendes (University of Aveiro, Portugal), Jesús Gomez Gardeñes
(University of Zaragoza, Spain) and Huijuan Wang (TU Delft, Netherlands) chaired
the lightning sessions. Manuel Marques-Pita (Universidade Lusófona, Portugal),
José Javier Ramasco (IFISC, Spain) and Taha Yasseri (University of Oxford, UK)
managed the poster sessions. Luca Maria Aiello (Nokia Bell Labs, UK) and Leto
Peel (Université Catholique de Louvain, Belgium) were our tutorial chairs. Finally,
Sabrina Gaito (University of Milan, Italy) and Javier Galeano (Universidad
Politécnica de Madrid, Spain) were our satellite chairs.
We extend our thanks to Benjamin Renoust (Osaka University, Japan), Michael
Schaub (MIT, USA), Andreia Sofia Teixeira (Indiana University Bloomington,
USA), Xiangjie Kong (Dalian University of Technology, China), the publicity
Preface vii
chairs for advertising the conference in America, Asia and Europa, hence encour-
aging the participation.
We would like also to acknowledge Regino Criado (Universidad Rey Juan
Carlos, Spain) as well as Roberto Interdonato (CIRAD - UMR TETIS, Montpellier,
France) our sponsor chairs.
Our deep thanks go to Matteo Zignani (University of Milan, Italy), publication
chair, for the tremendous work he has done at managing the submission system and
the proceedings publication process.
Thanks to Stephany Rajeh (University of Burgundy, France), Web chair, in
maintaining the Web site.
We would also like to record our appreciation for the work of the local
committee chair, Juan Carlos Losada (Universidad Politécnica de Madrid, Spain)
and all the local committee members, David Camacho (UPM, Spain), Fabio
Revuelta (UPM, Spain), Juan Manuel Pastor (UPM, Spain), Francisco Prieto (UPM,
Spain), Leticia Perez Sienes (UPM, Spain), Jacobo Aguirre (CSIC, Spain), Julia
Martinez-Atienza (UPM, Spain), for their work in managing online sessions. They
greatly participated in the success of this edition.
We are also indebted to our partners, Alessandro Fellegara and Alessandro Egro
from Tribe Communication, for their passion and patience in designing the visual
identity of the conference.
We would like to express our gratitude to our partner journals involved in the
sponsoring of keynote talks: Applied Network Science, EPJ Data Science, Social
Network Analysis and Mining, and Entropy.
Generally, we are thankful to all those who have helped us contributing to the
success of this meeting. Sincere thanks to the contributors, and the success of the
technical program would not be possible without their creativity.
Finally, we would like to express our most sincere thanks to the program
committee members for their huge efforts in producing high-quality reviews in a
very limited time.
These volumes make the most advanced contribution of the international
community to the research issues surrounding the fascinating world of complex
networks. Their breath, quality and novelty signal how profound is the role played
by complex networks in our understanding of our world. We hope that you will
enjoy reading the papers as much as we enjoyed organizing the conference and
putting this collection of papers together.
Rosa M. Benito
Hocine Cherifi
Chantal Cherifi
Esteban Moro
Luis Mateus Rocha
Marta Sales-Pardo
Organization and Committees
General Chairs
Advisory Board
Jon Crowcroft University of Cambridge, UK
Raissa D’Souza University of California, Davis, USA
Eugene Stanley Boston University, USA
Ben Y. Zhao University of Chicago, USA
Program Chairs
Chantal Cherifi University of Lyon, France
Luis M. Rocha Indiana University Bloomington, USA
Marta Sales-Pardo Universitat Rovira i Virgili, Spain
Satellite Chairs
Sabrina Gaito University of Milan, Italy
Javier Galeano Universidad Politécnica de Madrid, Spain
Lightning Chairs
José Fernando Mendes University of Aveiro, Portugal
Jesús Gomez Gardeñes University of Zaragoza, Spain
Huijuan Wang TU Delft, Netherlands
ix
x Organization and Committees
Poster Chairs
Manuel Marques-Pita University Lusófona, Portugal
José Javier Ramasco IFISC, Spain
Taha Yasseri University of Oxford, UK
Publicity Chairs
Benjamin Renoust Osaka University, Japan
Andreia Sofia Teixeira University of Lisbon, Portugal
Michael Schaub MIT, USA
Xiangjie Kong Dalian University of Technology, China
Tutorial Chairs
Luca Maria Aiello Nokia Bell Labs, UK
Leto Peel UCLouvain, Belgium
Sponsor Chairs
Roberto Interdonato CIRAD - UMR TETIS, France
Regino Criado Universidad Rey Juan Carlos, Spain
Local Committee
Jacobo Aguirre CSIC, Spain
David Camacho UPM, Spain
Julia Martinez-Atienza UPM, Spain
Juan Manuel Pastor UPM, Spain
Leticia Perez Sienes UPM, Spain
Francisco Prieto UPM, Spain
Fabio Revuelta UPM, Spain
Publication Chair
Matteo Zignani University of Milan, Italy
Web Chair
Stephany Rajeh University of Burgundy, France
Organization and Committees xi
Program Committee
Jacobo Aguirre Centro Nacional de Biotecnología, Spain
Amreen Ahmad Jamia Millia Islamia, India
Masaki Aida Tokyo Metropolitan University, Japan
Luca Maria Aiello Nokia Bell Labs, UK
Marco Aiello University of Stuttgart, Germany
Esra Akbas Oklahoma State University, USA
Mehmet Aktas University of Central Oklahoma, USA
Tatsuya Akutsu Kyoto University, Japan
Reka Albert The Pennsylvania State University, USA
Aleksandra Aloric Institute of Physics Belgrade, Serbia
Claudio Altafini Linköping University, Sweden
Benjamin Althouse New Mexico State University, USA
Lucila G. Alvarez-Zuzek IFIMAR-UNMdP, Argentina
Luiz G. A. Alves Northwestern University, USA
Enrico Amico Swiss Federal Institute of Technology
in Lausanne, Switzerland
Hamed Amini Georgia State University, USA
Chuankai An Dartmouth College, USA
Marco Tulio Angulo National Autonomous University of Mexico
(UNAM), Mexico
Demetris Antoniades RISE Research Center, Cyprus
Alberto Antonioni Carlos III University of Madrid, Spain
Nino Antulov-Fantulin ETH Zurich, Switzerland
Nuno Araujo Universidade de Lisboa, Portugal
Elsa Arcaute University College London, UK
Laura Arditti Polytechnic of Turin, Italy
Samin Aref Max Planck Institute for Demographic Research,
Germany
Panos Argyrakis Aristotle University of Thessaloniki, Greece
Malbor Asllani University of Limerick, Ireland
Tomaso Aste University College London, UK
Martin Atzmueller Tilburg University, Netherlands
Konstantin Avrachenkov Inria, France
Jean-Francois Baffier National Institute of Informatics, Japan
Giacomo Baggio University of Padova, Italy
Rodolfo Baggio Bocconi University, Italy
Franco Bagnoli University of Florence, Italy
Annalisa Barla Università di Genova, Italy
Paolo Barucca University College London, UK
Anastasia Baryshnikova Calico Life Sciences, USA
Nikita Basov St. Petersburg State University, Russia
Gareth Baxter University of Aveiro, Portugal
xii Organization and Committees
Community Structure
A Method for Community Detection in Networks with Mixed Scale
Features at Its Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Soroosh Shalileh and Boris Mirkin
Efficient Community Detection by Exploiting Structural Properties
of Real-World User-Item Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Larry Yueli Zhang and Peter Marbach
Measuring Proximity in Attributed Networks for Community
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Rinat Aynulin and Pavel Chebotarev
Core Method for Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . 38
A. A. Chepovskiy, S. P. Khaykova, and D. A. Leshchev
Effects of Community Structure in Social Networks on Speed
of Information Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Nako Tsuda and Sho Tsugawa
Closure Coefficient in Complex Directed Networks . . . . . . . . . . . . . . . . 62
Mingshan Jia, Bogdan Gabrys, and Katarzyna Musial
Nondiagonal Mixture of Dirichlet Network Distributions
for Analyzing a Stock Ownership Network . . . . . . . . . . . . . . . . . . . . . . 75
Wenning Zhang , Ryohei Hisano, Takaaki Ohnishi, and Takayuki Mizuno
Spectral Clustering for Directed Networks . . . . . . . . . . . . . . . . . . . . . . . 87
William R. Palmer and Tian Zheng
Composite Modularity and Parameter Tuning in the Weight-Based
Fusion Model for Community Detection in Node-Attributed Social
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Petr Chunaev, Timofey Gradov, and Klavdiya Bochenina
xxiii
xxiv Contents
Network Analysis
Complex Network Analysis of North American Institutions
of Higher Education on Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Dmitry Zinoviev, Shana Cote, and Robert Díaz
Connectivity-Based Spectral Sampling for Big Complex Network
Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Jingming Hu, Seok-Hee Hong, Jialu Chen, Marnijati Torkel, Peter Eades,
and Kwan-Liu Ma
Graph Signal Processing on Complex Networks for Structural
Health Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Stefan Bloemheuvel, Jurgen van den Hoogen, and Martin Atzmueller
An Analysis of Four Academic Department Collaboration Networks
with Respect to Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Lauren Nakamichi, Theresa Migler, and Zoë Wood
Contents xxv
K
K
F (λk , sk , ck ) = ρ (yiv − ckv sik )2 + ξ (pij − λk sik sjk )2 (3)
k=1 iv k=1 ij
The factors ρ and ξ in Eq. (3) are expert-driven constants to balance the two
sources of data.
On the first glance, criterion in Eq. (3) differs from what follows from Eqs. (2)
and (1): the operation of summation over k is outside of the parentheses in it,
whereas these equations require that to be within the parentheses. However, the
formulation in (3) is consistent with the models in (2) and (1) because vectors
sk (k = 1, 2, ..., K) correspond to a partition and thus are mutually orthogonal:
For any specific i ∈ I, sik is zero for all k except one; that one k at which i ∈ Sk .
Therefore, each of the sums over k in Eqs. (2) and (1) consists of just one item,
so that the summation sign may be applied outside of the parentheses indeed.
To use a one-by-one clustering strategy [13] here, let us denote an individ-
ual community by S; its center in feature space, by c; and the corresponding
intensity weight, by λ (just removing the index, k, for convenience). The extent
of fit between the community and the dataset will be the corresponding part of
criterion in (3):
F (λ, cv , si ) = ρ (yiv − cv si )2 + ξ (pij − λsi sj )2 (4)
i,v i,j
The problem: given matrices P = (pij ) and Y = (yiv ), find binary s, as well
as real-valued λ and c = (cv ), minimizing criterion (4).
As is well known, and, in fact, easy to prove, the optimal real-valued cv is
equal to the within-S mean of feature v, and the optimal intensity value λ is
equal to the mean within-cluster link value:
i∈S yiv i,j∈S pij
cv = ; λ= (5)
|S| |S|2
Criterion (4) can be further reformulated as:
F (s) = ρ yiv
2
− 2ρ yiv cv si + ρ c2v s2i +
i,v i,v v i
(6)
ξ p2ij − 2ξλ pij si sj + ξλ 2
s2i s2j
i,j i,j i j
The items T (Y ) = i,v yiv and T (P ) = 2
ij pi,j
2
in (6) express quadratic
scatters of data matrices Y and P , respectively. Using them, Eq. 6 can be refor-
mulated as
6 S. Shalileh and B. Mirkin
Equation (7) shows that the combined data scatter, ρT (Y )+ξT (P ) is decom-
posed in two complementary parts, one of which, F (s), expresses the residual,
that part of the data scatter which is minimized in Eqs. (1) and (2), whereas the
other part, G(s), expresses the contribution of the model to the data scatter.
By putting the optimal values cv and λ from (5) into this expression, we
obtain a simpler expression for G(s)
G = ρ|S| c2v + ξλ pij si sj (9)
v ij
3. Redefine J by removing all the elements of Sk from it. Check whether thus
obtained J is empty or not. If yes, stop. Define the current k as K and output
all the solutions Sk , ck , λk , Gk , k = 1, 2, ..., K. If not, add 1 to k, and go to 2.
3.2 Datasets
We use both real world datasets and synthetic datasets.
Real World Datasets. We take on five real-world data sets listed in Table 1.
Some of them involve both quantitative and categorical features. The algorithms
under comparison, unlike the proposed algorithm SEFNAC, require that features
are to be categorical. Therefore, whenever a data set contains a quantitative
feature we convert that feature to a categorical version.
Malaria data set [9]
The nodes are amino acid sequences containing six highly variable regions (HVR)
each. The edges are drawn between sequences with similar HVRs number 6. In
this data set, there are two nominal attributes of nodes:
1. Cys labels derived from of a highly variable region HVR6 (assumed ground
truth);
2. Cys-PoLV labels derived from the sequences adjacent to regions HVR 5 and 6.
Table 1. Real world datasets under consideration. Symbols N, E, and F stand for the
number of nodes, the number of edges, and the number of node features, respectively.
Before applying SEFNAC, all attribute categories are converted into 1/0
dummy variables which are considered quantitative.
Generating Synthetic Data Sets. First of all, we specify the number of nodes
N , the number of features V , and the number of communities, K, in a dataset to
be generated. As the number of parameters to control is rather high, we narrow
down the variation of our data generator by maintaining two types of settings
only, a small size network and a medium size network. For a small size setting,
we specify the values of the three parameters as follows: N = 200, V = 5, and
K = 5. For the medium size, N = 1000, V = 10, and K = 15.
Generating Networks
At given numbers of nodes, N , and communities K, cardinalities of communities
are defined uniformly randomly, up to a constraint that no community may have
less than a pre-specified number of nodes (in our experiments, this is set to 30,
so that probabilistic approaches are applicable), and the total number of nodes
in all the communities sums to N .
Given the community sizes, we populate them with nodes, that are spec-
ified just by indices. Then we specify two probability values, p and q. Every
within-community edge is drawn with the probability p, independently of other
edges. Similarly, any between- community edge is drawn independently with the
probability q.
Generating Quantitative Features
To model quantitative features, we generate each cluster from a Gaussian dis-
tribution whose covariance matrix is diagonal with diagonal values uniformly
random in the range [0.05, 0.1] to specify the cluster’s spread. Each component
of the cluster center is generated uniformly random from the range α[−1, +1],
so that the real positive α controls the cluster intermix: the smaller the α, the
closer are cluster centers to each other.
In addition to cluster intermix, we take into account the possibility of pres-
ence of noise in data. We uniformly random generate a noise feature from an
10 S. Shalileh and B. Mirkin
interval defined by the maximum and minimum values. In this way, we may
replicate 50% of the original data with noise features.
Generating Categorical Features
To model categorical features, we randomly choose the number of categories for
each of them from the set {2, 3, ..., L} where L = 10 for small-size networks and
L = 15 for the medium-size networks. Then, given the number of communities,
K, and the numbers of entities, Nk for (k = 1, ..., K); the cluster centers are
generated randomly so that no two centers may coincide at more than 50% of
features.
Once a center of k -th cluster, ck = (ckv ), is specified, Nk entities of this
cluster are generated as follows. Given a pre-specified threshold of intermix,
between 0 and 1, for every pair (i, v), i = 1 : Nk ; v = 1 : V , a uniformly random
real number r between 0 and 1 is generated. If r > , the entry xiv is set to be
equal to ckv ; otherwise, xiv is taken randomly from the set of categories specified
for feature v.
Consequently, all entities in cluster k -th coincide with its center, up to rare
errors if is close to 1. The smaller the epsilon, the more diverse, and thus
intermixed, would be the generated entities.
Generating mixed scale features
We divide the number of features in two approximately equal parts, one to consist
of quantitative features, the other, of categorical features. Each part is filled in
independently, as described above.
To set a more realistic design, we may explicitly insert 50% features that are
uniformly random in some datasets.
Therefore, generation of synthetic datasets is controlled by specifying six two-
valued and one three-valued parameters: feature scales: quantitative, categorical,
mixed; data size: small, medium; presence of noise features: yes, no; the prob-
ability of a within-community edge p; the probability of a between-community
edge q; cluster inter-mix parameter α/. Therefore, there are 192 combinations
of these altogether. At each setting, we generate 10 datasets, run a community
detection algorithm, and calculate the mean and the standard deviation of ARI
(NMI) values at these 10 datasets.
The following two sections present our experimental results for (a) testing
validity of the SEFNAC algorithm at synthetic data, and (b) comparing perfor-
mance of SEFNAC and competition on both real and synthetic data.
Table 2 presents the results of our experiments at synthetic datasets with mixed
scale features.
We can see that SEFNAC successfully recovers the numbers of communities
at q = 0.3 and mostly fails at q = 0.6 – because this corresponds to a counter
intuitive situation at which the probability of a link between separate commu-
nities is greater than 0.5. Yet even in this case the partition is recovered exactly
when other parameters keep its structure tight, as say at p = 0.9. This holds for
both small size and medium size cases. Insertion of noise features does reduce
the levels of ARI (NMI) but not that much. The real reduction in the numbers
of recovered communities, 7–8 out of 15 ones generated, occurs at the medium
size data sets at really loose data structures with p = 0.7 and q = 0.6, leading
to significant drops in the levels of ARI (NMI) values.
The picture is much similar at the cases of quantitative only and categorical
only feature scales - we do not present them to shorten the paper.
12 S. Shalileh and B. Mirkin
Table 3. Comparison of CESNA, SIAN and SEFNAC at synthetic data sets with
categorical features. The best results are highlighted using bold-face. The average ARI
and NMI value and its standard deviation over 10 different data sets is reported.
Small size networks Medium size networks
CESNA SIAN SEFNAC CESNA SIAN SEFNAC
p, q, ARI NMI ARI NMI ARI NMI ARI NMI ARI NMI ARI NMI
0.9, 0.3, 0.9 1.00(0.00) 1.00(0.00) 0.55(0.29) 0.58(0.30) 0.99(0.01) 0.99(0.01) 0.89(0.05) 0.94(0.03) 0.00(0.00) 0.00(0.00) 1.00(0.00) 1.00(0.00)
0.9, 0.3, 0.7 0.95(0.10) 0.97(0.06) 0.48(0.29) 0.52(0.27) 0.97(0.02) 0.97(0.03) 0.85(0.08) 0.92(0.03) 0.00(0.00) 0.00(0.00) 0.99(0.01) 0.99(0.01)
0.9, 0.6, 0.9 0.93(0.08) 0.93(0.06) 0.32(0.25) 0.37(0.27) 0.97(0.01) 0.96(0.02) 0.63(0.06) 0.75(0.04) 0.00(0.00) 0.00(0.00) 0.99(0.01) 1.00(0.00)
0.9, 0.6, 0.7 0.90(0.06) 0.90(0.06) 0.11(0.14) 0.12(0.15) 0.75(0.12) 0.73(0.11) 0.48(0.09) 0.66(0.06) 0.00(0.00) 0.00(0.00) 0.96(0.03) 0.97(0.01)
0.7, 0.3, 0.9 0.97(0.08) 0.97(0.04) 0.55(0.16) 0.60(0.15) 0.98(0.02) 0.97(0.02) 0.77(0.07) 0.89(0.03) 0.03(0.08) 0.04(0.12) 1.00(0.01) 1.00(0.01)
0.7, 0.3, 0.7 0.89(0.14) 0.91(0.10) 0.51(0.21) 0.55(0.19) 0.87(0.07) 0.85(0.06) 0.71(0.13) 0.84(0.06) 0.00(0.00) 0.00(0.00) 0.99(0.01) 0.99(0.01)
0.7, 0.6, 0.9 0.50(0.10) 0.59(0.08) 0.05(0.09) 0.05(0.10) 0.90(0.07) 0.89(0.05) 0.06(0.02) 0.35(0.06) 0.00(0.00) 0.00(0.00) 0.99(0.01) 0.99(0.01)
0.7, 0.6, 0.7 0.20(0.08) 0.29(0.08) 0.03(0.04) 0.04(0.04) 0.60(0.09) 0.59(0.08) 0.02(0.01) 0.25(0.02) 0.00(0.00) 0.00(0.00) 0.91(0.04) 0.99(0.04)
One can see that at small sizes with regarding ARI CESNA wins three times
(out of 8); while if one considers NMI, CESNA wins two more settings. At all
the other cases, including at medium size datasets, SEFNAC wins. SIAN never
wins in this table. There is an impressive change in the performance of SIAN at
the medium-sized datasets: SIAN comprehensively fails on all counts at medium
sizes by producing NaN which we interpret as a one-cluster solution.
We also experimented with a slightly different design for categorical feature
generation. That different design sets an entity to either coincide with its cluster
center or to be entirely random. At that design CESNA wins 7 times at the
small size datasets and SEFNAC wins at 7 medium size datasets.
Real world datasets lead to somewhat different results: CESNA performs
rather poorly; SEFNAC wins three times regarding ARI and two times regarding
NMI, and SIAN, two times regarding ARI and three times regarding NMI (see
Table 4).
Here, we chose that data normalization method leading, on average, to the
larger ARI values. Specifically, we used z-scoring for normalizing features in
Lawyers dataset, HVR data set and COSN data set. The best results on World-
Trade data set and parliament data set are obtained with no normalization. The
network data in Lawyers and HVR are normalized with applying the modularity
transformation [15]. The network data of COSN is normalized by shifting all the
similarities to the average link value [13].
Community Detection with Mixed Scale Features 13
Table 4. Comparison of CESNA, SIAN and SEFNAC on Real-world data sets; average
values of ARI and NMI and their standard deviation (std) are presented over 10 random
initialisations. The best results are shown in bold-face.
Data sets / Alg. CESNA SIAN SEFNAC
ARI NMI ARI NMI ARI NMI
HRV6 0.20(0.00) 0.37(0.00) 0.39(0.29) 0.39(0.22) 0.45(0.14) 0.62(0.05)
Lawyers 0.28(0.00) 0.48(0.00) 0.59(0.04) 0.71(0.04) 0.63(0.06) 0.65(0.05)
World Trade 0.23(0.00) 0.59(0.00) 0.55(0.07) 0.77(0.03) 0.23(0.03) 0.58(0.04)
Parliament 0.25(0.00) 0.52(0.00) 0.79(0.12) 0.82(0.07) 0.28(0.01) 0.47(0.01)
COSN 0.44(0.00) 0.45(0.00) 0.43(0.05) 0.61(0.03) 0.50(0.11) 0.64(0.06)
5 Conclusion
This paper proposes a novel combined data recovery criterion for the problem
of detecting communities in a feature-rich network. Our algorithm SEFNAC
(Sequential Extraction of Feature-Rich Network Addition Clusters) extracts
clusters one by one. Our approach is more or less universal regarding the scales
of the data available. On the other hand, SEFNAC results may depend on data
normalization.
We experimentally show that SEFNAC is competitive over both synthetic
and real-world data sets against two popular state-of-the-art algorithms, CESNA
[21] and SIAN [16].
Possible directions for future work:
References
1. Bojchevski, A., Günnemann, S.: Bayesian robust attributed graph clustering: joint
learning of partial anomalies and group structure. In: Thirty-Second AAAI Con-
ference on Artificial Intelligence (2018)
2. Chiang, M.M.T., Mirkin, B.: Intelligent choice of the number of clusters in k-means
clustering: an experimental study with different cluster spreads. J. Classif. 27(1),
3–40 (2010)
3. Chunaev, P.: Community detection in node-attributed social networks: a survey
(2019). arXiv preprint arXiv:1912.09816
4. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York
(2012)
5. Cross, R.L., Parker, A.: The Hidden Power of Social Networks: Understanding How
Work Really Gets Done in Organizations. Harvard Business Press, Boston (2004)
6. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
7. He, D., Jin, D., Chen, Z., Zhang, W.: Identification of hybrid node and link com-
munities in complex networks. Nat. Sci. Rep. 5, 8638 (2015)
14 S. Shalileh and B. Mirkin
8. Interdonato, R., Atzmueller, M., Gaito, S., Kanawati, R., Largeron, C., Sala, A.:
Feature-rich networks: going beyond complex network topologies. Appl. Netw. Sci.
4, 4:1–4:13 (2019)
9. Larremore, D.B., Clauset, A., Buckee, C.O.: A network approach to analyzing
highly recombinant malaria parasite genes. PLoS Comput. Biol. 9(10), e1003268
(2013)
10. Lazega, E.: The Collegial Phenomenon: The Social Mechanisms of Cooperation
Among Peers in a Corporate Law Partnership. Oxford University Press, Oxford
(2001)
11. Leskovec, J., Sosič, R.: SNAP: a general-purpose network analysis and graph-
mining library. ACM Trans. Intell. Syst. Technol. (TIST) 8-1, 1 (2016). CESNA
on Github: https://github.com/snap-stanford/snap/tree/master/examples/cesna
12. Mirkin, B., Nascimento, S.: Additive spectral method for fuzzy cluster analysis
of similarity data including community structure and affinity matrices. Inf. Sci.
183(1), 16–34 (2012)
13. Mirkin, B.: Clustering: A Data Recovery Approach, 1st edn. (2005); 2d edn. (2012).
CRC Press, Routledge (2005; 2012)
14. Nature Communications. https://www.nature.com/articles/ncomms11863
15. Newman, M.E.: Modularity and community structure in networks. Proc. Nat.
Acad. Sci. 103(23), 8577–8582 (2006)
16. Newman, M.E., Clauset, A.: Structure and inference in annotated networks. Nat.
Commun. 7, 11863 (2016)
17. De Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with
Pajek, chap. 2. Cambridge University Press, Cambridge (2004)
18. Stanley, N., Bonacci, T., Kwitt, R., Niethammer, M., Mucha, P.J.: Stochastic block
models with multiple continuous attributes. Appl. Netw. Sci. 4(1), 1–22 (2019)
19. Snijders, T.: The Siena webpage. https://www.stats.ox.ac.uk/∼snijders/siena/
Lazega lawyers data.htm
20. Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to
attributed graph clustering. In: Proceedings of the 2012 ACM SIGMOD Inter-
national Conference on Management of Data, pp. 505–516. ACM (2012)
21. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node
attributes. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1151–
1156. IEEE https://arxiv.org/pdf/1401.7267.pdf (2013). Accessed 22 Nov 2019
Efficient Community Detection
by Exploiting Structural Properties
of Real-World User-Item Graphs
1 Introduction
Numerous real-world Internet applications generate large amounts of data con-
sisting of “user-item” interactions. For example, in video streaming services such
as YouTube and Netflix, users interact with the videos by watching or rating
them; in an online shopping application such as Amazon, users interact with the
items by viewing or purchasing them; in a social network application such as
Twitter, the follower/following relation can also be modelled as user-item inter-
actions. It is an important problem to detect the underlying community structure
of these user-item interactions. More precisely, we want to detect a set of users
that share a common interest and, as a result, tend to interact with a common
set of items. Being able to identify such communities is essential for function-
alities such as item recommendation and discovering similar-minded users. The
data of user-item interactions can be modelled as a bipartite user-item graph
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 15–26, 2021.
https://doi.org/10.1007/978-3-030-65347-7_2
16 L. Y. Zhang and P. Marbach
and the task of detecting a community involves identifying a user set along with
an item set, where the user set consists of users who share a common interest,
and the item set consists of items that are representative of the shared interest.
The community detection problem has traditionally been modelled as a prob-
lem of finding clusters of densely connected vertices (e.g., cliques and quasi-
cliques) in a graph. There has been a rich literature on graph clustering algo-
rithms. Compared to most of the existing works, a key difference in the design of
the proposed algorithm is that it attempts to exploit the special structural prop-
erties that are inherent in real-world social graphs. That is, rather than being
a generic clustering algorithm for arbitrary graphs, the proposed algorithm is
designed to be efficient for a particular subset of graphs – user-item graphs that
are formed by human agents interacting with items of their interest. We focus
our effort on detecting “interest-based communities”, i.e., a group of individu-
als that share a common interest. The structural property that the algorithm
takes into consideration is the “core-periphery” structure, i.e., the existence of
“cores users” whose interests are highly identical to the common interest of the
community. By specializing to detect such core users, the proposed algorithm
is able to effectively detect all communities in the graph by examining a much
smaller search space than that of an exhaustive generic clustering algorithm,
therefore achieves a dramatically high efficiency in terms of the time needed to
process the entire graph. More precisely, the proposed algorithm achieves a time
complexity that is guaranteed to be below quadratic time (in terms of the num-
ber of vertices) and is practically close to linear-time with certain datasets. Our
experiments (presented in Sect. 6) show that the actual runtime of the proposed
algorithm on large graphs is several magnitudes shorter compared to existing
community detection algorithms for bipartite graphs. It also effectively detects
the vast majority of communities that can be detected by other algorithms.
In summary, the contributions of this paper are to (1) propose the design of
a low-complexity community detection algorithm that is tailored for real-world
user-item graphs by exploiting assumptions on the structural properties of the
graph, (2) provide experimental results on real-world datasets to investigate the
efficiency and accuracy of the algorithm, and (3) demonstrate the effectiveness
of the algorithm-design method that specializes the algorithm to take advantage
of structures in real-world complex networks.
2 Related Work
spectrum or modularity, therefore do not scale well when the size of the input
graph becomes substantially large. Algorithms based on the detection of quasi-
cliques/bicliques [1,7,12,14,16,19] are the state of the art. The best known run-
time of quasi-clique algorithms is O(|V |3 ) (e.g., in [12]) for processing the entire
graph, where |V | is the number of vertices in the graph.
The algorithm design in this paper is based on efficiently taking advantage of
the core-periphery structure [6,8,13,15,17,20,21] in real-world social graphs, i.e.,
the existence of densely connected subgraphs surrounded by sparsely connected
periphery nodes. Communities in social graphs exhibit internal core-periphery
structure. In [17] it is explained that the core component can be viewed as
users who are highly identical in their interests and the peripheries are the users
whose interests partially overlap with the core. A key idea behind the proposed
algorithm in this paper is to detect communities by identifying their core users.
In this section, we provide the main intuition behind the algorithm that we
propose, i.e., we discuss the structural properties of communities in the user-item
graphs that we use to define our algorithm. For this, we focus on the so-called
“interest-based” community which is given by a group of individuals that share
a common interest. We refer to the common interest of the community as the
core interest of the community. Note that the communities in a user-item graph
are interest-based communities, i.e., a community consists of a set of users that
have the same interest in the sense they are interested in the same type of items.
Interest-based communities have been extensively studied in the litera-
ture [11,13,18,21]. An important result is the existence of the core-periphery
structure [6,8,13,15,17,20,21]. That is, for a given community, there exists a
set of core users whose interests are largely identical to the core interest of the
community, and there exist peripheral users who have some overlap with the
core interest but the overlap is partial. This result is important as it implies
that we can identify a community as a tuple consisting of (a) the core interest
and (b) the core users of the community. Moreover, given the core interest of
a community, we can identify the core users by selecting the set of users that,
not only share the core interest, but also have the core interest as their main
interest. Similarly, given a set of core users, we can identify the core interest of
the community by finding the set of topics that all (or a large portion of) core
users are interested in while users outside the core are only partially interested
in.
In a user-item graph, an interest-based community can be identified a by a
tuple consisting of (a) the core items and (b) the core users of the community.
The core items of the community are the items that represent the core interest
of the community. This relationship between the core items and the core users is
the key property that we use to design our algorithm. To highlight this, we use
the terms “popularity” and “typicality” to describe this property. More precise
definitions of these terms can be found in Sect. 4.
18 L. Y. Zhang and P. Marbach
4 Model
A user-item graph is an undirected bipartite graph in which the vertices are
divided into two sets, the user set and the item set. Let U denote the set of all
user vertices and I be the set of all item vertices. An edge (u, i), u ∈ U, i ∈ I
exists if user u interacts with/accesses item i. The bipartite user-item graph is
a natural model for a wide range of real-life applications. For example, in video
streaming services such as YouTube and Netflix, the user-video viewing/rating
relationship can be modelled by a user-item graph.
Based on the discussion in Sect. 3, an interest-based community is defined by
a set of core users who share a common interest identified by a set of core items.
More precisely, given the core users of a community, we say that the core items
of the community are the items that are (a) popular among the core users in the
sense that a core item is accessed/connected by a significant fraction of the core
users, and (b) typical among the core users in the sense that a significant fraction
of the accesses/edges of a core item come from within the core users instead of
from outside the core users. Similarly, given the core items of a community, we
say that the core users of the community are the users that are (a) popular among
the core items in the sense that a core user accesses a significant fraction of the
core items, and (b) typical among the core items in the sense that a significant
fraction of the accesses made by a core user go to within the core items instead
of outside the core items. Below are the formal definitions.
Definition 3 A (δ̄, ¯, ρ̄, ᾱ)-community consists a user set UC ⊆ U and an item
set IC ⊆ I of items such that ∀ u ∈ UC , δ(u, IC ) ≥ δ̄ ∧ α(u, IC ) ≥ ᾱ and ∀ i ∈
IC , (i, UC ) ≥ ¯ ∧ ρ(i, UC ) ≥ ρ̄.
Efficient Community Detection 19
5 Algorithm
Based on the above definitions, we devise a community detection algorithm that
detects (δ̄, ¯, ᾱ, ρ̄)-communities with given thresholds.
Algorithm 1: DETECT-SINGLE-COMMUNITY
Input: UC : an initial set of users, δ̄, ¯, ᾱ, ρ̄
Output: (UC , IC ): a pair of user and item sets that is the core of a
detected community, or NIL if a community is not detected
1 converged ← F alse
2 num iterations ← 0
3 while not converged and num iterations < max iterations do
4 UC ← copy of UC
5 IC ← SELECT-ITEMS(UC , ¯, ρ̄)
6 UC ← SELECT-USERS(IC , δ̄, ᾱ)
7 if UC = UC then
8 converged ← T rue
9 num iterations ← num iterations + 1
10 if converged then
11 return (UC , IC )
12 else
13 return NIL
Algorithm 1 shows the routine for detecting a single community core. Given
the user set, the SELECT-ITEMS subroutine returns the set of items that satisfy
the constraints on the item’s popularity and typicality with respect to the user
set. The SELECT-USERS works symmetrically in a similar way.
Algorithm 2: DETECT-ALL-COMMUNITIES
Input: the user-item graph with user set U and item set I, δ̄, ¯, ᾱ, ρ̄
Output: C: a set of communities each being a pair of user and item sets
1 C←∅
2 foreach item i in I do
3 Ui ← the set of users that are connected with i
4 c ← DETECT-SINGLE-COMMUNITY(Ui , δ̄, ¯, ᾱ, ρ̄)
5 if c is not NIL then
6 Add c to C
7 return C
uniform. In the average-case graph, let dI be the average number of (user) neigh-
bours of an item and dU be the average number of (item) neighbours of a user.
The number of users and items being selected in each iteration is upper-bounded
by a constant in our algorithm’s implementation. The number of the candidate
items that need to be processed in SELECT-ITEMS is in O(dU ). The set intersec-
tion performed when computing the popularity/typicality is between a set of size
dU and a set of constant size. Therefore, the runtime for SELECT-ITEMS is overall
O(dU ). Similarly, the overall runtime of SELECT-USERS is O(dI ). The main loop
in Algorithm 1 typically terminates after a small constant number of iterations
(as will be shown experimentally in Sect. 6). Therefore, the overall runtime of
DETECT-SINGLE-COMMUNITY is in the order of O(dU + dI ). More precisely, let
|V | = |U | + |I| be the total number of vertices in the graph and dU and dI be
functions of |V |, the above runtime is now O(dU (|V |) + dI (|V |)).
Algorithm 1 detects a single community core from an initial user set. To
detect all communities in a given user-item graph, theoretically, it would suffice
if we run DETECT-SINGLE-COMMUNITY on all possible subsets of the user set U of
the graph. However, it would lead to exponential runtime. The proposed algo-
rithm does the following: for each item, use its neighbouring user set as an initial
set to detect a community. In this way, we only need to run Algorithm 1 on |I| ini-
tial user sets. Based on the core-periphery structure of real-world interest-based
communities, it can be argued that only going through these initial sets would
be sufficient for detecting all the communities/interests in the graph: assuming
that each community contains a set of core users/items who are dedicated to the
interest of the community, it suffices to detect all the community cores in order
to detect all the communities. Moreover, because the core users/items are highly
focussed on the interest of the community, an iteration starting from the user set
of a core item is highly likely to converge to the core itself. This observation is
critical for reducing the overall runtime of processing the entire user-item graph.
Algorithm 2 is the pseudocode of the routine for detecting all communities.
The time complexity of Algorithm 2 is simply |I| multiplied by the runtime of
Algorithm 1, i.e., O(|I|·(dU (|V |)+dI (|V |))). Let d(|V |) = max(dU (|V |), dI (|V |)),
and given that |V | = |U | + |I|, the overall runtime of Algorithm 2 becomes
O(|V | · d(|V |)). Since the degree of a vertex is upper-bounded by |V |, the overall
runtime for detecting all communities is guaranteed to be in O(|V |2 ).
6 Experimental Evaluation
We perform our experiments using two real-world datasets: the Netflix dataset
and the Yelp dataset. The Netflix dataset [4] consists of 100,480,507 ratings that
480,189 users gave to 17,770 movies, which can be modelled as a user-item graph
with each rating as a user-item edge (ignoring the value of the rating). The Yelp
Dataset [9] contains user reviews of businesses posted on the Yelp across 10
metropolitan areas. We select the subgraph of one metropolitan area (Toronto,
Ontario). This results in a dataset with 148,570 users, 33,412 businesses, and
784,462 reviews. Compared to the Netflix dataset, the Yelp dataset is “sparser”
in the sense that the average degree of a vertex is much lower.
Efficient Community Detection 21
We compare the proposed algorithm (namely MUISI, standing for Mutual User-
Item Subset Iterations) with a state-of-the-art quasi-biclique algorithm in [12]
(named LIU hereinafter). The definition of a quasi-biclique in LIU is closely
related to MUISI—it is essentially the MUISI definition with only the popularity
constraints. In order for LIU to finish in a reasonable amount of time, we sampled
a subset of the Netflix data in the following way: pick 200 movies from a number
of known communities, then randomly add 100 additional movies as “noise”
items. Among all users that rated any of the 300 movies, we selected uniformly
at random a subset of 3000 users. The resulting graph contains 14,540 edges.
Table 2. Interest vector coverage between LIU and MUISI (Netflix data)
Recommend Popularity 0.1 0.2 0.3 0.4 0.5 Recommend Popularity 0.1 0.2 0.3 0.4 0.5
MUISI cover LIU (%) 99.8 97.6 92.2 87.8 81.5 MUISI cover LIU (%) 99.8 97.6 92.2 87.8 81.5
LIU cover MUISI (%) 96.8 95.4 94.9 94.3 93.2 LIU cover MUISI (%) 96.8 95.4 94.9 94.3 93.2
(a)Detection popularity threshold = 0.1 (b)Detection popularity threshold = 0.5
Table 3. Interest vector coverage between LIU and MUISI (Yelp data)
Recommend Popularity 0.1 0.2 0.3 0.4 0.5 Recommend Popularity 0.1 0.2 0.3 0.4 0.5
MUISI cover LIU (%) 98.5 98.4 98.4 98.1 90.1 MUISI cover LIU (%) 99.9 99.4 99.1 99.1 99.0
LIU cover MUISI (%) 91.3 91.3 89.9 89.8 84.6 LIU cover MUISI (%) 99.1 98.9 98.8 98.2 95.9
(a)Detection popularity threshold = 0.1 (b)Detection popularity threshold = 0.5
Table 2 shows the covering percentages (averaged over all users) under dif-
ferent detection and recommendation popularity. The mutual coverage between
Efficient Community Detection 23
MUISI communities and LIU communities is above 80% in most cases. The cov-
erage only dropped slightly below 80% when we apply a high recommendation
popularity (0.5) with communities detected using low popularity (0.1). Table 3
presents the same comparison using the Yelp dataset. The coverage is overall
higher than the Netflix result. This gives us the insight that the Yelp dataset
contains “clearer” communities. We conclude that the MUISI and LIU commu-
nities achieve similar outcome in content recommendation; in other words, the
MUISI communities are equivalent to those detected by LIU.
We compare the time efficiency of the MUISI algorithm with the quasi-clique
algorithm LIU [12]. We generated sampled subgraphs of the Netflix dataset and
Yelp dataset of varying sizes as inputs and compared the time taken to pro-
cess the entire input. All experiments were run on a computer with 2.7 GHz
Intel Core i5 CPU, 8 GB 1867 MHz DDR3 RAM. As shown in Table 4, MUISI
has a significantly lower runtime and scales much better than the quasi-clique
algorithm as the input size increases.
We also evaluated the asymptotic runtime of the proposed algorithm using
both the Netflix and Yelp datasets sampled in varying sizes. For the sampling, we
randomly choose a subgraph with desired numbers of items and users such that
the item-vs-user ratio stays similar to that of the original graph. The result is
plotted in Fig. 1 with log-log scaled axes. The slope of the red line is the reference
quadratic growth. For both datasets, the growth of the runtime is either close to
or slower than the quadratic growth rate. This asymptotic pattern agrees with
our theoretical analysis on the algorithm’s complexity.
We used the MUISI algorithm to process the complete Netflix dataset with a
popularity threshold of 0.2 and a typicality threshold of 0.02, and a total of
4,617 communities were detected. We are interested in the number of iterations
taken before each DETECT-SINGLE-COMMUNITY routine converges. Figure 2 is the
histogram that shows the distribution this number. The distribution is bimodal
in the sense that it has two peak areas around 4 and 8. Our hypothesis is that,
when a seed item belongs to the core of some ground-truth community, the
iterations would converge very quickly (within 4 iterations); whereas when the
item does not belong to a community core, it would take a longer time (around
8 iterations). We verified this hypothesis by reviewing the output and dividing
the iterations according to whether the seed item ends up in a community core.
As a result, the average number of iterations from a core item is 3.2 while the
average for non-core items is 6.8, which agrees with our hypothesis.
Another hypothesis to verify is that, when starting from a core item of a
community, the iterations are highly likely to converge to the core it is from.
This hypothesis would imply that, as long as a community core exists, it is
guaranteed to be detected by the algorithm. We re-ran iterations from the core
items and, for each community core, we record the resulting item set with the
maximum overlap with the starting community core. The overlap between two
sets A and B is calculated as |A ∩ B|/|A ∪ B|. Figure 3 shows the distribution of
the maximum overlap. It is evident that the majority of communities cores have
items that would lead to iterations converging to themselves.
Efficient Community Detection 25
Fig. 3. Histogram of the maximum overlap between a community core’s item set and
the resulting item set of iterations starting from one of its core items.
7 Conclusions
Algorithms that take advantage of the inherent structures of real-world complex
networks can be surprisingly simple and efficient. Real-life social graphs, being
generated by human agents that are motivated by specific objectives, naturally
exhibit structural properties such as the core-periphery structure. The key idea
behind the design of the proposed algorithm is to achieve performance by spe-
cializing the algorithm for a targeted category of inputs with such structures.
The proposed algorithm achieves a significant improvement in the time complex-
ity to process the entire input graph, reducing it to guaranteed subquadratic
time. In addition to the popularity constraint that’s commonly used in existing
methods, the proposed algorithm applies a second constraint on the typicality
when detecting communities. In our experiments, the algorithm demonstrates
a runtime that is several magnitudes shorter compared to the state-of-the-art
quasi-clique based algorithms; in the meantime, it can detect communities that
are equivalent to those detectable by other algorithms. These qualities make the
proposed algorithm highly practical for real-world applications with large-scale
graphs.
References
1. Abello, J., Resende, M.G., Sudarsky, S.: Massive quasi-clique detection. In: Latin
American Symposium on Theoretical Informatics, pp. 598–612. Springer (2002)
2. Barber, M.J.: Modularity and community detection in bipartite networks. Phys.
Rev. E 76(6), 066102 (2007)
3. Beckett, S.J.: Improved community detection in weighted bipartite networks. R.
Soc. Open Sci. 3(1), 140536 (2016)
4. Bennett, J., Lanning, S., et al.: The netflix prize. In: Proceedings of KDD Cup and
Workshop, New York, vol. 2007, p. 35 (2007)
5. Bokde, D., Girase, S., Mukhopadhyay, D.: Matrix factorization model in collabo-
rative filtering algorithms: A survey. Proc. Comput. Sci. 49, 136–146 (2015)
6. Borgatti, S.P., Everett, M.G.: Models of core/periphery structures. Soc. Netw.
21(4), 375–395 (2000)
26 L. Y. Zhang and P. Marbach
7. Brunato, M., Hoos, H.H., Battiti, R.: On effectively finding maximal quasi-cliques
in graphs. In: International Conference on Learning and Intelligent Optimization,
pp. 41–55. Springer (2007)
8. Csermely, P., London, A., Wu, L.Y., Uzzi, B.: Structure and dynamics of
core/periphery networks. J. Complex Netw. 1(2), 93–123 (2013)
9. Dataset, T.Y.: https://www.yelp.ca/academic dataset. Accessed July 2018
10. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)
11. Li, X., Guo, L., Zhao, Y.E.: Tag-based social interest discovery. In: Proceedings of
the 17th International Conference on World Wide Web, pp. 675–684. ACM (2008)
12. Liu, X., Li, J., Wang, L.: Modeling protein interacting groups by quasi-bicliques:
complexity, algorithm, and application. IEEE/ACM Trans. Comput. Biol. Bioinf.
(TCBB) 7(2), 354–364 (2010)
13. Marbach, P.: The structure of communities in information networks. In: Informa-
tion Theory and Applications Workshop (ITA), 2016, pp. 1–6. IEEE (2016)
14. Pattillo, J., Veremyev, A., Butenko, S., Boginski, V.: On the maximum quasi-clique
problem. Discrete Appl. Math. 161(1–2), 244–257 (2013)
15. Rombach, M.P., Porter, M.A., Fowler, J.H., Mucha, P.J.: Core-periphery structure
in networks. SIAM J. Appl. Math. 74(1), 167–190 (2014)
16. Sim, K., Li, J., Gopalkrishnan, V., Liu, G.: Mining maximal quasi-bicliques: novel
algorithm and applications in the stock market and protein networks. Statistical
Anal. Data Min. ASA Data Sci. J. 2(4), 255–273 (2009)
17. Yang, J., Leskovec, J.: Overlapping communities explain core-periphery organiza-
tion of networks. Proc. IEEE 102(12), 1892–1902 (2014)
18. Yang, J., Leskovec, J.: Defining and evaluating network communities based on
ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2015)
19. Zeng, Z., Wang, J., Zhou, L., Karypis, G.: Coherent closed quasi-clique discovery
from large dense graph databases. In: Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 797–802.
ACM (2006)
20. Zhang, J., Ackerman, M.S., Adamic, L.: Expertise networks in online communities:
structure and algorithms. In: Proceedings of the 16th International Conference on
World Wide Web, pp. 221–230. ACM (2007)
21. Zhang, L.Y., Marbach, P.: Stable and efficient structures for the content production
and consumption in information communities. In: Game Theory for Networking
Applications, pp. 163–173. Springer (2019)
Measuring Proximity in Attributed
Networks for Community Detection
1 Introduction
Many real-world systems from the fields of social science, economy, biology, chem-
istry, etc. can be represented as networks or graphs1 [7]. A network consists of
nodes representing objects, connected by edges representing relations between
the objects. Nodes can often be divided into groups called clusters or communi-
ties. Members of such a cluster are more densely connected to each other than
to the nodes outside the cluster.
The task of finding such groups is called clustering or community detection.
There have been plenty of algorithms proposed by researchers in the past to
address this problem.
Some of the community detection algorithms require the introduction of dis-
tance or proximity measure on the set of graph nodes: a function, which shows,
respectively, the distance or proximity (similarity) between a pair of nodes. Only
1
Formally, graph is a mathematical representation of a network. However, hereinafter,
the terms “graph” and “network” will be used interchangeably.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 27–37, 2021.
https://doi.org/10.1007/978-3-030-65347-7_3
28 R. Aynulin and P. Chebotarev
the shortest path distance had been studied for a long time. Nowadays, we have
a surprising variety of measures on the set of graph nodes [9, Chap. 15]. Some
of the proximity measures can be defined as kernels on graphs, i.e., symmetric
positive semidefinite matrices [1].
Previously, kernels have been applied mainly to analyze networks without
attributes. However, in many networks, nodes are associated with attributes
that describe them in some way. Thereby, multiple dimensions of information
can be available: a structural dimension representing relations between objects,
a compositional dimension describing attributes of particular objects, and an
affiliation dimension representing the community structure [3]. Combining infor-
mation about relations between nodes and their attributes provides a deeper
understanding of the network structure.
Many methods for community detection in attributed networks have been
proposed recently. Surveys [3,6] describe existing approaches to this problem. We
provide some information on this in Sect. 2. However, kernel-based clustering, as
already noted, has not yet been applied to attributed networks.
In this paper, we extend the definition of a number of previously defined
proximity measures to the case of networks with node attributes. Several sim-
ilarity measures on attributes are used for this purpose. Then, we apply the
obtained proximity measures to the problem of community detection in several
real-world datasets.
According to the results of our experiments, taking both node attributes and
node relations into account can improve the efficiency of clustering in comparison
with clustering based on attributes only or on structural data only. Also, the
most effective attribute similarity measures in our experiments are the Cosine
Similarity and Extended Jaccard Similarity.
2 Related Work
This section is divided into two parts. In the first one, we provide a quick overview
of papers where various measures on the set of graph nodes are discussed. Then, we
introduce a few studies focused on community detection in attributed networks.
For a long time, only the shortest path distance has been widely used [10].
[9, Chap. 15] provides a survey of dozens of measures that have been proposed
in various studies in the last decades. Among them there are inspired by physics
Resistance (also known as Electric) measure [31], logarithmic Walk measure
discussed in [5], the Forest measure related to Resistance [4], and many others.
In [1], the authors analytically study properties of various proximity mea-
sures2 and kernels on graphs, including Walk, Communicability, Heat, PageR-
ank, and several logarithmic measures. Then, these measures are compared in
the context of spectral clustering on the stochastic block model. [12] provides
a survey and numerical comparison of nine kernels on graphs in application to
link prediction and clustering problems.
2
Here, we use the term “proximity measure” in a broaded sense and, unlike [1], do
not require a proximity measure to satisfy the triangle inequality for proximities.
Measuring Proximity in Attributed Networks for Community Detection 29
cij , which is the cost of following this edge. If cost does not appear naturally, it
can be defined as cij = a1ij . The cost matrix C contains costs of all the edges.
The degree of a node is the sum of the weights of the edges linked to the
node. The diagonal degree matrix D = diag(A · 1) shows degrees of all the nodes
in the graph (1 = (1, ..., 1)T ). Given A and D, the Laplacian matrix is defined
as L = D − A, and the Markov matrix is P = D−1 A.
A measure on the set of graph nodes is a function κ that characterizes prox-
imity or similarity between the pairs of graph nodes. A kernel on graph is a
similarity measure that has a Gram matrix (symmetric positive semidefinite
matrix) representation K. Given K, the corresponding distance matrix Δ can
be obtained from the equation
1
K = − HΔH, (1)
2
where H = I − n1 1 · 1T .
For more details about graph measures and kernels, we refer to [1].
Spectral. In this paper, we use the variation of the Spectral algorithm pre-
sented by Shi and Malik in [32]. The approach is based on applying the k-means
algorithm to the eigenvectors of the Laplacian matrix of the graph. For a detailed
review of the mathematics behind the Spectral algorithm, we refer to the tutorial
by Ulrike von Luxburg [22].
3.3 Measures
In this study, we consider five measures which have shown a good efficiency in
[1,33].
∞ αn An
Communicability. K C = n=0 n! = exp(αA), α > 0 [11,13].
∞ αn (−L)n
Heat. K H = n=0 n! = exp(−αL), α > 0 [19].
Free Energy. Given P , C and the parameter α, the matrix W can be defined
as W = exp(−αC) ◦ P (the “◦” symbol stands for element-wise multiplication).
Then, Z = (I − W )−1 and S = (Z(C ◦ W )) ÷ Z (the “÷” symbol stands for
T log(Z)
element-wise division). Finally, ΔFE = Φ+Φ
2 , where Φ = α . K
FE
can be
FE
obtained from Δ using transformation (1) [18].
Measuring Proximity in Attributed Networks for Community Detection 31
d
1(fik = fjk )
– Matching Coefficient3 [34]: sMC (fi , fj ) = k=1 , where 1(x) is the
d
indicator function which takes the value of one if the condition x is true and
zero otherwise;
fi · fj
– Cosine Similarity [35, Chap. 2]: sCS (fi , fj ) = ;
||fi ||2 ||fj ||2
fi · fj
– Extended Jaccard Similarity [35, Chap. 2]: sJS (fi , fj ) = ;
||fi ||22 + ||fj ||22 − fi · fj
1
– Manhattan Similarity [8]: sMS (fi , fj ) = ;
1 + ||fi − fj ||1
1
– Euclidean Similarity [8]: sES (fi , fj ) = .
1 + ||fi − fj ||2
5 Experiments
– WebKB [21]: a dataset of university web pages. Each web page is classified
into one of five classes: course, faculty, student, project, staff. Each node
is associated with a binary feature vector (d = 1703) describing presence or
absence of words from the dictionary. This dataset consists of four unweighted
graphs: Washington (n = 230, m = 446), Wisconsin (n = 265, m = 530),
Cornell (n = 195, m = 304), and Texas (n = 187, m = 328).
– CiteSeer [30]: an unweighted citation graph of scientific papers. The dataset
contains 3312 nodes and 4732 edges. Each paper in the graph is classified into
one of six classes (the topic of the paper) and associated with a binary vector
(d = 3703) describing the presence of words from the dictionary.
– Cora [30]: an unweighted citation graph of scientific papers with a structure
similar to the CiteSeer graph. The number of nodes: n = 2708, the number
of edges: m = 5429, the number of classes: c = 7, and the number of words
in the dictionary (the length of the feature vector): d = 1433.
These datasets are clustered using multiple methods. First, we apply the
k-means algorithm, which uses only attribute information and ignores graph
structure. Then, each dataset is clustered with the Spectral algorithm and five
plain proximity measures that do not use attribute information. Finally, commu-
nities are detected using the Spectral algorithm and attribute-aware proximity
measures that employ both data dimensions (structure and attributes).
We use balanced versions of attribute similarity measures with β = 12 in (2).
Each of the proximity measures depends on the parameter. So, we search
for the optimal parameter in the experiments, and the results include clustering
quality for the optimal parameter.
3
Since equality will be rare for continuous attributes, Matching Coefficient is mainly
used for discrete attributes, especially binary ones.
Measuring Proximity in Attributed Networks for Community Detection 33
6 Results
In this section, we discuss the results of the experiments.
In Table 1, ARI for all the tested proximity measures and similarity measures
on all the datasets is presented. “No” column shows the result for plain proximity
measures that do not use attribute information. The table also presents ARI
for the k-means clustering algorithm. The top-performing similarity measure is
marked in red for each proximity measure.
As we can see, taking attributes into account improves community detection
quality for all the proximity measures. Attribute-aware proximity measures out-
perform k-means for all the datasets except Texas. Therefore, we can conclude
that in most cases, the proximity measures based on structure and attribute
information perform better than both the plain proximity measures which use
only structure information and the k-means clustering method which uses only
attribute information.
Not all tested attribute similarity measures have shown good clustering qual-
ity. Figure 1 presents the average rank and standard deviation for attribute sim-
ilarity measures and k-means. The rank is averaged over 6 datasets. The figure
contains 5 graphs: one for each proximity measure.
Fig. 1. Average rank and standard deviation for attribute similarity measures and
k-means
One can see that the Cosine Similarity and Extended Jaccard Similarity
measures perform the best: they have the highest ranks for all the proximity
measures. As for plain proximity measures, which are not combined with any
similarity measure, they have one of the lowest ranks. The performance of the
Matching Coefficient, the Manhattan Similarity, and the Euclidean Similarity
measures varies for different proximity measures.
34 R. Aynulin and P. Chebotarev
7 Conclusion
In this paper, we investigated the possibility of applying proximity measures
for community detection in attributed networks. We studied a number of prox-
imity measures, including Communicability, Heat, PageRank, Free Energy, and
Sigmoid Corrected Commute-Time. Attribute information was embedded into
proximity measures using several attribute similarity measures, i.e., the Match-
ing Coefficient, the Cosine Similarity, the Extended Jaccard Similarity, the Man-
hattan Similarity, and the Euclidean Similarity.
According to the results of the experiments, taking node attributes into
account when measuring proximity improves the efficiency of proximity mea-
sures for community detection. Not all attribute similarity measures perform
equally well. The top-performing attribute similarity measures were the Cosine
Similarity and Extended Jaccard Similarity.
Future studies may address the problem of choosing the optimal β in (2).
Another area for future research is to find more effective attribute similarity mea-
sures. Furthermore, the proposed method can be compared with the Embedding
Approach of [20].
References
1. Avrachenkov, K., Chebotarev, P., Rubanov, D.: Kernels on graphs as proximity
measures. In: International Workshop on Algorithms and Models for the Web-
Graph. LNCS, vol. 10519, pp. 27–41. Springer (2017)
2. Aynulin, R.: Efficiency of transformations of proximity measures for graph clus-
tering. In: International Workshop on Algorithms and Models for the Web-Graph.
LNCS, vol. 11631, pp. 16–29. Springer (2019)
36 R. Aynulin and P. Chebotarev
3. Bothorel, C., Cruz, J.D., Magnani, M., Micenkova, B.: Clustering attributed
graphs: models, measures and methods. Netw. Sci. 3(3), 408–444 (2015)
4. Chebotarev, P.Y., Shamis, E.: On the proximity measure for graph vertices pro-
vided by the inverse Laplacian characteristic matrix. In: 5th Conference of the
International Linear Algebra Society, Georgia State University, Atlanta, pp. 30–31
(1995)
5. Chebotarev, P.: The walk distances in graphs. Discrete Appl. Math. 160(10–11),
1484–1500 (2012)
6. Chunaev, P.: Community detection in node-attributed social networks: a survey.
Comput. Sci. Rev. 37, 100286 (2020)
7. Costa, L.D.F., Oliveira Jr., O.N., Travieso, G., Rodrigues, F.A., Villas Boas, P.R.,
Antiqueira, L., Viana, M.P., Correa Rocha, L.E.: Analyzing and modeling real-
world phenomena with complex networks: a survey of applications. Adv. Phys.
60(3), 329–412 (2011)
8. Dang, T., Viennet, E.: Community detection based on structural and attribute sim-
ilarities. In: International Conference on Digital Society (ICDS), pp. 7–12 (2012)
9. Deza, M.M., Deza, E.: Encyclopedia of Distances, 4th edn. Springer, Berlin (2016)
10. Dijkstra, E.W., et al.: A note on two problems in connexion with graphs.
Numerische Mathematik 1(1), 269–271 (1959)
11. Estrada, E.: The communicability distance in graphs. Linear Algebra Appl.
436(11), 4317–4328 (2012)
12. Fouss, F., Francoisse, K., Yen, L., Pirotte, A., Saerens, M.: An experimental inves-
tigation of kernels on graphs for collaborative recommendation and semisupervised
classification. Neural Netw. 31, 53–72 (2012)
13. Fouss, F., Yen, L., Pirotte, A., Saerens, M.: An experimental investigation of graph
kernels on a collaborative recommendation task. In: Sixth International Conference
on Data Mining (ICDM’06), pp. 863–868. IEEE (2006)
14. Girvan, M., Newman, M.E.: Community structure in social and biological networks.
Proc. Nat. Acad. Sci. 99(12), 7821–7826 (2002)
15. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
16. Ivashkin, V., Chebotarev, P.: Do logarithmic proximity measures outperform
plain ones in graph clustering? In: International Conference on Network Analy-
sis. PROMS, vol. 197, pp. 87–105. Springer (2016)
17. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8),
651–666 (2010)
18. Kivimäki, I., Shimbo, M., Saerens, M.: Developments in the theory of randomized
shortest paths with a comparison of graph node distances. Physica A Stat. Mech.
Appl. 393, 600–616 (2014)
19. Kondor, R., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces.
In: International Conference on Machine Learning, pp. 315–322 (2002)
20. Li, Y., Sha, C., Huang, X., Zhang, Y.: Community detection in attributed graphs:
an embedding approach. In: Thirty-Second AAAI Conference on Artificial Intelli-
gence, pp. 338–345 (2018)
21. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-
tional Conference on Machine Learning (ICML-03), pp. 496–503 (2003)
22. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416
(2007)
23. von Luxburg, U., Radl, A., Hein, M.: Getting lost in space: large sample analysis
of the resistance distance. In: Advances in Neural Information Processing Systems,
pp. 2622–2630 (2010)
Measuring Proximity in Attributed Networks for Community Detection 37
24. MacQueen, J., et al.: Some methods for classification and analysis of multivariate
observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, pp. 281–297. Oakland, CA, USA (1967)
25. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria
for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986)
26. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and
link information. In: Proceedings of the Text Mining and Link Analysis Workshop,
18th International Joint Conference on Artificial Intelligence, pp. 9–15 (2003)
27. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
bringing order to the web. Technical report, Stanford InfoLab (1999)
28. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am.
Stat. Assoc. 66(336), 846–850 (1971)
29. Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large net-
works using content and links. In: Proceedings of the 22nd International Conference
on World Wide Web, pp. 1089–1098 (2013)
30. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI Mag. 29(3), 93 (2008)
31. Sharpe, G.: Solution of the (m+1)-terminal resistive network problem by means
of metric geometry. In: Proceedings of the First Asilomar Conference on Circuits
and Systems, Pacific Grove, CA, pp. 319–328 (1967)
32. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 22(8), 888–905 (2000)
33. Sommer, F., Fouss, F., Saerens, M.: Comparison of graph node distances on clus-
tering tasks. In: International Conference on Artificial Neural Networks. LNCS,
vol. 9886, pp. 192–201. Springer (2016)
34. Sulc, Z., Řezanková, H.: Evaluation of recent similarity measures for categorical
data. In: Proceedings of the 17th International Conference Applications of Math-
ematics and Statistics in Economics. Wydawnictwo Uniwersytetu Ekonomicznego
we Wroclawiu, Wroclaw, pp. 249–258 (2014)
35. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Edu-
cation India (2016)
36. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node
attributes. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1151–
1156. IEEE (2013)
37. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)
Core Method for Community Detection
1 Theory
1.1 About Revealing Communities and Key Applied Tasks
The problem of identifying implicit communities in networks of interacting objects has
been covered in many publications over the past 20 years [1–7]. Various algorithms
have been developed to solve problems related to this topic. Separately, it is possible
to highlight the currently relevant area of analysis of social networks [8–10]. When
working with social networks graphs, the key applied problems are the following:
1. the task of determining the proximity of user profiles, the coincidence of their
interests, the degree of (face-to-face) acquaintance;
To solve these problems, one must first decide what kind of graph should be built
when importing data from the original source. There is usually no point in working
with the graph of the entire social network due to its large size. Therefore, as a rule,
a subgraph, the construction of which is carried out using a breadth-first search from
a set of vertices given in advance, is unloaded. We will work with weighted graphs
G(V , E), the set of vertices V of which consists of the original objects – users of the
social network. In this case, the weight on the set of edges E is given by the function
w with nonnegative values and corresponds to the degree of intensity of interaction of
objects with each other. The method for determining the values of w(e) depends on the
specific source network. The weight w(v) of the vertex v is then defined as the sum of
the weights of all edges incident to it. And let the weight of a given set of vertices be
defined as the sum of the weights of all these vertices. The concept of a community is
defined in many works [11–15]. Community S is a subgraph (containing set of vertices),
with the density of edges between them higher than in the whole graph. In this work, we
assume that the communities do not overlap, i.e. after the selection of communities, each
vertex is in a single community. Therefore, the community weight w(S) is determined
accordingly.
We also define the concept of the internal weight of a community w∗ (S) – the sum
of the weights of the edges, both vertices of which lie inside the community and the
internal weight of the vertex w∗ (v) – the sum of the weights of the edges incident to it
that lie in the community of the given vertex. For the tasks listed above, you can build
the corresponding weighted graphs: graphs of general similarity of users; graphs of user
sympathy; graphs of information interaction of users.
To solve the first task and build a graph of general similarity of users, even an
unweighted graph of mutual friends or subscriptions is often sufficient. When solving
this problem, a weighted graph, the values of the weights of the edges of which are
determined based on the general attributes of the original network, can also be used
[16]. To basically solve Task 2, both the user sympathy graph and the graph of their
information interaction can be used. These weighted graphs are based on the values
of the network attributes. But for a qualitative analysis of the graph of information
interaction, it is necessary to perform several procedures, which are described further in
our work. This allows you to qualitatively solve tasks 2 and 3.
The high-quality construction of the specified graphs requires the selection of param-
eters for the weight of the edges, depending on the initial network of object interaction.
For example, when importing data from the Twitter network, it is possible to use infor-
mation about existing likes, retweets, comments, user subscriptions. These types of
interactions will constitute many attributes. In this case, one of the options for construct-
ing weights on the edges between two vertices is to calculate a weighted sum for these
vertices based on the attributes.
40 A. A. Chepovskiy et al.
If we consider the original graph G(V , E) and apply popular methods of community
revealing to it, then the picture will usually be distorted by a large number of leaf
vertices obtained during data import. Other vertices, for which the weight of incident
edges is significantly lower than the others in the graph, are also possible. Typically, these
will be vertices of users who minimally interact with the rest. For graphs of information
interaction, these users are not interesting. We will conditionally call such vertices as
“garbage”. In contrast, the key vertices of opinion leaders, which have a lot of weight, as
well as the structure of communities, including the “heaviest” of them. Revealing such
vertices is one of the important tasks, because around them “heavy” communities are
formed, and other vertices are attracted. It will be more accurate to say that a vertex v
will be called δ-garbage if its weight is less than δ. Then the set Junkδ (G) of all garbage
in the graph is defined as follows:
Mirror situation takes place for vertices with large weight value. Let us call α-star
or simply a “star” such vertices v of the graph G, that v has a weight greater than some
value α. The set of stars Starα (G) is then defined as follows:
Such vertices attract other to their communities, unless they, in turn, are in communi-
ties with a significant total weight. In this case, the weight of the edges between adjacent
vertices from different communities will be important [17]. One of the goals of the core
method is to select the key core community (or several such communities) that has the
greatest weight among the other communities in graph G. Further, for a given partition,
we denote the community with the maximum weight by Core(G), and its weight by
w(Core(G)). It may happen that in the initial graph several communities Si are revealed,
whose weight is very close or even coincides with the maximum weight w(Core(G)).
In these cases, we can talk about the presence of several communities-cores in the graph
G. The admissible degree of proximity of these values is denoted by γ . We define the
γ -core Coreγ (G) as the set of vertices from those communities that satisfy the following
relation:
w(Si )
Coreγ (G) = vV |vSi : >γ (3)
w(Core(G))
In addition to cores and stars, the graph often contains other smaller communities,
the connection inside which is quite dense, and the weight is high enough that cores and
other large communities cannot absorb these smaller communities.
It is possible to simplify the graph analysis task by ignoring weak interactions of the
original objects by removing garbage vertices. This can be done in two ways. The first
Core Method for Community Detection 41
is simply by removing garbage vertices and then the remaining from them edges. But
we will use another method: for all edges having a weight less than a given β we will
consider this weight equal to zero, i.e. remove such edges and obtain a new set of edges
E . After that, some of the vertices will become isolated and can be removed from the
analyzed graph. Thus we geta new set of vertices V . Let us denote the graph obtained
after these operations as G V , E – the graph of active information interaction of
network objects.
The second method is better suited for forming such a graph, because by taking β
equal to δ, we can remove not only Junkδ (G), but other inactive edges and vertices from
the graph.
We define the interaction coefficient kint G as the ratio of the doubled number of
edges to the square of the number of vertices in the resulting graph of active information
interaction G :
2E
kint G = (4)
|V |2
It is easy to see that in the complete graph kint G = 1− 1n , which for large n is close
to 1. But graphs of networks of interacting objects are sparse, so usually this coefficient
will be closer to 0. High values of kint G will indicate a significant connection between
actively interacting vertices. This indicator is generally responsible for the activity within
the analyzed graph. Similarly, you can determine the coefficients of interaction within
individual communities of the graph. For the core case, high values will indicate a high
level of subjectivityof the graph content [17].
Let us define kS G – the density coefficient of the community S as the ratio of the
total internal weight of the vertices of this community to the doubled weight of the graph
edges, which is equal to the ratio of the weight of the edges within the community to the
weight of the edges of the graph:
∗
vS w (v) w(e)
kS G = = eS (5)
2 eS w(e) eE w(e)
Similarly we define kCoreγ G – the density coefficient of the γ -core as the ratio of
the total weight of the vertices of the γ -core Coreγ G to the doubled total weight of
the edges of the graph G :
vCoreγ (G ) w(v)
kCoreγ G = (6)
2 eE w(e)
High values of kCoreγ G show, that the core in such a graph plays a significant role
in comparison with other communities and garbage vertices. Based on these coefficients,
the classification and methodology for working with graphs will be built.
a meta-graph as G,1 for the first iteration). Then, with each iteration, the number of
vertices will decrease: each meta-vertex in the new graph is a group of vertices of the
previous graph. Consequently, the meta-vertex is itself a certain graph, within which it
is useful to apply the algorithm and analyze the selected communities.
After revealing communities in the meta-graph and analyzing their connections with
each other, we get new meta-vertices. Repeating the operation of forming meta-vertices
in this way, we obtain the graph G,2 . Similarly, at the i-th step, the graph G,i is obtained.
Thus, at one of the iterations, you can get a general view of the interaction of the largest
groups of users. Of course, this makes sense for a large initial graph G.
Let us present the sequence of analysts’ actions to solve the main tasks 2 and 3. It is
this technique that we will call the “core method”. Acting according to the algorithm
presented below, the analyst will be able to identify both opinion leaders and ways of
disseminating information within and between communities.
1. Remove isolated vertices. The presence of such may be due to the peculiarities of
the source network and the data import process. We get the graph G.
2. Calculate the initial interaction coefficient of the graph kint (G). This will be
important as a reference point for a graph without isolated vertices.
3. Remove garbage vertices. To do this, it is necessary to determine the value of β for
which to perform the operation of removing edges. We get the graph
G .
4. Calculate the updated interaction coefficient of the graph kint G . The recom-
mended variation range is within the following limits: 0, 8 < int ( ) < 0, 9. In case
k G
kint (G)
the coefficient has changed outside this range, we recommend returning to step 3
and making it with a different value of β.
5. Apply algorithm to reveal communities. It is supposed to use an algorithm that
identifies not overlapping communities. For example, variations of the algorithms
Infomap [18, 19], Louvain [20] etc. can be used.
6. Identify stars. Choose the value of α and select the set of vertices Starα (G), con-
sisting of the stars of the graph. Check that stars are highlighted
in the largest
communities and such communities have a high kSi G . If not, then change the
α.
7. Detect the core. Determine γ and compose Coreγ G .
8. Generate meta-graph. Create a graph of meta-vertices G,1 . Reveal communities
inside Coreγ (G) and other key meta-vertices, study their structure in accordance
with the initial tasks of the researcher. If necessary, continue working with each
meta-vertex separately, passing for them recursively to step 2.
9. Create a meta-meta graph. Consider the structure of G,2 in accordance with the
original tasks.
Core Method for Community Detection 43
2 Tool
The authors of this work have designed and implemented a special tool for graph analysis
that supports automatic revealing of communities through built-in algorithms and visu-
alization of the result, as well as other actions that allow implementing the previously
indicated methodology.
Since the graph of interacting objects can be obtained from any social network and
other sources, it was extremely important to fix a universal graph representation format
that allows storing the attributes of its vertices and edges. A special XML-like unified
markup format called AVS is used. The description of the vertices and edges of the graph
is written to the AVS format file: their names, attribute names, data format in attributes.
The description of the graph is followed by the description of each vertex and each edge.
An important feature of this AVS-format is its way of storing the attributes of the
edges and vertices of the graph, which are large texts. In the AVS file itself, only the
link to the file (or part of the file) is stored in the field of the corresponding attribute.
And the text is stored in the file, which allows you to analyze texts separately, as well
as perform data compression if necessary. Therefore, when exporting data from the
developed application, the graph itself (vertices, edges and their attributes) is saved in
AVS file, and user texts are stored in separate yaml files for each of them.
The developed software allows user to implement the following user scenarios:
• load an AVS file with a graph for visualization and point-by-point consideration of
its vertices and connections;
• apply to the loaded graph any of algorithms to reveal communities available in the
application and get acquainted with the statistics of the resulting partition;
• analyze partitions, including using meta-vertices, in order to change communities and
/ or weight functions;
• export the resulting set of communities for analysis or presentation of results in other
systems or manually.
The graph uploaded by the user is displayed on the scene of the main application
window (Fig. 1). Scene – is an area of the main application window used to display the
graph and interact with its components.
Fig. 1. Application screen and the scene Fig. 2. Revealed communities, graph G
44 A. A. Chepovskiy et al.
Dividing the graph into communities is one of the most essential tools for their
analysis. In the implemented application, two algorithms to reveal communities are
integrated: Infomap and Louvain.
After applying the partition, the vertices of the graph belonging to the same com-
munity will be colored in the same color, each vertex will be added an attribute with its
community number. Then vertices are arranged by revealed communities (Fig. 2).
In addition to these basic actions, this tool allows you to perform those that are
important for presented methodology. One can remove edges with a predicate weight,
remove isolated vertices, compose meta-vertices, define stars and select the core (using
the statistics menu). To determine β and remove “garbage” vertices, it is possible to
hide the edges of the graph on the scene, according to the user-specified condition, and
later either remove them completely from the graph or cancel the hiding and return
the hidden edges to the scene. This functionality provides analysts with the ability to
visually evaluate the graph that will be obtained for different β and choose its optimal
value. The user can specify a condition for hiding edges not only by weight, but also
by any other attribute of the edge, if there are any in the graph G. To do this, you must
specify the edge attribute, sign and constant value for comparison, as well as the type
of comparison. The tool also provides the ability to remove isolated vertices, which is
especially useful after removing edges, since, with the optimal choice of β, some of the
vertices will become isolated and their removal will help to finally clear the graph of
“garbage”. To create a meta-graph, the “Make meta-vertices” button is implemented,
which initiates the creation, based on the current scene, of meta-vertices corresponding
to the communities selected at the current step.
To create a graph of meta-vertices, first, for each community, the total weight of edges
(with both vertices within the same community) is calculated. This gives the weight of
the meta-vertices used to calculate the radius for displaying on the main scene. Then,
for each pair of communities, the total weight of the edges between the vertices of these
communities is calculated. This is how the weights of the new edges in the meta-graph
are determined.
When analyzing a meta-graph, it may be relevant for a user to study the structure
of a community representing one meta-vertex. To do this, the user can “fall” into the
meta-vertex and see the subgraph of the original graph, composed only of the vertices
of the given community. With this view of the subgraph, the user can save and open
the subgraph as a separate project and work with it using the full functionality of the
application. The statistics menu implemented in the application allows the user to study
the topological indicators of the graph and its components (vertices, communities). Basic
statistics, located in the text boxes in the lower left corner, will help user to calculate
the graph interaction ratio needed in the early stages of the core method, as well as
the community and γ -core density factors for later stages. Communities, sorted by the
number of vertices included in them (the value is indicated in parentheses after the name
of the community), will help user to quickly find the largest one and highlight the core.
Core Method for Community Detection 45
Consider the partition into internal communities in S1 (Fig. 3). This meta-vertex is a
source of information – a star-vertex corresponding to the official media account and its
46 A. A. Chepovskiy et al.
Encrypted vertex Vertex degree Weighted vertex Weighted vertex Weighted degree
nick degree degree in devided by mean
community
v_***ov 126 445 187 37,08
le***al 84 315 204 26,25
Vi***va 76 258 129 21,5
Mr***ay 65 171 128 14,25
Pr***at 62 171 83 14,25
Ol***13 27 84 61 7
aa***an 38 64 27 5,33
8a***Wn 20 60 30 5
ma***n_ 25 57 29 4,75
Se***us 20 49 23 4,08
ru***60 31 42 15 3,5
NpA***36 13 41 18 3,41
dj***ef 12 40 15 3,33
adjacent vertices. We will call such meta-vertices “constellations of the first kind”, and
“planets” – its adjacent vertices. Thus, in our classification, S1 is a constellation of the
first kind, consisting of one vertex-star and 48 vertices-planets.
Let us consider in more detail the meta-vertex corresponding to S0 Coreγ (G). Let’s
reveal internal communities in S0 using Infomap algorithm (Fig. 4). The top-star is
Core Method for Community Detection 47
clearly visible – this is a pronounced influencer and the adjacent users who had inter-
action with the original posts. It is worth noting that the composition of many of these
users is ambiguous: basic nicknames of 15 random characters, which are created upon
registration, the number of followers is zero or close to zero, photos are uploaded mostly
without a face. Presumably, such users are bots or fakes of the respective influencer.
Of course, there are also real users who share the leader’s views. They can even form
their own communities, in this case there are two of them, but both of them consist of 2
vertices. Further there will be examples where additional communities are larger.
We will call such meta-vertices “constellations of the second kind”, “stars” in them
– opinion leaders, and “planets” – other vertices (in general, not all of them will be
adjacent to a star). Thus, in our classification S0 is a constellation of the second kind,
consisting of one vertex-star, with which 43 vertex-planets are connected.
Communities S2 and S4 also represent constellations of the second kind (Figs. 5,
6) consisting of one star, 32 and 43 planetary vertices, respectively. Other vertices,
according to the value of α are not stars here.
Community S3 is a “constellation
of the third kind” (Fig. 7), there is no star here, but
the density value kS3 G is quite high, the vertices are still competing for supremacy
in this group. This means that with the further development of the social network over
time, a star will appear here.
48 A. A. Chepovskiy et al.
These constellation variants are not unique for the considered communities and can
be found in other cases as well, for example, in the structure of communities S5 , S6 and
S7 . There will also be one of the three previous pictures. Among them, the star is only
S5 for the taken value of α.
Thus, this technique distinguishes opinion leaders and communities, constituting
support groups for the respective leaders. It should be noted that the qualitative analysis
identified leaders of various opinions, both pro-government (Fig. 9) and opposition
(Fig. 10).
Then we repeat the selection of the community in the graph of meta-vertices G,1
and obtain the graph of meta-meta-vertices G,2 . If the graph G,1 can be conventionally
called a “constellation graph”, then the graph G,2 – is a “galaxy graph” (Fig. 8). Five
communities are distinguished on G,2 . Only one of them includes more than 1 meta-
vertex. It consists of 39 meta-vertices and is the “Galactic core” for the initial graph
G . All the other 4 are composed of single meta-vertex, each of which, in turn, consists
of several vertices of the original graph. It should be noted that there could be more
interesting “graphs of galaxies”.
Core Method for Community Detection 49
4 Conclusions
This paper describes a core method that allows you to analyze weighted graphs and
solve the problem of identifying opinion leaders and ways of disseminating information.
Mathematical indicators for a weighted graph, which make it possible to assess the degree
of interaction of network vertices, have been introduced. The functionality of the software
implemented by the authors, which allows analysts to work within the framework of the
suggested method, is described. An example of work by the method using the described
application with a graph built on the basis of real data from the Twitter network is given
and considered in detail.
The authors see a possible further development in this area in testing the hypothe-
sis of self-organization of networks of interacting objects in the formation of implicit
communities in the process of graph evolution based on interaction according to laws
similar to the laws of physics and with the aim of bringing the system into a state of
stable equilibrium.
References
1. Aggarwal, C.: Social Network Data Analytics. Springer, US (2011)
2. Chepovskiy, A., Lobanova, S.: Combined method to detect communities in graphs of
interacting objects. Bus. Inf. 42(4), 64–73 (2017)
3. Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Demon: a local-first discovery method
for overlapping communities. In: Proceedings of the 18th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 615–623, ACM (2012)
4. Karimi, F., Lotfi, S., Izadkhah, H.: Multiplex community detection in complex networks using
an evolutionary approach. Expert Syst. Appl. 146, 113184 (2020)
5. Lambiotte, R., Rosvall, M.: Ranking and clustering of nodes in networks with smart
teleportation. Phys. Rev. E 85, 056107 (2012)
6. Newman, M., Girvan, M.: Finding and evaluating community structure in networks. Phys.
Rev. E 69(2), 1–15 (2004)
7. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure
of complex networks in nature and society. Nature 435, 814–818 (2005)
8. Lei, T., Huan, L.: Community detection and mining in social media. Synthesis Lectures on
Data Mining and Knowledge Discovery, p. 137 (2010)
9. Xie, J., Szymanski, B.K., Liu, X.: SLPA: uncovering overlapping communities in social
networks via a speaker-listener interaction dynamic process. In: 2011 IEEE 11th International
Conference, Data Mining Workshops (ICDM), pp. 344–349 (2011)
10. Yang Bo, Liu Dayou, Liu Jiming.: Discovering communities from social networks: method-
ologies and applications. Handbook of Social Network Technologies and Applications,
pp. 331–346. Springer, Boston (2010)
11. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks.
Phys. Rev. E 70, 066111 (2004)
12. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
13. Girvan, M., Newman, M.: Community structure in social and biological networks. Proc. Natl.
Acad. Sci. 99(12), 7821–7826 (2002)
14. Lancichinetti, A., Fortunato, S.: Community detection algorithms: a comparative analysis.
Phys. Rev. E 80, 056117 (2009)
50 A. A. Chepovskiy et al.
15. Radicchi, F., Castellano, C., Loreto, V., Cecconi, F., Parisi, D.: Defining and identifying
communities in networks. Proc. Natl. Acad. Sci. U.S.A. 101(9), 2658–2663 (2004)
16. Leschyov, D.A., Suchkov, D.V., Khaykova, S.P., Chepovskiy, A.A.: Algorithms to reveal
communication groups. Voprosy kiberbezopasnosti. 32(4), 61–71 (2019). https://doi.org/10.
21681/2311-3456-2019-4-61-71
17. Voronin, A.N., Kovaleva, J.B., Chepovskiy, A.A.: Interconnection of network character-
istics and subjectivity of network communities in the social network Twitter. Voprosy
kiberbezopasnosti. 37(3), 40–57 (2020). https://doi.org/10.21681/2311-3456-2020-03-40-57
18. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community
structure. Proc. Natl. Acad. Sci. U.S.A. 105(4), 1118–1123 (2008)
19. Rosvall, M., Bergstrom, C.T., Axelsson, D.: The map equation. Eur. Phys. J. Spec. Top.
178(1), 13–23 (2009)
20. Blondel, V., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in
large network. J. Stat. Mech. Theory Exp. 10, P10008 (2008)
Effects of Community Structure in Social
Networks on Speed of Information
Diffusion
1 Introduction
Social media users can widely disseminate information, potentially affecting soci-
etal trends [2]. When information about a given product is widely disseminated,
users receiving that information may purchase that product, increasing its sales.
Therefore, understanding factors affecting information diffusion in social media
is important for successful viral marketing campaigns.
In our previous work, we have investigated how community structures of
users in social media affect the scale of cascading information diffusion [15,16].
Many social networks have a community structure, in which the network is com-
posed of highly clustered communities with sparse links between them [4,12].
Our previous studies have shown that if information is spread across different
communities, the information will be widely spread [15,16]. Other existing stud-
ies have also reported that the community structure has strong influence on
spreading processes using real data and theoretical models [3,6,7,10,11,13,21].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 51–61, 2021.
https://doi.org/10.1007/978-3-030-65347-7_5
52 N. Tsuda and S. Tsugawa
While many previous studies have focused on the scale of information diffu-
sion cascades [1,8,9,15,16,21], there have been few studies of factors affecting
its speed, which is equally important. For instance, suppose a post reaches ten
thousand users within one week, and another reaches ten thousand users within
one hour. Both posts have the same diffusion scale, but significantly different
diffusion speeds. For viral marketing campaigns, it is important to spread infor-
mation both widely and rapidly among social media users. Therefore, it is also
important to understand the factors affecting diffusion speed of information cas-
cades.
In this paper, we focus on the community structure of social networks among
Twitter users and investigate how that structure affects the speed of diffusion
by retweets. Extracting communities among sampled Twitter users, we investi-
gate differences in diffusion speed between tweets with many intra-community
retweets and those with many inter-community retweets. We also use commu-
nity structures in Twitter social networks to tackle the tasks of predicting the
diffusion speed of a given tweet (i.e., time intervals between a first tweet and its
N -th retweet). Consequently, we examine the effectiveness of community struc-
ture features in a social network for predicting information diffusion speed. Our
main contributions are summarized as follows.
– We investigate the effects of inter-community and intra-community diffusion
of tweets on their diffusion speed. To the best of our knowledge, this is the first
study to investigate the effects of community structure of a social network on
diffusion speed of tweets.
– We show the potential of community structure features for predicting infor-
mation diffusion speed. While many studies have addressed the tasks of pre-
dicting the scale of diffusion, it has been rarely studied the tasks of predicting
diffusion speed. Our results contribute to constructing models for predicting
the speed of information diffusion.
The remainder of this paper is organized as follows. In Sect. 2, we investigate
the effects of community structure on diffusion speed of tweets. In Sect. 3, we
tackle the tasks of predicting diffusion speed of tweets. Finally, Sect. 4 contains
our conclusions and a discussion of future work.
where UN,t is the set of target users who post the first through N -th retweets of
tweet t, c(u) is the community to which user u belongs, and u(t) is the user who
posted the original tweet t. By changing N , we investigated the relation between
N th retweet time and the intra-community diffusion rate p(N ).
2.2 Results
We categorized tweets into four groups based on the intra-community diffusion
rate p(N ): tweets with 0 ≤ p < 0.25, 0.25 ≤ p < 0.5, 0.5 ≤ p < 0.75, and
0.75 ≤ p ≤ 1. We then compared average N th retweet times among the four
groups (Fig. 2). The following shows the results for cycle truss communities with
k = 5 and flow truss communities with k = 15. Note that cycle truss communities
with k = 5 and flow truss communities with k = 15 have similar community sizes.
For both cycle truss communities and flow truss communities, Fig. 2 shows
that the higher the intra-community diffusion rate, the longer it takes to reach
N times. For example, it takes approximately 1.9 times longer to reach the 100th
retweet when the intra-community diffusion rate p is 0.75 ≤ p ≤ 1 than when
0 ≤ p < 0.25. We also confirmed similar tendencies with k values other than
those shown in Fig. 2. These results suggest that the intra-community diffusion
rate affects the speed of tweet diffusion.
We next extracted truss communities with different community structure
strengths and performed the same analysis. Previous analyses have shown that
tweet diffusion times can be long when the intra-community diffusion rate is
high. Therefore, we next categorized tweets into four patterns based on the
intra-community diffusion rate before N th retweets, and for each investigated
the extent to which retweet times differ between the cases of diffusion within
relatively strong and weak community structures. Figure 3 compares the N th
retweet time in extracted k = 5, k = 10, and k = 20 cycle truss communities
with intra-community diffusion rate p values where 0 ≤ p < 0.25 (Fig. 3a),
0.25 ≤ p < 0.50 (Fig. 3b), 0.50 ≤ p < 0.75 (Fig. 3c), and 0.75 ≤ p ≤ 1 (Fig. 3d).
Figure 4 shows similar results for extracted flow truss communities with k = 10,
k = 30, and k = 50.
Figures 3 and 4 do not show large differences when intra-community diffu-
sion rates are low, but when the rate is 0.25 or more, tweet diffusion takes longer
when the community structure is strong than when it is weak. In a cycle truss
community where intra-community diffusion rate p is 0.25 ≤ p < 0.50, for exam-
ple, it takes approximately 1.8 times longer to reach 100 retweets when k = 20
than when k = 5. These results suggest that when the intra-community diffusion
rate is at least some level, stronger community structures incur longer diffusion
times than do weak structures.
2000
2000
intra:0−0.25 intra:0−0.25
intra:0.25−0.50 intra:0.25−0.50
intra:0.50−0.75 intra:0.50−0.75
1500
1500
N−th RT time (m)
1000
500
500
0
0
0 50 100 150 200 0 50 100 150 200
N−th retweet N−th retweet
100 times and was first retweeted more than sixty minutes after its initial post-
ing. We used tweets posted within a certain period and retweets of those posts
as training data. From these data, we extracted features regarding the tweet
body texts, the reliability and activity of the tweeting or retweeting user, and
community structure features for predictions.
The target users were the same 356,453 users analyzed in the previous
Section. The training period was the period from January 1st, 2016 to April
30th, 2016. The testing period was the period from May 1st, 2016 to May 20th,
2016. We only use tweets that are retweeted 100 times more, and their lifetimes
are sixty minutes and more. Thus, 8,438 original tweets were available for the
training data, and 1,489 original tweets were available for the test data.
We predict the N th retweet time based on the method proposed in [1], because
it has been shown to be useful for predicting both diffusion scales and lifetimes,
which is a type of diffusion time. In this method, features necessary for pre-
dictions are extracted from the training data, and stored in a knowledge base.
Given a test-data tweet subject to predictions, we extract tweets having similar
features with the test tweet from the knowledge base (i.e., training data), calcu-
late their average of the N th retweet time, and use the result as the predicted
value.
56 N. Tsuda and S. Tsugawa
2000
k=5 k=5
k=10 k=10
k=20 k=20
1500
1500
N−th RT time (m)
1000
500
500
0
0
0 50 100 150 200 0 50 100 150 200
N−th retweet N−th retweet
2000
k=5 k=5
k=10 k=10
k=20 k=20
1500
1500
N−th RT time (m)
1000
500
500
0
Fig. 3. Comparison of N th retweet times among different truss numbers k (cycle truss).
2000
2000
k=10 k=10
k=30 k=30
k=50 k=50
1500
1500
N−th RT time (m)
1000
500
500
0
0
0 50 100 150 200 0 50 100 150 200
N−th retweet N−th retweet
2000
k=10 k=10
k=30 k=30
k=50 k=50
1500
1500
N−th RT time (m)
1000
500
500
0
Fig. 4. Comparison of N th retweet times among different truss numbers k (flow truss).
and extract the top γ tweets with smallest DIST. Finally, we calculate the aver-
age of the N th retweet times for the γ tweets with shortest distance between
the prediction tweet (eti ) and the prediction target, taking this as the prediction
value. Following Ref. [9], in this study we used values α = 50, β = 10, and γ = 5.
We next describe the method for prediction by combining community struc-
ture features with the existing method. As a community structure feature, we
use the intra-community diffusion rate p(100), calculated as described in Sect. 2.
By using intra-community diffusion rate p(100) and features used in the existing
method, we measure the similarity between the prediction target tweet and those
58 N. Tsuda and S. Tsugawa
in the knowledge base. Note that the method for extracting communities is the
same as that described in Sect. 2. Given a tweet that is subject to prediction,
we extract from the knowledge base tweets with similar intra-community diffu-
sion rates. Specifically, we calculate the quotient obtained by dividing p(100) of
the prediction target tweet by 0.05. We also calculate the quotients obtained by
dividing p(100) for knowledge base tweets by 0.05, and extract knowledge base
tweets with the same quotients as the prediction tweet. After that, we perform
the same processing for predictions as in the existing method, and extract similar
tweets from the knowledge base, and calculate the average of their N th retweet
time.
For the prediction method using only community structure features, we
only use the intra-community diffusion rate p(100) for measuring the similarity
between the target tweet and tweets in the knowledge base. Given a tweet that is
subject to prediction, we calculate the quotients obtained by dividing p(100) for
knowledge base tweets and those subject to prediction by 0.05, extract knowledge
base tweets with same quotients as the prediction tweet, and use the average of
the N th retweet times for the extracted tweets as the prediction value.
As a baseline, we also use the method of using the average of all training data
as the predicted value. In this method, for a given target tweet, the prediction
value is the average of the N th retweet times in the training data.
10 20 30 40 50 60
10 20 30 40 50
150th RT time (hour)
RMSE of predicting
previous method previous method
previous method previous method
+cycle +flow
average average
0
0
method method
40
200th RT time (hour)
RMSE of predicting
30
30
previous method previous method
previous method previous method
20
20
+cycle +flow
average average
10
10
0
method method
Overall, these results show the potential of community structure features for
predicting N th retweet times. In contrast, it is also shown that there remains
room for consideration regarding the use of those features and their combination
with other features.
4 Conclusion
In this paper, we have investigated how the community structure of a social net-
work among Twitter users affects the speed of diffusion by retweets. Extracting
communities among sampled Twitter users, we investigate differences in diffu-
sion speed between tweets with many intra-community retweets and those with
many inter-community retweets. Consequently, we have shown that tweets with
many intra-community retweets tend to spread slowly. We have also tackled the
tasks of predicting time intervals between a first tweet and its N th retweet using
features obtained from community structures in Twitter social networks. Our
60 N. Tsuda and S. Tsugawa
results have shown the potential of community structure features that are effec-
tive for predicting information diffusion speed. In contrast, we have also found
that there remains room for consideration regarding the use of those features
and their combination with other features for predicting diffusion speed.
In future work, generalizability of the results should be investigated. The
results in this paper should be validated using other datasets. Although clear
differences between cycle and flow truss communities are not observed in this
paper, analyzing the effects of different types of communities on information
diffusion is an important future work. We are also interested in understanding
the background mechanisms of the effects of community structure on diffusion
speed.
References
1. Bae, Y., Ryu, P.M., Kim, H.: Predicting the lifespan and retweet times of tweets
based on multiple feature analysis. ETRI J. 36(3), 418–428 (2014)
2. Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer:
quantifying influence on Twitter. In: Proceedings of WSDM 2011, pp. 65–74 (2011)
3. De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: On Facebook, most ties are
weak. Commun. ACM 57(11), 78–84 (2014)
4. Ferrara, E.: A large-scale community structure analysis in Facebook. EPJ Data
Sci. 1(1), 9 (2012)
5. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
6. Galstyan, A., Cohen, P.: Cascading dynamics in modular networks. Phys. Rev. E
75(3), 036109 (2007)
7. He, J.L., Fu, Y., Chen, D.B.: A novel top-k strategy for influence maximization in
complex networks with community structure. PLoS ONE 10(12), e0145283 (2015)
8. Hong, L., Dan, O., Davison, B.D.: Predicting popular messages in Twitter. In:
Proceedings of the 20th International Conference Companion on World Wide Web,
pp. 57–58 (2011)
9. Kong, S., Feng, L., Sun, G., Luo, K.: Predicting lifespans of popular tweets in
microblog. In: Proceedings of SIGIR 2012, pp. 1129–1130 (2012)
10. Li, C.T., Lin, Y.J., Yeh, M.Y.: The roles of network communities in social infor-
mation diffusion. In: Proceedings of the IEEE Big Data 2015, pp. 391–400 (2015)
11. Nematzadeh, A., Ferrara, E., Flammini, A., Ahn, Y.Y.: Optimal network modu-
larity for information diffusion. Phys. Rev. Lett. 113(8), 088701 (2014)
12. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69(2), 026113 (2004)
13. Onnela, J.P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., Kertész,
J., Barabási, A.L.: Structure and tie strengths in mobile communication networks.
PNAS 104(18), 7332–7336 (2007)
14. Takaguchi, T., Yoshida, Y.: Cycle and flow trusses in directed networks. R. Soc.
Open Sci. 3(11), 160270 (2016)
15. Tsuda, N., Tsugawa, S.: Effects of truss structure of social network on information
diffusion among Twitter users. In: Proceedings of INCoS 2019, pp. 306–315 (2019)
16. Tsugawa, S.: Empirical analysis of the relation between community structure and
cascading retweet diffusion. In: Proceedings of ICWSM 2019 (2019)
17. Tsugawa, S.: A survey of social network analysis techniques and their applications
to socially aware networking. IEICE Trans. Commun. 102(1), 17–39 (2019)
Effects of Community Structure 61
18. Tsugawa, S., Ohsaki, H.: Negative messages spread rapidly and widely on social
media. In: Proceedings of COSN 2015, pp. 151–160 (2015)
19. Tsugawa, S., Ohsaki, H.: On the relation between message sentiment and its virality
on social media. Soc. Netw. Anal. Min. 7(1), 19:1–19:14 (2017)
20. Wang, J., Cheng, J.: Truss decomposition in massive networks. Proc. VLDB
Endow. 5(9), 812–823 (2012)
21. Weng, L., Menczer, F., Ahn, Y.Y.: Virality prediction and community structure in
social networks. Sci. Rep. 3, 2522 (2013)
Closure Coefficient in Complex Directed
Networks
1 Introduction
Fig. 1. Classification diagram of local clustering measures. In each of the two node-
based clustering measures, the focal node is painted in red, and the dotted edge repre-
sents the potential closing edge in an open triad. In the edge-based clustering measure,
the focal edge is in red, and the dotted outline circle represents the potential node that
forms a triangle.
cites two papers where one tends to cite the other [13]; and in a signed directed
trust network, when Alice distrusts Bob, Alice discounts anything recommended
by Bob [14].
The classic measure of a 3-clique formation is the local clustering coefficient
[15], which is defined by the percentage of the number of triangles formed with a
node (referred to as node i) to the number of triangles that i could possibly form
with its neighbours. In this definition, the focal node i serves as the centre-node
in an open triad. To emphasize, an open triad is an unordered pair of edges
sharing one node. With a focus on node i, it describes the extent to which edges
congregate around it. The extensions of local clustering coefficient have been
thoroughly discussed for weighted networks [16,17], directed networks [18] and
signed networks [19]. Another metric for 3-clique formation, with a focus on an
edge, is the edge clustering coefficient [20] which evaluates to what extent nodes
cluster around this edge.
A recent study has proposed another local edge clustering measure, i.e., the
local closure coefficient [21]. With the focal node i as the end-node of an open
triad, it is quantified as the percentage of two times the number of triangles
containing i to the number of open triads with i as the end-node. Conceptually,
the local clustering coefficient measures the phenomenon that two friends of mine
are also friends themselves, while the local closure coefficient is focusing on a
friend of my friend is also a friend of mine. This new metric has been proven to
be a useful tool in several network analysis tasks such as community detection
and link prediction [21]. Together with the two measures mentioned above, we
propose a classification diagram of all three local clustering measures (Fig. 1).
64 M. Jia et al.
The local closure coefficient is originally defined for undirected binary net-
works. However, in real-world complex networks, the relationships between com-
ponents can be nonreciprocal (a follower is often not followed back by the fol-
lowee), heterogeneous (trade volumes between countries vary significantly), and
negative (an individual can be disliked or distrusted).
In this paper, with an end-node focus, we propose the local directed closure
coefficient to measure local edge clustering in binary directed networks, and we
extend it to weighted directed networks and weighted signed directed networks.
Since in a directed 3-clique, each of the three edges can take either direction,
there are eight different triangles in total. According to the direction of the
closing edge, i.e., the edge that closes an open triad and forms a triangle, we
classify them into two groups (emanating from or pointing to the focal node, as
shown in Fig. 2). Based on that, we propose the source closure coefficient and
the target closure coefficient respectively.
Fig. 2. Taxonomy of directed triangles. Two solid edges connecting nodes i, j and k
form an open triad, which is closed by a dotted edge connecting nodes i and k. Focal
node i, painted in red, is the end-node of an open triad. Eight triangles are classified
into two groups according to the direction of the closing edge. First row shows a group
where the focal node serves as the source node of the closing edge; second row is another
group where the focal node serves as the target.
2 Preliminaries
This section introduces the preliminary knowledge of our work, including the
classic clustering coefficient and the recently proposed closure coefficient.
The subscript c here emphasizes that the focal node i serves as the centre-
node of an open triad. We assume that Cc (i) is well defined (di > 1). Clearly,
Cc (i) ∈ [0, 1].
In order to measure clustering at the network-level, the average cluster-
ing coefficient is introduced by averaging the local clustering coefficient over
all nodes (an undefined local clustering coefficient is treated as zero): Cc =
1
|V | i∈V Cc (i).
where N (i) denotes the set of neighbours of node i. Ce (i) is well defined when
the neighbours of i are not solely connected to it. T (i) is multiplied by two for
the reason that each triangle contains two open triads with i as the end-node.
66 M. Jia et al.
When a triangle is actually formed (e.g., with nodes i, j and k), the focal node
i can be viewed as the centre-node in one open triad (ji k) or as the end-node
in two open triads (i jk and i kj). Obviously, Ce (i) ∈ [0, 1].
At the network-level, the average closure coefficient is then defined as the
mean of the local closure coefficient over all nodes: Ce = |V1 | i∈V Ce (i). When
we consider a random network where each pair of nodes is connected with prob-
ability p, its expected value is also p, i.e., E[Ce ] = p.
Motivated by the closure coefficient and the directed clustering coefficient [18],
we aim to measure the directed 3-clique formation from the end-node of an open
triad. There are eight different directed triangles, and a triangle (or an open
triad) with bidirectional edges is treated as a combination of triangles (or open
triads) with only unidirectional edges.
Let A = {aij } denote the adjacency matrix of a directed graph GD = (V, E).
aij = 1 if there is an edge from node i to node j, otherwise aij = 0. The degree
of node i is denoted as di , including
both outgoing edges and incoming edges:
di = dout
i + d in
i = j aij + j aji . The set of neighbours of node i is denoted
N (i). We now give the definition of the closure coefficient in directed networks.
T D (i) is multiplied by two since each triangle contains two open triads with
i as the end-node. OTeD (i) is multiplied by two because the closing edge of a
directed open triad can take two directions. Obviously, CeD (i) ∈ [0, 1]. When
the adjacency matrix A is symmetric (the network becomes undirected), Eq. 3
reduces to Eq. 2, i.e., CeD (i) = Ce (i).
Similarly, in order to measure at the network-level, we propose the definition
of an average directed closure coefficient.
Closure Coefficient in Complex Directed Networks 67
Please note that Cesrc (i) + Cetgt (i) = CeD (i). These two metrics evaluate the
extent to which the focal node is acting as the source node or the target node of
the closing edges in a triangle formation. In Sect. 4.2, we show how the source
and target closure coefficients can be used to improve the performance in a link
prediction task.
Obviously, CeW (i) ∈ [0, 1]. When the weight matrix becomes binary, Eq. 5
degrades to Eq. 2, i.e., CeW (i) = Ce (i).
In a similar approach, the definition of closure coefficient in weighted directed
networks can be extended from Eq. 3. Let us denote W = {wij } as the weight
W,D
matrix of a weighted directed
graph
G , wij ∈ [0, 1]. The strength of node i
is denoted by si (si = j wij + j wji ).
This definition can also be used inweighted signed networks (wij ∈ [−1, 1]),
with a modified definition of si (si = j |wij | + j |wji |). In this case, CeW,D (i)
varies in [−1, 1]. It is positive when positive triangles formed around the focal
node outweigh negative ones. It equals zero when no triangles formed with the
focal node or positive triangles and negative triangles are balanced.
Table 1. Statistics of datasets, showing the number of nodes (|V |), the number of
edges (|E|), the average degree (k̄), the proportion of reciprocal edges (r), the average
directed clustering coefficient (CcD ), and the average directed closure coefficient (CeD )
defined in this paper. Datasets having timestamps on edge creation are superscripted
by (τ ). Positively weighted networks are superscripted by (+), and networks having
both positive and negative weights are superscripted by (±).
Table 1 lists some key statistics of these datasets. We see that in all 12 net-
works, the average directed closure coefficient is less than the average directed
clustering coefficient. In these types of networks, we may say that compared
to a triangle formation from centre-node based open triads, fewer triangles are
formed from the end-node based open triads. In some networks (Ado-Health
and Amazon), the difference between them is not very big; while in Q&A net-
works, the difference is more than 40 times.
From the scatter plots of the local directed closure coefficient and the local
directed clustering coefficient (Fig. 3), we can see their relationship more clearly.
First, the Pearson correlation is positive but weak (ranging from 0.134 to 0.759).
Secondly, similar networks exhibit similar relationships between the two vari-
ables, as in two trust networks BTC-Alpha and Epinions, in two citation net-
works Arxiv-HepPh and US-Patent or in two Q&A networks AskUbuntu
and StackOverflow.
Many studies [33–37] have shown that future interactions among nodes can be
extracted from the network topology information. The key idea is to compare the
proximity or similarity between pairs of nodes, either from the neighbourhoods
[34,35], the local structures [36] or the whole network [37].
70 M. Jia et al.
Fig. 3. Scatter plots of the local directed closure coefficient and the local directed
clustering coefficient, with the Pearson correlation coefficient.
choose 50% edges and related nodes as Gold and repeat 10 times in the experi-
ment (r1 = 10).
Let Enew be the set of future edges among the nodes in V ∗ , which is also what
we aim to predict. Apparently, the total number of potential links on node set
V ∗ is: |V ∗ |2 − Eold . We apply each prediction method to output a list containing
the similarity scores for all potential links in descending order, denoted Lp . An
intersection of Lp [0 : |Enew |] and Enew gives us the set of correctly predicted
links, denoted Etrue . The precision is then calculated by |Etrue |/|Enew |.
For large networks (|V | > 10K), we perform a randomised breadth first
search sampling of 5K nodes on GD and repeat the above procedures r2 times
according to the size of the dataset. Therefore, for large networks without times-
tamps we run the experiment r1 ∗ r2 = 10 ∗ r2 times.
Results and Discussion. We compare three baseline methods with two pro-
posed methods (Definition 6) in Table 2. We see that the closure closeness index
(CCI) has recorded the highest precision in 5 networks, and the extra closure
closeness index (ECCI) has recorded the highest precision in 4 networks. It sug-
gests that in most networks, including the local structure information of closure
coefficient leads to improvement in link prediction. Sometimes the improvement
is significant: In Ado-Health and Epinions, ECCI is over 30% better than the
baseline methods. In the other six networks (CollegeMsg, Digg-Friends,
US-Patent, AskUbuntu, StackOverflow and Amazon), the precision of
CCI or ECCI is over 5% higher than that of the baselines.
72 M. Jia et al.
5 Conclusion
In this paper, we introduce the directed closure coefficient and its extension as
another measure of edge clustering in complex directed networks. To better use
it, we further propose the source and target closure coefficients. Through exper-
iments on 12 real-world networks, we show that the proposed metric not only
provides complementary information to the classic directed clustering coefficient
but also helps to make some interesting discoveries in network analysis. Fur-
thermore, we demonstrate that including closure coefficients in link prediction
leads to significant improvement in most directed networks. We anticipate that
the directed closure coefficient can be used as a descriptive feature as well as in
other network analysis tasks.
References
1. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45,
167–256 (2003)
2. Barabási, A.-L., et al.: Network Science. Cambridge University Press, Cambridge
(2016)
3. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network
motifs: simple building blocks of complex networks. Science 298, 824–827 (2002)
4. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435, 814–818 (2005)
5. Solava, R.W., Michaels, R.P., Milenković, T.: Graphlet-based edge clustering
reveals pathogen-interacting proteins. Bioinformatics 28, i480–i486 (2012)
6. Noble, C.C., Cook, D.J.: Graph-based anomaly detection. In: KDD (2003)
7. LaFond, T., Neville, J., Gallagher, B.: Anomaly detection in networks with chang-
ing trends. In: ODD2 Workshop (2014)
8. Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L.,
Koutra, D., Faloutsos, C., Li, L.: RolX: structural role extraction & mining in
large graphs. In: KDD (2012)
9. Musial, K., Juszczyszyn, K.: Motif-based analysis of social position influence on
interconnection patterns in complex social network. In: ACIIDS. IEEE (2009)
10. Schall, D.: Link prediction in directed social networks. Soc. Netw. Anal. Min. 4(1),
157 (2014)
Closure Coefficient in Complex Directed Networks 73
11. Juszczyszyn, K., Kazienko, P., Musial, K.: Local topology of social network based
on motif analysis. In: KES. Springer (2008)
12. Rapoport, A.: Spread of information through a population with socio-structural
bias: I. Assumption of transitivity. Bull. Math. Biophys. 15, 523–533 (1953)
13. Wu, Z.-X., Holme, P.: Modeling scientific-citation patterns and other triangle-rich
acyclic networks. Phys. Rev. E 80, 037101 (2009)
14. Josang, A., Hayward, R.F., Pope, S.: Trust network analysis with subjective logic
(2006)
15. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
393, 440–442 (1998)
16. Onnela, J.-P., Saramäki, J., Kertész, J., Kaski, K.: Intensity and coherence of
motifs in weighted complex networks. Phys. Rev. E 71, 065103 (2005)
17. Zhang, B., Horvath, S.: A general framework for weighted gene co-expression net-
work analysis. Stat. Appl. Genet. Mol. Biol. (2005)
18. Fagiolo, G.: Clustering in complex directed networks. Phys. Rev. E 76, 026107
(2007)
19. Kunegis, J., Lommatzsch, A., Bauckhage, C.: The slashdot zoo: mining a social
network with negative edges. In: WWW (2009)
20. Wang, J., Li, M., Wang, H., Pan, Y.: Identification of essential proteins based on
edge clustering coefficient. TCBB 9, 1070–1080 (2011)
21. Yin, H., Benson, A.R., Leskovec, J.: The local closure coefficient: a new perspective
on network clustering. In: WSDM (2019)
22. Moody, J.: Peer influence groups: identifying dense clusters in large networks. Soc.
Netw. 23, 261–283 (2001)
23. Hogg, T., Lerman, K.: Social dynamics of Digg. EPJ Data Sci. 1, 5 (2012)
24. Kumar, S., Hooi, B., Makhija, D., Kumar, M., Faloutsos, C., Subrahmanian, V.:
REV2: fraudulent user prediction in rating platforms. In: WSDM. ACM (2018)
25. Massa, P., Avesani, P.: Controversial users demand local trust metrics: an experi-
mental study on Epinions.com community. In: AAAI (2005)
26. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links
in online social networks. In: WWW (2010)
27. Panzarasa, P., Opsahl, T., Carley, K.M.: Patterns and dynamics of users’ behavior
and interaction: network analysis of an online community. J. Am. Soc. Inf. Sci.
Technol. 60, 911–932 (2009)
28. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and
shrinking diameters. ACM Trans. Knowl. Discov. Data (TKDD) 1, 2 (2007)
29. Hall, B.H., Adam, B.: 13 the NBER patent-citations data file: lessons, insights,
and methodological tools. Patents, citations, and innovations: A window on the
knowledge economy (2002)
30. Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: WSDM
(2017)
31. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
ACM Trans. Web (TWEB) 1, 5 (2007)
32. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of
community structure in large social and information networks. In: WWW (2008)
33. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks.
J. Am. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007)
34. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Soc. Netw. 25, 211–230
(2003)
35. Zhou, T., Lü, L., Zhang, Y.-C.: Predicting missing links via local information. Eur.
Phys. J. B 71, 623–630 (2009)
74 M. Jia et al.
36. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social represen-
tations. In: KDD (2014)
37. Meo, P.D.: Trust prediction via matrix factorisation. ACM Trans. Internet Technol.
(TOIT) 19, 1–20 (2019)
38. Zhang, X., Zhao, C., Wang, X., Yi, D.: Identifying missing and spurious interactions
in directed networks. Int. J. Distrib. Sens. Netw. 11, 507386 (2015)
Nondiagonal Mixture of Dirichlet
Network Distributions for Analyzing
a Stock Ownership Network
1 Introduction
Block modeling has been widely used in studies on complex networks [1,2]. The
goal of block modeling is to uncover the latent group memberships of nodes
responsible for generating the complex network. The uncovered latent block
structure is used for both prediction and interpretation. For prediction, block
modeling is used to find missing or spurious edges [3,4]. For interpretation,
the estimated latent block structure provides a coarse-grained summary of the
linkage structure that is particularly useful in complex networks, which is often
messy at the primary level.
W. Zhang and R. Hisano—These authors made equal contributions.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 75–86, 2021.
https://doi.org/10.1007/978-3-030-65347-7_7
76 W. Zhang et al.
The cornerstone model of block modeling is the stochastic block model (SBM)
[5–7]. In the SBM, each node is assigned to a block. The edge probability in the
network is governed solely by the linkage probability defined among these blocks.
The goal of the SBM is to find the latent block structure and the linkage prob-
ability among the blocks. If given only one block structure, the model collapses
to the Erdős–Rényi–Gilbert type random graph model that dates back to the
1950s [8,9].
The fact that the random graph model cannot reproduce basic properties,
such as the sparsity and heavy-tailed degree distribution of complex networks,
has always been an issue [1,10]. The failure of random graph models to repro-
duce these basic properties has recently been re-examined from the perspective
of node exchangeability [11]. From the graphon formulation [12] and Aldous–
Hoover theorem [13,14], it can be proven that the only possible network in the
random graph model setting is either dense or empty [11,15]. This limitation
makes the SBM rather unsuitable for modeling complex linkage patterns found
in various complex networks.
Several authors have proposed models that go beyond such limitations using
these modern findings. One line of research uses exchangeable point processes
to generate the linkage patterns in a network [16]. In their formulation, edges
appear when a pair of nodes occur in a nearby time position in the point pro-
cesses. [16] showed that this formulation could generate sparse networks. Another
line of research focuses on a more intuitive edge generation process based on edge
exchangeability [11,15,17,18]. Edge exchangeable models have been proven to
generate a sparse and heavy-tailed network. [19] proposed a model that consid-
ers the latent community structure in addition to the edge exchangeable frame-
work. They called their model the mixture of Dirichlet network distributions
(MDND) [19].
However, the MDND oversimplifies the latent block structure limiting it to
only the diagonal case, similar to community detection algorithms. These limita-
tions are problematic in instances in which the flow of influences (or information)
among the blocks is the focus of research. One such example is the stock owner-
ship network. In this setting, companies consider direct ownership and indirect
ownership to maximize their influence and minimize risks [20]. A simple diagonal
block structure only provides community-like clustering of companies, which is
unsatisfactory.
In this paper, we provide a nondiagonal extension of the MDND (the
NDMDND model) that makes it possible to estimate both the diagonal and
nondiagonal latent block structure. Our model has no additional limitations
than the MDND, and flexibly infers the number of blocks and considers the
possibility of unseen nodes. It is noteworthy that our model can be regarded
as a nonparametric extension of the sparse block model [21]. The sparse block
model is a precursor model that focused on edge exchangeability even before
the connection between sparse graphs and edge exchangeability was rigorously
proven. We highlight both models in this paper.
Nondiagonal Mixture of Dirichlet Network Distributions 77
2 Related Works
2.1 Sparse Block Model
In this section, we provide a brief explanation of the sparse block model. We use
the notation (sn , rn ) to denote the nth edge of the network, and cn = (csn , crn )
to denote the block pair to which that nth edge is assigned. We use Ak to
define the node proportion distribution that captures which nodes are probable
in block k. We use Dir() to denote the Dirichlet distribution and Cat() the
categorical distribution, where the parameters are written inside the parentheses.
We summarize the generative process as follows:
(A) Initialization
For each block pair k = 1, . . . , K,
we draw the node proportions Ak ∼ Dir(τ )
(B) Sampling of block pairs and edges
For each edge (sn , rn ),
(1) sample the block pair cn = (csn , crn ) ∼ Cat(θ)
(2) sample the sender node from sn ∼ Cat(Acsn )
(3) sample the receiver node from rn ∼ Cat(Acrn ).
Note that in the sparse block model, the latent block structure is defined in
advance. The goal of the sparse block model is to infer the probability of each
block to generate nodes (i.e., Ak ), and the probability of each block pair (i.e.,
cn ) appearing from a given network. The fact that we have to specify the latent
block structure is a huge limitation. It implies that we have to provide both the
number of blocks to use and the block pairs’ interaction patterns before seeing
the data. Second, note that the same node pairs could appear multiple times in
this setting (i.e., multigraph). These multiple edges could be used as a proxy for
the edge weights. Although we could add a link function that links the proxy
edge weights to the continuous edge weights, in this paper, we make the simple
assumption that these multiple occurrences of an edge describe the weights of
an edge. Finally, note also that the number of nodes used in the network is fixed;
it does not increase as we sample more edges in the process.
process to the Chinese restaurant franchise process [23]. The Chinese restaurant
franchise process introduces an auxiliary assignment variable called a “table”.
By separating the growth of the popularity of tables and the assignment of nodes
(i.e., in [23]s term “dish”) to the table, we can make multiple Chinese restaurant
processes share the same set of nodes. We use CRP (α) to denote the Chinese
restaurant process with hyperparameter α. We use subscripts to discern the
multiple Chinese restaurant processes used in the model. We use snt and rnt to
denote the table assigned to the sending node that originates from the Chinese
restaurant franchise process. α, τ , and γ are hyperparameters of the model. The
generative process is as follows:
(A) Sampling of diagonal blocks
For each edge sample cn ∼ CRPB (α) where csn is always equal to crn
(B) Sampling of edges
(1) Sample a table for the sender node: snt ∼ CRPcn (γ)
if snt is a new table, then sample sn ∼ CRPN (γ)
else sn is assigned the same node as snt
(2) Sample a table for the receiver node: rnt ∼ CRPcn (γ)
if rnt is a new table, then sample rn ∼ CRPN (γ)
else rn is assigned the same node as rnt .
In NDMDND, γblock controls the number of blocks used. A low γblock with
a relatively high τpair would lead to a more dense structure, whereas increasing
γblock would make the number of blocks increase. τpair and τblock are trickier to
interpret as both parameters also affect the possibility of considering a new block
or block pair in the model. We further explain this issue in the next section.
3.2 Inference
The inference of NDMDND is rather involved compared with that of the MDND
counterpart. In MDND, the direct sampling scheme is used to avoid the sampling
of table assignments (Sect. 5.3 in [23]). However, in NDMDND, the sampling
of both the table and table-to-block assignments turns out to be much simple
(Sect. 5.1 in [23]). Moreover, a bonus of explicitly sampling tables is that we do
not need to simulate the node counts (i.e., the number of tables with block k
(1) (2)
for a given node i, ρk,i and ρk,i in [19]) and instead evaluate them from our
table assignments. We used these values to estimate the probability of a node
appearing in an edge without block pairs. This probability is defined for both
already seen nodes β1 , · · · , βJ and unseen nodes βu . A simple sampling relation
(1)
derives these βs: β1 , · · · , βJ , βu ∼ Dir(ρ·1 , · · · , ρ·J , γ) where ρ··i = k ρk,i + ρk,i
(·) (·) (2)
represents the number of tables that a node i(i ∈ {1, · · · , J, J + 1}) is selected
in all the blocks.
Before introducing the inference scheme in more detail, we need to introduce
some further notation. We use ntp , nts , and ntr to count the number of edges or
nodes assigned to a particular pair block table t, send block table s, and receive
block table r, respectively. We use n−i tp to denote the count, ignoring the ith edge.
We sometimes use the subscript i to denote the ith table, as in tip , tis , and ktis .
80 W. Zhang et al.
· · · Perform exactly the same steps for the receiver block tables
end for
end while
Nondiagonal Mixture of Dirichlet Network Distributions 81
4 Results
4.1 Dataset
Our experiments used two datasets: one containing synthetic data and the other
containing real-world global stock ownership network data. The synthetic data
was created, assuming the sparse block model. The stock ownership network is a
subset of the Thomson Reuters ownership database. We focused on the ownership
of significant assets in the second quarter of 2015. Both can be considered as
a weighted network, and the datasets’ basic summary statistics are shown in
Table 11 . In both datasets, the network is sparse: the synthetic data has 7.2%,
and the stock ownership data has 0.4% of all possible edges. Moreover, both
datasets exhibit a heavy-tailed degree distribution, as shown in Fig. 1.
The motivation behind using a synthetic dataset is to illustrate whether our
proposed algorithm recovers the ground truth block structure. In this experi-
ment, we used all the edges in the synthetic data for training. Figure 2 shows
the result of running the algorithm for 1,000 epochs2 . We confirm that after 100
epochs, the algorithm almost found the right block structure, and after 1,000
epochs, the result became more stable. Thus, we conclude that our model cor-
rectly uncovers the latent block structure of a given network.
1
The weights for the stock ownership data is in percentage term.
2
One epoch comprises sampling all the edges in the training example once.
82 W. Zhang et al.
Table 1. Datasets
(a) Ground truth (b) Initial state (c) 100 epochs (d) 1,000 epochs
5 Conclusion
In this paper, we proposed an edge exchangeable block model that estimates the
latent block structure of complex networks. Because the model is edge exchange-
able, it reproduces the sparsity and heavy-tailed degree distribution that its ran-
dom graph counterpart (i.e., SBM) fails to consider. We tested our model using
one synthetic dataset and one real-world stock ownership dataset and showed
that our model outperformed state-of-the-art models.
References
1. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford,
New York (2010)
2. Doreian, P., Batagelj, V., Ferligoj, A.: Advances in Network Clustering and Block-
modeling. Wiley, Hoboken (2019)
3. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks.
J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007)
4. Martı́nez, V., Berzal, F., Cubero, J.-C.: A survey of link prediction in complex
networks. ACM Comput. Surv. 49(4), 1–33 (2016)
5. Holland, P.W., Leinhardt, S.: An exponential family of probability distributions
for directed graphs. J. Am. Stat. Assoc. 76(373), 33–50 (1981)
6. Nowicki, K., Snijders, T.: Estimation and prediction for stochastic blockmodels for
graphs with latent block structure. J. Classif. 14 (1997)
7. Nowicki, K., Snijders, T.A.B.: Estimation and prediction for stochastic blockstruc-
tures. J. Am. Stat. Assoc. 96(455), 1077–1087 (2001)
8. Erdös, P.: Graph theory and probability. Can. J. Math. 11, 34–38 (1959)
9. Bollobás, B.: Random Graphs. Cambridge Studies in Advanced Mathematics, 2nd
edn. Cambridge University Press, Cambridge (2001)
10. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science
286(5439), 509–512 (1999)
11. Crane, H.: Probabilistic Foundations of Statistical Network Analysis. CRC Press,
Boca Raton (2018)
12. Bickel, P.J., Chen, A.: A nonparametric view of network models and Newman–
Girvan and other modularities. Proc. Natl. Acad. Sci. 106(50), 21068–21073 (2009)
13. Aldous, D.J.: Representations for partially exchangeable arrays of random vari-
ables. J. Multivariate Anal. 11(4), 581–598 (1981)
14. Relations on probability spaces and arrays of random variables. Institute for
Advanced Studies
15. Crane, H., Dempsey, W.: Edge exchangeable models for interaction networks. J.
Am. Stat. Assoc. 113(523), 1311–1326 (2018)
16. Caron, F., Fox, E.B.: Sparse graphs using exchangeable random measures (2017)
17. Cai, D., Campbell, T., Broderick, T.: Edge-exchangeable graphs and sparsity. In:
Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in
Neural Information Processing Systems, vol. 29, pp. 4249–4257. Curran Associates,
Inc. (2016)
18. Crane, H., Dempsey, W.: Edge exchangeable models for interaction networks. J.
Am. Stat. Assoc. 113, 07 (2017)
19. Williamson, S.A.: Nonparametric network models for link prediction. J. Mach.
Learn. Res. 17(202), 1–21 (2016)
20. Garcia-Bernardo, J., Fichtner, J., Takes, F.W., Heemskerk, E.M.: Uncovering off-
shore financial centers: conduits and sinks in the global corporate ownership net-
work. Sci. Rep. 7(1), 1–10 (2017)
21. Parkkinen, J., Gyenge, A., Sinkkonen, J., Kaski, S.: A block model suitable for
sparse graphs (2009)
22. Phadia, E.: Prior Processes and Their Applications. Nonparametric Bayesian Esti-
mation. Springer, Heidelberg (2013)
23. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes.
J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
86 W. Zhang et al.
1 Introduction
The goal of community detection—one of the most popular topics in statisti-
cal network analysis—is to identify groups of nodes that are more similar to
each other than to other nodes in the network. Determining the number of com-
munities in a given network and the community assignments gives key insight
into the network structure, creating a natural dimensionality reduction of the
data. Moreover, the existence of clusters of highly connected nodes is a fea-
ture of many empirical networks ([6,8]). Though there is growing research for
directed networks ([10,15]), community detection is best understood and most
often implemented on undirected networks. In directed networks, edge direc-
tionality is often fundamental, and communities of nodes may be characterized
by asymmetric relations. Consider, for example, citations, twitter follows and
webpage hyperlinks. Properly accounting for edge directionality when analyzing
such network data is very important.
Community detection is a clustering problem and requires an explicit notion
of similarity between nodes. In general, clustering algorithms fall into two cat-
egories. There is model based clustering, which includes fitting procedures of
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 87–99, 2021.
https://doi.org/10.1007/978-3-030-65347-7_8
88 W. R. Palmer and T. Zheng
a model with well-defined clusters, and there are methods motivated by what
clusters of the data objects should look like. These methods specify a related
objective function, and partition the data to optimize it, often approximately.
For points in Rn , Gaussian mixture modeling falls in the first category, while
k-means falls in the second. The most popular community detection algorithms,
including spectral clustering [16] and modularity [6], fall in the second category.
However, these methods have been shown to provide consistent clustering for
certain random graph models ([2,14]).
A broadly applicable method for clustering relational data, spectral clustering
requires a similarity matrix between the data objects. For graph representations
of network data, the adjacency matrix of edge weights provides measures of
similarity between all nodes. Thus spectral clustering is a natural choice for
community detection. Spectral clustering is particularly well understood in the
symmetric, undirected setting [9]. The problem is more complicated in the more
general setting of weighted, undirected networks, which we consider. Building
from [21] and [3], this paper presents some of the theory of spectral clustering
for directed networks, as well as two applications.
Section 1.1 motivates spectral clustering, Sect. 1.2 presents its general frame-
work, and Sect. 2 explains our approach to spectral clustering for directed graphs.
Sections 3 and 4 delve into applications—a stochastic block model simulation
study and an analysis of recent cosponsorship data from the U.S. Senate.
1.1 Motivation
In order to motivate the use of spectral clustering for directed networks, we
consider a toy example involving points in R2 .
Figure 1a shows the points we wish separate into two spiral-shaped clusters.
In Fig. 1b, k-means clustering fails to do this properly, since the clusters that we
wish to capture have overlapping means. In Fig. 1c, spectral clustering properly
separates the points.
Here we have defined a similarity matrix Wij as the inverse Euclidean dis-
tance between points i and j if point j is among point i’s four nearest neighbors,
Directed Spectral Clustering 89
min Tr(V T LV ).
V ∈Rn×k
V T V =I
method devised particularly for spectral clustering is the eigengap heuristic [9].
It stipulates that we should choose k such that the first (smallest) k eigenvalues
λ1 , ..., λk are relatively small, but λk+1 is relatively large. We follow the eigengap
heuristic in the applications below, choosing values of k such that λk+1 − λk is
relatively large.
3 Simulation Study
hand, the results of Algorithm 2 with the symmetrized adjacency matrix Wsym
and k = 3 combine nodes from communities 1 and 2, and split the nodes of
group 3 into two clusters. For k = 2, the symmetrized approach nearly recovers
group 3.
Increasing the size of the simulated network to n = 100 tells a somewhat
similar story, with improvements for the symmetrized approach. Figure 3 shows
the simulated network as adjacency matrix heatmaps. The block structure asso-
ciated with the true groupings in Fig. 3a is clear. While node indices vary across
the three panels, the estimated clusters in Fig. 3b–c are arranged to best align
with the true blocks. Figure 3b shows again the near recovery of the true com-
munity structure by directed spectral clustering. In Fig. 3c it is clear that the
symmetrized approach with continues to struggle to separate the communities
correctly.
Across the bottom of Fig. 3b–c is the Normalized Mutual Information (NMI)
measure between the true communities and the estimated clusterings. A value
of 1 indicates exact agreement up to cluster relabeling. NMI is an information
theoretic measure, relating the information needed to infer one cluster from
the other. NMI satisfies desirable normalization and metric properties, and is
adjusted for chance [17].
Fig. 4. Results of spectral approaches on 100 simulations at each network size. (a):
Proportion of simulations where the eigengap heuristic correctly chooses k = 3 over
k = 2. (b): NMI between true grouping and estimated clusterings.
94 W. R. Palmer and T. Zheng
4 Congressional Cosponsorship
2nd column of X*
Wyden Hirono
Stabenow
Warren
Booker HawleySasse Harris
Wyden
Hirono Coons
Whitehouse Grassley Cruz
Reed Tillis Young Scott
Grassley Van Hollen
Blumenthal
Gillibrand
Cardin
Merkley Booker
Kaine
Feinstein Kennedy
Romney
Duckworth
Brown
Klobuchar
Collins Sanders
Markey
SchumerDurbin Braun
Whitehouse
CaseyCoons IsaksonRomney
Toomey
Cornyn
Smith
Baldwin
Bennet Shaheen Hassan Cruz
McConnell
Alexander
Cassidy
Ernst
Portman
Fischer
Gardner Blackburn
Wicker Perdue
Braun Graham
SchatzCortez Masto Jones
Udall Risch Thune
Hyde−Smith
Blunt
Inhofe Lankford Leahy
Reed Lee
Peters
Boozman
McSally
CrapoSullivan
Tester Moran Johnson Murphy
Cantwell Carper
Manchin Roberts
Rosen Sinema Young
Rounds
Heinrich
Murkowski Cramer
Capito Enzi Menendez
Daines
Barrasso
Hoeven Paul
Figure 5 shows results of the two spectral clustering approaches for k = 2 and
k = 3 communities. Here we plot the columns of X ∗ from step 3 of Algorithm 2,
along with the decision boundary separating the detected communities. The col-
ors indicate party affiliation—blue for Democrat, green for Independent, and red
for Republican. The results for k = 2 (Fig. 5a–b) are similar for the directed and
symmetrized approaches, with, respectively, 85 and 86% of Senators clustered
with the majority of their party, excluding independents.
Murkowski(R)
Sullivan(R)
AK
Cantwell(D) Cramer(R)
Murray(D) Daines(R) Hoeven(R)
Tester(D) Klobuchar(D)
Smith(D) Collins(R)
King(ID)
Rounds(R) Baldwin(D) Leahy(D)
Merkley(D) Thune(R) Johnson(R) Sanders(ID)
Wyden(D) Crapo(R) Hassan(D)
Risch(R) Barrasso(R) Peters(D) Gillibrand(D)
Stabenow(D) Schumer(D) Shaheen(D)
Enzi(R) Markey(D)
Ernst(R) Warren(D)
Fischer(R) Grassley(R) Blumenthal(D)
Casey(D) Murphy(D) Reed(D)
Sasse(R) Braun(R) Whitehouse(D)
Toomey(R)Booker(D)
Young(R) Brown(D) Menendez(D)
Duckworth(D) Cardin(D)
Cortez Masto(D) Lee(R) Durbin(D) Portman(R)
Rosen(D) Romney(R) Bennet(D) Moran(R) Van
Capito(R) Hollen(D)
Gardner(R) Blunt(R) Carper(D)
Roberts(R) Hawley(R) Manchin(D) Coons(D)
McConnell(R) Kaine(D)
Feinstein(D) Paul(R) Warner(D)
Harris(D) Alexander(R)
Inhofe(R) Burr(R)
Lankford(R) Boozman(R) Blackburn(R) Tillis(R)
McSally(R) Heinrich(D)
Udall(D) Cotton(R)
Sinema(D) Graham(R)
Jones(D) Scott(R)
Hyde−Smith(R) Isakson(R)
Hirono(D) Wicker(R) Perdue(R)
Schatz(D) Cornyn(R) cluster1
Cruz(R) Cassidy(R)
HI Kennedy(R) cluster2
cluster3
Rubio(R)
Scott(R)
The results for k = 3 (Fig. 5c–d), however, differ greatly between the
two approaches. The directed approach detects balanced and relatively well-
separated communities, two of which align closely with party, and one that con-
tains a mix of Republicans and Democrats mainly from the Plains, Mountain
West, Southwest, and non-contiguous states. Figure 6 shows the full geographic
correspondence of the detected communities. Meanwhile, the symmetrized app-
roach detects one diffuse and separated community of four Democrats and four
Republicans, and splits the remaining Senators roughly along the same lines as
in Fig. 5b.
edges (yellow and green) for the Democrats in the top left. These patterns are
borne out more clearly by the inter and intra community intensities in Fig. 8a.
Here we show the observed cosponsorship counts divided by the number of pairs
of distinct legislators. The rows correspond, in order, to the blue (Democrat),
green (mixed) and red (Republican) communities. The directed approach reflects
inter-community asymmetries, while the symmetrized approach does not.
A notable feature of Fig. 7 and an exception to the patterns discussed above
are four very prominent green edges from Republicans Graham, Lee, Paul and
Young into Menendez (D-NJ) at the top middle, mirrored by three prominent
brown edges into Menendez from Democrats Leahy, Murphy and Reed. These are
precisely the eight Senators clustered together by the symmetrized approach with
k = 3, appearing at the bottom of Fig. 5d. We isolate this star-like subnetwork in
Fig. 8b. Here we include three nodes for the remaining Senators of each detected
community and show the combined weighted edges between the eight individual
Senators and these ‘remaining’ clusters. The edges flowing into Menendez are
blue, those flowing out from Menendez are red, and the rest are grey.
Each blue edge from the Senators besides Menendez represents more than 23
cosponsorships, combining for a total of 174. Menendez is the minority ranking
member of the Committee on Foreign Relations, and 169 of these cosponsor-
ships involve international affairs legislation. Menendez cosponsors just 4 bills
in return, and the ‘other seven’ have only 18 cosponsorships among themselves.
Meanwhile, all four democrats exchange heavily with the remaining Senators
in cluster 1, while the Republicans exchange with those remaining in cluster 3.
Considering edge directionality, these eight Senators do not form a natural com-
munity within the context of the entire network. The directed approach reflects
this, splitting these Senators along party lines. Unable to account for the patent
asymmetry, the symmetrized approach allows the high weight of the edges flow-
ing into Menendez to pull these Senators closer together, distorting the entire
embedding, as seen in Fig. 5d, and classifying them as a separate community.
Data Note. Bill cosponsorship data is available from the ProPublica Congress
API [13]. Amendment cosponsorship is obtained directly from Congress.gov [4].
98 W. R. Palmer and T. Zheng
5 Conclusion
In this paper we presented a variation of the general spectral clustering algo-
rithm adopted for community detection on directed networks. We described the
theoretical motivation behind the directed graph Laplacian, showing its con-
nection to an objective function that reflects a notion of how communities of
nodes in directed networks should behave. We applied our algorithm to sim-
ulated and empirical networks, and found encouraging and insightful results.
When we ignore edge directionality by using a symmetrized adjacency matrix,
we observe different results and worse performance on the simulated networks.
We see clear advantages to taking full account of the directionality of edges
in complex networks. This is an important area of continued research, both from
a theoretical and applied perspective.
References
1. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algo-
rithm. Neural Inf. Process. Syst. 14, 849–856 (2001). https://doi.org/10.5555/
2980539.2980649
2. Bickel, P.J., Chen, A.: A nonparametric view of network models and Newman-
Girvan and other modularities. Proc. Nat. Acad. Sci. 106(50), 21068–21073 (2009).
https://doi.org/10.1073/pnas.0907096106
3. Chung, F.: Laplacians and the Cheeger inequality for directed graphs. Ann. Comb.
9(1), 1–19 (2005). https://doi.org/10.1007/s00026-005-0237-z
4. Congress.gov: Legislative search results. https://congress.gov/search
5. Fowler, J.H.: Connecting the congress: a study of cosponsorship networks. Polit.
Anal. 14(4), 456–487 (2006). https://doi.org/10.1093/pan/mpl002
6. Girvan, M., Newman, M.E.: Community structure in social and biological net-
works. Proc. Nat. Acad. Sci. 99(12), 7823–7826 (2002). https://doi.org/10.1073/
pnas.122653799
7. Jin, J.: Fast community detection by score. Ann. Stat. 43(1), 57–89 (2015). https://
doi.org/10.1214/14-AOS1265
8. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of
community structure in large social and information networks. In: Proceedings of
WWW, pp. 695–704 (2008). https://doi.org/10.1145/1367497.1367591
9. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17, 395–416
(2007). https://doi.org/10.1007/s11222-007-9033-z
10. Malliaros, F., Vazirgiannis, M.: Clustering and community detection in directed
networks. Phys. Rep. 533(4), 95–142 (2013). https://doi.org/10.1016/j.physrep.
2013.08.002
11. Newman, M.E.: Modularity and community structure in networks. Proc. Nat. Acad.
Sci. U. S. A. 103(23), 8577–8582 (2006). https://doi.org/10.1073/pnas.0601602103
12. Oleszek, M.J.: Sponsorship and cosponsorship of house bills. Congressional
Research Service Report RS22477 (2019). https://fas.org/sgp/crs/misc/RS22477.
pdf
13. ProPublica: Congress API (2020). https://projects.propublica.org/api-docs/
congress-api/
Directed Spectral Clustering 99
14. Rohe, K., Chatterjee, S., Yu, B.: Spectral clustering and the high-dimensional
stochastic blockmodel. Ann. Stat. 39(4), 1878–1915 (2011). https://doi.org/10.
1214/11-AOS887
15. Rohe, K., Chatterjee, S., Yu, B.: Co-clustering directed graphs to discover asymme-
tries and directional communities. Proc. Nat. Acad. Sci. U. S. A. 113(45), 12679–
12684 (2016). https://doi.org/10.1073/pnas.1525793113
16. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 22(8), 888–905 (2000). https://doi.org/10.1109/34.868688
17. Vinh, N.X., Epps, J.R., Bailey, J.: Information theoretic measures for clusterings
comparison: Variants, properties, normalization and correction for chance. J. Mach.
Learn. Res. 11, 2837–2854 (2010). https://doi.org/10.5555/1756006.1953024
18. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps.
Soc. Netw. 5(2), 109–137 (1983). https://doi.org/10.1016/0378-8733(83)90021-7
19. Xiang, T., Gong, S.: Spectral clustering with eigenvector selection. Pattern Recogn.
41(3), 1012–1029 (2008). https://doi.org/10.1016/j.patcog.2007.07.023
20. Zhang, Y., Friend, A., Traud, A.L., Porter, M.A., Fowler, J.H., Mucha, P.J.: Com-
munity structure in congressional cosponsorship networks. Physica 387(7), 1705–
1712 (2008). https://doi.org/10.1016/j.physa.2007.11.004
21. Zhou, D., Huang, J., Schoelkopf, B.: Learning from labeled and unlabeled data on
a directed graph. In: Proceedings of the 22nd International Conference on Machine
Learning (2005). https://doi.org/10.1145/1102351.1102482
Composite Modularity and Parameter
Tuning in the Weight-Based Fusion
Model for Community Detection
in Node-Attributed Social Networks
1 Introduction
A variety of different methods for ASN CD have been already proposed [4,6].
Although they are widely used in applications, some of them have a serious log-
ical gap that stems from the discrepancy between the objective functions within
the CD optimization process and the functions used for CD quality evaluation [6].
Indeed, it is rather questionable that one function is optimized within the CD
process and another function (that is not directly related to the latter) is used for
evaluating the optimization results. We emphasize that we talk about structure-
and attributes-aware quality measures and do not consider those estimating the
agreement between the detected communities and the ground truth ones. There
are specialized studies of the latter [6, Sect. 9] and we refer an interested reader
to them. Note that the above-mentioned gap may cause misinterpretation of CD
results and difficulties with the CD method parameter tuning.
In this paper, we reveal such a gap in the weight-based fusion model (WBFM)
that is rather popular for ASN CD [6]. (We will give the description of WBFM
under consideration in Sect. 2 and overview existing WBFMs in Sect. 3.) In par-
ticular, we show that there is a discrepancy between the so-called Composite
Modularity that is usually optimized within WBFM and the corresponding CD
quality measures (Modularity, Entropy, etc.), see Sect. 2. To fulfil the gap, we
theoretically study how Composite Modularity is related to the CD quality mea-
sures called Structural and Attributive Modularities, where the latter is the sub-
stitution of Entropy (Sect. 4). From a more general point of view, we actually
establish the connection between Modularities of two graphs and Modularity of
the graph whose edge weights are linear combinations of edge weights of the two
graphs. Our theoretical results further bring us to a simple parameter tuning
scheme that provides the equal impact of structure and attributes on CD results
(Sect. 5). It is worth mentioning that it is the first non-manual tuning scheme
of this type within WBFM. Experiments with synthetic and real-world ASNs
(Sect. 6) show that our conclusions allow for a reasonable interpretation of CD
results within WBFM and that our tuning scheme is very accurate.
Recall that (V, E) is called the structure and (V, A) the attributes of the ASN.
1
An edge weight may be zero and this indicates that there is no social connection.
2
If one deals with nominal or textual attributes, it is common to use one-hot encoding
or embeddings to obtain their numerical representation.
102 P. Chunaev et al.
The general WBFM may be thought to first convert G into the two weighted
complete graphs by a certain rule: structural graph GS = (V, E, WS ) and attribu-
tive graph GA = (V, E, WA ), where WS = {wS (eij )} and WA = {wA (eij )} are
the sets of edge weights in each graph. For convenience, we suppose that
wS (eij ) = 1, wA (eij ) = 1. (1)
ij ij
Furthermore, the two graphs are fused to obtain the weighted graph Gα =
(V, E, Wα ), where the elements of Wα = {wα (eij )}, with eij ∈ E, are as follows:
wα (eij ) =αwS (eij ) + (1 − α)wA (eij ), wα (eij ) = 1, α ∈ [0, 1].
ij
(2)
Here α is the fusion parameter that controls the impact of GS and GA . Note
that G1 = GS and G0 = GA by construction.
Recall that community detection (CD) in G consists in unsupervised dividing
V into K disjoint3 communities Ck ⊂ V, with C = {Ck }K k=1 , such that V =
K
k=1 Ck , and a certain balance between the following properties is achieved [4,6]:
(i) structural closeness, i.e. nodes in a community are more densely connected
than those in different communities; (ii) attributive homogeneity, i.e. nodes in a
community have similar attributes, while those in different ones do not.
As for WBFM, the CD problem consists in unsupervised dividing G αK into K
disjoint communities Ck,α ⊂ V, with Cα = {Ck,α }K k=1 , such that V = k=1 Ck,α
and nodes in Ck,α are structurally close and homogeneous in terms of attributes.
Since one deals with a weighted graph Gα within WBFM, classical graph
CD methods can be applied for finding Cα . A popular choice [6] is Louvain [3]
aiming at maximizing Modularity [14], a measure of divisibility of a graph into
clusters. In the context of WBFM, the maximization of Modularity of Gα is
implicitly thought to provide structural closeness and attributive homogeneity
also in G (similar schemes are applied e.g.. in [7–10]).
For completeness, let us first precisely define the involved measures (Modu-
larity and Entropy). Note that the former works with graphs and the latter with
sets of vectors. The definitions below are for G and C of a general form.
Modularity of graph G for a partition C is as follows:
Q(G, C) = 2m1
Aij − 2m1
ki kj δij ∈ [−1, 1], (3)
ij
where i and j are from 1 to n; Aij is the edge weight between nodes vi and vj ;
ki and kj are the weighted degrees of vi and vj , respectively; m is the sum of
all edge weights in G; and δij = 1 if ci and cj , the community labels of nodes vi
and vj , coincide, and δij = 0 otherwise.
Modularity of graph G is then defined as Q(G) = maxC Q(G, C).
Entropy H measures the degree of disorder of attribute vectors A within
communities. To unify notation, we define Entropy of node-attributed graph G
for a partition C for the case of binary D-dimensional vectors as follows:
|Ck | D φ(pk,d )
H(G, C) = H(Ck ) ∈ [0, 1], H(Ck ) = − ,
Ck ∈C |V | d=1 D ln 2
(4)
where φ(x) = x ln x + (1 − x) ln(1 − x) and pk,d is the proportion of nodes in the
community Ck with the same value on dth attribute.
Thus the CD task within WBFM is as follows: (i) one finds Cα maximizing
Modularity of Gα , (ii) the Cα is used to calculate Q(GS , Cα ) and H(G, Cα ). Let
us emphasize that this scheme is applied implicitly and there are no theoretical
studies why this jump from Q(Gα , Cα ) to Q(GS , Cα ) and H(G, Cα ) is reasonable.
We call this issue Problem A below. Another issue of the scheme (denoted by
Problem B below) is that Entropy that deals with vectors is unnatural for WBFM
aiming at representing G in a unified graph form. Moreover, Entropy may be not
informative within WBFM as experiments in [5] and our own ones in Sect. 6.3
show. We will provide the solutions to Problems A and B in Sect. 4.
3 Related Works
WBFMs based on (2) have been widely tested on synthetic and real-world ASNs
and have shown its superiority in CD quality to other CD models, see [1,2,7,10,
12,13,15,16]. Furthermore, there are many particular versions of (2) but it seems
that the most balanced one is that from [5] with the following normalization:
μ(eij ) ν(eij )
wS (eij ) = , wA (eij ) = , (5)
eij ∈E μ(eij ) eij ∈E ν(eij )
CD results equal. In fact, α is usually chosen manually and such a choice may
be not fully justified. For example, WBFMs [1,7,13,16] use α = 0, while WBFMs
[10,12,15] set α = 0.5 in experiments, suggesting to achieve the equal impact.
What is more, we are unaware of any paper devoted to the study of the
above-mentioned gap in WBFM, besides [6] where the problem is stated.
4 Theoretical Study
We first resolve Problem B by substituting Entropy (4) by Q(GA , C), i.e. Mod-
ularity of attributive graph GA for a partition C. In these terms, if C is that
maximizing Q(GA , C), then the links between nodes in each community in C
have high attributive weights, i.e. the node attributes therein are homogeneous
by construction. Additionally, the proposed measure works with graphs (oppo-
sitely to Entropy working with vectors) and naturally appears in WBFM as is
seen from the results below. Moreover, it is more informative than Entropy as
the experiments in Sect. 6.3 show.
Now we turn to Problem A about the connection of Modularity Q(Gα ) and
the CD quality measures (in our case, Q(GS , Cα ) and Q(GA , Cα )). The solution
to Problem A is provided by the following theoretical results for a fixed G.
Theorem 1. For any partition C, it holds that
Q(Gα , C) = αQ(GS , C) + (1 − α)Q(GA , C) + α(1 − α)Q(GS , GA , C), (6)
where Q(GS , GA , C) counts the difference of node degrees in GS and GA and is
precisely defined in (9).
Proof. Fix a partition C. We first rewrite the ingredients of (3) in terms of (2):
Aij = αwS (eij ) + (1 − α)wA (eij ), m= wα (eij ) = 1,
ij
(7)
kh = (αwS (ehl ) + (1 − α)wA (ehl )) , h ∈ {i, j}.
l, l=h
Furthermore, if kh = l, l=h w (ehl ), where h ∈ {i, j} and ∈ {S, A}, then
S
ki kj = αki + (1 − α)kiA αkjS + (1 − α)kjA
(8)
= α2 kiS kjS + α(1 − α) kiS kjA + kiA kjS + (1 − α)2 kiA kjA .
If one takes (7) and (8) into account, (3) can be rewritten in the form
Q(Gα , C) = α · 12 wS (eij ) − 12 αkiS kjS δij
ij
+ (1 − α) · 12 wA (eij ) − 12 (1 − α)kiA kjA δij
ij
S A
− α(1 − α) · 12 1 A S
2 ki kj + ki kj δij .
ij
Extracting Q(GS , C) and Q(GA , C) from this by (1) and (3) yields (6), where
Q(GS , GA , C) = 14 (kiS − kiA )(kjS − kjA )δij . (9)
ij
Community Detection in Node-Attributed Social Networks 105
K
K
2
ki kj δij = ki kj = ki kj = ki .
ij i
j: cj =ci k=1 vi ∈Ck vj ∈Ck k=1 vi ∈Ck
What is more, ij kiS kjA δij = ij kiA kjS δij . Therefore by expanding (9) we get
K
2 2
Q(GS , GA , C) = 1
4 kiS −2 kiS kiA + kiA
k=1 vi ∈Ck vi ∈Ck vi ∈Ck vi ∈Ck
K
2
= 1
4 (kiS − kiA ) .
k=1 vi ∈Ck
The last expression is non-negative. This fact and (9) yield (11). Finally, (11)
follows from (6) by (11). The sharpness of (11) follows from (6).
Note that Theorems 1 and 2 connect Modularities of two graphs and Modularity
of the graph whose weights are linear combinations of weights of the two graphs.
It seems a key result for analysis of Modularity-based models for ASN CD.
We continue by introducing additional notation that simplifies further expo-
sition. For the partition Cα such that Q(Gα ) = Q(Gα , Cα ), Theorem 1 gives:
Qα α α α
com = Qstr + Qattr + Qdif , Qα
str = αQ(GS , Cα )
(12)
Qattr = (1 − α)Q(GA , Cα ),
α
Qdif = α(1 − α)Q(GS , GA , Cα ),
α
where we call Qα α α
com = Q(Gα ) Composite, Qstr Structural, Qattr Attributive and
α
Qdif Differential Modularity, correspondingly.
Thus within WBFM we maximize Composite Modularity Qα com that consists
not of the two components used for quality evaluation (Structural Modularity Qα str
and Attributive Modularity Qα attr ) but of additional Differential Modularity Qdif
α
that counts the difference of node degrees in GS and GA . It moreover follows from
Theorem 2 that WBFM under consideration does not provide optimal values of
Qαstr + Qattr for α ∈ (0, 1), if Qdif = 0. In particular, this means that WBFM is
α α
str > 0 and Qattr > 0 for any α ∈ [0, 1]. If it holds that
Theorem 3. Let Qα α
Q(GA , Cα )
α= .
Q(GS , Cα ) + Q(GA , Cα )
This and the conditions (14) immediately imply that (15) holds for α instead
of α∗ . Furthermore, the conditions (14) guarantee that Qα α
str and Qattr are well-
approximated uniformly for any α ∈ [0, 1] by αQ(GA ) and (1 − α)Q(GS ), corre-
spondingly. These facts imply that (15) particularly hold for α = α∗ .
Q(GA )
α̃ = (16)
Q(GS ) + Q(GA )
6 Experiments
Recall that graphs G, GS and GA are complete and weighed according to the
definitions in Sect. 2. Note that some edge weights may be zero and then one
Community Detection in Node-Attributed Social Networks 107
Fig. 1. Modularities in the graph pair ER+BA: (a) Q(GS ) ≈ Q(GA ) (ER-based GS
of 500 nodes and 6210 edges with equal non-zero weights, p = 0.05; BA-based GA of
500 nodes and 6331 edges with equal non-zero weights, m = 13), (b) Q(GS ) > Q(GA )
(ER-based GS of 500 nodes and 24886 edges with equal non-zero weights, p = 0.2;
BA-based GA of 500 nodes and 6331 edges with equal non-zero weights, m = 13).
Fig. 2. Modularities for the experiment with the ER-based structural graph GS (500
nodes and 6210 edges with equal non-zero weights, p = 0.05) and the star-based attribu-
tive graph GA (500 nodes and 499 edges with equal non-zero weights).
should think that there is no structural or attributive link between the corre-
sponding nodes. Below, if we say that a (complete) graph has M edges with
non-zero weights, then edge weights in the graph are set zero for all edges except
for M of them. Let us emphasize that the precise parameters of experiments
necessary for reproducing the results are indicated in the figure captions.
Now we generate random graphs GS and GA by the well-known Erdős-Rényi
(ER) and Barabási-Albert (BA) models which are standard for modelling social
networks. The models are chosen as they produce graphs with different node
degree distributions and thus may influence the behaviour of Differential Mod-
ularity Qαdif . What is more, we generate GS and GA in pairs ER+ER, ER+BA
and BA+BA. In each pair we consider the following two cases: (a) when Q(GS )
and Q(GA ) are almost equal and (b) when one of Q(GS ) and Q(GA ) is greater
then the other. These proportions can be achieved by varying the number of
edges with equal non-zero weights in GS and GA .
It turns out that the results obtained in each pair are very similar qualita-
tively (and even quantitatively). For this reason we provide and analyze only
108 P. Chunaev et al.
the results for the pair ER+BA, see Fig. 1. First note that in both cases Qα dif
vanishes for all α ∈ [0, 1]. Thus even the difference in degree distribution does
not make its values large. It can be also observed that the intersection point of
∗
Structural and Attributive Modularities Qα α
str and Qattr (corresponding to α that
makes (13) valid) is closer to α = 1, when Q(GS ) is less than that of Q(GA ), see
Fig. 1(b). In the opposite case, it is close to α = 0 (not shown), and is close to
α = 0.5, when the Modularities are almost equal, see Fig. 1(a).
The experiments performed so far hint that values of Qα dif are always vanish-
ing. However, this is not true as Fig. 2 shows. In this experiment GS is ER-based
and GA is a star graph, if one excludes the edges with zero weights. The differ-
ence in node degree distributions is so notable that the maximal value of Qα dif
for α ∈ [0, 1] is rather separated from zero. This result emphasizes how interest-
ingly WBFM may work for non-zero Qα α α
dif . Note that Qstr and Qattr have opposite
∗
signs for α ∈ [0, 1] here so that α providing the equal impact of structure and
attributes may be thought to be 0.
Fig. 3. Modularities for (a) WebKB Washington (b) PolBlogs (c) Sinanet (d) Cora.
7 Conclusions
It is proved analytically in this paper that there is a logical gap in the well-known
WBFM for ASN CD. This gap stems from the fact that optimal values of Com-
posite Modularity optimized within WBFM do not generally provide those of
Structural and Attributive Modularities that are the corresponding WBFM CD
quality measures. Indeed, it turns out that Composite Modularity additionally
includes non-negative Differential Modularity that may be very separated from
zero in some special cases. At the same time, it is observed in experiments that
it surprisingly vanishes in many cases of synthetic and real-world ASNs thus
making WBFM optimal for providing the balance of structural closeness and
attributive homogeneity in the above-mentioned terms.
Moreover, the identity for Composite Modularity and its usable terms pro-
posed in this paper yield a simple and accurate parameter tuning scheme that
gives clear meaning to and provides the equal impact of structure and attributes
on the WBFM CD results. Note that it is the first non-manual one of this type.
Finally, it is also worth saying that we consider the theoretical results pre-
sented in this paper as a fundamental base for our current analytical comparative
study of several Modularity-based ASN CD models in terms of CD quality. Such
Community Detection in Node-Attributed Social Networks 111
References
1. Akbas, E., Zhao, P.: Graph clustering based on attribute-aware graph embedding.
In: Karampelas, P., Kawash, J., Özyer, T. (eds.) From Security to Community
Detection in Social Networking Platforms, pp. 109–131. Springer, Cham (2019)
2. Alinezhad, E., Teimourpour, B., Sepehri, M.M., Kargari, M.: Community detec-
tion in attributed networks considering both structural and attribute similarities:
two mathematical programming approaches. Neural Comput. Appl. 32, 3203–3220
(2020)
3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-
munities in large networks. J. Stat. Mech. Theor. Exp. 2008(10), P10008 (2008)
4. Bothorel, C., Cruz, J., Magnani, M., Micenková, B.: Clustering attributed graphs:
models, measures and methods. Netw. Sci. 3(3), 408–444 (2015)
5. Chunaev, P., Nuzhdenko, I., Bochenina, K.: Community detection in attributed
social networks: a unified weight-based model and its regimes. In: 2019 Interna-
tional Conference on Data Mining Workshops (ICDMW), pp. 455–464 (2019)
6. Chunaev, P.: Community detection in node-attributed social networks: a survey.
Comput. Sci. Rev. 37, 100286 (2020)
7. Combe, D., Largeron, C., Egyed-Zsigmond, E., Gery, M.: Combining relations and
text in scientific network clustering. In: Proceedings of the 2012 International Con-
ference on Advances in Social Networks Analysis and Mining, ASONAM 2012, pp.
1248–1253 (2012)
8. Cruz, J., Bothorel, C., Poulet, F.: Entropy based community detection in aug-
mented social networks. In: International Conference on Computational Aspects
of Social Networks, pp. 163–168 (2011)
9. Cruz, J., Bothorel, C., Poulet, F.: Détection et visualisation des communautés dans
les réseaux sociaux. Revue d’intelligence artificielle 26, 369–392 (2012)
10. Dang, T.A., Viennet, E.: Community detection based on structural and attribute
similarities. In: International Conference on Digital Society, pp. 7–14 (2012)
11. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in
social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001)
12. Meng, F., Rui, X., Wang, Z., Xing, Y., Cao, L.: Coupled node similarity learning
for community detection in attributed networks. Entropy 20(6), 471 (2018)
13. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and
link information. In: Proceedings of the Text Mining and Link Analysis Workshop,
18th International Joint Conference on Artificial Intelligence, pp. 9–15 (2003)
14. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69, 026113 (2004)
15. Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large net-
works using content and links. In: Proceedings of the 22nd International Conference
on World Wide Web, WWW 2013, pp. 1089–1098 (2013)
16. Steinhaeuser, K., Chawla, N.V.: Identifying and evaluating community structure
in complex networks. Pattern Recogn. Lett. 31(5), 413–421 (2010)
Maximal Labeled-Cliques
for Structural-Functional Communities
Debajyoti Bera(B)
1 Introduction
A clique represents a set of mutually related entities in a network and has played
an important role in community detection and graph clustering [6,19]. Many
network analysis methods, e.g., clique-percolation method [18] and maximal clique
centrality [4], rely on the set of maximal cliques of a graph. Therefore, it natural
to ask how to extend these results to networks with additional information.
One way to extend cliques would be to incorporate attributes on the nodes.
The last decade has witnessed a massive increase in the collection of richer net-
work datasets. These datasets not only contain the inter-entity relationships,
but they also contain additional attributes (aka. “labels”) associated with each
entity. For example, social network datasets contain both “structural relation-
ships” (social links between users) and “functional attributes” (interests, likes,
tags, etc.). A recent experimental study concluded that real-life communities
are formed more on the basis of functional attributes of entities (like interests of
users, functions of genes, etc.) rather than their “structural attributes” (those
defined using cliques, cuts, etc.)[25]. Naturally, given both structural and func-
tional information, we expect to find communities that are bonded on both.
The notion of cliques playing the role of seeds in a community structure
ought to be strengthened if we also mandate functional similarity. In this work
we address the question “what is the role of such cliques in discovering cohesive
structural-functional clusters?” We are aware of only two prior solutions for this
problem. Modani et al. [13] resolved the problem of finding “like-minded com-
munities in a social network” by reducing it to that of finding maximal cliques
in an unlabeled graph. Their solution was applying any graph clustering tech-
nique on a subgraph constructed using those maximal cliques. Motivated by a
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 112–123, 2021.
https://doi.org/10.1007/978-3-030-65347-7_10
Maximal Labeled-Cliques 113
similar problem, Wan et al. [24] studied the problem of finding communities
that are strongly related in terms of both node attributes and inter-node rela-
tionships; their solution was a heuristic to avoid generating all maximal cliques.
To the best of our knowledge, the first comprehensive graph-theoretic model for
structural-functional clusters was given by Bera et al. [2] in the form of maximal
cliques of entities with a maximal set of shared labels, aka. MLMCs. In that work
the authors presented the idea, gave an algorithm to find those structures, and
merely suggested a use for finding communities. In this work we outline tools
and methods to employ MLMCs to analyse networks.
Overview of Results: We answer two specific questions. First, how to analyse a
graph with the help of its MLMCs? In particular, what would be the statistics of
MLMCs in a random graph? And, how far is a network from attaining stability,
i.e., when the structural and functional linkages have converged to the same?
To answer these questions, we propose a null model for labeled-graphs, and then
use this null model to define structural-functional divergence.
The communities that we focus in this work are built on cliques; however, a
clique in itself may be too strict a definition for a community. We devise an exten-
sion of the clique-percolation method [18] to labeled-graphs named CBCPM
that incorporates similarity of labels also while constructing communities. For
evaluating the functional cohesion of the communities found by our algorithm,
we devise a new metric ΦC to overcome a shortcoming of the likemindedness
measure proposed earlier [13].
The interest in labeled graphs has recently gained popularity and there are
now quite a few techniques for clustering them [1,5]. However, every clustering
technique emphasises a different notion of community and it appears to be dif-
ficult to decide one clear winner. The relevance of this paper is limited only to
the scenarions where clique-based communities are logical.
a B
B
A A C:e
b
C
+
C c
D:e
d
e D
D E
E
A:a,e B:a,e
(a) Graph G1 (b) Bipartite graph G2
A:a,c, a B C:a,e
d,e B:a,e A
b A:a,d,e
C
C:a,d,e c
d
C:a,d,e
D:c,e E
E:a,b e D
(e) Maximal LCs
(d) Joined-graph GJ in GL
(c) Labeled-graph GL
U and V (E1 ) but none among vertices in U . It was shown by Bera et al. [2]
that a labeled graph can be treated as a joined graph and vice versa.
An MLMC—maximal clique with maximal set of labels, is a labeled-clique
which does not remain an LC if we add any more vertex or label.
All of these concepts can be understood with the help of Fig. 1. It shows a net-
work of entities {A, B, C, D, E} as the general graph G1 and Fig. 1b shows their
association with labels from {a, b, c, d, e} as the bipartite graph G2 . Figure 1c
shows a labeled-graph GL that combines the information from G1 and G2 and
{a, e}, {A, C} is an LC in GL .
Examples: We present two examples to illustrate how MLMCs can help in
analysing networks. Figure 2 presents the number-vs-size distribution of the
MLMCs of two social-network datasets with tens of thousands of links and
labelings (representing “user interests”) Not only the number of MLMCs of
different sizes follow markedly different distributions, observe that the number
of MLMCs with 5 (or 3 or 4) users are mostly same in the “Last.fm” dataset,
whereas, the same number follows a rapidly decreasing trend in the “The Marker
Cafe” dataset. Our explanation is that users of networks based on user-ratings
(Last.fm) do not necessarily compare and correlate their ratings but users of a
social network (The Marker Cafe) have a natural tendency to bond over shared
interests. Such insights are attractive for targeted advertisement and personal-
ized recommendation.
Table 1 shows some of the patterns we obtained by analysing the MLMCs of a
DBLP dataset of papers published within 1984–2011 in data mining and related
venues [23]—considering only the top venues and authors with 40+ papers in
Maximal Labeled-Cliques 115
Table 1. Groups of prolific authors who share (pairwise) a common coauthor but are
not collaborators despite having concurrent papers at common venues
Authors @ Venues
Philip S. Yu, Heikki TKDE(2008, 2009), Know. Inf. Sys. (2005– @ 2008,
Mannila, Tao Li 2010, 2011), ICDM (2002, 2006), SDM(2008–2010)
Jian Pei, Christos ICDM(2005, 2006, 2008, 2010), @ SDM(2007, 2008,
Faloutsos, Wei Fan 2011), CIKM(2009), KDD(2004, 2006, 2008–2011)
them. We wanted to know which scientists are not collaborators but could easily
be so. For that we constructed a labeled-graph of scientists in which the labels
represented the venues of their papers. We linked two scientists if they have
do not have a joint paper but share a common coauthor—roughly indicating a
shared interest. We discovered 58 MLMCs that consisted of at least 3 authors
and at least 10 venues; two such MLMCs are shown in Table 1. All such MLMCs
represent potential collaborative groups that could have been formed due to
familiarity (common coauthors) and concurrency (same venues).
3 Community Detection
Now we discuss how our labeled-cliques can help us find tightly bonded commu-
nities. Our objective is to establish proof-of-concept application of MLMCs; in
reality, each network requires its own bespoke notion of community. The reader
may refer to a recent survey [5] for many such techniques for labeled graphs.
3.1 Null Model for Labeled-Graphs
A common tool in network analysis is a null model that is a random graph with
specific desirable properties. They are used to analyse networks, e.g., distinctness
116 D. Bera
of network from a randomly formed one, quality of a network clustering [7], etc.
For example, the well-known notion of network modularity [16] uses a null model
that preserves the expected degree of vertices. We use a null model that preserves
the degree distribution of labeled-graphs. Given a labeled-graph G = V, E, L, l,
consider an equivalent joined-graph and denote its bipartite component as GB
and general component as GN . We define null model for the labeled-graph G by
simply joining the null models for GN and GB , which we describe below.
Null Model for GN : For the non-bipartite component we use the well-
studied Configuration Model(CM) [14,15] that creates a degree-preserving ran-
dom graph. Consider a random approach that starts with an empty graph, picks
two of the “unsaturated” vertices uniformly at random and connects them by
an edge; a vertex is saturated when its number of edges equals its degree in GN .
Null Model for GB : We extend CM and generate a random graph with the
same degrees as in GB . The BiCM null model also generates graphs with the same
properties [20], however, they use entropy-maximization unlike our combinatorial
approach. We will, anyhow, denote our model too by BiCM.
We will follow the exact same approach as in CM and add edges between
two randomly chosen unsaturated vertices, one each from L and V . Clearly, the
final random graph has the same degrees as in GB and also the same number
of edges. Favoring simplicity, we allow the random graph to have multiple edges
between vertices just like in CM.
Next we state a technical lemma on the expected number of common labels
in BiCM. Consider any labeled-graph G from BiCM, and further, consider any
two nodes u, v ∈ V and any label l ∈ L. Let Nu,vl
denote the indicator variable
l
that is 1 iff l is the labeling of both u and v; further, let Nu,v = l∈L Nu,v
denote the number of common labels. Let m denote the number of edges, du and
dv denote the degrees of u and v and cl denote the number of nodes which have
the label l.
1 r+g≤c l du dv
Lemma 1. The expected value of Nu,v is m
c
r g
l∈L l r=1...du
cl ≥2
g=1...dv
m − du − d v
cl − r − g
Proof (Proof sketch). A standard approach is to attach deg(x) stubs to a vertex x
and connect to unassigned stubs at each step. Then the probability of selecting
cl stubs from the nodes, where there are du stubs from u, dv stubs from v
and (m − du − dv ) other stubs, follows a trivariate hypergeometric distribution.
E[Nu,v
l
], which is same as the probability of selecting at least one stub of u and
v each, can be now easily calculated from which the lemma follows.
An equivalent, but easier to compute, expression for E[Nu,v
l
] can be obtained
by applying the Chu-Vandermonde identity:
m−du −dv m−du m−dv
E[Nu,v
l
]= m
cl + cl − cl − cl / m
cl
Maximal Labeled-Cliques 117
Choose some κ > 1. If there are κEu,v or more common labels between u
and v, then this indicates a strong functional similarity between u and v when
compared to the null model. The parameter λ is used for additional restrictions
on the minimum functional similarity.
3.3 Structural-Functional Clustering
The clique percolation method (CPM) is a popular method for clustering of enti-
ties in a network considering only the structural links. This method identifies
overlapping clusters which are composed of several (overlapping and maximal)
cliques [18]. We are interested in clustering entities that are closely related both
structurally and functionally. A previous approach by Modani et al. [13] first
finds all MLMCs with a minimum number of nodes and common labels. Then it
obtains the subgraph induced by the nodes of the MLMCs. They rightly claim
that this subgraph is made up of those nodes that are better connected both
118 D. Bera
structurally and functionally. The authors then proposed to run any suitable
overlapping (or non-overlapping) algorithm (e.g., CPM) on this subgraph.
However, we think better clusters can be obtained if the functional similar-
ity is in-grained deeper in the cluster finding algorithm. Hence, we propose a
“Clique-Biclique Percolation Algorithm” (CBCPM) outlined in Algorithm 1.
Like CPM, the clusters discovered by CBCPM are composed of maximal LCs.
Each cluster is constructed from several LCs that are “connected”—two LCs are
said to be connected if they overlap in at least ks nodes and at least kl labels. The
output of the algorithm are clusters of nodes from the connected components of
the network of maximal LCs.
Modularity was initially defined for disjoint clusters. To apply this to overlap-
ping clusters, a common trend is to use the notion of “belongingness” [17]. Shen
1
et al. defined the contribution of a node u towards a cluster C as βu,C = |comm(u)|
if u ∈ C and 0 otherwise, and used it to define a generalized modularity OQ [22].
1
OCov(C) = E(u, v)βu,C βv,C
2e
C∈C u,v∈C
1
OQ(C) = OCov(C) − E[OCov(C)] = E(u, v) − du dv
2e βu,C βv,C
2e
C∈C u,v∈C
Maximal Labeled-Cliques 119
Lemma 2. Consider all graphs with a fixed set of labels, say L, and in which,
|c(l)| is fixed for every l ∈ L. Then,
|c(l)|
u,v SH (u, v) = σ
1
l∈L:|c(l)|>1 2
The proof uses a simple double-counting of the nodes with a particular label.
The denominator in E[CohSH ] (and also in CohSH ) therefore becomes a constant
independent of the (random) graph. Furthermore, observe that E[SH (u, v)] in
the random graph is same as E[Nu,v ] in GB (defined earlier).
120 D. Bera
|c(l)|
ΦSH (C)= S(u, v) − E[Nu,v ] βu,C βv,C
2
C∈C u,v∈C l∈L
|c(l)|>1
4 Evaluation Results
To evaluate the effectiveness of our approaches, we applied them to several real-
life datasets (described in Table 2). The “Twitter-small” dataset is constructed
from the Twitter dataset [10] with edges representing “following a celebrity”;
we selected as labels those users with followers between 15000 and 16000 (i.e.,
celebrities) and for nodes, those non-celebrities with 6000–65000 followers.
4.1 SF-Divergence
First we report the SF-divergence of our labeled-graph datasets in Table 3; we
skip Ciao since it involved computing CC for a large number of nodes and labels
which did not finish within a day.
A SF-divergence value less than one of indicates that there are several nodes
that share functionalities but are yet to form structural links. On the other hand,
a value more than one indicates that nodes are yet to fully acquire functionalities
from structurally connected nodes. We conjecture that the SF-divergence of a
static social network (in which users are not joining or leaving) should approach
Maximal Labeled-Cliques 121
one in long term. We can see that the CTM and TwS networks display this
behavior better than the other networks. This is expected for the TwS dataset
since the “labels” in this network are celebrities and two users who follow each
are more likely to follow the same celebrities. CTM users anyway show a highly
“matured” behavior as was observed earlier in Fig. 2a.
4.2 Discovering Overlapping Communities
Now we report the quality of overlapping communities obtained by our CBCPM
algorithm (Algorithm 1). Our goal was to show that, for similar setting of param-
eters ks and kl , CBCPM creates communities with better likemindedness than
the existing CPMCore method [13] of running the CPM algorithm on the sub-
graph of nodes that are present in the MLMCs with at least kl labels and
ks nodes. These parameters are related to the “percolation” of clique/labeled-
cliques and has to be chosen carefully that was beyond our scope. Too large
values may not find any community and too small values will create a single
community. Therefore, we conducted experiments with different values of ks ≥ 3
and that of kl ≥ 3 and only considered clusters with at least two communities.
We compared the overlapping modularity [22] (OQ) and the likemindedness [13]
(LM) of the communities obtained by our CBCPM algorithm vs. those given by
CPMCore [13]. We used the unnormalized Hamming similarity for S().
Dataset ΔSF
2,3 Dataset Parameters Method OQ [22] LM [13]
FT 0.28 Ning (*) kl = 3 ks = 4 CBCPM 0.05 3.38
Ning 0.39 ks = 5 CPMCore 0.03 2.17
Lfm 0.27 FT kl = 3 ks = 3 CBCPM 0.35 5.27
CTM 0.82 CPMCore 0.41 4.91
Tws 0.85 kl = 4 ks = 3 CBCPM 0.27 5.51
CPMCore 0.39 4.93
(*) Best ks for kl = 3 is used that maximized LM.
The Ning and the FT datasets generated very few MLMCs for some param-
eters. Therefore, we set kl = 3, ks ≥ 3 for Ning which generated 118 MLMCs.
Similarly, we used kl ≥ 3, ks = 3 for the FT dataset that gave us 72 MLMCs.
Results for the two clustering algorithms are presented in Table 4.
The quality measures of the larger Ciao and Lfm datasets are illustrated
in Figs. 3 and 4, respectively. We tried several different values of kl (indicated
as CBCPM-kl and CPMCore-kl ) and ks (X-axis). We observed that CBCPM
consistently found communities with higher LM compared to those found by
CPMCore. Due to the stronger enforcement of functional similarity, CBCPM
modularities are expected to be lower; however, we observed that the change
is highly non-uniform here and sometimes even higher. We conclude that, in
comparison to CPMCore, CBCPM finds communities with better functional
qualities and with competitive structural qualities.
122 D. Bera
5 Conclusion
Labeled-graphs are a richer representation of networks that can also store
attributes of nodes, apart from the usual node-node relationship, and has been
gaining popularity. In this work we show how to analyse the maximal labeled-
cliques of these graphs, a concept that was recently introduced [2], and then
show how to use those structures to identify clique-based communities. We also
introduce a null model and a statistic to represent the attribute-level similarities
within a community.
References
1. Baroni, A., Conte, A., Patrignani, M., Ruggieri, S.: Efficiently clustering very large
attributed graphs. In: 2017 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM), pp. 369–376 (2017)
2. Bera, D., Esposito, F., Pendyala, M.: Maximal labelled-clique and click-biclique
problems for networked community detection. In: 2018 IEEE Global Communica-
tions Conference (GLOBECOM), pp. 1–6 (2018)
3. Cantador, I., Brusilovsky, P., Kuflik, T.: Second workshop on information het-
erogeneity and fusion in recommender systems. In: Proceedings of the 5th ACM
Conference on Recommender Systems, (HetRec 2011) (2011)
4. Chin, C.H., Chen, S.H., Wu, H.H., Ho, C.W., Ko, M.T., Lin, C.Y.: cytohubba:
identifying hub objects and sub-networks from complex interactome. BMC Syst.
Biol. 8(4), S11 (2014)
5. Chunaev, P.: Community detection in node-attributed social networks: a sur-
vey. Comput. Sci. Rev. 37, 100286 (2020). http://www.sciencedirect.com/science/
article/pii/S1574013720303865
Maximal Labeled-Cliques 123
6. Faghani, M.R., Nguyen, U.T.: A study of malware propagation via online social
networking. In: Mining Social Networks and Security Informatics. Springer, Nether-
lands (2013)
7. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
8. Guo, G., Zhang, J., Thalmann, D., Yorke-Smith, N.: ETAF: an extended trust
antecedents framework for trust prediction. In: Proceedings of the 2014 Interna-
tional Conference on Advances in Social Networks Analysis and Mining (2014)
9. Guo, G., Zhang, J., Yorke-Smith, N.: A novel Bayesian similarity measure for
recommender systems. In: Proceedings of the 23rd International Joint Conference
on Artificial Intelligence (2013)
10. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a
news media? In: Proceedings of 19th International Conference on World Wide
Web (2010)
11. Latapy, M., Magnien, C., Vecchio, N.D.: Basic notions for the analysis of large
two-mode networks. Soc. Netw. 30(1), 31–48 (2008)
12. Lesser, O., Tenenboim-Chekina, L., Rokach, L., Elovici, Y.: Intruder or welcome
friend: inferring group membership in online social networks. In: Social Computing,
Behavioral-Cultural Modeling and Prediction (2013)
13. Modani, N., et al.: Like-minded communities: bringing the familiarity and similar-
ity together. World Wide Web 17(5), 899–919 (2014)
14. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev.
45(2), 167–256 (2003)
15. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford
(2010)
16. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69(2) (2004)
17. Nicosia, V., Mangioni, G., Carchiolo, V., Malgeri, M.: Extending the definition of
modularity to directed graphs with overlapping communities. J. Stat. Mech. Theor.
Exp. 2009, P03024 (2008)
18. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435(7043), 814–818
(2005)
19. Plantié, M., Crampes, M.: Survey on social community detection. In: Social Media
Retrieval (2013)
20. Saracco, F., Di Clemente, R., Gabrielli, A., Squartini, T.: Randomizing bipartite
networks: the case of the world trade web. Sci. Rep. 5, 10595 (2015)
21. Saracco, F., Straka, M.J., Clemente, R.D., Gabrielli, A., Caldarelli, G., Squartini,
T.: Inferring monopartite projections of bipartite networks: an entropy-based app-
roach. New J. Phys. 19(5), 053022 (2017)
22. Shen, H., Cheng, X., Cai, K., Hu, M.B.: Detect overlapping and hierarchical com-
munity structure in networks. Physica A: Stat. Mech. Appl. 388(8), 1706–1712
(2009)
23. Spyropoulou, E., De Bie, T., Boley, M.: Interesting pattern mining in multi-
relational data. Data Min. Knowl. Disc. 28(3), 808–849 (2014)
24. Wan, L., Liao, J., Wang, C., Zhu, X.: JCCM: Joint cluster communities on attribute
and relationship data in social networks. In: Proceedings of 5th International Con-
ference on Advanced Data Mining and Applications (2009)
25. Yang, J., Leskovec, J.: Defining and evaluating network communities based on
ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2015)
Community Detection in a Multi-layer Network
Over Social Media
Abstract. Detection of Communities over the social network, also known as net-
work clustering, has been widely studied in the past few years. The objective of
community detection is to identify strongly connected components in a complex
network. It reveals how people connect and interact with each other. In the real
world, however, a person is engaged in several traits of connections, these con-
nections or social ties carry other different challenges in community detection.
More than one trait of connections can be exhibited as a multiplex network that
contained itself a collection of multiple interdependent networks, where each net-
work represents a trait of the connections. In this literature, we provide readers
with a brief understanding of multilayer networks, community detection methods,
and proposed an approach to detect community and its structure using a multi-
layer modularity method on the Facebook page. The study also investigates how
strong the ties between users and their polarity towards the page over the span of
time. The results successfully remove the isolates from the network and built a
well-defined structure of the community.
1 Introduction
Social network analysis is rapidly growing in recent years, one of the main reasons is
the growing social media platforms. Community detection can be utilized in such disci-
plines as marketing, information propagation, identifying ethnic groups in society, and
so on. As social media platforms are increasing, virtual communities and networks are
also expanding and social networks are now have become multilayered. Community
detection aims to divide a network into several strongly-connected components. Such
subcomponents are formed from sets of similar nodes, and thus can be viewed as a com-
munity. If we compare the traditional problem of community detection with a multilayer
network then we can conclude that the recent era of mobile phones and social network
analyses brings difficulties. Multiple human interactions networks are encapsulated in
graphical data such as a person may have a co-worker network at the same time he has
a friendship network and will also appear in the social network of online sites. These all
networks are interdependent and represent a person’s lifetime network which is, in other
words, is a temporal network or a complex network, thus identifying a community in
such a network is a challenge and takes a lot of attention from researchers. The objective
of this research is to provide a layer-based approach to form a layered network from raw
data and find communities over this social network.
Previous research work related to multilayer social networks generally formed the
network from directly connected nodes, such as the Facebook network focuses on the
friend list of a user, similarly, the LinkedIn network is based on a person’s connections.
There are very few studies that focus on public pages and posts on social networks.
Where users indirectly can form communities and interact with each other using sharing
commenting and mentioning others on posts.
We have discovered that there are 60 million active business, political, news channels,
and other Facebook pages. People like and share the content of a Facebook page and
interact with each other via various activities on pages. We have found that discovering
community structure on a public page becomes more difficult as the number of Facebook
users and there interactions on pages increases with the increase of features provided
by the page. We proposed a method which forms a multilayer network from a given
Facebook page and finds indirect communities relation and acquaintanceship within the
page.
2 Related Work
Several research works have addressed and proposed community detection methods
in a single layer and multilayer network. In this section, we discuss some multilayer
community detection methods.
This study narrows down three main approaches that have been used to extract
communities in a multilayer network.
In the traditional approach, social networks are constructed from directly connected
or related nodes, most of the networks were single layer based on one’s friend circle
and multilayer networks are based on one’s family, friends, or colleague circles. These
networks wouldn’t capture a person’s interactions outside relationships or we can say
acquaintanceship of a person through social media activities.
Our proposed method will construct a multilayer network over a Facebook pub-
lic page to capture the acquaintanceship of a person through the activities and detect
community, their structure, and temporal analysis of dynamics on the page.
3 Proposed Work
3.1 Dataset
The Facebook dataset is crawled from “Pakistan Tehreek-e-Insaf (www.facebook.com/
PTIOfficial)” public page. On this page only the administration can post information and
users can only like, share, comment or reply to a comment. In our database we have two
tables one is for post and the other is for comments. The schema of the post is composed
Community Detection in a Multi-layer Network 127
of Author, Time, Text, URL, and post_id. And the schema of comment is composed of
profile_id, time, Text, and post_id. Since comments are the most common activity of
Facebook so we mainly focus on the comment table (Table 1).
Data Measures
Number of users 9457
Number of comments 14196
Number of posts 400
Duration June 2019 to October 2019
We considered the multilayer network as graphs where each post represents a layer with
a different set of overlapping nodes, a node represents a Facebook user and an edge
represents the interaction between a pair of nodes.
Fig. 2. A toy-example of a multilayer network with two types of interaction among five users
Let P be a set of posts, U be a set of users who have commented on posts and L be
a set of layers where a layer represent a single graph containing post and commenters
then the multilayer network of the Facebook page is defined as a graph G = (V, E) where
G is a multilayer graph, V ⊆ U × L × P is the set of nodes or vertices and E is a set of
intra-layer edges connecting users to post on the same layer. Layers are not required to
contain all users and have a different set of edges. The existence of a user in a layer is
represented as a unique node in that layer (Fig. 2).
This bi-adjacency matrix Bi,j is then used with the Matrix transposition method to
form a User-User matrix Xi,j , with one row/column for each user, and element (i, j)
indicating the numberof shared
posts between user i and user j called acquaintanceship
uj
weight, denoted as wt Aui
Community Detection in a Multi-layer Network 129
A Confidence value c will be used to determine the strength and density of the
Acquaintanceship network. Users (i,j) who have the score above confidence value will
be the part of the Acquaintanceship network. This network can also be defined as a graph
G’ = (V’, E’) Where G’ is a graph,
u V’ is the set of user nodes having inter-layer edges
E’ between two users whose wt Aui ≥ c.
j
After the formation of the network, we detect communities in this newly formed
acquaintanceship network by applying a modularity maximization algorithm Louvain
[7]. Modularity measures how densely a network is connected when partitioned into
communities. The algorithm divides the graph into clusters called communities and tries
to maximize the modularity of a community by placing each node in a different cluster
and calculating modularity gain Q for that cluster. It evaluates that how much densely
nodes can be connected within a community as compared to how densely they would
be in a random network. The gain in modularity obtain by moving ui in a community C
can easily be computed by
2 2
in + 2ki,in tot + ki 2 in tot ki
Q = − − − − (3)
2m 2m 2m 2m 2m
where in is the sum of links inside a community, tot is the sum of links outside
the community, ki is the sum of links of node i, ki,in is the sum of links of node i in
community, and m is the total number of links.
Louvain algorithm comprises of two phases, first phase computes the modularity
gain Q for all communities if a node moves from one community to other. The next
phase is to aggregate all communities which maximize the modularity to form a new
graph. We borrow the idea of the Louvain algorithm’s first phase which is to maximize
the modularity gain. Our proposed algorithm tends to identify communities with signed
edges between nodes to further identify relations or ties between them. It will reveal how
strongly or weakly a node is tied within the community. It can also be used to predict
future relations and the strength of the connection.
The identification of signed edges is based on the polarity of comments given by
users on posts. If two users concurrently commented on a post with the same polarity
and are members of the same community then they have a strong positive relationship
between them. If two users commented on a post with opposite polarity and are member
of different communities, then they have a strong negative relationship between them.
The resultant network will be formed by aggregating the communities with signed
edges between nodes.
130 M. M. Sheikh and R. A. S. Malick
Evaluation Metrics. There are two common evaluation metrics of community detec-
tion the first is accuracy if the actual member of the community is given. And second
Community Detection in a Multi-layer Network 131
4 Results
In this section, we will present the analysis of the results of our proposed method on the
multilayer network of Facebook page graph. The single layer is composed of a single
post and users who have commented on that post.
Fig. 4. Community structure with tie strength in acquaintanceship network with c = 3 (a)
Acquaintanceship network. (b) Communities with relations
Fig. 5. Community structure with tie strength in acquaintanceship network with c = 5 (a)
Acquaintanceship network. (b) Communities with relations
Fig. 6. Community structure with tie strength in acquaintanceship network with c = 8 (a)
Acquaintanceship network. (b) Communities with relations
We have performed some basic analyses on the merged graph. Table 4 shows the network
statistics of the merged graph. Figure 7 shows the Betweenness Centrality.
Data Measure
Nodes 9227
Edges 314061
Avg. degree 68.07
Avg. path length 2.68
Clustering coefficient 0.901
Network diameter 5
Network radius 3
Figure 8 shows the impact of the removal of ten highly centric nodes vs ten random
nodes on the overall connectedness of the network. From the Figure, we can conclude
that by removing centric nodes the network will disconnect sooner but removing random
nodes has little to no effect on network connectedness.
Similarly, Fig. 9 shows the impact of the removal of ten highly centric nodes and ten
random nodes on average path length. We can see that as the number of highly centric
nodes decrease average path length increases, but the removal of random nodes does
not affect average path length. From Figs. 8 and 9 we can conclude that the removal of
random nodes has no effect on the network state but removing centric nodes increased
the diameter of the network and decreased network centralization and average degree.
Fig. 9. Impact of removal of nodes on the average path length of the network
graphs from which we can conclude that by the passage of time positive and negative
commenters are decreasing whereas the number of neutral commenters increases.
Fig. 11. Commenter’s polarity over the time (a) Negative commenters graph (b) positive
commenters graph (c) Neutral commenters graph
5 Conclusion
In this research, we have developed a flattening technique to identify Global communities
in a multilayer network constructed from the Facebook public page. We consider page
posts as layers and commenters as members of an overlapping community. The algorithm
successfully removed the isolated nodes and construct the community over highly centric
nodes. The resultant communities are modular and revolve around the highly centric
users, the algorithm also identified the strength of ties between users. Such community
structures can be used to find the active influential nodes over the public page.
References
1. Berlingerio, M., Coscia, M., Giannotti, F.: Finding and characterizing communities in mul-
tidimensional networks. In: 2011 International Conference on advances in social networks
analysis and mining. IEEE, pp. 490–494 (2011, July)
136 M. M. Sheikh and R. A. S. Malick
2. Rocklin, M., Pinar, A.: On clustering on graphs with multiple edge types. Internet Math. 9(1),
82–112 (2013)
3. Kim, J., Lee, J.G., Lim, S.: Differential flattening: A novel framework for community detection
in multilayer graphs. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 1–23 (2016)
4. Tagarelli, A., Amelio, A., Gullo, F.: Ensemble-based community detection in multilayer net-
works. Data Min. Knowl. Disc. 31(5), 1506–1543 (2017). https://doi.org/10.1007/s10618-017-
0528-8
5. Huang, Y., Krim, H., Panahi, A., Dai, L.: Community detection and improved detectability in
multiplex networks. IEEE Trans. Netw. Sci. Eng. 7(3), 1697–1709 (2020). Electronic ISSN:
2327-4697. https://doi.org/10.1109/TNSE.2019.2949036
6. Zhu, Z., Zhou, T., Jia, C., Liu, J., Cao, J.: Community detection across multiple social networks
based on overlapping users. arXiv preprint arXiv:1909.09007 (2019)
7. Jeub, L.G.S., Bazzi, M., Jutla, I.S., Mucha, P.J.: “A generalized Louvain method for community
detection implemented in MATLAB” (2011–2019)
8. Afsarmanesh, N., Magnani, M.: Finding overlapping communities in multiplex networks. arXiv
preprint arXiv:1602.03746 (2018)
9. Zhang, H., Wang, C.D., Lai, J.H., Philip, S.Y.: Community detection using a multilayer edge
mixture model. Knowl. Inf. Syst. 60(2), 757–779 (2019)
Using Preference Intensity for Detecting
Network Communities
1 Introduction
great interest as it resembles more closely the real-world networks [2]. For exam-
ple, in a protein complex network, a large number of proteins may belong to
many protein complexes at the same time [7]. The identification of these com-
munity structures can provide a solution for many risky situations. For example,
to control the community infection earlier and in the time of pandemic like the
current one is of great interest. Many community detection methods have been
developed in the past three decades, and we present a summary of some. Some
algorithms use a local function to characterize the densely connected group of
nodes [8–22]. Lancichinetti uses the local function optimization approach in the
(LFR) overlapping community benchmark [23]. LFR introduces a fitness func-
tion for the definition of a community, as shown in Eq. 1. The random seed nodes
from the network form the community until the fitness function in Eq. 1 is locally
maximal. In the OSLOM method, the local optimization of the fitness function is
used, which determines the statistical significance of clusters for random fluctu-
ations [24]. First, it identifies the relevant cluster until the local fitness function
converges. Then an internal analysis of these clusters is preformed of their union.
Lastly, it identifies the hierarchical structure of these clusters. This method gives
a comparable performance with those of other existing algorithms on synthetic
networks. The main advantage of this method is that it can be used to improve
the clusters generated by different algorithms
fG > fG , (2)
1. In this method, the first limitation is that the overlap regions may belong to
more than one community, which is difficult to decide based on Eq. 2.
Using Preference Intensity for Network Communities 139
3 Preference Relations
This new method based on preferences can be used to define overlap communities
in networks and also be used to create benchmark communities. The preference
relation has the monotone property, and here we define the preference implica-
tion which can be used to make multi-criteria decisions [26,27]. The preference
(α)
relation Pν (x, y) shows how true is (x < y) sometimes, which in our case also
indicates how strong the community is. Here, x = fG and y = fG
Parameters Domain
(α)
Preference Relation Pν (x, y) ∈ (0, 1)
Threshold ν ∈ (0, 1)
Sharpness Parameter α ∈ (0, 1)
x = fG fG ∈ (0, 1)
y = fG fG ∈ (0, 1)
δ ∈ deltaset (deltaset) ∈ (0, 1)
(α)
Case 3. Pν (9, 6)The truth value of statement (9 < 6) is 0.1, which is less than
ν as 0.5 > 0.1. So, it is a very weak statement or in other words it is a false
statement. And hence, 9 < 6 is not true a statement.
Here the sharpness threshold is defined by α and ν is the threshold used for
comparing the truth values. The intensity of the preference is controlled by the
parameter of this function. The parameter is ν ∈ (0, 1) and f is a generator of a
strict t norm. The preference implication in pliant logic form is [26]:
⎧
⎨1, if (x, y) ∈ (0, 0), (1, 1)
α
Pν(α) = (6)
⎩f −1 f (ν) ff (x)
(y)
, otherwise
We can also define our own function in the Eq. 6. We define a special function
for our purpose which has the monotonic property [28,29].
Table 2. Preference rule table: rule for the new node addition in the subgraph based
on the threshold ν value and preference value.
(α)
Preference value Pν (x, y) Fitness Decision of a new node
function addition
rule
(α)
0 < Pν (x, y) < 0.5 fG > fG Not desirable for a
(α)
Pν (x, y) > δ community membership but
it can have a very weak
overlapping membership
(α)
Pν (x, y) = 0.5 fG = fG Desirable for Community
(α)
Pν (x, y) > δ membership and it can have
a weak overlapping
membership
(α)
1 > Pν (x, y) > 0.5 fG < fG Strong community
(α)
Pν (x, y) > δ membership and it can have
a strong overlapping
membership
This method allows us to control the size and number of these overlapping
regions. The threshold parameter ν of preference allows us to design the commu-
nities according to our requirement of the strength of a community. The value
of ν is desirable when ν > 0.5, as the Dombi operator system is a sigmoid func-
tion (as in Eq. 9). The graphical form of the sigmoid function is shown in Fig. 1.
Table 2 above lists the preference-based rules in different scenarios and its com-
parison with different values of ν. Also, the case of strong membership occurs
when the threshold value is greater than 0.5.
Cancelling the common term −xy on each side of the equation we get,
x < y, (21)
kin
x= . (24)
kin + kout
Now, taking the reciprocal of x and subtracting 1 from it, we get
1−x kout
= . (25)
x kin
144 J. Dombi and S. Dhama
x kin
= . (26)
1−x kout
As defined above in commutative property we get
kin n
= . (27)
kout o
Similarly, for y when α1 = 1 we get,
kin
y= + k . (28)
kin out
Therefore,
1
α > δ.
o n
1+ n o
α1
o n 1
)( < −1 . (34)
n o δ
Using Preference Intensity for Network Communities 145
α1
1
−1 = k. (35)
δ
And we get,
o n
< k. (36)
n o
This term denotes the strict threshold for the addition of a new member to the
community.
A brief preliminary version of the algorithm was described in our paper [34].
In our framework, the preference-based method works in the following way. We
start with the number of community nc to be detected and graph G as input in
the main algorithm.
23 L[c] ←− G
24 End while
25 return L
G ←− Gi : a new subgraph which includes nodes with a preference value
greater than δ.
4. The process is repeated from step 4 using a while loop until it satisfies the
stopping criteria.
Using Preference Intensity for Network Communities 147
5 end
6 else
7 P ←− null ;
8 end
9 return P
10 End Function
Fig. 2. In (A) of Fig. 2 from the LFR benchmark the statistics of network that we
used are N = 15, k = 3, maxk = 5, ν = 0.2, t1 = 2, t2 = 1, minc = 3, maxc = 5,
on = 5, om = 2. The value of α is tested for α ∈ [0.05, 0.8]. For each α the ν is tested
for ν ∈ [0.35, 0.8]. The initial seed nodes were the same for all the runs. The value of
number of communities to be created is the same in all the cases; i.e. nc = 6. The delta
set is same in each cases with ν and α with the values [0.24, 0.95]. The Community size
distribution is shown in (B) of Fig. 2 for a fixed value of (α = 0.75) and varying value
of ν ∈ [0.375, 0.70]. The size of network = 1000, Average degree k = 150, Maximum
degree Kmax = 200, Mixing parameter μ = 0.1, t = 1, t2 = 1. The initial seed node
and threshold δ are the same for each value of ν.
Fig. 3. The figure on right shows the violin plot of a community size distribution for
10% of the seed nodes on a large networks with a size from 200 to 10000. The value
of parameters deltaset = 0.3, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, α = 1 and
ν = 0.6 are the same in every case. The figure on the left shows a power law distribution
for a community size of artificial networks with a size from 200 to 10000. The parameter
values are the same as those in the violin plot on the right.
performed using different seed size for 5%, 10%, 20% and 30% of the network.
We observed a similar behavior for violins in all the cases and they have the
characteristics of a power law.
can be useful to control the overlapping nodes. The delta set is vital for deciding
the threshold for the community strength. Different seed sizes were used for the
community detection method, and they had a good performance and gave simi-
lar results. The detected communities size follow the power-law when tested on
different artificial networks with a different seed size ranging from ten to thirty
per cent of the network size. The preference implication provides a new way
of analysing the creation of overlapping communities in networks for the algo-
rithms which employ a local function optimisation approach for the detection
of overlapping communities in the network. Robustness is an important prop-
erty for many networks [33]. For networks which are not robust, our method of
community detection may be useful as it preserves the original structure of the
graph when the algorithm terminates [34]. We showed that the algorithm could
be helpful for analysing artificial and real networks, it has a good performance,
and we calculated the results to demonstrate the effectiveness of the preference
implication method. In the future, we intend to include the map-reduce frame-
work for faster implementation of this method. If the network is complex, then
the run time increases, and it becomes difficult to collect when we have very
large networks. Optimising the code of the current parallelized algorithm using
a map-reduce function would be useful for making the algorithm scalable, and
it would be beneficial in real-time decentralised networks.
References
1. Nepusz, T., Vicsek, T.: Controlling edge dynamics in complex networks. Nat. Phys.
8(7), 568 (2012)
2. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435(7043), 814 (2005)
3. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2),
167–256 (2003)
4. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works 99(12), 7821–7826 (2002)
5. Barabási, A.L.: Network Science. Cambridge University Press (2016). http://
networksciencebook.com/
6. Gregory, S.: Fuzzy overlapping communities in networks. J. Stat. Mech. Theor.
Exp. 2011(2), P02017 (2011)
7. Gavin, A.C., Bösche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz,
J., Rick, J.M., Michon, A.M., Cruciat, C.M.: Functional organization of the yeast
proteome by systematic analysis of protein complexes. Nature 415(6868), 141
(2002)
8. Baumes, J., Goldberg, M.K., Krishnamoorthy, M.S., Magdon-Ismail, M., Pre-
ston,N.: Finding communities by clustering a graph into overlapping subgraphs.
In: IADIS AC, pp. 97–104 (2005)
9. Derényi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Phys.
Rev. Lett. 94(16), 160202 (2005)
150 J. Dombi and S. Dhama
10. Gulbahce, N., Lehmann, S.: The art of community detection. BioEssays 30(10),
934–938 (2008)
11. Kelley, S.: The existence and discovery of overlapping communities in large-scale
networks. Ph.D. thesis, Rensselaer Polytechnic Institute (2009)
12. Kim, J., Wilhelm, T.: What is a complex graph? Phys. A 387(11), 2637–2652
(2008)
13. Li, H.J., Bu, Z., Li, A., Liu, Z., Shi, Y.: Fast and accurate mining the community
structure: integrating center locating and membership optimization. IEEE Trans.
Knowl. Data Eng. 28(9), 2349–2362 (2016)
14. Liu, C., Chamberlain, B.P.: Speeding up bigclam implementation on snap. arXiv
preprint arXiv:1712.01209 (2017)
15. Nepusz, T., Petróczi, A., Négyessy, L., Bazsó, F.: Fuzzy communities and the
concept of bridgeness in complex networks. Phys. Rev. E 77(1), 016107 (2008)
16. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford InfoLab (1999)
17. Populi, N.: The real-life applications of graph data structures you must know
(2018). https://leapgraph.com/graph-data-structures-applications
18. Traud, A.L., Kelsic, E.D., Mucha, P.J., Porter, M.A.: Comparing community struc-
ture to characteristics in online collegiate social networks. SIAM Rev. 53(3), 526–
543 (2011)
19. Traud, A.L., Mucha, P.J., Porter, M.A.: Social structure of Facebook networks.
Phys. A 391(16), 4165–4180 (2012)
20. Vanhems, P., Barrat, A., Cattuto, C., Pinton, J.F., Khanafer, N., Régis, C., Kim,
B., Comte, B., Voirin, N.: Estimating potential infection transmission routes in
hospital wards using wearable proximity sensors. PLOS ONE 8(9) (2013)
21. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks:
the state-of-the-art and comparative study. ACM Comput. Surv. (CSUR) 45(4),
1–35 (2013)
22. Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative
matrix factorization approach. In: Proceedings of the 6th ACM International Con-
ference on Web Search and Data Mining, pp. 587–596 (2013)
23. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. Phys. Rev. E 78(4), 046110 (2008)
24. Lancichinetti, A., Radicchi, F., Ramasco, J.J., Fortunato, S.: Finding statistically
significant communities in networks. PLOS ONE 6(4) (2011)
25. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and iden-
tifying communities in networks. Natl. Acad. Sci. 101, 2658–2663 (2004)
26. Dombi, J., Gera, Z., Vincze, N.: On preferences related to aggregative operators
and their transitivity. In: LINZ, p. 56 (2006)
27. Dombi, J., Baczyński, M.: General characterization of implication’s distributivity
properties: the preference implication. IEEE Trans. Fuzzy Syst. 1 (2019)
28. Dombi, J.: Basic concepts for a theory of evaluation: the aggregative operator. Eur.
J. Oper. Res. 10(3), 282–293 (1982)
29. Dombi, J., Jónás, T.: Approximations to the normal probability distribution func-
tion using operators of continuous-valued logic. Acta Cybernetica 23(3), 829–852
(2018)
30. Csardi, G., Nepusz, T., et al.: The igraph software package for complex network
research. InterJ. Complex Syst. 1695(5), 1–9 (2006)
31. Csardi, G.: igraphdata: A Collection of Network Data Sets for the igraph Package.
r package version 1.0.1 (2015)
Using Preference Intensity for Network Communities 151
32. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph
analytics and visualization. In: AAAI. http://networkrepository.com
33. Callaway, D.S., Newman, M.E., Strogatz, S.H., Watts, D.J.: Network robustness
and fragility: percolation on random graphs. Phys. Rev. Lett. 85(25), 5468 (2000)
34. Dombi, J., Dhama, S.: Preference relation and community detection. In: 2019
IEEE 19th International Symposium on Computational Intelligence and Informat-
ics and 7th IEEE International Conference on Recent Achievements in Mechatron-
ics, Automation, Computer Sciences and Robotics (CINTI-MACRo), pp. 33–36
(2019)
Community Detection Algorithm Using
Hypergraph Modularity
There are some recent attempts to deal with hypergraphs in the context of
clustering. For example, Kumar et al. [6,7] still reduce the problem to graphs
but use the original hypergraphs to iteratively adjust weights to encourage some
hyperedges to be included in some cluster but discourage other ones (this process
can be viewed as separating signal from noise). Moreover, in [4] a number of
extensions of the classic null model for graphs are proposed that can potentially
be used by true hypergraph algorithms. Unfortunately, there are many ways
such extensions can be done depending on how often vertices in one community
share hyperedges with vertices from other communities. This is something that
varies between networks at hand and usually depends on the hyperedge sizes.
Indeed, hyperedges associated with papers written by mathematicians might be
more homogeneous and smaller in comparison with those written by medical
doctors who tend to work in large and multidisciplinary teams. Moreover, in
general, papers with a large number of co-authors tend to be less homogeneous.
A good algorithm should be able to automatically decide which extension should
be used.
In this paper, we propose a framework that is able to adjust to various
scenarios mentioned above. We do it by generalizing and unifying all exten-
sions of the graph modularity function to hypergraphs, and putting them into
one framework in which the contribution from different “slices” is controlled by
hyper-parameters that can be tuned for a given scenario (Sect. 2). We propose
two prototype algorithms that show the potential of the framework, the so-called
proof-of-concept (Sect. 3). In order to test the performance of algorithms in vari-
ous scenarios, we introduce a synthetic random hypergraph model (Sect. 4) that
may be of independent interest. We experiment with our prototypes as well as
the two main competitors in this space, namely, the Louvain and Kumar et al.
algorithms (Sect. 5). The experiments show that, after tuning hyper-parameters
appropriately, the proposed prototypes work very well. Independently, we pro-
vide an evidence that such tuning can be done in an unsupervised way. Of course,
more work and experiments need to be done before we are able to announce a
scalable and properly tuned algorithm but at the end of this paper we reveal a
bit more details to that effect (Sect. 6). Spoiler alert: the reader who wants to
be surprised should avoid that section.
2 Modularity Functions
We start this section by recalling the classic definition of modularity function for
graphs (Sect. 2.1). In order to deal with hypergraphs, one may reduce the prob-
lem to graphs by considering the corresponding 2-section graph (Sect. 2.2). Alter-
natively, one may generalize the modularity function to hypergraphs (Sect. 2.3)
and then perform algorithms directly on hypergraphs. Such approach should
presumably give better results as it preserves more information on the original
network in comparison to the corresponding 2-section graphs. In this paper, we
generalize the hypergraph modularity function even further that allows us to
value various contributions to the modularity function differently (Sect. 2.4).
154 B. Kamiński et al.
H, graph H[2] on the same vertex set as H. For each hyperedge e ∈ E with
|e| ≥ 2 and weight w(e), |e| 2 edges are formed, each of them with weight of
w(e)/(|e| − 1). While there are other natural choices for the weights (such as the
original weighting scheme w(e)/ |e|2 that preserves the total weight), this choice
ensures that the degree distribution of the created graph matches the one of
the original hypergraph H [6,7]. Moreover, let us also mention that it also nicely
translates a natural random walk on H into a random walk on the corresponding
H[2] [13]. As hyperedges in H usually overlap, this process creates a multigraph.
In order for H[2] to be a simple graph, if the same pair of vertices appear in
multiple hyperedges, the edge weights are simply added together.
The framework introduced in [4] is more flexible but, for simplicity, let us
concentrate only on the two extreme cases. For d ∈ N and p ∈ [0, 1], let Bin(d, p)
denotes the binomial random variable with parameters d and p. The majority-
based modularity function for hypergraphs is defined as
em (Ai ) |Ed |
volH (Ai )
d
m
qH (A) = H
− P Bin d, > , (2)
|E| |E| volH (V ) 2
Ai ∈A d≥2 Ai ∈A
In (2), em
H (Ai ) counts the number of hyperedges where the majority of vertices
belong to part Ai while in (3), esH (Ai ) counts the number of edges where all
vertices are in part Ai . The goal is the same as for graphs. We search for a par-
tition A that yields modularity as close as possible to the maximum modularity
q ∗ (H) which is defined as the maximum over all possible partitions of the vertex
set. We can define weighted versions of the above functions (with weights on
hyperedges) the same way as we did for graphs. Finally, note that if H consists
only of hyperedges of size 2 (that is, H is a graph), then both (2) and (3) reduce
to (1).
where ed,c
H (Ai ) is the number of hyperedges of size d that have exactly c members
m
in Ai . So qH (A) can be viewed as:
d
c,d
m
qH (A) = qH (A),
d≥2 c=d/2+1
Community Detection Algorithm Using Hypergraph Modularity 157
where
c,d 1 d,c vol(Ai )
qH (A) = eH (Ai ) − |Ed | · P Bin d, =c .
|E| vol(V )
Ai ∈A
This definition gives us more flexibility and allows as to value some slices more
than others. In our experiments, we restricted ourselves to the following family
of hyper-parameters that gave us enough flexibility but is controlled only by 3
variables. (In fact, we will argue later on that one of them, namely ρmax , can
be set to one.) Let α ∈ [0, ∞), and ρmin , ρmax ∈ (0.5, 1] such that ρmin ≤ ρmax .
Then,
(c/d)α if dρmin ≤ c ≤ dρmax .
wc,d = (5)
0 otherwise.
Parameters ρmin and ρmax are related to the assumption on the minimal
and, respectively, maximal “pureness” of hyperedges and depends on the level
of homogeneity of the network. In particular, ρmax may be bounded away from
one if one expects that “totally pure” (that is, occurring in a single community)
hyperedges are unlikely to be observed in practice. Finally, parameter α governs
the smooth transition between the relative informativeness between contributing
hyperedges of different levels of “pureness”.
As a result, after adjusting the hyper-parameters accordingly, (4) can be
used for the two extreme cases (majority-based and strict-based) and anything
in between. Moreover, (4) may well approximate the graph modularity for the
corresponding 2-section graph H[2] . Indeed, if c vertices of a hyperedge e of size
d and weight w(e) fall into one part
of the partition A, then the contribution
to the graph modularity is w(e) 2c /(|e| − 1) (in the variant where the degrees
are preserved) or w(e) 2c / |e|
2 ≈ w(e)(c/|e|) (if the total weight is preserved).
2
Hence, the hyper-parameters can be adjusted to reflect that. The only differ-
ence is that (4) does not allow to include contributions from parts that contain
at most d/2 vertices which still contributes to the graph modularity of H[2] .
However, most of the contribution comes from large values of c and so the two
corresponding measures are close in practice.
158 B. Kamiński et al.
3 Algorithms
In this paper, we experiment with four clustering algorithms that can handle
networks represented as hypergraphs. The last two of them are two prototypes
of our hybrid and flexible framework under development. More advanced version
will be presented in the forthcoming papers but some spoilers are provided in
Sect. 6.
changing the cluster of a vertex to one of its neighbours’ if it increases the mod-
ularity function. Unfortunately, trying to apply this strategy to hypergraphs is
challenging. Indeed, if one starts from all vertices in their own community, then
changing the cluster of only one vertex will likely have no positive effect on the
modularity function unless edges of small size are present. For example, it takes
several moves for a hyperedge of size d ≥ 4 to have the majority of its vertices
to fall into the same community.
In order to solve this problem, we propose to use the graph modularity func-
tion qG (A) defined in (1) to “lift the process from the ground” but then switch
to the hypergraph counterpart qH (A) defined in (4). There are many ways to
achieve it and one of them is mentioned in Sect. 6. For experiments provided in
this paper, we consider two prototypes: the first one (HA) switches to hyper-
graphs as soon as possible whereas the second one (LS ) stays with graphs for
much longer. The first algorithm, that we call HA (for hybrid algorithm), works
as follows:
1. Form small, tight clumps by running ECG using qG (A) on the degree-
preserving graph G built from H. Prune edges below the threshold value
of 70% (number of votes), and keep connected components as initial clumps.
2. Merge clumps (in a random order) if qH (A) improves. Repeat until no more
improvement is possible.
3. Move one vertex at a time (in a random order) to a neighbouring cluster if it
improves qH . Repeat until convergence.
The second algorithm, that we call LS (for last step) runs Kumar et al. and only
does the last step (step 3.) above.
Finally, recall that the hypergraph modularity function qH (A) is controlled
by hyper-parameters wc,d but we restrict ourselves to a family of such parameters
guided by parameters α, ρmin , and ρmax ; see (5). Hence, we will refer to the above
algorithms as HA(α, ρmin , ρmax ) and, respectively, LS(α, ρmin , ρmax ).
size of a largest hyperedge and m = d≥2 md is the total number of hyperedges.
Hyperedges are partitioned into community and noise hyperedges. The expected
proportion of noise edges is μ ∈ [0, 1], the parameter that controls the level
of noise. Each community hyperedge will be assigned to one community. The
Kof hyperedges that are assigned to the kth community is pk ;
expected fraction
in particular, k=1 pk = 1. Community hyperedges that are assigned to the
kth community will have majority members from that community. On the other
hand, noise hyperedges will be “sprinkled” across the whole hypergraph.
The hyperedges of H are generated as follows. For each edge size d, we inde-
pendently generate md edges of size d. For each edge e of size d, we first decide
if e is a community hyperedge or a noise. It is a noise with probability μ; other-
wise, it is a community hyperedge. If e turns out to be a noise, then we simply
choose its d vertices uniformly at random from the set of all sets of vertices
of size d, regardless to which community they belong to. On the other hand,
if e is a community edge, then we assign it to community k with probability
pk . Then, we fix the homogeneity value τe of hyperedge e that is the integer-
valued random variable taken uniformly at random from the homogeneity set
{ τmin d , τmin d + 1, . . . , τmax d }. The homogeneity set depends on parame-
ters τmin and τmax of the model that satisfy 0.5 < τmin ≤ τmax ≤ 1, and is
assumed to be the same for all edges. Finally, members of e are determined
as follows: τe vertices are selected uniformly at random from the kth commu-
nity, and the remaining vertices are selected uniformly at random from vertices
outside of this community.
As mentioned above, the proposed model is aimed to be simple but it tries
to capture the fact that many real-world networks represented as hypergraphs
exhibit various levels of homogeneity or the lack of thereof. Moreover, some
networks are noisy with some fraction of hyperedges consisting of vertices from
different communities. Such behaviour can be controlled by parameters τmin ,
τmax , and μ. It gives us a tool to test the performance of our algorithms for
various scenarios. A good algorithm should be able to adjust to any scenario in
an unsupervised way.
5 Experiments
For our experiments we use the synthetic random hypergraph model intro-
duced in Sect. 4. It contains 5 communities, each consisting of 40 vertices:
(n1 , n2 , . . . , n5 ) = (40, 40, . . . , 40). The distribution of hyperedge sizes is as
follows: (m1 , m2 , . . . , m11 ) = (30, 30, 30, 30, 30, 30, 30, 20, 20, 20). The expected
fraction of edges that belong to a given cluster is equal to 0.2: (p1 , p2 , . . . , p5 ) =
(0.2, 0.2, . . . , 0.2). The lower bound for the homogeneity interval is fixed to be
τmin = 0.65. We performed experiments on four hypergraphs with the remain-
ing two parameters fixed to: a) (μ, τmax ) = (0, 0.65), b) (μ, τmax ) = (0, 0.8),
c) (μ, τmax ) = (0, 1), d) (μ, τmax ) = (0.1, 0.80). All of them lead to the same
conclusion so we present figures only for hypergraph H that is obtained with
parameters d).
Community Detection Algorithm Using Hypergraph Modularity 161
We test the two known algorithms, Louvain and Kumar et al., as well as
our two prototypes, LS and HA. For each prototype, we test three differ-
ent sets of hyper-parameters. In the first variant, we include contribution to
the hypergraph modularity function that comes from all slices, that is, we fix
ρmin = 0.5+ = 0.5 + (for some very small > 0 so that all “slices” of the
modularity function are included) and ρmax = 1. For simplicity, we fix α = 1.
For convenient notation, let LS = LS(1, 0.5+ , 1) and HA = HA(1, 0.5+ , 1).
For the second variant, we use the knowledge about the hypergraph (ground
truth) and concentrate only on slices that are above the lower bound for the
homogeneity set, that is, we fix ρmin = τmin = 0.65 but keep ρmax = 1. Let
LS+ = LS(1, 0.65, 1) and HA+ = HA(1, 0.65, 1). Finally, we use the complete
knowledge about the generative process of our synthetic hypergraph and fix
ρmin = τmin = 0.65 and ρmax = τmax = 0.80. The corresponding algorithms are
denoted by LS++ = LS(1, 0.65, 0.85) and HA++ = HA(1, 0.65, 0, 85).
In the first experiment, we run each algorithm on H and measure its perfor-
mance using the Adjusted Mutual Information (AMI). AMI is the information
theory measure that allows us to quantify the similarity between two partitions
of the same set of nodes, the partition returned by the algorithm and the ground
truth. Since all algorithms involved are randomized, we run them 100 times and
present a box-plot of the corresponding AMIs in Fig. 1(a). We see that LS and
HA give comparable results as the original Louvain and Kumar et al. is con-
sistently better. On the other hand, when our prototypes are provided with a
knowledge about the homogeneity of H, they perform very well, better than
Kumar et al. There is a less difference between + and ++ variants of the two
prototypes. This is a good and desired feature as “pure” hyperedges should not
generally be penalized unless there is some known external hard constraint that
prevents hyperedges to be homogeneous. On the other hand, if large hyperedges
are non-homogeneous, then the quality of + and ++ should be similar as these
“slices” barely contribute to the modularity function anyway. Note that this
observation does not apply to small hyperedges; in the extreme situation when
dealing with graphs with hyperedges of size 2, any choice of ρmax ≥ ρmin leads
to exactly the same results.
The previous experiment shows that knowing some global statistics (namely,
how homogeneous the network is) significantly increases the performance of our
prototypes. However, typically such information is not available and the algo-
rithm has to learn such global statistics in an unsupervised way. In our second
experiment, we test if this is possible. We take a partition returned by Kumar
et al. and investigate all hyperedges of H. For each hyperedge e we check if at
least τ ≥ 0.5 fraction of its vertices belong to some community. We compare it
with the corresponding homogeneity value based on the ground truth. The two
distributions are presented in Fig. 1(c) and are almost indistinguishable. This
suggests that learning the right value of ρmin should be possible in practice.
Finally, we tested the performance of our prototypes for various choices of
parameter α. + and ++ variants turn out to be not too sensitive whereas LS
and HA increase their performance as α increases—Figure 1(b). It is perhaps not
162 B. Kamiński et al.
References
1. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Stat. Mech. Theor. Exp. 10, P10008 (2008)
2. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very
large networks. Phys. Rev. E 70, 066111 (2004)
3. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.
Natl. Acad. Sci. U.S.A. 104, 36–41 (2007)
4. Kamiński, B., Poulin, V., Pralat, P., Szufel, P., Théberge, F.: Clustering via hyper-
graph modularity. PLOS ONE 14(11), e0224307 (2019)
5. Kamiński, B., Pralat, P., Théberge, F.: Artificial Benchmark for Community
Detection (ABCD)—Fast Random Graph Model with Community Structure,
arXiv:2002.00843
6. Kumar, T., Vaidyanathan, S., Ananthapadmanabhan, H., Parthasarathy, S.,
Ravindran, B.: A new measure of modularity in hypergraphs: theoretical insights
and implications for effective clustering. In: International Conference on Complex
Networks and Their Applications, Complex Networks 2019, pp. 286–297. Springer,
Cham (2019)
7. Kumar, T., Vaidyanathan, S., Ananthapadmanabhan, H., Parthasarathy, S.,
Ravindran, B.: Hypergraph clustering by iteratively reweighted modularity maxi-
mization. Appl. Netw. Sci 5 (2020)
8. Lancichinetti, A., Fortunato, S.: Limits of modularity maximization in community
detection. Phys. Rev. E 84, 066122 (2011)
9. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. Phys. Rev. E 78 (2008)
10. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.
Phys. Rev. E 69, 066133 (2004)
11. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69, 026–113 (2004)
12. Poulin, V., Théberge, F.: Ensemble clustering for graphs. In: Aiello, L., Cherifi,
C., Cherifi, H., Lambiotte, R., Lió, P., Rocha, L. (eds.) Complex Networks and
their Applications VII, COMPLEX NETWORKS 2018. Studies in Computational
Intelligence, vol. 812. Springer, Cham (2018)
13. Théberge, F.: Summer School on Data Science Tools and Techniques in Modelling
Complex Networks. https://github.com/ftheberge/ComplexNetworks2019/
Towards Causal Explanations of
Community Detection in Networks
1 Introduction
Imagine someone participating in a social network. Due to an analytics engine
that the social network offers for its users, she finds out that she is unintention-
ally part of a community and asks what are the reasons for her belongingness
to this community. She would also wish to become a member of another com-
munity - always in the context of the community detection algorithm offered
by the analytics engine - and asks what new relations she would have to set up
in order to become a member. Note that in this example, one’s membership in
a community is not explicit but implicit through the social network analytics
engine, which affects many aspects of the user’s belongingness to the social net-
work (e.g., recommendation of new friends, selection of ads to show, etc.) and
thus it is of high importance to the user. In particular, we ask:
1. What causes the fact that a node u belongs to a community C? Which are the
edges that are responsible for u ∈ C? Can we rank these edges based on the
degree of their responsibility for u ∈ C?
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 164–176, 2021.
https://doi.org/10.1007/978-3-030-65347-7_14
Towards Causal Explanations of Community Detection in Networks 165
2. What causes the fact that a node u does not belong to a community C? Which
are the new edges that would allow u to become a member of C?
Networks are used to represent data in almost any field, such as transporta-
tion systems [7], biological systems [22], and social groups [20], just to name a
few. In such networks, certain groups of nodes with particular importance arise,
which form the so called communities. The dominant definition of community is a
group of nodes that are more densely connected internally than externally [25].
In real-world networks, communities are of major importance since they are
related to functional units [3,24]. Communities have also topological properties
that are different from those of the network as a whole.
In this work, we formulate such questions related to community detection by
using the structural-model approach introduced by Halpern and Pearl [11,12]. In
particular, we focus on the community detection problem and we define different
sets of causes with different degrees of importance with respect to the question
at hand. This importance is captured by a measure of responsibility [4] for each
cause, thus allowing for a ranking of the causes. Moreover, this structural-model
approach has the side-effect that other communities may change as a result
of a question for a particular node. We introduce a measure for these changes
to quantify how the interventions implied by the cause alter the community
structure of the network. To the best of our knowledge, we are the first to look
at the community detection problem on networks through the lens of causal
explanations. In fact, it seems that this is the first time that such a viewpoint is
adopted with respect to general network analysis problems.
Related Work. Community detection in general has been a very active field
during the last years. There is a plethora of algorithms aiming at finding the best
quality communities in networks, based on different evaluation metrics. Those
works include both disjoint or overlapping community detection algorithms. For
some detailed surveys on the field we refer to [8,14]. However, there is no work on
combining causal explanations and community detection. The structural-model
approach introduced by Halpern and Pearl [11,12] has been applied mainly to
database queries. Meliou et al. [18] transferred these notions to databases. This
approach is related to data provenance, lineages and view updates (e.g., deletion
propagation) [19]. Inspired by this approach others have applied this structural-
model approach to reverse-skyline queries [10], to probabilistic nearest neighbor
queries [16], and so on. One more network-related problem where this model has
been applied concerns the ranking of propagation history in information diffusion
in social networks [27].
Contributions and Roadmap. This work focuses on the application of causal-
ity in the community detection problem. Examining causality in community
detection in networks is novel in its own right. We suggest a general frame-
work that can be used to find causal relations about the belongingness of nodes
to communities. An interesting aspect is that the proposed framework is easily
adaptable to other network processing operations apart from community detec-
tion. Apart from transferring the concepts in [11,12], we also introduce the con-
166 G. Baltsou et al.
Here we study the proposed causal model and introduce some fundamental con-
cepts.
2.1 Preliminaries
Initially, we restrict our setting to a simple undirected, unweighted network
G = (V, E), which is composed of a node set V = {1, . . . , n} with n = |V | nodes
and an edge set E ⊆ |V | · |V − 1| with m = |E| edges; we discuss extensions to
more generic graphs in Sect. 4. Let G[S] represent the induced sub-graph of the
node set S, S ⊆ V . The adjacent nodes of a node u, i.e., the nodes connected to
u with an edge, are its neighbours: N (u) = {u|{u, v} ∈ E}. The degree of u is
deg(u) := |N (u)|, i.e., the number of u’s neighbours.
Modularity [21] is a widely used objective function to measure the quality of
a network’s decomposition into communities and is defined as:
2
j
mC deg C
Q= −
m 2m
C=1
endogenous ones.1 Exogenous pairs of nodes Ex ⊆ |V |·|V −1| are not considered
to have a causal effect on the (non-)belongingness of node v to a community and
endogenous Ee ⊆ |V | · |V − 1| are the pairs of nodes that can in principle infer
such causal implications. Note that Ex ∪ Ee = |V | · |V − 1| and Ex ∩ Ee = ∅.
To check if an edge e is a cause for the (non-)belongingness of a node v to
a particular community C, we have to find a set of endogenous pairs of nodes
whose edge removal/addition will allow e to immediately affect the belongingness
of v to the community C. These sets are called contingencies. Note that the
contingency set does not alone change the community of v but it is required
in order to unlock the causal effect of edge e on the belongingness of v. The
contingency set must be minimal, in a manner that removing any edge from it,
will dampen the causal effect of e to the belongingness of v to its community,
so, no redundancy is allowed.
In a sense, all incident edges (and possibly additional ones) of v affect its
belongingness to the community (either in positive or negative manner). Thus,
we need a ranking function that will allow us to reason about the most important
causes for v (non-)participating in the community. Responsibility [4] measures
the degree of causality of an edge e for a node v as a function of the smallest
contingency set.
In the following, we provide a definition of causality tailored to the problem
of why a node belongs to a particular community.
Definition 1. Let e ∈ Ee be the edge connecting an endogenous pair of nodes
and let v belong to community C.
1
The endogenous and exogenous sets differ for different nodes. Also, the endogenous
set typically comprises the incident edges connecting this node to its neighbors.
Finally, allowing self-loops is not an issue, since if they are irrelevant to the setting,
we can simply make them exogenous.
168 G. Baltsou et al.
for all contingency sets Γ for e, where |Γ | is the size of the set Γ .
The domain of ρe is in (0, 1]. If the contingency set is ∅, then the responsi-
bility is 1, otherwise, the larger the contingency set the less the responsibility. In
this way, we capture the degree of interventions needed (the set Γ ) to uncover
the causal implication of e on v with respect to community C.
At this point we need to discuss a distinguishing feature in the introduction
of causality in community detection. The counterfactual interventions suggested
by the contingency set and the cause may as well change other communities. This
may seem as an undesirable side-effect of our definition that we may choose to
ignore, as the question of the causes for the belongingness or non-belongingness
of v to community C is related to v alone, and so potential changes to other
communities are of no interest to v. Since these causes are counterfactual, in
fact no change happens if they are simply used for the purpose of briefing v.
However, if we consider v to be an agent whose purpose is to find out what
actions should be taken in order to achieve her removal from C (in the case
she asks of the causes of her belongingness to C) or her addition to C (in the
case she asks of the causes of her non-belongingness to C ) then this side-effect
becomes important. In this case, the causes and their corresponding contingency
sets can be considered as a suggested set of actions so that v achieves her goal.
Apparently, the endogenous set must be defined so that v can alter the corre-
sponding edges. Going into more depth, one will confront various issues like the
identity problem that comes up in community detection in temporal networks
[23]; that is, after the intervention, what happens if C has changed so much
that cannot be considered as C anymore? We avoid such issues by introducing
a measure of such changes, called discrepancy.
Assumption 1. The more distant two nodes the less they influence each other.
This assumption allows us to focus on possible causes around node v and in the
involved communities. Pairs of nodes whose corresponding edges are considered
far from v and do not belong to the involved communities are not considered as
endogenous. In community detection in unweighted networks, this assumption
is part of its very definition, since the belongingness of node v to a commu-
nity is guided mainly by its incident edges. Such an assumption is widely used
in network analysis [2,5,6], e.g., in social networks is known as the Friedkin’s
postulate [9].
We also make an assumption concerning the size of communities, called
henceforth the Size Assumption. This assumption allows us to bound the number
of endogenous pairs of nodes introduced by the involved communities.
Usually, communities tend to be smaller than the size of the network. As dis-
cussed in [8], after systematic analysis by the authors of [15], communities in
many large networks, including traditional and online social networks, techno-
logical, information networks and web graphs, are fairly small in size. It is also
believed that the communities in biological networks are relatively small i.e.,
3–150 nodes [26,28]. We capture this phenomenon by assuming that the size of
communities is O(n ) for some small constant < 1.
Finally, for efficiency reasons, we assume that both the number of causes and
the size of the Γ set are small. We call this assumption the Bound Assumption.
Assumption 3. The number of causes and the size of contingency sets are
bounded by a small constant.
This assumption is important because it limits the available options for causes
and the contingency set. If the number of causes was large, then the information
on the causal relations would be minuscule. Besides, if the size of the contingency
sets were large, that would lead to a very low value of responsibility, meaning
that the effect of the actual cause is minuscule.
and Orange communities are the one group after the division while the Blue
and P urple communities correspond to the other group. The modularity Q of
this network decomposition is 0.417.
Fig. 1. The Zachary karate club network. There are 4 different communities denoted
by different colors: Green, Orange, Blue and P urple.
Let us first look at node 10. What is the cause for 10 ∈ Blue? Removing edge
(10, 34) apparently leads to 10 not belonging anymore in Blue but in Green and
thus edge (10, 34) is a counterfactual cause. This is the case with modularity
Q = 0.427 and no other node changes community, which means that discrepancy
γ = 0 while responsibility ρ = 1, since the contingency set is empty. The Locality
Assumption 1 was used since the endogenous pairs of nodes were assumed to be
only the neighbours of node 10. In case we extend the set of endogenous pairs to
contain the neighbours of 34, we could weaken node 34, by choosing some of its
incident edges (with the exception of edge (34, 10)), thus indirectly making node
10 belong to Green. However, this is not a cause for 10 ∈ Blue but a by-product
of 34 being a hub node of Blue. Of course, by transitivity,2 the fact that 34 is
a hub of Blue causes 10 ∈ Blue through edge (34, 10), but we prefer to look
straightforwardly at the direct cause expressed by this edge.
Why does node 32 ∈ / Blue? As seen in Fig. 1, node 32 is quite central in
P urple community. We found out that the edge (31, 32) is a cause for 32 ∈ / Blue
with contingency Γ = {(19, 32)} meaning that its responsibility for 32 ∈ / Blue
is 1/2. Note that since the network is undirected the same can be said for edge
(19, 32) as a cause with contingency Γ = {(31, 32)} with ρ = 1/2. In this case
Q = 0.404 while node 29 is also put in Blue community, and thus γ = 1/34.
Finally, lets look at node 9. Why does 9 ∈ / Green? Iterating over all nodes
in Green as causes we get the results in Table 1. The results are expected since
N (9) = {1, 3, 31, 33, 34}, which are the most central nodes w.r.t. degree in their
communities. We expected that (9, 2) would be a counterfactual cause for 9 ∈ /
Green but this is not the case.
What if we extend the definition of contingency and allow for deletions of
edges of 9 to nodes within its current community so that its belongingness to
Blue community is weakened? Then, in this case the edge (9, 34) is a cause with
Γ = {(9, 31)} since their removal moves 9 to Green with γ = 0 and Q = 0.423.
2
Transitivity does not hold in general w.r.t. causation [11].
Towards Causal Explanations of Community Detection in Networks 171
Cause Γ ρ Q γ
1
(9, 12) {(9, 2)} 2 0.402 0
1
(9, 20) {(9, 2)} 2 0.402 0
1
(9, 14) {(9, 2)} 2 0.402 0
1
(9, 4) {(9, 2)} 2 0.402 0
1
(9, 8) {(9, 2)} 2 0.402 0
1
(9, 2) {(9, 8)} 2 0.402 0
3 Algorithmic Aspects
In this section we describe algorithmic aspects that allow us to answer why
(Definition 1) and why-not (Definition 2) queries for community detection. We
first provide a general framework that is oblivious to the community detection
algorithm being used. Then, within this framework and for reasons of efficiency,
we specialize by focusing on modularity-based algorithms.
3.1 The General Framework
We begin by describing a trivial algorithm-agnostic framework. In fact, this
framework is so general that can be used as a first step for introducing causality
in different network settings as argued in Sect. 4. Assume an algorithm A that
divides a given network G = (V, E) into a set of communities C. We pose the
question “why a node v ∈ V belongs in community C ∈ C” (henceforth why
question). For the why-not question the framework works in the same way.
Following Definition 1, we need to identify edges within Ee that are causes
and discover their respective contingency sets Γ as well as the changes implied
by them in the community structure in order to compute the responsibility ρ
and the discrepancy γ. To accomplish this, we first iterate over all subsets c of
Ee to choose possible causes e in increasing size (starting from singletons) and
then we iterate on all subsets of Ee /e to compute Γ . We maintain the top-k
causes with highest ρ. If we are interested on γ as well, we could use either a
weighted mean or maintain the top-k dominating causes with respect to both
metrics. A very crude upper bound for the method is O(22y ) iterations of the
algorithm A, where y = |Ee | is the number of endogenous pairs of nodes.
Apparently, the time complexity of this framework is prohibitive. To speed
the algorithm up, we can use the Locality Assumption. In this sense, we can
define the endogenous pairs of nodes to be all corresponding edges at a small
distance from v. For example, if we include in the endogenous set the neigh-
bours of v, then the number of iterations is O(22deg(v) ), which is considerably
smaller especially for sparse networks that are usually seen in practice. How-
ever, even in this case the number of iterations is quite large. We could further
172 G. Baltsou et al.
reduce the complexity by having some information about the inner workings of
the algorithm A. In the following, we assume such an approach by looking at
an algorithm that optimizes modularity. In addition, for the why question we
consider as endogenous pairs of nodes all the neighbours of the query node v.
In this case, Ee could contain all edges in the k-core of the network since these are
the possible causes for node v being in the k-core. Similarly, one can introduce
causality in the minimum cut problem in a weighted network (why does edge
e belong to the cut?). Efficiency issues must be handled in an ad-hoc manner
based on the problem at hand.
In the present work, we have introduced the concept of causal explanations in
community formation and we have proposed a framework for identifying actual
causes. In the future, we will focus on efficient algorithmic techniques as well
as on extensive experimental evaluation for different types of networks (e.g.,
directed, weighted) and different problems (e.g., overlapping communities, k-
core decomposition).
References
1. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-
munities in large networks. J. Stat. Mech. Theor. Exp. 2008(10), P10008 (2008)
Towards Causal Explanations of Community Detection in Networks 175
23. Rossetti, G., Cazabet, R.: Community discovery in dynamic networks: a survey.
ACM Comput. Surv. 51(2) (2018)
24. Sharan, R., Ulitsky, I., Shamir, R.: Network-based prediction of protein function.
Mol. Syst. Biol. 3, 88 (2007)
25. Shen, H.W.: Community Structure of Complex etworks. Springer Science & Busi-
ness Media (2013)
26. Tripathi, B., Parthasarathy, S., Sinha, H., Raman, K., Ravindran, B.: Adapting
community detection algorithms for disease module identification in heterogeneous
biological networks. Front. Genet. 10, 164 (2019)
27. Wang, Z., Wang, C., Ye, X., Pei, J., Li, B.: Propagation history ranking in social
networks: a causality-based approach. Tsinghua Sci. Tech. 25(2), 161–179 (2020)
28. Wilber, A.W., Doye, J.P., Louis, A.A., Lewis, A.C.: Monodisperse self-assembly in
a model with protein-like interactions. J. Chem. Phys. 131(17), 11B602 (2009)
29. Zachary, W.W.: An information flow model for conflict and fission in small groups.
J. Anthropol. Res. 33(4), 452–473 (1977)
30. Zarayeneh, N., Kalyanaraman, A.: A fast and efficient incremental approach toward
dynamic community detection. In: Proceedings of the ACM/IEEE International
Conference on Advances in Social Networks Analysis and Mining, pp. 9–16 (2019)
A Pledged Community? Using
Community Detection to Analyze
Autocratic Cooperation in UN
Co-sponsorship Networks
1 Motivation
Who cooperates with whom and why in international relations? Most of what we
know about states’ cooperative behavior is based on studies that focus on cooper-
Due to the random nature of surnames, the authors change the author’s order on a
paper basis. The authors would like to thank Brett Ashley Leeds and Nikolay Marinov
for invaluable input on earlier versions of this research as well as the participants at
the CoMeS 2020 conference and three anonymous reviewers. This research has been
supported by the University of Mannheim’s Graduate School of Economic and Social
Sciences funded by the German Research Foundation.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 177–188, 2021.
https://doi.org/10.1007/978-3-030-65347-7_15
178 C. Meyer and D. Hammerschmidt
1
This estimate is based on the Polity IV score [28] to measure the share between
democratic and autocratic between from 1989 and 2017.
A Pledged Community? 179
3
Lead sponsor(s) decide on potential co-sponsors by balancing the required weight
on the draft necessary to get the resolution passed and the policy positions of co-
sponsors [20]. Hence, we do not expect only to find pure autocratic and pure demo-
cratic resolutions but to observe interesting patterns of cooperation across regime
types as well.
A Pledged Community? 181
4 Method
We use data from [22] to construct networks from states’ UNGA co-sponsorship
behavior between 1979 and 2014. To account for the relative amount that states
co-sponsor in a given year4 we convert the n × n adjacency matrix of annual
co-sponsorships for n states into an agreement matrix [1].5 This results in 36
weighted and directed network, one for each year in our time period. Table 1
shows an example of one agreement matrix for states’ co-sponsorship behavior
in 1985 where Austria (AUT) co-sponsors 100% of its resolutions with Australia
(AUS), but Australia co-sponsors only 79% of its resolutions with Austria. We
use these values as weights and thus receive one weighted and directed network
of states’ co-sponsorship behavior for each year in our sample.
Table 1. Agreement matrix of co-sponsoring at the UNGA in 1985 (first 5 rows and
columns)
in year t is not necessarily equal to the first community in year t+1 . As a result,
we estimate 36 separate random forest models for each year to predict states’
community membership based on the following variables.
We categorize states as democratic or autocratic based on data by Polity
IV [28]. Polity IV is a 21-point scale measure ranging from −10 to +10. We are
conservative with our democracy measure as we code countries being democratic
if they score +6 or higher and autocratic otherwise. We further expect other
variables to be associated with the clustering of states. First, we expect that
states are more likely to co-sponsor resolutions following their alliance behavior
[26] and membership in the regional groups at the UN [22].6 We further consider
other economic and political factors such as states’ trade behavior, GDP per
capita, population size, the official religion, the Human Development Index, and
post-conflict environments. All these variables are included in the ATOP, the
Quality of Government, and the UCDP data set [25,26,35,38] and used as control
variables in our multiclass random forest model.7
One particular aspect of the UNGA is that the UN has always been used by
states to communicate and advocate their domestic interests [5]. In particular,
(new) states in dire need for (financial) support use(d) the UNGA to seek sup-
port. Post-conflict countries fall into this category [7,8,17], and we can think of
two possible scenarios of how post-conflict countries might behave at the UNGA.
Post-conflict states can either strategically seek for similar cooperation partners
or prefer countries they perceive as big players to be their best choice.8
5 Results
Our results support our argument that regime type is important for interna-
tional cooperation and that autocracies increasingly cooperate with each other.
Figure 1 shows the distribution of democracies and autocracies for each cluster
in the network in a given year. The more balanced the distribution is, i.e., the
closer it is to the horizontal 50% line, the less clearly separated clusters are
based on states’ regime type. We find that throughout all years, we have clear
6
The literature shows that states usually vote in voting blocs that broadly reflect the
regional groups that they belong to [4].
7
All variables are present across the entire observation period except of the official
religion provided by the Bar-Ilan University and the Human Development Index,
which are only available starting 1990 [35].
8
We determine the beginning of a post-conflict period by the calendar year when
the UCDP/PRIO data mark the respective country as not being in conflict [2, 14].
This event occurs once the threshold of the considered conflict is below 25 battle-
related deaths [16]. If there are multiple overlapping conflict periods in a geographical
country, we combine them into one single conflict period [2, 12].
Due to the limited scope of this paper, we will only discuss selected variables in
the result section but include all variables in our analysis to control for confounding
factors. We represent the overall mean of these features in Fig. 3.
A Pledged Community? 183
1 2
Relative share of regime type by community (in %)
100
75
50
25
3 4
100
75
50
25
0
1980 1990 2000 2010 1980 1990 2000 2010
Time frame
Fig. 1. Relative share of regime type (in %) by community 1–4 over time. The graph
shows the distribution of democracies and autocracies in communities 1–4. The num-
bers associated with the communities do not contain any meaningful information on
each cluster’s type, and clusters can be reshuffled for each year. The less balanced the
shares (i.e., the more different from the 50% line), the more homogeneous a community
is. We observe that over time communities are highly homogeneous for our entire time
frame.
Figure 2 plots the network for states’ co-sponsorship at the UNGA in 1985
as an example with information on each state’s regime type and their identified
community cluster using the force-directed layout algorithm by Fruchterman and
Reingold [13]. Interestingly, while democracies compose one large group of states
with the large majority of Western states in it (cluster 3), autocracies appear
to be divided into three sub-groups. This follows the arguments in the litera-
ture. Democracy is perceived as a strong, cohesive factor that groups countries
with similar institutional characteristics together. When it comes to coopera-
tion, autocracies are not a uniform group but consist of different sub-groups
that require a more disaggregated consideration [29]. We observe this in our
network as well. For instance, cluster 4 consists of primarily Middle Eastern
countries – all autocracies with largely similar institutional characteristics. By
contrast, cluster 2 features (ex-)socialist countries such as the German Demo-
cratic Republic, Cuba, or Venezuela alongside predominantly autocratic Latin
American countries.
9
Given that our community detection does not contain any meaningful information
on the type of each cluster, as described above, and that clusters can be reshuffled
for each observation year, we can observe fluctuations in the distribution of regime
types over time. Put differently, our clusters only describe communities with different
regimes but do not meaningfully label each cluster across the entire time period.
184 C. Meyer and D. Hammerschmidt
Fig. 2. Co-sponsorship network at the UNGA in 1985. The graph shows the co-
sponsorship network at the UNGA in 1985. The vertex shapes represent the regime
type and colors represent communities (1–4). We observe that similar regime types are
clustered in communities.
Our findings for states’ cooperative behavior hold across several years and
networks in our analysis and thereby reiterate previous findings of the heteroge-
neous nature of autocracies [29]. We believe that states’ co-sponsorship networks
at the UNGA provide fruitful insights into autocratic cooperation behaviors that
are difficult to study in other environments and with approaches that are not
based on states’ networked behavior.
To investigate our findings of autocratic cooperation more systematically
across the entire 36 years in our time frame, Fig. 3 plots the variable impor-
tance of our main determinants for co-sponsorship over time.10 These results
are derived from a multiclass random forest classification. For this classification,
we consider only complete cases and filter variables with zero variance, one-hot
encode the dummy variables for regime and region to contain the explanatory
power for our variables of interest, and normalize all non-nominal variables in
our data set.11
The results further support our argument and show the importance of regime
type for predicting states’ membership in different co-sponsorship communities
with the dotted line as a reference that indicates the mean importance across all
10
For our variable importance, we use permutation importance, which describes the
difference between the prediction accuracy in the OOB observations and the predic-
tion accuracy after randomly shuffling one single column in our data frame.
11
We normalize the variables to achieve faster convergence of our models.
A Pledged Community? 185
features.12 Over time, the regime type of states is a consistently strong variable
for the community clusters of states.13 In line with previous research, we further
observe that both alliances and regions play an important role when understand-
ing cluster formations [22,26]. Both features become more important toward the
end of the Cold War and experience decreasing importance during the 1990s
before becoming relevant factors again at the beginning of the 2000s. However,
it is interesting that while alliances experience a revival in the last period of
our sample, regions are consistently decreasing in their variable importance over
time. We suspect that this resembles the increasing globalization and the detach-
ment from regional cooperation partners that can also be observed elsewhere.
Post-conflict periods yield only limited importance. One reason might be that
post-conflict periods tend to be relatively short (on average seven years) [15] and
only occur in a small fraction of our entire sample. Moreover, post-conflict peri-
ods are rather clustered in the Americas and across the African continent. This
finding might further indicate that post-conflict states instead seek to cooperate
across regime types and clusters, potentially in an attempt to attract support
from a wider audience. However, more research is needed to substantiate this
point.
For robustness, we further estimate models with different specifications of
our main independent variable regime type. Instead of a dichotomous measure
of regime type, we use the original Polity IV measure and the Freedom House
Index as alternative measures of the regime type and receive similar results.14
This gives us confidence that our results hold across different specifications of
our model.
6 Discussion
Based on our analysis, we conclude that regime type is an important factor and
that states’ co-sponsorship and cooperation behavior at the UNGA correlates
with the regime type. We show that both democracies and autocracies are more
likely to co-sponsor resolutions at the UNGA and support previous findings that
the variations in autocracies’ institutional characteristics are associated with dif-
ferent cooperation partners. Moreover, we show that UNGA co-sponsorship net-
works are valuable resources to learn about international cooperation and find
12
The mean includes alliances, states’ trade behavior, GDP per capita, the population
size, political regime type as well as dummies for post-conflict periods and regions.
13
In general, our model achieves a mean weighted AUC score of 0.81 across our models
[18]. We can only calculate weighted, multiclass AUC scores for 34 out of 36 models;
the remaining two models, however, achieve binary AUC scores of 0.83 and 0.97 for
the years 1984 and 2003, respectively.
14
Freedom House is often referred to as an alternative measure of regime type that.
Polity IV primarily focuses on the constitutional components of the regime. In con-
trast, Freedom House emphasizes civil liberties and political rights and makes the
Freedom House Index thus more suitable for specific regions, e.g., sub-Saharan Africa
[11, 30].
186 C. Meyer and D. Hammerschmidt
0.08
0.06
Variable importance
0.04
0.02
0.00
Regime type Alliance Post−conflict country Region (avg.) Mean of VIP per year
Fig. 3. Variable importance over time. The graph shows the importance of a subset
of variables from the random forest model across the entire time frame. The focus is
on the variable importance score of regime type, alliance, post-conflict, and an average
across all five regions in the UNGA. For reference, the mean variable importance score
for all features in a given year is included (dotted line). We observe that regime type
is a constant and important variable for predicting the community clusters identified
using Leiden across all years.
clear clusters of cooperative behavior that are in line with historical develop-
ments and insights from previous work on international relations. In particular,
our findings regarding the spikes for alliances and regional patterns support pre-
vious work on international cooperation and allow for further studies on these
developments. For instance, the development of cooperation patterns following
the end of the Cold War and during the post-9/11 period can be observed in
our results. More detailed analyses are needed here to disentangle the role of
these aspects, also concerning states’ regime types. One potential extension of
our study might be to analyze the impact of regime transitions on states’ coop-
erative behavior in more detail. In other words, do recently autocratized states
cooperate more with fellow autocracies, or do they carry-over their cooperative
behavior with democracies? These and other related questions become more
important, given a third wave of autocratization that we might be currently
observing [27].
Our paper shows the importance of autocratic cooperation in the field of
International Relations. In particular, we highlight the possibilities that co-
sponsoring networks at the UNGA offer to study further states’ cooperation
and behavior at the international stage.
Replication Material
References
1. Alemán, E., Calvo, E., Jones, M.P., Kaplan, N.: Comparing cosponsorship and
roll-call ideal points. Legislat. Studi. Q. 34(1), 87–116 (2009)
2. Appel, B.J., Loyle, C.E.: The economic benefits of justice: post-conflict justice and
foreign direct investment. J. Peace Res. 49(5), 685–699 (2012)
3. Bailey, M.A., Voeten, E.: A two-dimensional analysis of seventy years of United
Nations voting. Public Choice 176(1–2), 33–55 (2018)
4. Ball, M.M.: Bloc voting in the general assembly. Int. Org. 5(1), 3–31 (1951)
5. Baturo, A., Dasandi, N., Mikhaylov, S.J.: Understanding state preferences with
text as data: introducing the UN General Debate corpus. Res. Polit. 4(2), 1–9
(2017)
6. Carter, D.B., Stone, R.W.: Democracy and multilateralism: the case of vote buying
in the UN general assembly. Int. Org. 69, 1–33 (2016)
7. Collier, P.: The Bottom Billion. Oxford University Press, Oxford (2008)
8. Collier, P., Hoeffler, A., Söderbom, M.: Post-conflict risks. J. Peace Res. 45(4),
461–478 (2008)
9. Crescenzi, M.J., Kathman, J.D., Kleinberg, K.B., Wood, R.M.: Reliability, repu-
tation, and alliance formation. Int. Stud. Quart. 56, 259–274 (2012)
10. Dreher, A., Vreeland, J.R.: Buying votes and international organizations. cege
Center for European, Governance and Economic Development Research, vol. 123,
pp. 1–38 (2011)
11. Erdmann, G.: Demokratie in Afrika. GIGA Focus Afr. 10, 1–8 (2007)
12. Flores, T.E., Nooruddin, I.: Democracy under the gun: understanding postconflict
economic recovery. J. Peace Res. 53(1), 3–29 (2009)
13. Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement.
Softw. Pract. Exp. 21(11), 1129–1164 (1991)
14. Garriga, A.C., Phillips, B.J.: Foreign aid as a signal to investors: predicting FDI
in post-conflict countries. J. Conflict Resolut. 58(2), 280–306 (2014)
15. Gates, S., Nygård, H.M., Trappeniers, E.: Conflict recurrence. Conflict Trends 2,
1–4 (2016)
16. Gleditsch, N.P., Wallensteen, P., Eriksson, M., Sollenberg, M., Strand, H.: Armed
conflict 1946–2001: a new dataset. J. Peace Res. 39(5), 615–637 (2002)
17. Hammerschmidt, D., Meyer, C.: Money makes the world go frowned. Analyzing
the impact of Chinese foreign aid on states’ sentiment using natural language
processing. Working Paper (2020)
18. Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve
for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)
19. Jackson, M.O., Nei, S.: Networks of military alliances, wars, and international
trade. Proc. Natl. Acad. Sci. 112(50), 15277–15284 (2015)
20. Jacobsen, K.: Sponsorships in the united nations: a system analysis. J. Peace Res.
6(3), 235–256 (1969)
21. Keohane, R.O.: After Hegemony: Cooperation and Discord in the World Political
Economy. Princeton University Press, Princeton (2005)
22. Lee, E., Stek, P.E.: Shifting alliances in international organizations: a social net-
works analysis of co-sponsorship of UN GA resolutions, 1976–2012. J. Contemp.
Eastern Asia 15(2), (2016)
23. Leeds, B.A.: Domestic political institutions, credible commitments, and interna-
tional cooperation. Am. J. Polit. Sci. 43(4), 979–1002 (1999)
188 C. Meyer and D. Hammerschmidt
24. Leeds, B.A.: Alliance reliability in times of war: explaining state decisions to violate
treaties. Int. Org. 57(4), 801–827 (2003)
25. Leeds, B.A.: Alliance treaty obligations and provisions (ATOP) codebook
(2018). http://www.atopdata.org/uploads/6/9/1/3/69134503/atopcodebookv4.
pdf. Accessed 04 Apr 2020
26. Leeds, B.A., Ritter, J., Mitchell, S., Long, A.: Alliance treaty obligations and pro-
visions, 1815–1944. Int. Interact. 28(3), 237–260 (2002)
27. Lührmann, A., Lindberg, S.I.: A third wave of autocratization is here: what is new
about it? Democratization 26(7), 1095–1113 (2019)
28. Marshall, M.G., Gurr, T.R., Davenport, C., Jaggers, K.: Polity iv, 1800–1999:
comments on Munck and Verkuilen. Comp. Polit. Stud. 35(1), 40–45 (2002)
29. Mattes, M., Rodriguez, M.: Autocracies and international cooperation. Int. Stud.
Quart. 58(3), 527–538 (2014)
30. Moss, T.J.: African development: making sense of the issues and actors. Lynne
Rienner Publishers Boulder, CO (2007)
31. Panke, D.: The institutional design of the united nations general assembly: an
effective equalizer? Int. Relat. 31(1), 3–20 (2017)
32. Reichardt, J., Bornholdt, S.: Detecting fuzzy community structures in complex
networks with a potts model. Phys. Rev. Lett. 93(21), 218701 (2004)
33. Reichardt, J., Bornholdt, S.: Statistical mechanics of community detection. Phys.
Rev. E 74(1), 016110 (2006)
34. von Soest, C.: Democracy prevention: the international collaboration of authori-
tarian regimes. Eur. J. Polit. Res. 54(4), 623–638 (2015)
35. Teorell, J., Dahlberg, S., Holmberg, S., Rothstein, B., Hartman, F., Svensson, R.:
The Quality of Government Standard Dataset, version Jan15 (2015). http://www.
qog.pol.gu.se. Accessed 05 June 2015
36. Traag, V.A., Waltman, L., van Eck, N.J.: From Louvain to Leiden: guaranteeing
well-connected communities. Sci. Rep. 9(1), 1–12 (2019)
37. Voeten, E.: Data and analyses of voting in the UN General Assembly (2012).
http://papers.ssrn.com/abstract=2111149
38. Wallensteen, P., Sollenberg, M., Eriksson, M., Harbom, L., Buhaug, H., Rød,
J.K.: Armed conflict dataset codebook. version 3.0 (2004). https://www.prio.org/
Global/upload/CSCW/Data/UCDP/v3/codebook v3 0.pdf
Distances on a Graph
1 Introduction
When clustering graphs, we seek to group nodes into clusters of nodes that are
similar to each other. We posit that similarity is reflected in the number of shared
connections. Our node-to-node distances are based on this shared connectivity.
Although a formal definition of vertex clusters (communities) remains a topic of
debate, virtually all authors agree a cluster is a subset of vertices that exhibit a
high level of interconnection between themselves and a low level of connection
to vertices in the rest of the graph [7,21–23] (we quote these authors, but their
definition is very common across the literature). Consequently, clusters, subsets
of strongly inter-connected vertices, also form dense induced subgraphs.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 189–199, 2021.
https://doi.org/10.1007/978-3-030-65347-7_16
190 P. Miasnikof et al.
We compare three different distance measurements from the literature and exam-
ine how faithfully they reflect connectivity patterns. We argue that mean node-
to-node distance within a cluster should offer an accurate reflection of intra-
cluster density, but move in an opposite direction. Densely connected clusters
should display low mean node-node distances.
Intra-cluster density is defined as
(k) |Ekk |
Kintra = .
0.5 × nk × (nk − 1)
In this definition, |Ekk | is the cardinality of the set of edges connecting two
vertices within the same cluster ‘k’ and nk = |Vk | is the number of vertices
in that same cluster. This ratio also represents the empirical estimate of the
probability two nodes within a cluster are connected by an edge.
We then examine the relationship between mean Jaccard [12], Otsuka-Ochiai
[19] and Burt’s distances [3,7], on one hand, and intra-cluster density within
each cluster, on the other. Because these distances are pairwise measures, we
compare their mean value for a given cluster to the cluster’s internal density.
We begin by calling the reader’s attention to the fact this article is not about
graph embedding. Here, we are not interested in a vector representation of nodes.
We are only interested in the distance separating them.
We also call the reader’s attention to the fact the distance measures under
consideration can all be obtained using simple arithmetic. It is precisely for this
reason that we did not consider the popular “commute distance” and its correc-
tions, like “amplified commute distance” [15,16,20], in this work. While these
distances are known to capture cluster structure, they require matrix inversion
and are very costly to compute [4]. Although some authors have found efficient
approximations that circumvent the need for matrix inversion (e.g., [15]), the
distances under consideration in this article are exact quantities. Exactness of
the distances is a desirable feature, given our ultimate goal to use them to esti-
mate intra-cluster density. Additionally, unlike some of the approximations in
the literature, our distances have simple and intuitive interpretations.
192 P. Miasnikof et al.
The Jaccard distance separating two vertices ‘i’ and ‘j’ is defined as
|ci ∩ cj |
ζij = 1 − ∈ [0, 1] .
|ci ∪ cj |
Here, ci (cj ) represents the set of all vertices with which vertex ‘i (j)’ shares an
edge.
At the cluster level, we compute the mean distance separating all pairs of
vertices within the cluster, which we denote as J . For an arbitrary cluster ‘k’
with nk vertices, we have
1
Jk = ζij .
0.5 × nk × (nk − 1) i,j=i+1
The Otsuka-Ochiai (OtOc) distance separating two vertices ‘i’ and ‘j’ is defined
as
|ci ∩ cj |
oij = 1 − ∈ [0, 1] .
|ci | × |cj |
Here too, we obtain a cluster level measure of similarity by taking the mean over
each pair of nodes within a cluster. We denote this mean as O. Again, for an
arbitrary cluster ‘k’ with nk vertices, we have
1
Ok = oij .
0.5 × nk × (nk − 1) i,j=i+1
Burt’s distance between two vertices ‘i’ and ‘j’, denoted as bij , is computed using
the adjacency matrix (A) as
2
bij = (Aik − Ajk ) .
k=i,j
At the cluster level, we denote the mean Burt distance as B. As with the
other distances, for an arbitrary cluster ‘k’ with nk vertices, it is computed as
1
Bk = bij .
0.5 × nk × (nk − 1) i,j=i+1
Distances on a Graph 193
4 Numerical Comparisons
To compare the distance measures and assess the accuracy of each measure as
a reflection of intra-cluster density, we generate synthetic graphs with known
cluster membership, using the NetworkX library’s [11] stochastic block model
generator. In our experiments, we generate several graphs with varying graph
and cluster sizes and inter and intra-cluster edge probabilities. To ensure ease of
readability, we only include a subset of our most revealing results.
For each test graph in the experiments below, we compute our three vertex-to-
vertex distances. We then compute mean distances between nodes in each cluster
and intra-cluster density. To obtain a graph-wide assessment, we then take the
mean of all cluster quantities over the entire graph. Because our clusters vary in
size, we ensure the well-documented “resolution limit” degeneracy [8] does not
affect our conclusions by taking simple unweighted means, regardless of cluster
sizes.
We use the stochastic block model to generate two sets of six graphs, as described
in Table 1. In the first set of experiments, we vary the probability of an intra-
cluster edge, an edge with both ends inside the cluster. To generate noise, we
vary the size (nk ) and number of clusters (K) and as a result the total number
of nodes (|V |). For added noise, we also set inter-cluster edge probability to 0.15.
Details are shown in Table 1.
P intra
0 0.2 0.4 0.6 0.8 1
Jacc
Mean 0.919 0.918 0.912 0.898 0.879 0.841
Stdev 0.000 0.000 0.002 0.005 0.012 0.020
+1 stdev 0.919 0.919 0.914 0.903 0.891 0.862
−1 stdev 0.919 0.918 0.910 0.893 0.867 0.821
OtOc
Mean 0.850 0.849 0.838 0.815 0.784 0.727
Stdev 0.001 0.001 0.003 0.008 0.019 0.030
+1 stdev 0.851 0.849 0.842 0.823 0.803 0.757
−1 stdev 0.849 0.848 0.835 0.807 0.765 0.697
Burt
Mean 30.325 37.732 37.355 33.499 34.708 30.059
Stdev 0.125 0.062 0.101 0.095 0.055 0.138
+1 stdev 30.450 37.794 37.455 33.594 34.763 30.197
−1 stdev 30.200 37.671 37.254 33.404 34.653 29.921
K intra
Mean 0.000 0.199 0.400 0.601 0.800 1.000
Stdev 0.000 0.005 0.007 0.008 0.006 0.000
+1 stdev 0.000 0.204 0.407 0.609 0.806 1.000
−1 stdev 0.000 0.195 0.393 0.593 0.794 1.000
Po2 × (N − nk )
Pi → 0 ⇒ ζij → 1 −
Po × (N − nk )
(nk − 2) + Po2 × (N − nk )
Pi → 1 ⇒ ζij → 1 − .
(nk − 2) + Po × (N − nk )
The main observation here is that while the actual Jaccard distance depends
on the number of nodes in each cluster and the total number of nodes on the
graph, its variation remains in step with intra-cluster edge probability and intra-
cluster density. It is this dependence on the number of nodes in each cluster and
the total number of nodes on the graph that is the main source of additional
variance observed in Table 2 and which is mitigated by keeping cluster sizes
constant across graphs in the second set of experiments shown in Table 3. A
similar argument can be made in the case of OtOc.
196 P. Miasnikof et al.
P intra
0 0.2 0.4 0.6 0.8 1
Jacc
Mean 0.919 0.918 0.913 0.903 0.888 0.868
Stdev 0.000 0.001 0.001 0.003 0.006 0.010
+1 stdev 0.919 0.919 0.914 0.906 0.894 0.878
−1 stdev 0.919 0.918 0.911 0.899 0.882 0.858
OtOc
Mean 0.850 0.849 0.840 0.823 0.798 0.767
Stdev 0.001 0.001 0.002 0.005 0.010 0.015
+1 stdev 0.851 0.850 0.842 0.828 0.808 0.783
−1 stdev 0.849 0.848 0.837 0.817 0.788 0.752
Burt
Mean 29.190 29.496 29.650 29.641 29.485 29.205
Stdev 0.068 0.073 0.071 0.076 0.061 0.103
+1 stdev 29.258 29.569 29.721 29.717 29.546 29.308
−1 stdev 29.122 29.422 29.579 29.566 29.423 29.102
K intra
Mean 0.000 0.200 0.402 0.599 0.800 1.000
Stdev 0.000 0.011 0.015 0.011 0.011 0.000
+1 stdev 0.000 0.211 0.417 0.610 0.811 1.000
−1 stdev 0.000 0.189 0.387 0.588 0.788 1.000
Burt’s Distance
2
bij = (Aik − Ajk )
k=i,j
≈ 2 × Pi (1 − Pi ) × (nk − 2) + 2 × Po (1 − Po ) × (N − nk )
On the other hand, the asymptotic behavior of Burt’s distance explains why
it is a poor reflection of intra-cluster density. We see that as Pi moves toward
either extreme, Burt’s distance moves toward the same quantity. It should also
be noted that it is unbounded and grows with the number of nodes on the graph.
In fact, as the total number of nodes increases in proportion to cluster size, the
intra-cluster portion is minimized, since (nk − 2) (N − nk ).
Distances on a Graph 197
7 Conclusion
We show that Jaccard and Otsuka-Ochiai distances, when averaged over clus-
ters, very accurately follow the evolution of intra-cluster density. They are both
shown to vary in an opposite direction to intra-cluster density. This variation
has been observed to be robust to noise from inter-cluster edge probability and
variations in cluster sizes. Finally, we also show that Jaccard distance displays
lower variance than Otsuka-Ochiai distance.
Our future work will focus on a study of these distances on weighted graphs.
We also intend to conduct empirical comparisons to commute and amplified
commute distances. We are interested in studying the statistical properties of all
these distances when averaged over clusters.
Acknowledgements
– The work of Alexander Ponomarenko was conducted within the framework of the
Basic Research Program at the National Research University Higher School of Eco-
nomics (HSE).
– The authors wish to thank the organizers of the 10th International Conference on
Network Analysis at the Laboratory of Algorithms and Technologies for Networks
Analysis in Nizhny Novgorod.
198 P. Miasnikof et al.
References
1. Aramon, M., Rosenberg, G., Valiante, E., Miyazawa, T., Tamura, H., Katzgraber,
H.: Physics-inspired optimization for quadratic unconstrained problems using a dig-
ital annealer. Front. Phys. 7, 48 (2019). https://doi.org/10.3389/fphy.2019.00048
2. Bauckhage, C., Piatkowski, N., Sifa, R., Hecker, D., Wrobel, S.: A QUBO formu-
lation of the k-medoids problem. In: Jäschke, R., Weidlich, M. (eds.) Proceedings
of the Conference on Lernen, Wissen, Daten, Analysen, CEUR Workshop Pro-
ceedings, Berlin, Germany, 30 September–2 October 2019, vol. 2454, pp. 54–63.
CEUR-WS.org (2019). http://ceur-ws.org/Vol-2454/paper 39.pdf
3. Burt, R.: Positions in networks. Soc. Forces 55(1), 93–122 (1976)
4. Camby, E., Caporossi, G.: The extended Jaccard distance in complex networks.
Les Cahiers du GERAD G-2017-77 (September 2017)
5. Fan, N., Pardalos, P.: Linear and quadratic programming approaches for the gen-
eral graph partitioning problem. J. Glob. Optim. 48(1), 57–71 (2010). https://doi.
org/10.1007/s10898-009-9520-1
6. Fan, N., Pardalos, P.: Robust optimization of graph partitioning and critical node
detection in analyzing networks. In: Proceedings of the 4th International Confer-
ence on Combinatorial Optimization and Applications - Volume Part I, COCOA
2010, pp. 170–183. Springer, Heidelberg (2010). http://dl.acm.org/citation.cfm?
id=1940390.1940405
7. Fortunato, S.: Community detection in graphs. Phys. Rep. 486, 75–174 (2010)
8. Fortunato, S., Barthélemy, M.: Resolution limit in community detection. Proc.
Natl. Acad. Sci. 104(1), 36–41 (2007). http://www.pnas.org/content/104/1/36.
abstract
9. Fu, Y., Anderson, P.: Application of statistical mechanics to NP-complete problems
in combinatorial optimisation. J. Phys. A Math. Gen. 19(9), 1605–1620 (1986)
10. Glover, F., Kochenberger, G., Du, Y.: A Tutorial on Formulating and Using QUBO
Models. arXiv e-prints arXiv:1811.11538 (June 2018)
11. Hagberg, A., Schult, D., Swart, P.: Exploring network structure, dynamics, and
function using networkX. In: Varoquaux, G., Vaught, T., Millman, J. (eds.) Pro-
ceedings of the 7th Python in Science Conference, Pasadena, CA, USA, pp. 11–15
(2008)
12. Jaccard, P.: Étude de la distribution florale dans une portion des Alpes et du Jura.
Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
13. Levandowsky, M., Winter, D.: Distance between sets. Nature 234 (1971)
14. Lucas, A.: Ising formulations of many NP problems. Front. Phys. 2, 5 (2014)
15. von Luxburg, U., Radl, A., Hein, M.: Getting lost in space: large sample analysis
of the resistance distance. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J.,
Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Sys-
tems, vol. 23, pp. 2622–2630. Curran Associates, Inc. (2010). http://papers.nips.
cc/paper/3891-getting-lost-in-space-large-sample-analysis-of-the-resistance-dista
nce.pdf
16. von Luxburg, U., Radl, A., Hein, M.: Hitting and commute times in large random
neighborhood graphs. J. Mach. Learn. Res. 15(52), 1751–1798 (2014). http://jmlr.
org/papers/v15/vonluxburg14a.html
17. Miasnikof, P., Shestopaloff, A., Bonner, A., Lawryshyn, Y.: A statistical perfor-
mance analysis of graph clustering algorithms, Chap. 11. Lecture Notes in Com-
puter Science. Springer Nature (June 2018)
Distances on a Graph 199
18. Miasnikof, P., Shestopaloff, A., Bonner, A., Lawryshyn, Y., Pardalos, P.: A density-
based statistical analysis of graph clustering algorithm performance. J. Complex
Netw. 8(3), cnaa012 (2020). https://doi.org/10.1093/comnet/cnaa012
19. Ochiai, A.: Zoogeographical studies on the Soleoid fishes found in Japan and its
neighbouring regions-i. Nippon Suisan Gakkaishi 22(9), 522–525 (1957)
20. Ponomarenko, A., Pitsoulis, L.S., Shamshetdinov, M.: Overlapping community
detection in networks based on link partitioning and partitioning around medoids.
CoRR abs/1907.08731 (2019). http://arxiv.org/abs/1907.08731
21. Prokhorenkova, L.O., Pralat, P., Raigorodskii, A.: Modularity of complex networks
models. In: Bonato, A., Graham, F., Pralat, P. (eds.) Algorithms and Models for
the Web Graph, pp. 115–126. Springer, Cham (2016)
22. Prokhorenkova, L.O., Pralat, P., Raigorodskii, A.: Modularity in several random
graph models. Electro. Notes Discrete Math. 61, 947–953 (2017), http://www.
sciencedirect.com/science/article/pii/S1571065317302238. The European Confer-
ence on Combinatorics, Graph Theory and Applications (EUROCOMB 2017)
23. Schaeffer, S.: Survey: graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007).
https://doi.org/10.1016/j.cosrev.2007.05.001
Local Community Detection Algorithm
with Self-defining Source Nodes
1 Introduction
Complex networks exhibit modular structures, namely communities, which are
directly related to important functional and topological properties in various
fields. They can, for example, represent modules of proteins with similar func-
tionality in a protein interaction network [17], or affect dynamic processes of a
network such as opinion and epidemic spreading [18]. Despite the various insights
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 200–210, 2021.
https://doi.org/10.1007/978-3-030-65347-7_17
Local Community Detection Algorithm 201
and applications communities represent, they are all referred to as a densely con-
nected set of nodes with relatively sparse links to the rest of the network. This
simple definition, however, has raised great interest in discovering communities
in complex networks. Numerous solutions have been proposed ever since. While
most of the conventional algorithms are rooted in a top-down view obtaining
the global information of the entire network [11,20], others reduce the problem
to a local level, by availability of a part of the network [1,6] to find local com-
munities of a given node(s). The existing local community detection algorithms
in the literature are mostly designed to first identify a set of source nodes to
initialize the community detection [2,5,8,15] and then use a local community
modularity to expand the communities [6,13,16]. The main challenges raised by
these methods fall into the followings: i) the optimal result highly depends on
the source node selection [5], ii) the main goal is to discover the local commu-
nities of a given set of nodes rather than all communities of a network, iii) the
approaches are mostly operating in a relaxed level of locality, i.e. local-context
the forth level of locality [19], exploiting the information of a part of the network
in the community detection process, iv) even though they appreciate a level of
locality while employing the algorithm, they cannot cope with any changes in
the network which is mostly the case in real-world complex networks.
Taking the above-mentioned considerations into account, we propose a com-
munity detection approach that has two main properties: First, it is operating
solely based on a node and its local neighbours at a time, thus, it can belong
to the local-bounded category, introduced by Stein et al. [19], which is one level
more restrictive compared to most of the state-of-the-art approaches. Secondly,
it does not depend on any auxiliary process of source node selection. Instead, it
is exploiting a self-defining source node that can adapt based on the local neigh-
bourhood knowledge. Our algorithm progressively iterates over the discovered
part of the network allowing each node to decide on joining one of the neighbour
communities or even create a new community. We define a community influence
degree employing topological measures [10] to identify the community influence
of each node. The metrics is used to guarantee a hierarchical community struc-
ture centralized by high-degree nodes. We, then, perform a local modularity
measure to label each node’s community. This way, our algorithm addresses the
challenges raised by the previous algorithms by proposing a local approach based
on a self-defining source node.
The remainder of this paper is organized as follows. In Sect. 2 we review some
of the state-of-the-art in local community detection. Section 3 defines the nota-
tions and concepts that are used in the rest of the paper. In Sect. 4 we present
our proposed algorithm in detail. Next, in Sect. 5 we evaluate our local commu-
nity detection by using artificial and real datasets. Finally, Sect. 6 summarizes
and concludes the paper.
2 Related Work
Many of local-context community detection algorithms are founded on this
assumption that the global knowledge of the network is not available, there-
202 S. Dilmaghani et al.
close to the border of the community. We aim to find all communities of a network
by allowing each node to adjust its community label given its local neighbours,
Γ (v), and their properties at a time. We exploit a set of measures adopted from
the network structure to assure that each node belongs to a community at the
end of the execution time.
community and nodes with a lower degree to the borders while maximizing the
local modularity defined in Eq. 2. To forge a hierarchical structure, we adjust
the hop-distance hl, and in the meantime, we update each node’s community
influence degree λ(v) as defined in Eq. 1. The metric is considered as a level of
attraction to encourage a node towards a community. On the other hand, to
extend communities or to prevent emerging large communities we initially filter
communities by measuring the local modularity from Eq. 2.
Algorithm description. The general structure of the proposed local approach
to detect communities of a network is described in Algorithm 1. To extend the
communities we define a set of principles that are explained in Algorithm 2. The
procedure starts by initializing the node list R (line 1), that records visited nodes
and their neighbours. As a first-time-visited node in the list, the community
label cl and hierarchy level hl of the node will be initialized to its node ID and
a constant value HL, respectively (line 2–3). We chose HL to be 4 initially,
however, it can be any value larger than 1. The next step is to adjust the node’s
hl value, its value will be reduced if it has the highest degree compared to its
neighbours (line 6–7). Afterwards, the community influence degree λ(v) and the
local modularity μ(v) is calculated (line 9–10). To update both hl and cl of v,
we input the node through some principles defined in Algorithm 2 (line 11).
Besides, the list R will be updated by the neighbours of node v. Finally, if all
nodes come to an agreement such that no further changes occur, the algorithm
will converge and stop. Extracting the cl of all nodes in R results in obtaining
all communities of G. A set of principles is defined in Algorithm 2 to decide the
corresponding community of the node v. First, choosing the common community
label (mc), the local modularity is calculated. If v.μ was positive, v takes the
same label as mc. Then, v adjust its hierarchy level by taking the minimum
hl of that community and increase it by one unit as its hl value. Otherwise, if
μ(v) was negative or zero, then, either v itself is selected by the neighbours to
be a new community, or it will temporarily follow the best candidate among its
neighbourhood.
5 Experimental Analysis
In this section, we examine the performance of our algorithm with differ-
ent experiments. We exploit both real-world and artificial networks that are
described in Table 1. Following artificial networks, we generate various networks
using the LFR benchmark algorithm [14]. The mixing parameter μ, identifies
the density of the networks, i.e. the strength of the communities.
We first compare the results of the proposed algorithm on the networks from
Table 1 with a set of algorithms: Louvain [3] and Fast-greedy [7], and Label Prop-
agation Algorithm (LPA). Next, to examine the ability of self-defining source
nodes of our algorithm, we implement a set of source node selection methods
from the literature and combine them with our algorithm. Finally, we provide
tests to validate the analytically derived low complexity of our algorithm.
206 S. Dilmaghani et al.
Table 2. The AMI quality metric results on the communities detected by Louvain,
LPA, Fast-greedy, and our proposed algorithm (Proposed Alg.) on real-world networks.
The bold values show the best results among other algorithms for each network.
Most of the existing local community detection algorithms require a source node
selection before the community expansion. We implement some of the source
node selection methods from the literature and develop an experiment to ana-
lyze the impact of source node selection on our algorithm. We choose different
centrality and similarity scores: degree centrality [8], extended Jaccard metric [2],
and node density (to find nodes with high degree, however, distant from each
other) [5]. In order to be fair on choosing the best candidate nodes, we apply
an outlier detection technique, Interquartile Range (IQR), to select nodes with
higher scores. We then adjust the hl of these nodes to be known as the initial
communities of the network and proceed as described in Algorithm 1. We eval-
uate the methods on an LFR benchmark network with n = 2000 and report
the results in Fig. 2. The results show that there are no differences between
the proposed algorithm (Basic) and its variations by each source node selection
(e.g., Basic+Degree). As shown in Fig. 2, our method maintains a self-identifying
source node selection considering node degree.
Fig. 2. Employing different source node selection methods from the literature on the
bases of the proposed algorithm. The methods are examined over the LFR 2000s net-
work exploiting AMI and Modularity measures.
208 S. Dilmaghani et al.
(a) (b)
Fig. 3. The results of experiments on the convergence of the algorithm on LFR net-
works, (a) Bar plot of the number of iteration, (b) The percentage of the number of
nodes modified per iteration.
if the node has a high degree and low hierarchy level in the community that
is defined based on the hop-distance from the source node. This way, we shape
communities in a hierarchical structure where nodes with higher degrees are
towards the center of the community. Our algorithm exploits a set of local prin-
ciples allowing each node to take a decision on its community label based on its
neighborhoods local information. The algorithm is designed in a more restric-
tive level of locality compared to the current local algorithms and offers a linear
order of computational complexity. We deploy extensive experiments to analyze
the performance and efficiency of our algorithm. The experiments on both real
and artificial networks show that the proposed algorithm performs better in net-
works with weak community structures compare to the algorithms that benefit
from the global information of the network. Moreover, we perform experiments
to validate the ability of self-defining source node selection of the our algorithm.
We show that our algorithm performs independently from the source node selec-
tion methods in the literature. The experiments on the complexity of the algo-
rithms demonstrate that, regardless of the size of the network, the algorithm
converged after approximately 8 iterations, whereas, the number of nodes that
are involved in the process has shown not to exceed the 87% of the whole net-
work size. Remarkably, the locality and self-defining properties of this approach
have equipped our algorithm for the future investigations on the adaptability to
dynamic environments. Besides, we are planning to elaborate on the proposed
approach by employing a local merging method on the output communities in
order to increase the accuracy and performance of the results, while still holding
the same level of the locality.
Acknowledgment. This work has been partially funded by the joint research pro-
gramme University of Luxembourg/SnT-ILNAS on Digital Trust for Smart-ICT.
References
1. Bagrow, J.P., Bollt, E.M.: Local method for detecting communities. Phys. Rev. E
72(4), 046108 (2005)
2. Berahmand, K., Bouyer, A., Vasighi, M.: Community detection in complex net-
worksby detecting and expanding core nodes through extended local similarity
ofnodes. IEEE Trans. Comput. Soc. Syst. 5(4), 1021–1033 (2018)
3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Stat. Mech.
4. Brust, M.R., Frey, H., Rothkugel, S.: Adaptive multi-hop clustering in mobile net-
works. In: Proceedings of the 4th International Conference on Mobile Technology,
Applications (2007)
5. Chen, Y., Zhao, P., Li, P., Zhang, K., Zhang, J.: Finding communities by their
centers. Sci. Rep. 6, 24017 (2016)
6. Clauset, A.: Finding local community structure in networks. Phys. Rev. E72(2),
026132 (2005)
7. Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large
networks. Phys. Rev. E 70(6), 066111 (2004)
210 S. Dilmaghani et al.
8. Comin, C.H., da Fontoura Costa, L.: Identifying the starting point of a spreading
process in complex networks. Phys. Rev. E 84(5), 056105 (2011)
9. Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: DEMON: a local-first dis-
covery method for overlapping communities. In: Proceedings of the 18th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(2012)
10. Dilmaghani, S., Brust, M.R., Piyatumrong, A., Danoy, G., Bouvry, P.: Link defini-
tion ameliorating community detection in collaboration networks. Front. Big Data
2, 22 (2019)
11. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)
12. Girvan, M., Newman, M.E.: Community structure in social and biological networks.
Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)
13. Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hier-
archical community structure in complex networks. J. phys. 11(3), 033015 (2009)
14. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. Phys. Rev. E 78(4), 046110 (2008)
15. Li, S., Huang, J., Zhang, Z., Liu, J., Huang, T., Chen, H.: Similarity-based future
common neighbors model for link prediction in complex networks. Sci. Rep. 8(1),
1–11 (2018)
16. Luo, F., Wang, J.Z., Promislow, E.: Exploring local community structures in large
networks. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence
(WI 2006) (2006)
17. Porter, M.A., Onnela, J.P., Mucha, P.J.: Communities in networks. Not. AMS
56(9), 1082–1097 (2009)
18. Stegehuis, C., Van Der Hofstad, R., Van Leeuwaarden, J.S.: Epidemic spreadingon
complex networks with community structures. Sci. Rep. 6(1), 1–7 (2016)
19. Stein, M., Fischer, M., Schweizer, I., Mühlhäuser, M.: A classification of locality
in network research. ACM Comput. Surv. (CSUR) 50(4), 1–37 (2017)
20. Yang, Z., Algesheimer, R., Tessone, C.J.: A comparative analysis of community
detection algorithms on artificial networks. Sci. Rep. 6, 30750 (2016)
Investigating Centrality Measures
in Social Networks with Community
Structure
1 Introduction
With the rapid increase of online social networks (OSNs) such as Facebook and
Twitter, large amount of data is being generated daily. A valuable mining area of
network data is composed when OSNs are modeled into nodes and edges. Identi-
fying key nodes in such networks is the basis of major applications such as viral
marketing [1], controlling epidemic spreading [2], and determining sources of mis-
information [3]. Designing centrality measures is a main approach to quantify
node influence. Numerous centrality measures exploiting various properties of the
network topology have been developed [4]. Information exploited can be either in
the neighborhood of the node or concerning all the topological structure of the
network. The former called local centrality measures are less computationally
expensive as compared to the later called global centrality measures. However,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 211–222, 2021.
https://doi.org/10.1007/978-3-030-65347-7_18
212 S. Rajeh et al.
In this section preliminaries and definitions used throughout the rest of the paper
are given.
Following are the definitions of the 5 most popular centrality measures used in
the study.
where σ(s, t) is the number of shortest paths between nodes s and t and σi (s, t)
is the number of shortest paths between nodes s and t that pass through node i.
Katz Centrality is based on how many nodes a node is connected to and also
to the connectivity of its neighbors. It is defined as follows:
αk (i) = sp apij (4)
p=1 j=1
where apij is the connectivity of node i with respect to all the other nodes at
Ap and sp is the attenuation factor where s ∈ [0, 1].
1−d αp (j)
αp (i) = +d (5)
N kj
j∈N1 (i)
where αp (i) and αp (j) are the PageRank centralities of node i and node j,
respectively, N1 (i) is the set of direct neighbors of node i, kj is the number of
links from node j to node i, and d is the damping parameter where d ∈ [0, 1],
set to 0.85 in the experiments.
In this section, the 8 real-world online social networks are briefly discussed,
alongside the tools applied. Table 1 reports the basic topological characteristics
of the networks. Note that the mixing parameter μ is defined as the proportion of
inter-community links to the total links in a given network. It is calculated after
the community structure is uncovered by the community detection algorithm.
3.1 Data
Table 1. Basic topological properties of the real-world networks. N is the total num-
bers of nodes. E is the number of edges. <k> is the average degree. <d> is the average
shortest path. ν is the density. ζ is the transitivity (also called global clustering coef-
ficient). knn (k) is the assortativity (also called degree correlation coefficient). Q is the
modularity. μ is the mixing parameter. * indicates the topological properties of the
largest connected component of the network in case it is disconnected.
DeezerEU this network (deezer europe) is obtained from Deezer, a platform for
music streaming. Nodes are Deezer European users and edges represent online
friendships [26].
PGP this network (arenas-pgp) is obtained from the web of trust. Nodes are
users using the Pretty Good Privacy (PGP) algorithm and edges represent secure
information sharing among them [27].
Investigating Centrality Measures 217
3.2 Tools
Kendall’s Tau Correlation is used to assess the relationship for all possi-
ble combinations between classical and community-aware centrality measures.
Assume that R(α) and R(β) are the ranking lists of a classical centrality and
a community-aware centrality, respectively. The correlation value resulted [−1,
+1] reveals the degree of ordinal association between the two given sets of ranks.
If R(αi ) > R(αj ) and R(βi ) > R(βj ) or R(αi ) < R(αj ) and R(βi ) < R(βj ), node
pair (i, j) is concordant. If R(αi ) > R(αj ) and R(βi ) < R(βj ) or R(αi ) < R(αj )
and R(βi ) > R(βj ), node pair (i, j) is discordant. If R(αi ) = R(αj ) and/or
R(βi ) = R(βj ), node pair (i, j) is neither concordant nor discordant. It is defined
as follows:
nc − nd
τb (R(α), R(β)) = (11)
(nc + ndisc + u)(Nc + Nd + v)
where nc and nd stand for the number of concordant and discordant pairs, respec-
tively, and u and v hold the number of tied pairs in sets R(α) and R(β), respec-
tively.
Rank-Biased Overlap (RBO) [28] is capable of placing more emphasis on
the top nodes between the two ranked lists R(α) and R(β) of classical and
community-aware centrality measures. Its value ranges between [0,1]. It is defined
as follows:
∞
|R(αd ) ∩ R(βd )|
RBO(R(α), R(β)) = (1 − p) p(d−1) (12)
d
d=1
where p dictates “user persistence” and the weight to the top ranks, d is the
depth reached on sets R(α) and R(β), and |R(αd ) ∩ R(βd )|/d is the proportion
of the similarity overlap at depth d. Note that p is set to 0.9 in the experiments.
Infomap Community Detection Algorithm [24] is based on compression
of information. The idea is that a random walker on a network is likely to stay
longer inside a given community and shorter outside communities. Accordingly,
using Huffman coding, each community is defined by a unique codeword and
nodes inside communities are defined by other codewords that can be reused in
different communities. The optimization algorithm minimizes the coding resulted
by the path of the random walker, achieving a concise map of the community
structure.
4 Experimental Results
In this section, the results of the experiments performed on the real-world net-
works are reported. The first set of experiments involves calculating Kendall’s
Tau correlation coefficient for all possible combinations between classical and
community-aware centrality measures. The second experiment involves calculat-
ing the RBO similarity across all the combinations.
218 S. Rajeh et al.
Fig. 1. Heatmaps of the Kendall’s Tau correlation (τb ) of real-world networks across
the various combinations between classical (α) and community-aware (β) centrality
measures. The classical centrality measures are: αd = Degree, αb = Betweenness, αc =
Closeness, αk = Katz, αp = PageRank. The community-aware centrality measures are:
βBC = Bridging centrality, βCHB = Community Hub-Bridge, βP C = Participation
Coefficient, βCBM = Community-based Mediator, βN N C = Number of Neighboring
Communities. (Color figure online)
As top nodes are more important than bottom nodes in centrality assessment,
RBO is calculated. Moreover, high correlation doesn’t necessarily mean high
similarity. This is more obvious when ties exist among the rankings of a set.
Figure 2 shows the RBO similarity heatmaps of the 8 OSNs. The RBO values
range from 0 to 0.86. Low similarity from 0 to 0.3 is characterized by the dark
purple color. Medium similarity from 0.3 to 0.6 is characterized by the fuch-
sia color. High similarity over 0.6 is characterized by the light pink color. For
comparison purposes, the networks are arranged in the same order as in Fig. 1.
220 S. Rajeh et al.
5 Conclusion
Communities have major consequences on the dynamics of a network. Humans
tend to form communities within their social presence according to one or many
similarity criteria. In addition to that, humans tend to follow other members
manifesting power, influence, or popularity, resulting in dense community struc-
tures. Centrality measures aim to identify the key members within OSNs, which
is crucial for a lot of strategic applications. However, these measures are agnostic
to the community structure. Newly developed centrality measures account for
the existence of communities.
Most works have been conducted on classical centrality measures on online
social networks. In this work, we shed the light on the relationship between classi-
cal and community-aware centrality measures in OSNs. Using 8 real-world OSNs
from different platforms, their community structure is uncovered using Infomap.
Then, for each network, 5 classical and 5 community-aware centrality measures
are calculated. After that, correlation and similarity evaluation between all pos-
sible classical and community-aware centrality measures is conducted. Results
show that globally these two types of centrality do not convey the same infor-
mation. Moreover, community-aware centrality measures exhibit two behaviors.
The first set (Bridging centrality, Community Hub-Bridge, and Participation
Coefficient) exhibit low correlation and low similarity for all the networks under
Investigating Centrality Measures 221
References
1. Jalili, M., Perc, M.: Information cascades in complex networks. J. Complex Netw.
5(5), 665–693 (2017)
2. Wang, Z., Moreno, Y., Boccaletti, S., Perc, M.: Vaccination and epidemics in net-
worked populations—an introduction (2017)
3. Azzimonti, M., Fernandes, M.: Social media networks, fake news, and polarization.
Technical report, National Bureau of Economic Research (2018)
4. Lü, L., Chen, D., Ren, X.-L., Zhang, Q.-M., Zhang, Y.-C., Zhou, T.: Vital nodes
identification in complex networks. Phys. Rep. 650, 1–63 (2016)
5. Sciarra, C., Chiarotti, G., Laio, F., Ridolfi, L.: A change of perspective in network
centrality. Sci. Rep. 8(1), 1–9 (2018)
6. Ibnoulouafi, A., El Haziti, M., Cherifi, H.: M-centrality: identifying key nodes based
on global position and local degree variation. J. Stat. Mech: Theory Exp. 2018(7),
073407 (2018)
7. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)
8. Jebabli, M., Cherifi, H., Cherifi, C., Hamouda, A.: User and group networks on
Youtube: a comparative analysis. In: 2015 IEEE/ACS 12th International Confer-
ence of Computer Systems and Applications (AICCSA), pp. 1–8. IEEE (2015)
9. Cherifi, H., Palla, G., Szymanski, B.K., Lu, X.: On community structure in complex
networks: challenges and opportunities. Appl. Netw. Sci. 4(1), 1–35 (2019)
10. Hwang, W., Cho, Y., Zhang, A., Ramanathan, M.: Bridging centrality: identifying
bridging nodes in scale-free networks. In: Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 20–23
(2006)
11. Ghalmane, Z., El Hassouni, M., Cherifi, H.: Immunization of networks with non-
overlapping community structure. Soc. Netw. Anal. Min. 9(1), 45 (2019)
12. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic net-
works. Nature 433(7028), 895–900 (2005)
13. Tulu, M.M., Hou, R., Younas, T.: Identifying influential nodes based on community
structure to speed up the dissemination of information in complex network. IEEE
Access 6, 7390–7401 (2018)
14. Gupta, N., Singh, A., Cherifi, H.: Community-based immunization strategies for
epidemic control. In: 2015 7th International Conference on Communication Sys-
tems and Networks (COMSNETS), pp. 1–6. IEEE (2015)
15. Chakraborty, D., Singh, A., Cherifi, H.: Immunization strategies based on the over-
lapping nodes in networks with community structure. In: International Conference
on Computational Social Networks, pp. 62–73. Springer, Cham (2016)
222 S. Rajeh et al.
16. Kumar, M., Singh, A., Cherifi, H.: An efficient immunization strategy using over-
lapping nodes and its neighborhoods. In: Companion Proceedings of the The Web
Conference 2018, pp. 1269–1275 (2018)
17. Ghalmane, Z., Cherifi, C., Cherifi, H., El Hassouni, M.: Centrality in complex
networks with overlapping community structure. Sci. Rep. 9(1), 1–29 (2019)
18. Li, C., Li, Q., Van Mieghem, P., Stanley, H.E., Wang, H.: Correlation between
centrality metrics and their application to the opinion model. Eur. Phys. J. B
88(3), 1–13 (2015)
19. Oldham, S., Fulcher, B., Parkes, L., Arnatkevic̆iūtė, A., Suo, C., Fornito, A.: Con-
sistency and differences between centrality measures across distinct classes of net-
works. PloS One 14(7) (2019)
20. Shao, C., Cui, P., Xun, P., Peng, Y., Jiang, X.: Rank correlation between centrality
metrics in complex networks: an empirical study. Open Phys. 16(1), 1009–1023
(2018)
21. Landherr, A., Friedl, B., Heidemann, J.: A critical review of centrality measures
in social networks. Bus. Inf. Syst. Eng. 2, 371–385 (2010)
22. Grando, F., Noble, D., Lamb, L.C.: An analysis of centrality measures for complex
and social networks. In: 2016 IEEE Global Communications Conference (GLOBE-
COM), pp. 1–6. IEEE (2016)
23. Rajeh, S., Savonnet, M., Leclercq, E., Cherifi, H.: Interplay between hierarchy and
centrality in complex networks. IEEE Access 8, 129717–129742 (2020)
24. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal
community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008)
25. Rossi, R., Ahmed, N.: The network data repository with interactive graph analytics
and visualization. In: Twenty-Ninth AAAI Conference on Artificial Intelligence
(2015)
26. Rozemberczki, B., Sarkar, R.: Characteristic functions on graphs: birds of a feather,
from statistical descriptors to parametric models (2020)
27. Kunegis, J.: Handbook of network analysis [konect–the koblenz network collection].
arXiv:1402.5500 (2014). http://konect.cc/networks/
28. Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings.
ACM Trans. Inf. Syst. (TOIS) 28(4), 1–38 (2010)
Network Analysis
Complex Network Analysis of North
American Institutions of Higher
Education on Twitter
1 Introduction
According to the National Center for Education Statistics [6], in 2018, there were
4,313 degree-granting postsecondary institutions, also known as institutions of
higher education (IHEs), in the USA. This number includes public and private
(both nonprofit and for-profit) universities, liberal arts colleges, community col-
leges, religious schools, and trade schools.
The IHEs enjoy a heavy presence on social media, in particular, on Twit-
ter. In 2012, Linvill et al. [3] found that IHEs employ Twitter primarily as an
institutional news feed to a general audience. These results were confirmed by
Kimmons et al. in 2016 [2] and 2017 [14]; the authors further argue that Twitter
failed to become a “vehicle for institutions to extend their reach and further
demonstrate their value to society”—and a somewhat ”missed opportunity for
presidents to use Twitter to connect more closely with alumni and donors” [15].
The same disconnect has been observed for IHE library accounts [11].
Despite the failed promise, the IHEs massively invest in online marketing [12]
and, in reciprocity, collect impressive follower lists that include both organiza-
tions and individuals. The longer follower lists demonstrate a positive effect on
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 225–236, 2021.
https://doi.org/10.1007/978-3-030-65347-7_19
226 D. Zinoviev et al.
2 Data Set
Our data set consists of two subsets: social networking data from Twitter and
IHE demographics from Niche [8]. We used the former to construct a network of
IHEs and the latter to provide independent variables for the network analysis.
Both subsets were collected in Summer 2020.
The Twitter data set describes the Twitter accounts of 1,450 IHEs from all 50
states and the District of Columbia. The majority of the accounts are the official
IHE accounts, but for some IHEs, we had to rely on secondary accounts, such
as those of admission offices or varsity sports teams. For each IHE, we have the
following attributes (and their mean values): geographical location (including
the state), the lists of followers (20,198) and friends (1,130), the numbers of
favorites (“likes”; 4,656) and statuses (“posts”; 9,132), the account age in years
(10.4), and whether the account is verified or not (32% accounts are verified).
With some IHEs having more than a million followers (e.g., MIT and Harvard
University), we chose to restrict our lists to up to 10,000 followers per IHE. This
limitation may have resulted in a slight underestimation of the connectedness
of the most popular IHEs. We explain in Subsect. 3.1 why we believe that the
underestimation is not crucial.
It is worth noting that while we have downloaded the friend lists, we do
not use them in this work because they are controlled by the IHE administra-
tions/PR offices and cannot be considered truly exogenous.
The combined list of followers consists of 347,920 users. This number does
not include the “occasional” followers who subscribed to fewer than three IHEs.
The descriptive IHE data comes from Niche [8], an American company that
provides demographics, rankings, report cards, and colleges’ reviews. It covers
1,435 of the IHEs that we selected for the network construction. Five more IHEs
were not found on Niche and, though included in the network, were not used in
further analysis.
Network of Colleges on Twitter 227
Binary:
– “Liberal Arts” college designation,
– Application options: “SAT/ACT Optional”, “Common App Accepted,”
or “No App Fee” (these options can be combined).
Categorical:
– Type: “Private”, “Public”, “Community College”, or “Trade School”;
note that all community colleges and trade schools in our data set are
public;
– Religious affiliation: “Christian”, “Catholic”, “Muslim” or “Jewish”; we
lumped the former two together;
– Online learning options: “Fully Online”, “Large Online Program”, or
“Some Online Degrees”;
– Gender preferences: “All-Women” or “All-Men”;
– Race preferences: “Hispanic-Serving Institution” (HSI) or “Historically
Black College or University” (HBCU).
Count or real-valued: Enrollment and tuition. We noticed that due to the
broad range of enrollments and tuition, enrollment and tuition logarithms
are better predictors. We will use log (enrollment) and log (tuition) instead of
enrollment and tuition throughout the paper.
3 Network Construction
We define the network G of IHEs on Twitter as G = (N, E). Here, N = {ni } is
a set of 1,450 nodes, each representing an IHE account, and E = {eij } is a set
of weighted edges.
Let f(n) be a set of followers of the account n. As noted in Sect. 2, ∀n ∈ N :
#f(n) ≤ 10, 000.
Let f −1 (q) = {n ∈ N | q ∈ f(n)} be a set of all IHE accounts followed by user
q. Note that q itself may be a member of N : IHEs can follow each other.
The definition of an edge is derived from the concept of G as a network based
on co-following: two nodes ni and nj share an edge eij iff they have at least one
shared follower that also follows at least three IHE accounts. We denote a set of
such qualified followers as Q:
Q = {q | #f −1 (q) ≥ 3} (1)
∀i, j : ∃eij ⇔ Q ∩ f(ni ) ∩ f(nj ) = ∅ (2)
We mentioned in Sect. 2 that we use only up to 10,000 followers for edge weight
calculations. The truncated follower lists result in lower weights. We can estimate
the difference between true and calculated weights by assuming the worst-case
scenario: The shared followers are uniformly distributed in the follower lists. Let
F = #f = 21,123 be the mean number of followers; let T = 10, 000; let p ≈ 0.685
be the probability that a follower list is not longer than T ; let w be the mean
edge weight; finally, let w∗ be the estimated mean edge weight. Note that if p = 1
then w∗ = w. One can show that:
2
w∗ (F − T ) p + T
≈ ≈ 1.436. (4)
w F
Seemingly, the weights of all edges that are incident to at least one node with
a truncated follower list are underestimated by ≈30%.
However, we noticed that Twitter reports follower lists not uniformly but
roughly in the order of prominence: the prominent followers with many followers
of their own are reported first. We hope that the shared users responsible for
edge formations are mostly reported among the first 10,000 followers.
4 Network Analysis
In this section, we analyze the constructed network and present the results. We
looked at individual nodes’ positions in the network (monadic analysis), rela-
tions between adjacent nodes (dyadic analysis), and node clusters (community
analysis).
aspects of n’s prominence in a network [16]: the number of closely similar IHEs
(degree), the average similarity of n to all other IHEs (closeness), the number
of IHEs that are similar to each other by being similar to n (betweenness), and
the measure of mutual importance (eigenvector: “n is important if it is similar
to other important nodes”). The local clustering coefficient reports if the nodes
similar to n are also similar to each other.
We use multiple ordinary least squares (OLS) regression to model the rela-
tionships between each of the network attributes and the following independent
variables: tuition, enrollment, Twitter account age, Twitter account verified sta-
tus, “No App Fee”, “Liberal Arts” designation, “SAT/ACT Optional”, “Com-
mon App Accepted”, race preferences, online learning options, type/religious
affiliations, and gender preferences (see Sect. 2). We combined the IHE type and
religious affiliations into one variable because all public schools are secular.
The number of samples in the regression is 1,348 (the intersection of the Niche
set and Twitter set). Table 1 shows the independent variables that significantly
(p ≤ 0.01) explain the monadic network measures, and the regression coefficients.
Table 1. Variables that significantly (p ≤ 0.01) explain the monadic network measures:
betw[enness], clos[eness], degr[ee], eigen[vector] centralities, clust[ering] coefficient, and
numbers of favorites (“likes”), followers, friends, and statuses (posts). † The marked
rows represent levels of the categorical variables.
Variable Coef.
Betw. Clos. Clust. Degr. Eigen. Favorites Followers Friends Posts
Liberal Arts 0.27 0.05 0.08 0.08 −0.98
Private† 0.29 1.40
Account Age 0.12 0.02 0.03 0.03 0.05
Tuition −0.21
Common App 0.02
No App Fee 0.03 0.05 0.05
Large Online† 0.05 0.09 0.09
Some Online† 0.02 0.04 0.04 −0.55
HBCU† 0.06 0.11 0.10
Christian† −0.04 0.07 0.07 0.63
Verified −0.03 −0.05 −0.05 0.66 1.30 0.52 0.48
Enrollment 0.03 0.01 0.06 0.06 0.33 0.65 0.29 0.29
Both clauses emphasize the difference of the monadic attributes along the
incident edge. Table 2 shows the independent variables that significantly (p ≤
0.01) explain the edge weights, and the regression coefficients. For this analysis,
we add the state in which an IHE is located to the monadic variables listed in
Subsect. 4.1.
Variable Coef.
Same state 0.0169
Similar enrollment 0.0024
Similar tuition 0.0022
Same religious affiliation 0.0019
Same online preferences 0.0010
Similar account age 0.0008
Same “Common App Accepted” option 0.0008
Same “No App Fee” option 0.0008
Same race designation 0.0006
Same “SAT/ACT Optional” option 0.0005
Both verified −0.0001
Same gender designation −0.0022
Same “Liberal Arts” designation −0.0026
Variable Coef.
1 2 3 4 5 7 8 11 12 13 14 15 17
Christian† 1.75 −3.85
Comm. Coll.† 2.75
Common App −1.95 2.38 1.75
Enrollment −0.53 1.99 −0.50 −0.73 −0.56
HBCU† 7.29
HSI† 1.30 1.65
Large Online† 3.20
Liberal Arts −1.77 2.97
No App Fee 1.19
Private† −2.88
SAT/ACT Opt 1.44 −0.74 −2.08
Some Online† 0.92 −1.61
Trade School† 3.34 −2.42
Tuition −1.85 1.68 3.14 −1.05
Verified −1.34 −1.27 1.80 −1.27
Fig. 1. An induced network of IHE clusters. Each node represents a cluster named
after its highest-enrollment IHE. The node size represents the number of IHEs in the
cluster. The edge width represents the number of IHE-level connections.
The name of each cluster in Fig. 1 incorporates the name of the Twitter
account of the IHE with the highest enrollment in the cluster.
5 Followers’ Analysis
At the last stage of the network analysis, we shift the focus of attention from
the IHEs to their followers.
We selected 14,750 top followers who follow at least 1% of the IHEs in our
data set. Approximately 8% of them have an empty description or a description
in a language other than English. Another 268 accounts belong to the IHEs from
the original data set, and at least 326 more accounts belong to other IHEs, both
domestic and international.
We constructed a semantic network of lemmatized tokens by connecting the
tokens that frequently (10 or more times) occur together in the descriptions.
We applied the Louvain [1] community detection algorithm to extract topics—
the clusters of words that are frequently used together. The algorithm identified
twelve topics named after the first nine most frequently used words. For each
follower’s account, we selected the most closely matching topics. The names and
counts for the most prominent topics are shown in Table 4.
Even after the manual cleanup, some of the 12,984 remaining followers’
accounts probably still belong to IHEs and associated divisions, organizations,
Network of Colleges on Twitter 233
Table 4. The most prominent topics and the number of followers accounts that use
them. (Since a description may contain words from more than one topic, the sum of
the counts is larger than the number of followers.) † Topic #8 is technical.
and officials. This deficiency would explain the significance of the topics #4 and,
partially, #2 that seem to use the endogenous terminology. The remaining topics
are exogenous to the IHEs and represent higher education services, high schools,
communities, career services, and individuals (“male” and “female”).
6 Discussion
Based on the results from Sect. 4, we look at each independent variable’s influence
on each network and Twitter performance parameter, whenever the influence is
statistically significant (p ≤ 0.01).
It has been observed [13] that the centrality measures are often positively
correlated. Indeed, in G’s case, we saw strong (≥0.97) correlations between the
degree, eigenvector, and closeness centralities, which explains their statistically
significant connection to the same independent variables (Table 1). More central
nodes tend to represent:
Some specialty IHEs: Liberal arts colleges, HBCUs.
Internet-savvy IHEs: IHEs with a longer presence on Twitter, IHEs with
some or many online programs.
Bigger IHEs with simplified application options: IHEs with no application
fees (and accepting Common App—for the closeness centrality), larger IHEs.
All these IHEs blend better in their possibly non-homogeneous network
neighborhoods.
The betweenness centrality—the propensity to act as a shared reference
point—is positively affected by being a liberal arts college or private IHE, and
234 D. Zinoviev et al.
longer presence on Twitter, and negatively affected by higher tuition and being
a Christian IHE. On the contrary, large and Christian IHEs tend to have a larger
local clustering coefficient and a more homogeneous network neighborhood.
All Twitter performance measures: the numbers of favorites, followers, friends,
and posts—are positively affected by enrollment and the verified account status.
The number of posts is also higher for the IHEs with a more prolonged presence
on Twitter and Christian IHEs. The number of followers is also higher for private
IHEs and lower for liberal arts colleges and IHEs with some online programs. The
latter observation is counterintuitive and needs further exploration.
Edge weight is the only dyadic variable in G. Table 2 shows that the weight
of an edge is explained by the differences of the adjacent nodes’ attributes. Some
of the attributes promote homophily, while others inhibit it.
The strongest edges connect the IHEs located in the same state, which is
probably because many local IHEs admit the bulk of the local high schools’
graduates and are followed by them and their parents. Much weaker, but still
positive, contributors to the edge weight are similar enrollment and tuition, same
religious affiliation, online teaching preferences, racial preferences, and applica-
tion preferences, a “classical” list of characteristics that breed connections [5]. We
hypothesize that prospective students and their parents follow several IHEs that
match the same socio-economic profile. National, regional, and professional asso-
ciations (such as the National Association for Equal Opportunity and National
Association of Independent Colleges and Universities) may follow similar IHEs
for the same reason.
We identified two factors that have a detrimental effect on edge weight: having
the same gender designation (“All-Male”, “All-Female”, or neither) and espe-
cially the same “Liberal Arts” designation. There are 1.58% of “All-Female”
IHEs (and no “All-Male”) and 11.2% Liberal Arts colleges in our data set. The
IHEs of both types may be considered unique and not substitutable, thus having
fewer shared followers.
In the same spirit, some network communities (clusters) of G represent com-
pact groups of IHEs with unique characteristics (Table 3). For example, cluster 1
tends to include community colleges and trade schools with no application fees,
optional SAT/ACT, and lower tuition (e.g., Carl Sandburg College). Cluster 3
is a preferred locus of smaller Christian IHEs that do not accept Common App
but require SAT/ACT (New Saint Andrews College). The last comprehensive
example is cluster 8: smaller public, secular, expensive IHEs embracing Common
App (University of Maine at Machias). IHEs with large online programs are in
cluster 17 (Middle Georgia State University), Historically Black Colleges and
Universities—in cluster 7 (North Carolina A&T State University), and Liberal
Arts colleges—in cluster 15 (St. Olaf College).
It is worth reiterating that the membership in five clusters containing 9.1%
IHEs, cannot be statistically significantly explained by any independent variable.
The explanatory variables, if they exist, must be missing from our data set.
Network of Colleges on Twitter 235
7 Conclusion
We constructed and analyzed a social network of select North American insti-
tutions of higher education (IHEs) on Twitter, using the numbers of shared fol-
lowers as a measure of connectivity. We used multiple OLS regression to explain
the network characteristics: centralities, clustering coefficients, and cluster mem-
bership. The regression variables include IHE size, tuition, geographic location,
type, and application preferences. We discovered statistically significant con-
nections between the independent variables and the network characteristics. In
particular, we observed strong homophily among the IHEs in terms of the num-
ber of shared followers. Finally, we analyzed the self-provided descriptions of the
followers and assigned them to several classes. Our findings may help understand
the college application decision-making process from the points of view of the
major stakeholders: applicants, their families, high schools, and marketing and
recruitment companies.
References
1. Blondel, V., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-
munities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), 1000 (2008)
2. Kimmons, R., Veletsianos, G., Woodward, S.: Institutional uses of Twitter in U.S.
higher education. Innov. High. Educ. 42, 97–111 (2017)
3. Linvill, D., McGee, S., Hicks, L.: Colleges’ and universities’ use of Twitter: a content
analysis. Public Relat. Rev. 38(4), 636–638 (2012)
4. McCoy, C., Nelson, M., Weigle, M.: University Twitter engagement: using Twitter
followers to rank universities. arXiv preprint arXiv:1708.05790 (2017)
5. McPherson, M., Smith-Lovin, L., Cook, J.: Birds of a feather: homophily in social
networks. Ann. Rev. Sociol. 27(1), 415–444 (2001)
6. National Center for Education Statistics: Degree-granting postsecondary institu-
tions, by control and level of institution (2018). https://nces.ed.gov/programs/
digest/d18/tables/dt18 317.10.asp?current=yes. Accessed September 2020
7. Newman, M.: Modularity and community structure in networks. Proc. Natl. Acad.
Sci. U.S.A. 103(23), 8577–8696 (2006)
8. Niche: 2020 Best Colleges in America. https://niche.com/colleges. Accessed 2020
9. Rutter, R., Roper, S., Lettice, F.: Social media interaction, the university brand
and recruitment performance. J. Bus. Res. 69(8), 3096–3104 (2016)
10. Simas, T., Rocha, L.M.: Distance closures on complex networks. Netw. Sci. 3(2),
227–268 (2015)
11. Stewart, B., Walker, J.: Build it and they will come? Patron engagement via Twit-
ter at historically black college and university libraries. J. Acad. Libr. 44(1), 118–
124 (2018)
12. Taylor, Z., Bicak, I.: Buying search, buying students: how elite U.S. institutions
employ paid search to practice academic capitalism online. J. Mark. High. Educ.
30(2), 271–296 (2020)
13. Valente, T., Coronges, K., Lakon, C., Costenbader, E.: How correlated are network
centrality measures? Connections (Toronto Ont.) 28(1), 16 (2008)
14. Veletsianos, G., Kimmons, R., Shaw, A., Pasquini, L., Woodward, S.: Selective
openness, branding, broadcasting, and promotion: Twitter use in Canada’s public
universities. Educ. Media Int. 54(1), 1–19 (2017)
236 D. Zinoviev et al.
Abstract. Graph sampling methods have been used to reduce the size
and complexity of big complex networks for graph mining and visual-
ization. However, existing graph sampling methods often fail to preserve
the connectivity and important structures of the original graph.
This paper introduces a new divide and conquer approach to spectral
graph sampling based on the graph connectivity (i.e., decomposition of
a connected graph into biconnected components) and spectral sparsifi-
cation. Specifically, we present two methods, spectral vertex sampling
and spectral edge sampling by computing effective resistance values of
vertices and edges for each connected component. Experimental results
demonstrate that our new connectivity-based spectral sampling approach
is significantly faster than previous methods, while preserving the same
sampling quality.
1 Introduction
Big complex networks are abundant in many application domains, such as social
networks and systems biology. Examples include facebook networks, protein-
protein interaction networks, biochemical pathways and web graphs. However,
good visualization of big complex networks is challenging due to scalability and
complexity. For example, visualizations of big complex networks often produce
hairball-like visualization, which makes it difficult for human to understand the
structure of the graphs.
Graph sampling methods have been widely used to reduce the size of graphs
in graph mining [6,7]. Popular graph sampling methods include Random Vertex
sampling, Random Edge sampling and Random Walk. However, previous work
based on random sampling methods often fails to preserve the connectivity and
important structures of the original graph, in particular for visualization [13].
Spectral sparsification is a technique to reduce the number of edges in a graph
while retaining its structural properties [12]. More specifically, it is a stochastic
sampling method, using the effective resistance values of edges, which is closely
Supported by the ARC Discovery Projects.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 237–248, 2021.
https://doi.org/10.1007/978-3-030-65347-7_20
238 J. Hu et al.
2 Related Work
2.1 Graph Sampling and Spectral Sparsification
Graph sampling methods have been extensively studied in graph mining to
reduce the size of big complex graphs. Consequently, many stochastic sam-
pling methods are available [6,7]. For example, most popular stochastic sampling
include Random Vertex sampling and Random Edge sampling. However, it was
shown that random sampling methods often fail to preserve connectivity and
important structure in the original graph, in particular for visualization [13].
Spielman et al. [12] introduced the Spectral Sparsification (SS), a subgraph
which preserves the structural properties of the original graph, and proved that
every n-vertex graph has a spectral approximation with O(n log n) edges. More
specifically, they presented a stochastic sampling method, using the effective
resistance values of edges, which is closely related to the commute distance of
graphs [12]. However, computing effective resistance values of edges is quite
complicated and can be very slow for big graphs [2].
3.1 Algorithm BC SS
Let G = (V , E) be a connected graph with a vertex set V and an edge set E,
and let Gi , i = 1, . . . , k, denote biconnected components of G.
240 J. Hu et al.
3.2 Algorithm BC SV
The BC SV algorithm is a divide and conquer algorithm that uses spectral sam-
pling of vertices [5]: i.e., adapt the spectral sparsification approach, by sampling
vertices rather than edges. More specifically, we define an effective resistance
value r(v) for each vertex v as the sum of effective resistance values of the inci-
dent edges, i.e., r(v) = e∈Ev r(e), where Ev represents a set of edges incident
to a vertex v.
The BC SV algorithm first computes the BC tree decomposition, and then
adds the cut vertices and their incident edges to the spectral sampling G of G.
Next, it computes a spectral vertex sampling Gi for each biconnected compo-
nent Gi , i = 1, . . . , k of G. Specifically, for each component Gi , we compute the
effective resistance values r(v) of the vertices, and then sample the vertices with
the largest effective resistance values. Finally, it merges Gi , i = 1, . . . , k to obtain
the spectral sampling G of G. The BC SV algorithm is described as follows:
Algorithm BC SV
1. Partitioning: Divide a connected graph G into biconnected components, Gi ,
i = 1, . . . , k.
2. Cut vertices: Add the cut vertices and their incident edges to the spectral
sampling G of G.
3. Spectral vertex sampling: For each component Gi , compute spectral vertex
sampling Gi of Gi . Specifically, compute the effective resistance values r(v)
of the vertices, and then sample the vertices with largest effective resistance
values.
Connectivity-Based Spectral Sampling 241
4 BC SS and BC SV Experiments
The main rationale behind the hypotheses is that the graph samples com-
puted by BC SS and SS (resp., BC SV and SV), are very similar, since the
ranking of edges (resp., vertices) based on the resistance values are highly sim-
ilar. We experiment with benchmark real world graphs [2] and synthetic data
sets, see Table 1. The real world graphs are scale-free graphs with highly imbal-
anced size of biconnected components (i.e., big biconnected component). The
synthetic graphs are generated with balanced size of biconnected components.
Graph V E
facebook 4039 88234
Graph Abbr V E
G4 2075 4769
syn path20 150 200 T rue sp20 3462 5410
G15 1789 20459
syn tree4 5 10 10 100 T rue st4 5 21955 34957
oflights 2939 14458
syn tree4 9 10 10 30 T rue 2 st4 9 18936 40411
p2pG 8846 31839
syn tree6 3 4 10 30 T rue 2 st6 3 11679 24868
soch 2426 16630
syn tree6 3 4 10 30 T rue 3 st6 3 3 13237 41936
wiki 7115 100762
(b) Synthetic graphs
yeastppi 2361 6646
(a) Real world graphs
242 J. Hu et al.
values. On the other hand, graph G4 shows very high performance on runtime
improvement and the difference in resistance values is quite small, however the
sampling accuracy is low when the sampling ratio is smaller than 20%.
For BC SV, synthetic graphs show excellent performance overall, achieving
above 70% sampling accuracy at sampling ratio 5%, and then rise up to 90% at
sampling ratio 10%, with steady improvement towards 100% as the sampling
ratio increases. For real world graphs, BC SV achieves above 70% sampling
accuracy at sampling ratio 20% for all graphs. Particularly, the Facebook graph
shows excellent sampling accuracy at all sampling ratios, achieving above 99%.
Furthermore, the correlation analysis of the ranking of edges (resp., vertices)
based on resistance values computed by SS and BC SS (resp., SV and BC SV)
shows strong and positive results for all data sets. Overall, our experiments
and analysis confirm that BC SS (resp., BC SV) computes good approximations
on the rankings of edges (resp., vertices) based on effective resistance values,
compared to SS (resp., SV), validating hypothesis H2.
Fig. 4. The KS values of the sampling quality metrics (Closeness, AND, Degree, CC)
of graph samples. (a), (b): computed by BC SS, SS, RE; (c), (d): BC SV, SV, RV. The
lower KS value means the better result. BC SS and SS (resp., BC SV and SV) perform
highly similar and better than RE (resp., RV). (Color figure online)
We also computed the Jaccard similarity index for testing similarity between
the original graph G and the graph samples G and G computed by SS and
BC SS (resp., SV and BC SV). More specifically, it is defined as the size of the
intersection divided by the size of the union of the two graphs (value 1 indicates
that two graphs are the same).
Figure 5 shows the average Jaccard similarity values for real world graphs
and synthetic graphs for BC SS and SS (resp. BC SV and SV), with sampling
ratio from 5% to 95%. Clearly, for both data sets, the Jaccard similarity index
linearly increases with the sampling ratio, and SS and BC SS (resp., SV and
BC SV) perform almost the same, validating hypothesis H3.
Fig. 6. Comparison of graph samples of real world graphs and a synthetic graph with
20% sampling ratio, computed by SS, BC SS, SV and BC SV. SS and BC SS (resp.
SV and BC SV) produce almost identical visualizations.
248 J. Hu et al.
References
1. Barrat, A., Barthelemy, M., Pastor-Satorras, R., Vespignani, A.: The architecture
of complex weighted networks. Proc. Natl. Acad. Sci. U.S.A. 101(11), 3747–3752
(2004)
2. Eades, P., Nguyen, Q.H., Hong, S.: Drawing big graphs using spectral sparsifica-
tion. In: Proceedings of GD 2017, pp. 272–286 (2017)
3. Freeman, L.C.: Centrality in social networks conceptual clarification. Soc. Netw.
1(3), 215–239 (1978)
4. Gammon, J., Chakravarti, I.M., Laha, R.G., Roy, J.: Handbook of Methods of
Applied Statistics (1967)
5. Hu, J., Hong, S., Eades, P.: Spectral vertex sampling for big complex graphs. In:
Proceedings of Complex Networks, pp. 216–227. Springer (2019)
6. Hu, P., Lau, W.C.: A survey and taxonomy of graph sampling. CoRR
abs/1308.5865 (2013)
7. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proceedings of
SIGKDD, pp. 631–636. ACM (2006)
8. Meidiana, A., Hong, S., Huang, J., Eades, P., Ma, K.: Topology-based spectral
sparsification. In: Proceedings of LDAV, pp. 73–82. IEEE (2019)
9. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2), 26126 (2003)
10. Nocaj, A., Ortmann, M., Brandes, U.: Untangling the hairballs of multi-centered,
small-world online social media networks. JGAA 19(2), 595–618 (2015)
11. Saramäki, J., Kivelä, M., Onnela, J.P., Kaski, K., Kertész, J.: Generalizations of
the clustering coefficient to weighted complex networks. Phys. Rev. E 75(2), 27105
(2007)
12. Spielman, D.A., Teng, S.H.: Spectral sparsification of graphs. SIAM J. Comput.
40(4), 981–1025 (2011)
13. Wu, Y., Cao, N., Archambault, D.W., Shen, Q., Qu, H., Cui, W.: Evaluation of
graph sampling: a visualization perspective. IEEE TVCG 23(1), 401–410 (2017)
Graph Signal Processing on Complex
Networks for Structural Health
Monitoring
1 Introduction
For several domains and problems, complex networks provide natural means
for representing complex data, e. g., concerning multi-relational data, dynamic
behavior, and complex systems in general. In particular, targeting dynamic, tem-
poral, and continuous sequential data, Graph Signal Processing (GSP) [24] has
emerged as a prominent and versatile framework for analyzing such complex
data. This concerns both the analysis of network structure as well as its dynam-
ics. GSP extends on classical signal processing by including irregular structures
such as Graphs/Networks [22]. The advantage of graphs over classical data rep-
resentations is that graphs naturally account for such irregular relations [25].
In this paper, a computational framework utilizing GSP for the analysis of
complex sensor data represented in complex networks is presented. Our applica-
tion context is given by Structural Health Monitoring (SHM), a data-driven
diagnostic framework for investigating and estimating the integrity of mas-
sive structures [1,23]. It aims at improving safety, reliability, efficiency, and
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 249–261, 2021.
https://doi.org/10.1007/978-3-030-65347-7_21
250 S. Bloemheuvel et al.
2 Background on GSP
This section first introduces basic concepts of signal processing on graphs and the
necessary theoretical background. For a detailed overview on GSP see e. g., [18,
24].
L = V ΛV −1 , (1)
where the columns vn of the matrix V are the eigenvectors of the laplacian L,
and Λ the diagonal matrix of the corresponding eigenvalues. The eigenvalues
252 S. Bloemheuvel et al.
act as the frequencies on the graph [20]. The GFT of a signal s is then
computed as ŝ = U ∗ s where U is the Fourier basis of a graph and U ∗ the
conjugate transpose of U . The inverse GFT of a Fourier domain signal ŝ is
defined as s = U ŝ where U is the Fourier basis of a graph.
– Similar to classical signal processing, filters can be applied to graph signals.
The fundamental idea is to transform the graph signal into the graph spectral
domain, weaken unwanted frequencies or magnify wanted frequencies of the
signal by altering the Fourier coefficients, and convert the signal back to the
vertex domain.
3 Method
In this section, an overview of our analysis framework is provided. After that,
the dataset and network modeling techniques are described.
Figure 1 shows an overview of the proposed framework applying GSP for SHM
using complex networks. Overall, the framework provides an incremental and
iterative methodology for analysis. The complex signal (sensor network data) is
first modeled into a complex network, which can be assessed in a semi-automatic
approach to apply refinements and enrichment of the network (or the respective
data, e. g., when defective sensors are detected). After that, GSP is applied on
the network to obtain a specific GSP model which can be utilized and deployed
for SHM, e. g., for identifying a minimal subset of sensors, or for detecting specific
patterns, events, mode shapes, etc.
3.2 Dataset
indicated that the bridge did not meet the quality and security requirements.
The network consists of 145 sensors placed along the width of the bridge, which
include 34 vibration, 50 horizontal strain (X-strain), 41 vertical strain (Y-strain),
and 20 temperature sensors. The dataset has been used for various data mining
techniques, such as time series analysis [26] and modal analysis [17].
Overall, the dataset made available to us consists of 5 min of high-resolution
sensor data, about 30,000 observations in total. It contains several traffic events,
of which the 10 most significant events are investigated. For data preprocessing
and alignment, the domain expert mentioned that the strain sensors were not
scaled on the same range and that clock times of the sensors were not aligned;
time synchronization is a general open challenge [15], regarding simultaneous
data collection on all sensors. Thus, the clock times of the sensors were aligned
by matching the peaks in the sensor readings, and the data were rescaled by
z-score standardization. Lastly, the average values per 100ms were taken from
the original data (sampled at 100 Hz) to smooth out the signal. The sensors
were placed at three cross-sections within one span (see [16]). Thus, to make the
connections in the network meaningful, the 31 sensors in the middle and right
cross-sections were dropped. Also, 4 sensors were found defective, which reduced
the total set of sensors from 145 to 110 (Fig. 2).
Fig. 2. Sensor locations at one of the girders of the bridge (from [16]). Sensors are
either embedded or attached to the deck or girders.
of the strain sensors at the girders. Therefore, the edges were determined by the
correlation score between the sensor measurements, selecting the top-3 edges for
modeling (excluding vibration sensors, which kept all their original edges).
To conclude, four networks are created based on the given set of desired
sensors: X-strain (|V | = 42, |E| = 126), Y-strain (|V | = 37, |E| = 111), X-Y
combined (|V | = 79, |E| = 237) and Vibration (|V | = 15, |E| = 26). The strain
sensors are used in the analysis of sampling and mode shape identification, while
the vibration sensors only assisted in identifying mode shapes.
Below, we first present and discuss the results of sampling the sensors for obtain-
ing a minimal subset of sensors which allows to reconstruct the total signal. After
that, applications for mode shape identification will be discussed.
Table 1 shows the Root Mean Squared Error (RMSE) for different conditions
during the 10 most noticeable traffic events in the time series. We chose RMSE
as a standard metric since it measures in the same unit as the variable of interest.
The domain expert indicated that Y-strain is harder to model since the bridge
can move more freely in the Y-direction than the X-direction. Therefore, the
algorithms perform best on the X-strain sensors but struggle with the Y-strain
sensors. When directly comparing the algorithms, the top-down algorithm con-
sistently outperforms the random (+29.92%) and bottom-up (+11.85%) algo-
rithms in terms of RMSE. Also, the random algorithm was tested in a separate
experiment for 50.000 iterations (100× more than the initial setting). When we
consider the individual events, even after that many runs, the random algo-
rithm consistently did not find any better solution than both hill-climbers. In
that sense, running only one top-down iteration already outperforms very many
random iterations (for any reasonable N of iterations).
Table 1. Mean and standard deviation of the best RMSE scores for each algorithm
during all significant traffic events in the dataset. Non-filtered and Filtered stands for
the conditions if a graph filter was applied to the signal or not.
Figures 3 and 4 show the performance of the X-strain sensors during the
heaviest traffic event data (event 1). In comparison, the random algorithm (M
= 3.90, SD = 1.09) performs poor in general. The bottom-up algorithm (M =
2.60, SD = .45) already improves from the random algorithm by a high margin;
the differences in both distributions showed to be significant (as estimated via a
t-test of the result from 500 iterations for each algorithm (p <.001)). The top-
down algorithm (M = 1.20, SD = .06) shares no overlap with the bottom-up
algorithm.
The top-down algorithm also shows a much smaller standard deviation, which
indicates that it performs more consistent. Such behavior can be explained by
the underlying procedures of the hill-climbers. The bottom-up algorithm finishes
more iterations because it decides which nodes are selected instead of dropped. It
makes decisions about the 25% selected sensors, while the top-down algorithm
decides on the 75% sensors not selected. If a weak performing sensor is not
dropped in the first few iterations, it will most likely be dropped in a later
iteration since it will stay in the pool of sensors to drop longer. Therefore, a
good strategy appears to be running a few top-down trials.
Fig. 5. The green and purple nodes are the selected sensors from the top-down algo-
rithm. Circle-shaped nodes indicate the Y-strain sensors and triangle-shaped nodes
indicate the X-strain sensors (X-strain/Y-strain are vertically separated).
Fig. 6. The sensor network consisting of Fig. 7. The sensor network during traf-
X (bottom component) and Y (top com- fic event 1. We can observe increased
ponent) sensors. X-strain/Y-strain sen- strain in the girders at the bottom right
sors are vertically separated for explana- part of the bridge and the upper region
tory purposes). of the Y-strain sensors, respectively.
any physical phenomenon, e. g., wave propagation, fluid behavior and heat flow.
FEM solves a problem by reducing a system in smaller parts called finite ele-
ments, which in turn form a mesh of the object. Each element contains a simple
equation that when assembled, models the entire problem.
With GSP, this procedure is not always necessary since certain mode shapes
can be spotted by examining the graph for a period of t time points. Figure 8a
shows a combination of mode shapes that can be spotted in Fig. 8c. The bridge
is vibrating back and forth, of which a time point where the left side of the
bridge if decreasing in terms of vibration is shown. A supplementary page1 with
animated GIFs is available since static images do not show the full story.
Figure 8d shows a vehicle passing by on the right side of the bridge, and
how the girders on the bottom right of the bridge carry the weight and allow
the other parts of the bridge to decrease in strain level (see the left side of
Fig. 8b). In future work, methods to automatically label moments in time by
their combination of mode shapes could be investigated.
Fig. 8. (a) shows a FEM-based (from [17]) combination of mode shapes and the
corresponding graph signal in (c). (b) shows a FEM-based (from [17]) combination of
torsional mode shapes in the girders and the corresponding graph signal in (d).
1
Link to github page.
Graph Signal Processing for Structural Health Monitoring 259
5 Conclusions
In this work, we presented the application of a computational framework utilizing
Graph Signal Processing for the analysis of complex sensor data in the domain
of Structural Health Monitoring. In the proposed framework, GSP showed to
be a promising technique to work with real-world complex sensor data. The
results indicate that GSP is capable of selecting the most important sensors in
the Hollandse Brug, a large bridge in the Netherlands, to arrive at a minimal
subset of sensors in a resource-aware way. The top-down algorithm performed
best of the tried algorithms. Using GSP for sensor selection could lead to sig-
nificant cost-reductions in monitoring large civil infrastructures. Furthermore,
the sensor selection could be used to increase the lifetime of battery-powered
sensor networks, e.g., by calculating two optimal sets of sensors to turn on and
off interchangeably.
In addition, our case study demonstrated a technique to find a combination of
mode shapes in the graph signal plots, indicating important events/mode shapes
in our application context. The events and mode shapes visible in the network
can be used to assess the structural health of the bridge, since the mode shapes
hint to other aspects of the bridge, such as stiffness and damping. Moreover, our
GSP approach requires less modeling assumptions and engineering knowledge
(e.g., building a complex FEM model).
For future research, we intend to investigate ways to detect mode shapes with
GSP in an unsupervised manner, e. g., adapting/refining methods from anomaly
detection [3,5]. Furthermore, we plan to apply GSP to other civil infrastructure
datasets with applications domains such as, e. g., heat diffusion or fluid flow.
Additionally, we aim to assess more data-driven methods in order to bypass the
usage of greedy strategies, such as graph neural network approaches, e. g., [9,
27], also including combinations of other network analysis and GSP methods,
e. g., utilizing more information as modeled in feature-rich networks [10]. Such
methods could then as well be tested on other civil infrastructure datasets.
Acknowledgement. We wish to thank Dr. A.J. Knobbe for his domain knowledge
considering the InfraWatch project, which he managed for several years.
References
1. Abdulkarem, M., Samsudin, K., Rokhani, F.Z., A Rasid, M.F.: Wireless sensor
network for structural health monitoring: a contemporary review of technologies,
challenges, and future direction. Struct. Health Monit. 19(3), 693–735 (2020)
2. Aggarwal, C.C., Bar-Noy, A., Shamoun, S.: On sensor selection in linked informa-
tion networks. Comput. Netw. 126, 100–113 (2017)
3. Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description.
Data Min. Knowl. Disc. 29(3), 626–688 (2015)
4. Anis, A., Gadde, A., Ortega, A.: Efficient sampling set selection for bandlimited
graph signals using graph spectral proxies. IEEE Trans. Signal Process. 64(14),
3775–3789 (2016)
260 S. Bloemheuvel et al.
5. Atzmueller, M., Arnu, D., Schmidt, A.: Anomaly detection and structural analysis
in industrial production environments. In: Proceedings of the International Data
Science Conference (IDSC 2017), Salzburg, Austria (2017)
6. Capellari, G., Chatzi, E., Mariani, S.: Cost-benefit optimization of structural health
monitoring sensor networks. Sensors 18(7), 2174 (2018)
7. Cornwell, P., Farrar, C.R., Doebling, S.W., Sohn, H.: Environmental variability of
modal properties. Exp. Tech. 23(6), 45–48 (1999)
8. Defferrard, M., Martin, L., Pena, R., Perraudin, N.: PyGSP: graph signal process-
ing in python. https://github.com/epfl-lts2/pygsp/
9. Han, Z., Wang, Y., Ma, Y., Günnemann, S., Tresp, V.: Graph hawkes network for
reasoning on temporal knowledge graphs. CoRR abs/2003.13432 (2020)
10. Interdonato, R., Atzmueller, M., Gaito, S., Kanawati, R., Largeron, C., Sala, A.:
Feature-rich networks: going beyond complex network topologies. Appl. Netw. Sci.
4(4), 1–13 (2019)
11. Isufi, E.: Graph-time signal processing: filtering and sampling strategies. Ph.D.
thesis, Doctoral Thesis. Technische Universiteit Delft (2019)
12. Knobbe, A., Blockeel, H., Koopman, A., Calders, T., Obladen, B., Bosma, C.,
Galenkamp, H., Koenders, E., Kok, J.: Infrawatch: data management of large sys-
tems for monitoring infrastructural performance. In: Proceedings of the Interna-
tional Symposium on Intelligent Data Analysis, pp. 91–102. Springer (2010)
13. Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in Gaussian
processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res.
9(Feb), 235–284 (2008)
14. Mateos, G., Segarra, S., Marques, A.G., Ribeiro, A.: Connecting the dots: iden-
tifying network structure via graph signal processing. IEEE Signal Process. Mag.
36(3), 16–43 (2019)
15. Mechitov, K., Kim, W., Agha, G., Nagayama, T.: High-frequency distributed sens-
ing for structure monitoring. In: Proceedings of the First International Workshop
on Networked Sensing Systems (INSS 2004), pp. 101–105 (2004)
16. Miao, S.: Structural health monitoring meets data mining. Ph.D. thesis, Doctoral
Thesis. Leiden Institute of Advances Computer Science (2014)
17. Miao, S., Veerman, R., Koenders, E., Knobbe, A.: Modal analysis of a concrete
highway bridge: structural calculations and vibration-based results. In: Proceedings
of the Conference on Structural Health Monitoring of Intelligent Infrastructure,
Hongkong (2013)
18. Ortega, A., Frossard, P., Kovačević, J., Moura, J.M., Vandergheynst, P.: Graph
signal processing: overview, challenges, and applications. Proc. IEEE 106(5), 808–
828 (2018)
19. Puy, G., Tremblay, N., Gribonval, R., Vandergheynst, P.: Random sampling of ban-
dlimited signals on graphs. Appl. Comput. Harmon. Anal. 44(2), 446–475 (2018)
20. Sandryhaila, A., Moura, J.M.: Discrete signal processing on graphs: frequency anal-
ysis. IEEE Trans. Signal Process. 62(12), 3042–3054 (2014)
21. Seo, J., Hu, J.W., Lee, J.: Summary review of structural health monitoring appli-
cations for highway bridges. J. Perform. Constr. Facilit. 30(4), 04015072 (2016)
22. Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., Vandergheynst, P.: The
emerging field of signal processing on graphs: extending high-dimensional data
analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3),
83–98 (2013)
23. Sony, S., Laventure, S., Sadhu, A.: A literature review of next-generation smart
sensing technology in structural health monitoring. Struct. Control Health Monit.
26(3), e2321 (2019)
Graph Signal Processing for Structural Health Monitoring 261
24. Stankovic, L., Mandic, D., Dakovic, M., Brajovic, M., Scalzo, B., Constantinides,
T.: Graph signal processing–part I: graphs, graph spectra, and spectral clustering.
arXiv preprint arXiv:1907.03467 (2019)
25. Stankovic, L., Mandic, D.P., Dakovic, M., Kisil, I., Sejdic, E., Constantinides, A.G.:
Understanding the basis of graph signal processing via an intuitive example-driven
approach [lecture notes]. IEEE Signal Process. Mag. 36(6), 133–145 (2019)
26. Vespier, U., Knobbe, A., Nijssen, S., Vanschoren, J.: MDL-based analysis of time
series at multiple time-scales. In: ECML PKDD, pp. 371–386. Springer (2012)
27. Zhang, Z., Cui, P., Zhu, W.: Deep learning on graphs: a survey. IEEE Trans. Knowl.
Data Eng. (2020)
An Analysis of Four Academic
Department Collaboration Networks
with Respect to Gender
1 Introduction
Director at the Office for Diversity and Inclusion at the Japan Science and Tech-
nology Agency has said that younger people are more “data-literate”, meaning
they rely much more on “evidence-based data” rather than experiences to make
decisions for the future [5]. For this reason, it is important to produce data
analyses showing gender collaboration patterns in research in order to further
encourage future diversity and inclusion measures.
We investigate the collaboration patterns between researchers of different
genders at California Polytechnic State University, San Luis Obispo (Cal Poly),
a largely undergraduate university with approximately 1,400 faculty members.
We scope the study to the Biology, Computer Science and Software Engineering,
Electrical Engineering, and Mathematics departments. These departments were
chosen to compare the gender collaborations in departments with both more
applied and more theoretical fields. We evaluate multiple recently published
claims of gender patterns in research across these departments.
2 Related Works
A collaboration network is an instance of a network where vertices repre-
sent researchers and an edge exists between two vertices if the corresponding
researchers collaborated on a research activity. For this paper, an edge will exist
between two researchers if they collaborated on a publication.
This paper will examine collaboration patterns among researchers with
respect to gender. GLAAD, an American organization dedicated to counter-
ing discrimination against the LGBTQ population, defines gender identity as
“a person’s internal, deeply held sense of their gender”. A gender identity can be
a binary gender (man or woman) or a non-binary/genderqueer. This differs from
a person’s sex which is the classification of a person as male or female based on
their external anatomy and other bodily characteristics [2].
In 2020, Elsevier, a leading information publishing and analytics company,
released their third gender report, titled The Researcher Journey Through a
Gender Lens, examining researchers in Argentina, Brazil, Mexico, Canada, USA,
UK, Portugal, Spain, France, Italy, Netherlands, Germany, Denmark, Australia,
and Japan as well as EU28 which aggregates data from all 28 countries in the
EU. These researchers were also grouped by subject area, including disciplines
within the physical sciences, life sciences, health sciences, and social sciences.
This report summarizes findings from studies conducted during recent years [5].
The study found that there has been a trend toward gender parity when compar-
ing active authors (authoring at least 2 publications during the study period) in
1999–2003 to those in 2014–2018. For every subject area in each of the countries
analyzed, it was found that the median ratio of women to men among active
authors was higher in the later period of 2014–2018 than in 1999–2003. Despite
this trend toward gender parity, Elsevier’s study still found evidence for a gender
gap. They found that even during the period from 2014–2018, most disciplines
had more active male authors than active female authors. This gap was most
pronounced in the physical sciences, including mathematics, computer science,
and engineering disciplines [5].
264 L. Nakamichi et al.
3 Methods
We build a collaboration network for the Computer Science, Electrical Engi-
neering, Mathematics, and Biology departments at Cal Poly. In this network,
vertices are defined as researchers: either active faculty members at Cal Poly
(as of the 2019–2020 academic year) in one of the departments of interest or a
direct collaborator of one of these faculty members (either internal to Cal Poly
in another department or external to the university). Two researchers are con-
nected if they have ever coauthored a publication within the years from 1972 to
2020. Note that there is no overlap in the faculty within these four departments.
The network was primarily filled through online databases and supplemented
with publication lists provided by researchers. The databases used were the
Microsoft Academic Knowledge API for all departments, MathSciNet for the
Mathematics department, and IEEE Xplore API for the Computer Science and
Electrical Engineering department.
Four Academic Department Collaboration Networks with Respect to Gender 265
Gender Inference. Since the genders of the researchers in our network are not
labeled, an API was used to label each of the researchers’ genders. We chose to
use Gender API because a study by Santamarı́a and Mihaljevic found it to be
the most accurate for gender inferences [12].
Since we were already using names as identifiers, no parsing was needed
to extract names. Papers found by manual scraping had authors separated. The
first names were sent to GenderAPI after being lowercased. Gender API handled
the normalization. Gender API offered a threshold of confidence for each name.
We took names that were above the 50% threshold. So if name X was labeled
female with 51% confidence by Gender API, we took the name to be female.
We note that the use of this API is a weakness of this study. Some of our
data sources only supplied first initials for the authors in each publication, which
will be labeled as “unknown”. Additionally, we recognize that researchers may
not identify as a binary gender, which is not considered by the Gender API and
therefore could not be considered in this study.
Find the vertex and edge distribution of the network in Table 1. From the Gender
API, each researcher was marked as male, female, or unknown. In Table 2 are
the counts of each gender in the entire collaboration network and within each of
the department networks.
This data was used to analyze the validity of the following claims for
researchers in the Biology, Computer Science, Electrical Engineering, and Math-
ematics departments at Cal Poly. In these analyses, researchers whose genders
were marked as unknown were ignored.
Note that in each of the box and whisker plots, the median is represented by
the yellow line and the green triangle represents the mean.
266 L. Nakamichi et al.
5 Claims
In the following we consider four previously published observations about the
difference in collaboration patterns between male and female researchers. We
encourage the reader to pay particular attention to the Math department’s col-
laboration patterns in each of the following studies.
The 2020 Gender Report found that across all departments and regions stud-
ied, men tended to have more direct collaborators than women [5]. This was
also found by Ductor, Goyal, and Prummer in their study of the collaboration
patterns of the Economics department at the University of Cambridge. They
found that women tended to have 23% fewer collaborators than the men in the
study [9].
To test this, we measured the number of different researchers each Cal Poly
researcher collaborated with and took the average across each department of
interest. For example, if Researcher 1 worked on Paper 1 with Researcher 2
and Researcher 3 and worked on Paper 2 with Researcher 3 and Researcher 4,
Researcher 1 will have 3 collaborators.
The results of this measurement can be seen in Table 3 and Fig. 1. From the
box and whisker plot, it can be seen that the mean number of collaborators
of men is greater than that for women in the Biology, Computer Science, and
Electrical Engineering departments at Cal Poly. We found that in the Math
department, women tend to have more collaborators than men. This is especially
Four Academic Department Collaboration Networks with Respect to Gender 267
significant because 10 out of the 30 labeled Cal Poly researchers from the Math
department are women, so this represents a significant portion of the population.
where di is the number of researcher i’s distinct collaborators and nij is the
number of publications written between researcher i and researcher j. They
found that women had a 9.4% greater average strength of ties than men.
In order to test the presence of this pattern in the network, we measured
the strength of ties for women and men in each of the departments of interest.
268 L. Nakamichi et al.
The measurements can be seen in Fig. 2. The percent differences between the
men’s and women’s average strength of ties in each department can be seen in
Table 4. Since the average strength of ties for women is greater than that for men
in the Computer Science and Math departments, the claim that women tend to
collaborate with the same collaborators is supported in those departments. In the
Math department, it is important to note that there is an outlier for the strength
of ties of the women, suggesting that this percent difference between the strength
of ties between the men and women in Table 4 may be exaggerated. However,
the opposite is true for the Biology and Electrical Engineering departments: the
men tend to repeatedly collaborate with the same collaborators.
Figure 3 shows the average WGRs for each department. It can be seen that
in all departments except Electrical Engineering, the WGR for women is higher
than that for men. This points toward gender homophily in these three depart-
ments since women show a higher ratio of co-authorship experiences with other
women than men. The Electrical Engineering department shows almost equal
WGR between men and women researchers, suggesting that men and women
collaborate equally as often with women researchers.
By α Metric. Evidence for gender homophily was also found to be true for
researchers in the life sciences by Holman and Morandin in a study of publica-
tions from PubMed [8]. To measure gender homophily, Holman and Morandin
calculated the coefficient of homophily for a collaboration network, α = p − q
270 L. Nakamichi et al.
Table 5. α by department
While the measurements for α for the Computer Science department seems
to agree with the findings of Holman and Morandin, the measurements for the
Biology, Electrical Engineering, and Math departments are low, suggesting little
gender homophily. This may be attributed to the researchers labeled “unknown”
or the limited number of researchers in each department. For this calculation,
the α value was calculated for each department, while Holman and Morandin
calculated this value over 3308 journals to obtain this trend.
This study is ongoing with the goal of identifying departments with healthy
and successful collaboration networks, especially for all genders. Our goal is
to examine collaboration trends and understand how to promote equity and
academic collaboration for all.
When interpreting these analyses, it is important to understand the context
of these studies. These were conducted at Cal Poly, a public university in San
Luis Obipso, CA. It is not a Research 1 university, meaning that while research
is conducted at Cal Poly, the primary focus is on undergraduate education. This
limits the amount of data that can be retrieved for each of the departments of
interest because all faculty members are not required to participate in research,
meaning the research output is more limited than it would be at a more research-
focused institution. Nevertheless, from the findings, it can be seen that there is
likely a difference in the collaboration patterns of researchers based on their
gender. This difference seems to depend on the department being examined, but
it does support the need for more research in this area.
A limitation of this study is that we are unable to take into account additional
properties of each researcher, such as academic seniority, prestige of previous
institutions, and demographic information. These factors could also have a strong
influence on the claims discussed.
For many of the claims investigated in this paper, the Math department was
an outlier to the other departments studied. Further study must be conducted
to investigate the reasons behind these differences and whether they persist for
researchers in the field of mathematics across the country or globally.
For future work we intend to investigate the collaboration patterns of the
Mathematics department at Cal Poly and Mathematics departments in general.
Also, we intend to develop a framework where Cal Poly researchers can self iden-
272 L. Nakamichi et al.
tify their gender. This would allow for a more accurate picture of collaborative
patterns with respect to gender.
References
1. About IEEE Xplore. IEEE Xplore Digital Library
2. Glossary of terms - transgender. GLAAD
3. Mathscinet. American Mathematical Society
4. Project academic knowledge. Microsoft (2020)
5. The researcher journey through a gender lens. Elsevier (2020)
6. Araujo, E., Araujo, N., Moreira, A., Herrmann, H., Andrade, J.: Gender differences
in scientific collaborations: women are more egalitarian than men. PLoS ONE 12,
05 (2017)
7. D’Angelo, C.A., Abramo, G., Murgia, G.: Gender differences in research collabo-
ration. Journal of Informetr. 7, 811–822 (2013)
8. Holman, L., Morandin, C.: Researchers collaborate with same-gendered colleagues
more often than expected across the life sciences. PLOS ONE 14, e0216128 (2019)
9. Goyal, S., Ductor, L., Prummer, A.: Gender & collaboration. Cambridge Working
Papers in Economics, Faculty of Economics, University of Cambridge (2018)
10. McNichols, L., Medina-Kim, G., Nguyen, V.L., Rapp, C., Migler, T.: Gender’s
influence on academic collaboration in a university-wide network. In: Complex
Networks and Their Applications VIII, pp. 94–104. Springer, Cham (2020)
11. Rossiter, M.W.: The Matthew Matilda effect in science. Soc. Stud. Sci. 23(2),
325–341 (1993)
12. Santamarı́a, L., Mihaljevic, H.: Comparison and benchmark of name-to-gender
inference services. PeerJ Comput. Sci. 4, e156 (2018)
13. Reich, J.A., Reich, S.M.: Cultural competence in interdisciplinary collaborations: a
method for respecting diversity in research partnerships. Am. J. Commun. Psychol.
38, 51–62 (2006)
Uncovering the Image Structure
of Japanese TV Commercials Through
a Co-occurrence Network Representation
1 Introduction
Companies invest significant amounts of money in TV commercials to secure the
purchase intentions of customers [1,2]. Experiments and empirical data analyses
have been conducted to evaluate the actual effect of TV commercial advertise-
ments on customer purchase intentions and behaviours. Psychological experi-
ments have shown a phenomenon in which people tend to prefer more familiar
objects, which is known as the mere-exposure effect, which supports the efficacy
of TV commercial advertisements [3]. However, it has also been suggested that
the mere-exposure effect can be reduced when the subjects are aware of the
persuasive intent [3,4]. Furthermore, studies have indicated that most viewers
regard TV advertisements as intrusive, thereby reducing their effect [1,4]. By
contrast, for certain categories of TV advertisements, e.g., tourism and foods
targeting children, studies have shown that certain images positively affect the
purchase intention of the viewers [2,5].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 273–283, 2021.
https://doi.org/10.1007/978-3-030-65347-7_23
274 M. I. Ito and T. Ohnishi
2 Methods
The dataset analysed in this study was provided by M Data Co., Ltd. (https://
mdata.tv/en/). This dataset includes information on TV commercials aired on
five TV stations in Japan, i.e. Fuji TV, Nippon TV, TBS TV, TV Asahi, and
TV Tokyo, during the period of January 2017 to June 2020. A total of 1,682,171,
1,672,549, 1,657,578, and 804,828 TV commercials were recorded in 2017, 2018,
2019 and 2020, respectively. Here, commercials that have the same content but
have gone to air at a different time are counted as different advertisements. In
the dataset, the scenario of each TV commercial is represented by keywords,
e.g., ‘nursery’, ‘pick up’, and ‘mother and child’. These keywords are from a
commercial advertising Japanese dumplings and describe a situation in which a
mother picks up her child from a nursery and ends up buying Japanese dumplings
for dinner. Each TV commercial is classified according to the type of the product
advertised. There are 38 categories in the larger classification, and all categories
are further divided into a total of 131 sub-categories (Table 1).
We constructed a co-occurrence network of TV commercials, G. Firstly, for
each category, we constructed an unweighted network of keywords, where each
node represents a keyword and an edge exists between two nodes if these key-
words appear at least once in the same TV commercial (Fig. 1). Subsequently,
we merge these networks to construct a weighted network, G. Network G con-
sists of nodes that have appeared in any of the networks of sub-categories. The
weight of the edge between two nodes represents the number of networks for
the sub-categories in which the edge between these nodes exists. Network G has
74,106 nodes and 1,659,254 edges. Let A be the weight matrix of G.
Image Structure of TV Commercials 275
The strength of node i, si = j Ai,j , is a measure of the importance of node
i in a weighted network [10,11]. Nodes with high strength can be regarded as the
core of the image structure of the TV commercials because they co-occur with
various keywords across a variety of sub-categories of products. We examined
the strength of the nodes and the strength distribution in G.
A
Sub-category 1
B C
Category 1
D
Sub-category 7
B C
A
Sub-category 131 Category 38
B C
A D
Weighted network G
B C
3 Results
Table 2. Fifteen nodes with the highest strength. The degree of each node is also
shown.
which represents the extent to which node i is related to category k. For each
community l and each category k, we calculate the sum of ni,k of nodes that
belong to community l:
Wl,k = ni,k . (4)
node i ∈ community l
Image Structure of TV Commercials 279
Let wl,k be the normalisation of Wl,k : wl,k = Wl,k / j Wl,j . Therefore, wl,k can
represent the extent to which community l is composed of nodes associated with
category k. The value of wl,k for each community and each category is shown in
Fig. 3 using a heatmap.
Table 3. Nodes in each community. Ten nodes with the highest strength are shown,
and the node with the ultimate highest strength for each community is highlighted in
bold. The star beside the node indicates that its within-module degree z exceeds 2.5.
Fig. 3. Features of each community. The values of wl,k are shown by a heatmap for
community l (the vertical axis) and category k (the horizontal axis). The average of
wl,k is 1/38 because there are 38 categories, and red (blue) in the heatmap indicates
that the value of wl,k is greater (less) than the average.
image, which are represented by ‘women’, ‘turn around’, ‘look back’, ‘room’, and
‘cafe’, among others.
As another example of the relationship between communities and categories,
community 1 is related to the categories Detergent, Draft beer, Household goods,
Cosmetics, and Medicine, and Detergent and Draft beer, in particular, are salient
only in community 1. In community 1, there are nodes associated with the prod-
uct itself and its description, e.g., ‘product’, ‘product in hand’, and ‘description’.
Therefore, it seems that community 1, at least partially, possesses an image of
persuasion. Thus, products in the categories associated with community 1, Deter-
gent, Draft beer, Household goods, Cosmetics, and Medicine, can be advertised
directly without concealing the intended persuasion.
4 Discussion
mutually connected with significant strength, i.e., that are significantly associ-
ated with each other by frequently co-occurring in TV commercials across the
categories of products. The cores of the image structure, ‘woman’, ‘product’,
‘logo’, ‘man’, ‘white back’, and ‘cinema scope’ were assigned to the different
communities. These can be representative keywords of these communities.
The literature has shown that community detection based on the modularity
optimisation has a shortcoming, called resolution limit. The resolution limit is a
problem that we cannot detect communities with the relatively small size [12].
This problem comes from the assumption of the null model incorporated into
modularity. In the null model, the random network with the same size as the
focal network is considered, i.e., we assume that a node can interact with all
nodes. Communities with the small size can be obtained by narrowing the range
of possible interaction for a node in the assumption of the null model [12,17].
This assumption may be suitable for the case of social networks, considering
people’s capacity to interact with others. By contrast, there seems to be no
plausible reason to narrow the range of connection when we are interested in
the overall picture of the image in TV commercials. In addition, in the case of
the co-occurrence network of TV commercials, communities with the small size
can significantly reflect the contents in the commercials of a single product; a
product is sometimes advertised by the series of commercials, that share similar
story and contents. Thus, we think that a partition of the co-occurrence network
by the modularity optimisation is rather preferable to reveal the image structure.
Each community seems to have a specific image; in community 2, for example,
nodes are associated with storylines. As mentioned in Sect. 3, some nodes with
significant strength in community 1 evoke the persuasive intent by the adver-
tisers. The literature has suggested that the conspicuous placement of products
or apparent persuasion in an advertising can induce a defensive mindset of the
consumers [3,4]. Therefore, it is presumably possible that TV commercials in the
categories of Detergent, Draft beer, Household goods, Cosmetics, and Medicine,
which are related particularly to community 1, do not positively affect the pur-
chase intention of the viewers.
Our results suggest that the effect of TV commercials should be evaluated by
considering the image structure of such advertisements, which are composed of
multiple subsets (communities), each of which has a specific image. For example,
we may have to classify TV commercials according to their image before eval-
uating their effect. This will lead to an understanding of which type of image
can strongly influence the effect of TV commercials on the purchase intention or
behaviour of the viewers.
We analysed all TV commercials in Japan during the period of 2017 to 2020.
It should be interesting to study whether the nature of the image structure shown
in this paper is universal in various countries or is unique to Japan. For example,
community 6, which includes ‘man’, ‘smartphone’, ‘laugh’, ‘talk’, and ‘office’
among others, is associated in particular with the categories of Online shopping,
Credit cards, PC & A/V, Canned coffee, and Tobacco. Whether such a mutual
association of keywords in community 6 can be observed and whether these
282 M. I. Ito and T. Ohnishi
References
1. Carreón, E.C.A., Nonaka, H., Hentona, A., Yamashiro, H.: Measuring the influence
of mere exposure effect of TV commercial adverts on purchase behavior based on
machine learning prediction models. Inf. Process. Manag. 56(4), 1339–1355 (2019)
2. Boyland, E.J., Halford, J.C.: Television advertising and branding. Effects on eating
behaviour and food preferences in children. Appetite 62, 236–241 (2013)
3. Hekkert, P., Thurgood, C., Whitfield, T.A.: The mere exposure effect for consumer
products as a consequence of existing familiarity and controlled exposure. Acta
Psychol. 144(2), 411–417 (2013)
4. Davtyan, D., Cunningham, I.: An investigation of brand placement effects on brand
attitudes and purchase intentions: brand placements versus TV commercials. J.
Bus. Res. 70, 160–167 (2017)
5. Pan, S.: The role of TV commercial visuals in forming memorable and impressive
destination images. J. Travel Res. 50(2), 171–185 (2011)
6. Radhakrishnan, S., Erbis, S., Isaacs, J.A., Kamarthi, S.: Novel keyword co-
occurrence network-based methods to foster systematic reviews of scientific lit-
erature. PLoS ONE 12(3), e0172778 (2017)
7. Su, H.N., Lee, P.C.: Mapping knowledge structure by keyword co-occurrence: a
first look at journal papers in technology foresight. Scientometrics 85(1), 65–79
(2010)
8. Zhang, J., Xie, J., Hou, W., Tu, X., Xu, J., Song, F., Wang, Z., Lu, Z.: Map-
ping the knowledge structure of research on patient adherence: knowledge domain
visualization based co-word analysis and social network analysis. PLoS ONE 7(4),
e34497 (2012)
9. Özgür, A., Cetin, B., Bingol, H.: Co-occurrence network of reuters news. Int. J.
Mod. Phys. C 19(05), 689–702 (2008)
10. Lü, L., Chen, D., Ren, X.L., Zhang, Q.M., Zhang, Y.C., Zhou, T.: Vital nodes
identification in complex networks. Phys. Rep. 650, 1–63 (2016)
11. Barrat, A., Barthelemy, M., Pastor-Satorras, R., Vespignani, A.: The architecture
of complex weighted networks. Proc. Natl. Acad. Sci. U.S.A. 101(11), 3747–3752
(2004)
12. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)
13. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), P10008
(2008)
Image Structure of TV Commercials 283
14. Aynaud, T.: python-louvain 0.14: Louvain algorithm for community detection
(2020). https://github.com/taynaud/python-louvain
15. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic net-
works. Nature 433(7028), 895–900 (2005)
16. Rubinov, M., Sporns, O.: Complex network measures of brain connectivity: uses
and interpretations. Neuroimage 52(3), 1059–1069 (2010)
17. Lambiotte, R., Delvenne, J.C., Barahona, M.: Laplacian dynamics and multiscale
modular structure in networks. arXiv preprint arXiv:0812.1770 (2008)
Movie Script Similarity Using Multilayer
Network Portrait Divergence
1 Introduction
Network models have been increasingly used to support the analysis of sto-
ries [11], such as novels and plays [15,27], famous TV series [25], news [22,23],
and movies [16,17,19]. However, most of these models only focus on one facet of
the movie story. Indeed, usually, they capture interactions between the charac-
ters at play [11] to bring out a global picture of the story content. Other works
went beyond by introducing other semantic elements such as scenes and dia-
logues [25]. Nevertheless, they have always captured the information in a single
layer network or a bipartite graph. To enrich the representation, we have pre-
viously proposed a multilayer network model that captures key elements of the
movie story [16,17]. It encompasses the single network analysis based either on
characters or scenes and proposes new topological analysis tools.
Recently, various studies are being conducted to measure the similarity of
movie stories [13,14] focusing on characters interactions. This work, aims at
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 284–295, 2021.
https://doi.org/10.1007/978-3-030-65347-7_24
Movie Script Similarity 285
2 Background
1 kBl,k
PB (k|l) = Bl,k = (3)
N Σc n2c
max(d,d ) P (k|l)
KL(P (k|l)||P (k|l))) = Σl=0 N
Σk=0 P log( ) (6)
P (k|l)
3 Experimental Evaluation
Experiments are performed using the movie scripts of six episodes of the Star
Wars saga1 . It is divided into the sequel (original) and prequel trilogies. The
sequel trilogy is the first created by George Lucas. It is composed of Episode
IV (SW4): A New Hope (1977), Episode V (SW5): The Empire Strikes Back
(1980) and Episode VI (SW6): Return of the Jedi (1983). It is followed by the
prequel trilogy composed of Episode I (SW1): The Phantom Menace (1999),
Episode II (SW2): Attack of the Clones (2002), and Episode III (SW3): Revenge
1
Here is a quick summary of the plot: The saga follows Anakin Skywalker, a young
child freed from slavery to become a Jedi and endeavored to save the galaxy. Anakin
instructed by the Jedi Masters of the light side, married the senator Padme. Unfortu-
nately, the Sith (Palpatine) submits him to the dark side, rebelling and losing against
his Master (Obi-Wan). Anakin is saved by the Sith, now ruling over the galaxy, and
transformed to Darth Vader. Padme died while giving birth to twins Luke and Leia.
Luke becomes a farmer while Leia becomes a princess. Nineteen years later, Obi-
Wan met Luke and taught him the Jedi way, while receiving a distress call from the
princess Leia, leading the resistance against Palpatine. Joining smuggler Han Solo
in the Millenium Falcon they went to save her, and support the resistance. Luke
completes his Jedi training, while Solo gets captured by the Sith, who crushes most
of the resistance. Vader tries to turn Luke to the dark side when discovering that
Luke is his son. Unsuccessful, Palpatine tries to kill Luke, awaking in Vader his old
self. Vader turns back against Palpatine and rescues the galaxy.
288 M. Lafhel et al.
of the Sith (2005). The experiment process proceeds as follows. First, the multi-
layer network of each episode of the saga are extracted. Their basic topological
properties are summarised in Table 1. Then, the corresponding portrait for each
layer (character, location, and keyword ) of each multilayer network are computed
individually and discussed. Finally, the distance between each pair of same-type
layers between all movies are compared using their portrait divergence.
Table 1. Global properties of the character, keyword and location layers for each
movie, with: number of nodes (N ), number of edges (E), diameter (D), transitivity
(T ), and assortativity (A).
Comparing all Three Layers: A quick look at all three layers: characters
in Fig. 2, keywords in Fig. 3, and locations in Fig. 4, clearly shows that each
type of layers display a distinct pattern making it recognizable. Character layers
display a k between 15 and 30, for a maximum path length between 3 and 5.
Keyword layers do not display much larger path length (up to 6) but the number
of nodes are much larger. Despite this larger number of nodes, most nodes are
distributed in the lower path corner with degree 1 and 3, maybe suggesting a
lot of small components. The character and keyword layers both display clear
characteristics common to small-world and scale-free networks [3] in their shape
such as the elongated form and knot near the center of the portrait. It may be less
obvious for the character layer, but one should remember that character graphs
are rather small. Location layers diverge the most from small world patterns
and seem to greatly vary upon trilogies, while being minimalist in the prequel
series. This suggests that locations networks are probably more linear loops in
the prequel while they show more complexity in the original trilogy.
Comparing Character Layers: Each character layer portrait is illustrated in
Fig. 2. The overall six portraits display a very similar pattern. SW1 (Fig. 2(a))
and SW2 (Fig. 2(b)) – a surprising similarity given that this layer shows a double
number of edges as shown in Table 1 – together with SW5 (Fig. 2(e)) and SW6
(Fig. 2(f)) make two similar pairs, with SW3 (Figs. 2(c)) and SW4 (Figs. 2(d))
being somewhat intermediary between them. In SW5 and SW6, a lot of action
Movie Script Similarity 289
Fig. 2. Portraits of the characters layer for each movie of the Star Wars saga. (a) SW1,
(b) SW2, (c) SW3, (d) SW4, (e) SW5, (f) SW6. The horizontal axis is the node degree
k. The vertical axis is the distance l. Colors are the entries of the portrait matrix Blk .
The white color indicates Blk = 0.
separates the main characters into different groups with parallel actions, while
the first series is rather focused on the main character, Anakin. The aspect ratio
of the character layer portraits seems to vary upon the different movies. This
has been linked to the small world characteristics of the graph that each portrait
summarizes [3]. We can notice a progression from SW1 to SW6 suggesting a
“smaller” world effect for episodes 5 and 6.
Fig. 3. Portraits of the keyword layers for each movie of the Star Wars saga. (a) SW1,
(b) SW2, (c) SW3, (d) SW4, (e) SW5, (f) SW6. The horizontal axis is the node degree
k. The vertical axis is the distance l. Colors are the entries of the portrait matrix Blk .
The white color indicates Blk = 0.
290 M. Lafhel et al.
Fig. 4. Portraits of the locations layer for each movie of the Star Wars saga. (a) SW1,
(b) SW2, (c) SW3, (d) SW4, (e) SW5, (f) SW6. The horizontal axis is the node degree
k. The vertical axis is the distance l. Colors are the entries of the portrait matrix Blk .
The white color indicates Blk = 0.
number of edges in Table 1). The rhythm is also different in the original trilogy,
which often separates its main characters such that actions happen in parallel.
Frequent cuts between them generate transitions that are less linear. This trend
seems to be less adopted after the fourth episode.
Fig. 5. Portrait divergence of the characters, keywords, and locations networks in the 6
episodes of the Star Wars saga. (a) Character layers. (b) Keyword layers. (c) Location
layers. Each cell of the heat-map represents the portrait divergence between a couple
of episodes.
lower divergence with the sequel: the clones being instrumental in the original
series may help shaping a common structure of relationships. This may also con-
firm the efforts taken in the third episode of the prequel to link characters with
the original series.
In the original trilogy, SW5 and SW6 show the lowest divergence
(DJS = 0.132). Indeed, individual inspection shows that a lot of the links are
shared in both episodes. SW4 appears the most diverging with the other movies
in the trilogy. This make sense since the movie was made so that it could have
been a stand alone movie at first. It shows its closest relationship with SW3
(DJS = 0.134), once more underlining the special care taken to connect the pre-
quel to the sequel from the character relationships point of view. SW1 and SW6
show the highest divergence. This is not surprising since at this point the story
of the prequel and original trilogies are completely different in their characters
and plots.
Divergence Between Keyword Layers: The keyword layers show the high-
est divergence among the different movies (Fig. 5(b)). In both trilogies, the
divergence appears lower for a pair of movies, SW2 and SW3 in the prequel
(DJS = 0.29), and SW4 and SW5 in the sequel. SW1 shows its lower divergence
with SW4 (DJS = 0.4) and SW6 (DJS = 0.33). It remains difficult to conclude
what in the structure of the movies can lead to a similar relationship structure
in keywords between those three. We suspect the frequent association of the key
terms young and master in these movies to play some important role.
Divergence Between Location Layers: Observing Fig. 5(c), the portrait
divergence shows very high similarity within the structure of locations of the
first prequel. The original series also shows a low divergence between movies.
We observe a noticeable difference between the prequel trilogy and SW4, specif-
ically, between SW3 and SW4 with DJS = 0.585. From SW4, a lot of scenes
start taking place in the Millenium Falcon, and death stars, giving a specific
rhythm in cuts and locations. Nonetheless, divergence is rather low between the
prequel and the last two episodes of the series. Remembering that all informa-
tion is extracted from the scripts, we may suspect here different influences the
movie director may have. Indeed, SW4 was written to possibly be a standalone
movie, while the following SW5 and SW6 scripts were written in a short period
of span, planned to complete the trilogy. This concerns even more the prequel
trilogy which was planned from the beginning to be a unified trilogy.
In this work, a multilayer network model is used to represent the main elements
of a movie script: characters, key elements of the conversations (keywords), and
locations of the scenes. Investigations based on the single-layer components of
the model are performed to relate the similarity of the 6 Star Wars Saga episodes
to the distance between the layers based on portrait divergence.
Movie Script Similarity 293
Preliminary results using portrait and portrait divergence measures are quite
encouraging. Indeed, the analysis of the movie networks confirms a good corre-
spondence between characters of the prequel and sequel trilogies; and a differ-
ence between locations in the prequel and sequel trilogy. Each of the prequel
and sequel trilogies shows a good relationship between characters; and high sim-
ilarity between locations. Results show similarity between topic relationships
(keywords) in the movies SW2 and SW3; and also of SW1 with SW4 and SW6.
Otherwise, other episodes appear to be dissimilar, mainly, the relation connecting
keywords of SW5 with SW2 and SW3. Although more experiments are needed to
fully assess that movie similarities can be discovered by measuring their network
distances, this work opens multiple research directions.
Note that the current results are only based on the sole script of each movie.
The script of SW4 was written in the late 70’s, the two following movies in the
80’s and the prequel in the late 90’s/2000’s. How movies are scripted have defi-
nitely changed over this span, and may impact our results. To even go further, we
can enrich the model. Previous works have also formulated multi-media cues for
movie analysis [17] including its visual features often important for recommen-
dation systems [8,12,20,28]. We may further enrich the model, using for example
sentiment analysis. To further evaluate how the network similarity performs, we
wish to further proceed with user evaluation, and script-based topical [5] and
style [10] similarity.
Our processing scheme is currently exploiting layers of the multilayer network
as separated single-layer networks. One interesting task would be to compare
with distance measures of multilayer networks. Since the Kullback-Liebler diver-
gence is not a metric and does not respect the triangular inequality, although
tempting, we have found irrelevant to average divergence over all layers and
measure overall similarity between movies. Alternative strategies could be to
project the whole multilayer network into a single-layer knowledge graph, but
such projection remains unsatisfying. We further need to investigate pure mul-
tilayer approaches and include in the comparison the effect of transition layers.
We can try embedding-based comparisons [1], and investigate proper multilayer
metrics such as entanglement [24] or nodes and edge coupling [6] between movies.
This however requires to align named entities between layers and movies, such
that they refer to the same entity. This is not straightforward since we have
at least two entities with an ambiguous definition which are Anakin/Vader and
Amidala/Padme. Finally, we believe this methodology to be useful once incor-
porated into recommendation systems to increase their efficiency [21] since we
provide a different definition of a genre based on actual elements of the movie
content [7].
References
1. Bagavathi, A., Krishnan, S.: Multi-net: scalable multilayer network embeddings.
arXiv preprint arXiv:1805.10172 (2018)
2. Bagrow, J.P., Bollt, E.M.: An information-theoretic, all-scales approach to com-
paring networks. Appl. Netw. Sci. 4(1), 45 (2019)
294 M. Lafhel et al.
3. Bagrow, J.P., Bollt, E.M., Skufca, J.D., Ben-Avraham, D.: Portraits of complex
networks. EPL (Europhys. Lett.) 81(6), 68004 (2008)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3(Jan), 993–1022 (2003)
5. Bougiatiotis, K., Giannakopoulos, T.: Content representation and similarity of
movies based on topic extraction from subtitles. In: Proceedings of the 9th Hellenic
Conference on Artificial Intelligence, pp. 1–7 (2016)
6. Cozzo, E., Kivelä, M., De Domenico, M., Solé-Ribalta, A., Arenas, A., Gómez,
S., Porter, M.A., Moreno, Y.: Structure of triadic relations in multiplex networks.
New J. Phys. 17(7), 073029 (2015)
7. Deldjoo, Y., Schedl, M., Elahi, M.: Movie genome recommender: a novel recom-
mender system based on multimedia content. In: 2019 International Conference on
Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2019)
8. Demirkesen, C., Cherifi, H.: A comparison of multiclass SVM methods for real
world natural scenes. In: International Conference on Advanced Concepts for Intel-
ligent Vision Systems, pp. 752–763. Springer, Heidelberg (2008)
9. Grabska-Gradzińska, I., Kulig, A., Kwapień, J., Drożdż, S.: Complex network anal-
ysis of literary and scientific texts. Int. J. Mod. Phys. C 23(07), 1250051 (2012)
10. Kovacs, B., Kleinbaum, A.M.: Language-style similarity and social networks. Psy-
chol. Sci. 31(2), 202–213 (2020)
11. Labatut, V., Bost, X.: Extraction and analysis of fictional character networks: a
survey. ACM Comput. Surv. (CSUR) 52(5), 1–40 (2019)
12. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-based retrieval in
fractal coded image databases. In: Proceedings 15th International Conference on
Pattern Recognition, ICPR-2000, vol. 1, pp. 1031–1034. IEEE (2000)
13. Lee, O.J., Jo, N., Jung, J.J.: Measuring character-based story similarity by ana-
lyzing movie scripts. In: Text2Story@ ECIR, pp. 41–45 (2018)
14. Lee, O.J., Jung, J.J.: Explainable movie recommendation systems by using story-
based similarity. In: IUI Workshops (2018)
15. Markovič, R., Gosak, M., Perc, M., Marhl, M., Grubelnik, V.: Applying network
theory to fables: complexity in Slovene Belles-Lettres for different age groups. J.
Complex Netw. 7(1), 114–127 (2018)
16. Mourchid, Y., Renoust, B., Cherifi, H., El Hassouni, M.: Multilayer network model
of movie script. In: International Conference on Complex Networks and their Appli-
cations, pp. 782–796. Springer (2018)
17. Mourchid, Y., Renoust, B., Roupin, O., Văn, L., Cherifi, H., El Hassouni, M.:
Movienet: a movie multilayer network model using visual and textual semantic
cues. Appl. Netw. Sci. 4(1), 1–37 (2019)
18. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification.
Lingvisticae Investigationes 30(1), 3–26 (2007)
19. Park, S.B., Oh, K.J., Jo, G.S.: Social network analysis in a movie using character-
net. Multimed. Tools Appl. 59(2), 601–627 (2012)
20. Pastrana-Vidal, R., Gicquel, J., Blin, J., Cherifi, H.: Predicting subjective video
quality from separated spatial and temporal assessment. In: Human Vision and
Electronic Imaging XI, vol. 6057, p. 60570S. SPIE (2006)
21. Reddy, S., Nalluri, S., Kunisetti, S., Ashok, S., Venkatesh, B.: Content-based movie
recommendation system using genre correlation. In: Smart Intelligent Computing
and Applications, pp. 391–397. Springer (2019)
22. Renoust, B., Kobayashi, T., Ngo, T.D., Le, D.D., Satoh, S.: When face-tracking
meets social networks: a story of politics in news videos. Appl. Netw. Sci. 1(1), 4
(2016)
Movie Script Similarity 295
23. Renoust, B., Le, D.D., Satoh, S.: Visual analytics of political networks from face-
tracking of news video. IEEE Trans. Multimed. 18(11), 2184–2195 (2016)
24. Škrlj, B., Renoust, B.: Patterns of multiplex layer entanglement across real and
synthetic networks. In: International Conference on Complex Networks and Their
Applications, pp. 671–683. Springer (2019)
25. Tan, M.S., Ujum, E.A., Ratnavelu, K.: A character network study of two sci-fi tv
series. In: AIP Conference Proceedings, vol. 1588, pp. 246–251. AIP (2014)
26. Tantardini, M., Ieva, F., Tajoli, L., Piccardi, C.: Comparing methods for compar-
ing networks. Sci. Rep. 9(1), 17557 (2019). https://doi.org/10.1038/s41598-019-
53708-y
27. Waumans, M.C., Nicodème, T., Bersini, H.: Topology analysis of social networks
extracted from literature. PLoS ONE 10(6), e0126470 (2015)
28. Zhou, H., Hermans, T., Karandikar, A.V., Rehg, J.M.: Movie genre classification
via scene categorization. In: Proceedings of the 18th ACM International Conference
on Multimedia, pp. 747–750 (2010)
Interaction of Structure and Information on Tor
{Kadariya.2,Derek.Doran}@wright.edu
Abstract. Tor is the most popular dark network in the world. It provides anony-
mous communications using unique application layer protocols and authorization
schemes. Noble uses of Tor, including as a platform for censorship circumvention,
free speech, and information dissemination make it an important socio-technical
system. Past studies on Tor present exclusive investigation over its information or
structure. However, activities in socio-technical systems, including Tor, need to
be driven by considering both structure and information. This work attempts to
address the present gap in our understanding of Tor by scrutinizing the interac-
tion between structural identity of Tor domains and their type of information. We
conduct a micro-level investigation on the neighborhood structure of Tor domains
using struc2vec and classify the extracted structural identities by hierarchical clus-
tering. Our findings reveal that the structural identity of Tor services can be cat-
egorized into eight distinct groups. One group belongs to only Dream market
services where neighborhood structure is almost fully connected and thus, robust
against node removal or targeted attack. Domains with different types of services
form the other clusters based on if they have links to Dream market or to the
domains with low/high out-degree centrality. Results indicate that the structural
identity created by linking to services with significant out-degree centrality is the
dominant structural identity for Tor services.
1 Introduction
Tor is an important socio-technical system which is used as a tool for internet censorship
circumvention, releasing information to the public, sensitive communication between
parties, and as a private space to trade goods and services [1]. Perhaps our best under-
standing of Tor is limited to the present art in evaluating Tor which studies its structure
or information exclusively and narrowly. However, activities in socio-technical systems,
and hence Tor, are driven by both structure and information. Thus, there is still a gap in
our understanding of the interplay between the structure and information on Tor.
Structural identity is a concept of categorizing network nodes based on the structure
of relationships they have with others [2]. Assuming that the structural identity of Tor
domains denotes their neighborhood structure independent of their service type or their
location in the network, there are open questions whether the structural identity of Tor
domains has any relationship with the type of services they provide and if there is
any dominant structural identity on Tor. Answering these questions will give us hints on
micro-level connections each domain has in the network. It will also reveal how different
the Tor domains are in their structural identity and what makes this difference. If there is
any relation between the service type and the neighborhood structure of dark domains,
such an insight can be useful in predicting links between a new domain with the others
based on their service type. Scrutinizing the dense or sparse patterns in relationships
among domains also leads to predicting the proportions of the Tor network which have
vulnerability against node removal or targeted attack.
To this end, we conduct a micro-level investigation on the neighborhood structure of
Tor domains using struc2vec and classify the extracted structural identities by hierarchi-
cal clustering to study any relationship between domains’ structural identity and the type
of service they provide. Our findings reveal that the structural identity of Tor services
can be categorized into eight distinct groups. One group belongs to only Dream market
services where neighborhood structure is almost fully connected and thus, robust against
node removal or targeted attack. This insight helps track and trace moving vendors of the
Dream market domains based on the linking patterns in the neighborhood structure of
their services. Domains with different types of services form the other clusters based on
if they have links to Dream market or to the domains with low/high out-degree centrality.
Results indicate that the structural identity created by linking to services with significant
out-degree centrality is the dominant structural identity for Tor services.
This paper is organized as follows: Sect. 2 discusses the related work on inves-
tigating Tor. Section 3 explains how to extract the structural identity of Tor domains
and presents the evaluation results and analyses over service types of domains. Finally,
Sect. 4 summarizes the main conclusions and discusses the future work.
2 Related Work
Previous research on Tor can be categorized into two classes: (1) work which has focused
on topological properties of Tor network; and (2) studies on characterizing different types
of information and services hosted on Tor.
The topological properties of Tor, at physical and logical levels, are only beginning to
be studied. O’Keeffe et al. analyzed the hyperlink structure of Tor services and compared
it with the structure of the World Wide Web. Their comparison described the dark Web
as a set of largely isolated dark silos, which can reveal different social behavior of
dark Web users [3]. Sanchez-Rola et al. conducted a broader structural analysis over
7,257 Tor domains [4]. They find a surprising relation between Tor and the surface Web:
there are more links from Tor domains to the surface Web than to other Tor domains.
Bernaschi et al. presented a characterization study on topology of Tor network graph
298 M. Zabihimayvan et al.
and investigated the persistence of hidden services and their hyperlinks [5]. All analyses
are conducted over three different snapshots of Tor captured during a five-month period.
They also compared Tor with other social networks and surface Web graphs using well-
known metrics. In another similar work [6], Bernaschi et al. investigated measurements
to evaluate and characterize Tor hidden services data and topology of their network. They
provided a critical discussion on possible data collection techniques for dark Web and
conducted analyses on the relationship between Tor English content and its topology.
In one of our previous work [7], we presented a broad evaluation of the network of
referencing from Tor to surface Web and investigated to what extent Tor hidden services
are vulnerable against the information leakage caused by linking to the surface websites.
Towards understanding types of content on Tor, Dolliver et al. used geo-visualizations
and exploratory spatial data analyses to analyze distributions of drugs and substances
advertised on the Agora Tor marketplace [8]. Results demonstrate that drugs with Euro-
pean sources are randomly distributed and six countries, with Canada and the United
States at the top, have the major portion of drug dealing around the world. Chen et al.
sought an understanding of terrorist activities by a method incorporating information
collection, analysis, and visualization techniques from 39 Jihad Tor sites [9]. An expert
evaluation on the proposed method indicates its high performance in investigating ter-
rorist activities on the dark Web. Mörch et al. analyze the nature and accessibility of
information related to suicide [10] by investigating the search results of nine popular
search engines on Tor. Experiments depict that in comparison with the surface Web,
searching “suicide” and “suicide method” on Tor results in much smaller number of
sites providing suicide-related content. Biryukov et al. investigated the content and pop-
ularity of Tor hidden services by scanning their descriptors for open ports and looking at
their request rate [1]. The results indicate that the content of over four fifths of Tor hidden
services is in English and near half of them are devoted to drugs, adult, counterfeit, and
weapon topics.
Now, we present the analyses conducted on structural identity of Tor domains. First,
we describe how to extract the structural identity of Tor domains using the struc2vec
algorithm. Then, we present the evaluation results of conducting hierarchical clustering
over the embedding vectors which represent the structural identity of Tor domains in
our data.
In the experiments, we utilize the data collected and labeled in our previous work [11]
which is the product of crawling over 1 million pages from 20,000 Tor seed addresses,
yielding a collection of 7,782 English pages coming from 1,766 unique Tor domains.
Figure 1 presents the distribution of Tor domains based on their service types. It illustrates
that directory and shopping domains and the Dream market dominate by accounting for
58.83% of all domain types in our data. In contrast, domains of Forum, Email, and News
sites account for 23.66% of all the domains and Gambling, Bitcoin, and Multimedia
domains constitute the smallest proportion of the Tor domains in our data. Table 1
presents a summary description of the Tor domain types.
Interaction of Structure and Information on Tor 299
The main purpose of this work is to specify the structural identity of Tor domains based on
their network structure. This identity reveals latent similarities among domains regardless
of the services they provide (as vertices’ labels) and their position in the Tor network.
To this end, we employ the struc2vec algorithm to learn the latent representations for
the structural identity of Tor domains. Struc2vec [2] is a technique to learn a language
model from a network as vector embedding for vertices. The basic steps of struc2vec
are as follows:
– Compute structural similarity between all node pairs: this similarity should be
independent of the node or edge attribute, node position in the network, and even the
300 M. Zabihimayvan et al.
network connectivity. The latent representations of nodes should be also strongly co-
related to their structural similarity. Hence, the more similar network structures of two
nodes, the closer their latent representations. To do so, this step uses a hierarchical
similarity metric to capture more information on the structural similarity of nodes for
different neighborhood sizes.
Suppose that G = (V, E) indicates the graph of G with the node set V (n = |V|)
and the edge set E. Let d ∗ indicate the network diameter, and Sd (u), d ≥ 0 denote the
set of nodes at distance d of the vertex u. Also, for S ⊂ V, os(S) denotes the ordered
degree sequence of vertices in S. The elements in S are integers in the range of [1, n − 1]
and repetition is possible. To impose a hierarchy to measure the latent similarities,
comparisons between ordered degree sequences of each pairs of nodes is considered. If
fd (u, v) denotes the similarity between u and v based on their d- and (less than d)-hop
neighbors (including all the edges among them), it is defined as follows:
where g(os(Sd (u)), os(Sd (v))) indicates the distance between two ordered degree
sequences of os(Sd (u)) and os(Sd (v)). It is worth mentioning that fd (u, v) is only
defined when both vertices have nodes at distance d. And, fd (u, v) = 0 if d ≥ 0,
and fd −1 (u, v) = 0 if d-hop neighbors of both vertices are isomorphic and thus, map the
both vertices onto each other. Since Sd (u) and Sd (v) can have different sizes, to compute
their distance, Dynamic Time Warping (DTW) is used to loosely compare the patterns
in sequences with different sizes [2]. To do so, DTW matches the elements of sequences
in a way that sum of distances between pairs of elements
will be minimized. Assuming
ei ∈ Sd (u) and ej ∈ Sd (v), their distance, dist ei , ej is computed as follows:
max ei , ej
dist ei , ej = −1 (2)
min ei , ej
Based on the definition, edges corresponding to the vertices with high similarity to
a vertex will have large weights across all the layers. To connect the layers, each vertex
in layer d is attached to its corresponding vertex in the layers d − 1 and d + 1 using
directed edges. The weights of such edges are defined as below:
wt Eud ,ud +1 = log(d (u) + e), d ∈ 0, d ∗ − 1 (4)
wt Eud ,ud −1 = 1, d ∈ 1, d ∗
Interaction of Structure and Information on Tor 301
d (u) = I (wt Eud ,vd > wtd ) (5)
v∈V
e−fd (u,v)
Prd (u, v) = −f (u,v) (6)
e k
According to the definition of Prd (u, v), the probability of moving to nodes which
are structurally more similar to u will be higher than the probability of moving to nodes
with small similarity.
Given the probability of stepping to another layer, 1 − Prs , the random walk will
move to the layers d + 1 or d − 1 based on the following probabilities:
wt Eud ,ud +1
Prd (ud , ud +1 ) =
wt Eud ,ud +1 + wt Eud ,ud −1
Prd (ud , ud −1 ) = 1 − Prd (ud , ud +1 ) (7)
Every vertex that the random walk moves to in a layer will be added to the context.
This process of random walking is started for each node from layer 0 and repeated for
a certain number of times.
– Learn the language model: this step generates the language models of node
sequences created in the previous step using Skip-Gram [12]. Main purpose of Skip-
Gram is to maximize the likelihood of each vertex’s context in the corresponding
sequence.
We classify the set of vectors using hierarchical clustering to see how vectors that belong
to domains with similar service type locate in the same clusters. To ensure that the
clustering method is able to find small clusters, we utilize Agglomerative Hierarchical
clustering algorithm (AGNES) which has a bottom-up approach in generating clusters
and is able to capture clusters with small size [13]. To define the similarity between two
clusters, we employ the Ward’s minimum variance method [14] which minimizes the
final within-cluster variance by merging the pair of clusters that have minimum sum of
squared Euclidean distance.
302 M. Zabihimayvan et al.
The reason of choosing the hierarchical clustering is that it avoids the problem of
choosing the number of clusters before running the algorithm [15]. Dendrogram which
captures the results of hierarchical clustering provides a visualization of the clusters
at different granularities. This visualization can help determine the number of clusters
with no need to rerun the algorithm. Hierarchical clustering also allows to cope with
more intricate shapes of clusters in contrast to some methods such as Kmeans or mixture
models with more restrictive assumptions on data [15].
In this work to avoid any bias towards our a priori knowledge, we leverage average
Silhouette width [16] to specify the number of clusters. This metric is a graphical aid to
evaluate clustering validity based on comparison of similarity among clusters. In other
words, it measures how cohort a sample is with other samples in the cluster. Ranging in
[−1, +1], Silhouette width equal to +1 for a sample denotes that it is perfectly matched
to its own cluster and poorly matched to others. The higher the average Silhouette width,
the more appropriate the clustering configuration. Figure 2 illustrates the values of this
metric for different number of clusters from 1 to 20. As indicated, the average Silhouette
width for eight clusters has the maximum value (0.54). Hence, in the following analysis
we consider eight clusters for the original data.
Figure 3 indicates the dendrogram of the hierarchies found in the data. With regard to
cut-off equal to 8 as the number of clusters, first five clusters have a hierarchy separated
from clusters 6, 7, and 8 with distance (dissimilarity) larger than 40. The distance between
clusters 1 and 2 is less than 30, which is lower than the distance between other neighboring
clusters. Clusters 1, 2, and 7 have the largest size among the others while clusters 3, 5,
and 8 have the smallest size. To find a better insight into the clustering results, we further
investigate the type of services for domains in each cluster (Fig. 41 ).
As Fig. 4 illustrates, one group is belonged to Dream market while the others con-
tain different types of services. As reported in [11], the Dream market domains have a
tendency towards isolation which makes them separated from the other domains in the
Tor network. This is in compliance with our finding which reveals the Dream market
services have a distinguished structural identity. It also implies that with high probabil-
ity, a new Dream market service contains hyperlinks to other Dream market domains
which are active on the dark Web. Manual exploration over the cluster 6 reveals that
neighborhood structure of the Dream market services is almost fully connected, which
makes their intra-connectivity robust against node removal or targeted attack. This is
also in accordance with our previous finding in [11] which reported the high modu-
larity and dense intra-connectivity of the Dream marketplaces. Figure 5 denotes some
neighborhood structures of these services in the data. The vertices with maximum size
indicate the domains whose neighborhood structure is extracted.
Further investigation over samples in clusters 1 to 5 shows that by average, more
than 80% of samples in these clusters belong to domains that have no link to Dream
market in their 1st - and 2nd - degree neighborhood. Directories and/or shopping domains
are the majority in clusters 1 to 4. Manual investigation within these clusters indicates
that their samples belong to domains which have links to domains with high out-degree
1 As mentioned before, samples within a cluster represent the structural identity of the Tor domains,
and the labels of samples indicate the service type provided by the related domain.
304 M. Zabihimayvan et al.
Statistic Value
Min 0.000
Max 407.000
Mean 6.486
Median 2.000
First quartile 1.000 Fig. 6. CDF plot of out-degree
Third quartile 5.000 centrality
centrality. To have better insight into such clusters, we consider basic statistics of out-
degree centrality, presented in Table 2, and its CDF plot, illustrated in Fig. 6.
As the Table 2 presents, the distribution of out-degree centralities is right-skewed
since the mean is larger than the median and the median is closer to the first quartile
rather than the third one. The CDF plot also denotes the probability of values larger than
the average notably decreases from 1 to lower than 0.1. In fact, only 19% (249 domains)
of Tor domains studied in this work have out-degree centralities larger than the average.
The domains belonging to the clusters 1 to 4 have direct links to the domains with
out-degree centralities greater than the average and this linking makes their 2nd -degree
neighborhood large. On the other hand, gambling is the major population of domains in
cluster 5. Investigation shows that the out-degree centrality of 73% of the domains in
cluster 5 is lower than the average (Table 2) and they link to other domains which also have
out-degrees centralities lower than this value. Therefore, their 1st -degree neighborhood
is small, sparse, and thus, vulnerable against node removal.
Regarding the cluster sizes, we observe that the first four clusters contain 60% of the
Tor domains studied in this work. As mentioned, these clusters belong to domains which
have links to services with high out-degree centrality. The 1st -degree neighborhood of
the domains can be small or large depending on their out-degree centrality. Figure 7 illus-
trates some examples of 1st -degree neighborhood structure of such domains (specified
with larger size).
Figure 8 demonstrates some examples of the 2nd -degree neighborhood structure of
the domains in clusters 1 to 4. As illustrated, due to linking to services with high out-
degree centrality, the 2nd -degree neighborhood graph is large, dense, and hence, robust
Interaction of Structure and Information on Tor 305
Fig. 7. Examples of 1st -degree neighborhood structure in clusters 1 to 4 (The Figs. 7, 8, and 9
are best viewed digitally and in color.)
against node removal. This implies that in spite of the sparse nature of Tor domains
reported in [4, 11], the structural identity of more than half of them indicates a 2nd -degree
neighborhood which is robust against node removal.
(a) Direct linking to Dream markets (b) Indirect linking to Dream markets
Further investigation over the samples in clusters 7 and 8 reveals that in contrast to the
first five clusters, 88.4% of domains in cluster 7 have direct links to Dream market while
92% of domains in the cluster 8 have Dream market in their 2nd -degree neighborhood
(indirect linking). This explains the reason of having two distinct hierarchies in Fig. 3
between clusters 1 to 5 and clusters 6 to 8. Therefore, not only Dream market has its
own structural identity, its existence in the 1st - or 2nd -degree neighborhood of domains
can make different structural identities for the Tor domains. As Fig. 9(a) indicates, two
directories have direct links to Dream market which directly effects on their 1st -degree
neighborhood. Based on Fig. 4, directory has the maximum size in cluster 7, which is
in accordance with the reports in [11] which reveals that the directories have the largest
number of hyperlinks to Dream market. On the other hand, Fig. 9(b) shows one shopping
domain (represented by the largest vertex) which has indirect linking to the Dream market
domains. Regarding the number of shopping domains in different clusters, our analysis
indicates that although shopping domains have the major population in cluster 8 (25%),
this portion belongs to only 7% of all the shopping services in our data. This indicates
that there is only a small number of shopping domains which have Dream market in their
2nd -degree neighborhood. However, this implies that Dream market, as the competitor
of the dark shopping services, is only 2-hop far from them which can make it more di
cult for the shopping service owners to attract and keep their customers’ attention.
306 M. Zabihimayvan et al.
References
1. Biryukov, A., Pustogarov, I., Thill, F.: Content and popularity analysis of tor hidden services.
In: 34th International Conference on Distributed Computing Systems Workshops (2014)
2. Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from
structural identity. In: The 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2017)
3. O’Keeffe, K.P., Griffith, V., Xu, Y., Santi, P., Ratti, C.: The darkweb: a social network anomaly.
arXiv preprint arXiv:2005.14023 (2020)
4. Sanchez-Rola, I., Balzarotti, D., Santos, I.: The onions have eyes: a comprehensive structure
and privacy analysis of tor hidden services. In: The 26th International Conference on World
Wide Web (2017)
5. Bernaschi, M., Celestini, A., Guarin, S.: Spiders like onions: on the network of tor hidden
services. In: The World Wide Web Conference (2019)
6. Bernaschi, M., Celestini, A., Guarino, S., Lombardi, F.: Exploring and analyzing the Tor
hidden services graph. ACM Trans. Web 11(4), 1–26 (2017)
7. Zabihimayvan, M., Doran, D.: A first look at references from the dark to surface web world.
arXiv preprint arXiv:1911.07814 (2019)
8. Dolliver, D.S., Ericson, S.P., Love, K.L.: A geographic analysis of drug trafficking patterns
on the tor network. Geogr. Rev. 108(1), 45–68 (2018)
9. Chen, H., Chung, W., Qin, J., Reid, E., Sageman, M., Weimann, G.: Uncovering the dark
web: a case study of jihad on the web. J. Am. Soc. Inform. Sci. Technol. 59(8), 1347–1359
(2008)
Interaction of Structure and Information on Tor 307
10. Mörch, C.-M., Louis-Philippe, C., Corthésy-Blondin, L., Plourde-Léveillé, L., Dargis, L.,
Mishara, B.L.: The darknet and suicide. J. Affect. Disord. 241, 127–132 (2018)
11. Zabihimayvan, M., Sadeghi, R., Doran, D., Allahyari, M.: A broad evaluation of the tor english
content ecosystem. In: The 10th ACM Conference on Web Science (2019)
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Advances in Neural Information Processing
Systems (2013)
13. Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery
Handbook (2005)
14. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc.
58(301), 236–244 (1963)
15. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J.
26(4), 354–359 (1983)
16. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
17. Flake, G.W., Tarjan, R.E., Tsioutsiouliklis, K.: Graph clustering and minimum cut trees.
Internet Math. 1(4), 385–408 (2004)
Classifying Sleeping Beauties and Princes
Using Citation Rarity
1 Introduction
Often, some of the innovative scientific works go unnoticed for long periods. This
phenomenon is known as “delayed recognition” [1–3]. New discoveries and theo-
ries are significantly important for scientific progress; however, initially, they are
often restricted or neglected as the scientific community is skeptical about them
[4,5]. Further, information explosion prevents important ideas from penetrat-
ing the wall of established wisdom related to a subject. Mechanisms underlying
delayed recognition are always relevant to major scientific progress or ground-
breaking scientific revolutions. However, how this delayed recognition occurs
remains unknown.
The quantitative concept of delayed recognition, as proposed by Van Raan,
can be designated simply as a sleeping beauty (SB) phenomenon [6]. Although a
set of papers might go unnoticed for a long time, the same set will be suddenly
noticed after a certain point a time. In addition to the original definition of SB
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 308–318, 2021.
https://doi.org/10.1007/978-3-030-65347-7_26
Classifying Sleeping Beauties and Princes Using Citation Rarity 309
using depth, length, and waking up from sleep [6], several extended terms exist
for the extraction of various cases of SB papers [7,8].
Initially, SB was regarded as a rare phenomenon in scientific progress, but
recent research shows that it is far less exceptional than previously thought. In
fact, SBs include a number of scientific finding-related information [8].
Every SB has its own PR, which wakes it up and introduces it to the wider
research community by citing the SB document. The first report to cite SB is
the original definition of a PR [6]. However, this definition is suitable only for
cases of “coma sleep,” i.e., cases wherein no attention was paid to citations [9].
The Internet makes it easy to access minor but related articles. Therefore, a
co-citation criterion is appropriate for finding a PR [10].
Many studies have positioned SBs and PRs in a specific field or category
[8,11]. Nevertheless, there has been no systematic approach reported till date
that can find SB–PR pairs comprehensively from articles because so many pat-
terns show how a PR discovers an SB. While examining the computer science
category specifically, it has been found that SBs contribute to some methodolo-
gies. Actually, PRs have extended the model and methodology established for
SBs to make them applicable in other sub-fields [11]. Comprehensive analysis of
SB–PR pair findings is essential because it remains unknown whether citation
distributions for different sciences are similar.
Our research specifically examines classification of the various types of sci-
entific findings across respective scientific disciplines using SB and PR pairs in
various fields. The SB and PR pairs include breakpoints of the scientific find-
ings in the concerned field. Comparison for a case of delayed recognition reveals
cross-disciplinary similarity in the structure with respect to how delayed recog-
nition is resolved. This might be the first report related to a study analyzing the
number of SB–PR papers and categorizing their types.
The driving hypothesis of this paper is that estimation of the cross-
disciplinary relation between SBs and PRs is performed through citation rarity
calculated from complex citation networks. For this study, we have systemat-
ically clarified the relation between SBs and PRs by categorizing them post
large-scale acquisition of SB and PR pairs. As a classification technique, we
have considered the inadequacy of citation of SB by PR deduced on the basis of
inter-cluster distance calculated with respect to complex networks corresponding
to the citations.
2 Results
2.1 Sleeping Beauties and Princes
There are various methods to identify SBs, such as an average-based approach
[6,12], a quartile-based approach [13,14], and a non-parametric approach. In
this research, we have used the “beauty-coefficient,” which is a non-parametric
method, for extracting SBs proposed by Ke [8] and, subsequently, for classifying
the SB papers. This is because average-based and quartile-based approaches are
strongly affected by arbitrary parameters of citation thresholds, which depend
310 T. Miura et al.
on their categorical citation bias [15]. For specific examination of articles that
have sufficient impact on the scientific community, we have extracted the top
5% citations from the Scopus comprehensive database. The number of top cita-
tion papers are 3,392,918, and the fewest citations are 67. As shown below, we
calculated the beauty-coefficient score B for each paper.
tm ctm −c0
tm · t + c0 − ct
B= (1)
t=0
max{1, ct }
In the above equation, ct represents the number of citations that the paper
received after its publication in the tth year, and tm represents the year in which
the paper received maximum citations ctm .
The Eq. (1) penalizes early citations as the later the citations are accumu-
lated, the higher is the value of index B. We have defined the top 1% of the B
scores as SB papers, which include 33,939 papers.
For each SB paper, a candidate for the PR paper is the one with the highest
number of co-citations among all the papers citing that SB. For definition of SB
papers, we have used the Ke’s awakening year [8], which describes the time of
citation burst as follows.
ta = arg{max dt } (2)
t≤tm
|(ctm − c0 )t − tm ct + tm c0 |
dt = (3)
(ctm − c0 )2 + t2m
If the candidate paper was published within 5 years (i.e., around ta , which is the
awakening year of the SB papers), then it was defined as the PR paper of the SB.
Thus, the number of SB–PR pairs was 14,317. Figure 1(a) presents the year-wise
distribution of SB and PR. By definition, the greater the time distance between
SB and PR, the larger the likely beauty coefficient. Therefore, most of SBs are
papers published between 1970 and 1990. The gap year distribution reflects that
(Fig. 1(b)) SBs are usually discovered after around 25 years.
In this section, we have defined the SB–PR pair density with respect to its
citation probability. We clustered the citation network of 67 million papers using
the Leiden algorithm [16]. Citation probability is defined on the basis of the
frequency of the edges between two clusters in the PR publication year. When
papers in a cluster comprising a PR paper cite the particular cluster that includes
the SB paper, the presence of edges between the SB and PR is not so unusual.
Hence, the density in this case is high.
Classifying Sleeping Beauties and Princes Using Citation Rarity 311
(a) (b)
Fig. 1. (a) Annual distribution of SB and PR. (b) Gap year distribution of the SB
paper and the PR paper.
Ayi,j
Dy,ci ,cj = (i = j) (5)
|ci ||cj |
In the above equation, Ay,i,j indicates the number of papers in the cluster i that
were published during the year y. Further, it also cites the papers in cluster j.
Further, |ci ||cj | represents the possible edges between cluster i and cluster j,
whereas Ayi,j showcases the actual edges between the two clusters until year y.
When a PR published in the year yp , and from the cluster cp , cites the SB in
cluster cs , the density of this SB–PR paper is Dyp ,cp ,cs . The density of the pair
cannot be defined if the PR and SB are in the same cluster.
In this research, we have considered the first floor clustering of the entire
citation network using the Leiden algorithm [16] as label for the papers. The
purpose is to classify each paper into a unique category, as many papers exist in
multiple disciplines these days.
Table 1 shows the example of each clusters. The top clusters include more
than 8 million nodes, which are way too extensive to be considered under a
single category. These may be covered under the basic concept of science. As we
have specifically examined the cross-disciplinary SB–PR pairs in this study, we
adopted the first floor clustering as a category to extract a more pointed cross
section of the field. A more detailed analysis of the sub-clustering categories is
necessary for future work.
pairs are internal findings. Figure 2 presents the density distribution of cross-
disciplinary SB–PR pairs. As compared to the random extraction from all cross-
disciplinary citations, the distribution of SB–PR pairs is skewed to the left. This
implies that SB–PR pairs include more rare collaborations than normal cross-
disciplinary citations.
The distribution has two peaks. The first peak represents rare collaborations
(D < 1.07×10−3 ). The most cross-disciplinary PR papers “explore” unusual cat-
egories of SB paper, thereby indicating that the PR broadens the possibilities of
the field. The second peak represents common collaborations (D ≥ 1.07 × 10−3 ).
Even when similar papers are cited via common clusters, some PRs “rediscover”
an important concept of SB papers. We have classified the bottom 66% of den-
sity under “exploring citations,” which are rare collaborations that transpired
until that particular year. The other 33% are “rediscovering citations,” which
re-evaluate the importance of common pairs of knowledge.
Classifying Sleeping Beauties and Princes Using Citation Rarity 313
ing, and environment related. Biology and chemistry, which are closely related,
demonstrate rediscovery of the core concepts of the mutual findings.
the successful papers from the failed papers, to be larger for exploring citations.
However, the variance did not differ on the basis of whether the cited works were
explored or rediscovered.
Furthermore, examination of key papers related to Nobel prize-winning find-
ings selected by Mr. John Ioannidis [18] revealed that Nobel prize papers among
the cross-disciplinary SB and PR papers are very few. We hypothesize that Nobel
prize papers broaden the horizon of a category and they have an extremely
strong impact beyond the representation of citations. Therefore, some of them
may exist in SB–PR pairs. However, all SB–PR pairs include only four SBs and
four PRs; cross-disciplinary pairs include only 1 SB. There was no correlation
found between the impact of SB—PR papers and their density of citation. These
results imply that surprising citations may not necessarily result in useful find-
ings for the scientific community. With increasing attention being focused on
the importance of cross-disciplinary research, the implications of the rarity of
citations in the network are expected to be a major challenge in the future.
3 Conclusion
In this study, we have classified the types of SB–PR pairs across scientific dis-
ciplines in various fields. The relation of the pair is described on the basis of
the citation rarity of the clusters that they are present in. The pairs have been
broadly divided into two categories: major exploration citations and minor redis-
covery citations. Rediscovering PRs contain more review articles than average.
They refer to the SB in the Introduction and Results sections, which cite fun-
damentally important information about key findings. Meanwhile, the exploring
PRs form an integral part of the Methodology section, which require an uncom-
mon method to break the known challenges in the PR field. Furthermore, the
materials science PRs, instead of being explored, are more likely to explore var-
ious types of fields, such as rheology or structural chemistry. This indicates that
the field applies key findings obtained from other fields. However, biological sub-
jects, such as cancer or cell biology, exhibit rediscovery of important papers
through common clusters of SB–PR pairs.
This research contributes toward a better understanding of the delayed recog-
nition across categories.
4 Data
References
1. Garfield, E.: Premature discovery or delayed recognition - why? Essays Inf. Sci. 4,
488–493 (1980)
2. Garfield, E.: Delayed recognition in scientific discovery: citation frequency analysis
aids the search for case histories. Curr. Contents 23, 3–9 (1989)
3. Garfield, E.: More delayed recognition. Part 2. From inhibin to scanning electron
microscopy. Essays Inf. Sci. 13, 68–74 (1990)
4. Campanario, J.M.: Rejecting and resisting Nobel class discoveries: accounts by
Nobel Laureates. Scientometrics 81(2), 549–565 (2009). https://doi.org/10.1007/
s11192-008-2141-5
5. Fang, H.: An explanation of resisted discoveries based on construal-level theory.
Sci. Eng. Ethics 21(1), 41–50 (2015). https://doi.org/10.1007/s11948-013-9512-x
6. van Raan, A.F.J.: Sleeping beauties in science. Scientometrics 59(3), 467–472
(2004). https://doi.org/10.1023/B:SCIE.0000018543.82441.f1
7. Mazloumian, A., Eom, Y.H., Helbing, D., Lozano, S., Fortunato, S.: How cita-
tion boosts promote scientific paradigm shifts and Nobel Prizes. PLOS ONE 6(5)
(2011). https://doi.org/10.1371/journal.pone.0018975
8. Ke, Q., Ferrara, E., Raduccgu, F., Flammini, A.: Defining and identifying sleep-
ing beauties in science. Proc. Nat. Acad. Sci. U.S.A. 112(24), 7426–7431 (2015).
https://doi.org/10.1073/pnas.1424329112
9. van Raan AFJ.: Dormitory of physical and engineering sciences: sleeping beauties
may be sleeping innovations. PLOS ONE 10(10), e0139786 (2015). https://doi.
org/10.1371/journal.pone.0139786
10. Du, J., Wu, Y.: A bibliometric framework for identifying “princes” who wake up
the “sleeping beauty” in challenge-type scientific discoveries. J. Data Inf. Sci. 1(1),
50–68 (2016). https://doi.org/10.20309/jdis.201605
11. Dey, R., Roy, A., Chakraborty, T., Chosh, S.: Sleeping beauties in computer science:
characterization and early identification. Scientometrics 113, 1645–1663 (2017).
https://doi.org/10.1007/s11192-017-2543-3
12. Glänzel, W., Schlemmer, B., Thijs, B.: Better late than never? On the chance to
become highly cited only beyond the standard bibliometric time horizon. Sciento-
metrics 58, 571–586 (2013). https://doi.org/10.1023/b:scie.0000006881.30700.ea
13. Costas, R., van Leeuwen, T.N., van Raan, A.F.J.: Is scientific literature subject to
a ‘Sell-By-Date’ ? A general methodology to analyze the ‘durability’ of scientific
documents. J. Am. Soc. Inf. Sci. Technol. 61, 329–339 (2010). https://doi.org/10.
1002/asi.21244
14. Li, J.: Citation curves of “all-elements-sleeping-beauties”: “flash in the pan” first
and then “delayed recognition”. Scientometrics 100(2), 595–601 (2013). https://
doi.org/10.1007/s11192-013-1217-z
15. Ioannidis, J.P.A, Baas, J., Klavans, R., Boyack, K.W.: A standardized citation
metrics author database annotated for scientific field. PLOS Biol. 17(8) (2019).
https://doi.org/10.1371/journal.pbio.3000384
16. Traag, V.A., Waltman, L., van Eck, N.J.: From Louvain to Leiden: guaranteeing
well-connected communities. Sci. Rep. 9, 5233 (2019). https://doi.org/10.1371/
journal.pbio.3000384
17. Marks, M.S., Marsh, M.C., Schroer, T.A., Stevens, T.H.: An alarming trend within
the biological/biomedical research literature towards the citation of review articles
rather than the primary research papers. Traffic 14(1), 1 (2013). https://doi.org/
10.1111/tra.12023
318 T. Miura et al.
18. Ioannidis, J.P.A., Cristea, I.A., Boyack, K.W.: Work honored by Nobel prizes clus-
ters heavily in a few scientific fields. PLOS ONE 15(7), e0234612 (2020). https://
doi.org/10.1371/journal.pone.0234612
19. Oka, T., Hikoso, S., Yamaguchi, O., Taneike, M., Takeda, T., Tamai, T., Akira, S.:
Mitochondrial DNA that escapes from autophagy causes inflammation and heart
failure. Nature 485(7397), 251–255 (2012). https://doi.org/10.1038/nature10992
20. Wysocka, J., Myers, M.P., Laherty, C.D., Eisenman, R.N., Herr, W.: Human Sin3
deacetylase and trithorax-related Set1/Ash2 histone H3–K4 methyltransferase are
tethered together selectively by the cell-proliferation factor HCF-1. Genes Dev.
17(7), 896–911 (2003). https://doi.org/10.1101/gad.252103
21. Schuh, C.A., Hufnagel, T.C., Ramamurty, U.: Mechanical behavior of amorphous
alloys. Acta Mater. 55(12), 4067–4109 (2007). https://doi.org/10.1016/j.actamat.
2007.01.052
22. Segata, N., Izard, J., Waldron, L., Gevers, D., Miropolsky, L., Garrett, W.S., Hut-
tenhower, C.: Metagenomic biomarker discovery and explanation. Genome Biol.
12(6), 1–18 (2011). https://doi.org/10.1186/gb-2011-12-6-r60
Finding High-Degree Vertices with Inclusive
Random Sampling
Abstract. The friendship paradox (FP) is the famous phenomenon that one’s
friends typically have more friends than they themselves do. The FP has inspired
novel approaches for sampling vertices at random from a network when the goal of
the sampling is to find vertices of higher degree. The most famous of these methods
involves selecting a vertex at random and then selecting one of its neighbors at
random. Another possible method would be to select a random edge from the
network and then select one of its endpoints at random, again predicated on the
fact that high degree vertices will be overrepresented in the collection of edge
endpoints. In this paper we propose a simple tweak to these methods where we
consider the degrees of the two vertices involved in the selection process and
choose the one with higher degree. We explore the different sampling methods
theoretically and establish interesting asymptotic bounds on their performances
as a way of highlighting their respective strengths. We also apply the methods
experimentally to both synthetic graphs and real-world networks to determine the
improvement inclusive sampling offers over exclusive sampling, which version of
inclusive sampling is stronger, and what graph characteristics affect these results.
1 Introduction
1.1 Random Neighbor
It is often of interest to locate vertices within a network that are of relatively high
degree (see for example [1, 5, 7]). But in most networks, there is no obvious way to do
this efficiently, either because total network knowledge does not exist, or because it is
unavailable due to the network’s size and dynamic nature. Consider the most obvious
random sampling method of naïvely sampling a random vertex, which we will abbreviate
as RV . The expected degree of a vertex obtained by RV is simply the mean degree of
the graph, or RV = μ1 .
1 We will interchangeably use method abbreviations to refer to the method and the expected degree
of a vertex it returns.
Cohen et al. [6] offer a novel approach that leverages the well know “friendship
paradox” (FP) [9] to give a higher expected degree vertex than RV . The friendship
paradox is the phenomenon that a person’s friends will, on average, have more friends
than the person themself does. The method of selecting a random neighbor (RN ) is
performed by selecting a random vertex as in RV , but then taking the collection of its
neighbors and selecting one of these neighbors instead of the originally selected vertex.
Because the “friends” have a higher degree on average, the expected degree of RN should
be greater than or equal to the expected degree of RV , or RN ≥ RV . The superiority of
RN has been demonstrated by [6] and [12], and a formal proof appears in [10] where it
is also attributed to a comment in an online article [14].
In a paper that is still in progress [10], Kumar et al. differentiate between two mean
degrees of a graph that are both inspired by the friendship paradox. The first is the ‘local
mean’, which is calculated by taking every vertex individually, finding the mean degree
of its neighbors, and then taking the mean value of this mean over all vertices. Clearly
the local mean is perfectly analogous to RN .
The second mean they discuss is the ‘global mean’. The global mean is the mean
degree of the collection of all neighbors in the graph, which is compiled by taking the
collection of neighbors of every individual vertex and combining these collections into a
single collection. Notably, vertices with degree 2 or higher will appear in this collection
multiple times. In fact, they will appear exactly as many times as their degree.
The authors offer a clever sampling technique whose expectation is the global mean
of the graph, by examining all neighbors of a randomly selected vertex and selecting
each with some fixed probability p. By considering all neighbors of a selected vertex,
the probability of a vertex being considered is directly proportional to its degree, and all
vertices that are considered have equal probability (p) of being selected. Therefore, the
expected degree of this sampling method is exactly the global mean.
We note here that the global mean can also be achieved by a different sampling
method. A single edge from the collection of edges in the graph can be selected at
random, then one of its two endpoints can be selected at random. Once again, a vertex’s
likelihood of being selected is proportional to the number of edges it touches, in other
words its degree. We will call this method ‘random edge’ (RE). The probabilistic method
of [10] has strong practical appeal because the edges of a graph are rarely stored as a
separate collection. Typically, an edge could only be found by selecting a vertex and
then identifying its neighbors. (Perhaps a road network would be a real-world exception.)
However, as this paper is an academic exploration of the expected values rather than a
discussion of implementation or practicality, we prefer to frame our results in the context
of RE rather than the probabilistic method, while recognizing that all of the authors’
findings for the global mean apply equally to RE and vice versa.
Based on the authors’ results for the global mean, RE ≥ RV can be proven directly
from the FP. It is interesting that RE is actually the purer manifestation of the FP as a
sampling method; RN ≥ RV is not in fact directly implied by the FP. It is also interesting
to note where each method reduces to RV . RE = RV only in a regular graph, whereas
Finding High-Degree Vertices with Inclusive Random Sampling 321
The primary contribution of this paper is to propose a tweak to both RN and RE. Instead
of blindly taking the randomly selected neighbor in the case of RN , or selecting the
edge-endpoint at random in the case of RE, we would compare the respective degrees of
the two vertices, the original vertex and the neighbor in RN , and the two edge-endpoints
in RE, and select the one with higher degree. It should be noted that this is not a purely
hypothetical suggestion. Even in networks where the lack of total knowledge prevents
one from selecting high-degree vertices directly, it is still typically possible to know the
degree of any selected vertex. In most cases this value would be stored separately, the
collection of neighbors would not even need to be enumerated. Even in an offline human
network it is not hard to conceive of a scenario where two selected individuals would
consent to having their phones scanned by software that would give some acceptable
estimate of their popularity based on their contacts, emails, social media activity, etc.
We will call these methods ‘inclusive random neighbor’, or IRN , and ‘inclusive random
edge’, or IRE.
We begin this study with a direct comparison between RN and RE. Kumar et al. [10]
demonstrated that it is possible for RN or RE to have the higher expected degree in a
graph. We will look at the ratio RN : RE as well as the inverse ratio, RE : RN , and show
that both ratios can grow without bound.
In performing this study, we opt to ignore any vertices with degree 0. It is not clear
what RN would even do should a 0-degree vertex be selected. In RN would select the
0-degree vertex there would be an obvious advantage to RE because it does not even
consider these vertices. But regardless of how RN would deal with 0-degree vertices,
we consider a study of RN vs. RE more interesting if we only include the subgraph that
contains edges, so that RN and RE are sampling from the same set of vertices. We will
therefore simply ignore any vertices without neighbors.
In the following equations, we will use n to denote the number of vertices in the
graph, and V is the collection of vertices itself. Similarly, m will denote the number
of edges in the collection of edges E. We will use v to denote the collection of the
neighbors of vertex v. An edge between u and v will be denoted e(u, v). We will consider
D the degree sequence of the graph and use dv to denote the degree of vertex v.
RE can be defined as:
1 du + dv
RE = (1)
m 2
e(u,v) ∈ E
322 Y. Novick and A. BarNoy
It is worth noting that the contribution of every edge e(u, v) to the outer summation
du
is dv + dduv and therefore RN can also be expressed as a summation over E.
1 du dv
RN = + (3)
n dv du
e(u,v) ∈ E
Corollary 1. RN
RE ≤ 2m
n .
a b a2 + b2 a2 b + b2 a
+ = ≤a+b=
b a ab ab
Corollary 2. RN
RE < 2m
n in all graphs with a single vertex v with dv > 1.
Proof. There exists at least one edge (u, v) with du > 1. If a > 1 and b ≥ 1 then
a2 + b2 < a2 b + b2 a
If we assume without loss of generality that du ≥ dv , we can define IRN as:
1 du
IRN = +1 (6)
n dv
e(u,v) ∈ E
Corollary 3. IRE
RE < 2.
In order to maximize one of the ratios, we construct a graph that accentuates the strength
of the sampling method in the numerator. We consider a graph that is comprised of two
disconnected subgraphs with h and k vertices respectively. The first subgraph will have
the majority of the graph’s edges, so that RE will select a vertex from this subgraph with
high probability. Similarly, the second subgraph has the majority of the vertices, so that
RN will select a vertex from this subgraph with high probability. In both cases, we make
the first subgraph a clique, saturating it with edges. If we want RE to be the superior
method, we lower the degrees of the k vertices in the second subgraph by arranging them
as a collection of edges connecting two 1-degree vertices. This is actually a generalization
of a figure in [10] that demonstrates that it is possible for RE > RN . If we want RN to
be stronger we arrange the k vertices of the second subgraph into a star. If we select a
vertex from the second subgraph, with probability (k − 1)/k, we will select a leaf whose
only neighbor is degree k − 1 (Fig. 1).
Fig. 1. Two constructions that illustrate how either RN or RE could be made the stronger sampling
method in a graph.
Notice that, by symmetry, IRE and IRN reduce to RE and RN respectively in this
construction. Therefore, these proofs suffice to prove the same results for IRE/IRN .
324 Y. Novick and A. BarNoy
RN/RE is Unbounded. We can follow a similar process as the one above to prove that
RN /RE is unbounded in the clique and star graph by setting h = x2 and k = x3 − x2 .
This will give an expression that grows without bound ask/h increases.
It can further
be used to give a bound on the ratio as a function of n as n1/3 . Here too it is possible
to prove that the results apply to IRN /IRE as well. We omit the full steps here in the
interest of brevity.
We will now examine our sampling methods in the specific case of trees.
RN/RE is Bounded by 2. Our first observation is that RN must be less than 2RE. This
follows from Corollary 1 because 2m/n is fixed at 2((n − 1)/n). It would also seem that
a star maximizes RN /RE for all trees of a fixed size n. For a star:
RN 2(n − 2)2 + 2
=
RE n2
This expression has the same bound, 2, as n increases.
Fig. 2. A graph tree with h internal nodes and h(k − 1) leaf nodes.
We calculated values for RN , RE, IRN , and IRE in both synthetic and real-world graphs.
For synthetic graphs, we looked at both Erdős Rényi (ER) [8] and Barabási Albert
(BA) [2] models, using different parameters and taking the mean results of 30 randomly
generated graphs for each parameter set. For real-world graphs, we examined networks
from the famous Koblenz Network Collection [11].
Finding High-Degree Vertices with Inclusive Random Sampling 325
These findings again reflect on the respective natures of the sampling methods. RE is
a pure manifestation of the FP, it relies entirely on the fact that high-degree vertices are
overrepresented in a collection of edge endpoints. But, between the two edge endpoints
of a given edge, there is no favoring one vertex over the other. On the other hand, RV
seems to be implicitly assuming that the jump from a vertex to a neighbor will lead to an
increase in degree. In most natural graphs, the low degree vertices will outnumber the
high-degree vertices so RN is actually a type of correction to RV , improving the outcome
by exchanging the random vertex for its neighbor. Therefore, in RN , the gain of inclusive
sampling is less significant. It only applies in the less common case where a high-degree
vertex was selected in the first step of the process. Whereas in RE the inclusive sampling
is more significant because the mean degree of the two edge endpoints is always less
than or equal to the max degree.
326 Y. Novick and A. BarNoy
We examined 1072 networks from the Koblenz Network Collection to see the effects
of the four sampling methods. We found that RN > RE in 93% of the networks, yet
IRE > IRN in 43%. The average gain of IRN versus RN was 102.3%, while the average
gain of IRE versus RE was a staggering 186%. This is especially significant in light of
the bound of 2 in Corollary 3.
We also calculated these results for the different network categories of the collection.
The results are summarized in Table 2. RN > RE is true in the majority of networks
in all but three categories, and the mean percent over all categories where this is true is
72.8%. IRE > IRN is true for a majority of networks in all but three categories (note
that these are not the same three categories where RE > RN ), and the mean percent over
all categories where this is true is 82.2%. The modest gains of IRN over RN are roughly
consistent over all categories, while the gain of IRE over RE ranges from 1.13 to 1.98.
RE is purely a function of the degree sequence, so rewiring cannot affect it, and this
is clear in the plots (any noise presumably reflects an inability of some degree sequences
to achieve certain assortativity values). RN becomes weaker as assortativity increases,
and the fact that these two lines meet around zero assortativity is consistent with the fact
that the sign of inversity, which is very close to assortativity, indicates the strength of
RN relative to RE.
The plots again indicate a strength in IRE over IRN , and while both weaken as
assortativity increases, it appears that IRN decreases at a slightly faster rate.
In this paper we explored two sampling methods that leverage the phenomenon of the
FP to perform random sampling with a bias towards higher degree vertices. We proved
that either method can be infinitely better than the other and gave possible lower bounds
on the number of vertices that would be required in order to achieve a desired ratio.
We introduced “inclusive” versions of each of these methods and showed a surprising
result that IRE is often greater than IRN , even in graphs where RN outperforms RE. While
we explored these methods mostly as an academic study, we noted that inclusivity itself
is not a contrived concept considering the fact that, once a vertex in a network has been
selected, its degree is typically available. We believe this study can have strong practical
applications in situations where high-degree sampling is desired. While edges are not
typically stored as collections from which to draw samples, we noted the existence of
a probabilistic sampling method that takes a similar approach to RN but achieves the
results of RE. Furthermore, perhaps in some situations where high-degree sampling is
often required and the nature of the graph makes RE a significantly stronger options, it
would actually be worthwhile to store edges in order to enable RE sampling.
While we have noted the strong connection with degree-homophily, we believe there
are other graph characteristics, for example the power-law exponent, that contribute to the
results of the different methods and we hope to explore some of these in future research.
We also note that we have evaluated the methods based solely on their results. We are
currently studying the methods in light of not only results, but also the computational
complexity and other costs that could be associated with them in order to give a far more
robust analysis of their respective values as high-degree sampling methods.
References
1. Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex networks. Lett.
Nat. 406, 378–382 (2000)
2. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512
(1999)
3. Babak, F., Rabbat, M. G.: Degree Correlation in Scale-Free Graphs, arXiv:1308.5169 (2013)
4. Bertotti, M.L., Modanese, G.: The Bass Diffusion Model on Finite Barabasi-Albert Networks,
Phys. Soc. arXiv:1806.05959 (2018)
5. Cohen, R., Erez, K., Ben-Avraham, D., Havlin, S.: Breakdown of the internet under intentional
attack. Phys. Rev. Lett. 86(16), 3682–3685 (2001)
Finding High-Degree Vertices with Inclusive Random Sampling 329
6. Cohen, R., Havlin, S., Ben-Avraham, D.: Efficient immunization strategies for computer
networks and populations. Phys. Rev. Lett. 91(24), 247901 (2003)
7. Christakis, N.A., Fowler, J.H.: Social network sensors for early detection of contagious
outbreaks. PLoS ONE 5(9), e12948 (2010)
8. Erdős, P., Rényi, A.: Publicationes Mathematicae 6(290) (1959)
9. Feld, S.: Why your friends have more friends than you do. Am. J. Soc. 96(6), 1464–1477
(1991)
10. Kumar, V., Krackhardt, D., Feld, S.: Network Interventions Based on Inversity: Leveraging
the Friendship Paradox in Unknown Network Structures (2018). https://vineetkumars.github.
io/Papers/NetworkInversity.pdf
11. Kunegis, J.: KONECT, The Koblenz Network Collection (2013). http://konect.cc/
12. Momeni, N., Rabbat, M.G.: Effectiveness of Alter Sampling in Social Networks, https://arxiv.
org/abs/1812.03096v2 (2018)
13. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 208701 (2002)
14. Strogatz, S.: Friends You Can Count On, NY Times (2012). Accessed 17 Sep 2012. https://
opinionator.blogs.nytimes.com/2012/09/17/friends-you-can-count-on/
15. Van Mieghem, P., Wang, H., Ge, X., Tang, S., Kuipers, F.A.: Influence of assortativity and
degree-preserving rewiring on the spectra of networks. Eur. Phys. J. B 76, 643–652 (2010)
16. Xulvi-Brunet, R., Sokolov, I.M.: Reshuffling scale-free networks: from random to assortative.
Phys. Rev. 70, 066102 (2004)
Concept-Centered Comparison
of Semantic Networks
1 Introduction
Research in cultural sociology argues that semantic networks –with vertices being
concepts and links being concept co-occurrences within a given time-window–
are capable of revealing structures of knowledge underpinning production of
texts (Carley 1986; Carley and Newell 1994; Lee and Marin 2015) and thus help
to explore these structures through the lens of network analysis (Abbott et al.
2015; Roth and Cointet 2010). Currently, an issue of particular interest is how
different knowledge systems (for instance institutional-field and local knowledge,
see Basov et al. 2019) interplay with each other. This paper proposes an approach
to examine this interplay at the meso-level of particular concepts which gain
different meanings across different knowledge systems.
We draw on a new textual dataset on professional and local flood manage-
ment knowledge collected during one of this paper’s authors ethnographic study
in England. Flood management in England as a knowledge domain provides an
apt example because, until recently, it exclusively relied on institutionalized pro-
fessional knowledge. In recent decades, however, flood management has become
more sensitive to ‘local knowledges’ and started seeking ways to engage local
actors (McEwen and Jones 2012; Nye et al. 2011; Wehn et al. 2015). Becoming
stakeholders in flood risk management local, communities/activists are expected
to adhere to professional knowledge and language. They, however, rarely take
professional knowledge at ‘face value’, but rather creatively reuse it to fit the
local context (Nye et al. 2011; Wehn et al. 2015). Our data comes from one
such flood-prone area in England where flood management agencies and local
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 330–341, 2021.
https://doi.org/10.1007/978-3-030-65347-7_28
Concept-Centered Networks 331
|A ∩ B|
ω(A, B) = (1)
min(|A|, |B|)
∗
Vertex Overlap (concept c, actors a, a’) = ω(Vac , Va∗ c ) (2)
Link Overlap (concept c, actors a, a’) = ω(Eac , Ea∗ c )
∗
(3)
The way we define vertex and link overlap indices assumes that the central
concept c has at least one alter. Otherwise, there would be division by 0. We
assume this because we are not practically interested in empty/trivial concept-
centered semantic networks and only focus on the most central concepts which
are necessarily non-trivial.
Along with the analytical approach we also propose a visualization approach
to highlight similarities and discrepancies between two concept-centered net-
works. We first combine concept-centered sub-networks into a “union” network
by gathering all vertices and links. The union graph helps to position nodes
(e.g., in Fig. 1) and keep these positions thereafter to visualize each network
separately. We assign weights to links with those shared by both actors having
larger weight. The weights serve functional roles: we want shared links to weight
more than non-shared ones because this way a force-directed layout makes ver-
tices which are incident to shared links more attracted to the ego-concept and
themselves. Using weights not only puts shared concepts closer to each other
2
Which is always smaller than that of professionals, mostly due to different corpus
sizes.
Concept-Centered Networks 333
but also spatially separates actor-unique concepts that are connected to shared
concepts from those that are not.
Figure 1 illustrates this idea. Suppose we have a pair of actors and their
respective concept-centered networks. The ego-concept in both networks is c1.
The first actor links the concept c1 to concepts c2, c3, c4, and c5. In the second
network c1 is not linked to c5 but to c6. Besides, there is the difference in the
alter-alter linkage: the first actor links c2 and c5, while the second actor links
c2 and c3. This results in the vertex Jaccard similarity between these concept-
centered networks being 0.75 (3 shared vertices out of 4 vertices). At the same
time, the link overlap similarity between them is 0.5 (1 shared alter-alter links
out of 2).
c6 c5
c5 c1 c4 c1 c4 Shared
Unique
c2
c6
c2 c2
c1
c3 c3
c3
Ego
Shared
Unique to Network 1
Unique to Network 2
c4
Fig. 1. Actors share 3 links between the concept c1 and other concepts (c2, c3, and
c4) with the minimum number of unique concepts for two actors being 4. This makes
for vertex overlap similarity of 0.75. The link overlap similarity between two actors is
0, as they do not share alter-alter links at all
3 Data Description
Our data come from an ethnographic study focusing on two local flood manage-
ment groups located in the County of Shropshire, England. Professional knowl-
edge is represented with a collection of documents issued by flood management
agencies and authorities (around 316,000 words in total). Local knowledge is
represented with semi-structured interviews with 15 members of the two ‘local
flood groups’—local activist groups involved in flood risk management in two vil-
lages. We denote these groups as LFG1 and LFG2, respectively. The interviews
comprise 186,000 words in total, with the average word number per interview
of around 13,000. The total number of co-occurrences for professional network
with the threshold of 2 is 266,704 times, while the average total number of co-
occurrences in local networks is 790 times.
Note that we represent professional knowledge with one network (further,
‘professional network’). This reflects the ‘universality’ of professional knowledge,
as we assume that the content of official documents should reflect some general
consensus among professionals. For locals, on the other hand, we allow each
interviewee to have their own semantic network. We do so because we are inter-
ested in looking at how particular local actors borrow concepts from professional
knowledge. We denote local networks with lowercase a’s followed by a number
(e.g. a1 or a2 ).
We produce semantic networks from texts as follows. First, raw texts are
tokenized and POS-tagged using the UDpipe package (Wijffels 2019): we con-
vert words into lemmas and combine lemmas with their POS-tags to produce
unique concept identifiers (e.g. flood(v) as the verb and flood(n) as the noun),
which we refer to as concepts in this paper. Research assistants have manually
inspected the corpus checking for machine-missed stopwords (these usually were
numbers (e.g. ‘60s’), transcribed artefacts of oral speech (e.g., ‘aha’, ‘eh’), set
phrases (‘bear mind’, ‘couple time’), incorrectly automatically recognized words,
informants’ real names, and the same words that have different spelling. All such
instances have been replaced with either correct versions or a generic placeholder
(‘xxx’ sign).
We count co-occurrences using a sliding window approach as they appear
within 8-concept vicinity from each other, unless separated by a full stop mark.
This yields weighted co-occurrence networks. We then filter these networks from
all the non-Nouns, non-Verbs, non-Adjectives as well as from trivial verbs (e.g.,
‘do’ or ‘make’), leaving only vertices related to adjectives, nouns, and non trivial
verbs. We take this step to reduce the amount of information to process and to
focus on informative parts of speech. Finally, we binarize co-occurrence counts
using the threshold of 2 for both professional and local networks, that is, we link
all the pairs of concepts which co-occurred at least 2 times.
fessionals and locals. In general, extracting binary networks from weighted data
is an ongoing research field with several sophisticated methods recently proposed
(Dianati 2016; Radicchi et al. 2011; Serrano et al. 2009). We choose to work with
the same threshold of 2 for both professional and local networks. Our choice is
guided by the following consideration.
Let us consider a hypothetical situation in which the professional network
strictly comprises all the local networks. In this case, we would expect all local
networks to be sub-networks of the professional network. Let us denote this
conjecture as ‘SIN’ (Strictly Included Networks). In this case, we would like to
understand how much of the professional network is directly reflected in the local
network and why the professional network features many more interlinked con-
cepts than the local networks. Two simple hypothetical scenarios could explain
the latter.
In the first scenario, professional knowledge covers more topics than locals
use. For instance, professionals may use the concept of group in contexts never
taken on by locals. Simply put, locals might use a concept in a very narrow
context neglecting all the other contexts elaborated in professional knowledge.
In this case, the difference in network densities would reflect the breadth of the
professional knowledge as opposed to the specificity and particularity of the local
knowledge. At the level of network representation, this would imply symmetric
thresholding (i.e. if we use 2 for locals, we should use 2 for professionals), because
otherwise, we would lose all the contexts in the concept that has been used in
the professional knowledge.
The second scenario, at the opposite end of the spectrum, may suggest that
local knowledge is just a scaled-down version of the professional knowledge.
For example, professionals and locals may have the same local linkage pat-
terns around the concept group, but professionals may have more co-occurrences
between all the context concepts. Given the SIN hypothesis, we would still argue
against removing these richer links.
Finally, higher threshold for the professional network would filter away a great
deal of structure to the degree that some local concept neighborhoods would
become more linked than the corresponding neighborhoods in the professional
network. This goes against our initial assumption that locals draw on professional
knowledge. While this well may be the case that there are some concepts more
elaborated in the local knowledge than in the professional knowledge, we decided
to leave this option for further research and focus instead on the ‘SIN’ hypothesis
and its implications.
It is important to note, that our decision to work with the symmetric thresh-
old is driven by the specificity of data: we use two corpora of different sizes. We
do not directly engage literature on how to choose such a threshold analytically
(for example, Dianati 2016) or what kind of consequences thresholding may have
for the topology of the resulting network (Cantwell et al. 2020). However, we are
driven by the observation that weight distributions in word co-occurrence net-
works are typically highly skewed. This implies that a higher threshold value
336 D. Medeuov et al.
would discard a large amount of small scale word co-occurrences (Serrano et al.
2009), which, however, might be relevant for local actors.
5 Illustration
We apply our approach to find and examine concepts shared by local activists
and professionals. We select concepts for inspection based on their similarity
profile, because we want to understand what it means for two semantic networks
to share concepts in terms of those concepts’ immediate neighbors. Comparing
concepts, we take the following steps:
1. First, we create a list of common concepts: those concepts that both profes-
sional and local actors use.
2. For each common concept, we extract its immediate alters in the professional
and in the local network. This yields concept-centered networks: Cipro and
Ciloc
3. We calculate two metrics characterising similarity between the two concept-
centered networks:
– the vertex overlap
– the link overlap.
Figure 2 shows vertex and link overlaps for top 15% of the most central
concepts in one of the local semantic networks. Let us examine two of them
- management(n), plan(n). The concept ‘management’ is interesting because
it yields the highest similarity scores in both vertex and link overlaps. This
means that the local actor uses the concept in the same contexts as used in the
professional knowledge both in terms of alters and linkages between them. The
concept ‘plan’, on the other hand, has relatively high link overlap but stands out
with a relatively low vertex overlap score.
Concept ‘management’. One thing to note is that all the contexts in which
the local actor uses the concept of ‘management’ are present in the professional
network.
surface(n)
0.75
water(n)
plan(n)
flood(v)
Link Overlap
event(n)
flood(n)
0.50 development(n)
resilience(n)
0.25
new(a)
0.00
Fig. 2. Top 15% of the most central concepts for local actor 9 along with their vertex
and link similarities to professionals. For display purposes, only concepts with link
overlap larger than 0.5 or vertex overlap larger than 0.75 are shown
Professionals vs. local actor 9:
’management’
flood(n)
plan(n)
Links
Shared
Local
natural(a)
management(n)
water(n)
surface(n)
Vertices
Ego
Shared
plan(v) Local
Fig. 3. Professionals vs a local actor: ‘management’. Suffixes after the concept’s name
show its part or speech: (n) - noun, (v) - verb, (a) - adjective. For display purposes, only
shared and local-specific concepts are shown. In case of ‘management’, the professional
network contains 641 other concepts, while the local actor’s network has only 3
“In 2007 Telford & Wrekin Council were successful in a bid to create a
surface water management plan under DEFRA s Integrated Urban
Drainage pilot studies. The project was driven by the need to gain a better
understanding of the surface water environment within its borough with a
view to reducing the risk of flooding to existing and new properties through
the development control process”(professional text)
The local actor, meanwhile, points out that surface water management plan
is a rich source of information on flood risks in the area that informs flood
management-related activities of the local flood group:
338 D. Medeuov et al.
“Most what I find with most documents related to flooding they’re actu-
ally historical reports like the surface water management plan, the
neighborhood plan, there may have been a report on the Priorslee balanc-
ing lake.”(local activist)
“the flood manager helped to set up and Jason was quite instrumental in
creating the community group and actually gave us an insight into quite
important documents like the Surface Water Management Plan or any-
thing else which he could talk to Severn Trent he was quite a good facili-
tator.”(local activist)
Concept ‘plan’. Figure 4 shows the use of the concept plan in the profes-
sional and local networks. Plan is indeed one of the core concepts in the pro-
fessional knowledge, in particular, plan is embedded into a clique with 4 other
concepts some of which also appear in the local actor’s network: surface-water-
flood. Meanwhile, we can also see that in the local network plan has several
unique neighbours, most notably inside the dyad plan-neighborhood which does
not appear in the professional network.
outline(a)
document(n) Links
question(n) suds(n)
Shared
Local
neighborhood(n)
place(n)
plan(n) connection(n)
surface(n)
water(n)
railway(n)
management(n)
flood(n)
event(n)
Vertices
Ego
Shared
contingency(n) Local
Fig. 4. Professionals vs. local actor 9: ‘plan’. For display purposes, only shared and
local-specific concepts are shown. The professional network contains 480 other concepts
linked with ‘plan’
Shared Link: Plan-Flood. Professionals and the locals use the concept ‘plan’
when referring to documents that coordinate various stakeholders’ flood man-
agement activities. Although both the locals and professionals share the idea of
Concept-Centered Networks 339
“Flood risk management plans [FRMP] describe the risk of flooding from
rivers, the sea, surface water, groundwater and reservoirs. FRMPs set out
how risk management authorities will work together and with communities
to manage flood and coastal risk over the next 6 years [. . . ] Each EU mem-
ber country must produce FRMPs as set out in the EU Floods Directive
2007.”(professional texts)
“I suppose. . . the other one [issue] which isn’t perhaps as major [a problem]
but it [is] certainly significant for [the village], is the local developers. The
planning permissions are granted on the understanding that certain flood
mitigation steps will be taken... Developers are only allowed to develop in
line with the neighborhood plan.”(local activist)
6 Concluding Remarks
This paper proposed a two-step concept-centered approach to compare seman-
tic networks, where one network serves as a “golden standard” from which the
other network selectively pulls semantic links. At the first step, we mapped
all the shared concepts onto a two-dimensional space of Jaccard similarities
of their alters and of links connecting these alters. The joint distribution of
these indices highlights concepts which potentially can give insight into the
selective appropriation of professional knowledge by local actors. At the second
step, we visually inspected chosen concepts using a customized version of the
Fruchterman-Reingold layout (Fruchterman and Reingold 1991) which spatially
separates shared and non-shared concepts.
We argue that while network comparison can happen at any level of analysis,
in the case of semantic networks it is sensible to start with concept-centered
networks, since they provide insights on the meaning residing in these networks.
We also think that researchers may gain deeper insight into how meaning of
concepts in semantic networks emerges because of the productive juxtaposition
of quantitative and qualitative perspectives that this vantage brings together.
Future research in concept-centered networks may focus on working with sev-
eral symmetric thresholds for both professional and local networks. This implies
that instead of working with one single network threshold researchers should
340 D. Medeuov et al.
References
Abbott, J.T., Austerweil, J.L., Griffiths, T.L.: Random walks on semantic networks
can resemble optimal foraging. Psychol. Rev. 122(3), 558–569 (2015). https://doi.
org/10.1037/a0038693
Basov, N., De Nooy, W., Nenko, A.: Local meaning structures: mixed method sociose-
mantic network analysis. Am. J. Cult. Sociol. 1–42 (2019)
Cantwell, G.T., Liu, Y., Maier, B.F., Schwarze, A.C., Serván, C.A., Snyder, J.,
St-Onge, G.: Thresholding normally distributed data creates complex networks.
Phys. Rev. E 101(6), 062302 (2020)
Carley, K.: An approach for relating social structure to cognitive structure. J. Math.
Sociol. 12(2), 137–189 (1986)
Carley, K., Newell, A.: The nature of the social agent. J. Math. Sociol. 19(4), 221–262
(1994)
Choobdar, S., Ribeiro, P., Bugla, S., Silva, F.: Comparison of coauthorship networks
across scientific fields using motifs. In: 2012 IEEE/ACM International Conference
on Advances in Social Networks Analysis and Mining, pp. 147–152. IEEE (2012)
Dianati, N.: Unwinding the hairball graph: pruning algorithms for weighted complex
networks. Phys. Rev. E 93(1), 012304 (2016)
Fruchterman, T.M., Reingold, E.M.: Graph drawing by force-directed placement.
Softw.: Pract. Exp. 21(11), 1129–1164 (1991)
Lee, M., Marin, J.L.: Coding, counting and cultural cartography. Am. J. Cult. Sociol.
3, 1–33 (2015). https://doi.org/10.1057/ajcs.2014.13
McEwen, L., Jones, O.: Building local/lay flood knowledges into community flood
resilience planning after the July 2007 floods, gloucestershire, UK. Hydrol. Res.
43(5), 675–688 (2012)
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network
motifs: simple building blocks of complex networks. Science 298(5594), 824–827
(2002)
Nye, M., Tapsell, S., Twigger-Ross, C.: New social directions in UK flood risk manage-
ment: moving towards flood risk citizenship? J. Flood Risk Manage. 4(4), 288–297
(2011)
Pržulj, N.: Biological network comparison using graphlet degree distribution. Bioinfor-
matics 23(2), e177–e183 (2007)
Radicchi, F., Ramasco, J.J., Fortunato, S.: Information filtering in complex weighted
networks. Phys. Rev. E 83, 046101 (2011). https://doi.org/10.1103/PhysRevE.83.
046101
Roth, C., Cointet, J.-P.: Social and semantic coevolution in knowledge networks. Soc.
Netw. 32(1), 16–29 (2010)
Serrano, M. À., Boguñá, M., Vespignani, A.: Extracting the multiscale backbone of
complex weighted networks. Proc. Natl. Acad. Sci. 106(16), 6483–6488 (2009).
https://doi.org/10.1073/pnas.0808904106. eprint: https://www.pnas.org/content/
106/16/6483.full.pdf
Concept-Centered Networks 341
Wehn, U., Rusca, M., Evers, J., Lanfranchi, V.: Participation in flood risk management
and the potential of citizen observatories: a governance analysis. Environ. Sci. Pol.
48, 225–236 (2015)
Wijffels, J.: Udpipe: Tokenization, parts of speech tagging, lemmatization and depen-
dency parsing with the ‘udpipe’ ‘nlp’ toolkit. R package version 0.8.3 (2019). https://
CRAN.R-project.org/package=udpipe
Diffusion and Epidemics
Analyzing the Impact of Geo-Spatial
Organization of Real-World Communities
on Epidemic Spreading Dynamics
Alexandru Topı̂rceanu(B)
1 Introduction
have helped public health officials prepare better for major outbreaks [10,19,20].
As a result, current strategies for controlling and eradicating diseases, including
the COVID-19 pandemic, are fueled by consistent insights into the processes
that drive, and have driven epidemics in the past [1,16].
A predominant body of recent research has invested in extending, custom-
tailoring, and augmenting standard mass-action mixing models into tools suit-
able for analyzing COVID-19 [2,13,18]. However, in most cases we find that these
mathematical models assume homogeneous mixing of the population (i.e., each
infected individual has a small chance of spreading infection to every suscepti-
ble individual in the population) [2–4,28]. As such, “flattening the curve”-type
solutions have been proposed to decrease the reproduction number R0 , and thus
dampen the peak of the daily infection ratio. Based on homogeneous mixing
populations, several notable studies estimate the length and proportion of the
current COVID-19 pandemic [2,4,13,17,18]. By contrast, most pathogens spread
through contact networks, such that infection has a much higher probability of
spreading to a more limited set of susceptible contacts [15]. Ultimately, govern-
ments’ actions around the world are based on these scientific predictions, having
immense social and economic impact [3].
Over the past decade, an increasing number of studies pertaining to network
science have shown the importance of community structure when considering epi-
demic processes over networks [5,6,23,26,27]. In this sense, the heterogeneous
organization of communities is not a novel concept in network science [24,32].
Salathé et al. [23] show how community structure affects the dynamics of epi-
demics, with implications on how networks can be protected from large-scale
epidemics. Ghalmane et al. [11] reach similar conclusions in the context of time
evolving network nodes and edges. Shang et al. [26] show that overlapping com-
munities and higher average degree accelerate spreading. In [5] it is shown that
overlapping communities lead to a major infection prevalence and a peak of the
spread velocity in the early stages of the emerging infection, as the authors Chen
et al. use a power law model. Stegehuis et al. [27] suggest that community struc-
ture is an important determinant of the behavior of percolation processes on
networks, as community structure can both enforce or inhibit spreading. With a
slightly different approach, we find Chung et al. [7] who use a multiplex network
to model heterogeneity in Singapore’s population; thus, the authors are able to
obtain real-world like epidemic dynamics.
Mobility patterns represent an important ingredient for augmenting the real-
ism of complex network models, in order to increase the predictability of epi-
demic dynamics. Sattenspiel et al. [25] incorporate five fixed patterns of mobil-
ity into a SIR model to explain a measles epidemic in the Carribean. Salathé
et al. [24] study US contact networks and conclude that heterogeneity is impor-
tant because it directly affects the basic reproductive number R0 , and that it is
realistic enough to assume (contact) homogeneity inside communities (e.g., high
schools). Their observation supports the simplification of a community’s network,
namely from a complex network to a stochastic block model [14]. Finally, Watts
et al. [32] introduce a synthetic hierarchical block model, capable of reproducing
Epidemics with Geo-Spatial Community Structure 347
multiple epidemic waves, but without any correlation to real-world human set-
tlement organization, or realistic distances between communities.
In this paper, we address the issue of modeling mobile heterogeneous popu-
lation systems, where the community structure is defined by actual real-world
geo-spatial data (i.e., position and size of human settlements). The contributions
of our study can be summarized as follows:
Also note that the reference probability of an individual to remain within its
settlement si , for dii = 0 becomes pia (si ) ∝ Ωi . In the current form of GPM,
all individuals from the same settlement have the same probability for mobility
(e.g., pia (sj ) = pib (sj ), ∀na , nb ∈ si ).
As such, given all probabilities to move from one settlement to all other set-
tlements s0 , ...k in the population system, we express the normalized probability
pi (summing up to 1, i.e., Σk pi (sk ) = 1) by dividing each reference probability
from Eq. 1 by the sum of all probabilities for all settlements:
Ωj · e−dij /ψ
pi (sj ) = (2)
Σk Ωk · e−dik /ψ
In practice, we find that the probability of an individual to leave its settlement
is roughly 0–2% when the home settlement is moderately large (e.g., a city), and
0–50% when the home settlement is relatively small (e.g., a village or small town).
Experimental assessment has shown a reliable value of ψ = 0.2 for the tunable
parameter; nevertheless, the value of ψ could be customized, in a follow-up study,
for each settlement using reliable real mobility data from specific countries [12].
In order to compare our numerical simulations with real epidemic data we use
the most recent JHU CSSE COVID-19 dataset curated by the Center for Sys-
tems Science and Engineering (CSSE) at Johns Hopkins University [9]. The
comprehensive dataset contains time series information on daily total confirmed
Coronavirus cases for the majority of countries (and some subregions) of the
world. From this data we compute the number of new daily cases and show
several important insights that are further investigated by our simulations using
GPM.
Figure 2a represents the histogram of the current COVID-19 outbreak size
around the world. From the bimodal distribution we conclude that many coun-
tries are (still) weakly affected by the pandemic (e.g., ≤10,000 total cases), and
another significant proportion are strongly affected (e.g., >100,000 cases). In
between, there is a relative flat distribution of the outbreak size, similar to the
occurrence of measles [32].
In Figs. 2b–d we provide three representative examples of real-world pan-
demic evolution for the first δ ≈ 200 days (starting January 22nd). Here we
underline two important empirical observations: (i) the outbreak sizes (ξ < 1%)
are much smaller than many early predictions based on homogeneous mixing,
(ii) the dynamics are much less predictable, being characterized by multiple
waves (w1..3 ) which do not follow a single skewed Gaussian-like wave. Also, the
pandemic duration is yet to be accurately inferred from the real data.
350 A. Topı̂rceanu
Fig. 2. (A) Histogram of COVID-19 outbreak size, quantified by the number of glob-
ally confirmed cases. The current bimodal distribution shows that countries are either
weakly or strongly affected by the virus. (B–D) Time series evolution of daily cases from
January 22nd to July 26th 2020. The sizes of the current outbeak are given in percent
(ξ) and similar several waves (w1..3 ) characterize the COVID-19 epidemic dynamics in
varied regions of the world. The current duration of the outbreak is δ ≈ 200 days.
3 Results
The numerical simulations running GPM are quantified through the outbreak
duration δ and size ξ. All simulations run for a fixed amount of t = 1000 itera-
tions, ensuring a 3-year overview of the epidemic. The duration δ represents the
number of days (discrete iterations t) from the epidemic onset (iteration t = 0)
to the last registered new infection case. The size ξ represents the proportion of
the total population being infected.
Table 1 offers an overview of the numerical simulations’ statistics on the
Grump dataset (before and after filtering out countries with less than 50 set-
tlements). In summary, our simulations do not trigger a pandemic in less than
20 of the smallest countries, a weak pandemic is characteristic to less than 30
countries, and about 20 countries exhibit a strong pandemic (i.e., in terms of
size or duration).
Looking at the persistent panel in Table 1, we notice that, after leaving
just the countries with more than 50 settlements in the dataset, the average
duration δ increases, and the average size ξ drops. We believe this is explained
by the high number of small-sized countries (101) in which the pandemic may
be of shorter duration and higher impact. Furthermore, the top 14 countries
(lowest panel) with the longest epidemic duration (δ > 270) take, on average,
446 days to overcome the simulated pandemic, and reach an average infection
size of ξ = 0.65.
Epidemics with Geo-Spatial Community Structure 351
2. The same distinction is not possible in terms of outbreak size (Fig. 4b). While
larger countries continue to correlate with the number of settlements (ρ =
0.624), smaller countries’ outbreak size is unpredictable (ρ = −0.269).
Fig. 4. Dependency between outbreak duration (A), size (B) and number of settle-
ments for smaller (blue) and larger countries (orange). Impact of settlements density
on outbreak duration (C) and size (D) when contracting or expanding the position of
settlements.
4 Conclusions
Establishing realistic models for the geographic spread of epidemics is still under-
developed compared to other areas of network modeling, such as online social
networks [29], models for the diffusion of information [30], or network medicine
modeling [21]. Nevertheless, one of the most distinct characteristics of many
viral outbreaks is their spreading across geo-spatially organized human com-
munities. In this paper we investigate the importance of spatially structured
real-world community structures for predicting epidemic dynamics. The GPM
model presented here provides one novel method that may prove useful in better
binding complex networks and mathematical epidemics to the empirical patterns
of infectious diseases spread across time and space.
Our numerical simulations confirm that smaller scale environments (e.g.,
countries with fewer settlements) exhibit less predictable epidemic dynamics
(in terms of outbreak size ξ), but as a general observation, the duration δ
is noticeably shorter (within ≈200 days) than that of larger environments.
Indeed, for larger environments, the outbreak duration and size increase lin-
early (ρ ≈ 0.62−0.67). In general, our results illustrate the qualitative point
354 A. Topı̂rceanu
that epidemics, when they succeed, they occur on multiple scales, resulting in
longer duration, repeated waves, and hard-to-predict size.
A planned next step in our model is to include diverse isolation measures,
under the form of mobility restrictions between settlements, and reduced infec-
tiousness inside settlements (e.g., by wearing masks) and study their feasibility
on limiting the infection size on a long term. Furthermore, there are several
extensions to GPM worth investigating in future studies. For example, environ-
mental factors associated with settlements location can have important effects on
transmission risk, as they can vary greatly over short distances [25]. The model
can also consider larger scale populations (e.g., continental) where mobility is
given by international travel logs. In-between settlements, real data on mobil-
ity patterns can be used when available [12]. Finally, our mobility model may
be further detailed to consider contact between individuals along the way to a
target settlement (e.g., by car, bus, train) instead of direct transfer (e.g., plane).
Taken together, we believe our model represents a timely contribution to
better understanding and tackling the current COVID-19 pandemic that has
proven hard to predict with many existing homogeneously mixing population
models.
References
1. Anderson, R.M., May, R.M.: Directly transmitted infections diseases: control by
vaccination. Science 215(4536), 1053–1060 (1982)
2. Arenas, A., Cota, W., Gomez-Gardenes, J., Gómez, S., Granell, C., Matamalas,
J.T., Soriano-Panos, D., Steinegger, B.: A mathematical model for the spatiotem-
poral epidemic spreading of covid19. MedRxiv (2020)
3. Atkeson, A.: What will be the economic impact of covid-19 in the us? Rough esti-
mates of disease scenarios. Technical Report, Nat. Bureau of Economic Research
(2020)
4. Block, P., Hoffman, M., Raabe, I.J., Dowd, J.B., Rahal, C., Kashyap, R., Mills,
M.C.: Social network-based distancing strategies to flatten the covid-19 curve in a
post-lockdown world. Nat. Hum. Behav. 4, 1–9 (2020)
5. Chen, J., Zhang, H., Guan, Z.H., Li, T.: Epidemic spreading on networks with
overlapping community structure. Phys. A: Stat. Mech. Appl. 391(4), 1848–1854
(2012)
6. Cherifi, H., Palla, G., Szymanski, B.K., Lu, X.: On community structure in complex
networks: challenges and opportunities. Appl. Netw. Sci. 4(1), 1–35 (2019)
7. Chung, N.N., Chew, L.Y.: Modelling singapore covid-19 pandemic with a seir mul-
tiplex network model. medRxiv (2020)
8. Cohen, J., Kupferschmidt, K.: Labs scramble to produce new coronavirus diagnos-
tics (2020)
9. Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track covid-
19 in real time. Lancet Infect. Dis. 20(5), 533–534 (2020)
Epidemics with Geo-Spatial Community Structure 355
10. Dye, C., Gay, N.: Modeling the SARS epidemic. Science 300(5627), 1884–1885
(2003)
11. Ghalmane, Z., Cherifi, C., Cherifi, H., El Hassouni, M.: Centrality in complex
networks with overlapping community structure. Sci. Rep. 9(1), 1–29 (2019)
12. Hasan, S., Zhan, X., Ukkusuri, S.V.: Understanding urban human activity and
mobility patterns using large-scale location-based data from online social media.
In: Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Com-
puting, pp. 1–8 (2013)
13. Hellewell, J., Abbott, S., et al.: Feasibility of controlling covid-19 outbreaks by
isolation of cases and contacts. Lancet Global Health 8(4), e488–e496 (2020)
14. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps.
Soc. Netw. 5(2), 109–137 (1983)
15. Keeling, M.: The implications of network structure for epidemic dynamics. Theor.
popul. Biol. 67(1), 1–8 (2005)
16. Keeling, M.J., Rohani, P.: Modeling Infectious Diseases in Humans and Animals.
Princeton University Press, Princeton (2011)
17. Koo, J., Cook, A., Park, M., et al.: Interventions to mitigate early spread ofcovid-19
in Singapore: a modelling study. Lancet Infect. Dis. (2020)
18. Kucharski, A.J., Russell, T.W., et al.: Early dynamics of transmission and control
of covid-19: a mathematical modelling study. Lancet Infect. Dis. 20(5), 553–558
(2020)
19. Lipsitch, M., Cohen, T., et al.: Transmission dynamics and control of severe acute
respiratory syndrome. Science 300(5627), 1966–1970 (2003)
20. Lloyd-Smith, J.O., Schreiber, S.J., et al.: Superspreading and the effect of individ-
ual variation on disease emergence. Nature 438(7066), 355–359 (2005)
21. Mihaicuta, S., Udrescu, M., Topirceanu, A., Udrescu, L.: Network science meets
respiratory medicine for OSAS phenotyping and severity prediction. Peer J. 5,
e3289 (2017)
22. Nguyen, A.D., Sénac, P., Ramiro, V., Diaz, M.: Steps-an approach for human
mobility modeling. In: International Conference on Research in Networking, pp.
254–265. Springer (2011)
23. Salathé, M., Jones, J.H.: Dynamics and control of diseases in networks with com-
munity structure. PLoS Comput. Biol. 6(4), e1000736 (2010)
24. Salathé, M., Kazandjieva, M., Lee, J.W., Levis, P., Feldman, M.W., Jones, J.H.: A
high-resolution human contact network for infectious disease transmission. Proc.
Nat. Acad. Sci. 107(51), 22020–22025 (2010)
25. Sattenspiel, L., Dietz, K., et al.: A structured epidemic model incorporating geo-
graphic mobility among regions. Math. Biosci. 128(1), 71–92 (1995)
26. Shang, J., Liu, L., Li, X., Xie, F., Wu, C.: Epidemic spreading on complex networks
with overlapping and non-overlapping community structure. Phys. A: Stat. Mech.
Appl. 419, 171–182 (2015)
27. Stegehuis, C., Van Der Hofstad, R., Van Leeuwaarden, J.S.: Epidemic spreading
on complex networks with community structures. Sci. Rep. 6(1), 1–7 (2016)
28. Thunström, L., Newbold, S.C., Finnoff, D., Ashworth, M., Shogren, J.F.: The
benefits and costs of using social distancing to flatten the curve for covid-19. J.
Benefit-Cost Anal. 11(2), 1–27 (2020)
29. Topirceanu, A., Udrescu, M., Vladutiu, M.: Genetically optimized realistic social
network topology inspired by facebook. In: Online Social Media Analysis and Visu-
alization, pp. 163–179. Springer (2014)
356 A. Topı̂rceanu
30. Topirceanu, A., Udrescu, M., Vladutiu, M., Marculescu, R.: Tolerance-based inter-
action: a new model targeting opinion formation and diffusion in social networks.
Peer J. Comput. Sci. 2, e42 (2016)
31. Warszawski, L., Frieler, K., et al.: Center for international earth science information
network—ciesin—columbia university. gridded population of the world, version 4
(gpwv4). NASA socioeconomic data and applications center (sedac), Atlas of Envi-
ronmental Risks Facing China Under Climate Change, p. 228 (2017). https://doi.
org/10.7927/h4np22dq
32. Watts, D.J., Muhamad, R., Medina, D.C., Dodds, P.S.: Multiscale, resurgent epi-
demics in a hierarchical metapopulation model. Proc. Nat. Acad. Sci. 102(32),
11157–11162 (2005)
Identifying Biomarkers for Important
Nodes in Networks of Sexual and Drug
Activity
1 Introduction
Problems associated with HIV transmission extend into all aspects of society,
with minority populations often bearing the brunt in both health and economic
outcomes. For example, 1 in 7 people with HIV does not know it, one in two
minority men who have sex with men (MSM) will become HIV-infected in their
lifetime, and African American women bear a disproportionate HIV burden and
poorer health outcomes than other women [23].
A critical barrier to the eradication of HIV is identification, education, and
treatment of infected and potentially infected individuals. If interventions are
to be successful, they must be targeted on segments of individuals who play
important roles in HIV transmission [20]. The effectiveness of highly-targeted
medical interventions has been demonstrated in studies such as [5] in which no
examples of heterosexual HIV transmission were found when an HIV-positive
partner was receiving HAART.
Based on work in [3] and [24], it is thought that transmission and substance-
use networks consist of tree-like sub-groups joined by a central cycle. This study
concentrates on nodes with high betweenness centrality, who are important par-
ticipants because of their topological position on the central cycle of the trans-
mission network. Epidemiological theory suggests that interventions targeting
these nodes are more effective in stopping the spread of a disease through the
network than interventions involving other nodes [9,18].
This study is a secondary analysis of SATHCAP (Sexual Acquisition and
Transmission of HIV Cooperative Agreement Program) [15] user data for
Chicago, Los Angeles, and Raleigh. The SATHCAP data was collected as part of
a study to assess the impact of drug use in the sexual transmission of HIV from
traditional high-risk groups to lower risk groups. The SATHCAP dataset was
collected by a process called Respondent-Driven Sampling (RDS). This process
naturally leads to a network-oriented presentation of the data. Although there
are known theoretical benefits to statistical analysis with RDS data [16,25], there
is some question as to whether the dataset can be accurately analyzed using
classical statistical techniques. An evaluation of the statistical implications for
respondent-driven sampling can be found in Lee et al. [17]. They conclude that
RDS results in a non-random sampling process with errors that make traditional
statistical analysis unwieldy and ineffective. This dissatisfaction with traditional
statistical analysis has led to underuse of the datasets, as well as underuse of
the RDS technique.
The data have been obtained through the National Addiction and HIV Data
Archive Program (NAHDAP), accessible online1 . This research was conducted
under the approval of the Southern Illinois University Edwardsville IRB.
2 Related Work
There have been several successful studies on finding important nodes using
the SATHCAP data, including a special issue of the Journal of Urban Health in
2009 [6]. In Youm et al. [27] neighborhoods are identified in Chicago for localized
campaigns. The neighborhoods are hidden because they have fewer than average
cases of HIV. However, they are important in transmission, as they act as bridge
neighborhoods facilitating spreading. Simple factors can cause individuals to be
important, yet overlooked, such as being very poor [22], or ethnic [28].
There are difficulties with using RDS data for statistical analysis, which is
intended to be performed on randomly collected data. It is shown in [11] that
the length of data collection chains is often insufficient to obtain an unbiased
sample, and it is stated in [12] that the poor statistical “performance of RDS is
driven not by the bias but by the high variance of estimates.” It is shown in [26]
that valid point estimates are possible with RDS data, but that improvement
in variance estimation is needed. The current study avoids this controversy by
using network science techniques to analyze RDS data.
The graphical nature of respondent driven sampling is examined in [7].
Betweenness centrality is a widely-used concept for finding nodes important to
1
https://www.icpsr.umich.edu/icpsrweb/NAHDAP/index.jsp.
Identifying Biomarkers 359
network resilience and spreading [14], including HIV transmission [4]. Important
nodes in other epidemiological contexts have been called superspreaders [19].
3 Methodology
SATHCAP Origins. The SATHCAP study was conducted across three cities
in the United States: Chicago, Los Angeles, and the Raleigh-Durham area, as
well as St. Petersburg, Russia [15]. This study used a system of peer-recruitment
and respondent-driven sampling to generate a data sampling of men who have
sex with men (MSM), drug users (DU), and injected drug users (IDU).
The peer recruitment process allowed individuals to recruit sexual and drug
partners to participate in the survey, and for those partners to recruit addi-
tional partners. The participants were provided a set of colored coupons for
referring partners to the survey, with different colors representing different rela-
tionships, such as male sexual partnership, female sexual partnership, or high
risk behaviour (MSM, DU, and IDU). Participants in the program were asked
a series of questions that attempted to gather as much information about the
participant’s sexual and drug activity as possible.
For each city network, we identified the 10 nodes with the highest betweenness
centrality. For each of these nodes, we compared each of the 143 features with
the average of the underlying city population. A feature was marked as notable
for the individual if the value was more than two standard deviations away from
the underlying city average. By accumulating the number of notable features
across the top 10 highest betweenness nodes, patterns began to emerge across
features and cities.
Fig. 1. Distribution of underlying sexual and drug networks, plotted on a log-log scale.
The downward slope is characteristic of a scale-free network.
4 Results
4.1 Scale-Free Underlying Networks
log-log scale, and show the distinct downward slope representative of a power-law
degree distribution, indicating a scale-free network.
This is an important result, because it is well known that scale-free net-
works demonstrate resilience to random attacks but high susceptibility to tar-
geted attacks [1]. The scale-free property of networks indicates the existence of
important nodes that have a greater effect on the overall structure of the graph.
Therefore, targeting important nodes within the referral network is a reasonable
strategy on which to base efforts to stop spreading.
RALEIGH-DURHAM
Attribute Plaintext Exceptionality
Last week, I most often slept
slept-2 in my neighborhood, but not 0.4
my home.
My primary form of transporation
tmode-5 0.3
is walking.
I have used heroin and cocaine mixed
usedc 0.3
together (speedball).
I currently live in a lover’s apartment
reside-3 0.2
or house.
Last week, I most often slept in a
slept-3 different neighborhood within 20 0.2
miles of my home.
CHICAGO
Attribute Plaintext Exceptionality
Last week, I most often slept
slept-2 in my neighborhood, but not 0.3
my home.
mstat-5 I am currently divorced. 0.3
My first sexual encounter was
risk2 0.3
non-consensual.
I have experienced difficulty
insd2 getting healthcare due to my 0.2
race/ethnicity.
I have experienced difficulty
insd4 getting healthcare due to my 0.2
culture/heritage.
LOS ANGELES
Attribute Plaintext Exceptionality
usedi Drug Usage (other) 0.5
I currently live in a rented room
reside-5 0.4
at a hotel or a rooming house
Last week, I most often slept in my
slept-2 0.3
neighborhood but not my home.
I mostly have sex with women, but
sexid2-4 0.3
occasionally men
racee Other Race 0.3
that the respondent used a mixture of cocaine and heroin are more prevalent
within Raleigh-Durham, but not as common in Chicago or Los Angeles, although
the generic category of “Drug Usage (other)” had the highest exceptionality
in Los Angeles. Beyond living arrangements and drug use, difficulty getting
Identifying Biomarkers 365
healthcare and having sex with both men and women were represented among
the most exceptional attributes.
As seen in Fig. 2, many of the high betweenness nodes share a close prox-
imity to one another within the referral network, indicating a shared subset
of attributes. Nodes with high betweenness centrality are often referred to as
“bridge” nodes because they connect, or bridge, different groups. Conversely,
this suggests that while some attributes are shared among the high betweenness
nodes, other attributes would be unique. Figure 3 shows the largest component
in Los Angeles, which contains 60% of that city’s high betweenness nodes. By
analyzing each node individually, we discovered each node had a set of unique
attributes that no other high betweenness node in that city contained.
366 J. Grubb et al.
Fig. 3. The largest connected component in Los Angeles. The nodes with the six highest
betweenness centrality scores are shown in red. Unique exceptional attributes are shown
for each node.
The exceptional attributes shown in Fig. 3 are unique to each of the associated
respondents within the top 10 high betweenness nodes. The descriptive qualities
of these attributes provide a unique insight into the structure of these networks of
recruitment, via outlier data that may not be captured by traditional statistical
modeling. By focusing on the centrality “bridging” qualities of these nodes, we
can identify core attributes that may not otherwise be captured. For example,
attributes such as youth, being a member of an underrepresented race, and/or
carrying STIs are identifiable as attributes that bridge from one group to another.
References
1. Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex net-
works. Nature 406(6794), 378–382 (2000)
2. Barabási, A.L., Bonabeau, E.: Scale-free networks. Sci. Am. 288(5), 60–69 (2003)
3. Bearman, P.S., Moody, J., Stovel, K.: Chains of affection: the structure of adoles-
cent romantic and sexual networks. Am. J. Sociol. 110(1), 44–91 (2004)
4. Bell, D.C., Atkinson, J.S., Carlson, J.W.: Centrality measures for disease trans-
mission networks. Soc. Netw. 21(1), 1–21 (1999)
5. Castilla, J., Del Romero, J., Hernando, V., Marincovich, B., Garcı́a, S., Rodrı́guez,
C.: Effectiveness of highly active antiretroviral therapy in reducing heterosexual
transmission of HIV. JAIDS J. Acquired Immune Defic. Syndr. 40(1), 96–101
(2005)
6. Compton, W., Normand, J., Lambert, E.: Sexual acquisition and transmission of
HIV cooperative agreement program (SATHCAP), July 2009. J. Urban Health
86(1), 1–4 (2009)
7. Crawford, F.W.: The graphical structure of respondent-driven sampling. Sociol.
Methodol. 46(1), 187–211 (2016)
368 J. Grubb et al.
8. Deo, N.: Graph theory with application to engineering and computer science, pp.
39–44. phi pvt., Ltd., India (1974)
9. Eames, K.T.: Modelling disease spread through random and regular contacts in
clustered populations. Theor. popul. Biol. 73(1), 104–111 (2008)
10. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry
40, 35–41 (1977)
11. Gile, K.J., Handcock, M.S.: 7. respondent-driven sampling: an assessment of cur-
rent methodology. Sociol. Methodol. 40(1), 285–327 (2010)
12. Goel, S., Salganik, M.J.: Assessing respondent-driven sampling. Proc. Nat. Acad.
Sci. 107(15), 6743–6747 (2010)
13. Hagberg, A., Swart, P., S Chult, D.: Exploring network structure, dynamics, and
function using networkx. Technical Report, Los Alamos National Lab. (LANL),
Los Alamos, NM (United States) (2008)
14. Holme, P., Kim, B.J., Yoon, C.N., Han, S.K.: Attack vulnerability of complex
networks. Phys. Rev. 65(5), 056109 (2002)
15. Iguchi, M.Y., Ober, A.J., Berry, S.H., Fain, T., Heckathorn, D.D., Gorbach, P.M.,
Heimer, R., Kozlov, A., Ouellet, L.J., Shoptaw, S., et al.: Simultaneous recruitment
of drug users and men who have sex with men in the united states and Russia using
respondent-driven sampling: sampling methods and implications. J. Urban Health
86(1), 5 (2009)
16. Kuhns, L.M., Kwon, S., Ryan, D.T., Garofalo, R., Phillips, G., Mustanski, B.S.:
Evaluation of respondent-driven sampling in a study of urban young men who have
sex with men. J. Urban Health 92(1), 151–167 (2015)
17. Lee, S., Suzer-Gurtekin, T., Wagner, J., Valliant, R.: Total survey error and respon-
dent driven sampling: focus on nonresponse and measurement errors in the recruit-
ment process and the network size reports and implications for inferences. J. Official
Stat. 33(2), 335–366 (2017)
18. Lloyd, A.L., May, R.M.: How viruses spread among computers and people. Science
292(5520), 1316–1317 (2001)
19. Lloyd-Smith, J.O., Schreiber, S.J., Kopp, P.E., Getz, W.M.: Superspreading and
the effect of individual variation on disease emergence. Nature 438(7066), 355–359
(2005)
20. Magnani, R., Sabin, K., Saidel, T., Heckathorn, D.: Review of sampling hard-to-
reach and hidden populations for HIV surveillance. Aids 19, S67–S72 (2005)
21. McKinney, W.: Data structures for statistical computing in python. In: van der
Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference,
pp. 51 – 56 (2010)
22. Murphy, R.D., Gorbach, P.M., Weiss, R.E., Hucks-Ortiz, C., Shoptaw, S.J.: Seroad-
aptation in a sample of very poor Los Angeles area men who have sex with men.
AIDS Behav. 17(5), 1862–1872 (2013)
23. Pellowski, J.A., Kalichman, S.C., Matthews, K.A., Adler, N.: A pandemic of the
poor: social disadvantage and the us HIV epidemic. Am. Psychol. 68(4), 197 (2013)
24. Potterat, J.J., Phillips-Plummer, L., Muth, S.Q., Rothenberg, R., Woodhouse, D.,
Maldonado-Long, T., Zimmerman, H., Muth, J.: Risk network structure in the
early epidemic phase of HIV transmission in Colorado springs. Sex. Transm. Infect.
78(suppl 1), i159–i163 (2002)
25. Rhodes, S.D., McCoy, T.P.: Condom use among immigrant Latino sexual minori-
ties: multilevel analysis after respondent-driven sampling. AIDS Educ. Prev. 27(1),
27–43 (2015)
Identifying Biomarkers 369
Abstract. Fake news diffusion represents one of the most pressing issues
of our online society. In recent years, fake news has been analyzed from
several points of view, primarily to improve our ability to separate them
from the legit ones as well as identify their sources. Among such vast
literature, a rarely discussed theme is likely to play uttermost importance
in our understanding of such a controversial phenomenon: the analysis
of fake news’ perception. In this work, we approach such a problem by
proposing a family of opinion dynamic models tailored to study how
specific social interaction patterns concur to the acceptance, or refusal,
of fake news by a population of interacting individuals. To discuss the
peculiarities of the proposed models, we tested them on several synthetic
network topologies, thus underlying when/how they affect the stable
states reached by the performed simulations.
1 Introduction
Nowadays, one of the most pressing and challenging issues in our continuously
growing and hyperconnected (online) world is identifying fake/bogus news to
reduce their effect on society. Like all controversial pieces of information, fake
news usually polarizes the public debate - both online and offline - with the side
effect of radicalizing population opinions, thus reducing the chances of reaching a
synthesis of opposing views. Moreover, such phenomena are usually amplified due
to the existence of stubborn agents, individuals that foster - either for personal
gain, lack of knowledge, or excessive ego - their point of view disregarding the
existence of sound opposing arguments or, even, debunking evidence. So far, the
leading efforts to study such a complex scenario was devoted to: (i) identifying
fake news, (ii) debunk them, (iii) identifying the sources of fake news, and (iv)
studying how they spread. Indeed, all such tasks are carriers of challenges as
well as opportunities: each costly, step ahead increasing out knowledge on this
complex phenomenon, a knowledge that can be applied to reduce its effect on
the public debate. Among such tasks, the analysis of how fake news diffuse is
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 370–381, 2021.
https://doi.org/10.1007/978-3-030-65347-7_31
Bounded Confidence, Stubborness and Peer Pressure 371
probably the most difficult to address. Even by restricting the analysis on the
online world, tracing the path of a content shared by users of online platforms
is not always feasible (at least extensively): it becomes even impossible when
we consider that such content can diffuse across multiple services, of which we
usually have only a partial view. However, we can argue that - in the fake news
scenario - it is important how a given controversial content spreads (e.g., how
different individuals get in touch with it) and how the population reached by such
content perceives it. Dangerous fake news cannot only reach a broad audience,
but it is also capable of convincing it of its trustworthiness. The latter component
goes beyond the mere spreading process that allows it to become viral: it strictly
relates to individuals’ perception, opinions that are formed not only to the news
content itself but also through the social context of its users.
In this work, moving from such observation, we propose a family of opin-
ion dynamics models to understand the role of specific social factors on the
acceptance/rejection of fake news. Assuming a population composed of agents
aware of a given piece of information - each starting with its attitude toward
it - we study how different social interaction patterns lead to the consensus or
polarization of opinions. In particular, we model and discuss the effect that stub-
born agents, different levels of trusts among individuals, open-mindedness and,
attraction/repulsion phenomena have on the population dynamics of fake news
perception.
The paper is organized as follows. In Sect. 2, the literature relevant to our
work is discussed. Subsequently, in Sect. 3, we describe the opinion dynamics
models we designed to describe and study the evolution of Fake news percep-
tion. In Sect. 4, we provide an analysis of the proposed models on synthetic net-
works having heterogeneous characteristics. Finally, Sect. 5 concludes the paper
by summarizing our results and underlying future research directions.
2 Related Works
We present the literature review by dividing this Section into two sub-
paragraphs: first, we try to characterize fake news, and we illustrate the main
areas of research for these. Then, we introduce opinion dynamics, and we describe
the most popular methods.
Fake News Characterization. Before examining the central studies in the lit-
erature on the topic of fake news, it is appropriate to define the term itself.
There is no universal definition of fake news, but there are several explanations
and taxonomies in the literature. We define “fake news” to be news articles
that are intentionally and verifiably false and could mislead readers, as reported
in [1]. Indeed, identifying the components that characterize fake news is an open
and challenging issue [2]. Moreover, several approaches have been designed to
address the problem of unreliable content online: most of them propose meth-
ods for detecting bogus contents or their creators. Focusing on the target of
the analysis involving fake news, we can distinguish different areas of research:
372 C. Toccaceli et al.
creator analysis (e.g., bots detection [3]), content analysis (e.g., fake news iden-
tification [4]), social context analysis (e.g., the impact of the fake news and their
diffusion on society [5]).
Opinion Dynamics. Recently, opinion formation processes have been attracting
the curiosity of interdisciplinary experts. We hold opinions about virtually every-
thing surrounding us, opinions influenced by several factors, e.g., the individual
predisposition, the possessed information, the interaction with other subjects.
In [6], opinion dynamics is defined as the process that “attempts to describe
how individuals exchange opinions, persuade each other, make decisions, and
implement actions, employing diverse tools furnished by statistical physics, e.g.,
probability and graph theory”. Opinion dynamics models are often devised to
understand how certain assumptions on human behaviors can explain alterna-
tive scenarios, namely consensus, polarization or fragmentation. The consensus
is reached when the dynamic stable state describes the population agreement
toward a single and homogeneous opinion cluster; polarization describes a simul-
taneous presence of more than one, well defined, separated opinion clusters of
suitable sizes; finally, fragmentation corresponds to a disordered state with an
even higher set of small opinions’ clusters.
Agent-based modeling is often used to understand how these situations are
achieved. In these models, each agent has a variable corresponding to his opin-
ion. According to the way opinion variables are defined, models can be classified
in discrete or continuous models. Among the classic models, we can distinguish:
the Voter model [7], the Majority rule model [8], and the Sznajd model [9],
which are discrete models that describe scenarios in which individuals have to
choose between two options on a given topic (for example, yes/no, true/false,
iPhone/Samsung). For the continuous models, on the other hand, the most
prominent ones are the Hegselmann-Krause (HK) model [10] and Deffuant-
Weisbuch model [11] that describe the contexts in which an opinion can be
expressed as a real value - within a given range - that can vary smoothly from
one extreme to the other, such as the political orientation of an individual.
Fig. 1. Weight example. Opinion xi is influenced by the opinions of agents with the
opinion more similar to its opinion; e.g., the agents in the yellow elliptical. At the end
of the interaction, xi approaches the opinions of the agents with heavier weights (as
visually shown xi change of position).
The HK model converges in polynomial time, and its behavior is strictly related
to the expressed confidence level: the higher the value, the higher the number
of opinions clusters when model stability is reached.
Given its definition, the HK model does not consider the strength of the ties
of the agents. In a fake news scenario, we can suppose that when an agent i
reads a post on his Facebook wall concerning a news A the reliability attributed
from i to the content of the post is closely linked to the user that shared it - as
exemplified in Fig. 1. To adapt the HK model to include such specific information,
we extend it to leverage weighted, pair-wise, interactions.
Definition 2 (Weighted-HK (WHK)). Conversely from the HK model, dur-
ing each iteration WHK consider a random pair-wise interaction involving agents
at distance . Moreover, to account for the heterogeneity of interaction frequency
374 C. Toccaceli et al.
among agent pairs, WHK leverages edge weights, thus capturing the effect of dif-
ferent social bonds’ strength/trust as it happens in reality. To such extent, each
edge (i, j) ∈ E, carries a value wi,j ∈ [0, 1]. The update rule then becomes:
x (t)+xj (t)wi,j
xi (t) + i 2 (1 − xi (t)) if xi (t) ≥ 0
xi (t + 1) = xi (t)+xj (t)wi,j (2)
xi (t) + 2 (1 + xi (t)) if xi (t) < 0
The idea behind the WHK formulation is that the opinion of agent i at time
t + 1, will be given by the combined effect of his previous belief and the average
opinion weighed by its, selected, -neighbor, where wi,j accounts for i’s perceived
influence/trust of j.
Moreover, we can further extend the WHK model to account for more com-
plex interaction patterns, namely attractive-repulsive effects.
Definition 3 (Attraction WHK - (AWHK)). By “attraction”, we identify
those pair-wise interactions between agents that agree on a given topic. At the end
of the interaction, agent i begins to doubt his position and to share some thoughts
of j. For this reason his opinion will tend to approach that of his interlocutor,
so dij (t) > di,j (t + 1).
After selecting the pair of agents i and j, the model has the following update
rule:
⎧ sumop
⎪
⎪xi (t) − 2 (1 − xi (t)) if xi (t) ≥ 0, xj (t) ≥ 0, xi (t) > xj (t)
⎪
⎪
⎪
⎪
sum
xi (t) + 2 op (1 − xi (t)) if xi (t) ≥ 0, xj (t) ≥ 0, xi (t) < xj (t)
⎪
⎪
⎪
⎪xi (t) + sumop (1 + xi (t)) if xi (t) < 0, xj (t) < 0, xi (t) > xj (t)
⎪
⎪
⎪ 2
⎨x (t) − sumop (1 + x (t)) if x (t) < 0, x (t) < 0, x (t) < x (t)
i 2 i i j i j
xi (t + 1) = sumop
⎪
⎪ x i (t) − (1 − xi (t)) if xi (t) ≥ 0, xj (t) < 0, sum op > 0
⎪
⎪ 2
⎪
⎪ sum,op
⎪xi (t) + 2 (1 − xi (t)) if xi (t) ≥ 0, xj (t) < 0, sumop < 0
⎪
⎪
⎪ sum
⎪xi (t) + 2 op (1 + xi (t)) if xi (t) < 0, xj (t) ≥ 0, sumop > 0
⎪
⎪
⎩ sum
xi (t) − 2 op (1 + xi (t)) if xi (t) < 0, xj (t) ≥ 0, sumop < 0
(3)
where sumop = xi (t) + xj (t)wi,j .
The used criterion is always the same: the new opinion of i is the result of the
combined effect of his initial opinion and that of the neighbor j, but each case
applies a different formula depending on whether the opinions of i and j show
discordant or not, so we can guarantee that the difference between the respective
opinions is reduced after the communication.
However, when observing real phenomena, we are used to identifying more
complex interactions where individuals influence each other despite their initial
opinions, getting closer to the like-minded individuals and moving apart from
ones having opposite views.
Definition 4 (Repulsive WHK - (RWHK)). This circumstance is called a
“repulsion”: two agents’ opinions will tend to move them apart. Consider the
Bounded Confidence, Stubborness and Peer Pressure 375
situation where agent i communicates with j with an opposite belief. At the end
of the interaction, i will continue to be more convinced of his thoughts and his
new opinion will be further away from that of j. So, when the communication
between the two agents ends, the opinion of i will move away from that of j by
following:
⎧
⎪
⎪ xi (t) + sumop (1 − xi (t)) if xi (t) ≥ 0, xj (t) ≥ 0, xi (t) > xj (t)
⎪
⎪
2
⎪
⎪
sumop
xi (t) − 2 (1 − xi (t)) if xi (t) ≥ 0, xj (t) ≥ 0, xi (t) < xj (t)
⎪
⎪
⎪
⎪ sum
⎪
⎪ xi (t) − 2 op (1 + xi (t)) if xi (t) < 0, xj (t) < 0, xi (t) > xj (t)
⎪
⎨x (t) + sumop (1 + x (t)) if x (t) < 0, x (t) < 0, x (t) < x (t)
i 2 i i j i j
xi (t+1) = sumop (4)
⎪
⎪ xi (t) + (1 − xi (t)) if xi (t) ≥ 0, xj (t) < 0, sum op > 0
⎪
⎪ 2
⎪
⎪ sumop
⎪xi (t) − 2 (1 − xi (t)) if xi (t) ≥ 0, xj (t) < 0, sumop < 0
⎪
⎪
⎪ sum
⎪xi (t) − 2 op (1 + xi (t)) if xi (t) < 0, xj (t) ≥ 0, sumop > 0
⎪
⎪
⎩ sum
xi (t) + 2 op (1 + xi (t)) if xi (t) < 0, xj (t) ≥ 0, sumop < 0
To integrate this idea into the model presented above, we add a binary flag
to each agent to denote it as “stubborn” or not. The update rule changes are
then straightforward: if the randomly selected agent is a stubborn one, he will
not update his opinion and, therefore, xi (t) = xi (t+1); otherwise, the previously
discussed update strategy is applied.
4 Experimental Analysis
This Section describes the performed experimental analysis, focusing on its main
components: the selected network datasets, the designed experimental protocol,
and the obtained results. To foster experiments reproducibility, the introduced
models have been integrated within the NDlib1 python library [14].
Results. We report the results obtained by AWHK and ARWHK on the pre-
viously described synthetic scenarios and, after that, we discuss the impact of
community structure on them. Edge weights, representing trust values among
agent pairs, are drawn from a normal distribution.
Attraction & Stubbornness. Figure 2 shows the results obtained by AWHK on
the scale-free network for different values of while maintaining constant the
percentage of stubborn agents (90% of the individuals assume and maintain a
positive opinion). Different colors represent the agent’s initial opinion (positive,
1
NDlib: Network Diffusion library. https://ndlib.readthedocs.io/.
Bounded Confidence, Stubborness and Peer Pressure 377
Fig. 2. Effect of the stubborn agents varying epsilon on scale-free network in the
AWHK model. Stubborn population opinion evolution lines are omitted.
negative, or neutral). We can observe that in the selected scenarios, the increase
of the bounded confidence interval results in a more chaotic regime, character-
ized by a subset of agents whose opinions heavily fluctuates toward the critical
mass introduced by the stubborn agents. The presence of stubborn agents affects
opinions’ evolution since they act as pivots for those open to change their minds.
We executed the same simulation varying the percentage of stubborns and the
set of initial stubborns’ opinion. As expected, we observed a similar result when
stubborns are tied to negative opinions and even a more chaotic regime when
such class of agents equally distributes over the opinion spectrum (we do not
report the figures for limited space). So stubborns act as persuaders, bringing
the opinion of the population closer to theirs. The higher their number, the more
evident appears their action on the remaining population. As previously stated,
Fig. 2 reports the results observed in a scale-free scenario: however, our experi-
mental investigation underlines that the observed trends can also be identified
in random and mean-field scenarios (with a significant reduction of the chaotic
regime due to the more regular topological structure).
Fig. 3. Effect of the stubborn agents varying epsilon on scale-free (first row) and
random network (second row) in the ARWHK model. Stubborn population opinion
evolution lines are omitted.
2
All images are taken from animations that reproduce the unfolding of the simulated
dynamic processes. Animations, as well as the python code to generate them, are
available at https://bit.ly/3jzp1Qs.
Bounded Confidence, Stubborness and Peer Pressure 379
Fig. 4. Network visualizations. (a) Nodes initial conditions - three communities, two
prevalently negative (blue node), two positive (red nodes); (b) AWHK final equilibrium;
(c) ARWHK final equilibrium.
structure of the analyzed network that acts as boundaries for cross-cluster dif-
fusion. The network topologies considered in this analysis are exemplified in the
toy example of Fig. 4, that we will use to summarize the observed outcomes of
our analysis. Such a particular case study describes a setup in which network
nodes are clustered in four loosely interconnected blocks - two composed by
agents sharing opinions in the negative spectrum, the others characterized by
an opposite reality. In Fig. 4(a), we report the initial condition shared by two
simulations (one based on AWHK, the other on ARWHK) that will be further
discussed. Both simulations assume the same value for = 0.85 and a fixed set of
stubborn agents (e.g., the 6 less community embedded nodes - namely, the ones
with the higher ratio among their intra-community degree and their total degree)
- which are prevalently allocated to the bigger negative (blue) community. While
performing a simulation that involves attraction, using AWHK, we can observe
how the resulting final equilibrium (Fig. 4(b)) converges toward a common spec-
trum. In particular, in this example, we can observe how stubborn agents can
make their opinion prevail, even crossing community boundaries. Indeed, such
a scenario can be explained in terms of the prevalence of negative stubborn
agents and the relative size of the negative communities (covering almost 3/5
of the graph). Conversely, when applying the ARWHK model, we get a com-
pletely different result, as can be observed in Fig. 4(c). Two strongly polarized
communities characterize the final equilibrium. In this scenario, stubborns have
a two-fold role: (i) they increase the polarization of their community by radi-
calizing agents’ opinions and, (ii) as a consequence, make rare the eventuality
of cross-community ties connecting moderate agents, thus ideologically breaking
apart the population. While varying the models’ parameters, our experimen-
tal analysis confirms the results obtained on the scale-free and random graphs:
well-defined mesoscale clusters prevalently slow-down convergence in case of a
population-wide agreement while accelerating the process of fragmentation.
380 C. Toccaceli et al.
5 Conclusion
In this paper, we modeled the response of individuals to fake news as an opin-
ion dynamic process. Modeling some of the different patterns that regulate the
exchange of opinions regarding a piece of given news - namely, trust, attrac-
tion/repulsion and existence of stubborn agents - we were able to drive a few
interesting observations on this complex, often not properly considered, context.
Our simulations underlined that: (i) differences in the topological interaction
layer reflect on the time to convergence of the proposed models; (ii) the presence
of stubborn agents significantly affects the final system equilibrium, especially
when high confidence bounds regulates pair-wise interactions; (iii) attraction
mechanisms foster convergence toward a common opinion while repulsion ones
facilitate polarization.
As future work, we plan to extend the experimental analysis to real data
to understand the extent to which the proposed models can replicate observed
ground truths. Moreover, we plan to investigate the effect of higher-order inter-
actions on opinion dynamics, thus measuring the effect that peer-pressure has
on the evolution of individuals’ perceptions.
References
1. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. J.
Econ. Perspect. 31(2), 211–36 (2017)
2. Zhang, X., Ghorbani, A.A.: An overview of online fake news: characterization,
detection, and discussion. Inf. Process. Manage. 57(2), 102025 (2020)
3. Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M.: DNA-inspired
online behavioral modeling and its application to spambot detection. IEEE Intell.
Syst. 31(5), 58–64 (2016)
4. Sharma, K., Qian, F., Jiang, H., Ruchansky, N., Zhang, M., Liu, Y.: Combating
fake news: a survey on identification and mitigation techniques. ACM Trans. Intell.
Syst. Technol. (TIST) 10(3), 1–42 (2019)
5. Visentin, M., Pizzi, G., Pichierri, M.: Fake news, real problems for brands: the
impact of content truthfulness and source credibility on consumers’ behavioral
intentions toward the advertised brands. J. Interact. Market. 45, 99–112 (2019)
6. Si, X.-M., Li, C.: Bounded confidence opinion dynamics in virtual networks and
real networks. J. Comput. 29(3), 220–228 (2018)
7. Holley, R.A., Liggett, T.M.: Ergodic theorems for weakly interacting infinite sys-
tems and the voter model. Ann. Probab. 3, 643–663 (1975)
8. Galam, S.: Minority opinion spreading in random geometry. Eur. Phys. J. B-
Condens. Matter Complex Syst. 25(4), 403–406 (2002)
9. Sznajd-Weron, K., Sznajd, J.: Opinion evolution in closed community. Int. J. Mod.
Phys. C 11(06), 1157–1165 (2000)
10. Hegselmann, R., Krause, U., et al.: Opinion dynamics and bounded confidence
models, analysis, and simulation. J. Artif. Soc. Soc. Simul. 5(3), 1–33 (2002)
Bounded Confidence, Stubborness and Peer Pressure 381
11. Deffuant, G., Neau, D., Amblard, F., Weisbuch, G.: Mixing beliefs among inter-
acting agents. Adv. Complex Syst. 3(01n04), 87–98 (2000)
12. Mobilia, M.: Does a single zealot affect an infinite group of voters? Phys. Rev.
Lett. 91(2), 028701 (2003)
13. Wu, F., Huberman, B.A.: Social structure and opinion formation arXiv preprint
cond-mat/0407252 (2004)
14. Rossetti, G., Milli, L., Rinzivillo, S., Sı̂rbu, A., Pedreschi, D., Giannotti, F.: NDLIB:
a python library to model and analyze diffusion processes over complex networks.
Int. J. Data Sci. Anal. 5(1), 61–79 (2018)
15. Erdös, P., Rényi, A.: On random graphs i. Publicationes Mathematicae Debrecen
6, 290 (1959)
16. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science
286(5439), 509–512 (1999)
17. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. Phys. Rev. E 78(4), 046110 (2008)
Influence Maximization for Dynamic
Allocation in Voter Dynamics
1 Introduction
While providing new channels to guide and influence people for public benefits
(e.g., health [14] or education [27]), the increasing use of social media has also
led to a wider spread of fake news and misinformation [15]. Given that people’s
opinions can be influenced and changed by peer interactions [19] and mass media
[30], it is of great importance to understand ways to guide public opinions or
prevent manipulation. This problem has been formalized as the well-known topic
of influence maximization (IM) [12]. The crux of IM is to strategically select the
most influential subsets of agents in the network as the seeds to propagate a given
opinion held by an external partisan (referred to as controller in the following)
throughout the network, in order to maximize the expected number of agents
adopting the opinion.
So far, a majority of research on IM are based on variants of the independent
cascade (IC) model [4,5,10,13]. These models simulate the propagation of influ-
ence as a one-off activation, i.e., once activated, agents keep committed to an
opinion. However, in many real-world settings, individuals may repeatedly flip
their opinions back and forth due to peer and media influence, e.g., attitudes
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 382–394, 2021.
https://doi.org/10.1007/978-3-030-65347-7_32
Influence Maximization for Dynamic Allocation in Voter Dynamics 383
towards public or political issues. Since IC-like models only allow a single acti-
vation for each agent, they fail to address the above scenarios. Instead, dynamic
models which allow agents to switch their opinions in both directions are suit-
able for modelling such an opinion formation process. In this work, we focus on
the voter model because of its prominence in the literature and its conceptual
simplicity in which the opinion dynamics is treated as a linear system and can
be solved analytically in simple topologies (e.g., star networks) [17].
Typically, the IM problem is explored without time constraint only subject
to a budget constraint where there are limited resources to allocate to agents
in the network [4,5,10,13]. However, recent research based on real-world data
shows that time plays a critical role in influence propagation [7,8]. For exam-
ple, many practical applications of IM have natural time constraints. Indeed,
some researchers have incorporated temporal aspects in IM [1–3,9,11,16,18].
Related to our modeling approach, Brede et al. [3] are the first to explore the
IM under time constraints in voter dynamics. However, their paper does not
allow controllers to allocated different amounts of resources over time, which
is in contrast to real-world scenarios such as marketing [31], where the mar-
keters can choose the start of campaigning. Representative works considering
effects of time scales and activating agents depending on stages of the diffusion
process include [1,2,11,18]. Specifically, [2] concentrates on minimizing the dif-
fusion time by targeting agents with different levels of connectivity at different
stages of the contagion process. However, this problem is addressed in a non-
competitive setting where only a single external controller spreads its influence
in the network. In addition, [11,18] explore the optimal sequential seeding for
influence maximization in static and evolving networks respectively. However,
both of them aim to maximize the influence in the stationary state, which is not
suitable for real-world events with time limitation, and they also only investigate
this problem in the presence of a single controller. Given that competition for
influence is common in real-world contexts (e.g., political campaigns [29] or rad-
icalization prevention [24]), the single-controller setting has a restricted range
of applications. The only directly related study [1] solves the time-constrained
IM in the presence of more than one controller by considering when to initiate
opinion propagation via reinforcement learning. However, it focuses on verifying
the effectiveness of the q-learning framework from an algorithmic perspective
and does not explore the mechanism behind the optimal strategies. Besides, like
other models discussed above, [1] is essentially static (i.e., only allow a single
activation of agents) and not appropriate for modeling changing opinions.
To bridge these gaps in research about intertemporal influence allocations, we
study the IM problem for dynamic allocation in voter dynamics under time and
budget constraints in the presence of two opposing controllers. Here, we explore
the dynamic allocation for the constant opponent setting where one active con-
troller competes against a known and fixed-strategy opponent. In the context
of dynamic allocations, the active controller has to design a strategy to make
efficient use of its budget over time. This results in the following trade-off: If
the controller starts allocating later, it has more disposable budget per unit of
384 Z. Cai et al.
time but less time left for its influence to become effective. To address this issue,
we make the following contributions: (i) We are the first to define the dynamic
allocation for IM in voter dynamics, where controllers have the flexibility to
determine when to start control. (ii) To explore the network’s influence propa-
gation timescales, we use the heterogeneous mean-field method [3] and Taylor
expansions to derive estimates for relaxation timescales towards equilibrium on
scale-free networks. (iii) We demonstrate the value of our derivation and algo-
rithm by conducting numerical experiments to address the following aspects.
First, to explore the dependence of timescales towards equilibrium on network
configurations, we investigate networks with different degrees of heterogeneity
characterized by different degree exponents. Second, we use the interior-point
optimization method [21] to obtain the optimal starting time under the time
constraint.
Our main findings are as follows: (i) In constant-opponent setting, as we fix
one controller to start control from the very beginning, the optimal strategy for
the optimizing controller is to initially leave the system subject to the influence
of the opposing controller and then only use its budget closer to the end of the
campaign. (ii) For short time horizons, the optimized controller tends to start
control later in highly heterogeneous networks compared to less heterogeneous
networks. In contrast, for long time horizons, an earlier start is preferred for
highly heterogeneous networks.
The remainder of the paper is organized as follows. In Sect. 2 we describe
the model we use for dynamic allocation. In Sect. 3 we show the main results for
optimal dynamic allocation. The paper concludes with a summary and future
work in Sect. 4.
2 Model Description
Apart from the budget constraint, ai (t), bi (t) also need to be non-negative, i.e.,
ai (t) ≥ 0, bi (t) ≥ 0.
To proceed, we consider the following updating process of opinions according
to voting dynamics [25]. At time t, one of the agents in the network, e.g., agent
i, is selected randomly. Then, agent i selects an in-neighbour or a controller
at random with a probability proportional to the weight of the incoming link
(including control gains from controllers). Here, we follow the mean-field rate
equation for probability flows [17] by introducing xi as the probability that
agent i has opinion A. We then have:
dxi j wji xj + ai (t) j (1 − xj )wji + bi (t)
= (1 − xi ) − xi . (1)
dt j wji + ai (t) + bi (t) j wji + ai (t) + bi (t)
Here, for numerical optimization of Eq. (4), we use the Runge-Kutta method
[22] to integrate Eq. (1) and obtain the optimal starting time of controller A by
interior-point optimization algorithm [21].
3 Results
We start our analysis with exploring the timescales towards equilibrium by ana-
lyzing relaxation times for networks with different degrees of heterogeneity in
Sect. 3.1. Our approach is based on a heterogeneous mean-field approximation of
Eq. (1). We then carry out numerical experiments to obtain the optimal strate-
gies for the dynamic allocation in Sect. 3.2. All of our experiments are based
on uncorrelated random scale-free networks with power-law degree distribution
pk ∝ k −λ constructed by the configuration model [6].
386 Z. Cai et al.
56 6.45 Mean-Field
relaxation time ( relax,k )
52
2 12 22 32 42 52 62 72 82 92 1.6 2 3 4 5
degree degree exponent ( )
(a) (b)
Fig. 1. Results for networks with N = 10000, average degree k = 10.5, average over
100 realizations. Both controller A and controller B start control at time 0 and allocate
0.1 or 1 resource on each node per unit time respectively for Fig. 1(a) and Fig. 1(b).
Figure 1(a) shows dependence of relaxation time τrelax,k on degree k calculated via
direct integration using a Runge-Kutta method, the mean-field estimation of Eq. (8)
and a Taylor expansion of Eq. (9) in k up to 2nd order on networks with exponent
1.6. The data is represented in box plots with median, 25th and 75th percentiles and
whiskers extending to the maximum or minimum values. Figure 1(b) shows dependence
of relaxation time τrelax on network heterogeneity calculated numrically via integra-
tion and analytically by mean-field approximation. Error bars indicate 95% confidence
intervals.
The trend of τrelax,k with degree k is mainly determined by the constant term
α−1 a+b −1
α and the first-order term α k . Specifically, the coefficient a+b
α is negative,
which leads to the observation that the larger the degree of a node is, the longer
2
the relaxation time will be. Moreover, the second-order term (a+b) α k −2 reduces
the difference between relaxation time for nodes of different degrees.
To proceed, we compare τrelax,k calculated via direct integration using a
Runge-Kutta method, the mean-field estimate of Eq. (8) and a Taylor expansion
of Eq. (9) in Fig. 1 (a). From Fig. 1(a), it can be seen that xk is monotonically
increasing with degree k. This phenomenon is consistent with Gershgorin’s circle
theorem [28]. According to the Gershgorin theorem, eigenvalues for nodes with
degree k of Eq. (5) lie within at least the discs with radii −1 + 1
ak +bk around
1+ k
zero. As we assume that the controllers target all nodes uniformly, the larger the
degree of nodes, the smaller the absolute values of eigenvalues. In other words, the
larger the node’s degree, the longer its relaxation time scales towards equilibrium.
Additionally, we find that the mean-field method and Taylor expansion are in
reasonable agreement with numerical estimates for τrelax,k .
Furthermore, the overall average relaxation time (i.e., network’s natural
timescales towards equilibrium) for the equally targeting case is given by:
k(β(a + b) + a) + β(a + b)2
τrelax = pk τrelax,k = pk (10)
β(a + b)(a + b + k)
k k
388 Z. Cai et al.
where t is determined implicitly by: SA (t ) = lSA (∞) for x0 ≤ lSA and SA (t ) =
(2 − l)SA (∞) for x0 ≥ (2 − l)SA (∞). This equation defines an average timescale
at which the vote-share dynamics approaches the desired l-percentage vote shares
when the initial state x0 is less than lSA or greater than (2 − l)SA (∞).
To explore the relationship between the l-percentage relaxation time and
transient control, we plot the dependence of relaxation time on the degree of
equilibrium l and network heterogeneity in Fig. 2 (a). We clearly see a cross-over
l
of τrelax in Fig. 2 (a): relaxation times are larger for less heterogeneous networks
than for more heterogeneous networks for low l, but this ordering is reversed
for large degree of equilibrium (see inset in Fig. 2 (a)). We hypothesize that this
is a consequence of the characteristic dynamics toward equilibrium in heteroge-
neous networks occurring via two stages. To illustrate this point, we visualize
the evolution of vote shares for high-degree and low-degree nodes in Fig. 2 (b).
In more detail, we sort nodes according to their degrees in ascending order. Then
we assign the first 80% as low-degree nodes and the rest as high-degree nodes
according to the Pareto principle [23]. To explore which role they play in the
transient dynamics, we compute the state changes dx i
grouped by low-degree
dxi dt dxi
nodes (i.e. low dt ) and high-degree nodes (i.e. high dt ). Then the aver-
age contribution of low-degree nodes and high-degree nodes to the vote-share
dxi dxi
0.2 low dt
0.8high dt
changes are: dxi dxi and dxi dxi . In this way,
0.8 high dt +0.2 low dt 0.8 high dt +0.2 low dt
we obtain Fig. 2 (b), where we also compare vote-share changes for networks
constructed for different degree exponents. In Fig. 2 (b), we see that a large pro-
portion of vote-share changes is caused by the low-degree nodes at the beginning
of the evolution. As the evolution proceeds, the dynamics of high-degree nodes
Influence Maximization for Dynamic Allocation in Voter Dynamics 389
6.5 1
6 =1.6, high-degree nodes
relaxation time lrelax 6 =1.6, low-degree nodes
0.8 =5, high-degree nodes
5
5.5
percentage
=5, low-degree nodes
4 5 0.6
0.8 0.9 1
3 0.4
MF, =1.6
2 RK, =1.6
MF, =5 0.2
1 RK, =5
0
0.2 0.4 0.6 0.8 1 0 10 20 30
degree of equilibrium (l) time (t)
(a) (b)
Fig. 2. Results for networks with N = 10000, average degree k = 10.5, represent
by error bars with 95% confidence intervals over 100 realizations. Both controller A
and controller B start control at time 0 and allocate 1 resource on each node per unit
time. The legend “λ = 1.6(5)” is identical to power law distribution P (k) ∝ k−1.6(−5) .
Figure 2 (a) shows dependence of relaxation time on degree of equilibrium l and network
heterogeneity by Runge-Kutta and mean-field method. Figure 2 (b) shows evolution
of average vote share changes in the proportion. “low-degree nodes” and “high-degree
nodes” refer to the first 80% low degree nodes and top 20% high degree nodes. The
y-axis shows the proportion of the average state changes for high-degree nodes and
low-degree nodes in the total changes.
1 0.64 0.82
MF, =1.6
a 0.6 RK, =5
0.55
15 15.5 16 0.63
0.6 0.78
0.4 0.625
MF, =1.6
0.76
MF, =5
1 1 1
a
0.9 0.9 0.99
0.97
0.6 0.965 0.98
0.8 0.968
0.1 0.8
0.96 0.97
0.4 255 255.5 256
0.05
MF, =1.6 248 250 252 MF, =1.6
0.7 0.966 MF, =5
0.7 MF, =5
0.2 0 249 249.5 250
249 249.5 250 RK, =1.6 RK, =1.6
RK, =5 RK, =5
0.6 0.6
0 245 250 255 245 250 255
246 248 250 252 254 256
time (t) start time for controller A (ta) start time for controller A (ta)
Fig. 3. Figs. (a–c) and d–f) show the evolution of total voter shares when controller
A follows optimal control for time horizon T = 16 and T = 256 respectively. The
turning points are the times when controller A starts control. (b) and (e) show the
dependence of vote shares on controller A’s starting time for time horizons T = 16 and
T = 256, respectively. (c) and (f) show the dependence of degree of equilibrium l on ta .
To find the optimal control time, the networks have to strike a balance between budget
per node and degree of equilibrium. All the calculations are based on networks with
N = 10000 and k = 10.5 and averaged over 100 realizations. Controller B always
starts its control from time 0. The black squares and blue triangles stand for networks
with degree exponents λ = 1.6 and λ = 5 respectively.
Figs. 3 (b) and (e). We note that the dependence is a convex shape with a
maximum, which is marked with arrows in Figs. 3 (b) and (e). The peak values
of curves are consistent with the optimal starting time in Figs. 3 (a) and (d)
(see the turning points in Figs. 3 (a) and (d)). In more detail, the maximum
values of the vote shares is a result of a trade-off. On one hand, if ta is small,
the controller will have more time left to influence the network but with small
resource allocations on each node per unit time. In other words, the final vote
shares are determined by lSA (∞). Though an early start makes the system closer
to the equilibrium (i.e., l becomes larger), small resource allocations result in the
small value of vote share in equilibrium (i.e., SA becomes smaller). On the other
hand, if controller A starts late, it will have more resource allocations on each
node per unit time, which leads to a larger value of vote share in equilibrium,
but there will be less time left for the exerted influence to become effective.
Additionally, Fig. 4 shows the dependence of the optimal starting time of the
targeting controller (T − opt{ta }) on network heterogeneity and time horizons.
Influence Maximization for Dynamic Allocation in Voter Dynamics 391
7 //
=1.6
4
6.8
3
6.6
2
6.4
1 6.2
128 256 512
0 //
1 16 32 64 128 256 512
time horizons (T)
Generally, the optimized controller only uses its budget near the end of the cam-
paign. This means that the system is initially only subject to the influence of
the opponent. Only when close to the end of the campaign T , the optimized
controller exerts several times the allocations of its opponent on the network. In
a
doing so, the system approaches equilibrium a+b gradually, which can be seen
from the monotonous rise of votes shares in Figs. 3 (a) and (d). In addition, for
short time horizons, optimal control times for networks with large heterogene-
ity tend to start later, while for long time horizons, optimal control on highly
heterogeneous networks should start slightly earlier.
This dependence of optimal starting times on network heterogeneity can be
explained by our earlier observations in Fig. 2 (a). For short time horizons, the
network is still far from equilibrium at the end of the competition. In other
words, the network’s degree of equilibrium l is small, which corresponds to the
lower-left corner of Fig. 2 (a). Therefore, the state changes of vote shares are
dominated by the low-degree nodes, which have shorter timescales. As highly
heterogeneous networks have more lower degree nodes, they will respond much
quicker to the resource allocations. Consequently, campaigns on highly heteroge-
neous networks should start slightly later than on less heterogeneous networks. In
contrast, for long time horizons, the network is close to equilibrium at the end of
the competition. In this case, the network’s degree of equilibrium l approaches
1. From Fig. 2 (a), for a sufficiently large l, the more heterogeneous the net-
work, the larger the relaxation time. As a result, highly heterogeneous networks
respond much more slowly to resource allocations, which explains an earlier start
in optimized control.
392 Z. Cai et al.
4 Conclusion
In this paper, we explore the IM problem under the dynamic allocation setting
where controllers have the flexibility to determine when to start control. Our
focus is on determining optimal starting times of campaigns on heterogeneous
networks. In conclusion, our contributions are mainly threefold. (i) we extend
research on transient control in the dynamic allocation setting. (ii) we analyze
how the natural timescales of networks affect optimal control in networks with
different degrees of heterogeneity. (iii) we numerically obtain dependence of opti-
mal strategies on time horizons. In addition, we have obtained the following three
main results. (i) the network has a natural time scale for information propaga-
tion. The controller must balance the start-up time to leave enough time for its
application of control to take effect. This implies that for a network with high
heterogeneity, given a short time horizon, the optimized controller will start con-
trolling later. On the contrary, for large time horizons and highly heterogeneous
networks, it is preferred to start earlier. (ii) In the constant opponent setting,
by allowing the opponent to consume its budget first, the competing controllers
can dominate the campaign at later stages. An interesting direction for future
work is to assign different allocations as well as starting time for individual nodes
under the framework of dynamic allocation.
References
1. Ali, K., Wang, C.Y., Chen, Y.S.: A novel nested q-learning method to tackle time-
constrained competitive influence maximization. IEEE Access 7, 6337–6352 (2018)
2. Alshamsi, A., Pinheiro, F.L., Hidalgo, C.A.: When to target hubs? Strategic diffu-
sion in complex networks. arXiv preprint arXiv:1705.00232 (2017)
3. Brede, M., Restocchi, V., Stein, S.: Effects of time horizons on influence maximiza-
tion in the voter dynamics. J. Complex Netw. 7(3), 445–468 (2019)
4. Budak, C., Agrawal, D., El Abbadi, A.: Limiting the spread of misinformation in
social networks. In: Proceedings of the 20th International Conference on World
Wide Web, pp. 665–674 (2011)
5. Carnes, T., Nagarajan, C., Wild, S.M., Van Zuylen, A.: Maximizing influence in a
competitive social network: a follower’s perspective. In: Proceedings of the Ninth
International Conference on Electronic Commerce, pp. 351–360 (2007)
Influence Maximization for Dynamic Allocation in Voter Dynamics 393
1 Introduction
Infamous waves of uprisings (e.g., Black Lives Matter, Women’s March, Occupy
Wall Street) are commonly characterized by the significant use of social media
to share information prior to, as well as during, protests to reach a critical
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 395–407, 2021.
https://doi.org/10.1007/978-3-030-65347-7_33
396 C. J. Kuhlman et al.
Table 1. Communication mechanisms of the CKF model evaluated in this work, indi-
vidually and in combination. These mechanisms may be operative in contagion initia-
tion, propagation, or both. Mechanism abbreviations are denoted in [·].
Mechanism Description
Common knowledge [CK] This is a common knowledge mechanism
characterized by bicliques in social
networks. This mechanism can initiate
contagion, and can drive contagion
propagation. No seeded nodes with
contagion are required.
Neighborhood dynamics [ND2] This is influence (communication)
produced by neighbors within distance-2
of an ego node. This mechanism
propagates contagion.
Population dynamics [PD2] Since agents (nodes) know both states
and thresholds of agents within
distance-2, an agent can infer information
about the numbers of agents currently in
state 1, even when these other agents are
at geodesic distances of 4 or more. This
mechanism propagates contagion.
example, i writes information about herself on k’s wall, then j knows i’s infor-
mation by reading k’s wall, without directly communicating with i. The informa-
tion thus travels distance-2, from i to k to j. Multiple mechanisms are operative
in the CKF model, including CK itself, network dynamics, and local and global
interactions. Hence, it is of interest to understand the effects of mechanisms on
the spread of contagions. We aim to develop computational models of the CKF
model mechanisms to study these mechanisms individually and in combination,
to quantify their effects on the spread of collective action. Table 1 describes these
mechanisms, which are formalized in Sect. 3.
Figure 1 provides an example illustrating all three mechanisms summarized
in Table 1. In this network, there are 7 people with different thresholds. Based on
398 C. J. Kuhlman et al.
the CKF model [13] summarized in Sect. 3, for agents to participate (i.e., tran-
sition to state 1), they need to share common knowledge with a group of people
(they need to form a complete bipartite graph), and their thresholds should
be less than the size of the common knowledge set (i.e., the group they share
common knowledge with). In this example, agents 1, 2, 3, and 4 have threshold
of 3, indicating that each needs to have at least 3 other people to participate
(i.e., transition to state 1) for them to participate. These four people form a com-
plete bipartite graph (a square) that allows them to generate common knowledge
about their willingness to participate. They know each others’ thresholds and
know that they are sufficiently low for them to jointly participate and achieve
mutual benefits. Hence, they transition to state 1 at t = 1. This is referred to
as the common knowledge [CK] mechanism. On the other hand, agent 5, who
shares common knowledge of thresholds with agents 1, 2, and 4 (through the
4-node star network centered at agent 2), has threshold of 6 which is not low
enough for him to participate with the other 3 players that he shares CK with.
Agent 5 also is part of CK node sets {2, 5, 6} (a 3-node star centered at agent 5)
and {5, 6, 7} (a 3-node star centered at agent 6), but cannot transition to state 1
for the same reason. Similarly, persons 6 and 7 do not transition to state 1 at
t = 1.
Since agent 2 is within distance-2 of agent 6 (friend-of-friend), agent 6 knows
agent 2’s threshold and state (action), through the Facebook wall or timeline of
agent 5. At t = 2, agent 2’s state is 1 and her fixed threshold is 3. Thus, agent 6
knows that at least four agents are in state 1. Agent 6’s threshold is satisfied and
she transitions to state 1. This is the population dynamics [PD2] mechanism.
Finally, at t = 3, person 7 will transition to state 1 as a result of the neighbor-
hood dynamics [ND2] mechanism: it has one activated neighbor (agent 6) within
distance-2 to meet its threshold of 1. All of the state transitions in this example
are made formal in Sect. 3.
Following others who study contagion dynamics on networks (e.g., [19]), we quan-
tify contagion dynamics on five web-based social networks that range over 6×
(i.e., over a factor of 6) in numbers of nodes, 4× in numbers of edges, 4× in aver-
age degree, 13× in maximum degree, and 80× in average clustering coefficient.
Thresholds range over 3×. We construct agent-based models and a framework
that can turn on and off any combination of mechanisms in simulations of con-
tagion dynamics. (The CKF model is presented in Sect. 3; simulation process
is given in Sect. 5.) There are also companion theoretical results, but owing to
space limitations, these will be included in an extended version of the paper.
as a function of time, show little difference between the effects of the [CK] mech-
anism alone, versus the [CK] and [ND2] mechanisms combined. However, there
are cases (e.g., for the P2P network with θ = 8 and pp = 0.2), where the addi-
tion of [ND2] to [CK] increases the spread fraction by more than 50%. (ii) The
[PD2] mechanism dominates the other two mechanisms for driving contagion in
particular cases (e.g., for the Facebook (FB) network). In other cases, the [CK]
mechanism dominates (e.g., for several cases for the Wiki network). As aver-
age degree decreases relative to threshold, the more the [PD2] mechanism can
dominate. As average network degree increases, the more the [CK] mechanism
dominates. This is because a star subgraph is a form of biclique, and the more
nodes in a biclique, the more threshold assignments will cause state transition
owing to CK. (iii) If [CK] and [PD2] mechanisms are already operative, then
there is no increase in spread fraction if mechanism [ND2] is incorporated. (iv)
There are combinations of simulation conditions (e.g., network, threshold, par-
ticipation probability, CK model mechanisms) that can produce small or large
spread size changes by varying only one of these inputs.
2. Sensitivity of contagion dynamics to average degree dave . The spread
size (large or small) is driven by the magnitude of average degree relative to the
node threshold θ assigned uniformly to all nodes. In all five networks, spread size
can be large (e.g., spread fraction > 0.5). If dave > θ, then outbreak sizes are
large; if dave < θ, then outbreak sizes are lesser. We demonstrate a pronounced
effect on spread size even when the magnitudes of dave and θ are close.
2 Related Work
There are several studies that model web-based social media interactions, includ-
ing the following. The spread of hashtags on Twitter is modeled using a thresh-
old model in [17]. Diffusion on Facebook is modeled in [18], and a similar type
of mechanism on Facebook is used to study the resharing of photographs [4].
None of these works uses the “wall” or “timeline” mechanism of Facebook that
is modeled here in the CKF model. Several unilateral models and applications
were identified in the Introduction. These are not repeated. Here, we focus on
game-theoretic common knowledge models, in particular, the CKF model.
A couple of data mining studies have used Facebook walls, including an
experimental study [7]. Features of cascades on Facebook are studied using user
wall posts [10], but again, these are cascades of the conventional social influence
type; there is no assessment of CK-based coordination.
There have been a few works on the CKF model, which was initially intro-
duced in [13]. Details of the game-theoretic formulation are provided there.
For example, the CKF model is not efficiently computable because finding all
bicliques in a network is an NP-hard problem. This makes studying CKF on
very large networks (e.g., with 1 million or more nodes) extremely difficult. An
approximate and computationally efficient CKF model is specified in [14]. CK
dynamics on networks that are devoid of key players is studied in [15]. None of
these investigates the individual and combinations of mechanisms of Table 1.
400 C. J. Kuhlman et al.
3 Model
3.1 Preliminaries
where −z < 0 is the penalty she gets if she activates and not enough people join
her. Thus, a person will activate as long as she is sure that there is a sufficient
number of people (in the population) in state 1 at t. A person always gets utility
0 by staying in state 0 regardless of what others do since we do not consider
free-riding problems. When she transitions to the active state, she gets utility 1
if the total number of other people activating is at least θi . (Note that these
“others” do not have to be neighbors of i, as in unilateral models.)
The CKF model describes Facebook-type (friend-of-friend) communication
in which friends write to and read from each others’ Facebook walls and this
information is also available to their friends of friends. The mechanisms and its
implications are described below. The communication network indicates that if
{i, j} ∈ E, then node i (resp., j) communicates (θi , ait ) (resp., (θj , ajt )) to node
j (resp., i) over edge {i, j} at time t, and this information is available to j’s
(resp., i’s) neighbors. The communication network helps agents to coordinate by
creating common knowledge at each t. Agents’ presence on the network (online
or offline) is captured by the participation probability 0 ≤ pp ≤ 1 for each node,
which determines whether a node is participating in the contagion dynamics at
each t; e.g., whether i is online or offline at t in Facebook.
Here we describe the three mechanisms in this model (cf. Table 1), and their
implications. Figure 1 illustrates these mechanisms through an example. First of
all, the CKF model describes a Facebook type communication which allows for
Interaction Mechanisms in Common Knowledge Models 401
For the common knowledge [CK] mechanism of Table 1, the biclique sub-
graph is the structure necessary for creation of CK among a group of people [13],
and allows them to jointly activate. We first compute all node-maximal bicliques
in G, which is an NP-hard problem [2]. Let M biclique denote a set of nodes
of G that forms a biclique. Then, V in Eq. (1) is replaced with M biclique . At
each t, Eq. (1) is computed for each i ∈ V in each CK set M biclique for which
i ∈ M biclique .
Finally, the population dynamics [PD2] mechanism indicates that a node
i that is in state 0 can infer a minimum number of nodes already in state 1 if a
neighbor j in Ni2 is already in state 1, by knowing θj . Formally,
⎧
⎨ 1 if ai,t−1 = 1 or
ait = (max θj : j ∈ Ni2 , aj,t−1 = 1) + 1 ≥ θi (3)
⎩
0 otherwise.
Assume ai,t−1 = 0. If j ∈ Ni2 and aj,t−1 = 1, with θj , then i can infer that at
least θj + 1 nodes are in state 1. Now, if θi ≤ θj + 1, then i will transition to
state 1; i.e., ait = 1.
At each time t − 1, all operative mechanisms are evaluated, independently,
for each i ∈ V for which ai,t−1 = 0. If any of the three mechanisms causes i to
transition, then ait = 1.
4 Social Networks
The web-based networks of this study are summarized in Table 2. FB is a Face-
book user network [20], P2PG is a peer-to-peer network, Wiki is a Wikipedia
network of voting for administrators, and Enron is an Enron email network [16].
All but the SF1 network are real (i.e., mined) networks. SF1 is a scale free
(SF) network generated by a standard preferential attachment method [3] to
fill in gaps of the real networks. For networks possessing multiple components,
402 C. J. Kuhlman et al.
we use the giant component. These networks have wide-ranging properties and
hence represent a broad sampling of web-based mined network features. Figure 2
shows the average degrees per network in the original graphs G, corresponding
to geodesic distance of 1, and in the square of the graphs G2 that are particularly
relevant to CK model dynamics (forthcoming in Sect. 6).
Fig. 2. Average vertex degree for geodesic distances 1 and 2 (i.e., for G1 and G2 ), which
are relevant for the CK, ND2, and PD2 mechanisms for driving contagion through
networks.
Parameter Description
Agent Thresholds θ Uniform threshold values for a simulation: all nodes in
a network have the same value. Values range from
θ = 8 through θ = 29.
Participation Uniform value for all nodes in a simulation. Values in
Probabilities pp the range of 0.05 to 0.4.
Model Mechanisms [CK], [ND2], and [PD2] mechanisms described in
Table 1. [CK] is always operative to initiate contagion.
Seed Vertices No specified seed vertices; all vertices initially in
state 0. CK model initiates contagion without seeds.
Simulation Duration 30 and 90 time steps.
tmax
6 Simulation Results
In this section, we present the results of our agent-based model simulations. All
results provided are average results from 30 runs.
Effects of CKF model mechanisms on contagion dynamics. We analyze
the effects of the [CK], [ND2], and [PD2] mechanisms (described in Table 1) and
their combinations on the time histories of activated nodes for each network of
the study. Figure 3 contains time histories for the fraction of nodes in state 1 over
time for the Wiki network. In this simulation, all nodes have threshold θ ≈ dave =
29. The mechanism combinations are [CK] only, [CK] plus [ND2], [CK] plus
[PD2], and [CK] plus [ND2] plus [PD2] (i.e., all) mechanisms. In Fig. 3a, pp =
0.1; in Fig. 3b, pp = 0.4. Several observations are important. First, the [ND2]
mechanism does not contribute significantly to the driving force to transmit
contagion in the system. This is seen in the first plot in that the magenta curve
is only slightly above the blue curve, i.e., the addition of [ND2] to [CK] results in a
small increase in spread fraction (i.e., fraction of agents in state 1). In comparing
the orange and green curves in the left plot, we observe that adding [ND2] to
[CK]+[PD2] does not increase spread fraction. The same two comparisons in the
right plot (where pp = 0.4) give the same conclusion. Second, the addition of
the [PD2] mechanism to the [CK] mechanism can produce significant increases
in spread fraction (comparing blue and green curves). Third, in moving from the
left to the right plot, the spread fractions increase, for a given time t, and the
contagion spreads more rapidly, with increasing pp . These findings are shown for
all networks, as described below.
404 C. J. Kuhlman et al.
Fig. 3. Wiki network results for θ = dave = 29: (a) pp = 0.1, (b) pp = 0.4. Cumulative
fraction of agents in state 1 is plotted as a function of time in simulation for combina-
tions of different mechanisms (in Table 1). Each propagation mechanism is isolated for
different simulations and is represented by a different curve; however, [CK] (labeled CK)
is always operative. In (a), the blue ([CK] only) and magenta ([CK + [ND2]) curves are
close together, indicating that for pp = 0.1, the distance-2 classic diffusion mechanism
[ND2] provides a relatively small increase to the overall contagion driving force. In
(b), the blue ([CK] only) and magenta ([CK + [ND2]) curves overlay; this means that
[ND2] provides no noticeable increase in driving force for contagion spreading. [PD2]
(green and orange curves) in both plots provides significant additional driving force,
since the green and orange curves are well above the blue and magenta curves. The all
mechanisms (denoted all in legend) curves coincide with the [CK]+[PD2] curves since
they (orange curves) overlay with the green curves. This means that, again, [ND2] does
not provide much driving force to spread a contagion.
7 Conclusion
We evaluate the CKF contagion model on a set of networks with wide ranging
properties, for a range of thresholds and participation probabilities. We model
and investigate multiple mechanisms of contagion spread (initiation and prop-
agation), as well as the full model. We find evidence that the [CK] and [PD2]
mechanisms are the major driving forces for the contagion initiation and spread,
compared to [ND2]. These types of results are being used to specify conditions
for impending human subject experiments that will evaluate CK and its mecha-
nisms (e.g., [12]), and will be used to assess the predictive ability of the models.
References
1. Ahmed, N.K., Alo, R.A., Amelink, C.T., et al.: net.science: a cyberinfrastructure
for sustained innovation in network science and engineering. In: Gateway (2020)
2. Alexe, G., Alexe, S., Crama, Y., Foldes, S., Hammer, P.L., Simeone, B.: Consensus
algorithms for the generation of all maximal bicliques. DAM 145, 11–21 (2004)
3. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Nature 286,
509–512 (1999)
4. Cheng, J., Adamic, L.A., Dow, P.A., Kleinberg, J., Leskovec, J.: Can cascades be
predicted? In: WWW (2014)
5. Chwe, M.S.Y.: Communication and coordination in social networks. Rev. Econ.
Stud. 67, 1–16 (2000)
6. Chwe, M.S.Y.: Structure and strategy in collective action. Am. J. Sociol. 105,
128–156 (1999)
7. Devineni, P., Koutra, D., Faloutsos, M., Faloutsos, C.: If walls could talk: Patterns
and anomalies in facebook wallposts. In: ASONAM, pp. 367–374 (2015)
8. Gonzalez-Bailon, S., Borge-Holthoefer, J., Rivero, A., Moreno, Y.: The dynamics
of protest recruitment through an online network. Sci. Rep. 1, 1–7 (2011)
9. Granovetter, M.: Threshold models of collective behavior. Am. J. Soc. 83(6), 1420–
1443 (1978)
10. Huang, T.K., Rahman, M.S., Madhyastha, H.V., Faloutsos, M., et al.: An analysis
of socware cascades in online social networks. In: WWW, pp. 619–630 (2013)
11. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: KDD, pp. 137–146 (2003)
12. Korkmaz, G., Capra, M., Kraig, A., Lakkaraju, K., Kuhlman, C.J., Vega-Redondo,
F.: Coordination and common knowledge on communication networks. In: Proceed-
ings of the AAMAS Conference, 10–15 July, pp. 1062–1070, Stockholm (2018)
13. Korkmaz, G., Kuhlman, C.J., Marathe, A., et al.: Collective action through com-
mon knowledge using a Facebook model. In: AAMAS (2014)
14. Korkmaz, G., Kuhlman, C.J., Ravi, S.S., Vega-Redondo, F.: Approximate conta-
gion model of common knowledge on Facebook. In: Hypertext, pp. 231–236 (2016)
15. Korkmaz, G., Kuhlman, C.J., Vega-Redondo, F.: Can social contagion spread with-
out key players? In: BESC (2016)
16. Leskovec, J., Krevl, A.: SNAP Datasets: stanford large network dataset collection,
June 2014. http://snap.stanford.edu/data
17. Romero, D., Meeder, B., Kleinberg, J.: Differences in the mechanics of information
diffusion. In: WWW (2011)
18. Sun, E., Rosenn, I., Marlow, C.A., Lento, T.M.: Gesundheit! modeling contagion
through Facebook news feed. In: ICWSM (2009)
19. Tong, H., Prakash, B., Eliassi-Rad, T., Faloutsos, M., Faloutsos, C.: Gelling, and
melting, large graphs by edge manipulation. In: CIKM, pp. 245–254 (2012)
20. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user
interaction in Facebook. In: WOSN, August 2009
21. Watts, D.: A simple model of global cascades on random networks. Proc. Nat.
Acad. Sci. (PNAS) 99(9), 5766–5771 (2002)
Using Link Clustering to Detect Influential
Spreaders
1 Introduction
Spreading processes, originally examined in areas such as disease modelling [1, 2] and
epidemiological mathematics [3], are increasingly examined to study social phenomena
of information diffusion within complex networks, such as the spreading of rumours
[4] or the communication during crisis events, for example [5]. They also gain special
relevance considering the global COVID-19 pandemic, possibly yielding results to better
understand and mitigate its spread. As a result of the way users connect and interact with
each other, the social networks used for these analyses often exhibit properties of small-
world and scale-free networks [6], making topological characteristics of the network an
important aspect when analysing spreading processes. A common goal of the mentioned
studies is to predict the efficiency of the spreading process, with the broader intention to
acquire knowledge on how to control it. In this context, both local and global network
properties related to spreading processes have to be explored. This paper focuses on local
properties (topological properties of the network attributed to nodes) but goes beyond
their immediate neighbourhood. Using local properties to predict spreading efficiency,
Kitsak et al. [7] already showed that the most efficient spreaders within a network are not
necessarily the most connected nodes (i.e. nodes with the highest degree), but the ones
that are located in densely connected cores of the network indicated by a high k-shell
index. However, apart from being located within the core of a network or having a certain
degree, structural properties of the network such as its tendency to form clusters might
also yield an informative measure to predict the spreading efficiency. The general idea is,
that someone who is member of many different overlapping social groups (workplace,
sports club, friendship circles) is better capable of injecting information into various
densely connected regions of the network where it further circulates. This notion of
overlapping social groups is operationalised by applying the link-clustering approach
by Ahn, Bagrow & Lehmann [8], where resulting clusters are highly interleaved and
sometimes even nested. The approach clusters the links instead of the nodes, resulting
in nodes possibly belonging to multiple clusters. It is reasonable to assume then, that the
number of groups a node belongs to predicts its spreading capability as good as or even
better than its k-shell index. Thus, we derive the following research questions:
2 Background
Spreading processes and the analysis of potential spreaders have a long history in science
and generally describe a flow of information between actors or members of a network [9].
For complex networks such as computer networks or networks of real individuals, infor-
mation can refer to diseases and computer viruses [1, 10], whereas for other networks
(i.e. created from Social Media data), it can refer to opinions, news articles or influence
[11–13]. The spreading, i.e. the flow of information within these networks can result in
diverse operationalisations, and obviously in contrary motifs regarding its analysis. For
disease spreading, potential strategies to mitigate are sought-after, whereas for influence
maximisation or opinion spreading, strategies to accelerate the flow of information are
desired. For that reason, influencing factors are of great interest. Within complex net-
works where spreading can only happen between adjacent nodes, there is one aspect
that affects the spreading, and likewise the efficiency of a single spreader – regardless
of the motifs of analysis – to an equal degree: the topology of the network (see [9]).
With regard to this topology, the origin of the diffusion process (i.e. the spreader) is of
interest, as these so called “seeds” [14] and their properties yield important information
from which inferences regarding the efficiency of the spreading process can be drawn.
410 S. Krukowski and T. Hecking
3 Approach
In this paper, it is hypothesised that actors who connect groups in different social contexts
and thus are part of different overlapping and nested link communities are capable spread-
ers. The underlying assumption is that information items, diseases, etc. mainly circulate
within densely connected groups. Actors in the overlap of such groups can be infected
within one group and inject the spreading process into several other groups. To this end,
we investigate whether the membership in multiple link communities (see Sect. 3.1) is
another factor that determines spreading capability in addition to the node’s k-shell index
[7]. It is further argued, that the number of link communities a node belongs to, also
constituting a topological feature of the network organisation, suggests that the spreader
has close connections to many other actors from different sub-communities within the
network, and is thus able to spread between different highly interconnected communities
more easily. Kitsak et al. [7] argue, that especially during the early stages of spreading
processes, through the many pathways that exist for nodes located within the core of the
network, the k-Shell index of a node predicts its spreading capability. However, nodes
Using Link Clustering to Detect Influential Spreaders 411
within these cores also tend to exhibit multiple community memberships, which yields
an additional inferential value in comparison to solely taking the node’s k-Shell index
into account (see Fig. 1). Additionally, when multiple outbreaks happen, information
can spread more easily between different sub-communities, whereas for different cores,
the distance between them needs to be taken into account [7]. The Link Communities
approach by Ahn, Bagrow, & Lehmann [8] offers a method to determine overlapping
communities, as it assigns multiple community-memberships to a single node. From
this, inferences regarding the spreading capability of single nodes can be drawn.
Fig. 1. On the left, an example from Kitsak et al. (2010) can be seen. Nodes within the core of the
network, i.e. with a high k-Shell index, were found to be good spreaders. On the right, the same
network is shown, but clustered with the Link Communities approach by Ahn et al. (2010). Nodes
with black borders are nodes which are hypothesized to exhibit a high spreading capability. In this
example, nodes with multiple community memberships also exhibit a high community centrality.
For the above example (Fig. 2), this would result in S = 41 . A dendrogram is then built
through single-linkage hierarchical clustering and cut at a certain threshold according
to the partition density, which then results in the link communities [8]. From these link
communities, the community memberships of the nodes can be derived, and thus each
node is assigned a vector of community memberships, from which the actual number of
communities it belongs to can be calculated. The possibility to detect multiple community
memberships differentiates our approach from other approaches analysing properties of
influential spreaders [20], where nodes can only belong to one community each due to
the community detection algorithm being used.
4 Evaluation
To evaluate the capability of nodes to spread information through the network, spread-
ing processes are simulated according to well-known SIR models [3]. The process starts
with one initially infected node. This node infects its neighbours at a given infection rate
(denoted as β) and recovers. The resulting infected nodes then try to infect their neigh-
bours themselves. The process terminates when no new infections occur. In this study,
the spreading capability of a node x corresponds to the average fraction of infected popu-
lation in 100 SIR runs starting at node x. For the community detection, it is decided to cut
the resulting dendrogram at a smaller threshold to also detect smaller sub-communities.
Using Link Clustering to Detect Influential Spreaders 413
Analysed Datasets. To evaluate our measure, we chose to use both real-world networks
to increase the external validity, as well as artificially generated networks to maintain
a high internal validity. The generated networks were created according to the forest-
fire algorithm as shown by Leskovec, Kleinberg & Faloutsos [22], as it creates networks
with properties typical of real-world graphs such as heavy-tailed degree distributions and
communities. To this end, 8 undirected networks were created, each with 1000 nodes,
with forward burning probabilities (fwprob) ranging from .05 to .40 (Table 1) and a
fixed backward burning probability of 1 (see [22]). The fwprob controls for the tendency
to form densely connected and potentially nested clusters. Additionally, we analysed
one ego-networks from the SNAP (Stanford Network Analysis Project) at Stanford
University [23] to evaluate our metric on real-world data. The network represents the
connections between all friends of the individual of which the ego network is derived
from (thus ego), with all connections between the ego and friends removed. To increase
informative value and evaluate our measure on a bigger network than our created ones,
we chose the 107 network because of its size (at 1912 nodes and 53498 edges), a local
average clustering coefficient of .534 and a mean spreading capability of .683.
Table 1. Analysed datasets. We used the average local clustering coefficient. The spreading
capability is the mean spreading capability of all nodes.
The Imprecision Function. To quantify the importance of nodes with a high commu-
nity centrality during the spreading, an objective measure has to be calculated. The
imprecision function serves just that purpose. Similar to Kitsak et al. (2010), these func-
tions are calculated for each of the three relevant measures, and they are denoted as
εkS (p), εCC (p) and εd (p), respectively. For each subset p of nodes (here, p refers to a
specific percentage of the dataset) with the highest spreading capability (denoted as φeff )
and the highest value according to one of the three measures (denoted as ϕkS , ϕCC and
ϕd ), the average spreading is calculated. Then, the difference in spreading between the
414 S. Krukowski and T. Hecking
p nodes with highest values in any of the measures and the most efficient spreaders is
calculated. Formally, for εCC , the function is defined as follows:
ϕCC (p)
εCC (p) = 1 − (2)
ϕeff (p)
Note that through subtracting the fraction from 1, higher values correspond to more
imprecision, while smaller values for ε show less imprecision and thus a better measure.
4.2 Results
In the following, the results of the evaluation will be presented. All of the evaluations
are calculated at a beta value (infection rate) of 7%.
Descriptive Results
graphs. To additionally evaluate this on a real network, we created the same plot for the
107 network (Fig. 5) at beta = 1 (high beta values would result in little variance of the
spreading capability), where a similar pattern as above and in Kitsak et al. [7] emerges.
Even more clearly than for the generated networks, it can be seen that high values in
either community centrality, degree or k-Shell index, go along with a high spreading
capability. Thus, and also considering what can be seen in Fig. 3, both the k-Shell index
and the Link Communities approach are comparable in predicting the spreading capabil-
ity (RQ2). However, this effect shows less so for our generated networks at high fwprob
values than for our analysed 107 network.
Fig. 4. Bivariate distribution for the examined measures on our generated networks. The colours
indicate the spreading capability.
Correlational Measures. Correlations between the CC and the established measures are
calculated. The mean correlation across all of our generated networks between the CC
416 S. Krukowski and T. Hecking
and the k-Shell index of a node is r = .56 and r = .54 between CC and the degree. All
measures correlate with the spreading capability of a node (r = .40 for CC, r = .40 for
degree and r = .47 for k-Shell). In the 107 network, the CC also correlates with the other
measure (r = .90 with k-Shell and r = .88 for degree) and with the spreading capability
of a node (r values at .50 for CC, .55 for degree and .63 for k-Shell). Regarding RQ2,
this makes the CC comparable to the k-Shell index of a node.
Imprecision Function. Apart from using correlational measures and looking at the dis-
tribution of data points, focusing on the top n nodes, the imprecision function is applied
to all of the studied networks, in order to objectively evaluate the proposed measure
in comparison to the other measures. It was calculated for all of our generated net-
works at beta = 7. As can be seen in Fig. 6 very clearly, the errors decrease with
higher fwprob values, reflecting the observation described above and indicating that
the predictive value of all measures (the CC included) seems to increase when there
is a higher community structure. However,
with M = 0.18 (SD = 0.20) for the k-Shell
index, M = 0.10 (SD = 0.12) for degree and
M = 0.15 (SD = 0.20) for community central-
ity, they are generally quite low, with the k-
Shell index showing the highest average error.
For the 107 network, the error values are espe-
cially low at .01 for degree, .02 for k-Shell and
.01 for the CC. As there is no significant differ-
ence between the imprecision of our measure
and the imprecision of other measures for both
real and generated networks, the measures are
comparable (RQ2), while the low error values
additionally indicate the CC to be a good pre- Fig. 6. Results of the imprecision function
dictor for the spreading capability of a node for our generated networks.
(RQ1).
5 Discussion
The aim of this paper is to contribute to research on spreading processes within com-
plex networks and identify properties which increase the efficiency of spreading. To
this end, a novel measure is introduced, from which inferences regarding the spreading
capability of single actors can potentially be drawn. Using link clustering [8] to infer
multiple community memberships of nodes, we evaluate and compare the measure to
other approaches. Generally, our results extend the understanding of the effect of struc-
tural properties of networks on information diffusion beyond centralities and k-shell and
show that the community centrality of a node is comparable in predicting its spreading
capability.
Descriptively, the average values of the measures indicate a comparability (RQ2):
Higher fwprob values and thus a higher community structure go along with higher k-
Shell indices, degrees and community centrality. The bivariate plots (Fig. 4) extend this
Using Link Clustering to Detect Influential Spreaders 417
(fixed to 10), and likewise the beta value at .7 (for the generated networks). Future studies
should also vary these fixed parameters and try to systematise them like the systematised
fwprob values in this paper. Additionally, the spreading capabilities obtained are also
not deterministic, that means they rely heavily on chance. It is therefore possible that
another run yields slightly different results.
Conclusion and Outlook. Our analyses and evaluations showed, that apart from the
k-Shell index and centralities, structural properties of the network can affect spreading
processes. To this end, community centrality, the examined measure, proved to be com-
parable in doing so. Along with the other analysed measures, the CC’s inferential value
increases, as the fwprob used to create the network with the forest fire algorithm [21]
increases – meaning that for increasing real-world and scale free properties of a graph,
the CC becomes better in predicting efficient spreaders. While the k-Shell index of a
node seems to be a better predictor for the spreading capability under certain conditions,
there might be applications in which the k-Shell index yields little inferential value or
where there are multiple outbreaks simultaneously. In this case, the community central-
ity might be used instead of the k-Shell index coupled with the distance between cores
[7]. This paper contributes to our understanding of the underlying processes through
offering another measure that can be used to infer the spreading capability of nodes,
and thus the efficiency of information diffusion due to structural properties of complex
networks. Due to factors that could have influenced the evaluations in the present paper,
future studies should further evaluate the measure, apply it to different contexts and
networks and possibly apply new evaluation metrics.
References
1. Goltsev, A.V., Dorogovtsev, S.N., Oliveira, J.G., Mendes, J.F.F.: Localization and spreading
of diseases in complex networks. Phys. Rev. Lett. 109(12), 128702 (2012). https://doi.org/
10.1103/PhysRevLett.109.128702
2. Kong, X., Qi, Y., Song, X., Shen, G.: Modeling disease spreading on complex networks.
Comput. Sci. Inf. Syst. 8(4), 1129–1141 (2011)
3. Keeling,M.J., Rohani, P.: Modeling Infectious Diseases in Humans and Animals. Princeton
University Press (2011)
4. Qian, Z., Tang, S., Zhang, X., Zheng, Z.: The independent spreaders involved SIR rumor
model in complex networks. Phys. Stat. Mech. Its Appl. 429, 95–102 (2015). https://doi.org/
10.1016/j.physa.2015.02.022
5. Stieglitz, S., Bunker, D., Mirbabaie, M., Ehnis, C.: Sense-making in social media during
extreme events. J. Contingencies Crisis Manag. 26(1), 4–15 (2018). https://doi.org/10.1111/
1468-5973.12193
6. Wu, F., Huberman, B.A., Adamic, L.A., Tyler, J.R.: Information flow in social groups. Phys.
Stat. Mech. Appl. 337(1), 327–335 (2004). https://doi.org/10.1016/j.physa.2004.01.030
7. Kitsak, M., et al.: Identification of influential spreaders in complex networks. Nat. Phys. 6(11),
888–893 (2010). https://doi.org/10.1038/nphys1746
8. Ahn, Y.-Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multiscale complexity in
networks. Nature 466, 761 (2010)
9. Karunakaran, R.K., Manuel, S., Narayanan Satheesh, E.: Spreading information in complex
networks: an overview and some modified methods. In: Graph Theory - Advanced Algorithms
Applications, December 2017. https://doi.org/10.5772/intechopen.69204
Using Link Clustering to Detect Influential Spreaders 419
10. Liu, Q., Han, L., Zou, M., Wei, O.: Spread dynamics with resistances of computer virus on
complex networks. In: 2010 3rd International Conference on Advanced Computer Theory
and Engineering (ICACTE), vol. 2, pp. V2-27–V2-31, August 2010. https://doi.org/10.1109/
icacte.2010.5579087
11. Törnberg, P.: Echo chambers and viral misinformation: modeling fake news as complex
contagion. PLoS ONE 13(9), e0203958 (2018). https://doi.org/10.1371/journal.pone.020
3958
12. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380),
1146–1151 (2018). https://doi.org/10.1126/science.aap9559
13. Alvarez-Galvez, J.: Network models of minority opinion spreading: using agent-based mod-
eling to study possible scenarios of social contagion. Soc. Sci. Comput. Rev. 34(5), 567–581
(2016). https://doi.org/10.1177/0894439315605607
14. Comin, C.H., da Fontoura Costa, L.: Identifying the starting point of a spreading process in
complex networks. Phys. Rev. E 84(5), 056105 (2011). https://doi.org/10.1103/physreve.84.
056105
15. Albert, R., Jeong, H., Barabási, A.-L.: Error and attack tolerance of complex networks. Nature
406(6794), 378–382 (2000). https://doi.org/10.1038/35019019
16. Stegehuis, C., van der Hofstad, R., van Leeuwaarden, J.S.H.: Epidemic spreading on complex
networks with community structures. Sci. Rep. 6, 29748 (2016). https://doi.org/10.1038/sre
p29748
17. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities
in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.
1088/1742-5468/2008/10/P10008
18. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure
of complex networks in nature and society. Nature 435(7043), 814–818 (2005). https://doi.
org/10.1038/nature03607
19. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912).
https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
20. Hu, Q., Gao, Y., Ma, P., Yin, Y., Zhang, Y., Xing, C.: A new approach to identify influential
spreaders in complex networks. In: Web-Age Information Management, pp. 99–104, Berlin,
Heidelberg, (2013). http://doi.org/10.1007/978-3-642-38562-9_10
21. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices.
Phys. Rev. E 74(3), 036104 (2006). https://doi.org/10.1103/PhysRevE.74.036104
22. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diam-
eters. ACM Trans. Knowl. Discov. Data, 1(1), 2–es (2007). https://doi.org/10.1145/1217299.
1217301
23. Leskovec, J., Mcauley, J.J., Learning to discover social circles in ego networks. In: Pereira,
F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information
Processing Systems, vol. 25, Curran Associates, Inc. pp. 539–547 (2012)
Prediction of the Effects of Epidemic
Spreading with Graph Neural Networks
1 Introduction
The spread of information and disease is a common phenomenon that has a lot
of practical and sometimes life-saving applications. One of these applications is
the creation of better strategies for stopping the spreading of misinformation
on social media or an epidemic. Further, companies can analyze spreading to
create better strategies for marketing their product [6,19]. Spreading analysis
can also be suitable for analysis of e.g., fire spreading, implying large practical
value in terms of insurance cost analysis [8]. Analysis of spreading is commonly
studied via extensive simulations [11]. Here, the ideas from statistical mechanics
are exploited to better understand both the extent of spreading, as well as its
speed [2].
Albeit offering high utility, reliable simulations of spreading processes can be
expensive on larger networks, which we believe can be addressed by the employ-
ment of machine learning-aided modelling [29]. The contributions of this work
are multifold and can be summarized as follows.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 420–431, 2021.
https://doi.org/10.1007/978-3-030-65347-7_35
Epidemic Spreading with Graph Neural Networks 421
2 Related Work
In the following section, we discuss the relevant related work. We begin by dis-
cussing the notion of contagion processes, followed by an overview of graph
neural networks.
3 Task Formulation
When analysing an epidemic there are three main pieces of information that give
us most insight about how severe an epidemic was. The first crucial information
is when an epidemic reaches the peak (most nodes infected) since we are less
likely to be able to stop an epidemic when the peak is reached too quickly. This
information is especially important when trying to find a cure for a disease or
trying to stop misinformation on social media. Related to this, we usually want
to know, how many nodes will be infected when the epidemic reaches its peak.
When the maximum number of people infected by some disease is high, there
might not be enough beds or medicine for everyone. In contrast, companies might
want to create marketing campaigns on platforms such as Twitter and target
specific users to reach a certain number of retweets that are needed to become
trending. Another important insight into an epidemic is how many people get
infected. If a scam on the internet reaches a lot of people there is a greater chance
that more people will fall for it.
In our work, we focus on predicting the maximum number of infected nodes
and the time this number is reached. We create target data by simulating epi-
demics from each node with SIR diffusion model and identifying the number, as
well as the time when the maximum number of nodes are infected. We aggregate
Epidemic Spreading with Graph Neural Networks 423
the generated target data by taking the mean values for each node. In the end,
we preprocess this data by normalizing it.
4 Proposed Methodology
In this section, we present the creation of target data and summarize centralities
and learners used for the regression task. An overview of the proposed method-
ology can be seen in Fig. 1. The figure shows two branches. On the upper branch,
simulations are created and transformed into target data, while the node repre-
sentation is learned on the lower branch. After this, a regression model is trained
with data from both branches and used to generate predictions for the remaining
(unknown) nodes.
The initial part of the methodology addresses the issue of input data gen-
eration. Intuitively, the first step simulates epidemic spreading from individual
nodes of a given network to assess both the time required to reach the maximum
number of infected, as well as the number itself. In this work, we leverage the
widely used SIR model [10] to simulate epidemics, formulated as follows.
dS β·S·I
=−
dt N
dI β·S·I
= −γ·I
dt N
R
= γ · I,
dt
424 S. Mežnar et al.
Table 1. Summary of the considered learners with descriptions. Here, A denotes the
adjacency matrix and F the feature matrix.
During model training we minimized the mean squared error between the
prediction (f (u)) and the ground truth (yu ); stated for the u-th node as
1
M SE = (f (u) − yu )2 .
|N |
u∈N
To summarize, we learn network features with fast algorithms and use them
together with GIN, GAT and XGBoost learners to minimize the mean squared
error between predictions and data we make using simulations on part of the
network. We next present the evaluation process and results of this methodology.
5 Empirical Evaluation
In this section, we show the empirical results of our approach and compare it to
other baselines. We also present how predictions from CABoost model can be
explained using SHAP [16].
426 S. Mežnar et al.
5.1 Baselines
For comparison we also add the averaged simulation error. We calculate this
error with the MSE formula, where we use the mean absolute difference between
the value we get from simulations and the mean value for that node as f (u) and
0 as yu . This baseline corresponds to a situation, where only a single simulation
would be used to approximate the expected value of multiple ones (the goal of
this work).
We used the following approach to test the proposed method as well as base-
lines mentioned in Sect. 5.1. We created the target data by simulating five epi-
demics starting from each node of every dataset. We created each simulation
2
That can be found at https://github.com/smeznar/Epidemic-spreading-CN2020.
Epidemic Spreading with Graph Neural Networks 427
Fig. 2. Visualization of Advogato (left) and FB Public Figures (right) networks. The
color represents the target value we get when starting the spreading from a given node.
Color on Advogato dataset represents the time needed to reach the peak while on
FB Public Figures dataset maximum number of infected nodes is shown. Blue colors
represent low values while red ones represent high ones.
using the SIR diffusion model from the NDlib [23] Python library with param-
eters β = 5% and γ = 0.5%. We then create the target variables by identifying
and aggregating the maximum number of infected nodes and the time when this
happens. We use these variables to test methods using five-fold cross-validation.
We used XGBoost [1] with default parameters as the regression model with pro-
posed features based on the mentions centralities, the random baseline and the
node2vec [5] baseline. Baselines GIN and GAT were trained for 200 epochs using
the Adam optimizer [12].
5.3 Results
The results of the evaluation described in Sect. 5.2 are presented in Tables 4
and 5. We can see that in most cases both time and the maximum number of
infected nodes can be predicted significantly better by using information about
the structure of the network.
Observing results in Table 4 we see that overall CABoost achieves best results
and is beaten only on the Wiki vote dataset by GAT and the Hamsterster dataset
by GIN. We can further see that although node2vec does not achieve the best
score on any dataset, it consistently achieves results that are comparable to
CABoost. The biggest improvement over the random baseline can be seen on
datasets Hamsterster and Advogato that have more than one connected com-
ponents. We can also see that only GAT achieves a result that is significantly
better than the random baseline on the Wiki vote dataset.
The results in Table 5 are very similar to those of showcased in Table 4. Here
CABoost performs even better and is outperformed only by GIN on the Wiki
vote dataset. We can see that when predicting time, random baseline performs
significantly worse than all other baselines on all datasets but that overall these
predictions are better than when the maximum number of infected nodes is being
predicted.
We can see on all datasets that prediction with such learners is more beneficial
than creating only one simulation, further showing their usefulness.
428 S. Mežnar et al.
Table 5. Cross-validation results for time when most nodes are infected.
Fig. 3. An example of a model explanation for an instance. Blue arrows indicate how
much the prediction is lowered by some feature value, while the red ones indicate how
much it is raised. Prediction starts at models expected value 0.354 and finishes at 0.328.
Features and their values are shown on the left. The visualization shows for example
that the prediction dropped from 0.350 to 0.328 because of the low value of degree
centrality.
Acknowledgments. The work of the last author (BŠ) was funded by the national
research agency (ARRS)’s grant for junior researchers. The work of other authors was
supported by the Slovenian Research Agency (ARRS) core research program P2-0103
and P6-0411, and research projects J7-7303, L7-8269, and N2-0078 (financed under the
ERC Complementary Scheme).
430 S. Mežnar et al.
References
1. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD 2016, pp. 785–794, Association for Computing Machinery, New
York (2016)
2. Dong, S., Fan, F.-H., Huang, Y.-C.: Studies on the population dynamics of a rumor-
spreading model in online social networks. Phys. A 492, 10–20 (2018)
3. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch geometric.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
4. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6),
1420–1443 (1978)
5. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 855–864 (2016)
6. Guille, A., Hacid, H., Favre, C., Zighed, D.A.: Information diffusion in online social
networks: a survey. SIGMOD Rec. 42(2), 17–28 (2013)
7. Hamsterster. Hamsterster social network. http://www.hamsterster.com
8. Kacem, A., Lallemand, C., Giraud, N., Mense, M., De Gennaro, M., Pizzo, Y.,
Loraud, J.-C., Boulet, P., Porterie, B.: A small-world network model for the sim-
ulation of fire spread onboard naval vessels. Fire Saf. J. 91, 441–450 (2017)
9. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through
a social network. In: Proceedings of the Ninth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KDD 2003, pp. 137–146, Asso-
ciation for Computing Machinery, New York (2003)
10. Kermack, W.O., McKendrick, A.G., Walker, G.T.: A contribution to the mathe-
matical theory of epidemics. Proc. Roy. Soc. Lond. Ser. A, Containing Pap. Math.
Phys. Char. 115(772), 700–721 (1927)
11. Kesarev, S., Severiukhina, O., Bochenina, K.: Parallel simulation of community-
wide information spreading in online social networks. In: Russian Supercomputing
Days, pp. 136–148. Springer (2018)
12. Kingma, D.P., Ba, J.: Adam: a Method for Stochastic Optimization. CoRR,
abs/1412.6980 (2015)
13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: International Conference on Learning Representations (ICLR) (2017)
14. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM
46(5), 604–632 (1999)
15. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Signed networks in social media. In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
pp. 1361–1370. ACM (2010)
16. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions.
In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., Garnett, R. (eds.)Advances in Neural Information Processing Systems, vol. 30,
pp. 4765–4774. Curran Associates Inc. (2017)
17. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions.
In: Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
18. Massa, P., Salvetti, M., Tomasoni, D.: Bowling alone and trust decline in social net-
work sites. In: Eighth IEEE International Conference on Dependable, Autonomic
and Secure Computing, 2009. DASC 2009, pp. 658–663. IEEE (2009)
Epidemic Spreading with Graph Neural Networks 431
19. Nowzari, C., Preciado, V.M., Pappas, G.J.: Analysis and control of epidemics:
a survey of spreading processes on complex networks. IEEE Control Syst. Mag.
36(1), 26–46 (2016)
20. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking:
bringing order to the Web. In: WWW 1999 (1999)
21. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: Online learning of social represen-
tations. In: Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2014, pp. 701–710. ACM, New York
(2014)
22. Rodrigues, F.A.: Network centrality: an introduction. In: A Mathematical Model-
ing Approach from Nonlinear Dynamics to Complex Systems, p. 177 (2019)
23. Rossetti, G., Milli, L., Rinzivillo, S., Sı̂rbu, A., Pedreschi, D., Giannotti, F.: NDlib:
a python library to model and analyze diffusion processes over complex networks.
Int. J. Data Sci. Anal. 5(1), 61–79 (2018)
24. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph
analytics and visualization. In: AAAI (2015)
25. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.: GEMSEC: graph embedding
with self clustering. In: Proceedings of the 2019 IEEE/ACM International Confer-
ence on Advances in Social Networks Analysis and Mining 2019, pp. 65–72. ACM
(2019)
26. Škrlj, B., Lavrač, N., Kralj, J.: Symbolic graph embedding using frequent pattern
mining. In: Novak, P.K., Šmuc, T., Džeroski, S. (eds.) Discovery Science, pp. 261–
275. Springer, Cham (2019)
27. Štrumbelj, E., Kononenko, I.: Explaining prediction models and individual predic-
tions with feature contributions. Knowl. Inf. Syst. 41(3), 647–665 (2014)
28. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph
attention networks. In: International Conference on Learning Representations
(2018)
29. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive
survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. (2020).
https://doi.org/10.1109/TNNLS.2020.2978386
30. Xiaojin, Z., Zoubin, G.: Learning from labeled and unlabeled data with label prop-
agation. Technical report CMU-CALD-02–107, Carnegie Mellon University (2002)
31. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks?
In: International Conference on Learning Representations (2019)
Learning Vaccine Allocation
from Simulations
1 Introduction
Networks provide a universal language to represent interacting systems with
emerging dynamical patterns. Computationally, every propagation process over
a network can be considered an epidemic. Examples include actual pathogens on
human contact networks [1], fake-news in online social networks [8,10], cascad-
ing failures in an infrastructure network [11], congestions in a traffic network,
malware in computer networks [3], neural activity in a brain network [4], etc.
The problem of vaccine allocation is linked to the control of such a propa-
gation, where limited vaccine resources are available and we aim to reduce the
spread as much as possible, that is, lower the number of nodes reached by the
epidemic.
Vaccine allocation strategies can help in the design of complex systems to
make them more resilient against (cascading) failures. This is particularly rel-
evant regarding infrastructure networks where a “vaccination” might represent
the installation of some kind of protective safeguard. Another example is the mit-
igation of fake-news in online social networks which can be achieved by removing
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 432–443, 2021.
https://doi.org/10.1007/978-3-030-65347-7_36
Learning Vaccine Allocation from Simulations 433
2 Related Work
Traditionally, most methods focused on finding nodes for vaccination using a
static analysis of the contact network, for instance by looking at the between-
ness centrality of nodes [17] or at their degree [15]. Likewise, NetShield tries to
minimize the epidemic threshold of the contact graph (i.e., its general ability to
support epidemics) [20]. A more advanced method is GraphShield that starts
with degree centrality but then takes the flow of information in the contact
graph into account [21]. Eventually, researchers focused more on the dynami-
cal aspects for instance by utilizing linear programming [16] or reinforcement
learning [22,23]. For an overview, we refer the reader to [12].
Conceptually most relevant for us is the work of Zhang et al. who propose
DAVA [24] and Song et al. who propose NIIP [18]. Both methods are based
on a dominator tree architecture which tries to capture the direction of the
epidemic. DAVA merges all initially infected nodes and analyzes the paths from
this node to all other nodes. Nodes that block a large number of paths are
suitable vaccination candidates. NIIP focuses on a problem setting where not all
vaccination units are distributed at once. Therefore, NIIP extracts a maximum
DAG from the contact graph and uses Monte-Carlo simulation to find the best
nodes to vaccinate and combines this with a greedy simulation-based approach,
the simulation’s goal is to determine when to distribute a vaccine.
3 Problem Statement
We first formalize the generative epidemic spreading model and the vaccination
allocation problem. We remark that our framework can easily deal with all other
types of spreading models as well.
initial network
state Linit
1 4
!
! !
2 3 ! 2 5
!
!
!
4 5 1
terminal
network state
Fig. 1. Example contact network with possible SIR dynamics (S: blue line; I: red, filled;
R: green, shaded interior). The firing node/edge is annotated with an exclamation mark.
The resulting transmission tree is shown on the right.
Complexity. The problem is computationally difficult because there are nk
possibilities to distribute k vaccines to n nodes. The corresponding decision
problem is N P-hard. Specifically, for a given input G, Linit , λ, μ, and threshold
τ , it is N P-hard (in n) to decide if a solution X exists s.t. F (X) > τ . It can
be shown that for this type of problem, N P-hardness holds for any propagation
model that can mimic an independent cascade (IC) model [24]. We can do this
by making μ (λ) arbitrary small (large). Note that this only holds for the SIR
model and not necessarily for the generalizations to arbitrary spreading models.
4 Our Method
We first explain the main components of Simba (Simulation-based vaccine allo-
cation): the rejection-based simulation method and the construction of the trans-
mission graph with the identification of high-impact nodes based on a ranking
analysis.
4
24
4 vD 0.13 1
7
3 1 24
3 4
4
… 8 1 5
24 8 24
1 0
2
… 2 0.36 5 0.23
8 5
… 6 1 3 1
1 6
8
1
8
3
5
1
7
0.28
… 10 6 7 6
7
We store the number of susceptible nodes when the simulation ends. More-
over, each time a node gets infected, we store from which (infected) neighbor the
infection originated (or all nodes it could have originated from, cf. Sect. 4.2).
1
1/2
1/2
1
Fig. 3. Assume we want to estimate the impact of the two successor nodes of patient
zero based on two simulation runs. Using, for example, the size of their corresponding
subtree in the transmission tree leads to misleading results in this case. Specifically,
both nodes would be assigned drastically different values. Combining the two runs in
a transmission graph yields a more realistic impact score than considering both runs
separately. Note that edges point to the origin of the infection and the transmission
graph is shown without its dummy node.
4.4 Discussion
Here, we want to address three non-obvious questions: (i) why build a transmis-
sion graph?, (ii) what does the graph say about the objective function?, and (iii)
why is it necessary to consider the dynamics at all?.
Impact Score and Objective. Note that we handle two different problems.
The impact score quantifies the question “How many nodes became infected as a
direct (‘multi-hop’) consequence from each node? ” However, the objective F (·)
is concerned with “How many nodes will (on average) not become infected if a
specific (set of ) node(s) is vaccinated? ” The latter question is notoriously more
difficult to answer. The reason why they differ is that if we vaccinate a node, all
of its children in the transmission tree can still become infected via alternative
paths. In this sense, the impact score gives an over-approximation on the effect
of vaccinating a node regarding the objective. Colloquially, if we vaccinate a
node with m children (on average), then the best we can hope for is that these
m nodes do not become infected. Hence, our optimization procedure picks nodes
depending on their theoretical (and over-approximated) capability or potential
to reduce the epidemic spreading.
440 G. Großmann et al.
10 9 8 2 3 5 4 2 9 10
11 7 1 4 6 1 3 11
12 13 14 15 6 5 7 8 14 15 13 12
Importance of Dynamics. The goal is to vaccinate nodes such that the net-
work becomes less “supportive” of epidemics spreading in it. But then why
should the specific dynamics matter? In other words, how can vaccinate a spe-
cific node be the right decision for some infection rate constants and the wrong
decision for other ones? It is easy to see this in the example in Fig. 4 where
we have a single patient zero and a budget of k = 1. We can either vaccinate
the node to the “right” to protect the fully connected component (FCC) with
six nodes or we can vaccinate the node to the “left” to protect the line-graph
with nine nodes. If the epidemic is “weak”, it will die out anyway over the line
graph, so it makes sense to protect the FCC. In contrast, protecting the line-
graph “saves” more nodes if the epidemic is strong enough to conquer the whole
graph.
4.5 Generalizations
Fig. 5. Left: Optimization of the terminal fraction of expected susceptible nodes F (X)
on three sample networks. Right: Runtime of 1000 simulations and solution of the
corresponding transmission graph based on random d-regular graphs.
5 Experimental Results
We provide an implementation of Simba in Rust and Python (for visualiza-
tion/IO), code will be made available1 . We used synthetic networks follow-
ing three random graph models (Erdős-Rényi, Geometric, and Barabási-Albert
(BA)) with 102 , 103 , and 104 nodes, respectively. The corresponding budgets are
k = 2, k = 3, and k = 10. We consider the following baselines: random (expected
F (X) when random nodes are vaccinated), DAVA, and DAVA-fast [24], PageRank,
and Pers. PageRank (personalized PageRank) [7,24], and Degree (pick nodes
with highest degree). We use 103 simulations runs for each construction of the
transmission tree. We also analyze the runtime of a complete construction and
solution of a transmission graph based on d-regular random graphs (i.e., all nodes
have exactly d neighbors) with varying degree d and n. Practically, the runtime
is almost linear in n. Theoretically, the number of simulation steps in each run
increases linearly. The costs of each simulation step increase sub-linearly. The
costs of solving the DTMC also increase linearly. We see that, even though the
number of iteration steps is quite small, Simba is superior to or (almost) on par
with the baselines in the experiments. Simba struggles the most with BA graph
which is a special case but important as it highlights potential problems. It
seems that the general strategy of Simba to separate the initially infected from
the susceptible nodes does not work better than identifying the nodes which
are generally important for the graph’s resilience against epidemics. This is due
to the fact that BA graphs typically possess a small subset of nodes that are
extremely effective candidates for vaccination regardless of the infection source.
Note that DAVA also struggles in this case while Degree and both PageRank
methods shine.
simulations which are analyzed using the transmission graph. The transmission
graph represents the flow of a pathogen in the network as a directed weighted
graph. The method is suitable for all epidemic models that can be simulated
efficiently.
In the future, we aim to perform different types of information flow analysis
on the transmission graph, not only random walks. It remains to be determined
which kind of flow analysis is most useful for which type of objective (e.g., vacci-
nation, control, influence maximization). Moreover, we want to extend numerical
evaluations to more complex spreading models (e.g., non-Markovian, multi-state
ones) and network types (e.g., adaptive networks).
Acknowledgments. This work was partially funded by the DFG project MULTI-
MODE.
References
1. Ball, F., Sirl, D., Trapman, P.: Analysis of a stochastic sir epidemic on a random
network incorporating household structure. Math. Biosci. 224(2), 53–73 (2010)
2. Cota, W., Ferreira, S.C.: Optimized Gillespie algorithms for the simulation of
Markovian epidemic processes on large and heterogeneous networks. Comput.
Phys. Commun. 219, 303–312 (2017)
3. Gan, C., Yang, X., Liu, W., Zhu, Q., Zhang, X.: Propagation of computer virus
under human intervention: a dynamical model. Discrete Dynam. Natu. Soc. 2012
(2012)
4. Goltsev, A., De Abreu, F., Dorogovtsev, S., Mendes, J.: Stochastic cellular
automata model of neural networks. Phys. Rev. E 81(6), 061921 (2010)
5. Grossmann, G., Backenköhler, M., Wolf, V.: Importance of interaction structure
and stochasticity for epidemic spreading: a covid-19 case study. ResearchGate
(2020). https://www.researchgate.net/publication/341119247 Importance of Inte
raction Structure and Stochasticity for Epidemic Spreading A COVID-19 Case
Study
6. Großmann, G., Wolf, V.: Rejection-based simulation of stochastic spreading pro-
cesses on complex networks. In: International Workshop on Hybrid Systems Biol-
ogy, pp. 63–79. Springer (2019)
7. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th
International Conference on World Wide Web, pp. 271–279 (2003)
8. Khurana, P., Kumar, D.: Sir model for fake news spreading through whatsapp. In:
Proceedings of 3rd International Conference on Internet of Things and Connected
Technologies (ICIoTCT), pp. 26–27 (2018)
9. Kiss, I.Z., Miller, J.C., Simon, P.L., et al.: Mathematics of Epidemics on Networks.
vol. 598. Springer, Cham (2017)
10. Kitsak, M., Gallos, L.K., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H.E., Makse,
H.A.: Identification of influential spreaders in complex networks. Nat. Phys. 6(11),
888 (2010)
11. May, R.M., Arinaminpathy, N.: Systemic risk: the dynamics of model banking
systems. J. R. Soc. Interface 7(46), 823–838 (2009)
12. Nowzari, C., Preciado, V.M., Pappas, G.J.: Analysis and control of epidemics:
a survey of spreading processes on complex networks. IEEE Control Syst. Mag.
36(1), 26–46 (2016)
Learning Vaccine Allocation from Simulations 443
13. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
bringing order to the web. Technical report 1999-66, Stanford InfoLab, November
1999. http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-
0120
14. Prakash, B.A., Chakrabarti, D., Valler, N.C., Faloutsos, M., Faloutsos, C.: Thresh-
old conditions for arbitrary cascade models on arbitrary networks. Knowl. Inf. Syst.
33(3), 549–575 (2012)
15. Prakash, B.A., Tong, H., Valler, N., Faloutsos, M., Faloutsos, C.: Virus propagation
on time-varying networks: theory and immunization algorithms. In: Joint European
Conference on Machine Learning and Knowledge Discovery in Databases, pp. 99–
114. Springer (2010)
16. Sambaturu, P., Vullikanti, A.: Designing robust interventions to control epidemic
outbreaks. In: International Conference on Complex Networks and Their Applica-
tions, pp. 469–480. Springer (2019)
17. Schneider, C.M., Mihaljev, T., Havlin, S., Herrmann, H.J.: Suppressing epidemics
with a limited amount of immunization units. Phys. Rev. E 84(6), 061911 (2011)
18. Song, C., Hsu, W., Lee, M.L.: Node immunization over infectious period. In: Pro-
ceedings of the 24th ACM International on Conference on Information and Knowl-
edge Management, pp. 831–840 (2015)
19. Stewart, G., Miller, J.: Methods of simultaneous iteration for calculating eigenvec-
tors of matrices. Top. Numer. Anal. II, 169–185 (1975)
20. Tong, H., Prakash, B.A., Tsourakakis, C., Eliassi-Rad, T., Faloutsos, C., Chau,
D.H.: On the vulnerability of large graphs. In: 2010 IEEE International Conference
on Data Mining, pp. 1091–1096. IEEE (2010)
21. Wijayanto, A.W., Murata, T.: Flow-aware vertex protection strategy on large social
networks. In: 2017 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), pp. 58–63. IEEE (2017)
22. Wijayanto, A.W., Murata, T.: Learning adaptive graph protection strategy on
dynamic networks via reinforcement learning. In: 2018 IEEE/WIC/ACM Interna-
tional Conference on Web Intelligence (WI), pp. 534–539. IEEE (2018)
23. Wijayanto, A.W., Murata, T.: Effective and scalable methods for graph protection
strategies against epidemics on dynamic networks. Appl. Netw. Sci. 4(1), 18 (2019)
24. Zhang, Y., Prakash, B.A.: Data-aware vaccine allocation over large networks. ACM
Trans. Knowl. Discov. Data (TKDD) 10(2), 1–32 (2015)
Suppressing Epidemic Spreading
via Contact Blocking in Temporal
Networks
1 Introduction
Since the outbreak of the Covid-19, most countries have taken mitigation mea-
sures to stop or at least reduce the spread. Citizens reduce significantly their
transportation and social activities and human contact in general. However,
applying the same mitigation measure (e.g. everyone reducing their physical
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 444–454, 2021.
https://doi.org/10.1007/978-3-030-65347-7_37
Suppressing Epidemic Spreading 445
contact by 10%) to all citizens might not be the most efficient way to stop the
virus’s spread. The fundational question is which type of social contacts should
be blocked in order to slow down the epidemic spreading.
Human contact networks like face-to-face contact networks are in general
evolving over time. In such so-called temporal networks, two nodes are con-
nected at a given time step when they have a face-to-face contact. Physical
contact networks become available thanks to the development of sensor and
communications technologies. In contract to static networks, where links remain
constantly active, the link between a node pair is active (or the two nodes have a
contact) only at specific time steps in a temporal network. A temporal network
G = (N , C) observed within a given time window [0, T ] among a set N of N
nodes can be represented by a set of contacts C = {c(i, j, t), t ∈ [0, T ], i, j ∈ N },
where contact c(i, j, t) occurs between node pair (i, j) at time step t.
In this work, we explore the question which contacts could be removed in
order to suppress the epidemic spreading effectively. As a simple start, we con-
sider the Susceptible-Infected (SI) epidemic model, which models information
diffusion and epidemic spreading when the spreading is much faster than the
recovery. Initially, a seed node is randomly selected and infected at t = 0, whereas
all the other nodes are susceptible. An infected node infects a susceptible node
with a probability β when a contact happens between the two nodes. The preva-
lence i.e. the percentage of individuals that are infected grows over time. The
prevalence over time could be reduced via the removal of contacts. Such reduc-
tion in prevalence over time is used to quantify the effect of contact removal. In
practice, the temporal contact network at large scale e.g. country level is likely
unavailable. We assume that we could obtain the corresponding aggregated net-
work GW , where two nodes i and j are connected by a link l(i, j) if the two nodes
have at least one contact and the link is associated with a weight representing
the number of contacts in between. We aim to design contact removal strategies
based on the aggregated network. We propose systematically a set of link cen-
trality metrics or properties based on the aggregated network. Furthermore, we
define the probability that a contact c(i, j, t) is removed as a generic function of
a centrality metric of link l(i, j) in the aggregated network and the time t of the
contact. Each centrality metric thus leads to a different mitigation strategy to
select the contacts to block. The average fraction φ of contacts to be removed is
considered as a control parameter, indicating the mitigation cost. We evaluate
the performance of all the strategies that we have proposed in 6 real-world tem-
poral networks. We find that the epidemic prevalence can be better suppressed
when contacts between node pairs that have fewer contacts are more likely to
be removed, i.e. using the metric one over the number of contacts between a
node pair. Removing contacts that happen earlier in time also further enhance
the mitigation effect. The number of contacts between a node pair is hetero-
geneous. It seems that the mitigation effect is better if the average number of
contacts removed per node pair varies less. Static network studies have shown
that a weighted network tends to be more robust against epidemic spreading
with respect to its epidemic threshold if its largest eigenvalue is smaller. The
resultant aggregated network after contact removal, however, may have a lower
446 X. Zhao and H. Wang
prevalence if its largest eigenvalue is larger. This implies that the underlying
temporal network may lead to new phenomena in epidemic spreading that differ
from what we have learned from static networks.
The influence of temporal networks on dynamic processes has been widely
investigated [1–3]. Gemmetto et al. have studied the epidemic mitigation via
excluding a sub-group of nodes in a temporal network [10]. Link blocking strate-
gies using link centrality metrics to suppress information diffusion has been
explored in [4]. The links to block are selected from the aggregated network.
When a link is blocked, all contacts associated with the links are all removed. In
this work, we investigate more in-depth at contact level, i.e. how to choose the
contacts to block when the total number of contacts to block is given. Moreover,
the consideration of the time of a contact in contact removal strategies may
inspire the decision when a mitigation should be implemented.
2 Methods
We propose firstly a set of link centrality metrics/properties based on the aggre-
gated network GW . Furthermore, the probability that a contact is removed is
defined step by step as a function of a given centrality metric and the time of
the contact, which also ensures that a fraction φ of contacts are removed on
average. We evaluate the effect of the mitigation strategies via the extent that
the prevalence is reduced over time.
For each metric mij , we consider m1ij as an extra centrality metric, except for
the random. For any centrality metric, the centrality value of every link in the
aggregated network is positive. The motivation to consider the reciprocal metrics
1
mij is explained in the design of the removal probabilities of contacts (2).
Link centrality metrics can be correlated [5,6]. We find that the Spearman
rank correlation between any two metrics proposed above is weak, i.e. below
0.2. This implies that each metric captures a specific property that can not be
captured by another metric.
For a given link centrality metric, we can compute the centrality for mij for each
link l(i, j). We propose the probability pij that a contact c(i, j, t) between i and
j is removed as
φ ij wij
pij = mij (1)
ij (wij mij )
where wij is the weight of link l(i, j) in the aggregated network, and the nor-
malization ensures that on average a fraction φ of contacts will be removed.
The probability that a contact is removed is assumed to be proportional to the
centrality mij of the corresponding link l(i, j).
We found that some centrality metrics are highly heterogeneous. Hence, it is
possible that the removal probability calculated by (1) is larger than 1 for con-
tacts whose associated link l(i, j) has an extremely large centrality mij . In such
cases, the actual fraction of contacts removed can be lower than the expected φ,
if all contacts with removal probability larger than 1 are removed. Therefore, we
set the removal probabilities of those contacts to 1 and re-normalize the removal
probability among the rest contacts. This process is repeated until the removal
probabilities of all remaining contacts are between 0 and 1, while the actual
fraction of contacts removed is the same as expected φ.
The probability pij that a contact between i and j is removed can be defined
in a more general way
∗ α
φ ij wij
pij = mij α (2)
ij (wij mij )
prevalence. To take the time factor of the contacts into account, we propose the
probability pij (t) that a contact c(i, j, t) between i and j at t is removed as
φ ij wij
pij (t) = mij f (t) (3)
ij (wij mij f (t))
where f (t) implies the preference to block contacts at specific period. The prob-
ability that c(i, j, t) is removed is proportional to mij · f (t).
As a simple start, we consider f (t) = 4 · 1t≤T /2 + 1t>T /2 , f (t) = 1t≤T /2 + 4 ·
1t>T /2 and f (t) = 1, where the indicator function 1y is one is the condition y is
true, and otherwise it is 0. These three functions corresponds to the preference
to block contacts happening in [1, T /2], in (T /2, T ] and no preference over the
time of the contacts, respectively.
2.3 Datasets
We consider six real-world temporal physical contact networks, measured in three
scenarios:
All networks are considered as undirected. Some basic properties of the networks
are shown in Table 1.
Table 1. Basic properties of the temporal networks: the number of nodes, links and
contacts. The duration is the duration of the observation time window [1,T] measured
in days, thus T times the duration per discrete time step.
2.4 Simulation
3 Results
First of all, we evaluate the performance of all strategies as defined in (1) where
the time of a contact has no influence on its probability of being blocked.
Fig. 1. The prevalence ρ over time, when φ = 10% of the contacts are removed using
each centrality metric according to (1) in two temporal networks.
Figure 1 illustrates how the prevalence ρ grows over time when each contact
blocking strategy is performed in two networks and 10% contacts are removed.
450 X. Zhao and H. Wang
Table 2. The prevalence E[ρ] averaged over time when φ = 10% of the contacts
are removed from each temporal network using removal probability (1) based on each
centrality metric. The best performance in each network is marked in bold.
The performance of the strategies in each network can be also compared via the
average prevalence E[ρ] over the whole time window, as shown in Table 2 and
3, where φ = 10% and φ = 30% percent of contacts are removed respectively.
The 1/link weight performs the best in all networks except for MIT2 and/or
WorkPlace13. These observations imply that it is more effective to suppress the
epidemic by removing contacts between node pairs that have few contacts.
For any node pair (i, j), the average number of contacts removed between
i and j is pij wij . For strategy 1/link weight, the average number of contacts
removed is the same for all node pairs or for all links in the aggregated net-
work1 . We wonder whether a more similar number of contacts removed per node
pair leads to a better mitigation effect. Hence, we derive further the variance
V ar[pij wij ] for each strategy in each network.
Figure 2(a) shows the scatter plot
of the average prevalence E[ρ] versus V ar[pij wij ]. In each network, a strategy
tends to perform better i.e. leads to a low E[ρ] if the V ar[pij wij ] is small.
In the studies of the Susceptible-Infected-Susceptible SIS epidemic spread-
ing model on a static weighted network, the largest eigenvalue of the weighted
network has been shown to suggest the robustness of the network against epi-
demic [11–14]. The infection rate between two individuals is assumed in the
SIS model to be proportional to the infection rate of the epidemic multiply by
1
For strategy 1/link weight, the actual average number of contacts removed per node
pair in the simulation may differ slightly among the links, because when the removal
probability pij derived from (1) is larger than one, we set pij = 1, and re-normalize
the removal probabilities of the rest links so that a fraction φ of contacts are removed
as expected.
Suppressing Epidemic Spreading 451
Table 3. The prevalence E[ρ] averaged over time when φ = 30% of the contacts
are removed from each temporal network using removal probability (1) based on each
centrality metric.
the link weight, i.e. the contact frequency. In this case, the epidemic threshold
τc ∼ λ1 (W1
) , where matrix W with its element wij captures the weights of all
links in the aggregated network. When the effective infection rate, i.e. infection
rate normalized by the recovery rate of the epidemic, is above the threshold,
a none-zero fraction of the population gets infected in the meta-stable state,
whereas below the threshold, the epidemic dies out in the meta-stable state. A
static weighted network with a small largest eigenvalue tends to be more robust
against epidemic. We explore further the largest eigenvalue λ1 (W ∗ ) of the resul-
tant aggregated network after contact removal whose weighted adjacency matrix
is W ∗ . Would a strategy that leads to a smaller λ1 (W ∗ ) be more effective in sup-
press the prevalence according to the findings of SIS model on static networks?
The scatter plot in Fig. 2(b) of the average prevalence E[ρ] versus λ1 (W ∗ ) shows
the contrary: the prevalence tends to be low when the resultant network has
a large largest eigenvalue. This inconsistency can be possibly introduced by
the following. Removing many contacts from few links whose end nodes have
a high strength may better reduce the largest eigenvalue. This is less effective in
mitigation an SI spreading process where each link can transmit the epidemic
maximally once dependent also on the time ordering of contacts. It can be, how-
ever, effective to mitigate an SIS epidemic where such links could transmit the
epidemic frequently.
Finally, we take the time of a contact into account when selecting the contacts
to remove via the contact removal probability pij (t) defined in (3). When f (t) =
1t≤T /2 +4·1t>T /2 , contacts happening late i.e. t > T /2 in time are more likely to
be removed. When f (t) = 4 · 1t≤T /2 + 1t>T /2 , contacts happening early i.e. t <
T /2 are 4 times more likely to be removed compared to contacts happening late
t > T /2. Comparing Table 3, 4 and 5, where the contact removal is independent
452 X. Zhao and H. Wang
2. (a) Scatter plot of the average prevalence E[ρ] versus the standard deviation
Fig.
V ar[pij wij ] of the average number of contacts removed from a node pair. (b) Scatter
plot of the average prevalence E[ρ] versus the largest eigenvalue λ1 (W ∗ ) of the resultant
aggregated network after the contact removal. A fraction φ = 30% of contacts are
removed.
of the time of a contact, favors the removal of late and early contacts respectively,
we find that the suppressing effect is better when early contacts are more likely
to removed. Furthermore, the metric 1/link weight tends to perform the best
independent of the choice of f (t). Hence, the mitigation effect tends to be better
if contacts between node pairs that have few contacts and earlier contacts are
more likely to be removed. Node pairs with few contacts are usually referred as
weak social ties. Removing the contacts along weak social ties seems an effective
and likely socially feasible mitigation strategy.
Table 4. The prevalence E[ρ] averaged over time when φ = 30% of the contacts
are removed from each temporal network using contact removal probability (3) and
f (t) = 1t≤T /2 + 4 · 1t>T /2 based on each centrality metric. Contacts happening late i.e.
t > T /2 in time are more likely to be removed.
Table 5. The prevalence E[ρ] averaged over time when φ = 30% of the contacts
are removed from each temporal network using contact removal probability (3) and
f (t) = 4 · 1t≤T /2 + 1t>T /2 based on each centrality metric. Contacts happening early
i.e. t < T /2 in time are more likely to be removed.
an epidemic spreads extremely fast, e.g. all nodes have already been infected
before T /2, the aggregated network information is likely not ideal to determine
the contact removal probabilities, though this scenario is less realistic.
References
1. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519, 97–125 (2012)
2. Holme, P.: Modern temporal network theory: a colloquium. Eur. Phys. J. B 88,
234 (2015)
3. Zhan, X., Hanjalic, A., Wang, H.: Information diffusion backbones in temporal
networks. Sci. Rep. 9, 6798 (2019)
4. Zhan, X., Hanjalic, A., Wang, H.: Suppressing information diffusion via link block-
ing in temporal networks. In: Cherifi, H., Gaito, S., Mendes, J.F., Moro, E., Rocha,
L.M. (eds.) Complex Networks and Their Applications VIII, vol. 881, pp. 448–458.
Springer, Heidelberg (2020)
5. Li, C., Li, Q., Van Mieghem, P., Stanley, H.E., Wang, H.: Correlation between
centrality metrics and their application to the opinion model. Eur. Phys. J. B 88,
65 (2015)
6. Li, C., Wang, H., de Haan, W., Stam, C.J., Mieghem, P.V.: The correlation of
metrics in complex networks with applications in functional brain networks. J.
Stat. Mech. 2011, P11018 (2011)
7. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS ONE
9, e107878 (2014)
8. Énois, M., Vestergaard, C., Fournet, J., Panisson, A., Bonmarin, I., Barrat, A.:
Data on face-to-face contacts in an office building suggest a low-cost vaccination
strategy based on community linkers. Netw. Sci. 3, 326–347 (2015)
9. Reality mining network dataset–KONECT. https://konect.uni-koblenz.de/
networks/mit
10. Gemmetto, V., Barrat, A., Cattuto, C.: Mitigation of infectious disease at school:
targeted class closure vs school closure. BMC Infect. Dis. 14, 695 (2014)
11. Van Mieghem, P., Omic, J., Kooij, R.: Virus spread in networks. IEEE/ACM Trans.
Netw. 17, 1–14 (2009)
12. Wang, H., Li, Q., D’Agostino, G., Havlin, S., Stanley, H.E., Van Mieghem, P.:
Effect of the interconnected network structure on the epidemic threshold. Phys.
Rev. E 88, 022801 (2013)
13. Ottaviano, S., De Pellegrini, F., Bonaccorsi, S., Mugnolo, D., Van Mieghem, P.:
Community networks with equitable partitions. In: Altman, E., Avrachenkov, K.,
De Pellegrini, F., El-Azouzi, R., Wang, H. (eds.) Multilevel Strategic Interaction
Game Models for Complex Networks, pp. 111–129. Springer, Heidelberg (2019)
14. Qu, B., Wang, H.: SIS epidemic spreading with heterogeneous infection rates. IEEE
Trans. Netw. Sci. Eng. 4, 177–186 (2017)
Blocking the Propagation of Two
Simultaneous Contagions over Networks
1 Introduction
consider disease propagation and use vaccinating nodes as the blocking strategy.
The goal is to reduce the number of newly infected nodes under a budget on the
number of nodes that can be vaccinated. Following [11], we use the synchronous
dynamical system (SyDS) as the formal model for contagion propagation; see
Sect. 2.
Summary of Results: We discuss a general threshold-based model for the
simultaneous propagation of two contagions through a network. As this general
model (which requires the specification of five threshold values for each node) is
somewhat complex, we consider a simplified model that uses only two threshold
values for each node. Using that model, we formulate the problem of minimiz-
ing the number of new infections in a network by vaccinating some nodes. In
practice, there is a budget constraint on the number of vaccinations. We observe
that the resulting budget-constrained optimization problem is computationally
intractable using a known result for the case of a single contagion [11]. Therefore,
we develop an efficient heuristic algorithm called MCICH for the problem. This
heuristic is based on a generalized version of the Minimum Set Cover (MSC)
problem [7]. Through computational experiments, we compare the performance
of MCICH with two baseline methods using three real-world social networks.
Our results indicate that MCICH is able to block the two contagions effectively
even with a small vaccination budget, and performs far better than the other
two methods.
Related Work: Reference [11] treats the single contagion blocking problem
under the threshold model. The goal is again to minimize the number of new
infections subject to a budget on the number of nodes that can be vaccinated.
It is shown that if the budget cannot be violated, even obtaining an approxi-
mation algorithm with any provable performance guarantee is NP-hard. Two
efficient heuristics for the problem are introduced and their performance is eval-
uated on several social networks. Although single contagion epidemic models
have been studied for years, study of the multiple contagion context is newer.
For example, conditions for the coexistence of two contagions in compartmental
models are explored in [3]. A number of references (see e.g., [10,14,15] and the
references cited therein) have considered the propagation of competing conta-
gions (where infection by one contagion prevents or reduces the likelihood of
infection by another), and cooperating contagions (where infection by one con-
tagion makes it easier to get infected by another contagion). While our work uses
the deterministic threshold model, reference [18] discusses a general framework
for a probabilistic multiple-contagion model, namely the Susceptible-Infected-
Recovered (SIR) model.
0 3
State Interpretation
2
0 Not infected by either C1 or C2
1 Infected by C1 only Fig. 1. Possible state transitions
2 Infected by C2 only for each node
3 Infected by both C1 and C2
|V | = n, represents the underlying graph of the SyDS, with node set V and edge
set E, and (b) F = {f1 , f2 , . . . , fn } is a collection of functions in the system, with
fi denoting the local function associated with node vi , 1 ≤ i ≤ n. Each node
of G has a state value from B. Each function fi specifies the local interaction
between node vi and its neighbors in G. The inputs to function fi are the state
of vi and those of the neighbors of vi in G; function fi maps each combination
of inputs to a value in B. This value becomes the next state of node vi . It is
assumed that each local function can be computed efficiently.
For a single contagion, the domain B is usually chosen as {0,1}, with 0 and
1 representing that a node is uninfected and infected respectively. Since we have
two contagions (denoted by C1 and C2 ) propagating through the underlying
network, we have four possible states for each node, denoted by 0, 1, 2 and 3;
thus, B = {0, 1, 2, 3}.
The interpretation of these state values is shown in Table 1. An easy way to
think of these states is to consider the 2-bit binary expansion of the state values
0 through 3. The least (most) significant bit indicates whether the node has been
infected by C1 (C2 ).
We assume that the system is progressive with respect to each of the con-
tagions [6]; that is, once a node is infected by a contagion, it remains infected by
that contagion. Using this assumption, Fig. 1 shows the possible state transitions
for each node.
State Transition Rules: Each node v is associated with a local transition
function fv that determines the next state of v given its current state and the
states of the neighbors of v. Such a function may be deterministic or stochas-
tic (as in SIR systems). Here, we will consider a simple class of deterministic
functions called threshold functions.
A General Form of Local Functions: We first discuss a very general (but
somewhat complex) form of local functions for the propagation of two contagions
in a network and then present a simpler form that will be used in the paper. In
the general form, for each node v and each of the five possible state transition x
to y (shown in Fig. 1), there is a threshold value θ(v, x, y). Let N (v, j) denote the
number of neighbors of v in state j, 0 ≤ j ≤ 3. (If the state of node v is j, then v
is included in the count N (v, j).) For any node v, the rules for each possible state
transition which collectively specify the local function fv are shown in Table 2.
458 H. L. Carscadden et al.
We briefly explain two of the state transition conditions shown in Table 2. The
conditions for other state transitions are similar. Consider the condition for the
“0 −→ 1” transition. For this transition to occur at a node v, the number of
neighbors of v in state 1 or state 3 must be at least θ(v, 0, 1) (i.e., (N (v, 1) +
N (v, 3) ≥ θ(v, 0, 1))) and the number of neighbors of v in state 2 or state 3 must
be less than θ(v, 0, 2) (i.e., (N (v, 2) + N (v, 3) < θ(v, 0, 2))). Likewise, for the
“1 −→ 3” transition to occur at v, the number of neighbors of v in state 2 or
state 3 must be at least θ(v, 1, 3) (i.e., (N (v, 2) + N (v, 3) ≥ θ(v, 1, 3))). At any
state j ∈ {0, 1, 2, 3}, if none of the conditions for transitions out of j hold, the
node remains in state j.
The above general model is powerful as it allows the two contagions to inter-
act. Many references have considered cooperating and competing contagions
(e.g., [10,12,15]). For example, in the case of cooperating contagions, if a node
has already contracted C1 , it may be easier for it to contract C2 . This can be
modeled by choosing a low value for θ(v, 1, 3). However, the model is also com-
plex since it requires the specification of five threshold values for each node. In
this paper, we consider a simpler model which uses only two threshold values for
each node.
A Simpler Form of Local Functions: In the general form discussed above,
each node was associated with five threshold values, one corresponding to each
of the five transitions shown in Fig. 1. In the simpler model, for each node v, we
use two threshold values, denoted by θ(v, 1) and θ(v, 2). The parameter θ(v, 1)
is used when v is in state 0 or state 2 (i.e., has not contracted contagion C1 ); it
specifies the minimum number of neighbors of v whose state is either 1 or 3 so
that v can contract contagion C1 . Similarly, θ(v, 2) is used when v is in state 0 or
1, and it specifies the minimum number of neighbors of v whose state is either 2
or 3 so that v can contract contagion C2 . Unlike the general model, the simpler
model does not permit other interactions between the two contagions. However,
the simpler model facilitates the development of analytical and experimental
results.
Additional Definitions Concerning SyDSs: At any time τ , the configura-
tion C of a SyDS is the n-vector (sτ1 , sτ2 , . . . , sτn ), where sτi ∈ B is the state of
node vi at time τ (1 ≤ i ≤ n). Given a configuration C , the state of a node v in
C is denoted by C (v). As mentioned earlier, in a SyDS, all nodes compute and
Blocking Two Simultaneous Contagions 459
update their next state synchronously. Other update disciplines (e.g., sequential
updates) have also been considered in the literature (e.g., [2,13]). Suppose a
given SyDS transitions in one step from a configuration C to a configuration C .
Then we say that C is the successor of C , and C is a predecessor of C . Since
the SyDSs considered in this paper are deterministic, each configuration has a
unique successor. However, a configuration may have zero or more predecessors.
A configuration C which is its own successor is called a fixed point. Thus, when
a SyDS reaches a fixed point, no further state changes occur at any node.
Example: The underlying network of a SyDS in which two contagions are prop-
agating under the simpler model discussed above is shown in Fig. 2.
For each node v, the two threshold val-
c ues θ(v, 1) and θ(v, 2) are both chosen as
1. Suppose the initial states of nodes a,
b, c and d are 1, 2, 0 and 0 respectively;
a d that is, the initial configuration of the
system is (1, 2, 0, 0). The local function
b fa at a is computed as follows. Since a
is in state 1, we need to check if it can
Fig. 2. The underlying network of a
contract contagion C2 . Since θ(a, 2) = 1
SyDS with two contagions. For each node
v, both the threshold values are 1.
and a has a neighbor (namely b) in state
2, a can indeed contract contagion C2 .
Therefore, the value of the local function fa is 3; that is, the next state of a is 3.
In a similar manner, it can be seen that the local functions fb and fc (at nodes b
and c respectively) also evaluate to 3. For node d, whose current state is 0, there
is one neighbor (namely, b) whose state is 2. Therefore, the local function fd at
d evaluates to 2. Thus, the configuration of the system at time 1 is (3, 3, 3, 2).
Since the system is progressive, the states of nodes a, b and c will continue to be
3 in subsequent time steps. However, the state of node d changes to 3 at time
step 2 since d has a neighbor (namely, b) whose state at time step 1 is 3. Thus,
the configuration of the system at the end of time step 2 is (3, 3, 3, 3). In other
words, the sequence of configurations at times 0, 1 and 2 of the system is:
Once the system reaches the configuration (3, 3, 3, 3), no further state changes
can occur. Thus, the configuration (3, 3, 3, 3) is a fixed point for the system.
In this example, the SyDS reached a fixed point. Using our assumption that
the system is progressive, one can show that every such SyDS reaches a fixed
point.
Proposition 1. Every progressive SyDS under the two contagion model reaches
s fixed point from every initial configuration.
Proof: Consider any progressive SyDS on B = {0, 1, 2, 3}. Let n denote the
number of nodes in the underlying graph of the SyDS. In any transition from
a configuration to a different configuration, at least one node changes state.
460 H. L. Carscadden et al.
Because the system is progressive, each node may change state at most twice:
once from 0 to 1 (or 0 to 2) and then from 1 to 3 (or 2 to 3). Thus, after at
most 2n transitions where the states of one or more nodes change, there can be
no further state changes. In other words, the system reaches a fixed point after
at most 2n transitions.
Problem Formulation: The focus of this paper is on a method for containing
the propagation of two simultaneous contagions by appropriately vaccinating a
subset of nodes. Before defining the problem formally, we state the assumptions
used in our formulation.
Following [11], we assume that only those nodes that are initially uninfected
by either contagion (i.e., nodes whose initial state is 0) can be vaccinated for C1
and/or C2 . When a node is vaccinated for a certain contagion, the node cannot
get infected by that contagion; as a consequence, such a node cannot propagate
the corresponding contagion. For i = 1, 2, one can think of vaccinating a node v
for a contagion Ci as increasing the threshold θ(v, i) of the node v to degree(v)+1
so that the number of neighbors of v that are infected by Ci will always be less
than θ(v, i). If a node v is vaccinated for both C1 and C2 , then it plays no role in
propagating either contagion. In such a situation, one can think of the effect of
vaccination as removing node v and all the edges incident on v from the network.
The optimization problem studied in this paper is a generalization of a prob-
lem studied in [11] for a single contagion. This problem deals with choosing a
small set of nodes to vaccinate so that the total number of resulting new infec-
tions when the system reaches a fixed point is a minimum. Given a set C of
nodes to be vaccinated, a vaccination scheme specifies for each node w ∈ C,
whether w is vaccinated against C1 , C2 or both. The total number of vaccina-
tions used by a vaccination scheme for a set of nodes C is the sum N1 + N2 ,
where Ni is the number of nodes vaccinated against Ci , i = 1, 2. Note that if
a node w is vaccinated against both C1 and C2 , then it is included in both N1
and N2 . Also, after a vaccination scheme is chosen and the contagions spread
through a network, the number of new infections is measured as the total num-
ber of state transitions, because each state transition means a node acquires a
new contagion. A formal statement of this optimization problem is as follows.
Vaccination Scheme to Minimize the Total Number of New Infections
(VS-MTNNI)
Given: A social network represented by the SyDS S = (G, F) over B = {0, 1,
2, 3}, with each local function fv ∈ F at node v represented by two threshold
values θ(v, 1) and θ(v, 2); the set I of seed nodes which are initially infected
(i.e., the state of each node in I is from {1,2,3}); an upper bound β on the total
number of vaccinations.
Requirement: A set C ⊆ V − I of nodes to be vaccinated and a vaccination
scheme for C so that (i) the total number of vaccinations is at most β and (ii)
among all subsets of V − I which can be vaccinated to satisfy (i), the set C and
the chosen vaccination scheme lead to the smallest number of newly infected
nodes.
Blocking Two Simultaneous Contagions 461
It is easy to show that the result of Theorem 1 also holds for the VS-MTNNI
problem.
Proof: The SCS-MNA problem can be easily reduced to the VS-MTNNI prob-
lem as follows. Let an instance of SCS-MNA be given by a graph G(V, E), a
subset I ⊆ V of initially infected nodes (by the only contagion), a vaccination
budget β and an upper bound Q on the number of new infections. From the
graph G(V, E) of the SCS-MNA instance, we create a new graph G (V , E) by
adding a new node v to V such that v has no edges incident on it. In the VS-
MTNNI instance, the initial state of each node in I is chosen as 1 and the initial
state of the new node v is chosen as 2. The two threshold values for each node
in G are chosen as 2. It is now easy to see that only C1 can spread in the SyDS
represented by G . Therefore any vaccination scheme for G which vaccinates at
most β that causes at most Q new infections is also a solution to the SCS-MNA
instance, and vice versa.
Proposition 2 points out that in the worst-case, even obtaining an effi-
cient approximation algorithm with a provable performance guarantee for the
VS-MTNNI problem is computationally intractable. Therefore, we now focus
on designing a heuristic algorithm that works well in practice. This heuristic
relies on a known approximation algorithm for a generalized version of the
Set Cover problem, called the Set Multicover problem [20]. In this problem,
1
An algorithm for the SCS-MNA problem provides a factor ρ approximation if for
every instance of the problem, the number of new infections is at most ρQ∗ , where
Q∗ is the minimum number of new infections.
462 H. L. Carscadden et al.
3 Experimental Results
In this section, we provide the networks tested; descriptions of the key elements of
the analysis process—simulation and the contagion blocking heuristics (including
the new MCICH); a summary of the overall analysis steps; and results of the
contagion blocking numerical experiments. Throughout this section, we use the
words “activated” and “infected” as synonyms, and also “block” and “vaccinate”
as synonyms.
Networks: The three networks of Table 3 are evaluated. We use only the giant
components from the networks.
Table 3. Networks used in experiments, and selected properties. All properties are
for the giant component of each graph. These properties were computed using the
net.science system [1].
Network Num. nodes Num. edges Ave. degree Ave. clust. coeff Diameter
Astroph 17,903 196,972 22.0 0.633 14
FB-Politicians 5,908 41,706 14.1 0.385 14
Wiki 7,115 100,762 28.3 0.141 7
iterations per simulation, where the differences among the iterations is the com-
position of the seed node sets. Simulations are run with and without blocking
nodes.
Blocking Heuristics: We present three methods (heuristics) for blocking a
contagion. The first two are well studied, and serve as baselines for comparison.
The third method is the covering heuristic MCICH that is a contribution of
this work. For a simulation involving two distinct contagions, the corresponding
method is applied for each contagion individually.
Random Heuristic. For a given budget βi on the number of blocking nodes for
contagion Ci , select βi nodes from among all nodes, uniformly at random.
High Degree Heuristic. For a given budget βi on the number of blocking nodes
for contagion Ci , select the βi nodes with the greatest degrees (break ties arbi-
trarily).
New Multi-Contagion Independent Covering Heuristic (MCICH). We devise a
set cover heuristic to identify a subset of nodes that are activated at time t,
to set as blocking nodes, such that no nodes will activate at time t + 1. If this is
accomplished, then the contagion is halted at t, and our goal is achieved.
A key idea is that any node vi that is activated at time t + 1 does so because
it receives influence from nodes activated at time t, for otherwise, vi would have
activated at an earlier time. Thus, for a node vi that gets activated at time t + 1,
vaccinating or blocking nodes at time t will halt contagion propagation to vi .
This idea is used in the algorithm as follows. Consider the sets St and St+1 of
nodes that get infected or activated at times t and t+1, respectively. We identify
nodes from St , one at a time, iteratively, where the node vk that is removed from
St has the most edges in the graph G to nodes that are still infected in St+1 .
Each time a vk is removed from St , the “covering requirement” for each neighbor
vj ∈ St+1 is reduced by 1, and when vj s requirement is 0, by removal of one or
more nodes from St , that means vj can no longer be infected for contagion Ci .
The algorithm for the MCICH is presented in Algorithm 1. The algorithm
computes the set C of blocking nodes for contagion Ci for one iteration.
Summary of Analysis Process: The steps of the full analysis follow. Step 1:
simulations are preformed without consideration of blocking nodes, as described
above. Step 2: using the simulation outputs, blocking nodes are determined using
the blocking heuristics and specified blocking node budget βi for contagion Ci .
Step 3: the simulations are repeated, with all conditions the same as in Step 1,
except that now the blocking nodes are added (these blocking nodes remain in
state 0). Note that the simulation and blocking methods, models and codes can
handle—as they currently exist—non-uniform thresholds across nodes, different
thresholds per contagion for each node, and heterogeneities in other parame-
ters. We are reporting uniform threshold results owing to space limitations and
because it is important to understand baseline behaviors.
Simulation and Blocking Results: Unless otherwise stated, all results are
averages over all 10 iterations of a simulation.
464 H. L. Carscadden et al.
Basic Simulation Data and Temporal Blocking Effects. Figure 3 provides three
types of results for the FB-Politicians network. The first two plots show temporal
data on the spread or propagation of both contagions C1 and C2 simultaneously
without blocking. The third plot shows temporal effects of blocking nodes on the
propagation of both contagions. Figure 3a shows the number of newly activated
nodes at each time step. The curves rise as uniform threshold decreases from 4, to
3, to 2, since contagion propagates more readily for lesser thresholds. Figure 3b
shows the corresponding plots of total or cumulative number of nodes activated
for both contagions as a function of time. Roughly 40% to 70% of FB-Politicians
nodes are activated by tmax = 24, depending on θ. Figure 3c uses the θ = 3 data
from Fig. 3b as a baseline, and shows three additional curves, one for each of the
three blocking methods discussed above. These data show that for a blocking
budget βi = 0.02 fraction of nodes, the MCICH performs best (i.e., the curve
is the lowest). For blocking contagions “lesser” (or “lower”) is better. However,
this budget is not sufficient to completely halt the contagion.
Blocking Two Simultaneous Contagions 465
Fig. 3. Simulation results for the FB-Politicians network, where results are averages
over 10 iterations. (a) shows time histories of the average number of newly activated
nodes at each time step for contagions C1 and C2 combined, for three thresholds. (b)
shows time histories of the average number of cumulative activated nodes at each time
step for contagion C1 and C2 combined, for the same thresholds. (c) provides data for
θ = 3, for no blocking, and for each of the three blocking methods, where the blocking
node budget βi = 0.02 fraction of nodes. No method completely blocks the contagion
(a greater budget is required), but MCICH performs best over the entire time history.
Fig. 4. Simulation and blocking results of applying all three blocking methods to block
two-contagion spreading in three networks, using different threshold values for conta-
gion propagation for each network. Results for each network are in one row. From the
top to bottom rows, the networks are: Astroph, FB-Politicians, and Wiki. Each col-
umn contains data one threshold: left to right, θ = 2, 3, and 4. Each plot displays the
fraction of nodes contracting either contagion in diffusion simulations, as a function of
the fraction of nodes used as blocking nodes, employed to stop the contagions. In each
plot, there are four curves. Contagion spreading without blocking (black dots) is the
reference curve, and is a horizontal line. Results from the random selection of blocking
nodes is the green curve. Results from selecting the highest degree nodes as blocking
nodes is the blue dashed curve. Results from MCICH method is the red dash-dot curve.
The lower the curve, the better the performance in blocking contagion. The MCICH
method does significantly better in all cases.
Blocking Two Simultaneous Contagions 467
Acknowledgments. We thank the reviewers for their comments. This work is par-
tially supported by NSF Grants ACI-1443054 (DIBBS), IIS-1633028 (BIG DATA),
CMMI-1745207 (EAGER), OAC-1916805 (CINES), CCF-1918656 (Expeditions),
CRISP 2.0-1832587 and IIS-1908530.
References
1. Ahmed, N.K., Alo, R.A., Amelink, C.T., et al.: net.science: a cyberinfrastructure
for sustained innovation in network science and engineering. In: Gateway (2020)
2. Barrett, C., Hunt III, H.B., Marathe, M.V., Ravi, S.S., Rosenkrantz, D.J., Stearns,
R.E., Thakur, M.: Predecessor existence problems for finite discrete dynamical
systems. Theor. Comput. Sci. 386(1), 3–37 (2007)
3. Beutel, A., Prakash, B.A., Rosenfeld, R., Faloutsos, C.: Interacting viruses in net-
works: can both survive? In: KDD, pp. 426–434 (2012)
4. Centola, D., Macy, M.: Complex contagions and the weakness of long ties. Am. J.
Sociol. 113(3), 702–734 (2007)
5. Chakrabarti, D., Wang, Y., Wang, C., Leskovec, J., Faloutsos, C.: Epidemic thresh-
olds in real networks. TISSEC 10, 1–33 (2008)
6. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press (2010)
7. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-completeness. W. H. Freeman & Co., San Francisco (1979)
8. González-Bailón, S., Borge-Holthoefer, J., Rivero, A., Moreno, Y.: The dynamics
of protest recruitment through an online network. Sci. Rep. 1, 7 (2011)
9. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 1420–1443
(1978)
10. Karrer, B., Newman, M.E.J.: Competing epidemics on complex networks. Phys.
Rev. E 84(3) (2011). https://doi.org/10.1103/PhysRevE.84.036106
11. Kuhlman, C.J., Kumar, V.A., Marathe, M.V., Ravi, S., Rosenkrantz, D.J.: Inhibit-
ing diffusion of complex contagions in social networks: theoretical and experimental
results. Data Min. Knowl. Disc. 29(2), 423–465 (2015)
12. Kumar, P., Verma, P., Singh, A., Cherifi, H.: Choosing optimal seed nodes in
competitive contagion. Front. Big Data 2, 1–6 (2019)
13. Mortveit, H., Reidys, C.: An Introduction to Sequential Dynamical Systems.
Springer, New York (2007)
468 H. L. Carscadden et al.
14. Myers, S.A., Leskovec, J.: Clash of the contagions: cooperation and competition in
information diffusion. In: ICDM, pp. 539–548 (2012)
15. Newman, M.E.J., Ferrario, C.R.: Interacting epidemics and coinfection on contact
networks. PLoS ONE 8, 1–8 (2013)
16. Romero, D., Meeder, B., Kleinberg, J.: Differences in the mechanics of informa-
tion diffusion across topics: idioms, political hashtags, and complex contagion on
twitter. In: WWW, pp. 695–704 (2011)
17. Schelling, T.C.: Micromotives and Macrobehavior. Norton, New York (1978)
18. Stanoev, A., Trpevski, D., Kocarev, L.: Modeling the spread of multiple concurrent
contagions on networks. PLoS ONE 9(6), e95669 (2014)
19. Ugander, J., Backstrom, L., Marlow, C., Kleinberg, J.: Structural diversity in social
contagion. Proc. Natl. Acad. Sci. 109(16), 5962–5966 (2012)
20. Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)
21. Watts, D.J.: A simple model of global cascades on random networks. Proc. Natl.
Acad. Sci. 99, 5766–5771 (2002)
Stimulation Index of Cascading
Transmission in Information Diffusion
over Social Networks
1 Introduction
Social networking services (SNSs) such as Twitter and Facebook have become
crucial information infrastructures, transmitting various information, such as
news and rumors, from person to person. Thus, it is important research ques-
tion to understand the process of information diffusion and how many people it
reaches.
In viral marketing (the promotion of products by word of mouth on social
networks), there is a need to reach as many users as possible within a limited
budget. This problem can be considered an estimation of the essential users who
can convey information to the most users and is called the “Influence Maximiza-
tion [7].” In addition, in times of disaster, lies and rumors spread, and many
users receive false information and transmit it. There are also studies on how to
disseminate corrections and block the spread of false information efficiently [6,8].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 469–481, 2021.
https://doi.org/10.1007/978-3-030-65347-7_39
470 K. Inafuku et al.
2 Related Work
2.1 Information Diffusion Model
When information is transmitted from one person to another, it is not only
exchanged between two people, but is also spread more widely as the receiver
transmits it to new people. This information transfer chain phenomenon is called
the “Information Cascade,” [1] and there has been much research on modeling
this phenomenon on complex networks. In particular, the “Independent Cascade
model (IC model)” [4,7] and “Linear Threshold model (LT model)” [15,16] are
widely used.
In comparing these two models, the IC model is sender centric. The diffu-
sion probability must be assigned to each edge in advance. We then select an
information source node and the node transmits information to the neighboring
nodes. The success or failure of information transmission depends independently
on each edge’s diffusion probability. If the information transmission is successful,
the receiver node becomes the sender node and receiving information should be
sent to the neighboring nodes. The IC model emulates information diffusion by
repeating this process.
Stimulation Index 471
On the other hand, the LT model simulates information diffusion around the
receiver nodes. First, we assign a threshold (0 to 1) for activation to each node.
We then choose a source node, which transmits the information to all neighboring
1
nodes. Neighboring nodes receive in−degrees as weights. When the sum of the
received weights exceeds the threshold value, the receiver node is activated and
transmits the information to neighboring nodes. These models have been used,
for example, in “Influence Maximization” [7,9,11] to find nodes that maximize
the expected number of nodes that receive information. Also, since real-world
networks are dynamic, a dynamic extension of Influence Maximization [12,13] is
also being studied.
The Stimulation Index s(et ) is defined as the sum of the directly stimulated score
sd(et ) and indirectly stimulated score si(et ). By constructing an Edge-Relation
(ER) graph EG that represents the relationship between the edges and dividing
s(et ) into sd(et ) and si(et ), the score can be obtained efficiently.
Figure 2(a) shows the information diffusion network G(V, E). As mentioned
above, an edge may occur by inspiration from older edges, e.g., e2 and e3 were
inspired by e1 . By taking this assumption into consideration, we construct an
ER graph as shown in Fig. 2(b). In Fig. 2, the stimulation of e1 may affect many
subsequent edges, but it directly affects only e2 and e3 ; thus, in the ER graph,
Stimulation Index 473
Fig. 2. A toy example: calculation of Stimulation Index for each edge. In this exam-
ple, values colored as red are the directly stimulated scores, green are the indirectly
stimulated scores, and black are stimulation indices.
14 return dictionary
474 K. Inafuku et al.
ture, like the Brandes algorithm for betweenness centrality [2]. In Algorithm 1,
the processes from line 10 to line 14 indicate the calculation of si(et ).
4 Experimental Settings
In order to evaluate the proposed method, we generated an artificial network that
imitates a follow network and simulated an information diffusion process over the
network. First, we generated an undirected network with a community structure
using the LFR (Lancichinetti-Fortunato-Radicchi) benchmark graph. We next
added a direction for each edge based on random walks. We then simulated
information diffusion using an extended LT model to allow reactivation in a
Susceptible-Infected-Susceptible manner.
in the case of infectious diseases, so we employed the SIS model for node state
transitions. In the SIS model, an infected (active) node returns to a susceptible
(inactive) state again after sending information. In our experiments, for each
node, we updated the threshold of the LT model as a higher value than before
reactivation at each time of reactivation; i.e., the more times a node enters
the susceptible state, the higher the threshold for the next activation, which
corresponds to “boredom with the topic.” Specifically, we represent the initial
threshold and reactivation coefficients of node v as θv and α. Then, we set the
value of the threshold in the n-th reactivation as (n − 1)θv + αn θv .
In our experiments, we simulated a biased information diffusion in which
information transmission only occurs in a limited range of the network, such as
an intra-community; that is, for certain information, a certain range of nodes
actively transmits the information and other node groups do not transmit much
or do not transmit at all. To generate biased information diffusion, the threshold
θv is multiplied by the bias coefficient βv . Let Vβ be the set of nodes that actively
spread information and set the bias coefficient βv for v ∈ Vβ as follows:
|V |
β
|V | v ∈ Vβ
βv = |V |+|Vβ |
|V | v∈/ Vβ .
5 Experimental Evaluations
5.1 Does Removing the Edge with a High Stimulation Index Inhibit
Information Diffusion?
We evaluated whether the Stimulation Index can detect important edges in infor-
mation diffusion using the artificial sequence generated in Sect. 4.
In our evaluation method, we remove any edge from the follow network G and
extract subgraph G . For this subgraph, we run an information diffusion simu-
lation with the same settings. Now, the number of activations σas (G ) and the
number of activated nodes σar (G ) are lower than before the removal because we
removed edges. The more these σas (G ) and σar (G ) decrease, the more their edge
deletions inhibit information diffusion; i.e., we consider them to be important
edges in information diffusion. We confirmed that σas (G ) and σar (G ) decrease
more as the edge with a high Stimulation Index is removed, showing that the
proposed index is effective.
However, the Stimulation Index is a measure of dynamic edges, while the
following network is a static structure. Thus, edges connecting the same nodes
may appear at different times tx and ty , such as ex = (u, v) and ey = (u, v).
Then, we calculate the Stimulation Index for each static edge. We represent
the dynamic graph for the mth simulation from any node v on G as Gv,m =
(Vv,m , Ev,m ) and the dynamic edge from node u to v that appeared at time t
as et = (u, v, t). We then define the Stimulation Index si((u, v)) of a static edge
(u, v) ∈ E as follows:
the ranking of edge betweenness centrality and edge-degree centrality from the
follow network G. We chose these indices because they are all representative
centrality indicators and because we consider them to have an essential role in
terms of information diffusion.
Figure 4 shows the number of activations σas (G ) and activated nodes σar (G )
when removing the edge in each indices’ ranking order. The horizontal axis
corresponds to the edge deletion rate i, the vertical axes to metrics (sσas (G )
and σar (G )), and the plot-line to three edge indices. For example, the edge
deletion rate i = 0.1 shows each metric value when removing the top 10% of
edges in rankings. Naturally, as the deletion rate increases, information diffusion
is inhibited, so the number of activations σas (G ) and activated nodes σar (G )
decreases. Note that the edge deletion ratio of i = 0.0 means that no edges are
removed, so we used the same simulation results for three rankings. Therefore,
each metric has the same value. In Sect. 4, we simulated two types of thresholds
θv for information diffusion in the SIS-LT model, one with a uniform distribution
and the other with a biased distribution.
Figure 4(c) and Fig. 4(a) show the results of uniform distribution simulations.
Although the Stimulation Index was smaller than the edge-degree centrality, it
was not so different from the edge betweenness centrality. This result means that
the edge’s importance of information diffusion, such that information reaches
the overall network equally, is not different from a static network case. Next, we
focused on the results of biased simulations (Fig. 4(b) and Fig. 4(d)). Compared
to the edge-degree centrality, the Stimulation Index shows that the number of
activations σas (G ) and activated nodes σar (G ) decreases at the step where the
edge deletion rate i is low. Each metric (σas (G ) and σar (G )) is also lower than
the edge betweenness centrality, although the difference is somewhat smaller.
The difference is especially noticeable in the case of i ≤ 0.05, which shows that
the important edges are ranked higher in the biased information diffusion. In
other words, the proposed method works as expected.
s((u, v)) and the number of activated nodes σar (v; G ). There is no correlation
in any of the experimental settings and metrics. Figure 5 shows the relationship
between the Stimulation Index ranking of each edge and the number of activated
nodes of the target node. Due to space limitations, we excluded the number of
activations’ results. The horizontal and vertical axis shows the Stimulation Index
ranking and the number of activated nodes, respectively. Each plot point rep-
resents an edge or target node. We find some edges with a high ranking and a
small number of activated nodes in both cases. Accordingly, the experimental
results in Sect. 5.1 are valid.
Table 1. Pearson’s correlation coefficients for Stimulation Index and the number of
activations, and Stimulation Index and the number of activated nodes
(c) Activated nodes when uniform (d) Activated nodes when biased
Fig. 4. The number of activations σas (G ) and activated nodes σar (G ) when edges are
removed.
6 Discussion
Addressing the edges with a high Stimulation Index and a low number of
expected activations revealed in Sect. 5, Fig. 6 shows the network’s visualiza-
tion results, reflecting the number of activations in the color of the edges. The
Stimulation Index 479
higher the number of activations, the darker the red. To focus on the edges with
a high Stimulation Index, we extracted the top 30% edges in the ranking and
colored the rest of the edges gray.
First, edges that connect communities have a high number of activations
in both uniform and biased simulations. Figure 6(b) shows that edges in the
community with a low threshold of information diffusion also have a high number
of activations. On the other hand, many edges with a low number of activations
often connect nodes in the same community to each other. These edges contribute
to the transmission of information in the community. Moreover, these black edges
can only be extracted by the Stimulation Index. In summary, the proposed index
can extract the edges that induce information transmission in the community,
in addition to the edges or nodes that deliver information to a larger number of
users, which was the conventional method’s focus.
Fig. 6. The network’s visualization results reflecting the number of activations in the
color of the edges. The higher the number of activations, the darker the red.
480 K. Inafuku et al.
7 Conclusion
In this study, we proposed a Stimulation Index, measuring an edge’s spillover
effect in information diffusion on networks. This Stimulation Index quantifies the
amount of subsequent information transmission caused by an information trans-
mission. To verify the proposed method’s effectiveness, we simulated information
diffusion using an artificial follow network. We demonstrated that removing edges
with a high Stimulation Index inhibits information diffusion conspicuously.
Evaluation methods still require improvement. In this study, we adopted the
number of activation and activated nodes to confirm the proposed index’s effec-
tiveness. However, it can be said that this is an indirect evaluation of the effects
of the Stimulation Index. In order to more accurately evaluate the importance
of information diffusion, we would like to examine other metrics.
References
1. Bikhchandani, S., Hirshleifer, D., Welch, I.: A theory of fads, fashion, custom, and
cultural change as informational cascades. J. Polit. Econ. 100(5), 992–1026 (1992)
2. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25,
163–177 (2001)
3. Cheng, J., Adamic, L., Dow, P.A., Kleinberg, J.M., Leskovec, J.: Can cascades be
predicted? In: Proceedings of the 23rd International Conference on World Wide
Web, pp. 925–936. ACM (2014)
4. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: a complex systems look
at the underlying process of word-of-mouth. Mark. Lett. 12(3), 211–223 (2001)
5. Gomez Rodriguez, M., Leskovec, J., Schölkopf, B.: Structure and dynamics of infor-
mation pathways in online media. In: Proceedings of the Sixth ACM International
Conference on Web Search and Data Mining, WSDM 2013, pp. 23–32. Association
for Computing Machinery, New York (2013)
6. Ikeda, K., Sakaki, T., Toriumi, F., Kurihara, S.: Report of findings obtained from
modeling of false rumor diffusion in case of disaster. In: The 31st Annual Conference
of the Japanese Society for Artificial Intelligent JSAI2017, 3P1–NFC–00a–1 (2017)
7. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through
a social network. In: Proceedings of the Ninth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003)
8. Kim, J., Bae, J., Hastak, M.: Emergency information diffusion on online social
media during storm Cindy in U.S. Int. J. Inf. Manag. 40, 153–165 (2018)
9. Kimura, M., Saito, K., Nakano, R.: Extracting influential nodes for information
diffusion on a social network. In: AAAI, vol. 7, pp. 1371–1376 (2007)
10. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing com-
munity detection algorithms. Phys. Rev. E 78(4), 046110 (2008)
11. Li, M., Wang, X., Gao, K., Zhang, S.: A survey on information diffusion in online
social networks: models and methods. Information (Switzerland) 8 (2017)
12. Murata, T., Koga, H.: Extended methods for influence maximization in dynamic
networks. Comput. Soc. Netw. 5 (2018)
13. Osawa, S., Murata, T.: Selecting seed nodes for influence maximization in dynamic
networks. Stud. Comput. Intell. 597, 91–98 (2015)
Stimulation Index 481
14. Takashi, K., Masashi, T., Naoki, Y.: Detecting information cascades with social
influence from microblogs. Inf. Process. Soc. Jpn. Trans. Database 9(2), 23–33
(2016)
15. Watts, D.J.: A simple model of global cascades on random networks. Proc. Natl.
Acad. Sci. 99(9), 5766–5771 (2002)
16. Watts, D.J., Dodds, P.S.: Influentials, networks, and public opinion formation. J.
Consum. Res. 34(4), 441–458 (2007)
17. Yuya, Y., Kazumi, S., Hiroshi, M., Kouzou, O., Masahiro, K.: Estimating method of
expected influence curve from single diffusion sequence on social networks. IEICE
Trans. Inf. Syst. 94(11), 1899–1908 (2011)
Diffusion Dynamics Prediction
on Networks Using Sub-graph Motif
Distribution
1 Introduction
Finding key patterns in network structures explains the main semantic differ-
ences and network formation process. Motifs are structural units, characterising
network systemic properties.
Different kinds of networks can be characterised by motifs and can be distin-
guished in this way. Therefore, motifs are often interpreted as functional units in
biological, chemical, or other kinds of networks. They are associated with func-
tional groups and are used to explore emerging functional patterns in dynam-
ics [1]. In contrast to it, functional properties in social and economical groups are
weakly studied, therefore, extracted motifs are hard to interpret or validate. In
addition, interest networks, as a special type of social networks, are characterised
by special “motifs”, peculiar to the special type of a networks and associated
dynamics. Functionally, they reflect how people connect each other in the con-
text of their goals in interest societies, or how people are different from molecules
at the global scale. In this way, there exists a global research question of how to
extract these basic structural patterns, characterising a network specificity and
reproducing its functional/dynamic properties.
Extraction of these patterns contributes to the estimation of dynamics on a
whole network on the base of its sub-graph properties, which is of great impor-
tance for understanding dynamics of large networks of different types and for
computational performance optimisation. Existing methods of motifs extraction
allow for 6-node motifs extraction, which is not enough to distinguish some kinds
of networks and to reflect their dynamical properties. Extraction of larger motifs
is time-consuming. Therefore, in current study we compared sampling techniques
for sub-graphs extraction and explored how diffusion dynamics on graphs can
be predicted on the base of its sub-graph properties.
In order to select the most appropriate samples, we compared their motif
structures with initial network. We assume, that too small sub-graph does not
capture all motifs, while too large sub-graph may be equal to initial network
or comprise required structural patterns multiple times. In this way, we have
chosen a sampling technique, returning the smallest divergence between motif
sequences, and then for each network we selected its smallest sub-graph, having
similar motif sequence to an initial graph. To estimate prediction ability, we took
motif sequences of sub-graphs as an input of regression model and explored how
these motifs can predict SI diffusion dynamics on a whole network.
As a data set VK friendship networks with various topical attribution was
explored. We extracted not intersected 4–5-node motifs by SuperNoder method
and analysed motif significance, which was shown to contrast with results of
Gtries, extracting intersected motifs. After that, we compared a number of sam-
pling techniques, and explored how the obtained sub-graphs differ from initial
graph in Kullback-Leibler divergence for motif distributions. Finally, we selected
minimal sub-graph size, showing appropriate divergence. For those graphs dif-
fusion dynamics time was predicted on the base of motif distribution.
2 Literature
Here we explore how motifs are connected with dynamics on networks, and how
implementations of motif extraction algorithms differ.
between a motif structure and its function: authors explored dynamical responds
of motifs, having “bi-fan” network structure and differed by node properties,
in biological networks. This can be due to sizes of chosen motifs, absence of
difference of given networks, and so on. Function of motifs, observed in social
networks, may be explained by social restrictions like Dunbar’s number [6], at
the same time interaction patterns in interest networks, depending on personal
or group goals are not enough formalised for motifs interpretation.
Estimation of synchronisation dynamics in large networks on the base of
motifs [11] is provided by means of eigenvectors [20] of connection matrices,
obtained from initial motif by Kroneker product. Lodato et al. [15] explore
synchronisation dynamics for 3 and 4-node motifs, and analyse which of them
are correlated with stability states. D’Huys et al. [4] study Kuramoto oscilla-
tion models for three kinds of network motifs with different symmetries, and
find numerical solutions for those single motifs. Nevertheless, this result is not
expanded on the whole network. Motives synchronisation is also discussed in [33].
Contribution of motives to dynamics on networks is studied by [21], showing
how motif abundance affect structural stability score. They show the significance
of combination of motifs, having particular structural properties, with their fre-
quency. In this way, the explored networks are divided into groups according to
aggregated properties, which is observed for both, 3 and 4-node motives.
[31] explore diffusion at individual and population scales in relation to motif
structure and try to infer diffusion network with motif profile. Finally, diffusion
networks are considered. Sarkar et al. [28] also use motifs to understand whether
or not they can be used for explanation of emerging cascades. For this purpose
edges, covered by percolation algorithm are compared with edges generated by
motifs. In this way, existing studies related to diffusion dynamics are mostly
focused on dynamic paths generated, and distinguish structural and process-
related motifs [29]. The majority of methods cover small motifs due to high
algorithmic complexity of this procedure and dynamics on whole networks is
restricted by diffusion paths.
In this study we explore, in which extent diffusion dynamics on a graph
can be estimated on the base of its subgraph. For this purpose we explore and
compare motifs distributions for initial graph and its samples, and analyse how
subgraph sizes affect prediction accuracy on the base on motifs. This extends
question of diffusion dynamics estimation with motifs, on one side, but on other,
this allows to understand which subgraph sizes are appropriate for such kind of
prediction, and how can we represent subgraph structural patterns to use small
samples for dynamics approximations on large graphs.
3 Network Data
As the main data the set of friendship graphs was used. Nodes represent users,
subscribing communities in VK social network, and edges between them reflect
friendship. Neither nodes nor edges have additional attributes. In total we had
418 graphs with interest attribute markup. Topics were marked up by expert to
provide balance between group sizes, and contain reach variability in topological
properties (the statistical analysis of topology for the data set can be found
in [32]).
4 Method
4.1 Motif Detection Methods
Motifs extracted were very desired to contain enough nodes to represent graph
structural patters, responsible for process spreading, and to split graph into
disjunctive components. In this way, methods for motifs extraction were selected
according to their computational performance and edges intersection in graph
partitions.
First, motif approximations were obtained by building a prefix tree (g-trie
method [24]) for each node in a base graph. This method effectively evaluates
motif of 5 nodes, nevertheless, the obtained samples do not split graph into
disjunctive structural components.
As a non-intersected motif extraction method, the SuperNoder algorithm
was explored [3]. It decomposes a network after each iteration by folding motifs
into a node. In this way, self-similar patterns can be extracted, but computation
efficiency decreases. As a result, the distribution of the embedded motifs and the
rest of the nodes are calculated.
Motifs significance was evaluated by z-scores [17]. They were combined with
the corresponding frequencies and used as input features in further classification
tasks.
486 A. L. Zaykov et al.
SHAP values. Data partitioning for train and test sets is made by cross-validation
technique.
5 Results
Fig. 1. Impact of motifs and their relative frequencies on diffusion dynamics prediction
power. The motifs with the highest impact are displayed in decreasing order. Points
correspond to graphs, their colour show relative frequency of the motif in the graph
Fig. 2. Connection between prediction accuracy and network sizes and density
(a) Divergence against (b) Sub-graph size and (c) Averaged divergence
sub-graph sizes relative size against sub-graph size
Fig. 4. Kullback-Leibler divergence between motif distributions for graphs and sub-
graphs of different relative sizes
Results show divergence does not really affect prediction quality significantly,
which indicate the considered relative size of samples is appropriate for esti-
mations. Nevertheless, KL divergence after 0.18 demonstrate clear increase in
deviance and mean error. This means, these types of networks require additional
analysis of their structural patterns and motifs inside.
References
1. Alexander, R.P., Kim, P.M., Emonet, T., Gerstein, M.B.: Understanding modular-
ity in molecular networks requires dynamics. Sci. Signal. 2(81), pe44–pe44 (2009)
2. Alon, U.: Network motifs: theory and experimental approaches. Nat. Rev. Genet.
8(6), 450–461 (2007)
3. Dessı̀, D., Cirrone, J., Recupero, D.R., Shasha, D.: SuperNoder: a tool to dis-
cover over-represented modular structures in networks. BMC Bioinform. 19(1),
318 (2018)
4. D’Huys, O., Vicente, R., Erneux, T., Danckaert, J., Fischer, I.: Synchronization
properties of network motifs: influence of coupling delay and symmetry. Chaos:
Interdiscip. J. Nonlinear Sci. 18(3), 037116 (2008)
5. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a
case study of unbiased sampling of OSNs. In: 2010 Proceedings IEEE Infocom, pp.
1–9. IEEE (2010)
6. Gonçalves, B., Perra, N., Vespignani, A.: Modeling users’ activity on twitter net-
works: validation of Dunbar’s number. PLoS ONE 6(8), e22656 (2011)
7. Hübler, C., Kriegel, H.P., Borgwardt, K., Ghahramani, Z.: Metropolis algorithms
for representative subgraph sampling. In: 2008 Eighth IEEE International Confer-
ence on Data Mining, pp. 283–292. IEEE (2008)
8. Ingram, P.J., Stumpf, M.P., Stark, J.: Network motifs: structure does not determine
function. BMC Genom. 7(1), 1–12 (2006)
9. Irigoin, F., Triolet, R.: SuperNode partitioning. In: Proceedings of the 15th ACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp.
319–329 (1988)
10. Kashani, Z.R.M., Ahrabian, H., Elahi, E., Nowzari-Dalini, A., Ansari, E.S., Asadi,
S., Mohammadi, S., Schreiber, F., Masoudi-Nejad, A.: Kavosh: a new algorithm
for finding network motifs. BMC Bioinform. 10(1), 1–12 (2009)
11. Krishnagopal, S., Lehnert, J., Poel, W., Zakharova, A., Schöll, E.: Synchronization
patterns: from network motifs to hierarchical networks. Philos. Trans. R. Soc. A:
Math. Phys. Eng. Sci. 375(2088), 20160216 (2017)
12. Lee, C.H., Xu, X., Eun, D.Y.: Beyond random walk and Metropolis-Hastings sam-
plers: why you should not backtrack for unbiased graph sampling. ACM SIGMET-
RICS Perform. Eval. Rev. 40(1), 319–330 (2012)
492 A. L. Zaykov et al.
13. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proceedings of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 631–636 (2006)
14. Li, Y., Wu, Z., Lin, S., Xie, H., Lv, M., Xu, Y., Lui, J.C.: Walking with perception:
efficient random walk sampling via common neighbor awareness. In: 2019 IEEE
35th International Conference on Data Engineering (ICDE), pp. 962–973. IEEE
(2019)
15. Lodato, I., Boccaletti, S., Latora, V.: Synchronization properties of network motifs.
EPL (Europhys. Lett.) 78(2), 28001 (2007)
16. Maiya, A.S., Berger-Wolf, T.Y.: Sampling community structure. In: Proceedings
of the 19th International Conference on World Wide Web, pp. 701–710 (2010)
17. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network
motifs: simple building blocks of complex networks. Science 298(5594), 824–827
(2002)
18. Paredes, P., Ribeiro, P.: Towards a faster network-centric subgraph census. In: Pro-
ceedings of the 2013 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining, pp. 264–271 (2013)
19. Paredes, P., Ribeiro, P.: Rand-FaSE: fast approximate subgraph census. Soc. Netw.
Anal. Min. 5(1), 17 (2015)
20. Poel, W., Zakharova, A., Schöll, E.: Partial synchronization and partial amplitude
death in mesoscale network motifs. Phys. Rev. E 91(2), 022915 (2015)
21. Prill, R.J., Iglesias, P.A., Levchenko, A.: Dynamic properties of network motifs
contribute to biological network organization. PLoS Biol. 3(11), e343 (2005)
22. Ribeiro, P., Paredes, P., Silva, M.E., Aparicio, D., Silva, F.: A survey on subgraph
counting: concepts, algorithms and applications to network motifs and graphlets.
arXiv preprint arXiv:1910.13011 (2019)
23. Ribeiro, P., Silva, F.: Efficient subgraph frequency estimation with g-tries. In:
International Workshop on Algorithms in Bioinformatics, pp. 238–249. Springer
(2010)
24. Ribeiro, P., Silva, F.: G-tries: a data structure for storing and finding subgraphs.
Data Min. Knowl. Disc. 28(2), 337–377 (2014)
25. Ribeiro, P., Silva, F., Lopes, L.: A parallel algorithm for counting subgraphs in
complex networks. In: International Joint Conference on Biomedical Engineering
Systems and Technologies, pp. 380–393. Springer (2010)
26. Ribeiro, P., Silva, F., Lopes, L.: Parallel discovery of network motifs. J. Parallel
Distrib. Comput. 72(2), 144–154 (2012)
27. Rozemberczki, B., Kiss, O., Sarkar, R.: Little ball of fur: a python library for
graph sampling. In: Proceedings of the 29th ACM International Conference on
Information and Knowledge Management (CIKM 2020). ACM (2020)
28. Sarkar, S., Guo, R., Shakarian, P.: Using network motifs to characterize temporal
network evolution leading to diffusion inhibition. Soc. Netw. Anal. Min. 9(1), 14
(2019)
29. Schwarze, A.C., Porter, M.A.: Motifs for processes on networks. arXiv preprint
arXiv:2007.07447 (2020)
30. Shahrivari, S., Jalili, S.: Fast parallel all-subgraph enumeration using multicore
machines. Sci. Program. 2015 (2015)
31. Tan, Q., Liu, Y., Liu, J.: Motif-aware diffusion network inference. Int. J. Data Sci.
Anal. 9(4), 375–387 (2020)
32. Vaganov, D.A., Guleva, V.Y., Bochenina, K.O.: Social media group structure and
its goals: building an order. In: International Conference on Complex Networks
and their Applications, pp. 473–483. Springer (2018)
Diffusion Dynamics on Networks Using Sub-graph Motif Distribution 493
33. Vega, Y.M., Vázquez-Prada, M., Pacheco, A.F.: Fitness for synchronization of
network motifs. Physica A 343, 279–287 (2004)
34. Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput.
Biol. Bioinf. 3(4), 347–359 (2006)
35. Wernicke, S., Rasche, F.: FANMOD: a tool for fast network motif detection. Bioin-
formatics 22(9), 1152–1153 (2006)
Using Distributed Risk Maps
by Consensus as a Complement
to Contact Tracing Apps
1 Introduction
One of the current challenges to stem the spread of COVID-19 is to track people
infected with coronaviruses that can spread the disease. Although technological
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 494–505, 2021.
https://doi.org/10.1007/978-3-030-65347-7_41
Distributed Risk Maps by Consensus 495
citizens and the administration have the same data, promoting transparency.
One relevant limitation is that some critical mass is still necessary.
The rest of the paper is structured as follows. Section 2 explains how citizens
can collaboratively create risk maps using a consensus process with their close
contacts. Section 3 shows the results using La Gomera as an example: one of the
Canary Islands, with 21,550 inhabitants. The Spanish government carried out
a pilot project with their contact tracing application (RadarCOVID) on that
island, so we decided to use the same scenario. Finally, Sect. 4 summarizes the
main findings of this work.
The authors demonstrated that this consensus process converges to the average
1
of the initial values when ε < max di , being di the degree of node i. There is an
equivalent matricial formulation.
where I denotes the identity matrix and L is the laplacian of the adjacency
matrix of the graph G. This expression P is called the Perron–Frobenius matrix
and governs the consensus process’s collective dynamics.
If each component contains a vector xi = (x1i , . . . , xm m
i ) ∈ R , the process
carries out a consensus over m independent variables. By expanding the vector
with one additional element yi ∈ R, we can determine the size of the network
at the same times as follows: xi ⊕ yi = (x1i , . . . , xm
i , yi ). Initially, yi = 0 ∀i.
Without losing generality, we can introduce an additional node in the network
Distributed Risk Maps by Consensus 497
whose initial values are x0 ⊕ y0 = (0, . . . , 0, 1). As the process converges to the
m
average value, yi = 1/|V | and, therefore, |V | = 1/yi is the size of the network.
We need the Perron P matrix to be doubly stochastic for the consensus to
work: a matrix whose rows and columns add up to one. However, in directed
networks, we obtain a row stochastic one. Inspired in the Dominguez-Garcia
and Hadjicostis matrix scaling algorithm [5], we define an iterative process to
convert the Perron matrix into a double stochastic one. The process begins with
a row stochastic matrix P , and, in each iteration, the matrix scales following the
expression.
P (t) = P Δ(t) + [I − Δ(t)] (3)
where P is a local Perron matrix and Δ(t) is updated as Algorithm 1 describes.
The only consideration is that the Perron matrix P is defined using the degree of
each node instead of a common ε value for all the nodes (see line 4). Furthermore,
as P (t) is based on the Perron matrix, we can combine the matrix’s scaling with
the consensus value calculation in the same step (line 12).
The adaptions of the scaling algorithm are the calculation of Δ(t) (lines 3
and 10), the definition of the matrix P as a local variation of the Perron matrix
(line 4), and how π(t) updates (line 8).
3 Results
The purpose of this section is to validate the algorithm proposed to create risk
maps in a scenario similar to the conditions of the real world. The ideal situa-
tion would be to launch a pilot project in a controlled environment. However,
1
https://coronavirus.comunidad.madrid/.
Distributed Risk Maps by Consensus 499
Fig. 1. (Left) Population distribution in La Gomera. (Right) Sample of 100 paths using
Lévy flights
The movements of the people along the day are simulated using recurrent
Lévy flights [6]. Each person has assigned a path with 96 points (every 5 min for
8 h) that begins and ends at his or her home location (Fig. 1, right). We have
validated the model comparing the movements with the data available in the
study on mobility based on mobile phone carried out by the Spanish National
Statistics Institute (INE) in 2019.2 In this study, La Gomera was divided into
two areas. The flows represent travels from commuters between both areas. No
external sources, such as ferry or plane trips from other islands or the peninsula,
would undoubtedly be relevant. This data, if available, could be added to the
model. The daily mobility between them was 450 persons leaving San Sebastián
de la Gomera area and 550 enter (the inverse from the northern area viewpoint).
The simulation with Lévy flights throws an output flow of 464 persons and an
input flow of 668. The movements are in the same magnitude order, so we assume
that they are coherent.
2
https://www.ine.es/en/experimental/movilidad/experimental em en.htm.
500 M. Rebollo et al.
Fig. 2. (Left) The population moves alternatively between the home and the working
location. Carriers can infect other people in both places. The cycle consists of a sequence
movement → infection → movement → infection. (Right) Evolution of the risk map
To simulate the close contacts, we use the same criteria as the contact tracing
app: a close contact is defined when two persons are at 5 meters (with 2 meters
only obtains a 78% accuracy) at most and during 15 min. The result is a daily
risky contact matrix of dimension 21, 550 × 21, 550.
Finally, to simulate the spread of the COVID-19, we use an SEIR model. Its
parameters follow the findings from the literature that has analyzed the COVID-
19 propagation [17]. Particularly, the incubation time is 7 days, so β = 1/7, the
probability of infection σ = 0.1 and the recovery time is 15 days, so γ = 1/15.
The purpose of the model is not to predict precisely the behavior of the disease.
The model provides the consensus process with different scenarios to check the
accurateness of the risk maps.
People start at their home location. They move along the day, interacting
with the other persons. Nodes update their state according to the epidemic
model and the contact matrix, and they go back to their home locations. A
new infection stage is performed at home since, in COVID-19, some researches
demonstrate the family to be a strong transmission source. Once completed the
update, a new cycle begins (Fig. 2).
Fig. 3. Cumulated degree distribution of the networks (Left) Complete contact network
with a power-law with α = −1.7 (Right) Random selection of contacts. It follows a
Weibull distribution with parameters α = 13.4, γ = 0.48.
subset of the potential links. A reasonable limit is 15 contacts, five of each type
(family, colleagues, and friends). If each person choose randomly 9 ± 2 contacts,
the resulting network has 26,500 contacts and the number of connections vary
from 4 to 16 (Fig. 3, right).
Therefore, we have obtained a network with 3,000 nodes and mean degree 10,
varying from 4 to 16, generated from social network profiles. Over this scenario,
the inhabitants can determine their town’s risk map at the end of the day. We
assume that no additional measures, such as social distancing or limitations of
movements, are taken.
As an example, let us consider the situation after 30 days. People have moved
during this period as described in Sect. 3.1, and the contagion has evolved fol-
lowing the SEIR model. After 30 days, the situation of the COVID-19 in La
Gomera appears in Fig. 4, with a clear breakout in San Sebastian de La Gomera
(in red). The 3,000 persons determine their risk index (some of them are already
infected) and share the value with their direct contacts, following the consensus
process from C4C.
Each node has a vector of 14 components, one for each census district xi ⊕yi =
(x1i , . . . , x14
i , 0). Let be rii the risk index of i and cdi = k the census district i
lives in. xki = rii and the rest xli = 0, l = k. Each node executes the process
detailed in Algorithm 2. The evolution appears in Fig. 4. The real risk values are
and the values calculated by consensus in one of the 3,000 participants for each
census district are
Area 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Real risk 9.9 10.1 9.9 9.9 12.2 11.3 50.3 55.6 53.1 15.4 21.8 9.9 10.9 10.1
Ri 10.0 8.1 12.3 8.6 13.2 9.5 49.2 48.3 40.1 12.5 22.0 9.6 11.4 8.8
502 M. Rebollo et al.
Fig. 4. Evolution of the consensus in the creation of a risk map. (Left) Convergence of
the process. (Right) Evolution of the map calculated for one random node. (Bottom)
Map created by consensus versus real risk map
Let us assume that people outside the risky areas do not move into them,
and people who live in high-risk areas do not go out, depending on the risk map
readings. The effect of having a risk map available and avoid areas with high
risk does not reduce the total number of infections significantly. It reduces the
peak but keeps the propagation active more days (see Fig. 5).
Fig. 5. Evolution of infected with and without considering the risk map. (Left) Evo-
lution (Right) Cumulated
Distributed Risk Maps by Consensus 503
Fig. 6. Evolution of infected in three scenarios: no isolation (blue), total isolation (red)
and isolation for traced users (yellow), from 20% to 80% of users. (top) total infected
by day (bottom) cumulated infections
4 Conclusions
Technology can be an essential ally to control the transmission of COVID-19.
Nevertheless, concerns about privacy and the possible use of the data after the
pandemic have made it difficult to implant technological solutions.
This work proposes an alternative for users to create risk maps collabora-
tively. This approach executes a consensus process that uses local information
and data from the direct neighbors to calculate the value of a shared function.
In our case, the values are the risk index of the different districts that form
the town. Close contacts (family, colleagues, and friends) define the network
relationships, whom we warned about being infected. The data exchanged is an
aggregation, and it is not possible to reidentify the personal information. At the
end of the process, all the participants obtain the same copy of the complete risk
map. However, constraining the movements using the information on risk maps
reduces the peak and smooth the evolution of the infection.
The use of contact tracing apps needs a considerable proportion of active
users to work. Nevertheless, the combination of the information from risk maps
to avoid areas with a high index of infections and alerts of exposure obtain good
results even with relatively low penetration of the apps.
References
1. Apple, Google: Exposure notifications: using technology to help public health
authorities fight covid’19, 01 June 2020. https://www.google.com/covid19/
exposurenotifications/
2. Apple, Google: Privacy-preserving contact tracing, 01 June 2020. https://www.
apple.com/covid19/contacttracing
Distributed Risk Maps by Consensus 505
3. Castelluccia, C., Bielova, N., Boutet, A., Cunche, M., Lauradoux, C., Le Métayer,
D., Roca, V.: ROBERT: ROBust and privacy-presERving proximity Tracing, May
2020. https://hal.inria.fr/hal-02611265/file/ROBERT-v1.1.pdf
4. Chan, J., Foster, D., Gollakota, S., Horvitz, E., Jaeger, J., Kakade, S., Kohno,
T., Langford, J., Larson, J., Sharma, P., Singanamalla, S., Sunshine, J., Tessaro,
S.: Pact: privacy sensitive protocols and mechanisms for mobile contact tracing
(2020). arXiv:2004.03544 [cs.CR]
5. Dominguez-Garcia, A.D., Hadjicostis, C.N.: Distributed matrix scaling and appli-
cation to average consensus in directed graphs. IEEE Trans. Autom. Control 58(3),
667–681 (2013)
6. González, M., Hidalgo, C., Barabási, A.: Understanding individual human mobility
patterns. Nature 453, 779–782 (2008)
7. Olfati-Saber, R., Fax, J.A., Murray, R.M.: Consensus and cooperation in networked
multi-agent systems. Proc. IEEE 95(1), 215–233 (2007)
8. O’Neill, P.H., Ryan-Mosley, T., Johnson, B.: A flood of coronavirus apps are
tracking us. Now it’s time to keep track of them, 07 May 2020. https://www.tec
hnologyreview.com/2020/05/07/1000961/launching-mittr-covid-tracing-tracker/
9. Pan-European privacy-preserving proximity tracing, 01 June 2020. https://www.
pepp-pt.org/
10. Rivest, R.L., Weitzner, D.J., Ivers, L.C., Soibelman, I., Zissman, M.A.: Pact: pri-
vate automated contact tracing, 01 June 2020. pact.mit.edu
11. Hinch, R., Probert, W., Nurtay, A., Kendall, M., Wymant, C., Hall, M., Lyth-
goe, K., Cruz, A.B., Zhao, L., Stewart, A., Ferretti, L., Parker, M., Meroueh, A.,
Mathias, B., Stevenson, S., Montero, D., Warren, J., Mather, N.K., Finkelstein,
A., Abeler-Dörner, L., Bonsall, D., Fraser, C.: Effective Configurations of a Digital
Contact Tracing App: A report to NHSX. University of Oxford, 16 April 2020
12. Simko, L., Calo, R., Roesner, F., Kohno, T.: Covid-19 contact tracing and privacy:
studying opinion and preferences (2020). arXiv:2005.06056 [cs.CR]
13. The European Data Protection Board: Guidelines 04/2020 on the use of loca-
tion data and contact tracing tools in the context of the COVID-19 outbreak, 01
June 2020. https://edpb.europa.eu/our-work-tools/our-documents/usmernenia/
guidelines-042020-use-location-data-and-contact-tracing en
14. Troncoso, C., Payer, M., Hubaux, J.P., Salath, M., Larus, J., Bugnion, E., Lueks,
W., Stadler, T., Pyrgelis, A., Antonioli, D., Barman, L., Chatel, S., Paterson, K.,
apkun, S., Basin, D., Beutel, J., Jackson, D., Preneel, B., Smart, N., Singelee, D.,
Abidin, A., Guerses, S., Veale, M., Cremers, C., Binns, R., Cattuto, C.: Decentral-
ized privacy-preserving proximity tracing, April 2020. https://github.com/DP-3T/
documents/blob/master/DP3T
15. Vinuesa, R., Theodorou, A., Battaglini, M., Dignum, V.: A socio-technical frame-
work for digital contact tracing. Results Eng. 8, 100163 (2020)
16. WeTrace: Privacy-preserving bluetooth covid-19 contract tracing application, 01
June 2020. https://wetrace.ch/
17. Yang, Z., Zeng, Z., Wang, K., Wong, S.S., Liang, W., Zanin, M., Liu, P., Cao, X.,
Gao, Z., Mai, Z., Liang, J., Liu, X., Li, S., Li, Y., Ye, F., Guan, W., Yang, Y., Li,
F., Luo, S., Xie, Y., Liu, B., Wang, Z., Zhang, S., Wang, Y., Zhong, N., He, J.:
Modified SEIR and AI prediction of the epidemics trend of covid-19 in china under
public health interventions. J. Thorac. Dis. 12(3) (2020)
Dynamics on/of Networks
Distributed Algorithm for Link Removal
in Directed Networks
Azwirman Gusrialdi(B)
1 Introduction
In order to address this issue, several works have focused on developing strate-
gies to approximate and compute sub-optimal solution to this problem for both
unidirectional and bidirectional networks, see for example [1,4,12,17]. An effec-
tive and scalable algorithm based on eigenvalue sensitivity analysis is presented
in [4] to minimize dominant eigenvalue of the adjacency matrix by removing some
links from a directed network. Specifically, an optimization problem involving
the left and right eigenvectors corresponding the dominant eigenvalue is formu-
lated to compute the sub-optimal solution. Note that the previously mentioned
work assume that the global network structure is available and known to the
designer. However, in practice the global network structure may not be available
or may be very hard to obtain in a centralized manner due to geographical con-
straint or privacy concerns [10,11]. In addition to the availability of information
on global network structure, the previously mentioned work do not take into
account the (strong) connectivity of the network after the link removal. In some
cases, it is desirable to preserve the (strong) connectivity of a network, for exam-
ple so that important information can still be passed to all the users/nodes in
the network or goods can still be transported between cities. Note that in [8], dis-
tributed algorithms which do not require knowledge of global network structure
are proposed to remove a fraction of links from a network while guaranteeing
the connectivity of the resulting network. However, the result is only limited to
the case of bidirectional or undirected network.
The contribution of this paper is the development of distributed algorithms
to compute the sub-optimal solution to link removal problem in a directed net-
work while preserving strong connectivity of the resulting network. Specifically,
matrix perturbation approach proposed in the literature is combined with novel
distributed algorithms to estimate both the left and right dominant eigenvectors
of the adjacency matrix to decide the candidate link to be removed. Further-
more, distributed verification algorithm is proposed to check whether a strongly
connected directed network remains to be strongly connected after removing a
fraction of links. This paper also generalizes the results presented in [8]. The
proposed distributed algorithms can also readily be applied to the link addition
problem whose goal is to maximize dominant eigenvalue of the adjacency matrix.
The organization of this paper is as follows: preliminaries followed by the
problem formulation are presented in Sect. 2. The proposed distributed algo-
rithms for link removal in directed networks are described in Sect. 3. A numerical
example to demonstrate the proposed distributed strategy is provided in Sect. 4.
Finally, Sect. 5 concludes the paper.
2 Problem Statement
We first provide a brief overview of graph theory and well-known results used to
develop distributed link removal strategy followed by the problem formulation.
the elements of vector a ∈ Rn on its diagonal. For a given set V, |V| denotes the
number of the elements in this set.
Let G = (V, E) be a directed graph (digraph) with a set of nodes V =
{1, 2, · · · , n} and a set of edges E ⊂ V × V. An edge (j, i) ∈ E denotes that
node i can obtain information from node j. The set of in-neighbors of node i is
denoted by NG,i in
= {j|(j, i) ∈ E}. Similarly, the set of out-neighbors of node i is
denoted by NG,i = {j|(i, j) ∈ E}. The directed graph G is strongly connected if
out
every node can be reached from any other nodes by following a set of directed
edges. For a matrix C ∈ Rn×n , let [C]i∗ and [C]∗i represent vectors whose ele-
ments are equal to the i-th row and column of C respectively. Let us denote the
dominant (i.e., largest in module) eigenvalue of matrix C as λ(C). The adjacency
matrix associated with digraph G, denoted by A(G) ∈ Rn×n is defined as
1 if i = j and (j, i) ∈ E,
[A(G)]ij =
0, otherwise
where [A]ij denote the element in the i-th row and j-th column of matrix A.
The proposed algorithm can also be applied to adjacency matrix whose rows are
defined using the out-neighbors of node i. Matrix C ∈ Rn×n is nonnegative (i.e.,
C ≥ 0) if all its elements are nonnegative. A nonnegative matrix C is irreducible
if and only if (In + C)n−1 > 0 where In = diag(1n ). In addition, matrix C is
primitive if it is irreducible and has at least one positive diagonal element.
Finally, we review a max-consensus algorithm. Consider a strongly connected
digraph G with n nodes and let us assign state xi (t) ∈ R to each node of G. If
each node executes the following max-consensus algorithm [13]
xi (t + 1) = max xj (t) (1)
in ∪{i}
j∈NG,i
3 Main Result
In order to solve Problem 1, we first adopt the strategy based on matrix pertur-
bation theory presented in [4,8]. Using matrix perturbation theory, for a graph
with a large spectral gap (i.e., difference between the largest and second largest
eigenvalue in magnitude) we can write
ν0T ΔA− w0
λ(A(Gme )) = λ(A(G0 )) − + O(ΔA− 2 ) (2)
ν0T w0
where ΔA− denotes the adjacency matrix corresponding to the graph whose links
are given by ΔE − . Moreover, ν0 , w0 are the dominant left and right eigenvectors
corresponding to λ(A(G0 )), respectively. Due to the large spectral gap, we can
neglect the higher order term in (2) and thus minimizing λ(A(Gme )) is equivalent
to maximizing ν0T ΔA− w0 /(ν0T w0 ). Defining the labeling G0 ∈ {1, · · · , |E0 |} on
|E |
the edges of graph G0 , matrix ΔA− can be written as ΔA− = G0 =1 yG0 AG0
0
where AG0 is a matrix with all zeros entries except for the ijth entry correspond-
ing to the edge of label G0 ∼ (j, i) which is equal to 1. Furthermore, yG0 ∈ {0, 1}
where yG0 = 1 means that the edge G0 in E0 is removed. Problem 1 can then
Distributed Link Removal 513
where ν0,i and w0,i respectively denotes the i-th element of left eigenvector ν0
and w0 associated with λ(A(G0 )). In addition, the vector y = [y1 , · · · , y|E0 | ]T .
The analysis of optimality gap between the solutions obtained by solving (P2)
and (P1) is discussed in [4]. In order to solve (P2), Note that ν0 , w0 cannot
be directly computed and whether graph Gme is strongly connected cannot be
directly checked since the global network topology G0 is not available.
Next, we present distributed algorithms performed at each node, given that
the nodes have local computational capability, to solve (P2) under constraint 1.
To this end, we first define a primitive matrix Q0 given by
Q0 = In + A(G0 ). (3)
Since matrix Q0 is primitive, it is known that there exists a real dominant and
simple eigenvalue of Q0 , denoted by λ(Q0 ) satisfying λ(Q0 ) > |μ| for all other
eigenvalues μ of Q0 [2]. Hence, we have the following relationship: λ(Q0 ) =
1 + λ(A(G0 )). It can also be observed that both matrices Q0 and A(G0 ) share
the same set of left and right eigenvectors (i.e., ν0 , w0 ) which are both positive,
up to rescaling [2].
where ŵ0,i (t) denotes the local estimation of w0,i at the t-th iteration. Note
that since w0 > 0, each node can choose any initial condition ŵ0,i > 0. Fur-
thermore, since the graph is strongly connected, it is guaranteed that under
update law (4) local estimate ŵ0,i (t) will asymptotically converge to w0,i for all
nodes
i. Note that by using max-consensus algorithm (1) and by setting xi (0) =
j∈{N in ∪i} [Q0 ]ij ŵ0,j (t), each node will be able to compute Q0 ŵ0 (t)∞ in a
G0 ,i
distributed manner. Therefore, update law (4) can then be implemented dis-
tributively by each node in the network. The nodes can implement the stopping
criteria ŵ0 (t) − ŵ0 (t − 1)∞ < for a sufficiently small pre-defined threshold
(to guarantee the estimation accuracy) which can also be checked in a distributed
manner using max-consensus algorithm.
514 A. Gusrialdi
where QT0 denotes the transpose of matrix Q0 . Each node can then distributively
estimate ν0 by solving (5) in a distributed fashion. First, observe that after
estimating w0,i and from Q0 w0 = λ(Q0 )w0 , node i can estimate λ(Q0 ) according
to
[Q0 ]Ti∗ ŵ0
λ(Q0 ) = . (6)
ŵ0,i
Next, since node i knows NGout 0 ,i
it can construct the vector [Q0 ]∗i or [QT0 ]i∗ . In
addition, after estimating λ(Q0 ) from (6), each node then estimates ν0 by solving
distributively a set of linear equations (5) which can be rewritten as
where ν̂0i (t) denotes the local estimation of ν0 at node i at the t-th iteration and
matrix Pi is defined as
for arbitrary vector b > 0 with |·| denotes the Euclidean norm. It is shown in [18]
that under update law (8) whose initial conditions are chosen to minimize (9), all
the nodes estimation ν̂0i converge exponentially fast to the solution to (7) which
is also the solution to: minQ0 ν0 =0 21 |ν0 − b|2 . The settling time of update law (8)
can be calculated similar to the calculation in [7]. Note that update law (8)
utilizes the same communication network G0 as the one utilized to distributively
estimate w0 . Furthermore, in contrast to the estimation of w0 presented in the
previous subsection, node i will obtain the estimation of the full vector ν0 instead
of the i-th element ν0,i .
Remark 3. In comparison to distributed algorithm for estimating left and right
eigenvectors corresponding to any irreducible matrices presented in [7] which
requires each node to use memory O(n2 ) and to send n2 values to its neigh-
bors, the proposed distributed algorithm only requires to use memory O(n) and
to send n values to its neighbors. In addition, applying distributed estimation
algorithms in [7] will reveal the global network structure to all nodes which may
violate the privacy of each node. In contrast to [7], the proposed distributed
algorithm respects the privacy in terms of the global network topology.
Proof. For showing the necessity (=⇒), since the graph G1 = {V, E0 \(j ∗ , i∗ )} is
strongly connected, it is shown in [13] that under max-consensus protocol (1)
all nodes will converge to maxi xi (0) which is equal to 1. Next, we show the
sufficiency (⇐=). To do this note that the graph G0 is strongly connected. The
removal of link (j ∗ , i∗ ) thus may result in that there exists no direct or indirect
path from node j ∗ to node i∗ . However, since we have xi (n) = 1 under update
law (1) for all nodes i in the network, this means that there is at least an indi-
rect path from nodes j ∗ to i∗ . Hence, the resulting graph G1 = {V, E0 \(j ∗ , i∗ )}
remains to be strongly connected which completes the proof.
516 A. Gusrialdi
After each node executes update law (1) for n iterations with initial values
described in Lemma 1, node i∗ then checks whether xi∗ (n) = 1. If xi∗ (n) = 1,
it needs to notify node j ∗ that the network remains to be strongly connected in
the removal of link (j ∗ , i∗ ). To this end, each node is assigned with additional
scalar variable fi (t). If the graph G1 is strongly connected (resp. not strongly
connected), the initial values of fi are set to fi∗ (0) = 1 (resp. fi∗ (0) = −1)
and fm (0) = 0 for all m = i∗ . The nodes then again execute (1) on graph G0
with the previously described fi (0) and after n iterations, node j ∗ checks if
xj ∗ (n) = 1 (resp. xj ∗ (n) = 0) then it will know that the graph G1 remains to
be strongly connected (resp. will not be strongly connected) after the removal
of the link (j ∗ , i∗ ).
The strong connectivity of the resulting graph after removal of multiple links
can then be distributively checked by iteratively applying the result in Lemma 1.
can then normalize the estimation ν̂0i independently. On the other hand, since
node i only has the estimation of the i-th element of right eigenvector w0 , it then
needs to normalize ŵ0 cooperatively with the rest of the nodes. To this end, the
norm ŵ0 defined as ŵ0 = ŵ0,1
2 + · · · + ŵ 2
0,n can also be written as
ŵ0,1
2 + · · · + ŵ 2
0,n
ŵ0 = n = nŵ0ave .
n
If the nodes can compute ŵ0ave distributively and given that they know n, each
node can then compute ŵ0 . Specifically, the nodes can compute ŵ0ave in a
distributed manner using the finite-time average consensus algorithm proposed
in the literature, e.g., [3] by setting its initial value as ŵ0,i
2
.
4 An Illustrative Example
In this section, we demonstrate the proposed distributed algorithm to compute
solution to Problem 1. Consider a strongly connected digraph G0 consisting
518 A. Gusrialdi
1.2 1.4
1 1.2
0.8 1
0.6 0.8
0.4 0.6
true dominant left eigenvector
0.2 true right eigenvector 0.4
0 0.2
0 5 10 15 20 25 30 0 50 100 150 200 250 300
time step time step
(a) (b)
Fig. 3. (a) Estimated right eigenvector ŵ0,i corresponding to λ(Q0 ) with Q0 = A(G0 )+
In by each node (denoted by the markers) using power iteration method in (4). The
estimation converge in less than 20 time steps; (b) Estimation of left eigenvector ν0
corresponding to λ(Q0 ) by node 1 (i.e., ν̂01 )
of eight nodes shown in Fig. 2 where each node may represent for example a
city/state or a person. The number of links to be removed me is varied between
1 and 4. We choose a small size network so that the comparison with the central-
ized brute-force search approach, which in general is NP-hard, becomes possible.
Interested reader is referred to the simulation results in [4] for the performance
evaluation of solution to optimization problem (P2), without connectivity con-
straint, on real large graphs.
We apply Algorithm 1 to find solution to optimization problem (P2) for each
me . As illustrated in Fig. 3a, the estimation ŵ0,i converges in less than 20 time
steps to the true (unnormalized) right eigenvector w0,i . Next, each node distribu-
tively estimates the left eigenvector ν0 using update law (8). Figure 3b depicts
the left eigenvector estimation of A(G0 ) by node 1. As can be observed, the
local estimation by each node converges to the true (unnormalized) left eigen-
vector ν0 . After estimating the eigenvectors (and normalizing them), the nodes
then compute the candidate edge to be removed and check the strong connectiv-
Distributed Link Removal 519
5 Conclusion
This paper proposes eigenvalue sensitivity based distributed algorithm to remove
a fraction of links from a strongly connected directed network such that dominant
eigenvalue of the adjacency matrix is minimized. In addition, the algorithm also
guarantees that the resulting network remains to be strongly connected after
the link removals. The proposed distributed algorithms consist of distributed
estimation of both left and right eigenvectors corresponding to the largest (in
module) eigenvalue of the adjacency matrix together with distributed verification
algorithm to check whether a network is strongly connected after removal of a
520 A. Gusrialdi
References
1. Bishop, A.N., Shames, I.: Link operations for slowing the spread of disease in
complex networks. Europhys. Lett. 95(18005) (2011)
2. Bullo, F.: Lectures on Network Systems. CreateSpace, 1 edn. (2018). http://
motion.me.ucsb.edu/book-lns. with contributions by J. Cortes, F. Dorfler, and S.
Martinez
3. Charalambous, T., Yuan, Y., Yang, T., Pan, W., Hadjicostis, C.N., Johansson,
M.: Distributed finite-time average consensus in digraphs in the presence of time
delays. IEEE Trans. Control Netw. Syst. 2(4), 370–381 (2015)
4. Chen, C., Tong, H., Prakash, B.A., Eliassi-Rad, T., Faloutsos, M., Faloutsos, C.:
Eigen-optimization on large graphs by edge manipulation. ACM Trans. Knowl.
Discovery Data (TKDD) 10(4), 49 (2016)
5. Ghaboussi, J., Wu, X.S.: Numerical Methods in Computational Mechanics. CRC
Press (2016)
6. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins Uni-
versity Press, Baltimore (1996)
7. Gusrialdi, A., Qu, Z.: Distributed estimation of all the eigenvalues and eigenvectors
of matrices associated with strongly connected digraphs. IEEE Control Syst. Lett.
1(2), 328–333 (2017)
8. Gusrialdi, A., Qu, Z., Hirche, S.: Distributed link removal using local estimation
of network topology. IEEE Trans. Netw. Sci. Eng. 6(3), 280–292 (2019)
9. Li, C., Wang, H., Van Mieghem, P.: Epidemic threshold in directed networks. Phys.
Rev. E 88(6), 062802 (2013)
10. Li, N., Zhang, N., Das, S.: Preserving relation privacy in online social network
data. IEEE Internet Comput. 15(3), 35–42 (2011)
11. McDaniel, P., McLaughlin, S.: Security and privacy challenges in the smart grid.
Secur. Privacy IEEE 7(3), 75–77 (2009)
12. Milanese, A., Sun, J., Nishikawa, T.: Approximating spectral impact of structural
perturbations in large networks. Phys. Rev. E 81(4), 046112 (2010)
13. Nejad, B., Attia, S., Raisch, J.: Max-consensus in a max-plus algebraic setting: the
case of fixed communication topologies. In: International Symposium on Informa-
tion, Communication and Automation Technologies, pp. 1–7 (2009)
14. Prakash, B.A., Chakrabarti, D., Valler, N., Faloutsos, M., Faloutsos, C.: Threshold
conditions for arbitrary cascade models on arbitrary networks. Knowl. Inf. Syst.
33(3), 549–575 (2012)
15. Shames, I., Charalambous, T., Hadjicostis, C.N., Johansson, M.: Distributed net-
work size estimation and average degree estimation and control in networks iso-
morphic to directed graphs. In: Proceedings of Annual Allerton Conference on
Communication, Control, and Computing, pp. 1885–1892 (2012)
16. Van Mieghem, P., Van de Bovenkamp, R.: Non-markovian infection spread dramat-
ically alters the susceptible-infected-susceptible epidemic threshold in networks.
Phys. Rev. Lett. 110(10), 108701 (2013)
Distributed Link Removal 521
17. Van Mieghem, P., Stevanović, D., Kuipers, F., Li, C., van de Bovenkamp, R., Liu,
D., Wang, H.: Decreasing the spectral radius of a graph by link removals. Phys.
Rev. E 84, 016101 (2011)
18. Wang, X., Mou, S., Sun, D.: Improvement of a distributed algorithm for solving
linear equations. IEEE Trans. Ind. Electron. 64(4), 3113–3117 (2016)
19. Wang, Y., Chakrabarti, D., Wang, C., Faloutsos, C.: Epidemic spreading in real
networks: an eigenvalue viewpoint. In: Proceedings of 22nd International Sympo-
sium on Reliable Distributed Systems, pp. 25–34 (2003)
Data Compression to Choose a Proper
Dynamic Network Representation
Remy Cazabet(B)
1 Introduction
The analysis of dynamic networks is an important topic of research in the net-
work science community. With the ubiquity of digital data collection, more and
more relational data becomes available with a temporal component. However,
the way to handle and model such data is still a research question. As discussed
in several articles in the state of the art [5,8,10,11], there are multiple ways to
model the same original observations.
In this article, we propose a method to choose an appropriate dynamic graph
model, among the three following ones: Sequence of Snapshots, Interval Graphs
and Link Streams. The method we propose is based on the principle of max-
imizing data compression, i.e., minimizing the network description’s encoding
length.
In Sect. 2, we explain the rationale of our approach, and possible applications.
In Sect. 3, we introduce the computation of the encoding cost of a dynamic
network, for each representation. Section 4 present experiments on synthetic
and real datasets. Finally, we conclude in Sect. 5.
In this article, we will consider three types of dynamic network models, often
used in the literature.
While these representations might appear unrelated at first sight, they are
in fact able to represent the same original data as long as time is provided as
a discrete value, which is the case in most practical situations. For instance, if
time is represented as a POSIX time (timestamp), it can be considered discrete,
since there is a countable number of possible values between any two POSIX
time.
The best way to understand how the same data can be represented by these
different models is to take a practical example: the SocioPatterns projet [1]
has collected several physical interaction datasets between individuals in various
contexts, such as schools or hospitals. Wearable sensors collect every face-to-face
interaction at a high frequency between a group of subjects over an extended
period of time, from a few hours to a few days. For practical reasons, data
collection is made for the whole setting every 20 s, in a synchronous fashion. The
publicly provided data therefore consists in a file containing all those captured
524 R. Cazabet
Any of these interpretations is valid a priori, so the choice of using one instead
of another is usually based on practical reasons, e.g., to apply a method that
requires to have the data in one format or another. For instance, dynamic com-
munity detection algorithms assume a specific network format: Snapshots in
most cases (e.g., [13]), but also sometimes Link streams (e.g., [12,17]) or Inter-
val Graphs (e.g., [3,6]).
The principle that the best description of data is the description that minimizes
the cost of its representation can be found in several areas of science, from
Occam’s Razor to the Minimum Description Length [9] (MDL) principle.
For static networks, this principle could for instance be used to choose
between a matrix representation, an edge-list representation and an adjacency
list representation. For an unweighted, undirected network composed of n nodes
and m edges, the cost (in bits) of a matrix representation is n2 (a matrix
of boolean values), while its corresponding representation as an edge list is
2m log2 (n) –encoding each edge requires to encode each of its 2 nodes. It means
that if the graph is sparse, m n2 , the edge list representation is the most
efficient, and vice versa. The adjacency list representation is beyond the scope of
this article, but relatively similar to the adjacency list. A first implication is that
we can choose the most appropriate way to store the graph in memory given its
properties n and m, but, beyond this, it also provides hints on what can be done
or not on this graph. For instance, in the community detection problem, meth-
ods using matrix factorization are little penalized by the density of the matrix,
while an algorithm such as Louvain, designed for sparse graphs, only requires the
neighborhoods of nodes, available in an adjacency list representation. Therefore,
the best way to encode a graph also gives us hints on how to handle it.
Note that in this paper, we limit ourselves to the comparison of encoding
scheme that depends only on the number of nodes, edges and temporal infor-
mation, and not on other properties such as the degree distribution, that could
Data Compression to Choose a Proper Dynamic Network Representation 525
also be optimized with techniques such as Huffman Coding. When dealing with
dynamic graphs, we will also make the assumption in our representation that
cumulated graphs of networks to represent are sparse, since this is the most
common setting. We therefore propose representations which are extensions of
edge lists rather than adjacency matrices.
2.3 Applications
The method introduced in this article is implemented in tnetwork 1 , a python
library to manipulate temporal networks. The first application is to automati-
cally choose the most efficient in-memory representation for a temporal network
loaded from a file containing triplets <T, U 1, U 2>, as is the case for temporal
networks shared by the SocioPatterns project and those available on the Network
Repository 2 website [14].
The method is also used in the library to choose the most parsimonious
representation when saving temporal networks created using the library, such as
random temporal networks with community structure.
Beyond these memory-related applications, knowing the most appropriate
representation also tells the practitioner how to efficiently manipulate their
data. For instance, if the snapshot representation is inefficient to represent a
dynamic network, it is unwise to analyze each period as an independent snap-
shot, for instance computing centralities or detecting communities for each of
them. Reciprocally, if the network is poorly represented as a link stream, it is
unwise to apply methods expecting such a graph, e.g. the community detection
method introduced in [12].
Fig. 1. Illustration of the chosen encoding strategies. From an observed dynamic net-
work, each strategy encodes node pairs (red) and temporal information (blue/green).
Textured blocs correspond to STOP symbols to mark the end of a series of unknown
length.
The cost of encoding nodes themselves is a constant for a given network and
is thus ignored in the rest of this paper.
Link Stream Encoding. For this encoding, we list, for each pair of nodes, the
list of timestamps it appears in. The total encoding cost of a graph is:
I ls = mI m + eI t + mI t
where the first element is the cost of encoding the edges, the second element
encodes the timestamps, and the last encode stop sections to signal the end of
a list of times.
We formalize below the cost of encoding a dynamic network using four rep-
resentations. A summary of these representations can be found in Fig. 1.
I SNM = mI m + tI t + te
where the first element is the cost of encoding the edges, the second element
encodes the timestamps, and the last encode the bits of the matrix.
In the second snapshot representation, each snapshot is represented as a list
of pair of nodes, and timestamps are added at the start of every snapshot. This
representation is equivalent to representing each snapshot as an edge list.
I SNE = eI m + tI t + tI m
Where the first element is the cost of encoding the edges, the second element
encodes timestamps, and the last encode stop section at the end of each snapshot.
4 Experiments
To validate the relevance of our approach, we experiment with synthetic and
real networks. In each experiment, we test with the original temporal resolution
(on the left of figures), and then explore how aggregating at coarser temporal
scales affects encoding costs. To create those aggregated version, we use non-
overlapping sliding time-windows. To every unique time period, we associate an
unweighted cumulative graph, such as an edge exists between two nodes of this
graph if there is at least one interaction between those two nodes during the
corresponding period.
All experiments can be checked and reproduced using an on-line notebook3 .
Fig. 2. Code length for synthetic graphs. We observe that the most efficient network
representation depends strongly on the properties of networks.
Fig. 3. Code length for synthetic graphs. We observe that the most efficient network
representation depends strongly on the properties of networks.
All datasets are public, and available either through their original paper or
though the network data repository [14], as reported in Table 1.
Results of experiments are reported in Fig. 3. When relevant, we indicate
with vertical lines typical temporal scales (respectively, minute, hour, day, week,
month, year).
The three sociopattern datasets seem to have similar profiles overall. For
High School and Hospital datasets, Interval Graphs is the most efficient
representation for the original timescale, while Link Streams become more effi-
cient if we aggregate every 5 min, approximately. From this observation, we can
infer that interactions tend to be maintained typically for no more than a few
minutes. Each interval of observation thus becomes a single observation when
aggregating, more efficiently represented as a Link Stream. For the Primary
Data Compression to Choose a Proper Dynamic Network Representation 531
School, the Link Stream representation is the most efficient even for the original
data, either because the data is more noisy or because there are more singleton
observations.
Link Stream is also the most efficient representation for ENRON dataset, as
expected due to the nature of the dataset: each email is stamped with the exact
minute it was sent, and it is rather unlikely that emails sent on a particular
minute form a well-defined graph, or that several emails are sent between the
same individuals in successive minutes. Only when aggregating at a scale of
weeks or months is the Interval Graph representation the most appropriate, and
when aggregating every year a snapshot representation becomes relevant.
In the primate dataset on the contrary, the snapshot representation is the
most appropriate: each timestamp correspond to a well formed graph, and rela-
tions are usually not stables from one snapshot to the next.
Finally, for the Game of Thrones dataset, Interval graphs seems to be clearly
the most appropriate for the original data. However, by having a second look
at the data, we realized that this is due to the way the dataset is provided. A
smoothing window is used, and for each sequence of 10 scenes, the same average
network is provided 10 times, i.e., snapshots 1–10 corresponds to the first 10
scenes and are identical. Using our method, we observe that if we aggregate
every 10 scenes, thus removing this bias, the link stream approach becomes the
most efficient.
5 Conclusion
References
1. Barrat, A., Cattuto, C., Colizza, V., Gesualdo, F., Isella, L., Pandolfi, E., Pinton,
J.F., Ravà, L., Rizzo, C., Romano, M., et al.: Empirical temporal networks of face-
to-face human interactions. Eur. Phys. J. Spec. Topics 222(6), 1295–1309 (2013)
532 R. Cazabet
2. Bost, X., Labatut, V., Gueye, S., Linarès, G.: Narrative smoothing: dynamic con-
versational network for the analysis of TV series plots. In: 2016 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), pp. 1111–1118. IEEE (2016)
3. Cazabet, R., Amblard, F., Hanachi, C.: Detection of overlapping communities in
dynamical social networks. In: 2010 IEEE Second International Conference on
Social Computing, pp. 309–314. IEEE (2010)
4. Cazabet, R., Boudebza, S., Rossetti, G.: Evaluating community detection algo-
rithms for progressively evolving graphs. J. Complex Netw. (2020)
5. Remy Cazabet, R., Rossetti, G.: Challenges in community discovery on temporal
networks. In: Temporal Network Theory, pp. 181–197. Springer, Heidelberg (2019)
6. Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Demon: a local-first discovery
method for overlapping communities. In: Proceedings of the 18th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 615–623
(2012)
7. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS ONE
9(9), e107878 (2014)
8. Gauvin, L., Génois, M., Karsai, M., Kivelä, M., Takaguchi, T., Valdano, E.,
Vestergaard, C.L.: Randomized reference models for temporal networks. arXiv
preprintarXiv:1806.04032 (2018)
9. Grünwald, P.D., Grunwald, A.: The Minimum Description Length Principle. MIT
Press, Cambridge (2007)
10. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012)
11. Latapy, M., Viard, T., Magnien, C.: Stream graphs and link streams for the mod-
eling of interactions over time. Soc. Netw. Anal. Mining 8(1), 61 (2018)
12. Matias, C., Rebafka, T., Villers, F.: A semiparametric extension of the stochastic
block model for longitudinal networks. Biometrika 105(3), 665–680 (2018)
13. Mucha, P.J., Richardson, T., Macon, K., Porter, M.A., Onnela, J.-P.: Commu-
nity structure in time-dependent, multiscale, and multiplex networks. Science
328(5980), 876–878 (2010)
14. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph
analytics and visualization. In: AAAI (2015)
15. Stehlé, J., Voirin, N., Barrat, A., Cattuto, C., Isella, L., Pinton, J.-F., Quaggiotto,
M., Van den Broeck, W., Régis, C., Lina, B., Vanhems, P.: High-resolution mea-
surements of face-to-face contact patterns in a primary school. PLOS ONE 6(8),
e23176 (2011)
16. Vanhems, P., Barrat, A., Cattuto, C., Pinton, J.F., Khanafer, N., Regis, C., Kim,
B.A., Comte, B., Voirin, N.: Estimating potential infection transmission routes in
hospital wards using wearable proximity sensors. PLoS ONE 8(9), e73970 (2013)
17. Viard, T., Latapy, M., Magnien, C.: Computing maximal cliques in link streams.
Theoret. Comput. Sci. 609, 245–252 (2016)
Effect of Nonisochronicity on the Chimera
States in Coupled Nonlinear Oscillators
Abstract. We investigate the conditions which enable one to show the phe-
nomenon of swing of synchronized states via amplitude chimera states in non-
locally coupled systems with symmetry breaking with coupled Stuart-Landau
oscillators as an example. Chimera states represent the spatio-temporal patterns
coexisting with synchronized and desynchronized behaviour in coupled identi-
cal oscillators. We identify that the radius of non-local interaction and the non-
isochronicity in the system also play an important role in the observation of such
states. The system shows such notable property neither for smaller values of non-
isochronicity nor for higher values. We also carefully study the occurrence of
different transition routes to recently observed dynamical state called chimera
death while varying the strength of nonisochronicity parameter.
1 Introduction
Chimera states are intriguing spatiotemporal patterns coexisting with synchronized and
desynchronized oscillations and it has brought out considerable attention towards the
study of coupled networks with nonlocal topology. It has been realized that nonlocal
coupling plays a crucial role in inducing chimera states [1–6]. Recently, such coexis-
tence behavior has also been addressed in global coupling [7] and local coupling [8–10].
They have also been observed in maps [11], complex networks [12], time discrete and
continuous chaotic systems [13]. This state has also been observed experimentally in
mechanical oscillators with metronomes [14], coupled chemical oscillators [15], cou-
pled electronic oscillators [16], oscillators with more than one populations [17–21],
time varying networks [22] and also in optical coupled map lattices realized by liq-
uid crystal light modulators [23]. These observations of chimera states have helped to
explain various phenomena that occur in practice, such as uni-hemispheric sleep [24],
power distribution in networks [25], and bump states in neural networks [26].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 533–543, 2021.
https://doi.org/10.1007/978-3-030-65347-7_44
534 K. Premalatha et al.
ε N+P
z˙j = z j − (1 − ic)|z j |2 z j + ∑ (Re[zk ] − Re[z j ]),
2P k=N−P
(1)
where z j = x j + iy j , j = 1, 2, 3, ...N. All the indexes in Eq. (1) are regarded as modulo
N. Here c is the nonisochronicity parameter and N is the total number of oscillators.
The nonlocal coupling in the system is controlled by the coupling strength (ε ) and the
coupling range (r = NP ), where P corresponds to the number of nearest neighbors in both
directions (or coupling radius). Here, we have introduced the coupling only in the real
parts of the complex amplitude, and so this coupling introduces a symmetry breaking
in the system. This symmetry breaking is important here as we find that the absence of
symmetry breaking cannot induce the undulation in synchronization as we prove in the
next section.
In our simulations, we choose the number of oscillators N to be equal to 100 and in
order to solve the Eq. (1), we use the fourth order Runge-Kutta method with time step
0.01 and with symmetric initial conditions between −1 and +1 which is necessary for
the occurrence of oscillation death state.
1.1 1.1
(a) (b)
S
0 0
0 ε 0.18 0 ε 0.18
1.1 1.1
(c) (d)
S
0 0
0 ε 0.30 0 ε 0.75
Fig. 1. (a) Strength of incoherence (S) of the system (1) for different values of ε (a) for c = 3 and
P = 10, (b) c = 3 and P = 25, (c) P = 10 with c = 4.5 and (d) for P = 10, c = 7, respectively.
system. In order to know the nature of dynamical states in more detail, we look at
the strength of incoherence of the system which was introduced recently by Gopal,
Venkatesan and two of the present authors [31] that will help us to detect interesting
collective dynamical states such as the chimera state. It is defined as
∑M
m=1 sm
S = 1− , sm = Θ (δ − σ (m)), (2)
M
xj
10 1 2.6 1.5
(a) (b) (c)
yc.m
0
fj
t
0 -1 2.5 -1.5
1 j 100 1 j 100 1 j 100
Fig. 2. (a) Snap shot of the variable x j corresponding to amplitude chimera state with P = 10, c =
3.0 and ε = 0.04. (b) Frequency of the oscillators in the system for amplitude chimera state, (c)
corresponding center of mass yc.m averaged over one period of each oscillator for the variable y j .
found to be unity which represents the desynchronization among the oscillators and by
increasing the coupling S reaches the value zero where the oscillators are in synchro-
nization. Further for the values of ε between 0.036 and 0.08, S has non-zero values, but
below one. We find the occurrence of chimera states in the region 0.036 < ε < 0.082
and 0.15 < ε < 0.19 (as S takes the values between 0 and 1 for these values of ε ). For
the intermediate values of ε (0.082 < ε < 0.15) S is found to have unit value indicat-
ing that the state is desynchronized. S is found to reach the minimum value (S = 0) by
further increase of ε . Thus for the values ε > 0.19 the oscillators return back to the
synchronized state. The above analysis shows that the swing in synchronization in the
system is mediated by the desynchronized state in addition to the chimera states. More
specifically the states of the oscillators in this case follow the transition route as desyn-
chronization → synchronization → chimera state → desynchronization → chimera
state → synchronization. Now let us look whether this type of reappearence occurs for
higher values of c also.
(iv) The obtained results for c = 7.0 are shown in Fig. 1(d) which indicates the
absence of the above type of synchronized state. In this case, we can observe that S has
the maximum value for small values of ε (ε < 0.25). S takes the value between 0 to 1 for
0.25 < ε < 0.550 corresponding to a chimera state. Finally for ε > 0.550, S decreases
to zero corresponding to desynchronized state. The transition route in the present case
is then as follows: desynchronization → chimera state → synchronization. Hence
the large value of nonisochronicity leads to the absence of swing in synchronized state.
Thus, for P = 10, we have the phenomenon of swing by in synchronization for the
values of c between 2.7 < c < 5.2.
Nature of Chimera States and Chimera Death States. In order to know the char-
acteristic nature of chimera states more clearly, we present the space-time plot and
frequency profiles corresponding to the system in the chimera state region. Figure 2(a)
shows that there exists fluctuation in amplitudes of the oscillations (Fig. 2(b)) of the
desynchronized oscillators for p = 10, c = 3.0 and ε = 0.8. But Fig. 2(b) shows that the
frequency of the oscillators are identical. This is also confirmed through by calculating
center of mass of the oscillators in Fig. 2(c) which clearly illustrate that the synchro-
nized oscillators are oscillating periodically with origin as the center of rotation while
the incoherent oscillators are oscillating periodically with different amplitudes and with
538 K. Premalatha et al.
xj 5.5
10 (a) 1 (b)
fj
0
t
0 -1 3.0
1 j 100 1 j 100
Fig. 3. (a) Snap shot of the variables x j corresponding to frequency chimera state with P = 40,
c = 7.0 and ε = 0.08. (b) Frequency of the oscillators in the frequency chimera state shown in (a).
a shifted center of rotation from the origin [27]. Hence we confirm that the swing of syn-
chronized states occurs through the amplitude chimera states. Interestingly, for larger
strength of nonlocal interaction, we observe a change in the behavior of the chimera
states. For the case P = 40, c = 7.0 and ε = 0.08, Fig. 3 shows that in addition to the
fluctuations in the amplitude, there are fluctuations in their frequencies also. The fre-
quency profile Fig. 3 (b) of the oscillators shows that there are two groups of oscillators:
the oscillators from 1 to 73 belong to the first group, and they have different frequen-
cies, thereby showing the presence of spatial incoherence among these oscillators. The
other group is made up of the oscillators from 74 to 100, which are all found to have
the same frequency and are synchronized. This type of chimera state can be designated
as frequency chimera state.
Fig. 4. Snap shot of the variables x j corresponding to chimera death state with P = 10 and ε =
0.08: (a) for c = 0, (b) for c = 3, (c) for c = 6.
Further we note that the presence of symmetry breaking term in the coupling causes
the system to transit to a new dynamical state called chimera death state for large val-
ues of coupling strength. It has the combined properties of chimera and oscillation death
(OD). The population of identical oscillators splits into two coexisting domains: (i) spa-
tially coherent oscillation death (neighboring oscillators populate in the same branch
of inhomogeneous steady state either as x(1) or x(2) ) (ii) spatially incoherent oscilla-
tion death (population of neighboring oscillators are completely random between x(1)
and x(2) ), which is clearly shown in Fig. 4 for three different value of nonisochrinic-
ity parameter. Other parameter values are fixed as P = 10 and ε = 0.8. In this steady
Effect of Nonisochronicity on the Chimera States in Coupled Nonlinear Oscillators 539
state all the oscillators in the array are distributed uniformly either in the lower branch
or in the upper branch for c = 0 as shown in Fig. 4(a). Next on increasing the value
of the nonisochronicity parameter to c = 3, as seen in Fig. 4(b), we observe that an
increase in disorder in the distribution of inhomogeneous steady states. On increas-
ing nonisochronicity parameter further (c = 6), one finds a further increase in disorder
in distribution of inhomogeneous steady states as in Fig. 4(c). Thus we conclude that
increase of strength of nonisochronicity parameter leads to an increase in the distribu-
tion of inhomogeneous steady states. To give a concrete idea about the different dynam-
ical states with transition routes, we extend our study with phase diagram in the next
sub-section.
10 (a) 10 (b)
AC FC SY
DS SY
5.5 5.5 DS
c
AC
CD CD
1 1
0 10 20 0 4 8
ε ε
Fig. 5. (a) (Color online) Phase diagrams of the system (1) for (a) P = 10, (b) P = 40. (This
figure is reproduced from [28]). SY (Green color), DS (yellow), AC (blue), FC (violet), CD (red)
regions represent synchronized state, desynchronized state, amplitude chimera state, frequency
chimera state, and chimera death state (CD), respectively.
increases disorder in the system. At c = 7.0, a variation in the coupling strength causes
the system to transit from a desynchronized state to synchronized state via frequency
chimera instead of amplitude chimera due to the increase of nonisochronicity parameter
which induces the disorder in the dynamical states. The diverse transition routes to
chimera death is briefly explained in the Table 1.
3 Conclusion
Even with small radius of nonlocal interaction, the system shows such notable property
neither for smaller values of nonisochronicity nor for higher values of nonisochronic-
ity. The swing of synchronized state initially follows the route that is defined as syn-
chronization → amplitude chimera → synchronization and the increase in the non-
isochronicity causes the synchronized state to be mediated by the desynchronized state
along with the amplitude chimera state, where the transition route can be defined as
synchronization → amplitude chimera → desynchronization → amplitude chimera →
synchronization.
Another interesting case of these nonlocally coupled systems for higher radius of
nonlocal interaction P (in which case such peculiarity of synchronization disappears)
is that the presence of frequency chimera state due to disorder introduced by the non-
isochronicity parameter. We also carefully study the occurrence of different transition
routes to recently observed dynamical state called chimera death while varying the
strength of nonisochronicity parameter. Since, we have studied the model of nonlo-
cally coupled Stuart-Landau oscillators, which has wide practical applications in phys-
ical, chemical and biological phenomena, this study may help to control the chimera
states which appear in various areas. As an example, chimera states in power distribu-
tion networks may lead to blackouts due to coexistence of desynchronized state with
synchronized state. Controlling chimera may lead to maintain the stable distribution of
power supply.
References
1. Kuramoto, Y., Battogtokh, D.: Coexistence of coherence and incoherence in nonlocally cou-
pled phase oscillators. Nonlinear Phenom. Complex Syst. 5, 380–385 (2002). http://www.j-
npcs.org/online/vol2002/v5no4/v5no4p380.pdf
2. Abrams, D.M., Strogatz, S.H.: Chimera states for coupled oscillators. Phys. Rev. Lett. 93,
174102 (2004). https://doi.org/10.1103/PhysRevLett.93.174102
3. Abrams, D.M., Strogatz, S.H.: Chimera states in a ring of nonlocally coupled oscillators. Int.
J. Bif. Choas. 16(1), 21 (2006). https://doi.org/10.1142/S0218127406014551
4. Abrams, D.M., Mirollo, R., Strogatz, S.H., Wiley, D.A.: Solvable model for chimera
states of coupled oscillators. Phys. Rev. Lett. 101, 084103 (2008). https://doi.org/10.1103/
PhysRevLett.101.084103
5. Shima, S.I., Kuramoto, Y.: Rotating spiral waves with phase-randomized core in nonlocally
coupled oscillators. Phys. Rev. E 69, 036213 (2004). https://doi.org/10.1103/PhysRevE.69.
036213
6. Sheeba, J.H., Chandrasekar, V.K., Lakshmanan, M.: Chimera and globally clustered chimera:
impact of time delay. Phys. Rev. E 81, 046203 (2010). https://doi.org/10.1103/PhysRevE.81.
046203
542 K. Premalatha et al.
7. Sethia, G.C., Sen, A.: Chimera states: the existence criteria revisited. Phys. Rev. Lett. 112,
144101 (2014). https://doi.org/10.1103/PhysRevLett.112.144101
8. Laing, C.R.: Chimeras in networks with purely local coupling. Phys. Rev. E 92, 050904(R)
(2015). https://doi.org/10.1103/PhysRevE.92.050904
9. Bera, B.K., Ghosh, D., Lakshmanan, M.: Chimera states in bursting neurons. Phys. Rev. E
93, 012205 (2016). https://doi.org/10.1103/PhysRevE.93.012205
10. Premalatha, K., Chandrasekar, V.K., Senthilvelan, M., Lakshmanan, M.: Stable amplitude
chimera states in a network of locally coupled Stuart-Landau oscillators. Chaos 28, 033110
(2018). https://doi.org/10.1063/1.5006454
11. Omelchenko, I., Maistrenko, Y., Hövel, P., Schöll, E.: Loss of coherence in dynamical net-
works: spatial chaos and chimera states. Phys. Rev. Lett. 106, 234102 (2011). https://doi.org/
10.1103/PhysRevLett.106.234102
12. Zhu, Y., Zheng, Z., Yang, J.: Chimera states on complex networks. Phys. Rev. E 89, 022914
(2014). https://doi.org/10.1103/PhysRevE.89.022914
13. Omelchenko, I., Riemenschneider, B., Hövel, P., Maistrenko, Y.L., Schöll, E.: Transition
from spatial coherence to incoherence in coupled chaotic systems. Phys. Rev. E 85, 026212
(2012). https://doi.org/10.1103/PhysRevE.85.026212
14. Martens, E.A., Thutupalli, S., Fourrière, A., Hal-latschek, O.: Chimera states in mechanical
oscillator networks. Proc. Natl. Acad. Sci. 110, 10563 (2013). https://doi.org/10.1073/pnas.
1302880110
15. Tinsley, M.R., Nkomo, S., Showalter, K.: Chimera and phase-cluster states in populations of
coupled chemical oscillators. Nat. Phys. 8, 662 (2012). https://doi.org/10.1038/nphys2371
16. Gambuzza, L.V., Buscarino, A., Chessari, S., Fortuna, L., Meucci, R., Frasca, M.: Experi-
mental investigation of chimera states with quiescent and synchronous domains in coupled
electronic oscillators. Phys. Rev. E 90, 032905 (2014). https://doi.org/10.1103/PhysRevE.90.
032905
17. Montbrio, T.E., Kurths, J., Blasius, B.: Synchronization of two interacting populations of
oscillators. Phys. Rev. E 70, 056125 (2004). https://doi.org/10.1103/PhysRevE.70.056125
18. Pikovsky, A., Rosenblum, M.: Partially integrable dynamics of hierarchical populations
of coupled oscillators. Phys. Rev. Lett. 101, 264103 (2008). https://doi.org/10.1103/
PhysRevLett.101.264103
19. Martens, E.A., Panaggio, M.J., Abrams, D.M.: Basins of attraction for chimera states. New
J. Phys. 18, 022002 (2016). https://doi.org/10.1088/1367-2630/18/2/022002
20. Premalatha, K., Chandrasekar, V.K., Senthilvelan, M., Lakshmanan, M.: Imperfectly syn-
chronized states and chimera states in two interacting populations of nonlocally cou-
pled Stuart-Landau oscillators. Phys. Rev. E 94, 012311 (2016). https://doi.org/10.1103/
PhysRevE.94.012311
21. Premalatha, K., Chandrasekar, V.K., Senthilvelan, M., Lakshmanan, M.: Chimeralike states
in two distinct groups of identical populations of coupled Stuart-Landau oscillators. Phys.
Rev. E 95, 022208 (2017). https://doi.org/10.1103/PhysRevE.95.022208
22. Buscarino, A., Frasca, M., Gambuzza, L.V., Hövel, P.: Chimera states in time-varying
complex networks. Phys. Rev. E 91, 022817 (2015). https://doi.org/10.1103/PhysRevE.91.
022817
23. Hagerstrom, A., Murphy, T.E., Roy, R., Hövel, P., Omelchenko, I., Schöll, E.: Experimental
observation of chimeras in coupled-map lattices. Nat. Phys. 8, 658 (2012). https://doi.org/10.
1038/nphys2372
24. Rattenborg, N.C., Amlaner, C.J., Lima, S.L.: Behavioral, neurophysiological and evolution-
ary perspectives on unihemispheric sleep. Neurosci. Biobehav. Rev. 24, 817–842 (2000).
https://doi.org/10.1016/S0149-7634(00)00039-7
25. Filatrella, G., Neilson, A.H., Pedersen, N.F.: Analysis of a power grid using a Kuramoto-like
model. Eur. Phys. J. B 61(4), 485–491 (2008). https://doi.org/10.1140/epjb/e2008-00098-8
Effect of Nonisochronicity on the Chimera States in Coupled Nonlinear Oscillators 543
1 Introduction
Discrete graph dynamical systems are generalizations of cellular automata (CA)
[10, 26]. They serve as a useful formal model in many contexts, including multi-agent
systems, propagation of contagions in social networks and interaction phenomena in
biological systems (see e.g., [1, 17, 25, 27]). Here, we focus on one such class of graph
dynamical systems, namely synchronous discrete dynamical systems (SyDSs). Infor-
mally, a SyDS1 consists of an undirected graph2 whose vertices represent entities and
edges represent local interactions among entities. Each node v has a Boolean state and
a local function fv whose inputs are the current state of v and those of its neighbors;
the output of fv is the next state of v. The vector consisting of the state values of all
the nodes at each time instant is referred to as the configuration of the system at that
instant. In each time step, all nodes of a SyDS compute and update their states syn-
chronously. Starting from a (given) initial configuration, the time evolution of a SyDS
consists of a sequence of successive configurations, which is also called a trajectory.
1 Formal definitions associated with SyDSs are presented in Sect. 2.
2 Synchronous dynamical systems, where the underlying graph is directed, are called Syn-
chronous Boolean Networks (see e.g., [12, 13, 19]).
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 544–555, 2021.
https://doi.org/10.1007/978-3-030-65347-7_45
Evolution of Similar Configurations 545
and the time used by two SAT solvers (namely, Clasp [7] and Glucose [8]) to solve the
corresponding SAT instances.
Related Work. Computational problems associated with discrete dynamical sys-
tems have been addressed by many researchers. For example, Barrett et al. [3] and
Rosenkrantz et al. [21] studied the reachability problem (i.e., given a SyDS S and two
configurations C1 and C2 , does S starting from C1 reach C2 ?) for undirected graphs.
The same problem for directed graphs has been studied in [5, 19]. Tosic [23, 24] pre-
sented results for counting the number of fixed points4 for systems with special forms of
local functions. Kosub and Homan [15] presented dichotomy results that delineate com-
putationally intractable and efficiently solvable versions of counting fixed points, based
on the class of allowable local functions. The complexity of the predecessor existence
problem for various classes of underlying graphs and local functions is investigated in
[4]. A more general version of the predecessor existence problem, where the goal is
to find t-step predecessors for values of t ≥ 2, has been studied in [14, 16]. Problems
similar to predecessor existence have also been considered in the context of cellular
automata [6, 9]. Readers interested in the applications of graph dynamical systems are
referred to [1, 17].
Note: For space reasons, proofs are not included; they can be found in [20].
2 Preliminaries
predecessors are called Garden of Eden (GE) configurations [17]. Given a configura-
tion C, we use the notation σ (C) to denote the successor of C, and Π (C) to denote the
set of all predecessors of C.
SyDSs have been considered in the literature under many classes of local functions
(see e.g., [4, 14]). We now present an example of a SyDS where the local function at
each node is a threshold function. For each integer k ≥ 0, the k-threshold function
has the value 1 iff at at least k of its inputs are 1.
Example: The underlying graph of a SyDS shown in Fig. 1. The threshold value for
each node is shown within parentheses. (Thus, the local function at b is the 2-threshold
function while that at d is the 3-threshold function.) Suppose the initial configuration
of the system is (1, 1, 0, 0, 0); that is, a and b are in state 1 while c, d and e are in state
0. The reader can verify that starting from time 0, the system goes through the follow-
ing sequence of configurations: (1, 1, 0, 0, 0) −→ (1, 1, 1, 0, 0) −→ (1, 1, 1, 1, 0) −→
(1, 1, 1, 1, 1). Once the system reaches the configu-
ration (1, 1, 1, 1, 1) at time step 3, no further state
c changes occur in the subsequent time steps; that is, the
e configuration (1, 1, 1, 1, 1) is a fixed point.
(2)
(1)
The phase space PS of a SyDS S is a directed
b graph defined as follows. There is a node in PS for
d each configuration of S. There is a directed edge from
(2) (3)
a node representing configuration C1 to that represent-
(1) ing configuration C2 if there is a one step transition of
S from C1 to C2 . For a SyDS with n nodes, the num-
a
ber of nodes in the phase space is 2n ; thus, the size of
Fig. 1. An Example of a SyDS
phase space is exponential in the size of a SyDS. Each
where each node has a threshold node in the phase space has an outdegree of 1 (since
function. The threshold values are our SyDS model is deterministic). Also, in the phase
shown in parentheses. space, each fixed point of a SyDS is a self-loop and
each GE configuration is a node of indegree zero.
Hamming Distance and Similarity of Configurations. Given two configurations C1
and C2 of a SyDS over the domain {0, 1}, the Hamming Distance between C1 and C2 ,
denoted by H(C1 , C2 ), is the number of positions in which they differ. For example, if
C1 = (1,0,0,1) and C2 = (0,1,0,0), then H(C1 , C2 ) = 3. We say that two configurations C1
and C2 of a SyDS are h-close if H(C1 , C2 ) = h. Two configurations that are h-close for
a small value of h can be thought of as ‘similar’ configurations. We note that in a SyDS
with n nodes, the maximum Hamming distance between any pair of configurations C1
and C2 is n; this occurs when C1 is the bitwise complement of C2 .
Similarity Measures for Sets of Configurations. Our focus is on studying the degree
of similarity between predecessors of similar configurations. To do this, we define the
following distance measures between two sets of nonempty configurations S1 and S2 .
548 J. D. Priest et al.
M IN S EP(S1 , S2 ) = min{H(C, C ) : C ∈ S1 , C ∈ S2 }.
M AX S EP(S1 , S2 ) = max{H(C, C ) : C ∈ S1 , C ∈ S2 }.
3 Analytical Results
Overview. In this section, we first show that the problem of finding the predecessors of
a given configuration of a SyDS can be expressed as an instance of SAT. This transfor-
mation forms the basis for the experimental results presented in Sect. 4. In addition, we
present several analytical results regarding the similarities of predecessor sets of two
configurations of a SyDS. Throughout this section, the reader should bear in mind that
for any configuration C, σ (C) denotes the successor of C and Π (C) denotes the set of
all predecessors of C.
Reducing Predecessor Finding to SAT. We assume that the nodes of the underlying
graph of the given SyDS are numbered 1 through n and that the local function at node
Evolution of Similar Configurations 549
Since ci is a known 0 or 1 value, the expression given in Eq. (1) can be simplified. If
ci = 0, the above expression simplifies to ¬ fi (xi1 , xi2 , . . . , xik ). Likewise, if ci = 1, the
above expression simplifies to fi (xi1 , xi2 , . . . , xik ).
Using Pi to denote the subexpression given by Eq. (1) for node i, the condition to be
satisfied for C to be a predecessor of C is given by
P1 ∧ P2 ∧ . . . ∧ Pn . (2)
Proposition 2. Let G be an arbitrary connected graph, and let n be the number of nodes
in G. Then, there is a SyDS S with underlying graph G, such that S has the following
properties: (i) every configuration has a predecessor and (ii) for every configuration C1 ,
there is a configuration C2 such that H(C1 , C2 ) = 1 and M AX S EP (Π (C1 ), Π (C2 ))
= M IN S EP (Π (C1 ), Π (C2 )) = AVG S EP (Π (C1 ), Π (C2 )) = n.
Proof: See [20].
We now show the existence of SyDSs in which there are pairs of configurations which
have the maximum level of dissimilarity but their predecessors are 1-close.
Proposition 3. Let G be an arbitrary graph, and let Δ be the maximum node degree
of G. Then, there is a SyDS S with underlying graph G, such that S has the
following properties: (i) every configuration has a predecessor and (ii) for every
configuration C1 , there is a configuration C2 such that H(C1 , C2 ) = Δ + 1 and
M AX S EP (Π (C1 ), Π (C2 )) = M IN S EP (Π (C1 ), Π (C2 )) = AVG S EP (Π (C1 ), Π (C2 ))
= 1.
Proof: See [20].
The following corollary is a direct consequence of Proposition 3 by taking the underly-
ing graph of the SyDS to be the star graph on n nodes.
Corollary 1. For any integer n ≥ 2, there is a SyDS S with n nodes satisfying the fol-
lowing properties: (i) there is a pair of configurations C1 and C2 with H(C1 , C2 ) = n
and M AX S EP (Π (C1 ), Π (C2 )) = 1.
We now present a result that establishes the computational complexity of comput-
ing distance measures for predecessor configurations. The decision problem, which we
call Minimum Predecessor Separation (MPS), is the following: given a SyDS S, two
configurations C1 and C2 , and a positive integer q, is M IN S EP(Π (C1 ), Π (C2 )) ≤ q?
Using the known result that the Predecessor Existence problem (i.e., given a SyDS S
and a configuration C, does C have a predecessor?) is NP-complete [4], it can be shown
that MPS is also NP-complete. This result is stated below.
Proposition 4. The MPS problem is NP-complete.
Proof: See [20].
Our proof of Proposition 4 relies on the fact that it NP-hard to decide whether a con-
figuration C has a predecessor. We now present a stronger NP-completeness result. We
show that the MPS problem is NP-complete even when we are given predecessors of C1
and C2 . We call the decision problem when this extra information is given Minimum
Predecessor Separation Given Predecessors (MPSGP). Note that since the predeces-
sors of C1 and C2 are specified in a given MPSGP problem instance, it is unnecessary to
explicitly specify C1 and C2 . Thus, we formalize the MPSGP problem as follows: given
a SyDS S, and two configurations C1 and C2 , is M IN S EP(Π (σ (C1 )), Π (σ (C2 ))) <
H(C1 , C2 )? Our next result points out the NP-hardness of this problem.
Theorem 1. The MPSGP problem is NP-complete.
Proof: See [20].
Evolution of Similar Configurations 551
4 Experimental Results
Overview. The analytical results presented in Sect. 3 show that in general, SyDSs may
exhibit extreme behaviors with respect to evolution of configurations. So, in the exper-
imental phase, our goal was to understand the behavior for restricted classes of graphs
and local functions. We generated SyDSs whose underlying graphs are from special
classes of graphs and whose local functions are from restricted classes of Boolean
functions. We generated pairs of configurations that are h-close for small values of h
and examined the range of Hamming distances for their sets of predecessors. We used
the transformation from the predecessor problem to SAT discussed in Sect. 3.
SyDS Construction. We investigated several types of underlying graph structures
including Erdős–Rényi models, lattice/grid graphs, and Watts-Strogatz small-world net-
works [18]. All graphs were created using the NetworkX library [11]. The Erdős–Rényi
graphs were constructed such that the estimated mean degree of the graph was 16. Grid
graphs were constructed such that each node connected to exactly four other nodes.
Nodes in the Watts-Strogatz small world networks were initially wired to their eight
nearest neighbors; then each edge had a 50% chance to be rewired to a random node in
the graph.
To examine the similarity of configurations, we considered several local functions.
All SyDSs constructed and tested were uniform SyDSs5 with threshold functions rang-
ing from threshold 1 (equivalent to Boolean OR) to threshold 4. We chose threshold
functions as they are monotone Boolean functions. As shown in Sect. 3, SyDSs with
similar configurations and non-monotone local functions (such as exclusive OR) can
have predecessors with very high variability in their Hamming distances. With thresh-
old functions, we expected the Hamming distances of the predecessors of similar con-
figurations to show less extreme variance.
Procedure for Generating Configurations and Their Predecessors. We imple-
mented the transformation from the predecessor problem to SAT in Python. We limited
the number of predecessors generated for each configuration for the following reasons.
In order to compute the minimum, average, and maximum Hamming distances between
two sets S1 and S2 of predecessors, each predecessor in S1 must be compared with each
predecessor from S2 . For example, with just 104 predecessors for each configuration,
the number of such comparisons is 108 . In addition to time used for such a computa-
tion, attempting to exhaustively find and record every predecessor for larger graph sizes
could generate several terabytes of data.
We defined our “base” configuration as the one with all node states set to 1. To
generate a configuration with Hamming distance h from the base configuration, the
states of h random nodes were changed from 1 to 0. In total, 20 configurations with
different Hamming distances were generated. We generated up to 104 solutions for each
predecessor problem. We computed the necessary Hamming distance values between
the set of predecessors for the base configuration and the sets of predecessors of the 20
configurations derived from the base configuration. Our results provide an indication of
the minimum and maximum Hamming distances. In the plots shown in this section, the
5 A uniform SyDS is one in which all the nodes have the same local function.
552 J. D. Priest et al.
mean Hamming distance between the predecessors of the base configuration and those
of the 20 derived configurations are shown, with error bars representing the minimum
and maximum Hamming distances of the solution sets. For each threshold value, we fit
a linear trend line to the results.
Table 1. Table showing minimum, maximum and average Hamming distance values for grids and
Watts-Strogatz small world networks with 16 nodes
Fig. 2. Average Hamming distance values for grid and Watts-Strogatz networks
Hamming Distance Results for Small Networks. Table 1 shows the minimum, aver-
age, and maximum Hamming distances for 16 node grids and Watts-Strogatz networks.
For these small networks, we were able to generate all predecessors for each configu-
ration. The table shows the results for the configurations for which both the grid and
the Watts-Strogatz graph had predecessors. For both classes of graphs and all thresh-
old values, the minimum and average predecessor Hamming distance show a roughly
monotonic non-decreasing trend with increase in the Hamming distance of a config-
uration from the base configuration. The maximum Hamming distance also increased
Evolution of Similar Configurations 553
monotonically for the grid graphs; however, for the Watts-Strogatz networks started at
the highest value (16) and stayed very close to that value.
Hamming Distance Results for Large Networks. Our results for the 1024 node
square grid network and the 1024 node Watts-Strogatz small world network are shown
in Figs. 2a and 2b respectively. The average Hamming distances of predecessors for
these two graphs show similar trends. For both networks, the Hamming distance values
for threshold 1 were lower compared to the other threshold values; moreover, the aver-
age Hamming distance increased linearly with increase in the Hamming distance of a
configuration from the base configuration. Threshold 2 showed the highest values of
average Hamming distances for both networks; further, the average Hamming distance
also showed a stable trend as configuration Hamming distance was increased. For the
square grid graph, threshold 4 also showed a stable trend. In contrast, threshold 4 results
for the Watts-Strogatz graph show a linearly increasing trend similar to Threshold 1. For
both networks, the range of minimum and maximum Hamming distances was within 50
units of the average.
Average Hamming distance values for an
Erdős–Rényi graph with 256 nodes are
shown in Fig. 3. There, the minimum and
maximum Hamming distances in each
set were within 30 units of the average
and are not shown in Fig. 3 to avoid
clutter. The average Hamming distance
values for Threshold 1 once again show
a linearly increasing trend with increase
in the Hamming distance from the base
configuration. Threshold 4 also shows a
linearly increasing trend but with a slope
Fig. 3. Graph showing average Hamming dis-
smaller than that of threshold 1. The val-
tance values for a 256 node Erdős-Rényi net-
ues for Threshold 2 show a more or less
work
stable trend.
Number of Clauses Generated and SAT Solver Runtime. We conducted tests to
compare the performance of the two most recently updated SAT solvers, namely Clasp
[7] and Glucose [8]. For these experiments, graphs were generated in the same man-
ner as previously mentioned except that Erdős–Rényi graphs for this experiment were
constructed to have an average degree of 4. Assuming that each local function is the
1-threshold function, we computed the number of clauses generated for each predeces-
sor problem with a uniform threshold of 1 on a sample of 20 predecessor problems and
measured the average CPU time6 it took each SAT solver to produce one solution. The
results are shown in Table 2.
There was no significant difference between the Glucose and Clasp SAT solvers in
terms of CPU time taken to obtain a single solution to a SAT problem. The only notable
exception is that for the larger Watts-Strogatz graph, Clasp was faster than Glucose
6 Experiments were run on a single core of a 2.80 GHz Intel Core i5-8400 CPU and with 16 GB
of RAM.
554 J. D. Priest et al.
Table 2. Table showing the number of clauses in the SAT instance generated from a predecessor
problem and the CPU time to generate a solution for several networks
Network type 216 = 65, 536 Nodes 218 = 262, 144 Nodes
Number Clasp time Glucose time Number Clasp time Glucose time
of clauses (seconds) (seconds) of clauses (seconds) (seconds)
Square Grid 65806 0.059 0.054 262414 0.213 0.201
Watts-Strogatz 77614 0.552 0.772 299310 8.418 11.219
Erdős–Rényi 65686 0.096 0.129 262800 0.882 0.863
(8.418 s vs 11.219 s). The larger amount of time used for this graph could potentially
be due to the larger average degree. Clasp was eventually chosen for our experiments
because it can generate all the solutions for a given SAT instance.
Acknowledgments. We thank the referees for their comments. This work is partially supported
by NSF Grants ACI-1443054 (DIBBS), IIS-1633028 (BIG DATA), CMMI-1745207 (EAGER),
OAC-1916805 (CINES), CCF-1918656 (Expeditions) and IIS-1908530.
References
1. Adiga, A., Kuhlman, C.J., Marathe, M.V., Mortveit, H.S., Ravi, S.S., Vullikanti, A.: Graphi-
cal dynamical systems and their applications to bio-social systems. Springer Int. J. Adv. Eng.
Sci. Appl. Math. 11(2), 153–171 (2019)
2. Ahmed, N.K., Alo, R.A., Amelink, C.T., et al.: net.science: a cyberinfrastructure for sus-
tained innovation in network science and engineering. In: Gateway (2020)
3. Barrett, C.L., Hunt III, H.B., Marathe, M.V., Ravi, S.S., Rosenkrantz, D.J., Stearns, R.E.:
Complexity of reachability problems for finite discrete dynamical systems. J. Comput. Syst.
Sci. 72(8), 1317–1345 (2006)
4. Barrett, C., Hunt III, H.B., Marathe, M.V., Ravi, S.S., Rosenkrantz, D.J., Stearns, R.E.,
Thakur, M.: Predecessor existence problems for finite discrete dynamical systems. Theoret.
Comput. Sci. 386(1), 3–37 (2007)
5. Chistikov, D., Lisowski, G., Paterson, M., Turrini, P.: Convergence of opinion diffusion is
PSPACE-complete. CoRR abs/1912.09864 (2019). http://arxiv.org/abs/1912.09864
6. Durand, B.: A random NP-complete problem for inversion of 2D cellular automata. Theoret.
Comput. Sci. 148(1), 19–32 (1995)
Evolution of Similar Configurations 555
7. Gebser, M., Kaufmann, B., Neumann, A., Schaub, T.: Clasp: a conflict-driven answer set
solver. In: Baral, C., Brewka, G., Schlipf, J. (eds.) Logic Programming and Nonmonotonic
Reasoning, pp. 260–265. Springer, Heidelberg (2007)
8. The Glucose SAT solver (2016). https://www.labri.fr/perso/lsimon/glucose/
9. Green, F.: NP-complete problems in Cellular Automata. Complex Syst. 1(3), 453–474 (1987)
10. Gutowitz, H.: Cellular Automata: Theory and Experiment. North Holland (1989)
11. Hagberg, A., Schult, D., Swart, P.: NetworkX reference (2020). https://networkx.github.io/
documentation/latest/ downloads/networkx reference.pdf
12. Kauffman, S., Peterson, C., Samuelsson, B., Troein, C.: Random Boolean network models
and the yeast transcriptional network. Proc. Natl. Acad. Sci. (PNAS) 100(25), 14796–14799
(2003)
13. Kauffman, S., Peterson, C., Samuelsson, B., Troein, C.: Genetic networks with canalyz-
ing Boolean rules are always stable. Proc. Natl. Acad. Sci. (PNAS) 101(49), 17102–17107
(2004)
14. Kawachi, A., Ogihara, M., Uchizawa, K.: Generalized predecessor existence problems for
Boolean finite dynamical systems. In: 42nd International Symposium on Mathematical Foun-
dations of Computer Science (MFCS 2017), pp. 8:1–8:13 (2017)
15. Kosub, S., Homan, C.M.: Dichotomy results for fixed point counting in Boolean dynamical
systems. In: Proceedings of the 10th Italian Conference on Theoretical Computer Science,
pp. 163–174 (2007)
16. Marathe, M.V., Ravi, S.S., Rosenkrantz, D.J., Stearns, R.E.: Computational aspects of fault
location and resilience problems for interdependent infrastructure networks. In: International
Conference on Complex Networks and their Applications, pp. 879–890. Springer, Heidelberg
(2018)
17. Mortveit, H., Reidys, C.: An Introduction to Sequential Dynamical Systems. Springer, New
York (2007)
18. Newman, M., Barabási, A.L., Watts, D.J.: The Structure and Dynamics of Networks. Prince-
ton University Press, Princeton (2006)
19. Ogihara, M., Uchizawa, K.: Computational complexity studies of synchronous Boolean finite
dynamical systems on directed graphs. Inf. Comput. 256, 226–236 (2017)
20. Priest, J.D., Marathe, M.V., Ravi, S.S., Rosenkrantz, D.J., Stearns, R.E.: Evolution of sim-
ilar configurations in graph dynamical systems. Technical Report for 2020, Network Sys-
tems Science and Advanced Computing (NSSAC) Division, Biocomplexity Institute and
Initiative, University of Virginia, Charlottesville, VA, USA. https://drive.google.com/file/d/
1Bc2idtlFnk7uidLnDEi6U3iggET0O0dh/view?usp=sharing
21. Rosenkrantz, D.J., Marathe, M.V., Ravi, S.S., Stearns, R.E.: Testing phase space proper-
ties of synchronous dynamical systems with nested canalyzing local functions. In: Proceed-
ings of the 17th International Conference on Autonomous Agents and MultiAgent Systems,
AAMAS 2018, Stockholm, Sweden, 10–15 July 2018, pp. 1585–1594 (2018)
22. Information regarding SAT solvers (2018). http://www.satlive.org
23. Tosic, P.T.: On the complexity of enumerating possible dynamics of sparsely connected
Boolean network automata with simple update rules. In: Automata 2010 - 16th International
Workshop on CA and DCS, pp. 125–144 (2010)
24. Tosic, P.T.: Phase transitions in possible dynamics of cellular and graph automata models
of sparsely interconnected multi-agent systems. In: Proceedings of the 16th Conference on
Autonomous Agents and MultiAgent Systems, AAMAS 2017, São Paulo, Brazil, 8-12 May
2017, pp. 474–483 (2017)
25. Valente, T.W.: Social network thresholds in the diffusion of innovations. Soc. Netw. 18, 69–
89 (1996)
26. Wolfram, S.: Theory and Applications of Cellular Automata. World Scientific (1987)
27. Wooldridge, M.: An Introduction to Multi-Agent Systems. Wiley, West Sussex (2002)
Congestion Due to Random Walk Routing
1 Introduction
The study of network capacity, sometimes referred to as load or congestion, is
over half a century old, and goes back to the pioneering work of Ford and Fulk-
erson [1] and Shannon [2] for the single commodity and to early attempts [3–7]
for the multicommodity flow solutions of the problem. This rather large liter-
ature provides a characterization of the load or, more specifically, the minimal
capacity required, in terms of sum of link capacities needed based on cut values,
which in case of the single commodity model are both necessary and sufficient
and for the multicommodity case generally provide necessary conditions.
Single commodity or multicommodity network flow models in communica-
tion, transportation and numerous other settings typically assume shortest path
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 556–567, 2021.
https://doi.org/10.1007/978-3-030-65347-7_46
Random Walk Routing 557
routing. There are natural settings in which alternative routing not involving
shortest paths may be required. For example, it may happen that longer routes
are used for load balancing or, in the case of capacitated networks, to avoid
network expansion [8,9]. Or the inverse problem may be posed: to determine
weights so that shortest path routes determined from these weights result in
smallest load across the network [10]. Given the universality of the network flow
model, there are a vast number of applications of the model, and the list is too
large to enumerate here.
There are few analytical results concerning the multicommodity flow problem
with shortest path routing, in the sense of having a closed form solution as a
function of a small number of parameters characterizing the network and the
commodities. These include characterization of the maximal load for hyperbolic
graphs [11,12]. In this setting, for a network of N nodes one assumes 1 unit of
(directed) flow between all N (N − 1) node pairs, and then asks how the load
scales due to shortest path routing as a function of N . This measure is sometimes
referred to as the betweenness centrality, see [13].
In this paper, we study the near opposite of shortest path routing: when flows
are routed in a uniformly random manner, each flow starting from its source and
moving at each step randomly to a neighboring node and only stopping when
the destination of the flow is reached. More specifically, we consider the case
when one unit of traffic, or a single packet, is injected into the network at every
time step at each node i for each possible destination node j = i. Thus there
are N (N − 1) units of traffic (or packets) injected into the N -node network
at every time step. The network is assumed to be connected, i.e. have a single
component. We first demonstrate that a steady-state distribution is achieved and
then derive an expression for the expected flow, or the average number of packets
passing through each node, in terms of the eigenvalues of the graph Laplacian.
To illustrate the results more concretely, we estimate the largest mean loads for
a few networks whose distribution of Laplacian eigenvalues are known. We note
that similar but not identical measures to the expected load at each node have
been investigated numerically in the context of node ranking, see [14].
2 General Results
2.1 Time Evolution Equations
As described in the previous section, we consider an undirected connected graph
G(N, E) with N nodes, in which packets of traffic are injected at various nodes in
a deterministic manner and move towards specified destinations. The dynamics
are discrete time, i.e. packets of traffic move from node to node at time t =
0, 1, 2, 3 . . . . At each time step, exactly N − 1 packets of unit size are injected
into the graph at each node k, with one packet heading towards each other node
in the graph l = k. Thus there are precisely N (N − 1) packets injected into the
graph at each time step. Any packet that is present at node i at time step t and
whose destination is not i moves to one of the nodes adjacent to i at time t + 1.
For this, one of the di nodes adjacent to i is chosen randomly, with probability
558 O. Narayan et al.
equal to 1/di . However, any packet of traffic that is at its destination at time t
is removed from the network, and is no longer present at time t + 1. Note that
a packet that returns to its source as it moves around randomly continues as it
would from any other node. The congestion or load at any node at any time step
is is a random variable equal to the number of packets that are being processed
at that node. We are interested in the expected value of the number of packets
at each node. We expect that in steady state, if and when it exists, packets
are (injected and) removed from each node at the same rate, i.e. N − 1 packets
per time step. We seek to find the steady state load, i.e. the average number of
packets, at all the nodes of the network.
As a byproduct, we obtain the average time τ (or number of steps in its path)
that a packet takes to go from a randomly chosen source node to a randomly
chosen destination. A packet that hops from source to destination in t steps is in
the network for t time steps. (We have assigned one time step each to the source
and destination nodes.) The average number of packets at each node, summed
over all the nodes in the graph, is therefore the product of the total injection
rate N (N − 1) and τ.
Remark 1. We shall use N to represent both the set of nodes in the graph as
well as their count |N | without danger of confusion. Also, we write k ∼ j to
mean that node k is a neighbor of node j, i.e., i and j are adjacent, and k j
when they are not; and refer to the adjacency matrix (Aij ) the Laplacian (Lij )
and the normalized Laplacian (Lij ) (for 0 ≤ i, j ≤ N ) of the undirected graph
G(N, E), with their standard definitions:
⎧ ⎧ ⎧
⎨ 0, i = j ⎨ di , i = j ⎨ 1, i=j
1
Aij = 1, i ∼ j , Lij = −1, i ∼ j , Lij = −(di dj )− 2 , i ∼ j
⎩ ⎩ ⎩
0, i j 0, i j 0, ij
(1)
Proof. We first consider the case of the traffic flowing from a single source node
k to a single destination node l. Let Xikl (t) be the random variable representing
kl
the number of packets at node i at time t and Zji (t + 1) be the random variable
representing the number of packets sent out of node j, a neighbor of i, to i at
time t. This assumes tacitly that an outgoing packet from node j that leaves j
at time t reaches a neighboring node i at time t + 1; an incoming packet to node
i from node j that reaches i at time t must leave node j at time t − 1. Then the
boundary condition Eq. (2), and the no-escape condition from destination l Eq.
(3), both hold:
Xikl (0) = δik , 0 ≤ i ≤ N (2)
Zlikl (t) = 0, ∀t ≥ 0. (3)
Random Walk Routing 559
l,
Flow balance for outgoing packets implies that for all neighbors i of a node j =
Xjkl (t) = kl
Zji (t + 1), 1 ≤ i, j ≤ N, 0 ≤ t. (4)
i∼j=l
which simply states that packets at node j at time t move out to its incident
links at time t + 1. These same packets arrive at time t + 1 at adjacent nodes
Xikl (t + 1) = δik + kl
Zji (t + 1), 1 ≤ i, j ≤ N, 0 ≤ t, (5)
l=j∼i
Notice that the first term on the right hand side of Eq. (5) accounts for the
fact that one packet is injected at node k for destination l at each time step.
The second term represents the packets that move to node i at time t + 1 from
adjacent nodes at time t. The sum in this term excludes the node l because any
packet that was at the node l (the destination) at time t is removed from the
network and is no longer present at time t + 1.
Further, our assumption of uniformly random routing of packets from each
node to its neighbors implies that for any neighbor i of a node j = l,
kl Xj (t) 1 z 1
P{Zji (t + 1) = z} = ( ) (1 − )Xj (t)−z , 0 ≤ z ≤ Xj (t), 0 ≤ t. (6)
z dj dj
Taking ensemble expectation of Eqs. (6) and (5) and using the standard expres-
sion for the mean of the binomial distribution for Eq. (6), we get that for all
0 ≤ i, j ≤ N
kl 1
E[Zji (t + 1)] = E[Xjkl (t)], l = j ∼ i (7)
dj
E[Xikl (t + 1)] = δki + kl
E[Zji (t + 1)] (8)
l=j∼i
Proof. In steady state, we know that the load flowing into the node l at any
time
step must be equal to the load injected into the node k, i.e. unity. Therefore
Alj pkl
j /d j = 1, and we can extend Eq. (11) as
pkl
j
pkl
i = δik − δil + Aij (14)
j
dj
Random Walk Routing 561
with rlkl = 0. Here (Lij ) is the Laplacian for the graph. Since (Lij ) is a real sym-
metric matrix, it has a complete set of real eigenvalues λα and real orthonormal
eigenvectors ξ α for α = 0, 1, 2 . . . N − 1. Using the standard properties of the
Laplacian, all the eigenvalues are non-negative, and since the graph has been
assumed to have one component, √ there is only one zero eigenvalue λ0 with eigen-
vector ξ 0 =
(1, 1, 1, . . . 1)/ N . The denominator ensures that the normalization
condition i ξi0 ξi0 = 1 is satisfied.
We define
α
πkl = ξiα (δik − δil ) = ξkα − ξlα (16)
i
which is the projection of the right hand side of Eq. (15) on to the α’th eigen-
0
vector. Note that πkl = 0. With this definition,
N
−1 α
πkl
rjkl = ξ α + ckl ξj0 , (17)
α=1
λα j
where ckl has to be chosen to make rlkl equal to zero. Since ξj0 is independent of
j, the condition rlkl = 0 yields
N
−1 N −1
ξkα − ξlα α 1 α α
rjkl = ξj − [ξk ξl − (ξlα )2 ]. (18)
α=1
λα α=1
λ α
Averaging over all the random paths taken by the traffic packets, the steady
state load at any node j = l is pkl kl
j = rj dj . For the l’th node, the load is
kl kl kl
E[Xl ] = pl , since we defined pl to be zero. However, in steady state we know
that the traffic flowing out of node l at any time step is unity, and this is equal to
the entire load E[Xlkl ] at that time step. Therefore, in steady state, the load at
the j’th node is equal to Λkl kl
j = dj rj + δjl . Note that a unit of load from k to l is
counted at all the nodes it passes through, as well as the source and destination
nodes. Depending on how traffic is actually processed by the network, it may be
appropriate to change the weightage given to the source and destination nodes.
562 O. Narayan et al.
Summing over all source destination pairs, the total steady state load at the
j’th node is
Λ j = dj rjkl + N − 1. (19)
l k=l
1 α 2
N
−1
Λj = (N − 1) + dj N (ξl ) − ( ξlα )2
λ
α=1 α l l
1
= (N − 1) + N dj . (20)
λα
α=0
The load Λj at any node j is linearly dependent on the degree dj of the node.
Unlike the case when traffic between any source and destination flows along the
geodesic path connecting them, there is no concept of a network core.
Remark 2. The result Λj − (N − 1) ∝ dj can be obtained directly. An outline
of the proof is as follows. The traffic from node k to node l can be represented
as a stream of random walkers that diffuse through the network at discrete
time steps. At every time step in addition to the diffusive dynamics, a walker is
introduced at node k, and all the walkers at node l are removed. Comparing with
Eq. (11), the expected number of random walkers at node j at time t is equal
to pkl
j (t). If the random walks corresponding to all source destination pairs take
place simultaneously, with each walker labelled with an index corresponding to
its destination, we have random walkers with N different labels moving through
the network. In addition to the random walk dynamics, walkers are created
and destroyed at their sources and destinations respectively. In steady state,
the number of walkers created and destroyed at any time step are equal to
N − 1 at each node, but they have different labels. If we ignore the labels on
the random walkers, the creation and destruction
of random walkers can be
ignored. The steady state solution for k l pkl j (t) is proportional to the steady
state solution for a diffusion process on the graph with no sources or sinks. It
is easy to verify that, in this steady state, the number of random walkers at
any node is proportional to the degree of the node. Although this tells us that
[Λj − (N − 1)]/dj is a constant, independent of j, it does not tell us that this
constant is equal to N α=0 1/λα .
Remark 3. If instead of using the Laplacian, L, of the graph, we had used the
normalized Laplacian, L, the entire proof would have proceeded as presented
except that Eq. (20) would have read as follows
1 ζlα 2 ξα
N
−1
Λj = (N − 1) + dj N (√ ) − ( √l )2
α=1
ν α d l dl
l l
1 ζ α
= (N − 1) + N dj V ar( √l ). (21)
να dl
α=0
Random Walk Routing 563
Remark 4. So far we have dealt with connected undirected graphs. We point out
that when the graph is directed, then assuming that steady state distribution is
achieved, Remark 2 implies that the expected load Λj = N − 1 + Cπj where C
is some constant independent of the node and (πj ) is the principal eigenvector
of the random walk matrix for the directed graph, which for undirected graphs
is equal to (dj ).
Remark 5. We observe that the proofs of both theorems carry through essentially
unchanged if we replace the deterministic arrival of one packet at each source
node for each destination node at each time step with a Poisson arrival process
with a mean of one packet arrival per node per unit time for each destination
node. The same is true if we replace the uniform random routingfrom each node
to its neighbors with a more general value wjk /wj with wj = l∼j wjl for the
probability of moving from a node j to any of its neighbors k, so long as wjk =
wkj = 0. However, the normalized Laplacian (Ljk ) and its eigenvalues {λα , α <
N } in Theorem (2) are now replaced by (Lw w
jk ) and its eigenvalues {λα , α < N }
w
where (Ljk ) is now the weighted normalized Laplacian [15], defined analogously
√
as Lwjk = δjk − (1 − δjk )wjk / wj wk instead of Ljk = δjk − (1 − δjk )/ dj dk ,
see (1) in Remark 1.
2.3 Discussion
In the large-N limit, the spectral density of the Laplacian α δ(λ − λα ) tends
to N ρ(λ) where ρ(λ) is smooth. If ρ(λ → 0) = 0, we have
1 ρ(λ)
N → N2 dλ ∼ N 2 (22)
λα λ
α=0
for large N. The simplest example of this is when the graph Laplacian has
a spectral gap in the large N limit. A more subtle case is the Erdös-Rényi
model [17], where the spectral density is empirically found [18] to be close to
that of a infinite regular tree whose nodes all have the same degree as the average
degree of the Erdös-Rényi graph. Even though the infinite tree has a spectral
gap, the corresponding Erdös-Rényi spectral density has a narrow tail extending
down to λ = 0, so that there is no spectralgap [19]. However, in the next section
of this paper, we find numerically that N α λ−1 α ∼ N for Erdös-Rényi graphs,
2
564 O. Narayan et al.
in the large N limit, where d is the average degree of nodes in the graph. If
λ−1
α ∼ N, the average sojourn time in the graph is O(N ). To express this in
terms of the diameter of the graph instead of the number of nodes, we have to
know how the diameter grows as N is increased; for small world graphs, τ grows
exponentially as the diameter of the graph is increased. Exponential growth
implies that a shortest-path walk starting at a site k and aimed at site l can
reach destination l exponentially faster on average than the random walk.
3 Numerical Results
In this section, we present numerical results for a few prototypical graph models:
the Erdös-Rényi random graphs in various regimes, the Barabási-Albert model
of preferential attachment [20,21], and hyperbolic grids.
Because of its zero eigenvalue, the matrix L is not invertible. We define
thematrix M = L + P, where Pij = 1/N. Then P is a projection operator:
P α cα ξ α = c0 ξ 0 . Therefore
M cα ξ α = (λα + δα0 )cα ξ α . (24)
α α
Therefore M is an invertible matrix, with Tr[M −1 ] = α (λα + δα0 )−1 , which is
equal to 1 + α=0 λα . We have to numerically evaluate Tr[M −1 ] − 1.
−1
Figure 1 shows the results for N λ−1 α for the Erdös-Rényi model as N
is increased. Two cases are considered: when the average nodal degree da is 2
and 4. Since da > 1, there is a giant component in each graph, containing an
N -independent fraction of the nodes in the large-N limit. All the other nodes
Random Walk Routing 565
0.6
ER (d a = 2)
ER (d a = 4)
0.5
[Σ λα-1] /N 0.4
0.3
0.2
SF (p = 2, q = 1)
SF (p = 3, q = 1)
SF (p = 4, q = 1)
0.1 SF (p = 4, q = 2)
SF (p = 2, q = 0)
SF (p = 3, q = 0)
0
100 1000 10000
N
Fig. 1. Plot of [ α=0 λ−1α ]/N versus N for the Erdös-Rényi model with average nodal
degree of 2 and 4. (For the first of these, the vertical axis is scaled by a factor of 0.25
to fit in the figure.) Also shown are the results for scale free networks, where each node
is born with p edges that link it to preexisting nodes, and the probability of linking
to a preexisting node is proportional to its degree with an offset of q; the results for
various values of (p, q) are shown. The curves are flat for all the cases, demonstrating
that N λ−1
α ∼ N .
2
are in components whose size does not diverge as N is increased. Since we are
considering graphs with a single component in this paper, only the giant com-
ponent of each graph is retained. This means that the actual number of nodes in
the graph is a da -dependent fraction of the N shown in Fig. 1, but this does not
affect the functional form of large-N behavior. Each point shownin the figure
comes from averaging over eighty random graphs. We see that N λ−1 α ∼N .
2
Figure 1 also shows results for scale free networks. Following the extension
of Ref. [21] of the original model of Ref. [20], nodes enter the network one by
one, with each node born with p edges that link it to pre-existing nodes; the
probability of linking to any preexisting node is proportional to d−q if its degree
is d, where q is a parameter of the model. The figure shows results for (p, q) =
(2, 0), (3, 0), (2, 1), (3, 1), (4, 1) and (4, 2). As with the Erdös-Rényi graphs, each
point
in the figure comes from averaging over eighty random graphs. Once again,
N λ−1 α ∼ N 2
. −1
The first panel of Fig. 2 shows the results −1 for N2 λα for the hyperbolic
grid H3,7 . The data clearly show that N λα ∼ N ln N.
In the Erdös-Rényi model, if da = c ln N instead of being independent of N,
there is a phase transition in the behavior of the model when c is increased to
1: the fraction of the nodes in the giant component approaches 1. The behavior
of graphs constructed using this model is very different in this regime. The
second panel of
−1 Fig. 2 shows the results for N λ−1
α when da = ln N. We
2
see that N λα grows slower than N for large N. Although the data are
566 O. Narayan et al.
2
H 3,7 ER (d a = ln N)
1.8 0.4
1.6
0.3
[Σ λα-1] /N
[Σ λα-1] /N
1.4
1.2 0.2
0.8
0.6 0.1
100 1000 10000 100 1000 10000
N N
Fig. 2. Plot of [ α=0 λ−1
α ]/N versus N for a) the hyperbolic grid, with seven triangles
meeting at every node. All the nodes that are less than some distance r from a center
node are included; N increases
−1 with2r. With the x-axis on a logscale, the straight line
fit demonstrates that N λα ∼ N ln N. b) the Erdös-Rényi model with the average
degree of the nodes equal to ln N. The straight line shown corresponds to 0.75N −0.185 .
not conclusive, they suggest a ∼ N 2−α form. As with the other random graph
models, each point in the figure is obtained by averaging over eighty random
graphs.
As mentioned earlier in this paper, the maximum load for all the
nodes in a
graph consists of—apart from an additive term—the product of N α λ−1 α and
the highest nodal degree in the graph. For scale free graphs, if the probability
of a node having a degree d scales as p(d) ∼ d−γ for large d, the highest nodal
degree in a graph with N nodes scales as N 1/(γ−1) for large N.
4 Conclusions
We showed for the uniform multicommodity flow problem on an arbitrary con-
nected graph under random routing, the mean load (or congestion) at each
node of the graph exists, is unique and derived an explicit expression for it in
terms of the spectrum of the graph Laplacian. Using this explicit expression,
we obtained analytical estimates for the mean load for hypercubic lattices and
regular trees in the large-size regime using their known spectral densities and
computed numerically the mean load for the Erdös-Rényi random graphs, the
scale-free Barabási-Albert preferential attachment graphs and hyperbolic grids.
Acknowledgements. This work of Onuttom Narayan and Iraj Saniee was supported
by grants FA9550-11-1-0278 and 60NANB10D128 from AFOSR and NIST.
References
1. Ford, L.R., Fulkerson, D.R.: Maximal flow through a network. Canadian J. Math.
8, 399–404 (1956)
2. Elias, P., Feinstein, A., Shannon, C.: A note on the maximum flow through a
network. IEEE Trans. Inf. Theory 2, 117–119 (1956)
Random Walk Routing 567
3. Hu, T.C.: Multicommodity network flows. Op. Res. 11, 344–360 (1963)
4. Lomonosov, M.V.: Combinatorial approaches to multiflow problems. Discrete Appl.
Math 11, 1–94 (1985)
5. Shahrokhi, F., Matula, D.W.: The maximum concurrent flow problem. J. ACM
37, 318–334 (1990)
6. Papernov, B.A.: Feasibility of multicommodity flows (in Russian). In: Friedman,
A. (ed.) Studies in Discrete Optimization, pp. 230–261. Idzat. “Nauka”, Moscow
(1976)
7. Okamura, H., Seymour, P.D.: Multicommodity flows in planar graphs. J. Combi-
natorial Theory Series B 31, 75–81 (1981)
8. Minou, M.: Network synthesis and optimum network design problems: models,
solution methods and applications. Networks 19, 313–360 (1989)
9. Magnanti, T.L., Wong, R.T.: Network design and transportation planning: models
and algorithms. Transp. Sci. 18, 1–55 (1984)
10. Applegate, D., Cohen, E.: Making intra-domain routing robust to changing and
uncertain traffic demands: understanding fundamental tradeoffs. In: ACM Sig-
comm 2003, pp. 313–324 (2003)
11. Narayan, O., Saniee, I.: Large-scale curvature of networks. Phys. Rev. E 84, 066108
(2011)
12. Jonckheere, E.A., Lou, M., Bonahon, F., Baryshnikov, Y.: Euclidean versus hyper-
bolic congestion in idealized versus experimental networks. Internet Math. 7, 1–27
(2011)
13. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford
(2010)
14. Newman, M.E.J.: A measure of betweenness centrality based on random walks.
Soc. Netw. 27, 39–54 (2005)
15. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society, Provi-
dence (1997)
16. Lovasz, L.: Random walks on graphs: a survey. In: Miklós, D., Sós, V.T., Szônyi, T.
(eds.) Combinatorics-Paul Erdös is Eighty 2, pp. 1–46. Janos Bolyai Mathematical
Society, Keszthely (1993)
17. Erdös, P., Rényi, A.: On random graphs. Publicationes Mathematicae 6, 290–297
(1959)
18. Narayan, O., Saniee, I., Tucci, G.H.: Lack of hyperbolicity in asymptotic Erdös-
Rényi sparse random graphs. Internet Math. 11, 277–288 (2015)
19. Samukhin, A.N., Dorogovtsev, S.N., Mendes, J.F.F.: Laplacian spectra of complex
networks and random walks on them: are scale-free architectures really important?
Phys. Rev. E 77, 036115 (2008)
20. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286,
509–512 (1999)
21. Dorogovtsev, S.N., Mendes, J.F.F., Samukhin, A.N.: Structure of growing net-
works: exact solution of the Barabasi-Albert model. Phys. Rev. Lett. 85, 4633–4636
(2000)
Strongly Connected Components
in Stream Graphs: Computation
and Experimentations
Connected components are among the most important concepts of graph the-
ory. They were recently generalized to stream graphs [18], a formal object that
captures the dynamics of nodes and links over time. Unlike other generalizations
available in the literature, these generalized connected components partition the
set of temporal nodes. This means that each node at each time instant is in one
and only one connected component. This makes these generalized connected com-
ponents particularly appealing to capture important features of objects modeled
by stream graphs. However, computation of connected components in stream
graphs has not been explored yet. Therefore, up to this date, they remain a for-
mal object with no practical use. In addition, the algorithmic complexity of the
problem is unknown, as well as the insight they may shed on real-world stream
graphs of interest.
After introducing key notations and definitions (Sect. 1), we present two
algorithms for strongly connected components, together with their complex-
ity (Sect. 2). We then apply these algorithms to several large-scale real-world
datasets and demonstrate their ability to describe such datasets (Sect. 3). We
also show that their performances may be improved greatly at the cost of rea-
sonable approximations.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 568–580, 2021.
https://doi.org/10.1007/978-3-030-65347-7_47
Connected Components in Stream Graphs 569
Fig. 1. (Left) An example of stream graph. We display time T = [0, 10] on the hori-
zontal axis and nodes V = {A, B, C, D, E, F } on the vertical one. We represent each
node segment by a colored horizontal segment, with one color per node; and each link
segment in grey by a vertical line between the two involved nodes at the link segment
starting time, and an horizontal line from this time to its ending time. (Right) The 16
strongly connected components of the stream graph.
We call node segment a couple ([b, e], u) such that [b, e] is a segment that
is not included in any other segment of Tu , and we denote by W the set of all
node segments in W . We say that b is an arrival of u, and e a departure. We
denote by N = |W | the number of node segments in the stream. Likewise, we
call link segment a couple ([b, e], uv) such that [b, e] is not included in any other
segment of Tuv , and by E the set of all link segments in E. We say that b is
an arrival of uv, and e a departure. We denote by M = |E| the number of link
segments in the stream. We call all time instants that correspond to a node or
link arrival or departure an event time. There are at most 2 · N + 2 · M event
times. Notice that the intervals considered above may be singletons. Then, b = e
and [b, e] = {b} = {e}. See Fig. 1 for an illustration.
The induced graph G(S) = (V (S), E(S)) is defined by V (S) = {v, Tv = ∅}
and E(S) = {uv, ∃t, (t, uv) ∈ E}. We denote by n = |V (S)| and m = |E(S)| its
number of nodes and links, respectively. We denote by Gt = (Vt , Et ) the graph
such that Vt = {v, (t, v) ∈ W } and Et = {uv, Tuv = ∅}. We denote by G− t the
570 L. Rannou et al.
graph that corresponds to the nodes and links present between the event time
just before t and t: G− − − −
t = (Vt , Et ) where Vt = {v, ∃t = t, [t , t] ⊆ Tv } and
−
Et = {uv, ∃t = t, [t , t] ⊆ Tuv }.
We consider in input a time-ordered sequence of node or link arrivals or
departures. We maintain the set of present nodes and links at the current time
instant t, i.e. the graph Gt , and we store their latest arrival time seen so far.
This has a Θ(N + M ) time and Θ(n + m) space cost for the whole processing
of input data. Therefore, these worst-case complexities are lower bounds for our
algorithms.
One may compute strongly connected components directly from their definition,
by processing event times in increasing order and by maintaining the set of
strongly connected components that begin before or at current event time, and
end after it. We represent each such component as a couple (b, C), meaning that
it starts at b (included or not) and involves nodes in C.
More precisely, we start with a set C containing ([α, C) for each connected
component C of the graph Gα at the first event time α. Then, for each event
time t > α in increasing order we consider the connected components of G− t .
For each such component C, if there is no component (b, X) with X = C in
Connected Components in Stream Graphs 571
C then we add (]t , C) to C , where t is the event time preceding t. For each
element (b, X) of C , if X is not a connected component of G− t , then we remove
it from C and we output (b, t ], X). We then turn to the connected components
of Gt : for each such component C, if there is no component (b, X) with X = C
in C then we add ([t, C) to C ; and for each element (b, X) of C , if X is not a
connected component of Gt , then we remove it from C and we output (b, t[, X).
Finally, when the last event time t = ω is reached, we output (b, ω], X) for each
element (b, X) of C .
Clearly, this algorithm outputs all strongly connected components of the
considered stream graph. Computing the connected components of each graph
is in O(n + m) time and space. The considered set families (the graph connected
components, as well as the elements of C ) form partitions of V . Therefore, their
storage and all set comparisons processed for each event time have a cost in O(n)
time and space. There are O(M +N ) event times, therefore, the time complexity
of this method is O((N + M ) · (n + m)), and it needs O(n + m) space.
Without changing its time complexity, this algorithm may be improved by
ignoring event times t such that all events occurring at t are link arrivals between
nodes already in the same connected component. However, one still has to com-
pute graph connected components at each event time with link departures.
Therefore, this improvement is mostly appealing if many link departures occur
at the same event times.
More generally, the approach above is efficient only if many events (node
and/or links arrivals and/or departures) occur at each event time. Then, many
connected components may change at each event time, and computing them from
scratch makes sense. Instead, if only few events occur at most event times, man-
aging each event itself and updating current connected components accordingly
is appealing.
This leads to the following algorithm, which starts with an empty set C ,
considers each event time t in increasing order, and performs the following
operations.
1. For each node segment ([b, e], u) such that b = t (node arrival), add ([b, {u})
to C .
2. For each link segment ([b, e], uv) such that b = t (link arrival), let Cu =
(bu , Xu ) and Cv = (bv , Xv ) be the elements of C such that u ∈ Xu and
v ∈ Xv ; if Cu = Cv then replace Cu and Cv by ([t, Xu ∪ Xv ) in C . Then: if
bu = [t then output (bu , t[, Xu ); if bv = [t then output (bv , t[, Xv ).
3. Let Gt = Gt ; then for each link segment ([b, e], uv) such that e = t (link
departure), let Cv = Cu = (bu , Xu ) be the element of C such that u ∈ Xu
and v ∈ Xu ; remove the link uv from Gt ; if there is no path between u and v
in Gt then replace Cu by Cu = (]t, Xu ) and Cv = (]t, Xv ) in C where Xu and
Xv are the connected components of u and v in Gt , respectively; if bu =]t
then output (bu , t], Xu ).
4. For each node segment ([b, e], u) such that e = t (node departure), let Cu =
(bu , Xu ) be the element of C such that u ∈ Xu ; remove Cu from C ; if bu =]t
then output (bu , t], {u}).
572 L. Rannou et al.
We call this algorithm SCC Direct. It clearly outputs the strongly connected
components of the considered stream, like the previous algorithm. It performs
2(M + N ) of the steps above, corresponding to N node arrivals and departures
and M link arrivals and departures. One easily deals with node arrivals and
departures in constant time. If a link arrival induces a merge between two com-
ponents, computing their union is in O(n), as is outputting both components
if needed. Thus the complexity for link arrival steps is in O(M · n). Each link
departure calls for a computation of the connected components of a graph, and
writing a component to the output is in O(n). Thus the complexity for link
departure steps is in O(M · (m + n)). We obtain a total time complexity in
O(M · (m + n) + N ). The space complexity is still in O(n + m) as above.
The SCC Direct algorithm presented above is strongly related to one of the most
classical algorithmic problems in dynamic graph theory, called fully dynamic
connectivity [2,9,10,13–15,26], which aims at maintaining the connected com-
ponents of an evolving graph. Considering a sequence of link additions and
removals, dynamic connectivity algorithms maintain a data structure able to
tell if two nodes are in the same connected components (query operation) and
to merge or split connected components upon link addition or removal (update
operation).
This data structure and the corresponding operations can be used in the
above algorithm: we can use the data structure to store C , the set of current
connected components (we also need to store the beginning time of each compo-
nent, which has negligible cost). Then, at each link arrival or departure, we can
use the query operation to test whether the two nodes are in the same compo-
nent or not, and the update operation to add or remove the current link to the
data structure, while keeping an up-to-date set of connected components. When
we observe a node appearance it is necessarily isolated, so we have to add the
current time to its component. All the other steps (mainly, writing the output)
are unchanged. We call this algorithm SCC FD.
Several methods efficiently solve the dynamic connectivity problem, the key
challenge being to know if updates and queries may be performed in O(log(n))
time, where n is the
number of nodes in the graph. Current exact solutions per-
n·(log log(n))2 log2 (n)
form updates in O log(n) worst time [15], or in log log(n) amortized
worst time [26]. Probabilistic (exact or approximate) methods perform even bet-
ter, but they remain above the O(log(n)) time cost [9,10,14].
It is well acknowledged that these algorithmic time and space complexities
hide big constants, and that the underlying algorithms and data structures are
very intricate. As a consequence, implementing these algorithms is an important
challenge in itself [2,13], and the results above should be considered as theoret-
ical bounds. In practice, the implemented algorithms typically have O(log(n)3 )
amortized time and linear space complexities, still with large constants [2,13].
Connected Components in Stream Graphs 573
3.1 Datasets
First notice that most available datasets record instantaneous interactions only,
either because of periodic measurements, or because only one timestamp is avail-
able. In such situations, one resorts to δ-analysis [18]: one considers that each
interaction lasts for a given duration δ. This transforms a dataset into a stream
graph S = (T, V, W, E) in which all link segments last for at least δ, and all links
in D separated by a delay lower than δ lead to a unique link segment. Nodes are
considered as present only when they have at least one link.
In order to explore the performances of our algorithms in a wide variety of
situations, we considered 14 publicly available datasets that we shortly present
below. Their key stream graph properties are given in Table 1, together with
the value of δ we used. It either corresponds to a natural value underlying the
dataset or is determined by the original timestamp precision.
UC Message (UC) [17] is a capture of messages between University of Cal-
ifornia students in an online community. High School 2012 (HS 2012) [6] is a
sensor recording of contacts between students of 5 classes during 7 days in a high
school in Marseille, France in 2012. Digg [17] is a set of links representing replies
of Digg website users to others. Infectious [12] is a recording of face-to-face con-
tacts between visitors of an exhibition in 2009, Dublin. Twitter Higs (Twit-
ter) [4,19] is a recording of all kinds of twitter activity for one week around the
discovery of the Higgs boson in 2012. Linux Kernel mailing list (Linux) [17]
574 L. Rannou et al.
Table 1. Key features of the real-world stream graphs we consider, ordered with respect
to their number M of link segments (K indicates thousands, M millions).
δ n m |T | N M
UC 1h 2K 14K 189d 43K 34K
HS 2012 60 s 327 6K 4d 48K 46K
Digg 1h 30K 85K 14.5d 110K 86K
Infectious 60 s 11K 45K 80d 85K 133K
Twitter 600 s 304K 452K 7d 543K 488K
Linux 10 h 27K 160K 8y 450K 544K
Facebook 10 h 46K 183K 4.3y 957K 588K
Epinions 10 h 132K 711K 2.6y 404K 743 K
Amazon 1h 2.1M 5.7M 9.5y 9.9M 5.8M
Youtube 24 h 3.2M 9.4M 226d 6.7M 9.4M
Movielens 1h 70K 10M 14y 8.5M 10M
Wiki 1h 2.9M 8.1M 14.3y 18.3M 14.5M
Mawilab 2s 940K 9.1M 902 s 17M 18.8M
Stackoverflow 10 h 2.6M 28.2M 7.6y 30M 33.5 M
represents the email replies between users on this mailing-list. Facebook wall
posts (Facebook) [25] represents messages exchanged between Facebook users,
through their walls. Epinions [17] is a set of timestamped trust and distrust
link creations on Epinions, an online product rating site. Amazon [17] contains
product ratings on Amazon. Youtube [20] is a social network of YouTube users
and their friendship connections. Movielens [8] contains movie ratings by users
of the Movielens site. Wiki Talk En (Wiki) [17] is a recording of discussions
between contributors to the English Wikipedia. Mawilab 2020-03-09 (Maw-
ilab) [5] is a 15 min capture of network traffic on a backbone trans-pacific router
in Japan on March 3, 2020. Each link represents a packet exchanged between
two internet addresses. Stackoverflow [19,22] is a recording of interactions on
the stack overflow web site.
Fig. 2. Time cost of SCC Direct and SCC FD in seconds, along with the number M of
link segments and the number of strongly connected components, for each considered
stream (horizontal axis, ordered with respect to M ).
Fig. 3. Distribution of the size (left) and duration (middle) of SCC in Mawilab dataset.
Duration of each SCC as a function of its size, in log-log scales (right).
nents are due to the frontier effect, that we define as follows. Consider a set X of
nodes, and assume that link segments that start close to a given time b and end
close to a given time e connect them. However, they all start at different times
and end at different times. This leads to a connected component ([b , e ], X),
with b close to b and e close to e, but also to many short strongly connected
components that both start and end close to b, or close to e. These components
make little sense, if any, but they account for a huge fraction of all strongly con-
nected components, and so they have a crucial impact on computation time as
explained in the previous section. We show below how to get rid of them while
keeping crucial information.
The fact that link segments start and end at slightly different times induces many
strongly connected components of very low duration, that have little interest.
We, therefore, propose to consider the following approximation of the stream
graph S = (T, V, W, E). Given an approximation parameter Δ < δ and any
time t in T , we define t Δ as Δ · Δt and tΔ as Δ · Δt . We then define
SΔ = (T, V, WΔ , EΔ ) where WΔ = ∪([b,e],v)∈W [ bΔ , e Δ ] × {v} and EΔ =
∪([b,e],uv)∈E [ bΔ , e Δ ] × {uv}. In other words, we replace each node segment
([b, e], v) by a shorter node segment that starts at the first time after b and ends
at the last time before e which are multiple of Δ. We proceed similarly with link
segments.
First notice that SΔ is an approximation of S, in the sense that SΔ may be
computed from S, but not the converse. In addition, each node or link segment in
S lasts at least δ, and since Δ is lower than δ, no node or link segment disappears
when S is transformed into SΔ ; only their starting and ending times change. In
addition, SΔ is included in S: WΔ ⊆ W and EΔ ⊆ E. This has an important
consequence: all paths in SΔ are also paths in S, and so the approximation does
not create any new reachability relation. It, therefore, preserves key information
contained in the original stream.
Let us first observe the effect of the approximation on strongly connected
components in Fig. 4. The number of components rapidly drops from its initial
value of 30 millions (for Δ = 0, i.e. no approximation) to less than 6 millions for
Δ = δ/103 = 0.002. Its decrease is much slower when Δ grows further, which
indicates that the stream does not anymore contain an important number of
irrelevant components due to the frontier effect. As expected, this has a strong
impact on computation time, which we also display; it also very rapidly drops,
from more than one day to less than one hour, making computations on such
large-scale datasets much quicker.
Figure 5 presents the effect of Δ on size, duration and span distributions of
strongly connected components. For Δ = δ/104 , we notice that while the number
of components has decreased by half only fifty percent of them involve more than
30K nodes. Furthermore, as Δ increases, the number of components tends to be
stable (Fig. 4) but the number of components involving more than 30K nodes
Connected Components in Stream Graphs 577
Fig. 4. Running time of SCC Direct, number of SCC and number of event times in
MawiLab, as a function of Δ (here, δ = 2s).
Fig. 5. Box plots representing the distribution of the size (left), duration (middle) and
span (right) of strongly connected components in Mawilab, for various values of Δ (here,
δ = 2s). We indicate the mean, minimal, and maximal values with dots connected by
horizontal lines, as well as the median and percentiles with vertical boxes.
continues to drop. This explains the differences observed in the execution time
of SCC Direct (Fig. 4) and confirms that the approximation eliminates most
very short connected components, but not all: the ones which are not due to the
frontier effect are preserved, another wanted feature.
Fig. 6. Evolution of the LRMSE, the average difference between latencies and the
average latency stretch with respect to Δ in Mawilab. We indicate the number of
missing paths and represent it as a disk of area proportional to this number.
u,v∈V,u=v (Δ (u,v)+1)/((u,v)+1)
plays the average latency stretch and the latency
n·(n−1)
((u,v)−Δ (u,v))2
u,v∈V,u=v
root mean square error: LRMSE(S, SΔ ) = n(n−1) . The figure
also indicates the number of node pairs that were reachable in S but became
unreachable in this approximation. It appears that latencies are not significantly
impacted by approximation, thus confirming that SΔ , despite its reduced num-
ber of strongly connected components, captures key information available in S.
More precisely, only 11 temporal paths disappear for Δ = δ/103 and 115 disap-
pear for Δ = δ/102 , among a total number of 2, 888, 917. The over-estimate of
latencies is very small, with a LRMSE of 0.51 and 1.61, respectively. This has
important consequences. For instance, one may compute latencies in SΔ from
its strongly connected components, which are much easier to compute and store
than the ones of S, and obtain this way fast and accurate upper bounds (or
approximations) of latencies in S, like we did here for the Mawilab dataset.
4 Related Work
We focus here on connected components defined in [18], but other notions of
connected components in dynamic graphs have been proposed. Several rely on
the notion of reachability, which, in most cases, induces components that may
overlap and are NP-hard to enumerate, see for instance [3,7,11,21]. This makes
them quite different from the connected components considered here.
Akradi and Spirakis [1] study and propose an algorithm for testing whether
a given dynamic graph is connected at all times during a given time interval. If
it is not connected, their algorithm looks for large connected components that
exist for a long duration. Vernet et al. [24] propose an algorithm for computing
all sets of nodes that remain connected for a given duration, and that are not
dominated by other such sets. Unlike our work, these papers do not partition
the set of temporal nodes.
Connected Components in Stream Graphs 579
5 Conclusion
References
1. Akrida, E.C., Spirakis, P.G.: On verifying and maintaining connectivity of interval
temporal networks. Parallel Process. Lett. 29, 1950009 (2019)
2. Alberts, D., Cattaneo, G., Italiano, G.F.: An empirical study of dynamic graph
algorithms. Exp, Algorithmics 2, 5-es (1997)
3. Bhadra, S., Ferreira, A.: Complexity of connected components in evolving graphs
and the computation of multicast trees in dynamic networks. ADHOC-NOW (2003)
4. Domenico, M.D., Lima, A., Mougel, P., Musolesi, M.: The anatomy of a scientific
rumor. Sci. Rep. 3, 2980 (2013)
5. Fontugne, R., Borgnat, P., Abry, P., Fukuda, K.: MAWILab: combining diverse
anomaly detectors for automated anomaly labeling and performance benchmarking
(2010)
6. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS One
9, e107878 (2014)
7. Gómez-Calzado, C., Casteigts, A., Lafuente, A., Larrea, M.: A connectivity model
for agreement in dynamic systems. In: Euro-Par 2015: Parallel Processing (2015)
8. GroupLens Research: MovieLens data sets (2006)
9. Henzinger, M.R., King, V.: Randomized fully dynamic graph algorithms with poly-
logarithmic time per operation. ACM (1999)
10. Huang, S.E., Huang, D., Kopelowitz, T., Pettie, S.: Fully dynamic connectivity in
O(logn(loglogn)2 ) amortized expected time. ACM-SIAM (2017)
11. Huyghues-Despointes, C., Bui-Xuan, B.M., Magnien, C.: Forte delta-connexité
dans les flots de liens. ALGOTEL (2016)
580 L. Rannou et al.
12. Isella, L., Stehlé, J., Barrat, A., Cattuto, C., Pinton, J.F., den Broeck, W.V.:
What’s in a crowd? Analysis of face-to-face behavioral networks. JTB 271, 166–
180 (2011)
13. Iyer, R., Karger, D., Rahul, H., Thorup, M.: An experimental study of polyloga-
rithmic, fully dynamic, connectivity algorithms. Exp. Algorithmics 6, 4-es (2001)
14. Kapron, B.M., King, V., Mountjoy, B.: Dynamic Graph Connectivity in Polyloga-
rithmic Worst Case Time. In: SODA (2013)
15. Kejlberg-Rasmussen, C., Kopelowitz, T., Pettie, S., Thorup, M.: Faster worst case
deterministic dynamic connectivity. In: ESA (2016)
16. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference problems for
temporal networks. JCSS 64, 820–842 (2002)
17. Kunegis, J.: Konect: the koblenz network collection. In: Proceedings of the 22nd
International Conference on World Wide Web, pp. 1343–1350 (2013)
18. Latapy, M., Viard, T., Magnien, C.: Stream graphs and link streams for the mod-
eling of interactions over time. SNAM 8, 61 (2018)
19. Leskovec, J., Krevl, A.: SNAP datasets: stanford large network dataset collection
(2014)
20. Mislove, A.: Online social networks: measurement, analysis, and applications to
distributed information systems. Ph.D. thesis (2009)
21. Nicosia, V., Tang, J., Musolesi, M., Russo, G., Mascolo, C., Latora, V.: Components
in time-varying graphs. Chaos (2012)
22. Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: WSDM
(2017)
23. Rannou, L.: Straph – Python library for the modelisation and analysis of stream
graphs (2020). https://github.com/StraphX/Straph
24. Vernet, M., Pigne, Y., Sanlaville, E.: A study of connectivity on dynamic graphs:
computing persistent connected components (2020)
25. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user
interaction in facebook. In: SIGCOMM-WOSN (2009)
26. Wulff-Nilsen, C.: Faster deterministic fully-dynamic graph connectivity. In: SODA
(2013)
27. Xuan, B.B., Ferreira, A., Jarry, A.: Computing shortest, fastest, and foremost
journeys in dynamic networks. IJFCS 14, 267–285 (2003)
The Effect of Cryptocurrency Price
on a Blockchain-Based Social Network
Cheick Tidiane Ba, Matteo Zignani(B) , Sabrina Gaito, and Gian Paolo Rossi
1 Introduction
In the last decade, we have witnessed the massive spread of social media plat-
forms such as Facebook, Twitter, Instagram. These platforms are usually con-
trolled by one social media company, that owns all the data and decides its own
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 581–592, 2021.
https://doi.org/10.1007/978-3-030-65347-7_48
582 C. T. Ba et al.
1
Actually, this is an abstraction of the architecture, since modern online social net-
works heavily rely on large-scale data center and content delivery networks.
The Effect of Cryptocurrency Price on a Blockchain-Based Social Network 583
timestamp, so we can study the impact of the price of the cryptocurrency Steem2
on the social network growth. From the analysis based on temporal correlation,
we noticed that the growth of the network is strongly tied to the fluctuations
in Steem cryptocurrency price: it can be observed that rising Steem prices trig-
ger the link creation process and similarly when Steem value drops, so does the
network’s growth. In fact, there is a positive correlation (0.7) between Steem
price values and the creation of “follow” relationships. By the analysis on cor-
relation lags, we also found evidence of a lead-follow relationship between time
series, where Steem prices influence user behavior: our cross-correlations study
suggests that the full impact is felt with a 25–32 days difference. Finally, by ana-
lyzing where links form during these correlation phases, we identified suspicious
behaviors in high in-degree nodes which try to favor the diffusion of particular
contents, so doping the rewarding system. In general, we have highlighted that,
in blockchain-based social networks which implement reward mechanisms rely-
ing cryptocurrencies, the network dynamics and economic/financial aspects are
strongly intertwined, and in the former, growth patterns not explainable through
social theories come into play.
The rest of the paper is structured as follows. In Sect. 2, we introduce the
blockchain-based social network Steemit and a brief review about studies on
Steemit. Then, in Sect. 3 we describe how we collect data from the Steem
blockchain. In Sect. 4 we describe the methodology to assess the temporal corre-
lation among the network dynamics and the price of the Steem cryptocurrency.
Finally, in Sect. 5 we discuss our main findings about the temporal correlations
and the identification of suspicious behaviors involving high in-degree accounts.
2
Cryptocurrency price history data is recovered from an already existent web service.
3
Resteemed, in the Steemit jargon.
584 C. T. Ba et al.
3 Dataset
In order to study the interplay between the social and financial aspects in Steemit
we needed the longitudinal data of the cryptocurrencies that influence the Steem
value over time. We can retrieve such data from [2]. There we can consult the
daily value of the Steem currency in US Dollars and other cryptocurrencies. The
prices are updated daily and we find values from April 18th, 2016. So, from this
platform we collected data for the Steem price in USD. Alongside the Steem
value, we also consider Bitcoin, the leading cryptocurrency, as it is shown that
it has strong influence on the value of other cryptocurrencies.
Then, as our goal is studying the currency’s impact on the evolution of
the Steemit social network, we had to recover data describing the relationships
among users. The collection of these operations composes a detailed tempo-
ral evolution dataset, that describes user activity with a temporal precision of
3 s5 . In Steemit every operation can be retrieved from the Steem blockchain:
researchers and application developers have access to the data through a series
of APIs, that can be queried through HTTP requests. Relying on the Steem-
python library, we were able to specify a Steem node (https://steemdev.api.com)
to handle our data requests. We collected the data from the very first block, pro-
duced on 24th March, 2016, up to block 44301097, that was produced on 16th
June, 2020 - an overall period of 4 years.
As introduced in Sect. 2, users on Steemit can perform many different actions.
According to the official documentation [1], on the blockchain we can find more
than 50 different types of operation. As we are interested in the relationships in
the social network, we had to extract “follow” information. This data is stored
5
The action timestamp is derived from its block and each block is verified every 3 s.
586 C. T. Ba et al.
in custom json operations. In the “json” field users can input any kind of data
as long as it follows the JSON format: thus requiring a filtering step to actually
extract “follow” relationships. According to the documentation, we can filter
“follow” operations by looking at the presence of the field id, set to follow. In
our collection, we obtain a total of 157,883,036 “follow” operations with their
timestamp. “Follow” actions cannot be directly used to build the social graph,
since Steemit tracks more than just follow operations. For example, users can
decide to perform an “unfollow” action. Not only, Stemmit also allows to com-
pletely block a user. Alongside these three main actions, some of the collected
operations may be not properly formatted: usually they are the result of errors
made by developers or user-made scripts. Therefore we must further filter “fol-
low” operations, and, according to the official documentation, group them based
on the what attribute. We focus on three main options: blog, which is the equiva-
lent of a real following action; ignore which is the equivalent of a blocking action,
and, finally, we consider empty strings as unfollowing actions. After filtering, we
obtain 134,941,606 “follow” relationship, 20,216,913 “unfollows” and 2,721,355,
“ignore/mute” (empty strings or arrays, or explicit “muted”) actions. It is worth
to note that the “follow” actions were not immediately available in the platform
from the beginning, but the introduction of this functionality dates back to June
3rd, 2016 (commit e8472fb), according to Steem project commit history6 .
4 Methods
As mentioned, our objective is studying the relationship between Steem price
values and users’ social behavior. In our analysis, we focus on the daily evolution
of the network. We aggregate “follow” operations by date, obtaining a time series
describing the number of new “follow” relationships in the social network, every
day. Thus, we have obtained two time series: the daily new “follow” links and
the historical data of the daily Steem price.
We first look at potential seasonal patterns through the analysis of the Auto-
correlation Function (ACF). It measures the linear relationship between lagged
values of a time series; the resulting plot shows if data has some sort of pattern,
either a long-term trend or a seasonal pattern [8]. The ACF is the function of
autocorrelation values ρk for every lag k, where ρy (k) ia defined as
T
(yt − ȳ)(yt−k − ȳ)
t=k+1
ρy (k) =
T
(yt − ȳ)2
t=1
According to [8], if data are trended, ρy (k) values for small lags are large and
positive; observations closed in time are also close in size. So, the ACF of trended
time series will show positive values that slowly decrease as the lags increase.
6
https://github.com/steemit/steem/search?o=asc&q=follow&s=committer-date&ty
pe=Commits.
The Effect of Cryptocurrency Price on a Blockchain-Based Social Network 587
When data are seasonal, the autocorrelations ρy (k) will be larger for the seasonal
lags (at multiples of the seasonal frequency) than for other lags. When data are
both trended and seasonal, we can observe both these phenomena.
After studying the singular time series, we shift our focus on the relationship
between the two time series. We compare the time plots for the series looking for
potential visual evidence of influences or correlations; then we use scatter plots
and quantify the correlation between two time series using correlation coefficient.
We evaluate the correlation by the classical Pearson Coefficient [5]. Given two
time series X and y, we compute the Pearson Coefficient ρ(x, y) as:
(xt − x̄)(yt − ȳ)
ρ(x, y) = . (1)
(xt − x̄)2 (yt − ȳ)2
with values towards 1 indicating perfect correlation, 0 no cross-correlation and
around −1 perfect anti-correlation. This measure tell us if there is a linear rela-
tionship between the Steem price and the amount of “follow” actions.
Finally, we can determinate potential lead-follow relationships between time
series using the normalized cross correlation measure. Given two time series
x and y, the normalized cross-correlation measure is similar to the correlation
measure: instead of correlating once x with y, we do it multiple times, considering
the time series y, but shifted by a series of time lags k. We obtain a series of
different correlation values ρ, one for each chosen time lag k. In our work, we
consider lags in days. This measure can be expressed as:
T
(xt − x̄)(yt−k − ȳ)
σxy (k) t=k+1
ρxy (k) = = (2)
σx σy
T
(xt − x̄)(yt − ȳ)
t=1
This process produces a set of pairs (lag, correlation value). We can better
explore them by analyzing their shape and focusing on the time lags k that
show the highest correlation values. If we find high correlation values for a pos-
itive time lag, then x leads y; vice versa, if the higher values are for a negative
time lag, then we have that time series y is leading x.
5 Results
In this section we present the insights obtained by the methodology introduced in
the previous section. Specifically, in Fig. 1 we report the auto-correlation function
for the two time series: a) new “follow” relationships and b) Steem price, on daily
basis.
As shown in Fig. 1a, the auto-correlation function on the new “follow” time
series shows the lack of repeating peaks, which means there are no seasonal
trends. Also, we can observe that the new “follow” is a trended time series, since
it tends to have positive values that slowly decrease as the lags increase. We
obtain a similar result while evaluating the Steem price time series (see Fig. 1b),
suggesting the lack of seasonal trends.
588 C. T. Ba et al.
Fig. 1. The auto-correlation function for the a) new “follow links and a) Steem price.
On the Y-axis: the correlation coefficient ρy (k). On the X-axis: the lag k in days.
The Interplay Between Network Growth and Steem Price. Before delv-
ing into a quantitative assessment of the temporal correlation between the net-
work evolution process and the Steem price, we qualitatively inspected the trends
of the two time series, to verify if our hypothesis of correlation holds. In fact, it
was also observed in [13], that the activities, such as comments, votes, content
sharing and node arrival, could be strongly tied to the fluctuations in Steem
cryptocurrency price, but the observations hold for a shorter period w.r.t. our
dataset. In fact, looking at Fig. 2, we can confirm those observations for the
periods of growth: April 2017 - June 2017, November 2017 - January 2018 (cyan
regions in the Fig. 2).
Analyzing the new dataset, we can also observe a similar influence in suc-
cessive periods of time. Around March, 2018 we see that as the price of Steem
falls, so does the activity in terms of “follow” operations. Around the beginning
of April 2018, we can see a small rise in Steem price; it precedes one of the
biggest growth of the “follow” relationships in the network. This spike is how-
ever short lived, as the Steem price falls again. We can see that shortly after
the number of “follow” operations per day starts to shrink: we reach the low-
est level recorded up to that point. The Steem price has never recovered and is
now hovering around 0.20 USD. This is an important new phenomena: not only
the success of the cryptocurrency is an important catalyst of a social network’s
growth, but we also saw that a drop in value has stunted the growth of the net-
work. The crisis that emerges from the data was also confirmed by the Steemit
company: a post by at the time Founder and CEO of Steemit, Inc., Ned Scott,
on 28/11/2018, confirms the crisis: Steemit had to lay down 70% of its workforce
as the maintenance costs were becoming too high [17].
Given, the previous evidences, we evaluated the strength of the correlation
between the Steem cryptocurrency and the number of “follow” relationships that
are created daily in Steemit. We first compute the Pearson correlation coefficient
(Eq. 1): we obtain an important positive correlation of 0.71, that confirms that
The Effect of Cryptocurrency Price on a Blockchain-Based Social Network 589
Fig. 2. Time plots for “follow” operations (blue) and Steem price value in USD (green).
On x-axis: time in months, from June 3rd, 2016 (“follow” plugin in Steemit) up to June
3rd, 2020. On left Y-axis: volume of “follow” operations per day. On right Y-axis: Steem
price in USD.
Steem changes and users behavior are strongly connected. In our hypothesis,
Steem prices are influencing user activity, and considering we are not taking into
consideration the time it would take for users to react to Steem fluctuations, this
is a pretty high correlation value. The corresponding scatter plot is displayed in
Fig. 3a and shows that low activity days tend to be linked to low Steem price and
higher activity days tend to be linked with higher values of the cryptocurrency.
Finally we are able to determinate potential lead-follow relationships between
two time series using the normalized cross correlation measure between the
Steem price and the “follow” relationships. We compute different cross corre-
lation values by Eq. 2. We test lags in the range of (−90, +90), and we obtain
180 correlation values for these time lags. In Fig. 3b we display the plot of the
pairs of lags and correlation values. In the figure, we look for the range of days
that register the highest correlation values. We obtained positive moderate cross-
correlation values across the whole interval (>0.5). The highest value is 0.87,
obtained considering a lag of 32 days. This high correlation value confirms that,
indeed, there is a lead-follow relationships between the two time series. These
lags suggest that the full impact of Steem price on the network evolution is felt
with a 25–32 days difference.
590 C. T. Ba et al.
(a) (b)
Fig. 3. In a) The scatter plot of Steem price and “follow” relationships per day. In
b) the normalized cross correlation between the Steem price and the daily number of
“follow” relationships. On the X-axis the lag and on the Y-axis the coefficient value.
Active Users and Bot Accounts. In Sect. 5 we have discussed the presence
of two periods of growth: April, 2017 - June, 2017 and November, 2017 - Jan-
uary, 2018. We want to understand which users are the most active in the two
periods. Starting from the “follow” actions we isolate those occuring in the two
above periods. This allows us to create two subgraphs representing the network’s
evolution in those intervals. Taking the nodes with highest in-degree, we can pin-
point the most popular users. This way, we find that among the most popular
accounts, we see that some of them have “resteem” or “bot” in their names.
In Steemit, a resteem is the equivalent of sharing a content/post. Sharing is an
important activity: when a user shares a post, it is going to be shared to all of his
followers. A user is incentivized to share posts he voted or commented, as only
popular posts will be rewarded with currency. So, sharing a post is just a further
investment. We decided to investigate these profiles. One of them even clearly
states that the accounts will automatically share every post from his followers, to
all the followers. Observing the actions, we can see that indeed the main actions
shown in their history are “resteem” actions and seems to be done automatically.
These bots provide an advantage to all the users: users that use the bot to gain
visibility for their posts, increasing the chance of their post becoming popular
and the rewards. At the same time, the bot has the opportunity to gain from
the curation (vote, comment, sharing) of posts from the users. Sharing a post
makes it more likely it gains popularity, increasing the chance of a reward for
both the author and the bot.
6 Conclusions
In this paper, we have studied Steemit, one of the leading blockchain-based
decentralized social network. In these social networks, creators and curators are
rewarded with cryptocurrency for their efforts. Our objective was to study the
relationship between the cryptocurrency and the growth of the social network.
The Effect of Cryptocurrency Price on a Blockchain-Based Social Network 591
We did so by analyzing more than 4 years of daily data for the users’ social activ-
ity and the price value of the Steem cryptocurrency. The analysis shows that the
growth of the network is strongly tied to the fluctuations in Steem cryptocur-
rency price: we observed that rising Steem prices trigger network growth and
when the Steem value drops, so does the network’s growth. A correlation analy-
sis confirms that there is a strong positive correlation (0.7) between Steem price
values and network activity. We also confirmed that there is lead-follow rela-
tionship between time series, where Steem prices influence user behavior. The
studied lags suggest that the full impact is felt with a 25–32 days lag. In conclu-
sion, we show that the cryptocurrency rewards influence social network growth
and user behavior, for better or worse. This work shows that while a cryptocur-
rency reward can be a strong incentive to join a blockchain based network, it
can also be a social network’s downfall.
References
1. Broadcast Ops. https://developers.steem.io/apidefinitions/broadcast-ops
2. Steem (STEEM) price, charts, market cap, and other metrics. https://coin
marketcap.com/currencies/steem/
3. Steemit Whitepaper. https://steem.com/steem-whitepaper.pdf
4. Chohan, U.W.: The concept and criticisms of steemit. Available at SSRN 3129410
(2018)
5. Freedman, D., Pisani, R., Purves, R.: Statistics (International Student Edition),
4th edn. WW Norton & Company, New York (2007). Pisani, R. Purves
6. Gensollen, N., Latapy, M.: Do you trade with your friends or become friends with
your trading partners? A case study in the g1 cryptocurrency. Appl. Netw. Sci.
5(1), NA–NA (2020)
7. Guidi, B.: When blockchain meets online social networks. Perv. Mob. Comput. 62,
101131 (2020)
8. Hyndman, R., Athanasopoulos, G.: Forecasting: Principles and Practice. OTexts,
3rd edn. (2019). https://Otexts.com/fpp3/
9. Ji, Q., Bouri, E., Gupta, R., Roubaud, D.: Network causality structures among
bitcoin and other financial assets: a directed acyclic graph approach. Q. Rev. Econ.
Financ. 70, 203–213 (2018)
10. Jia, P., Yin, C.: Research on the characteristics of community network information
transmission in blockchain environment. In: 2019 IEEE 4th Advanced Information
Technology, Electronic and Automation Control Conference (IAEAC), vol. 1, pp.
2296–2300 (2019)
11. Kim, M.S., Chung, J.Y.: Sustainable growth and token economy design: the case
of steemit. Sustainability 11(1), 167 (2019)
12. Larimer, D.: DPOS consensus algorithm–the missing whitepaper, steemit (2018)
13. Li, C., Palanisamy, B.: Incentivized blockchain-based social media platforms: a case
study of steemit. In: Proceedings of the 10th ACM Conference on Web Science,
pp. 145–154 (2019)
14. Maesa, D.D.F., Marino, A., Ricci, L.: Data-driven analysis of bitcoin properties:
exploiting the users graph. Int. J. Data Sci. Anal. 6(1), 63–80 (2018)
15. Maesa, D.D.F., Marino, A., Ricci, L.: The bow tie structure of the bitcoin users
graph. Appl. Netw. Sci. 4(1), 56 (2019)
592 C. T. Ba et al.
16. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Cryptography Mail-
ing, March 2009. https://metzdowd.com
17. Scott, N.: Steemit update, November 2018. https://steemit.com/steem/@ned/
2fajh9-steemit-update
Multivariate Information in Random
Boolean Networks
1 Introduction
Over the last decades several approaches have tried to investigate the dynamics
that govern complex systems, in order to understand how they process informa-
tion. This old question will remain largely open as long as we fail to grasp the
fundamentals of the dynamics: the interdependencies involving a large number
of agents, wherein the richness of the system lies, rather than in the agents’ fea-
tures. Recently, building on previous work [2,7,20,21] perfected O-information, a
promising generalization of mutual information, as an effective metric to quantify
statistical high-order interdependencies among the agents of a system. It distin-
guishes between the synergistic or redundant nature of such interdependencies,
whose dynamic relevance has been widely demonstrated [6,21].
These advances from information theory may be applied to the analysis and
design of dynamical systems. Of particular interest is the study of complex
network dynamics, where heterogeneous interactions involve multiple influences
between components, favoring complicated and highly non-linear behaviors, and
the dynamics is hard to predict from local interactions. In this context, a nat-
ural candidate are Random Boolean Networks (RBNs), which correspond to a
2 Preliminaries
High-Order Interdependencies in Boolean Networks. Consider a Boolean
network of n nodes, and denote the set of states (or a configuration) at time t
as X n (t) = (X1 (t), . . . , Xn (t)), where Xi (t) ∈ {0, 1}. Then the state of the i-th
node at time t + 1 is determined by Xi (t + 1) = fi (X n (t)), where the Boolean
function fi includes connectivity, that is, it depends on Xj (t) if and only if there
is an edge from node j to i.
An orbit is defined as the sequence of configurations starting from a given
initial condition. With this, the dynamics of the network can be described as
an ensemble of orbits produced from different initial conditions. It can also be
seen as an ensemble X n = (X1 , . . . , Xn ) of individual trajectories, where Xi is a
binary vector describing the states of node i along the orbit. Note that a prob-
ability distribution over the 2n possible initial conditions induces a distribution
for X n . The ensembles of trajectories can be build in several ways, depending on
the duration of the orbit, the sampling of the initial configurations, and possible
restrictions to orbits in attractors of the system; this will be specified in the
different sections.
Multivariate Information in RBNs 595
which describes the average shared information between two or more nodes of
the network, with X −i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Note that H(Xi |X −i )
corresponds to the residual information R(Xi ) of the node i, that is, all informa-
tion that can only be retrieved by directly accessing the dynamic information of
i. Then consider the total correlation:
n
n
C(X n ) = N (X n ) − N (Xi ) = H(Xi ) − H(X n ), (2)
i=1 i=1
In order to contrast the results with the phase diagram of the RBNs presented
in [3], Fig. 1 corresponds to a 2D projection of Ω(X n ) as a function of K̄ and
p. Here it is observed that the O-information is effectively capable of capturing
relevant aspects of the dynamics, since it is clearly sensitive to the structural
characteristics of the network. The most interesting finding is that the critical
point in RBNs is defined by a balance between redundancy and synergy, corre-
sponding to Ω(X n ) ≈ 0. With this, as the chaos in the dynamics increases, the
ability to store the same information in different nodes is lost, in exchange for
the ability to share information in more complex ways spread throughout the
network. Likewise, the tendency to order in the dynamics hinders this propaga-
tion of information in favor of robustness. In general, Ω(X n ) is able to capture
the phase diagram of the standard RBN model, attributing redundant character-
istics to the ordered regime and synergistic characteristics to the chaotic regime.
Thus, in the interval p ∈ [0.5, 1], Ω(X n ) exhibits a decreasing behavior with
K̄ and increasing with p. This allows us to hypothesize that within both the
ordered and chaotic phases there is a certain continuous change determined by
the magnitude of Ω(X n ). That is, the greater the distance to the critical point,
the more ordered (or chaotic) the dynamics will be.
Multivariate Information in RBNs 597
Fig. 3. Boolean networks with n = 10 and K = 1. The colors on the nodes correspond
to different redundancy families.
This structure is known as Feed Forward Loop (FFL) and, despite its struc-
tural simplicity, it has been widely studied in the literature [4,12,17] for its
significant concentration and relevance in the dynamics of real biological net-
works. Mainly, FFLs are considered as necessary design components to improve
the robustness of biological systems such as human signaling networks, genetic
regulation networks, and others, favoring adaptation against possible external
variations. Thus, FFLs structures can strengthen external signal processing or
play the role of molecular clocks, a necessary condition for the existence of bio-
logical rhythms such as circadian rhythms. This quality of synchronization is
strongly validated in [17], by means of a genetic algorithm that finds in the
FFLs the core of rhythm generation (with the size of the ring matching the
period of the attractors), operating as coordinator of the propagation of state
changes.
Here, we will show how the FFL gives rise to redundancy, and thus to the
robustness attributed to it in the literature. To do this, first consider an FFL
without chains, that is c = n. It is not difficult to see that in such a structure
there are no transients, therefore X n is uniform with respect to each of the
2c possible states. With these conditions the calculation of the O-information
is trivial, since the total entropy,
the individual
entropies, and the residuals
information are maximum: n H(Xi ) = H(X n ) = n R(Xi ) = c = n, and with
this C(X n ) = B(X n ) = Ω(X n ) = 0, and there is no high-order interdependency
between the nodes of the ring.
We now add chains in the analysis. It is useful to understand a circuit as a
signal propagator through the chains (analogously to [17]). Consider Xi→ , with
600 S. Orellana and A. Moreira
i ∈ [1, . . . , c], as the set of all nodes that receive the same signal from the sender
node i. We will say that the set Xi→ corresponds to a redundancy family, since
the nodes that compose it maintain an identical dynamic (or complementary
in case of receiving the inverted signal as a consequence of negative edges) and
accessing one of them is enough to retrieve the total information at any time t,
that is, they share redundant information. To visualize these redundancy families
in Fig. 3 they are labeled with different colors. There are c redundancy families,
which start on the ring, and have the actual redundance in the chains that receive
the signal. As expected, this dynamics is reflected in C(X n ). Regardless of the
Boolean function applied by each node, if K = 1 the network is a conservative
system, that is, the number of 1’s and 0’s in each node and at all times t is
constant and in fact balanced. With this H(Xi ) = 1, ∀i ∈ [1, . . . , n], therefore
the sum of individual entropies is n H(Xi ) = n. On the other hand, since
the chains do not interfere with the rhythm of the signal, but rather preserve it
until dying out or finding a new subsystem to feed (for cases K > 1), the total
entropy H(X n ) remains constant at c even when chains are included. Thus,
C(X n ) = n − c, matching the amount of chain nodes. This phenomenon sheds
light on the evolution of networks in [17]. There, in the search for robustness,
often the genetic algorithm favored the early creation of long chains, prior to the
generation of a dominant FFL, since it is in such chains that redundancy lies.
Ensuring the predominance of redundancy in the structure means that we
n
should also be able to calculate the effects of chains on B(X ). For this we
require the calculation of the sum of residual information n R(Xi ). A chain
node is necessarily part of some redundancy family of a ring node, therefore
all information contained in it can always be recovered by looking at any other
node of the same color. Thus, the residual information of the string nodes is
null: R(Xi ) = 0 for i ∈ [c + 1, . . . , n]. Obviously, the information of the nodes
of the ring that form a redundancy family with at least one additional node of
the chains, can also be completely recovered without the need to access them
directly. On the contrary, if a ring node is unique in its redundancy family, its
dynamic information will also be unique and with it R(Xi ) = 1. The following
lemma gives the value of the O-information:
Lemma 1. Given a Boolean network where K = 1 for all its nodes,
Example 1. Let X1n y X2n be the stationary behavior produced by the structures
in Fig. 3a and 3b respectively.
c = 3, c∗ = 2. c = 4, c∗ = 3. c = 4, c∗ = 2. c = 5, c∗ = 2.
Fig. 4. CFBLs with circuit of size c and c∗ nodes shared with each other. Colors label
different redundancy families.
signal amplification, while if they are negative they promote homeostasis. Fur-
thermore, coherence in CFBLs has been identified as a design principle in human
signaling networks by favoring greater robustness in dynamics [10].
Table 1. Metrics for CFBLs, with circuits of size c and c∗ shared nodes.
We computed the values of Ω(X n ), B(X n ), and C(X n ) for small examples
of CFBLs. A relevant characteristic is that these metrics are exclusively influ-
enced by the sign of the circuits that compose it. In fact, there is a dynamic
irrelevance of incoherent CFBLs with respect to high-order interdependencies,
since Ω(X n ) = B(X n ) = C(X n ) = 0. Thus, the complex nature of the net-
works that contain this structure should only depend on their condition of coher-
ence. Table 1 presents the results of calculating the metrics on the four variants
of CFBLs in Fig. 4, considering positive and negative coherence in each case.
The general trend is effectively a balance between synergy and redundancy with
B(X n ) ≈ C(X n ) = 0. The value of the synergic component in each case seems to
be explained with a similar approach to that used in simple FFLs, since B(X n )
is limited by the number of redundancy families between nodes not shared by
both circuits. Indeed, when coherence is negative, the values of B(X n ) reach
values very close to their structural maximum, while when coherence is posi-
tive, the magnitude of the interdependencies decreases slightly. A consequence
of increasing the size of the structure by means of non-shared nodes is precisely
Multivariate Information in RBNs 603
the increase of the structural synergistic maximum, and that the magnitude gap
between positive and negative coherence is gradually increased.
Table 2. Ω(X n ) and its components as a function of the depth of the emerging chains
of the CFBL with c = 4 and c∗ = 3.
5 Conclusions
The work presented here serves two purposes: to rediscover RBNs’ phase dia-
gram in terms of high-order statistical interdependencies (thereby validating
O-information as a tool for analyzing heterogeneous dynamical systems), and
to delve into structural aspects characterizing the ordered regime and the edge
of chaos. Specifically, it was determined that the ordered regime is characterized
by redundant shared information, while the chaotic regime is characterized by
synergistic shared information. Furthermore, the critical point between the two
regimes seems to be well defined by a balance between redundancy and synergy.
It is also interesting that such behaviors seem to be conserved despite variations
in network topologies, as evidenced at least for scale-free networks. On the other
hand, the robustness of the interdependencies against perturbations in the initial
configurations for each regime is quantified, finding in the O-information a fairly
consistent dynamic robustness metric. All of the above suggests exploiting this
statistical approach offered by RBNs in other widely studied variants such as the
diversification of topologies, the effect of network size, restriction to particular
Boolean functions, use of asynchronous update schedules, etc.
On the other hand, FFLs are postulated as the fundamental redundancy
generation structures in Boolean networks. Likewise, CFBLs seem to correspond
to fundamental components in the generation of complex dynamics associated
with the critical point between order and chaos, where the quality of coherence
seems to be relevant for information processing. This also opens up the possibility
of using O-information as a tool for analyzing the dynamics induced by motifs.
In work in progress, we use this approach to shed light on the roles of individual
nodes with respect to high-order interdependencies of a Boolean network.
Finally, we want to stress that O-information may be used to any dynamic
system on complex networks (beyond the Boolean case). However, there is still
work to be done to control the computational complexity of its evaluation (or
estimation), before it can be used as a practical analysis or design tool on large
volumes of data (beyond the theoretical guarantees presented in [20]).
References
1. Aldana, M.: Boolean dynamics of networks with scale-free topology. Phys. D 185,
45–66 (2003)
2. Brillouin, L.: The negentropy principle of information. J. Appl. Phys. 24, 1152–
1163 (1953)
3. Derrida, B., Pomeau, Y.: Random networks of automata: a simple annealed approx-
imation. Europhys. Lett. 1, 45–49 (2007)
4. Drossel, B.: Random boolean networks. In: Reviews of Nonlinear Dynamics and
Complexity, Vol. 1, Ed. H.G. Schuster, Wiley (2008)
5. Gates, A.J., Rocha, L.M.: Control of complex networks requires both structure and
dynamics. Sci. Rep. 6(1), 1–11 (2016)
6. Griffith, V., Koch, C.: Quantifying synergistic mutual information. In: Guided Self-
Organization: Inception. Springer (2014)
Multivariate Information in RBNs 605
7. James, R., Ellison, C., Crutchfield, J.P.: Anatomy of a bit: information in a time
series observation. Chaos 21, 037109 (2011)
8. Kauffman, S.A.: Metabolic stability and epigenesis in randomly constructed genetic
nets. J. Theor. Biol. 22, 437–467 (1969)
9. Kim, J.R., Yoon, Y., Cho, K.H.: Coupled feedback loops form dynamic motifs of
cellular networks. Biophys. J. 94, 359–365 (2008)
10. Kwon, Y.K., Cho, K.H.: Coherent coupling of feedback loops: a design principle of
cell signaling networks. Bioinformatics 24, 1926–1932 (2008)
11. Langton, C.G.: Computation at the edge of chaos: phase transitions and emergent
computation. Phys. D 42, 12–37 (1990)
12. Le, D.H., Kwon, Y.K.: A coherent feedforward loop design principle to sustain
robustness of biological networks. Bioinformatics 29, 630–637 (2013)
13. Lizier, J., Prokopenko, M., Zomaya, A.: The information dynamics of phase transi-
tions in random boolean networks. In Proceedings of 11th International Conference
on the Simulation and Synthesis of Living Systems (ALife XI). MIT Press (2008)
14. Lloyd-Price, J., Gupta, A., Ribeiro, A.S.: Robustness and information propagation
in attractors of random boolean networks. PLoS One 7(7), e42018 (2012)
15. Marques-Pita, M., Rocha, L.: Canalization and control in automata networks: body
segmentation in Drosophila melanogaster. PLoS One 8(3), e55946 (2013)
16. Oosawa, C., Savageau, M.A.: Effects of alternative connectivity on behavior of
randomly constructed Boolean networks. Phys. D 170, 143–161 (2002)
17. Philipp, R., Di Paolo, E.A.: The circular topology of rhythm in asynchronous
random Boolean networks. Biosystems 73, 141–152 (2004)
18. Ribeiro, A., Kauffman, S.A., Lloyd-Price, J., Samuelsson, B., Socolar, J.: Mutual
information in random Boolean models of regulatory networks. Phys. Rev. E 77,
011901 (2008)
19. Richard, A.: Positive and negative cycles in Boolean networks. J. Theor. Biol. 463,
67–76 (2019)
20. Rosas, F., Mediano, P., Gastpar, M., Jensen, H.: Quantifying high-order interde-
pendencies via multivariate extensions of the mutual information. Phys. Rev. E
100, 032305 (2019)
21. Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information.
arXiv preprint arXiv:1004.2515 (2010)
Earth Sciences Applications
Complexity of the Vegetation-Climate System
Through Data Analysis
1 Introduction
Grasslands areas account for 40% of the terrestrial earth surface and cover more than
5 million of hectares in the Iberian Peninsula. In Spain, grasslands represents one of
the most beneficial ecosystems for different purposes: biodiversity, meat production,
landscape, preservation of traditional values and rural population fixation.
Due to the low cost and real-time data acquisition, remote sensing (RS) techniques
have been acknowledged as an appropriate tool to monitor ecosystems. They are based
on obtaining reflectance information from the surface properties. Particularly, vegetation
has a distinctive response in the near infrared (740–1110, 1300–2500 mm) and the visible
(400–700 nm) areas of the electromagnetic spectrum [1].
Vegetation indices (VIs) are mathematical combinations of two or more selected
reflectance bands related to biochemical and biophysical vegetation parameters [2].
However, VIs do not identify vegetation activity well in situations when bare soil repre-
sents a large part of the surface to be analysed, as it is the case of arid and semi-arid areas.
Modified Soil Adjusted Vegetation Index (MSAVI) [3] was developed as a solution for
these cases; including a soil adjustment factor.
VIs series present time cycles allowing to describe agro-environmental systems
dynamics. Previous studies indicate that vegetation indices behaviour is controlled by
climatic fluctuations and have revealed a delayed response of vegetation growth, as a
result of the interactions between climate and vegetation [4]. In this line, researchers
suggest that there are different lags depending on the variable, ranging from one to two
months [5]. In our case, the cross-correlation method allowed to measure the correlation
between vegetation indices and climate variables at different lags. Through this analy-
sis, an optimal lag (τ) is obtained; where the correlation between climate variable and
vegetation index is maximum.
Vegetation-climate systems present nonlinear characteristics as in any complex sys-
tem. In 1987, [6] introduced recurrence plots (RPs), which are a simple way to visualize
the periodic or chaotic behaviour of a dynamical system through its phase space. There
are several works in the framework of RPs and VIs. As an example, [7] applied RPs to
measure the determinism and predictability of the NDVI series and its spatial patterns.
This work aims to understand the complex dynamic of pasture-climate system in
a semi-arid area. Then, MSAVI temporal dynamics, of this area, are identified through
recurrence plot (RP) and the recurrence quantification analysis (RQA).
Soto Del Real, Madrid (Spain), named as ZMA, was the pasture area selected for this
work. This site is a characteristic Mediterranean climate with warm summers, scarce
precipitation, and cold winters. The study area is located on the hillsides of Guadarrama
Sierra (Central Spain), where soil materials such as granites and gneiss are predominant.
ZMA is situated at 958 m.a.s.l. and the average slope is of 4,7%. ZMA soil is a Dystric
Cambisol, with a topsoil (0–15 cm) with sandy loam texture, 3% organic matter and a
Complexity of the Vegetation-Climate System Through Data Analysis 611
pH of 5.6. Average precipitation and average temperature are of 541 mm and 13.6 °C,
respectively.
The dominant vegetation in the area is Mediterranean grasslands that grow during
spring and autumn. They have a summer senescence period and vegetative winter dor-
mancy. Pasture vegetation is grazed by cows and sheep during the whole year with
different intensities depending on pasture production, exposing a mix of bare soil and
vegetation (live or dead).
Pasture plots were selected based on three criteria: i) maximum area covered by
pasture grassland with no woodland, ii) continuous pastureland practices during the
analysed period and iii) pastureland cover in the contiguous area. Finally, three plots of
500 × 500 m between (4°32 00 W, 4°33 00 W) and (40°37 00 , 40º39 00 N) were
selected.
daily air temperature (Tm) and daily precipitation were obtained. These variables were
transformed in a series in which TMP was the average of Tm each 8 days and PCP was
the accumulated daily precipitation during 8 days. The length of these series was the
same than MSAVI series.
Based on these phases, linear regressions and Pearson’s coefficients analysis were
conducted to show the relationship between vegetation indices and climate variables.
VI dynamics fluctuate depending on the season of the year. This indicates that a constant
optimum time lag (τ) through all the year might be inadequate. With taking into account
the differing time lags for each climate factor, the relationship between the MSAVI
and climate variables was analysed through correlation coefficients, calculated by the
following equation:
N
i=1 {(xi − x̄)(yi − ȳ)}
Px,y = N (3)
N
i=1 (xi − x̄) i=1 (yi − ȳ)
2 2
Being N the number of the years, x the MSAVI series and y the climate variable
series. Then, correlation between MSAVI and climate time-series was calculated in each
one of the above-mentioned phases, using an accumulative 8-days lag period during the
phase duration.
Complexity of the Vegetation-Climate System Through Data Analysis 613
where N is the number of measured states xi , Θ is the Heaviside step function (i.e. Θ(x) =
1, if xi − xj ≤ ε, and Θ(x) = 0 otherwise), · is a norm and ε is a threshold previously
defined based on the time-series properties. In this study, the phase space trajectories are
based on the Euclidean distance between xi and xj of the series. If Rij = 1 at a time (i,
j), is marked as a black dot in the position (i, j). Otherwise, if Rij = 0 recurrence states
will be represented as white dots.
Recurrence Quantification Analysis (RQA) is based on the quantification of the
small-scale structures in RPs [12]. Several measures of complexity have been proposed,
however, in this work we focused on: Determinism (DET), Average length of structures
(LT), Shanon Entropy (ENT) and Laminarity (LAM).
The CRQA R package [13], based on the Cross Recurrence Plot Toolbox developed
by [14], was used to construct RP and obtain RQA measures. First, MSAVI series were
normalized using z-score and distance matrix was rescaled based on the maximum
value following the recommendations of [15] and [16]. Optimizeparam function was
then computed to found the optimal values of the three parameters (τ, m, and ε). The
delay (τ) is obtained by finding the local minimum where mutual information drops to
both series. The embedding dimension (m) is determined by the false nearest neighbours’
algorithm. The threshold ε is estimated by an iterative process based on the standard
deviation (SD) of the time series.
The quantification of RP structures was computed with the Crqa function using the
three values obtained from the optimization.
Fig. 1. Box plots of MSAVI and average temperature (A) and accumulated precipitation (B),
8-day period at ZMA (Madrid).
Based on box plots results, linear regression analysis is conducted to study the relation
of each climatic variable in each phase with MSAVI values (Fig. 2).
Fig. 2. Relation of MSAVI to average temperature (TMP) and accumulated precipitation (PCP)
at 8-day period during phases 1 (first column), 2 (second column) and 4 (third column).
In general, the temperature is identified as the main driving factor in the vegetation-
climate system; as it shows high R2 values (>0.9) in all the phases. Precipitation shows
lower R2 values in comparison to temperature; being the highest (>0.7) in P2. It is
important to note that temperature trend varies depending on the phases, being positive
on P1 and negative in P2 and P4 phases. In the meantime, precipitation has the same
trend; being positive in all the phases and pointing out that precipitation is regularly
favourable in semiarid-grassland growth. This fact is in line with previous work, such
as [18] that revealed a positive relationship between the amount of precipitation and the
net primary grasslands production.
Complexity of the Vegetation-Climate System Through Data Analysis 615
Table 2. Cross-correlation coefficients between MSAVI and temperature (TMP) and precipitation
(PCP) at different lags in each phase at ZMA (Madrid). Bold letter represents the maximum
correlation in each row. Each time lags is of an 8-days period.
Time lags
0 −1 −2 −3 −4 −5 −6
P1 TMP 0.422 0.308 0.259 0.233 0.169 0.109 0.029
PCP 0.032 0.088 0.186 0.221 0.156 0.102 0.093
P2 TMP −0.763 −0.757 −0.717 −0.699 −0.695 −0.697 −0.637
PCP 0.359 0.396 0.381 0.321 0.357 0.309 0.230
P4 TMP −0.594 −0.600 −0.625 −0.602 −0.603 −0.581 −0.512
PCP 0.159 0.297 0.393 0.397 0.293 0.275 0.304
Fig. 3. A) Recurrence plots of MSAVI for ZMA zone. Vegetation indices series is normalized by
the z-score method. Time units are represented as the X and Y axis. Each time-unit is a period
of 8-days, coincident with 8-days compose MODIS images during the study period (2002–2018).
Embedding dimension m = 1, delay τ = 0 and recurrence rate RR = 1.5%. B) Optimized recurrence
plots using normalized VI data and rescaled distance matrix for ZMA
A higher embedding dimension than expected was obtained. This fact is supported by
findings in the literature [21], which relate higher dimensionality to system complexity.
Ecological systems present nonlinear dynamics, combining periodic and chaotic cycles,
whose equations ruling the systems are unknown. In this line, [20] demonstrated the
usefulness of RPs to describe nonlinear behaviours in high-dimensional systems, such
as MSAVI time series.
The quantification of RP structures was computed with the Crqa function, and the results
are exposed in Table 3.
Table 3. Recurrence Quantification Analysis (RQA) of ZMA = Soto Del Real, using rescaled
data. RQA of artificial series, adapted from [22], were added for comparison. MSAVI = Modified
Soil-Adjusted Vegetation Index, DET = Determinism, LT = Average length of diagonal structures,
ENTR = Shannon Entropy, LAM = Laminarity.
Based on the density of recurrence points, determinism has been related to the chaotic
or periodic behaviour of the system, representing a measure of temporal stochasticity.
Determinism (DET) has been utilized as an indicator of climate stability [7] or the
detection of bioclimatic transitions [22]. Our results suggest that a high value of DET
is related to an adequate characterisation of pasture vegetation pattern through MSAVI
index,
Increases in LT are interpreted as a larger time of predictability, as it has been reported
by [23] work. MSAVI LT values obtained are low compared to periodic series, Table 3
indicates that vegetation may be predicted in the short term due to the great complexity
of ecological systems reported by [24].
ENTR refers to the disorder of the system. Standard values obtained by [22] noted
that stochastic systems tend to obtain lower ENTR values (0.2) in comparison with
those of periodic systems (2.20). We speculate that the high value of MSAVI ENTR
is the consequence of the high number of precipitations events in the zone. Box plot
precipitation shed a light about the water status in the area, in this case, ZMA precipitation
is higher than the average of Mediterranean climate. This fact is sustained by [20] findings
which suggest that grassland areas with higher precipitations tend to obtain higher ENTR
values.
LAM refers to the chaos-chaos transitions and is directly related to the detection of
laminar states [25]. MSAVI series presents a high number of laminar states indicating
indicates that VI values are trapped during certain time frames, decreasing time series
variability and supporting the idea of higher predictability and determinism of MSAVI
index.
4 Conclusions
In summary, we have applied the cross-correlation method as a prior step to characterize
the complexity of the vegetation-climate system, concluding that temperature is a strong
driver factor. However, it is important to note that precipitation showed a stable positive
trend along the phases suggesting that precipitation events are beneficial in arid-semiarid
grassland, regardless of the time of a year. In addition, it was revealed that lag between
MSAVI and climatic series is variable depending on the phase and climatic variable.
Then, RP and RQA were applied to MSAVI time-series to measure the complexity
of the pasture-climate system. We detected a characteristic dynamic that point out short-
term predictability and high-dimensionality of the MSAVI time series. In the end, this
work emphasizes the potential of recurrence plots and recurrence quantification analysis
to characterise and quantify the complexity of a vegetation-climate system.
References
1. Rouse, J.W., Haas, R.H., Schell, J.A., Deering, D.W.: Monitoring the vernal advancement and
retrogradation (green wave effect) of natural vegetation. Prog. Rep. RSC 1978–1, 2–8 (1973)
2. Huete, A., Didan, K., Miura, T., Rodriguez, E.P., Gao, X., Ferreira, L.G.: Overview of the
radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens.
Environ. 83(1–2), 195–213 (2002)
3. Qi, J., Chehbouni, A., Huete, A.R., Kerr, Y.H., Sorooshian, S.: A modified soil adjusted
vegetation index. Remote Sens. Environ. 48(2), 119–126 (1994)
4. Guo, B., Zhou, Y., Wang, S., Tao, H.: The relationship between normalized difference vege-
tation index (NDVI) and climate factors in the semiarid region: a case study in Yalu Tsangpo
River basin of Qinghai-Tibet Plateau. J. Mt. Sci. 11(4), 926–940 (2014). https://doi.org/10.
1007/s11629-013-2902-3
5. Shen, B., Fang, S., Li, G.: vegetation coverage changes and their response to meteorological
variables from 2000 to 2009 in Naqu, Tibet, China. Can. J. Remote. Sens. 40(1), 67–74 (2014)
6. Eckmann, J.P., Oliffson Kamphorst, O., Ruelle, D.: Recurrence plots of dynamical systems.
Epl 4(9), 973–977 (1987)
7. Li, S.C., Zhao, Z.Q., Liu, F.Y.: Identifying spatial pattern of NDVI series dynamics using
recurrence quantification analysis. Eur. Phys. J. Spec. Top. 164(1), 127–139 (2008)
8. LP DAAC: Land processes distributed active archive center: surface reflectance 8-day L3
global 500 m, NASA and USGS (2014)
9. Baret, F., Guyot, G.: Potentials and limits of vegetation indices for LAI and APAR assessment.
Remote Sens. Environ. 35(2–3), 161–173 (1991)
10. Xu, D., Guo, X.: A study of soil line simulation from landsat images in mixed grassland.
Remote Sens. 5(9), 4533–4550 (2013)
11. Xu, M., Eckstein, Y.: Use of weighted least squares method in evaluation of the relationship
between dispersivity and field scale. Groundwater 33(6), 905–908 (1995)
12. Webber, C.L., Zbilut, J.P.: Dynamical assessment of physiological systems and states using
recurrence plot strategies. J. Appl. Physiol. 76(2), 965–973 (1994)
13. Coco, M.I., Dale, R.: Cross-recurrence quantification analysis of categorical and continuous
time series: an R package. Front. Psychol. 5(1), 1–14 (2014)
14. Marwan, N.: CRP Toolbox 5.22 (R32.4) (2007). http://tocsy.pik-potsdam.de/CRPtoolbox/.
Accessed 28 June 2019
15. Patro, S.G.K., Sahu, K.K.: Normalization: a preprocessing stage. Iarjset, pp. 20–22 (2015)
16. Webber, C.L., Zbilut, J.: Recurrence quantification analysis of nonlinear dynamical systems.
In: Tutorials in contemporary nonlinear methods for the Behavioral Sciences Web Book, no.
June, pp. 26–94 (2005). http://www.nsf.gov/sbe/bcs/pac/nmbs/nmbs.jsp. Accessed 5 June
2019
17. Wang, X., Ge, L., Li, X.: Pasture monitoring using SAR with COSMO-skymed, ENVISAT
ASAR, and ALOS PALSAR in Otway, Australia. Remote Sens. 5(7), 3611–3636 (2013)
18. Heisler-White, J.L., Knapp, A.K., Kelly, E.F.: Increasing precipitation event size increases
aboveground net primary productivity in a semi-arid grassland. Oecologia 158(1), 129–140
(2008)
19. Proulx, R., Parrott, L., Fahrig, L., Currie, D.J.: Long time-scale recurrences in ecology:
detecting relationships between climate dynamics and biodiversity along a latitudinal gradient.
In: Webber, C.L., Marwan, N. (eds.) Recurrence Quantification Analysis – Theory and Best
Practices, no. February, pp. 335–347. Springer, Cham (2015)
20. Marwan, N., Kurths, J., Foerster, S.: Analysing spatially extended high-dimensional dynamics
by recurrence plots. Phys. Lett. Sect. A Gen. At. Solid State Phys. 379(10–11), 894–900 (2015)
Complexity of the Vegetation-Climate System Through Data Analysis 619
21. Belaire-Franch, J., Contreras, D., Tordera-Lledó, L.: Assessing nonlinear structures in real
exchange rates using recurrence plot strategies. Phys. D Nonlinear Phenom. 171(4), 249–264
(2002)
22. Zhao, Z., Liu, J., Peng, J., Li, S., Wang, Y.: Nonlinear features and complexity patterns of
vegetation dynamics in the transition zone of North China. Ecol. Indic. 49, 237–246 (2015)
23. Frilot II, C., Kim, P., Carrubba, S., McCarty, D., Chesson Jr., A.L., Marino, A.: Analysis of
brain recurrence. In: Webber, C.L., Marwan, N. (eds.) Recurrence Quantification Analysis
– Theory and Best Practices, no. February, pp. 213–251. Springer, Cham (2015)
24. Beckage, B., Gross, L.J., Kauffman, S.: The limits to prediction in ecological systems.
Ecosphere 2(11), 1–12 (2011)
25. Marwan, N., Wessel, N., Meyerfeldt, U., Schirdewan, A., Kurths, J.: Recurrence-plot-based
measures of complexity and their application to heart-rate-variability data. Phys. Rev. E -
Stat. Physics, Plasmas, Fluids, Relat. Interdiscip. Top. 66(2), 1–8 (2002)
Towards Understanding Complex Interactions
of Normalized Difference Vegetation Index
Measurements Network and Precipitation
Gauges of Cereal Growth System
Madrid, Spain
Abstract. Earth observations (EO) are nowadays a powerful tool to evaluate veg-
etation systems as crops to reach Sustainable Development Goals (SDGs) of the
agenda 2030. Normalized Difference Vegetation Index (NDVI) is a popular and
widespread index in remote sensing to evaluate vegetation dynamics. However,
analytical advances of NDVI long term series analysis are towards understanding
complex relations of atmosphere-plant-soil system through temporal and scaling
behaviour. Hence, this research presents the generalized structure function (GSF)
and Hurst exponent as innovative analytical methods to explore a satellite-based
network of NDVI measurements and precipitation series in cereals in the semi-arid.
Results suggest that weather support anti-persistence structure of NDVI time series
since weather regime in semi-arid is essential in the understanding of complex pro-
cesses of the crop growth. Mathematical description of NDVI series coupled with
GSF and Hurst exponent can reinforce crop modelling future purposes.
1 Introduction
Earth observation time series analysis is increasingly improved for multiple vegetated
and unvegetated areas evaluation. However, characterize agricultural land processes
coupling to weather are challenging due to multeity of processes and factors affecting
vegetation growth. One of these growing factors in semiarid is especially the rainfall
behaviour on agricultural fields, in which plant, soil, and climate are strongly correlated
with crop yield. These relationships are commonly analysed using vegetation indices
such as the normalized difference vegetation index (NDVI).
The analysis of intensive long-term cereal sequences is very scarce from earth
observations. Even the NDVI long-term series from monoculture and rotational cereals
sequences have not been deeply studied in semiarid, although these remain one driving
factor of soil degradation in those areas [1–3]. Site selection for NDVI spatial analysis
can be treated as a network of measurements to capture vegetation phenology variations.
Some studies that analysed the relation of scaling behaviour through the mass distribution
of land management in crops and soil types (e.g., tillage, land levelling, etc.) introduced
a promising method [4] to complement spatial features and environmental insights to
mitigate soil degradation. This method for scaling properties is the generalized structure
function (GSF). The scaling properties of reflectance signals from satellites can provide
complimentary information to specific sites [5–7] increasing the temporal understand-
ing of cereal phenology sequences. These scaling properties of reflectance signals, along
time series, can be described as a mass distribution on a temporal domain complementing
classical statistics of the measured signals [8]. Hence, stationary series from a satellite-
based network of cereal NDVI measurements can be related to site weather interactions,
especially with precipitation (pcp) patterns.
2 Methods
2.1 Case Study and Data
The study area is located in north-central Spain in the midlands of the Duero River
basin, Fig. 1. This area overlaps with most of the Avila and Segovia provinces. The area
was delineated using the midlands of the Eresma and Adaja Rivers [9], and it covers
200,197 ha. The land use of the area is mainly rainfed cereal agriculture (70%), of which
41% is barley, 15% is wheat, and 14% are other crops (e.g., canola, sunflower, and peas),
as the most typical rainfed crops in the area. Both crops are part of the most representative
features of the crop rotation sequence in the area, being the focus of interest in this study.
Fig. 1. Study area located in north-central Spain in the midlands of the Duero River basin.
Site selection of cereal plot was based on a previous study in which long-term ce-
real schemas were identified from remote sensing [9]. These plots comprise several
622 D. Rivas-Tabares and A. M. Tarquis
sub-basins and different soil types. From those, two areas were finally selected as rep-
resentative for each river in terms of weather and soil type. The selected final areas are
shown with color wheat spikes in Fig. 2.
Fig. 2. Location of the plots with long-term cereal sequences NDVI measurements network, black
spike (right) in-dicates the selected site from Adaja River basin and purple spike (left) indicates
site from Eresma River basin.
The MODIS-Terra MOD13Q1 V06 product at 250 m spatial resolution and 16-day
composite images [10] from 2000 to 2019 (451 images) were used to define NDVI
measurement network. The extracted series from MOD13Q1 comprise data from 02-
02-2000 to 14-09-2019 that were checked through the quality and reliability pixel index
of MODIS data, and only high-quality pixels (rank key = 0) were filtered for the series.
It is important to highlight that the 16-day composite NDVI series are generated using
the two 8-day composite surface reflectance granules (MOD09A1) in the 16-day period
considered one of the most spatiotemporal reliable products of MODIS [10]. Some of
the depressed values of the series were preprocessed (i.e., less than 7 values in the series)
through the Savitzky–Golay filter [11] to smooth the time series, specifically those that
were caused primarily by cloud contamination and atmospheric variability [12]. The
data extraction from the MODIS product was performed using the Google Earth Engine
[13]. Each site represents the average of five cereal plots in which all years were planted
with cereals. Three time series were analyzed for each site, i) NDVI average for the 5
plots, ii), NDVI residual series (the former series substracted annual pattern) and iii),
NDVI anomalies [14] calculated as ZNDVI = NDVIi-μNDVI/SD.
Meteorological national data from a network of 17 precipitation gauges [15] were
used to set up weather assignation to crop plots. This was performer applying the Thiessen
Towards Understanding Complex Interactions 623
Polygon Method (TPM) from rain gauges to weighted configuration of subbasin pre-
cipitation series [9]. As similar to NDVI series, three time series were analyzed for
precipitation, i) Pcp average for subbasin plots, ii), NDVI residual series (the former
series substracted annual pattern) and iii), Pcp anomalies.
3 Results
Due to the seasonal pattern of NDVI signals, their long-term statistics do not change
significantly over time. The statistics of the NDVI series were developed over the tillage
period from March (vegetation cover >30%) to June until the grain harvest, which is
denoted the growing season in cereals. The confidence interval reveals that in the growing
season, the sites are statistically different when the surface is covered by vegetation. The
ANOVA results confirm that NDVI values for the sample sites during the vegetative
period exhibit significant differences with a confidence level of 99% and p-values <
0.01.
Fig. 3. Generalized Structure Function plots for NDVI residual series of eastern sites (left column)
and western sites (right column) of ζ(q) curve.
The scaling of the NDVI series (original, residual and anomalies) was confirmed
through the GSF calculation. The GSFs relate the Sq (i/L) against the (i/L) with
L = imax for the NDVI residual series of the sites. Thus, the maximum increment was
chosen in 32 data points (L = imax = 64), which is equivalent to a 32-month period
or 2 growing seasons. In this case, values for q were selected between 0.25 and 4 with
0.25 increments, Fig. 3. The Hurst exponents curve results from the GSF plot across all
the q exponents. For this case this relation reveals that the NDVI residual signals of the
sites are anti-persistent in time, Fig. 4. Red dots from west sites and blue dost from east
sites. However, the anti-persistence degree of these NDVI residual series between sites
is shifted but not statistically different.
624 D. Rivas-Tabares and A. M. Tarquis
Fig. 4. Generalized Hurst exponent H(q) for NDVI residual series of eastern sites (left column)
and western sites (right column). The continuous line correspond to non correlated noise with
Hurs value of 0.5.
Fig. 5. Generalized Structure Function plots for Pcp residual series of eastern sites (left column)
and western sites (right column) of ζ(q) curve.
The resulting noise exhibited a scaling behavior, and the generalized Hurst exponent
was also anti-persistent, Fig. 6. This situation can support the anti-persistent response of
NDVI residuals when presenting the anti-persistent noise structure of the precipitation
time series. To our knowledge, there is no scientific evidence about the NDVI residual
anti-persistent series in conjunction with the precipitation residual anti-persistent series
from cereal sequences in semiarid conditions.
Towards Understanding Complex Interactions 625
Fig. 6. Generalized Hurst exponent H(q) for Pcp residual series of eastern sites (left column) and
western sites (right column). The continuous line correspond to non correlated noise with Hurs
value of 0.5.
4 Conclusions
The results presented in this work reinforce the idea that the knowledge of rain gauges
spatial variability is a key component in understanding patterns of vegetation at large
scales, specifically related to cereal yields in the semiarid. The NDVI residual series
under rainfed activity for cereal production in the semiarid climate in Spain exhibits an
anti-persistent structure; this is primarily due to the anti-persistent behaviour of precipita-
tion residual series. The time series analysis of vegetation indices, such as satellite-based
NDVI measurement network, in combination with precipitation time series from dense
rain gauges networks provides some insights into the understanding of seasonal cereal
yields. This approach has the objective of obtaining feedback and identifying the field
features associated with the anti-persistent structure of NDVI residual time series since
weather regime in semi-arid is essential in the understanding of complex processes of
the crop growth. Mathematical description of NDVI series coupled with GSF and Hurst
exponent can reinforce crop modelling future purposes. Defining NDVI measurement
networks constitutes a low-cost and efficient tool to track temporal variations of rainfed
cereal dynamics. NDVI time-series provide effective estimates of crop growth states
and constitutes accurate estimates of crop timing of main phenological events such as
tillering, stem extension, heading and ripening.
References
1. Mao, R., Zeng, D.-H., Li, L.-J., Hu, Y.-L.: Changes in labile soil organic matter fractions
following land use change from monocropping to poplar-based agroforestry systems in a
semiarid region of Northeast China. Environ. Monit. Assess. 184, 6845–6853 (2012)
626 D. Rivas-Tabares and A. M. Tarquis
2. Hernanz, J.L., López, R., Navarrete, L., Sanchez-Giron, V.: Long-term effects of tillage sys-
tems and rotations on soil structural stability and organic carbon stratification in semiarid
central Spain. Soil Tillage Res. 66, 129–141 (2002)
3. Wu, H., Wu, L., Zhu, Q., Wang, J., Qin, X., Xu, J., Kong, L., Chen, J., Lin, S., Khan, M.U.:
The role of organic acids on microbial deterioration in the Radix pseudostellariae rhizosphere
under continuous monoculture regimes. Sci. Rep. 7, 1–13 (2017)
4. Moreno, R.G., Alvarez, M.C., Requejo, A.S., Tarquis, A.M.: Multifractal analysis of soil
surface roughness. Vadose Zo. J. 7, 512–520 (2008)
5. Zeleke, T.B., Si, B.C.: Scaling properties of topographic indices and crop yield. Agron. J. 96,
1082–1090 (2004)
6. Wang, Z., Shu, Q., Liu, Z., Si, B.: Scaling analysis of soil water retention parameters and
physical properties of a Chinese agricultural soil. Soil Res. 47, 821–827 (2010)
7. Zheng-Ying, W., Qiao-Sheng, S.H.U., Li-Ya, X.I.E., Zuo-Xin, L.I.U., Si, B.C.: Joint mul-
tifractal analysis of scaling relationships between soil water-retention parameters and soil
texture. Pedosphere 21, 373–379 (2011)
8. Tarquis, A.M., Morato, M.C., Castellanos, M.T., Perdigones, A.: Comparison of structure
function and detrended fluctuation analysis of wind time series. Nuovo Cim. Della Soc. Ital.
Di Fis. C Geophys. Sp. Phys. 31, 633–651 (2008)
9. Rivas-Tabares, D., Tarquis, A.M., Willaarts, B., De Miguel, Á.: An accurate evaluation of
water availability in sub-arid Mediterranean watersheds through SWAT: Cega-Eresma-Adaja.
Agric. Water Manag. 212, 211–225 (2019). https://doi.org/10.1016/j.agwat.2018.09.012
10. Didan, K.: MOD13Q1 MODIS/Terra vegetation indices 16-day L3 global 250 m SIN grid
V006. NASA EOSDIS L. Process. DAAC (2015)
11. Savitzky, A., Golay, M.J.E.: Smoothing and differentiation of data by simplified least squares
procedures. Anal. Chem. 36, 1627–1639 (1964)
12. Chen, J., Jönsson, P., Tamura, M., Gu, Z., Matsushita, B., Eklundh, L.: A simple method for
reconstructing a high-quality NDVI time-series data set based on the Savitzky-Golay filter.
Remote Sens. Environ. 91, 332–344 (2004)
13. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R.: Google earth
engine: planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27
(2017)
14. Klisch, A., Atzberger, C.: Operational drought monitoring in Kenya using MODIS NDVI
time series. Remote Sens. 8, 267 (2016)
15. AEMET: Daily precipitation, maximum temperature and minimum temperature. Period 2000–
2015 (2013)
Spatio-Temporal Clustering
of Earthquakes Based on Average
Magnitudes
1 Introduction
Our research objective is to develop methods for analyzing huge earthquake
catalogs as large-scale complex networks, where nodes (vertices) correspond to
earthquakes, and links (edges) correspond to the interaction between them. Tech-
nically, we are not only interested in knowing what is happening now and how it
develops in the future, but also we are interested in knowing what happened in
the past and how it caused by some changes in the distribution of the information
as studied in [10,21]. Thus, it seems worth putting some effort into attempting to
find empirical regularities and develop explanatory accounts of basic properties
in these complex networks. Such attempts would be valuable for understanding
some structures and trends, and inspiring us to lead to the discovery of new
knowledge and insights underlying these interactions.
The clustering of earthquakes is important for many applications in seis-
mology, including seismic activity modeling, and earthquake prediction. In this
paper, for a given earthquake catalog, we addressed the problem of automatically
extracting several clusters consisting of spatio-temporally similar earthquakes
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 627–637, 2021.
https://doi.org/10.1007/978-3-030-65347-7_52
628 Y. Yamagishi et al.
whose average magnitudes are substantially different from the total one. Espe-
cially, we intended to produce one relatively large cluster and the other small
clusters having substantially different average magnitudes from the total one.
To this end, we propose a new method by uniquely combining some techniques
developed in two different fields, i.e., declustering algorithms [2,4,6] in the field
of seismology and a change-point detection algorithm [23,24] in the field of data
mining. In our empirical evaluation using an earthquake catalog covering the
whole of Japan, it was confirmed that we could generally obtain the clustering
results each of which consists of one relatively large cluster and the other small
clusters having substantially different average magnitude from the total one.
The paper is organized as follows. We describe related work in Sect. 2 and
give our problem setting and the proposed methods for clustering an earthquake
catalog in Sect. 3. We report and discuss experimental results using a real catalog
in Sect. 4 and conclude this paper and address the future work in Sect. 5.
2 Related Work
times, the method by Bottiglieri et al. [3] uses the coefficient of variation of the
times, and Frohlich and Davis [5] proposed the ration method which also exploits
the inter-event times but without examining their distribution. As another clus-
ter analysis with links, Baiesi and Paczuski [2] proposed a simple space-time
metric to correlate earthquakes with each other and Zaliapin et al. [25] further
defined the rescaled distance and time.
Among these declustering algorithms, we focused on the single-link cluster
analysis proposed by Frohlich and Davis [4,6] and the correlation metric pro-
posed by Baiesi and Paczuski [2]. In our proposed method, we employ one of
these two algorithms alternatively in the tree construction phase.
Our research aim is in some sense the same, in the spirit, with the work by
Kleinberg [10] and Swan and Allan [21]. They noted a huge volume of the time-
series data, tried to organize it, and extract structures behind it. This is done
in a retrospective framework, i.e., assuming that there is a flood of abundant
data already and there is a strong need to understand it. Kleinberg’s work is
motivated by the fact that the appearance of a topic in a document stream is
signaled by a “burst of activity” and identifying its nested structure manifests
itself as a summarization of the activities over a period of time, making it pos-
sible to analyze the underlying content much easier. Kleinberg’s method used a
hidden Markov model in which bursts appear naturally as state transitions, and
successfully identified the hierarchical structure of e-mail messages. Swan and
Allan’s work is motivated by the need to organize a huge amount of information
in an efficient way. They used a statistical model of feature occurrence over time
based on hypothesis testing and successfully generated clusters of named entities
and noun phrases that capture the information corresponding to major topics
in the corpus and designed a way to nicely display the summary on the screen
(Overview Timelines). We also follow the same retrospective approach, i.e., we
are not predicting the future, but we are trying to understand the phenomena
that happened in the past.
We are interested in detecting spatio-temporal changes in the magnitude of
earthquakes. For this purpose, by defining a set of links with some declustering
algorithm described earlier, we construct a spatio-temporal network (spanning
tree), where the nodes correspond to the observed earthquakes. After that, in
order to analyze the burst of activity in an earthquake catalog and attempt to
present an overview map, we employ a variant of the change-point detection
algorithm proposed by Yamagishi et al. [23,24].
3 Proposed Method
Let D = {(xi , ti , mi ) | 1 ≤ i ≤ N } be a set of observed earthquakes, where
xi , ti and mi stand for a location vector, time and magnitude of the observed
earthquake i, respectively. Here, we assume that these earthquakes are in order
630 Y. Yamagishi et al.
from oldest to most recent, i.e., ti < tj if i < j. In this paper, from the observed
dataset D, we address the problem of automatically extracting several clusters
consisting of spatio-temporally similar earthquakes whose average magnitudes
are substantially different from the total one. In what follows, we describe some
details of our proposed algorithm consisting of two phases: tree construction and
tree separation.
Here df is the fractal dimension set to df = 1.6, and b the parameter of the
Gutenberg-Richter law set to b = 0.95. Again, an earthquake j is regarded
as the aftershock (child node) of iCM (j) if the metric mi,j is minimized, i.e.,
iCM (j) = arg min ni,j . Then, based on the correlation-metric strategy, we can
1≤i<j
define a spanning tree, where the nodes correspond to the observed earthquakes,
and the links are defined by T CM = {(iCM (j), j) | 2 ≤ j ≤ N }.
Note that we employed the definition of sample variance in this paper. Intu-
itively, we intend to produce one relatively large cluster and the other small
clusters having substantially different (typically large) average magnitudes from
the total one. In fact, since the distribution of magnitudes in a catalog reasonably
obeys the Gutenberg-Richter law (exponential distribution), i.e., the magnitudes
of most earthquakes are relatively small, it is naturally expected that we can
improve the objective function by separating clusters of spatio-temporally sim-
ilar earthquakes with relatively large magnitudes. Here note that this objective
function can be interpreted as a weighted sum of variances.
In order to compute the resultant set of separation links R, we employ a
variant of the change-point detection algorithm proposed by Yamagishi et al. [23,
24]. Namely, from the observed dataset D, the tree T constructed by either the
SL and CM strategies, and a given number of clusters G, our algorithm computes
R as follows:
Step 1. Initialize g ← 1 and R0 ← ∅.
Step 2. Compute eg ← arg min{f (Rg−1 ∪{e})}, and update Rg ← Rg−1 ∪{eg }.
e∈T
Step 3. Set g ← g + 1 and then return to Step 2 if g < G − 1; otherwise set
g ← 1 and h ← 0,
Step 4. Compute eg = arg min{f (RG \ {eg } ∪ {e})}, and update RG ← RG \
e∈T
{eg } ∪ {eg } and then h ← 0 if eg = eg ; otherwise set h ← h + 1,
Step 5. Output RG−1 and then terminate if h = G − 1; otherwise set g ←
(g mod (G − 1)) + 1 and then return to Step 4.
More specifically, after initializing the variables in Step 1, we compute the opti-
mal g-th link in eg by fixing the already selected set of (g − 1) links in Rg−1
and add it to Rg−1 as shown in Step 2. We repeat this procedure from g = 1
to G − 1 as shown in Step 3. After that, we start with the solution obtained as
RG−1 , pick up a link eg from the already selected links, fix the rest RG−1 \ {eg }
and compute the better link eg of eg as shown in Step 4, where · \ · represents set
difference. We repeat this from g = 1 to G − 1. If no replacement is possible for
all g, i.e. eg = eg for all g ∈ {1, · · · , G − 1}, then no better solution is expected
and the iteration stops, as shown in Step 5. Here, it is not guaranteed that the
above algorithm theoretically produces the optimal result, but it is confirmed
that the algorithm always computes the optimal or near-optimal solutions in our
empirical evaluation [23,24].
632 Y. Yamagishi et al.
4 Experimental Evaluation
By using an earthquake catalog which contains source parameters determined by
Japan Meteorological Agency1 in the whole of Japan Islands, we generated two
original datasets. Namely, by setting the minimum magnitude and the maximum
depth as Mmin = 3.0 and Dmax = 100 km, respectively, we selected N = 104, 343
earthquakes during the period from Oct. 01, 1997 to Dec. 31, 2016 as dataset
A, while by setting Mmin = 4.0 and Dmax = 100 km, we selected N = 27, 728
earthquakes during the period from Oct. 01, 1977 to Dec. 31, 2016 as dataset B.
0.275 0.2295
0.274 0.229
0.2285
0.273
0.228
0.272
0.2275
0.271
0.227
0.27 0.2265
0.269 0.226
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
1 1
G:2 1 0.9752 0.9674 0.6313 0.6312 0.5994 0.7 G:2 1 0.3898 0.3902 0.3892 0.3867 0.3867 0.3866
0.9 0.9
G:3 0.9752 1 0.9921 0.6561 0.656 0.6241 0.7248 G:3 0.3898 1 0.9997 0.9987 0.995 0.995 0.9948
0.8 0.8
G:4 0.9674 0.9921 1 0.6639 0.6638 0.632 0.7326 G:4 0.3902 0.9997 1 0.999 0.9953 0.9953 0.9952
G:5 0.6313 0.6561 0.6639 1 0.9999 0.9681 0.5408 0.7 G:5 0.3892 0.9987 0.999 1 0.9963 0.9963 0.9962 0.7
G:6 0.6312 0.656 0.6638 0.9999 1 0.9682 0.5409 0.6 G:6 0.3867 0.995 0.9953 0.9963 1 1 0.9998 0.6
G:7 0.5994 0.6241 0.632 0.9681 0.9682 1 0.5728 G:7 0.3867 0.995 0.9953 0.9963 1 1 0.9999
0.5 0.5
G:8 0.7 0.7248 0.7326 0.5408 0.5409 0.5728 1 G:8 0.3866 0.9948 0.9952 0.9962 0.9998 0.9999 1
0.4 0.4
G:2 G:3 G:4 G:5 G:6 G:7 G:8 G:2 G:3 G:4 G:5 G:6 G:7 G:8
(a) Single-link (b) Correlation-metric
Finally, we visually evaluate the obtained results employing the different tree
construction strategies by focusing on the dataset A. To this end, we transform
the average magnitude in each cluster Ng denoted by µg , into the corresponding
b-value denoted as bg [1], i.e.,
log10 e
bg = , (6)
µg − Mmin
where e stands for Napier’s constant (Euler’s number) and recall that the min-
imum magnitude in the dataset A was set to Mmin = 3.0. Here the average
magnitude in the dataset A is around 3.50, and the corresponding b-value is
634 Y. Yamagishi et al.
mer strategy, we could obtain clearly visible clusters having substantially large
average magnitudes around the region where the 2011 Tohoku earthquake with
M w = 9.0 (indicated by red x in the figures), the largest magnitude in Japan,
occurred. On the other hand, as for comparison among the different numbers of
clusters in case of employing the single-link strategy, shown in Figs. 3a, b and
c, we could obtain somehow different types of clustering results, which might
help to analyze the dataset from multiple viewpoints. In short, in our empirical
evaluation, we can confirm that the proposed method employing the single-link
strategy can produce more desirable results for our purpose.
5 Conclusion
In this paper, for a given dataset of observed earthquakes, we addressed the prob-
lem of automatically extracting several clusters consisting of spatio-temporally
similar earthquakes whose magnitudes are substantially different from the total
average. Especially, we intended to produce one relatively large cluster and the
other small clusters having substantially different average magnitudes from the
total one. For this purpose, we proposed a new method consisting of two phases.
In the former tree construction phase, we employ one of two different decluster-
ing algorithms called single-link and correlation-metric developed in the field of
seismology, while in the later tree separation phase, we employ a variant of the
change detection algorithm, developed in the field of data mining. In our empiri-
cal evaluation using earthquake catalog data covering the whole of Japan, it was
confirmed that we could generally obtain the clustering results each of which
consists of one relatively large cluster and the other small clusters having sub-
stantially different average magnitude from the total one. Moreover, we showed
that the proposed method employing the single-link strategy can produce more
desirable results, in terms of the improvement of weighted sums of variances and
visualization results. As a future task, we plan to conduct more experiments
to see that our clustering method can provide new findings on the earthquake
statistics, the underlying earthquake dynamics, and so on, by producing one rel-
atively large cluster and the other small clusters having substantially different
average magnitudes from the total one. Further theoretical studies to find the
optimal number of clusters are also future works.
References
1. Aki, K.: Maximum likelihood estimate of bin the formula log(N ) = a − bM and its
confidence limits. Bull. Earthq. Res. Inst. 43, 237–239 (1965)
2. Baiesi, M., Paczuski, M.: Scale-free networks of earthquakes and aftershocks. Phys.
Rev. E, Stat. Nonlinear Soft Matter Phys. 69, 066106 (2004)
3. Bottiglieri, M., Lippiello, E., Godano, C., De Arcangelis, L.: Identification and
spatiotemporal organization of aftershocks. J. Geophys. Res. 114 (2009)
4. Davis, S.D., Frohlich, C.: Single-link cluster analysis, synthetic earthquake cata-
logues, and aftershock identification. Geophys. J. Int. 104(2), 289–306 (1991)
636 Y. Yamagishi et al.
Abstract. Security and privacy have been major concerns of Online Social Net-
works (OSN). Individual users as well as organizations utilize OSNs, such as
Facebook, Twitter, and LinkedIn, to share information with other users within
their networks. While sharing information, users are not always aware of the fact
that an innocent action on their post by a direct friend such as a comment or a
share may turn the post transparent to someone outside their network.
In previous work we have devised a comprehensive Trust-based model that
combines Role based Access Control for the direct circle of friends and Flow Con-
trol for the friends’ networks. In this paper we reinforce this model by analyzing
its strength in terms of OSN features. We simulate attack scenarios carried out by a
community of malicious users that attempt to fake the OSN features of the model.
We analyze the attack of an alleged trustworthy clique of adversaries and show
the futility of such an attack, due to the strength of the model’s parameters and
combination of Trust, Access Control and Flow Control. We also demonstrate the
robustness of the model when facing an optimized attack, which carefully selects
the best network nodes to compromise, as determined by the minimal vertex cover
algorithm.
1 Introduction
The rapid growth of Online Social Networks (OSN) and their increasing popularity in
the past decade as major communication channels, have raised some new shapes of
security and privacy concerns. In our previous work, we have created a privacy model
that is composed of three main phases addressing three of its major aspects: trust, role-
based access control [1, 2] and information flow, by creating an Information Flow-
Control model for adversary detection [3], or a trustworthy network [4]. We represent
a social network as an undirected graph, where nodes are the OSN users, and edges
represent relations between them such as friendship relations. An Ego node (or Ego
user) is an individual focal node, representing a user whose information flow we aim
to control. An Ego node along with its adjacent nodes are denoted Ego network. Our
comprehensive Trust-based model uses Access Control for the direct friends of the
Ego-user, and Information Flow Control for the users that are in a further distance.
We use OSN parameters, such as total number of friends, age of user account, and
friendship duration to characterize the quality of the network connections as we explain
in Sect. 4 of this paper. The robustness of this model is the key objective of this paper.
Several attacks on private information in Social Networks have been described in [5].
A common type of attack in OSN aims at a specific user or network and attempts to
access or act on its information e.g., spread false data or spam for different purposes.
Trust based systems must deal with attacks, in which malicious users initially behave
properly to gain a positive reputation but then start to misbehave and inflict damage on
the community. In this paper we show the robustness of our model and focus on the
latter type of attack, where a user, or its network is the target of an attack initiated by
malicious users. The main scenarios we simulate include a community of spammers
whose profiles conform with the OSN attributes that constitute the Trust aspect of the
model. We use a graph algorithm (minimal vertex cover) to select an optimized set
of candidate nodes to compromise, and show that even in this case, such an attack is
futile. The rest of this paper is structured as follows: Sect. 2 discusses the background
for our work and references related papers; Sect. 3 provides a brief overview of the
Trust model; Sect. 4 discusses the attack scenarios on the model and Sect. 5 presents
the experimental evaluation of the attack scenario based on preliminary evaluations of
the properties conducted in our previous research. In Sect. 6 we conclude the paper and
discuss further research directions.
due to a delay and a malicious user exploits this delay for a personal gain. Attacks on
social networks are presented in relatively early papers such as [12], where a conceptual
framework of a Social Honeypot is described for uncovering social spammers who target
online communities. The idea of creating Social Honeypots is also developed in [13]
where the honeypot profiles were assimilated into an organizational social network.
The honeypot then received suspicious friend requests and mail messages that revealed
basic indications of a potential forthcoming attack. An interesting form of attack, that
is related to our presented attack scenarios is the “friend-in-the-middle” attack [14],
in which a legitimate friend in the social network is used as a gateway for spammers
that harvest social data. This data can then be exploited for large-scale attacks such as
context-aware spam and social-phishing. The network used specifically in this attack
scenario is Facebook. Our Vertex-Cover algorithm can be looked at as a generalization
of the “Friend-in-the-middle” attack.
The model we have presented in previous work [1–4] is composed of three main phases
addressing three of its major aspects: trust, role-based access control and information
flow. In the First phase, the Trust phase, we assign trust values on the edges connecting
direct friends to the Ego node in their different roles, e.g., Family, Colleagues etc. In the
second phase, the Role Based Access Control phase, we remove direct friends that do
not have the minimal trust values required to grant a specific permission to their roles.
A cascade removal is carried out in their Ego networks as well.
After this removal, the remaining user nodes and their edges are also assigned with
trust values. In the third and last phase, the Information Flow phase, we remove from
the graph edges and nodes that are not directly connected to the Ego-user to construct a
privacy preserving trusted network. To calculate trust values in the first phase we use a
set of OSN parameters carefully selected based on previous research (specifically [1, 2,
4]). We divide these parameters to connection attributes which relate to edges and to user
attributes which relate to nodes. In this work we refer to four of these attributes. Two
connection attributes: Friendship Duration (FD) and Mutual Friends (MF) and another
two user attributes: Total number of Friends (TF) and Age of User Account (AUA). A
Trust value ranges between 0 and 1 to reflect the probability of sharing information with
a certain user: 0 represents total restriction, and 1 represents definite sharing willingness.
The threshold values are denoted here as T property (e.g. for the TF attributes the threshold
value is T TF ) and their experimental values, achieved in our previous research mentioned
above are presented in the Evaluation part of this paper. We define the User Trust Value
(UTV ), as the weighted average of these properties, taking into consideration the different
weights (wi ) that were assessed by experimental results in [1] and [2] for the significance
(weight) of every attribute-factor. The calculation of a certain property value (pproperty )
is done by these thresholds and is as follows:
property
property < T property ,
Pproperty = T property (1)
1 property ≥ T property .
644 N. Voloch et al.
The User Trust Value (UTV ) is calculated as follows, where |p| denotes the number
of attributes and <w> denotes the average of their weights:
|p|
i=1 wi pi
UTV = wi pi = (2)
w|p|
Minimal Trust Value (MTV ) denotes the threshold value for determining whether to
grant a certain access to a data instance. It is calculated in this model as the average of
UTV s within the Ego Network.
These thresholds cab be configured per role and per permission according to the
OSN administration policy, or according to user-preferences if such exist.
The three-phases model described above and presented in Fig. 1. generates a trust-
worthy network of users with which the Ego-user can safely share information. In Phase
3 we use the measure Minimal Path Trust Value (MPTV ) that is presented in [3], and is
the threshold value for PTV (Path Trust Value), that is computed as the cumulative Trust
values of the edges and nodes in the path from the Ego user to the user being checked
in the network.
Definition 1- An attack is a tuple of the form < G, T TF , T AUA ,Gspm , t spm > where:
For an attack to take place, the model major trust attributes must be faked: Mutual
Friends (MF), Total Friends (TF), Age of User Account (AUA) and Friendship Duration
(FD). We divide these attributes into two groups: attributes representing quantities (MF
and TF), and attributes representing durations (AUA and FD). Quantities imply that a
user is well connected and a user that has enough mutual friends with others demonstrates
human circles of relations within an OSN (family, work, neighborhood, etc.). Duration
attributes represent the steadiness of the profile, as genuine users usually create their
profile once. To fake a user attribute such as MF or TF an adversary must connect the
fake user profile to other profiles in the network, genuine or not. The minimal number of
fake users to be created must exceed the threshold of every attribute. To impersonate to
a real user network, an attack must consist of a network of trustworthy users, that need
to adhere to all the model’s properties. We consider the extreme scenario of spammers
that are only friends with each other, making the MF property similar as possible to the
TF property. This attack simulates a closed spammer network Gspm (V spm , E spm ) that is
a clique. In this type of attack the MF attribute is correctly faked, since all the users are
connected to each other. As all the nodes are connected in the spammer clique the size
of the spammer graph must be at least:
spm
V ≥ T TF (3)
For the duration attributes, AUA and FD, we also consider the extreme scenario
of spammers that are only friends with each other, making the FD property similar
as possible to the AUA property. These properties must also hold for all the users in
the spammer’s network. This is specifically hard due to OSN policies that require a
reasonable duration for a user account to be considered a genuine one. This means that
before the attack can take place the elapsed time should be:
This attack process is shown in Fig. 2, where the two properties are created.
The creation of a spammer network for malicious purposes, is detailed in [15], where
the following attack is described: a malicious user that creates a set of false identities
and uses them to communicate with a large, random set of innocent users (Random Link
Attack-RLA). The research shows and proves that this is in fact an NP-complete problem.
Practically it means that this kind of attack, carried out naively without heuristics, is very
646 N. Voloch et al.
Fig. 2. Attack of a spammer network on the Fig. 3. Minimal vertex cover: optimized
Ego attack
hard to perform. We extend this form of basic attack one step further as we take into
consideration the attributes of these nodes, making the attack even more difficult to
implement. To perform an efficient attack, we need to assume that some of the requests
of the spammer network will be denied or blocked by the OSN administration, thus the
attack has to involve as many friends as possible from the Ego network. The robustness
of our model is derived from its resilience to these attacks in term of actual OSN size -
the bigger the network is, the harder it is to fake the attributes of the model. We define
four types of attack, based on their strength and complexity:
A Regular Attack: A Blackbox attack that does not include preliminary knowledge on
the Ego-user network. In this attack the spammer network tries to connect to k direct
TF
friends of the Ego network, where 1 ≤ k ≤ T 2 .
In this attack, the number of friend requests to be made, is the number of edges from
the spammed network, that is |E spm |. Since it is a clique, |E spm | = |V |(|V2 |−1) , and
spm spm
In this attack the spammer network tries to connect to all the friends within a distance
d from the Ego user. In this case, the number of friend requests that should be made,
which are the number of edges from the spammed network, is:
ψ T TF T TF − 1
E = |V |(|V | − 1) + (T TF )d =
spm spm
+ (T TF )d (7)
2 2
An Optimized Very Strong Attack: An attack that uses an optimization algorithm to
conduct an efficient attack. In the next sub section, we describe a minimization heuristic
that a smart spammer would perform, but as we show this problem is still NP-complete.
The complexity of the problem of creating a fake friends’ network becomes harder
as the attack strength grows, and therefore it is not viable in terms of OSN sizes.
An attempt of a spammer’s network to reach out to the entire Ego network could create
an anomalous amount of action in the OSN, which may raise the suspicion of the OSN
administration or community. Certain techniques for minimizing this amount of activity
may involve graph algorithms to allow the attacker an efficient connection to several
nodes in the Ego-users graph instead of connecting to the entire Ego network. In graph
theory, a vertex cover of a graph is a set of vertices such that each edge in the graph is
incident to at least one vertex of the set. Formally, a vertex cover V’ of an undirected
graph G = (V, E) is a subset of V such that uv ∈ E ∧ (u ∈ V ∨ v ∈ V ).
It is a set of vertices V’ where every edge has at least one endpoint in the vertex
cover V’. Such a set is said to cover the edges of G. The problem of finding a minimum
vertex cover in a graph is an optimization problem [16]. We assume that a sophisticated
attacker would create fake attributes only on the vertices (users) in V’, which are in
the minimum vertex cover, enabling the attacker to control all of the connections with
a minimal number of users, which require minimal effort in terms of actions required
for the creation of the attack graph. To reduce the problem to the spammer attack we
define the cost c ψ (v) ≥ 0 as thenumber of actions required for the creation of G ψ and
formulate as follows: minimize v∈V cψ (v)xv (minimize the total number of actions)
subject to xv + xvspm ≥ 1 for all {vspm , v} ∈ E spm (cover every edge of the connected
spammed subgraph that connects a spammer node with a friend node)
xv ∈ {0, 1} for all v ∈ V (every vertex is either in the vertex cover or not)
G ψ ← G spm ∪ V’ (the connection of the spammer network is to the vertex cover)
This delay in time could be very significant in terms of the OSN structure: as time goes
by, properties change, users are added and removed, and the network can be different
from its preliminary status. The changes of the network create a difficulty for an attack
that was pre-ordained to the original network and might not be relevant after the delay
of T AUA . The full attack is described in Algorithm 3.
Algorithm 3. SpammerCommunityMinimalAttackOnOSN
Input: Total Friends threshold , Age of User Account threshold , Graph G, Spammer Vertex ;
ψ
Output: Spammed Graph G
For i =1 to
Vspm // creating fake users
Graph Gspm {Vspm, Espm}; // creating a spammer network
Wait ( // the threshold time must pass to authenticate the AUA attribute
V' minimalVertexCover (G)
For each v in V' and e in E'; 0 ≤ i ≤ |V'|
ei { , vi } // spammer connects to minimalVertexCover of Ego network
ψ spm
G G G
return G ψ ש
5 Evaluation
The experimental evaluation estimates the attacker effort in terms of the size of the
spammer network that is required for a successful attack to take place.
For the OSN attributes threshold we have used the results obtained from our previous
research [2] as presented in Table 1. These threshold values were obtained from a survey
of 282 OSN users that were asked for the importance of various attributes in their deci-
sions to grant various permissions to their private data. The MTV value was calculated
based on UTV values (Eq. 2) using a real OSN dataset that included the attributes of
162 users of an Ego-network [2]. To calculate the sizes of spammer networks we use
the experimental results of the thresholds values T TF and T AUA (Table 1) and use Eqs. 3
and 4 presented above. We can see that the basic T TF is 245 for d = 1, and T AUA is 24
months. The size of the spammer network in terms of edges being created is expressed
by |E spm |, and as described above, since it is a clique, |E spm | = |V |(|V2 |−1) . The result
spm spm
Table 1. Experimental results for trust values for the model’s parameters.
networks with connectivity level of 0.5. We can see that the number of steps required
to find the minimal vertex cover is very high relative to the size of the network being
attacked. The implementation of the model, with these relevant threshold values for
the parameters is meant to be performed by the OSN administration, per each user’s
network.
The problem of attacks by malicious users in OSN has many aspects and applications.
Using several aspects in a comprehensive Trust-based model that was presented in this
paper is a genuine necessity for OSN privacy. In this research we have established
the strength of the comprehensive model by analyzing the possible attack scenarios
of creating a spammer community that may “contaminate” the model’s raw attributes.
These attributes are hard to fake since they are built on real OSN user presence and
real numerical assets. The comprehensive coverage of Access Control, Flow Control
and Trust provides a solid infrastructure for OSN privacy. We have simulated several
attack scenarios based on the preliminary evaluations of the properties from our previous
research and show that the effort required by the attacker make these attacks infeasible.
In current and future work, we are refining the model by considering both the categories
and context of data instances and learning User profile from past actions in different
contexts.
650 N. Voloch et al.
References
1. Voloch, N., Levy, P., Elmakies, M., Gudes, E.: A role and trust access control model for
preserving privacy and image anonymization in social networks. In: IFIPTM 2019 - 13th
IFIP WG 11.11 International Conference on Trust Management (2019)
2. Voloch, N., Levy, P., Elmakies, M., Gudes, E.: An access control model for data security
in online social networks based on role and user credibility. In: International Symposium on
Cyber Security Cryptography and Machine Learning (CSCML 2019). Springer, Cham (2019)
3. Gudes, E., Voloch, N.: An information-flow control model for online social networks based on
user-attribute credibility and connection-strength factors. In: CSCML 2018, 2nd International
Symposium on Cyber Security Cryptography and Machine Learning (2018)
4. Voloch, N., Gudes, E.: An MST-based information flow model for security in online social
networks. In: The 11th IEEE International Conference on Ubiquitous and Future Networks
(ICUFN 2019) (2019)
5. Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Preventing private information inference
attacks on social networks. IEEE Trans. Knowl. Data Eng. 25(8), 1849–1862 (2012)
6. Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role-based access control models.
Computer 29(2), 38–47 (1996)
7. Lavi, T., Gudes, E.: Trust-based Dynamic RBAC. In: Proceedings of the 2nd International
Conference on Information Systems Security and Privacy (ICISSP), pp. 317–324 (2016)
8. Li, Z., Shen, H., Sapra, K.: Leveraging social networks to combat collusion in reputation
systems for peer-to-peer networks. IEEE Trans. Comput. 62(9), 1745–1759 (2012)
9. Sun, J., Zhu, X., Fang, Y.: A privacy-preserving scheme for online social networks with
efficient revocation. In: 2010 Proceedings IEEE INFOCOM, pp. 1–9. IEEE (2010)
10. Viswanath, B., Bashir, M.A., Crovella, M., Guha, S., Gummadi, K.P., Krishnamurthy, B.,
Mislove, A.: Towards detecting anomalous user behavior in online social networks. In: 23rd
{USENIX} Security Symposium ({USENIX} Security 2014), pp. 223–238 (2014)
11. Sirur, S., Muller, T.: The reputation lag attack. In: IFIP International Conference on Trust
Management, pp. 39–56. Springer, Cham (2019)
12. Lee, K., Caverlee, J., Webb, S.: The social honeypot project: protecting online communities
from spammers. In: Proceedings of the 19th International Conference on World Wide Web,
pp. 1139–1140 (2010)
13. Paradise, A., Shabtai, A., Puzis, R., Elyashar, A., Elovici, Y., Roshandel, M., Peylo, C.:
Creation and management of social network honeypots for detecting targeted cyber attacks.
IEEE Trans. Comput. Soc. Syst. 4(3), 65–79 (2017)
14. Huber, M., Mulazzani, M., Weippl, E., Kitzler, G., Goluch, S.: Friend-in-the-middle attacks:
exploiting social networking sites for spam. IEEE Internet Comput. 15(3), 28–34 (2011)
15. Shrivastava, N., Majumder, A., Rastogi, R.: Mining (social) network graphs to detect random
link attacks. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 486–495.
IEEE (2008)
16. Dinur, I., Safra, S.: On the hardness of approximating minimum vertex cover. Ann. Math.
439–485 (2005)
17. https://www.businessofapps.com/data/facebook-statistics/
Media Partisanship During Election: Indonesian
Cases
1 Introduction
The rapid development of online media and social media in recent years has radically
changed the way people consume information. Survey shows that 63% of people read
news digitally [1], while social networks, such as Twitter and Facebook, are the plat-
forms where people share and discuss the latest news. The combination of online news
media and social one strengthens the role of news outlets as gatekeepers of information
concerning the formation of public opinion [2, 3].
The neutrality of news media is difficult to maintain at the time of the election. This
has increasingly become a public concern that given the ability of news media to influence
individual choices, which possibly become an impact on the outcome of the election. The
scientific efforts to examine the partisans’ behavior of news outlets during the election
are constrained by the lack of data about the ideological stance of news media [4–7]. The
majority of news outlets do not openly express their political positions on various issues
[5]. Generally, media alignment is reflected through content, terminology, and arguments
used in framing reported issues. In consequence, it is difficult to objectively measure
the political biases contained in the media frame. The alternative approach is to infer
the stance of media based on the political affiliation of their audiences. This approach is
based on the assumption that people naturally tend to interact with information adhering
to their preferred narrative [6–8].
Social networks like Twitter are very rich in information related to user behavior, e.g.
tweet contents, followers, hashtags used. This information can be used to identify users’
political affiliations, as well as the political leaning of news outlets they read. In this
study, we propose a method for political stance detection of online news outlets based
on the behavior of their audience in social media. The method consists of 3 processing
stages, as follow: (i) Hashtag-based user labeling: we use a number of political hashtags,
i.e. hashtags that are strongly associated with certain political groups, to infer political
affiliations of users of these hashtags; (ii) Network-based user labeling: we expand the
number of tagged users using Label Propagation Algorithm; (iii) Media classification:
at this stage, we use polarity rule to identify the political stance of news outlets based
on the political affiliation of their audiences.
We applied this methodology to the tweet dataset related to the 2019 Indonesian
general election, to observed media alignments during the election. In doing so, we
also report news consumption patterns on Twitter concerning credibility and partisan
behavior of news sources. This paper is structured as follows: sections two and three
discuss data and analysis methods, results of the analysis will be shown in Sect. 4, while
the final section will discuss a number of conclusions and contributions of this study.
2 Data
We conducted the data1 collection process from 27 March to 19 May 2019, which
covered the campaign period, general elections (April 17, 2019), vote recapitulation,
and the announcement period (May 21, 2019). Table 1 shows the descriptive statistics
of the data used in this study. Tweet data was extracted from Twitter using the DMI-Tcat
application [9] based on a number of keywords related to the candidates, namely: (i)
Candidate I (Prabowo-Sandiaga Uno): prabowo, sandiaga uno; (ii) Candidate II (Joko
Widodo-KH. Maaruf Amin): joko widodo, jokowi, ma’ruf amin, kiai ma’ruf.
Statistics Count
# of tweets 13990975
# of tweets with a URL 667821
# of hashtags 74515
# of unique users 3958817
Fig. 1. 10 most used political hashtags used by the candidate: (a) Joko Widodo; (b) Prabowo
We label manually 1400 hashtags, which is 700 for each candidate, of the total 74515
hashtags recorded in the dataset. Each of these hashtags have been used by at least 10
different Twitter users. We apply polarity rule to infer the political affiliation of the users,
as follow [11]:
tf (u,CO )
total(CO )
V (u) = 2 tf (u,CO ) tf (u,C1 )
−1 (1)
total(CO ) + total(C1 )
where tf (u, CO ) is the number of times (user frequency) user u use group of the hashtag
CO of candidate i, total(CO ) is a sum of the frequency of hashtag usage by all users. The
hashtag of other candidates is defined similarly. The political valence value V (u) is in
the range −1 ≤ V (u) ≤ 1. To ensure the user’s political affiliation, we use a threshold
value of ±0.2, where users are lean-to Prabowo if they have a valence score < −0.2,
while lean-to Joko Widodo if the valence score >0.2. Table 2 shows that at this stage
we are able to identify the political affiliation of 181,145 Twitter users, of which 89,000
are Jokowi’s supporters and 92,145 are Prabowo’s supporters.
654 A. Maulana and H. Situngkir
Table 2. Identification of users’ political affiliations using hashtags and network-based labeling
Then, we apply the label propagation algorithm to classify each node in the network
as pro-Joko Widodo or pro-Prabowo. In this study we do not consider the existence
of non-polarized users by assuming that each user is exposed to political information
and therefore will have a tendency towards one of the candidates [11, 14]. Furthermore,
supporters of both candidates who are less polarized tend to consume media that is
considered politically neutral, and hence will place these media in the middle of the
political spectrum. We accommodate this latter possibility by establishing a ‘moderate’
media type in our media classification (see Eq. 2).
Label propagation algorithms are graph-based semi-supervised learning methods,
and use the label information of labeled data to predict the label information of unlabeled
data. At this stage, we used 153,990 labeled nodes identified in the previous stage as seeds
(the list of labeled nodes). We fix the seeds’ labels so they do not change in the process,
Media Partisanship During Election: Indonesian Cases 655
since this seed list also serves as our ground truth. This algorithm works iteratively to
renew the label of each node based on the majority label of its neighbor node. This
process is carried out until the labels of the majority of nodes no longer change [14, 15].
The k-stratified cross (5-fold) validation model is implemented to the set of 153,990
seeds to validate result of the label propagation algorithm [14]. We use 4/5 of the seed
nodes as training data, while the remaining nodes are used to evaluate the algorithm
performance. The evaluation results in Table 4 show prediction accuracy of ~98%. This
strengthens confidence in the performance of the classification algorithm that we use.
At this stage, we successfully identified the political affiliation of 388,253 users in the
retweet network.
where S(v) < 0 means that the media tends to lean to Prabowo, S(v) > 0 means
the media tends to lean Joko Widodo, and S(v) = 0 means that the media tends to be
politically neutral or moderate news media.
4 Analysis
Figure 2 shows the daily number of articles and unique articles shared by users during the
data collection period. The statistics of the unique article become a proxy for the volume
656 A. Maulana and H. Situngkir
of articles published by the media outlets. In general, the two indicators do not show
different dynamics. This indicates the influence of the media on the intensity of news
consumption on social media. Although the daily volume has fluctuated, the dynamics
clearly show an upward trend ahead of the election. This indicates election-related news,
as well as the reader’s attention, is increasing toward the election, which reaches its peak
on election day, then shows a downward trend afterward.
In this study, we only investigated 560 news outlets out of 700 outlets found in the
dataset. Overall, we only focus on domestic news media, which has been shared by 10
different Twitter users.
Fig. 2. Daily number of articles and unique articles shared by twitter users. Dotted lines are trend
lines.
Fig. 4. The position of a number of mainstream media in the spectrum of political alignments:
(left) pro-Prabowo; (right) pro-Joko Widodo.
Fig. 5. Heat map between political valence score vs credibility ranking by: (left) number of media;
(right) number of shares.
658 A. Maulana and H. Situngkir
The heat map shown in Fig. 5 illustrates the relationship between the political align-
ments of news media and their credibility. In this study we use Alexa Rank [16] as a
proxy to assess the credibility of a media.
As shown in Fig. 5(left), most of mainstream media with high credibility ratings
have a neutral valence score or tend to favor Joko Widodo, while Prabowo-leaning
media generally have moderate or low credibility. In addition, most of low-credibility
media tend to have extreme political valence scores. In other words, low-credibility
media tend to be more partisan than the one with high credibility. From Fig. 5 (right) we
also know that, for all political valence scores, the intensity of information dissemination
originating from low-credibility media is relatively not much different compared to high-
credibility media. This means that partisan readers tend not to question the credibility
of the news sources they share. We highlight this empirical fact related to the rise of
false news during the election and the potential of low-credibility media as sources of
misinformation.
5 Conclusion
In this study, we use the partisan behavior of media audiences on Twitter to identify
political alignments of news media during the 2019 Indonesian elections. The identifi-
cation method we proposed is carried out in 3 stages, as follow: (i) Identification of the
political affiliations of twitter users based on the political hashtag they used in their tweet;
(ii) Identification of the political affiliations of Twitter users based on their interaction
networks using the label propagation algorithm; (iii) Identification of the political align-
ments of news media based on the political affiliation of its audience. Evaluation results
show that the proposed method is very effective in detecting the political affiliation of
Twitter users as well as predicting the political stance of news media. The position of
media in the spectrum of political valence confirms the general allegations of media par-
tisanship during the election. Further elaboration regarding news consumption behavior
shows that low-credibility news outlets tend to have extreme political positions, while
partisan readers tend not to question the credibility of the news sources they share.
References
1. Nic, N., Levy, D.A.L., Nielsen, R.K.: Reuters Institute Digital News Report 2018. Reuters
Institute Digital News (2018)
2. Vos, T.P.: Journalists as Gatekeepers. In: Wahl-Jorgensen, K., Hanitzsch, T. (eds.) The
Handbook of Journalism Studies, pp. 90–104. Routledge (2019)
3. Allcott, H., Gentzkow, M.: Sosial media and fake news in the 2016 election. J. Econ. Perspect.
31, 211–236 (2017)
4. Groeling, T.: Media bias by the numbers: challenges and opportunities in the empirical study
of Partisan news. Ann. Rev. Polit. Sci. 16(1), 129–151 (2013)
5. Barberá, P., Sood, G.: Follow your ideology: measuring media ideology on social networks.
In: Annual Meeting of the European Political Science Association, Viena (2015)
6. Becatti, C., Caldarelli, G., Lambiotte, R., Saracco, F.: Extracting significant signal of news
consumption from social networks: the case of twitter in italian political elections. Palgrave
Commun. 5(1) (2019)
Media Partisanship During Election: Indonesian Cases 659
7. Bakshy, E., Messing, S., Adamic, L.A.: Exposure to ideologically diverse news and opinion
on facebook. Science 348(6239), 1130–1132 (2015)
8. Stefanov, P., Darwish, K., Atanasov, A., Nakov, P.: Predicting the topical stance and political
leaning of media using tweets. In: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics (2020)
9. Borra, E., Rieder, B.: Programmed method: developing a toolset for capturing and analyzing
tweets. Aslib J. Inf. Manag. 66(3), 262–278 (2014)
10. Varol, O., Uluturk, I.: Journalists on twitter: self-branding, audiences, and involvement of
bots. J. Comput. Soc. Sci. 3(1), 83–101 (2020)
11. Conover, M., Ratkiewicz, J., Francisco, M., Gonçalves, F., Menczer, F., Flamini, A: Political
polarization on twitter. In: Proceedings of 5th International Conference on Weblogs and Social
Media (2011)
12. Vicario, M.D., Bessi, A., Zollo, F., Petroni, F., Scala, A., Caldarelli, G., Stanley, H.E., Quat-
trociocchi, W.: The spreading of misinformation online. Proc. Natl. Acad. Sci. U.S.A. 113(3),
554–559 (2016)
13. Zollo, F., Bessi, A., Del Vicario, M., Scala, A., Caldarelli, G., Shekhtman, L., Havlin, S.,
Quattrociocchi, W.: Debunking in a world of tribes. PLoS ONE 12(7), e0181821 (2017)
14. Badawy, A., Ferrara, E., Lerman, K.: Analyzing the digital traces of political manipulation:
the 2016 Russian interference twitter campaign. In: Proceedings of the 2018 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining (ASONAM
2018) (2018)
15. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community
structures in large-scale networks. Phys. Rev. E 76(3) (2007)
16. Alexa Internet: Keyword Research, Competitor Analysis RESEARCH & Website Ranking.
https://www.alexa.com. Accessed 10 Aug 2019
Media Polarization on Twitter During 2019
Indonesian Election
1 Introduction
The outbreak of extreme political polarization in various democratic countries has been
the problem of this century [1]. This phenomenon cannot be separated from the rise
of digital information space, e.g. social media, online news media, which makes it
easier for citizens to access and discuss political information [2]. On the one hand, The
combination of online news media and social one increases the chances of individuals
being exposed to information from a variety of perspectives [3]. But on the other hand,
mediation and personalization of information by social networks also has the potential
to limit exposure to only information that is politically agreed upon [4], giving rise to
misperceptions of facts and events [5] and leading to the emergence of increasingly
extreme political attitudes [6].
A number of studies have shown empirically the tendency of social media users
to focus on specific narratives, and interact intensively with those who have the same
political preferences [7–9]. This micro tendency may lead to the emergence of echo-
chambers [7, 10] that divide the social media space into politically homogeneous user
communities [11]. In each community, users tend to ignore dissenting information and to
interact with information adhering to their preferred narrative. The study of digital echo-
chamber phenomena is quite challenging [7–12]. However, most of research in this area
examine fragmentation and polarization in user networks. Meanwhile, empirical works
to investigate information segregation due to fragmentation of information sources is
constrained by the difficulty of measuring the political tendencies of news media [13].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 660–670, 2021.
https://doi.org/10.1007/978-3-030-65347-7_55
Media Polarization on Twitter During 2019 Indonesian Election 661
2 Data
This study investigates news consumption patterns on Twitter during the 2019 Indonesian
presidential and legislative elections. We conducted the data collection process from
March 27 to May 19, 2019, covering the campaign period, elections (April 17, 2019),
the vote recapitulation and announcement period (May 21, 2019). We use DMI-Tcat
application [14] to extract data from Twitter based on a number of keywords related to
the candidates, namely: (i) Candidate I (Prabowo - Sandiaga Uno): prabowo, sandiaga
uno; (ii) Candidate II (Joko Widodo - KH. Maaruf Amin): joko widodo, jokowi, maaruf
amin, kiyai maaruf.
Table 1 shows the descriptive statistics of the tweet dataset1 used in this study.
Overall, we only focused on 667,821 of total 13,990,975 tweets, which contained news
links from 559 Indonesian news media, and were shared at least ten times by Twitter
users.
Statistics Count
# of tweets 13,990,975
# of tweets with a Url 667,821
# of hashtags 74,515
# of unique users 3,958,817
3 Method
3.1 Bipartite Network
In this study we use Fast Greedy algorithm [17] to analyze the meso structure of projec-
tion network Ŝ. As shown in Table 3, this algorithm revealed five media communities,
where there were two very dominant clusters, covering 98% of total news outlets ana-
lyzed. To validate the results, we also implemented the Walk Trap algorithm [18], then
compared the results of both algorithm using Adjusted Rand Index (ARI) method [19].
We find the ARI coefficient is 0.8, which indicates that two algorithms produces similar
result.
Media Polarization on Twitter During 2019 Indonesian Election 663
Table 3. Descriptive statistics of community partitions using the Fast Greedy algorithm
We need to measure political stance of news outlets in order to investigate political polar-
ization that occurs in media networks during the elections. In this study the media stance
was identified based on partisan behavior of their audience, assuming that people tended
to be selective about information, i.e. only reading and sharing news articles in accor-
dance with their political preferences. Following [20], the process of media classification
is carried out in 3 stages, as follow: (i) Hashtag-based user labeling: 1400 political hash-
tag associated with certain political groups are used to identify the political affiliations
of these hashtag users. At this stage, we successfully identified 153,990 labeled users,
which will then be used as seed nodes for the label propagation algorithm at later stage;
(ii) Network-based user labeling: at this stage we apply Label Propagation algorithm
to expand the number of labeled users [21, 22]. We have successfully identified polit-
ical affiliation of 388,253 Twitter users, with prediction accuracy of ~0.98; (iii) Media
classification: we use polarity rule [11] to identify media stance based on the political
affiliation of their audiences. Table 4 shows classification result of 560 Indonesian news
outlets.
4 Analysis
The Fast Greedy algorithm has successfully identified two dominant communities in the
Indonesian news media network, covering 98% of the total news outlets analyzed. This
media network has a high value of modularity (M = 0.25), which indicates a segregation
of information sources in the news media landscape during the election. Considering
that the grouping of news outlets emerge from the interaction between audience and
news sources, it is necessary to measure the extent to which segregation occurs between
the two dominant media communities, as follow [26, 27]:
y−x
p(u) = (1)
y+x
where y (x) is a fraction of twitter users who share news tweets from outlets in media
community C1 (C2 ). Figure 2 shows the presence of strong bimodality in the distribution
of news audience activity in each community. This indicates that each media community
is an echo-chamber for their respective audiences, that is a groups of like-minded people
cooperating to reinforce a shared narrative.
To understand the relationship between segregation in news media networks and the
political alignments of news outlets during the election, we elaborate the composition of
partisan media in the two dominant media clusters. As shown in Table 5, the composition
of the partisan media in each community tends to be politically homogeneous. This fact
confirms the occurrence of political polarization in Indonesian media networks. Table 5
also shows that Joko Widodo-leaning media has a stronger tendency to group in the same
cluster than Prabowo-leaning media, while news outlets with moderate political stance
are relatively small in number and spread evenly in two dominant clusters.
Fig. 3. Political polarization in Indonesian news media during 2019 General Elections
666 A. Maulana and H. Situngkir
Figure 3 visualizes the political polarization of Indonesian news media during 2019
Presidential Elections. As seen in the figure, the structure of media network is divided
into two dominant clusters, each with a relatively homogeneous political identity. This
shows that the pattern of news consumption in 2019 Indonesian elections is not only
fragmented, but also forms a political echo-chamber where audiens tend to be exposed
to politically homogeneous information coming from news outlet with the same political
tendencies.
We then elaborate on empirical facts about interactions between news media across
political affiliations [20–22, 26–28]. Figure 4 shows the statistical characteristics of
interaction between news outlets, intra and inter media communities. In general, the
Indonesian news media network have homophily properties, where news outlets with
the same political stance tend to be connected to each other (Joko Widodo-leaning media:
median = 0.842, IQR = [0.809, 0.873]; Prabowo-leaning media: median = 0.543, IQR =
[0.514, 0.577]). In general, this characteristic is stronger for Joko Widodo-leaning media
than Prabowo-leaning media. Furthermore, the interaction tendency from Joko Widodo-
leaning media to Prabowo-leaning media (median = 0.045, interquartile distance (IQR)
= [0.0023, 0.064]) is much smaller than the opposite (median = 0.306, IQR = [0.273,
0.323]). Meanwhile, interactions tendency from moderate news media partisan media are
Fig. 4. Statistical characteristics of interaction between news outlets, intra and inter political
affiliations. White vertical lines are median values, horizontal thick lines are interquartile ranges,
and black thin horizontal lines are 10th and 90th percentile values.
Media Polarization on Twitter During 2019 Indonesian Election 667
almost equal (pro Joko Widodo: median = 0.425, IQR = [0.345, 0.527]; pro Prabowo:
median = 0.438, IQR = [0.302, 0.538]).
The statistics of interaction between media across political affiliations indicate that
information exposure to Joko Widodo’s supporters is relatively more homogeneous,
coming from news outlets with the same political affiliation, compared to information
exposure to Prabowo’s supporters. This can be understood by looking at the composition
of partisan media in each media community (see Table 5). Moreover, as discussed in
previous studies [20], Prabowo-leaning media is dominated by segmented news media,
such as the Islamic news outlets (e.g. eramuslim, portal-islam, republika)
and opposition media (e.g. viva, rmol, gelora) while mainstream media tends to be
neutral or in favor of Joko Widodo. As a result, Prabowo’s supporters tend to be exposed
to information coming from the pro-Joko Widodo news media, but not vice versa.
We further investigate the role and position of each news outlet in the information ecosys-
tem during the 2019 Indonesian elections. We use two indicators, namely within module
degree z-score (z i ) and participation coefficient (Pci ) [29], to measure the role of a news
outlet based on its relations with other outlets within or between media communities.
The within module degree z-score (z i ) measures connectivity of a news outlet in its
community internally, as follow:
ki − k si
zi = (2)
σksi
where ki is the degree of news outlet i within the cluster si , k si is the average degree
of all media in cluster si , and σksi is the standard deviation of degree k in cluster si .
The greater the value of z i , the higher the connectivity of outlet i relative to other outlet
in its community. Meanwhile, cross-cluster node connectivity is evaluated using the
participation coefficient ( pci ) indicator, as follows:
M kis 2
Pci = 1 − (3)
s=1 ki
where kis is the number of relation of outlet i to other outlets in cluster s. Value Pci ∼ 0 if
outlet i is only connected to the outlet in its group only. Conversely, the value of Pci ∼ 1
if the relation of an outlet is evenly distributed in all clusters. The combination of those
two indicators forms 7 regions of node roles within z-P parameter space, namely (i) R1:
ultra-peripheral nodes (z < 2.5, P ≤ 0.05); (ii) R2: peripheral nodes (z < 2.5, 0.05 ≤
P ≤ 0.62); (iii) R3: non-hub connector (z < 2.5, 0.62 ≤ P ≤ 0.8); (iv) R4: non-hub
kinless nodes (z < 2.5, 0.8 ≤ P); (v) R5: provincial hubs (z ≥ 2.5, P ≤ 0.3); (vi) R6:
connector hubs (z ≥ 2.5, 0.3 < P ≤ 0.75); (vii) R8: kinless hubs: (z ≥ 2.5, P > 0.75).
668 A. Maulana and H. Situngkir
As shown in Table 6, it is known that ~96% of news outlets are ultra-peripheral (R1)
or peripheral nodes (R2), or low degrees nodes with little or no cross-cluster connection.
The remaining media fills the R5 region as a provincial hub and the R6 region as connector
hubs.
Table 6. Descriptive statistics of the news media role during election. Media rating is based on
Alexa rank [30]. (JW: Joko Widodo; P: Prabowo; M: Moderate).
How the partisan outlets filled the R5 and R6 regions revealed differences in the
information echo space characteristics of the two candidates. As shown in Table 6,
there are only two news outlets in the R5 region, and both are Prabowo-leaning media.
This means that gelora and rmol are central outlets within the information echo-
chamber of Prabowo’s supporter. However, this also implies that information structure of
Prabowo’s media community is more centralized than Joko Widodo’s media community.
In general, Joko Widodo-leaning media, as well as moderate media, dominate the R6
region as a connector hubs, which means that these outlets are consumed by supporters of
both candidates. As shown in Table 6, news outlets in the R6 region have a high median
Alexa Rank, which indicates this region is dominated by mainstream news media. This
fact highlights the important role of mainstream news media as a bridge of information
between opposite political sides, especially in heated election.
Media Polarization on Twitter During 2019 Indonesian Election 669
5 Conclusion
In this study we reveal empirical facts about political polarization in Indonesian news
media network during 2019 Indonesian General Elections. By modeling news con-
sumption patterns as a bipartite network of news outlets-Twitter users, we observed
the emergence of a number of media communites based on audience similarity. By mea-
suring the political alignments of each news outlet, we reveal the politically fragmented
Indonesian news media landscape, where each media community becomes an political
echo-chamber for its audience. More specifically we find that, compared to the Prabowo
media cluster which tends to be centralized in a small number of outlets, Joko Widodo’s
supporters have diverse sources of information. However, information exposure to Joko
Widodo’s supporters is relatively more homogeneous coming from the media with the
same political affiliations.
Nowadays, the understanding of the impact of social media and online news media on
the emergence of extreme polarization in political discourse is one of the most pressing
challenges for both science and society. Our finding highlight the important role of
mainstream media as a bridge of information between political echo-chamber in social
media environment.
References
1. World Economic Forum: Digital Wildfires in a Hyperconnected World. https://reports.
weforum.org/global-risks-2018/digital-wildfires/. Accessed 21 July 2019
2. Nic, N., Levy, D.A.L., Nielsen, R.K.: Reuters Institute Digital News Report 2018. Reuters
Institute Digital News (2018)
3. Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information
diffusion. In: Proceedings of the 21st Annual Conference on World Wide Web (IW3C2),
pp. 519–528. ACM, Lyon (2012)
4. Pariser, E.: The Filter Bubble: What the Internet Is Hiding from You. Penguin Press, London
(2011)
5. Kull, S., Ramsay, C., Lewis, E.: Misperceptions, the Media, and the Iraq War. Polit. Sci. Q.
118(4), 569–598 (2003)
6. Stroud, N.J.: Media use and political predispositions: Revisiting the concept of selective
exposure. Polit. Behav. 30(3), 341–366 (2008)
7. Quattrociocchi, W., Scala, A., Sunstein, C.R.: Echo Chambers on Facebook. SSRN Electron.
J. (2018)
8. Bessi, A., Petroni, F., Del Vicario, M., Zollo, F., Anagnostopoulos, A., Scala, A., Caldarelli,
G., Quattrocciocchi, W.: Viral misinformation: the role of homophily and polarization. In:
WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide
Web, pp. 355–356. ACM, Florence (2015)
9. Vicario, M.D., Bessi, A., Zollo, F., Petroni, F., Scala, A., Caldarelli, G., Stanley, H.E., Quattro-
ciocchi, W.: The spreading of misinformation online. Proc. Natl. Acad. Sci. U. S. A. 113(3),
554–559 (2016)
10. Del Vicario, M., Vivaldo, G., Bessi, A., Zollo, F., Scala, A., Caldarelli, G., Quattrociocchi,
W.: Echo chambers: emotional contagion and group polarization on Facebook. Sci. Rep. 6(1)
(2016)
670 A. Maulana and H. Situngkir
11. Conover, M., Ratkiewicz, J., Francisco, M., Gonçalves, F., Menczer, F., Flamini, A: Political
polarization on Twitter. In: Proceedings of 5th International Conference on Weblogs and
Social Media (2011)
12. Bessi, A., Zollo, F., Del Vicario, M., Scala, A., Caldarelli, G., Quattrociocchi, W.: Trend of
narratives in the age of misinformation. PLoS ONE 10(8), e0134641 (2015)
13. Groeling, T.: Media bias by the numbers: challenges and opportunities in the empirical study
of Partisan news. Ann. Rev. Polit. Sci. 16(1), 129–151 (2013)
14. Borra, E., Rieder, B.: Programmed method: developing a toolset for capturing and analyzing
tweets. Aslib J. Inf. Manag. 66(3), 262–278 (2014)
15. Tumminello, M., Miccichè, S., Lillo, F., Piilo, J., Mantegna, R.N.: Statistically validated
networks in bipartite complex systems. PLoS ONE 6(3), e17994 (2011)
16. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
17. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks.
Phys. Rev. E 70(6) (2004)
18. Pons, P., Latapy, M.: Computing communities in large networks using random walks. J. Graph
Algorithms Appl. 10(2), 191–218 (2006)
19. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.
66(336), 846–850 (1971)
20. Maulana, A., Situngkir, H.: Media partisanship during election: Indonesian cases. MPRA
Paper, 101950 (2020)
21. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect communi-ty
structures in large-scale networks. Phys. Rev. E 76(3) (2007)
22. Badawy, A., Ferrara, E., Lerman, K.: Analyzing the digital traces of political manipulation:
the 2016 Russian interference Twitter campaign. In: Proceedings of the 2018 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining (ASONAM
2018) (2018)
23. Sunstein, C.R.: #Republic: Divided Democracy in the Age of Social MediaTitle. Princeton
University Press, Princeton (2017)
24. Webster, J.G., Ksiazek, T.B.: The dynamics of audience fragmentation: public attention in an
age of digital media. J. Commun. (2012)
25. Gaol, F.L., Matsuo, T., Maulana, A.: Network model for online news media landscape in
Twitter. Information 10(9), 277 (2019)
26. Schmidt, A.L., Zollo, F., Vicario, M.D., Bessi, A., Scala, A., Caldarelli, G., Stanley, H.E.,
Quattrociocchi, W.: Anatomy of news consumption on Facebook. Proc. Natl. Acad. Sci. U.
S. A. 114(12), 3035–3039 (2017)
27. Del Vicario, M., Gaito, S., Quattrociocchi, W., Zignani, M., Zollo, F.: News consumption
during the Italian referendum: a cross-platform analysis on Facebook and Twitter. In: Pro-
ceedings - 2017 International Conference on Data Science and Advanced Analytics (DSAA
2017), Tokyo, pp. 648–657 (2017)
28. Bakshy, E., Messing, S., Adamic, L.A.: Exposure to ideologically diverse news and opinion
on Facebook. Science 348(6239), 1130–1132 (2015)
29. Guimerà, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature
433(7028), 895–900 (2005)
30. Alexa Internet: Keyword Research, Competitor Analysis RESEARCH & Website Ranking.
https://www.alexa.com. Accessed 10 Aug 2019
Influence of Retweeting on the Behaviors
of Social Networking Service Users
1 Introduction
In recent years, many social media platforms, including Twitter, Facebook,
and Instagram, have drawn significant attention from people around the world.
Countless people constantly use these platforms to submit various types of infor-
mation such as texts, images, and videos for different purposes including per-
sonal/group communication, business [6], education [4], and political matters [5].
This collection of information has become a valuable resource/asset shared by
social media users. To further grow these assets, we must determine the fac-
tor that motivates and facilitates people to provide information, as articles and
associated comments must be continuously updated by users.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 671–682, 2021.
https://doi.org/10.1007/978-3-030-65347-7_56
672 Y. Yan et al.
This issue has been studied from different viewpoints including social psy-
chology [3,7,21], social network analysis [23], and agent-based simulation with
evolutionary game theory [8,9,12,15]. Zhao et al. [23], for example, reported the
potential impacts of micro-blogging sites, such as Twitter. They also attempted
to understand the reason behind people using micro-blogs as an informal com-
municative tool and the user behavior features. Toriumi et al. [15] modeled the
activities of social media by modifying the public goods game [1] as posted arti-
cles are shared resources. However, because people have to incur some costs and
responsibilities by posting articles, they may become lurkers, who just read arti-
cles without posting any. They introduced the following: 1) rewards, which corre-
spond to writing comments on posted articles, 2) cooperation, which corresponds
to posting new articles, and 3) meta-rewards, which correspond to a comment
made on an existing comment; they showed that meta-rewards enhance coopera-
tion [15]. However, the effect of retweeting on the activities of social media users
has not been studied thus far, although it is evident that retweeting prompts
information dissemination and thus increases the motivation for cooperation,
i.e., increases the number of posting activities.
Retweeting, a mechanism implemented by a few social media platforms,
enables users to read the articles posted by strangers (who are within a user’s
social network connections) and present their opinions as a reply to the arti-
cle writers. Consequently, the number of potential readers/commenters of the
posted articles may significantly increase. Additionally, we think that retweet-
ing enhances the importance of micro-blogging/tweeting while incurring only a
small cost. Thus, investigating the influence of retweeting can help understand
the conditions required for ensuring the lasting impact of social media.
Therefore, we extend a reward game model by introducing the retweeting
mechanism, to clarify the effect of retweeting on user behavior. Subsequently,
we experimentally analyze the effect of retweeting on user communication by
using variable parameters, which restrict the spread of retweeted information.
For the analysis, we perform an agent-based simulation using a genetic algorithm
on networks based on a complete graph and connecting nearest neighbor (CNN)
model [19]. Our results indicate that the probability that a user will retweet an
article is moderate, and this enhances cooperative activities (i.e., posting arti-
cles) among users, although a reward game without the retweeting mechanism
cannot maintain cooperation because of the lack of meta-rewards. Additionally,
we observed that agents tend to comment more on dense networks.
2 Related Works
Several studies have been conducted to clarify the role of retweeting; however,
most studies aimed to analyze the contents of retweeting or understand the
behaviors of users from a psychological viewpoint. Boyd et al. [2] investigated
retweeting from a conversational viewpoint and systematically analyzed the syn-
tax of retweets and tried to understand why, how, and what Twitter users
retweet. Suh et al. [13] also performed a mathematical analysis to clarify the
Influence of Retweeting 673
factors associated with the retweet rate and built a predictive retweet model for
further analysis. They found that URLs, hash tags, and the numbers of follow-
ers and followees mostly affect retweetability, which is the number of times a
tweet can be retweeted. Yang et al. [20] proposed a factor graph model to study
the underlying mechanism of the retweeting behaviors of users. Zhang et al. [22]
defined the notion of social influence locality and predicted the retweeting behav-
iors of users by training a logistic regression classifier. Ten Thij et al. [14] designed
a mathematical model to describe the evolution of a retweet graph. Other stud-
ies attempted to identify the typical behaviors of users who frequently retweet.
By focusing on a specific user whose profile was known, Luo et al. [10] aimed
to investigate the type of followers who tend to frequently retweet the tweets
of that specific user. However, these studies focused on the factors that influ-
ence retweeting behaviors, and few studies investigated the manner in which the
presence of the retweeting mechanism influences the behaviors of users toward
posting new articles and commenting on them.
In addition to Toriumi et al. [15], several studies further investigated a model
that is an extension of the public goods game mentioned in Sect. 1 [15], to clarify
the conditions for ensuring the lasting impact of social media. Hirahara et al. [8,9]
extended this model by adding low-cost feedbacks such as the “Like!” button and
“read marks” feature. Despite not having a meta-reward mechanism, they con-
siderably facilitated cooperation through an agent-based simulation that was
executed on complex networks generated using Facebook data. Osaka et al. [12]
extended the model by introducing direct reciprocity into it, and they studied
the effect of direct reciprocity and network structure on the continuing pros-
perity of social networking services. Toriumi et al. [16] further explored what
types of incentive systems of rewards and punishments promote and maintain
effective cooperation in actual groupware. Toriumi et al. [17,18] updated the
meta-reward model to identify a realistic situation through which to achieve
a cooperation in Consumer-Generated Media and analyzed the effects of the
information behaviors. However, they did not focus on how the existence of the
retweeting mechanism influences user strategies. Therefore, we propose an evo-
lutionary model based on a reward game to explore how the willingness of users
toward posting articles and commenting would vary in social media depending
on the presence of a retweeting mechanism.
3 Proposed Model
3.1 Reward Game with Retweeting
To model user behaviors, including retweeting, observed in social media, we
propose a retweet (RT) reward game, which is an evolutionary game based on
networked agents. The game is an extension of the reward game proposed by
Toriumi et al. [15]. The game is extended by adding several rounds of retweeting
for each article posted.
Intuitively, retweeting is the behavior of re-posting or forwarding the tweets
of a person to her/his followers. Posting an article is often called cooperating,
674 Y. Yan et al.
Ni = {j ∈ A|eij ∈ E},
where eij denotes the edge between agents i and j. Agent i has three learning
parameters that decide his/her behavior: cooperation rate Bi , comment rate Li ,
and retweeting rate RTi ; the values of these parameters describe the probabilities
of the corresponding behavior.
The RT reward game proceeds as follows (see Fig. 1). Parameter Sit (0 ≤
Sit ≤ 1), which is called the seeing probability, indicates how interesting the
article of agent i is at time t and is randomly decided by the game environment.
For agent i, if Sit < Bi , then agent i cooperates (i.e., i posts an article or tweets
with probability Bi ). If agent i cooperates, all the agents in Ni receive a positive
reward M , and agent i receives a negative reward F (so cost) for posting the
article. Agent j ∈ Ni views the article of agent i with probability Sit . If agent j
views the article of agent i, then agent j will comment on the article of agent i
with probability Lj . If agent j comments, then agent j receives negative reward
C as the cost of writing a comment, and agent i receives a positive reward R.
As long as agent j views the article of agent i, agent j may retweet the article
of agent i with probability RTj . If agent j retweets the article of agent i, agent
j receives 0.5C, and agent i receives 0.5R. Agent ∀k ∈ Nj has a chance to see
the article of agent i with probability Sit . If agent k views the retweeted article
of agent i and has not commented on it before, agent k can comment on the
Influence of Retweeting 675
Table 1. Parameters
where vj denotes the fitness value of j, and vmin denotes the minimum fitness
value among those of Ni ∪ {i}. Agents with high fitness values are likely to be
selected as parents by their neighboring agents.
After choosing two parent agents, agent i generates a new gene through
uniform crossover, which implies that each bit of the new gene is chosen from
either of the two parent agents with equal probability. After building the new 9-
bit gene in the crossover process, each bit is inverted with the probability of 0.01.
This probability is called the mutation rate. Subsequently, the gene obtained is
used as the gene for agent i in the next generation.
4 Experiment
The objective of this experiment was to investigate the dominant strategy that
was common among users. The strategy is specified by Bi , Li , and RTi when
a retweeting mechanism is introduced in an SNS. Conversely, we investigated
the most beneficial behaviors when the neighbors also have the same or similar
strategies to those of the poster. This dominant strategy also suggests the manner
Influence of Retweeting 677
Parameter u = 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Average degree 2.20 2.44 2.77 3.19 3.84 4.78 6.40 9.68 17.6
Cluster coefficient 0.39 0.41 0.45 0.44 0.56 0.60 0.68 0.80 0.87
in which the willingness of users toward posting articles and commenting would
be promoted or diminished by the retweeting mechanism. These behaviors can be
analyzed in terms of the posting rate (or the cooperating rate), B, comment rate
(or rewarding rate), L, and retweeting rate, RT , by comparing with a strategy
without the retweeting mechanism. Notably, B,L, and RT denote the average
parameter values; for example, B is defined as i∈A Bi /|A|.
We conducted our experiments using a complete graph and CNN networks,
which are based on the CNN model [19], as they are often used in this type of
experiments. The number of agents in the complete graph and CNN networks
was 20 and 1000, respectively. The characteristics of the CNN networks are
presented in Table 2. When the CNN networks were generated, we varied the
probability of changing a potential edge to a real edge, u, from 0.1 to 0.9.
The other parameter values of the RT reward game are also listed in Table 1.
The values are set on the basis of previous studies [1,15]. All the results in this
study are the average values of ten independent runs with different random seeds;
however, the results of the complete graph-based experiments are the average
values of 100 independent runs.
We chose the reward game and RT reward game to investigate the effect of
retweeting on user behaviors. We did not choose the (RT) meta-reward game
because we knew that a meta-reward game, in which all the agents have chances
to meta-reward, can maintain high cooperating and comment rates because of
the effect of posting a comment on another comment (i.e., meta-reward). Thus,
it will become difficult to understand the effect of retweeting on the behaviors
of users. However, agents that play the reward game on CNN networks and
complete graphs have low activities, and thus, changes are easily observed. First,
we experimented the reward game and RT reward game on a complete graph.
The average cooperating and comment rates versus time are plotted in Fig. 3.
In the reward game, although the average cooperating rate of all the genera-
tions was 0.1527 (fairly low), it increased to 0.9060 after introducing retweeting
(i.e., the RT reward game). Simultaneously, there was a minor increase in the
average comment rate, from 0.0287 to 0.0841. The increase in the comment
rate indicates that although the commenting activities were not substantially
affected, the retweeting mechanism considerably activated cooperation, i.e., the
posting/tweeting behavior. To estimate the extent to which the cooperating rate
was changed, we defined the increasing ratio of B as follows:
678 Y. Yan et al.
Brt − Bnormal
inc = × 100%, (2)
Bnormal
where Bnormal and Brt denote the average values of cooperating rates in the
reward game and RT reward game, respectively. The increasing ratio was 4.93
when the network was a complete graph.
Fig. 4. Cooperating and comment rates of the reward game on CNN networks.
Fig. 5. Cooperating and comment rates of the RT reward game on CNN networks.
Table 3. List of cooperating rates, comment rates, retweeting rates, and increasing
ratios on various CNN networks.
4.4 Discussion
In the case of complete graphs, retweeting enables an agent, who had missed
an article in the original post, to read the article. Furthermore, for those agents
who have previously read an article but have not done anything to it, retweeting
could compel them to reread the article and think twice on whether to do some-
thing. Whenever an agent retweets an article, the neighbor agents come to know
that other agents are interested in the article and will get a new chance to read
it. Thus, they can perform some activities such as commenting or retweeting.
Therefore, retweeting in complete graphs considerably increases the chance of
an article to be read and commented on. In the case of CNN networks, retweet-
ing may also help activate some friends of the article poster provided that the
retweeters and posters have some mutual friends or their friends are friends.
Simultaneously, retweeting increases the number of potential readers by allow-
ing strangers to read and act on an article. All these effects make the article
posters highly likely to receive comments, thereby making article posting signif-
icantly proficient. In CNN networks, the cooperating rate seems to increase the
most near u = 0.7. We suppose it is because the posting rate of reward game
falls from u = 0.1 to u = 0.7, and rises from u = 0.7 to u = 0.9.
Influence of Retweeting 681
However, the comment rate also increases in some networks, which would
imply that agents become willing to comment on the articles of others. Those who
comment more should lose more fitness value. After retweeting is implemented,
the posting rate increases with a subsequent increase in the chance of reading;
therefore, agents will have more chances to choose whether to reward others, and
thus, commenters who are less active may also benefit. The results demonstrate
that agents in networks with a high u are dense and will have increased comment
rates after the implementation of retweeting. A complete graph is an extreme
case of a dense network, and the cooperating rate in it is fairly high.
5 Conclusion
We investigated the effect of retweeting on social media users. We extended
the reward game by introducing the retweeting mechanism. In the new model,
each article undergoes two rounds of retweeting. The new retweeting mechanism
allows users to read the articles of strangers, thereby increasing the number
of potential readers for article posters. We analyzed the manner in which the
posting and comment rates of the agents would change upon the implementation
of retweeting. We found that retweeting motivates agents to post new articles.
In CNN networks, the cooperating rate seems to increase the most for u values
near 0.7.
In the future, we plan to implement meta-rewards in our model, run an agent-
based simulation on real networks like Facebook, and apply the multiple world
genetic algorithm [11] to analyze the diversity of agent strategies.
References
1. Axelrod, R.: An evolutionary approach to norms. Am. Polit. Sci. Rev. 80, 1095–
1111 (1986)
2. Boyd, D., Golder, S., Lotan, G.: Tweet, tweet, retweet: conversational aspects of
retweeting on Twitter. In: 2010 43rd Hawaii International Conference on System
Sciences, pp. 1–10. IEEE (2010)
3. Burke, M., Marlow, C., Lento, T.: Social network activity and social well-being. In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
pp. 1909–1912 (2010)
4. Chugh, R., Ruhi, U.: Social media in higher education: a literature review of Face-
book. Educ. Inf. Technol. 23(2), 605–616 (2018)
5. Conway, B.A., Kenski, K., Wang, D.: The rise of Twitter in the political cam-
paign: searching for intermedia agenda-setting effects in the presidential primary.
J. Comput.-Mediated Commun. 20(4), 363–380 (2015)
6. Culnan, M.J., McHugh, P.J., Zubillaga, J.I.: How large us companies can use Twit-
ter and other social media to gain business value. MIS Q. Executive 9(4) (2010)
682 Y. Yan et al.
7. Ellison, N., Steinfield, C., Lampe, C.: Spatially bounded online social networks and
social capital. Int. Commun. Assoc. 36(1-37) (2006)
8. Hirahara, Y., Toriumi, F., Sugawara, T.: Evolution of cooperation in SNS-norms
game on complex networks and real social networks. In: International Conference
on Social Informatics, pp. 112–120. Springer, Heidelberg (2014)
9. Hirahara, Y., Toriumi, F., Sugawara, T.: Cooperation-dominant situations in SNS-
norms game on complex and Facebook networks. New Gen. Comput. 34(3), 273–
290 (2016)
10. Luo, Z., Osborne, M., Tang, J., Wang, T.: Who will retweet me? Finding retweeters
in Twitter. In: Proceedings of the 36th International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 869–872 (2013)
11. Miura, Y., Toriumi, F., Sugawara, T.: Multiple-world genetic algorithm to identify
locally reasonable behaviors in complex social networks. In: 2019 IEEE Interna-
tional Conference on Systems, Man and Cybernetics, pp. 3665–3672 (2019)
12. Osaka, K., Toriumi, F., Sugawara, T.: Effect of direct reciprocity and network
structure on continuing prosperity of social networking services. Comput. Soc.
Netw. 4(1), 2 (2017)
13. Suh, B., Hong, L., Pirolli, P., Chi, E.H.: Want to be retweeted? Large scale ana-
lytics on factors impacting retweet in Twitter network. In: 2010 IEEE Second
International Conference on Social Computing, pp. 177–184. IEEE (2010)
14. Ten Thij, M., Ouboter, T., Worm, D., Litvak, N., van den Berg, H., Bhulai, S.:
Modelling of trends in Twitter using retweet graph dynamics. In: International
Workshop on Algorithms and Models for the Web-Graph, pp. 132–147. Springer,
Heidelberg (2014)
15. Toriumi, F., Yamamoto, H., Okada, I.: Why do people use social media? Agent-
based simulation and population dynamics analysis of the evolution of coopera-
tion in social media. In: 2012 IEEE/WIC/ACM International Conferences on Web
Intelligence and Intelligent Agent Technology, vol. 2, pp. 43–50. IEEE (2012)
16. Toriumi, F., Yamamoto, H., Okada, I.: Exploring an effective incentive system on
a groupware. J. Artif. Soc. Soc. Simul. 19(4), 6 (2016). https://doi.org/10.18564/
jasss.3166
17. Toriumi, F., Yamamoto, H., Okada, I.: A belief in rewards accelerates cooperation
on consumer-generated media. J. Comput. Soc. Sci. 3, 19–31 (2019)
18. Toriumi, F., Yamamoto, H., Okada, I.: Rewards visualization system promotes
information provision. In: Annual Conference of the Japanese Society for Artificial
Intelligence, pp. 55–65. Springer, Heidelberg (2019)
19. Vázquez, A.: Growing network with local rules: preferential attachment, clustering
hierarchy, and degree correlations. Phys. Rev. E 67(5), 056,104 (2003)
20. Yang, Z., Guo, J., Cai, K., Tang, J., Li, J., Zhang, L., Su, Z.: Understanding
retweeting behaviors in social networks. In: Proceedings of the 19th ACM Inter-
national Conference on Information and Knowledge Management, pp. 1633–1636
(2010)
21. Yu, A.Y., Tian, S.W., Vogel, D., Kwok, R.C.W.: Can learning be virtually boosted?
An investigation of online social networking impacts. Comput. Educ. 55(4), 1494–
1503 (2010)
22. Zhang, J., Liu, B., Tang, J., Chen, T., Li, J.: Social influence locality for modeling
retweeting behaviors. IJCAI 13, 2761–2767 (2013)
23. Zhao, D., Rosson, M.B.: How and why people Twitter: the role that micro-blogging
plays in informal communication at work. In: Proceedings of the ACM 2009 Inter-
national Conference on Supporting Group Work, pp. 243–252 (2009)
Author Index
A D
Almeida-Ñauñay, Andrés F., 609 Danoy, Gregoire, 200
Amuda, R., 533 Dhama, Sakshi, 137
Asatani, Kimitaka, 308 Díaz, Robert, 225
Atzmueller, Martin, 249 Dilmaghani, Saharnaz, 200
Aynulin, Rinat, 27 Dombi, József, 137
Doran, Derek, 296
B
Ba, Cheick Tidiane, 581 E
Backenköhler, Michael, 432 Eades, Peter, 237
Baltsou, Georgia, 164 El Hassouni, Mohammed, 284
BarNoy, Amotz, 319
Basov, Nikita, 330 F
Benito, Rosa M., 494, 609 Fushimi, Takayasu, 469
Bera, Debajyoti, 112
Bloemheuvel, Stefan, 249 G
Bochenina, Klavdiya, 100 Gabrys, Bogdan, 62
Bouvry, Pascal, 200 Gaito, Sabrina, 581
Brede, Markus, 382 Galeano, Javier, 494
Brust, Matthias R., 200 Gal-Oz, Nurit, 641
Gerding, Enrico, 382
C Gounaris, Anastasios, 164
Cai, Zhongqi, 382 Gradov, Timofey, 100
Carscadden, Henry L., 455 Großmann, Gerrit, 432
Cazabet, Remy, 522 Grubb, Jacob, 357
Chandrasekar, V. K., 533 Gudes, Ehud, 641
Chebotarev, Pavel, 27 Guleva, Valentina Y., 482
Chen, Jialu, 237 Gusrialdi, Azwirman, 509
Chepovskiy, A. A., 38
Cherifi, Hocine, 211, 284 H
Chunaev, Petr, 100 Hammerschmidt, Dennis, 177
Cote, Shana, 225 Hecking, Tobias, 408
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2021
R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 943, pp. 683–685, 2021.
https://doi.org/10.1007/978-3-030-65347-7
684 Author Index
J O
Jia, Mingshan, 62 Ohnishi, Takaaki, 75, 273
Orellana, Sebastián, 593
K
Kadariya, Dipesh, 296 P
Kamiński, Bogumił, 152 Palmer, William R., 87
Khaykova, S. P., 38 Papadopoulos, Apostolos N., 164
Klesen, Jonas, 432 Pitsoulis, Leonidas, 189
Korkmaz, Gizem, 395 Ponomarenko, Alexander, 189
Krukowski, Simon, 408 Prałat, Paweł, 152
Kuhlman, Chris J., 395, 455 Premalatha, K., 533
Priest, Joshua D., 544
L Puzyreva, Ksenia, 330
Lafhel, Majda, 284
Lakshmanan, M., 533 Q
Latapy, Matthieu, 568 Quemada, Miguel, 609
Lavrač, Nada, 420
Lawryshyn, Yuri, 189 R
Leclercq, Eric, 211 Rajeh, Stephany, 211
Leshchev, D. A., 38 Rannou, Léo, 568
Lopez, Derek, 357 Ravi, S. S., 395, 455, 544
Losada, Juan C., 494, 609 Rebollo, Miguel, 494
Renoust, Benjamin, 284
M Rivas-Tabares, David, 620
Ma, Kwan-Liu, 237 Rosenkrantz, Daniel J., 455, 544
Magnien, Clémence, 568 Rossetti, Giulio, 370
Malick, Rauf Ahmed Shams, 124 Rossi, Gian Paolo, 581
Marathe, Madhav V., 455, 544 Roth, Camille, 330
Marbach, Peter, 15
Marbukh, Vladimir, 556 S
Matta, John, 357 Sadeghi, Reza, 296
Maulana, Ardian, 651, 660 Saito, Kazumi, 627
Medeuov, Darkhan, 330 Sakata, Ichiro, 308
Meyer, Cosima, 177 Saniee, Iraj, 556
Mežnar, Sebastian, 420 Satoh, Tetsuji, 469
Miasnikof, Pierre, 189 Savonnet, Marinette, 211
Migler, Theresa, 262 Senthilvelan, M., 533
Milli, Letizia, 370 Shalileh, Soroosh, 3
Mirkin, Boris, 3 Sheikh, Maham Mobin, 124
Miura, Takahiro, 308 Shestopaloff, Alexander Y., 189
Mizuno, Takayuki, 75 Situngkir, Hokky, 651, 660
Mohan, Bhuvaneshwar, 357 Škrlj, Blaž, 420
Author Index 685
T W
Tarquis, Ana M., 609, 620 Wang, Huijuan, 444
Théberge, François, 152 Wolf, Verena, 432
Toccaceli, Cecilia, 370 Wood, Zoë, 262
Topîrceanu, Alexandru, 345
Y
Toriumi, Fujio, 671
Yamagishi, Yuki, 627
Torkel, Marnijati, 237
Yan, Yizhou, 671
Tsichlas, Konstantinos, 164
Tsuda, Nako, 51 Z
Tsugawa, Sho, 51 Zabihimayvan, Mahdieh, 296
Zaykov, Alexey L., 482
U Zhang, Larry Yueli, 15
Ueda, Naonori, 627 Zhang, Wenning, 75
Zhao, Xunyi, 444
V Zheng, Tian, 87
Vaganov, Danila A., 482 Zignani, Matteo, 581
van den Hoogen, Jurgen, 249 Zinoviev, Dmitry, 225