Thesis
Thesis
Thesis by
Uchenna Akujuobi
Masters of Science
(March, 2016)
2
The thesis of Uchenna Akujuobi is approved by the examination committee
2016
3
Copyright ©2016
Uchenna Akujuobi
Uchenna Akujuobi
Abstract
Link label prediction is the problem of predicting the missing labels or signs of
all the unlabeled edges in a network. For signed networks, these labels can either be
positive or negative. In recent years, di↵erent algorithms have been proposed such as
using regression, trust propagation and matrix factorization. These approaches have
tried to solve the problem of link label prediction by using ideas from social theories,
where most of them predict a single missing label given that labels of other edges are
known. However, in most real-world social graphs, the number of labeled edges is
usually less than that of unlabeled edges. Therefore, predicting a single edge label at
In this thesis, we look at link label prediction problem on a signed citation network
with missing edge labels. Our citation network consists of papers from three major
machine learning and data mining conferences together with their references, and
edges showing the relationship between them. An edge in our network is labeled
either positive (dataset relevant) if the reference is based on the dataset used in the
labels. The first approach converts the label prediction problem into a standard
classification problem. We then, generate a set of features for each edge and then
adopt Support Vector Machines in solving the classification problem. For the second
approach, we formalize the graph such that the edges are represented as nodes with
links showing similarities between them. We then adopt a label propagation method
5
to propagate the labels on known nodes to those with unknown labels. In the third
the number of incoming positive and negative edges, after which we set a threshold.
Based on the ranks, we can infer an edge would be positive if it goes a node above
the threshold. Experimental results on our citation network corroborate the efficacy
of these approaches.
With each edge having a label, we also performed additional network analysis
where we extracted a subnetwork of the dataset relevant edges and nodes in our
citation network, and then detected di↵erent communities from this extracted sub-
study on several dataset communities. The study shows a relationship between the
major topic areas in a dataset community and the data sources in the community.
6
ACKNOWLEDGEMENTS
I would like to thank God for his grace leading me thus far. I would like to acknowledge
my parents, Chief and Dr Njoku whose support and prayers have kept me going. I will
also like to acknowledge my supervisor Prof. Xiangliang Zhang, whose constant advice
and corrections has kept me on the right path. Also, I acknowledge King Abdullah
University of Science and Technology for granting me this study opportunity. I would
also like to thank the Computer Science department professors who had taught me
one course or the other, bestowing knowledge upon me, without which I wouldn’t
TABLE OF CONTENTS
Copyright 3
Acknowledgements 6
List of Abbreviations 9
List of Symbols 10
List of Figures 11
List of Tables 13
List of Algorithms 14
1 Introduction 15
1.1 Signed Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Link Label Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Prior Work 22
2.1 Structural Balance Theory . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Social Status Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Page Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Label propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Link Prediction Using Propagation . . . . . . . . . . . . . . . . . . . 29
4 Experimental Results 52
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Community Network Analysis . . . . . . . . . . . . . . . . . . . . . . 61
References 74
9
LIST OF ABBREVIATIONS
LIST OF SYMBOLS
LIST OF FIGURES
3.1 Example of a link label prediction problem with one unknown label. . 35
3.2 Example of a linear separating hyperplane of a separable problem in a
2D space. Support vectors (circled) show the margin of the maximal
separation between the two classes. . . . . . . . . . . . . . . . . . . . 36
3.3 A non-Linearly separable hyperplane in a 2D space mapped into a new
feature space where data are linearly separable. . . . . . . . . . . . . 40
3.4 A sample graph G. (a) Original graph G given as input to the algo-
rithm, (b) Resulting graph G? produced as output of the algorithm . 42
3.5 PageRank score. (a) Resulting rank scores of Normal PageRank algo-
rithm, (b) Resulting rank scores of our method, the red line shows the
threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 A Directed graph representing a small web of six web pages. . . . . . 50
4.1 The number of collected papers for each of the three conferences from
2001 to 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Distribution of the number of citation . . . . . . . . . . . . . . . . . . 54
4.3 Full citation graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Performance Results of the evaluated approaches . . . . . . . . . . . 59
4.5 ROC curve of the di↵erent approaches. . . . . . . . . . . . . . . . . . 60
4.6 Result of the evaluation on unseen papers . . . . . . . . . . . . . . . 60
4.7 Topic Cloud of the four largest communities . . . . . . . . . . . . . . 64
12
4.8 Topic Cloud of the two selected communities . . . . . . . . . . . . . . 65
4.9 The largest community. (a) UCI dataset repository - C.L. Blake and
C.J. Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web
Agent for Document Categorization and Exploration - Han et al . . . 67
4.10 The second largest community. (a) UCI dataset repository - A. Asun-
cion and D. Newman . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.11 The third largest community. (a) RCV1: A New Benchmark Collec-
tion for Text Categorization Research - Lewis et al. (b) NewsWeeder:
Learning to Filter Netnews - Lang et al. (c) Gradient-based Learning
Applied to Document Recognition - Lecun et al. (d) Molecular Clas-
sification of Cancer: Class Discovery and Class Prediction by Gene
Expression Monitoring - Golub et al . . . . . . . . . . . . . . . . . . . 68
4.12 The fourth community. (a) GroupLens: An Open Architecture for
Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Be-
havior in Large Blog Graphs - Leskovec et al. (c) Grouplens: Applying
Collaborative Filtering to Usenet News - Konstan et al. . . . . . . . . 69
4.13 The fifth community. (a) The UCR time Series Data Mining Archive
- Keogh and Folias (b) Emergence of Scaling in Random Networks -
Barabasi et al. (c) GThe UCR Time Series Classification/Clustering
Homepage - keogh et al. (d) Exact Discovery of Time Series Motifs -
Abdullah et al. (e) www.cs.ucr.edu - Keogh . . . . . . . . . . . . . . 70
13
LIST OF TABLES
List of Algorithms
Chapter 1
Introduction
The emergence and rapid growth of Social Network Sites (SNSs) such as Twitter,
LinkedIn, eBay and Epinions have greatly increased the influence of the internet in
everyday life. People now depend more on the web for information, social inter-
actions, decision making, event planning, sales, etc. Due to this rapid increasing
amount of interaction, social network analysis has been attracting lots of attention
during the recent years. A considerable amount of these analyses has been on mining
and analyzing interesting user behaviors with the scope of enhancing user experience.
are used to visualize and analyze papers or researchers and can be utilized in several
between di↵erent papers, rank conferences, rank papers and datasets, find and mon-
itor communities and the prevalent topics within them, etc [1] [2] [3]. Throughout
this thesis, we will use edge and link interchangeably. We will also use the terms sign
Online social networks have traditionally been represented as graphs with positively
weighted edges representing interactions amongst entities and the nodes representing
the entities. However, this representation is inadequate since there are also negative
e↵ects at work in most network settings. For instance, in online social networks
like Slashdot and Epinions, users often tag others as friends or foes, give ratings to
other users or items [4]; and on Wikipedia, users can vote against or in favour of the
nomination of other users to be admins [5]. In a binary signed network, edges are given
positive or negative labels showing the relationship between two nodes. However, the
complexity and nature of many graph problems change once negatively labeled edges
are introduced. For example, the shortest-path problem in the presence of cycles
with negative edges is known to be NP-hard [6]. Several studies have been conducted
on binary signed networks with di↵erent algorithms proposed using methods such as
structural balance theory, social status theory, matrix factorization, trust and distrust
In our directed citation network, the direction of the edges is from a paper to its
reference (see Figure 1.1). We define its sign to be negative or positive based on its
dataset relevancy (i.e whether the citation is based on the dataset used in the paper).
The underlying question is then: Does the pattern of edge signs in the local vicinity
of a given edge a↵ect the sign of the edge? The local vicinity of an edge comprises of
other edges having the same target node as the edge. Knowing the answer to these
questions would provide an insight on the design of computing models where we try
to infer the unobserved link relationship between two papers using the negative and
Figure 1.1: Relationship between papers in a signed citation network. The local
vicinity of each edge from Paper C comprises of an edge from paper D
Considering a directed graph, the label of a link can be defined to be positive (repre-
senting a positive demeanor of its originator towards the receiver) or negative (rep-
network, the sign can denote similarities between a pair of nodes base on properties
Studies on social networks based on social psychology [9] [10] [11] [12] [4] have
shown that future unknown interactions and perceptions of an entity towards another
entities. For instance, if user A is known not to trust user B, and user B is known to
trust user C, it is likely that user A would not trust user C also. Considering online
shops such as eBay and Amazon, for instance, if everyone that purchased item A also
purchased item B, it can infer that if user X buys item A, then user X is also likely
to buy item B. Therefore, understanding the latent tension between the positive and
18
negative forces is very important in order to solve some networking problems like
is to try to examine the relationship between negative and positive links in a network.
For this, we need to know the theories of signed networks, which will enable us to
reason and understand how di↵erent configurations of positive and negative edges
provide information for the explanation of the various interactions in the network [4].
There are two most popular theories that have been used for positive and negative
relationship prediction on social networks: Social status theory [12] [13] and structural
chology. This theory considers the possible ways a triad (a triangle on three entities)
can be signed. The main idea of this theory is that triads with one or three positive
signs (two friends with one enemy or three mutual friends) are balanced and thus,
more plausible than triads with none or two positive signs (three mutual enemies or
Social status theory introduced in [12] takes into account the direction and sign
of edges and posits that a negative directed link implies that the initiator views the
recipient as having a lower status while a positively directed link implies that the
initiator views the recipient as having a higher status. Thus, an entity will only
connect to another entity with a positive sign only if it views it as having a higher
status. Note that although these theories work well and have been proved useful,
neither of them has been able to explain the situation where the two nodes connected
Generally, the edge sign prediction problem in most of these existing studies [11] [4] [12] [7] [8]
can be formalized as follows: Given a network with signs on all its edges, but the
19
sign on the edge from node u to node v, denoted as s(u, v), the aim is to predict
the missing sign on the edge s(u, v) [4]. Our goal in this thesis, however, is to solve
the link label prediction problem on our citation network. The link label prediction
problem is defined as follows: Given the information about signs of certain links in a
social network we want to predict, which reveals the nature of relationships that exist
among the nodes, we want to predict the sign, positive or negative, of the remaining
links [16].
Agrawal et al. [16] introduced the link label prediction problem and proposed a
matrix factorization based technique M F -LiSP (Matrix Factorization for Link Sign
of the network (i.e. if it is a balanced network, semi balanced network, etc.). This
method uses this information to try to complete the matrix in a way so as to keep
Our problem setting is described as follows: Given a citation graph G = (V, E) where,
the set of nodes V is a set of papers and data sources, the set of edges E is the set of
relationship between pairs of nodes. Let Lu,v denote the type of relationship between
pairs of nodes. We focus on two types of relationships: dataset (positive) and non-
be dataset related if the paper or data source represented by node v is cited by the
paper represented by the node u based on the datasets used or available in v and
vice-versa. Given the information about the relationship nature of certain links in a
citation network, we want to learn the nature of relationships that exist among the
nodes by predicting the sign, positive or negative, of the remaining links. To solve
20
the problem of link label prediction on our signed citation network, We present three
approaches which are respectively, based on SVM, PageRank algorithm, and label
propagation algorithm.
To study the link label prediction problem, a step-by-step research approach was
followed. First, we mined the dataset used here from the DBLP computer science
ference on Data Mining IEEE (ICDM), SIAM International Conference on Data Min-
ing (SDM), and ACM SIGKDD Conference on Knowledge Discovery & Data Mining
(KDD) ranging from 2001 to 2014. From the mined dataset, we extracted the dataset
relevant references and constructed a citation graph with directed edges from papers
to papers or data sources showing that the paper is referencing the papers or data
the edge else we assign a negative label. Then the following tasks were accomplished
problem by generating sets of features for each edge and adopting SVM to solve
formalizing the graph in such a way that the edges are transformed to labeled
and unlabeled nodes in the converted graph. In the new graph, the edges show
the similarities between the nodes. This similarity is based on the nodes the
edges are linked to in the original graph. We then, adopt a label propaga-
tion approach by propagating labels from known nodes to nodes with unknown
labels.
coming positive edges add to the rank of a node while negative incoming edges
reduced the rank of a node. Then we find a separating threshold that separates
• Evaluate proposed algorithms for link label prediction problem on our citation
network.
the algorithms and similar ideas used in this thesis. We also discuss previous works on
link label prediction that tries to predict links using ideas from label propagation. In
chapter 3, we discuss the methodologies used in this thesis and how we approached the
problem. We also discuss in detail how the algorithms used in each approach works.
In chapter 4, we discuss the result gotten from the evaluation of these algorithms
and show them side by side for comparison. We also discuss the network analysis
on the dataset relevant subgraph. Finally, we conclude and propose future research
direction.
22
Chapter 2
Prior Work
A number of papers have investigated the positive and negative relationship between
members on signed networks using di↵erent approaches. Guha et al. [12] extended the
already existing works based on trust propagation [18] [19]. Their work introduced the
distrust propagation. The trust and distrust propagations are obtained by calculating
up of the co-citation, trust coupling, direct propagation and transpose trust matrices.
theory.
Based on the social status and structural balance theory, Leskovec et al. [7] ex-
tended the work of Guha et al. [12] by using a machine-learning framework approach.
They evaluate which of a range of structural di↵erence give more information for
prediction task. In their work, they define two classes of features for the machine
learning approach. The first class is based on the signed degree of the nodes and the
second class is based on the social structure principles using a triad. Assuming we are
trying to predict the sign of the edge from u to v. For the first class of features, they
construct seven-degree features based on the outgoing edges from u, incoming edges
and negative edges from u, the total out-degree of u and the total in-degree of v. For
the second class, they considered each triad involving the edge (u, v), consisting of
a node w (see Figure 2.2 ) and encoded the information in a 16-dimensional vector
specifying the number of triads of each type that (u, v) is involved in. Then, using
a logistic regression classifier, they combined the information from the two classes of
Chiang et al. [8] extended the work of Leskovec et al. [7] by considering not
just triads, but longer cycles. In their work, they ignored the edge directions so
approach of [7] by using features derived from longer circles [8]. However, Agrawal
et al. [16] is the first to tackle simultaneous label prediction of multiple links. They
setting, where the data is represented as a partially observed matrix [16]. In their
work, they proposed a new technique Matrix Factorization for Link Sign Prediction
(MF-LiSP) that is based on matrix factorizations. Note, many approaches based upon
social status theory or structural balance theory have been developed to perform edge
sign prediction in signed networks. However, they generally do not perform well in
networks with few topological features (i.e. long-range cycles and triads) [15]. The
most popular theories in the study of signed social network are that of social status
and structural balance. However, there are some methods used in link prediction
problem that can be extended to solve the link label prediction problem. Prior works
based on the approaches used in this thesis are described in sections below.
24
Figure 2.1: Structural balance: Each triangle must have 1 or 3 positive edges to
be balanced. Figure 2.1 (a) and (c) are balanced, while Figure 2.1 (b) and (d) are
balanced
The formulation of the structural balance theory is based on social psychology [10].
Based on this theory, Cartwright and Harary [11] formally provided and proved the
graphs. Their notion of structural balance has four basic intuitive explanations each
representing a particular structure. They further claim that these four structures can
be divided into two groups, namely: balance and unbalanced (see Figure 2.1). The
intuitive explanations for the balanced structures are: A, B and C are all friends
(Figure 2.1(a)); B and C are friends with A as a mutual enemy (Figure 2.1(c)). The
intuitive explanations for the unbalanced structures are: A is friends with B and C,
but B and C are enemies (Figure 2.1(b)); A, B and C are mutual enemies (Figure
2.1(d)). In other words, the balanced triads follow the rule: the friend of my friend
is my friend (Figure 2.1(a)) and the enemy of my enemy is my friend, the friend of
Structural balance theory was initially intended for undirected networks but has been
25
Figure 2.2: 16 triads determined by three nodes and a fixed unsigned edge (from A
to B).
26
applied to directed network by disregarding the edge direction [4].
Social status theory was motivated by the work of Guha et al. [12] while considering
the edge sign prediction problem. Leskovec et al. [4] developed the social status
theory using the structural balance theory and ideas from Guha et al. [12] to explain
the signed networks. In this theory, they consider the direction and sign of an edge.
Consider a link from A to B. If the link from A to B has a positive sign, it indicates
that A views B as having a higher status and otherwise, if the link has a negative
sign, it indicates that A views B as having a lower status. In this theory of status, A
will positively connect to B only if it views B as having a higher status, and negative
otherwise. These relative levels of status can be propagated across the network [12].
Assuming all nodes in the network agree to the social status ordering and the sign of
the link from A to B is unknown, we could infer the sign from the contexts provided
The PageRank algorithm introduced by Brin and Page [20] shows the importance
of a node in a network based on the importance of the nodes that link to it. Some
works [21] [22] on link prediction adapted what is known as rooted PageRank approach
to their studies. In the rooted PageRank approach, the random walk assumption of
the original PageRank is altered as follows: Similarity score between two vertices u
and v can be measured as the stationary probability of v in a random walk that returns
counterpart where the role of u and v are reversed [23].Given a diagonal degree matrix
1
RP R = (1 ↵)(I ↵N ) (2.1)
Nowell et al. [21] assigned a connection weight score(u, v) to pairs of nodes (u, v)
based on an input graph G. Then from this score, they produced a ranked list
similarity or proximity between pairs of nodes (u, v), based on the network topology.
The random reset procedure in web page ranking was adopted. However, this work
focused on link prediction problem which, is di↵erent from our problem setting.
RPR) which is built upon Rooted-PageRank. This method captures the interaction
dynamics of the underlying network structure. They define the problem as: Given a
time-interaction graph Gki = (Vk , Ek , W ), where Vk are the nodes, Tk is the time slot
in which there was an interaction between the set of nodes (in their case actors), and
W is the edge weight vector. Let scorex,k (vi ) denote the RPR score of node vi when
x is used as root on Gki . The resulting list of nodes are sorted in descending order
where n =| Vk | 1, Vk \{x} = {v1 , v2 , ..., vn }, and scorex,k (vi ) scorex,k (vi+1 ) for
i = 1, ..., n 1. The rankings obtained over the n time slots are then aggregated
28
for each root node x and all remaining nodes vi 2 V , resulting in an aggregate score
aggScorex(vi ). Finally, just like in [21] the aggregate scores are sorted in descending
order. Finally, from the list of nodes L(x), the hierarchy graph GH is inferred. Their
social network. Note that this is di↵erent from the objective of this thesis.
Though the methods used a similar idea of scoring and ranking nodes, there is no
prior work on using PageRank to infer the labels of links in a signed network. We
will use PageRank to rank the nodes and based on their ranks, we will infer the label
with labeled and unlabeled nodes, label propagation di↵uses the labels recursively
across graph edges from the labeled to unlabeled nodes, until a stable result is reached.
This results in a final labeling of all the graph nodes. Because of its efficiency and
performance, it has been applied to several problems with modified algorithms be-
ing proposed [25] [24] [26] [27]. Several works [28] [29] applied label propagation on
community detection problem. Some other works [30] [31] [32] extended the works on
v is a parameter of the algorithm. Boldi et al. [33] applied label propagation in clus-
tering and compressing very large web graphs and social networks. They proposed
with the same labels close to each other at each iteration and yields a compression
friendly ordering. On a similar note, Ugander et al. [34] proposed an efficient algo-
29
rithm (Balanced Label P ropagation) for partitioning massive graphs while greedily
maximizing edge locality. There are also several works [35] [36] [37] [38] on classifi-
cation and predictions on social networks using label propagation. However, to the
best of my knowledge, there is no previous method proposed for solving the link label
A similar idea of propagating labels on links was used in the algorithm proposed
by Kashima et al. [39]. They used ideas from label propagation method to predict
links. Using ancillary information such as node similarities, they devised a node-
information based link prediction method using a conjugate gradient method and
vec-trick techniques to make the computations efficient. Given two sets of nodes
The link propagation inference principle which was gotten from modifying the label
propagation principle, states that two similar node pairs are likely to have the same
link strength. The link strengths indicate how likely a link exists for a given a node
pair (xi , yj ), and are set to some positive value if a link exists in the observed graph,
some negative value if no link exists in the observed graph, or zero otherwise. Applying
1
J(F ) = vec(F )T Lvec(F ) + k vec(F ⇤ G) vec(F ⇤ ) k22 (2.3)
2 2
8 9
>
< 1 >
if (i, j) 2 E =
Gi,j =
: p
>
µ otherwise
>
;
30
where F is a second order tensor representing link strength (vec(F ) is the vectorization
operation of the matrix F ), F ⇤ represents the observed parts of the network. The first
term of Eq.(2.3) indicates that two link strength values Fi,j and Fl,m for two pairs of
edges (xi , yj ) and (xl , ym ) should be close to each other if the similarity between them
is high. The second term is the loss function. The loss function fits the predictions
to their target values of the observed regions of the network. The second term also
acts as a regularization term so as to prevent the predictions from being too far from
zero, and for numerical stability (the ⇤ is the Hadamard product). The > 0 and
µ> 0 are regularization parameters which balance the two terms of Eq. (2.3). L
is di↵erentiated with respect to vec(F ) [39]. This method can fill in missing parts
Although the qualities of this method were better compared to those of other
computational time and space constraints thereby, limiting its application. To ad-
dress this issue, Raymond et al. [41] extended [39], proposing a fast and scalable
algorithm for link propagation. This proposed algorithm utilizes matrix factoriza-
to reduce the computational time and space required in solving linear equations in
the Link Propagation algorithm. Note, however, this is solving the link prediction
Chapter 3
Methodology
In this section, we discuss the approaches used in solving our link label prediction
problem.
Link label prediction problem is defined as: Given a citation graph G = (V, E)
where, the set of nodes V is the set of papers or data sources, the set of directed
edges E is the set of relationship between pairs of nodes. An edge pointing from u
to v indicating v is cited in u. We let Lu,v denote the label of the edge (u, v) from
cited by node u based on the dataset it uses and vice-versa. In this thesis, we will
address the node relationship using their graphical terms (labels or signs). Given the
want to learn the nature of relationships that exist among the nodes by predicting
32
the sign, positive or negative, of the remaining links. We assume that: A link (u, v)
higher volume of its incoming links to be positive and on the other hand, if node(v)
is known to have a higher volume of its incoming links to be negative, then link (u, v)
is most likely to be a negative link. This problem can thus be formulated as follows:
Given a citation graph G = (V, E), and a set of label Lu,v = { 1, 1} for some edges,
we predict the missing labels for other edges in the graph. We next introduce three
Here, we consider the link label prediction problem as a classification problem. For
each edge, we generate a set of features based on the network topology, structural
balance and status theory. Having obtained these features, we convert the link label
prediction problem to a binary classification problem: labeled edges are used for
training a classifier that will predict positive or negative relationship for unlabeled
edges. We adopt Support Vector Machines (SVM) in solving this binary classification
set. Making a poor feature set selection might have negative e↵ects on the result
categories. The first category is based on the signed degree of the nodes and the
1. d+
in (v) and din (v) which denote the number of incoming positive and negative
edges to v
2. d+
out (u) and dout (u) which denote the outgoing positive and negative edges from
3. The total degree of u and the total degree of v which are d(u) and d(v) respec-
tively
4. C(u, v) which denote the embededness of the edge. We describe the embed-
sense that is, the number of nodes w such that wi is linked by an edge in either
5. H(u, v) which infers the sign of the edge based on social status theory. Node
6. T (v) which is the number of dataset related words contained in the paper title
The status heuristic (x) gives node x status benefits for each positive link it
receives and each positive link it generates, and status reduction for each negative
link it receives. We then predict a positive sign +1 for (u, v) if (u) < (v), a negative
sign -1 if (u) > (v), and a 0 otherwise. The list of dataset related word was gotten
For the second class, we consider each node w such that w has an edge to or from
u and also an edge to v. Here, we used the idea that if nodes similar to node u link
34
to node v with a positive sign, then u is most likely to link to v with a positive sign
or with a negative sign if nodes similar to node u link to v with a negative sign”.
To further explain the feature setup, a small network with one unknown label is
shown in Figure 3.1. In this small network, the features for edge(u,v) in the first
category are
1. d+
in (v) = 1 , din (v) = 4
2. d+
out (u) = 0 , dout (u) = 1
4. C(u, v) = 2
6. T (v) is gotten from the paper title or name of the data source represented by
v.
1. N + = 0
2. N = 2
3. P (e(u,v) = 1 | N ± ) = 0
35
Figure 3.1: Example of a link label prediction problem with one unknown label.
Getting the features, we pass them to SVM classifier to classify the edges into
positive and negative edges using the RBF kernel (See section. 3.2.2). We used the
LibSVM implementation of SVM for the classification. Since the network is made
up many lowly cited nodes (See Figure 4.2), we sample the test dataset (edges)
from nodes with high in-degree to avoid a situation where we get dangling nodes by
randomly sampling all the edges linking (in an undirected sense) a node to any other
node in the network. Another reason for this is to keep a considerable amount of
links to each node for better predictions. For instance, in Figure 3.1, using only edge
(d, v) in the training dataset might lead to the prediction of positive labels on other
incoming edges to v in the test dataset. This is because a positively signed edge (d, v)
would be the only incoming edge to node v in the training dataset. Thus, for each
node with a high in-degree, we randomly sample 10% of the incoming links. The
parameter was selected by running a grid search cross-validation on the training data
and then selecting the parameters that achieved the best result. The evaluations and
Support Vector Machines (SVM) was introduced in 1979 (Vapnik, 1982). This al-
gorithm has been shown to support both classification and regression tasks and has
thus grown in popularity over the years. The basic concept of this algorithm is as
follows: Given an input vector, find a hyperplane (boundary) that separates a set
of objects having di↵erent class memberships. This raises the question of how do
we find a good hyperplane that not only can separate the training data, but, also
will generalize well - as not only all hyperplanes that separate the training data will
automatically separate the testing data [42]. The optimal hyperplane for this work
can be defined as a decision function with maximal margin between positively signed
In our case x can be seen as link features and y as the link label. This training
examples are linearly separable if there exists a vector w and a scalar b such that the
inequalities:
w · xi + b 1 if yi = 1, (3.2)
w · xi + b 1 if yi = 1, (3.3)
are valid for all examples of the training set [42]. These inequalities can be combined
yi (xi · w + b) 1 0 8i (3.4)
Lying on the hyperplane H1 : w · xi + b = 1, are the points for which the equality
w·x+b 1
(3.3) holds. These points have a perpendicular distance kwk
= kwk
from the optimal
are the points for which the equality (3.2) holds. These points have a perpendicular
w·x+b 1
distance kwk
= kwk
from the optimal hyperplane with normal w. Therefore, d =
|1| 2
d+ = kwk
and the margin is kwk
. Therefore, the optimal hyperplane w? · x + b? = 0
optimal hyperplane can be built from few of its support vectors relative to the size
of the training dataset, the generalization ability will be high even in an infinite
38
dimensional space [42]. However, [14] shows that w can be expressed as a linear
combination of a subset of the training dataset: The points that lie exactly on the
where ↵i? 0. Since ↵ > 0 only for support vectors, the equation (3.5) represents a
H2, the constrains (3.2) and (3.3) can to be relaxed when necessary (i.e. introduce
further cost for doing this) by introducing positive slack variables ⇣ i , i = 1, 2, ..., n
w · xi + b 1 ⇣i if yi = 1, ⇣i 0 8i. (3.6)
w · xi + b 1 + ⇣i if yi = 1, ⇣i 0 8i. (3.7)
One way to assign an extra penalty for errors is to change the objective function to
kwk2 kwk2
be minimized from 2
to 2
+ C(⌃i ⇣i )k where a larger C correlate to assigning a
A question that may arise is “what if they are not linearly separable, how can the
above methods be generalized?” Boser et al. [43] showed that this problem can be
solved by mapping the data into a new dimension feature space where it can be
separated (see Figure 3.3). For this, we need to find a function that will perform
such mapping:
: <N ! F
39
However, this feature space might be of a much higher dimension and mapping the
data into a feature space with such dimension might a↵ect the performance of the
resulting machine. One can show [44] that during training, the optimization problem
only uses the training examples to compute pair-wise dot products, xi · xj , where
xi , xj 2 <N . This is notable because it turns out that there exist functions that,
given two vectors x and y in <N , implicitly computes the dot product between x and
This process uses no extra memory and has a minimal e↵ect on computation time.
These functions are called kernel functions. A kernel function can thus be formulated
as:
where, (·, ·)M is an inner product of <M , (x) transforms x to <M ( :<N ! <M ) and
M > N . The most popular kernel functions K(xi , xj ) = (xi ). (xj ) avalible in most
The most popular choice of kernel types used in Support Vector Machines is the
RBF kernel. This is primarily because of their finite and localized responses across
problem. Note, “popular” in our case means having most of its incoming links to
and dataset sources, edges E are partially labeled (i.e only some edges are labeled),
we want to infer the labels of the unlabeled edges using the labeled edges and the
the edges are represented as labeled and unlabeled nodes V ? and edges E ? represents
similarities between them. These similarities are given by a weight matrix W : Wi,j is
non-zero i↵ ei and ej have similar target nodes or the target node of one is the source
node of another, i.e., the edges have the following configurations ({ei,j and ek,j } or
{ei,j and ej,k }). We can then propagate the labels along the directed edges according
41
to the weight of the links in a random walk approach.
with N number of nodes, we define a conversion algorithm (See Algorithm 1). This
algorithm creates a new graph G? with the edges of G as its nodes and generates
edges in such a way that two nodes in G? are linked only if their edges in G have the
same target node Vj or the source node of one is the target node of the other. Two
nodes in G? are linked together with either a similarity score of two if the edges in G
represented by these nodes have the same target node, or a similarity score of one if
the source node of one is the target node of the other. Figure 3.1 shows an example
If our graph was a connected graph, using only the graph produced as output
by Algorithm 1 would have been enough. However, our graph has more than one
a dead end, we make a random jump to any other connected component. This is
CCi in G from a node in another connected component CCj . These edges are given
Figure 3.4: A sample graph G. (a) Original graph G given as input to the algorithm,
(b) Resulting graph G? produced as output of the algorithm
with weights assigned to its edges.
distribution is made such that the random edge is more likely to be generated from
a node with high in-degree. This preference is made so that the result of the label
propagation algorithm is not a↵ected by the generated edges. Based on the label
propagation algorithm (See Algorithm 2), starting with labeled nodes 1, 2, ..., l and
unlabeled nodes l + 1, ..., n, each node starts to propagate its label to its neighbors
in the manner of a Markov chain on the graph, and the process is repeated until
convergence [46].
problem setup is defined as follows: denoting a set of items and their corresponding
labels (x1 , y1 )...(xn , yn ),let X = {x1 ...xn } be the set of items and Y = {y1 ...yn } the
item labels. Let (x1 , y1 )...(xl , yl ) be the items with known labels YL = y1 ...yl , and
(xl+1 , yl+1 )...(xn , yn ) be the items with unknown labels YU L = yl+1 ...yn . The aim is to
predict the labels of unlabeled items from X and YL . To achieve this goal, a graph
G = (V, E) is constructed where; the set of nodes V = x1 ...xn is the set of items X and
43
the weights of the set of edges E shows the similarities among the items. Intuitively,
similar nodes should have similar labels. Label from one node is propagated to other
nodes based on the weight of the edges (i.e. labels are more likely to be propagated
through edges with larger weights). Di↵erent algorithms have been proposed for label
propagation. These include iterative algorithms [25][24] and using Markov random
walks [26].
Iterative Algorithms
Given the graph G, the idea is to propagate labels from labeled nodes to unlabeled
nodes in the graph. Starting with labeled nodes (x1 , y1 )...(xl , yl ) labeled with (1or 1)
to unlabeled nodes (xl+1 , yl+1 )...(xn , yn ) labeled with 0, each node starts to propagate
its label to its neighbors, and the process is repeated until convergence.
Algorithms of this kind have been proposed by Zhu et al. and Zhou et al. [25][24]
(See Algorithm: (2) and (3)). Labels on labeled and unlabeled data are denoted by
Ŷ = (Ŷl , Yˆu ). Zhu et al. Proposed and proved the convergence of the propagation
algorithm (See Algorithm: (2)), the initial labels for the data points with known
labels (x1 ...xl ) are forced on their estimated labels i.e. Ŷl is constrained to be equal
known as label spreading was proposed and proved to converge by Zhou et al. [25]
which at each step, a node i gets a contribution from its neighbors j (weighted by the
normalized weight of the edge (i, j)) and an additional small contribution given by its
initial value [46]. In general, we can expect convergence rate of these two algorithms
in the graph. In the case of a dense weight matrix, the computational time is thus
O(n3 ) [46].
44
Szummer et al. [26] took a di↵erent approach for label propagation on a similarity
of class labels, they defined the transition probabilities of Markov random walks on
Where, the weight Wi,j is given by a Gaussian kernel for neighbors and 0 for non-
neighbors, and Wi,i = 1 (but one could also use Wi,i = 0) [46]. They assume that the
starting point of the Markov random walk is chosen uniformly at random, i.e.,P (i) =
1
N
. A probability P (y = 1|i) of being of class 1 is associated with each data point xi .
of class ystart = 1), given that we arrived to xj after t steps of random walk is given
by:
where P0|t (i | j) is the probability that we started from xi given that we arrived at
j after t steps of random walk (this probability can be computed from the pi,k ) [26].
1. They proposed two techniques for estimating the unknown parameters P (y | i):
on the length of the random walk t. This parameter can be chosen heuristically (i.e.,
to the scale of the clusters we are interested in) or by cross-validation (if enough data
Here we consider another semi-supervised approach for predicting the missing link
labels in our dataset. In this method, we approach our link label prediction problem
as a task to rank the nodes in the network. The idea is to design an algorithm that
will assign higher scores to nodes with more positive in-degree than those with more
negative in-degree. Here, we rank the nodes using the PageRank algorithm, then we
find a threshold separating the nodes such that an edge directed towards any of the
nodes with a rank score above the threshold would be labeled as a positive edge and
an edge directed towards a node below the threshold would be labeled as a negative
edge. A broad description of how PageRank works can be found in section 3.4.1
below.
represents the graph as a sparse square matrix M whose element Mu,v is the prob-
ability of moving from node u to node v in one-time step. For instance, using the
graph in Figure (3.6), then our M would be the matrix P (See Matrix 3.4.1). We
then compute the adjustments to make our matrix irreducible and stochastic (See
Section 3.4.1). The PageRank scores are then calculated using a modified version of
erates the convergence of the power method [47]. We modified the algorithm such
that: 8
>
< rj
Nj
if edge (Pj , Pi ) is positive,
ri = ⌃j2Li (3.11)
>
: rj
Nj
if edge (Pj , Pi ) is negative,
Equation 3.11 is set such that incoming negative links will decrease the rank score of
a node while incoming positive links will increase the rank score of a node. An initial
1
rank score of N
was assigned to each node (N is the total number of nodes). Figure
47
(3.5) shows the PageRank score distribution of both the original PageRank algorithm
and the modified PageRank algorithm. Another important thing in this approach is
the threshold. The performance of this approach relies on a good choice of threshold.
For the our network, we chose the 85th percentile to be the threshold.
The PageRank algorithm which was first introduced by Brin and page [20] is one of the
algorithms used by Google to rank their search engine results. The main idea behind
the PageRank algorithm is that the importance of any web page can be determined
based on the pages that link to it. It was intuitively explained using the concept
of an infinitely dedicated web surfer randomly going from page to page by choosing
a random link from one page to get to another. The PageRank of page(i) is the
probability that a random web surfer visits page(i). However, this random surfer can
end up in a loop (cycles around cliques of interconnected pages) or hit a dead end (a
web page with no outgoing links). Therefore, in order to address these problems, an
adjustment was made such that with a certain probability, the random surfer jumps
to a random web page. This random walk is known as a Markov process. Based on the
principle of PageRank, if we include a hyperlink to the web page(i) in our Data mining
lab site, this means that we consider page(i) important and relevant to the topic being
discussed on our site. If lots of other web pages link also to page(i), the logical belief is
if page(i) has only one backlink, which comes from a very popular site page(j),
asserts that page(i) is important. With this understanding, we can say that PageRank
algorithm is like a counter of an online ballot, where pages vote for the importance
48
×10 -5
2.134
2.132
2.13
Scores
2.128
2.126
2.124
2.122
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Nodes ×10 4
×10 -5
2.128
2.126
2.124
2.122
Scores
2.12
2.118
2.116
2.114
2.112
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Nodes ×10 4
Figure 3.5: PageRank score. (a) Resulting rank scores of Normal PageRank algo-
rithm, (b) Resulting rank scores of our method, the red line shows the threshold
49
of other pages, and this result is then gathered by PageRank and is reflected in the
search results. For any given graph, PageRank is well-defined and can be used for
of vertices.
The Algorithm
The original PageRank algorithm described by Brin et al. [20] is given by:
✓ ◆
P R(T1 ) P R(Tn )
P R(A) = (1 d) + d + ... + (3.12)
C(T1 ) C(Tn )
where, P R(A) is the PageRank of page A, P R(Ti ) is the PageRank of pages Ti which
link to page A, C(Ti ) - number of outbound links on page Ti , and d - damping factor
which can be set between 0 and 1. The damping factor d is usually set to be 0.85. This
means the amount of PageRank that a page has to vote will be its own value ⇥ 0.85
and this vote is shared equally among all the pages it links to. To get accurate results,
this process performed iteratively till it converges. The PageRank ri , i 2 1, 2, ..., n for
rj
ri = ⌃j2Li , i = 1, 2, ..., n. (3.13)
Nj
where Nj - the number of outlinks from page Pj and Li - the pages that link to page
Pj .
Operation On Matrix
Let us consider Figure (3.6) which is a graphical representation of a small web con-
sisting of six web pages. The graph contains six nodes, each of which represents a web
50
page with the link showing the relationship between the web pages. Here, a hyperlink
Figure 3.6: A Directed graph representing a small web of six web pages.
1
matrix P is introduced, which is a row normalized hyperlink matrix with Pi,j = |Pi |
if,
node(i) has a link to node(j) and 0 otherwise. At each iteration, a PageRank single
1
where ⇡ is the page rank vector and is usually initialized with ⇡i = N
for a web of N
0 1
P1 P2 P3 P4 P5 P6
B C
B C
BP 1 0 0 0 0 0 0 C
B C
B C
B 1 1 C
BP 2 0 0 0 0 C
B 2 2 C
B C
P =B
BP 3
1 1
0 1
0 0 C
C
3 3 3
B C
B C
BP 4 0 0 0 0 1 0 C
B C
B C
BP 5 0 0 0 1
0 1 C
B 2 2 C
@ A
1 1
P6 0 0 0 2 2
0
However, using just the hyperlink structure in building the Markov matrix is not
enough because of the presence of dangling node. Dangling nodes are nodes with no
51
outgoing edges like P1 in our example thus, P is not stochastic. These dangling nodes
appear very often on the web. There are two proposed adjustments to deal with this
problem. The first adjustment is to make the matrix stochastic. This is formulated
mathematically as:
1 T
S = P + a( e ) (3.15)
N
where 8
>
<1 if Pi is a dangling node,
ai =
>
:0 otherwise
0 1
P1 P2 P3 P4 P5 P6
B C
B C
BP 1 1 1 1 1 1 1 C
B 6 6 6 6 6 6 C
B C
B 1 1 C
BP 2 0 0 0 0C
B 2 2 C
B C
S=B
BP 3
1 1
0 1
0 0CC
3 3 3
B C
B C
BP 4 0 0 0 0 1 0C
B C
B C
BP 5 0 0 0 1
0 1 C
B 2 2 C
@ A
1 1
P6 0 0 0 2 2
0
vector of the chain (the PageRank vector) [48]. Therefore, to make P irreducible,
G = ↵S + (1 ↵)E (3.16)
1
Where ↵ is a scalar between 0 and 1 and E = N
eeT is the teleportation matrix.
↵ = 0.85 can be explained as the random surfer follows the hyperlink structure of the
web 85% of the time and jumps to a random new page 15% of the time.
52
Chapter 4
Experimental Results
In this chapter, we present and discuss our dataset, the results of experiments, com-
pare the proposed approaches, evaluate their predictive performance on our dataset,
The dataset used in this thesis comprises papers presented in three major confer-
Discovery & Data Mining (KDD). The dataset spans from 2001 until 2014 and
was crawled from the DBLP computer science repository website http://dblp.
uni-trier.de/db/conf/icdm/index.html, http://dblp.uni-trier.de/db/conf/
main information in the collected dataset includes paper title, authors, abstract and
reference. Figure 4.1 shows the number of papers collected for each conference from
2001 to 2014. However, to use this dataset efficiently in our work, we split the papers
and their references, and present them as separate entries thus, building an index
over the whole dataset. For each paper, we create a node pair (u, vi ) i = 1, 2, ..., n.
53
Where u is the paper, vi is a reference in the paper, and n is the total number of
Figure 4.1: The number of collected papers for each of the three conferences from
2001 to 2014
papers and their references and directed links representing the citation relationship.
In total, there are 4830 papers gotten from the three conferences and on average,
each paper references a dataset source, although there are papers with more than one
reference to dataset sources and also papers with no reference to any dataset source.
Each link is labeled as positive or negative based on the nature of the reference. If the
54
Table 4.1: Driected Network Information
reference is based on the dataset used in the paper it is labeled as positive, otherwise,
through the papers to check for which of their references are dataset related. The
network contains 51,680 nodes and 101,503 edges, of which ⇠ 90% are negative. The
full citation network is shown in Figure 4.3. We can see that the network is made up
In total, there are 63 connected components. Information about the network can be
250
Number of citations
200
150
100
50
0
0 1 2 3 4 5 6
Papers and data sources (index) ×10 4
The number of citations of each node in our network (Figure 4.3) can be seen in
55
Figure 4.2 with the most cited paper in our dataset being Latent Dirichlet Allocation
- Blei et al.. The nodes in Figure 4.2 represented by their indices are sorted in
descending order according to their citation count in our dataset. We can see in
4.2 that the dataset is mainly made of lowly-cited papers and dataset sources. The
highly-cited dataset sources (with positive in-degree greater than 20 and based on
di↵erent citations) in our dataset listed according to their citation count are:
Resnick et al.
56
6. Gradient-based Learning Applied to Document Recognition - Lecun et al.
which was created in 1987 currently comprises 348 datasets that encompass a wide
variety of application areas and data types from a broad range of problem areas
The UCI KDD Archive [50] published in 1999 is similar to the UCI Repository
of Machine Learning Databases. However, its aim is to store large datasets that
individual samples (or examples, or objects). This dataset also spans a wide variety
integrated into the Usernet news architecture. It is a distributed system for gathering,
distributing, and using ratings from some users to predict other users’ interest in
articles. The news clients can be freely substituted, providing an environment for
experimentation in predicting ratings and in user interfaces for collecting ratings and
networks and handwriting recognition. In this paper, they created a new database
which was called the Modified NIST dataset or MNIST. This database composes
handwritten digit images with over 120000 images. They also collected datasets of
over 500,000 hand printed characters (comprising of upper cases, lower cases, digits,
and punctuations) spanning the entire printable ASCII set, and over 3000 words.
57
4.3 Results
The network that we study is overwhelmingly unbalanced with most of the nodes
having very low in-degree. Thus, we consider and evaluate the three di↵erent al-
gorithms by randomly sampling 10% of edges to each node with high in-degree as
test datasets. The evaluation results are obtained from running the evaluation five
times with random samples and then taking the average. However, since our aim is
to make predictions for new papers, we set aside 300 papers as unseen data to be
our methods based on the precision, recall, and AUC (area under the ROC curve) of
the algorithms and also show their ROC curve. For the SVM method, we evaluate
1. SVM method with all 12 features, see Section 3.2.2 (noted as “Degree & Em-
2. SVM method with 9 features from node degree information (noted as “Degree”
in Figure 4.4 )
Figure 4.4 )
4. SVM method with 28 features with our 9 degree features and 16 structural
4.4 )
5. SVM method with only 16 structural balance features as presented in [7] (noted
The results shown in Figure 4.4 compare the AUC, precision, and recall of these
di↵erent approaches. We can see that using just triad information or information
obtained from mutual neighbors produce worse results since as noted by [7], the
triad features are only relevant when two nodes have neighbors in common. It is
to be expected that they will be most e↵ective for edges of greater embeddedness.
However, the network we study has very few of such edges with high embeddedness.
The ROC curve is shown in Figure 4.5. It compares the 7 di↵erent approaches. We
can see that the methods relying on the embeddedness of the network performed worse
than the rest. The propagation method in general performed well while the PageRank
method performed worse than the SVM methods. However, in this thesis work, our
aim is to obtain a high number of true-positives with low false-positives; hence, the
In Figure 4.5, the SVM method based on the degree and structural balance features
performs better than the other methods if a false-positive threshold of 0.1 is selected.
However, the PageRank method performs less than both the label propagation and
SVM methods in the region of the curves corresponding to low false-positive rates as
all the papers in the training and testing set for training. We then try to predict
the dataset-relevant links in the unseen dataset. Figure 4.6 shows the result of the
prediction using the di↵erent algorithms. For SVM, we report the best result from
using the di↵erent feature sets. This is interesting as it shows how the di↵erent
methods respond to unseen datasets. We can see that the label propagation method
59
0.9
0.8
0.7
0.5
0.4
0.3
Embedded
16 Triads
0.2 Degree & 16 Triads
Degree & Embedded
Degree
0.1
Propagation
PageRank
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
out-performs the other methods, even with the unseen datasets as it was able to
among vertices within a community are denser than that from one community to
another. Various algorithms have in recent years been developed for detecting this
structure. In our work, we are interested in analyzing the citation structure of data-
reference papers. Analyzing the dataset communities can enable us to gain further
information on the datasets such as how the datasets or dataset sources a↵ect the
structure of the network, which datasets are more useful for a particular topic, what
are the prevalent topics within the communities, which people are more likely to use
a dataset etc. We used the GLay community detection algorithm by Su et al. [52]
Taking out all the non-dataset related links from our network, we got a new
network with only dataset relevant links. Information about the new network can
be seen in Table 4.2. Running the community detection algorithm and selecting
communities with more than five nodes in order to focus on fairly large communities,
The three largest communities are shown in Figure 4.9, 4.10, and 4.11. We can
see that the first and second largest communities Figure 4.9 and 4.10, are largely
made up of papers citing the UCI dataset repository. Although these papers might
be using similar datasets, the dataset repository is cited di↵erently according to the
UCI repository librarians in charge of the repository at the time of use. The UCI
dataset repository is one of the most popular repositories used by the machine learning
62
community and comprises datasets of di↵erent types and varieties. The third largest
community Figure 4.11 consists of two main dataset sources which were respectively
The collected rating information is then used to learn a new model of the users
interests.
In addition to the three largest communities, two other communities shown in Figure
4.12 and 4.13 are analyzed. The dataset sources in the fourth community are mainly
based on the use and behavior of online articles. These datasets were introduced and
posts from 44,362 blogs from a larger dataset [54] which contains 21.3 million
posts from 2.5 million blogs from August and September 2005.
63
The fifth community mainly comprises citations on time-series dataset sources. Al-
though the UCR time series repository is cited in di↵erent formats, some of the papers
having di↵erent citation formats might be using similar datasets. This is due to the
citation policy of the repository (similar to UCI dataset repository), which requires
the citation to be on the repository’s librarians at the time of use. The main dataset
• The UCR Time Series Data Mining Archive - Keogh and Folias
and Ratanamahatana
• www.cs.ucr.edu - Keogh
work on the electrical power grid data of western US with 4,941 vertices, col-
laboration graph of movie actors with 212,250 vertices and the world wide web
• Exact Discovery of Time Series Motifs - Abdullah et al. built and made publicly
available a web page containing all the time series dataset and code used in their
work.
To further analyze these communities, we first look into the research topics of the
papers in each community. A total of 20,542 topics are crawled from the Microsoft
academic website [55] in the field of machine learning and data mining, because papers
studied in this thesis are from conferences in these two areas. We then calculate the
frequency of each topic in the titles of the papers represented by the nodes in a
given community. The popular topics in each community are the frequent ones (with
64
frequency varying from community to community ). Figure 4.7 and 4.8 shows the
topic clouds of the three largest communities and the two selected communities.
(a) Community 1
(b) Community 2
(c) Community 3
The largest community (Community 1) has three main sources of datasets: UCI
dataset repository, UCI KDD archive, and WebACE an agent for exploring and cat-
egorizing documents of the world wide web. WebAce was introduced in a research
paper - WebACE: A Web Agent for Document Categorization and Exploration - Han
et al. The largest dataset source in this community is the UCI dataset repository.
This is expected due to the vast amount of dataset in the repository. However, due to
the repository citation policy of both UCI dataset repository and UCI KDD archive,
the exact type of dataset used in the papers cannot be easily determined by looking
at the repositories (or the citations). The topic cloud confirms that the repositories
65
(a) Community 4
(b) Community 5
The second largest dataset community (Community 2) has one main dataset
source which is the UCI dataset repository - A. Asuncion and D. Newman. This
community’s topic cloud also confirms the wide range of dataset and application area
The third largest dataset community (Community 3) has four main dataset
Expression Monitoring - Golub et al. respectively. We can see (based on the descrip-
tion of the dataset sources above) that three of the largely cited dataset sources in this
community are relevant to text (or character) recognition and classification. Thus,
the larger amount of papers on text related topics. However, due to the fourth dataset
The fourth community (Community 4) has two main dataset sources introduced
66
and discussed in three research papers: GroupLens: An Open Architecture for Collab-
orative Filtering of Netnews - Resnick et al., Cascading Behavior in Large Blog Graphs
Konstan et al. We can see in Figure 4.8 that this community mainly has papers in
collaboratory filtering. This is because the main dataset source in this community is
the GroupLens system which uses collaborative filtering to predict (and recommend)
Usernet news articles based on obtained user ratings. The second dataset source,
The fifth community (Community 5) has three main dataset sources. The first
data source is the UCR time series data mining archive which is a repository of time
series datasets. The second dataset source introduced in Exact Discovery of Time
Series Motifs - Abdullah et al. is also made up of time series datasets while the third
dataset source contains datasets from di↵erent networks. Two of the three main
dataset sources in this community consists of time series datasets. Thus, the high
To further analyze the predicted dataset relevant links in the unseen data, we
ran the community-detection process with the correctly predicted links (using the
label propagation method) included, and analyzed the distribution of the links in the
communities. 68 of the 109 predicted links were distributed in the 15 largest (with
more than 53 papers) communities with 19 of the 68 links in the largest community.
67
Figure 4.9: The largest community. (a) UCI dataset repository - C.L. Blake and C.J.
Merz, (b) UCI KDD archive - S.D. Bay (c) WebACE: A Web Agent for Document
Categorization and Exploration - Han et al
68
Figure 4.10: The second largest community. (a) UCI dataset repository - A. Asuncion
and D. Newman
Figure 4.11: The third largest community. (a) RCV1: A New Benchmark Collection
for Text Categorization Research - Lewis et al. (b) NewsWeeder: Learning to Filter
Netnews - Lang et al. (c) Gradient-based Learning Applied to Document Recognition
- Lecun et al. (d) Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring - Golub et al
69
Figure 4.12: The fourth community. (a) GroupLens: An Open Architecture for
Collaborative Filtering of Netnews - Resnick et al. (b) Cascading Behavior in Large
Blog Graphs - Leskovec et al. (c) Grouplens: Applying Collaborative Filtering to
Usenet News - Konstan et al.
70
Figure 4.13: The fifth community. (a) The UCR time Series Data Mining Archive -
Keogh and Folias (b) Emergence of Scaling in Random Networks - Barabasi et al. (c)
GThe UCR Time Series Classification/Clustering Homepage - keogh et al. (d) Exact
Discovery of Time Series Motifs - Abdullah et al. (e) www.cs.ucr.edu - Keogh
71
Chapter 5
5.1 Conclusion
We studied the link label prediction problem and proposed three di↵erent approaches
for our citation graph. The first approach which is a supervised method uses Support
Vector Machines (SVM) to predict link labels based on generated features. The
nodes are citation relations and edges indicate similarities between citations. This
similarity is based on their mutual nodes in the original graph. The known node labels
are then propagated to the nodes with unknown labels. The third approach applies
a PageRank method to assign scores to each node. Nodes with a high PageRank
value will be classified as dataset relevant nodes and nodes with lower PageRank
value will be classified as non-dataset relevant nodes. Results from the experiments
show that the label propagation method performs the best. Although an approach
using the supervised method performed slightly better than the PageRank method,
the supervised method using SVM is computationally demanding and need di↵erent
preprocessing steps such as generating and choosing the best features to use, the best
parameters to use, and which SVM kernel to use. Cross-validation can be employed
72
to tune the parameter and choose kernel functions. However, the computation cost
is an issue to be resolved later. Therefore, the label propagation approach is the best
choice, due to its less demand on pre-processing work and less computation cost.
We also analyzed our citation network to get the communities that are dataset
by studying their data sources. For this study, we examined five communities. We
found out that by looking at the data sources available in a dataset communities, one
can understand to a certain degree the communities and their main topic areas.
There are several future research directions based on this work. First, the current
network is composed of papers from only three conferences. The results presented in
this thesis can be made better by including more conferences and papers. Adding
more data would improve the prediction accuracy of the algorithms. However, as
real-world data contain some noise, another possible approach would be to use the
label spreading algorithm (Algorithm 3) [25] as it allows changes to the initial la-
beling to keep it the labels consistent across the network. A disadvantage of using
this method on our citation network is that although keeping the consistency in our
citation network might remove some noise, it would also result in the change of some
true labels. For instance, if node u has a total of five in-degree, four of which have a
true label of negative, and one positive (i.e one paper cites u because of the dataset
used in it) then, in making the labels consistent among the edges, it would modify the
label on the positive edge to be negative thereby, producing a false label. A future
research is to find a way to solve this problem in order to produce a cleaner dataset.
in this thesis are based on using mainly the topological information of the graph
and some minor additional heuristic and title information. An extension would be
to use more additional side information such as the sections where the references are
mentioned in the papers, the sentences in which the references were mentioned in the
these three conferences and how they change over time. This might include examining
the number of authors with new affiliations taking part in the conference, or the
growth in the number of authors with affiliations already taking part in the conference.
Also, it would be interesting to study the increase (or decrease) in the number of
authors within each affiliation over time, starting from its first involvement in any of
the conferences.
74
REFERENCES
[5] M. Burke and R. Kraut, “Mopping up: Modeling wikipedia promotion decisions,”
in Proceedings of Conference on Computer-Supported Cooperative Work, 2008,
pp. 27–36.
[10] F. Heider, “Attitudes and cognitive organization,” Journal of Psychology, vol. 21,
pp. 107–112, 1946.
[15] D. Song and D. A. Meyer, “Link sign prediction and ranking in signed directed
social networks,” Social Network Analysis and Mining, 2015.
[20] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search
engine* 1,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117,
1998.
[21] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social net-
works,” in CIKM ’03 Proceedings of the twelfth international conference on In-
formation and knowledge management, 2003, pp. 556 – 559.
[24] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with
label propagation,” CMU CALD, Tech. Rep., 2002.
[26] M. Szummer and T. Jaakkola, “Partially labeled classification with markov ran-
dom walks,” in Advances in Neural Information Processing Systems, T. G. Di-
etterich, S. Becker, and Z. Ghahramani, Eds., vol. 14. MIT Press, 2002.
[32] Z.-H. Wu, Y.-F. Lin, S. Gregory, H.-Y. Wan, and S.-F. Tian, “Balanced multi-
label propagation for overlapping community detection in social networks,”
Journal of Computer Science and Technology, vol. 27, no. 3, pp. 468–479, 2012.
[Online]. Available: http://dx.doi.org/10.1007/s11390-012-1236-x
[40] P. Wang, B. Xu, Y. Wu, and X. Zhou, “Link prediction in social networks: the
state-of-the-art,” Science China Information Sciences, vol. 58, pp. 1–38, 2014.
[41] R. Raymond and H. Kashima, “Fast and scalable algorithms for semi-supervised
link prediction on static and dynamic graphs,” in Proceedings of ECML/P-
KDD’10, 2010, pp. 131 – 147.
[50] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth, “The uci kdd archive
of large data sets for data mining research and experimentation,” SIGKDD
Explor. Newsl., vol. 2, no. 2, pp. 81–85, Dec. 2000. [Online]. Available:
http://doi.acm.org/10.1145/380995.381030